Re: [OMPI users] Valgrind Functionality

2008-07-08 Thread Tom Riddle
Thanks Ashley, after going through your suggestions we tried our test with 
valgrind 3.3.0 and with glibc-devel-2.5-18.el5_1.1, both exhibit the same 
results. A simple non-MPI test prog however returns expected responses, so 
valgrind itself look ok. We then checked that the same (shared) libc gets 
linked in both the MPI and non-MPI cases, adding -pthread to the cc command 
line yields the same result, the only difference it appears is the open mpi 
libraries.

Now mpicc links against libopen-pal which defines malloc for it's own purposes. 
The big difference seems to be that libopen-pal.so is providing its own malloc 
replacement 

Is there perhaps something I have missed in runtime? I have included the 
ompi_info just in case. Thanks in advance. Tom


--- On Tue, 7/8/08, Ashley Pittman  wrote:
From: Ashley Pittman 
Subject: Re: [OMPI users] Valgrind Functionality
To: rarebit...@yahoo.com, "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Tuesday, July 8, 2008, 2:05 AM

On Mon, 2008-07-07 at 19:09 -0700, Tom Riddle wrote:
> 
> I was attempting to get valgrind working with a simple MPI app
> (osu_latency) on OpenMPI. While it appears to report uninitialized
> values it fails to report any mallocs or frees that have been
> conducted. 

The normal reason for this is either using static applications or having
a very stripped glibc.  It doesn't appear you've done the former as you
are linking in libpthread but the latter is a possibility, you might
benefit from installing the glibc-devel package.  I don't recalled RHEL
being the worst offenders at stripping libc however.

> I am using RHEL 5, gcc 4.2.3 and a drop from the repo labeled
> openmpi-1.3a1r18303. configured with  
> 
>  $ ../configure --prefix=/opt/wkspace/openmpi-1.3a1r18303 CC=gcc 
> CXX=g++ --disable-mpi-f77 --enable-debug --enable-memchecker 
> --with-psm=/usr/include --with-valgrind=/opt/wkspace/valgrind-3.3.0/

> As the FAQ's suggest I am running a later version of valgrind,
> enabling the memchecker and debug. I tested a slightly modified
> osu_latency test which has a simple char buffer malloc and free but
> the valgrind summary shows no malloc/free activity whatsoever. This is
> running on a dual node system using Infinipath HCAs.  Here is a
> trimmed output.

Although you configured openmpi with what appears to be valgrind 3.3.0
the version of valgrind you are using is 3.2.1, perhaps you want to
specify the full path of valgrind on the mpirun command line?

> [tom@lab01 ~]$ mpirun --mca pml cm -np 2 --hostfile my_hostfile
> valgrind ./osu_latency1 
> ==17839== Memcheck, a memory error detector.
> ==17839== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et
> al.
> ==17839== Using LibVEX rev 1658, a library for dynamic binary
> translation.
> ==17839== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
> ==17839== Using valgrind-3.2.1, a dynamic binary instrumentation
> framework.
> ==17839== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et
> al.
> ==17839== For more details, rerun with: -v

Ashley Pittman.




ompi_info.out.bz2
Description: Binary data


Re: [OMPI users] Gridengine + Open MPI

2008-07-08 Thread Pak Lui

Pak Lui wrote:

Romaric David wrote:

Pak Lui a écrit :



It was fixed at one point in the trunk before v1.3 went official, but 
while rolling the code from gridengine PLM into the rsh PLM code, 
this feature was left out because there was some lingering issues 
that I didn't resolved and I lost track of it. Sorry but thanks for 
bringing it up, I will need to look at the issue again and reopen 
this ticket against v1.3:

Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?


I believe it will definitely be in 1.3 series, I am not sure about v1.2 
at this point.







So even it is the rsh PLM that starts the parallel job under SGE, the 
rsh PLM can detect if the Open MPI job is started under the SGE 
Parallel Environment (via checking some SGE env vars) and use the 
"qrsh --inherit" command to launch the parallel job the same way as 
it was before. You can check by setting MCA to something like "--mca 
plm_base_verbose 10" in your mpirun command and look for the launch 
commands that mpirun uses.



It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] 


Romaric,

I just made a fix for the problem I've shown below in r18844. I think it 
is essentially the same problem that you are running into here.


Please let me know if you still see the problem with the SGE tight 
integration job errors out. And I'll look at the suspend/resume feature 
later on.






 Regards,
Romaric


How recent is the build that you use to generate the error above? I 
assume you are using a trunk build?


I didn't see the complete error messages that you are seeing, but I 
think I am running into the same exact error too. It seems to be a weird 
error that points out that the 'ssh' component not found. I don't 
believe there's a component named 'ssh' here, because ssh and rsh shared 
the same component.


Well, it looks like something is broken in the plm that is responsible 
for launching the tight integration job for SGE.


I checked it used to work without problem with my earlier trunk build 
(r18645). I have to find out what has happened since...




Starting server daemon at host "burl-ct-v440-4"
Server daemon successfully started with task id "1.burl-ct-v440-4"
Establishing /opt/sge/utilbin/sol-sparc64/rsh session to host 
burl-ct-v440-4 ...
[burl-ct-v440-4:13749] mca: base: components_open: Looking for plm 
components

--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  burl-ct-v440-4
Framework: plm
Component: ssh
--
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 70
[burl-ct-v440-4:13749] 
--

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
ess_env_module.c at line 135
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 132
[burl-ct-v440-4:13749] 
--

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
orted/orted_main.c at line 311

/opt/sge/utilbin/sol-sparc64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[burl-ct-v440-5:09789] 
--

A daemon (pid 9

Re: [OMPI users] ORTE_ERROR_LOG timeout

2008-07-08 Thread Ralph H Castain
Several thins are going on here. First, this error message:

> mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal
> 6 (Aborted).
> 2 additional processes aborted (not shown)

indicates that your application procs are aborting for some reason. The
system is then attempting to shutdown and somehow got itself "hung", hence
the timeout error message.

I'm not sure that increasing the timeout value will help in this situation.
Unfortunately, 1.2.x has problems with this scenario (1.3 is -much- better!
;-)). If you want to try adjusting the timeout anyway, you can do so with:

mpirun -mca orte_abort_timeout x ...

where x is the specified timeout in seconds.

Hope that helps.
Ralph



On 7/8/08 8:55 AM, "Alastair Basden"  wrote:

> Hi,
> I've got some code that uses openmpi, and sometimes, it crashes, after
> printing somthing like:
> 
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1166
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line
> 90
> mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal
> 6 (Aborted).
> 2 additional processes aborted (not shown)
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1198
> --
> mpirun was unable to cleanly terminate the daemons for this job. Returned
> value Timeout instead of ORTE_SUCCESS.
> --
> 
> In this case, all processes were running on the same machine, so its not a
> connection problem.  Is this a bug, or something else wrong?  Is there a
> way to increase the timeout time?
> 
> Thanks...
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] ORTE_ERROR_LOG timeout

2008-07-08 Thread Alastair Basden

Hi,
I've got some code that uses openmpi, and sometimes, it crashes, after 
printing somthing like:


[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at 
line 1166
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 
90
mpirun noticed that job rank 1 with PID 9658 on node mac1 exited on signal 
6 (Aborted).

2 additional processes aborted (not shown)
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[mac1:09654] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at 
line 1198

--
mpirun was unable to cleanly terminate the daemons for this job. Returned 
value Timeout instead of ORTE_SUCCESS.

--

In this case, all processes were running on the same machine, so its not a 
connection problem.  Is this a bug, or something else wrong?  Is there a 
way to increase the timeout time?


Thanks...



Re: [OMPI users] Valgrind Functionality

2008-07-08 Thread Ashley Pittman
On Mon, 2008-07-07 at 19:09 -0700, Tom Riddle wrote:
> 
> I was attempting to get valgrind working with a simple MPI app
> (osu_latency) on OpenMPI. While it appears to report uninitialized
> values it fails to report any mallocs or frees that have been
> conducted. 

The normal reason for this is either using static applications or having
a very stripped glibc.  It doesn't appear you've done the former as you
are linking in libpthread but the latter is a possibility, you might
benefit from installing the glibc-devel package.  I don't recalled RHEL
being the worst offenders at stripping libc however.

> I am using RHEL 5, gcc 4.2.3 and a drop from the repo labeled
> openmpi-1.3a1r18303. configured with  
> 
>  $ ../configure --prefix=/opt/wkspace/openmpi-1.3a1r18303 CC=gcc 
> CXX=g++ --disable-mpi-f77 --enable-debug --enable-memchecker 
> --with-psm=/usr/include --with-valgrind=/opt/wkspace/valgrind-3.3.0/

> As the FAQ's suggest I am running a later version of valgrind,
> enabling the memchecker and debug. I tested a slightly modified
> osu_latency test which has a simple char buffer malloc and free but
> the valgrind summary shows no malloc/free activity whatsoever. This is
> running on a dual node system using Infinipath HCAs.  Here is a
> trimmed output.

Although you configured openmpi with what appears to be valgrind 3.3.0
the version of valgrind you are using is 3.2.1, perhaps you want to
specify the full path of valgrind on the mpirun command line?

> [tom@lab01 ~]$ mpirun --mca pml cm -np 2 --hostfile my_hostfile
> valgrind ./osu_latency1 
> ==17839== Memcheck, a memory error detector.
> ==17839== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et
> al.
> ==17839== Using LibVEX rev 1658, a library for dynamic binary
> translation.
> ==17839== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
> ==17839== Using valgrind-3.2.1, a dynamic binary instrumentation
> framework.
> ==17839== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et
> al.
> ==17839== For more details, rerun with: -v

Ashley Pittman.