[OMPI devel] ORTE Scaling results: updated

2008-04-08 Thread Ralph H Castain
Hello all The wiki page has been updated with the latest test results from a new branch that implemented inbound collectives on the modex and barrier operations. As you will see from the graphs, ORTE/OMPI now exhibits a negative 2nd-derivative on the launch time curve for mpi_no_op (i.e., MPI_Init

Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-08 Thread Aurélien Bouteiller
Still no luck here, I launch those three processes : term1$ ompi-server -d --report-uri URIFILE term2$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1 simple_accept term3$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1 simple_connect The output of ompi-server shows a s

Re: [OMPI devel] Signals

2008-04-08 Thread Richard Graham
On 4/8/08 2:19 PM, "Ralph H Castain" wrote: > > > > On 4/8/08 12:10 PM, "Pak Lui" wrote: > >> Richard Graham wrote: >>> What happens if I deliver sigusr2 to mpirun ? What I observe (for both >>> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does >>> get propagated

Re: [OMPI devel] Signals

2008-04-08 Thread Ralph H Castain
On 4/8/08 12:10 PM, "Pak Lui" wrote: > Richard Graham wrote: >> What happens if I deliver sigusr2 to mpirun ? What I observe (for both >> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does >> get propagated to the mpi procs, which do invoke the signal handler I >> regi

Re: [OMPI devel] Signals

2008-04-08 Thread Pak Lui
Richard Graham wrote: What happens if I deliver sigusr2 to mpirun ? What I observe (for both ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does get propagated to the mpi procs, which do invoke the signal handler I registered, but the job is terminated right after that. H

Re: [OMPI devel] Signals

2008-04-08 Thread Ralph H Castain
Hmmm...well, I'll take a look. I haven't seen that behavior, but I haven't checked it in some time. On 4/8/08 11:54 AM, "Richard Graham" wrote: > What happens if I deliver sigusr2 to mpirun ? What I observe (for both > ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does

Re: [OMPI devel] Signals

2008-04-08 Thread Richard Graham
What happens if I deliver sigusr2 to mpirun ? What I observe (for both ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does get propagated to the mpi procs, which do invoke the signal handler I registered, but the job is terminated right after that. However, if I deliver the

Re: [OMPI devel] Signals

2008-04-08 Thread Ralph H Castain
I found what Pak said a little confusing as the wait_daemon function doesn't actually receive a signal itself - it only detects that a proc has exited and checks to see if that happened due to a signal. If so, it flags that situation and will order the job aborted. So if the proc continues alive,

Re: [OMPI devel] Signals

2008-04-08 Thread Pak Lui
First, can your user executable create a signal handler to catch the SIGUSR2 to not exit? By default on Solaris it is going to exit, unless you catch the signal and have the process to do nothing. from signal(3HEAD) Name Value DefaultEvent SIGUSR1 16 Ex

[OMPI devel] Signals

2008-04-08 Thread Richard Graham
I am running into a situation where I am trying to deliver a signal to the mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the mpi procs, but then proceeds to kill the children. Is there an easy way that I can get around this ? I am using this mechanism in a situation where

Re: [OMPI devel] mpirun return code problems

2008-04-08 Thread Ralph H Castain
I'm aware - as we discussed on a recent telecon, I put it on my list of things to resolve. Solution is known - just busy with other things at the moment. On 4/8/08 6:06 AM, "Tim Prins" wrote: > Hi all, > > I reported this before, but it seems that the report got lost. I have > found some situa

[OMPI devel] mpirun return code problems

2008-04-08 Thread Tim Prins
Hi all, I reported this before, but it seems that the report got lost. I have found some situations where mpirun will return a '0' when there is an error. An easy way to reproduce this is to edit the file 'orte/mca/plm/base/plm_base_launch_support.c' and on line 154 put in 'return ORTE_ERROR