Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Ralph Castain
Guess I'm having trouble reading your diff...different notation than I'm used 
to seeing. I'll have to parse thru it when I have more time.


On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

> Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
> beforehand, and then calling putenv() with the string duplicated from env[j]. 
>  Of course, if the strdup fails, then we bail out. 
> As for why it's suddenly a problem, I'm not quite as certain.   The problem 
> we do show is a double free, so someone has already freed that memory used by 
> putenv(), and I do know that while that used to be just flagged as an event 
> before, now we seem to be unable to continue past it.   Not sure if that is 
> our change or a library/system change. 
> PeterT
> 
> 
> Ralph Castain wrote:
>> On May 16, 2011, at 12:45 PM, Peter Thompson wrote:
>> 
>>  
>>> Hi Ralph,
>>> 
>>> We've had a number of user complaints about this.   Since it seems on the 
>>> face of it that it is a debugger issue, it may have not made it's way back 
>>> here.  Is your objection that the patch basically aborts if it gets a bad 
>>> value?   I could understand that being a concern.   Of course, it aborts on 
>>> TotalView now if we attempt to move forward without this patch.
>>> 
>>>
>> 
>> No - my concern is that you appear to be removing the "putenv" calls. OMPI 
>> places some values into the local environment so the user can control 
>> behavior. Removing those causes problems.
>> 
>> What I need to know is why, after it has worked with TV for years, these 
>> putenv's are suddenly a problem. Is the problem occurring during shutdown? 
>> Or is this something that causes TV to break?
>> 
>> 
>>  
>>> I've passed your comment back to the engineer, with a suspicion about the 
>>> concerns about the abort, but if you have other objections, let me know.
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> Ralph Castain wrote:
>>>
 That would be a problem, I fear. We need to push those envars into the 
 environment.
 
 Is there some particular problem causing what you see? We have no other 
 reports of this issue, and orterun has had that code forever.
 
 
 
 Sent from my iPad
 
 On May 11, 2011, at 2:05 PM, Peter Thompson  
 wrote:
 
   
> We've gotten a few reports of problems with memory debugging when using 
> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
> started after an MPI_Init.  However in the case where memory debugging is 
> enabled, things seemed to run away or fail.   My analysis showed that we 
> had a number of core files left over from the attempt, and all were 
> mpirun (or orterun) cores.   It seemed to be a regression on our part, 
> since testing seemed to indicate this worked okay before TotalView 
> 8.9.0-0, so I filed an internal bug and passed it to engineering.   After 
> giving our engineer a brief tutorial on how to build a debug version of 
> OpenMPI, he found what appears to be a problem in the code for orterun.c. 
>   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
> 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.
> He doesn't subscribe to this list that I know of, so I offered to pass 
> this by the group.   Of course, I'm not sure if this is exactly the right 
> place to submit patches, but I'm sure you'd tell me where to put it if 
> I'm in the wrong here.   It's a short patch, so I'll cut and paste it, 
> and attach as well, since cut and paste can do weird things to formatting.
> 
> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
> to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
> 'totalview mpirun -a -np 4 ./foo'
> 
> Cheers,
> PeterT
> 
> 
> more ~/patches/anbs-patch
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 
> -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
> 20:28:16.5881
> 83000 -0400
> ***
> *** 1578,1588 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
>   }
>   }
>   /* All done */
> --- 1578,1600 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the 
> pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Peter Thompson
Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
beforehand, and then calling putenv() with the string duplicated from 
env[j].  Of course, if the strdup fails, then we bail out. 

As for why it's suddenly a problem, I'm not quite as certain.   The 
problem we do show is a double free, so someone has already freed that 
memory used by putenv(), and I do know that while that used to be just 
flagged as an event before, now we seem to be unable to continue past 
it.   Not sure if that is our change or a library/system change. 


PeterT


Ralph Castain wrote:

On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

  

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on the face 
of it that it is a debugger issue, it may have not made it's way back here.  Is 
your objection that the patch basically aborts if it gets a bad value?   I 
could understand that being a concern.   Of course, it aborts on TotalView now 
if we attempt to move forward without this patch.




No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. Removing 
those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


  

I've passed your comment back to the engineer, with a suspicion about the 
concerns about the abort, but if you have other objections, let me know.

Cheers,
PeterT


Ralph Castain wrote:


That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

 
  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
   }
   if (NULL != env) {
   size1 = opal_argv_count(env);
   for (j = 0; j < size1; ++j) {
! putenv(env[j]);
   }
   }
   /* All done */
--- 1578,1600 
   }
   if (NULL != env) {
   size1 = opal_argv_count(env);
   for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
   }
   }
   /* All done */

*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
}

if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! putenv(env[j]);
}
}

  

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Ralph Castain

On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

> Hi Ralph,
> 
> We've had a number of user complaints about this.   Since it seems on the 
> face of it that it is a debugger issue, it may have not made it's way back 
> here.  Is your objection that the patch basically aborts if it gets a bad 
> value?   I could understand that being a concern.   Of course, it aborts on 
> TotalView now if we attempt to move forward without this patch.
> 

No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. 
Removing those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


> I've passed your comment back to the engineer, with a suspicion about the 
> concerns about the abort, but if you have other objections, let me know.
> 
> Cheers,
> PeterT
> 
> 
> Ralph Castain wrote:
>> That would be a problem, I fear. We need to push those envars into the 
>> environment.
>> 
>> Is there some particular problem causing what you see? We have no other 
>> reports of this issue, and orterun has had that code forever.
>> 
>> 
>> 
>> Sent from my iPad
>> 
>> On May 11, 2011, at 2:05 PM, Peter Thompson  
>> wrote:
>> 
>>  
>>> We've gotten a few reports of problems with memory debugging when using 
>>> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
>>> started after an MPI_Init.  However in the case where memory debugging is 
>>> enabled, things seemed to run away or fail.   My analysis showed that we 
>>> had a number of core files left over from the attempt, and all were mpirun 
>>> (or orterun) cores.   It seemed to be a regression on our part, since 
>>> testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I 
>>> filed an internal bug and passed it to engineering.   After giving our 
>>> engineer a brief tutorial on how to build a debug version of OpenMPI, he 
>>> found what appears to be a problem in the code for orterun.c.   He's made a 
>>> slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, 
>>> those being the versions he's tested with so far.He doesn't subscribe 
>>> to this list that I know of, so I offered to pass this by the group.   Of 
>>> course, I'm not sure if this is exactly the right place to submit patches, 
>>> but I'm sure you'd tell me where to put it if I'm in the wrong here.   It's 
>>> a short patch, so I'll cut and paste it, and attach as well, since cut and 
>>> paste can do weird things to formatting.
>>> 
>>> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
>>> find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
>>> 'totalview mpirun -a -np 4 ./foo'
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> more ~/patches/anbs-patch
>>> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
>>> --- 
>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
>>> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
>>> 20:28:16.5881
>>> 83000 -0400
>>> ***
>>> *** 1578,1588 
>>>}
>>>if (NULL != env) {
>>>size1 = opal_argv_count(env);
>>>for (j = 0; j < size1; ++j) {
>>> ! putenv(env[j]);
>>>}
>>>}
>>>/* All done */
>>> --- 1578,1600 
>>>}
>>>if (NULL != env) {
>>>size1 = opal_argv_count(env);
>>>for (j = 0; j < size1; ++j) {
>>> ! /* Use-after-Free error possible here.  putenv does not copy
>>> !the string passed to it, and instead stores only the 
>>> pointer.
>>> !env[j] may be freed later, in which case the pointer
>>> !in environ will now be left dangling into a deallocated
>>> !region.
>>> !So we make a copy of the variable.
>>> ! */
>>> ! char *s = strdup(env[j]);
>>> !
>>> ! if (NULL == s) {
>>> ! return OPAL_ERR_OUT_OF_RESOURCE;
>>> ! }
>>> ! putenv(s);
>>>}
>>>}
>>>/* All done */
>>> 
>>> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
>>> --- 
>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
>>> 2011-05-09 20:28:16.588183000 -0400
>>> ***
>>> *** 1578,1588 
>>> }
>>> 
>>> if (NULL != env) {
>>> size1 = opal_argv_count(env);
>>> for (j = 0; j < size1; ++j) {
>>> ! putenv(env[j]);
>>> }
>>> }
>>> 
>>> /* All done */
>>> 
>>> --- 1578,1600 
>>> }
>>> 
>>> if (NULL != env) {
>>> size1 = opal_argv_count(env);
>>> for (j = 0; j < size1; ++j) {
>>> ! /* Use-after-Free 

Re: [OMPI users] Scheduling dynamically spawned processes

2011-05-16 Thread Ralph Castain
You need to use MPI_Comm_spawn_multiple. Despite the name, it results in a 
single communicator being created by a single launch - it just allows you to 
specify multiple applications to run.

In this case, we use the same app, but give each element a different "host" 
info key to get the behavior we want. Looks something like this:

MPI_Comm child;
char *cmds[3] = {"myapp", "myapp", "myapp"};
MPI_Info info[3];
int maxprocs[] = { 1, 3, 1 };

  MPI_Info_create([0]);
  MPI_Info_set(info[0], "host", "m1");

  MPI_Info_create([1]);
  MPI_Info_set(info[1], "host", "m2");

  MPI_Info_create([2]);
  MPI_Info_set(info[2], "host", "m1");
  
MPI_Comm_spawn_multiple(3, cmds, NULL, maxprocs, 
info, 0, MPI_COMM_WORLD,
, MPI_ERRCODES_IGNORE);

I won't claim the above is correct - but it gives the gist of the idea.


On May 16, 2011, at 12:19 PM, Thatyene Louise Alves de Souza Ramos wrote:

> Ralph,
> 
> I have the same issue and I've been searching how to do this, but I couldn't 
> find. 
> 
> What exactly must be the string in the host info key to do what Rodrigo 
> described?
> 
> <<< Inside your master, you would create an MPI_Info key "host" that has a 
> value 
> <<< consisting of a string "host1,host2,host3" identifying the hosts you want 
> <<< your slave to execute upon. Those hosts must have been included in 
> <<< my_hostfile. Include that key in the MPI_Info array passed to your Spawn.
> 
> I tried to do what you said above but ompi ignores the repetition of hosts. 
> Using Rodrigo's example I did:
> 
> host info key = "m1,m2,m2,m2,m3" and number of processes = 5 and the result 
> was
> 
> m1 -> 2
> m2 -> 2
> m3 -> 1
> 
> and not
> 
> m1 -> 1
> m2 -> 3
> m3 -> 1
> 
> as I wanted.
> 
> Thanks in advance.
> 
> Thatyene Ramos
> 
> On Fri, May 13, 2011 at 9:16 PM, Ralph Castain  wrote:
> I believe I answered that question. You can use the hostfile info key, or you 
> can use the host info key - either one will do what you require.
> 
> On May 13, 2011, at 4:11 PM, Rodrigo Silva Oliveira wrote:
> 
>> Hi,
>> 
>> I think I was not specific enough. I need to spawn the copies of a process 
>> in a unique mpi_spawn call. It is, I have to specify a list of machines and 
>> how many copies of the process will be spawned on each one. Is it possible?
>> 
>> I would be something like that:
>> 
>> machines #copies
>> m11
>> m23
>> m31
>> 
>> After an unique call to spawn, I want the copies running in this fashion. I 
>> tried use a hostfile with the option slot, but I'm not sure if it is the 
>> best way.
>> 
>> hostfile:
>> 
>> m1 slots=1
>> m2 slots=3
>> m3 slots=1
>> 
>> Thanks
>> 
>> -- 
>> Rodrigo Silva Oliveira
>> M.Sc. Student - Computer Science
>> Universidade Federal de Minas Gerais
>> www.dcc.ufmg.br/~rsilva
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Peter Thompson

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on 
the face of it that it is a debugger issue, it may have not made it's 
way back here.  Is your objection that the patch basically aborts if it 
gets a bad value?   I could understand that being a concern.   Of 
course, it aborts on TotalView now if we attempt to move forward without 
this patch.


I've passed your comment back to the engineer, with a suspicion about 
the concerns about the abort, but if you have other objections, let me know.


Cheers,
PeterT


Ralph Castain wrote:

That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! putenv(env[j]);
}
}
/* All done */
--- 1578,1600 
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
}
}
/* All done */

*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! putenv(env[j]);
 }
 }

 /* All done */

--- 1578,1600 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
! 
! if (NULL == s) {

! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
 }
 }

 /* All done */

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Scheduling dynamically spawned processes

2011-05-16 Thread Thatyene Louise Alves de Souza Ramos
Ralph,

I have the same issue and I've been searching how to do this, but I couldn't
find.

What exactly must be the string in the host info key to do what Rodrigo
described?

<<< Inside your master, you would create an MPI_Info key "host" that has a
value
<<< consisting of a string "host1,host2,host3" identifying the hosts you
want
<<< your slave to execute upon. Those hosts must have been included in
<<< my_hostfile. Include that key in the MPI_Info array passed to your
Spawn.

I tried to do what you said above but ompi ignores the repetition of hosts.
Using Rodrigo's example I did:

host info key = "m1,m2,m2,m2,m3" and number of processes = 5 and the result
was

m1 -> 2
m2 -> 2
m3 -> 1

and not

m1 -> 1
m2 -> 3
m3 -> 1

as I wanted.

Thanks in advance.

Thatyene Ramos

On Fri, May 13, 2011 at 9:16 PM, Ralph Castain  wrote:

> I believe I answered that question. You can use the hostfile info key, or
> you can use the host info key - either one will do what you require.
>
> On May 13, 2011, at 4:11 PM, Rodrigo Silva Oliveira wrote:
>
> Hi,
>
> I think I was not specific enough. I need to spawn the copies of a process
> in a unique mpi_spawn call. It is, I have to specify a list of machines and
> how many copies of the process will be spawned on each one. Is it possible?
>
> I would be something like that:
>
> machines #copies
> m11
> m23
> m31
>
> After an unique call to spawn, I want the copies running in this fashion. I
> tried use a hostfile with the option slot, but I'm not sure if it is the
> best way.
>
> hostfile:
>
> m1 slots=1
> m2 slots=3
> m3 slots=1
>
> Thanks
>
> --
> Rodrigo Silva Oliveira
> M.Sc. Student - Computer Science
> Universidade Federal de Minas Gerais
> www.dcc.ufmg.br/~rsilva 
>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Segfault after malloc()?

2011-05-16 Thread Paul van der Walt
Hi all,

I hope to provide enough information to make my problem clear. I
have been debugging a lot after continually getting a segfault
in my program, but then I decided to try and run it on another
node, and it didn't segfault! The program which causes this
strange behaviour can be downloaded with

$ git clone https://toothbr...@github.com/toothbrush/bsp-cg.git

It depends on bsponmpi (can be found at:
http://bsponmpi.sourceforge.net/ ).

The machine on which I get a segfault is 
Linux scarlatti 2.6.38-2-amd64 #1 SMP Thu Apr 7 04:28:07 UTC 2011 x86_64 
GNU/Linux
OpenMPI --version: mpirun (Open MPI) 1.4.3

And the error message is:
[scarlatti:22100] *** Process received signal ***
[scarlatti:22100] Signal: Segmentation fault (11)
[scarlatti:22100] Signal code:  (128)
[scarlatti:22100] Failing at address: (nil)
[scarlatti:22100] [ 0] /lib/libpthread.so.0(+0xef60) [0x7f33ca69ef60]
[scarlatti:22100] [ 1] /lib/libc.so.6(+0x74121) [0x7f33ca3a3121]
[scarlatti:22100] [ 2] /lib/libc.so.6(__libc_malloc+0x70) [0x7f33ca3a5930]
[scarlatti:22100] [ 3] src/cg(vecalloci+0x2c) [0x401789]
[scarlatti:22100] [ 4] src/cg(bspmv_init+0x60) [0x40286a]
[scarlatti:22100] [ 5] src/cg(bspcg+0x63b) [0x401f8b]
[scarlatti:22100] [ 6] src/cg(main+0xd3) [0x402517]
[scarlatti:22100] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f33ca34dc4d]
[scarlatti:22100] [ 8] src/cg() [0x401609]
[scarlatti:22100] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 22100 on node scarlatti exited on 
signal 11 (Segmentation fault).
--

The program can be invoked (after downloading the source,
running make, and cd'ing into the project's root directory)
like:

$ mpirun -np 2 src/cg examples/test.mtx-P2 examples/test.mtx-v2 
examples/test.mtx-u2

The program seems to fail at src/bspedupack.c:vecalloci(), but
printf'ing the pointer that's returned by malloc() looks okay.

The node on which the program DOES run without segfault is as
follows: (OS X laptop)

Darwin purcell 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 
2011; root:xnu-1504.9.37~1/RELEASE_I386 i386
OpenMPI --version: mpirun (Open MPI) 1.2.8

Please inform if this is a real bug in OpenMPI, or if I'm coding
something incorrectly. Note that I'm not asking anyone to debug
my code for me, it's purely in case people want to try and
reproduce my error locally. 

If I can provide more info, please advise. I'm not an MPI
expert, unfortunately. 

Kind regards,

Paul van der Walt

-- 
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org


signature.asc
Description: Digital signature


Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread George Bosilca
Here is the output of the "ompi_info --param btl openib":

 MCA btl: parameter "btl_openib_flags" (current value: <306>, 
data
  source: default value)
  BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
  SEND_INPLACE=8, RDMA_MATCHED=64, 
HETEROGENEOUS_RDMA=256; flags
  only used by the "dr" PML (ignored by others): ACK=16,
  CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
the "bfo"
  PML (ignored by others): FAILOVER_SUPPORT=512)

So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of 
these flags are totally useless in the current version of Open MPI (DR is not 
supported), so the only value that really matter is SEND | HETEROGENEOUS_RDMA.

If you want to enable the send protocol try first with SEND | SEND_INPLACE (9), 
if not downgrade to SEND (1)

  george.

On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:

> 
> On May 16, 2011, at 8:53 AM, Brock Palen wrote:
> 
>> 
>> 
>> 
>> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
>> 
>>> Hi,
>>> 
>>> Just out of curiosity - what happens when you add the following MCA option 
>>> to your openib runs?
>>> 
>>> -mca btl_openib_flags 305
>> 
>> You Sir found the magic combination.
> 
> :-)  - cool.
> 
> Developers - does this smell like a registered memory availability hang?
> 
>> I verified this lets IMB and CRASH progress pass their lockup points,
>> I will have a user test this, 
> 
> Please let us know what you find.
> 
>> Is this an ok option to put in our environment?  What does 305 mean?
> 
> There may be a performance hit associated with this configuration, but if it 
> lets your users run, then I don't see a problem with adding it to your 
> environment.
> 
> If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on SEND.
> 
> OpenFabrics gurus - please correct me if I'm wrong :-).
> 
> Samuel Gutierrez
> Los Alamos National Laboratory
> 
> 
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>>> 
>>> Thanks,
>>> 
>>> Samuel Gutierrez
>>> Los Alamos National Laboratory
>>> 
>>> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
>>> 
 On May 13, 2011, at 4:09 PM, Dave Love wrote:
 
> Jeff Squyres  writes:
> 
>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>> 
>>> We can reproduce it with IMB.  We could provide access, but we'd have to
>>> negotiate with the owners of the relevant nodes to give you interactive
>>> access to them.  Maybe Brock's would be more accessible?  (If you
>>> contact me, I may not be able to respond for a few days.)
>> 
>> Brock has replied off-list that he, too, is able to reliably reproduce 
>> the issue with IMB, and is working to get access for us.  Many thanks 
>> for your offer; let's see where Brock's access takes us.
> 
> Good.  Let me know if we could be useful
> 
 -- we have not closed this issue,
>>> 
>>> Which issue?   I couldn't find a relevant-looking one.
>> 
>> https://svn.open-mpi.org/trac/ompi/ticket/2714
> 
> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
> connectx with more than one collective I can't recall.
 
 Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well 
 that doesn't help here, both my production code (crash) and IMB still hang.
 
 
 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
> 
> -- 
> Excuse the typping -- I have a broken wrist
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

George Bosilca
Research Assistant Professor
Innovative Computing Laboratory
Department of Electrical Engineering and Computer Science
University of Tennessee, Knoxville
http://web.eecs.utk.edu/~bosilca/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Brock Palen



On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:

> Hi,
> 
> Just out of curiosity - what happens when you add the following MCA option to 
> your openib runs?
> 
> -mca btl_openib_flags 305

You Sir found the magic combination.
I verified this lets IMB and CRASH progress pass their lockup points,
I will have a user test this, 
Is this an ok option to put in our environment?  What does 305 mean?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985

> 
> Thanks,
> 
> Samuel Gutierrez
> Los Alamos National Laboratory
> 
> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
> 
>> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>> 
>>> Jeff Squyres  writes:
>>> 
 On May 11, 2011, at 3:21 PM, Dave Love wrote:
 
> We can reproduce it with IMB.  We could provide access, but we'd have to
> negotiate with the owners of the relevant nodes to give you interactive
> access to them.  Maybe Brock's would be more accessible?  (If you
> contact me, I may not be able to respond for a few days.)
 
 Brock has replied off-list that he, too, is able to reliably reproduce the 
 issue with IMB, and is working to get access for us.  Many thanks for your 
 offer; let's see where Brock's access takes us.
>>> 
>>> Good.  Let me know if we could be useful
>>> 
>> -- we have not closed this issue,
> 
> Which issue?   I couldn't find a relevant-looking one.
 
 https://svn.open-mpi.org/trac/ompi/ticket/2714
>>> 
>>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>>> connectx with more than one collective I can't recall.
>> 
>> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well 
>> that doesn't help here, both my production code (crash) and IMB still hang.
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>>> 
>>> -- 
>>> Excuse the typping -- I have a broken wrist
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Samuel K. Gutierrez
Hi,

Just out of curiosity - what happens when you add the following MCA option to 
your openib runs?

-mca btl_openib_flags 305

Thanks,

Samuel Gutierrez
Los Alamos National Laboratory

On May 13, 2011, at 2:38 PM, Brock Palen wrote:

> On May 13, 2011, at 4:09 PM, Dave Love wrote:
> 
>> Jeff Squyres  writes:
>> 
>>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>> 
 We can reproduce it with IMB.  We could provide access, but we'd have to
 negotiate with the owners of the relevant nodes to give you interactive
 access to them.  Maybe Brock's would be more accessible?  (If you
 contact me, I may not be able to respond for a few days.)
>>> 
>>> Brock has replied off-list that he, too, is able to reliably reproduce the 
>>> issue with IMB, and is working to get access for us.  Many thanks for your 
>>> offer; let's see where Brock's access takes us.
>> 
>> Good.  Let me know if we could be useful
>> 
> -- we have not closed this issue,
 
 Which issue?   I couldn't find a relevant-looking one.
>>> 
>>> https://svn.open-mpi.org/trac/ompi/ticket/2714
>> 
>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>> connectx with more than one collective I can't recall.
> 
> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well 
> that doesn't help here, both my production code (crash) and IMB still hang.
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> -- 
>> Excuse the typping -- I have a broken wrist
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users