Re: [OMPI users] Using POSIX shared memory as send buffer

2015-10-02 Thread Brice Goglin


Le 28/09/2015 21:44, Dave Goodell (dgoodell) a écrit :
> It may have to do with NUMA effects and the way you're allocating/touching 
> your shared memory vs. your private (malloced) memory.  If you have a 
> multi-NUMA-domain system (i.e., any 2+ socket server, and even some 
> single-socket servers) then you are likely to run into this sort of issue.  
> The PCI bus on which your IB HCA communicates is almost certainly closer to 
> one NUMA domain than the others, and performance will usually be worse if you 
> are sending/receiving from/to a "remote" NUMA domain.
>
> "lstopo" and other tools can sometimes help you get a handle on the 
> situation, though I don't know if it knows how to show memory affinity.

So, you'd like "lstopo --ps" or "hwloc-ps" for displaying memory binding
and/or memory location instead of CPU binding? Shouldn't be too hard.

Brice



>   I think you can find memory affinity for a process via 
> "/proc//numa_maps".  There's lots of info about NUMA affinity here: 
> https://queue.acm.org/detail.cfm?id=2513149
>




Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-30 Thread marcin.krotkiewski

Hi, Nathan

I have compiled 2.x with your patch. I must say it works _much_ better 
with your changes. I have no idea how you figured that out! A short 
table with my bandwidth calculations (MB/s)


PROT_READPROT_READ | PROT_WRITE
1.10.025005700
2.x+patch 4800-52005700

That is not a very thorough study, but essentially I was getting 
2500MB/s with read-only shm. With your patch it is somewhat shaky (very 
rarely I get 2500 also), but most of the time it is around 5000MB/s.


Seems mmaping the memory read-write still yields marginally better 
results. Again, I do not have very solid data to support it - just a 
bunch of runs.


Do you have an idea as to why such performance difference exists?

Thanks a lot!

Marcin


On 09/30/2015 12:37 AM, Nathan Hjelm wrote:

There was a bug in that patch that affected IB systems. Updated patch:

https://github.com/hjelmn/ompi/commit/c53df23c0bcf8d1c531e04d22b96c8c19f9b3fd1.patch

-Nathan

On Tue, Sep 29, 2015 at 03:35:21PM -0600, Nathan Hjelm wrote:

I have a branch with the changes available at:

https://github.com/hjelmn/ompi.git

in the mpool_update branch. If you prefer you can apply this patch to
either a 2.x or a master tarball.

https://github.com/hjelmn/ompi/commit/8839dbfae85ba8f443b2857f9bbefdc36c4ebc1a.patch

Let me know if this resolves the performance issues.

-Nathan

On Tue, Sep 29, 2015 at 09:57:54PM +0200, marcin.krotkiewski wrote:

I've now run a few more tests and I think I can reasonably confidently say
that the read only mmap is a problem. Let me know if you have a possible
fix - I will gladly test it.

Marcin

On 09/29/2015 04:59 PM, Nathan Hjelm wrote:

  We register the memory with the NIC for both read and write access. This
  may be the source of the slowdown. We recently added internal support to
  allow the point-to-point layer to specify the access flags but the
  openib btl does not yet make use of the new support. I plan to make the
  necessary changes before the 2.0.0 release. I should have them complete
  later this week. I can send you a note when they are ready if you would
  like to try it and see if it addresses the problem.

  -Nathan

  On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:

  Thanks, Dave.

  I have verified the memory locality and IB card locality, all's fine.

  Quite accidentally I have found that there is a huge penalty if I mmap the
  shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good results,
  although I must look at this further. I'll report when I am certain, in case
  sb finds this useful.

  Is this an OS feature, or is OpenMPI somehow working differently? I don't
  suspect you guys write to the send buffer, right? Even if you would there
  would be a segfault. So I guess this could be OS preventing any writes to
  the pointer that introduced the overhead?

  Marcin



  On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:

  On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski 
 wrote:

  Hello, everyone

  I am struggling a bit with IB performance when sending data from a POSIX 
shared memory region (/dev/shm). The memory is shared among many MPI processes 
within the same compute node. Essentially, I see a bit hectic performance, but 
it seems that my code it is roughly twice slower than when using a usual, 
malloced send buffer.

  It may have to do with NUMA effects and the way you're allocating/touching your shared 
memory vs. your private (malloced) memory.  If you have a multi-NUMA-domain system (i.e., 
any 2+ socket server, and even some single-socket servers) then you are likely to run 
into this sort of issue.  The PCI bus on which your IB HCA communicates is almost 
certainly closer to one NUMA domain than the others, and performance will usually be 
worse if you are sending/receiving from/to a "remote" NUMA domain.

  "lstopo" and other tools can sometimes help you get a handle on the situation, though I don't 
know if it knows how to show memory affinity.  I think you can find memory affinity for a process via 
"/proc//numa_maps".  There's lots of info about NUMA affinity here: 
https://queue.acm.org/detail.cfm?id=2513149

  -Dave

  ___
  users mailing list
  us...@open-mpi.org
  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27702.php

  ___
  users mailing list
  us...@open-mpi.org
  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27705.php

  ___
  users mailing list
  us...@open-mpi.org
  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this post: 

Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-29 Thread Nathan Hjelm

There was a bug in that patch that affected IB systems. Updated patch:

https://github.com/hjelmn/ompi/commit/c53df23c0bcf8d1c531e04d22b96c8c19f9b3fd1.patch

-Nathan

On Tue, Sep 29, 2015 at 03:35:21PM -0600, Nathan Hjelm wrote:
> 
> I have a branch with the changes available at:
> 
> https://github.com/hjelmn/ompi.git
> 
> in the mpool_update branch. If you prefer you can apply this patch to
> either a 2.x or a master tarball.
> 
> https://github.com/hjelmn/ompi/commit/8839dbfae85ba8f443b2857f9bbefdc36c4ebc1a.patch
> 
> Let me know if this resolves the performance issues.
> 
> -Nathan
> 
> On Tue, Sep 29, 2015 at 09:57:54PM +0200, marcin.krotkiewski wrote:
> >I've now run a few more tests and I think I can reasonably confidently 
> > say
> >that the read only mmap is a problem. Let me know if you have a possible
> >fix - I will gladly test it.
> > 
> >Marcin
> > 
> >On 09/29/2015 04:59 PM, Nathan Hjelm wrote:
> > 
> >  We register the memory with the NIC for both read and write access. This
> >  may be the source of the slowdown. We recently added internal support to
> >  allow the point-to-point layer to specify the access flags but the
> >  openib btl does not yet make use of the new support. I plan to make the
> >  necessary changes before the 2.0.0 release. I should have them complete
> >  later this week. I can send you a note when they are ready if you would
> >  like to try it and see if it addresses the problem.
> > 
> >  -Nathan
> > 
> >  On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:
> > 
> >  Thanks, Dave.
> > 
> >  I have verified the memory locality and IB card locality, all's fine.
> > 
> >  Quite accidentally I have found that there is a huge penalty if I mmap the
> >  shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good results,
> >  although I must look at this further. I'll report when I am certain, in 
> > case
> >  sb finds this useful.
> > 
> >  Is this an OS feature, or is OpenMPI somehow working differently? I don't
> >  suspect you guys write to the send buffer, right? Even if you would there
> >  would be a segfault. So I guess this could be OS preventing any writes to
> >  the pointer that introduced the overhead?
> > 
> >  Marcin
> > 
> > 
> > 
> >  On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:
> > 
> >  On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski 
> >  wrote:
> > 
> >  Hello, everyone
> > 
> >  I am struggling a bit with IB performance when sending data from a POSIX 
> > shared memory region (/dev/shm). The memory is shared among many MPI 
> > processes within the same compute node. Essentially, I see a bit hectic 
> > performance, but it seems that my code it is roughly twice slower than when 
> > using a usual, malloced send buffer.
> > 
> >  It may have to do with NUMA effects and the way you're allocating/touching 
> > your shared memory vs. your private (malloced) memory.  If you have a 
> > multi-NUMA-domain system (i.e., any 2+ socket server, and even some 
> > single-socket servers) then you are likely to run into this sort of issue.  
> > The PCI bus on which your IB HCA communicates is almost certainly closer to 
> > one NUMA domain than the others, and performance will usually be worse if 
> > you are sending/receiving from/to a "remote" NUMA domain.
> > 
> >  "lstopo" and other tools can sometimes help you get a handle on the 
> > situation, though I don't know if it knows how to show memory affinity.  I 
> > think you can find memory affinity for a process via 
> > "/proc//numa_maps".  There's lots of info about NUMA affinity here: 
> > https://queue.acm.org/detail.cfm?id=2513149
> > 
> >  -Dave
> > 
> >  ___
> >  users mailing list
> >  us...@open-mpi.org
> >  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >  Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/09/27702.php
> > 
> >  ___
> >  users mailing list
> >  us...@open-mpi.org
> >  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >  Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/09/27705.php
> > 
> >  ___
> >  users mailing list
> >  us...@open-mpi.org
> >  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >  Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/09/27711.php
> 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2015/09/27716.php
> 



> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> 

Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-29 Thread Nathan Hjelm

I have a branch with the changes available at:

https://github.com/hjelmn/ompi.git

in the mpool_update branch. If you prefer you can apply this patch to
either a 2.x or a master tarball.

https://github.com/hjelmn/ompi/commit/8839dbfae85ba8f443b2857f9bbefdc36c4ebc1a.patch

Let me know if this resolves the performance issues.

-Nathan

On Tue, Sep 29, 2015 at 09:57:54PM +0200, marcin.krotkiewski wrote:
>I've now run a few more tests and I think I can reasonably confidently say
>that the read only mmap is a problem. Let me know if you have a possible
>fix - I will gladly test it.
> 
>Marcin
> 
>On 09/29/2015 04:59 PM, Nathan Hjelm wrote:
> 
>  We register the memory with the NIC for both read and write access. This
>  may be the source of the slowdown. We recently added internal support to
>  allow the point-to-point layer to specify the access flags but the
>  openib btl does not yet make use of the new support. I plan to make the
>  necessary changes before the 2.0.0 release. I should have them complete
>  later this week. I can send you a note when they are ready if you would
>  like to try it and see if it addresses the problem.
> 
>  -Nathan
> 
>  On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:
> 
>  Thanks, Dave.
> 
>  I have verified the memory locality and IB card locality, all's fine.
> 
>  Quite accidentally I have found that there is a huge penalty if I mmap the
>  shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good results,
>  although I must look at this further. I'll report when I am certain, in case
>  sb finds this useful.
> 
>  Is this an OS feature, or is OpenMPI somehow working differently? I don't
>  suspect you guys write to the send buffer, right? Even if you would there
>  would be a segfault. So I guess this could be OS preventing any writes to
>  the pointer that introduced the overhead?
> 
>  Marcin
> 
> 
> 
>  On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:
> 
>  On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski 
>  wrote:
> 
>  Hello, everyone
> 
>  I am struggling a bit with IB performance when sending data from a POSIX 
> shared memory region (/dev/shm). The memory is shared among many MPI 
> processes within the same compute node. Essentially, I see a bit hectic 
> performance, but it seems that my code it is roughly twice slower than when 
> using a usual, malloced send buffer.
> 
>  It may have to do with NUMA effects and the way you're allocating/touching 
> your shared memory vs. your private (malloced) memory.  If you have a 
> multi-NUMA-domain system (i.e., any 2+ socket server, and even some 
> single-socket servers) then you are likely to run into this sort of issue.  
> The PCI bus on which your IB HCA communicates is almost certainly closer to 
> one NUMA domain than the others, and performance will usually be worse if you 
> are sending/receiving from/to a "remote" NUMA domain.
> 
>  "lstopo" and other tools can sometimes help you get a handle on the 
> situation, though I don't know if it knows how to show memory affinity.  I 
> think you can find memory affinity for a process via "/proc//numa_maps". 
>  There's lots of info about NUMA affinity here: 
> https://queue.acm.org/detail.cfm?id=2513149
> 
>  -Dave
> 
>  ___
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27702.php
> 
>  ___
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27705.php
> 
>  ___
>  users mailing list
>  us...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>  Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27711.php

> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27716.php



pgpPYlQcWyJlJ.pgp
Description: PGP signature


Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-29 Thread marcin.krotkiewski


I've now run a few more tests and I think I can reasonably confidently 
say that the read only mmap is a problem. Let me know if you have a 
possible fix - I will gladly test it.


Marcin


On 09/29/2015 04:59 PM, Nathan Hjelm wrote:

We register the memory with the NIC for both read and write access. This
may be the source of the slowdown. We recently added internal support to
allow the point-to-point layer to specify the access flags but the
openib btl does not yet make use of the new support. I plan to make the
necessary changes before the 2.0.0 release. I should have them complete
later this week. I can send you a note when they are ready if you would
like to try it and see if it addresses the problem.

-Nathan

On Tue, Sep 29, 2015 at 10:51:38AM +0200, Marcin Krotkiewski wrote:

Thanks, Dave.

I have verified the memory locality and IB card locality, all's fine.

Quite accidentally I have found that there is a huge penalty if I mmap the
shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good results,
although I must look at this further. I'll report when I am certain, in case
sb finds this useful.

Is this an OS feature, or is OpenMPI somehow working differently? I don't
suspect you guys write to the send buffer, right? Even if you would there
would be a segfault. So I guess this could be OS preventing any writes to
the pointer that introduced the overhead?

Marcin



On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:

On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski  
wrote:

Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX shared 
memory region (/dev/shm). The memory is shared among many MPI processes within 
the same compute node. Essentially, I see a bit hectic performance, but it 
seems that my code it is roughly twice slower than when using a usual, malloced 
send buffer.

It may have to do with NUMA effects and the way you're allocating/touching your shared 
memory vs. your private (malloced) memory.  If you have a multi-NUMA-domain system (i.e., 
any 2+ socket server, and even some single-socket servers) then you are likely to run 
into this sort of issue.  The PCI bus on which your IB HCA communicates is almost 
certainly closer to one NUMA domain than the others, and performance will usually be 
worse if you are sending/receiving from/to a "remote" NUMA domain.

"lstopo" and other tools can sometimes help you get a handle on the situation, though I don't 
know if it knows how to show memory affinity.  I think you can find memory affinity for a process via 
"/proc//numa_maps".  There's lots of info about NUMA affinity here: 
https://queue.acm.org/detail.cfm?id=2513149

-Dave

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27702.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27705.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27711.php




Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-29 Thread Marcin Krotkiewski

Thanks, Dave.

I have verified the memory locality and IB card locality, all's fine.

Quite accidentally I have found that there is a huge penalty if I mmap 
the shm with PROT_READ only. Using PROT_READ | PROT_WRITE yields good 
results, although I must look at this further. I'll report when I am 
certain, in case sb finds this useful.


Is this an OS feature, or is OpenMPI somehow working differently? I 
don't suspect you guys write to the send buffer, right? Even if you 
would there would be a segfault. So I guess this could be OS preventing 
any writes to the pointer that introduced the overhead?


Marcin



On 09/28/2015 09:44 PM, Dave Goodell (dgoodell) wrote:

On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski  
wrote:

Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX shared 
memory region (/dev/shm). The memory is shared among many MPI processes within 
the same compute node. Essentially, I see a bit hectic performance, but it 
seems that my code it is roughly twice slower than when using a usual, malloced 
send buffer.

It may have to do with NUMA effects and the way you're allocating/touching your shared 
memory vs. your private (malloced) memory.  If you have a multi-NUMA-domain system (i.e., 
any 2+ socket server, and even some single-socket servers) then you are likely to run 
into this sort of issue.  The PCI bus on which your IB HCA communicates is almost 
certainly closer to one NUMA domain than the others, and performance will usually be 
worse if you are sending/receiving from/to a "remote" NUMA domain.

"lstopo" and other tools can sometimes help you get a handle on the situation, though I don't 
know if it knows how to show memory affinity.  I think you can find memory affinity for a process via 
"/proc//numa_maps".  There's lots of info about NUMA affinity here: 
https://queue.acm.org/detail.cfm?id=2513149

-Dave

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/09/27702.php




Re: [OMPI users] Using POSIX shared memory as send buffer

2015-09-28 Thread Dave Goodell (dgoodell)
On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski  
wrote:
> 
> Hello, everyone
> 
> I am struggling a bit with IB performance when sending data from a POSIX 
> shared memory region (/dev/shm). The memory is shared among many MPI 
> processes within the same compute node. Essentially, I see a bit hectic 
> performance, but it seems that my code it is roughly twice slower than when 
> using a usual, malloced send buffer.

It may have to do with NUMA effects and the way you're allocating/touching your 
shared memory vs. your private (malloced) memory.  If you have a 
multi-NUMA-domain system (i.e., any 2+ socket server, and even some 
single-socket servers) then you are likely to run into this sort of issue.  The 
PCI bus on which your IB HCA communicates is almost certainly closer to one 
NUMA domain than the others, and performance will usually be worse if you are 
sending/receiving from/to a "remote" NUMA domain.

"lstopo" and other tools can sometimes help you get a handle on the situation, 
though I don't know if it knows how to show memory affinity.  I think you can 
find memory affinity for a process via "/proc//numa_maps".  There's lots 
of info about NUMA affinity here: https://queue.acm.org/detail.cfm?id=2513149

-Dave



[OMPI users] Using POSIX shared memory as send buffer

2015-09-27 Thread marcin.krotkiewski

Hello, everyone

I am struggling a bit with IB performance when sending data from a POSIX 
shared memory region (/dev/shm). The memory is shared among many MPI 
processes within the same compute node. Essentially, I see a bit hectic 
performance, but it seems that my code it is roughly twice slower than 
when using a usual, malloced send buffer.


I was wondering - has any of you had experience with sending SHM over 
Infiniband? why would I see so much worse results? Is it e.g., because 
this memory cannot be pinned and OpenMPI is reallocating it? Or is it 
some OS peculiarity?


I would appreciate any hints at all. Thanks a lot !

Marcin