Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christian Bell
On Wed, 13 Feb 2008, Christoph Lameter wrote:

 Right. We (SGI) have done something like this for a long time with XPmem 
 and it scales ok.

I'd dispute this based on experience developing PGAS language support
on the Altix but more importantly (and less subjectively), I think
that scales ok refers to a very specific case.  Sure, pages (and/or
regions) can be large on some systems and the number of systems may
not always be in the thousands but you're still claiming scalability
for a mechanism that essentially logs who accesses the regions.  Then
there's the fact that reclaim becomes a collective communication
operation over all region accessors.  Makes me nervous.

  When messages are sufficiently large, the control messaging necessary
  to setup/teardown the regions is relatively small.  This is not
  always the case however -- in programming models that employ smaller
  messages, the one-sided nature of RDMA is the most attractive part of
  it.  
 
 The messaging would only be needed if a process comes under memory 
 pressure. As long as there is enough memory nothing like this will occur.
 
  Nothing any communication/runtime system can't already do today.  The
  point of RDMA demand paging is enabling the possibility of using RDMA
  without the implied synchronization -- the optimistic part.  Using
  the notifiers to duplicate existing memory region handling for RDMA
  hardware that doesn't have HW page tables is possible but undermines
  the more important consumer of your patches in my opinion.
 

 The notifier schemet should integrate into existing memory region 
 handling and not cause a duplication. If you already have library layers 
 that do this then it should be possible to integrate it.

I appreciate that you're trying to make a general case for the
applicability of notifiers on all types of existing RDMA hardware and
wire protocols.  Also, I'm not disagreeing whether a HW page table
is required or not: clearly it's not required to make *some* use of
the notifier scheme.

However, short of providing user-level notifications for pinned pages
that are inadvertently released to the O/S, I don't believe that the
patchset provides any significant added value for the HPC community
that can't optimistically do RDMA demand paging.


. . christian


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-12 Thread Christian Bell
On Tue, 12 Feb 2008, Christoph Lameter wrote:

 On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
 
  The problem is that the existing wire protocols do not have a
  provision for doing an 'are you ready' or 'I am not ready' exchange
  and they are not designed to store page tables on both sides as you
  propose. The remote side can send RDMA WRITE traffic at any time after
  the RDMA region is established. The local side must be able to handle
  it. There is no way to signal that a page is not ready and the remote
  should not send.
  
  This means the only possible implementation is to stall/discard at the
  local adaptor when a RDMA WRITE is recieved for a page that has been
  reclaimed. This is what leads to deadlock/poor performance..

You're arguing that a HW page table is not needed by describing a use
case that is essentially what all RDMA solutions already do above the
wire protocols (all solutions except Quadrics, of course).

 You would only use the wire protocols *after* having established the RDMA 
 region. The notifier chains allows a RDMA region (or parts thereof) to be 
 down on demand by the VM. The region can be reestablished if one of 
 the side accesses it. I hope I got that right. Not much exposure to 
 Infiniband so far.

RDMA is already always used *after* memory regions are set up --
they are set up out-of-band w.r.t RDMA but essentially this is the
before part.

 Lets say you have a two systems A and B. Each has their memory region MemA 
 and MemB. Each side also has page tables for this region PtA and PtB.
 
 Now you establish a RDMA connection between both side. The pages in both
 MemB and MemA are present and so are entries in PtA and PtB. RDMA 
 traffic can proceed.
 
 The VM on system A now gets into a situation in which memory becomes 
 heavily used by another (maybe non RDMA process) and after checking that 
 there was no recent reference to MemA and MemB (via a notifier aging 
 callback) decides to reclaim the memory from MemA.
 
 In that case it will notify the RDMA subsystem on A that it is trying to
 reclaim a certain page.
 
 The RDMA subsystem on A will then send a message to B notifying it that 
 the memory will be going away. B now has to remove its corresponding page 
 from memory (and drop the entry in PtB) and confirm to A that this has 
 happened. RDMA traffic is then stopped for this page. Then A can also 
 remove its page, the corresponding entry in PtA and the page is reclaimed 
 or pushed out to swap completing the page reclaim.
 
 If either side then accesses the page again then the reverse process 
 happens. If B accesses the page then it wil first of all incur a page 
 fault because the entry in PtB is missing. The fault will then cause a 
 message to be send to A to establish the page again. A will create an 
 entry in PtA and will then confirm to B that the page was established. At 
 that point RDMA operations can occur again.

The notifier-reclaim cycle you describe is akin to the out-of-band
pin-unpin control messages used by existing communication libraries.
Also, I think what you are proposing can have problems at scale -- A
must keep track of all of the (potentially many systems) of memA and
cooperatively get an agreement from all these systems before reclaiming
the page.

When messages are sufficiently large, the control messaging necessary
to setup/teardown the regions is relatively small.  This is not
always the case however -- in programming models that employ smaller
messages, the one-sided nature of RDMA is the most attractive part of
it.  

 So the whole scheme does not really need a hardware page table in the RDMA 
 hardware. The page tables of the two systems A and B are sufficient.
 
 The scheme can also be applied to a larger range than only a single page. 
 The RDMA subsystem could tear down a large section when reclaim is 
 pushing on it and then reestablish it as needed.

Nothing any communication/runtime system can't already do today.  The
point of RDMA demand paging is enabling the possibility of using RDMA
without the implied synchronization -- the optimistic part.  Using
the notifiers to duplicate existing memory region handling for RDMA
hardware that doesn't have HW page tables is possible but undermines
the more important consumer of your patches in my opinion.

One other area that has not been brought up yet (I think) is the
applicability of notifiers in letting users know when pinned memory
is reclaimed by the kernel.  This is useful when a lower-level
library employs lazy deregistration strategies on memory regions that
are subsequently released to the kernel via the application's use of
munmap or sbrk.  Ohio Supercomputing Center has work in this area but
a generalized approach in the kernel would certainly be welcome.


. . christian

-- 
[EMAIL PROTECTED]
(QLogic Host Solutions Group, formerly Pathscale)

-
This SF.net email is sponsored by: Microsoft