Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Jason Gunthorpe wrote: > Christoph: It seemed to me you were first talking about > freeing/swapping/faulting RDMA'able pages - but would pure migration > as a special hardware supported case be useful like Catilan suggested? That is a special case of the proposed solution. You could mlock the regions of interest. Those can then only be migrated but not swapped out. However, I think we need some limit on the number of pages one can mlock. Otherwise the VM can get into a situation where reclaim is not possible because the majority of memory is either mlocked or pinned by I/O etc. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
> -Original Message- > From: Christoph Lameter [mailto:[EMAIL PROTECTED] > Sent: Friday, February 15, 2008 2:50 PM > To: Caitlin Bestler > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [email protected]; [EMAIL PROTECTED] > Subject: RE: [ofa-general] Re: Demand paging for memory regions > > On Fri, 15 Feb 2008, Caitlin Bestler wrote: > > > There isn't much point in the RDMA layer subscribing to mmu > > notifications > > if the specific RDMA device will not be able to react appropriately > when > > the notification occurs. I don't see how you get around needing to > know > > which devices are capable of supporting page migration (via > > suspend/resume > > or other mechanisms) and which can only respond to a page migration > by > > aborting connections. > > You either register callbacks if the device can react properly or you > dont. If you dont then the device will continue to have the problem > with > page pinning etc until someone comes around and implements the > mmu callbacks to fix these issues. > > I have doubts regarding the claim that some devices just cannot be made > to > suspend and resume appropriately. They obviously can be shutdown and so > its a matter of sequencing the things the right way. I.e. stop the app > wait for a quiet period then release resources etc. > > That is true. What some devices will be unable to do is suspend and resume in a manner that is transparent to the application. However, for the duration required to re-arrange pages it is definitely feasible to do so transparently to the application. Presumably the Virtual Memory Manager would be more willing to take an action that is transparent to the user than one that is disruptive, although obviously as the owner of the physical memory it has the right to do either. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
On Fri, 15 Feb 2008, Caitlin Bestler wrote: > There isn't much point in the RDMA layer subscribing to mmu > notifications > if the specific RDMA device will not be able to react appropriately when > the notification occurs. I don't see how you get around needing to know > which devices are capable of supporting page migration (via > suspend/resume > or other mechanisms) and which can only respond to a page migration by > aborting connections. You either register callbacks if the device can react properly or you dont. If you dont then the device will continue to have the problem with page pinning etc until someone comes around and implements the mmu callbacks to fix these issues. I have doubts regarding the claim that some devices just cannot be made to suspend and resume appropriately. They obviously can be shutdown and so its a matter of sequencing the things the right way. I.e. stop the app wait for a quiet period then release resources etc. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
Christoph Lameter wrote > > > Merely mlocking pages deals with the end-to-end RDMA semantics. > > What still needs to be addressed is how a fastpath interface > > would dynamically pin and unpin. Yielding pins for short-term > > suspensions (and flushing cached translations) deals with the > > rest. Understanding the range of support that existing devices > > could provide with software updates would be the next step if > > you wanted to pursue this. > > That is addressed on the VM level by the mmu_notifier which started > this whole thread. The RDMA layers need to subscribe to this notifier > and then do whatever the hardware requires to unpin and pin memory. > I can only go as far as dealing with the VM layer. If you have any > issues there I'd be glad to help. There isn't much point in the RDMA layer subscribing to mmu notifications if the specific RDMA device will not be able to react appropriately when the notification occurs. I don't see how you get around needing to know which devices are capable of supporting page migration (via suspend/resume or other mechanisms) and which can only respond to a page migration by aborting connections. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
On Fri, 15 Feb 2008, Caitlin Bestler wrote: > So that would mean that mlock is used by the application before it > registers memory for direct access, and then it is up to the RDMA > layer and the OS to negotiate actual pinning of the addresses for > whatever duration is required. Right. > There is no *protocol* barrier to replacing pages within a Memory > Region as long as it is done in a way that keeps the content of > those page coherent. But existing devices have their own ideas > on how this is done and existing devices are notoriously poor at > learning new tricks. H.. Okay. But that is mainly a device driver maintenance issue. > Merely mlocking pages deals with the end-to-end RDMA semantics. > What still needs to be addressed is how a fastpath interface > would dynamically pin and unpin. Yielding pins for short-term > suspensions (and flushing cached translations) deals with the > rest. Understanding the range of support that existing devices > could provide with software updates would be the next step if > you wanted to pursue this. That is addressed on the VM level by the mmu_notifier which started this whole thread. The RDMA layers need to subscribe to this notifier and then do whatever the hardware requires to unpin and pin memory. I can only go as far as dealing with the VM layer. If you have any issues there I'd be glad to help. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
> -Original Message- > From: Christoph Lameter [mailto:[EMAIL PROTECTED] > Sent: Friday, February 15, 2008 10:46 AM > To: Caitlin Bestler > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [email protected]; [EMAIL PROTECTED] > Subject: RE: [ofa-general] Re: Demand paging for memory regions > > On Fri, 15 Feb 2008, Caitlin Bestler wrote: > > > > What does it mean that the "application layer has to be determine > what > > > pages are registered"? The application does not know which of its > > pages > > > are currently in memory. It can only force these pages to stay in > > > memory if their are mlocked. > > > > > > > An application that advertises an RDMA accessible buffer > > to a remote peer *does* have to know that its pages *are* > > currently in memory. > > Ok that would mean it needs to inform the VM of that issue by mlocking > these pages. > > > But the more fundamental issue is recognizing that applications > > that use direct interfaces need to know that buffers that they > > enable truly have committed resources. They need a way to > > ask for twenty *real* pages, not twenty pages of address > > space. And they need to do it in a way that allows memory > > to be rearranged or even migrated with them to a new host. > > mlock will force the pages to stay in memory without requiring the OS > to keep them where they are. So that would mean that mlock is used by the application before it registers memory for direct access, and then it is up to the RDMA layer and the OS to negotiate actual pinning of the addresses for whatever duration is required. There is no *protocol* barrier to replacing pages within a Memory Region as long as it is done in a way that keeps the content of those page coherent. But existing devices have their own ideas on how this is done and existing devices are notoriously poor at learning new tricks. Merely mlocking pages deals with the end-to-end RDMA semantics. What still needs to be addressed is how a fastpath interface would dynamically pin and unpin. Yielding pins for short-term suspensions (and flushing cached translations) deals with the rest. Understanding the range of support that existing devices could provide with software updates would be the next step if you wanted to pursue this. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
On Fri, 15 Feb 2008, Caitlin Bestler wrote: > > What does it mean that the "application layer has to be determine what > > pages are registered"? The application does not know which of its > pages > > are currently in memory. It can only force these pages to stay in > > memory if their are mlocked. > > > > An application that advertises an RDMA accessible buffer > to a remote peer *does* have to know that its pages *are* > currently in memory. Ok that would mean it needs to inform the VM of that issue by mlocking these pages. > But the more fundamental issue is recognizing that applications > that use direct interfaces need to know that buffers that they > enable truly have committed resources. They need a way to > ask for twenty *real* pages, not twenty pages of address > space. And they need to do it in a way that allows memory > to be rearranged or even migrated with them to a new host. mlock will force the pages to stay in memory without requiring the OS to keep them where they are. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
Christoph Lameter asked: > > What does it mean that the "application layer has to be determine what > pages are registered"? The application does not know which of its pages > are currently in memory. It can only force these pages to stay in > memory if their are mlocked. > An application that advertises an RDMA accessible buffer to a remote peer *does* have to know that its pages *are* currently in memory. The application does *not* need for the virtual-to-physical mapping of those pages to be frozen for the lifespan of the Memory Region. But it is issuing an invitation to its peer to perform direct writes to the advertised buffer. When the peer decides to exercise that invitation the pages have to be there. An analogy: when you write a check for $100 you do not have to identify the serial numbers of ten $10 bills, but you are expected to have the funds in your account. Issuing a buffer advertisement for memory you do not have is the network equivalent of writing a check that you do not have funds for. Now, just as your bank may offer overdraft protection, an RDMA device could merely report a page fault rather than tearing down the connection itself. But that does not grant permission for applications to advertise buffer space that they do not have committed, it merely helps recovery from a programming fault. A suspend/resume interface between the Virtual Memory Manager and the RDMA layer allows pages to be re-arranged at the convenience of the Virtual Memory Manager without breaking the application layer peer-to-peer contract. The current interfaces that pin exact pages are really the equivalent of having to tell the bank that when Joe cashes this $100 check that you should give him *these* ten $10 bills. It works, but it adds too much overhead and is very inflexible. So there are a lot of good reasons to evolve this interface to better deal with these issues. Other areas of possible evolution include allowing growing or trimming of Memory Regions without invalidating their advertised handles. But the more fundamental issue is recognizing that applications that use direct interfaces need to know that buffers that they enable truly have committed resources. They need a way to ask for twenty *real* pages, not twenty pages of address space. And they need to do it in a way that allows memory to be rearranged or even migrated with them to a new host. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Fri, Feb 15, 2008 at 07:47:06AM +1100, David Singleton wrote: > Caitlin Bestler wrote: >> But the broader question is what the goal is here. Allowing memory to >> be shuffled is valuable, and perhaps even ultimately a requirement for >> high availability systems. RDMA and other direct-access APIs should >> be evolving their interfaces to accommodate these needs. >> Oversubscribing memory is a totally different matter. If an application >> is working with memory that is oversubscribed by a factor of 2 or more >> can it really benefit from zero-copy direct placement? At first glance I >> can't see what RDMA could be bringing of value when the overhead of >> swapping is going to be that large. > > A related use case from HPC. Some of us have batch scheduling > systems based on suspend/resume of jobs (which is really just > SIGSTOP and SIGCONT of all job processes). The value of this > system is enhanced greatly by being able to page out the suspended > job (just normal Linux demand paging caused by the incoming job is > OK). Apart from this (relatively) brief period of paging, both > jobs benefit from RDMA. > > SGI kindly implemented a /proc mechanism for unpinning of XPMEM > pages to allow suspended jobs to be paged on their Altix system. > > Note that this use case would not benefit from Pete Wyckoff's > approach of notifying user applications/libraries of VM changes. We will be implementing xpmem on top of mmu_notifiers (actively working on that now) so in that case, you would no longer need to use the /proc/xpmem/ mechanism for unpinning. Hopefully, we will have xpmem in before 2.6.26 and get it into the base OS now instead of an add-on. Oh yeah, and memory migration will not need the unpin thing either so you can move smaller jobs around more easily. Thanks, Robin ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
On Thu, 14 Feb 2008, Caitlin Bestler wrote: > So any solution that requires the upper layers to suspend operations > for a brief bit will require explicit interaction with those layers. > No RDMA layer can perform the sleight of hand tricks that you seem > to want it to perform. Looks like it has to be up there right. > AT the RDMA layer the best you could get is very brief suspensions for > the purpose of *re-arranging* memory, not of reducing the amount of > registered memory. If you need to reduce the amount of registered memory > then you have to talk to the application. Discussions on making it > easier for the application to trim a memory region dynamically might be > in order, but you will not work around the fact that the application > layer needs to determine what pages are registered. And they would > really prefer just to be told how much memory they can have up front, > they can figure out how to deal with that amount of memory on their own. What does it mean that the "application layer has to be determine what pages are registered"? The application does not know which of its pages are currently in memory. It can only force these pages to stay in memory if their are mlocked. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
> -Original Message- > From: Christoph Lameter [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 14, 2008 2:49 PM > To: Caitlin Bestler > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [email protected]; [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: Demand paging for memory regions > > On Thu, 14 Feb 2008, Caitlin Bestler wrote: > > > I have no problem with that, as long as the application layer is > responsible for > > tearing down and re-establishing the connections. The RDMA/transport > layers > > are incapable of tearing down and re-establishing a connection > transparently > > because connections need to be approved above the RDMA layer. > > I am not that familiar with the RDMA layers but it seems that RDMA has > a library that does device driver like things right? So the logic would > best fit in there I guess. > > If you combine mlock with the mmu notifier then you can actually > guarantee that a certain memory range will not be swapped out. The > notifier will then only be called if the memory range will need to be > moved for page migration, memory unplug etc etc. There may be a limit > on > the percentage of memory that you can mlock in the future. This may be > done to guarantee that the VM still has memory to work with. > The problem is that with existing APIs, or even slightly modified APIs, the RDMA layer will not be able to figure out which connections need to be "interrupted" in order to deal with what memory suspensions. Further, because any request for a new connection will be handled by the remote *application layer* peer there is no way for the two RDMA layers to agree to covertly tear down and re-establish the connection. Nor really should there be, connections should be approved by OS layer networking controls. RDMA should not be able to tell the network stack, "trust me, you don't have to check if this connection is legitimate". Another example, if you terminate a connection pending receive operations complete *to the user* in a Completion Queue. Those completions are NOT seen by the RDMA layer, and especially not by the Connection Manager. It has absolutely no way to repost them transparently to the same connection when the connection is re-established. Even worse, some portions of a receive operation might have been placed in the receive buffer and acknowledged to the remote peer. But there is no mechanism to report this fact in the CQE. A receive operation that is aborted is aborted. There is no concept of partial success. Therefore you cannot covertly terminate a connection mid-operation and covertly re-establish it later. Data will be lost, it will no longer be a reliable connection, and therefore it needs to be torn down anyway. The RDMA layers also cannot tell the other side not to transmit. Flow control is the responsibility of the application layer, not RDMA. What the RDMA layer could do is this: once you tell it to suspend a given memory region it can either tell you that it doesn't know how to do that or it can instruct the device to stop processing a set of connections that will ceases all access for a given Memory Region. When you resume it can guarantee that it is no longer using any cached older mappings for the memory region (assuming it was capable of doing the suspend), and then because RDMA connections are reliable everything will recover unless the connection timed-out. The chance that it will time-out is probably low, but the chance that the underlying connection will be in slow start or equivalent is much higher. So any solution that requires the upper layers to suspend operations for a brief bit will require explicit interaction with those layers. No RDMA layer can perform the sleight of hand tricks that you seem to want it to perform. AT the RDMA layer the best you could get is very brief suspensions for the purpose of *re-arranging* memory, not of reducing the amount of registered memory. If you need to reduce the amount of registered memory then you have to talk to the application. Discussions on making it easier for the application to trim a memory region dynamically might be in order, but you will not work around the fact that the application layer needs to determine what pages are registered. And they would really prefer just to be told how much memory they can have up front, they can figure out how to deal with that amount of memory on their own. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Thu, 14 Feb 2008, Caitlin Bestler wrote: > I have no problem with that, as long as the application layer is responsible > for > tearing down and re-establishing the connections. The RDMA/transport layers > are incapable of tearing down and re-establishing a connection transparently > because connections need to be approved above the RDMA layer. I am not that familiar with the RDMA layers but it seems that RDMA has a library that does device driver like things right? So the logic would best fit in there I guess. If you combine mlock with the mmu notifier then you can actually guarantee that a certain memory range will not be swapped out. The notifier will then only be called if the memory range will need to be moved for page migration, memory unplug etc etc. There may be a limit on the percentage of memory that you can mlock in the future. This may be done to guarantee that the VM still has memory to work with. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Thu, Feb 14, 2008 at 12:20 PM, Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Thu, 14 Feb 2008, Caitlin Bestler wrote: > > > So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover > > swapping out pages so they can be reallocated is an exercise in futility. > By the > > time you resume the connections will be broken or at the minimum damaged. > > The connections would then have to be torn down before swap out and would > have to be reestablished after the pages have been brought back from swap. > > I have no problem with that, as long as the application layer is responsible for tearing down and re-establishing the connections. The RDMA/transport layers are incapable of tearing down and re-establishing a connection transparently because connections need to be approved above the RDMA layer. Further the teardown will have visible artificats that the application must deal with, such as flushed Recv WQEs. This is still, the RDMA device will do X and will not worry about Y. The reasons for not worrying about Y could be that the suspend will be very short, or that other mechanisms have taken care of all the Ys independently. For example, an HPC cluster that suspended the *entire* cluster would not have to worry about dropped packets. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
Caitlin Bestler wrote: But the broader question is what the goal is here. Allowing memory to be shuffled is valuable, and perhaps even ultimately a requirement for high availability systems. RDMA and other direct-access APIs should be evolving their interfaces to accommodate these needs. Oversubscribing memory is a totally different matter. If an application is working with memory that is oversubscribed by a factor of 2 or more can it really benefit from zero-copy direct placement? At first glance I can't see what RDMA could be bringing of value when the overhead of swapping is going to be that large. A related use case from HPC. Some of us have batch scheduling systems based on suspend/resume of jobs (which is really just SIGSTOP and SIGCONT of all job processes). The value of this system is enhanced greatly by being able to page out the suspended job (just normal Linux demand paging caused by the incoming job is OK). Apart from this (relatively) brief period of paging, both jobs benefit from RDMA. SGI kindly implemented a /proc mechanism for unpinning of XPMEM pages to allow suspended jobs to be paged on their Altix system. Note that this use case would not benefit from Pete Wyckoff's approach of notifying user applications/libraries of VM changes. And one of the grand goal of HPC developers has always been to have checkpoint/restart of jobs David ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Thu, 14 Feb 2008, Caitlin Bestler wrote: > So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover > swapping out pages so they can be reallocated is an exercise in futility. By > the > time you resume the connections will be broken or at the minimum damaged. The connections would then have to be torn down before swap out and would have to be reestablished after the pages have been brought back from swap. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Thu, Feb 14, 2008 at 11:39 AM, Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Thu, 14 Feb 2008, Steve Wise wrote: > > > Note that for T3, this involves suspending _all_ rdma connections that are > in > > the same PD as the MR being remapped. This is because the driver doesn't > know > > who the application advertised the rkey/stag to. So without that > knowledge, > > all connections that _might_ rdma into the MR must be suspended. If the MR > > was only setup for local access, then the driver could track the > connections > > with references to the MR and only quiesce those connections. > > > > Point being, it will stop probably all connections that an application is > > using (assuming the application uses a single PD). > > Right but if the system starts reclaiming pages of the application then we > have a memory shortage. So the user should address that by not running > other apps concurrently. The stopping of all connections is still better > than the VM getting into major trouble. And the stopping of connections in > order to move the process memory into a more advantageous memory location > (f.e. using page migration) or stopping of connections in order to be able > to move the process memory out of a range of failing memory is certainly > good. > In that spirit, there are two important aspects of a suspend/resume API that would enable the memory manager to solve problems most effectively: 1) The device should be allowed flexibility to extend the scope of the suspend to what it is capable of implementing -- rather than being forced to say that it does not support suspend/;resume merely because it does so at a different granularity. 2) It is very important that users of this API understand that it is only the RDMA device handling of incoming packets and WQEs that is being suspended. The peers are not suspended by this API, or even told that this end is suspending. Unless the suspend is kept *extremely* short there will be adverse impacts. And "short" here is measured in network terms, not human terms. The blink of any eye is *way* too long. Any external dependencies between "suspend" and "resume" will probably mean that things will not work, especially if the external entities involve a disk drive. So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover swapping out pages so they can be reallocated is an exercise in futility. By the time you resume the connections will be broken or at the minimum damaged. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Thu, 14 Feb 2008, Steve Wise wrote: > Note that for T3, this involves suspending _all_ rdma connections that are in > the same PD as the MR being remapped. This is because the driver doesn't know > who the application advertised the rkey/stag to. So without that knowledge, > all connections that _might_ rdma into the MR must be suspended. If the MR > was only setup for local access, then the driver could track the connections > with references to the MR and only quiesce those connections. > > Point being, it will stop probably all connections that an application is > using (assuming the application uses a single PD). Right but if the system starts reclaiming pages of the application then we have a memory shortage. So the user should address that by not running other apps concurrently. The stopping of all connections is still better than the VM getting into major trouble. And the stopping of connections in order to move the process memory into a more advantageous memory location (f.e. using page migration) or stopping of connections in order to be able to move the process memory out of a range of failing memory is certainly good. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Kanoj Sarcar wrote: > Oh ok, yes, I did see the discussion on this; sorry I > missed it. I do see what notifiers bring to the table > now (without endorsing it :-)). > > An orthogonal question is this: is IB/rdma the only > "culprit" that elevates page refcounts? Are there no > other subsystems which do a similar thing? Yes there are actually two projects by SGI that also ran into the same issue that motivated the work on this. One is XPmem which allows sharing of process memory between different Linux instances and then there is the GRU which is a kind of DMA engine. Then there is KVM and probably multiple other drivers. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Thu, Feb 14, 2008 at 8:23 AM, Steve Wise <[EMAIL PROTECTED]> wrote: > Robin Holt wrote: > > On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote: > >> Note that for T3, this involves suspending _all_ rdma connections that are > >> in the same PD as the MR being remapped. This is because the driver > >> doesn't know who the application advertised the rkey/stag to. So without > > > > Is there a reason the driver can not track these. > > > > Because advertising of a MR (ie telling the peer about your rkey/stag, > offset and length) is application-specific and can be done out of band, > or in band as simple SEND/RECV payload. Either way, the driver has no > way of tracking this because the protocol used is application-specific. > > I fully agree. If there is one important thing about RDMA and other fastpath solutions that must be understood is that the driver does not see the payload. This is a fundamental strength, but it means that you have to identify what if any intercept points there are in advance. You also raise a good point on the scope of any suspend/resume API. Device reporting of this capability would not be a simple boolean, but more of a suspend/resume scope. A minimal scope would be any connection that actually attempts to use the suspended MR. Slightly wider would be any connection *allowed* to use the MR, which could expand all the way to any connection under the same PD. Convievably I could imagine an RDMA device reporting that it could support suspend/ resume, but only at the scope of the entire device. But even at such a wide scope, suspend/resume could be useful to a Memory Manager. The pages could be fully migrated to the new location, and the only work that was still required during the critical suspend/resume region was to actually shift to the new map. That might be short enough that not accepting *any* incoming RDMA packet would be acceptable. And if the goal is to replace a memory card the alternative might be migrating the applications to other physical servers, which would mean a much longer period of not accepting incoming RDMA packets. But the broader question is what the goal is here. Allowing memory to be shuffled is valuable, and perhaps even ultimately a requirement for high availability systems. RDMA and other direct-access APIs should be evolving their interfaces to accommodate these needs. Oversubscribing memory is a totally different matter. If an application is working with memory that is oversubscribed by a factor of 2 or more can it really benefit from zero-copy direct placement? At first glance I can't see what RDMA could be bringing of value when the overhead of swapping is going to be that large. If it really does make sense, then explicitly registering the portion of memory that should be enabled to receive incoming traffic while the application is swapped out actually makes sense. Current Memory Registration methods force applications to either register too much or too often. They register too much when the cost of registration is high, and the application responds by registering its entire buffer pool permanently. This is a problem when it overstates the amount of memory that the application needs to have resident, or when the device imposes limits on the size of memory maps that it can know. The alternative is to register too often, that is on a per-operation basis. To me that suggests the solutions lie in making it more reasonable to register more memory, or in making it practical to register memory on-the-fly on a per-operation basis with low enough overhead that applications don't feel the need to build elaborate registration caching schemes. As has been pointed out a few times in this thread, the RDMA and transport layers simply do not have enough information to know which portion of registered memory *really* had to be registered. So any back-pressure scheme where the Memory Manager is asking for pinned memory to be "given back" would have to go all the way to the application. Only the application knows what it is "really" using. I also suspect that most applications that are interested in using RDMA would rather be told they can allocate 200M indefinitely (and with real memory backing it) than be given 1GB of virtual memory that is backed by 200-300M of physical memory, especially if it meant dealing with memory pressure upcalls. > >> Point being, it will stop probably all connections that an application is > >> using (assuming the application uses a single PD). > > > > It seems like the need to not stop all would be a compelling enough reason > > to modify the driver to track which processes have received the rkey/stag. > > > > Yes, _if_ the driver could track this. > > And _if_ the rdma API and paradigm was such that the kernel/driver could > keep track, then remote revokations of MR tags could be supported. > > Stevo > > > ___ > general mailing list > general@lists.
Re: [ofa-general] Re: Demand paging for memory regions
Robin Holt wrote: On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote: Note that for T3, this involves suspending _all_ rdma connections that are in the same PD as the MR being remapped. This is because the driver doesn't know who the application advertised the rkey/stag to. So without Is there a reason the driver can not track these. Because advertising of a MR (ie telling the peer about your rkey/stag, offset and length) is application-specific and can be done out of band, or in band as simple SEND/RECV payload. Either way, the driver has no way of tracking this because the protocol used is application-specific. Point being, it will stop probably all connections that an application is using (assuming the application uses a single PD). It seems like the need to not stop all would be a compelling enough reason to modify the driver to track which processes have received the rkey/stag. Yes, _if_ the driver could track this. And _if_ the rdma API and paradigm was such that the kernel/driver could keep track, then remote revokations of MR tags could be supported. Stevo ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote: > Note that for T3, this involves suspending _all_ rdma connections that are > in the same PD as the MR being remapped. This is because the driver > doesn't know who the application advertised the rkey/stag to. So without Is there a reason the driver can not track these. > Point being, it will stop probably all connections that an application is > using (assuming the application uses a single PD). It seems like the need to not stop all would be a compelling enough reason to modify the driver to track which processes have received the rkey/stag. Thanks, Robin ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
Felix Marti wrote: That is correct, not a change we can make for T3. We could, in theory, deal with changing mappings though. The change would need to be synchronized though: the VM would need to tell us which mapping were about to change and the driver would then need to disable DMA to/from it, do the change and resume DMA. Note that for T3, this involves suspending _all_ rdma connections that are in the same PD as the MR being remapped. This is because the driver doesn't know who the application advertised the rkey/stag to. So without that knowledge, all connections that _might_ rdma into the MR must be suspended. If the MR was only setup for local access, then the driver could track the connections with references to the MR and only quiesce those connections. Point being, it will stop probably all connections that an application is using (assuming the application uses a single PD). Steve. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
Hi Kanoj, On Wed, Feb 13, 2008 at 03:43:17PM -0800, Kanoj Sarcar wrote: > Oh ok, yes, I did see the discussion on this; sorry I > missed it. I do see what notifiers bring to the table > now (without endorsing it :-)). I'm not really livelocks are really the big issue here. I'm running N 1G VM on a 1G ram system, with N-1G swapped out. Combining this with auto-ballooning, rss limiting, and ksm ram sharing, provides really advanced and lowlevel virtualization VM capabilities to the linux kernel while at the same time guaranteeing no oom failures as long as the guest pages are lower than ram+swap (just slower runtime if too many pages are unshared or if the balloons are deflated etc..). Swapping the virtual machine in the host may be more efficient than having the guest swapping over a virtual swap paravirt storage for example. As more management features are added admins will gain more experience in handling those new features and they'll find what's best for them. mmu notifiers and real reliable swapping are the enabler for those more advanced VM features. oom livelocks wouldn't happen anyway with KVM as long as the maximimal number of guest physical is lower than RAM. > An orthogonal question is this: is IB/rdma the only > "culprit" that elevates page refcounts? Are there no > other subsystems which do a similar thing? > > The example I am thinking about is rawio (Oracle's > mlock'ed SHM regions are handed to rawio, isn't it?). > My understanding of how rawio works in Linux is quite > dated though ... rawio in flight I/O shall be limited. As long as each task can't pin more than X ram, and the ram is released when the task is oom killed, and the first get_user_pages/alloc_pages/slab_alloc that returns -ENOMEM takes an oom fail path that returns failure to userland, everything is ok. Even with IB deadlock could only happen if IB would allow unlimited memory to be pinned down by unprivileged users. If IB is insecure and DoSable without mmu notifiers, then I'm not sure how enabling swapping of the IB memory could be enough to fix the DoS. Keep in mind that even tmpfs can't be safe allowing all ram+swap to be allocated in a tmpfs file (despite the tmpfs file storage includes swap and not only ram). Pinning the whole ram+swap with tmpfs livelocks the same way of pinning the whole ram with ramfs. So if you add mmu notifier support to IB, you only need to RDMA an area as large as ram+swap to livelock again as before... no difference at all. I don't think livelocks have anything to do with mmu notifiers (other than to deferring the livelock to the "swap+ram" point of no return instead of the current "ram" point of no return). Livelocks have to be solved the usual way: handling alloc_pages/get_user_pages/slab allocation failures with a fail path that returns to userland and allows the ram to be released if the task was selected for oom-killage. The real benefit of the mmu notifiers for IB would be to allow the rdma region to be larger than RAM without triggering the oom killer (or without triggering a livelock if it's DoSable but then the livelock would need fixing to be converted in a regular oom-killing by some other mean not related to the mmu-notifier, it's really an orthogonal problem). So suppose you've a MPI simulation that requires a 10G array and you've only 1G of ram, then you can rdma over 10G like if you had 10G of ram. Things will preform ok only if there's some huge locality of the computations. For virtualization it's orders of magnitude more useful than for computer clusters but certain simulations really swaps so I don't exclude certain RDMA apps will also need this (dunno about IB). ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, Feb 13, 2008 at 06:23:08PM -0500, Pete Wyckoff wrote:
> [EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800:
> > One other area that has not been brought up yet (I think) is the
> > applicability of notifiers in letting users know when pinned memory
> > is reclaimed by the kernel. This is useful when a lower-level
> > library employs lazy deregistration strategies on memory regions that
> > are subsequently released to the kernel via the application's use of
> > munmap or sbrk. Ohio Supercomputing Center has work in this area but
> > a generalized approach in the kernel would certainly be welcome.
>
> The whole need for memory registration is a giant pain. There is no
> motivating application need for it---it is simply a hack around
> virtual memory and the lack of full VM support in current hardware.
> There are real hardware issues that interact poorly with virtual
> memory, as discussed previously in this thread.
Well, the registrations also exist to provide protection against
rouge/faulty remotes, but for the purposes of MPI that is probably not
important.
Here is a thought.. Some RDMA hardware can change the page tables on
the fly. What if the kernel had a mechanism to dynamically maintain a
full registration of the processes entire address space ('mlocked' but
able to be migrated)? MPI would never need to register a buffer, and
all the messy cases with munmap/sbrk/etc go away - the risk is that
other MPI nodes can randomly scribble all over the process :)
Christoph: It seemed to me you were first talking about
freeing/swapping/faulting RDMA'able pages - but would pure migration
as a special hardware supported case be useful like Catilan suggested?
Regards,
Jason
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
***SPAM*** Re: [ofa-general] Re: Demand paging for memory regions
--- Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 13 Feb 2008, Kanoj Sarcar wrote: > > > It seems that the need is to solve potential > memory > > shortage and overcommit issues by being able to > > reclaim pages pinned by rdma driver/hardware. Is > my > > understanding correct? > > Correct. > > > If I do understand correctly, then why is rdma > page > > pinning any different than eg mlock pinning? I > imagine > > Oracle pins lots of memory (using mlock), how come > > they do not run into vm overcommit issues? > > Mlocked pages are not pinned. They are movable by > f.e. page migration and > will be potentially be moved by future memory defrag > approaches. Currently > we have the same issues with mlocked pages as with > pinned pages. There is > work in progress to put mlocked pages onto a > different lru so that reclaim > exempts these pages and more work on limiting the > percentage of memory > that can be mlocked. > > > Are we up against some kind of breaking c-o-w > issue > > here that is different between mlock and rdma > pinning? > > Not that I know. > > > Asked another way, why should effort be spent on a > > notifier scheme, and rather not on fixing any > memory > > accounting problems and unifying how pin pages are > > accounted for that get pinned via mlock() or rdma > > drivers? > > There are efforts underway to account for and limit > mlocked pages as > described above. Page pinning the way it is done by > Infiniband through > increasing the page refcount is treated by the VM as > a temporary > condition not as a permanent pin. The VM will > continually try to reclaim > these pages thinking that the temporary usage of the > page must cease > soon. This is why the use of large amounts of pinned > pages can lead to > livelock situations. Oh ok, yes, I did see the discussion on this; sorry I missed it. I do see what notifiers bring to the table now (without endorsing it :-)). An orthogonal question is this: is IB/rdma the only "culprit" that elevates page refcounts? Are there no other subsystems which do a similar thing? The example I am thinking about is rawio (Oracle's mlock'ed SHM regions are handed to rawio, isn't it?). My understanding of how rawio works in Linux is quite dated though ... Kanoj > > If we want to have pinning behavior then we could > mark pinned pages > specially so that the VM will not continually try to > evict these pages. We > could manage them similar to mlocked pages but just > not allow page > migration, memory unplug and defrag to occur on > pinned memory. All of > theses would have to fail. With the notifier scheme > the device driver > could be told to get rid of the pinned memory. This > would make these 3 > techniques work despite having an RDMA memory > section. > > > Startup benefits are well understood with the > notifier > > scheme (ie, not all pages need to be faulted in at > > memory region creation time), specially when most > of > > the memory region is not accessed at all. I would > > imagine most of HPC does not work this way though. > > No for optimal performance you would want to > prefault all pages like > it is now. The notifier scheme would only become > relevant in memory > shortage situations. > > > Then again, as rdma hardware is applied > (increasingly?) towards apps > > with short lived connections, the notifier scheme > will help with startup > > times. > > The main use of the notifier scheme is for stability > and reliability. The > "pinned" pages become unpinnable on request by the > VM. So the VM can work > itself out of memory shortage situations in > cooperation with the > RDMA logic instead of simply failing. > > -- > To unsubscribe, send a message with 'unsubscribe > linux-mm' in > the body to [EMAIL PROTECTED] For more info on > Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"[EMAIL PROTECTED]"> > [EMAIL PROTECTED] > Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Caitlin Bestler wrote: > The very limited objective presented above was actually discussed in RNIC-PI. > A minimalist solution (from the hardware viewpoint) is to "suspend" a Memory > Region for a very brief time to allow the Host to re-arrange memory, and then > to "resume" operation once the pages were copied and the map updated. Exactly. Could you post that back to the full cc list? ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Feb 13, 2008 3:02 PM, Christoph Lameter <[EMAIL PROTECTED]> wrote: > > The main use of the notifier scheme is for stability and reliability. The > "pinned" pages become unpinnable on request by the VM. So the VM can work > itself out of memory shortage situations in cooperation with the > RDMA logic instead of simply failing. > The very limited objective presented above was actually discussed in RNIC-PI. A minimalist solution (from the hardware viewpoint) is to "suspend" a Memory Region for a very brief time to allow the Host to re-arrange memory, and then to "resume" operation once the pages were copied and the map updated. The RDMA device has to avoid processing incoming packets that reference the suspended Memory Region (rather than failing the connection) and flush any cached mappings from before the "suspend" so that everything is learned/ fetched after the "resume". The advertised pages have to have the same *meaning* and they have to be committed, but they do not have to be the same physical pages for the lifetime of the memory region (at least from the protocol perspective). Obviously any add-on hardware functionality would have to be a documented option so that the memory manager would know whether a given device actually could do this. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800: > One other area that has not been brought up yet (I think) is the > applicability of notifiers in letting users know when pinned memory > is reclaimed by the kernel. This is useful when a lower-level > library employs lazy deregistration strategies on memory regions that > are subsequently released to the kernel via the application's use of > munmap or sbrk. Ohio Supercomputing Center has work in this area but > a generalized approach in the kernel would certainly be welcome. The whole need for memory registration is a giant pain. There is no motivating application need for it---it is simply a hack around virtual memory and the lack of full VM support in current hardware. There are real hardware issues that interact poorly with virtual memory, as discussed previously in this thread. The way a messaging cycle goes in IB is: register buf post send from buf wait for completion deregister buf This tends to get hidden via userspace software libraries into a single call: MPI_send(buf) Now if you actually do the reg/dereg every time, things are very slow. So userspace library writers came up with the idea of caching registrations: if buf is not registered: register buf post send from buf wait for completion The second time that the app happens to do a send from the same buffer, it proceeds much faster. Spatial locality applies here, and this caching is generally worth it. Some libraries have schemes to limit the size of the registration cache too. But there are plenty of ways to hurt yourself with such a scheme. The first being a huge pool of unused but registered memory, as the library doesn't know the app patterns, and it doesn't know the VM pressure level in the kernel. There are plenty of subtle ways that this breaks too. If the registered buf is removed from the address space via munmap() or sbrk() or other ways, the mapping and registration are gone, but the library has no way of knowing that the app just did this. Sure the physical page is still there and pinned, but the app cannot get at it. Later if new address space arrives at the same virtual address but a different physical page, the library will mistakenly think it already has it registered properly, and data is transferred from this old now-unmapped physical page. The whole situation is rather ridiculuous, but we are quite stuck with it for current generation IB and iWarp hardware. If we can't have the kernel interact with the device directly, we could at least manage state in these multiple userspace registration caches. The VM could ask for certain (or any) pages to be released, and the library would respond if they are indeed not in use by the device. The app itself does not know about pinned regions, and the library is aware of exactly which regions are potentially in use. Since the great majority of userspace messaging over IB goes through middleware like MPI or PGAS languages, and they all have the same approach to registration caching, this approach could fix the problem for a big segment of use cases. More text on the registration caching problem is here: http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf with an approach using vm_ops open and close operations in a kernel module here: http://www.osc.edu/~pw/dreg/ There is a place for VM notifiers in RDMA messaging, but not in talking to devices, at least not the current set. If you can define a reasonable userspace interface for VM notifiers, libraries can manage registration caches more efficiently, letting the kernel unmap pinned pages as it likes. -- Pete ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Kanoj Sarcar wrote: > It seems that the need is to solve potential memory > shortage and overcommit issues by being able to > reclaim pages pinned by rdma driver/hardware. Is my > understanding correct? Correct. > If I do understand correctly, then why is rdma page > pinning any different than eg mlock pinning? I imagine > Oracle pins lots of memory (using mlock), how come > they do not run into vm overcommit issues? Mlocked pages are not pinned. They are movable by f.e. page migration and will be potentially be moved by future memory defrag approaches. Currently we have the same issues with mlocked pages as with pinned pages. There is work in progress to put mlocked pages onto a different lru so that reclaim exempts these pages and more work on limiting the percentage of memory that can be mlocked. > Are we up against some kind of breaking c-o-w issue > here that is different between mlock and rdma pinning? Not that I know. > Asked another way, why should effort be spent on a > notifier scheme, and rather not on fixing any memory > accounting problems and unifying how pin pages are > accounted for that get pinned via mlock() or rdma > drivers? There are efforts underway to account for and limit mlocked pages as described above. Page pinning the way it is done by Infiniband through increasing the page refcount is treated by the VM as a temporary condition not as a permanent pin. The VM will continually try to reclaim these pages thinking that the temporary usage of the page must cease soon. This is why the use of large amounts of pinned pages can lead to livelock situations. If we want to have pinning behavior then we could mark pinned pages specially so that the VM will not continually try to evict these pages. We could manage them similar to mlocked pages but just not allow page migration, memory unplug and defrag to occur on pinned memory. All of theses would have to fail. With the notifier scheme the device driver could be told to get rid of the pinned memory. This would make these 3 techniques work despite having an RDMA memory section. > Startup benefits are well understood with the notifier > scheme (ie, not all pages need to be faulted in at > memory region creation time), specially when most of > the memory region is not accessed at all. I would > imagine most of HPC does not work this way though. No for optimal performance you would want to prefault all pages like it is now. The notifier scheme would only become relevant in memory shortage situations. > Then again, as rdma hardware is applied (increasingly?) towards apps > with short lived connections, the notifier scheme will help with startup > times. The main use of the notifier scheme is for stability and reliability. The "pinned" pages become unpinnable on request by the VM. So the VM can work itself out of memory shortage situations in cooperation with the RDMA logic instead of simply failing. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
--- Christoph Lameter <[EMAIL PROTECTED]> wrote: > On Wed, 13 Feb 2008, Christian Bell wrote: > > > not always be in the thousands but you're still > claiming scalability > > for a mechanism that essentially logs who accesses > the regions. Then > > there's the fact that reclaim becomes a collective > communication > > operation over all region accessors. Makes me > nervous. > > Well reclaim is not a very fast process (and we > usually try to avoid it > as much as possible for our HPC). Essentially its > only there to allow > shifts of processing loads and to allow efficient > caching of application > data. > > > However, short of providing user-level > notifications for pinned pages > > that are inadvertently released to the O/S, I > don't believe that the > > patchset provides any significant added value for > the HPC community > > that can't optimistically do RDMA demand paging. > > We currently also run XPmem with pinning. Its great > as long as you just > run one load on the system. No reclaim ever iccurs. > > However, if you do things that require lots of > allocations etc etc then > the page pinning can easily lead to livelock if > reclaim is finally > triggerd and also strange OOM situations since the > VM cannot free any > pages. So the main issue that is addressed here is > reliability of pinned > page operations. Better VM integration avoids these > issues because we can > unpin on request to deal with memory shortages. > > I have a question on the basic need for the mmu notifier stuff wrt rdma hardware and pinning memory. It seems that the need is to solve potential memory shortage and overcommit issues by being able to reclaim pages pinned by rdma driver/hardware. Is my understanding correct? If I do understand correctly, then why is rdma page pinning any different than eg mlock pinning? I imagine Oracle pins lots of memory (using mlock), how come they do not run into vm overcommit issues? Are we up against some kind of breaking c-o-w issue here that is different between mlock and rdma pinning? Asked another way, why should effort be spent on a notifier scheme, and rather not on fixing any memory accounting problems and unifying how pin pages are accounted for that get pinned via mlock() or rdma drivers? Startup benefits are well understood with the notifier scheme (ie, not all pages need to be faulted in at memory region creation time), specially when most of the memory region is not accessed at all. I would imagine most of HPC does not work this way though. Then again, as rdma hardware is applied (increasingly?) towards apps with short lived connections, the notifier scheme will help with startup times. Kanoj Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Jason Gunthorpe wrote: > Unfortunately it really has little to do with the drivers - changes, > for instance, need to be made to support this in the user space MPI > libraries. The RDMA ops do not pass through the kernel, userspace > talks directly to the hardware which complicates building any sort of > abstraction. Ok so the notifiers have to be handed over to the user space library that has the function of the device driver here... > That is where I think you run into trouble, if you ask the MPI people > to add code to their critical path to support swapping they probably > will not be too interested. At a minimum to support your idea you need > to check on every RDMA if the remote page is mapped... Plus the > overheads Christian was talking about in the OOB channel(s). You only need to check if a handle has been receiving invalidates. If not then you can just go ahead as now. You can use the notifier to take down the whole region if any reclaim occur against it (probably best and simples to implement approach). Then you mark the handle so that the mapping is reestablished before the next operation. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Christian Bell wrote: > not always be in the thousands but you're still claiming scalability > for a mechanism that essentially logs who accesses the regions. Then > there's the fact that reclaim becomes a collective communication > operation over all region accessors. Makes me nervous. Well reclaim is not a very fast process (and we usually try to avoid it as much as possible for our HPC). Essentially its only there to allow shifts of processing loads and to allow efficient caching of application data. > However, short of providing user-level notifications for pinned pages > that are inadvertently released to the O/S, I don't believe that the > patchset provides any significant added value for the HPC community > that can't optimistically do RDMA demand paging. We currently also run XPmem with pinning. Its great as long as you just run one load on the system. No reclaim ever iccurs. However, if you do things that require lots of allocations etc etc then the page pinning can easily lead to livelock if reclaim is finally triggerd and also strange OOM situations since the VM cannot free any pages. So the main issue that is addressed here is reliability of pinned page operations. Better VM integration avoids these issues because we can unpin on request to deal with memory shortages. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, Feb 13, 2008 at 10:51:58AM -0800, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > But this isn't how IB or iwarp work at all. What you describe is a > > significant change to the general RDMA operation and requires changes to > > both sides of the connection and the wire protocol. > > Yes it may require a separate connection between both sides where a > kind of VM notification protocol is established to tear these things down and > set them up again. That is if there is nothing in the RDMA protocol that > allows a notification to the other side that the mapping is being down > down. Well, yes, you could build this thing you are describing on top of the RDMA protocol and get some support from some of the hardware - but it is a new set of protocols and they would need to be implemented in several places. It is not transparent to userspace and it is not compatible with existing implementations. Unfortunately it really has little to do with the drivers - changes, for instance, need to be made to support this in the user space MPI libraries. The RDMA ops do not pass through the kernel, userspace talks directly to the hardware which complicates building any sort of abstraction. That is where I think you run into trouble, if you ask the MPI people to add code to their critical path to support swapping they probably will not be too interested. At a minimum to support your idea you need to check on every RDMA if the remote page is mapped... Plus the overheads Christian was talking about in the OOB channel(s). Jason ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Christoph Lameter wrote: > Right. We (SGI) have done something like this for a long time with XPmem > and it scales ok. I'd dispute this based on experience developing PGAS language support on the Altix but more importantly (and less subjectively), I think that "scales ok" refers to a very specific case. Sure, pages (and/or regions) can be large on some systems and the number of systems may not always be in the thousands but you're still claiming scalability for a mechanism that essentially logs who accesses the regions. Then there's the fact that reclaim becomes a collective communication operation over all region accessors. Makes me nervous. > > When messages are sufficiently large, the control messaging necessary > > to setup/teardown the regions is relatively small. This is not > > always the case however -- in programming models that employ smaller > > messages, the one-sided nature of RDMA is the most attractive part of > > it. > > The messaging would only be needed if a process comes under memory > pressure. As long as there is enough memory nothing like this will occur. > > > Nothing any communication/runtime system can't already do today. The > > point of RDMA demand paging is enabling the possibility of using RDMA > > without the implied synchronization -- the optimistic part. Using > > the notifiers to duplicate existing memory region handling for RDMA > > hardware that doesn't have HW page tables is possible but undermines > > the more important consumer of your patches in my opinion. > > The notifier schemet should integrate into existing memory region > handling and not cause a duplication. If you already have library layers > that do this then it should be possible to integrate it. I appreciate that you're trying to make a general case for the applicability of notifiers on all types of existing RDMA hardware and wire protocols. Also, I'm not disagreeing whether a HW page table is required or not: clearly it's not required to make *some* use of the notifier scheme. However, short of providing user-level notifications for pinned pages that are inadvertently released to the O/S, I don't believe that the patchset provides any significant added value for the HPC community that can't optimistically do RDMA demand paging. . . christian ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Wed, 13 Feb 2008, Christoph Raisch wrote: > For ehca we currently can't modify a large MR when it has been allocated. > EHCA Hardware expects the pages to be there (MRs must not have "holes"). > This is also true for the global MR covering all kernel space. > Therefore we still need the memory to be "pinned" if ib_umem_get() is > called. It cannot be freed and then reallocated? What happens when a process exists? ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Christian Bell wrote: > You're arguing that a HW page table is not needed by describing a use > case that is essentially what all RDMA solutions already do above the > wire protocols (all solutions except Quadrics, of course). The HW page table is not essential to the notification scheme. That the RDMA uses the page table for linearization is another issue. A chip could just have a TLB cache and lookup the entries using the OS page table f.e. > > Lets say you have a two systems A and B. Each has their memory region MemA > > and MemB. Each side also has page tables for this region PtA and PtB. > > If either side then accesses the page again then the reverse process > > happens. If B accesses the page then it wil first of all incur a page > > fault because the entry in PtB is missing. The fault will then cause a > > message to be send to A to establish the page again. A will create an > > entry in PtA and will then confirm to B that the page was established. At > > that point RDMA operations can occur again. > > The notifier-reclaim cycle you describe is akin to the out-of-band > pin-unpin control messages used by existing communication libraries. > Also, I think what you are proposing can have problems at scale -- A > must keep track of all of the (potentially many systems) of memA and > cooperatively get an agreement from all these systems before reclaiming > the page. Right. We (SGI) have done something like this for a long time with XPmem and it scales ok. > When messages are sufficiently large, the control messaging necessary > to setup/teardown the regions is relatively small. This is not > always the case however -- in programming models that employ smaller > messages, the one-sided nature of RDMA is the most attractive part of > it. The messaging would only be needed if a process comes under memory pressure. As long as there is enough memory nothing like this will occur. > Nothing any communication/runtime system can't already do today. The > point of RDMA demand paging is enabling the possibility of using RDMA > without the implied synchronization -- the optimistic part. Using > the notifiers to duplicate existing memory region handling for RDMA > hardware that doesn't have HW page tables is possible but undermines > the more important consumer of your patches in my opinion. The notifier schemet should integrate into existing memory region handling and not cause a duplication. If you already have library layers that do this then it should be possible to integrate it. > One other area that has not been brought up yet (I think) is the > applicability of notifiers in letting users know when pinned memory > is reclaimed by the kernel. This is useful when a lower-level > library employs lazy deregistration strategies on memory regions that > are subsequently released to the kernel via the application's use of > munmap or sbrk. Ohio Supercomputing Center has work in this area but > a generalized approach in the kernel would certainly be welcome. The driver gets the notifications about memory being reclaimed. The driver could then notify user code about the release as well. Pinned memory current *cannot* be reclaimed by the kernel. The refcount is elevated. This means that the VM tries to remove the mappings and then sees that it was not able to remove all references. Then it gives up and tries again and again and again Thus the potential for livelock. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > But this isn't how IB or iwarp work at all. What you describe is a > significant change to the general RDMA operation and requires changes to > both sides of the connection and the wire protocol. Yes it may require a separate connection between both sides where a kind of VM notification protocol is established to tear these things down and set them up again. That is if there is nothing in the RDMA protocol that allows a notification to the other side that the mapping is being down down. > - In RDMA (iwarp and IB versions) the hardware page tables exist to >linearize the local memory so the remote does not need to be aware >of non-linearities in the physical address space. The main >motivation for this is kernel bypass where the user space app wants >to instruct the remote side to DMA into memory using user space >addresses. Hardware provides the page tables to switch from >incoming user space virtual addresses to physical addresess. s/switch/translate I guess. That is good and those page tables could be used for the notification scheme to enable reclaim. But they are optional and are maintaining the driver state. The linearization could be reconstructed from the kernel page tables on demand. >Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables >for access control and enforcing the liftime of the mapping. Well the mapping would have to be on demand to avoid the issues that we currently have with pinning. The user API could stay the same. If the driver tracks the mappings using the notifier then the VM can make sure that the right things happen on exit etc etc. >The page tables in the RDMA hardware exist primarily to support >this, and not for other reasons. The pinning of pages is one part >to support the HW page tables and one part to support the RDMA >lifetime rules, the liftime rules are what cause problems for >the VM. So the driver software can tear down and establish page tables entries at will? I do not see the problem. The RDMA hardware is one thing, the way things are visible to the user another. If the driver can establish and remove mappings as needed via RDMA then the user can have the illusion of persistent RDMA memory. This is the same as virtual memory providing the illusion of a process having lots of memory all for itself. > - The wire protocol consists of packets that say 'Write XXX bytes to >offset YY in Region RRR'. Creating a region produces the RRR label >and currently pins the pages. So long as the RRR label is valid the >remote side can issue write packets at any time without any >further synchronization. There is no wire level events associated >with creating RRR. You can pass RRR to the other machine in any >fashion, even using carrier pigeons :) > - The RDMA layer is very general (ala TCP), useful protocols (like SCSI) >are built on top of it and they specify the lifetime rules and >protocol for exchanging RRR. Well yes of course. What is proposed here is an additional notification mechanism (could even be via tcp/udp to simplify things) that would manage the mappings at a higher level. The writes would not occur if the mapping has not been established. >This is your step 'A will then send a message to B notifying..'. >It simply does not exist in the protocol specifications Of course. You need to create an additional communication layer to get that. > What it boils down to is that to implement true removal of pages in a > general way the kernel and HCA must either drop packets or stall > incoming packets, both are big performance problems - and I can't see > many users wanting this. Enterprise style people using SCSI, NFS, etc > already have short pin periods and HPC MPI users probably won't care > about the VM issues enough to warrent the performance overhead. True maybe you cannot do this by simply staying within the protocol bounds of RDMA that is based on page pinning if the RDMA protocol does not support a notification to the other side that the mapping is going away. If RDMA cannot do this then you would need additional ways of notifying the remote side that pages/mappings are invalidated. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
> > > Chelsio's T3 HW doesn't support this. For ehca we currently can't modify a large MR when it has been allocated. EHCA Hardware expects the pages to be there (MRs must not have "holes"). This is also true for the global MR covering all kernel space. Therefore we still need the memory to be "pinned" if ib_umem_get() is called. So with the current implementation we don't have much use for a notifier. "It is difficult to make predictions, especially about the future" Gruss / Regards Christoph Raisch + Hoang-Nam Nguyen ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
Jason Gunthorpe wrote: [mangled CC list trimmed] Thanks, noticed that afterwards. This wasn't ment as a slight against Quadrics, only to point out that the specific wire protcols used by IB and iwarp are what cause this limitation, it would be easy to imagine that Quadrics has some additional twist that can make this easier.. The wire protocols are similar, nothing fancy. The specificity of Quadrics (and many others) is that they can change the behavior of the NIC in firmware, so they adapt to what the OS offers. They had the VM notifier support in Tru64 back in the days, they just ported the functionality to Linux. I ment that HPC users are unlikely to want to swap active RDMA pages if this causes a performance cost on normal operations. None of my Swapping to disk is not a normal operations in HPC, it's going to be slow anyway. The main problem for HPC users is not swapping, it's that they do not know when a registered page is released to the OS through free(), sbrk() or munmap(). Like swapping, they don't expect that it will happen often, but they have to handle it gracefully. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
[mangled CC list trimmed] On Tue, Feb 12, 2008 at 10:56:26PM -0500, Patrick Geoffray wrote: > Jason Gunthorpe wrote: >> I don't know much about Quadrics, but I would be hesitant to lump it >> in too much with these RDMA semantics. Christian's comments sound like >> they operate closer to what you described and that is why the have an >> existing patch set. I don't know :) > > The Quadrics folks have been doing RDMA for 10 years, there is a reason why > they maintained a patch. This wasn't ment as a slight against Quadrics, only to point out that the specific wire protcols used by IB and iwarp are what cause this limitation, it would be easy to imagine that Quadrics has some additional twist that can make this easier.. >> What it boils down to is that to implement true removal of pages in a >> general way the kernel and HCA must either drop packets or stall >> incoming packets, both are big performance problems - and I can't see >> many users wanting this. Enterprise style people using SCSI, NFS, etc >> already have short pin periods and HPC MPI users probably won't care >> about the VM issues enough to warrent the performance overhead. > > This is not true, HPC people do care about the VM issues a lot. Memory > registration (pinning and translating) is usually too expensive to I ment that HPC users are unlikely to want to swap active RDMA pages if this causes a performance cost on normal operations. None of my comments are ment to imply that lazy de-registration or page migration are not good things. Regards, Jason ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > The problem is that the existing wire protocols do not have a > > provision for doing an 'are you ready' or 'I am not ready' exchange > > and they are not designed to store page tables on both sides as you > > propose. The remote side can send RDMA WRITE traffic at any time after > > the RDMA region is established. The local side must be able to handle > > it. There is no way to signal that a page is not ready and the remote > > should not send. > > > > This means the only possible implementation is to stall/discard at the > > local adaptor when a RDMA WRITE is recieved for a page that has been > > reclaimed. This is what leads to deadlock/poor performance.. You're arguing that a HW page table is not needed by describing a use case that is essentially what all RDMA solutions already do above the wire protocols (all solutions except Quadrics, of course). > You would only use the wire protocols *after* having established the RDMA > region. The notifier chains allows a RDMA region (or parts thereof) to be > down on demand by the VM. The region can be reestablished if one of > the side accesses it. I hope I got that right. Not much exposure to > Infiniband so far. RDMA is already always used *after* memory regions are set up -- they are set up out-of-band w.r.t RDMA but essentially this is the "before" part. > Lets say you have a two systems A and B. Each has their memory region MemA > and MemB. Each side also has page tables for this region PtA and PtB. > > Now you establish a RDMA connection between both side. The pages in both > MemB and MemA are present and so are entries in PtA and PtB. RDMA > traffic can proceed. > > The VM on system A now gets into a situation in which memory becomes > heavily used by another (maybe non RDMA process) and after checking that > there was no recent reference to MemA and MemB (via a notifier aging > callback) decides to reclaim the memory from MemA. > > In that case it will notify the RDMA subsystem on A that it is trying to > reclaim a certain page. > > The RDMA subsystem on A will then send a message to B notifying it that > the memory will be going away. B now has to remove its corresponding page > from memory (and drop the entry in PtB) and confirm to A that this has > happened. RDMA traffic is then stopped for this page. Then A can also > remove its page, the corresponding entry in PtA and the page is reclaimed > or pushed out to swap completing the page reclaim. > > If either side then accesses the page again then the reverse process > happens. If B accesses the page then it wil first of all incur a page > fault because the entry in PtB is missing. The fault will then cause a > message to be send to A to establish the page again. A will create an > entry in PtA and will then confirm to B that the page was established. At > that point RDMA operations can occur again. The notifier-reclaim cycle you describe is akin to the out-of-band pin-unpin control messages used by existing communication libraries. Also, I think what you are proposing can have problems at scale -- A must keep track of all of the (potentially many systems) of memA and cooperatively get an agreement from all these systems before reclaiming the page. When messages are sufficiently large, the control messaging necessary to setup/teardown the regions is relatively small. This is not always the case however -- in programming models that employ smaller messages, the one-sided nature of RDMA is the most attractive part of it. > So the whole scheme does not really need a hardware page table in the RDMA > hardware. The page tables of the two systems A and B are sufficient. > > The scheme can also be applied to a larger range than only a single page. > The RDMA subsystem could tear down a large section when reclaim is > pushing on it and then reestablish it as needed. Nothing any communication/runtime system can't already do today. The point of RDMA demand paging is enabling the possibility of using RDMA without the implied synchronization -- the optimistic part. Using the notifiers to duplicate existing memory region handling for RDMA hardware that doesn't have HW page tables is possible but undermines the more important consumer of your patches in my opinion. One other area that has not been brought up yet (I think) is the applicability of notifiers in letting users know when pinned memory is reclaimed by the kernel. This is useful when a lower-level library employs lazy deregistration strategies on memory regions that are subsequently released to the kernel via the application's use of munmap or sbrk. Ohio Supercomputing Center has work in this area but a generalized approach in the kernel would certainly be welcome. . . christian -- [EMAIL PROTECTED] (QLogic Host Solutions Group, formerly Pathscale) ___ general ma
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, Feb 12, 2008 at 06:35:09PM -0800, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > The problem is that the existing wire protocols do not have a > > provision for doing an 'are you ready' or 'I am not ready' exchange > > and they are not designed to store page tables on both sides as you > > propose. The remote side can send RDMA WRITE traffic at any time after > > the RDMA region is established. The local side must be able to handle > > it. There is no way to signal that a page is not ready and the remote > > should not send. > > > > This means the only possible implementation is to stall/discard at the > > local adaptor when a RDMA WRITE is recieved for a page that has been > > reclaimed. This is what leads to deadlock/poor performance.. > > You would only use the wire protocols *after* having established the RDMA > region. The notifier chains allows a RDMA region (or parts thereof) to be > down on demand by the VM. The region can be reestablished if one of > the side accesses it. I hope I got that right. Not much exposure to > Infiniband so far. [clip explaination] But this isn't how IB or iwarp work at all. What you describe is a significant change to the general RDMA operation and requires changes to both sides of the connection and the wire protocol. A few comments on RDMA operation that might clarify things a little bit more: - In RDMA (iwarp and IB versions) the hardware page tables exist to linearize the local memory so the remote does not need to be aware of non-linearities in the physical address space. The main motivation for this is kernel bypass where the user space app wants to instruct the remote side to DMA into memory using user space addresses. Hardware provides the page tables to switch from incoming user space virtual addresses to physical addresess. This greatly simplifies the user space programming model since you don't need to pass around or create s/g lists for memory that is already virtually continuous. Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables for access control and enforcing the liftime of the mapping. The page tables in the RDMA hardware exist primarily to support this, and not for other reasons. The pinning of pages is one part to support the HW page tables and one part to support the RDMA lifetime rules, the liftime rules are what cause problems for the VM. - The wire protocol consists of packets that say 'Write XXX bytes to offset YY in Region RRR'. Creating a region produces the RRR label and currently pins the pages. So long as the RRR label is valid the remote side can issue write packets at any time without any further synchronization. There is no wire level events associated with creating RRR. You can pass RRR to the other machine in any fashion, even using carrier pigeons :) - The RDMA layer is very general (ala TCP), useful protocols (like SCSI) are built on top of it and they specify the lifetime rules and protocol for exchanging RRR. Every protocol is different. In kernel protocols like SRP and NFS RDMA seem to have very short lifetimes for RRR and work more like pci_map_* in real SCSI hardware. - HPC userspace apps, like MPI apps, have different lifetime rules and tend to be really long lived. These people will not want anything that makes their OPs more expensive and also probably don't care too much about the VM problems you are looking at (?) - There is no protocol support to exchange RRR. This is all done by upper level protocols (ala HTTP vs TCP). You cannot assert and revoke RRR in a general way. Every protocol is different and optimized. This is your step 'A will then send a message to B notifying..'. It simply does not exist in the protocol specifications I don't know much about Quadrics, but I would be hesitant to lump it in too much with these RDMA semantics. Christian's comments sound like they operate closer to what you described and that is why the have an existing patch set. I don't know :) What it boils down to is that to implement true removal of pages in a general way the kernel and HCA must either drop packets or stall incoming packets, both are big performance problems - and I can't see many users wanting this. Enterprise style people using SCSI, NFS, etc already have short pin periods and HPC MPI users probably won't care about the VM issues enough to warrent the performance overhead. Regards, Jason ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > The problem is that the existing wire protocols do not have a > provision for doing an 'are you ready' or 'I am not ready' exchange > and they are not designed to store page tables on both sides as you > propose. The remote side can send RDMA WRITE traffic at any time after > the RDMA region is established. The local side must be able to handle > it. There is no way to signal that a page is not ready and the remote > should not send. > > This means the only possible implementation is to stall/discard at the > local adaptor when a RDMA WRITE is recieved for a page that has been > reclaimed. This is what leads to deadlock/poor performance.. You would only use the wire protocols *after* having established the RDMA region. The notifier chains allows a RDMA region (or parts thereof) to be down on demand by the VM. The region can be reestablished if one of the side accesses it. I hope I got that right. Not much exposure to Infiniband so far. Lets say you have a two systems A and B. Each has their memory region MemA and MemB. Each side also has page tables for this region PtA and PtB. Now you establish a RDMA connection between both side. The pages in both MemB and MemA are present and so are entries in PtA and PtB. RDMA traffic can proceed. The VM on system A now gets into a situation in which memory becomes heavily used by another (maybe non RDMA process) and after checking that there was no recent reference to MemA and MemB (via a notifier aging callback) decides to reclaim the memory from MemA. In that case it will notify the RDMA subsystem on A that it is trying to reclaim a certain page. The RDMA subsystem on A will then send a message to B notifying it that the memory will be going away. B now has to remove its corresponding page from memory (and drop the entry in PtB) and confirm to A that this has happened. RDMA traffic is then stopped for this page. Then A can also remove its page, the corresponding entry in PtA and the page is reclaimed or pushed out to swap completing the page reclaim. If either side then accesses the page again then the reverse process happens. If B accesses the page then it wil first of all incur a page fault because the entry in PtB is missing. The fault will then cause a message to be send to A to establish the page again. A will create an entry in PtA and will then confirm to B that the page was established. At that point RDMA operations can occur again. So the whole scheme does not really need a hardware page table in the RDMA hardware. The page tables of the two systems A and B are sufficient. The scheme can also be applied to a larger range than only a single page. The RDMA subsystem could tear down a large section when reclaim is pushing on it and then reestablish it as needed. Swapping and page reclaim is certainly not something that improves the speed of the application affected by swapping and page reclaim but it allows the VM to manage memory effectively if multiple loads are runing on a system. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Christian Bell wrote: > I think there are very potential clients of the interface when an > optimistic approach is used. Part of the trick, however, has to do > with being able to re-start transfers instead of buffering the data > or making guarantees about delivery that could cause deadlock (as was > alluded to earlier in this thread). InfiniBand is constrained in > this regard since it requires message-ordering between endpoints (or > queue pairs). One could argue that this is still possible with IB, > at the cost of throwing more packets away when a referenced page is > not in memory. With this approach, the worse case demand paging > scenario is met when the active working set of referenced pages is > larger than the amount physical memory -- but HPC applications are > already bound by this anyway. > > You'll find that Quadrics has the most experience in this area and > that their entire architecture is adapted to being optimistic about > demand paging in RDMA transfers -- they've been maintaining a patchset > to do this for years. The notifier patchset that we are discussing here was mostly inspired by their work. There is no need to restart transfers that you have never started in the first place. The remote side would never start a transfer if the page reference has been torn down. In order to start the transfer a fault handler on the remote side would have to setup the association between the memory on both ends again. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > Well, certainly today the memfree IB devices store the page tables in > > host memory so they are already designed to hang onto packets during > > the page lookup over PCIE, adding in faulting makes this time > > larger. > > You really do not need a page table to use it. What needs to be maintained > is knowledge on both side about what pages are currently shared across > RDMA. If the VM decides to reclaim a page then the notification is used to > remove the remote entry. If the remote side then tries to access the page > again then the page fault on the remote side will stall until the local > page has been brought back. RDMA can proceed after both sides again agree > on that page now being sharable. HPC environments won't be amenable to a pessimistic approach of synchronizing before every data transfer. RDMA is assumed to be a low-level data movement mechanism that has no implied synchronization. In some parallel programming models, it's not uncommon to use RDMA to send 8-byte messages. It can be difficult to make and hold guarantees about in-memory pages when many concurrent RDMA operations are in flight (not uncommon in reasonably large machines). Some of the in-memory page information could be shared with some form of remote caching strategy but then it's a different problem with its own scalability challenges. I think there are very potential clients of the interface when an optimistic approach is used. Part of the trick, however, has to do with being able to re-start transfers instead of buffering the data or making guarantees about delivery that could cause deadlock (as was alluded to earlier in this thread). InfiniBand is constrained in this regard since it requires message-ordering between endpoints (or queue pairs). One could argue that this is still possible with IB, at the cost of throwing more packets away when a referenced page is not in memory. With this approach, the worse case demand paging scenario is met when the active working set of referenced pages is larger than the amount physical memory -- but HPC applications are already bound by this anyway. You'll find that Quadrics has the most experience in this area and that their entire architecture is adapted to being optimistic about demand paging in RDMA transfers -- they've been maintaining a patchset to do this for years. . . christian ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
Jason Gunthorpe wrote: On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote: On Tue, 12 Feb 2008, Jason Gunthorpe wrote: Well, certainly today the memfree IB devices store the page tables in host memory so they are already designed to hang onto packets during the page lookup over PCIE, adding in faulting makes this time larger. You really do not need a page table to use it. What needs to be maintained is knowledge on both side about what pages are currently shared across RDMA. If the VM decides to reclaim a page then the notification is used to remove the remote entry. If the remote side then tries to access the page again then the page fault on the remote side will stall until the local page has been brought back. RDMA can proceed after both sides again agree on that page now being sharable. The problem is that the existing wire protocols do not have a provision for doing an 'are you ready' or 'I am not ready' exchange and they are not designed to store page tables on both sides as you propose. The remote side can send RDMA WRITE traffic at any time after the RDMA region is established. The local side must be able to handle it. There is no way to signal that a page is not ready and the remote should not send. This means the only possible implementation is to stall/discard at the local adaptor when a RDMA WRITE is recieved for a page that has been reclaimed. This is what leads to deadlock/poor performance.. If the events are few and far between then this model is probably ok. For iWARP, it means TCP retransmit and slow start and all that, but if its an infrequent event, then its ok if it helps the host better manage memory. Maybe... ;-) Steve. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > Well, certainly today the memfree IB devices store the page tables in > > host memory so they are already designed to hang onto packets during > > the page lookup over PCIE, adding in faulting makes this time > > larger. > > You really do not need a page table to use it. What needs to be maintained > is knowledge on both side about what pages are currently shared across > RDMA. If the VM decides to reclaim a page then the notification is used to > remove the remote entry. If the remote side then tries to access the page > again then the page fault on the remote side will stall until the local > page has been brought back. RDMA can proceed after both sides again agree > on that page now being sharable. The problem is that the existing wire protocols do not have a provision for doing an 'are you ready' or 'I am not ready' exchange and they are not designed to store page tables on both sides as you propose. The remote side can send RDMA WRITE traffic at any time after the RDMA region is established. The local side must be able to handle it. There is no way to signal that a page is not ready and the remote should not send. This means the only possible implementation is to stall/discard at the local adaptor when a RDMA WRITE is recieved for a page that has been reclaimed. This is what leads to deadlock/poor performance.. Jason ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > Well, certainly today the memfree IB devices store the page tables in > host memory so they are already designed to hang onto packets during > the page lookup over PCIE, adding in faulting makes this time > larger. You really do not need a page table to use it. What needs to be maintained is knowledge on both side about what pages are currently shared across RDMA. If the VM decides to reclaim a page then the notification is used to remove the remote entry. If the remote side then tries to access the page again then the page fault on the remote side will stall until the local page has been brought back. RDMA can proceed after both sides again agree on that page now being sharable. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Felix Marti wrote: > > I don't know anything about the T3 internals, but it's not clear that > > you could do this without a new chip design in general. Lot's of RDMA > > devices were designed expecting that when a packet arrives, the HW can > > look up the bus address for a given memory region/offset and place the > > packet immediately. It seems like a major change to be able to > > generate a "page fault" interrupt when a page isn't present, or even > > just wait to scatter some data until the host finishes updating page > > tables when the HW needs the translation. > > That is correct, not a change we can make for T3. We could, in theory, > deal with changing mappings though. The change would need to be > synchronized though: the VM would need to tell us which mapping were > about to change and the driver would then need to disable DMA to/from > it, do the change and resume DMA. Right. That is the intend of the patchset. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, 12 Feb 2008, Roland Dreier wrote: > I don't know anything about the T3 internals, but it's not clear that > you could do this without a new chip design in general. Lot's of RDMA > devices were designed expecting that when a packet arrives, the HW can > look up the bus address for a given memory region/offset and place the > packet immediately. It seems like a major change to be able to > generate a "page fault" interrupt when a page isn't present, or even > just wait to scatter some data until the host finishes updating page > tables when the HW needs the translation. Well if the VM wants to invalidate a page then the remote end first has to remove its mapping. If a page has been removed then the remote end would encounter a fault and then would have to wait for the local end to reestablish its mapping before proceeding. So the packet would only be generated when both ends are in sync. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
On Tue, Feb 12, 2008 at 02:41:48PM -0800, Roland Dreier wrote: > > > Chelsio's T3 HW doesn't support this. > > > Not so far I guess but it could be equipped with these features right? > > I don't know anything about the T3 internals, but it's not clear that > you could do this without a new chip design in general. Lot's of RDMA > devices were designed expecting that when a packet arrives, the HW can > look up the bus address for a given memory region/offset and place > the Well, certainly today the memfree IB devices store the page tables in host memory so they are already designed to hang onto packets during the page lookup over PCIE, adding in faulting makes this time larger. But this is not a good thing at all, IB's congestion model is based on the notion that end ports can always accept packets without making input contigent on output. If you take a software interrupt to fill in the page pointer then you could potentially deadlock on the fabric. For example using this mechanism to allow swap-in of RDMA target pages and then putting the storage over IB would be deadlock prone. Even without deadlock slowing down the input path will cause network congestion and poor performance for other nodes. It is not a desirable thing to do.. I expect that iwarp running over flow controlled ethernet has similar kinds of problems for similar reasons.. In general the best I think you can hope for with RDMA hardware is page migration using some atomic operations with the adaptor and a cpu page copy with retry sort of scheme - but is pure page migration interesting at all? Jason ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Demand paging for memory regions
> -Original Message- > From: [EMAIL PROTECTED] [mailto:general- > [EMAIL PROTECTED] On Behalf Of Roland Dreier > Sent: Tuesday, February 12, 2008 2:42 PM > To: Christoph Lameter > Cc: Rik van Riel; [EMAIL PROTECTED]; Andrea Arcangeli; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux- > [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; Robin Holt; [email protected]; > Andrew Morton; [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: Demand paging for memory regions > > > > Chelsio's T3 HW doesn't support this. > > > Not so far I guess but it could be equipped with these features > right? > > I don't know anything about the T3 internals, but it's not clear that > you could do this without a new chip design in general. Lot's of RDMA > devices were designed expecting that when a packet arrives, the HW can > look up the bus address for a given memory region/offset and place the > packet immediately. It seems like a major change to be able to > generate a "page fault" interrupt when a page isn't present, or even > just wait to scatter some data until the host finishes updating page > tables when the HW needs the translation. That is correct, not a change we can make for T3. We could, in theory, deal with changing mappings though. The change would need to be synchronized though: the VM would need to tell us which mapping were about to change and the driver would then need to disable DMA to/from it, do the change and resume DMA. > > - R. > > ___ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: Demand paging for memory regions
> > Chelsio's T3 HW doesn't support this. > Not so far I guess but it could be equipped with these features right? I don't know anything about the T3 internals, but it's not clear that you could do this without a new chip design in general. Lot's of RDMA devices were designed expecting that when a packet arrives, the HW can look up the bus address for a given memory region/offset and place the packet immediately. It seems like a major change to be able to generate a "page fault" interrupt when a page isn't present, or even just wait to scatter some data until the host finishes updating page tables when the HW needs the translation. - R. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
