On Sun, Aug 18, 2024 at 2:45 PM Barry Song <21cn...@gmail.com> wrote: > > On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cn...@gmail.com> wrote: > > > > On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.s...@gmail.com> wrote: > > > > > > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cn...@gmail.com> wrote: > > > > > > > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.s...@gmail.com> > > > > wrote: > > > > > > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cn...@gmail.com> wrote: > > > > > > > > > > > > From: Barry Song <v-songbao...@oppo.com> > > > > > > > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. This > > > > > > kind of > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless. If we > > > > > > attempt and still fail to allocate memory for these users, we have > > > > > > two > > > > > > choices: > > > > > > > > > > > > 1. We could busy-loop and hope that some other direct > > > > > > reclamation or > > > > > > kswapd rescues the current process. However, this is unreliable > > > > > > and could ultimately lead to hard or soft lockups, > > > > > > > > > > That can occur even if we set both __GFP_NOFAIL and > > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related > > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed > > > > > design of __GFP_NOFAIL itself. > > > > > > > > the point of GFP_NOFAIL is that it won't fail and its user won't check > > > > the return value. without direct_reclamation, it is sometimes > > > > impossible. > > > > but with direct reclamation, users constantly wait and finally they can > > > > > > So, what exactly is the difference between 'constantly waiting' and > > > 'busy looping'? Could you please clarify? Also, why can't we > > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set? > > > > I list two options in changelog > > 1: busy loop 2. bug_on. I am actually fine with either one. either one is > > better than the existing code. but returning null in the current code > > is definitely wrong. > > > > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation > > legal. so it is a bit suspicious going in the wrong direction. > > > > busy-loop is that you are not reclaiming memory you are not sleeping. > > cpu is constantly working and busy, so it might result in a lockup, either > > soft lockup or hard lockup.
Thanks for the clarification. That can be avoided by a simple cond_resched() if the hard lockup or softlockup is the main issue ;) > > > > with direct_reclamation, wait is the case you can sleep. it is not holding > > cpu, not a busy loop. in rare case, users might end in endless wait, > > but it matches the doc of __GFP_NOFAIL, never return till memory > > is gotten (the current code is implemented in this way unless users > > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.) > > > > and the essential difference between "w/ and w/o direct_reclaim": with > direct reclaim, the user is actively reclaiming memory to rescue itself > by all kinds of possible ways(compact, oom, reclamation), while without > direct reclamation, it can do nothing and just loop (busy-loop). It can wake up kswapd, which can then reclaim memory. If kswapd can't keep up, the system is likely under heavy memory pressure. In such a case, it makes little difference whether __GFP_DIRECT_RECLAIM is set or not. For reference, see the old issue: https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485e...@gmx.com/. I believe the core issue persists, and the design of __GFP_NOFAIL exacerbates it. By the way, I believe we could trigger an asynchronous OOM kill in the case without direct reclaim to avoid busy looping. > > > note, long-term we won't expose __GFP_NOFAIL any more. we > > will only expose GFP_NOFAIL which enforces Blockable. I am > > quite busy on other issues, so this won't happen in a short time. > > > > > > > > > get memory. if you read the doc of __GFP_NOFAIL you will find it. > > > > it is absolutely clearly documented. > > > > > > > > > > > > > > > which might not > > > > > > be well supported by some architectures. > > > > > > > > > > > > 2. We could use BUG_ON to trigger a reliable system crash, > > > > > > avoiding > > > > > > exposing NULL dereference. > > > > > > > > > > > > Neither option is ideal, but both are improvements over the > > > > > > existing code. > > > > > > This patch selects the second option because, with the introduction > > > > > > of > > > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation > > > > > > for > > > > > > nofail users(which is in my plan), non-blockable nofail allocations > > > > > > will > > > > > > no longer be possible. > > > > > > > > > > > > Signed-off-by: Barry Song <v-songbao...@oppo.com> > > > > > > Cc: Michal Hocko <mho...@suse.com> > > > > > > Cc: Uladzislau Rezki (Sony) <ure...@gmail.com> > > > > > > Cc: Christoph Hellwig <h...@infradead.org> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoa...@oracle.com> > > > > > > Cc: Christoph Lameter <c...@linux.com> > > > > > > Cc: Pekka Enberg <penb...@kernel.org> > > > > > > Cc: David Rientjes <rient...@google.com> > > > > > > Cc: Joonsoo Kim <iamjoonsoo....@lge.com> > > > > > > Cc: Vlastimil Babka <vba...@suse.cz> > > > > > > Cc: Roman Gushchin <roman.gushc...@linux.dev> > > > > > > Cc: Hyeonggon Yoo <42.hye...@gmail.com> > > > > > > Cc: Linus Torvalds <torva...@linux-foundation.org> > > > > > > Cc: Kees Cook <k...@kernel.org> > > > > > > Cc: "Eugenio Pérez" <epere...@redhat.com> > > > > > > Cc: Hailong.Liu <hailong....@oppo.com> > > > > > > Cc: Jason Wang <jasow...@redhat.com> > > > > > > Cc: Maxime Coquelin <maxime.coque...@redhat.com> > > > > > > Cc: "Michael S. Tsirkin" <m...@redhat.com> > > > > > > Cc: Xuan Zhuo <xuanz...@linux.alibaba.com> > > > > > > --- > > > > > > mm/page_alloc.c | 10 +++++----- > > > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644 > > > > > > --- a/mm/page_alloc.c > > > > > > +++ b/mm/page_alloc.c > > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, > > > > > > unsigned int order, > > > > > > */ > > > > > > if (gfp_mask & __GFP_NOFAIL) { > > > > > > /* > > > > > > - * All existing users of the __GFP_NOFAIL are > > > > > > blockable, so warn > > > > > > - * of any new users that actually require GFP_NOWAIT > > > > > > + * All existing users of the __GFP_NOFAIL are > > > > > > blockable > > > > > > + * otherwise we introduce a busy loop with inside > > > > > > the page > > > > > > + * allocator from non-sleepable contexts > > > > > > */ > > > > > > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > > > > > > - goto fail; > > > > > > + BUG_ON(!can_direct_reclaim); > > > > > > > > > > I'm not in favor of using BUG_ON() here, as many call sites already > > > > > handle the return value of __GFP_NOFAIL. > > > > > > > > > > > > > it is not correct to handle the return value of __GFP_NOFAIL. > > > > if you check the ret, don't use __GFP_NOFAIL. > > > > > > If so, you have many code changes to make in the linux kernel ;) > > > > > > > Please list those code using __GFP_NOFAIL and check the result > > might fail, we should get them fixed. This is insane. NOFAIL means > > no fail. You can find some instances with grep commands, but there's no reliable way to capture them all with a single command. Here are a few examples: // drivers/infiniband/hw/cxgb4/mem.c skb = alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL); if (!skb) return -ENOMEM; // fs/xfs/libxfs/xfs_dir2.c args = kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL); if (!args) return -ENOMEM; -- Regards Yafang