Hi, Sorry for the formatting.
On Thu, 23 Sep 2021, 06:10 Stefano Stabellini, <sstabell...@kernel.org> wrote: > On Wed, 22 Sep 2021, Jan Beulich wrote: > > On 22.09.2021 01:38, Stefano Stabellini wrote: > > > On Mon, 20 Sep 2021, Ian Jackson wrote: > > >> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - > FAIL"): > > >>> As per > > >>> > > >>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info: > > >>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 > inactive_anon:15857 isolated_anon:0 > > >>> Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 > inactive_file:11182 isolated_file:0 > > >>> Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 > writeback:0 unstable:0 > > >>> Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 > slab_unreclaimable:30234 > > >>> Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 > pagetables:401 bounce:0 > > >>> Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 > free_cma:1650 > > >>> > > >>> the system doesn't look to really be out of memory; as per > > >>> > > >>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB > (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) > 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB > > >>> > > >>> there even look to be a number of higher order pages available > (albeit > > >>> without digging I can't tell what "(C)" means). Nevertheless order-4 > > >>> allocations aren't really nice. > > >> > > >> The host history suggests this may possibly be related to a qemu > update. > > >> > > >> > http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html > > > > Stefano - as per some of your investigation detailed further down I > > wonder whether you had seen this part of Ian's reply. (Question of > > course then is how that qemu update had managed to get pushed.) > > > > >> The grub cfg has this: > > >> > > >> multiboot /xen placeholder conswitch=x watchdog noreboot > async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan > ${xen_rm_opts} > > >> > > >> It's not clear to me whether xen_rm_opts is "" or "no-real-mode > edd=off". > > > > > > I definitely recommend to increase dom0 memory, especially as I guess > > > the box is going to have a significant amount, far more than 4GB. I > > > would set it to 2GB. Also the syntax on ARM is simpler, so it should be > > > just: dom0_mem=2G > > > > Ian - I guess that's an adjustment relatively easy to make? I wonder > > though whether we wouldn't want to address the underlying issue first. > > Presumably not, because the fix would likely take quite some time to > > propagate suitably. Yet if not, we will want to have some way of > > verifying that an eventual fix there would have helped here. > > > > > In addition, I also did some investigation just in case there is > > > actually a bug in the code and it is not a simple OOM problem. > > > > I think the actual issue is quite clear; what I'm struggling with is > > why we weren't hit by it earlier. > > > > As imo always, non-order-0 allocations (perhaps excluding the bringing > > up of the kernel or whichever entity) are to be avoided it at possible. > > The offender in this case looks to be privcmd's alloc_empty_pages(). > > For it to request through kcalloc() what ends up being an order-4 > > allocation, the original IOCTL_PRIVCMD_MMAPBATCH must specify a pretty > > large chunk of guest memory to get mapped. Which may in turn be > > questionable, but I'm afraid I don't have the time to try to drill > > down where that request is coming from and whether that also wouldn't > > better be split up. > > > > The solution looks simple enough - convert from kcalloc() to kvcalloc(). > > I can certainly spin up a patch to Linux to this effect. Yet that still > > won't answer the question of why this issue has popped up all of the > > sudden (and hence whether there are things wanting changing elsewhere > > as well). > > Also, I saw your patches for Linux. Let's say that the patches are > reviewed and enqueued immediately to be sent to Linus at the next > opportunity. It is going to take a while for them to take effect in > OSSTest, unless we import them somehow in the Linux tree used by OSSTest > straight away, right? > For Arm testing we don't use a branch provided by Linux upstream. So your wait will be forever :). > Should we arrange for one test OSSTest flight now with the patches > applied to see if they actually fix the issue? Otherwise we might end up > waiting for nothing... We could push the patch in the branch we have. However the Linux we use is not fairly old (I think I did a push last year) and not even the latest stable. I can't remember whether we still have some patches on top of Linux to run on arm (specifically 32-bit). So maybe we should start to track upstream instead? This will have the benefits to pick any new patches. Cheers, . >