Re: [xen-unstable test] 164996: regressions - FAIL

Julien Grall Wed, 22 Sep 2021 19:57:13 -0700

Hi,

Sorry for the formatting.



On Thu, 23 Sep 2021, 06:10 Stefano Stabellini, <sstabell...@kernel.org>
wrote:

> On Wed, 22 Sep 2021, Jan Beulich wrote:
> > On 22.09.2021 01:38, Stefano Stabellini wrote:
> > > On Mon, 20 Sep 2021, Ian Jackson wrote:
> > >> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions -
> FAIL"):
> > >>> As per
> > >>>
> > >>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
> > >>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639
> inactive_anon:15857 isolated_anon:0
> > >>> Sep 15 14:44:55.514480 [ 1613.324918]  active_file:13286
> inactive_file:11182 isolated_file:0
> > >>> Sep 15 14:44:55.514545 [ 1613.324918]  unevictable:0 dirty:30
> writeback:0 unstable:0
> > >>> Sep 15 14:44:55.526477 [ 1613.324918]  slab_reclaimable:10922
> slab_unreclaimable:30234
> > >>> Sep 15 14:44:55.526540 [ 1613.324918]  mapped:11277 shmem:10975
> pagetables:401 bounce:0
> > >>> Sep 15 14:44:55.538474 [ 1613.324918]  free:8364 free_pcp:100
> free_cma:1650
> > >>>
> > >>> the system doesn't look to really be out of memory; as per
> > >>>
> > >>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB
> (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C)
> 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
> > >>>
> > >>> there even look to be a number of higher order pages available
> (albeit
> > >>> without digging I can't tell what "(C)" means). Nevertheless order-4
> > >>> allocations aren't really nice.
> > >>
> > >> The host history suggests this may possibly be related to a qemu
> update.
> > >>
> > >>
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
> >
> > Stefano - as per some of your investigation detailed further down I
> > wonder whether you had seen this part of Ian's reply. (Question of
> > course then is how that qemu update had managed to get pushed.)
> >
> > >> The grub cfg has this:
> > >>
> > >>  multiboot /xen placeholder conswitch=x watchdog noreboot
> async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan
> ${xen_rm_opts}
> > >>
> > >> It's not clear to me whether xen_rm_opts is "" or "no-real-mode
> edd=off".
> > >
> > > I definitely recommend to increase dom0 memory, especially as I guess
> > > the box is going to have a significant amount, far more than 4GB. I
> > > would set it to 2GB. Also the syntax on ARM is simpler, so it should be
> > > just: dom0_mem=2G
> >
> > Ian - I guess that's an adjustment relatively easy to make? I wonder
> > though whether we wouldn't want to address the underlying issue first.
> > Presumably not, because the fix would likely take quite some time to
> > propagate suitably. Yet if not, we will want to have some way of
> > verifying that an eventual fix there would have helped here.
> >
> > > In addition, I also did some investigation just in case there is
> > > actually a bug in the code and it is not a simple OOM problem.
> >
> > I think the actual issue is quite clear; what I'm struggling with is
> > why we weren't hit by it earlier.
> >
> > As imo always, non-order-0 allocations (perhaps excluding the bringing
> > up of the kernel or whichever entity) are to be avoided it at possible.
> > The offender in this case looks to be privcmd's alloc_empty_pages().
> > For it to request through kcalloc() what ends up being an order-4
> > allocation, the original IOCTL_PRIVCMD_MMAPBATCH must specify a pretty
> > large chunk of guest memory to get mapped. Which may in turn be
> > questionable, but I'm afraid I don't have the time to try to drill
> > down where that request is coming from and whether that also wouldn't
> > better be split up.
> >
> > The solution looks simple enough - convert from kcalloc() to kvcalloc().
> > I can certainly spin up a patch to Linux to this effect. Yet that still
> > won't answer the question of why this issue has popped up all of the
> > sudden (and hence whether there are things wanting changing elsewhere
> > as well).
>
> Also, I saw your patches for Linux. Let's say that the patches are
> reviewed and enqueued immediately to be sent to Linus at the next
> opportunity. It is going to take a while for them to take effect in
> OSSTest, unless we import them somehow in the Linux tree used by OSSTest
> straight away, right?
>

For Arm testing we don't use a branch provided by Linux upstream. So your
wait will be forever :).


> Should we arrange for one test OSSTest flight now with the patches
> applied to see if they actually fix the issue? Otherwise we might end up
> waiting for nothing...


We could push the patch in the branch we have. However the Linux we use is
not fairly old (I think I did a push last year) and not even the latest
stable.

I can't remember whether we still have some patches on top of Linux to run
on arm (specifically 32-bit). So maybe we should start to track upstream
instead?

This will have the benefits to pick any new patches.

Cheers,

.





>

Re: [xen-unstable test] 164996: regressions - FAIL

Reply via email to