>>> On 01.05.18 at 22:22, <bou...@antioche.eu.org> wrote:
> On Mon, Apr 30, 2018 at 07:31:28AM -0600, Jan Beulich wrote:
>> >>> On 25.04.18 at 16:42, <bou...@antioche.eu.org> wrote:
>> > On Wed, Apr 25, 2018 at 12:42:42PM +0200, Manuel Bouyer wrote:
>> >> > Without line numbers associated with at least the top stack trace entry
>> >> > I can only guess what it might be - could you give the patch below a 
>> >> > try?
>> >> > (This may not be the final patch, as I'm afraid there may be some race
>> >> > here, but I'd have to work this out later.)
>> >> 
>> >> Yes, this works. thanks !
>> >> I'll now put this version on the NetBSD testbed I'm running.
>> >> This should put some pressure on it.
>> > 
>> > Running NetBSD tests in several guests I got:
>> > (XEN) 
>> > (XEN) ****************************************
>> > (XEN) Panic on CPU 1:
>> > (XEN) Assertion 'oc > 0' failed at mm.c:628
>> > (XEN) ****************************************
>> > (see attached file for complete report).
>> 
>> So in combination with your later reply I'm confused: Are you observing
>> this with 64-bit guests as well (your later reply appears to hint towards
>> 64-bit-ness), or (as the stack trace suggests) only 32-bit ones? Knowing
>> this may already narrow areas where to look.
> 
> I've seen it a server where, I think, only 32bits domUs are running.
> But the dom0 is a 64bits NetBSD anyway.

Right; Dom0 bitness is of no interest. I've been going through numerous
possibly racing combinations of code paths, without being able to spot
anything yet. I'm afraid I'm not in the position to try to set up the full
environment you're observing the problem in. It would therefore really
help if you could
- debug this yourself, or
- reduce the test environment (ideally to a simple [XTF?] test), or
- at least narrow the conditions, or
- at the very least summarize the relevant actions NetBSD takes in
  terms of page table management, to hopefully reduce the sets of
  code paths potentially involved (for example, across a larger set of
  crashes knowing whether UNPIN is always involved would be
  helpful; I've been blindly assuming it would be short of having
  further data)
(besides a more reliable confirmation - or otherwise - that this indeed
is an issue with 32-bit guests only).

While I think I have ruled out the TLB flush time stamp setting still
happening too early / wrongly in certain cases, there's a small
debugging patch that I would hope could help prove this one or the
other way (see below).

Btw: You've said earlier that there wouldn't be a domain number in
the panic message. However,

(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d14v3)

has it (at the end: domain 14, vCPU 3). Just in case this helps
identifying further useful pieces of information.

Jan

--- unstable.orig/xen/arch/x86/mm.c
+++ unstable/xen/arch/x86/mm.c
@@ -578,7 +578,11 @@ static inline void set_tlbflush_timestam
      */
     if ( !(page->count_info & PGC_page_table) ||
          !shadow_mode_enabled(page_get_owner(page)) )
+    {
+        /* NB: This depends on WRAP_MASK in flushtlb.c to be <= 0xffff. */
+        ASSERT(!page->linear_pt_count);
         page_set_tlbflush_timestamp(page);
+    }
 }
 
 const char __section(".bss.page_aligned.const") __aligned(PAGE_SIZE)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to