Re: Bad link elm in vm_object_terminate [Was: crash on process exit.. current at about r332467]

2018-05-29 Thread Andriy Gapon
On 29/05/2018 19:22, Mark Johnston wrote:
> On Tue, May 29, 2018 at 04:50:14PM +0300, Andriy Gapon wrote:
>> On 23/04/2018 17:50, Julian Elischer wrote:
>>> back trace at:  http://www.freebsd.org/~julian/bob-crash.png
>>>
>>> If anyone wants to take a look..
>>>
>>> In the exit syscall, while deallocating a vm object.
>>>
>>> I haven't see references to a similar crash in the last 10 days or so.. But 
>>> if
>>> it rings any bells...
>>
>> We have just got another one:
>> panic: Bad link elm 0xf80cc3938360 prev->next != elm
>>
>> Matching disassembled code to C code, it seems that the crash is somewhere in
>> vm_object_terminate_pages (inlined into vm_object_terminate), probably in 
>> one of
>> TAILQ_REMOVE-s there:
>>  if (p->queue != PQ_NONE) {
>>  KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
>>  "page %p is not queued", p));
>>  pq1 = vm_page_pagequeue(p);
>>  if (pq != pq1) {
>>  if (pq != NULL) {
>>  vm_pagequeue_cnt_add(pq, dequeued);
>>  vm_pagequeue_unlock(pq);
>>  }
>>  pq = pq1;
>>  vm_pagequeue_lock(pq);
>>  dequeued = 0;
>>  }
>>  p->queue = PQ_NONE;
>>  TAILQ_REMOVE(>pq_pl, p, plinks.q);
>>  dequeued--;
>>  }
>>  if (vm_page_free_prep(p, true))
>>  continue;
>> unlist:
>>  TAILQ_REMOVE(>memq, p, listq);
>>  }
>>
>>
>> Please note that this is the code before r332974 Improve VM page queue 
>> scalability.
>> I am not sure if r332974 + r333256 would fix the problem or if it just would 
>> get
>> moved to a different place.
>>
>> Does this ring a bell to anyone who tinkered with that part of the VM code 
>> recently?
> 
> This doesn't look familiar to me and I doubt that r332974 fixed the
> underlying problem, whatever it is.

I see.

>> Looking a little bit further, I think that object->memq somehow got 
>> corrupted.
>> memq contains just two elements and the reported element is not there.
> 
> Based on the debugging session, it would be interesting to know if there
> were any other threads somehow manipulating the (dead) object at the
> time of the panic.

I will check for this.

> Among the panics that you observed, is it the same application that is
> causing the crash in each case?

I have two crash dumps right now and in both cases it's sh exec-ing grep.
But I cannot imagine what could be so special about that.
Actually, I see that the shell ran a long pipeline with many grep-s in it, so
there were many exec-s and exits of grep, perhaps some of them concurrent.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Bad link elm in vm_object_terminate [Was: crash on process exit.. current at about r332467]

2018-05-29 Thread Mark Johnston
On Tue, May 29, 2018 at 04:50:14PM +0300, Andriy Gapon wrote:
> On 23/04/2018 17:50, Julian Elischer wrote:
> > back trace at:  http://www.freebsd.org/~julian/bob-crash.png
> > 
> > If anyone wants to take a look..
> > 
> > In the exit syscall, while deallocating a vm object.
> > 
> > I haven't see references to a similar crash in the last 10 days or so.. But 
> > if
> > it rings any bells...
> 
> We have just got another one:
> panic: Bad link elm 0xf80cc3938360 prev->next != elm
> 
> Matching disassembled code to C code, it seems that the crash is somewhere in
> vm_object_terminate_pages (inlined into vm_object_terminate), probably in one 
> of
> TAILQ_REMOVE-s there:
>   if (p->queue != PQ_NONE) {
>   KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
>   "page %p is not queued", p));
>   pq1 = vm_page_pagequeue(p);
>   if (pq != pq1) {
>   if (pq != NULL) {
>   vm_pagequeue_cnt_add(pq, dequeued);
>   vm_pagequeue_unlock(pq);
>   }
>   pq = pq1;
>   vm_pagequeue_lock(pq);
>   dequeued = 0;
>   }
>   p->queue = PQ_NONE;
>   TAILQ_REMOVE(>pq_pl, p, plinks.q);
>   dequeued--;
>   }
>   if (vm_page_free_prep(p, true))
>   continue;
> unlist:
>   TAILQ_REMOVE(>memq, p, listq);
>   }
> 
> 
> Please note that this is the code before r332974 Improve VM page queue 
> scalability.
> I am not sure if r332974 + r333256 would fix the problem or if it just would 
> get
> moved to a different place.
> 
> Does this ring a bell to anyone who tinkered with that part of the VM code 
> recently?

This doesn't look familiar to me and I doubt that r332974 fixed the
underlying problem, whatever it is.

> Looking a little bit further, I think that object->memq somehow got corrupted.
> memq contains just two elements and the reported element is not there.

Based on the debugging session, it would be interesting to know if there
were any other threads somehow manipulating the (dead) object at the
time of the panic.

Among the panics that you observed, is it the same application that is
causing the crash in each case?
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Bad link elm in vm_object_terminate [Was: crash on process exit.. current at about r332467]

2018-05-29 Thread Andriy Gapon
On 23/04/2018 17:50, Julian Elischer wrote:
> back trace at:  http://www.freebsd.org/~julian/bob-crash.png
> 
> If anyone wants to take a look..
> 
> In the exit syscall, while deallocating a vm object.
> 
> I haven't see references to a similar crash in the last 10 days or so.. But if
> it rings any bells...

We have just got another one:
panic: Bad link elm 0xf80cc3938360 prev->next != elm

Matching disassembled code to C code, it seems that the crash is somewhere in
vm_object_terminate_pages (inlined into vm_object_terminate), probably in one of
TAILQ_REMOVE-s there:
if (p->queue != PQ_NONE) {
KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
"page %p is not queued", p));
pq1 = vm_page_pagequeue(p);
if (pq != pq1) {
if (pq != NULL) {
vm_pagequeue_cnt_add(pq, dequeued);
vm_pagequeue_unlock(pq);
}
pq = pq1;
vm_pagequeue_lock(pq);
dequeued = 0;
}
p->queue = PQ_NONE;
TAILQ_REMOVE(>pq_pl, p, plinks.q);
dequeued--;
}
if (vm_page_free_prep(p, true))
continue;
unlist:
TAILQ_REMOVE(>memq, p, listq);
}


Please note that this is the code before r332974 Improve VM page queue 
scalability.
I am not sure if r332974 + r333256 would fix the problem or if it just would get
moved to a different place.

Does this ring a bell to anyone who tinkered with that part of the VM code 
recently?

Looking a little bit further, I think that object->memq somehow got corrupted.
memq contains just two elements and the reported element is not there.

(kgdb) p *(struct vm_page *)0xf80cc3938360
$22 = {
  plinks = {
q = {
  tqe_next = 0xf80cd7175398,
  tqe_prev = 0xf80cb9f69170
},
s = {
  ss = {
sle_next = 0xf80cd7175398
  },
  pv = 0xf80cb9f69170
},
memguard = {
  p = 18446735332764767128,
  v = 18446735332276081008
}
  },
  listq = {
tqe_next = 0xf80cc3938568,  <=
tqe_prev = 0xf8078c11b848   <=
  },
  object = 0x0,
  pindex = 1548,
  phys_addr = 14695911424,
  md = {
pv_list = {
  tqh_first = 0x0,
  tqh_last = 0xf80cc3938398
},
pv_gen = 1205766,
pat_mode = 6
  },
  wire_count = 0,
  busy_lock = 1,
  hold_count = 0,
  flags = 0,
  aflags = 0 '\000',
  oflags = 0 '\000',
  queue = 255 '\377',
  psind = 0 '\000',
  segind = 5 '\005',
  order = 13 '\r',
  pool = 0 '\000',
  act_count = 5 '\005',
  valid = 0 '\000',
  dirty = 0 '\000'
}

(kgdb) p object->memq
$11 = {
  tqh_first = 0xf80cb861cfb8,
  tqh_last = 0xf80cc3938780
}

(kgdb) p *object->memq.tqh_first
$25 = {
  plinks = {
q = {
  tqe_next = 0xf80cb9f69108,
  tqe_prev = 0xf80cd7175398
},
s = {
  ss = {
sle_next = 0xf80cb9f69108
  },
  pv = 0xf80cd7175398
},
memguard = {
  p = 18446735332276080904,
  v = 18446735332764767128
}
  },
  listq = {
tqe_next = 0xf80cb56eafb0,  <=
tqe_prev = 0xf8078c11b848   <=
  },
  object = 0xf8078c11b800,
  pindex = 515,
  phys_addr = 7299219456,
  md = {
pv_list = {
  tqh_first = 0xf80b99e4ff88,
  tqh_last = 0xf80b99e4ff90
},
pv_gen = 466177,
pat_mode = 6
  },
  wire_count = 0,
  busy_lock = 2,
  hold_count = 0,
  flags = 0,
  aflags = 0 '\000',
  oflags = 0 '\000',
  queue = 255 '\377',
  psind = 0 '\000',
  segind = 5 '\005',
  order = 13 '\r',
  pool = 0 '\000',
  act_count = 5 '\005',
  valid = 255 '\377',
  dirty = 0 '\000'
}
(kgdb) p *object->memq.tqh_first->listq.tqe_next
$26 = {
  plinks = {
q = {
  tqe_next = 0x0,
  tqe_prev = 0xf80cc92e1d18
},
s = {
  ss = {
sle_next = 0x0
  },
  pv = 0xf80cc92e1d18
},
memguard = {
  p = 0,
  v = 18446735332531379480
}
  },
  listq = {
tqe_next = 0x0, <=
tqe_prev = 0xf80cb861cfc8   <=
  },
  object = 0xf8078c11b800,
  pindex = 1548,
  phys_addr = 5350158336,
  md = {
pv_list = {
  tqh_first = 0xf80a07222808,
  tqh_last = 0xf80a07222810
},
pv_gen = 7085,
pat_mode = 6
  },
  wire_count = 0,
  busy_lock = 1,
  hold_count = 0,
  flags = 0,
  aflags = 1 '\001',
  oflags = 0 '\000',
  queue = 1 '\001',
  psind = 0 '\000',
  segind = 5 '\005',
  order = 13 '\r',
  pool = 0 '\000',
  act_count = 5 '\005',
  valid = 255 '\377',
  dirty = 255 '\377'
}

Pages 0xf80cc3938360 (the reported one) and