On 22.4.2021. 11:02, Alexander Bluhm wrote:
> On Thu, Apr 22, 2021 at 09:03:22AM +0200, Hrvoje Popovski wrote:
>> something like this:
>>
>> x3550m4# pappnaiannc:iicc :p:o ppoolo_oolcla__ddcohoe__gg_eiettt::e m
>> _mmcbmualg2fkpilc2_:: chppeaag
>> gceke: ee mmmbppttuyfyp
>
> This was without my kernel lock around ARP bandage, right?
yes, yes ...
>
>> ddb{9}> mach ddbcpu 0xa
>> Stopped at x86_ipi_db+0x12: leave
>> x86_ipi_db(ffff800021a2aff0) at x86_ipi_db+0x12
>> x86_ipi_handler() at x86_ipi_handler+0x80
>> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
>> pool_get(ffffffff8221e568,2) at pool_get+0x43
>> m_gethdr(2,1) at m_gethdr+0x3f
>> rtm_msg1(e,ffff800026e3cf70) at rtm_msg1+0x4c
>> rtm_ifchg(ffff8000005b3800) at rtm_ifchg+0x61
>> if_down(ffff8000005b3800) at if_down+0xa0
>> if_downall() at if_downall+0x5b
>> boot(104) at boot+0x99
>> reboot(104) at reboot+0x5b
>> panic(ffffffff81df855b) at panic+0x132
>> pool_do_get(ffffffff8221ebc8,2,ffff800026e3d294) at pool_do_get+0x309
>> pool_get(ffffffff8221ebc8,2) at pool_get+0x95
>> end trace frame: 0xffff800026e3d340, count: 0
>>
>> ddb{10}> mach ddbcpu 0xb
>> Stopped at db_enter+0x10: popq %rbp
>> db_enter() at db_enter+0x10
>> panic(ffffffff81df855b) at panic+0x12a
>> pool_do_get(ffffffff8221e568,2,ffff800026e43294) at pool_do_get+0x309
>> pool_get(ffffffff8221e568,2) at pool_get+0x95
>> m_clget(0,2,802) at m_clget+0xdd
>> ixgbe_get_buf(ffff80000015c0e8,e) at ixgbe_get_buf+0xa3
>> ixgbe_rxfill(ffff80000015c0e8) at ixgbe_rxfill+0xae
>> ixgbe_queue_intr(ffff80000011ac40) at ixgbe_queue_intr+0x4f
>> intr_handler(ffff800026e434b0,ffff8000000cd700) at intr_handler+0x6e
>> Xintr_ioapic_edge4_untramp() at Xintr_ioapic_edge4_untramp+0x18f
>> acpicpu_idle() at acpicpu_idle+0x1ea
>> sched_idle(ffff800021a33ff0) at sched_idle+0x27e
>> end trace frame: 0x0, count: 3
>
> Two processors 10 and 11 in pool get.
>
> CPU 10 does pool_get, panic, boot, pool_get again.
> CPU 11 was the one that originally stopped in ddb.
>
> Did you enter boot reboot before doing mach ddbcpu 0xa?
nope... is doing that ever useful?
> Or how did we get the boot sequence in this trace?
>
> Can it be that both CPU paniced simultaeously? The mangled massage
> indicates this. Then cpu 10 saw that cpu 11 already paniced to ddb
> and tried to reboot. There it paniced again and got stuck in a
> recursive call to pool_get().
>
> The if (db_panic) in the panic() function was not written with
> simultaneous panics on multiple CPUs in mind.
if you want i'll try to reproduce in on other boxes..
maybe i can trigger it here easily because of 2 sockets ?