Re: [External] : 7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-10 Thread Barbaros Bilek
Thanks for your support. I'll try to test when you get it done.

On Mon, May 9, 2022 at 8:51 PM Alexandr Nedvedicky <
alexandr.nedvedi...@oracle.com> wrote:

> Hello Barbaros,
>
> thank you for testing and excellent report.
>
> 
>
> > ddb{1}> trace
> > db_enter() at db_enter+0x10
> > panic(81f22e39) at panic+0xbf
> > __assert(81f96c9d,81f85ebc,a3,81fd252f) at
> __assert+0x25
> > assertwaitok() at assertwaitok+0xcc
> > mi_switch() at mi_switch+0x40
>
> assert indicates we attempt to sleep inside SMR section,
> which must be avoided.
>
> > sleep_finish(800025574da0,1) at sleep_finish+0x10b
> > rw_enter(822cfe50,1) at rw_enter+0x1cb
> > pf_test(2,1,8520e000,800025575058) at pf_test+0x1088
> > ip_input_if(800025575058,800025575064,4,0,8520e000) at
> ip_input_if+0xcd
> > ipv4_input(8520e000,fd8053616700) at ipv4_input+0x39
> > ether_input(8520e000,fd8053616700) at ether_input+0x3ad
> > vport_if_enqueue(8520e000,fd8053616700) at
> vport_if_enqueue+0x19
> >
> veb_port_input(851c3800,fd806064c200,,82066600)
> at veb_port_input+0x4d2
> > ether_input(851c3800,fd806064c200) at ether_input+0x100
> > vlan_input(8095a050,fd806064c200,8000255752bc) at
> vlan_input+0x23d
> > ether_input(8095a050,fd806064c200) at ether_input+0x85
> > if_input_process(8095a050,800025575358) at
> if_input_process+0x6f
> > ifiq_process(8095a460) at ifiq_process+0x69
> > taskq_thread(80035080) at taskq_thread+0x100
>
> above is a call stack, which has done a bad thing (sleeping SMR
> section)
>
> in my opinion the primary suspect is veb_port_input() which code reads as
> follows:
>
>  966 static struct mbuf *
>  967 veb_port_input(struct ifnet *ifp0, struct mbuf *m, uint64_t dst, void
> *brport)
>  968 {
>  969 struct veb_port *p = brport;
>  970 struct veb_softc *sc = p->p_veb;
>  971 struct ifnet *ifp = >sc_if;
>  972 struct ether_header *eh;
>  ...
> 1021 counters_pkt(ifp->if_counters, ifc_ipackets, ifc_ibytes,
> 1022 m->m_pkthdr.len);
> 1023
> 1024 /* force packets into the one routing domain for pf */
> 1025 m->m_pkthdr.ph_rtableid = ifp->if_rdomain;
> 1026
> 1027 #if NBPFILTER > 0
> 1028 if_bpf = READ_ONCE(ifp->if_bpf);
> 1029 if (if_bpf != NULL) {
> 1030 if (bpf_mtap_ether(if_bpf, m, 0) != 0)
> 1031 goto drop;
> 1032 }
> 1033 #endif
> 1034
> 1035 veb_span(sc, m);
> 1036
> 1037 if (ISSET(p->p_bif_flags, IFBIF_BLOCKNONIP) &&
> 1038 veb_ip_filter(m))
> 1039 goto drop;
> 1040
> 1041 if (!ISSET(ifp->if_flags, IFF_LINK0) &&
> 1042 veb_vlan_filter(m))
> 1043 goto drop;
> 1044
> 1045 if (veb_rule_filter(p, VEB_RULE_LIST_IN, m, src, dst))
> 1046 goto drop;
>
> call to veb_span() at line 1035 seems to be our guy/culprit (in my
> opinion):
>
>  356 smr_read_enter();
>  357 SMR_TAILQ_FOREACH(p, >sc_spans.l_list, p_entry) {
>  358 ifp0 = p->p_ifp0;
>  359 if (!ISSET(ifp0->if_flags, IFF_RUNNING))
>  360 continue;
>  361
>  362 m = m_dup_pkt(m0, max_linkhdr + ETHER_ALIGN,
> M_NOWAIT);
>  363 if (m == NULL) {
>  364 /* XXX count error */
>  365 continue;
>  366 }
>  367
>  368 if_enqueue(ifp0, m); /* XXX count error */
>  369 }
>  370 smr_read_leave();
>
> loop above comes from veb_span(), which calls if_enqueue() from within
> a smr section. The line 368 calls here:
>
> 2191 static int
> 2192 vport_if_enqueue(struct ifnet *ifp, struct mbuf *m)
> 2193 {
> 2194 /*
> 2195  * switching an l2 packet toward a vport means pushing it
> 2196  * into the network stack. this function exists to make
> 2197  * if_vinput compat with veb calling if_enqueue.
> 2198  */
> 2199
> 2200 if_vinput(ifp, m);
> 2201
> 2202 return (0);
> 2203 }
>
> which in turn calls if_vinput() which calls further down to ipstack, and IP
> stack my sleep. We must change veb_span() such calls to if_vinput() will
> happen
> outside of SMR section.
>
> I don't have such complex setup to use vlans and virtual ports. I'll try to
> cook some diff and pass it to you for testing.
>
> thanks again for coming back to us with report.
>
> regards
> sashan
>
>
>


7.1-Current crash with NET_TASKQ 4 and veb interface

2022-05-09 Thread Barbaros Bilek
Hello,

I was using veb (veb+vlan+ixl) interfaces quite stable since 6.9.
My system ran as a firewall under OpenBSD 6.9 and 7.0 quite stable.
Also I've used 7.1 for a limited time and there were no crash.
After OpenBSD' NET_TASKQ upgrade to 4 it crashed after 5 days.
Here crash report and dmesg:

ether_input(8520e000,fd8053616700) at ether_input+0x3ad

vport_if_enqueue(8520e000,fd8053616700) at vport_if_enqueue+0x19

veb_port_input(851c3800,fd806064c200,,82066600)

 at veb_port_input+0x4d2

ether_input(851c3800,fd806064c200) at ether_input+0x100

end trace frame: 0x800025575290, count: 0

https://www.openbsd.org/ddb.html describes the minimum info required in bug

reports.  Insufficient info makes it difficult to find and fix bugs.

ddb{1}> show panic

*cpu1: kernel diagnostic assertion "curcpu()->ci_schedstate.spc_smrdepth ==
0" f

ailed: file "/usr/src/sys/kern/subr_xxx.c", line 163

ddb{1}> trace

db_enter() at db_enter+0x10

panic(81f22e39) at panic+0xbf

__assert(81f96c9d,81f85ebc,a3,81fd252f) at
__assert+0x2

5

assertwaitok() at assertwaitok+0xcc

mi_switch() at mi_switch+0x40

sleep_finish(800025574da0,1) at sleep_finish+0x10b

rw_enter(822cfe50,1) at rw_enter+0x1cb

pf_test(2,1,8520e000,800025575058) at pf_test+0x1088

ip_input_if(800025575058,800025575064,4,0,8520e000) at
ip_input

_if+0xcd

ipv4_input(8520e000,fd8053616700) at ipv4_input+0x39

ether_input(8520e000,fd8053616700) at ether_input+0x3ad

vport_if_enqueue(8520e000,fd8053616700) at vport_if_enqueue+0x19

veb_port_input(851c3800,fd806064c200,,82066600)

 at veb_port_input+0x4d2

ether_input(851c3800,fd806064c200) at ether_input+0x100

vlan_input(8095a050,fd806064c200,8000255752bc) at
vlan_input+0x

23d

ether_input(8095a050,fd806064c200) at ether_input+0x85

if_input_process(8095a050,800025575358) at if_input_process+0x6f

ifiq_process(8095a460) at ifiq_process+0x69

taskq_thread(80035080) at taskq_thread+0x100

end trace frame: 0x0, count: -19

ddb{1}> ps /o

TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND

 422021  80579  0 0x2  07  ifconfig

 292011   89065020x12  0x4008  mariadbd

 427181   89065020x12  0x4006K mariadbd

  86788   89065020x12  0x4003  mariadbd

 302453  98158  0 0x14000  0x2009  softnet

  88346  66890  0 0x14000  0x2005  softnet

ddb{1}> machine ddbcpu 2

Stopped at  x86_ipi_db+0x12:leave

x86_ipi_db(80001d1c3ff0) at x86_ipi_db+0x12

x86_ipi_handler() at x86_ipi_handler+0x80

Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23

acpicpu_idle() at acpicpu_idle+0x203

sched_idle(80001d1c3ff0) at sched_idle+0x280

end trace frame: 0x0, count: 10

ddb{2}> trace

x86_ipi_db(80001d1c3ff0) at x86_ipi_db+0x12

x86_ipi_handler() at x86_ipi_handler+0x80

Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23

acpicpu_idle() at acpicpu_idle+0x203

sched_idle(80001d1c3ff0) at sched_idle+0x280

end trace frame: 0x0, count: -5

ddb{2}> machine ddbcpu 2

Invalid cpu 2

ddb{2}> t[A[A

Bad character

x86_ipi_db(80001d1c3ff0) at x86_ipi_db+0x12

x86_ipi_handler() at x86_ipi_handler+0x80

Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23

acpicpu_idle() at acpicpu_idle+0x203

sched_idle(80001d1c3ff0) at sched_idle+0x280

end trace frame: 0x0, count: -5

ddb{2}> machine ddbcpu 3

Stopped at  x86_ipi_db+0x12:leave

x86_ipi_db(80001d1ccff0) at x86_ipi_db+0x12

x86_ipi_handler() at x86_ipi_handler+0x80

Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23

__mp_lock(823d986c) at __mp_lock+0x72

wakeup_n(8000fffeca88,1) at wakeup_n+0x32

futex_requeue(c13fb4a32e0,1,0,0,2) at futex_requeue+0xe4

sys_futex(8000fffc2008,8000265ca780,8000265ca7d0) at
sys_futex+0xe6


syscall(8000265ca840) at syscall+0x374

Xsyscall() at Xsyscall+0x128

end of kernel

end trace frame: 0xc13f5b4b090, count: 6

ddb{3}> trace

x86_ipi_db(80001d1ccff0) at x86_ipi_db+0x12

x86_ipi_handler() at x86_ipi_handler+0x80

Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23

__mp_lock(823d986c) at __mp_lock+0x72

wakeup_n(8000fffeca88,1) at wakeup_n+0x32

futex_requeue(c13fb4a32e0,1,0,0,2) at futex_requeue+0xe4

sys_futex(8000fffc2008,8000265ca780,8000265ca7d0) at
sys_futex+0xe6


syscall(8000265ca840) at syscall+0x374

Xsyscall() at Xsyscall+0x128

end of kernel

end trace frame: 0xc13f5b4b090, count: -9

ddb{3}> machine ddbcpu 4

Stopped at  x86_ipi_db+0x12:leave

x86_ipi_db(80001d1d5ff0) at x86_ipi_db+0x12

x86_ipi_handler() at x86_ipi_handler+0x80

Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23

acpicpu_idle() at acpicpu_idle+0x203

sched_idle(80001d1d5ff0) at