Re: [PATCH] veth: fix memory leak in veth_newlink()
On 2020/08/31 9:51, Rustam Kovhaev wrote: On Mon, Aug 31, 2020 at 09:16:32AM +0900, Toshiaki Makita wrote: On 2020/08/30 22:13, Rustam Kovhaev wrote: when register_netdevice(dev) fails we should check whether struct veth_rq has been allocated via ndo_init callback and free it, because, depending on the code path, register_netdevice() might not call priv_destructor() callback AFAICS, register_netdevice() always goto err_uninit and calls priv_destructor() on failure after ndo_init() succeeded. So I could not find such a code path. Would you elaborate on it? in net/core/dev.c:9863, where register_netdevice() calls rollback_registered(), which does not call priv_destructor(), then register_netdevice() returns error net/core/dev.c:9884 Thank you, now I see the code path. But then all devices which allocate something in ndo_init() and free them in priv_destructor() are affected? E.g. loopback and ifb seem to do such thing. Why not calling priv_destructor() after invocation of rollback_registered()? It looks weird that only that path does not call priv_destructor(). Toshiaki Makita
Re: [PATCH] veth: fix memory leak in veth_newlink()
On 2020/08/30 22:13, Rustam Kovhaev wrote: when register_netdevice(dev) fails we should check whether struct veth_rq has been allocated via ndo_init callback and free it, because, depending on the code path, register_netdevice() might not call priv_destructor() callback AFAICS, register_netdevice() always goto err_uninit and calls priv_destructor() on failure after ndo_init() succeeded. So I could not find such a code path. Would you elaborate on it? Thanks, Toshiaki Makita
Re: [PATCH net] net: ethtool: Allow matching on vlan CFI bit
On 2019/06/12 0:54, Maxime Chevallier wrote: Using ethtool, users can specify a classification action matching on the full vlan tag, which includes the CFI bit. However, when converting the ethool_flow_spec to a flow_rule, we use dissector keys to represent the matching patterns. Since the vlan dissector key doesn't include the CFI bit, this information was silently discarded when translating the ethtool flow spec in to a flow_rule. This commit adds the CFI bit into the vlan dissector key, and allows propagating the information to the driver when parsing the ethtool flow spec. Fixes: eca4205f9ec3 ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator") Reported-by: Michał Mirosław Signed-off-by: Maxime Chevallier --- Hi all, Although this prevents information to be silently discarded when parsing an ethtool_flow_spec, this information doesn't seem to be used by any driver that converts an ethtool_flow_spec to a flow_rule, hence I'm not sure this is suitable for -net. Thanks, Maxime include/net/flow_dissector.h | 1 + net/core/ethtool.c | 5 + 2 files changed, 6 insertions(+) diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index 7c5a8d9a8d2a..9d2e395c6568 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -46,6 +46,7 @@ struct flow_dissector_key_tags { struct flow_dissector_key_vlan { u16 vlan_id:12, + vlan_cfi:1, Current IEEE 802.1Q defines this bit as DEI not CFI, so IMO this should be vlan_dei. Toshiaki Makita
Re: KMSAN: uninit-value in __netif_receive_skb_core
0246 R12: >> R13: 06cd R14: 006fd3d8 R15: >> >> Uninit was stored to memory at: >> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] >> kmsan_save_stack mm/kmsan/kmsan.c:293 [inline] >> kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684 >> __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521 >> skb_vlan_untag+0x950/0xee0 include/linux/if_vlan.h:597 >> __netif_receive_skb_core+0x70a/0x4a80 net/core/dev.c:4460 >> __netif_receive_skb net/core/dev.c:4627 [inline] >> process_backlog+0x62d/0xe20 net/core/dev.c:5307 >> napi_poll net/core/dev.c:5705 [inline] >> net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771 >> __do_softirq+0x56d/0x93d kernel/softirq.c:285 >> Uninit was created at: >> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] >> kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 >> kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 >> kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 >> slab_post_alloc_hook mm/slab.h:445 [inline] >> slab_alloc_node mm/slub.c:2737 [inline] >> __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 >> __kmalloc_reserve net/core/skbuff.c:138 [inline] >> __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 >> alloc_skb include/linux/skbuff.h:984 [inline] >> alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234 >> sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085 >> packet_alloc_skb net/packet/af_packet.c:2803 [inline] >> packet_snd net/packet/af_packet.c:2894 [inline] >> packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969 >> sock_sendmsg_nosec net/socket.c:630 [inline] >> sock_sendmsg net/socket.c:640 [inline] >> sock_write_iter+0x3b9/0x470 net/socket.c:909 >> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 >> do_iter_write+0x30d/0xd40 fs/read_write.c:932 >> vfs_writev fs/read_write.c:977 [inline] >> do_writev+0x3c9/0x830 fs/read_write.c:1012 >> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 >> SyS_writev+0x56/0x80 fs/read_write.c:1082 >> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> == >> >> >> --- >> This bug is generated by a dumb bot. It may contain errors. >> See https://goo.gl/tpsmEJ for details. >> Direct all questions to syzkal...@googlegroups.com. >> >> syzbot will keep track of this bug report. >> If you forgot to add the Reported-by tag, once the fix for this bug is >> merged >> into any tree, please reply to this email with: >> #syz fix: exact-commit-title >> To mark this as a duplicate of another syzbot report, please reply with: >> #syz dup: exact-subject-of-another-report >> If it's a one-off invalid bug report, please reply with: >> #syz invalid >> Note: if the crash happens again, it will cause creation of a new bug >> report. >> Note: all commands must start from beginning of the line in the email body. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "syzkaller-bugs" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to syzkaller-bugs+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c059ce01f643c0569a228ee%40google.com. >> For more options, visit https://groups.google.com/d/optout. > > -- Toshiaki Makita
Re: KMSAN: uninit-value in __netif_receive_skb_core
06cd R14: 006fd3d8 R15: >> >> Uninit was stored to memory at: >> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] >> kmsan_save_stack mm/kmsan/kmsan.c:293 [inline] >> kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684 >> __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521 >> skb_vlan_untag+0x950/0xee0 include/linux/if_vlan.h:597 >> __netif_receive_skb_core+0x70a/0x4a80 net/core/dev.c:4460 >> __netif_receive_skb net/core/dev.c:4627 [inline] >> process_backlog+0x62d/0xe20 net/core/dev.c:5307 >> napi_poll net/core/dev.c:5705 [inline] >> net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771 >> __do_softirq+0x56d/0x93d kernel/softirq.c:285 >> Uninit was created at: >> kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] >> kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 >> kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 >> kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 >> slab_post_alloc_hook mm/slab.h:445 [inline] >> slab_alloc_node mm/slub.c:2737 [inline] >> __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 >> __kmalloc_reserve net/core/skbuff.c:138 [inline] >> __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 >> alloc_skb include/linux/skbuff.h:984 [inline] >> alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234 >> sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085 >> packet_alloc_skb net/packet/af_packet.c:2803 [inline] >> packet_snd net/packet/af_packet.c:2894 [inline] >> packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969 >> sock_sendmsg_nosec net/socket.c:630 [inline] >> sock_sendmsg net/socket.c:640 [inline] >> sock_write_iter+0x3b9/0x470 net/socket.c:909 >> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 >> do_iter_write+0x30d/0xd40 fs/read_write.c:932 >> vfs_writev fs/read_write.c:977 [inline] >> do_writev+0x3c9/0x830 fs/read_write.c:1012 >> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 >> SyS_writev+0x56/0x80 fs/read_write.c:1082 >> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> == >> >> >> --- >> This bug is generated by a dumb bot. It may contain errors. >> See https://goo.gl/tpsmEJ for details. >> Direct all questions to syzkal...@googlegroups.com. >> >> syzbot will keep track of this bug report. >> If you forgot to add the Reported-by tag, once the fix for this bug is >> merged >> into any tree, please reply to this email with: >> #syz fix: exact-commit-title >> To mark this as a duplicate of another syzbot report, please reply with: >> #syz dup: exact-subject-of-another-report >> If it's a one-off invalid bug report, please reply with: >> #syz invalid >> Note: if the crash happens again, it will cause creation of a new bug >> report. >> Note: all commands must start from beginning of the line in the email body. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "syzkaller-bugs" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to syzkaller-bugs+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c059ce01f643c0569a228ee%40google.com. >> For more options, visit https://groups.google.com/d/optout. > > -- Toshiaki Makita
Re: KMSAN: uninit-value in netif_skb_features
ine] >> __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 >> alloc_skb include/linux/skbuff.h:984 [inline] >> alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234 >> sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085 >> packet_alloc_skb net/packet/af_packet.c:2803 [inline] >> packet_snd net/packet/af_packet.c:2894 [inline] >> packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969 >> sock_sendmsg_nosec net/socket.c:630 [inline] >> sock_sendmsg net/socket.c:640 [inline] >> sock_write_iter+0x3b9/0x470 net/socket.c:909 >> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 >> do_iter_write+0x30d/0xd40 fs/read_write.c:932 >> vfs_writev fs/read_write.c:977 [inline] >> do_writev+0x3c9/0x830 fs/read_write.c:1012 >> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 >> SyS_writev+0x56/0x80 fs/read_write.c:1082 >> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> == >> >> >> --- >> This bug is generated by a dumb bot. It may contain errors. >> See https://goo.gl/tpsmEJ for details. >> Direct all questions to syzkal...@googlegroups.com. >> >> syzbot will keep track of this bug report. >> If you forgot to add the Reported-by tag, once the fix for this bug is >> merged >> into any tree, please reply to this email with: >> #syz fix: exact-commit-title >> If you want to test a patch for this bug, please reply with: >> #syz test: git://repo/address.git branch >> and provide the patch inline or as an attachment. >> To mark this as a duplicate of another syzbot report, please reply with: >> #syz dup: exact-subject-of-another-report >> If it's a one-off invalid bug report, please reply with: >> #syz invalid >> Note: if the crash happens again, it will cause creation of a new bug >> report. >> Note: all commands must start from beginning of the line in the email body. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "syzkaller-bugs" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to syzkaller-bugs+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/syzkaller-bugs/089e082d0cb81b67d10569a2283f%40google.com. >> For more options, visit https://groups.google.com/d/optout. > > -- Toshiaki Makita
Re: KMSAN: uninit-value in netif_skb_features
;> alloc_skb include/linux/skbuff.h:984 [inline] >> alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234 >> sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085 >> packet_alloc_skb net/packet/af_packet.c:2803 [inline] >> packet_snd net/packet/af_packet.c:2894 [inline] >> packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969 >> sock_sendmsg_nosec net/socket.c:630 [inline] >> sock_sendmsg net/socket.c:640 [inline] >> sock_write_iter+0x3b9/0x470 net/socket.c:909 >> do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 >> do_iter_write+0x30d/0xd40 fs/read_write.c:932 >> vfs_writev fs/read_write.c:977 [inline] >> do_writev+0x3c9/0x830 fs/read_write.c:1012 >> SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 >> SyS_writev+0x56/0x80 fs/read_write.c:1082 >> do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> == >> >> >> --- >> This bug is generated by a dumb bot. It may contain errors. >> See https://goo.gl/tpsmEJ for details. >> Direct all questions to syzkal...@googlegroups.com. >> >> syzbot will keep track of this bug report. >> If you forgot to add the Reported-by tag, once the fix for this bug is >> merged >> into any tree, please reply to this email with: >> #syz fix: exact-commit-title >> If you want to test a patch for this bug, please reply with: >> #syz test: git://repo/address.git branch >> and provide the patch inline or as an attachment. >> To mark this as a duplicate of another syzbot report, please reply with: >> #syz dup: exact-subject-of-another-report >> If it's a one-off invalid bug report, please reply with: >> #syz invalid >> Note: if the crash happens again, it will cause creation of a new bug >> report. >> Note: all commands must start from beginning of the line in the email body. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "syzkaller-bugs" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to syzkaller-bugs+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/syzkaller-bugs/089e082d0cb81b67d10569a2283f%40google.com. >> For more options, visit https://groups.google.com/d/optout. > > -- Toshiaki Makita
Re: [PATCH bpf-next v3 1/3] libbpf: add function to setup XDP
On 2017/12/28 17:04, Eric Leblond wrote: > Most of the code is taken from set_link_xdp_fd() in bpf_load.c and > slightly modified to be library compliant. > > Signed-off-by: Eric Leblond <e...@regit.org> > Acked-by: Alexei Starovoitov <a...@kernel.org> > --- ... > +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) ... > + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { > + ret = -errno; > + goto cleanup; > + } > + > + addrlen = sizeof(sa); > + if (getsockname(sock, (struct sockaddr *), ) < 0) { > + ret = errno; Still errno is not inverted, > + goto cleanup; > + } > + > + if (addrlen != sizeof(sa)) { > + ret = errno; And not set here. > + goto cleanup; > + } -- Toshiaki Makita
Re: [PATCH bpf-next v3 1/3] libbpf: add function to setup XDP
On 2017/12/28 17:04, Eric Leblond wrote: > Most of the code is taken from set_link_xdp_fd() in bpf_load.c and > slightly modified to be library compliant. > > Signed-off-by: Eric Leblond > Acked-by: Alexei Starovoitov > --- ... > +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) ... > + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { > + ret = -errno; > + goto cleanup; > + } > + > + addrlen = sizeof(sa); > + if (getsockname(sock, (struct sockaddr *), ) < 0) { > + ret = errno; Still errno is not inverted, > + goto cleanup; > + } > + > + if (addrlen != sizeof(sa)) { > + ret = errno; And not set here. > + goto cleanup; > + } -- Toshiaki Makita
Re: [PATCH 1/4] libbpf: add function to setup XDP
On 2017/12/28 3:02, Eric Leblond wrote: > Most of the code is taken from set_link_xdp_fd() in bpf_load.c and > slightly modified to be library compliant. > > Signed-off-by: Eric Leblond <e...@regit.org> > --- ... > +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) ... > + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { > + ret = -errno; > + goto cleanup; > + } > + > + addrlen = sizeof(sa); > + if (getsockname(sock, (struct sockaddr *), ) < 0) { > + ret = errno; forgot to prepend '-'? > + goto cleanup; > + } > + > + if (addrlen != sizeof(sa)) { > + ret = errno; errno is not set? > + goto cleanup; > + } -- Toshiaki Makita
Re: [PATCH 1/4] libbpf: add function to setup XDP
On 2017/12/28 3:02, Eric Leblond wrote: > Most of the code is taken from set_link_xdp_fd() in bpf_load.c and > slightly modified to be library compliant. > > Signed-off-by: Eric Leblond > --- ... > +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) ... > + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { > + ret = -errno; > + goto cleanup; > + } > + > + addrlen = sizeof(sa); > + if (getsockname(sock, (struct sockaddr *), ) < 0) { > + ret = errno; forgot to prepend '-'? > + goto cleanup; > + } > + > + if (addrlen != sizeof(sa)) { > + ret = errno; errno is not set? > + goto cleanup; > + } -- Toshiaki Makita
Re: [PATCH net-next] libbpf: add function to setup XDP
On 2017/12/09 23:43, Eric Leblond wrote: > Most of the code is taken from set_link_xdp_fd() in bpf_load.c and > slightly modified to be library compliant. > > Signed-off-by: Eric Leblond <e...@regit.org> ... > +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) ... > + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); > + nh = NLMSG_NEXT(nh, len)) { > + if (nh->nlmsg_pid != getpid()) { Generally nlmsg_pid should not be compared with process id. See man netlink and https://github.com/iovisor/bcc/pull/1275/commits/69ce96a54c55960c8de3392061254c97b6306a6d > + ret = -LIBBPF_ERRNO__WRNGPID; > + goto cleanup; > + } -- Toshiaki Makita
Re: [PATCH net-next] libbpf: add function to setup XDP
On 2017/12/09 23:43, Eric Leblond wrote: > Most of the code is taken from set_link_xdp_fd() in bpf_load.c and > slightly modified to be library compliant. > > Signed-off-by: Eric Leblond ... > +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) ... > + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); > + nh = NLMSG_NEXT(nh, len)) { > + if (nh->nlmsg_pid != getpid()) { Generally nlmsg_pid should not be compared with process id. See man netlink and https://github.com/iovisor/bcc/pull/1275/commits/69ce96a54c55960c8de3392061254c97b6306a6d > + ret = -LIBBPF_ERRNO__WRNGPID; > + goto cleanup; > + } -- Toshiaki Makita
Re: Sending 802.1Q packets using AF_PACKET socket on filtered bridge forwards with wrong MAC addresses
Hi, (CC: Vlad) On 2017/11/30 7:01, Brandon Carpenter wrote: > I narrowed the search to a memmove() called from > skb_reorder_vlan_header() in net/core/skbuff.c. > >> memmove(skb->data - ETH_HLEN, skb->data - skb->mac_len - VLAN_HLEN, >>2 * ETH_ALEN); > > Calling skb_reset_mac_len() after skb_reset_mac_header() before > calling br_allowed_ingress() in net/bridge/br_device.c fixes the > problem. > > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c > index af5b8c87f590..e10131e2f68f 100644 > --- a/net/bridge/br_device.c > +++ b/net/bridge/br_device.c > @@ -58,6 +58,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct > net_device *dev) > BR_INPUT_SKB_CB(skb)->brdev = dev; > > skb_reset_mac_header(skb); > + skb_reset_mac_len(skb); > eth = eth_hdr(skb); > skb_pull(skb, ETH_HLEN); Thanks for debugging this problem. It seems this has been broken since a6e18ff11170 ("vlan: Fix untag operations of stacked vlans with REORDER_HEADER off"). Unfortunately this does not always work correctly, since in tx path drivers assume network header to be set to L3 protocol header offset. Packet socket (packet_snd()) determines network header by dev_hard_header which is ETH_HLEN in bridge devices, so this works for packet socket, but with vlan devices on top of bridge device with tx-vlan hwaccel disabled we get ETH_HLEN + VLAN_HLEN or longer by mac_len. Since mac_len can be arbitrarily long if we stack vlan devices on bridge devices, and since we want to untag the outermost tag, using mac_len to untag in tx path is probably no longer correct. I'll think deeper about how to fix it. > I'll put together an official patch and submit it. Should I use > another email account? Are my emails being ignored because of that > stupid disclaimer my employer attaches to my messages (outside my > control)? > > Brandon > -- Toshiaki Makita
Re: Sending 802.1Q packets using AF_PACKET socket on filtered bridge forwards with wrong MAC addresses
Hi, (CC: Vlad) On 2017/11/30 7:01, Brandon Carpenter wrote: > I narrowed the search to a memmove() called from > skb_reorder_vlan_header() in net/core/skbuff.c. > >> memmove(skb->data - ETH_HLEN, skb->data - skb->mac_len - VLAN_HLEN, >>2 * ETH_ALEN); > > Calling skb_reset_mac_len() after skb_reset_mac_header() before > calling br_allowed_ingress() in net/bridge/br_device.c fixes the > problem. > > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c > index af5b8c87f590..e10131e2f68f 100644 > --- a/net/bridge/br_device.c > +++ b/net/bridge/br_device.c > @@ -58,6 +58,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct > net_device *dev) > BR_INPUT_SKB_CB(skb)->brdev = dev; > > skb_reset_mac_header(skb); > + skb_reset_mac_len(skb); > eth = eth_hdr(skb); > skb_pull(skb, ETH_HLEN); Thanks for debugging this problem. It seems this has been broken since a6e18ff11170 ("vlan: Fix untag operations of stacked vlans with REORDER_HEADER off"). Unfortunately this does not always work correctly, since in tx path drivers assume network header to be set to L3 protocol header offset. Packet socket (packet_snd()) determines network header by dev_hard_header which is ETH_HLEN in bridge devices, so this works for packet socket, but with vlan devices on top of bridge device with tx-vlan hwaccel disabled we get ETH_HLEN + VLAN_HLEN or longer by mac_len. Since mac_len can be arbitrarily long if we stack vlan devices on bridge devices, and since we want to untag the outermost tag, using mac_len to untag in tx path is probably no longer correct. I'll think deeper about how to fix it. > I'll put together an official patch and submit it. Should I use > another email account? Are my emails being ignored because of that > stupid disclaimer my employer attaches to my messages (outside my > control)? > > Brandon > -- Toshiaki Makita
Re: Inconsistency in packet drop due to MTU (eth vs veth)
On 17/02/03 (金) 17:07, Fredrik Markstrom wrote: On Tue, 31 Jan 2017 17:27:09 +0100 Eric Dumazet <eric.duma...@gmail.com> wrote > On Tue, 2017-01-31 at 14:32 +0100, Fredrik Markstrom wrote: > > On Thu, 19 Jan 2017 19:53:47 +0100 Eric Dumazet <eric.duma...@gmail.com> wrote > > > On Thu, 2017-01-19 at 17:41 +0100, Fredrik Markstrom wrote: > > > > Hello, > > > > > > > > I've noticed an inconsistency between how physical ethernet and > > veth handles mtu. > > > > > > > > If I setup two physical interfaces (directly connected) with > > different mtu:s, only the size of the outgoing packets are limited by > > the mtu. But with veth a packet is dropped if the mtu of the receiving > > interface is smaller then the packet size. > > > > > > > > This seems inconsistent to me, but maybe there is a reason for > > it ? > > > > > > > > Can someone confirm if it's a deliberate inconsistency or just a > > side effect of using dev_forward_skb() ? > > > > > > It looks this was added in commit > > > 38d408152a86598a50680a82fe3353b506630409 > > > ("veth: Allow setting the L3 MTU") > > > > > > But what was really needed here was a way to change MRU :( > > > > Ok, do we consider this correct and/or something we need to be > > backwards compatible with ? Is it insane to believe that we can fix > > this "inconsistency" by removing the check ? > > > > The commit message reads "For consistency I drop packets on the > > receive side when they are larger than the MTU", do we know what it's > > supposed > > to be consistent with or is that lost in history ? > > There is no consistency among existing Ethernet drivers. > > Many ethernet drivers size the buffers they post in RX ring buffer > according to MTU. > > If MTU is set to 1500, RX buffers are sized to be about 1536 bytes, > so you wont be able to receive a 1700 bytes frame. > > I guess that you could add a specific veth attribute to precisely > control MRU, that would not break existing applications. Ok, I will propose a patch shortly. And thanks, your response time is awesome ! But why do you want to configure MRU? What is the problem with setting MTU instead. Toshiaki Makita
Re: Inconsistency in packet drop due to MTU (eth vs veth)
On 17/02/03 (金) 17:07, Fredrik Markstrom wrote: On Tue, 31 Jan 2017 17:27:09 +0100 Eric Dumazet wrote > On Tue, 2017-01-31 at 14:32 +0100, Fredrik Markstrom wrote: > > On Thu, 19 Jan 2017 19:53:47 +0100 Eric Dumazet wrote > > > On Thu, 2017-01-19 at 17:41 +0100, Fredrik Markstrom wrote: > > > > Hello, > > > > > > > > I've noticed an inconsistency between how physical ethernet and > > veth handles mtu. > > > > > > > > If I setup two physical interfaces (directly connected) with > > different mtu:s, only the size of the outgoing packets are limited by > > the mtu. But with veth a packet is dropped if the mtu of the receiving > > interface is smaller then the packet size. > > > > > > > > This seems inconsistent to me, but maybe there is a reason for > > it ? > > > > > > > > Can someone confirm if it's a deliberate inconsistency or just a > > side effect of using dev_forward_skb() ? > > > > > > It looks this was added in commit > > > 38d408152a86598a50680a82fe3353b506630409 > > > ("veth: Allow setting the L3 MTU") > > > > > > But what was really needed here was a way to change MRU :( > > > > Ok, do we consider this correct and/or something we need to be > > backwards compatible with ? Is it insane to believe that we can fix > > this "inconsistency" by removing the check ? > > > > The commit message reads "For consistency I drop packets on the > > receive side when they are larger than the MTU", do we know what it's > > supposed > > to be consistent with or is that lost in history ? > > There is no consistency among existing Ethernet drivers. > > Many ethernet drivers size the buffers they post in RX ring buffer > according to MTU. > > If MTU is set to 1500, RX buffers are sized to be about 1536 bytes, > so you wont be able to receive a 1700 bytes frame. > > I guess that you could add a specific veth attribute to precisely > control MRU, that would not break existing applications. Ok, I will propose a patch shortly. And thanks, your response time is awesome ! But why do you want to configure MRU? What is the problem with setting MTU instead. Toshiaki Makita
Re: DSA vs envelope frames
On 2016/11/30 23:58, Nikita Yushchenko wrote: >>> (1) When DSA is in use, frames processed by FEC chip contain DSA tag and >>> thus can be larger than hardcoded limit of 1522. This issue is not >>> FEC-specific, any driver that hardcodes maximum frame size to 1522 (many >>> do) will have this issue if used with DSA. >> >> BTW I'm trying to introduce envelope frames to solve this kind of problems. >> http://marc.info/?t=14749669155=1=2 >> http://marc.info/?t=14749669153=1=2 >> http://marc.info/?t=14749669152=1=2 >> http://marc.info/?t=14749669154=1=2 >> http://marc.info/?t=14749669151=1=2 >> >> It needs jumbo frame support of NICs though. > > Thanks for pointing to this. > > Indeed frame with DSA tag conceptually is an envelope frame. > > ndev->env_hdr_len introduced by your patches, actually is explicitly > handled difference between (MTU + 18) and frame that HW should allow. > If this is known, hardware can be configured to work with DSA. At least > FEC hardware that can send and receive "slightly larger" frames after > simple register configuration. > > Furthermore, since DSA configuration is known statically (it comes from > device tree), ndo_set_env_hdr_len method could be automatically called > at init, making setup working by default if driver supports that. And if > not, perhaps can automatically lower MTU. > > Looks like a solution :) > > What's current status of this work? Thank you for taking a look. I'm planning to post v2 soon. > What is not really clear - what if several tagging protocols are used > together. AFAIU, things may be more complex that simple appending of > tags, e.g. EDSA tag can carry VLAN id inside. If kernel is aware of VLAN configuration, add 4 bytes + DSA tag size. (I'm not familiar with how dsa knows vlan configuration, but probably through switchdev_port_obj_add()? If so, dsa should be able to take into account additional vlan tag size.) If vlan tag is opaque from kernel, e.g. forwarding vlan tagged frames without configuring vlan_filtering in bridge, admin needs to set env_hdr_len manually. This is why I'm proposing manual operation. Regards, Toshiaki Makita
Re: DSA vs envelope frames
On 2016/11/30 23:58, Nikita Yushchenko wrote: >>> (1) When DSA is in use, frames processed by FEC chip contain DSA tag and >>> thus can be larger than hardcoded limit of 1522. This issue is not >>> FEC-specific, any driver that hardcodes maximum frame size to 1522 (many >>> do) will have this issue if used with DSA. >> >> BTW I'm trying to introduce envelope frames to solve this kind of problems. >> http://marc.info/?t=14749669155=1=2 >> http://marc.info/?t=14749669153=1=2 >> http://marc.info/?t=14749669152=1=2 >> http://marc.info/?t=14749669154=1=2 >> http://marc.info/?t=14749669151=1=2 >> >> It needs jumbo frame support of NICs though. > > Thanks for pointing to this. > > Indeed frame with DSA tag conceptually is an envelope frame. > > ndev->env_hdr_len introduced by your patches, actually is explicitly > handled difference between (MTU + 18) and frame that HW should allow. > If this is known, hardware can be configured to work with DSA. At least > FEC hardware that can send and receive "slightly larger" frames after > simple register configuration. > > Furthermore, since DSA configuration is known statically (it comes from > device tree), ndo_set_env_hdr_len method could be automatically called > at init, making setup working by default if driver supports that. And if > not, perhaps can automatically lower MTU. > > Looks like a solution :) > > What's current status of this work? Thank you for taking a look. I'm planning to post v2 soon. > What is not really clear - what if several tagging protocols are used > together. AFAIU, things may be more complex that simple appending of > tags, e.g. EDSA tag can carry VLAN id inside. If kernel is aware of VLAN configuration, add 4 bytes + DSA tag size. (I'm not familiar with how dsa knows vlan configuration, but probably through switchdev_port_obj_add()? If so, dsa should be able to take into account additional vlan tag size.) If vlan tag is opaque from kernel, e.g. forwarding vlan tagged frames without configuring vlan_filtering in bridge, admin needs to set env_hdr_len manually. This is why I'm proposing manual operation. Regards, Toshiaki Makita
Re: [patch net / RFC] net: fec: increase frame size limitation to actually available buffer
On 2016/11/30 15:36, Nikita Yushchenko wrote: >> But I think it is not necessary since the driver don't support jumbo frame. > > Hardcoded 1522 raises two separate issues. > > (1) When DSA is in use, frames processed by FEC chip contain DSA tag and > thus can be larger than hardcoded limit of 1522. This issue is not > FEC-specific, any driver that hardcodes maximum frame size to 1522 (many > do) will have this issue if used with DSA. > > Clean solution for this must take into account that difference between > MTU and max frame size is no longer known at compile time. Actually this > is the case even without DSA, due to VLANs: max frame size is (MTU + 18) > without VLANs, but (MTU + 22) with VLANs. However currently drivers tend > to ignore this and hardcode 22. With DSA, 22 is not enough, need to add > switch-specific tag size to that. > > Not yet sure how to handle this. DSA-specific API to find out tag size > could be added, but generic solution should handle all cases of dynamic > difference between MTU and max frame size, not only DSA. BTW I'm trying to introduce envelope frames to solve this kind of problems. http://marc.info/?t=14749669155=1=2 http://marc.info/?t=14749669153=1=2 http://marc.info/?t=14749669152=1=2 http://marc.info/?t=14749669154=1=2 http://marc.info/?t=14749669151=1=2 It needs jumbo frame support of NICs though. Regards, Toshiaki Makita
Re: [patch net / RFC] net: fec: increase frame size limitation to actually available buffer
On 2016/11/30 15:36, Nikita Yushchenko wrote: >> But I think it is not necessary since the driver don't support jumbo frame. > > Hardcoded 1522 raises two separate issues. > > (1) When DSA is in use, frames processed by FEC chip contain DSA tag and > thus can be larger than hardcoded limit of 1522. This issue is not > FEC-specific, any driver that hardcodes maximum frame size to 1522 (many > do) will have this issue if used with DSA. > > Clean solution for this must take into account that difference between > MTU and max frame size is no longer known at compile time. Actually this > is the case even without DSA, due to VLANs: max frame size is (MTU + 18) > without VLANs, but (MTU + 22) with VLANs. However currently drivers tend > to ignore this and hardcode 22. With DSA, 22 is not enough, need to add > switch-specific tag size to that. > > Not yet sure how to handle this. DSA-specific API to find out tag size > could be added, but generic solution should handle all cases of dynamic > difference between MTU and max frame size, not only DSA. BTW I'm trying to introduce envelope frames to solve this kind of problems. http://marc.info/?t=14749669155=1=2 http://marc.info/?t=14749669153=1=2 http://marc.info/?t=14749669152=1=2 http://marc.info/?t=14749669154=1=2 http://marc.info/?t=14749669151=1=2 It needs jumbo frame support of NICs though. Regards, Toshiaki Makita
Re: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix)
On 2014/11/06 16:58, 박수현 wrote: >> -Original Message- >> From: Toshiaki Makita [mailto:makita.toshi...@lab.ntt.co.jp] >> Sent: Thursday, November 06, 2014 4:07 PM >> To: 박수현; Stephen Hemminger; David S. Miller >> Cc: bri...@lists.linux-foundation.org; net...@vger.kernel.org; linux- >> ker...@vger.kernel.org >> Subject: Re: [PATCH] bridge: missing null bridge device check causing null >> pointer dereference (bugfix) >> >> On 2014/11/06 15:26, Su-Hyun Park wrote: >>> the bridge device can be null if the bridge is being deleted while >>> processing the packet, which causes the null pointer dereference in >> switch statement. >> >> How can this happen?? >> It is guarded by rcu. >> netdev_rx_handler_unregister() ensures rx_handler_data is non NULL. >> > > The RCU protect rx_handler_data, not the bridge member port. It can be NULL > according to below code. > > static inline struct net_bridge_port *br_port_get_rcu(const struct net_device > *dev) { > struct net_bridge_port *port = rcu_dereference(dev->rx_handler_data); > return br_port_exists(dev) ? port : NULL; > } Seems to have been fixed for a year. 716ec052d228 ("bridge: fix NULL pointer deref of br_port_get_rcu") Thanks, Toshiaki Makita -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix)
On 2014/11/06 16:58, 박수현 wrote: -Original Message- From: Toshiaki Makita [mailto:makita.toshi...@lab.ntt.co.jp] Sent: Thursday, November 06, 2014 4:07 PM To: 박수현; Stephen Hemminger; David S. Miller Cc: bri...@lists.linux-foundation.org; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix) On 2014/11/06 15:26, Su-Hyun Park wrote: the bridge device can be null if the bridge is being deleted while processing the packet, which causes the null pointer dereference in switch statement. How can this happen?? It is guarded by rcu. netdev_rx_handler_unregister() ensures rx_handler_data is non NULL. The RCU protect rx_handler_data, not the bridge member port. It can be NULL according to below code. static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *dev) { struct net_bridge_port *port = rcu_dereference(dev-rx_handler_data); return br_port_exists(dev) ? port : NULL; } Seems to have been fixed for a year. 716ec052d228 (bridge: fix NULL pointer deref of br_port_get_rcu) Thanks, Toshiaki Makita -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix)
On 2014/11/06 15:26, Su-Hyun Park wrote: > the bridge device can be null if the bridge is being deleted while processing > the packet, which causes the null pointer dereference in switch statement. How can this happen?? It is guarded by rcu. netdev_rx_handler_unregister() ensures rx_handler_data is non NULL. Thanks, Toshiaki Makita > > crash dump snippet: > > <1>BUG: unable to handle kernel NULL pointer dereference at 0021 > <1>IP: [] br_handle_frame+0xe6/0x270 > > <0>Code: 4c 0f 44 f0 89 f8 66 33 15 32 52 24 00 66 33 05 29 52 24 00 09 c2 89 > f0 66 33 05 22 52 24 00 80 e4 f0 66 09 c2 0f 84 eb 00 00 00 <41> 0f b6 46 21 > 3c 02 74 61 3c 03 74 1d 48 89 df e8 d5 bc f0 ff > --- > net/bridge/br_input.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c > index 6fd5522..7e899ca 100644 > --- a/net/bridge/br_input.c > +++ b/net/bridge/br_input.c > @@ -176,6 +176,8 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb) > return RX_HANDLER_CONSUMED; > > p = br_port_get_rcu(skb->dev); > + if (!p) > + goto drop; > > if (unlikely(is_link_local_ether_addr(dest))) { > u16 fwd_mask = p->br->group_fwd_mask_required; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bridge: missing null bridge device check causing null pointer dereference (bugfix)
On 2014/11/06 15:26, Su-Hyun Park wrote: the bridge device can be null if the bridge is being deleted while processing the packet, which causes the null pointer dereference in switch statement. How can this happen?? It is guarded by rcu. netdev_rx_handler_unregister() ensures rx_handler_data is non NULL. Thanks, Toshiaki Makita crash dump snippet: 1BUG: unable to handle kernel NULL pointer dereference at 0021 1IP: [814179f6] br_handle_frame+0xe6/0x270 0Code: 4c 0f 44 f0 89 f8 66 33 15 32 52 24 00 66 33 05 29 52 24 00 09 c2 89 f0 66 33 05 22 52 24 00 80 e4 f0 66 09 c2 0f 84 eb 00 00 00 41 0f b6 46 21 3c 02 74 61 3c 03 74 1d 48 89 df e8 d5 bc f0 ff --- net/bridge/br_input.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index 6fd5522..7e899ca 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -176,6 +176,8 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb) return RX_HANDLER_CONSUMED; p = br_port_get_rcu(skb-dev); + if (!p) + goto drop; if (unlikely(is_link_local_ether_addr(dest))) { u16 fwd_mask = p-br-group_fwd_mask_required; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space after fdb update
(2014/05/29 16:27), Jon Maxwell wrote: > There has been a number incidents recently where customers running KVM have > reported that VM hosts on different Hypervisors are unreachable. Based on > pcap traces we found that the bridge was broadcasting the ARP request out > onto the network. However some NICs have an inbuilt switch which on occasions > were broadcasting the VMs ARP request back through the physical NIC on the > Hypervisor. This resulted in the bridge changing ports and incorrectly > learning > that the VMs mac address was external. As a result the ARP reply was directed > back onto the external network and VM never updated it's ARP cache. This patch > will notify the bridge command, after a fdb has been updated to identify such > port toggling. > > Signed-off-by: Jon Maxwell Acked-by: Toshiaki Makita -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space after fdb update
(2014/05/29 16:27), Jon Maxwell wrote: There has been a number incidents recently where customers running KVM have reported that VM hosts on different Hypervisors are unreachable. Based on pcap traces we found that the bridge was broadcasting the ARP request out onto the network. However some NICs have an inbuilt switch which on occasions were broadcasting the VMs ARP request back through the physical NIC on the Hypervisor. This resulted in the bridge changing ports and incorrectly learning that the VMs mac address was external. As a result the ARP reply was directed back onto the external network and VM never updated it's ARP cache. This patch will notify the bridge command, after a fdb has been updated to identify such port toggling. Signed-off-by: Jon Maxwell jmaxwel...@gmail.com Acked-by: Toshiaki Makita makita.toshi...@lab.ntt.co.jp -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space of fdb port change
(2014/05/23 13:59), Jon Maxwell wrote: ... > Makita-san, > > I recoded this using your idea and ran it through a reproducer. > It work fine. After some more consideration I agree that > setting fdb->dst = source is only required when source != fdb->dst. > > Thanks for your suggestions. This is the revised patch. It should > retain the original behaviour except for the notify after the fdb update. > > Please let me know if you have any further input? I have no more comments except for style problems (bracket position, indentation, type mismatch). thank you for rewriting :) Thanks, Toshiaki Makita > > $ diff -Naur br_fdb.c br_fdb.c.patch > --- br_fdb.c2014-05-17 12:43:23.346319609 +1000 > +++ br_fdb.c.patch2014-05-17 16:54:46.280235728 +1000 > @@ -487,6 +487,7 @@ > { > struct hlist_head *head = >hash[br_mac_hash(addr, vid)]; > struct net_bridge_fdb_entry *fdb; > +bool fdb_modified = 0; > > /* some users want to always flood. */ > if (hold_time(br) == 0) > @@ -507,10 +508,16 @@ > source->dev->name); > } else { > /* fastpath: update of existing entry */ > -fdb->dst = source; > +if (unlikely(source != fdb->dst)) > +{ > +fdb->dst = source; > +fdb_modified = 1; > +} > fdb->updated = jiffies; > if (unlikely(added_by_user)) > fdb->added_by_user = 1; > +if (unlikely(fdb_modified)) > +fdb_notify(br, fdb, RTM_NEWNEIGH); > } > } else { > spin_lock(>hash_lock); > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space of fdb port change
(2014/05/23 13:59), Jon Maxwell wrote: ... Makita-san, I recoded this using your idea and ran it through a reproducer. It work fine. After some more consideration I agree that setting fdb-dst = source is only required when source != fdb-dst. Thanks for your suggestions. This is the revised patch. It should retain the original behaviour except for the notify after the fdb update. Please let me know if you have any further input? I have no more comments except for style problems (bracket position, indentation, type mismatch). thank you for rewriting :) Thanks, Toshiaki Makita $ diff -Naur br_fdb.c br_fdb.c.patch --- br_fdb.c2014-05-17 12:43:23.346319609 +1000 +++ br_fdb.c.patch2014-05-17 16:54:46.280235728 +1000 @@ -487,6 +487,7 @@ { struct hlist_head *head = br-hash[br_mac_hash(addr, vid)]; struct net_bridge_fdb_entry *fdb; +bool fdb_modified = 0; /* some users want to always flood. */ if (hold_time(br) == 0) @@ -507,10 +508,16 @@ source-dev-name); } else { /* fastpath: update of existing entry */ -fdb-dst = source; +if (unlikely(source != fdb-dst)) +{ +fdb-dst = source; +fdb_modified = 1; +} fdb-updated = jiffies; if (unlikely(added_by_user)) fdb-added_by_user = 1; +if (unlikely(fdb_modified)) +fdb_notify(br, fdb, RTM_NEWNEIGH); } } else { spin_lock(br-hash_lock); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space of fdb port change
(2014/05/13 16:55), Jon Maxwell wrote: > From: Jon Maxwell > > There has been a number incidents recently where customers running KVM have > reported that VM hosts on different Hypervisors are unreachable. Based on > pcap traces we found that the bridge was broadcasting the ARP request out > onto the network. However some NICs have an inbuilt switch which on occasions > were broadcasting the VMs ARP request back through the physical NIC on the > Hypervisor. This resulted in the bridge changing ports and incorrectly > learning > that the VMs mac address was external. As a result the ARP reply was directed > back onto the external network and VM never updated it's ARP cache. This > patch > will notify the bridge command to identify such port toggling. > > Signed-off-by: Jon Maxwell > --- > net/bridge/br_fdb.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c > index 9203d5a..37742e2 100644 > --- a/net/bridge/br_fdb.c > +++ b/net/bridge/br_fdb.c > @@ -507,6 +507,8 @@ void br_fdb_update(struct net_bridge *br, struct > net_bridge_port *source, > source->dev->name); > } else { > /* fastpath: update of existing entry */ > + if (source->port_no != fdb->dst->port_no) It seems that we don't need to fetch port_no and it is enough to compare source and fdb->dst. > + fdb_notify(br, fdb, RTM_NEWNEIGH); > fdb->dst = source; > fdb->updated = jiffies; > if (unlikely(added_by_user)) > This notifies fdb entry before updating existing entry. Is this on purpose? I think we should notify the updated fdb entry. Similar code fdb_add_entry() does after updating it. Also, isn't it better to move update of dst into "if" block? if (source != fdb->dst) { fdb->dst = source; modified = true; } ... if (modified) ... Thanks, Toshiaki Makita -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] bridge: notify user space of fdb port change
(2014/05/13 16:55), Jon Maxwell wrote: From: Jon Maxwell jmaxwel...@gmail.com There has been a number incidents recently where customers running KVM have reported that VM hosts on different Hypervisors are unreachable. Based on pcap traces we found that the bridge was broadcasting the ARP request out onto the network. However some NICs have an inbuilt switch which on occasions were broadcasting the VMs ARP request back through the physical NIC on the Hypervisor. This resulted in the bridge changing ports and incorrectly learning that the VMs mac address was external. As a result the ARP reply was directed back onto the external network and VM never updated it's ARP cache. This patch will notify the bridge command to identify such port toggling. Signed-off-by: Jon Maxwell jmaxwel...@gmail.com --- net/bridge/br_fdb.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c index 9203d5a..37742e2 100644 --- a/net/bridge/br_fdb.c +++ b/net/bridge/br_fdb.c @@ -507,6 +507,8 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, source-dev-name); } else { /* fastpath: update of existing entry */ + if (source-port_no != fdb-dst-port_no) It seems that we don't need to fetch port_no and it is enough to compare source and fdb-dst. + fdb_notify(br, fdb, RTM_NEWNEIGH); fdb-dst = source; fdb-updated = jiffies; if (unlikely(added_by_user)) This notifies fdb entry before updating existing entry. Is this on purpose? I think we should notify the updated fdb entry. Similar code fdb_add_entry() does after updating it. Also, isn't it better to move update of dst into if block? if (source != fdb-dst) { fdb-dst = source; modified = true; } ... if (modified) ... Thanks, Toshiaki Makita -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bridge] [PATCH 1/3] bridge: preserve random init MAC address
On Tue, 2014-03-18 at 18:10 -0700, Luis R. Rodriguez wrote: > On Tue, Mar 18, 2014 at 6:04 PM, Toshiaki Makita > wrote: > > (2014/03/19 9:50), Luis R. Rodriguez wrote: > >> On Tue, Mar 18, 2014 at 5:42 PM, Toshiaki Makita > >> wrote: > >>> nit, > >>> If the last detached port happens to have the same addr as > >>> random_init_addr, this seems to call br_stp_change_bridge_id() even > >>> though bridge_id is not changed. > >> > >> Ah good point. > >> > >>> Shouldn't the assignment of random_init_addr be done before the check of > >>> "no change"? > >> > >> Good question, should we even allow two ports to have the same MAC > >> address or should we complain and refuse to add it? If so that should > >> mean we should also have to monitor any manual address changes or > >> events for address changes on the ports. > > > > This was recently discussed by Stephen and me. > > I'm thinking it should be allowed. > > > > http://marc.info/?l=linux-netdev=139182743919257=2 > > Great now that that's sorted out though I still think calling > br_stp_change_bridge_id() is right just as calling the update features > as the device is different. It could however be confusing when this > situation is run and folks might report odd bugs unless we could tell > them apart clearly. Thoughts? br_stp_change_bridge_id() is currently called only if bridge_id.addr should be changed. If the addr should not be changed but some updates are needed, br_stp_recalculate_bridge_id() doesn't seem to fit into it. Toshiaki Makita -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bridge] [PATCH 1/3] bridge: preserve random init MAC address
On Tue, 2014-03-18 at 18:10 -0700, Luis R. Rodriguez wrote: On Tue, Mar 18, 2014 at 6:04 PM, Toshiaki Makita makita.toshi...@lab.ntt.co.jp wrote: (2014/03/19 9:50), Luis R. Rodriguez wrote: On Tue, Mar 18, 2014 at 5:42 PM, Toshiaki Makita makita.toshi...@lab.ntt.co.jp wrote: nit, If the last detached port happens to have the same addr as random_init_addr, this seems to call br_stp_change_bridge_id() even though bridge_id is not changed. Ah good point. Shouldn't the assignment of random_init_addr be done before the check of no change? Good question, should we even allow two ports to have the same MAC address or should we complain and refuse to add it? If so that should mean we should also have to monitor any manual address changes or events for address changes on the ports. This was recently discussed by Stephen and me. I'm thinking it should be allowed. http://marc.info/?l=linux-netdevm=139182743919257w=2 Great now that that's sorted out though I still think calling br_stp_change_bridge_id() is right just as calling the update features as the device is different. It could however be confusing when this situation is run and folks might report odd bugs unless we could tell them apart clearly. Thoughts? br_stp_change_bridge_id() is currently called only if bridge_id.addr should be changed. If the addr should not be changed but some updates are needed, br_stp_recalculate_bridge_id() doesn't seem to fit into it. Toshiaki Makita -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] bridge: preserve random init MAC address
(2014/03/19 9:50), Luis R. Rodriguez wrote: > On Tue, Mar 18, 2014 at 5:42 PM, Toshiaki Makita > wrote: >> nit, >> If the last detached port happens to have the same addr as >> random_init_addr, this seems to call br_stp_change_bridge_id() even >> though bridge_id is not changed. > > Ah good point. > >> Shouldn't the assignment of random_init_addr be done before the check of >> "no change"? > > Good question, should we even allow two ports to have the same MAC > address or should we complain and refuse to add it? If so that should > mean we should also have to monitor any manual address changes or > events for address changes on the ports. This was recently discussed by Stephen and me. I'm thinking it should be allowed. http://marc.info/?l=linux-netdev=139182743919257=2 Toshiaki Makita > > Stephen? > > Luis -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] bridge: preserve random init MAC address
(2014/03/13 12:15), Luis R. Rodriguez wrote: > From: "Luis R. Rodriguez" > > As it is now if you add create a bridge it gets started > with a random MAC address and if you then add a net_device > as a slave but later kick it out you end up with a zero > MAC address. Instead preserve the original random MAC > address and use it. > > If you manually set the bridge address that will always > be respected. This change only takes effect if at the time > of computing the new root port we determine we have found > no candidates. > > Cc: Stephen Hemminger > Cc: bri...@lists.linux-foundation.org > Cc: net...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: xen-de...@lists.xenproject.org > Cc: k...@vger.kernel.org > Signed-off-by: Luis R. Rodriguez > --- > net/bridge/br_device.c | 1 + > net/bridge/br_private.h | 1 + > net/bridge/br_stp_if.c | 3 +++ > 3 files changed, 5 insertions(+) > > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c > index b063050..5f13eac 100644 > --- a/net/bridge/br_device.c > +++ b/net/bridge/br_device.c > @@ -368,6 +368,7 @@ void br_dev_setup(struct net_device *dev) > br->bridge_id.prio[1] = 0x00; > > ether_addr_copy(br->group_addr, eth_reserved_addr_base); > + ether_addr_copy(br->random_init_addr, dev->dev_addr); > > br->stp_enabled = BR_NO_STP; > br->group_fwd_mask = BR_GROUPFWD_DEFAULT; > diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h > index e1ca1dc..32a06da 100644 > --- a/net/bridge/br_private.h > +++ b/net/bridge/br_private.h > @@ -240,6 +240,7 @@ struct net_bridge > unsigned long bridge_hello_time; > unsigned long bridge_forward_delay; > > + u8 random_init_addr[ETH_ALEN]; > u8 group_addr[ETH_ALEN]; > u16 root_port; > > diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c > index 189ba1e..4c9ad45 100644 > --- a/net/bridge/br_stp_if.c > +++ b/net/bridge/br_stp_if.c > @@ -239,6 +239,9 @@ bool br_stp_recalculate_bridge_id(struct net_bridge *br) > if (ether_addr_equal(br->bridge_id.addr, addr)) > return false; /* no change */ > > + if (ether_addr_equal(addr, br_mac_zero)) > + addr = br->random_init_addr; > + > br_stp_change_bridge_id(br, addr); > return true; > } nit, If the last detached port happens to have the same addr as random_init_addr, this seems to call br_stp_change_bridge_id() even though bridge_id is not changed. Shouldn't the assignment of random_init_addr be done before the check of "no change"? Toshiaki Makita -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] bridge: preserve random init MAC address
(2014/03/13 12:15), Luis R. Rodriguez wrote: From: Luis R. Rodriguez mcg...@suse.com As it is now if you add create a bridge it gets started with a random MAC address and if you then add a net_device as a slave but later kick it out you end up with a zero MAC address. Instead preserve the original random MAC address and use it. If you manually set the bridge address that will always be respected. This change only takes effect if at the time of computing the new root port we determine we have found no candidates. Cc: Stephen Hemminger step...@networkplumber.org Cc: bri...@lists.linux-foundation.org Cc: net...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Cc: k...@vger.kernel.org Signed-off-by: Luis R. Rodriguez mcg...@suse.com --- net/bridge/br_device.c | 1 + net/bridge/br_private.h | 1 + net/bridge/br_stp_if.c | 3 +++ 3 files changed, 5 insertions(+) diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c index b063050..5f13eac 100644 --- a/net/bridge/br_device.c +++ b/net/bridge/br_device.c @@ -368,6 +368,7 @@ void br_dev_setup(struct net_device *dev) br-bridge_id.prio[1] = 0x00; ether_addr_copy(br-group_addr, eth_reserved_addr_base); + ether_addr_copy(br-random_init_addr, dev-dev_addr); br-stp_enabled = BR_NO_STP; br-group_fwd_mask = BR_GROUPFWD_DEFAULT; diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index e1ca1dc..32a06da 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -240,6 +240,7 @@ struct net_bridge unsigned long bridge_hello_time; unsigned long bridge_forward_delay; + u8 random_init_addr[ETH_ALEN]; u8 group_addr[ETH_ALEN]; u16 root_port; diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c index 189ba1e..4c9ad45 100644 --- a/net/bridge/br_stp_if.c +++ b/net/bridge/br_stp_if.c @@ -239,6 +239,9 @@ bool br_stp_recalculate_bridge_id(struct net_bridge *br) if (ether_addr_equal(br-bridge_id.addr, addr)) return false; /* no change */ + if (ether_addr_equal(addr, br_mac_zero)) + addr = br-random_init_addr; + br_stp_change_bridge_id(br, addr); return true; } nit, If the last detached port happens to have the same addr as random_init_addr, this seems to call br_stp_change_bridge_id() even though bridge_id is not changed. Shouldn't the assignment of random_init_addr be done before the check of no change? Toshiaki Makita -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] bridge: preserve random init MAC address
(2014/03/19 9:50), Luis R. Rodriguez wrote: On Tue, Mar 18, 2014 at 5:42 PM, Toshiaki Makita makita.toshi...@lab.ntt.co.jp wrote: nit, If the last detached port happens to have the same addr as random_init_addr, this seems to call br_stp_change_bridge_id() even though bridge_id is not changed. Ah good point. Shouldn't the assignment of random_init_addr be done before the check of no change? Good question, should we even allow two ports to have the same MAC address or should we complain and refuse to add it? If so that should mean we should also have to monitor any manual address changes or events for address changes on the ports. This was recently discussed by Stephen and me. I'm thinking it should be allowed. http://marc.info/?l=linux-netdevm=139182743919257w=2 Toshiaki Makita Stephen? Luis -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 12/21] bridge: slight optimization of addr compare
On Mon, 2013-12-23 at 13:10 +0800, Ding Tianhong wrote: > Use the recently added and possibly more efficient > ether_addr_equal_unaligned to instead of memcmp. > > Cc: Stephen Hemminger > Cc: David Miller > Cc: bri...@lists.linux-foundation.org > Cc: net...@vger.kernel.org > Signed-off-by: Wang Weidong > Signed-off-by: Ding Tianhong > --- > net/bridge/br_stp_if.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c > index 656a6f3..04217d1 100644 > --- a/net/bridge/br_stp_if.c > +++ b/net/bridge/br_stp_if.c > @@ -229,7 +229,7 @@ bool br_stp_recalculate_bridge_id(struct net_bridge *br) > > list_for_each_entry(p, >port_list, list) { > if (addr == br_mac_zero || > - memcmp(p->dev->dev_addr, addr, ETH_ALEN) < 0) > + !ether_addr_equal_unaligned(p->dev->dev_addr, addr) < 0) > addr = p->dev->dev_addr; > > } We cannot do this change. !ether_addr_equal() isn't identical to memcmp(). memcmp() can return negative value but ether_addr_equal() cannot. br_stp_recalculate_bridge_id() is searching the smallest address among its ports. This change breaks it. Thanks, Toshiaki Makita -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 12/21] bridge: slight optimization of addr compare
On Mon, 2013-12-23 at 13:10 +0800, Ding Tianhong wrote: Use the recently added and possibly more efficient ether_addr_equal_unaligned to instead of memcmp. Cc: Stephen Hemminger step...@networkplumber.org Cc: David Miller da...@davemloft.net Cc: bri...@lists.linux-foundation.org Cc: net...@vger.kernel.org Signed-off-by: Wang Weidong wangweido...@huawei.com Signed-off-by: Ding Tianhong dingtianh...@huawei.com --- net/bridge/br_stp_if.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c index 656a6f3..04217d1 100644 --- a/net/bridge/br_stp_if.c +++ b/net/bridge/br_stp_if.c @@ -229,7 +229,7 @@ bool br_stp_recalculate_bridge_id(struct net_bridge *br) list_for_each_entry(p, br-port_list, list) { if (addr == br_mac_zero || - memcmp(p-dev-dev_addr, addr, ETH_ALEN) 0) + !ether_addr_equal_unaligned(p-dev-dev_addr, addr) 0) addr = p-dev-dev_addr; } We cannot do this change. !ether_addr_equal() isn't identical to memcmp(). memcmp() can return negative value but ether_addr_equal() cannot. br_stp_recalculate_bridge_id() is searching the smallest address among its ports. This change breaks it. Thanks, Toshiaki Makita -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/