[Kernel-packages] [Bug 1779678] Re: deadlocks in copy_net_ns

2018-07-04 Thread Christian Brauner
I've been running a 4.18 kernel for a long time now and I haven't been
able to reproduce the bug. Please note however, that this bug was a
race. Meaning, it is easily possible that the race has just gotten so
unlikely that it doesn't matter anymore. I doubt it however, since a)
there's a proper explanation for the prior bug and b) the locking has
changed completely upstream.

** Tags added: kernel-fixed-upstream

** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779678

Title:
  deadlocks in copy_net_ns

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Various users have reported hangs happening during network namespace
  creation. This mostly manifests in issues while starting lxc containers,
  and, when triggered, can be seen clearly by running `unshare -n` which
  will simply hang forever. This has been happening randomly for quite a
  few kernel versions now. This has been confirmed on 4.13 from Proxmox
  users (which uses an ubuntu based kernel with few patches), and various
  other older and newer kernels as found by reports in the links [1][2][3]
  below. [2] in particular contains the same symptoms across multiple
  distributions and kernel versions. The posted stack traces do include
  copy_net_ns() on top as well.

  There are races in the network code causing copy_net_ns() to hang
  (seemingly permanently). Some of these are caused by specific types of
  interfaces being in use and have been addressed (various refcount leak
  fixes), but that's not all of them. We've received yet another report
  with the current version 4.15.0-22.24 / 4.15.17 with the same symptoms.

  Processes in this state always have copy_net_ns() on top of their
  /proc/$pid/stack looking like:

  ~/ cat /proc/5228/stack 
  [<0>] copy_net_ns+0xab/0x220
  [<0>] create_new_namespaces+0x11b/0x1e0
  [<0>] unshare_nsproxy_namespaces+0x5a/0xb0
  [<0>] SyS_unshare+0x201/0x3a0
  [<0>] do_syscall_64+0x73/0x130
  [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [<0>] 0x

  or

  cat /proc/23900/stack 
  [<0>] copy_net_ns+0xab/0x220
  [<0>] create_new_namespaces+0x11b/0x1e0
  [<0>] copy_namespaces+0x6d/0xa0
  [<0>] copy_process.part.35+0x941/0x1ab0
  [<0>] _do_fork+0xdf/0x3f0
  [<0>] SyS_clone+0x19/0x20
  [<0>] do_syscall_64+0x73/0x130
  [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [<0>] 0x

  This randomly affects users of network namespaces (lxc, lxd, docker, PVE
  as well as service units using systemd's PrivateNetwork option and
  various others).

  Upstream there have been a lot of changes to the involved locking
  mechanism since 4.16 and we should try to backport these patches.
  This includes most of Kirill Tkhai's network patches and some others.

  I've been going through the following ones generated via various `git
  log` calls on net/, drivers/net/ (initially limiting to the ones with
  `--author='Kirill Tkhai'` as a starting point.)
  There's also a long list of patches we don't need to pick as they're
  implicitly reverted by 1 later change, provided we include all the
  necessary patches. They seem to be nice to review given that they're a
  progressive change first introducing a flag about async-safety, then
  going through all the affected areas with commit messages detailing
  why/if/how they're safe, followed finally when they're all the same by a
  commit to remove the flag again.

  Orderd newest to oldest
  U .. already in the ubuntu kernel, included due to its order when viewing 
related patches
  P .. should be cherry-picked
  Q .. (just 1) included for completion, will conflict in case backports of the 
patches adding NETDEV_{C,S}VLAN_FILTER_PUSH_INFO, which is probably good as a 
reminder for verification?
  - .. if all other patches are applied, they're made obsolete by 2f635ceeb22b 
("net: Drop pernet_operations::async")

  Q 3f5ecd8a90dd net: Fix coccinelle warning
  P eb7f54b90bd8 kcm: Fix use-after-free caused by clonned sockets
  P 554873e51711 net: Do not take net_rwsem in __rtnl_link_unregister()
  P fc1dd36992bb net: Remove net_rwsem from {, un}register_netdevice_notifier()
  P 328fbe747ad4 net: Close race between {un, }register_netdevice_notifier() 
and setup_net()/cleanup_net()
  P 9e2f6c5d78db netfilter: Rework xt_TEE netdevice notifier
  P e9a441b6e729 xfrm: Register xfrm_dev_notifier in appropriate place
  P 152f253152cc net: Remove rtnl_lock() in nf_ct_iterate_destroy()
  P ec9c780925c5 ovs: Remove rtnl_lock() from ovs_exit_net()
  P 350311aab4c0 security: Remove rtnl_lock() in 
selinux_xfrm_notify_policyload()
  P 10256debb918 net: Don't take rtnl_lock() in wireless_nlevent_flush()
  P f0b07bb151b0 net: Introduce net_rwsem to protect net_namespace_list
  d 8518e9bb98b6 net: Add more comments
  P 4420bf21fb6c net: Rename net_sem to pernet_ops_rwsem
  P 2f635ceeb22b

[Kernel-packages] [Bug 1779678] Re: deadlocks in copy_net_ns

2018-07-03 Thread Joseph Salisbury
Would it be possible for you to test the latest upstream kernel? Refer to 
https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.18 
kernel[0].

If this bug is fixed in the mainline kernel, please add the following
tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag:
'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as
"Confirmed".


Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc3


** Changed in: linux (Ubuntu)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu)
   Status: Confirmed => Incomplete

** Tags added: kernel-da-key

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779678

Title:
  deadlocks in copy_net_ns

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  Various users have reported hangs happening during network namespace
  creation. This mostly manifests in issues while starting lxc containers,
  and, when triggered, can be seen clearly by running `unshare -n` which
  will simply hang forever. This has been happening randomly for quite a
  few kernel versions now. This has been confirmed on 4.13 from Proxmox
  users (which uses an ubuntu based kernel with few patches), and various
  other older and newer kernels as found by reports in the links [1][2][3]
  below. [2] in particular contains the same symptoms across multiple
  distributions and kernel versions. The posted stack traces do include
  copy_net_ns() on top as well.

  There are races in the network code causing copy_net_ns() to hang
  (seemingly permanently). Some of these are caused by specific types of
  interfaces being in use and have been addressed (various refcount leak
  fixes), but that's not all of them. We've received yet another report
  with the current version 4.15.0-22.24 / 4.15.17 with the same symptoms.

  Processes in this state always have copy_net_ns() on top of their
  /proc/$pid/stack looking like:

  ~/ cat /proc/5228/stack 
  [<0>] copy_net_ns+0xab/0x220
  [<0>] create_new_namespaces+0x11b/0x1e0
  [<0>] unshare_nsproxy_namespaces+0x5a/0xb0
  [<0>] SyS_unshare+0x201/0x3a0
  [<0>] do_syscall_64+0x73/0x130
  [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [<0>] 0x

  or

  cat /proc/23900/stack 
  [<0>] copy_net_ns+0xab/0x220
  [<0>] create_new_namespaces+0x11b/0x1e0
  [<0>] copy_namespaces+0x6d/0xa0
  [<0>] copy_process.part.35+0x941/0x1ab0
  [<0>] _do_fork+0xdf/0x3f0
  [<0>] SyS_clone+0x19/0x20
  [<0>] do_syscall_64+0x73/0x130
  [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [<0>] 0x

  This randomly affects users of network namespaces (lxc, lxd, docker, PVE
  as well as service units using systemd's PrivateNetwork option and
  various others).

  Upstream there have been a lot of changes to the involved locking
  mechanism since 4.16 and we should try to backport these patches.
  This includes most of Kirill Tkhai's network patches and some others.

  I've been going through the following ones generated via various `git
  log` calls on net/, drivers/net/ (initially limiting to the ones with
  `--author='Kirill Tkhai'` as a starting point.)
  There's also a long list of patches we don't need to pick as they're
  implicitly reverted by 1 later change, provided we include all the
  necessary patches. They seem to be nice to review given that they're a
  progressive change first introducing a flag about async-safety, then
  going through all the affected areas with commit messages detailing
  why/if/how they're safe, followed finally when they're all the same by a
  commit to remove the flag again.

  Orderd newest to oldest
  U .. already in the ubuntu kernel, included due to its order when viewing 
related patches
  P .. should be cherry-picked
  Q .. (just 1) included for completion, will conflict in case backports of the 
patches adding NETDEV_{C,S}VLAN_FILTER_PUSH_INFO, which is probably good as a 
reminder for verification?
  - .. if all other patches are applied, they're made obsolete by 2f635ceeb22b 
("net: Drop pernet_operations::async")

  Q 3f5ecd8a90dd net: Fix coccinelle warning
  P eb7f54b90bd8 kcm: Fix use-after-free caused by clonned sockets
  P 554873e51711 net: Do not take net_rwsem in __rtnl_link_unregister()
  P fc1dd36992bb net: Remove net_rwsem from {, un}register_netdevice_notifier()
  P 328fbe747ad4 net: Close race between {un, }register_netdevice_notifier() 
and setup_net()/cleanup_net()
  P 9e2f6c5d78db netfilter: Rework xt_TEE netdevice notifier
  P e9a441b6e729 xfrm: Register xfrm_dev_notifier in appropriate place
  P 152f253152cc net: Remove rtnl_lock() in nf_ct_iterate_destroy()
  P ec9c780925c5 ovs: Remove rtnl_lock() from ovs_exit_net()
  P 350311aab4c0 security: Remove rtnl_lock() in 
selinux_xfrm_notify_policyload()
  P 10256debb918 net: Don't take rtnl_lock() in

[Kernel-packages] [Bug 1779678] Re: deadlocks in copy_net_ns

2018-07-02 Thread Christian Brauner
** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779678

Title:
  deadlocks in copy_net_ns

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Various users have reported hangs happening during network namespace
  creation. This mostly manifests in issues while starting lxc containers,
  and, when triggered, can be seen clearly by running `unshare -n` which
  will simply hang forever. This has been happening randomly for quite a
  few kernel versions now. This has been confirmed on 4.13 from Proxmox
  users (which uses an ubuntu based kernel with few patches), and various
  other older and newer kernels as found by reports in the links [1][2][3]
  below. [2] in particular contains the same symptoms across multiple
  distributions and kernel versions. The posted stack traces do include
  copy_net_ns() on top as well.

  There are races in the network code causing copy_net_ns() to hang
  (seemingly permanently). Some of these are caused by specific types of
  interfaces being in use and have been addressed (various refcount leak
  fixes), but that's not all of them. We've received yet another report
  with the current version 4.15.0-22.24 / 4.15.17 with the same symptoms.

  Processes in this state always have copy_net_ns() on top of their
  /proc/$pid/stack looking like:

  ~/ cat /proc/5228/stack 
  [<0>] copy_net_ns+0xab/0x220
  [<0>] create_new_namespaces+0x11b/0x1e0
  [<0>] unshare_nsproxy_namespaces+0x5a/0xb0
  [<0>] SyS_unshare+0x201/0x3a0
  [<0>] do_syscall_64+0x73/0x130
  [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [<0>] 0x

  or

  cat /proc/23900/stack 
  [<0>] copy_net_ns+0xab/0x220
  [<0>] create_new_namespaces+0x11b/0x1e0
  [<0>] copy_namespaces+0x6d/0xa0
  [<0>] copy_process.part.35+0x941/0x1ab0
  [<0>] _do_fork+0xdf/0x3f0
  [<0>] SyS_clone+0x19/0x20
  [<0>] do_syscall_64+0x73/0x130
  [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  [<0>] 0x

  This randomly affects users of network namespaces (lxc, lxd, docker, PVE
  as well as service units using systemd's PrivateNetwork option and
  various others).

  Upstream there have been a lot of changes to the involved locking
  mechanism since 4.16 and we should try to backport these patches.
  This includes most of Kirill Tkhai's network patches and some others.

  I've been going through the following ones generated via various `git
  log` calls on net/, drivers/net/ (initially limiting to the ones with
  `--author='Kirill Tkhai'` as a starting point.)
  There's also a long list of patches we don't need to pick as they're
  implicitly reverted by 1 later change, provided we include all the
  necessary patches. They seem to be nice to review given that they're a
  progressive change first introducing a flag about async-safety, then
  going through all the affected areas with commit messages detailing
  why/if/how they're safe, followed finally when they're all the same by a
  commit to remove the flag again.

  Orderd newest to oldest
  U .. already in the ubuntu kernel, included due to its order when viewing 
related patches
  P .. should be cherry-picked
  Q .. (just 1) included for completion, will conflict in case backports of the 
patches adding NETDEV_{C,S}VLAN_FILTER_PUSH_INFO, which is probably good as a 
reminder for verification?
  - .. if all other patches are applied, they're made obsolete by 2f635ceeb22b 
("net: Drop pernet_operations::async")

  Q 3f5ecd8a90dd net: Fix coccinelle warning
  P eb7f54b90bd8 kcm: Fix use-after-free caused by clonned sockets
  P 554873e51711 net: Do not take net_rwsem in __rtnl_link_unregister()
  P fc1dd36992bb net: Remove net_rwsem from {, un}register_netdevice_notifier()
  P 328fbe747ad4 net: Close race between {un, }register_netdevice_notifier() 
and setup_net()/cleanup_net()
  P 9e2f6c5d78db netfilter: Rework xt_TEE netdevice notifier
  P e9a441b6e729 xfrm: Register xfrm_dev_notifier in appropriate place
  P 152f253152cc net: Remove rtnl_lock() in nf_ct_iterate_destroy()
  P ec9c780925c5 ovs: Remove rtnl_lock() from ovs_exit_net()
  P 350311aab4c0 security: Remove rtnl_lock() in 
selinux_xfrm_notify_policyload()
  P 10256debb918 net: Don't take rtnl_lock() in wireless_nlevent_flush()
  P f0b07bb151b0 net: Introduce net_rwsem to protect net_namespace_list
  d 8518e9bb98b6 net: Add more comments
  P 4420bf21fb6c net: Rename net_sem to pernet_ops_rwsem
  P 2f635ceeb22b net: Drop pernet_operations::async
  P 094374e5e173 net: Reflect all pernet_operations are converted
  - 67441c2472dd net: Convert nfsd_net_ops
  - dbf7bb443726 net: Convert nfs4blocklayout_net_ops
  - 436de500948e net: Convert nfs4_dns_resolver_ops
  - 5e804a6077dc net: Convert sunrpc_net_ops
  - 855aeba34047 net: Convert rpcsec_gss_net_ops
  P 070f2d7e264a net: Drop NETDEV_UNREGISTER_FINAL
  P 3e0c2dbfea28 infi