Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-20 Thread Thomas Walker
On Tue, Jun 19, 2018 at 04:14:40PM -0700, Stephen Hemminger wrote:
> On Wed, 20 Jun 2018 08:01:50 +0900
> Greg KH  wrote:
> 
> > On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> > > Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> > > guests failed to bring up network interfaces on boot, logging "A link 
> > > change request failed with some changes committed already. Interface eth0 
> > > may have been left with an inconsistent configuration, please check."  
> > > Running 'ifconfig eth0 up' appears to fix the problem temporarily so I 
> > > went about bisecting which landed on:
> > > 
> > > commit be9c798d0d13ae609a91177323ac816545c39d28
> > > Author: Stephen Hemminger 
> > > Date:   Mon May 14 15:32:18 2018 -0700
> > > 
> > > hv_netvsc: common detach logic
> > > 
> > > [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> > > 
> > > Make common function for detaching internals of device
> > > during changes to MTU and RSS. Make sure no more packets
> > > are transmitted and all packets have been received before
> > > doing device teardown.
> > > 
> > > Change the wait logic to be common and use usleep_range().
> > > 
> > > Changes transmit enabling logic so that transmit queues are disabled
> > > during the period when lower device is being changed. And enabled
> > > only after sub channels are setup. This avoids issue where it could
> > > be that a packet was being sent while subchannel was not initialized.
> > > 
> > > Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> > > Signed-off-by: Stephen Hemminger 
> > > Signed-off-by: David S. Miller 
> > > Signed-off-by: Greg Kroah-Hartman 
> > > 
> > > 
> > > The removal of which does indeed fix the problem.  That led me to:
> > > 
> > > commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> > > Author: Dexuan Cui 
> > > Date:   Wed Jun 6 21:32:51 2018 +
> > > 
> > > hv_netvsc: Fix a network regression after ifdown/ifup
> > > 
> > > Recently people reported the NIC stops working after
> > > "ifdown eth0; ifup eth0". It turns out in this case the TX queues are 
> > > not
> > > enabled, after the refactoring of the common detach logic: when the 
> > > NIC
> > > has sub-channels, usually we enable all the TX queues after all
> > > sub-channels are set up: see rndis_set_subchannel() ->
> > > netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" 
> > > where
> > > the number of channels doesn't change, we also must make sure the TX 
> > > queues
> > > are enabled. The patch fixes the regression.
> > > 
> > > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > > Signed-off-by: Dexuan Cui 
> > > Cc: Stephen Hemminger 
> > > Cc: K. Y. Srinivasan 
> > > Cc: Haiyang Zhang 
> > > Signed-off-by: David S. Miller 
> > > 
> > > Which sounded very promising, but does not seem to fully fix the issue.  
> > > Doing some more digging, I was able to determine that the message 
> > > coincides with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 
> > > second) after the hv_netvsc driver loads.  If I delay the mtu change 
> > > until well after the driver loads, everything works fine.  If I unload 
> > > hv_netvsc and then reload it and apply the mtu change immediately, the 
> > > failure re-occurs.  So something is racy here, and the above doesn't 
> > > entirely address it.
> > > 
> > > I'm happy to test out any suggested patches and/or do additional 
> > > debugging if anyone has any suggestions.
> > > 
> > > (oh, and I did also try 4.18-rc1 and the problem still persists)  
> > 
> > Always cc: the authors of the patch you are having problems with, along
> > with the mailing list for the networking subsystem, so that the right
> > people know to look at this.
> 
> How are you changing the MTU? and starting the device.
> When MTU changes the device has to reconnect with the host. This takes a 
> small amount of time
> and no changes to device state are possible then.
> 
> If MTU change happens and at the same time some other script tries to bring 
> up the connection,
> then that script will get an error.

Simply setting 'mtu 1300' in /etc/network/interfaces (Debian userland) which 
results in ifup executing:

ip addr add 192.168.1.100/255.255.255.0 broadcast 192.168.1.255 dev eth0 label 
eth0
ip link set dev eth0 mtu 1300 up   
ip route add default via 192.168.1.1 dev eth0   

Running those commands standalone (outside of ifup) reproduces the problem as 
well.  Trying lots of different permutations of loading the driver and 
different commands with different delays, my initial suspicion that this was 
somehow related to proximity to the driver load appears incorrect.  The only 
thing that seems to matter is setting the mtu and link state in one go.  If I 
separate the 'ip link' command 

Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-19 Thread Stephen Hemminger
On Wed, 20 Jun 2018 08:01:50 +0900
Greg KH  wrote:

> On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> > Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> > guests failed to bring up network interfaces on boot, logging "A link 
> > change request failed with some changes committed already. Interface eth0 
> > may have been left with an inconsistent configuration, please check."  
> > Running 'ifconfig eth0 up' appears to fix the problem temporarily so I went 
> > about bisecting which landed on:
> > 
> > commit be9c798d0d13ae609a91177323ac816545c39d28
> > Author: Stephen Hemminger 
> > Date:   Mon May 14 15:32:18 2018 -0700
> > 
> > hv_netvsc: common detach logic
> > 
> > [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> > 
> > Make common function for detaching internals of device
> > during changes to MTU and RSS. Make sure no more packets
> > are transmitted and all packets have been received before
> > doing device teardown.
> > 
> > Change the wait logic to be common and use usleep_range().
> > 
> > Changes transmit enabling logic so that transmit queues are disabled
> > during the period when lower device is being changed. And enabled
> > only after sub channels are setup. This avoids issue where it could
> > be that a packet was being sent while subchannel was not initialized.
> > 
> > Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> > Signed-off-by: Stephen Hemminger 
> > Signed-off-by: David S. Miller 
> > Signed-off-by: Greg Kroah-Hartman 
> > 
> > 
> > The removal of which does indeed fix the problem.  That led me to:
> > 
> > commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> > Author: Dexuan Cui 
> > Date:   Wed Jun 6 21:32:51 2018 +
> > 
> > hv_netvsc: Fix a network regression after ifdown/ifup
> > 
> > Recently people reported the NIC stops working after
> > "ifdown eth0; ifup eth0". It turns out in this case the TX queues are 
> > not
> > enabled, after the refactoring of the common detach logic: when the NIC
> > has sub-channels, usually we enable all the TX queues after all
> > sub-channels are set up: see rndis_set_subchannel() ->
> > netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
> > the number of channels doesn't change, we also must make sure the TX 
> > queues
> > are enabled. The patch fixes the regression.
> > 
> > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > Signed-off-by: Dexuan Cui 
> > Cc: Stephen Hemminger 
> > Cc: K. Y. Srinivasan 
> > Cc: Haiyang Zhang 
> > Signed-off-by: David S. Miller 
> > 
> > Which sounded very promising, but does not seem to fully fix the issue.  
> > Doing some more digging, I was able to determine that the message coincides 
> > with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the 
> > hv_netvsc driver loads.  If I delay the mtu change until well after the 
> > driver loads, everything works fine.  If I unload hv_netvsc and then reload 
> > it and apply the mtu change immediately, the failure re-occurs.  So 
> > something is racy here, and the above doesn't entirely address it.
> > 
> > I'm happy to test out any suggested patches and/or do additional debugging 
> > if anyone has any suggestions.
> > 
> > (oh, and I did also try 4.18-rc1 and the problem still persists)  
> 
> Always cc: the authors of the patch you are having problems with, along
> with the mailing list for the networking subsystem, so that the right
> people know to look at this.

How are you changing the MTU? and starting the device.
When MTU changes the device has to reconnect with the host. This takes a small 
amount of time
and no changes to device state are possible then.

If MTU change happens and at the same time some other script tries to bring up 
the connection,
then that script will get an error.


Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-19 Thread Stephen Hemminger
On Wed, 20 Jun 2018 08:01:50 +0900
Greg KH  wrote:

> On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> > Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> > guests failed to bring up network interfaces on boot, logging "A link 
> > change request failed with some changes committed already. Interface eth0 
> > may have been left with an inconsistent configuration, please check."  
> > Running 'ifconfig eth0 up' appears to fix the problem temporarily so I went 
> > about bisecting which landed on:
> > 
> > commit be9c798d0d13ae609a91177323ac816545c39d28
> > Author: Stephen Hemminger 
> > Date:   Mon May 14 15:32:18 2018 -0700
> > 
> > hv_netvsc: common detach logic
> > 
> > [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> > 
> > Make common function for detaching internals of device
> > during changes to MTU and RSS. Make sure no more packets
> > are transmitted and all packets have been received before
> > doing device teardown.
> > 
> > Change the wait logic to be common and use usleep_range().
> > 
> > Changes transmit enabling logic so that transmit queues are disabled
> > during the period when lower device is being changed. And enabled
> > only after sub channels are setup. This avoids issue where it could
> > be that a packet was being sent while subchannel was not initialized.
> > 
> > Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> > Signed-off-by: Stephen Hemminger 
> > Signed-off-by: David S. Miller 
> > Signed-off-by: Greg Kroah-Hartman 
> > 
> > 
> > The removal of which does indeed fix the problem.  That led me to:
> > 
> > commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> > Author: Dexuan Cui 
> > Date:   Wed Jun 6 21:32:51 2018 +
> > 
> > hv_netvsc: Fix a network regression after ifdown/ifup
> > 
> > Recently people reported the NIC stops working after
> > "ifdown eth0; ifup eth0". It turns out in this case the TX queues are 
> > not
> > enabled, after the refactoring of the common detach logic: when the NIC
> > has sub-channels, usually we enable all the TX queues after all
> > sub-channels are set up: see rndis_set_subchannel() ->
> > netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
> > the number of channels doesn't change, we also must make sure the TX 
> > queues
> > are enabled. The patch fixes the regression.
> > 
> > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > Signed-off-by: Dexuan Cui 
> > Cc: Stephen Hemminger 
> > Cc: K. Y. Srinivasan 
> > Cc: Haiyang Zhang 
> > Signed-off-by: David S. Miller 
> > 
> > Which sounded very promising, but does not seem to fully fix the issue.  
> > Doing some more digging, I was able to determine that the message coincides 
> > with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the 
> > hv_netvsc driver loads.  If I delay the mtu change until well after the 
> > driver loads, everything works fine.  If I unload hv_netvsc and then reload 
> > it and apply the mtu change immediately, the failure re-occurs.  So 
> > something is racy here, and the above doesn't entirely address it.
> > 
> > I'm happy to test out any suggested patches and/or do additional debugging 
> > if anyone has any suggestions.
> > 
> > (oh, and I did also try 4.18-rc1 and the problem still persists)  
> 
> Always cc: the authors of the patch you are having problems with, along
> with the mailing list for the networking subsystem, so that the right
> people know to look at this.

Let me take a look at it, and log it into the bug system.


Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-19 Thread Greg KH
On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> guests failed to bring up network interfaces on boot, logging "A link change 
> request failed with some changes committed already. Interface eth0 may have 
> been left with an inconsistent configuration, please check."  Running 
> 'ifconfig eth0 up' appears to fix the problem temporarily so I went about 
> bisecting which landed on:
> 
> commit be9c798d0d13ae609a91177323ac816545c39d28
> Author: Stephen Hemminger 
> Date:   Mon May 14 15:32:18 2018 -0700
> 
> hv_netvsc: common detach logic
> 
> [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> 
> Make common function for detaching internals of device
> during changes to MTU and RSS. Make sure no more packets
> are transmitted and all packets have been received before
> doing device teardown.
> 
> Change the wait logic to be common and use usleep_range().
> 
> Changes transmit enabling logic so that transmit queues are disabled
> during the period when lower device is being changed. And enabled
> only after sub channels are setup. This avoids issue where it could
> be that a packet was being sent while subchannel was not initialized.
> 
> Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> Signed-off-by: Stephen Hemminger 
> Signed-off-by: David S. Miller 
> Signed-off-by: Greg Kroah-Hartman 
> 
> 
> The removal of which does indeed fix the problem.  That led me to:
> 
> commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> Author: Dexuan Cui 
> Date:   Wed Jun 6 21:32:51 2018 +
> 
> hv_netvsc: Fix a network regression after ifdown/ifup
> 
> Recently people reported the NIC stops working after
> "ifdown eth0; ifup eth0". It turns out in this case the TX queues are not
> enabled, after the refactoring of the common detach logic: when the NIC
> has sub-channels, usually we enable all the TX queues after all
> sub-channels are set up: see rndis_set_subchannel() ->
> netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
> the number of channels doesn't change, we also must make sure the TX 
> queues
> are enabled. The patch fixes the regression.
> 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Signed-off-by: Dexuan Cui 
> Cc: Stephen Hemminger 
> Cc: K. Y. Srinivasan 
> Cc: Haiyang Zhang 
> Signed-off-by: David S. Miller 
> 
> Which sounded very promising, but does not seem to fully fix the issue.  
> Doing some more digging, I was able to determine that the message coincides 
> with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the 
> hv_netvsc driver loads.  If I delay the mtu change until well after the 
> driver loads, everything works fine.  If I unload hv_netvsc and then reload 
> it and apply the mtu change immediately, the failure re-occurs.  So something 
> is racy here, and the above doesn't entirely address it.
> 
> I'm happy to test out any suggested patches and/or do additional debugging if 
> anyone has any suggestions.
> 
> (oh, and I did also try 4.18-rc1 and the problem still persists)

Always cc: the authors of the patch you are having problems with, along
with the mailing list for the networking subsystem, so that the right
people know to look at this.

Now fixed...

thanks,

greg k-h