Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
On 10.08.2017 15:09, Andrew Moore wrote: > Both of those reports were me. I suspect the issue may be isolated to > the HPE custom implementation of the ESXi 6.5u1 build. I haven't seen > any similar reports of people using the vanilla 6.5u1 build. Not surprising. It wouldn't be the first time HPE horribly botched their ESX custom ISOs. (Which is the prime reason I don't *ever* use custom vendor ISOs from any vendor in the first place.) > Interestingly none of the fixes that have been discussed work with this > build either. This includes disabling the rx-mini buffer (# ethtool -G > rx-mini 0) and adding vmxnet3.rev.30 = FALSE to the VMs vmx > file. Very strange, indeed. > The only way I've managed to restore stability is by removing vmxnet3 > out of the equation completely and changing to the e1000 NIC type. Using a HW version lower than 13 should also help. Unfortunately the sample size of people reporting failure or success is very small at the time, a conclusive result can't be drawn, I am afraid. Grüße, Sven. signature.asc Description: OpenPGP digital signature
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
On Tue, 8 Aug 2017 11:38:06 +0200 (CEST) Sven Hartge wrote: > Um 16:22 Uhr am 03.08.17 schrieb Sven Hartge: > > On 03.08.2017 15:34, Patrick Matthäi wrote: > >> Am 16.07.2017 um 23:42 schrieb Ben Hutchings: > >>> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote: > > > Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ? > >>> Note that this has been root-caused as a bug in the virtual device, not > >>> the driver. (Though it would be nice if the driver could work around > >>> it.) > > > >> I can confirm, that the VMs do not crash anymore with vSphere 6.5 build > >> 5969303 from 27.07.2017, that is why I lowered the severity. > > > > This is the version from 6.5u1, right? > > > > Still: Stretch is basically unusable with HW13 on ESX6.5 below Update1. > > Hmm. There are discussions on Reddit right now indicating the bug still > occurs even with the latest ESXi6.5u1 (Build 5969303). > > https://www.reddit.com/r/homelab/comments/6s5dh6/debian_9_on_esxi_65u1_complete_lockup/ > > One of the latest comments on the Kernel Bugzilla shows the same: > > https://bugzilla.kernel.org/show_bug.cgi?id=191201#c54 > > (For me, this is really frustrating right now, since I waited until > ESX6.5u1 before updating my infrastructure and now it seems I have to push > this update even farther into the future because of this critical blocker > bug.) > > I really wonder what could be done on the Kernel side to avoid the > problem, since only newer Kernel are affected while older one don't show > the problem. > > Grüße, > Sven. > > Hi Sven, Both of those reports were me. I suspect the issue may be isolated to the HPE custom implementation of the ESXi 6.5u1 build. I haven't seen any similar reports of people using the vanilla 6.5u1 build. Interestingly none of the fixes that have been discussed work with this build either. This includes disabling the rx-mini buffer (# ethtool -G rx-mini 0) and adding vmxnet3.rev.30 = FALSE to the VMs vmx file. The only way I've managed to restore stability is by removing vmxnet3 out of the equation completely and changing to the e1000 NIC type. Thanks, Andrew
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
Um 16:22 Uhr am 03.08.17 schrieb Sven Hartge: > On 03.08.2017 15:34, Patrick Matthäi wrote: >> Am 16.07.2017 um 23:42 schrieb Ben Hutchings: >>> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote: > Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ? >>> Note that this has been root-caused as a bug in the virtual device, not >>> the driver. (Though it would be nice if the driver could work around >>> it.) > >> I can confirm, that the VMs do not crash anymore with vSphere 6.5 build >> 5969303 from 27.07.2017, that is why I lowered the severity. > > This is the version from 6.5u1, right? > > Still: Stretch is basically unusable with HW13 on ESX6.5 below Update1. Hmm. There are discussions on Reddit right now indicating the bug still occurs even with the latest ESXi6.5u1 (Build 5969303). https://www.reddit.com/r/homelab/comments/6s5dh6/debian_9_on_esxi_65u1_complete_lockup/ One of the latest comments on the Kernel Bugzilla shows the same: https://bugzilla.kernel.org/show_bug.cgi?id=191201#c54 (For me, this is really frustrating right now, since I waited until ESX6.5u1 before updating my infrastructure and now it seems I have to push this update even farther into the future because of this critical blocker bug.) I really wonder what could be done on the Kernel side to avoid the problem, since only newer Kernel are affected while older one don't show the problem. Grüße, Sven.
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
On 03.08.2017 15:34, Patrick Matthäi wrote: > Am 16.07.2017 um 23:42 schrieb Ben Hutchings: >> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote: Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ? >> Note that this has been root-caused as a bug in the virtual device, not >> the driver. (Though it would be nice if the driver could work around >> it.) > I can confirm, that the VMs do not crash anymore with vSphere 6.5 build > 5969303 from 27.07.2017, that is why I lowered the severity. This is the version from 6.5u1, right? Still: Stretch is basically unusable with HW13 on ESX6.5 below Update1. Grüße, Sven. signature.asc Description: OpenPGP digital signature
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
severity #864642 normal thanks Am 16.07.2017 um 23:42 schrieb Ben Hutchings: > Control: tag -1 moreinfo > > Sven asked this, but forgot to add you to the recipients: > > On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote: >> Hi! >> >>> Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ? > Note that this has been root-caused as a bug in the virtual device, not > the driver. (Though it would be nice if the driver could work around > it.) > > Ben. I can confirm, that the VMs do not crash anymore with vSphere 6.5 build 5969303 from 27.07.2017, that is why I lowered the severity. But we have got still the issue with "Driver has suspect GRO implementation, TCP performance may be compromised" and the fact, that 4.9.18-1 wasn't crashing and has not this message, while 4.9.30-1 was crashing with the message. -- /* Mit freundlichem Gruß / With kind regards, Patrick Matthäi GNU/Linux Debian Developer Blog: http://www.linux-dev.org/ E-Mail: pmatth...@debian.org patr...@linux-dev.org */ signature.asc Description: OpenPGP digital signature
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
Control: tag -1 moreinfo Sven asked this, but forgot to add you to the recipients: On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote: > Hi! > > > Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ? Note that this has been root-caused as a bug in the virtual device, not the driver. (Though it would be nice if the driver could work around it.) Ben. > Try the following, from comment 37 > https://bugzilla.kernel.org/show_bug.cgi?id=191201#c37 > > > In the meantime, suggested workaround: > > - disable rx data ring: ethtool -G eth? rx-mini 0 > > Also adding "vmxnet3.rev.30 = FALSE" to the vmx file of the VM seems to > be needed. https://bugzilla.kernel.org/show_bug.cgi?id=191201#c40 > > Also: Which hardware version are you running? It is v10 for me (highest > for ESX5.5) -- Ben Hutchings If the facts do not conform to your theory, they must be disposed of. signature.asc Description: This is a digitally signed message part
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
forwarded #864642 https://bugzilla.kernel.org/show_bug.cgi?id=191201 thanks Am 11.07.2017 um 10:24 schrieb Patrick Matthäi: > found #864642 4.9.30-2+deb9u2 > thanks > > And it still crashes.. > > > Am 22.06.2017 um 14:58 schrieb Patrick Matthäi: >> found #864642 4.9.30-2+deb9u1 >> thanks >> >> >> Am 12.06.2017 um 10:02 schrieb Patrick Matthäi: >>> Package: src:linux >>> Version: 4.9.30-1 >>> Severity: important >>> File: linux >>> >>> Dear Maintainer, >>> >>> *** Reporter, please consider answering these questions, where >>> appropriate *** >>> >>> Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to >>> linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the >>> "primary" interface this: >>> >>> TCP: ens192: Driver has suspect GRO implementation, TCP performance may >>> be compromised. >>> >>> I can't see any performance impact. This happens on all our vSphere 6.0 >>> and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML >>> 350 G9 and so on). >>> >>> Why is this bug important? Because on one VM this also produces a kernel >>> panic after some time (minutes or hours). I just could get the panic >>> attached as screenshot. The only "big" difference between the crashing >>> host and the others may be, that it is also running PM2, NPM, NodeJS and >>> a NFS kernel server. >>> >>> If I boot the VM with 4.9.30-1 and deactivate gro and lro with: >>> ethtool -K ens192 gro off >>> ethtool -K ens192 lro off >>> .. it does not crash. >>> >>> Booting 4.9.18-1 and everything is completly fine ;) >>> >> The VM keeps on crashing after a few hours >> -- /* Mit freundlichem Gruß / With kind regards, Patrick Matthäi GNU/Linux Debian Developer Blog: http://www.linux-dev.org/ E-Mail: pmatth...@debian.org patr...@linux-dev.org */
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
found #864642 4.9.30-2+deb9u2 thanks And it still crashes.. Am 22.06.2017 um 14:58 schrieb Patrick Matthäi: > found #864642 4.9.30-2+deb9u1 > thanks > > > Am 12.06.2017 um 10:02 schrieb Patrick Matthäi: >> Package: src:linux >> Version: 4.9.30-1 >> Severity: important >> File: linux >> >> Dear Maintainer, >> >> *** Reporter, please consider answering these questions, where >> appropriate *** >> >> Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to >> linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the >> "primary" interface this: >> >> TCP: ens192: Driver has suspect GRO implementation, TCP performance may >> be compromised. >> >> I can't see any performance impact. This happens on all our vSphere 6.0 >> and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML >> 350 G9 and so on). >> >> Why is this bug important? Because on one VM this also produces a kernel >> panic after some time (minutes or hours). I just could get the panic >> attached as screenshot. The only "big" difference between the crashing >> host and the others may be, that it is also running PM2, NPM, NodeJS and >> a NFS kernel server. >> >> If I boot the VM with 4.9.30-1 and deactivate gro and lro with: >> ethtool -K ens192 gro off >> ethtool -K ens192 lro off >> .. it does not crash. >> >> Booting 4.9.18-1 and everything is completly fine ;) >> > The VM keeps on crashing after a few hours > -- /* Mit freundlichem Gruß / With kind regards, Patrick Matthäi GNU/Linux Debian Developer Blog: http://www.linux-dev.org/ E-Mail: pmatth...@debian.org patr...@linux-dev.org */
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
Hi! Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ? Try the following, from comment 37 https://bugzilla.kernel.org/show_bug.cgi?id=191201#c37 | In the meantime, suggested workaround: | - disable rx data ring: ethtool -G eth? rx-mini 0 Also adding "vmxnet3.rev.30 = FALSE" to the vmx file of the VM seems to be needed. https://bugzilla.kernel.org/show_bug.cgi?id=191201#c40 Also: Which hardware version are you running? It is v10 for me (highest for ESX5.5) Grüße, Sven.
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
On Mon, 12 Jun 2017 10:02:56 +0200 =?UTF-8?Q?Patrick_Matth=c3=a4i?= wrote: > Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to > linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the > "primary" interface this: > > TCP: ens192: Driver has suspect GRO implementation, TCP performance may > be compromised. > > I can't see any performance impact. This happens on all our vSphere 6.0 > and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML > 350 G9 and so on). I see the same for my Stretch Test VMs, running on ESXi 5.5 on Dell R720. I have yet to experience a kernel panic, but those VMs are mostly idle and don't transfer many bytes via network, so the crash-intensity might be related to the amount of data transmitted or the peak throughput at some time. Grüße, Sven. signature.asc Description: OpenPGP digital signature
Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes
found #864642 4.9.30-2+deb9u1 thanks Am 12.06.2017 um 10:02 schrieb Patrick Matthäi: > Package: src:linux > Version: 4.9.30-1 > Severity: important > File: linux > > Dear Maintainer, > > *** Reporter, please consider answering these questions, where > appropriate *** > > Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to > linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the > "primary" interface this: > > TCP: ens192: Driver has suspect GRO implementation, TCP performance may > be compromised. > > I can't see any performance impact. This happens on all our vSphere 6.0 > and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML > 350 G9 and so on). > > Why is this bug important? Because on one VM this also produces a kernel > panic after some time (minutes or hours). I just could get the panic > attached as screenshot. The only "big" difference between the crashing > host and the others may be, that it is also running PM2, NPM, NodeJS and > a NFS kernel server. > > If I boot the VM with 4.9.30-1 and deactivate gro and lro with: > ethtool -K ens192 gro off > ethtool -K ens192 lro off > .. it does not crash. > > Booting 4.9.18-1 and everything is completly fine ;) > The VM keeps on crashing after a few hours -- /* Mit freundlichem Gruß / With kind regards, Patrick Matthäi GNU/Linux Debian Developer Blog: http://www.linux-dev.org/ E-Mail: pmatth...@debian.org patr...@linux-dev.org */