[Bug 1819786] Re: 4.15 kernel ip_vs --ops causes performance and hang problem

2019-04-29 Thread Marc Hasson
I'd like to note that I tested/verified BOTH these kernel version in
their respective Proposed states that have the ipvs fix.   We really
most need the 16.04 4.15 hwe kernel released, which appears to be in
progress but this is a bionic bug so its unclear if another step is
required.

These kernels passed my ipvs tests properly, the fix worked perfectly:

4.15.0-49.52~16.04.1 (xenial-proposed)
Linux director-16-04 4.15.0-49-generic #52~16.04.1-Ubuntu SMP Thu Apr 25 
18:54:26 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

4.15.0-49.53 (bionic-proposed)
Linux direct-18-04 4.15.0-49-generic #53-Ubuntu SMP Fri Apr 26 06:45:49 UTC 
2019 x86_64 x86_64 x86_64 GNU/Linux


** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1819786

Title:
  4.15 kernel ip_vs --ops causes performance and hang problem

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1819786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1817247] Re: 4.15 kernel's ip_vs module gets refcount errors with --ops usage

2019-04-09 Thread Marc Hasson
Thanks, that source change makes sense.   Hopefully now that the similar
bug:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1819786

Has been marked with a status of "Fix Released" we can get a similar
result here?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1817247

Title:
  4.15 kernel's ip_vs module gets refcount errors with --ops usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1817247/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1817247] Re: 4.15 kernel's ip_vs module gets refcount errors with --ops usage

2019-04-08 Thread Marc Hasson
The test kernel supplied for the 4.15 kernel in the 18.04LTS release
fixed the identical issue there:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1819786


With our wider 16.04 deployment with the HWE 4.15 kernel we'd especially wish 
to see the same version of the fix applied here too.

Thanks!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1817247

Title:
  4.15 kernel's ip_vs module gets refcount errors with --ops usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1819786] Re: 4.15 kernel ip_vs --ops causes performance and hang problem

2019-04-08 Thread Marc Hasson
Well, my apologies.   I retract my skepticism!   Your referenced -47
kernel above appears to have fixed the problem, while the stock -47
kernel showed the failure when I tested it this evening first (I already
had the distributed 4.15.0-47 kernel installed, then I removed it and
installed your references).

So, we will be looking forward to the official rollout of this fix.  But
even more important to us is the Xenial 4.15 kernel version, since it is
more widely distributed, as referenced in:

https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1819786

Title:
  4.15 kernel ip_vs --ops causes performance and hang problem

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1819786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1819786] Re: 4.15 kernel ip_vs --ops causes performance and hang problem

2019-04-08 Thread Marc Hasson
Thanks for getting back to me on this.  Sorry for the slow response, did not 
get (or see?) an email notification about an update.

I'll try the -47 kernel you referenced but I'm skeptical since the
failure occurs on the -46 and the -47 doesn't show any changelog for the
ipvs refcount issue.  Nor does the -47 source I downloaded from Ubuntu
show the fix.

But on the off-chance that you're just asking to confirm to test the
latest, or that you included the fix below in the kernel you referenced
even though its named identically to the kernel we received last month,
I'll give it a try and let you know.

Again, for reference, the upstream fix/patch we are requesting for both
this launchpad report as well as
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247
is as follows:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a050d345cef0dc6249263540da1e902bba617e43


I'll let you know the results of testing this latest -47.   And I presume I'll 
have the -48 shortly as well, that one's changelog also gives no indication of 
having the fix yet.

Thanks.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1819786

Title:
  4.15 kernel ip_vs --ops causes performance and hang problem

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1819786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1817247] Re: 4.15 kernel's ip_vs module gets refcount errors with --ops usage

2019-03-13 Thread Marc Hasson
** Tags added: bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1817247

Title:
  4.15 kernel's ip_vs module gets refcount errors with --ops usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1817247] Re: 4.15 kernel's ip_vs module gets refcount errors with --ops usage

2019-03-12 Thread Marc Hasson
This report does not appear to have received any human's attention yet
at Ubuntu, even though it was written weeks ago and is a significant
performance/functionality concern that the --ops facility can not be
used as is.  It must be patched.  Unfortunately for us in the interim,
it was discovered that the -36 kernel we did the livepatch remedy for
now has to be quickly replaced by a later kernel due to a TCP window
performance problem fixed in a later maintenance release.

So, in case the combination of the 16.04LTS with HWE 4.15 kernel is not
sufficient to garner attention I have also tested this exact issue on a
stock 18.04 system with the distributed kernel.  As one would expect
from the similar 4.15 kernel, the issue exists there as well and is
reported at the following URL with a title that better conveys the
seriousness of the issue:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1819786

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1817247

Title:
  4.15 kernel's ip_vs module gets refcount errors with --ops usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1819786] [NEW] 4.15 kernel ip_vs --ops causes performance and hang problem

2019-03-12 Thread Marc Hasson
Public bug reported:

On our 16.04LTS (and earlier) systems we used the ipvsadm --ops UDP
support (one-packet scheduling) to get a better distribution amongst
our real servers behind the load-balancer for some small subset of
applications.

This has worked fine through the 4.4.0-xxx kernels. But when we started
a program to upgrade systems to use the 4.15 series of kernels to take
advantage of new facilities, the subset of systems which used the --ops
option ran into problems. Everything else with the 4.15 kernels appeared
to work well.

This issue was reported in #1817247 against 16.04LTS with the HWE 4.15 kernel
but has not received any acknowledgement after having been reported weeks
ago.   So we have moved on to confirm that a stock 18.04LTS system with the
latest expected/standard 4.15 kernel also has this issue as well and report
that here.  Perhaps this will get more attention.

The issue appears to have been the change in the ip_vs module from using
"atomic_*()" increment/decrement functions in the 4.4 kernel to instead
use "refcount_*()" functions in a later kernel, including the 4.15 one
we switched to. Unfortunately, the simple refcount_dec() function was
inadequate, in putting out a time-consuming message and handling when
the refcount dropped to zero, which is expected in the case of --ops
support that retains no state post packet delivery. I will upload an
attachment with the sample messages that get put out at packet arrival
rate, which destroys performance of course. This test VM reports the
identical errors we see in our production servers, but at least throwing
only a couple of test --ops packets at it doesn't crash/hang the 18.04 system
as it did in the 16.04 VM reported earlier.   And in production, with the
far greater packet rates, our systems fail since the attached call backtrace
*** appears on every packet!! ***

This issue was apparently already recognized as an error and has appeared
as a fix in upstream kernels. This is a reference to the 4.17 version
of the fix that we'd like to see incorporated into the next possible
kernel maintenance release:

https://github.com/torvalds/linux/commit/a050d345cef0dc6249263540da1e902bba617e43
#diff-75923493f6e3f314b196a8223b0d6342

We have successfully used the livepatch facility to build a livepatch .ko
with the above diffs on our 4.15.0-36 system and successfully demonstrated
the contrast in good/bad behavior with/without the livepatch module
loaded. But we'd rather not have to build a version of livepatch.ko for
each kernel maintenance release, such as the 4.5.0-46 kernel here used to
demonstrate the issue persists in the Ubuntu mainline distro.

The problem is easy to generate, with only a couple of packets
and a simple configuration. Here's a very basic test (addresses
rewritten/obscured) version of an example configuration for 2 servers
that worked on my test VM:

ipvsadm -A -f 100 -s rr --ops
ipvsadm -a -f 100 -r 10.129.131.227:0 -g -w 
ipvsadm -a -f 100 -r 10.129.131.228:0 -g -w 
iptables -t mangle -A PREROUTING -d 172.16.5.1/32 -j MARK --set-xmark 
0x64/0x
ifconfig lo:0 172.16.5.1/32 up

Routing and addressing to achieve the above, or adaptation for one's
own test environment, is left to the tester.  I just added alias 10.129.131.x
addresses on my "outbound" interface and a static route for 172.16.5.1 to my
client system so the test packets arrived on the "inbound" interface.

I set up routing and addresses on my 2 NIC test such that packets arrived
on my test machine's eth1 NIC and were directed by ip_vs out the eth2. To
test, all I did was throw a few UDP packets via traceroute at the address
on the iptables/firewall mark rule so that the eth1 interface of the
test system was the traceroute system's default gateway:

  traceroute -m 2 172.16.5.1

Without the fix my test ip_vs system either hangs or puts out messages
as per the attached. With our livepatch module using the above commit's
contents, all is well. Both of the test ("real" as opposed to "virtual")
servers configured above via ipvsadm, get packets and no errors are
reported in the logs.

Let me know of anything I can do to help accelerate addressing of this 
issue or understanding. It seems that the fix incorporation is fairly
straightforward, and is a performance disaster without it for anyone
using the --ops facility to any significant degree.

Thanks!

$ lsb_release -rd
Description:Ubuntu 18.04.2 LTS
Release:18.04
$ uname -a
Linux direct-18-04 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-46-generic 4.15.0-46.49
ProcVersionSignature: Ubuntu 4.15.0-46.49-generic 4.15.18
Uname: Linux 4.15.0-46-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.5
Architecture: amd64
AudioDevicesInUse:
 USERPID ACCESS COMMAND
 /dev/snd/controlC0:  marc   1980 F pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Tue Mar 12 13:14:31 2019

[Bug 1817247] Re: 4.15 kernel's ip_vs module gets refcount errors with --ops usage

2019-02-22 Thread Marc Hasson
** Description changed:

  On our 16.04LTS system we use the ipvsadm --ops UDP support (one-packet
  scheduling) to get a better distribution amongst our real servers behind
  the load-balancer for some small subset of applications.
  
  This has worked fine through the 4.4.0-xxx kernels.   But when we
  started a program to upgrade systems to use the 4.15 series of kernels
  to take advantage of new facilities, the subset of systems which used
  the --ops option ran into problems.   Everything else with the 4.15
  kernels appeared to work well.
  
  The issue appears to have been the change in the ip_vs module from using
  "atomic_*()" increment/decrement functions in the 4.4 kernel to instead
  use "refcount_*()" functions in a later kernel, including the 4.15 one
  we switched to.  Unfortunately, the simple refcount_dec() function was
  inadequate, in putting out a time-consuming message and handling when
  the refcount dropped to zero, which is expected in the case of --ops
  support that retains no state post packet delivery.   I will upload an
  attachment with the sample messages that get put out at packet arrival
  rate, which destroys performance of course.   On this VM, with far more
  limited # of CPUs than production servers, the system actually hangs
  (crashes?) for quite some time.
  
  This issue was apparently already recognized as an error and has
  appeared as a fix in upstream kernels. This is a reference to the 4.17
  version of the fix that we'd like to see incorporated into the next
  possible kernel maintenance release:
  
  
https://github.com/torvalds/linux/commit/a050d345cef0dc6249263540da1e902bba617e43
  #diff-75923493f6e3f314b196a8223b0d6342
  
  We have successfully used the livepatch facility to build a livepatch
  .ko with the above diffs on our 4.15.0-36 system and successfully
  demonstrated the contrast in good/bad behavior with/without the
  livepatch  module loaded.   But we'd rather not have to build a version
  of livepatch.ko for each kernel maintenance release.
  
  The problem is easy to generate, with only a couple of packets and a
  simple configuration.   Here's a very basic test (addresses
- rewritten/obscrued) version of an example configuration for 2 servers
+ rewritten/obscured) version of an example configuration for 2 servers
  that worked on my test VM:
  
  ipvsadm -A -f 100 -s rr --ops
  ipvsadm -a -f 100 -r 10.129.131.227:0 -g -w 
  ipvsadm -a -f 100 -r 10.129.131.228:0 -g -w 
  iptables -t mangle -A PREROUTING -d 172.16.5.1/32 -j MARK --set-xmark 
0x64/0x
  ifconfig lo:0 172.16.5.1/32 up
  
  Routing and addressing to achieve the above, or adaptation for one's own
  test environment, is left to the tester.
  
  I set up routing and addresses on my 2 NIC test such that packets
  arrived on my test machine's eth1 NIC and were directed by ip_vs out the
  eth2.   To test, all I did was throw a few UDP packets via traceroute at
  the address on the iptables/firewall mark rule so that the eth1
  interface of the test system was the traceroute system's default
  gateway:
  
-   traceroute -m 2 172.16.5.1
+   traceroute -m 2 172.16.5.1
  
- Without the fix my test ip_vs system either hangs or puts out messages
- as per the attached.  With our livepatch module using the above commit's
- contents, all is well.
+ Without the fix my test ip_vs system either hangs or puts out messages as per 
the attached.  With our livepatch module using the above commit's contents, all 
is well.  Both real servers configured above, get packets
+ and no errors are reported in the logs.
  
  Just read that, despite using apport-bug, that I should include the
  lsb_release info requested.   Here it is inline, with the uname -a to
  emphasize the kernel we are running which is the issue:
  
  $ lsb_release -rd
  Description:  Ubuntu 16.04.5 LTS
  Release:  16.04
  $ uname -a
  Linux director-16-04 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 25 
08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  
- 
- Let me know of anything I can do to help accelerate addressing of this issue 
or understanding.  It seems that the fix incorporation is fairly 
straightforward, and is a performance disaster without it for anyone using the 
--ops facility to any significant degree.
+ Let me know of anything I can do to help accelerate addressing of this
+ issue or understanding.  It seems that the fix incorporation is fairly
+ straightforward, and is a performance disaster without it for anyone
+ using the --ops facility to any significant degree.
  
  Thanks!
  
  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-image-4.15.0-36-generic 4.15.0-36.39~16.04.1
  ProcVersionSignature: Ubuntu 4.15.0-36.39~16.04.1-generic 4.15.18
  Uname: Linux 4.15.0-36-generic x86_64
  ApportVersion: 2.20.1-0ubuntu2.18
  Architecture: amd64
  CurrentDesktop: Unity
  Date: Thu Feb 21 20:13:18 2019
  InstallationDate: Installed on 2017-06-21 (611 days ago)
  InstallationMedia: Ubuntu 

[Bug 1817247] [NEW] 4.15 kernel's ip_vs module gets refcount errors with --ops usage

2019-02-21 Thread Marc Hasson
Public bug reported:

On our 16.04LTS system we use the ipvsadm --ops UDP support (one-packet
scheduling) to get a better distribution amongst our real servers behind
the load-balancer for some small subset of applications.

This has worked fine through the 4.4.0-xxx kernels.   But when we
started a program to upgrade systems to use the 4.15 series of kernels
to take advantage of new facilities, the subset of systems which used
the --ops option ran into problems.   Everything else with the 4.15
kernels appeared to work well.

The issue appears to have been the change in the ip_vs module from using
"atomic_*()" increment/decrement functions in the 4.4 kernel to instead
use "refcount_*()" functions in a later kernel, including the 4.15 one
we switched to.  Unfortunately, the simple refcount_dec() function was
inadequate, in putting out a time-consuming message and handling when
the refcount dropped to zero, which is expected in the case of --ops
support that retains no state post packet delivery.   I will upload an
attachment with the sample messages that get put out at packet arrival
rate, which destroys performance of course.   On this VM, with far more
limited # of CPUs than production servers, the system actually hangs
(crashes?) for quite some time.

This issue was apparently already recognized as an error and has
appeared as a fix in upstream kernels. This is a reference to the 4.17
version of the fix that we'd like to see incorporated into the next
possible kernel maintenance release:

https://github.com/torvalds/linux/commit/a050d345cef0dc6249263540da1e902bba617e43
#diff-75923493f6e3f314b196a8223b0d6342

We have successfully used the livepatch facility to build a livepatch
.ko with the above diffs on our 4.15.0-36 system and successfully
demonstrated the contrast in good/bad behavior with/without the
livepatch  module loaded.   But we'd rather not have to build a version
of livepatch.ko for each kernel maintenance release.

The problem is easy to generate, with only a couple of packets and a
simple configuration.   Here's a very basic test (addresses
rewritten/obscrued) version of an example configuration for 2 servers
that worked on my test VM:

ipvsadm -A -f 100 -s rr --ops
ipvsadm -a -f 100 -r 10.129.131.227:0 -g -w 
ipvsadm -a -f 100 -r 10.129.131.228:0 -g -w 
iptables -t mangle -A PREROUTING -d 172.16.5.1/32 -j MARK --set-xmark 
0x64/0x
ifconfig lo:0 172.16.5.1/32 up

Routing and addressing to achieve the above, or adaptation for one's own
test environment, is left to the tester.

I set up routing and addresses on my 2 NIC test such that packets
arrived on my test machine's eth1 NIC and were directed by ip_vs out the
eth2.   To test, all I did was throw a few UDP packets via traceroute at
the address on the iptables/firewall mark rule so that the eth1
interface of the test system was the traceroute system's default
gateway:

  traceroute -m 2 172.16.5.1

Without the fix my test ip_vs system either hangs or puts out messages
as per the attached.  With our livepatch module using the above commit's
contents, all is well.

Just read that, despite using apport-bug, that I should include the
lsb_release info requested.   Here it is inline, with the uname -a to
emphasize the kernel we are running which is the issue:

$ lsb_release -rd
Description:Ubuntu 16.04.5 LTS
Release:16.04
$ uname -a
Linux director-16-04 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 25 
08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux


Let me know of anything I can do to help accelerate addressing of this issue or 
understanding.  It seems that the fix incorporation is fairly straightforward, 
and is a performance disaster without it for anyone using the --ops facility to 
any significant degree.

Thanks!

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.15.0-36-generic 4.15.0-36.39~16.04.1
ProcVersionSignature: Ubuntu 4.15.0-36.39~16.04.1-generic 4.15.18
Uname: Linux 4.15.0-36-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
CurrentDesktop: Unity
Date: Thu Feb 21 20:13:18 2019
InstallationDate: Installed on 2017-06-21 (611 days ago)
InstallationMedia: Ubuntu 16.04 LTS "Xenial Xerus" - Release amd64 (20160420.1)
SourcePackage: linux-signed-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: linux-signed-hwe (Ubuntu)
 Importance: Undecided
 Status: New


** Tags: amd64 apport-bug xenial

** Attachment added: "ip_vs refcount error call trace logs for a single --ops 
packet"
   
https://bugs.launchpad.net/bugs/1817247/+attachment/5240862/+files/kern_log.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1817247

Title:
  4.15 kernel's ip_vs module gets refcount errors with --ops usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247/+subscriptions

-- 

[Bug 1618299] Re: IPv6 with LVS Performance issue in latest 3.13LTS kernels

2016-09-26 Thread Marc Hasson
Thanks so much for the rapid turnaround on this report guys.  I've
modified the tag to the verification-done-trusty, as requested.  I
pulled down all the linux-image, source, and dbgsyms for the
3.13.0-97.144 kernel from proposed for installing/testing.  I verified
the source code manually as well.

All looks good to me as well as I can determine before actual production
here, lets go with it.


** Tags removed: verification-needed-trusty verified-test-kernel-works
** Tags added: verification-done-trusty

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1618299

Title:
  IPv6 with LVS Performance issue in latest 3.13LTS kernels

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1618299/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1618299] Re: IPv6 with LVS Performance issue in latest 3.13LTS kernels

2016-09-12 Thread Marc Hasson
** Tags added: verified-test-kernel-works

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1618299

Title:
  IPv6 with LVS Performance issue in latest 3.13LTS kernels

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1618299/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1618299] Re: IPv6 with LVS Performance issue in latest 3.13LTS kernels

2016-09-07 Thread Marc Hasson
Joseph, the commits you've included appear to be correct to me.  I've
tested this kernel as best I can without the dbgsym package.  I really
could use that package so I could verify the performance with "perf" as
well as use "systemtap" to verify stuff is going down the right paths.
But I did use the /proc/net/ipv6_route's lookups counter as an
indication that excessive lookups, when routing generation changes were
injected, were no longer occurring with your test kernel.

Bottom line: This test kernel looks like it has what we've been needing
for our production servers, we'd appreciate any/all efforts for getting
this into the earliest 3.13 kernel maintenance release you guys can
manage.

Thanks!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1618299

Title:
  IPv6 with LVS Performance issue in latest 3.13LTS kernels

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1618299/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1618299] Re: IPv6 with LVS Performance issue in latest 3.13LTS kernels

2016-09-06 Thread Marc Hasson
Thanks so much Joseph!  My apologies on not seeing this earlier that you
had a test kernel ready, I was offsite almost all of last week and
swamped.  Will download it now and test it later this evening or
tomorrow morning in my test rig.  By any chance, so as to aid my
systemtap monitoring, do you have a matching dbgsyms package for your
test kernel available?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1618299

Title:
  IPv6 with LVS Performance issue in latest 3.13LTS kernels

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1618299/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1618299] [NEW] IPv6 with LVS Performance issue in latest 3.13LTS kernels

2016-08-29 Thread Marc Hasson
Public bug reported:

We experienced a major performance regression between 12.04's 3.2 kernels
and 14.04's 3.13 kernels when using IPv6 with the LVS load-balancing
facility.  Through analysis of perf events and a workaround we've
determined that an upstream fix is available which addresses the issue.
Ubuntu has picked up the "IPv6: remove rt6i_genid" fix that appears to
address our issue in their 3.16 and later kernels.  This was checked
into the upstream 3.16 kernel back in late 2014.  This fix addressed an
issue introduced by the (late 2012) "ipv6: use net->rt_genid to check dst
validity", which is the source of our issue due to the mismatch between
a dst/route instantiation's rt6i_genid and the IPv6 rt_genid field.

Since we have drivers and other software tied to the 3.13 kernels in
the field this report is requesting that the backport of that upstream
fix be done to the 3.13 LTS kernels as well since we were planning on
several more years for those deployed systems.  It seems relatively
straight-forward.

In our 3.13 kernel we used the "systemtap" facility in a test system
to temporarily address the "obsolete" determination mistakenly made by
the ip6_dst_check() function to check dst validity on behalf of the LVS
(ip_vs) code.  By updating rt6i_genid to the current global value we were
able to restore our test systems to the previous performance obtained
with the 3.2 kernels.  But clearly we want the official upstream fix
incorporated, which pulls the troublesome rt6i_genid field out altogether
since its mishandling affected more than IPv6/LVS support based on the
upstream mail threads.  Those were made in mid-2014 to the upstream
kernel folks, such as a per-socket route caching issue.

Note that the apport-bug collected info from a 3.13.0-87 system but we see the
issue on all 3.13.0-xxx kernels.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-87-generic 3.13.0-87.133
ProcVersionSignature: Ubuntu 3.13.0-87.133-generic 3.13.11-ckt39
Uname: Linux 3.13.0-87-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.21
Architecture: amd64
AudioDevicesInUse:
 USERPID ACCESS COMMAND
 /dev/snd/controlC0:  marc   2433 F pulseaudio
Date: Mon Aug 29 20:05:57 2016
HibernationDevice: RESUME=UUID=c4187d86-ea40-4f53-af39-1b7e83964502
InstallationDate: Installed on 2014-04-30 (852 days ago)
InstallationMedia: Ubuntu 14.04 LTS "Trusty Tahr" - Release amd64 (20140417)
Lsusb:
 Bus 001 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
 Bus 001 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
 Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: VMware, Inc. VMware Virtual Platform
ProcEnviron:
 LANGUAGE=en_US
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 svgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-87-generic 
root=UUID=92211e82-1c0b-42e6-bb12-0003b7f6db54 ro quiet splash 
crashkernel=384M-:128M
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not 
accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-87-generic N/A
 linux-backports-modules-3.13.0-87-generic  N/A
 linux-firmware 1.127.22
RfKill:
 
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 05/20/2014
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: 440BX Desktop Reference Platform
dmi.board.vendor: Intel Corporation
dmi.board.version: None
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 1
dmi.chassis.vendor: No Enclosure
dmi.chassis.version: N/A
dmi.modalias: 
dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd05/20/2014:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:
dmi.product.name: VMware Virtual Platform
dmi.product.version: None
dmi.sys.vendor: VMware, Inc.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New


** Tags: amd64 apport-bug trusty

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1618299

Title:
  IPv6 with LVS Performance issue in latest 3.13LTS kernels

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1618299/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1021375] Re: Nautilus says the USB stick is read only when it is not

2014-12-11 Thread Marc Hasson
I too have the nautilus read-only misbehavior on the very latest
12.04LTS (x86_64) system and have the following observations, including
a simple-enough workaround since running into this problem quite a while
back

I can cause the false read-only behavior on demand by first inserting a
read-only USB flash such as a camera's Secure Digital card with the
lock switch thrown, via an SD card reader.  But after
removing/unmounting that read-only  SD card,  *every* subsequent
insertion of a normally-writable USB flash drive then gets the above-
reported symptoms: its considered read-only by nautilus yet a cp
command from the shell to the USB media works fine which shows the
system mounted it correctly for writing.

As I believe someone mentioned earlier, it must be that the nautilus
state is contaminated/broken under some scenarios.  My intentional read-
only flash drive method is just one solid way to cause that broken
behavior hangover.  So my workaround is

When I encounter this false read-only, I simply unmount/remove the
removable device(s), close any windows using nautilus, and then I do a
killall nautilus from the shell (I always see a non-window nautilus
whose parent is pid 1 with my user id, presumably this is the thing with
the broken state that I kill).  After that, reinserting the very USB
drive which falsely received the read-only error will now work properly.
I can drag things onto the drive using the fresh instance of nautilus
that is launched..

Its a shame that after more than 2 years, on a claimed LTS system, I
have yet to see even a we aren't going to bother fixing this one
response from the Ubuntu folks on this bug report.  Its not been closed
or marked as a duplicate, its still marked unassigned.  Makes me
question why I've been recommending Ubuntu LTS releases at all to my
companies for production use

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1021375

Title:
  Nautilus says the USB stick is read only when it is not

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nautilus/+bug/1021375/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-10-16 Thread Marc Hasson
Summary:

Tried 3.11rc7, very happy with how it behaved in our testing.  Tried
this week's 3.12rc5, disappointed that a step backwards was taken
on that one for us.  The difference for us was in the low memory killer
that was configured in the 3.11rc7 build but not the 3.12rc5 system.
Details below, as a consequence I'm tagging this bug with both upstream
3.11rc7 fixes as well as upstream 3.12rc5 doesn't fix!


Details:

I've now switched to a real hardware (Dell multicore) platform to make
sure no one has any doubts as to this kernel problem being an issue on
real hardware as well as my VM testbed.  I can achieve the same hang
failure in the original bug description using either my 2GB VM or the
actual machine now.

I first reproduced the hang with a more recent 3.2.0-45 kernel on this
64-bit Dell hardware and then tried both the mainline 3.11rc7 and this
week's 3.12rc5 kernels from the URL supplied above by Christopher.

The good news is that I was unable to reproduce a problem using the
3.11rc7 kernel and the system was extremely well-behaved!  That is,
despite running a very heavy load it remained responsive to new requests,
appeared to get more overall work accomplished compared to the 3.2 system
in the same time period, and had a minimum of kswapd scan rates in the
sar records.  And no direct allocation failure scan rates at all.
Naturally, the system was SIGKILL'ing off selected processes periodically
but this is the price I'd expect for running the memory-overloading
test I have here and in my real-world environment.  We much prefer
this behavior of individual processes being killed off, which can be
subsequently relaunched, rather than hanging or crashing the entire
system.  Especially since it appeared that the SIGKILLs in my tests
were *always* directed at processes that were actively doing the memory
consuming work, so they were good choices.

I note that the processes SKIGKILL'ed off in the above 3.11rc7 system
were dispatched to their death by the low memory killer logic in the
lowmemorykiller.c code.  The standard kernel OOM killer rarely, if ever,
was invoked.  The 3.11rc7 kernel appears to have been built with the
CONFIG_ANDROID_LOW_MEMORY_KILLER=y setting which caused that low memory
killer code to be statically linked into the kernel and register its low
memory shrinker callback function which issued the appropriate SIGKILLs
under overloaded conditions.

The bad news is that the more recent 3.12rc5 kernel I tried did NOT
have the above CONFIG_ANDROID_LOW_MEMORY_KILLER=y setting and instead
relied upon just the kernel OOM killer.  This 3.12rc5 system is behaving
similarly to when I turned off the 3.11rc7's low memory killer via
a /sys/module low memory minfree parameter.  That is, the 3.12rc5 (or
3.11rc7 with low memory killer disabled) system experienced:

 1) Much longer, and with wide variance, user response times
External wget queries went from 1-5 seconds with the low memory
killer enabled during the overloading tests to 2 *minutes* without
that facility!

 2) High kswapd scans of .5M-1M/second in the sar reports
With the low memory killer, kswapd scan rates never exceeded a few K/sec.

 3) Fairly high direct allocation failure scans as well (K/sec)

 4) Multiple processes critical to system functions were OOM'ed
Management shell/terminal sessions that were idle, sshd, cron, etc.

 5) Even a panic in one test sequence
Kernel panic - not syncing: Out of memory and no killable processes...

The behavior of our test systems without the low memory killer
functionality is poor, with the system either crashing or providing
a poor (simulated) customer response.  Either is better than the 3.2
hang I've reported, but not by much for our production/response needs!

I understand that there are concerns about the low memory killer
killing off processes before even getting to use the allocated
swap space on a system.  I observed that as well, which for us was
fine.   But I appreciate that it may not be desirable to have the
CONFIG_ANDROID_LOW_MEMORY_KILLER=y option for all folks' usage cases
as was done for the 3.11rc7 build.  But what about supplying that low
memory killer as an optionally loadable module by simply building with
CONFIG_ANDROID_LOW_MEMORY_KILLER=m in the kernel/distribution package?
That way, those of us who desire to not use any swap area and prefer a
more responsive system overall will have a simple way to load that module
distributed with the then-current Ubuntu kernel.  There are usage cases
where its better to shed load by killing off processes earlier rather than
degrade response time by using the swap area to preserve those processes.
The default would be to retain the current 3.12rc5 behavior: do NOT load
the low memory killer and in so doing experience the standard kernel OOM
handling.  The later could be improved over time as a separate effort,
if needed.

We would consider the above minor loadable module configuration change as
a simple way to 

[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-08-09 Thread Marc Hasson
Christopher, its looks like I actually have  a reasonable record of the
VMWare version I was using for this reproduction despite having
regularly updated my VMWare. .  The VMWare installer has a log that
shows that at the time of the reproduction/report here I was running the
VMWare vmplayer 4.0.4 x86_64 version build#744019.

Since I was causing reproductions of this issue well before and after
the dates in March that I reported it here, I'm quite certain that I've
reproduced this issue across multiple versions of the vmplayer.   And
we've seen similar-appearing issues on our real servers.  Hope this
helps.

Thanks for looking into this!  Its still an ongoing issue, especially
with the 2.6 kernels in another bug I wrote related to this one.



** Changed in: linux (Ubuntu)
   Status: Incomplete = Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-08-09 Thread Marc Hasson
Christopher, I did such a test back in March upon request with no response
to my testing results then.  Nor *any* activity, until your recent notes.
Do we have any specific reason or bugfix to believe that this memory issue
has been addressed since then?  Will I get at least a response this time to
my testing results as to next steps?  I ask because this test can take
several days to perform and I will not be able to start it immediately, I
want to make sure its worth our time to do in order to get the most
effective progress on this.

BTW, have you noticed the bug report below?  It seems fairly similar to
what I've been seeing in different kernels/testing as well as has a similar
reproduction method.  In my testing there definitely is the OOM deadlock in
my 2.6 kernel testing while the 3.2 kernel testing seemed to have a
slightly different deadlock in believing it had made page freeing
progress when in fact it had not done so.  My upstream kernel testing back
in March had not totally deadlocked, but would freeze for long periods of
time.  Here's the kernel.org bug, with no response, that seemed somewhat
similar and could be an indication that me doing upstream testing would not
be all that useful:

https://bugzilla.kernel.org/show_bug.cgi?id=59901


Assuming a positive response for me to still proceed with the upstream
testing you requested, I will first have to reconfirm that my current
testbed can reproduce the issue with the latest 3.2 -51 Ubuntu kernel and
then I will try the upstream kernel you referenced.  So it will be a little
while before I can report on this testing, which took multiple days in the
past.  Nor can I start this testing immediately.  It sure would be nice if
you guys could reproduce this, I thought my info on this score would be
adequate for that.


  -- Marc --


On Fri, Aug 9, 2013 at 7:35 PM, Christopher M. Penalver 
christopher.m.penal...@gmail.com wrote:

 Marc Hasson, could you please test the latest upstream kernel available
 following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow
 additional upstream developers to examine the issue. Please do not test the
 daily folder, but the one all the way at the bottom. Once you've tested the
 upstream kernel, please comment on which kernel version specifically you
 tested. If this bug is fixed in the mainline kernel, please add the
 following tags:
 kernel-fixed-upstream
 kernel-fixed-upstream-VERSION-NUMBER

 where VERSION-NUMBER is the version number of the kernel you tested. For
 example:
 kernel-fixed-upstream-v3.11-rc4

 This can be done by clicking on the yellow circle with a black pencil icon
 next to the word Tags located at the bottom of the bug description. As
 well, please remove the tag:
 needs-upstream-testing

 If the mainline kernel does not fix this bug, please add the following
 tags:
 kernel-bug-exists-upstream
 kernel-bug-exists-upstream-VERSION-NUMBER

 As well, please remove the tag:
 needs-upstream-testing

 If you are unable to test the mainline kernel, please comment as to why
 specifically you were unable to test it and add the following tags:
 kernel-unable-to-test-upstream
 kernel-unable-to-test-upstream-VERSION-NUMBER

 Once testing of the upstream kernel is complete, please mark this bug's
 Status as Confirmed. Please let us know your results. Thank you for your
 understanding.

 ** Changed in: linux (Ubuntu)
Status: Confirmed = Incomplete




** Bug watch added: Linux Kernel Bug Tracker #59901
   http://bugzilla.kernel.org/show_bug.cgi?id=59901

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-05-22 Thread Marc Hasson
So, its been many weeks without any kind of acknowledgement of either my
previous note in this bug from March nor the 10.04 variant I filed in
bug #1161202 for the 10.04 base.

Is there any way to get a response of anything further to do on these
matters?  You guys have the scripts/description and dumps, these issues
are reproducible at will on 2 different LTS releases and still cause
ongoing operational issues for us.  The newest upstream kernel we tried,
as reported in March, appears to be an improvement but is still
unacceptable with its many minutes of going mute.   In practical
commerce terms, thats just as severe as permanently hanging from the
user's viewpoint.

Is there anything more I can provde, test, or do?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-05-22 Thread Marc Hasson
** Tags added: kernel-bug-exists-upstream

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1161202] [NEW] All our Lucid 2.6 and 3.0 kernels hang with heavy memory loads

2013-03-27 Thread Marc Hasson
Public bug reported:

The purpose of this bug is to report/emphasize the severe number of
system hangs, which require power-cycling, on our deployment of servers
running the 10.04LTS (Lucid) release.  The issue here is essentially
identical to that reported for the 3.2 kernels on 12.04LTS in bug
#1154876 at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876
almost 2 weeks ago.

The kernel version which fails for us in the field is a  2.6.38-16
Ubuntu kernel.   Though the power-cycling recovery in the field means we
never received a crash dump for analysis we've been able to reproduce
what appears to be the identical symptoms on an in-house VMware testbed.
The exact same failure as the other bug also occurs in our testbed on
Lucid using the very latest stock 3.0.0-32-generic kernel from the
repository.

See the other bug for details of scripts/loads and details of a kdb
session during the hang.  I didn't reproduce all those attachments for
this bug report since everything for this version of the system would be
similar.   Essentially all processes remain stuck in
__alloc_pages_nodemask and never succeed in allocating memory.  All CPUs
are busy rerunning each process to try again, to no avail.  The OOM
logic is not invoked on the 3.0 kernel while in this hang, even though
plenty of OOMs had occurred in the time leading up to the hang.   In the
2.6.38 kernels it looks essentially the same except that even during the
hang we see the OOM select_bad_process() function continually called but
no OOM candidate is returned, due to a pending one previously selected.
But the end result is identical: continual memory allocation failures,
short sleeps, try again, and the system becomes totally non-responsive
other than for pings.  The serial console and all other CLI or GUI
goes totally dead, with no response.  The only thing one can do is break
in with kdb to investigate, as shown in the other bug.

Before the hangs even occur we will also see very heavy pgscank and
pgscand numbers, as reported by the sar facility.  On our production
machines these can each hit millions of page scans per second and seem
to occur even when there are several gigabytes of available memory.  The
system hangs are invariably immediately preceded by exceptionally high
levels of pgscank and usually pgscand as well.

We really need a remedy or some kind of workaround for this issue.

Requested system release info:

marc@direct-10-04:~$ lsb_release -rd
Description:Ubuntu 10.04.4 LTS
Release:10.04


Requested package info:

marc@direct-10-04:~$ dpkg -l | fgrep linux-image-3.0.0
ii  linux-image-3.0.0-32-generic3.0.0-32.50~lucid1  
Linux kernel image for version 3.0.0 on x86/

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-3.0.0-32-generic 3.0.0-32.50~lucid1
ProcVersionSignature: Ubuntu 3.0.0-32.50~lucid1-generic 3.0.65
Uname: Linux 3.0.0-32-generic x86_64
Architecture: amd64
Date: Wed Mar 27 20:45:58 2013
InstallationMedia: Ubuntu 10.04.3 LTS Lucid Lynx - Release amd64 (20110719.2)
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.utf8
 SHELL=/bin/bash
SourcePackage: linux-lts-backport-oneiric

** Affects: linux-lts-backport-oneiric (Ubuntu)
 Importance: Undecided
 Status: New


** Tags: amd64 apport-bug lucid

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1161202

Title:
  All our Lucid 2.6 and 3.0 kernels hang with heavy memory loads

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-backport-oneiric/+bug/1161202/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1160674] [NEW] ddeb package missing for 3.2.0-31-generic kernel (and 3.2.0-30 too)

2013-03-26 Thread Marc Hasson
Public bug reported:

As the summary says, we are unable to find the linux-image-3.2.0-31
-generic-dbgsym*.ddeb package.  We need this for our kernel so that we
can get more effective crash dumps.  We have many systems deployed with
the 3.2.0-31 kernel and its not convenient to upgrade them at this time
to the -32 or later which *do* have the needed ddeb packages available.
We will upgrade at some point, but it would be more convenient now to
just install the missing .ddeb package which Ubuntu should have made
available.

At http://ddebs.ubuntu.com/pool/main/l/linux/ one can see that the
*generic-dbgsym*.ddeb packages are available for pretty much all the
kernels except the very one we happen to need, 3.2.0-31 (and it now
appears -30 is missing too, I *thought* it was there in the last month
or so but I may be wrong about that).

Here is the sequence from -29 thru -32 on the above-mentioned web page,
showing that the -30 and -31 generic-dbgsym ones are missing:

[ ] linux-image-3.2.0-29-generic-dbgsym_3.2.0-29.46_amd64.ddeb  
27-Jul-2012 18:03   630M 
[ ] linux-image-3.2.0-29-generic-dbgsym_3.2.0-29.46_i386.ddeb   
27-Jul-2012 18:34   637M 
[ ] linux-image-3.2.0-29-generic-pae-dbgsym_3.2.0-29.46_i386.ddeb   
27-Jul-2012 18:49   637M 
[ ] linux-image-3.2.0-29-highbank-dbgsym_3.2.0-29.46_armhf.ddeb 
27-Jul-2012 23:05   19M  
[ ] linux-image-3.2.0-29-omap-dbgsym_3.2.0-29.46_armel.ddeb 27-Jul-2012 
21:26   289M 
[ ] linux-image-3.2.0-29-omap-dbgsym_3.2.0-29.46_armhf.ddeb 27-Jul-2012 
22:53   289M 
[ ] linux-image-3.2.0-29-virtual-dbgsym_3.2.0-29.46_amd64.ddeb  
27-Jul-2012 18:20   629M 
[ ] linux-image-3.2.0-29-virtual-dbgsym_3.2.0-29.46_i386.ddeb   
27-Jul-2012 19:04   637M 
[ ] linux-image-3.2.0-30-highbank-dbgsym_3.2.0-30.48_armhf.ddeb 
24-Aug-2012 23:11   19M  
[ ] linux-image-3.2.0-30-omap-dbgsym_3.2.0-30.48_armel.ddeb 24-Aug-2012 
20:50   289M 
[ ] linux-image-3.2.0-30-omap-dbgsym_3.2.0-30.48_armhf.ddeb 24-Aug-2012 
22:58   289M 
[ ] linux-image-3.2.0-31-highbank-dbgsym_3.2.0-31.50_armhf.ddeb 
07-Sep-2012 22:24   19M  
[ ] linux-image-3.2.0-31-omap-dbgsym_3.2.0-31.50_armel.ddeb 07-Sep-2012 
20:25   289M 
[ ] linux-image-3.2.0-31-omap-dbgsym_3.2.0-31.50_armhf.ddeb 07-Sep-2012 
22:12   290M 
[ ] linux-image-3.2.0-32-generic-dbgsym_3.2.0-32.51_amd64.ddeb  
07-Nov-2012 10:21   630M 
[ ] linux-image-3.2.0-32-generic-dbgsym_3.2.0-32.51_i386.ddeb   
07-Nov-2012 10:23   638M 
[ ] linux-image-3.2.0-32-generic-pae-dbgsym_3.2.0-32.51_i386.ddeb   
07-Nov-2012 10:28   638M 
[ ] linux-image-3.2.0-32-highbank-dbgsym_3.2.0-32.51_armhf.ddeb 
27-Sep-2012 03:39   19M  
[ ] linux-image-3.2.0-32-omap-dbgsym_3.2.0-32.51_armel.ddeb 27-Sep-2012 
01:25   290M 
[ ] linux-image-3.2.0-32-omap-dbgsym_3.2.0-32.51_armhf.ddeb 27-Sep-2012 
03:27   289M 
[ ] linux-image-3.2.0-32-virtual-dbgsym_3.2.0-32.51_amd64.ddeb  
07-Nov-2012 10:32   630M 
[ ] linux-image-3.2.0-32-virtual-dbgsym_3.2.0-32.51_i386.ddeb   
07-Nov-2012 10:43   638M

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-31-generic 3.2.0-31.50
ProcVersionSignature: Ubuntu 3.2.0-31.50-generic 3.2.28
Uname: Linux 3.2.0-31-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 2.0.1-0ubuntu17.1
Architecture: amd64
ArecordDevices:
  List of CAPTURE Hardware Devices 
 card 0: AudioPCI [Ensoniq AudioPCI], device 0: ES1371/1 [ES1371 DAC2/ADC]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USERPID ACCESS COMMAND
 /dev/snd/controlC0:  marc   2417 F pulseaudio
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not 
found.
Card0.Amixer.info:
 Card hw:0 'AudioPCI'/'Ensoniq AudioPCI ENS1371 at 0x20c0, irq 18'
   Mixer name   : 'Cirrus Logic CS4297A rev 3'
   Components   : 'AC97a:43525913'
   Controls  : 24
   Simple ctrls  : 13
CurrentDmesg:
 [   31.156966] init: vmware-tools pre-start process (1153) terminated with 
status 1
 [   31.416257] eth0: no IPv6 routers present
 [   31.711553] eth1: no IPv6 routers present
 [   31.775418] eth2: no IPv6 routers present
Date: Tue Mar 26 18:52:36 2013
HibernationDevice: RESUME=UUID=2342cd45-2970-47d7-bb6d-6801d361cb3e
InstallationMedia: Ubuntu 12.04 LTS Precise Pangolin - Release amd64 
(20120425)
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
 Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
MachineType: VMware, Inc. VMware Virtual Platform
MarkForUpload: True
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash

[Bug 1160674] Re: ddeb package missing for 3.2.0-31-generic kernel (and 3.2.0-30 too)

2013-03-26 Thread Marc Hasson
** Attachment added: Requested lspci-vnvn.log
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1160674/+attachment/3599940/+files/lspci-vnvn.log

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1160674

Title:
  ddeb package missing for 3.2.0-31-generic kernel (and 3.2.0-30 too)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1160674/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-25 Thread Marc Hasson
My testing on the 3.9 kernel has been underway since the note above, its
surpassed 11 days of running the loads from the scripts attached, and
even higher.  The previous 3.2 and 3.5 kernel testing never exceeded 4.5
days before hanging solidly, and usually were less.

So, the 3.9 kernel appears to be considerably more robust at the very
least since I could not cause it to solidly hang as I could in my 2.6
and 3.2/3.5 kernel testbeds.   So it would be good to see 3.9 backported
to Precise for supported usage on our deployed 12.04 systems.  And I
will write another bug for the 2.6 systems that are suffering the most
so that perhaps something can be done there as well.  BUT.

... I could not tag this bug either as kernel-bug-exists-upstream nor
kernel-fixed-upstream because while the solid hang/failure symptom
*is* fixed in the upstream kernel we *still* experienced the same hangs
but of only 5-10 minutes each event through at least the later half of
the 3.9 kernel testing.  I had no way to measure these hangs other than
my own observations at my testing consoles, I had the impression they
occurred a couple of times a day.  I first noticed them a few days into
the test, and can not say for sure whether they were there from the
beginning or not.  5-10 minutes of outage from our servers would look
the same to most network operations folks as a permanently solid hang,
one can't have customers twiddling their thumbs for that long when
engaged in transactions of some kind.  I believe these transient hangs
were also seen in my 3.2/3.5 testing, but I didn't time them since I was
most concerned about the solid hang/failure.  When any of the kernels,
including this 3.9 test,l hangs like this I can see that all CPUs are
100% busy and I presume its the same symptom I've reported earlier about
the constant rescheduling all processes for another page that I reported
as part of my kdb session attachment.  But I did not break in with kdb
to confirm that in this round of testing, I didn't want to risk
disrupting the longer-term survival testing I was going for primarily.
I can confirm that pings were still responded to during these hangs and
that the serial console remained unresponsive for the 5-10 minutes of
hang.

** Changed in: linux (Ubuntu)
   Status: Incomplete = Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-14 Thread Marc Hasson
Sure Joseph, in progress.  I have the 3.9 kernel you referenced now
running my tests on my 12.04 system.  Its so far behaving normally, it
will likely take a few days to know whether there is any difference as
far as the hang is concerned.

Just for the record, I had previously tested with: linux-
image-3.5.0-21-generic_3.5.0-21.32~precise1_amd64.deb and the hang
failure could still be seen with that kernel.   I had not checked my
records when I submitted this bug, so had forgotten.  I could possibly
have entitled this bug as 3.5.0-21 or earlier fail, but was focused on
using one of the regularly distributed kernels to test/reproduce the
failure for you folks.

I had also tested with: linux-image-3.8.0-0-generic_3.8.0-0.3_amd64.deb.
On that system the hang did not occur BUT for some reason it also
appeared to be the case that my loading tests were not pushing the
system as hard either.  So I figured some mismatch between that kernel
and precise was the cause and that this 3.8 test was inconclusive.

Your 3.9 kernel seems to be allowing my tests to allocate as much memory
and inflict as many memory overloading events (OOM killer) as the 3.5
and 3.2 kernels, so this test looks like we will be able to gather a
datapoint on the issue, either way.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] [NEW] 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
Public bug reported:

Background

We've been experiencing mysterious hangs on our 2.6.38-16 Ubuntu 10.04
systems in the field.  The systems have large amounts of memory and disk,
along with up to a couple dozen CPU threads.  Our operations folks have
to power-cycle the machines to recover them, they do not panic.  Our use
of hang means the system will no longer respond to any current shell
prompts, will not accept new logins, and may not even respond to pings.
It appears totally dead.

Using log files and the sar utility from the sysstat package we
gradually put together the following clues to the hangs:

  Numerous INFO: task task-name:pid blocked for more than 120 seconds
  High CPU usage suddenly on all CPUs heading into the hang, 92% or higher
  Very high kswapd page scan rates (pgscank/s) - up to 7 million per second
  Very high direct page scan rates (pgscand/s) - up to 3 million per second

In addition to noting the above events just before the hangs, we have
some evidence that the high kswapd scans occur at other times for no
seemingly obvious reason.  Such as when there is a signficant (25%) amount
of kbmemfree.  Also, we've seen cases where there are application errors
related to a system's responsiveness and that has sometimes correlated
with either high pgscank/s or pgscand/s that lasts for some number of
sar records before the system returns to normal running.  The peaks of
these transients aren't usually as high as those we see leading to a
solid system hang/failure.  And sometimes these are not transients,
but last for hours with no apparent event related to the starting or
stopping of this behavior!

So we decided to see if we can reproduce these symptoms on a VMware
testbed that we could easily examine with kdb and snapshot/dump.
Through a combination of tar, find, and cat commands launched from
a shell script we could recreate a system hang on both our 2.6.38-16
systems as well as the various flavors of the 3.2 kernels, with the
one crashdump'ed here being the latest 3.2.0-38 at the time of testing.
The sar utility on our 2.6 testing confirmed similar behavior of the
CPUs, kswapd scans, and direct scans leading up to the testbed hangs as
to what we see in the field failures of our servers.

Details on the shell scripts can be found in the file referenced below.
Its important to read the information below on how the crash dump was
taken before investigating it.  Reproduction on a 2-CPU VM took 1.5-4
days for a 3.2 kernel, usually considerably less for a 2.6 kernel.


Hang/crashdump details:

In the crashdump the crash dmesg command will also show Call Traces that
occured *after* kdb investigations started.  Its important to note the
kernel timestamp that indicates the start of those kdb actions and only
examine prior to that for clues as to the hang proper:

[160512.756748] SysRq : DEBUG
[164936.052464] psmouse serio1: Wheel Mouse at isa0060/serio1/input0 lost 
synchronization, throwing 2 bytes away.
[164943.764441] psmouse serio1: resync failed, issuing reconnect request
[165296.817468] SysRq : DEBUG

Everything previous to the above dmesg output occurs prior (or during)
the full system hang.  The kdb session started over 12 hours after the
hang, the system was totally non-responsive at either its serial console
or GUI.  Did not try a ping in this instance.

The kdb actions taken may be seen in an actual log of that session
recorded in console_kdb_session.txt.  It shows where these 3.2 kernels
are spending their time when hung in our testbed (spinning in
__alloc_pages_slowpath by failing an allocation, sleeping, retrying).
We see the same behavior for the 2.6 kernels/tests as well except for
one difference described below.  For the 3.2 dump included here all our
script/load processes, as well as system processes, are constantly failing
to allocate a page, sleeping briefly, and trying again.  This occurs
across all CPUs (2 CPUs in this system/dump), which fits with what we
believe we see in our field machines for the 2.6 kernels.

For the 2.6 kernels the only difference we see is that there is typically
a call to the __alloc_pages_may_oom function which in turn selects a
process to kill, but we see that there is already a being killed by oom
process at the hang so no additional ones are selected.  And we deadlock,
just as the comment in oom_kill.c's select_bad_process() says.  In the
3.2 kernels we are now moving our systems to we see in our testbed hang
that the code does not go down the __alloc_pages_may_oom path.  Yet from
the logs we include and the dmesg within crash one can see that prior
to the hang OOM killing is invoked frequently.  The key seems to be a
difference in the did_some_progress variable returned when we are very
low on memory, its always a 1 in the 3.2 kernels on our testbed.

Though the kernel used here is 3.2.0-38-generic we have also caused this
to occur with earlier 3.2 Ubuntu kernels.  We have also reproduced the
failures with 2.6.38-8, 2.6.38-16, and 3.0 Ubuntu 

[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
** Attachment added: boot up messages until standard running state of OOMs 
spew out
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573090/+files/console_boot_output.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
** Attachment added: dmesg file from boot, mostly duplicates start of 
console_boot_output.txt
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573109/+files/dmesg_of_boot.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
** Attachment added: last messages on serial console when system hung
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573110/+files/console_last_output.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
** Attachment added: kdb session demo'ing where system is spinning
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573111/+files/console_kdb_session.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
** Attachment added: Machine environment and script/data used in our testbed
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573112/+files/reproduction_info.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
** Attachment added: Requested version.log
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573113/+files/version.log

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1154876] Re: 3.2.0-38 and earlier systems hang with heavy memory usage

2013-03-13 Thread Marc Hasson
** Attachment added: Requested lspci-vnvn.log
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+attachment/3573114/+files/lspci-vnvn.log

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1154876

Title:
  3.2.0-38 and earlier systems hang with heavy memory usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs