Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-13 Thread dann frazier
On Mon, Jul 13, 2015 at 9:27 AM, Ming Lei 1469...@bugs.launchpad.net wrote:
 Dann,

 Please follow the steps in #12, in which you should trigger the crash in
 4 minutes.

I've been running that in a loop and I'm currently on iteration #76
w/o a crash :(

Maybe it's
Linux ms10-33-mcdivittB0 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16
17:18:17 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

 BTW, looks wily kernel can't boot to shell prompt on mcdivitt.

OK - mind filing a separate bug for that?

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to irqbalance in Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-13 Thread dann frazier
On Mon, Jul 13, 2015 at 9:27 AM, Ming Lei 1469...@bugs.launchpad.net wrote:
 Dann,

 Please follow the steps in #12, in which you should trigger the crash in
 4 minutes.

I've been running that in a loop and I'm currently on iteration #76
w/o a crash :(

Maybe it's
Linux ms10-33-mcdivittB0 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16
17:18:17 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

 BTW, looks wily kernel can't boot to shell prompt on mcdivitt.

OK - mind filing a separate bug for that?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-09 Thread dann frazier
On Tue, Jul 7, 2015 at 2:25 AM, Ming Lei 1469...@bugs.launchpad.net wrote:
 On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei ming@canonical.com wrote:
 Looks there are two kinds of translation fault from irqbalance:

 1) happend in place_irq_in_node() which can reproduce in vivid package

 2) the 2nd one happened in glib2, which  is built by myself, because
 irqbalance can choose to use its own local glib if there isn't glib2 
 available,
 and the glib2 does exist in my server in which I build irqbalance.


 Both of two above reports can be fixed by the following irqbalance commit:

 NUMA is not available fix

 https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c

 Looks stress-ng can't only find kernel bug, but also userspace
 issue, :-)

I was looking to upload a fix for wily, but I haven't been able to
reproduce it to in order to verify the fix. I ran 'stress-ng --seq 0
-t 60 --syslog --metrics --times -v' overnight in a loop, but
irqbalance never crashed. How long should I expect this to take on
average? Does it usually crash in a single run?

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to irqbalance in Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-09 Thread dann frazier
On Tue, Jul 7, 2015 at 2:25 AM, Ming Lei 1469...@bugs.launchpad.net wrote:
 On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei ming@canonical.com wrote:
 Looks there are two kinds of translation fault from irqbalance:

 1) happend in place_irq_in_node() which can reproduce in vivid package

 2) the 2nd one happened in glib2, which  is built by myself, because
 irqbalance can choose to use its own local glib if there isn't glib2 
 available,
 and the glib2 does exist in my server in which I build irqbalance.


 Both of two above reports can be fixed by the following irqbalance commit:

 NUMA is not available fix

 https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c

 Looks stress-ng can't only find kernel bug, but also userspace
 issue, :-)

I was looking to upload a fix for wily, but I haven't been able to
reproduce it to in order to verify the fix. I ran 'stress-ng --seq 0
-t 60 --syslog --metrics --times -v' overnight in a loop, but
irqbalance never crashed. How long should I expect this to take on
average? Does it usually crash in a single run?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-07 Thread Ming Lei
On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei ming@canonical.com wrote:
 Looks there are two kinds of translation fault from irqbalance:

 1) happend in place_irq_in_node() which can reproduce in vivid package

 2) the 2nd one happened in glib2, which  is built by myself, because
 irqbalance can choose to use its own local glib if there isn't glib2 
 available,
 and the glib2 does exist in my server in which I build irqbalance.


Both of two above reports can be fixed by the following irqbalance commit:

NUMA is not available fix

https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c

Looks stress-ng can't only find kernel bug, but also userspace
issue, :-)

Thanks,
Ming

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-06 Thread Ming Lei
On Tue, Jul 7, 2015 at 2:37 AM, Colin Ian King
1469...@bugs.launchpad.net wrote:
 captured irqbalance segfaulting:

 Program received signal SIGSEGV, Segmentation fault.
 0x00408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at 
 placement.c:145
 145 if (irq_numa_node(info)-number != -1) {
 (gdb) where
 #0  0x00408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at 
 placement.c:145
 #1  0x00405154 in for_each_irq (list=0x2c3df660, cb=0x408f4c 
 place_irq_in_node, data=0x0)
 at classify.c:508
 #2  0x0040923c in calculate_placement () at placement.c:196
 #3  0x00407800 in main (argc=2, argv=0x7fcd014928) at irqbalance.c:372

 (gdb) print info
 $1 = (struct irq_info *) 0x2c3d0050

Suppose info is one address in heap, then it is valid, and the segfault
should be caused by invalid info-numa_node.

Thanks


 --
 You received this bug notification because you are subscribed to linux
 in Ubuntu.
 https://bugs.launchpad.net/bugs/1469214

 Title:
   HP ProLiant m400 Server crashes with unhandled level 3 translation
   fault

 Status in linux package in Ubuntu:
   Triaged

 Bug description:
   Running stress-ng on a HP ProLiant m400 server can cause unhandled
   level 3 translations faults:

   use stress-ng from git://kernel.ubuntu.com/cking/stress-ng

   ./stress-ng --seq 0 -t 60 -v

   and after some time this trips the following:

   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] 
 systemd-timesyn[481]: unhandled level 3 translation fault (7) at 
 0x7fa8ea6008, esr 0x9207
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = 
 ffcfb563f000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] 
 *pgd=004fb4f28003, *pud=004fb4f28003, *pmd=004fb4f38003, 
 *pte=1d151c00
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566]
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 
 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP 
 ProLiant m400 Server Cartridge (DT)
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: 
 ffcfb4e3b100 ti: ffcfb4d2c000 task.ti: ffcfb4d2c000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 
 0x7fa8d81824
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 
 0x7fa8e3b3e4
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : 
 [007fa8d81824] lr : [007fa8e3b3e4] pstate: 8000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 
 007ff120d660
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922592] x29: 
 007ff120d660 x28: 007fa8f1c000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922594] x27: 
 007fa8f32084 x26: 007fa8f32000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922595] x25: 
 007fa8f1d788 x24: 007fa8f1d888
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922597] x23: 
 0001 x22: 007fa8f1faa0
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922599] x21: 
 007ff120d7f0 x20: 007ff120d7d0
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922600] x19: 
 007fa8f31000 x18: 007fa8f1e000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922602] x17: 
 007fa8e3b3b8 x16: 007fa8ea6000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922603] x15: 
 003b9aca x14: 00219bbdd000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922605] x13: 
 aa751223 x12: 
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922607] x11: 
 0101010101010101 x10: 7f7f7f7f7f7f7f7f
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922609] x9 : 
 37333c43484f5e46 x8 : 007ff120d818
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922610] x7 : 
 007ff120d8f0 x6 : 007ff120d828
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922612] x5 : 
 ff80ffd0 x4 : 007ff120d8c0
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922613] x3 : 
 007ff120d7d0 x2 : 007fa8f1faa0
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922615] x1 : 
 0001 x0 : 0064
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922616]

 To manage notifications about this bug go to:
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-06 Thread Ming Lei
Looks there are two kinds of translation fault from irqbalance:

1) happend in place_irq_in_node() which can reproduce in vivid package

2) the 2nd one happened in glib2, which  is built by myself, because
irqbalance can choose to use its own local glib if there isn't glib2 available,
and the glib2 does exist in my server in which I build irqbalance.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-06 Thread Ming Lei
On Mon, Jul 6, 2015 at 9:28 PM, Colin Ian King
1469...@bugs.launchpad.net wrote:
 I re-ran this today with the following script as a non-root user:

 #!/bin/bash
 tests=affinity aio bigheap brk bsearch cache chdir chmod clock context cpu 
 crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork 
 futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf 
 longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap 
 msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit 
 seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice 
 stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp 
 udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr 
 zero zombie

 for t in $tests
 do
 echo $t
 echo $t | sudo tee /dev/kmsg
 ./stress-ng --$t 0 -v -t 60
 done

 and hit this issue:

 [14098.848615] urandom
 [14111.696335] irqbalance[828]: unhandled level 2 translation fault (11) at 
 0x4f64, esr 0x9206
 [14111.696341] pgd = ffcfef71b000
 [14111.737149] [4f64] *pgd=004fef1f3003, *pud=004fef1f3003, 
 *pmd=


As I suggested, it should be helpful to provide /proc/$(pidof
irqbalance)/maps, otherwise we can't know where both the faulted
and PC address are.

Finally I have figured out one simple way to reproduce the issue:

1) apply the attached debug patch to stress-ng

2) run the following script:

sudo cat /proc/$(pidof irqbalance)/maps
/home/ubuntu/git/stress-ng/stress-ng --sequential 0 --seq-start 80
--seq-end 84 -t 60 --syslog --metrics --times -v

And the above command just runs the following 4 stresses in 4 minutes:

stress-ng: info:  [1067] dispatching hogs: 8 tsearch, 8 udp, 8 udp-flood,
8  urandom

3) the above may trigger the following faults from irqbalance with
~3/4 probability, and the faulted address is in heap, and PC points to
code of libglib-2.0.so, so looks like a use-after-free in irqbalance or
libglib? And no information shows it is related with kernel, also
the four stresses are quite simple and shouldn't cause trouble to
kernel.


# irqbalance memory maps
0040-0040a000 r-xp  08:02 10496929
  /usr/sbin/irqbalance
00419000-0041a000 r-xp 9000 08:02 10496929
  /usr/sbin/irqbalance
0041a000-0041b000 rwxp a000 08:02 10496929
  /usr/sbin/irqbalance
16294000-162b5000 rwxp  00:00 0  [heap]
162b5000-162ce000 rwxp  00:00 0  [heap]
7f8fbf9000-7f8fbfb000 rwxp  00:00 0
7f8fbfb000-7f8fc11000 r-xp  08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc11000-7f8fc2 ---p 00016000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc2-7f8fc21000 r-xp 00015000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc21000-7f8fc22000 rwxp 00016000 08:02 4722034
  /lib/aarch64-linux-gnu/libpthread-2.21.so
7f8fc22000-7f8fc26000 rwxp  00:00 0
7f8fc26000-7f8fc7f000 r-xp  08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc7f000-7f8fc8f000 ---p 00059000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc8f000-7f8fc9 r-xp 00059000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc9-7f8fc91000 rwxp 0005a000 08:02 4718668
  /lib/aarch64-linux-gnu/libpcre.so.3.13.1
7f8fc91000-7f8fdc1000 r-xp  08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdc1000-7f8fdd ---p 0013 08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdd-7f8fdd4000 r-xp 0012f000 08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdd4000-7f8fdd6000 rwxp 00133000 08:02 4722027
  /lib/aarch64-linux-gnu/libc-2.21.so
7f8fdd6000-7f8fdda000 rwxp  00:00 0
7f8fdda000-7f8fde3000 r-xp  08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fde3000-7f8fdf2000 ---p 9000 08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fdf2000-7f8fdf3000 r-xp 8000 08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fdf3000-7f8fdf4000 rwxp 9000 08:02 10885206
  /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0
7f8fdf4000-7f8fdf8000 rwxp  00:00 0
7f8fdf8000-7f8fe89000 r-xp  08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe89000-7f8fe98000 ---p 00091000 08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe98000-7f8fe99000 r-xp 0009 08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe99000-7f8fe9a000 rwxp 00091000 08:02 4722041
  /lib/aarch64-linux-gnu/libm-2.21.so
7f8fe9a000-7f8ff8c000 r-xp  08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
7f8ff8c000-7f8ff9c000 ---p 000f2000 08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
7f8ff9c000-7f8ff9d000 r-xp 000f2000 08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
7f8ff9d000-7f8ff9e000 rwxp 000f3000 08:02 4718610
  /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1

Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-03 Thread Ming Lei
Hi Colin,

On Sat, Jul 4, 2015 at 12:43 AM, Colin Ian King
1469...@bugs.launchpad.net wrote:
 I was able to hit the following translation fault running sudo ./stress-
 ng --seq 0 -t 60 --syslog --metrics --times -v

I suggest to not run stress-ng as root, otherwise it can be less
serious because:

  - root user can do bad things easily, and it is quite easy to kill any
of process
  - in reality most of loads are run as non-root

If some system processes(irqbalance, systemd-*) are only killed
becasue stress-ng is running as root, it can be a low priority issue.
Otherwise we need pay close attention to the issue.

And I always run 'stress-ng' as ubuntu user without sudo, that may
be the reason why it is difficult for me to reproduce that.

Even with the two new approaches, it is still not easy for me to
reproduce that. I only see one time of translation fault by your
first approach(./stress-ng --seq 0 ...)  in 6 hours, and can't trigger
that with your 2nd approach(by bash script).

Folllows the log[1] I triggered, and I think it is very likely a userspace
issue. From irqbalanc-dbgsym package, we can easily find 'PC is at
0x406078' is one address in text section, and it should be inside
function of 'place_irq_in_node' because the exec file isn't built as
relocation. One thing I still can't understand is that why the fault
address is '0x0040' in the context.


[1]
[ 3616.92] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 3616.93] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 5316.367265] irqbalance[1457]: unhandled level 2 translation fault
(11) at 0x0040, esr 0x9206
[ 5316.476937] pgd = ffcfb5478000
[ 5316.520692] [0040] *pgd=004fb4a3c003,
*pud=004fb4a3c003, *pmd=
[ 5316.620270]
[ 5316.638140] CPU: 7 PID: 1457 Comm: irqbalance Not tain-21-generic #21-Ubuntu
[ 5316.733212] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 5316.806382] task: ffcfb55e6e40 ti: ffcfa72b task.ti:
ffcfa72b
[ 5316.896258] PC is at 0x406078
[ 5316.931865] LR is at 0x404100
[ 5316.967457] pc : [00406078] lr : [00404100]
pstate: 2000
[ 5317.056268] sp : 007fc07ff2d0
[ 5317.096038] x29: 007fc07ff2d0 x28: 004095a0
[ 5317.160023] x27: 00409548 x26: 0041a000
[ 5317.223897] x25: 00405000 x24: 0041acf8
[ 5317.287868] x23: 0041a000 x22: 0041a000
[ 5317.351841] x21: 2e0d6050 x20: 0041a000
[ 5317.415744] x19: 2e0e9020 x18: 
[ 5317.479620] x17: 007fb5ac287c x16: 0041a188
[ 5317.543490] x15: 003bdd2370f74a1c x14: 2030203020302030
[ 5317.607373] x13: 2030203020302030 x12: 2030203020302030
[ 5317.671263] x11: 2030203020302030 x10: 2030203020302030
[ 5317.735137] x9 : 00a0 x8 : 0001
[ 5317.799113] x7 : 0033 x6 : 2e0d6e08
[ 5317.862983] x5 : 0040 x4 : 
[ 5317.926867] x3 : 2e0d7008 x2 : 
[ 5317.990840] x1 : 002c x0 : 0003
[ 5318.054713]

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

2015-07-03 Thread Ming Lei
Hi Colin,

That looks one progress, but still takes time to reproduce that,
and I will use your new approach to reproduce that.

When you are doing that, could you dump the file of /proc/$(pidof
irqbalance)/maps so that we can see where the faulted address are
in the process's vm space?

thanks,


On Sat, Jul 4, 2015 at 4:10 AM, Colin Ian King
1469...@bugs.launchpad.net wrote:
 Running the following:

 #!/bin/bash
 tests=affinity aio bigheap brk bsearch cache chdir chmod clock context cpu 
 crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork 
 futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf 
 longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap 
 msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit 
 seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice 
 stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp 
 udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr 
 zero zombie

 for t in $tests
 do
 echo $t
 echo $t  /dev/kmsg
 ./stress-ng --$t 0 -v -t 60
 done

 eventually tripped the translation fault in irqbalance.  I ran this
 after a clean reboot.

 [ 4901.799846] timerfd
 [ 4961.807050] tsearch
 [ 5021.884456] udp
 [ 5081.895058] udp-flood
 [ 5141.674365] irqbalance[827]: unhandled level 2 translation fault (11) at 
 0x002d6da4, esr 0x9206
 [ 5141.674376] pgd = ffcfb51a
 [ 5141.715215] [002d6da4] *pgd=004fb677e003, *pud=004fb677e003, 
 *pmd=

 [ 5141.816183] CPU: 0 PID: 827 Comm: irqbalance Not tainted 3.19.0-21-generic 
 #21-Ubuntu
 [ 5141.816185] Hardware name: HP ProLiant m400 Server Cartridge (DT)
 [ 5141.816188] task: ffcfac088000 ti: ffcfab71 task.ti: 
 ffcfab71
 [ 5141.816206] PC is at 0x7f88287834
 [ 5141.816208] LR is at 0x7f882877f4
 [ 5141.816210] pc : [007f88287834] lr : [007f882877f4] pstate: 
 8000
 [ 5141.816212] sp : 007ff2e46b30
 [ 5141.816214] x29: 007ff2e46b30 x28: 004095a0
 [ 5141.816217] x27: 00409548 x26: 0041a000
 [ 5141.816220] x25: 0001 x24: 0010
 [ 5141.816222] x23: 2d6c98a0 x22: 2d6c9880
 [ 5141.816225] x21: 0018 x20: 007f88323000
 [ 5141.816228] x19: 0002 x18: 
 [ 5141.816230] x17: 007f87f8d8ec x16: 007f883222e0
 [ 5141.816233] x15: 0020 x14: 0001
 [ 5141.816235] x13:  x12: 
 [ 5141.816237] x11: 007ff2e446a0 x10: 0010
 [ 5141.816240] x9 : 00a0 x8 : 0007
 [ 5141.816242] x7 : 0033 x6 : 2d6c9c80
 [ 5141.816245] x5 : 0001 x4 : 007f87fa62a0
 [ 5141.816247] x3 : 2d6c9880 x2 : 0001
 [ 5141.816250] x1 : 03fa x0 : 002d6d9c

 [ 5141.907792] urandom
 [ 5201.928712] utime
 [ 5261.934534] vecmath
 [ 5321.940302] vfork
 [ 5381.947904] vm
 [ 5441.991784] vm-rw
 [ 5502.017614] vm-splice
 [ 5562.023334] wcs
 [ 5622.037054] wait
 [ 5682.043302] yield
 [ 5742.056595] xattr
 [ 5802.075772] zero
 [ 5862.087396] zombie

 --
 You received this bug notification because you are subscribed to linux
 in Ubuntu.
 https://bugs.launchpad.net/bugs/1469214

 Title:
   HP ProLiant m400 Server crashes with unhandled level 3 translation
   fault

 Status in linux package in Ubuntu:
   Triaged

 Bug description:
   Running stress-ng on a HP ProLiant m400 server can cause unhandled
   level 3 translations faults:

   use stress-ng from git://kernel.ubuntu.com/cking/stress-ng

   ./stress-ng --seq 0 -t 60 -v

   and after some time this trips the following:

   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] 
 systemd-timesyn[481]: unhandled level 3 translation fault (7) at 
 0x7fa8ea6008, esr 0x9207
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = 
 ffcfb563f000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] 
 *pgd=004fb4f28003, *pud=004fb4f28003, *pmd=004fb4f38003, 
 *pte=1d151c00
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566]
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 
 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP 
 ProLiant m400 Server Cartridge (DT)
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: 
 ffcfb4e3b100 ti: ffcfb4d2c000 task.ti: ffcfb4d2c000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 
 0x7fa8d81824
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 
 0x7fa8e3b3e4
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : 
 [007fa8d81824] lr : [007fa8e3b3e4] pstate: 8000
   Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 
 007ff120d660