Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
On Mon, Jul 13, 2015 at 9:27 AM, Ming Lei 1469...@bugs.launchpad.net wrote: Dann, Please follow the steps in #12, in which you should trigger the crash in 4 minutes. I've been running that in a loop and I'm currently on iteration #76 w/o a crash :( Maybe it's Linux ms10-33-mcdivittB0 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16 17:18:17 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux BTW, looks wily kernel can't boot to shell prompt on mcdivitt. OK - mind filing a separate bug for that? -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to irqbalance in Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
On Mon, Jul 13, 2015 at 9:27 AM, Ming Lei 1469...@bugs.launchpad.net wrote: Dann, Please follow the steps in #12, in which you should trigger the crash in 4 minutes. I've been running that in a loop and I'm currently on iteration #76 w/o a crash :( Maybe it's Linux ms10-33-mcdivittB0 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16 17:18:17 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux BTW, looks wily kernel can't boot to shell prompt on mcdivitt. OK - mind filing a separate bug for that? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
On Tue, Jul 7, 2015 at 2:25 AM, Ming Lei 1469...@bugs.launchpad.net wrote: On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei ming@canonical.com wrote: Looks there are two kinds of translation fault from irqbalance: 1) happend in place_irq_in_node() which can reproduce in vivid package 2) the 2nd one happened in glib2, which is built by myself, because irqbalance can choose to use its own local glib if there isn't glib2 available, and the glib2 does exist in my server in which I build irqbalance. Both of two above reports can be fixed by the following irqbalance commit: NUMA is not available fix https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c Looks stress-ng can't only find kernel bug, but also userspace issue, :-) I was looking to upload a fix for wily, but I haven't been able to reproduce it to in order to verify the fix. I ran 'stress-ng --seq 0 -t 60 --syslog --metrics --times -v' overnight in a loop, but irqbalance never crashed. How long should I expect this to take on average? Does it usually crash in a single run? -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to irqbalance in Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
On Tue, Jul 7, 2015 at 2:25 AM, Ming Lei 1469...@bugs.launchpad.net wrote: On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei ming@canonical.com wrote: Looks there are two kinds of translation fault from irqbalance: 1) happend in place_irq_in_node() which can reproduce in vivid package 2) the 2nd one happened in glib2, which is built by myself, because irqbalance can choose to use its own local glib if there isn't glib2 available, and the glib2 does exist in my server in which I build irqbalance. Both of two above reports can be fixed by the following irqbalance commit: NUMA is not available fix https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c Looks stress-ng can't only find kernel bug, but also userspace issue, :-) I was looking to upload a fix for wily, but I haven't been able to reproduce it to in order to verify the fix. I ran 'stress-ng --seq 0 -t 60 --syslog --metrics --times -v' overnight in a loop, but irqbalance never crashed. How long should I expect this to take on average? Does it usually crash in a single run? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1469214/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
On Tue, Jul 7, 2015 at 11:16 AM, Ming Lei ming@canonical.com wrote: Looks there are two kinds of translation fault from irqbalance: 1) happend in place_irq_in_node() which can reproduce in vivid package 2) the 2nd one happened in glib2, which is built by myself, because irqbalance can choose to use its own local glib if there isn't glib2 available, and the glib2 does exist in my server in which I build irqbalance. Both of two above reports can be fixed by the following irqbalance commit: NUMA is not available fix https://github.com/Irqbalance/irqbalance/commit/a3c812eb6cd627cd3fae45b8345538558b86973c Looks stress-ng can't only find kernel bug, but also userspace issue, :-) Thanks, Ming -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
On Tue, Jul 7, 2015 at 2:37 AM, Colin Ian King 1469...@bugs.launchpad.net wrote: captured irqbalance segfaulting: Program received signal SIGSEGV, Segmentation fault. 0x00408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145 145 if (irq_numa_node(info)-number != -1) { (gdb) where #0 0x00408f8c in place_irq_in_node (info=0x2c3d0050, data=0x0) at placement.c:145 #1 0x00405154 in for_each_irq (list=0x2c3df660, cb=0x408f4c place_irq_in_node, data=0x0) at classify.c:508 #2 0x0040923c in calculate_placement () at placement.c:196 #3 0x00407800 in main (argc=2, argv=0x7fcd014928) at irqbalance.c:372 (gdb) print info $1 = (struct irq_info *) 0x2c3d0050 Suppose info is one address in heap, then it is valid, and the segfault should be caused by invalid info-numa_node. Thanks -- You received this bug notification because you are subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault Status in linux package in Ubuntu: Triaged Bug description: Running stress-ng on a HP ProLiant m400 server can cause unhandled level 3 translations faults: use stress-ng from git://kernel.ubuntu.com/cking/stress-ng ./stress-ng --seq 0 -t 60 -v and after some time this trips the following: Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] systemd-timesyn[481]: unhandled level 3 translation fault (7) at 0x7fa8ea6008, esr 0x9207 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = ffcfb563f000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] *pgd=004fb4f28003, *pud=004fb4f28003, *pmd=004fb4f38003, *pte=1d151c00 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566] Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP ProLiant m400 Server Cartridge (DT) Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: ffcfb4e3b100 ti: ffcfb4d2c000 task.ti: ffcfb4d2c000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 0x7fa8d81824 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 0x7fa8e3b3e4 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : [007fa8d81824] lr : [007fa8e3b3e4] pstate: 8000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 007ff120d660 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922592] x29: 007ff120d660 x28: 007fa8f1c000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922594] x27: 007fa8f32084 x26: 007fa8f32000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922595] x25: 007fa8f1d788 x24: 007fa8f1d888 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922597] x23: 0001 x22: 007fa8f1faa0 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922599] x21: 007ff120d7f0 x20: 007ff120d7d0 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922600] x19: 007fa8f31000 x18: 007fa8f1e000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922602] x17: 007fa8e3b3b8 x16: 007fa8ea6000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922603] x15: 003b9aca x14: 00219bbdd000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922605] x13: aa751223 x12: Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922607] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922609] x9 : 37333c43484f5e46 x8 : 007ff120d818 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922610] x7 : 007ff120d8f0 x6 : 007ff120d828 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922612] x5 : ff80ffd0 x4 : 007ff120d8c0 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922613] x3 : 007ff120d7d0 x2 : 007fa8f1faa0 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922615] x1 : 0001 x0 : 0064 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922616] To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
Looks there are two kinds of translation fault from irqbalance: 1) happend in place_irq_in_node() which can reproduce in vivid package 2) the 2nd one happened in glib2, which is built by myself, because irqbalance can choose to use its own local glib if there isn't glib2 available, and the glib2 does exist in my server in which I build irqbalance. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
On Mon, Jul 6, 2015 at 9:28 PM, Colin Ian King 1469...@bugs.launchpad.net wrote: I re-ran this today with the following script as a non-root user: #!/bin/bash tests=affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie for t in $tests do echo $t echo $t | sudo tee /dev/kmsg ./stress-ng --$t 0 -v -t 60 done and hit this issue: [14098.848615] urandom [14111.696335] irqbalance[828]: unhandled level 2 translation fault (11) at 0x4f64, esr 0x9206 [14111.696341] pgd = ffcfef71b000 [14111.737149] [4f64] *pgd=004fef1f3003, *pud=004fef1f3003, *pmd= As I suggested, it should be helpful to provide /proc/$(pidof irqbalance)/maps, otherwise we can't know where both the faulted and PC address are. Finally I have figured out one simple way to reproduce the issue: 1) apply the attached debug patch to stress-ng 2) run the following script: sudo cat /proc/$(pidof irqbalance)/maps /home/ubuntu/git/stress-ng/stress-ng --sequential 0 --seq-start 80 --seq-end 84 -t 60 --syslog --metrics --times -v And the above command just runs the following 4 stresses in 4 minutes: stress-ng: info: [1067] dispatching hogs: 8 tsearch, 8 udp, 8 udp-flood, 8 urandom 3) the above may trigger the following faults from irqbalance with ~3/4 probability, and the faulted address is in heap, and PC points to code of libglib-2.0.so, so looks like a use-after-free in irqbalance or libglib? And no information shows it is related with kernel, also the four stresses are quite simple and shouldn't cause trouble to kernel. # irqbalance memory maps 0040-0040a000 r-xp 08:02 10496929 /usr/sbin/irqbalance 00419000-0041a000 r-xp 9000 08:02 10496929 /usr/sbin/irqbalance 0041a000-0041b000 rwxp a000 08:02 10496929 /usr/sbin/irqbalance 16294000-162b5000 rwxp 00:00 0 [heap] 162b5000-162ce000 rwxp 00:00 0 [heap] 7f8fbf9000-7f8fbfb000 rwxp 00:00 0 7f8fbfb000-7f8fc11000 r-xp 08:02 4722034 /lib/aarch64-linux-gnu/libpthread-2.21.so 7f8fc11000-7f8fc2 ---p 00016000 08:02 4722034 /lib/aarch64-linux-gnu/libpthread-2.21.so 7f8fc2-7f8fc21000 r-xp 00015000 08:02 4722034 /lib/aarch64-linux-gnu/libpthread-2.21.so 7f8fc21000-7f8fc22000 rwxp 00016000 08:02 4722034 /lib/aarch64-linux-gnu/libpthread-2.21.so 7f8fc22000-7f8fc26000 rwxp 00:00 0 7f8fc26000-7f8fc7f000 r-xp 08:02 4718668 /lib/aarch64-linux-gnu/libpcre.so.3.13.1 7f8fc7f000-7f8fc8f000 ---p 00059000 08:02 4718668 /lib/aarch64-linux-gnu/libpcre.so.3.13.1 7f8fc8f000-7f8fc9 r-xp 00059000 08:02 4718668 /lib/aarch64-linux-gnu/libpcre.so.3.13.1 7f8fc9-7f8fc91000 rwxp 0005a000 08:02 4718668 /lib/aarch64-linux-gnu/libpcre.so.3.13.1 7f8fc91000-7f8fdc1000 r-xp 08:02 4722027 /lib/aarch64-linux-gnu/libc-2.21.so 7f8fdc1000-7f8fdd ---p 0013 08:02 4722027 /lib/aarch64-linux-gnu/libc-2.21.so 7f8fdd-7f8fdd4000 r-xp 0012f000 08:02 4722027 /lib/aarch64-linux-gnu/libc-2.21.so 7f8fdd4000-7f8fdd6000 rwxp 00133000 08:02 4722027 /lib/aarch64-linux-gnu/libc-2.21.so 7f8fdd6000-7f8fdda000 rwxp 00:00 0 7f8fdda000-7f8fde3000 r-xp 08:02 10885206 /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0 7f8fde3000-7f8fdf2000 ---p 9000 08:02 10885206 /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0 7f8fdf2000-7f8fdf3000 r-xp 8000 08:02 10885206 /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0 7f8fdf3000-7f8fdf4000 rwxp 9000 08:02 10885206 /usr/lib/aarch64-linux-gnu/libnuma.so.1.0.0 7f8fdf4000-7f8fdf8000 rwxp 00:00 0 7f8fdf8000-7f8fe89000 r-xp 08:02 4722041 /lib/aarch64-linux-gnu/libm-2.21.so 7f8fe89000-7f8fe98000 ---p 00091000 08:02 4722041 /lib/aarch64-linux-gnu/libm-2.21.so 7f8fe98000-7f8fe99000 r-xp 0009 08:02 4722041 /lib/aarch64-linux-gnu/libm-2.21.so 7f8fe99000-7f8fe9a000 rwxp 00091000 08:02 4722041 /lib/aarch64-linux-gnu/libm-2.21.so 7f8fe9a000-7f8ff8c000 r-xp 08:02 4718610 /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1 7f8ff8c000-7f8ff9c000 ---p 000f2000 08:02 4718610 /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1 7f8ff9c000-7f8ff9d000 r-xp 000f2000 08:02 4718610 /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1 7f8ff9d000-7f8ff9e000 rwxp 000f3000 08:02 4718610 /lib/aarch64-linux-gnu/libglib-2.0.so.0.4400.1
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
Hi Colin, On Sat, Jul 4, 2015 at 12:43 AM, Colin Ian King 1469...@bugs.launchpad.net wrote: I was able to hit the following translation fault running sudo ./stress- ng --seq 0 -t 60 --syslog --metrics --times -v I suggest to not run stress-ng as root, otherwise it can be less serious because: - root user can do bad things easily, and it is quite easy to kill any of process - in reality most of loads are run as non-root If some system processes(irqbalance, systemd-*) are only killed becasue stress-ng is running as root, it can be a low priority issue. Otherwise we need pay close attention to the issue. And I always run 'stress-ng' as ubuntu user without sudo, that may be the reason why it is difficult for me to reproduce that. Even with the two new approaches, it is still not easy for me to reproduce that. I only see one time of translation fault by your first approach(./stress-ng --seq 0 ...) in 6 hours, and can't trigger that with your 2nd approach(by bash script). Folllows the log[1] I triggered, and I think it is very likely a userspace issue. From irqbalanc-dbgsym package, we can easily find 'PC is at 0x406078' is one address in text section, and it should be inside function of 'place_irq_in_node' because the exec file isn't built as relocation. One thing I still can't understand is that why the fault address is '0x0040' in the context. [1] [ 3616.92] Bits 55-60 of /proc/PID/pagemap entries are about to stop being page-shift some time soon. See the linux/Documentation/vm/pagemap.txt for details. [ 3616.93] Bits 55-60 of /proc/PID/pagemap entries are about to stop being page-shift some time soon. See the linux/Documentation/vm/pagemap.txt for details. [ 5316.367265] irqbalance[1457]: unhandled level 2 translation fault (11) at 0x0040, esr 0x9206 [ 5316.476937] pgd = ffcfb5478000 [ 5316.520692] [0040] *pgd=004fb4a3c003, *pud=004fb4a3c003, *pmd= [ 5316.620270] [ 5316.638140] CPU: 7 PID: 1457 Comm: irqbalance Not tain-21-generic #21-Ubuntu [ 5316.733212] Hardware name: HP ProLiant m400 Server Cartridge (DT) [ 5316.806382] task: ffcfb55e6e40 ti: ffcfa72b task.ti: ffcfa72b [ 5316.896258] PC is at 0x406078 [ 5316.931865] LR is at 0x404100 [ 5316.967457] pc : [00406078] lr : [00404100] pstate: 2000 [ 5317.056268] sp : 007fc07ff2d0 [ 5317.096038] x29: 007fc07ff2d0 x28: 004095a0 [ 5317.160023] x27: 00409548 x26: 0041a000 [ 5317.223897] x25: 00405000 x24: 0041acf8 [ 5317.287868] x23: 0041a000 x22: 0041a000 [ 5317.351841] x21: 2e0d6050 x20: 0041a000 [ 5317.415744] x19: 2e0e9020 x18: [ 5317.479620] x17: 007fb5ac287c x16: 0041a188 [ 5317.543490] x15: 003bdd2370f74a1c x14: 2030203020302030 [ 5317.607373] x13: 2030203020302030 x12: 2030203020302030 [ 5317.671263] x11: 2030203020302030 x10: 2030203020302030 [ 5317.735137] x9 : 00a0 x8 : 0001 [ 5317.799113] x7 : 0033 x6 : 2e0d6e08 [ 5317.862983] x5 : 0040 x4 : [ 5317.926867] x3 : 2e0d7008 x2 : [ 5317.990840] x1 : 002c x0 : 0003 [ 5318.054713] -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault
Hi Colin, That looks one progress, but still takes time to reproduce that, and I will use your new approach to reproduce that. When you are doing that, could you dump the file of /proc/$(pidof irqbalance)/maps so that we can see where the faulted address are in the process's vm space? thanks, On Sat, Jul 4, 2015 at 4:10 AM, Colin Ian King 1469...@bugs.launchpad.net wrote: Running the following: #!/bin/bash tests=affinity aio bigheap brk bsearch cache chdir chmod clock context cpu crypt dentry dir dup epoll eventfd fstat fallocate fault fifo flock fork futex get getrandom hdd hsearch inotify io itimer kcmp kill lease link lockf longjmp lsearch malloc matrix memcpy memfd mincore mlock mmap mmapmany mremap msg mq nice null open pipe poll procfs pthread qsort readahead rename rlimit seek sem sem-sysv sendfile shm-sysv sigfd sigfpe sigq sigsegv sock splice stack str switch symlink sysinfo sysfs tee timer timerfd tsearch udp udp-flood urandom utime vecmath vfork vm vm-rw vm-splice wcs wait yield xattr zero zombie for t in $tests do echo $t echo $t /dev/kmsg ./stress-ng --$t 0 -v -t 60 done eventually tripped the translation fault in irqbalance. I ran this after a clean reboot. [ 4901.799846] timerfd [ 4961.807050] tsearch [ 5021.884456] udp [ 5081.895058] udp-flood [ 5141.674365] irqbalance[827]: unhandled level 2 translation fault (11) at 0x002d6da4, esr 0x9206 [ 5141.674376] pgd = ffcfb51a [ 5141.715215] [002d6da4] *pgd=004fb677e003, *pud=004fb677e003, *pmd= [ 5141.816183] CPU: 0 PID: 827 Comm: irqbalance Not tainted 3.19.0-21-generic #21-Ubuntu [ 5141.816185] Hardware name: HP ProLiant m400 Server Cartridge (DT) [ 5141.816188] task: ffcfac088000 ti: ffcfab71 task.ti: ffcfab71 [ 5141.816206] PC is at 0x7f88287834 [ 5141.816208] LR is at 0x7f882877f4 [ 5141.816210] pc : [007f88287834] lr : [007f882877f4] pstate: 8000 [ 5141.816212] sp : 007ff2e46b30 [ 5141.816214] x29: 007ff2e46b30 x28: 004095a0 [ 5141.816217] x27: 00409548 x26: 0041a000 [ 5141.816220] x25: 0001 x24: 0010 [ 5141.816222] x23: 2d6c98a0 x22: 2d6c9880 [ 5141.816225] x21: 0018 x20: 007f88323000 [ 5141.816228] x19: 0002 x18: [ 5141.816230] x17: 007f87f8d8ec x16: 007f883222e0 [ 5141.816233] x15: 0020 x14: 0001 [ 5141.816235] x13: x12: [ 5141.816237] x11: 007ff2e446a0 x10: 0010 [ 5141.816240] x9 : 00a0 x8 : 0007 [ 5141.816242] x7 : 0033 x6 : 2d6c9c80 [ 5141.816245] x5 : 0001 x4 : 007f87fa62a0 [ 5141.816247] x3 : 2d6c9880 x2 : 0001 [ 5141.816250] x1 : 03fa x0 : 002d6d9c [ 5141.907792] urandom [ 5201.928712] utime [ 5261.934534] vecmath [ 5321.940302] vfork [ 5381.947904] vm [ 5441.991784] vm-rw [ 5502.017614] vm-splice [ 5562.023334] wcs [ 5622.037054] wait [ 5682.043302] yield [ 5742.056595] xattr [ 5802.075772] zero [ 5862.087396] zombie -- You received this bug notification because you are subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469214 Title: HP ProLiant m400 Server crashes with unhandled level 3 translation fault Status in linux package in Ubuntu: Triaged Bug description: Running stress-ng on a HP ProLiant m400 server can cause unhandled level 3 translations faults: use stress-ng from git://kernel.ubuntu.com/cking/stress-ng ./stress-ng --seq 0 -t 60 -v and after some time this trips the following: Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] systemd-timesyn[481]: unhandled level 3 translation fault (7) at 0x7fa8ea6008, esr 0x9207 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = ffcfb563f000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] *pgd=004fb4f28003, *pud=004fb4f28003, *pmd=004fb4f38003, *pte=1d151c00 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566] Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP ProLiant m400 Server Cartridge (DT) Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: ffcfb4e3b100 ti: ffcfb4d2c000 task.ti: ffcfb4d2c000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 0x7fa8d81824 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 0x7fa8e3b3e4 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : [007fa8d81824] lr : [007fa8e3b3e4] pstate: 8000 Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 007ff120d660