[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
[Expired for linux (Ubuntu) because there has been no activity for 60 days.] ** Changed in: linux (Ubuntu) Status: Incomplete => Expired -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1534413 Title: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
** Changed in: linux (Ubuntu) Status: In Progress => Incomplete ** Changed in: linux (Ubuntu) Assignee: Rafael David Tinoco (inaddy) => (unassigned) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1534413 Title: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
Unfortunately I received 2 other kdumps from the same user and it looks like that other 2 kdumps were happening because of a NULL dereference (3.2.0-79). I asked them to test latest 3.2 kernel + the fix I commented above. User still faced kdumps with latest 3.2 but only with khungtask complaining about soft lockups. A first look into the dumps showed a lock contention: crash> bt 883f70c78000 PID: 41940 TASK: 883f70c78000 CPU: 13 COMMAND: "java" #0 [883f69093d98] __schedule at 81666f4a #1 [883f69093e20] schedule at 816675cf #2 [883f69093e30] rwsem_down_failed_common at 8166940d #3 [883f69093ea0] rwsem_down_write_failed at 816694c3 #4 [883f69093eb0] call_rwsem_down_write_failed at 8131fb43 #5 [883f69093f10] sys_mprotect at 811488f0 #6 [883f69093f80] system_call_fastpath at 81671c62 RIP: 7f1adc172cc7 RSP: 7f1adcc89c60 RFLAGS: 00010297 RAX: 000a RBX: 81671c62 RCX: ef038c60 RDX: 0003 RSI: 2000 RDI: 7f1ad49bc000 RBP: 2010 R8: 009be000 R9: R10: 009bc000 R11: 0206 R12: 0ac0 R13: 7f1ad49bb540 R14: 2000 R15: 7f1ad478 ORIG_RAX: 000a CS: 0033 SS: 002b on the task monitored by khungtask thread. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1534413 Title: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
** Changed in: linux (Ubuntu) Importance: Undecided => Critical ** Changed in: linux (Ubuntu) Importance: Critical => Medium -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1534413 Title: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
Disconsider my previous analysis please. After disassembling carefully the java process dump I could see that we had the instruction pointer at: if (iter.task) put_task_struct(iter.task); Some instructions before a loop that could be a potential candidate for deadlocks, which I was referring to before. 0x811df242 <+34>:je 0x811df24f0x811df244 <+36>:lock decl 0x10(%rdx) 0x811df248 <+40>:sete %al 0x811df24b <+43>:test %al,%al The combination of those 3 instructions (decl, sete & test) shows me that this is actually the implementation of atomic_dec_and_test function. It servers to atomically (CPU coordination) decrement a variable from task_struct for the given task. Deadlock is probably NOT there. I went for khungtask stack to check which task it was complaining about being locked and I could not find it (probably taskid was in registers and not on the stack) BUT I could find an output message telling me which task was locked: [15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds. And this comes exactly from khungtaskd code. It means that PID 57263 was NOT scheduled for more than 120 seconds. Checking its stack: PID: 57263 COMMAND: "jsvc" TASK: 883e0c4e2e00 [THREAD_INFO: 883e849c2000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE|TASK_TRACED|EXIT_DEAD I can see that it is in UNINTERRUPTIBLE state (it can't received signals, for ex). Checking its stack: crash> bt PID: 57263 TASK: 883e0c4e2e00 CPU: 1 COMMAND: "jsvc" #0 [883e849c3c88] __schedule at 8166256a #1 [883e849c3d10] schedule at 81662c0f #2 [883e849c3d20] schedule_timeout at 8166324d #3 [883e849c3dd0] wait_for_common at 81662a4f #4 [883e849c3e70] wait_for_completion at 81662bcd #5 [883e849c3e80] flush_work at 8108757e #6 [883e849c3ed0] schedule_on_each_cpu at 81087813 #7 [883e849c3f10] lru_add_drain_all at 81127365 #8 [883e849c3f20] sys_fadvise64_64 at 8111e189 #9 [883e849c3f70] sys_fadvise64 at 8111e27e #10 [883e849c3f80] system_call_fastpath at 8166d2c2 ## sys_fadvise64_64 The task entered kernel mode after the syscall: fadvise. It basically tells the kernel how the file is going to be accessed so it can prepare itself regarding to different caching techniques (expect page references in random order, expect page references in sequential order, expect delay in accessing file contents, etc). It turns out that the JSVC process basically told kernel that it would NOT use the opened file (flag POSIX_FADV_DONTNEED) until further moment, at a certain offset. With that, kernel calculated page start/end index (offset, offset+len) for the memory range of given file and tried to invalidate all unlocked pages (page cache) for that area. Dirty, locked, write-back or mapped pages are NOT invalidated. After first attempt of invalidating pages from page-cache from that particular file, Kernel found out that fewer pages (than the ones given to fadvise - in the form of offset/len delta) than expected were invalidated (in other words, it could not invalidate all pages from the the range) so it resolved calling the LRU DRAIN logic. ## lru_add_drain_all LRU stands for Least-Recent-Used algorithm and is used for user mode address space and for page cache (this case). It defines 2 lists of memory pages: active and inactive. The LRU DRAIN logic schedules a drain logic on every CPU on the system in order to drain per-cpu existent page vectors (pagevecs). Page vecs are temporary small & finite data structures containing page descriptor pointers. Instead of changing LRU lists (active/inactive) directly, the LRU algorithm uses per-cpu page vectors until they are full (avoiding having to lock/unlock while LRU lists more often). ## schedule_on_each_cpu Task 883e0c4e2e00 continued the LRU DRAIN logic by scheduling its call back function using schedule_on_each_cpu: schedule_on_each_cpu(drain_cpu_pagevecs) scheduled the drain_cpu_pagevecs on all cpus and kept waiting - for more than 120 seconds - for all cpus to execute the drain. While it was waiting, the task 883e0c4e2e00 was made UNINTERRUPTIBLE and was locked inside this loop where it waits for drain_cpu_pagevecs to finish on all cpus (it never did). ## precise commit (cherry-picked from upstream) adding that logic: commit a68692ea42825466b0ade8a94a98afc924d2564e Author: Mel Gorman Date: Fri Feb 22 16:35:59 2013 -0800 mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages BugLink: http://bugs.launchpad.net/bugs/1150557 commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 upstream. It was added to Precise (3.2.0-40.64). I'm removing that logic and providing a kernel without it to see if it mitigates the issue. Per original commit
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
Disconsider my previous analysis please. After disassembling carefully the java process dump I could see that we had the instruction pointer at: if (iter.task) put_task_struct(iter.task); Some instructions before a loop that could be a potential candidate for deadlocks, which I was referring to before. 0x811df242 <+34>:je 0x811df24f0x811df244 <+36>:lock decl 0x10(%rdx) 0x811df248 <+40>:sete %al 0x811df24b <+43>:test %al,%al The combination of those 3 instructions (decl, sete & test) shows me that this is actually the implementation of atomic_dec_and_test function. It servers to atomically (CPU coordination) decrement a variable from task_struct for the given task. Deadlock is probably NOT there. I went for khungtask stack to check which task it was complaining about being locked and I could not find it (probably taskid was in registers and not on the stack) BUT I could find an output message telling me which task was locked: [15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds. And this comes exactly from khungtaskd code. It means that PID 57263 was NOT scheduled for more than 120 seconds. Checking its stack: PID: 57263 COMMAND: "jsvc" TASK: 883e0c4e2e00 [THREAD_INFO: 883e849c2000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE|TASK_TRACED|EXIT_DEAD I can see that it is in UNINTERRUPTIBLE state (it can't received signals, for ex). Checking its stack: crash> bt PID: 57263 TASK: 883e0c4e2e00 CPU: 1 COMMAND: "jsvc" #0 [883e849c3c88] __schedule at 8166256a #1 [883e849c3d10] schedule at 81662c0f #2 [883e849c3d20] schedule_timeout at 8166324d #3 [883e849c3dd0] wait_for_common at 81662a4f #4 [883e849c3e70] wait_for_completion at 81662bcd #5 [883e849c3e80] flush_work at 8108757e #6 [883e849c3ed0] schedule_on_each_cpu at 81087813 #7 [883e849c3f10] lru_add_drain_all at 81127365 #8 [883e849c3f20] sys_fadvise64_64 at 8111e189 #9 [883e849c3f70] sys_fadvise64 at 8111e27e #10 [883e849c3f80] system_call_fastpath at 8166d2c2 ## sys_fadvise64_64 The task entered kernel mode after the syscall: fadvise. It basically tells the kernel how the file is going to be accessed so it can prepare itself regarding to different caching techniques (expect page references in random order, expect page references in sequential order, expect delay in accessing file contents, etc). It turns out that the JSVC process basically told kernel that it would NOT use the opened file (flag POSIX_FADV_DONTNEED) until further moment, at a certain offset. With that, kernel calculated page start/end index (offset, offset+len) for the memory range of given file and tried to invalidate all unlocked pages (page cache) for that area. Dirty, locked, write-back or mapped pages are NOT invalidated. After first attempt of invalidating pages from page-cache from that particular file, Kernel found out that fewer pages (than the ones given to fadvise - in the form of offset/len delta) than expected were invalidated (in other words, it could not invalidate all pages from the the range) so it resolved calling the LRU DRAIN logic. ## lru_add_drain_all LRU stands for Least-Recent-Used algorithm and is used for user mode address space and for page cache (this case). It defines 2 lists of memory pages: active and inactive. The LRU DRAIN logic schedules a drain logic on every CPU on the system in order to drain per-cpu existent page vectors (pagevecs). Page vecs are temporary small & finite data structures containing page descriptor pointers. Instead of changing LRU lists (active/inactive) directly, the LRU algorithm uses per-cpu page vectors until they are full (avoiding having to lock/unlock while LRU lists more often). ## schedule_on_each_cpu Task 883e0c4e2e00 continued the LRU DRAIN logic by scheduling its call back function using schedule_on_each_cpu: schedule_on_each_cpu(drain_cpu_pagevecs) scheduled the drain_cpu_pagevecs on all cpus and kept waiting - for more than 120 seconds - for all cpus to execute the drain. While it was waiting, the task 883e0c4e2e00 was made UNINTERRUPTIBLE and was locked inside this loop where it waits for drain_cpu_pagevecs to finish on all cpus (it never did). ## precise commit (cherry-picked from upstream) adding that logic: commit a68692ea42825466b0ade8a94a98afc924d2564e Author: Mel Gorman Date: Fri Feb 22 16:35:59 2013 -0800 mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages BugLink: http://bugs.launchpad.net/bugs/1150557 commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 upstream. It was added to Precise (3.2.0-40.64). I'm removing that logic and providing a kernel without it to see if it mitigates the issue. Per original commit description, this was supposed
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
Because Precise 3.2.0-79 is missing debug symbols froms ddebs.ubuntu.com I had to compile a 3.2.0-79 kernel in a PPA and expect the symbols to be close to what they used to be in that version. That led me to a wrong initial analysis that I document here for historical purposes: > 178 2 18 881f716bdc00 RU 0.0 0 0 [khungtaskd] > 3680 2808 38 881f71a5c500 RU 1.6 6629520 4303188 java > 50279 49370 31 883f0c7e8000 RU 0.0 4121160 20 java > 50757 50322 23 881ef27eae00 RU 0.3 4149720 870892 java crash> bt 881ef27eae00 PID: 50757 TASK: 881ef27eae00 CPU: 23 COMMAND: "java" #0 [881fbfba6ee0] crash_nmi_callback at 81031ac9 #1 [881fbfba6ef0] default_do_nmi at 81666079 #2 [881fbfba6f30] do_nmi at 816662b0 #3 [881fbfba6f50] nmi at 81665620 [exception RIP: next_tgid+40] RIP: 811df248 RSP: 881cd4b67da8 RFLAGS: 0202 RAX: RBX: 81c281a0 RCX: RDX: 881f7283 RSI: 0074 RDI: 81c281a0 RBP: 881cd4b67df8 R8: a88d R9: 0004 R10: 883f70922540 R11: 0001f8f579768213 R12: 0074 R13: 881f0073 R14: 81c281a0 R15: 881f73138000 ORIG_RAX: CS: 0010 SS: 0018 --- --- #4 [881cd4b67da8] next_tgid at 811df248 #5 [881cd4b67e00] proc_pid_readdir at 811e117e #6 [881cd4b67eb0] proc_root_readdir at 811dbe0a #7 [881cd4b67ee0] vfs_readdir at 8118e1d0 #8 [881cd4b67f30] sys_getdents at 8118e4a9 #9 [881cd4b67f80] system_call_fastpath at 8166d2c2 RIP: 7f855e94d605 RSP: 7f853a457bf0 RFLAGS: 0283 RAX: 004e RBX: 8166d2c2 RCX: 0010 RDX: 8000 RSI: 7f8548057980 RDI: 00fb RBP: 7f854846b140 R8: 7f8548057980 R9: 0008 R10: 0010 R11: 0246 R12: 0016 R13: ffa0 R14: 7f853a45a4d0 R15: 7f8548057950 ORIG_RAX: 004e CS: 0033 SS: 002b 1) khungtaskd is complaining about a hung task 2) task 881ef27eae00 is in "next_tgid" from procfs vfs subsystem: We are probably stuck here: rcu_read_lock(); retry: iter.task = NULL; pid = find_ge_pid(iter.tgid, ns); if (pid) { iter.tgid = pid_nr_ns(pid, ns); iter.task = pid_task(pid, PIDTYPE_PID); /* What we to know is if the pid we have find is the * pid of a thread_group_leader. Testing for task * being a thread_group_leader is the obvious thing * todo but there is a window when it fails, due to * the pid transfer logic in de_thread. * * So we perform the straight forward test of seeing * if the pid we have found is the pid of a thread * group leader, and don't worry if the task we have * found doesn't happen to be a thread group leader. * As we don't care in the case of readdir. */ if (!iter.task || !has_group_leader_pid(iter.task)) { iter.tgid += 1; goto retry; } get_task_struct(iter.task); } rcu_read_unlock(); Trying to find a task group leader when reading procfs structure (by the JVM process). I'm still analysing code. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1534413 Title: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
Available PPA for testing purposes: https://launchpad.net/~inaddy/+archive/ubuntu/lp1534413 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1534413 Title: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs