Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Tue, Jul 12, 2016 at 03:12:35PM +0200, Peter Zijlstra wrote: > On Mon, Jul 11, 2016 at 03:33:53PM -0700, Greg KH wrote: > > > Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees. > > Can someone send me the backported verision that they have tested to > > work properly so I can queue it up? > > I've never actually been able to reproduce, but the attached patches > apply, the reject was trivial. > > They seem to compile and boot on my main test rig, but nothing else was > done but build the next kernel with it. Thanks for these, now applied. greg k-h
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Mon, Jul 11, 2016 at 03:33:53PM -0700, Greg KH wrote: > Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees. > Can someone send me the backported verision that they have tested to > work properly so I can queue it up? I've never actually been able to reproduce, but the attached patches apply, the reject was trivial. They seem to compile and boot on my main test rig, but nothing else was done but build the next kernel with it. commit 8974189222159154c55f24ddad33e3613960521a Author: Peter Zijlstra Date: Thu Jun 16 10:50:40 2016 +0200 sched/fair: Fix cfs_rq avg tracking underflow As per commit: b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization") > the code generated from update_cfs_rq_load_avg(): > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > removed_load = 1; > } > > turns into: > > 81087064: 49 8b 85 98 00 00 00mov0x98(%r13),%rax > 8108706b: 48 85 c0test %rax,%rax > 8108706e: 74 40 je 810870b0 > 81087070: 4c 89 f8mov%r15,%rax > 81087073: 49 87 85 98 00 00 00xchg %rax,0x98(%r13) > 8108707a: 49 29 45 70 sub%rax,0x70(%r13) > 8108707e: 4c 89 f9mov%r15,%rcx > 81087081: bb 01 00 00 00 mov$0x1,%ebx > 81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) > 8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx > > Which you'll note ends up with sa->load_avg -= r in memory at > 8108707a. So I _should_ have looked at other unserialized users of ->load_avg, but alas. Luckily nikbor reported a similar /0 from task_h_load() which instantly triggered recollection of this here problem. Aside from the intermediate value hitting memory and causing problems, there's another problem: the underflow detection relies on the signed bit. This reduces the effective width of the variables, IOW its effectively the same as having these variables be of signed type. This patch changes to a different means of unsigned underflow detection to not rely on the signed bit. This allows the variables to use the 'full' unsigned range. And it does so with explicit LOAD - STORE to ensure any intermediate value will never be visible in memory, allowing these unserialized loads. Note: GCC generates crap code for this, might warrant a look later. Note2: I say 'full' above, if we end up at U*_MAX we'll still explode; maybe we should do clamping on add too. Signed-off-by: Peter Zijlstra (Intel) Cc: Andrey Ryabinin Cc: Chris Wilson Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Yuyang Du Cc: bseg...@google.com Cc: ker...@kyup.com Cc: morten.rasmus...@arm.com Cc: p...@google.com Cc: steve.muc...@linaro.org Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") Link: http://lkml.kernel.org/r/20160617091948.gj30...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 33 + 1 file changed, 25 insertions(+), 8 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2682,6 +2682,23 @@ static inline void update_tg_load_avg(st static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); +/* + * Unsigned subtract and clamp on underflow. + * + * Explicitly do a load-store to ensure the intermediate value never hits + * memory. This allows lockless observations without ever seeing the negative + * values. + */ +#define sub_positive(_ptr, _val) do {\ + typeof(_ptr) ptr = (_ptr);\ + typeof(*ptr) val = (_val);\ + typeof(*ptr) res, var = READ_ONCE(*ptr); \ + res = var - val; \ + if (res > var) \ + res = 0; \ + WRITE_ONCE(*ptr, res); \ +} while (0) + /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) { @@ -2690,15 +2707,15 @@ static inline int update_cfs_rq_load_avg if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); - sa->load_avg = max_t(long, sa->load_avg - r, 0); - sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); + sub_positive(&sa->load_avg, r); + sub_positive(&sa->load_sum, r * LOAD_AVG_MAX); removed = 1; } if (atomic_long_read(&cfs_rq->removed_util_avg)) { long r = atomic_long_xchg
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Thu, Jul 07, 2016 at 09:42:32AM +0200, Peter Zijlstra wrote: > On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote: > > > > commit 8974189222159154c55f24ddad33e3613960521a > > > Author: Peter Zijlstra > > > Date: Thu Jun 16 10:50:40 2016 +0200 > > > Since these early reports from Stefan and I it looks like it's been > > hit but alot more folks now so I'd like to ask what the process is for > > getting this backported into 4.6, 4.5 and 4.4 as in our testing all > > those versions for their latest point release seem to have the same > > problem. > > I think this should do; Greg is on Cc and will mark the commit > somewhere. It is already in Linus' tree and should indeed be sufficient. > > It has a Fixes tag referring the commit that introduced it, which IIRC > is somewhere around v4.2. > > Greg, anything else required? Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees. Can someone send me the backported verision that they have tested to work properly so I can queue it up? thanks, greg k-h
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Thu, Jul 07, 2016 at 09:42:32AM +0200, Peter Zijlstra wrote: > On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote: > > > > commit 8974189222159154c55f24ddad33e3613960521a > > > Author: Peter Zijlstra > > > Date: Thu Jun 16 10:50:40 2016 +0200 > > > Since these early reports from Stefan and I it looks like it's been > > hit but alot more folks now so I'd like to ask what the process is for > > getting this backported into 4.6, 4.5 and 4.4 as in our testing all > > those versions for their latest point release seem to have the same > > problem. > > I think this should do; Greg is on Cc and will mark the commit > somewhere. It is already in Linus' tree and should indeed be sufficient. > > It has a Fixes tag referring the commit that introduced it, which IIRC > is somewhere around v4.2. > > Greg, anything else required? Nope, that should be fine. greg k-h
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote: > > commit 8974189222159154c55f24ddad33e3613960521a > > Author: Peter Zijlstra > > Date: Thu Jun 16 10:50:40 2016 +0200 > Since these early reports from Stefan and I it looks like it's been > hit but alot more folks now so I'd like to ask what the process is for > getting this backported into 4.6, 4.5 and 4.4 as in our testing all > those versions for their latest point release seem to have the same > problem. I think this should do; Greg is on Cc and will mark the commit somewhere. It is already in Linus' tree and should indeed be sufficient. It has a Fixes tag referring the commit that introduced it, which IIRC is somewhere around v4.2. Greg, anything else required?
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On 22 June 2016 at 18:13, Peter Zijlstra wrote: > On Wed, Jun 22, 2016 at 01:19:54PM +1200, Campbell Steven wrote: >> >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: >> >>> >> >>> update_cfs_rq_h_load(cfs_rq); >> >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, >> >>> cfs_rq_load_avg(cfs_rq) + 1); >> >>> > > > --- > commit 8974189222159154c55f24ddad33e3613960521a > Author: Peter Zijlstra > Date: Thu Jun 16 10:50:40 2016 +0200 > > sched/fair: Fix cfs_rq avg tracking underflow > > As per commit: > > b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() > serialization") > > > the code generated from update_cfs_rq_load_avg(): > > > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > > sa->load_avg = max_t(long, sa->load_avg - r, 0); > > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > > removed_load = 1; > > } Hi Peter, I just wanted to report back to say thanks for this, and we have (and others) have tested this out in 4.7 rc6 and have not been able to repeat the issue. It seems that anyone running busy ceph osd's or high load KVM instances is able to trigger this on a dual socket box pretty easily. Since these early reports from Stefan and I it looks like it's been hit but alot more folks now so I'd like to ask what the process is for getting this backported into 4.6, 4.5 and 4.4 as in our testing all those versions for their latest point release seem to have the same problem. Thanks Campbell
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
21 juin 2016 14:13 "Yannis Aribaud" a écrit: > Hi everyone, > > I recently it this bug in the kernel using a vanilla 4.6.2 release. > It seems that somewhere in the load average calculation a division by 0 > occurs (see the stack trace > at the end). > > [snipped] > > I'm not an expert at all but I suspect that is the issue's origin. Shouldn't > the function > cfs_rq_load_avg use an atomic_long_read() to avoid this ? After digging a bit more, this can't be the problem as this function obviously can't return negative value. I found that it can maybe come from the update_cfs_rq_load_avg function in the following block: if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); sa->load_avg = max_t(long, sa->load_avg - r, 0); sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); removed_load = 1; } The max_t(long, sa->load_avg - r, 0) can result in a negative value keeped by the max_t function as the long would wrap up then generate a division by zero in task_h_load function. Best regards, -- Yannis Aribaud
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Wed, Jun 22, 2016 at 01:19:54PM +1200, Campbell Steven wrote: > >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: > >>> > >>> update_cfs_rq_h_load(cfs_rq); > >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, > >>> cfs_rq_load_avg(cfs_rq) + 1); > >>> --- commit 8974189222159154c55f24ddad33e3613960521a Author: Peter Zijlstra Date: Thu Jun 16 10:50:40 2016 +0200 sched/fair: Fix cfs_rq avg tracking underflow As per commit: b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization") > the code generated from update_cfs_rq_load_avg(): > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > removed_load = 1; > } > > turns into: > > 81087064: 49 8b 85 98 00 00 00mov0x98(%r13),%rax > 8108706b: 48 85 c0test %rax,%rax > 8108706e: 74 40 je 810870b0 > 81087070: 4c 89 f8mov%r15,%rax > 81087073: 49 87 85 98 00 00 00xchg %rax,0x98(%r13) > 8108707a: 49 29 45 70 sub%rax,0x70(%r13) > 8108707e: 4c 89 f9mov%r15,%rcx > 81087081: bb 01 00 00 00 mov$0x1,%ebx > 81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) > 8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx > > Which you'll note ends up with sa->load_avg -= r in memory at > 8108707a. So I _should_ have looked at other unserialized users of ->load_avg, but alas. Luckily nikbor reported a similar /0 from task_h_load() which instantly triggered recollection of this here problem. Aside from the intermediate value hitting memory and causing problems, there's another problem: the underflow detection relies on the signed bit. This reduces the effective width of the variables, IOW its effectively the same as having these variables be of signed type. This patch changes to a different means of unsigned underflow detection to not rely on the signed bit. This allows the variables to use the 'full' unsigned range. And it does so with explicit LOAD - STORE to ensure any intermediate value will never be visible in memory, allowing these unserialized loads. Note: GCC generates crap code for this, might warrant a look later. Note2: I say 'full' above, if we end up at U*_MAX we'll still explode; maybe we should do clamping on add too. Signed-off-by: Peter Zijlstra (Intel) Cc: Andrey Ryabinin Cc: Chris Wilson Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Yuyang Du Cc: bseg...@google.com Cc: ker...@kyup.com Cc: morten.rasmus...@arm.com Cc: p...@google.com Cc: steve.muc...@linaro.org Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") Link: http://lkml.kernel.org/r/20160617091948.gj30...@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a2348deab7a3..2ae68f0e3bf5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2904,6 +2904,23 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) } } +/* + * Unsigned subtract and clamp on underflow. + * + * Explicitly do a load-store to ensure the intermediate value never hits + * memory. This allows lockless observations without ever seeing the negative + * values. + */ +#define sub_positive(_ptr, _val) do { \ + typeof(_ptr) ptr = (_ptr); \ + typeof(*ptr) val = (_val); \ + typeof(*ptr) res, var = READ_ONCE(*ptr);\ + res = var - val;\ + if (res > var) \ + res = 0;\ + WRITE_ONCE(*ptr, res); \ +} while (0) + /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) @@ -2913,15 +2930,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); - sa->load_avg = max_t(long, sa->load_avg - r, 0); - sa-
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On 17 May 2016 at 21:21, Campbell Steven wrote: > Thanks Stefan, > > I am seeing this on 4.5.0 and 4.5.4 both are compiled from mainline > neither include any patches over and above the tree. I ran for well > over a month in production on 4.5.0 with no issues at all on a single > socket server (E5-2670 v3 @ 2.30GHz) but as soon as we try to run > either 4.5.0 or 4.5.4 on a dual socket box we get these traces after > 12 or so hours of uptime and the same sky rocketing load. Given you > have stopped your issues by applying those patches to 4.4 it's > possible I have different issues, but the symptoms are too similar to > ignore, our trace is here: > > http://pastebin.com/Q7nd8nfP > > I am using the cfg scheduler, and not the fair scheduler as Stefan > reported so that is another difference. > > Happy to proceed with any suggestions or further information that may > help isolate this, use case on these servers are 12 x CEPH osd's with > bcache so they are seeing alot of disk activity via NVMe and SATA > buses. > > Thanks > > Campbell > > On 17 May 2016 at 18:01, Stefan Priebe - Profihost AG > wrote: >> Am 21.03.2016 um 14:38 schrieb Greg KH: >>> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG >>> wrote: Am 20.03.2016 um 22:41 schrieb Greg KH: > On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote: >> >> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka: >>> On 03/17/2016 07:45 PM, Greg KH wrote: On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: > Hi, > > while running qemu 2.5 on a host running 4.4.6 the host system has > crashed > (load > 200) 3 times in the last 3 days. > > Always with this stack trace: (copy left here: > http://pastebin.com/raw/bCWTLKyt) > > [69068.874268] divide error: [#1] SMP > [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter > ip6_tables > ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 > nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack > nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap > macvlan > kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm > xt_multiport > iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi > scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp > 8021q > garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor > raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid > sd_mod > ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb > i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core > mdio ptp > pps_core > [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted > 4.4.6+7-ph #1 > [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 > 03/17/2012 > [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: > 880fda2c4000 > [69068.900377] RIP: 0010:[] [] > task_h_load+0xcc/0x100 >>> >>> decodecode says: >>> >>> 27: 48 83 c1 01 add$0x1,%rcx >>> 2b:* 48 f7 f1div%rcx <-- trapping >>> instruction >>> >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: >>> >>> update_cfs_rq_h_load(cfs_rq); >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, >>> cfs_rq_load_avg(cfs_rq) + 1); >>> >>> So the load avg is -1, thus after adding 1 we get division by 0, huh? >> >> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those >> commits up >> to 4.5 for fair.c: >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5 >> >> It didn't happen again with v4.4.6 + 4.5 patches for fair.c > > Ok, that's a lot of patches, how about figuring out which single patch, > or shortest number of patches, makes things work again? will do so but it seems most out of those 9 patches are based on each other. So it wouldn't be easy. >>> >>> Worst case, we take all 9. Best case, we only need one :) >> >> Today i got another report from: casteven (at) gmail.com having the same >> issue. I wasn't able to figure out which of those 9 patches helps - i >> still can just tell that it does not happen with this series on top of 4.4. >> >> Stefan Just reinvigorating this thread based off yesterdays post by Yannis here: https://lkml.org/lkml/headers/2016/6/21/491 of which the trace is more or less is identical to what we see. In addition to this we have since tested the 4.6.x releases and have seen the same
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
Hi everyone, I recently it this bug in the kernel using a vanilla 4.6.2 release. It seems that somewhere in the load average calculation a division by 0 occurs (see the stack trace at the end). After digging a bit (be fair it's my first time) in the kernel sources, I found that we "recently" added the function cfs_rq_load_avg (commit 6f2b04524f0b38bfbb8413f98d2d6af234508309) and started using it in the function task_h_load which do a division with the value returned (kernel/sched/fair.c) like this: static unsigned long task_h_load(struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); update_cfs_rq_h_load(cfs_rq); return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, cfs_rq_load_avg(cfs_rq) + 1); } But the load_avg filed from sched_avg struct is an atomic_long_t and the cfs_rq_load_avg returns this field as an unsigned long without doing any type conversion. static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) { return cfs_rq->avg.load_avg; } I'm not an expert at all but I suspect that is the issue's origin. Shouldn't the function cfs_rq_load_avg use an atomic_long_read() to avoid this ? Here is the stack trace: [534814.112500] divide error: [#1] SMP [534814.112550] Modules linked in: vhost_net vhost macvtap macvlan ipmi_si mpt3sas raid_class scsi_transport_sas ipmi_devintf dell_rbu tun nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc bridge 8021q garp mrp stp llc bonding xfs libcrc32c bcache usbhid hid uhci_hcd ohci_hcd x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass ghash_clmulni_intel iTCO_wdt iTCO_vendor_support sha256_generic hmac drbg dcdbas ansi_cprng aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper shpchp evdev sb_edac edac_core ehci_pci ehci_hcd lpc_ich usbcore mfd_core usb_common ipmi_msghandler acpi_cpufreq wmi tpm_tis tpm processor acpi_power_meter button ext4 crc16 jbd2 mbcache sg sd_mod dm_mod crc32c_intel igb megaraid_sas i2c_algo_bit i2c_core dca ptp scsi_mod pps_core [last unloaded: ipmi_si] [534814.113345] CPU: 10 PID: 38568 Comm: ceph-osd Not tainted 4.6.2-ig1virt #16 [534814.113390] Hardware name: Dell Inc. PowerEdge R730xd/0H21J3, BIOS 1.1.4 11/03/2014 [534814.113458] task: 88100cf5ef00 ti: 8814827e task.ti: 8814827e [534814.113525] RIP: 0010:[] [] task_h_load+0x4f/0xc7 [534814.113613] RSP: :8814827e3c00 EFLAGS: 00010256 [534814.113654] RAX: RBX: 00d7 RCX: [534814.113720] RDX: RSI: 88103d8a5f00 RDI: 88100cf5ef00 [534814.113786] RBP: 8814827e3c90 R08: 000107f70c76 R09: [534814.113851] R10: 0001 R11: R12: 0001 [534814.113917] R13: 0015 R14: R15: 88207ec14580 [534814.113984] FS: 7eff83cbb700() GS:88107f4a() knlGS: [534814.114053] CS: 0010 DS: ES: CR0: 80050033 [534814.114095] CR2: 146b07c0 CR3: 00103f605000 CR4: 001426e0 [534814.119456] Stack: [534814.119488] 8106fc4c 88103d639400 00d7 00014580 [534814.119571] fe19 88107f4a 00d7 0027 [534814.119653] 88100cf5ef00 025f 0100 0188 [534814.119736] Call Trace: [534814.119774] [] ? task_numa_find_cpu+0x1d2/0x2ec [534814.119819] [] ? task_numa_migrate+0x120/0x328 [534814.119864] [] ? ttwu_do_wakeup+0xf/0xcd [534814.119907] [] ? task_numa_fault+0x912/0x9a9 [534814.119954] [] ? mpol_misplaced+0x138/0x14a [534814.120001] [] ? handle_mm_fault+0xe28/0xf31 [534814.120046] [] ? fput+0xd/0x81 [534814.120087] [] ? __do_page_fault+0x425/0x485 [534814.120131] [] ? page_fault+0x22/0x30 [534814.120171] Code: 63 92 38 09 00 00 48 8b 80 b8 00 00 00 48 8b 04 d0 75 1c 48 8b 86 b0 00 00 00 48 8b 4e 78 31 d2 48 0f af 87 58 01 00 00 48 ff c1 <48> f7 f1 c3 48 c7 86 c0 00 00 00 00 00 00 00 48 89 f1 eb 18 48 [534814.120582] RIP [] task_h_load+0x4f/0xc7 [534814.120628] RSP [534814.121242] ---[ end trace ca72a3c25fb6f0dc ]--- Best regards, -- Yannis Aribaud -- Yannis Aribaud
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
Thanks Stefan, I am seeing this on 4.5.0 and 4.5.4 both are compiled from mainline neither include any patches over and above the tree. I ran for well over a month in production on 4.5.0 with no issues at all on a single socket server (E5-2670 v3 @ 2.30GHz) but as soon as we try to run either 4.5.0 or 4.5.4 on a dual socket box we get these traces after 12 or so hours of uptime and the same sky rocketing load. Given you have stopped your issues by applying those patches to 4.4 it's possible I have different issues, but the symptoms are too similar to ignore, our trace is here: http://pastebin.com/Q7nd8nfP I am using the cfg scheduler, and not the fair scheduler as Stefan reported so that is another difference. Happy to proceed with any suggestions or further information that may help isolate this, use case on these servers are 12 x CEPH osd's with bcache so they are seeing alot of disk activity via NVMe and SATA buses. Thanks Campbell On 17 May 2016 at 18:01, Stefan Priebe - Profihost AG wrote: > Am 21.03.2016 um 14:38 schrieb Greg KH: >> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote: >>> >>> Am 20.03.2016 um 22:41 schrieb Greg KH: On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote: > > Am 19.03.2016 um 23:26 schrieb Vlastimil Babka: >> On 03/17/2016 07:45 PM, Greg KH wrote: >>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: Hi, while running qemu 2.5 on a host running 4.4.6 the host system has crashed (load > 200) 3 times in the last 3 days. Always with this stack trace: (copy left here: http://pastebin.com/raw/bCWTLKyt) [69068.874268] divide error: [#1] SMP [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp pps_core [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1 [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012 [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: 880fda2c4000 [69068.900377] RIP: 0010:[] [] task_h_load+0xcc/0x100 >> >> decodecode says: >> >> 27: 48 83 c1 01 add$0x1,%rcx >> 2b:* 48 f7 f1div%rcx <-- trapping >> instruction >> >> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: >> >> update_cfs_rq_h_load(cfs_rq); >> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, >> cfs_rq_load_avg(cfs_rq) + 1); >> >> So the load avg is -1, thus after adding 1 we get division by 0, huh? > > Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits > up > to 4.5 for fair.c: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5 > > It didn't happen again with v4.4.6 + 4.5 patches for fair.c Ok, that's a lot of patches, how about figuring out which single patch, or shortest number of patches, makes things work again? >>> >>> will do so but it seems most out of those 9 patches are based on each >>> other. So it wouldn't be easy. >> >> Worst case, we take all 9. Best case, we only need one :) > > Today i got another report from: casteven (at) gmail.com having the same > issue. I wasn't able to figure out which of those 9 patches helps - i > still can just tell that it does not happen with this series on top of 4.4. > > Stefan
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
Am 21.03.2016 um 14:38 schrieb Greg KH: > On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote: >> >> Am 20.03.2016 um 22:41 schrieb Greg KH: >>> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote: Am 19.03.2016 um 23:26 schrieb Vlastimil Babka: > On 03/17/2016 07:45 PM, Greg KH wrote: >> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: >>> Hi, >>> >>> while running qemu 2.5 on a host running 4.4.6 the host system has >>> crashed >>> (load > 200) 3 times in the last 3 days. >>> >>> Always with this stack trace: (copy left here: >>> http://pastebin.com/raw/bCWTLKyt) >>> >>> [69068.874268] divide error: [#1] SMP >>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT >>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter >>> ip6_tables >>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 >>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack >>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan >>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm >>> xt_multiport >>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi >>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp >>> 8021q >>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor >>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid >>> sd_mod >>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb >>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core >>> mdio ptp >>> pps_core >>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted >>> 4.4.6+7-ph #1 >>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 >>> 03/17/2012 >>> [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: >>> 880fda2c4000 >>> [69068.900377] RIP: 0010:[] [] >>> task_h_load+0xcc/0x100 > > decodecode says: > > 27: 48 83 c1 01 add$0x1,%rcx > 2b:* 48 f7 f1div%rcx <-- trapping > instruction > > This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: > > update_cfs_rq_h_load(cfs_rq); > return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, > cfs_rq_load_avg(cfs_rq) + 1); > > So the load avg is -1, thus after adding 1 we get division by 0, huh? Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up to 4.5 for fair.c: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5 It didn't happen again with v4.4.6 + 4.5 patches for fair.c >>> >>> Ok, that's a lot of patches, how about figuring out which single patch, >>> or shortest number of patches, makes things work again? >> >> will do so but it seems most out of those 9 patches are based on each >> other. So it wouldn't be easy. > > Worst case, we take all 9. Best case, we only need one :) Today i got another report from: casteven (at) gmail.com having the same issue. I wasn't able to figure out which of those 9 patches helps - i still can just tell that it does not happen with this series on top of 4.4. Stefan
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote: > > Am 20.03.2016 um 22:41 schrieb Greg KH: > > On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote: > >> > >> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka: > >>> On 03/17/2016 07:45 PM, Greg KH wrote: > On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: > > Hi, > > > > while running qemu 2.5 on a host running 4.4.6 the host system has > > crashed > > (load > 200) 3 times in the last 3 days. > > > > Always with this stack trace: (copy left here: > > http://pastebin.com/raw/bCWTLKyt) > > > > [69068.874268] divide error: [#1] SMP > > [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT > > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter > > ip6_tables > > ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 > > nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack > > nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan > > kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm > > xt_multiport > > iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi > > scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp > > 8021q > > garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor > > raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid > > sd_mod > > ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb > > i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core > > mdio ptp > > pps_core > > [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted > > 4.4.6+7-ph #1 > > [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 > > 03/17/2012 > > [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: > > 880fda2c4000 > > [69068.900377] RIP: 0010:[] [] > > task_h_load+0xcc/0x100 > >>> > >>> decodecode says: > >>> > >>> 27: 48 83 c1 01 add$0x1,%rcx > >>> 2b:* 48 f7 f1div%rcx <-- trapping > >>> instruction > >>> > >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: > >>> > >>> update_cfs_rq_h_load(cfs_rq); > >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, > >>> cfs_rq_load_avg(cfs_rq) + 1); > >>> > >>> So the load avg is -1, thus after adding 1 we get division by 0, huh? > >> > >> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits > >> up > >> to 4.5 for fair.c: > >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5 > >> > >> It didn't happen again with v4.4.6 + 4.5 patches for fair.c > > > > Ok, that's a lot of patches, how about figuring out which single patch, > > or shortest number of patches, makes things work again? > > will do so but it seems most out of those 9 patches are based on each > other. So it wouldn't be easy. Worst case, we take all 9. Best case, we only need one :) thanks, greg k-h
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
Am 20.03.2016 um 22:41 schrieb Greg KH: > On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote: >> >> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka: >>> On 03/17/2016 07:45 PM, Greg KH wrote: On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: > Hi, > > while running qemu 2.5 on a host running 4.4.6 the host system has > crashed > (load > 200) 3 times in the last 3 days. > > Always with this stack trace: (copy left here: > http://pastebin.com/raw/bCWTLKyt) > > [69068.874268] divide error: [#1] SMP > [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter > ip6_tables > ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 > nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack > nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan > kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm > xt_multiport > iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi > scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp > 8021q > garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor > raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid > sd_mod > ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb > i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core > mdio ptp > pps_core > [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted > 4.4.6+7-ph #1 > [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 > 03/17/2012 > [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: > 880fda2c4000 > [69068.900377] RIP: 0010:[] [] > task_h_load+0xcc/0x100 >>> >>> decodecode says: >>> >>> 27: 48 83 c1 01 add$0x1,%rcx >>> 2b:* 48 f7 f1div%rcx <-- trapping >>> instruction >>> >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: >>> >>> update_cfs_rq_h_load(cfs_rq); >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, >>> cfs_rq_load_avg(cfs_rq) + 1); >>> >>> So the load avg is -1, thus after adding 1 we get division by 0, huh? >> >> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up >> to 4.5 for fair.c: >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5 >> >> It didn't happen again with v4.4.6 + 4.5 patches for fair.c > > Ok, that's a lot of patches, how about figuring out which single patch, > or shortest number of patches, makes things work again? will do so but it seems most out of those 9 patches are based on each other. So it wouldn't be easy. Stefan > > thanks, > > greg k-h >
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote: > > Am 19.03.2016 um 23:26 schrieb Vlastimil Babka: > >On 03/17/2016 07:45 PM, Greg KH wrote: > >>On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: > >>>Hi, > >>> > >>>while running qemu 2.5 on a host running 4.4.6 the host system has > >>>crashed > >>>(load > 200) 3 times in the last 3 days. > >>> > >>>Always with this stack trace: (copy left here: > >>>http://pastebin.com/raw/bCWTLKyt) > >>> > >>>[69068.874268] divide error: [#1] SMP > >>>[69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT > >>>nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter > >>>ip6_tables > >>>ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 > >>>nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack > >>>nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan > >>>kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm > >>>xt_multiport > >>>iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi > >>>scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp > >>>8021q > >>>garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor > >>>raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid > >>>sd_mod > >>>ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb > >>>i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core > >>>mdio ptp > >>>pps_core > >>>[69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted > >>>4.4.6+7-ph #1 > >>>[69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 > >>>03/17/2012 > >>>[69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: > >>>880fda2c4000 > >>>[69068.900377] RIP: 0010:[] [] > >>>task_h_load+0xcc/0x100 > > > >decodecode says: > > > > 27: 48 83 c1 01 add$0x1,%rcx > > 2b:* 48 f7 f1div%rcx <-- trapping > >instruction > > > >This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: > > > > update_cfs_rq_h_load(cfs_rq); > > return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, > > cfs_rq_load_avg(cfs_rq) + 1); > > > >So the load avg is -1, thus after adding 1 we get division by 0, huh? > > Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up > to 4.5 for fair.c: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5 > > It didn't happen again with v4.4.6 + 4.5 patches for fair.c Ok, that's a lot of patches, how about figuring out which single patch, or shortest number of patches, makes things work again? thanks, greg k-h
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
Am 19.03.2016 um 23:26 schrieb Vlastimil Babka: On 03/17/2016 07:45 PM, Greg KH wrote: On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: Hi, while running qemu 2.5 on a host running 4.4.6 the host system has crashed (load > 200) 3 times in the last 3 days. Always with this stack trace: (copy left here: http://pastebin.com/raw/bCWTLKyt) [69068.874268] divide error: [#1] SMP [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp pps_core [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1 [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012 [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: 880fda2c4000 [69068.900377] RIP: 0010:[] [] task_h_load+0xcc/0x100 decodecode says: 27: 48 83 c1 01 add$0x1,%rcx 2b:* 48 f7 f1div%rcx <-- trapping instruction This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: update_cfs_rq_h_load(cfs_rq); return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, cfs_rq_load_avg(cfs_rq) + 1); So the load avg is -1, thus after adding 1 we get division by 0, huh? Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up to 4.5 for fair.c: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5 It didn't happen again with v4.4.6 + 4.5 patches for fair.c Greets, Stefan [69068.961763] RSP: :880fda2c7b50 EFLAGS: 00010257 [69069.023910] RAX: RBX: 880fda2c7c10 RCX: [69069.085953] RDX: RSI: 0001 RDI: 880fc7f28000 [69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09: 880fffd153f8 [69069.213757] R10: 0009 R11: 0193 R12: 881f6832c780 [69069.274271] R13: 88203fc35380 R14: 0007 R15: 00b6 [69069.334727] FS: 7f578a3fb700() GS:880fffd0() knlGS: [69069.396435] CS: 0010 DS: ES: CR0: 80050033 [69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4: 26e0 [69069.520799] Stack: [69069.581430] 860b6855 880fda2c7b78 0005 [69069.642629] 880fffd0 0015 fe7d 0015 [69069.702815] 00015380 880fda2c7bc8 880fc7f28000 01e9 [69069.761881] Call Trace: [69069.819883] [] ? task_numa_find_cpu+0x225/0x670 [69069.878368] [] task_numa_migrate+0x550/0x950 [69069.936059] [] ? find_next_bit+0x18/0x20 [69069.993262] [] numa_migrate_preferred+0x7d/0x90 [69070.050528] [] task_numa_fault+0x7c5/0xaa0 [69070.106544] [] ? mpol_misplaced+0x16b/0x1b0 [69070.163705] [] __handle_mm_fault+0x9ae/0x11f0 [69070.220013] [] ? inet_recvmsg+0x72/0x90 [69070.276558] [] ? SYSC_recvfrom+0x12b/0x170 [69070.332283] [] handle_mm_fault+0xdf/0x180 [69070.388515] [] __do_page_fault+0x164/0x380 [69070.443897] [] ? account_user_time+0x73/0x80 [69070.498534] [] ? vtime_account_user+0x4e/0x70 [69070.552598] [] do_page_fault+0x37/0x90 [69070.605960] [] ? syscall_return_slowpath+0x83/0xf0 [69070.660705] [] page_fault+0x28/0x30 [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01 <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86 [69070.835144] RIP [] task_h_load+0xcc/0x100 [69070.894095] RSP [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]--- Would be nice if we can fix this in 4.4? Does this also happen in 4.5? Did this work on 4.4.5? Some previous release? If you can find the offending patch using 'git bisect', that would be great. thanks, greg k-h -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: > Hi, > > while running qemu 2.5 on a host running 4.4.6 the host system has crashed > (load > 200) 3 times in the last 3 days. > > Always with this stack trace: (copy left here: > http://pastebin.com/raw/bCWTLKyt) > > [69068.874268] divide error: [#1] SMP > [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables > ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 > nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack > nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan > kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport > iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi > scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q > garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor > raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod > ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb > i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp > pps_core > [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1 > [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012 > [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: > 880fda2c4000 > [69068.900377] RIP: 0010:[] [] > task_h_load+0xcc/0x100 > [69068.961763] RSP: :880fda2c7b50 EFLAGS: 00010257 > [69069.023910] RAX: RBX: 880fda2c7c10 RCX: > > [69069.085953] RDX: RSI: 0001 RDI: > 880fc7f28000 > [69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09: > 880fffd153f8 > [69069.213757] R10: 0009 R11: 0193 R12: > 881f6832c780 > [69069.274271] R13: 88203fc35380 R14: 0007 R15: > 00b6 > [69069.334727] FS: 7f578a3fb700() GS:880fffd0() > knlGS: > [69069.396435] CS: 0010 DS: ES: CR0: 80050033 > [69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4: > 26e0 > [69069.520799] Stack: > [69069.581430] 860b6855 880fda2c7b78 > 0005 > [69069.642629] 880fffd0 0015 fe7d > 0015 > [69069.702815] 00015380 880fda2c7bc8 880fc7f28000 > 01e9 > [69069.761881] Call Trace: > [69069.819883] [] ? task_numa_find_cpu+0x225/0x670 > [69069.878368] [] task_numa_migrate+0x550/0x950 > [69069.936059] [] ? find_next_bit+0x18/0x20 > [69069.993262] [] numa_migrate_preferred+0x7d/0x90 > [69070.050528] [] task_numa_fault+0x7c5/0xaa0 > [69070.106544] [] ? mpol_misplaced+0x16b/0x1b0 > [69070.163705] [] __handle_mm_fault+0x9ae/0x11f0 > [69070.220013] [] ? inet_recvmsg+0x72/0x90 > [69070.276558] [] ? SYSC_recvfrom+0x12b/0x170 > [69070.332283] [] handle_mm_fault+0xdf/0x180 > [69070.388515] [] __do_page_fault+0x164/0x380 > [69070.443897] [] ? account_user_time+0x73/0x80 > [69070.498534] [] ? vtime_account_user+0x4e/0x70 > [69070.552598] [] do_page_fault+0x37/0x90 > [69070.605960] [] ? syscall_return_slowpath+0x83/0xf0 > [69070.660705] [] page_fault+0x28/0x30 > [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49 > 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01 > <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86 > [69070.835144] RIP [] task_h_load+0xcc/0x100 > [69070.894095] RSP > [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]--- > > Would be nice if we can fix this in 4.4? Does this also happen in 4.5? Did this work on 4.4.5? Some previous release? If you can find the offending patch using 'git bisect', that would be great. thanks, greg k-h
Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
On 03/17/2016 07:45 PM, Greg KH wrote: On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote: Hi, while running qemu 2.5 on a host running 4.4.6 the host system has crashed (load > 200) 3 times in the last 3 days. Always with this stack trace: (copy left here: http://pastebin.com/raw/bCWTLKyt) [69068.874268] divide error: [#1] SMP [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp pps_core [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1 [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012 [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: 880fda2c4000 [69068.900377] RIP: 0010:[] [] task_h_load+0xcc/0x100 decodecode says: 27: 48 83 c1 01 add$0x1,%rcx 2b:* 48 f7 f1div%rcx <-- trapping instruction This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load: update_cfs_rq_h_load(cfs_rq); return div64_ul(p->se.avg.load_avg * cfs_rq->h_load, cfs_rq_load_avg(cfs_rq) + 1); So the load avg is -1, thus after adding 1 we get division by 0, huh? [69068.961763] RSP: :880fda2c7b50 EFLAGS: 00010257 [69069.023910] RAX: RBX: 880fda2c7c10 RCX: [69069.085953] RDX: RSI: 0001 RDI: 880fc7f28000 [69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09: 880fffd153f8 [69069.213757] R10: 0009 R11: 0193 R12: 881f6832c780 [69069.274271] R13: 88203fc35380 R14: 0007 R15: 00b6 [69069.334727] FS: 7f578a3fb700() GS:880fffd0() knlGS: [69069.396435] CS: 0010 DS: ES: CR0: 80050033 [69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4: 26e0 [69069.520799] Stack: [69069.581430] 860b6855 880fda2c7b78 0005 [69069.642629] 880fffd0 0015 fe7d 0015 [69069.702815] 00015380 880fda2c7bc8 880fc7f28000 01e9 [69069.761881] Call Trace: [69069.819883] [] ? task_numa_find_cpu+0x225/0x670 [69069.878368] [] task_numa_migrate+0x550/0x950 [69069.936059] [] ? find_next_bit+0x18/0x20 [69069.993262] [] numa_migrate_preferred+0x7d/0x90 [69070.050528] [] task_numa_fault+0x7c5/0xaa0 [69070.106544] [] ? mpol_misplaced+0x16b/0x1b0 [69070.163705] [] __handle_mm_fault+0x9ae/0x11f0 [69070.220013] [] ? inet_recvmsg+0x72/0x90 [69070.276558] [] ? SYSC_recvfrom+0x12b/0x170 [69070.332283] [] handle_mm_fault+0xdf/0x180 [69070.388515] [] __do_page_fault+0x164/0x380 [69070.443897] [] ? account_user_time+0x73/0x80 [69070.498534] [] ? vtime_account_user+0x4e/0x70 [69070.552598] [] do_page_fault+0x37/0x90 [69070.605960] [] ? syscall_return_slowpath+0x83/0xf0 [69070.660705] [] page_fault+0x28/0x30 [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01 <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86 [69070.835144] RIP [] task_h_load+0xcc/0x100 [69070.894095] RSP [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]--- Would be nice if we can fix this in 4.4? Does this also happen in 4.5? Did this work on 4.4.5? Some previous release? If you can find the offending patch using 'git bisect', that would be great. thanks, greg k-h -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org
divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
Hi, while running qemu 2.5 on a host running 4.4.6 the host system has crashed (load > 200) 3 times in the last 3 days. Always with this stack trace: (copy left here: http://pastebin.com/raw/bCWTLKyt) [69068.874268] divide error: [#1] SMP [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp pps_core [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1 [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012 [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: 880fda2c4000 [69068.900377] RIP: 0010:[] [] task_h_load+0xcc/0x100 [69068.961763] RSP: :880fda2c7b50 EFLAGS: 00010257 [69069.023910] RAX: RBX: 880fda2c7c10 RCX: [69069.085953] RDX: RSI: 0001 RDI: 880fc7f28000 [69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09: 880fffd153f8 [69069.213757] R10: 0009 R11: 0193 R12: 881f6832c780 [69069.274271] R13: 88203fc35380 R14: 0007 R15: 00b6 [69069.334727] FS: 7f578a3fb700() GS:880fffd0() knlGS: [69069.396435] CS: 0010 DS: ES: CR0: 80050033 [69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4: 26e0 [69069.520799] Stack: [69069.581430] 860b6855 880fda2c7b78 0005 [69069.642629] 880fffd0 0015 fe7d 0015 [69069.702815] 00015380 880fda2c7bc8 880fc7f28000 01e9 [69069.761881] Call Trace: [69069.819883] [] ? task_numa_find_cpu+0x225/0x670 [69069.878368] [] task_numa_migrate+0x550/0x950 [69069.936059] [] ? find_next_bit+0x18/0x20 [69069.993262] [] numa_migrate_preferred+0x7d/0x90 [69070.050528] [] task_numa_fault+0x7c5/0xaa0 [69070.106544] [] ? mpol_misplaced+0x16b/0x1b0 [69070.163705] [] __handle_mm_fault+0x9ae/0x11f0 [69070.220013] [] ? inet_recvmsg+0x72/0x90 [69070.276558] [] ? SYSC_recvfrom+0x12b/0x170 [69070.332283] [] handle_mm_fault+0xdf/0x180 [69070.388515] [] __do_page_fault+0x164/0x380 [69070.443897] [] ? account_user_time+0x73/0x80 [69070.498534] [] ? vtime_account_user+0x4e/0x70 [69070.552598] [] do_page_fault+0x37/0x90 [69070.605960] [] ? syscall_return_slowpath+0x83/0xf0 [69070.660705] [] page_fault+0x28/0x30 [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01 <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86 [69070.835144] RIP [] task_h_load+0xcc/0x100 [69070.894095] RSP [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]--- Would be nice if we can fix this in 4.4? Greets, Stefan