Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-07-12 Thread Greg KH
On Tue, Jul 12, 2016 at 03:12:35PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 11, 2016 at 03:33:53PM -0700, Greg KH wrote:
> 
> > Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees.
> > Can someone send me the backported verision that they have tested to
> > work properly so I can queue it up?
> 
> I've never actually been able to reproduce, but the attached patches
> apply, the reject was trivial.
> 
> They seem to compile and boot on my main test rig, but nothing else was
> done but build the next kernel with it.

Thanks for these, now applied.

greg k-h


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-07-12 Thread Peter Zijlstra
On Mon, Jul 11, 2016 at 03:33:53PM -0700, Greg KH wrote:

> Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees.
> Can someone send me the backported verision that they have tested to
> work properly so I can queue it up?

I've never actually been able to reproduce, but the attached patches
apply, the reject was trivial.

They seem to compile and boot on my main test rig, but nothing else was
done but build the next kernel with it.


commit 8974189222159154c55f24ddad33e3613960521a
Author: Peter Zijlstra 
Date:   Thu Jun 16 10:50:40 2016 +0200

sched/fair: Fix cfs_rq avg tracking underflow

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
>
> turns into:
>
> 81087064:   49 8b 85 98 00 00 00mov0x98(%r13),%rax
> 8108706b:   48 85 c0test   %rax,%rax
> 8108706e:   74 40   je 810870b0 
> 81087070:   4c 89 f8mov%r15,%rax
> 81087073:   49 87 85 98 00 00 00xchg   %rax,0x98(%r13)
> 8108707a:   49 29 45 70 sub%rax,0x70(%r13)
> 8108707e:   4c 89 f9mov%r15,%rcx
> 81087081:   bb 01 00 00 00  mov$0x1,%ebx
> 81087086:   49 83 7d 70 00  cmpq   $0x0,0x70(%r13)
> 8108708b:   49 0f 49 4d 70  cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> 8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
   maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Yuyang Du 
Cc: bseg...@google.com
Cc: ker...@kyup.com
Cc: morten.rasmus...@arm.com
Cc: p...@google.com
Cc: steve.muc...@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.gj30...@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 

---
 kernel/sched/fair.c |   33 +
 1 file changed, 25 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2682,6 +2682,23 @@ static inline void update_tg_load_avg(st
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {\
+	typeof(_ptr) ptr = (_ptr);\
+	typeof(*ptr) val = (_val);\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;	\
+	if (res > var)		\
+		res = 0;	\
+	WRITE_ONCE(*ptr, res);	\
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
@@ -2690,15 +2707,15 @@ static inline int update_cfs_rq_load_avg
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->load_avg, r);
+		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg

Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-07-11 Thread Greg KH
On Thu, Jul 07, 2016 at 09:42:32AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote:
> 
> > > commit 8974189222159154c55f24ddad33e3613960521a
> > > Author: Peter Zijlstra 
> > > Date:   Thu Jun 16 10:50:40 2016 +0200
> 
> > Since these early reports from Stefan and I it looks like it's been
> > hit but alot more folks now so I'd like to ask what the process is for
> > getting this backported into 4.6, 4.5 and 4.4 as in our testing all
> > those versions for their latest point release seem to have the same
> > problem.
> 
> I think this should do; Greg is on Cc and will mark the commit
> somewhere. It is already in Linus' tree and should indeed be sufficient.
> 
> It has a Fixes tag referring the commit that introduced it, which IIRC
> is somewhere around v4.2.
> 
> Greg, anything else required?

Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees.
Can someone send me the backported verision that they have tested to
work properly so I can queue it up?

thanks,

greg k-h


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-07-08 Thread Greg KH
On Thu, Jul 07, 2016 at 09:42:32AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote:
> 
> > > commit 8974189222159154c55f24ddad33e3613960521a
> > > Author: Peter Zijlstra 
> > > Date:   Thu Jun 16 10:50:40 2016 +0200
> 
> > Since these early reports from Stefan and I it looks like it's been
> > hit but alot more folks now so I'd like to ask what the process is for
> > getting this backported into 4.6, 4.5 and 4.4 as in our testing all
> > those versions for their latest point release seem to have the same
> > problem.
> 
> I think this should do; Greg is on Cc and will mark the commit
> somewhere. It is already in Linus' tree and should indeed be sufficient.
> 
> It has a Fixes tag referring the commit that introduced it, which IIRC
> is somewhere around v4.2.
> 
> Greg, anything else required?

Nope, that should be fine.

greg k-h


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-07-07 Thread Peter Zijlstra
On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote:

> > commit 8974189222159154c55f24ddad33e3613960521a
> > Author: Peter Zijlstra 
> > Date:   Thu Jun 16 10:50:40 2016 +0200

> Since these early reports from Stefan and I it looks like it's been
> hit but alot more folks now so I'd like to ask what the process is for
> getting this backported into 4.6, 4.5 and 4.4 as in our testing all
> those versions for their latest point release seem to have the same
> problem.

I think this should do; Greg is on Cc and will mark the commit
somewhere. It is already in Linus' tree and should indeed be sufficient.

It has a Fixes tag referring the commit that introduced it, which IIRC
is somewhere around v4.2.

Greg, anything else required?


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-07-06 Thread Campbell Steven
On 22 June 2016 at 18:13, Peter Zijlstra  wrote:
> On Wed, Jun 22, 2016 at 01:19:54PM +1200, Campbell Steven wrote:
>> >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>> >>>
>> >>> update_cfs_rq_h_load(cfs_rq);
>> >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>> >>> cfs_rq_load_avg(cfs_rq) + 1);
>> >>>
>
>
> ---
> commit 8974189222159154c55f24ddad33e3613960521a
> Author: Peter Zijlstra 
> Date:   Thu Jun 16 10:50:40 2016 +0200
>
> sched/fair: Fix cfs_rq avg tracking underflow
>
> As per commit:
>
>   b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() 
> serialization")
>
> > the code generated from update_cfs_rq_load_avg():
> >
> >   if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> >   s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> >   sa->load_avg = max_t(long, sa->load_avg - r, 0);
> >   sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> >   removed_load = 1;
> >   }


Hi Peter,

I just wanted to report back to say thanks for this, and we have (and
others) have tested this out in 4.7 rc6 and have not been able to
repeat the issue. It seems that anyone running busy ceph osd's or high
load KVM instances is able to trigger this on a dual socket box pretty
easily.

Since these early reports from Stefan and I it looks like it's been
hit but alot more folks now so I'd like to ask what the process is for
getting this backported into 4.6, 4.5 and 4.4 as in our testing all
those versions for their latest point release seem to have the same
problem.

Thanks

Campbell


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-06-22 Thread Yannis Aribaud
21 juin 2016 14:13 "Yannis Aribaud"  a écrit:
> Hi everyone,
> 
> I recently it this bug in the kernel using a vanilla 4.6.2 release.
> It seems that somewhere in the load average calculation a division by 0 
> occurs (see the stack trace
> at the end).
>
> [snipped]
> 
> I'm not an expert at all but I suspect that is the issue's origin. Shouldn't 
> the function
> cfs_rq_load_avg use an atomic_long_read() to avoid this ?

After digging a bit more, this can't be the problem as this function obviously 
can't return negative value.

I found that it can maybe come from the update_cfs_rq_load_avg function in the 
following block:

if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
removed_load = 1;
}

The max_t(long, sa->load_avg - r, 0) can result in a negative value keeped by 
the max_t function as the long would wrap up then generate a division by zero 
in task_h_load function.

Best regards,
--
Yannis Aribaud


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-06-22 Thread Peter Zijlstra
On Wed, Jun 22, 2016 at 01:19:54PM +1200, Campbell Steven wrote:
> >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
> >>>
> >>> update_cfs_rq_h_load(cfs_rq);
> >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >>> cfs_rq_load_avg(cfs_rq) + 1);
> >>>


---
commit 8974189222159154c55f24ddad33e3613960521a
Author: Peter Zijlstra 
Date:   Thu Jun 16 10:50:40 2016 +0200

sched/fair: Fix cfs_rq avg tracking underflow

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
>   if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>   s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>   sa->load_avg = max_t(long, sa->load_avg - r, 0);
>   sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
>   removed_load = 1;
>   }
>
> turns into:
>
> 81087064:   49 8b 85 98 00 00 00mov0x98(%r13),%rax
> 8108706b:   48 85 c0test   %rax,%rax
> 8108706e:   74 40   je 810870b0 

> 81087070:   4c 89 f8mov%r15,%rax
> 81087073:   49 87 85 98 00 00 00xchg   %rax,0x98(%r13)
> 8108707a:   49 29 45 70 sub%rax,0x70(%r13)
> 8108707e:   4c 89 f9mov%r15,%rcx
> 81087081:   bb 01 00 00 00  mov$0x1,%ebx
> 81087086:   49 83 7d 70 00  cmpq   $0x0,0x70(%r13)
> 8108708b:   49 0f 49 4d 70  cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> 8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
   maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Yuyang Du 
Cc: bseg...@google.com
Cc: ker...@kyup.com
Cc: morten.rasmus...@arm.com
Cc: p...@google.com
Cc: steve.muc...@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization 
average tracking")
Link: 
http://lkml.kernel.org/r/20160617091948.gj30...@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2348deab7a3..2ae68f0e3bf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2904,6 +2904,23 @@ static inline void cfs_rq_util_change(struct cfs_rq 
*cfs_rq)
}
 }
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {  \
+   typeof(_ptr) ptr = (_ptr);  \
+   typeof(*ptr) val = (_val);  \
+   typeof(*ptr) res, var = READ_ONCE(*ptr);\
+   res = var - val;\
+   if (res > var)  \
+   res = 0;\
+   WRITE_ONCE(*ptr, res);  \
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
@@ -2913,15 +2930,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, 
bool update_freq)
 
if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-   sa->load_avg = max_t(long, sa->load_avg - r, 0);
-   sa-

Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-06-21 Thread Campbell Steven
On 17 May 2016 at 21:21, Campbell Steven  wrote:
> Thanks Stefan,
>
> I am seeing this on 4.5.0 and 4.5.4 both are compiled from mainline
> neither include any patches over and above the tree. I ran for well
> over a month in production on 4.5.0 with no issues at all on a single
> socket server (E5-2670 v3 @ 2.30GHz) but as soon as we try to run
> either 4.5.0 or 4.5.4 on a dual socket box we get these traces after
> 12 or so hours of uptime and the same sky rocketing load. Given you
> have stopped your issues by applying those patches to 4.4 it's
> possible I have different issues, but the symptoms are too similar to
> ignore, our trace is here:
>
> http://pastebin.com/Q7nd8nfP
>
> I am using the cfg scheduler, and not the fair scheduler as Stefan
> reported so that is another difference.
>
> Happy to proceed with any suggestions or further information that may
> help isolate this, use case on these servers are 12 x CEPH osd's with
> bcache so they are seeing alot of disk activity via NVMe and SATA
> buses.
>
> Thanks
>
> Campbell
>
> On 17 May 2016 at 18:01, Stefan Priebe - Profihost AG
>  wrote:
>> Am 21.03.2016 um 14:38 schrieb Greg KH:
>>> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG 
>>> wrote:

 Am 20.03.2016 um 22:41 schrieb Greg KH:
> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
>>
>> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
>>> On 03/17/2016 07:45 PM, Greg KH wrote:
 On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> Hi,
>
> while running qemu 2.5 on a host running 4.4.6 the host system has
> crashed
> (load > 200) 3 times in the last 3 days.
>
> Always with this stack trace: (copy left here:
> http://pastebin.com/raw/bCWTLKyt)
>
> [69068.874268] divide error:  [#1] SMP
> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables
> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap 
> macvlan
> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
> xt_multiport
> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
> 8021q
> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
> sd_mod
> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
> mdio ptp
> pps_core
> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
> 4.4.6+7-ph #1
> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
> 03/17/2012
> [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
> 880fda2c4000
> [69068.900377] RIP: 0010:[]  []
> task_h_load+0xcc/0x100
>>>
>>> decodecode says:
>>>
>>>   27:   48 83 c1 01 add$0x1,%rcx
>>>   2b:*  48 f7 f1div%rcx <-- trapping
>>> instruction
>>>
>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>>>
>>> update_cfs_rq_h_load(cfs_rq);
>>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>> cfs_rq_load_avg(cfs_rq) + 1);
>>>
>>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
>>
>> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those 
>> commits up
>> to 4.5 for fair.c:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
>>
>> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
>
> Ok, that's a lot of patches, how about figuring out which single patch,
> or shortest number of patches, makes things work again?

 will do so but it seems most out of those 9 patches are based on each
 other. So it wouldn't be easy.
>>>
>>> Worst case, we take all 9.  Best case, we only need one :)
>>
>> Today i got another report from: casteven (at) gmail.com having the same
>> issue. I wasn't able to figure out which of those 9 patches helps - i
>> still can just tell that it does not happen with this series on top of 4.4.
>>
>> Stefan

Just reinvigorating this thread based off yesterdays post by Yannis
here: https://lkml.org/lkml/headers/2016/6/21/491 of which the trace
is more or less is identical to what we see.

In addition to this we have since tested the 4.6.x releases and have
seen the same

Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-06-21 Thread Yannis Aribaud
Hi everyone,

I recently it this bug in the kernel using a vanilla 4.6.2 release.
It seems that somewhere in the load average calculation a division by 0 occurs 
(see the stack trace
at the end).

After digging a bit (be fair it's my first time) in the kernel sources, I found 
that we "recently"
added the function cfs_rq_load_avg (commit 
6f2b04524f0b38bfbb8413f98d2d6af234508309) and started
using it in the function task_h_load which do a division with the value returned
(kernel/sched/fair.c) like this:

static unsigned long task_h_load(struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);

update_cfs_rq_h_load(cfs_rq);
return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
cfs_rq_load_avg(cfs_rq) + 1);
}

But the load_avg filed from sched_avg struct is an atomic_long_t and the 
cfs_rq_load_avg returns
this field as an unsigned long without doing any type conversion.

static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
{
return cfs_rq->avg.load_avg;
}

I'm not an expert at all but I suspect that is the issue's origin. Shouldn't 
the function
cfs_rq_load_avg use an atomic_long_read() to avoid this ?

Here is the stack trace:

[534814.112500] divide error:  [#1] SMP
[534814.112550] Modules linked in: vhost_net vhost macvtap macvlan ipmi_si 
mpt3sas raid_class
scsi_transport_sas ipmi_devintf dell_rbu tun nfsd auth_rpcgss oid_registry 
nfs_acl nfs lockd grace
fscache sunrpc bridge 8021q garp mrp stp llc bonding xfs libcrc32c bcache 
usbhid hid uhci_hcd
ohci_hcd x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass 
ghash_clmulni_intel iTCO_wdt
iTCO_vendor_support sha256_generic hmac drbg dcdbas ansi_cprng aesni_intel 
aes_x86_64 ablk_helper
cryptd lrw gf128mul glue_helper shpchp evdev sb_edac edac_core ehci_pci 
ehci_hcd lpc_ich usbcore
mfd_core usb_common ipmi_msghandler acpi_cpufreq wmi tpm_tis tpm processor 
acpi_power_meter button
ext4 crc16 jbd2 mbcache sg sd_mod dm_mod crc32c_intel igb megaraid_sas 
i2c_algo_bit i2c_core dca
ptp scsi_mod pps_core [last unloaded: ipmi_si]
[534814.113345] CPU: 10 PID: 38568 Comm: ceph-osd Not tainted 4.6.2-ig1virt #16
[534814.113390] Hardware name: Dell Inc. PowerEdge R730xd/0H21J3, BIOS 1.1.4 
11/03/2014
[534814.113458] task: 88100cf5ef00 ti: 8814827e task.ti: 
8814827e
[534814.113525] RIP: 0010:[] [] 
task_h_load+0x4f/0xc7
[534814.113613] RSP: :8814827e3c00 EFLAGS: 00010256
[534814.113654] RAX:  RBX: 00d7 RCX: 

[534814.113720] RDX:  RSI: 88103d8a5f00 RDI: 
88100cf5ef00
[534814.113786] RBP: 8814827e3c90 R08: 000107f70c76 R09: 

[534814.113851] R10: 0001 R11:  R12: 
0001
[534814.113917] R13: 0015 R14:  R15: 
88207ec14580
[534814.113984] FS: 7eff83cbb700() GS:88107f4a() 
knlGS:
[534814.114053] CS: 0010 DS:  ES:  CR0: 80050033
[534814.114095] CR2: 146b07c0 CR3: 00103f605000 CR4: 
001426e0
[534814.119456] Stack:
[534814.119488] 8106fc4c 88103d639400 00d7 
00014580
[534814.119571] fe19 88107f4a 00d7 
0027
[534814.119653] 88100cf5ef00 025f 0100 
0188
[534814.119736] Call Trace:
[534814.119774] [] ? task_numa_find_cpu+0x1d2/0x2ec
[534814.119819] [] ? task_numa_migrate+0x120/0x328
[534814.119864] [] ? ttwu_do_wakeup+0xf/0xcd
[534814.119907] [] ? task_numa_fault+0x912/0x9a9
[534814.119954] [] ? mpol_misplaced+0x138/0x14a
[534814.120001] [] ? handle_mm_fault+0xe28/0xf31
[534814.120046] [] ? fput+0xd/0x81
[534814.120087] [] ? __do_page_fault+0x425/0x485
[534814.120131] [] ? page_fault+0x22/0x30
[534814.120171] Code: 63 92 38 09 00 00 48 8b 80 b8 00 00 00 48 8b 04 d0 75 1c 
48 8b 86 b0 00 00 00
48 8b 4e 78 31 d2 48 0f af 87 58 01 00 00 48 ff c1 <48> f7 f1 c3 48 c7 86 c0 00 
00 00 00 00 00 00
48 89 f1 eb 18 48
[534814.120582] RIP [] task_h_load+0x4f/0xc7
[534814.120628] RSP 
[534814.121242] ---[ end trace ca72a3c25fb6f0dc ]---

Best regards,
--
Yannis Aribaud
-- 
Yannis Aribaud


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-05-17 Thread Campbell Steven
Thanks Stefan,

I am seeing this on 4.5.0 and 4.5.4 both are compiled from mainline
neither include any patches over and above the tree. I ran for well
over a month in production on 4.5.0 with no issues at all on a single
socket server (E5-2670 v3 @ 2.30GHz) but as soon as we try to run
either 4.5.0 or 4.5.4 on a dual socket box we get these traces after
12 or so hours of uptime and the same sky rocketing load. Given you
have stopped your issues by applying those patches to 4.4 it's
possible I have different issues, but the symptoms are too similar to
ignore, our trace is here:

http://pastebin.com/Q7nd8nfP

I am using the cfg scheduler, and not the fair scheduler as Stefan
reported so that is another difference.

Happy to proceed with any suggestions or further information that may
help isolate this, use case on these servers are 12 x CEPH osd's with
bcache so they are seeing alot of disk activity via NVMe and SATA
buses.

Thanks

Campbell

On 17 May 2016 at 18:01, Stefan Priebe - Profihost AG
 wrote:
> Am 21.03.2016 um 14:38 schrieb Greg KH:
>> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote:
>>>
>>> Am 20.03.2016 um 22:41 schrieb Greg KH:
 On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
>
> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
>> On 03/17/2016 07:45 PM, Greg KH wrote:
>>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
 Hi,

 while running qemu 2.5 on a host running 4.4.6 the host system has
 crashed
 (load > 200) 3 times in the last 3 days.

 Always with this stack trace: (copy left here:
 http://pastebin.com/raw/bCWTLKyt)

 [69068.874268] divide error:  [#1] SMP
 [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
 nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
 ip6_tables
 ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
 nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
 nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
 kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
 xt_multiport
 iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
 scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
 8021q
 garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
 raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
 sd_mod
 ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
 i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
 mdio ptp
 pps_core
 [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
 4.4.6+7-ph #1
 [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
 03/17/2012
 [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
 880fda2c4000
 [69068.900377] RIP: 0010:[]  []
 task_h_load+0xcc/0x100
>>
>> decodecode says:
>>
>>   27:   48 83 c1 01 add$0x1,%rcx
>>   2b:*  48 f7 f1div%rcx <-- trapping
>> instruction
>>
>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>>
>> update_cfs_rq_h_load(cfs_rq);
>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>> cfs_rq_load_avg(cfs_rq) + 1);
>>
>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
>
> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits 
> up
> to 4.5 for fair.c:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
>
> It didn't happen again with v4.4.6 + 4.5 patches for fair.c

 Ok, that's a lot of patches, how about figuring out which single patch,
 or shortest number of patches, makes things work again?
>>>
>>> will do so but it seems most out of those 9 patches are based on each
>>> other. So it wouldn't be easy.
>>
>> Worst case, we take all 9.  Best case, we only need one :)
>
> Today i got another report from: casteven (at) gmail.com having the same
> issue. I wasn't able to figure out which of those 9 patches helps - i
> still can just tell that it does not happen with this series on top of 4.4.
>
> Stefan


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-05-16 Thread Stefan Priebe - Profihost AG
Am 21.03.2016 um 14:38 schrieb Greg KH:
> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote:
>>
>> Am 20.03.2016 um 22:41 schrieb Greg KH:
>>> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:

 Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
> On 03/17/2016 07:45 PM, Greg KH wrote:
>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
>>> Hi,
>>>
>>> while running qemu 2.5 on a host running 4.4.6 the host system has
>>> crashed
>>> (load > 200) 3 times in the last 3 days.
>>>
>>> Always with this stack trace: (copy left here:
>>> http://pastebin.com/raw/bCWTLKyt)
>>>
>>> [69068.874268] divide error:  [#1] SMP
>>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
>>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
>>> ip6_tables
>>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
>>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
>>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
>>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
>>> xt_multiport
>>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
>>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
>>> 8021q
>>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
>>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
>>> sd_mod
>>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
>>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
>>> mdio ptp
>>> pps_core
>>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
>>> 4.4.6+7-ph #1
>>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
>>> 03/17/2012
>>> [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
>>> 880fda2c4000
>>> [69068.900377] RIP: 0010:[]  []
>>> task_h_load+0xcc/0x100
>
> decodecode says:
>
>   27:   48 83 c1 01 add$0x1,%rcx
>   2b:*  48 f7 f1div%rcx <-- trapping
> instruction
>
> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>
> update_cfs_rq_h_load(cfs_rq);
> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> cfs_rq_load_avg(cfs_rq) + 1);
>
> So the load avg is -1, thus after adding 1 we get division by 0, huh?

 Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits 
 up
 to 4.5 for fair.c:
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5

 It didn't happen again with v4.4.6 + 4.5 patches for fair.c
>>>
>>> Ok, that's a lot of patches, how about figuring out which single patch,
>>> or shortest number of patches, makes things work again?
>>
>> will do so but it seems most out of those 9 patches are based on each
>> other. So it wouldn't be easy.
> 
> Worst case, we take all 9.  Best case, we only need one :)

Today i got another report from: casteven (at) gmail.com having the same
issue. I wasn't able to figure out which of those 9 patches helps - i
still can just tell that it does not happen with this series on top of 4.4.

Stefan


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-03-21 Thread Greg KH
On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote:
> 
> Am 20.03.2016 um 22:41 schrieb Greg KH:
> > On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
> >>
> >> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
> >>> On 03/17/2016 07:45 PM, Greg KH wrote:
>  On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> > Hi,
> >
> > while running qemu 2.5 on a host running 4.4.6 the host system has
> > crashed
> > (load > 200) 3 times in the last 3 days.
> >
> > Always with this stack trace: (copy left here:
> > http://pastebin.com/raw/bCWTLKyt)
> >
> > [69068.874268] divide error:  [#1] SMP
> > [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> > ip6_tables
> > ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> > nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> > nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
> > kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
> > xt_multiport
> > iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> > scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
> > 8021q
> > garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> > raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
> > sd_mod
> > ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> > i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
> > mdio ptp
> > pps_core
> > [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
> > 4.4.6+7-ph #1
> > [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
> > 03/17/2012
> > [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
> > 880fda2c4000
> > [69068.900377] RIP: 0010:[]  []
> > task_h_load+0xcc/0x100
> >>>
> >>> decodecode says:
> >>>
> >>>   27:   48 83 c1 01 add$0x1,%rcx
> >>>   2b:*  48 f7 f1div%rcx <-- trapping
> >>> instruction
> >>>
> >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
> >>>
> >>> update_cfs_rq_h_load(cfs_rq);
> >>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >>> cfs_rq_load_avg(cfs_rq) + 1);
> >>>
> >>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
> >>
> >> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits 
> >> up
> >> to 4.5 for fair.c:
> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
> >>
> >> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
> > 
> > Ok, that's a lot of patches, how about figuring out which single patch,
> > or shortest number of patches, makes things work again?
> 
> will do so but it seems most out of those 9 patches are based on each
> other. So it wouldn't be easy.

Worst case, we take all 9.  Best case, we only need one :)

thanks,

greg k-h


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-03-21 Thread Stefan Priebe - Profihost AG

Am 20.03.2016 um 22:41 schrieb Greg KH:
> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
>>
>> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
>>> On 03/17/2016 07:45 PM, Greg KH wrote:
 On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> Hi,
>
> while running qemu 2.5 on a host running 4.4.6 the host system has
> crashed
> (load > 200) 3 times in the last 3 days.
>
> Always with this stack trace: (copy left here:
> http://pastebin.com/raw/bCWTLKyt)
>
> [69068.874268] divide error:  [#1] SMP
> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables
> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
> xt_multiport
> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
> 8021q
> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
> sd_mod
> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
> mdio ptp
> pps_core
> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
> 4.4.6+7-ph #1
> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
> 03/17/2012
> [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
> 880fda2c4000
> [69068.900377] RIP: 0010:[]  []
> task_h_load+0xcc/0x100
>>>
>>> decodecode says:
>>>
>>>   27:   48 83 c1 01 add$0x1,%rcx
>>>   2b:*  48 f7 f1div%rcx <-- trapping
>>> instruction
>>>
>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>>>
>>> update_cfs_rq_h_load(cfs_rq);
>>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>> cfs_rq_load_avg(cfs_rq) + 1);
>>>
>>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
>>
>> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
>> to 4.5 for fair.c:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
>>
>> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
> 
> Ok, that's a lot of patches, how about figuring out which single patch,
> or shortest number of patches, makes things work again?

will do so but it seems most out of those 9 patches are based on each
other. So it wouldn't be easy.

Stefan

> 
> thanks,
> 
> greg k-h
> 


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-03-20 Thread Greg KH
On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
> 
> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
> >On 03/17/2016 07:45 PM, Greg KH wrote:
> >>On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> >>>Hi,
> >>>
> >>>while running qemu 2.5 on a host running 4.4.6 the host system has
> >>>crashed
> >>>(load > 200) 3 times in the last 3 days.
> >>>
> >>>Always with this stack trace: (copy left here:
> >>>http://pastebin.com/raw/bCWTLKyt)
> >>>
> >>>[69068.874268] divide error:  [#1] SMP
> >>>[69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> >>>nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> >>>ip6_tables
> >>>ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> >>>nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> >>>nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
> >>>kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
> >>>xt_multiport
> >>>iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> >>>scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
> >>>8021q
> >>>garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> >>>raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
> >>>sd_mod
> >>>ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> >>>i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
> >>>mdio ptp
> >>>pps_core
> >>>[69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
> >>>4.4.6+7-ph #1
> >>>[69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
> >>>03/17/2012
> >>>[69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
> >>>880fda2c4000
> >>>[69068.900377] RIP: 0010:[]  []
> >>>task_h_load+0xcc/0x100
> >
> >decodecode says:
> >
> >   27:   48 83 c1 01 add$0x1,%rcx
> >   2b:*  48 f7 f1div%rcx <-- trapping
> >instruction
> >
> >This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
> >
> > update_cfs_rq_h_load(cfs_rq);
> > return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> > cfs_rq_load_avg(cfs_rq) + 1);
> >
> >So the load avg is -1, thus after adding 1 we get division by 0, huh?
> 
> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
> to 4.5 for fair.c:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
> 
> It didn't happen again with v4.4.6 + 4.5 patches for fair.c

Ok, that's a lot of patches, how about figuring out which single patch,
or shortest number of patches, makes things work again?

thanks,

greg k-h


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-03-20 Thread Stefan Priebe


Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:

On 03/17/2016 07:45 PM, Greg KH wrote:

On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:

Hi,

while running qemu 2.5 on a host running 4.4.6 the host system has
crashed
(load > 200) 3 times in the last 3 days.

Always with this stack trace: (copy left here:
http://pastebin.com/raw/bCWTLKyt)

[69068.874268] divide error:  [#1] SMP
[69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
ip6_tables
ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
xt_multiport
iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
8021q
garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
sd_mod
ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
mdio ptp
pps_core
[69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
4.4.6+7-ph #1
[69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
03/17/2012
[69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
880fda2c4000
[69068.900377] RIP: 0010:[]  []
task_h_load+0xcc/0x100


decodecode says:

   27:   48 83 c1 01 add$0x1,%rcx
   2b:*  48 f7 f1div%rcx <-- trapping
instruction

This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:

 update_cfs_rq_h_load(cfs_rq);
 return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
 cfs_rq_load_avg(cfs_rq) + 1);

So the load avg is -1, thus after adding 1 we get division by 0, huh?


Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those 
commits up to 4.5 for fair.c:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5

It didn't happen again with v4.4.6 + 4.5 patches for fair.c

Greets,
Stefan




[69068.961763] RSP: :880fda2c7b50  EFLAGS: 00010257
[69069.023910] RAX:  RBX: 880fda2c7c10 RCX:

[69069.085953] RDX:  RSI: 0001 RDI:
880fc7f28000
[69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09:
880fffd153f8
[69069.213757] R10: 0009 R11: 0193 R12:
881f6832c780
[69069.274271] R13: 88203fc35380 R14: 0007 R15:
00b6
[69069.334727] FS:  7f578a3fb700() GS:880fffd0()
knlGS:
[69069.396435] CS:  0010 DS:  ES:  CR0: 80050033
[69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4:
26e0
[69069.520799] Stack:
[69069.581430]  860b6855 880fda2c7b78 
0005
[69069.642629]  880fffd0 0015 fe7d
0015
[69069.702815]  00015380 880fda2c7bc8 880fc7f28000
01e9
[69069.761881] Call Trace:
[69069.819883]  [] ? task_numa_find_cpu+0x225/0x670
[69069.878368]  [] task_numa_migrate+0x550/0x950
[69069.936059]  [] ? find_next_bit+0x18/0x20
[69069.993262]  [] numa_migrate_preferred+0x7d/0x90
[69070.050528]  [] task_numa_fault+0x7c5/0xaa0
[69070.106544]  [] ? mpol_misplaced+0x16b/0x1b0
[69070.163705]  [] __handle_mm_fault+0x9ae/0x11f0
[69070.220013]  [] ? inet_recvmsg+0x72/0x90
[69070.276558]  [] ? SYSC_recvfrom+0x12b/0x170
[69070.332283]  [] handle_mm_fault+0xdf/0x180
[69070.388515]  [] __do_page_fault+0x164/0x380
[69070.443897]  [] ? account_user_time+0x73/0x80
[69070.498534]  [] ? vtime_account_user+0x4e/0x70
[69070.552598]  [] do_page_fault+0x37/0x90
[69070.605960]  [] ? syscall_return_slowpath+0x83/0xf0
[69070.660705]  [] page_fault+0x28/0x30
[69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75
ca 49
8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83
c1 01
<48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86
[69070.835144] RIP  [] task_h_load+0xcc/0x100
[69070.894095]  RSP 
[69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---

Would be nice if we can fix this in 4.4?


Does this also happen in 4.5?  Did this work on 4.4.5?  Some previous
release?  If you can find the offending patch using 'git bisect', that
would be great.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 





Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-03-19 Thread Greg KH
On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> Hi,
> 
> while running qemu 2.5 on a host running 4.4.6 the host system has crashed
> (load > 200) 3 times in the last 3 days.
> 
> Always with this stack trace: (copy left here:
> http://pastebin.com/raw/bCWTLKyt)
> 
> [69068.874268] divide error:  [#1] SMP
> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport
> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q
> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod
> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp
> pps_core
> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1
> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012
> [69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
> 880fda2c4000
> [69068.900377] RIP: 0010:[]  []
> task_h_load+0xcc/0x100
> [69068.961763] RSP: :880fda2c7b50  EFLAGS: 00010257
> [69069.023910] RAX:  RBX: 880fda2c7c10 RCX:
> 
> [69069.085953] RDX:  RSI: 0001 RDI:
> 880fc7f28000
> [69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09:
> 880fffd153f8
> [69069.213757] R10: 0009 R11: 0193 R12:
> 881f6832c780
> [69069.274271] R13: 88203fc35380 R14: 0007 R15:
> 00b6
> [69069.334727] FS:  7f578a3fb700() GS:880fffd0()
> knlGS:
> [69069.396435] CS:  0010 DS:  ES:  CR0: 80050033
> [69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4:
> 26e0
> [69069.520799] Stack:
> [69069.581430]  860b6855 880fda2c7b78 
> 0005
> [69069.642629]  880fffd0 0015 fe7d
> 0015
> [69069.702815]  00015380 880fda2c7bc8 880fc7f28000
> 01e9
> [69069.761881] Call Trace:
> [69069.819883]  [] ? task_numa_find_cpu+0x225/0x670
> [69069.878368]  [] task_numa_migrate+0x550/0x950
> [69069.936059]  [] ? find_next_bit+0x18/0x20
> [69069.993262]  [] numa_migrate_preferred+0x7d/0x90
> [69070.050528]  [] task_numa_fault+0x7c5/0xaa0
> [69070.106544]  [] ? mpol_misplaced+0x16b/0x1b0
> [69070.163705]  [] __handle_mm_fault+0x9ae/0x11f0
> [69070.220013]  [] ? inet_recvmsg+0x72/0x90
> [69070.276558]  [] ? SYSC_recvfrom+0x12b/0x170
> [69070.332283]  [] handle_mm_fault+0xdf/0x180
> [69070.388515]  [] __do_page_fault+0x164/0x380
> [69070.443897]  [] ? account_user_time+0x73/0x80
> [69070.498534]  [] ? vtime_account_user+0x4e/0x70
> [69070.552598]  [] do_page_fault+0x37/0x90
> [69070.605960]  [] ? syscall_return_slowpath+0x83/0xf0
> [69070.660705]  [] page_fault+0x28/0x30
> [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49
> 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01
> <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86
> [69070.835144] RIP  [] task_h_load+0xcc/0x100
> [69070.894095]  RSP 
> [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---
> 
> Would be nice if we can fix this in 4.4?

Does this also happen in 4.5?  Did this work on 4.4.5?  Some previous
release?  If you can find the offending patch using 'git bisect', that
would be great.

thanks,

greg k-h


Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-03-19 Thread Vlastimil Babka

On 03/17/2016 07:45 PM, Greg KH wrote:

On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:

Hi,

while running qemu 2.5 on a host running 4.4.6 the host system has crashed
(load > 200) 3 times in the last 3 days.

Always with this stack trace: (copy left here:
http://pastebin.com/raw/bCWTLKyt)

[69068.874268] divide error:  [#1] SMP
[69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport
iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q
garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod
ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp
pps_core
[69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1
[69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012
[69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti:
880fda2c4000
[69068.900377] RIP: 0010:[]  []
task_h_load+0xcc/0x100


decodecode says:

  27:   48 83 c1 01 add$0x1,%rcx
  2b:*  48 f7 f1div%rcx <-- trapping instruction

This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:

update_cfs_rq_h_load(cfs_rq);
return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
cfs_rq_load_avg(cfs_rq) + 1);

So the load avg is -1, thus after adding 1 we get division by 0, huh?


[69068.961763] RSP: :880fda2c7b50  EFLAGS: 00010257
[69069.023910] RAX:  RBX: 880fda2c7c10 RCX:

[69069.085953] RDX:  RSI: 0001 RDI:
880fc7f28000
[69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09:
880fffd153f8
[69069.213757] R10: 0009 R11: 0193 R12:
881f6832c780
[69069.274271] R13: 88203fc35380 R14: 0007 R15:
00b6
[69069.334727] FS:  7f578a3fb700() GS:880fffd0()
knlGS:
[69069.396435] CS:  0010 DS:  ES:  CR0: 80050033
[69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4:
26e0
[69069.520799] Stack:
[69069.581430]  860b6855 880fda2c7b78 
0005
[69069.642629]  880fffd0 0015 fe7d
0015
[69069.702815]  00015380 880fda2c7bc8 880fc7f28000
01e9
[69069.761881] Call Trace:
[69069.819883]  [] ? task_numa_find_cpu+0x225/0x670
[69069.878368]  [] task_numa_migrate+0x550/0x950
[69069.936059]  [] ? find_next_bit+0x18/0x20
[69069.993262]  [] numa_migrate_preferred+0x7d/0x90
[69070.050528]  [] task_numa_fault+0x7c5/0xaa0
[69070.106544]  [] ? mpol_misplaced+0x16b/0x1b0
[69070.163705]  [] __handle_mm_fault+0x9ae/0x11f0
[69070.220013]  [] ? inet_recvmsg+0x72/0x90
[69070.276558]  [] ? SYSC_recvfrom+0x12b/0x170
[69070.332283]  [] handle_mm_fault+0xdf/0x180
[69070.388515]  [] __do_page_fault+0x164/0x380
[69070.443897]  [] ? account_user_time+0x73/0x80
[69070.498534]  [] ? vtime_account_user+0x4e/0x70
[69070.552598]  [] do_page_fault+0x37/0x90
[69070.605960]  [] ? syscall_return_slowpath+0x83/0xf0
[69070.660705]  [] page_fault+0x28/0x30
[69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49
8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01
<48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86
[69070.835144] RIP  [] task_h_load+0xcc/0x100
[69070.894095]  RSP 
[69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---

Would be nice if we can fix this in 4.4?


Does this also happen in 4.5?  Did this work on 4.4.5?  Some previous
release?  If you can find the offending patch using 'git bisect', that
would be great.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 





divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6

2016-03-19 Thread Stefan Priebe

Hi,

while running qemu 2.5 on a host running 4.4.6 the host system has 
crashed (load > 200) 3 times in the last 3 days.


Always with this stack trace: (copy left here: 
http://pastebin.com/raw/bCWTLKyt)


[69068.874268] divide error:  [#1] SMP
[69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT 
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter 
ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype 
xt_conntrack nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost 
macvtap macvlan kvm_intel nfnetlink_log kvm nfnetlink irqbypass 
netconsole dlm xt_multiport iptable_filter ip_tables x_tables iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss oid_registry 
bonding coretemp 8021q garp fuse i2c_i801 i7core_edac edac_core 
i5500_temp button btrfs xor raid6_pq dm_mod raid1 md_mod usb_storage 
ohci_hcd bcache sg usbhid sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd 
usbcore ata_piix usb_common igb i2c_algo_bit mpt3sas raid_class ixgbe 
scsi_transport_sas i2c_core mdio ptp pps_core

[69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1
[69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 
03/17/2012
[69068.898578] task: 880fc7f28000 ti: 880fda2c4000 task.ti: 
880fda2c4000
[69068.900377] RIP: 0010:[]  [] 
task_h_load+0xcc/0x100

[69068.961763] RSP: :880fda2c7b50  EFLAGS: 00010257
[69069.023910] RAX:  RBX: 880fda2c7c10 RCX: 

[69069.085953] RDX:  RSI: 0001 RDI: 
880fc7f28000
[69069.151731] RBP: 880fda2c7bc8 R08: 0001041955df R09: 
880fffd153f8
[69069.213757] R10: 0009 R11: 0193 R12: 
881f6832c780
[69069.274271] R13: 88203fc35380 R14: 0007 R15: 
00b6
[69069.334727] FS:  7f578a3fb700() GS:880fffd0() 
knlGS:

[69069.396435] CS:  0010 DS:  ES:  CR0: 80050033
[69069.458522] CR2: 7f5784f18468 CR3: 001fe9738000 CR4: 
26e0

[69069.520799] Stack:
[69069.581430]  860b6855 880fda2c7b78  
0005
[69069.642629]  880fffd0 0015 fe7d 
0015
[69069.702815]  00015380 880fda2c7bc8 880fc7f28000 
01e9

[69069.761881] Call Trace:
[69069.819883]  [] ? task_numa_find_cpu+0x225/0x670
[69069.878368]  [] task_numa_migrate+0x550/0x950
[69069.936059]  [] ? find_next_bit+0x18/0x20
[69069.993262]  [] numa_migrate_preferred+0x7d/0x90
[69070.050528]  [] task_numa_fault+0x7c5/0xaa0
[69070.106544]  [] ? mpol_misplaced+0x16b/0x1b0
[69070.163705]  [] __handle_mm_fault+0x9ae/0x11f0
[69070.220013]  [] ? inet_recvmsg+0x72/0x90
[69070.276558]  [] ? SYSC_recvfrom+0x12b/0x170
[69070.332283]  [] handle_mm_fault+0xdf/0x180
[69070.388515]  [] __do_page_fault+0x164/0x380
[69070.443897]  [] ? account_user_time+0x73/0x80
[69070.498534]  [] ? vtime_account_user+0x4e/0x70
[69070.552598]  [] do_page_fault+0x37/0x90
[69070.605960]  [] ? syscall_return_slowpath+0x83/0xf0
[69070.660705]  [] page_fault+0x28/0x30
[69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 
49 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 
c1 01 <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86

[69070.835144] RIP  [] task_h_load+0xcc/0x100
[69070.894095]  RSP 
[69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---

Would be nice if we can fix this in 4.4?

Greets,
Stefan