Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-05 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Wed, Apr 04, 2018 at 06:53:10PM +0200, Peter Zijlstra wrote:
> > Awesome, I'll go get it merged, even though I don't understand where it
> > went wobbly.
> 
> Ingo, could you magic this in?
> 
> ---
> Subject: sched: Force alignment of struct util_est
> From: Peter Zijlstra 
> Date: Thu Apr  5 09:56:16 CEST 2018

Sure, done!

Thanks,

Ingo


Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-05 Thread Peter Zijlstra
On Wed, Apr 04, 2018 at 06:53:10PM +0200, Peter Zijlstra wrote:
> Awesome, I'll go get it merged, even though I don't understand where it
> went wobbly.

Ingo, could you magic this in?

---
Subject: sched: Force alignment of struct util_est
From: Peter Zijlstra 
Date: Thu Apr  5 09:56:16 CEST 2018

For some as yet not understood reason, Tony gets unaligned access
traps on IA64 because of:

  struct util_est ue = READ_ONCE(p->se.avg.util_est);

and:

  WRITE_ONCE(p->se.avg.util_est, ue);

introduced by commit:

  d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")

Normally those two fields should end up on an 8-byte aligned location,
but UP and RANDSTRUCT can mess that up so enforce the alignment
explicitly.

Also make the alignment on sched_avg unconditional, as it is really
about data locality, not false-sharing.

With or without this patch the layout for sched_avg on a
ia64-defconfig build looks like:

$ pahole -EC sched_avg ia64-defconfig/kernel/sched/core.o
die__process_function: tag not supported (INVALID)!
struct sched_avg {
/* typedef u64 */ long long unsigned int last_update_time;  
 /* 0 8 */
/* typedef u64 */ long long unsigned int load_sum;  
 /* 8 8 */
/* typedef u64 */ long long unsigned int runnable_load_sum; 
 /*16 8 */
/* typedef u32 */ unsigned int   util_sum;  
 /*24 4 */
/* typedef u32 */ unsigned int   period_contrib;
 /*28 4 */
long unsigned int  load_avg;
 /*32 8 */
long unsigned int  runnable_load_avg;   
 /*40 8 */
long unsigned int  util_avg;
 /*48 8 */
struct util_est {
unsigned int   enqueued;
 /*56 4 */
unsigned int   ewma;
 /*60 4 */
} util_est; /*56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */

/* size: 64, cachelines: 1, members: 9 */
};

Fixes: d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")
Reported-and-Tested-by: Tony Luck 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched.h |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -300,7 +300,7 @@ struct util_est {
unsigned intenqueued;
unsigned intewma;
 #define UTIL_EST_WEIGHT_SHIFT  2
-};
+} __attribute__((__aligned__(sizeof(u64;
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
@@ -364,7 +364,7 @@ struct sched_avg {
unsigned long   runnable_load_avg;
unsigned long   util_avg;
struct util_est util_est;
-};
+} cacheline_aligned;
 
 struct sched_statistics {
 #ifdef CONFIG_SCHEDSTATS
@@ -435,7 +435,7 @@ struct sched_entity {
 * Put into separate cache line so it does not
 * collide with read-mostly values above.
 */
-   struct sched_avgavg cacheline_aligned_in_smp;
+   struct sched_avgavg;
 #endif
 };
 


Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-04 Thread Peter Zijlstra
On Wed, Apr 04, 2018 at 09:38:56AM -0700, Luck, Tony wrote:
> On Wed, Apr 04, 2018 at 09:25:13AM +0200, Peter Zijlstra wrote:
> > Right, I remember being careful with that. Which again brings me to the
> > RANDSTRUCT thing, which will mess that up.
> 
> No RANDSTRUCT config options set for my build.

Weird though, with or without that patch, my ia64-defconfig gives the
below layout.

$ pahole -EC sched_avg ia64-defconfig/kernel/sched/core.o
die__process_function: tag not supported (INVALID)!
struct sched_avg {
/* typedef u64 */ long long unsigned int last_update_time;  
 /* 0 8 */
/* typedef u64 */ long long unsigned int load_sum;  
 /* 8 8 */
/* typedef u64 */ long long unsigned int runnable_load_sum; 
 /*16 8 */
/* typedef u32 */ unsigned int   util_sum;  
 /*24 4 */
/* typedef u32 */ unsigned int   period_contrib;
 /*28 4 */
long unsigned int  load_avg;
 /*32 8 */
long unsigned int  runnable_load_avg;   
 /*40 8 */
long unsigned int  util_avg;
 /*48 8 */
struct util_est {
unsigned int   enqueued;
 /*56 4 */
unsigned int   ewma;
 /*60 4 */
} util_est; /*56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */

/* size: 64, cachelines: 1, members: 9 */
};

> > Does the below cure things? It makes absolutely no difference for my
> > x86_64-defconfig build, but it puts more explicit alignment constraints
> > on things.
> 
> Yes. That fixes it. No unaligned traps with this patch applied.
> 
> Tested-by: Tony Luck 

Awesome, I'll go get it merged, even though I don't understand where it
went wobbly.


Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-04 Thread Luck, Tony
On Wed, Apr 04, 2018 at 09:25:13AM +0200, Peter Zijlstra wrote:
> Right, I remember being careful with that. Which again brings me to the
> RANDSTRUCT thing, which will mess that up.

No RANDSTRUCT config options set for my build.

> Does the below cure things? It makes absolutely no difference for my
> x86_64-defconfig build, but it puts more explicit alignment constraints
> on things.

Yes. That fixes it. No unaligned traps with this patch applied.

Tested-by: Tony Luck 

Thanks

-Tony


Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-04 Thread Peter Zijlstra
On Wed, Apr 04, 2018 at 12:04:00AM +, Luck, Tony wrote:
> > bisect says:
> >
> > d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")
> > 
> > Reverting just this commit makes the problem go away.
> 
> The unaligned read and write seem to come from:
> 
> struct util_est ue = READ_ONCE(p->se.avg.util_est);
> WRITE_ONCE(p->se.avg.util_est, ue);
> 
> which is puzzling as they were around before. Also the "avg"
> field is tagged with an attribute to make it cache aligned
> and there don't look to be holes in the structure that would
> make util_est not be 8-byte aligned ... though it does consist
> of two 4-byte fields, so legal for it to be 4-byte aligned.

Right, I remember being careful with that. Which again brings me to the
RANDSTRUCT thing, which will mess that up.

Does the below cure things? It makes absolutely no difference for my
x86_64-defconfig build, but it puts more explicit alignment constraints
on things.


diff --git a/include/linux/sched.h b/include/linux/sched.h
index f228c6033832..b3d697f3b573 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -300,7 +300,7 @@ struct util_est {
unsigned intenqueued;
unsigned intewma;
 #define UTIL_EST_WEIGHT_SHIFT  2
-};
+} __attribute__((__aligned__(sizeof(u64;
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
@@ -364,7 +364,7 @@ struct sched_avg {
unsigned long   runnable_load_avg;
unsigned long   util_avg;
struct util_est util_est;
-};
+} cacheline_aligned;
 
 struct sched_statistics {
 #ifdef CONFIG_SCHEDSTATS
@@ -435,7 +435,7 @@ struct sched_entity {
 * Put into separate cache line so it does not
 * collide with read-mostly values above.
 */
-   struct sched_avgavg cacheline_aligned_in_smp;
+   struct sched_avgavg;
 #endif
 };
 


RE: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-03 Thread Luck, Tony
> bisect says:
>
> d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")
> 
> Reverting just this commit makes the problem go away.

The unaligned read and write seem to come from:

struct util_est ue = READ_ONCE(p->se.avg.util_est);
WRITE_ONCE(p->se.avg.util_est, ue);

which is puzzling as they were around before. Also the "avg"
field is tagged with an attribute to make it cache aligned
and there don't look to be holes in the structure that would
make util_est not be 8-byte aligned ... though it does consist
of two 4-byte fields, so legal for it to be 4-byte aligned.

-Tony




Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-03 Thread Luck, Tony
On Tue, Apr 03, 2018 at 09:37:06AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 02, 2018 at 04:24:49PM -0700, Luck, Tony wrote:
> > Any guesses before I start to bisect?
> 
> That doesn't sound good. The only guess I have at this moment is you
> accidentially enabled RANDSTRUCT_PLUGIN and that messes things up.
> 
> struct task_struct whould be at least L1_CACHE_BYTES aligned, and C
> otherwise makes it fairly hard to cause unaligned accesses. Packed
> structures and/or casting are required, and I don't think we added
> anything dodgy like that here.

bisect says:

d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")

Reverting just this commit makes the problem go away.

-Tony


Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-03 Thread Peter Zijlstra
On Mon, Apr 02, 2018 at 04:24:49PM -0700, Luck, Tony wrote:
> v4.16 boots cleanly. But with the first bunch of merges
> (Linus HEAD = 46e0d28bdb8e6d00e27a0fe9e1d15df6098f0ffb)
> I see a bunch of:
> 
> ia64_handle_unaligned: 4863 callbacks suppressed
> kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0
> kernel unaligned access to 0xe0033bdffbcc, ip=0xa001000f2370
> kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0
> kernel unaligned access to 0xe0033bdffbcc, ip=0xa001000f2370
> kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0
> 
> The addresses are all 4-byte, but not 8-byte aligned.
> 
> Any guesses before I start to bisect?

That doesn't sound good. The only guess I have at this moment is you
accidentially enabled RANDSTRUCT_PLUGIN and that messes things up.

struct task_struct whould be at least L1_CACHE_BYTES aligned, and C
otherwise makes it fairly hard to cause unaligned accesses. Packed
structures and/or casting are required, and I don't think we added
anything dodgy like that here.



RE: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64

2018-04-02 Thread Luck, Tony
> kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0
> kernel unaligned access to 0xe0033bdffbcc, ip=0xa001000f2370

Here's the disassembly of dequeu_task_fair() in case it would help to see
which two instructions are getting all the faults:

a001000f21c0 :
a001000f21c0:   08 28 29 0e 80 05   [MMI]   alloc 
r37=ar.pfs,10,7,0
a001000f21c6:   c0 00 33 7e 46 00   adds r12=-32,r12
a001000f21cc:   c2 08 85 84 adds r16=4236,r33
a001000f21d0:   09 58 40 ab 16 27   [MMI]   addl r11=-685232,r1
a001000f21d6:   80 02 84 02 42 20   adds r40=128,r33
a001000f21dc:   05 10 01 84 mov r41=r34;;
a001000f21e0:   08 00 00 00 01 00   [MMI]   nop.m 0x0
a001000f21e6:   f0 00 2c 00 42 80   mov r15=r11
a001000f21ec:   04 00 c4 00 mov r36=b0
a001000f21f0:   19 30 00 50 07 39   [MMB]   cmp.eq p6,p7=0,r40
a001000f21f6:   30 00 c0 a3 4e 03   mov r3=-219008
a001000f21fc:   60 00 00 41   (p06) br.cond.spnt.few 
a001000f2250 ;;
a001000f2200:   0b 50 00 20 10 10   [MMI]   ld4 r10=[r16];;
a001000f2206:   90 50 3c 24 40 00   shladd r9=r10,3,r15
a001000f220c:   00 00 04 00 nop.i 0x0;;
a001000f2210:   0b 40 00 12 18 10   [MMI]   ld8 r8=[r9];;
a001000f2216:   60 1a 20 00 40 00   add r38=r3,r8
a001000f221c:   00 00 04 00 nop.i 0x0;;
a001000f2220:   11 18 71 4c 01 21   [MIB]   adds r35=156,r38
a001000f2226:   70 02 98 02 42 00   adds r39=128,r38
a001000f222c:   68 f3 ff 58 br.call.sptk.many 
b0=a001000f1580 ;;
a001000f2230:   0a 10 00 46 10 10   [MMI]   ld4 r2=[r35];;
a001000f2236:   e0 f8 0b 7e 46 00   adds r14=-1,r2
a001000f223c:   00 00 04 00 nop.i 0x0
a001000f2240:   0a 00 00 00 01 00   [MMI]   nop.m 0x0;;
a001000f2246:   00 70 8c 20 23 00   st4 [r35]=r14
a001000f224c:   00 00 04 00 nop.i 0x0
a001000f2250:   09 98 c0 02 d7 26   [MMI]   addl r19=-2069584,r1
a001000f2256:   50 21 80 00 42 00   adds r21=4,r32
a001000f225c:   83 01 05 84 adds r24=152,r32;;
a001000f2260:   09 90 00 26 18 10   [MMI]   ld8 r18=[r19]
a001000f2266:   60 01 54 20 20 00   ld4 r22=[r21]
a001000f226c:   00 00 04 00 nop.i 0x0;;
a001000f2270:   09 a0 fc 2d 3f 23   [MMI]   adds r20=-1,r22
a001000f2276:   10 01 48 a0 20 00   ld4.a r17=[r18]
a001000f227c:   00 00 04 00 nop.i 0x0;;
a001000f2280:   08 00 00 00 01 00   [MMI]   nop.m 0x0
a001000f2286:   00 a0 54 20 23 00   st4 [r21]=r20
a001000f228c:   a1 8a 24 50 tbit.z p8,p9=r17,21
a001000f2290:   18 00 00 00 01 00   [MMB]   nop.m 0x0
a001000f2296:   10 d9 00 80 02 00   chk.a.clr 
r17,a001000f2440 
a001000f229c:   00 00 00 20 nop.b 0x0
a001000f22a0:   10 00 00 00 01 00   [MIB]   nop.m 0x0
a001000f22a6:   00 00 00 02 00 04   nop.i 0x0
a001000f22ac:   e0 00 00 43   (p08) br.cond.dpnt.few 
a001000f2380 
a001000f22b0:   0b b8 00 30 10 10   [MMI]   ld4 r23=[r24];;
a001000f22b6:   b0 00 5c 14 73 00   cmp4.eq 
p11,p10=0,r23
a001000f22bc:   00 00 04 00 nop.i 0x0;;
a001000f22c0:   71 01 81 40 02 e1   [MIB] (p11) adds r32=288,r32
a001000f22c6:   12 01 00 00 42 05 (p11) mov r17=r0
a001000f22cc:   e0 00 00 42   (p10) br.cond.dptk.few 
a001000f23a0 ;;
a001000f22d0:   09 f8 40 18 00 21   [MMI]   adds r31=16,r12
a001000f22d6:   00 88 80 60 23 c0   st4.rel [r32]=r17
a001000f22dc:   00 10 1d 50 tbit.z p6,p7=r34,0;;
a001000f22e0:   10 00 44 3e 90 11   [MIB]   st4 [r31]=r17
a001000f22e6:   00 00 00 02 00 03   nop.i 0x0
a001000f22ec:   a0 00 00 43   (p06) br.cond.dpnt.few 
a001000f2380 
a001000f22f0:   09 88 e0 42 02 21   [MMI]   adds r17=312,r33
a001000f22f6:   e0 80 85 04 42 40   adds r14=304,r33
a001000f22fc:   c4 0b 09 84 adds r34=316,r33;;
a001000f2300:   09 00 01 22 10 10   [MMI]   ld4 r32=[r17]
a000