Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
* Peter Zijlstra wrote: > On Wed, Apr 04, 2018 at 06:53:10PM +0200, Peter Zijlstra wrote: > > Awesome, I'll go get it merged, even though I don't understand where it > > went wobbly. > > Ingo, could you magic this in? > > --- > Subject: sched: Force alignment of struct util_est > From: Peter Zijlstra > Date: Thu Apr 5 09:56:16 CEST 2018 Sure, done! Thanks, Ingo
Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
On Wed, Apr 04, 2018 at 06:53:10PM +0200, Peter Zijlstra wrote: > Awesome, I'll go get it merged, even though I don't understand where it > went wobbly. Ingo, could you magic this in? --- Subject: sched: Force alignment of struct util_est From: Peter Zijlstra Date: Thu Apr 5 09:56:16 CEST 2018 For some as yet not understood reason, Tony gets unaligned access traps on IA64 because of: struct util_est ue = READ_ONCE(p->se.avg.util_est); and: WRITE_ONCE(p->se.avg.util_est, ue); introduced by commit: d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") Normally those two fields should end up on an 8-byte aligned location, but UP and RANDSTRUCT can mess that up so enforce the alignment explicitly. Also make the alignment on sched_avg unconditional, as it is really about data locality, not false-sharing. With or without this patch the layout for sched_avg on a ia64-defconfig build looks like: $ pahole -EC sched_avg ia64-defconfig/kernel/sched/core.o die__process_function: tag not supported (INVALID)! struct sched_avg { /* typedef u64 */ long long unsigned int last_update_time; /* 0 8 */ /* typedef u64 */ long long unsigned int load_sum; /* 8 8 */ /* typedef u64 */ long long unsigned int runnable_load_sum; /*16 8 */ /* typedef u32 */ unsigned int util_sum; /*24 4 */ /* typedef u32 */ unsigned int period_contrib; /*28 4 */ long unsigned int load_avg; /*32 8 */ long unsigned int runnable_load_avg; /*40 8 */ long unsigned int util_avg; /*48 8 */ struct util_est { unsigned int enqueued; /*56 4 */ unsigned int ewma; /*60 4 */ } util_est; /*56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 9 */ }; Fixes: d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") Reported-and-Tested-by: Tony Luck Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -300,7 +300,7 @@ struct util_est { unsigned intenqueued; unsigned intewma; #define UTIL_EST_WEIGHT_SHIFT 2 -}; +} __attribute__((__aligned__(sizeof(u64; /* * The load_avg/util_avg accumulates an infinite geometric series @@ -364,7 +364,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est; -}; +} cacheline_aligned; struct sched_statistics { #ifdef CONFIG_SCHEDSTATS @@ -435,7 +435,7 @@ struct sched_entity { * Put into separate cache line so it does not * collide with read-mostly values above. */ - struct sched_avgavg cacheline_aligned_in_smp; + struct sched_avgavg; #endif };
Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
On Wed, Apr 04, 2018 at 09:38:56AM -0700, Luck, Tony wrote: > On Wed, Apr 04, 2018 at 09:25:13AM +0200, Peter Zijlstra wrote: > > Right, I remember being careful with that. Which again brings me to the > > RANDSTRUCT thing, which will mess that up. > > No RANDSTRUCT config options set for my build. Weird though, with or without that patch, my ia64-defconfig gives the below layout. $ pahole -EC sched_avg ia64-defconfig/kernel/sched/core.o die__process_function: tag not supported (INVALID)! struct sched_avg { /* typedef u64 */ long long unsigned int last_update_time; /* 0 8 */ /* typedef u64 */ long long unsigned int load_sum; /* 8 8 */ /* typedef u64 */ long long unsigned int runnable_load_sum; /*16 8 */ /* typedef u32 */ unsigned int util_sum; /*24 4 */ /* typedef u32 */ unsigned int period_contrib; /*28 4 */ long unsigned int load_avg; /*32 8 */ long unsigned int runnable_load_avg; /*40 8 */ long unsigned int util_avg; /*48 8 */ struct util_est { unsigned int enqueued; /*56 4 */ unsigned int ewma; /*60 4 */ } util_est; /*56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 9 */ }; > > Does the below cure things? It makes absolutely no difference for my > > x86_64-defconfig build, but it puts more explicit alignment constraints > > on things. > > Yes. That fixes it. No unaligned traps with this patch applied. > > Tested-by: Tony Luck Awesome, I'll go get it merged, even though I don't understand where it went wobbly.
Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
On Wed, Apr 04, 2018 at 09:25:13AM +0200, Peter Zijlstra wrote: > Right, I remember being careful with that. Which again brings me to the > RANDSTRUCT thing, which will mess that up. No RANDSTRUCT config options set for my build. > Does the below cure things? It makes absolutely no difference for my > x86_64-defconfig build, but it puts more explicit alignment constraints > on things. Yes. That fixes it. No unaligned traps with this patch applied. Tested-by: Tony Luck Thanks -Tony
Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
On Wed, Apr 04, 2018 at 12:04:00AM +, Luck, Tony wrote: > > bisect says: > > > > d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") > > > > Reverting just this commit makes the problem go away. > > The unaligned read and write seem to come from: > > struct util_est ue = READ_ONCE(p->se.avg.util_est); > WRITE_ONCE(p->se.avg.util_est, ue); > > which is puzzling as they were around before. Also the "avg" > field is tagged with an attribute to make it cache aligned > and there don't look to be holes in the structure that would > make util_est not be 8-byte aligned ... though it does consist > of two 4-byte fields, so legal for it to be 4-byte aligned. Right, I remember being careful with that. Which again brings me to the RANDSTRUCT thing, which will mess that up. Does the below cure things? It makes absolutely no difference for my x86_64-defconfig build, but it puts more explicit alignment constraints on things. diff --git a/include/linux/sched.h b/include/linux/sched.h index f228c6033832..b3d697f3b573 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -300,7 +300,7 @@ struct util_est { unsigned intenqueued; unsigned intewma; #define UTIL_EST_WEIGHT_SHIFT 2 -}; +} __attribute__((__aligned__(sizeof(u64; /* * The load_avg/util_avg accumulates an infinite geometric series @@ -364,7 +364,7 @@ struct sched_avg { unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est; -}; +} cacheline_aligned; struct sched_statistics { #ifdef CONFIG_SCHEDSTATS @@ -435,7 +435,7 @@ struct sched_entity { * Put into separate cache line so it does not * collide with read-mostly values above. */ - struct sched_avgavg cacheline_aligned_in_smp; + struct sched_avgavg; #endif };
RE: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
> bisect says: > > d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") > > Reverting just this commit makes the problem go away. The unaligned read and write seem to come from: struct util_est ue = READ_ONCE(p->se.avg.util_est); WRITE_ONCE(p->se.avg.util_est, ue); which is puzzling as they were around before. Also the "avg" field is tagged with an attribute to make it cache aligned and there don't look to be holes in the structure that would make util_est not be 8-byte aligned ... though it does consist of two 4-byte fields, so legal for it to be 4-byte aligned. -Tony
Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
On Tue, Apr 03, 2018 at 09:37:06AM +0200, Peter Zijlstra wrote: > On Mon, Apr 02, 2018 at 04:24:49PM -0700, Luck, Tony wrote: > > Any guesses before I start to bisect? > > That doesn't sound good. The only guess I have at this moment is you > accidentially enabled RANDSTRUCT_PLUGIN and that messes things up. > > struct task_struct whould be at least L1_CACHE_BYTES aligned, and C > otherwise makes it fairly hard to cause unaligned accesses. Packed > structures and/or casting are required, and I don't think we added > anything dodgy like that here. bisect says: d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") Reverting just this commit makes the problem go away. -Tony
Re: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
On Mon, Apr 02, 2018 at 04:24:49PM -0700, Luck, Tony wrote: > v4.16 boots cleanly. But with the first bunch of merges > (Linus HEAD = 46e0d28bdb8e6d00e27a0fe9e1d15df6098f0ffb) > I see a bunch of: > > ia64_handle_unaligned: 4863 callbacks suppressed > kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0 > kernel unaligned access to 0xe0033bdffbcc, ip=0xa001000f2370 > kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0 > kernel unaligned access to 0xe0033bdffbcc, ip=0xa001000f2370 > kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0 > > The addresses are all 4-byte, but not 8-byte aligned. > > Any guesses before I start to bisect? That doesn't sound good. The only guess I have at this moment is you accidentially enabled RANDSTRUCT_PLUGIN and that messes things up. struct task_struct whould be at least L1_CACHE_BYTES aligned, and C otherwise makes it fairly hard to cause unaligned accesses. Packed structures and/or casting are required, and I don't think we added anything dodgy like that here.
RE: v4.16+ seeing many unaligned access in dequeue_task_fair() on IA64
> kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0 > kernel unaligned access to 0xe0033bdffbcc, ip=0xa001000f2370 Here's the disassembly of dequeu_task_fair() in case it would help to see which two instructions are getting all the faults: a001000f21c0 : a001000f21c0: 08 28 29 0e 80 05 [MMI] alloc r37=ar.pfs,10,7,0 a001000f21c6: c0 00 33 7e 46 00 adds r12=-32,r12 a001000f21cc: c2 08 85 84 adds r16=4236,r33 a001000f21d0: 09 58 40 ab 16 27 [MMI] addl r11=-685232,r1 a001000f21d6: 80 02 84 02 42 20 adds r40=128,r33 a001000f21dc: 05 10 01 84 mov r41=r34;; a001000f21e0: 08 00 00 00 01 00 [MMI] nop.m 0x0 a001000f21e6: f0 00 2c 00 42 80 mov r15=r11 a001000f21ec: 04 00 c4 00 mov r36=b0 a001000f21f0: 19 30 00 50 07 39 [MMB] cmp.eq p6,p7=0,r40 a001000f21f6: 30 00 c0 a3 4e 03 mov r3=-219008 a001000f21fc: 60 00 00 41 (p06) br.cond.spnt.few a001000f2250 ;; a001000f2200: 0b 50 00 20 10 10 [MMI] ld4 r10=[r16];; a001000f2206: 90 50 3c 24 40 00 shladd r9=r10,3,r15 a001000f220c: 00 00 04 00 nop.i 0x0;; a001000f2210: 0b 40 00 12 18 10 [MMI] ld8 r8=[r9];; a001000f2216: 60 1a 20 00 40 00 add r38=r3,r8 a001000f221c: 00 00 04 00 nop.i 0x0;; a001000f2220: 11 18 71 4c 01 21 [MIB] adds r35=156,r38 a001000f2226: 70 02 98 02 42 00 adds r39=128,r38 a001000f222c: 68 f3 ff 58 br.call.sptk.many b0=a001000f1580 ;; a001000f2230: 0a 10 00 46 10 10 [MMI] ld4 r2=[r35];; a001000f2236: e0 f8 0b 7e 46 00 adds r14=-1,r2 a001000f223c: 00 00 04 00 nop.i 0x0 a001000f2240: 0a 00 00 00 01 00 [MMI] nop.m 0x0;; a001000f2246: 00 70 8c 20 23 00 st4 [r35]=r14 a001000f224c: 00 00 04 00 nop.i 0x0 a001000f2250: 09 98 c0 02 d7 26 [MMI] addl r19=-2069584,r1 a001000f2256: 50 21 80 00 42 00 adds r21=4,r32 a001000f225c: 83 01 05 84 adds r24=152,r32;; a001000f2260: 09 90 00 26 18 10 [MMI] ld8 r18=[r19] a001000f2266: 60 01 54 20 20 00 ld4 r22=[r21] a001000f226c: 00 00 04 00 nop.i 0x0;; a001000f2270: 09 a0 fc 2d 3f 23 [MMI] adds r20=-1,r22 a001000f2276: 10 01 48 a0 20 00 ld4.a r17=[r18] a001000f227c: 00 00 04 00 nop.i 0x0;; a001000f2280: 08 00 00 00 01 00 [MMI] nop.m 0x0 a001000f2286: 00 a0 54 20 23 00 st4 [r21]=r20 a001000f228c: a1 8a 24 50 tbit.z p8,p9=r17,21 a001000f2290: 18 00 00 00 01 00 [MMB] nop.m 0x0 a001000f2296: 10 d9 00 80 02 00 chk.a.clr r17,a001000f2440 a001000f229c: 00 00 00 20 nop.b 0x0 a001000f22a0: 10 00 00 00 01 00 [MIB] nop.m 0x0 a001000f22a6: 00 00 00 02 00 04 nop.i 0x0 a001000f22ac: e0 00 00 43 (p08) br.cond.dpnt.few a001000f2380 a001000f22b0: 0b b8 00 30 10 10 [MMI] ld4 r23=[r24];; a001000f22b6: b0 00 5c 14 73 00 cmp4.eq p11,p10=0,r23 a001000f22bc: 00 00 04 00 nop.i 0x0;; a001000f22c0: 71 01 81 40 02 e1 [MIB] (p11) adds r32=288,r32 a001000f22c6: 12 01 00 00 42 05 (p11) mov r17=r0 a001000f22cc: e0 00 00 42 (p10) br.cond.dptk.few a001000f23a0 ;; a001000f22d0: 09 f8 40 18 00 21 [MMI] adds r31=16,r12 a001000f22d6: 00 88 80 60 23 c0 st4.rel [r32]=r17 a001000f22dc: 00 10 1d 50 tbit.z p6,p7=r34,0;; a001000f22e0: 10 00 44 3e 90 11 [MIB] st4 [r31]=r17 a001000f22e6: 00 00 00 02 00 03 nop.i 0x0 a001000f22ec: a0 00 00 43 (p06) br.cond.dpnt.few a001000f2380 a001000f22f0: 09 88 e0 42 02 21 [MMI] adds r17=312,r33 a001000f22f6: e0 80 85 04 42 40 adds r14=304,r33 a001000f22fc: c4 0b 09 84 adds r34=316,r33;; a001000f2300: 09 00 01 22 10 10 [MMI] ld4 r32=[r17] a000