On Mon, Sep 5, 2016 at 1:19 PM, matthew green <[email protected]> wrote: > Ryota Ozaki writes: >> On Thu, Sep 1, 2016 at 4:04 PM, matthew green <[email protected]> wrote: >> > have you tested other values than 1 and 16? what about 4 or 8? >> >> 4 and 8 are not so good; their performance fluctuations are >> similar to the unaligned case in my experiments. >> >> > >> > can you post the size difference of kernels? particularly the >> > kernel without DIAGNOSTIC or DEBUG (since those are the ones >> > where performance matters most.) >> >> I measured the sizes of GENERIC kernels, i.e., DIAGNOSTIC on >> and DEBUG off. > > DIAGNOSTIC is enabled on most -current GENERIC kernels including > the amd64 one. it's disabled on release branches.
I tried without DIAGNOSTIC. The overhead due to alignment doesn't change but the total text size of the kernel is reduced by 660kB, so the ratio of overhead increases a bit (< 1%). > >> The sizes of kernel binaries don't change in most cases because >> the alignment of __rodata_start that begins just after kernel text >> hides the changes due to -falign-functions. >> >> The sizes of the actual kernel text (from kernel_text to _etext) >> slightly changes. The difference between that of GENERIC kernels >> w/ and w/o -falign-functions=16 is 200kB. That is 1% of the total >> kernel text size. >> >> BTW, as I noted, I'm not exploring an alignment size that provides >> best performance, I just want to reduce performance fluctuations. > > 200KB is a lot of text. that's a non trivial i-cache issue. > > what are the CPU specifics of the system you're testing on? dut1# cpuctl identify 0 cpu0: highest basic info 0000000b cpu0: highest extended info 80000008 cpu0: "Intel(R) Atom(TM) CPU C2558 @ 2.40GHz" cpu0: Intel Atom C2000 (686-class), 2400.27 MHz cpu0: family 0x6 model 0x4d stepping 0x8 (id 0x406d8) cpu0: features 0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE> cpu0: features 0xbfebfbff<MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2> cpu0: features 0xbfebfbff<SS,HTT,TM,SBF> cpu0: features1 0x43d8e3bf<SSE3,PCLMULQDQ,DTES64,MONITOR,DS-CPL,VMX,EST,TM2> cpu0: features1 0x43d8e3bf<SSSE3,CX16,xTPR,PDCM,SSE41,SSE42,MOVBE,POPCNT> cpu0: features1 0x43d8e3bf<DEADLINE,AES,RDRAND> cpu0: features2 0x28100800<SYSCALL/SYSRET,XD,RDTSCP,EM64T> cpu0: features3 0x101<LAHF,PREFETCHW> cpu0: I-cache 32KB 64B/line 8-way, D-cache 24KB 64B/line 6-way cpu0: L2 cache 1MB 64B/line 16-way cpu0: ITLB 48 4KB entries fully associative cpu0: DTLB 128 4KB entries 4-way, 4K/2M: 16 entries cpu0: Initial APIC ID 0 cpu0: Cluster/Package ID 0 cpu0: Core ID 0 cpu0: SMT ID 0 cpu0: DSPM-eax 0x5<DTS,ARAT> cpu0: DSPM-ecx 0x9<HWF,EPB> cpu0: SEF highest subleaf 00000000 cpu0: SEF-main 0x2282<TSCADJUST,SMEP,ERMS,FPUCSDS> cpu0: microcode version 0x127, platform ID 0 > can you run performance tests on systems with small cache? Not tested ever. It'll take a bit time to do because I don't have a suitable one. BTW what size do you expect for small? Thanks, ozaki-r
