Re: [PATCH v5 00/10] Function Granular KASLR

2020-09-28 Thread Kristen Carlson Accardi
Hi,

On Fri, 2020-09-25 at 15:06 +0200, Miroslav Benes wrote:
> Hi Kristen,
> 
> On Wed, 23 Sep 2020, Kristen Carlson Accardi wrote:
> 
> > Function Granular Kernel Address Space Layout Randomization
> > (fgkaslr)
> > -
> > 
> > 
> > This patch set is an implementation of finer grained kernel address
> > space
> > randomization. It rearranges your kernel code at load time 
> > on a per-function level granularity, with only around a second
> > added to
> > boot time.
> 
> I ran live patching kernel selftests on the patch set and everything 
> passed fine.
> 
> However, we also use not-yet-upstream set of tests at SUSE for
> testing 
> live patching [1] and one of them, klp_tc_12.sh, is failing. You
> should be 
> able to run the set on upstream as is.
> 
> The test uninterruptedly sleeps in a kretprobed function called by a 
> patched one. The current master without fgkaslr patch set reports
> the 
> stack of the sleeping task as unreliable and live patching fails.
> The 
> situation is different with fgkaslr (even with nofgkaslr on the
> command 
> line). The stack is returned as reliable. It looks something like 
> 
> [<0>] __schedule+0x465/0xa40
> [<0>] schedule+0x55/0xd0
> [<0>] orig_do_sleep+0xb1/0x110 [klp_test_support_mod]
> [<0>] swap_pages+0x7f/0x7f
> 
> where the last entry is not reliable. I've seen 
> kretprobe_trampoline+0x0/0x4a and some other symbols there too. Since
> the 
> patched function (orig_sleep_uninterruptible_set) is not on the
> stack, 
> live patching succeeds, which is not intended.
> 
> With kprobe setting removed, all works as expected.
> 
> So I wonder if there is still some issue with ORC somewhere as you 
> mentioned in v4 thread. I'll investigate more next week, but wanted
> to 
> report early.
> 
> Regards
> Miroslav
> 
> [1] https://github.com/lpechacek/qa_test_klp

Thanks for testing and reporting. I will grab your test and see what I
can find.




[PATCH v5 10/10] livepatch: only match unique symbols when using fgkaslr

2020-09-23 Thread Kristen Carlson Accardi
If any type of function granular randomization is enabled, the sympos
algorithm will fail, as it will be impossible to resolve symbols when
there are duplicates using the previous symbol position.

Override the value of sympos to always be zero if fgkaslr is enabled for
either the core kernel or modules, forcing the algorithm
to require that only unique symbols are allowed to be patched.

Signed-off-by: Kristen Carlson Accardi 
---
 kernel/livepatch/core.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index f76fdb925532..da08e40f2da2 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -170,6 +170,17 @@ static int klp_find_object_symbol(const char *objname, 
const char *name,
kallsyms_on_each_symbol(klp_find_callback, );
mutex_unlock(_mutex);
 
+   /*
+* If any type of function granular randomization is enabled, it
+* will be impossible to resolve symbols when there are duplicates
+* using the previous symbol position (i.e. sympos != 0). Override
+* the value of sympos to always be zero in this case. This will
+* force the algorithm to require that only unique symbols are
+* allowed to be patched.
+*/
+   if (IS_ENABLED(CONFIG_FG_KASLR) || IS_ENABLED(CONFIG_MODULE_FG_KASLR))
+   sympos = 0;
+
/*
 * Ensure an address was found. If sympos is 0, ensure symbol is unique;
 * otherwise ensure the symbol position count matches sympos.
-- 
2.20.1



[PATCH v5 03/10] x86: Makefile: Add build and config option for CONFIG_FG_KASLR

2020-09-23 Thread Kristen Carlson Accardi
Allow user to select CONFIG_FG_KASLR if dependencies are met. Change
the make file to build with -ffunction-sections if CONFIG_FG_KASLR.

While the only architecture that supports CONFIG_FG_KASLR does not
currently enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION, make sure these
2 features play nicely together for the future by ensuring that if
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is selected when used with
CONFIG_FG_KASLR the function sections will not be consolidated back
into .text. Thanks to Kees Cook for the dead code elimination changes.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Reviewed-by: Kees Cook 
Tested-by: Tony Luck 
---
 Makefile  |  6 +-
 arch/x86/Kconfig  |  4 
 include/asm-generic/vmlinux.lds.h | 16 ++--
 init/Kconfig  | 14 ++
 4 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 2b66d3398878..0c116b833fd5 100644
--- a/Makefile
+++ b/Makefile
@@ -878,10 +878,14 @@ KBUILD_CFLAGS += $(call cc-option, 
-fno-inline-functions-called-once)
 endif
 
 ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
-KBUILD_CFLAGS_KERNEL += -ffunction-sections -fdata-sections
+KBUILD_CFLAGS_KERNEL += -fdata-sections
 LDFLAGS_vmlinux += --gc-sections
 endif
 
+ifneq ($(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION)$(CONFIG_FG_KASLR),)
+KBUILD_CFLAGS += -ffunction-sections
+endif
+
 ifdef CONFIG_SHADOW_CALL_STACK
 CC_FLAGS_SCS   := -fsanitize=shadow-call-stack
 KBUILD_CFLAGS  += $(CC_FLAGS_SCS)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..ff0f90d0421f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -374,6 +374,10 @@ config CC_HAS_SANE_STACKPROTECTOR
   We have to make sure stack protector is unconditionally disabled if
   the compiler produces broken code.
 
+config ARCH_HAS_FG_KASLR
+   def_bool y
+   depends on RANDOMIZE_BASE && X86_64
+
 menu "Processor type and features"
 
 config ZONE_DMA
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 5430febd34be..afd5cdf79a3a 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -93,14 +93,12 @@
  * sections to be brought in with rodata.
  */
 #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
-#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
 #define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..LPBX*
 #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
 #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]*
 #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]*
 #define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]*
 #else
-#define TEXT_MAIN .text
 #define DATA_MAIN .data
 #define SDATA_MAIN .sdata
 #define RODATA_MAIN .rodata
@@ -108,6 +106,20 @@
 #define SBSS_MAIN .sbss
 #endif
 
+/*
+ * Both LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_FG_KASLR options enable
+ * -ffunction-sections, which produces separately named .text sections. In
+ * the case of CONFIG_FG_KASLR, they need to stay distict so they can be
+ * separately randomized. Without CONFIG_FG_KASLR, the separate .text
+ * sections can be collected back into a common section, which makes the
+ * resulting image slightly smaller
+ */
+#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) && !defined(CONFIG_FG_KASLR)
+#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
+#else
+#define TEXT_MAIN .text
+#endif
+
 /*
  * GCC 4.5 and later have a 32 bytes section alignment for structures.
  * Except GCC 4.9, that feels the need to align on 64 bytes.
diff --git a/init/Kconfig b/init/Kconfig
index d6a0b31b13dc..81220973b064 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2019,6 +2019,20 @@ config PROFILING
 config TRACEPOINTS
bool
 
+config FG_KASLR
+   bool "Function Granular Kernel Address Space Layout Randomization"
+   depends on $(cc-option, -ffunction-sections)
+   depends on ARCH_HAS_FG_KASLR
+   default n
+   help
+ This option improves the randomness of the kernel text
+ over basic Kernel Address Space Layout Randomization (KASLR)
+ by reordering the kernel text at boot time. This feature
+ uses information generated at compile time to re-layout the
+ kernel text section at boot time at function level granularity.
+
+ If unsure, say N.
+
 endmenu# General setup
 
 source "arch/Kconfig"
-- 
2.20.1



[PATCH v5 06/10] x86/boot/compressed: Avoid duplicate malloc() implementations

2020-09-23 Thread Kristen Carlson Accardi
From: Kees Cook 

The preboot malloc() (and free()) implementation in
include/linux/decompress/mm.h (which is also included by the
static decompressors) is static. This is fine when the only thing
interested in using malloc() is the decompression code, but the
x86 preboot environment uses malloc() in a couple places, leading to a
potential collision when the static copies of the available memory
region ("malloc_ptr") gets reset to the global "free_mem_ptr" value.
As it happens, the existing usage pattern happened to be safe because each
user did 1 malloc() and 1 free() before returning and were not nested:

extract_kernel() (misc.c)
choose_random_location() (kaslr.c)
mem_avoid_init()
handle_mem_options()
malloc()
...
free()
...
parse_elf() (misc.c)
malloc()
...
free()

Adding FGKASLR, however, will insert additional malloc() calls local to
fgkaslr.c in the middle of parse_elf()'s malloc()/free() pair:

parse_elf() (misc.c)
malloc()
if (...) {
layout_randomized_image(output, , phdrs);
malloc() <- boom
...
else
layout_image(output, , phdrs);
free()

To avoid collisions, there must be a single implementation of malloc().
Adjust include/linux/decompress/mm.h so that visibility can be
controlled, provide prototypes in misc.h, and implement the functions in
misc.c. This also results in a small size savings:

$ size vmlinux.before vmlinux.after
   textdata bss dec hex filename
8842314 468  178320 9021102  89a6ae vmlinux.before
8842240 468  178320 9021028  89a664 vmlinux.after

Signed-off-by: Kees Cook 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/boot/compressed/kaslr.c |  4 
 arch/x86/boot/compressed/misc.c  |  3 +++
 arch/x86/boot/compressed/misc.h  |  2 ++
 include/linux/decompress/mm.h| 12 ++--
 4 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index dde7cb3724df..e811071ce5d2 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -32,10 +32,6 @@
 #include 
 #include 
 
-/* Macros used by the included decompressor code below. */
-#define STATIC
-#include 
-
 #ifdef CONFIG_X86_5LEVEL
 unsigned int __pgtable_l5_enabled;
 unsigned int pgdir_shift __ro_after_init = 39;
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index e478e40fbe5a..dc396321eba8 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -28,6 +28,9 @@
 
 /* Macros used by the included decompressor code below. */
 #define STATIC static
+/* Define an externally visible malloc()/free(). */
+#define MALLOC_VISIBLE
+#include 
 
 /*
  * Provide definitions of memzero and memmove as some of the decompressors will
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 726e264410ff..81fbc8d686fa 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -39,6 +39,8 @@
 /* misc.c */
 extern memptr free_mem_ptr;
 extern memptr free_mem_end_ptr;
+extern void *malloc(int size);
+extern void free(void *where);
 extern struct boot_params *boot_params;
 void __putstr(const char *s);
 void __puthex(unsigned long value);
diff --git a/include/linux/decompress/mm.h b/include/linux/decompress/mm.h
index 868e9eacd69e..9192986b1a73 100644
--- a/include/linux/decompress/mm.h
+++ b/include/linux/decompress/mm.h
@@ -25,13 +25,21 @@
 #define STATIC_RW_DATA static
 #endif
 
+/*
+ * When an architecture needs to share the malloc()/free() implementation
+ * between compilation units, it needs to have non-local visibility.
+ */
+#ifndef MALLOC_VISIBLE
+#define MALLOC_VISIBLE static
+#endif
+
 /* A trivial malloc implementation, adapted from
  *  malloc by Hannu Savolainen 1993 and Matthias Urlichs 1994
  */
 STATIC_RW_DATA unsigned long malloc_ptr;
 STATIC_RW_DATA int malloc_count;
 
-static void *malloc(int size)
+MALLOC_VISIBLE void *malloc(int size)
 {
void *p;
 
@@ -52,7 +60,7 @@ static void *malloc(int size)
return p;
 }
 
-static void free(void *where)
+MALLOC_VISIBLE void free(void *where)
 {
malloc_count--;
if (!malloc_count)
-- 
2.20.1



[PATCH v5 04/10] x86: Make sure _etext includes function sections

2020-09-23 Thread Kristen Carlson Accardi
When using -ffunction-sections to place each function in
it's own text section so it can be randomized at load time, the
linker considers these .text.* sections "orphaned sections", and
will place them after the first similar section (.text). In order
to accurately represent the end of the text section and the
orphaned sections, _etext must be moved so that it is after both
.text and .text.* The text size must also be calculated to
include .text AND .text.*

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
Reviewed-by: Kees Cook 
---
 arch/x86/kernel/vmlinux.lds.S | 17 +++--
 include/asm-generic/vmlinux.lds.h |  2 +-
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 9a03e5b23135..b0718eef283f 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -146,9 +146,22 @@ SECTIONS
 #endif
} :text =0x
 
-   /* End of text section, which should occupy whole number of pages */
-   _etext = .;
+   /*
+* -ffunction-sections creates .text.* sections, which are considered
+* "orphan sections" and added after the first similar section (.text).
+* Placing this ALIGN statement before _etext causes the address of
+* _etext to be below that of all the .text.* orphaned sections
+*/
. = ALIGN(PAGE_SIZE);
+   _etext = .;
+
+   /*
+* the size of the .text section is used to calculate the address
+* range for orc lookups. If we just use SIZEOF(.text), we will
+* miss all the .text.* sections. Calculate the size using _etext
+* and _stext and save the value for later.
+*/
+   text_size = _etext - _stext;
 
X86_ALIGN_RODATA_BEGIN
RO_DATA(PAGE_SIZE)
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index afd5cdf79a3a..6f7239e033e8 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -863,7 +863,7 @@
. = ALIGN(4);   \
.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) { \
orc_lookup = .; \
-   . += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /\
+   . += (((text_size + LOOKUP_BLOCK_SIZE - 1) /\
LOOKUP_BLOCK_SIZE) + 1) * 4;\
orc_lookup_end = .; \
}
-- 
2.20.1



[PATCH v5 02/10] x86/boot: Allow a "silent" kaslr random byte fetch

2020-09-23 Thread Kristen Carlson Accardi
From: Kees Cook 

Under earlyprintk, each RNG call produces a debug report line. When
shuffling hundreds of functions, this is not useful information (each
line is identical and tells us nothing new). Instead, allow for a NULL
"purpose" to suppress the debug reporting.

Signed-off-by: Kees Cook 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/lib/kaslr.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/lib/kaslr.c b/arch/x86/lib/kaslr.c
index a53665116458..2b3eb8c948a3 100644
--- a/arch/x86/lib/kaslr.c
+++ b/arch/x86/lib/kaslr.c
@@ -56,11 +56,14 @@ unsigned long kaslr_get_random_long(const char *purpose)
unsigned long raw, random = get_boot_seed();
bool use_i8254 = true;
 
-   debug_putstr(purpose);
-   debug_putstr(" KASLR using");
+   if (purpose) {
+   debug_putstr(purpose);
+   debug_putstr(" KASLR using");
+   }
 
if (has_cpuflag(X86_FEATURE_RDRAND)) {
-   debug_putstr(" RDRAND");
+   if (purpose)
+   debug_putstr(" RDRAND");
if (rdrand_long()) {
random ^= raw;
use_i8254 = false;
@@ -68,7 +71,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (has_cpuflag(X86_FEATURE_TSC)) {
-   debug_putstr(" RDTSC");
+   if (purpose)
+   debug_putstr(" RDTSC");
raw = rdtsc();
 
random ^= raw;
@@ -76,7 +80,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (use_i8254) {
-   debug_putstr(" i8254");
+   if (purpose)
+   debug_putstr(" i8254");
random ^= i8254();
}
 
@@ -86,7 +91,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
: "a" (random), "rm" (mix_const));
random += raw;
 
-   debug_putstr("...\n");
+   if (purpose)
+   debug_putstr("...\n");
 
return random;
 }
-- 
2.20.1



Re: [PATCH v4 00/10] Function Granular KASLR

2020-08-21 Thread Kristen Carlson Accardi
On Wed, 2020-07-22 at 16:33 -0500, Josh Poimboeuf wrote:
> On Wed, Jul 22, 2020 at 12:56:10PM -0700, Kristen Carlson Accardi
> wrote:
> > On Wed, 2020-07-22 at 12:42 -0700, Kees Cook wrote:
> > > On Wed, Jul 22, 2020 at 11:07:30AM -0500, Josh Poimboeuf wrote:
> > > > On Wed, Jul 22, 2020 at 07:39:55AM -0700, Kees Cook wrote:
> > > > > On Wed, Jul 22, 2020 at 11:27:30AM +0200, Miroslav Benes
> > > > > wrote:
> > > > > > Let me CC live-patching ML, because from a quick glance
> > > > > > this is
> > > > > > something 
> > > > > > which could impact live patching code. At least it
> > > > > > invalidates
> > > > > > assumptions 
> > > > > > which "sympos" is based on.
> > > > > 
> > > > > In a quick skim, it looks like the symbol resolution is using
> > > > > kallsyms_on_each_symbol(), so I think this is safe? What's a
> > > > > good
> > > > > selftest for live-patching?
> > > > 
> > > > The problem is duplicate symbols.  If there are two static
> > > > functions
> > > > named 'foo' then livepatch needs a way to distinguish them.
> > > > 
> > > > Our current approach to that problem is "sympos".  We rely on
> > > > the
> > > > fact
> > > > that the second foo() always comes after the first one in the
> > > > symbol
> > > > list and kallsyms.  So they're referred to as foo,1 and foo,2.
> > > 
> > > Ah. Fun. In that case, perhaps the LTO series has some solutions.
> > > I
> > > think builds with LTO end up renaming duplicate symbols like
> > > that, so
> > > it'll be back to being unique.
> > > 
> > 
> > Well, glad to hear there might be some precendence for how to solve
> > this, as I wasn't able to think of something reasonable off the top
> > of
> > my head. Are you speaking of the Clang LTO series? 
> > https://lore.kernel.org/lkml/20200624203200.78870-1-samitolva...@google.com/
> 
> I'm not sure how LTO does it, but a few more (half-brained) ideas
> that
> could work:
> 
> 1) Add a field in kallsyms to keep track of a symbol's original
> offset
>before randomization/re-sorting.  Livepatch could use that field
> to
>determine the original sympos.
> 
> 2) In fgkaslr code, go through all the sections and mark the ones
> which
>have duplicates (i.e. same name).  Then when shuffling the
> sections,
>skip a shuffle if it involves a duplicate section.  That way all
> the
>duplicates would retain their original sympos.
> 
> 3) Livepatch could uniquely identify symbols by some feature other
> than
>sympos.  For example:
> 
>Symbol/function size - obviously this would only work if
> duplicately
>named symbols have different sizes.
> 
>Checksum - as part of a separate feature we're also looking at
> giving
>each function its own checksum, calculated based on its
> instruction
>opcodes.  Though calculating checksums at runtime could be
>complicated by IP-relative addressing.
> 
> I'm thinking #1 or #2 wouldn't be too bad.  #3 might be harder.
> 

Hi there! I was trying to find a super easy way to address this, so I
thought the best thing would be if there were a compiler or linker
switch to just eliminate any duplicate symbols at compile time for
vmlinux. I filed this question on the binutils bugzilla looking to see
if there were existing flags that might do this, but H.J. Lu went ahead
and created a new one "-z unique", that seems to do what we would need
it to do. 

https://sourceware.org/bugzilla/show_bug.cgi?id=26391

When I use this option, it renames any duplicate symbols with an
extension - for example duplicatefunc.1 or duplicatefunc.2. You could
either match on the full unique name of the specific binary you are
trying to patch, or you match the base name and use the extension to
determine original position. Do you think this solution would work? If
so, I can modify livepatch to refuse to patch on duplicated symbols if
CONFIG_FG_KASLR and when this option is merged into the tool chain I
can add it to KBUILD_LDFLAGS when CONFIG_FG_KASLR and livepatching
should work in all cases. 



[PATCH] objtool: support symtab_shndx during dump

2020-08-12 Thread Kristen Carlson Accardi
When getting the symbol index number, make sure to use the
extended symbol table information in order to support symbol
index's greater than 64K.

Signed-off-by: Kristen Carlson Accardi 
---
 tools/objtool/orc_dump.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/objtool/orc_dump.c b/tools/objtool/orc_dump.c
index fca46e006fc2..cf835069724a 100644
--- a/tools/objtool/orc_dump.c
+++ b/tools/objtool/orc_dump.c
@@ -74,7 +74,8 @@ int orc_dump(const char *_objname)
GElf_Rela rela;
GElf_Sym sym;
Elf_Data *data, *symtab = NULL, *rela_orc_ip = NULL;
-
+   Elf_Data *xsymtab = NULL;
+   Elf32_Word shndx;
 
objname = _objname;
 
@@ -138,6 +139,8 @@ int orc_dump(const char *_objname)
orc_ip_addr = sh.sh_addr;
} else if (!strcmp(name, ".rela.orc_unwind_ip")) {
rela_orc_ip = data;
+   } else if (!strcmp(name, ".symtab_shndx")) {
+   xsymtab = data;
}
}
 
@@ -157,13 +160,22 @@ int orc_dump(const char *_objname)
return -1;
}
 
-   if (!gelf_getsym(symtab, GELF_R_SYM(rela.r_info), 
)) {
-   WARN_ELF("gelf_getsym");
+   if (!gelf_getsymshndx(symtab, xsymtab,
+ GELF_R_SYM(rela.r_info),
+ , )) {
+   WARN_ELF("gelf_getsymshndx");
return -1;
}
 
if (GELF_ST_TYPE(sym.st_info) == STT_SECTION) {
-   scn = elf_getscn(elf, sym.st_shndx);
+   if ((sym.st_shndx > SHN_UNDEF &&
+sym.st_shndx < SHN_LORESERVE) ||
+   (xsymtab && sym.st_shndx == SHN_XINDEX)) {
+   if (sym.st_shndx != SHN_XINDEX)
+   shndx = sym.st_shndx;
+   }
+
+   scn = elf_getscn(elf, shndx);
if (!scn) {
WARN_ELF("elf_getscn");
return -1;
-- 
2.20.1



Re: [PATCH v4 00/10] Function Granular KASLR

2020-08-12 Thread Kristen Carlson Accardi
On Tue, 2020-08-04 at 14:23 -0400, Joe Lawrence wrote:
> On Fri, Jul 17, 2020 at 09:59:57AM -0700, Kristen Carlson Accardi
> wrote:
> > Function Granular Kernel Address Space Layout Randomization
> > (fgkaslr)
> > -
> > 
> > 
> > This patch set is an implementation of finer grained kernel address
> > space
> > randomization. It rearranges your kernel code at load time 
> > on a per-function level granularity, with only around a second
> > added to
> > boot time.
> > 
> > Changes in v4:
> > -
> > * dropped the patch to split out change to STATIC definition in
> >   x86/boot/compressed/misc.c and replaced with a patch authored
> >   by Kees Cook to avoid the duplicate malloc definitions
> > * Added a section to Documentation/admin-guide/kernel-
> > parameters.txt
> >   to document the fgkaslr boot option.
> > * redesigned the patch to hide the new layout when reading
> >   /proc/kallsyms. The previous implementation utilized a
> > dynamically
> >   allocated linked list to display the kernel and module symbols
> >   in alphabetical order. The new implementation uses a randomly
> >   shuffled index array to display the kernel and module symbols
> >   in a random order.
> > 
> > Changes in v3:
> > -
> > * Makefile changes to accommodate
> > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
> > * removal of extraneous ALIGN_PAGE from _etext changes
> > * changed variable names in x86/tools/relocs to be less confusing
> > * split out change to STATIC definition in
> > x86/boot/compressed/misc.c
> > * Updates to Documentation to make it more clear what is preserved
> > in .text
> > * much more detailed commit message for function granular KASLR
> > patch
> > * minor tweaks and changes that make for more readable code
> > * this cover letter updated slightly to add additional details
> > 
> > Changes in v2:
> > --
> > * Fix to address i386 build failure
> > * Allow module reordering patch to be configured separately so that
> >   arm (or other non-x86_64 arches) can take advantage of module
> > function
> >   reordering. This support has not be tested by me, but smoke
> > tested by
> >   Ard Biesheuvel  on arm.
> > * Fix build issue when building on arm as reported by
> >   Ard Biesheuvel  
> > 
> > Patches to objtool are included because they are dependencies for
> > this
> > patchset, however they have been submitted by their maintainer
> > separately.
> > 
> > Background
> > --
> > KASLR was merged into the kernel with the objective of increasing
> > the
> > difficulty of code reuse attacks. Code reuse attacks reused
> > existing code
> > snippets to get around existing memory protections. They exploit
> > software bugs
> > which expose addresses of useful code snippets to control the flow
> > of
> > execution for their own nefarious purposes. KASLR moves the entire
> > kernel
> > code text as a unit at boot time in order to make addresses less
> > predictable.
> > The order of the code within the segment is unchanged - only the
> > base address
> > is shifted. There are a few shortcomings to this algorithm.
> > 
> > 1. Low Entropy - there are only so many locations the kernel can
> > fit in. This
> >means an attacker could guess without too much trouble.
> > 2. Knowledge of a single address can reveal the offset of the base
> > address,
> >exposing all other locations for a published/known kernel image.
> > 3. Info leaks abound.
> > 
> > Finer grained ASLR has been proposed as a way to make ASLR more
> > resistant
> > to info leaks. It is not a new concept at all, and there are many
> > variations
> > possible. Function reordering is an implementation of finer grained
> > ASLR
> > which randomizes the layout of an address space on a function level
> > granularity. We use the term "fgkaslr" in this document to refer to
> > the
> > technique of function reordering when used with KASLR, as well as
> > finer grained
> > KASLR in general.
> > 
> > Proposed Improvement
> > 
> > This patch set proposes adding function reordering on top of the
> > existing
> > KASLR base address randomization. The over-arching objective is
> > incremental
> > improvement over what we already have. It is designed to work in
> > combination
> > with the e

Re: [PATCH v4 00/10] Function Granular KASLR

2020-08-10 Thread Kristen Carlson Accardi
On Fri, 2020-08-07 at 10:20 -0700, Kees Cook wrote:
> On Fri, Aug 07, 2020 at 09:38:11AM -0700, Kristen Carlson Accardi
> wrote:
> > Thanks for testing. Yes, Josh and I have been discussing the
> > orc_unwind
> > issues. I've root caused one issue already, in that objtool places
> > an
> > orc_unwind_ip address just outside the section, so my algorithm
> > fails
> > to relocate this address. There are other issues as well that I
> > still
> > haven't root caused. I'll be addressing this in v5 and plan to have
> > something that passes livepatch testing with that version.
> 
> FWIW, I'm okay with seeing fgkaslr be developed progressively.
> Getting
> it working with !livepatching would be fine as a first step. There's
> value in getting the general behavior landed, and then continuing to
> improve it.
> 

In this case, part of the issue with livepatching appears to be a more
general issue with objtool and how it creates the orc unwind entries
when you have >64K sections. So livepatching is a good test case for
making sure that the orc tables are actually correct. However, the
other issue with livepatching (the duplicate symbols), might be worth
deferring if the solution is complex - I will keep that in mind as I
look at it more closely.




Re: [PATCH v4 00/10] Function Granular KASLR

2020-08-07 Thread Kristen Carlson Accardi
On Tue, 2020-08-04 at 14:23 -0400, Joe Lawrence wrote:
> On Fri, Jul 17, 2020 at 09:59:57AM -0700, Kristen Carlson Accardi
> wrote:
> > Function Granular Kernel Address Space Layout Randomization
> > (fgkaslr)
> > -
> > 
> > 
> > This patch set is an implementation of finer grained kernel address
> > space
> > randomization. It rearranges your kernel code at load time 
> > on a per-function level granularity, with only around a second
> > added to
> > boot time.
> > 
> > Changes in v4:
> > -
> > * dropped the patch to split out change to STATIC definition in
> >   x86/boot/compressed/misc.c and replaced with a patch authored
> >   by Kees Cook to avoid the duplicate malloc definitions
> > * Added a section to Documentation/admin-guide/kernel-
> > parameters.txt
> >   to document the fgkaslr boot option.
> > * redesigned the patch to hide the new layout when reading
> >   /proc/kallsyms. The previous implementation utilized a
> > dynamically
> >   allocated linked list to display the kernel and module symbols
> >   in alphabetical order. The new implementation uses a randomly
> >   shuffled index array to display the kernel and module symbols
> >   in a random order.
> > 
> > Changes in v3:
> > -
> > * Makefile changes to accommodate
> > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
> > * removal of extraneous ALIGN_PAGE from _etext changes
> > * changed variable names in x86/tools/relocs to be less confusing
> > * split out change to STATIC definition in
> > x86/boot/compressed/misc.c
> > * Updates to Documentation to make it more clear what is preserved
> > in .text
> > * much more detailed commit message for function granular KASLR
> > patch
> > * minor tweaks and changes that make for more readable code
> > * this cover letter updated slightly to add additional details
> > 
> > Changes in v2:
> > --
> > * Fix to address i386 build failure
> > * Allow module reordering patch to be configured separately so that
> >   arm (or other non-x86_64 arches) can take advantage of module
> > function
> >   reordering. This support has not be tested by me, but smoke
> > tested by
> >   Ard Biesheuvel  on arm.
> > * Fix build issue when building on arm as reported by
> >   Ard Biesheuvel  
> > 
> > Patches to objtool are included because they are dependencies for
> > this
> > patchset, however they have been submitted by their maintainer
> > separately.
> > 
> > Background
> > --
> > KASLR was merged into the kernel with the objective of increasing
> > the
> > difficulty of code reuse attacks. Code reuse attacks reused
> > existing code
> > snippets to get around existing memory protections. They exploit
> > software bugs
> > which expose addresses of useful code snippets to control the flow
> > of
> > execution for their own nefarious purposes. KASLR moves the entire
> > kernel
> > code text as a unit at boot time in order to make addresses less
> > predictable.
> > The order of the code within the segment is unchanged - only the
> > base address
> > is shifted. There are a few shortcomings to this algorithm.
> > 
> > 1. Low Entropy - there are only so many locations the kernel can
> > fit in. This
> >means an attacker could guess without too much trouble.
> > 2. Knowledge of a single address can reveal the offset of the base
> > address,
> >exposing all other locations for a published/known kernel image.
> > 3. Info leaks abound.
> > 
> > Finer grained ASLR has been proposed as a way to make ASLR more
> > resistant
> > to info leaks. It is not a new concept at all, and there are many
> > variations
> > possible. Function reordering is an implementation of finer grained
> > ASLR
> > which randomizes the layout of an address space on a function level
> > granularity. We use the term "fgkaslr" in this document to refer to
> > the
> > technique of function reordering when used with KASLR, as well as
> > finer grained
> > KASLR in general.
> > 
> > Proposed Improvement
> > 
> > This patch set proposes adding function reordering on top of the
> > existing
> > KASLR base address randomization. The over-arching objective is
> > incremental
> > improvement over what we already have. It is designed to work in
> > combination
> > with the e

Re: [PATCH v4 00/10] Function Granular KASLR

2020-08-06 Thread Kristen Carlson Accardi
Hi Mingo, thanks for taking a look, I am glad you like the idea. Some
replies below:

On Thu, 2020-08-06 at 17:32 +0200, Ingo Molnar wrote:
> * Kristen Carlson Accardi  wrote:
> 
> > Function Granular Kernel Address Space Layout Randomization
> > (fgkaslr)
> > -
> > 
> > 
> > This patch set is an implementation of finer grained kernel address
> > space
> > randomization. It rearranges your kernel code at load time 
> > on a per-function level granularity, with only around a second
> > added to
> > boot time.
> 
> This is a very nice feature IMO, and it should be far more effective 
> at randomizing the kernel, due to the sheer number of randomization 
> bits that kernel function granular randomization presents.
> 
> If this is a good approximation of fg-kaslr randomization depth:
> 
>   thule:~/tip> grep ' [tT] ' /proc/kallsyms  | wc -l
>   88488
> 
> ... then that's 80K bits of randomization instead of the mere
> handful 
> of kaslr bits we have today. Very nice!
> 
> > In order to hide our new layout, symbols reported through 
> > /proc/kallsyms will be displayed in a random order.
> 
> Neat. :-)
> 
> > Performance Impact
> > --
> > * Run time
> > The performance impact at run-time of function reordering varies by
> > workload.
> > Using kcbench, a kernel compilation benchmark, the performance of a
> > kernel
> > build with finer grained KASLR was about 1% slower than a kernel
> > with standard
> > KASLR. Analysis with perf showed a slightly higher percentage of 
> > L1-icache-load-misses. Other workloads were examined as well, with
> > varied
> > results. Some workloads performed significantly worse under
> > FGKASLR, while
> > others stayed the same or were mysteriously better. In general, it
> > will
> > depend on the code flow whether or not finer grained KASLR will
> > impact
> > your workload, and how the underlying code was designed. Because
> > the layout
> > changes per boot, each time a system is rebooted the performance of
> > a workload
> > may change.
> 
> I'd guess that the biggest performance impact comes from tearing
> apart 
> 'groups' of functions that particular workloads are using.
> 
> In that sense it might be worthwile to add a '__kaslr_group'
> function 
> tag to key functions, which would keep certain performance critical 
> functions next to each other.
> 
> This shouldn't really be a problem, as even with generous amount of 
> grouping the number of randomization bits is incredibly large.

So my strategy so far was to try to get a very basic non-performance
optimized fgkaslr mode merged first, then add performance optimized
options as a next step. For example, a user might pass in
fgkaslr="group" to the fgkaslr kernel parameter to select a layout
which groups some things by whatever criteria we want to mitigate some
of the performance impact of full randomization, or they might chose
fgkaslr="full", which just randomizes everything (the current
implementation). If people think it's worth adding the performance
optimizations for the initial merge, I can certainly work on those, but
i thought it might be better to keep it super simple at first.

> 
> > Future work could identify hot areas that may not be randomized and
> > either
> > leave them in the .text section or group them together into a
> > single section
> > that may be randomized. If grouping things together helps, one
> > other thing to
> > consider is that if we could identify text blobs that should be
> > grouped together
> > to benefit a particular code flow, it could be interesting to
> > explore
> > whether this security feature could be also be used as a
> > performance
> > feature if you are interested in optimizing your kernel layout for
> > a
> > particular workload at boot time. Optimizing function layout for a
> > particular
> > workload has been researched and proven effective - for more
> > information
> > read the Facebook paper "Optimizing Function Placement for Large-
> > Scale
> > Data-Center Applications" (see references section below).
> 
> I'm pretty sure the 'grouping' solution would address any real 
> slowdowns.
> 
> I'd also suggest allowing the passing in of a boot-time pseudo-
> random 
> generator seed number, which would allow the creation of a 
> pseudo-randomized but repeatable layout across reboots.

We talked during the RFC stage of porting the chacha20 code to this
early boot stage to use as a prand gener

Re: [PATCH v4 00/10] Function Granular KASLR

2020-07-22 Thread Kristen Carlson Accardi
On Wed, 2020-07-22 at 12:42 -0700, Kees Cook wrote:
> On Wed, Jul 22, 2020 at 11:07:30AM -0500, Josh Poimboeuf wrote:
> > On Wed, Jul 22, 2020 at 07:39:55AM -0700, Kees Cook wrote:
> > > On Wed, Jul 22, 2020 at 11:27:30AM +0200, Miroslav Benes wrote:
> > > > Let me CC live-patching ML, because from a quick glance this is
> > > > something 
> > > > which could impact live patching code. At least it invalidates
> > > > assumptions 
> > > > which "sympos" is based on.
> > > 
> > > In a quick skim, it looks like the symbol resolution is using
> > > kallsyms_on_each_symbol(), so I think this is safe? What's a good
> > > selftest for live-patching?
> > 
> > The problem is duplicate symbols.  If there are two static
> > functions
> > named 'foo' then livepatch needs a way to distinguish them.
> > 
> > Our current approach to that problem is "sympos".  We rely on the
> > fact
> > that the second foo() always comes after the first one in the
> > symbol
> > list and kallsyms.  So they're referred to as foo,1 and foo,2.
> 
> Ah. Fun. In that case, perhaps the LTO series has some solutions. I
> think builds with LTO end up renaming duplicate symbols like that, so
> it'll be back to being unique.
> 

Well, glad to hear there might be some precendence for how to solve
this, as I wasn't able to think of something reasonable off the top of
my head. Are you speaking of the Clang LTO series? 
https://lore.kernel.org/lkml/20200624203200.78870-1-samitolva...@google.com/



Re: [PATCH v4 00/10] Function Granular KASLR

2020-07-22 Thread Kristen Carlson Accardi
On Wed, 2020-07-22 at 10:56 -0400, Joe Lawrence wrote:
> On 7/22/20 10:51 AM, Joe Lawrence wrote:
> > On 7/22/20 10:39 AM, Kees Cook wrote:
> > > On Wed, Jul 22, 2020 at 11:27:30AM +0200, Miroslav Benes wrote:
> > > > Let me CC live-patching ML, because from a quick glance this is
> > > > something
> > > > which could impact live patching code. At least it invalidates
> > > > assumptions
> > > > which "sympos" is based on.
> > > 
> > > In a quick skim, it looks like the symbol resolution is using
> > > kallsyms_on_each_symbol(), so I think this is safe? What's a good
> > > selftest for live-patching?
> > > 
> > 
> > Hi Kees,
> > 
> > I don't think any of the in-tree tests currently exercise the
> > kallsyms/sympos end of livepatching.
> > 
> 
> On second thought, I mispoke.. The general livepatch code does use
> it:
> 
> klp_init_object
>klp_init_object_loaded
>  klp_find_object_symbol
> 
> in which case any of the current kselftests should exercise that.
> 
>% make -C tools/testing/selftests/livepatch run_tests
> 
> -- Joe
> 

Thanks, it looks like this should work for helping me exercise the live
patch code paths. I will take a look and get back to you all.



Re: [PATCH v4 09/10] kallsyms: Hide layout

2020-07-20 Thread Kristen Carlson Accardi
On Sun, 2020-07-19 at 18:25 -0700, Kees Cook wrote:
> On Fri, Jul 17, 2020 at 10:00:06AM -0700, Kristen Carlson Accardi
> wrote:
> > This patch makes /proc/kallsyms display in a random order, rather
> > than sorted by address in order to hide the newly randomized
> > address
> > layout.
> 
> Ah! Much nicer. Is there any reason not to just do this
> unconditionally,
> regardless of FGKASLR? It's a smallish dynamic allocation, and
> displaying kallsyms is hardly fast-path...

My only concern would be whether or not there are scripts out there
which assume the list would be ordered. If someone chooses to enable
CONFIG_FG_KASLR, I think that it is reasonable to break those scripts.
On the flip side, I don't know why it needs to come out of
/proc/kallsyms in order, you can always just sort it after the fact if
you need it sorted. It would make it more maintainable to not special
case this. Hopefully a maintainer will comment on their preference.
Another thing I do in this patch is continue to use the existing sorted
by address functions if you are root. I didn't know if this was
neccessary - it'd be nice if we could just do it the same way all the
time. But I need some guidance here.

> 
> > Signed-off-by: Kristen Carlson Accardi 
> > Reviewed-by: Tony Luck 
> > Tested-by: Tony Luck 
> > ---
> >  kernel/kallsyms.c | 163
> > +-
> >  1 file changed, 162 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> > index bb14e64f62a4..45d147f7f10e 100644
> > --- a/kernel/kallsyms.c
> > +++ b/kernel/kallsyms.c
> > @@ -446,6 +446,12 @@ struct kallsym_iter {
> > int show_value;
> >  };
> >  
> > +struct kallsyms_shuffled_iter {
> > +   struct kallsym_iter iter;
> > +   loff_t total_syms;
> > > 
> > > (I need to go read how kallsyms doesn't miscount in general when
> > the
> > > symbol table changes out from under it...)
> > > 
> > > 
> > +   loff_t shuffled_index[];
> > +};
> > +
> >  int __weak arch_get_kallsym(unsigned int symnum, unsigned long
> > *value,
> > char *type, char *name)
> >  {
> > @@ -661,7 +667,7 @@ bool kallsyms_show_value(const struct cred
> > *cred)
> > }
> >  }
> >  
> > -static int kallsyms_open(struct inode *inode, struct file *file)
> > +static int __kallsyms_open(struct inode *inode, struct file *file)
> >  {
> > /*
> >  * We keep iterator in m->private, since normal case is to
> > @@ -682,6 +688,161 @@ static int kallsyms_open(struct inode *inode,
> > struct file *file)
> > return 0;
> >  }
> >  
> > +/*
> > + * When function granular kaslr is enabled, we need to print out
> > the symbols
> > + * at random so we don't reveal the new layout.
> > + */
> > +#if defined(CONFIG_FG_KASLR)
> > +static int update_random_pos(struct kallsyms_shuffled_iter
> > *s_iter,
> > +loff_t pos, loff_t *new_pos)
> > +{
> > +   loff_t new;
> > +
> > +   if (pos >= s_iter->total_syms)
> > +   return 0;
> > +
> > +   new = s_iter->shuffled_index[pos];
> > +
> > +   /*
> > +* normally this would be done as part of update_iter, however,
> > +* we want to avoid triggering this in the event that new is
> > +* zero since we don't want to blow away our pos end
> > indicators.
> > +*/
> > +   if (new == 0) {
> > +   s_iter->iter.name[0] = '\0';
> > +   s_iter->iter.nameoff = get_symbol_offset(new);
> > +   s_iter->iter.pos = new;
> > +   }
> > +
> > +   *new_pos = new;
> > +   return 1;
> > +}
> > +
> > +static void *shuffled_start(struct seq_file *m, loff_t *pos)
> > +{
> > +   struct kallsyms_shuffled_iter *s_iter = m->private;
> > +   loff_t new_pos;
> > +
> > +   if (!update_random_pos(s_iter, *pos, _pos))
> > +   return NULL;
> > +
> > +   return s_start(m, _pos);
> > +}
> > +
> > +static void *shuffled_next(struct seq_file *m, void *p, loff_t
> > *pos)
> > +{
> > +   struct kallsyms_shuffled_iter *s_iter = m->private;
> > +   loff_t new_pos;
> > +
> > +   (*pos)++;
> > +
> > +   if (!update_random_pos(s_iter, *pos, _pos))
> > +   return NULL;
> > +
> > +   if (!update_iter(m->private, new_pos))
> > +   return NULL;
> > +
> > +   return p;
> > +}
> > +
> > +

[PATCH v4 05/10] x86: Make sure _etext includes function sections

2020-07-17 Thread Kristen Carlson Accardi
When using -ffunction-sections to place each function in
it's own text section so it can be randomized at load time, the
linker considers these .text.* sections "orphaned sections", and
will place them after the first similar section (.text). In order
to accurately represent the end of the text section and the
orphaned sections, _etext must be moved so that it is after both
.text and .text.* The text size must also be calculated to
include .text AND .text.*

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
Reviewed-by: Kees Cook 
---
 arch/x86/kernel/vmlinux.lds.S | 17 +++--
 include/asm-generic/vmlinux.lds.h |  2 +-
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3bfc8dd8a43d..e8da7eeb4d8d 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -146,9 +146,22 @@ SECTIONS
 #endif
} :text =0x
 
-   /* End of text section, which should occupy whole number of pages */
-   _etext = .;
+   /*
+* -ffunction-sections creates .text.* sections, which are considered
+* "orphan sections" and added after the first similar section (.text).
+* Placing this ALIGN statement before _etext causes the address of
+* _etext to be below that of all the .text.* orphaned sections
+*/
. = ALIGN(PAGE_SIZE);
+   _etext = .;
+
+   /*
+* the size of the .text section is used to calculate the address
+* range for orc lookups. If we just use SIZEOF(.text), we will
+* miss all the .text.* sections. Calculate the size using _etext
+* and _stext and save the value for later.
+*/
+   text_size = _etext - _stext;
 
X86_ALIGN_RODATA_BEGIN
RO_DATA(PAGE_SIZE)
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index a5552cf28d5d..34eab6513fdc 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -835,7 +835,7 @@
. = ALIGN(4);   \
.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) { \
orc_lookup = .; \
-   . += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /\
+   . += (((text_size + LOOKUP_BLOCK_SIZE - 1) /\
LOOKUP_BLOCK_SIZE) + 1) * 4;\
orc_lookup_end = .; \
}
-- 
2.20.1



[PATCH v4 04/10] x86: Makefile: Add build and config option for CONFIG_FG_KASLR

2020-07-17 Thread Kristen Carlson Accardi
Allow user to select CONFIG_FG_KASLR if dependencies are met. Change
the make file to build with -ffunction-sections if CONFIG_FG_KASLR.

While the only architecture that supports CONFIG_FG_KASLR does not
currently enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION, make sure these
2 features play nicely together for the future by ensuring that if
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is selected when used with
CONFIG_FG_KASLR the function sections will not be consolidated back
into .text. Thanks to Kees Cook for the dead code elimination changes.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Reviewed-by: Kees Cook 
Tested-by: Tony Luck 
---
 Makefile  |  6 +-
 arch/x86/Kconfig  |  4 
 include/asm-generic/vmlinux.lds.h | 16 ++--
 init/Kconfig  | 14 ++
 4 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 0b5f8538bde5..66427b12de53 100644
--- a/Makefile
+++ b/Makefile
@@ -872,7 +872,7 @@ KBUILD_CFLAGS += $(call cc-option, 
-fno-inline-functions-called-once)
 endif
 
 ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
-KBUILD_CFLAGS_KERNEL += -ffunction-sections -fdata-sections
+KBUILD_CFLAGS_KERNEL += -fdata-sections
 LDFLAGS_vmlinux += --gc-sections
 endif
 
@@ -880,6 +880,10 @@ ifdef CONFIG_LIVEPATCH
 KBUILD_CFLAGS += $(call cc-option, -flive-patching=inline-clone)
 endif
 
+ifneq ($(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION)$(CONFIG_FG_KASLR),)
+KBUILD_CFLAGS += -ffunction-sections
+endif
+
 ifdef CONFIG_SHADOW_CALL_STACK
 CC_FLAGS_SCS   := -fsanitize=shadow-call-stack
 KBUILD_CFLAGS  += $(CC_FLAGS_SCS)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..e7a2db3e270d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -372,6 +372,10 @@ config CC_HAS_SANE_STACKPROTECTOR
   We have to make sure stack protector is unconditionally disabled if
   the compiler produces broken code.
 
+config ARCH_HAS_FG_KASLR
+   def_bool y
+   depends on RANDOMIZE_BASE && X86_64
+
 menu "Processor type and features"
 
 config ZONE_DMA
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index db600ef218d7..a5552cf28d5d 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -93,14 +93,12 @@
  * sections to be brought in with rodata.
  */
 #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
-#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
 #define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..LPBX*
 #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
 #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]*
 #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]*
 #define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]*
 #else
-#define TEXT_MAIN .text
 #define DATA_MAIN .data
 #define SDATA_MAIN .sdata
 #define RODATA_MAIN .rodata
@@ -108,6 +106,20 @@
 #define SBSS_MAIN .sbss
 #endif
 
+/*
+ * Both LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_FG_KASLR options enable
+ * -ffunction-sections, which produces separately named .text sections. In
+ * the case of CONFIG_FG_KASLR, they need to stay distict so they can be
+ * separately randomized. Without CONFIG_FG_KASLR, the separate .text
+ * sections can be collected back into a common section, which makes the
+ * resulting image slightly smaller
+ */
+#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) && !defined(CONFIG_FG_KASLR)
+#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
+#else
+#define TEXT_MAIN .text
+#endif
+
 /*
  * Align to a 32 byte boundary equal to the
  * alignment gcc 4.5 uses for a struct
diff --git a/init/Kconfig b/init/Kconfig
index 0498af567f70..82f042a1062f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1990,6 +1990,20 @@ config PROFILING
 config TRACEPOINTS
bool
 
+config FG_KASLR
+   bool "Function Granular Kernel Address Space Layout Randomization"
+   depends on $(cc-option, -ffunction-sections)
+   depends on ARCH_HAS_FG_KASLR
+   default n
+   help
+ This option improves the randomness of the kernel text
+ over basic Kernel Address Space Layout Randomization (KASLR)
+ by reordering the kernel text at boot time. This feature
+ uses information generated at compile time to re-layout the
+ kernel text section at boot time at function level granularity.
+
+ If unsure, say N.
+
 endmenu# General setup
 
 source "arch/Kconfig"
-- 
2.20.1



[PATCH v4 07/10] x86/boot/compressed: Avoid duplicate malloc() implementations

2020-07-17 Thread Kristen Carlson Accardi
From: Kees Cook 

The preboot malloc() (and free()) implementation in
include/linux/decompress/mm.h (which is also included by the
static decompressors) is static. This is fine when the only thing
interested in using malloc() is the decompression code, but the
x86 preboot environment uses malloc() in a couple places, leading to a
potential collision when the static copies of the available memory
region ("malloc_ptr") gets reset to the global "free_mem_ptr" value.
As it happens, the existing usage pattern happened to be safe because each
user did 1 malloc() and 1 free() before returning and were not nested:

extract_kernel() (misc.c)
choose_random_location() (kaslr.c)
mem_avoid_init()
handle_mem_options()
malloc()
...
free()
...
parse_elf() (misc.c)
malloc()
...
free()

Adding FGKASLR, however, will insert additional malloc() calls local to
fgkaslr.c in the middle of parse_elf()'s malloc()/free() pair:

parse_elf() (misc.c)
malloc()
if (...) {
layout_randomized_image(output, , phdrs);
malloc() <- boom
...
else
layout_image(output, , phdrs);
free()

To avoid collisions, there must be a single implementation of malloc().
Adjust include/linux/decompress/mm.h so that visibility can be
controlled, provide prototypes in misc.h, and implement the functions in
misc.c. This also results in a small size savings:

$ size vmlinux.before vmlinux.after
   textdata bss dec hex filename
8842314 468  178320 9021102  89a6ae vmlinux.before
8842240 468  178320 9021028  89a664 vmlinux.after

Signed-off-by: Kees Cook 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/boot/compressed/kaslr.c |  4 
 arch/x86/boot/compressed/misc.c  |  3 +++
 arch/x86/boot/compressed/misc.h  |  2 ++
 include/linux/decompress/mm.h| 12 ++--
 4 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index d7408af55738..6f596bd5b6e5 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -39,10 +39,6 @@
 #include 
 #include 
 
-/* Macros used by the included decompressor code below. */
-#define STATIC
-#include 
-
 #ifdef CONFIG_X86_5LEVEL
 unsigned int __pgtable_l5_enabled;
 unsigned int pgdir_shift __ro_after_init = 39;
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 9652d5c2afda..90a4b64b3037 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -28,6 +28,9 @@
 
 /* Macros used by the included decompressor code below. */
 #define STATIC static
+/* Define an externally visible malloc()/free(). */
+#define MALLOC_VISIBLE
+#include 
 
 /*
  * Use normal definitions of mem*() from string.c. There are already
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 726e264410ff..81fbc8d686fa 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -39,6 +39,8 @@
 /* misc.c */
 extern memptr free_mem_ptr;
 extern memptr free_mem_end_ptr;
+extern void *malloc(int size);
+extern void free(void *where);
 extern struct boot_params *boot_params;
 void __putstr(const char *s);
 void __puthex(unsigned long value);
diff --git a/include/linux/decompress/mm.h b/include/linux/decompress/mm.h
index 868e9eacd69e..9192986b1a73 100644
--- a/include/linux/decompress/mm.h
+++ b/include/linux/decompress/mm.h
@@ -25,13 +25,21 @@
 #define STATIC_RW_DATA static
 #endif
 
+/*
+ * When an architecture needs to share the malloc()/free() implementation
+ * between compilation units, it needs to have non-local visibility.
+ */
+#ifndef MALLOC_VISIBLE
+#define MALLOC_VISIBLE static
+#endif
+
 /* A trivial malloc implementation, adapted from
  *  malloc by Hannu Savolainen 1993 and Matthias Urlichs 1994
  */
 STATIC_RW_DATA unsigned long malloc_ptr;
 STATIC_RW_DATA int malloc_count;
 
-static void *malloc(int size)
+MALLOC_VISIBLE void *malloc(int size)
 {
void *p;
 
@@ -52,7 +60,7 @@ static void *malloc(int size)
return p;
 }
 
-static void free(void *where)
+MALLOC_VISIBLE void free(void *where)
 {
malloc_count--;
if (!malloc_count)
-- 
2.20.1



[PATCH v4 03/10] x86/boot: Allow a "silent" kaslr random byte fetch

2020-07-17 Thread Kristen Carlson Accardi
From: Kees Cook 

Under earlyprintk, each RNG call produces a debug report line. When
shuffling hundreds of functions, this is not useful information (each
line is identical and tells us nothing new). Instead, allow for a NULL
"purpose" to suppress the debug reporting.

Signed-off-by: Kees Cook 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/lib/kaslr.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/lib/kaslr.c b/arch/x86/lib/kaslr.c
index a53665116458..2b3eb8c948a3 100644
--- a/arch/x86/lib/kaslr.c
+++ b/arch/x86/lib/kaslr.c
@@ -56,11 +56,14 @@ unsigned long kaslr_get_random_long(const char *purpose)
unsigned long raw, random = get_boot_seed();
bool use_i8254 = true;
 
-   debug_putstr(purpose);
-   debug_putstr(" KASLR using");
+   if (purpose) {
+   debug_putstr(purpose);
+   debug_putstr(" KASLR using");
+   }
 
if (has_cpuflag(X86_FEATURE_RDRAND)) {
-   debug_putstr(" RDRAND");
+   if (purpose)
+   debug_putstr(" RDRAND");
if (rdrand_long()) {
random ^= raw;
use_i8254 = false;
@@ -68,7 +71,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (has_cpuflag(X86_FEATURE_TSC)) {
-   debug_putstr(" RDTSC");
+   if (purpose)
+   debug_putstr(" RDTSC");
raw = rdtsc();
 
random ^= raw;
@@ -76,7 +80,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (use_i8254) {
-   debug_putstr(" i8254");
+   if (purpose)
+   debug_putstr(" i8254");
random ^= i8254();
}
 
@@ -86,7 +91,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
: "a" (random), "rm" (mix_const));
random += raw;
 
-   debug_putstr("...\n");
+   if (purpose)
+   debug_putstr("...\n");
 
return random;
 }
-- 
2.20.1



[PATCH v4 06/10] x86/tools: Add relative relocs for randomized functions

2020-07-17 Thread Kristen Carlson Accardi
When reordering functions, the relative offsets for relocs that
are either in the randomized sections, or refer to the randomized
sections will need to be adjusted. Add code to detect whether a
reloc satisfies these cases, and if so, add them to the appropriate
reloc list.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
Reviewed-by: Kees Cook 
---
 arch/x86/boot/compressed/Makefile |  7 +-
 arch/x86/tools/relocs.c   | 41 ---
 arch/x86/tools/relocs.h   |  4 +--
 arch/x86/tools/relocs_common.c| 15 +++
 4 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index 7619742f91c9..c17b1c8ec82c 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -119,6 +119,11 @@ $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
$(call if_changed,check-and-link-vmlinux)
 
 OBJCOPYFLAGS_vmlinux.bin :=  -R .comment -S
+
+ifdef CONFIG_FG_KASLR
+   RELOCS_ARGS += --fg-kaslr
+endif
+
 $(obj)/vmlinux.bin: vmlinux FORCE
$(call if_changed,objcopy)
 
@@ -126,7 +131,7 @@ targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) 
vmlinux.bin.all vmlinux.relo
 
 CMD_RELOCS = arch/x86/tools/relocs
 quiet_cmd_relocs = RELOCS  $@
-  cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
+  cmd_relocs = $(CMD_RELOCS) $(RELOCS_ARGS) $< > $@;$(CMD_RELOCS) 
$(RELOCS_ARGS) --abs-relocs $<
 $(obj)/vmlinux.relocs: vmlinux FORCE
$(call if_changed,relocs)
 
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 31b2d151aa63..e0665038742e 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -42,6 +42,8 @@ struct section {
 };
 static struct section *secs;
 
+static int fgkaslr_mode;
+
 static const char * const sym_regex_kernel[S_NSYMTYPES] = {
 /*
  * Following symbols have been audited. There values are constant and do
@@ -818,6 +820,32 @@ static int is_percpu_sym(ElfW(Sym) *sym, const char 
*symname)
strncmp(symname, "init_per_cpu_", 13);
 }
 
+static int is_function_section(struct section *sec)
+{
+   const char *name;
+
+   if (!fgkaslr_mode)
+   return 0;
+
+   name = sec_name(sec->shdr.sh_info);
+
+   return(!strncmp(name, ".text.", 6));
+}
+
+static int is_randomized_sym(ElfW(Sym) *sym)
+{
+   const char *name;
+
+   if (!fgkaslr_mode)
+   return 0;
+
+   if (sym->st_shndx > shnum)
+   return 0;
+
+   name = sec_name(sym_index(sym));
+   return(!strncmp(name, ".text.", 6));
+}
+
 static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
  const char *symname)
 {
@@ -842,13 +870,17 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, 
ElfW(Sym) *sym,
case R_X86_64_PC32:
case R_X86_64_PLT32:
/*
-* PC relative relocations don't need to be adjusted unless
-* referencing a percpu symbol.
+* we need to keep pc relative relocations for sections which
+* might be randomized, and for the percpu section.
+* We also need to keep relocations for any offset which might
+* reference an address in a section which has been randomized.
 *
 * NB: R_X86_64_PLT32 can be treated as R_X86_64_PC32.
 */
-   if (is_percpu_sym(sym, symname))
+   if (is_function_section(sec) || is_randomized_sym(sym) ||
+   is_percpu_sym(sym, symname))
add_reloc(, offset);
+
break;
 
case R_X86_64_PC64:
@@ -1158,8 +1190,9 @@ static void print_reloc_info(void)
 
 void process(FILE *fp, int use_real_mode, int as_text,
 int show_absolute_syms, int show_absolute_relocs,
-int show_reloc_info)
+int show_reloc_info, int fgkaslr)
 {
+   fgkaslr_mode = fgkaslr;
regex_init(use_real_mode);
read_ehdr(fp);
read_shdrs(fp);
diff --git a/arch/x86/tools/relocs.h b/arch/x86/tools/relocs.h
index 43c83c0fd22c..f582895c04dd 100644
--- a/arch/x86/tools/relocs.h
+++ b/arch/x86/tools/relocs.h
@@ -31,8 +31,8 @@ enum symtype {
 
 void process_32(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
-   int show_reloc_info);
+   int show_reloc_info, int fgkaslr);
 void process_64(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
-   int show_reloc_info);
+   int show_reloc_info, int fgkaslr);
 #endif /* RELOCS_H */
diff --git a/arch/x86/tools/relocs_common.c b/arch/x86/tools/relocs_common.c
index 6634352a20bc..a80efa2f53ff 100644
--- a/arch/x86/tools/relocs_common.c
+++ b/arch/x8

[PATCH v4 08/10] x86: Add support for function granular KASLR

2020-07-17 Thread Kristen Carlson Accardi
This commit contains the changes required to re-layout the kernel text
sections generated by -ffunction-sections shortly after decompression.
Documentation of the feature is also added.

After decompression, the decompressed image's elf headers are parsed.
In order to manually update certain data structures that are built with
relative offsets during the kernel build process, certain symbols are
not stripped by objdump and their location is retained in the elf symbol
tables. These addresses are saved.

If the image was built with -ffunction-sections, there will be ELF section
headers present which contain information about the address range of each
section. Anything that is not broken out into function sections (i.e. is
consolidated into .text) is left in it's original location, but any other
executable section which begins with ".text." is located and shuffled
randomly within the remaining text segment address range.

After the sections have been copied to their new locations, but before
relocations have been applied, the kallsyms tables must be updated to
reflect the new symbol locations. Because it is expected that these tables
will be sorted by address, the kallsyms tables will need to be sorted
after the update.

When applying relocations, the address of the relocation needs to be
adjusted by the offset from the original location of the section that was
randomized to it's new location. In addition, if a value at that relocation
was a location in the text segment that was randomized, it's value will be
adjusted to a new location.

After relocations have been applied, the exception table must be updated
with with new symbol locations, and then re-sorted by the new address. The
orc table will have been updated as part of applying relocations, but since
it is expected to be sorted by address, it will need to be resorted.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
Reviewed-by: Kees Cook 
---
 .../admin-guide/kernel-parameters.txt |   7 +
 Documentation/security/fgkaslr.rst| 172 
 Documentation/security/index.rst  |   1 +
 arch/x86/boot/compressed/Makefile |   2 +
 arch/x86/boot/compressed/fgkaslr.c| 811 ++
 arch/x86/boot/compressed/misc.c   | 154 +++-
 arch/x86/boot/compressed/misc.h   |  28 +
 arch/x86/boot/compressed/utils.c  |  11 +
 arch/x86/boot/compressed/vmlinux.symbols  |  17 +
 arch/x86/include/asm/boot.h   |  15 +-
 include/uapi/linux/elf.h  |   1 +
 11 files changed, 1191 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/security/fgkaslr.rst
 create mode 100644 arch/x86/boot/compressed/fgkaslr.c
 create mode 100644 arch/x86/boot/compressed/utils.c
 create mode 100644 arch/x86/boot/compressed/vmlinux.symbols

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..0affa1458017 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2082,6 +2082,13 @@
kernel and module base offset ASLR (Address Space
Layout Randomization).
 
+   fgkaslr [KNL]
+   Format: { "off" }
+   When CONFIG_FG_KASLR is set, setting this parameter
+   to "off" disables kernel function granular ASLR
+   (Address Space Layout Randomization).
+   See Documentation/security/fgkaslr.rst.
+
kasan_multi_shot
[KNL] Enforce KASAN (Kernel Address Sanitizer) to print
report on every invalid memory access. Without this
diff --git a/Documentation/security/fgkaslr.rst 
b/Documentation/security/fgkaslr.rst
new file mode 100644
index ..d52af50d6715
--- /dev/null
+++ b/Documentation/security/fgkaslr.rst
@@ -0,0 +1,172 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Function Granular Kernel Address Space Layout Randomization (fgkaslr)
+=
+
+:Date: 6 April 2020
+:Author: Kristen Accardi
+
+Kernel Address Space Layout Randomization (KASLR) was merged into the kernel
+with the objective of increasing the difficulty of code reuse attacks. Code
+reuse attacks reused existing code snippets to get around existing memory
+protections. They exploit software bugs which expose addresses of useful code
+snippets to control the flow of execution for their own nefarious purposes.
+KASLR as it was originally implemented moves the entire kernel code text as a
+unit at boot time in order to make addresses less predictable. The order of the
+code within the segment is unchanged - only the base address is shifted. There
+are a few 

[PATCH v4 02/10] x86: tools/relocs: Support >64K section headers

2020-07-17 Thread Kristen Carlson Accardi
While the relocs tool already supports finding the total number
of section headers if vmlinux exceeds 64K sections, it fails to
read the extended symbol table to get section header indexes for symbols,
causing incorrect symbol table indexes to be used when there are > 64K
symbols.

Parse the elf file to read the extended symbol table info, and then
replace all direct references to st_shndx with calls to sym_index(),
which will determine whether the value can be read directly or
whether the value should be pulled out of the extended table.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Kees Cook 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/tools/relocs.c | 104 ++--
 1 file changed, 78 insertions(+), 26 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index ce7188cbdae5..31b2d151aa63 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -14,6 +14,10 @@
 static Elf_Ehdrehdr;
 static unsigned long   shnum;
 static unsigned intshstrndx;
+static unsigned intshsymtabndx;
+static unsigned intshxsymtabndx;
+
+static int sym_index(Elf_Sym *sym);
 
 struct relocs {
uint32_t*offset;
@@ -32,6 +36,7 @@ struct section {
Elf_Shdr   shdr;
struct section *link;
Elf_Sym*symtab;
+   Elf32_Word *xsymtab;
Elf_Rel*reltab;
char   *strtab;
 };
@@ -265,7 +270,7 @@ static const char *sym_name(const char *sym_strtab, Elf_Sym 
*sym)
name = sym_strtab + sym->st_name;
}
else {
-   name = sec_name(sym->st_shndx);
+   name = sec_name(sym_index(sym));
}
return name;
 }
@@ -335,6 +340,23 @@ static uint64_t elf64_to_cpu(uint64_t val)
 #define elf_xword_to_cpu(x)elf32_to_cpu(x)
 #endif
 
+static int sym_index(Elf_Sym *sym)
+{
+   Elf_Sym *symtab = secs[shsymtabndx].symtab;
+   Elf32_Word *xsymtab = secs[shxsymtabndx].xsymtab;
+   unsigned long offset;
+   int index;
+
+   if (sym->st_shndx != SHN_XINDEX)
+   return sym->st_shndx;
+
+   /* calculate offset of sym from head of table. */
+   offset = (unsigned long)sym - (unsigned long)symtab;
+   index = offset / sizeof(*sym);
+
+   return elf32_to_cpu(xsymtab[index]);
+}
+
 static void read_ehdr(FILE *fp)
 {
if (fread(, sizeof(ehdr), 1, fp) != 1) {
@@ -468,31 +490,60 @@ static void read_strtabs(FILE *fp)
 static void read_symtabs(FILE *fp)
 {
int i,j;
+
for (i = 0; i < shnum; i++) {
struct section *sec = [i];
-   if (sec->shdr.sh_type != SHT_SYMTAB) {
+   int num_syms;
+
+   switch (sec->shdr.sh_type) {
+   case SHT_SYMTAB_SHNDX:
+   sec->xsymtab = malloc(sec->shdr.sh_size);
+   if (!sec->xsymtab) {
+   die("malloc of %d bytes for xsymtab failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->xsymtab, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read extended symbol table: %s\n",
+   strerror(errno));
+   }
+   shxsymtabndx = i;
+   continue;
+
+   case SHT_SYMTAB:
+   num_syms = sec->shdr.sh_size / sizeof(Elf_Sym);
+
+   sec->symtab = malloc(sec->shdr.sh_size);
+   if (!sec->symtab) {
+   die("malloc of %d bytes for symtab failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->symtab, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read symbol table: %s\n",
+   strerror(errno));
+   }
+   for (j = 0; j < num_syms; j++) {
+   Elf_Sym *sym = >symtab[j];
+
+   sym->st_name  = elf_word_to_cpu(sym->st_name);
+   sym->st_val

[PATCH v4 10/10] module: Reorder functions

2020-07-17 Thread Kristen Carlson Accardi
Introduce a new config option to allow modules to be re-ordered
by function. This option can be enabled independently of the
kernel text KASLR or FG_KASLR settings so that it can be used
by architectures that do not support either of these features.
This option will be selected by default if CONFIG_FG_KASLR is
selected.

If a module has functions split out into separate text sections
(i.e. compiled with the -ffunction-sections flag), reorder the
functions to provide some code diversification to modules.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Kees Cook 
Acked-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/Makefile |  5 +++
 init/Kconfig  | 12 +++
 kernel/kallsyms.c |  2 +-
 kernel/module.c   | 81 +++
 4 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 00e378de8bc0..0f2dbc46eb5c 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -51,6 +51,11 @@ ifdef CONFIG_X86_NEED_RELOCS
 LDFLAGS_vmlinux := --emit-relocs --discard-none
 endif
 
+ifndef CONFIG_FG_KASLR
+   ifdef CONFIG_MODULE_FG_KASLR
+   KBUILD_CFLAGS_MODULE += -ffunction-sections
+   endif
+endif
 #
 # Prevent GCC from generating any FP code by mistake.
 #
diff --git a/init/Kconfig b/init/Kconfig
index 82f042a1062f..b4741838da40 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1994,6 +1994,7 @@ config FG_KASLR
bool "Function Granular Kernel Address Space Layout Randomization"
depends on $(cc-option, -ffunction-sections)
depends on ARCH_HAS_FG_KASLR
+   select MODULE_FG_KASLR
default n
help
  This option improves the randomness of the kernel text
@@ -2278,6 +2279,17 @@ config UNUSED_KSYMS_WHITELIST
  one per line. The path can be absolute, or relative to the kernel
  source tree.
 
+config MODULE_FG_KASLR
+   depends on $(cc-option, -ffunction-sections)
+   bool "Module Function Granular Layout Randomization"
+   help
+ This option randomizes the module text section by reordering the text
+ section by function at module load time. In order to use this
+ feature, the module must have been compiled with the
+ -ffunction-sections compiler flag.
+
+ If unsure, say N.
+
 endif # MODULES
 
 config MODULES_TREE_LOOKUP
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 45d147f7f10e..e3f7d0fd3270 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -692,7 +692,7 @@ static int __kallsyms_open(struct inode *inode, struct file 
*file)
  * When function granular kaslr is enabled, we need to print out the symbols
  * at random so we don't reveal the new layout.
  */
-#if defined(CONFIG_FG_KASLR)
+#if defined(CONFIG_FG_KASLR) || defined(CONFIG_MODULE_FG_KASLR)
 static int update_random_pos(struct kallsyms_shuffled_iter *s_iter,
 loff_t pos, loff_t *new_pos)
 {
diff --git a/kernel/module.c b/kernel/module.c
index aa183c9ac0a2..0f4f4e567a42 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "module-internal.h"
 
@@ -2391,6 +2392,83 @@ static long get_offset(struct module *mod, unsigned int 
*size,
return ret;
 }
 
+/*
+ * shuffle_text_list()
+ * Use a Fisher Yates algorithm to shuffle a list of text sections.
+ */
+static void shuffle_text_list(Elf_Shdr **list, int size)
+{
+   int i;
+   unsigned int j;
+   Elf_Shdr *temp;
+
+   for (i = size - 1; i > 0; i--) {
+   /*
+* pick a random index from 0 to i
+*/
+   get_random_bytes(, sizeof(j));
+   j = j % (i + 1);
+
+   temp = list[i];
+   list[i] = list[j];
+   list[j] = temp;
+   }
+}
+
+/*
+ * randomize_text()
+ * Look through the core section looking for executable code sections.
+ * Store sections in an array and then shuffle the sections
+ * to reorder the functions.
+ */
+static void randomize_text(struct module *mod, struct load_info *info)
+{
+   int i;
+   int num_text_sections = 0;
+   Elf_Shdr **text_list;
+   int size = 0;
+   int max_sections = info->hdr->e_shnum;
+   unsigned int sec = find_sec(info, ".text");
+
+   if (sec == 0)
+   return;
+
+   text_list = kmalloc_array(max_sections, sizeof(*text_list), GFP_KERNEL);
+   if (!text_list)
+   return;
+
+   for (i = 0; i < max_sections; i++) {
+   Elf_Shdr *shdr = >sechdrs[i];
+   const char *sname = info->secstrings + shdr->sh_name;
+
+   if (!(shdr->sh_flags & SHF_ALLOC) ||
+   !(shdr->sh_flags & SHF_EXECINSTR) ||
+   strstarts(sname, ".init"))
+ 

[PATCH v4 09/10] kallsyms: Hide layout

2020-07-17 Thread Kristen Carlson Accardi
This patch makes /proc/kallsyms display in a random order, rather
than sorted by address in order to hide the newly randomized address
layout.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 kernel/kallsyms.c | 163 +-
 1 file changed, 162 insertions(+), 1 deletion(-)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index bb14e64f62a4..45d147f7f10e 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -446,6 +446,12 @@ struct kallsym_iter {
int show_value;
 };
 
+struct kallsyms_shuffled_iter {
+   struct kallsym_iter iter;
+   loff_t total_syms;
+   loff_t shuffled_index[];
+};
+
 int __weak arch_get_kallsym(unsigned int symnum, unsigned long *value,
char *type, char *name)
 {
@@ -661,7 +667,7 @@ bool kallsyms_show_value(const struct cred *cred)
}
 }
 
-static int kallsyms_open(struct inode *inode, struct file *file)
+static int __kallsyms_open(struct inode *inode, struct file *file)
 {
/*
 * We keep iterator in m->private, since normal case is to
@@ -682,6 +688,161 @@ static int kallsyms_open(struct inode *inode, struct file 
*file)
return 0;
 }
 
+/*
+ * When function granular kaslr is enabled, we need to print out the symbols
+ * at random so we don't reveal the new layout.
+ */
+#if defined(CONFIG_FG_KASLR)
+static int update_random_pos(struct kallsyms_shuffled_iter *s_iter,
+loff_t pos, loff_t *new_pos)
+{
+   loff_t new;
+
+   if (pos >= s_iter->total_syms)
+   return 0;
+
+   new = s_iter->shuffled_index[pos];
+
+   /*
+* normally this would be done as part of update_iter, however,
+* we want to avoid triggering this in the event that new is
+* zero since we don't want to blow away our pos end indicators.
+*/
+   if (new == 0) {
+   s_iter->iter.name[0] = '\0';
+   s_iter->iter.nameoff = get_symbol_offset(new);
+   s_iter->iter.pos = new;
+   }
+
+   *new_pos = new;
+   return 1;
+}
+
+static void *shuffled_start(struct seq_file *m, loff_t *pos)
+{
+   struct kallsyms_shuffled_iter *s_iter = m->private;
+   loff_t new_pos;
+
+   if (!update_random_pos(s_iter, *pos, _pos))
+   return NULL;
+
+   return s_start(m, _pos);
+}
+
+static void *shuffled_next(struct seq_file *m, void *p, loff_t *pos)
+{
+   struct kallsyms_shuffled_iter *s_iter = m->private;
+   loff_t new_pos;
+
+   (*pos)++;
+
+   if (!update_random_pos(s_iter, *pos, _pos))
+   return NULL;
+
+   if (!update_iter(m->private, new_pos))
+   return NULL;
+
+   return p;
+}
+
+/*
+ * shuffle_index_list()
+ * Use a Fisher Yates algorithm to shuffle a list of text sections.
+ */
+static void shuffle_index_list(loff_t *indexes, loff_t size)
+{
+   int i;
+   unsigned int j;
+   loff_t temp;
+
+   for (i = size - 1; i > 0; i--) {
+   /* pick a random index from 0 to i */
+   get_random_bytes(, sizeof(j));
+   j = j % (i + 1);
+
+   temp = indexes[i];
+   indexes[i] = indexes[j];
+   indexes[j] = temp;
+   }
+}
+
+static const struct seq_operations kallsyms_shuffled_op = {
+   .start = shuffled_start,
+   .next = shuffled_next,
+   .stop = s_stop,
+   .show = s_show
+};
+
+static int kallsyms_random_open(struct inode *inode, struct file *file)
+{
+   loff_t pos;
+   struct kallsyms_shuffled_iter *shuffled_iter;
+   struct kallsym_iter iter;
+   bool show_value;
+
+   /*
+* If privileged, go ahead and use the normal algorithm for
+* displaying symbols
+*/
+   show_value = kallsyms_show_value(file->f_cred);
+   if (show_value)
+   return __kallsyms_open(inode, file);
+
+   /*
+* we need to figure out how many extra symbols there are
+* to print out past kallsyms_num_syms
+*/
+   pos = kallsyms_num_syms;
+   reset_iter(, 0);
+   do {
+   if (!update_iter(, pos))
+   break;
+   pos++;
+   } while (1);
+
+   /*
+* add storage space for an array of loff_t equal to the size
+* of the total number of symbols we need to print
+*/
+   shuffled_iter = __seq_open_private(file, _shuffled_op,
+  sizeof(*shuffled_iter) +
+  (sizeof(pos) * pos));
+   if (!shuffled_iter)
+   return -ENOMEM;
+
+   reset_iter(_iter->iter, 0);
+   shuffled_iter->iter.show_value = show_value;
+   shuffled_iter->total_syms = pos;
+
+   /*
+* the existing update_iter algorithm requires that we
+* are either moving along increasing pos 

[PATCH v4 01/10] objtool: Do not assume order of parent/child functions

2020-07-17 Thread Kristen Carlson Accardi
If a .cold function is examined prior to it's parent, the link
to the parent/child function can be overwritten when the parent
is examined. Only update pfunc and cfunc if they were previously
nil to prevent this from happening.

Signed-off-by: Kristen Carlson Accardi 
Acked-by: Josh Poimboeuf 
---
 tools/objtool/elf.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 26d11d821941..c18f0f216740 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -434,7 +434,13 @@ static int read_symbols(struct elf *elf)
size_t pnamelen;
if (sym->type != STT_FUNC)
continue;
-   sym->pfunc = sym->cfunc = sym;
+
+   if (sym->pfunc == NULL)
+   sym->pfunc = sym;
+
+   if (sym->cfunc == NULL)
+   sym->cfunc = sym;
+
coldstr = strstr(sym->name, ".cold");
if (!coldstr)
continue;
-- 
2.20.1



[PATCH v4 00/10] Function Granular KASLR

2020-07-17 Thread Kristen Carlson Accardi
ze
--
Adding additional section headers as a result of compiling with
-ffunction-sections will increase the size of the vmlinux ELF file.
With a standard distro config, the resulting vmlinux was increased by
about 3%. The compressed image is also increased due to the header files,
as well as the extra relocations that must be added. You can expect fgkaslr
to increase the size of the compressed image by about 15%.

Memory Usage

fgkaslr increases the amount of heap that is required at boot time,
although this extra memory is released when the kernel has finished
decompression. As a result, it may not be appropriate to use this feature on
systems without much memory.

Building

To enable fine grained KASLR, you need to have the following config options
set (including all the ones you would use to build normal KASLR)

CONFIG_FG_KASLR=y

In addition, fgkaslr is only supported for the X86_64 architecture.

Modules
---
Modules are randomized similarly to the rest of the kernel by shuffling
the sections at load time prior to moving them into memory. The module must
also have been build with the -ffunction-sections compiler option.

Although fgkaslr for the kernel is only supported for the X86_64 architecture,
it is possible to use fgkaslr with modules on other architectures. To enable
this feature, select

CONFIG_MODULE_FG_KASLR=y

This option is selected automatically for X86_64 when CONFIG_FG_KASLR is set.

Disabling
-
Disabling normal KASLR using the nokaslr command line option also disables
fgkaslr. It is also possible to disable fgkaslr separately by booting with
fgkaslr=off on the commandline.

References
--
There are a lot of academic papers which explore finer grained ASLR.
This paper in particular contributed the most to my implementation design
as well as my overall understanding of the problem space:

Selfrando: Securing the Tor Browser against De-anonymization Exploits,
M. Conti, S. Crane, T. Frassetto, et al.

For more information on how function layout impacts performance, see:

Optimizing Function Placement for Large-Scale Data-Center Applications,
G. Ottoni, B. Maher

Kees Cook (2):
  x86/boot: Allow a "silent" kaslr random byte fetch
  x86/boot/compressed: Avoid duplicate malloc() implementations

Kristen Carlson Accardi (8):
  objtool: Do not assume order of parent/child functions
  x86: tools/relocs: Support >64K section headers
  x86: Makefile: Add build and config option for CONFIG_FG_KASLR
  x86: Make sure _etext includes function sections
  x86/tools: Add relative relocs for randomized functions
  x86: Add support for function granular KASLR
  kallsyms: Hide layout
  module: Reorder functions

 .../admin-guide/kernel-parameters.txt |   7 +
 Documentation/security/fgkaslr.rst| 172 
 Documentation/security/index.rst  |   1 +
 Makefile  |   6 +-
 arch/x86/Kconfig  |   4 +
 arch/x86/Makefile |   5 +
 arch/x86/boot/compressed/Makefile |   9 +-
 arch/x86/boot/compressed/fgkaslr.c| 811 ++
 arch/x86/boot/compressed/kaslr.c  |   4 -
 arch/x86/boot/compressed/misc.c   | 157 +++-
 arch/x86/boot/compressed/misc.h   |  30 +
 arch/x86/boot/compressed/utils.c  |  11 +
 arch/x86/boot/compressed/vmlinux.symbols  |  17 +
 arch/x86/include/asm/boot.h   |  15 +-
 arch/x86/kernel/vmlinux.lds.S |  17 +-
 arch/x86/lib/kaslr.c  |  18 +-
 arch/x86/tools/relocs.c   | 143 ++-
 arch/x86/tools/relocs.h   |   4 +-
 arch/x86/tools/relocs_common.c|  15 +-
 include/asm-generic/vmlinux.lds.h |  18 +-
 include/linux/decompress/mm.h |  12 +-
 include/uapi/linux/elf.h  |   1 +
 init/Kconfig  |  26 +
 kernel/kallsyms.c | 163 +++-
 kernel/module.c   |  81 ++
 tools/objtool/elf.c   |   8 +-
 26 files changed, 1670 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/security/fgkaslr.rst
 create mode 100644 arch/x86/boot/compressed/fgkaslr.c
 create mode 100644 arch/x86/boot/compressed/utils.c
 create mode 100644 arch/x86/boot/compressed/vmlinux.symbols


base-commit: 11ba468877bb23f28956a35e896356252d63c983
-- 
2.20.1



Re: [PATCH v3 09/10] kallsyms: Hide layout

2020-07-08 Thread Kristen Carlson Accardi
On Tue, 2020-07-07 at 23:16 +, Luck, Tony wrote:
> > Signed-off-by: Kristen Carlson Accardi 
> > Reviewed-by: Tony Luck 
> > Tested-by: Tony Luck 
> 
> I'll happily review and test again ... but since you've made
> substantive
> changes, you should drop these tags until I do.

Will do - thanks! If nobody thinks this is a horrible direction, I'll
clean up this patch and submit it with the rest as part of v4.

> 
> FWIW I think random order is a good idea.  Do you shuffle once?
> Or every time somebody opens /proc/kallsyms?

I am shuffling every single time that somebody opens /proc/kallsyms -
this is because it's possible that somebody has loaded modules or bpf
stuff and there may be new/different symbols to display.




Re: [PATCH v3 09/10] kallsyms: Hide layout

2020-07-07 Thread Kristen Carlson Accardi
On Wed, 2020-06-24 at 08:18 -0700, Kees Cook wrote:
> On Wed, Jun 24, 2020 at 12:21:16PM +0200, Jann Horn wrote:
> > On Tue, Jun 23, 2020 at 7:26 PM Kristen Carlson Accardi
> >  wrote:
> > > This patch makes /proc/kallsyms display alphabetically by symbol
> > > name rather than sorted by address in order to hide the newly
> > > randomized address layout.
> > [...]
> > > +static int sorted_show(struct seq_file *m, void *p)
> > > +{
> > > +   struct list_head *list = m->private;
> > > +   struct kallsyms_iter_list *iter;
> > > +   int rc;
> > > +
> > > +   if (list_empty(list))
> > > +   return 0;
> > > +
> > > +   iter = list_first_entry(list, struct kallsyms_iter_list,
> > > next);
> > > +
> > > +   m->private = iter;
> > > +   rc = s_show(m, p);
> > > +   m->private = list;
> > > +
> > > +   list_del(>next);
> > > +   kfree(iter);
> > 
> > Does anything like this kfree() happen if someone only reads the
> > start
> > of kallsyms and then closes the file? IOW, does "while true; do
> > head
> > -n1 /proc/kallsyms; done" leak memory?
> 
> Oop, nice catch. It seems the list would need to be walked on s_stop.
> 
> > > +   return rc;
> > > +}
> > [...]
> > > +static int kallsyms_list_cmp(void *priv, struct list_head *a,
> > > +struct list_head *b)
> > > +{
> > > +   struct kallsyms_iter_list *iter_a, *iter_b;
> > > +
> > > +   iter_a = list_entry(a, struct kallsyms_iter_list, next);
> > > +   iter_b = list_entry(b, struct kallsyms_iter_list, next);
> > > +
> > > +   return strcmp(iter_a->iter.name, iter_b->iter.name);
> > > +}
> > 
> > This sorts only by name, but kallsyms prints more information
> > (module
> > names and types). This means that if there are elements whose names
> > are the same, but whose module names or types are different, then
> > some
> > amount of information will still be leaked by the ordering of
> > elements
> > with the same name. In practice, since list_sort() is stable, this
> > means you can see the ordering of many modules, and you can see the
> > ordering of symbols with same name but different visibility (e.g.
> > "t
> > user_read" from security/selinux/ss/policydb.c vs "T user_read"
> > from
> > security/keys/user_defined.c, and a couple other similar cases).
> 
> i.e. sub-sort by visibility?
> 
> > [...]
> > > +#if defined(CONFIG_FG_KASLR)
> > > +/*
> > > + * When fine grained kaslr is enabled, we need to
> > > + * print out the symbols sorted by name rather than by
> > > + * by address, because this reveals the randomization order.
> > > + */
> > > +static int kallsyms_open(struct inode *inode, struct file *file)
> > > +{
> > > +   int ret;
> > > +   struct list_head *list;
> > > +
> > > +   list = __seq_open_private(file, _sorted_op,
> > > sizeof(*list));
> > > +   if (!list)
> > > +   return -ENOMEM;
> > > +
> > > +   INIT_LIST_HEAD(list);
> > > +
> > > +   ret = kallsyms_on_each_symbol(get_all_symbol_name, list);
> > > +   if (ret != 0)
> > > +   return ret;
> > > +
> > > +   list_sort(NULL, list, kallsyms_list_cmp);
> > 
> > This permits running an algorithm (essentially mergesort) with
> > secret-dependent branches and memory addresses on essentially
> > secret
> > data, triggerable and arbitrarily repeatable (although with partly
> > different addresses on each run) by the attacker, and probably a
> > fairly low throughput (comparisons go through indirect function
> > calls,
> > which are slowed down by retpolines, and linked list iteration
> > implies
> > slow pointer chases). Those are fairly favorable conditions for
> > typical side-channel attacks.
> > 
> > Do you have estimates of how hard it would be to leverage such side
> > channels to recover function ordering (both on old hardware that
> > only
> > has microcode fixes for Spectre and such, and on newer hardware
> > with
> > enhanced IBRS and such)?
> 
> I wonder, instead, if sorting should be just done once per module
> load/unload? That would make the performance and memory management
> easier too.
> 

I've be

Re: [PATCH v3 09/10] kallsyms: Hide layout

2020-06-25 Thread Kristen Carlson Accardi
On Wed, 2020-06-24 at 08:18 -0700, Kees Cook wrote:
> On Wed, Jun 24, 2020 at 12:21:16PM +0200, Jann Horn wrote:
> > On Tue, Jun 23, 2020 at 7:26 PM Kristen Carlson Accardi
> >  wrote:
> > > This patch makes /proc/kallsyms display alphabetically by symbol
> > > name rather than sorted by address in order to hide the newly
> > > randomized address layout.
> > [...]
> > > +static int sorted_show(struct seq_file *m, void *p)
> > > +{
> > > +   struct list_head *list = m->private;
> > > +   struct kallsyms_iter_list *iter;
> > > +   int rc;
> > > +
> > > +   if (list_empty(list))
> > > +   return 0;
> > > +
> > > +   iter = list_first_entry(list, struct kallsyms_iter_list,
> > > next);
> > > +
> > > +   m->private = iter;
> > > +   rc = s_show(m, p);
> > > +   m->private = list;
> > > +
> > > +   list_del(>next);
> > > +   kfree(iter);
> > 
> > Does anything like this kfree() happen if someone only reads the
> > start
> > of kallsyms and then closes the file? IOW, does "while true; do
> > head
> > -n1 /proc/kallsyms; done" leak memory?
> 
> Oop, nice catch. It seems the list would need to be walked on s_stop.
> 
> > > +   return rc;
> > > +}
> > [...]
> > > +static int kallsyms_list_cmp(void *priv, struct list_head *a,
> > > +struct list_head *b)
> > > +{
> > > +   struct kallsyms_iter_list *iter_a, *iter_b;
> > > +
> > > +   iter_a = list_entry(a, struct kallsyms_iter_list, next);
> > > +   iter_b = list_entry(b, struct kallsyms_iter_list, next);
> > > +
> > > +   return strcmp(iter_a->iter.name, iter_b->iter.name);
> > > +}
> > 
> > This sorts only by name, but kallsyms prints more information
> > (module
> > names and types). This means that if there are elements whose names
> > are the same, but whose module names or types are different, then
> > some
> > amount of information will still be leaked by the ordering of
> > elements
> > with the same name. In practice, since list_sort() is stable, this
> > means you can see the ordering of many modules, and you can see the
> > ordering of symbols with same name but different visibility (e.g.
> > "t
> > user_read" from security/selinux/ss/policydb.c vs "T user_read"
> > from
> > security/keys/user_defined.c, and a couple other similar cases).
> 
> i.e. sub-sort by visibility?
> 
> > [...]
> > > +#if defined(CONFIG_FG_KASLR)
> > > +/*
> > > + * When fine grained kaslr is enabled, we need to
> > > + * print out the symbols sorted by name rather than by
> > > + * by address, because this reveals the randomization order.
> > > + */
> > > +static int kallsyms_open(struct inode *inode, struct file *file)
> > > +{
> > > +   int ret;
> > > +   struct list_head *list;
> > > +
> > > +   list = __seq_open_private(file, _sorted_op,
> > > sizeof(*list));
> > > +   if (!list)
> > > +   return -ENOMEM;
> > > +
> > > +   INIT_LIST_HEAD(list);
> > > +
> > > +   ret = kallsyms_on_each_symbol(get_all_symbol_name, list);
> > > +   if (ret != 0)
> > > +   return ret;
> > > +
> > > +   list_sort(NULL, list, kallsyms_list_cmp);
> > 
> > This permits running an algorithm (essentially mergesort) with
> > secret-dependent branches and memory addresses on essentially
> > secret
> > data, triggerable and arbitrarily repeatable (although with partly
> > different addresses on each run) by the attacker, and probably a
> > fairly low throughput (comparisons go through indirect function
> > calls,
> > which are slowed down by retpolines, and linked list iteration
> > implies
> > slow pointer chases). Those are fairly favorable conditions for
> > typical side-channel attacks.
> > 
> > Do you have estimates of how hard it would be to leverage such side
> > channels to recover function ordering (both on old hardware that
> > only
> > has microcode fixes for Spectre and such, and on newer hardware
> > with
> > enhanced IBRS and such)?
> 
> I wonder, instead, if sorting should be just done once per module
> load/unload? That would make the performance and memory management
> easier too.
> 

My first solution (just don't show kallsyms at all for non-root) is
looking better and better :). But seriously, I will rewrite this one
with something like this, but I think we are going to need another
close look to see if sidechannel issues still exist with the new
implementation. Hopefully Jann can take another look then.




[PATCH v3 00/10] Function Granular KASLR

2020-06-23 Thread Kristen Carlson Accardi
 As a result, it may not be appropriate to use this feature on
systems without much memory.

Building

To enable fine grained KASLR, you need to have the following config options
set (including all the ones you would use to build normal KASLR)

CONFIG_FG_KASLR=y

In addition, fgkaslr is only supported for the X86_64 architecture.

Modules
---
Modules are randomized similarly to the rest of the kernel by shuffling
the sections at load time prior to moving them into memory. The module must
also have been build with the -ffunction-sections compiler option.

Although fgkaslr for the kernel is only supported for the X86_64 architecture,
it is possible to use fgkaslr with modules on other architectures. To enable
this feature, select

CONFIG_MODULE_FG_KASLR=y

This option is selected automatically for X86_64 when CONFIG_FG_KASLR is set.

Disabling
-
Disabling normal KASLR using the nokaslr command line option also disables
fgkaslr. It is also possible to disable fgkaslr separately by booting with
fgkaslr=off on the commandline.

References
--
There are a lot of academic papers which explore finer grained ASLR.
This paper in particular contributed the most to my implementation design
as well as my overall understanding of the problem space:

Selfrando: Securing the Tor Browser against De-anonymization Exploits,
M. Conti, S. Crane, T. Frassetto, et al.

For more information on how function layout impacts performance, see:

Optimizing Function Placement for Large-Scale Data-Center Applications,
G. Ottoni, B. Maher

Kees Cook (1):
  x86/boot: Allow a "silent" kaslr random byte fetch

Kristen Carlson Accardi (9):
  objtool: Do not assume order of parent/child functions
  x86: tools/relocs: Support >64K section headers
  x86: Makefile: Add build and config option for CONFIG_FG_KASLR
  x86: Make sure _etext includes function sections
  x86/tools: Add relative relocs for randomized functions
  x86/boot/compressed: change definition of STATIC
  x86: Add support for function granular KASLR
  kallsyms: Hide layout
  module: Reorder functions

 Documentation/security/fgkaslr.rst   | 173 +
 Documentation/security/index.rst |   1 +
 Makefile |   6 +-
 arch/x86/Kconfig |   4 +
 arch/x86/Makefile|   5 +
 arch/x86/boot/compressed/Makefile|   9 +-
 arch/x86/boot/compressed/fgkaslr.c   | 812 +++
 arch/x86/boot/compressed/kaslr.c |   4 -
 arch/x86/boot/compressed/misc.c  | 161 -
 arch/x86/boot/compressed/misc.h  |  34 +
 arch/x86/boot/compressed/utils.c |  12 +
 arch/x86/boot/compressed/vmlinux.symbols |  17 +
 arch/x86/include/asm/boot.h  |  15 +-
 arch/x86/kernel/vmlinux.lds.S|  17 +-
 arch/x86/lib/kaslr.c |  18 +-
 arch/x86/tools/relocs.c  | 143 +++-
 arch/x86/tools/relocs.h  |   4 +-
 arch/x86/tools/relocs_common.c   |  15 +-
 include/asm-generic/vmlinux.lds.h|  18 +-
 include/uapi/linux/elf.h |   1 +
 init/Kconfig |  26 +
 kernel/kallsyms.c| 128 
 kernel/module.c  |  81 +++
 tools/objtool/elf.c  |   8 +-
 24 files changed, 1627 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/security/fgkaslr.rst
 create mode 100644 arch/x86/boot/compressed/fgkaslr.c
 create mode 100644 arch/x86/boot/compressed/utils.c
 create mode 100644 arch/x86/boot/compressed/vmlinux.symbols


base-commit: 48778464bb7d346b47157d21ffde2af6b2d39110
-- 
2.20.1



[PATCH v3 08/10] x86: Add support for function granular KASLR

2020-06-23 Thread Kristen Carlson Accardi
This commit contains the changes required to re-layout the kernel text
sections generated by -ffunction-sections shortly after decompression.
Documentation of the feature is also added.

After decompression, the decompressed image's elf headers are parsed.
In order to manually update certain data structures that are built with
relative offsets during the kernel build process, certain symbols are
not stripped by objdump and their location is retained in the elf symbol
tables. These addresses are saved.

If the image was built with -ffunction-sections, there will be ELF section
headers present which contain information about the address range of each
section. Anything that is not broken out into function sections (i.e. is
consolidated into .text) is left in it's original location, but any other
executable section which begins with ".text." is located and shuffled
randomly within the remaining text segment address range.

After the sections have been copied to their new locations, but before
relocations have been applied, the kallsyms tables must be updated to
reflect the new symbol locations. Because it is expected that these tables
will be sorted by address, the kallsyms tables will need to be sorted
after the update.

When applying relocations, the address of the relocation needs to be
adjusted by the offset from the original location of the section that was
randomized to it's new location. In addition, if a value at that relocation
was a location in the text segment that was randomized, it's value will be
adjusted to a new location.

After relocations have been applied, the exception table must be updated
with with new symbol locations, and then re-sorted by the new address. The
orc table will have been updated as part of applying relocations, but since
it is expected to be sorted by address, it will need to be resorted.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 Documentation/security/fgkaslr.rst   | 173 +
 Documentation/security/index.rst |   1 +
 arch/x86/boot/compressed/Makefile|   2 +
 arch/x86/boot/compressed/fgkaslr.c   | 812 +++
 arch/x86/boot/compressed/misc.c  | 154 -
 arch/x86/boot/compressed/misc.h  |  28 +
 arch/x86/boot/compressed/utils.c |  12 +
 arch/x86/boot/compressed/vmlinux.symbols |  17 +
 arch/x86/include/asm/boot.h  |  15 +-
 include/uapi/linux/elf.h |   1 +
 10 files changed, 1187 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/security/fgkaslr.rst
 create mode 100644 arch/x86/boot/compressed/fgkaslr.c
 create mode 100644 arch/x86/boot/compressed/utils.c
 create mode 100644 arch/x86/boot/compressed/vmlinux.symbols

diff --git a/Documentation/security/fgkaslr.rst 
b/Documentation/security/fgkaslr.rst
new file mode 100644
index ..9478b1298ae8
--- /dev/null
+++ b/Documentation/security/fgkaslr.rst
@@ -0,0 +1,173 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Function Granular Kernel Address Space Layout Randomization (fgkaslr)
+=
+
+:Date: 6 April 2020
+:Author: Kristen Accardi
+
+Kernel Address Space Layout Randomization (KASLR) was merged into the kernel
+with the objective of increasing the difficulty of code reuse attacks. Code
+reuse attacks reused existing code snippets to get around existing memory
+protections. They exploit software bugs which expose addresses of useful code
+snippets to control the flow of execution for their own nefarious purposes.
+KASLR as it was originally implemented moves the entire kernel code text as a
+unit at boot time in order to make addresses less predictable. The order of the
+code within the segment is unchanged - only the base address is shifted. There
+are a few shortcomings to this algorithm.
+
+1. Low Entropy - there are only so many locations the kernel can fit in. This
+   means an attacker could guess without too much trouble.
+2. Knowledge of a single address can reveal the offset of the base address,
+   exposing all other locations for a published/known kernel image.
+3. Info leaks abound.
+
+Finer grained ASLR has been proposed as a way to make ASLR more resistant
+to info leaks. It is not a new concept at all, and there are many variations
+possible. Function reordering is an implementation of finer grained ASLR
+which randomizes the layout of an address space on a function level
+granularity. The term "fgkaslr" is used in this document to refer to the
+technique of function reordering when used with KASLR, as well as finer grained
+KASLR in general.
+
+The objective of this patch set is to improve a technology that is already
+merged into the kernel (KASLR). This code will not prevent all code reuse
+attacks, and should be considered as one of several tools that can be used.
+
+Im

[PATCH v3 10/10] module: Reorder functions

2020-06-23 Thread Kristen Carlson Accardi
Introduce a new config option to allow modules to be re-ordered
by function. This option can be enabled independently of the
kernel text KASLR or FG_KASLR settings so that it can be used
by architectures that do not support either of these features.
This option will be selected by default if CONFIG_FG_KASLR is
selected.

If a module has functions split out into separate text sections
(i.e. compiled with the -ffunction-sections flag), reorder the
functions to provide some code diversification to modules.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Kees Cook 
Acked-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/Makefile |  5 +++
 init/Kconfig  | 12 +++
 kernel/kallsyms.c |  2 +-
 kernel/module.c   | 81 +++
 4 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 00e378de8bc0..0f2dbc46eb5c 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -51,6 +51,11 @@ ifdef CONFIG_X86_NEED_RELOCS
 LDFLAGS_vmlinux := --emit-relocs --discard-none
 endif
 
+ifndef CONFIG_FG_KASLR
+   ifdef CONFIG_MODULE_FG_KASLR
+   KBUILD_CFLAGS_MODULE += -ffunction-sections
+   endif
+endif
 #
 # Prevent GCC from generating any FP code by mistake.
 #
diff --git a/init/Kconfig b/init/Kconfig
index e29c032e4d66..87706a08a2ca 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1994,6 +1994,7 @@ config FG_KASLR
bool "Function Granular Kernel Address Space Layout Randomization"
depends on $(cc-option, -ffunction-sections)
depends on ARCH_HAS_FG_KASLR
+   select MODULE_FG_KASLR
default n
help
  This option improves the randomness of the kernel text
@@ -2278,6 +2279,17 @@ config UNUSED_KSYMS_WHITELIST
  one per line. The path can be absolute, or relative to the kernel
  source tree.
 
+config MODULE_FG_KASLR
+   depends on $(cc-option, -ffunction-sections)
+   bool "Module Function Granular Layout Randomization"
+   help
+ This option randomizes the module text section by reordering the text
+ section by function at module load time. In order to use this
+ feature, the module must have been compiled with the
+ -ffunction-sections compiler flag.
+
+ If unsure, say N.
+
 endif # MODULES
 
 config MODULES_TREE_LOOKUP
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index df2b20e1b7f2..da65593bffc7 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -761,7 +761,7 @@ int get_all_symbol_name(void *data, const char *name, 
struct module *mod,
return 0;
 }
 
-#if defined(CONFIG_FG_KASLR)
+#if defined(CONFIG_FG_KASLR) || defined(CONFIG_MODULE_FG_KASLR)
 /*
  * When fine grained kaslr is enabled, we need to
  * print out the symbols sorted by name rather than by
diff --git a/kernel/module.c b/kernel/module.c
index e8a198588f26..ff1e82b54127 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "module-internal.h"
 
@@ -2388,6 +2389,83 @@ static long get_offset(struct module *mod, unsigned int 
*size,
return ret;
 }
 
+/*
+ * shuffle_text_list()
+ * Use a Fisher Yates algorithm to shuffle a list of text sections.
+ */
+static void shuffle_text_list(Elf_Shdr **list, int size)
+{
+   int i;
+   unsigned int j;
+   Elf_Shdr *temp;
+
+   for (i = size - 1; i > 0; i--) {
+   /*
+* pick a random index from 0 to i
+*/
+   get_random_bytes(, sizeof(j));
+   j = j % (i + 1);
+
+   temp = list[i];
+   list[i] = list[j];
+   list[j] = temp;
+   }
+}
+
+/*
+ * randomize_text()
+ * Look through the core section looking for executable code sections.
+ * Store sections in an array and then shuffle the sections
+ * to reorder the functions.
+ */
+static void randomize_text(struct module *mod, struct load_info *info)
+{
+   int i;
+   int num_text_sections = 0;
+   Elf_Shdr **text_list;
+   int size = 0;
+   int max_sections = info->hdr->e_shnum;
+   unsigned int sec = find_sec(info, ".text");
+
+   if (sec == 0)
+   return;
+
+   text_list = kmalloc_array(max_sections, sizeof(*text_list), GFP_KERNEL);
+   if (!text_list)
+   return;
+
+   for (i = 0; i < max_sections; i++) {
+   Elf_Shdr *shdr = >sechdrs[i];
+   const char *sname = info->secstrings + shdr->sh_name;
+
+   if (!(shdr->sh_flags & SHF_ALLOC) ||
+   !(shdr->sh_flags & SHF_EXECINSTR) ||
+   strstarts(sname, ".init"))
+   continue;
+
+   text_list[num_text_sections] = shdr;
+   num_text_sections++;
+  

[PATCH v3 02/10] x86: tools/relocs: Support >64K section headers

2020-06-23 Thread Kristen Carlson Accardi
While the relocs tool already supports finding the total number
of section headers if vmlinux exceeds 64K sections, it fails to
read the extended symbol table to get section header indexes for symbols,
causing incorrect symbol table indexes to be used when there are > 64K
symbols.

Parse the elf file to read the extended symbol table info, and then
replace all direct references to st_shndx with calls to sym_index(),
which will determine whether the value can be read directly or
whether the value should be pulled out of the extended table.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Kees Cook 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/tools/relocs.c | 104 ++--
 1 file changed, 78 insertions(+), 26 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index ce7188cbdae5..31b2d151aa63 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -14,6 +14,10 @@
 static Elf_Ehdrehdr;
 static unsigned long   shnum;
 static unsigned intshstrndx;
+static unsigned intshsymtabndx;
+static unsigned intshxsymtabndx;
+
+static int sym_index(Elf_Sym *sym);
 
 struct relocs {
uint32_t*offset;
@@ -32,6 +36,7 @@ struct section {
Elf_Shdr   shdr;
struct section *link;
Elf_Sym*symtab;
+   Elf32_Word *xsymtab;
Elf_Rel*reltab;
char   *strtab;
 };
@@ -265,7 +270,7 @@ static const char *sym_name(const char *sym_strtab, Elf_Sym 
*sym)
name = sym_strtab + sym->st_name;
}
else {
-   name = sec_name(sym->st_shndx);
+   name = sec_name(sym_index(sym));
}
return name;
 }
@@ -335,6 +340,23 @@ static uint64_t elf64_to_cpu(uint64_t val)
 #define elf_xword_to_cpu(x)elf32_to_cpu(x)
 #endif
 
+static int sym_index(Elf_Sym *sym)
+{
+   Elf_Sym *symtab = secs[shsymtabndx].symtab;
+   Elf32_Word *xsymtab = secs[shxsymtabndx].xsymtab;
+   unsigned long offset;
+   int index;
+
+   if (sym->st_shndx != SHN_XINDEX)
+   return sym->st_shndx;
+
+   /* calculate offset of sym from head of table. */
+   offset = (unsigned long)sym - (unsigned long)symtab;
+   index = offset / sizeof(*sym);
+
+   return elf32_to_cpu(xsymtab[index]);
+}
+
 static void read_ehdr(FILE *fp)
 {
if (fread(, sizeof(ehdr), 1, fp) != 1) {
@@ -468,31 +490,60 @@ static void read_strtabs(FILE *fp)
 static void read_symtabs(FILE *fp)
 {
int i,j;
+
for (i = 0; i < shnum; i++) {
struct section *sec = [i];
-   if (sec->shdr.sh_type != SHT_SYMTAB) {
+   int num_syms;
+
+   switch (sec->shdr.sh_type) {
+   case SHT_SYMTAB_SHNDX:
+   sec->xsymtab = malloc(sec->shdr.sh_size);
+   if (!sec->xsymtab) {
+   die("malloc of %d bytes for xsymtab failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->xsymtab, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read extended symbol table: %s\n",
+   strerror(errno));
+   }
+   shxsymtabndx = i;
+   continue;
+
+   case SHT_SYMTAB:
+   num_syms = sec->shdr.sh_size / sizeof(Elf_Sym);
+
+   sec->symtab = malloc(sec->shdr.sh_size);
+   if (!sec->symtab) {
+   die("malloc of %d bytes for symtab failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->symtab, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read symbol table: %s\n",
+   strerror(errno));
+   }
+   for (j = 0; j < num_syms; j++) {
+   Elf_Sym *sym = >symtab[j];
+
+   sym->st_name  = elf_word_to_cpu(sym->st_name);
+   sym->st_val

[PATCH v3 03/10] x86/boot: Allow a "silent" kaslr random byte fetch

2020-06-23 Thread Kristen Carlson Accardi
From: Kees Cook 

Under earlyprintk, each RNG call produces a debug report line. When
shuffling hundreds of functions, this is not useful information (each
line is identical and tells us nothing new). Instead, allow for a NULL
"purpose" to suppress the debug reporting.

Signed-off-by: Kees Cook 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/lib/kaslr.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/lib/kaslr.c b/arch/x86/lib/kaslr.c
index a53665116458..2b3eb8c948a3 100644
--- a/arch/x86/lib/kaslr.c
+++ b/arch/x86/lib/kaslr.c
@@ -56,11 +56,14 @@ unsigned long kaslr_get_random_long(const char *purpose)
unsigned long raw, random = get_boot_seed();
bool use_i8254 = true;
 
-   debug_putstr(purpose);
-   debug_putstr(" KASLR using");
+   if (purpose) {
+   debug_putstr(purpose);
+   debug_putstr(" KASLR using");
+   }
 
if (has_cpuflag(X86_FEATURE_RDRAND)) {
-   debug_putstr(" RDRAND");
+   if (purpose)
+   debug_putstr(" RDRAND");
if (rdrand_long()) {
random ^= raw;
use_i8254 = false;
@@ -68,7 +71,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (has_cpuflag(X86_FEATURE_TSC)) {
-   debug_putstr(" RDTSC");
+   if (purpose)
+   debug_putstr(" RDTSC");
raw = rdtsc();
 
random ^= raw;
@@ -76,7 +80,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (use_i8254) {
-   debug_putstr(" i8254");
+   if (purpose)
+   debug_putstr(" i8254");
random ^= i8254();
}
 
@@ -86,7 +91,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
: "a" (random), "rm" (mix_const));
random += raw;
 
-   debug_putstr("...\n");
+   if (purpose)
+   debug_putstr("...\n");
 
return random;
 }
-- 
2.20.1



[PATCH v3 07/10] x86/boot/compressed: change definition of STATIC

2020-06-23 Thread Kristen Carlson Accardi
In preparation for changes to the upcoming fgkaslr commit, change misc.c
to not define STATIC as static, but instead set STATIC to "". This allows
memptr to become accessible to multiple files.

Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/boot/compressed/kaslr.c | 4 
 arch/x86/boot/compressed/misc.c  | 7 ---
 arch/x86/boot/compressed/misc.h  | 6 ++
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index d7408af55738..6f596bd5b6e5 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -39,10 +39,6 @@
 #include 
 #include 
 
-/* Macros used by the included decompressor code below. */
-#define STATIC
-#include 
-
 #ifdef CONFIG_X86_5LEVEL
 unsigned int __pgtable_l5_enabled;
 unsigned int pgdir_shift __ro_after_init = 39;
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 9652d5c2afda..a55a4ec48422 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -26,9 +26,6 @@
  * it is not safe to place pointers in static structures.
  */
 
-/* Macros used by the included decompressor code below. */
-#define STATIC static
-
 /*
  * Use normal definitions of mem*() from string.c. There are already
  * included header files which expect a definition of memset() and by
@@ -49,6 +46,10 @@ struct boot_params *boot_params;
 
 memptr free_mem_ptr;
 memptr free_mem_end_ptr;
+#ifdef CONFIG_FG_KASLR
+unsigned long malloc_ptr;
+int malloc_count;
+#endif
 
 static char *vidmem;
 static int vidport;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 726e264410ff..d2ec7c745cfa 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -39,6 +39,12 @@
 /* misc.c */
 extern memptr free_mem_ptr;
 extern memptr free_mem_end_ptr;
+#define STATIC
+#ifdef CONFIG_FG_KASLR
+#define STATIC_RW_DATA extern
+#endif
+#include 
+
 extern struct boot_params *boot_params;
 void __putstr(const char *s);
 void __puthex(unsigned long value);
-- 
2.20.1



[PATCH v3 06/10] x86/tools: Add relative relocs for randomized functions

2020-06-23 Thread Kristen Carlson Accardi
When reordering functions, the relative offsets for relocs that
are either in the randomized sections, or refer to the randomized
sections will need to be adjusted. Add code to detect whether a
reloc satisifies these cases, and if so, add them to the appropriate
reloc list.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
Reviewed-by: Kees Cook 
---
 arch/x86/boot/compressed/Makefile |  7 +-
 arch/x86/tools/relocs.c   | 41 ---
 arch/x86/tools/relocs.h   |  4 +--
 arch/x86/tools/relocs_common.c| 15 +++
 4 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index 7619742f91c9..c17b1c8ec82c 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -119,6 +119,11 @@ $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
$(call if_changed,check-and-link-vmlinux)
 
 OBJCOPYFLAGS_vmlinux.bin :=  -R .comment -S
+
+ifdef CONFIG_FG_KASLR
+   RELOCS_ARGS += --fg-kaslr
+endif
+
 $(obj)/vmlinux.bin: vmlinux FORCE
$(call if_changed,objcopy)
 
@@ -126,7 +131,7 @@ targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) 
vmlinux.bin.all vmlinux.relo
 
 CMD_RELOCS = arch/x86/tools/relocs
 quiet_cmd_relocs = RELOCS  $@
-  cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
+  cmd_relocs = $(CMD_RELOCS) $(RELOCS_ARGS) $< > $@;$(CMD_RELOCS) 
$(RELOCS_ARGS) --abs-relocs $<
 $(obj)/vmlinux.relocs: vmlinux FORCE
$(call if_changed,relocs)
 
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 31b2d151aa63..e0665038742e 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -42,6 +42,8 @@ struct section {
 };
 static struct section *secs;
 
+static int fgkaslr_mode;
+
 static const char * const sym_regex_kernel[S_NSYMTYPES] = {
 /*
  * Following symbols have been audited. There values are constant and do
@@ -818,6 +820,32 @@ static int is_percpu_sym(ElfW(Sym) *sym, const char 
*symname)
strncmp(symname, "init_per_cpu_", 13);
 }
 
+static int is_function_section(struct section *sec)
+{
+   const char *name;
+
+   if (!fgkaslr_mode)
+   return 0;
+
+   name = sec_name(sec->shdr.sh_info);
+
+   return(!strncmp(name, ".text.", 6));
+}
+
+static int is_randomized_sym(ElfW(Sym) *sym)
+{
+   const char *name;
+
+   if (!fgkaslr_mode)
+   return 0;
+
+   if (sym->st_shndx > shnum)
+   return 0;
+
+   name = sec_name(sym_index(sym));
+   return(!strncmp(name, ".text.", 6));
+}
+
 static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
  const char *symname)
 {
@@ -842,13 +870,17 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, 
ElfW(Sym) *sym,
case R_X86_64_PC32:
case R_X86_64_PLT32:
/*
-* PC relative relocations don't need to be adjusted unless
-* referencing a percpu symbol.
+* we need to keep pc relative relocations for sections which
+* might be randomized, and for the percpu section.
+* We also need to keep relocations for any offset which might
+* reference an address in a section which has been randomized.
 *
 * NB: R_X86_64_PLT32 can be treated as R_X86_64_PC32.
 */
-   if (is_percpu_sym(sym, symname))
+   if (is_function_section(sec) || is_randomized_sym(sym) ||
+   is_percpu_sym(sym, symname))
add_reloc(, offset);
+
break;
 
case R_X86_64_PC64:
@@ -1158,8 +1190,9 @@ static void print_reloc_info(void)
 
 void process(FILE *fp, int use_real_mode, int as_text,
 int show_absolute_syms, int show_absolute_relocs,
-int show_reloc_info)
+int show_reloc_info, int fgkaslr)
 {
+   fgkaslr_mode = fgkaslr;
regex_init(use_real_mode);
read_ehdr(fp);
read_shdrs(fp);
diff --git a/arch/x86/tools/relocs.h b/arch/x86/tools/relocs.h
index 43c83c0fd22c..f582895c04dd 100644
--- a/arch/x86/tools/relocs.h
+++ b/arch/x86/tools/relocs.h
@@ -31,8 +31,8 @@ enum symtype {
 
 void process_32(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
-   int show_reloc_info);
+   int show_reloc_info, int fgkaslr);
 void process_64(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
-   int show_reloc_info);
+   int show_reloc_info, int fgkaslr);
 #endif /* RELOCS_H */
diff --git a/arch/x86/tools/relocs_common.c b/arch/x86/tools/relocs_common.c
index 6634352a20bc..a80efa2f53ff 100644
--- a/arch/x86/tools/relocs_common.c
+++ b/arch/x8

[PATCH v3 05/10] x86: Make sure _etext includes function sections

2020-06-23 Thread Kristen Carlson Accardi
When using -ffunction-sections to place each function in
it's own text section so it can be randomized at load time, the
linker considers these .text.* sections "orphaned sections", and
will place them after the first similar section (.text). In order
to accurately represent the end of the text section and the
orphaned sections, _etext must be moved so that it is after both
.text and .text.* The text size must also be calculated to
include .text AND .text.*

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/kernel/vmlinux.lds.S | 17 +++--
 include/asm-generic/vmlinux.lds.h |  2 +-
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3bfc8dd8a43d..e8da7eeb4d8d 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -146,9 +146,22 @@ SECTIONS
 #endif
} :text =0x
 
-   /* End of text section, which should occupy whole number of pages */
-   _etext = .;
+   /*
+* -ffunction-sections creates .text.* sections, which are considered
+* "orphan sections" and added after the first similar section (.text).
+* Placing this ALIGN statement before _etext causes the address of
+* _etext to be below that of all the .text.* orphaned sections
+*/
. = ALIGN(PAGE_SIZE);
+   _etext = .;
+
+   /*
+* the size of the .text section is used to calculate the address
+* range for orc lookups. If we just use SIZEOF(.text), we will
+* miss all the .text.* sections. Calculate the size using _etext
+* and _stext and save the value for later.
+*/
+   text_size = _etext - _stext;
 
X86_ALIGN_RODATA_BEGIN
RO_DATA(PAGE_SIZE)
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index a5552cf28d5d..34eab6513fdc 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -835,7 +835,7 @@
. = ALIGN(4);   \
.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) { \
orc_lookup = .; \
-   . += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /\
+   . += (((text_size + LOOKUP_BLOCK_SIZE - 1) /\
LOOKUP_BLOCK_SIZE) + 1) * 4;\
orc_lookup_end = .; \
}
-- 
2.20.1



[PATCH v3 04/10] x86: Makefile: Add build and config option for CONFIG_FG_KASLR

2020-06-23 Thread Kristen Carlson Accardi
Allow user to select CONFIG_FG_KASLR if dependencies are met. Change
the make file to build with -ffunction-sections if CONFIG_FG_KASLR.

While the only architecture that supports CONFIG_FG_KASLR does not
currently enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION, make sure these
2 features play nicely together for the future by ensuring that if
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is selected when used with
CONFIG_FG_KASLR the function sections will not be consolidated back
into .text. Thanks to Kees Cook for the dead code elimination changes.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Reviewed-by: Kees Cook 
Tested-by: Tony Luck 
---
 Makefile  |  6 +-
 arch/x86/Kconfig  |  4 
 include/asm-generic/vmlinux.lds.h | 16 ++--
 init/Kconfig  | 14 ++
 4 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index ac2c61c37a73..363f53798fca 100644
--- a/Makefile
+++ b/Makefile
@@ -872,7 +872,7 @@ KBUILD_CFLAGS += $(call cc-option, 
-fno-inline-functions-called-once)
 endif
 
 ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
-KBUILD_CFLAGS_KERNEL += -ffunction-sections -fdata-sections
+KBUILD_CFLAGS_KERNEL += -fdata-sections
 LDFLAGS_vmlinux += --gc-sections
 endif
 
@@ -880,6 +880,10 @@ ifdef CONFIG_LIVEPATCH
 KBUILD_CFLAGS += $(call cc-option, -flive-patching=inline-clone)
 endif
 
+ifneq ($(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION)$(CONFIG_FG_KASLR),)
+KBUILD_CFLAGS += -ffunction-sections
+endif
+
 ifdef CONFIG_SHADOW_CALL_STACK
 CC_FLAGS_SCS   := -fsanitize=shadow-call-stack
 KBUILD_CFLAGS  += $(CC_FLAGS_SCS)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6a0cc524882d..932cbc327af0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -372,6 +372,10 @@ config CC_HAS_SANE_STACKPROTECTOR
   We have to make sure stack protector is unconditionally disabled if
   the compiler produces broken code.
 
+config ARCH_HAS_FG_KASLR
+   def_bool y
+   depends on RANDOMIZE_BASE && X86_64
+
 menu "Processor type and features"
 
 config ZONE_DMA
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index db600ef218d7..a5552cf28d5d 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -93,14 +93,12 @@
  * sections to be brought in with rodata.
  */
 #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
-#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
 #define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..LPBX*
 #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
 #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]*
 #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]*
 #define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]*
 #else
-#define TEXT_MAIN .text
 #define DATA_MAIN .data
 #define SDATA_MAIN .sdata
 #define RODATA_MAIN .rodata
@@ -108,6 +106,20 @@
 #define SBSS_MAIN .sbss
 #endif
 
+/*
+ * Both LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_FG_KASLR options enable
+ * -ffunction-sections, which produces separately named .text sections. In
+ * the case of CONFIG_FG_KASLR, they need to stay distict so they can be
+ * separately randomized. Without CONFIG_FG_KASLR, the separate .text
+ * sections can be collected back into a common section, which makes the
+ * resulting image slightly smaller
+ */
+#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) && !defined(CONFIG_FG_KASLR)
+#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
+#else
+#define TEXT_MAIN .text
+#endif
+
 /*
  * Align to a 32 byte boundary equal to the
  * alignment gcc 4.5 uses for a struct
diff --git a/init/Kconfig b/init/Kconfig
index a46aa8f3174d..e29c032e4d66 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1990,6 +1990,20 @@ config PROFILING
 config TRACEPOINTS
bool
 
+config FG_KASLR
+   bool "Function Granular Kernel Address Space Layout Randomization"
+   depends on $(cc-option, -ffunction-sections)
+   depends on ARCH_HAS_FG_KASLR
+   default n
+   help
+ This option improves the randomness of the kernel text
+ over basic Kernel Address Space Layout Randomization (KASLR)
+ by reordering the kernel text at boot time. This feature
+ uses information generated at compile time to re-layout the
+ kernel text section at boot time at function level granularity.
+
+ If unsure, say N.
+
 endmenu# General setup
 
 source "arch/Kconfig"
-- 
2.20.1



[PATCH v3 09/10] kallsyms: Hide layout

2020-06-23 Thread Kristen Carlson Accardi
This patch makes /proc/kallsyms display alphabetically by symbol
name rather than sorted by address in order to hide the newly
randomized address layout.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 kernel/kallsyms.c | 128 ++
 1 file changed, 128 insertions(+)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 16c8c605f4b0..df2b20e1b7f2 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * These will be re-linked against their real values
@@ -446,6 +447,11 @@ struct kallsym_iter {
int show_value;
 };
 
+struct kallsyms_iter_list {
+   struct kallsym_iter iter;
+   struct list_head next;
+};
+
 int __weak arch_get_kallsym(unsigned int symnum, unsigned long *value,
char *type, char *name)
 {
@@ -660,6 +666,127 @@ int kallsyms_show_value(void)
}
 }
 
+static int sorted_show(struct seq_file *m, void *p)
+{
+   struct list_head *list = m->private;
+   struct kallsyms_iter_list *iter;
+   int rc;
+
+   if (list_empty(list))
+   return 0;
+
+   iter = list_first_entry(list, struct kallsyms_iter_list, next);
+
+   m->private = iter;
+   rc = s_show(m, p);
+   m->private = list;
+
+   list_del(>next);
+   kfree(iter);
+
+   return rc;
+}
+
+static void *sorted_start(struct seq_file *m, loff_t *pos)
+{
+   return m->private;
+}
+
+static void *sorted_next(struct seq_file *m, void *p, loff_t *pos)
+{
+   struct list_head *list = m->private;
+
+   (*pos)++;
+
+   if (list_empty(list))
+   return NULL;
+
+   return p;
+}
+
+static const struct seq_operations kallsyms_sorted_op = {
+   .start = sorted_start,
+   .next = sorted_next,
+   .stop = s_stop,
+   .show = sorted_show
+};
+
+static int kallsyms_list_cmp(void *priv, struct list_head *a,
+struct list_head *b)
+{
+   struct kallsyms_iter_list *iter_a, *iter_b;
+
+   iter_a = list_entry(a, struct kallsyms_iter_list, next);
+   iter_b = list_entry(b, struct kallsyms_iter_list, next);
+
+   return strcmp(iter_a->iter.name, iter_b->iter.name);
+}
+
+int get_all_symbol_name(void *data, const char *name, struct module *mod,
+   unsigned long addr)
+{
+   unsigned long sym_pos;
+   struct kallsyms_iter_list *node, *last;
+   struct list_head *head = (struct list_head *)data;
+
+   node = kmalloc(sizeof(*node), GFP_KERNEL);
+   if (!node)
+   return -ENOMEM;
+
+   if (list_empty(head)) {
+   sym_pos = 0;
+   memset(node, 0, sizeof(*node));
+   reset_iter(>iter, 0);
+   node->iter.show_value = kallsyms_show_value();
+   } else {
+   last = list_first_entry(head, struct kallsyms_iter_list, next);
+   memcpy(node, last, sizeof(*node));
+   sym_pos = last->iter.pos;
+   }
+
+   INIT_LIST_HEAD(>next);
+   list_add(>next, head);
+
+   /*
+* update_iter returns false when at end of file
+* which in this case we don't care about and can
+* safely ignore. update_iter() will increment
+* the value of iter->pos, for ksymbol_core.
+*/
+   if (sym_pos >= kallsyms_num_syms)
+   sym_pos++;
+
+   (void)update_iter(>iter, sym_pos);
+
+   return 0;
+}
+
+#if defined(CONFIG_FG_KASLR)
+/*
+ * When fine grained kaslr is enabled, we need to
+ * print out the symbols sorted by name rather than by
+ * by address, because this reveals the randomization order.
+ */
+static int kallsyms_open(struct inode *inode, struct file *file)
+{
+   int ret;
+   struct list_head *list;
+
+   list = __seq_open_private(file, _sorted_op, sizeof(*list));
+   if (!list)
+   return -ENOMEM;
+
+   INIT_LIST_HEAD(list);
+
+   ret = kallsyms_on_each_symbol(get_all_symbol_name, list);
+   if (ret != 0)
+   return ret;
+
+   list_sort(NULL, list, kallsyms_list_cmp);
+
+   return 0;
+}
+#else
 static int kallsyms_open(struct inode *inode, struct file *file)
 {
/*
@@ -676,6 +803,7 @@ static int kallsyms_open(struct inode *inode, struct file 
*file)
iter->show_value = kallsyms_show_value();
return 0;
 }
+#endif /* CONFIG_FG_KASLR */
 
 #ifdef CONFIG_KGDB_KDB
 const char *kdb_walk_kallsyms(loff_t *pos)
-- 
2.20.1



[PATCH v3 01/10] objtool: Do not assume order of parent/child functions

2020-06-23 Thread Kristen Carlson Accardi
If a .cold function is examined prior to it's parent, the link
to the parent/child function can be overwritten when the parent
is examined. Only update pfunc and cfunc if they were previously
nil to prevent this from happening.

Signed-off-by: Kristen Carlson Accardi 
Acked-by: Josh Poimboeuf 
---
 tools/objtool/elf.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 84225679f96d..f953d3a15612 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -434,7 +434,13 @@ static int read_symbols(struct elf *elf)
size_t pnamelen;
if (sym->type != STT_FUNC)
continue;
-   sym->pfunc = sym->cfunc = sym;
+
+   if (sym->pfunc == NULL)
+   sym->pfunc = sym;
+
+   if (sym->cfunc == NULL)
+   sym->cfunc = sym;
+
coldstr = strstr(sym->name, ".cold");
if (!coldstr)
continue;
-- 
2.20.1



[tip: objtool/core] objtool: Do not assume order of parent/child functions

2020-06-17 Thread tip-bot2 for Kristen Carlson Accardi
The following commit has been merged into the objtool/core branch of tip:

Commit-ID: e000acc145928693833f09152244242a678d3cd5
Gitweb:
https://git.kernel.org/tip/e000acc145928693833f09152244242a678d3cd5
Author:Kristen Carlson Accardi 
AuthorDate:Wed, 15 Apr 2020 14:04:43 -07:00
Committer: Josh Poimboeuf 
CommitterDate: Thu, 28 May 2020 11:06:05 -05:00

objtool: Do not assume order of parent/child functions

If a .cold function is examined prior to it's parent, the link
to the parent/child function can be overwritten when the parent
is examined. Only update pfunc and cfunc if they were previously
nil to prevent this from happening.

This fixes an issue seen when compiling with -ffunction-sections.

Signed-off-by: Kristen Carlson Accardi 
Signed-off-by: Josh Poimboeuf 
---
 tools/objtool/elf.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 8422567..f953d3a 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -434,7 +434,13 @@ static int read_symbols(struct elf *elf)
size_t pnamelen;
if (sym->type != STT_FUNC)
continue;
-   sym->pfunc = sym->cfunc = sym;
+
+   if (sym->pfunc == NULL)
+   sym->pfunc = sym;
+
+   if (sym->cfunc == NULL)
+   sym->cfunc = sym;
+
coldstr = strstr(sym->name, ".cold");
if (!coldstr)
continue;


Re: [PATCH v2 9/9] module: Reorder functions

2020-06-09 Thread Kristen Carlson Accardi
On Tue, 2020-06-09 at 13:42 -0700, Kees Cook wrote:
> On Tue, Jun 09, 2020 at 01:14:04PM -0700, Kristen Carlson Accardi
> wrote:
> > On Thu, 2020-05-21 at 14:33 -0700, Kees Cook wrote:
> > > Oh! And I am reminded suddenly about CONFIG_FG_KASLR needing to
> > > interact
> > > correctly with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION in that we do
> > > NOT
> > > want the sections to be collapsed at link time:
> > 
> > sorry - I'm a little confused and was wondering if you could
> > clarify
> > something. Does this mean you expect CONFIG_FG_KASLR=y and
> > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y to be a valid config? I am
> > not
> 
> Yes, I don't see a reason they can't be used together.
> 
> > familiar with the option, but it seems like you are saying that it
> > requires sections to be collapsed, in which case both of these
> > options
> > as yes would not be allowed? Should I actively prevent this in the
> > Kconfig?
> 
> No, I'm saying that CONFIG_LD_DEAD_CODE_DATA_ELIMINATION does _not_
> actually require that the sections be collapsed, but the Makefile
> currently does this just to keep the resulting ELF "tidy". We want
> that disabled (for the .text parts) in the case of CONFIG_FG_KASLR.
> The
> dead code elimination step, is, IIUC, done at link time before the
> output sections are written.
> 

Ah ok, that makes sense. Thanks.




Re: [PATCH v2 9/9] module: Reorder functions

2020-06-09 Thread Kristen Carlson Accardi
On Thu, 2020-05-21 at 14:33 -0700, Kees Cook wrote:
> Oh! And I am reminded suddenly about CONFIG_FG_KASLR needing to
> interact
> correctly with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION in that we do NOT
> want the sections to be collapsed at link time:

sorry - I'm a little confused and was wondering if you could clarify
something. Does this mean you expect CONFIG_FG_KASLR=y and
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y to be a valid config? I am not
familiar with the option, but it seems like you are saying that it
requires sections to be collapsed, in which case both of these options
as yes would not be allowed? Should I actively prevent this in the
Kconfig?

Thanks.

Kristen

> 
> #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
> #define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
> 
> (I think I had fixed this in some earlier version?)
> 
> I think you want this (untested):
> 
> 
> diff --git a/Makefile b/Makefile
> index 04f5662ae61a..a0d9acd3b900 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -853,8 +853,11 @@ ifdef CONFIG_DEBUG_SECTION_MISMATCH
>  KBUILD_CFLAGS += $(call cc-option, -fno-inline-functions-called-
> once)
>  endif
>  
> +ifneq ($(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION)$(CONFIG_FG_KASLR),)
> +KBUILD_CFLAGS_KERNEL += -ffunction-sections
> +endif
>  ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
> -KBUILD_CFLAGS_KERNEL += -ffunction-sections -fdata-sections
> +KBUILD_CFLAGS_KERNEL += -fdata-sections
>  LDFLAGS_vmlinux += --gc-sections
>  endif
>  
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-
> generic/vmlinux.lds.h
> index 71e387a5fe90..5f5c692751dd 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -93,20 +93,31 @@
>   * sections to be brought in with rodata.
>   */
>  #ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
> -#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
>  #define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..LPBX*
>  #define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
>  #define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]*
>  #define BSS_MAIN .bss .bss.[0-9a-zA-Z_]*
>  #define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]*
>  #else
> -#define TEXT_MAIN .text
>  #define DATA_MAIN .data
>  #define SDATA_MAIN .sdata
>  #define RODATA_MAIN .rodata
>  #define BSS_MAIN .bss
>  #define SBSS_MAIN .sbss
>  #endif
> +/*
> + * Both LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_FG_KASLR options
> enable
> + * -ffunction-sections, which produces separately named .text
> sections. In
> + * the case of CONFIG_FG_KASLR, they need to stay distinct so they
> can be
> + * separately randomized. Without CONFIG_FG_KASLR, the separate
> .text
> + * sections can be collected back into a common section, which makes
> the
> + * resulting image slightly smaller.
> + */
> +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) &&
> !defined(CONFIG_FG_KASLR)
> +#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
> +#else
> +#define TEXT_MAIN .text
> +#endif



Re: [PATCH v2 7/9] x86: Add support for function granular KASLR

2020-06-04 Thread Kristen Carlson Accardi
On Thu, 2020-05-21 at 14:08 -0700, Kees Cook wrote:
> On Thu, May 21, 2020 at 09:56:38AM -0700, Kristen Carlson Accardi
> wrote:
> > At boot time, find all the function sections that have separate
> > .text
> > sections, shuffle them, and then copy them to new locations. Adjust
> > any relocations accordingly.
> 
> Commit log length vs "11 files changed, 1159 insertions(+), 15
> deletions(-)" implies to me that a lot more detail is needed here. ;)
> 
> Perhaps describe what the code pieces are, why the code is being
> added
> are here, etc (I see at least the ELF parsing, the ELF shuffling, the
> relocation updates, the symbol list, and the re-sorting of kallsyms,
> ORC, and extables. I think the commit log should prepare someone to
> read
> the diff and know what to expect to find. (In the end, I wonder if
> these
> pieces should be split up into logically separate patches, but for
> now,
> let's just go with it -- though I've made some suggestions below
> about
> things that might be worth splitting out.)
> 
> More below...
> 
> > Signed-off-by: Kristen Carlson Accardi 
> > Reviewed-by: Tony Luck 
> > Tested-by: Tony Luck 
> > ---
> >  Documentation/security/fgkaslr.rst   | 155 +
> >  Documentation/security/index.rst |   1 +
> >  arch/x86/boot/compressed/Makefile|   3 +
> >  arch/x86/boot/compressed/fgkaslr.c   | 823
> > +++
> >  arch/x86/boot/compressed/kaslr.c |   4 -
> >  arch/x86/boot/compressed/misc.c  | 109 ++-
> >  arch/x86/boot/compressed/misc.h  |  34 +
> >  arch/x86/boot/compressed/utils.c |  12 +
> >  arch/x86/boot/compressed/vmlinux.symbols |  17 +
> >  arch/x86/include/asm/boot.h  |  15 +-
> >  include/uapi/linux/elf.h |   1 +
> >  11 files changed, 1159 insertions(+), 15 deletions(-)
> >  create mode 100644 Documentation/security/fgkaslr.rst
> >  create mode 100644 arch/x86/boot/compressed/fgkaslr.c
> >  create mode 100644 arch/x86/boot/compressed/utils.c
> >  create mode 100644 arch/x86/boot/compressed/vmlinux.symbols
> > 
> > diff --git a/Documentation/security/fgkaslr.rst
> > b/Documentation/security/fgkaslr.rst
> > new file mode 100644
> > index ..94939c62c50d
> > --- /dev/null
> > +++ b/Documentation/security/fgkaslr.rst
> > @@ -0,0 +1,155 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +==
> > ===
> > +Function Granular Kernel Address Space Layout Randomization
> > (fgkaslr)
> > +==
> > ===
> > +
> > +:Date: 6 April 2020
> > +:Author: Kristen Accardi
> > +
> > +Kernel Address Space Layout Randomization (KASLR) was merged into
> > the kernel
> > +with the objective of increasing the difficulty of code reuse
> > attacks. Code
> > +reuse attacks reused existing code snippets to get around existing
> > memory
> > +protections. They exploit software bugs which expose addresses of
> > useful code
> > +snippets to control the flow of execution for their own nefarious
> > purposes.
> > +KASLR as it was originally implemented moves the entire kernel
> > code text as a
> > +unit at boot time in order to make addresses less predictable. The
> > order of the
> > +code within the segment is unchanged - only the base address is
> > shifted. There
> > +are a few shortcomings to this algorithm.
> > +
> > +1. Low Entropy - there are only so many locations the kernel can
> > fit in. This
> > +   means an attacker could guess without too much trouble.
> > +2. Knowledge of a single address can reveal the offset of the base
> > address,
> > +   exposing all other locations for a published/known kernel
> > image.
> > +3. Info leaks abound.
> > +
> > +Finer grained ASLR has been proposed as a way to make ASLR more
> > resistant
> > +to info leaks. It is not a new concept at all, and there are many
> > variations
> > +possible. Function reordering is an implementation of finer
> > grained ASLR
> > +which randomizes the layout of an address space on a function
> > level
> > +granularity. The term "fgkaslr" is used in this document to refer
> > to the
> > +technique of function reordering when used with KASLR, as well as
> > finer grained
> > +KASLR in general.
> > +
> > +The objective of this patch set is to improve a technology that is
> > al

Re: [PATCH v2 0/9] Function Granular KASLR

2020-05-21 Thread Kristen Carlson Accardi
On Thu, 2020-05-21 at 16:30 -0700, Kees Cook wrote:
> On Fri, May 22, 2020 at 12:26:30AM +0200, Thomas Gleixner wrote:
> > I understand how this is supposed to work, but I fail to find an
> > explanation how all of this is preserving the text subsections we
> > have,
> > i.e. .kprobes.text, .entry.text ...?
> 
> I had the same question when I first started looking at earlier
> versions
> of this series! :)

Thanks for responding - clearly I do need to update the cover letter
and documentation.

> 
> > I assume that the functions in these subsections are reshuffled
> > within
> > their own randomized address space so that __xxx_text_start and
> > __xxx_text_end markers still make sense, right?
> 
> No, but perhaps in the future. Right now, they are entirely ignored
> and
> left untouched. The current series only looks at the sections
> produced
> by -ffunction-sections, which is to say only things named
> ".text.$thing"
> (e.g. ".text.func1", ".text.func2"). Since the "special" text
> sections
> in the kernel are named ".$thing.text" (specifically to avoid other
> long-standing linker logic that does similar .text.* pattern matches)
> they get ignored by FGKASLR right now too.
> 
> Even more specifically, they're ignored because all of these special
> _input_ sections are actually manually collected by the linker script
> into the ".text" _output_ section, which FGKASLR ignores -- it can
> only
> randomize the final output sections (and has no basic block
> visibility
> into the section contents), so everything in .text is untouched.
> Because
> these special sections are collapsed into the single .text output
> section is why we've needed the __$thing_start and __$thing_end
> symbols
> manually constructed by the linker scripts: we lose input section
> location/size details once the linker collects them into an output
> section.
> 
> > I'm surely too tired to figure it out from the patches, but you
> > really
> > want to explain that very detailed for mere mortals who are not
> > deep
> > into this magic as you are.
> 
> Yeah, it's worth calling out, especially since it's an area of future
> work -- I think if we can move the special sections out of .text into
> their own output sections that can get randomized and we'll have
> section
> position/size information available without the manual ..._start/_end
> symbols. But this will require work with the compiler and linker to
> get
> what's needed relative to -ffunction-sections, teach the kernel about
> the new way of getting _start/_end, etc etc.
> 
> So, before any of that, just .text.* is a good first step, and after
> that I think next would be getting .text randomized relative to the
> other
> .text.* sections (IIUC, it is entirely untouched currently, so only
> the
> standard KASLR base offset moves it around). Only after that do we
> start
> poking around trying to munge the special section contents (which
> requires use solving a few problems simultaneously). :)
> 

That's right - we keep .text unrandomized, so any special sections that
are collected into .text are still in their original layout. Like you
said, they still get to take advantage of normal KASLR (base address
randomization).




Re: [PATCH v2 7/9] x86: Add support for function granular KASLR

2020-05-21 Thread Kristen Carlson Accardi
Hi Kees,
Thanks for your review - I will incorporate what I can into v3, or
explain why not once I give it a try :).

On Thu, 2020-05-21 at 14:08 -0700, Kees Cook wrote:
> > 


> On Thu, May 21, 2020 at 09:56:38AM -0700, Kristen Carlson Accardi
> wrote:
> > +   /*
> > +* sometimes we are updating a relative offset that would
> > +* normally be relative to the next instruction (such as a
> > call).
> > +* In this case to calculate the target, you need to add 32bits
> > to
> > +* the pc to get the next instruction value. However, sometimes
> > +* targets are just data that was stored in a table such as
> > ksymtab
> > +* or cpu alternatives. In this case our target is not relative
> > to
> > +* the next instruction.
> > +*/
> 
> Excellent and scary comment. ;) Was this found by trial and error?
> That
> sounds "fun" to debug. :P

This did suck to debug. Thank goodness for debugging with gdb in a VM.
As you know, I had previously had a patch to use a prand to be able to
retain the same layout across boots, and that came in handy here. While
we decided to not submit this functionality with this initial merge
attempt, I will add it on in the future as it does make debugging much
easier when you can reliably duplicate failure modes.




[PATCH v2 7/9] x86: Add support for function granular KASLR

2020-05-21 Thread Kristen Carlson Accardi
At boot time, find all the function sections that have separate .text
sections, shuffle them, and then copy them to new locations. Adjust
any relocations accordingly.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 Documentation/security/fgkaslr.rst   | 155 +
 Documentation/security/index.rst |   1 +
 arch/x86/boot/compressed/Makefile|   3 +
 arch/x86/boot/compressed/fgkaslr.c   | 823 +++
 arch/x86/boot/compressed/kaslr.c |   4 -
 arch/x86/boot/compressed/misc.c  | 109 ++-
 arch/x86/boot/compressed/misc.h  |  34 +
 arch/x86/boot/compressed/utils.c |  12 +
 arch/x86/boot/compressed/vmlinux.symbols |  17 +
 arch/x86/include/asm/boot.h  |  15 +-
 include/uapi/linux/elf.h |   1 +
 11 files changed, 1159 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/security/fgkaslr.rst
 create mode 100644 arch/x86/boot/compressed/fgkaslr.c
 create mode 100644 arch/x86/boot/compressed/utils.c
 create mode 100644 arch/x86/boot/compressed/vmlinux.symbols

diff --git a/Documentation/security/fgkaslr.rst 
b/Documentation/security/fgkaslr.rst
new file mode 100644
index ..94939c62c50d
--- /dev/null
+++ b/Documentation/security/fgkaslr.rst
@@ -0,0 +1,155 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Function Granular Kernel Address Space Layout Randomization (fgkaslr)
+=
+
+:Date: 6 April 2020
+:Author: Kristen Accardi
+
+Kernel Address Space Layout Randomization (KASLR) was merged into the kernel
+with the objective of increasing the difficulty of code reuse attacks. Code
+reuse attacks reused existing code snippets to get around existing memory
+protections. They exploit software bugs which expose addresses of useful code
+snippets to control the flow of execution for their own nefarious purposes.
+KASLR as it was originally implemented moves the entire kernel code text as a
+unit at boot time in order to make addresses less predictable. The order of the
+code within the segment is unchanged - only the base address is shifted. There
+are a few shortcomings to this algorithm.
+
+1. Low Entropy - there are only so many locations the kernel can fit in. This
+   means an attacker could guess without too much trouble.
+2. Knowledge of a single address can reveal the offset of the base address,
+   exposing all other locations for a published/known kernel image.
+3. Info leaks abound.
+
+Finer grained ASLR has been proposed as a way to make ASLR more resistant
+to info leaks. It is not a new concept at all, and there are many variations
+possible. Function reordering is an implementation of finer grained ASLR
+which randomizes the layout of an address space on a function level
+granularity. The term "fgkaslr" is used in this document to refer to the
+technique of function reordering when used with KASLR, as well as finer grained
+KASLR in general.
+
+The objective of this patch set is to improve a technology that is already
+merged into the kernel (KASLR). This code will not prevent all code reuse
+attacks, and should be considered as one of several tools that can be used.
+
+Implementation Details
+==
+
+The over-arching objective of the fgkaslr implementation is incremental
+improvement over the existing KASLR algorithm. It is designed to work with
+the existing solution, and there are two main area where code changes occur:
+Build time, and Load time.
+
+Build time
+--
+
+GCC has had an option to place functions into individual .text sections
+for many years now (-ffunction-sections). This option is used to implement
+function reordering at load time. The final compiled vmlinux retains all the
+section headers, which can be used to help find the address ranges of each
+function. Using this information and an expanded table of relocation addresses,
+individual text sections can be shuffled immediately after decompression.
+Some data tables inside the kernel that have assumptions about order
+require sorting after the update. In order to modify these tables,
+a few key symbols from the objcopy symbol stripping process are preserved
+for use after shuffling the text segments.
+
+Load time
+-
+
+The boot kernel was modified to parse the vmlinux elf file after
+decompression to check for symbols for modifying data tables, and to
+look for any .text.* sections to randomize. The sections are then shuffled,
+and tables are updated or resorted. The existing code which updated relocation
+addresses was modified to account for not just a fixed delta from the load
+address, but the offset that the function section was moved to. This requires
+inspection of each address to see if it was impacted by a randomization.
+
+In order to hide the new layout, symbols reported through /pro

[PATCH v2 6/9] x86/tools: Add relative relocs for randomized functions

2020-05-21 Thread Kristen Carlson Accardi
When reordering functions, the relative offsets for relocs that
are either in the randomized sections, or refer to the randomized
sections will need to be adjusted. Add code to detect whether a
reloc satisifies these cases, and if so, add them to the appropriate
reloc list.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/boot/compressed/Makefile |  7 +++-
 arch/x86/tools/relocs.c   | 55 ---
 arch/x86/tools/relocs.h   |  4 +--
 arch/x86/tools/relocs_common.c| 15 ++---
 4 files changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile 
b/arch/x86/boot/compressed/Makefile
index 5f7c262bcc99..3a5a004498de 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -117,6 +117,11 @@ $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
$(call if_changed,check-and-link-vmlinux)
 
 OBJCOPYFLAGS_vmlinux.bin :=  -R .comment -S
+
+ifdef CONFIG_FG_KASLR
+   RELOCS_ARGS += --fg-kaslr
+endif
+
 $(obj)/vmlinux.bin: vmlinux FORCE
$(call if_changed,objcopy)
 
@@ -124,7 +129,7 @@ targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) 
vmlinux.bin.all vmlinux.relo
 
 CMD_RELOCS = arch/x86/tools/relocs
 quiet_cmd_relocs = RELOCS  $@
-  cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
+  cmd_relocs = $(CMD_RELOCS) $(RELOCS_ARGS) $< > $@;$(CMD_RELOCS) 
$(RELOCS_ARGS) --abs-relocs $<
 $(obj)/vmlinux.relocs: vmlinux FORCE
$(call if_changed,relocs)
 
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index a00dc133f109..bf51ff1854ff 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -42,6 +42,8 @@ struct section {
 };
 static struct section *secs;
 
+static int fg_kaslr;
+
 static const char * const sym_regex_kernel[S_NSYMTYPES] = {
 /*
  * Following symbols have been audited. There values are constant and do
@@ -351,8 +353,8 @@ static int sym_index(Elf_Sym *sym)
return sym->st_shndx;
 
/* calculate offset of sym from head of table. */
-   offset = (unsigned long) sym - (unsigned long) symtab;
-   index = offset/sizeof(*sym);
+   offset = (unsigned long)sym - (unsigned long)symtab;
+   index = offset / sizeof(*sym);
 
return elf32_to_cpu(xsymtab[index]);
 }
@@ -500,22 +502,22 @@ static void read_symtabs(FILE *fp)
sec->xsymtab = malloc(sec->shdr.sh_size);
if (!sec->xsymtab) {
die("malloc of %d bytes for xsymtab failed\n",
-   sec->shdr.sh_size);
+   sec->shdr.sh_size);
}
if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
die("Seek to %d failed: %s\n",
-   sec->shdr.sh_offset, strerror(errno));
+   sec->shdr.sh_offset, strerror(errno));
}
if (fread(sec->xsymtab, 1, sec->shdr.sh_size, fp)
-   != sec->shdr.sh_size) {
+   != sec->shdr.sh_size) {
die("Cannot read extended symbol table: %s\n",
-   strerror(errno));
+   strerror(errno));
}
shxsymtabndx = i;
continue;
 
case SHT_SYMTAB:
-   num_syms = sec->shdr.sh_size/sizeof(Elf_Sym);
+   num_syms = sec->shdr.sh_size / sizeof(Elf_Sym);
 
sec->symtab = malloc(sec->shdr.sh_size);
if (!sec->symtab) {
@@ -818,6 +820,32 @@ static int is_percpu_sym(ElfW(Sym) *sym, const char 
*symname)
strncmp(symname, "init_per_cpu_", 13);
 }
 
+static int is_function_section(struct section *sec)
+{
+   const char *name;
+
+   if (!fg_kaslr)
+   return 0;
+
+   name = sec_name(sec->shdr.sh_info);
+
+   return(!strncmp(name, ".text.", 6));
+}
+
+static int is_randomized_sym(ElfW(Sym) *sym)
+{
+   const char *name;
+
+   if (!fg_kaslr)
+   return 0;
+
+   if (sym->st_shndx > shnum)
+   return 0;
+
+   name = sec_name(sym_index(sym));
+   return(!strncmp(name, ".text.", 6));
+}
+
 static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
  const char *symname)
 {
@@ -842,13 +870,17 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, 
ElfW(Sym) *sym,
case R_X86_64_PC32:
case R_X86_64_PLT32:
/*
-* PC relative relocations don't need to 

[PATCH v2 4/9] x86: Makefile: Add build and config option for CONFIG_FG_KASLR

2020-05-21 Thread Kristen Carlson Accardi
Allow user to select CONFIG_FG_KASLR if dependencies are met. Change
the make file to build with -ffunction-sections if CONFIG_FG_KASLR

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 Makefile |  4 
 arch/x86/Kconfig | 13 +
 2 files changed, 17 insertions(+)

diff --git a/Makefile b/Makefile
index 04f5662ae61a..28e515baa824 100644
--- a/Makefile
+++ b/Makefile
@@ -862,6 +862,10 @@ ifdef CONFIG_LIVEPATCH
 KBUILD_CFLAGS += $(call cc-option, -flive-patching=inline-clone)
 endif
 
+ifdef CONFIG_FG_KASLR
+KBUILD_CFLAGS += -ffunction-sections
+endif
+
 # arch Makefile may override CC so keep this after arch Makefile is included
 NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include)
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2d3f963fd6f1..50e83ea57d70 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2183,6 +2183,19 @@ config RANDOMIZE_BASE
 
  If unsure, say Y.
 
+config FG_KASLR
+   bool "Function Granular Kernel Address Space Layout Randomization"
+   depends on $(cc-option, -ffunction-sections)
+   depends on RANDOMIZE_BASE && X86_64
+   help
+ This option improves the randomness of the kernel text
+ over basic Kernel Address Space Layout Randomization (KASLR)
+ by reordering the kernel text at boot time. This feature
+ uses information generated at compile time to re-layout the
+ kernel text section at boot time at function level granularity.
+
+ If unsure, say N.
+
 # Relocation on x86 needs some additional build support
 config X86_NEED_RELOCS
def_bool y
-- 
2.20.1



[PATCH v2 5/9] x86: Make sure _etext includes function sections

2020-05-21 Thread Kristen Carlson Accardi
When using -ffunction-sections to place each function in
it's own text section so it can be randomized at load time, the
linker considers these .text.* sections "orphaned sections", and
will place them after the first similar section (.text). In order
to accurately represent the end of the text section and the
orphaned sections, _etext must be moved so that it is after both
.text and .text.* The text size must also be calculated to
include .text AND .text.*

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/kernel/vmlinux.lds.S | 18 +-
 include/asm-generic/vmlinux.lds.h |  2 +-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1bf7e312361f..044f7528a2f0 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -147,8 +147,24 @@ SECTIONS
 #endif
} :text =0x
 
-   /* End of text section, which should occupy whole number of pages */
+#ifdef CONFIG_FG_KASLR
+   /*
+* -ffunction-sections creates .text.* sections, which are considered
+* "orphan sections" and added after the first similar section (.text).
+* Adding this ALIGN statement causes the address of _etext
+* to be below that of all the .text.* orphaned sections
+*/
+   . = ALIGN(PAGE_SIZE);
+#endif
_etext = .;
+
+   /*
+* the size of the .text section is used to calculate the address
+* range for orc lookups. If we just use SIZEOF(.text), we will
+* miss all the .text.* sections. Calculate the size using _etext
+* and _stext and save the value for later.
+*/
+   text_size = _etext - _stext;
. = ALIGN(PAGE_SIZE);
 
X86_ALIGN_RODATA_BEGIN
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 71e387a5fe90..f5baee74854c 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -813,7 +813,7 @@
. = ALIGN(4);   \
.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) { \
orc_lookup = .; \
-   . += (((SIZEOF(.text) + LOOKUP_BLOCK_SIZE - 1) /\
+   . += (((text_size + LOOKUP_BLOCK_SIZE - 1) /\
LOOKUP_BLOCK_SIZE) + 1) * 4;\
orc_lookup_end = .; \
}
-- 
2.20.1



[PATCH v2 9/9] module: Reorder functions

2020-05-21 Thread Kristen Carlson Accardi
Introduce a new config option to allow modules to be re-ordered
by function. This option can be enabled independently of the
kernel text KASLR or FG_KASLR settings so that it can be used
by architectures that do not support either of these features.
This option will be selected by default if CONFIG_FG_KASLR is
selected.

If a module has functions split out into separate text sections
(i.e. compiled with the -ffunction-sections flag), reorder the
functions to provide some code diversification to modules.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Kees Cook 
Acked-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/Kconfig  |  1 +
 arch/x86/Makefile |  3 ++
 init/Kconfig  | 11 +++
 kernel/module.c   | 81 +++
 4 files changed, 96 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 50e83ea57d70..d0bdd5c8c432 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2187,6 +2187,7 @@ config FG_KASLR
bool "Function Granular Kernel Address Space Layout Randomization"
depends on $(cc-option, -ffunction-sections)
depends on RANDOMIZE_BASE && X86_64
+   select MODULE_FG_KASLR
help
  This option improves the randomness of the kernel text
  over basic Kernel Address Space Layout Randomization (KASLR)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index b65ec63c7db7..8c830c37c74c 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -51,6 +51,9 @@ ifdef CONFIG_X86_NEED_RELOCS
 LDFLAGS_vmlinux := --emit-relocs --discard-none
 endif
 
+ifdef CONFIG_MODULE_FG_KASLR
+   KBUILD_CFLAGS_MODULE += -ffunction-sections
+endif
 #
 # Prevent GCC from generating any FP code by mistake.
 #
diff --git a/init/Kconfig b/init/Kconfig
index 74a5ac65644f..b19920413bcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2227,6 +2227,17 @@ config UNUSED_KSYMS_WHITELIST
  one per line. The path can be absolute, or relative to the kernel
  source tree.
 
+config MODULE_FG_KASLR
+   depends on $(cc-option, -ffunction-sections)
+   bool "Module Function Granular Layout Randomization"
+   help
+ This option randomizes the module text section by reordering the text
+ section by function at module load time. In order to use this
+ feature, the module must have been compiled with the
+ -ffunction-sections compiler flag.
+
+ If unsure, say N.
+
 endif # MODULES
 
 config MODULES_TREE_LOOKUP
diff --git a/kernel/module.c b/kernel/module.c
index 646f1e2330d2..e3cd619c60c2 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "module-internal.h"
 
@@ -2370,6 +2371,83 @@ static long get_offset(struct module *mod, unsigned int 
*size,
return ret;
 }
 
+/*
+ * shuffle_text_list()
+ * Use a Fisher Yates algorithm to shuffle a list of text sections.
+ */
+static void shuffle_text_list(Elf_Shdr **list, int size)
+{
+   int i;
+   unsigned int j;
+   Elf_Shdr *temp;
+
+   for (i = size - 1; i > 0; i--) {
+   /*
+* pick a random index from 0 to i
+*/
+   get_random_bytes(, sizeof(j));
+   j = j % (i + 1);
+
+   temp = list[i];
+   list[i] = list[j];
+   list[j] = temp;
+   }
+}
+
+/*
+ * randomize_text()
+ * Look through the core section looking for executable code sections.
+ * Store sections in an array and then shuffle the sections
+ * to reorder the functions.
+ */
+static void randomize_text(struct module *mod, struct load_info *info)
+{
+   int i;
+   int num_text_sections = 0;
+   Elf_Shdr **text_list;
+   int size = 0;
+   int max_sections = info->hdr->e_shnum;
+   unsigned int sec = find_sec(info, ".text");
+
+   if (sec == 0)
+   return;
+
+   text_list = kmalloc_array(max_sections, sizeof(*text_list), GFP_KERNEL);
+   if (!text_list)
+   return;
+
+   for (i = 0; i < max_sections; i++) {
+   Elf_Shdr *shdr = >sechdrs[i];
+   const char *sname = info->secstrings + shdr->sh_name;
+
+   if (!(shdr->sh_flags & SHF_ALLOC) ||
+   !(shdr->sh_flags & SHF_EXECINSTR) ||
+   strstarts(sname, ".init"))
+   continue;
+
+   text_list[num_text_sections] = shdr;
+   num_text_sections++;
+   }
+
+   shuffle_text_list(text_list, num_text_sections);
+
+   for (i = 0; i < num_text_sections; i++) {
+   Elf_Shdr *shdr = text_list[i];
+
+   /*
+* get_offset has a section index for it's last
+* argument, that is only used by arch_mod_section_prepend(),
+   

[PATCH v2 8/9] kallsyms: Hide layout

2020-05-21 Thread Kristen Carlson Accardi
This patch makes /proc/kallsyms display alphabetically by symbol
name rather than sorted by address in order to hide the newly
randomized address layout.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 kernel/kallsyms.c | 138 +-
 1 file changed, 137 insertions(+), 1 deletion(-)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 16c8c605f4b0..558963b275ec 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * These will be re-linked against their real values
@@ -446,6 +447,11 @@ struct kallsym_iter {
int show_value;
 };
 
+struct kallsyms_iter_list {
+   struct kallsym_iter iter;
+   struct list_head next;
+};
+
 int __weak arch_get_kallsym(unsigned int symnum, unsigned long *value,
char *type, char *name)
 {
@@ -660,6 +666,121 @@ int kallsyms_show_value(void)
}
 }
 
+static int sorted_show(struct seq_file *m, void *p)
+{
+   struct list_head *list = m->private;
+   struct kallsyms_iter_list *iter;
+   int rc;
+
+   if (list_empty(list))
+   return 0;
+
+   iter = list_first_entry(list, struct kallsyms_iter_list, next);
+
+   m->private = iter;
+   rc = s_show(m, p);
+   m->private = list;
+
+   list_del(>next);
+   kfree(iter);
+
+   return rc;
+}
+
+static void *sorted_start(struct seq_file *m, loff_t *pos)
+{
+   return m->private;
+}
+
+static void *sorted_next(struct seq_file *m, void *p, loff_t *pos)
+{
+   struct list_head *list = m->private;
+
+   (*pos)++;
+
+   if (list_empty(list))
+   return NULL;
+
+   return p;
+}
+
+static const struct seq_operations kallsyms_sorted_op = {
+   .start = sorted_start,
+   .next = sorted_next,
+   .stop = s_stop,
+   .show = sorted_show
+};
+
+static int kallsyms_list_cmp(void *priv, struct list_head *a,
+struct list_head *b)
+{
+   struct kallsyms_iter_list *iter_a, *iter_b;
+
+   iter_a = list_entry(a, struct kallsyms_iter_list, next);
+   iter_b = list_entry(b, struct kallsyms_iter_list, next);
+
+   return strcmp(iter_a->iter.name, iter_b->iter.name);
+}
+
+int get_all_symbol_name(void *data, const char *name, struct module *mod,
+   unsigned long addr)
+{
+   unsigned long sym_pos;
+   struct kallsyms_iter_list *node, *last;
+   struct list_head *head = (struct list_head *)data;
+
+   node = kmalloc(sizeof(*node), GFP_KERNEL);
+   if (!node)
+   return -ENOMEM;
+
+   if (list_empty(head)) {
+   sym_pos = 0;
+   memset(node, 0, sizeof(*node));
+   reset_iter(>iter, 0);
+   node->iter.show_value = kallsyms_show_value();
+   } else {
+   last = list_first_entry(head, struct kallsyms_iter_list, next);
+   memcpy(node, last, sizeof(*node));
+   sym_pos = last->iter.pos;
+   }
+
+   INIT_LIST_HEAD(>next);
+   list_add(>next, head);
+
+   /*
+* update_iter returns false when at end of file
+* which in this case we don't care about and can
+* safely ignore. update_iter() will increment
+* the value of iter->pos, for ksymbol_core.
+*/
+   if (sym_pos >= kallsyms_num_syms)
+   sym_pos++;
+
+   (void)update_iter(>iter, sym_pos);
+
+   return 0;
+}
+
+static int kallsyms_sorted_open(struct inode *inode, struct file *file)
+{
+   int ret;
+   struct list_head *list;
+
+   list = __seq_open_private(file, _sorted_op, sizeof(*list));
+   if (!list)
+   return -ENOMEM;
+
+   INIT_LIST_HEAD(list);
+
+   ret = kallsyms_on_each_symbol(get_all_symbol_name, list);
+   if (ret != 0)
+   return ret;
+
+   list_sort(NULL, list, kallsyms_list_cmp);
+
+   return 0;
+}
+
 static int kallsyms_open(struct inode *inode, struct file *file)
 {
/*
@@ -704,9 +825,24 @@ static const struct proc_ops kallsyms_proc_ops = {
.proc_release   = seq_release_private,
 };
 
+static const struct proc_ops kallsyms_sorted_proc_ops = {
+   .proc_open = kallsyms_sorted_open,
+   .proc_read = seq_read,
+   .proc_lseek = seq_lseek,
+   .proc_release = seq_release_private,
+};
+
 static int __init kallsyms_init(void)
 {
-   proc_create("kallsyms", 0444, NULL, _proc_ops);
+   /*
+* When fine grained kaslr is enabled, we need to
+* print out the symbols sorted by name rather than by
+* by address, because this reveals the randomization order.
+*/
+   if (!IS_ENABLED(CONFIG_FG_KASLR))
+   proc_create("kallsyms", 0444, NULL, _proc_ops);
+   else
+   proc_create("kallsyms", 0444, NULL, _sorted_proc_ops);
return 0;
 }
 device_initcall(kallsyms_init);
-- 
2.20.1



[PATCH v2 0/9] Function Granular KASLR

2020-05-21 Thread Kristen Carlson Accardi
LR using the nokaslr command line option also disables
fgkaslr. It is also possible to disable fgkaslr separately by booting with
fgkaslr=off on the commandline.

References
--
There are a lot of academic papers which explore finer grained ASLR.
This paper in particular contributed the most to my implementation design
as well as my overall understanding of the problem space:

Selfrando: Securing the Tor Browser against De-anonymization Exploits,
M. Conti, S. Crane, T. Frassetto, et al.

For more information on how function layout impacts performance, see:

Optimizing Function Placement for Large-Scale Data-Center Applications,
G. Ottoni, B. Maher

Kees Cook (1):
  x86/boot: Allow a "silent" kaslr random byte fetch

Kristen Carlson Accardi (8):
  objtool: Do not assume order of parent/child functions
  x86: tools/relocs: Support >64K section headers
  x86: Makefile: Add build and config option for CONFIG_FG_KASLR
  x86: Make sure _etext includes function sections
  x86/tools: Add relative relocs for randomized functions
  x86: Add support for function granular KASLR
  kallsyms: Hide layout
  module: Reorder functions

 Documentation/security/fgkaslr.rst   | 155 +
 Documentation/security/index.rst |   1 +
 Makefile |   4 +
 arch/x86/Kconfig |  14 +
 arch/x86/Makefile|   3 +
 arch/x86/boot/compressed/Makefile|  10 +-
 arch/x86/boot/compressed/fgkaslr.c   | 823 +++
 arch/x86/boot/compressed/kaslr.c |   4 -
 arch/x86/boot/compressed/misc.c  | 109 ++-
 arch/x86/boot/compressed/misc.h  |  34 +
 arch/x86/boot/compressed/utils.c |  12 +
 arch/x86/boot/compressed/vmlinux.symbols |  17 +
 arch/x86/include/asm/boot.h  |  15 +-
 arch/x86/kernel/vmlinux.lds.S|  18 +-
 arch/x86/lib/kaslr.c |  18 +-
 arch/x86/tools/relocs.c  | 143 +++-
 arch/x86/tools/relocs.h  |   4 +-
 arch/x86/tools/relocs_common.c   |  15 +-
 include/asm-generic/vmlinux.lds.h|   2 +-
 include/uapi/linux/elf.h |   1 +
 init/Kconfig |  11 +
 kernel/kallsyms.c| 138 +++-
 kernel/module.c  |  81 +++
 tools/objtool/elf.c  |   8 +-
 24 files changed, 1578 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/security/fgkaslr.rst
 create mode 100644 arch/x86/boot/compressed/fgkaslr.c
 create mode 100644 arch/x86/boot/compressed/utils.c
 create mode 100644 arch/x86/boot/compressed/vmlinux.symbols


base-commit: b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce
-- 
2.20.1



[PATCH v2 1/9] objtool: Do not assume order of parent/child functions

2020-05-21 Thread Kristen Carlson Accardi
If a .cold function is examined prior to it's parent, the link
to the parent/child function can be overwritten when the parent
is examined. Only update pfunc and cfunc if they were previously
nil to prevent this from happening.

Signed-off-by: Kristen Carlson Accardi 
Acked-by: Josh Poimboeuf 
---
 tools/objtool/elf.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index c4857fa3f1d1..b998c853a1f0 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -408,7 +408,13 @@ static int read_symbols(struct elf *elf)
size_t pnamelen;
if (sym->type != STT_FUNC)
continue;
-   sym->pfunc = sym->cfunc = sym;
+
+   if (sym->pfunc == NULL)
+   sym->pfunc = sym;
+
+   if (sym->cfunc == NULL)
+   sym->cfunc = sym;
+
coldstr = strstr(sym->name, ".cold");
if (!coldstr)
continue;
-- 
2.20.1



[PATCH v2 3/9] x86/boot: Allow a "silent" kaslr random byte fetch

2020-05-21 Thread Kristen Carlson Accardi
From: Kees Cook 

Under earlyprintk, each RNG call produces a debug report line. When
shuffling hundreds of functions, this is not useful information (each
line is identical and tells us nothing new). Instead, allow for a NULL
"purpose" to suppress the debug reporting.

Signed-off-by: Kees Cook 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/lib/kaslr.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/lib/kaslr.c b/arch/x86/lib/kaslr.c
index a53665116458..2b3eb8c948a3 100644
--- a/arch/x86/lib/kaslr.c
+++ b/arch/x86/lib/kaslr.c
@@ -56,11 +56,14 @@ unsigned long kaslr_get_random_long(const char *purpose)
unsigned long raw, random = get_boot_seed();
bool use_i8254 = true;
 
-   debug_putstr(purpose);
-   debug_putstr(" KASLR using");
+   if (purpose) {
+   debug_putstr(purpose);
+   debug_putstr(" KASLR using");
+   }
 
if (has_cpuflag(X86_FEATURE_RDRAND)) {
-   debug_putstr(" RDRAND");
+   if (purpose)
+   debug_putstr(" RDRAND");
if (rdrand_long()) {
random ^= raw;
use_i8254 = false;
@@ -68,7 +71,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (has_cpuflag(X86_FEATURE_TSC)) {
-   debug_putstr(" RDTSC");
+   if (purpose)
+   debug_putstr(" RDTSC");
raw = rdtsc();
 
random ^= raw;
@@ -76,7 +80,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
}
 
if (use_i8254) {
-   debug_putstr(" i8254");
+   if (purpose)
+   debug_putstr(" i8254");
random ^= i8254();
}
 
@@ -86,7 +91,8 @@ unsigned long kaslr_get_random_long(const char *purpose)
: "a" (random), "rm" (mix_const));
random += raw;
 
-   debug_putstr("...\n");
+   if (purpose)
+   debug_putstr("...\n");
 
return random;
 }
-- 
2.20.1



[PATCH v2 2/9] x86: tools/relocs: Support >64K section headers

2020-05-21 Thread Kristen Carlson Accardi
While the relocs tool already supports finding the total number
of section headers if vmlinux exceeds 64K sections, it fails to
read the extended symbol table to get section header indexes for symbols,
causing incorrect symbol table indexes to be used when there are > 64K
symbols.

Parse the elf file to read the extended symbol table info, and then
replace all direct references to st_shndx with calls to sym_index(),
which will determine whether the value can be read directly or
whether the value should be pulled out of the extended table.

Signed-off-by: Kristen Carlson Accardi 
Reviewed-by: Kees Cook 
Reviewed-by: Tony Luck 
Tested-by: Tony Luck 
---
 arch/x86/tools/relocs.c | 104 ++--
 1 file changed, 78 insertions(+), 26 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index ce7188cbdae5..a00dc133f109 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -14,6 +14,10 @@
 static Elf_Ehdrehdr;
 static unsigned long   shnum;
 static unsigned intshstrndx;
+static unsigned intshsymtabndx;
+static unsigned intshxsymtabndx;
+
+static int sym_index(Elf_Sym *sym);
 
 struct relocs {
uint32_t*offset;
@@ -32,6 +36,7 @@ struct section {
Elf_Shdr   shdr;
struct section *link;
Elf_Sym*symtab;
+   Elf32_Word *xsymtab;
Elf_Rel*reltab;
char   *strtab;
 };
@@ -265,7 +270,7 @@ static const char *sym_name(const char *sym_strtab, Elf_Sym 
*sym)
name = sym_strtab + sym->st_name;
}
else {
-   name = sec_name(sym->st_shndx);
+   name = sec_name(sym_index(sym));
}
return name;
 }
@@ -335,6 +340,23 @@ static uint64_t elf64_to_cpu(uint64_t val)
 #define elf_xword_to_cpu(x)elf32_to_cpu(x)
 #endif
 
+static int sym_index(Elf_Sym *sym)
+{
+   Elf_Sym *symtab = secs[shsymtabndx].symtab;
+   Elf32_Word *xsymtab = secs[shxsymtabndx].xsymtab;
+   unsigned long offset;
+   int index;
+
+   if (sym->st_shndx != SHN_XINDEX)
+   return sym->st_shndx;
+
+   /* calculate offset of sym from head of table. */
+   offset = (unsigned long) sym - (unsigned long) symtab;
+   index = offset/sizeof(*sym);
+
+   return elf32_to_cpu(xsymtab[index]);
+}
+
 static void read_ehdr(FILE *fp)
 {
if (fread(, sizeof(ehdr), 1, fp) != 1) {
@@ -468,31 +490,60 @@ static void read_strtabs(FILE *fp)
 static void read_symtabs(FILE *fp)
 {
int i,j;
+
for (i = 0; i < shnum; i++) {
struct section *sec = [i];
-   if (sec->shdr.sh_type != SHT_SYMTAB) {
+   int num_syms;
+
+   switch (sec->shdr.sh_type) {
+   case SHT_SYMTAB_SHNDX:
+   sec->xsymtab = malloc(sec->shdr.sh_size);
+   if (!sec->xsymtab) {
+   die("malloc of %d bytes for xsymtab failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->xsymtab, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read extended symbol table: %s\n",
+   strerror(errno));
+   }
+   shxsymtabndx = i;
+   continue;
+
+   case SHT_SYMTAB:
+   num_syms = sec->shdr.sh_size/sizeof(Elf_Sym);
+
+   sec->symtab = malloc(sec->shdr.sh_size);
+   if (!sec->symtab) {
+   die("malloc of %d bytes for symtab failed\n",
+   sec->shdr.sh_size);
+   }
+   if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+   die("Seek to %d failed: %s\n",
+   sec->shdr.sh_offset, strerror(errno));
+   }
+   if (fread(sec->symtab, 1, sec->shdr.sh_size, fp)
+   != sec->shdr.sh_size) {
+   die("Cannot read symbol table: %s\n",
+   strerror(errno));
+   }
+   for (j = 0; j < num_syms; j++) {
+   Elf_Sym *sym = >symtab[j];
+
+   sym->st_name  = elf_word_to_cpu(sym->st_name)

[PATCH] x86: entry: flush the cache if syscall error

2018-10-11 Thread Kristen Carlson Accardi
This patch aims to make it harder to perform cache timing attacks on data
left behind by system calls. If we have an error returned from a syscall,
flush the L1 cache.

It's important to note that this patch is not addressing any specific
exploit, nor is it intended to be a complete defense against anything.
It is intended to be a low cost way of eliminating some of side effects
of a failed system call.

A performance test using sysbench on one hyperthread and a script which
attempts to repeatedly access files it does not have permission to access
on the other hyperthread found no significant performance impact.

Suggested-by: Alan Cox 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/Kconfig|  9 +
 arch/x86/entry/common.c | 18 ++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a0be022f91d..bde978eb3b4e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -445,6 +445,15 @@ config RETPOLINE
  code are eliminated. Since this includes the syscall entry path,
  it is not entirely pointless.
 
+config SYSCALL_FLUSH
+   bool "Clear L1 Cache on syscall errors"
+   default n
+   help
+ Selecting 'y' allows the L1 cache to be cleared upon return of
+ an error code from a syscall if the CPU supports "flush_l1d".
+ This may reduce the likelyhood of speculative execution style
+ attacks on syscalls.
+
 config INTEL_RDT
bool "Intel Resource Director Technology support"
default n
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..26de8ea71293 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct 
pt_regs *regs)
prepare_exit_to_usermode(regs);
 }
 
+__visible inline void l1_cache_flush(struct pt_regs *regs)
+{
+   if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
+   static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+   if (regs->ax == 0 || regs->ax == -EAGAIN ||
+   regs->ax == -EEXIST || regs->ax == -ENOENT ||
+   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
+   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
+   return;
+
+   wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+   }
+}
+
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -290,6 +304,8 @@ __visible void do_syscall_64(unsigned long nr, struct 
pt_regs *regs)
regs->ax = sys_call_table[nr](regs);
}
 
+   l1_cache_flush(regs);
+
syscall_return_slowpath(regs);
 }
 #endif
@@ -338,6 +354,8 @@ static __always_inline void do_syscall_32_irqs_on(struct 
pt_regs *regs)
 #endif /* CONFIG_IA32_EMULATION */
}
 
+   l1_cache_flush(regs);
+
syscall_return_slowpath(regs);
 }
 
-- 
2.14.4



[PATCH] x86: entry: flush the cache if syscall error

2018-10-11 Thread Kristen Carlson Accardi
This patch aims to make it harder to perform cache timing attacks on data
left behind by system calls. If we have an error returned from a syscall,
flush the L1 cache.

It's important to note that this patch is not addressing any specific
exploit, nor is it intended to be a complete defense against anything.
It is intended to be a low cost way of eliminating some of side effects
of a failed system call.

A performance test using sysbench on one hyperthread and a script which
attempts to repeatedly access files it does not have permission to access
on the other hyperthread found no significant performance impact.

Suggested-by: Alan Cox 
Signed-off-by: Kristen Carlson Accardi 
---
 arch/x86/Kconfig|  9 +
 arch/x86/entry/common.c | 18 ++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a0be022f91d..bde978eb3b4e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -445,6 +445,15 @@ config RETPOLINE
  code are eliminated. Since this includes the syscall entry path,
  it is not entirely pointless.
 
+config SYSCALL_FLUSH
+   bool "Clear L1 Cache on syscall errors"
+   default n
+   help
+ Selecting 'y' allows the L1 cache to be cleared upon return of
+ an error code from a syscall if the CPU supports "flush_l1d".
+ This may reduce the likelyhood of speculative execution style
+ attacks on syscalls.
+
 config INTEL_RDT
bool "Intel Resource Director Technology support"
default n
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..26de8ea71293 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct 
pt_regs *regs)
prepare_exit_to_usermode(regs);
 }
 
+__visible inline void l1_cache_flush(struct pt_regs *regs)
+{
+   if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
+   static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+   if (regs->ax == 0 || regs->ax == -EAGAIN ||
+   regs->ax == -EEXIST || regs->ax == -ENOENT ||
+   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
+   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
+   return;
+
+   wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+   }
+}
+
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -290,6 +304,8 @@ __visible void do_syscall_64(unsigned long nr, struct 
pt_regs *regs)
regs->ax = sys_call_table[nr](regs);
}
 
+   l1_cache_flush(regs);
+
syscall_return_slowpath(regs);
 }
 #endif
@@ -338,6 +354,8 @@ static __always_inline void do_syscall_32_irqs_on(struct 
pt_regs *regs)
 #endif /* CONFIG_IA32_EMULATION */
}
 
+   l1_cache_flush(regs);
+
syscall_return_slowpath(regs);
 }
 
-- 
2.14.4



Re: [PATCH] cpufreq, intel_pstate, Fix intel_pstate powersave min_perf_pct value

2015-10-14 Thread Kristen Carlson Accardi
On Wed, 14 Oct 2015 07:41:59 -0400
Prarit Bhargava  wrote:

> On systems that initialize the intel_pstate driver with the performance
> governor, and then switch to the powersave governor will not transition to
> lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct
> is set to a low value.
> 
> The behavior of governor switching changed after commit a04759924e25
> ("[cpufreq] intel_pstate: honor user space min_perf_pct override on
>  resume").  The commit introduced tracking of performance percentage
> changes via sysfs in order to restore userspace changes during
> suspend/resume.  The problem occurs because the global values of the newly
> introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor
> change and this causes the powersave governor to inherit the performance
> governor's settings.
> 
> A simple change would have been to reset max_sysfs_pct to 100 and
> min_sysfs_pct to 0 on a governor change, which fixes the problem with
> governor switching.  However, since we cannot break userspace[1] the fix
> is now to give each governor its own limits storage area so that governor
> specific changes are tracked.
> 
> I successfully tested this by booting with both the performance governor
> and the powersave governor by default, and switching between the two
> governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values,
> and looking at the output of cpupower frequency-info).  Suspend/Resume
> testing was performed by Doug Smythies.
> 
> [1] Systems which suspend/resume using the unmaintained pm-utils package
> will always transition to the performance governor before the suspend and
> after the resume.  This means a system using the powersave governor will
> go from powersave to performance, then suspend/resume, performance to
> powersave.  The simple change during governor changes would have been
> overwritten when the governor changed before and after the suspend/resume.
> I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225
> against Fedora to remove the 94cpufreq file that causes the problem.  It
> should be noted that pm-utils is obsoleted with newer versions of systemd.
> 
> Cc: Kristen Carlson Accardi 
> Cc: "Rafael J. Wysocki" 
> Cc: Viresh Kumar 
> Cc: linux...@vger.kernel.org
> Cc: Doug Smythies 
> Signed-off-by: Prarit Bhargava 

Acked-by: Kristen Carlson Accardi 

BTW - I think I can see an issue here with HWP enabled systems.  It
looks to me like the hwp settings will not be programmed correctly
during a governor switch.  This probably needs to be addressed in a
separate patch.

> ---
>  drivers/cpufreq/intel_pstate.c |  120 
> +---
>  1 file changed, 75 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 3af9dd7..78b4be5 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -156,7 +156,20 @@ struct perf_limits {
>   int min_sysfs_pct;
>  };
>  
> -static struct perf_limits limits = {
> +static struct perf_limits performance_limits = {
> + .no_turbo = 0,
> + .turbo_disabled = 0,
> + .max_perf_pct = 100,
> + .max_perf = int_tofp(1),
> + .min_perf_pct = 100,
> + .min_perf = int_tofp(1),
> + .max_policy_pct = 100,
> + .max_sysfs_pct = 100,
> + .min_policy_pct = 0,
> + .min_sysfs_pct = 0,
> +};
> +
> +static struct perf_limits powersave_limits = {
>   .no_turbo = 0,
>   .turbo_disabled = 0,
>   .max_perf_pct = 100,
> @@ -169,6 +182,12 @@ static struct perf_limits limits = {
>   .min_sysfs_pct = 0,
>  };
>  
> +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> +static struct perf_limits *limits = _limits;
> +#else
> +static struct perf_limits *limits = _limits;
> +#endif
> +
>  static inline void pid_reset(struct _pid *pid, int setpoint, int busy,
>int deadband, int integral) {
>   pid->setpoint = setpoint;
> @@ -255,7 +274,7 @@ static inline void update_turbo_state(void)
>  
>   cpu = all_cpu_data[0];
>   rdmsrl(MSR_IA32_MISC_ENABLE, misc_en);
> - limits.turbo_disabled =
> + limits->turbo_disabled =
>   (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE ||
>cpu->pstate.max_pstate == cpu->pstate.turbo_pstate);
>  }
> @@ -274,14 +293,14 @@ static void intel_pstate_hwp_set(void)
>  
>   for_each_online_cpu(cpu) {
>   rdmsrl_on_cpu(cpu, MSR_HWP_REQUEST, );
> - adj_range = limits.min_perf_pct * range / 100;
> + adj_range = limits->min_perf_pct * range / 100;
>   min = hw

Re: [PATCH] cpufreq, intel_pstate, Fix intel_pstate powersave min_perf_pct value

2015-10-14 Thread Kristen Carlson Accardi
On Wed, 14 Oct 2015 07:41:59 -0400
Prarit Bhargava <pra...@redhat.com> wrote:

> On systems that initialize the intel_pstate driver with the performance
> governor, and then switch to the powersave governor will not transition to
> lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct
> is set to a low value.
> 
> The behavior of governor switching changed after commit a04759924e25
> ("[cpufreq] intel_pstate: honor user space min_perf_pct override on
>  resume").  The commit introduced tracking of performance percentage
> changes via sysfs in order to restore userspace changes during
> suspend/resume.  The problem occurs because the global values of the newly
> introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor
> change and this causes the powersave governor to inherit the performance
> governor's settings.
> 
> A simple change would have been to reset max_sysfs_pct to 100 and
> min_sysfs_pct to 0 on a governor change, which fixes the problem with
> governor switching.  However, since we cannot break userspace[1] the fix
> is now to give each governor its own limits storage area so that governor
> specific changes are tracked.
> 
> I successfully tested this by booting with both the performance governor
> and the powersave governor by default, and switching between the two
> governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values,
> and looking at the output of cpupower frequency-info).  Suspend/Resume
> testing was performed by Doug Smythies.
> 
> [1] Systems which suspend/resume using the unmaintained pm-utils package
> will always transition to the performance governor before the suspend and
> after the resume.  This means a system using the powersave governor will
> go from powersave to performance, then suspend/resume, performance to
> powersave.  The simple change during governor changes would have been
> overwritten when the governor changed before and after the suspend/resume.
> I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225
> against Fedora to remove the 94cpufreq file that causes the problem.  It
> should be noted that pm-utils is obsoleted with newer versions of systemd.
> 
> Cc: Kristen Carlson Accardi <kris...@linux.intel.com>
> Cc: "Rafael J. Wysocki" <r...@rjwysocki.net>
> Cc: Viresh Kumar <viresh.ku...@linaro.org>
> Cc: linux...@vger.kernel.org
> Cc: Doug Smythies <dsmyth...@telus.net>
> Signed-off-by: Prarit Bhargava <pra...@redhat.com>

Acked-by: Kristen Carlson Accardi <kris...@linux.intel.com>

BTW - I think I can see an issue here with HWP enabled systems.  It
looks to me like the hwp settings will not be programmed correctly
during a governor switch.  This probably needs to be addressed in a
separate patch.

> ---
>  drivers/cpufreq/intel_pstate.c |  120 
> +---
>  1 file changed, 75 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 3af9dd7..78b4be5 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -156,7 +156,20 @@ struct perf_limits {
>   int min_sysfs_pct;
>  };
>  
> -static struct perf_limits limits = {
> +static struct perf_limits performance_limits = {
> + .no_turbo = 0,
> + .turbo_disabled = 0,
> + .max_perf_pct = 100,
> + .max_perf = int_tofp(1),
> + .min_perf_pct = 100,
> + .min_perf = int_tofp(1),
> + .max_policy_pct = 100,
> + .max_sysfs_pct = 100,
> + .min_policy_pct = 0,
> + .min_sysfs_pct = 0,
> +};
> +
> +static struct perf_limits powersave_limits = {
>   .no_turbo = 0,
>   .turbo_disabled = 0,
>   .max_perf_pct = 100,
> @@ -169,6 +182,12 @@ static struct perf_limits limits = {
>   .min_sysfs_pct = 0,
>  };
>  
> +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE
> +static struct perf_limits *limits = _limits;
> +#else
> +static struct perf_limits *limits = _limits;
> +#endif
> +
>  static inline void pid_reset(struct _pid *pid, int setpoint, int busy,
>int deadband, int integral) {
>   pid->setpoint = setpoint;
> @@ -255,7 +274,7 @@ static inline void update_turbo_state(void)
>  
>   cpu = all_cpu_data[0];
>   rdmsrl(MSR_IA32_MISC_ENABLE, misc_en);
> - limits.turbo_disabled =
> + limits->turbo_disabled =
>   (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE ||
>cpu->pstate.max_pstate == cpu->pstate.turbo_pstate);
>  }
> @@ -274,14 +293,14 @@ static void intel_pstate_hwp_set(void)
>  
>   for_each_online_cpu(cpu) {
>  

[tip:x86/urgent] x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag

2015-09-23 Thread tip-bot for Kristen Carlson Accardi
Commit-ID:  a7adb91b13c104e5ad950fbe1795aa2722f2ea0a
Gitweb: http://git.kernel.org/tip/a7adb91b13c104e5ad950fbe1795aa2722f2ea0a
Author: Kristen Carlson Accardi 
AuthorDate: Tue, 22 Sep 2015 10:51:36 -0700
Committer:  Ingo Molnar 
CommitDate: Wed, 23 Sep 2015 09:57:24 +0200

x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag

Because noitification just isn't right.

Signed-off-by: Kristen Carlson Accardi 
Acked-by: Rafael J. Wysocki 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Cc: r...@rjwysocki.net
Link: 
http://lkml.kernel.org/r/1442944296-11737-1-git-send-email-kris...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/cpufeature.h | 2 +-
 arch/x86/kernel/cpu/scattered.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index e6cf2ad..9727b3b 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -193,7 +193,7 @@
 #define X86_FEATURE_HW_PSTATE  ( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
 #define X86_FEATURE_HWP( 7*32+ 10) /* "hwp" Intel HWP */
-#define X86_FEATURE_HWP_NOITFY ( 7*32+ 11) /* Intel HWP_NOTIFY */
+#define X86_FEATURE_HWP_NOTIFY ( 7*32+ 11) /* Intel HWP_NOTIFY */
 #define X86_FEATURE_HWP_ACT_WINDOW ( 7*32+ 12) /* Intel HWP_ACT_WINDOW */
 #define X86_FEATURE_HWP_EPP( 7*32+13) /* Intel HWP_EPP */
 #define X86_FEATURE_HWP_PKG_REQ ( 7*32+14) /* Intel HWP_PKG_REQ */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 3d423a1..608fb26 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -37,7 +37,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
{ X86_FEATURE_PLN,  CR_EAX, 4, 0x0006, 0 },
{ X86_FEATURE_PTS,  CR_EAX, 6, 0x0006, 0 },
{ X86_FEATURE_HWP,  CR_EAX, 7, 0x0006, 0 },
-   { X86_FEATURE_HWP_NOITFY,   CR_EAX, 8, 0x0006, 0 },
+   { X86_FEATURE_HWP_NOTIFY,   CR_EAX, 8, 0x0006, 0 },
{ X86_FEATURE_HWP_ACT_WINDOW,   CR_EAX, 9, 0x0006, 0 },
{ X86_FEATURE_HWP_EPP,  CR_EAX,10, 0x0006, 0 },
{ X86_FEATURE_HWP_PKG_REQ,  CR_EAX,11, 0x0006, 0 },
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag

2015-09-23 Thread tip-bot for Kristen Carlson Accardi
Commit-ID:  a7adb91b13c104e5ad950fbe1795aa2722f2ea0a
Gitweb: http://git.kernel.org/tip/a7adb91b13c104e5ad950fbe1795aa2722f2ea0a
Author: Kristen Carlson Accardi <kris...@linux.intel.com>
AuthorDate: Tue, 22 Sep 2015 10:51:36 -0700
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 23 Sep 2015 09:57:24 +0200

x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag

Because noitification just isn't right.

Signed-off-by: Kristen Carlson Accardi <kris...@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: r...@rjwysocki.net
Link: 
http://lkml.kernel.org/r/1442944296-11737-1-git-send-email-kris...@linux.intel.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/include/asm/cpufeature.h | 2 +-
 arch/x86/kernel/cpu/scattered.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index e6cf2ad..9727b3b 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -193,7 +193,7 @@
 #define X86_FEATURE_HW_PSTATE  ( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
 #define X86_FEATURE_HWP( 7*32+ 10) /* "hwp" Intel HWP */
-#define X86_FEATURE_HWP_NOITFY ( 7*32+ 11) /* Intel HWP_NOTIFY */
+#define X86_FEATURE_HWP_NOTIFY ( 7*32+ 11) /* Intel HWP_NOTIFY */
 #define X86_FEATURE_HWP_ACT_WINDOW ( 7*32+ 12) /* Intel HWP_ACT_WINDOW */
 #define X86_FEATURE_HWP_EPP( 7*32+13) /* Intel HWP_EPP */
 #define X86_FEATURE_HWP_PKG_REQ ( 7*32+14) /* Intel HWP_PKG_REQ */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 3d423a1..608fb26 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -37,7 +37,7 @@ void init_scattered_cpuid_features(struct cpuinfo_x86 *c)
{ X86_FEATURE_PLN,  CR_EAX, 4, 0x0006, 0 },
{ X86_FEATURE_PTS,  CR_EAX, 6, 0x0006, 0 },
{ X86_FEATURE_HWP,  CR_EAX, 7, 0x0006, 0 },
-   { X86_FEATURE_HWP_NOITFY,   CR_EAX, 8, 0x0006, 0 },
+   { X86_FEATURE_HWP_NOTIFY,   CR_EAX, 8, 0x0006, 0 },
{ X86_FEATURE_HWP_ACT_WINDOW,   CR_EAX, 9, 0x0006, 0 },
{ X86_FEATURE_HWP_EPP,  CR_EAX,10, 0x0006, 0 },
{ X86_FEATURE_HWP_PKG_REQ,  CR_EAX,11, 0x0006, 0 },
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [v2] intel_pstate: Fix user input of min/max to legal policy region

2015-09-09 Thread Kristen Carlson Accardi
On Wed,  9 Sep 2015 18:27:31 +0800
Chen Yu  wrote:

> In current code, max_perf_pct might be smaller than min_perf_pct
> by improper user input:
> 
> $ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
> /sys/devices/system/cpu/intel_pstate/max_perf_pct:100
> /sys/devices/system/cpu/intel_pstate/min_perf_pct:100
> 
> $ echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
> 
> $ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
> /sys/devices/system/cpu/intel_pstate/max_perf_pct:80
> /sys/devices/system/cpu/intel_pstate/min_perf_pct:100
> 
> Fix this problem by 2 steps:
> 1.Normalize the user input to [min_policy, max_policy].
> 2.Make sure max_perf_pct>=min_perf_pct, suggested by Seiichi Ikarashi.
> 
> Signed-off-by: Chen Yu 

Acked-by: Kristen Carlson Accardi 
> ---
> v2:
>  - Add logic to ensure max_perf_pct>=min_perf_pct.
> ---
>  drivers/cpufreq/intel_pstate.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index fcb929e..a0b935f 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -423,6 +423,8 @@ static ssize_t store_max_perf_pct(struct kobject *a, 
> struct attribute *b,
>  
>   limits.max_sysfs_pct = clamp_t(int, input, 0 , 100);
>   limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct);
> + limits.max_perf_pct = max(limits.min_policy_pct, limits.max_perf_pct);
> + limits.max_perf_pct = max(limits.min_perf_pct, limits.max_perf_pct);
>   limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100));
>  
>   if (hwp_active)
> @@ -442,6 +444,8 @@ static ssize_t store_min_perf_pct(struct kobject *a, 
> struct attribute *b,
>  
>   limits.min_sysfs_pct = clamp_t(int, input, 0 , 100);
>   limits.min_perf_pct = max(limits.min_policy_pct, limits.min_sysfs_pct);
> + limits.min_perf_pct = min(limits.max_policy_pct, limits.min_perf_pct);
> + limits.min_perf_pct = min(limits.max_perf_pct, limits.min_perf_pct);
>   limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
>  
>   if (hwp_active)
> @@ -985,12 +989,19 @@ static int intel_pstate_set_policy(struct 
> cpufreq_policy *policy)
>  
>   limits.min_policy_pct = (policy->min * 100) / policy->cpuinfo.max_freq;
>   limits.min_policy_pct = clamp_t(int, limits.min_policy_pct, 0 , 100);
> - limits.min_perf_pct = max(limits.min_policy_pct, limits.min_sysfs_pct);
> - limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
> -
>   limits.max_policy_pct = (policy->max * 100) / policy->cpuinfo.max_freq;
>   limits.max_policy_pct = clamp_t(int, limits.max_policy_pct, 0 , 100);
> +
> + /* Normalize user input to [min_policy_pct, max_policy_pct] */
> + limits.min_perf_pct = max(limits.min_policy_pct, limits.min_sysfs_pct);
> + limits.min_perf_pct = min(limits.max_policy_pct, limits.min_perf_pct);
>   limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct);
> + limits.max_perf_pct = max(limits.min_policy_pct, limits.max_perf_pct);
> +
> + /* Make sure min_perf_pct <= max_perf_pct */
> + limits.min_perf_pct = min(limits.max_perf_pct, limits.min_perf_pct);
> +
> + limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
>   limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100));
>  
>   if (hwp_active)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [v2] intel_pstate: Fix user input of min/max to legal policy region

2015-09-09 Thread Kristen Carlson Accardi
On Wed,  9 Sep 2015 18:27:31 +0800
Chen Yu <yu.c.c...@intel.com> wrote:

> In current code, max_perf_pct might be smaller than min_perf_pct
> by improper user input:
> 
> $ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
> /sys/devices/system/cpu/intel_pstate/max_perf_pct:100
> /sys/devices/system/cpu/intel_pstate/min_perf_pct:100
> 
> $ echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
> 
> $ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct
> /sys/devices/system/cpu/intel_pstate/max_perf_pct:80
> /sys/devices/system/cpu/intel_pstate/min_perf_pct:100
> 
> Fix this problem by 2 steps:
> 1.Normalize the user input to [min_policy, max_policy].
> 2.Make sure max_perf_pct>=min_perf_pct, suggested by Seiichi Ikarashi.
> 
> Signed-off-by: Chen Yu <yu.c.c...@intel.com>

Acked-by: Kristen Carlson Accardi <kris...@linux.intel.com>
> ---
> v2:
>  - Add logic to ensure max_perf_pct>=min_perf_pct.
> ---
>  drivers/cpufreq/intel_pstate.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index fcb929e..a0b935f 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -423,6 +423,8 @@ static ssize_t store_max_perf_pct(struct kobject *a, 
> struct attribute *b,
>  
>   limits.max_sysfs_pct = clamp_t(int, input, 0 , 100);
>   limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct);
> + limits.max_perf_pct = max(limits.min_policy_pct, limits.max_perf_pct);
> + limits.max_perf_pct = max(limits.min_perf_pct, limits.max_perf_pct);
>   limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100));
>  
>   if (hwp_active)
> @@ -442,6 +444,8 @@ static ssize_t store_min_perf_pct(struct kobject *a, 
> struct attribute *b,
>  
>   limits.min_sysfs_pct = clamp_t(int, input, 0 , 100);
>   limits.min_perf_pct = max(limits.min_policy_pct, limits.min_sysfs_pct);
> + limits.min_perf_pct = min(limits.max_policy_pct, limits.min_perf_pct);
> + limits.min_perf_pct = min(limits.max_perf_pct, limits.min_perf_pct);
>   limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
>  
>   if (hwp_active)
> @@ -985,12 +989,19 @@ static int intel_pstate_set_policy(struct 
> cpufreq_policy *policy)
>  
>   limits.min_policy_pct = (policy->min * 100) / policy->cpuinfo.max_freq;
>   limits.min_policy_pct = clamp_t(int, limits.min_policy_pct, 0 , 100);
> - limits.min_perf_pct = max(limits.min_policy_pct, limits.min_sysfs_pct);
> - limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
> -
>   limits.max_policy_pct = (policy->max * 100) / policy->cpuinfo.max_freq;
>   limits.max_policy_pct = clamp_t(int, limits.max_policy_pct, 0 , 100);
> +
> + /* Normalize user input to [min_policy_pct, max_policy_pct] */
> + limits.min_perf_pct = max(limits.min_policy_pct, limits.min_sysfs_pct);
> + limits.min_perf_pct = min(limits.max_policy_pct, limits.min_perf_pct);
>   limits.max_perf_pct = min(limits.max_policy_pct, limits.max_sysfs_pct);
> + limits.max_perf_pct = max(limits.min_policy_pct, limits.max_perf_pct);
> +
> + /* Make sure min_perf_pct <= max_perf_pct */
> + limits.min_perf_pct = min(limits.max_perf_pct, limits.min_perf_pct);
> +
> + limits.min_perf = div_fp(int_tofp(limits.min_perf_pct), int_tofp(100));
>   limits.max_perf = div_fp(int_tofp(limits.max_perf_pct), int_tofp(100));
>  
>   if (hwp_active)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] intel_pstate: append more Oracle OEM table id to vendor bypass list

2015-08-05 Thread Kristen Carlson Accardi
On Wed,  5 Aug 2015 09:28:50 +0900
Ethan Zhao  wrote:

> Append more Oracle X86 servers that have their own power management,
> 
> SUN FIRE X4275 M3
> SUN FIRE X4170 M3
> and
> SUN FIRE X6-2
> 
> Signed-off-by: Ethan Zhao 
> ---
>  drivers/cpufreq/intel_pstate.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index c45d274..c57b011 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1156,6 +1156,10 @@ static struct hw_vendor_info vendor_info[] = {
>   {1, "ORACLE", "X4270M3 ", PPC},
>   {1, "ORACLE", "X4270M2 ", PPC},
>   {1, "ORACLE", "X4170M2 ", PPC},
> + {1, "ORACLE", "X4170 M3", PPC},
> + {1, "ORACLE", "X4275 M3", PPC},
> + {1, "ORACLE", "X6-2", PPC},
> + {1, "ORACLE", "Sudbury ", PPC},
>   {0, "", ""},
>  };
>  

Acked-by:  Kristen Carlson Accardi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] intel_pstate: append more Oracle OEM table id to vendor bypass list

2015-08-05 Thread Kristen Carlson Accardi
On Wed,  5 Aug 2015 09:28:50 +0900
Ethan Zhao ethan.z...@oracle.com wrote:

 Append more Oracle X86 servers that have their own power management,
 
 SUN FIRE X4275 M3
 SUN FIRE X4170 M3
 and
 SUN FIRE X6-2
 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com
 ---
  drivers/cpufreq/intel_pstate.c | 4 
  1 file changed, 4 insertions(+)
 
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index c45d274..c57b011 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -1156,6 +1156,10 @@ static struct hw_vendor_info vendor_info[] = {
   {1, ORACLE, X4270M3 , PPC},
   {1, ORACLE, X4270M2 , PPC},
   {1, ORACLE, X4170M2 , PPC},
 + {1, ORACLE, X4170 M3, PPC},
 + {1, ORACLE, X4275 M3, PPC},
 + {1, ORACLE, X6-2, PPC},
 + {1, ORACLE, Sudbury , PPC},
   {0, , },
  };
  

Acked-by:  Kristen Carlson Accardi kris...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] intel_pstate: Add get_scaling cpu_defaults param to Knights Landing

2015-07-21 Thread Kristen Carlson Accardi
On Tue, 21 Jul 2015 10:41:13 +0200
Lukasz Anaczkowski  wrote:

> Scaling for Knights Landing is same as the default scaling (10).
> When Knigts Landing support was added to the pstate driver, this
> parameter was omitted resulting in a kernel panic during boot.
> 
> Reported-by: Yasuaki Ishimatsu 
> Signed-off-by: Dasaratharaman Chandramouli 
> 
> Signed-off-by: Lukasz Anaczkowski 
> ---
>  drivers/cpufreq/intel_pstate.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 15ada47..fcb929e 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -681,6 +681,7 @@ static struct cpu_defaults knl_params = {
>   .get_max = core_get_max_pstate,
>   .get_min = core_get_min_pstate,
>   .get_turbo = knl_get_turbo_pstate,
> + .get_scaling = core_get_scaling,
>       .set = core_set_pstate,
>   },
>  };

Acked-by: Kristen Carlson Accardi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] intel_pstate: Add get_scaling cpu_defaults param to Knights Landing

2015-07-21 Thread Kristen Carlson Accardi
On Tue, 21 Jul 2015 10:41:13 +0200
Lukasz Anaczkowski lukasz.anaczkow...@intel.com wrote:

 Scaling for Knights Landing is same as the default scaling (10).
 When Knigts Landing support was added to the pstate driver, this
 parameter was omitted resulting in a kernel panic during boot.
 
 Reported-by: Yasuaki Ishimatsu yishi...@redhat.com
 Signed-off-by: Dasaratharaman Chandramouli 
 dasaratharaman.chandramo...@intel.com
 Signed-off-by: Lukasz Anaczkowski lukasz.anaczkow...@intel.com
 ---
  drivers/cpufreq/intel_pstate.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index 15ada47..fcb929e 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -681,6 +681,7 @@ static struct cpu_defaults knl_params = {
   .get_max = core_get_max_pstate,
   .get_min = core_get_min_pstate,
   .get_turbo = knl_get_turbo_pstate,
 + .get_scaling = core_get_scaling,
   .set = core_set_pstate,
   },
  };

Acked-by: Kristen Carlson Accardi kris...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpufreq, Fix overflow in busy_scaled due to long delay [v2]

2015-06-15 Thread Kristen Carlson Accardi
3 which is 16.777 seconds.
> 
> The duration between reads of the APERF and MPERF registers overflowed a s32
> sized integer in intel_pstate_get_scaled_busy()'s call to div_fp().  The 
> result
> is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0.
> 
> While the kernel shouldn't be delaying for a long time, it can and does
> happen and the intel_pstate driver should not panic in this situation.  This
> patch changes the div_fp() function to use div64_s64() to allow for "long"
> division.  This will avoid the overflow condition on long delays.
> 
> [v2]: use div64_s64() in div_fp()

Were you able to resolve your original concerns with doing this?  I
thought you mentioned that you'd tested it and it gave you some
negative side effects?

Thanks,
Kristen

> 
> Cc: Kristen Carlson Accardi 
> Cc: "Rafael J. Wysocki" 
> Cc: Viresh Kumar 
> Cc: Doug Smythies 
> Cc: linux...@vger.kernel.org
> 
> Signed-off-by: Prarit Bhargava 
> ---
>  drivers/cpufreq/intel_pstate.c |   10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 6414661..b153d86 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -48,9 +48,9 @@ static inline int32_t mul_fp(int32_t x, int32_t y)
>   return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
>  }
>  
> -static inline int32_t div_fp(int32_t x, int32_t y)
> +static inline int32_t div_fp(s64 x, s64 y)
>  {
> - return div_s64((int64_t)x << FRAC_BITS, y);
> + return div64_s64((int64_t)x << FRAC_BITS, y);
>  }
>  
>  static inline int ceiling_fp(int32_t x)
> @@ -794,7 +794,7 @@ static inline void intel_pstate_set_sample_time(struct 
> cpudata *cpu)
>  static inline int32_t intel_pstate_get_scaled_busy(struct cpudata *cpu)
>  {
>   int32_t core_busy, max_pstate, current_pstate, sample_ratio;
> - u32 duration_us;
> + s64 duration_us;
>   u32 sample_time;
>  
>   /*
> @@ -821,8 +821,8 @@ static inline int32_t intel_pstate_get_scaled_busy(struct 
> cpudata *cpu)
>* to adjust our busyness.
>*/
>   sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
> - duration_us = (u32) ktime_us_delta(cpu->sample.time,
> -cpu->last_sample_time);
> + duration_us = ktime_us_delta(cpu->sample.time,
> +  cpu->last_sample_time);
>   if (duration_us > sample_time * 3) {
>   sample_ratio = div_fp(int_tofp(sample_time),
> int_tofp(duration_us));

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpufreq, Fix overflow in busy_scaled due to long delay [v2]

2015-06-15 Thread Kristen Carlson Accardi
 of the APERF and MPERF registers overflowed a s32
 sized integer in intel_pstate_get_scaled_busy()'s call to div_fp().  The 
 result
 is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0.
 
 While the kernel shouldn't be delaying for a long time, it can and does
 happen and the intel_pstate driver should not panic in this situation.  This
 patch changes the div_fp() function to use div64_s64() to allow for long
 division.  This will avoid the overflow condition on long delays.
 
 [v2]: use div64_s64() in div_fp()

Were you able to resolve your original concerns with doing this?  I
thought you mentioned that you'd tested it and it gave you some
negative side effects?

Thanks,
Kristen

 
 Cc: Kristen Carlson Accardi kris...@linux.intel.com
 Cc: Rafael J. Wysocki r...@rjwysocki.net
 Cc: Viresh Kumar viresh.ku...@linaro.org
 Cc: Doug Smythies dsmyth...@telus.net
 Cc: linux...@vger.kernel.org
 
 Signed-off-by: Prarit Bhargava pra...@redhat.com
 ---
  drivers/cpufreq/intel_pstate.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index 6414661..b153d86 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -48,9 +48,9 @@ static inline int32_t mul_fp(int32_t x, int32_t y)
   return ((int64_t)x * (int64_t)y)  FRAC_BITS;
  }
  
 -static inline int32_t div_fp(int32_t x, int32_t y)
 +static inline int32_t div_fp(s64 x, s64 y)
  {
 - return div_s64((int64_t)x  FRAC_BITS, y);
 + return div64_s64((int64_t)x  FRAC_BITS, y);
  }
  
  static inline int ceiling_fp(int32_t x)
 @@ -794,7 +794,7 @@ static inline void intel_pstate_set_sample_time(struct 
 cpudata *cpu)
  static inline int32_t intel_pstate_get_scaled_busy(struct cpudata *cpu)
  {
   int32_t core_busy, max_pstate, current_pstate, sample_ratio;
 - u32 duration_us;
 + s64 duration_us;
   u32 sample_time;
  
   /*
 @@ -821,8 +821,8 @@ static inline int32_t intel_pstate_get_scaled_busy(struct 
 cpudata *cpu)
* to adjust our busyness.
*/
   sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
 - duration_us = (u32) ktime_us_delta(cpu-sample.time,
 -cpu-last_sample_time);
 + duration_us = ktime_us_delta(cpu-sample.time,
 +  cpu-last_sample_time);
   if (duration_us  sample_time * 3) {
   sample_ratio = div_fp(int_tofp(sample_time),
 int_tofp(duration_us));

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] intel_pstate: set BYT MSR with wrmsrl_on_cpu()

2015-05-11 Thread Kristen Carlson Accardi
On Fri, 08 May 2015 15:59:14 +0200
"Rafael J. Wysocki"  wrote:

> On Thursday, May 07, 2015 04:22:32 PM Joe Konno wrote:
> > On Thu, May 07, 2015 at 10:58:11PM +0200, Rafael J. Wysocki wrote:
> > > On Thursday, May 07, 2015 09:59:39 AM Joe Konno wrote:
> > > > From: Joe Konno 
> > > > 
> > > > In instances where the default cpufreq governor is Performance, reading
> > > 
> > > I'm not really sure what this is about.  You're talking about cpufreq 
> > > governors
> > > and this is an intel_pstate patch.  What gives?
> > 
> > I'll reshuffle the paragraph to bring detail to the fix first, and the
> > "when/why" second.
> > 
> > In debug I have only seen the bug during boot when cpufreq calls
> > intel_pstate's init for each logical core-- often from one, sometimes
> > two logical cores.
> > 
> > The bug may occur after init as well, but not enough data to conclude
> > one way or the other. I personally have not seen it happen after init in
> > my local testing.
> > 
> > > 
> > > > from MSR 0x199 on an applicable multi-core Atom system saw boot-to-boot
> > > > variability in the P-State value set to each logical core.  Sometimes
> > > > only one logical core would be set properly, other times two or three.
> > > > There was an assumption in the code that only a thread on the intended
> > > > logical core would be calling the wrmsrl() function. That was disproven
> > > > during debug, as cpufreq, at init, was not always calling from the same
> > > > as the logical core it targeted. Thus, use wrmsrl_on_cpu() instead, as
> > > > done in the core_set_pstate() function.
> > > > 
> > > > For: LCK-1822
> > > 
> > > This tag is meaningless upstream.
> > 
> > Mimicked another subsystem's practice. I have no problem removing it.
> > 
> > > 
> > > > Fixes: 007bea098b86 ("intel_pstate: Add setting voltage value for
> > > >baytrail P states.")
> > > 
> > > So, you're fixing a function introduced by the above commit, right?
> > 
> > Correct. That commit introduced the byt_set_pstate() function with the
> > wrmsrl() call.
> > 
> > > 
> > > > Signed-off-by: Joe Konno 
> > > > ---
> > > >  drivers/cpufreq/intel_pstate.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/cpufreq/intel_pstate.c 
> > > > b/drivers/cpufreq/intel_pstate.c
> > > > index 6414661ac1c4..c45d274a75c8 100644
> > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > @@ -535,7 +535,7 @@ static void byt_set_pstate(struct cpudata *cpudata, 
> > > > int pstate)
> > > >  
> > > > val |= vid;
> > > >  
> > > > -   wrmsrl(MSR_IA32_PERF_CTL, val);
> > > > +   wrmsrl_on_cpu(cpudata->cpu, MSR_IA32_PERF_CTL, val);
> > > 
> > > So the bug is that this may run on a CPU which is not cpudata->cpu in 
> > > which
> > > case the write will not happen where it should.  Is that correct?
> > 
> > Yes-- I believe my first inline comment spoke to this.
> 
> So here's the changelog I'd use with this patch:
> 
> "Commit 007bea098b86 (intel_pstate: Add setting voltage value for baytrail
>  P states.) introduced byt_set_pstate() with the assumption that it would
>  always be run by the CPU whose MSR is to be written by it.  It turns out,
>  however, that is not always the case in practice, so modify byt_set_pstate()
>  to enforce the MSR write done by it to always happen on the right CPU."
> 
> I don't think you need to say anything more in it.  Mentioning governors in
> particular is unnecessary and confusing.
> 
> Kristen, what do you think?
> 
> 

Looks good to me with the modified changelog.

Acked-by: Kristen Carlson Accardi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] intel_pstate: set BYT MSR with wrmsrl_on_cpu()

2015-05-11 Thread Kristen Carlson Accardi
On Fri, 08 May 2015 15:59:14 +0200
Rafael J. Wysocki r...@rjwysocki.net wrote:

 On Thursday, May 07, 2015 04:22:32 PM Joe Konno wrote:
  On Thu, May 07, 2015 at 10:58:11PM +0200, Rafael J. Wysocki wrote:
   On Thursday, May 07, 2015 09:59:39 AM Joe Konno wrote:
From: Joe Konno joe.ko...@intel.com

In instances where the default cpufreq governor is Performance, reading
   
   I'm not really sure what this is about.  You're talking about cpufreq 
   governors
   and this is an intel_pstate patch.  What gives?
  
  I'll reshuffle the paragraph to bring detail to the fix first, and the
  when/why second.
  
  In debug I have only seen the bug during boot when cpufreq calls
  intel_pstate's init for each logical core-- often from one, sometimes
  two logical cores.
  
  The bug may occur after init as well, but not enough data to conclude
  one way or the other. I personally have not seen it happen after init in
  my local testing.
  
   
from MSR 0x199 on an applicable multi-core Atom system saw boot-to-boot
variability in the P-State value set to each logical core.  Sometimes
only one logical core would be set properly, other times two or three.
There was an assumption in the code that only a thread on the intended
logical core would be calling the wrmsrl() function. That was disproven
during debug, as cpufreq, at init, was not always calling from the same
as the logical core it targeted. Thus, use wrmsrl_on_cpu() instead, as
done in the core_set_pstate() function.

For: LCK-1822
   
   This tag is meaningless upstream.
  
  Mimicked another subsystem's practice. I have no problem removing it.
  
   
Fixes: 007bea098b86 (intel_pstate: Add setting voltage value for
   baytrail P states.)
   
   So, you're fixing a function introduced by the above commit, right?
  
  Correct. That commit introduced the byt_set_pstate() function with the
  wrmsrl() call.
  
   
Signed-off-by: Joe Konno joe.ko...@intel.com
---
 drivers/cpufreq/intel_pstate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/intel_pstate.c 
b/drivers/cpufreq/intel_pstate.c
index 6414661ac1c4..c45d274a75c8 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -535,7 +535,7 @@ static void byt_set_pstate(struct cpudata *cpudata, 
int pstate)
 
val |= vid;
 
-   wrmsrl(MSR_IA32_PERF_CTL, val);
+   wrmsrl_on_cpu(cpudata-cpu, MSR_IA32_PERF_CTL, val);
   
   So the bug is that this may run on a CPU which is not cpudata-cpu in 
   which
   case the write will not happen where it should.  Is that correct?
  
  Yes-- I believe my first inline comment spoke to this.
 
 So here's the changelog I'd use with this patch:
 
 Commit 007bea098b86 (intel_pstate: Add setting voltage value for baytrail
  P states.) introduced byt_set_pstate() with the assumption that it would
  always be run by the CPU whose MSR is to be written by it.  It turns out,
  however, that is not always the case in practice, so modify byt_set_pstate()
  to enforce the MSR write done by it to always happen on the right CPU.
 
 I don't think you need to say anything more in it.  Mentioning governors in
 particular is unnecessary and confusing.
 
 Kristen, what do you think?
 
 

Looks good to me with the modified changelog.

Acked-by: Kristen Carlson Accardi kris...@linux.intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpufreq/intel_pstate: Fix an annoying !CONFIG_SMP warning

2015-04-13 Thread Kristen Carlson Accardi
On Sat, 11 Apr 2015 02:22:03 +0200
"Rafael J. Wysocki"  wrote:

> On Friday, April 03, 2015 03:19:53 PM Borislav Petkov wrote:
> > From: Borislav Petkov 
> > 
> > I keep seeing
> > 
> >   drivers/cpufreq/intel_pstate.c: In function ‘intel_pstate_init’:
> >   drivers/cpufreq/intel_pstate.c:1187:26: warning: initialization from 
> > incompatible pointer type
> > struct cpuinfo_x86 *c = _cpu_data;
> > 
> > when doing randconfig builds.
> > 
> > This is caused by the fact that when !CONFIG_SMP, asm/processor.h
> > defines cpu_info to boot_cpu_data and the local variable
> > 
> >   struct cpu_defaults *cpu_info
> > 
> > overshadows it leading to this unfortunate assignment in the
> > preprocessed source:
> > 
> >  struct cpu_defaults *boot_cpu_data;
> >  struct cpuinfo_x86 *c = _cpu_data;
> > 
> > Rename the local variable and use static_cpu_has_safe() which alleviates
> > the need for defining a local cpuinfo_x86 pointer.
> 
> Kristen, any comments here?

Seems fine to me.

Acked-by: Kristen Carlson Accardi 

> 
> > Signed-off-by: Borislav Petkov 
> > Cc: Kristen Carlson Accardi 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Viresh Kumar 
> > Cc: linux...@vger.kernel.org
> > ---
> >  drivers/cpufreq/intel_pstate.c | 12 ++--
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index 872c5772c5d3..0b883f131a73 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -31,6 +31,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #define BYT_RATIOS 0x66a
> >  #define BYT_VIDS   0x66b
> > @@ -1183,8 +1184,7 @@ static int __init intel_pstate_init(void)
> >  {
> > int cpu, rc = 0;
> > const struct x86_cpu_id *id;
> > -   struct cpu_defaults *cpu_info;
> > -   struct cpuinfo_x86 *c = _cpu_data;
> > +   struct cpu_defaults *cpu_def;
> >  
> > if (no_load)
> > return -ENODEV;
> > @@ -1200,10 +1200,10 @@ static int __init intel_pstate_init(void)
> > if (intel_pstate_platform_pwr_mgmt_exists())
> > return -ENODEV;
> >  
> > -   cpu_info = (struct cpu_defaults *)id->driver_data;
> > +   cpu_def = (struct cpu_defaults *)id->driver_data;
> >  
> > -   copy_pid_params(_info->pid_policy);
> > -   copy_cpu_funcs(_info->funcs);
> > +   copy_pid_params(_def->pid_policy);
> > +   copy_cpu_funcs(_def->funcs);
> >  
> > if (intel_pstate_msrs_not_valid())
> > return -ENODEV;
> > @@ -1214,7 +1214,7 @@ static int __init intel_pstate_init(void)
> > if (!all_cpu_data)
> > return -ENOMEM;
> >  
> > -   if (cpu_has(c,X86_FEATURE_HWP) && !no_hwp)
> > +   if (static_cpu_has_safe(X86_FEATURE_HWP) && !no_hwp)
> > intel_pstate_hwp_enable();
> >  
> > if (!hwp_active && hwp_only)
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cpufreq/intel_pstate: Fix an annoying !CONFIG_SMP warning

2015-04-13 Thread Kristen Carlson Accardi
On Sat, 11 Apr 2015 02:22:03 +0200
Rafael J. Wysocki r...@rjwysocki.net wrote:

 On Friday, April 03, 2015 03:19:53 PM Borislav Petkov wrote:
  From: Borislav Petkov b...@suse.de
  
  I keep seeing
  
drivers/cpufreq/intel_pstate.c: In function ‘intel_pstate_init’:
drivers/cpufreq/intel_pstate.c:1187:26: warning: initialization from 
  incompatible pointer type
  struct cpuinfo_x86 *c = boot_cpu_data;
  
  when doing randconfig builds.
  
  This is caused by the fact that when !CONFIG_SMP, asm/processor.h
  defines cpu_info to boot_cpu_data and the local variable
  
struct cpu_defaults *cpu_info
  
  overshadows it leading to this unfortunate assignment in the
  preprocessed source:
  
   struct cpu_defaults *boot_cpu_data;
   struct cpuinfo_x86 *c = boot_cpu_data;
  
  Rename the local variable and use static_cpu_has_safe() which alleviates
  the need for defining a local cpuinfo_x86 pointer.
 
 Kristen, any comments here?

Seems fine to me.

Acked-by: Kristen Carlson Accardi kris...@linux.intel.com

 
  Signed-off-by: Borislav Petkov b...@suse.de
  Cc: Kristen Carlson Accardi kris...@linux.intel.com
  Cc: Rafael J. Wysocki r...@rjwysocki.net
  Cc: Viresh Kumar viresh.ku...@linaro.org
  Cc: linux...@vger.kernel.org
  ---
   drivers/cpufreq/intel_pstate.c | 12 ++--
   1 file changed, 6 insertions(+), 6 deletions(-)
  
  diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
  index 872c5772c5d3..0b883f131a73 100644
  --- a/drivers/cpufreq/intel_pstate.c
  +++ b/drivers/cpufreq/intel_pstate.c
  @@ -31,6 +31,7 @@
   #include asm/div64.h
   #include asm/msr.h
   #include asm/cpu_device_id.h
  +#include asm/cpufeature.h
   
   #define BYT_RATIOS 0x66a
   #define BYT_VIDS   0x66b
  @@ -1183,8 +1184,7 @@ static int __init intel_pstate_init(void)
   {
  int cpu, rc = 0;
  const struct x86_cpu_id *id;
  -   struct cpu_defaults *cpu_info;
  -   struct cpuinfo_x86 *c = boot_cpu_data;
  +   struct cpu_defaults *cpu_def;
   
  if (no_load)
  return -ENODEV;
  @@ -1200,10 +1200,10 @@ static int __init intel_pstate_init(void)
  if (intel_pstate_platform_pwr_mgmt_exists())
  return -ENODEV;
   
  -   cpu_info = (struct cpu_defaults *)id-driver_data;
  +   cpu_def = (struct cpu_defaults *)id-driver_data;
   
  -   copy_pid_params(cpu_info-pid_policy);
  -   copy_cpu_funcs(cpu_info-funcs);
  +   copy_pid_params(cpu_def-pid_policy);
  +   copy_cpu_funcs(cpu_def-funcs);
   
  if (intel_pstate_msrs_not_valid())
  return -ENODEV;
  @@ -1214,7 +1214,7 @@ static int __init intel_pstate_init(void)
  if (!all_cpu_data)
  return -ENOMEM;
   
  -   if (cpu_has(c,X86_FEATURE_HWP)  !no_hwp)
  +   if (static_cpu_has_safe(X86_FEATURE_HWP)  !no_hwp)
  intel_pstate_hwp_enable();
   
  if (!hwp_active  hwp_only)
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 V9] intel_pstate: add kernel parameter to force loading.

2014-12-10 Thread Kristen Carlson Accardi
On Tue, 9 Dec 2014 10:43:19 +0900
Ethan Zhao  wrote:

> To force loading on Oracle Sun X86 servers, provide one kernel command line
> parameter
> 
>   intel_pstate = force
> 
> For those who be aware of the risk of no power capping capabily working and
> try to get better performance with this driver.
> 
> Signed-off-by: Ethan Zhao 
> Tested-by: Alexey Kodanev 
> Reviewed-by: Linda Knippers 

Acked-by: Kristen Carlson Accardi 

> ---
>  v2: change to hardware vendor specific naming parameter.
>  v4: refine code and doc.
>  v5: fix a typo in doc.
>  v7: change enum PCC to PPC.
>  v8: change the name of kernel command line parameter to generic one.
>  v9: refine doc
> 
>  Documentation/kernel-parameters.txt | 5 +
>  drivers/cpufreq/intel_pstate.c  | 6 +-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 479f332..7d0983e 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1446,6 +1446,11 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>  disable
>Do not enable intel_pstate as the default
>scaling driver for the supported processors
> +force
> +  Enable intel_pstate on systems that prohibit it by 
> default
> +  in favor of acpi-cpufreq. Forcing the intel_pstate 
> driver
> +  instead of acpi-cpufreq may disable platform features, 
> such
> +  as thermal controls and power capping, that rely on 
> ACPI
> +  P-States information being indicated to OSPM and 
> therefore
> +  should be used with caution. This option does not work 
> with
> +  processors that aren't supported by the intel_pstate 
> driver
> +  or on platforms that use pcc-cpufreq instead of 
> acpi-cpufreq.
>  
>   intremap=   [X86-64, Intel-IOMMU]
>   on  enable Interrupt Remapping (default)
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 1bb62ca..2654e13 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -866,6 +866,7 @@ static struct cpufreq_driver intel_pstate_driver = {
>  };
>  
>  static int __initdata no_load;
> +static unsigned int  force_load;
>  
>  static int intel_pstate_msrs_not_valid(void)
>  {
> @@ -1003,7 +1004,8 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
>   case PSS:
>   return intel_pstate_no_acpi_pss();
>   case PPC:
> - return intel_pstate_has_acpi_ppc();
> + return intel_pstate_has_acpi_ppc() &&
> + (!force_load);
>   }
>   }
>  
> @@ -1078,6 +1080,8 @@ static int __init intel_pstate_setup(char *str)
>  
>   if (!strcmp(str, "disable"))
>   no_load = 1;
> + if (!strcmp(str, "force"))
> + force_load = 1;
>   return 0;
>  }
>  early_param("intel_pstate", intel_pstate_setup);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 V9] intel_pstate: add kernel parameter to force loading.

2014-12-10 Thread Kristen Carlson Accardi
On Tue, 9 Dec 2014 10:43:19 +0900
Ethan Zhao ethan.z...@oracle.com wrote:

 To force loading on Oracle Sun X86 servers, provide one kernel command line
 parameter
 
   intel_pstate = force
 
 For those who be aware of the risk of no power capping capabily working and
 try to get better performance with this driver.
 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com
 Tested-by: Alexey Kodanev alexey.koda...@oracle.com
 Reviewed-by: Linda Knippers linda.knipp...@hp.com

Acked-by: Kristen Carlson Accardi kris...@linux.intel.com

 ---
  v2: change to hardware vendor specific naming parameter.
  v4: refine code and doc.
  v5v6: fix a typo in doc.
  v7: change enum PCC to PPC.
  v8: change the name of kernel command line parameter to generic one.
  v9: refine doc
 
  Documentation/kernel-parameters.txt | 5 +
  drivers/cpufreq/intel_pstate.c  | 6 +-
  2 files changed, 10 insertions(+), 1 deletion(-)
 
 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 479f332..7d0983e 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -1446,6 +1446,11 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
  disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
 +force
 +  Enable intel_pstate on systems that prohibit it by 
 default
 +  in favor of acpi-cpufreq. Forcing the intel_pstate 
 driver
 +  instead of acpi-cpufreq may disable platform features, 
 such
 +  as thermal controls and power capping, that rely on 
 ACPI
 +  P-States information being indicated to OSPM and 
 therefore
 +  should be used with caution. This option does not work 
 with
 +  processors that aren't supported by the intel_pstate 
 driver
 +  or on platforms that use pcc-cpufreq instead of 
 acpi-cpufreq.
  
   intremap=   [X86-64, Intel-IOMMU]
   on  enable Interrupt Remapping (default)
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index 1bb62ca..2654e13 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -866,6 +866,7 @@ static struct cpufreq_driver intel_pstate_driver = {
  };
  
  static int __initdata no_load;
 +static unsigned int  force_load;
  
  static int intel_pstate_msrs_not_valid(void)
  {
 @@ -1003,7 +1004,8 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
   case PSS:
   return intel_pstate_no_acpi_pss();
   case PPC:
 - return intel_pstate_has_acpi_ppc();
 + return intel_pstate_has_acpi_ppc() 
 + (!force_load);
   }
   }
  
 @@ -1078,6 +1080,8 @@ static int __init intel_pstate_setup(char *str)
  
   if (!strcmp(str, disable))
   no_load = 1;
 + if (!strcmp(str, force))
 + force_load = 1;
   return 0;
  }
  early_param(intel_pstate, intel_pstate_setup);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 V7] intel_pstate: add kernel parameter to force loading on Sun X86 servers.

2014-12-04 Thread Kristen Carlson Accardi
On Thu, 04 Dec 2014 23:10:58 +0100
"Rafael J. Wysocki"  wrote:

> On Thursday, December 04, 2014 11:07:31 AM Ethan Zhao wrote:
> > To force loading on Oracle Sun X86 servers, provide one kernel command line
> > parameter
> > 
> >   intel_pstate = ora_force
> 
> I would suggest to change the name of the option to "oracle_force" or 
> "sun_force"
> for clarity.
> 
> Anyway, I need an ACK from Kristen if this patch is to be applied.
> 
> > For those who be aware of the risk of no power capping capabily working and
> > try to get better performance with this driver.
> > 
> > Signed-off-by: Ethan Zhao 
> > ---
> >  v2: change to hardware vendor specific naming parameter.
> >  v4: refine code and doc.
> >  v5: fix a typo in doc.
> >  v7: change enum PCC to PPC.
> > 
> >  Documentation/kernel-parameters.txt | 5 +
> >  drivers/cpufreq/intel_pstate.c  | 6 +-
> >  2 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/kernel-parameters.txt 
> > b/Documentation/kernel-parameters.txt
> > index 479f332..7d0983e 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -1446,6 +1446,11 @@ bytes respectively. Such letter suffixes can also be 
> > entirely omitted.
> >disable
> >  Do not enable intel_pstate as the default
> >  scaling driver for the supported processors
> > +  ora_force
> > +Force loading intel_pstate on Oracle Sun Servers(X86).
> > +only for those who be aware of the risk of no power 
> > capping
> > +capability working and try to get better performance 
> > with this
> > +driver.
> 
> That is not sufficiently clear.  What does "risk of no power capping 
> capability
> working" mean, in particular?
> 
> >  
> > intremap=   [X86-64, Intel-IOMMU]
> > on  enable Interrupt Remapping (default)
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index 1bb62ca..2654e13 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -866,6 +866,7 @@ static struct cpufreq_driver intel_pstate_driver = {
> >  };
> >  
> >  static int __initdata no_load;
> > +static unsigned int  ora_force;
> >  
> >  static int intel_pstate_msrs_not_valid(void)
> >  {
> > @@ -1003,7 +1004,8 @@ static bool 
> > intel_pstate_platform_pwr_mgmt_exists(void)
> > case PSS:
> > return intel_pstate_no_acpi_pss();
> > case PPC:
> > -   return intel_pstate_has_acpi_ppc();
> > +   return intel_pstate_has_acpi_ppc() &&
> > +   (!ora_force);
> > }
> > }
> >  
> > @@ -1078,6 +1080,8 @@ static int __init intel_pstate_setup(char *str)
> >  
> > if (!strcmp(str, "disable"))
> > no_load = 1;
> > +   if (!strcmp(str, "ora_force"))
> > +   ora_force = 1;
> > return 0;
> >  }
> >  early_param("intel_pstate", intel_pstate_setup);
> 
> And can anyone please remind me what was wrong with a "force" option that 
> would
> work for everyone, not just Oracle/Sun?
> 

That was my suggestion as well (i.e. a parameter to bypass the vendor
checks), but Linda didn't like it.  My personal opinion is that unless
it's generic, I don't really feel like having a force option solely for
oracle.  I'm not convinced you want this for production machines, and I
think for debug purposes I don't want a vendor specific param.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 V7] intel_pstate: add kernel parameter to force loading on Sun X86 servers.

2014-12-04 Thread Kristen Carlson Accardi
On Thu, 04 Dec 2014 23:10:58 +0100
Rafael J. Wysocki r...@rjwysocki.net wrote:

 On Thursday, December 04, 2014 11:07:31 AM Ethan Zhao wrote:
  To force loading on Oracle Sun X86 servers, provide one kernel command line
  parameter
  
intel_pstate = ora_force
 
 I would suggest to change the name of the option to oracle_force or 
 sun_force
 for clarity.
 
 Anyway, I need an ACK from Kristen if this patch is to be applied.
 
  For those who be aware of the risk of no power capping capabily working and
  try to get better performance with this driver.
  
  Signed-off-by: Ethan Zhao ethan.z...@oracle.com
  ---
   v2: change to hardware vendor specific naming parameter.
   v4: refine code and doc.
   v5v6: fix a typo in doc.
   v7: change enum PCC to PPC.
  
   Documentation/kernel-parameters.txt | 5 +
   drivers/cpufreq/intel_pstate.c  | 6 +-
   2 files changed, 10 insertions(+), 1 deletion(-)
  
  diff --git a/Documentation/kernel-parameters.txt 
  b/Documentation/kernel-parameters.txt
  index 479f332..7d0983e 100644
  --- a/Documentation/kernel-parameters.txt
  +++ b/Documentation/kernel-parameters.txt
  @@ -1446,6 +1446,11 @@ bytes respectively. Such letter suffixes can also be 
  entirely omitted.
 disable
   Do not enable intel_pstate as the default
   scaling driver for the supported processors
  +  ora_force
  +Force loading intel_pstate on Oracle Sun Servers(X86).
  +only for those who be aware of the risk of no power 
  capping
  +capability working and try to get better performance 
  with this
  +driver.
 
 That is not sufficiently clear.  What does risk of no power capping 
 capability
 working mean, in particular?
 
   
  intremap=   [X86-64, Intel-IOMMU]
  on  enable Interrupt Remapping (default)
  diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
  index 1bb62ca..2654e13 100644
  --- a/drivers/cpufreq/intel_pstate.c
  +++ b/drivers/cpufreq/intel_pstate.c
  @@ -866,6 +866,7 @@ static struct cpufreq_driver intel_pstate_driver = {
   };
   
   static int __initdata no_load;
  +static unsigned int  ora_force;
   
   static int intel_pstate_msrs_not_valid(void)
   {
  @@ -1003,7 +1004,8 @@ static bool 
  intel_pstate_platform_pwr_mgmt_exists(void)
  case PSS:
  return intel_pstate_no_acpi_pss();
  case PPC:
  -   return intel_pstate_has_acpi_ppc();
  +   return intel_pstate_has_acpi_ppc() 
  +   (!ora_force);
  }
  }
   
  @@ -1078,6 +1080,8 @@ static int __init intel_pstate_setup(char *str)
   
  if (!strcmp(str, disable))
  no_load = 1;
  +   if (!strcmp(str, ora_force))
  +   ora_force = 1;
  return 0;
   }
   early_param(intel_pstate, intel_pstate_setup);
 
 And can anyone please remind me what was wrong with a force option that 
 would
 work for everyone, not just Oracle/Sun?
 

That was my suggestion as well (i.e. a parameter to bypass the vendor
checks), but Linda didn't like it.  My personal opinion is that unless
it's generic, I don't really feel like having a force option solely for
oracle.  I'm not convinced you want this for production machines, and I
think for debug purposes I don't want a vendor specific param.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 V6] intel_pstate: add kernel parameter to force loading on Sun X86 servers.

2014-12-02 Thread Kristen Carlson Accardi
On Fri, 28 Nov 2014 12:36:17 +0900
Ethan Zhao  wrote:

> To force loading on Oracle Sun X86 servers, provide one kernel command line
> parameter
> 
>   intel_pstate = ora_force
> 
> For those who be aware of the risk of no power capping capabily working and
> try to get better performance with this driver.

So, is this something you'd expect users to use on production systems,
or is it just for debug?

Thanks,
Kristen

> 
> Signed-off-by: Ethan Zhao 
> ---
>  v2: change to hardware vendor specific naming parameter.
>  v4: refine code and doc.
>  v5: fix a typo in doc.
> 
>  Documentation/kernel-parameters.txt | 5 +
>  drivers/cpufreq/intel_pstate.c  | 6 +-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 479f332..7d0983e 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1446,6 +1446,11 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>  disable
>Do not enable intel_pstate as the default
>scaling driver for the supported processors
> +ora_force
> +  Force loading intel_pstate on Oracle Sun Servers(X86).
> +  only for those who be aware of the risk of no power 
> capping
> +  capability working and try to get better performance 
> with this
> +  driver.
>  
>   intremap=   [X86-64, Intel-IOMMU]
>   on  enable Interrupt Remapping (default)
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 1bb62ca..2654e13 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -866,6 +866,7 @@ static struct cpufreq_driver intel_pstate_driver = {
>  };
>  
>  static int __initdata no_load;
> +static unsigned int  ora_force;
>  
>  static int intel_pstate_msrs_not_valid(void)
>  {
> @@ -1003,7 +1004,8 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
>   case PSS:
>   return intel_pstate_no_acpi_pss();
>   case PCC:
> - return intel_pstate_has_acpi_ppc();
> + return intel_pstate_has_acpi_ppc() &&
> + (!ora_force);
>   }
>   }
>  
> @@ -1078,6 +1080,8 @@ static int __init intel_pstate_setup(char *str)
>  
>   if (!strcmp(str, "disable"))
>   no_load = 1;
> + if (!strcmp(str, "ora_force"))
> + ora_force = 1;
>   return 0;
>  }
>  early_param("intel_pstate", intel_pstate_setup);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2 V6] intel_pstate: skip this driver if Sun server has _PPC method

2014-12-02 Thread Kristen Carlson Accardi
On Mon, 1 Dec 2014 11:32:08 +0900
Ethan Zhao  wrote:

> Oracle Sun X86 servers have dynamic power capping capability that works via
> ACPI _PPC method etc, so skip loading this driver if Sun server has ACPI _PPC
> enabled.
> 
> Signed-off-by: Ethan Zhao 
> Signed-off-by: Dirk Brandewie 
> Tested-by: Linda Knippers 

Acked-by: Kristen Carlson Accardi 

> ---
>   v2: fix break HP Proliant issue.
>   v3: expand the hardware vendor list.
>   v4: refine code.
>   v5v6: change enum PCC to PPC.
> 
>  drivers/cpufreq/intel_pstate.c | 45 
> ++
>  1 file changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 27bb6d3..1bb62ca 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -943,15 +943,46 @@ static bool intel_pstate_no_acpi_pss(void)
>   return true;
>  }
>  
> +static bool intel_pstate_has_acpi_ppc(void)
> +{
> + int i;
> +
> + for_each_possible_cpu(i) {
> + struct acpi_processor *pr = per_cpu(processors, i);
> +
> + if (!pr)
> + continue;
> + if (acpi_has_method(pr->handle, "_PPC"))
> + return true;
> + }
> + return false;
> +}
> +
> +enum {
> + PSS,
> + PPC,
> +};
> +
>  struct hw_vendor_info {
>   u16  valid;
>   char oem_id[ACPI_OEM_ID_SIZE];
>   char oem_table_id[ACPI_OEM_TABLE_ID_SIZE];
> + int  oem_pwr_table;
>  };
>  
>  /* Hardware vendor-specific info that has its own power management modes */
>  static struct hw_vendor_info vendor_info[] = {
> - {1, "HP", "ProLiant"},
> + {1, "HP", "ProLiant", PSS},
> + {1, "ORACLE", "X4-2", PPC},
> + {1, "ORACLE", "X4-2L   ", PPC},
> + {1, "ORACLE", "X4-2B   ", PPC},
> + {1, "ORACLE", "X3-2", PPC},
> + {1, "ORACLE", "X3-2L   ", PPC},
> + {1, "ORACLE", "X3-2B   ", PPC},
> + {1, "ORACLE", "X4470M2 ", PPC},
> + {1, "ORACLE", "X4270M3 ", PPC},
> + {1, "ORACLE", "X4270M2 ", PPC},
> + {1, "ORACLE", "X4170M2 ", PPC},
>   {0, "", ""},
>  };
>  
> @@ -966,15 +997,21 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
>  
>   for (v_info = vendor_info; v_info->valid; v_info++) {
>   if (!strncmp(hdr.oem_id, v_info->oem_id, ACPI_OEM_ID_SIZE) &&
> - !strncmp(hdr.oem_table_id, v_info->oem_table_id, 
> ACPI_OEM_TABLE_ID_SIZE) &&
> - intel_pstate_no_acpi_pss())
> - return true;
> + !strncmp(hdr.oem_table_id, v_info->oem_table_id,
> + ACPI_OEM_TABLE_ID_SIZE))
> + switch (v_info->oem_pwr_table) {
> + case PSS:
> + return intel_pstate_no_acpi_pss();
> + case PPC:
> + return intel_pstate_has_acpi_ppc();
> + }
>   }
>  
>   return false;
>  }
>  #else /* CONFIG_ACPI not enabled */
>  static inline bool intel_pstate_platform_pwr_mgmt_exists(void) { return 
> false; }
> +static inline bool intel_pstate_has_acpi_ppc(void) { return false; }
>  #endif /* CONFIG_ACPI */
>  
>  static int __init intel_pstate_init(void)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2 V6] intel_pstate: skip this driver if Sun server has _PPC method

2014-12-02 Thread Kristen Carlson Accardi
On Mon, 1 Dec 2014 11:32:08 +0900
Ethan Zhao ethan.z...@oracle.com wrote:

 Oracle Sun X86 servers have dynamic power capping capability that works via
 ACPI _PPC method etc, so skip loading this driver if Sun server has ACPI _PPC
 enabled.
 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com
 Signed-off-by: Dirk Brandewie dirk.brande...@gmail.com
 Tested-by: Linda Knippers linda.knipp...@hp.com

Acked-by: Kristen Carlson Accardi kris...@linux.intel.com

 ---
   v2: fix break HP Proliant issue.
   v3: expand the hardware vendor list.
   v4: refine code.
   v5v6: change enum PCC to PPC.
 
  drivers/cpufreq/intel_pstate.c | 45 
 ++
  1 file changed, 41 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index 27bb6d3..1bb62ca 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -943,15 +943,46 @@ static bool intel_pstate_no_acpi_pss(void)
   return true;
  }
  
 +static bool intel_pstate_has_acpi_ppc(void)
 +{
 + int i;
 +
 + for_each_possible_cpu(i) {
 + struct acpi_processor *pr = per_cpu(processors, i);
 +
 + if (!pr)
 + continue;
 + if (acpi_has_method(pr-handle, _PPC))
 + return true;
 + }
 + return false;
 +}
 +
 +enum {
 + PSS,
 + PPC,
 +};
 +
  struct hw_vendor_info {
   u16  valid;
   char oem_id[ACPI_OEM_ID_SIZE];
   char oem_table_id[ACPI_OEM_TABLE_ID_SIZE];
 + int  oem_pwr_table;
  };
  
  /* Hardware vendor-specific info that has its own power management modes */
  static struct hw_vendor_info vendor_info[] = {
 - {1, HP, ProLiant},
 + {1, HP, ProLiant, PSS},
 + {1, ORACLE, X4-2, PPC},
 + {1, ORACLE, X4-2L   , PPC},
 + {1, ORACLE, X4-2B   , PPC},
 + {1, ORACLE, X3-2, PPC},
 + {1, ORACLE, X3-2L   , PPC},
 + {1, ORACLE, X3-2B   , PPC},
 + {1, ORACLE, X4470M2 , PPC},
 + {1, ORACLE, X4270M3 , PPC},
 + {1, ORACLE, X4270M2 , PPC},
 + {1, ORACLE, X4170M2 , PPC},
   {0, , },
  };
  
 @@ -966,15 +997,21 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
  
   for (v_info = vendor_info; v_info-valid; v_info++) {
   if (!strncmp(hdr.oem_id, v_info-oem_id, ACPI_OEM_ID_SIZE) 
 - !strncmp(hdr.oem_table_id, v_info-oem_table_id, 
 ACPI_OEM_TABLE_ID_SIZE) 
 - intel_pstate_no_acpi_pss())
 - return true;
 + !strncmp(hdr.oem_table_id, v_info-oem_table_id,
 + ACPI_OEM_TABLE_ID_SIZE))
 + switch (v_info-oem_pwr_table) {
 + case PSS:
 + return intel_pstate_no_acpi_pss();
 + case PPC:
 + return intel_pstate_has_acpi_ppc();
 + }
   }
  
   return false;
  }
  #else /* CONFIG_ACPI not enabled */
  static inline bool intel_pstate_platform_pwr_mgmt_exists(void) { return 
 false; }
 +static inline bool intel_pstate_has_acpi_ppc(void) { return false; }
  #endif /* CONFIG_ACPI */
  
  static int __init intel_pstate_init(void)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2 V6] intel_pstate: add kernel parameter to force loading on Sun X86 servers.

2014-12-02 Thread Kristen Carlson Accardi
On Fri, 28 Nov 2014 12:36:17 +0900
Ethan Zhao ethan.z...@oracle.com wrote:

 To force loading on Oracle Sun X86 servers, provide one kernel command line
 parameter
 
   intel_pstate = ora_force
 
 For those who be aware of the risk of no power capping capabily working and
 try to get better performance with this driver.

So, is this something you'd expect users to use on production systems,
or is it just for debug?

Thanks,
Kristen

 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com
 ---
  v2: change to hardware vendor specific naming parameter.
  v4: refine code and doc.
  v5v6: fix a typo in doc.
 
  Documentation/kernel-parameters.txt | 5 +
  drivers/cpufreq/intel_pstate.c  | 6 +-
  2 files changed, 10 insertions(+), 1 deletion(-)
 
 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 479f332..7d0983e 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -1446,6 +1446,11 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
  disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
 +ora_force
 +  Force loading intel_pstate on Oracle Sun Servers(X86).
 +  only for those who be aware of the risk of no power 
 capping
 +  capability working and try to get better performance 
 with this
 +  driver.
  
   intremap=   [X86-64, Intel-IOMMU]
   on  enable Interrupt Remapping (default)
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index 1bb62ca..2654e13 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -866,6 +866,7 @@ static struct cpufreq_driver intel_pstate_driver = {
  };
  
  static int __initdata no_load;
 +static unsigned int  ora_force;
  
  static int intel_pstate_msrs_not_valid(void)
  {
 @@ -1003,7 +1004,8 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
   case PSS:
   return intel_pstate_no_acpi_pss();
   case PCC:
 - return intel_pstate_has_acpi_ppc();
 + return intel_pstate_has_acpi_ppc() 
 + (!ora_force);
   }
   }
  
 @@ -1078,6 +1080,8 @@ static int __init intel_pstate_setup(char *str)
  
   if (!strcmp(str, disable))
   no_load = 1;
 + if (!strcmp(str, ora_force))
 + ora_force = 1;
   return 0;
  }
  early_param(intel_pstate, intel_pstate_setup);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] intel_pstate: add module and kernel command line parameter to ignore ACPI _PPC

2014-11-20 Thread Kristen Carlson Accardi
On Thu, 20 Nov 2014 08:57:34 +0800
ethan  wrote:

> 
> 
> > 在 2014年11月20日,03:05,Kristen Carlson Accardi  写道:
> > 
> > On Tue, 18 Nov 2014 17:37:06 +0900
> > Ethan Zhao  wrote:
> > 
> >> Add kernel command line parameter
> >> intel_pstate = ignore_acpi_ppc
> >> and module parameter
> >> ignore_acpi_ppc = 1
> >> to allow driver to ignore the ACPI _PPC existence even for Sun x86 servers.
> >> These parameter could be used for debug\test\workaround etc purpose.
> >> 
> >> Signed-off-by: Ethan Zhao 
> > 
> > What if we used a more generic parameter like "force" that would bypass
> > any vendor specific checks and just load anyway?  This way we don't have
> > to add new parameters everything some new thing shows up that we want to
> > ignore.
> > 
> To be honest, I prefer more generic parameter. But to avoid the possible 
> negative affect 
> To another vendors. I back to this way.

Well, your parameter can still impact other vendors as it is.  it
is pretty typical to assume that using a parameter like "force" means
you know what you are doing and accept the risks.  Especially if its
documented as such.

> 
> Thanks,
> Ethan
> >> ---
> >> Documentation/kernel-parameters.txt | 3 +++
> >> drivers/cpufreq/intel_pstate.c  | 8 +++-
> >> 2 files changed, 10 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/Documentation/kernel-parameters.txt 
> >> b/Documentation/kernel-parameters.txt
> >> index 4c81a86..f502b85 100644
> >> --- a/Documentation/kernel-parameters.txt
> >> +++ b/Documentation/kernel-parameters.txt
> >> @@ -1446,6 +1446,9 @@ bytes respectively. Such letter suffixes can also be 
> >> entirely omitted.
> >>   disable
> >> Do not enable intel_pstate as the default
> >> scaling driver for the supported processors
> >> +   ignore_acpi_ppc
> >> + Ignore the existence of ACPI method _PPC for Sun x86 servers
> >> + and load the driver.
> >> 
> >>intremap=[X86-64, Intel-IOMMU]
> >>onenable Interrupt Remapping (default)
> >> diff --git a/drivers/cpufreq/intel_pstate.c 
> >> b/drivers/cpufreq/intel_pstate.c
> >> index 7c5faea..388387b 100644
> >> --- a/drivers/cpufreq/intel_pstate.c
> >> +++ b/drivers/cpufreq/intel_pstate.c
> >> @@ -870,6 +870,7 @@ static struct cpufreq_driver intel_pstate_driver = {
> >> };
> >> 
> >> static int __initdata no_load;
> >> +static unsigned int  ignore_acpi_ppc;
> >> 
> >> static int intel_pstate_msrs_not_valid(void)
> >> {
> >> @@ -990,7 +991,7 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
> >>intel_pstate_no_acpi_pss())
> >>return true;
> >>if (!strncmp(hdr.oem_id, v_info->oem_id, ACPI_OEM_ID_SIZE) &&
> >> -intel_pstate_has_acpi_ppc())
> >> +intel_pstate_has_acpi_ppc() && !ignore_acpi_ppc)
> >>return true;
> >>}
> >> 
> >> @@ -1066,11 +1067,16 @@ static int __init intel_pstate_setup(char *str)
> >> 
> >>if (!strcmp(str, "disable"))
> >>no_load = 1;
> >> +if (!strcmp(str, "ignore_acpi_ppc"))
> >> +ignore_acpi_ppc = 1;
> >>return 0;
> >> }
> >> early_param("intel_pstate", intel_pstate_setup);
> >> #endif
> >> 
> >> +module_param(ignore_acpi_ppc, uint, 0644);
> >> +MODULE_PARM_DESC(ignore_acpi_ppc,
> >> +"value 0 or non-zero. non-zero -> ignore ACPI _PPC and load this 
> >> driver");
> >> MODULE_AUTHOR("Dirk Brandewie ");
> >> MODULE_DESCRIPTION("'intel_pstate' - P state driver Intel Core 
> >> processors");
> >> MODULE_LICENSE("GPL");
> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] intel_pstate: add module and kernel command line parameter to ignore ACPI _PPC

2014-11-20 Thread Kristen Carlson Accardi
On Thu, 20 Nov 2014 08:57:34 +0800
ethan ethan.ker...@gmail.com wrote:

 
 
  在 2014年11月20日,03:05,Kristen Carlson Accardi kris...@linux.intel.com 写道:
  
  On Tue, 18 Nov 2014 17:37:06 +0900
  Ethan Zhao ethan.z...@oracle.com wrote:
  
  Add kernel command line parameter
  intel_pstate = ignore_acpi_ppc
  and module parameter
  ignore_acpi_ppc = 1
  to allow driver to ignore the ACPI _PPC existence even for Sun x86 servers.
  These parameter could be used for debug\test\workaround etc purpose.
  
  Signed-off-by: Ethan Zhao ethan.z...@oracle.com
  
  What if we used a more generic parameter like force that would bypass
  any vendor specific checks and just load anyway?  This way we don't have
  to add new parameters everything some new thing shows up that we want to
  ignore.
  
 To be honest, I prefer more generic parameter. But to avoid the possible 
 negative affect 
 To another vendors. I back to this way.

Well, your parameter can still impact other vendors as it is.  it
is pretty typical to assume that using a parameter like force means
you know what you are doing and accept the risks.  Especially if its
documented as such.

 
 Thanks,
 Ethan
  ---
  Documentation/kernel-parameters.txt | 3 +++
  drivers/cpufreq/intel_pstate.c  | 8 +++-
  2 files changed, 10 insertions(+), 1 deletion(-)
  
  diff --git a/Documentation/kernel-parameters.txt 
  b/Documentation/kernel-parameters.txt
  index 4c81a86..f502b85 100644
  --- a/Documentation/kernel-parameters.txt
  +++ b/Documentation/kernel-parameters.txt
  @@ -1446,6 +1446,9 @@ bytes respectively. Such letter suffixes can also be 
  entirely omitted.
disable
  Do not enable intel_pstate as the default
  scaling driver for the supported processors
  +   ignore_acpi_ppc
  + Ignore the existence of ACPI method _PPC for Sun x86 servers
  + and load the driver.
  
 intremap=[X86-64, Intel-IOMMU]
 onenable Interrupt Remapping (default)
  diff --git a/drivers/cpufreq/intel_pstate.c 
  b/drivers/cpufreq/intel_pstate.c
  index 7c5faea..388387b 100644
  --- a/drivers/cpufreq/intel_pstate.c
  +++ b/drivers/cpufreq/intel_pstate.c
  @@ -870,6 +870,7 @@ static struct cpufreq_driver intel_pstate_driver = {
  };
  
  static int __initdata no_load;
  +static unsigned int  ignore_acpi_ppc;
  
  static int intel_pstate_msrs_not_valid(void)
  {
  @@ -990,7 +991,7 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
 intel_pstate_no_acpi_pss())
 return true;
 if (!strncmp(hdr.oem_id, v_info-oem_id, ACPI_OEM_ID_SIZE) 
  -intel_pstate_has_acpi_ppc())
  +intel_pstate_has_acpi_ppc()  !ignore_acpi_ppc)
 return true;
 }
  
  @@ -1066,11 +1067,16 @@ static int __init intel_pstate_setup(char *str)
  
 if (!strcmp(str, disable))
 no_load = 1;
  +if (!strcmp(str, ignore_acpi_ppc))
  +ignore_acpi_ppc = 1;
 return 0;
  }
  early_param(intel_pstate, intel_pstate_setup);
  #endif
  
  +module_param(ignore_acpi_ppc, uint, 0644);
  +MODULE_PARM_DESC(ignore_acpi_ppc,
  +value 0 or non-zero. non-zero - ignore ACPI _PPC and load this 
  driver);
  MODULE_AUTHOR(Dirk Brandewie dirk.j.brande...@intel.com);
  MODULE_DESCRIPTION('intel_pstate' - P state driver Intel Core 
  processors);
  MODULE_LICENSE(GPL);
  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] intel_pstate: add module and kernel command line parameter to ignore ACPI _PPC

2014-11-19 Thread Kristen Carlson Accardi
On Tue, 18 Nov 2014 17:37:06 +0900
Ethan Zhao  wrote:

> Add kernel command line parameter
>  intel_pstate = ignore_acpi_ppc
> and module parameter
>  ignore_acpi_ppc = 1
> to allow driver to ignore the ACPI _PPC existence even for Sun x86 servers.
> These parameter could be used for debug\test\workaround etc purpose.
> 
> Signed-off-by: Ethan Zhao 

What if we used a more generic parameter like "force" that would bypass
any vendor specific checks and just load anyway?  This way we don't have
to add new parameters everything some new thing shows up that we want to
ignore.

> ---
>  Documentation/kernel-parameters.txt | 3 +++
>  drivers/cpufreq/intel_pstate.c  | 8 +++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 4c81a86..f502b85 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1446,6 +1446,9 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>  disable
>Do not enable intel_pstate as the default
>scaling driver for the supported processors
> +ignore_acpi_ppc
> +  Ignore the existence of ACPI method _PPC for Sun x86 
> servers
> +  and load the driver.
>  
>   intremap=   [X86-64, Intel-IOMMU]
>   on  enable Interrupt Remapping (default)
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 7c5faea..388387b 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -870,6 +870,7 @@ static struct cpufreq_driver intel_pstate_driver = {
>  };
>  
>  static int __initdata no_load;
> +static unsigned int  ignore_acpi_ppc;
>  
>  static int intel_pstate_msrs_not_valid(void)
>  {
> @@ -990,7 +991,7 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
>   intel_pstate_no_acpi_pss())
>   return true;
>   if (!strncmp(hdr.oem_id, v_info->oem_id, ACPI_OEM_ID_SIZE) &&
> - intel_pstate_has_acpi_ppc())
> + intel_pstate_has_acpi_ppc() && !ignore_acpi_ppc)
>   return true;
>   }
>  
> @@ -1066,11 +1067,16 @@ static int __init intel_pstate_setup(char *str)
>  
>   if (!strcmp(str, "disable"))
>   no_load = 1;
> + if (!strcmp(str, "ignore_acpi_ppc"))
> + ignore_acpi_ppc = 1;
>   return 0;
>  }
>  early_param("intel_pstate", intel_pstate_setup);
>  #endif
>  
> +module_param(ignore_acpi_ppc, uint, 0644);
> +MODULE_PARM_DESC(ignore_acpi_ppc,
> + "value 0 or non-zero. non-zero -> ignore ACPI _PPC and load this 
> driver");
>  MODULE_AUTHOR("Dirk Brandewie ");
>  MODULE_DESCRIPTION("'intel_pstate' - P state driver Intel Core processors");
>  MODULE_LICENSE("GPL");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] intel_pstate: allow driver to be built as a module

2014-11-19 Thread Kristen Carlson Accardi
On Tue, 18 Nov 2014 17:37:05 +0900
Ethan Zhao  wrote:

> From: Brian Maly 
> 
> To provide the flexibility of module, allow this driver to
> be configured and built as a module.
> 
> Signed-off-by: Brian Maly 
> Signed-off-by: Ethan Zhao 

I believe the entire concept of being able to use intel_pstate as a
module just isn't going to work.  There are load order issues - and
additionally the driver doesn't clean up after itself in any way.


> ---
>  drivers/cpufreq/Kconfig.x86| 2 +-
>  drivers/cpufreq/intel_pstate.c | 6 ++
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86
> index 89ae88f..94c9e6b 100644
> --- a/drivers/cpufreq/Kconfig.x86
> +++ b/drivers/cpufreq/Kconfig.x86
> @@ -3,7 +3,7 @@
>  #
>  
>  config X86_INTEL_PSTATE
> -   bool "Intel P state control"
> +   tristate "Intel P state control"
> depends on X86
> help
>This driver provides a P state for Intel core processors.
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 5498eb0..7c5faea 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -590,7 +590,9 @@ static void intel_pstate_set_pstate(struct cpudata *cpu, 
> int pstate)
>   if (pstate == cpu->pstate.current_pstate)
>   return;
>  
> +#ifndef MODULE
>   trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
> +#endif
>  
>   cpu->pstate.current_pstate = pstate;
>  
> @@ -705,12 +707,14 @@ static void intel_pstate_timer_func(unsigned long 
> __data)
>  
>   intel_pstate_adjust_busy_pstate(cpu);
>  
> +#ifndef MODULE
>   trace_pstate_sample(fp_toint(sample->core_pct_busy),
>   fp_toint(intel_pstate_get_scaled_busy(cpu)),
>   cpu->pstate.current_pstate,
>   sample->mperf,
>   sample->aperf,
>   sample->freq);
> +#endif
>  
>   intel_pstate_set_sample_time(cpu);
>  }
> @@ -1054,6 +1058,7 @@ out:
>  }
>  device_initcall(intel_pstate_init);
>  
> +#ifndef MODULE
>  static int __init intel_pstate_setup(char *str)
>  {
>   if (!str)
> @@ -1064,6 +1069,7 @@ static int __init intel_pstate_setup(char *str)
>   return 0;
>  }
>  early_param("intel_pstate", intel_pstate_setup);
> +#endif
>  
>  MODULE_AUTHOR("Dirk Brandewie ");
>  MODULE_DESCRIPTION("'intel_pstate' - P state driver Intel Core processors");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] intel_pstate: allow driver to be built as a module

2014-11-19 Thread Kristen Carlson Accardi
On Tue, 18 Nov 2014 17:37:05 +0900
Ethan Zhao ethan.z...@oracle.com wrote:

 From: Brian Maly brian.m...@oracle.com
 
 To provide the flexibility of module, allow this driver to
 be configured and built as a module.
 
 Signed-off-by: Brian Maly brian.m...@oracle.com
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com

I believe the entire concept of being able to use intel_pstate as a
module just isn't going to work.  There are load order issues - and
additionally the driver doesn't clean up after itself in any way.


 ---
  drivers/cpufreq/Kconfig.x86| 2 +-
  drivers/cpufreq/intel_pstate.c | 6 ++
  2 files changed, 7 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86
 index 89ae88f..94c9e6b 100644
 --- a/drivers/cpufreq/Kconfig.x86
 +++ b/drivers/cpufreq/Kconfig.x86
 @@ -3,7 +3,7 @@
  #
  
  config X86_INTEL_PSTATE
 -   bool Intel P state control
 +   tristate Intel P state control
 depends on X86
 help
This driver provides a P state for Intel core processors.
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index 5498eb0..7c5faea 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -590,7 +590,9 @@ static void intel_pstate_set_pstate(struct cpudata *cpu, 
 int pstate)
   if (pstate == cpu-pstate.current_pstate)
   return;
  
 +#ifndef MODULE
   trace_cpu_frequency(pstate * cpu-pstate.scaling, cpu-cpu);
 +#endif
  
   cpu-pstate.current_pstate = pstate;
  
 @@ -705,12 +707,14 @@ static void intel_pstate_timer_func(unsigned long 
 __data)
  
   intel_pstate_adjust_busy_pstate(cpu);
  
 +#ifndef MODULE
   trace_pstate_sample(fp_toint(sample-core_pct_busy),
   fp_toint(intel_pstate_get_scaled_busy(cpu)),
   cpu-pstate.current_pstate,
   sample-mperf,
   sample-aperf,
   sample-freq);
 +#endif
  
   intel_pstate_set_sample_time(cpu);
  }
 @@ -1054,6 +1058,7 @@ out:
  }
  device_initcall(intel_pstate_init);
  
 +#ifndef MODULE
  static int __init intel_pstate_setup(char *str)
  {
   if (!str)
 @@ -1064,6 +1069,7 @@ static int __init intel_pstate_setup(char *str)
   return 0;
  }
  early_param(intel_pstate, intel_pstate_setup);
 +#endif
  
  MODULE_AUTHOR(Dirk Brandewie dirk.j.brande...@intel.com);
  MODULE_DESCRIPTION('intel_pstate' - P state driver Intel Core processors);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] intel_pstate: add module and kernel command line parameter to ignore ACPI _PPC

2014-11-19 Thread Kristen Carlson Accardi
On Tue, 18 Nov 2014 17:37:06 +0900
Ethan Zhao ethan.z...@oracle.com wrote:

 Add kernel command line parameter
  intel_pstate = ignore_acpi_ppc
 and module parameter
  ignore_acpi_ppc = 1
 to allow driver to ignore the ACPI _PPC existence even for Sun x86 servers.
 These parameter could be used for debug\test\workaround etc purpose.
 
 Signed-off-by: Ethan Zhao ethan.z...@oracle.com

What if we used a more generic parameter like force that would bypass
any vendor specific checks and just load anyway?  This way we don't have
to add new parameters everything some new thing shows up that we want to
ignore.

 ---
  Documentation/kernel-parameters.txt | 3 +++
  drivers/cpufreq/intel_pstate.c  | 8 +++-
  2 files changed, 10 insertions(+), 1 deletion(-)
 
 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 4c81a86..f502b85 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -1446,6 +1446,9 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
  disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
 +ignore_acpi_ppc
 +  Ignore the existence of ACPI method _PPC for Sun x86 
 servers
 +  and load the driver.
  
   intremap=   [X86-64, Intel-IOMMU]
   on  enable Interrupt Remapping (default)
 diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
 index 7c5faea..388387b 100644
 --- a/drivers/cpufreq/intel_pstate.c
 +++ b/drivers/cpufreq/intel_pstate.c
 @@ -870,6 +870,7 @@ static struct cpufreq_driver intel_pstate_driver = {
  };
  
  static int __initdata no_load;
 +static unsigned int  ignore_acpi_ppc;
  
  static int intel_pstate_msrs_not_valid(void)
  {
 @@ -990,7 +991,7 @@ static bool intel_pstate_platform_pwr_mgmt_exists(void)
   intel_pstate_no_acpi_pss())
   return true;
   if (!strncmp(hdr.oem_id, v_info-oem_id, ACPI_OEM_ID_SIZE) 
 - intel_pstate_has_acpi_ppc())
 + intel_pstate_has_acpi_ppc()  !ignore_acpi_ppc)
   return true;
   }
  
 @@ -1066,11 +1067,16 @@ static int __init intel_pstate_setup(char *str)
  
   if (!strcmp(str, disable))
   no_load = 1;
 + if (!strcmp(str, ignore_acpi_ppc))
 + ignore_acpi_ppc = 1;
   return 0;
  }
  early_param(intel_pstate, intel_pstate_setup);
  #endif
  
 +module_param(ignore_acpi_ppc, uint, 0644);
 +MODULE_PARM_DESC(ignore_acpi_ppc,
 + value 0 or non-zero. non-zero - ignore ACPI _PPC and load this 
 driver);
  MODULE_AUTHOR(Dirk Brandewie dirk.j.brande...@intel.com);
  MODULE_DESCRIPTION('intel_pstate' - P state driver Intel Core processors);
  MODULE_LICENSE(GPL);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc2 Thinkpad T30 docking fails with oops.

2008-02-22 Thread Kristen Carlson Accardi
On Fri, 22 Feb 2008 00:34:17 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Tue, 19 Feb 2008 20:47:08 + Paul Martin <[EMAIL PROTECTED]> wrote:
> 
> > Now, this was working in 2.6.23, but is not in any later kernel.
> > Previously reported, but I guess that the previous report was
> > ignored due to the kernel being tainted.
> 
> Not really.  A lot of acpi-related bug reports get the
> crickets-chirping treatment.
> 
> The usual response with acpi is "please raise a bugzilla report", and
> the acpi developers do respond well to bugzilla reports.  But really,
> a recent and oops-causing regression shouldn't require such actions -
> developers should be running around with their hair on fire.
> 
> > Well, there's nothing other than pure Linus here.
> > 
> > When booted up docked, there is no oops until you try to undock.
> > 
> > Whilst I will be looking for replies in the list, I'd appreciate
> > being CC'd on any replies.
> > 
> > ACPI: \_SB_.PCI0.PCI1.DOCK - docking
> > PCI: Transparent bridge - :02:03.0
> > ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1.DOCK._PRT]
> > BUG: unable to handle kernel NULL pointer dereference at 
> > IP: [] pdev_sort_resources+0x8a/0x116
> > *pde =  
> > Oops:  [#1] PREEMPT SMP 
> > Modules linked in: radeon drm rfcomm l2cap bluetooth ppdev lp ipv6
> > ext3 jbd mbcache fuse dm_crypt crypto_blkcipher dm_mod
> > cpufreq_stats speedstep_ich speedstep_lib thinkpad_acpi nvram
> > acpiphp bay dock pcmcia firmware_class joydev yenta_socket
> > rsrc_nonstatic pcmcia_core battery irtty_sir ac sir_dev
> > snd_intel8x0 snd_ac97_codec nsc_ircc ac97_bus snd_pcm_oss
> > snd_mixer_oss snd_pcm snd_timer irda parport_pc i2c_i801 button
> > shpchp pci_hotplug crc_ccitt parport snd psmouse intel_agp agpgart
> > iTCO_wdt evdev serio_raw soundcore snd_page_alloc rtc pcspkr xfs
> > floppy e100 mii uhci_hcd usbcore sg sr_mod cdrom sd_mod thermal
> > processor fan ata_piix libata scsi_mod radeonfb fb_ddc i2c_algo_bit
> > i2c_core
> > 
> > Pid: 39, comm: kacpi_notify Not tainted (2.6.25-rc2 #1)
> > EIP: 0060:[] EFLAGS: 00010246 CPU: 0
> > EIP is at pdev_sort_resources+0x8a/0x116
> > EAX:  EBX:  ECX:  EDX: 0fff
> > ESI: f5e30cc4 EDI: f74f1e80 EBP: f74f1e68 ESP: f74f1e48
> >  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
> > Process kacpi_notify (pid: 39, ti=f74f task=f74cf120
> > task.ti=f74f) Stack:  f5e30cf0 f74f1e80 f5e30c00
> > 0007 f5e30c00 f74cb414 f74cb400 f74f1e9c c02a9cce f5e30c08
> > f7431f90 f7431978 f5420c00  f7d6d408 f7d6d3c0 f74f1e9c
> >  f7d6d3c8 f7d6d3c0 f74f1ed8 f9b278ca f74cb40c Call Trace:
> >  [] ? pci_bus_assign_resources+0x59/0x341
> >  [] ? acpiphp_enable_slot+0x2a1/0x3a3 [acpiphp]
> >  [] ? handle_hotplug_event_func+0x63/0x101 [acpiphp]
> >  [] ? post_dock_fixups+0x6c/0x79 [acpiphp]
> >  [] ? notifier_call_chain+0x2b/0x4a
> >  [] ? handle_hotplug_event_func+0x0/0x101 [acpiphp]
> >  [] ? hotplug_dock_devices+0x39/0xe1 [dock]
> >  [] ? dock_notify+0x75/0xc0 [dock]
> >  [] ? acpi_ev_notify_dispatch+0x4f/0x5a
> >  [] ? acpi_os_execute_deferred+0x20/0x2c
> >  [] ? run_workqueue+0x78/0xfb
> >  [] ? acpi_os_execute_deferred+0x0/0x2c
> >  [] ? worker_thread+0xb6/0xc2
> >  [] ? autoremove_wake_function+0x0/0x30
> >  [] ? worker_thread+0x0/0xc2
> >  [] ? kthread+0x3b/0x61
> >  [] ? kthread+0x0/0x61
> >  [] ? kernel_thread_helper+0x7/0x10
> >  ===
> > Code: 50 52 51 ff 75 f0 68 a0 af 34 c0 e8 ce 4e f4 ff 83 c4 1c e9
> > 87 00 00 00 8b 7d e8 8d 42 01 83 7d f0 06 89 7d e0 0f 4e c8 8b 45
> > e0 <8b> 18 31 c0 85 db 74 29 8b 53 04 8b 43 08 89 d7 05 8c 01 00 00
> > EIP: [] pdev_sort_resources+0x8a/0x116 SS:ESP
> > 0068:f74f1e48 ---[ end trace 623ea8d57da4defe ]---
> 
> I suppose that if you're feeling keen, a bisection search as per
> http://www.kernel.org/doc/local/git-quick.html would help things
> along, thanks.
> 

Well, I seem to be having problems finding a system to duplicate.  I
guess we'll have to do this the hard way.  Can you reproduce with
acpiphp loaded with debug=1 and send me the dmesg output when you first
load the driver?  Also, do you get the same problem if you boot
undocked, dock, then undock again?

Unfortunately, I'll be out of town next week, but I'll try to look more
at this the week after.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc2 Thinkpad T30 docking fails with oops.

2008-02-22 Thread Kristen Carlson Accardi
On Fri, 22 Feb 2008 00:34:17 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Tue, 19 Feb 2008 20:47:08 + Paul Martin <[EMAIL PROTECTED]> wrote:
> 
> > Now, this was working in 2.6.23, but is not in any later kernel.
> > Previously reported, but I guess that the previous report was
> > ignored due to the kernel being tainted.
> 
> Not really.  A lot of acpi-related bug reports get the
> crickets-chirping treatment.
> 
> The usual response with acpi is "please raise a bugzilla report", and
> the acpi developers do respond well to bugzilla reports.  But really,
> a recent and oops-causing regression shouldn't require such actions -
> developers should be running around with their hair on fire.
> 
> > Well, there's nothing other than pure Linus here.
> > 
> > When booted up docked, there is no oops until you try to undock.
> > 
> > Whilst I will be looking for replies in the list, I'd appreciate
> > being CC'd on any replies.
> > 
> > ACPI: \_SB_.PCI0.PCI1.DOCK - docking
> > PCI: Transparent bridge - :02:03.0
> > ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1.DOCK._PRT]
> > BUG: unable to handle kernel NULL pointer dereference at 
> > IP: [] pdev_sort_resources+0x8a/0x116
> > *pde =  
> > Oops:  [#1] PREEMPT SMP 
> > Modules linked in: radeon drm rfcomm l2cap bluetooth ppdev lp ipv6
> > ext3 jbd mbcache fuse dm_crypt crypto_blkcipher dm_mod
> > cpufreq_stats speedstep_ich speedstep_lib thinkpad_acpi nvram
> > acpiphp bay dock pcmcia firmware_class joydev yenta_socket
> > rsrc_nonstatic pcmcia_core battery irtty_sir ac sir_dev
> > snd_intel8x0 snd_ac97_codec nsc_ircc ac97_bus snd_pcm_oss
> > snd_mixer_oss snd_pcm snd_timer irda parport_pc i2c_i801 button
> > shpchp pci_hotplug crc_ccitt parport snd psmouse intel_agp agpgart
> > iTCO_wdt evdev serio_raw soundcore snd_page_alloc rtc pcspkr xfs
> > floppy e100 mii uhci_hcd usbcore sg sr_mod cdrom sd_mod thermal
> > processor fan ata_piix libata scsi_mod radeonfb fb_ddc i2c_algo_bit
> > i2c_core
> > 
> > Pid: 39, comm: kacpi_notify Not tainted (2.6.25-rc2 #1)
> > EIP: 0060:[] EFLAGS: 00010246 CPU: 0
> > EIP is at pdev_sort_resources+0x8a/0x116
> > EAX:  EBX:  ECX:  EDX: 0fff
> > ESI: f5e30cc4 EDI: f74f1e80 EBP: f74f1e68 ESP: f74f1e48
> >  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
> > Process kacpi_notify (pid: 39, ti=f74f task=f74cf120
> > task.ti=f74f) Stack:  f5e30cf0 f74f1e80 f5e30c00
> > 0007 f5e30c00 f74cb414 f74cb400 f74f1e9c c02a9cce f5e30c08
> > f7431f90 f7431978 f5420c00  f7d6d408 f7d6d3c0 f74f1e9c
> >  f7d6d3c8 f7d6d3c0 f74f1ed8 f9b278ca f74cb40c Call Trace:
> >  [] ? pci_bus_assign_resources+0x59/0x341
> >  [] ? acpiphp_enable_slot+0x2a1/0x3a3 [acpiphp]
> >  [] ? handle_hotplug_event_func+0x63/0x101 [acpiphp]
> >  [] ? post_dock_fixups+0x6c/0x79 [acpiphp]
> >  [] ? notifier_call_chain+0x2b/0x4a
> >  [] ? handle_hotplug_event_func+0x0/0x101 [acpiphp]
> >  [] ? hotplug_dock_devices+0x39/0xe1 [dock]
> >  [] ? dock_notify+0x75/0xc0 [dock]
> >  [] ? acpi_ev_notify_dispatch+0x4f/0x5a
> >  [] ? acpi_os_execute_deferred+0x20/0x2c
> >  [] ? run_workqueue+0x78/0xfb
> >  [] ? acpi_os_execute_deferred+0x0/0x2c
> >  [] ? worker_thread+0xb6/0xc2
> >  [] ? autoremove_wake_function+0x0/0x30
> >  [] ? worker_thread+0x0/0xc2
> >  [] ? kthread+0x3b/0x61
> >  [] ? kthread+0x0/0x61
> >  [] ? kernel_thread_helper+0x7/0x10
> >  ===
> > Code: 50 52 51 ff 75 f0 68 a0 af 34 c0 e8 ce 4e f4 ff 83 c4 1c e9
> > 87 00 00 00 8b 7d e8 8d 42 01 83 7d f0 06 89 7d e0 0f 4e c8 8b 45
> > e0 <8b> 18 31 c0 85 db 74 29 8b 53 04 8b 43 08 89 d7 05 8c 01 00 00
> > EIP: [] pdev_sort_resources+0x8a/0x116 SS:ESP
> > 0068:f74f1e48 ---[ end trace 623ea8d57da4defe ]---
> 
> I suppose that if you're feeling keen, a bisection search as per
> http://www.kernel.org/doc/local/git-quick.html would help things
> along, thanks.
> 

Hi Paul,
I'll try to take a look at this today - I've got a T30 around here
somewhere I can try to duplicate.

Kristen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc2 Thinkpad T30 docking fails with oops.

2008-02-22 Thread Kristen Carlson Accardi
On Fri, 22 Feb 2008 00:34:17 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 On Tue, 19 Feb 2008 20:47:08 + Paul Martin [EMAIL PROTECTED] wrote:
 
  Now, this was working in 2.6.23, but is not in any later kernel.
  Previously reported, but I guess that the previous report was
  ignored due to the kernel being tainted.
 
 Not really.  A lot of acpi-related bug reports get the
 crickets-chirping treatment.
 
 The usual response with acpi is please raise a bugzilla report, and
 the acpi developers do respond well to bugzilla reports.  But really,
 a recent and oops-causing regression shouldn't require such actions -
 developers should be running around with their hair on fire.
 
  Well, there's nothing other than pure Linus here.
  
  When booted up docked, there is no oops until you try to undock.
  
  Whilst I will be looking for replies in the list, I'd appreciate
  being CC'd on any replies.
  
  ACPI: \_SB_.PCI0.PCI1.DOCK - docking
  PCI: Transparent bridge - :02:03.0
  ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1.DOCK._PRT]
  BUG: unable to handle kernel NULL pointer dereference at 
  IP: [c01dd6f2] pdev_sort_resources+0x8a/0x116
  *pde =  
  Oops:  [#1] PREEMPT SMP 
  Modules linked in: radeon drm rfcomm l2cap bluetooth ppdev lp ipv6
  ext3 jbd mbcache fuse dm_crypt crypto_blkcipher dm_mod
  cpufreq_stats speedstep_ich speedstep_lib thinkpad_acpi nvram
  acpiphp bay dock pcmcia firmware_class joydev yenta_socket
  rsrc_nonstatic pcmcia_core battery irtty_sir ac sir_dev
  snd_intel8x0 snd_ac97_codec nsc_ircc ac97_bus snd_pcm_oss
  snd_mixer_oss snd_pcm snd_timer irda parport_pc i2c_i801 button
  shpchp pci_hotplug crc_ccitt parport snd psmouse intel_agp agpgart
  iTCO_wdt evdev serio_raw soundcore snd_page_alloc rtc pcspkr xfs
  floppy e100 mii uhci_hcd usbcore sg sr_mod cdrom sd_mod thermal
  processor fan ata_piix libata scsi_mod radeonfb fb_ddc i2c_algo_bit
  i2c_core
  
  Pid: 39, comm: kacpi_notify Not tainted (2.6.25-rc2 #1)
  EIP: 0060:[c01dd6f2] EFLAGS: 00010246 CPU: 0
  EIP is at pdev_sort_resources+0x8a/0x116
  EAX:  EBX:  ECX:  EDX: 0fff
  ESI: f5e30cc4 EDI: f74f1e80 EBP: f74f1e68 ESP: f74f1e48
   DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
  Process kacpi_notify (pid: 39, ti=f74f task=f74cf120
  task.ti=f74f) Stack:  f5e30cf0 f74f1e80 f5e30c00
  0007 f5e30c00 f74cb414 f74cb400 f74f1e9c c02a9cce f5e30c08
  f7431f90 f7431978 f5420c00  f7d6d408 f7d6d3c0 f74f1e9c
   f7d6d3c8 f7d6d3c0 f74f1ed8 f9b278ca f74cb40c Call Trace:
   [c02a9cce] ? pci_bus_assign_resources+0x59/0x341
   [f9b278ca] ? acpiphp_enable_slot+0x2a1/0x3a3 [acpiphp]
   [f9b27ad9] ? handle_hotplug_event_func+0x63/0x101 [acpiphp]
   [f9b27167] ? post_dock_fixups+0x6c/0x79 [acpiphp]
   [c013506e] ? notifier_call_chain+0x2b/0x4a
   [f9b27a76] ? handle_hotplug_event_func+0x0/0x101 [acpiphp]
   [f9b22168] ? hotplug_dock_devices+0x39/0xe1 [dock]
   [f9b22441] ? dock_notify+0x75/0xc0 [dock]
   [c01f83c9] ? acpi_ev_notify_dispatch+0x4f/0x5a
   [c01f32c0] ? acpi_os_execute_deferred+0x20/0x2c
   [c012ef41] ? run_workqueue+0x78/0xfb
   [c01f32a0] ? acpi_os_execute_deferred+0x0/0x2c
   [c012f79d] ? worker_thread+0xb6/0xc2
   [c0131b25] ? autoremove_wake_function+0x0/0x30
   [c012f6e7] ? worker_thread+0x0/0xc2
   [c0131a52] ? kthread+0x3b/0x61
   [c0131a17] ? kthread+0x0/0x61
   [c010568f] ? kernel_thread_helper+0x7/0x10
   ===
  Code: 50 52 51 ff 75 f0 68 a0 af 34 c0 e8 ce 4e f4 ff 83 c4 1c e9
  87 00 00 00 8b 7d e8 8d 42 01 83 7d f0 06 89 7d e0 0f 4e c8 8b 45
  e0 8b 18 31 c0 85 db 74 29 8b 53 04 8b 43 08 89 d7 05 8c 01 00 00
  EIP: [c01dd6f2] pdev_sort_resources+0x8a/0x116 SS:ESP
  0068:f74f1e48 ---[ end trace 623ea8d57da4defe ]---
 
 I suppose that if you're feeling keen, a bisection search as per
 http://www.kernel.org/doc/local/git-quick.html would help things
 along, thanks.
 

Hi Paul,
I'll try to take a look at this today - I've got a T30 around here
somewhere I can try to duplicate.

Kristen
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc2 Thinkpad T30 docking fails with oops.

2008-02-22 Thread Kristen Carlson Accardi
On Fri, 22 Feb 2008 00:34:17 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 On Tue, 19 Feb 2008 20:47:08 + Paul Martin [EMAIL PROTECTED] wrote:
 
  Now, this was working in 2.6.23, but is not in any later kernel.
  Previously reported, but I guess that the previous report was
  ignored due to the kernel being tainted.
 
 Not really.  A lot of acpi-related bug reports get the
 crickets-chirping treatment.
 
 The usual response with acpi is please raise a bugzilla report, and
 the acpi developers do respond well to bugzilla reports.  But really,
 a recent and oops-causing regression shouldn't require such actions -
 developers should be running around with their hair on fire.
 
  Well, there's nothing other than pure Linus here.
  
  When booted up docked, there is no oops until you try to undock.
  
  Whilst I will be looking for replies in the list, I'd appreciate
  being CC'd on any replies.
  
  ACPI: \_SB_.PCI0.PCI1.DOCK - docking
  PCI: Transparent bridge - :02:03.0
  ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1.DOCK._PRT]
  BUG: unable to handle kernel NULL pointer dereference at 
  IP: [c01dd6f2] pdev_sort_resources+0x8a/0x116
  *pde =  
  Oops:  [#1] PREEMPT SMP 
  Modules linked in: radeon drm rfcomm l2cap bluetooth ppdev lp ipv6
  ext3 jbd mbcache fuse dm_crypt crypto_blkcipher dm_mod
  cpufreq_stats speedstep_ich speedstep_lib thinkpad_acpi nvram
  acpiphp bay dock pcmcia firmware_class joydev yenta_socket
  rsrc_nonstatic pcmcia_core battery irtty_sir ac sir_dev
  snd_intel8x0 snd_ac97_codec nsc_ircc ac97_bus snd_pcm_oss
  snd_mixer_oss snd_pcm snd_timer irda parport_pc i2c_i801 button
  shpchp pci_hotplug crc_ccitt parport snd psmouse intel_agp agpgart
  iTCO_wdt evdev serio_raw soundcore snd_page_alloc rtc pcspkr xfs
  floppy e100 mii uhci_hcd usbcore sg sr_mod cdrom sd_mod thermal
  processor fan ata_piix libata scsi_mod radeonfb fb_ddc i2c_algo_bit
  i2c_core
  
  Pid: 39, comm: kacpi_notify Not tainted (2.6.25-rc2 #1)
  EIP: 0060:[c01dd6f2] EFLAGS: 00010246 CPU: 0
  EIP is at pdev_sort_resources+0x8a/0x116
  EAX:  EBX:  ECX:  EDX: 0fff
  ESI: f5e30cc4 EDI: f74f1e80 EBP: f74f1e68 ESP: f74f1e48
   DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
  Process kacpi_notify (pid: 39, ti=f74f task=f74cf120
  task.ti=f74f) Stack:  f5e30cf0 f74f1e80 f5e30c00
  0007 f5e30c00 f74cb414 f74cb400 f74f1e9c c02a9cce f5e30c08
  f7431f90 f7431978 f5420c00  f7d6d408 f7d6d3c0 f74f1e9c
   f7d6d3c8 f7d6d3c0 f74f1ed8 f9b278ca f74cb40c Call Trace:
   [c02a9cce] ? pci_bus_assign_resources+0x59/0x341
   [f9b278ca] ? acpiphp_enable_slot+0x2a1/0x3a3 [acpiphp]
   [f9b27ad9] ? handle_hotplug_event_func+0x63/0x101 [acpiphp]
   [f9b27167] ? post_dock_fixups+0x6c/0x79 [acpiphp]
   [c013506e] ? notifier_call_chain+0x2b/0x4a
   [f9b27a76] ? handle_hotplug_event_func+0x0/0x101 [acpiphp]
   [f9b22168] ? hotplug_dock_devices+0x39/0xe1 [dock]
   [f9b22441] ? dock_notify+0x75/0xc0 [dock]
   [c01f83c9] ? acpi_ev_notify_dispatch+0x4f/0x5a
   [c01f32c0] ? acpi_os_execute_deferred+0x20/0x2c
   [c012ef41] ? run_workqueue+0x78/0xfb
   [c01f32a0] ? acpi_os_execute_deferred+0x0/0x2c
   [c012f79d] ? worker_thread+0xb6/0xc2
   [c0131b25] ? autoremove_wake_function+0x0/0x30
   [c012f6e7] ? worker_thread+0x0/0xc2
   [c0131a52] ? kthread+0x3b/0x61
   [c0131a17] ? kthread+0x0/0x61
   [c010568f] ? kernel_thread_helper+0x7/0x10
   ===
  Code: 50 52 51 ff 75 f0 68 a0 af 34 c0 e8 ce 4e f4 ff 83 c4 1c e9
  87 00 00 00 8b 7d e8 8d 42 01 83 7d f0 06 89 7d e0 0f 4e c8 8b 45
  e0 8b 18 31 c0 85 db 74 29 8b 53 04 8b 43 08 89 d7 05 8c 01 00 00
  EIP: [c01dd6f2] pdev_sort_resources+0x8a/0x116 SS:ESP
  0068:f74f1e48 ---[ end trace 623ea8d57da4defe ]---
 
 I suppose that if you're feeling keen, a bisection search as per
 http://www.kernel.org/doc/local/git-quick.html would help things
 along, thanks.
 

Well, I seem to be having problems finding a system to duplicate.  I
guess we'll have to do this the hard way.  Can you reproduce with
acpiphp loaded with debug=1 and send me the dmesg output when you first
load the driver?  Also, do you get the same problem if you boot
undocked, dock, then undock again?

Unfortunately, I'll be out of town next week, but I'll try to look more
at this the week after.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] enclosure: add support for enclosure services

2008-02-13 Thread Kristen Carlson Accardi
On Tue, 12 Feb 2008 13:28:15 -0600
James Bottomley <[EMAIL PROTECTED]> wrote:

> On Tue, 2008-02-12 at 11:07 -0800, Kristen Carlson Accardi wrote:
> > I understand what you are trying to do - I guess I just doubt the
> > value you've added by doing this.  I think that there's going to be
> > so much customization that system vendors will want to add, that
> > they are going to wind up adding a custom library regardless, so
> > standardising those few things won't buy us anything.
> 
> It depends ... if you actually have a use for the customisations, yes.
> If you just want the basics of who (what's in the enclousure), what
> (activity) and where (locate) then I think it solves your problem
> almost entirely.
> 
> So, entirely as a straw horse, tell me what else your enclosures
> provide that I haven't listed in the four points.  The SES standards
> too provide a huge range of things that no-one ever seems to
> implement (temperature, power, fan speeds etc).
> 
> I think the users of enclosures fall int these categories
> 
> 85% just want to know where their device actually is (i.e. that sdc is
> in enclosure slot 5)
> 50% like watching the activity lights
> 30% want to be able to have a visual locate function
> 20% want a visual failure indication (the other 80% rely on some OS
> notification instead)
> 
> When you add up the overlapping needs, you get about 90% of people
> happy with the basics that the enclosure services provide.  Could
> there be more ... sure; should there be more ... I don't think so ...
> that's what value add the user libraries can provide.
> 
> James
> 
> 

I don't think I'm arguing whether or not your solution may work, what I
am arguing is really a more philosophical point.  Not "can we do it
this way", but "should we do it way".  I am of the opinion that
management belongs in userspace.  I also am of the opinion that if you
can successfully accomplish something in user space, you should.  I
also believe that even if you provide this basic interface, all system
vendors are going to provide libraries on top of that to customize it,
so you've not added much value to just a simple message passing
interface.

So, I'm happy to defer to Jeff's judgement call here - I just want to
do what's right for our customers and get an enclosure management
interface for SATA exposed, preferrably in time for the 2.6.26 merge
window.  If he prefers your design, I'll disagree, but commit to his
decision and try to get this to work for SATA. If he'd rather see
something along the lines of what I proposed, then since it is 100% self
contained in the SATA subsystem, it shouldn't impact whatever you
want to do in the SCSI subsystem.

Jeff?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] enclosure: add support for enclosure services

2008-02-13 Thread Kristen Carlson Accardi
On Tue, 12 Feb 2008 13:28:15 -0600
James Bottomley [EMAIL PROTECTED] wrote:

 On Tue, 2008-02-12 at 11:07 -0800, Kristen Carlson Accardi wrote:
  I understand what you are trying to do - I guess I just doubt the
  value you've added by doing this.  I think that there's going to be
  so much customization that system vendors will want to add, that
  they are going to wind up adding a custom library regardless, so
  standardising those few things won't buy us anything.
 
 It depends ... if you actually have a use for the customisations, yes.
 If you just want the basics of who (what's in the enclousure), what
 (activity) and where (locate) then I think it solves your problem
 almost entirely.
 
 So, entirely as a straw horse, tell me what else your enclosures
 provide that I haven't listed in the four points.  The SES standards
 too provide a huge range of things that no-one ever seems to
 implement (temperature, power, fan speeds etc).
 
 I think the users of enclosures fall int these categories
 
 85% just want to know where their device actually is (i.e. that sdc is
 in enclosure slot 5)
 50% like watching the activity lights
 30% want to be able to have a visual locate function
 20% want a visual failure indication (the other 80% rely on some OS
 notification instead)
 
 When you add up the overlapping needs, you get about 90% of people
 happy with the basics that the enclosure services provide.  Could
 there be more ... sure; should there be more ... I don't think so ...
 that's what value add the user libraries can provide.
 
 James
 
 

I don't think I'm arguing whether or not your solution may work, what I
am arguing is really a more philosophical point.  Not can we do it
this way, but should we do it way.  I am of the opinion that
management belongs in userspace.  I also am of the opinion that if you
can successfully accomplish something in user space, you should.  I
also believe that even if you provide this basic interface, all system
vendors are going to provide libraries on top of that to customize it,
so you've not added much value to just a simple message passing
interface.

So, I'm happy to defer to Jeff's judgement call here - I just want to
do what's right for our customers and get an enclosure management
interface for SATA exposed, preferrably in time for the 2.6.26 merge
window.  If he prefers your design, I'll disagree, but commit to his
decision and try to get this to work for SATA. If he'd rather see
something along the lines of what I proposed, then since it is 100% self
contained in the SATA subsystem, it shouldn't impact whatever you
want to do in the SCSI subsystem.

Jeff?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] enclosure: add support for enclosure services

2008-02-12 Thread Kristen Carlson Accardi
On Tue, 12 Feb 2008 12:45:35 -0600
James Bottomley <[EMAIL PROTECTED]> wrote:

> On Tue, 2008-02-12 at 10:22 -0800, Kristen Carlson Accardi wrote:
> > I apologize for taking so long to review this patch.  I obviously
> > agree wholeheartedly with Luben.  The problem I ran into while
> > trying to design an enclosure management interface for the SATA
> > devices is that there is all this vendor defined stuff.  For
> > example, for the AHCI LED protocol, the only "defined" LED is
> > 'activity'.  For LED2 and LED3 it is up to hardware vendors to
> > define these.  For SGPIO there's all kinds of ways for hw vendors
> > to customize.  I felt that it was going to be a maintainance
> > nightmare to have to keep track of various vendors enclosure
> > implementations in the ahci driver, and that it'd be better to just
> > have user space libraries take care of that.  Plus, that way a
> > vendor doesn't have to get a patch into the kernel to get their new
> > spiffy wizzy bang blinky lights working (think of how long it takes
> > something to even get into a vendor kernel, which is what these
> > guys care about...).  So I'm still not sold on having an enclosure
> > abstraction in the kernel - at least for the SATA controllers.
> 
> Correct me if I'm wrong, but didn't the original AHCI enclosure patch
> expose activity LEDs via sysfs?

You are sort of wrong.  we exposed a sysfs entry to enable sofware
controlled activity LED, then the driver was responsible for turning it
on and off. (blech, I know, but some vendors want this feature).

> 
> I'm not saying there aren't a lot of non standard pieces that need to
> be activated by direct commands or other user activated protocol.  I
> am saying there are a lot of standard pieces that we could do with
> showing in a uniform manner.
> 
> The pieces I think are absolutely standard are
> 
> 1. Actual enclosure presence (is this device in an enclosure)
> 2. Activity LED, this seems to be a feature of every enclosure.
> 
> I also think the following are reasonably standard (based on the fact
> that most enclosure standards recommend but don't require this):
> 
> 3. Locate LED (for locating the device).  Even if you only have an
> activity LED, this is usually done by flashing the activity LED in a
> well defined pattern.
> 4. Fault.  this is the least standardised of the lot, but does seem to
> be present in about every enclosure implementation.
> 
> All I've done is standardise these four pieces ... the services
> actually take into account that it might not be possible to do
> certain of these (like fault).
> 
> James
> 
> 

I understand what you are trying to do - I guess I just doubt the value
you've added by doing this.  I think that there's going to be so much
customization that system vendors will want to add, that they are going
to wind up adding a custom library regardless, so standardising those
few things won't buy us anything.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] enclosure: add support for enclosure services

2008-02-12 Thread Kristen Carlson Accardi
On Mon, 4 Feb 2008 18:01:36 -0800 (PST)
Luben Tuikov <[EMAIL PROTECTED]> wrote:

> --- On Mon, 2/4/08, James Bottomley
> <[EMAIL PROTECTED]> wrote:
> > > > The enclosure misc device is really just a
> > library providing
> > > > sysfs
> > > > support for physical enclosure devices and their
> > > > components.
> > > 
> > > Who is the target audience/user of those facilities?
> > > a) The kernel itself needing to read/write SES pages?
> > 
> > That depends on the enclosure integration, but right at the
> > moment, it
> > doesn't
> 
> Yes, I didn't suspect so.
> 
> > 
> > > b) A user space application using sysfs to read/write
> > >SES pages?
> > 
> > Not an application so much as a user.  The idea of sysfs is
> > to allow
> > users to get and set the information in addition to
> > applications.
> 
> Exactly the same argument stands for a user-space
> application with a user-space library.
> 
> This is the classical case of where it is better to
> do this in user-space as opposed to the kernel.
> 
> The kernel provides capability to access the SES
> device.  The user space application and library
> provide interpretation and control.  Thus if the
> enclosure were upgraded, one doesn't need to
> upgrade their kernel in order to utilize the new
> capabilities of the SES device.  Plus upgrading
> a user-space application is a lot easier than
> the kernel (and no reboot necessary).
> 
> Consider another thing: vendors would really like
> unprecedented access to the SES device in the enclosure
> so as your ses/enclosure code keeps state it would
> get out of sync when vendor user-space enclosure
> applications access (and modify) the SES device's
> pages.
> 
> You can test this yourself: submit a patch
> that removes SES /dev/sgX support; advertise your
> ses/class solution and watch the fun.
> 
> > > At the moment SES device management is done via
> > > an application (user-space) and a user-space library
> > > used by the application and /dev/sgX to send SCSI
> > > commands to the SES device.
> > 
> > I must have missed that when I was looking for
> > implementations; what's
> > the URL?
> 
> I'm not aware of any GPLed ones.  That doesn't
> necessarily mean that the best course of action is
> to bloat the kernel.  You can move your ses/enclosure
> stuff to a user space application library
> and thus start a GPLed one.
> 
> > But, if we have non-scsi enclosures to integrate, that
> > makes it harder
> > for a user application because it has to know all the
> > implementations.
> 
> So does the kernel.  And as I pointed out above, it
> is a lot easier to upgrade a user-space application and
> library than it is to upgrade a new kernel and having
> to reboot the computer to run the new kernel.
> 
> > A sysfs framework on the other hand is a universal known
> > thing for the
> > user applications.
> 
> So would a user-space ses library, a la libses.so.
> 
> > > One could have a very good argument to not bloat
> > > the kernel with this but leave it to a user-space
> > > application and a library to do all this and
> > > communicate with the SES device via the kernel's
> > /dev/sgX.
> > 
> > The same thing goes for other esoteric SCSI infrastructure
> > pieces like
> > cd changers.  On the whole, given that ATA is asking for
> > enclosure
> > management in kernel, it makes sense to consolidate the
> > infrastructure
> > and a ses ULD is a very good test bed.
> 
> What is wrong with exporting the SES device as /dev/sgX
> and having a user-space application and library to
> do all this?
> 
> Luben
> 

Hi,
I apologize for taking so long to review this patch.  I obviously agree
wholeheartedly with Luben.  The problem I ran into while trying to
design an enclosure management interface for the SATA devices is that
there is all this vendor defined stuff.  For example, for the AHCI LED
protocol, the only "defined" LED is 'activity'.  For LED2 and LED3 it
is up to hardware vendors to define these.  For SGPIO there's all kinds
of ways for hw vendors to customize.  I felt that it was going to be a
maintainance nightmare to have to keep track of various vendors
enclosure implementations in the ahci driver, and that it'd be better
to just have user space libraries take care of that.  Plus, that way a
vendor doesn't have to get a patch into the kernel to get their new
spiffy wizzy bang blinky lights working (think of how long it takes
something to even get into a vendor kernel, which is what these guys
care about...).  So I'm still not sold on having an enclosure
abstraction in the kernel - at least for the SATA controllers.

Kristen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] enclosure: add support for enclosure services

2008-02-12 Thread Kristen Carlson Accardi
On Tue, 12 Feb 2008 12:45:35 -0600
James Bottomley [EMAIL PROTECTED] wrote:

 On Tue, 2008-02-12 at 10:22 -0800, Kristen Carlson Accardi wrote:
  I apologize for taking so long to review this patch.  I obviously
  agree wholeheartedly with Luben.  The problem I ran into while
  trying to design an enclosure management interface for the SATA
  devices is that there is all this vendor defined stuff.  For
  example, for the AHCI LED protocol, the only defined LED is
  'activity'.  For LED2 and LED3 it is up to hardware vendors to
  define these.  For SGPIO there's all kinds of ways for hw vendors
  to customize.  I felt that it was going to be a maintainance
  nightmare to have to keep track of various vendors enclosure
  implementations in the ahci driver, and that it'd be better to just
  have user space libraries take care of that.  Plus, that way a
  vendor doesn't have to get a patch into the kernel to get their new
  spiffy wizzy bang blinky lights working (think of how long it takes
  something to even get into a vendor kernel, which is what these
  guys care about...).  So I'm still not sold on having an enclosure
  abstraction in the kernel - at least for the SATA controllers.
 
 Correct me if I'm wrong, but didn't the original AHCI enclosure patch
 expose activity LEDs via sysfs?

You are sort of wrong.  we exposed a sysfs entry to enable sofware
controlled activity LED, then the driver was responsible for turning it
on and off. (blech, I know, but some vendors want this feature).

 
 I'm not saying there aren't a lot of non standard pieces that need to
 be activated by direct commands or other user activated protocol.  I
 am saying there are a lot of standard pieces that we could do with
 showing in a uniform manner.
 
 The pieces I think are absolutely standard are
 
 1. Actual enclosure presence (is this device in an enclosure)
 2. Activity LED, this seems to be a feature of every enclosure.
 
 I also think the following are reasonably standard (based on the fact
 that most enclosure standards recommend but don't require this):
 
 3. Locate LED (for locating the device).  Even if you only have an
 activity LED, this is usually done by flashing the activity LED in a
 well defined pattern.
 4. Fault.  this is the least standardised of the lot, but does seem to
 be present in about every enclosure implementation.
 
 All I've done is standardise these four pieces ... the services
 actually take into account that it might not be possible to do
 certain of these (like fault).
 
 James
 
 

I understand what you are trying to do - I guess I just doubt the value
you've added by doing this.  I think that there's going to be so much
customization that system vendors will want to add, that they are going
to wind up adding a custom library regardless, so standardising those
few things won't buy us anything.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] enclosure: add support for enclosure services

2008-02-12 Thread Kristen Carlson Accardi
On Mon, 4 Feb 2008 18:01:36 -0800 (PST)
Luben Tuikov [EMAIL PROTECTED] wrote:

 --- On Mon, 2/4/08, James Bottomley
 [EMAIL PROTECTED] wrote:
The enclosure misc device is really just a
  library providing
sysfs
support for physical enclosure devices and their
components.
   
   Who is the target audience/user of those facilities?
   a) The kernel itself needing to read/write SES pages?
  
  That depends on the enclosure integration, but right at the
  moment, it
  doesn't
 
 Yes, I didn't suspect so.
 
  
   b) A user space application using sysfs to read/write
  SES pages?
  
  Not an application so much as a user.  The idea of sysfs is
  to allow
  users to get and set the information in addition to
  applications.
 
 Exactly the same argument stands for a user-space
 application with a user-space library.
 
 This is the classical case of where it is better to
 do this in user-space as opposed to the kernel.
 
 The kernel provides capability to access the SES
 device.  The user space application and library
 provide interpretation and control.  Thus if the
 enclosure were upgraded, one doesn't need to
 upgrade their kernel in order to utilize the new
 capabilities of the SES device.  Plus upgrading
 a user-space application is a lot easier than
 the kernel (and no reboot necessary).
 
 Consider another thing: vendors would really like
 unprecedented access to the SES device in the enclosure
 so as your ses/enclosure code keeps state it would
 get out of sync when vendor user-space enclosure
 applications access (and modify) the SES device's
 pages.
 
 You can test this yourself: submit a patch
 that removes SES /dev/sgX support; advertise your
 ses/class solution and watch the fun.
 
   At the moment SES device management is done via
   an application (user-space) and a user-space library
   used by the application and /dev/sgX to send SCSI
   commands to the SES device.
  
  I must have missed that when I was looking for
  implementations; what's
  the URL?
 
 I'm not aware of any GPLed ones.  That doesn't
 necessarily mean that the best course of action is
 to bloat the kernel.  You can move your ses/enclosure
 stuff to a user space application library
 and thus start a GPLed one.
 
  But, if we have non-scsi enclosures to integrate, that
  makes it harder
  for a user application because it has to know all the
  implementations.
 
 So does the kernel.  And as I pointed out above, it
 is a lot easier to upgrade a user-space application and
 library than it is to upgrade a new kernel and having
 to reboot the computer to run the new kernel.
 
  A sysfs framework on the other hand is a universal known
  thing for the
  user applications.
 
 So would a user-space ses library, a la libses.so.
 
   One could have a very good argument to not bloat
   the kernel with this but leave it to a user-space
   application and a library to do all this and
   communicate with the SES device via the kernel's
  /dev/sgX.
  
  The same thing goes for other esoteric SCSI infrastructure
  pieces like
  cd changers.  On the whole, given that ATA is asking for
  enclosure
  management in kernel, it makes sense to consolidate the
  infrastructure
  and a ses ULD is a very good test bed.
 
 What is wrong with exporting the SES device as /dev/sgX
 and having a user-space application and library to
 do all this?
 
 Luben
 

Hi,
I apologize for taking so long to review this patch.  I obviously agree
wholeheartedly with Luben.  The problem I ran into while trying to
design an enclosure management interface for the SATA devices is that
there is all this vendor defined stuff.  For example, for the AHCI LED
protocol, the only defined LED is 'activity'.  For LED2 and LED3 it
is up to hardware vendors to define these.  For SGPIO there's all kinds
of ways for hw vendors to customize.  I felt that it was going to be a
maintainance nightmare to have to keep track of various vendors
enclosure implementations in the ahci driver, and that it'd be better
to just have user space libraries take care of that.  Plus, that way a
vendor doesn't have to get a patch into the kernel to get their new
spiffy wizzy bang blinky lights working (think of how long it takes
something to even get into a vendor kernel, which is what these guys
care about...).  So I'm still not sold on having an enclosure
abstraction in the kernel - at least for the SATA controllers.

Kristen
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] ata: ahci: Enclosure Management via LED rev2

2007-12-03 Thread Kristen Carlson Accardi
On Sat, 01 Dec 2007 18:28:54 -0500
Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Kristen Carlson Accardi wrote:
> > Enclosure Management via LED
> > 
> > This patch implements Enclosure Management via the LED protocol as
> > specified in AHCI specification.
> > 
> > Signed-off-by: Kristen Carlson Accardi <[EMAIL PROTECTED]>
> > ---
> > This revision makes the change to the comment requested by Mark
> > Lord, fixes some bugs in the bit shifting for writing the new led
> > state, and implements a show function so that led status can be
> > read as well as written.
> 
> Overall looks pretty good, from a technical review perspective.
> 
> Two worries:
> 
> 1) exporting ata_scsi_find_dev(), and assuming a scsi device is 
> attached.  the latter can be fixed by a !NULL check (and should be),
> but its a bit of a layering violation since long term we want to make
> the SCSI simulator optional for all ATA devices.
> 
> 2) vaguely related to #1, I'm not so sure the attributes should be 
> implemented directly in ahci.  if this __or something like it__
> appears on non-Intel hardware, the code should be somewhere more
> generic.
> 

When I first started developing this patch, I did have a more generic
approach - It adds lots of complexity that isn't needed for this simple
protocol, and one of the problems I encountered was that for different
EM protocols, you'd probably want to have a different set of attributes
defined.  Also, even using the same protocol, you may have hardware
that supports more attributes.  For example, in the case of this LED
protocol, some implementations may support the Activity LED (software
controlled), and some may not.  For protocols like SGPIO, they have a
lot of attributes defined by the spec, but I'm guessing hardware may
not support all of them.  When I tried to abstract hardware and
protocol away and make some kind of generic enclosure management
framework, it turned into this big ordeal.  So, I can keep going along
those lines, but to me it started to seem silly since I had no other
hardware that I knew of that was going to be helped by all this.  I
thought maybe the right thing to do was to keep it simple and then wait
for other the hardware to appear.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] ata: ahci: Enclosure Management via LED rev2

2007-12-03 Thread Kristen Carlson Accardi
On Sat, 01 Dec 2007 18:28:54 -0500
Jeff Garzik [EMAIL PROTECTED] wrote:

 Kristen Carlson Accardi wrote:
  Enclosure Management via LED
  
  This patch implements Enclosure Management via the LED protocol as
  specified in AHCI specification.
  
  Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED]
  ---
  This revision makes the change to the comment requested by Mark
  Lord, fixes some bugs in the bit shifting for writing the new led
  state, and implements a show function so that led status can be
  read as well as written.
 
 Overall looks pretty good, from a technical review perspective.
 
 Two worries:
 
 1) exporting ata_scsi_find_dev(), and assuming a scsi device is 
 attached.  the latter can be fixed by a !NULL check (and should be),
 but its a bit of a layering violation since long term we want to make
 the SCSI simulator optional for all ATA devices.
 
 2) vaguely related to #1, I'm not so sure the attributes should be 
 implemented directly in ahci.  if this __or something like it__
 appears on non-Intel hardware, the code should be somewhere more
 generic.
 

When I first started developing this patch, I did have a more generic
approach - It adds lots of complexity that isn't needed for this simple
protocol, and one of the problems I encountered was that for different
EM protocols, you'd probably want to have a different set of attributes
defined.  Also, even using the same protocol, you may have hardware
that supports more attributes.  For example, in the case of this LED
protocol, some implementations may support the Activity LED (software
controlled), and some may not.  For protocols like SGPIO, they have a
lot of attributes defined by the spec, but I'm guessing hardware may
not support all of them.  When I tried to abstract hardware and
protocol away and make some kind of generic enclosure management
framework, it turned into this big ordeal.  So, I can keep going along
those lines, but to me it started to seem silly since I had no other
hardware that I knew of that was going to be helped by all this.  I
thought maybe the right thing to do was to keep it simple and then wait
for other the hardware to appear.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >