Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-09 Thread Borislav Petkov
On Thu, Mar 08, 2018 at 03:20:58PM -0800, Linus Torvalds wrote:
> Yes, so the reason the prologue is more important is that there's
> really two cases for the "Code:" line:

Yap, I had a hunch it must be about some of those but thanks for taking
the time and writing it down!

I've tried to summarize the reasoning and slap it as a comment above it
so that it is clear why we do it this way:

---
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index c3aa0e29513e..7a21098c9128 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -71,6 +71,25 @@ static void printk_stack_address(unsigned long address, int 
reliable,
printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
+/*
+ * The are a couple of reasons for the 2/3rd prologue, courtesy of Linus:
+ *
+ * In case where we don't have the exact kernel image (which, if we did, we can
+ * simply disassemble and navigate to the RIP), the purpose of the bigger
+ * prologue is to have more context and to be able to correlate the code from
+ * the different toolchains better.
+ *
+ * In addition, it helps in recreating the register allocation of the failing
+ * kernel and thus make sense of the register dump.
+ *
+ * What is more, the additional complication of a variable length insn arch 
like
+ * x86 warrants having longer byte sequence before rIP so that the disassembler
+ * can "sync" up properly and find instruction boundaries when decoding the
+ * opcode bytes.
+ *
+ * Thus, the 2/3rds prologue and 64 byte OPCODE_BUFSIZE is just a random
+ * guesstimate in attempt to achieve all of the above.
+ */
 void show_opcodes(u8 *rip, const char *loglvl)
 {
 #define OPCODE_BUFSIZE 64

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 2:36 PM, Borislav Petkov  wrote:
>
> Btw, do we have any explanation for the two-thirds prologue? I dug it
> out to the patch below but it doesn't say why the prologue being bigger
> is more important than the epilogue.

Yes, so the reason the prologue is more important is that there's
really two cases for the "Code:" line:

 (a) it doesn't matter at all, because you have the kernel image
anyway, and can just get the code from there based on the faulting
instruction pointer address.

 (b) it ends up helping you match things up roughly, because maybe you
have a different compiler version and different code generation
(and/or maybe just different config options)

In the (a) case, the code doesn't matter at all, and you might as well
just look at the whole function by just doing "gdb vmlinux" and then
disassemble it. You have everything just from %rip.

In the (b) case, there are two reasons to give code "around" the
faulting instruction pointer:

 - to help match up when you compare your not-identical vmlinux sequence

 - to actually figure out what the register allocation in the faulting
kernel was, so that you can make sense of the register dump, because
your local kernel might have completely different register allocaiton.

And for that *second* case, the instructions that _precede_ the
faulting instructions are generally much more important than the
instructions that follow the faulting instruction.

The instructions that *follow* the faulting instructions generally
have little or no relevance for the register dump (obviously loops etc
_can_ make them  matter, but you see the rough point).

So the instructions that precede the faulting instruction are in many
ways much more relevant than the instructions that follow it. So you
want more of those.

Also, the instructions that precede the faulting instruction have
another issue - with variable-length instructions, it's hard to sync
up things. Even if we were to use a disassembler (which we almost
certainly don't at fault time), walking backwards can easily be
ambiguous.

So you want extra "slop" in the bytes that get dumped before the
instruction, because you don't know quite where the instruction
boundaries may be.  In contrast, the faulting instruction itself - and
the instrucvtions that follow it - you can tell where the instruction
boundaries are (unless you had an exception due to a corrupted
indirect branch, which does happen but is quite rare).

So there's two reasons for why the bytes _before_ the instruction are
more important than the bytes _after_ the instruction: the register
state is more relevant for them, and the slop for decoding.

The "two thirds" then just comes from that. It's just a random
estimate of "instruction bytes that precede the faulting instructions
are more important than later ones".

In fact, to some degree the "one third" can be seen as "the faulting
instruction itself needs some space".

An x86 instruction can be at most 15 bytes, and you do likely want at
least that. Any bytes after the first instruction aren't _useless_
(because they definitely can help line things up when you have
slightly different code generation due to compiler versions etc), so
"at least 15 bytes, and maybe a bit more" again makes that 2/3rds
thing work out well when you print out 64 bytes.

  Linus

PS. Historical side note:

We haven't always printed out that much. This is Linux-0.01:

for(i=0;i<10;i++)
printk("%02x ",0xff & get_seg_byte(esp[1],(i+(char *)esp[0])));

so it just printed out 10 bytes, basically just the faulting
instruction (and not even all of it if it has a lot of immediates).

We eventually extended the 10 bytes into 20 bytes, again starting at
the existing instruction, because if you don't print out a lot, you
need to start at the faulting instruction itself because of slop.

The expansion from 20 to 64 bytes happened in 2004 in v2.6.9. And
that's also when we started dumping preceding instructions, because
once you print out a fair chunk of bytes, that's when "preceding
instructions are more important" comes in. Because before that, you
want to guarantee that you at least print out the faulting
instruction.


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-08 Thread Borislav Petkov
On Thu, Mar 08, 2018 at 10:00:09AM -0800, Linus Torvalds wrote:
> On Thu, Mar 8, 2018 at 2:16 AM, Borislav Petkov  wrote:
> > +#define OPCODE_BUFSIZE 64
> > +   unsigned int code_prologue = OPCODE_BUFSIZE * 43 / OPCODE_BUFSIZE;
> 
> Heh.
> 
> That's a very odd way of writing "43".

I was simply search-replacing code_bytes :-)

> Honestly, the "43" is just "two thirds" rounded to closest, and it's
> not important anyway, so I think you should just write it as
> 
> unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3;

Btw, do we have any explanation for the two-thirds prologue? I dug it
out to the patch below but it doesn't say why the prologue being bigger
is more important than the epilogue.

I would've made it half and half but I guess it is more important to see
the opcode bytes leading to rip... Oh well.

> and never mind that it will now be 42.
>
> 42 is obviously the right answer anyway, which makes me think we got
> it wrong earlier.

Doh, of course! What was I thinking?!? :-)

Done.

---
>From 313c2652ce75899d7801b021ed40e5b8ef233aca Mon Sep 17 00:00:00 2001
From: Keith Owens 
Date: Sun, 22 Aug 2004 22:36:42 -0700
Subject: [PATCH] [PATCH] i386 oops output: dump preceding code

This teaches the i386 oops dumper to dump opcodes preceding and after the
offending EIP.  Supporting code against ksymoops has been tested and produces
output like the below.

Support for this was added to ksymoops-2.4.9.

Note that ksymoops will guarantee that the disassembly after the  value
is always in sync - if the disassembly from the start of the Code: line does
not sync up with the EIP address ksymoops will perform the resync.


Warning (merge_maps): no symbols in merged map
Mar 18 23:47:36 vmm kernel: kernel BUG at fs/open.c:802!
Mar 18 23:47:36 vmm kernel: invalid operand:  [#1]
Mar 18 23:47:36 vmm kernel: CPU:0
Mar 18 23:47:36 vmm kernel: EIP:0060:[] VLINot tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Mar 18 23:47:36 vmm kernel: EFLAGS: 00010246
Mar 18 23:47:36 vmm kernel: eax: ccdfb900   ebx: 4001020d   ecx:    
edx: 007b
Mar 18 23:47:36 vmm kernel: esi:    edi: bfffdd70   ebp: ccdfdfbc   
esp: ccdfdfb0
Mar 18 23:47:36 vmm kernel: ds: 007b   es: 007b   ss: 0068
Mar 18 23:47:36 vmm kernel: Stack: 4001020d  bfffdd70 ccdfc000 c0109213 
4001020d  0003
Mar 18 23:47:36 vmm kernel: bfffdd70 bfffdc88 0005 007b 
007b 0005 4000ef94
Mar 18 23:47:36 vmm kernel:0073 0206 bfffdbd8 007b
Mar 18 23:47:36 vmm kernel: Call Trace:
Mar 18 23:47:36 vmm kernel:  [] syscall_call+0x7/0xb
Mar 18 23:47:36 vmm kernel: Code: 14 98 f0 81 41 04 00 00 00 01 5b 89 ec 5d c3 
90 b8 00 e0 ff ff 21 e0 55 89 e5 57 56 53 8b 00 81 b8 e4 01 00 00 0f 27 00 00 
75 08 <0f> 0b 22 03 85 18 2f c0 8b 45 08 50 e8 30 d4 00 00 89 c7 83 c4


>>EIP; c014fedf No symbols available   <=

Trace; c0109213 No symbols available

This architecture has variable length instructions, decoding before eip
is unreliable, take these instructions with a pinch of salt.

Code;  c014feb4 No symbols available
 <_EIP>:
Code;  c014feb4 No symbols available
   0:   14 98 adc$0x98,%al
Code;  c014feb6 No symbols available
   2:   f0 81 41 04 00 00 00  lock addl $0x100,0x4(%ecx)
Code;  c014febd No symbols available
   9:   01
Code;  c014febe No symbols available
   a:   5bpop%ebx
Code;  c014febf No symbols available
   b:   89 ec mov%ebp,%esp
Code;  c014fec1 No symbols available
   d:   5dpop%ebp
Code;  c014fec2 No symbols available
   e:   c3ret
Code;  c014fec3 No symbols available
   f:   90nop
Code;  c014fec4 No symbols available
  10:   b8 00 e0 ff ffmov$0xe000,%eax
Code;  c014fec9 No symbols available
  15:   21 e0 and%esp,%eax
Code;  c014fecb No symbols available
  17:   55push   %ebp
Code;  c014fecc No symbols available
  18:   89 e5 mov%esp,%ebp
Code;  c014fece No symbols available
  1a:   57push   %edi
Code;  c014fecf No symbols available
  1b:   56push   %esi
Code;  c014fed0 No symbols available
  1c:   53push   %ebx
Code;  c014fed1 No symbols available
  1d:   8b 00 mov(%eax),%eax
Code;  c014fed3 No symbols available
  1f:   81 b8 e4 01 00 00 0f  cmpl   $0x270f,0x1e4(%eax)
Code;  c014feda No symbols available
  26:   27 00 00
Code;  c014fedd No symbols available
  29:   75 08 jne33 <_EIP+0x33> c014fee7 No symbols 
available

This decode from eip onwards should be reliable

Code;  c014fedf No symbols available
 <_EIP>:
Code;  c014fedf No symbols available   <=
   0:   0f 0b ud2a  <=
Code;  c014fee1 No symbols available
   2:   22 0

Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-08 Thread Linus Torvalds
On Thu, Mar 8, 2018 at 2:16 AM, Borislav Petkov  wrote:
> +#define OPCODE_BUFSIZE 64
> +   unsigned int code_prologue = OPCODE_BUFSIZE * 43 / OPCODE_BUFSIZE;

Heh.

That's a very odd way of writing "43".

Honestly, the "43" is just "two thirds" rounded to closest, and it's
not important anyway, so I think you should just write it as

unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3;

and never mind that it will now be 42.

42 is obviously the right answer anyway, which makes me think we got
it wrong earlier.

 Linus


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-08 Thread Borislav Petkov
On Wed, Mar 07, 2018 at 01:08:32PM -0800, Linus Torvalds wrote:
> On Wed, Mar 7, 2018 at 5:25 AM, Josh Poimboeuf  wrote:
> >
> > How about we just remove the 'code_bytes=' option?  (Or at the very
> > least, reduce its possible range to a reasonable max?)
> 
> Ack. Just limit it to 64 bytes max sounds plenty.

Done, combined diff ontop. With a 64-byte on-stack opcodes buffer:

---
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b37c1c30c16f..2c74c1694d9d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -587,11 +587,6 @@
Sets the size of memory pool for coherent, atomic dma
allocations, by default set to 256K.
 
-   code_bytes  [X86] How many bytes of object code to print
-   in an oops report.
-   Range: 0 - 8192
-   Default: 64
-
com20020=   [HW,NET] ARCnet - COM20020 chipset
Format:

[,[,[,[,[,]
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 409a5bd02a18..2fc009a1824e 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -22,13 +22,9 @@
 #include 
 #include 
 
-#define OPCODE_BUFSIZE 64
-
 int panic_on_unrecovered_nmi;
 int panic_on_io_nmi;
-static unsigned int code_bytes = OPCODE_BUFSIZE;
-static u8 __opc[OPCODE_BUFSIZE];
-static u8 *opcodes = __opc;
+
 static int die_counter;
 
 static struct pt_regs exec_summary_regs;
@@ -77,19 +73,21 @@ static void printk_stack_address(unsigned long address, int 
reliable,
 
 void show_opcodes(u8 *rip, const char *loglvl)
 {
-   unsigned int code_prologue = code_bytes * 43 / OPCODE_BUFSIZE;
+#define OPCODE_BUFSIZE 64
+   unsigned int code_prologue = OPCODE_BUFSIZE * 43 / OPCODE_BUFSIZE;
+   u8 opcodes[OPCODE_BUFSIZE];
u8 *ip;
int i;
 
printk("%sCode: ", loglvl);
 
ip = (u8 *)rip - code_prologue;
-   if (probe_kernel_read(opcodes, ip, code_bytes)) {
+   if (probe_kernel_read(opcodes, ip, OPCODE_BUFSIZE)) {
pr_cont(" Bad RIP value.\n");
return;
}
 
-   for (i = 0; i < code_bytes; i++, ip++) {
+   for (i = 0; i < OPCODE_BUFSIZE; i++, ip++) {
if (ip == (u8 *)rip)
pr_cont("<%02x> ", opcodes[i]);
else
@@ -387,34 +385,6 @@ void die(const char *str, struct pt_regs *regs, long err)
oops_end(flags, regs, sig);
 }
 
-static int __init code_bytes_setup(char *s)
-{
-   unsigned long val;
-   ssize_t ret;
-
-   if (!s)
-   return -EINVAL;
-
-   ret = kstrtoul(s, 0, &val);
-   if (ret)
-   return ret;
-
-   code_bytes = val;
-   if (code_bytes > 8192)
-   code_bytes = 8192;
-
-   if (code_bytes > OPCODE_BUFSIZE) {
-   u8 *new_buf = kzalloc(code_bytes, GFP_KERNEL);
-   if (!new_buf)
-   return -ENOMEM;
-
-   opcodes = new_buf;
-   }
-
-   return 1;
-}
-__setup("code_bytes=", code_bytes_setup);
-
 void show_regs(struct pt_regs *regs)
 {
bool all = true;

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-07 Thread Linus Torvalds
On Wed, Mar 7, 2018 at 5:25 AM, Josh Poimboeuf  wrote:
>
> How about we just remove the 'code_bytes=' option?  (Or at the very
> least, reduce its possible range to a reasonable max?)

Ack. Just limit it to 64 bytes max sounds plenty.

 Linus


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-07 Thread Borislav Petkov
On Wed, Mar 07, 2018 at 07:25:35AM -0600, Josh Poimboeuf wrote:
> How about we just remove the 'code_bytes=' option?

Haha, removing stuff is my usual solution :-)

I'd love to.

> (Or at the very
> least, reduce its possible range to a reasonable max?)
> 
> I doubt anybody actually uses it.  I'd never heard of it before, nor
> have I ever seen an oops with a long code dump.  I can't fathom why
> somebody would even need it.  64 bytes is plenty, and an 8k code dump
> just sounds insane.

Yeah, I see a Chuck Ebbert in git log output with a gmail account, maybe
that's the same person. (I've assumed he's not at RH anymore, otherwise
you would've CCed him :-)).

Let's ask him. CCed.

> It comes from the following commit:
> 
> commit 86c418374223be3f328b5522545196db02c8ceda
> Author: Chuck Ebbert 
> Date:   Tue Feb 13 13:26:25 2007 +0100
> 
> [PATCH] i386: add option to show more code in oops reports
> 
> Sometimes developers need to see more object code in an oops report,
> e.g. when kernel may be corrupted at runtime.
> 
> Add the "code_bytes" option for this.
> 
> Signed-off-by: Chuck Ebbert 
> Signed-off-by: Andi Kleen 
> Cc: Andi Kleen 
> Signed-off-by: Andrew Morton 
> 
> But I've never seen a case where somebody needed to use it.
> 
> -- 
> Josh

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-07 Thread Josh Poimboeuf
On Wed, Mar 07, 2018 at 11:13:14AM +0100, Borislav Petkov wrote:
> And that is fine if I do a 64-byte default, on-stack buffer but that
> code_bytes= thing can be 2 pages max which is yuck. No way I'm doing
> on-stack buffers then.

How about we just remove the 'code_bytes=' option?  (Or at the very
least, reduce its possible range to a reasonable max?)

I doubt anybody actually uses it.  I'd never heard of it before, nor
have I ever seen an oops with a long code dump.  I can't fathom why
somebody would even need it.  64 bytes is plenty, and an 8k code dump
just sounds insane.

It comes from the following commit:

commit 86c418374223be3f328b5522545196db02c8ceda
Author: Chuck Ebbert 
Date:   Tue Feb 13 13:26:25 2007 +0100

[PATCH] i386: add option to show more code in oops reports

Sometimes developers need to see more object code in an oops report,
e.g. when kernel may be corrupted at runtime.

Add the "code_bytes" option for this.

Signed-off-by: Chuck Ebbert 
Signed-off-by: Andi Kleen 
Cc: Andi Kleen 
Signed-off-by: Andrew Morton 

But I've never seen a case where somebody needed to use it.

-- 
Josh


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-07 Thread Borislav Petkov
On Tue, Mar 06, 2018 at 10:47:56AM -0800, Linus Torvalds wrote:
> Are these always serialized? For oopses, I think we end up serializing
> with die_lock, but is that always the case?

Hmm, good question.

> Maybe at least a comment about why a static allocation is ok?

Well, I'm afraid it is not ok but let me show what I'm seeing - maybe
I'm wrong somewhere:

Normally, when something calls die() we do this:

die
|-> oops_begin
|-> arch_spin_lock(&die_lock)   <-- grab die_lock
|-> __die
|-> show_regs
|-> __show_regs
|-> show_iret_regs
|-> show_ip
|-> show_opcodes

and we dump fine here.

But, if, for example, a #PF happens while we die(), we could do this:

do_page_fault
|-> __do_page_fault
|-> bad_area_nosemaphore
|-> __bad_area_nosemaphore
|-> show_signal_msg
|-> show_opcodes

that's the catch-all case in:

if (unlikely(fault_in_kernel_space(address))) {

and that doesn't sync with the die_lock, AFAICT, and we're walking all
over the opcodes buffer.

Unless I'm missing something, that is.

If I'm not, then I guess I need to think about a better way to solve
this. Because I like the improvement of not having to probe_kernel_read()
byte-by-byte but read it all at once.

And that is fine if I do a 64-byte default, on-stack buffer but that
code_bytes= thing can be 2 pages max which is yuck. No way I'm doing
on-stack buffers then.

Hmm, I need to think about it.

Thanks!

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section

2018-03-06 Thread Linus Torvalds
On Tue, Mar 6, 2018 at 1:49 AM, Borislav Petkov  wrote:
>
> Make it read the whole buffer of code_bytes size in one go. By default
> use a statically allocated 64 bytes buffer. If "code_bytes=" is supplied
> on the cmdline a new buffer gets allocated.

Are these always serialized? For oopses, I think we end up serializing
with die_lock, but is that always the case?

Maybe at least a comment about why a static allocation is ok?

 Linus