Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Thu, Mar 08, 2018 at 03:20:58PM -0800, Linus Torvalds wrote: > Yes, so the reason the prologue is more important is that there's > really two cases for the "Code:" line: Yap, I had a hunch it must be about some of those but thanks for taking the time and writing it down! I've tried to summarize the reasoning and slap it as a comment above it so that it is clear why we do it this way: --- diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index c3aa0e29513e..7a21098c9128 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -71,6 +71,25 @@ static void printk_stack_address(unsigned long address, int reliable, printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address); } +/* + * The are a couple of reasons for the 2/3rd prologue, courtesy of Linus: + * + * In case where we don't have the exact kernel image (which, if we did, we can + * simply disassemble and navigate to the RIP), the purpose of the bigger + * prologue is to have more context and to be able to correlate the code from + * the different toolchains better. + * + * In addition, it helps in recreating the register allocation of the failing + * kernel and thus make sense of the register dump. + * + * What is more, the additional complication of a variable length insn arch like + * x86 warrants having longer byte sequence before rIP so that the disassembler + * can "sync" up properly and find instruction boundaries when decoding the + * opcode bytes. + * + * Thus, the 2/3rds prologue and 64 byte OPCODE_BUFSIZE is just a random + * guesstimate in attempt to achieve all of the above. + */ void show_opcodes(u8 *rip, const char *loglvl) { #define OPCODE_BUFSIZE 64 -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Thu, Mar 8, 2018 at 2:36 PM, Borislav Petkov wrote: > > Btw, do we have any explanation for the two-thirds prologue? I dug it > out to the patch below but it doesn't say why the prologue being bigger > is more important than the epilogue. Yes, so the reason the prologue is more important is that there's really two cases for the "Code:" line: (a) it doesn't matter at all, because you have the kernel image anyway, and can just get the code from there based on the faulting instruction pointer address. (b) it ends up helping you match things up roughly, because maybe you have a different compiler version and different code generation (and/or maybe just different config options) In the (a) case, the code doesn't matter at all, and you might as well just look at the whole function by just doing "gdb vmlinux" and then disassemble it. You have everything just from %rip. In the (b) case, there are two reasons to give code "around" the faulting instruction pointer: - to help match up when you compare your not-identical vmlinux sequence - to actually figure out what the register allocation in the faulting kernel was, so that you can make sense of the register dump, because your local kernel might have completely different register allocaiton. And for that *second* case, the instructions that _precede_ the faulting instructions are generally much more important than the instructions that follow the faulting instruction. The instructions that *follow* the faulting instructions generally have little or no relevance for the register dump (obviously loops etc _can_ make them matter, but you see the rough point). So the instructions that precede the faulting instruction are in many ways much more relevant than the instructions that follow it. So you want more of those. Also, the instructions that precede the faulting instruction have another issue - with variable-length instructions, it's hard to sync up things. Even if we were to use a disassembler (which we almost certainly don't at fault time), walking backwards can easily be ambiguous. So you want extra "slop" in the bytes that get dumped before the instruction, because you don't know quite where the instruction boundaries may be. In contrast, the faulting instruction itself - and the instrucvtions that follow it - you can tell where the instruction boundaries are (unless you had an exception due to a corrupted indirect branch, which does happen but is quite rare). So there's two reasons for why the bytes _before_ the instruction are more important than the bytes _after_ the instruction: the register state is more relevant for them, and the slop for decoding. The "two thirds" then just comes from that. It's just a random estimate of "instruction bytes that precede the faulting instructions are more important than later ones". In fact, to some degree the "one third" can be seen as "the faulting instruction itself needs some space". An x86 instruction can be at most 15 bytes, and you do likely want at least that. Any bytes after the first instruction aren't _useless_ (because they definitely can help line things up when you have slightly different code generation due to compiler versions etc), so "at least 15 bytes, and maybe a bit more" again makes that 2/3rds thing work out well when you print out 64 bytes. Linus PS. Historical side note: We haven't always printed out that much. This is Linux-0.01: for(i=0;i<10;i++) printk("%02x ",0xff & get_seg_byte(esp[1],(i+(char *)esp[0]))); so it just printed out 10 bytes, basically just the faulting instruction (and not even all of it if it has a lot of immediates). We eventually extended the 10 bytes into 20 bytes, again starting at the existing instruction, because if you don't print out a lot, you need to start at the faulting instruction itself because of slop. The expansion from 20 to 64 bytes happened in 2004 in v2.6.9. And that's also when we started dumping preceding instructions, because once you print out a fair chunk of bytes, that's when "preceding instructions are more important" comes in. Because before that, you want to guarantee that you at least print out the faulting instruction.
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Thu, Mar 08, 2018 at 10:00:09AM -0800, Linus Torvalds wrote: > On Thu, Mar 8, 2018 at 2:16 AM, Borislav Petkov wrote: > > +#define OPCODE_BUFSIZE 64 > > + unsigned int code_prologue = OPCODE_BUFSIZE * 43 / OPCODE_BUFSIZE; > > Heh. > > That's a very odd way of writing "43". I was simply search-replacing code_bytes :-) > Honestly, the "43" is just "two thirds" rounded to closest, and it's > not important anyway, so I think you should just write it as > > unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3; Btw, do we have any explanation for the two-thirds prologue? I dug it out to the patch below but it doesn't say why the prologue being bigger is more important than the epilogue. I would've made it half and half but I guess it is more important to see the opcode bytes leading to rip... Oh well. > and never mind that it will now be 42. > > 42 is obviously the right answer anyway, which makes me think we got > it wrong earlier. Doh, of course! What was I thinking?!? :-) Done. --- >From 313c2652ce75899d7801b021ed40e5b8ef233aca Mon Sep 17 00:00:00 2001 From: Keith Owens Date: Sun, 22 Aug 2004 22:36:42 -0700 Subject: [PATCH] [PATCH] i386 oops output: dump preceding code This teaches the i386 oops dumper to dump opcodes preceding and after the offending EIP. Supporting code against ksymoops has been tested and produces output like the below. Support for this was added to ksymoops-2.4.9. Note that ksymoops will guarantee that the disassembly after the value is always in sync - if the disassembly from the start of the Code: line does not sync up with the EIP address ksymoops will perform the resync. Warning (merge_maps): no symbols in merged map Mar 18 23:47:36 vmm kernel: kernel BUG at fs/open.c:802! Mar 18 23:47:36 vmm kernel: invalid operand: [#1] Mar 18 23:47:36 vmm kernel: CPU:0 Mar 18 23:47:36 vmm kernel: EIP:0060:[] VLINot tainted Using defaults from ksymoops -t elf32-i386 -a i386 Mar 18 23:47:36 vmm kernel: EFLAGS: 00010246 Mar 18 23:47:36 vmm kernel: eax: ccdfb900 ebx: 4001020d ecx: edx: 007b Mar 18 23:47:36 vmm kernel: esi: edi: bfffdd70 ebp: ccdfdfbc esp: ccdfdfb0 Mar 18 23:47:36 vmm kernel: ds: 007b es: 007b ss: 0068 Mar 18 23:47:36 vmm kernel: Stack: 4001020d bfffdd70 ccdfc000 c0109213 4001020d 0003 Mar 18 23:47:36 vmm kernel: bfffdd70 bfffdc88 0005 007b 007b 0005 4000ef94 Mar 18 23:47:36 vmm kernel:0073 0206 bfffdbd8 007b Mar 18 23:47:36 vmm kernel: Call Trace: Mar 18 23:47:36 vmm kernel: [] syscall_call+0x7/0xb Mar 18 23:47:36 vmm kernel: Code: 14 98 f0 81 41 04 00 00 00 01 5b 89 ec 5d c3 90 b8 00 e0 ff ff 21 e0 55 89 e5 57 56 53 8b 00 81 b8 e4 01 00 00 0f 27 00 00 75 08 <0f> 0b 22 03 85 18 2f c0 8b 45 08 50 e8 30 d4 00 00 89 c7 83 c4 >>EIP; c014fedf No symbols available <= Trace; c0109213 No symbols available This architecture has variable length instructions, decoding before eip is unreliable, take these instructions with a pinch of salt. Code; c014feb4 No symbols available <_EIP>: Code; c014feb4 No symbols available 0: 14 98 adc$0x98,%al Code; c014feb6 No symbols available 2: f0 81 41 04 00 00 00 lock addl $0x100,0x4(%ecx) Code; c014febd No symbols available 9: 01 Code; c014febe No symbols available a: 5bpop%ebx Code; c014febf No symbols available b: 89 ec mov%ebp,%esp Code; c014fec1 No symbols available d: 5dpop%ebp Code; c014fec2 No symbols available e: c3ret Code; c014fec3 No symbols available f: 90nop Code; c014fec4 No symbols available 10: b8 00 e0 ff ffmov$0xe000,%eax Code; c014fec9 No symbols available 15: 21 e0 and%esp,%eax Code; c014fecb No symbols available 17: 55push %ebp Code; c014fecc No symbols available 18: 89 e5 mov%esp,%ebp Code; c014fece No symbols available 1a: 57push %edi Code; c014fecf No symbols available 1b: 56push %esi Code; c014fed0 No symbols available 1c: 53push %ebx Code; c014fed1 No symbols available 1d: 8b 00 mov(%eax),%eax Code; c014fed3 No symbols available 1f: 81 b8 e4 01 00 00 0f cmpl $0x270f,0x1e4(%eax) Code; c014feda No symbols available 26: 27 00 00 Code; c014fedd No symbols available 29: 75 08 jne33 <_EIP+0x33> c014fee7 No symbols available This decode from eip onwards should be reliable Code; c014fedf No symbols available <_EIP>: Code; c014fedf No symbols available <= 0: 0f 0b ud2a <= Code; c014fee1 No symbols available 2: 22 0
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Thu, Mar 8, 2018 at 2:16 AM, Borislav Petkov wrote: > +#define OPCODE_BUFSIZE 64 > + unsigned int code_prologue = OPCODE_BUFSIZE * 43 / OPCODE_BUFSIZE; Heh. That's a very odd way of writing "43". Honestly, the "43" is just "two thirds" rounded to closest, and it's not important anyway, so I think you should just write it as unsigned int code_prologue = OPCODE_BUFSIZE * 2 / 3; and never mind that it will now be 42. 42 is obviously the right answer anyway, which makes me think we got it wrong earlier. Linus
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Wed, Mar 07, 2018 at 01:08:32PM -0800, Linus Torvalds wrote: > On Wed, Mar 7, 2018 at 5:25 AM, Josh Poimboeuf wrote: > > > > How about we just remove the 'code_bytes=' option? (Or at the very > > least, reduce its possible range to a reasonable max?) > > Ack. Just limit it to 64 bytes max sounds plenty. Done, combined diff ontop. With a 64-byte on-stack opcodes buffer: --- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b37c1c30c16f..2c74c1694d9d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -587,11 +587,6 @@ Sets the size of memory pool for coherent, atomic dma allocations, by default set to 256K. - code_bytes [X86] How many bytes of object code to print - in an oops report. - Range: 0 - 8192 - Default: 64 - com20020= [HW,NET] ARCnet - COM20020 chipset Format: [,[,[,[,[,] diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index 409a5bd02a18..2fc009a1824e 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -22,13 +22,9 @@ #include #include -#define OPCODE_BUFSIZE 64 - int panic_on_unrecovered_nmi; int panic_on_io_nmi; -static unsigned int code_bytes = OPCODE_BUFSIZE; -static u8 __opc[OPCODE_BUFSIZE]; -static u8 *opcodes = __opc; + static int die_counter; static struct pt_regs exec_summary_regs; @@ -77,19 +73,21 @@ static void printk_stack_address(unsigned long address, int reliable, void show_opcodes(u8 *rip, const char *loglvl) { - unsigned int code_prologue = code_bytes * 43 / OPCODE_BUFSIZE; +#define OPCODE_BUFSIZE 64 + unsigned int code_prologue = OPCODE_BUFSIZE * 43 / OPCODE_BUFSIZE; + u8 opcodes[OPCODE_BUFSIZE]; u8 *ip; int i; printk("%sCode: ", loglvl); ip = (u8 *)rip - code_prologue; - if (probe_kernel_read(opcodes, ip, code_bytes)) { + if (probe_kernel_read(opcodes, ip, OPCODE_BUFSIZE)) { pr_cont(" Bad RIP value.\n"); return; } - for (i = 0; i < code_bytes; i++, ip++) { + for (i = 0; i < OPCODE_BUFSIZE; i++, ip++) { if (ip == (u8 *)rip) pr_cont("<%02x> ", opcodes[i]); else @@ -387,34 +385,6 @@ void die(const char *str, struct pt_regs *regs, long err) oops_end(flags, regs, sig); } -static int __init code_bytes_setup(char *s) -{ - unsigned long val; - ssize_t ret; - - if (!s) - return -EINVAL; - - ret = kstrtoul(s, 0, &val); - if (ret) - return ret; - - code_bytes = val; - if (code_bytes > 8192) - code_bytes = 8192; - - if (code_bytes > OPCODE_BUFSIZE) { - u8 *new_buf = kzalloc(code_bytes, GFP_KERNEL); - if (!new_buf) - return -ENOMEM; - - opcodes = new_buf; - } - - return 1; -} -__setup("code_bytes=", code_bytes_setup); - void show_regs(struct pt_regs *regs) { bool all = true; -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Wed, Mar 7, 2018 at 5:25 AM, Josh Poimboeuf wrote: > > How about we just remove the 'code_bytes=' option? (Or at the very > least, reduce its possible range to a reasonable max?) Ack. Just limit it to 64 bytes max sounds plenty. Linus
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Wed, Mar 07, 2018 at 07:25:35AM -0600, Josh Poimboeuf wrote: > How about we just remove the 'code_bytes=' option? Haha, removing stuff is my usual solution :-) I'd love to. > (Or at the very > least, reduce its possible range to a reasonable max?) > > I doubt anybody actually uses it. I'd never heard of it before, nor > have I ever seen an oops with a long code dump. I can't fathom why > somebody would even need it. 64 bytes is plenty, and an 8k code dump > just sounds insane. Yeah, I see a Chuck Ebbert in git log output with a gmail account, maybe that's the same person. (I've assumed he's not at RH anymore, otherwise you would've CCed him :-)). Let's ask him. CCed. > It comes from the following commit: > > commit 86c418374223be3f328b5522545196db02c8ceda > Author: Chuck Ebbert > Date: Tue Feb 13 13:26:25 2007 +0100 > > [PATCH] i386: add option to show more code in oops reports > > Sometimes developers need to see more object code in an oops report, > e.g. when kernel may be corrupted at runtime. > > Add the "code_bytes" option for this. > > Signed-off-by: Chuck Ebbert > Signed-off-by: Andi Kleen > Cc: Andi Kleen > Signed-off-by: Andrew Morton > > But I've never seen a case where somebody needed to use it. > > -- > Josh -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Wed, Mar 07, 2018 at 11:13:14AM +0100, Borislav Petkov wrote: > And that is fine if I do a 64-byte default, on-stack buffer but that > code_bytes= thing can be 2 pages max which is yuck. No way I'm doing > on-stack buffers then. How about we just remove the 'code_bytes=' option? (Or at the very least, reduce its possible range to a reasonable max?) I doubt anybody actually uses it. I'd never heard of it before, nor have I ever seen an oops with a long code dump. I can't fathom why somebody would even need it. 64 bytes is plenty, and an 8k code dump just sounds insane. It comes from the following commit: commit 86c418374223be3f328b5522545196db02c8ceda Author: Chuck Ebbert Date: Tue Feb 13 13:26:25 2007 +0100 [PATCH] i386: add option to show more code in oops reports Sometimes developers need to see more object code in an oops report, e.g. when kernel may be corrupted at runtime. Add the "code_bytes" option for this. Signed-off-by: Chuck Ebbert Signed-off-by: Andi Kleen Cc: Andi Kleen Signed-off-by: Andrew Morton But I've never seen a case where somebody needed to use it. -- Josh
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Tue, Mar 06, 2018 at 10:47:56AM -0800, Linus Torvalds wrote: > Are these always serialized? For oopses, I think we end up serializing > with die_lock, but is that always the case? Hmm, good question. > Maybe at least a comment about why a static allocation is ok? Well, I'm afraid it is not ok but let me show what I'm seeing - maybe I'm wrong somewhere: Normally, when something calls die() we do this: die |-> oops_begin |-> arch_spin_lock(&die_lock) <-- grab die_lock |-> __die |-> show_regs |-> __show_regs |-> show_iret_regs |-> show_ip |-> show_opcodes and we dump fine here. But, if, for example, a #PF happens while we die(), we could do this: do_page_fault |-> __do_page_fault |-> bad_area_nosemaphore |-> __bad_area_nosemaphore |-> show_signal_msg |-> show_opcodes that's the catch-all case in: if (unlikely(fault_in_kernel_space(address))) { and that doesn't sync with the die_lock, AFAICT, and we're walking all over the opcodes buffer. Unless I'm missing something, that is. If I'm not, then I guess I need to think about a better way to solve this. Because I like the improvement of not having to probe_kernel_read() byte-by-byte but read it all at once. And that is fine if I do a 64-byte default, on-stack buffer but that code_bytes= thing can be 2 pages max which is yuck. No way I'm doing on-stack buffers then. Hmm, I need to think about it. Thanks! -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH 5/9] x86/dumpstack: Improve opcodes dumping in the Code: section
On Tue, Mar 6, 2018 at 1:49 AM, Borislav Petkov wrote: > > Make it read the whole buffer of code_bytes size in one go. By default > use a statically allocated 64 bytes buffer. If "code_bytes=" is supplied > on the cmdline a new buffer gets allocated. Are these always serialized? For oopses, I think we end up serializing with die_lock, but is that always the case? Maybe at least a comment about why a static allocation is ok? Linus