Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash
On Sat, 9 Mar 2024 12:40:51 -0800 Kees Cook wrote: > The part I'd like to get wired up sanely is having pstore find the > nvdimm area automatically, but it never quite happened: > https://lore.kernel.org/lkml/CAGXu5jLtmb3qinZnX3rScUJLUFdf+pRDVPjy=cs4kutw9tl...@mail.gmail.com/ The automatic detection is what I'm looking for. > > > Thanks for the info. We use pstore on ChromeOS, but it is currently > > restricted to 1MB which is too small for the tracing buffers. From what > > I understand, it's also in a specific location where there's only 1MB > > available for contiguous memory. > > That's the area that is specifically hardware backed with persistent > RAM. > > > I'm looking at finding a way to get consistent memory outside that > > range. That's what I'll be doing next week ;-) > > > > But this code was just to see if I could get a single contiguous range > > of memory mapped to ftrace, and this patch set does exactly that. > > Well, please take a look at pstore. It should be able to do everything > you mention already; it just needs a way to define multiple regions if > you want to use an area outside of the persistent ram area defined by > Chrome OS's platform driver. I'm not exactly sure how to use pstore here. At boot up I just need some consistent memory reserved for the tracing buffer. It just needs to be the same location at every boot up. I don't need a front end. If you mean a way to access it from user space. The front end is the tracefs directory, as I need all the features that the tracefs directory gives. I'm going to look to see how pstore is set up in ChromeOS and see if I can use whatever it does to allocate another location. -- Steve
Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash
On Sat, Mar 09, 2024 at 01:51:16PM -0500, Steven Rostedt wrote: > On Sat, 9 Mar 2024 10:27:47 -0800 > Kees Cook wrote: > > > On Tue, Mar 05, 2024 at 08:59:10PM -0500, Steven Rostedt wrote: > > > This is a way to map a ring buffer instance across reboots. > > > > As mentioned on Fedi, check out the persistent storage subsystem > > (pstore)[1]. It already does what you're starting to construct for RAM > > backends (but also supports reed-solomon ECC), and supports several > > other backends including EFI storage (which is default enabled on at > > least Fedora[2]), block devices, etc. It has an existing mechanism for > > handling reservations (including via device tree), and supports multiple > > "frontends" including the Oops handler, console output, and even ftrace > > which does per-cpu recording and event reconstruction (Joel wrote this > > frontend). > > Mathieu was telling me about the pmem infrastructure. I use nvdimm to back my RAM backend testing with qemu so I can examine the storage "externally": RAM_SIZE=16384 NVDIMM_SIZE=200 MAX_SIZE=$(( RAM_SIZE + NVDIMM_SIZE )) ... qemu-system-x86_64 \ ... -machine pc,nvdimm=on \ -m ${RAM_SIZE}M,slots=2,maxmem=${MAX_SIZE}M \ -object memory-backend-file,id=mem1,share=on,mem-path=$IMAGES/x86/nvdimm.img,size=${NVDIMM_SIZE}M,align=128M \ -device nvdimm,id=nvdimm1,memdev=mem1,label-size=1M \ ... -append 'console=uart,io,0x3f8,115200n8 loglevel=8 root=/dev/vda1 ro ramoops.mem_size=1048576 ramoops.ecc=1 ramoops.mem_address=0x44000 ramoops.console_size=16384 ramoops.ftrace_size=16384 ramoops.pmsg_size=16384 ramoops.record_size=32768 panic=-1 init=/root/resume.sh '"$@" The part I'd like to get wired up sanely is having pstore find the nvdimm area automatically, but it never quite happened: https://lore.kernel.org/lkml/CAGXu5jLtmb3qinZnX3rScUJLUFdf+pRDVPjy=cs4kutw9tl...@mail.gmail.com/ > Thanks for the info. We use pstore on ChromeOS, but it is currently > restricted to 1MB which is too small for the tracing buffers. From what > I understand, it's also in a specific location where there's only 1MB > available for contiguous memory. That's the area that is specifically hardware backed with persistent RAM. > I'm looking at finding a way to get consistent memory outside that > range. That's what I'll be doing next week ;-) > > But this code was just to see if I could get a single contiguous range > of memory mapped to ftrace, and this patch set does exactly that. Well, please take a look at pstore. It should be able to do everything you mention already; it just needs a way to define multiple regions if you want to use an area outside of the persistent ram area defined by Chrome OS's platform driver. -Kees -- Kees Cook
Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash
On Sat, 9 Mar 2024 10:27:47 -0800 Kees Cook wrote: > On Tue, Mar 05, 2024 at 08:59:10PM -0500, Steven Rostedt wrote: > > This is a way to map a ring buffer instance across reboots. > > As mentioned on Fedi, check out the persistent storage subsystem > (pstore)[1]. It already does what you're starting to construct for RAM > backends (but also supports reed-solomon ECC), and supports several > other backends including EFI storage (which is default enabled on at > least Fedora[2]), block devices, etc. It has an existing mechanism for > handling reservations (including via device tree), and supports multiple > "frontends" including the Oops handler, console output, and even ftrace > which does per-cpu recording and event reconstruction (Joel wrote this > frontend). Mathieu was telling me about the pmem infrastructure. This patch set doesn't care where the memory comes from. You just give it an address and size, and it will do the rest. > > It should be pretty straight forward to implement a new frontend if the > ftrace one isn't flexible enough. It's a bit clunky still to add one, > but search for "ftrace" in fs/pstore/ram.c to see how to plumb a new > frontend into the RAM backend. > > I continue to want to lift the frontend configuration options up into > the pstore core, since it would avoid a bunch of redundancy, but this is > where we are currently. :) Thanks for the info. We use pstore on ChromeOS, but it is currently restricted to 1MB which is too small for the tracing buffers. From what I understand, it's also in a specific location where there's only 1MB available for contiguous memory. I'm looking at finding a way to get consistent memory outside that range. That's what I'll be doing next week ;-) But this code was just to see if I could get a single contiguous range of memory mapped to ftrace, and this patch set does exactly that. > > -Kees > > [1] CONFIG_PSTORE et. al. in fs/pstore/ > https://docs.kernel.org/admin-guide/ramoops.html > [2] > https://www.freedesktop.org/software/systemd/man/latest/systemd-pstore.service.html > Thanks! -- Steve
Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash
On Tue, Mar 05, 2024 at 08:59:10PM -0500, Steven Rostedt wrote: > This is a way to map a ring buffer instance across reboots. As mentioned on Fedi, check out the persistent storage subsystem (pstore)[1]. It already does what you're starting to construct for RAM backends (but also supports reed-solomon ECC), and supports several other backends including EFI storage (which is default enabled on at least Fedora[2]), block devices, etc. It has an existing mechanism for handling reservations (including via device tree), and supports multiple "frontends" including the Oops handler, console output, and even ftrace which does per-cpu recording and event reconstruction (Joel wrote this frontend). It should be pretty straight forward to implement a new frontend if the ftrace one isn't flexible enough. It's a bit clunky still to add one, but search for "ftrace" in fs/pstore/ram.c to see how to plumb a new frontend into the RAM backend. I continue to want to lift the frontend configuration options up into the pstore core, since it would avoid a bunch of redundancy, but this is where we are currently. :) -Kees [1] CONFIG_PSTORE et. al. in fs/pstore/ https://docs.kernel.org/admin-guide/ramoops.html [2] https://www.freedesktop.org/software/systemd/man/latest/systemd-pstore.service.html -- Kees Cook
[POC] !!! Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash
I forgot to add [POC] to the topic. All these patches are a proof of concept. -- Steve
[PATCH 0/8] tracing: Persistent traces across a reboot or crash
This is a way to map a ring buffer instance across reboots. The requirement is that you have a memory region that is not erased. I tested this on a Debian VM running on qemu on a Debian server, and even tested it on a baremetal box running Fedora. I was surprised that it worked on the baremetal box, but it does so surprisingly consistently. The idea is that you can reserve a memory region and save it in two special variables: trace_buffer_start and trace_buffer_size If these are set by fs_initcall() then a "boot_mapped" instance is created. The memory that was reserved is used by the ring buffer of this instance. It acts like a memory mapped instance so it has some limitations. It does not allow snapshots nor does it allow tracers which use a snapshot buffer (like irqsoff and wakeup tracers). On boot up, when setting up the ring buffer, it looks at the current content and does a vigorous test to see if the content is valid. It even walks the events in all the sub-buffers to make sure the ring buffer meta data is correct. If it determines that the content is valid, it will reconstruct the ring buffer to use the content it has found. If the buffer is valid, on the next boot, the boot_mapped instance will contain the data from the previous boot. You can cat the trace or trace_pipe file, or even run trace-cmd extract on it to make a trace.dat file that holds the date. This is much better than dealing with a ftrace_dump_on_opps (I wish I had this a decade ago!) There are still some limitations of this buffer. One is that it assumes that the kernel you are booting back into is the same one that crashed. At least the trace_events (like sched_switch and friends) all have the same ids. This would be true with the same kernel as the ids are determined at link time. Module events could possible be a problem as the ids may not match. One idea is to just print the raw fields and not process the print formats for this instance, as the print formats may do some crazy things with data that does not match. Another limitation is any print format that has "%pS" will likely not work. That's because the pointer in the old ring buffer is for an address that may be different than the function points to now. I was thinking of adding a file in the boot_mapped instance that holds the delta of the old mapping to the new mapping, so that trace-cmd and perf could calculate the current kallsyms from the old pointers. Finally, this is still a proof of concept. How to create this memory mapping isn't decided yet. In this patch set I simply hacked into kexec crash code and hard coded an address that worked for one of my machines (for the other machine I had to play around to find another address). Perhaps we could add a kernel command line parameter that lets people decided, or an option where it could possibly look at the ACPI (for intel) tables to come up with an address on its own. Anyway, I plan on using this for debugging, as it already is pretty featureful but there's much more that can be done. Basically, all you need to do is: echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/enable Do what ever you want and the system crashes (and boots to the same kernel). Then: cat /sys/kernel/tracing/instances/boot_mapped/trace and it will have the trace. I'm sure there's still some gotchas here, which is why this is currently still just a POC. Enjoy... Steven Rostedt (Google) (8): ring-buffer: Allow mapped field to be set without mapping ring-buffer: Add ring_buffer_alloc_range() tracing: Create "boot_mapped" instance for memory mapped buffer HACK: Hard code in mapped tracing buffer address ring-buffer: Add ring_buffer_meta data ring-buffer: Add output of ring buffer meta page ring-buffer: Add test if range of boot buffer is valid ring-buffer: Validate boot range memory events arch/x86/kernel/setup.c | 20 ++ include/linux/ring_buffer.h | 17 + include/linux/trace.h | 7 + kernel/trace/ring_buffer.c | 826 ++-- kernel/trace/trace.c| 95 - kernel/trace/trace.h| 5 + 6 files changed, 856 insertions(+), 114 deletions(-)