[PATCH 1/8] pseries: phyp dump: Docmentation
Basic documentation for hypervisor-assisted dump. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Signed-off-by: Manish Ahuja [EMAIL PROTECTED] Documentation/powerpc/phyp-assisted-dump.txt | 127 +++ 1 file changed, 127 insertions(+) Index: 2.6.25-rc1/Documentation/powerpc/phyp-assisted-dump.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ 2.6.25-rc1/Documentation/powerpc/phyp-assisted-dump.txt 2008-02-18 03:22:33.0 -0600 @@ -0,0 +1,127 @@ + + Hypervisor-Assisted Dump + + November 2007 + +The goal of hypervisor-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. + +As compared to kdump or other strategies, hypervisor-assisted +dump offers several strong, practical advantages: + +-- Unlike kdump, the system has been reset, and loaded + with a fresh copy of the kernel. In particular, + PCI and I/O devices have been reinitialized and are + in a clean, consistent state. +-- As the dump is performed, the dumped memory becomes + immediately available to the system for normal use. +-- After the dump is completed, no further reboots are + required; the system will be fully usable, and running + in it's normal, production mode on it normal kernel. + +The above can only be accomplished by coordination with, +and assistance from the hypervisor. The procedure is +as follows: + +-- When a system crashes, the hypervisor will save + the low 256MB of RAM to a previously registered + save region. It will also save system state, system + registers, and hardware PTE's. + +-- After the low 256MB area has been saved, the + hypervisor will reset PCI and other hardware state. + It will *not* clear RAM. It will then launch the + bootloader, as normal. + +-- The freshly booted kernel will notice that there + is a new node (ibm,dump-kernel) in the device tree, + indicating that there is crash data available from + a previous boot. It will boot into only 256MB of RAM, + reserving the rest of system memory. + +-- Userspace tools will parse /sys/kernel/release_region + and read /proc/vmcore to obtain the contents of memory, + which holds the previous crashed kernel. The userspace + tools may copy this info to disk, or network, nas, san, + iscsi, etc. as desired. + + For Example: the values in /sys/kernel/release-region + would look something like this (address-range pairs). + CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: / + DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A + +-- As the userspace tools complete saving a portion of + dump, they echo an offset and size to + /sys/kernel/release_region to release the reserved + memory back to general use. + + An example of this is: + echo 0x4000 0x1000 /sys/kernel/release_region + which will release 256MB at the 1GB boundary. + +Please note that the hypervisor-assisted dump feature +is only available on Power6-based systems with recent +firmware versions. + +Implementation details: +-- + +During boot, a check is made to see if firmware supports +this feature on this particular machine. If it does, then +we check to see if a active dump is waiting for us. If yes +then everything but 256 MB of RAM is reserved during early +boot. This area is released once we collect a dump from user +land scripts that are run. If there is dump data, then +the /sys/kernel/release_region file is created, and +the reserved memory is held. + +If there is no waiting dump data, then only the highest +256MB of the ram is reserved as a scratch area. This area +is *not* be released: this region will be kept permanently +reserved, so that it can act as a receptacle for a copy +of the low 256MB in the case a crash does occur. See, +however, open issues below, as to whether +such a reserved region is really needed. + +Currently the dump will be copied from /proc/vmcore to a +a new file upon user intervention. The starting address +to be read and the range for each data point in provided +in /sys/kernel/release_region. + +The tools to examine the dump will be same as the ones +used for kdump. + +General notes: +-- +Security: please note that there are potential security issues +with any sort of dump mechanism. In particular, plaintext +(unencrypted) data, and possibly passwords, may be present in +the dump data. Userspace tools must take adequate precautions to +preserve security. + +Open issues/ToDo: + + o The various code paths that tell the hypervisor that a crash + occurred, vs. it simply being a normal reboot, should be + reviewed, and possibly clarified/fixed. + + o Instead of using /sys/kernel, should there be a /sys/dump + instead? There is a
[PATCH 1/8] pseries: phyp dump: Docmentation
Basic documentation for hypervisor-assisted dump. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Signed-off-by: Manish Ahuja [EMAIL PROTECTED] Documentation/powerpc/phyp-assisted-dump.txt | 127 +++ 1 file changed, 127 insertions(+) Index: 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt 2008-02-12 06:38:25.0 -0600 @@ -0,0 +1,127 @@ + + Hypervisor-Assisted Dump + + November 2007 + +The goal of hypervisor-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. + +As compared to kdump or other strategies, hypervisor-assisted +dump offers several strong, practical advantages: + +-- Unlike kdump, the system has been reset, and loaded + with a fresh copy of the kernel. In particular, + PCI and I/O devices have been reinitialized and are + in a clean, consistent state. +-- As the dump is performed, the dumped memory becomes + immediately available to the system for normal use. +-- After the dump is completed, no further reboots are + required; the system will be fully usable, and running + in it's normal, production mode on it normal kernel. + +The above can only be accomplished by coordination with, +and assistance from the hypervisor. The procedure is +as follows: + +-- When a system crashes, the hypervisor will save + the low 256MB of RAM to a previously registered + save region. It will also save system state, system + registers, and hardware PTE's. + +-- After the low 256MB area has been saved, the + hypervisor will reset PCI and other hardware state. + It will *not* clear RAM. It will then launch the + bootloader, as normal. + +-- The freshly booted kernel will notice that there + is a new node (ibm,dump-kernel) in the device tree, + indicating that there is crash data available from + a previous boot. It will boot into only 256MB of RAM, + reserving the rest of system memory. + +-- Userspace tools will parse /sys/kernel/release_region + and read /proc/vmcore to obtain the contents of memory, + which holds the previous crashed kernel. The userspace + tools may copy this info to disk, or network, nas, san, + iscsi, etc. as desired. + + For Example: the values in /sys/kernel/release-region + would look something like this (address-range pairs). + CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: / + DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A + +-- As the userspace tools complete saving a portion of + dump, they echo an offset and size to + /sys/kernel/release_region to release the reserved + memory back to general use. + + An example of this is: + echo 0x4000 0x1000 /sys/kernel/release_region + which will release 256MB at the 1GB boundary. + +Please note that the hypervisor-assisted dump feature +is only available on Power6-based systems with recent +firmware versions. + +Implementation details: +-- + +During boot, a check is made to see if firmware supports +this feature on this particular machine. If it does, then +we check to see if a active dump is waiting for us. If yes +then everything but 256 MB of RAM is reserved during early +boot. This area is released once we collect a dump from user +land scripts that are run. If there is dump data, then +the /sys/kernel/release_region file is created, and +the reserved memory is held. + +If there is no waiting dump data, then only the highest +256MB of the ram is reserved as a scratch area. This area +is *not* be released: this region will be kept permanently +reserved, so that it can act as a receptacle for a copy +of the low 256MB in the case a crash does occur. See, +however, open issues below, as to whether +such a reserved region is really needed. + +Currently the dump will be copied from /proc/vmcore to a +a new file upon user intervention. The starting address +to be read and the range for each data point in provided +in /sys/kernel/release_region. + +The tools to examine the dump will be same as the ones +used for kdump. + +General notes: +-- +Security: please note that there are potential security issues +with any sort of dump mechanism. In particular, plaintext +(unencrypted) data, and possibly passwords, may be present in +the dump data. Userspace tools must take adequate precautions to +preserve security. + +Open issues/ToDo: + + o The various code paths that tell the hypervisor that a crash + occurred, vs. it simply being a normal reboot, should be + reviewed, and possibly clarified/fixed. + + o Instead of using /sys/kernel, should there be a /sys/dump + instead? There is a
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On 13/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: How do you expect to have it in full production if you don't have all resources available for it? It's not until the dump has finished that you can return all memory to the production environment and use it. With the PHYP dump, each chunk of RAM is returned for general use immediately after being dumped; so its not an all-or-nothing proposition. Production systems don't often hit 100% RAM use right out of the gate, they often take hours or days to get there, so again, there should be time to dump. This can very easily be argued in both direction, with no clear winner: If the crash is stress-induced (say a slashdotted website), for those cases it seems more rational to take the time, collect _good data_ even if it takes a little longer, and then go back into production. Especially if the alternative is to go back into production immediately, collect about half of the data, and then crash again. Rinse and repeat. Again, the mode of operation for the phyp dump is that you'll always have all of the data from the *first* crash, even if there are multiple crashes. That's because the the as-yet undumped RAM is not put back into production. really surprises me that there's no way to reset a device through PHYP though. Seems like such a fundamental feature. I don't know who said that; that's not right. The EEH function certainly does allow you to halt/restart PCI traffic to a particular device and also to reset the device. So, yes, the pSeries kexec code should call into the eeh subsystem to rationalize the device state. I think people are overly optimistic if they think it'll be possible to do all of this reliably (as in with consistent performance) without a second reboot though. The NUMA issues do concern me. But then, the whole virtualized, fractional-cpu, tickless operation stuff sounds like a performance tuning nightmare to begin with. At least without similar amounts of work being done as it would have taken to fix kdump's reliability in the first place. :-) Speaking of reboots. PHYP isn't known for being quick at rebooting a partition, it used to take in the order of minutes even on a small machine. Has that been fixed? Dunno. Probably not. If not, the avoiding an extra reboot argument hardly seems like a benefit versus kdump+kexec, which reboots nearly instantly and without involvement from PHYP. OK, let me tell you what I'm up against right now. I'm dealing with sporadic corruption on my home box. About a month ago, I bought a whizzy ASUS M2NE motherboard an AMD64 2-core cpu, and two sticks of RAM, 1GB per stick. I have one new hard drive, SATA, and one old hard drive, from my old machine, the PATA. The two disks are mirrored in a RAID-1 config. Running Ubuntu. During install/upgrade a month ago, I noticed some of the install files seemed to have gotten corrupted, but that downloading them again got me a working version. This put a serious frown on my face: maybe a bad ethernet card or connection !? Two weeks ago, gcc stopped working one morning, although it worked fine the night before. I'd done nothing in the interim but sleep. Reinstalling it made it work again. Yesterday, something else stopped working. I found the offending library, I compared file checksums against a known-good version, and they were off. (!!!) Disk corruption? Then apt-get stopped working. The /var/lib/dpkg/status file had randomly corrupted single bytes. Its ascii, I hand repaired it; it had maybe 10 bad bytes out of 2MB total size. I installed tripwire. Between the first run of tripwire, and the second, less than an hour later, it reported several dozen files have changed checksums. Manual inspection of some of these files against known-good versions show that, at least this morning, that's no longer the case. System hasn't crashed in a month, since first boot. So what's going on? Is it possible that one of the two disks is serving up bad data, which explains the funny checksum behaviour? Or maybe its bad RAM, so that a fresh disk read shows good data? If its bad ram, why doesn't the system crash? I forced fsck last night, fsck came back spotless. So ... moral of the story: If phyp is doing some sort of hardware checks and validation, that's great. I wish I could afford a pSeries system for my home computer, because my impression is that they are very stable, and don't do things like data corruption. I'm such a friggin cheapskate that I can't bear to spend many thousands instead of many hundreds of dollars. However, I will trade a longer boot for the dream of higher reliability. --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On Fri, Jan 11, 2008 at 10:57:51AM -0600, Linas Vepstas wrote: On 10/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote: Mike Strosaker wrote: At the risk of repeating what others have already said, the PHYP-assistance method provides some advantages that the kexec method cannot: - Availability of the system for production use before the dump data is collected. As was mentioned before, some production systems may choose not to operate with the limited memory initially available after the reboot, but it sure is nice to provide the option. I'm more concerned that this design encourages the user to resume a workload *which is almost certainly known to result in a system crash* before collection of crash data is complete. Maybe the gamble will pay off most of the time, but I wouldn't want to be working support when it doesn't. Workloads that cause crashes within hours of startup tend to be weeded-out/discovered during pre-production test of the system to be deployed. Since its pre-production test, dumps can be taken in a leisurely manner. Heck, even a session at the xmon prompt can be contemplated. The problem is when the crash only reproduces after days or weeks of uptime, on a production machine. Since the machine is in production, its got to be brought back up ASAP. Since its crashing only after days/weeks, the dump should have plenty of time to complete. (And if it crashes quickly after that reboot ... well, support people always welcome ways in which a bug can be reproduced more quickly/easily). How do you expect to have it in full production if you don't have all resources available for it? It's not until the dump has finished that you can return all memory to the production environment and use it. This can very easily be argued in both direction, with no clear winner: If the crash is stress-induced (say a slashdotted website), for those cases it seems more rational to take the time, collect _good data_ even if it takes a little longer, and then go back into production. Especially if the alternative is to go back into production immediately, collect about half of the data, and then crash again. Rinse and repeat. Anyway -- I can agree that some of the arguments w.r.t robustness and reliability of collecting dumps can be higher using this approach. It really surprises me that there's no way to reset a device through PHYP though. Seems like such a fundamental feature. I think people are overly optimistic if they think it'll be possible to do all of this reliably (as in with consistent performance) without a second reboot though. At least without similar amounts of work being done as it would have taken to fix kdump's reliability in the first place. Speaking of reboots. PHYP isn't known for being quick at rebooting a partition, it used to take in the order of minutes even on a small machine. Has that been fixed? If not, the avoiding an extra reboot argument hardly seems like a benefit versus kdump+kexec, which reboots nearly instantly and without involvement from PHYP. -Olof ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On 10/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote: Mike Strosaker wrote: At the risk of repeating what others have already said, the PHYP-assistance method provides some advantages that the kexec method cannot: - Availability of the system for production use before the dump data is collected. As was mentioned before, some production systems may choose not to operate with the limited memory initially available after the reboot, but it sure is nice to provide the option. I'm more concerned that this design encourages the user to resume a workload *which is almost certainly known to result in a system crash* before collection of crash data is complete. Maybe the gamble will pay off most of the time, but I wouldn't want to be working support when it doesn't. Workloads that cause crashes within hours of startup tend to be weeded-out/discovered during pre-production test of the system to be deployed. Since its pre-production test, dumps can be taken in a leisurely manner. Heck, even a session at the xmon prompt can be contemplated. The problem is when the crash only reproduces after days or weeks of uptime, on a production machine. Since the machine is in production, its got to be brought back up ASAP. Since its crashing only after days/weeks, the dump should have plenty of time to complete. (And if it crashes quickly after that reboot ... well, support people always welcome ways in which a bug can be reproduced more quickly/easily). --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote: On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote: Heh. That's the elbow-grease of this thing. The easy part is to get the core function working. The hard part is to test these various configs, and when they don't work, figure out what went wrong. That will take perseverence and brains. This just sounds like a whole lot of extra work to get a feature that already exists. Well, no. kexec is horribly ill-behaved with respect to PCI. The kexec kernel starts running with PCI devices in some random state; maybe they're DMA'ing or who knows what. kexec tries real hard to whack a few needed pci devices into submission but it has been hit-n-miss, and the source of 90% of the kexec headaches and debugging effort. Its not pretty. It surprises me that this hasn't been possible to resolve with less than architecting a completely new interface, given that the platform has all this fancy support for isolating and resetting adapters. After all, the exact same thing has to be done by the hypervisor before rebooting the partition. If all pci-host bridges could shut-down or settle the bus, and raise the #RST line high, and then if all BIOS'es supported this, you'd be right. But they can't This argument doesn't hold. We're not talking about some generic PC with a crappy BIOS here, we're specifically talking about POWER6 PHYP. It certainly already has ways to reset adapters in it, or EEH wouldn't work. Actually, the whole phyp dump feature wouldn't work either, since it's exactly what the firmware has to do under the covers as well. -Olof ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On 10/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote: On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote: Heh. That's the elbow-grease of this thing. The easy part is to get the core function working. The hard part is to test these various configs, and when they don't work, figure out what went wrong. That will take perseverence and brains. This just sounds like a whole lot of extra work to get a feature that already exists. Well, no. kexec is horribly ill-behaved with respect to PCI. The kexec kernel starts running with PCI devices in some random state; maybe they're DMA'ing or who knows what. kexec tries real hard to whack a few needed pci devices into submission but it has been hit-n-miss, and the source of 90% of the kexec headaches and debugging effort. Its not pretty. It surprises me that this hasn't been possible to resolve with less than architecting a completely new interface, given that the platform has all this fancy support for isolating and resetting adapters. After all, the exact same thing has to be done by the hypervisor before rebooting the partition. OK, point taken. -- The phyp interfaces are there for AIX, which I guess must not have kexec-like ability. So this is a case of Linux leveraging a feature architected for AIX. -- There's also this idea, somewhat weak, that the crash may have corrupted the ram where the kexec kernel sits. For someone who is used to seeing crashes due to null pointer deref's, this seems fairly unlikely. But perhaps crashes in production systems are more mind-bending. (we did have a case where a USB stick used for boot continued to scribble on memory long after it was supposed to be quiet and unused. This resulted in a very hard to debug crash.) A solution to a corrupted kexec kernel would be to disable memory access to where kexec sits, e.g un-mapping or making r/o the pages where it lies. This begs the questions of who unhides the kexec kernel, and what if this 'who' gets corrupted? In short, the kexec kernel does not boot exactly the same as a cold boot, and so this opens a can of worms about well, what's different, how do we minimize these differences, etc. and I think that lead AIX to punt, and say lets just use one single, well-known boot loader/ boot sequence instead of inventing a new one, thus leading to the phyp design. But that's just my guess.. :-) --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
Linas Vepstas wrote: On 10/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote: On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote: Heh. That's the elbow-grease of this thing. The easy part is to get the core function working. The hard part is to test these various configs, and when they don't work, figure out what went wrong. That will take perseverence and brains. This just sounds like a whole lot of extra work to get a feature that already exists. Well, no. kexec is horribly ill-behaved with respect to PCI. The kexec kernel starts running with PCI devices in some random state; maybe they're DMA'ing or who knows what. kexec tries real hard to whack a few needed pci devices into submission but it has been hit-n-miss, and the source of 90% of the kexec headaches and debugging effort. Its not pretty. It surprises me that this hasn't been possible to resolve with less than architecting a completely new interface, given that the platform has all this fancy support for isolating and resetting adapters. After all, the exact same thing has to be done by the hypervisor before rebooting the partition. OK, point taken. -- The phyp interfaces are there for AIX, which I guess must not have kexec-like ability. So this is a case of Linux leveraging a feature architected for AIX. Certainly AIX was in a more difficult position at the time, because they don't have a kexec equivalent, and thus were collecting dump data with a potentially faulty kernel. It makes sense to have something outside the partition collect or maintain the data; ideally, some kind of service partition would extract dump data from a failed partition, but giving one partition total access to the memory of another is clearly risky. Both the PHYP-assistance method and the kexec method are ways to simulate that without the risk. At the risk of repeating what others have already said, the PHYP-assistance method provides some advantages that the kexec method cannot: - Availability of the system for production use before the dump data is collected. As was mentioned before, some production systems may choose not to operate with the limited memory initially available after the reboot, but it sure is nice to provide the option. - Ensuring that the devices are in a good state. PHYP doesn't expose a method to force adapters into a frozen state, (which I agree would be useful), and I don't know of any plans to do so. What we are starting to see is that some drivers need modifications in order to work correctly with kdump [1]. Supporting PHYP-assisted dump would eliminate those issues. - The small possibility that the kexec area could have been munged by the failing kernel, preventing it from being able to collect a dump. The NUMA issues are daunting, but not insurmountable. Release early, release often, n'est-ce pas? Mike [1] http://ozlabs.org/pipermail/linuxppc-dev/2007-November/045663.html ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
Mike Strosaker wrote: At the risk of repeating what others have already said, the PHYP-assistance method provides some advantages that the kexec method cannot: - Availability of the system for production use before the dump data is collected. As was mentioned before, some production systems may choose not to operate with the limited memory initially available after the reboot, but it sure is nice to provide the option. I'm more concerned that this design encourages the user to resume a workload *which is almost certainly known to result in a system crash* before collection of crash data is complete. Maybe the gamble will pay off most of the time, but I wouldn't want to be working support when it doesn't. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
Hi Linas, Linas Vepstas wrote: On 08/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote: Manish Ahuja wrote: + +The goal of hypervisor-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. Is it actually faster than kdump? This is a basic presumption; I used the word actually. I already know that it is intended to be faster. :) it should blow it away, as, after all, it requires one less reboot! There's more than rebooting going on during system dump processing. Depending on the system type, booting may not be where most time is spent. As a side effect, the system is in production *while* the dump is being taken; A dubious feature IMO. Seems that the design potentially trades reliability of first failure data capture for availability. E.g. system crashes, reboots, resumes processing while copying dump, crashes again before dump procedure is complete. How is that handled, if at all? with kdump, you can't go into production until after the dump is finished, and the system has been rebooted a second time. On systems with terabytes of RAM, the time difference can be hours. The difference in time it takes to resume the normal workload may be significant, yes. But the time it takes to get a usable dump image would seem to be the basically the same. Since you bring up large systems... a system with terabytes of RAM is practically guaranteed to be a NUMA configuration with dozens of cpus. When processing a dump on such a system, I wonder how well we fare: can we successfully boot with (say) 128 cpus and 256MB of usable memory? Do we have to hot-online nodes as system memory is freed up (and does that even work)? We need to be able to restore the system to its optimal topology when the dump is finished; if the best we can do is a degraded configuration, the workload will suffer and the system admin is likely to just reboot the machine again so the kernel will have the right NUMA topology. +Implementation details: +-- +In order for this scheme to work, memory needs to be reserved +quite early in the boot cycle. However, access to the device +tree this early in the boot cycle is difficult, and device-tree +access is needed to determine if there is a crash data waiting. I don't think this bit about early device tree access is correct. By the time your code is reserving memory (from early_init_devtree(), I think), RTAS has been instantiated and you are able to test for the existence of /rtas/ibm,dump-kernel. If I remember right, it was still too early to look up this token directly, so we wrote some code to crawl the flat device tree to find it. But not only was that a lot of work, but I somehow decided that doing this to the flat tree was wrong, as otherwise someone would surely have written the access code. If this can be made to work, that would be great, but we couldn't make it work at the time. +To work around this problem, all but 256MB of RAM is reserved +during early boot. A short while later in boot, a check is made +to determine if there is dump data waiting. If there isn't, +then the reserved memory is released to general kernel use. So I think these gymnastics are unneeded -- unless I'm misunderstanding something, you should be able to determine very early whether to reserve that memory. Only if you can get at rtas, but you can't get at rtas at that point. Sorry, but I think you are mistaken (see Michael's earlier reply). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
I used the word actually. I already know that it is intended to be faster. :) it should blow it away, as, after all, it requires one less reboot! There's more than rebooting going on during system dump processing. Depending on the system type, booting may not be where most time is spent. As a side effect, the system is in production *while* the dump is being taken; A dubious feature IMO. Seems that the design potentially trades reliability of first failure data capture for availability. E.g. system crashes, reboots, resumes processing while copying dump, crashes again before dump procedure is complete. How is that handled, if at all? This is a simple version. The intent was not to have a complex dump taking mechanism in version 1. Subsequent versions will see planned improvement on the way the pages are tracked and freed. Also it is very easily possible now, to register for another dump as soon as the scratch area is copied to a user designated region. But for now this simple implementation exists. It is also possible to extend this further to only preserve pages that are kernel pages and free the non required pages like user/data pages etc. This would reduce the space preserved and would prevent any issues that are caused by reserving everything in memory except for the first 256 MB. Improvements and future versions are planned to make this efficient. But for now the intent is to get this off the ground and handle simple cases. with kdump, you can't go into production until after the dump is finished, and the system has been rebooted a second time. On systems with terabytes of RAM, the time difference can be hours. The difference in time it takes to resume the normal workload may be significant, yes. But the time it takes to get a usable dump image would seem to be the basically the same. Since you bring up large systems... a system with terabytes of RAM is practically guaranteed to be a NUMA configuration with dozens of cpus. When processing a dump on such a system, I wonder how well we fare: can we successfully boot with (say) 128 cpus and 256MB of usable memory? Do we have to hot-online nodes as system memory is freed up (and does that even work)? We need to be able to restore the system to its optimal topology when the dump is finished; if the best we can do is a degraded configuration, the workload will suffer and the system admin is likely to just reboot the machine again so the kernel will have the right NUMA topology. +Implementation details: +-- +In order for this scheme to work, memory needs to be reserved +quite early in the boot cycle. However, access to the device +tree this early in the boot cycle is difficult, and device-tree +access is needed to determine if there is a crash data waiting. I don't think this bit about early device tree access is correct. By the time your code is reserving memory (from early_init_devtree(), I think), RTAS has been instantiated and you are able to test for the existence of /rtas/ibm,dump-kernel. If I remember right, it was still too early to look up this token directly, so we wrote some code to crawl the flat device tree to find it. But not only was that a lot of work, but I somehow decided that doing this to the flat tree was wrong, as otherwise someone would surely have written the access code. If this can be made to work, that would be great, but we couldn't make it work at the time. +To work around this problem, all but 256MB of RAM is reserved +during early boot. A short while later in boot, a check is made +to determine if there is dump data waiting. If there isn't, +then the reserved memory is released to general kernel use. So I think these gymnastics are unneeded -- unless I'm misunderstanding something, you should be able to determine very early whether to reserve that memory. Only if you can get at rtas, but you can't get at rtas at that point. Sorry, but I think you are mistaken (see Michael's earlier reply). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
It's in production with 256MB of RAM? Err. Sure as the dump progresses more RAM will be freed, but that's hardly production. I think Nathan's right, any sysadmin who wants predictability will probably double reboot anyway. Thats a changeable parameter. Its something we chose for now. It by no means is set in stone. Its not a design parameter. If you like to allocate 1GB we can. But that is something we did for now. we expect this to be a variable value dependent upon the size of the system. So if you have 128 GB system and you can spare 10 gb, you should be able to have 10 GB to boot with. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On Wed, 2008-01-09 at 12:44 -0600, Nathan Lynch wrote: Hi Linas, Linas Vepstas wrote: On 08/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote: Manish Ahuja wrote: + +The goal of hypervisor-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. Is it actually faster than kdump? This is a basic presumption; As a side effect, the system is in production *while* the dump is being taken; It's in production with 256MB of RAM? Err. Sure as the dump progresses more RAM will be freed, but that's hardly production. I think Nathan's right, any sysadmin who wants predictability will probably double reboot anyway. with kdump, you can't go into production until after the dump is finished, and the system has been rebooted a second time. On systems with terabytes of RAM, the time difference can be hours. Since you bring up large systems... a system with terabytes of RAM is practically guaranteed to be a NUMA configuration with dozens of cpus. When processing a dump on such a system, I wonder how well we fare: can we successfully boot with (say) 128 cpus and 256MB of usable memory? Do we have to hot-online nodes as system memory is freed up (and does that even work)? We need to be able to restore the system to its optimal topology when the dump is finished; if the best we can do is a degraded configuration, the workload will suffer and the system admin is likely to just reboot the machine again so the kernel will have the right NUMA topology. Yeah that's a good question. Even if the hot-onlining works, there's still kernel data structures allocated at boot which want to be node-local. So the end result will be != a production boot. +Implementation details: +-- +In order for this scheme to work, memory needs to be reserved +quite early in the boot cycle. However, access to the device +tree this early in the boot cycle is difficult, and device-tree +access is needed to determine if there is a crash data waiting. I don't think this bit about early device tree access is correct. By the time your code is reserving memory (from early_init_devtree(), I think), RTAS has been instantiated and you are able to test for the existence of /rtas/ibm,dump-kernel. If I remember right, it was still too early to look up this token directly, so we wrote some code to crawl the flat device tree to find it. But not only was that a lot of work, but I somehow decided that doing this to the flat tree was wrong, as otherwise someone would surely have written the access code. If this can be made to work, that would be great, but we couldn't make it work at the time. +To work around this problem, all but 256MB of RAM is reserved +during early boot. A short while later in boot, a check is made +to determine if there is dump data waiting. If there isn't, +then the reserved memory is released to general kernel use. So I think these gymnastics are unneeded -- unless I'm misunderstanding something, you should be able to determine very early whether to reserve that memory. Only if you can get at rtas, but you can't get at rtas at that point. AFAICT you don't need to get at RTAS, you just need to look at the device tree to see if the property is present, and that is trivial. You probably just need to add a check in early_init_dt_scan_rtas() which sets a flag for the PHYP dump stuff, or add your own scan routine if you need. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On 09/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote: Hi Linas, Linas Vepstas wrote: As a side effect, the system is in production *while* the dump is being taken; A dubious feature IMO. Hmm. Take it up with Ken Rozendal, this is supposed to be one of the two main selling points of this thing. Seems that the design potentially trades reliability of first failure data capture for availability. E.g. system crashes, reboots, resumes processing while copying dump, crashes again before dump procedure is complete. How is that handled, if at all? Its handled by the hypervisor. phyp maintains the copy of the RMO of first crash, until such time that the OS declares the dump of the RMO to be complete. So you'll always have the RMO of the first crash. For the rest of RAM, it will come in two parts: some portion will have been dumped already. The rest has not yet been dumped, and it will still be there, preserved across the second crash. So you get both RMO and all of RAM from the first crash. with kdump, you can't go into production until after the dump is finished, and the system has been rebooted a second time. On systems with terabytes of RAM, the time difference can be hours. The difference in time it takes to resume the normal workload may be significant, yes. But the time it takes to get a usable dump image would seem to be the basically the same. Yes. Since you bring up large systems... a system with terabytes of RAM is practically guaranteed to be a NUMA configuration with dozens of cpus. When processing a dump on such a system, I wonder how well we fare: can we successfully boot with (say) 128 cpus and 256MB of usable memory? Do we have to hot-online nodes as system memory is freed up (and does that even work)? We need to be able to restore the system to its optimal topology when the dump is finished; if the best we can do is a degraded configuration, the workload will suffer and the system admin is likely to just reboot the machine again so the kernel will have the right NUMA topology. Heh. That's the elbow-grease of this thing. The easy part is to get the core function working. The hard part is to test these various configs, and when they don't work, figure out what went wrong. That will take perseverence and brains. --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On 09/01/2008, Michael Ellerman [EMAIL PROTECTED] wrote: Only if you can get at rtas, but you can't get at rtas at that point. AFAICT you don't need to get at RTAS, you just need to look at the device tree to see if the property is present, and that is trivial. You probably just need to add a check in early_init_dt_scan_rtas() which sets a flag for the PHYP dump stuff, or add your own scan routine if you need. I no longer remember the details. I do remember spending a lot of time trying to figure out how to do this. I know I didn't want to write my own scan routine; maybe that's what stopped me. As it happens, we also did most of the development on a broken phyp which simply did not even have this property, no matter what, and so that may have brain-damaged me. I went for the most elegant solution, where most elegant is defined as fewest lines of code, least effort, etc. Manish may need some hands-on help to extract this token during early boot. Hopefully, he'll let us know. --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote: Heh. That's the elbow-grease of this thing. The easy part is to get the core function working. The hard part is to test these various configs, and when they don't work, figure out what went wrong. That will take perseverence and brains. This just sounds like a whole lot of extra work to get a feature that already exists. Also, features like these seem to just get tested when the next enterprise distro is released, so they're broken for long stretches of time in mainline. There's a bunch of problems like the NUMA ones, which would by far be easiest to solve by just doing another reboot or kexec, wouldn't they? -Olof ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On Wed, 2008-01-09 at 20:47 -0600, Linas Vepstas wrote: On 09/01/2008, Michael Ellerman [EMAIL PROTECTED] wrote: Only if you can get at rtas, but you can't get at rtas at that point. AFAICT you don't need to get at RTAS, you just need to look at the device tree to see if the property is present, and that is trivial. You probably just need to add a check in early_init_dt_scan_rtas() which sets a flag for the PHYP dump stuff, or add your own scan routine if you need. I no longer remember the details. I do remember spending a lot of time trying to figure out how to do this. I know I didn't want to write my own scan routine; maybe that's what stopped me. As it happens, we also did most of the development on a broken phyp which simply did not even have this property, no matter what, and so that may have brain-damaged me. Sure, the API docs for the kernel are a little lacking ;) I went for the most elegant solution, where most elegant is defined as fewest lines of code, least effort, etc. Manish may need some hands-on help to extract this token during early boot. Hopefully, he'll let us know. It would just be something like: --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -901,6 +901,11 @@ int __init early_init_dt_scan_rtas(unsigned long node, rtas.size = *sizep; } +#ifdef CONFIG_PHYP_DUMP + if (of_get_flat_dt_prop(node, ibm,dump-kernel, NULL)) + phyp_dump_is_active++; +#endif + #ifdef CONFIG_UDBG_RTAS_CONSOLE basep = of_get_flat_dt_prop(node, put-term-char, NULL); if (basep) Or to do your own scan routine: diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index acc0d24..442134e 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -1022,6 +1022,7 @@ void __init early_init_devtree(void *params) /* Some machines might need RTAS info for debugging, grab it now. */ of_scan_flat_dt(early_init_dt_scan_rtas, NULL); #endif + of_scan_flat_dt(early_init_dt_scan_phyp_dump, NULL); /* Retrieve various informations from the /chosen node of the * device-tree, including the platform type, initrd location and diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 52e95c2..af2b6e8 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -883,6 +883,19 @@ void __init rtas_initialize(void) #endif } +int __init early_init_dt_scan_phyp_dump(unsigned long node, + const char *uname, int depth, void *data) +{ +#ifdef CONFIG_PHYP_DUMP + if (depth != 1 || strcmp(uname, rtas) != 0) + return 0; + + if (of_get_flat_dt_prop(node, ibm,dump-kernel, NULL)) + phyp_dump_is_active++; +#endif + return 1; +} + int __init early_init_dt_scan_rtas(unsigned long node, const char *uname, int depth, void *data) { cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote: Heh. That's the elbow-grease of this thing. The easy part is to get the core function working. The hard part is to test these various configs, and when they don't work, figure out what went wrong. That will take perseverence and brains. This just sounds like a whole lot of extra work to get a feature that already exists. Well, no. kexec is horribly ill-behaved with respect to PCI. The kexec kernel starts running with PCI devices in some random state; maybe they're DMA'ing or who knows what. kexec tries real hard to whack a few needed pci devices into submission but it has been hit-n-miss, and the source of 90% of the kexec headaches and debugging effort. Its not pretty. If all pci-host bridges could shut-down or settle the bus, and raise the #RST line high, and then if all BIOS'es supported this, you'd be right. But they can't --linas ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On Wed, 2008-01-09 at 22:12 -0600, Linas Vepstas wrote: On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote: On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote: Heh. That's the elbow-grease of this thing. The easy part is to get the core function working. The hard part is to test these various configs, and when they don't work, figure out what went wrong. That will take perseverence and brains. This just sounds like a whole lot of extra work to get a feature that already exists. Well, no. kexec is horribly ill-behaved with respect to PCI. The kexec kernel starts running with PCI devices in some random state; maybe they're DMA'ing or who knows what. kexec tries real hard to whack a few needed pci devices into submission but it has been hit-n-miss, and the source of 90% of the kexec headaches and debugging effort. Its not pretty. Isn't that what EEH and the IOMMU are for? :) cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
Manish Ahuja wrote: + + Hypervisor-Assisted Dump + + November 2007 Date is unneeded (and, uhm, dated :) +The goal of hypervisor-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. Is it actually faster than kdump? +As compared to kdump or other strategies, hypervisor-assisted +dump offers several strong, practical advantages: + +-- Unlike kdump, the system has been reset, and loaded + with a fresh copy of the kernel. In particular, + PCI and I/O devices have been reinitialized and are + in a clean, consistent state. +-- As the dump is performed, the dumped memory becomes + immediately available to the system for normal use. +-- After the dump is completed, no further reboots are + required; the system will be fully usable, and running + in it's normal, production mode on it normal kernel. + +The above can only be accomplished by coordination with, +and assistance from the hypervisor. The procedure is +as follows: + +-- When a system crashes, the hypervisor will save + the low 256MB of RAM to a previously registered + save region. It will also save system state, system + registers, and hardware PTE's. + +-- After the low 256MB area has been saved, the + hypervisor will reset PCI and other hardware state. + It will *not* clear RAM. It will then launch the + bootloader, as normal. + +-- The freshly booted kernel will notice that there + is a new node (ibm,dump-kernel) in the device tree, + indicating that there is crash data available from + a previous boot. It will boot into only 256MB of RAM, + reserving the rest of system memory. + +-- Userspace tools will parse /sys/kernel/release_region + and read /proc/vmcore to obtain the contents of memory, + which holds the previous crashed kernel. The userspace + tools may copy this info to disk, or network, nas, san, + iscsi, etc. as desired. + + For Example: the values in /sys/kernel/release-region + would look something like this (address-range pairs). + CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: / + DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A + +-- As the userspace tools complete saving a portion of + dump, they echo an offset and size to + /sys/kernel/release_region to release the reserved + memory back to general use. + + An example of this is: + echo 0x4000 0x1000 /sys/kernel/release_region + which will release 256MB at the 1GB boundary. This violates the one file, one value rule of sysfs, but nobody really takes that seriously, I guess. In any case, consider documenting this in Documentation/ABI. + +Please note that the hypervisor-assisted dump feature +is only available on Power6-based systems with recent +firmware versions. This statement will of course become dated/incorrect so I recommend removing it. + +Implementation details: +-- +In order for this scheme to work, memory needs to be reserved +quite early in the boot cycle. However, access to the device +tree this early in the boot cycle is difficult, and device-tree +access is needed to determine if there is a crash data waiting. I don't think this bit about early device tree access is correct. By the time your code is reserving memory (from early_init_devtree(), I think), RTAS has been instantiated and you are able to test for the existence of /rtas/ibm,dump-kernel. +To work around this problem, all but 256MB of RAM is reserved +during early boot. A short while later in boot, a check is made +to determine if there is dump data waiting. If there isn't, +then the reserved memory is released to general kernel use. So I think these gymnastics are unneeded -- unless I'm misunderstanding something, you should be able to determine very early whether to reserve that memory. +If there is dump data, then the /sys/kernel/release_region +file is created, and the reserved memory is held. + +If there is no waiting dump data, then all but 256MB of the +reserved ram will be released for general kernel use. The +highest 256 MB of RAM will *not* be released: this region +will be kept permanently reserved, so that it can act as +a receptacle for a copy of the low 256MB in the case a crash +does occur. See, however, open issues below, as to whether +such a reserved region is really needed. + +Currently the dump will be copied from /proc/vmcore to a +a new file upon user intervention. The starting address +to be read and the range for each data point in provided ^is +in /sys/kernel/release_region. + +The tools to examine the dump will be same as the ones +used for kdump. + + +General notes: +-- +Security: please note that there are
Re: [PATCH 1/8] pseries: phyp dump: Docmentation
On Tue, 2008-01-08 at 22:29 -0600, Nathan Lynch wrote: Manish Ahuja wrote: + + Hypervisor-Assisted Dump + + November 2007 Date is unneeded (and, uhm, dated :) +The goal of hypervisor-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. Is it actually faster than kdump? +As compared to kdump or other strategies, hypervisor-assisted +dump offers several strong, practical advantages: + +-- Unlike kdump, the system has been reset, and loaded + with a fresh copy of the kernel. In particular, + PCI and I/O devices have been reinitialized and are + in a clean, consistent state. +-- As the dump is performed, the dumped memory becomes + immediately available to the system for normal use. +-- After the dump is completed, no further reboots are + required; the system will be fully usable, and running + in it's normal, production mode on it normal kernel. + +The above can only be accomplished by coordination with, +and assistance from the hypervisor. The procedure is +as follows: + +-- When a system crashes, the hypervisor will save + the low 256MB of RAM to a previously registered + save region. It will also save system state, system + registers, and hardware PTE's. + +-- After the low 256MB area has been saved, the + hypervisor will reset PCI and other hardware state. + It will *not* clear RAM. It will then launch the + bootloader, as normal. + +-- The freshly booted kernel will notice that there + is a new node (ibm,dump-kernel) in the device tree, + indicating that there is crash data available from + a previous boot. It will boot into only 256MB of RAM, + reserving the rest of system memory. + +-- Userspace tools will parse /sys/kernel/release_region + and read /proc/vmcore to obtain the contents of memory, + which holds the previous crashed kernel. The userspace + tools may copy this info to disk, or network, nas, san, + iscsi, etc. as desired. + + For Example: the values in /sys/kernel/release-region + would look something like this (address-range pairs). + CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: / + DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A + +-- As the userspace tools complete saving a portion of + dump, they echo an offset and size to + /sys/kernel/release_region to release the reserved + memory back to general use. + + An example of this is: + echo 0x4000 0x1000 /sys/kernel/release_region + which will release 256MB at the 1GB boundary. This violates the one file, one value rule of sysfs, but nobody really takes that seriously, I guess. In any case, consider documenting this in Documentation/ABI. + +Please note that the hypervisor-assisted dump feature +is only available on Power6-based systems with recent +firmware versions. This statement will of course become dated/incorrect so I recommend removing it. + +Implementation details: +-- +In order for this scheme to work, memory needs to be reserved +quite early in the boot cycle. However, access to the device +tree this early in the boot cycle is difficult, and device-tree +access is needed to determine if there is a crash data waiting. I don't think this bit about early device tree access is correct. By the time your code is reserving memory (from early_init_devtree(), I think), RTAS has been instantiated and you are able to test for the existence of /rtas/ibm,dump-kernel. Yep it's early_init_devtree(), and yes it's fairly easy to access the (flattened) device tree at that point. +To work around this problem, all but 256MB of RAM is reserved +during early boot. A short while later in boot, a check is made +to determine if there is dump data waiting. If there isn't, +then the reserved memory is released to general kernel use. So I think these gymnastics are unneeded -- unless I'm misunderstanding something, you should be able to determine very early whether to reserve that memory. I agree. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/8] pseries: phyp dump: Docmentation
Basic documentation for hypervisor-assisted dump. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Signed-off-by: Manish Ahuja [EMAIL PROTECTED] Documentation/powerpc/phyp-assisted-dump.txt | 129 +++ 1 file changed, 129 insertions(+) Index: 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt 2008-01-07 18:05:46.0 -0600 @@ -0,0 +1,129 @@ + + Hypervisor-Assisted Dump + + November 2007 + +The goal of hypervisor-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. + +As compared to kdump or other strategies, hypervisor-assisted +dump offers several strong, practical advantages: + +-- Unlike kdump, the system has been reset, and loaded + with a fresh copy of the kernel. In particular, + PCI and I/O devices have been reinitialized and are + in a clean, consistent state. +-- As the dump is performed, the dumped memory becomes + immediately available to the system for normal use. +-- After the dump is completed, no further reboots are + required; the system will be fully usable, and running + in it's normal, production mode on it normal kernel. + +The above can only be accomplished by coordination with, +and assistance from the hypervisor. The procedure is +as follows: + +-- When a system crashes, the hypervisor will save + the low 256MB of RAM to a previously registered + save region. It will also save system state, system + registers, and hardware PTE's. + +-- After the low 256MB area has been saved, the + hypervisor will reset PCI and other hardware state. + It will *not* clear RAM. It will then launch the + bootloader, as normal. + +-- The freshly booted kernel will notice that there + is a new node (ibm,dump-kernel) in the device tree, + indicating that there is crash data available from + a previous boot. It will boot into only 256MB of RAM, + reserving the rest of system memory. + +-- Userspace tools will parse /sys/kernel/release_region + and read /proc/vmcore to obtain the contents of memory, + which holds the previous crashed kernel. The userspace + tools may copy this info to disk, or network, nas, san, + iscsi, etc. as desired. + + For Example: the values in /sys/kernel/release-region + would look something like this (address-range pairs). + CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: / + DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A + +-- As the userspace tools complete saving a portion of + dump, they echo an offset and size to + /sys/kernel/release_region to release the reserved + memory back to general use. + + An example of this is: + echo 0x4000 0x1000 /sys/kernel/release_region + which will release 256MB at the 1GB boundary. + +Please note that the hypervisor-assisted dump feature +is only available on Power6-based systems with recent +firmware versions. + +Implementation details: +-- +In order for this scheme to work, memory needs to be reserved +quite early in the boot cycle. However, access to the device +tree this early in the boot cycle is difficult, and device-tree +access is needed to determine if there is a crash data waiting. +To work around this problem, all but 256MB of RAM is reserved +during early boot. A short while later in boot, a check is made +to determine if there is dump data waiting. If there isn't, +then the reserved memory is released to general kernel use. +If there is dump data, then the /sys/kernel/release_region +file is created, and the reserved memory is held. + +If there is no waiting dump data, then all but 256MB of the +reserved ram will be released for general kernel use. The +highest 256 MB of RAM will *not* be released: this region +will be kept permanently reserved, so that it can act as +a receptacle for a copy of the low 256MB in the case a crash +does occur. See, however, open issues below, as to whether +such a reserved region is really needed. + +Currently the dump will be copied from /proc/vmcore to a +a new file upon user intervention. The starting address +to be read and the range for each data point in provided +in /sys/kernel/release_region. + +The tools to examine the dump will be same as the ones +used for kdump. + + +General notes: +-- +Security: please note that there are potential security issues +with any sort of dump mechanism. In particular, plaintext +(unencrypted) data, and possibly passwords, may be present in +the dump data. Userspace tools must take adequate precautions to +preserve security. + +Open issues/ToDo: + + o The various code paths that tell the hypervisor that a