[PATCH 1/8] pseries: phyp dump: Docmentation

2008-02-28 Thread Manish Ahuja

Basic documentation for hypervisor-assisted dump.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Signed-off-by: Manish Ahuja [EMAIL PROTECTED]


 Documentation/powerpc/phyp-assisted-dump.txt |  127 +++
 1 file changed, 127 insertions(+)

Index: 2.6.25-rc1/Documentation/powerpc/phyp-assisted-dump.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ 2.6.25-rc1/Documentation/powerpc/phyp-assisted-dump.txt 2008-02-18 
03:22:33.0 -0600
@@ -0,0 +1,127 @@
+
+   Hypervisor-Assisted Dump
+   
+   November 2007
+
+The goal of hypervisor-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
+
+As compared to kdump or other strategies, hypervisor-assisted
+dump offers several strong, practical advantages:
+
+-- Unlike kdump, the system has been reset, and loaded
+   with a fresh copy of the kernel.  In particular,
+   PCI and I/O devices have been reinitialized and are
+   in a clean, consistent state.
+-- As the dump is performed, the dumped memory becomes
+   immediately available to the system for normal use.
+-- After the dump is completed, no further reboots are
+   required; the system will be fully usable, and running
+   in it's normal, production mode on it normal kernel.
+
+The above can only be accomplished by coordination with,
+and assistance from the hypervisor. The procedure is
+as follows:
+
+-- When a system crashes, the hypervisor will save
+   the low 256MB of RAM to a previously registered
+   save region. It will also save system state, system
+   registers, and hardware PTE's.
+
+-- After the low 256MB area has been saved, the
+   hypervisor will reset PCI and other hardware state.
+   It will *not* clear RAM. It will then launch the
+   bootloader, as normal.
+
+-- The freshly booted kernel will notice that there
+   is a new node (ibm,dump-kernel) in the device tree,
+   indicating that there is crash data available from
+   a previous boot. It will boot into only 256MB of RAM,
+   reserving the rest of system memory.
+
+-- Userspace tools will parse /sys/kernel/release_region
+   and read /proc/vmcore to obtain the contents of memory,
+   which holds the previous crashed kernel. The userspace
+   tools may copy this info to disk, or network, nas, san,
+   iscsi, etc. as desired.
+
+   For Example: the values in /sys/kernel/release-region
+   would look something like this (address-range pairs).
+   CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: /
+   DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A
+
+-- As the userspace tools complete saving a portion of
+   dump, they echo an offset and size to
+   /sys/kernel/release_region to release the reserved
+   memory back to general use.
+
+   An example of this is:
+ echo 0x4000 0x1000  /sys/kernel/release_region
+   which will release 256MB at the 1GB boundary.
+
+Please note that the hypervisor-assisted dump feature
+is only available on Power6-based systems with recent
+firmware versions.
+
+Implementation details:
+--
+
+During boot, a check is made to see if firmware supports
+this feature on this particular machine. If it does, then
+we check to see if a active dump is waiting for us. If yes
+then everything but 256 MB of RAM is reserved during early
+boot. This area is released once we collect a dump from user
+land scripts that are run. If there is dump data, then
+the /sys/kernel/release_region file is created, and
+the reserved memory is held.
+
+If there is no waiting dump data, then only the highest
+256MB of the ram is reserved as a scratch area. This area
+is *not* be released: this region will be kept permanently
+reserved, so that it can act as a receptacle for a copy
+of the low 256MB in the case a crash does occur. See,
+however, open issues below, as to whether
+such a reserved region is really needed.
+
+Currently the dump will be copied from /proc/vmcore to a
+a new file upon user intervention. The starting address
+to be read and the range for each data point in provided
+in /sys/kernel/release_region.
+
+The tools to examine the dump will be same as the ones
+used for kdump.
+
+General notes:
+--
+Security: please note that there are potential security issues
+with any sort of dump mechanism. In particular, plaintext
+(unencrypted) data, and possibly passwords, may be present in
+the dump data. Userspace tools must take adequate precautions to
+preserve security.
+
+Open issues/ToDo:
+
+ o The various code paths that tell the hypervisor that a crash
+   occurred, vs. it simply being a normal reboot, should be
+   reviewed, and possibly clarified/fixed.
+
+ o Instead of using /sys/kernel, should there be a /sys/dump
+   instead? There is a 

[PATCH 1/8] pseries: phyp dump: Docmentation

2008-02-11 Thread Manish Ahuja
Basic documentation for hypervisor-assisted dump.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Signed-off-by: Manish Ahuja [EMAIL PROTECTED]


 Documentation/powerpc/phyp-assisted-dump.txt |  127 +++
 1 file changed, 127 insertions(+)

Index: 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt 2008-02-12 
06:38:25.0 -0600
@@ -0,0 +1,127 @@
+
+   Hypervisor-Assisted Dump
+   
+   November 2007
+
+The goal of hypervisor-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
+
+As compared to kdump or other strategies, hypervisor-assisted
+dump offers several strong, practical advantages:
+
+-- Unlike kdump, the system has been reset, and loaded
+   with a fresh copy of the kernel.  In particular,
+   PCI and I/O devices have been reinitialized and are
+   in a clean, consistent state.
+-- As the dump is performed, the dumped memory becomes
+   immediately available to the system for normal use.
+-- After the dump is completed, no further reboots are
+   required; the system will be fully usable, and running
+   in it's normal, production mode on it normal kernel.
+
+The above can only be accomplished by coordination with,
+and assistance from the hypervisor. The procedure is
+as follows:
+
+-- When a system crashes, the hypervisor will save
+   the low 256MB of RAM to a previously registered
+   save region. It will also save system state, system
+   registers, and hardware PTE's.
+
+-- After the low 256MB area has been saved, the
+   hypervisor will reset PCI and other hardware state.
+   It will *not* clear RAM. It will then launch the
+   bootloader, as normal.
+
+-- The freshly booted kernel will notice that there
+   is a new node (ibm,dump-kernel) in the device tree,
+   indicating that there is crash data available from
+   a previous boot. It will boot into only 256MB of RAM,
+   reserving the rest of system memory.
+
+-- Userspace tools will parse /sys/kernel/release_region
+   and read /proc/vmcore to obtain the contents of memory,
+   which holds the previous crashed kernel. The userspace
+   tools may copy this info to disk, or network, nas, san,
+   iscsi, etc. as desired.
+
+   For Example: the values in /sys/kernel/release-region
+   would look something like this (address-range pairs).
+   CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: /
+   DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A
+
+-- As the userspace tools complete saving a portion of
+   dump, they echo an offset and size to
+   /sys/kernel/release_region to release the reserved
+   memory back to general use.
+
+   An example of this is:
+ echo 0x4000 0x1000  /sys/kernel/release_region
+   which will release 256MB at the 1GB boundary.
+
+Please note that the hypervisor-assisted dump feature
+is only available on Power6-based systems with recent
+firmware versions.
+
+Implementation details:
+--
+
+During boot, a check is made to see if firmware supports
+this feature on this particular machine. If it does, then
+we check to see if a active dump is waiting for us. If yes
+then everything but 256 MB of RAM is reserved during early
+boot. This area is released once we collect a dump from user
+land scripts that are run. If there is dump data, then
+the /sys/kernel/release_region file is created, and
+the reserved memory is held.
+
+If there is no waiting dump data, then only the highest
+256MB of the ram is reserved as a scratch area. This area
+is *not* be released: this region will be kept permanently
+reserved, so that it can act as a receptacle for a copy
+of the low 256MB in the case a crash does occur. See,
+however, open issues below, as to whether
+such a reserved region is really needed.
+
+Currently the dump will be copied from /proc/vmcore to a
+a new file upon user intervention. The starting address
+to be read and the range for each data point in provided
+in /sys/kernel/release_region.
+
+The tools to examine the dump will be same as the ones
+used for kdump.
+
+General notes:
+--
+Security: please note that there are potential security issues
+with any sort of dump mechanism. In particular, plaintext
+(unencrypted) data, and possibly passwords, may be present in
+the dump data. Userspace tools must take adequate precautions to
+preserve security.
+
+Open issues/ToDo:
+
+ o The various code paths that tell the hypervisor that a crash
+   occurred, vs. it simply being a normal reboot, should be
+   reviewed, and possibly clarified/fixed.
+
+ o Instead of using /sys/kernel, should there be a /sys/dump
+   instead? There is a 

Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-14 Thread Linas Vepstas
On 13/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:

 How do you expect to have it in full production if you don't have all
 resources available for it? It's not until the dump has finished that you
 can return all memory to the production environment and use it.

With the PHYP dump, each chunk of RAM is returned for
general use immediately after being dumped; so its not
an all-or-nothing proposition.  Production systems don't
often hit 100% RAM use right out of the gate, they often
take hours or days to get there, so again, there should
be time to dump.

 This can very easily be argued in both direction, with no clear winner:
 If the crash is stress-induced (say a slashdotted website), for those
 cases it seems more rational to take the time, collect _good data_ even
 if it takes a little longer, and then go back into production. Especially
 if the alternative is to go back into production immediately, collect
 about half of the data, and then crash again. Rinse and repeat.

Again, the mode of operation for the phyp dump  is that you'll
always have all of the data from the *first* crash, even if there
are multiple crashes. That's because the the as-yet undumped
RAM is not put back into production.

 really surprises me that there's no way to reset a device through PHYP
 though. Seems like such a fundamental feature.

I don't know who said that; that's not right. The EEH function
certainly does allow you to halt/restart PCI traffic to a particular
device and also to reset the device.  So, yes, the pSeries
kexec code should call into the eeh subsystem to rationalize
the device state.

 I think people are overly optimistic if they think it'll be possible
 to do all of this reliably (as in with consistent performance) without
 a second reboot though.

The NUMA issues do concern me. But then, the whole virtualized,
fractional-cpu, tickless operation stuff sounds like a performance
tuning nightmare to begin with.

 At least without similar amounts of work being
 done as it would have taken to fix kdump's reliability in the first place.

:-)


 Speaking of reboots. PHYP isn't known for being quick at rebooting a
 partition, it used to take in the order of minutes even on a small
 machine. Has that been fixed?

Dunno.  Probably not.

  If not, the avoiding an extra reboot
 argument hardly seems like a benefit versus kdump+kexec, which reboots
 nearly instantly and without involvement from PHYP.

OK, let me tell you what I'm up against right now.
I'm dealing with sporadic corruption on my home box.

About a month ago, I bought a whizzy ASUS M2NE
motherboard  an AMD64 2-core cpu, and two sticks
of RAM, 1GB per stick. I have one new hard drive,
SATA, and one old hard drive, from my old machine,
the PATA.  The two disks are mirrored in a RAID-1
config. Running Ubuntu.

During install/upgrade a month ago, I noticed some of
the install files seemed to have gotten corrupted, but
that downloading them again got me a working version.
This put a serious frown on my face: maybe a bad ethernet
card or connection !?

Two weeks ago, gcc stopped working one morning, although
it worked fine the night before. I'd done nothing in the interim
but sleep. Reinstalling it made it work again. Yesterday,
something else stopped working.  I found the offending
library, I compared file checksums against a known-good
version, and they were off. (!!!) Disk corruption?

Then apt-get stopped working. The /var/lib/dpkg/status file
had randomly corrupted single bytes. Its ascii, I hand
repaired it; it had maybe 10 bad bytes out of 2MB total size.

I installed tripwire. Between the first run of tripwire, and the
second, less than an hour later, it reported several dozen
files have changed checksums. Manual inspection of some
of these files against known-good versions show that, at least
this morning, that's no longer the case.

System hasn't crashed in a month, since first boot.  So
what's going on? Is it possible that one of the two disks
is serving up bad data, which explains the funny checksum
behaviour? Or maybe its bad RAM, so that a fresh disk
read shows good data?  If its bad ram, why doesn't the
system crash?  I forced fsck last night, fsck came back
spotless.

So ... moral of the story: If phyp is doing some sort of
hardware checks and validation, that's great. I wish I could
afford a pSeries system for my home computer, because
my impression is that they are very stable, and don't do
things like data corruption.  I'm such a friggin cheapskate
that I can't bear to spend many thousands instead of many
hundreds of dollars. However, I will trade a longer boot
for the dream of higher reliability.

--linas
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-13 Thread Olof Johansson
On Fri, Jan 11, 2008 at 10:57:51AM -0600, Linas Vepstas wrote:
 On 10/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote:
  Mike Strosaker wrote:
  
   At the risk of repeating what others have already said, the 
   PHYP-assistance
   method provides some advantages that the kexec method cannot:
- Availability of the system for production use before the dump data is
   collected.  As was mentioned before, some production systems may choose 
   not
   to operate with the limited memory initially available after the reboot,
   but it sure is nice to provide the option.
 
  I'm more concerned that this design encourages the user to resume a
  workload *which is almost certainly known to result in a system crash*
  before collection of crash data is complete.  Maybe the gamble will
  pay off most of the time, but I wouldn't want to be working support
  when it doesn't.
 
 Workloads that cause crashes within hours of startup tend to be
 weeded-out/discovered during pre-production test of the system
 to be deployed. Since its pre-production test, dumps can be
 taken in a leisurely manner. Heck, even a session at the
 xmon prompt can be contemplated.
 
 The problem is when the crash only reproduces after days or
 weeks of uptime, on a production machine.  Since the machine
 is in production, its got to be brought back up ASAP.  Since
 its crashing only after days/weeks, the dump should have
 plenty of time to complete.  (And if it crashes quickly after
 that reboot ... well, support people always welcome ways
 in which a bug can be reproduced more quickly/easily).

How do you expect to have it in full production if you don't have all
resources available for it? It's not until the dump has finished that you
can return all memory to the production environment and use it.

This can very easily be argued in both direction, with no clear winner:
If the crash is stress-induced (say a slashdotted website), for those
cases it seems more rational to take the time, collect _good data_ even
if it takes a little longer, and then go back into production. Especially
if the alternative is to go back into production immediately, collect
about half of the data, and then crash again. Rinse and repeat.

Anyway -- I can agree that some of the arguments w.r.t robustness and
reliability of collecting dumps can be higher using this approach. It
really surprises me that there's no way to reset a device through PHYP
though. Seems like such a fundamental feature.

I think people are overly optimistic if they think it'll be possible
to do all of this reliably (as in with consistent performance) without
a second reboot though. At least without similar amounts of work being
done as it would have taken to fix kdump's reliability in the first place.

Speaking of reboots. PHYP isn't known for being quick at rebooting a
partition, it used to take in the order of minutes even on a small
machine. Has that been fixed? If not, the avoiding an extra reboot
argument hardly seems like a benefit versus kdump+kexec, which reboots
nearly instantly and without involvement from PHYP.


-Olof

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-11 Thread Linas Vepstas
On 10/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote:
 Mike Strosaker wrote:
 
  At the risk of repeating what others have already said, the PHYP-assistance
  method provides some advantages that the kexec method cannot:
   - Availability of the system for production use before the dump data is
  collected.  As was mentioned before, some production systems may choose not
  to operate with the limited memory initially available after the reboot,
  but it sure is nice to provide the option.

 I'm more concerned that this design encourages the user to resume a
 workload *which is almost certainly known to result in a system crash*
 before collection of crash data is complete.  Maybe the gamble will
 pay off most of the time, but I wouldn't want to be working support
 when it doesn't.

Workloads that cause crashes within hours of startup tend to be
weeded-out/discovered during pre-production test of the system
to be deployed. Since its pre-production test, dumps can be
taken in a leisurely manner. Heck, even a session at the
xmon prompt can be contemplated.

The problem is when the crash only reproduces after days or
weeks of uptime, on a production machine.  Since the machine
is in production, its got to be brought back up ASAP.  Since
its crashing only after days/weeks, the dump should have
plenty of time to complete.  (And if it crashes quickly after
that reboot ... well, support people always welcome ways
in which a bug can be reproduced more quickly/easily).

--linas
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-10 Thread Olof Johansson
On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote:
 On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:
  On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
 
   Heh. That's the elbow-grease of this thing.  The easy part is to get
   the core function working. The hard part is to test these various configs,
   and when they don't work, figure out what went wrong. That will take
   perseverence and brains.
 
  This just sounds like a whole lot of extra work to get a feature that
  already exists.
 
 Well, no. kexec is horribly ill-behaved with respect to PCI. The
 kexec kernel starts running with PCI devices in some random
 state; maybe they're DMA'ing or who knows what. kexec tries
 real hard to whack a few needed pci devices into submission
 but it has been hit-n-miss, and the source of 90% of the kexec
 headaches and debugging effort. Its not pretty.

It surprises me that this hasn't been possible to resolve with less than
architecting a completely new interface, given that the platform has
all this fancy support for isolating and resetting adapters. After all,
the exact same thing has to be done by the hypervisor before rebooting
the partition.

 If all pci-host bridges could shut-down or settle the bus, and
 raise the #RST line high, and then if all BIOS'es supported
 this, you'd be right. But they can't 

This argument doesn't hold. We're not talking about some generic PC with
a crappy BIOS here, we're specifically talking about POWER6 PHYP. It
certainly already has ways to reset adapters in it, or EEH wouldn't
work. Actually, the whole phyp dump feature wouldn't work either, since
it's exactly what the firmware has to do under the covers as well.


-Olof

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-10 Thread Linas Vepstas
On 10/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:
 On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote:
  On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:
   On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
  
Heh. That's the elbow-grease of this thing.  The easy part is to get
the core function working. The hard part is to test these various 
configs,
and when they don't work, figure out what went wrong. That will take
perseverence and brains.
  
   This just sounds like a whole lot of extra work to get a feature that
   already exists.
 
  Well, no. kexec is horribly ill-behaved with respect to PCI. The
  kexec kernel starts running with PCI devices in some random
  state; maybe they're DMA'ing or who knows what. kexec tries
  real hard to whack a few needed pci devices into submission
  but it has been hit-n-miss, and the source of 90% of the kexec
  headaches and debugging effort. Its not pretty.

 It surprises me that this hasn't been possible to resolve with less than
 architecting a completely new interface, given that the platform has
 all this fancy support for isolating and resetting adapters. After all,
 the exact same thing has to be done by the hypervisor before rebooting
 the partition.

OK, point taken.

-- The phyp interfaces are there for AIX, which I guess must
   not have kexec-like ability. So this is a case of Linux leveraging
  a feature architected for AIX.

-- There's also this idea, somewhat weak, that the crash may
   have corrupted the ram where the  kexec kernel sits.
   For someone who is used to seeing crashes due to
   null pointer deref's, this seems fairly unlikely. But perhaps
   crashes in production systems are more mind-bending.
   (we did have a case where a USB stick used for boot
   continued to scribble on memory long after it was
   supposed to be quiet and unused. This resulted in
   a very hard to debug crash.)

   A solution to a corrupted
   kexec kernel would be to disable memory access to
   where kexec sits, e.g un-mapping or making r/o the
   pages where it lies. This begs the questions of who
   unhides the kexec kernel, and what if this 'who' gets
   corrupted?

   In short, the kexec kernel does not boot
   exactly the same as a cold boot, and so this opens
   a can of worms about well, what's different, how do
   we minimize these differences, etc. and I think that
   lead AIX to punt, and say lets just use one single,
   well-known boot loader/ boot sequence instead of
   inventing a new one, thus leading to the phyp design.

   But that's just my guess.. :-)

--linas
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-10 Thread Mike Strosaker
Linas Vepstas wrote:
 On 10/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:
 
On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote:

On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:

On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:


Heh. That's the elbow-grease of this thing.  The easy part is to get
the core function working. The hard part is to test these various configs,
and when they don't work, figure out what went wrong. That will take
perseverence and brains.

This just sounds like a whole lot of extra work to get a feature that
already exists.

Well, no. kexec is horribly ill-behaved with respect to PCI. The
kexec kernel starts running with PCI devices in some random
state; maybe they're DMA'ing or who knows what. kexec tries
real hard to whack a few needed pci devices into submission
but it has been hit-n-miss, and the source of 90% of the kexec
headaches and debugging effort. Its not pretty.

It surprises me that this hasn't been possible to resolve with less than
architecting a completely new interface, given that the platform has
all this fancy support for isolating and resetting adapters. After all,
the exact same thing has to be done by the hypervisor before rebooting
the partition.
 
 
 OK, point taken.
 
 -- The phyp interfaces are there for AIX, which I guess must
not have kexec-like ability. So this is a case of Linux leveraging
   a feature architected for AIX.

Certainly AIX was in a more difficult position at the time, because they don't 
have a kexec equivalent, and thus were collecting dump data with a potentially 
faulty kernel.  It makes sense to have something outside the partition collect 
or 
maintain the data; ideally, some kind of service partition would extract dump 
data from a failed partition, but giving one partition total access to the 
memory 
of another is clearly risky.  Both the PHYP-assistance method and the kexec 
method are ways to simulate that without the risk.

At the risk of repeating what others have already said, the PHYP-assistance 
method provides some advantages that the kexec method cannot:
  - Availability of the system for production use before the dump data is 
collected.  As was mentioned before, some production systems may choose not to 
operate with the limited memory initially available after the reboot, but it 
sure 
is nice to provide the option.
  - Ensuring that the devices are in a good state.  PHYP doesn't expose a 
method 
to force adapters into a frozen state, (which I agree would be useful), and I 
don't know of any plans to do so.  What we are starting to see is that some 
drivers need modifications in order to work correctly with kdump [1].  
Supporting 
PHYP-assisted dump would eliminate those issues.
  - The small possibility that the kexec area could have been munged by the 
failing kernel, preventing it from being able to collect a dump.

The NUMA issues are daunting, but not insurmountable.  Release early, release 
often, n'est-ce pas?

Mike

[1] http://ozlabs.org/pipermail/linuxppc-dev/2007-November/045663.html

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-10 Thread Nathan Lynch
Mike Strosaker wrote:
 
 At the risk of repeating what others have already said, the PHYP-assistance 
 method provides some advantages that the kexec method cannot:
  - Availability of the system for production use before the dump data is 
 collected.  As was mentioned before, some production systems may choose not 
 to operate with the limited memory initially available after the reboot, 
 but it sure is nice to provide the option.

I'm more concerned that this design encourages the user to resume a
workload *which is almost certainly known to result in a system crash*
before collection of crash data is complete.  Maybe the gamble will
pay off most of the time, but I wouldn't want to be working support
when it doesn't.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Nathan Lynch
Hi Linas,

Linas Vepstas wrote:
 
 On 08/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote:
  Manish Ahuja wrote:
   +
   +The goal of hypervisor-assisted dump is to enable the dump of
   +a crashed system, and to do so from a fully-reset system, and
   +to minimize the total elapsed time until the system is back
   +in production use.
 
  Is it actually faster than kdump?
 
 This is a basic presumption;

I used the word actually.  I already know that it is intended to be
faster.  :)


 it should blow it away, as, after all,
 it requires one less reboot!

There's more than rebooting going on during system dump processing.
Depending on the system type, booting may not be where most time is
spent.


 As a side effect, the system is in
 production *while* the dump is being taken;

A dubious feature IMO.  Seems that the design potentially trades
reliability of first failure data capture for availability.
E.g. system crashes, reboots, resumes processing while copying dump,
crashes again before dump procedure is complete.  How is that handled,
if at all?


 with kdump,
 you can't go into production until after the dump is finished,
 and the system has been rebooted a second time.  On
 systems with terabytes of RAM, the time difference can be
 hours.

The difference in time it takes to resume the normal workload may be
significant, yes.  But the time it takes to get a usable dump image
would seem to be the basically the same.

Since you bring up large systems... a system with terabytes of RAM is
practically guaranteed to be a NUMA configuration with dozens of cpus.
When processing a dump on such a system, I wonder how well we fare:
can we successfully boot with (say) 128 cpus and 256MB of usable
memory?  Do we have to hot-online nodes as system memory is freed up
(and does that even work)?  We need to be able to restore the system
to its optimal topology when the dump is finished; if the best we can
do is a degraded configuration, the workload will suffer and the
system admin is likely to just reboot the machine again so the kernel
will have the right NUMA topology.


   +Implementation details:
   +--
   +In order for this scheme to work, memory needs to be reserved
   +quite early in the boot cycle. However, access to the device
   +tree this early in the boot cycle is difficult, and device-tree
   +access is needed to determine if there is a crash data waiting.
 
  I don't think this bit about early device tree access is correct.  By
  the time your code is reserving memory (from early_init_devtree(), I
  think), RTAS has been instantiated and you are able to test for the
  existence of /rtas/ibm,dump-kernel.
 
 If I remember right, it was still too early to look up this token directly,
 so we wrote some code to crawl the flat device tree to find it.  But
 not only was that a lot of work, but I somehow decided that doing this
 to the flat tree was wrong, as otherwise someone would surely have
 written the access code.  If this can be made to work, that would be
 great, but we couldn't make it work at the time.
 
   +To work around this problem, all but 256MB of RAM is reserved
   +during early boot. A short while later in boot, a check is made
   +to determine if there is dump data waiting. If there isn't,
   +then the reserved memory is released to general kernel use.
 
  So I think these gymnastics are unneeded -- unless I'm
  misunderstanding something, you should be able to determine very early
  whether to reserve that memory.
 
 Only if you can get at rtas, but you can't get at rtas at that point.

Sorry, but I think you are mistaken (see Michael's earlier reply).

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Manish Ahuja
 
 I used the word actually.  I already know that it is intended to be
 faster.  :)
 
 it should blow it away, as, after all,
 it requires one less reboot!
 
 There's more than rebooting going on during system dump processing.
 Depending on the system type, booting may not be where most time is
 spent.
 
 
 As a side effect, the system is in
 production *while* the dump is being taken;
 
 A dubious feature IMO.  Seems that the design potentially trades
 reliability of first failure data capture for availability.
 E.g. system crashes, reboots, resumes processing while copying dump,
 crashes again before dump procedure is complete.  How is that handled,
 if at all?

This is a simple version. The intent was not to have a complex dump taking
mechanism in version 1. Subsequent versions will see planned improvement
on the way the pages are tracked and freed. 

Also it is very easily possible now, to register for another dump as soon as the
scratch area is copied to a user designated region. But for now this simple 
implementation
exists. 

It is also possible to extend this further to only preserve pages that are
kernel pages and free the non required pages like user/data pages etc. This
would reduce the space preserved and would prevent any issues that are
caused by reserving everything in memory except for the first 256 MB. 

Improvements and future versions are planned to make this efficient. But for
now the intent is to get this off the ground and handle simple cases.

 
 
 with kdump,
 you can't go into production until after the dump is finished,
 and the system has been rebooted a second time.  On
 systems with terabytes of RAM, the time difference can be
 hours.
 
 The difference in time it takes to resume the normal workload may be
 significant, yes.  But the time it takes to get a usable dump image
 would seem to be the basically the same.
 
 Since you bring up large systems... a system with terabytes of RAM is
 practically guaranteed to be a NUMA configuration with dozens of cpus.
 When processing a dump on such a system, I wonder how well we fare:
 can we successfully boot with (say) 128 cpus and 256MB of usable
 memory?  Do we have to hot-online nodes as system memory is freed up
 (and does that even work)?  We need to be able to restore the system
 to its optimal topology when the dump is finished; if the best we can
 do is a degraded configuration, the workload will suffer and the
 system admin is likely to just reboot the machine again so the kernel
 will have the right NUMA topology.
 
 
 +Implementation details:
 +--
 +In order for this scheme to work, memory needs to be reserved
 +quite early in the boot cycle. However, access to the device
 +tree this early in the boot cycle is difficult, and device-tree
 +access is needed to determine if there is a crash data waiting.
 I don't think this bit about early device tree access is correct.  By
 the time your code is reserving memory (from early_init_devtree(), I
 think), RTAS has been instantiated and you are able to test for the
 existence of /rtas/ibm,dump-kernel.
 If I remember right, it was still too early to look up this token directly,
 so we wrote some code to crawl the flat device tree to find it.  But
 not only was that a lot of work, but I somehow decided that doing this
 to the flat tree was wrong, as otherwise someone would surely have
 written the access code.  If this can be made to work, that would be
 great, but we couldn't make it work at the time.

 +To work around this problem, all but 256MB of RAM is reserved
 +during early boot. A short while later in boot, a check is made
 +to determine if there is dump data waiting. If there isn't,
 +then the reserved memory is released to general kernel use.
 So I think these gymnastics are unneeded -- unless I'm
 misunderstanding something, you should be able to determine very early
 whether to reserve that memory.
 Only if you can get at rtas, but you can't get at rtas at that point.
 
 Sorry, but I think you are mistaken (see Michael's earlier reply).
 
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@ozlabs.org
 https://ozlabs.org/mailman/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Manish Ahuja
 It's in production with 256MB of RAM? Err. Sure as the dump progresses
 more RAM will be freed, but that's hardly production. I think Nathan's
 right, any sysadmin who wants predictability will probably double reboot
 anyway.

Thats a changeable parameter. Its something we chose for now. It by no means
is set in stone. Its not a design parameter. If you like to allocate 1GB we can.
But that is something we did for now. we expect this to be a variable value
dependent upon the size of the system. So if you have 128 GB system and you 
can spare 10 gb, you should be able to have 10 GB to boot with. 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Michael Ellerman
On Wed, 2008-01-09 at 12:44 -0600, Nathan Lynch wrote:
 Hi Linas,
 
 Linas Vepstas wrote:
  
  On 08/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote:
   Manish Ahuja wrote:
+
+The goal of hypervisor-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
  
   Is it actually faster than kdump?
  
  This is a basic presumption;
 
  As a side effect, the system is in
  production *while* the dump is being taken;

It's in production with 256MB of RAM? Err. Sure as the dump progresses
more RAM will be freed, but that's hardly production. I think Nathan's
right, any sysadmin who wants predictability will probably double reboot
anyway.

  with kdump,
  you can't go into production until after the dump is finished,
  and the system has been rebooted a second time.  On
  systems with terabytes of RAM, the time difference can be
  hours.

 Since you bring up large systems... a system with terabytes of RAM is
 practically guaranteed to be a NUMA configuration with dozens of cpus.
 When processing a dump on such a system, I wonder how well we fare:
 can we successfully boot with (say) 128 cpus and 256MB of usable
 memory?  Do we have to hot-online nodes as system memory is freed up
 (and does that even work)?  We need to be able to restore the system
 to its optimal topology when the dump is finished; if the best we can
 do is a degraded configuration, the workload will suffer and the
 system admin is likely to just reboot the machine again so the kernel
 will have the right NUMA topology.

Yeah that's a good question. Even if the hot-onlining works, there's
still kernel data structures allocated at boot which want to be
node-local. So the end result will be != a production boot.

+Implementation details:
+--
+In order for this scheme to work, memory needs to be reserved
+quite early in the boot cycle. However, access to the device
+tree this early in the boot cycle is difficult, and device-tree
+access is needed to determine if there is a crash data waiting.
  
   I don't think this bit about early device tree access is correct.  By
   the time your code is reserving memory (from early_init_devtree(), I
   think), RTAS has been instantiated and you are able to test for the
   existence of /rtas/ibm,dump-kernel.
  
  If I remember right, it was still too early to look up this token directly,
  so we wrote some code to crawl the flat device tree to find it.  But
  not only was that a lot of work, but I somehow decided that doing this
  to the flat tree was wrong, as otherwise someone would surely have
  written the access code.  If this can be made to work, that would be
  great, but we couldn't make it work at the time.
  
+To work around this problem, all but 256MB of RAM is reserved
+during early boot. A short while later in boot, a check is made
+to determine if there is dump data waiting. If there isn't,
+then the reserved memory is released to general kernel use.
  
   So I think these gymnastics are unneeded -- unless I'm
   misunderstanding something, you should be able to determine very early
   whether to reserve that memory.
  
  Only if you can get at rtas, but you can't get at rtas at that point.

AFAICT you don't need to get at RTAS, you just need to look at the
device tree to see if the property is present, and that is trivial.

You probably just need to add a check in early_init_dt_scan_rtas() which
sets a flag for the PHYP dump stuff, or add your own scan routine if you
need.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Linas Vepstas
On 09/01/2008, Nathan Lynch [EMAIL PROTECTED] wrote:
 Hi Linas,

 Linas Vepstas wrote:
 
  As a side effect, the system is in
  production *while* the dump is being taken;

 A dubious feature IMO.

Hmm.  Take it up with Ken Rozendal, this is supposed to be
one of the two main selling points of this thing.

 Seems that the design potentially trades
 reliability of first failure data capture for availability.
 E.g. system crashes, reboots, resumes processing while copying dump,
 crashes again before dump procedure is complete.  How is that handled,
 if at all?

Its handled by the hypervisor.  phyp maintains the copy of the
RMO of  first crash, until such time that the OS declares the
dump of the RMO to be complete. So you'll always have
the RMO of the first crash.

For the rest of RAM, it will come in two parts: some portion
will have been dumped already. The rest has not yet been dumped,
and it will still be there, preserved across the second crash.

So you get both RMO and all of RAM from the first crash.

  with kdump,
  you can't go into production until after the dump is finished,
  and the system has been rebooted a second time.  On
  systems with terabytes of RAM, the time difference can be
  hours.

 The difference in time it takes to resume the normal workload may be
 significant, yes.  But the time it takes to get a usable dump image
 would seem to be the basically the same.

Yes.

 Since you bring up large systems... a system with terabytes of RAM is
 practically guaranteed to be a NUMA configuration with dozens of cpus.
 When processing a dump on such a system, I wonder how well we fare:
 can we successfully boot with (say) 128 cpus and 256MB of usable
 memory?  Do we have to hot-online nodes as system memory is freed up
 (and does that even work)?  We need to be able to restore the system
 to its optimal topology when the dump is finished; if the best we can
 do is a degraded configuration, the workload will suffer and the
 system admin is likely to just reboot the machine again so the kernel
 will have the right NUMA topology.

Heh. That's the elbow-grease of this thing.  The easy part is to get
the core function working. The hard part is to test these various configs,
and when they don't work, figure out what went wrong. That will take
perseverence and brains.

--linas
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Linas Vepstas
On 09/01/2008, Michael Ellerman [EMAIL PROTECTED] wrote:

   Only if you can get at rtas, but you can't get at rtas at that point.

 AFAICT you don't need to get at RTAS, you just need to look at the
 device tree to see if the property is present, and that is trivial.

 You probably just need to add a check in early_init_dt_scan_rtas() which
 sets a flag for the PHYP dump stuff, or add your own scan routine if you
 need.

I no longer remember the details. I do remember spending a lot of time
trying to figure out how to do this. I know I didn't want to write my own scan
routine; maybe that's what stopped me.  As it happens, we also did most
of the development on a broken phyp which simply did not even have
this property, no matter what, and so that may have brain-damaged me.

I went for the most elegant solution, where most elegant is defined
as fewest lines of code, least effort, etc.

Manish may need some hands-on help to extract this token during
early boot.  Hopefully, he'll let us know.

--linas
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Olof Johansson
On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:

 Heh. That's the elbow-grease of this thing.  The easy part is to get
 the core function working. The hard part is to test these various configs,
 and when they don't work, figure out what went wrong. That will take
 perseverence and brains.

This just sounds like a whole lot of extra work to get a feature that
already exists. Also, features like these seem to just get tested when the
next enterprise distro is released, so they're broken for long stretches
of time in mainline.

There's a bunch of problems like the NUMA ones, which would by far be
easiest to solve by just doing another reboot or kexec, wouldn't they?


-Olof

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Michael Ellerman
On Wed, 2008-01-09 at 20:47 -0600, Linas Vepstas wrote:
 On 09/01/2008, Michael Ellerman [EMAIL PROTECTED] wrote:
 
Only if you can get at rtas, but you can't get at rtas at that point.
 
  AFAICT you don't need to get at RTAS, you just need to look at the
  device tree to see if the property is present, and that is trivial.
 
  You probably just need to add a check in early_init_dt_scan_rtas() which
  sets a flag for the PHYP dump stuff, or add your own scan routine if you
  need.
 
 I no longer remember the details. I do remember spending a lot of time
 trying to figure out how to do this. I know I didn't want to write my own scan
 routine; maybe that's what stopped me.  As it happens, we also did most
 of the development on a broken phyp which simply did not even have
 this property, no matter what, and so that may have brain-damaged me.

Sure, the API docs for the kernel are a little lacking ;)

 I went for the most elegant solution, where most elegant is defined
 as fewest lines of code, least effort, etc.
 
 Manish may need some hands-on help to extract this token during
 early boot.  Hopefully, he'll let us know.

It would just be something like:

--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -901,6 +901,11 @@ int __init early_init_dt_scan_rtas(unsigned long node,
rtas.size = *sizep;
}
 
+#ifdef CONFIG_PHYP_DUMP
+   if (of_get_flat_dt_prop(node, ibm,dump-kernel, NULL))
+   phyp_dump_is_active++;
+#endif
+
 #ifdef CONFIG_UDBG_RTAS_CONSOLE
basep = of_get_flat_dt_prop(node, put-term-char, NULL);
if (basep)


Or to do your own scan routine:


diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index acc0d24..442134e 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -1022,6 +1022,7 @@ void __init early_init_devtree(void *params)
/* Some machines might need RTAS info for debugging, grab it now. */
of_scan_flat_dt(early_init_dt_scan_rtas, NULL);
 #endif
+   of_scan_flat_dt(early_init_dt_scan_phyp_dump, NULL);
 
/* Retrieve various informations from the /chosen node of the
 * device-tree, including the platform type, initrd location and
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 52e95c2..af2b6e8 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -883,6 +883,19 @@ void __init rtas_initialize(void)
 #endif
 }
 
+int __init early_init_dt_scan_phyp_dump(unsigned long node,
+   const char *uname, int depth, void *data)
+{
+#ifdef CONFIG_PHYP_DUMP
+   if (depth != 1 || strcmp(uname, rtas) != 0)
+   return 0;
+
+   if (of_get_flat_dt_prop(node, ibm,dump-kernel, NULL))
+   phyp_dump_is_active++;
+#endif
+   return 1;
+}
+
 int __init early_init_dt_scan_rtas(unsigned long node,
const char *uname, int depth, void *data)
 {


cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Linas Vepstas
On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:
 On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:

  Heh. That's the elbow-grease of this thing.  The easy part is to get
  the core function working. The hard part is to test these various configs,
  and when they don't work, figure out what went wrong. That will take
  perseverence and brains.

 This just sounds like a whole lot of extra work to get a feature that
 already exists.

Well, no. kexec is horribly ill-behaved with respect to PCI. The
kexec kernel starts running with PCI devices in some random
state; maybe they're DMA'ing or who knows what. kexec tries
real hard to whack a few needed pci devices into submission
but it has been hit-n-miss, and the source of 90% of the kexec
headaches and debugging effort. Its not pretty.

If all pci-host bridges could shut-down or settle the bus, and
raise the #RST line high, and then if all BIOS'es supported
this, you'd be right. But they can't 

--linas
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-09 Thread Michael Ellerman
On Wed, 2008-01-09 at 22:12 -0600, Linas Vepstas wrote:
 On 09/01/2008, Olof Johansson [EMAIL PROTECTED] wrote:
  On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote:
 
   Heh. That's the elbow-grease of this thing.  The easy part is to get
   the core function working. The hard part is to test these various configs,
   and when they don't work, figure out what went wrong. That will take
   perseverence and brains.
 
  This just sounds like a whole lot of extra work to get a feature that
  already exists.
 
 Well, no. kexec is horribly ill-behaved with respect to PCI. The
 kexec kernel starts running with PCI devices in some random
 state; maybe they're DMA'ing or who knows what. kexec tries
 real hard to whack a few needed pci devices into submission
 but it has been hit-n-miss, and the source of 90% of the kexec
 headaches and debugging effort. Its not pretty.

Isn't that what EEH and the IOMMU are for? :)

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-08 Thread Nathan Lynch
Manish Ahuja wrote:
 +
 +   Hypervisor-Assisted Dump
 +   
 +   November 2007

Date is unneeded (and, uhm, dated :)


 +The goal of hypervisor-assisted dump is to enable the dump of
 +a crashed system, and to do so from a fully-reset system, and
 +to minimize the total elapsed time until the system is back
 +in production use.

Is it actually faster than kdump?


 +As compared to kdump or other strategies, hypervisor-assisted
 +dump offers several strong, practical advantages:
 +
 +-- Unlike kdump, the system has been reset, and loaded
 +   with a fresh copy of the kernel.  In particular,
 +   PCI and I/O devices have been reinitialized and are
 +   in a clean, consistent state.
 +-- As the dump is performed, the dumped memory becomes
 +   immediately available to the system for normal use.
 +-- After the dump is completed, no further reboots are
 +   required; the system will be fully usable, and running
 +   in it's normal, production mode on it normal kernel.
 +
 +The above can only be accomplished by coordination with,
 +and assistance from the hypervisor. The procedure is
 +as follows:
 +
 +-- When a system crashes, the hypervisor will save
 +   the low 256MB of RAM to a previously registered
 +   save region. It will also save system state, system
 +   registers, and hardware PTE's.
 +
 +-- After the low 256MB area has been saved, the
 +   hypervisor will reset PCI and other hardware state.
 +   It will *not* clear RAM. It will then launch the
 +   bootloader, as normal.
 +
 +-- The freshly booted kernel will notice that there
 +   is a new node (ibm,dump-kernel) in the device tree,
 +   indicating that there is crash data available from
 +   a previous boot. It will boot into only 256MB of RAM,
 +   reserving the rest of system memory.
 +
 +-- Userspace tools will parse /sys/kernel/release_region
 +   and read /proc/vmcore to obtain the contents of memory,
 +   which holds the previous crashed kernel. The userspace
 +   tools may copy this info to disk, or network, nas, san,
 +   iscsi, etc. as desired.
 +
 +   For Example: the values in /sys/kernel/release-region
 +   would look something like this (address-range pairs).
 +   CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: /
 +   DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A
 +
 +-- As the userspace tools complete saving a portion of
 +   dump, they echo an offset and size to
 +   /sys/kernel/release_region to release the reserved
 +   memory back to general use.
 +
 +   An example of this is:
 + echo 0x4000 0x1000  /sys/kernel/release_region
 +   which will release 256MB at the 1GB boundary.

This violates the one file, one value rule of sysfs, but nobody
really takes that seriously, I guess.  In any case, consider
documenting this in Documentation/ABI.


 +
 +Please note that the hypervisor-assisted dump feature
 +is only available on Power6-based systems with recent
 +firmware versions.

This statement will of course become dated/incorrect so I recommend
removing it.


 +
 +Implementation details:
 +--
 +In order for this scheme to work, memory needs to be reserved
 +quite early in the boot cycle. However, access to the device
 +tree this early in the boot cycle is difficult, and device-tree
 +access is needed to determine if there is a crash data waiting.

I don't think this bit about early device tree access is correct.  By
the time your code is reserving memory (from early_init_devtree(), I
think), RTAS has been instantiated and you are able to test for the
existence of /rtas/ibm,dump-kernel.


 +To work around this problem, all but 256MB of RAM is reserved
 +during early boot. A short while later in boot, a check is made
 +to determine if there is dump data waiting. If there isn't,
 +then the reserved memory is released to general kernel use.

So I think these gymnastics are unneeded -- unless I'm
misunderstanding something, you should be able to determine very early
whether to reserve that memory.


 +If there is dump data, then the /sys/kernel/release_region
 +file is created, and the reserved memory is held.
 +
 +If there is no waiting dump data, then all but 256MB of the
 +reserved ram will be released for general kernel use. The
 +highest 256 MB of RAM will *not* be released: this region
 +will be kept permanently reserved, so that it can act as
 +a receptacle for a copy of the low 256MB in the case a crash
 +does occur. See, however, open issues below, as to whether
 +such a reserved region is really needed.
 +
 +Currently the dump will be copied from /proc/vmcore to a
 +a new file upon user intervention. The starting address
 +to be read and the range for each data point in provided
   ^is

 +in /sys/kernel/release_region.
 +
 +The tools to examine the dump will be same as the ones
 +used for kdump.
 +
 +
 +General notes:
 +--
 +Security: please note that there are 

Re: [PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-08 Thread Michael Ellerman
On Tue, 2008-01-08 at 22:29 -0600, Nathan Lynch wrote:
 Manish Ahuja wrote:
  +
  +   Hypervisor-Assisted Dump
  +   
  +   November 2007
 
 Date is unneeded (and, uhm, dated :)
 
 
  +The goal of hypervisor-assisted dump is to enable the dump of
  +a crashed system, and to do so from a fully-reset system, and
  +to minimize the total elapsed time until the system is back
  +in production use.
 
 Is it actually faster than kdump?
 
 
  +As compared to kdump or other strategies, hypervisor-assisted
  +dump offers several strong, practical advantages:
  +
  +-- Unlike kdump, the system has been reset, and loaded
  +   with a fresh copy of the kernel.  In particular,
  +   PCI and I/O devices have been reinitialized and are
  +   in a clean, consistent state.
  +-- As the dump is performed, the dumped memory becomes
  +   immediately available to the system for normal use.
  +-- After the dump is completed, no further reboots are
  +   required; the system will be fully usable, and running
  +   in it's normal, production mode on it normal kernel.
  +
  +The above can only be accomplished by coordination with,
  +and assistance from the hypervisor. The procedure is
  +as follows:
  +
  +-- When a system crashes, the hypervisor will save
  +   the low 256MB of RAM to a previously registered
  +   save region. It will also save system state, system
  +   registers, and hardware PTE's.
  +
  +-- After the low 256MB area has been saved, the
  +   hypervisor will reset PCI and other hardware state.
  +   It will *not* clear RAM. It will then launch the
  +   bootloader, as normal.
  +
  +-- The freshly booted kernel will notice that there
  +   is a new node (ibm,dump-kernel) in the device tree,
  +   indicating that there is crash data available from
  +   a previous boot. It will boot into only 256MB of RAM,
  +   reserving the rest of system memory.
  +
  +-- Userspace tools will parse /sys/kernel/release_region
  +   and read /proc/vmcore to obtain the contents of memory,
  +   which holds the previous crashed kernel. The userspace
  +   tools may copy this info to disk, or network, nas, san,
  +   iscsi, etc. as desired.
  +
  +   For Example: the values in /sys/kernel/release-region
  +   would look something like this (address-range pairs).
  +   CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: /
  +   DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A
  +
  +-- As the userspace tools complete saving a portion of
  +   dump, they echo an offset and size to
  +   /sys/kernel/release_region to release the reserved
  +   memory back to general use.
  +
  +   An example of this is:
  + echo 0x4000 0x1000  /sys/kernel/release_region
  +   which will release 256MB at the 1GB boundary.
 
 This violates the one file, one value rule of sysfs, but nobody
 really takes that seriously, I guess.  In any case, consider
 documenting this in Documentation/ABI.
 
 
  +
  +Please note that the hypervisor-assisted dump feature
  +is only available on Power6-based systems with recent
  +firmware versions.
 
 This statement will of course become dated/incorrect so I recommend
 removing it.
 
 
  +
  +Implementation details:
  +--
  +In order for this scheme to work, memory needs to be reserved
  +quite early in the boot cycle. However, access to the device
  +tree this early in the boot cycle is difficult, and device-tree
  +access is needed to determine if there is a crash data waiting.
 
 I don't think this bit about early device tree access is correct.  By
 the time your code is reserving memory (from early_init_devtree(), I
 think), RTAS has been instantiated and you are able to test for the
 existence of /rtas/ibm,dump-kernel.

Yep it's early_init_devtree(), and yes it's fairly easy to access the
(flattened) device tree at that point.

  +To work around this problem, all but 256MB of RAM is reserved
  +during early boot. A short while later in boot, a check is made
  +to determine if there is dump data waiting. If there isn't,
  +then the reserved memory is released to general kernel use.
 
 So I think these gymnastics are unneeded -- unless I'm
 misunderstanding something, you should be able to determine very early
 whether to reserve that memory.

I agree.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[PATCH 1/8] pseries: phyp dump: Docmentation

2008-01-07 Thread Manish Ahuja

Basic documentation for hypervisor-assisted dump.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Signed-off-by: Manish Ahuja [EMAIL PROTECTED]


 Documentation/powerpc/phyp-assisted-dump.txt |  129 +++
 1 file changed, 129 insertions(+)

Index: 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ 2.6.24-rc5/Documentation/powerpc/phyp-assisted-dump.txt 2008-01-07 
18:05:46.0 -0600
@@ -0,0 +1,129 @@
+
+   Hypervisor-Assisted Dump
+   
+   November 2007
+
+The goal of hypervisor-assisted dump is to enable the dump of
+a crashed system, and to do so from a fully-reset system, and
+to minimize the total elapsed time until the system is back
+in production use.
+
+As compared to kdump or other strategies, hypervisor-assisted
+dump offers several strong, practical advantages:
+
+-- Unlike kdump, the system has been reset, and loaded
+   with a fresh copy of the kernel.  In particular,
+   PCI and I/O devices have been reinitialized and are
+   in a clean, consistent state.
+-- As the dump is performed, the dumped memory becomes
+   immediately available to the system for normal use.
+-- After the dump is completed, no further reboots are
+   required; the system will be fully usable, and running
+   in it's normal, production mode on it normal kernel.
+
+The above can only be accomplished by coordination with,
+and assistance from the hypervisor. The procedure is
+as follows:
+
+-- When a system crashes, the hypervisor will save
+   the low 256MB of RAM to a previously registered
+   save region. It will also save system state, system
+   registers, and hardware PTE's.
+
+-- After the low 256MB area has been saved, the
+   hypervisor will reset PCI and other hardware state.
+   It will *not* clear RAM. It will then launch the
+   bootloader, as normal.
+
+-- The freshly booted kernel will notice that there
+   is a new node (ibm,dump-kernel) in the device tree,
+   indicating that there is crash data available from
+   a previous boot. It will boot into only 256MB of RAM,
+   reserving the rest of system memory.
+
+-- Userspace tools will parse /sys/kernel/release_region
+   and read /proc/vmcore to obtain the contents of memory,
+   which holds the previous crashed kernel. The userspace
+   tools may copy this info to disk, or network, nas, san,
+   iscsi, etc. as desired.
+
+   For Example: the values in /sys/kernel/release-region
+   would look something like this (address-range pairs).
+   CPU:0x177fee000-0x1: HPTE:0x177ffe020-0x1000: /
+   DUMP:0x177fff020-0x1000, 0x1000-0x16F1D370A
+
+-- As the userspace tools complete saving a portion of
+   dump, they echo an offset and size to
+   /sys/kernel/release_region to release the reserved
+   memory back to general use.
+
+   An example of this is:
+ echo 0x4000 0x1000  /sys/kernel/release_region
+   which will release 256MB at the 1GB boundary.
+
+Please note that the hypervisor-assisted dump feature
+is only available on Power6-based systems with recent
+firmware versions.
+
+Implementation details:
+--
+In order for this scheme to work, memory needs to be reserved
+quite early in the boot cycle. However, access to the device
+tree this early in the boot cycle is difficult, and device-tree
+access is needed to determine if there is a crash data waiting.
+To work around this problem, all but 256MB of RAM is reserved
+during early boot. A short while later in boot, a check is made
+to determine if there is dump data waiting. If there isn't,
+then the reserved memory is released to general kernel use.
+If there is dump data, then the /sys/kernel/release_region
+file is created, and the reserved memory is held.
+
+If there is no waiting dump data, then all but 256MB of the
+reserved ram will be released for general kernel use. The
+highest 256 MB of RAM will *not* be released: this region
+will be kept permanently reserved, so that it can act as
+a receptacle for a copy of the low 256MB in the case a crash
+does occur. See, however, open issues below, as to whether
+such a reserved region is really needed.
+
+Currently the dump will be copied from /proc/vmcore to a
+a new file upon user intervention. The starting address
+to be read and the range for each data point in provided
+in /sys/kernel/release_region.
+
+The tools to examine the dump will be same as the ones
+used for kdump.
+
+
+General notes:
+--
+Security: please note that there are potential security issues
+with any sort of dump mechanism. In particular, plaintext
+(unencrypted) data, and possibly passwords, may be present in
+the dump data. Userspace tools must take adequate precautions to
+preserve security.
+
+Open issues/ToDo:
+
+ o The various code paths that tell the hypervisor that a