Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Mon, Feb 04, 2008 at 07:38:40PM +0300, Michael Tokarev wrote: Eric Sandeen wrote: [] http://oss.sgi.com/projects/xfs/faq.html#nulls and note that recent fixes have been made in this area (also noted in the faq) Also - the above all assumes that when a drive says it's written/flushed data, that it truly has. Modern write-caching drives can wreak havoc with any journaling filesystem, so that's one good reason for a UPS. If Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... if the ups is supported by nut (http://www.networkupstools.org) you can do this easily. Obviously you should tune the timeout to give your systems enough time to shutdown in case of power outage, and periodically check your battery duration (that means real tests) and re-tune the nut software (and when you discover your battery is dead, change it) L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Linda Walsh wrote: > > Michael Tokarev wrote: >> Unfortunately an UPS does not *really* help here. Because unless >> it has control program which properly shuts system down on the loss >> of input power, and the battery really has the capacity to power the >> system while it's shutting down (anyone tested this? > > Yes. I must say, I am not connected or paid by APC. > >> With new UPS? >> and after an year of use, when the battery is not new?), -- unless >> the UPS actually has the capacity to shutdown system, it will cut >> the power at an unexpected time, while the disk(s) still has dirty >> caches... > > If you have a "SmartUPS" by "APC", their is a freeware demon that monitors [...] Good stuff. I knew at least SOME UPSes are good... ;) Too bad I rarely see such stuff in use by regular home users... [] >> Note also that with linux software raid barriers are NOT supported. > -- > Are you sure about this? When my system boots, I used to have > 3 new IDE's, and one older one. XFS checked each drive for barriers > and turned off barriers for a disk that didn't support it. ... or > are you referring specifically to linux-raid setups? I'm referring especially to linux-raid setups (software raid). md devices don't support barriers, because of a very simple reasons: once more than one disk drive is involved, md layer can't guarantee ordering ACROSS drives too. The problem is that in case of power loss during writes, when an array needs recovery/resync (at least the parts which were being written, if bitmaps are in use), md layer will choose arbitrary drive as a "master" and will copy data to another drive (speaking of simplest case of 2-drive raid1 array). But the thing is that one drive may have two last barriers written (I mean the data that was "assotiated" with the barriers), and another neither of the two - in two different places. And hence we may see quite.. some inconsistency here. This is regardless of whether underlying component devices supports barriers or not. > Would it be possible on boot to have xfs probe the Raid array, > physically, to see if barriers are really supported (or not), and disable > them if they are not (and optionally disabling write caching, but that's > a major performance hit in my experience. Xfs already probes the devices as you describe, exactly the same way as you've seen with your ide disks, and disables barriers. The question and confusing was about what happens when the barriers are disabled (provided, again, that we don't rely on UPS and other external things). As far as I understand, when barriers are working properly, xfs should be safe wrt power losses (still a bit unsure about this). Now, when barriers are turned off (for whatever reason), is it still as safe? I don't know. Does it use regular cache flushes in place of barriers in that case (which ARE supported by md layer)? Generally, it has been said numerous times that XFS is not "powercut-friendly", and it has to be used when everything is stable, including power. Hence I'm afraid to deploy it where I know the power is not stable (we've about 70 such places here, with servers in each, where they don't always replace UPS batteries in time - ext3fs never crashed so far, while ext2 did). Thanks. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Michael Tokarev wrote: Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? Yes. I must say, I am not connected or paid by APC. With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... If you have a "SmartUPS" by "APC", their is a freeware demon that monitors it's status. The UPS has USB and serial connections. It's included in some distributions (SuSE). The config file is pretty straight forward. I recommend the "1000XL" (1000 peak Volt-Amp load -- usually at startup; note, this is not the same as watts as some of us were taught in basic electronics class since the unit isn't a simple resistor (like a light bulb). over the 1500XL because with the 1000XL, you can buy several "add-on batteries" that plug into the back. One minor (but not fatal) design flaw: the add-on batteries give no indication that they are "live" (I knocked a cord on one, and only got 7 minutes of uptime before things shut-down instead of my expected 20. I have 3-cells total (controller & 1 extra pack). So why is my run time so short? I am being lazy in buying more extension packs. The UPS is running 3 computers, the house-phone, (answering and wireless handsets). a digital clock, 1 LCD (usually off), The real killer is a new workstation with 2x2-Core-II chips and other comparable equipment. The "1500XL" doesn't allow for adding more power packs. The "2200XL" does allow extra packs but comes in a rack-mount format. It's not just a battery backup -- it conditions the power -- to filter out spikes and emit a pure sine wave. It will kick in during over or under voltage conditions (you can set the sensitivity). Adjustable alarm when on battery, setting of output volts (115, 230, 120, 240). It selftests at least every 2 weeks or shorter (to your fancy). It also has a network feature (that I haven't gotten to work yet -- they just changed the format), that allows other computers on the same net to also be notified and take action. You specify what scripts to run at what times (power off, power on, getting critically low, etc). Hasn't failed me 'yet' -- cept when a charger died and was replaced free of cost (within warantee). I have a separate setup another room for another computer. The upspowerd runs on linux or windows (under cygwin, I think). You can specify when to shut down -- like "5 minutes of battery life left. The controller unit has 1 battery. But the add-ons have 2 batteries each, so the first add-on adds 3x to the run-time. When my system did shut down "prematurely", it went through the full "halt" sequence, which I'd presume flushes disk caches. the drive claims to have metadata safe on disk but actually does not, and you lose power, the data claimed safe will evaporate, there's not much the fs can do. IO write barriers address this by forcing the drive to flush order-critical data before continuing; xfs has them on by default, although they are tested at mount time and if you have something in between xfs and the disks which does not support barriers (i.e. lvm...) then they are disabled again, with a notice in the logs. Note also that with linux software raid barriers are NOT supported. -- Are you sure about this? When my system boots, I used to have 3 new IDE's, and one older one. XFS checked each drive for barriers and turned off barriers for a disk that didn't support it. ... or are you referring specifically to linux-raid setups? Would it be possible on boot to have xfs probe the Raid array, physically, to see if barriers are really supported (or not), and disable them if they are not (and optionally disabling write caching, but that's a major performance hit in my experience. Linda - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Michael Tokarev wrote: note that with some workloads, write caching in the drive actually makes write speed worse, not better - namely, in case of massive writes. With write barriers enabled, I did a quick test of a large copy from one backup filesystem to another. I'm not what you refer to when you say large, but this disk has 387G used with 975 files, averaging about 406MB/file. I was copying from /hde (ATA100-750G) to /sdb (SATA-300-750G) (both, basically underlying model) Of course your 'mileage may vary', and these were averages over 12 runs each (w/ + w/out wcaching); (write cache on) writeread dev ave TPS MB/sMB/s hde ave 64.67 30.94 0.0 sdb ave 249.510.2430.93 (write cache off)writeread dev ave TPS MB/sMB/s hde ave 45.63 21.81 0.0 xx: ave 177.76 0.24 21.96 write w/cache = (30.94-21.86)/21.86 => 45% faster w/o write cache = 100-(100*21.81/30.94) => 30% slower These disks have barrier support, so I'd guess the differences would have been greater if you didn't worry about losing w-cache contents. If barrier support doesn't work and one has to disable write-caching, that is a noticeable performance penalty. All writes with noatime, nodiratime, logbufs=8. FWIW...slightly OT, the rates under Win for their write-through (FAT32-perf) vs. write-back caching (NTFS-perf) were FAT about 60% faster over NTFS or NTFS ~ 40% slower than FAT32 (with ops for no-last-access and no 3.1 filename creation) - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Mon, 4 Feb 2008, Michael Tokarev wrote: Eric Sandeen wrote: [] http://oss.sgi.com/projects/xfs/faq.html#nulls and note that recent fixes have been made in this area (also noted in the faq) Also - the above all assumes that when a drive says it's written/flushed data, that it truly has. Modern write-caching drives can wreak havoc with any journaling filesystem, so that's one good reason for a UPS. If Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... You use nut and a large enough UPS to handle the load of the system, it shuts the machine down just fine. the drive claims to have metadata safe on disk but actually does not, and you lose power, the data claimed safe will evaporate, there's not much the fs can do. IO write barriers address this by forcing the drive to flush order-critical data before continuing; xfs has them on by default, although they are tested at mount time and if you have something in between xfs and the disks which does not support barriers (i.e. lvm...) then they are disabled again, with a notice in the logs. Note also that with linux software raid barriers are NOT supported. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Michael Tokarev wrote: Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... I'm unsure what you mean here. The Network UPS Tools project http://www.networkupstools.org/ has been supplying software to do this for years. In addition, a number of UPS manufacturers including APC, one of the larger ones, provide Linux management and monitoring software with the UPS. As far as worrying whether a one year old battery has enough capacity to hold up while the system shuts down, there is no reason why you cannot set it to shut the system down gracefully after maybe 30 seconds of power loss if you feel it is necessary. A reputable brand UPS with a correctly sized battery capacity will have no trouble in this scenario unless the battery is faulty, in which case it will probably be picked up during automated load tests. As long as the manufacturers battery replacement schedule is followed, genuine replacement batteries are used and automated regular UPS tests are enabled, the risks of failure are small. Regards, Richard - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Eric Sandeen wrote: > Moshe Yudkowsky wrote: >> So if I understand you correctly, you're stating that current the most >> reliable fs in its default configuration, in terms of protection against >> power-loss scenarios, is XFS? > > I wouldn't go that far without some real-world poweroff testing, because > various fs's are probably more or less tolerant of a write-cache > evaporation. I suppose it'd depend on the size of the write cache as well. I know no filesystem which is, as you say, tolerant to a write-cache evaporation. If a drive says the data is written but in fact it's not, it's a Bad Drive (tm) and it should be thrown away immediately. Fortunately, almost all modern disk drives don't lie this way. The only thing needed for the filesystem is to tell the drive to flush it's cache at the appropriate time, and actually wait for the flush to complete. Barriers (mentioned in this thread) is just another way to do so, in a somewhat more efficient way, but normal cache flush will do as well. IFF the write caching is enabled in the first place - note that with some workloads, write caching in the drive actually makes write speed worse, not better - namely, in case of massive writes. Speaking of XFS (and with ext3fs with write barriers enabled) - I'm confused here as well, and answers to my questions didn't help either. As far as I understand, XFS only use barriers, not regular cache flushes, hence without write barrier support (which is not here for linux software raid, which is explained elsewhere) it's unsafe, -- probably the same applies to ext3 with barrier support enabled. But I'm not sure I got it all correctly. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: > So if I understand you correctly, you're stating that current the most > reliable fs in its default configuration, in terms of protection against > power-loss scenarios, is XFS? I wouldn't go that far without some real-world poweroff testing, because various fs's are probably more or less tolerant of a write-cache evaporation. I suppose it'd depend on the size of the write cache as well. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Eric Sandeen wrote: [] > http://oss.sgi.com/projects/xfs/faq.html#nulls > > and note that recent fixes have been made in this area (also noted in > the faq) > > Also - the above all assumes that when a drive says it's written/flushed > data, that it truly has. Modern write-caching drives can wreak havoc > with any journaling filesystem, so that's one good reason for a UPS. If Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... > the drive claims to have metadata safe on disk but actually does not, > and you lose power, the data claimed safe will evaporate, there's not > much the fs can do. IO write barriers address this by forcing the drive > to flush order-critical data before continuing; xfs has them on by > default, although they are tested at mount time and if you have > something in between xfs and the disks which does not support barriers > (i.e. lvm...) then they are disabled again, with a notice in the logs. Note also that with linux software raid barriers are NOT supported. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Eric, Thanks very much for your note. I'm becoming very leery of resiserfs at the moment... I'm about to run another series of crash tests. Eric Sandeen wrote: Justin Piszcz wrote: Why avoid XFS entirely? esandeen, any comments here? Heh; well, it's the meme. Well, yeah... Note also that ext3 has the barrier option as well, but it is not enabled by default due to performance concerns. Barriers also affect xfs performance, but enabling them in the non-battery-backed-write-cache scenario is the right thing to do for filesystem integrity. So if I understand you correctly, you're stating that current the most reliable fs in its default configuration, in terms of protection against power-loss scenarios, is XFS? -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe "There is something fundamentally wrong with a country [USSR] where the citizens want to buy your underwear." -- Paul Thereaux - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Justin Piszcz wrote: > Why avoid XFS entirely? > > esandeen, any comments here? Heh; well, it's the meme. see: http://oss.sgi.com/projects/xfs/faq.html#nulls and note that recent fixes have been made in this area (also noted in the faq) Also - the above all assumes that when a drive says it's written/flushed data, that it truly has. Modern write-caching drives can wreak havoc with any journaling filesystem, so that's one good reason for a UPS. If the drive claims to have metadata safe on disk but actually does not, and you lose power, the data claimed safe will evaporate, there's not much the fs can do. IO write barriers address this by forcing the drive to flush order-critical data before continuing; xfs has them on by default, although they are tested at mount time and if you have something in between xfs and the disks which does not support barriers (i.e. lvm...) then they are disabled again, with a notice in the logs. Note also that ext3 has the barrier option as well, but it is not enabled by default due to performance concerns. Barriers also affect xfs performance, but enabling them in the non-battery-backed-write-cache scenario is the right thing to do for filesystem integrity. -Eric > Justin. > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Eric Sandeen wrote: > Justin Piszcz wrote: > >> Why avoid XFS entirely? >> >> esandeen, any comments here? > > Heh; well, it's the meme. > > see: > > http://oss.sgi.com/projects/xfs/faq.html#nulls > > and note that recent fixes have been made in this area (also noted in > the faq) Actually, continue reading past that specific entry to the next several, it covers all this quite well. -Eric - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Mon, 4 Feb 2008, Michael Tokarev wrote: Moshe Yudkowsky wrote: [] If I'm reading the man pages, Wikis, READMEs and mailing lists correctly -- not necessarily the case -- the ext3 file system uses the equivalent of data=journal as a default. ext3 defaults to data=ordered, not data=journal. ext2 doesn't have journal at all. The question then becomes what data scheme to use with reiserfs on the I'd say don't use reiserfs in the first place ;) Another way to phrase this: unless you're running data-center grade hardware and have absolute confidence in your UPS, you should use data=journal for reiserfs and perhaps avoid XFS entirely. By the way, even if you do have a good UPS, there should be some control program for it, to properly shut down your system when UPS loses the AC power. So far, I've seen no such programs... /mjt Why avoid XFS entirely? esandeen, any comments here? Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: [] > If I'm reading the man pages, Wikis, READMEs and mailing lists correctly > -- not necessarily the case -- the ext3 file system uses the equivalent > of data=journal as a default. ext3 defaults to data=ordered, not data=journal. ext2 doesn't have journal at all. > The question then becomes what data scheme to use with reiserfs on the I'd say don't use reiserfs in the first place ;) > Another way to phrase this: unless you're running data-center grade > hardware and have absolute confidence in your UPS, you should use > data=journal for reiserfs and perhaps avoid XFS entirely. By the way, even if you do have a good UPS, there should be some control program for it, to properly shut down your system when UPS loses the AC power. So far, I've seen no such programs... /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Mon Feb 04, 2008 at 05:06:09AM -0600, Moshe Yudkowsky wrote: > Robin, thanks for the explanation. I have a further question. > > Robin Hill wrote: > >> Once the file system is mounted then hdX,Y maps according to the >> device.map file (which may actually bear no resemblance to the drive >> order at boot - I've had issues with this before). At boot time it maps >> to the BIOS boot order though, and (in my experience anyway) hd0 will >> always map to the drive the BIOS is booting from. > > At the time that I use grub to write to the MBR, hd2,1 is /dev/sdc1. > Therefore, I don't quite understand why this would not work: > > grub < root(hd2,1) > setup(hd2) > EOF > > This would seem to be a command to have the MBR on hd2 written to use the > boot on hd2,1. It's valid when written. Are you saying that it's a command > for the MBR on /dev/sdc to find the data on (hd2,1), the location of which > might change at any time? That's... a very strange way to write the tool. > I thought it would be a command for the MBR on hd2 (sdc) to look at hd2,1 > (sdc1) to find its data, regardless of the boot order that caused sdc to be > the boot disk. > This is exactly what it does, yes - the hdX,Y are mapped by GRUB into BIOS disk interfaces (0x80 being the first, 0x81 the second and so on) and it writes (to hdc in this case) the instructions to look on the first partition of BIOS drive 0x82 (whichever drive that ends up being) for the rest of the bootloader. It is a bit of a strange way to work, but it's really the only way it _can_ work (and cover all circumstances). Unfortunately when you start playing with bootloaders you have to get down to the BIOS level, and things weren't written to make sense at that level (after all, when these standards were put in place everyone was booting from a single floppy disk system). If EFI becomes more standard then hopefully this will simplify but we're stuck with things as they are for now. Cheers, Robin -- ___ ( ' } | Robin Hill<[EMAIL PROTECTED]> | / / ) | Little Jim says | // !! | "He fallen in de water !!" | pgplpoLJetl8c.pgp Description: PGP signature
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Robin, thanks for the explanation. I have a further question. Robin Hill wrote: Once the file system is mounted then hdX,Y maps according to the device.map file (which may actually bear no resemblance to the drive order at boot - I've had issues with this before). At boot time it maps to the BIOS boot order though, and (in my experience anyway) hd0 will always map to the drive the BIOS is booting from. At the time that I use grub to write to the MBR, hd2,1 is /dev/sdc1. Therefore, I don't quite understand why this would not work: grub
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Michael Tokarev wrote: Moshe Yudkowsky wrote: [] But that's *exactly* what I have -- well, 5GB -- and which failed. I've modified /etc/fstab system to use data=journal (even on root, which I thought wasn't supposed to work without a grub option!) and I can power-cycle the system and bring it up reliably afterwards. Note also that data=journal effectively doubles the write time. It's a bit faster for small writes (because all writes are first done into the journal, i.e. into the same place, so no seeking is needed), but for larger writes, the journal will become full and data found in it needs to be written to proper place, to free space for new data. Here, if you'll continue writing, you will have more than 2x speed degradation, because of a) double writes, and b) more seeking. The alternative seems to be that portions of the / file system won't mount because the file system is corrupted on a crash while writing. If I'm reading the man pages, Wikis, READMEs and mailing lists correctly -- not necessarily the case -- the ext3 file system uses the equivalent of data=journal as a default. The question then becomes what data scheme to use with reiserfs on the remainder of the file system, the /usr, /var, /home, and others. If they can recover on a reboot sing fsck and the default configuration of resierfs, then I have no problem using them. But my understanding is that data can be destroyed or lost or destroyed if there's a crash on a write; then there's little point in running a RAID system that can collect corrupt data. Another way to phrase this: unless you're running data-center grade hardware and have absolute confidence in your UPS, you should use data=journal for reiserfs and perhaps avoid XFS entirely. -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe "Right in the middle of a large field where there had never been a trench was a shell hole... 8 feet deep by 15 across. On the edge of it was a dead... rat not over twice the size of a mouse. No wonder the war costs so much." Col. George Patton - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: [] > But that's *exactly* what I have -- well, 5GB -- and which failed. I've > modified /etc/fstab system to use data=journal (even on root, which I > thought wasn't supposed to work without a grub option!) and I can > power-cycle the system and bring it up reliably afterwards. Note also that data=journal effectively doubles the write time. It's a bit faster for small writes (because all writes are first done into the journal, i.e. into the same place, so no seeking is needed), but for larger writes, the journal will become full and data found in it needs to be written to proper place, to free space for new data. Here, if you'll continue writing, you will have more than 2x speed degradation, because of a) double writes, and b) more seeking. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Sun Feb 03, 2008 at 02:46:54PM -0600, Moshe Yudkowsky wrote: > Robin Hill wrote: > >> This is wrong - the disk you boot from will always be hd0 (no matter >> what the map file says - that's only used after the system's booted). >> You need to remap the hd0 device for each disk: >> grub --no-floppy <> root (hd0,1) >> setup (hd0) >> device (hd0) /dev/sdb >> root (hd0,1) >> setup (hd0) >> device (hd0) /dev/sdc >> root (hd0,1) >> setup (hd0) >> device (hd0) /dev/sdd >> root (hd0,1) >> setup (hd0) >> EOF > > For my enlightenment: if the file system is mounted, then hd2,1 is a > sensible grub operation, isn't it? For the record, given my original script > when I boot I am able to edit the grub boot options to read > > root (hd2,1) > > and proceed to boot. > Once the file system is mounted then hdX,Y maps according to the device.map file (which may actually bear no resemblance to the drive order at boot - I've had issues with this before). At boot time it maps to the BIOS boot order though, and (in my experience anyway) hd0 will always map to the drive the BIOS is booting from. So initially you may have: SATA-1: hd0 SATA-2: hd1 SATA-3: hd2 Now, if the SATA-1 drive dies totally you will have: SATA-1: - SATA-2: hd0 SATA-3: hd1 or if SATA-2 dies: SATA-1: hd0 SATA-2: - SATA-3: hd1 Note that in the case where the drive is still detected but fails to boot then the behaviour seems to be very BIOS dependent - some will continue to drive 2 as above, whereas others will just sit and complain. So to answer the second part of your question, yes - at boot time currently you can do "root (hd2,1)" or "root (hd3,1)". If a disk dies, however (whichever disk it is), then "root (hd3,1)" will fail to work. Note also that the above is only my experience - if you're depending on certain behaviour under these circumstances then you really need to test it out on your hardware by disconnecting drives, substituting non-bootable drives, etc. HTH, Robin -- ___ ( ' } | Robin Hill<[EMAIL PROTECTED]> | / / ) | Little Jim says | // !! | "He fallen in de water !!" | pgpuoxEErijXz.pgp Description: PGP signature
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: > Michael Tokarev wrote: > >> Speaking of repairs. As I already mentioned, I always use small >> (256M..1G) raid1 array for my root partition, including /boot, >> /bin, /etc, /sbin, /lib and so on (/usr, /home, /var are on >> their own filesystems). And I had the following scenarios >> happened already: > > But that's *exactly* what I have -- well, 5GB -- and which failed. I've > modified /etc/fstab system to use data=journal (even on root, which I > thought wasn't supposed to work without a grub option!) and I can > power-cycle the system and bring it up reliably afterwards. > > So I'm a little suspicious of this theory that /etc and others can be on > the same partition as /boot in a non-ext3 file system. If even your separate /boot failed (which should NEVER fail), what to say about the rest? I mean, if you'll save your /boot, what help it will be for you, if your root fs is damaged? That's why I said /boot is mostly irrelevant. Well. You can have some recovery stuff in your initrd/initramfs - that's for sure (and for that to work, you can make your /boot more reliable by creating a separate filesystem for it). But if to go this route, it's better to boot off some recovery CD instead of trying recovery from very limited toolset available in your initramfs. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Michael Tokarev wrote: Speaking of repairs. As I already mentioned, I always use small (256M..1G) raid1 array for my root partition, including /boot, /bin, /etc, /sbin, /lib and so on (/usr, /home, /var are on their own filesystems). And I had the following scenarios happened already: But that's *exactly* what I have -- well, 5GB -- and which failed. I've modified /etc/fstab system to use data=journal (even on root, which I thought wasn't supposed to work without a grub option!) and I can power-cycle the system and bring it up reliably afterwards. So I'm a little suspicious of this theory that /etc and others can be on the same partition as /boot in a non-ext3 file system. -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe "Thanks to radio, TV, and the press we can now develop absurd misconceptions about peoples and governments we once hardly knew existed."-- Charles Fair, _From the Jaws of Victory_ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Robin Hill wrote: This is wrong - the disk you boot from will always be hd0 (no matter what the map file says - that's only used after the system's booted). You need to remap the hd0 device for each disk: grub --no-floppy < For my enlightenment: if the file system is mounted, then hd2,1 is a sensible grub operation, isn't it? For the record, given my original script when I boot I am able to edit the grub boot options to read root (hd2,1) and proceed to boot. -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe "I love deadlines... especially the whooshing sound they make as they fly past." -- Dermot Dobson - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: > I've been reading the draft and checking it against my experience. > Because of local power fluctuations, I've just accidentally checked my > system: My system does *not* survive a power hit. This has happened > twice already today. > > I've got /boot and a few other pieces in a 4-disk RAID 1 (three running, > one spare). This partition is on /dev/sd[abcd]1. > > I've used grub to install grub on all three running disks: > > grub --no-floppy < root (hd0,1) > setup (hd0) > root (hd1,1) > setup (hd1) > root (hd2,1) > setup (hd2) > EOF > > (To those reading this thread to find out how to recover: According to > grub's "map" option, /dev/sda1 maps to hd0,1.) I usually install all the drives identically in this regard - to be treated as first bios disk (disk 0x80). As already pointed out in this thread - not all BIOSes are able to boot off a second or third disk, so if your first disk (sda) fail your only option is to put your sdb into place of sda and boot from it - this way, grub needs to think it's first boot drive too. By the way, lilo works here more easily and more reliable. You just install a standard mbr (lilo has it too) which just boots from an active partition, and install lilo onto the raid array, and tell it to NOT do anything fancy with raid at all (raid-extra-boot none). But for this to work, you have to have identical partitions with identical offsets - at least for the boot partitions. > After the power hit, I get: > >> Error 16 >> Inconsistent filesystem mounted But did it actually mount it? > I then tried to boot up on hda1,1, hdd2,1 -- none of them worked. Which is in fact expected after the above. You have 3 identical copies (thanks to raid) of your boot filesystem, all 3 equally broken. When it boots, it assembles your /boot raid array - the same regardless if you boot off hda, hdb or hdc. > The culprit, in my opinion, is the reiserfs file system. During the > power hit, the reiserfs file system of /boot was left in an inconsistent > state; this meant I had up to three bad copies of /boot. I've never seen any problem with ext[23] wrt unexpected power loss, so far. Running several 100s of different systems, some since 1998, some since 2000. Sure there was several inconsistencies, sometimes (maybe once or twice) some minor data loss (only few newly created files were lost), but most serious was to find a few items in lost+found after an fsck - that's ext2, never seen that with ext3. More, I tried hard to "force" a power failure at "unexpected" time, by doing massive write operations and cutting power while at it - I was never able to trigger any problem this way, at all. In any case, even if ext[23] is somewhat damaged, it can be mounted still - access to some files may return I/O errors (in the parts where it's really damaged), but the rest will work. On the other hand, I had several immediate issues with reiserfs. It was long time ago, when the filesystem first has been included into mainline kernel, so that doesn't reflect current situation. Yet even at that stage, reiserfs was declared "stable" by the authors. Issues were trivially triggerable by cutting the power at an "unexpected" time, and fsck didn't help several times. So I tend to avoid reiserfs - due to my own experience, and due to numerous problems elsewhere. > Recommendations: > > 1. I'm going to try adding a data=journal option to the reiserfs file > systems, including the /boot. If this does not work, then /boot must be > ext3 in order to survive a power hit. By the way, if your /boot is separate filesystem (ie, there's nothing more there), I see absolutely, zero no reason for it to crash. /boot is modified VERY rarely (only when installing a kernel), and only when it's modified there's a chance for it to be damaged somehow. During the rest of the time, it's constant, and any power cut should not hurt it at all. If even for a non-modified filesystem reiserfs shows such behavour ( > 2. We discussed what should be on the RAID1 bootable portion of the > filesystem. True, it's nice to have the ability to boot from just the > RAID1 portion. But if that RAID1 portion can't survive a power hit, > there's little sense. It might make a lot more sense to put /boot on its > own tiny partition. Hehe. /boot doesn't matter really. Separate /boot were used for 3 purposes: 1) to work around bios 1024th cylinder issues (long gone with LBA) 2) to be able to put the rest of the system onto an unsupported-by- bootloader filesystem/raid/lvm/etc. Like, lilo didn't support reiserfs (and still doesn't with tail packing enabled), so if you want to use reiserfs for your root fs, put /boot into a separate ext2fs. The same is true for raid - you can put the rest of the system into a raid5 array (unsupported by grub/lilo), and in order to boot, create small raid1 (or any other supported level) /boot. 3) to keep it as less volatile as possible. Like, an area of the disk which never cha
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Sun Feb 03, 2008 at 01:15:10PM -0600, Moshe Yudkowsky wrote: > I've been reading the draft and checking it against my experience. Because > of local power fluctuations, I've just accidentally checked my system: My > system does *not* survive a power hit. This has happened twice already > today. > > I've got /boot and a few other pieces in a 4-disk RAID 1 (three running, > one spare). This partition is on /dev/sd[abcd]1. > > I've used grub to install grub on all three running disks: > > grub --no-floppy < root (hd0,1) > setup (hd0) > root (hd1,1) > setup (hd1) > root (hd2,1) > setup (hd2) > EOF > > (To those reading this thread to find out how to recover: According to > grub's "map" option, /dev/sda1 maps to hd0,1.) > This is wrong - the disk you boot from will always be hd0 (no matter what the map file says - that's only used after the system's booted). You need to remap the hd0 device for each disk: grub --no-floppy < > After the power hit, I get: > > > Error 16 > > Inconsistent filesystem mounted > > I then tried to boot up on hda1,1, hdd2,1 -- none of them worked. > > The culprit, in my opinion, is the reiserfs file system. During the power > hit, the reiserfs file system of /boot was left in an inconsistent state; > this meant I had up to three bad copies of /boot. > Could well be - I always use ext2 for the /boot filesystem and don't have it automounted. I only mount the partition to install a new kernel, then unmount it again. Cheers, Robin -- ___ ( ' } | Robin Hill<[EMAIL PROTECTED]> | / / ) | Little Jim says | // !! | "He fallen in de water !!" | pgp9vtpHG44V7.pgp Description: PGP signature
RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
I've been reading the draft and checking it against my experience. Because of local power fluctuations, I've just accidentally checked my system: My system does *not* survive a power hit. This has happened twice already today. I've got /boot and a few other pieces in a 4-disk RAID 1 (three running, one spare). This partition is on /dev/sd[abcd]1. I've used grub to install grub on all three running disks: grub --no-floppy <(To those reading this thread to find out how to recover: According to grub's "map" option, /dev/sda1 maps to hd0,1.) After the power hit, I get: > Error 16 > Inconsistent filesystem mounted I then tried to boot up on hda1,1, hdd2,1 -- none of them worked. The culprit, in my opinion, is the reiserfs file system. During the power hit, the reiserfs file system of /boot was left in an inconsistent state; this meant I had up to three bad copies of /boot. Recommendations: 1. I'm going to try adding a data=journal option to the reiserfs file systems, including the /boot. If this does not work, then /boot must be ext3 in order to survive a power hit. 2. We discussed what should be on the RAID1 bootable portion of the filesystem. True, it's nice to have the ability to boot from just the RAID1 portion. But if that RAID1 portion can't survive a power hit, there's little sense. It might make a lot more sense to put /boot on its own tiny partition. The Fix: The way to fix this problem with booting is to get the reiser file system back into sync. I did this by booting to my emergency single-disk partition ((hd0,0) if you must know) and then mounting the /dev/md/root that contains /boot. This forced a resierfs consistency check and journal replay, and let me reboot without problems. -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe "A gun is, in many people's minds, like a magic wand. If you point it at people, they are supposed to do your bidding." -- Edwin E. Moise, _Tonkin Gulf_ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html