Re: [RFH] Partition table recovery
On Sun, July 22, 2007 18:28, Theodore Tso wrote: > On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote: >> Sounds great, but it may be advisable to hook this into the partition >> modification routines instead of mkfs/fsck. Which would mean that the >> partition manager could ask the kernel to instruct its fs subsystem to >> update the backup partition table for each known fs-type that supports such >> a feature. > > Well, let's think about this a bit. What are the requirements? > > 1) The partition manager should be able explicitly request that a new > backup of the partition tables be stashed in each filesystem that has > room for such a backup. That way, when the user affirmatively makes a > partition table change, it can get backed up in all of the right > places automatically. > > 2) The fsck program should *only* stash a backup of the partition > table if there currently isn't one in the filesystem. It may be that > the partition table has been corrupted, and so merely doing an fsck > should not transfer a current copy of the partition table to the > filesystem-secpfic backup area. It could be that the partition table > was only partially recovered, and we don't want to overwrite the > previously existing backups except on an explicit request from the > system administrator. > > 3) The mkfs program should automatically create a backup of the > current partition table layout. That way we get a backup in the newly > created filesystem as soon as it is created. > > 4) The exact location of the backup may vary from filesystem to > filesystem. For ext2/3/4, bytes 512-1023 are always unused, and don't > interfere with the boot sector at bytes 0-511, so that's the obvious > location. Other filesystems may have that location in use, and some > other location might be a better place to store it. Ideally it will > be a well-known location, that isn't dependent on finding an inode > table, or some such, but that may not be possible for all filesystems. To be on the safe side, maybe also add a checksum, timestamp and something identifying the disk the filesystem was created on. Regards, Indan - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
Theodore Tso wrote: > On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote: > > Sounds great, but it may be advisable to hook this into the partition > > modification routines instead of mkfs/fsck. Which would mean that the > > partition manager could ask the kernel to instruct its fs subsystem to > > update the backup partition table for each known fs-type that supports > > such a feature. > > Well, let's think about this a bit. What are the requirements? > > 1) The partition manager should be able explicitly request that a new > backup of the partition tables be stashed in each filesystem that has > room for such a backup. That way, when the user affirmatively makes a > partition table change, it can get backed up in all of the right > places automatically. > > 2) The fsck program should *only* stash a backup of the partition > table if there currently isn't one in the filesystem. It may be that > the partition table has been corrupted, and so merely doing an fsck > should not transfer a current copy of the partition table to the > filesystem-secpfic backup area. It could be that the partition table > was only partially recovered, and we don't want to overwrite the > previously existing backups except on an explicit request from the > system administrator. > > 3) The mkfs program should automatically create a backup of the > current partition table layout. That way we get a backup in the newly > created filesystem as soon as it is created. > > 4) The exact location of the backup may vary from filesystem to > filesystem. For ext2/3/4, bytes 512-1023 are always unused, and don't > interfere with the boot sector at bytes 0-511, so that's the obvious > location. Other filesystems may have that location in use, and some > other location might be a better place to store it. Ideally it will > be a well-known location, that isn't dependent on finding an inode > table, or some such, but that may not be possible for all filesystems. > > OK, so how about this as a solution that meets the above requirements? > > /sbin/partbackup [] > > Will scan (i.e., /dev/hda, /dev/sdb, etc.) and create > a 512 byte partition backup, using the format I've previously > described. If is specified on the command line, it > will use the blkid library to determine the filesystem type of > , and then attempt to execute > /dev/partbackupfs. to write the partition backup to > . If is '-', then it will write the 512 byte > partition table to stdout. If is not specified on > the command line, /sbin/partbackup will iterate over all > partitions in , use the blkid library to attempt to > determine the correct filesystem type, and then execute > /sbin/partbackupfs. if such a backup program exists. > > /sbin/partbackupfs. > > ... is a filesystem specific program for filesystem type > . It will assure that (i.e., /dev/hda1, > /dev/sdb3) is of an appropriate filesystem type, and then read > 512 bytes from stdin and write it out to to an > appropriate place for that filesystem. > > Partition managers will be encouraged to check to see if > /sbin/partbackup exists, and if so, after the partition table is > written, will check to see if /sbin/partbackup exists, and if so, to > call it with just one argument (i.e., /sbin/partbackup /dev/hdb). > They SHOULD provide an option for the user to suppress the backup from > happening, but the backup should be the default behavior. > > An /etc/mkfs. program is encouraged to run /sbin/partbackup > with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when > creating a filesystem. > > An /etc/fsck. program is encouraged to check to see if a > partition backup exists (assuming the filesystem supports it), and if > not, call /sbin/partbackup with two arguments. > > A filesystem utility package for a particular filesystem type is > encouraged to make the above changes to its mkfs and fsck programs, as > well as provide an /sbin/partbackupfs. program. Great! > I would do this all in userspace, though. Is there any reason to get > the kernel involved? I don't think so. Yes, doing things in userspace, when possible, is much better. But, a change in the partition table has to be relayed to the kernel, and when that change happens to be on a mounted disk, then the partition manager complains of not being able to update the kernel's view. So how can this be addressed? Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote: > Sounds great, but it may be advisable to hook this into the partition > modification routines instead of mkfs/fsck. Which would mean that the > partition manager could ask the kernel to instruct its fs subsystem to > update the backup partition table for each known fs-type that supports such > a feature. Well, let's think about this a bit. What are the requirements? 1) The partition manager should be able explicitly request that a new backup of the partition tables be stashed in each filesystem that has room for such a backup. That way, when the user affirmatively makes a partition table change, it can get backed up in all of the right places automatically. 2) The fsck program should *only* stash a backup of the partition table if there currently isn't one in the filesystem. It may be that the partition table has been corrupted, and so merely doing an fsck should not transfer a current copy of the partition table to the filesystem-secpfic backup area. It could be that the partition table was only partially recovered, and we don't want to overwrite the previously existing backups except on an explicit request from the system administrator. 3) The mkfs program should automatically create a backup of the current partition table layout. That way we get a backup in the newly created filesystem as soon as it is created. 4) The exact location of the backup may vary from filesystem to filesystem. For ext2/3/4, bytes 512-1023 are always unused, and don't interfere with the boot sector at bytes 0-511, so that's the obvious location. Other filesystems may have that location in use, and some other location might be a better place to store it. Ideally it will be a well-known location, that isn't dependent on finding an inode table, or some such, but that may not be possible for all filesystems. OK, so how about this as a solution that meets the above requirements? /sbin/partbackup [] Will scan (i.e., /dev/hda, /dev/sdb, etc.) and create a 512 byte partition backup, using the format I've previously described. If is specified on the command line, it will use the blkid library to determine the filesystem type of , and then attempt to execute /dev/partbackupfs. to write the partition backup to . If is '-', then it will write the 512 byte partition table to stdout. If is not specified on the command line, /sbin/partbackup will iterate over all partitions in , use the blkid library to attempt to determine the correct filesystem type, and then execute /sbin/partbackupfs. if such a backup program exists. /sbin/partbackupfs. ... is a filesystem specific program for filesystem type . It will assure that (i.e., /dev/hda1, /dev/sdb3) is of an appropriate filesystem type, and then read 512 bytes from stdin and write it out to to an appropriate place for that filesystem. Partition managers will be encouraged to check to see if /sbin/partbackup exists, and if so, after the partition table is written, will check to see if /sbin/partbackup exists, and if so, to call it with just one argument (i.e., /sbin/partbackup /dev/hdb). They SHOULD provide an option for the user to suppress the backup from happening, but the backup should be the default behavior. An /etc/mkfs. program is encouraged to run /sbin/partbackup with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when creating a filesystem. An /etc/fsck. program is encouraged to check to see if a partition backup exists (assuming the filesystem supports it), and if not, call /sbin/partbackup with two arguments. A filesystem utility package for a particular filesystem type is encouraged to make the above changes to its mkfs and fsck programs, as well as provide an /sbin/partbackupfs. program. I would do this all in userspace, though. Is there any reason to get the kernel involved? I don't think so. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests (take2)
The raid5 stripe cache object, struct stripe_head, serves two purposes: 1/ frontend: queuing incoming requests 2/ backend: transitioning requests through the cache state machine to the backing devices The problem with this model is that queuing decisions are directly tied to cache availability. There is no facility to determine that a request or group of requests 'deserves' usage of the cache and disks at any given time. This patch separates the object members needed for queuing from the object members used for caching. The stripe_queue object takes over the incoming bio lists as well as the buffer state flags. The following fields are moved from struct stripe_head to struct stripe_queue: raid5_private_data *raid_conf int pd_idx spinlock_t lock int bm_seq The following fields are moved from struct r5dev to struct r5_queue_dev: sector_t sector struct bio *toread, *towrite This patch lays the groundwork, but does not implement, the facility to have more queue objects in the system than available stripes, currently this remains a 1:1 relationship. In other words, this patch just moves fields around and does not implement new logic. --- Performance Data --- Unit information File size = megabytes Blk Size = bytes Num Thr = number of threads Avg Rate = relative throughput CPU% = relative percentage of CPU used during the test CPU Eff = Rate divided by CPU% - relative throughput per cpu load Configuration = Platform: 1200Mhz iop348 with 4-disk sata_vsc array mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5 mkfs.ext2 /dev/md0 mount /dev/md0 /mnt/raid tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid Sequential Reads FileBlk Num Avg Maximum CPU Identifier SizeSizeThr Rate(CPU%) Eff --- -- - --- -- -- - 2.6.22-iop1 204840961 -1% 2% -3% 2.6.22-iop1 204840962 -37%-34%-5% 2.6.22-iop1 204840964 -22%-19%-3% 2.6.22-iop1 204840968 -3% -3% -1% 2.6.22-iop1 204813107 1 1% -1% 2% 2.6.22-iop1 204813107 2 -11%-11%-1% 2.6.22-iop1 204813107 4 25% 20% 4% 2.6.22-iop1 204813107 8 8% 6% 2% Sequential Writes FileBlk Num Avg Maximum CPU Identifier SizeSizeThr Rate(CPU%) Eff --- -- - --- -- -- - 2.6.22-iop1 204840961 26% 29% -2% 2.6.22-iop1 204840962 40% 43% -2% 2.6.22-iop1 204840964 24% 7% 16% 2.6.22-iop1 204840968 6% -11%19% 2.6.22-iop1 204813107 1 66% 65% 0% 2.6.22-iop1 204813107 2 41% 33% 6% 2.6.22-iop1 204813107 4 23% -8% 34% 2.6.22-iop1 204813107 8 13% -24%49% The read numbers in this take have approved from a %14 average decline to a %5 average decline. However it is still a mystery as to why any significant variance is showing up because most reads should completely bypass the stripe_cache. Here is blktrace data for a component disk while running the following: for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=1024; done Pre-patch: CPU0 (sda): Reads Queued:7965,31860KiB Writes Queued: 437458, 1749MiB Read Dispatches: 881,31860KiB Write Dispatches:26405, 1749MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 881,31860KiB Writes Completed:26415, 1749MiB Read Merges: 6955,27820KiB Write Merges: 411007, 1644MiB Read depth: 2 Write depth: 2 IO unplugs: 176 Timer unplugs: 176 Post-patch: CPU0 (sda): Reads Queued: 36255, 145020KiB Writes Queued: 437727, 1750MiB Read Dispatches: 1960, 145020KiB Write Dispatches: 6672, 1750MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 1960, 145020KiB Writes Completed: 6682, 1750MiB Read Merges:34235, 136940KiB Write Merges: 430409, 1721MiB Read depth: 2 Write depth: 2 IO unplugs: 423 Timer unplugs: 423 The performance win is coming from improved merging and not from reduced reads as previously assumed. Note that with blktrace enabled the throughput comes in at ~98MB/s compared to ~120MB/s without. Pre-patch throughput hovers at ~85MB/s for this dd command. Changes in take2: * leave the flags with the buffers, prevents a data corruption issue whereby stale buffer state flags are attached to n
[GIT PATCH 0/2] stripe-queue for 2.6.23 consideration
Andrew, Neil, The stripe-queue patches are showing solid performance improvement. git://lost.foo-projects.org/~dwillia2/git/iop md-for-linus drivers/md/raid5.c | 1484 include/linux/raid/raid5.h | 87 +++- 2 files changed, 1164 insertions(+), 407 deletions(-) Dan Williams (2): raid5: add the stripe_queue object for tracking raid io requests (take2) raid5: use stripe_queues to prioritize the "most deserving" requests (take4) I initially considered them 2.6.24 material but after fixing the sync+io data corruption regression, fixing the large 'stripe_cache_size' values performance regression, and seeing how well it performed on my IA platform I would like them to be considered for 2.6.23. That being said I have not yet tested expand operations or raid6. Without any tuning a 4 disk (SATA) RAID5 array can reach 190MB/s. Previously performance was around 90MB/s. Blktrace data confirms that less reads are occurring and more writes are being merged. $ mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5 --assume-clean $ blktrace /dev/sd[abcd] & $ for i in `seq 1 3`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=1024; done $ fg ^C $ blkparse /dev/sda /dev/sdb /dev/sdc /dev/sdd =pre-patch= Total (sda): Reads Queued: 3,136, 12,544KiB Writes Queued: 187,068, 748,272KiB Read Dispatches: 676, 12,384KiB Write Dispatches: 30,949, 737,052KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 662, 12,080KiB Writes Completed: 30,630, 736,964KiB Read Merges:2,452,9,808KiB Write Merges: 155,885, 623,540KiB IO unplugs: 1 Timer unplugs: 1 Total (sdb): Reads Queued: 1,541,6,164KiB Writes Queued: 91,224, 364,896KiB Read Dispatches: 323,6,184KiB Write Dispatches: 14,603, 335,528KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 303,6,124KiB Writes Completed: 13,650, 328,520KiB Read Merges:1,209,4,836KiB Write Merges: 76,080, 304,320KiB IO unplugs: 0 Timer unplugs: 0 Total (sdc): Reads Queued: 1,372,5,488KiB Writes Queued: 82,995, 331,980KiB Read Dispatches: 297,5,280KiB Write Dispatches: 13,258, 304,020KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 268,4,948KiB Writes Completed: 12,320, 298,668KiB Read Merges:1,067,4,268KiB Write Merges: 69,154, 276,616KiB IO unplugs: 0 Timer unplugs: 0 Total (sdd): Reads Queued: 1,383,5,532KiB Writes Queued: 80,186, 320,744KiB Read Dispatches: 307,5,008KiB Write Dispatches: 13,241, 298,400KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 276,4,888KiB Writes Completed: 12,677, 294,324KiB Read Merges:1,050,4,200KiB Write Merges: 66,772, 267,088KiB IO unplugs: 0 Timer unplugs: 0 =post-patch= Total (sda): Reads Queued: 117, 468KiB Writes Queued: 71,511, 286,044KiB Read Dispatches: 17, 308KiB Write Dispatches:8,412, 699,204KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed:6, 96KiB Writes Completed:3,704, 321,552KiB Read Merges: 96, 384KiB Write Merges: 67,880, 271,520KiB IO unplugs:14 Timer unplugs: 15 Total (sdb): Reads Queued: 88, 352KiB Writes Queued: 56,687, 226,748KiB Read Dispatches: 11, 288KiB Write Dispatches:8,142, 686,412KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed:8, 184KiB Writes Completed:2,770, 257,740KiB Read Merges: 76, 304KiB Write Merges: 54,005, 216,020KiB IO unplugs:16 Timer unplugs: 17 Total (sdc): Reads Queued: 60, 240KiB Writes Queued: 61,863, 247,452KiB Read Dispatches:7, 248KiB Write Dispatches:8,302, 699,832KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed:5, 144KiB Writes Completed:2,907, 258,900KiB Read Merges: 50, 200KiB Write Merges: 58,926, 235,704KiB IO unplugs:20 Timer unplugs: 23 Total (sdd): Reads Queued: 61, 244KiB Writes Queued: 66,330, 265,320KiB Read Dispatches: 10, 180KiB Write Dispatches:9,326, 694,012KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed:4, 112KiB Writes Completed:3,562, 285,912KiB Read Merges: 47, 188KiB Writ
Re: [RFH] Partition table recovery
On 07/22/2007 03:11 AM, Theodore Tso wrote: This is a problem. Today the CHS fields in the partition entries don't mean much of anything anymore and Linux happily ignores them but DOS and (hence) Windows 9x do not. From time to time I still have the Windows 98 install that's sitting in a corner of my disk throw a fit just by having set the BIOS from LBA to Large (meaning the geometry the BIOS pretends the disk has changes) for example. Old DOS installs that I keep around for the purpose of hardware testing with the originally supplied drivers make for even more of a "don't touch, don't touch!" thing -- various version of DOS throw fits for various reasons. This is true, but that's due to the fundamentally broken nature of CHS. You need them to boot, and that's about it. I will say up front that I don't particularly care about legacy operating system such as DOS, Windows 98, or Minix 3. So the idea of simply having the number of heads and sectors in the partition header is that we can reconstruct CHS fields such that it is likely with modern hardware you will get it right. Well, I still don't believe this all to be a great idea but it was sort of fun so the attached does largely what you want -- build a list of all data partitions. The heads/sectors fields it for now just gets from the HDIO_GETGEO call. A better source would be guessing the values from the partition table itself but that _also_ doesn't make too much sense. If you're reconstructing a sanitized version of the table anyway, it makes better sense to reconstruct it with the values HDIO_GETGEO returns at restoration time. I kept your suggested format, but in fact, the 64-bit "start" value seems not very useful if we're getting the value from a 32-bit field in the old partition tables anyway. With that shrunk down to 32-bit again, there would be enough room for the complete partition table entry... For ancient systems that do all sorts of weird things such as ECHS, etc., yeah, you're pretty much doomed, and the bigger danger comes from futzing with BIOS settings, et. al. But it's 2007, gosh darn it! "Here's a quarter, kid, buy yourself a real computer". :-) Thanks, but real computers won't host my ISA cards... Yes, I'm very aware of the extended partitioning scheme mess. What I was proposing to back up here is only the real partitions, not the fake extended partitions. The idea is to store *just* enough information so that a partition table manager can recover the partition tables in such a way that the original filesystem information can be recovered. This should do I guess. It enters all data partitions into the list, in the order in which they are encountered and sets a flag to signify that a partition was a logical rather than primary. Another option would be to just reserve the first 4 entries for the primaries and the rest for the logicals but this saves entries if there are fewer than 4 primaries and was in fact easier... The program enters partitions in what should be the same order as Linux itself does. Primaries from slot 0 to 3 as normal (but not backed up to entry 0 to 3 as said -- the LOGICAL flag indentifies them), extended partitions in the MBR in the order as encountered, with the logicals in the second-level table as encountered, and following only the first extented in the second-level table. Made it into a generic C program -- didn't look at e2fsprogs sources yet. Need to be off now and haven't yet stared at this as long as I'd like so don't slap me if I've left a few bugs in (although it seems to work nicely). The program dumps the backup sector to stdout -- it's ofcourse easy to change it to print the entries out so they're easy to compare against, say, "fdisk -l -us". Oh, and once you've looked at it, please throw it away. As said, I still don't think it's a great idea ;-) Rene. /* * Public Domain 2007, Rene Herman * * gcc -W -Wall -DTEST -D_LARGEFILE64_SOURCE -o backup backup.c * */ #include enum { DOS_EXTENDED = 0x05, WIN98_EXTENDED = 0x0f, LINUX_EXTENDED = 0x85, }; struct partition { unsigned char boot_ind; unsigned char __1[3]; unsigned char sys_ind; unsigned char __2[3]; unsigned int start; unsigned int size; } __attribute__((packed)); struct entry { unsigned char flags; unsigned char type; unsigned short __1; unsigned long long start; unsigned int size; } __attribute__((packed)); enum { ENTRY_FLAG_LOGICAL = 0x01, ENTRY_FLAG_BOOTABLE = 0x80, }; struct backup { unsigned char signature[8]; unsigned short type; unsigned char heads; unsigned char sectors; unsigned char count; unsigned char __1[3]; struct entry table[31]; } __attribute__((packed)); #define BACKUP_SIGNATURE "PARTBAK1" enum { BACKUP_TYPE_MBR = 1, }; struct backup backup = { .signature = B