Re: [RFH] Partition table recovery

2007-07-22 Thread Indan Zupancic
On Sun, July 22, 2007 18:28, Theodore Tso wrote:
> On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote:
>> Sounds great, but it may be advisable to hook this into the partition
>> modification routines instead of mkfs/fsck.  Which would mean that the
>> partition manager could ask the kernel to instruct its fs subsystem to
>> update the backup partition table for each known fs-type that supports such
>> a feature.
>
> Well, let's think about this a bit.  What are the requirements?
>
> 1) The partition manager should be able explicitly request that a new
> backup of the partition tables be stashed in each filesystem that has
> room for such a backup.  That way, when the user affirmatively makes a
> partition table change, it can get backed up in all of the right
> places automatically.
>
> 2) The fsck program should *only* stash a backup of the partition
> table if there currently isn't one in the filesystem.  It may be that
> the partition table has been corrupted, and so merely doing an fsck
> should not transfer a current copy of the partition table to the
> filesystem-secpfic backup area.  It could be that the partition table
> was only partially recovered, and we don't want to overwrite the
> previously existing backups except on an explicit request from the
> system administrator.
>
> 3) The mkfs program should automatically create a backup of the
> current partition table layout.  That way we get a backup in the newly
> created filesystem as soon as it is created.
>
> 4) The exact location of the backup may vary from filesystem to
> filesystem.  For ext2/3/4, bytes 512-1023 are always unused, and don't
> interfere with the boot sector at bytes 0-511, so that's the obvious
> location.  Other filesystems may have that location in use, and some
> other location might be a better place to store it.  Ideally it will
> be a well-known location, that isn't dependent on finding an inode
> table, or some such, but that may not be possible for all filesystems.

To be on the safe side, maybe also add a checksum, timestamp and
something identifying the disk the filesystem was created on.

Regards,

Indan


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFH] Partition table recovery

2007-07-22 Thread Al Boldi
Theodore Tso wrote:
> On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote:
> > Sounds great, but it may be advisable to hook this into the partition
> > modification routines instead of mkfs/fsck.  Which would mean that the
> > partition manager could ask the kernel to instruct its fs subsystem to
> > update the backup partition table for each known fs-type that supports
> > such a feature.
>
> Well, let's think about this a bit.  What are the requirements?
>
> 1) The partition manager should be able explicitly request that a new
> backup of the partition tables be stashed in each filesystem that has
> room for such a backup.  That way, when the user affirmatively makes a
> partition table change, it can get backed up in all of the right
> places automatically.
>
> 2) The fsck program should *only* stash a backup of the partition
> table if there currently isn't one in the filesystem.  It may be that
> the partition table has been corrupted, and so merely doing an fsck
> should not transfer a current copy of the partition table to the
> filesystem-secpfic backup area.  It could be that the partition table
> was only partially recovered, and we don't want to overwrite the
> previously existing backups except on an explicit request from the
> system administrator.
>
> 3) The mkfs program should automatically create a backup of the
> current partition table layout.  That way we get a backup in the newly
> created filesystem as soon as it is created.
>
> 4) The exact location of the backup may vary from filesystem to
> filesystem.  For ext2/3/4, bytes 512-1023 are always unused, and don't
> interfere with the boot sector at bytes 0-511, so that's the obvious
> location.  Other filesystems may have that location in use, and some
> other location might be a better place to store it.  Ideally it will
> be a well-known location, that isn't dependent on finding an inode
> table, or some such, but that may not be possible for all filesystems.
>
> OK, so how about this as a solution that meets the above requirements?
>
> /sbin/partbackup  []
>
>   Will scan  (i.e., /dev/hda, /dev/sdb, etc.) and create
>   a 512 byte partition backup, using the format I've previously
>   described.  If  is specified on the command line, it
>   will use the blkid library to determine the filesystem type of
>   , and then attempt to execute
>   /dev/partbackupfs. to write the partition backup to
>   .  If  is '-', then it will write the 512 byte
>   partition table to stdout.  If  is not specified on
>   the command line, /sbin/partbackup will iterate over all
>   partitions in , use the blkid library to attempt to
>   determine the correct filesystem type, and then execute
>   /sbin/partbackupfs. if such a backup program exists.
>
> /sbin/partbackupfs. 
>
>   ... is a filesystem specific program for filesystem type
>   .  It will assure that  (i.e., /dev/hda1,
>   /dev/sdb3) is of an appropriate filesystem type, and then read
>   512 bytes from stdin and write it out to  to an
>   appropriate place for that filesystem.
>
> Partition managers will be encouraged to check to see if
> /sbin/partbackup exists, and if so, after the partition table is
> written, will check to see if /sbin/partbackup exists, and if so, to
> call it with just one argument (i.e., /sbin/partbackup /dev/hdb).
> They SHOULD provide an option for the user to suppress the backup from
> happening, but the backup should be the default behavior.
>
> An /etc/mkfs. program is encouraged to run /sbin/partbackup
> with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when
> creating a filesystem.
>
> An /etc/fsck. program is encouraged to check to see if a
> partition backup exists (assuming the filesystem supports it), and if
> not, call /sbin/partbackup with two arguments.
>
> A filesystem utility package for a particular filesystem type is
> encouraged to make the above changes to its mkfs and fsck programs, as
> well as provide an /sbin/partbackupfs. program.

Great!

> I would do this all in userspace, though.  Is there any reason to get
> the kernel involved?  I don't think so.

Yes, doing things in userspace, when possible, is much better.  But, a change 
in the partition table has to be relayed to the kernel, and when that change 
happens to be on a mounted disk, then the partition manager complains of not 
being able to update the kernel's view.  So how can this be addressed?


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFH] Partition table recovery

2007-07-22 Thread Theodore Tso
On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote:
> Sounds great, but it may be advisable to hook this into the partition 
> modification routines instead of mkfs/fsck.  Which would mean that the 
> partition manager could ask the kernel to instruct its fs subsystem to 
> update the backup partition table for each known fs-type that supports such 
> a feature.

Well, let's think about this a bit.  What are the requirements?

1) The partition manager should be able explicitly request that a new
backup of the partition tables be stashed in each filesystem that has
room for such a backup.  That way, when the user affirmatively makes a
partition table change, it can get backed up in all of the right
places automatically.

2) The fsck program should *only* stash a backup of the partition
table if there currently isn't one in the filesystem.  It may be that
the partition table has been corrupted, and so merely doing an fsck
should not transfer a current copy of the partition table to the
filesystem-secpfic backup area.  It could be that the partition table
was only partially recovered, and we don't want to overwrite the
previously existing backups except on an explicit request from the
system administrator.

3) The mkfs program should automatically create a backup of the
current partition table layout.  That way we get a backup in the newly
created filesystem as soon as it is created.

4) The exact location of the backup may vary from filesystem to
filesystem.  For ext2/3/4, bytes 512-1023 are always unused, and don't
interfere with the boot sector at bytes 0-511, so that's the obvious
location.  Other filesystems may have that location in use, and some
other location might be a better place to store it.  Ideally it will
be a well-known location, that isn't dependent on finding an inode
table, or some such, but that may not be possible for all filesystems.

OK, so how about this as a solution that meets the above requirements?

/sbin/partbackup  []

Will scan  (i.e., /dev/hda, /dev/sdb, etc.) and create
a 512 byte partition backup, using the format I've previously
described.  If  is specified on the command line, it
will use the blkid library to determine the filesystem type of
, and then attempt to execute
/dev/partbackupfs. to write the partition backup to
.  If  is '-', then it will write the 512 byte
partition table to stdout.  If  is not specified on
the command line, /sbin/partbackup will iterate over all
partitions in , use the blkid library to attempt to
determine the correct filesystem type, and then execute 
/sbin/partbackupfs. if such a backup program exists.

/sbin/partbackupfs. 

... is a filesystem specific program for filesystem type
.  It will assure that  (i.e., /dev/hda1,
/dev/sdb3) is of an appropriate filesystem type, and then read
512 bytes from stdin and write it out to  to an
appropriate place for that filesystem.

Partition managers will be encouraged to check to see if
/sbin/partbackup exists, and if so, after the partition table is
written, will check to see if /sbin/partbackup exists, and if so, to
call it with just one argument (i.e., /sbin/partbackup /dev/hdb).
They SHOULD provide an option for the user to suppress the backup from
happening, but the backup should be the default behavior.

An /etc/mkfs. program is encouraged to run /sbin/partbackup
with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when
creating a filesystem.

An /etc/fsck. program is encouraged to check to see if a
partition backup exists (assuming the filesystem supports it), and if
not, call /sbin/partbackup with two arguments.

A filesystem utility package for a particular filesystem type is
encouraged to make the above changes to its mkfs and fsck programs, as
well as provide an /sbin/partbackupfs. program.

I would do this all in userspace, though.  Is there any reason to get
the kernel involved?  I don't think so.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests (take2)

2007-07-22 Thread Dan Williams
The raid5 stripe cache object, struct stripe_head, serves two purposes:
1/ frontend: queuing incoming requests
2/ backend: transitioning requests through the cache state machine
   to the backing devices
The problem with this model is that queuing decisions are directly tied to
cache availability.  There is no facility to determine that a request or
group of requests 'deserves' usage of the cache and disks at any given time.

This patch separates the object members needed for queuing from the object
members used for caching.  The stripe_queue object takes over the incoming
bio lists as well as the buffer state flags.

The following fields are moved from struct stripe_head to struct
stripe_queue:
raid5_private_data *raid_conf
int pd_idx
spinlock_t lock
int bm_seq

The following fields are moved from struct r5dev to struct r5_queue_dev:
sector_t sector
struct bio *toread, *towrite

This patch lays the groundwork, but does not implement, the facility to
have more queue objects in the system than available stripes, currently this
remains a 1:1 relationship.  In other words, this patch just moves fields
around and does not implement new logic.

--- Performance Data ---

Unit information

File size = megabytes
Blk Size  = bytes
Num Thr   = number of threads
Avg Rate  = relative throughput
CPU%  = relative percentage of CPU used during the test
CPU Eff   = Rate divided by CPU% - relative throughput per cpu load

Configuration
=
Platform: 1200Mhz iop348 with 4-disk sata_vsc array
mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5
mkfs.ext2 /dev/md0
mount /dev/md0 /mnt/raid
tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid

Sequential Reads
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-iop1 204840961   -1% 2%  -3%
2.6.22-iop1 204840962   -37%-34%-5%
2.6.22-iop1 204840964   -22%-19%-3%
2.6.22-iop1 204840968   -3% -3% -1%
2.6.22-iop1 204813107   1   1%  -1% 2%
2.6.22-iop1 204813107   2   -11%-11%-1%
2.6.22-iop1 204813107   4   25% 20% 4%
2.6.22-iop1 204813107   8   8%  6%  2%

Sequential Writes
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-iop1 204840961   26% 29% -2%
2.6.22-iop1 204840962   40% 43% -2%
2.6.22-iop1 204840964   24% 7%  16%
2.6.22-iop1 204840968   6%  -11%19%
2.6.22-iop1 204813107   1   66% 65% 0%
2.6.22-iop1 204813107   2   41% 33% 6%
2.6.22-iop1 204813107   4   23% -8% 34%
2.6.22-iop1 204813107   8   13% -24%49%

The read numbers in this take have approved from a %14 average decline to a
%5 average decline.  However it is still a mystery as to why any
significant variance is showing up because most reads should completely
bypass the stripe_cache.

Here is blktrace data for a component disk while running the following:
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=1024; 
done

Pre-patch:
CPU0 (sda):
 Reads Queued:7965,31860KiB  Writes Queued:  437458, 1749MiB
 Read Dispatches:  881,31860KiB  Write Dispatches:26405, 1749MiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  881,31860KiB  Writes Completed:26415, 1749MiB
 Read Merges: 6955,27820KiB  Write Merges:   411007, 1644MiB
 Read depth: 2   Write depth: 2
 IO unplugs:   176   Timer unplugs: 176

Post-patch:
CPU0 (sda):
 Reads Queued:   36255,   145020KiB  Writes Queued:  437727, 1750MiB
 Read Dispatches: 1960,   145020KiB  Write Dispatches: 6672, 1750MiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed: 1960,   145020KiB  Writes Completed: 6682, 1750MiB
 Read Merges:34235,   136940KiB  Write Merges:   430409, 1721MiB
 Read depth: 2   Write depth: 2
 IO unplugs:   423   Timer unplugs: 423

The performance win is coming from improved merging and not from reduced
reads as previously assumed.  Note that with blktrace enabled the
throughput comes in at ~98MB/s compared to ~120MB/s without.  Pre-patch
throughput hovers at ~85MB/s for this dd command.

Changes in take2:
* leave the flags with the buffers, prevents a data corruption issue
  whereby stale buffer state flags are attached to n

[GIT PATCH 0/2] stripe-queue for 2.6.23 consideration

2007-07-22 Thread Dan Williams
Andrew, Neil,

The stripe-queue patches are showing solid performance improvement.

git://lost.foo-projects.org/~dwillia2/git/iop md-for-linus

 drivers/md/raid5.c | 1484 
 include/linux/raid/raid5.h |   87 +++-
 2 files changed, 1164 insertions(+), 407 deletions(-)

Dan Williams (2):
  raid5: add the stripe_queue object for tracking raid io requests (take2)
  raid5: use stripe_queues to prioritize the "most deserving" requests 
(take4)

I initially considered them 2.6.24 material but after fixing the sync+io
data corruption regression, fixing the large 'stripe_cache_size' values
performance regression, and seeing how well it performed on my IA
platform I would like them to be considered for 2.6.23.  That being said
I have not yet tested expand operations or raid6.

Without any tuning a 4 disk (SATA) RAID5 array can reach 190MB/s.  Previously
performance was around 90MB/s.  Blktrace data confirms that less reads are
occurring and more writes are being merged.

$ mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5 --assume-clean
$ blktrace /dev/sd[abcd] &
$ for i in `seq 1 3`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=1024; done
$ fg ^C
$ blkparse /dev/sda /dev/sdb /dev/sdc /dev/sdd

=pre-patch=
Total (sda):
 Reads Queued:   3,136,   12,544KiB  Writes Queued: 187,068,  748,272KiB
 Read Dispatches:  676,   12,384KiB  Write Dispatches:   30,949,  737,052KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  662,   12,080KiB  Writes Completed:   30,630,  736,964KiB
 Read Merges:2,452,9,808KiB  Write Merges:  155,885,  623,540KiB
 IO unplugs: 1   Timer unplugs:   1

Total (sdb):
 Reads Queued:   1,541,6,164KiB  Writes Queued:  91,224,  364,896KiB
 Read Dispatches:  323,6,184KiB  Write Dispatches:   14,603,  335,528KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  303,6,124KiB  Writes Completed:   13,650,  328,520KiB
 Read Merges:1,209,4,836KiB  Write Merges:   76,080,  304,320KiB
 IO unplugs: 0   Timer unplugs:   0

Total (sdc):
 Reads Queued:   1,372,5,488KiB  Writes Queued:  82,995,  331,980KiB
 Read Dispatches:  297,5,280KiB  Write Dispatches:   13,258,  304,020KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  268,4,948KiB  Writes Completed:   12,320,  298,668KiB
 Read Merges:1,067,4,268KiB  Write Merges:   69,154,  276,616KiB
 IO unplugs: 0   Timer unplugs:   0

Total (sdd):
 Reads Queued:   1,383,5,532KiB  Writes Queued:  80,186,  320,744KiB
 Read Dispatches:  307,5,008KiB  Write Dispatches:   13,241,  298,400KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  276,4,888KiB  Writes Completed:   12,677,  294,324KiB
 Read Merges:1,050,4,200KiB  Write Merges:   66,772,  267,088KiB
 IO unplugs: 0   Timer unplugs:   0


=post-patch=
Total (sda):
 Reads Queued: 117,  468KiB  Writes Queued:  71,511,  286,044KiB
 Read Dispatches:   17,  308KiB  Write Dispatches:8,412,  699,204KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:6,   96KiB  Writes Completed:3,704,  321,552KiB
 Read Merges:   96,  384KiB  Write Merges:   67,880,  271,520KiB
 IO unplugs:14   Timer unplugs:  15

Total (sdb):
 Reads Queued:  88,  352KiB  Writes Queued:  56,687,  226,748KiB
 Read Dispatches:   11,  288KiB  Write Dispatches:8,142,  686,412KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:8,  184KiB  Writes Completed:2,770,  257,740KiB
 Read Merges:   76,  304KiB  Write Merges:   54,005,  216,020KiB
 IO unplugs:16   Timer unplugs:  17

Total (sdc):
 Reads Queued:  60,  240KiB  Writes Queued:  61,863,  247,452KiB
 Read Dispatches:7,  248KiB  Write Dispatches:8,302,  699,832KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:5,  144KiB  Writes Completed:2,907,  258,900KiB
 Read Merges:   50,  200KiB  Write Merges:   58,926,  235,704KiB
 IO unplugs:20   Timer unplugs:  23

Total (sdd):
 Reads Queued:  61,  244KiB  Writes Queued:  66,330,  265,320KiB
 Read Dispatches:   10,  180KiB  Write Dispatches:9,326,  694,012KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:4,  112KiB  Writes Completed:3,562,  285,912KiB
 Read Merges:   47,  188KiB  Writ

Re: [RFH] Partition table recovery

2007-07-22 Thread Rene Herman

On 07/22/2007 03:11 AM, Theodore Tso wrote:


This is a problem. Today the CHS fields in the partition entries don't
mean much of anything anymore and Linux happily ignores them but DOS
and (hence) Windows 9x do not. From time to time I still have the
Windows 98 install that's sitting in a corner of my disk throw a fit
just by having set the BIOS from LBA to Large (meaning the geometry the
BIOS pretends the disk has changes) for example. Old DOS installs that
I keep around for the purpose of hardware testing with the originally
supplied drivers make for even more of a "don't touch, don't touch!"
thing -- various version of DOS throw fits for various reasons.


This is true, but that's due to the fundamentally broken nature of CHS.
You need them to boot, and that's about it.  I will say up front that I
don't particularly care about legacy operating system such as DOS,
Windows 98, or Minix 3.  So the idea of simply having the number of heads
and sectors in the partition header is that we can reconstruct CHS fields
such that it is likely with modern hardware you will get it right.


Well, I still don't believe this all to be a great idea but it was sort of 
fun so the attached does largely what you want -- build a list of all data 
partitions.


The heads/sectors fields it for now just gets from the HDIO_GETGEO call. A 
better source would be guessing the values from the partition table itself 
but that _also_ doesn't make too much sense. If you're reconstructing a 
sanitized version of the table anyway, it makes better sense to reconstruct 
it with the values HDIO_GETGEO returns at restoration time.


I kept your suggested format, but in fact, the 64-bit "start" value seems 
not very useful if we're getting the value from a 32-bit field in the old 
partition tables anyway. With that shrunk down to 32-bit again, there would 
be enough room for the complete partition table entry...



For ancient systems that do all sorts of weird things such as ECHS,
etc., yeah, you're pretty much doomed, and the bigger danger comes
from futzing with BIOS settings, et. al.  But it's 2007, gosh darn it!
"Here's a quarter, kid, buy yourself a real computer".  :-)


Thanks, but real computers won't host my ISA cards...


Yes, I'm very aware of the extended partitioning scheme mess.  What I
was proposing to back up here is only the real partitions, not the
fake extended partitions.  The idea is to store *just* enough
information so that a partition table manager can recover the
partition tables in such a way that the original filesystem
information can be recovered.


This should do I guess. It enters all data partitions into the list, in the 
order in which they are encountered and sets a flag to signify that a 
partition was a logical rather than primary. Another option would be to just 
reserve the first 4 entries for the primaries and the rest for the logicals 
but this saves entries if there are fewer than 4 primaries and was in fact 
easier...


The program enters partitions in what should be the same order as Linux 
itself does. Primaries from slot 0 to 3 as normal (but not backed up to 
entry 0 to 3 as said -- the LOGICAL flag indentifies them), extended 
partitions in the MBR in the order as encountered, with the logicals in the 
second-level table as encountered, and following only the first extented in 
the second-level table.


Made it into a generic C program -- didn't look at e2fsprogs sources yet.

Need to be off now and haven't yet stared at this as long as I'd like so 
don't slap me if I've left a few bugs in (although it seems to work nicely). 
The program dumps the backup sector to stdout -- it's ofcourse easy to 
change it to print the entries out so they're easy to compare against, say, 
"fdisk -l -us".


Oh, and once you've looked at it, please throw it away. As said, I still 
don't think it's a great idea ;-)


Rene.

/*
 * Public Domain 2007, Rene Herman
 *
 * gcc -W -Wall -DTEST -D_LARGEFILE64_SOURCE -o backup backup.c
 *
 */

#include 

enum {
DOS_EXTENDED   = 0x05,
WIN98_EXTENDED = 0x0f,
LINUX_EXTENDED = 0x85,
};

struct partition {
unsigned char boot_ind;
unsigned char __1[3];
unsigned char sys_ind;
unsigned char __2[3];
unsigned int start;
unsigned int size;
} __attribute__((packed));

struct entry {
unsigned char flags;
unsigned char type;
unsigned short __1;
unsigned long long start;
unsigned int size;
} __attribute__((packed));

enum {
ENTRY_FLAG_LOGICAL  = 0x01,
ENTRY_FLAG_BOOTABLE = 0x80,
};

struct backup {
unsigned char signature[8];
unsigned short type;
unsigned char heads;
unsigned char sectors;
unsigned char count;
unsigned char __1[3];
struct entry table[31];
} __attribute__((packed));

#define BACKUP_SIGNATURE "PARTBAK1"

enum {
BACKUP_TYPE_MBR = 1,
};

struct backup backup = {
.signature = B