Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Fri, May 11, 2001 at 04:54:44PM +0200, Daniel Phillips wrote:

> The only reasonable way I can think of getting a block-coherent view 
> underneath a mounted fs is to have a reverse map, and update it each 
> time we map block into the page cache or unmap it.

It's called the "buffer cache", and Ingo's early page-cache code in
2.3 actually did install page-cache backing buffers into the buffer
cache as aliases, mainly for debugging purposes.

Even without that, though, an application can achieve almost-coherency
via invalidation of the buffer cache before reading it.  And yes, this
won't necessarily remain coherent over the lifetime of the application
process, but then unless the filesystem is 100% quiescent then you
don't get that on 2.2 either.

Which is rather the point.  If the filesystem is active, then
coherency cannot be obtained at the block-device level in any case
without knowledge of the fs transaction activity.  If the filesystem
is quiescent, then you can sync it and flush the buffer cache and you
already get the coherency that you need.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-18 Thread Stephen C. Tweedie

Hi,

On Fri, May 11, 2001 at 04:54:44PM +0200, Daniel Phillips wrote:

 The only reasonable way I can think of getting a block-coherent view 
 underneath a mounted fs is to have a reverse map, and update it each 
 time we map block into the page cache or unmap it.

It's called the buffer cache, and Ingo's early page-cache code in
2.3 actually did install page-cache backing buffers into the buffer
cache as aliases, mainly for debugging purposes.

Even without that, though, an application can achieve almost-coherency
via invalidation of the buffer cache before reading it.  And yes, this
won't necessarily remain coherent over the lifetime of the application
process, but then unless the filesystem is 100% quiescent then you
don't get that on 2.2 either.

Which is rather the point.  If the filesystem is active, then
coherency cannot be obtained at the block-device level in any case
without knowledge of the fs transaction activity.  If the filesystem
is quiescent, then you can sync it and flush the buffer cache and you
already get the coherency that you need.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-11 Thread Alexander Viro



On Mon, 7 May 2001, Pavel Machek wrote:
> OTOH with current way if you make mistake in kernel, fsck will not
> automatically inherit it; therefore fsck is likely to work even if
> kernel ext2 is b0rken [and that's fairly important]

... and by the same logics you should make fsck implement its own
drivers - after all, right now b0rken driver affects both the kernel
ext2 and fsck ;-)

I'm not sure that fsck of fs mounted read/write is worth doing in the
first place, but I'd rather do that via fs/ext2 exporting its metadata
explicitly than by playing silly buggers with device/fs coherency.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-11 Thread Daniel Phillips

On Monday 07 May 2001 20:42, Pavel Machek wrote:
> > > It's not exactly "kernel-based fsck". What I've been talking
> > > about is secondary filesystem providing coherent access to
> > > primary fs metadata.  I.e. mount -t ext2meta -o master=/usr none
> > > /mnt and then access through /mnt/super, /mnt/block_bitmap, etc.
> > >
> > > Call me stupid --- but what exactly does the above actually
> > > achieve? Why would you do this?
> >
> > Coherent access to metadata? Well, for one thing, it allows stuff
> > like tunefs and friends on mounted fs. What's more useful, it
> > allows to do things like access to boot code, which is _not_ safe
> > to do through device access - usually you have superblock in
> > vicinity and no warranties about the things that will be
> > overwritten on umount. Same for debugging stuff, IO stats, etc. -
> > access through secondary tree is much saner than inventing tons of
> > ioctls for dealing with that. Moreover, it allows fsck and friends
> > to get rid of code duplication - while the repair logics, etc.
> > stays in userland (where it belongs) layout information is already
> > handled in the kernel. No need to duplicate it in userland...
>
> OTOH with current way if you make mistake in kernel, fsck will not
> automatically inherit it; therefore fsck is likely to work even if
> kernel ext2 is b0rken [and that's fairly important]

Al's idea ncely dances around a big problem with the page cache: there 
is no efficient way to know which address_space a given physical block 
belongs to.   It *might* be nice to have such capability in a 
fs-independent way.  We could do that now, very inefficiently, by 
searching all the address_spaces (i.e., inodes) for the physical block. 
We'd have to prevent further page cache operations while we did that, 
and when we add fs-private address_spaces some more mechanism would be 
required..  So: slow, intrusive and fragile.

The only reasonable way I can think of getting a block-coherent view 
underneath a mounted fs is to have a reverse map, and update it each 
time we map block into the page cache or unmap it.  The reverse map 
would tell us if a given physical block is currently in the page 
cache,and if so, which address_space it belongs to.  A blocks not 
currently mapped into any address_space could be mapped into an 
'anonymous' space covering the entire partition and moved automatically 
to the correct address_space when the fs tries to map it.

The big problem with this mechanism is it slows down the common case, 
which works perfectly well without any reverse map.  Not to mention 
adding bloat.  So the next question I thought about was, is there a way 
to switch on a page cache reverse map just when needed and do that in a 
generic way.  I convinced myself it wouldn't be too hard,  but then 
there's another question: how badly do we need this?

Al's idea does let us get at some of the specific parts of the fs 
metadata but it has its problems too.  We'd need to exhaustively 
enumerate every kind of filesystem metadata that could reasonably be 
accessed underneath the filesystem, and special-case it, not so nice.

But I couldn't come up with any killer examples where we'd really need 
a generalized, coherent view underneath a mounted filesystem, so I put 
these thoughts on hold.   Your borked-fs example sounds interesting, 
have you got more of those?

One more example I can suggest is: right now we have to way of 
detecting an error condition where the same fs block is mapped into 
more than one address_space.  A page cache reverse map could detect 
this easily and would be a really useful debugging tool.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-11 Thread Pavel Machek

gHi!

> > It's not exactly "kernel-based fsck". What I've been talking about
> > is secondary filesystem providing coherent access to primary fs
> > metadata.  I.e. mount -t ext2meta -o master=/usr none /mnt and
> > then access through /mnt/super, /mnt/block_bitmap, etc.
> > 
> > Call me stupid --- but what exactly does the above actually achieve?
> > Why would you do this?
> 
> Coherent access to metadata? Well, for one thing, it allows stuff like
> tunefs and friends on mounted fs. What's more useful, it allows to
> do things like access to boot code, which is _not_ safe to do through
> device access - usually you have superblock in vicinity and no warranties
> about the things that will be overwritten on umount. Same for debugging
> stuff, IO stats, etc. - access through secondary tree is much saner
> than inventing tons of ioctls for dealing with that. Moreover, it allows
> fsck and friends to get rid of code duplication - while the repair
> logics, etc. stays in userland (where it belongs) layout information
> is already handled in the kernel. No need to duplicate it in userland...

OTOH with current way if you make mistake in kernel, fsck will not
automatically inherit it; therefore fsck is likely to work even if
kernel ext2 is b0rken [and that's fairly important]

> Besides, with moving bitmaps, etc. into pagecache it becomes trivial
> to implement.
> 
> BTW, we have another ugly chunk of code - duplicated between kernel
> and userland and nasty in both incarnations. I mean handling of the
> partition tables. Kernel should be able to read and parse them -
> otherwise they are useless, right? Now, we have a bunch of userland

No. You might want to see (via fdisk) partition table, even through
*your* kernel can not read it.
Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-11 Thread Pavel Machek

gHi!

  It's not exactly kernel-based fsck. What I've been talking about
  is secondary filesystem providing coherent access to primary fs
  metadata.  I.e. mount -t ext2meta -o master=/usr none /mnt and
  then access through /mnt/super, /mnt/block_bitmap, etc.
  
  Call me stupid --- but what exactly does the above actually achieve?
  Why would you do this?
 
 Coherent access to metadata? Well, for one thing, it allows stuff like
 tunefs and friends on mounted fs. What's more useful, it allows to
 do things like access to boot code, which is _not_ safe to do through
 device access - usually you have superblock in vicinity and no warranties
 about the things that will be overwritten on umount. Same for debugging
 stuff, IO stats, etc. - access through secondary tree is much saner
 than inventing tons of ioctls for dealing with that. Moreover, it allows
 fsck and friends to get rid of code duplication - while the repair
 logics, etc. stays in userland (where it belongs) layout information
 is already handled in the kernel. No need to duplicate it in userland...

OTOH with current way if you make mistake in kernel, fsck will not
automatically inherit it; therefore fsck is likely to work even if
kernel ext2 is b0rken [and that's fairly important]

 Besides, with moving bitmaps, etc. into pagecache it becomes trivial
 to implement.
 
 BTW, we have another ugly chunk of code - duplicated between kernel
 and userland and nasty in both incarnations. I mean handling of the
 partition tables. Kernel should be able to read and parse them -
 otherwise they are useless, right? Now, we have a bunch of userland

No. You might want to see (via fdisk) partition table, even through
*your* kernel can not read it.
Pavel
-- 
Philips Velo 1: 1x4x8, 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-11 Thread Daniel Phillips

On Monday 07 May 2001 20:42, Pavel Machek wrote:
   It's not exactly kernel-based fsck. What I've been talking
   about is secondary filesystem providing coherent access to
   primary fs metadata.  I.e. mount -t ext2meta -o master=/usr none
   /mnt and then access through /mnt/super, /mnt/block_bitmap, etc.
  
   Call me stupid --- but what exactly does the above actually
   achieve? Why would you do this?
 
  Coherent access to metadata? Well, for one thing, it allows stuff
  like tunefs and friends on mounted fs. What's more useful, it
  allows to do things like access to boot code, which is _not_ safe
  to do through device access - usually you have superblock in
  vicinity and no warranties about the things that will be
  overwritten on umount. Same for debugging stuff, IO stats, etc. -
  access through secondary tree is much saner than inventing tons of
  ioctls for dealing with that. Moreover, it allows fsck and friends
  to get rid of code duplication - while the repair logics, etc.
  stays in userland (where it belongs) layout information is already
  handled in the kernel. No need to duplicate it in userland...

 OTOH with current way if you make mistake in kernel, fsck will not
 automatically inherit it; therefore fsck is likely to work even if
 kernel ext2 is b0rken [and that's fairly important]

Al's idea ncely dances around a big problem with the page cache: there 
is no efficient way to know which address_space a given physical block 
belongs to.   It *might* be nice to have such capability in a 
fs-independent way.  We could do that now, very inefficiently, by 
searching all the address_spaces (i.e., inodes) for the physical block. 
We'd have to prevent further page cache operations while we did that, 
and when we add fs-private address_spaces some more mechanism would be 
required..  So: slow, intrusive and fragile.

The only reasonable way I can think of getting a block-coherent view 
underneath a mounted fs is to have a reverse map, and update it each 
time we map block into the page cache or unmap it.  The reverse map 
would tell us if a given physical block is currently in the page 
cache,and if so, which address_space it belongs to.  A blocks not 
currently mapped into any address_space could be mapped into an 
'anonymous' space covering the entire partition and moved automatically 
to the correct address_space when the fs tries to map it.

The big problem with this mechanism is it slows down the common case, 
which works perfectly well without any reverse map.  Not to mention 
adding bloat.  So the next question I thought about was, is there a way 
to switch on a page cache reverse map just when needed and do that in a 
generic way.  I convinced myself it wouldn't be too hard,  but then 
there's another question: how badly do we need this?

Al's idea does let us get at some of the specific parts of the fs 
metadata but it has its problems too.  We'd need to exhaustively 
enumerate every kind of filesystem metadata that could reasonably be 
accessed underneath the filesystem, and special-case it, not so nice.

But I couldn't come up with any killer examples where we'd really need 
a generalized, coherent view underneath a mounted filesystem, so I put 
these thoughts on hold.   Your borked-fs example sounds interesting, 
have you got more of those?

One more example I can suggest is: right now we have to way of 
detecting an error condition where the same fs block is mapped into 
more than one address_space.  A page cache reverse map could detect 
this easily and would be a really useful debugging tool.

--
Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-11 Thread Alexander Viro



On Mon, 7 May 2001, Pavel Machek wrote:
 OTOH with current way if you make mistake in kernel, fsck will not
 automatically inherit it; therefore fsck is likely to work even if
 kernel ext2 is b0rken [and that's fairly important]

... and by the same logics you should make fsck implement its own
drivers - after all, right now b0rken driver affects both the kernel
ext2 and fsck ;-)

I'm not sure that fsck of fs mounted read/write is worth doing in the
first place, but I'd rather do that via fs/ext2 exporting its metadata
explicitly than by playing silly buggers with device/fs coherency.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-09 Thread Helge Hafting

[EMAIL PROTECTED] wrote:
> 
> I have tried this approach too a couple of years ago. I came to the idea
> that I want some kind of "event reporting" mechanism to know when
> application faults and when other events (like I/O) occurs. Booting is
> just the tip of the iceberg. MOST big apps are seeking on startup because
>a) their code is spread out all over executable
Page tuning can fix that.  (Have the compiler & linker increase locality
by stuffing related code in the same page.  You want fast paths
stuffed into as few pages as possible, regardless of which functions
the instructions belong to.)  This also cut down on swapping and TLB
misses.
Os/2 gained some nice speedups by doing this.

>b) don't forget shared libraries..
They can be page tuned too, and they're often partially in
memory aready when starting apps.

>c) the practice of keeping configuration files in ~/.filename
>   implies - read a little, seek a little.
>d) GUI apps tend to have a ton of icons.
Putting several in a single file, or even the executable will
help here.

Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-09 Thread Helge Hafting

[EMAIL PROTECTED] wrote:
 
 I have tried this approach too a couple of years ago. I came to the idea
 that I want some kind of event reporting mechanism to know when
 application faults and when other events (like I/O) occurs. Booting is
 just the tip of the iceberg. MOST big apps are seeking on startup because
a) their code is spread out all over executable
Page tuning can fix that.  (Have the compiler  linker increase locality
by stuffing related code in the same page.  You want fast paths
stuffed into as few pages as possible, regardless of which functions
the instructions belong to.)  This also cut down on swapping and TLB
misses.
Os/2 gained some nice speedups by doing this.

b) don't forget shared libraries..
They can be page tuned too, and they're often partially in
memory aready when starting apps.

c) the practice of keeping configuration files in ~/.filename
   implies - read a little, seek a little.
d) GUI apps tend to have a ton of icons.
Putting several in a single file, or even the executable will
help here.

Helge Hafting
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-08 Thread volodya


I have tried this approach too a couple of years ago. I came to the idea
that I want some kind of "event reporting" mechanism to know when
application faults and when other events (like I/O) occurs. Booting is
just the tip of the iceberg. MOST big apps are seeking on startup because
   a) their code is spread out all over executable 
   b) don't forget shared libraries..
   c) the practice of keeping configuration files in ~/.filename
  implies - read a little, seek a little.
   d) GUI apps tend to have a ton of icons.

I wonder - is it possible to get this via ptrace ? - could not find this
in the manpage.

Vladimir Dergachev

On Fri, 4 May 2001, Richard Gooch wrote:

> Linus Torvalds writes:
> > Now, if you want to speed up accesses, there are things you can
> > do. You can lay out the filesystem in the access order - trace the
> > IO accesses at bootup ("which file, which offset, which metadata
> > block?") and lay out the blocks of the files in exactly the right
> > order. Then you will get linear reads _without_ doing any "dd" at
> > all.
> 
> A year ago I came up with an alternative approach for cache warming,
> but I see that it wouldn't work with our current infrastructure.
> However, maybe there is still a way to use the basic technique. If so,
> please make suggestions.
> 
> The idea I had (motivated by the desire to eliminate random disc
> seeks, which is the limiting factor in how fast my boxes boot) was:
> 
> - init(8) issues an ioctl(2) on the root FS block device which turns
>   on recording of block reads (it records block numbers)
> 
> - at the end of the bootup process, init(8) issues another ioctl(2) to
>   grab the buffered block numbers, and turn off recording
> 
> - init(8) then sorts this list in ascending order and saves the result
>   in a file
> 
> - next boot, init(8) checks the file, and if it exists, opens the root
>   FS block device and reads in each block listed in the file. The
>   effect is to warm the buffer cache extremely quickly. The head will
>   move in one direction, grabbing data as it flys by. I expect this
>   will take around 1 second
> 
> - init(8) now continues the boot process (starting the magic ioctl(2)
>   again so as to get a fresh list of blocks, in case something has
>   changed)
> 
> - booting is now super fast, thanks to no disc activity.
> 
> The advantage of this scheme over blindly reading the first 50 MB is
> that it only reads in what you *need*, and thus will work better on
> low memory systems. It's also useful for other applications, not just
> speeding up the boot process.
> 
> However, doing an ioctl(2) on the block device won't help. So the
> question is, where to add the hook? One possibility is the FS, and
> record inum,bnum pairs. But of course we don't have a way of accessing
> via inum in user-space. So that's no good. Besides, we want to get
> block numbers on the block device, because that's the only meaningful
> number to resort.
> 
> So, what, then? Some kind of hook on the page cache? Ideas?
> 
>   Regards,
> 
>   Richard
> Permanent: [EMAIL PROTECTED]
> Current:   [EMAIL PROTECTED]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-08 Thread volodya


I have tried this approach too a couple of years ago. I came to the idea
that I want some kind of event reporting mechanism to know when
application faults and when other events (like I/O) occurs. Booting is
just the tip of the iceberg. MOST big apps are seeking on startup because
   a) their code is spread out all over executable 
   b) don't forget shared libraries..
   c) the practice of keeping configuration files in ~/.filename
  implies - read a little, seek a little.
   d) GUI apps tend to have a ton of icons.

I wonder - is it possible to get this via ptrace ? - could not find this
in the manpage.

Vladimir Dergachev

On Fri, 4 May 2001, Richard Gooch wrote:

 Linus Torvalds writes:
  Now, if you want to speed up accesses, there are things you can
  do. You can lay out the filesystem in the access order - trace the
  IO accesses at bootup (which file, which offset, which metadata
  block?) and lay out the blocks of the files in exactly the right
  order. Then you will get linear reads _without_ doing any dd at
  all.
 
 A year ago I came up with an alternative approach for cache warming,
 but I see that it wouldn't work with our current infrastructure.
 However, maybe there is still a way to use the basic technique. If so,
 please make suggestions.
 
 The idea I had (motivated by the desire to eliminate random disc
 seeks, which is the limiting factor in how fast my boxes boot) was:
 
 - init(8) issues an ioctl(2) on the root FS block device which turns
   on recording of block reads (it records block numbers)
 
 - at the end of the bootup process, init(8) issues another ioctl(2) to
   grab the buffered block numbers, and turn off recording
 
 - init(8) then sorts this list in ascending order and saves the result
   in a file
 
 - next boot, init(8) checks the file, and if it exists, opens the root
   FS block device and reads in each block listed in the file. The
   effect is to warm the buffer cache extremely quickly. The head will
   move in one direction, grabbing data as it flys by. I expect this
   will take around 1 second
 
 - init(8) now continues the boot process (starting the magic ioctl(2)
   again so as to get a fresh list of blocks, in case something has
   changed)
 
 - booting is now super fast, thanks to no disc activity.
 
 The advantage of this scheme over blindly reading the first 50 MB is
 that it only reads in what you *need*, and thus will work better on
 low memory systems. It's also useful for other applications, not just
 speeding up the boot process.
 
 However, doing an ioctl(2) on the block device won't help. So the
 question is, where to add the hook? One possibility is the FS, and
 record inum,bnum pairs. But of course we don't have a way of accessing
 via inum in user-space. So that's no good. Besides, we want to get
 block numbers on the block device, because that's the only meaningful
 number to resort.
 
 So, what, then? Some kind of hook on the page cache? Ideas?
 
   Regards,
 
   Richard
 Permanent: [EMAIL PROTECTED]
 Current:   [EMAIL PROTECTED]
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Andreas Dilger

Alan writes:
> > Actually, the EVMS project does exactly this.  All I/O is done on a full
> > disk basis, and essentially does block remapping for each partition.  This
> > also solves the problem of cache inconsistency if accessing the parent
> > device vs. accessing the partition.
> 
> Interesting. Can EVMS handle the partition labels used by the LVM layer - ie
> could it replace it as well ?

Yes, they already support all current LVM volumes (including snapshots).
However, the user-space tools to set up new LVM volumes and manage existing
ones is not ready yet.  The last I talked with the IBM folks (a week ago),
they said they were starting to work on the user-space tools.

Because the whole partition/volume code is modular in EVMS, they will be able
to handle AIX LVM, HP/UX LVM, etc. volumes in addition to the normal DOS or
other partitions.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Alan Cox

> Actually, the EVMS project does exactly this.  All I/O is done on a full
> disk basis, and essentially does block remapping for each partition.  This
> also solves the problem of cache inconsistency if accessing the parent
> device vs. accessing the partition.

Interesting. Can EVMS handle the partition labels used by the LVM layer - ie
could it replace it as well ?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Andreas Dilger

Alan writes:
> > an interesting task when your root lives on /dev/sda1. Ditto for destroying
> > a single partition (not mounted/used by swap/etc.) while you have some
> > other partition in use. IWBNI we had a decent API for handling partition
> > tables...
> 
> Partitions are just very crude logical volumes, and ultimiately I believe
> should be handled exactly that way

Actually, the EVMS project does exactly this.  All I/O is done on a full
disk basis, and essentially does block remapping for each partition.  This
also solves the problem of cache inconsistency if accessing the parent
device vs. accessing the partition.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Alan Cox

> an interesting task when your root lives on /dev/sda1. Ditto for destroying
> a single partition (not mounted/used by swap/etc.) while you have some
> other partition in use. IWBNI we had a decent API for handling partition
> tables...

Partitions are just very crude logical volumes, and ultimiately I believe
should be handled exactly that way
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Alan Cox

 an interesting task when your root lives on /dev/sda1. Ditto for destroying
 a single partition (not mounted/used by swap/etc.) while you have some
 other partition in use. IWBNI we had a decent API for handling partition
 tables...

Partitions are just very crude logical volumes, and ultimiately I believe
should be handled exactly that way
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Andreas Dilger

Alan writes:
  an interesting task when your root lives on /dev/sda1. Ditto for destroying
  a single partition (not mounted/used by swap/etc.) while you have some
  other partition in use. IWBNI we had a decent API for handling partition
  tables...
 
 Partitions are just very crude logical volumes, and ultimiately I believe
 should be handled exactly that way

Actually, the EVMS project does exactly this.  All I/O is done on a full
disk basis, and essentially does block remapping for each partition.  This
also solves the problem of cache inconsistency if accessing the parent
device vs. accessing the partition.

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Alan Cox

 Actually, the EVMS project does exactly this.  All I/O is done on a full
 disk basis, and essentially does block remapping for each partition.  This
 also solves the problem of cache inconsistency if accessing the parent
 device vs. accessing the partition.

Interesting. Can EVMS handle the partition labels used by the LVM layer - ie
could it replace it as well ?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-06 Thread Andreas Dilger

Alan writes:
  Actually, the EVMS project does exactly this.  All I/O is done on a full
  disk basis, and essentially does block remapping for each partition.  This
  also solves the problem of cache inconsistency if accessing the parent
  device vs. accessing the partition.
 
 Interesting. Can EVMS handle the partition labels used by the LVM layer - ie
 could it replace it as well ?

Yes, they already support all current LVM volumes (including snapshots).
However, the user-space tools to set up new LVM volumes and manage existing
ones is not ready yet.  The last I talked with the IBM folks (a week ago),
they said they were starting to work on the user-space tools.

Because the whole partition/volume code is modular in EVMS, they will be able
to handle AIX LVM, HP/UX LVM, etc. volumes in addition to the normal DOS or
other partitions.

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Alexander Viro



On Sun, 6 May 2001, Chris Wedgwood wrote:

> It's not exactly "kernel-based fsck". What I've been talking about
> is secondary filesystem providing coherent access to primary fs
> metadata.  I.e. mount -t ext2meta -o master=/usr none /mnt and
> then access through /mnt/super, /mnt/block_bitmap, etc.
> 
> Call me stupid --- but what exactly does the above actually achieve?
> Why would you do this?

Coherent access to metadata? Well, for one thing, it allows stuff like
tunefs and friends on mounted fs. What's more useful, it allows to
do things like access to boot code, which is _not_ safe to do through
device access - usually you have superblock in vicinity and no warranties
about the things that will be overwritten on umount. Same for debugging
stuff, IO stats, etc. - access through secondary tree is much saner
than inventing tons of ioctls for dealing with that. Moreover, it allows
fsck and friends to get rid of code duplication - while the repair
logics, etc. stays in userland (where it belongs) layout information
is already handled in the kernel. No need to duplicate it in userland...

Besides, with moving bitmaps, etc. into pagecache it becomes trivial
to implement.

BTW, we have another ugly chunk of code - duplicated between kernel
and userland and nasty in both incarnations. I mean handling of the
partition tables. Kernel should be able to read and parse them -
otherwise they are useless, right? Now, we have a bunch of userland
utilities that do the same. Various fdisks, that is. If you look
how they work you'll see that on the read side they duplicate
kernel code and on the write side... To put it quite mildly, they are
not doing it in graceful way. They write relevant sectors to disk and
use BLKRRPART to tell the kernel that ti should forget about all partitions
on that disk and reread the partition tables. _Not_ a nice thing to do,
since creation of new partition out of unused space on /dev/sda becomes
an interesting task when your root lives on /dev/sda1. Ditto for destroying
a single partition (not mounted/used by swap/etc.) while you have some
other partition in use. IWBNI we had a decent API for handling partition
tables...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sun, May 06, 2001 at 03:00:58PM +1200, Chris Wedgwood wrote:
> On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote:
> 
> Moving e2fsck into the kernel is a completly different matter
> than caching the blockdevice accesses with pagecache instead of
> buffercache.
> 
> No, I was takling about user space fsck using character devices.

I misread your previous email sorry, I think you meant to fsck using
rawio (not to move fsck into the kernel). You can do that just now but
to get decent performance then fsck should do self caching, changing
fsck to do self caching doesn't sound worthwhile either. Note also that
rawio has nothing to do with the pagecache.  Infact both rawio and
O_DIRECT bypasses all the pagecache and its smp locks for example.

> I'm not claiming it is... what I'm asking is _why_ do we need block
> devices once 'everything' lives in the page cache?

Where the cache of the blockdevice lives is a completly orthogonal
problem with "why cached blockdevices are useful" which I addressed in
the previous email.

> It's just that by doing it in pagecache you can mmap it as well
> and it will provide overall better performance and it's probably
> cleaner design. The only visible change is that you will be able
> to mmap a blockdevice as well.
> 
> Why? What needs to mmap a block device? Since these are typically
> larger than that you can mmap into a 32-bit address space (yes, I'm
> ignoring the 5% or so of cases where this isn't true) I'm not aware
> on many applications that do it.

Last time I talked with the parted maintainer he was asking for that
feature so that parted won't need to do its own anti-oom management in
userspace, so he can simple mmap(MAP_SHARED) a quite large region of
metadata of the blockdevice, read/write to the mmaped region and the
kernel will take care of doing paging when it runs low on memory. right
now it allocates the metadata in anonymous memory and loads it via
read(). This memory will need to be swapped out if the working set
doesn't fit in ram (and swap may not be available ;).

> As I said, I'm not takling about kernel based fsck, although for
> _VERY_ large filesystems even with journalling I suspect it will be
> required one day (so it can run in the background and do consistency
> checking when the machine is idle).

Being able to fsck a live filesystem is yet another exotic feature and
yes for that you would certainly need some additional kernel support.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andreas Dilger

Chris Wedgewood writes:
> As I said, I'm not takling about kernel based fsck, although for
> _VERY_ large filesystems even with journalling I suspect it will be
> required one day (so it can run in the background and do consistency
> checking when the machine is idle).

Actually, I was talking with Ted about this, and we agreed that:
a) kernel-based e2fsck is a pain in the a** (locking issues, etc)
b) you can do an LVM snapshot of your live filesystem and do a read-only
   fsck on that to check if the filesystem is still OK.  For journaled
   filesystems like reiserfs and ext3, they need to use the super method
   write_super_lockfs() to block I/O and flush everything to disk at the
   time of the snapshot, to ensure that they don't need recovery on a
   read-only device.  This makes the LVM snapshot equivalent to unmount
   the filesystem, copy contents to a new device and remount the filesystem.

While (b) doesn't let you fix a filesystem online, unless there is a kernel
bug or hardware problem, you should not have a problem.  If you have either
of those, then fixing the filesystem online is just asking for more problems
in the future.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Alexander Viro



On Sun, 6 May 2001, Chris Wedgwood wrote:

> On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote:
 
> About a kernel based fsck Alexander told me he likes it, I
> personally don't care about it that much because I believe...
> 
> As I said, I'm not takling about kernel based fsck, although for
> _VERY_ large filesystems even with journalling I suspect it will be
> required one day (so it can run in the background and do consistency
> checking when the machine is idle).

It's not exactly "kernel-based fsck". What I've been talking about is
secondary filesystem providing coherent access to primary fs metadata.
I.e. mount -t ext2meta -o master=/usr none /mnt and then access through
/mnt/super, /mnt/block_bitmap, etc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sun, May 06, 2001 at 02:14:37PM +1200, Chris Wedgwood wrote:
> You don't need block device for fsck, in fact some OS require you use
> character devices (e.g. Solaris).

Moving e2fsck into the kernel is a completly different matter than
caching the blockdevice accesses with pagecache instead of buffercache.

And even if you move e2fsck or reiserfsck into the kernel (you could
technically do that just now regardless of where the block_dev cache
lives) there will still be partd that wants to mmap the blockdevice to
get rid of part of the fat32 partition (right now it uses read/write of
course because buffer cache cannot be mapped in userspace), there will
still be mtools, not self caching dbms, od  I'm not saying we don't need block devices, but I really don't see
> much of a use for them once everything in in the page cache... I
> assume this is why others have got rid of them completely.

I have no idea why/if other got rid of it completly, but the fact block_dev
is useful has nothing to do if it's in pagecache or in buffercache,
really. It's just that by doing it in pagecache you can mmap it as well
and it will provide overall better performance and it's probably cleaner
design. The only visible change is that you will be able to mmap a
blockdevice as well.

About a kernel based fsck Alexander told me he likes it, I personally
don't care about it that much because I believe there's not that much to
share at the source level, fsck and real fs are quite different
problems, and what can be shared can be copied and by not sharing we get
the flexibility of not breaking fsck every time we change the kernel and
more in general the flexibility of doing it in userspace, sharing such
bytecode at runtime definitely doesn't matter.  It also partly depends
from the fs but current ext2 situation is really fine to me and I
wouldn't consier a wortwhile project to move e2fsck into the kernel. 

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sat, May 05, 2001 at 03:18:08PM +1200, Chris Wedgwood wrote:
> On Fri, May 04, 2001 at 05:29:40PM +0200, Andrea Arcangeli wrote:
> 
> once block_dev is in pagecache there will obviously be no-way to
> share cache between the block device and the filesystem, because
> all the caches will be in completly different address spaces.
> 
> Once we are at this point... will there be any use in having block
> devices? FreeBSD appears to have done without them completely about a

moving block_dev in pagecache won't change anything from userspace point
of view, it's a transparent change (if we ignore the total loss of
cache coherency between block_dev and fs metadata that it implies, but
as Linus said such loss of coherency will happen anyways eventually
because metadata will go into its address space too). Basically there
will still be a use for the block devices as far as there are fsck and
other userspace applications that want to use it.

Andrea SYNAPSE (very amusing movie ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Richard Gooch

Rogier Wolff writes:
> Richard Gooch wrote:
> > 
> > - next boot, init(8) checks the file, and if it exists, opens the root
> >   FS block device and reads in each block listed in the file. The
> >   effect is to warm the buffer cache extremely quickly. The head will
> >   move in one direction, grabbing data as it flys by. I expect this
> >   will take around 1 second
> 
> FYI: 
> 
> Around 1992 or 1993, I rewrote Minix-fsck to do this instead of
> seeking all over the place.
> 
> Cut the total time to fsck my filesystem from around 30 to 28
> seconds. (remember the days of small filesystems?)
> 
> That's when I decided that this was NOT an interesting project: there
> was very little to be gained.
> 
> The explanation is: A seek over a few tracks isn't much slower than a
> seek over hundreds of tracks. Almost any "skip" in linear access
> incurs the average 6ms rotational latency anyway.

Hm. I think the access patterns between boot-up and fsck are quite
different. An fsck has to seek to a large number of tracks. During
bootup, I think the number of tracks accessed is much lower, and there
is probably more data locality as well. Still, only one way to be
sure.

I haven't had time to look closely at this, but one thing that bothers
me is how to find out what is being accessed in the first place. A
C-library wrapper to intercept read(2) calls isn't any good, because a
lot of stuff is memory-mapped (in particular shared libraries). Anyone
have a clean way to do this?

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Alexander Viro



On Sat, 5 May 2001, Albert D. Cahalan wrote:

>   case P_SWAP:
> sprintf(tmp, "%4.4s ",
>   scale_k(((task->size - task->resident) << CL_pg_shift), 4, 1));
> break;

Albert, you can't be serious. The system had demand-loading for almost
ten years. ->size - ->resident can be huge with no swap at all. As in,
"box had never been subjected to swapon(8)".

That value is a  mix of amount of stuff we hadn't paged in,
amount of stuff we had paged in but then dropped (e.g. code that
had never been touched for two weeks, since application only uses
it on startup) and amount of stuff that had been swapped out _and_
wasn't swapped in (it may very well stay in swap).

BTW, "shared" is also bogus - page_count(page) can be raised
by any number of things.

> > * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
> >   _processes_ each 5 seconds. No wonder it's slow like hell and eats
> >   tons of CPU time.
> 
> On my system, "statm" takes 50% longer than "stat" or "status".
> Maybe there is a significant difference with Oracle on a 32 GB box?

Depends on that applications mix.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Rogier Wolff

Richard Gooch wrote:
> 
> - next boot, init(8) checks the file, and if it exists, opens the root
>   FS block device and reads in each block listed in the file. The
>   effect is to warm the buffer cache extremely quickly. The head will
>   move in one direction, grabbing data as it flys by. I expect this
>   will take around 1 second

FYI: 

Around 1992 or 1993, I rewrote Minix-fsck to do this instead of
seeking all over the place.

Cut the total time to fsck my filesystem from around 30 to 28
seconds. (remember the days of small filesystems?)

That's when I decided that this was NOT an interesting project: there
was very little to be gained.

The explanation is: A seek over a few tracks isn't much slower than a
seek over hundreds of tracks. Almost any "skip" in linear access
incurs the average 6ms rotational latency anyway.

Roger. 
-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Albert D. Cahalan

Alexander Viro writes:
>> On Fri, 4 May 2001, Alexander Viro wrote:

>>> Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
>  ^^^

Ah, you learn from the master.

> ObProcfs: I don't think that walking the page tables is a good way to
> compute RSS, especially since VM maintains the thing. Mind if I rip

Handling of mapped device memory should not change. For example
there is the X server with mapped video memory. There is another
RSS value provided elsewhere in case one does not want to include
mapped device memory.

Currently top uses the statm file in the following manner:

  case P_SIZE:
sprintf(tmp, "%5.5s ", scale_k((task->size << CL_pg_shift), 5, 1));
break;
  case P_TRS:
sprintf(tmp, "%4.4s ", scale_k((task->trs << CL_pg_shift), 4, 1));
break;
  case P_SWAP:
sprintf(tmp, "%4.4s ",
scale_k(((task->size - task->resident) << CL_pg_shift), 4, 1));
break;
  case P_SHARE:
sprintf(tmp, "%5.5s ", scale_k((task->share << CL_pg_shift), 5, 1));
break;
  case P_DT:
sprintf(tmp, "%3.3s ", scale_k(task->dt, 3, 0));
break;
  case P_RSS:   /* rss, not resident (which includes IO memory) */
sprintf(tmp, "%4.4s ",
scale_k((task->rss << CL_pg_shift), 4, 1));


> it out? In effect, implementation of /prc//statm
>   * produces extremely bogus values (VMA is from library if it goes
> beyond 0x6000? Might be even true 7 years ago...) and nobody
> had cared about them for 6-7 years

One could count pages that are mapped executable and do not come
from the main executable... but this is pretty worthless and does
not consider non-executable library sections.

The latest "top" does not bother to display this value.

>   * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
> _processes_ each 5 seconds. No wonder it's slow like hell and eats
> tons of CPU time.

On my system, "statm" takes 50% longer than "stat" or "status".
Maybe there is a significant difference with Oracle on a 32 GB box?

I'd rather top didn't have to read the file at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Albert D. Cahalan

Alexander Viro writes:
 On Fri, 4 May 2001, Alexander Viro wrote:

 Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
  ^^^

Ah, you learn from the master.

 ObProcfs: I don't think that walking the page tables is a good way to
 compute RSS, especially since VM maintains the thing. Mind if I rip

Handling of mapped device memory should not change. For example
there is the X server with mapped video memory. There is another
RSS value provided elsewhere in case one does not want to include
mapped device memory.

Currently top uses the statm file in the following manner:

  case P_SIZE:
sprintf(tmp, %5.5s , scale_k((task-size  CL_pg_shift), 5, 1));
break;
  case P_TRS:
sprintf(tmp, %4.4s , scale_k((task-trs  CL_pg_shift), 4, 1));
break;
  case P_SWAP:
sprintf(tmp, %4.4s ,
scale_k(((task-size - task-resident)  CL_pg_shift), 4, 1));
break;
  case P_SHARE:
sprintf(tmp, %5.5s , scale_k((task-share  CL_pg_shift), 5, 1));
break;
  case P_DT:
sprintf(tmp, %3.3s , scale_k(task-dt, 3, 0));
break;
  case P_RSS:   /* rss, not resident (which includes IO memory) */
sprintf(tmp, %4.4s ,
scale_k((task-rss  CL_pg_shift), 4, 1));


 it out? In effect, implementation of /prc/pid/statm
   * produces extremely bogus values (VMA is from library if it goes
 beyond 0x6000? Might be even true 7 years ago...) and nobody
 had cared about them for 6-7 years

One could count pages that are mapped executable and do not come
from the main executable... but this is pretty worthless and does
not consider non-executable library sections.

The latest top does not bother to display this value.

   * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
 _processes_ each 5 seconds. No wonder it's slow like hell and eats
 tons of CPU time.

On my system, statm takes 50% longer than stat or status.
Maybe there is a significant difference with Oracle on a 32 GB box?

I'd rather top didn't have to read the file at all.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Rogier Wolff

Richard Gooch wrote:
 
 - next boot, init(8) checks the file, and if it exists, opens the root
   FS block device and reads in each block listed in the file. The
   effect is to warm the buffer cache extremely quickly. The head will
   move in one direction, grabbing data as it flys by. I expect this
   will take around 1 second

FYI: 

Around 1992 or 1993, I rewrote Minix-fsck to do this instead of
seeking all over the place.

Cut the total time to fsck my filesystem from around 30 to 28
seconds. (remember the days of small filesystems?)

That's when I decided that this was NOT an interesting project: there
was very little to be gained.

The explanation is: A seek over a few tracks isn't much slower than a
seek over hundreds of tracks. Almost any skip in linear access
incurs the average 6ms rotational latency anyway.

Roger. 
-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Alexander Viro



On Sat, 5 May 2001, Albert D. Cahalan wrote:

   case P_SWAP:
 sprintf(tmp, %4.4s ,
   scale_k(((task-size - task-resident)  CL_pg_shift), 4, 1));
 break;

Albert, you can't be serious. The system had demand-loading for almost
ten years. -size - -resident can be huge with no swap at all. As in,
box had never been subjected to swapon(8).

That value is a  mix of amount of stuff we hadn't paged in,
amount of stuff we had paged in but then dropped (e.g. code that
had never been touched for two weeks, since application only uses
it on startup) and amount of stuff that had been swapped out _and_
wasn't swapped in (it may very well stay in swap).

BTW, shared is also bogus - page_count(page) can be raised
by any number of things.

  * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
_processes_ each 5 seconds. No wonder it's slow like hell and eats
tons of CPU time.
 
 On my system, statm takes 50% longer than stat or status.
 Maybe there is a significant difference with Oracle on a 32 GB box?

Depends on that applications mix.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sat, May 05, 2001 at 03:18:08PM +1200, Chris Wedgwood wrote:
 On Fri, May 04, 2001 at 05:29:40PM +0200, Andrea Arcangeli wrote:
 
 once block_dev is in pagecache there will obviously be no-way to
 share cache between the block device and the filesystem, because
 all the caches will be in completly different address spaces.
 
 Once we are at this point... will there be any use in having block
 devices? FreeBSD appears to have done without them completely about a

moving block_dev in pagecache won't change anything from userspace point
of view, it's a transparent change (if we ignore the total loss of
cache coherency between block_dev and fs metadata that it implies, but
as Linus said such loss of coherency will happen anyways eventually
because metadata will go into its address space too). Basically there
will still be a use for the block devices as far as there are fsck and
other userspace applications that want to use it.

Andrea SYNAPSE (very amusing movie ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sun, May 06, 2001 at 02:14:37PM +1200, Chris Wedgwood wrote:
 You don't need block device for fsck, in fact some OS require you use
 character devices (e.g. Solaris).

Moving e2fsck into the kernel is a completly different matter than
caching the blockdevice accesses with pagecache instead of buffercache.

And even if you move e2fsck or reiserfsck into the kernel (you could
technically do that just now regardless of where the block_dev cache
lives) there will still be partd that wants to mmap the blockdevice to
get rid of part of the fat32 partition (right now it uses read/write of
course because buffer cache cannot be mapped in userspace), there will
still be mtools, not self caching dbms, od /dev/hda, dd of=/dev/hda
etc..etc..etc..  that makes block_dev still *very* useful.

 I'm not saying we don't need block devices, but I really don't see
 much of a use for them once everything in in the page cache... I
 assume this is why others have got rid of them completely.

I have no idea why/if other got rid of it completly, but the fact block_dev
is useful has nothing to do if it's in pagecache or in buffercache,
really. It's just that by doing it in pagecache you can mmap it as well
and it will provide overall better performance and it's probably cleaner
design. The only visible change is that you will be able to mmap a
blockdevice as well.

About a kernel based fsck Alexander told me he likes it, I personally
don't care about it that much because I believe there's not that much to
share at the source level, fsck and real fs are quite different
problems, and what can be shared can be copied and by not sharing we get
the flexibility of not breaking fsck every time we change the kernel and
more in general the flexibility of doing it in userspace, sharing such
bytecode at runtime definitely doesn't matter.  It also partly depends
from the fs but current ext2 situation is really fine to me and I
wouldn't consier a wortwhile project to move e2fsck into the kernel. 

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Alexander Viro



On Sun, 6 May 2001, Chris Wedgwood wrote:

 On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote:
 
 About a kernel based fsck Alexander told me he likes it, I
 personally don't care about it that much because I believe...
 
 As I said, I'm not takling about kernel based fsck, although for
 _VERY_ large filesystems even with journalling I suspect it will be
 required one day (so it can run in the background and do consistency
 checking when the machine is idle).

It's not exactly kernel-based fsck. What I've been talking about is
secondary filesystem providing coherent access to primary fs metadata.
I.e. mount -t ext2meta -o master=/usr none /mnt and then access through
/mnt/super, /mnt/block_bitmap, etc.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andreas Dilger

Chris Wedgewood writes:
 As I said, I'm not takling about kernel based fsck, although for
 _VERY_ large filesystems even with journalling I suspect it will be
 required one day (so it can run in the background and do consistency
 checking when the machine is idle).

Actually, I was talking with Ted about this, and we agreed that:
a) kernel-based e2fsck is a pain in the a** (locking issues, etc)
b) you can do an LVM snapshot of your live filesystem and do a read-only
   fsck on that to check if the filesystem is still OK.  For journaled
   filesystems like reiserfs and ext3, they need to use the super method
   write_super_lockfs() to block I/O and flush everything to disk at the
   time of the snapshot, to ensure that they don't need recovery on a
   read-only device.  This makes the LVM snapshot equivalent to unmount
   the filesystem, copy contents to a new device and remount the filesystem.

While (b) doesn't let you fix a filesystem online, unless there is a kernel
bug or hardware problem, you should not have a problem.  If you have either
of those, then fixing the filesystem online is just asking for more problems
in the future.

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sun, May 06, 2001 at 03:00:58PM +1200, Chris Wedgwood wrote:
 On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote:
 
 Moving e2fsck into the kernel is a completly different matter
 than caching the blockdevice accesses with pagecache instead of
 buffercache.
 
 No, I was takling about user space fsck using character devices.

I misread your previous email sorry, I think you meant to fsck using
rawio (not to move fsck into the kernel). You can do that just now but
to get decent performance then fsck should do self caching, changing
fsck to do self caching doesn't sound worthwhile either. Note also that
rawio has nothing to do with the pagecache.  Infact both rawio and
O_DIRECT bypasses all the pagecache and its smp locks for example.

 I'm not claiming it is... what I'm asking is _why_ do we need block
 devices once 'everything' lives in the page cache?

Where the cache of the blockdevice lives is a completly orthogonal
problem with why cached blockdevices are useful which I addressed in
the previous email.

 It's just that by doing it in pagecache you can mmap it as well
 and it will provide overall better performance and it's probably
 cleaner design. The only visible change is that you will be able
 to mmap a blockdevice as well.
 
 Why? What needs to mmap a block device? Since these are typically
 larger than that you can mmap into a 32-bit address space (yes, I'm
 ignoring the 5% or so of cases where this isn't true) I'm not aware
 on many applications that do it.

Last time I talked with the parted maintainer he was asking for that
feature so that parted won't need to do its own anti-oom management in
userspace, so he can simple mmap(MAP_SHARED) a quite large region of
metadata of the blockdevice, read/write to the mmaped region and the
kernel will take care of doing paging when it runs low on memory. right
now it allocates the metadata in anonymous memory and loads it via
read(). This memory will need to be swapped out if the working set
doesn't fit in ram (and swap may not be available ;).

 As I said, I'm not takling about kernel based fsck, although for
 _VERY_ large filesystems even with journalling I suspect it will be
 required one day (so it can run in the background and do consistency
 checking when the machine is idle).

Being able to fsck a live filesystem is yet another exotic feature and
yes for that you would certainly need some additional kernel support.

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Alexander Viro



On Sun, 6 May 2001, Chris Wedgwood wrote:

 It's not exactly kernel-based fsck. What I've been talking about
 is secondary filesystem providing coherent access to primary fs
 metadata.  I.e. mount -t ext2meta -o master=/usr none /mnt and
 then access through /mnt/super, /mnt/block_bitmap, etc.
 
 Call me stupid --- but what exactly does the above actually achieve?
 Why would you do this?

Coherent access to metadata? Well, for one thing, it allows stuff like
tunefs and friends on mounted fs. What's more useful, it allows to
do things like access to boot code, which is _not_ safe to do through
device access - usually you have superblock in vicinity and no warranties
about the things that will be overwritten on umount. Same for debugging
stuff, IO stats, etc. - access through secondary tree is much saner
than inventing tons of ioctls for dealing with that. Moreover, it allows
fsck and friends to get rid of code duplication - while the repair
logics, etc. stays in userland (where it belongs) layout information
is already handled in the kernel. No need to duplicate it in userland...

Besides, with moving bitmaps, etc. into pagecache it becomes trivial
to implement.

BTW, we have another ugly chunk of code - duplicated between kernel
and userland and nasty in both incarnations. I mean handling of the
partition tables. Kernel should be able to read and parse them -
otherwise they are useless, right? Now, we have a bunch of userland
utilities that do the same. Various fdisks, that is. If you look
how they work you'll see that on the read side they duplicate
kernel code and on the write side... To put it quite mildly, they are
not doing it in graceful way. They write relevant sectors to disk and
use BLKRRPART to tell the kernel that ti should forget about all partitions
on that disk and reread the partition tables. _Not_ a nice thing to do,
since creation of new partition out of unused space on /dev/sda becomes
an interesting task when your root lives on /dev/sda1. Ditto for destroying
a single partition (not mounted/used by swap/etc.) while you have some
other partition in use. IWBNI we had a decent API for handling partition
tables...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds



On Fri, 4 May 2001, Alan Cox wrote:
>
> iso9660 alas doesn't allow you to do that. You can speed it up by reading
> the entire file into memory rather than paging it in (or reading it in and
> then executing it). iso9660 layout is pretty constrained and designed for
> linear file reads

Note that this you can do for any filesystem, including ext2. If you
instead of trying to remember what _blocks_ the bootup process reads, you
keep the trace at a higher level, and then sort the _high_level_ trace and
re-do that with some program, then you can obviously populate the virtual
caches properly with any filesystem.

The advantage of that approach is that it will continue to work forever,
because there will  never be any cache aliasing issues. You're always
"pre-caching" using the same operation that you'll actually use when you
do the real reads..

Now, that still leaves the question on how to sort the virtual cache
accesses, and you might want to know what the low-level layout of the
filesystem is to actually create the "sort". You might not want to sort
alphabetically on the file-name, but use a "where on the disk is this
file", and use _that_ as the sort oder function.

That's easy to do, actually. Just use the "bmap()" ioctl.

Now, you won't be able to use "dd" to populate the caches: you'd have to
have your own program that walks your sorted action list and populates the
caches that way (and you might want to take kernel read-ahead etc
heuristics into account).

SMOP.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alan Cox

> Now, if you want to speed up accesses, there are things you can do. You
> can lay out the filesystem in the access order - trace the IO accesses at
> bootup ("which file, which offset, which metadata block?") and lay out the
> blocks of the files in exactly the right order. Then you will get linear
> reads _without_ doing any "dd" at all.

iso9660 alas doesn't allow you to do that. You can speed it up by reading
the entire file into memory rather than paging it in (or reading it in and
then executing it). iso9660 layout is pretty constrained and designed for
linear file reads

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Alexander Viro writes:
> 
> 
> On Fri, 4 May 2001, Richard Gooch wrote:
> 
> > I don't bother splitting /usr off /. I gave up doing that when disc
> > became cheap. There's no point anymore. And since I have a lightweight
> 
> Yes, there is. Locality. Resistance to fs fuckups. Resistance to
> disk fuckups. Easier to restore from tape. Different tunefs optimum
> (higher inodes/blocks ratio, for one thing). Ability to keep /usr
> read-only.  Enough?

The correct solution to avoiding fs fuckups is to keep /tmp, /var and
/home separate. Basically, anything that gets written to for reasons
other than sysadmin/upgrades.

However, my point is not that it's always a bad idea to split /usr,
simply that the converse is not true. IOW, it is not true to say that
/usr *should* be split off. For a generic workstation, splitting /usr
is not useful. Importantly, it is most certainly entirely valid to
keep /usr on /.

> > distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
> > and a pile of other things), it makes even less sense to split /usr
> > off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
> > most of my day in emacs and xterm.
> 
> What desktops? None of that crap on my boxen either. EMACS? What EMACS?
> LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side).
> Netrape? No chance in hell. bash  is there, but I prefer to use
> rc.
> 
> I don't see what does it have to keeping root on a separate
> filesystem, though - the reasons have nothing to bloat in /usr/bin.

In any case, my point is that splitting /usr wouldn't help, because
I'd want to preload stuff from there as well. Splitting /usr doesn't
address the problem.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Richard Gooch wrote:

> I don't bother splitting /usr off /. I gave up doing that when disc
> became cheap. There's no point anymore. And since I have a lightweight

Yes, there is. Locality. Resistance to fs fuckups. Resistance to disk
fuckups. Easier to restore from tape. Different tunefs optimum (higher
inodes/blocks ratio, for one thing). Ability to keep /usr read-only.
Enough?

> distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
> and a pile of other things), it makes even less sense to split /usr
> off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
> most of my day in emacs and xterm.

What desktops? None of that crap on my boxen either. EMACS? What EMACS?
LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side).
Netrape? No chance in hell. bash  is there, but I prefer to use
rc.

I don't see what does it have to keeping root on a separate filesystem,
though - the reasons have nothing to bloat in /usr/bin.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Alexander Viro writes:
> 
> 
> On Fri, 4 May 2001, Richard Gooch wrote:
> 
> > > Two of them: use less bloated shell (and link it statically) and
> > > clean your rc scripts.
> > 
> > No, because I'm not using the latest bloated version of bash, and I'm
> 
> Umm... Last version of bash I could call not bloated was _long_ time
> ago. Something like ash(1) might be a better idea for /bin/sh.

The shell is irrelevant. I can easily preload that too, if I wanted
to, since it's just one thing. But it's not practical to preload all
files used by name, because it's just too hard to find out all that is
needed. Too much people time required, and it is specific to one
distribution (and a particular revision at that).

> > The problem is all the various daemons and system utilities (mount,
> > hwclock, ifconfig and so on) that turn a kernel into a useful system.
> > And then of course there's X...
> 
> How do you partition the thing? I.e. what's the size of your root
> partition?  I'm usually doing something from 10Mb to 30Mb - that may
> be the reason of differences.

I don't bother splitting /usr off /. I gave up doing that when disc
became cheap. There's no point anymore. And since I have a lightweight
distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
and a pile of other things), it makes even less sense to split /usr
off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
most of my day in emacs and xterm.

And even if I did split /usr off, that would just mean I'd want to
record block accesses for that device as well. This is because part of
my boot process requires stuff in /usr. And after that, firing up xdm.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Richard Gooch wrote:

> > Two of them: use less bloated shell (and link it statically) and
> > clean your rc scripts.
> 
> No, because I'm not using the latest bloated version of bash, and I'm

Umm... Last version of bash I could call not bloated was _long_ time
ago. Something like ash(1) might be a better idea for /bin/sh.

> The problem is all the various daemons and system utilities (mount,
> hwclock, ifconfig and so on) that turn a kernel into a useful system.
> And then of course there's X...

How do you partition the thing? I.e. what's the size of your root partition?
I'm usually doing something from 10Mb to 30Mb - that may be the reason of
differences.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds



On Fri, 4 May 2001, Alexander Viro wrote:
>
> ObProcfs: I don't think that walking the page tables is a good way to
> compute RSS, especially since VM maintains the thing.

Well, the VM didn't always use to maintain the stuff it does now, so I bet
that most of the code is just old code that still works.

Feel free to rip it out.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Jens Axboe

On Fri, May 04 2001, Richard Gooch wrote:
> The idea I had (motivated by the desire to eliminate random disc
> seeks, which is the limiting factor in how fast my boxes boot) was:
> 
> - init(8) issues an ioctl(2) on the root FS block device which turns
>   on recording of block reads (it records block numbers)
> 
> - at the end of the bootup process, init(8) issues another ioctl(2) to
>   grab the buffered block numbers, and turn off recording
> 
> - init(8) then sorts this list in ascending order and saves the result
>   in a file
> 
> - next boot, init(8) checks the file, and if it exists, opens the root
>   FS block device and reads in each block listed in the file. The
>   effect is to warm the buffer cache extremely quickly. The head will
>   move in one direction, grabbing data as it flys by. I expect this
>   will take around 1 second
> 
> - init(8) now continues the boot process (starting the magic ioctl(2)
>   again so as to get a fresh list of blocks, in case something has
>   changed)
> 
> - booting is now super fast, thanks to no disc activity.

I did 95% of what you need sometime last year, to do I/O scheduler
profiling (blocks requested, merge stats, request sent to disk). It was
a pretty gross hack, requiring a pretty big ring buffer of kernel memory
to be able to log at a sufficiently fast rate (you'd be amazed how much
output a single dbench 48 run produces :-). A user space app would read
data from a simple char device, save for later inspection.

A better approach would be to map the ring buffer from the user app, but
it was just a quick fix.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Alexander Viro writes:
> 
> 
> On Fri, 4 May 2001, Richard Gooch wrote:
> 
> > However, doing an ioctl(2) on the block device won't help. So the
> > question is, where to add the hook? One possibility is the FS, and
> > record inum,bnum pairs. But of course we don't have a way of accessing
> > via inum in user-space. So that's no good. Besides, we want to get
> > block numbers on the block device, because that's the only meaningful
> > number to resort.
> > 
> > So, what, then? Some kind of hook on the page cache? Ideas?
> 
> Two of them: use less bloated shell (and link it statically) and
> clean your rc scripts.

No, because I'm not using the latest bloated version of bash, and I'm
not using the slow and bloated RedHat boot scripts. My boot scripts
are lean and mean. Oh. And I already have init(8) warming the cache
with these scripts.

The problem is all the various daemons and system utilities (mount,
hwclock, ifconfig and so on) that turn a kernel into a useful system.
And then of course there's X...

Sorry. A "don't do that then" answer isn't appropriate for this
problem space.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Richard Gooch wrote:

> However, doing an ioctl(2) on the block device won't help. So the
> question is, where to add the hook? One possibility is the FS, and
> record inum,bnum pairs. But of course we don't have a way of accessing
> via inum in user-space. So that's no good. Besides, we want to get
> block numbers on the block device, because that's the only meaningful
> number to resort.
> 
> So, what, then? Some kind of hook on the page cache? Ideas?

Two of them: use less bloated shell (and link it statically) and clean
your rc scripts.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Linus Torvalds wrote:

> 
> On Fri, 4 May 2001, Alexander Viro wrote:
> > 
> > Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
   ^^^
> > * add pagecache access for block device
> > * put your "real" root on /dev/loop0 (setup from initrd)
> > * dd
> 
> You're one sick puppy.

[snip]
/me bows

Nice to see that imitation was good enough ;-) Seriously, I half-expected
Albert to show up at that point of thread and tried to anticipate what
he'd produce.

ObProcfs: I don't think that walking the page tables is a good way to
compute RSS, especially since VM maintains the thing. Mind if I rip
it out? In effect, implementation of /prc//statm
* produces extremely bogus values (VMA is from library if it goes
  beyond 0x6000? Might be even true 7 years ago...) and nobody
  had cared about them for 6-7 years
* makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
  _processes_ each 5 seconds. No wonder it's slow like hell and eats
  tons of CPU time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Linus Torvalds writes:
> Now, if you want to speed up accesses, there are things you can
> do. You can lay out the filesystem in the access order - trace the
> IO accesses at bootup ("which file, which offset, which metadata
> block?") and lay out the blocks of the files in exactly the right
> order. Then you will get linear reads _without_ doing any "dd" at
> all.

A year ago I came up with an alternative approach for cache warming,
but I see that it wouldn't work with our current infrastructure.
However, maybe there is still a way to use the basic technique. If so,
please make suggestions.

The idea I had (motivated by the desire to eliminate random disc
seeks, which is the limiting factor in how fast my boxes boot) was:

- init(8) issues an ioctl(2) on the root FS block device which turns
  on recording of block reads (it records block numbers)

- at the end of the bootup process, init(8) issues another ioctl(2) to
  grab the buffered block numbers, and turn off recording

- init(8) then sorts this list in ascending order and saves the result
  in a file

- next boot, init(8) checks the file, and if it exists, opens the root
  FS block device and reads in each block listed in the file. The
  effect is to warm the buffer cache extremely quickly. The head will
  move in one direction, grabbing data as it flys by. I expect this
  will take around 1 second

- init(8) now continues the boot process (starting the magic ioctl(2)
  again so as to get a fresh list of blocks, in case something has
  changed)

- booting is now super fast, thanks to no disc activity.

The advantage of this scheme over blindly reading the first 50 MB is
that it only reads in what you *need*, and thus will work better on
low memory systems. It's also useful for other applications, not just
speeding up the boot process.

However, doing an ioctl(2) on the block device won't help. So the
question is, where to add the hook? One possibility is the FS, and
record inum,bnum pairs. But of course we don't have a way of accessing
via inum in user-space. So that's no good. Besides, we want to get
block numbers on the block device, because that's the only meaningful
number to resort.

So, what, then? Some kind of hook on the page cache? Ideas?

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds


On Fri, 4 May 2001, Alexander Viro wrote:
> 
> Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
>   * add pagecache access for block device
>   * put your "real" root on /dev/loop0 (setup from initrd)
>   * dd

You're one sick puppy.

Now, the above is basically equivalent to using and populating a
dynamically sized ramdisk.

If you really want to go this way, I'd much rather see you using a real
ram-disk (that you populate at startup with something like a compressed
tar-file). THAT is definitly going to speed up booting - thanks to
compression you'll not only get linear reads, but you will get fewer reads
than the amount of data you need would imply.

Couple that with tmpfs, or possibly something like coda (to dynamically
move things between the ramdisk and the "backing store" filesystem), and
you can get a ramdisk approach that actually shrinks (and, in the case of
coda or whatever, truly grows) dynamically.

Think of it as an exercise in multi-level filesystems and filesystem
management. Others have done it before (usually between disk and tape, or
disk and network), and in these days of ever-growing memory it might just
make sense to do it on that level too.

(No, I don't seriously think it makes sense today. But if RAM keeps
growing and becoming ever cheaper, it might some day. At the point where
everybody has multi-gigabyte memories, and don't really need it for
anything but caching, you could think of it as just moving the caching to
a higher level - you don't cache blocks, you cache parts of the
filesystem).

>   Al, feeling sadistic today...

Sadistic you are.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Linus Torvalds wrote:

> Now, if you want to speed up accesses, there are things you can do. You
> can lay out the filesystem in the access order - trace the IO accesses at
> bootup ("which file, which offset, which metadata block?") and lay out the
> blocks of the files in exactly the right order. Then you will get linear
> reads _without_ doing any "dd" at all.
> 
> Now, laying out the filesystem that way is _hard_. No question about it.
> It's kind of equivalent to doing a filesystem "defreagment" operation,
> except you use a different sorting function (instead of sorting blocks
> linearly within each file, you sort according to access order).

Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
* add pagecache access for block device
* put your "real" root on /dev/loop0 (setup from initrd)
* dd
The last step will populate pagecache for underlying device and later
access to root fs will ultimately hit said pagecache, be it from page
cache of files or buffer cache of /dev/loop0 - loop_make_request() will
take care of that, by copying data from pagecache of /dev/.

Al, feeling sadistic today...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds


On Fri, 4 May 2001, Andrea Arcangeli wrote:

> On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
> > Or you can rewrite block_read/write to use the page cache, in which case
> > you'd have more luck doing the above.
> 
> once block_dev is in pagecache there will obviously be no-way to share
> cache between the block device and the filesystem, because all the
> caches will be in completly different address spaces.

They already pretty much are.

I do want to re-write block_read/write to use the page cache, but not
because it would impact anything in this discussion. I want to do it early
in 2.5.x, because:

 - it will speed up accesses
 - it will re-use existing code better and conceptualize things more
   cleanly (ie it would turn a disk into a _really_ simple filesystem with
   just one big file ;).
 - it will make MM handling much better for things like fsck - the memory
   pressure is designed to work on page cache things.
 - it will be one less thing that uses the buffer cache as a "cache" (I
   want people to think of, and use, the buffer cache as an _IO_ entity,
   not a cache).

It will not make the "cache at bootup" thing change at all (because even
in the page cache, there is no commonality between a virtual mapping of a
_file_ (or metadata) and a virtual mapping of a _disk_). 

It would have hidden the problem with "dd" or "dump" touching buffer cache
blocks that the filesystem was using, so the original metadata corruption
that started this thread would not happen. But that's not a design issue
or a design goal, that would just have been a random result.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds


On Fri, 4 May 2001, Rogier Wolff wrote:
>
> Linus Torvalds wrote:
> > 
> > Ehh. Doing that would be extremely stupid, and would slow down your boot
> > and nothing more.
> 
> Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second.

You obviously didn't read my explanation of _why_ it is stupid.

> By analyzing my boot process I determine that 50M of my disk is used
> during boot. I can then reshuffle my disk to have that 50M of data at
> the beginning and reading all that into 50M of cache, I can save
> thousands of 10ms seeks.

No. Have you _tried_ this?

What the above would do is to move 50M of the disk into the buffer cache.

Then, a second later, when the boot proceeds, Linux would start filling
the page cache.

BY READING THE CONTENTS FROM DISK AGAIN!

In short, by doing a "dd" from the disk, you would _not_ help anything at
all. You would only make things slower, by reading things twice.

The Linux buffer cache and page cache are two separate entities. They are
not synchronized, and they are indexed through totally different
means. The page cache is virtually indexed by , while
the
buffer cache is indexed by . 

> Is this simply: Don't try this then? 

Try it. You will see. 

You _can_ actually try to optimize certain things with 2.4.x: all
meta-data is still in the buffer cache in 2.4.x, so what you could do is
to lay out the image so that the metadata is at the front of the disk,
and do the "dd" to cache just the metadata. Even then you need to be
careful, and make sure that the "dd" uses the same block size as the
filesystem will use.

And even that will largely stop working very early in 2.5.x when the
directory contents and possibly inode and bitmap metadata moves into the
page cache.

Now, you may ask "why use the page cache at all then"? The answer is that
the page cache is a _lot_ faster to look up, exactly because of the
virtual indexing (and also because the data structure is much better
designed - fixed-size entities with none of the complexities of the buffer
cache. The buffer cache needs to be able to do IO, while the page cache is
_only_ a cache and does that one thing really well - doing IO is a
completely separate issue with the page cache).

Now, if you want to speed up accesses, there are things you can do. You
can lay out the filesystem in the access order - trace the IO accesses at
bootup ("which file, which offset, which metadata block?") and lay out the
blocks of the files in exactly the right order. Then you will get linear
reads _without_ doing any "dd" at all.

Now, laying out the filesystem that way is _hard_. No question about it.
It's kind of equivalent to doing a filesystem "defreagment" operation,
except you use a different sorting function (instead of sorting blocks
linearly within each file, you sort according to access order).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Andrea Arcangeli

On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
> Or you can rewrite block_read/write to use the page cache, in which case
> you'd have more luck doing the above.

once block_dev is in pagecache there will obviously be no-way to share
cache between the block device and the filesystem, because all the
caches will be in completly different address spaces.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Marc SCHAEFER

Rogier Wolff <[EMAIL PROTECTED]> wrote:
> during boot. I can then reshuffle my disk to have that 50M of data at
> the beginning and reading all that into 50M of cache, I can save

Wasn't that one of the goals of the LVM project, along snapshots and
block-level HSM ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Jens Axboe

On Fri, May 04 2001, Rogier Wolff wrote:
> > On Thu, 3 May 2001, Alan Cox wrote:
> > > Ditto for some CD based stuff. You burn the important binaries to the front
> > > of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
> > > avoid a lot of seeking during boot up from the CD-ROM.
> > > 
> > > However I could do that from an initrd before mounting
> > 
> > Ehh. Doing that would be extremely stupid, and would slow down your boot
> > and nothing more.
> 
> Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By
> analyzing my boot process I determine that 50M of my disk is used
> during boot. I can then reshuffle my disk to have that 50M of data at
> the beginning and reading all that into 50M of cache, I can save
> thousands of 10ms seeks. Boot time would go from several tens of
> seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU
> time.

Provided that the buffer cache and page cache are coherent, which they
are not. So at most you'll cache fs meta data by doing the dd trick,
which is hardly worth the effort.

Or you can rewrite block_read/write to use the page cache, in which case
you'd have more luck doing the above.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Rogier Wolff

Linus Torvalds wrote:
> 
> On Thu, 3 May 2001, Alan Cox wrote:
> > Ditto for some CD based stuff. You burn the important binaries to the front
> > of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
> > avoid a lot of seeking during boot up from the CD-ROM.
> > 
> > However I could do that from an initrd before mounting
> 
> Ehh. Doing that would be extremely stupid, and would slow down your boot
> and nothing more.

Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By
analyzing my boot process I determine that 50M of my disk is used
during boot. I can then reshuffle my disk to have that 50M of data at
the beginning and reading all that into 50M of cache, I can save
thousands of 10ms seeks. Boot time would go from several tens of
seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU
time.

This doesn't work if I don't have the memory to cache 50M of
disk-blocks.

Is this simply: Don't try this then? 

Roger. 


-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Rogier Wolff

Linus Torvalds wrote:
 
 On Thu, 3 May 2001, Alan Cox wrote:
  Ditto for some CD based stuff. You burn the important binaries to the front
  of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
  avoid a lot of seeking during boot up from the CD-ROM.
  
  However I could do that from an initrd before mounting
 
 Ehh. Doing that would be extremely stupid, and would slow down your boot
 and nothing more.

Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By
analyzing my boot process I determine that 50M of my disk is used
during boot. I can then reshuffle my disk to have that 50M of data at
the beginning and reading all that into 50M of cache, I can save
thousands of 10ms seeks. Boot time would go from several tens of
seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU
time.

This doesn't work if I don't have the memory to cache 50M of
disk-blocks.

Is this simply: Don't try this then? 

Roger. 


-- 
** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots. 
* There are also old, bald pilots. 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Jens Axboe

On Fri, May 04 2001, Rogier Wolff wrote:
  On Thu, 3 May 2001, Alan Cox wrote:
   Ditto for some CD based stuff. You burn the important binaries to the front
   of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
   avoid a lot of seeking during boot up from the CD-ROM.
   
   However I could do that from an initrd before mounting
  
  Ehh. Doing that would be extremely stupid, and would slow down your boot
  and nothing more.
 
 Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By
 analyzing my boot process I determine that 50M of my disk is used
 during boot. I can then reshuffle my disk to have that 50M of data at
 the beginning and reading all that into 50M of cache, I can save
 thousands of 10ms seeks. Boot time would go from several tens of
 seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU
 time.

Provided that the buffer cache and page cache are coherent, which they
are not. So at most you'll cache fs meta data by doing the dd trick,
which is hardly worth the effort.

Or you can rewrite block_read/write to use the page cache, in which case
you'd have more luck doing the above.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Marc SCHAEFER

Rogier Wolff [EMAIL PROTECTED] wrote:
 during boot. I can then reshuffle my disk to have that 50M of data at
 the beginning and reading all that into 50M of cache, I can save

Wasn't that one of the goals of the LVM project, along snapshots and
block-level HSM ?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Andrea Arcangeli

On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
 Or you can rewrite block_read/write to use the page cache, in which case
 you'd have more luck doing the above.

once block_dev is in pagecache there will obviously be no-way to share
cache between the block device and the filesystem, because all the
caches will be in completly different address spaces.

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds


On Fri, 4 May 2001, Rogier Wolff wrote:

 Linus Torvalds wrote:
  
  Ehh. Doing that would be extremely stupid, and would slow down your boot
  and nothing more.
 
 Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second.

You obviously didn't read my explanation of _why_ it is stupid.

 By analyzing my boot process I determine that 50M of my disk is used
 during boot. I can then reshuffle my disk to have that 50M of data at
 the beginning and reading all that into 50M of cache, I can save
 thousands of 10ms seeks.

No. Have you _tried_ this?

What the above would do is to move 50M of the disk into the buffer cache.

Then, a second later, when the boot proceeds, Linux would start filling
the page cache.

BY READING THE CONTENTS FROM DISK AGAIN!

In short, by doing a dd from the disk, you would _not_ help anything at
all. You would only make things slower, by reading things twice.

The Linux buffer cache and page cache are two separate entities. They are
not synchronized, and they are indexed through totally different
means. The page cache is virtually indexed by inode,pagenr, while
the
buffer cache is indexed by dev,blocknr,blocksize. 

 Is this simply: Don't try this then? 

Try it. You will see. 

You _can_ actually try to optimize certain things with 2.4.x: all
meta-data is still in the buffer cache in 2.4.x, so what you could do is
to lay out the image so that the metadata is at the front of the disk,
and do the dd to cache just the metadata. Even then you need to be
careful, and make sure that the dd uses the same block size as the
filesystem will use.

And even that will largely stop working very early in 2.5.x when the
directory contents and possibly inode and bitmap metadata moves into the
page cache.

Now, you may ask why use the page cache at all then? The answer is that
the page cache is a _lot_ faster to look up, exactly because of the
virtual indexing (and also because the data structure is much better
designed - fixed-size entities with none of the complexities of the buffer
cache. The buffer cache needs to be able to do IO, while the page cache is
_only_ a cache and does that one thing really well - doing IO is a
completely separate issue with the page cache).

Now, if you want to speed up accesses, there are things you can do. You
can lay out the filesystem in the access order - trace the IO accesses at
bootup (which file, which offset, which metadata block?) and lay out the
blocks of the files in exactly the right order. Then you will get linear
reads _without_ doing any dd at all.

Now, laying out the filesystem that way is _hard_. No question about it.
It's kind of equivalent to doing a filesystem defreagment operation,
except you use a different sorting function (instead of sorting blocks
linearly within each file, you sort according to access order).

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds


On Fri, 4 May 2001, Andrea Arcangeli wrote:

 On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
  Or you can rewrite block_read/write to use the page cache, in which case
  you'd have more luck doing the above.
 
 once block_dev is in pagecache there will obviously be no-way to share
 cache between the block device and the filesystem, because all the
 caches will be in completly different address spaces.

They already pretty much are.

I do want to re-write block_read/write to use the page cache, but not
because it would impact anything in this discussion. I want to do it early
in 2.5.x, because:

 - it will speed up accesses
 - it will re-use existing code better and conceptualize things more
   cleanly (ie it would turn a disk into a _really_ simple filesystem with
   just one big file ;).
 - it will make MM handling much better for things like fsck - the memory
   pressure is designed to work on page cache things.
 - it will be one less thing that uses the buffer cache as a cache (I
   want people to think of, and use, the buffer cache as an _IO_ entity,
   not a cache).

It will not make the cache at bootup thing change at all (because even
in the page cache, there is no commonality between a virtual mapping of a
_file_ (or metadata) and a virtual mapping of a _disk_). 

It would have hidden the problem with dd or dump touching buffer cache
blocks that the filesystem was using, so the original metadata corruption
that started this thread would not happen. But that's not a design issue
or a design goal, that would just have been a random result.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Linus Torvalds wrote:

 Now, if you want to speed up accesses, there are things you can do. You
 can lay out the filesystem in the access order - trace the IO accesses at
 bootup (which file, which offset, which metadata block?) and lay out the
 blocks of the files in exactly the right order. Then you will get linear
 reads _without_ doing any dd at all.
 
 Now, laying out the filesystem that way is _hard_. No question about it.
 It's kind of equivalent to doing a filesystem defreagment operation,
 except you use a different sorting function (instead of sorting blocks
 linearly within each file, you sort according to access order).

Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
* add pagecache access for block device
* put your real root on /dev/loop0 (setup from initrd)
* dd
The last step will populate pagecache for underlying device and later
access to root fs will ultimately hit said pagecache, be it from page
cache of files or buffer cache of /dev/loop0 - loop_make_request() will
take care of that, by copying data from pagecache of /dev/real_device.

Al, feeling sadistic today...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds


On Fri, 4 May 2001, Alexander Viro wrote:
 
 Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
   * add pagecache access for block device
   * put your real root on /dev/loop0 (setup from initrd)
   * dd

You're one sick puppy.

Now, the above is basically equivalent to using and populating a
dynamically sized ramdisk.

If you really want to go this way, I'd much rather see you using a real
ram-disk (that you populate at startup with something like a compressed
tar-file). THAT is definitly going to speed up booting - thanks to
compression you'll not only get linear reads, but you will get fewer reads
than the amount of data you need would imply.

Couple that with tmpfs, or possibly something like coda (to dynamically
move things between the ramdisk and the backing store filesystem), and
you can get a ramdisk approach that actually shrinks (and, in the case of
coda or whatever, truly grows) dynamically.

Think of it as an exercise in multi-level filesystems and filesystem
management. Others have done it before (usually between disk and tape, or
disk and network), and in these days of ever-growing memory it might just
make sense to do it on that level too.

(No, I don't seriously think it makes sense today. But if RAM keeps
growing and becoming ever cheaper, it might some day. At the point where
everybody has multi-gigabyte memories, and don't really need it for
anything but caching, you could think of it as just moving the caching to
a higher level - you don't cache blocks, you cache parts of the
filesystem).

   Al, feeling sadistic today...

Sadistic you are.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Linus Torvalds writes:
 Now, if you want to speed up accesses, there are things you can
 do. You can lay out the filesystem in the access order - trace the
 IO accesses at bootup (which file, which offset, which metadata
 block?) and lay out the blocks of the files in exactly the right
 order. Then you will get linear reads _without_ doing any dd at
 all.

A year ago I came up with an alternative approach for cache warming,
but I see that it wouldn't work with our current infrastructure.
However, maybe there is still a way to use the basic technique. If so,
please make suggestions.

The idea I had (motivated by the desire to eliminate random disc
seeks, which is the limiting factor in how fast my boxes boot) was:

- init(8) issues an ioctl(2) on the root FS block device which turns
  on recording of block reads (it records block numbers)

- at the end of the bootup process, init(8) issues another ioctl(2) to
  grab the buffered block numbers, and turn off recording

- init(8) then sorts this list in ascending order and saves the result
  in a file

- next boot, init(8) checks the file, and if it exists, opens the root
  FS block device and reads in each block listed in the file. The
  effect is to warm the buffer cache extremely quickly. The head will
  move in one direction, grabbing data as it flys by. I expect this
  will take around 1 second

- init(8) now continues the boot process (starting the magic ioctl(2)
  again so as to get a fresh list of blocks, in case something has
  changed)

- booting is now super fast, thanks to no disc activity.

The advantage of this scheme over blindly reading the first 50 MB is
that it only reads in what you *need*, and thus will work better on
low memory systems. It's also useful for other applications, not just
speeding up the boot process.

However, doing an ioctl(2) on the block device won't help. So the
question is, where to add the hook? One possibility is the FS, and
record inum,bnum pairs. But of course we don't have a way of accessing
via inum in user-space. So that's no good. Besides, we want to get
block numbers on the block device, because that's the only meaningful
number to resort.

So, what, then? Some kind of hook on the page cache? Ideas?

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Linus Torvalds wrote:

 
 On Fri, 4 May 2001, Alexander Viro wrote:
  
  Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
   ^^^
  * add pagecache access for block device
  * put your real root on /dev/loop0 (setup from initrd)
  * dd
 
 You're one sick puppy.

[snip]
/me bows

Nice to see that imitation was good enough ;-) Seriously, I half-expected
Albert to show up at that point of thread and tried to anticipate what
he'd produce.

ObProcfs: I don't think that walking the page tables is a good way to
compute RSS, especially since VM maintains the thing. Mind if I rip
it out? In effect, implementation of /prc/pid/statm
* produces extremely bogus values (VMA is from library if it goes
  beyond 0x6000? Might be even true 7 years ago...) and nobody
  had cared about them for 6-7 years
* makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_
  _processes_ each 5 seconds. No wonder it's slow like hell and eats
  tons of CPU time.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Alexander Viro writes:
 
 
 On Fri, 4 May 2001, Richard Gooch wrote:
 
  However, doing an ioctl(2) on the block device won't help. So the
  question is, where to add the hook? One possibility is the FS, and
  record inum,bnum pairs. But of course we don't have a way of accessing
  via inum in user-space. So that's no good. Besides, we want to get
  block numbers on the block device, because that's the only meaningful
  number to resort.
  
  So, what, then? Some kind of hook on the page cache? Ideas?
 
 Two of them: use less bloated shell (and link it statically) and
 clean your rc scripts.

No, because I'm not using the latest bloated version of bash, and I'm
not using the slow and bloated RedHat boot scripts. My boot scripts
are lean and mean. Oh. And I already have init(8) warming the cache
with these scripts.

The problem is all the various daemons and system utilities (mount,
hwclock, ifconfig and so on) that turn a kernel into a useful system.
And then of course there's X...

Sorry. A don't do that then answer isn't appropriate for this
problem space.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Jens Axboe

On Fri, May 04 2001, Richard Gooch wrote:
 The idea I had (motivated by the desire to eliminate random disc
 seeks, which is the limiting factor in how fast my boxes boot) was:
 
 - init(8) issues an ioctl(2) on the root FS block device which turns
   on recording of block reads (it records block numbers)
 
 - at the end of the bootup process, init(8) issues another ioctl(2) to
   grab the buffered block numbers, and turn off recording
 
 - init(8) then sorts this list in ascending order and saves the result
   in a file
 
 - next boot, init(8) checks the file, and if it exists, opens the root
   FS block device and reads in each block listed in the file. The
   effect is to warm the buffer cache extremely quickly. The head will
   move in one direction, grabbing data as it flys by. I expect this
   will take around 1 second
 
 - init(8) now continues the boot process (starting the magic ioctl(2)
   again so as to get a fresh list of blocks, in case something has
   changed)
 
 - booting is now super fast, thanks to no disc activity.

I did 95% of what you need sometime last year, to do I/O scheduler
profiling (blocks requested, merge stats, request sent to disk). It was
a pretty gross hack, requiring a pretty big ring buffer of kernel memory
to be able to log at a sufficiently fast rate (you'd be amazed how much
output a single dbench 48 run produces :-). A user space app would read
data from a simple char device, save for later inspection.

A better approach would be to map the ring buffer from the user app, but
it was just a quick fix.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds



On Fri, 4 May 2001, Alexander Viro wrote:

 ObProcfs: I don't think that walking the page tables is a good way to
 compute RSS, especially since VM maintains the thing.

Well, the VM didn't always use to maintain the stuff it does now, so I bet
that most of the code is just old code that still works.

Feel free to rip it out.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Richard Gooch wrote:

  Two of them: use less bloated shell (and link it statically) and
  clean your rc scripts.
 
 No, because I'm not using the latest bloated version of bash, and I'm

Umm... Last version of bash I could call not bloated was _long_ time
ago. Something like ash(1) might be a better idea for /bin/sh.

 The problem is all the various daemons and system utilities (mount,
 hwclock, ifconfig and so on) that turn a kernel into a useful system.
 And then of course there's X...

How do you partition the thing? I.e. what's the size of your root partition?
I'm usually doing something from 10Mb to 30Mb - that may be the reason of
differences.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Alexander Viro writes:
 
 
 On Fri, 4 May 2001, Richard Gooch wrote:
 
   Two of them: use less bloated shell (and link it statically) and
   clean your rc scripts.
  
  No, because I'm not using the latest bloated version of bash, and I'm
 
 Umm... Last version of bash I could call not bloated was _long_ time
 ago. Something like ash(1) might be a better idea for /bin/sh.

The shell is irrelevant. I can easily preload that too, if I wanted
to, since it's just one thing. But it's not practical to preload all
files used by name, because it's just too hard to find out all that is
needed. Too much people time required, and it is specific to one
distribution (and a particular revision at that).

  The problem is all the various daemons and system utilities (mount,
  hwclock, ifconfig and so on) that turn a kernel into a useful system.
  And then of course there's X...
 
 How do you partition the thing? I.e. what's the size of your root
 partition?  I'm usually doing something from 10Mb to 30Mb - that may
 be the reason of differences.

I don't bother splitting /usr off /. I gave up doing that when disc
became cheap. There's no point anymore. And since I have a lightweight
distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
and a pile of other things), it makes even less sense to split /usr
off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
most of my day in emacs and xterm.

And even if I did split /usr off, that would just mean I'd want to
record block accesses for that device as well. This is because part of
my boot process requires stuff in /usr. And after that, firing up xdm.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Alexander Viro



On Fri, 4 May 2001, Richard Gooch wrote:

 I don't bother splitting /usr off /. I gave up doing that when disc
 became cheap. There's no point anymore. And since I have a lightweight

Yes, there is. Locality. Resistance to fs fuckups. Resistance to disk
fuckups. Easier to restore from tape. Different tunefs optimum (higher
inodes/blocks ratio, for one thing). Ability to keep /usr read-only.
Enough?

 distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
 and a pile of other things), it makes even less sense to split /usr
 off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
 most of my day in emacs and xterm.

What desktops? None of that crap on my boxen either. EMACS? What EMACS?
LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side).
Netrape? No chance in hell. bash spit is there, but I prefer to use
rc.

I don't see what does it have to keeping root on a separate filesystem,
though - the reasons have nothing to bloat in /usr/bin.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Richard Gooch

Alexander Viro writes:
 
 
 On Fri, 4 May 2001, Richard Gooch wrote:
 
  I don't bother splitting /usr off /. I gave up doing that when disc
  became cheap. There's no point anymore. And since I have a lightweight
 
 Yes, there is. Locality. Resistance to fs fuckups. Resistance to
 disk fuckups. Easier to restore from tape. Different tunefs optimum
 (higher inodes/blocks ratio, for one thing). Ability to keep /usr
 read-only.  Enough?

The correct solution to avoiding fs fuckups is to keep /tmp, /var and
/home separate. Basically, anything that gets written to for reasons
other than sysadmin/upgrades.

However, my point is not that it's always a bad idea to split /usr,
simply that the converse is not true. IOW, it is not true to say that
/usr *should* be split off. For a generic workstation, splitting /usr
is not useful. Importantly, it is most certainly entirely valid to
keep /usr on /.

  distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap
  and a pile of other things), it makes even less sense to split /usr
  off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend
  most of my day in emacs and xterm.
 
 What desktops? None of that crap on my boxen either. EMACS? What EMACS?
 LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side).
 Netrape? No chance in hell. bash spit is there, but I prefer to use
 rc.
 
 I don't see what does it have to keeping root on a separate
 filesystem, though - the reasons have nothing to bloat in /usr/bin.

In any case, my point is that splitting /usr wouldn't help, because
I'd want to preload stuff from there as well. Splitting /usr doesn't
address the problem.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Linus Torvalds



On Fri, 4 May 2001, Alan Cox wrote:

 iso9660 alas doesn't allow you to do that. You can speed it up by reading
 the entire file into memory rather than paging it in (or reading it in and
 then executing it). iso9660 layout is pretty constrained and designed for
 linear file reads

Note that this you can do for any filesystem, including ext2. If you
instead of trying to remember what _blocks_ the bootup process reads, you
keep the trace at a higher level, and then sort the _high_level_ trace and
re-do that with some program, then you can obviously populate the virtual
caches properly with any filesystem.

The advantage of that approach is that it will continue to work forever,
because there will  never be any cache aliasing issues. You're always
pre-caching using the same operation that you'll actually use when you
do the real reads..

Now, that still leaves the question on how to sort the virtual cache
accesses, and you might want to know what the low-level layout of the
filesystem is to actually create the sort. You might not want to sort
alphabetically on the file-name, but use a where on the disk is this
file, and use _that_ as the sort oder function.

That's easy to do, actually. Just use the bmap() ioctl.

Now, you won't be able to use dd to populate the caches: you'd have to
have your own program that walks your sorted action list and populates the
caches that way (and you might want to take kernel read-ahead etc
heuristics into account).

SMOP.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-03 Thread Linus Torvalds


On Thu, 3 May 2001, Alan Cox wrote:
>
> > > discussion in itself), and there really are no valid uses for opening a
> > > block device that is already mounted. More importantly, I don't think
> > > anybody actually does.
> > 
> > Actually I did. I might do it again :) The point was to get the kernel to
> > cache certain blocks in the RAM. 
> 
> Ditto for some CD based stuff. You burn the important binaries to the front
> of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
> avoid a lot of seeking during boot up from the CD-ROM.
> 
> However I could do that from an initrd before mounting

Ehh. Doing that would be extremely stupid, and would slow down your boot
and nothing more.

The page cache is _not_ coherent with the buffer cache. Any filesystem
that uses the page cache for data caching (which pretty much all of them
do, because it's the only way to get sane mmap semantics, and it's a lot
faster than the ol dbuffer cache ever was), the above will do _nothing_
but spend time doing IO that the page cache will just end up doing again.

Currently it can help to pre-load the meta-data, but quite frankly, even
that is suspect, and won't work in 2.5.x when Al's metadata page-cache
stuff is merged (at least directories, and likely inodes too).

In short, don't do it. It doesn't work reliably (and hasn't since 2.0.x),
and it will only get more and more unreliable.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-03 Thread Alan Cox

> > discussion in itself), and there really are no valid uses for opening a
> > block device that is already mounted. More importantly, I don't think
> > anybody actually does.
> 
> Actually I did. I might do it again :) The point was to get the kernel to
> cache certain blocks in the RAM. 

Ditto for some CD based stuff. You burn the important binaries to the front
of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
avoid a lot of seeking during boot up from the CD-ROM.

However I could do that from an initrd before mounting

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-03 Thread volodya



On Thu, 26 Apr 2001, Linus Torvalds wrote:

> 
> 
> On Thu, 26 Apr 2001, Alexander Viro wrote:
> > On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> > >
> > > how can the read in progress see a branch that we didn't spliced yet? We
> >
> > fd = open("/dev/hda1", O_RDONLY);
> > read(fd, buf, sizeof(buf));
> 
> Note that I think all these arguments are fairly bogus.  Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

Actually I did. I might do it again :) The point was to get the kernel to
cache certain blocks in the RAM. 

 Vladimir Dergachev

> 
> The fact that you _can_ do so makes the patch valid, and I do agree with
> Al on the "least surprise" issue. I've already applied the patch, in fact.
> But the fact is that nobody should ever do the thing that could cause
> problems.
> 
>   Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-03 Thread volodya



On Thu, 26 Apr 2001, Linus Torvalds wrote:

 
 
 On Thu, 26 Apr 2001, Alexander Viro wrote:
  On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
  
   how can the read in progress see a branch that we didn't spliced yet? We
 
  fd = open(/dev/hda1, O_RDONLY);
  read(fd, buf, sizeof(buf));
 
 Note that I think all these arguments are fairly bogus.  Doing things like
 dump on a live filesystem is stupid and dangerous (in my opinion it is
 stupid and dangerous to use dump at _all_, but that's a whole 'nother
 discussion in itself), and there really are no valid uses for opening a
 block device that is already mounted. More importantly, I don't think
 anybody actually does.

Actually I did. I might do it again :) The point was to get the kernel to
cache certain blocks in the RAM. 

 Vladimir Dergachev

 
 The fact that you _can_ do so makes the patch valid, and I do agree with
 Al on the least surprise issue. I've already applied the patch, in fact.
 But the fact is that nobody should ever do the thing that could cause
 problems.
 
   Linus
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-03 Thread Alan Cox

  discussion in itself), and there really are no valid uses for opening a
  block device that is already mounted. More importantly, I don't think
  anybody actually does.
 
 Actually I did. I might do it again :) The point was to get the kernel to
 cache certain blocks in the RAM. 

Ditto for some CD based stuff. You burn the important binaries to the front
of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
avoid a lot of seeking during boot up from the CD-ROM.

However I could do that from an initrd before mounting

Alan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-03 Thread Linus Torvalds


On Thu, 3 May 2001, Alan Cox wrote:

   discussion in itself), and there really are no valid uses for opening a
   block device that is already mounted. More importantly, I don't think
   anybody actually does.
  
  Actually I did. I might do it again :) The point was to get the kernel to
  cache certain blocks in the RAM. 
 
 Ditto for some CD based stuff. You burn the important binaries to the front
 of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
 avoid a lot of seeking during boot up from the CD-ROM.
 
 However I could do that from an initrd before mounting

Ehh. Doing that would be extremely stupid, and would slow down your boot
and nothing more.

The page cache is _not_ coherent with the buffer cache. Any filesystem
that uses the page cache for data caching (which pretty much all of them
do, because it's the only way to get sane mmap semantics, and it's a lot
faster than the ol dbuffer cache ever was), the above will do _nothing_
but spend time doing IO that the page cache will just end up doing again.

Currently it can help to pre-load the meta-data, but quite frankly, even
that is suspect, and won't work in 2.5.x when Al's metadata page-cache
stuff is merged (at least directories, and likely inodes too).

In short, don't do it. It doesn't work reliably (and hasn't since 2.0.x),
and it will only get more and more unreliable.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-30 Thread Neil Conway

Hiya.

Linus Torvalds wrote:
> So anybody who depends on "dump" getting backups right is already playing
> russian rulette with their backups.  It's not at all guaranteed to get the
> right results - you may end up having stale data in the buffer cache that
> ends up being "backed up".
> 
> Dump was a stupid program in the first place. Leave it behind.

Ouch.  I just re-read the man page and it doesn't caution (*) against
using it on mounted filesystems.  That probably means that there are
thousands of other losers like me using it on production machines. 
Volunteers to (a) change the man page, (b) talk to the distros about
dumping "dump"?

> However, it may be that in the long run it would be advantageous to have a
> "filesystem maintenance interface" for doing things like backups and
> defragmentation..

Yup, sounds good.

Neil

(*) The KNOWNBUGS file mentions "possible" problems while dumping active
mounted filesystems, but I've (elsewhere) seen these characterised as no
real problem; also, this falls a long way short of discouraging use in
this fashion.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-28 Thread Olaf Titz

> or such. tar/cpio and friends don't deal properly with
> a. holes inside of files.
> b. hardlinks between files.

GNU tar handles both of these. (Not particularly efficiently in the
case of sparse files, but that's a minor issue in this case.) See -S flag.

Perhaps more importantly, for a _robust_ backup solution which can
deal with partially unreadable tapes, you have pretty much no option
other than tar for the actual storage.

Olaf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-28 Thread Jens Axboe

On Sat, Apr 28 2001, Albert D. Cahalan wrote:
> Linus Torvalds writes:
> 
> > The buffer cache is "virtual" in the sense that /dev/hda is a
> > completely separate name-space from /dev/hda1, even if there
> > is some physical overlap.
> 
> So the aliasing problems and elevator algorithm confusion remain?

At least for the I/O scheduler confusion, requests to partitions will
remap the buffer location and this problem disappears nicely. It's not a
big issue, really.

> Is this ever likely to change, and what is with the 1 kB assumptions?
> (Hmmm, cruft left over from the 1 kB Minix filesystem blocks?)

What 1kB assumption?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-28 Thread Matthias Urlichs

Martin Dalecki :
> tar/cpio and friends don't deal properly with
> 
> a. holes inside of files.
> b. hardlinks between files.
> 
??? GNU tar does both. The only thing it currently cannot handle is Not
Changing Anything: either atime or ctime _will_ be updated.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-28 Thread Matthias Urlichs

Martin Dalecki :
 tar/cpio and friends don't deal properly with
 
 a. holes inside of files.
 b. hardlinks between files.
 
??? GNU tar does both. The only thing it currently cannot handle is Not
Changing Anything: either atime or ctime _will_ be updated.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-28 Thread Jens Axboe

On Sat, Apr 28 2001, Albert D. Cahalan wrote:
 Linus Torvalds writes:
 
  The buffer cache is virtual in the sense that /dev/hda is a
  completely separate name-space from /dev/hda1, even if there
  is some physical overlap.
 
 So the aliasing problems and elevator algorithm confusion remain?

At least for the I/O scheduler confusion, requests to partitions will
remap the buffer location and this problem disappears nicely. It's not a
big issue, really.

 Is this ever likely to change, and what is with the 1 kB assumptions?
 (Hmmm, cruft left over from the 1 kB Minix filesystem blocks?)

What 1kB assumption?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-28 Thread Olaf Titz

 or such. tar/cpio and friends don't deal properly with
 a. holes inside of files.
 b. hardlinks between files.

GNU tar handles both of these. (Not particularly efficiently in the
case of sparse files, but that's a minor issue in this case.) See -S flag.

Perhaps more importantly, for a _robust_ backup solution which can
deal with partially unreadable tapes, you have pretty much no option
other than tar for the actual storage.

Olaf
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Albert D. Cahalan

Linus Torvalds writes:

> The buffer cache is "virtual" in the sense that /dev/hda is a
> completely separate name-space from /dev/hda1, even if there
> is some physical overlap.

So the aliasing problems and elevator algorithm confusion remain?
Is this ever likely to change, and what is with the 1 kB assumptions?
(Hmmm, cruft left over from the 1 kB Minix filesystem blocks?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Shane Wegner

On Fri, Apr 27, 2001 at 09:52:19AM -0700, Linus Torvalds wrote:
> 
> On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
> > 
> > Actually this is done quite often, even on mounted fs's:
> > 
> > hdparm -t /dev/hda
> 
> Note that this one happens to be ok.
> 
> The buffer cache is "virtual" in the sense that /dev/hda is a completely
> separate name-space from /dev/hda1, even if there is some physical
> overlap.

Wouldn't something like "hdparm -t /dev/md0" trigger it
though.  It is the same device as gets mounted as md
devices aren't partitioned.

Shane


-- 
Shane Wegner: [EMAIL PROTECTED]
  http://www.cm.nu/~shane/
PGP:  1024D/FFE3035D
  A0ED DAC4 77EC D674 5487
  5B5C 4F89 9A4E FFE3 035D
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread dek_ml

On Fri, Apr 27, 2001 at 11:02:17AM -0700, LA Walsh wrote:
> Andrzej Krzysztofowicz wrote:
> 
> > I know a few people that often do:
> >
> > dd if=/dev/hda1 of=/dev/hdc1
> > e2fsck /dev/hdc1
> >
> > to make an "exact" copy of a currently working system.
> 
> ---
> Presumably this isn't a problem is the source disks are either unmounted or 
>mounted 'read-only' ?
> 
> 

I thought the known best solution on this was to use COW snapshots,
because then you copy the filesystem as exactly the state when the snapshot
was made, without impacting the writability of the filesystem while
the (potentially very long) dump is made?

I tried using this on LVM, but after seeing a few messages on the list about
kernel oopses happening with snapshots of filesystems with heavy write
activities, as well as experiencing serious problems with the LVM userspace
tools (they would core dump on startup if the LVM filesystem had any sort
of corruption or integrity failure) I decided to put it away until the LVM
folks managed to get a production version ready.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread LA Walsh

Andrzej Krzysztofowicz wrote:

> I know a few people that often do:
>
> dd if=/dev/hda1 of=/dev/hdc1
> e2fsck /dev/hdc1
>
> to make an "exact" copy of a currently working system.

---
Presumably this isn't a problem is the source disks are either unmounted or 
mounted 'read-only' ?


--
The above thoughts and   | They may have nothing to do with
writings are my own. | the opinions of my employer. :-)
L A Walsh| Trust Technology, Core Linux, SGI
[EMAIL PROTECTED]  | Voice: (650) 933-5338



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Jeff Garzik

Linus Torvalds wrote:
> On Fri, 27 Apr 2001, Neil Conway wrote:
> >
> > I'm surprised that dump is deprecated (by you at least ;-)).  What to
> > use instead for backups on machines that can't umount disks regularly?
> 
> Note that dump simply won't work reliably at all even in 2.4.x: the buffer
> cache and the page cache (where all the actual data is) are not
> coherent. This is only going to get even worse in 2.5.x, when the
> directories are moved into the page cache as well.

> Dump was a stupid program in the first place. Leave it behind.

Dump/restore are useful, on-line dump is silly.  I am personally amazed
that on-line, mounted dump was -ever- supported.  I guess it would work
if mounted ro...

dump is still the canonical solution, IMHO, for saving and restoring
filesystem metadata OFFLINE.  tar/cpio can be taught to do stuff like
security ACLs and EAs and such, but such code and formats are not yet
standardized, and they do not approach dump when it comes to taking an
accurate snapshot of the filesystem.

-- 
Jeff Garzik  | Disbelief, that's why you fail.
Building 1024|
MandrakeSoft |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Martin Dalecki

Linus Torvalds wrote:

> Dump was a stupid program in the first place. Leave it behind.

Not quite Linus - dump/restore are nice tools to create for example
automatic over network installation servers, i.e. efficient system
images
or such. tar/cpio and friends don't deal properly with

a. holes inside of files.
b. hardlinks between files.

Really they are not useless. However I wouldn't recommend them
for backup practicies as well.

Please see for example:

http://www.systime-solutions.de/index.php?topic=produkte=setupserver

Well yes, if you understand german...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Linus Torvalds


[ linux-kernel added back as a cc ]

On Fri, 27 Apr 2001, Neil Conway wrote:
> 
> I'm surprised that dump is deprecated (by you at least ;-)).  What to
> use instead for backups on machines that can't umount disks regularly? 

Note that dump simply won't work reliably at all even in 2.4.x: the buffer
cache and the page cache (where all the actual data is) are not
coherent. This is only going to get even worse in 2.5.x, when the
directories are moved into the page cache as well.

So anybody who depends on "dump" getting backups right is already playing
russian rulette with their backups.  It's not at all guaranteed to get the
right results - you may end up having stale data in the buffer cache that
ends up being "backed up".

Dump was a stupid program in the first place. Leave it behind.

> I've always thought "tar" was a bit undesirable (updates atimes or
> ctimes for example).

Right now, the cpio/tar/xxx solutions are definitely the best ones, and
will work on multiple filesystems (another limitation of "dump"). Whatever
problems they have, they are still better than the _guaranteed_(*)  data
corruptions of "dump".

However, it may be that in the long run it would be advantageous to have a
"filesystem maintenance interface" for doing things like backups and
defragmentation..

Linus

(*) Dump may work fine for you a thousand times. But it _will_ fail under
the right circumstances. And there is nothing you can do about it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Linus Torvalds


On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
> 
> Actually this is done quite often, even on mounted fs's:
> 
> hdparm -t /dev/hda

Note that this one happens to be ok.

The buffer cache is "virtual" in the sense that /dev/hda is a completely
separate name-space from /dev/hda1, even if there is some physical
overlap.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Ville Herva

On Fri, Apr 27, 2001 at 09:23:57AM -0400, you [Alexander Viro] claimed:
> 
> 
> On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
> 
> > Actually this is done quite often, even on mounted fs's:
> > 
> > hdparm -t /dev/hda
> 
> You would need either hdparm -t /dev/hda or mounting the
> whole /dev/hda.
> 
> Buffer cache for the disk is unrelated to buffer cache for parititions.

Well, I for one have been running

hdparm -t /dev/md0
or
time head -c 1000m /dev/md0 > /dev/null

while /dev/md0 was mounted without realizing that this could be "stupid" or
that it could eat my data.

/dev/md0 on /backup-versioned type ext2 (rw)

I often cat(1) or head(1) partitions or devices (even mounted ones) if I
need dummy randomish test data for compression or tape drives (that I've
been having trouble with). 

BTW: is 2.2 affected? 2.0? 


-- v --

[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-27 Thread Andi Kleen

On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote:
> Note that I think all these arguments are fairly bogus.  Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

You can use LVM snapshot volumes to do it safely.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



  1   2   >