from:"Bryan Henderson"

Re: BTRFS partition usage...

2008-02-12 Thread Bryan Henderson

> The Sun disk label only allows you to specify the start of a partition
> in cylinders, so if you want to use a filesystem like XFS you have to
> start the partition on cylinder 1 which can be many blocks into the
> disk.  That entire first cylinder is completely wasted.

I don't believe a cylinder of wasted disk space is significant.  I don't 
believe a cylinder of disk is worth adding the complexity of sharing a 
partition to a filesystem.  That complexity translates into engineering 
time and mistakes.

Your other arguments for making a hole in the filesystem, based on 
tradition, are more convincing.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BTRFS partition usage...

2008-02-12 Thread Bryan Henderson

 The Sun disk label only allows you to specify the start of a partition
 in cylinders, so if you want to use a filesystem like XFS you have to
 start the partition on cylinder 1 which can be many blocks into the
 disk.  That entire first cylinder is completely wasted.

I don't believe a cylinder of wasted disk space is significant.  I don't 
believe a cylinder of disk is worth adding the complexity of sharing a 
partition to a filesystem.  That complexity translates into engineering 
time and mistakes.

Your other arguments for making a hole in the filesystem, based on 
tradition, are more convincing.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Bryan Henderson

>> Incidentally, some context for the AIX approach to the OOM problem: a 
>> process may exclude itself from OOM vulnerability altogether.  It 
places 
>> itself in "early allocation" mode, which means at the time it creates 
>> virtual memory, it reserves enough backing store for the worst case. 
The 
>> memory manager does not send such a process the SIGDANGER signal or 
>> terminate it when it runs out of paging space.  Before c. 2000, this 
was 
>> the only mode.  Now the default is late allocation mode, which is 
similar 
>> to Linux.
>
>This is an interesting approach. It feels like some programs might be 
>interested in choosing this mode instead of risking OOM. 

It's the way virtual memory always worked when it was first invented.  The 
system not only reserved space to back every page of virtual memory; it 
assigned the particular blocks for it.  Late allocation was a later 
innovation, and I believe its main goal was to make it possible to use the 
cheaper disk drives for paging instead of drums.  Late allocation gives 
you better locality on disk, so the seeking doesn't eat you alive (drums 
don't seek).  Even then, I assume (but am not sure) that the system at 
least reserved the space in an account somewhere so at pageout time there 
was guaranteed to be a place to which to page out.  Overcommitting page 
space to save on disk space was a later idea.

I was surprised to see AIX do late allocation by default, because IBM's 
traditional style is bulletproof systems.  A system where a process can be 
killed at unpredictable times because of resource demands of unrelated 
processes doesn't really fit that style.

It's really a fairly unusual application that benefits from late 
allocation: one that creates a lot more virtual memory than it ever 
touches.  For example, a sparse array.  Or am I missing something?

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Bryan Henderson

 Incidentally, some context for the AIX approach to the OOM problem: a 
 process may exclude itself from OOM vulnerability altogether.  It 
places 
 itself in early allocation mode, which means at the time it creates 
 virtual memory, it reserves enough backing store for the worst case. 
The 
 memory manager does not send such a process the SIGDANGER signal or 
 terminate it when it runs out of paging space.  Before c. 2000, this 
was 
 the only mode.  Now the default is late allocation mode, which is 
similar 
 to Linux.

This is an interesting approach. It feels like some programs might be 
interested in choosing this mode instead of risking OOM. 

It's the way virtual memory always worked when it was first invented.  The 
system not only reserved space to back every page of virtual memory; it 
assigned the particular blocks for it.  Late allocation was a later 
innovation, and I believe its main goal was to make it possible to use the 
cheaper disk drives for paging instead of drums.  Late allocation gives 
you better locality on disk, so the seeking doesn't eat you alive (drums 
don't seek).  Even then, I assume (but am not sure) that the system at 
least reserved the space in an account somewhere so at pageout time there 
was guaranteed to be a place to which to page out.  Overcommitting page 
space to save on disk space was a later idea.

I was surprised to see AIX do late allocation by default, because IBM's 
traditional style is bulletproof systems.  A system where a process can be 
killed at unpredictable times because of resource demands of unrelated 
processes doesn't really fit that style.

It's really a fairly unusual application that benefits from late 
allocation: one that creates a lot more virtual memory than it ever 
touches.  For example, a sparse array.  Or am I missing something?

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Bryan Henderson

>I think there is a clear need for applications to be able to
>register a callback from the kernel to indicate that the machine as
>a whole is running out of memory and that the application should
>trim it's caches to reduce memory utilisation.
>
>Perhaps instead of swapping immediately, a SIGLOWMEM could be sent ...

The problem with that approach is that the Fsck process doesn't know how 
its need for memory compares with other process' need for memory.  How 
much memory should it give up?  Maybe it should just quit altogether if 
other processes are in danger of deadlocking.  Or maybe it's best for it 
to keep all its memory and let some other frivolous process give up its 
memory instead.

It's the OS's job to have a view of the entire system and make resource 
allocation decisions.

If it's just a matter of the application choosing a better page frame to 
vacate than what the kernel would have taken, (which is more a matter of 
self-interest than resource allocation), then Fsck can do that more 
directly by just monitoring its own page fault rate.  If it's high, then 
it's using more real memory than the kernel thinks it's entitled to and it 
can reduce its memory footprint to improve its speed.  It can even check 
whether an access to readahead data caused a page fault; if so, it knows 
reading ahead is actually making things worse and therefore reduce 
readahead until the page faults stop happening.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Bryan Henderson

I think there is a clear need for applications to be able to
register a callback from the kernel to indicate that the machine as
a whole is running out of memory and that the application should
trim it's caches to reduce memory utilisation.

Perhaps instead of swapping immediately, a SIGLOWMEM could be sent ...

The problem with that approach is that the Fsck process doesn't know how 
its need for memory compares with other process' need for memory.  How 
much memory should it give up?  Maybe it should just quit altogether if 
other processes are in danger of deadlocking.  Or maybe it's best for it 
to keep all its memory and let some other frivolous process give up its 
memory instead.

It's the OS's job to have a view of the entire system and make resource 
allocation decisions.

If it's just a matter of the application choosing a better page frame to 
vacate than what the kernel would have taken, (which is more a matter of 
self-interest than resource allocation), then Fsck can do that more 
directly by just monitoring its own page fault rate.  If it's high, then 
it's using more real memory than the kernel thinks it's entitled to and it 
can reduce its memory footprint to improve its speed.  It can even check 
whether an access to readahead data caused a page fault; if so, it knows 
reading ahead is actually making things worse and therefore reduce 
readahead until the page faults stop happening.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Bryan Henderson

I just had a talk with a colleague, John Palmer, who worked on disk drive 
design for about 5 years in the '90s and he gave me a very confident, 
credible explanation of some of the things we've been wondering about disk 
drive power loss in this thread, complete with demonstrations of various 
generations of disk drives, dismantled.

First of all, it is plain to see that there is no spring capable of 
parking the head, and there is no capacitor that looks big enough to 
possibly supply the energy to park the head, in any of the models I looked 
at.  Since parking of the heads is essential, we can only conclude that 
the myth of the kinetic energy of the disks being used for that (turned 
into electricity by the drive motor) is true.  The energy required is not 
just to move the heads to the parking zone, but to latch them there as 
well.

The myth is probably just that that energy is used for anything else; it's 
really easy to build a dumb circuit to park the heads using that power; 
keeping a computer running is something else.

The drive does drop a write in the middle of the sector if it is writing 
at the time of power loss.  The designers were too conservative to keep 
writing as power fails -- there's no telling what damage you might do.  So 
the drive cuts the power to the heads at the first sign of power loss.  If 
a write was in progress, this means there is one garbage sector on the 
disk.  It can't be read.

Trying to finish writing the sector is something I can image some drive 
model somewhere trying to do, but if even _some_ take the conservative 
approach, everyone has to design for it, so it doesn't matter.

A device might then reassign that sector the next time you try to write to 
it (after failing to read it), thinking the medium must be bad.  But there 
are various algorithms for deciding when to reassign a sector, so it might 
not too.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Bryan Henderson

"H. Peter Anvin" <[EMAIL PROTECTED]> wrote on 01/18/2008 07:08:30 AM:

> Bryan Henderson wrote:
> > 
> > We weren't actually talking about writing out the cache.  While that 
was 
> > part of an earlier thread which ultimately conceded that disk drives 
most 
> > probably do not use the spinning disk energy to write out the cache, 
the 
> > claim was then made that the drive at least survives long enough to 
finish 
> > writing the sector it was writing, thereby maintaining the integrity 
of 
> > the data at the drive level.  People often say that a disk drive 
> > guarantees atomic writes at the sector level even in the face of a 
power 
> > failure.
> > 
> > But I heard some years ago from a disk drive engineer that that is a 
myth 
> > just like the rotational energy thing.  I added that to the 
discussion, 
> > but admitted that I haven't actually seen a disk drive write a partial 

> > sector.
> > 
> 
> A disk drive whose power is cut needs to have enough residual power to 
> park its heads (or *massive* data loss will occur), and at that point it 

> might as well keep enough on hand to finish an in-progress sector write.
> 
> There are two possible sources of onboard temporary power: a large 
> enough capacitor, or the rotational energy of the platters (an 
> electrical motor also being a generator.)  I don't care which one they 
> use, but they need to do something.

I believe the power for that comes from a third source: a spring.  Parking 
the heads is too important to leave to active circuits.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Bryan Henderson

H. Peter Anvin [EMAIL PROTECTED] wrote on 01/18/2008 07:08:30 AM:

 Bryan Henderson wrote:
  
  We weren't actually talking about writing out the cache.  While that 
was 
  part of an earlier thread which ultimately conceded that disk drives 
most 
  probably do not use the spinning disk energy to write out the cache, 
the 
  claim was then made that the drive at least survives long enough to 
finish 
  writing the sector it was writing, thereby maintaining the integrity 
of 
  the data at the drive level.  People often say that a disk drive 
  guarantees atomic writes at the sector level even in the face of a 
power 
  failure.
  
  But I heard some years ago from a disk drive engineer that that is a 
myth 
  just like the rotational energy thing.  I added that to the 
discussion, 
  but admitted that I haven't actually seen a disk drive write a partial 

  sector.
  
 
 A disk drive whose power is cut needs to have enough residual power to 
 park its heads (or *massive* data loss will occur), and at that point it 

 might as well keep enough on hand to finish an in-progress sector write.
 
 There are two possible sources of onboard temporary power: a large 
 enough capacitor, or the rotational energy of the platters (an 
 electrical motor also being a generator.)  I don't care which one they 
 use, but they need to do something.

I believe the power for that comes from a third source: a spring.  Parking 
the heads is too important to leave to active circuits.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Bryan Henderson

I just had a talk with a colleague, John Palmer, who worked on disk drive 
design for about 5 years in the '90s and he gave me a very confident, 
credible explanation of some of the things we've been wondering about disk 
drive power loss in this thread, complete with demonstrations of various 
generations of disk drives, dismantled.

First of all, it is plain to see that there is no spring capable of 
parking the head, and there is no capacitor that looks big enough to 
possibly supply the energy to park the head, in any of the models I looked 
at.  Since parking of the heads is essential, we can only conclude that 
the myth of the kinetic energy of the disks being used for that (turned 
into electricity by the drive motor) is true.  The energy required is not 
just to move the heads to the parking zone, but to latch them there as 
well.

The myth is probably just that that energy is used for anything else; it's 
really easy to build a dumb circuit to park the heads using that power; 
keeping a computer running is something else.

The drive does drop a write in the middle of the sector if it is writing 
at the time of power loss.  The designers were too conservative to keep 
writing as power fails -- there's no telling what damage you might do.  So 
the drive cuts the power to the heads at the first sign of power loss.  If 
a write was in progress, this means there is one garbage sector on the 
disk.  It can't be read.

Trying to finish writing the sector is something I can image some drive 
model somewhere trying to do, but if even _some_ take the conservative 
approach, everyone has to design for it, so it doesn't matter.

A device might then reassign that sector the next time you try to write to 
it (after failing to read it), thinking the medium must be bad.  But there 
are various algorithms for deciding when to reassign a sector, so it might 
not too.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Bryan Henderson

Ric Wheeler <[EMAIL PROTECTED]> wrote on 01/17/2008 03:18:05 PM:

> Theodore Tso wrote:
> > On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
> >> Have you observed that in the wild?  A former engineer of a disk 
drive
> >> company suggests to me that the capacitors on the board provide 
enough
> >> power to complete the last sector, even to park the head.
> >>
> 
> Even if true (which I doubt), this is not implemented.
> 
> A modern drive can have 16-32 MB of write cache. Worst case, those 
> sectors are not sequential which implies lots of head movement.

We weren't actually talking about writing out the cache.  While that was 
part of an earlier thread which ultimately conceded that disk drives most 
probably do not use the spinning disk energy to write out the cache, the 
claim was then made that the drive at least survives long enough to finish 
writing the sector it was writing, thereby maintaining the integrity of 
the data at the drive level.  People often say that a disk drive 
guarantees atomic writes at the sector level even in the face of a power 
failure.

But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.

Ted brought up the separate issue of the host sending garbage to the disk 
device because its own power is failing at the same time, which makes the 
integrity at the disk level moot (or even undesirable, as you'd rather 
write a bad sector than a good one with the wrong data).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Bryan Henderson

"Daniel Phillips" <[EMAIL PROTECTED]> wrote on 01/16/2008 06:02:50 PM:

> On Jan 16, 2008 2:06 PM, Bryan Henderson <[EMAIL PROTECTED]> wrote:
> > >The "disk motor as a generator" tale may not be purely folklore. When
> > >an IDE drive is not in writeback mode, something special needs to 
done
> > >to ensure the last write to media is not a scribble.
> >
> > No it doesn't.  The last write _is_ a scribble.
> 
> Have you observed that in the wild?  A former engineer of a disk drive
> company suggests to me that the capacitors on the board provide enough
> power to complete the last sector, even to park the head.

No, I haven't.  It's hearsay, and from about 3 years ago.

As for parking the head, that's hard to believe, since it's so easy and 
more reliable to use a spring and an electromagnet.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Bryan Henderson

Daniel Phillips [EMAIL PROTECTED] wrote on 01/16/2008 06:02:50 PM:

 On Jan 16, 2008 2:06 PM, Bryan Henderson [EMAIL PROTECTED] wrote:
  The disk motor as a generator tale may not be purely folklore. When
  an IDE drive is not in writeback mode, something special needs to 
done
  to ensure the last write to media is not a scribble.
 
  No it doesn't.  The last write _is_ a scribble.
 
 Have you observed that in the wild?  A former engineer of a disk drive
 company suggests to me that the capacitors on the board provide enough
 power to complete the last sector, even to park the head.

No, I haven't.  It's hearsay, and from about 3 years ago.

As for parking the head, that's hard to believe, since it's so easy and 
more reliable to use a spring and an electromagnet.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Bryan Henderson

Ric Wheeler [EMAIL PROTECTED] wrote on 01/17/2008 03:18:05 PM:

 Theodore Tso wrote:
  On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
  Have you observed that in the wild?  A former engineer of a disk 
drive
  company suggests to me that the capacitors on the board provide 
enough
  power to complete the last sector, even to park the head.
 
 
 Even if true (which I doubt), this is not implemented.
 
 A modern drive can have 16-32 MB of write cache. Worst case, those 
 sectors are not sequential which implies lots of head movement.

We weren't actually talking about writing out the cache.  While that was 
part of an earlier thread which ultimately conceded that disk drives most 
probably do not use the spinning disk energy to write out the cache, the 
claim was then made that the drive at least survives long enough to finish 
writing the sector it was writing, thereby maintaining the integrity of 
the data at the drive level.  People often say that a disk drive 
guarantees atomic writes at the sector level even in the face of a power 
failure.

But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.

Ted brought up the separate issue of the host sending garbage to the disk 
device because its own power is failing at the same time, which makes the 
integrity at the disk level moot (or even undesirable, as you'd rather 
write a bad sector than a good one with the wrong data).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Bryan Henderson

>The "disk motor as a generator" tale may not be purely folklore.  When
>an IDE drive is not in writeback mode, something special needs to done
>to ensure the last write to media is not a scribble.

No it doesn't.  The last write _is_ a scribble.  Systems that make atomic 
updates to disk drives use a shadow update mechanism and write the master 
sector twice.  If the power fails in the middle of writing one, it will 
almost certainly be unreadable due to a CRC failure, and the other one 
will have either the old or new master block contents.

And I think there's a problem with drives that, upon sensing the 
unreadable sector, assign an alternate even though the sector is fine, and 
you eventually run out of spares.


Incidentally, while this primitive behavior applies to IDE (ATA et al) 
drives, that isn't the only thing people put filesystem on.  Many 
important filesystems go on higher level storage subsystems that contain 
IDE drives and cache memory and batteries.  A device like this _does_ make 
sure that all data that it says has been written is actually retrievable 
even if there's a subsequent power outage, even while giving the 
performance of writeback caching.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Bryan Henderson

The disk motor as a generator tale may not be purely folklore.  When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.

No it doesn't.  The last write _is_ a scribble.  Systems that make atomic 
updates to disk drives use a shadow update mechanism and write the master 
sector twice.  If the power fails in the middle of writing one, it will 
almost certainly be unreadable due to a CRC failure, and the other one 
will have either the old or new master block contents.

And I think there's a problem with drives that, upon sensing the 
unreadable sector, assign an alternate even though the sector is fine, and 
you eventually run out of spares.


Incidentally, while this primitive behavior applies to IDE (ATA et al) 
drives, that isn't the only thing people put filesystem on.  Many 
important filesystems go on higher level storage subsystems that contain 
IDE drives and cache memory and batteries.  A device like this _does_ make 
sure that all data that it says has been written is actually retrievable 
even if there's a subsequent power outage, even while giving the 
performance of writeback caching.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-06 Thread Bryan Henderson

>the 
>maintainers of util-linux have well versed autotool people at their 
disposal, 
>so i really dont see this as being worrisome.

As long as that is true, I agree that the fact that so many autotool 
packages don't work well is irrelevant.

However, I think the difficulty of using autotools (I mean using by 
packagers), as evidenced by all the people who get it wrong, justifies 
people being skeptical that util-linux really has that expertise 
available.  Also, many open source projects are developed by a large 
diverse group of people, so even if there exist people who can do the 
autotools right, it doesn't mean they'll be done right.

One reason I try to minimize the number of tools/skills used in 
maintaining packages I distribute is to enable a larger group of people to 
help me maintain them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-06 Thread Bryan Henderson

the 
maintainers of util-linux have well versed autotool people at their 
disposal, 
so i really dont see this as being worrisome.

As long as that is true, I agree that the fact that so many autotool 
packages don't work well is irrelevant.

However, I think the difficulty of using autotools (I mean using by 
packagers), as evidenced by all the people who get it wrong, justifies 
people being skeptical that util-linux really has that expertise 
available.  Also, many open source projects are developed by a large 
diverse group of people, so even if there exist people who can do the 
autotools right, it doesn't mean they'll be done right.

One reason I try to minimize the number of tools/skills used in 
maintaining packages I distribute is to enable a larger group of people to 
help me maintain them.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Bryan Henderson

>i dont see how blaming autotools for other people's misuse is relevant

Here's how other people's misuse of the tool can be relevant to the choice 
of the tool: some tools are easier to use right than others.  Probably the 
easiest thing to use right is the system you designed and built yourself. 
I've considered distributing code with an Autotools-based build system 
before and determined quickly that I am not up to that challenge.  (The 
bigger part of the challenge isn't writing the original input files; it's 
debugging when a user says his build doesn't work).  But as far as I know, 
my hand-rolled build system is used correctly by me.

>> checks the width of integers on i386 for projects not caring about that 
and
>> fails to find installed libraries without telling how it was supposed 
to
>> find them or how to make it find that library.
>
>no idea what this rant is about.

The second part sounds like my number 1 complaint as a user of 
Autotools-based packages: 'configure' often can't find my libraries.  I 
know exactly where they are, and even what compiler and linker options are 
needed to use them, but it often takes a half hour of tracing 'configure' 
or generated make files to figure out how to force the build to understand 
the same thing.  And that's with lots of experience.  The first five times 
it was much more frustrating.

>> Configuring the build of an autotools program is harder than 
nescensary;
>> if it used a config file, you could easily save it somewhere while 
adding
>> comments on how and why you did *that* choice, and you could possibly
>> use a set of default configs which you'd just include.
>
>history shows this is a pita to maintain.  every package has its own 
build 
>system and configuration file ...

It's my understanding that autotools _does_ provide that ability (as 
stated, though I think "config file" may have been meant here as 
"config.make").  The config file is a shell script that contains a 
'configure' command with a pile of options on it, and as many comments as 
you want, to tailor the build to your requirements.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Bryan Henderson

i dont see how blaming autotools for other people's misuse is relevant

Here's how other people's misuse of the tool can be relevant to the choice 
of the tool: some tools are easier to use right than others.  Probably the 
easiest thing to use right is the system you designed and built yourself. 
I've considered distributing code with an Autotools-based build system 
before and determined quickly that I am not up to that challenge.  (The 
bigger part of the challenge isn't writing the original input files; it's 
debugging when a user says his build doesn't work).  But as far as I know, 
my hand-rolled build system is used correctly by me.

 checks the width of integers on i386 for projects not caring about that 
and
 fails to find installed libraries without telling how it was supposed 
to
 find them or how to make it find that library.

no idea what this rant is about.

The second part sounds like my number 1 complaint as a user of 
Autotools-based packages: 'configure' often can't find my libraries.  I 
know exactly where they are, and even what compiler and linker options are 
needed to use them, but it often takes a half hour of tracing 'configure' 
or generated make files to figure out how to force the build to understand 
the same thing.  And that's with lots of experience.  The first five times 
it was much more frustrating.

 Configuring the build of an autotools program is harder than 
nescensary;
 if it used a config file, you could easily save it somewhere while 
adding
 comments on how and why you did *that* choice, and you could possibly
 use a set of default configs which you'd just include.

history shows this is a pita to maintain.  every package has its own 
build 
system and configuration file ...

It's my understanding that autotools _does_ provide that ability (as 
stated, though I think config file may have been meant here as 
config.make).  The config file is a shell script that contains a 
'configure' command with a pile of options on it, and as many comments as 
you want, to tailor the build to your requirements.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patent or not patent a new idea

2007-06-25 Thread Bryan Henderson

>If your only purpose is to try generate a defensive patent, then just
>dumping the idea in the public domain serves the same purpose, probably
>better.
>
>I have a few patents, some of which are defensive. That has not prevented
>the USPTO issuing quite a few patents that are in clear violation of 
mine.

That's not what a defensive patent is.  Indeed, patenting something just 
so someone else can't patent it is ridiculous, because publishing is so 
much easier.

A defensive patent is one you file so that you can trade rights to it for 
rights to other patents that you need.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patent or not patent a new idea

2007-06-25 Thread Bryan Henderson

If your only purpose is to try generate a defensive patent, then just
dumping the idea in the public domain serves the same purpose, probably
better.

I have a few patents, some of which are defensive. That has not prevented
the USPTO issuing quite a few patents that are in clear violation of 
mine.

That's not what a defensive patent is.  Indeed, patenting something just 
so someone else can't patent it is ridiculous, because publishing is so 
much easier.

A defensive patent is one you file so that you can trade rights to it for 
rights to other patents that you need.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Versioning file system

2007-06-20 Thread Bryan Henderson

>The directory is quite visible with a standard 'ls -a'. Instead,
>they simply mark it as a separate volume/filesystem: i.e. the fsid
>differs when you call stat(). The whole thing ends up acting rather like
>our bind mounts.

Hmm.  So it breaks user space quite a bit.  By break, I mean uses that 
work with more conventional filesystems stop working if you switch to 
NetAp.  Most programs that operate on directory trees willingly cross 
filesystems, right?  Even ones that give you an option, such as GNU cp, 
don't by default.

But if the implementation is, as described, wildly successful, that means 
users are willing to tolerate this level of breakage, so it could be used 
for versioning too.

But I think I'd rather see a truly hidden directory for this (visible only 
when looked up explicitly).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Versioning file system

2007-06-20 Thread Bryan Henderson

The directory is quite visible with a standard 'ls -a'. Instead,
they simply mark it as a separate volume/filesystem: i.e. the fsid
differs when you call stat(). The whole thing ends up acting rather like
our bind mounts.

Hmm.  So it breaks user space quite a bit.  By break, I mean uses that 
work with more conventional filesystems stop working if you switch to 
NetAp.  Most programs that operate on directory trees willingly cross 
filesystems, right?  Even ones that give you an option, such as GNU cp, 
don't by default.

But if the implementation is, as described, wildly successful, that means 
users are willing to tolerate this level of breakage, so it could be used 
for versioning too.

But I think I'd rather see a truly hidden directory for this (visible only 
when looked up explicitly).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Versioning file system

2007-06-19 Thread Bryan Henderson

>We don't need a new special character for every 
>>  new feature.  We've got one, and it's flexible enough to do what you 
want, 
>> as proven by NetApp's extremely successful implementation.

I don't know NetApp's implementation, but I assume it is more than just a 
choice of special character.  If you merely start the directory name with 
a dot, you don't fool anyone but 'ls' and shell wildcard expansion.  (And 
for some enlightened people like me, you don't even fool ls, because we 
use the --almost-all option to show the dot files by default, having been 
burned too many times by invisible files).

I assume NetApp flags the directory specially so that a POSIX directory 
read doesn't get it.  I've seen that done elsewhere.

The same thing, by the way, is possible with Jack's filename:version idea, 
and I assumed that's what he had in mind.  Not that that makes it all OK.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Versioning file system

2007-06-19 Thread Bryan Henderson

We don't need a new special character for every 
  new feature.  We've got one, and it's flexible enough to do what you 
want, 
 as proven by NetApp's extremely successful implementation.

I don't know NetApp's implementation, but I assume it is more than just a 
choice of special character.  If you merely start the directory name with 
a dot, you don't fool anyone but 'ls' and shell wildcard expansion.  (And 
for some enlightened people like me, you don't even fool ls, because we 
use the --almost-all option to show the dot files by default, having been 
burned too many times by invisible files).

I assume NetApp flags the directory specially so that a POSIX directory 
read doesn't get it.  I've seen that done elsewhere.

The same thing, by the way, is possible with Jack's filename:version idea, 
and I assumed that's what he had in mind.  Not that that makes it all OK.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Versioning file system

2007-06-18 Thread Bryan Henderson

>The question remains is where to implement versioning: directly in
>individual filesystems or in the vfs code so all filesystems can use it?

Or not in the kernel at all.  I've been doing versioning of the types I 
described for years with user space code and I don't remember feeling that 
I compromised in order not to involve the kernel.

Of course, if you want to do it with snapshots and COW, you'll have to ask 
where in the kernel to put that, but that's not a file versioning 
question; it's the larger snapshot question.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Versioning file system

2007-06-18 Thread Bryan Henderson

The question remains is where to implement versioning: directly in
individual filesystems or in the vfs code so all filesystems can use it?

Or not in the kernel at all.  I've been doing versioning of the types I 
described for years with user space code and I don't remember feeling that 
I compromised in order not to involve the kernel.

Of course, if you want to do it with snapshots and COW, you'll have to ask 
where in the kernel to put that, but that's not a file versioning 
question; it's the larger snapshot question.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Finding hardlinks

2007-01-03 Thread Bryan Henderson

>On any decent filesystem st_ino should uniquely identify an object and
>reliably provide hardlink information. The UNIX world has relied upon 
this
>for decades. A filesystem with st_ino collisions without being hardlinked
>(or the other way around) needs a fix.

But for at least the last of those decades, filesystems that could not do 
that were not uncommon.  They had to present 32 bit inode numbers and 
either allowed more than 4G files or just didn't have the means of 
assigning inode numbers with the proper uniqueness to files.  And the sky 
did not fall.  I don't have an explanation why, but it makes it look to me 
like there are worse things than not having total one-one correspondence 
between inode numbers and files.  Having a stat or mount fail because 
inodes are too big, having fewer than 4G files, and waiting for the 
filesystem to generate a suitable inode number might fall in that 
category.

I fully agree that much effort should be put into making inode numbers 
work the way POSIX demands, but I also know that that sometimes requires 
more than just writing some code.

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Finding hardlinks

2007-01-03 Thread Bryan Henderson

On any decent filesystem st_ino should uniquely identify an object and
reliably provide hardlink information. The UNIX world has relied upon 
this
for decades. A filesystem with st_ino collisions without being hardlinked
(or the other way around) needs a fix.

But for at least the last of those decades, filesystems that could not do 
that were not uncommon.  They had to present 32 bit inode numbers and 
either allowed more than 4G files or just didn't have the means of 
assigning inode numbers with the proper uniqueness to files.  And the sky 
did not fall.  I don't have an explanation why, but it makes it look to me 
like there are worse things than not having total one-one correspondence 
between inode numbers and files.  Having a stat or mount fail because 
inodes are too big, having fewer than 4G files, and waiting for the 
filesystem to generate a suitable inode number might fall in that 
category.

I fully agree that much effort should be put into making inode numbers 
work the way POSIX demands, but I also know that that sometimes requires 
more than just writing some code.

--
Bryan Henderson   San Jose California
IBM Almaden Research Center   Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-02 Thread Bryan Henderson

I have to correct an error in perspective, or at least in the wording of 
it, in the following, because it affects how people see the big picture in 
trying to decide how the filesystem types in question fit into the world:

>Shared storage can be more efficient than network file
>systems like NFS because the storage access is often more efficient
>than network access

The shared storage access _is_ network access.  In most cases, it's a 
fibre channel/FCP network.  Nowadays, it's more and more common for it to 
be a TCP/IP network just like the one folks use for NFS (but carrying 
ISCSI instead of NFS).  It's also been done with a handful of other 
TCP/IP-based block storage protocols.

The reason the storage access is expected to be more efficient than the 
NFS access is because the block access network protocols are supposed to 
be more efficient than the file access network protocols.

In reality, I'm not sure there really is such a difference in efficiency 
between the protocols.  The demonstrated differences in efficiency, or at 
least in speed, are due to other things that are different between a given 
new shared block implementation and a given old shared file 
implementation.

But there's another advantage to shared block over shared file that hasn't 
been mentioned yet:  some people find it easier to manage a pool of blocks 
than a pool of filesystems.

>it is more reliable because it doesn't have a
>single point of failure in form of the NFS server.

This advantage isn't because it's shared (block) storage, but because it's 
a distributed filesystem.  There are shared storage filesystems (e.g. IBM 
SANFS, ADIC StorNext) that have a centralized metadata or locking server 
that makes them unreliable (or unscalable) in the same ways as an NFS 
server.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-02 Thread Bryan Henderson

I have to correct an error in perspective, or at least in the wording of 
it, in the following, because it affects how people see the big picture in 
trying to decide how the filesystem types in question fit into the world:

Shared storage can be more efficient than network file
systems like NFS because the storage access is often more efficient
than network access

The shared storage access _is_ network access.  In most cases, it's a 
fibre channel/FCP network.  Nowadays, it's more and more common for it to 
be a TCP/IP network just like the one folks use for NFS (but carrying 
ISCSI instead of NFS).  It's also been done with a handful of other 
TCP/IP-based block storage protocols.

The reason the storage access is expected to be more efficient than the 
NFS access is because the block access network protocols are supposed to 
be more efficient than the file access network protocols.

In reality, I'm not sure there really is such a difference in efficiency 
between the protocols.  The demonstrated differences in efficiency, or at 
least in speed, are due to other things that are different between a given 
new shared block implementation and a given old shared file 
implementation.

But there's another advantage to shared block over shared file that hasn't 
been mentioned yet:  some people find it easier to manage a pool of blocks 
than a pool of filesystems.

it is more reliable because it doesn't have a
single point of failure in form of the NFS server.

This advantage isn't because it's shared (block) storage, but because it's 
a distributed filesystem.  There are shared storage filesystems (e.g. IBM 
SANFS, ADIC StorNext) that have a centralized metadata or locking server 
that makes them unreliable (or unscalable) in the same ways as an NFS 
server.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] atomic open(..., O_CREAT | ...)

2005-08-09 Thread Bryan Henderson

>Have you looked at how we're dealing with this in NFSv4?

No.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] atomic open(..., O_CREAT | ...)

2005-08-09 Thread Bryan Henderson

>Intents are meant as optimisations, not replacements for existing
>operations. I'm therefore not really comfortable about having them
>return errors at all.

That's true of normal intents, but not what are called intents here.  A 
normal intent merely expresses an intent, and it can be totally ignored 
without harm to correctness.  But these "intents" were designed to be 
responded to by actually performing the foreshadowed operation now - 
irreversibly.

Linux needs an atomic lookup/open/create in order to participate in a 
shared filesystem and provide a POSIX interface (where shared filesystem 
means a filesystem that is simultaneously accessed by something besides 
the Linux system in question).  Some operating systems do this simply with 
a VFS lookup/open/create function.  Linux does it with this intents 
interface.

It's hard to merge the concepts in code or in one's mind, which is why 
we're here now.  A filesystem driver that needs to do atomic 
lookup/open/create has to bend over backwards to split the operation 
across the three filesystem driver calls that Linux wants to make.

I've always preferred just to have a new inode operation for 
lookup/open/create (mirroring the POSIX open operation, used for all opens 
if available), but if enough arguments to lookup can do it, that's 
practically as good.  But that means returning final status from lookup, 
and not under any circumstance proceeding to create or open when the 
filesystem driver has said the entire operation is complete.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] atomic open(..., O_CREAT | ...)

2005-08-09 Thread Bryan Henderson

Intents are meant as optimisations, not replacements for existing
operations. I'm therefore not really comfortable about having them
return errors at all.

That's true of normal intents, but not what are called intents here.  A 
normal intent merely expresses an intent, and it can be totally ignored 
without harm to correctness.  But these intents were designed to be 
responded to by actually performing the foreshadowed operation now - 
irreversibly.

Linux needs an atomic lookup/open/create in order to participate in a 
shared filesystem and provide a POSIX interface (where shared filesystem 
means a filesystem that is simultaneously accessed by something besides 
the Linux system in question).  Some operating systems do this simply with 
a VFS lookup/open/create function.  Linux does it with this intents 
interface.

It's hard to merge the concepts in code or in one's mind, which is why 
we're here now.  A filesystem driver that needs to do atomic 
lookup/open/create has to bend over backwards to split the operation 
across the three filesystem driver calls that Linux wants to make.

I've always preferred just to have a new inode operation for 
lookup/open/create (mirroring the POSIX open operation, used for all opens 
if available), but if enough arguments to lookup can do it, that's 
practically as good.  But that means returning final status from lookup, 
and not under any circumstance proceeding to create or open when the 
filesystem driver has said the entire operation is complete.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] atomic open(..., O_CREAT | ...)

2005-08-09 Thread Bryan Henderson

Have you looked at how we're dealing with this in NFSv4?

No.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson

>I don't see how the following is tortured: 
>
>enum {
>   PNODE_MEMBER_VFS  = 0x01,
>   PNODE_SLAVE_VFS   = 0x02
>}; 

Only because it's using a facility that's supposed to be for enumerated 
types for something that isn't.  If it were a true enumerated type, the 
codes for the enumerations (0x01, 0x02) would be quite arbitrary, whereas 
here they must fundamentally be integers whose pure binary cipher has 
exactly one 1 bit (because, as I understand it, these are used as bitmasks 
somewhere).

I can see that this paradigm has practical advantages over using macros 
(or a middle ground - integer constants), but only as a byproduct of what 
the construct is really for.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson

>If it's really enumerated data types, that's fine, but this example was 
>about bitfield masks.

Ah.  In that case, enum is a pretty tortured way to declare it, though it 
does have the practical advantages over define that have been mentioned 
because the syntax is more rigorous.

The proper way to do bitfield masks is usually C bit field declarations, 
but I understand that tradition works even more strongly against using 
those than against using enum to declare enumerated types.

>there is _nothing_ wrong with using defines for constants.

I disagree with that; I find practical and, more importantly, 
philosophical reasons not to use defines for constants.  I'm sure you've 
heard the arguments; I just didn't want to let that statement go 
uncontested.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson

I wasn't aware anyone preferred defines to enums for declaring enumerated 
data types.  The practical advantages of enums are slight, but as far as I 
know, the practical advantages of defines are zero.  Isn't the only 
argument for defines, "that's what I'm used to."?

Two advantages of the enum declaration that haven't been mentioned yet, 
that help me significantly:

- if you have a typo in a define, it can be really hard to interpret the 
compiler error messages.  The same typo in an enum gets a pointed error 
message referring to the line that has the typo.

- Gcc warns you if a switch statement doesn't handle every case.  I often 
add an enumeration and Gcc lets me know where I forgot to consider it.

The macro language is one the most hated parts of the C language; it makes 
sense to try to avoid it as a general rule.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson

I wasn't aware anyone preferred defines to enums for declaring enumerated 
data types.  The practical advantages of enums are slight, but as far as I 
know, the practical advantages of defines are zero.  Isn't the only 
argument for defines, that's what I'm used to.?

Two advantages of the enum declaration that haven't been mentioned yet, 
that help me significantly:

- if you have a typo in a define, it can be really hard to interpret the 
compiler error messages.  The same typo in an enum gets a pointed error 
message referring to the line that has the typo.

- Gcc warns you if a switch statement doesn't handle every case.  I often 
add an enumeration and Gcc lets me know where I forgot to consider it.

The macro language is one the most hated parts of the C language; it makes 
sense to try to avoid it as a general rule.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson

If it's really enumerated data types, that's fine, but this example was 
about bitfield masks.

Ah.  In that case, enum is a pretty tortured way to declare it, though it 
does have the practical advantages over define that have been mentioned 
because the syntax is more rigorous.

The proper way to do bitfield masks is usually C bit field declarations, 
but I understand that tradition works even more strongly against using 
those than against using enum to declare enumerated types.

there is _nothing_ wrong with using defines for constants.

I disagree with that; I find practical and, more importantly, 
philosophical reasons not to use defines for constants.  I'm sure you've 
heard the arguments; I just didn't want to let that statement go 
uncontested.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: share/private/slave a subtree - define vs enum

2005-07-08 Thread Bryan Henderson

I don't see how the following is tortured: 

enum {
   PNODE_MEMBER_VFS  = 0x01,
   PNODE_SLAVE_VFS   = 0x02
}; 

Only because it's using a facility that's supposed to be for enumerated 
types for something that isn't.  If it were a true enumerated type, the 
codes for the enumerations (0x01, 0x02) would be quite arbitrary, whereas 
here they must fundamentally be integers whose pure binary cipher has 
exactly one 1 bit (because, as I understand it, these are used as bitmasks 
somewhere).

I can see that this paradigm has practical advantages over using macros 
(or a middle ground - integer constants), but only as a byproduct of what 
the construct is really for.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device

2005-02-03 Thread Bryan Henderson

>I did a patch which switched loop to use the file_operations.read/write
>about a year ago.  Forget what happened to it.  It always seemed the 
right
>thing to do..

This is unquestionably the right thing to do (at least compared to what we 
have now).  The loop device driver has no business assuming that the 
underlying filesystem uses the generic routines.  I always assumed it was 
a simple design error that it did.  (Such errors are easy to make because 
prepare_write and commit_write are declared as address space operations, 
when they're really private to the buffer cache and generic writer).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-03 Thread Bryan Henderson

>> > > And for the vmscan->writepage() side of things I wonder if it would 
be
>> > > possible to overload the mapping's ->nopage handler.  If the target 
page
>> > > lies in a hole, go off and allocate all the necessary pagecache 
pages, zero
>> > > them, mark them dirty?
>> > 
>> > I guess it would be possible but ->nopage is used for the read case 
and
>> > why would we want to then cause writes/allocations?
>> 
>> yup, we'd need to create a new handler for writes, or pass 
`write_access'
>> into ->nopage.  I think others (dwdm2?) have seen a need for that.
>
>That would work as long as all writable mappings are actually written to
>everywhere.  Otherwise you still get that reading the whole mmap()ped
>are but writing a small part of it would still instantiate all of it on
>disk.  As far as I understand this there is no way to hook into the mmap
>system such that we have a hook whenever a mmap()ped page gets written
>to for the first time.  (I may well be wrong on that one so please
>correct me if that is the case.)

I think the point is that we can't have a "handler for writes," because 
the writes are being done by simple CPU Store instructions in a user 
program.  The handler we're talking about is just for page faults.  Other 
operating systems approach this by actually _having_ a handler for a CPU 
store instruction, in the form of a page protection fault handler -- the 
nopage routine adds the page to the user's address space, but write 
protects it.  The first time the user tries to store into it, the 
filesystem driver gets a chance to do what's necessary to support a dirty 
cache page -- allocate a block, add additional dirty pages to the cache, 
etc.  It would be wonderful to have that in Linux.  I saw hints of such 
code in a Linux kernel once (a "write_protect" address space operation or 
something like that); I don't know what happened to it.

Short of that, I don't see any way to avoid sometimes filling in holes due 
to reads.  It's not a huge problem, though -- it requires someone to do a 
shared writable mmap and then read lots of holes and not write to them, 
which is a pretty rare situation for a normal file.

I didn't follow how the helper function solves this problem.  If it's 
something involving adding the required extra pages to the cache at 
pageout time, then that's not going to work -- you can't make adding pages 
to the cache a prerequisite for cleaning a page -- that would be Deadlock 
City.

My large-block filesystem driver does the nopage thing, and does in fact 
fill in files unnecessarily in this scenario.  :-(  The driver for the 
same filesystems on AIX does not, though.  It has the write protection 
thing.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-03 Thread Bryan Henderson

   And for the vmscan-writepage() side of things I wonder if it would 
be
   possible to overload the mapping's -nopage handler.  If the target 
page
   lies in a hole, go off and allocate all the necessary pagecache 
pages, zero
   them, mark them dirty?
  
  I guess it would be possible but -nopage is used for the read case 
and
  why would we want to then cause writes/allocations?
 
 yup, we'd need to create a new handler for writes, or pass 
`write_access'
 into -nopage.  I think others (dwdm2?) have seen a need for that.

That would work as long as all writable mappings are actually written to
everywhere.  Otherwise you still get that reading the whole mmap()ped
are but writing a small part of it would still instantiate all of it on
disk.  As far as I understand this there is no way to hook into the mmap
system such that we have a hook whenever a mmap()ped page gets written
to for the first time.  (I may well be wrong on that one so please
correct me if that is the case.)

I think the point is that we can't have a handler for writes, because 
the writes are being done by simple CPU Store instructions in a user 
program.  The handler we're talking about is just for page faults.  Other 
operating systems approach this by actually _having_ a handler for a CPU 
store instruction, in the form of a page protection fault handler -- the 
nopage routine adds the page to the user's address space, but write 
protects it.  The first time the user tries to store into it, the 
filesystem driver gets a chance to do what's necessary to support a dirty 
cache page -- allocate a block, add additional dirty pages to the cache, 
etc.  It would be wonderful to have that in Linux.  I saw hints of such 
code in a Linux kernel once (a write_protect address space operation or 
something like that); I don't know what happened to it.

Short of that, I don't see any way to avoid sometimes filling in holes due 
to reads.  It's not a huge problem, though -- it requires someone to do a 
shared writable mmap and then read lots of holes and not write to them, 
which is a pretty rare situation for a normal file.

I didn't follow how the helper function solves this problem.  If it's 
something involving adding the required extra pages to the cache at 
pageout time, then that's not going to work -- you can't make adding pages 
to the cache a prerequisite for cleaning a page -- that would be Deadlock 
City.

My large-block filesystem driver does the nopage thing, and does in fact 
fill in files unnecessarily in this scenario.  :-(  The driver for the 
same filesystems on AIX does not, though.  It has the write protection 
thing.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device

2005-02-03 Thread Bryan Henderson

I did a patch which switched loop to use the file_operations.read/write
about a year ago.  Forget what happened to it.  It always seemed the 
right
thing to do..

This is unquestionably the right thing to do (at least compared to what we 
have now).  The loop device driver has no business assuming that the 
underlying filesystem uses the generic routines.  I always assumed it was 
a simple design error that it did.  (Such errors are easy to make because 
prepare_write and commit_write are declared as address space operations, 
when they're really private to the buffer cache and generic writer).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)

2001-03-23 Thread Bryan Henderson


How it can be used? Well, say it you've mounted JFS on /usr/local
>% mount -t jfsmeta none /mnt -o jfsroot=/usr/local
>% ls /mnt
>stats control   bootcode whatever_I_bloody_want
>% cat /mnt/stats
>master is on /usr/local
>fragmentation = 5%
>696942 reads, yodda, yodda
>% echo "defrag 69 whatever 42 13" > /mnt/control
>% umount /mnt

There's a lot of cool simplicity in this, both in implementation and 
application, but it leaves something to be desired in functionality.  This 
is partly because the price you pay for being able to use existing, 
well-worn Unix interfaces is the ancient limitations of those interfaces 
-- like the inability to return adequate error information.

Specifically, transactional stuff looks really hard in this method.
If I want the user to know why his "defrag" command failed, how would I 
pass that information back to him?  What if I want to warn him of of a 
filesystem inconsistency I found along the way?  Or inform him of how 
effective the defrag was?  And bear in mind that multiple processes may be 
issuing commands to /mnt/control simultaneously.

With ioctl, I can easily match a response of any kind to a request.  I can 
even return an English text message if I want to be friendly.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)

2001-03-23 Thread Bryan Henderson


How it can be used? Well, say it you've mounted JFS on /usr/local
% mount -t jfsmeta none /mnt -o jfsroot=/usr/local
% ls /mnt
stats control   bootcode whatever_I_bloody_want
% cat /mnt/stats
master is on /usr/local
fragmentation = 5%
696942 reads, yodda, yodda
% echo "defrag 69 whatever 42 13"  /mnt/control
% umount /mnt

There's a lot of cool simplicity in this, both in implementation and 
application, but it leaves something to be desired in functionality.  This 
is partly because the price you pay for being able to use existing, 
well-worn Unix interfaces is the ancient limitations of those interfaces 
-- like the inability to return adequate error information.

Specifically, transactional stuff looks really hard in this method.
If I want the user to know why his "defrag" command failed, how would I 
pass that information back to him?  What if I want to warn him of of a 
filesystem inconsistency I found along the way?  Or inform him of how 
effective the defrag was?  And bear in mind that multiple processes may be 
issuing commands to /mnt/control simultaneously.

With ioctl, I can easily match a response of any kind to a request.  I can 
even return an English text message if I want to be friendly.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Linux not adhering to BIOS Drive boot order?

2001-01-16 Thread Bryan Henderson


>If we can truly go for label based mounting
>and lilo'ing this would solve the problem.

>From a layering point of view, it makes a lot more sense to
me for the label (or signature or whatever) for this purpose 
to be in the partition table than inside the filesystem.  The 
parts of the system that assign devices their identities already 
know about that part of the disk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

RE: Linux not adhering to BIOS Drive boot order?

2001-01-16 Thread Bryan Henderson


If we can truly go for label based mounting
and lilo'ing this would solve the problem.

From a layering point of view, it makes a lot more sense to
me for the label (or signature or whatever) for this purpose 
to be in the partition table than inside the filesystem.  The 
parts of the system that assign devices their identities already 
know about that part of the disk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

50 matches

Mail list logo