Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread Theodore Tso
On Mon, Jul 31, 2006 at 09:41:02PM -0700, David Lang wrote:
 just becouse you have redundancy doesn't mean that your data is idle enough 
 for you to run a repacker with your spare cycles. to run a repacker you 
 need a time when the chunk of the filesystem that you are repacking is not 
 being accessed or written to. it doesn't matter if that data lives on one 
 disk or 9 disks all mirroring the same data, you can't just break off 1 of 
 the copies and repack that becouse by the time you finish it won't match 
 the live drives anymore.
 
 database servers have a repacker (vaccum), and they are under tremendous 
 preasure from their users to avoid having to use it becouse of the 
 performance hit that it generates. (the theory in the past is exactly what 
 was presented in this thread, make things run faster most of the time and 
 accept the performance hit when you repack). the trend seems to be for a 
 repacker thread that runs continuously, causing a small impact all the time 
 (that can be calculated into the capacity planning) instead of a large 
 impact once in a while.

Ah, but as soon as the repacker thread runs continuously, then you
lose all or most of the claimed advantage of wandering logs.
Specifically, the claim of the wandering log is that you don't have
to write your data twice --- once to the log, and once to the final
location on disk (whereas with ext3 you end up having to do double
writes).  But if the repacker is running continuously, you end up
doing double writes anyway, as the repacker moves things from a
location that is convenient for the log, to a location which is
efficient for reading.  Worse yet, if the repacker is moving disk
blocks or objects which are no longer in cache, it may end up having
to read objects in before writing them to a final location on disk.
So instead of a write-write overhead, you end up with a
write-read-write overhead.

But of course, people tend to disable the repacker when doing
benchmarks because they're trying to play the my filesystem/database
has bigger performance numbers than yours game

- Ted


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread Avi Kivity

Theodore Tso wrote:


Ah, but as soon as the repacker thread runs continuously, then you
lose all or most of the claimed advantage of wandering logs.
Specifically, the claim of the wandering log is that you don't have
to write your data twice --- once to the log, and once to the final
location on disk (whereas with ext3 you end up having to do double
writes).  But if the repacker is running continuously, you end up
doing double writes anyway, as the repacker moves things from a
location that is convenient for the log, to a location which is
efficient for reading.  Worse yet, if the repacker is moving disk
blocks or objects which are no longer in cache, it may end up having
to read objects in before writing them to a final location on disk.
So instead of a write-write overhead, you end up with a
write-read-write overhead.



There's no reason to repack *all* of the data.  Many workloads write and 
delete whole files, so file data should be contiguous.  The repacker 
would only need to move metadata and small files.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.



Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread Matthias Andree
On Tue, 01 Aug 2006, Avi Kivity wrote:

 There's no reason to repack *all* of the data.  Many workloads write and 
 delete whole files, so file data should be contiguous.  The repacker 
 would only need to move metadata and small files.

Move small files? What for?

Even if it is only moving metadata, it is not different from what ext3
or xfs are doing today (rewriting metadata from the intent log or block
journal to the final location).

The UFS+softupdates from the BSD world looks pretty good at avoiding
unnecessary writes (at the expense of a long-running but nice background
fsck after a crash, which is however easy on the I/O as of recent FreeBSD
versions).  Which was their main point against logging/journaling BTW,
but they are porting XFS as well to save those that need instant
complete recovery.

-- 
Matthias Andree


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread Avi Kivity

Matthias Andree wrote:


On Tue, 01 Aug 2006, Avi Kivity wrote:

 There's no reason to repack *all* of the data.  Many workloads write 
and

 delete whole files, so file data should be contiguous.  The repacker
 would only need to move metadata and small files.

Move small files? What for?



WAFL-style filesystems like contiguous space,  so if small files are 
scattered in otherwise free space, the repacker should free them.



Even if it is only moving metadata, it is not different from what ext3
or xfs are doing today (rewriting metadata from the intent log or block
journal to the final location).



There is no need to repack all metadata; only that which helps in 
creating free space.


For example: if you untar a source tree you'd get mixed metadata and 
small file data packed together, but there's no need to repack that data.



--
error compiling committee.c: too many arguments to function



Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread Hans Reiser
Theodore Tso wrote:

On Mon, Jul 31, 2006 at 09:41:02PM -0700, David Lang wrote:
  

just becouse you have redundancy doesn't mean that your data is idle enough 
for you to run a repacker with your spare cycles. to run a repacker you 
need a time when the chunk of the filesystem that you are repacking is not 
being accessed or written to. it doesn't matter if that data lives on one 
disk or 9 disks all mirroring the same data, you can't just break off 1 of 
the copies and repack that becouse by the time you finish it won't match 
the live drives anymore.

database servers have a repacker (vaccum), and they are under tremendous 
preasure from their users to avoid having to use it becouse of the 
performance hit that it generates. (the theory in the past is exactly what 
was presented in this thread, make things run faster most of the time and 
accept the performance hit when you repack). the trend seems to be for a 
repacker thread that runs continuously, causing a small impact all the time 
(that can be calculated into the capacity planning) instead of a large 
impact once in a while.



Ah, but as soon as the repacker thread runs continuously, then you
lose all or most of the claimed advantage of wandering logs.
  

Wandering logs is a term specific to reiser4, and I think you are making
a more general remark.

You are missing the implications of the oft-cited statistic that 80% of
files never or rarely move.   You are also missing the implications of
the repacker being able to do larger IOs than occur for a random tiny IO
workload which is impacting a filesystem that is performing allocations
on the fly.

Specifically, the claim of the wandering log is that you don't have
to write your data twice --- once to the log, and once to the final
location on disk (whereas with ext3 you end up having to do double
writes).  But if the repacker is running continuously, you end up
doing double writes anyway, as the repacker moves things from a
location that is convenient for the log, to a location which is
efficient for reading.  Worse yet, if the repacker is moving disk
blocks or objects which are no longer in cache, it may end up having
to read objects in before writing them to a final location on disk.
So instead of a write-write overhead, you end up with a
write-read-write overhead.

But of course, people tend to disable the repacker when doing
benchmarks because they're trying to play the my filesystem/database
has bigger performance numbers than yours game
  

When the repacker is done, we will just for you run one of our
benchmarks the morning after the repacker is run (and reference this
email);-)  that was what you wanted us to do to address your
concern, yes?;-)

   - Ted


  




Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread Jan Engelhardt

Wandering logs is a term specific to reiser4, and I think you are making
a more general remark.

So, what is UDF's wandering log then?



Jan Engelhardt
-- 


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread David Masover

Theodore Tso wrote:


Ah, but as soon as the repacker thread runs continuously, then you
lose all or most of the claimed advantage of wandering logs.

[...]

So instead of a write-write overhead, you end up with a
write-read-write overhead.


This would tend to suggest that the repacker should not run constantly, 
but also that while it's running, performance could be almost as good as 
ext3.



But of course, people tend to disable the repacker when doing
benchmarks because they're trying to play the my filesystem/database
has bigger performance numbers than yours game


So you run your own benchmarks, I'll run mine...  Benchmarks for 
everyone!  I'd especially like to see what performance is like with the 
repacker not running, and during the repack.  If performance during a 
repack is comparable to ext3, I think we win, although we have to amend 
that statement to My filesystem/database has the same or bigger 
perfomance numbers than yours.


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread Ian Stirling

David Masover wrote:

David Lang wrote:


On Mon, 31 Jul 2006, David Masover wrote:

Oh, I'm curious -- do hard drives ever carry enough 
battery/capacitance to cover their caches?  It doesn't seem like it 
would be that hard/expensive, and if it is done that way, then I 
think it's valid to leave them on.  You could just say that other 
filesystems aren't taking as much advantage of newer drive features 
as Reiser :P



there are no drives that have the ability to flush their cache after 
they loose power.



Aha, so back to the usual argument:  UPS!  It takes a fraction of a 
second to flush that cache.


You probably don't actually want to flush the cache - but to write
to a journal.
16M of cache - split into 32000 writes to single sectors spread over
the disk could well take several minutes to write. Slapping it onto
a journal would take well under .2 seconds.
That's a non-trivial amount of storage though - 3J or so, [EMAIL PROTECTED] -
a moderately large/expensive capacitor.

And if you've got to spin the drive up, you've just added another
order of magnitude.

You can see why a flash backup of the write cache may be nicer.
You can do it if the disk isn't spinning.
It uses moderately less energy - and at a much lower rate, which
means the power supply can be _much_ cheaper. I'd guess it's the
difference between under $2 and $10.
And if you can use it as a lazy write cache for laptops - things
just got better battery life wise too.


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-08-01 Thread David Masover

Ian Stirling wrote:

David Masover wrote:

David Lang wrote:


On Mon, 31 Jul 2006, David Masover wrote:

Oh, I'm curious -- do hard drives ever carry enough 
battery/capacitance to cover their caches?  It doesn't seem like it 
would be that hard/expensive, and if it is done that way, then I 
think it's valid to leave them on.  You could just say that other 
filesystems aren't taking as much advantage of newer drive features 
as Reiser :P



there are no drives that have the ability to flush their cache after 
they loose power.



Aha, so back to the usual argument:  UPS!  It takes a fraction of a 
second to flush that cache.


You probably don't actually want to flush the cache - but to write
to a journal.
16M of cache - split into 32000 writes to single sectors spread over
the disk could well take several minutes to write. Slapping it onto
a journal would take well under .2 seconds.
That's a non-trivial amount of storage though - 3J or so, [EMAIL PROTECTED] -
a moderately large/expensive capacitor.


Before we get ahead of ourselves, remember:  ~$200 buys you a huge 
amount of battery storage.  We're talking several minutes for several 
boxes, at the very least -- more like 10 minutes.


But yes, a journal or a software suspend.


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-07-31 Thread David Lang

On Mon, 31 Jul 2006, David Masover wrote:

Oh, I'm curious -- do hard drives ever carry enough battery/capacitance to 
cover their caches?  It doesn't seem like it would be that hard/expensive, 
and if it is done that way, then I think it's valid to leave them on.  You 
could just say that other filesystems aren't taking as much advantage of 
newer drive features as Reiser :P


there are no drives that have the ability to flush their cache after they loose 
power.


now, that being said, /. had a story within the last couple of days about hard 
drive manufacturers adding flash to their hard drives. they may be aiming to add 
some non-volitile cache capability to their drives, although I didn't think that 
flash writes were that fast (needed if you dump the cache to flash when you 
loose power), or that easy on power (given that you would first loose power), 
and flash has limited write cycles (needed if you always use the cache).


I've heard to many fancy-sounding drive technologies that never hit the market, 
I'll wait until thye are actually available before I start counting on them for 
anything  (let alone design/run a filesystem that requires them :-)


external battery backed cache is readily available, either on high-end raid 
controllers or as seperate ram drives (and in raid array boxes), but nothing on 
individual drives.


David Lang


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-07-31 Thread David Masover

David Lang wrote:

On Mon, 31 Jul 2006, David Masover wrote:

Oh, I'm curious -- do hard drives ever carry enough 
battery/capacitance to cover their caches?  It doesn't seem like it 
would be that hard/expensive, and if it is done that way, then I think 
it's valid to leave them on.  You could just say that other 
filesystems aren't taking as much advantage of newer drive features as 
Reiser :P


there are no drives that have the ability to flush their cache after 
they loose power.


Aha, so back to the usual argument:  UPS!  It takes a fraction of a 
second to flush that cache.


now, that being said, /. had a story within the last couple of days 
about hard drive manufacturers adding flash to their hard drives. they 
may be aiming to add some non-volitile cache capability to their drives, 
although I didn't think that flash writes were that fast (needed if you 
dump the cache to flash when you loose power), or that easy on power 
(given that you would first loose power), and flash has limited write 
cycles (needed if you always use the cache).


But, the point of flash was not to replace the RAM cache, but to be 
another level.  That is, you have your Flash which may be as fast as the 
disk, maybe faster, maybe less, and you have maybe a gig worth of it. 
Even the bloatiest of OSes aren't really all that big -- my OS X came 
installed, with all kinds of apps I'll never use, in less than 10 gigs.


And I think this story was awhile ago (a dupe?  Not surprising), and the 
point of the Flash is that as long as your read/write cache doesn't run 
out, and you're still in that 1 gig of Flash, you're a bit safer than 
the RAM cache, and you can also leave the disk off, as in, spinned down. 
 Parked.


Very useful for a laptop -- I used to do this in Linux by using Reiser4, 
setting the disk to spin down, and letting lazy writes do their thing, 
but I didn't have enough RAM, and there's always the possibility of 
losing data.  But leaving the disk off is nice, because in the event of 
sudden motion, it's safer that way.  Besides, most hardware gets 
designed for That Other OS, which doesn't support any kind of Laptop 
Mode, so it's nice to be able to enforce this at a hardware level, in a 
safe way.


I've heard to many fancy-sounding drive technologies that never hit the 
market, I'll wait until thye are actually available before I start 
counting on them for anything  (let alone design/run a filesystem that 
requires them :-)


Or even remember their names.

external battery backed cache is readily available, either on high-end 
raid controllers or as seperate ram drives (and in raid array boxes), 
but nothing on individual drives.


Ah.  Curses.

UPS, then.  If you have enough time, you could even do a Software 
Suspend first -- that way, when power comes back on, you boot back up, 
and if it's done quickly enough, connections won't even be dropped...




Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-07-31 Thread David Lang

On Mon, 31 Jul 2006, David Masover wrote:


And perhaps a
really good clustering filesystem for markets that
require NO downtime. 


Thing is, a cluster is about the only FS I can imagine that could reasonably 
require (and MAYBE provide) absolutely no downtime. Everything else, the more 
you say it requires no downtime, the more I say it requires redundancy.


Am I missing any more obvious examples where you can't have enough 
redundancy, but you can't have downtime either?


just becouse you have redundancy doesn't mean that your data is idle enough for 
you to run a repacker with your spare cycles. to run a repacker you need a time 
when the chunk of the filesystem that you are repacking is not being accessed or 
written to. it doesn't matter if that data lives on one disk or 9 disks all 
mirroring the same data, you can't just break off 1 of the copies and repack 
that becouse by the time you finish it won't match the live drives anymore.


database servers have a repacker (vaccum), and they are under tremendous 
preasure from their users to avoid having to use it becouse of the performance 
hit that it generates. (the theory in the past is exactly what was presented in 
this thread, make things run faster most of the time and accept the performance 
hit when you repack). the trend seems to be for a repacker thread that runs 
continuously, causing a small impact all the time (that can be calculated into 
the capacity planning) instead of a large impact once in a while.


the other thing they are seeing as new people start useing them is that the 
newbys don't realize they need to do somthing as archaic as running a repacker 
periodicly, as a result they let things devolve down to where performance is 
really bad without understanding why.


David Lang


Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]

2006-07-31 Thread David Masover

David Lang wrote:

On Mon, 31 Jul 2006, David Masover wrote:


And perhaps a
really good clustering filesystem for markets that
require NO downtime. 


Thing is, a cluster is about the only FS I can imagine that could 
reasonably require (and MAYBE provide) absolutely no downtime. 
Everything else, the more you say it requires no downtime, the more I 
say it requires redundancy.


Am I missing any more obvious examples where you can't have enough 
redundancy, but you can't have downtime either?


just becouse you have redundancy doesn't mean that your data is idle 
enough for you to run a repacker with your spare cycles.


Then you don't have redundancy, at least not for reliability.  In that 
case, you have redundancy for speed.


to run a 
repacker you need a time when the chunk of the filesystem that you are 
repacking is not being accessed or written to.


Reasonably, yes.  But it will be an online repacker, so it will be 
somewhat tolerant of this.


it doesn't matter if that 
data lives on one disk or 9 disks all mirroring the same data, you can't 
just break off 1 of the copies and repack that becouse by the time you 
finish it won't match the live drives anymore.


Aha.  That really depends how you're doing the mirroring.

If you're doing it at the block level, then no, it won't work.  But if 
you're doing it at the filesystem level (a cluster-based FS, or 
something that layers on top of an FS), or (most likely) the 
database/application level, then when you come back up, the new data is 
just pulled in from the logs as if it had been written to the FS.


The only example I can think of that I've actually used and seen working 
is MySQL tables, but that already covers a huge number of websites.


database servers have a repacker (vaccum), and they are under tremendous 
preasure from their users to avoid having to use it becouse of the 
performance hit that it generates. (the theory in the past is exactly 
what was presented in this thread, make things run faster most of the 
time and accept the performance hit when you repack). the trend seems to 
be for a repacker thread that runs continuously, causing a small impact 
all the time (that can be calculated into the capacity planning) instead 
of a large impact once in a while.


Hmm, if that could be done right, it wouldn't be so bad -- if you get 
twice the performance but have to repack for 2 hrs at the end of the 
week, repacker is better, right?  So if you could spread the 2 hours out 
over the week, in theory, you'd still be pretty close to twice the 
performance.


But that is fairly difficult to do, and may be more difficult to do well 
than to implement, say, a Reiser4 plugin that operates about on the 
level of rsync, but on every file modification.


the other thing they are seeing as new people start useing them is that 
the newbys don't realize they need to do somthing as archaic as running 
a repacker periodicly, as a result they let things devolve down to where 
performance is really bad without understanding why.


Yikes.  But then, that may be a failure of distro maintainers for not 
throwing it in cron for them.


I had a similar problem with MySQL.  I turned on binary logging so I 
could do database replication, but I didn't realize I had to actually 
delete the logs.  I now have a daily cron job that wipes out everything 
except the last day's logs.  It could probably be modified pretty easily 
to run hourly, if I needed to.


Moral of the story?  Maybe there's something to this continuous 
repacker idea, but don't ruin a good thing for the rest of us because 
of newbies.