Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
On Mon, Jul 31, 2006 at 09:41:02PM -0700, David Lang wrote: just becouse you have redundancy doesn't mean that your data is idle enough for you to run a repacker with your spare cycles. to run a repacker you need a time when the chunk of the filesystem that you are repacking is not being accessed or written to. it doesn't matter if that data lives on one disk or 9 disks all mirroring the same data, you can't just break off 1 of the copies and repack that becouse by the time you finish it won't match the live drives anymore. database servers have a repacker (vaccum), and they are under tremendous preasure from their users to avoid having to use it becouse of the performance hit that it generates. (the theory in the past is exactly what was presented in this thread, make things run faster most of the time and accept the performance hit when you repack). the trend seems to be for a repacker thread that runs continuously, causing a small impact all the time (that can be calculated into the capacity planning) instead of a large impact once in a while. Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. Specifically, the claim of the wandering log is that you don't have to write your data twice --- once to the log, and once to the final location on disk (whereas with ext3 you end up having to do double writes). But if the repacker is running continuously, you end up doing double writes anyway, as the repacker moves things from a location that is convenient for the log, to a location which is efficient for reading. Worse yet, if the repacker is moving disk blocks or objects which are no longer in cache, it may end up having to read objects in before writing them to a final location on disk. So instead of a write-write overhead, you end up with a write-read-write overhead. But of course, people tend to disable the repacker when doing benchmarks because they're trying to play the my filesystem/database has bigger performance numbers than yours game - Ted
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Theodore Tso wrote: Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. Specifically, the claim of the wandering log is that you don't have to write your data twice --- once to the log, and once to the final location on disk (whereas with ext3 you end up having to do double writes). But if the repacker is running continuously, you end up doing double writes anyway, as the repacker moves things from a location that is convenient for the log, to a location which is efficient for reading. Worse yet, if the repacker is moving disk blocks or objects which are no longer in cache, it may end up having to read objects in before writing them to a final location on disk. So instead of a write-write overhead, you end up with a write-read-write overhead. There's no reason to repack *all* of the data. Many workloads write and delete whole files, so file data should be contiguous. The repacker would only need to move metadata and small files. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
On Tue, 01 Aug 2006, Avi Kivity wrote: There's no reason to repack *all* of the data. Many workloads write and delete whole files, so file data should be contiguous. The repacker would only need to move metadata and small files. Move small files? What for? Even if it is only moving metadata, it is not different from what ext3 or xfs are doing today (rewriting metadata from the intent log or block journal to the final location). The UFS+softupdates from the BSD world looks pretty good at avoiding unnecessary writes (at the expense of a long-running but nice background fsck after a crash, which is however easy on the I/O as of recent FreeBSD versions). Which was their main point against logging/journaling BTW, but they are porting XFS as well to save those that need instant complete recovery. -- Matthias Andree
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Matthias Andree wrote: On Tue, 01 Aug 2006, Avi Kivity wrote: There's no reason to repack *all* of the data. Many workloads write and delete whole files, so file data should be contiguous. The repacker would only need to move metadata and small files. Move small files? What for? WAFL-style filesystems like contiguous space, so if small files are scattered in otherwise free space, the repacker should free them. Even if it is only moving metadata, it is not different from what ext3 or xfs are doing today (rewriting metadata from the intent log or block journal to the final location). There is no need to repack all metadata; only that which helps in creating free space. For example: if you untar a source tree you'd get mixed metadata and small file data packed together, but there's no need to repack that data. -- error compiling committee.c: too many arguments to function
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Theodore Tso wrote: On Mon, Jul 31, 2006 at 09:41:02PM -0700, David Lang wrote: just becouse you have redundancy doesn't mean that your data is idle enough for you to run a repacker with your spare cycles. to run a repacker you need a time when the chunk of the filesystem that you are repacking is not being accessed or written to. it doesn't matter if that data lives on one disk or 9 disks all mirroring the same data, you can't just break off 1 of the copies and repack that becouse by the time you finish it won't match the live drives anymore. database servers have a repacker (vaccum), and they are under tremendous preasure from their users to avoid having to use it becouse of the performance hit that it generates. (the theory in the past is exactly what was presented in this thread, make things run faster most of the time and accept the performance hit when you repack). the trend seems to be for a repacker thread that runs continuously, causing a small impact all the time (that can be calculated into the capacity planning) instead of a large impact once in a while. Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. Wandering logs is a term specific to reiser4, and I think you are making a more general remark. You are missing the implications of the oft-cited statistic that 80% of files never or rarely move. You are also missing the implications of the repacker being able to do larger IOs than occur for a random tiny IO workload which is impacting a filesystem that is performing allocations on the fly. Specifically, the claim of the wandering log is that you don't have to write your data twice --- once to the log, and once to the final location on disk (whereas with ext3 you end up having to do double writes). But if the repacker is running continuously, you end up doing double writes anyway, as the repacker moves things from a location that is convenient for the log, to a location which is efficient for reading. Worse yet, if the repacker is moving disk blocks or objects which are no longer in cache, it may end up having to read objects in before writing them to a final location on disk. So instead of a write-write overhead, you end up with a write-read-write overhead. But of course, people tend to disable the repacker when doing benchmarks because they're trying to play the my filesystem/database has bigger performance numbers than yours game When the repacker is done, we will just for you run one of our benchmarks the morning after the repacker is run (and reference this email);-) that was what you wanted us to do to address your concern, yes?;-) - Ted
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Wandering logs is a term specific to reiser4, and I think you are making a more general remark. So, what is UDF's wandering log then? Jan Engelhardt --
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Theodore Tso wrote: Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. [...] So instead of a write-write overhead, you end up with a write-read-write overhead. This would tend to suggest that the repacker should not run constantly, but also that while it's running, performance could be almost as good as ext3. But of course, people tend to disable the repacker when doing benchmarks because they're trying to play the my filesystem/database has bigger performance numbers than yours game So you run your own benchmarks, I'll run mine... Benchmarks for everyone! I'd especially like to see what performance is like with the repacker not running, and during the repack. If performance during a repack is comparable to ext3, I think we win, although we have to amend that statement to My filesystem/database has the same or bigger perfomance numbers than yours.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
David Masover wrote: David Lang wrote: On Mon, 31 Jul 2006, David Masover wrote: Oh, I'm curious -- do hard drives ever carry enough battery/capacitance to cover their caches? It doesn't seem like it would be that hard/expensive, and if it is done that way, then I think it's valid to leave them on. You could just say that other filesystems aren't taking as much advantage of newer drive features as Reiser :P there are no drives that have the ability to flush their cache after they loose power. Aha, so back to the usual argument: UPS! It takes a fraction of a second to flush that cache. You probably don't actually want to flush the cache - but to write to a journal. 16M of cache - split into 32000 writes to single sectors spread over the disk could well take several minutes to write. Slapping it onto a journal would take well under .2 seconds. That's a non-trivial amount of storage though - 3J or so, [EMAIL PROTECTED] - a moderately large/expensive capacitor. And if you've got to spin the drive up, you've just added another order of magnitude. You can see why a flash backup of the write cache may be nicer. You can do it if the disk isn't spinning. It uses moderately less energy - and at a much lower rate, which means the power supply can be _much_ cheaper. I'd guess it's the difference between under $2 and $10. And if you can use it as a lazy write cache for laptops - things just got better battery life wise too.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Ian Stirling wrote: David Masover wrote: David Lang wrote: On Mon, 31 Jul 2006, David Masover wrote: Oh, I'm curious -- do hard drives ever carry enough battery/capacitance to cover their caches? It doesn't seem like it would be that hard/expensive, and if it is done that way, then I think it's valid to leave them on. You could just say that other filesystems aren't taking as much advantage of newer drive features as Reiser :P there are no drives that have the ability to flush their cache after they loose power. Aha, so back to the usual argument: UPS! It takes a fraction of a second to flush that cache. You probably don't actually want to flush the cache - but to write to a journal. 16M of cache - split into 32000 writes to single sectors spread over the disk could well take several minutes to write. Slapping it onto a journal would take well under .2 seconds. That's a non-trivial amount of storage though - 3J or so, [EMAIL PROTECTED] - a moderately large/expensive capacitor. Before we get ahead of ourselves, remember: ~$200 buys you a huge amount of battery storage. We're talking several minutes for several boxes, at the very least -- more like 10 minutes. But yes, a journal or a software suspend.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
On Mon, 31 Jul 2006, David Masover wrote: Oh, I'm curious -- do hard drives ever carry enough battery/capacitance to cover their caches? It doesn't seem like it would be that hard/expensive, and if it is done that way, then I think it's valid to leave them on. You could just say that other filesystems aren't taking as much advantage of newer drive features as Reiser :P there are no drives that have the ability to flush their cache after they loose power. now, that being said, /. had a story within the last couple of days about hard drive manufacturers adding flash to their hard drives. they may be aiming to add some non-volitile cache capability to their drives, although I didn't think that flash writes were that fast (needed if you dump the cache to flash when you loose power), or that easy on power (given that you would first loose power), and flash has limited write cycles (needed if you always use the cache). I've heard to many fancy-sounding drive technologies that never hit the market, I'll wait until thye are actually available before I start counting on them for anything (let alone design/run a filesystem that requires them :-) external battery backed cache is readily available, either on high-end raid controllers or as seperate ram drives (and in raid array boxes), but nothing on individual drives. David Lang
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
David Lang wrote: On Mon, 31 Jul 2006, David Masover wrote: Oh, I'm curious -- do hard drives ever carry enough battery/capacitance to cover their caches? It doesn't seem like it would be that hard/expensive, and if it is done that way, then I think it's valid to leave them on. You could just say that other filesystems aren't taking as much advantage of newer drive features as Reiser :P there are no drives that have the ability to flush their cache after they loose power. Aha, so back to the usual argument: UPS! It takes a fraction of a second to flush that cache. now, that being said, /. had a story within the last couple of days about hard drive manufacturers adding flash to their hard drives. they may be aiming to add some non-volitile cache capability to their drives, although I didn't think that flash writes were that fast (needed if you dump the cache to flash when you loose power), or that easy on power (given that you would first loose power), and flash has limited write cycles (needed if you always use the cache). But, the point of flash was not to replace the RAM cache, but to be another level. That is, you have your Flash which may be as fast as the disk, maybe faster, maybe less, and you have maybe a gig worth of it. Even the bloatiest of OSes aren't really all that big -- my OS X came installed, with all kinds of apps I'll never use, in less than 10 gigs. And I think this story was awhile ago (a dupe? Not surprising), and the point of the Flash is that as long as your read/write cache doesn't run out, and you're still in that 1 gig of Flash, you're a bit safer than the RAM cache, and you can also leave the disk off, as in, spinned down. Parked. Very useful for a laptop -- I used to do this in Linux by using Reiser4, setting the disk to spin down, and letting lazy writes do their thing, but I didn't have enough RAM, and there's always the possibility of losing data. But leaving the disk off is nice, because in the event of sudden motion, it's safer that way. Besides, most hardware gets designed for That Other OS, which doesn't support any kind of Laptop Mode, so it's nice to be able to enforce this at a hardware level, in a safe way. I've heard to many fancy-sounding drive technologies that never hit the market, I'll wait until thye are actually available before I start counting on them for anything (let alone design/run a filesystem that requires them :-) Or even remember their names. external battery backed cache is readily available, either on high-end raid controllers or as seperate ram drives (and in raid array boxes), but nothing on individual drives. Ah. Curses. UPS, then. If you have enough time, you could even do a Software Suspend first -- that way, when power comes back on, you boot back up, and if it's done quickly enough, connections won't even be dropped...
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
On Mon, 31 Jul 2006, David Masover wrote: And perhaps a really good clustering filesystem for markets that require NO downtime. Thing is, a cluster is about the only FS I can imagine that could reasonably require (and MAYBE provide) absolutely no downtime. Everything else, the more you say it requires no downtime, the more I say it requires redundancy. Am I missing any more obvious examples where you can't have enough redundancy, but you can't have downtime either? just becouse you have redundancy doesn't mean that your data is idle enough for you to run a repacker with your spare cycles. to run a repacker you need a time when the chunk of the filesystem that you are repacking is not being accessed or written to. it doesn't matter if that data lives on one disk or 9 disks all mirroring the same data, you can't just break off 1 of the copies and repack that becouse by the time you finish it won't match the live drives anymore. database servers have a repacker (vaccum), and they are under tremendous preasure from their users to avoid having to use it becouse of the performance hit that it generates. (the theory in the past is exactly what was presented in this thread, make things run faster most of the time and accept the performance hit when you repack). the trend seems to be for a repacker thread that runs continuously, causing a small impact all the time (that can be calculated into the capacity planning) instead of a large impact once in a while. the other thing they are seeing as new people start useing them is that the newbys don't realize they need to do somthing as archaic as running a repacker periodicly, as a result they let things devolve down to where performance is really bad without understanding why. David Lang
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
David Lang wrote: On Mon, 31 Jul 2006, David Masover wrote: And perhaps a really good clustering filesystem for markets that require NO downtime. Thing is, a cluster is about the only FS I can imagine that could reasonably require (and MAYBE provide) absolutely no downtime. Everything else, the more you say it requires no downtime, the more I say it requires redundancy. Am I missing any more obvious examples where you can't have enough redundancy, but you can't have downtime either? just becouse you have redundancy doesn't mean that your data is idle enough for you to run a repacker with your spare cycles. Then you don't have redundancy, at least not for reliability. In that case, you have redundancy for speed. to run a repacker you need a time when the chunk of the filesystem that you are repacking is not being accessed or written to. Reasonably, yes. But it will be an online repacker, so it will be somewhat tolerant of this. it doesn't matter if that data lives on one disk or 9 disks all mirroring the same data, you can't just break off 1 of the copies and repack that becouse by the time you finish it won't match the live drives anymore. Aha. That really depends how you're doing the mirroring. If you're doing it at the block level, then no, it won't work. But if you're doing it at the filesystem level (a cluster-based FS, or something that layers on top of an FS), or (most likely) the database/application level, then when you come back up, the new data is just pulled in from the logs as if it had been written to the FS. The only example I can think of that I've actually used and seen working is MySQL tables, but that already covers a huge number of websites. database servers have a repacker (vaccum), and they are under tremendous preasure from their users to avoid having to use it becouse of the performance hit that it generates. (the theory in the past is exactly what was presented in this thread, make things run faster most of the time and accept the performance hit when you repack). the trend seems to be for a repacker thread that runs continuously, causing a small impact all the time (that can be calculated into the capacity planning) instead of a large impact once in a while. Hmm, if that could be done right, it wouldn't be so bad -- if you get twice the performance but have to repack for 2 hrs at the end of the week, repacker is better, right? So if you could spread the 2 hours out over the week, in theory, you'd still be pretty close to twice the performance. But that is fairly difficult to do, and may be more difficult to do well than to implement, say, a Reiser4 plugin that operates about on the level of rsync, but on every file modification. the other thing they are seeing as new people start useing them is that the newbys don't realize they need to do somthing as archaic as running a repacker periodicly, as a result they let things devolve down to where performance is really bad without understanding why. Yikes. But then, that may be a failure of distro maintainers for not throwing it in cron for them. I had a similar problem with MySQL. I turned on binary logging so I could do database replication, but I didn't realize I had to actually delete the logs. I now have a daily cron job that wipes out everything except the last day's logs. It could probably be modified pretty easily to run hourly, if I needed to. Moral of the story? Maybe there's something to this continuous repacker idea, but don't ruin a good thing for the rest of us because of newbies.