Re: Futures - HAMMER comparison testing?
Matthew Dillon wrote: :But - at the end of the day - how much [extra?] on-disk space will be :needed to insure mount 'as-of' is 'good enough' for some realisitic span :(a week?, a month?)? 'Forever' may be too much to ask. The amount of disk needed is precisely the same as the amount of historical data (different from current data) that must be retained, plus record overhead. So it comes down to how much space you are willing to eat up to store the history, and what kind of granularity you will want for the history. :How close are we to being able to start predicting that storage-space :efficiency relative to ${some_other_fs}? : :Bill Ultimately it will be extremely efficient simply by the fact that there will be a balancer going through it and repacking it. For the moment (and through the alpha release) it will be fairly inefficient because it is using fixed 16K data records, even for small files. The on-disk format doesn't care... records can reference variable-length data from around 1MB down to 64 bytes. But supporting variable-length data requires implementing some overwrite cases that I don't want to do right now. This only applies to regular files of course. Directories store directory entries as records, not as data, so directories are packed really nicely. e.g. if you have one record representing, say, 1MB of data, and you write 64 bytes right smack in the middle of that, the write code will have to take that one record, mark it as deleted, then create three records to replace it (one pointing to the unchanged left portion of the original data, one pointing to the 64 bytes of overwritten data, and one pointing to the unchanged right portion of the original data). The recovery and deletion code will also have to deal with that sort of overlayed data situation. I'm not going to be writing that feature for a bit. There are some quick hacks I can do too, for small files, but its not on my list prior to the alpha release. To me it seems that this makes compression somewhat more easy to implement. Mainly because the old data is kept. So when you overwrite compressed data, the system would only compress the new data and introduce pointers to the left portion and right portion of the old data. What might be complicated to handle is the differences in size of compressed vs. uncompressed in all the buffers. If this yields a compression ratio of 2x, this could be extremely useful especially due to the historic nature of HAMMER (deleted files take less memory). Even better if there would be a flag to open(2) to retrieve the file in raw format (i.e. compressed in case it is stored compressed). Think about web-servers! Most of them send static files in compressed form if the client accepts it. That'd be a huge benefit for serving static pages. Also think about sendfile(2) which now could send compressed data directly. This would make it much more usable in a webserver and would avoid to store a compressed copy of the file. The same infrastructure could also be used to implement file-oriented encryption. Basically every algorithm that works on a stream of data would be possible. For encryption, a special system call could associate a key to an open file descriptor, or a default key to use for the whole process. Regards, Michael
Re: Futures - HAMMER comparison testing?
Bill Hacker wrote: Michael Neumann wrote: Bill Hacker wrote: I'm guessing it will be a while yet before HAMMER is ready for this, but it seems to be moving fast - and cleanly - so... Sorry to hijack this thread. Just wanna mention a little write down of mine about HammerFS features (and sometimes comparing it with ZFS): http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs I can't await to try it out in real! Regards, Michael Michael - that's a good start! Such as good start, that I'd like to suggest the 'un-blog-like' course of correcting the original, at least 'for a while yet' rather than blogging-on the errata at the tail. Yes, I fixed it a bit ;-) I don't think your ZFS assessment is 100% accurate, so a bit of clean-up there could reduce flame-bait. Done. That may earn further alteration if/as/when Sun integrate 'lustre' features. At present, HAMMMERfs and ZFS have only partial overlap in their Venn diagrams. But I encourage you to keep updating and tracking the changes. Maybe it should open with a date-stamped 'current state of ...'? You mean using a "@@0x..." (HAMMER timestamp) in the URL ;-) I wait until I port my blogging software natively over to HAMMER ;-) Regards, Michael
Re: Futures - HAMMER comparison testing?
:> So it comes down to how much space you are willing to eat up to store :> the history, and what kind of granularity you will want for the history. : :OK - so it WILL be a 'tunable', then. :... :HAMMER cannot protect against all forms of human error - BUT - if it :inherently rebuilds more intelligently than the least-intelligent of :RAID1, it can greatly reduce the opportunity for that sort of 'accident' :to occur. One idea I had was to number the records as they were layed down on disk, and validate the file or directory by determining that no records were missing. But that doesn't fly very well when things are deleted and replaced. Another idea, much easier to implement, is to have a way to guarentee that all the bits and pieces of the file had been found by creating a record which contains a CRC of the whole mess. One could have a 'whole file' CRC, or even a 'whole directory tree' CRC (as-of a particular timestamp). Since HAMMER is record oriented associating special records with inodes is uttery trivial. For archival storage one could then 'tag' a directory tree with such a record and have a way of validating that the directory tree had not become corrupted, or was recovered properly. For encryption one could 'tag' a directory tree or a file with an encryption label. Not implemented yet but a definite possibility. There are so many things we can do with HAMMER due to its record oriented nature. :> Ultimately it will be extremely efficient simply by the fact that :> there will be a balancer going through it and repacking it. :> :"... constantly, and in the background..." (I presume) In the background, for sure. Probably not constantly, but taking a piece at a time with a nightly cron job. One thing I've learned over the years is that it is a bad idea to just go randomly accessing the disk at unexpected times. The nice thing is that the balancing can occur on a cluster-by-cluster basis, so one can do a bunch of clusters, then stop, then do a bunch more, then stop, etc. :Is variable-length still likely to have a payback if the data records :were to be fixed at 512B or 1024B or integer multiples thereof? Not a good idea for HAMMER. A HAMMER record is 96 bytes and a HAMMER B-Tree element is 56 bytes. That's 152 bytes of overhead per record. The smaller the data associated with each record, the larger the overhead and the less efficient the filesystem storage model. Also, while accessing records is localized, you only reap major benefits over a linear block storage scheme if you can make those records reference a significant amount of data. So for large static files we definitely want to use a large per-record data size, and for small static files we want to use a small data size. Theoretically the best-case storage for a tiny file would be 96 + 56 + 128 (inode data) + 64 (data), or 344 bytes of disk space. That's very, very good. (In the current incarnation the minimum disk space use per file is 96 + 56 + 128 + 16384). -Matt Matthew Dillon <[EMAIL PROTECTED]>
Re: Futures - HAMMER comparison testing?
Matthew Dillon wrote: :But - at the end of the day - how much [extra?] on-disk space will be :needed to insure mount 'as-of' is 'good enough' for some realisitic span :(a week?, a month?)? 'Forever' may be too much to ask. The amount of disk needed is precisely the same as the amount of historical data (different from current data) that must be retained, plus record overhead. So it comes down to how much space you are willing to eat up to store the history, and what kind of granularity you will want for the history. OK - so it WILL be a 'tunable', then. FWIW - my yardsticks at the 'heavy' or most wasteful end are punch card & paper/mylar tape on low/no RAM systems, where 'backup' is essentially of 'infinite' granularity, moving through WORM storage to Plan9 Venti, et al. AFAIK, none of the oldest 'write once' methods are in even 'virtualized' use - save possibly in the FAA or military fields, as few entities have any prectical use for that sort of history. At the other end, one of our projects involved storing the floor plans of 60,000 buildings on RAID1. A technician manually rebuiding a failed array mirrored empty HDD to full, and over 600 CD's had to be manually reloaded. In that case, there never had been risk of loss - anyone could buy the latast CD's from the government lands department. What his error cost us was 'only' time and inconvenience. HAMMER cannot protect against all forms of human error - BUT - if it inherently rebuilds more intelligently than the least-intelligent of RAID1, it can greatly reduce the opportunity for that sort of 'accident' to occur. :How close are we to being able to start predicting that storage-space :efficiency relative to ${some_other_fs}? : :Bill Ultimately it will be extremely efficient simply by the fact that there will be a balancer going through it and repacking it. "... constantly, and in the background..." (I presume) ".. and with tunable frequency and priority." (I wish, eventually). For the moment (and through the alpha release) it will be fairly inefficient because it is using fixed 16K data records, even for small files. The on-disk format doesn't care... records can reference variable-length data from around 1MB down to 64 bytes. But supporting variable-length data requires implementing some overwrite cases that I don't want to do right now. Is variable-length still likely to have a payback if the data records were to be fixed at 512B or 1024B or integer multiples thereof? > This only applies to regular files of course. Directories store directory entries as records, not as data, so directories are packed really nicely. e.g. if you have one record representing, say, 1MB of data, and you write 64 bytes right smack in the middle of that, the write code will have to take that one record, mark it as deleted, then create three records to replace it (one pointing to the unchanged left portion of the original data, one pointing to the 64 bytes of overwritten data, and one pointing to the unchanged right portion of the original data). The recovery and deletion code will also have to deal with that sort of overlayed data situation. I'm not going to be writing that feature for a bit. There are some quick hacks I can do too, for small files, but its not on my list prior to the alpha release. Remember that HAMMER is designed for large filesystems which don't fill up instantly. Consequently it will operate under the assumption that it can take its time to recover free space. If one doesn't want to use the history feature one can turn it off, of course, or use a very granular retention policy. My local backup system is currently using a 730GB UFS partition and it is able to backup apollo, crater, and leaf with daily cpdups (using the hardlink snapshot trick) going back about 3 months. In fact, I can only fill up that 730GB about half way because fsck runs out of memory and fails once you get over around 50 million inodes (mostly dependant on the number of directories you have)... on UFS that is. I found that out the hard way. .which reminds us what we will ALL soon face if we do NOT seek newer solutions! It takes almost a day for fsck to recover the filesystem even half full. I'll be happy when I can throw that old stuff away. -Matt Matthew Dillon <[EMAIL PROTECTED]> . or just relegate it to what it still does faster/better. IF.. I hope and trust that DragonFly BSD will earn a place as a 'broad spectrum' OS, competitive across the board with alternatives. But - if not, or even just 'not at first' - much as OpenBSD and NetBSD have long been seen as good choices for routers and firewalls, DragonFly should be able to carve out a viable niche as the better choice for centralized / c
Re: Futures - HAMMER comparison testing?
:But - at the end of the day - how much [extra?] on-disk space will be :needed to insure mount 'as-of' is 'good enough' for some realisitic span :(a week?, a month?)? 'Forever' may be too much to ask. The amount of disk needed is precisely the same as the amount of historical data (different from current data) that must be retained, plus record overhead. So it comes down to how much space you are willing to eat up to store the history, and what kind of granularity you will want for the history. :How close are we to being able to start predicting that storage-space :efficiency relative to ${some_other_fs}? : :Bill Ultimately it will be extremely efficient simply by the fact that there will be a balancer going through it and repacking it. For the moment (and through the alpha release) it will be fairly inefficient because it is using fixed 16K data records, even for small files. The on-disk format doesn't care... records can reference variable-length data from around 1MB down to 64 bytes. But supporting variable-length data requires implementing some overwrite cases that I don't want to do right now. This only applies to regular files of course. Directories store directory entries as records, not as data, so directories are packed really nicely. e.g. if you have one record representing, say, 1MB of data, and you write 64 bytes right smack in the middle of that, the write code will have to take that one record, mark it as deleted, then create three records to replace it (one pointing to the unchanged left portion of the original data, one pointing to the 64 bytes of overwritten data, and one pointing to the unchanged right portion of the original data). The recovery and deletion code will also have to deal with that sort of overlayed data situation. I'm not going to be writing that feature for a bit. There are some quick hacks I can do too, for small files, but its not on my list prior to the alpha release. Remember that HAMMER is designed for large filesystems which don't fill up instantly. Consequently it will operate under the assumption that it can take its time to recover free space. If one doesn't want to use the history feature one can turn it off, of course, or use a very granular retention policy. My local backup system is currently using a 730GB UFS partition and it is able to backup apollo, crater, and leaf with daily cpdups (using the hardlink snapshot trick) going back about 3 months. In fact, I can only fill up that 730GB about half way because fsck runs out of memory and fails once you get over around 50 million inodes (mostly dependant on the number of directories you have)... on UFS that is. I found that out the hard way. It takes almost a day for fsck to recover the filesystem even half full. I'll be happy when I can throw that old stuff away. -Matt Matthew Dillon <[EMAIL PROTECTED]>
Re: Futures - HAMMER comparison testing?
Matthew Dillon wrote: :Sorry to hijack this thread. Not to worry! It was *intended* to be 'hijacked'. Welcome! Just wanna mention a little write down of :mine about HammerFS features (and sometimes comparing it with ZFS): : :http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs : :I can't await to try it out in real! : :Regards, : : Michael Nice. There are a few factual mistakes but nothing major. ZFS is reliable on a 386 system, you just have to limit its memory consumption with (I think) a sysctl. HAMMER's resource use is still far lower, though. There are still reports of total corruption for ZFS on FreeBSD but the issue doesn't sound insurmountable. Not just on FreeBSD. There are some 'bad news' ZFS stories on the Solaris-on-UltraSPARC lists as well. But - so far- more seem to be 'tuning' issues than structural. HAMMER doesn't journal. Theoretically by carefully ordering certain I/O operations it will not have to journal. As with 'softupdates' the net result - reduced risk of damage - is more important than the technique used to deliver it. More importantly, HAMMER can 'recover', as in regenerate, the contents of a cluster on the fly by scanning the records in that cluster and then rebuilding the B-Tree and allocation radix trees from scratch. As long as recognizable records are present, it should be possible to recover a considerable amount of information even after the disk gets scratched. Again - so long as it can be trusted to actually *deliver the goods*, the method is 'don't care' to the end-user. The historical nature and backup sections are correct, minus your journaling comment. Basically you make backups by working from an as-of access. You don't even have to make an as-of mount, there's an @@timestamp extension that allows you to access the filesystem as-of any time. So, e.g. you can do things like: 'diff /mnt /mnt/@@0x'. From 'decision maker land' - the hottest of hot buttons are not speed or bragging rights but rather: - The fs will not fail, embarrass, or cost lost time or money. - It will need less time install, configure, and maintain than other options (in which is included backup/restoral costs). Needing less effort to muck about with tarballs and rsync offsets a (possibly) slower fs. Bigtime. Not so lost or corrupted or even just wrongly erased data. I am down to three major items for the release: The Recovery, balancing, and vacuuming subsystems. All are interrelated and I am making good progress. Beyond that the spike code needs some major tweaking but the only effect of that is poor write performance (probably through the alpha release). 30% of the I/O speed of current alternatives is fine with me. 25% or less might be problematic. 50% is even OK for the long-term if the backup/restoral/rollback pays off. Those things are otherwise not 'free'. > Of course, there are many other little issues that need to be dealt with before the release as well. Post release I'll have a go at implementing backup/mirroring streaming. I have a pretty good idea how to implement it -- basically by storing a last-transaction-id in cluster headers, super-cluster headers, and volume headers, in order to reduce the amount of initial scanning required to resynchronize a stream. -Matt Matthew Dillon <[EMAIL PROTECTED]> That last part sounds like 'journaling' to me. But - at the end of the day - how much [extra?] on-disk space will be needed to insure mount 'as-of' is 'good enough' for some realisitic span (a week?, a month?)? 'Forever' may be too much to ask. How close are we to being able to start predicting that storage-space efficiency relative to ${some_other_fs}? Bill
Re: Futures - HAMMER comparison testing?
Michael Neumann wrote: Bill Hacker wrote: I'm guessing it will be a while yet before HAMMER is ready for this, but it seems to be moving fast - and cleanly - so... Sorry to hijack this thread. Just wanna mention a little write down of mine about HammerFS features (and sometimes comparing it with ZFS): http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs I can't await to try it out in real! Regards, Michael Michael - that's a good start! Such as good start, that I'd like to suggest the 'un-blog-like' course of correcting the original, at least 'for a while yet' rather than blogging-on the errata at the tail. I don't think your ZFS assessment is 100% accurate, so a bit of clean-up there could reduce flame-bait. That may earn further alteration if/as/when Sun integrate 'lustre' features. At present, HAMMMERfs and ZFS have only partial overlap in their Venn diagrams. But I encourage you to keep updating and tracking the changes. Maybe it should open with a date-stamped 'current state of ...'? Thanks! Bill
Re: Futures - HAMMER comparison testing?
:Sorry to hijack this thread. Just wanna mention a little write down of :mine about HammerFS features (and sometimes comparing it with ZFS): : :http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs : :I can't await to try it out in real! : :Regards, : : Michael Nice. There are a few factual mistakes but nothing major. ZFS is reliable on a 386 system, you just have to limit its memory consumption with (I think) a sysctl. HAMMER's resource use is still far lower, though. There are still reports of total corruption for ZFS on FreeBSD but the issue doesn't sound insurmountable. HAMMER doesn't journal. Theoretically by carefully ordering certain I/O operations it will not have to journal. More importantly, HAMMER can 'recover', as in regenerate, the contents of a cluster on the fly by scanning the records in that cluster and then rebuilding the B-Tree and allocation radix trees from scratch. As long as recognizable records are present, it should be possible to recover a considerable amount of information even after the disk gets scratched. The historical nature and backup sections are correct, minus your journaling comment. Basically you make backups by working from an as-of access. You don't even have to make an as-of mount, there's an @@timestamp extension that allows you to access the filesystem as-of any time. So, e.g. you can do things like: 'diff /mnt /mnt/@@0x'. I am down to three major items for the release: The Recovery, balancing, and vacuuming subsystems. All are interrelated and I am making good progress. Beyond that the spike code needs some major tweaking but the only effect of that is poor write performance (probably through the alpha release). Of course, there are many other little issues that need to be dealt with before the release as well. Post release I'll have a go at implementing backup/mirroring streaming. I have a pretty good idea how to implement it -- basically by storing a last-transaction-id in cluster headers, super-cluster headers, and volume headers, in order to reduce the amount of initial scanning required to resynchronize a stream. -Matt Matthew Dillon <[EMAIL PROTECTED]>
Re: Futures - HAMMER comparison testing?
Bill Hacker wrote: I'm guessing it will be a while yet before HAMMER is ready for this, but it seems to be moving fast - and cleanly - so... Sorry to hijack this thread. Just wanna mention a little write down of mine about HammerFS features (and sometimes comparing it with ZFS): http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs I can't await to try it out in real! Regards, Michael
Futures - HAMMER comparison testing?
I'm guessing it will be a while yet before HAMMER is ready for this, but it seems to be moving fast - and cleanly - so... .. to the extent networked clustering AND inherent recoverablility are intended to be major strengths - to what comparable fs should we expect to look if we were to compare performance & features? My own first impression includes: - IBM's 'GPFS' (hard for mere mortals to get access to a realistic test environment, but perhaps performance is well documented somewhere in a Redpaper?) - Just-acquired-by-Sun 'Lustre' (an unknown to me) - CMU's AFS (in ports. Has never seemed particularly magical to me..) - Plan 9's Fossil/Venti (likewise, even when run 'native' - IMNSHO, actually no longer viable). - Is NFS a realistic candidate in some incarnation? Others? [1] and - eventual performance quite aside, should we be looking at stress test design [2], or are there benchmarks already appropriate? Bill Hacker [1] I don't know that ZFS *inherently* plays in the cluster yet - but Lustre was allegedly purchased in order to be worked in with ZFS so... 'maybe soon'. [2] Tracking ZFS from mailing lists gives me the impression that is has been relatively solid for PJD on small HDD and RAM. Likewise for Sun on massively greater RAM and HDD arrays. BUT - that many of the reported problems are from folks in the 'tween decks' - i.e. larger arrays amd RAM than PJD reports as solid, but less available RAM-to-disk than Sun generally uses for real-world use. So the question arises if HAMMER will a) be overly sensitive to / hungry for - memory as a resource b) if so, at least just slow-down when short-changed rather than crashing. c) IF crashed, (or otherwise treated rudely) be more certain of trouble-free recovery than product 'x'.