Re: Futures - HAMMER comparison testing?

2008-01-19 Thread Michael Neumann

Matthew Dillon wrote:
:But - at the end of the day - how much [extra?] on-disk space will be 
:needed to insure mount 'as-of' is 'good enough' for some realisitic span 
:(a week?, a month?)? 'Forever' may be too much to ask.


The amount of disk needed is precisely the same as the amount of
historical data (different from current data) that must be retained,
plus record overhead.

So it comes down to how much space you are willing to eat up to store
the history, and what kind of granularity you will want for the history.

:How close are we to being able to start predicting that storage-space 
:efficiency relative to ${some_other_fs}?

:
:Bill

Ultimately it will be extremely efficient simply by the fact that
there will be a balancer going through it and repacking it.

For the moment (and through the alpha release) it will be fairly
inefficient because it is using fixed 16K data records, even for small
files.  The on-disk format doesn't care... records can reference 
variable-length data from around 1MB down to 64 bytes.  But supporting

variable-length data requires implementing some overwrite cases that
I don't want to do right now.  This only applies to regular files
of course.  Directories store directory entries as records, not as data,
so directories are packed really nicely. 


e.g. if you have one record representing, say, 1MB of data, and you
write 64 bytes right smack in the middle of that, the write code will
have to take that one record, mark it as deleted, then create three
records to replace it (one pointing to the unchanged left portion of
the original data, one pointing to the 64 bytes of overwritten data,
and one pointing to the unchanged right portion of the original data).
The recovery and deletion code will also have to deal with that sort
of overlayed data situation.  I'm not going to be writing that
feature for a bit.  There are some quick hacks I can do too, for
small files, but its not on my list prior to the alpha release.


To me it seems that this makes compression somewhat more easy to
implement. Mainly because the old data is kept. So when you overwrite
compressed data, the system would only compress the new data and
introduce pointers to the left portion and right portion of the old 
data. What might be complicated to handle is the differences in size of

compressed vs. uncompressed in all the buffers.

If this yields a compression ratio of 2x, this could be extremely useful
especially due to the historic nature of HAMMER (deleted files take less 
memory).


Even better if there would be a flag to open(2) to retrieve the file in
raw format (i.e. compressed in case it is stored compressed).
Think about web-servers! Most of them send static files in compressed
form if the client accepts it. That'd be a huge benefit for serving
static pages. Also think about sendfile(2) which now could send
compressed data directly. This would make it much more usable in a
webserver and would avoid to store a compressed copy of the file.

The same infrastructure could also be used to implement file-oriented
encryption. Basically every algorithm that works on a stream of data
would be possible. For encryption, a special system call could
associate a key to an open file descriptor, or a default key to use
for the whole process.



Regards,

  Michael


Re: Futures - HAMMER comparison testing?

2008-01-18 Thread Michael Neumann

Bill Hacker wrote:

Michael Neumann wrote:

Bill Hacker wrote:
I'm guessing it will be a while yet before HAMMER is ready for this, 
but it seems to be moving fast - and cleanly - so...


Sorry to hijack this thread. Just wanna mention a little write down of 
mine about HammerFS features (and sometimes comparing it with ZFS):


http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs

I can't await to try it out in real!

Regards,

  Michael


Michael - that's a good start!

Such as good start, that I'd like to  suggest the 'un-blog-like' course 
of correcting the original, at least 'for a while yet' rather than 
blogging-on the errata at the tail.


Yes, I fixed it a bit ;-)

I don't think your ZFS assessment is 100% accurate, so a bit of clean-up 
there could reduce flame-bait.


Done.

That may earn further alteration if/as/when Sun integrate 'lustre' 
features. At present, HAMMMERfs and ZFS have only partial overlap in 
their Venn diagrams.


But I encourage you to keep updating and tracking the changes.

Maybe it should open with a date-stamped 'current state of ...'?


You mean using a "@@0x..." (HAMMER timestamp) in the URL ;-)
I wait until I port my blogging software natively over to HAMMER ;-)

Regards,

  Michael


Re: Futures - HAMMER comparison testing?

2008-01-18 Thread Matthew Dillon

:> So it comes down to how much space you are willing to eat up to store
:> the history, and what kind of granularity you will want for the history.
:
:OK - so it WILL be a 'tunable', then.
:...
:HAMMER cannot protect against all forms of human error - BUT - if it 
:inherently rebuilds more intelligently than the least-intelligent of 
:RAID1, it can greatly reduce the opportunity for that sort of 'accident' 
:to occur.

One idea I had was to number the records as they were layed down on
disk, and validate the file or directory by determining that no records
were missing.  But that doesn't fly very well when things are deleted
and replaced.

Another idea, much easier to implement, is to have a way to guarentee that 
all
the bits and pieces of the file had been found by creating a record which
contains a CRC of the whole mess.   One could have a 'whole file' CRC, or
even a 'whole directory tree' CRC (as-of a particular timestamp).
Since HAMMER is record oriented associating special records with inodes
is uttery trivial.

For archival storage one could then 'tag' a directory tree with such a
record and have a way of validating that the directory tree had not become
corrupted, or was recovered properly.

For encryption one could 'tag' a directory tree or a file with an
encryption label.

Not implemented yet but a definite possibility.  There are so many things
we can do with HAMMER due to its record oriented nature.

:> Ultimately it will be extremely efficient simply by the fact that
:> there will be a balancer going through it and repacking it.
:> 
:"... constantly, and in the background..." (I presume)

In the background, for sure.  Probably not constantly, but taking a piece
at a time with a nightly cron job.  One thing I've learned over the
years is that it is a bad idea to just go randomly accessing the disk
at unexpected times.

The nice thing is that the balancing can occur on a cluster-by-cluster
basis, so one can do a bunch of clusters, then stop, then do a bunch
more, then stop, etc.

:Is variable-length still likely to have a payback if the data records 
:were to be fixed at 512B or 1024B or integer multiples thereof?

Not a good idea for HAMMER.  A HAMMER record is 96 bytes and a
HAMMER B-Tree element is 56 bytes.  That's 152 bytes of overhead
per record.  The smaller the data associated with each record,
the larger the overhead and the less efficient the filesystem
storage model.  Also, while accessing records is localized, you
only reap major benefits over a linear block storage scheme
if you can make those records reference a significant amount of
data.

So for large static files we definitely want to use a large
per-record data size, and for small static files we want to use
a small data size.  Theoretically the best-case storage for a tiny
file would be 96 + 56 + 128 (inode data) + 64 (data), or
344 bytes of disk space.  That's very, very good.  (In the current
incarnation the minimum disk space use per file is 96 + 56 + 128 + 16384).

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>


Re: Futures - HAMMER comparison testing?

2008-01-18 Thread Bill Hacker

Matthew Dillon wrote:
:But - at the end of the day - how much [extra?] on-disk space will be 
:needed to insure mount 'as-of' is 'good enough' for some realisitic span 
:(a week?, a month?)? 'Forever' may be too much to ask.


The amount of disk needed is precisely the same as the amount of
historical data (different from current data) that must be retained,
plus record overhead.

So it comes down to how much space you are willing to eat up to store
the history, and what kind of granularity you will want for the history.


OK - so it WILL be a 'tunable', then.

FWIW - my yardsticks at the 'heavy' or most wasteful end are punch card 
& paper/mylar tape on low/no RAM systems, where 'backup' is essentially 
of 'infinite' granularity, moving through WORM storage to Plan9 Venti, 
et al.


AFAIK, none of the oldest 'write once' methods are in even 'virtualized' 
use - save possibly in the FAA or military fields, as few entities have 
any prectical use for that sort of history.


At the other end, one of our projects involved storing the floor plans 
of 60,000 buildings on RAID1. A technician manually rebuiding a failed 
array mirrored empty HDD to full, and over 600 CD's had to be manually 
reloaded.


In that case, there never had been risk of loss - anyone could buy the 
latast CD's from the government lands department.


What his error cost us was 'only' time and inconvenience.

HAMMER cannot protect against all forms of human error - BUT - if it 
inherently rebuilds more intelligently than the least-intelligent of 
RAID1, it can greatly reduce the opportunity for that sort of 'accident' 
to occur.




:How close are we to being able to start predicting that storage-space 
:efficiency relative to ${some_other_fs}?

:
:Bill

Ultimately it will be extremely efficient simply by the fact that
there will be a balancer going through it and repacking it.


"... constantly, and in the background..." (I presume)


".. and with tunable frequency and priority." (I wish, eventually).


For the moment (and through the alpha release) it will be fairly
inefficient because it is using fixed 16K data records, even for small
files.  The on-disk format doesn't care... records can reference 
variable-length data from around 1MB down to 64 bytes.  But supporting

variable-length data requires implementing some overwrite cases that
I don't want to do right now.


Is variable-length still likely to have a payback if the data records 
were to be fixed at 512B or 1024B or integer multiples thereof?


> This only applies to regular files

of course.  Directories store directory entries as records, not as data,
so directories are packed really nicely. 


e.g. if you have one record representing, say, 1MB of data, and you
write 64 bytes right smack in the middle of that, the write code will
have to take that one record, mark it as deleted, then create three
records to replace it (one pointing to the unchanged left portion of
the original data, one pointing to the 64 bytes of overwritten data,
and one pointing to the unchanged right portion of the original data).
The recovery and deletion code will also have to deal with that sort
of overlayed data situation.  I'm not going to be writing that
feature for a bit.  There are some quick hacks I can do too, for
small files, but its not on my list prior to the alpha release.

Remember that HAMMER is designed for large filesystems which don't fill
up instantly.  Consequently it will operate under the assumption that
it can take its time to recover free space.  If one doesn't want to use
the history feature one can turn it off, of course, or use a very
granular retention policy.

My local backup system is currently using a 730GB UFS partition and it
is able to backup apollo, crater, and leaf with daily cpdups (using
the hardlink snapshot trick) going back about 3 months.  In fact, I
can only fill up that 730GB about half way because fsck runs out of
memory and fails once you get over around 50 million inodes (mostly
dependant on the number of directories you have)... on UFS that is.
I found that out the hard way.


.which reminds us what we will ALL soon face if we do NOT seek newer 
solutions!




It takes almost a day for fsck to
recover the filesystem even half full.  I'll be happy when I can throw
that old stuff away.

-Matt
	Matthew Dillon 
	<[EMAIL PROTECTED]>


. or just relegate it to what it still does faster/better. IF..

I hope and trust that DragonFly BSD will earn a place as a 'broad 
spectrum' OS, competitive across the board with alternatives.


But - if not, or even just 'not at first'

- much as OpenBSD and NetBSD have long been seen as good choices for 
routers and firewalls, DragonFly should be able to carve out a viable 
niche as the better choice for centralized / c

Re: Futures - HAMMER comparison testing?

2008-01-18 Thread Matthew Dillon

:But - at the end of the day - how much [extra?] on-disk space will be 
:needed to insure mount 'as-of' is 'good enough' for some realisitic span 
:(a week?, a month?)? 'Forever' may be too much to ask.

The amount of disk needed is precisely the same as the amount of
historical data (different from current data) that must be retained,
plus record overhead.

So it comes down to how much space you are willing to eat up to store
the history, and what kind of granularity you will want for the history.

:How close are we to being able to start predicting that storage-space 
:efficiency relative to ${some_other_fs}?
:
:Bill

Ultimately it will be extremely efficient simply by the fact that
there will be a balancer going through it and repacking it.

For the moment (and through the alpha release) it will be fairly
inefficient because it is using fixed 16K data records, even for small
files.  The on-disk format doesn't care... records can reference 
variable-length data from around 1MB down to 64 bytes.  But supporting
variable-length data requires implementing some overwrite cases that
I don't want to do right now.  This only applies to regular files
of course.  Directories store directory entries as records, not as data,
so directories are packed really nicely. 

e.g. if you have one record representing, say, 1MB of data, and you
write 64 bytes right smack in the middle of that, the write code will
have to take that one record, mark it as deleted, then create three
records to replace it (one pointing to the unchanged left portion of
the original data, one pointing to the 64 bytes of overwritten data,
and one pointing to the unchanged right portion of the original data).
The recovery and deletion code will also have to deal with that sort
of overlayed data situation.  I'm not going to be writing that
feature for a bit.  There are some quick hacks I can do too, for
small files, but its not on my list prior to the alpha release.

Remember that HAMMER is designed for large filesystems which don't fill
up instantly.  Consequently it will operate under the assumption that
it can take its time to recover free space.  If one doesn't want to use
the history feature one can turn it off, of course, or use a very
granular retention policy.

My local backup system is currently using a 730GB UFS partition and it
is able to backup apollo, crater, and leaf with daily cpdups (using
the hardlink snapshot trick) going back about 3 months.  In fact, I
can only fill up that 730GB about half way because fsck runs out of
memory and fails once you get over around 50 million inodes (mostly
dependant on the number of directories you have)... on UFS that is.
I found that out the hard way.  It takes almost a day for fsck to
recover the filesystem even half full.  I'll be happy when I can throw
that old stuff away.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>


Re: Futures - HAMMER comparison testing?

2008-01-17 Thread Bill Hacker

Matthew Dillon wrote:

:Sorry to hijack this thread.


Not to worry!

It was *intended* to be 'hijacked'.  Welcome!

Just wanna mention a little write down of 
:mine about HammerFS features (and sometimes comparing it with ZFS):

:
:http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs
:
:I can't await to try it out in real!
:
:Regards,
:
:   Michael

Nice.  There are a few factual mistakes but nothing major.  ZFS is
reliable on a 386 system, you just have to limit its memory consumption
with (I think) a sysctl.  HAMMER's resource use is still far lower,
though.  There are still reports of total corruption for ZFS on FreeBSD
but the issue doesn't sound insurmountable.



Not just on FreeBSD. There are some 'bad news' ZFS stories on the 
Solaris-on-UltraSPARC lists as well.


But - so far- more seem to be 'tuning' issues than structural.


HAMMER doesn't journal.  Theoretically by carefully ordering
certain I/O operations it will not have to journal.


As with 'softupdates' the net result - reduced risk of damage -  is more 
important than the technique used to deliver it.



More importantly,
HAMMER can 'recover', as in regenerate, the contents of a cluster on
the fly by scanning the records in that cluster and then rebuilding
the B-Tree and allocation radix trees from scratch.  As long as
recognizable records are present, it should be possible to recover a
considerable amount of information even after the disk gets scratched.



Again - so long as it can be trusted to actually *deliver the goods*, 
the method is 'don't care' to the end-user.



The historical nature and backup sections are correct, minus your
journaling comment.  Basically you make backups by working from an
as-of access.   You don't even have to make an as-of mount, there's
an @@timestamp extension that allows you to access the filesystem
as-of any time.  So, e.g. you can do things like:
'diff /mnt /mnt/@@0x'.



From 'decision maker land' - the hottest of hot buttons are not speed 
or bragging rights but rather:


- The fs will not fail, embarrass, or cost lost time or money.

- It will need less time install, configure, and maintain than other 
options (in which is included backup/restoral costs).


Needing less effort to muck about with tarballs and rsync offsets a 
(possibly) slower fs. Bigtime.


Not so lost or corrupted or even just wrongly erased data.


I am down to three major items for the release:  The Recovery, balancing,
and vacuuming subsystems.  All are interrelated and I am making good
progress.  Beyond that the spike code needs some major tweaking but
the only effect of that is poor write performance (probably through
the alpha release).


30% of the I/O speed of current alternatives is fine with me. 25% or 
less might be problematic.


50% is even OK for the long-term if the backup/restoral/rollback pays 
off. Those things are otherwise not 'free'.


> Of course, there are many other little issues

that need to be dealt with before the release as well.

Post release I'll have a go at implementing backup/mirroring streaming.
I have a pretty good idea how to implement it -- basically by storing a
last-transaction-id in cluster headers, super-cluster headers, and
volume headers, in order to reduce the amount of initial scanning
required to resynchronize a stream.

-Matt
	Matthew Dillon 
	<[EMAIL PROTECTED]>


That last part sounds like 'journaling' to me.

But - at the end of the day - how much [extra?] on-disk space will be 
needed to insure mount 'as-of' is 'good enough' for some realisitic span 
(a week?, a month?)? 'Forever' may be too much to ask.


How close are we to being able to start predicting that storage-space 
efficiency relative to ${some_other_fs}?


Bill




Re: Futures - HAMMER comparison testing?

2008-01-17 Thread Bill Hacker

Michael Neumann wrote:

Bill Hacker wrote:
I'm guessing it will be a while yet before HAMMER is ready for this, 
but it seems to be moving fast - and cleanly - so...


Sorry to hijack this thread. Just wanna mention a little write down of 
mine about HammerFS features (and sometimes comparing it with ZFS):


http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs

I can't await to try it out in real!

Regards,

  Michael


Michael - that's a good start!

Such as good start, that I'd like to  suggest the 'un-blog-like' course 
of correcting the original, at least 'for a while yet' rather than 
blogging-on the errata at the tail.


I don't think your ZFS assessment is 100% accurate, so a bit of clean-up 
there could reduce flame-bait.


That may earn further alteration if/as/when Sun integrate 'lustre' 
features. At present, HAMMMERfs and ZFS have only partial overlap in 
their Venn diagrams.


But I encourage you to keep updating and tracking the changes.

Maybe it should open with a date-stamped 'current state of ...'?

Thanks!

Bill


Re: Futures - HAMMER comparison testing?

2008-01-17 Thread Matthew Dillon

:Sorry to hijack this thread. Just wanna mention a little write down of 
:mine about HammerFS features (and sometimes comparing it with ZFS):
:
:http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs
:
:I can't await to try it out in real!
:
:Regards,
:
:   Michael

Nice.  There are a few factual mistakes but nothing major.  ZFS is
reliable on a 386 system, you just have to limit its memory consumption
with (I think) a sysctl.  HAMMER's resource use is still far lower,
though.  There are still reports of total corruption for ZFS on FreeBSD
but the issue doesn't sound insurmountable.

HAMMER doesn't journal.  Theoretically by carefully ordering
certain I/O operations it will not have to journal.  More importantly,
HAMMER can 'recover', as in regenerate, the contents of a cluster on
the fly by scanning the records in that cluster and then rebuilding
the B-Tree and allocation radix trees from scratch.  As long as
recognizable records are present, it should be possible to recover a
considerable amount of information even after the disk gets scratched.

The historical nature and backup sections are correct, minus your
journaling comment.  Basically you make backups by working from an
as-of access.   You don't even have to make an as-of mount, there's
an @@timestamp extension that allows you to access the filesystem
as-of any time.  So, e.g. you can do things like:
'diff /mnt /mnt/@@0x'.

I am down to three major items for the release:  The Recovery, balancing,
and vacuuming subsystems.  All are interrelated and I am making good
progress.  Beyond that the spike code needs some major tweaking but
the only effect of that is poor write performance (probably through
the alpha release).  Of course, there are many other little issues
that need to be dealt with before the release as well.

Post release I'll have a go at implementing backup/mirroring streaming.
I have a pretty good idea how to implement it -- basically by storing a
last-transaction-id in cluster headers, super-cluster headers, and
volume headers, in order to reduce the amount of initial scanning
required to resynchronize a stream.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>


Re: Futures - HAMMER comparison testing?

2008-01-17 Thread Michael Neumann

Bill Hacker wrote:
I'm guessing it will be a while yet before HAMMER is ready for this, but 
it seems to be moving fast - and cleanly - so...


Sorry to hijack this thread. Just wanna mention a little write down of 
mine about HammerFS features (and sometimes comparing it with ZFS):


http://www.ntecs.de/blog/articles/2008/01/17/zfs-vs-hammerfs

I can't await to try it out in real!

Regards,

  Michael


Futures - HAMMER comparison testing?

2008-01-14 Thread Bill Hacker
I'm guessing it will be a while yet before HAMMER is ready for this, but 
it seems to be moving fast - and cleanly - so...


.. to the extent networked clustering AND inherent recoverablility are 
intended to be major strengths - to what comparable fs should we expect 
to look if we were to compare performance & features?



My own first impression includes:

- IBM's 'GPFS'  (hard for mere mortals to get access to a realistic test 
environment, but perhaps performance is well documented somewhere in a 
Redpaper?)


- Just-acquired-by-Sun 'Lustre' (an unknown to me)

- CMU's AFS (in ports. Has never seemed particularly magical to me..)

- Plan 9's Fossil/Venti (likewise, even when run 'native' - IMNSHO, 
actually no longer viable).


- Is NFS a realistic candidate in some incarnation?

Others? [1]

 and - eventual performance quite aside, should we be looking at stress 
test design [2], or are there benchmarks already appropriate?



Bill Hacker


[1] I don't know that ZFS *inherently* plays in the cluster yet - but 
Lustre was allegedly purchased in order to be worked in with ZFS so... 
'maybe soon'.


[2] Tracking ZFS from mailing lists gives me the impression that is has 
been relatively solid for PJD on small HDD and RAM. Likewise for Sun on 
massively greater RAM and HDD arrays.


BUT - that many of the reported problems are from folks in the 'tween 
decks' - i.e. larger arrays amd RAM than PJD reports as solid, but less 
available RAM-to-disk than Sun generally uses for real-world use.


So the question arises if HAMMER will

a) be overly sensitive to / hungry for - memory as a resource

b) if so, at least just slow-down when short-changed rather than crashing.

c) IF crashed, (or otherwise treated rudely) be more certain of 
trouble-free recovery than product 'x'.