Re: Some very basic questions

2008-10-21 Thread Eric Anopolsky
On Tue, 2008-10-21 at 18:18 -0400, Ric Wheeler wrote:
> Eric Anopolsky wrote:
> > On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:
> >   
> >>> - power loss at any time must not corrupt the fs (atomic fs 
> >>> modification)
> >>>   (new-data loss is acceptable)
> >>>   
> >> Done.  Btrfs already uses barriers as required for sata drives.
> >> 
> >
> > Aren't there situations in which write barriers don't do what they're
> > supposed to do?
> >
> > Cheers,
> > Eric
> >
> >   
> If the drive effectively "lies" to you about flushing the write cache, 
> you might have an issue. I have not seen that first hand with recent 
> disk drives (and I have seen a lot :-))

That does not match the understanding I get from reading the
notes/caveats section of Documentation/block/barrier.txt:

"Note that block drivers must not requeue preceding requests while
completing latter requests in an ordered sequence.  Currently, no
error checking is done against this."

and perhaps more importantly:

"[a technical scenario involving disk writes]
The problem here is that the barrier request is *supposed* to indicate
that filesystem update requests [2] and [3] made it safely to the
physical medium and, if the machine crashes after the barrier is
written, filesystem recovery code can depend on that.  Sadly, that
isn't true in this case anymore.  IOW, the success of a I/O barrier
should also be dependent on success of some of the preceding requests,
where only upper layer (filesystem) knows what 'some' is.

This can be solved by implementing a way to tell the block layer which
requests affect the success of the following barrier request and
making lower lever drivers to resume operation on error only after
block layer tells it to do so.

As the probability of this happening is very low and the drive should
be faulty, implementing the fix is probably an overkill.  But, still,
it's there."

Cheers,
Eric




signature.asc
Description: This is a digitally signed message part


Re: BTRFS Performance page

2008-10-21 Thread Chris Mason
On Tue, Oct 21, 2008 at 05:20:03PM -0500, Steven Pratt wrote:
> As discussed on the BTRFS conference call, myself and Kevin Corry have  
> set up some test machines for the purpose of doing performance testing  
> on BTRFS.  The intent is to have a semi permanent setup that we can use  
> to test new features and code drops in BTRFS as well as to do  
> comparisons to other file systems.  The systems are pretty much fully  
> automated for execution, so we should be able to crank out large numbers  
> of different benchmarks as well as keep up with GIT changes.
>
> The data is hosted at http://btrfs.boxacle.net/. So far we have the data  
> for the single disk tests uploaded. We should be able to upload results  
> from the larger RAID config tomorrow.
>
> Initial tests were done with the FFSB benchmark and we picked 5 common  
> workloads; create, random and sequential read, random write, and a mail  
> server emulation.  We plan to expand this based on feedback to include  
> more FFSB tests and/or other workloads.
>
> All runs have complete analysis data with them (iostat, mpstat,  
> oprofile, sar), as well as the FFSB profiles that can be used to  
> recreate any test we ran. We also have collected blktrace data but not  
> uploaded due to size.
>
> Please follow the results link on the bottom of the main page to get to  
> the current results.  Let me know what you like or don't like.   I will  
> post again when we get the RAID data uploaded.

Very interesting data, thank you for posting this.  The first comment
I'll make is that -o nodatacow requires -o nodatasum.  The sums aren't
valid without the cow.

The FFSB mail server workload, does it do fsync writes?

For the sequential read workload, I'm guessing (hoping) the files are
created in parallel?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS Performance page

2008-10-21 Thread Steven Pratt
As discussed on the BTRFS conference call, myself and Kevin Corry have 
set up some test machines for the purpose of doing performance testing 
on BTRFS.  The intent is to have a semi permanent setup that we can use 
to test new features and code drops in BTRFS as well as to do 
comparisons to other file systems.  The systems are pretty much fully 
automated for execution, so we should be able to crank out large numbers 
of different benchmarks as well as keep up with GIT changes.


The data is hosted at http://btrfs.boxacle.net/. So far we have the data 
for the single disk tests uploaded. We should be able to upload results 
from the larger RAID config tomorrow.


Initial tests were done with the FFSB benchmark and we picked 5 common 
workloads; create, random and sequential read, random write, and a mail 
server emulation.  We plan to expand this based on feedback to include 
more FFSB tests and/or other workloads.


All runs have complete analysis data with them (iostat, mpstat, 
oprofile, sar), as well as the FFSB profiles that can be used to 
recreate any test we ran. We also have collected blktrace data but not 
uploaded due to size.


Please follow the results link on the bottom of the main page to get to 
the current results.  Let me know what you like or don't like.   I will 
post again when we get the RAID data uploaded.



Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Ric Wheeler

Eric Anopolsky wrote:

On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:
  

- power loss at any time must not corrupt the fs (atomic fs modification)
  (new-data loss is acceptable)
  

Done.  Btrfs already uses barriers as required for sata drives.



Aren't there situations in which write barriers don't do what they're
supposed to do?

Cheers,
Eric

  
If the drive effectively "lies" to you about flushing the write cache, 
you might have an issue. I have not seen that first hand with recent 
disk drives (and I have seen a lot :-))


Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Eric Anopolsky
On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:
> > - power loss at any time must not corrupt the fs (atomic fs 
> > modification)
> >   (new-data loss is acceptable)
> 
> Done.  Btrfs already uses barriers as required for sata drives.

Aren't there situations in which write barriers don't do what they're
supposed to do?

Cheers,
Eric



signature.asc
Description: This is a digitally signed message part


Re: Data-deduplication?

2008-10-21 Thread Valerie Aurora Henson
On Sun, Oct 19, 2008 at 08:16:31PM -0400, Chris Mason wrote:
> 
> I think I'll have to come back to this after getting ENOSPC to work at
> all ;)  You're right that reserved space can do wonders to dig us out of

:) Having been through this before, the ENOSPC accounting was
incredibly hard to get right.  It's at least worth thinking about the
edge cases while you're writing the first version, although you will
probably just have to throw one away no matter what.

> holes, it has to be reserved at a multiple of the number of procs that I
> allow into the transaction.
> 
> I should be able to go into an emergency one writer at a time theme as
> space gets really tight, but there are lots of missing pieces that
> haven't been coded yet in that area.

Makes sense.

I have the following "behave like I expect" rules for things that
often aren't right in the first version of a COW file system.

* If a write could succeed in the future without any user-level
  changes to the file system, then it will succeeed the first time. 

Basically, this is reflecting what happens when space used by the
previous version of the fs is freed after the next COW version is
written out.  A naive implementation of COW will fail the write if it
happens while enough other writes are outstanding, even if there would
be enough space after the other writes have been synced to disk and
the blocks from the old version are freed.  This means backing off to
the one-writer-at-a-time mode you are talking about.

* Rewriting metadata will always succeed.

Again, with naive COW, you can get into a state where doing a chmod()
on a file could end up returning ENOSPC.  Totally uncool.  Pretty much
just requires a little reserved space.

* Deletion will always succeed.

Again, reserved space, plus a little forethought in metadata design.
It is not automatically the case that your metadata will be designed
such that deletion will always result in more free space afterwards,
so it's worth a review pass just to be sure.

One thing I ran into before is that it's non-trivial to calculate
exactly how many blocks will need to be COW'd for even the tiniest
write.  Leaves split, directories grow another block, the inode block
has to be copied, the tree grows another level, you have to allocate a
new free space extent, etc., etc.  The worst case can be hundreds of
KB per 1-byte write.  Logically, you may only be writing a few bytes,
but they may require megabytes of free space to sync out to disk.
Very annoying.

-VAL
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nuke fs wide allocation mutex

2008-10-21 Thread Josef Bacik
Hello,

This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.  There is now a pinned_mutex, which is used when messing with
the pinned_extents extent io tree, and the extent_ins_mutex which is used with
the pending_del and extent_ins extent io trees.  The locking for the extent tree
stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I
cleaned it up some and changed the locking around a little bit, but the idea
remains the same.  Basically instead of holding the extent_ins_mutex throughout
the processing of an extent on the extent_ins or pending_del trees, we just hold
it while we're searching and when we clear the bits on those trees, and lock the
extent for the duration of the operations on the extent.  Also to keep from
getting hung up waiting to lock an extent, I've added a try_lock_extent so if we
cannot lock the extent, move on to the next one in the tree and we'll come back
to that one.  I have tested this heavily and it does not appear to break
anything.  This has to be applied on top of my find_free_extent redo patch.
Thank you,

Signed-off-by: Josef Bacik <[EMAIL PROTECTED]>


diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 9caeb37..4f2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1390,8 +1390,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root
lowest_level = p->lowest_level;
WARN_ON(lowest_level && ins_len > 0);
WARN_ON(p->nodes[0] != NULL);
-   WARN_ON(cow && root == root->fs_info->extent_root &&
-   !mutex_is_locked(&root->fs_info->alloc_mutex));
+
if (ins_len < 0)
lowest_unlock = 2;
 
@@ -2051,6 +2050,7 @@ static noinline int split_node(struct btrfs_trans_handle 
*trans,
if (c == root->node) {
/* trying to split the root, lets make a new one */
ret = insert_new_root(trans, root, path, level + 1);
+   printk(KERN_ERR "splitting the root, %llu\n", c->start);
if (ret)
return ret;
} else {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fad58b9..d1e304f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -516,12 +516,14 @@ struct btrfs_free_space {
struct rb_node offset_index;
u64 offset;
u64 bytes;
+   unsigned long ip;
 };
 
 struct btrfs_block_group_cache {
struct btrfs_key key;
struct btrfs_block_group_item item;
spinlock_t lock;
+   struct mutex alloc_mutex;
u64 pinned;
u64 reserved;
u64 flags;
@@ -600,6 +602,7 @@ struct btrfs_fs_info {
struct mutex transaction_kthread_mutex;
struct mutex cleaner_mutex;
struct mutex alloc_mutex;
+   struct mutex extent_io_mutex;
struct mutex chunk_mutex;
struct mutex drop_mutex;
struct mutex volume_mutex;
@@ -1879,8 +1882,12 @@ int btrfs_acl_chmod(struct inode *inode);
 /* free-space-cache.c */
 int btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
 u64 bytenr, u64 size);
+int btrfs_add_free_space_lock(struct btrfs_block_group_cache *block_group,
+ u64 offset, u64 bytes);
 int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
u64 bytenr, u64 size);
+int btrfs_remove_free_space_lock(struct btrfs_block_group_cache *block_group,
+u64 offset, u64 bytes);
 void btrfs_remove_free_space_cache(struct btrfs_block_group_cache
   *block_group);
 struct btrfs_free_space *btrfs_find_free_space(struct btrfs_block_group_cache
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0be044b..6da2345 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1458,6 +1458,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
mutex_init(&fs_info->tree_log_mutex);
mutex_init(&fs_info->drop_mutex);
mutex_init(&fs_info->alloc_mutex);
+   mutex_init(&fs_info->extent_io_mutex);
mutex_init(&fs_info->chunk_mutex);
mutex_init(&fs_info->transaction_kthread_mutex);
mutex_init(&fs_info->cleaner_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5f235fc..c27c71b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -164,6 +164,7 @@ static int add_new_free_space(struct 
btrfs_block_group_cache *block_group,
u64 extent_start, extent_end, size;
int ret;
 
+   mutex_lock(&info->extent_io_mutex);
while (start < end) {
ret = find_first_extent_bit(&info->pinned_extents, start,
&extent_start, &extent_end,
@@ -175,7 +176,8 @@ static int add_new_free_space(struct 
btrfs_block_group_cache *block_group,
start = extent_end + 1;
} else if (extent_start > start && extent_start < end) {
   

Re: Some very basic questions

2008-10-21 Thread jim owens

calin wrote:

question is: if you had such an implementation, are there
drawbacks expectable for the single-mount case? If not I'd vote for it
because there are not really many alternatives "on the market".


As I understand it, the largest issue is in locking and boundaries. 


Correct, that is the first big issue.  As soon as 2 machines can
access the same device, you must design for distributed locking.
And that means a lot more code, lower performance, and a lot of
things a local-only filesystem could do that must be disallowed.

The second issue is what is the purpose of more than 1 host
accessing the data directly from the device.  There are cases
where this is a good thing because the application is designed
with data partitioning and multi-instance coordination.  It is
a bad thing for random uncoordinated use like backups or fsck.

Remember that the device bandwidth is the limiter so even
when each host has a dedicated path to the device (as in
dual port SAS or FC), that 2nd host cuts the throughput by
more than 1/2 with uncoordinated seeks and transfers.

And if the host device drivers are not designed for multiple
host sharing, this can cause timeouts, resets, and false
device-failed states.

And yes... even read-only access from a 2nd host is trouble
in many parts of the design and does not come for free.

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs conference call

2008-10-21 Thread Chris Mason
Hello everyone,

Our regular bi-weekly btrfs conf call will be Wednesday October 22nd.

There is a new dial in number below.

Time: 1:30pm US Eastern (10:30am Pacific)

* Dial-in Number(s):
* Toll Free: +1-888-967-2253
* Toll  +1-650-607-2253 
* Meeting id: 665734
* Passcode: 428737 (which hopefully spells 4Btrfs)

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Chris Mason
On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote:

> > > 2. general requirements
> > > - fs errors without file/dir names are useless
> > > - errors in parts of the fs are no reason for a fs to go offline as a 
> > > whole
> > 
> > These two are in progress.  Btrfs won't always be able to give a file
> > and directory name, but it will be able to give something that can be
> > turned into a file or directory name.  You don't want important
> > diagnostic messages delayed by name lookup.
> 
> That's a point I really never understood. Why is it non-trivial for a fs to
> know what file or dir (name) it is currently working on?

The name lives in block A, but you might find a corruption while
processing block B.  Block A might not be in ram anymore, or it might be
in ram but locked by another process.

On top of all of that, when we print errors it's because things haven't
gone well.  They are deep inside of various parts of the filesystem, and
we might not be able to take the required locks or read from the disk in
order to find the name of the thing we're operating on.

> > 
> > > - mounting must not delay the system startup significantly
> > 
> > Mounts are fast
> > 
> > > - resizing during runtime (up and down)
> > 
> > Resize is done
> > 
> > > - parallel mounts (very important!)
> > >   (two or more hosts mount the same fs concurrently for reading and
> > >   writing)
> > 
> > As Jim and Andi have said, parallel mounts are not in the feature list
> > for Btrfs.  Network filesystems will provide these features.
> 
> Can you explain what "network filesystems" stands for in this statement,
> please name two or three examples.
> 
NFS (done) CRFS (under development), maybe ceph as well which is also
under development.

> > > - journaling
> > 
> > Btrfs doesn't journal.  The tree logging code is close, it provides
> > optimized fsync and O_SYNC operations.  The same basic structures could
> > be used for remote replication.
> > 
> > > - versioning (file and dir)
> > 
> > >From a data structure point of view, version control is fairly easy.
> > >From a user interface and policy point of view, it gets difficult very
> > quickly.  Aside from snapshotting, version control is outside the scope
> > of btrfs.
> > 
> > There are lots of good version control systems available, I'd suggest
> > you use them instead.
> 
> To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless
> I trust your experience. If a basic implementation is possible and not too
> complex, why deny a feature? 
> 

In general I think snapshotting solves enough of the problem for most of
the people most of the time.  I'd love for Btrfs to be the perfect FS,
but I'm afraid everyone has a different definition of perfect.

Storing multiple versions of something is pretty easy.  Making a usable
interface around those versions is the hard part, especially because you
need groups of files to be versioned together in atomic groups
(something that looks a lot like a snapshot).

Versioning is solved in userspace.  We would never be able to implement
everything that git or mercurial can do inside the filesystem.

> > > - undelete (file and dir)
> > 
> > Undelete is easy
> 
> Yes, we hear and say that all the time, name one linux fs doing it, please.
> 

The fact that nobody is doing it is not a good argument for why it
should be done ;)  Undelete is a policy decision about what to do with
files as they are removed.  I'd much rather see it implemented above the
filesystems instead of individually in each filesystem.

This doesn't mean I'll never code it, it just means it won't get
implemented directly inside of Btrfs.  In comparison with all of the
other features pending, undelete is pretty far down on the list.

> > but I think best done at a layer above the FS.
> 
> Before we got into the linux community we used n.vell netware. Undelete has
> been there since about the first day. More then ten years later (nowadays) it
> is still missing in linux. I really do suggest to provide _some_ solution and
> _then_ lets talk about the _better_ solution.
> 
> > > - snapshots
> > 
> > Done
> > 
> > > - run into hd errors more than once for the same file (as an option)
> > 
> > Sorry, I'm not sure what you mean here.
> 
> If your hd is going dead you often find out that touching broken files takes
> ages. If the fs finds out a file is corrupt because the device has errors it
> could just flag the file as broken and not re-read the same error a thousand
> times more. Obviously you want that as an option, because there can be good
> reasons for re-reading dead files...

I really agree that we want to avoid beating on a dead drive.

Btrfs will record some error information about the drive so it can
decide what to do with failures.  But, remembering that sector #12345768
is bad doesn't help much.  When the drive returned the IO error it
remapped the sector and the next write will probably s

Re: Some very basic questions

2008-10-21 Thread calin
> question is: if you had such an implementation, are there
> drawbacks expectable for the single-mount case? If not I'd vote for it
> because there are not really many alternatives "on the market".

As I understand it, the largest issue is in locking and boundaries.  Two 
different systems could mount a filesystem, and try to use some sort of on-disk 
markers to keep from writing to the same area at the same time... but there is 
often some bit of time between when a system sends data to the disk and when it 
would become available to read from the disk, and little or no guarantee about 
the order in which the data is written.  All the work that goes into making 
transactions atomic depends on there only being a single path to the disk - 
through the code that handles transactions.  If data can arrive on the disk 
without being managed by that code, all bets are off.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Ric Wheeler

Christoph Hellwig wrote:

On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote:
  

Sure, but what you say only reflects the ideal world. On a file service, you
never have that. In fact you do not even have good control about what is going
on. Lets say you have a setup that creates, reads and deletes files 24h a day
from numerous clients. At two o'clock in the morning some hd decides to
partially die. Files get created on it, fill data up to errors, get
deleted and another bunch of data arrives and yet again fs tries to allocate
the same dead areas. You loose a lot more data only because the fs did not map
out the already known dead blocks. Of course you would replace the dead drive
later on, but in the meantime you have a lot of fun.
In other words: give me a tool to freeze the world right at the time the
errors show up, or map out dead blocks (only because it is a lot easier).



When modern disks can't solve the problems with their internal driver
remapping anymore you better replace it ASAP as it is a very strong
disk failure indication.  Last years FAST has some very interesting
statitics showing this in the field.
  


Doing proactive drive pulls is kind of a black art, but looking for 
*lots* of remapped sectors is always a pretty reliable clue. Note that 
modern S-ATA disks might have room to remap 2-3 thousand sectors, so you 
should not worry too much about a handful (say 20 or so). Sometimes the 
remapping happens because of transient things (junk on the platter, 
vibrations, out of spec temperature range, etc) so your drive might be 
perfectly healthy.


If you have remapped a big chunk of the sectors (say more than 10%), you 
should grab the data off the disk asap and replace it. Worry less about 
errors during read, writes indicate more serious errors.


The file system should not have to worry about remapping sectors 
internally, by the time writes fail and you have consumed all remapped 
sectors, you should definitely be in read-only mode and well on the way 
to replacing the disk :-)


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Christoph Hellwig
On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote:
> Sure, but what you say only reflects the ideal world. On a file service, you
> never have that. In fact you do not even have good control about what is going
> on. Lets say you have a setup that creates, reads and deletes files 24h a day
> from numerous clients. At two o'clock in the morning some hd decides to
> partially die. Files get created on it, fill data up to errors, get
> deleted and another bunch of data arrives and yet again fs tries to allocate
> the same dead areas. You loose a lot more data only because the fs did not map
> out the already known dead blocks. Of course you would replace the dead drive
> later on, but in the meantime you have a lot of fun.
> In other words: give me a tool to freeze the world right at the time the
> errors show up, or map out dead blocks (only because it is a lot easier).

When modern disks can't solve the problems with their internal driver
remapping anymore you better replace it ASAP as it is a very strong
disk failure indication.  Last years FAST has some very interesting
statitics showing this in the field.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 09:20:16 -0400
jim owens <[EMAIL PROTECTED]> wrote:

> btrfs has many of the same goals... but they are goals not code
> so when you might see them is indeterminate.

no big issue, my pension is 20 years away, I got time ;-)
 
> I believe these should not be in btrfs:
> 
> Stephan von Krawczynski wrote:
> 
> > - parallel mounts (very important!)
> 
> as Andi said, you want a cluster or distributed fs.  There
> are layered designs (CRFS or network filesystems) that can do
> the job and trying to do it in btrfs causes too many problems.

question is: if you had such an implementation, are there drawbacks expectable
for the single-mount case? If not I'd vote for it because there are not really
many alternatives "on the market".

> > - journaling
> 
> I assume you *do not* mean metadata journaling, you mean
> sending all file updates to a single output stream (as in one
> disk, tape, or network link).  I've done that, but would not
> recommend it in btrfs because it limits the total fs bandwidth
> to what the single stream can support.  This is normally done
> today by applications like databases, not in the filesystem.

As far as I know metadata journaling is in, right?
If what you mean is capable of creating live or offline images of the fs you
got me right.
 
> > - map out dead blocks
> 
> Useless... a waste of time, code, and metadata structures.
> With current device technology, any device reporting bad blocks
> the device can not map out is about to die and needs replaced!

Sure, but what you say only reflects the ideal world. On a file service, you
never have that. In fact you do not even have good control about what is going
on. Lets say you have a setup that creates, reads and deletes files 24h a day
from numerous clients. At two o'clock in the morning some hd decides to
partially die. Files get created on it, fill data up to errors, get
deleted and another bunch of data arrives and yet again fs tries to allocate
the same dead areas. You loose a lot more data only because the fs did not map
out the already known dead blocks. Of course you would replace the dead drive
later on, but in the meantime you have a lot of fun.
In other words: give me a tool to freeze the world right at the time the
errors show up, or map out dead blocks (only because it is a lot easier).

> jim

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Andi Kleen
Stephan von Krawczynski <[EMAIL PROTECTED]> writes:
>
> Yes, we hear and say that all the time, name one linux fs doing it, please.

ext[234] support it to some extent. It has some limitations
(especially when the files are large and you shouldn't do too much followon
IO to prevent the data from being overwriten) and the user frontends are not
very nice, but it it's there

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Stephan von Krawczynski
Hello Chris, 

let me clarify some things a bit, see ...

On Tue, 21 Oct 2008 09:59:40 -0400
Chris Mason <[EMAIL PROTECTED]> wrote:

> Thanks for this input and for taking the time to post it.
> 
> > 1. filesystem-check
> > 1.1 it should not
> > - delay boot process (we have to wait for hours currently)
> > - prevent mount in case of errors
> > - be a part of the mount process at all
> > - always check the whole fs
> 
> For this, you have to define filesystem-check very carefully.  In
> reality, corruptions can prevent mounting.  We can try very very hard to
> limit the class of corruptions that prevent mounting, and use
> duplication and replication to create configurations that address the
> remaining cases.

What we would like to have is a possibility to check an already mounted and
active fs for corruption, that's the reporting part.
If some corruption is found we should be able to correct the
data/metadata/whatever on the _still active_ fs, lets say by starting fsck in
modify mode. It is often preferred not to do a run over the complete fs but
only over certain (already known-to-be-corrupted) parts/subtrees.
It is obvious that the fs should not go offline then even if something very
ugly happens.
You can imagine:
Run fsck via cron every night. Then look at the logs in the morning an if bad
news arrived try to correct the broken subtree or exclude it from further
usage.

> In general, we'll be able to make things much better than they are
> today.

I am pretty sure about that ;-)

> > 1.2 it should be able 
> > - to always be started interactively by user
> > - to check parts/subtrees of the fs
> > - to run purely informational (reporting, non-modifying)
> > - to run on a mounted fs
> 
> Started interactively?  I'm not entirely sure what that means, but in
> general when you ask the user a question about if/how to fix a
> corruption, they will have no idea what the correct answer is.

see above explanation. We don't expect the classical y/n-questions during
fsck. Honestly there are only 3 types of modification modes in fsck:
- try correction in place
- exclude (i.e. delete) whole problem subtree
- duplicate to another subtree whatever can be rescued from the original place
  (and leave problem subtree as-is)

> > 2. general requirements
> > - fs errors without file/dir names are useless
> > - errors in parts of the fs are no reason for a fs to go offline as a 
> > whole
> 
> These two are in progress.  Btrfs won't always be able to give a file
> and directory name, but it will be able to give something that can be
> turned into a file or directory name.  You don't want important
> diagnostic messages delayed by name lookup.

That's a point I really never understood. Why is it non-trivial for a fs to
know what file or dir (name) it is currently working on?
It really sounds strange to me that a layer that is managing files on some
device does not know at any time during runtime what file or dir it is
actually handling. If _it_ does not know, how should the _user_ probably hours
later reading the logs know based on inode numbers or whatever cryptic logs
are thrown out? I mean filenames are nothing more than a human-readable
describing data structure mostly type char. Its only reason of existance is
readability, why not in logs?

> 
> > - mounting must not delay the system startup significantly
> 
> Mounts are fast
> 
> > - resizing during runtime (up and down)
> 
> Resize is done
> 
> > - parallel mounts (very important!)
> >   (two or more hosts mount the same fs concurrently for reading and
> >   writing)
> 
> As Jim and Andi have said, parallel mounts are not in the feature list
> for Btrfs.  Network filesystems will provide these features.

Can you explain what "network filesystems" stands for in this statement,
please name two or three examples.

> > - journaling
> 
> Btrfs doesn't journal.  The tree logging code is close, it provides
> optimized fsync and O_SYNC operations.  The same basic structures could
> be used for remote replication.
> 
> > - versioning (file and dir)
> 
> >From a data structure point of view, version control is fairly easy.
> >From a user interface and policy point of view, it gets difficult very
> quickly.  Aside from snapshotting, version control is outside the scope
> of btrfs.
> 
> There are lots of good version control systems available, I'd suggest
> you use them instead.

To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless
I trust your experience. If a basic implementation is possible and not too
complex, why deny a feature? 

> > - undelete (file and dir)
> 
> Undelete is easy

Yes, we hear and say that all the time, name one linux fs doing it, please.

> but I think best done at a layer above the FS.

Before we got into the linux community we used n.vell netware. Undelete has
been there since about the first day. More then ten years later (nowadays) it
is still missing i

Re: Some very basic questions

2008-10-21 Thread Andi Kleen
Chris Mason <[EMAIL PROTECTED]> writes:
>
> Started interactively?  I'm not entirely sure what that means, but in
> general when you ask the user a question about if/how to fix a
> corruption, they will have no idea what the correct answer is.

While that's true today, I'm not sure it has to be true always.
I always thought traditional fsck user interfaces were a
UI desaster and could be done much better with some simple tweaks. 

For example the fsck could present the user a list of files that ended
up in lost+found and let them examine them, instead of asking a lot of
useless questions. Or it could give a high level summary on how many
files in which part of the directory tree were corrupted. etc.etc.  Or
it could default to a high level mode that only gives such high level
information to the user.

So I don't think all corruptions could be done perfectly user
friendly, but at least the basic user friendliness in many
situations could be much improved.

-Andi


-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread jim owens

Hearing what user's think they want is always good, but...

Stephan von Krawczynski wrote:


thanks for your feedback. Understand "minimum requirement" as "minimum
requirement to drop the current installation and migrate the data to a
new fs platform".


I would sure like to know what existing platform and filesystem
you have that you think has all 10 of your features.


Of course you are right, dealing with multiple/parallel mounts can be quite a
nasty job if the fs was not originally planned with this feature in mind.
On the other hand I cannot really imagine how to deal with TBs of data in the
future without such a feature.
If you look at the big picture the things I mentioned allow you to have
redundant front-ends for the fileservice doing the same or completely
different applications. You can use one mount (host) for tape backup purposes
only without heavy loss in standard file service. You can even mount for
filesystem check purposes, a box that does nothing else but check the
structure and keep you informed what is really going on with your data - and
your data is still in production in the meantime.
Whatever happens you have a real chance of keeping your file service up, even
if parts of your fs go nuts because some underlying hd got partially damaged.
Keeping it up and running is the most important part, performance is only
second on the list.
If you take a close look there are not really 10 different items on my list,
depending on the level of abstraction you prefer, nevertheless:

1) parallel mounts


What I see from that explanation is you have a "system design" idea
using parallel machines to fix problems you have had in the past.
To implement your design, you need a filesystem to fit it.  I think
it is better to just design a filesystem without the problems and
configure the hardware to handle the necessary load.


2) mounting must not delay the system startup significantly
3) errors in parts of the fs are no reason for a fs to go offline as a whole
4) power loss at any time must not corrupt the fs
5) fsck on a mounted fs, interactively, not part of the mount (all fsck
features)


I think all of these are part of the "reliability" goal for btrfs
and when you say "fsck" it is probably misleading if I understand
your real requirement to be the same as my customers:

  - *NO* fsck
  - filesystem design "prevents problems we have had before"
  - filesystem autodetects, isolates, and (possibly) repairs errors
  - online "scan, check, repair filesystem" tool initiated by admin
  - Reliability so high that they never run that check-and-fix tool

Note that I personally have never seen a first release meet
the "no problems, no need to fix" criteria that would obviate
any need for a check/fix tool.

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 14:13:33 +0200
Andi Kleen <[EMAIL PROTECTED]> wrote:

> Stephan von Krawczynski <[EMAIL PROTECTED]> writes:
> 
> > reading the list for a while it looks like all kinds of implementational
> > topics are covered but no basic user requests or talks are going on. Since I
> > have found no other list on vger covering these issues I choose this one,
> > forgive my ignorance if it is the wrong place.
> > Like many people on the planet we try to handle quite some amounts of data
> > (TBs) and try to solve this with several linux-based fileservers.
> > Years of (mostly bad) experience led us to the following minimum 
> > requirements
> > for a new fs on our servers:
> 
> If that are the minimum requirements, what are the maximum ones?
> 
> Also you realize that some of the requirements (like parallel read/write
> aka a full cluster file system) are extremly hard?
> 
> Perhaps it would make more sense if you extracted the top 10 items
> and ranked them by importance and posted again.

Hello Andi,

thanks for your feedback. Understand "minimum requirement" as "minimum
requirement to drop the current installation and migrate the data to a
new fs platform".
Of course you are right, dealing with multiple/parallel mounts can be quite a
nasty job if the fs was not originally planned with this feature in mind.
On the other hand I cannot really imagine how to deal with TBs of data in the
future without such a feature.
If you look at the big picture the things I mentioned allow you to have
redundant front-ends for the fileservice doing the same or completely
different applications. You can use one mount (host) for tape backup purposes
only without heavy loss in standard file service. You can even mount for
filesystem check purposes, a box that does nothing else but check the
structure and keep you informed what is really going on with your data - and
your data is still in production in the meantime.
Whatever happens you have a real chance of keeping your file service up, even
if parts of your fs go nuts because some underlying hd got partially damaged.
Keeping it up and running is the most important part, performance is only
second on the list.
If you take a close look there are not really 10 different items on my list,
depending on the level of abstraction you prefer, nevertheless:

1) parallel mounts
2) mounting must not delay the system startup significantly
3) errors in parts of the fs are no reason for a fs to go offline as a whole
4) power loss at any time must not corrupt the fs
5) fsck on a mounted fs, interactively, not part of the mount (all fsck
features)
6) journaling
7) undelete (file and dir)
8) resizing during runtime (up and down)
9) snapshots
10) performant handling of large numbers of files inside single dirs


-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Chris Mason
On Tue, 2008-10-21 at 13:23 +0200, Stephan von Krawczynski wrote:
> Hello all,
> 
> reading the list for a while it looks like all kinds of implementational
> topics are covered but no basic user requests or talks are going on. Since I
> have found no other list on vger covering these issues I choose this one,
> forgive my ignorance if it is the wrong place.
> Like many people on the planet we try to handle quite some amounts of data
> (TBs) and try to solve this with several linux-based fileservers.
> Years of (mostly bad) experience led us to the following minimum requirements
> for a new fs on our servers:
> 

Thanks for this input and for taking the time to post it.

> 1. filesystem-check
> 1.1 it should not
> - delay boot process (we have to wait for hours currently)
> - prevent mount in case of errors
> - be a part of the mount process at all
> - always check the whole fs

For this, you have to define filesystem-check very carefully.  In
reality, corruptions can prevent mounting.  We can try very very hard to
limit the class of corruptions that prevent mounting, and use
duplication and replication to create configurations that address the
remaining cases.

In general, we'll be able to make things much better than they are
today.

> 1.2 it should be able 
> - to always be started interactively by user
> - to check parts/subtrees of the fs
> - to run purely informational (reporting, non-modifying)
> - to run on a mounted fs

Started interactively?  I'm not entirely sure what that means, but in
general when you ask the user a question about if/how to fix a
corruption, they will have no idea what the correct answer is.

> 2. general requirements
> - fs errors without file/dir names are useless
> - errors in parts of the fs are no reason for a fs to go offline as a 
> whole

These two are in progress.  Btrfs won't always be able to give a file
and directory name, but it will be able to give something that can be
turned into a file or directory name.  You don't want important
diagnostic messages delayed by name lookup.

> - mounting must not delay the system startup significantly

Mounts are fast

> - resizing during runtime (up and down)

Resize is done

> - parallel mounts (very important!)
>   (two or more hosts mount the same fs concurrently for reading and
>   writing)

As Jim and Andi have said, parallel mounts are not in the feature list
for Btrfs.  Network filesystems will provide these features.

> - journaling

Btrfs doesn't journal.  The tree logging code is close, it provides
optimized fsync and O_SYNC operations.  The same basic structures could
be used for remote replication.

> - versioning (file and dir)

>From a data structure point of view, version control is fairly easy.
>From a user interface and policy point of view, it gets difficult very
quickly.  Aside from snapshotting, version control is outside the scope
of btrfs.

There are lots of good version control systems available, I'd suggest
you use them instead.

> - undelete (file and dir)

Undelete is easy but I think best done at a layer above the FS.

> - snapshots

Done

> - run into hd errors more than once for the same file (as an option)

Sorry, I'm not sure what you mean here.

> - map out dead blocks
>   (and of course display of the currently mapped out list)

I agree with Jim on this one.  Drives remap dead sectors, and when they
stop remapping them, the drive should be replaced.

> - no size limitations (more or less)
> - performant handling of large numbers of files inside single dirs
>   (to check that use > 100.000 files in a dir, understand that it is
>   no good idea to spread inode-blocks over the whole hd because of seek
>   times)

Everyone has different ideas on "large" numbers of files inside a single
dir.  The directory indexing done by btrfs can easily handle 100,000

> - power loss at any time must not corrupt the fs (atomic fs modification)
>   (new-data loss is acceptable)

Done.  Btrfs already uses barriers as required for sata drives.

> 
> Remember, this is not meant to be a request for features, it is a list that
> built up over 10 years of handling data and the failures we experienced. To
> our knowledge no fs meets this list, but hey, is that a reason for not talking
> about it? Our goal is pretty simple: maximize fs uptime.
> How does btrfs match?

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread jim owens

btrfs has many of the same goals... but they are goals not code
so when you might see them is indeterminate.

I believe these should not be in btrfs:

Stephan von Krawczynski wrote:


- parallel mounts (very important!)


as Andi said, you want a cluster or distributed fs.  There
are layered designs (CRFS or network filesystems) that can do
the job and trying to do it in btrfs causes too many problems.


- journaling


I assume you *do not* mean metadata journaling, you mean
sending all file updates to a single output stream (as in one
disk, tape, or network link).  I've done that, but would not
recommend it in btrfs because it limits the total fs bandwidth
to what the single stream can support.  This is normally done
today by applications like databases, not in the filesystem.


- map out dead blocks


Useless... a waste of time, code, and metadata structures.
With current device technology, any device reporting bad blocks
the device can not map out is about to die and needs replaced!

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Andi Kleen
Stephan von Krawczynski <[EMAIL PROTECTED]> writes:

> reading the list for a while it looks like all kinds of implementational
> topics are covered but no basic user requests or talks are going on. Since I
> have found no other list on vger covering these issues I choose this one,
> forgive my ignorance if it is the wrong place.
> Like many people on the planet we try to handle quite some amounts of data
> (TBs) and try to solve this with several linux-based fileservers.
> Years of (mostly bad) experience led us to the following minimum requirements
> for a new fs on our servers:

If that are the minimum requirements, what are the maximum ones?

Also you realize that some of the requirements (like parallel read/write
aka a full cluster file system) are extremly hard?

Perhaps it would make more sense if you extracted the top 10 items
and ranked them by importance and posted again.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Some very basic questions

2008-10-21 Thread Stephan von Krawczynski
Hello all,

reading the list for a while it looks like all kinds of implementational
topics are covered but no basic user requests or talks are going on. Since I
have found no other list on vger covering these issues I choose this one,
forgive my ignorance if it is the wrong place.
Like many people on the planet we try to handle quite some amounts of data
(TBs) and try to solve this with several linux-based fileservers.
Years of (mostly bad) experience led us to the following minimum requirements
for a new fs on our servers:

1. filesystem-check
1.1 it should not
- delay boot process (we have to wait for hours currently)
- prevent mount in case of errors
- be a part of the mount process at all
- always check the whole fs
1.2 it should be able 
- to always be started interactively by user
- to check parts/subtrees of the fs
- to run purely informational (reporting, non-modifying)
- to run on a mounted fs
2. general requirements
- fs errors without file/dir names are useless
- errors in parts of the fs are no reason for a fs to go offline as a whole
- mounting must not delay the system startup significantly
- resizing during runtime (up and down)
- parallel mounts (very important!)
  (two or more hosts mount the same fs concurrently for reading and
  writing)
- journaling
- versioning (file and dir)
- undelete (file and dir)
- snapshots
- run into hd errors more than once for the same file (as an option)
- map out dead blocks
  (and of course display of the currently mapped out list)
- no size limitations (more or less)
- performant handling of large numbers of files inside single dirs
  (to check that use > 100.000 files in a dir, understand that it is
  no good idea to spread inode-blocks over the whole hd because of seek
  times)
- power loss at any time must not corrupt the fs (atomic fs modification)
  (new-data loss is acceptable)

Remember, this is not meant to be a request for features, it is a list that
built up over 10 years of handling data and the failures we experienced. To
our knowledge no fs meets this list, but hey, is that a reason for not talking
about it? Our goal is pretty simple: maximize fs uptime.
How does btrfs match?
-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Improve space balancing code

2008-10-21 Thread Yan Zheng
Hello,

This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.

Regards

Signed-off-by: Yan Zheng <[EMAIL PROTECTED]>

---
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 9caeb37..73899d0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -287,7 +287,7 @@ int noinline __btrfs_cow_block(struct btrfs_trans_handle 
*trans,
/*
 * There are only two places that can drop reference to
 * tree blocks owned by living reloc trees, one is here,
-* the other place is btrfs_merge_path. In both places,
+* the other place is btrfs_drop_subtree. In both places,
 * we check reference count while tree block is locked.
 * Furthermore, if reference count is one, it won't get
 * increased by someone else.
@@ -312,9 +312,6 @@ int noinline __btrfs_cow_block(struct btrfs_trans_handle 
*trans,
}
 
if (root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) {
-   ret = btrfs_add_reloc_mapping(root, buf->start,
- buf->len, cow->start);
-   BUG_ON(ret);
ret = btrfs_reloc_tree_cache_ref(trans, root, cow, buf->start);
WARN_ON(ret);
}
@@ -1627,61 +1624,57 @@ int btrfs_merge_path(struct btrfs_trans_handle *trans,
btrfs_node_key_to_cpu(eb, &key, slot);
key_match = !memcmp(&key, &node_keys[level - 1], sizeof(key));
 
+   if (generation == trans->transid) {
+   eb = read_tree_block(root, bytenr, blocksize,
+generation);
+   btrfs_tree_lock(eb);
+   }
+
/*
 * if node keys match and node pointer hasn't been modified
 * in the running transaction, we can merge the path. for
-* reloc trees, the node pointer check is skipped, this is
-* because the reloc trees are fully controlled by the space
-* balance code, no one else can modify them.
+* blocks owened by reloc trees, the node pointer check is
+* skipped, this is because these blocks are fully controlled
+* by the space balance code, no one else can modify them.
 */
if (!nodes[level - 1] || !key_match ||
(generation == trans->transid &&
-root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID)) {
-next_level:
-   if (level == 1 || level == lowest_level + 1)
+btrfs_header_owner(eb) != BTRFS_TREE_RELOC_OBJECTID)) {
+   if (level == 1 || level == lowest_level + 1) {
+   if (generation == trans->transid) {
+   btrfs_tree_unlock(eb);
+   free_extent_buffer(eb);
+   }
break;
+   }
 
-   eb = read_tree_block(root, bytenr, blocksize,
-generation);
-   btrfs_tree_lock(eb);
+   if (generation != trans->transid) {
+   eb = read_tree_block(root, bytenr, blocksize,
+   generation);
+   btrfs_tree_lock(eb);
+   }
 
ret = btrfs_cow_block(trans, root, eb, parent, slot,
  &eb, 0);
BUG_ON(ret);
 
+   if (root->root_key.objectid ==
+   BTRFS_TREE_RELOC_OBJECTID) {
+   if (!nodes[level - 1]) {
+   nodes[level - 1] = eb->start;
+   memcpy(&node_keys[level - 1], &key,
+  sizeof(node_keys[0]));
+   } else {
+   WARN_ON(1);
+   }
+   }
+
btrfs_tree_unlock(parent);
free_extent_buffer(parent);
parent = eb;