Re: SSD Optimizations

2010-03-12 Thread Stephan von Krawczynski
On Fri, 12 Mar 2010 02:07:40 +0100
Hubert Kario h...@qbs.com.pl wrote:

  [...]
  If the FS were to be smart and know about the 256kb requirement, it
  would do a read/modify/write cycle somewhere and then write the 4KB.
 
 If all the free blocks have been TRIMmed, FS should pick a completely free 
 erasure size block and write those 4KiB of data.
 
 Correct implementation of wear leveling in the drive should notice that the 
 write is entirely inside a free block and make just a write cycle adding 
 zeros 
 to the end of supplied data.

Your assumption here is that your _addressed_ block layout is completely
identical to the SSDs disk layout. Else you cannot know where a free
erasure block is located and how to address it from FS.
I really wonder what this assumption is based on. You still think a SSD is a
true disk with linear addressing. I doubt that very much. Even on true
spinning disks your assumption is wrong for relocated sectors. Which basically
means that every disk controller firmware fiddles around with the physical
layout since decades. Please accept that you cannot do a disks' job in FS. The
more advanced technology gets the more disks become black boxes with a defined
software interface. Use this interface and drop the idea of having inside
knowledge of such a device. That's other peoples' work. If you want to design
smart SSD controllers hire at a company that builds those.

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-12 Thread Hubert Kario
On Friday 12 March 2010 10:15:28 Stephan von Krawczynski wrote:
 On Fri, 12 Mar 2010 02:07:40 +0100
 
 Hubert Kario h...@qbs.com.pl wrote:
   [...]
   If the FS were to be smart and know about the 256kb requirement, it
   would do a read/modify/write cycle somewhere and then write the 4KB.
  
  If all the free blocks have been TRIMmed, FS should pick a completely
  free erasure size block and write those 4KiB of data.
  
  Correct implementation of wear leveling in the drive should notice that
  the write is entirely inside a free block and make just a write cycle
  adding zeros to the end of supplied data.
 
 Your assumption here is that your _addressed_ block layout is completely
 identical to the SSDs disk layout.
 Else you cannot know where a free
 erasure block is located and how to address it from FS.
 I really wonder what this assumption is based on. You still think a SSD is
 a true disk with linear addressing. I doubt that very much.

I made no such assumptions.
Im sure that the linearity on the ATA LBA level isn't so linear on the device 
level, especially after wear-leveling takes its toll, but I assume that the 
smallest block of data that the translation layer can address is erase-block 
sized and that all the erase-block are equal in size. Otherwise the algorithm 
would be needlessly complicated which would make it both slower and more error 
prone.

 Even on true
 spinning disks your assumption is wrong for relocated sectors.

Which we don't have to worry about because if the drive has less than 5 of 
'em, the impact of hitting them is marginal and if there are more, the user 
has much more pressing problem than the performance of the drive or FS.

 Which
 basically means that every disk controller firmware fiddles around with
 the physical layout since decades. Please accept that you cannot do a
 disks' job in FS. The more advanced technology gets the more disks become
 black boxes with a defined software interface. Use this interface and drop
 the idea of having inside knowledge of such a device. That's other
 peoples' work. If you want to design smart SSD controllers hire at a
 company that builds those.

And I don't think that doing disks' job in the FS is good idea, but I think 
that we should be able to minimise the impact of the translation layer.

The way to do this, is to threat the device as a block device with sectors the 
size of erase-blocks. That's nothing too fancy, don't you think?

-- 
Hubert Kario
QBS - Quality Business Software
ul. Ksawerów 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Re: New btrfs command pushed into the btrfs-progs subvol branch - commands

2010-03-12 Thread Goffredo Baroncelli
On Friday 12 March 2010, Chris Mason wrote:
 On Thu, Mar 11, 2010 at 10:44:21PM +0100, Goffredo Baroncelli wrote:
  Hi Chris,
  
  I updated my git repository. You can pull from 
[...]

 Wonderful.  I've rebased this and put it into the subvol branch.  I
 think I got all the commits and differences.  

 The big reason for the
 rebase is to avoid the small string of single liner fixup commits.
 If you have other work pending I'm happy to rebase it in, otherwise
 please try to work against my subvol branch.

Sorry; I am new on git, so some concepts for me are new. The next time I will 
rebase.
 
 Thanks again!
 
 -chris
 
 


-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Saving and restoring btrfs snapshots

2010-03-12 Thread Goffredo Baroncelli
On Friday 12 March 2010, Pat Patterson wrote:
 Are there any plans to implement something akin to ZFS send/recv, to
 be able to create a stream representation of a snapshot and restore it
 later/somewhere else? I've spent some time trawling the mailing list
 and wiki, but I don't see anything there.

I spent a bit of time on this argument, in order to find how implement an 
efficient method to backup incrementally the data.

AFAICT zfs send and zfs recv do the same thing that tar does. They 
transform a tree (or the difference between a tree and its snapshot) to a 
stream, and vice-versa.

To transform a tree to a stream is not very interesting. 
The interesting part is how compare a tree and its snapshot. In fact a 
snapshot of a tree should a be pointer to the original tree, and when a file 
is modified, a branch of the modified part (the extens of the file, the 
directories of the path) is performed (yes I know that this a big 
simplification of the process).
The key is that the file-system knows which part of a snapshot is still equal 
to the source and which not.

If this kind of data is available to the user space, comparing a tree and it 
snapshot should be very fast. 

Reading the documentation of btrfs, it seems that associated the transaction 
there is a version number. With this version  number of a directory, we 
would be able to verify the equality of two trees comparing only the root of 
the trees. This would increase the seed of two trees.

But I was never able to get this version number. There is the ioctl command 
FS_IOC_GETVERSION, which seems to return this number. But when a directory or 
an its children is update, this number doesn't change.

I tried to hack the kernel code in order to test different version number: I 
tried inode-i_generation, or btrfs_inode-generation or btrfs_inode-sequence 
or btrfs_inode-{last|last_sub|logged}_trans...
But none of the above was useful for my purpose.

Even tough there is no a clear conclusion, I hope that this note may be useful 
to start to discuss on this matter. 

Regards
Goffredo
 
 Cheers,
 
 Pat
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreijackATinwind.it
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix small race with delalloc flushing waitqueue's

2010-03-12 Thread Josef Bacik
Everytime we start a new flushing thread, we init the waitqueue if there isn't a
flushing thread running.  The problem with this is we check
space_info-flushing, which we clear right before doing a wake_up on the
flushing waitqueue, which causes problems if we init the waitqueue in the middle
of clearing the flushing flagh and calling wake_up.  This is hard to hit, but
the code is wrong anyway, so init the flushing/allocating waitqueue when
creating the space info and let it be.  I haven't seen the panic since I've been
using this patch.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/extent-tree.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9c6fbd0..73ac69b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2678,6 +2678,8 @@ static int update_space_info(struct btrfs_fs_info *info, 
u64 flags,
INIT_LIST_HEAD(found-block_groups);
init_waitqueue_head(found-flush_wait);
init_rwsem(found-groups_sem);
+   init_waitqueue_head(found-flush_wait);
+   init_waitqueue_head(found-allocate_wait);
spin_lock_init(found-lock);
found-flags = flags;
found-total_bytes = total_bytes;
@@ -2929,12 +2931,10 @@ static void flush_delalloc(struct btrfs_root *root,
 
spin_lock(info-lock);
 
-   if (!info-flushing) {
+   if (!info-flushing)
info-flushing = 1;
-   init_waitqueue_head(info-flush_wait);
-   } else {
+   else
wait = true;
-   }
 
spin_unlock(info-lock);
 
@@ -2997,7 +2997,6 @@ static int maybe_allocate_chunk(struct btrfs_root *root,
if (!info-allocating_chunk) {
info-force_alloc = 1;
info-allocating_chunk = 1;
-   init_waitqueue_head(info-allocate_wait);
} else {
wait = true;
}
-- 
1.6.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix ENOSPC accounting when max_extent is not maxed out

2010-03-12 Thread Josef Bacik
A user reported a bug a few weeks back where if he set max_extent=1m and then
did a dd and then stopped it, we would panic.  This is because I miscalculated
how many extents would be needed for splits/merges.  Turns out I didn't actually
take max_extent into account properly, since we only ever add 1 extent for a
write, which isn't quite right for the case that say max_extent is 4k and we do
8k writes.  That would result in more than 1 extent.  So this patch makes us
properly figure out how many extents are needed for the amount of data that is
being written, and deals with splitting and merging better.  I've tested this ad
nauseum and it works well.  It depends on all of the other patches I've sent
recently, including the per-cpu pools patch.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/ctree.h|8 
 fs/btrfs/extent-tree.c  |6 +-
 fs/btrfs/file.c |5 +-
 fs/btrfs/inode.c|   99 +++---
 fs/btrfs/ordered-data.c |   13 --
 5 files changed, 90 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f12fe00..2f5c01f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1965,6 +1965,14 @@ static inline struct dentry *fdentry(struct file *file)
return file-f_path.dentry;
 }
 
+static inline int calculate_extents(struct btrfs_root *root, u64 bytes)
+{
+   if (bytes = root-fs_info-max_extent)
+   return 1;
+   return (int)div64_u64(bytes + root-fs_info-max_extent -1,
+ root-fs_info-max_extent);
+}
+
 /* extent-tree.c */
 void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
 int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 73ac69b..0085dcb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2837,7 +2837,7 @@ int btrfs_unreserve_metadata_for_delalloc(struct 
btrfs_root *root,
spin_unlock(BTRFS_I(inode)-accounting_lock);
return 0;
}
-   BTRFS_I(inode)-reserved_extents--;
+   BTRFS_I(inode)-reserved_extents -= num_items;
spin_unlock(BTRFS_I(inode)-accounting_lock);
 
btrfs_unreserve_metadata_space(root, num_items);
@@ -3059,7 +3059,7 @@ again:
if (realloc_bytes = num_bytes) {
pool-total_bytes += realloc_bytes;
spin_lock(BTRFS_I(inode)-accounting_lock);
-   BTRFS_I(inode)-reserved_extents++;
+   BTRFS_I(inode)-reserved_extents += num_items;
spin_unlock(BTRFS_I(inode)-accounting_lock);
spin_unlock(pool-lock);
return 0;
@@ -3074,7 +3074,7 @@ again:
 */
if (pool-reserved_bytes + pool-used_bytes = pool-total_bytes) {
spin_lock(BTRFS_I(inode)-accounting_lock);
-   BTRFS_I(inode)-reserved_extents++;
+   BTRFS_I(inode)-reserved_extents += num_items;
spin_unlock(BTRFS_I(inode)-accounting_lock);
spin_unlock(pool-lock);
return 0;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d146dde..a457a94 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -838,6 +838,7 @@ static ssize_t btrfs_file_write(struct file *file, const 
char __user *buf,
unsigned long first_index;
unsigned long last_index;
int will_write;
+   int reserved = calculate_extents(root, count);
 
will_write = ((file-f_flags  O_SYNC) || IS_SYNC(inode) ||
  (file-f_flags  O_DIRECT));
@@ -855,7 +856,7 @@ static ssize_t btrfs_file_write(struct file *file, const 
char __user *buf,
/* do the reserve before the mutex lock in case we have to do some
 * flushing.  We wouldn't deadlock, but this is more polite.
 */
-   err = btrfs_reserve_metadata_for_delalloc(root, inode, 1);
+   err = btrfs_reserve_metadata_for_delalloc(root, inode, reserved);
if (err)
goto out_nolock;
 
@@ -975,7 +976,7 @@ out:
mutex_unlock(inode-i_mutex);
if (ret)
err = ret;
-   btrfs_unreserve_metadata_for_delalloc(root, inode, 1);
+   btrfs_unreserve_metadata_for_delalloc(root, inode, reserved);
 
 out_nolock:
kfree(pages);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b4056ca..09f18b9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1231,19 +1231,37 @@ static int btrfs_split_extent_hook(struct inode *inode,
 
size = orig-end - orig-start + 1;
if (size  root-fs_info-max_extent) {
-   u64 num_extents;
-   u64 new_size;
+   u64 left_extents, right_extents;
+   u64 orig_extents;
+   u64 left_size, right_size;
 
-   new_size = orig-end - split + 1;
-   num_extents = div64_u64(size + root-fs_info-max_extent - 1,
-   

[PATCH] Btrfs: force delalloc flushing when things get desperate

2010-03-12 Thread Josef Bacik
When testing with max_extents=4k, we enospc out really really early.  The reason
for this is we really overwhelm the system with our worst case calculation.
When we try to flush delalloc, we don't want everybody to wait around forever,
so we wake up the waiters when we've done some of the work in hopes that its
enough work to get everything they need done.  The problem with this is we don't
wait long enough sometimes.  So if we've already done a flush_delalloc and
didn't find what we need, do it again and this time wait for the flushing to be
completely finished before returning.  This makes my ENOSPC test actually
finish, instead of finishing in about 20 seconds.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/extent-tree.c |   25 +
 1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0085dcb..aeef481 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2873,7 +2873,7 @@ static noinline void flush_delalloc_async(struct 
btrfs_work *work)
kfree(async);
 }
 
-static void wait_on_flush(struct btrfs_root *root, struct btrfs_space_info 
*info)
+static void wait_on_flush(struct btrfs_root *root, struct btrfs_space_info 
*info, int soft)
 {
DEFINE_WAIT(wait);
u64 num_bytes;
@@ -2895,6 +2895,12 @@ static void wait_on_flush(struct btrfs_root *root, 
struct btrfs_space_info *info
break;
}
 
+   if (!soft) {
+   spin_unlock(info-lock);
+   schedule();
+   continue;
+   }
+
free = 0;
for_each_possible_cpu(i) {
struct btrfs_reserved_space_pool *pool;
@@ -2924,7 +2930,7 @@ static void wait_on_flush(struct btrfs_root *root, struct 
btrfs_space_info *info
 }
 
 static void flush_delalloc(struct btrfs_root *root,
-struct btrfs_space_info *info)
+  struct btrfs_space_info *info, int soft)
 {
struct async_flush *async;
bool wait = false;
@@ -2939,7 +2945,7 @@ static void flush_delalloc(struct btrfs_root *root,
spin_unlock(info-lock);
 
if (wait) {
-   wait_on_flush(root, info);
+   wait_on_flush(root, info, soft);
return;
}
 
@@ -2953,7 +2959,7 @@ static void flush_delalloc(struct btrfs_root *root,
 
btrfs_queue_worker(root-fs_info-enospc_workers,
   async-work);
-   wait_on_flush(root, info);
+   wait_on_flush(root, info, soft);
return;
 
 flush:
@@ -3146,14 +3152,17 @@ again:
 
if (!delalloc_flushed) {
delalloc_flushed = true;
-   flush_delalloc(root, meta_sinfo);
+   flush_delalloc(root, meta_sinfo, 1);
goto again;
}
 
if (!chunk_allocated) {
+   int ret;
+
chunk_allocated = true;
-   btrfs_wait_ordered_extents(root, 0, 0);
-   maybe_allocate_chunk(root, meta_sinfo);
+   ret = maybe_allocate_chunk(root, meta_sinfo);
+   if (!ret)
+   flush_delalloc(root, meta_sinfo, 0);
goto again;
}
 
@@ -3338,7 +3347,7 @@ again:
 
if (!delalloc_flushed) {
delalloc_flushed = true;
-   flush_delalloc(root, meta_sinfo);
+   flush_delalloc(root, meta_sinfo, 0);
goto again;
}
 
-- 
1.6.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


resize

2010-03-12 Thread Brian Callender
Greetings,

I estimate it has been 10 hours since I issued the command  btrfsctl
-r -4g /home to attempt to free some space for a new partition. It is
still running. How long should this take? I am very concerned about
the integrity of my system. Is it safe to interrupt the process? It
seems increasingly likely there is some problem.

My btrfs tools version is 0,19. The fs is mounted with compress and
ssd options. I am running kernel 2.6.32.

Any advice or suggestions are appreciated,
Brian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html