Re: [RFC] improve space utilization on off-sized raid devices
On 17.11.2011 01:27, Thomas Schmidt wrote: > I wrote a small patch to improve allocation on differently sized raid devices. > > With 2.6.38 I frequently ran into a no space left error that I attribute to > this. But I'm not entierly sure. The fs was an 8 device -d raid0 -m raid10. > The used space was the same across all devices. 5 were full and 3 bigger ones > still had plenty of space. > I was unable to use the remaning space and a balance did not fix it for > long. > Did you also test with 3.0? In 3.0, the allocation strategy changed vastly. In your setup, it should stripe to all 8 devices until the 5 smaller ones are full, and from then on stripe to the 3 remaining devices. See commit commit 73c5de0051533cbdf2bb656586c3eb21a475aa7d Author: Arne Jansen Date: Tue Apr 12 12:07:57 2011 +0200 btrfs: quasi-round-robin for chunk allocation Also using raid1 instead of raid10 will yield a better space utilization. -Arne > Now I tried to avoid getting there again. > > The basic idea to not allocate space on the devices with the least free > space. The amount of devices to leave out is calculated on each allocation > to ajust to changing circumstances. It leaves the minimum number that still > can achieve full space usage. > > Additionally I tought leaving at least one out might be of use in device > removal. > > Please take extra care with this. I'm new to btrfs, kernel and C in general. > It was written and tested with 3.0.0. > > > --- volumes.c.orig 2011-10-07 16:50:04.0 +0200 > +++ volumes.c 2011-11-16 23:49:08.097085568 +0100 > @@ -2329,6 +2329,8 @@ static int __btrfs_alloc_chunk(struct bt > u64 stripe_size; > u64 num_bytes; > int ndevs; > + u64 fs_total_avail; > + int opt_ndevs; > int i; > int j; > > @@ -2404,6 +2406,7 @@ static int __btrfs_alloc_chunk(struct bt > * about the available holes on each device. > */ > ndevs = 0; > + fs_total_avail = 0; > while (cur != &fs_devices->alloc_list) { > struct btrfs_device *device; > u64 max_avail; > @@ -2448,6 +2451,7 @@ static int __btrfs_alloc_chunk(struct bt > devices_info[ndevs].total_avail = total_avail; > devices_info[ndevs].dev = device; > ++ndevs; > + fs_total_avail += total_avail; > } > > /* > @@ -2456,6 +2460,20 @@ static int __btrfs_alloc_chunk(struct bt > sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > btrfs_cmp_device_info, NULL); > > + /* > +* do not allocate space on all devices > +* instead balance free space to maximise space utilization > +* (this needs tweaking if parity raid gets implemented > +* for n parity ignore the n first (after sort) devs in the sum and > division) > +*/ > + opt_ndevs = fs_total_avail / devices_info[0].total_avail; > + if (opt_ndevs >= ndevs) > + opt_ndevs = ndevs - 1; //optional, might be used for faster > dev remove? > + if (opt_ndevs < devs_min) > + opt_ndevs = devs_min; > + if (ndevs > opt_ndevs) > + ndevs = opt_ndevs; > + > /* round down to number of usable stripes */ > ndevs -= ndevs % devs_increment; > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming: - It may trim from offset smaller than user-specified offset. - It may trim beyond user-specified range. - It may leak free space for extents smaller than specified minlen. - It may truncate the last trimmed extent thus leak free space. - With mixed extents+bitmaps, some extents may not be trimmed. - With mixed extents+bitmaps, some bitmaps may not be trimmed (even none will be trimmed). Even for those trimmed, not all the free space in the bitmaps will be trimmed. I rewrite btrfs_trim_block_group() and break it into two functions. One is to trim extents only, and the other is to trim bitmaps only. Signed-off-by: Li Zefan --- fs/btrfs/free-space-cache.c | 235 ++- 1 files changed, 164 insertions(+), 71 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 8c32434..89cc54e 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2575,17 +2575,57 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster) cluster->block_group = NULL; } -int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, - u64 *trimmed, u64 start, u64 end, u64 minlen) +static int do_trimming(struct btrfs_block_group_cache *block_group, + u64 *total_trimmed, u64 start, u64 bytes, + u64 reserved_start, u64 reserved_bytes) { - struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; - struct btrfs_free_space *entry = NULL; + struct btrfs_space_info *space_info = block_group->space_info; struct btrfs_fs_info *fs_info = block_group->fs_info; - u64 bytes = 0; - u64 actually_trimmed; - int ret = 0; + int ret; + int update = 0; + u64 trimmed = 0; - *trimmed = 0; + spin_lock(&space_info->lock); + spin_lock(&block_group->lock); + if (!block_group->ro) { + block_group->reserved += reserved_bytes; + space_info->bytes_reserved += reserved_bytes; + update = 1; + } + spin_unlock(&block_group->lock); + spin_unlock(&space_info->lock); + + ret = btrfs_error_discard_extent(fs_info->extent_root, +start, bytes, &trimmed); + if (!ret) + *total_trimmed += trimmed; + + btrfs_add_free_space(block_group, reserved_start, reserved_bytes); + + if (update) { + spin_lock(&space_info->lock); + spin_lock(&block_group->lock); + if (block_group->ro) + space_info->bytes_readonly += reserved_bytes; + block_group->reserved -= reserved_bytes; + space_info->bytes_reserved -= reserved_bytes; + spin_unlock(&space_info->lock); + spin_unlock(&block_group->lock); + } + + return ret; +} + +static int trim_no_bitmap(struct btrfs_block_group_cache *block_group, + u64 *total_trimmed, u64 start, u64 end, u64 minlen) +{ + struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; + struct btrfs_free_space *entry; + struct rb_node *node; + int ret; + u64 extent_start; + u64 extent_bytes; + u64 bytes; while (start < end) { spin_lock(&ctl->tree_lock); @@ -2596,81 +2636,118 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, } entry = tree_search_offset(ctl, start, 0, 1); - if (!entry) - entry = tree_search_offset(ctl, - offset_to_bitmap(ctl, start), - 1, 1); - - if (!entry || entry->offset >= end) { + if (!entry) { spin_unlock(&ctl->tree_lock); break; } - if (entry->bitmap) { - ret = search_bitmap(ctl, entry, &start, &bytes); - if (!ret) { - if (start >= end) { - spin_unlock(&ctl->tree_lock); - break; - } - bytes = min(bytes, end - start); - bitmap_clear_bits(ctl, entry, start, bytes); - if (entry->bytes == 0) - free_bitmap(ctl, entry); - } else { - start = entry->offset + BITS_PER_BITMAP * - block_group->sectorsize; + /* skip bitmaps */ + while (entry->bitmap) { + node = rb_next(&entry->offset_index); + if (!node) {
[PATCH v2 1/2] Btrfs: fix to search one more bitmap for cluster setup
Suppose there are two bitmaps [0, 256], [256, 512] and one extent [100, 120] in the free space cache, and we want to setup a cluster with offset=100, bytes=50. In this case, there will be only one bitmap [256, 512] in the temporary bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256]. The cause is, the list is constructed in setup_cluster_no_bitmap(), and only bitmaps with bitmap_entry->offset >= offset will be added into the list, and the very bitmap that convers offset has bitmap_entry->offset <= offset. Signed-off-by: Li Zefan --- v2: fix a NULL pointer deref. --- fs/btrfs/free-space-cache.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 181760f..8f792f4 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2453,11 +2453,23 @@ setup_cluster_bitmap(struct btrfs_block_group_cache *block_group, struct btrfs_free_space *entry; struct rb_node *node; int ret = -ENOSPC; + u64 bitmap_offset = offset_to_bitmap(ctl, offset); if (ctl->total_bitmaps == 0) return -ENOSPC; /* +* The bitmap that covers offset won't be in the list unless offset +* is just its start offset. +*/ + entry = list_first_entry(bitmaps, struct btrfs_free_space, list); + if (entry->offset != bitmap_offset) { + entry = tree_search_offset(ctl, bitmap_offset, 1, 0); + if (entry && list_empty(&entry->list)) + list_add(&entry->list, bitmaps); + } + + /* * First check our cached list of bitmaps and see if there is an entry * here that will work. */ -- 1.7.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/21] [RFC] Btrfs: restriper
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/23/2011 04:01 PM, Ilya Dryomov wrote: > Hello, > > This patch series adds an initial implementation of restriper (it's > a clever name for relocation framework that allows to do selective > profile changing and selective balancing with some goodies like > pausing/resuming and reporting progress to the user. > > Profile changing is global (per-FS) so far, per-subvolume profiles > require some discussion and can be implemented in future. This is > a RFC so some features/problems are not yet implemented/resolved. > The current TODO list is as follows: I managed to use these patches to convert the raid1 system and metadata chunks back to single and drop the second disk from a two disk array. In doing so I noticed that the restriper required a force switch to downgrade raid1 to single. This seems completely unnecessary to me. A force switch to btrfs device delete might make sense since delete may or may not force a downgrade, but with restripe, the request to convert from raid1 to single is already quite explicit with no room for ambiguity, so there should be no need for an additional confirmation switch. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7Ee+oACgkQJ4UciIs+XuIGIQCdFx9cP7cPQPslE9IcFNDg/6Ns LQYAn2l2ykGwiJt/yZNvuqePyMj3sxYH =P+HR -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fs/btrfs/locking.c: Removed some unneeded return statements
Signed-off-by: Marcos Paulo de Souza --- fs/btrfs/locking.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index d77b67c..8abb870 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -48,7 +48,6 @@ void btrfs_set_lock_blocking_rw(struct extent_buffer *eb, int rw) atomic_dec(&eb->spinning_readers); read_unlock(&eb->lock); } - return; } /* @@ -71,7 +70,6 @@ void btrfs_clear_lock_blocking_rw(struct extent_buffer *eb, int rw) if (atomic_dec_and_test(&eb->blocking_readers)) wake_up(&eb->read_lock_wq); } - return; } /* -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] improve space utilization on off-sized raid devices
I wrote a small patch to improve allocation on differently sized raid devices. With 2.6.38 I frequently ran into a no space left error that I attribute to this. But I'm not entierly sure. The fs was an 8 device -d raid0 -m raid10. The used space was the same across all devices. 5 were full and 3 bigger ones still had plenty of space. I was unable to use the remaning space and a balance did not fix it for long. Now I tried to avoid getting there again. The basic idea to not allocate space on the devices with the least free space. The amount of devices to leave out is calculated on each allocation to ajust to changing circumstances. It leaves the minimum number that still can achieve full space usage. Additionally I tought leaving at least one out might be of use in device removal. Please take extra care with this. I'm new to btrfs, kernel and C in general. It was written and tested with 3.0.0. --- volumes.c.orig 2011-10-07 16:50:04.0 +0200 +++ volumes.c 2011-11-16 23:49:08.097085568 +0100 @@ -2329,6 +2329,8 @@ static int __btrfs_alloc_chunk(struct bt u64 stripe_size; u64 num_bytes; int ndevs; + u64 fs_total_avail; + int opt_ndevs; int i; int j; @@ -2404,6 +2406,7 @@ static int __btrfs_alloc_chunk(struct bt * about the available holes on each device. */ ndevs = 0; + fs_total_avail = 0; while (cur != &fs_devices->alloc_list) { struct btrfs_device *device; u64 max_avail; @@ -2448,6 +2451,7 @@ static int __btrfs_alloc_chunk(struct bt devices_info[ndevs].total_avail = total_avail; devices_info[ndevs].dev = device; ++ndevs; + fs_total_avail += total_avail; } /* @@ -2456,6 +2460,20 @@ static int __btrfs_alloc_chunk(struct bt sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); + /* +* do not allocate space on all devices +* instead balance free space to maximise space utilization +* (this needs tweaking if parity raid gets implemented +* for n parity ignore the n first (after sort) devs in the sum and division) +*/ + opt_ndevs = fs_total_avail / devices_info[0].total_avail; + if (opt_ndevs >= ndevs) + opt_ndevs = ndevs - 1; //optional, might be used for faster dev remove? + if (opt_ndevs < devs_min) + opt_ndevs = devs_min; + if (ndevs > opt_ndevs) + ndevs = opt_ndevs; + /* round down to number of usable stripes */ ndevs -= ndevs % devs_increment; -- NEU: FreePhone - 0ct/min Handyspartarif mit Geld-zurück-Garantie! Jetzt informieren: http://www.gmx.net/de/go/freephone -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Segmentation Faults
Hi, On Wed, Nov 16, 2011 at 04:39:21PM +, Tim Crone wrote: > root@berna:~# uname -a > Linux berna 2.6.38-bpo.2-amd64 #1 SMP Mon Jun 6 15:24:02 UTC 2011 > x86_64 GNU/Linux 2.6.38 is quite old from btrfs perspective and it's highly possible that the bug you've hit is already fixed. Try newer (like 3.1) kernel and if you still hit some sort of crash, please send the report. > root@berna:/home/tjc# rm -rf .cache/chromium > Segmentation fault > root@berna:/home/tjc# > Message from syslogd@localhost at Nov 16 11:19:35 ... > kernel:[ 66.877568] [ cut here ] > > Message from syslogd@localhost at Nov 16 11:19:35 ... > kernel:[ 66.877572] invalid opcode: [#1] SMP > > Message from syslogd@localhost at Nov 16 11:19:35 ... > kernel:[ 66.877573] last sysfs file: /sys/devices/virtual/block/md0/uevent > > Message from syslogd@localhost at Nov 16 11:19:35 ... > kernel:[ 66.877633] Stack: > > Message from syslogd@localhost at Nov 16 11:19:35 ... > kernel:[ 66.877640] Call Trace: > > Message from syslogd@localhost at Nov 16 11:19:35 ... > kernel:[ 66.877703] Code: 24 24 48 8b 74 24 28 48 8d 54 24 50 41 b9 01 00 > 00 00 48 89 d9 4c 89 e7 e8 80 6e ff ff 83 f8 00 41 89 c5 0f 8c c5 02 00 00 74 > 04 <0f> 0b eb fe 4c 8b 2b 8b 73 40 4c 89 ef e8 a3 e5 ff ff 41 89 c6 btw such output is printed to every console when the crash occurs, but lacks important information like stacktrace and names of functions. without that it's very hard to get a clue what happened. > (I was going to include some log entries, but the lines are longer than > 80 characters and your system will not accept them.) As for stacktrace and line length, I personally want to see the lines not wrapped at 80 (except the eg. byte dump of instructions), it helps readability. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed, Nov 16, 2011 at 11:35:40AM -0800, Mark Fasheh wrote: > > We should do it per FS though, I'll patch up btrfs. > > I agree about doing it per FS. Ocfs2 just needs a one-liner to mark the > journal transaction as synchronous. Joel, here's an (untested) patch to fix this in Ocfs2. --Mark -- Mark Fasheh From: Mark Fasheh ocfs2: honor O_(D)SYNC flag in fallocate We need to sync the transaction which updates i_size if the file is marked as needing sync semantics. Signed-off-by: Mark Fasheh --- fs/ocfs2/file.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index de4ea1a..cac00b4 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1950,6 +1950,9 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, if (ret < 0) mlog_errno(ret); + if (file->f_flags & O_SYNC) + handle->h_sync = 1; + ocfs2_commit_trans(osb, handle); out_inode_unlock: -- 1.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed, Nov 16, 2011 at 11:18:06AM -0500, Chris Mason wrote: > On Wed, Nov 16, 2011 at 04:57:55PM +0100, Jan Kara wrote: > > On Wed 16-11-11 08:42:34, Christoph Hellwig wrote: > > > On Wed, Nov 16, 2011 at 02:39:15PM +0100, Jan Kara wrote: > > > > > This would work fine with XFS and be equivalent to what it does for > > > > > O_DSYNC now. But I'd rather see every filesystem do the right thing > > > > > and make sure the update actually is on disk when doing O_(D)SYNC > > > > > operations. > > > > OK, I don't really have a strong opinion here. Are you afraid that > > > > just > > > > calling fsync() need not be enough to push all updates fallocate did to > > > > disk? > > > > > > No, the point is that you should not have to call fsync when doing > > > O_SYNC I/O. That's the whole point of it. > > I agree with you that userspace shouldn't have to call fsync. What I > > meant is that sys_fallocate() or do_fallocate() can call > > generic_write_sync(file, pos, len), and that would be completely > > transparent to userspace. > > We should do it per FS though, I'll patch up btrfs. I agree about doing it per FS. Ocfs2 just needs a one-liner to mark the journal transaction as synchronous. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compressed btrfs "No space left on device"
Am 14.11.2011 19:24, schrieb Arnd Hannemann: > Am 14.11.2011 15:57, schrieb Arnd Hannemann: > >> I'm using btrfs for my /usr/share/ partition and keep getting the following >> error >> while installing a debian package which should take no more than 228 MB: >> >> Unpacking texlive-fonts-extra (from >> .../texlive-fonts-extra_2009-10ubuntu1_all.deb) ... >> dpkg: error processing >> /var/cache/apt/archives/texlive-fonts-extra_2009-10ubuntu1_all.deb >> (--unpack): >> unable to install new version of >> `/usr/share/texmf-texlive/fonts/type1/public/allrunes/frutlt.pfb': No space >> left on device >> >> >> However df reports plenty of available space: >> >> /dev/mapper/vg0-usr_share >> 5.0G 1.5G 2.5G 37% /usr/share >> >> >> I already extended /dev/mapper/vg0-usr_share from 4G to 5G and ran defrag >> and balance on it with no luck. >> I'm using ubuntu 11.10 on amd64 with the ubuntu 3.0.0 kernel. > > FYI: The problem is the same with mainline kernel v3.1.1. JFYI: the problem went away in 3.2-rc2 so someone must have fixed something. Thanks! Best regards Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: Prefix resize related printks with btrfs:
For the user it is confusing to find something like: [10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472 in kernel log, because it doesn't point directly to btrfs. This patch prefixes those messages with "btrfs:" like other btrfs related printks. Signed-off-by: Arnd Hannemann --- fs/btrfs/ioctl.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 4a34c47..b6d2a1a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1216,12 +1216,12 @@ static noinline int btrfs_ioctl_resize(struct btrfs_root *root, *devstr = '\0'; devstr = vol_args->name; devid = simple_strtoull(devstr, &end, 10); - printk(KERN_INFO "resizing devid %llu\n", + printk(KERN_INFO "btrfs: resizing devid %llu\n", (unsigned long long)devid); } device = btrfs_find_device(root, devid, NULL, NULL); if (!device) { - printk(KERN_INFO "resizer unable to find device %llu\n", + printk(KERN_INFO "btrfs: resizer unable to find device %llu\n", (unsigned long long)devid); ret = -EINVAL; goto out_unlock; @@ -1267,7 +1267,7 @@ static noinline int btrfs_ioctl_resize(struct btrfs_root *root, do_div(new_size, root->sectorsize); new_size *= root->sectorsize; - printk(KERN_INFO "new size for %s is %llu\n", + printk(KERN_INFO "btrfs: new size for %s is %llu\n", device->name, (unsigned long long)new_size); if (new_size > old_size) { -- 1.7.5.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] btrfs scrub: handle -ENOMEM from init_ipath()
On 16.11.2011 09:28, Dan Carpenter wrote: > init_ipath() can return an ERR_PTR(-ENOMEM). > > Signed-off-by: Dan Carpenter Signed-off-by: Jan Schmidt Thanks, -Jan > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index ed11d38..b72ee47 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -256,6 +256,11 @@ static int scrub_print_warning_inode(u64 inum, u64 > offset, u64 root, void *ctx) > btrfs_release_path(swarn->path); > > ipath = init_ipath(4096, local_root, swarn->path); > + if (IS_ERR(ipath)) { > + ret = PTR_ERR(ipath); > + ipath = NULL; > + goto err; > + } > ret = paths_from_inode(inum, ipath); > > if (ret < 0) > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Segmentation Faults
I have some kind of corruption in my btrfs file system that causes kernel segmentation faults when I try to delete files in my browser cache. Let me know if you are interested in having any additional information besides what is below. Thanks. root@berna:~# uname -a Linux berna 2.6.38-bpo.2-amd64 #1 SMP Mon Jun 6 15:24:02 UTC 2011 x86_64 GNU/Linux root@berna:/usr/local/src/btrfs-progs# ./btrfsck -s 1 /dev/sda using SB copy 1, bytenr 67108864 failed to read /dev/sr0: No medium found ERROR: unable to scan the device '/dev/sdd' - Device or resource busy failed to read /dev/sr0: No medium found ERROR: unable to scan the device '/dev/sdd' - Device or resource busy leaf parent key incorrect 1782596403200 bad block 1782596403200 incorrect offsets 3942 17510 bad block 1782603522048 warning, start mismatch 1778178670592 1778178764800 Aborted root@berna:/home/tjc# rm -rf .cache/chromium Segmentation fault root@berna:/home/tjc# Message from syslogd@localhost at Nov 16 11:19:35 ... kernel:[ 66.877568] [ cut here ] Message from syslogd@localhost at Nov 16 11:19:35 ... kernel:[ 66.877572] invalid opcode: [#1] SMP Message from syslogd@localhost at Nov 16 11:19:35 ... kernel:[ 66.877573] last sysfs file: /sys/devices/virtual/block/md0/uevent Message from syslogd@localhost at Nov 16 11:19:35 ... kernel:[ 66.877633] Stack: Message from syslogd@localhost at Nov 16 11:19:35 ... kernel:[ 66.877640] Call Trace: Message from syslogd@localhost at Nov 16 11:19:35 ... kernel:[ 66.877703] Code: 24 24 48 8b 74 24 28 48 8d 54 24 50 41 b9 01 00 00 00 48 89 d9 4c 89 e7 e8 80 6e ff ff 83 f8 00 41 89 c5 0f 8c c5 02 00 00 74 04 <0f> 0b eb fe 4c 8b 2b 8b 73 40 4c 89 ef e8 a3 e5 ff ff 41 89 c6 (I was going to include some log entries, but the lines are longer than 80 characters and your system will not accept them.) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix URL of btrfs-progs git repository in docs
The location of the btrfs-progs repository has been changed. This patch updates the documentation accordingly. Signed-off-by: Arnd Hannemann --- Documentation/filesystems/btrfs.txt |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt index 64087c3..7671352 100644 --- a/Documentation/filesystems/btrfs.txt +++ b/Documentation/filesystems/btrfs.txt @@ -63,8 +63,8 @@ IRC network. Userspace tools for creating and manipulating Btrfs file systems are available from the git repository at the following location: - http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git - git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git + http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git + git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git These include the following tools: -- 1.7.5.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed, Nov 16, 2011 at 04:57:55PM +0100, Jan Kara wrote: > On Wed 16-11-11 08:42:34, Christoph Hellwig wrote: > > On Wed, Nov 16, 2011 at 02:39:15PM +0100, Jan Kara wrote: > > > > This would work fine with XFS and be equivalent to what it does for > > > > O_DSYNC now. But I'd rather see every filesystem do the right thing > > > > and make sure the update actually is on disk when doing O_(D)SYNC > > > > operations. > > > OK, I don't really have a strong opinion here. Are you afraid that just > > > calling fsync() need not be enough to push all updates fallocate did to > > > disk? > > > > No, the point is that you should not have to call fsync when doing > > O_SYNC I/O. That's the whole point of it. > I agree with you that userspace shouldn't have to call fsync. What I > meant is that sys_fallocate() or do_fallocate() can call > generic_write_sync(file, pos, len), and that would be completely > transparent to userspace. We should do it per FS though, I'll patch up btrfs. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed, Nov 16, 2011 at 04:57:55PM +0100, Jan Kara wrote: > I agree with you that userspace shouldn't have to call fsync. What I > meant is that sys_fallocate() or do_fallocate() can call > generic_write_sync(file, pos, len), and that would be completely > transparent to userspace. That's different from how everything else in the I/O path works. If filessystem want to use it, that's fine, but I suspect most could do it more efficiently. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed 16-11-11 08:42:34, Christoph Hellwig wrote: > On Wed, Nov 16, 2011 at 02:39:15PM +0100, Jan Kara wrote: > > > This would work fine with XFS and be equivalent to what it does for > > > O_DSYNC now. But I'd rather see every filesystem do the right thing > > > and make sure the update actually is on disk when doing O_(D)SYNC > > > operations. > > OK, I don't really have a strong opinion here. Are you afraid that just > > calling fsync() need not be enough to push all updates fallocate did to > > disk? > > No, the point is that you should not have to call fsync when doing > O_SYNC I/O. That's the whole point of it. I agree with you that userspace shouldn't have to call fsync. What I meant is that sys_fallocate() or do_fallocate() can call generic_write_sync(file, pos, len), and that would be completely transparent to userspace. Honza -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix stat blocks accounting
Round inode bytes and delalloc bytes up to real blocksize before converting to sector size. Otherwise eg. files smaller than 512 are reported with zero blocks due to incorrect rounding. Signed-off-by: David Sterba --- fs/btrfs/inode.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e16215f..8ad26b1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6794,11 +6794,13 @@ static int btrfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) { struct inode *inode = dentry->d_inode; + u32 blocksize = inode->i_sb->s_blocksize; + generic_fillattr(inode, stat); stat->dev = BTRFS_I(inode)->root->anon_dev; stat->blksize = PAGE_CACHE_SIZE; - stat->blocks = (inode_get_bytes(inode) + - BTRFS_I(inode)->delalloc_bytes) >> 9; + stat->blocks = (ALIGN(inode_get_bytes(inode), blocksize) + + ALIGN(BTRFS_I(inode)->delalloc_bytes, blocksize)) >> 9; return 0; } -- 1.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG at fs/btrfs/inode.c:1587
2011/11/16 Chris Mason : > On Tue, Nov 15, 2011 at 09:19:53AM +0100, Christian Brunner wrote: >> Hi, >> >> this time I've hit a new bug. This happened while ceph was rebuilding >> his filestore (heavy io). >> >> The btrfs version is from 3.2-rc1, applied to a 3.0 kernel. > > This one means some part of the kernel has set a btrfs data page dirty > without going through the proper setup. A few of us have hit it, but we > haven't been able to nail down a solid way to reproduce it. > > Have you hit it more than once? I' sorry, I've only hit this once and it's not reproduceable. Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG at fs/btrfs/inode.c:1587
On Tue, Nov 15, 2011 at 09:19:53AM +0100, Christian Brunner wrote: > Hi, > > this time I've hit a new bug. This happened while ceph was rebuilding > his filestore (heavy io). > > The btrfs version is from 3.2-rc1, applied to a 3.0 kernel. This one means some part of the kernel has set a btrfs data page dirty without going through the proper setup. A few of us have hit it, but we haven't been able to nail down a solid way to reproduce it. Have you hit it more than once? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed, Nov 16, 2011 at 02:39:15PM +0100, Jan Kara wrote: > > This would work fine with XFS and be equivalent to what it does for > > O_DSYNC now. But I'd rather see every filesystem do the right thing > > and make sure the update actually is on disk when doing O_(D)SYNC > > operations. > OK, I don't really have a strong opinion here. Are you afraid that just > calling fsync() need not be enough to push all updates fallocate did to > disk? No, the point is that you should not have to call fsync when doing O_SYNC I/O. That's the whole point of it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed 16-11-11 07:45:50, Christoph Hellwig wrote: > On Wed, Nov 16, 2011 at 11:54:13AM +0100, Jan Kara wrote: > > Yeah, only that nobody calls that fsync() automatically if the fd is > > O_SYNC if I'm right. But maybe calling fdatasync() on the range which was > > fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for > > most filesystems? That would match how we treat O_SYNC for other operations > > as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit > > with this. > > This would work fine with XFS and be equivalent to what it does for > O_DSYNC now. But I'd rather see every filesystem do the right thing > and make sure the update actually is on disk when doing O_(D)SYNC > operations. OK, I don't really have a strong opinion here. Are you afraid that just calling fsync() need not be enough to push all updates fallocate did to disk? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
On Wed, Nov 16, 2011 at 11:54:13AM +0100, Jan Kara wrote: > Yeah, only that nobody calls that fsync() automatically if the fd is > O_SYNC if I'm right. But maybe calling fdatasync() on the range which was > fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for > most filesystems? That would match how we treat O_SYNC for other operations > as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit > with this. This would work fine with XFS and be equivalent to what it does for O_DSYNC now. But I'd rather see every filesystem do the right thing and make sure the update actually is on disk when doing O_(D)SYNC operations. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fallocate vs O_(D)SYNC
On Wed, Nov 16, 2011 at 03:42:56AM -0500, Christoph Hellwig wrote: > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never > make sure the size update transaction made it to disk. > > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data > operation (it adds new blocks that return zeroes) that seems like a > fairly nasty surprise for O_SYNC users. Hi all, This patch should be fix this problem in ext4. From: Zheng Liu Make sure the transaction to be commited if O_(D)SYNC flag is set in ext4_fallocate(). Signed-off-by: Zheng Liu --- fs/ext4/extents.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 61fa9e1..f47e3ad 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4356,6 +4356,8 @@ retry: ret = PTR_ERR(handle); break; } + if (file->f_flags & O_SYNC) + ext4_handle_sync(handle); ret = ext4_map_blocks(handle, inode, &map, flags); if (ret <= 0) { #ifdef EXT4FS_DEBUG -- 1.7.4.1 > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
Hi, On Wed, 2011-11-16 at 11:54 +0100, Jan Kara wrote: > Hello, > > On Wed 16-11-11 09:43:08, Steven Whitehouse wrote: > > On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote: > > > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never > > > make sure the size update transaction made it to disk. > > > > > > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data > > > operation (it adds new blocks that return zeroes) that seems like a > > > fairly nasty surprise for O_SYNC users. > > > > In GFS2 we zero out the data blocks as we go (since our metadata doesn't > > allow us to mark blocks as zeroed at alloc time) and also because we are > > mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use > > on our rindex system file in order to ensure that there is always enough > > space to expand a filesystem. > > > > So there is no danger of having non-zeroed blocks appearing later, as > > that is done before the metadata change. > > > > Our fallocate_chunk() function calls mark_inode_dirty(inode) on each > > call, so that fsync should pick that up and ensure that the metadata has > > been written back. So we should thus have both data and metadata stable > > on disk. > > > > Do you have some evidence that this is not happening? > Yeah, only that nobody calls that fsync() automatically if the fd is > O_SYNC if I'm right. But maybe calling fdatasync() on the range which was > fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for > most filesystems? That would match how we treat O_SYNC for other operations > as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit > with this. > > Honza Ah, I see now. Sorry, I missed the original point. So that would just be a VFS addition to check the O_(D)SYNC flag as you suggest. I've no objections to that, it makes sense to me, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
Hello, On Wed 16-11-11 09:43:08, Steven Whitehouse wrote: > On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote: > > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never > > make sure the size update transaction made it to disk. > > > > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data > > operation (it adds new blocks that return zeroes) that seems like a > > fairly nasty surprise for O_SYNC users. > > In GFS2 we zero out the data blocks as we go (since our metadata doesn't > allow us to mark blocks as zeroed at alloc time) and also because we are > mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use > on our rindex system file in order to ensure that there is always enough > space to expand a filesystem. > > So there is no danger of having non-zeroed blocks appearing later, as > that is done before the metadata change. > > Our fallocate_chunk() function calls mark_inode_dirty(inode) on each > call, so that fsync should pick that up and ensure that the metadata has > been written back. So we should thus have both data and metadata stable > on disk. > > Do you have some evidence that this is not happening? Yeah, only that nobody calls that fsync() automatically if the fd is O_SYNC if I'm right. But maybe calling fdatasync() on the range which was fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for most filesystems? That would match how we treat O_SYNC for other operations as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit with this. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
Hi, On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote: > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never > make sure the size update transaction made it to disk. > > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data > operation (it adds new blocks that return zeroes) that seems like a > fairly nasty surprise for O_SYNC users. > In GFS2 we zero out the data blocks as we go (since our metadata doesn't allow us to mark blocks as zeroed at alloc time) and also because we are mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use on our rindex system file in order to ensure that there is always enough space to expand a filesystem. So there is no danger of having non-zeroed blocks appearing later, as that is done before the metadata change. Our fallocate_chunk() function calls mark_inode_dirty(inode) on each call, so that fsync should pick that up and ensure that the metadata has been written back. So we should thus have both data and metadata stable on disk. Do you have some evidence that this is not happening? Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
fallocate vs O_(D)SYNC
It seems all filesystems but XFS ignore O_SYNC for fallocate, and never make sure the size update transaction made it to disk. Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data operation (it adds new blocks that return zeroes) that seems like a fairly nasty surprise for O_SYNC users. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] btrfs scrub: handle -ENOMEM from init_ipath()
init_ipath() can return an ERR_PTR(-ENOMEM). Signed-off-by: Dan Carpenter diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index ed11d38..b72ee47 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -256,6 +256,11 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, void *ctx) btrfs_release_path(swarn->path); ipath = init_ipath(4096, local_root, swarn->path); + if (IS_ERR(ipath)) { + ret = PTR_ERR(ipath); + ipath = NULL; + goto err; + } ret = paths_from_inode(inum, ipath); if (ret < 0) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html