Two identical copies of an image mounted result in changes to both images if only one is modified
Hi, I've observed a rather strange behaviour while trying to mount two identical copies of the same image to different mount points. Each modification to one image is also performed in the second one. Example: dd if=/dev/sda? of=image1 bs=1M cp image1 image2 mount -o loop image1 m1 mount -o loop image2 m2 touch m2/hello ls -la m1 //will now also include a file calles hello Is this behaviour intentional and known or should I create a bug-report? I've deleted quite a bunch of files on my production system because of this... Thanks, Clemens -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
On Thu, Jun 20, 2013 at 3:47 PM, Clemens Eisserer linuxhi...@gmail.com wrote: Hi, I've observed a rather strange behaviour while trying to mount two identical copies of the same image to different mount points. Each modification to one image is also performed in the second one. Example: dd if=/dev/sda? of=image1 bs=1M cp image1 image2 mount -o loop image1 m1 mount -o loop image2 m2 touch m2/hello ls -la m1 //will now also include a file calles hello What do you get if you unmount BOTH m1 and m2, and THEN mount m1 again? Is the file still there? Is this behaviour intentional and known or should I create a bug-report? I've deleted quite a bunch of files on my production system because of this... I'm pretty sure this is a known behavior in btrfs. http://markmail.org/message/i522sdkrhlxhw757#query:+page:1+mid:ksdi5d4v26eqgxpi+state:results -- Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
On Thu, Jun 20, 2013 at 10:47:53AM +0200, Clemens Eisserer wrote: Hi, I've observed a rather strange behaviour while trying to mount two identical copies of the same image to different mount points. Each modification to one image is also performed in the second one. Example: dd if=/dev/sda? of=image1 bs=1M cp image1 image2 mount -o loop image1 m1 mount -o loop image2 m2 touch m2/hello ls -la m1 //will now also include a file calles hello Is this behaviour intentional and known or should I create a bug-report? It's known, and not desired behaviour. The problem is that you've ended up with two filesystems with the same UUID, and the FS code gets rather confused about that. The same problem exists with LVM snapshots (or other block-device-layer copies). The solution is a combination of a tool to scan an image and change the UUID (offline), and of some code in the kernel that detects when it's being told about a duplicate image (rather than an additional device in the same FS). Neither of these has been written yet, I'm afraid. I've deleted quite a bunch of files on my production system because of this... Oops. I'm sorry to hear that. :( Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Welcome to Rivendell, Mr Anderson... --- signature.asc Description: Digital signature
Re: Filesystem somewhat destroyed - need help for recovery/fixing
Hi On Mon, Jun 17, 2013 at 11:43 PM, Alexander Skwar alexanders.mailinglists+nos...@gmail.com wrote: Hello Josef On Mon, Jun 17, 2013 at 11:21 PM, Josef Bacik jba...@fusionio.com wrote: Pull down my tree git://github.com/josefbacik/btrfs-progs.git and build and run the fsck in there and see if it's a bit more friendly. I just gave it a try, but wasn't successful, it seems… Kernel still crashes. Maybe checkout the screenphotos at http://goo.gl/DWkRH or http://imgur.com/a/00pTx Any other ideas, about what I might be able to do, to revive my btrfs filesystem? Alexander -- =Google+ = http://plus.skwar.me == = Chat (Jabber/Google Talk) = a.sk...@gmail.com == -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
On Thu, 20 Jun 2013 10:16:22 +0100, Hugo Mills wrote: On Thu, Jun 20, 2013 at 10:47:53AM +0200, Clemens Eisserer wrote: Hi, I've observed a rather strange behaviour while trying to mount two identical copies of the same image to different mount points. Each modification to one image is also performed in the second one. touch m2/hello ls -la m1 //will now also include a file calles hello Is this behaviour intentional and known or should I create a bug-report? It's known, and not desired behaviour. The problem is that you've ended up with two filesystems with the same UUID, and the FS code gets rather confused about that. The same problem exists with LVM snapshots (or other block-device-layer copies). The solution is a combination of a tool to scan an image and change the UUID (offline), and of some code in the kernel that detects when it's being told about a duplicate image (rather than an additional device in the same FS). Neither of these has been written yet, I'm afraid. To clarify, the loop devices are properly distinct, but the first device ends up mounted twice. I've had a look at the vfs code, and it doesn't seem to be uuid-aware, which makes sense because the uuid is a property of the superblock and the fs structure doesn't expose it. It's a Btrfs problem. Instead of redirecting to a different block device, Btrfs could and should refuse to mount an already-mounted superblock when the block device doesn't match, somewhere in or below btrfs_mount. Registering extra, distinct superblocks for an already mounted raid is a different matter, but that isn't done through the mount syscall anyway. I've deleted quite a bunch of files on my production system because of this... Oops. I'm sorry to hear that. :( Hugo. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
On Thu, Jun 20, 2013 at 10:22:07AM +, Gabriel de Perthuis wrote: On Thu, 20 Jun 2013 10:16:22 +0100, Hugo Mills wrote: On Thu, Jun 20, 2013 at 10:47:53AM +0200, Clemens Eisserer wrote: Hi, I've observed a rather strange behaviour while trying to mount two identical copies of the same image to different mount points. Each modification to one image is also performed in the second one. touch m2/hello ls -la m1 //will now also include a file calles hello Is this behaviour intentional and known or should I create a bug-report? It's known, and not desired behaviour. The problem is that you've ended up with two filesystems with the same UUID, and the FS code gets rather confused about that. The same problem exists with LVM snapshots (or other block-device-layer copies). The solution is a combination of a tool to scan an image and change the UUID (offline), and of some code in the kernel that detects when it's being told about a duplicate image (rather than an additional device in the same FS). Neither of these has been written yet, I'm afraid. To clarify, the loop devices are properly distinct, but the first device ends up mounted twice. I've had a look at the vfs code, and it doesn't seem to be uuid-aware, which makes sense because the uuid is a property of the superblock and the fs structure doesn't expose it. It's a Btrfs problem. Yes, it is. (I didn't intend, however obliquely, to imply that it wasn't). Instead of redirecting to a different block device, Btrfs could and should refuse to mount an already-mounted superblock when the block device doesn't match, somewhere in or below btrfs_mount. Registering extra, distinct superblocks for an already mounted raid is a different matter, but that isn't done through the mount syscall anyway. The problem here is that you could quite legitimately mount /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are both part of the same filesystem. So you can't simply prevent mounting based on the device that the mount's being done with. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I know of three kinds: hot, --- cool, and what-time-does-the-tune-start? signature.asc Description: Digital signature
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
Instead of redirecting to a different block device, Btrfs could and should refuse to mount an already-mounted superblock when the block device doesn't match, somewhere in or below btrfs_mount. Registering extra, distinct superblocks for an already mounted raid is a different matter, but that isn't done through the mount syscall anyway. The problem here is that you could quite legitimately mount /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are both part of the same filesystem. So you can't simply prevent mounting based on the device that the mount's being done with. Okay. The check should rely on a list of known block devices for a given filesystem uuid. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
On Thu, Jun 20, 2013 at 10:41:53AM +, Gabriel de Perthuis wrote: Instead of redirecting to a different block device, Btrfs could and should refuse to mount an already-mounted superblock when the block device doesn't match, somewhere in or below btrfs_mount. Registering extra, distinct superblocks for an already mounted raid is a different matter, but that isn't done through the mount syscall anyway. The problem here is that you could quite legitimately mount /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are both part of the same filesystem. So you can't simply prevent mounting based on the device that the mount's being done with. Okay. The check should rely on a list of known block devices for a given filesystem uuid. And this is where we fail currently -- that list is held by the btrfs module in the kernel, and is constructed on the basis of what btrfs dev scan finds by looking at superblocks on block devices. Currently, there's no method implemented for determining whether a block device with a legitimate btrfs superblock on it is a duplicate of another device, or whether it's a newly-discovered device which is part of an as-yet incompletely specified multi-device FS. I think it should be possible to look up the device ID as well, and complain (loudly, to the user, and in the kernel) at btrfs dev scan time if we see duplicates. That would deal with the problem at the earliest point of confusion. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I know of three kinds: hot, --- cool, and what-time-does-the-tune-start? signature.asc Description: Digital signature
[PATCH 1/4] Btrfs-progs: fix misuse of skinny metadata in btrfs-image
As for skinny metadata, key.offset stores levels rather than extent length. Signed-off-by: Liu Bo bo.li@oracle.com --- btrfs-image.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/btrfs-image.c b/btrfs-image.c index 739ae35..e5ff795 100644 --- a/btrfs-image.c +++ b/btrfs-image.c @@ -798,9 +798,9 @@ static int copy_from_extent_tree(struct metadump_struct *metadump, bytenr = key.objectid; if (key.type == BTRFS_METADATA_ITEM_KEY) - num_bytes = key.offset; - else num_bytes = extent_root-leafsize; + else + num_bytes = key.offset; if (btrfs_item_size_nr(leaf, path-slots[0]) sizeof(*ei)) { ei = btrfs_item_ptr(leaf, path-slots[0], -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Btrfs-progs: skip open devices which is missing
A device can be added to the device list without getting a name, so we may access to illegal addresses while opening devices with their name. Signed-off-by: Liu Bo bo.li@oracle.com --- volumes.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/volumes.c b/volumes.c index 8285240..a06896d 100644 --- a/volumes.c +++ b/volumes.c @@ -186,6 +186,10 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags) list_for_each(cur, head) { device = list_entry(cur, struct btrfs_device, dev_list); + if (!device-name) { + printk(no name for device %llu, skip it now\n, device-devid); + continue; + } fd = open(device-name, flags); if (fd 0) { -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] multiple disks restore support of btrfs-image
Patch 1-3 are bug fixes for several places. Patch 4 adds btrfs-image support of multiple disks restore. Liu Bo (4): Btrfs-progs: fix misuse of skinny metadata in btrfs-image Btrfs-progs: skip open devices which is missing Btrfs-progs: delete fs_devices itself from fs_uuid list before freeing Btrfs-progs: exhance btrfs-image to restore image onto multiple disks btrfs-image.c | 298 ++--- ctree.h |1 + disk-io.c | 91 +- disk-io.h |5 + volumes.c |4 + 5 files changed, 339 insertions(+), 60 deletions(-) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. This aims to address the following case, $ mkfs.btrfs -m raid0 sda sdb $ btrfs-image sda image.file $ btrfs-image -r image.file sdc - so we can only restore metadata onto sdc, and another thing is we can only mount sdc with degraded mode as we don't provide informations of another disk. And, it's built as RAID0 and we have only one disk, so after mount sdc we'll get into readonly mode. This is just annoying for people(like me) who're trying to restore image but turn to find they cannot make it work. So this'll make your life easier, just tap $ btrfs-image -m image.file sdc sdd - then you get everything about metadata done, the same offset with that of the originals(of course, you need offer enough disk size, at least the disk size of the original disks). Besides, this also works with raid5 and raid6 metadata image. Signed-off-by: Liu Bo bo.li@oracle.com --- btrfs-image.c | 294 ++--- ctree.h |1 + disk-io.c | 90 +- disk-io.h |5 + 4 files changed, 332 insertions(+), 58 deletions(-) diff --git a/btrfs-image.c b/btrfs-image.c index e5ff795..6ca4589 100644 --- a/btrfs-image.c +++ b/btrfs-image.c @@ -119,6 +119,9 @@ struct mdrestore_struct { int done; int error; int old_restore; + int fixup_offset; + int multi_devices; + struct btrfs_fs_info *info; }; static void csum_block(u8 *buf, size_t len) @@ -1233,33 +1236,67 @@ static void *restore_worker(void *data) size = async-bufsize; } - if (async-start == BTRFS_SUPER_INFO_OFFSET) { - if (mdres-old_restore) { - update_super_old(outbuf); - } else { - ret = update_super(outbuf); + if (!mdres-multi_devices) { + if (async-start == BTRFS_SUPER_INFO_OFFSET) { + if (mdres-old_restore) { + update_super_old(outbuf); + } else { + ret = update_super(outbuf); + if (ret) + err = ret; + } + } else if (!mdres-old_restore) { + ret = fixup_chunk_tree_block(mdres, async, +outbuf, size); if (ret) err = ret; } - } else if (!mdres-old_restore) { - ret = fixup_chunk_tree_block(mdres, async, outbuf, size); - if (ret) - err = ret; } - ret = pwrite64(outfd, outbuf, size, async-start); - if (ret size) { - if (ret 0) { - fprintf(stderr, Error writing to device %d\n, - errno); - err = errno; - } else { - fprintf(stderr, Short write\n); - err = -EIO; + if (!mdres-fixup_offset) { + ret = pwrite64(outfd, outbuf, size, async-start); + if (ret != size) { + if (ret 0) { + fprintf(stderr, Error writing to device %d\n, + errno); + err = errno; + } else { + fprintf(stderr, Short write\n); + err = -EIO; + } + } + } else if (async-start != BTRFS_SUPER_INFO_OFFSET) { + u64 cur_off; + size_t cur_size; + struct extent_buffer *eb; + + cur_size = size; + cur_off = 0; + while (cur_size 0) { + eb = read_tree_block(mdres-info-chunk_root, +async-start + cur_off, +mdres-leafsize, 0); + BUG_ON(!eb); /* we should have eb now */ + + if (memcmp(eb-data, outbuf + cur_off, + mdres-leafsize)) { + printk(%s: eb %llu NOT same with outbuf\n, __func__, eb-start); +
[PATCH 3/4] Btrfs-progs: delete fs_devices itself from fs_uuid list before freeing
Otherwise we will access illegal addresses while searching on fs_uuid list. Signed-off-by: Liu Bo bo.li@oracle.com --- disk-io.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/disk-io.c b/disk-io.c index 21b410d..2892300 100644 --- a/disk-io.c +++ b/disk-io.c @@ -1277,6 +1277,7 @@ static int close_all_devices(struct btrfs_fs_info *fs_info) kfree(device-label); kfree(device); } + list_del(fs_info-fs_devices-list); kfree(fs_info-fs_devices); return 0; } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
On Thu, Jun 20, 2013 at 08:05:30PM +0800, Liu Bo wrote: This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. This aims to address the following case, $ mkfs.btrfs -m raid0 sda sdb $ btrfs-image sda image.file $ btrfs-image -r image.file sdc - so we can only restore metadata onto sdc, and another thing is we can only mount sdc with degraded mode as we don't provide informations of another disk. And, it's built as RAID0 and we have only one disk, so after mount sdc we'll get into readonly mode. Um that shouldn't be happening, the restore will mask out the RAID parts of the chunk tree and it should work just fine. Are you using the most recent version of btrfs-image? If this is happening it's a bug and we need to fix it, but I've restored several file systems from users with raid0/10 file systems onto a single disk and it's worked just fine. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
On Thu, Jun 20, 2013 at 08:24:32AM -0400, Josef Bacik wrote: On Thu, Jun 20, 2013 at 08:05:30PM +0800, Liu Bo wrote: This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. This aims to address the following case, $ mkfs.btrfs -m raid0 sda sdb $ btrfs-image sda image.file $ btrfs-image -r image.file sdc - so we can only restore metadata onto sdc, and another thing is we can only mount sdc with degraded mode as we don't provide informations of another disk. And, it's built as RAID0 and we have only one disk, so after mount sdc we'll get into readonly mode. Um that shouldn't be happening, the restore will mask out the RAID parts of the chunk tree and it should work just fine. Are you using the most recent version of btrfs-image? If this is happening it's a bug and we need to fix it, but I've restored several file systems from users with raid0/10 file systems onto a single disk and it's worked just fine. Thanks, Well apparently I've been hallucinating because it definitely doesn't work. I'd rather fix the device tree so it only restores onto one disk, since the raid level shouldn't matter and it does in fact get masked out. So the only thing left would be to fix the device tree so the only device it knows about is the device we're restoring to. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
Quoting Josef Bacik (2013-06-20 08:24:32) On Thu, Jun 20, 2013 at 08:05:30PM +0800, Liu Bo wrote: This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. This aims to address the following case, $ mkfs.btrfs -m raid0 sda sdb $ btrfs-image sda image.file $ btrfs-image -r image.file sdc - so we can only restore metadata onto sdc, and another thing is we can only mount sdc with degraded mode as we don't provide informations of another disk. And, it's built as RAID0 and we have only one disk, so after mount sdc we'll get into readonly mode. Um that shouldn't be happening, the restore will mask out the RAID parts of the chunk tree and it should work just fine. Are you using the most recent version of btrfs-image? If this is happening it's a bug and we need to fix it, but I've restored several file systems from users with raid0/10 file systems onto a single disk and it's worked just fine. Thanks, I just pushed my current merge of Josef's patches into my master branch. Please base on that. Josef, this should only be missing the enospc log, please go ahead and rebase/double check. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
On Thu, Jun 20, 2013 at 08:22:12AM -0500, Kevin O'Kelley wrote: Thank you for your reply. I appreciate it. Unfortunately this issue is a deal killer for us. The ability to take very fast snapshots and replicate them to another site is key for us. We just can't us Btrfs with this setup. That's too bad. Good luck and thank you. If you want to make fast atomic incremental copies of btrfs to a remote system, then btrfs send/receive may be what you're looking for. Hugo. Sent from my iPhone On Jun 20, 2013, at 5:56 AM, Hugo Mills h...@carfax.org.uk wrote: On Thu, Jun 20, 2013 at 10:41:53AM +, Gabriel de Perthuis wrote: Instead of redirecting to a different block device, Btrfs could and should refuse to mount an already-mounted superblock when the block device doesn't match, somewhere in or below btrfs_mount. Registering extra, distinct superblocks for an already mounted raid is a different matter, but that isn't done through the mount syscall anyway. The problem here is that you could quite legitimately mount /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are both part of the same filesystem. So you can't simply prevent mounting based on the device that the mount's being done with. Okay. The check should rely on a list of known block devices for a given filesystem uuid. And this is where we fail currently -- that list is held by the btrfs module in the kernel, and is constructed on the basis of what btrfs dev scan finds by looking at superblocks on block devices. Currently, there's no method implemented for determining whether a block device with a legitimate btrfs superblock on it is a duplicate of another device, or whether it's a newly-discovered device which is part of an as-yet incompletely specified multi-device FS. I think it should be possible to look up the device ID as well, and complain (loudly, to the user, and in the kernel) at btrfs dev scan time if we see duplicates. That would deal with the problem at the earliest point of confusion. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Computer Science is not about computers, any more than --- astronomy is about telescopes. signature.asc Description: Digital signature
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
Thank you for your reply. I appreciate it. Unfortunately this issue is a deal killer for us. The ability to take very fast snapshots and replicate them to another site is key for us. We just can't us Btrfs with this setup. That's too bad. Good luck and thank you. The issue we were discussing is: how to fail early when there are duplicate UUIDs. Duplicate UUIDs will never be supported. If *your* problem has to do with fast snapshots and fast replication, that's supported, see btrfs send/receive. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two identical copies of an image mounted result in changes to both images if only one is modified
Thank you for your reply. I appreciate it. Unfortunately this issue is a deal killer for us. The ability to take very fast snapshots and replicate them to another site is key for us. We just can't us Btrfs with this setup. That's too bad. Good luck and thank you. Sent from my iPhone On Jun 20, 2013, at 5:56 AM, Hugo Mills h...@carfax.org.uk wrote: On Thu, Jun 20, 2013 at 10:41:53AM +, Gabriel de Perthuis wrote: Instead of redirecting to a different block device, Btrfs could and should refuse to mount an already-mounted superblock when the block device doesn't match, somewhere in or below btrfs_mount. Registering extra, distinct superblocks for an already mounted raid is a different matter, but that isn't done through the mount syscall anyway. The problem here is that you could quite legitimately mount /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are both part of the same filesystem. So you can't simply prevent mounting based on the device that the mount's being done with. Okay. The check should rely on a list of known block devices for a given filesystem uuid. And this is where we fail currently -- that list is held by the btrfs module in the kernel, and is constructed on the basis of what btrfs dev scan finds by looking at superblocks on block devices. Currently, there's no method implemented for determining whether a block device with a legitimate btrfs superblock on it is a duplicate of another device, or whether it's a newly-discovered device which is part of an as-yet incompletely specified multi-device FS. I think it should be possible to look up the device ID as well, and complain (loudly, to the user, and in the kernel) at btrfs dev scan time if we see duplicates. That would deal with the problem at the earliest point of confusion. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I know of three kinds: hot, --- cool, and what-time-does-the-tune-start? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
On Thu, Jun 20, 2013 at 08:39:19AM -0400, Josef Bacik wrote: On Thu, Jun 20, 2013 at 08:24:32AM -0400, Josef Bacik wrote: On Thu, Jun 20, 2013 at 08:05:30PM +0800, Liu Bo wrote: This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. This aims to address the following case, $ mkfs.btrfs -m raid0 sda sdb $ btrfs-image sda image.file $ btrfs-image -r image.file sdc - so we can only restore metadata onto sdc, and another thing is we can only mount sdc with degraded mode as we don't provide informations of another disk. And, it's built as RAID0 and we have only one disk, so after mount sdc we'll get into readonly mode. Um that shouldn't be happening, the restore will mask out the RAID parts of the chunk tree and it should work just fine. Are you using the most recent version of btrfs-image? If this is happening it's a bug and we need to fix it, but I've restored several file systems from users with raid0/10 file systems onto a single disk and it's worked just fine. Thanks, Well apparently I've been hallucinating because it definitely doesn't work. I'd rather fix the device tree so it only restores onto one disk, since the raid level shouldn't matter and it does in fact get masked out. So the only thing left would be to fix the device tree so the only device it knows about is the device we're restoring to. Thanks, Um, I believe that'd work and it's not hard, but I'm afraid that way we're not able to debug bugs related to raid types? thanks, liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
On Thu, Jun 20, 2013 at 08:39:19AM -0400, Josef Bacik wrote: On Thu, Jun 20, 2013 at 08:24:32AM -0400, Josef Bacik wrote: On Thu, Jun 20, 2013 at 08:05:30PM +0800, Liu Bo wrote: This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. This aims to address the following case, $ mkfs.btrfs -m raid0 sda sdb $ btrfs-image sda image.file $ btrfs-image -r image.file sdc - so we can only restore metadata onto sdc, and another thing is we can only mount sdc with degraded mode as we don't provide informations of another disk. And, it's built as RAID0 and we have only one disk, so after mount sdc we'll get into readonly mode. Um that shouldn't be happening, the restore will mask out the RAID parts of the chunk tree and it should work just fine. Are you using the most recent version of btrfs-image? If this is happening it's a bug and we need to fix it, but I've restored several file systems from users with raid0/10 file systems onto a single disk and it's worked just fine. Thanks, Well apparently I've been hallucinating because it definitely doesn't work. I'd rather fix the device tree so it only restores onto one disk, since the raid level shouldn't matter and it does in fact get masked out. So the only thing left would be to fix the device tree so the only device it knows about is the device we're restoring to. Thanks, I just check the latest progs code, in commit ef2a8889ef813ba77061f6a92f4954d047a78932 Btrfs-progs: make image restore with the original device offsets, we suffer from an huge pain and take a great amount of efforts to map logical offset to physical offset. But with this patch, we'll build the same whole logical-physical mapping on the disks we're restoring to with what it is on the disks that generate the image file, so we can get rid of those pain causing by mapping issues. thanks, liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: use a percpu to keep track of possibly pinned bytes
@@ -3380,6 +3382,10 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags, if (!found) return -ENOMEM; + ret = percpu_counter_init(found-total_bytes_pinned, 0); + if (ret) + return ret; + Leaks *found if percpu_counter_init() fails. - if (space_info-bytes_pinned + delayed_rsv-size bytes) { + bytes_pinned = percpu_counter_sum(space_info-total_bytes_pinned); + if (bytes_pinned + delayed_rsv-size bytes) { This stood out as being different from the rest of the comparisons. Why manually sum the counters instead of letting _compare() optimize it away if it can? _compare(, bytes - delayed_rsv-size)? - z -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which is completely fraking useless for us as we need to make sure pages are actually written before we go and check if there are ordered extents. So replace this with an open coding of try_to_writeback_inodes_sb_nr minus the writeback underway check so that we are sure to actually have flushed some dirty pages out and will have ordered extents to use. With this patch xfstests generic/273 now passes. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/extent-tree.c |9 - 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 806801a..16da187 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3941,12 +3941,11 @@ static void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root, unsigned long nr_pages) { struct super_block *sb = root-fs_info-sb; - int started; - /* If we can not start writeback, just sync all the delalloc file. */ - started = try_to_writeback_inodes_sb_nr(sb, nr_pages, - WB_REASON_FS_FREE_SPACE); - if (!started) { + if (down_read_trylock(sb-s_umount)) { + writeback_inodes_sb_nr(sb, nr_pages, WB_REASON_FS_FREE_SPACE); + up_read(sb-s_umount); + } else { /* * We needn't worry the filesystem going from r/w to r/o though * we don't acquire -s_umount mutex, because the filesystem -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Btrfs-progs: Validate super block checksum
Ping. Is there any reason why the btrfs progs (except for btrfs-show-super) don't validate the super block's checksum? thanks On Mon, Jun 10, 2013 at 8:51 PM, Filipe David Borba Manana fdman...@gmail.com wrote: After finding a super block in a device also validate its checksum. This validation is done in the kernel but it was missing in btrfs-progs. The function btrfs_check_super_csum() is imported from the file fs/btrfs/disk-io.c in the kernel source tree. Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- disk-io.c | 76 + 1 file changed, 62 insertions(+), 14 deletions(-) diff --git a/disk-io.c b/disk-io.c index bd9cf4e..edd4d52 100644 --- a/disk-io.c +++ b/disk-io.c @@ -1085,47 +1085,95 @@ struct btrfs_root *open_ctree_fd(int fp, const char *path, u64 sb_bytenr, return info-fs_root; } +static int btrfs_check_super_csum(char *raw_disk_sb) +{ + struct btrfs_super_block *disk_sb = + (struct btrfs_super_block *)raw_disk_sb; + u16 csum_type = btrfs_super_csum_type(disk_sb); + int ret = 0; + + if (csum_type == BTRFS_CSUM_TYPE_CRC32) { + u32 crc = ~(u32)0; + const int csum_size = sizeof(crc); + char result[csum_size]; + + /* +* The super_block structure does not span the whole +* BTRFS_SUPER_INFO_SIZE range, we expect that the unused space +* is filled with zeros and is included in the checkum. +*/ + crc = btrfs_csum_data(NULL, raw_disk_sb + BTRFS_CSUM_SIZE, + crc, BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE); + btrfs_csum_final(crc, result); + + if (memcmp(raw_disk_sb, result, csum_size)) + ret = 1; + + if (ret btrfs_super_generation(disk_sb) 10) { + fprintf(stderr, btrfs: super block crcs don't match, + older mkfs detected\n); + ret = 0; + } + } + + if (csum_type = ARRAY_SIZE(btrfs_csum_sizes)) { + fprintf(stderr, btrfs: unsupported checksum algorithm %u\n, + csum_type); + ret = 1; + } + + return ret; +} + int btrfs_read_dev_super(int fd, struct btrfs_super_block *sb, u64 sb_bytenr) { u8 fsid[BTRFS_FSID_SIZE]; int fsid_is_initialized = 0; - struct btrfs_super_block buf; + char buf[BTRFS_SUPER_INFO_SIZE]; + struct btrfs_super_block *tmp_sb; int i; int ret; u64 transid = 0; u64 bytenr; if (sb_bytenr != BTRFS_SUPER_INFO_OFFSET) { - ret = pread64(fd, buf, sizeof(buf), sb_bytenr); + ret = pread64(fd, buf, sizeof(buf), sb_bytenr); if (ret sizeof(buf)) return -1; - if (btrfs_super_bytenr(buf) != sb_bytenr || - buf.magic != cpu_to_le64(BTRFS_MAGIC)) + tmp_sb = (struct btrfs_super_block *)buf; + + if (btrfs_super_bytenr(tmp_sb) != sb_bytenr || + tmp_sb-magic != cpu_to_le64(BTRFS_MAGIC) || + btrfs_check_super_csum(buf)) return -1; - memcpy(sb, buf, sizeof(*sb)); + memcpy(sb, buf, sizeof(*sb)); return 0; } for (i = 0; i BTRFS_SUPER_MIRROR_MAX; i++) { bytenr = btrfs_sb_offset(i); - ret = pread64(fd, buf, sizeof(buf), bytenr); + ret = pread64(fd, buf, sizeof(buf), bytenr); if (ret sizeof(buf)) break; - if (btrfs_super_bytenr(buf) != bytenr ) + tmp_sb = (struct btrfs_super_block *)buf; + + if (btrfs_super_bytenr(tmp_sb) != bytenr ) continue; /* if magic is NULL, the device was removed */ - if (buf.magic == 0 i == 0) + if (tmp_sb-magic == 0 i == 0) return -1; - if (buf.magic != cpu_to_le64(BTRFS_MAGIC)) + if (tmp_sb-magic != cpu_to_le64(BTRFS_MAGIC)) + continue; + if (btrfs_check_super_csum(buf)) continue; if (!fsid_is_initialized) { - memcpy(fsid, buf.fsid, sizeof(fsid)); + memcpy(fsid, tmp_sb-fsid, sizeof(fsid)); fsid_is_initialized = 1; - } else if (memcmp(fsid, buf.fsid, sizeof(fsid))) { + } else if (memcmp(fsid, tmp_sb-fsid, sizeof(fsid))) { /* * the superblocks (the original one and
Re: [PATCH] Btrfs: use a percpu to keep track of possibly pinned bytes
On Thu, Jun 20, 2013 at 09:26:15AM -0700, Zach Brown wrote: @@ -3380,6 +3382,10 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags, if (!found) return -ENOMEM; + ret = percpu_counter_init(found-total_bytes_pinned, 0); + if (ret) + return ret; + Leaks *found if percpu_counter_init() fails. Right thanks. - if (space_info-bytes_pinned + delayed_rsv-size bytes) { + bytes_pinned = percpu_counter_sum(space_info-total_bytes_pinned); + if (bytes_pinned + delayed_rsv-size bytes) { This stood out as being different from the rest of the comparisons. Why manually sum the counters instead of letting _compare() optimize it away if it can? _compare(, bytes - delayed_rsv-size)? Cause negative numbers bother me? Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 6/8] Btrfs: introduce uuid-tree-gen field
In order to be able to detect the case that a filesystem is mounted with an old kernel, add a uuid-tree-gen field like the free space cache is doing it. It is part of the super block and written with each commit. Old kernels do not know this field and don't update it. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/ctree.h | 5 - fs/btrfs/transaction.c | 1 + 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 89b2d78..424c38d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -481,9 +481,10 @@ struct btrfs_super_block { char label[BTRFS_LABEL_SIZE]; __le64 cache_generation; + __le64 uuid_tree_generation; /* future expansion */ - __le64 reserved[31]; + __le64 reserved[30]; u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE]; struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS]; } __attribute__ ((__packed__)); @@ -2847,6 +2848,8 @@ BTRFS_SETGET_STACK_FUNCS(super_csum_type, struct btrfs_super_block, csum_type, 16); BTRFS_SETGET_STACK_FUNCS(super_cache_generation, struct btrfs_super_block, cache_generation, 64); +BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, struct btrfs_super_block, +uuid_tree_generation, 64); static inline int btrfs_super_csum_size(struct btrfs_super_block *s) { diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 00ae884..1ae9621 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1370,6 +1370,7 @@ static void update_super_roots(struct btrfs_root *root) super-root_level = root_item-level; if (btrfs_test_opt(root, SPACE_CACHE)) super-cache_generation = root_item-generation; + super-uuid_tree_generation = root_item-generation; } int btrfs_transaction_in_commit(struct btrfs_fs_info *info) -- 1.8.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 3/8] Btrfs: create UUID tree if required
This tree is not created by mkfs.btrfs. Therefore when a filesystem is mounted writable and the UUID tree does not exist, this tree is created if required. The tree is also added to the fs_info structure and initialized, but this commit does not yet read or write UUID tree elements. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 34 ++ fs/btrfs/extent-tree.c | 3 +++ fs/btrfs/volumes.c | 26 ++ fs/btrfs/volumes.h | 1 + 5 files changed, 65 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 04447b6..1dac165 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1305,6 +1305,7 @@ struct btrfs_fs_info { struct btrfs_root *fs_root; struct btrfs_root *csum_root; struct btrfs_root *quota_root; + struct btrfs_root *uuid_root; /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3c2886c..1db446a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1580,6 +1580,9 @@ struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info, if (location-objectid == BTRFS_QUOTA_TREE_OBJECTID) return fs_info-quota_root ? fs_info-quota_root : ERR_PTR(-ENOENT); + if (location-objectid == BTRFS_UUID_TREE_OBJECTID) + return fs_info-uuid_root ? fs_info-uuid_root : + ERR_PTR(-ENOENT); again: root = btrfs_lookup_fs_root(fs_info, location-objectid); if (root) @@ -2037,6 +2040,12 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) info-quota_root-node = NULL; info-quota_root-commit_root = NULL; } + if (info-uuid_root) { + free_extent_buffer(info-uuid_root-node); + free_extent_buffer(info-uuid_root-commit_root); + info-uuid_root-node = NULL; + info-uuid_root-commit_root = NULL; + } if (chunk_root) { free_extent_buffer(info-chunk_root-node); free_extent_buffer(info-chunk_root-commit_root); @@ -2097,11 +2106,13 @@ int open_ctree(struct super_block *sb, struct btrfs_root *chunk_root; struct btrfs_root *dev_root; struct btrfs_root *quota_root; + struct btrfs_root *uuid_root; struct btrfs_root *log_tree_root; int ret; int err = -EINVAL; int num_backups_tried = 0; int backup_index = 0; + bool create_uuid_tree = false; tree_root = fs_info-tree_root = btrfs_alloc_root(fs_info); chunk_root = fs_info-chunk_root = btrfs_alloc_root(fs_info); @@ -2695,6 +2706,18 @@ retry_root_backup: fs_info-quota_root = quota_root; } + location.objectid = BTRFS_UUID_TREE_OBJECTID; + uuid_root = btrfs_read_tree_root(tree_root, location); + if (IS_ERR(uuid_root)) { + ret = PTR_ERR(uuid_root); + if (ret != -ENOENT) + goto recovery_tree_root; + create_uuid_tree = true; + } else { + uuid_root-track_dirty = 1; + fs_info-uuid_root = uuid_root; + } + fs_info-generation = generation; fs_info-last_trans_committed = generation; @@ -2881,6 +2904,17 @@ retry_root_backup: btrfs_qgroup_rescan_resume(fs_info); + if (create_uuid_tree) { + pr_info(btrfs: creating UUID tree\n); + ret = btrfs_create_uuid_tree(fs_info); + if (ret) { + pr_warn(btrfs: failed to create the UUID tree %d\n, + ret); + close_ctree(tree_root); + return ret; + } + } + return 0; fail_qgroup: diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 6d5c5f7..1c4694a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4308,6 +4308,9 @@ static struct btrfs_block_rsv *get_block_rsv( if (root == root-fs_info-csum_root trans-adding_csums) block_rsv = trans-block_rsv; + if (root == root-fs_info-uuid_root) + block_rsv = trans-block_rsv; + if (!block_rsv) block_rsv = root-block_rsv; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c58bf19..d4c7955 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3411,6 +3411,32 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) return 0; } +int btrfs_create_uuid_tree(struct btrfs_fs_info *fs_info) +{ + struct btrfs_trans_handle *trans; + struct btrfs_root *tree_root = fs_info-tree_root; + struct btrfs_root *uuid_root; + + /* +* 1 - root
[PATCH v5 1/8] Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort today. Today, the algorithm even has quadratic effort (based on the number of existing subvolumes), which means, that it takes minutes to send/receive a single subvolume if 10,000 subvolumes exist. But even linear effort would be too much since it is a waste. And these data structures to allow mapping UUIDs to subvolume IDs are created every time a btrfs send/receive instance is started. It is much more efficient to maintain a searchable persistent data structure in the filesystem, one that is updated whenever a subvolume/snapshot is created and deleted, and when the received subvolume UUID is set by the btrfs-receive tool. Therefore kernel code is added with this commit that is able to maintain data structures in the filesystem that allow to quickly search for a given UUID and to retrieve data that is assigned to this UUID, like which subvolume ID is related to this UUID. This commit adds a new tree to hold UUID-to-data mapping items. The key of the items is the full UUID plus the key type BTRFS_UUID_KEY. Multiple data blocks can be stored for a given UUID, a type/length/ value scheme is used. Now follows the lengthy justification, why a new tree was added instead of using the existing root tree: The first approach was to not create another tree that holds UUID items. Instead, the items should just go into the top root tree. Unfortunately this confused the algorithm to assign the objectid of subvolumes and snapshots. The reason is that btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for the first created subvol or snapshot after mounting a filesystem, and this function simply searches for the largest used objectid in the root tree keys to pick the next objectid to assign. Of course, the UUID keys have always been the ones with the highest offset value, and the next assigned subvol ID was wastefully huge. To use any other existing tree did not look proper. To apply a workaround such as setting the objectid to zero in the UUID item key and to implement collision handling would either add limitations (in case of a btrfs_extend_item() approach to handle the collisions) or a lot of complexity and source code (in case a key would be looked up that is free of collisions). Adding new code that introduces limitations is not good, and adding code that is complex and lengthy for no good reason is also not good. That's the justification why a completely new tree was introduced. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/Makefile| 3 +- fs/btrfs/ctree.h | 50 ++ fs/btrfs/uuid-tree.c | 480 +++ 3 files changed, 532 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 3932224..a550dfc 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ - reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o + reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ + uuid-tree.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 76e4983..04447b6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -91,6 +91,9 @@ struct btrfs_ordered_sum; /* holds quota configuration and tracking */ #define BTRFS_QUOTA_TREE_OBJECTID 8ULL +/* for storing items that use the BTRFS_UUID_KEY */ +#define BTRFS_UUID_TREE_OBJECTID 9ULL + /* for storing balance parameters in the root tree */ #define BTRFS_BALANCE_OBJECTID -4ULL @@ -953,6 +956,18 @@ struct btrfs_dev_replace_item { __le64 num_uncorrectable_read_errors; } __attribute__ ((__packed__)); +/* for items that use the BTRFS_UUID_KEY */ +#define BTRFS_UUID_ITEM_TYPE_SUBVOL0 /* for UUIDs assigned to subvols */ +#define BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL 1 /* for UUIDs assigned to + * received subvols */ + +/* a sequence of such items is stored under the BTRFS_UUID_KEY */ +struct btrfs_uuid_item { + __le16 type;/* refer to BTRFS_UUID_ITEM_TYPE* defines above */ + __le32 len; /* number of following 64bit values */ + __le64 subid[0];/* sequence of subids */ +} __attribute__ ((__packed__)); + /* different types of block groups (and chunks) */ #define BTRFS_BLOCK_GROUP_DATA (1ULL 0) #define BTRFS_BLOCK_GROUP_SYSTEM (1ULL 1) @@ -1922,6 +1937,17 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_DEV_REPLACE_KEY 250 /* + * Stores items that allow to quickly map UUIDs to something else. + * These
[PATCH v5 4/8] Btrfs: maintain subvolume items in the UUID tree
When a new subvolume or snapshot is created, a new UUID item is added to the UUID tree. Such items are removed when the subvolume is deleted. The ioctl to set the received subvolume UUID is also touched and will now also add this received UUID into the UUID tree together with the ID of the subvolume. The latter is also done when read-only snapshots are created which inherit all the send/receive information from the parent subvolume. User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and read in the UUID tree. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/ctree.h | 1 + fs/btrfs/ioctl.c | 74 +++--- fs/btrfs/transaction.c | 19 - 3 files changed, 83 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1dac165..f2751e7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3678,6 +3678,7 @@ extern const struct dentry_operations btrfs_dentry_operations; long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); void btrfs_update_iflags(struct inode *inode); void btrfs_inherit_iflags(struct inode *inode, struct inode *dir); +int btrfs_is_empty_uuid(u8 *uuid); int btrfs_defrag_file(struct inode *inode, struct file *file, struct btrfs_ioctl_defrag_range_args *range, u64 newer_than, unsigned long max_pages); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0e17a30..4e0c292 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -363,6 +363,13 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) return 0; } +int btrfs_is_empty_uuid(u8 *uuid) +{ + static char empty_uuid[BTRFS_UUID_SIZE] = {0}; + + return !memcmp(uuid, empty_uuid, BTRFS_UUID_SIZE); +} + static noinline int create_subvol(struct inode *dir, struct dentry *dentry, char *name, int namelen, @@ -396,7 +403,7 @@ static noinline int create_subvol(struct inode *dir, * of create_snapshot(). */ ret = btrfs_subvolume_reserve_metadata(root, block_rsv, - 7, qgroup_reserved); + 8, qgroup_reserved); if (ret) return ret; @@ -518,9 +525,13 @@ static noinline int create_subvol(struct inode *dir, ret = btrfs_add_root_ref(trans, root-fs_info-tree_root, objectid, root-root_key.objectid, btrfs_ino(dir), index, name, namelen); - BUG_ON(ret); + ret = btrfs_insert_uuid_subvol_item(trans, root-fs_info-uuid_root, + root_item.uuid, objectid); + if (ret) + btrfs_abort_transaction(trans, root, ret); + fail: trans-block_rsv = NULL; trans-bytes_reserved = 0; @@ -573,9 +584,10 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, * 1 - root item * 2 - root ref/backref * 1 - root of snapshot +* 1 - UUID item */ ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)-root, - pending_snapshot-block_rsv, 7, + pending_snapshot-block_rsv, 8, pending_snapshot-qgroup_reserved); if (ret) goto out; @@ -2213,6 +2225,27 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, goto out_end_trans; } } + + ret = btrfs_del_uuid_subvol_item(trans, root-fs_info-uuid_root, +dest-root_item.uuid, +dest-root_key.objectid); + if (ret ret != -ENOENT) { + btrfs_abort_transaction(trans, root, ret); + err = ret; + goto out_end_trans; + } + if (!btrfs_is_empty_uuid(dest-root_item.received_uuid)) { + ret = btrfs_del_uuid_received_subvol_item(trans, + root-fs_info-uuid_root, + dest-root_item.received_uuid, + dest-root_key.objectid); + if (ret ret != -ENOENT) { + btrfs_abort_transaction(trans, root, ret); + err = ret; + goto out_end_trans; + } + } + out_end_trans: trans-block_rsv = NULL; trans-bytes_reserved = 0; @@ -2424,7 +2457,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) struct btrfs_fs_devices *fs_devices = root-fs_info-fs_devices; int ret = 0; char *s_uuid = NULL; - char empty_uuid[BTRFS_UUID_SIZE] = {0}; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -2433,7
[PATCH v5 7/8] Btrfs: check UUID tree during mount if required
If the filesystem was mounted with an old kernel that was not aware of the UUID tree, this is detected by looking at the uuid_tree_generation field of the superblock (similar to how the free space cache is doing it). If a mismatch is detected at mount time, a thread is started that does two things: 1. Iterate through the UUID tree, check each entry, delete those entries that are not valid anymore (i.e., the subvol does not exist anymore or the value changed). 2. Iterate through the root tree, for each found subvolume, add the UUID tree entries for the subvolume (if they are not already there). This mechanism is also used to handle and repair errors that happened during the initial creation and filling of the tree. The update of the uuid_tree_generation field (which indicates that the state of the UUID tree is up to date) is blocked until all create and repair operations are successfully completed. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/ctree.h | 4 ++ fs/btrfs/disk-io.c | 17 +- fs/btrfs/transaction.c | 3 +- fs/btrfs/uuid-tree.c | 156 + fs/btrfs/volumes.c | 82 ++ fs/btrfs/volumes.h | 1 + 6 files changed, 261 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 424c38d..817894d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1648,6 +1648,7 @@ struct btrfs_fs_info { atomic_t mutually_exclusive_operation_running; struct completion uuid_tree_rescan_completion; + unsigned int update_uuid_tree_gen:1; }; /* @@ -3453,6 +3454,9 @@ void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root); /* uuid-tree.c */ +int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info, + int (*check_func)(struct btrfs_fs_info *, u8 *, u16, + u64)); int btrfs_lookup_uuid_subvol_item(struct btrfs_root *uuid_root, u8 *uuid, u64 *subvol_id); int btrfs_insert_uuid_subvol_item(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a52504b..7508b3a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2112,7 +2112,8 @@ int open_ctree(struct super_block *sb, int err = -EINVAL; int num_backups_tried = 0; int backup_index = 0; - bool create_uuid_tree = false; + bool create_uuid_tree; + bool check_uuid_tree; tree_root = fs_info-tree_root = btrfs_alloc_root(fs_info); chunk_root = fs_info-chunk_root = btrfs_alloc_root(fs_info); @@ -2714,9 +2715,13 @@ retry_root_backup: if (ret != -ENOENT) goto recovery_tree_root; create_uuid_tree = true; + check_uuid_tree = false; } else { uuid_root-track_dirty = 1; fs_info-uuid_root = uuid_root; + create_uuid_tree = false; + check_uuid_tree = + generation != btrfs_super_uuid_tree_generation(disk_super); } fs_info-generation = generation; @@ -2914,7 +2919,17 @@ retry_root_backup: close_ctree(tree_root); return ret; } + } else if (check_uuid_tree) { + pr_info(btrfs: checking UUID tree\n); + ret = btrfs_check_uuid_tree(fs_info); + if (ret) { + pr_warn(btrfs: failed to check the UUID tree %d\n, + ret); + close_ctree(tree_root); + return ret; + } } else { + fs_info-update_uuid_tree_gen = 1; complete_all(fs_info-uuid_tree_rescan_completion); } diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 1ae9621..cf07548 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1370,7 +1370,8 @@ static void update_super_roots(struct btrfs_root *root) super-root_level = root_item-level; if (btrfs_test_opt(root, SPACE_CACHE)) super-cache_generation = root_item-generation; - super-uuid_tree_generation = root_item-generation; + if (root-fs_info-update_uuid_tree_gen) + super-uuid_tree_generation = root_item-generation; } int btrfs_transaction_in_commit(struct btrfs_fs_info *info) diff --git a/fs/btrfs/uuid-tree.c b/fs/btrfs/uuid-tree.c index 3939a54..59697d1 100644 --- a/fs/btrfs/uuid-tree.c +++ b/fs/btrfs/uuid-tree.c @@ -415,6 +415,162 @@ out: return ret; } +static int btrfs_uuid_iter_rem(struct btrfs_root *uuid_root, u8 *uuid, + u16 sub_item_type, u64 subid) +{ + struct btrfs_trans_handle *trans; + int ret; + + /* 1 - for the uuid item */ + trans =
[PATCH v5 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping
Mapping UUIDs to subvolume IDs is an operation with a high effort today. Today, the algorithm even has quadratic effort (based on the number of existing subvolumes), which means, that it takes minutes to send/receive a single subvolume if 10,000 subvolumes exist. But even linear effort would be too much since it is a waste. And these data structures to allow mapping UUIDs to subvolume IDs are created every time a btrfs send/receive instance is started. So the issue to address is that Btrfs send / receive does not work as it is today when a high number of subvolumes exist. The table below shows the time it takes on my testbox to send _one_ empty subvolume depending on the number of subvolume that exist in the filesystem. # of subvols | without| with in filesystem | UUID tree | UUID tree --++-- 2 | 0m00.004s | 0m00.003s 1000 | 0m07.010s | 0m00.004s 2000 | 0m28.210s | 0m00.004s 3000 | 1m04.872s | 0m00.004s 4000 | 1m56.059s | 0m00.004s 5000 | 3m00.489s | 0m00.004s 6000 | 4m27.376s | 0m00.004s 7000 | 6m08.938s | 0m00.004s 8000 | 7m54.020s | 0m00.004s 9000 | 10m05.108s | 0m00.004s 1 | 12m47.406s | 0m00.004s 11000 | 15m05.800s | 0m00.004s 12000 | 18m00.170s | 0m00.004s 13000 | 21m39.438s | 0m00.004s 14000 | 24m54.681s | 0m00.004s 15000 | 28m09.096s | 0m00.004s 16000 | 33m08.856s | 0m00.004s 17000 | 37m10.562s | 0m00.004s 18000 | 41m44.727s | 0m00.004s 19000 | 46m14.335s | 0m00.004s 2 | 51m55.100s | 0m00.004s 21000 | 56m54.346s | 0m00.004s 22000 | 62m53.466s | 0m00.004s 23000 | 66m57.328s | 0m00.004s 24000 | 73m59.687s | 0m00.004s 25000 | 81m24.476s | 0m00.004s 26000 | 87m11.478s | 0m00.004s 27000 | 92m59.225s | 0m00.004s Or as a chart: http://btrfs.giantdisaster.de/Btrfs-send-recv-perf.pdf It is much more efficient to maintain a searchable persistent data structure in the filesystem, one that is updated whenever a subvolume/snapshot is created and deleted, and when the received subvolume UUID is set by the btrfs-receive tool. Therefore kernel code is added that is able to maintain data structures in the filesystem that allow to quickly search for a given UUID and to retrieve the subvol ID. Now follows the lengthy justification, why a new tree was added instead of using the existing root tree: The first approach was to not create another tree that holds UUID items. Instead, the items should just go into the top root tree. Unfortunately this confused the algorithm to assign the objectid of subvolumes and snapshots. The reason is that btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for the first created subvol or snapshot after mounting a filesystem, and this function simply searches for the largest used objectid in the root tree keys to pick the next objectid to assign. Of course, the UUID keys have always been the ones with the highest offset value, and the next assigned subvol ID was wastefully huge. To use any other existing tree did not look proper. To apply a workaround such as setting the objectid to zero in the UUID item key and to implement collision handling would either add limitations (in case of a btrfs_extend_item() approach to handle the collisions) or a lot of complexity and source code (in case a key would be looked up that is free of collisions). Adding new code that introduces limitations is not good, and adding code that is complex and lengthy for no good reason is also not good. That's the justification why a completely new tree was introduced. v1 - v2: - All review comments from David Sterba, Josef Bacik and Jan Schmidt are addressed. The hugest change was to add a mechanism that handles the case that the filesystem is mounted with an older kernel. Now that case is detected when the filesystem is mounted with a newer kernel again, and the UUID tree is updated in the background. v2 - v3: - All review comments from Liu Bo are addressed: - shrinked the size of the uuid_item. - fixed the issue that the uuid-tree was not using the transaction block reserve. v3 - v4: - Fixed a bug. A corrupted UUID tree entry could have caused an endless loop in the check+rescan thread. v4 - v5: - On demand from multiple persons, the way was changed that a umount waits for the completion of the uuid tree rescan thread. Now a struct completion is used instead of a struct semaphore. Stefan Behrens (8): Btrfs: introduce a tree for items that map UUIDs to something Btrfs: support printing UUID tree elements Btrfs: create UUID tree if required Btrfs: maintain subvolume items in the UUID tree Btrfs: fill UUID tree initially Btrfs: introduce uuid-tree-gen field Btrfs: check UUID tree during mount if required Btrfs: add mount option to force UUID tree checking
[PATCH v5 5/8] Btrfs: fill UUID tree initially
When the UUID tree is initially created, a task is spawned that walks through the root tree. For each found subvolume root_item, the uuid and received_uuid entries in the UUID tree are added. This is such a quick operation so that in case somebody wants to unmount the filesystem while the task is still running, the unmount is delayed until the UUID tree building task is finished. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/ctree.h | 2 + fs/btrfs/disk-io.c | 6 +++ fs/btrfs/volumes.c | 148 - 3 files changed, 155 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f2751e7..89b2d78 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1645,6 +1645,8 @@ struct btrfs_fs_info { struct btrfs_dev_replace dev_replace; atomic_t mutually_exclusive_operation_running; + + struct completion uuid_tree_rescan_completion; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1db446a..a52504b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2288,6 +2288,7 @@ int open_ctree(struct super_block *sb, init_rwsem(fs_info-extent_commit_sem); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); + init_completion(fs_info-uuid_tree_rescan_completion); fs_info-dev_replace.lock_owner = 0; atomic_set(fs_info-dev_replace.nesting_level, 0); mutex_init(fs_info-dev_replace.lock_finishing_cancel_unmount); @@ -2913,6 +2914,8 @@ retry_root_backup: close_ctree(tree_root); return ret; } + } else { + complete_all(fs_info-uuid_tree_rescan_completion); } return 0; @@ -3543,6 +3546,9 @@ int close_ctree(struct btrfs_root *root) fs_info-closing = 1; smp_mb(); + /* wait for the uuid_scan task to finish */ + wait_for_completion(fs_info-uuid_tree_rescan_completion); + /* pause restriper - we want to resume on mount */ btrfs_pause_balance(fs_info); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d4c7955..e2e2bbc 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3411,11 +3411,145 @@ int btrfs_cancel_balance(struct btrfs_fs_info *fs_info) return 0; } +static int btrfs_uuid_scan_kthread(void *data) +{ + struct btrfs_fs_info *fs_info = data; + struct btrfs_root *root = fs_info-tree_root; + struct btrfs_key key; + struct btrfs_key max_key; + struct btrfs_path *path = NULL; + int ret = 0; + struct extent_buffer *eb; + int slot; + struct btrfs_root_item root_item; + u32 item_size; + struct btrfs_trans_handle *trans; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + key.objectid = 0; + key.type = BTRFS_ROOT_ITEM_KEY; + key.offset = 0; + + max_key.objectid = (u64)-1; + max_key.type = BTRFS_ROOT_ITEM_KEY; + max_key.offset = (u64)-1; + + path-keep_locks = 1; + + while (1) { + ret = btrfs_search_forward(root, key, max_key, path, 0); + if (ret) { + if (ret 0) + ret = 0; + break; + } + + if (key.type != BTRFS_ROOT_ITEM_KEY || + (key.objectid BTRFS_FIRST_FREE_OBJECTID +key.objectid != BTRFS_FS_TREE_OBJECTID) || + key.objectid BTRFS_LAST_FREE_OBJECTID) + goto skip; + + eb = path-nodes[0]; + slot = path-slots[0]; + item_size = btrfs_item_size_nr(eb, slot); + if (item_size sizeof(root_item)) + goto skip; + + trans = NULL; + read_extent_buffer(eb, root_item, + btrfs_item_ptr_offset(eb, slot), + (int)sizeof(root_item)); + if (btrfs_root_refs(root_item) == 0) + goto skip; + if (!btrfs_is_empty_uuid(root_item.uuid)) { + /* +* 1 - subvol uuid item +* 1 - received_subvol uuid item +*/ + trans = btrfs_start_transaction(fs_info-uuid_root, 2); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + break; + } + ret = btrfs_insert_uuid_subvol_item(trans, + fs_info-uuid_root, + root_item.uuid, + key.objectid); + if (ret 0)
[PATCH v5 8/8] Btrfs: add mount option to force UUID tree checking
This should never be needed, but since all functions are there to check and rebuild the UUID tree, a mount option is added that allows to force this check and rebuild procedure. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 3 ++- fs/btrfs/super.c | 8 +++- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 817894d..ea1adf6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1986,6 +1986,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY(1 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 22) +#define BTRFS_MOUNT_RESCAN_UUID_TREE (1 23) #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7508b3a..e76554b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2919,7 +2919,8 @@ retry_root_backup: close_ctree(tree_root); return ret; } - } else if (check_uuid_tree) { + } else if (check_uuid_tree || + btrfs_test_opt(tree_root, RESCAN_UUID_TREE)) { pr_info(btrfs: checking UUID tree\n); ret = btrfs_check_uuid_tree(fs_info); if (ret) { diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8eb6191..191f281 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -320,7 +320,7 @@ enum { Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_check_integrity, Opt_check_integrity_including_extent_data, - Opt_check_integrity_print_mask, Opt_fatal_errors, + Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree, Opt_err, }; @@ -360,6 +360,7 @@ static match_table_t tokens = { {Opt_check_integrity, check_int}, {Opt_check_integrity_including_extent_data, check_int_data}, {Opt_check_integrity_print_mask, check_int_print_mask=%d}, + {Opt_rescan_uuid_tree, rescan_uuid_tree}, {Opt_fatal_errors, fatal_errors=%s}, {Opt_err, NULL}, }; @@ -554,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_space_cache: btrfs_set_opt(info-mount_opt, SPACE_CACHE); break; + case Opt_rescan_uuid_tree: + btrfs_set_opt(info-mount_opt, RESCAN_UUID_TREE); + break; case Opt_no_space_cache: printk(KERN_INFO btrfs: disabling disk space caching\n); btrfs_clear_opt(info-mount_opt, SPACE_CACHE); @@ -928,6 +932,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ,space_cache); else seq_puts(seq, ,nospace_cache); + if (btrfs_test_opt(root, RESCAN_UUID_TREE)) + seq_puts(seq, ,rescan_uuid_tree); if (btrfs_test_opt(root, CLEAR_CACHE)) seq_puts(seq, ,clear_cache); if (btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED)) -- 1.8.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 1/8] Btrfs: introduce a tree for items that map UUIDs to something
+/* for items that use the BTRFS_UUID_KEY */ +#define BTRFS_UUID_ITEM_TYPE_SUBVOL 0 /* for UUIDs assigned to subvols */ +#define BTRFS_UUID_ITEM_TYPE_RECEIVED_SUBVOL 1 /* for UUIDs assigned to +* received subvols */ + +/* a sequence of such items is stored under the BTRFS_UUID_KEY */ +struct btrfs_uuid_item { + __le16 type;/* refer to BTRFS_UUID_ITEM_TYPE* defines above */ + __le32 len; /* number of following 64bit values */ + __le64 subid[0];/* sequence of subids */ +} __attribute__ ((__packed__)); + [...] /* + * Stores items that allow to quickly map UUIDs to something else. + * These items are part of the filesystem UUID tree. + * The key is built like this: + * (UUID_upper_64_bits, BTRFS_UUID_KEY, UUID_lower_64_bits). + */ +#if BTRFS_UUID_SIZE != 16 +#error UUID items require BTRFS_UUID_SIZE == 16! +#endif +#define BTRFS_UUID_KEY 251 Why do we need this btrfs_uuid_item structure? Why not set the key type to either _SUBVOL or _RECEIVED_SUBVOL instead of embedding structs with those types under items with the constant BTRFS_UUID_KEY. Then use the item size to determine the number of u64 subids. Then the item has a simple array of u64s in the data which will be a lot easier to work with. +/* btrfs_uuid_item */ +BTRFS_SETGET_FUNCS(uuid_type, struct btrfs_uuid_item, type, 16); +BTRFS_SETGET_FUNCS(uuid_len, struct btrfs_uuid_item, len, 32); +BTRFS_SETGET_STACK_FUNCS(stack_uuid_type, struct btrfs_uuid_item, type, 16); +BTRFS_SETGET_STACK_FUNCS(stack_uuid_len, struct btrfs_uuid_item, len, 32); This would all go away. +/* + * One key is used to store a sequence of btrfs_uuid_item items. + * Each item in the sequence contains a type information and a sequence of + * ids (together with the information about the size of the sequence of ids). + * {[btrfs_uuid_item type0 {id0, id1, ..., idN}], + * ..., + * [btrfs_uuid_item typeZ {id0, id1, ..., idN}]} + * + * It is forbidden to put multiple items with the same type under the same key. + * Instead the sequence of ids is extended and used to store any additional + * ids for the same item type. This constraint, and the cost of ensuring it and repairing violations, would go away. +static struct btrfs_uuid_item *btrfs_match_uuid_item_type( + struct btrfs_path *path, u16 type) +{ + struct extent_buffer *eb; + int slot; + struct btrfs_uuid_item *ptr; + u32 item_size; + + eb = path-nodes[0]; + slot = path-slots[0]; + ptr = btrfs_item_ptr(eb, slot, struct btrfs_uuid_item); + item_size = btrfs_item_size_nr(eb, slot); + do { + u16 sub_item_type; + u64 sub_item_len; + + if (item_size sizeof(*ptr)) { + pr_warn(btrfs: uuid item too short (%lu %d)!\n, + (unsigned long)item_size, (int)sizeof(*ptr)); + return NULL; + } + item_size -= sizeof(*ptr); + sub_item_type = btrfs_uuid_type(eb, ptr); + sub_item_len = btrfs_uuid_len(eb, ptr); + if (sub_item_len * sizeof(u64) item_size) { + pr_warn(btrfs: uuid item too short (%llu %lu)!\n, + (unsigned long long)(sub_item_len * + sizeof(u64)), + (unsigned long)item_size); + return NULL; + } + if (sub_item_type == type) + return ptr; + item_size -= sub_item_len * sizeof(u64); + ptr = 1 + (struct btrfs_uuid_item *) + (((char *)ptr) + (sub_item_len * sizeof(u64))); + } while (item_size); +static int btrfs_uuid_tree_lookup_prepare(struct btrfs_root *uuid_root, + u8 *uuid, u16 type, + struct btrfs_path *path, + struct btrfs_uuid_item **ptr) +{ + int ret; + struct btrfs_key key; + + if (!uuid_root) { + WARN_ON_ONCE(1); + ret = -ENOENT; + goto out; + } + + btrfs_uuid_to_key(uuid, key); + + ret = btrfs_search_slot(NULL, uuid_root, key, path, 0, 0); + if (ret 0) + goto out; + if (ret 0) { + ret = -ENOENT; + goto out; + } + + *ptr = btrfs_match_uuid_item_type(path, type); + if (!*ptr) { + ret = -ENOENT; + goto out; + } + + ret = 0; + +out: + return ret; +} All of this is replaced with the simple search_slot in the caller. + offset = (unsigned long)ptr; + while (sub_item_len 0) { + u64 data; + + read_extent_buffer(eb, data, offset, sizeof(data)); + data = le64_to_cpu(data); +
Re: hang on 3.9, 3.10-rc5
On Wed, 19 Jun 2013, Sage Weil wrote: Hi Chris, On Tue, 18 Jun 2013, Chris Mason wrote: [...] Very long way of saying I think we're one release_path short. Sage, I haven't tested this at all yet, I was hoping to trigger it first. diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c276ac9..c1954b3 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3730,6 +3730,7 @@ next_slot: log_extents: if (fast_search) { btrfs_release_path(dst_path); + btrfs_release_path(path); ret = btrfs_log_changed_extents(trans, root, inode, dst_path); if (ret) { err = ret; This seems to be doing the trick. I'll keep testing overnight, but so far so good! ...and it's still holding up well in QA. Thanks, Chris! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
Quoting Sage Weil (2013-06-20 17:56:19) On Wed, 19 Jun 2013, Sage Weil wrote: Hi Chris, On Tue, 18 Jun 2013, Chris Mason wrote: [...] Very long way of saying I think we're one release_path short. Sage, I haven't tested this at all yet, I was hoping to trigger it first. diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c276ac9..c1954b3 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3730,6 +3730,7 @@ next_slot: log_extents: if (fast_search) { btrfs_release_path(dst_path); + btrfs_release_path(path); ret = btrfs_log_changed_extents(trans, root, inode, dst_path); if (ret) { err = ret; This seems to be doing the trick. I'll keep testing overnight, but so far so good! ...and it's still holding up well in QA. Awesome, thanks for getting the traces for us. Looks like this one has been around since v3.7, so I'm not going to try and sneak it into the 3.10 final. I'll have it in the next merge window and for stable. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
On Thu, 20 Jun 2013, Chris Mason wrote: Quoting Sage Weil (2013-06-20 17:56:19) On Wed, 19 Jun 2013, Sage Weil wrote: Hi Chris, On Tue, 18 Jun 2013, Chris Mason wrote: [...] Very long way of saying I think we're one release_path short. Sage, I haven't tested this at all yet, I was hoping to trigger it first. diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c276ac9..c1954b3 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3730,6 +3730,7 @@ next_slot: log_extents: if (fast_search) { btrfs_release_path(dst_path); + btrfs_release_path(path); ret = btrfs_log_changed_extents(trans, root, inode, dst_path); if (ret) { err = ret; This seems to be doing the trick. I'll keep testing overnight, but so far so good! ...and it's still holding up well in QA. Awesome, thanks for getting the traces for us. Looks like this one has been around since v3.7, so I'm not going to try and sneak it into the 3.10 final. I'll have it in the next merge window and for stable. Weird, these same tests have been running on it nightly for ages and it seems like these failures just started with 3.9. Perhaps some other change made it hang when it didn't before? In any case, thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
Quoting Sage Weil (2013-06-20 21:00:21) On Thu, 20 Jun 2013, Chris Mason wrote: Awesome, thanks for getting the traces for us. Looks like this one has been around since v3.7, so I'm not going to try and sneak it into the 3.10 final. I'll have it in the next merge window and for stable. Weird, these same tests have been running on it nightly for ages and it seems like these failures just started with 3.9. Perhaps some other change made it hang when it didn't before? It's always possible, there are a ton of moving pieces here. The wait_event you were hung on was waiting for crcs to finish, and that part at least isn't new. Somewhat unrelated, but are you still using notreelog? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
On Thu, 20 Jun 2013, Chris Mason wrote: Quoting Sage Weil (2013-06-20 21:00:21) On Thu, 20 Jun 2013, Chris Mason wrote: Awesome, thanks for getting the traces for us. Looks like this one has been around since v3.7, so I'm not going to try and sneak it into the 3.10 final. I'll have it in the next merge window and for stable. Weird, these same tests have been running on it nightly for ages and it seems like these failures just started with 3.9. Perhaps some other change made it hang when it didn't before? It's always possible, there are a ton of moving pieces here. The wait_event you were hung on was waiting for crcs to finish, and that part at least isn't new. K. There was also a shift of writes to leveldb (which does the mmap thing), so that may explain the change in behavior. Somewhat unrelated, but are you still using notreelog? Nope, just noatime. Thanks- sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
Quoting Liu Bo (2013-06-20 08:05:30) This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. I'd like to pull this in, could you please rebase it against my current master? Thanks! -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
Quoting Jon Nelson (2013-06-18 13:19:04) Josef Bacik jbacik at fusionio.com writes: On Tue, Jun 11, 2013 at 11:43:30AM -0400, Sage Weil wrote: I'm also seeing this hang regularly with both 3.9 and 3.10-rc5. Is this is a known problem? In this case there is no powercycling; just a regular ceph-osd workload. .. I'm able to cause a complete kernel hang by defrag'ing even one file on 3.9.X (3.9.0 through 3.9.4, so far). I'm not able to reproduce this here. Could you please capture the output from sysrq-w during the hang? (It will all go to your dmesg) -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Btrfs-progs: exhance btrfs-image to restore image onto multiple disks
On Thu, Jun 20, 2013 at 09:10:24PM -0400, Chris Mason wrote: Quoting Liu Bo (2013-06-20 08:05:30) This adds a 'btrfs-image -m' option, which let us restore an image that is built from a btrfs of multiple disks onto several disks altogether. I'd like to pull this in, could you please rebase it against my current master? Yeah, I'll rebase it now. thanks, liubo Thanks! -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hang on 3.9, 3.10-rc5
Is this what you are looking for? After this, the CPU gets stuck and I have to reboot. [360491.932226] [ cut here ] [360491.932261] kernel BUG at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.9.6/linux-3.9/fs/btrfs/ctree.c:1144! [360491.932312] invalid opcode: [#1] PREEMPT SMP [360491.932344] Modules linked in: xfs nilfs2 jfs usb_storage nls_iso8859_1 nls_cp437 vfat fat mmc_block nfsv4 auth_rpcgss nfs fscache lockd sunrpc tun snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device fuse xt_tcpudp xt_pkttype xt_LOG xt_limit af_packet ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables arc4 iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_conexant iTCO_wdt iTCO_vendor_support mperf coretemp kvm_intel kvm snd_hda_intel snd_hda_codec microcode snd_hwdep snd_pcm thinkpad_acpi snd_timer joydev pcspkr sr_mod snd tpm_tis iwlwifi cdrom e1000e wmi tpm battery ac tpm_bios cfg80211 sdhci_pci soundcore ptp sdhci snd_page_alloc i2c_i801 rfkill pps_core mmc_core lpc_ich mfd_core tcp_westwood sg autofs4 btrfs raid6_pq zlib_deflate xor libcrc32c sha256_generic dm_crypt dm_mod crc32_pclmul ghash_clmulni_intel crc32c_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul thermal i915 drm_kms_helper drm i2c_algo_bit video button processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh ata_generic ata_piix [360491.933095] CPU 3 [360491.933110] Pid: 22166, comm: btrfs-endio-wri Not tainted 3.9.6-1.g8ead728-desktop #1 LENOVO 4239CTO/4239CTO [360491.933161] RIP: 0010:[a022075b] [a022075b] __tree_mod_log_rewind+0x23b/0x240 [btrfs] [360491.933225] RSP: 0018:88014cad7888 EFLAGS: 00010297 [360491.933253] RAX: RBX: 880065705b40 RCX: 88014cad7828 [360491.933289] RDX: 0247cc54 RSI: 88014949e821 RDI: 8801bf916640 [360491.933326] RBP: 880073048490 R08: 1000 R09: 88014cad7838 [360491.933362] R10: R11: R12: 8801c0556640 [360491.933398] R13: 0001731e R14: 003d R15: 880158b76340 [360491.933435] FS: () GS:88021e2c() knlGS: [360491.933476] CS: 0010 DS: ES: CR0: 80050033 [360491.933506] CR2: 7f0a360017f8 CR3: 00015be03000 CR4: 000407e0 [360491.933543] DR0: DR1: DR2: [360491.933579] DR3: DR6: 0ff0 DR7: 0400 [360491.933617] Process btrfs-endio-wri (pid: 22166, threadinfo 88014cad6000, task 8801a5f8a6c0) [360491.933662] Stack: [360491.933674] 88011b573800 88010126e7f0 0001 1600 [360491.933719] 880073048490 a0228d45 6db6db6db6db6db7 8801c0556640 [360491.933763] 88020e764000 8800 0001731e 880197cb8158 [360491.933807] Call Trace: [360491.933848] [a0228d45] btrfs_search_old_slot+0x635/0x950 [btrfs] [360491.933909] [a02a1ec6] __resolve_indirect_refs+0x156/0x640 [btrfs] [360491.934044] [a02a2e0c] find_parent_nodes+0x95c/0x1050 [btrfs] [360491.934176] [a02a3592] btrfs_find_all_roots+0x92/0x100 [btrfs] [360491.934307] [a02a401e] iterate_extent_inodes+0x16e/0x370 [btrfs] [360491.934440] [a02a42b8] iterate_inodes_from_logical+0x98/0xc0 [btrfs] [360491.934572] [a024c1c8] record_extent_backrefs+0x68/0xe0 [btrfs] [360491.934652] [a0256d80] btrfs_finish_ordered_io+0x150/0x990 [btrfs] [360491.934739] [a0276ef3] worker_loop+0x153/0x560 [btrfs] [360491.934833] [810697c3] kthread+0xb3/0xc0 [360491.934864] [815dc6bc] ret_from_fork+0x7c/0xb0 [360491.934896] DWARF2 unwinder stuck at ret_from_fork+0x7c/0xb0 [360491.934925] [360491.934934] Leftover inexact backtrace: [360491.934934] [360491.934965] [81069710] ? kthread_create_on_node+0x120/0x120 [360491.934999] Code: c1 48 63 43 58 48 89 c2 48 c1 e2 05 48 8d 54 10 65 48 63 43 2c 48 89 c6 48 c1 e6 05 48 8d 74 30 65 e8 3a af 04 00 e9 b3 fe ff ff 0f 0b 0f 0b 90 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 83 ec [360491.935188] RIP [a022075b] __tree_mod_log_rewind+0x23b/0x240 [btrfs] [360491.935233] RSP 88014cad7888 [360491.946047] ---[ end trace 1475a0830dcadf9c ]--- [360491.946051] note: btrfs-endio-wri[22166] exited with preempt_count 1 On Thu, Jun 20, 2013 at 8:11 PM, Chris Mason chris.ma...@fusionio.com wrote: Quoting Jon Nelson (2013-06-18 13:19:04) Josef Bacik jbacik at fusionio.com writes: On Tue, Jun 11, 2013 at 11:43:30AM -0400, Sage Weil wrote: I'm also seeing this hang regularly with both 3.9 and 3.10-rc5. Is this is a known problem? In this case there is no powercycling; just a regular ceph-osd
Re: hang on 3.9, 3.10-rc5
On Jun 20, 2013, at 7:46 PM, Jon Nelson jnel...@jamponi.net wrote: Is this what you are looking for? If you're able to reproduce while you're remoted in via ssh, then if you get the dmesg at least you won't have to spend time trying to save it somewhere since you'll have it on the remote system's terminal window. https://www.kernel.org/doc/Documentation/sysrq.txt So basically: echo w /proc/sysrq-trigger dmesg Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html