[REGRESSION?] Used+avail gives more than size of device
Hi! On 3.17, i.e. since the size reporting changes, I get: merkaba:~ LANG=C df -hT -t btrfs Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/sata-debian btrfs 30G 19G 21G 48% / /dev/mapper/sata-debian btrfs 30G 19G 21G 48% /mnt/debian-zeit /dev/mapper/msata-daten btrfs 200G 185G 15G 93% /daten /dev/mapper/msata-home btrfs 160G 135G 48G 75% /mnt/home-zeit /dev/mapper/msata-home btrfs 160G 135G 48G 75% /home I wonder about used and avail not adding up to total size of filesystem: 19+21 = 40 GiB instead of 30 GiB for / and 135+48 = 183 GiB for /home. Only /daten seems to be correct. / and /home are RAID 1 spanning two SSDs. /daten is single. I wondered about compression taken into account? They use compress=lzo. While /daten has also compress=lzo, it contains mostly incompressible data like jpeg images and mp3, ogg vorbis and probably some flac music files as well one or the other mp4 video file, compressed media data that is. Any explaination for the discrepancy, can it just be due to the compression? /home has large maildir and lots of source files… which I expect to compress well. Also a debian installation may contain quite an amount of compressible data. But still… the ratio seems a bit off. As it means it would have been able to store 19 GiB of data within 9 GiB of actual disk space for / which would be a quite high compression ratio for LZO. merkaba:~ mount | grep btrfs /dev/mapper/sata-debian on / type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/sata-debian on /mnt/debian-zeit type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/msata-daten on /daten type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/msata-home on /mnt/home-zeit type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/msata-home on /home type btrfs (rw,noatime,compress=lzo,ssd,space_cache) merkaba:~ btrfs fi sh Label: 'debian' uuid: […] Total devices 2 FS bytes used 18.47GiB devid1 size 30.00GiB used 30.00GiB path /dev/mapper/sata-debian devid2 size 30.00GiB used 30.00GiB path /dev/mapper/msata-debian Label: 'daten' uuid: […] Total devices 1 FS bytes used 184.82GiB devid1 size 200.00GiB used 188.02GiB path /dev/mapper/msata-daten Label: 'home' uuid: […] Total devices 2 FS bytes used 134.39GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home merkaba:~ btrfs fi df / Data, RAID1: total=27.99GiB, used=17.84GiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=2.00GiB, used=645.88MiB unknown, single: total=224.00MiB, used=0.00 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=131.46GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=2.93GiB unknown, single: total=512.00MiB, used=0.00 merkaba:~ btrfs fi df /daten Data, single: total=187.01GiB, used=184.53GiB System, single: total=4.00MiB, used=48.00KiB Metadata, single: total=1.01GiB, used=292.31MiB unknown, single: total=112.00MiB, used=0.00 merkaba:~ LANG=C strace df -hT -t btrfs execve(/bin/df, [df, -hT, -t, btrfs], [/* 55 vars */]) = 0 brk(0) = 0x13c2000 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f795e7de000 access(/etc/ld.so.preload, R_OK) = -1 ENOENT (No such file or directory) open(/etc/ld.so.cache, O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=250673, ...}) = 0 mmap(NULL, 250673, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f795e7a close(3)= 0 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory) open(/lib/x86_64-linux-gnu/libc.so.6, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0P\34\2\0\0\0\0\0..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1725888, ...}) = 0 mmap(NULL, 3832352, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f795e218000 mprotect(0x7f795e3b7000, 2093056, PROT_NONE) = 0 mmap(0x7f795e5b6000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19e000) = 0x7f795e5b6000 mmap(0x7f795e5bc000, 14880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f795e5bc000 close(3)= 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f795e79f000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f795e79e000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f795e79d000 arch_prctl(ARCH_SET_FS, 0x7f795e79e700) = 0 mprotect(0x7f795e5b6000, 16384, PROT_READ) = 0 mprotect(0x616000, 4096, PROT_READ) = 0 mprotect(0x7f795e7e, 4096, PROT_READ) = 0 munmap(0x7f795e7a, 250673) = 0 brk(0)
Re: What is the vision for btrfs fs repair?
Am Donnerstag, 9. Oktober 2014, 21:58:53 schrieben Sie: * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? These three translate into eight combinations of repairs, adding -o recovery there are 9 combinations. I think this is the main source of confusion, there are just too many options, but also it's completely non-obvious which one to use in which situation. My expectation is that eventually these get consolidated into just check and check --repair. As the repair code matures, it'd go into kernel autorecovery code. That's a guess on my part, but it's consistent with design goals. Also I think these should at least all be unter the btrfs command. So include btrfs-zero-log in btrfs command. And well how about btrfs repair or btrfs check as upper category and at least add the various options as commands below it? So there is at least one command and one place in manpage to learn about the various options. But maybe some can be made automatic as well. Or folded into btrfs check -- repair. Ideally it would auto-detect which path to take on filesystem recovery. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy: On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. To understand this a bit better: What can be the reasons a recent tree gets corrupted? I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? That said all journalling filesystems have some sort of rollback as far as I understand: If the last journal entry is incomplete they discard it on journal replay. So even there you use the last seconds of write activity. But in case fsync() returns the data needs to be safe on disk. I always thought BTRFS honors this under *any* circumstance. If some proposed autorollback breaks this guarentee, I think something is broke elsewhere. And fsync is an fsync is an fsync. Its semantics are clear as crystal. There is nothing, absolutely nothing to discuss about it. An fsync completes if the device itself reported Yeah, I have the data on disk, all safe and cool to go. Anything else is a bug IMO. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Am Mittwoch, 8. Oktober 2014, 14:11:51 schrieb Eric Sandeen: I was looking at Marc's post: http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub- and-Btrfs-Filesystem-Repair.html and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. In other words - I'm an admin cruising along, when the kernel throws some fs corruption error, or for whatever reason btrfs fails to mount. What should I do? Marc lays out several steps, but to me this highlights that there seem to be a lot of disjoint mechanisms out there to deal with these problems; mostly from Marc's blog, with some bits of my own: * btrfs scrub Errors are corrected along if possible (what *is* possible?) * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. * mount -o degraded Allow mounts to continue with missing devices. (This isn't really a way to recover from corruption, right?) * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? * btrfs restore try to salvage files from a damaged filesystem (not really repair, it's disk-scraping) What's the vision for, say, scrub vs. check vs. rescue? Should they repair the same errors, only online vs. offline? If not, what class of errors does one fix vs. the other? How would an admin know? Can btrfs check recover a bad tree root in the same way that mount -o recovery does? How would I know if I should use --init-*-tree, or chunk-recover, and what are the ramifications of using these options? It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? How about taking one step back: What are the possible corruption cases these tools are meant to address? *Where* can BTRFS break and *why*? What of it can be folded into one command? Where can BTRFS be improved to either prevent a corruption from happening ot automatically correcting it? What actions can be determined automatically by the repair tool? What needs to be options for the user to choose from? And what guidance would the user need to decide? I.e. really going to back what diagnosing and repair of BTRFS actually includes and then well… go about a vision how this all can fit together as you suggested. As a minimum I suggest to have all possible options as a main category in btrfs command, no external commands whatsoever, so if btrfs-zero-log is still needed, at it into btrfs command. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send and kernel 3.17
This weekend I finally had time to try btrfs send again on the newly created fs. Now I am running into another problem: btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate memory In dmesg I see only the following output: parent transid verify failed on 21325004800 wanted 2620 found 8325 On 10/07/2014 10:46 PM, Chris Mason wrote: On Tue, Oct 7, 2014 at 4:45 PM, David Arendt ad...@prnet.org wrote: On 10/07/2014 03:19 PM, Chris Mason wrote: On Tue, Oct 7, 2014 at 1:25 AM, David Arendt ad...@prnet.org wrote: I did a revert of this commit. After creating a snapshot, the filesystem was no longer usable, even with kernel 3.16.3 (crashes 10 seconds after mount without error message) . Maybe there was some previous damage that just appeared now. This evening, I will restore from backup and report back. On October 7, 2014 12:22:11 AM CEST, Chris Mason c...@fb.com wrote: On Mon, Oct 6, 2014 at 4:51 PM, David Arendt ad...@prnet.org wrote: I just tried downgrading to 3.16.3 again. In 3.16.3 btrfs send is working without any problem. Afterwards I upgraded again to 3.17 and the problem reappeared. So the problem seems to be kernel version related. [ backref errors during btrfs-send ] Ok then, our list of suspects is pretty short. Can you easily build test kernels? I'd like to try reverting this commit: 51f395ad4058883e4273b02fdebe98072dbdc0d2 Oh no! Reverting this definitely should not have caused corruptions, so I think the problem was already there. Do you still have the filesystem image? Please let us know if you're missing files off the backup, we'll help pull them out. Due to space constraints, it was not possible to take an image of the corrupted filesystem. As I do backups daily, and the problems occurred 5 hours after backup, no file was lost. Thanks for offering your help. In 4 days I will do some send tests on the newly created filesystem and report back. Ok, if you have the kernel messages from the panic, please send them along. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send and kernel 3.17
Hi. I just wanted to confirm David's story so to speak :) -kernel 3.17-rc7 (didn't bother to compile 3.17 as there weren't any btrfs fixes, I think) -btrfs-progs 3.16.2 (also compiled from source, so no distribution-specific patches) -fresh fs -I get the same two errors David got (first I got the I/O error one and then the memory allocation one) -plus now when I ls -la the fs top volume this is what I get drwxrwsr-x 1 root staff 30 Sep 11 16:15 home d? ? ?? ?? home-backup drwxr-xr-x 1 root root 250 Oct 10 15:37 root d? ? ?? ?? root-backup drwxr-xr-x 1 root root 88 Sep 15 16:02 vms drwxr-xr-x 1 root root 88 Sep 15 16:02 vms-backup yes, the question marks on those two *-backup snapshots are really there. I can't access the snapshots, I can't delete them, I can't do anything with them. -btrfs check segfaults -the events that led to this situation are these: 1) btrfs su snap -r root root-backup 2) send |receive (the entire root-backup, not and incremental send) immediate I/O error 3) move on to home: btrfs su snap -r home home-backup 4) send|receive (again not an incremental send) everything goes well (!) 5) retry with root: btrfs su snap -r root root-backup 6) send|receive and it goes seemingly well 7) apt-get dist-upgrade just to modify root and try an incremental send 8) reboot after the dist-upgrade 9) ls -la the fs top volume: first I get the memory allocation error and after that any ls -la gives the output I pasted above. (notice that beside the ls -la, the two snapshots were not touched in any way since the two send|receive) Few final notes. I haven't tried send/receive in a while (they were unreliable) so I can't tell which is the last version they worked for me (well, no version actually :) ). I've never had any problem with just snapshots. I make them regularly, I use them, I modify them and I've never had one problem (with 3.17 too, it's just send/receive that murders them). Best regards John -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Test Commit
--- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 18eb944..7cc7783 100644 --- a/Makefile +++ b/Makefile @@ -5,7 +5,7 @@ CC = gcc LN = ln AR = ar AM_CFLAGS = -Wall -D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES -fno-strict-aliasing -fPIC -CFLAGS = -g -O1 -fno-strict-aliasing +CFLAGS = -g -O1 -fno-strict-aliasing -rdynamic objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
TEST PING EOM
-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [PATCH]Btrfs-prog: uniform error handling for utils.c
--- Makefile | 4 +- btrfs-syscalls.c | 180 + btrfs-syscalls.h | 55 +++ kerncompat.h | 5 +- utils.c | 200 +++ 5 files changed, 337 insertions(+), 107 deletions(-) create mode 100644 btrfs-syscalls.c create mode 100644 btrfs-syscalls.h diff --git a/Makefile b/Makefile index 18eb944..d738f20 100644 --- a/Makefile +++ b/Makefile @@ -5,7 +5,7 @@ CC = gcc LN = ln AR = ar AM_CFLAGS = -Wall -D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES -fno-strict-aliasing -fPIC -CFLAGS = -g -O1 -fno-strict-aliasing +CFLAGS = -g -O1 -fno-strict-aliasing -rdynamic objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ @@ -17,7 +17,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \ - uuid-tree.o utils-lib.o + uuid-tree.o utils-lib.o btrfs-syscalls.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \ extent_io.h ioctl.h ctree.h btrfsck.h version.h diff --git a/btrfs-syscalls.c b/btrfs-syscalls.c new file mode 100644 index 000..b4d791b --- /dev/null +++ b/btrfs-syscalls.c @@ -0,0 +1,180 @@ +/*** + * File Name : btrfs-syscalls.c + * Description : This file contains system call wrapper functions with + * uniform error handling. + **/ +#include btrfs-syscalls.h + +#define BKTRACE_BUFFER_SIZE 1024 + +int err_verbose = 0; +static void *buf[BKTRACE_BUFFER_SIZE]; + +void +btrfs_backtrace(void) +{ +int i; +int nptrs; +char **entries; + +fprintf(stderr, Call trace:\n); +nptrs = backtrace(buf, BKTRACE_BUFFER_SIZE); +entries = backtrace_symbols(buf, nptrs); +if (entries == NULL) { +fprintf(stderr, ERROR: backtrace_symbols\n); +exit(EXIT_FAILURE); +} +for (i = 0; i nptrs; i++) { +if (strstr(entries[i], btrfs_backtrace) == NULL); +fprintf(stderr, \t%s\n, entries[i]); +} +free(entries); +} + +int +btrfs_open(const char *pathname, int flags) +{ +int ret; + +if ((ret = open(pathname, flags)) 0) +SYS_ERROR(open : %s, pathname); + +return ret; +} + +int +btrfs_close(int fd) +{ +int ret; + +if ((ret = close(fd)) 0) +SYS_ERROR(close :); + +return ret; +} + +int +btrfs_stat(const char *path, struct stat *buf) +{ +int ret; + +if ((ret = stat(path, buf)) 0) +SYS_ERROR(stat : %s, path); + +return ret; +} + +int +btrfs_lstat(const char *path, struct stat *buf) +{ +int ret; + +if ((ret = lstat(path, buf)) 0) { +SYS_ERROR(lstat : %s, path); +} + +return ret; +} + +int +btrfs_fstat(int fd, struct stat *buf) +{ +int ret; + +if ((ret = fstat(fd, buf)) 0) +SYS_ERROR(fstat :); + +return ret; +} + +void* +btrfs_malloc(size_t size) +{ +void *p; + +if ((p = malloc(size)) == NULL) { +if (size != 0) +SYS_ERROR(malloc :); +} + +return p; +} + +void* +btrfs_calloc(size_t nmemb, size_t size) +{ +void *p; + +if ((p = calloc(nmemb, size)) == NULL) { +if (size != 0) +SYS_ERROR(calloc :); +} + +return p; +} + +FILE* +btrfs_fopen(const char *path, const char *mode) +{ +FILE *f; + +if ((f = fopen(path, mode)) == NULL) +SYS_ERROR(fopen : %s, path); + +return f; +} + +DIR* +btrfs_opendir(const char *name) +{ +DIR *d; + +if ((d = opendir(name)) == NULL) +SYS_ERROR(opendir :); + +return d; +} + +int +btrfs_dirfd(DIR *dirp) +{ +int fd; + +if ((fd = dirfd(dirp)) 0) +SYS_ERROR(dirfd :); + +return fd; +} + +int +btrfs_closedir(DIR *dirp) +{ +int ret; + +if ((ret = closedir(dirp)) 0) +SYS_ERROR(closedir :); + +return ret; +} + +ssize_t +btrfs_pwrite(int fd, const void *buf, size_t count, off_t offset) +{ +ssize_t ret; + +if ((ret = pwrite(fd, buf, count, offset)) 0) + SYS_ERROR(pwrite :); + +return ret; +} + +ssize_t +btrfs_pread(int fd, const void *buf, size_t count, off_t offset) +{ +ssize_t ret; + +if ((ret = pread(fd, buf, count, offset)) 0) + SYS_ERROR(pread :); + +return ret; +} diff --git a/btrfs-syscalls.h b/btrfs-syscalls.h new file mode 100644 index 000..2c717bf --- /dev/null +++ b/btrfs-syscalls.h @@ -0,0 +1,55 @@ +#ifndef __BTRFS_SYSCALLS_H__
Re: [PATCH] [PATCH]Btrfs-prog: uniform error handling for utils.c
Hi, Following path implements the uniform error handling for the utils.c in btrfs-progs. On Sun, Oct 12, 2014 at 2:01 PM, neo ckn...@gmail.com wrote: --- Makefile | 4 +- btrfs-syscalls.c | 180 + btrfs-syscalls.h | 55 +++ kerncompat.h | 5 +- utils.c | 200 +++ 5 files changed, 337 insertions(+), 107 deletions(-) create mode 100644 btrfs-syscalls.c create mode 100644 btrfs-syscalls.h diff --git a/Makefile b/Makefile index 18eb944..d738f20 100644 --- a/Makefile +++ b/Makefile @@ -5,7 +5,7 @@ CC = gcc LN = ln AR = ar AM_CFLAGS = -Wall -D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES -fno-strict-aliasing -fPIC -CFLAGS = -g -O1 -fno-strict-aliasing +CFLAGS = -g -O1 -fno-strict-aliasing -rdynamic objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ @@ -17,7 +17,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \ - uuid-tree.o utils-lib.o + uuid-tree.o utils-lib.o btrfs-syscalls.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \ extent_io.h ioctl.h ctree.h btrfsck.h version.h diff --git a/btrfs-syscalls.c b/btrfs-syscalls.c new file mode 100644 index 000..b4d791b --- /dev/null +++ b/btrfs-syscalls.c @@ -0,0 +1,180 @@ +/*** + * File Name : btrfs-syscalls.c + * Description : This file contains system call wrapper functions with + * uniform error handling. + **/ +#include btrfs-syscalls.h + +#define BKTRACE_BUFFER_SIZE 1024 + +int err_verbose = 0; +static void *buf[BKTRACE_BUFFER_SIZE]; + +void +btrfs_backtrace(void) +{ +int i; +int nptrs; +char **entries; + +fprintf(stderr, Call trace:\n); +nptrs = backtrace(buf, BKTRACE_BUFFER_SIZE); +entries = backtrace_symbols(buf, nptrs); +if (entries == NULL) { +fprintf(stderr, ERROR: backtrace_symbols\n); +exit(EXIT_FAILURE); +} +for (i = 0; i nptrs; i++) { +if (strstr(entries[i], btrfs_backtrace) == NULL); +fprintf(stderr, \t%s\n, entries[i]); +} +free(entries); +} + +int +btrfs_open(const char *pathname, int flags) +{ +int ret; + +if ((ret = open(pathname, flags)) 0) +SYS_ERROR(open : %s, pathname); + +return ret; +} + +int +btrfs_close(int fd) +{ +int ret; + +if ((ret = close(fd)) 0) +SYS_ERROR(close :); + +return ret; +} + +int +btrfs_stat(const char *path, struct stat *buf) +{ +int ret; + +if ((ret = stat(path, buf)) 0) +SYS_ERROR(stat : %s, path); + +return ret; +} + +int +btrfs_lstat(const char *path, struct stat *buf) +{ +int ret; + +if ((ret = lstat(path, buf)) 0) { +SYS_ERROR(lstat : %s, path); +} + +return ret; +} + +int +btrfs_fstat(int fd, struct stat *buf) +{ +int ret; + +if ((ret = fstat(fd, buf)) 0) +SYS_ERROR(fstat :); + +return ret; +} + +void* +btrfs_malloc(size_t size) +{ +void *p; + +if ((p = malloc(size)) == NULL) { +if (size != 0) +SYS_ERROR(malloc :); +} + +return p; +} + +void* +btrfs_calloc(size_t nmemb, size_t size) +{ +void *p; + +if ((p = calloc(nmemb, size)) == NULL) { +if (size != 0) +SYS_ERROR(calloc :); +} + +return p; +} + +FILE* +btrfs_fopen(const char *path, const char *mode) +{ +FILE *f; + +if ((f = fopen(path, mode)) == NULL) +SYS_ERROR(fopen : %s, path); + +return f; +} + +DIR* +btrfs_opendir(const char *name) +{ +DIR *d; + +if ((d = opendir(name)) == NULL) +SYS_ERROR(opendir :); + +return d; +} + +int +btrfs_dirfd(DIR *dirp) +{ +int fd; + +if ((fd = dirfd(dirp)) 0) +SYS_ERROR(dirfd :); + +return fd; +} + +int +btrfs_closedir(DIR *dirp) +{ +int ret; + +if ((ret = closedir(dirp)) 0) +SYS_ERROR(closedir :); + +return ret; +} + +ssize_t +btrfs_pwrite(int fd, const void *buf, size_t count, off_t offset) +{ +ssize_t ret; + +if ((ret = pwrite(fd, buf, count, offset)) 0) + SYS_ERROR(pwrite :); + +return ret; +} + +ssize_t
Re: btrfs send and kernel 3.17
Just to let you know, I just tried an ls -l on 2 machines running kernel 3.17 and btrfs-progs 3.16.2. Here is my ls -l output: Machine 1: ls: cannot access root.20141009.000503.backup: Cannot allocate memory total 0 d? ? ? ? ?? root.20141009.000503.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141012.095526.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141012.000503.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141011.000502.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141010.000502.backup root.20141009.000503.backup is not deletable. Machine 2: ls: cannot access root.20141006.003239.backup: Cannot allocate memory ls: cannot access root.20141007.001616.backup: Cannot allocate memory ls: cannot access root.20141008.000501.backup: Cannot allocate memory ls: cannot access root.20141009.052436.backup: Cannot allocate memory total 0 d? ? ?? ?? root.20141009.052436.backup d? ? ?? ?? root.20141008.000501.backup d? ? ?? ?? root.20141007.001616.backup d? ? ?? ?? root.20141006.003239.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140925.001125.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140924.001017.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140923.001008.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140922.001836.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140921.001029.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140920.001020.backup The ? ones are also not deletable. Both machines are giving transid verify failed errors. I verified my logfiles and this problem was never there using previous kernel versions. On machine 1, it is also sure that it was not any previous corruption as this filesystem has also been created with btrfs-progs 3.16.2 using kernel 3.17. On 10/12/2014 05:24 PM, john terragon wrote: Hi. I just wanted to confirm David's story so to speak :) -kernel 3.17-rc7 (didn't bother to compile 3.17 as there weren't any btrfs fixes, I think) -btrfs-progs 3.16.2 (also compiled from source, so no distribution-specific patches) -fresh fs -I get the same two errors David got (first I got the I/O error one and then the memory allocation one) -plus now when I ls -la the fs top volume this is what I get drwxrwsr-x 1 root staff 30 Sep 11 16:15 home d? ? ?? ?? home-backup drwxr-xr-x 1 root root 250 Oct 10 15:37 root d? ? ?? ?? root-backup drwxr-xr-x 1 root root 88 Sep 15 16:02 vms drwxr-xr-x 1 root root 88 Sep 15 16:02 vms-backup yes, the question marks on those two *-backup snapshots are really there. I can't access the snapshots, I can't delete them, I can't do anything with them. -btrfs check segfaults -the events that led to this situation are these: 1) btrfs su snap -r root root-backup 2) send |receive (the entire root-backup, not and incremental send) immediate I/O error 3) move on to home: btrfs su snap -r home home-backup 4) send|receive (again not an incremental send) everything goes well (!) 5) retry with root: btrfs su snap -r root root-backup 6) send|receive and it goes seemingly well 7) apt-get dist-upgrade just to modify root and try an incremental send 8) reboot after the dist-upgrade 9) ls -la the fs top volume: first I get the memory allocation error and after that any ls -la gives the output I pasted above. (notice that beside the ls -la, the two snapshots were not touched in any way since the two send|receive) Few final notes. I haven't tried send/receive in a while (they were unreliable) so I can't tell which is the last version they worked for me (well, no version actually :) ). I've never had any problem with just snapshots. I make them regularly, I use them, I modify them and I've never had one problem (with 3.17 too, it's just send/receive that murders them). Best regards John -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION?] Used+avail gives more than size of device
Martin Steigerwald posted on Sun, 12 Oct 2014 11:56:51 +0200 as excerpted: On 3.17, i.e. since the size reporting changes, I get: merkaba:~ LANG=C df -hT -t btrfs Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/sata-debian btrfs 30G 19G 21G 48% / /dev/mapper/sata-debian btrfs 30G 19G 21G 48% /mnt/debian-zeit /dev/mapper/msata-daten btrfs 200G 185G 15G 93% /daten /dev/mapper/msata-home btrfs 160G 135G 48G 75% /mnt/home-zeit /dev/mapper/msata-home btrfs 160G 135G 48G 75% /home I wonder about used and avail not adding up to total size of filesystem: 19+21 = 40 GiB instead of 30 GiB for / and 135+48 = 183 GiB for /home. Only /daten seems to be correct. That's standard df, not btrfs fi df. Due to the way btrfs works and the constraints of the printing format that standard df uses, it cannot and will not present a full picture of filesystem usage. Some compromises must be made in the choice of which available filesystem stats to present and the manner in which they are presented within the limited df format, and no matter which compromises are chosen, standard df output will always look a bit screwy for /some/ btrfs filesystem layouts. / and /home are RAID 1 spanning two SSDs. /daten is single. I wondered about compression taken into account? They use compress=lzo. [...] Any explaination for the discrepancy, can it just be due to the compression? It's not compression, but FWIW, I believe I know what's going on... merkaba:~ mount | grep btrfs /dev/mapper/sata-debian on / type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/sata-debian on /mnt/debian-zeit type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/msata-daten on /daten type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/msata-home on /mnt/home-zeit type btrfs (rw,noatime,compress=lzo,ssd,space_cache) /dev/mapper/msata-home on /home type btrfs (rw,noatime,compress=lzo,ssd,space_cache) merkaba:~ btrfs fi sh Label: 'debian' uuid: […] Total devices 2 FS bytes used 18.47GiB devid 1 size 30.00GiB used 30.00GiB path /dev/mapper/sata-debian devid 2 size 30.00GiB used 30.00GiB path /dev/mapper/msata-debian Label: 'daten' uuid: […] Total devices 1 FS bytes used 184.82GiB devid 1 size 200.00GiB used 188.02GiB path /dev/mapper/msata-daten Label: 'home' uuid: […] Total devices 2 FS bytes used 134.39GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home merkaba:~ btrfs fi df / Data, RAID1: total=27.99GiB, used=17.84GiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=2.00GiB, used=645.88MiB unknown, single: total=224.00MiB, used=0.00 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=131.46GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=2.93GiB unknown, single: total=512.00MiB, used=0.00 merkaba:~ btrfs fi df /daten Data, single: total=187.01GiB, used=184.53GiB System, single: total=4.00MiB, used=48.00KiB Metadata, single: total=1.01GiB, used=292.31MiB unknown, single: total=112.00MiB, used=0.00 Side observation, doesn't look like you have btrfs-progs 3.16.1 yet, since btrfs fi df is still reporting unknown for that last chunk-type instead of global reserve. While I didn't follow the (standard) df information presentation change discussion closely enough to know what the resolution was, looking at the numbers above I believe I know what's going on with df. First, focus on used, using / as an example. df (standard) used: 19 G btrfs fi show (total line) used:18.47 GiB btrfs fi df (sum all types) used: 17.84 GiB + 646 MiB ~= 18.5 GiB So the displayed usage for all three reports agrees, roughly 19 G used. Compression? Only actual (used) data/metadata can compress, the left over free space won't; it's left over. So any effects of compression would be seen in the above used numbers. The numbers above are close enough to each other that compression can't be playing a part.[1] OK, so what's deal with (standard) df, then? If the discrepancy isn't coming from used, where's it coming from? Simple enough. What's the big difference between the filesystem that appears correct and the other two? Big hint, take a look at the second field of the btrfs fi df output. Hint #2, btrfs fi show, count the number of devices. Back to standard df, available: / 21 GiB /2 10.5 GiB10.5 (avail) + 19 (used) ~ 30 GiB /data 15 GiB -- 15 GiB 15 (avail) + 185 (used) ~ 200 GiB /home 48 GiB /2 24 GiB 24 (avail) + 135 (used) ~ 160 GiB It's the raid-factor. =:^) Btrfs in the kernel is apparently accounting for raid-factor in used space in whatever function standard df is using, but not in available space, even where
what is the best way to monitor raid1 drive failures?
Hi, I am testing some disk failure scenarios in a 2 drive raid1 mirror. They are 4GB each, virtual SATA drives inside virtualbox. To simulate the failure, I detached one of the drives from the system. After that, I see no sign of a problem except for these errors: Oct 12 15:37:14 rock-dev kernel: btrfs: bdev /dev/sdb errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 Oct 12 15:37:14 rock-dev kernel: lost page write due to I/O error on /dev/sdb /dev/sdb is gone from the system, but btrfs fi show still lists it. Label: raid1pool uuid: 4e5d8b43-1d34-4672-8057-99c51649b7c6 Total devices 2 FS bytes used 1.46GiB devid1 size 4.00GiB used 2.45GiB path /dev/sdb devid2 size 4.00GiB used 2.43GiB path /dev/sdc I am able to read and write just fine, but do see the above errors in dmesg. What is the best way to find out that one of the drives has gone bad? Suman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Martin Steigerwald posted on Sun, 12 Oct 2014 12:14:01 +0200 as excerpted: I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? My understanding here is... With btrfs a full-tree commit is atomic. You should get either the old tree or the new tree. However, due to the cascading nature of updates on cow-based structures, these full-tree commits are done by default (there's a mount-option to adjust it) every 30 seconds. Between these atomic commits partial updates may have occurred. The btrfs log (the one that btrfs-zero-log kills) is limited to between-commit updates, and thus to the upto 30 seconds (default) worth of changes since the last full- tree atomic commit. In addition to that, there's a history of tree-root commits kept (with the superblocks pointing to the last one). Btrfs-find-tree-root can be used to list this history. The recovery mount option simply allows btrfs to fall back to this history, should the current root be corrupted. Btrfs restore can be used to list tree roots as well, and can be pointed at an appropriate one if necessary. Fsync forces the file and its corresponding metadata update to the log and barring hardware or software bugs should not return until it's safely in the log, but I'm not sure whether it forces a full-tree commit. Either way the guarantees should be the same. If the log can be replayed or a full-tree commit has occurred since the fsync, the new copy should appear. If it can't, the rollback to the last atomic tree commit should return an intact copy of the file from that point. If the recovery mount option is used and a further rollback to an earlier full-tree commit is forced, provided it existed at the point of that full-tree commit, the intact file at that point should appear. So if the current tree root is a good one, the log will replay the last upto 30 seconds of activity on top of that last atomic tree root. If the current root tree itself is corrupt, the recovery mount option will let an earlier one be used. Obviously in that case the log will be discarded since it applies to a later root tree that itself has been discarded. The debate is whether recovery should be automated so the admin doesn't have to care about it, or whether having to manually add that option serves as a necessary notifier to the admin that something /did/ go wrong, and that an earlier root is being used instead, so more than a few seconds worth of data may have disappeared. As someone else has already suggested, I'd argue that as long as btrfs continues to be under the sort of development it's in now, keeping recovery as a non-default option is desired. Once it's optimized and considered stable, arguably recovery should be made the default, perhaps with a no-recovery option for those who prefer that in-the-face notification in the form of a mount error, if btrfs would otherwise fall back to an earlier tree root commit. What worries me, however, is that IMO the recent warning stripping was premature. Btrfs is certainly NOT fully stable or optimized for normal use at this point. We're still using the even/odd PID balancing scheme for raid1 reads, for instance, and multi-device writes are still serialized when they could be parallelized to a much larger degree (tho keeping some serialization is arguably good for data safety). Arguably optimizing that now would be premature optimization since the code itself is still subject to change, so I'm not complaining, but by that very same token, it *IS* still subject to change, which by definition means it's *NOT* stable, so why are we removing all the warnings and giving the impression that it IS stable? The decision wasn't mine to make and I don't know, but while a nice suggestion, making recovery-by-default a measure of when btrfs goes stable simply won't work, because surely, the same folks behind the warning stripping would then ensure this indicator too, said btrfs was stable, while the state of the code itself continues to say otherwise. Meanwhile, if your distributed transactions scenario doesn't account for crash and loss of data on one side with real-time backup/redundancy, such that loss of a few seconds worth of transactions on a single local filesystem is going to kill the entire scenario, I don't think too much of that scenario in the first place, and regardless, btrfs, certainly in its current state, is definitely NOT an appropriate base for it. Use appropriate tools for the task. Btrfs at least at this point is simply not an appropriate tool for that task. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe
Re: TEST PING EOM
On Sun, Oct 12, 2014 at 2:45 PM, royy walls ckn...@gmail.com wrote: -- http://www.tux.org/lkml/#s3 Test messages are very, very inappropriate on the lkml or any other list, for that matter. If you want to know whether the subscribe succeeded, wait for a couple of hours after you get a reply from the mailing list software saying it did. You'll undoubtedly get a number of list messages. If you want to know whether you can post, you must have something important to say, right? After you have read the following paragraphs, compose a real letter, not a test message, in an editor, saving the body of the letter in the off chance your post doesn't succeed. Then post your letter to lkml. Please remember that there are quite a number of subscribers, and it will take a while for your letter to be reflected back to you. An hour is not too long to wait. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TEST PING EOM
I apologies for this, I'm new to this and was not sure whether it is working or not. On Sun, Oct 12, 2014 at 5:14 PM, cwillu cwi...@cwillu.com wrote: On Sun, Oct 12, 2014 at 2:45 PM, royy walls ckn...@gmail.com wrote: -- http://www.tux.org/lkml/#s3 Test messages are very, very inappropriate on the lkml or any other list, for that matter. If you want to know whether the subscribe succeeded, wait for a couple of hours after you get a reply from the mailing list software saying it did. You'll undoubtedly get a number of list messages. If you want to know whether you can post, you must have something important to say, right? After you have read the following paragraphs, compose a real letter, not a test message, in an editor, saving the body of the letter in the off chance your post doesn't succeed. Then post your letter to lkml. Please remember that there are quite a number of subscribers, and it will take a while for your letter to be reflected back to you. An hour is not too long to wait. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed
Guan On Sat, 11 Oct 2014 14:45:29 +0800, Eryu Guan wrote: device replace could fail due to another running scrub process, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21 done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Also only WARN_ON if the return code is not -EINPROGRESS. Signed-off-by: Eryu Guan guane...@gmail.com Ping, any comments on this patch? Thanks, Eryu --- fs/btrfs/dev-replace.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..44d32ab 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); - WARN_ON(ret); + /* don't warn if EINPROGRESS, someone else might be running scrub */ + if (ret != -EINPROGRESS) + WARN_ON(ret); picky comment I prefer WARN_ON(ret ret != -EINPROGRESS). Yes, this is simpler :) - return 0; + return ret; here we will return -EINPROGRESS if scrub is running, I think it better that we assign some special number to args-result, and then return 0, just like the case the device replace is running. Seems that requires a new result type, say, #define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS 3 and assign this result to args-result if btrfs_scrub_dev() returned -EINPROGRESS But I don't think returning 0 unconditionally is a good idea, since btrfs_dev_replace_finishing() could return other errors too, that way these errors will be lost, and userspace still won't catch the errors ($? is 0) Of course. Maybe the above explanation of mine was not so clear. In fact, I just talked about the EINPROGRESS case, for the other case, returning error code is better. What I'm thinking about is something like: ret = btrfs_scrub_dev(...); ret = btrfs_dev_replace_finishing(root-fs_info, ret); if (ret == -EINPROGRESS) { args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS; ret = 0; } else { WARN_ON(ret); } return ret; What do you think? If no objection I'll work on v2. I like it. Thanks Miao Thanks for your review! Eryu Thanks Miao leave: dev_replace-srcdev = NULL; @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); - return 0; + return scrub_ret; } printk_in_rcu(KERN_INFO -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: kernel BUG at fs/btrfs/extent_io.c:676!
Ping? This BUG_ON()ing due to GFP_ATOMIC allocation failure is really silly :( On 03/23/2014 09:26 PM, Sasha Levin wrote: Hi all, While fuzzing with trinity inside KVM tools guest running latest -next kernel I've stumbled on the following spew. This is a result of a failed allocation in alloc_extent_state_atomic() which triggers a BUG_ON when the return value is NULL. It's a bit weird that it BUGs on failed allocations, since it's obviously not a critical failure. [ 447.705167] kernel BUG at fs/btrfs/extent_io.c:676! [ 447.706201] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 447.707732] Dumping ftrace buffer: [ 447.708473](ftrace buffer empty) [ 447.709684] Modules linked in: [ 447.710246] CPU: 17 PID: 4195 Comm: kswapd17 Tainted: GW 3.14.0-rc7-next-20140321-sasha-00018-g0516fe6-dirty #265 [ 447.710253] task: 88066be9b000 ti: 88066be82000 task.ti: 88066be82000 [ 447.710253] RIP: clear_extent_bit (fs/btrfs/extent_io.c:676) [ 447.710253] RSP: :88066be83768 EFLAGS: 00010246 [ 447.710253] RAX: RBX: 00d00fff RCX: 0006 [ 447.710253] RDX: 58e0 RSI: 88066be9bd60 RDI: 0286 [ 447.710253] RBP: 88066be837e8 R08: R09: [ 447.710253] R10: 0001 R11: 454a4e495f544c55 R12: 01ff [ 447.710253] R13: R14: 88007b89fd08 R15: 00d0 [ 447.710253] FS: () GS:8804acc0() knlGS: [ 447.710253] CS: 0010 DS: ES: CR0: 8005003b [ 447.710253] CR2: 02aec968 CR3: 05e29000 CR4: 06a0 [ 447.710253] DR0: 00698000 DR1: 00698000 DR2: [ 447.710253] DR3: DR6: 0ff0 DR7: 0600 [ 447.710253] Stack: [ 447.710253] 88066be83788 844fc4d5 8804ab4800e8 [ 447.710253] 0001 8804ab4800c8 fbf7 [ 447.710253] 88066be837c8 0006 ea0007aaf340 [ 447.710253] Call Trace: [ 447.710253] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) [ 447.710253] try_release_extent_mapping (fs/btrfs/extent_io.c:3998 fs/btrfs/extent_io.c:4058) [ 447.710253] __btrfs_releasepage (fs/btrfs/inode.c:7521) [ 447.710253] btrfs_releasepage (fs/btrfs/inode.c:7534) [ 447.710253] try_to_release_page (mm/filemap.c:2984) [ 447.710253] invalidate_inode_page (mm/truncate.c:165 mm/truncate.c:215) [ 447.710253] invalidate_mapping_pages (mm/truncate.c:517) [ 447.710253] inode_lru_isolate (arch/x86/include/asm/current.h:14 include/linux/swap.h:33 fs/inode.c:724) [ 447.710253] ? insert_inode_locked (fs/inode.c:687) [ 447.710253] list_lru_walk_node (mm/list_lru.c:89) [ 447.710253] prune_icache_sb (fs/inode.c:759) [ 447.710253] super_cache_scan (fs/super.c:96) [ 447.710253] shrink_slab_node (mm/vmscan.c:306) [ 447.710253] shrink_slab (mm/vmscan.c:381) [ 447.710253] kswapd_shrink_zone (mm/vmscan.c:2909) [ 447.710253] kswapd (mm/vmscan.c:3090 mm/vmscan.c:3296) [ 447.710253] ? mem_cgroup_shrink_node_zone (mm/vmscan.c:3213) [ 447.710253] kthread (kernel/kthread.c:219) [ 447.710253] ? __tick_nohz_task_switch (arch/x86/include/asm/paravirt.h:809 kernel/time/tick-sched.c:272) [ 447.710253] ? kthread_create_on_node (kernel/kthread.c:185) [ 447.710253] ret_from_fork (arch/x86/kernel/entry_64.S:555) [ 447.710253] ? kthread_create_on_node (kernel/kthread.c:185) [ 447.710253] Code: e9 a9 00 00 00 0f 1f 00 48 39 c3 0f 82 87 00 00 00 4c 39 e3 0f 83 7e 00 00 00 48 8b 7d a0 e8 45 ef ff ff 48 85 c0 49 89 c5 75 05 0f 0b 0f 1f 00 48 8b 7d b0 48 8d 4b 01 48 89 c2 4c 89 f6 e8 c5 [ 447.710253] RIP clear_extent_bit (fs/btrfs/extent_io.c:676) [ 447.710253] RSP 88066be83768 Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: what is the best way to monitor raid1 drive failures?
Suman, To simulate the failure, I detached one of the drives from the system. After that, I see no sign of a problem except for these errors: Are you physically pulling out the device ? I wonder if lsblk or blkid shows the error ? reporting device missing logic is in the progs (so have that latest) and it works provided user script such as blkid/lsblk also reports the problem. OR for soft-detach tests you could use devmgt at http://github.com/anajain/devmgt Also I am trying to get the device management framework for the btrfs with a more better device management and reporting. Thanks, Anand On 10/13/14 07:50, Suman C wrote: Hi, I am testing some disk failure scenarios in a 2 drive raid1 mirror. They are 4GB each, virtual SATA drives inside virtualbox. To simulate the failure, I detached one of the drives from the system. After that, I see no sign of a problem except for these errors: Oct 12 15:37:14 rock-dev kernel: btrfs: bdev /dev/sdb errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 Oct 12 15:37:14 rock-dev kernel: lost page write due to I/O error on /dev/sdb /dev/sdb is gone from the system, but btrfs fi show still lists it. Label: raid1pool uuid: 4e5d8b43-1d34-4672-8057-99c51649b7c6 Total devices 2 FS bytes used 1.46GiB devid1 size 4.00GiB used 2.45GiB path /dev/sdb devid2 size 4.00GiB used 2.43GiB path /dev/sdc I am able to read and write just fine, but do see the above errors in dmesg. What is the best way to find out that one of the drives has gone bad? Suman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
Original Message Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Manana fdman...@gmail.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年10月10日 16:08 On Fri, Oct 10, 2014 at 3:39 AM, Qu Wenruoquwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Mananafdman...@gmail.com To: Qu Wenruoquwen...@cn.fujitsu.com Date: 2014年10月09日 18:27 On Thu, Oct 9, 2014 at 1:28 AM, Qu Wenruoquwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Mananafdman...@gmail.com To: Qu Wenruoquwen...@cn.fujitsu.com Date: 2014年10月08日 20:08 On Fri, Sep 19, 2014 at 1:31 AM, Qu Wenruoquwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map From: Filipe David Mananafdman...@gmail.com To: Qu Wenruoquwen...@cn.fujitsu.com Date: 2014年09月18日 21:16 On Wed, Sep 17, 2014 at 4:53 AM, Qu Wenruoquwen...@cn.fujitsu.com wrote: The following commit enhanced the merge_extent_mapping() to reduce fragment in extent map tree, but it can't handle case which existing lies before map_start: 51f39 btrfs: Use right extent length when inserting overlap extent map. [BUG] When existing extent map's start is before map_start, the em-len will be minus, which will corrupt the extent map and fail to insert the new extent map. This will happen when someone get a large extent map, but when it is going to insert it into extent map tree, some one has already commit some write and split the huge extent into small parts. This sounds like very deterministic to me. Any reason to not add tests to the sanity tests that exercise this/these case/cases? Yes, thanks for the informing. Will add the test case for it soon. Hi Qu, Any progress on the test? This is a very important one IMHO, not only because of the bad consequences of the bug (extent map corruption, leading to all sorts of chaos), but also because this problem was not found by the full xfstests suite on several developer machines. thanks Still trying to reproduce it under xfstest framework. That's the problem, wasn't apparently reproducible (or detectable at least) by anyone with xfstests. I'll try to build a C program to behave the same of filebench and to see if it works. At least with filebench, it can be triggered in 60s with 100% possibility to reproduce. But even followiiing the FileBench randomrw behavior(1 thread random read 1 thread random write on preallocated space), I still failed to reproduce it. Still investigating how to reproduce it. Worst case may be add a new C program into src of xfstests? How about the sanity tests (fs/btrfs/tests/*.c)? Create an empty map tree, add some extent maps, then try to merge some new extent maps that used to fail before this fix. Seems simple, no? thanks Qu It needs concurrency read and write(commit) to trigger it, I am not sure it can be reproduced in sanity tests since it seems not commit things and lacks multithread facility. Hum? Why does concurrency or persistence matters? Let's review the problem. So you fixed the function inode.c:merge_extent_mapping(). That function merges a new extent map (not in the extent map tree) with an existing extent map (which is in the tree). The issue was that the merge was incorrect for some cases - producing a bad extent map (compared to the rest of the existing extent maps) that either overlaps existing ones or introduces incorrect gaps, etc - doesn't really matter the reason. Now, this function is run while holding the write lock of the inode's extent map tree. So why does concurrency (or persistence) matters here? It is true that the patch only fixed the above merge problem. But the bug involves more. 1) the direct cause. existing extent map's start is smaller than map_start in merge_extent_mapping(). 2) the root cause. As described in my V2 patch, there is a window between btrfs_release_path() and write_lock(em_tree-lock) in btrfs_get_extent(), and under concurrency, one may get a big extent map converted from on-disk file extent, and during the windows, a commit happens and the original extent map is split into several small ones. So 1) will happen and cause the bug. At least the reporter's filebench reproducer can be explained like above, and that's why concurrency is needed to trigger the bug under such circumstance. Why can't we have a sanity test that simply reproduces a scenario where immediately after attempting to merge extent maps, we get an (in-memory) extent map that is incorrect? There is other situation triggering the bug(just like the mail below), but the above known circumstance needs concurrency to let commit happen
Re: btrfs send and kernel 3.17
Some more info I thought off. For me, the corruption problem seems not to be send related but snapshot creation related. On machine 2 send was never used. However both filesystems are stored on SSDs (of different brand). Another filesystem stored on a normal HDD didn't experience the problem. Maybe this is pure coincidence and has nothing to do with the fact that it is on SSD or HDD. Another thing I noticed is that for me, the problem only seems to occur for root subvolumes with many small files. I have no root subvolumes on HDD so it might be not SSD related. On 10/12/2014 11:35 PM, David Arendt wrote: Just to let you know, I just tried an ls -l on 2 machines running kernel 3.17 and btrfs-progs 3.16.2. Here is my ls -l output: Machine 1: ls: cannot access root.20141009.000503.backup: Cannot allocate memory total 0 d? ? ? ? ?? root.20141009.000503.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141012.095526.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141012.000503.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141011.000502.backup drwxr-xr-x 1 root root182 Oct 7 20:35 root.20141010.000502.backup root.20141009.000503.backup is not deletable. Machine 2: ls: cannot access root.20141006.003239.backup: Cannot allocate memory ls: cannot access root.20141007.001616.backup: Cannot allocate memory ls: cannot access root.20141008.000501.backup: Cannot allocate memory ls: cannot access root.20141009.052436.backup: Cannot allocate memory total 0 d? ? ?? ?? root.20141009.052436.backup d? ? ?? ?? root.20141008.000501.backup d? ? ?? ?? root.20141007.001616.backup d? ? ?? ?? root.20141006.003239.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140925.001125.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140924.001017.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140923.001008.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140922.001836.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140921.001029.backup drwxr-xr-x 1 root root 232 Aug 3 15:00 root.20140920.001020.backup The ? ones are also not deletable. Both machines are giving transid verify failed errors. I verified my logfiles and this problem was never there using previous kernel versions. On machine 1, it is also sure that it was not any previous corruption as this filesystem has also been created with btrfs-progs 3.16.2 using kernel 3.17. On 10/12/2014 05:24 PM, john terragon wrote: Hi. I just wanted to confirm David's story so to speak :) -kernel 3.17-rc7 (didn't bother to compile 3.17 as there weren't any btrfs fixes, I think) -btrfs-progs 3.16.2 (also compiled from source, so no distribution-specific patches) -fresh fs -I get the same two errors David got (first I got the I/O error one and then the memory allocation one) -plus now when I ls -la the fs top volume this is what I get drwxrwsr-x 1 root staff 30 Sep 11 16:15 home d? ? ?? ?? home-backup drwxr-xr-x 1 root root 250 Oct 10 15:37 root d? ? ?? ?? root-backup drwxr-xr-x 1 root root 88 Sep 15 16:02 vms drwxr-xr-x 1 root root 88 Sep 15 16:02 vms-backup yes, the question marks on those two *-backup snapshots are really there. I can't access the snapshots, I can't delete them, I can't do anything with them. -btrfs check segfaults -the events that led to this situation are these: 1) btrfs su snap -r root root-backup 2) send |receive (the entire root-backup, not and incremental send) immediate I/O error 3) move on to home: btrfs su snap -r home home-backup 4) send|receive (again not an incremental send) everything goes well (!) 5) retry with root: btrfs su snap -r root root-backup 6) send|receive and it goes seemingly well 7) apt-get dist-upgrade just to modify root and try an incremental send 8) reboot after the dist-upgrade 9) ls -la the fs top volume: first I get the memory allocation error and after that any ls -la gives the output I pasted above. (notice that beside the ls -la, the two snapshots were not touched in any way since the two send|receive) Few final notes. I haven't tried send/receive in a while (they were unreliable) so I can't tell which is the last version they worked for me (well, no version actually :) ). I've never had any problem with just snapshots. I make them regularly, I use them, I modify them and I've never had one problem (with 3.17 too, it's just send/receive that murders them). Best regards John -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: return failure if btrfs_dev_replace_finishing() failed
device replace could fail due to another running scrub process or any other errors btrfs_scrub_dev() may hit, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21; done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Introduce a new dev replace result BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to catch -EINPROGRESS explicitly and return other errors directly to userspace. Signed-off-by: Eryu Guan guane...@gmail.com --- v2: - set result to SCRUB_INPROGRESS if btrfs_scrub_dev returned -EINPROGRESS and return 0 as Miao Xie suggested fs/btrfs/dev-replace.c | 12 +--- include/uapi/linux/btrfs.h | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..a141f8b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,15 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); - WARN_ON(ret); + /* don't warn if EINPROGRESS, someone else might be running scrub */ + if (ret == -EINPROGRESS) { + args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS; + ret = 0; + } else { + WARN_ON(ret); + } - return 0; + return ret; leave: dev_replace-srcdev = NULL; @@ -538,7 +544,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); - return 0; + return scrub_ret; } printk_in_rcu(KERN_INFO diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 2f47824..611e1c5 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -157,6 +157,7 @@ struct btrfs_ioctl_dev_replace_status_params { #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR0 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED 1 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED 2 +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS3 struct btrfs_ioctl_dev_replace_args { __u64 cmd; /* in */ __u64 result; /* out */ -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: add new dev replace result
A new dev replace result was introduced by kernel commit Btrfs: return failure if btrfs_dev_replace_finishing() failed Make the userspace know about the new result too. Signed-off-by: Eryu Guan guane...@gmail.com --- cmds-replace.c | 2 ++ ioctl.h| 1 + 2 files changed, 3 insertions(+) diff --git a/cmds-replace.c b/cmds-replace.c index 9fe7ad8..7a45cef 100644 --- a/cmds-replace.c +++ b/cmds-replace.c @@ -53,6 +53,8 @@ static const char *replace_dev_result2string(__u64 result) return not started; case BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED: return already started; + case BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS: + return scrub is in progress; default: return illegal result value; } diff --git a/ioctl.h b/ioctl.h index f0fc060..0e02fae 100644 --- a/ioctl.h +++ b/ioctl.h @@ -144,6 +144,7 @@ struct btrfs_ioctl_dev_replace_status_params { #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR0 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED 1 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED 2 +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS3 struct btrfs_ioctl_dev_replace_args { __u64 cmd; /* in */ __u64 result; /* out */ -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html