System Storage Manager
Hello everyone, I would like to introduce to you a new tool called System Storage Manager (ssm). It is supposed to provide easy to use command line interface to manage your storage using various technologies like lvm, btrfs, encrypted volumes and possibly more. Background -- In more sophisticated enterprise storage environments, management with Device Mapper (dm), Logical Volume Manager (LVM), or Multiple Devices (md) is becoming increasingly more difficult. With file systems added to the mix, the number of tools needed to configure and manage storage has grown so large that it is simply not user friendly. With so many options for a system administrator to consider, the opportunity for errors and problems is large. The btrfs administration tools have shown us that storage management can be simplified, and we are working to bring that ease of use to Linux filesystems in general. You can also find some more information in my presentation from LinuxCon Prague this year: http://people.redhat.com/lczerner/files/lczerner_fsm.pdf The code is still under development and no release has been made yet, but I would like to share with you what I have done so far, since the progress has been a bit slower than I have previously expected. The project lives on the sf.net : https://sourceforge.net/projects/storagemanager/ and you can grab source files from git repository here: https://sourceforge.net/p/storagemanager/code/ci/a1a5fd616d06030f94b9d2e80ee6ebcad09ad35f/tree/ More information can be found on the projects home page, or in the README file. https://sourceforge.net/p/storagemanager/home/Home/ Notes - - It is written in python - So far it supports commands : check, resize, create, list, add, remove - More commands to come : mirror, snapshot(!) - It does not support raid yet, except raid 0 from lvm (striped volume) - It has been tested with python 2.7, but would like to make it work on python 2.6 as well. - It comes with some doctests, unittests and regression tests written in bash (although it is for lvm only so far) - Its modular design should make it relatively simple to add support for more back-ends other than lvm, or btrfs Things to be done before the actual release --- - Create btrfs bash tests - Create btrfs unittests - Extend python unittests to other backends - Use wipefs -a before using the device in add() - Consider using wipefs -a after removing the device - Remove the physical volume after it is removed from the group - Figure out how to create better pool names so it is unique in the system and between the systems. - Add mirror support - Add snapshots support - Add raid support - use lsblk and blkid to get information - Better table alignment when the output spans multiple lines - Better error handling - not just plain Exception, but rather named exception and handle it as main() in ssm module - Update readme - Add more documentation into the code Or course any comment or ideas would be highly appreciated. Please report bugs directly to me. Thanks! -Lukas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: System Storage Manager
On 12/07/2011 10:20 AM, Lukas Czerner wrote: Hello everyone, I would like to introduce to you a new tool called System Storage Manager (ssm). It is supposed to provide easy to use command line interface to manage your storage using various technologies like lvm, btrfs, encrypted volumes and possibly more. Background -- In more sophisticated enterprise storage environments, management with Device Mapper (dm), Logical Volume Manager (LVM), or Multiple Devices (md) is becoming increasingly more difficult. With file systems added to the mix, the number of tools needed to configure and manage storage has grown so large that it is simply not user friendly. With so many options for a system administrator to consider, the opportunity for errors and problems is large. This seems like a worthwhile project given the overlap in functionality from the various commands. It echoes some of the functionality provided by libguestfs for VM images I think. The btrfs administration tools have shown us that storage management can be simplified, and we are working to bring that ease of use to Linux filesystems in general. You can also find some more information in my presentation from LinuxCon Prague this year: http://people.redhat.com/lczerner/files/lczerner_fsm.pdf This had the name 'fsm'. I do think 'ssm' is better, however I would drop the '.py' from the command name. This is incidental and restrictive going forward. Note I checked the debian and freebsd man pages and 'ssm' is available. The code is still under development and no release has been made yet, but I would like to share with you what I have done so far, since the progress has been a bit slower than I have previously expected. The project lives on the sf.net : https://sourceforge.net/projects/storagemanager/ This is a much cleaner landing page: http://sourceforge.net/p/storagemanager/home/Home/ cheers, Pádraig. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: fix deadlock on sb-s_umount when doing umount
On Wed, Dec 07, 2011 at 10:31:35AM +0800, Miao Xie wrote: On tue, 6 Dec 2011 16:36:11 -0500, Chris Mason wrote: On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote: On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote: I can't see why you need the writeout when the trylocks fails. Umount needs to take care of writing out all pending file data anyway, so doing it from the cleaner thread in addition doesn't sound like it would help. umount invokes sync_fs() and write out all the dirty file data. For the other file systems, its OK because the file system does not introduce dirty pages by itself. But btrfs is different. Its automatic defragment will make lots of dirty pages after sync_fs() and reserve lots of meta-data space for those pages. And then the cleaner thread may find there is no enough space to reserve, it must sync the dirty file data and release the reserved space which is for the dirty file data. I think the safest way to fix is is to write out all dirty data again once the cleaner thread has been safely stopped. Said another way we want to stop the autodefrag code before the unmount is ready to continue. We also want to stop balancing, scrub etc. But there is no good interface to do it before umount gets s_umount lock. I think trylock(in writeback_inodes_sb_nr_if_idle()) + dirty data flush can help us to fix the bug perfectly. But it won't fix the umount while balancing family of deadlocks (they are really of the same nature, vfs grabs s_umount mutex and we need it to proceed). (Balance cancelling code is part of restriper patches, it's just a hook in close_ctree() that waits until we are done relocating a chunk - very similar to cleaner wait) One example would be that balancing code while dirtying pages calls balance_dirty_pages_ratelimited() for each dirtied page, as it should. And if balance_dirty_pages() then decides to initiate writeback we are stuck schedule()ing forever, because writeback can't proceed w/o read-taking s_umount mutex which is fully held by vfs - it just skips the relocation inode. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: new check 276 to ensure btrfs backref integrity
This is a btrfs specific scratch test checking the backref walker. It creates a file system with compressed and uncompressed data extents, picks files randomly and uses filefrag to get their extents. It then asks the btrfs utility (inspect-internal) to do the backref resolving from fs-logical address (the one filefrag calls physical) back to the inode number and file-logical offset, verifying the result. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- 276 | 230 + 276.out |4 + common.config |1 + group |1 + 4 files changed, 236 insertions(+), 0 deletions(-) create mode 100755 276 create mode 100644 276.out diff --git a/276 b/276 new file mode 100755 index 000..f22d089 --- /dev/null +++ b/276 @@ -0,0 +1,230 @@ +#! /bin/bash + +# creator +owner=list.bt...@jan-o-sch.net + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +# 1=production, 0=avoid touching the scratch dev (no mount/umount, no writes) +fresh=1 +tmp=/tmp/$$ +status=1 +FSTYP=btrfs + +_cleanup() +{ + if [ $fresh -ne 0 ]; then + echo *** unmount + umount $SCRATCH_MNT 2/dev/null + fi + rm -f $tmp.* +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common.rc +. ./common.filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux + +if [ $fresh -ne 0 ]; then + _require_scratch +fi + +_require_nobigloopfs + +[ -n $BTRFS_UTIL_PROG ] || _notrun btrfs executable not found +$BTRFS_UTIL_PROG inspect-internal --help /dev/null 21 +[ $? -eq 0 ] || _notrun btrfs executable too old +which filefrag /dev/null 21 +[ $? -eq 0 ] || _notrun filefrag missing + + +rm -f $seq.full + +FILEFRAG_FILTER='if (/, blocksize (\d+)/) {$blocksize = $1; next} ($ext, '\ +'$logical, $physical, $expected, $length, $flags) = (/^\s*(\d+)\s+(\d+)'\ +'\s+(\d+)\s+(?:(\d+)\s+)?(\d+)\s+(.*)/) or next; $flags =~ '\ +'/(?:^|,)inline(?:,|$)/ and next; print $physical * $blocksize, #, '\ +'$length * $blocksize, #, $logical * $blocksize, ' + +_filter_extents() +{ + tee -a $seq.full | $PERL_PROG -ne $FILEFRAG_FILTER +} + +_check_file_extents() +{ + cmd=filefrag -vx $1 + echo # $cmd $seq.full + out=`$cmd | _filter_extents` + if [ -z $out ]; then + return 1 + fi + echo after filter: $out $seq.full + echo $out + return 0 +} + +_btrfs_inspect_addr() +{ + mp=$1 + addr=$2 + expect_addr=$3 + expect_inum=$4 + file=$5 + cmd=$BTRFS_UTIL_PROG inspect-internal logical-resolve -P $addr $mp + echo # $cmd $seq.full + out=`$cmd` + echo $out $seq.full + grep_expr=inode $expect_inum offset $expect_addr root + echo $out | grep ^$grep_expr 5$ /dev/null + ret=$? + if [ $ret -eq 0 ]; then + # look for a root number that is not 5 + echo $out | grep ^$grep_expr \([0-46-9][0-9]*\|5[0-9]\+\)$ \ + /dev/null + ret=$? + fi + if [ $ret -eq 0 ]; then + return 0 + fi + echo unexpected output from + echo $cmd + echo expected inum: $expect_inum, expected address: $expect_addr,\ + file: $file, got: + echo $out + return 1 +} + +_btrfs_inspect_inum() +{ + file=$1 + inum=$2 + snap_name=$3 + mp=$SCRATCH_MNT/$snap_name + cmd=$BTRFS_UTIL_PROG inspect-internal inode-resolve $inum $mp + echo # $cmd $seq.full + out=`$cmd` + echo $out $seq.full + grep_expr=^$file$ + cnt=`echo $out | grep $grep_expr | wc -l` + if [ $cnt -ge 1 ]; then + return 0 + fi + echo unexpected output from + echo $cmd + echo expected path: $file, got: + echo $out + return 1 +} + +_btrfs_inspect_check() +{ + file=$1 + physical=$2 + length=$3 + logical=$4 + snap_name=$5 + cmd=stat -c %i $file + echo # $cmd $seq.full + inum=`$cmd` + echo $inum $seq.full + _btrfs_inspect_addr $SCRATCH_MNT/$snap_name $physical $logical $inum\ + $file + ret=$? + if [ $ret -eq 0 ]; then + _btrfs_inspect_inum $file $inum $snap_name + ret=$? + fi + return $? +} + +run_check() +{ + echo # $@ $seq.full 21 + $@ $seq.full 21 || _fail failed: '$@' +} + +workout() +{ + fsz=$1 + nfiles=$2 + procs=$3 + snap_name=$4 + + if [ $fresh -ne 0 ]; then + umount $SCRATCH_DEV /dev/null 21 + echo *** mkfs -dsize=$fsz$seq.full + echo $seq.full + _scratch_mkfs_sized $fsz $seq.full 21 \ + || _fail size=$fsz mkfs failed + _scratch_mount
Cloning a Btrfs partition
I've got a 6TB btrfs array (two 3TB drives in a RAID 0). It's about 2/3 full and has lots of snapshots. I've written a script that runs through the snapshots and copies the data efficiently (rsync --inplace --no-whole-file) from the main 6TB array to a backup array, creating snapshots on the backup array and then continuing on copying the next snapshot. Problem is, it looks like it will take weeks to finish. I've tried simply using dd to clone the btrfs partition, which technically appears to work, but then it appears that the UUID between the arrays is identical, so I can only mount one or the other. This means I can't continue to simply update the backup array with the new snapshots created on the main array (my script is capable of catching up the backup array with the new snapshots, but if I can't mount both arrays...). Any suggestions? -BJ Quinn -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cloning a Btrfs partition
On Wed, Dec 7, 2011 at 10:35 AM, BJ Quinn b...@placs.net wrote: I've got a 6TB btrfs array (two 3TB drives in a RAID 0). It's about 2/3 full and has lots of snapshots. I've written a script that runs through the snapshots and copies the data efficiently (rsync --inplace --no-whole-file) from the main 6TB array to a backup array, creating snapshots on the backup array and then continuing on copying the next snapshot. Problem is, it looks like it will take weeks to finish. I've tried simply using dd to clone the btrfs partition, which technically appears to work, but then it appears that the UUID between the arrays is identical, so I can only mount one or the other. This means I can't continue to simply update the backup array with the new snapshots created on the main array (my script is capable of catching up the backup array with the new snapshots, but if I can't mount both arrays...). Any suggestions? Until an analog of zfs send is added to btrfs (and I believe there are some side projects ongoing to add something similar), your only option is the one you are currently using via rsync. -- Freddie Cash fjwc...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cloning a Btrfs partition
Until an analog of zfs send is added to btrfs (and I believe there are some side projects ongoing to add something similar), your only option is the one you are currently using via rsync. Well, I don't mind using the rsync script, it's just that it's so slow. I'd love to use my script to keep up the backup array, which only takes a couple of hours and is acceptable. But starting with a blank backup array, it takes weeks to get the backup array caught up, which isn't realistically possible. What I need isn't really an equivalent zfs send -- my script can do that. As I remember, zfs send was pretty slow too in a scenario like this. What I need is to be able to clone a btrfs array somehow -- dd would be nice, but as I said I end up with the identical UUID problem. Is there a way to change the UUID of an array? -BJ Quinn -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
WARNING: at fs/btrfs/extent-tree.c:4754 followed by BUG: unable to handle kernel NULL pointer dereference at (null)
Hello btrfs! Recently I upgraded to 3.2.0-rc4 due to instabilities with my btrfs filesystem in 3.1.1. While with 3.1.1 my system completely froze, with 3.2.0-rc4 it stays at least somehow usable (for some strange reason my xorg screen turns black as soon as this happens, only ssh is working then). Scrubbing reports 1 uncorrectable error. I have this error since my system froze due to some xorg graphic driver instability (was trying out SNA acceleration for sandybridge). The problematic file seems to be in /usr/portage but scrubbing doesn't tell me the filename (I was under the impression 3.2.x adds a patch which should report filenames). Everytime I run emerge (it is a gentoo system) my screen goes black after a few seconds and I can only revert to using ssh. Problem is: As soon as this happens, some filesystem accesses block the process in disk state, it cannot be killed. This initiates some feedback loop: From now on any other process trying to access the FS freezes. I can only reisub now. It seems to be fine if data comes from cache instead from disk. Any chance to fix the filesystem or make the kernel not getting stuck? I'd hate to recreate the fs from scratch again. Using Linus' tree from git, tagged v3.2-rc4. Here's my dmesg output: [172816.292951] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292957] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292960] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292963] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292965] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292967] [ cut here ] [172816.292972] WARNING: at fs/btrfs/extent-tree.c:4754 __btrfs_free_extent+0x290/0x5c7() [172816.292974] Hardware name: To Be Filled By O.E.M. [172816.292975] Modules linked in: zram(C) af_packet fuse snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss nls_iso8859_15 nls_cp437 vfat fat reiserfs loop nfs tcp_cubic lockd auth_rpcgss nfs_acl sunrpc sg snd_usb_audio snd_hwdep snd_usbmidi_lib snd_rawmidi snd_seq_device gspca_sonixj gspca_main videodev usb_storage v4l2_compat_ioctl32 uas usbhid hid pcspkr evdev i2c_i801 unix [last unloaded: microcode] [172816.293004] Pid: 6193, comm: btrfs-delayed-m Tainted: G C 3.2.0-rc4 #2 [172816.293005] Call Trace: [172816.293010] [8103327e] ? warn_slowpath_common+0x78/0x8c [172816.293012] [8111ea5b] ? __btrfs_free_extent+0x290/0x5c7 [172816.293014] [810b2490] ? __slab_free+0xd1/0x236 [172816.293016] [81121d68] ? run_clustered_refs+0x66c/0x6b8 [172816.293018] [81121e7d] ? btrfs_run_delayed_refs+0xc9/0x173 [172816.293021] [8112faf0] ? __btrfs_end_transaction+0x90/0x1dd [172816.293024] [810273b0] ? should_resched+0x5/0x24 [172816.293027] [81166981] ? btrfs_async_run_delayed_node_done+0x16c/0x1ca [172816.293029] [8114f20f] ? worker_loop+0x170/0x46d [172816.293031] [8114f09f] ? btrfs_queue_worker+0x25b/0x25b [172816.293033] [8114f09f] ? btrfs_queue_worker+0x25b/0x25b [172816.293036] [8104883b] ? kthread+0x7a/0x82 [172816.293040] [81415af4] ? kernel_thread_helper+0x4/0x10 [172816.293042] [810487c1] ? kthread_worker_fn+0x135/0x135 [172816.293043] [81415af0] ? gs_change+0xb/0xb [172816.293045] ---[ end trace 095cf6945c90cf63 ]--- [172816.293046] btrfs unable to find ref byte nr 1871181426688 parent 0 root 2 owner 0 offset 0 [172816.293050] BUG: unable to handle kernel NULL pointer dereference at (null) [172816.293054] IP: [81148998] map_private_extent_buffer+0x9/0xde [172816.293057] PGD 0 [172816.293058] Oops: [#1] SMP [172816.293060] CPU 1 [172816.293061] Modules linked in: zram(C) af_packet fuse snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss nls_iso8859_15 nls_cp437 vfat fat reiserfs loop nfs tcp_cubic lockd auth_rpcgss nfs_acl sunrpc sg snd_usb_audio snd_hwdep snd_usbmidi_lib snd_rawmidi snd_seq_device gspca_sonixj gspca_main videodev usb_storage v4l2_compat_ioctl32 uas usbhid hid pcspkr evdev i2c_i801 unix [last unloaded: microcode] [172816.293078] [172816.293079] Pid: 6193, comm: btrfs-delayed-m Tainted: GWC 3.2.0-rc4 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Pro3 [172816.293083] RIP: 0010:[81148998] [81148998] map_private_extent_buffer+0x9/0xde [172816.293086] RSP: 0018:8801bb847b00 EFLAGS: 00010286 [172816.293088] RAX: 0067 RBX: 8801bb847b40 RCX: 8801bb847b40 [172816.293090] RDX: 0004 RSI: 007a RDI: [172816.293092] RBP: 0065 R08: 8801bb847b38 R09: 8801bb847b30 [172816.293103] R10: R11: 0009 R12: 007a [172816.293105] R13: R14:
Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list
2011/12/1 Christian Brunner c...@muc.de: 2011/12/1 Alexandre Oliva ol...@lsd.ic.unicamp.br: On Nov 29, 2011, Christian Brunner c...@muc.de wrote: When I'm doing havy reading in our ceph cluster. The load and wait-io on the patched servers is higher than on the unpatched ones. That's unexpected. In the mean time I know, that it's not related to the reads. I suppose I could wave my hands while explaining that you're getting higher data throughput, so it's natural that it would take up more resources, but that explanation doesn't satisfy me. I suppose allocation might have got slightly more CPU intensive in some cases, as we now use bitmaps where before we'd only use the cheaper-to-allocate extents. But that's unsafisfying as well. I must admit, that I do not completely understand the difference between bitmaps and extents. From what I see on my servers, I can tell, that the degradation over time is gone. (Rebooting the servers every day is no longer needed. This is a real plus.) But the performance compared to a freshly booted, unpatched server is much slower with my ceph workload. I wonder if it would make sense to initialize the list field only, when the cluster setup fails? This would avoid the fallback to the much unclustered allocation and would give us the cheaper-to-allocate extents. I've now tried various combinations of you patches and I can really nail it down to this one line. With this patch applied I get much higher write-io values than without it. Some of the other patches help to reduce the effect, but it's still significant. iostat on an unpatched node is giving me: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 105.90 0.37 15.42 14.48 2657.33 560.13 107.61 1.89 62.75 6.26 18.71 while on a node with this patch it's sda 128.20 0.97 11.10 57.15 3376.80 552.80 57.5820.58 296.33 4.16 28.36 Also interesting, is the fact that the average request size on the patched node is much smaller. Josef was telling me, that this could be related to the number of bitmaps we write out, but I've no idea how to trace this. I would be very happy if someone could give me a hint on what to do next, as this is one of the last remaining issues with our ceph cluster. Thanks, Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Yet Another Newb Question...
(Asking this question on this list kinda makes me wonder if there shouldn't be a btrfs-users list where folks could ask questions just like this without pestering developers...) Anyway -- I had a root partition with a /snapshots directory, in which I placed a bunch of snapshots. At one point, I goofed stuff up, and decided to revert my root (using btrfs sub set-default) to one of the snapshots. Rebooted, and it worked great -- just like I'd hoped. But where'd the snapshots in /snapshots go? I mean, I still see them if I do a btrfs sub list, but how do I *get* to them for, say, deleting? (I can still mount them via -o subvolid, but that's not quite the same thing.) Suggestions? Thanks kindly! -Ken -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Yet Another Newb Question...
On Wed, Dec 07, 2011 at 04:14:46PM -0500, Ken D'Ambrosio wrote: (Asking this question on this list kinda makes me wonder if there shouldn't be a btrfs-users list where folks could ask questions just like this without pestering developers...) Anyway -- I had a root partition with a /snapshots directory, in which I placed a bunch of snapshots. At one point, I goofed stuff up, and decided to revert my root (using btrfs sub set-default) to one of the snapshots. Rebooted, and it worked great -- just like I'd hoped. But where'd the snapshots in /snapshots go? Where they always were -- it's just that you've mounted a different bit of the filesystem, so you can't see them. :) I mean, I still see them if I do a btrfs sub list, but how do I *get* to them for, say, deleting? (I can still mount them via -o subvolid, but that's not quite the same thing.) If you've got subvolumes outside your mounted filesystem, then you can either reach them by mounting via subvolid, or by mounting the top-level subvolume with subvolid=0 (on, say, /media/btrfs-top) and then navigating through that to the subvolume you want. See (my) recommended filesystem structure on the wiki[1]. Hugo. http://btrfs.ipv5.de/index.php?title=SysadminGuide#Managing_snapshots -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- There are three things you should never see being made: laws, --- standards, and sausages. signature.asc Description: Digital signature