Re: [RFC] btrfs auto snapshot
On Thu, Aug 18, 2011 at 12:38 AM, Matthias G. Eckermann m...@suse.com wrote: Ah, sure. Sorry. Packages for blocxx for: Fedora_14 Fedora_15 RHEL-5 RHEL-6 SLE_11_SP1 openSUSE_11.4 openSUSE_Factory are available in the openSUSE buildservice at: http://download.opensuse.org/repositories/home:/mge1512:/snapper/ Hi Matthias, I'm testing your packages on top of RHEL6 + kernel 3.2.7. A small suggestion, you should include /etc/sysconfig/snapper in the package (at least for RHEL6, haven't tested the other ones). Even if it just contains SNAPPER_CONFIGS= Thanks, Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.3 restripe between different raid levels
On Wed, Feb 22, 2012 at 07:14:44PM +, Alex wrote: [Referring to https://lkml.org/lkml/2012/1/17/381], and perhaps I'm a bit previous, but what are the command sequence to change the raid levels? Wouldn't mind being pointed to git manual if better for you. Look at http://article.gmane.org/gmane.comp.file-systems.btrfs/15211. The only syntax change that has been merged since then is that you can invoke balancing commands w/o 'filesystem' prefix, so instead of 'btrfs fi balance' you can (and should) use 'btrfs balance'. To get the code you can pull for-chris branch from my repo: git://github.com/idryomov/btrfs-progs.git for-chris I'll add proper man pages shortly. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is there any data recovery tool?
OK. On Wed, Feb 22, 2012 at 8:58 PM, Duncan 1i5t5.dun...@cox.net wrote: qasdfgtyuiop posted on Tue, 21 Feb 2012 20:11:06 +0800 as excerpted: I'm using GNU/linux with btrfs root. My filesystem is created with command mkfs.btrfs /dev/sda . Today I'm trying to install Microsoft Windows 7 on /dev/sdb , a 16GB esata ssd. After the installation, I found that Windows create a hidden NTFS partition called System Reserved on the first 100MB of my /dev/sda and that my btrfs filesystem was lost! I have searched google for help but I got no useful information. Is there any data recovery tools? The btrfs kernel option says: Btrfs filesystem (EXPERIMENTAL) Unstable disk format Its description says in part: Btrfs is highly experimental, and THE DISK FORMAT IS NOT YET FINALIZED. You should say N here unless you are interested in testing Btrfs with non- critical data. [...] If unsure, say N. The front page and getting started pages of the wiki (see URL below) also heavily emphasize the development aspect and backups, and the source code section has this to say: Warning, Btrfs evolves very quickly do not test it unless: You have good backups and you have tested the restore capability You have a backup installation that you can switch to when something breaks You are willing to report any issues you find You can apply patches and compile the latest btrfs code against your kernel (quite easy with git and dkms, see below) You acknowledge that btrfs may eat your data Backups! Backups! Backups! Given all that, any data you store on btrfs is by definition not particularly important, either because you have it backed up in a more stable format elsewhere (which might be the net, or local), or because the data really /isn't/ particularly important to you in the first place, or you'd have made and tested backups (naturally, always test recovery from your backups, as an untested backup is worse than none, since it's likely to give you a false sense of security) before putting it on the after all still experimental and under heavy development btrfs in the first place. Thus, you shouldn't need to worry about a data recovery tool, since you can either simply restore from backups (which since you tested recovery, you're already familiar with the recovery procedures), or the data was simply garbage you were using for testing and didn't care about losing anyway. Never-the-less, yes, there's a recovery tool, naturally experimental just like the filesystem itself at this point, but there is one. Testing and suggestions for improvements, especially with patches, will be welcomed. It seems you need to read up on the wiki, which covers this among other things. There's an older version on btrfs.wiki.kernel.org, but that's not updated ATM due to restrictions in place since the kernel.org breakin some months ago. The temporary (but six months and counting, I believe) replacement is at btrfs.ipv5.de: http://btrfs.ipv5.de/index.php?title=Main_Page The restore and find-root commands from btrfs-progs are specifically covered on this page: http://btrfs.ipv5.de/index.php?title=Restore If you wish to try a newer copy of btrfs-progs (after all, it's all still in development, and bugs are fixed all the time), you'll also want to read: http://btrfs.ipv5.de/index.php?title=Getting_started#Compiling_Btrfs_from_sources -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] treewide: fix memory corruptions when TASK_COMM_LEN != 16
On Thursday 2012-02-23 10:57, Andrew Morton wrote: But there's more, 24931 ?S 0:00 \_ [btrfs-endio-met] \_ [kconservative/5] \_ [ext4-dio-unwrit] [with a wondersome patch:] $ grep Name /proc/{29431,29432}/stat* /proc/29431/status:Name: btrfs-endio-meta-1 /proc/29432/status:Name: btrfs-endio-meta-write-1 Name: kconservative/512 Name: ext4-dio-unwritten doh. The fix for that is to have less clueless btrfs developers. And truncate their names to SUNWbtfs, ORCLintg and EXT4diou? I think not :) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] btrfs auto snapshot
autosnap code is available either end of this week or early next week and what you will notice is autosnap snapshots are named using uuid. Main reason to drop time-stamp based names is that, - test (clicking on Take-snapshot button) which took more than one snapshot per second was failing. - a more descriptive creation time is available using a command line option as in the example below. - # btrfs su list -t tag=@minute,parent=/btrfs/sv1 /btrfs /btrfs/.autosnap/6c0dabfa-5ddb-11e1-a8c1-0800271feb99 Thu Feb 23 13:01:18 2012 /btrfs/sv1 @minute /btrfs/.autosnap/5669613e-5ddd-11e1-a644-0800271feb99 Thu Feb 23 13:15:01 2012 /btrfs/sv1 @minute - As of now code for time-stamp as autosnap snapshot name is commented out, if more people wanted it to be a time-stamp based names, I don't mind having that way. Please do let me know. Thanks, Anand On Thursday 23,February,2012 06:37 PM, Hubert Kario wrote: On Wednesday 17 of August 2011 10:15:46 Anand Jain wrote: btrfs auto snapshot feature will include: Initially: snip - snapshot destination will be subvol/.btrfs/snapshot@time and snapshot/.btrfs/snapshot@time for subvolume and snapshot respectively Is there some reason not to use the format used by shadow_copy2 overlay for Samba? (The one providing Shadow Volume Copy functionality for Windows clients): Current date in this format you get like this: @GMT-`date -u '+%Y.%m.%d-%H.%M.%S'` For example: @GMT-2012.02.23-10.34.32 This way, when the volume is exported using Samba, you can easily export past copies too, without creating links. Regards, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.3 restripe between different raid levels
Thank you very much Ilya. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] btrfs auto snapshot
On Thursday 23 of February 2012 20:02:38 Anand Jain wrote: autosnap code is available either end of this week or early next week and what you will notice is autosnap snapshots are named using uuid. Main reason to drop time-stamp based names is that, - test (clicking on Take-snapshot button) which took more than one snapshot per second was failing. - a more descriptive creation time is available using a command line option as in the example below. - # btrfs su list -t tag=@minute,parent=/btrfs/sv1 /btrfs /btrfs/.autosnap/6c0dabfa-5ddb-11e1-a8c1-0800271feb99 Thu Feb 23 13:01:18 2012 /btrfs/sv1 @minute /btrfs/.autosnap/5669613e-5ddd-11e1-a644-0800271feb99 Thu Feb 23 13:15:01 2012 /btrfs/sv1 @minute - As of now code for time-stamp as autosnap snapshot name is commented out, if more people wanted it to be a time-stamp based names, I don't mind having that way. Please do let me know. I'd say, that having it as configure option (do Samba-style snapshot naming vs. uuid based) would be sufficient. The question remains what should be the default. That being said, what use-case would require snapshots taken more often than every second? I doubt that you actually can do snapshots every second on a busy file system, let alone more often. On lightly-used one they will be identical and just clutter the name-space. Regards, -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] Btrfs: clear the extent uptodate bits during parent transid failures
On Thu, Feb 23, 2012 at 10:12:26AM +0800, Liu Bo wrote: On 02/23/2012 01:43 AM, Chris Mason wrote: Normally I just toss patches into git, but this one is pretty subtle and I wanted to send it around for extra review. QA at Oracle did a test where they unplugged one drive of a btrfs raid1 mirror for a while and then plugged it back in. The end result is that we have a whole bunch of out-of-date blocks on the bad mirror. The btrfs parent transid pointers are supposed to detect these bad blocks and then we're supposed to read from the good copy instead. The good news is we did detect the bad blocks. The bad news is we didn't jump over to the good mirror instead. This patch explains why: Author: Chris Mason chris.ma...@oracle.com Date: Wed Feb 22 12:36:24 2012 -0500 Btrfs: clear the extent uptodate bits during parent transid failures If btrfs reads a block and finds a parent transid mismatch, it clears the uptodate flags on the extent buffer, and the pages inside it. But we only clear the uptodate bits in the state tree if the block straddles more than one page. This is from an old optimization from to reduce contention on the extent state tree. But it is buggy because the code that retries a read from a different copy of the block is going to find the uptodate state bits set and skip the IO. The end result of the bug is that we'll never actually read the good copy (if there is one). The fix here is to always clear the uptodate state bits, which is safe because this code is only called when the parent transid fails. Reviewed-by: Liu Bo liubo2...@cn.fujitsu.com Thanks! or we can be safer: diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index fcf77e1..c1fe25d 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3859,8 +3859,12 @@ int clear_extent_buffer_uptodate(struct extent_io_tree *tree, } for (i = 0; i num_pages; i++) { page = extent_buffer_page(eb, i); - if (page) + if (page) { + u64 start = (u64)page-index PAGE_CACHE_SHIFT; + u64 end = start + PAGE_CACHE_SIZE - 1; + ClearPageUptodate(page); + clear_extent_uptodate(tree, start, end, NULL, GFP_NOFS); } return 0; } Hmmm, I'm not sure this is safer. Our readpage trusts the extent uptodate bits unconditionally, so we should really clear them unconditionally as well. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] treewide: fix memory corruptions when TASK_COMM_LEN != 16
On Thu, 23 Feb 2012 12:19:28 +0100 (CET) Jan Engelhardt jeng...@medozas.de wrote: On Thursday 2012-02-23 10:57, Andrew Morton wrote: But there's more, 24931 ?S 0:00 \_ [btrfs-endio-met] \_ [kconservative/5] \_ [ext4-dio-unwrit] [with a wondersome patch:] $ grep Name /proc/{29431,29432}/stat* /proc/29431/status:Name: btrfs-endio-meta-1 /proc/29432/status:Name: btrfs-endio-meta-write-1 Name: kconservative/512 Name: ext4-dio-unwritten doh. The fix for that is to have less clueless btrfs developers. And truncate their names to SUNWbtfs, ORCLintg and EXT4diou? I think not :) Teach ps(1) to look in /proc/pid/status for kernel threads? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] treewide: fix memory corruptions when TASK_COMM_LEN != 16
On Thursday 2012-02-23 18:30, Andrew Morton wrote: Teach ps(1) to look in /proc/pid/status for kernel threads? To what end? The name in /proc/pid/status was also limited to TASK_COMM_LEN. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Strange prformance degradation when COW writes happen at fixed offsets
Hi, My kernel version is 32-bit 3.2.0-rc5 and using btrfs-tools 0.19 I was having performance issues with BTRFS with fragmentation and HDDs, so I decided to switch to an SSD to see if these would go away. Performance was much better but at times, I would see a freeze happen which I can't really explain. The CPU would spike up to 100% at times. I decided to try reproduce this, hough it may or may not be related, while testing BTFS performance, I encountered this interesting problem where performance would depend on whether a file is freshly copied onto a BTRFS filesystem or obtained via COW children. This is all happening on a Crucial M4 SSD, so something on the SSD firmware could be causing the issue but I feel it's related to BTRFS metadata. Here is the test: 1. Write a fresh large file to the file system called A 2. Make a reflink of A COW copy B 3. Modify a set of random blocks on B 4. Remove A 5. Repeat 2-5 but use newly produced B as new A Expected results: Each steps takes equal amount of time to complete on an SSD because there is no fragmentation involved and the system is in the same state at #2 because there's always only one file on the filesystem. I used 1GB file as my source. I repeated tests using different algorithms for the write in step #2 above. Algorithm 1 (random): Write 8 bytes randomly Algorithm 2 (fixed): Write first 8 bytes and continue at 50k offsets Algorithm 3 (incremental): Write first 8 bytes at offset = random (50k) then continue at 50k offsets For each test, there were 40k writes total. Algorithm is in the Java code below. The following is observed with each iteration ONLY when using algorithm #3 1. Over time, the time to modify the file increases 2. Over time, the time to make the reflink copy increases 3. Over time, the time to remove the file increases 4. First few writes take less then normal time to complete. Data for 1st/5th/10th/15th/20th iteration: Algorithm 1 and 2: Always Write:6s Always Copy: 0.5s Always Remove: 0.10s Algorithm 2: Write: 2/6/9/10/11.5 Copy: 0.5/3/4.5/5.5/6 Remove: 0.1/1/2/2/2 As you can see, things degrade and taper off after the 10th iteration. This probably has to do with 4k block size being near 50k/10. I don't think this has to do with SSD garbage collection because I ran these tests multiple times. To use this script, cd into an empty directory on a btrfs filesystem and and run it with incremental as argument. You can use other modes to confirm expected behavior. Script used to produce the bug: #!/bin/bash mode=$1 if [ -z $mode ]; then echo Usage $0 incremental|random|fixed exit -1 fi mode=$1 src=`pwd`/test/src dst=`pwd`/test/dst srcfile=$src/test.tar dstfile=$dst/test.tar mkdir -p $src mkdir -p $dst filesize=100MB #build a 1GB file from a smaller download. You can tweak filesize and the loop below for lower bandwidth if [ ! -f $srcfile ]; then cd $src if [ ! -f $srcfile.dl ]; then wget http://download.thinkbroadband.com/${filesize}.zip --output-document=$srcfile.dl fi rm -rf tarbase mkdir tarbase for i in {1..10}; do cp --reflink=always $srcfile.dl tarbase/$i.dl done tar -cvf $srcfile tarbase rm -rf tarbase fi cat END $src/FileTest.java import java.io.IOException; import java.io.RandomAccessFile; public class FileTest { public static final int BLOCK_SIZE = 5; public static final int MAX_ITERATIONS = 4; public static void main(String args[]) throws IOException { String mode = args[0]; RandomAccessFile f = new RandomAccessFile(args[1], rw); //int offset = 0; int i; int offset = new java.util.Random().nextInt(BLOCK_SIZE); // initializer ONLY for incremental mode for (i=0; i MAX_ITERATIONS; i++) { try { int writeOffset; if (mode.equals(incremental)) { writeOffset = new java.util.Random().nextInt(offset + i * BLOCK_SIZE); } else { // mode.equals random writeOffset = new java.util.Random().nextInt(((int)f.length() - 100)); offset = writeOffset; // for reporting it at the end } f.seek(writeOffset); f.writeBytes(DEADBEEF); } catch (java.io.IOException e) { System.out.println(EOF); break; } } System.out.print(Last offset= + offset); System.out.println(. Made + i + random writes.); f.close(); } } END cd $src javac FileTest.java /usr/bin/time --format 'rm: %E' rm -rf $dst/* cp --reflink=always $srcfile.dl $dst/1.tst cd $dst for i in {1..20}; do echo -n $i. i_plus=`expr $i + 1` /usr/bin/time --format 'write: %E' java -cp $src FileTest $mode $i.tst /usr/bin/time --format 'cp:%E' cp --reflink=always $i.tst $i_plus.tst /usr/bin/time --format
Re: Set nodatacow per file?
On 02/13/2012 04:17 PM, Ralf-Peter Rohbeck wrote: Hello, is it possible to set nodatacow on a per-file basis? I couldn't find anything. If not, wouldn't that be a great feature to get around the performance issues with VM and database storage? Of course cloning should still cause COW. Hello, Going back to the original question from Ralf I wanted to share my experience. Yesterday I set up KVM+qemu and set -z -C with David's 'fileflags' utility for the VM image file. I was very pleased with results - Redhat 6 Minimal installation was installed in 10 minutes whereas it was taking 'forever' the last time I tried it some 4 months ago. Writes during installation were very moderate. Performance of VM is excellent. Installing some big packages with yum inside VM goes very quickly with the speed indistinguishable from that of bare metal installs. I am not quite sure should this improvement be attributed to the nocow and nocompress flags or to the overall improvement of btrfs (I am on 3.3-rc4 kernel) but KVM is definitely more than usable on btrfs now. I am yet to test the install speed and performance without those flags set. best ~dima -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git resources
(When we have this I shall update the btrfs wiki) As promised an article was posted here sometime back. Writing patch for btrfs http://btrfs.ipv5.de/index.php?title=Writing_patch_for_btrfs It was long waiting in my mail draft as kernel.org was down sorry for the delay. -Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] btrfs auto snapshot
Thanks for the inputs. there is no clear winner as of now. Let me keep the uuid for now, if more sysadmin feel timestamp is better we could device it that way. -Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] btrfs auto snapshot
I'd like to vote for timestamp/timestamp-uuid as a sysadmin. The timestamp allows for easy conversion from clients' wants to actual commands: I need my data from two days ago is easy when I have timestamps to use. On 2/23/2012 10:05 PM, Anand Jain wrote: Thanks for the inputs. there is no clear winner as of now. Let me keep the uuid for now, if more sysadmin feel timestamp is better we could device it that way. -Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange prformance degradation when COW writes happen at fixed offsets
Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted: I noticed a few errors in the script that I used. I corrected it and it seems that degradation is occurring even at fully random writes: I don't have an ssd, but is it possible that you're simply seeing erase- block related degradation due to multi-write-block sized erase-blocks? It seems to me that when originally written to the btrfs-on-ssd, the file will likely be written block-sequentially enough that the file as a whole takes up relatively few erase-blocks. As you COW-write individual blocks, they'll be written elsewhere, perhaps all the changed blocks to a new erase-block, perhaps each to a different erase block. As you increase the successive COW generation count, the file's file- system/write blocks will be spread thru more and more erase-blocks, basically fragmentation but of the SSD-critical type, into more and more erase blocks, thus affecting modification and removal time but not read time. IIRC I saw a note about this on the wiki, in regard to the nodatacow mount-option. Let's see if I can find it again. Hmm... yes... http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options In particular this (for nodatacow, read the rest as there's additional implications): Performance gain is usually 5% unless the workload is random writes to large database files, where the difference can become very large. In addition to nodatacow, see the note on the autodefrag option. IOW, with the repeated generations of random-writes to cow-copies, you're apparently triggering a cow-worst-case fragmentation situation. It shouldn't affect read-time much on SSD, but it certainly will affect copy and erase time, as the data and metadata (which as you'll recall is 2X by default on btrfs) gets written to more and more blocks that need updated at copy/erase time, That /might/ be the problem triggering the freezes you noted that set off the original investigation as well, if the SSD firmware is running out of erase blocks and having to pause access while it rearranges data to allow operations to continue. Since your original issue on rotating rust drives was fragmentation, rewriting would seem to be something you do quite a lot of, triggering different but similar-cause issues on SSDs as well. FWIW, with that sort of database-style workload, large files constantly random-change rewritten, something like xfs might be more appropriate than btrfs. See the recent xfs presentations (were they at ScaleX or LinuxConf.au? both happened about the same time and were covered in the same LWN weekly edition) as covered a couple weeks ago on LWN for more. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html