Re: Strange prformance degradation when COW writes happen at fixed offsets

2012-02-27 Thread Christian Brunner
2012/2/24 Nik Markovic nmarkovi.nav...@gmail.com:
 To add... I also tried nodatasum (only) and nodatacow otions. I found
 somewhere that nodatacow doesn't really mean tthat COW is disabled.
 Test data is still the same - CPU spikes and times are the same.

 On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic nmarkovi.nav...@gmail.com 
 wrote:
 On Fri, Feb 24, 2012 at 12:38 AM, Duncan 1i5t5.dun...@cox.net wrote:
 Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:

 I noticed a few errors in the script that I used. I corrected it and it
 seems that degradation is occurring even at fully random writes:

 I don't have an ssd, but is it possible that you're simply seeing erase-
 block related degradation due to multi-write-block sized erase-blocks?

 It seems to me that when originally written to the btrfs-on-ssd, the file
 will likely be written block-sequentially enough that the file as a whole
 takes up relatively few erase-blocks.  As you COW-write individual
 blocks, they'll be written elsewhere, perhaps all the changed blocks to a
 new erase-block, perhaps each to a different erase block.

 This is a very interesting insight. I wasn't even aware of the
 erase-block issue, so I did some reading up on it...


 As you increase the successive COW generation count, the file's file-
 system/write blocks will be spread thru more and more erase-blocks,
 basically fragmentation but of the SSD-critical type, into more and more
 erase blocks, thus affecting modification and removal time but not read
 time.

 OK, so time to write would increase due to fragmentation and writing,
 it now makes sense (though I don't see why small writes would affect
 this, but my concerns are not writes anyway), but why would cp
 --reflink time increase so much. Yes, new extents would be created,
 but btrfs doesn't write into data blocks, does it? I figured its
 metadata would be kept in one place. I figure the only thing BTRFS
 would do on cp --reflink=always:
 1. Take a collection of extents owned by source.
 2. Make the new copy use the same collection of extents.
 3. Write the collection of extents to the directory.

 Now this process seems to be CPU intensive. When I remove or make a
 reflink copy, one core pikes up to 100%, which tells me that there's a
 performance issue there, not an ssd issue. Also, only one CPU thread
 is being used for this. I figured that I can improve this by some
 setting. Maybe thread_pool mount option? Are there any updates in
 later kernels that I should possibly pick up?

 [...]

 Unless I am wrong, this would disable COW completely and reflink copy.
 Reflinks are a crucial component and the sole
 reason I picked BTRFS for the system that I am writing for my company.
 The autodefrag option addresses multiple writes. Writing is not the
 problem, but cp --reflink should be near-instant. That was the reason
 we chose BTRFS over ZFS, which seemed to be the only feasible
 alternative. ZFS snapshot complicate the design and deduplication copy
 time is the same as (or not much better than) raw copy.

 [...]

 As I mentioned above, the COW is the crucial component of our system,
 XFS won't do. Our system does not do random writes. In fact it is
 mainly heavy on read operation. The system does occasional rotation
 of rust on large files in a way that version control system would
 (large files are modified and then used as a new baseline)

The symptoms you are reporting are quite similar to what I'm seeing in
our Ceph cluster:

http://comments.gmane.org/gmane.comp.file-systems.btrfs/15413

AFAIK, Chris and Josef are working on it, but you'll have to wait for
kernel 3.4, until this will be available in mainline. If you are
feeling adventurous, you could try the patches in Josef's git tree,
but I think it's still experimental.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LABEL only 1 device

2012-02-27 Thread Hugo Mills
On Mon, Feb 27, 2012 at 07:44:00AM +0100, Helmut Hullen wrote:
 Hallo, Hugo,
 
 Du meintest am 26.02.12:
 
 mkfs.btrfs creates a new filesystem. The -L option sets the label
  for the newly-created FS. It *cannot* be used to change the label of
  an existing FS.
 
 The safest way may be deleting this option ... it seems to work as  
 expected only when I create a new FS on 1 disk/partition.

   I've said this several times: Your expectations are wrong. You
don't label partitions. You label filesystems. You are using the wrong
tool (filesystems labels) for the job (uniquely identifying
partitions).

  If you want to do that, use btrfs filesystem label.
 
 And that seems to work as I expected - fine.
 
 Adding a device works, deleting a device works. Fine!
 Now I'll try the job with my Terabyte disks.
 
 (Yes - I have backups ...)
 
 Viele Gruesse!
 Helmut

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Be pure. Be vigilant. Behave. ---  


signature.asc
Description: Digital signature


Re: btrfs-convert options

2012-02-27 Thread Hugo Mills
On Mon, Feb 27, 2012 at 08:52:00AM +0100, Helmut Hullen wrote:
 Hallo, linux-btrfs,
 
 I want to change some TByte disks (at least one) from ext4 to btrfs. And  
 I want -d raid0 -m raid1. Is it possible to tell btrfs-convert  
 especially these options for data and metadata?
 
 Or have I to use mkfs.btrfs (and then copy the backup) when I want  
 these options?

   The other alternative is to get hold of 3.3-rcX and the latest
userspace tools (currently in the dangerdonteveruse branch, but
should be in master very soon), and use the restriper after you've
converted.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- Be pure. Be vigilant. Behave. ---  


signature.asc
Description: Digital signature


Re: btrfs-convert options

2012-02-27 Thread Helmut Hullen
Hallo, Hugo,

Du meintest am 27.02.12:

 I want to change some TByte disks (at least one) from ext4 to btrfs.
 And I want -d raid0 -m raid1. Is it possible to tell btrfs-convert
 especially these options for data and metadata?

 Or have I to use mkfs.btrfs (and then copy the backup) when I want
 these options?

The other alternative is to get hold of 3.3-rcX and the latest
 userspace tools (currently in the dangerdonteveruse branch, but
 should be in master very soon), and use the restriper after you've
 converted.

H - I don't yet dare these experiments with really big disks ...

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LABEL only 1 device

2012-02-27 Thread Helmut Hullen
Hallo, Hugo,

Du meintest am 27.02.12:

mkfs.btrfs creates a new filesystem. The -L option sets the
label
 for the newly-created FS. It *cannot* be used to change the label
 of an existing FS.

 The safest way may be deleting this option ... it seems to work as
 expected only when I create a new FS on 1 disk/partition.

I've said this several times: Your expectations are wrong. You
 don't label partitions.

Yes - now I know.
But I'm afraid other people also expect wrong - when I use mkfs.ext[234]  
then this option works (in another way than with mkfs.btrfs).

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LABEL only 1 device

2012-02-27 Thread David Sterba
On Sun, Feb 26, 2012 at 04:44:00PM +, Hugo Mills wrote:
OK, the real problem you're seeing is that when btrfs removes a
 device from the filesystem, that device is not modified in any way.
 This means that the old superblock is left behind on it, containing
 the FS label information. What you need to do is, immediately after
 removing a device from the FS, zero the first part of the partition
 with dd and /dev/zero.

A correction here: if the device being removed is writable, the
superblock is cleared so it's not recognized as a part of any other fs:

int btrfs_rm_device(struct btrfs_root *root, char *device_path)
...
/*
 * at this point, the device is zero sized.  We want to
 * remove it from the devices list and zero out the old super
 */
if (clear_super) {
/* make sure this device isn't detected as part of
 * the FS anymore
 */
memset(disk_super-magic, 0, sizeof(disk_super-magic));
set_buffer_dirty(bh);
sync_dirty_buffer(bh);
}

Doing this manually means zeroing 4k block at all offsets up to
partition size:

Superblock 0 offset 65536
Superblock 1 offset 67108864
Superblock 2 offset 274877906944
Superblock 3 offset 1125899906842624
Superblock 4 offset 4611686018427387904


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LABEL only 1 device

2012-02-27 Thread Helmut Hullen
Hallo, David,

Du meintest am 27.02.12:

[deleting btrfs partition]

OK, the real problem you're seeing is that when btrfs removes a
 device from the filesystem, that device is not modified in any way.
 This means that the old superblock is left behind on it, containing
 the FS label information. What you need to do is, immediately after
 removing a device from the FS, zero the first part of the partition
 with dd and /dev/zero.

 A correction here: if the device being removed is writable, the
 superblock is cleared so it's not recognized as a part of any other
 fs:

[...]

 Doing this manually means zeroing 4k block at all offsets up to
 partition size:

 Superblock 0 offset 65536
 Superblock 1 offset 67108864
 Superblock 2 offset 274877906944
 Superblock 3 offset 1125899906842624
 Superblock 4 offset 4611686018427387904

My actual experiments:

  dd if=/dev/zero of=/dev/sdxn bs=16M count=1

seems to be enough. Perhaps deleting the first 16 MByte is too much.

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: hold enough space for global_rsv

2012-02-27 Thread Johannes Hirte
Am Tue, 17 Jan 2012 17:51:59 +0800
schrieb Liu Bo liubo2...@cn.fujitsu.com:

 I've kept hitting enospc warnings of global_rsv while running
 defragment on files:
 btrfs: block rsv returned -28
 WARNING: at fs/btrfs/extent-tree.c:5984
 btrfs_alloc_free_block+0x333/0x340 [btrfs]() ...
 
 I used a fio jobs to create a file with lots of fragments:
 $ filefrag /mnt/btrfs/foobar
 /mnt/btrfs/foobar: 66964 extents found
 
 and then btrfs fi defrag /mnt/btrfs/foobar  sync would pop the
 warnings.
 
 I found that the global_rsv size is just not enough for defragment,
 and didn't find any space leak in using global_rsv, so double it and
 go ahead.
 
 Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
 ---
  fs/btrfs/extent-tree.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 8603ee4..77ea23c 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3979,7 +3979,7 @@ static u64 calc_global_metadata_size(struct
 btrfs_fs_info *fs_info) num_bytes += div64_u64(data_used + meta_used,
 50); 
   if (num_bytes * 3  meta_used)
 - num_bytes = div64_u64(meta_used, 3);
 + num_bytes = div64_u64(meta_used, 3) * 2;
  
   return ALIGN(num_bytes, fs_info-extent_root-leafsize 
 10); }

This patch breakes my system. With this applied all services fail on
boot with no space left messages.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Set nodatacow per file?

2012-02-27 Thread dima

Hello,

Since several people asked to post the results, here they are.
I tried raw virtio disk with and without -z -C set and also qcow2 virtio 
disk without -z -C set and did not notice any difference in performance 
at all - Redhat 6.2 Minimal installs in 10 minutes in each case. The 
abysmal performance as it was some several months ago (like 10 minutes 
just for virtual disk formatting) under the same conditions is no more 
at least on 3.3.0-rc5.


best
~dima


On 02/24/2012 02:22 PM, dima wrote:

On 02/13/2012 04:17 PM, Ralf-Peter Rohbeck wrote:

Hello,
is it possible to set nodatacow on a per-file basis? I couldn't find
anything.
If not, wouldn't that be a great feature to get around the performance
issues with VM and database storage? Of course cloning should still
cause COW.


Hello,
Going back to the original question from Ralf I wanted to share my
experience.

Yesterday I set up KVM+qemu and set -z -C with David's 'fileflags'
utility for the VM image file.
I was very pleased with results - Redhat 6 Minimal installation was
installed in 10 minutes whereas it was taking 'forever' the last time I
tried it some 4 months ago. Writes during installation were very
moderate. Performance of VM is excellent. Installing some big packages
with yum inside VM goes very quickly with the speed indistinguishable
from that of bare metal installs.

I am not quite sure should this improvement be attributed to the nocow
and nocompress flags or to the overall improvement of btrfs (I am on
3.3-rc4 kernel) but KVM is definitely more than usable on btrfs now.

I am yet to test the install speed and performance without those flags set.

best
~dima
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Kernel Bug at fs/btrfs/volumes.c:3638

2012-02-27 Thread David Sterba
On Sun, Feb 26, 2012 at 11:42:58PM -0500, Jérôme Carretero wrote:
 At some point, I would appreciate some kind of thorough evaluation using a 
 fuzzer on small disk images.
 The btrfs developers could for instance:
 - provide a script to create a filesystem image with a known layout (known 
 corpus)
 - provide .config and reference to kernel sources to build the kernel
 - provide a minimal root filesystem to be run under qemu, it would run a 
 procedure on the other disk image at boot
   crashes wouldn't affect the host, which is good.
 - provide a way to retrieve the test parameters and results for every test 
 case
   in case of bug, the test can be reproduced by the developers since the 
 configuration is known
 - expect volunteers to run the scenarios (I know I would)
 The tricky part is of course the potentially super-costly procedure...
 Simplest case: flipping every bit / writing blocks with pseudo-random data, 
 even on meta-data only, as the outcome on data is supposed to be known.
 Smarter: flipping bits on every btrfs meta-data structure type at every 
 possible logical location.

There is a dangerdonteveruse(tm) utility btrfs-corrupt-block able to
target at specific metadata structure and corrupt it, with the fsck
counterpart for the rescue. I believe we'll see more updates in that
area.

The block checksums are supposed to catch bitflips after they were
written down to device (provided the data were correct up to the
checksum point).

If you're talking about random bitflips in metadata structures during
processing, that's very likely to crash in many ways of course. I think
some logic needs to be added to those corruptions and accompanied by the
fsck part.

 The kind of stuff that would help all this could be something like Python 
 bindings for a *btrfs library*.
 Helpful even for prototyping fsck stuff, making illustrations, etc.

We could see btrfsprogs turn into a library + tool, someday. Added to
project page.

 As of today, how are btrfs developers testing the filesystem implementation 
 (except with xfstests) ?

If there is a patch fixing particular bug, I try to set up environment
stressing exactly that bug (and sometimes finding another one ...). The
xfstests suite is a must before any testing. There are common loads
raising the chances to hit a bug like repeated snapshots (and deletions),
exhausting data/metadata space, 'fi defrag', 'fi sync', 'fi balance'.

Sometimes it's enough to run a specific xfstest in a loop. I have set of
hackish scripts doing just these tasks or wrappers around xfstests to
create filesystem with desired raid levels (where applicable) and let
the suite run on top of it.

Another dimension of testing are mount options, there are some
combinations likely to execrise specific parts of code, or create files
in a way that may confuse different mount options (like nodatasum).

We've seen btrfs-specific tests added to xfstests, so it's mostly
changing the outer environment for the testsuite.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in dmesg, no crashes though.

2012-02-27 Thread David Sterba
On Sat, Feb 25, 2012 at 09:14:01AM +1030, Jordan Windsor wrote:
 I'm running Ubuntu under KVM, with btrfs on the host where the
 Qemu/KVM image is stored, the VM was also running at the time. I was
 going to check something unrelated in the dmesg output, as I did that
 I noticed some errors in it about btrfs here they are:
 
 
 [ 4294.431807] btrfs: block rsv returned -28
 [ 4294.431811] [ cut here ]
 [ 4294.431831] WARNING: at fs/btrfs/extent-tree.c:5985

This is a warning and it shows up from time to time, accross recent
releases.

 This didn't cause any crashes or related issues, also unrelated (I
 think?) is that I get very low performance (about 5MBps according to
 iostat) in a guest under Qemu/KVM.
 3.2.7-1-ARCH x86_64 Arch Linux

Yes, the performance goes down as some pathological should-not-happen
code path is taken. I myself haven't seen this error recently during
testing, but at the time I did, it slowed down the machine for a while,
ie. it's not a inifinite loop. Seems there's a dark corner of
space reservations left for Josef.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LABEL only 1 device

2012-02-27 Thread Duncan
Helmut Hullen posted on Mon, 27 Feb 2012 11:27:00 +0100 as excerpted:

 Du meintest am 27.02.12:
 
mkfs.btrfs creates a new filesystem. The -L option sets the label
 for the newly-created FS.

 The safest way may be deleting this option ... it seems to work as
 expected only when I create a new FS on 1 disk/partition.
 
I've said this several times: Your expectations are wrong. You
 don't label partitions.
 
 Yes - now I know.
 But I'm afraid other people also expect wrong - when I use mkfs.ext[234]
 then this option works (in another way than with mkfs.btrfs).

AFAIK, it works in the same way... that is, it labels the, in that case, 
ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs filesystem.

From the manpages:

mkfs.btrfs (aka mkbtrfs):

   -L, --label name
  Specify a label for the filesystem.

mkfs.ext2/3/4 (aka mke2fs):

   -L new-volume-label
  Set  the  volume  label  for the filesystem to
  new-volume-label.  The maximum length of the
  volume label is 16 bytes.

e2label:

   e2label  will display or change the filesystem label on the
   ext2, ext3, or ext4 filesystem located on device.

mkreiserfs:

  -l | --label LABEL
  Sets   the   volume  label  of  the filesystem. LABEL
  can at most be 16 characters long; if it is longer than
  16 characters, mkreiserfs will truncate it.

reiserfstune:

   -l | --label LABEL
  Set  the  volume  label  of  the filesystem. LABEL can
  be at most 16 characters long; if it is longer than 16
  characters, reiserfstune will truncate it.

The mkswap manpage does make things a bit more confusing, until you 
realize that the device they're referencing is a swap device, which 
can be a file, not just a block device.

   mkswap sets up a Linux swap area on a device or in a file.

   [...]

   -L, --label label
  Specify a label for the device, to allow swapon by label.


fstab indicates the filesystem label:

   The first field (fs_spec).
  This field describes the block special device or remote
  filesystem to be mounted.

  For ordinary mounts it will hold (a link to) a block
  special device  node  (as  created  by mknod(8))  for
  the device to be mounted, like `/dev/cdrom' or
  `/dev/sdb7'.  [...]

  Instead of giving the device explicitly, one may
  indicate the (ext2 or xfs) filesystem that is to
  be  mounted  by its UUID or volume label (cf.
  e2label(8) or xfs_admin(8)), writing
  LABEL=label or UUID=uuid, e.g., `LABEL=Boot'[.]
  This  will  make  the  system more robust: adding
  or removing a SCSI disk changes the disk device name
  but not the filesystem volume label.

mount seems to be confused, using label in both the filesystem and device 
context (it also discusses selinux labels, etc, which are of course 
different).  I'm not going to quote it here as the bits discussing label 
are dispersed and getting context clear on all of them would take a lot 
of space.  Searching the manpage for label (case insensitive search) 
works, tho, again noting that it uses label in selinux and other 
contexts as well.

In another post I mentioned that gpt partitions do have names, which 
/could/ function similarly to labels, tho Linux including the mount 
command generally ignores them at present.  From the gdisk (part of 
gptfdisk) manpage (the cgdisk and sgdisk manpages, same package, are 
similarly worded, including the note on the distinction between gpt 
partition name and filesystem label):

   c  Change the GPT name of a partition. This name is encoded
  as a UTF-16 string, but proper entry and display of
  anything beyond basic ASCII values requires suitable
  locale and font support. For the most part, Linux ignores
  the partition name, but it may be important in some OSes.
  GPT  fdisk sets a default name based on the partition type
  code. Note that the GPT partition name is different from
  the filesystem name, which is encoded in the filesystem's
  data structures.

Note especially that last sentence, above.


So a filesystem label is just that, a /filesystem/ label.  That there's 
normally a 1:1 correspondence between filesystem and the block device(s) 
it's on is simply an accident.  But it's NOT an accident when a btrfs 
filesystem label applies to ALL the devices that compose the filesystem, 
since it's a FILESYSTEM label, NOT a PARTITION label.  As the gptfdisk 
manpages make clear, partition names/labels, where they exist as in gpt 
based partitioning, are quite distinct from the filesystem names/labels.


However, the above manpage research does point out that while usage is 
generally quite 

Re: Errors in dmesg, no crashes though.

2012-02-27 Thread Nik Markovic
I've just seen this too on Fedora 16 while I was investigating an NFS issue.
I was trying to copy a file from an NFS mount to a btrfs partition.
The NFS transfers for large files were occurring in bursts for some
reason and I was aborting the copy at times. This NFS problem was not
related to btrfs (cat NFS file  /dev/null was also bursting and
slow).

Originally I ran  3.2.1-3.fc16, but just upgraded to 3.2.7-1.fc16. The
file system was formatted with 3.2.1 originally.

I can't say for sure what caused this - whether it was the NFS being
slow, the copy being interrupted, or btrfs itself.

Regards,
Nik

On Mon, Feb 27, 2012 at 9:03 AM, David Sterba d...@jikos.cz wrote:
 On Sat, Feb 25, 2012 at 09:14:01AM +1030, Jordan Windsor wrote:
 I'm running Ubuntu under KVM, with btrfs on the host where the
 Qemu/KVM image is stored, the VM was also running at the time. I was
 going to check something unrelated in the dmesg output, as I did that
 I noticed some errors in it about btrfs here they are:


 [ 4294.431807] btrfs: block rsv returned -28
 [ 4294.431811] [ cut here ]
 [ 4294.431831] WARNING: at fs/btrfs/extent-tree.c:5985

 This is a warning and it shows up from time to time, accross recent
 releases.

 This didn't cause any crashes or related issues, also unrelated (I
 think?) is that I get very low performance (about 5MBps according to
 iostat) in a guest under Qemu/KVM.
 3.2.7-1-ARCH x86_64 Arch Linux

 Yes, the performance goes down as some pathological should-not-happen
 code path is taken. I myself haven't seen this error recently during
 testing, but at the time I did, it slowed down the machine for a while,
 ie. it's not a inifinite loop. Seems there's a dark corner of
 space reservations left for Josef.


 david
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LABEL only 1 device

2012-02-27 Thread Helmut Hullen
Hallo, Duncan,

Du meintest am 27.02.12:

I've said this several times: Your expectations are wrong. You
 don't label partitions.

 Yes - now I know.
 But I'm afraid other people also expect wrong - when I use
 mkfs.ext[234] then this option works (in another way than with
 mkfs.btrfs).

 AFAIK, it works in the same way... that is, it labels the, in that
 case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs
 filesystem.

 From the manpages:

 mkfs.btrfs (aka mkbtrfs):

-L, --label name
   Specify a label for the filesystem.

 mkfs.ext2/3/4 (aka mke2fs):

-L new-volume-label
   Set  the  volume  label  for the filesystem to
 new-volume-label.  The maximum length of the
   volume label is 16 bytes.

But there's a small difference:

mke2fs -L MyLabel /dev/sdn4

only sets/changes the label (ok - it tests the type of the partition and  
refuses labeling if the type doesn't fit).

mkfs.btrfs -L MyLabel /dev/sdn4

not only sets/changes the label but also (re-)creates a btrfs  
filesystem, using the default parameters.

I had to learn this difference ...

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 21/22] btrfs: add support for read_iter, write_iter, and direct_IO_bvec

2012-02-27 Thread Dave Kleikamp
Some helpers were broken out of btrfs_direct_IO() in order to avoid code
duplication in new bio_vec-based function.

Signed-off-by: Dave Kleikamp dave.kleik...@oracle.com
Cc: Zach Brown z...@zabbo.net
Cc: Chris Mason chris.ma...@oracle.com
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/file.c  |2 +
 fs/btrfs/inode.c |  116 +-
 2 files changed, 82 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..7a2fbc0 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1880,6 +1880,8 @@ const struct file_operations btrfs_file_operations = {
.aio_read   = generic_file_aio_read,
.splice_read= generic_file_splice_read,
.aio_write  = btrfs_file_aio_write,
+   .read_iter  = generic_file_read_iter,
+   .write_iter = generic_file_write_iter,
.mmap   = btrfs_file_mmap,
.open   = generic_file_open,
.release= btrfs_release_file,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..52199e7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6151,24 +6151,14 @@ static ssize_t check_direct_IO(struct btrfs_root *root, 
int rw, struct kiocb *io
 out:
return retval;
 }
-static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
-   const struct iovec *iov, loff_t offset,
-   unsigned long nr_segs)
+
+static ssize_t btrfs_pre_direct_IO(int writing,  loff_t offset, size_t count,
+  struct inode *inode, int *write_bits)
 {
-   struct file *file = iocb-ki_filp;
-   struct inode *inode = file-f_mapping-host;
struct btrfs_ordered_extent *ordered;
struct extent_state *cached_state = NULL;
u64 lockstart, lockend;
ssize_t ret;
-   int writing = rw  WRITE;
-   int write_bits = 0;
-   size_t count = iov_length(iov, nr_segs);
-
-   if (check_direct_IO(BTRFS_I(inode)-root, rw, iocb, iov,
-   offset, nr_segs)) {
-   return 0;
-   }
 
lockstart = offset;
lockend = offset + count - 1;
@@ -6176,7 +6166,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
if (writing) {
ret = btrfs_delalloc_reserve_space(inode, count);
if (ret)
-   goto out;
+   return ret;
}
 
while (1) {
@@ -6191,8 +6181,8 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 lockend - lockstart + 1);
if (!ordered)
break;
-   unlock_extent_cached(BTRFS_I(inode)-io_tree, lockstart, 
lockend,
-cached_state, GFP_NOFS);
+   unlock_extent_cached(BTRFS_I(inode)-io_tree, lockstart,
+lockend, cached_state, GFP_NOFS);
btrfs_start_ordered_extent(inode, ordered, 1);
btrfs_put_ordered_extent(ordered);
cond_resched();
@@ -6203,46 +6193,99 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb 
*iocb,
 * the dirty or uptodate bits
 */
if (writing) {
-   write_bits = EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING;
-   ret = set_extent_bit(BTRFS_I(inode)-io_tree, lockstart, 
lockend,
-EXTENT_DELALLOC, 0, NULL, cached_state,
-GFP_NOFS);
-   if (ret) {
+   *write_bits = EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING;
+   ret = set_extent_bit(BTRFS_I(inode)-io_tree, lockstart,
+lockend, EXTENT_DELALLOC, 0, NULL,
+cached_state, GFP_NOFS);
+   if (ret)
clear_extent_bit(BTRFS_I(inode)-io_tree, lockstart,
-lockend, EXTENT_LOCKED | write_bits,
+lockend, EXTENT_LOCKED | *write_bits,
 1, 0, cached_state, GFP_NOFS);
-   goto out;
-   }
}
-
free_extent_state(cached_state);
-   cached_state = NULL;
 
-   ret = __blockdev_direct_IO(rw, iocb, inode,
-  BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev,
-  iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
-  btrfs_submit_direct, 0);
+   return ret;
+}
+
+static ssize_t btrfs_post_direct_IO(ssize_t ret, loff_t offset, size_t count,
+  struct inode *inode, int *write_bits)
+{
+   struct extent_state *cached_state = NULL;
 
if (ret  0  ret != -EIOCBQUEUED) {
clear_extent_bit(BTRFS_I(inode)-io_tree, offset,
- offset + iov_length(iov, nr_segs) - 1,
-  

Re: LABEL only 1 device

2012-02-27 Thread Hugo Mills
On Mon, Feb 27, 2012 at 10:15:00PM +0100, Helmut Hullen wrote:
 Du meintest am 27.02.12:
 
 I've said this several times: Your expectations are wrong. You
  don't label partitions.
 
  Yes - now I know.
  But I'm afraid other people also expect wrong - when I use
  mkfs.ext[234] then this option works (in another way than with
  mkfs.btrfs).
 
  AFAIK, it works in the same way... that is, it labels the, in that
  case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs
  filesystem.
 
  From the manpages:
 
  mkfs.btrfs (aka mkbtrfs):
 
 -L, --label name
Specify a label for the filesystem.
 
  mkfs.ext2/3/4 (aka mke2fs):
 
 -L new-volume-label
Set  the  volume  label  for the filesystem to
new-volume-label.  The maximum length of the
volume label is 16 bytes.
 
 But there's a small difference:
 
 mke2fs -L MyLabel /dev/sdn4
 
 only sets/changes the label (ok - it tests the type of the partition and  
 refuses labeling if the type doesn't fit).

   That feels really weird. It wouldn't ever occur to me to look at a
mkfs tool to relabel a filesystem without destroying the data on it. I
view this behaviour as a bug.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Great oxymorons of the world, no. 6: Mature Student ---   


signature.asc
Description: Digital signature


Re: LABEL only 1 device

2012-02-27 Thread Felix Blanke

Hi Helmut,

are you sure that 'mkfs.ext2/3/4 -L label /dev/xxx' doesn't create a 
new fs?


Afaik to change a label of a given (ext2/3/4) filesystem you should use 
tune2fs.


I don't have a linux system available right now but this is what I would 
expect and what would make a lot more sense then changing a label via 
mkfs.ext2/3/4. If you are correct with that labeling thing then the 
btrfs way makes like 1000x more sense then the way ext2/3/4 does it.


mkfs should only be used for creating filesystems. For changing existing 
fs tools like tune2fs, btrfs etc. should be used.


Regards,
Felix


On 2/27/12 10:15 PM, Helmut Hullen wrote:

Hallo, Duncan,

Du meintest am 27.02.12:


I've said this several times: Your expectations are wrong. You
don't label partitions.



Yes - now I know.
But I'm afraid other people also expect wrong - when I use
mkfs.ext[234] then this option works (in another way than with
mkfs.btrfs).



AFAIK, it works in the same way... that is, it labels the, in that
case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs
filesystem.



 From the manpages:



mkfs.btrfs (aka mkbtrfs):



-L, --label name
   Specify a label for the filesystem.



mkfs.ext2/3/4 (aka mke2fs):



-L new-volume-label
   Set  the  volume  label  for the filesystem to
  new-volume-label.  The maximum length of the
   volume label is 16 bytes.


But there's a small difference:

 mke2fs -L MyLabel /dev/sdn4

only sets/changes the label (ok - it tests the type of the partition and
refuses labeling if the type doesn't fit).

 mkfs.btrfs -L MyLabel /dev/sdn4

not only sets/changes the label but also (re-)creates a btrfs
filesystem, using the default parameters.

I had to learn this difference ...

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LABEL only 1 device

2012-02-27 Thread Hugo Mills
On Mon, Feb 27, 2012 at 10:15:00PM +0100, Helmut Hullen wrote:
 Du meintest am 27.02.12:
 
 I've said this several times: Your expectations are wrong. You
  don't label partitions.
 
  Yes - now I know.
  But I'm afraid other people also expect wrong - when I use
  mkfs.ext[234] then this option works (in another way than with
  mkfs.btrfs).
 
  AFAIK, it works in the same way... that is, it labels the, in that
  case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs
  filesystem.
 
  From the manpages:
 
  mkfs.btrfs (aka mkbtrfs):
 
 -L, --label name
Specify a label for the filesystem.
 
  mkfs.ext2/3/4 (aka mke2fs):
 
 -L new-volume-label
Set  the  volume  label  for the filesystem to
new-volume-label.  The maximum length of the
volume label is 16 bytes.
 
 But there's a small difference:
 
 mke2fs -L MyLabel /dev/sdn4
 
 only sets/changes the label (ok - it tests the type of the partition and  
 refuses labeling if the type doesn't fit).

   OK, I have just tried this out. It does set the filesystem label.
It also wipes the filesystem, as I expected it to. You clearly aren't
doing this on existing filesystems with data in them.

hrm@ruth:~ $ sudo mke2fs /dev/loop0 
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
25688 inodes, 102400 blocks
5120 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
13 block groups
8192 blocks per group, 8192 fragments per group
1976 inodes per group
Superblock backups stored on blocks: 
   8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done 

hrm@ruth:~ $ sudo mount /dev/loop0 /mnt
hrm@ruth:~ $ sudo dd if=/dev/urandom of=/mnt/foo bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.74158 s, 3.0 MB/s
hrm@ruth:~ $ ls /mnt
foo  lost+found
hrm@ruth:~ $ sudo umount /mnt
hrm@ruth:~ $ sudo mke2fs -L newlabel /dev/loop0 
mke2fs 1.42 (29-Nov-2011)
Filesystem label=newlabel
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
25688 inodes, 102400 blocks
5120 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
13 block groups
8192 blocks per group, 8192 fragments per group
1976 inodes per group
Superblock backups stored on blocks: 
   8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done 

hrm@ruth:~ $ sudo mount /dev/loop0 /mnt
hrm@ruth:~ $ ls /mnt
lost+found
hrm@ruth:~ $ 


   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Great oxymorons of the world, no. 6: Mature Student ---   


signature.asc
Description: Digital signature


Re: LABEL only 1 device

2012-02-27 Thread Helmut Hullen
Hallo, Hugo,

Du meintest am 27.02.12:

 But there's a small difference:

 mke2fs -L MyLabel /dev/sdn4

 only sets/changes the label (ok - it tests the type of the partition
 and refuses labeling if the type doesn't fit).

OK, I have just tried this out. It does set the filesystem label.
 It also wipes the filesystem, as I expected it to. You clearly aren't
 doing this on existing filesystems with data in them.

I should have tested it ... sorry.
I've always labeled my ext[234] partitions with e2label.

And because I hadn't found such a simple command for btrfs I took  
mkfs.btrfs -L instead of btrfs fi label mountpoint.

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Set nodatacow per file?

2012-02-27 Thread Chester
On Mon, Feb 27, 2012 at 7:54 AM, dima dole...@parallels.com wrote:
 Hello,

 Since several people asked to post the results, here they are.
 I tried raw virtio disk with and without -z -C set and also qcow2 virtio
 disk without -z -C set and did not notice any difference in performance at
 all - Redhat 6.2 Minimal installs in 10 minutes in each case. The abysmal
 performance as it was some several months ago (like 10 minutes just for
 virtual disk formatting) under the same conditions is no more at least on
 3.3.0-rc5.

Just to make sure, this is a _new_ virtual disk right? I can barely
contain my excitement right now. This is amazing progress.


 best
 ~dima



 On 02/24/2012 02:22 PM, dima wrote:

 On 02/13/2012 04:17 PM, Ralf-Peter Rohbeck wrote:

 Hello,
 is it possible to set nodatacow on a per-file basis? I couldn't find
 anything.
 If not, wouldn't that be a great feature to get around the performance
 issues with VM and database storage? Of course cloning should still
 cause COW.


 Hello,
 Going back to the original question from Ralf I wanted to share my
 experience.

 Yesterday I set up KVM+qemu and set -z -C with David's 'fileflags'
 utility for the VM image file.
 I was very pleased with results - Redhat 6 Minimal installation was
 installed in 10 minutes whereas it was taking 'forever' the last time I
 tried it some 4 months ago. Writes during installation were very
 moderate. Performance of VM is excellent. Installing some big packages
 with yum inside VM goes very quickly with the speed indistinguishable
 from that of bare metal installs.

 I am not quite sure should this improvement be attributed to the nocow
 and nocompress flags or to the overall improvement of btrfs (I am on
 3.3-rc4 kernel) but KVM is definitely more than usable on btrfs now.

 I am yet to test the install speed and performance without those flags
 set.

 best
 ~dima
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SELinux inode size gotcha in btrfs.

2012-02-27 Thread Hugo Mills
On Mon, Feb 27, 2012 at 10:18:55PM +, Alex wrote:
 I've come across the 'gotcha' in XFS where the inode size defaults to 256 [1]
 whereas for SELinux the attributes play better when you initialise it at
 creation to 512.

   A btrfs inode structure is 136 bytes in size. xattrs and any inline
file data are separate from the inode structure, stored with
additional keys in the FS tree (which means that they're quite likely
to appear in the same page, as the inode data, but not guaranteed).

 From my reading of the btrfs specs [2] it doesn't look like you'll get caught
 with that as the inodes will not contain embedded file data or extended
 attribute data. These things are stored in other item types.
 
 Have I read that right? I've seen xattr bugs patches etc but nothing that 
 would
 hit the SE Linux domain.

   It's not clear from looking at the gentoo doc what the problem
actually is with different inode sizes... Without some kind of
indication what the issue really is, it's kind of hard to say how this
might affect btrfs.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- The enemy have elected for Death by Powerpoint. That's what ---   
 they shall get.  -- gdb 


signature.asc
Description: Digital signature


Re: SELinux inode size gotcha in btrfs.

2012-02-27 Thread David Sterba
On Mon, Feb 27, 2012 at 10:18:55PM +, Alex wrote:
 From my reading of the btrfs specs [2] it doesn't look like you'll get caught
 with that as the inodes will not contain embedded file data or extended
 attribute data. These things are stored in other item types.

 Have I read that right? I've seen xattr bugs patches etc but nothing that 
 would
 hit the SE Linux domain.

That's right. Inode represented as btrfs_inode_item does not contain any
xattr fields, they're stored independently as a btrfs_dir_item of type
BTRFS_FT_XATTR . Due to the way the b-tree keys are built, the xattr
item key should be stored near the inode item key, that's for the tree
search side. The xattr data are always stored inline in the b-tree leaf.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Set nodatacow per file?

2012-02-27 Thread dima

On 02/28/2012 07:10 AM, Chester wrote:

On Mon, Feb 27, 2012 at 7:54 AM, dimadole...@parallels.com  wrote:

Hello,

Since several people asked to post the results, here they are.
I tried raw virtio disk with and without -z -C set and also qcow2 virtio
disk without -z -C set and did not notice any difference in performance at
all - Redhat 6.2 Minimal installs in 10 minutes in each case. The abysmal
performance as it was some several months ago (like 10 minutes just for
virtual disk formatting) under the same conditions is no more at least on
3.3.0-rc5.


Just to make sure, this is a _new_ virtual disk right? I can barely
contain my excitement right now. This is amazing progress.


Yes, it is a newly created virtual disk. By the way, one thing that 
slipped out from my message - in case of raw I did pre-allocation of the 
entire image, but in case of qcow2 I unchecked this box in virt-manager 
and the disk was growing as the system was installing. Nevertheless I 
did not notice performance degradation during the install.


best
~dima
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: hold enough space for global_rsv

2012-02-27 Thread Liu Bo
On 02/27/2012 09:29 PM, Johannes Hirte wrote:
 Am Tue, 17 Jan 2012 17:51:59 +0800
 schrieb Liu Bo liubo2...@cn.fujitsu.com:
 
 I've kept hitting enospc warnings of global_rsv while running
 defragment on files:
 btrfs: block rsv returned -28
 WARNING: at fs/btrfs/extent-tree.c:5984
 btrfs_alloc_free_block+0x333/0x340 [btrfs]() ...

 I used a fio jobs to create a file with lots of fragments:
 $ filefrag /mnt/btrfs/foobar
 /mnt/btrfs/foobar: 66964 extents found

 and then btrfs fi defrag /mnt/btrfs/foobar  sync would pop the
 warnings.

 I found that the global_rsv size is just not enough for defragment,
 and didn't find any space leak in using global_rsv, so double it and
 go ahead.

 Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
 ---
  fs/btrfs/extent-tree.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 8603ee4..77ea23c 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3979,7 +3979,7 @@ static u64 calc_global_metadata_size(struct
 btrfs_fs_info *fs_info) num_bytes += div64_u64(data_used + meta_used,
 50); 
  if (num_bytes * 3  meta_used)
 -num_bytes = div64_u64(meta_used, 3);
 +num_bytes = div64_u64(meta_used, 3) * 2;
  
  return ALIGN(num_bytes, fs_info-extent_root-leafsize 
 10); }
 
 This patch breakes my system. With this applied all services fail on
 boot with no space left messages.
 

It's weird since this patch is just aiming to enlarge our metadata reservation 
count.

so you've tried a revert or a bisect, right?  Can you show me the environment 
or any log messages?

thanks,
liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html