Re: [PATCH 6/6] Btrfs: do aio_write instead of write

2010-06-01 Thread Dmitri Nikulin
On Tue, Jun 1, 2010 at 11:19 PM, Chris Mason chris.ma...@oracle.com wrote:
 Is it ok not to unlock_extent if !ordered?
 I don't know if you fixed this in a later version but it stuck out to me :)

 The construct is confusing.  Ordered extents track things that we have
 allocated on disk and need to write.  New ones can't be created while we
 have the extent range locked.  But we can't force old ones to disk with
 the lock held.

 So, we lock then lookup and if we find nothing we can safely do our
 operation because no io is in progress.  We unlock a little later on, or
 at endio time.

 If we find an ordered extent we drop the lock and wait for that IO to
 finish, then loop again.

Ok, that's fair enough. Maybe it's worth commenting, I'm sure I'm not
the only one surprised.

Thanks,

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] Btrfs: do aio_write instead of write

2010-05-30 Thread Dmitri Nikulin
On Sat, May 22, 2010 at 3:03 AM, Josef Bacik jo...@redhat.com wrote:
 +       while (1) {
 +               lock_extent(tree, start, end, GFP_NOFS);
 +               ordered = btrfs_lookup_ordered_extent(inode, start);
 +               if (!ordered)
 +                       break;
 +               unlock_extent(tree, start, end, GFP_NOFS);

Is it ok not to unlock_extent if !ordered?
I don't know if you fixed this in a later version but it stuck out to me :)

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Dmitri Nikulin
On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote:
 Hi, during the last half year I thought a little bit about doing dedup for
 my backup program: not only with fixed blocks (which is implemented), but
 with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
 means, I have to have *lots* of comparisions (size of file - blocksize).
 Even it's not the same, it must be very fast and that's the same problem
 like the one discussed here.

 My solution (not yet implemented) is as follows (hopefully I remember well):

 I calculate a checksum of 24 bit. (there can be another size)

 This means, I can have 2^24 different checksums.

 Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
 I'm just in a hotel and have no calculator): one bit for each possibility.
 This verctor is initialized with zeros.

 For each calculated checksum of a block, I set the according bit in the bit
 vector.

 It's very fast, to check if a block with a special checksum exists in the
 filesystem (backup for me) by checking the appropriate bit in the bit
 vector.

 If it doesn't exist, it's a new block

 If it exists, there need to be a separate 'real' check if it's really the
 same block (which is slow, but's that's happening 1% of the time).

Which means you have to refer to each block in some unique way from
the bit vector, making it a block pointer vector instead. That's only
64 times more expensive for a 64 bit offset...

Since the overwhelming majority of combinations will never appear in
practice, you are much better served with a self-sizing data structure
like a hash map or even a binary tree, or a hash map with each bucket
being a binary tree, etc... You can use any sized hash and it won't
affect the number of nodes you have to store. You can trade off to CPU
or RAM easily as required, just by selecting an appropriate data
structure. A bit vector and especially a pointer vector have extremely
bad any case RAM requirements because even if you're deduping a mere
10 blocks you're still allocating and initialising 2^24 offsets. The
least you could do is adaptively switch to a more efficient data
structure if you see the number of blocks is low enough.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in file-item.c

2009-04-30 Thread Dmitri Nikulin
On Thu, Apr 30, 2009 at 5:04 AM, Marc R. O'Connor
mrocon...@oel.state.nj.us wrote:
 If you can stomach it, you can get a second opinion from the bootable
 windows memory testing iso:

   http://oca.microsoft.com/en/windiag.asp


 It will be hard but I might just try it. Two versions of memtest86+ die
 in the middle of the scan. Ugh.

Also try memtester in Linux. Boot up as you normally do and give it a
fair chunk of your RAM to run on. Unlike memtest it leaves all the
actual low level stuff to the kernel. Doesn't even need root, let
alone boot. At least then you can thoroughly rule out a memtest86 bug.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in file-item.c

2009-04-30 Thread Dmitri Nikulin
On Fri, May 1, 2009 at 2:54 AM, Tracy Reed tr...@ultraviolet.org wrote:
 Sorry, I was unclear: I meant manage in a public-relations sort of
 way. Not in a technical way. You are absolutely right that bad RAM or
 CPU means you are hosed.

Even so, it's a perfect opportunity to not make things worse by trying
to write data after fundamental assertions have already failed. My
most recent data loss scenario with ext3 involved a little
kernel/hardware/whatever glitch that would have been harmless on its
own, but ext3 took as a cue to completely mangle metadata. I found XML
config files with PNG blocks in them, etc.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Dmitri Nikulin
On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason chris.ma...@oracle.com wrote:
 So you need an extra index either way.  It makes sense to keep the
 crc32c csums for fast verification of the data read from disk and only
 use the expensive csums for dedup.

What about self-healing? With only a CRC32 to distinguish a good block
from a bad one, statistically you're likely to get an incorrectly
healed block in only every few billion blocks. And that may not be
your machine, but it'll be somebody's, since the probability is way
too high for it not to happen to somebody. Even just a 64 bit checksum
would drop the probability plenty, but I'd really only start with 128
bits. NetApp does 64 for 4k of data, ZFS does 256 bits per block, and
this traces back to the root like a highly dynamic Merkle tree.

In the CRC case the only safe redundancy is one that has 3+ copies of
the block, to compare the raw data itself, at which point you may as
well have just been using healing RAID1 without checksums.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LVM vs btrfs as a volume manager for SANs

2009-04-22 Thread Dmitri Nikulin
On Thu, Apr 23, 2009 at 5:20 AM, Tomasz Chmielewski man...@wpkg.org wrote:
 - what happens if SAN machine crashes while the iSCSI file images were being
 written to; with LVM and its block devices, I'm somehow more confident it
 wouldn't make more data loss than necessary

I would hope that COW applies to such a setting. It certainly does for
ZFS, making it an excellent backend for iSCSI. At least, having it on
btrfs shouldn't make it any less reliable than on LVM, as long as
btrfs does its job correctly.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs development plans

2009-04-21 Thread Dmitri Nikulin
On Tue, Apr 21, 2009 at 5:46 PM, Stephan von Krawczynski
sk...@ithnet.com wrote:
 On Mon, 20 Apr 2009 12:38:57 -0400
 Chris Mason chris.ma...@oracle.com wrote:

 The short answer from my point of view is yes.  This doesn't really
 change the motivations for working on btrfs or the problems we're trying
 to solve.

 ... which sounds logical to me. From looking at the project for a while one
 can see you are trying to solve problems that are not really linux' ones...

Even so, I certainly hope that btrfs end up at least as reliable and
feature-complete as ZFS, if ZFS itself cannot be merged into Linux.
That's a big ask, but now that ZFS' IP has been imported into Oracle,
perhaps a lot of patent and copyright issues can be smoothed over,
giving btrfs a huge advantage relative to what it had before the
acquisition.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: interesting use case for multiple devices and delayed raid?

2009-04-01 Thread Dmitri Nikulin
On Wed, Apr 1, 2009 at 8:17 PM, Brian J. Murrell br...@interlinx.bc.ca wrote:
 I have a use case that I wonder if anyone might find interesting
 involving multiple device support and delayed raid.

 Let's say I have a system with two disks of equal size (to make it easy)
 which has sporadic, heavy, write requirements.  At some points in time
 there will be multiple files being appended to simultaneously and at
 other times, there will be no activity at all.

 The write activity is time sensitive, however, so the filesystem must be
 able to provide guaranteed (only in a loose sense -- not looking for
 real QoS reservation semantics) bandwidths at times.  Let's say slightly
 (but within the realm of reality) less than the bandwidth of the two
 disks combined.

I assume you mean read bandwidth, since write bandwidth cannot be
increased by mirroring, only striping. If you intend to stripe first,
then mirror later as time permits, this is the kind of sophistication
you will need to write in the program code itself.

A filesystem is a handy abstraction, but you are by no means limited
to using it. If you have very special needs, you can get pretty far by
writing your own meta-filesystem to add semantics you don't have in
your kernel filesystem of choice. That's what every single database
application does. You can get even further by writing a complete
user-space filesystem as part of your program, or a shared daemon, and
the performance isn't really that bad.

 I also want both the metadata and file data mirrored between the two
 disks so that I can afford to lose one of the disks and not lose (most
 of) my data.  It is not a strict requirement that all data be
 immediately mirrored however.

This is handled by DragonFly BSD's HAMMER filesystem. A master gets
written to, and asynchronously updates a slave, even over a network.
It is transactionally consistent and virtually impossible to corrupt
as long as the disk media is stable. However as far as I know it won't
spread reads, so you'll still get the performance of one disk.

A more complete solution, that requires no software changes, would be
to have 3 or 4 disks. A stripe for really fast reads and writes, and
another disk (or another stripe) to act as a slave to the data being
written to the primary stripe. This seems to do what you want, at a
small price premium.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: interesting use case for multiple devices and delayed raid?

2009-04-01 Thread Dmitri Nikulin
On Thu, Apr 2, 2009 at 8:04 AM, Brian J. Murrell br...@interlinx.bc.ca wrote:
 A more complete solution, that requires no software changes, would be
 to have 3 or 4 disks. A stripe for really fast reads and writes, and
 another disk (or another stripe) to act as a slave to the data being
 written to the primary stripe. This seems to do what you want, at a
 small price premium.

 No.  That's not really what I am describing at all.

Well you get the bandwidth of 2 disks when reading and writing, and
still mirrored to a second stripe as time permits. Kind of like
delayed RAID10.

 I apologize if my original description was unclear.  Hopefully it is
 more so now.

Yes. It'll be up to the actual filesystem devs to weigh in on whether
it's worth implementing.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: metadata copied/data not copied?

2009-03-16 Thread Dmitri Nikulin
On Tue, Mar 17, 2009 at 1:35 AM, CSights csig...@fastmail.fm wrote:
 Hi everyone,
        I'm curious what would happen in btrfs if the following commands were 
 issued:

 # cp file1 file2
 # chown newuser:newgroup file2

 Where file1 was owned by olduser:oldgroup.


        If I understand copy-on-write correctly the cp would merely create a 
 new
 pointer (or whatever it is called :( ) containing the files' metadata but the
 file contents would not actually be duplicated.

Certainly not. cp is a userspace program with no knowledge of
filesystem details like COW or, as you describe, deduplicate.

The semantics you are looking for come with links, except that they
don't. Hard links are implemented at the filesystem level, but they
are not copy-on-write at the user space level - if you write to the
linked file it will appear via all other links too. Soft links are
nothing more than a transparent shortcut that happen to give most of
the semantics people want from hard links in a much more flexible way.
But neither allows a file to have contradictory ownership - you need
ACLs to hack up access rights to mimic that. Hard links and soft links
make multiple paths refer to the same file, not merely the same
contents with different metadata. Otherwise you could soft link
/bin/sh into your home directly, setuid the link, and own the machine.
Clearly this is not the case, and won't be with btrfs either.

And if you're not talking about soft or hard links, I've no idea how
you thought that would work. It is possible for a file system to
detect identical blocks between files, but without more guidance, it
would be very expensive to do so, and with questionable benefits.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: metadata copied/data not copied?

2009-03-16 Thread Dmitri Nikulin
On Tue, Mar 17, 2009 at 8:14 AM, Dmitri Nikulin dniku...@gmail.com wrote:
 Otherwise you could soft link
 /bin/sh into your home directly, setuid the link, and own the machine.

Sorry, that was a terrible example, only root can setuid anyway. A
better example is linking to /bin/sh and making your link writable,
then using that to inject malicious code, which is just as good and
would be possible with the semantics you described.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: questions about GRUB and BTRFS

2009-02-24 Thread Dmitri Nikulin
On Wed, Feb 25, 2009 at 12:32 PM, Anthony Roberts
btrfs-de...@arbitraryconstant.com wrote:
 Hi,

 A quick googling turns up posts that GRUB support for BTRFS is planned. My
 curiosity is more towards how this will be managed, because the way this is
 currently implemented with software RAID/LVM is quite haphazard. I
 therefore have some questions about GRUB + BTRFS:

IANAGD (I Am Not A GRUB Developer), but I'll post some intuitive respones.

 -With GRUB booting, it's easy to think of awkward use cases and limitations
 unless it's capable of discovering BTRFS instances, and can boot by
 specifying BTRFS UUID + subvolume. That seems quite ambitious, but is this
 planned eventually?

I don't know how much filesystem code can be crammed into the
pre-/boot parts of GRUB, but I doubt it's enough to support btrfs'
advanced features like object-level striping.

For comparison with how the two major ZFS operating systems support root on ZFS:

*Solaris (Open, Nexenta, etc.) support booting from ZFS using GRUB,
but ONLY plain or mirrored, not striped or raid-z. Not sure about
linear, if the kernel is installed on anything but the first vdev.

FreeBSD unofficially supports / on ZFS very well, but you still need a
/boot to let the bootloader find the kernel and modules. However the
kernel itself can be given a ZFS pool and path such as
zfs:pool/freebsd/root and it will find all of the ZFS metadata it
needs on disk blocks and the small cache in /boot. However in return
for this /boot you get the ability to boot right off RAID-Z or
whatever you like, because it's using the kernel with full driver and
filesystem code instead of very limited bootloader code.

 -Might it be possible to tweak the userspace component of GRUB to install
 the bootloader to every member device? This seems necessary for reliable
 booting and rebuilding after a dead disk.

Even if you couldn't tweak grub, device-mapper already has an easy way
to mirror just the boot blocks per disk. However GRUB would get
confused since the virtual device does not map to a BIOS boot device.
Legacy BIOS booting is a pain that way. You may as well just write a
shell script to automatically invoke grub-install for each device
individually.

 -64 kb at the beginning of the device is plenty for MBR + GRUB stage 1 +
 1.5. Might this allow bootable BTRFS without paritions being used at all?
 The space used for partitioning is negligible, however we're on the cusp of
 disks that are too big to partition with MBR, and GPT booting doesn't seem
 well supported yet.

As far as I know, we don't even have a way to boot straight off LVM
(because GRUB doesn't support it, and for a kernel and initrd you need
a supported partition), and btrfs would only be more difficult.

 There's obviously no point in getting worked up about this before
 production ready support is available in the first place. :) However, I am
 curious about what sort of implementation is planned.

Well before production ready support is there, people will already
want to test btrfs as their / (which should be automagic like for
FreeBSD ZFS) and /boot (because they're difficult that way). Long
before reiser4 was even proposed for mainline merge, it already had
GRUB support. Enthusiasts will always believe that even /boot should
be fortified with COW, checksums and snapshotting :)

Especially if btrfs is intended to be the next default Linux
filesystem as quoted in many places, it will need /boot support in
some form. I'll personally keep an ext3 /boot for a long time just
because recovery is easier that way.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-23 Thread Dmitri Nikulin
On Tue, Feb 24, 2009 at 3:10 PM, Martin K. Petersen
martin.peter...@oracle.com wrote:
 Dmitri == Dmitri Nikulin dniku...@gmail.com writes:

 Dmitri If that's the case, why is it marketed for Windows Vista only,
 Dmitri and referring to filesystem features like marking unused blocks?
 Dmitri Surely if it was at the device level it would be OS-neutral, and
 Dmitri marketed as such.

 The article you posted references some benchmarketing numbers involving
 Vista.  That does not imply it's a Windows-only product.

        http://en.wikipedia.org/wiki/ExtremeFFS

So it seems. I misread the marketing material, and I'm very relieved
that Linux will benefit just as much from the improvement, indeed
probably better. Thank you very much for correcting me, Martin and
Dongjun.


-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-22 Thread Dmitri Nikulin
On Mon, Feb 23, 2009 at 12:22 PM, Dongjun Shin djshi...@gmail.com wrote:
 A well-designed SSD should survive power cycling and should provide atomicity
 of flush operation regardless of the underlying flash operations. I don't 
 expect
 that users of SSD have different requirements about atomicity.

Well that's my point, it should provide atomicity, but is this the
case for consumer SSDs? It is certainly NOT the case for cheap
USB-based flash media and AFAIK not for CF either.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-22 Thread Dmitri Nikulin
On Mon, Feb 23, 2009 at 2:17 PM, Seth Huang seth...@gmail.com wrote:
 On Mon, Feb 23, 2009 at 12:22 PM, Dongjun Shin djshi...@gmail.com wrote:
 A well-designed SSD should survive power cycling and should provide atomicity
 of flush operation regardless of the underlying flash operations. I don't 
 expect
 that users of SSD have different requirements about atomicity.

 A reliable system should be based on the assumption that the underlying parts 
 are unreliable. Therefore, we should do as much as possible to make sure the 
 reliability in our filesystem instead of leaning on the SSDs.

I generally agree with this approach, however, it would clearly have a
performance penalty. If possible it should be optional so that, on a
reliable media, the hardware can do the hard work and software can
perform well. But it might be too much to ask that btrfs support
mkfs/mount options for every distinct class of storage (rotating, bad
SSD, good SSD, USB flash, holographic cube, electron spin, etc.).

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssd optimised mode

2009-02-21 Thread Dmitri Nikulin
On Sat, Feb 21, 2009 at 3:30 AM, Chris Mason chris.ma...@oracle.com wrote:
 The short answer is that in ssd mode we don't try to avoid random reads.

In the ideal future where SSDs can be run without a flimsy hardware
FTL, and btrfs can use something like ubi directly, would SSD mode
also be able to enable more intelligent wear levelling and safer use
of eraseblocks?

I've read that one of the potentially crippling limitations of ZFS is
that even its reliability features depend largely on being able to
perform atomic writes, which are currently impossible (?) on flash
media where a block has to be erased before it can be updated, clearly
not an atomic operation. Is there any solution to this that doesn't
depend on a battery backup? Clearly it's not something a filesystem
can practically solve.

--
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs and swap files on SSD's ?

2009-01-20 Thread Dmitri Nikulin
On Wed, Jan 21, 2009 at 12:02 AM, Chris Mason chris.ma...@oracle.com wrote:
 There are patches to support swap over NFS that might make it safe to
 use on btrfs. At any rate, it is a fixable problem.

FreeBSD has been able to run swap over NFS for as long as I can
remember, what is different in Linux that makes it especially
difficult?

I've read that swap over non-trivial filesystems is hazardous as it
may lead to a situation in which memory allocation can fail in the
swap/FS code that was meant to make allocation possible again.

If btrfs is to take the role of a RAID and volume manager, it would
certainly be very useful to be able to run swap on it, since that
frees up other volumes from an administrative standpoint.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html