Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-19 Thread David Turner
If your client needs to be able to handle the writes like that on its own,
RBDs might be the more appropriate use case.  You lose the ability to have
multiple clients accessing the data as easily as with CephFS, but you would
gain the features you're looking for.

On Tue, Feb 12, 2019 at 1:43 PM Gregory Farnum  wrote:

>
>
> On Tue, Feb 12, 2019 at 5:10 AM Hector Martin 
> wrote:
>
>> On 12/02/2019 06:01, Gregory Farnum wrote:
>> > Right. Truncates and renames require sending messages to the MDS, and
>> > the MDS committing to RADOS (aka its disk) the change in status, before
>> > they can be completed. Creating new files will generally use a
>> > preallocated inode so it's just a network round-trip to the MDS.
>>
>> I see. Is there a fundamental reason why these kinds of metadata
>> operations cannot be buffered in the client, or is this just the current
>> way they're implemented?
>>
>
> It's pretty fundamental, at least to the consistency guarantees we hold
> ourselves to. What happens if the client has buffered an update like that,
> performs writes to the data with those updates in mind, and then fails
> before they're flushed to the MDS? A local FS doesn't need to worry about a
> different node having a different lifetime, and can control the write order
> of its metadata and data updates on belated flush a lot more precisely than
> we can. :(
> -Greg
>
>
>>
>> e.g. on a local FS these kinds of writes can just stick around in the
>> block cache unflushed. And of course for CephFS I assume file extension
>> also requires updating the file size in the MDS, yet that doesn't block
>> while truncation does.
>>
>> > Going back to your first email, if you do an overwrite that is confined
>> > to a single stripe unit in RADOS (by default, a stripe unit is the size
>> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
>> > to be atomic. CephFS can only tear writes across objects, and only if
>> > your client fails before the data has been flushed.
>>
>> Great! I've implemented this in a backwards-compatible way, so that gets
>> rid of this bottleneck. It's just a 128-byte flag file (formerly
>> variable length, now I just pad it to the full 128 bytes and rewrite it
>> in-place). This is good information to know for optimizing things :-)
>>
>> --
>> Hector Martin (hec...@marcansoft.com)
>> Public Key: https://mrcn.st/pub
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-12 Thread Gregory Farnum
On Tue, Feb 12, 2019 at 5:10 AM Hector Martin  wrote:

> On 12/02/2019 06:01, Gregory Farnum wrote:
> > Right. Truncates and renames require sending messages to the MDS, and
> > the MDS committing to RADOS (aka its disk) the change in status, before
> > they can be completed. Creating new files will generally use a
> > preallocated inode so it's just a network round-trip to the MDS.
>
> I see. Is there a fundamental reason why these kinds of metadata
> operations cannot be buffered in the client, or is this just the current
> way they're implemented?
>

It's pretty fundamental, at least to the consistency guarantees we hold
ourselves to. What happens if the client has buffered an update like that,
performs writes to the data with those updates in mind, and then fails
before they're flushed to the MDS? A local FS doesn't need to worry about a
different node having a different lifetime, and can control the write order
of its metadata and data updates on belated flush a lot more precisely than
we can. :(
-Greg


>
> e.g. on a local FS these kinds of writes can just stick around in the
> block cache unflushed. And of course for CephFS I assume file extension
> also requires updating the file size in the MDS, yet that doesn't block
> while truncation does.
>
> > Going back to your first email, if you do an overwrite that is confined
> > to a single stripe unit in RADOS (by default, a stripe unit is the size
> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
> > to be atomic. CephFS can only tear writes across objects, and only if
> > your client fails before the data has been flushed.
>
> Great! I've implemented this in a backwards-compatible way, so that gets
> rid of this bottleneck. It's just a 128-byte flag file (formerly
> variable length, now I just pad it to the full 128 bytes and rewrite it
> in-place). This is good information to know for optimizing things :-)
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-12 Thread Hector Martin

On 12/02/2019 06:01, Gregory Farnum wrote:
Right. Truncates and renames require sending messages to the MDS, and 
the MDS committing to RADOS (aka its disk) the change in status, before 
they can be completed. Creating new files will generally use a 
preallocated inode so it's just a network round-trip to the MDS.


I see. Is there a fundamental reason why these kinds of metadata 
operations cannot be buffered in the client, or is this just the current 
way they're implemented?


e.g. on a local FS these kinds of writes can just stick around in the 
block cache unflushed. And of course for CephFS I assume file extension 
also requires updating the file size in the MDS, yet that doesn't block 
while truncation does.


Going back to your first email, if you do an overwrite that is confined 
to a single stripe unit in RADOS (by default, a stripe unit is the size 
of your objects which is 4MB and it's aligned from 0), it is guaranteed 
to be atomic. CephFS can only tear writes across objects, and only if 
your client fails before the data has been flushed.


Great! I've implemented this in a backwards-compatible way, so that gets 
rid of this bottleneck. It's just a 128-byte flag file (formerly 
variable length, now I just pad it to the full 128 bytes and rewrite it 
in-place). This is good information to know for optimizing things :-)


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-11 Thread Gregory Farnum
On Thu, Feb 7, 2019 at 3:31 AM Hector Martin  wrote:

> On 07/02/2019 19:47, Marc Roos wrote:
> >
> > Is this difference not related to chaching? And you filling up some
> > cache/queue at some point? If you do a sync after each write, do you
> > have still the same results?
>
> No, the slow operations are slow from the very beginning. It's not about
> filling a buffer/cache somewhere. I'm guessing the slow operations
> trigger several synchronous writes to the underlying OSDs, while the
> fast ones don't. But I'd like to know more about why exactly there is
> this significant performance hit to truncation operations vs. normal
> writes.
>
> To give some more numbers:
>
> echo test | dd of=b conv=notrunc
>
> This completes extremely quickly (microseconds). The data obviously
> remains in the client cache at this point. This is what I want.
>
> echo test | dd of=b conv=notrunc,fdatasync
>
> This runs quickly until the fdatasync(), then that takes ~12ms, which is
> about what I'd expect for a synchronous write to the underlying HDDs. Or
> maybe that's two writes?


It's certainly one write, and may be two overlapping ones if you've
extended the file and need to persist its new size (via the MDS journal).


>


> echo test | dd of=b
>
> This takes ~10ms in the best case for the open() call (sometimes 30-40
> or even more), and 6-8ms for the write() call.
>
> echo test | dd of=b conv=fdatasync
>
> This takes ~10ms for the open() call, ~8ms for the write() call, and
> ~18ms for the fdatasync() call.
>
> So it seems like truncating/recreating an existing file introduces
> several disk I/Os worth of latency and forces synchronous behavior
> somewhere down the stack, while merely creating a new file or writing to
> an existing one without truncation does not.
>

Right. Truncates and renames require sending messages to the MDS, and the
MDS committing to RADOS (aka its disk) the change in status, before they
can be completed. Creating new files will generally use a preallocated
inode so it's just a network round-trip to the MDS.

Going back to your first email, if you do an overwrite that is confined to
a single stripe unit in RADOS (by default, a stripe unit is the size of
your objects which is 4MB and it's aligned from 0), it is guaranteed to be
atomic. CephFS can only tear writes across objects, and only if your client
fails before the data has been flushed.
-Greg

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-07 Thread Hector Martin

On 07/02/2019 19:47, Marc Roos wrote:
  
Is this difference not related to chaching? And you filling up some

cache/queue at some point? If you do a sync after each write, do you
have still the same results?


No, the slow operations are slow from the very beginning. It's not about 
filling a buffer/cache somewhere. I'm guessing the slow operations 
trigger several synchronous writes to the underlying OSDs, while the 
fast ones don't. But I'd like to know more about why exactly there is 
this significant performance hit to truncation operations vs. normal writes.


To give some more numbers:

echo test | dd of=b conv=notrunc

This completes extremely quickly (microseconds). The data obviously 
remains in the client cache at this point. This is what I want.


echo test | dd of=b conv=notrunc,fdatasync

This runs quickly until the fdatasync(), then that takes ~12ms, which is 
about what I'd expect for a synchronous write to the underlying HDDs. Or 
maybe that's two writes?


echo test | dd of=b

This takes ~10ms in the best case for the open() call (sometimes 30-40 
or even more), and 6-8ms for the write() call.


echo test | dd of=b conv=fdatasync

This takes ~10ms for the open() call, ~8ms for the write() call, and 
~18ms for the fdatasync() call.


So it seems like truncating/recreating an existing file introduces 
several disk I/Os worth of latency and forces synchronous behavior 
somewhere down the stack, while merely creating a new file or writing to 
an existing one without truncation does not.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-07 Thread Marc Roos
 
Is this difference not related to chaching? And you filling up some 
cache/queue at some point? If you do a sync after each write, do you 
have still the same results?




-Original Message-
From: Hector Martin [mailto:hec...@marcansoft.com] 
Sent: 07 February 2019 06:51
To: ceph-users@lists.ceph.com
Subject: [ceph-users] CephFS overwrite/truncate performance hit

I'm seeing some interesting performance issues with file overwriting on 
CephFS.

Creating lots of files is fast:

for i in $(seq 1 1000); do
echo $i; echo test > a.$i
done

Deleting lots of files is fast:

rm a.*

As is creating them again.

However, repeatedly creating the same file over and over again is slow:

for i in $(seq 1 1000); do
echo $i; echo test > a
done

And it's still slow if the file is created with a new name and then 
moved over:

for i in $(seq 1 1000); do
echo $i; echo test > a.$i; mv a.$i a
done

While appending to a single file is really fast:

for i in $(seq 1 1000); do
echo $i; echo test >> a
done

As is repeatedly writing to offset 0:

for i in $(seq 1 1000); do
echo $i; echo $RANDOM | dd of=a bs=128 conv=notrunc done

But truncating the file first slows it back down again:

for i in $(seq 1 1000); do
echo $i; truncate -s 0 a; echo test >> a done

All of these things are reasonably fast on a local FS, of course. I'm 
using the kernel client (4.18) with Ceph 13.2.4, and the relevant CephFS 
data and metadata pools are rep-3 on HDDs. It seems to me that any 
operation that *reduces* a file's size for any given filename, or 
replaces it with another inode, has a large overhead.

I have an application that stores some flag data in a file, using the 
usual open/write/close/rename dance to atomically overwrite it, and this 
operation is currently the bottleneck (while doing a bunch of other 
processing on files on CephFS). I'm considering changing it to use a 
xattr to store the data instead, which seems like it should be atomic 
and performs a lot better:

for i in $(seq 1 1000); do
echo $i; setfattr -n user.foo -v "test$RANDOM" a done

Alternatively, is there a more CephFS-friendly atomic overwrite pattern 
than the usual open/write/close/rename? Can it e.g. guarantee that a 
write at offset 0 of less than the page size is atomic? I could easily 
make the writes equal-sized and thus avoid truncations and remove the 
rename dance, if I can guarantee they're atomic.

Is there any documentation on what write operations incur significant 
overhead on CephFS like this, and why? This particular issue isn't 
mentioned in http://docs.ceph.com/docs/master/cephfs/app-best-practices/
(which seems like it mostly deals with reads, not writes).

--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com