Re: Tux3 Report: How fast can we fsync?

2015-05-12 Thread Daniel Phillips
Tux3 Report: How fast can we fail?

Tux3 now has a preliminary out of space handling algorithm. This might
sound like a small thing, but in fact handling out of space reliably
and efficiently is really hard, especially for Tux3. We developed an
original solution with unusually low overhead in the common case, and
simple enough to prove correct. Reliability seems good so far. But not
to keep anyone in suspense: Tux3 does not fail very fast, but it fails
very reliably. We like to think that Tux3 is better at succeeding than
failing.

We identified the following quality metrics for this algorithm:

 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.
 4) Writing to a nearly full volume is not excessively slow.
 5) Overhead is insignificant when a volume is far from full.

Like every filesystem that does delayed allocation, Tux3 must guess how
much media space will be needed to commit any update it accepts into
cache. It must not guess low or the commit may fail and lose data. This
is especially tricky for Tux3 because it does not track individual
updates, but instead, partitions updates atomically into delta groups
and commits each delta as an atomic unit. A single delta can be as
large as writable cache, including thousands of individual updates.
This delta scheme ensures perfect namespace, metadata and data
consistency without complex tracking of relationships between thousands
of cache objects, and also does delayed allocation about as well as it
can be done. Given these benefits, it is not too hard to accept some
extra pain in out of space accounting.

Speaking of accounting, we borrow some of that terminology to talk
about the problem. Each delta has a budget and computes a balance
that declines each time a transaction cost is charged against it.
The budget is all of free space, plus some space that belongs to
the current disk image that we know will be released soon, and less
a reserve for taking care of certain backend duties. When the balance
goes negative, the transaction backs out its cost, triggers a delta
transition, and tries again. This has the effect of shrinking the delta
size as a volume approaches full. When the delta budget finally shrinks
to less than the transaction cost, the update fails with ENOSPC.

This is where the how fast can we fail question comes up. If our guess
at cost is way higher than actual blocks consumed, deltas take a long
time to shrink. Overestimating transaction cost by a factor of ten
can trigger over a hundred deltas before failing. Fortunately, deltas
are pretty fast, so we only keep the user waiting for a second or so
before delivering the bad news. We also slow down measurably, but not
horribly, when getting close to full. Ext4 by contrast flies along at
full speed right until it fills the volume, and stops on a dime at
exactly at 100% full. I don't think that Tux3 will ever be as good at
failing as that, but we will try to get close.

Before I get into how Tux3's out of space behavior stacks up against
other filesystems, there are some interesting details to touch on about
how we go about things.

Tux3's front/back arrangement is lockless, which is great for
performance but turns into a problem when front and back need to
cooperate about something like free space accounting. If we were willing
to add a spinlock between front and back this would be easy, but don't
want to do that. Not only are we jealously protective of our lockless
design, but if our fast path suddenly became slower because of adding
essential functionality we might need to revise some posted benchmark
results. Better that we should do it right and get our accounting
almost for free.

The world of lockless algorithms is an arcane one indeed, just ask Paul
McKenney about that. The solution we came up with needs just two atomic
adds per transaction, and we will eventually turn one of those into a
per-cpu counter. As mentioned above, a frontend transaction backs out
its cost when the delta balance goes negative, so from the backend's
point of view, the balance is going up and down unpredictably all the
time. Delta transition can happen at any time, and somehow, the backend
must assign the new front delta its budget exactly at transition.
Meanwhile, the front delta balance is still going up and down
unpredictably. See the problem? The issue is, delta transition is truly
asynchronous. We can't change that short of adding locks with the
contention and stalls that go along with them.

Fortunately, one consequence of delta transition is that the total cost
charged to the delta instantly becomes stable when the front delta
becomes the back delta. Volume free space is also stable because only
the backend accesses it. The backend can easily measure the actual
space consumed by the back delta: it is the difference between free
space before and after flushing to media. Updating 

Re: Tux3 Report: How fast can we fsync?

2015-05-02 Thread Daniel Phillips

On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:

On Fri, 1 May 2015, Daniel Phillips wrote:

On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:


Well, yes - I never claimed XFS is a general purpose filesystem.  It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...


OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.


keep in mind that if you optimize only for the small systems 
you may not scale as well to the larger ones.


Tux3 is designed to scale, and it will when the time comes. I look 
forward to putting Shardmap through its billion file test in due course. 
However, right now it would be wise to stay focused on basic 
functionality suited to a workstation because volunteer devs tend to 
have those. After that, phones are a natural direction, where hard core 
ACID commit and really smooth file ops are particularly attractive.


per the ramdisk but, possibly not as relavent as you may think. 
This is why it's good to test on as many different systems as 
you can. As you run into different types of performance you can 
then pick ones to keep and test all the time.


I keep being surprised how well it works for things we never tested 
before.


Single spinning disk is interesting now, but will be less 
interesting later. multiple spinning disks in an array of some 
sort is going to remain very interesting for quite a while.


The way to do md well is to integrate it into the block layer like 
Freebsd does (GEOM) and expose a richer interface for the filesystem. 
That is how I think Tux3 should work with big iron raid. I hope to be

able to tackle that sometime before the stars start winking out.

now, some things take a lot more work to test than others. 
Getting time on a system with a high performance, high capacity 
RAID is hard, but getting hold of an SSD from Fry's is much 
easier. If it's a budget item, ping me directly and I can donate 
one for testing (the cost of a drive is within my unallocated 
budget and using that to improve Linux is worthwhile)


Thanks.

As I'm reading Dave's comments, he isn't attacking you the way 
you seem to think he is. He is pointing ot that there are 
problems with your data, but he's also taking a lot of time to 
explain what's happening (and yes, some of this is probably 
because your simple tests with XFS made it look so bad)


I hope the lightening up trend is a trend.

the other filesystems don't use naive algortihms, they use 
something more complex, and while your current numbers are 
interesting, they are only preliminary until you add something 
to handle fragmentation. That can cause very significant 
problems.


Fsync is pretty much agnostic to fragmentation, so those results are 
unlikely to change substantially even if we happen to do a lousy job on 
allocation policy, which I naturally consider unlikely. In fact, Tux3 
fsync is going to get faster over time for a couple of reasons: the 
minimum blocks per commit will be reduced, and we will get rid of most 
of the seeks to beginning of volume that we currently suffer per commit.


Remember how fabulous btrfs looked in the initial 
reports? and then corner cases were found that caused real 
problems and as the algorithms have been changed to prevent 
those corner cases from being so easy to hit, the common case 
has suffered somewhat. This isn't an attack on Tux2 or btrfs, 
it's just a reality of programming. If you are not accounting 
for all the corner cases, everything is easier, and faster.



Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.


If you are doing tests with a 4G ramdisk on a machine with only 
4G of RAM, it seems like you end up testing a lot more than just 
the filesystem. Testing in such low memory situations can 
indentify significant issues, but it is questionable as a 'which 
filesystem is better' benchmark.


A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G). 
I am careful to ensure the test environment does not have spurious 
memory or cpu hogs. I will not claim that this is the most sterile test 
environment possible, but it is adequate for the task at hand. Nearly 
always, when I find big variations in the test numbers it turns out to 
be a quirk of one filesystem that is not exhibited by the others. 
Everything gets multiple runs and lands in a spreadsheet. Any fishy 
variance is investigated.


By the way, the low variance kings by far are Ext4 and Tux3, and of 
those two, guess which one is more consistent. XFS is usually steady, 
but can get emotional with lots of tasks, and Btrfs has regular wild 
mood swings whenever the stars change alignment. And while I'm making 
gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.


Just a 

Re: Tux3 Report: How fast can we fsync?

2015-05-01 Thread Daniel Phillips
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:

 Well, yes - I never claimed XFS is a general purpose filesystem.  It
 is a high performance filesystem. Is is also becoming more relevant
 to general purpose systems as low cost storage gains capabilities
 that used to be considered the domain of high performance storage...

OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.

 So, to demonstrate, I'll run the same tests but using a 256GB
 samsung 840 EVO SSD and show how much the picture changes.

 I will go you one better, I ran a series of fsync tests using
 tmpfs, and I now have a very clear picture of how the picture
 changes. The executive summary is: Tux3 is still way faster, and
 still scales way better to large numbers of tasks. I have every
 confidence that the same is true of SSD.

 /dev/ramX can't be compared to an SSD.  Yes, they both have low
 seek/IO latency but they have very different dispatch and IO
 concurrency models.  One is synchronous, the other is fully
 asynchronous.

I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.

 This is an important distinction, as we'll see later on

I regard it as predictive of Tux3 performance on NVM.

 These trees:

 git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
 git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git

 have not been updated for 11 months. I thought tux3 had died long
 ago.

 You should keep them up to date, and send patches for xfstests to
 support tux3, and then you'll get a lot more people running,
 testing and breaking tux3

People are starting to show up to do testing now, pretty much the first
time, so we must do some housecleaning. It is gratifying that Tux3 never
broke for Mike, but of course it will assert just by running out of
space at the moment. As you rightly point out, that fix is urgent and is
my current project.

 Running the same thing on tmpfs, Tux3 is significantly faster:

 Ext4:   1.40s
 XFS:1.10s
 Btrfs:  1.56s
 Tux3:   1.07s

 3% is not signficantly faster. It's within run to run variation!

You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:

   Ext4:   1.59s
   XFS:1.11s
   Btrfs:  1.70s
   Tux3:   1.11s

A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.

 You wish. In fact, Tux3 is a lot faster. ...

 Yes, it's easy to be fast when you have simple, naive algorithms and
 an empty filesystem.

No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.

There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.

 triple checked and reproducible:

Tasks:   10  1001,00010,000
Ext4:   0.05 0.141.53 26.56
XFS:0.05 0.162.10 29.76
Btrfs:  0.08 0.373.18 34.54
Tux3:   0.02 0.050.18  2.16

 Yet I can't reproduce those XFS or ext4 numbers you are quoting
 there. eg. XFS on a 4GB ram disk:

 $ for i in 10 100 1000 1; do rm /mnt/test/foo* ; time
 ./test-fsync /mnt/test/foo 10 $i; done

 real0m0.030s
 user0m0.000s
 sys 0m0.014s

 real0m0.031s
 user0m0.008s
 sys 0m0.157s

 real0m0.305s
 user0m0.029s
 sys 0m1.555s

 real0m3.624s
 user0m0.219s
 sys 0m17.631s
 $

 That's roughly 10x faster than your numbers. Can you describe your
 test setup in detail? e.g.  post the full log from block device
 creation to benchmark completion so I can reproduce what you are
 doing exactly?

Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.

Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
significantly flatter and starts from a lower base, so it ends with a
really wide gap. You will 

Re: Tux3 Report: How fast can we fsync?

2015-04-30 Thread Daniel Phillips

On Wednesday, April 29, 2015 8:50:57 PM PDT, Mike Galbraith wrote:

On Wed, 2015-04-29 at 13:40 -0700, Daniel Phillips wrote:


That order of magnitude latency difference is striking. It sounds
good, but what does it mean? I see a smaller difference here, maybe
because of running under KVM.


That max_latency thing is flush.


Right, it is just the max run time of all operations, including flush
(dbench's name for fsync I think) which would most probably be the longest
running one. I would like to know how we manage to pull that off. Now
that you mention it, I see a factor of two or so latency win here, not
the order of magnitude that you saw. Maybe KVM introduces some fuzz
for me.

I checked whether fsync = sync is the reason, and no. Well, that goes
on the back burner, we will no doubt figure it out in due course.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fsync?

2015-04-30 Thread Daniel Phillips

On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:

I measured fsync performance using a 7200 RPM disk as a virtual
drive under KVM, configured with cache=none so that asynchronous
writes are cached and synchronous writes translate into direct
writes to the block device.


Yup, a slow single spindle, so fsync performance is determined by
seek latency of the filesystem. Hence the filesystem that wins
will be the filesystem that minimises fsync seek latency above all
other considerations.

http://www.spinics.net/lists/kernel/msg1978216.html


If you want to declare that XFS only works well on solid state disks 
and big storage arrays, that is your business. But if you do, you can no
longer call XFS a general purpose filesystem. And if you would rather 
disparage people who report genuine performance bugs than get down to
fixing them, that is your business too. Don't expect to be able to stop 
the bug reports by bluster.



So, to demonstrate, I'll run the same tests but using a 256GB
samsung 840 EVO SSD and show how much the picture changes.


I will go you one better, I ran a series of fsync tests using tmpfs,
and I now have a very clear picture of how the picture changes. The
executive summary is: Tux3 is still way faster, and still scales way
better to large numbers of tasks. I have every confidence that the same
is true of SSD.


I didn't test tux3, you don't make it easy to get or build.


There is no need to apologize for not testing Tux3, however, it is 
unseemly to throw mud at the same time. Remember, you are the person 
who put so much energy into blocking Tux3 from merging last summer. If
it now takes you a little extra work to build it then it is hard to be 
really sympathetic. Mike apparently did not find it very hard.



To focus purely on fsync, I wrote a
small utility (at the end of this post) that forks a number of
tasks, each of which continuously appends to and fsyncs its own
file. For a single task doing 1,000 fsyncs of 1K each, we have:

   Ext4:  34.34s
   XFS:   23.63s
   Btrfs: 34.84s
   Tux3:  17.24s


   Ext4:   1.94s
   XFS:2.06s
   Btrfs:  2.06s

All equally fast, so I can't see how tux3 would be much faster here.


Running the same thing on tmpfs, Tux3 is significantly faster:

Ext4:   1.40s
XFS:1.10s
Btrfs:  1.56s
Tux3:   1.07s


   Tasks:   10  1001,00010,000
   Ext4:   0.05s   0.12s0.48s 3.99s
   XFS:0.25s   0.41s0.96s 4.07s
   Btrfs   0.22s   0.50s2.86s   161.04s
 (lower is better)

Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
very much faster as most of the elapsed time in the test is from
forking the processes that do the IO and fsyncs.


You wish. In fact, Tux3 is a lot faster. You must have made a mistake in 
estimating your fork overhead. It is easy to check, just run syncs foo 
0 1. I get 0.23 seconds to fork 10, proceses, create the files 
and exit. Here are my results on tmpfs, triple checked and reproducible:


   Tasks:   10  1001,00010,000
   Ext4:   0.05 0.141.53 26.56
   XFS:0.05 0.162.10 29.76
   Btrfs:  0.08 0.373.18 34.54
   Tux3:   0.02 0.050.18  2.16

Note: you should recheck your final number for Btrfs. I have seen Btrfs 
fall off the rails and take wildly longer on some tests just like that.
We know Btrfs has corner case issues, I don't think they deny it. 
Unlike you, Chris Mason is a gentleman when faced with issues. Instead 
of insulting his colleagues and hurling around the sort of abuse that 
has gained LKML its current unenviable reputation, he gets down to work 
and fixes things.


You should do that too, your own house is not in order. XFS has major 
issues. One easily reproducible one is a denial of service during the 
10,000 task test where it takes multiple seconds to cat small files. I 
saw XFS do this on both spinning disk and tmpfs, and I have seen it 
hang for minutes trying to list a directory. I looked a bit into it, and 
I see that you are blocking for aeons trying to acquire a lock in open.


Here is an example. While doing sync6 fs/foo 10 1:

time cat fs/foo999
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!

real0m2.282s
user0m0.000s
sys 0m0.000s

You and I both know the truth: Ext4 is the only really reliable general 
purpose filesystem on Linux at the moment. XFS is definitely not, I 
have seen ample evidence with my own eyes. What you need is people 
helping you fix your issues instead of making your colleagues angry at 
you with your incessant attacks.



FWIW, btrfs shows it's horrible fsync implementation here, burning
huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
and a half minutes in that 1 fork test so wasn't IO bound at
all.


Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high 
task counts. It is actually 

Re: Tux3 Report: How fast can we fsync?

2015-04-30 Thread Daniel Phillips

On Thursday, April 30, 2015 2:17:55 PM PDT, James Cloos wrote:

DP == Daniel Phillips dan...@phunq.net writes:


DP you build userspace tools from the hirofumi-user branch

In a fresh clone there is no hirofumi-user branch, only hirofumi and master:

  :; cat .git/packed-refs 
  # pack-refs with: peeled fully-peeled 
  028552773ced1c17cdbec2cda949b2ae94f55d30 refs/remotes/origin/hirofumi

  0dd55b3f5295f74c41e33e1962c79a0282603f5d refs/remotes/origin/master

-JimC


Git confuses me too. Try: git checkout hirofumi/hirofumi-user
This leaves you with a detached head, so you can do: git branch 
localname; git checkout localname.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3