Re: Tux3 Report: How fast can we fsync?
Tux3 Report: How fast can we fail? Tux3 now has a preliminary out of space handling algorithm. This might sound like a small thing, but in fact handling out of space reliably and efficiently is really hard, especially for Tux3. We developed an original solution with unusually low overhead in the common case, and simple enough to prove correct. Reliability seems good so far. But not to keep anyone in suspense: Tux3 does not fail very fast, but it fails very reliably. We like to think that Tux3 is better at succeeding than failing. We identified the following quality metrics for this algorithm: 1) Never fails to detect out of space in the front end. 2) Always fills a volume to 100% before reporting out of space. 3) Allows rm, rmdir and truncate even when a volume is full. 4) Writing to a nearly full volume is not excessively slow. 5) Overhead is insignificant when a volume is far from full. Like every filesystem that does delayed allocation, Tux3 must guess how much media space will be needed to commit any update it accepts into cache. It must not guess low or the commit may fail and lose data. This is especially tricky for Tux3 because it does not track individual updates, but instead, partitions updates atomically into delta groups and commits each delta as an atomic unit. A single delta can be as large as writable cache, including thousands of individual updates. This delta scheme ensures perfect namespace, metadata and data consistency without complex tracking of relationships between thousands of cache objects, and also does delayed allocation about as well as it can be done. Given these benefits, it is not too hard to accept some extra pain in out of space accounting. Speaking of accounting, we borrow some of that terminology to talk about the problem. Each delta has a budget and computes a balance that declines each time a transaction cost is charged against it. The budget is all of free space, plus some space that belongs to the current disk image that we know will be released soon, and less a reserve for taking care of certain backend duties. When the balance goes negative, the transaction backs out its cost, triggers a delta transition, and tries again. This has the effect of shrinking the delta size as a volume approaches full. When the delta budget finally shrinks to less than the transaction cost, the update fails with ENOSPC. This is where the how fast can we fail question comes up. If our guess at cost is way higher than actual blocks consumed, deltas take a long time to shrink. Overestimating transaction cost by a factor of ten can trigger over a hundred deltas before failing. Fortunately, deltas are pretty fast, so we only keep the user waiting for a second or so before delivering the bad news. We also slow down measurably, but not horribly, when getting close to full. Ext4 by contrast flies along at full speed right until it fills the volume, and stops on a dime at exactly at 100% full. I don't think that Tux3 will ever be as good at failing as that, but we will try to get close. Before I get into how Tux3's out of space behavior stacks up against other filesystems, there are some interesting details to touch on about how we go about things. Tux3's front/back arrangement is lockless, which is great for performance but turns into a problem when front and back need to cooperate about something like free space accounting. If we were willing to add a spinlock between front and back this would be easy, but don't want to do that. Not only are we jealously protective of our lockless design, but if our fast path suddenly became slower because of adding essential functionality we might need to revise some posted benchmark results. Better that we should do it right and get our accounting almost for free. The world of lockless algorithms is an arcane one indeed, just ask Paul McKenney about that. The solution we came up with needs just two atomic adds per transaction, and we will eventually turn one of those into a per-cpu counter. As mentioned above, a frontend transaction backs out its cost when the delta balance goes negative, so from the backend's point of view, the balance is going up and down unpredictably all the time. Delta transition can happen at any time, and somehow, the backend must assign the new front delta its budget exactly at transition. Meanwhile, the front delta balance is still going up and down unpredictably. See the problem? The issue is, delta transition is truly asynchronous. We can't change that short of adding locks with the contention and stalls that go along with them. Fortunately, one consequence of delta transition is that the total cost charged to the delta instantly becomes stable when the front delta becomes the back delta. Volume free space is also stable because only the backend accesses it. The backend can easily measure the actual space consumed by the back delta: it is the difference between free space before and after flushing to media. Updating
Re: Tux3 Report: How fast can we fsync?
On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote: On Fri, 1 May 2015, Daniel Phillips wrote: On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote: Well, yes - I never claimed XFS is a general purpose filesystem. It is a high performance filesystem. Is is also becoming more relevant to general purpose systems as low cost storage gains capabilities that used to be considered the domain of high performance storage... OK. Well, Tux3 is general purpose and that means we care about single spinning disk and small systems. keep in mind that if you optimize only for the small systems you may not scale as well to the larger ones. Tux3 is designed to scale, and it will when the time comes. I look forward to putting Shardmap through its billion file test in due course. However, right now it would be wise to stay focused on basic functionality suited to a workstation because volunteer devs tend to have those. After that, phones are a natural direction, where hard core ACID commit and really smooth file ops are particularly attractive. per the ramdisk but, possibly not as relavent as you may think. This is why it's good to test on as many different systems as you can. As you run into different types of performance you can then pick ones to keep and test all the time. I keep being surprised how well it works for things we never tested before. Single spinning disk is interesting now, but will be less interesting later. multiple spinning disks in an array of some sort is going to remain very interesting for quite a while. The way to do md well is to integrate it into the block layer like Freebsd does (GEOM) and expose a richer interface for the filesystem. That is how I think Tux3 should work with big iron raid. I hope to be able to tackle that sometime before the stars start winking out. now, some things take a lot more work to test than others. Getting time on a system with a high performance, high capacity RAID is hard, but getting hold of an SSD from Fry's is much easier. If it's a budget item, ping me directly and I can donate one for testing (the cost of a drive is within my unallocated budget and using that to improve Linux is worthwhile) Thanks. As I'm reading Dave's comments, he isn't attacking you the way you seem to think he is. He is pointing ot that there are problems with your data, but he's also taking a lot of time to explain what's happening (and yes, some of this is probably because your simple tests with XFS made it look so bad) I hope the lightening up trend is a trend. the other filesystems don't use naive algortihms, they use something more complex, and while your current numbers are interesting, they are only preliminary until you add something to handle fragmentation. That can cause very significant problems. Fsync is pretty much agnostic to fragmentation, so those results are unlikely to change substantially even if we happen to do a lousy job on allocation policy, which I naturally consider unlikely. In fact, Tux3 fsync is going to get faster over time for a couple of reasons: the minimum blocks per commit will be reduced, and we will get rid of most of the seeks to beginning of volume that we currently suffer per commit. Remember how fabulous btrfs looked in the initial reports? and then corner cases were found that caused real problems and as the algorithms have been changed to prevent those corner cases from being so easy to hit, the common case has suffered somewhat. This isn't an attack on Tux2 or btrfs, it's just a reality of programming. If you are not accounting for all the corner cases, everything is easier, and faster. Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way more substantial, so I can't compare my numbers directly to yours. If you are doing tests with a 4G ramdisk on a machine with only 4G of RAM, it seems like you end up testing a lot more than just the filesystem. Testing in such low memory situations can indentify significant issues, but it is questionable as a 'which filesystem is better' benchmark. A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G). I am careful to ensure the test environment does not have spurious memory or cpu hogs. I will not claim that this is the most sterile test environment possible, but it is adequate for the task at hand. Nearly always, when I find big variations in the test numbers it turns out to be a quirk of one filesystem that is not exhibited by the others. Everything gets multiple runs and lands in a spreadsheet. Any fishy variance is investigated. By the way, the low variance kings by far are Ext4 and Tux3, and of those two, guess which one is more consistent. XFS is usually steady, but can get emotional with lots of tasks, and Btrfs has regular wild mood swings whenever the stars change alignment. And while I'm making gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3. Just a
Re: Tux3 Report: How fast can we fsync?
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote: Well, yes - I never claimed XFS is a general purpose filesystem. It is a high performance filesystem. Is is also becoming more relevant to general purpose systems as low cost storage gains capabilities that used to be considered the domain of high performance storage... OK. Well, Tux3 is general purpose and that means we care about single spinning disk and small systems. So, to demonstrate, I'll run the same tests but using a 256GB samsung 840 EVO SSD and show how much the picture changes. I will go you one better, I ran a series of fsync tests using tmpfs, and I now have a very clear picture of how the picture changes. The executive summary is: Tux3 is still way faster, and still scales way better to large numbers of tasks. I have every confidence that the same is true of SSD. /dev/ramX can't be compared to an SSD. Yes, they both have low seek/IO latency but they have very different dispatch and IO concurrency models. One is synchronous, the other is fully asynchronous. I had ram available and no SSD handy to abuse. I was interested in measuring the filesystem overhead with the device factored out. I mounted loopback on a tmpfs file, which seems to be about the same as /dev/ram, maybe slightly faster, but much easier to configure. I ran some tests on a ramdisk just now and was mortified to find that I have to reboot to empty the disk. It would take a compelling reason before I do that again. This is an important distinction, as we'll see later on I regard it as predictive of Tux3 performance on NVM. These trees: git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git have not been updated for 11 months. I thought tux3 had died long ago. You should keep them up to date, and send patches for xfstests to support tux3, and then you'll get a lot more people running, testing and breaking tux3 People are starting to show up to do testing now, pretty much the first time, so we must do some housecleaning. It is gratifying that Tux3 never broke for Mike, but of course it will assert just by running out of space at the moment. As you rightly point out, that fix is urgent and is my current project. Running the same thing on tmpfs, Tux3 is significantly faster: Ext4: 1.40s XFS:1.10s Btrfs: 1.56s Tux3: 1.07s 3% is not signficantly faster. It's within run to run variation! You are right, XFS and Tux3 are within experimental error for single syncs on the ram disk, while Ext4 and Btrfs are way slower: Ext4: 1.59s XFS:1.11s Btrfs: 1.70s Tux3: 1.11s A distinct performance gap appears between Tux3 and XFS as parallel tasks increase. You wish. In fact, Tux3 is a lot faster. ... Yes, it's easy to be fast when you have simple, naive algorithms and an empty filesystem. No it isn't or the others would be fast too. In any case our algorithms are far from naive, except for allocation. You can rest assured that when allocation is brought up to a respectable standard in the fullness of time, it will be competitive and will not harm our clean filesystem performance at all. There is no call for you to disparage our current achievements, which are significant. I do not mind some healthy skepticism about the allocation work, you know as well as anyone how hard it is. However your denial of our current result is irritating and creates the impression that you have an agenda. If you want to complain about something real, complain that our current code drop is not done yet. I will humbly apologize, and the same for enospc. triple checked and reproducible: Tasks: 10 1001,00010,000 Ext4: 0.05 0.141.53 26.56 XFS:0.05 0.162.10 29.76 Btrfs: 0.08 0.373.18 34.54 Tux3: 0.02 0.050.18 2.16 Yet I can't reproduce those XFS or ext4 numbers you are quoting there. eg. XFS on a 4GB ram disk: $ for i in 10 100 1000 1; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done real0m0.030s user0m0.000s sys 0m0.014s real0m0.031s user0m0.008s sys 0m0.157s real0m0.305s user0m0.029s sys 0m1.555s real0m3.624s user0m0.219s sys 0m17.631s $ That's roughly 10x faster than your numbers. Can you describe your test setup in detail? e.g. post the full log from block device creation to benchmark completion so I can reproduce what you are doing exactly? Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way more substantial, so I can't compare my numbers directly to yours. Clearly the curve is the same: your numbers increase 10x going from 100 to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is significantly flatter and starts from a lower base, so it ends with a really wide gap. You will
Re: Tux3 Report: How fast can we fsync?
On Wednesday, April 29, 2015 8:50:57 PM PDT, Mike Galbraith wrote: On Wed, 2015-04-29 at 13:40 -0700, Daniel Phillips wrote: That order of magnitude latency difference is striking. It sounds good, but what does it mean? I see a smaller difference here, maybe because of running under KVM. That max_latency thing is flush. Right, it is just the max run time of all operations, including flush (dbench's name for fsync I think) which would most probably be the longest running one. I would like to know how we manage to pull that off. Now that you mention it, I see a factor of two or so latency win here, not the order of magnitude that you saw. Maybe KVM introduces some fuzz for me. I checked whether fsync = sync is the reason, and no. Well, that goes on the back burner, we will no doubt figure it out in due course. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3
Re: Tux3 Report: How fast can we fsync?
On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote: I measured fsync performance using a 7200 RPM disk as a virtual drive under KVM, configured with cache=none so that asynchronous writes are cached and synchronous writes translate into direct writes to the block device. Yup, a slow single spindle, so fsync performance is determined by seek latency of the filesystem. Hence the filesystem that wins will be the filesystem that minimises fsync seek latency above all other considerations. http://www.spinics.net/lists/kernel/msg1978216.html If you want to declare that XFS only works well on solid state disks and big storage arrays, that is your business. But if you do, you can no longer call XFS a general purpose filesystem. And if you would rather disparage people who report genuine performance bugs than get down to fixing them, that is your business too. Don't expect to be able to stop the bug reports by bluster. So, to demonstrate, I'll run the same tests but using a 256GB samsung 840 EVO SSD and show how much the picture changes. I will go you one better, I ran a series of fsync tests using tmpfs, and I now have a very clear picture of how the picture changes. The executive summary is: Tux3 is still way faster, and still scales way better to large numbers of tasks. I have every confidence that the same is true of SSD. I didn't test tux3, you don't make it easy to get or build. There is no need to apologize for not testing Tux3, however, it is unseemly to throw mud at the same time. Remember, you are the person who put so much energy into blocking Tux3 from merging last summer. If it now takes you a little extra work to build it then it is hard to be really sympathetic. Mike apparently did not find it very hard. To focus purely on fsync, I wrote a small utility (at the end of this post) that forks a number of tasks, each of which continuously appends to and fsyncs its own file. For a single task doing 1,000 fsyncs of 1K each, we have: Ext4: 34.34s XFS: 23.63s Btrfs: 34.84s Tux3: 17.24s Ext4: 1.94s XFS:2.06s Btrfs: 2.06s All equally fast, so I can't see how tux3 would be much faster here. Running the same thing on tmpfs, Tux3 is significantly faster: Ext4: 1.40s XFS:1.10s Btrfs: 1.56s Tux3: 1.07s Tasks: 10 1001,00010,000 Ext4: 0.05s 0.12s0.48s 3.99s XFS:0.25s 0.41s0.96s 4.07s Btrfs 0.22s 0.50s2.86s 161.04s (lower is better) Ext4 and XFS are fast and show similar performance. Tux3 *can't* be very much faster as most of the elapsed time in the test is from forking the processes that do the IO and fsyncs. You wish. In fact, Tux3 is a lot faster. You must have made a mistake in estimating your fork overhead. It is easy to check, just run syncs foo 0 1. I get 0.23 seconds to fork 10, proceses, create the files and exit. Here are my results on tmpfs, triple checked and reproducible: Tasks: 10 1001,00010,000 Ext4: 0.05 0.141.53 26.56 XFS:0.05 0.162.10 29.76 Btrfs: 0.08 0.373.18 34.54 Tux3: 0.02 0.050.18 2.16 Note: you should recheck your final number for Btrfs. I have seen Btrfs fall off the rails and take wildly longer on some tests just like that. We know Btrfs has corner case issues, I don't think they deny it. Unlike you, Chris Mason is a gentleman when faced with issues. Instead of insulting his colleagues and hurling around the sort of abuse that has gained LKML its current unenviable reputation, he gets down to work and fixes things. You should do that too, your own house is not in order. XFS has major issues. One easily reproducible one is a denial of service during the 10,000 task test where it takes multiple seconds to cat small files. I saw XFS do this on both spinning disk and tmpfs, and I have seen it hang for minutes trying to list a directory. I looked a bit into it, and I see that you are blocking for aeons trying to acquire a lock in open. Here is an example. While doing sync6 fs/foo 10 1: time cat fs/foo999 hello world! hello world! hello world! hello world! hello world! hello world! hello world! hello world! hello world! hello world! real0m2.282s user0m0.000s sys 0m0.000s You and I both know the truth: Ext4 is the only really reliable general purpose filesystem on Linux at the moment. XFS is definitely not, I have seen ample evidence with my own eyes. What you need is people helping you fix your issues instead of making your colleagues angry at you with your incessant attacks. FWIW, btrfs shows it's horrible fsync implementation here, burning huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2 and a half minutes in that 1 fork test so wasn't IO bound at all. Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high task counts. It is actually
Re: Tux3 Report: How fast can we fsync?
On Thursday, April 30, 2015 2:17:55 PM PDT, James Cloos wrote: DP == Daniel Phillips dan...@phunq.net writes: DP you build userspace tools from the hirofumi-user branch In a fresh clone there is no hirofumi-user branch, only hirofumi and master: :; cat .git/packed-refs # pack-refs with: peeled fully-peeled 028552773ced1c17cdbec2cda949b2ae94f55d30 refs/remotes/origin/hirofumi 0dd55b3f5295f74c41e33e1962c79a0282603f5d refs/remotes/origin/master -JimC Git confuses me too. Try: git checkout hirofumi/hirofumi-user This leaves you with a detached head, so you can do: git branch localname; git checkout localname. Regards, Daniel ___ Tux3 mailing list Tux3@phunq.net http://phunq.net/mailman/listinfo/tux3