Re: Tux3 Report: Initial fsck has landed

2013-03-19 Thread Daniel Phillips
On Tue, Mar 19, 2013 at 9:04 PM, David Lang da...@lang.hm wrote:
 On Wed, 20 Mar 2013, Martin Steigerwald wrote:

 Am Dienstag, 29. Januar 2013 schrieb Daniel Phillips:

 On Mon, Jan 28, 2013 at 5:40 PM, Theodore Ts'o ty...@mit.edu wrote:

 On Mon, Jan 28, 2013 at 04:20:11PM -0800, Darrick J. Wong wrote:

 On Mon, Jan 28, 2013 at 03:27:38PM -0800, David Lang wrote:

 The situation I'm thinking of is when dealing with VMs, you make a
 filesystem image once and clone it multiple times. Won't that end up
 with the same UUID in the superblock?

 Yes, but one ought to be able to change the UUID a la tune2fs
 -U.  Even still... so long as the VM images have a different UUID
 than the fs that they live on, it ought to be fine.

 ... and this is something most system administrators should be
 familiar with.  For example, it's one of those things that Norton
 Ghost when makes file system image copes (the equivalent of tune2fs
 -U random /dev/XXX)

 Hmm, maybe I missed something but it does not seem like a good idea
 to use the volume UID itself to generate unique-per-volume metadata
 hashes, if users expect to be able to change it. All the metadata hashes
 would need to be changed.

 I believe that is what BTRFS is doing.

 And yes, AFAIK there is no easy way to change the UUID of a BTRFS
 filesystems
 after it was created.

 In a world where systems are cloned, and many VMs are started from one
 master copy of a filesystem, a UUID is about as far from unique as anything
 you can generate.

 BTRFS may have this problem, but why should Tux3 copy the problem?

Tux3 won't copy that problem. We have enough real problems to deal with
as it is, without manufacturing new ones.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: Initial fsck has landed

2013-03-20 Thread Daniel Phillips
On Tue, Mar 19, 2013 at 11:54 PM, Rob Landley r...@landley.net wrote:
 I'm confused, http://tux3.org/ lists a bunch of dates from 5 years ago, then
 nothing. Is this project dead or not?

Not. We haven't done much about updating tux3.org lately, however you
will find plenty of activity here:

 https://github.com/OGAWAHirofumi/tux3/tree/master/user

You will also find fairly comprehensive updates on where we are and
where this is going, here:

 http://phunq.net/pipermail/tux3/

At the moment we're being pretty quiet because of being in the middle
of developing the next-gen directory index. Not such a small task, as
you might imagine.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: kernel merge

2013-03-22 Thread Daniel Phillips
On Tue, Mar 19, 2013 at 1:52 AM, Raymond Jennings shent...@gmail.com wrote:
 What I've heard so far about tux3 is very promising.

 When can consideration be given to merging it into the mainline linux kernel?

 For starters it's a great way to increase the testing base, and I'm
 actually confident enough in it to start using it on my desktop.

Hi Raymond,

At this point only a developer should use Tux3, and only on a test
machine or virtualized kernel, not on their desktop. That said, we are
generally in favour of merging and we agree that this will help
testing. But development is our bottleneck right now, and until
recently, there was at least one major design element unresolved
(directory index). Whether merging speeds this work up is an open
question. Up till now, being out of tree just not been a major
logistical problem. The hope is that by merging, we will end up with
another developer or two on the team. Will it happen? We can be easily
convinced. Somebody just needs to stand up and say: yeah, I can hack
that code. If and only if it shows up in Linus's tree.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: Faster than tmpfs, what?

2013-05-11 Thread Daniel Phillips
Hi Dave,

Thanks for the catch - I should indeed have noted that modified
dbench was used for this benchmark, thus amplifying Tux3's advantage
in delete performance. This literary oversight does not make the
results any less interesting: we beat Tmpfs on that particular load.
Beating tmpfs at anything is worthy of note. Obviously, all three
filesystems ran the same load.

We agree that classic unadulterated dbench is an important Linux
benchmark for comparison with other filesystems. I think we should
implement a proper fsync for that one and not just use fsync = sync.
That isn't very far in the future, however our main focus right now is
optimizing spinning disk allocation. It probably makes logistical
sense to leave fsync as it is for now and concentrate on the more
important issues.

I do not agree with your assertion that the benchmark as run is
invalid, only that the modified load should have been described in
detail. I presume you would like to see a new bakeoff using classic
dbench. Patience please, this will certainly come down the pipe in due
course. We might not beat Tmpfs on that load but we certainly expect
to outperform some other filesystems.

Note that Tux3 ran this benchmark using its normal strong consistency
semantics, roughly similar to Ext4's data=journal. In that light, the
results are even more interesting.

 ...you've done that so the front end of tux3 won't
 encounter any blocking operations and so can offload 100% of
 operations.

Yes, that is the entire point of our front/back design: reduce
application latency for buffered filesystem transactions.

 It also explains the sync call every 4 seconds to keep
 tux3 back end writing out to disk so that a) all the offloaded work
 is done by the sync process and not measured by the benchmark, and
 b) so the front end doesn't overrun queues and throttle or run out
 of memory.

Entirely correct. That's really nice, don't you think? You nicely
described a central part of Tux3's design: our delta mechanism. We
expect to spend considerable effort tuning the details of our delta
transition behaviour as time goes by. However this is not an immediate
priority because the simplistic flush every 4 seconds hack already
works pretty well for a lot of loads.

Thanks for your feedback,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Tux3 Report: Meet Shardmap, the designated successor of HTree

2013-06-18 Thread Daniel Phillips
Greetings all,

 From time to time, one may fortunate enough to be blessed with a 
discovery in computer science that succeeds at improving all four of 
performance, scalability, reliability and simplicity. Of these normally 
conflicting goals, simplicity is usually the most elusive. It is 
therefore with considerable satisfaction that I present the results of 
our recent development work in directory indexing technology, which 
addresses some long-standing and vexing scalability problems exhibited 
by HTree, my previous contribution to the art of directory indexing. 
This new approach, Shardmap, will not only enhance Tux3 scalability, but 
provides an upgrade path for Ext4 and Lustre as well. Shardmap is also 
likely to be interesting for high performance database design. Best of 
all, Shardmap is considerably simpler than the technology we expect it 
to replace.

The most interesting thing about Shardmap is that it remained 
undiscovered for so long. I expect that you will agree that this is 
particularly impressive, considering how obvious Shardmap is in 
retrospect. I can only speculate that the reason for not seeing this 
obvious solution is that we never asked the right question. The question 
should have been: how do we fix this write multiplication issue? Instead 
we spent ten years asking: what should be do about this cache 
thrashing?. It turns out that an answer to the former is also an answer 
to the latter.

Now let us proceed without further ado to a brief tour of Shardmap, 
starting with the technology we expect it to replace.

The Problem with HTree

Occasionally we see LKML reports of performance issues in HTree at high 
scale, usually from people running scalability benchmarks. Lustre users 
have encountered these issues in real life. I always tended to shy away 
from those discussions because, frankly, I did not see any satisfactory 
answer, other than that HTree works perfectly well at the scale it was 
designed for and at which it is normally used. Recently I did learn the 
right answer: HTree is unfixable, and this is true of any media backed 
B-Tree index. Let me reiterate: contrary to popular opinion, a media 
backed B-Tree is an abysmally poor choice of information structure for 
any randomly updated indexing load.

But how can this be, doesn't everybody use B-Trees in just this way? 
Yes, and everybody is making a big mistake. Let me explain. The big 
issue is write multiplication. Any index that groups entries together in 
blocks will tend to have nearly every block dirty under a random update 
load. How do we transfer all those dirty blocks to cache incrementally, 
efficiently and atomically? We don't, it just cannot be done. In 
practice, we end up writing out most index blocks multiple times due to 
just a few small changes. For example, at the end of a mass update 
create we may find that each block has been written hundreds of times. 
Media transfer latency therefore dominates the operation.

This obvious issue somehow escaped our attention over the entire time 
HTree has been in service. We have occasionally misattributed degraded 
HTree performance to inode table thrashing. To be sure, thrashing at 
high scale is a known problem with Tree, but it is not the biggest 
problem. That would be write multiplication. To fix this, we need to 
step back and adopt a completely different approach.

Dawning of the Light

I am kind of whacking myself on the forehead about this. For an entire 
decade I thought that HTree could be fixed by incremental improvements 
and consequently devoted considerable energy to that effort, the high 
water mark of which was my PHTree post earlier this year:

http://phunq.net/pipermail/tux3/2013-January/26.html

The PHTree design is a respectable if uninspired piece of work that 
fixes all the known issues with HTree except for write multiplication, 
which I expected to be pretty easy. Far from it. The issue is 
fundamental to the nature of B-Trees. Though not hitherto recognized in 
the Linux file system community, academics recognized this issue some 
time ago and have been busy hunting for a solution. During one of our 
sushi meetings in the wilds of Mountain View, Kent Overstreet of BCache 
fame pointed me at this work:

http://www.tokutek.com/2012/12/fractal-tree-indexing-overview/

Such attempts generally fail to get anywhere close to the efficiency 
levels we have become accustomed to with Ext4 and its ilk. But it got me 
thinking along productive lines. (Thank you Kent!) One day the answer 
just hit me like a slow rolling thunderbolt: instead of committing the 
actual B-Tree to disk we should leave it dirty in cache and just log the 
updates to it. This is obviously write-efficient and ACID friendly. It 
is also a poor solution because it sacrifices recovery latency. In the 
event of a crash we need to read the entire log to reconstruct the dirty 
B-Tree, which could take several minutes. During this time, even though 
the raw 

Design note: Simplified implementation of free block tags

2013-06-24 Thread Daniel Phillips
Free tags

Free tags in Tux3 will perforrm a similar function to Ext2's
unallocated block counts. In Ext2, an unallocated block count is a
16 bit field in the group descriptor for each block group. Tux3 does
not have group descriptors, but will use for a similar purpose a table
of free tags, where the table is just a large linear array at some
offset in a special inode.

Accounting data provided by  free tags is required by several
different allocators in Tux3 in order to scale effectively to large
allocation spaces. Our allocation algorithms need to be able to locate
available free objects in volumes theoretically as large as one
exabyte, and in practice, up to several terabytes for personal
workstation usage. We need to track three kinds of free objects:

 Free Blocks
 Free Inode numbers
 Free directory records

Each of these kinds of objects has its own unique allocation
characteristics. For example with free directory records we are
primarily interested in the availability of records of a given size,
while for free blocks we are mainly interested in the total free block
count in some given region of a volume. We will attempt to view these
three kinds of allocation space as similar in some important respects,
and handle them with similar data structures and algorithms.

For now, we will just concentrate on prototyping sufficient
functionality for the first of these, free blocks, to enable us to
proceed with work on high level block allocation algorithms.

Initial free block tracking

Our initial implementation of free block tracking will use a simple
persistent table of 16 bit counts of the allocated blocks in a single
4K block bitmap block, starting at offset zero in the table.

Note: the size of a bitmap block is probably going to change for
non-4K volumes. For now it is one filesystem block. But this does not
scale nicely as block size increases. We will eventually change to a
fixed size bitmaps that do not change size as filesystem block size
changes, but have not yet done so. For now, each free tag map element
will always cover one block's worth of bitmap bits.

Note: we will be storing the allocated block count, not the free block
count, so that a block full of zeroes is interpreted as maximum
availability. The relationship is simply:

 free_blocks = 2**(12 + 3) - allocated blocks.

As with block bitmaps, we need to consider the possibility of
allocation recursion in this structure, that is, the case where
allocating a new block for the free block map itself causes a free
block tag to be updated. We will deal with this recursion in the same
way as the bitmap: we will log changes to the free block map instead
of directly writing out changed free block map blocks per delta.
Eventually, dirty free block map blocks will be written out as part of
a unify (rollup), just like bitmap blocks.

As an initial implementation, we will simply increment a free block
tag every time a bit changes from zero to one in the respective bitmap
block (a block is allocated), and decrement it every time a bit
changes from one to zero (a block is freed).

Future Optimization: lazy updates

It is possible that the simplistic update strategy above may slow down
writes measurably by generating additional CPU work updating tags,
increasing the size of per-delta logs, and increasing the number of
blocks needing to be flushed in a unify (rollup). If this turns out to
be the case, then it is thought that such overhead can be reduced by
lazily updating the tag maps. This proposition may be tested
experimentally, given the simple implementation described here to test
it against.

Future Optimization: one byte tags

It is possible that exact 16 bit block counts are higher precision
than is actually required by effective volume layout optimization
algorithms. An approximate representation that can be stored in a
single byte might do just as well, and cut down the size of the
mapping tables by a factor of two. This can be tested experimentally,
later.

Future Optimization: hierarchical maps

For very large volumes, we may find that scanning the free tag map
becomes inefficient. Then it may be worth mapping the tag map itself,
with a higher level tag map so that we can rapidly locate free space
in extremely large volumes. Whether such an optimization is actually
needed for realistic volume sizes is an open question that we can
test, given an initial, simple implementation.

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [PATCH] Optimize wait_sb_inodes()

2013-06-26 Thread Daniel Phillips
Hi Dave,

On Wed, Jun 26, 2013 at 9:47 PM, Dave Chinner da...@fromorbit.com wrote:
 You have your own wait code, that doesn't make what the VFS does
 unnecesary. Quite frankly, I don't trust individual filesystems to
 get it right - there's a long history of filesystem specific data
 sync problems (including in XFS), and the best way to avoid that is
 to ensure the VFS gets it right for you.

I agree that some of the methods Tux3 uses to implement data integrity, sync
and friends may be worth lifting up to core, or better, to a library,
but we will
all be better served if such methods are given time to mature first. After all,
that basically describes the entire evolution of the VFS: new concepts start
in a filesystem, prove themselves useful, then may be lifted up to be shared.

It is important to get the order right: prove first, then lift.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [RFC] Tux3 for review

2014-05-16 Thread Daniel Phillips

On Friday, May 16, 2014 10:09:50 PM PDT, Martin Steigerwald wrote:

Hi Daniel!

Am Freitag, 16. Mai 2014, 17:50:59 schrieb Daniel Phillips:

We would like to offer Tux3 for review for mainline merge. We have
prepared a new repository suitable for pulling:


At long last!

Congrats for arriving at this point.

Ciao,


Hi Martin,

Thanks, Hirofumi is the one who deserves congratulations, recognition for 
providing more than half the code including most of the hard parts, and 
thanks for bringing Tux3 back to life.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [RFC] Tux3 for review

2014-05-20 Thread Daniel Phillips

Hi Dave,

This is to address your concern about theoretical interaction between 
direct IO and Tux3 page fork.


On Monday, May 19, 2014 10:41:40 PM PDT, I wrote:

Except that Direct IO impacts on the design of the page forking code
(because of how things like get_user_pages() need to be aware of
page forking). So you need to have direct IO working to demonstrate
that the page forking design is sound.


Page fork only affects cache pages, so the only interation with direct IO 
is when the direct IO is to/from a mmap. If a direct write races with a 
programmed write to cache that causes a fork, then get_user_pages may pick 
up the old or new version of a page. It is not defined which will be 
written to disk, which is not a surprise. If a direct read races with a 
programmed write to cache that causes a fork, then it might violate our 
strong ordering, but that is not a surprise. I do not see any theoretical 
oopses or life cycle issues.


So Tux3 may allow racy direct read to violate strong ordering, but strong 
ordering would still be available with proper application sequencing. For 
example, direct read to mmap followed by msync would be strongly ordered.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [RFC] Tux3 for review

2014-05-23 Thread Daniel Phillips

Hi Dongsu,

On Thursday, May 22, 2014 2:52:27 AM PDT, Dongsu Park wrote:

First of all, thank you for trying to merge it to mainline.
Maybe I cannot say the code is clean enough, but basically
the filesystem seems to work at least.


Thank you for confirming that. We test Tux3 extensively so we know it works 
pretty well (short of enospc handling) but independent confirmation carries 
more weight than anything we could say. Our standard disclaimer: Tux3 is 
for developers right now, not for users.



...The files named *_hack were kept as close as
possible to the original core code to clarify exactly where core
needs to change in order to remove our workarounds. If you think we
should pretty up that code then we will happily do it. Or maybe we
can hammer out acceptable core patches right now, and include those

 ...

Looking up kallsyms is not only hacky, but also making the filesystem
unable to be mounted at all, when CONFIG_KALLSYMS_ALL is not defined.
I'll send out patches to fix that separately to tux3 mailing list.


Thank you for improving the hack. We are working on getting rid of that 
flusher hack completely. There is a patch under development to introduce a 
new super_operationss.writeback() operation that allows a filesystem to 
flush its own inodes instead of letting core do it. This will allow Tux3 to 
enforce its strong ordering semantics efficiently without needing to 
reimplement part of fs-writeback.c.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 influence

2014-07-24 Thread Daniel Phillips

Hi Sachar,

On Wednesday, July 23, 2014 4:43:35 AM PDT, you wrote:

In the past few month I developed Funex, a new (FUSE-based, GPL)
file-system. Although still in very early alpha stages, the basic
functionality seams to work fine. During the development process I
discovered again and again that many of the design-notes in tux3
mailing list have a lot useful ideas as well as insight into
file-system internals. In particular, the idea of using
frontend/backend architecture influenced Funex internal design.
I would like to thank you all for sharing this knowledge.


You are most welcome. Perhaps you could tell us a little more
about how you adapted the frontend/backend concept for use with
FUSE.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-04-30 Thread Daniel Phillips

On Wednesday, April 29, 2015 5:20:08 PM PDT, Dave Chinner wrote:

It's easy to be fast on empty filesystems. XFS does not aim to be
fast in such situations - it aims to have consistent performance
across the life of the filesystem.

In this case, ext4, btrfs and tux3 have optimal allocation filling
from the outside of the disk, while XFS is spreading the files
across (at least) 4 separate regions of the whole disk. Hence XFS is
seeing seek times on read are much larger than the other filesystems
when the filesystem is empty as it is doing full disk seeks rather
than being confined to the outer edges of spindle.

Thing is, once you've abused those filesytsems for a couple of
months, the files in ext4, btrfs and tux3 are not going to be laid
out perfectly on the outer edge of the disk. They'll be spread all
over the place and so all the filesystems will be seeing large seeks
on read. The thing is, XFS will have roughly the same performance as
when the filesystem is empty because the spreading of the allocation
allows it to maintain better locality and separation and hence
doesn't fragment free space nearly as badly as the oher filesystems.
Free space fragmentation is what leads to performance degradation in
filesystems, and all the other filesystem will have degraded to be
*much worse* than XFS.

Put simply: empty filesystem benchmarking does not show the real
performance of the filesystem under sustained production workloads.
Hence benchmarks like this - while interesting from a theoretical
point of view and are widely used for bragging about whose got the
fastest - are mostly irrelevant to determining how the filesystem
will perform in production environments.

We can also look at this algorithm in a different way: take a large
filesystem (say a few hundred TB) across a few tens of disks in a
linear concat.  ext4, btrfs and tux3 will only hit the first disk in
the concat, and so go no faster because they are still bound by
physical seek times.  XFS, however, will spread the load across many
(if not all) of the disks, and so effectively reduce the average
seek time by the number of disks doing concurrent IO. Then you'll
see that application level IO concurrency becomes the performance
limitation, not the physical seek time of the hardware.

IOWs, what you don't see here is that the XFS algorithms that make
your test slow will keep *lots* of disks busy. i.e. testing empty
filesystem performance a single, slow disk demonstrates that an
algorithm designed for scalability isn't designed to acheive
physical seek distance minimisation.  Hence your storage makes XFS
look particularly poor in comparison to filesystems that are being
designed and optimised for the limitations of single slow spindles...

To further demonstrate that it is physical seek distance that is the
issue here, lets take the seek time out of the equation (e.g. use a
SSD).  Doing that will result in basically no difference in
performance between all 4 filesystems as performance will now be
determined by application level concurrency and that is the same for
all tests.


Lovely sounding argument, but it is wrong because Tux3 still beats XFS
even with seek time factored out of the equation.

Even with SSD, if you just go splattering files all over the disk you
will pay for it in latency and lifetime when the disk goes into
continuous erase and your messy layout causes write multiplication.
But of course you can design your filesystem any way you want. Tux3
is designed to be fast on the hardware that people actually have.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fsync?

2015-04-30 Thread Daniel Phillips

On Wednesday, April 29, 2015 8:50:57 PM PDT, Mike Galbraith wrote:

On Wed, 2015-04-29 at 13:40 -0700, Daniel Phillips wrote:


That order of magnitude latency difference is striking. It sounds
good, but what does it mean? I see a smaller difference here, maybe
because of running under KVM.


That max_latency thing is flush.


Right, it is just the max run time of all operations, including flush
(dbench's name for fsync I think) which would most probably be the longest
running one. I would like to know how we manage to pull that off. Now
that you mention it, I see a factor of two or so latency win here, not
the order of magnitude that you saw. Maybe KVM introduces some fuzz
for me.

I checked whether fsync = sync is the reason, and no. Well, that goes
on the back burner, we will no doubt figure it out in due course.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fsync?

2015-04-30 Thread Daniel Phillips

On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:

I measured fsync performance using a 7200 RPM disk as a virtual
drive under KVM, configured with cache=none so that asynchronous
writes are cached and synchronous writes translate into direct
writes to the block device.


Yup, a slow single spindle, so fsync performance is determined by
seek latency of the filesystem. Hence the filesystem that wins
will be the filesystem that minimises fsync seek latency above all
other considerations.

http://www.spinics.net/lists/kernel/msg1978216.html


If you want to declare that XFS only works well on solid state disks 
and big storage arrays, that is your business. But if you do, you can no
longer call XFS a general purpose filesystem. And if you would rather 
disparage people who report genuine performance bugs than get down to
fixing them, that is your business too. Don't expect to be able to stop 
the bug reports by bluster.



So, to demonstrate, I'll run the same tests but using a 256GB
samsung 840 EVO SSD and show how much the picture changes.


I will go you one better, I ran a series of fsync tests using tmpfs,
and I now have a very clear picture of how the picture changes. The
executive summary is: Tux3 is still way faster, and still scales way
better to large numbers of tasks. I have every confidence that the same
is true of SSD.


I didn't test tux3, you don't make it easy to get or build.


There is no need to apologize for not testing Tux3, however, it is 
unseemly to throw mud at the same time. Remember, you are the person 
who put so much energy into blocking Tux3 from merging last summer. If
it now takes you a little extra work to build it then it is hard to be 
really sympathetic. Mike apparently did not find it very hard.



To focus purely on fsync, I wrote a
small utility (at the end of this post) that forks a number of
tasks, each of which continuously appends to and fsyncs its own
file. For a single task doing 1,000 fsyncs of 1K each, we have:

   Ext4:  34.34s
   XFS:   23.63s
   Btrfs: 34.84s
   Tux3:  17.24s


   Ext4:   1.94s
   XFS:2.06s
   Btrfs:  2.06s

All equally fast, so I can't see how tux3 would be much faster here.


Running the same thing on tmpfs, Tux3 is significantly faster:

Ext4:   1.40s
XFS:1.10s
Btrfs:  1.56s
Tux3:   1.07s


   Tasks:   10  1001,00010,000
   Ext4:   0.05s   0.12s0.48s 3.99s
   XFS:0.25s   0.41s0.96s 4.07s
   Btrfs   0.22s   0.50s2.86s   161.04s
 (lower is better)

Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
very much faster as most of the elapsed time in the test is from
forking the processes that do the IO and fsyncs.


You wish. In fact, Tux3 is a lot faster. You must have made a mistake in 
estimating your fork overhead. It is easy to check, just run syncs foo 
0 1. I get 0.23 seconds to fork 10, proceses, create the files 
and exit. Here are my results on tmpfs, triple checked and reproducible:


   Tasks:   10  1001,00010,000
   Ext4:   0.05 0.141.53 26.56
   XFS:0.05 0.162.10 29.76
   Btrfs:  0.08 0.373.18 34.54
   Tux3:   0.02 0.050.18  2.16

Note: you should recheck your final number for Btrfs. I have seen Btrfs 
fall off the rails and take wildly longer on some tests just like that.
We know Btrfs has corner case issues, I don't think they deny it. 
Unlike you, Chris Mason is a gentleman when faced with issues. Instead 
of insulting his colleagues and hurling around the sort of abuse that 
has gained LKML its current unenviable reputation, he gets down to work 
and fixes things.


You should do that too, your own house is not in order. XFS has major 
issues. One easily reproducible one is a denial of service during the 
10,000 task test where it takes multiple seconds to cat small files. I 
saw XFS do this on both spinning disk and tmpfs, and I have seen it 
hang for minutes trying to list a directory. I looked a bit into it, and 
I see that you are blocking for aeons trying to acquire a lock in open.


Here is an example. While doing sync6 fs/foo 10 1:

time cat fs/foo999
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!

real0m2.282s
user0m0.000s
sys 0m0.000s

You and I both know the truth: Ext4 is the only really reliable general 
purpose filesystem on Linux at the moment. XFS is definitely not, I 
have seen ample evidence with my own eyes. What you need is people 
helping you fix your issues instead of making your colleagues angry at 
you with your incessant attacks.



FWIW, btrfs shows it's horrible fsync implementation here, burning
huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
and a half minutes in that 1 fork test so wasn't IO bound at
all.


Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high 
task counts. It is actually 

Re: Tux3 Report: How fast can we fsync?

2015-05-01 Thread Daniel Phillips
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:

 Well, yes - I never claimed XFS is a general purpose filesystem.  It
 is a high performance filesystem. Is is also becoming more relevant
 to general purpose systems as low cost storage gains capabilities
 that used to be considered the domain of high performance storage...

OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.

 So, to demonstrate, I'll run the same tests but using a 256GB
 samsung 840 EVO SSD and show how much the picture changes.

 I will go you one better, I ran a series of fsync tests using
 tmpfs, and I now have a very clear picture of how the picture
 changes. The executive summary is: Tux3 is still way faster, and
 still scales way better to large numbers of tasks. I have every
 confidence that the same is true of SSD.

 /dev/ramX can't be compared to an SSD.  Yes, they both have low
 seek/IO latency but they have very different dispatch and IO
 concurrency models.  One is synchronous, the other is fully
 asynchronous.

I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.

 This is an important distinction, as we'll see later on

I regard it as predictive of Tux3 performance on NVM.

 These trees:

 git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
 git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git

 have not been updated for 11 months. I thought tux3 had died long
 ago.

 You should keep them up to date, and send patches for xfstests to
 support tux3, and then you'll get a lot more people running,
 testing and breaking tux3

People are starting to show up to do testing now, pretty much the first
time, so we must do some housecleaning. It is gratifying that Tux3 never
broke for Mike, but of course it will assert just by running out of
space at the moment. As you rightly point out, that fix is urgent and is
my current project.

 Running the same thing on tmpfs, Tux3 is significantly faster:

 Ext4:   1.40s
 XFS:1.10s
 Btrfs:  1.56s
 Tux3:   1.07s

 3% is not signficantly faster. It's within run to run variation!

You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:

   Ext4:   1.59s
   XFS:1.11s
   Btrfs:  1.70s
   Tux3:   1.11s

A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.

 You wish. In fact, Tux3 is a lot faster. ...

 Yes, it's easy to be fast when you have simple, naive algorithms and
 an empty filesystem.

No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.

There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.

 triple checked and reproducible:

Tasks:   10  1001,00010,000
Ext4:   0.05 0.141.53 26.56
XFS:0.05 0.162.10 29.76
Btrfs:  0.08 0.373.18 34.54
Tux3:   0.02 0.050.18  2.16

 Yet I can't reproduce those XFS or ext4 numbers you are quoting
 there. eg. XFS on a 4GB ram disk:

 $ for i in 10 100 1000 1; do rm /mnt/test/foo* ; time
 ./test-fsync /mnt/test/foo 10 $i; done

 real0m0.030s
 user0m0.000s
 sys 0m0.014s

 real0m0.031s
 user0m0.008s
 sys 0m0.157s

 real0m0.305s
 user0m0.029s
 sys 0m1.555s

 real0m3.624s
 user0m0.219s
 sys 0m17.631s
 $

 That's roughly 10x faster than your numbers. Can you describe your
 test setup in detail? e.g.  post the full log from block device
 creation to benchmark completion so I can reproduce what you are
 doing exactly?

Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.

Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
significantly flatter and starts from a lower base, so it ends with a
really wide gap. You will 

Re: Tux3 Report: How fast can we fsync?

2015-05-02 Thread Daniel Phillips

On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:

On Fri, 1 May 2015, Daniel Phillips wrote:

On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:


Well, yes - I never claimed XFS is a general purpose filesystem.  It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...


OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.


keep in mind that if you optimize only for the small systems 
you may not scale as well to the larger ones.


Tux3 is designed to scale, and it will when the time comes. I look 
forward to putting Shardmap through its billion file test in due course. 
However, right now it would be wise to stay focused on basic 
functionality suited to a workstation because volunteer devs tend to 
have those. After that, phones are a natural direction, where hard core 
ACID commit and really smooth file ops are particularly attractive.


per the ramdisk but, possibly not as relavent as you may think. 
This is why it's good to test on as many different systems as 
you can. As you run into different types of performance you can 
then pick ones to keep and test all the time.


I keep being surprised how well it works for things we never tested 
before.


Single spinning disk is interesting now, but will be less 
interesting later. multiple spinning disks in an array of some 
sort is going to remain very interesting for quite a while.


The way to do md well is to integrate it into the block layer like 
Freebsd does (GEOM) and expose a richer interface for the filesystem. 
That is how I think Tux3 should work with big iron raid. I hope to be

able to tackle that sometime before the stars start winking out.

now, some things take a lot more work to test than others. 
Getting time on a system with a high performance, high capacity 
RAID is hard, but getting hold of an SSD from Fry's is much 
easier. If it's a budget item, ping me directly and I can donate 
one for testing (the cost of a drive is within my unallocated 
budget and using that to improve Linux is worthwhile)


Thanks.

As I'm reading Dave's comments, he isn't attacking you the way 
you seem to think he is. He is pointing ot that there are 
problems with your data, but he's also taking a lot of time to 
explain what's happening (and yes, some of this is probably 
because your simple tests with XFS made it look so bad)


I hope the lightening up trend is a trend.

the other filesystems don't use naive algortihms, they use 
something more complex, and while your current numbers are 
interesting, they are only preliminary until you add something 
to handle fragmentation. That can cause very significant 
problems.


Fsync is pretty much agnostic to fragmentation, so those results are 
unlikely to change substantially even if we happen to do a lousy job on 
allocation policy, which I naturally consider unlikely. In fact, Tux3 
fsync is going to get faster over time for a couple of reasons: the 
minimum blocks per commit will be reduced, and we will get rid of most 
of the seeks to beginning of volume that we currently suffer per commit.


Remember how fabulous btrfs looked in the initial 
reports? and then corner cases were found that caused real 
problems and as the algorithms have been changed to prevent 
those corner cases from being so easy to hit, the common case 
has suffered somewhat. This isn't an attack on Tux2 or btrfs, 
it's just a reality of programming. If you are not accounting 
for all the corner cases, everything is easier, and faster.



Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.


If you are doing tests with a 4G ramdisk on a machine with only 
4G of RAM, it seems like you end up testing a lot more than just 
the filesystem. Testing in such low memory situations can 
indentify significant issues, but it is questionable as a 'which 
filesystem is better' benchmark.


A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G). 
I am careful to ensure the test environment does not have spurious 
memory or cpu hogs. I will not claim that this is the most sterile test 
environment possible, but it is adequate for the task at hand. Nearly 
always, when I find big variations in the test numbers it turns out to 
be a quirk of one filesystem that is not exhibited by the others. 
Everything gets multiple runs and lands in a spreadsheet. Any fishy 
variance is investigated.


By the way, the low variance kings by far are Ext4 and Tux3, and of 
those two, guess which one is more consistent. XFS is usually steady, 
but can get emotional with lots of tasks, and Btrfs has regular wild 
mood swings whenever the stars change alignment. And while I'm making 
gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.


Just

Re: Tux3 Report: How fast can we fsync?

2015-04-30 Thread Daniel Phillips

On Thursday, April 30, 2015 2:17:55 PM PDT, James Cloos wrote:

DP == Daniel Phillips dan...@phunq.net writes:


DP you build userspace tools from the hirofumi-user branch

In a fresh clone there is no hirofumi-user branch, only hirofumi and master:

  :; cat .git/packed-refs 
  # pack-refs with: peeled fully-peeled 
  028552773ced1c17cdbec2cda949b2ae94f55d30 refs/remotes/origin/hirofumi

  0dd55b3f5295f74c41e33e1962c79a0282603f5d refs/remotes/origin/master

-JimC


Git confuses me too. Try: git checkout hirofumi/hirofumi-user
This leaves you with a detached head, so you can do: git branch 
localname; git checkout localname.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-11 Thread Daniel Phillips
Hi David,

On 05/11/2015 05:12 PM, David Lang wrote:
 On Mon, 11 May 2015, Daniel Phillips wrote:
 
 On 05/11/2015 03:12 PM, Pavel Machek wrote:
 It is a fact of life that when you change one aspect of an intimately 
 interconnected system,
 something else will change as well. You have naive/nonexistent free space 
 management now; when you
 design something workable there it is going to impact everything else 
 you've already done. It's an
 easy bet that the impact will be negative, the only question is to what 
 degree.

 You might lose that bet. For example, suppose we do strictly linear 
 allocation
 each delta, and just leave nice big gaps between the deltas for future
 expansion. Clearly, we run at similar or identical speed to the current 
 naive
 strategy until we must start filling in the gaps, and at that point our 
 layout
 is not any worse than XFS, which started bad and stayed that way.

 Umm, are you sure. If some areas of disk are faster than others is
 still true on todays harddrives, the gaps will decrease the
 performance (as you'll use up the fast areas more quickly).

 That's why I hedged my claim with similar or identical. The
 difference in media speed seems to be a relatively small effect
 compared to extra seeks. It seems that XFS puts big spaces between
 new directories, and suffers a lot of extra seeks because of it.
 I propose to batch new directories together initially, then change
 the allocation goal to a new, relatively empty area if a big batch
 of files lands on a directory in a crowded region. The big gaps
 would be on the order of delta size, so not really very big.
 
 This is an interesting idea, but what happens if the files don't arrive as a 
 big batch, but rather
 trickle in over time (think a logserver that if putting files into a bunch of 
 directories at a
 fairly modest rate per directory)

If files are trickling in then we can afford to spend a lot more time
finding nice places to tuck them in. Log server files are an especially
irksome problem for a redirect-on-write filesystem because the final
block tends to be rewritten many times and we must move it to a new
location each time, so every extent ends up as one block. Oh well. If
we just make sure to have some free space at the end of the file that
only that file can use (until everywhere else is full) then the long
term result will be slightly ravelled blocks that nonetheless tend to
be on the same track or flash block as their logically contiguous
neighbours. There will be just zero or one empty data blocks mixed
into the file tail as we commit the tail block over and over with the
same allocation goal. Sometimes there will be a block or two of
metadata as well, which will eventually bake themselves into the
middle of contiguous data and stop moving around.

Putting this together, we have:

  * At delta flush, break out all the log type files
  * Dedicate some block groups to append type files
  * Leave lots of space between files in those block groups
  * Peek at the last block of the file to set the allocation goal

Something like that. What we don't want is to throw those files into
the middle of a lot of rewrite-all files, messing up both kinds of file.
We don't care much about keeping these files near the parent directory
because one big seek per log file in a grep is acceptable, we just need
to avoid thousands of big seeks within the file, and not dribble single
blocks all over the disk.

It would also be nice to merge together extents somehow as the final
block is rewritten. One idea is to retain the final block dirty until
the next delta, and write it again into a contiguous position, so the
final block is always flushed twice. We already have the opportunistic
merge logic, but the redirty behavior and making sure it only happens
to log files would be a bit fiddly.

We will also play the incremental defragmentation card at some point,
but first we should try hard to control fragmentation in the first
place. Tux3 is well suited to online defragmentation because the delta
commit model makes it easy to move things around efficiently and safely,
but it does generate extra IO, so as a basic mechanism it is not ideal.
When we get to piling on features, that will be high on the list,
because it is relatively easy, and having that fallback gives a certain
sense of security.

 And when you then decide that you have to move the directory/file info, 
 doesn't that create a
 potentially large amount of unexpected IO that could end up interfering with 
 what the user is trying
 to do?

Right, we don't like that and don't plan to rely on it. What we hope
for is behavior that, when you slowly stir the pot, tends to improve the
layout just as often as it degrades it. It may indeed become harder to
find ideal places to put things as time goes by, but we also gain more
information to base decisions on.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On Monday, May 11, 2015 10:38:42 PM PDT, Dave Chinner wrote:
 I think Ted and I are on the same page here. Competitive
 benchmarks only matter to the people who are trying to sell
 something. You're trying to sell Tux3, but

By same page, do you mean transparently obvious about
obstructing other projects?

 The except page forking design statement is your biggest hurdle
 for getting tux3 merged, not performance.

No, the except page forking design is because the design is
already good and effective. The small adjustments needed in core
are well worth merging because the benefits are proved by benchmarks.
So benchmarks are key and will not stop just because you don't like
the attention they bring to XFS issues.

 Without page forking, tux3
 cannot be merged at all. But it's not filesystem developers you need
 to convince about the merits of the page forking design and
 implementation - it's the mm and core kernel developers that need to
 review and accept that code *before* we can consider merging tux3.

Please do not say we when you know that I am just as much a we
as you are. Merging Tux3 is not your decision. The people whose
decision it actually is are perfectly capable of recognizing your
agenda for what it is.

   http://www.phoronix.com/scan.php?page=news_itempx=MTA0NzM
   XFS Developer Takes Shots At Btrfs, EXT4

The real question is, has the Linux development process become
so political and toxic that worthwhile projects fail to benefit
from supposed grassroots community support. You are the poster
child for that.

 IOWs, you need to focus on the important things needed to acheive
 your stated goal of getting tux3 merged. New filesystems should be
 faster than those based on 20-25 year old designs, so you don't need
 to waste time trying to convince people that tux3, when complete,
 will be fast.

You know that Tux3 is already fast. Not just that of course. It
has a higher standard of data integrity than your metadata-only
journalling filesystem and a small enough code base that it can
be reasonably expected to reach the quality expected of an
enterprise class filesystem, quite possibly before XFS gets
there.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fsync?

2015-05-12 Thread Daniel Phillips
 lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.

One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.

Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.

Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.

Regards,

Daniel

[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.

 /*
 * Blurt v0.0
 *
 * A trivial multitasking filesystem load generator
 *
 * Daniel Phillips, June 2015
 *
 * to build: c99 -Wall blurt.c -oblurt
 * to run: blurt basename steps tasks
 */

#include unistd.h
#include stdlib.h
#include stdio.h
#include fcntl.h
#include sys/wait.h
#include errno.h
#include sys/types.h
#include sys/stat.h

enum { chunk = 1024, sync = 0 };

char text[chunk] = { hello world!\n };

int main(int argc, const char *argv[]) {
const char *basename = argc  1 ? foo : argv[1];
char name[100];
int steps = argc  3 ? 1 : atoi(argv[2]);
int tasks = argc  4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;

for (int t = 0; t  tasks; t++) {
snprintf(name, sizeof name, %s%i, basename, t);
if (!fork())
goto child;
}
for (int t = 0; t  tasks; t++) {
wait(status);
if (WIFEXITED(status)  WEXITSTATUS(status))
errors++;
}
return !!errors;

child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i  steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror(create failed);
return 1;
fail2:
perror(write failed);
return 1;
}

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 04:31 AM, Daniel Phillips wrote:
Let me be the first to catch that arithmetic error

 Let's say our delta size is 400MB (typical under load) and we leave
 a nice big gap of 112 MB after flushing each one. Let's say we do
 two thousand of those before deciding that we have enough information
 available to switch to some smarter strategy. We used one GB of a
 a 4TB disk, say. The media transfer rate decreased by a factor of:
 
 (1 - 2/1000) = .2%.

Ahem, no, we used 1/8th of the disk. The time/data rate increased
from unity to 1.125, for an average of 1.0625 across the region.
If we only use 1/10th of the disk instead, by not leaving gaps,
then the average time/data across the region is 1.05. The
difference is, 1.0625 - 1.05, so the gap strategy increases media
transfer time by 1.25%, which is not significant compared to the
performance deficit in question of 400%. So, same argument:
change in media transfer rate is just a distraction from the
original question.

In any case, we probably want to start using a smarter strategy
sooner than 1000 commits, maybe after ten or a hundred commits,
which would make the change in media transfer rate even less
relevant.

The thing is, when data first starts landing on media, we do not
have much information about what the long term load will be. So
just analyze the clues we have in the early commits and put those
early deltas onto disk in the most efficient format, which for
Tux3 seems to be linear per delta. There would be exceptions, but
that is the common case.

Then get smarter later. The intent is to get the best of both:
early efficiency, and long term nice aging behavior. I do not
accept the proposition that one must be sacrificed for the
other, I find that reasoning faulty.

 The performance deficit in question and the difference in media rate are
 three orders of magnitude apart, does that justify the term similar or
 identical?.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-13 Thread Daniel Phillips
On 05/13/2015 06:08 AM, Mike Galbraith wrote:
 On Wed, 2015-05-13 at 04:31 -0700, Daniel Phillips wrote:
 Third possibility: build from our repository, as Mike did.
 
 Sorry about that folks.  I've lost all interest, it won't happen again.

Thanks for your valuable contribution. Now we are seeing a steady
of stream of people heading to the repository, after you showed
it could be done.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 11:39 AM, David Lang wrote:
 On Mon, 11 May 2015, Daniel Phillips wrote:
 ...it's the mm and core kernel developers that need to
 review and accept that code *before* we can consider merging tux3.

 Please do not say we when you know that I am just as much a we
 as you are. Merging Tux3 is not your decision. The people whose
 decision it actually is are perfectly capable of recognizing your
 agenda for what it is.

   http://www.phoronix.com/scan.php?page=news_itempx=MTA0NzM
   XFS Developer Takes Shots At Btrfs, EXT4
 
 umm, Phoronix has no input on what gets merged into the kernel. they also hae 
 a reputation for
 trying to turn anything into click-bait by making it sound like a fight when 
 it isn't.

Perhaps you misunderstood. Linus decides what gets merged. Andrew
decides. Greg decides. Dave Chinner does not decide, he just does
his level best to create the impression that our project is unfit
to merge. Any chance there might be an agenda?

Phoronix published a headline that identifies Dave Chinner as
someone who takes shots at other projects. Seems pretty much on
the money to me, and it ought to be obvious why he does it.

 The real question is, has the Linux development process become
 so political and toxic that worthwhile projects fail to benefit
 from supposed grassroots community support. You are the poster
 child for that.
 
 The linux development process is making code available, responding to 
 concerns from the experts in
 the community, and letting the code talk for itself.

Nice idea, but it isn't working. Did you let the code talk to you?
Right, you let the code talk to Dave Chinner, then you listen to
what Dave Chinner has to say about it. Any chance that there might
be some creative licence acting somewhere in that chain?

 There have been many people pushing code for inclusion that has not gotten 
 into the kernel, or has
 not been used by any distros after it's made it into the kernel, in spite of 
 benchmarks being posted
 that seem to show how wonderful the new code is. ReiserFS was one of the 
 first, and part of what
 tarnished it's reputation with many people was how much they were pushing the 
 benchmarks that were
 shown to be faulty (the one I remember most vividly was that the entire 
 benchmark completed in 30
 seconds, and they had the FS tuned to not start flushing data to disk for 30 
 seconds, so the entire
 'benchmark' ran out of ram without ever touching the disk)

You know what to do about checking for faulty benchmarks.

 So when Ted and Dave point out problems with the benchmark (the difference in 
 behavior between a
 single spinning disk, different partitions on the same disk, SSDs, and 
 ramdisks), you would be
 better off acknowledging them and if you can't adjust and re-run the 
 benchmarks, don't start
 attacking them as a result.

Ted and Dave failed to point out any actual problem with any
benchmark. They invented issues with benchmarks and promoted those
as FUD.

 As Dave says above, it's not the other filesystem people you have to 
 convince, it's the core VFS and
 Memory Mangement folks you have to convince. You may need a little 
 benchmarking to show that there
 is a real advantage to be gained, but the real discussion is going to be on 
 the impact that page
 forking is going to have on everything else (both in complexity and in 
 performance impact to other
 things)

Yet he clearly wrote we as if he believes he is part of it.

Now that ENOSPC is done to a standard way beyond what Btrfs had
when it was merged, the next item on the agenda is writeback. That
involves us and VFS people as you say, and not Dave Chinner, who
only intends to obstruct the process as much as he possibly can. He
should get back to work on his own project. Nobody will miss his
posts if he doesn't make them. They contribute nothing of value,
create a lot of bad blood, and just serve to further besmirch the
famously tarnished reputation of LKML.

 You know that Tux3 is already fast. Not just that of course. It
 has a higher standard of data integrity than your metadata-only
 journalling filesystem and a small enough code base that it can
 be reasonably expected to reach the quality expected of an
 enterprise class filesystem, quite possibly before XFS gets
 there.
 
 We wouldn't expect anyone developing a new filesystem to believe any 
 differently.

It is not a matter of belief, it is a matter of testable fact. For
example, you can count the lines. You can run the same benchmarks.

Proving the data consistency claims would be a little harder, you
need tools for that, and some of those aren't built yet. Or, if you
have technical ability, you can read the code and the copious design
material that has been posted and convince yourself that, yes, there
is something cool here, why didn't anybody do it that way before?
But of course that starts to sound like work. Debating nontechnical
issues and playing politics seems so much more like fun.

 If they didn't
 believe

Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 03:35 PM, David Lang wrote:
 On Tue, 12 May 2015, Daniel Phillips wrote:
 On 05/12/2015 02:30 PM, David Lang wrote:
 You need to get out of the mindset that Ted and Dave are Enemies that you 
 need to overcome, they are
 friendly competitors, not Enemies.

 You are wrong about Dave These are not the words of any friend:

   I don't think I'm alone in my suspicion that there was something
   stinky about your numbers. -- Dave Chinner
 
 you are looking for offense. That just means that something is wrong with 
 them, not that they were
 deliberatly falsified.

I am not mistaken. Dave made sure to eliminate any doubt about
what he meant. He said Oh, so nicely contrived. But terribly
obvious now that I've found it among other things.

Good work, Dave. Never mind that we did not hide it.

Let's look at some more of the story. Hirofumi ran the test and
I posted the results and explained the significant. I did not
even know that dbench had fsyncs at that time, since I had never
used it myself, nor that Hirofumi had taken them out in order to
test the things he was interested in. Which turned out to be very
interesting, don't you agree?

Anyway, Hirofumi followed up with a clear explanation, here:

   http://phunq.net/pipermail/tux3/2013-May/002022.html

Instead of accepting that, Dave chose to ride right over it and
carry on with his thinly veiled allegations of intellectual fraud,
using such words as it's deceptive at best. Dave managed to
insult two people that day.

Dave dismissed the basic breakthrough we had made as silly
marketing fluff. By now I hope you understand that the result in
question was anything but silly marketing fluff. There are real,
technical reasons that Tux3 wins benchmarks, and the specific
detail that Dave attacked so ungraciously is one of them.

Are you beginning to see who the victim of this mugging was?

 Basically allegations of cheating. And wrong. Maybe Dave just
 lives in his own dreamworld where everybody is out to get him, so
 he has to attack people he views as competitors first.
 
 you are the one doing the attacking.

Defending, not attacking. There is a distinction.

 Please stop. Take a break if needed, and then get back to
 producing software rather than complaining about how everyone is out to get 
 you.

Dave is not everyone, and a shut up will not fix this.

What will fix this is a simple, professional statement that
an error was made, that there was no fraud or anything even
remotely resembling it, and that instead a technical
contribution was made. It is not even important that it come
from Dave. But it is important that the aspersions that were
cast be recognized for what they were.

By the way, do you remember the scene from Unforgiven where
the sherrif is kicking the guy on the ground and saying I'm
not kicking you? It feels like that.

As far as who should take a break goes, note that either of
us can stop the thread. Does it necessarily have to be me?

If you would prefer some light reading, you could read How fast
can we fail?, which I believe is relevant to the question of
whether Tux3 is mergeable or not.

   https://lkml.org/lkml/2015/5/12/663

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fail?

2015-05-13 Thread Daniel Phillips
Addendum to that post...

On 05/12/2015 10:46 AM, I wrote:
 ...For example, we currently
 overestimate the cost of a rewrite because we would need to go poking
 around in btrees to do that more accurately. Fixing that will be quite
 a bit of work...

Ah no, I was wrong about that, it will not be a lot of work because
it does not need to be done.

Charging the full cost of a rewrite as if it is a new write is the
right thing to do because we need to be sure the commit can allocate
space to redirect the existing blocks before it frees the old ones.
So that means there is no need for the front end to go delving into
file metadata, ever, which is a relief because that would have been
expensive and messy.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-15 Thread Daniel Phillips


On 05/15/2015 01:09 AM, Mel Gorman wrote:
 On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
 On 05/14/2015 08:06 PM, Daniel Phillips wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 This causes the process and the other subsystem to each
 look at a different page, instead of at shared state,
 causing ptrace to do nothing, AIO and RDMA data to be
 invisible (or corrupted), etc...

 Is this a bit like page migration?

 Yes. Page migration will fail if there is an extra
 reference to the page that is not accounted for by
 the migration code.
 
 When I said it's not like page migration, I was referring to the fact
 that a COW on a pinned page for RDMA is a different problem to page
 migration. The COW of a pinned page can lead to lost writes or
 corruption depending on the ordering of events.

I see the lost writes case, but not the corruption case, Do you
mean corruption by changing a page already in writeout? If so,
don't all filesystems have that problem?

If RDMA to a mmapped file races with write(2) to the same file,
maybe it is reasonable and expected to lose some data.

 Page migration fails
 when there are unexpected problems to avoid this class of issue which is
 fine for page migration but may be a critical failure in a filesystem
 depending on exactly why the copy is required.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Blurt code in Github

2015-05-13 Thread Daniel Phillips
On 05/13/2015 07:07 AM, Elifarley Callado Coelho Cruz wrote:
 Where can I see the torture test results ?

You mean, http://buildbot.tux3.org:8010/ ?

I am not as familiar with it as I should be.

 Maybe it would be good to use Wercker ( which is free - http://wercker.com/ ) 
 to make sure we always
 have binaries automatically built for every new commit.

Hirofumi already set that up. By the way, Hirofumi is an amazing sysadmin,
however he is an even more amazing developer. Sharing some of that load
would be a win.

Right now, we have an fsync patch set that needs review and torture
testing before it lands in the public repo, but I am not in any hurry.
It might possibly make sense to post it to tux3 list, or lkml even,
so at least people can see its real, or try it if they have the
interest.

One stupid issue is, ever since Kmail got busted for life, I have no
email client that I also like to use, that is capable of posting
patches reliably, without mangling them. I'm using Trojita because
it can handle lkml, but the composer is hopelessly awful. So I
compose in Thunderbird, which is also awful but at least usable.
Not sure what to do about that.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-18 Thread Daniel Phillips
On 05/17/2015 07:20 PM, Rik van Riel wrote:
 On 05/17/2015 09:26 AM, Boaz Harrosh wrote:
 On 05/14/2015 03:59 PM, Rik van Riel wrote:
 The issue is that things like ptrace, AIO, infiniband
 RDMA, and other direct memory access subsystems can take
 a reference to page A, which Tux3 clones into a new page B
 when the process writes it.

 However, while the process now points at page B, ptrace,
 AIO, infiniband, etc will still be pointing at page A.

 All these problems can also happen with truncate+new-extending-write

 It is the responsibility of the application to take file/range locks
 to prevent these page-pinned problems.
 
 It is unreasonable to expect a process that is being ptraced
 (potentially without its knowledge) to take special measures
 to protect the ptraced memory from disappearing.
 
 It is impossible for the debugger to take those special measures
 for anonymous memory, or unlinked inodes.
 
 I don't think your requirement is workable or reasonable.

Hi Rik,

You are quite right to poke at this aggressively. Whether or not
there is an issue needing fixing, we want to know the details. We
really need to do a deep dive in ptrace and know exactly what it
does, and whether Tux3 creates any new kind of hole. I really know
very little about ptrace at the moment, I only have heard that it
is a horrible hack we inherited from some place far away and a time
long ago.

A little guidance from you would help. Somewhere ptrace must modify
the executable page. Unlike uprobes, which makes sense to me, I did
not find where ptrace actually does that on a quick inspection.
Perhaps you could provide a pointer?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

2015-05-12 Thread Daniel Phillips
On 05/12/2015 02:03 AM, Pavel Machek wrote:
 On Mon 2015-05-11 19:34:34, Daniel Phillips wrote:
 On 05/11/2015 04:17 PM, Theodore Ts'o wrote:
 and another way that people
 doing competitive benchmarking can screw up and produce misleading
 numbers.

 If you think we screwed up or produced misleading numbers, could you
 please be up front about it instead of making insinuations and
 continuing your tirade against benchmarking and those who do it.
 
 Are not you little harsh with Ted? He was polite.

Polite language does not include words like screw up and misleading
numbers, those are combative words intended to undermine and disparage.
It is not clear how repeating the same words can be construed as less
polite than the original utterance.

 The ram disk removes seek overhead and greatly reduces media transfer
 overhead. This does not change things much: it confirms that Tux3 is
 significantly faster than the others at synchronous loads. This is
 apparently true independently of media type, though to be sure SSD
 remains to be tested.

 The really interesting result is how much difference there is between
 filesystems, even on a ram disk. Is it just CPU or is it synchronization
 strategy and lock contention? Does our asynchronous front/back design
 actually help a lot, instead of being a disadvantage as you predicted?

 It is too bad that fs_mark caps number of tasks at 64, because I am
 sure that some embarrassing behavior would emerge at high task counts,
 as with my tests on spinning disk.
 
 I'd call system with 65 tasks doing heavy fsync load at the some time
 embarrassingly misconfigured :-). It is nice if your filesystem can
 stay fast in that case, but...

Well, Tux3 wins the fsync race now whether it is 1 task, 64 tasks or
10,000 tasks. At the high end, maybe it is just a curiosity, or maybe
it tells us something about how Tux3 is will scale on the big machines
that XFS currently lays claim to. And Java programmers are busy doing
all kinds of wild and crazy things with lots of tasks. Java almost
makes them do it. If they need their data durable then they can easily
create loads like my test case.

Suppose you have a web server meant to serve 10,000 transactions
simultaneously and it needs to survive crashes without losing client
state. How will you do it? You could install an expensive, finicky
database, or you could write some Java code that happens to work well
because Linux has a scheduler and a filesystem that can handle it.
Oh wait, we don't have the second one yet, but maybe we soon will.

I will not claim that stupidly fast and scalable fsync is the main
reason that somebody should want Tux3, however, the lack of a high
performance fsync was in fact used as a means of spreading FUD about
Tux3, so I had some fun going way beyond the call of duty to answer
that. By the way, I am still waiting for the original source of the
FUD to concede the point politely, but maybe he is waiting for the
code to land, which it still has not as of today, so I guess that is
fair. Note that it would have landed quite some time ago if Tux3 was
already merged.

Historical note: didn't Java motivate the O(1) scheduler?

Regarda,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-20 Thread Daniel Phillips


On 05/20/2015 07:44 AM, Jan Kara wrote:
 On Tue 19-05-15 13:33:31, David Lang wrote:
 On Tue, 19 May 2015, Daniel Phillips wrote:

 I understand that Tux3 may avoid these issues due to some other mechanisms
 it internally has but if page forking should get into mm subsystem, the
 above must work.

 It does work, and by example, it does not need a lot of code to make
 it work, but the changes are not trivial. Tux3's delta writeback model
 will not suit everyone, so you can't just lift our code and add it to
 Ext4. Using it in Ext4 would require a per-inode writeback model, which
 looks practical to me but far from a weekend project. Maybe something
 to consider for Ext5.

 It is the job of new designs like Tux3 to chase after that final drop
 of performance, not our trusty Ext4 workhorse. Though stranger things
 have happened - as I recall, Ext4 had O(n) directory operations at one
 time. Fixing that was not easy, but we did it because we had to. Fixing
 Ext4's write performance is not urgent by comparison, and the barrier
 is high, you would want jbd3 for one thing.

 I think the meta-question you are asking is, where is the second user
 for this new CoW functionality? With a possible implication that if
 there is no second user then Tux3 cannot be merged. Is that is the
 question?

 I don't think they are asking for a second user. What they are
 saying is that for this functionality to be accepted in the mm
 subsystem, these problem cases need to work reliably, not just work
 for Tux3 because of your implementation.

 So for things that you don't use, you need to make it an error if
 they get used on a page that's been forked (or not be an error and
 'do the right thing')

 For cases where it doesn't matter because Tux3 controls the
 writeback, and it's undefined in general what happens if writeback
 is triggered twice on the same page, you will need to figure out how
 to either prevent the second writeback from triggering if there's
 one in process, or define how the two writebacks are going to happen
 so that you can't end up with them re-ordered by some other
 filesystem.

 I think that that's what's meant by the top statement that I left in
 the quote. Even if your implementation details make it safe, these
 need to be safe even without your implementation details to be
 acceptable in the core kernel.
   Yeah, that's what I meant. If you create a function which manipulates
 page cache, you better make it work with other functions manipulating page
 cache. Otherwise it's a landmine waiting to be tripped by some unsuspecting
 developer. Sure you can document all the conditions under which the
 function is safe to use but a function that has several paragraphs in front
 of it explaning when it is safe to use isn't very good API...

Violent agreement, of course. To put it in concrete terms, each of
the page fork support functions must be examined and determined
sane. They are:

 * cow_replace_page_cache
 * cow_delete_from_page_cache
 * cow_clone_page
 * page_cow_one
 * page_cow_file

Would it be useful to drill down into those, starting from the top
of the list?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fail?

2015-05-28 Thread Daniel Phillips

On Thursday, May 28, 2015 5:55:18 AM PDT, Austin S Hemmelgarn wrote:

On 2015-05-27 18:46, Daniel Phillips wrote:


On 05/27/2015 02:39 PM, Pavel Machek wrote:

On Wed 2015-05-27 11:28:50, Daniel Phillips wrote: ...


I mentioned earlier, it seems to work pretty well in Tux3. But do user
applications really expect it to work? I do not know of any, perhaps
you do.
I don't know of any applications that do, although I do know of 
quite a few users who would expect it to work (myself included). 
 This kind of thing could (depending on how the system in 
question is configured) potentially be critical for recovering 
from such a situation.


Hi Austin,

Not doubting you, but how would overwriting files help you recover
from disk full?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fail?

2015-05-28 Thread Daniel Phillips

On Thursday, May 28, 2015 4:19:29 PM PDT, Andreas Karlsson wrote:

On 05/28/2015 07:27 PM, Daniel Phillips wrote:

Not doubting you, but how would overwriting files help you recover
from disk full?


One benefit I see is that monitoring tools based on round-robin 
databases like Munin and MRTG can continue writing monitoring 
data even when the disk is full.


OK, I'll buy that. In general, it's better if as much stuff works
as possible in a disk full state.

Another quality metric is, it should work or fail predictably, not
erratically, sometimes succeeding and sometimes failing with no
reason that makes sense to the user. Tux3 seems to satisfy that
now, given some slight improvements that aren't posted yet. But of
course it needs wider testing, another reason for merging sooner
rather than later.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fail?

2015-05-27 Thread Daniel Phillips


On 05/27/2015 02:39 PM, Pavel Machek wrote:
 On Wed 2015-05-27 11:28:50, Daniel Phillips wrote:
 On Tuesday, May 26, 2015 11:41:39 PM PDT, Mosis Tembo wrote:
 On Tue, May 26, 2015 at 6:03 PM, Pavel Machek pa...@ucw.cz wrote:


 We identified the following quality metrics for this algorithm:

 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.

 This is definitely nonsense. You can not rm, rmdir and truncate
 when the volume is full. You will need a free space on disk to perform
 such operations. Do you know why?

 Because some extra space needs to be on the volume in order to do the
 atomic commit. Specifically, there must be enough extra space to keep
 both old and new copies of any changed metadata, plus enough space for
 new data or metadata. You are almost right: we can't support rm, rmdir
 or truncate _with atomic commit_ unless some space is available on the
 volume. So we keep a small reserve to handle those operations, which
 only those operations can access. We define the volume as full when
 only the reserve remains. The reserve is not included in available
 blocks reported to statfs, so the volume appears to be 100% full when
 only the reserve remains.

 For Tux3, that reserve is variable - about 1% of free space, declining
 to a minimum of 10 blocks as free space runs out. Eventually, we will
 reduce the minimum a bit as we develop finer control over how free
 space is used in very low space conditions, but 10 blocks is not bad
 at all. With no journal and only 10 blocks of unusable space, we do
 pretty well with tiny volumes.
 
 Yeah. Filesystem that could not do rm on full filesystem would be
 braindead.
 
 Now, what about
 
 1) writing to already-allocated space in existing files?

I mentioned earlier, it seems to work pretty well in Tux3. But do user
applications really expect it to work? I do not know of any, perhaps
you do.

Incidentally, I have been torture testing this very property using a
32K filesystem consisting of 64 x 512 byte blocks, with repeated dd,
mknod, rm, etc. Just to show that we are serious about getting this
part right.

 2) writing to already-allocated space in existing files using mmap?

Not part of the preliminary nospace patch, but planned. I intend to
work on that detail after merge.

The problem is almost the same as write(2) in that the reserve must be
large enough to accommodate both old and new versions of all data
blocks, otherwise we lose our ACID, which we will go to great lengths
to avoid losing. The thing that makes this work nicely is the way the
delta shrinks as freespace runs out, which is the central point of our
new nospace algorithm.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-27 Thread Daniel Phillips


On 05/27/2015 02:37 PM, Pavel Machek wrote:
 On Wed 2015-05-27 11:09:25, Daniel Phillips wrote:
 On Wednesday, May 27, 2015 12:41:37 AM PDT, Pavel Machek wrote:
 On Fri 2015-05-15 02:38:33, Daniel Phillips wrote:
 On 05/14/2015 08:06 PM, Rik van Riel wrote: ...

 Umm. Why do you think it is only issue for executable files?

 I meant: files with code in them, that will be executed. Please excuse
 me for colliding with the chmod sense. I will say code files to avoid
 ambiguity.

 I'm free to mmap() any file, and then execute from it.

 /lib/ld-linux.so /path/to/binary

 is known way to exec programs that do not have x bit set.

 So... why would I write to a code file at the same time as stepping
 through it with ptrace? Should I expect ptrace to work perfectly if
 I do that? What would work perfectly mean, if the code is changing
 at the same time as being traced?
 
 Do you have any imagination at all?

[Non-collegial rhetoric alert, it would be helpful to avoid that.]

 Reasons I should expect ptrace to work perfectly if I'm writing to
 file:
 
 1) it used to work before
 
 2) it used to work before
 
 3) it used to work before and regressions are not allowed

Are you sure that ptrace will work perfectly on a file that you are
writing to at the same time as tracing? If so, it has magic that I
do not understand. Could you please explain.

 4) some kind of just in time compiler

A JIT that can tolerate being written to by a task it knows nothing
about, at the same time as it is generating code in the file? I do
not know of any such JIT.

 5) some kind of malware, playing tricks so that you have trouble
 analyzing it

By writing to a code file? Then it already has write access to the
code file, so it has already gotten inside your security perimeter
without needing help from page fork. That said, we should be alert
for any new holes that page fork might open. But if there are any,
they should be actual holes, not theoretical ones.

 and of course,
 
 6) it used to work before.

I look forward to your explanation of how.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips

On Tuesday, May 26, 2015 3:13:02 AM PDT, Pavel Machek wrote:

On Tue 2015-05-26 01:09:59, Daniel Phillips wrote:

On Monday, May 25, 2015 11:13:46 PM PDT, David Lang wrote:

I'm assuming that Rik is talking about whatever has the reference to the
page via one of the methods that he talked about.


This would be a good moment to provide specifics.


Hmm. This seems like a good moment for you to audit whole kernel, to
make sure it does not do stuff you don't expect it to.

You are changing core semantics, stuff that was allowed before is not
allowed now, so it looks like you should do the auditing...


No, we are not changing core semantics. Core semantics remain
unchanged. All these questions are about Tux3 semantics. So I
believe that I am correct in asking for specifics, rather than
general theory. Auditing the whole kernel seems an unreasonable
barrier to merging a filesystem.


You may want to start with video4linux, as Jan pointed out.


I pointed out that the video case Jan mentioned is a race. Is
that right or wrong? And what responsibility does Tux3 have,
if the use case is a race? I would think: do not oops, do not
corrupt the filesystem. Anything else?

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 Report: How fast can we fail?

2015-05-26 Thread Daniel Phillips

On Tuesday, May 26, 2015 3:03:26 AM PDT, Pavel Machek wrote:

We identified the following quality metrics for this algorithm:

 1) Never fails to detect out of space in the front end.
 2) Always fills a volume to 100% before reporting out of space.
 3) Allows rm, rmdir and truncate even when a volume is full.


Hmm. Can you also overwrite existing data in files when a volume is
full? I guess applications expect that to work..


That does seem to work pretty well, for example, repeated dd of
a size greater than the volume works. But I am not sure that
applications really expect that to work. I do not know of any
application like that myself.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
On 05/26/2015 02:00 AM, Jan Kara wrote:
 On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
 On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
  E.g. video drivers (or infiniband or direct IO for that matter) which
 have buffers in user memory (may be mmapped file), grab references to pages
 and hand out PFNs of those pages to the hardware to store data in them...
 If you fork a page after the driver has handed PFNs to the hardware, you've
 just lost all the writes hardware will do.

 Hi Jan,

 The page forked because somebody wrote to it with write(2) or mmap write at
 the same time as a video driver (or infiniband or direct IO) was
 doing io to
 it. Isn't the application trying hard to lose data in that case? It
 would not need page fork to lose data that way.
 
 So I can think of two valid uses:
 
 1) You setup IO to part of a page and modify from userspace a different
part of a page.

Suppose the use case is reading textures from video memory into a mmapped
file, and at the same time, the application is allowed to update the
textures in the file via mmap or write(2). Fork happens at mkwrite time.
If the page is already dirty, we do not fork it. The video API must have
made the page writable and dirty, so I do not see an issue.

 2) At least for video drivers there is one ioctl() which creates object
with buffers in memory and another ioctl() to actually ship it to hardware
(may be called repeatedly). So in theory app could validly dirty the pages
before it ships them to hardware. If this happens repeatedly and interacts
badly with background writeback, you will end up with a forked page in a
buffer and from that point on things are broken.

Writeback does not fork pages. An app may dirty a page that is in process
of being shipped to hardware (must be a distinct part of the page, or it is
a race) and the data being sent to hardware will not be disturbed. If there
is an issue here, I do not see it.

 So my opinion is: Don't fork the page if page_count is elevated. You can
 just wait for the IO if you need stable pages in that case. It's slow but
 it's safe and it should be pretty rare. Is there any problem with that?

That would be our fallback if anybody discovers a specific case where page
fork breaks something, which so far has not been demonstrated.

With a known fallback, it is hard to see why we should delay merging over
that. Perfection has never been a requirement for merging filesystems. On
the contrary, imperfection is a reason for merging, so that the many
eyeballs effect may prove its value.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-05-26 Thread Daniel Phillips
Hi Sergey,

On 05/26/2015 03:22 AM, Sergey Senozhatsky wrote:
 
 Hello,
 
 is it possible to page-fork-bomb the system by some 'malicious' app?

Not in any new way. A page fork can happen either in the front end,
where it has to wait for memory like any other normal memory user,
or in the backend, where Tux3 may have privileged access to low
memory reserves and therefore must place bounds on its memory use
like any other user of low memory reserves.

This is not specific to page fork. We must place such bounds for
any memory that the backend uses. Fortunately, the backend does not
allocate memory extravagently, for fork or anything else, so when
this does get to the top of our to-do list it should not be too
hard to deal with. We plan to attack that after merge, as we have
never observed a problem in practice. Rather, Tux3 already seems
to survive low memory situations pretty well compared to some other
filesystems.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 5:00:43 PM PDT, Daniel Phillips wrote:

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.


Now, let me try to lead the way, but being specific. RDMA was raised
as a potential failure case for Tux3 page forking. But the RDMA api
does not let you use memory mmaped by Tux3 as a source or destination
of IO. Instead, it sets up its own pages and hands them out to the
RDMA app from a pool. So no issue. One down, right?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

If you define this as loosing our mojo, then yes we have.


A pity. There remains so much to do that simply will not get
done in the absence of mojo.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:
We, the Linux Community have less tolerance for losing people's 
data and preventing them from operating than we used to when it 
was all tinkerer's personal data and secondary systems.


So rather than pushing optimizations out to everyone and seeing 
what breaks, we now do more testing and checking for failures 
before pushing things out.


By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.

Regards,

Daniel


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 3:27:12 PM PDT, David Lang wrote:

On Fri, 31 Jul 2015, Daniel Phillips wrote:


On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote: ...


you weren't asking about any particular feature of Tux, you 
were asking if we were still willing to push out stuff that 
breaks for users and fix it later.


I think you left a key word out of my ask: theoretical.

Especially for filesystems that can loose the data of whoever 
is using it, the answer seems to be a clear no.


there may be bugs in what's pushed out that we don't know 
about. But we don't push out potential data corruption bugs that 
we do know about (or think we do)


so if you think this should be pushed out with this known 
corner case that's not handled properly, you have to convince 
people that it's _so_ improbable that they shouldn't care about 
it.


There should also be an onus on the person posing the worry
to prove their case beyond a reasonable doubt, which has not been
done in case we are discussing here. Note: that is a technical
assessment to which a technical response is appropriate.

I do think that we should put a cap on this fencing and make
a real effort to get Tux3 into mainline. We should at least
set a ground rule that a problem should be proved real before it
becomes a reason to derail a project in the way that our project
has been derailed. Otherwise, it's hard to see what interest is
served.

OK, lets get back to the program. I accept your assertion that
we should convince people that the issue is improbable. To do
that, I need a specific issue to address. So far, no such issue
has been provided with specificity. Do you see why this is
frustrating?

Please, community. Give us specific issues to address, or give us
some way out of this eternal limbo. Or better, lets go back to the
old way of doing things in Linux, which is what got us where we
are today. Not this.

Note: Hirofumi's email is clear, logical and speaks to the
question. This branch of the thread is largely pointless, though
it essentially says the same thing in non-technical terms. Perhaps
your next response should be to Hirofumi, and perhaps it should be
technical.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [FYI] tux3: Core changes

2015-07-31 Thread Daniel Phillips

On Friday, July 31, 2015 11:29:51 AM PDT, David Lang wrote:

We, the Linux Community have less tolerance for losing people's data and 
preventing them from operating than we used to when it was all tinkerer's 
personal data and secondary systems.

So rather than pushing optimizations out to everyone and seeing what breaks, we 
now do more testing and checking for failures before pushing things out.


By the way, I am curious about whose data you think will get lost
as a result of pushing out Tux3 with a possible theoretical bug
in a wildly improbable scenario that has not actually been
described with sufficient specificity to falsify, let alone
demonstrated.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Trying Tux3; What am I doing wrong?

2015-08-05 Thread Daniel Phillips
Hi Masoud,

Your md device is read-only for some reason. You need to fix that, then try
your mount. You can also make a tux3 volume on a file and mount it using
mount ... -oloop

Regards,

Daniel


On Wed, Aug 5, 2015 at 9:53 AM, Masoud Sharbiani masoud.sharbi...@gmail.com
 wrote:

 So, hirofumi branch does not have a fs/tux3/user/ directory:

 [root@localhost linux-tux3]# git branch
   hirofumi
   hirofumi-user
 * master
 [root@localhost linux-tux3]# git checkout hirofumi
 Checking out files: 100% (23904/23904), done.
 Switched to branch 'hirofumi'
 [root@localhost linux-tux3]# git log
 commit 6548a36e52cdee6bb6095f0c97f304a2e0866790
 Author: OGAWA Hirofumi hirof...@mail.parknet.co.jp
 Date:   Sat Jun 6 18:45:08 2015 +0900

 tux3: Cleanup deferred inode allocation in evict_inode() (temporary)
 
 [root@localhost linux-tux3]# ls fs/tux3/user
 ls: cannot access fs/tux3/user: No such file or directory

 So I switched to hirofumi-user, where a user/ exists; I used that to run
 mkfs, and then:

 [root@localhost tux3]# mount /dev/md127 /data
 mount: /dev/md127 is write-protected, mounting read-only
 [  956.051872] tux3_fill_super:775: s_blocksize 4096

 So effectively, I can see my empty file system, but not actually put
 files/directories there.

 Masoud



 On Tue, Aug 4, 2015 at 10:41 PM, OGAWA Hirofumi 
 hirof...@mail.parknet.co.jp wrote:

 Masoud Sharbiani masoud.sharbi...@gmail.com writes:

  Hi guys

 Hi,

  I've cloned the tree, and found the userspace tools in one of the
 branches as well.
  However after tux3 mkfs, and mounting the volume, I cannot mount the
 volume
  read/write to actually write to it.
 
  What gives?
 
  Masoud Sharbiani
 
  (Tree used was cloned from:
 https://github.com/OGAWAHirofumi/linux-tux3.git
  Branch used was: hirofumi-user)

 What error? Well, it might hit to version check. Can you try hirofumi
 branch for kernel? (master branch is old compared to tux3 mkfs,
 preparing to update now).

 Thanks.
 --
 OGAWA Hirofumi hirof...@mail.parknet.co.jp



 ___
 Tux3 mailing list
 Tux3@phunq.net
 http://phunq.net/mailman/listinfo/tux3


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: poke?

2016-12-15 Thread Daniel Phillips
Thanks for the pokes guys. Tux3 was only resting. I am doing some work
on Shardmap at the moment, so we should be able to see some
interesting results for directory scalability from that. Hirofumi is
busy updating the tree to current upstream kernel.

Regards,

Daniel

On Tue, Dec 13, 2016 at 9:22 AM, Kyle Sanderson  wrote:
> XFS is our only hope, it looks like dedup support is coming soon with the
> new on-disk format.
>
> On 9 Dec 2016 1:19 pm, "Raymond Jennings"  wrote:
>>
>> Hey, is this project alive?  I've been a big fan ever since I heard of it
>> and we need something to beat the pants off of btrfs.
>>
>> ___
>> Tux3 mailing list
>> Tux3@phunq.net
>> http://phunq.net/mailman/listinfo/tux3
>>
>
> ___
> Tux3 mailing list
> Tux3@phunq.net
> http://phunq.net/mailman/listinfo/tux3
>

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Tux3 performance: FUSE vs kernel

2016-12-23 Thread Daniel Phillips

On 12/23/2016 03:19 AM, Raymond Jennings wrote:
Hey um, is there much performance difference between using tux3 on 
FUSE vs as a built-in kernel module?


Yes, a huge difference. Fuse imposes a large (huge) overhead on the 
async operations that a kernel filesystem uses, and in the case of our 
fuse implemention, completely serializes them. I haven't actually 
measured the difference, but expect performance reduction of orders of 
magnitude for many common operations. Basically, the only time fuse ever 
looks efficient is when there is some other, expensive operation going 
on, like a slow disk or a network. Put a SSD behind any fuse filesystem 
and the overhead will be easily visible. Not slagging fuse here, that's 
just the nature of the problem.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: online resize

2017-01-15 Thread Daniel Phillips

On 01/12/2017 06:54 AM, Raymond Jennings wrote:

Does/will tux3 support online resize, both shrink and grow?


Not yet, but is straightforward.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Backgrounded deletion, dirty writeback, security, and you

2017-03-24 Thread Daniel Phillips

Hi Raymond,

On 03/23/2017 01:01 PM, Raymond Jennings wrote:

Having recently tangled with this issue on ext4, I had a few questions:

1.  Does the backend that actually digests an unlink request know how 
to handle outstanding dirty blocks belonging to a now nonexisting file?


If a file has been deleted, has no outstanding descriptors referencing 
it, then in theory none of the data in it will be useful anymore.  
That probably includes any dirty data in page cache or whatever that 
was due to be written back to the disk.


Is tux3 intelligent enough to seek out and void any writeback against 
unlinked files and spare the disk from ostensibly moot writeback?


The Tux3 backend does know how to handle outstanding dirty blocks of 
deleted files, and discards it without writing to media. For orphans, 
that is, unlinked/deleted files that are still open, dirty data in the 
most recent delta is discarded when an orphan is deleted, without being 
written to media. Your assessment of the value of this is correct, and 
you could call Tux3 intelligent in that regard.


Usage case:  I have a process that went haywire and wrote a crapton of 
data...and then it crashed and made the entire output useless.  If I 
delete it, will tux3 notice and promptly discard any pending writeback 
against the file, since I'd rather the system just forget about 
flushing it after the file has been deleted?



Tux3 does what you want in that case.

2.  Exception to point 1, erasing a file

Is it possible for tux3 to explicitly zero out a file's data before 
releasing its blocks back to the free pool?


Yes, possible, and there are good reasons for doing that as an option, 
but we do not do it at the moment. I will make sure it is on the todo list.
My guess that if 1 was done, 2 could be done by simply fsyncing the 
file before closing it.
You can fsync yourself if you like, or sync from the shell. The main 
reasons for zeroing out discarded data are: 1) security and 2) start 
early erase for SSD.


And maybe it would be a good feature if tux3 could be instructed to 
zero out blocks before they're released, perhaps as amount time 
default option, or perhaps explicitly.
Probably, our default will just do what you want when trim support gets 
to the top of the todo list.


Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: Alive?

2017-03-19 Thread Daniel Phillips

Hi Raymond,

Tux3.org is sufficient. Tux3.org is now hosted on phunq.net (my server).

Yes, still alive, and busy. There is a code drop coming from Hirofumi in 
the near future, to sync up with current mainline. I am currently 
working on a new distributed lock manager, which hopefully will play a 
role in Tux3's future evolution.


Pretty soon, I will return to work on Shardmap, Tux3's directory index, 
which I expect to scale considerably better than HTree, as well as being 
suitable for distribution across a cluster. Hmm, did I say the "cluster" 
word?


Regards,

Daniel

On 03/19/2017 07:56 PM, Raymond Jennings wrote:

Hey um...is tux3 still alive and kicking?

Also, I sent this to tux3 at both tux3.org  and 
phunq.net , are they both official mailing addresses?



___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: fallocate support ...

2017-03-22 Thread Daniel Phillips

Hi Lars,

On 03/21/2017 12:37 AM, Lars Segerlund wrote:


 Hi guys,

 I am doing some apps that preallocates files at known locations on 
disk, continous and the order ( placement ) is important, so I thought 
I¨d ask if there has been any thoughts about this on tux3 ?


 It¨s really a killer app for streaming, since files at the end of the 
disk is slower, and files on the outer part of the platter is faster.


Ah, interesting I did not realize that a factor of something less than 
two is so critical. And I guess you are telling us, spinning disks are 
not dead yet.


Yes, this is something Hirofumi and I have discussed in the past. Tux3 
is, in general, well suited to it. It would be required for any kind of 
real-time guarantee. We should have a specific plan for it, and see how 
the current code might be adapted for it.
 Preferably I¨d like to have continous files of a given size take up 
as much of the disk as I specify, from outer cylinders to the middle. 
ext4 has some support to set this up at filesystem creation time 
through a custom config to mkfs.


So there are two issues, one is large continous file allocation, and 
the second is file placement.


 Any thoughts ?


What API do you propose, for the user to specify these constraints?

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Tux3 update - Shardmap

2017-10-09 Thread Daniel Phillips

Hi all,

This is just a quick note to let everybody know that Tux3 development is 
active once again. There are a small number of areas we need to focus 
on, to make Tux3 ready for general (experimental!) use. One of those is 
directory scaling, that is, Shardmap.


Shardmap was conceived as a new-generation replacement for HTree, and as 
such, is aimed not just at Tux3, but Ext4 too. After all these years, 
Htree still remains arguably the best-performing directory index for 
moderate loads up to a few million files, but from there on it tends to 
hit thrashing issues. Shardmap is designed to fix those and scale up 
into the billions of files, while exhibiting similar performance at 
lower scale.


Shardmap is a scalable hash table design, as opposed to HTree, which is 
a hash-indexed btree. Htree thrashing comes from two main sources:


  1) Traversing a directory in hash order causes inode table to be 
traversed in random order, so at high scale, the working set tends to 
become the entire inode table


 2) A basic issue with all btrees: heavy random update load (e.g., 
creates) will dirty almost every btree leaf, forcing almost the entire 
index to be written per commit


HTree stores directory entries in the btree leaves, with no separate 
index structure. That is one of the things that makes it really fast. 
Shardmap breaks away from this by storing directory entries in 
Ext2-style directory entry blocks, and has a separate index higher up in 
the directory file.


To make this work well, the Shardmap index is very compact, just 8 bytes 
per directory entry. Htree has far less index overhead than that, 
approaching zero bytes per directory entry (!) but it also has a 
significant amount of empty space in each leaf block, about 25%, due to 
its btree splitting strategy. So actually, Shardmap directories should 
turn out to be about the same total size as HTree. But Shardmap has a 
huge advantage because of storing its entries roughly in creation order: 
traversing the directory via readdir tends to access the inode table in 
storage order. The goal here is to have multiple cache hits to the same 
inode table block, close together in time. This is about as 
cache-friendly as you can get.


To address the second issue, Shardmap updates its hash shards using a 
logging style, per shard. That is, it only dirties the tail block of 
each shard. Compared to a btree, that would be a small fraction of dirty 
blocks per commit under heavy create/update load. Shardmap's dirty cache 
footprint therefore promises to be less than 1% of Htree's.


Shardmap's in-cache hash lookup should perform nearly identically to 
HTree's binary search, with similarly low CPU consumption.


Shardmap uses a fundamentally different approach to cache than HTree. 
With HTree, directory blocks are simply mapped into page cache, where 
both reads and updates are done. Shardmap splits the cache objects into 
two parts: in-memory shards, each of which is a small hash table, and 
media shards, which are append-only regions of the directory file, from 
which in-memory shards can be reconstructed as needed. So Shardmap uses 
one format that is well suited to media, in combination with another 
format that is well suited to memory. As a consequence, we must manage 
eviction of shards explicitly rather than relying on the default page 
cache eviction mechanism as Htree does. Fortunately, this is a 
well-known kernel meme - no new ground to break here!


One of the things that makes HTree simple is, it does not need to 
implement free space management within a directory file. HTree always 
knows where new entries will be created: in the leaf block indexed by 
the name hash. And HTree is grow-only - a directory file never shrinks. 
In practice, this has not proved to be a problem, because large 
directories tend to stay large, and the user has at least one effective 
way to shrink a directory: just delete it. Shardmap too, will be 
grow-only, but unlike HTree, it must track free space in order to reuse 
empty records in directory blocks left there by deletes. It turns out 
that a very simple free space management technique is possible, which I 
will not detail here, except to say that the space overhead is less than 
.1% of the directory size. A special shout-out here goes to Ted Ty'o for 
hating my initial free space management proposal, an opinion he 
expressed during a noisy party at LinuxCon New Orleans a few years back. 
Ted was right, and I accordingly came up with something simpler, more 
efficient and friendlier to Ext4's way of doing things.


As opposed to Htree, which loads its index into cache one directory 
block at a time, Shardmap will load a full hash shard per cache miss, 
typically 2-4MB. Is that good or bad? On spinning disk, reading 4MB 
instead of 4KB costs less than the associated seek, so it is clearly a 
good idea, but what about SSD? My tentative conclusion is: this 
resembles read-ahead, in that loads exist where the 

Re: Progress report?

2018-04-04 Thread Daniel Phillips

On 2018-04-03 12:30 AM, Raymond Jennings wrote:

Are you guys close to getting merged into mainline?

I think it's high time that btrfs got a healthy dose of competition



Hi Raymond,

For the time being we will continue to develop out-of-tree, while 
continuing to track Linus's latest mainline kernel.


Currently, I am busy fixing Tux3's lack of directory indexing, which 
becomes a performance bottleneck at more than a few hundred files per 
directory. We need to fix this this before seriously putting Tux3 up 
against other general purpose file systems.


We could have gone with a hash-keyed B-tree indexing scheme like 
everybody else, but I felt we would be better off with a completely new 
approach based on scalable hash tables. I actually prototyped Shardmap 
back in 2012, to the point where I convinced myself that the technology 
was capable of meeting or beating B-tree performance at all scales, 
while not needing a huge hack to work around the basically impossible 
problem of doing readdir in hash order.


Evolving that prototype into usable code has kept me busy for a few 
months now. Problem number one was, a hash table does not scale 
naturally like a B-tree, instead the entire table needs to be expanded 
as directory size increases. A simple-minded implementation would cause 
huge latency for the create that happens to trigger the expand. Instead, 
Shardmap expands the hash table one shard at a time, where the latency 
of expanding a single shard is just a couple of milliseconds, appearing 
completely smooth to the user. The state of this incremental reshard, as 
I call it, needs to be recorded in the directory file so that the 
incremental re-shard will continue exactly where it left off if the 
directory is re-opened. After some effort, that settled down to a simple 
design where the index is represented as one or two "tiers" of hash 
tables, depending on whether whether a reshard is in progress or not. 
The lower tier merges incrementally into the upper tier until it 
disappears, so that the entire hash index moves higher up in the 
directory file over time, making room for a nice linear array of 
directory entry blocks below it.


This linear array of directory entry blocks is one of the main points of 
Shardmap. It means that readdir can use a simple logical directory 
address for readdir position, which is really the only way to comply 
accurately with Posix readdir semantics that were originally defined 
with simple linear directory layout in mind. Linear directory layout 
also gives the fastest and most cache-efficient readdir, so you can walk 
through an arbitrarily large Shardmap directory at essentially media 
transfer speed. Finally, we avoid an issue that Htree has, where walking 
the directory in hash order means that the inode table is accessed in 
random order, causing increased hash pressure and (in the case of 
delete) increased write multiplication.


Our nice linear array of directory entry blocks brings up hard problem 
number two: how to keep track of free space in directory entry blocks 
due to deleted entries? HTree does not have that problem because it 
always creates a new entry in the B-tree leaf that corresponds to the 
entry's hash, and splits that block to create space if necessary. So 
Shardmap needs something like a malloc, but because Shardmap competes 
with Htree in performance, the cost of this has to be nearly zero. My 
solution is a new algorithm called Bigmap, that records the largest free 
entry per block with overhead of just one byte per block. Searching and 
updating adds so little extra overhead that it is hard to measure.


Putting this all together, we got our reward: a directory index that 
scales efficiently to the billion file range while also handling smaller 
directories at least as efficiently as current B-tree schemes. Because a 
file system directory is really just a kind of specialized key-value 
store, we decided to compare Shardmap performance to standalone 
databases, and we found Shardmap outperforming them at create, delete 
and lookup for small data sets and large. This is by way of gaining 
confidence that we did not overlook some even better way to do things.


Please excuse me for going into this perhaps a little more deeply than I 
originally intended, but this should give you some idea where we are 
right now, and why we prioritized the current development work ahead of 
putting Tux3 up for LKML review once again. There is still more work to 
do on the Shardmap front: this code must now be ported from userspace to 
kernel,  work currently in progress. After that, there are some 
outstanding issues to take care of with seek optimization on spinning 
disk. That will bring us to the point where we are ready to make our 
case for mainline merge, without needing to explain away cases where we 
do not currently come out on top in file system benchmarks.


Regards,

Daniel


___
Tux3 mailing