Re: APFS: mmap page fault can take up to minutes after ftruncate/F_PREALLOCATE

2019-12-19 Thread Thomas Tempelmann via Filesystem-dev
Out of curiosity, I just ran the test on my 10.13.6 system:

On HFS+, the tool finishes in about 5 seconds, ending up with a 4 GB file.

However, on a freshly created APFS vol (no encryption, on same SSD as the
HFS+ volume), it runs for about an hour! But no extra wait time around 2 GB.

I guess that APFS's mmap was so slow on 10.13 that the Apple engineers
optimized this in 10.14, but still with a hiccup at 2GB. And then fixed
that as well in 10.15.

Thomas

>
 ___
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list  (Filesystem-dev@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/filesystem-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com


Re: APFS: mmap page fault can take up to minutes after ftruncate/F_PREALLOCATE

2019-12-19 Thread Ilia K via Filesystem-dev
UPDATE:
1. Interestingly, ftruncate() works fine on macOS Catalina. Were there some
fixes for this in Catalina?
2. I found an open issue for missing posix_fallocate():
https://openradar.appspot.com/32720223


On Thu, Dec 19, 2019 at 4:32 PM Ilia K  wrote:

> Hi!
>
> I investigate performance issues with our test cases on Mac mini (2018,
> Core i7 3.2GHz, 16GB RAM) with macOS Mojave 10.14.6.
>
> Our storage uses memory mappings backed by file, and periodically when it
> gets too big we increase file size using the corresponding function:
> CreateFileMapping on Windows, posix_fallocate on Linux. On macOS, we
> emulate posix_fallocate() which simply does ftruncate().
>
> It so happened that one of our test cases repeatedly allocates and
> mmap()'s chunks of size >= 4KB, without reading/writing to them. (btw, page
> size and block size are also 4K).
>
> The problem is that sometimes we have unpredictable delays in page faults:
> from tens of seconds to minutes. Usually it happens when accessing the
> mmap()'ed addresses with offset ~2050-2090MB.
>
> Well, I tried to implement posix_fallocate() the different ways:
> * ftruncate() -- the easist one, works both on Linux and on macOS with
> HFS+. But on APFS page fault takes about 23 seconds.
> * fcntl(F_SETSIZE) -- the worst page fault time is less than for
> ftruncate() (only 11 seconds), and we need root privilege.
> * fcntl(F_PREALLOCATE) -- I found 3 ways of using it in various open
> source projects: #1 & #2 seems wrong to me (see the comments in my demo for
> details), and #3 can cause a page fault lasting 10 minutes.
> * pwrite() -- works slow but without obscenely long page faults if step
> size 4K. Otherwise, we can also wait in pwrite() for 12 seconds, or get a
> 13 seconds page fault.
>
> Here is my posix_fallocate(), the full demo code is in the attachments
> (pagefault_test.c):
> ```C
> int posix_fallocate(int fd, off_t offset, off_t len) {
> struct stat stat_buf;
> if (flock(fd, LOCK_EX) != 0) return errno;
>
> int err_code = fstat(fd, _buf) == 0 ? 0 : errno;
> if (err_code == 0 && offset + len > stat_buf.st_size) {
> #if defined(IMPL_FTRUNCATE)
> err_code = ftruncate(fd, offset + len) == 0 ? 0 : errno;
> // btw, LLVM simply uses ftruncate when posix_fallocate not
> available:
> https://github.com/llvm/llvm-project/blob/b462cdff05b82071190e8bfd1078a2c76933b19b/llvm/lib/Support/Unix/Path.inc#L559
> .
> #elif defined(IMPL_FCNTL_SETSIZE)
> unsigned long long arg = offset + len;
> err_code = fcntl(fd, F_SETSIZE, ) != -1 ? 0 : errno;
> #elif defined(IMPL_FCNTL_PREALLOCATE)
> // I found several ways to use F_PREALLOCATE (uncomment to try it):
> // 1. Starting from specific offset.  This way is used in Chromium
> (
> https://chromium.googlesource.com/chromium/src/+/7ca4a2b489b1dd4b5c9b0046d55193b900da06ea/base/files/file_util_posix.cc#901),
> fallocate module for Python (
> https://github.com/trbs/fallocate/blob/9d7aae312ad0d1de6c6451193748e8e8c7e8230d/fallocate/_fallocatemodule.c#L59
> ),
> // but I get EINVAL if .fst_offset != 0.
> //fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, offset, len };
> //
> // 2. Specifying the desired file size.  Examples: Mozilla
> https://hg.mozilla.org/mozilla-central/file/3d846420a907/xpcom/glue/FileUtils.cpp#l61
> (copies here:
> https://github.com/mozilla/universal-search-gecko-dev/blob/33e34ae066dbdb35ff6889973e21a38792991f35/xpcom/glue/FileUtils.cpp,
>
> https://github.com/mozilla/integration-mozilla-inbound/blob/0d01aa29ce350beca861f7d3b7b4df399b246ed0/xpcom/glue/FileUtils.cpp),
> one guy there https://forums.developer.apple.com/thread/111312, Rust fs2
> https://docs.rs/crate/fs2/0.4.3/source/src/unix.rs.
> // But as I can see, this will allocate offset + len bytes per
> call, despite how big the file is at the moment.
> //fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len
> };
> //
> // 3. Specifying the diff file size.  The only example I found is
> Realm core
> https://github.com/realm/realm-core/blob/44152d283878473db8cbf90ac4083dcae44c1852/src/realm/util/file.cpp#L783
> .
> // Unfortunately in this case page fault can take longer than with
> ftruncate(): up to 611 seconds!!!
> // ```
> // $ ./a.out
> // map 2179727360-2179792896
> // page fault took 611276 milliseconds
> // map 2181300224-2181365760
> // page fault took 214747 milliseconds
> // ...
> // ```
> fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len -
> stat_buf.st_size };
>
> err_code = fcntl(fd, F_PREALLOCATE, ) != -1 && ftruncate(fd,
> offset + len) != -1 ? 0 : errno;
> #elif defined(IMPL_WRITE)
> // for 64K: pwrite() can take about 12 seconds, or we can get a 13
> seconds page fault.
> int step = 65536;
> //int step = stat_buf.st_blksize;
>
> assert(stat_buf.st_size % step 

APFS: mmap page fault can take up to minutes after ftruncate/F_PREALLOCATE

2019-12-19 Thread Ilia K via Filesystem-dev
Hi!

I investigate performance issues with our test cases on Mac mini (2018,
Core i7 3.2GHz, 16GB RAM) with macOS Mojave 10.14.6.

Our storage uses memory mappings backed by file, and periodically when it
gets too big we increase file size using the corresponding function:
CreateFileMapping on Windows, posix_fallocate on Linux. On macOS, we
emulate posix_fallocate() which simply does ftruncate().

It so happened that one of our test cases repeatedly allocates and mmap()'s
chunks of size >= 4KB, without reading/writing to them. (btw, page size and
block size are also 4K).

The problem is that sometimes we have unpredictable delays in page faults:
from tens of seconds to minutes. Usually it happens when accessing the
mmap()'ed addresses with offset ~2050-2090MB.

Well, I tried to implement posix_fallocate() the different ways:
* ftruncate() -- the easist one, works both on Linux and on macOS with
HFS+. But on APFS page fault takes about 23 seconds.
* fcntl(F_SETSIZE) -- the worst page fault time is less than for
ftruncate() (only 11 seconds), and we need root privilege.
* fcntl(F_PREALLOCATE) -- I found 3 ways of using it in various open source
projects: #1 & #2 seems wrong to me (see the comments in my demo for
details), and #3 can cause a page fault lasting 10 minutes.
* pwrite() -- works slow but without obscenely long page faults if step
size 4K. Otherwise, we can also wait in pwrite() for 12 seconds, or get a
13 seconds page fault.

Here is my posix_fallocate(), the full demo code is in the attachments
(pagefault_test.c):
```C
int posix_fallocate(int fd, off_t offset, off_t len) {
struct stat stat_buf;
if (flock(fd, LOCK_EX) != 0) return errno;

int err_code = fstat(fd, _buf) == 0 ? 0 : errno;
if (err_code == 0 && offset + len > stat_buf.st_size) {
#if defined(IMPL_FTRUNCATE)
err_code = ftruncate(fd, offset + len) == 0 ? 0 : errno;
// btw, LLVM simply uses ftruncate when posix_fallocate not
available:
https://github.com/llvm/llvm-project/blob/b462cdff05b82071190e8bfd1078a2c76933b19b/llvm/lib/Support/Unix/Path.inc#L559
.
#elif defined(IMPL_FCNTL_SETSIZE)
unsigned long long arg = offset + len;
err_code = fcntl(fd, F_SETSIZE, ) != -1 ? 0 : errno;
#elif defined(IMPL_FCNTL_PREALLOCATE)
// I found several ways to use F_PREALLOCATE (uncomment to try it):
// 1. Starting from specific offset.  This way is used in Chromium (
https://chromium.googlesource.com/chromium/src/+/7ca4a2b489b1dd4b5c9b0046d55193b900da06ea/base/files/file_util_posix.cc#901),
fallocate module for Python (
https://github.com/trbs/fallocate/blob/9d7aae312ad0d1de6c6451193748e8e8c7e8230d/fallocate/_fallocatemodule.c#L59
),
// but I get EINVAL if .fst_offset != 0.
//fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, offset, len };
//
// 2. Specifying the desired file size.  Examples: Mozilla
https://hg.mozilla.org/mozilla-central/file/3d846420a907/xpcom/glue/FileUtils.cpp#l61
(copies here:
https://github.com/mozilla/universal-search-gecko-dev/blob/33e34ae066dbdb35ff6889973e21a38792991f35/xpcom/glue/FileUtils.cpp,
https://github.com/mozilla/integration-mozilla-inbound/blob/0d01aa29ce350beca861f7d3b7b4df399b246ed0/xpcom/glue/FileUtils.cpp),
one guy there https://forums.developer.apple.com/thread/111312, Rust fs2
https://docs.rs/crate/fs2/0.4.3/source/src/unix.rs.
// But as I can see, this will allocate offset + len bytes per
call, despite how big the file is at the moment.
//fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len
};
//
// 3. Specifying the diff file size.  The only example I found is
Realm core
https://github.com/realm/realm-core/blob/44152d283878473db8cbf90ac4083dcae44c1852/src/realm/util/file.cpp#L783
.
// Unfortunately in this case page fault can take longer than with
ftruncate(): up to 611 seconds!!!
// ```
// $ ./a.out
// map 2179727360-2179792896
// page fault took 611276 milliseconds
// map 2181300224-2181365760
// page fault took 214747 milliseconds
// ...
// ```
fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len -
stat_buf.st_size };

err_code = fcntl(fd, F_PREALLOCATE, ) != -1 && ftruncate(fd,
offset + len) != -1 ? 0 : errno;
#elif defined(IMPL_WRITE)
// for 64K: pwrite() can take about 12 seconds, or we can get a 13
seconds page fault.
int step = 65536;
//int step = stat_buf.st_blksize;

assert(stat_buf.st_size % step == 0);  // precondition for this
program
assert((offset + len) % step == 0);
printf("\n");
for (off_t ofs = stat_buf.st_size; ofs < offset + len; ofs += step)
{
static const char pad = '\0';
fprintf(stdout, "writing %lld, step %d\r", ofs + step - 1,
step); fflush(stdout);
if (pwrite(fd, , 1, ofs + step - 1) == -1) {
err_code = errno;
break;