Re: APFS: mmap page fault can take up to minutes after ftruncate/F_PREALLOCATE
Out of curiosity, I just ran the test on my 10.13.6 system: On HFS+, the tool finishes in about 5 seconds, ending up with a 4 GB file. However, on a freshly created APFS vol (no encryption, on same SSD as the HFS+ volume), it runs for about an hour! But no extra wait time around 2 GB. I guess that APFS's mmap was so slow on 10.13 that the Apple engineers optimized this in 10.14, but still with a hiccup at 2GB. And then fixed that as well in 10.15. Thomas > ___ Do not post admin requests to the list. They will be ignored. Filesystem-dev mailing list (Filesystem-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/filesystem-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: APFS: mmap page fault can take up to minutes after ftruncate/F_PREALLOCATE
UPDATE: 1. Interestingly, ftruncate() works fine on macOS Catalina. Were there some fixes for this in Catalina? 2. I found an open issue for missing posix_fallocate(): https://openradar.appspot.com/32720223 On Thu, Dec 19, 2019 at 4:32 PM Ilia K wrote: > Hi! > > I investigate performance issues with our test cases on Mac mini (2018, > Core i7 3.2GHz, 16GB RAM) with macOS Mojave 10.14.6. > > Our storage uses memory mappings backed by file, and periodically when it > gets too big we increase file size using the corresponding function: > CreateFileMapping on Windows, posix_fallocate on Linux. On macOS, we > emulate posix_fallocate() which simply does ftruncate(). > > It so happened that one of our test cases repeatedly allocates and > mmap()'s chunks of size >= 4KB, without reading/writing to them. (btw, page > size and block size are also 4K). > > The problem is that sometimes we have unpredictable delays in page faults: > from tens of seconds to minutes. Usually it happens when accessing the > mmap()'ed addresses with offset ~2050-2090MB. > > Well, I tried to implement posix_fallocate() the different ways: > * ftruncate() -- the easist one, works both on Linux and on macOS with > HFS+. But on APFS page fault takes about 23 seconds. > * fcntl(F_SETSIZE) -- the worst page fault time is less than for > ftruncate() (only 11 seconds), and we need root privilege. > * fcntl(F_PREALLOCATE) -- I found 3 ways of using it in various open > source projects: #1 & #2 seems wrong to me (see the comments in my demo for > details), and #3 can cause a page fault lasting 10 minutes. > * pwrite() -- works slow but without obscenely long page faults if step > size 4K. Otherwise, we can also wait in pwrite() for 12 seconds, or get a > 13 seconds page fault. > > Here is my posix_fallocate(), the full demo code is in the attachments > (pagefault_test.c): > ```C > int posix_fallocate(int fd, off_t offset, off_t len) { > struct stat stat_buf; > if (flock(fd, LOCK_EX) != 0) return errno; > > int err_code = fstat(fd, _buf) == 0 ? 0 : errno; > if (err_code == 0 && offset + len > stat_buf.st_size) { > #if defined(IMPL_FTRUNCATE) > err_code = ftruncate(fd, offset + len) == 0 ? 0 : errno; > // btw, LLVM simply uses ftruncate when posix_fallocate not > available: > https://github.com/llvm/llvm-project/blob/b462cdff05b82071190e8bfd1078a2c76933b19b/llvm/lib/Support/Unix/Path.inc#L559 > . > #elif defined(IMPL_FCNTL_SETSIZE) > unsigned long long arg = offset + len; > err_code = fcntl(fd, F_SETSIZE, ) != -1 ? 0 : errno; > #elif defined(IMPL_FCNTL_PREALLOCATE) > // I found several ways to use F_PREALLOCATE (uncomment to try it): > // 1. Starting from specific offset. This way is used in Chromium > ( > https://chromium.googlesource.com/chromium/src/+/7ca4a2b489b1dd4b5c9b0046d55193b900da06ea/base/files/file_util_posix.cc#901), > fallocate module for Python ( > https://github.com/trbs/fallocate/blob/9d7aae312ad0d1de6c6451193748e8e8c7e8230d/fallocate/_fallocatemodule.c#L59 > ), > // but I get EINVAL if .fst_offset != 0. > //fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, offset, len }; > // > // 2. Specifying the desired file size. Examples: Mozilla > https://hg.mozilla.org/mozilla-central/file/3d846420a907/xpcom/glue/FileUtils.cpp#l61 > (copies here: > https://github.com/mozilla/universal-search-gecko-dev/blob/33e34ae066dbdb35ff6889973e21a38792991f35/xpcom/glue/FileUtils.cpp, > > https://github.com/mozilla/integration-mozilla-inbound/blob/0d01aa29ce350beca861f7d3b7b4df399b246ed0/xpcom/glue/FileUtils.cpp), > one guy there https://forums.developer.apple.com/thread/111312, Rust fs2 > https://docs.rs/crate/fs2/0.4.3/source/src/unix.rs. > // But as I can see, this will allocate offset + len bytes per > call, despite how big the file is at the moment. > //fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len > }; > // > // 3. Specifying the diff file size. The only example I found is > Realm core > https://github.com/realm/realm-core/blob/44152d283878473db8cbf90ac4083dcae44c1852/src/realm/util/file.cpp#L783 > . > // Unfortunately in this case page fault can take longer than with > ftruncate(): up to 611 seconds!!! > // ``` > // $ ./a.out > // map 2179727360-2179792896 > // page fault took 611276 milliseconds > // map 2181300224-2181365760 > // page fault took 214747 milliseconds > // ... > // ``` > fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len - > stat_buf.st_size }; > > err_code = fcntl(fd, F_PREALLOCATE, ) != -1 && ftruncate(fd, > offset + len) != -1 ? 0 : errno; > #elif defined(IMPL_WRITE) > // for 64K: pwrite() can take about 12 seconds, or we can get a 13 > seconds page fault. > int step = 65536; > //int step = stat_buf.st_blksize; > > assert(stat_buf.st_size % step
APFS: mmap page fault can take up to minutes after ftruncate/F_PREALLOCATE
Hi! I investigate performance issues with our test cases on Mac mini (2018, Core i7 3.2GHz, 16GB RAM) with macOS Mojave 10.14.6. Our storage uses memory mappings backed by file, and periodically when it gets too big we increase file size using the corresponding function: CreateFileMapping on Windows, posix_fallocate on Linux. On macOS, we emulate posix_fallocate() which simply does ftruncate(). It so happened that one of our test cases repeatedly allocates and mmap()'s chunks of size >= 4KB, without reading/writing to them. (btw, page size and block size are also 4K). The problem is that sometimes we have unpredictable delays in page faults: from tens of seconds to minutes. Usually it happens when accessing the mmap()'ed addresses with offset ~2050-2090MB. Well, I tried to implement posix_fallocate() the different ways: * ftruncate() -- the easist one, works both on Linux and on macOS with HFS+. But on APFS page fault takes about 23 seconds. * fcntl(F_SETSIZE) -- the worst page fault time is less than for ftruncate() (only 11 seconds), and we need root privilege. * fcntl(F_PREALLOCATE) -- I found 3 ways of using it in various open source projects: #1 & #2 seems wrong to me (see the comments in my demo for details), and #3 can cause a page fault lasting 10 minutes. * pwrite() -- works slow but without obscenely long page faults if step size 4K. Otherwise, we can also wait in pwrite() for 12 seconds, or get a 13 seconds page fault. Here is my posix_fallocate(), the full demo code is in the attachments (pagefault_test.c): ```C int posix_fallocate(int fd, off_t offset, off_t len) { struct stat stat_buf; if (flock(fd, LOCK_EX) != 0) return errno; int err_code = fstat(fd, _buf) == 0 ? 0 : errno; if (err_code == 0 && offset + len > stat_buf.st_size) { #if defined(IMPL_FTRUNCATE) err_code = ftruncate(fd, offset + len) == 0 ? 0 : errno; // btw, LLVM simply uses ftruncate when posix_fallocate not available: https://github.com/llvm/llvm-project/blob/b462cdff05b82071190e8bfd1078a2c76933b19b/llvm/lib/Support/Unix/Path.inc#L559 . #elif defined(IMPL_FCNTL_SETSIZE) unsigned long long arg = offset + len; err_code = fcntl(fd, F_SETSIZE, ) != -1 ? 0 : errno; #elif defined(IMPL_FCNTL_PREALLOCATE) // I found several ways to use F_PREALLOCATE (uncomment to try it): // 1. Starting from specific offset. This way is used in Chromium ( https://chromium.googlesource.com/chromium/src/+/7ca4a2b489b1dd4b5c9b0046d55193b900da06ea/base/files/file_util_posix.cc#901), fallocate module for Python ( https://github.com/trbs/fallocate/blob/9d7aae312ad0d1de6c6451193748e8e8c7e8230d/fallocate/_fallocatemodule.c#L59 ), // but I get EINVAL if .fst_offset != 0. //fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, offset, len }; // // 2. Specifying the desired file size. Examples: Mozilla https://hg.mozilla.org/mozilla-central/file/3d846420a907/xpcom/glue/FileUtils.cpp#l61 (copies here: https://github.com/mozilla/universal-search-gecko-dev/blob/33e34ae066dbdb35ff6889973e21a38792991f35/xpcom/glue/FileUtils.cpp, https://github.com/mozilla/integration-mozilla-inbound/blob/0d01aa29ce350beca861f7d3b7b4df399b246ed0/xpcom/glue/FileUtils.cpp), one guy there https://forums.developer.apple.com/thread/111312, Rust fs2 https://docs.rs/crate/fs2/0.4.3/source/src/unix.rs. // But as I can see, this will allocate offset + len bytes per call, despite how big the file is at the moment. //fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len }; // // 3. Specifying the diff file size. The only example I found is Realm core https://github.com/realm/realm-core/blob/44152d283878473db8cbf90ac4083dcae44c1852/src/realm/util/file.cpp#L783 . // Unfortunately in this case page fault can take longer than with ftruncate(): up to 611 seconds!!! // ``` // $ ./a.out // map 2179727360-2179792896 // page fault took 611276 milliseconds // map 2181300224-2181365760 // page fault took 214747 milliseconds // ... // ``` fstore_t store = { F_ALLOCATEALL, F_PEOFPOSMODE, 0, offset + len - stat_buf.st_size }; err_code = fcntl(fd, F_PREALLOCATE, ) != -1 && ftruncate(fd, offset + len) != -1 ? 0 : errno; #elif defined(IMPL_WRITE) // for 64K: pwrite() can take about 12 seconds, or we can get a 13 seconds page fault. int step = 65536; //int step = stat_buf.st_blksize; assert(stat_buf.st_size % step == 0); // precondition for this program assert((offset + len) % step == 0); printf("\n"); for (off_t ofs = stat_buf.st_size; ofs < offset + len; ofs += step) { static const char pad = '\0'; fprintf(stdout, "writing %lld, step %d\r", ofs + step - 1, step); fflush(stdout); if (pwrite(fd, , 1, ofs + step - 1) == -1) { err_code = errno; break;