On 5/25/23 19:08, enh via Toybox wrote: > so i finally enabled copy_file_range for the _host_ toybox because someone > pointed out that we copying 16GiB zip files around in the build, and even > though > obviously we should stop doing that, 30s seemed unreasonable, and coreutils cp > "only" took 20s because of copy_file_range.
Hardlinking them is not an option? :) > but toybox cp with copy_file_range still takes 25s. why? > > if (bytes<0 || bytes>(1<<30)) len = (1<<30); > > the checkin comment being: > > Update comments and add "sanity check" from kernel commit f16acc9d9b376. > (The kernel's been doing this since 2019, but older kernels may not, so...) The problem being that _before_ that commit, too big a sendfile didn't work right (returned an error from the kernel?). I suspect my range check was just the largest power of 2 that fit in the constraint... > what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is > actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a non-4KiB page > kernel? I don't think any of my test images have a PAGE_SHIFT other than 12? (Looks like Alpha, OpenRisc, and 64 bit Sparc are the only 3 architectures that CAN'T use a 4k page size, and none of those are exactly load bearing these days.) But I wouldn't have expected it to be that much slower given the block size here is a megabyte, and the number of transactions being submitted... 16 gigs done a megabyte at a time is 16k system calls, which is: $ cat hello2.c #include <stdio.h> int main(int argc, char *argv[]) { int i; for (i = 0; i<16384; i++) dprintf(1, " "); } $ gcc hello2.c $ strace ./a.out 2>&1 | grep write | wc -l 16384 $ time ./a.out | wc 0 0 16384 real 0m0.033s user 0m0.012s sys 0m0.043s Halving the number of output system calls would theoretically save you around 0.015 seconds on a 10 year old laptop. So why does it have a ~20% impact on the kernel's throughput? The kernel's cap isn't even cleanly a power of 2. Maybe the kernel is using 2 megabyte huge pages internally in the disk cache, and the smaller size is causing unnecessary copying? Is 1<<29 slower or faster than 1<<30? I didn't think letting something else get in there and seek was a big deal on ssd? Maybe a different hardware burst transaction size? This isn't even "maybe zerocopy from a userspace buffer above a certain size keeps the userspace process suspended so read and write never get to overlap" territory: there's no userspace buffer. This is "give the kernel two filehandles and a length and let it sort it out". We tell it what to do in very abstract terms. In theory the ENTIRE COPY OPERATION could be deferred by the filesystem, scheduling it as a big journal entry to update extents. On something like btrfs it could be shared extents behind the scenes. What is going ON here? > sadly 2019 is only 4 years ago, so there's a decent chunk of the 7 year rule > left to run out... I'm happy to change it, but I'd like to understand what's going on? We can switch to the kernel's exact size cap (assuming sysconf(_SC_PAGE_SIZE) is reliable), but _why_ is that magic number we had to get by reading the kernel source faster? We're handing this off to the kernel so it deals with the details and _avoids_ this sort of thing... (Why the kernel guys provided an API that can't handle O_LARGEFILE from 2001, I couldn't tell you...) Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net