On 5/25/23 19:08, enh via Toybox wrote:
> so i finally enabled copy_file_range for the _host_ toybox because someone
> pointed out that we copying 16GiB zip files around in the build, and even 
> though
> obviously we should stop doing that, 30s seemed unreasonable, and coreutils cp
> "only" took 20s because of copy_file_range.

Hardlinking them is not an option? :)

> but toybox cp with copy_file_range still takes 25s. why?
> 
>       if (bytes<0 || bytes>(1<<30)) len = (1<<30);
> 
> the checkin comment being:
> 
> Update comments and add "sanity check" from kernel commit f16acc9d9b376.
> (The kernel's been doing this since 2019, but older kernels may not, so...)

The problem being that _before_ that commit, too big a sendfile didn't work
right (returned an error from the kernel?). I suspect my range check was just
the largest power of 2 that fit in the constraint...

> what the kernel _actually_ does though is clamp to MAX_RW_COUNT. which is
> actually (INT_MAX & PAGE_MASK). which i'm assuming changes for a non-4KiB page
> kernel?

I don't think any of my test images have a PAGE_SHIFT other than 12? (Looks like
Alpha, OpenRisc, and 64 bit Sparc are the only 3 architectures that CAN'T use a
4k page size, and none of those are exactly load bearing these days.)

But I wouldn't have expected it to be that much slower given the block size here
is a megabyte, and the number of transactions being submitted... 16 gigs done a
megabyte at a time is 16k system calls, which is:

$ cat hello2.c
#include <stdio.h>

int main(int argc, char *argv[])
{
  int i;

  for (i = 0; i<16384; i++) dprintf(1, " ");
}
$ gcc hello2.c
$ strace ./a.out 2>&1 | grep write | wc -l
16384
$ time ./a.out | wc
      0       0   16384

real    0m0.033s
user    0m0.012s
sys     0m0.043s

Halving the number of output system calls would theoretically save you around
0.015 seconds on a 10 year old laptop.

So why does it have a ~20% impact on the kernel's throughput? The kernel's cap
isn't even cleanly a power of 2. Maybe the kernel is using 2 megabyte huge pages
internally in the disk cache, and the smaller size is causing unnecessary
copying? Is 1<<29 slower or faster than 1<<30? I didn't think letting something
else get in there and seek was a big deal on ssd? Maybe a different hardware
burst transaction size?

This isn't even "maybe zerocopy from a userspace buffer above a certain size
keeps the userspace process suspended so read and write never get to overlap"
territory: there's no userspace buffer. This is "give the kernel two filehandles
and a length and let it sort it out". We tell it what to do in very abstract
terms. In theory the ENTIRE COPY OPERATION could be deferred by the filesystem,
scheduling it as a big journal entry to update extents. On something like btrfs
it could be shared extents behind the scenes. What is going ON here?

> sadly 2019 is only 4 years ago, so there's a decent chunk of the 7 year rule
> left to run out...

I'm happy to change it, but I'd like to understand what's going on? We can
switch to the kernel's exact size cap (assuming sysconf(_SC_PAGE_SIZE) is
reliable), but _why_ is that magic number we had to get by reading the kernel
source faster? We're handing this off to the kernel so it deals with the details
and _avoids_ this sort of thing...

(Why the kernel guys provided an API that can't handle O_LARGEFILE from 2001, I
couldn't tell you...)

Rob
_______________________________________________
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to