On Fri, Apr 14, 2023 at 1:49 PM Rob Landley <r...@landley.net> wrote:

> On 4/11/23 19:00, enh wrote:
> > On Tue, Apr 11, 2023 at 12:10 PM Rob Landley
> >     > (unrelated, i've been meaning to ask whether we should make toybuf
> larger
> >     > anyway. 4KiB is really small for modern hardware, though at the
> same time
> >     > it does make it more likely that we test all the "toybuf too
> small, loop"
> >     > cases even with small test inputs...)
> >
> >     A) not a fan of asserts.
> >
> > i don't like assert(), but **static_assert** is really useful for things
> like
> > this where you want to say "this code makes an assumption that you can
> test at
> > compile time".
>
> Compile time is the time to care about that sort of thing, yes.
>
> Kinda wonder about portability on weirdness like qnx (or the guy in email
> who's
> asking about uclibc-ng). I suspect any such asserts would be a CFG_DEBUG
> option
> maybe in portability.c? Hmmm...
>

they haven't made other complaints about C11, so presumably static_assert
is fine too?


> >     B) it was only ever coincidentally page size, and huge pages are a
> thing even on
> >     x86.
> >
> >
> > well, huge pages are different from non-4KiB non-huge pages.
>
> There was a lot of talk a while back about getting the kernel to
> dynamically use
> them (false starts when I was reading about it), but I don't follow
> lwn.net or
> lkml nearly as closely in 2023 as I did in 2018. It just got too
> unpleasant even
> to check over the weekly web archive...
>

_transparent_ huge pages is yet another different thing :-)


> > i think it's only
> > arm64 where you're at all likely to actually have your page size not be
> 4KiB.
> > (all macs and iphones, for example. i _think_ all the linux distros that
> tried
> > to move gave up?)
>
> What _is_ Mel Gorman up to these days? He last updated his blog in 2016,
> and
> last tweeted in 2020...
>
> >     I never annotated toybuf or libbuf with any sort of alignment
> directive or tried
> >     to make it come first in its segment (toybuf and libbuf are the
> fifth and sixth
> >     globals defined in main.c), so they're both reasonably likely to
> straddle page
> >     boundaries anyway. Heck, I'm not even sure it's cache line aligned.
> The actual
> >     _guarantee_ is something like 4 bytes, except when it suddenly
> isn't. I fought
> >     with this in 2021 trying to get a simple "hello world" kernel out of
> gcc without
> >     needing a linker script:
> https://landley.net/notes-2021.html#12-04-2021
> >     <https://landley.net/notes-2021.html#12-04-2021>
> >
> > now you're on C11, you can easily say this:
> > https://en.cppreference.com/w/c/language/_Alignas
>
> Hmmm... Does it actually help to page align them, do you think? Not sure
> how to
> benchmark that...
>

(i was half expecting it to be page aligned for free, as the
joint-second-largest OBJECT in toybox, but readelf says it isn't, and
neither is `this`.)

i don't have good benchmarks either, or i'd have given you data. the build
team does keep track of where all the time goes in builds, but (other than
that pathological case we had in sed) toybox is unlikely to make their hit
list.


> >     The 4096 is just a convenient scratch pad size. I use sizeof(toybuf)
> in a bunch
> >     of places... and hardwire in the knowledge of its size in a bunch of
> others.
> >     Plus there's a bunch of implicit "toybuf and/or this slice of it is
> big enough
> >     to stick this struct in, so I can safely typecast the pointer"
> instances I
> >     checked at the time (and all of them had a big fudge factor in case
> of future
> >     glibc bloat).
> >
> >     It's really a "convenient granularity" thing. Copy loops doing
> byte-at-a-time
> >     stuff is known terrible because the library and syscall execution
> paths come to
> >     dominate, and grouping it into 4k blocks is 12 doublings of
> efficiency right
> >     there. Going to 64k is 1/16th as much syscalls, which is not as big
> a deal as
> >     1/4000th as many syscalls. And then raises the question "why not a
> megabyte
> >     then" which is something you don't just casually want to do on
> embedded devices
> >     without thinking about it (might as well malloc there)...
> >
> >     I could probably be talked into bumping it up to 64k if somebody
> measured
> >     numbers saying it would help something specific?
> >
> >
> > i think the time i noticed this was when i was looking into "where the
> time
> > went" and noticed that a 64KiB buffer was quite helpful, at least on the
> scale
> > of "an entire Android build" type of thing.
>
> Which counts as a good reason to increase it, but is that "bigger toybuf
> helps
> in general" or "some copy loops should use a malloc buffer instead of
> toybuf"?
>

(or, as you come to later, "some copy loops should just sendfile()
instead".)


> My todo items here tend to be about limitations, things like in ps.c
> get_ps()
> puts struct procpid in toybuf so the strings it's reading are cumulatively
> finite (seem the comment around line 878) which mostly impacts cmdline,
> but with
> modern kernel changes that's 10 megabytes per entry so it's probably gotta
> have
> SOME sort of limit it's reading. :)
>
> That case does a deferred malloc on ps.c line 1000 to copy the result out
> of
> toybuf, so in theory the initial pointer could be replaced by an
> xmalloc(65536)
> and then the current malloc would become a remalloc() to trim it down to
> what's
> actually used. The size calculations are mostly sizeof(toybuf) so could
> swap to
> the new size.
>
> Well, one exception: the 2048 on line 747 is sort of "half of toybuf" but
> it's
> really just that /proc/$PID/stat is never gonna be longer than this
> because wc
> /proc/$$/stat says 52 entries, one of which is the command name (which has
> to be
> a filename which means it's limited by the VFS maximum of 255 bytes plus
> the
> enclosing parentheses for 257 which I rounded to 260 because null
> terminator and
> 4 byte alignment) and then all the OTHER entries are 64 bit numbers
> printed out
> in decimal so 21 bytes each with the space between them, 52*21+260 is
> 1382, and
> I gave it room for future expansion because kernel guys append stuff. (Not
> that
> we'd parse the result if they did, because the for loop on line 764 is
> traversing from SLOT_ppid to SLOT_upticks so our iteration count is based
> on the
> array not the input we read.)
>

(i think for ps, the interesting question is "why does toybox top take 10%
cpu?". and although i don't know the answer to that, i don't think that's
toybuf's fault :-) )


> > is it _worth_ it? don't know. what's the _optimal_ size? don't know. (and
> > probably depends on the specific toy, and 4096 is clearly a sensible
> _lower_
> > bound...)
>
> Optimization is generally about a use case, you never know if you've
> actually
> HELPED until you can benchmark the result. I can see hardware having moved
> out
> from under us in the past 15 years, with block granularity switched to 4k,
> sending 4k pages that aren't guaranteed to be aligned could fairly
> regularly
> have read/edit/write cycles on two blocks it's straddling, although the
> page
> cache and I/O scheduler should really hide all that. (The cpu's also gonna
> be
> doing that under the covers but the _cpu_cache_ should hide that. It's
> really
> operating at cache line granularity and we should be way above that? I'd
> think
> burst read/write should be on the far side of L2 for any system you care
> about?)
>
> Last I checked the "new" (like, 10 years now) kernel pipe buffers are 128k
> and
> there are at most 32 of them? So if you're sending between processes maybe
> that
> would come up, but again... need more info. :)
>
> The place I'd think it would come up most would be sendfile, which is
> using a
> syscall these days and only falling BACK to a loop over libbuf...
>
> Oh, hang on:
>
>       if (bytes<0 || bytes>(1<<30)) len = (1<<30);
>       // glibc added this constant in git at the end of 2017, shipped
> 2018-02.
>       // Android's had the constant for years, but you'll get SIGSYS if
> you use
>       // this system call before Android U (2023's release).
> #if defined(__NR_copy_file_range) && !defined(__ANDROID__)
>       len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0);
> #else
>
> That might be worth looking into...
>

well, the _build_ performance is all on the host. (though until we switch
from host glibc to host musl, we probably don't have __NR_copy_file_range
defined anyway because of our ancient glibc.)

as for the device, we're caught between caring about the current OS version
and wanting the static binary to be able to run on old OS versions. we can
trivially remove this, but it will break any calling code for pre-U (which
is literally "everything currently in anyone's hands"). right now we don't
have any known users of the _device_ toybox static prebuilt, but changing
this would likely preclude that.

if we want to go this route, we'll probably have to go with something like
a signal handler for SIGSYS and a boolean. or a weak declaration of the
copy_file_range() *function*? no, because that's useless for the static
linking we're concerned about.

oh, wait... we should just be able to use android_get_device_api_level()...

i think something like this should work?
```
diff --git a/lib/portability.c b/lib/portability.c
index a9f28154..b205d7b7 100644
--- a/lib/portability.c
+++ b/lib/portability.c
@@ -626,10 +626,19 @@ long long sendfile_len(int in, int out, long long
bytes, long long *consumed)
     if (try_cfr) {
       if (bytes<0 || bytes>(1<<30)) len = (1<<30);
       // glibc added this constant in git at the end of 2017, shipped
2018-02.
-      // Android's had the constant for years, but you'll get SIGSYS if
you use
+#if defined(__NR_copy_file_range)
+  #if defined(__ANDROID__)
+      // Android's had the *constant* for years, but you'll get SIGSYS if
you use
       // this system call before Android U (2023's release).
-#if defined(__NR_copy_file_range) && !defined(__ANDROID__)
+      if (android_get_device_api_level() >= __ANDROID_API_U__) {
+        len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0);
+      } else {
+        errno = EINVAL;
+        len = -1;
+      }
+  #else
       len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0);
+  #endif
 #else
       errno = EINVAL;
       len = -1;
```


> Rob
>
_______________________________________________
Toybox mailing list
Toybox@lists.landley.net
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Reply via email to