On 4/11/23 19:00, enh wrote: > On Tue, Apr 11, 2023 at 12:10 PM Rob Landley > > (unrelated, i've been meaning to ask whether we should make toybuf > larger > > anyway. 4KiB is really small for modern hardware, though at the same > time > > it does make it more likely that we test all the "toybuf too small, > loop" > > cases even with small test inputs...) > > A) not a fan of asserts. > > i don't like assert(), but **static_assert** is really useful for things like > this where you want to say "this code makes an assumption that you can test at > compile time".
Compile time is the time to care about that sort of thing, yes. Kinda wonder about portability on weirdness like qnx (or the guy in email who's asking about uclibc-ng). I suspect any such asserts would be a CFG_DEBUG option maybe in portability.c? Hmmm... > B) it was only ever coincidentally page size, and huge pages are a thing > even on > x86. > > > well, huge pages are different from non-4KiB non-huge pages. There was a lot of talk a while back about getting the kernel to dynamically use them (false starts when I was reading about it), but I don't follow lwn.net or lkml nearly as closely in 2023 as I did in 2018. It just got too unpleasant even to check over the weekly web archive... > i think it's only > arm64 where you're at all likely to actually have your page size not be 4KiB. > (all macs and iphones, for example. i _think_ all the linux distros that tried > to move gave up?) What _is_ Mel Gorman up to these days? He last updated his blog in 2016, and last tweeted in 2020... > I never annotated toybuf or libbuf with any sort of alignment directive > or tried > to make it come first in its segment (toybuf and libbuf are the fifth and > sixth > globals defined in main.c), so they're both reasonably likely to straddle > page > boundaries anyway. Heck, I'm not even sure it's cache line aligned. The > actual > _guarantee_ is something like 4 bytes, except when it suddenly isn't. I > fought > with this in 2021 trying to get a simple "hello world" kernel out of gcc > without > needing a linker script: https://landley.net/notes-2021.html#12-04-2021 > <https://landley.net/notes-2021.html#12-04-2021> > > now you're on C11, you can easily say this: > https://en.cppreference.com/w/c/language/_Alignas Hmmm... Does it actually help to page align them, do you think? Not sure how to benchmark that... > The 4096 is just a convenient scratch pad size. I use sizeof(toybuf) in a > bunch > of places... and hardwire in the knowledge of its size in a bunch of > others. > Plus there's a bunch of implicit "toybuf and/or this slice of it is big > enough > to stick this struct in, so I can safely typecast the pointer" instances I > checked at the time (and all of them had a big fudge factor in case of > future > glibc bloat). > > It's really a "convenient granularity" thing. Copy loops doing > byte-at-a-time > stuff is known terrible because the library and syscall execution paths > come to > dominate, and grouping it into 4k blocks is 12 doublings of efficiency > right > there. Going to 64k is 1/16th as much syscalls, which is not as big a > deal as > 1/4000th as many syscalls. And then raises the question "why not a > megabyte > then" which is something you don't just casually want to do on embedded > devices > without thinking about it (might as well malloc there)... > > I could probably be talked into bumping it up to 64k if somebody measured > numbers saying it would help something specific? > > > i think the time i noticed this was when i was looking into "where the time > went" and noticed that a 64KiB buffer was quite helpful, at least on the scale > of "an entire Android build" type of thing. Which counts as a good reason to increase it, but is that "bigger toybuf helps in general" or "some copy loops should use a malloc buffer instead of toybuf"? My todo items here tend to be about limitations, things like in ps.c get_ps() puts struct procpid in toybuf so the strings it's reading are cumulatively finite (seem the comment around line 878) which mostly impacts cmdline, but with modern kernel changes that's 10 megabytes per entry so it's probably gotta have SOME sort of limit it's reading. :) That case does a deferred malloc on ps.c line 1000 to copy the result out of toybuf, so in theory the initial pointer could be replaced by an xmalloc(65536) and then the current malloc would become a remalloc() to trim it down to what's actually used. The size calculations are mostly sizeof(toybuf) so could swap to the new size. Well, one exception: the 2048 on line 747 is sort of "half of toybuf" but it's really just that /proc/$PID/stat is never gonna be longer than this because wc /proc/$$/stat says 52 entries, one of which is the command name (which has to be a filename which means it's limited by the VFS maximum of 255 bytes plus the enclosing parentheses for 257 which I rounded to 260 because null terminator and 4 byte alignment) and then all the OTHER entries are 64 bit numbers printed out in decimal so 21 bytes each with the space between them, 52*21+260 is 1382, and I gave it room for future expansion because kernel guys append stuff. (Not that we'd parse the result if they did, because the for loop on line 764 is traversing from SLOT_ppid to SLOT_upticks so our iteration count is based on the array not the input we read.) > is it _worth_ it? don't know. what's the _optimal_ size? don't know. (and > probably depends on the specific toy, and 4096 is clearly a sensible _lower_ > bound...) Optimization is generally about a use case, you never know if you've actually HELPED until you can benchmark the result. I can see hardware having moved out from under us in the past 15 years, with block granularity switched to 4k, sending 4k pages that aren't guaranteed to be aligned could fairly regularly have read/edit/write cycles on two blocks it's straddling, although the page cache and I/O scheduler should really hide all that. (The cpu's also gonna be doing that under the covers but the _cpu_cache_ should hide that. It's really operating at cache line granularity and we should be way above that? I'd think burst read/write should be on the far side of L2 for any system you care about?) Last I checked the "new" (like, 10 years now) kernel pipe buffers are 128k and there are at most 32 of them? So if you're sending between processes maybe that would come up, but again... need more info. :) The place I'd think it would come up most would be sendfile, which is using a syscall these days and only falling BACK to a loop over libbuf... Oh, hang on: if (bytes<0 || bytes>(1<<30)) len = (1<<30); // glibc added this constant in git at the end of 2017, shipped 2018-02. // Android's had the constant for years, but you'll get SIGSYS if you use // this system call before Android U (2023's release). #if defined(__NR_copy_file_range) && !defined(__ANDROID__) len = syscall(__NR_copy_file_range, in, 0, out, 0, len, 0); #else That might be worth looking into... Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net