Re: [Toybox] Impact of global struct size

Rob Landley Wed, 03 Jan 2024 09:30:23 -0800

I note that I've written over a hundred lines of rant in response to his
previous email already. I should dig back through this and turn it into proper
documentation at some point. (Especially since Elliott knows more of this stuff
than I do so I'm likely to get corrected a lot here...)

On 1/2/24 20:54, enh wrote:
>> You can look at /proc/self/maps (and /proc/self/smaps, and
>> /proc/self/smaps_rollup) to see them for a running process (replace "self" 
>> with
>> any running PID, self is a symlink to your current PID). The six sections 
>> are:
>>
>>   text - the executable functions: mmap(MAP_PRIVATE, PROT_READ|PROT_EXEC)
>>   rodata - const globals, string constants, etc: mmap(MAP_PRIVATE, PROT_READ)
>>   data - writeable data initialized to nonzero: mmap(MAP_PRIVATE, PROT_WRITE)
>>   bss - writeable data initialized to zero: mmap(MAP_ANON, PROT_WRITE)
>>   stack - function call stack, also contains environment data
>>   heap - backing store for malloc() and free()
> 
> (Android and modern linux distros require the relro section too.

I thought that was only needed for dynamic linking? Then again you don't allow a
lot of static stuff to run on the final system anyway...

(The line between PIE and dynamic linking confuses even me. How does static PIE
relocate itself? I _think_ I looked it up once, but "it's statically links in a
dynamic linker in the pile of crt1.o and begin.o files" _can't_ be right...)

> interestingly, there _is_ an elf program header for the stack, to
> signal that you don't want an executable stack. iirc Android and [very
> very recently] modern linux distros won't let you start a process with
> an executable main stack, but afaik the code for the option no-one has
> wanted/needed for a very long time is still in the kernel.)

Cool.

These days there's also vdso and vvar, which are provided by the kernel at
runtime. The first is a .text section with magic functions you can call as an
alternative to syscalls, and the second is a magic .rodata section that provides
volatile variables the OS updates which you can just reach out and look at.

Between the two of them you can do things like check the current timestamp
without a system call. What they actually provide varies by OS (and then your
libc has to be taught to use each new capability out of there instead of making
the syscalls).

"cat /proc/self/maps" and they're the last two entries if present.

There is a "man 7 vdso" but I dunno how up to date it is. (Which gets us back to
Michael Kerrisk's retirement and the new guy NOT MAINTAINING A WEB COPY. Grrr.)

Maintaining backwards compatibility means keeping a lot of old stuff. I had a
talk with Rich Felker last night on IRC about what musl-libc's syscall
requirements actually _are_, and what it would take to repot it on top of a
posix-ish RTOS du jour. (Makes the trusting trust cleansing cycle smaller if you
can cross compile Linux from an RTOS...)

We didn't come to a conclusion, but I _did_ get permission from skarnet to use
his git://git.skarnet.org/mdevd under 0BSD. (POrting that to toybox seems easier
than bringing my old mdev code up to speed for all the
https://github.com/slashbeast/mdev-like-a-boss stuff it's grown since I handed
it off.

>> The first three of those literally exist in the ELF file, as in it mmap()s a
>> block of data out of the file at a starting offset, and the memory is thus
>> automatically populated with data from the file. The text and rodata ones 
>> don't
>> really care if it's MAP_PRIVATE or MAP_SHARED because they can never write
>> anything back to the file, but the data one cares that it's MAP_PRIVATE: any
>> changes stay local and do NOT get written back to the file. And the bss is an
>> anonymous mapping so starts zeroed, the file doesn't bother wasting space on 
>> a
>> run of zeroes when the OS can just provide that on request. (It stands for 
>> Block
>> Starting Symbol which I assume meant something useful 40 years ago on DEC 
>> hardware.)
> 
> (close, but it was IBM and the name was slightly different:
> https://en.wikipedia.org/wiki/.bss#Origin)

That says United Aircraft Corporation named it using IBM 704 hardware in an
assembler and then in fortran. (I only give wikipedia[citation needed] about an
80% chance to be accurate about any given fact, but am not root causing it right
now. :)

I like to track down magic acronyms, ala grep meaning "get regular expression".
I once emailed Dennis Ritchie to ask what "inode" meant:

https://lkml.iu.edu/hypermail/linux/kernel/0207.2/1182.html

But in this case I stopped paying attention once I confirmed it doesn't mean
anything of modern relevance.

The interesting part (to me) is that the name predates unix by almost 20 years
(mainframe legacy predating even the PDP-1), and predating ELF by 40 years. (The
first OS with ELF binaries was Solaris 2.0 released in 1992. Linux switched over
3-4 years later.)

If it wasn't a legacy acronym from shortly after world war II, it would probably
be called something like the "zero section" and we wouldn't have to memorize
what it means. :)

>> The stack is also set up by the kernel, and is funny in three ways:
>>
>> 1) it has environment data at the end (so all your inherited environment
>> variables, and your argv[] arguments, plus an array of pointers to the start 
>> of
>> each string which is what char *argv[] and char *environ[] actually point to.
>> The kernel's task struct also used to live there, but these days there's a
>> separate "kernel stack" and I'd have to look up where things physically are 
>> now
>> and what's user visible.
> 
> (plus the confusingly named "ELF aux values", which come from the
> kernel, and aren't really anything to do with ELF --- almost by
> definition, since they're things that the binary _can't_ know like
> "what's the actual page size of the system i'm _running_ on?" or
> "what's the l1d cache size of the system i'm _running_ on?".)

Are they in the stack? I know the pointer is passed to _start() (often not in a
proper argument, in a REGISTER), but hadn't tracked down where it actually
lived. Stack makes sense...

Sadly, I have had to care about the auxiliary vector on far too many occasions:

man 3 getauxval

>> 3) The stack generally has _two_ pointers, a "stack pointer" and a "base
>> pointer" which I always get confused. One of them points to the start of the
>> mapping (kinda important to keep track of where your mappings are), and the
>> other one moves (gets subtracted from and added to and offset to access local
>> variables).
> 
> (s/base pointer/frame pointer/ for everything except x86. and actually
> _both_ change. it's the "base" of the current stack _frame_, not the
> whole stack. for a concrete example: alloca() changes the stack
> pointer, but not the frame pointer. so local variables offsets
> relative to fp will be constant throughout the function, whereas
> offsets relative to sp can change. [stacked values of] fp is also what
> you're using when you're unwinding.)

I only implemented alloca() for my tinycc fork on 32-bit x86, and that was back
in 2008.

I'm hoping to sit on tonight's https://meet.jit.si/golug at 6pm about creating a
compiler with a recursive descent parser, and someday hope to read
https://norasandler.com/2017/11/29/Write-a-Compiler.html and the corresponding
https://nostarch.com/writing-c-compiler and https://github.com/nlsandler/nqcc
but right now restarting my https://landley.net/code/qcc is not even on the back
burner...

>> All this is ignoring dynamic linking, in which case EACH library has those 
>> first
>> four sections (plus a PLT and GOT which have to nest since the shared 
>> libraries
>> are THEMSELVES dynamically linked, which is why you need to run ldd 
>> recursively
>> when harvesting binaries, although what it does to them at runtime I try not 
>> to
>> examine too closely after eating). There should still only be one stack and 
>> heap
>> shared by each process though.
> 
> (one stack _per thread_ in the process. and the main thread stack is
> very different from thread stacks.)

A thread is a process with brain damage inherited from solaris' limitations, but
you're right. I just mentally gloss over threads as "process with training
wheels and 5x the debugging effort".

Even before the ~7 year period where I thought java was a good idea, I had to
use threading VERY EXTENSIVELY on OS/2. The "workplace shell" desktop was a
single process with many, many threads, so any desktop programming there meant
creating a shared library the workplace shell process would dlopen() and launch
threads for. I got very, very good at debugging thread issues, once upon a time.
(And I've debugged a lot of OTHER people's threading issues as a consultant. The
oil exploration company that bought three different programs and mushed them
together into a single highly threaded process that leaked like a sieve and
segfaulted randomly. The 2018 project that replaced WinCE with Linux when
microsoft end-of-lifed wince, resulting in an 80 thread application process,
half of which were C# code running in mono and the other half were linux native
code sharing the same address space, and the PROBLEM was on the ~200 mhz
deployment hardware they had a warehouse full of and wanted to keep selling,
fork() caused a 75 millisecond latency spike in EVERY OTHER THREAD because the
kernel took one look at that mess and locked the whole vma until fork() had
finished copying everything, which meant a thread spawning a child process
caused the token-ring-like bus to timeout and drop connection. Which meant I got
to do a real world use of vfork() on a system with an MMU, because that only
suspends the PARENT thread, not all the other threads in the process, and
vfork()/exec() isn't much that much harder to program around than 
fork()/exec().)

My modern reaction to dealing with threads is...

https://www.youtube.com/watch?v=hlVwbpm4eHI

They're SOMETIMES the right tool for the job? Occasionally? Maybe?

>> If you launch dozens of instances of the same program, the read only sections
>> (text and rodata) are shared between all the instances. (This is why nommu
>> systems needed to invent fdpic: in conventional ELF everything uses absolute
>> addresses, which is find when you've got an MMU because each process has its 
>> own
>> virtual address range starting at zero. (Generally libc or something will 
>> mmap()
>> about 64k of "cannot read, cannot write, cannot execute" memory there so any
>> attempt to dereference a NULL pointer segfaults, but other than that...)
>>
>> But shared libraries need to move so they can fit around stuff. Back in the
>> a.out days each shared library was also linked at an absolute address (just 
>> one
>> well above zero, out of the way of most programs), meaning when putting 
>> together
>> a system you needed a registry of what addresses were used by each library, 
>> and
>> you'd have to supply an address range to each library you were building as 
>> part
>> of the compiler options (or linker script or however that build did it). This
>> sucked tremendously.
> 
> (funnily enough, this gets reinvented as an optimization every couple
> of decades. iirc macOS has "prelinking" again, but Android is
> currently in the no-prelinking phase of the cycle.)

The old line about how there are two hard problems in computer science: naming
things, cache invalidation, and fencepost errors. This falls under 'cache
invalidation", which more generically is "object lifetime rules".

The really FUN one is the horrible trick people did on various embedded systems
for fast boot, or on OpenVZ as part of the live migration, where they'd
basically core dump a process, load it into a debugger, and resume. Thus
skipping all the setup! (Assuming NOTHING HAS CHANGED in the context the resumed
process expects around it. Luckily X11 has "detach and restart" plumbing that
lets it reopen a process's network pipe without killing the window or the
process, because network connections hanging and needing retry isn't a new 
thing.)

Sigh, I did a whole rant about what would be involved in kernel upgrades without
reboots way back in 2002:

https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0610.html
https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0835.html
https://lkml.iu.edu/hypermail/linux/kernel/0206.2/1244.html

And I was just going "this is _hard_" but people tracked me down from that and
had me help IMPLEMENT some of that stuff over the years. The hard part was that
processes act in GROUPS: parent/child relationships and pipelines and so on, and
the kernel had no way to group processes. Enter "container" support, and me
helping the parallels/OpenVZ guys explain _why_ the kernel could benefit from
it. (The number of times I've been hired as a programmer and wound up spending
most of my energy as a combination tech writer and marketer...)

Sigh, I gotta go get on an airplane now, so stopping here for the moment...

Rob
_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] Impact of global struct size

Reply via email to