Hi Martin,

thank you for your answers! I try to answer to both your emails here.

Sure, but we still need to arrive at some equation for determining a
sensible default stack size, while allowing for both small values of
"--mem" (e.g. 8MB, or even less?) and large values.

I agree. Ideally the (uni)kernel developer is aware of the absence of an unbounded stack and codes accordingly. I would assume that required stack size does scale logarithmically with the heap size. Given such a relationship, we are on the safe side soon. And if we have a separate stack region, the user will also be notified clearly of the stack overflow.

Practically, I wonder where very long call chains can occur. For example in Ocaml it seems List.map is not tail recursive to avoid reversing. What are they using as a limit? For reference, ulimit -s seems to be 8M on Linux x86_64. Furthermore, the default pthread stack size seems to be 2M. For sufficiently large instances (> 16M), we could just fix 2M. For smaller machines (minimum 512K) somehow scale down to 64K??

In my mind the way to do this consistently for all targets is to make the
"desired stack size" a property attached to the *unikernel binary* and
initially set by the unikernel developer, with the libOS build system
determining the default. The *operator* should be able to override this.
By "property of the unikernel binary" I mean something that is
declarative(!) and forms part of a "manifest" that is embedded into the
binary as an ELF note.

This sounds like a good solution. Maybe fix a default desired size of 2M and allow the developer to overwrite it in the manifest. The operator can also adjust it by tuning some option. The application should then refuse to run if the desired size cannot be provided,
or if the heap/stack ratio becomes too small.

I am not sure about the "dynamic part". I would either set the stack from outside (as discussed above), or allow some kind of memconfig call, which
can be executed only once after application startup.
Alternatively you could provide munmap and munmap_done. Dynamic people would
just never call munmap_done...

I think we can forget about multiple stacks for now. Having thought about
it, even having a single separate memory region for the stack is tricky
enough:

I wonder what the difference is between multiple stacks and multiple separate memory regions? If multiple separate memory regions would be available, the application could use those as stacks and perform switching between them. If that makes sense is a question to the unikernel developers.

I thought a bit more about this idea of adding a munmap call and I don't
really like it.
Maybe the goal should be extended to allow multiple different memory areas, randomly distributed in the address space to leverage ASLR. One of those

ASLR is a separate topic in itself, see here for a rough plan of what needs
to be done:

https://github.com/Solo5/solo5/issues/304

In that issue you are talking about the randomization of the memory location of the binary (code & data)? I agree that this would be nice to have. But is that not a separate issue from how the heap memory regions are organized? Ideally all of it should be randomized, code, data and heap.

No, this has already been discussed before. Dynamic memory allocation is
not on the cards.

Most recently
https://github.com/Solo5/solo5/issues/335#issuecomment-472499246 and
earlier https://github.com/Solo5/solo5/issues/223.

I looked quickly at 223 and 335, maybe I missed some things.
In 223 you threw out the malloc implementation from solo5, which makes
perfect sense, since malloc should be handled on the application level.
Fined grained memory allocation should not be provided by solo5.

In 335 you refuse to add mmap. But there seems to be agreement that
the application/unikernel has no control over the virtual memory layout.
This has to be managed by solo5. Does your refusal to add mmap also apply to the configuration
phase in a very restricted sense?

Providing just a "munmap()" and nothing else might be a simpler way to get guard pages, or it might not. Anyway, one thing at a time. As I mentioned in my other email, lets ignore multiple stacks for now and just concentrate
on how a separate stack region could be done.

I agree, for me stack/heap separation is also the more important issue to avoid corruption.

Multiple memory regions or guarded regions are only nice have. But instead of adding munmap calls to configure guard pages, I think it is better to add two calls solo5_mem_alloc() and solo5_mem_lock(), which also allows randomization of the addresses returned by solo5_mem_alloc(). This is what I proposed in the github PR. For now I just exposed the already existing bump allocation scheme, but this could be randomized at some point. I argue that this is NOT dynamical memory allocation, but rather configuration of the memory layout until solo5_mem_lock() is called.

Compare option I:

1. app is informed of heap_start, heap_size, stack_start, stack_size
2. app calls solo5_munmap on parts of (heap_start, heap_start+heap_size) for guarding
3. app calls solo5_munmap_done to finish initialization

versus option II:

1. app is informed of stack_start, stack_size, mem_avail
2. app calls solo5_mem_alloc to allocate mem_avail until it is exhausted
3. app calls solo5_mem_lock to finish initialization

Basically both options are equivalent, but in option II more responsibility
is pushed away from the application to solo5.
On primitive targets without the ability to manipulate the page table,
there won't be any difference and solo5_munmap would be a nop.

However on targets which support modifying the page table, solo5 could
*automatically* guard the regions by adding gaps after each solo5_mem_alloc block. And even better, the memory block locations could also be randomized. This
is not possible in option I.

You mentioned in 335 that clone+munmap allowed privilege escalation. But in both
schemes, option I and II you need mmap/munmap. However the memory
configuration should be finished by calling solo5_mem_lock after initialization. You could even enforce that the API is used correctly, by unlocking network and blocks
only after solo5_mem_lock has been called ;)

Daniel

Reply via email to