Re: Roadmap 2022

Alexander Tormasov via users Wed, 05 Jan 2022 11:24:30 -0800

 Hi Norman,
thanks for answer. Some thoughts below.

> 
> it in interesting to learn more about the context of your work with Go.
> 
> 
couple years ago I do have a project to implement docker support on some new 
microkernel OS.
as a starting point I need to try this ontop of Genode (because project owners 
do not have OS sources available for me that time).
Initially I think that I need to implement integration of partitioning with 
containers, like having single container per OS partition.  Finally I need to 
support only a set of containers inside single OS partition with provided by OS 
linux emulation layer.
Later I found that main problem was not in fixing kernel, drivers /etc - 
problem that all docker support was implemented using golang. 
So, I need to port a couple millions LOC written on Go (AKA docker support) , 
and start from porting runtime of golang itself (another 1.2m LOC on golang and 
C, which touch ALL available syscalls/services/etc of underlying OS, and 
requires good understanding of small differences between different OS APIs). 
I have this work half-done for genode and then switches back to main OS (where 
I finished later everything - port of runtime and port of docker inside single 
partition).


In this moment I returned back from old project and want to finish undone work 
for genode as a testbed for initial idea of integration of docker and OS 
partition in 1 <-> 1, probably using libc port. I do not have formal customers 
for it, just my curiosity.

> You said that you are not a Go programmer yourself. But to you happen to
> have users of your Go runtime to get their feedback?
> 

About users - not sure, recently I publish my patches and do not have any 
feedback yet. Go actively used by the developers, I hope that it will be easy 
to bring some application software to Genode (e.g. different handling of http 
staff). Anyway, current lack of customers will not stop me from the second part 
of my research.

I have to compile and run inside genode docker support code - couple millions 
golang lines which heavily use system OS API including posix and dialects.

So, I consider to have "go build» to run natively inside genode inside qemu. 
First step was to have TCP support inside integrated with golang - done. Next 
will be gccgo native (non-cross) support.

>> Like namespaces based isolation (read: ability to have same names/id’s/etc 
>> in different domains for objects and anything provided by the Genode to user 
>> apps, together with additional related API). At least for app snapshotting, 
>> migration and persistency this is «the must». They are not so necessary for 
>> containers themselves, there are support of some platforms without  it, as 
>> well without dedicated layered FS (unions and similar like 
>> auFS/btrfs/zfs/etc - while it is good to have it).
> 
> I think the two aspects OS-level virtualization and
> snapshotting/persistency should best be looked at separately.
> 
> Regarding OS-level virtualization, Genode's protection domains already
> provide the benefit of being light-weight - like namespaces when
> compared to virtual machines - while providing much stronger isolation.
> Each Genode component has its private capability space after all with no
> sharing by default. Hence, OS-level virtualization on Genode comes down
> to hosting two regular Genode sub systems side by side.

General note. 
Initially, when in swsoft/virtuozzo/parallels we do create a container-based Os 
virtualisation (we call it "virtual environment" in 2000),  we assume 3 main 
pillars (having in mind that we want to use it as a base for hosting in hostile 
environment with open unlimited access from Internet to containers):

1. namespace virtualisation not only to isolate resources but to be able to 
have the same pid and related resources in different containers (for Unix we 
think about emulation of init process with pre-defined pid 1, at least)

2. file system virtualisation to allow COW and transparent sharing of the same 
files (e.g. executable for apache between 100’th of containers instances) to 
preserve kernel memory and objects space (oppose to VM where you can’t able to 
share efficiently files and data structures between different VM instances) - 
key to containers high scalability and performance , and for docker it also a 
key for "encapsulation of changes" paradigm. Sharing using single kernel 
instance is a wide paradigm - it allow to optimise kernel structure allocation, 
resources sharing, single instance of memory allocator/etc.

3. ALL resources limitations on per-container base (we call this 
userbeancounters) which prevent any attempts to make DoS attack from one 
container to another or to the host.

every container initially should be like remotely-accessible complete instance 
of linux with root access and init process, but without ability to load own 
device drivers. we do implement this first for linux, later for FreeBSD/Solaris 
(partially), Windows (based on hot patching technique and their Terminal 
Server), consider mach/MacOS Darwin (just experiments).
for linux and Windows it was a commercial-grade implementation and still used 
by millions of customers.

Now all these features (may be except file systems while 
zfs/brtfs/overlayfs/etc have something similar) are became a part of most of 
commercial OS available on the mass market. 
IMHO they can be cheaply implemented from the very beginning of OS kernels 
development - everything was in place, except understanding why this is 
necessary outside of simple testing environments for developers.
May be it is also a time for Genode to think into this direction?

returning to Genode.

reasons for existence of namespaces (ns) is not only isolation, it is a bit 
wider. One thing is an ability to disallow «manual construction» of objects 
ids/handles/capabilities/references/etc to access something which should not be 
visible at all. 

For example, in ns isolated containers I should not be able to send signal to 
arbitrary process by name (pid in our case) even if it exists in the kernel.
Or vice versa - use some pre-defined processes ids to do something (e.g. unix 
like to attach orphans to pid 1, and later try to enumerate them, this should 
be emulated somehow during user-level software port, e.g. for linux docker this 
is important).

in case of genode  probably I can created and keep capability (with data 
pointer inside) and perform some operations with it, if I do store it somewhere.
if this capability be virtualised then we will have additional level of control 
over it (by creation of pre-defined caps and explicit level of limitations even 
if it is intentionally shared during initialisation procedure which could be a 
part of legacy software being ported to genode).

For better understanding - use case. 
imagine that you want to port application which use 3-d party library which do 
init some exotic file descriptors.
good example is a docker itself - when you exec process inside docker container 
you typically don’t want to have your main process opened descriptors including 
stdin/out (typically it’s achieved via CLOEXEC flag - but lets consider its 
absence, you can just don’t know about descriptors existence).
Technically it is a code in your application while it was initialised by 3-d 
party library linked to it, and you do not have easy way to control it.

ns implementation has simple rules to maintain group  isolation, and it is not 
considered as unnecessary even in linux kernel with their own capabilities set.
I think that namespace is a convent way to handle legacy-related question and 
it worth to have in Genode level where you already have wrappers around native 
low level kernel calls.

And, for snapshotting (see comment below) this is a must - I need to re-create 
all objects with the same id even if they do exists in other 
threads/sessions/processes because they could be stored in «numerical form» 
inside user thread memory.

as for file system like overlayfs - not sure, I assume that it is possible to 
port some known fs into genode, while it is not a first priority task (Windows 
docker does not have it).

for resource counting and limitations - I do not tackle this topic at all for 
genode.

> 
> The snaphotting/persistency topic is not yet covered. But I see a rather
> clear path towards it, at least for applications based on Genode's libc.
> In fact, the libc already has the ability to replicate the state of its
> application as part of the fork mechanism. Right now, this mechanism is
> only used internally. But it could be taken as the basis for, e.g.,
> serializing the application state into snapshot file. Vice versa,
> similar to how a forked process obtains its state from the forking
> process, the libc could support the ability to import a snapshot file at
> startup. All this can be implemented in the libc without changing
> Genode's base framework.
> 
> That being said, there is an elephant in the room, namely how POSIX
> threads fit into the picture. How can the state of a multi-threaded
> application be serialized in a consistent way? That would be an
> interesting topic to research.

I think we can follow the ideas developed for CRIU patch for linux [1], no need 
to invent something too complex:
It can freeze a running container (or an individual application) and checkpoint 
its state to disk. The data saved can be used to restore the application and 
run it exactly as it was during the time of the freeze. Using this 
functionality, application or container live migration, snapshots, remote 
debugging, and many other things are now possible.

in short, they do utilise existent linux kernel sys calls like ptrace and add 
very small subset of absent ones to enumerate process-related objects [2].
This does not mean that you need to have ptrace - it just used as a kind of 
auxillary interface to obtain info about processes, it can be implemented in 
different ways.

to stop (freeze) a set of related processes (tree) even with POSIX they use 
feature (can be considered as a part of ns virtualisation) known as cgroups [3]:
The freezer allows the checkpoint code to obtain a consistent
image of the tasks by attempting to force the tasks in a cgroup into a
quiescent state. Once the tasks are quiescent another task can
walk /proc or invoke a kernel interface to gather information about the
quiesced tasks. Checkpointed tasks can be restarted later should a
recoverable error occur. This also allows the checkpointed tasks to be
migrated between nodes in a cluster by copying the gathered information
to another node and restarting the tasks there.

Seems that similar to ns and cgroup features should be a first part/base of 
checkpoint/restore implementation.
OF course, part of serialisation related to fork/libc, as you mention, also 
could be an another pillar.

In general, I think that to implement snapshotting we need
1. freeze set of threads (or make them COW, e.g. for memory changes)
2. enumerate threads
3. enumerate related objects/states (e.g. file descriptors/pipes/etc)
4. enumerate virt mem areas, and related «shared resources» between threads
5. enumerate network stack/sockets states (a bit different beast)
5. dump everything

for restore we need not only create some objects with the same numerical values 
of id (even same memory layout), we need to have an api to force every object 
to have the same content/state and related security/ns, and force (restore) 
«sharing» and parent/child relations if any of object between different 
threads/processes/sessions/etc

related to this topic: 
we also need to be able to bring some drivers to the same state, because during 
checkpointing/restore we assume external connections in the known state (e.g. 
imagine video drivers and application which draw on the screen. content of 
their video memory is a part of its state while it is not stored in 
application). Probably it is related to restartable drivers feature (and 
related fault tolerance questions).

Note: By the way, now of the key problem of CRIU patch in this moment is 
inability to restore graphical screen for X/etc. We can restore related 
sockets, while protocol parts themselves which need to be repeated are not 
known when you order checkpoint.I think there are no people now available who 
know real X protocol details necessary for the operations… but this is 
different story not directly related to genode questions.

[1] https://criu.org/Main_Page
[2] https://criu.org/Checkpoint/Restore
[3] https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt

Sincerely,
        Alexander



_______________________________________________
Genode users mailing list
[email protected]
https://lists.genode.org/listinfo/users

Re: Roadmap 2022

Reply via email to