Utrace and process (partial) virtualization

Renzo Davoli Wed, 04 Feb 2009 03:35:20 -0800

Dear Roland and dear utrace developers,

I am already having some problems regarding utrace, and more
specifically the utrace interface for (partial) virtual machines and 
(again) the support for utrace engines nesting.


I am writing my point of view here for a general discussion.

This is the summary:
1- Virtual Machines may need to change the system call

2- UTRACE_SYSCALL_ABORT: is it really useful as a return value for
report_syscall_entry?

3- Nesting, is it really useful to run all the reports in a row and 
(eventually) stop and the end waiting for all the engines?

4- report_syscall_entry engines evaluation order should be reversed

----
1- This is the simplest suggestion/request.
sometimes virtual machine engines need to change the system call 
(e.g. the process calls a "creat", the kernel must run "open" instead).
I suggest to add some useful inline functions in arch/*/include/asm/syscall.h:
syscall_set_nr // to set the system call number
syscall_get_pc // to get/set the program counter
syscall_set_pc
syscall_get_sp // to get/set the stack pointer
syscall_set_sp
These inline calls would help to create architecture independent
virtual machine engines.

Now the "hard" part:
2- Which is the scenario of virtual machines based on utrace?

In my mind there are two or three actors.
K- At the lowest layer there is the kernel providing utrace
M- There is a module which uses utrace and virtualize something.
   M can do all the virtualization at kernel level but maybe it uses also:
U- A userland Virtual Machine Monitor.

So we have K,M and U.

When a virtualized process does a syscall, K calls the report_syscall_entry 
function of M.
If M is entirely at kernel level it can decide whether to abort the syscall
(setting UTRACE_SYSCALL_ABORT) or not but there is no (clean) way to forward 
the request to U and wait for U's decision about the syscall.
SYSEMU can be implemented with utrace current interface as it aborts 
*all* the syscalls.
View-OS cannot use it. In fact km-view is a userland VM which need to 
decide which system calls must be skipped and which executed. 
It is not for View-OS only,
whoever tries to implement similar features will run into the same problem.

Maybe even VMMs entirely implemented in the kernel module need to delay
the decision about the action. I think UTRACE_STOP has exactly this
meaning: in Roland's ptrace implementation UTRACE_STOP is used in this way.
User-mode Linux running on ptrace do change the registers of the process
status while the process in in STOP state.

I am currently trying to implement a new kmview module using UTRACE_STOP.
When I need to skip the syscall I change the syscall (orig_ax in x86) number 
to -1 while the process is stopped.
Utrace believes that the syscall is *not* aborted then it passes orig_ax
(return ret ?: regs->orig_ax; in arch/x86/kernel/ptrace.c)
to the "entry_{32/64}.s" layer, causing the syscall to be skipped.
This is a dirty workaround.

I think that the specific actions (for syscalls, signals) should be
accepted during a utrace_control(..., UTRACE_RESUME).
In this way:
** K calls report_syscall_entry
** M sends the request to U and returns UTRACE_STOP.
   (M can then process requests for many other processes and many userland VMM)
** U receives the request, decides syscall abort or execute
** U sends its reply to M
** M calls utrace_control UTRACE_RESUME setting the action flag needed (e.g.
   UTRACE_SYSCALL_ABORT).

The same scenario can apply to userland management of signals, the
VMM or debugger could need to delay the decision among UTRACE_SIGNAL* cases,
and it is hard to keep the monitor inside the report_signal
upcall waiting to return a value. It would need another implementation of some
kind of process stop/quiescence inside the module.

3- Following the KMU schema above, let us now depict a scenario where
there are multiple M engines and multiple U VMMs on the same process.

If I have correctly understood the code, the current implementation
runs all the report upcalls in a row. If some ot the report upcalls return
UTRACE_STOP, utrace waits for all the stopped engine to send a UTRACE_RESUME.
(from utrace.c:
If another engine is keeping @target stopped, then it remains stopped until 
all engines let it resume.)

All the M engines may try to change the status of the process concurrently,
as each engine thinks the process has been stopped for its manamengent.

Maybe we have two different ideas of the STOP state and of process
virtualization.
For me a process in STOP state is blocked for inspection. During the STOP
state a module M can change the process status.
With "virtualized process" I mean a process that "sees" an environment 
different from that provided by the hosting kernel.
A user-mode linux process is a virtualized process.
In my mind several engines working on a process implement several layers
of virtualization.
The first engine provides the process a modified virtual world.
If a second engine gets loaded on the same process, the first engine
provides its modified world to the second engine which implement a
further virtualization for the process and so on.

In this perspective I think that the useful sequence (for kernel generated
events) is:
K calls the report upcall of the first engine
if M returns UTRACE_STOP wait for UTRACE_RESUME from the first engine
K calls the report upcall of the second engine
if M returns UTRACE_STOP wait for UTRACE_RESUME from the second engine
and so on.

In this way each engine can safely change the state (based on its virtual
perspective of the world maybe provided by the previous engine) and notify its
action before next engine start working. The next engine "sees" the world as
it has been modified by the previous one.

4- utrace_report_syscall_entry must scan the list of engine in the reverse
order (it is the only event type which is process generated).

>From the idea of nested virtualization it follows that the process request
to run a system call must be processed by the outer (latest) engine first
and then down to the inner/first.

Utrace uses "list_for_each_entry_safe" for the list scan.
"list_for_each_entry_safe_reverse" do exist, maybe it can be used.
I haven't tested it yet.

Interested readers may refer also to my previous postings on the same subject.
(July 2008)
-------
Thank you if you have read up to here.
ciao
        renzo

Utrace and process (partial) virtualization

Reply via email to