Re: [PATCH experiment 00/35] stackless coroutine backend
On Sun, Mar 13, 2022 at 04:18:40PM +0100, Paolo Bonzini wrote: > On 3/11/22 13:17, Daniel P. Berrangé wrote: > The question is what is easier to maintain, stack switching code that is > becoming less and less portable (status quo with SafeStack, CET and the TLS > issues that Stefan has worked on), a mixed C/C++ codebase (C++ coroutines), > a custom source-to-source translator (this series). The third might be more > fun, but it would be quite a large enterprise and the C++ compiler writers > have already done the work. Or a C-to-C++ translator to keep the code in C but still use C++ coroutines :). (I'm joking.) Stefan signature.asc Description: PGP signature
Re: [PATCH experiment 00/35] stackless coroutine backend
On 3/11/22 13:17, Daniel P. Berrangé wrote: Only files that define or use a coroutine_fn (which includes callers of qemu_coroutine_create) would have to be compiled as C++. Unless I'm misunderstanding what you mean, "define a coroutine_fn" is a very large number of functions/files $ git grep coroutine_fn | wc -l 806 $ git grep -l coroutine_fn | wc -l 132 Dominated by the block layer of course, but tentacles spreading out into alot of other code. The main other user is 9pfs, then there is: hw/remote/message.c io/channel.c job.c migration/savevm.c monitor/hmp-cmds.c monitor/monitor-internal.h monitor/qmp.c nbd/client-connection.c nbd/client.c nbd/server.c net/colo-compare.c net/filter-mirror.c scsi/pr-manager.c scsi/qemu-pr-helper.c ui/console.c util/vhost-user-server.c Feels like identifying all callers would be tedious/unpleasant enough, that practically speaking we would have to just compile all of QEMU as C++. Yes, it's a large amount of code, but it's relatively self-contained. In io/ for example only three functions would have to become C++ (qio_channel_readv_full_all_eof, qio_channel_writev_full_all, qio_channel_yield), and it's easy to move them to a separate file io/channel-coroutine.cc. Likewise for e.g. util/async.c or util/thread-pool.c (one function each). The block layer would almost entirely move to C++, that's for sure. The monitor would be a bit more in the middle, but hardware emulation can remain 100% C. I haven't gotten the thing to compile or run yet, and I'm not sure how much time I'll have this week, but the change for test-coroutine.c to run should be in the ballpark of this: include/qemu/coroutine.h | 26 tests/unit/meson.build | 6 tests/unit/{test-coroutine.c => test-coroutine.cc} | 106 util/meson.build | 4 util/{qemu-coroutine-lock.c => qemu-coroutine-lock.cc} | 65 util/{qemu-coroutine-sleep.c => qemu-coroutine-sleep.cc} | 10 where the changes are for a good part mechanical: switching from "x coroutine_fn" to CoroutineFn entirely so, while adding co_await in front of coroutine calls is half mechanical. For non-void functions, the compiler can identify all callers (because the old type "int" is not compatible with the new type CoroutineFn). For void function one could use warn_unused_result. The question is what is easier to maintain, stack switching code that is becoming less and less portable (status quo with SafeStack, CET and the TLS issues that Stefan has worked on), a mixed C/C++ codebase (C++ coroutines), a custom source-to-source translator (this series). The third might be more fun, but it would be quite a large enterprise and the C++ compiler writers have already done the work. A part of the changes is common in both cases, since you cannot have code that can run both inside or outside a coroutine. Paolo
Re: [PATCH experiment 00/35] stackless coroutine backend
On Fri, Mar 11, 2022 at 01:04:33PM +0100, Paolo Bonzini wrote: > On 3/11/22 10:27, Stefan Hajnoczi wrote: > > > Not quite voluntarily, but I noticed I had to add one 0 to make them run > > > for > > > a decent amount of time. So yeah, it's much faster than siglongjmp. > > That's a nice first indication that performance will be good. I guess > > that deep coroutine_fn stacks could be less efficient with stackless > > coroutines compared to ucontext, but the cost of switching between > > coroutines (enter/yield) will be lower with stackless coroutines. > > Note that right now I'm not placing the coroutine_fn stack on the heap, it's > still allocated from a contiguous area in virtual address space. The > contiguous allocation is wrapped by coroutine_stack_alloc and > coroutine_stack_free, so it's really easy to change them to malloc and free. > > I also do not have to walk up the whole call stack on coroutine_fn yields, > because calls from one coroutine_fn to the next are tail calls; in exchange > for that, I have more indirect calls than if the code did > > if (next_call() == COROUTINE_YIELD) { > return COROUTINE_YIELD; > } > > For now the choice was again just the one that made the translation easiest. > > Today I also managed to implement a QEMU-like API on top of C++ coroutines: > > CoroutineFn return_int() { > co_await qemu_coroutine_yield(); > co_return 30; > } > > CoroutineFn return_void() { > co_await qemu_coroutine_yield(); > } > > CoroutineFn co(void *) { > co_await return_void(); > printf("%d\n", co_await return_int()) > co_await qemu_coroutine_yield(); > } > > int main() { > Coroutine *f = qemu_coroutine_create(co, NULL); > printf("--- 0\n"); > qemu_coroutine_enter(f); > printf("--- 1\n"); > qemu_coroutine_enter(f); > printf("--- 2\n"); > qemu_coroutine_enter(f); > printf("--- 3\n"); > qemu_coroutine_enter(f); > printf("--- 4\n"); > } > > The runtime code is absurdly obscure; my favorite bit is > > Yield qemu_coroutine_yield() > { > return Yield(); > } > > :) However, at 200 lines of code it's certainly smaller than a > source-to-source translator. It might be worth investigating a bit more. > Only files that define or use a coroutine_fn (which includes callers of > qemu_coroutine_create) would have to be compiled as C++. Unless I'm misunderstanding what you mean, "define a coroutine_fn" is a very large number of functions/files $ git grep coroutine_fn | wc -l 806 $ git grep -l coroutine_fn | wc -l 132 Dominated by the block layer of course, but tentacles spreading out into alot of other code. Feels like identifying all callers would be tedious/unpleasant enough, that practically speaking we would have to just compile all of QEMU as C++. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH experiment 00/35] stackless coroutine backend
On 3/11/22 10:27, Stefan Hajnoczi wrote: Not quite voluntarily, but I noticed I had to add one 0 to make them run for a decent amount of time. So yeah, it's much faster than siglongjmp. That's a nice first indication that performance will be good. I guess that deep coroutine_fn stacks could be less efficient with stackless coroutines compared to ucontext, but the cost of switching between coroutines (enter/yield) will be lower with stackless coroutines. Note that right now I'm not placing the coroutine_fn stack on the heap, it's still allocated from a contiguous area in virtual address space. The contiguous allocation is wrapped by coroutine_stack_alloc and coroutine_stack_free, so it's really easy to change them to malloc and free. I also do not have to walk up the whole call stack on coroutine_fn yields, because calls from one coroutine_fn to the next are tail calls; in exchange for that, I have more indirect calls than if the code did if (next_call() == COROUTINE_YIELD) { return COROUTINE_YIELD; } For now the choice was again just the one that made the translation easiest. Today I also managed to implement a QEMU-like API on top of C++ coroutines: CoroutineFn return_int() { co_await qemu_coroutine_yield(); co_return 30; } CoroutineFn return_void() { co_await qemu_coroutine_yield(); } CoroutineFn co(void *) { co_await return_void(); printf("%d\n", co_await return_int()) co_await qemu_coroutine_yield(); } int main() { Coroutine *f = qemu_coroutine_create(co, NULL); printf("--- 0\n"); qemu_coroutine_enter(f); printf("--- 1\n"); qemu_coroutine_enter(f); printf("--- 2\n"); qemu_coroutine_enter(f); printf("--- 3\n"); qemu_coroutine_enter(f); printf("--- 4\n"); } The runtime code is absurdly obscure; my favorite bit is Yield qemu_coroutine_yield() { return Yield(); } :) However, at 200 lines of code it's certainly smaller than a source-to-source translator. It might be worth investigating a bit more. Only files that define or use a coroutine_fn (which includes callers of qemu_coroutine_create) would have to be compiled as C++. Paolo
Re: [PATCH experiment 00/35] stackless coroutine backend
On Thu, Mar 10, 2022 at 09:14:07PM +0100, Paolo Bonzini wrote: > On 3/10/22 18:42, Stefan Hajnoczi wrote: > > There are a lot of details to decide on in the translator tool and > > runtime to optimize the code. I think the way the stack frames are > > organized in this patch series is probably for convenience rather than > > performance. > > Yes, sometimes the optimizations are there but mostly because they made my > job easier. > > > Out of curiousity, did you run the perf tests and compare against > > ucontext? > > Not quite voluntarily, but I noticed I had to add one 0 to make them run for > a decent amount of time. So yeah, it's much faster than siglongjmp. That's a nice first indication that performance will be good. I guess that deep coroutine_fn stacks could be less efficient with stackless coroutines compared to ucontext, but the cost of switching between coroutines (enter/yield) will be lower with stackless coroutines. Stefan signature.asc Description: PGP signature
Re: [PATCH experiment 00/35] stackless coroutine backend
On 3/10/22 18:42, Stefan Hajnoczi wrote: There are a lot of details to decide on in the translator tool and runtime to optimize the code. I think the way the stack frames are organized in this patch series is probably for convenience rather than performance. Yes, sometimes the optimizations are there but mostly because they made my job easier. Out of curiousity, did you run the perf tests and compare against ucontext? Not quite voluntarily, but I noticed I had to add one 0 to make them run for a decent amount of time. So yeah, it's much faster than siglongjmp. Paolo
Re: [PATCH experiment 00/35] stackless coroutine backend
On Thu, Mar 10, 2022 at 01:43:38PM +0100, Paolo Bonzini wrote: > Here is an experiment with using stackless coroutines in QEMU. It > only compiles enough code to run tests/unit/test-coroutine, but at > least it proves that it's possible to quickly test ideas in the > area of coroutine runtimes. Another idea that could be toyed with > in a similar manner could be (whoa) C++ coroutines. > > As expected, this also found some issues in existing code, so I > plan to submit patches 1-5 separately. > > The new backend (which is the only one that works, due to the required > code changes) is in patch 7. For the big description of what stackless > coroutines are, please refer to that patch. > > Patches 8-11 do some initial conversions. Patch 12 introduce some > preprocessor magic that greatly eases the rest of the work, and then > the tests are converted one at a time, until patch 27 where the only > ones missing are the CoRwlock tests. > > Therefore, patches 28-33 convert CoRwlock and pathces 34-35 take care > of the corresponding tests, thus concluding the experiment. Nice, the transformation is clear. It's simpler than Continuation Passing Style transform because the loops and if statements remain unmodified. This is a big advantage with the Duff's device-style approach. There are a lot of details to decide on in the translator tool and runtime to optimize the code. I think the way the stack frames are organized in this patch series is probably for convenience rather than performance. Out of curiousity, did you run the perf tests and compare against ucontext? Stefan signature.asc Description: PGP signature
[PATCH experiment 00/35] stackless coroutine backend
Here is an experiment with using stackless coroutines in QEMU. It only compiles enough code to run tests/unit/test-coroutine, but at least it proves that it's possible to quickly test ideas in the area of coroutine runtimes. Another idea that could be toyed with in a similar manner could be (whoa) C++ coroutines. As expected, this also found some issues in existing code, so I plan to submit patches 1-5 separately. The new backend (which is the only one that works, due to the required code changes) is in patch 7. For the big description of what stackless coroutines are, please refer to that patch. Patches 8-11 do some initial conversions. Patch 12 introduce some preprocessor magic that greatly eases the rest of the work, and then the tests are converted one at a time, until patch 27 where the only ones missing are the CoRwlock tests. Therefore, patches 28-33 convert CoRwlock and pathces 34-35 take care of the corresponding tests, thus concluding the experiment. Paolo Paolo Bonzini (35): coroutine: add missing coroutine_fn annotations for CoRwlock functions coroutine: qemu_coroutine_get_aio_context is not a coroutine_fn coroutine: introduce QemuCoLockable coroutine: introduce coroutine_only_fn coroutine: small code cleanup in qemu_co_rwlock_wrlock disable some code coroutine: introduce the "stackless coroutine" backend /basic/lifecycle convert qemu-coroutine-sleep.c to stackless coroutines enable tail call optimization of qemu_co_mutex_lock convert CoMutex to stackless coroutines define magic macros for stackless coroutines /basic/yield /basic/nesting /basic/self /basic/entered /basic/in_coroutine /basic/order /perf/lifecycle /perf/nesting /perf/yield /perf/function-call /perf/cost /basic/no-dangling-access /locking/co-mutex convert qemu_co_mutex_lock_slowpath to magic macros /locking/co-mutex/lockable qemu_co_rwlock_maybe_wake_one qemu_co_rwlock_rdlock qemu_co_rwlock_unlock qemu_co_rwlock_downgrade qemu_co_rwlock_wrlock qemu_co_rwlock_upgrade /locking/co-rwlock/upgrade /locking/co-rwlock/downgrade configure| 44 +--- include/qemu/co-lockable.h | 110 + include/qemu/coroutine.h | 99 ++-- include/qemu/coroutine_int.h | 6 - include/qemu/lockable.h | 13 +- include/qemu/typedefs.h | 1 + tests/unit/meson.build | 2 +- tests/unit/test-coroutine.c | 425 +-- util/coroutine-stackless.c | 159 + util/meson.build | 10 +- util/qemu-coroutine-lock.c | 215 ++ util/qemu-coroutine-sleep.c | 57 - util/qemu-coroutine.c| 18 +- 13 files changed, 932 insertions(+), 227 deletions(-) create mode 100644 include/qemu/co-lockable.h create mode 100644 util/coroutine-stackless.c -- 2.35.1