Re: Throughput questions....
Hi Richard, Am 11.11.24 um 14:15 schrieb Richard Clark: Your comment to not use malloc is extremely confusing. I've also seen your response that using a lot of small malloc/free calls will slow down the kernel. I'm sensing a misunderstanding here. There is the L4Re-Microkernel aka Fiasco.OC and there is the l4re_kernel as in pkg/l4re-core/l4re_kernel. These are two completely different things. The topic of the other thread was pkg/l4re-core/l4re_kernel which is a service running in each application. Furthermore, I didn't want to suggest that there is any design-inherent slowdown attached to the usage of malloc/free in l4re_kernel. The l4re_kernel is just involved when you map new memory to your application, which only happens if the memory below the memory chunks malloc uses run out. Cheers, Philipp -Original Message- From: Adam Lackorzynski Sent: Monday, November 11, 2024 5:29 AM To: Richard Clark ; l4-hackers@os.inf.tu-dresden.de Subject: Re: Throughput questions Hi Richard, for using shared memory based communication I'd like to suggest to use L4::Irqs instead of IPC messages, especially ipc-calls which have a back and forth. Please also do not use malloc within a benchmark (or benchmark malloc separately to get an understanding how the share between L4 ops and libc is split). On QEMU it should be ok when running with KVM, less so without KVM. I do not have a recommendation for an AMD-based laptop. Cheers, Adam On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote: Dear L4Re experts, We now have a couple projects in which we are going to be utilizing your OS, so I've been implementing and testing some of the basic functionality that we will need. Namely that would be message passing I've been using the Hello World QEMU example as my starting point and have created a number of processes that communicate via a pair of unidirectional channels with IPC and shared memory. One channel for messages coming in, one channel for messages going out. The sender does an IPC_CALL() when a message has been put into shared memory. The receiver completes an IPC_RECEIVE(), fetches the message, and then responds with the IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event driven, no sleeping, no polling. It works. I've tested it for robustness and it behaves exactly as expected, with the exception of throughput. I seem to be getting only 4000 messages per second. Or roughly 4 messages per millisecond. Now there are a couple malloc() and free() and condition_wait() and condition_signal()s going on as the events and messages get passed through the sender and receiver threads, but nothing (IMHO) that should slow things down too much. Messages are very small, like 50 bytes, as I'm really just trying to get a handle on basic overhead. So pretty much, yes, I'm beating the context-switching mechanisms to death... My questions: Is this normal(ish) throughput for a single-core x86_64 QEMU system? Am I getting hit by a time-sliced scheduler issue and most of my CPU is being wasted? How do I switch to a different non-time-sliced scheduler? Thoughts on what I could try to improve throughput? And lastly... We are going to be signing up for training soon... do you have a recommendation for a big beefy AMD-based linux laptop? Thanks! Richard H. Clark ___ l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de -- philipp.epp...@kernkonzept.com - Tel. 0351-41 883 221 http://www.kernkonzept.com Kernkonzept GmbH. Sitz: Dresden. Amtsgericht Dresden, HRB 31129. Geschäftsführer: Dr.-Ing. Michael Hohmuth OpenPGP_signature.asc Description: OpenPGP digital signature ___ l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de
Re: Throughput questions....
Hi Richard, Am 07.11.24 um 14:36 schrieb Richard Clark: I seem to be getting only 4000 messages per second. Or roughly 4 messages per millisecond. Now there are a couple malloc() and free() and condition_wait() and condition_signal()s going on as the events and messages get passed through the sender and receiver threads, but nothing (IMHO) that should slow things down too much. Messages are very small, like 50 bytes, as I'm really just trying to get a handle on basic overhead. So pretty much, yes, I'm beating the context-switching mechanisms to death... Please note, that condition_wait / condition_signal might use L4::Semaphore to synchronize the threads, which is an additional microkernel interaction / context switch. My questions: Is this normal(ish) throughput for a single-core x86_64 QEMU system? Benchmarking on Qemu might give you a ballpark number, but QEMU performance depends on several factors like load on the host machine, linux scheduling, etc. This has the effect that the number of instructions-per-cycle execute in a time slice on a QEMU CPU vary. And lastly... We are going to be signing up for training soon... do you have a recommendation for a big beefy AMD-based linux laptop? If you plan on using virtualization in your setup, please be aware that the AMD-SVM support is experimental and needs more work to be ready for a production environment. Intel-VMX is supported. Cheers, Philipp -- philipp.epp...@kernkonzept.com - Tel. 0351-41 883 221 http://www.kernkonzept.com Kernkonzept GmbH. Sitz: Dresden. Amtsgericht Dresden, HRB 31129. Geschäftsführer: Dr.-Ing. Michael Hohmuth OpenPGP_signature.asc Description: OpenPGP digital signature ___ l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de
RE: Throughput questions....
Adam, Your explanation needs a lot more detail as it raises many more questions than it answers. I specifically did not use irq-based messaging because it does not provide the handshaking that I need. Sending a signal that a message is ready, without the ability to receive some sort of acknowledgement event in return would force the sender into a painfully slow and inefficient polling loop. The ipc_call function is perfect for this purpose as it not only provides the acknowledgement that the receiver has processed the message, but can return a status as well. All event-driven with no polling and no delays. The event-driven handshake has to exist so that the sender knows when it is safe to begin sending the next message... how does an irq do this? It is only a one-way signal. Your irq messaging example can only send one message and then has to poll shared memory to know when the receiver has gotten it. They all use the same underlying ipc functions, just with different kernel object types, so I don't understand why an ipc_call would be slow and an irq would be faster. In all cases, the return handshake is required to avoid polling. Your comment to not use malloc is extremely confusing. I've also seen your response that using a lot of small malloc/free calls will slow down the kernel. That just can't be correct. Malloc is one of the most used and abused calls in the entire C library. If it is not extremely fast and efficient, then something is seriously wrong with the underlying software. Please confirm that this is the case. Because if true, then I will have to allocate a few megabytes up front in a large buffer and port over my own malloc to point to it. Again, this just doesn't make sense. Can I not assign an individual heap to each process? The kernel should only hold a map to the large heap space, not each individual small buffer that gets malloc'ed. The kernel should not even be involved in a malloc at all. I do need to benchmark my message-passing exactly as is, with malloc and free, and signals and waits and locks and all. I am not interested in individual component performance, but need to know the performance when it is all put together in exactly the form that it will be used. If 3 or 4 messages per millisecond is real, then something needs to get redesigned and fixed. I can't use it at that speed. Our applications involve communications and message passing. They are servers that run forever, not little web applications. We need to process hundreds of messages per millisecond, not single digits. So this is a huge concern for me. I'll go break things up to find the slow parts, to test them one at a time, but your help in identifying more possible issues would be greatly appreciated. Thanks! Richard -Original Message- From: Adam Lackorzynski Sent: Monday, November 11, 2024 5:29 AM To: Richard Clark ; l4-hackers@os.inf.tu-dresden.de Subject: Re: Throughput questions Hi Richard, for using shared memory based communication I'd like to suggest to use L4::Irqs instead of IPC messages, especially ipc-calls which have a back and forth. Please also do not use malloc within a benchmark (or benchmark malloc separately to get an understanding how the share between L4 ops and libc is split). On QEMU it should be ok when running with KVM, less so without KVM. I do not have a recommendation for an AMD-based laptop. Cheers, Adam On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote: > Dear L4Re experts, > > We now have a couple projects in which we are going to be utilizing > your OS, so I've been implementing and testing some of the basic > functionality that we will need. Namely that would be message passing > I've been using the Hello World QEMU example as my starting point and > have created a number of processes that communicate via a pair of > unidirectional channels with IPC and shared memory. One channel for > messages coming in, one channel for messages going out. The sender > does an IPC_CALL() when a message has been put into shared memory. The > receiver completes an IPC_RECEIVE(), fetches the message, and then responds > with the IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event > driven, no sleeping, no polling. > It works. I've tested it for robustness and it behaves exactly as expected, > with the exception of throughput. > > I seem to be getting only 4000 messages per second. Or roughly 4 > messages per millisecond. Now there are a couple malloc() and free() > and condition_wait() and condition_signal()s going on as the events and > messages get passed through the sender and receiver threads, but nothing > (IMHO) that should slow things down too much. > Messages are very small, like 50 bytes, as I'm really just trying to > get a handle on basic overhead. So pretty
Re: Throughput questions....
Hi, On Mon Nov 11, 2024 at 13:15:52 +, Richard Clark wrote: > Your explanation needs a lot more detail as it raises many more questions > than it answers. > I specifically did not use irq-based messaging because it does not provide > the handshaking that I need. > Sending a signal that a message is ready, without the ability to receive some > sort of acknowledgement event > in return would force the sender into a painfully slow and inefficient > polling loop. The ipc_call > function is perfect for this purpose as it not only provides the > acknowledgement that the receiver > has processed the message, but can return a status as well. All event-driven > with no polling and no delays. > The event-driven handshake has to exist so that the sender knows when it is > safe to begin > sending the next message... how does an irq do this? It is only a one-way > signal. Your irq messaging > example can only send one message and then has to poll shared memory to know > when the receiver > has gotten it. They all use the same underlying ipc functions, just with > different kernel object types, so I > don't understand why an ipc_call would be slow and an irq would be faster. In > all cases, the > return handshake is required to avoid polling. I don't know about your mechanism, I just know that shared memory communication can work with shared memory and a notification. For example, Virtio uses exactly this. Notifications (Irqs) are sent in both directions. This is a rather asychronous model. Other use-cases might need other ways of doing it. Of course, polling should not be used, except when sitting alone on a core and being specifically for that. > Your comment to not use malloc is extremely confusing. I've also seen your > response that using a lot > of small malloc/free calls will slow down the kernel. That just can't be > correct. Malloc is one of the most > used and abused calls in the entire C library. If it is not extremely fast > and efficient, then something > is seriously wrong with the underlying software. Please confirm that this is > the case. Because if true, > then I will have to allocate a few megabytes up front in a large buffer and > port over my own malloc to point to it. > Again, this just doesn't make sense. Can I not assign an individual heap to > each process? The kernel should > only hold a map to the large heap space, not each individual small buffer > that gets malloc'ed. The kernel > should not even be involved in a malloc at all. Right, the kernel has no business with malloc and free (except the really downwards mechanisms of providing proper memory pages to the process). Malloc and free are a pure user-level implementation which works on a chunk of memory. The malloc is the one from uclibc, and is as fast as it is. > I do need to benchmark my message-passing exactly as is, with malloc and > free, and signals and waits and locks and all. > I am not interested in individual component performance, but need to know the > performance when it is > all put together in exactly the form that it will be used. If 3 or 4 messages > per millisecond is real, then something > needs to get redesigned and fixed. I can't use it at that speed. Sure you need the overall performance, however to understand what's going on looking into the individual phases can be a good thing. What do you do with signals, waits and locks? Is your communication within one process or among multiple processes? Or a mix of it? > Our applications involve communications and message passing. They are servers > that run forever, not little > web applications. We need to process hundreds of messages per millisecond, > not single digits. So this is a > huge concern for me. Understood. > I'll go break things up to find the slow parts, to test them one at a time, > but your help in identifying more possible > issues would be greatly appreciated. Thanks, will do my best. Adam > -Original Message- > From: Adam Lackorzynski > Sent: Monday, November 11, 2024 5:29 AM > To: Richard Clark ; > l4-hackers@os.inf.tu-dresden.de > Subject: Re: Throughput questions > > Hi Richard, > > for using shared memory based communication I'd like to suggest to use > L4::Irqs instead of IPC messages, especially ipc-calls which have a back and > forth. Please also do not use malloc within a benchmark (or benchmark malloc > separately to get an understanding how the share between L4 ops and libc is > split). On QEMU it should be ok when running with KVM, less so without KVM. > > I do not have a recommendation for an AMD-based laptop. > > > Cheers, > Adam > > On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote: > >
Re: Throughput questions....
Hi Richard, for using shared memory based communication I'd like to suggest to use L4::Irqs instead of IPC messages, especially ipc-calls which have a back and forth. Please also do not use malloc within a benchmark (or benchmark malloc separately to get an understanding how the share between L4 ops and libc is split). On QEMU it should be ok when running with KVM, less so without KVM. I do not have a recommendation for an AMD-based laptop. Cheers, Adam On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote: > Dear L4Re experts, > > We now have a couple projects in which we are going to be utilizing your OS, > so I've been implementing and testing > some of the basic functionality that we will need. Namely that would be > message passing > I've been using the Hello World QEMU example as my starting point and have > created a number of processes > that communicate via a pair of unidirectional channels with IPC and shared > memory. One channel for > messages coming in, one channel for messages going out. The sender does an > IPC_CALL() when a message > has been put into shared memory. The receiver completes an IPC_RECEIVE(), > fetches the message, and then > responds with the IPC_REPLY() to the original IPC_CALL(). It is all > interrupt/event driven, no sleeping, no polling. > It works. I've tested it for robustness and it behaves exactly as expected, > with the exception of throughput. > > I seem to be getting only 4000 messages per second. Or roughly 4 messages per > millisecond. Now there are > a couple malloc() and free() and condition_wait() and condition_signal()s > going on as the events and messages > get passed through the sender and receiver threads, but nothing (IMHO) that > should slow things down too much. > Messages are very small, like 50 bytes, as I'm really just trying to get a > handle on basic overhead. So pretty much, > yes, I'm beating the context-switching mechanisms to death... > > My questions: > Is this normal(ish) throughput for a single-core x86_64 QEMU system? > Am I getting hit by a time-sliced scheduler issue and most of my CPU is being > wasted? > How do I switch to a different non-time-sliced scheduler? > Thoughts on what I could try to improve throughput? > > And lastly... > We are going to be signing up for training soon... do you have a > recommendation for a big beefy AMD-based linux laptop? > > > Thanks! > > Richard H. Clark ___ l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de