Re: Throughput questions....

2024-11-13 Thread Philipp Eppelt

Hi Richard,

Am 11.11.24 um 14:15 schrieb Richard Clark:

Your comment to not use malloc is extremely confusing. I've also seen your 
response that using a lot
of small malloc/free calls will slow down the kernel. 


I'm sensing a misunderstanding here. There is the L4Re-Microkernel aka Fiasco.OC 
and there is the l4re_kernel as in pkg/l4re-core/l4re_kernel. These are two 
completely different things. The topic of the other thread was 
pkg/l4re-core/l4re_kernel which is a service running in each application.


Furthermore, I didn't want to suggest that there is any design-inherent slowdown 
attached to the usage of malloc/free in l4re_kernel. The l4re_kernel is just 
involved when you map new memory to your application, which only happens if the 
memory below the memory chunks malloc uses run out.


Cheers,
Philipp




-Original Message-
From: Adam Lackorzynski 
Sent: Monday, November 11, 2024 5:29 AM
To: Richard Clark ; 
l4-hackers@os.inf.tu-dresden.de
Subject: Re: Throughput questions

Hi Richard,

for using shared memory based communication I'd like to suggest to use L4::Irqs 
instead of IPC messages, especially ipc-calls which have a back and forth. 
Please also do not use malloc within a benchmark (or benchmark malloc 
separately to get an understanding how the share between L4 ops and libc is 
split). On QEMU it should be ok when running with KVM, less so without KVM.

I do not have a recommendation for an AMD-based laptop.


Cheers,
Adam

On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote:

Dear L4Re experts,

We now have a couple projects in which we are going to be utilizing
your OS, so I've been implementing and testing some of the basic functionality 
that we will need. Namely that would be message passing
I've been using the Hello World QEMU example as my starting point and
have created a number of processes that communicate via a pair of
unidirectional channels with IPC and shared memory. One channel for
messages coming in, one channel for messages going out. The sender
does an IPC_CALL() when a message has been put into shared memory. The receiver 
completes an IPC_RECEIVE(), fetches the message, and then responds with the 
IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event driven, no 
sleeping, no polling.
It works. I've tested it for robustness and it behaves exactly as expected, 
with the exception of throughput.

I seem to be getting only 4000 messages per second. Or roughly 4
messages per millisecond. Now there are a couple malloc() and free()
and condition_wait() and condition_signal()s going on as the events and 
messages get passed through the sender and receiver threads, but nothing (IMHO) 
that should slow things down too much.
Messages are very small, like 50 bytes, as I'm really just trying to
get a handle on basic overhead. So pretty much, yes, I'm beating the 
context-switching mechanisms to death...

My questions:
Is this normal(ish) throughput for a single-core x86_64 QEMU system?
Am I getting hit by a time-sliced scheduler issue and most of my CPU is being 
wasted?
How do I switch to a different non-time-sliced scheduler?
Thoughts on what I could try to improve throughput?

And lastly...
We are going to be signing up for training soon... do you have a recommendation 
for a big beefy AMD-based linux laptop?


Thanks!

Richard H. Clark

___
l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de
To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de


--
philipp.epp...@kernkonzept.com - Tel. 0351-41 883 221
http://www.kernkonzept.com

Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
Geschäftsführer: Dr.-Ing. Michael Hohmuth


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de
To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de


Re: Throughput questions....

2024-11-13 Thread Philipp Eppelt

Hi Richard,

Am 07.11.24 um 14:36 schrieb Richard Clark:


I seem to be getting only 4000 messages per second. Or roughly 4 messages per 
millisecond. Now there are
a couple malloc() and free() and condition_wait() and condition_signal()s going 
on as the events and messages
get passed through the sender and receiver threads, but nothing (IMHO) that 
should slow things down too much.
Messages are very small, like 50 bytes, as I'm really just trying to get a 
handle on basic overhead. So pretty much,
yes, I'm beating the context-switching mechanisms to death...


Please note, that condition_wait / condition_signal might use L4::Semaphore to 
synchronize the threads, which is an additional microkernel interaction / 
context switch.




My questions:
Is this normal(ish) throughput for a single-core x86_64 QEMU system?
Benchmarking on Qemu might give you a ballpark number, but QEMU performance 
depends on several factors like load on the host machine, linux scheduling, etc. 
This has the effect that the number of instructions-per-cycle execute in a time 
slice on a QEMU CPU vary.




And lastly...
We are going to be signing up for training soon... do you have a recommendation 
for a big beefy AMD-based linux laptop?
If you plan on using virtualization in your setup, please be aware that the 
AMD-SVM support is experimental and needs more work to be ready for a production 
environment. Intel-VMX is supported.



Cheers,
Philipp

--
philipp.epp...@kernkonzept.com - Tel. 0351-41 883 221
http://www.kernkonzept.com

Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
Geschäftsführer: Dr.-Ing. Michael Hohmuth


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de
To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de


RE: Throughput questions....

2024-11-12 Thread Richard Clark
Adam,

Your explanation needs a lot more detail as it raises many more questions than 
it answers.
I specifically did not use irq-based messaging because it does not provide the 
handshaking that I need.
Sending a signal that a message is ready, without the ability to receive some 
sort of acknowledgement event
in return would force the sender into a painfully slow and inefficient polling 
loop. The ipc_call
function is perfect for this purpose as it not only provides the 
acknowledgement that the receiver
has processed the message, but can return a status as well. All event-driven 
with no polling and no delays.
The event-driven handshake has to exist so that the sender knows when it is 
safe to begin
sending the next message... how does an irq do this? It is only a one-way 
signal. Your irq messaging
example can only send one message and then has to poll shared memory to know 
when the receiver
has gotten it. They all use the same underlying ipc functions, just with 
different kernel object types, so I
don't understand why an ipc_call would be slow and an irq would be faster. In 
all cases, the
return handshake is required to avoid polling.

Your comment to not use malloc is extremely confusing. I've also seen your 
response that using a lot
of small malloc/free calls will slow down the kernel. That just can't be 
correct. Malloc is one of the most
used and abused calls in the entire C library. If it is not extremely fast and 
efficient, then something
is seriously wrong with the underlying software. Please confirm that this is 
the case. Because if true,
then I will have to allocate a few megabytes up front in a large buffer and 
port over my own malloc to point to it.
Again, this just doesn't make sense. Can I not assign an individual heap to 
each process? The kernel should
only hold a map to the large heap space, not each individual small buffer that 
gets malloc'ed. The kernel
should not even be involved in a malloc at all. 

I do need to benchmark my message-passing exactly as is, with malloc and free, 
and signals and waits and locks and all.
I am not interested in individual component performance, but need to know the 
performance when it is
all put together in exactly the form that it will be used. If 3 or 4 messages 
per millisecond is real, then something
needs to get redesigned and fixed. I can't use it at that speed. 

Our applications involve communications and message passing. They are servers 
that run forever, not little
web applications. We need to process hundreds of messages per millisecond, not 
single digits. So this is a
huge concern for me.

I'll go break things up to find the slow parts, to test them one at a time, but 
your help in identifying more possible 
issues would be greatly appreciated.



Thanks!

Richard


-Original Message-
From: Adam Lackorzynski  
Sent: Monday, November 11, 2024 5:29 AM
To: Richard Clark ; 
l4-hackers@os.inf.tu-dresden.de
Subject: Re: Throughput questions

Hi Richard,

for using shared memory based communication I'd like to suggest to use L4::Irqs 
instead of IPC messages, especially ipc-calls which have a back and forth. 
Please also do not use malloc within a benchmark (or benchmark malloc 
separately to get an understanding how the share between L4 ops and libc is 
split). On QEMU it should be ok when running with KVM, less so without KVM.

I do not have a recommendation for an AMD-based laptop.


Cheers,
Adam

On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote:
> Dear L4Re experts,
> 
> We now have a couple projects in which we are going to be utilizing 
> your OS, so I've been implementing and testing some of the basic 
> functionality that we will need. Namely that would be message passing
> I've been using the Hello World QEMU example as my starting point and 
> have created a number of processes that communicate via a pair of 
> unidirectional channels with IPC and shared memory. One channel for 
> messages coming in, one channel for messages going out. The sender 
> does an IPC_CALL() when a message has been put into shared memory. The 
> receiver completes an IPC_RECEIVE(), fetches the message, and then responds 
> with the IPC_REPLY() to the original IPC_CALL(). It is all interrupt/event 
> driven, no sleeping, no polling.
> It works. I've tested it for robustness and it behaves exactly as expected, 
> with the exception of throughput.
> 
> I seem to be getting only 4000 messages per second. Or roughly 4 
> messages per millisecond. Now there are a couple malloc() and free() 
> and condition_wait() and condition_signal()s going on as the events and 
> messages get passed through the sender and receiver threads, but nothing 
> (IMHO) that should slow things down too much.
> Messages are very small, like 50 bytes, as I'm really just trying to 
> get a handle on basic overhead. So pretty

Re: Throughput questions....

2024-11-12 Thread Adam Lackorzynski
Hi,

On Mon Nov 11, 2024 at 13:15:52 +, Richard Clark wrote:
> Your explanation needs a lot more detail as it raises many more questions 
> than it answers.
> I specifically did not use irq-based messaging because it does not provide 
> the handshaking that I need.
> Sending a signal that a message is ready, without the ability to receive some 
> sort of acknowledgement event
> in return would force the sender into a painfully slow and inefficient 
> polling loop. The ipc_call
> function is perfect for this purpose as it not only provides the 
> acknowledgement that the receiver
> has processed the message, but can return a status as well. All event-driven 
> with no polling and no delays.
> The event-driven handshake has to exist so that the sender knows when it is 
> safe to begin
> sending the next message... how does an irq do this? It is only a one-way 
> signal. Your irq messaging
> example can only send one message and then has to poll shared memory to know 
> when the receiver
> has gotten it. They all use the same underlying ipc functions, just with 
> different kernel object types, so I
> don't understand why an ipc_call would be slow and an irq would be faster. In 
> all cases, the
> return handshake is required to avoid polling.

I don't know about your mechanism, I just know that shared memory
communication can work with shared memory and a notification. For
example, Virtio uses exactly this. Notifications (Irqs) are sent in both
directions. This is a rather asychronous model. Other use-cases might
need other ways of doing it.
Of course, polling should not be used, except when sitting alone on a
core and being specifically for that.

> Your comment to not use malloc is extremely confusing. I've also seen your 
> response that using a lot
> of small malloc/free calls will slow down the kernel. That just can't be 
> correct. Malloc is one of the most
> used and abused calls in the entire C library. If it is not extremely fast 
> and efficient, then something
> is seriously wrong with the underlying software. Please confirm that this is 
> the case. Because if true,
> then I will have to allocate a few megabytes up front in a large buffer and 
> port over my own malloc to point to it.
> Again, this just doesn't make sense. Can I not assign an individual heap to 
> each process? The kernel should
> only hold a map to the large heap space, not each individual small buffer 
> that gets malloc'ed. The kernel
> should not even be involved in a malloc at all. 

Right, the kernel has no business with malloc and free (except the really
downwards mechanisms of providing proper memory pages to the process).
Malloc and free are a pure user-level implementation which works on a
chunk of memory. The malloc is the one from uclibc, and is as fast as it is.

> I do need to benchmark my message-passing exactly as is, with malloc and 
> free, and signals and waits and locks and all.
> I am not interested in individual component performance, but need to know the 
> performance when it is
> all put together in exactly the form that it will be used. If 3 or 4 messages 
> per millisecond is real, then something
> needs to get redesigned and fixed. I can't use it at that speed. 

Sure you need the overall performance, however to understand what's
going on looking into the individual phases can be a good thing.
What do you do with signals, waits and locks?
Is your communication within one process or among multiple processes? Or
a mix of it?

> Our applications involve communications and message passing. They are servers 
> that run forever, not little
> web applications. We need to process hundreds of messages per millisecond, 
> not single digits. So this is a
> huge concern for me.

Understood.

> I'll go break things up to find the slow parts, to test them one at a time, 
> but your help in identifying more possible 
> issues would be greatly appreciated.

Thanks, will do my best.



Adam

> -Original Message-
> From: Adam Lackorzynski  
> Sent: Monday, November 11, 2024 5:29 AM
> To: Richard Clark ; 
> l4-hackers@os.inf.tu-dresden.de
> Subject: Re: Throughput questions
> 
> Hi Richard,
> 
> for using shared memory based communication I'd like to suggest to use 
> L4::Irqs instead of IPC messages, especially ipc-calls which have a back and 
> forth. Please also do not use malloc within a benchmark (or benchmark malloc 
> separately to get an understanding how the share between L4 ops and libc is 
> split). On QEMU it should be ok when running with KVM, less so without KVM.
> 
> I do not have a recommendation for an AMD-based laptop.
> 
> 
> Cheers,
> Adam
> 
> On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote:
> >

Re: Throughput questions....

2024-11-11 Thread Adam Lackorzynski
Hi Richard,

for using shared memory based communication I'd like to suggest to use
L4::Irqs instead of IPC messages, especially ipc-calls which have a back
and forth. Please also do not use malloc within a benchmark (or
benchmark malloc separately to get an understanding how the share
between L4 ops and libc is split). On QEMU it should be ok when running
with KVM, less so without KVM.

I do not have a recommendation for an AMD-based laptop.


Cheers,
Adam

On Thu Nov 07, 2024 at 13:36:06 +, Richard Clark wrote:
> Dear L4Re experts,
> 
> We now have a couple projects in which we are going to be utilizing your OS, 
> so I've been implementing and testing
> some of the basic functionality that we will need. Namely that would be 
> message passing
> I've been using the Hello World QEMU example as my starting point and have 
> created a number of processes
> that communicate via a pair of unidirectional channels with IPC and shared 
> memory. One channel for
> messages coming in, one channel for messages going out. The sender does an 
> IPC_CALL() when a message
> has been put into shared memory. The receiver completes an IPC_RECEIVE(), 
> fetches the message, and then
> responds with the IPC_REPLY() to the original IPC_CALL(). It is all 
> interrupt/event driven, no sleeping, no polling.
> It works. I've tested it for robustness and it behaves exactly as expected, 
> with the exception of throughput.
> 
> I seem to be getting only 4000 messages per second. Or roughly 4 messages per 
> millisecond. Now there are
> a couple malloc() and free() and condition_wait() and condition_signal()s 
> going on as the events and messages
> get passed through the sender and receiver threads, but nothing (IMHO) that 
> should slow things down too much.
> Messages are very small, like 50 bytes, as I'm really just trying to get a 
> handle on basic overhead. So pretty much,
> yes, I'm beating the context-switching mechanisms to death...
> 
> My questions:
> Is this normal(ish) throughput for a single-core x86_64 QEMU system?
> Am I getting hit by a time-sliced scheduler issue and most of my CPU is being 
> wasted?
> How do I switch to a different non-time-sliced scheduler?
> Thoughts on what I could try to improve throughput?
> 
> And lastly...
> We are going to be signing up for training soon... do you have a 
> recommendation for a big beefy AMD-based linux laptop?
> 
> 
> Thanks!
> 
> Richard H. Clark
___
l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de
To unsubscribe send an email to l4-hackers-le...@os.inf.tu-dresden.de