Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Avi Kivity wrote: Gregory Haskins wrote: +struct shm_signal_irq { +__u8 enabled; +__u8 pending; +__u8 dirty; +}; Some ABIs may choose to pad this, suggest explicit padding. Yeah, good idea. What is the official way to do this these days? Are GCC pragmas allowed? I just add a __u8 pad[5] in such cases. Oh, duh. Dumb question. I was getting confused with pack, not pad. :) + +struct shm_signal; + +struct shm_signal_ops { +int (*inject)(struct shm_signal *s); +void (*fault)(struct shm_signal *s, const char *fmt, ...); Eww. Must we involve strings and printf formats? This is still somewhat of a immature part of the design. Its supposed to be used so that by default, its a panic. But on the host side, we can do something like inject a machine-check. That way malicious/broken guests cannot (should not? ;) be able to take down the host. Note today I do not map this to anything other than the default panic, so this needs some love. But given the asynchronous nature of the fault, I want to be sure we have decent accounting to avoid bug reports like silent MCE kills the guest ;) At least this way, we can log the fault string somewhere to get a clue. I see. This raises a point I've been thinking of - the symmetrical nature of the API vs the assymetrical nature of guest/host or user/kernel interfaces. This is most pronounced in -inject(); in the host-guest direction this is async (host can continue processing while the guest is handling the interrupt), whereas in the guest-host direction it is synchronous (the guest is blocked while the host is processing the call, unless the host explicitly hands off work to a different thread). Note that this is exactly what I do (though it is device specific). venet-tap has a ioq_notifier registered on its rx ring (which is the tx-ring for the guest) that simply calls ioq_notify_disable() (which calls shm_signal_disable() under the covers) and it wakes its rx-thread. This all happens in the context of the hypercall, which then returns and allows the vcpu to re-enter guest mode immediately. signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Gregory Haskins wrote: Note that this is exactly what I do (though it is device specific). venet-tap has a ioq_notifier registered on its rx ring (which is the tx-ring for the guest) that simply calls ioq_notify_disable() (which calls shm_signal_disable() under the covers) and it wakes its rx-thread. This all happens in the context of the hypercall, which then returns and allows the vcpu to re-enter guest mode immediately. I think this is suboptimal. The ring is likely to be cache hot on the current cpu, waking a thread will introduce scheduling latency + IPI +cache-to-cache transfers. On a benchmark setup, host resources are likely to exceed guest requirements, so you can throw cpu at the problem and no one notices. But I think the bits/cycle figure will decrease, even if bits/sec increases. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Avi Kivity wrote: Gregory Haskins wrote: Note that this is exactly what I do (though it is device specific). venet-tap has a ioq_notifier registered on its rx ring (which is the tx-ring for the guest) that simply calls ioq_notify_disable() (which calls shm_signal_disable() under the covers) and it wakes its rx-thread. This all happens in the context of the hypercall, which then returns and allows the vcpu to re-enter guest mode immediately. I think this is suboptimal. Heh, yes I know this is your (well documented) position, but I respectfully disagree. :) CPUs are not getting much faster, but they are rapidly getting more cores. If we want to continue to make software run increasingly faster, we need to actually use those cores IMO. Generally this means split workloads up into as many threads as possible as long as you can keep pipelines filed. The ring is likely to be cache hot on the current cpu, waking a thread will introduce scheduling latency + IPI This part is a valid criticism, though note that Linux is very adept at scheduling so we are talking mere ns/us range here, which is dwarfed by the latency of something like your typical IO device (e.g. 36us for a rtt packet on 10GE baremetal, etc). The benefit, of course, is the potential for increased parallelism which I have plenty of data to show we are very much taking advantage of here (I can saturate two cores almost completely according to LTT traces, one doing vcpu work, and the other running my rx thread which schedules the packet on the hardware) +cache-to-cache transfers. This one I take exception to. While it is perfectly true that splitting the work between two cores has a greater cache impact than staying on one, you cannot look at this one metric alone and say this is bad. Its also a function of how efficiently the second (or more) cores are utilized. There will be a point in the curve where the cost of cache coherence will become marginalized by the efficiency added by the extra compute power. Some workloads will invariably be on the bad end of that curve, and therefore doing the work on one core is better. However, we cant ignore that there will others that are on the good end of this spectrum either. Otherwise, we risk performance stagnation on our effectively uniprocessor box ;). In addition, the task-scheduler will attempt to co-locate tasks that are sharing data according to a best-fit within the cache hierarchy. Therefore, we will still be sharing as much as possible (perhaps only L2, L3, or a local NUMA domain, but this is still better than nothing) The way I have been thinking about these issues is something I have been calling soft-asics. In the early days, we had things like a simple uniprocessor box with a simple dumb ethernet. People figured out that if you put more processing power into the NIC, you could offload that work from the cpu and do more in parallel. So things like checksum computation and segmentation duties were a good fit. More recently, we see even more advanced hardware where you can do L2 or even L4 packet classification right in the hardware, etc. All of these things are effectively parallel computation, and it occurs in a completely foreign cache domain! So a lot of my research has been around the notion of trying to use some of our cpu cores to do work like some of the advanced asic based offload engines do. The cores are often under utilized anyway, and this will bring some of the features of advanced silicon to commodity resources. They also have the added flexibility that its just software, so you can change or enhance the system at will. So if you think about it, by using threads like this in venet-tap, I am effectively using other cores to do csum/segmentation (if the physical hardware doesn't support it), layer 2 classification (linux bridging), filtering (iptables in the bridge), queuing, etc as if it was some smart device out on the PCI bus. The guest just queues up packets independently in its own memory, while the device just dma's the data on its own (after the initial kick). The vcpu is keeping the pipeline filled on its side independently. On a benchmark setup, host resources are likely to exceed guest requirements, so you can throw cpu at the problem and no one notices. Sure, but with the type of design I have presented this still sorts itself out naturally even if the host doesn't have the resources. For instance, if there is a large number of threads competing for a small number of cores, we will simply see things like the rx-thread stalling and going to sleep, or the vcpu thread backpressuring and going idle (and therefore sleeping). All of these things are self throttling. If you don't have enough resources to run a workload at a desirable performance level, the system wasn't sized right to begin with. ;) But I think the bits/cycle figure will decrease, even if bits/sec increases. Note that this isn't necessarily a bad thing. I
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Gregory Haskins wrote: This interface provides a bidirectional shared-memory based signaling mechanism. It can be used by any entities which desire efficient communication via shared memory. The implementation details of the signaling are abstracted so that they may transcend a wide variety of locale boundaries (e.g. userspace/kernel, guest/host, etc). The shm_signal mechanism supports event masking as well as spurious event delivery mitigation. + +/* + *- + * The following structures represent data that is shared across boundaries + * which may be quite disparate from one another (e.g. Windows vs Linux, + * 32 vs 64 bit, etc). Therefore, care has been taken to make sure they + * present data in a manner that is independent of the environment. + *--- + */ + +#define SHM_SIGNAL_MAGIC 0x58fa39df +#define SHM_SIGNAL_VER 1 + +struct shm_signal_irq { + __u8 enabled; + __u8 pending; + __u8 dirty; +}; Some ABIs may choose to pad this, suggest explicit padding. + +enum shm_signal_locality { + shm_locality_north, + shm_locality_south, +}; + +struct shm_signal_desc { + __u32 magic; + __u32 ver; + struct shm_signal_irq irq[2]; +}; Similarly, this should be padded to 0 (mod 8). Instead of versions, I prefer feature flags which can be independently enabled or disabled. + +/* --- END SHARED STRUCTURES --- */ + +#ifdef __KERNEL__ + +#include linux/interrupt.h + +struct shm_signal_notifier { + void (*signal)(struct shm_signal_notifier *); +}; This means -inject() has been called from the other side? (reading below I see this is so. not used to reading well commented code... :) + +struct shm_signal; + +struct shm_signal_ops { + int (*inject)(struct shm_signal *s); + void (*fault)(struct shm_signal *s, const char *fmt, ...); Eww. Must we involve strings and printf formats? + void (*release)(struct shm_signal *s); +}; + +/* + * signaling protocol: + * + * each side of the shm_signal has an irq structure with the following + * fields: + * + *- enabled: controlled by shm_signal_enable/disable() to mask/unmask + * the notification locally + *- dirty: indicates if the shared-memory is dirty or clean. This + * is updated regardless of the enabled/pending state so that + * the state is always accurately tracked. + *- pending: indicates if a signal is pending to the remote locale. + * This allows us to determine if a remote-notification is + * already in flight to optimize spurious notifications away. + */ When you overlay a ring on top of this, won't the ring indexes convey the same information as -dirty? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Avi Kivity wrote: Gregory Haskins wrote: This interface provides a bidirectional shared-memory based signaling mechanism. It can be used by any entities which desire efficient communication via shared memory. The implementation details of the signaling are abstracted so that they may transcend a wide variety of locale boundaries (e.g. userspace/kernel, guest/host, etc). The shm_signal mechanism supports event masking as well as spurious event delivery mitigation. + +/* + *- + * The following structures represent data that is shared across boundaries + * which may be quite disparate from one another (e.g. Windows vs Linux, + * 32 vs 64 bit, etc). Therefore, care has been taken to make sure they + * present data in a manner that is independent of the environment. + *--- + */ + +#define SHM_SIGNAL_MAGIC 0x58fa39df +#define SHM_SIGNAL_VER 1 + +struct shm_signal_irq { +__u8 enabled; +__u8 pending; +__u8 dirty; +}; Some ABIs may choose to pad this, suggest explicit padding. Yeah, good idea. What is the official way to do this these days? Are GCC pragmas allowed? + +enum shm_signal_locality { +shm_locality_north, +shm_locality_south, +}; + +struct shm_signal_desc { +__u32 magic; +__u32 ver; +struct shm_signal_irq irq[2]; +}; Similarly, this should be padded to 0 (mod 8). Ack Instead of versions, I prefer feature flags which can be independently enabled or disabled. Totally agreed. If you look, most of the ABI has a type of NEGCAP (negotiate capabilities) feature. The version number is a contingency plan in case I still have to break it for whatever reason. I will always opt for the feature bits over bumping the version when its feasible (especially if/when this is actually in the field). + +/* --- END SHARED STRUCTURES --- */ + +#ifdef __KERNEL__ + +#include linux/interrupt.h + +struct shm_signal_notifier { +void (*signal)(struct shm_signal_notifier *); +}; This means -inject() has been called from the other side? Yep (reading below I see this is so. not used to reading well commented code... :) :) + +struct shm_signal; + +struct shm_signal_ops { +int (*inject)(struct shm_signal *s); +void (*fault)(struct shm_signal *s, const char *fmt, ...); Eww. Must we involve strings and printf formats? This is still somewhat of a immature part of the design. Its supposed to be used so that by default, its a panic. But on the host side, we can do something like inject a machine-check. That way malicious/broken guests cannot (should not? ;) be able to take down the host. Note today I do not map this to anything other than the default panic, so this needs some love. But given the asynchronous nature of the fault, I want to be sure we have decent accounting to avoid bug reports like silent MCE kills the guest ;) At least this way, we can log the fault string somewhere to get a clue. +void (*release)(struct shm_signal *s); +}; + +/* + * signaling protocol: + * + * each side of the shm_signal has an irq structure with the following + * fields: + * + *- enabled: controlled by shm_signal_enable/disable() to mask/unmask + * the notification locally + *- dirty: indicates if the shared-memory is dirty or clean. This + * is updated regardless of the enabled/pending state so that + * the state is always accurately tracked. + *- pending: indicates if a signal is pending to the remote locale. + * This allows us to determine if a remote-notification is + * already in flight to optimize spurious notifications away. + */ When you overlay a ring on top of this, won't the ring indexes convey the same information as -dirty? I agree that the information may be redundant with components of the broader shm state. However, we need this state at this level of scope in order to function optimally, so I dont think its a huge deal to have this here as well. Afterall, the shm_signal library can only assess its internal state. We would have to teach it how to glean the broader state through some mechanism otherwise (callback, perhaps), but I don't think its worth it. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
Gregory Haskins wrote: +struct shm_signal_irq { +__u8 enabled; +__u8 pending; +__u8 dirty; +}; Some ABIs may choose to pad this, suggest explicit padding. Yeah, good idea. What is the official way to do this these days? Are GCC pragmas allowed? I just add a __u8 pad[5] in such cases. + +struct shm_signal; + +struct shm_signal_ops { +int (*inject)(struct shm_signal *s); +void (*fault)(struct shm_signal *s, const char *fmt, ...); Eww. Must we involve strings and printf formats? This is still somewhat of a immature part of the design. Its supposed to be used so that by default, its a panic. But on the host side, we can do something like inject a machine-check. That way malicious/broken guests cannot (should not? ;) be able to take down the host. Note today I do not map this to anything other than the default panic, so this needs some love. But given the asynchronous nature of the fault, I want to be sure we have decent accounting to avoid bug reports like silent MCE kills the guest ;) At least this way, we can log the fault string somewhere to get a clue. I see. This raises a point I've been thinking of - the symmetrical nature of the API vs the assymetrical nature of guest/host or user/kernel interfaces. This is most pronounced in -inject(); in the host-guest direction this is async (host can continue processing while the guest is handling the interrupt), whereas in the guest-host direction it is synchronous (the guest is blocked while the host is processing the call, unless the host explicitly hands off work to a different thread). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html