Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-04-01 Thread Gregory Haskins
Avi Kivity wrote:
 Gregory Haskins wrote:
 +struct shm_signal_irq {
 +__u8  enabled;
 +__u8  pending;
 +__u8  dirty;
 +};
 
 Some ABIs may choose to pad this, suggest explicit padding.
 

 Yeah, good idea.  What is the official way to do this these days?  Are
 GCC pragmas allowed?

   

 I just add a __u8 pad[5] in such cases.

Oh, duh.  Dumb question.  I was getting confused with pack, not pad.  :)


 +
 +struct shm_signal;
 +
 +struct shm_signal_ops {
 +int  (*inject)(struct shm_signal *s);
 +void (*fault)(struct shm_signal *s, const char *fmt, ...);
 
 Eww.  Must we involve strings and printf formats?
 

 This is still somewhat of a immature part of the design.  Its supposed
 to be used so that by default, its a panic.  But on the host side, we
 can do something like inject a machine-check.  That way malicious/broken
 guests cannot (should not? ;) be able to take down the host.  Note today
 I do not map this to anything other than the default panic, so this
 needs some love.

 But given the asynchronous nature of the fault, I want to be sure we
 have decent accounting to avoid bug reports like silent MCE kills the
 guest ;)  At least this way, we can log the fault string somewhere to
 get a clue.
   

 I see.

 This raises a point I've been thinking of - the symmetrical nature of
 the API vs the assymetrical nature of guest/host or user/kernel
 interfaces.  This is most pronounced in -inject(); in the host-guest
 direction this is async (host can continue processing while the guest
 is handling the interrupt), whereas in the guest-host direction it is
 synchronous (the guest is blocked while the host is processing the
 call, unless the host explicitly hands off work to a different thread).

Note that this is exactly what I do (though it is device specific). 
venet-tap has a ioq_notifier registered on its rx ring (which is the
tx-ring for the guest) that simply calls ioq_notify_disable() (which
calls shm_signal_disable() under the covers) and it wakes its
rx-thread.  This all happens in the context of the hypercall, which then
returns and allows the vcpu to re-enter guest mode immediately.








signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-04-01 Thread Avi Kivity

Gregory Haskins wrote:
Note that this is exactly what I do (though it is device specific). 
venet-tap has a ioq_notifier registered on its rx ring (which is the

tx-ring for the guest) that simply calls ioq_notify_disable() (which
calls shm_signal_disable() under the covers) and it wakes its
rx-thread.  This all happens in the context of the hypercall, which then
returns and allows the vcpu to re-enter guest mode immediately.
  
I think this is suboptimal.  The ring is likely to be cache hot on the 
current cpu, waking a thread will introduce scheduling latency + IPI 
+cache-to-cache transfers.


On a benchmark setup, host resources are likely to exceed guest 
requirements, so you can throw cpu at the problem and no one notices.  
But I think the bits/cycle figure will decrease, even if bits/sec increases.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-04-01 Thread Gregory Haskins
Avi Kivity wrote:
 Gregory Haskins wrote:
 Note that this is exactly what I do (though it is device specific).
 venet-tap has a ioq_notifier registered on its rx ring (which is the
 tx-ring for the guest) that simply calls ioq_notify_disable() (which
 calls shm_signal_disable() under the covers) and it wakes its
 rx-thread.  This all happens in the context of the hypercall, which then
 returns and allows the vcpu to re-enter guest mode immediately.
   
 I think this is suboptimal.

Heh, yes I know this is your (well documented) position, but I
respectfully disagree. :)

CPUs are not getting much faster, but they are rapidly getting more
cores.  If we want to continue to make software run increasingly faster,
we need to actually use those cores IMO.  Generally this means split
workloads up into as many threads as possible as long as you can keep
pipelines filed.

   The ring is likely to be cache hot on the current cpu, waking a
 thread will introduce scheduling latency + IPI
This part is a valid criticism, though note that Linux is very adept at
scheduling so we are talking mere ns/us range here, which is dwarfed by
the latency of something like your typical IO device (e.g. 36us for a
rtt packet on 10GE baremetal, etc).  The benefit, of course, is the
potential for increased parallelism which I have plenty of data to show
we are very much taking advantage of here (I can saturate two cores
almost completely according to LTT traces, one doing vcpu work, and the
other running my rx thread which schedules the packet on the hardware)

 +cache-to-cache transfers.
This one I take exception to.  While it is perfectly true that splitting
the work between two cores has a greater cache impact than staying on
one, you cannot look at this one metric alone and say this is bad. 
Its also a function of how efficiently the second (or more) cores are
utilized.  There will be a point in the curve where the cost of cache
coherence will become marginalized by the efficiency added by the extra
compute power.  Some workloads will invariably be on the bad end of that
curve, and therefore doing the work on one core is better.  However, we
cant ignore that there will others that are on the good end of this
spectrum either.  Otherwise, we risk performance stagnation on our
effectively uniprocessor box ;).  In addition, the task-scheduler will
attempt to co-locate tasks that are sharing data according to a best-fit
within the cache hierarchy.  Therefore, we will still be sharing as much
as possible (perhaps only L2, L3, or a local NUMA domain, but this is
still better than nothing)

The way I have been thinking about these issues is something I have been
calling soft-asics.  In the early days, we had things like a simple
uniprocessor box with a simple dumb ethernet.  People figured out that
if you put more processing power into the NIC, you could offload that
work from the cpu and do more in parallel.   So things like checksum
computation and segmentation duties were a good fit.  More recently, we
see even more advanced hardware where you can do L2 or even L4 packet
classification right in the hardware, etc.  All of these things are
effectively parallel computation, and it occurs in a completely foreign
cache domain!

So a lot of my research has been around the notion of trying to use some
of our cpu cores to do work like some of the advanced asic based offload
engines do.  The cores are often under utilized anyway, and this will
bring some of the features of advanced silicon to commodity resources. 
They also have the added flexibility that its just software, so you can
change or enhance the system at will.

So if you think about it, by using threads like this in venet-tap, I am
effectively using other cores to do csum/segmentation (if the physical
hardware doesn't support it), layer 2 classification (linux bridging),
filtering (iptables in the bridge), queuing, etc as if it was some
smart device out on the PCI bus.  The guest just queues up packets
independently in its own memory, while the device just dma's the data
on its own (after the initial kick).  The vcpu is keeping the pipeline
filled on its side independently.


 On a benchmark setup, host resources are likely to exceed guest
 requirements, so you can throw cpu at the problem and no one notices.
Sure, but with the type of design I have presented this still sorts
itself out naturally even if the host doesn't have the resources.  For
instance, if there is a large number of threads competing for a small
number of cores, we will simply see things like the rx-thread stalling
and going to sleep, or the vcpu thread backpressuring and going idle
(and therefore sleeping).  All of these things are self throttling.  If
you don't have enough resources to run a workload at a desirable
performance level, the system wasn't sized right to begin with. ;)

   But I think the bits/cycle figure will decrease, even if bits/sec
 increases.

Note that this isn't necessarily a bad thing.  I 

Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-03-31 Thread Avi Kivity

Gregory Haskins wrote:

This interface provides a bidirectional shared-memory based signaling
mechanism.  It can be used by any entities which desire efficient
communication via shared memory.  The implementation details of the
signaling are abstracted so that they may transcend a wide variety
of locale boundaries (e.g. userspace/kernel, guest/host, etc).

The shm_signal mechanism supports event masking as well as spurious
event delivery mitigation.
+
+/*
+ *-
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *---
+ */
+
+#define SHM_SIGNAL_MAGIC 0x58fa39df
+#define SHM_SIGNAL_VER   1
+
+struct shm_signal_irq {
+   __u8  enabled;
+   __u8  pending;
+   __u8  dirty;
+};
  


Some ABIs may choose to pad this, suggest explicit padding.


+
+enum shm_signal_locality {
+   shm_locality_north,
+   shm_locality_south,
+};
+
+struct shm_signal_desc {
+   __u32 magic;
+   __u32 ver;
+   struct shm_signal_irq irq[2];
+};
  


Similarly, this should be padded to 0 (mod 8).

Instead of versions, I prefer feature flags which can be independently 
enabled or disabled.



+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include linux/interrupt.h
+
+struct shm_signal_notifier {
+   void (*signal)(struct shm_signal_notifier *);
+};
  


This means -inject() has been called from the other side?

(reading below I see this is so.  not used to reading well commented 
code... :)



+
+struct shm_signal;
+
+struct shm_signal_ops {
+   int  (*inject)(struct shm_signal *s);
+   void (*fault)(struct shm_signal *s, const char *fmt, ...);
  


Eww.  Must we involve strings and printf formats?


+   void (*release)(struct shm_signal *s);
+};
+
+/*
+ * signaling protocol:
+ *
+ * each side of the shm_signal has an irq structure with the following
+ * fields:
+ *
+ *- enabled: controlled by shm_signal_enable/disable() to mask/unmask
+ *   the notification locally
+ *- dirty:   indicates if the shared-memory is dirty or clean.  This
+ *   is updated regardless of the enabled/pending state so that
+ *   the state is always accurately tracked.
+ *- pending: indicates if a signal is pending to the remote locale.
+ *   This allows us to determine if a remote-notification is
+ *   already in flight to optimize spurious notifications away.
+ */
  


When you overlay a ring on top of this, won't the ring indexes convey 
the same information as -dirty?



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-03-31 Thread Gregory Haskins
Avi Kivity wrote:
 Gregory Haskins wrote:
 This interface provides a bidirectional shared-memory based signaling
 mechanism.  It can be used by any entities which desire efficient
 communication via shared memory.  The implementation details of the
 signaling are abstracted so that they may transcend a wide variety
 of locale boundaries (e.g. userspace/kernel, guest/host, etc).

 The shm_signal mechanism supports event masking as well as spurious
 event delivery mitigation.
 +
 +/*
 + *-
 + * The following structures represent data that is shared across
 boundaries
 + * which may be quite disparate from one another (e.g. Windows vs
 Linux,
 + * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure
 they
 + * present data in a manner that is independent of the environment.
 + *---
 + */
 +
 +#define SHM_SIGNAL_MAGIC 0x58fa39df
 +#define SHM_SIGNAL_VER   1
 +
 +struct shm_signal_irq {
 +__u8  enabled;
 +__u8  pending;
 +__u8  dirty;
 +};
   

 Some ABIs may choose to pad this, suggest explicit padding.

Yeah, good idea.  What is the official way to do this these days?  Are
GCC pragmas allowed?


 +
 +enum shm_signal_locality {
 +shm_locality_north,
 +shm_locality_south,
 +};
 +
 +struct shm_signal_desc {
 +__u32 magic;
 +__u32 ver;
 +struct shm_signal_irq irq[2];
 +};
   

 Similarly, this should be padded to 0 (mod 8).
Ack


 Instead of versions, I prefer feature flags which can be independently
 enabled or disabled.

Totally agreed.  If you look, most of the ABI has a type of NEGCAP
(negotiate capabilities) feature.  The version number is a contingency
plan in case I still have to break it for whatever reason.   I will
always opt for the feature bits over bumping the version when its
feasible (especially if/when this is actually in the field).


 +
 +/* --- END SHARED STRUCTURES --- */
 +
 +#ifdef __KERNEL__
 +
 +#include linux/interrupt.h
 +
 +struct shm_signal_notifier {
 +void (*signal)(struct shm_signal_notifier *);
 +};
   

 This means -inject() has been called from the other side?

Yep

 (reading below I see this is so.  not used to reading well commented
 code... :)

:)


 +
 +struct shm_signal;
 +
 +struct shm_signal_ops {
 +int  (*inject)(struct shm_signal *s);
 +void (*fault)(struct shm_signal *s, const char *fmt, ...);
   

 Eww.  Must we involve strings and printf formats?

This is still somewhat of a immature part of the design.  Its supposed
to be used so that by default, its a panic.  But on the host side, we
can do something like inject a machine-check.  That way malicious/broken
guests cannot (should not? ;) be able to take down the host.  Note today
I do not map this to anything other than the default panic, so this
needs some love.

But given the asynchronous nature of the fault, I want to be sure we
have decent accounting to avoid bug reports like silent MCE kills the
guest ;)  At least this way, we can log the fault string somewhere to
get a clue.


 +void (*release)(struct shm_signal *s);
 +};
 +
 +/*
 + * signaling protocol:
 + *
 + * each side of the shm_signal has an irq structure with the
 following
 + * fields:
 + *
 + *- enabled: controlled by shm_signal_enable/disable() to
 mask/unmask
 + *   the notification locally
 + *- dirty:   indicates if the shared-memory is dirty or clean. 
 This
 + *   is updated regardless of the enabled/pending state
 so that
 + *   the state is always accurately tracked.
 + *- pending: indicates if a signal is pending to the remote locale.
 + *   This allows us to determine if a
 remote-notification is
 + *   already in flight to optimize spurious
 notifications away.
 + */
   

 When you overlay a ring on top of this, won't the ring indexes convey
 the same information as -dirty?

I agree that the information may be redundant with components of the
broader shm state.  However, we need this state at this level of scope
in order to function optimally, so I dont think its a huge deal to have
this here as well.  Afterall, the shm_signal library can only assess its
internal state.  We would have to teach it how to glean the broader
state through some mechanism otherwise (callback, perhaps), but I don't
think its worth it.

-Greg






signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-03-31 Thread Avi Kivity

Gregory Haskins wrote:

+struct shm_signal_irq {
+__u8  enabled;
+__u8  pending;
+__u8  dirty;
+};
  
  

Some ABIs may choose to pad this, suggest explicit padding.



Yeah, good idea.  What is the official way to do this these days?  Are
GCC pragmas allowed?

  


I just add a __u8 pad[5] in such cases.


+
+struct shm_signal;
+
+struct shm_signal_ops {
+int  (*inject)(struct shm_signal *s);
+void (*fault)(struct shm_signal *s, const char *fmt, ...);
  
  

Eww.  Must we involve strings and printf formats?



This is still somewhat of a immature part of the design.  Its supposed
to be used so that by default, its a panic.  But on the host side, we
can do something like inject a machine-check.  That way malicious/broken
guests cannot (should not? ;) be able to take down the host.  Note today
I do not map this to anything other than the default panic, so this
needs some love.

But given the asynchronous nature of the fault, I want to be sure we
have decent accounting to avoid bug reports like silent MCE kills the
guest ;)  At least this way, we can log the fault string somewhere to
get a clue.
  


I see.

This raises a point I've been thinking of - the symmetrical nature of 
the API vs the assymetrical nature of guest/host or user/kernel 
interfaces.  This is most pronounced in -inject(); in the host-guest 
direction this is async (host can continue processing while the guest is 
handling the interrupt), whereas in the guest-host direction it is 
synchronous (the guest is blocked while the host is processing the call, 
unless the host explicitly hands off work to a different thread).



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html