from:"Jamie Lokier"

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-10 Thread Jamie Lokier

Paolo Bonzini wrote:
> Il 10/01/2013 16:25, Jamie Lokier ha scritto:
> >> > Perhaps it's a bug that the cache mode is not reset when the machine is
> >> > reset.  I haven't checked that, but it would be a valid complaint.
> > The question is, is cache=writeback/cache=writethrough an initial
> > setting of guest-visible WCE that the guest is allowed to change, or
> > is cache=writeththrough a way of saying "don't have a write cache"
> > (which may or may not be reflected in the guest-visible disk id).
> 
> It used to be the latter (with reflection in the disk data), but now it
> is the former.

Interesting.  It could be worth a note in the manual.

> > I couldn't tell from QEMU documentation which is intended.  It would
> > be a bit silly if it means different things for different backend
> > storage.
> 
> It means the same thing for IDE, SCSI and virtio-blk.  Other backends,
> such as SD, do not even have flush, and are really slow with
> cache=writethrough because they write one sector at a time.  For this
> reason they cannot really be used in a "safe" manner.
> 
> > I have seen (obscure) guest code which toggled WCE to simulate FUA,
> 
> That's quite useless, since WCE=1->WCE=0 is documented to cause a flush
> (and it does).  Might as well send a real flush.

It was because the ATA spec seemed to permit the combination of WCE
with no flush command supported.  So WCE=>1->WCE=0 was used to flush,
and kept at WCE=0 for the subsequent logging write-FUA(s), until a
non-FUA write was wanted.

-- Jamie

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-10 Thread Jamie Lokier

Paolo Bonzini wrote:
> Il 09/01/2013 14:04, Liu Yuan ha scritto:
> > > >   2 The upper layer software which relies on the 'cache=xxx' to choose
> > > > cache mode will fail its assumption against new QEMU.
> > > 
> > > Which assumptions do you mean? As far as I can say the behaviour hasn't
> > > changed, except possibly for the performance.
> >
> > When users set 'cache=writethrough' to export only a writethrough cache
> > to Guest, but with new QEMU, it will actually get a writeback cache as
> > default.
> 
> They get a writeback cache implementation-wise, but they get a
> writethrough cache safety-wise.  How the cache is implemented doesn't
> matter, as long as it "looks like" a writethrough cache.
> 
> In fact, consider a local disk that doesn't support FUA.  In old QEMU,
> images used to be opened with O_DSYNC and that splits each write into
> WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
> flushes are created.  Old QEMU changes it in the kernel, new QEMU
> changes it in userspace.
> 
> > We don't need to communicate to the guest. I think 'cache=xxx' means
> > what kind of cache the users *expect* to export to Guest OS. So if
> > cache=writethrough set, Guest OS couldn't turn it to writeback cache
> > magically. This is like I bought a disk with 'writethrough' cache
> > built-in, I didn't expect that it turned to be a disk with writeback
> > cache under the hood which could possible lose data when power outage
> > happened.
> 
> It's not by magic.  It's by explicitly requesting the disk to do this.
> 
> Perhaps it's a bug that the cache mode is not reset when the machine is
> reset.  I haven't checked that, but it would be a valid complaint.

The question is, is cache=writeback/cache=writethrough an initial
setting of guest-visible WCE that the guest is allowed to change, or
is cache=writeththrough a way of saying "don't have a write cache"
(which may or may not be reflected in the guest-visible disk id).

I couldn't tell from QEMU documentation which is intended.  It would
be a bit silly if it means different things for different backend
storage.

I have seen (obscure) guest code which toggled WCE to simulate FUA,
and there is plenty of advice out there saying to set WCE=0 for
certain kinds of databases because of its presumed crash safety.  Even
very ancient guests on Linux and Windows can change WCE=0 with IDE and
SCSI.

So from a guest point of view, I think guest setting WCE=0 should mean
exactly the same as FUA every write, or flush after every write, until
guest setting WCE=1.

-- Jamie

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-10 Thread Jamie Lokier

Kevin Wolf wrote:
> Am 08.01.2013 11:39, schrieb Liu Yuan:
> > On 01/08/2013 06:00 PM, Kevin Wolf wrote:
> >> Am 08.01.2013 10:45, schrieb Liu Yuan:
> >>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>  Otherwise use sheepdog writeback and let QEMU block.c decide when to
>  flush.  Never use sheepdog writethrough because it's redundant here.
> >>>
> >>> I don't get it. What do you mean by 'redundant'? If we use virtio &
> >>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
> >>> cache? Here 'writethrough' means use a pure read cache, which doesn't
> >>> need flush at all.
> >>
> >> A writethrough cache is equivalent to a write-back cache where each
> >> write is followed by a flush. qemu makes sure to send these flushes, so
> >> there is no need use Sheepdog's writethrough mode.
> > 
> > Implement writethrough as writeback + flush will cause considerable
> > overhead for network block device like Sheepdog: a single write request
> > will be executed as two requests: write + flush
> 
> Yeah, maybe we should have some kind of a FUA flag with write requests
> instead of sending a separate flush.

Note that write+FUA has different semantics than write+flush, at least
with regular disks.

write+FUA commits just what was written, while write+flush commits
everything that was written before.

-- Jamie

Re: [Qemu-devel] [PATCH v7 00/10] i8254, i8259 and running Microport UNIX (ca 1987)

2012-12-11 Thread Jamie Lokier

Matthew Ogilvie wrote:
> 2. Just fix it immediately, and don't worry about migration.  Squash
>the last few patches together.  A single missed periodic
>timer tick that only happens when migrating
>between versions of qemu is probably not a significant
>concern.  (Unless someone knows of an OS that actually runs
>the i8254 in single shot mode 4, where a missed interrupt
>could cause a hang or something?)

Hi Matthew,

Such as Linux?  0x38 looks like mode 4 to me.  I suspect it's used in
tickless mode when there isn't a better clock event source.

linux/drivers/clocksource/i8253.c:

#ifdef CONFIG_CLKEVT_I8253
   /* ... */

case CLOCK_EVT_MODE_ONESHOT:
/* One shot setup */
outb_p(0x38, PIT_MODE);

   /* ... */

/*
 * Program the next event in oneshot mode
 *
 * Delta is given in PIT ticks
 */
static int pit_next_event(unsigned long delta, struct 
clock_event_device *evt)
{
raw_spin_lock(&i8253_lock);
outb_p(delta & 0xff , PIT_CH0); /* LSB */
outb_p(delta >> 8 , PIT_CH0);   /* MSB */
raw_spin_unlock(&i8253_lock);

return 0;
}

   /* ... */
#endif

> 4. Support both old and fixed i8254 models, selectable at runtime
>with a command line option.  (Question: What should such an
>option look like?)  This may be the best way to actually
>change the 8254, but I'm not sure changes are even needed.
>It's certainly getting rather far afield from running Microport
>UNIX...

I can't see a reason to have the old behaviour, if every guest works
with the new one, except for this awkward cross-version migration
thing.

I guess ideally, device emulations would be versioned when their
behaviour changes, rather like shared libraries are, and the
appropriate old version kept around to be loaded for a particular
machine that's still running with it.  Sounds a bit complicated though.

Best,
-- Jamie

Re: [Qemu-devel] [PATCH] qemu-timer: Don't use RDTSC on 386s and 486s

2012-11-23 Thread Jamie Lokier

Peter Maydell wrote:
> On 23 November 2012 15:31, Jamie Lokier  wrote:
> > x86 instruction sets haven't followed a linear progression of features
> > for quite a while, especially including non-Intel chips, so it stopped
> > making sense for GCC to indicate the instruction set in that way.
> 
> If you're going to go down that route you need to start defining
> #defines for features then, so we could say defined(__rdtsc__)
> or defined(__cmov__) and so on. I don't see any of those either :-(

It does for some major architectural instructions groups like MMX,
different kinds of SSE, etc.  But not everything and I don't see cmov
among them.  I agree it's unfortunate.

-- Jamie

Re: [Qemu-devel] [PATCH] qemu-timer: Don't use RDTSC on 386s and 486s

2012-11-23 Thread Jamie Lokier

Peter Maydell wrote:
> On 23 November 2012 15:17, Peter Maydell  wrote:
> > On 23 November 2012 15:15, Paolo Bonzini  wrote:
> >> You should at least test __i686__ too:
> >>
> >> $ gcc -m32 -dM -E -x c /dev/null |grep __i
> >> #define __i686 1
> >> #define __i686__ 1
> >> #define __i386 1
> >> #define __i386__ 1
> >
> > Yuck. I had assumed gcc would define everything from i386
> > on up when building for later cores.
> 
> ...and there's an enormous list of x86 cores too. This bites
> us already -- if you use '-march=native' to get "best for my
> cpu" then on a Core2, say, it will define __i386__ and __core2__
> but not __i686__, so TCG won't use cmov :-(
> 
> Anybody got any good ideas for how to say "is this at least
> a 586/686?" in a way that won't fail for any newly introduced
> x86 core types?

Fwiw, cmov doesn't work on some VIA "686" class CPUs.

Shouldn't TCG decide whether to use cmov at runtime anyway, using
cpuid?  For dynamically generated code it would seem not very
expensive to do that.

Looking at GCC source, it has an internal flag to say whether the
target has cmov, but doesn't expose it in preprocessor conditionals.

-- Jamie

Re: [Qemu-devel] [PATCH] qemu-timer: Don't use RDTSC on 386s and 486s

2012-11-23 Thread Jamie Lokier

Peter Maydell wrote:
> On 23 November 2012 15:15, Paolo Bonzini  wrote:
> > Il 23/11/2012 16:12, Peter Maydell ha scritto:
> >> Adjust the conditional which guards the implementation of
> >>
> >> -#elif defined(__i386__)
> >> +#elif defined(__i586__)
> >>
> >>  static inline int64_t cpu_get_real_ticks(void)
> >>  {
> >>
> >
> > You should at least test __i686__ too:
> >
> > $ gcc -m32 -dM -E -x c /dev/null |grep __i
> > #define __i686 1
> > #define __i686__ 1
> > #define __i386 1
> > #define __i386__ 1
> 
> Yuck. I had assumed gcc would define everything from i386
> on up when building for later cores.

No, and it doesn't define __i686__ on all x86-32 targets after i686 either:

$ gcc -march=core2 -dM -E -x c /dev/null | grep __[0-9a-z] | sort
#define __core2 1
#define __core2__ 1
#define __gnu_linux__ 1
#define __i386 1
#define __i386__ 1
#define __linux 1
#define __linux__ 1
#define __tune_core2__ 1
#define __unix 1
#define __unix__ 1

x86 instruction sets haven't followed a linear progression of features
for quite a while, especially including non-Intel chips, so it stopped
making sense for GCC to indicate the instruction set in that way.

GCC 4.6.3 defines __i586__ only when the target arch is set by -march
(or default) to i586, pentium or pentium-mmx.

And it defines __i686__ only when -march= is set (or default) to c3-2,
i686, pentiumpro, pentium2, pentium3, pentium3m or pentium-m.

Otherwise it's just things like __athlon__, __corei7__, etc.

The only one that's consistent is __i386__ (and __i386).

-- Jamie

Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server

2012-10-25 Thread Jamie Lokier

Kevin Wolf wrote:
> Am 24.10.2012 16:32, schrieb Jamie Lokier:
> > Kevin Wolf wrote:
> >> Am 24.10.2012 14:16, schrieb Nicholas Thomas:
> >>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
> >>>> Since the I/O _order_ before, and sometimes after, flush, is important
> >>>> for data integrity, this needs to be maintained when I/Os are queued in
> >>>> the disconnected state -- including those which were inflight at the
> >>>> time disconnect was detected and then retried on reconnect.
> >>>
> >>> Hmm, discussing this on IRC I was told that it wasn't necessary to
> >>> preserve order - although I forget the fine detail. Depending on the
> >>> implementation of qemu's coroutine mutexes, operations may not actually
> >>> be performed in order right now - it's not too easy to work out what's
> >>> happening.
> >>
> >> It's possible to reorder, but it must be consistent with the order in
> >> which completion is signalled to the guest. The semantics of flush is
> >> that at the point that the flush completes, all writes to the disk that
> >> already have completed successfully are stable. It doesn't say anything
> >> about writes that are still in flight, they may or may not be flushed to
> >> disk.
> > 
> > I admit I wasn't thinking clearly how much ordering NBD actually
> > guarantees (or if there's ordering the guest depends on implicitly
> > even if it's not guaranteed in specification), and how that is related
> > within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
> > that the guest expects for various emulated devices and their settings.
> > 
> > The ordering (if any) needed from the NBD driver (or any backend) is
> > going to depend on the assumptions baked into the interface between
> > QEMU device emulation <-> backend.
> > 
> > E.g. if every device emulation waited for all outstanding writes to
> > complete before sending a flush, then it wouldn't matter how the
> > backend reordered its requests, even getting the completions out of
> > order.
> > 
> > Is that relationship documented (and conformed to)?
> 
> No, like so many other things in qemu it's not spelt out explicitly.
> However, as I understand it it's the same behaviour as real hardware
> has, so device emulation at least for the common devices doesn't have to
> implement anything special for it. If the hardware even supports
> parallel requests, otherwise it would automatically only have a single
> request in flight (like IDE).

That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite
common.

They are features of devices which support multiple parallel requests,
but with certain ordering constraints conveyed by or expected by the
guest, which has to be ensured when it's mapped onto a QEMU fully
asynchronous backend.

That means they are features of the hardware which device emulations
_do_ have to implement.  If they don't, the storage is unreliable on
things like host power removal and virtual power removal.

If the backends are allowed to explicitly have no coupling between
different request types (even flush/discard and write), and ordering
constraints are being enforced by the order in which device emulations
submit and wait, that's fine.

I mention this, because POSIX aio_fsync() is _not_ fully decoupled
according to it's specification.

So it might be that some device emulations are depending on the
semantics of aio_fsync() or the QEMU equivalent by now; and randomly
reordering in the NBD driver in unusual circumstances (or any other
backend), would break those semantics.

-- Jamie

Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server

2012-10-24 Thread Jamie Lokier

Kevin Wolf wrote:
> Am 24.10.2012 14:16, schrieb Nicholas Thomas:
> > On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
> >> Since the I/O _order_ before, and sometimes after, flush, is important
> >> for data integrity, this needs to be maintained when I/Os are queued in
> >> the disconnected state -- including those which were inflight at the
> >> time disconnect was detected and then retried on reconnect.
> > 
> > Hmm, discussing this on IRC I was told that it wasn't necessary to
> > preserve order - although I forget the fine detail. Depending on the
> > implementation of qemu's coroutine mutexes, operations may not actually
> > be performed in order right now - it's not too easy to work out what's
> > happening.
> 
> It's possible to reorder, but it must be consistent with the order in
> which completion is signalled to the guest. The semantics of flush is
> that at the point that the flush completes, all writes to the disk that
> already have completed successfully are stable. It doesn't say anything
> about writes that are still in flight, they may or may not be flushed to
> disk.

I admit I wasn't thinking clearly how much ordering NBD actually
guarantees (or if there's ordering the guest depends on implicitly
even if it's not guaranteed in specification), and how that is related
within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
that the guest expects for various emulated devices and their settings.

The ordering (if any) needed from the NBD driver (or any backend) is
going to depend on the assumptions baked into the interface between
QEMU device emulation <-> backend.

E.g. if every device emulation waited for all outstanding writes to
complete before sending a flush, then it wouldn't matter how the
backend reordered its requests, even getting the completions out of
order.

Is that relationship documented (and conformed to)?

-- Jamie

Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server

2012-10-23 Thread Jamie Lokier

Nicholas Thomas wrote:
> On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote:
> > Am 22.10.2012 13:09, schrieb n...@bytemark.co.uk:
> > > 
> > > This is unlikely to come up now, but is a necessary prerequisite for 
> > > reconnection
> > > behaviour.
> > > 
> > > Signed-off-by: Nick Thomas 
> > > ---
> > >  block/nbd.c |   13 +++--
> > >  1 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > What's the real requirement here? Silently ignoring a flush and
> > returning success for it feels wrong. Why is it correct?
> > 
> > Kevin
> 
> I just needed to avoid socket operations while s->sock == -1, and
> extending the existing case of "can't do the command, so pretend I did
> it" to "can't do the command right now, so pretend..." seemed like an
> easy way out. 

Hi Nicholas,

Ignoring a flush is another way of saying "corrupt my data" in some
circumstances.  We have options in QEMU already to say whether flushes
are ignored on normal discs, but if someone's chosen the "I really
care about my database/filesystem" option, and verified that their NBD
setup really performs them (in normal circumstances), silently
dropping flushes from time to time isn't nice.

I would much rather the guest is forced to wait until reconnection and
then get a successful flush, if the problem is just that the server
was done briefly.  Or, if that is too hard, that the flush is 

Since the I/O _order_ before, and sometimes after, flush, is important
for data integrity, this needs to be maintained when I/Os are queued in
the disconnected state -- including those which were inflight at the
time disconnect was detected and then retried on reconnect.

Ignoring a discard is not too bad.  However, if discard is retried,
then I/O order is important in relation to those as well.

> In the Bytemark case, the NBD server always opens the file O_SYNC, so
> nbd_co_flush could check in_flight == 0 and return 0/1 based on that;
> but I'd be surprised if that's true for all NBD servers. Should we be
> returning 1 here for both "not supported" and "can't do it right now",
> instead?

When the server is opening the file O_SYNC, wouldn't it make sense to
tell QEMU -- and the guest -- that there's no need to send flushes at
all, as it's equivalent to a disk with no write-cache (or disabled)?

Best,
-- Jamie

Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Jamie Lokier

Rusty Russell wrote:
> I don't think it'll be that bad; reset clears the device to unknown,
> bar0 moves it from unknown->legacy mode, bar1/2/3 changes it from
> unknown->modern mode, and anything else is bad (I prefer being strict so
> we catch bad implementations from the beginning).

Will that work, if the guest with kernel that uses modern mode, kexecs
to an older (but presumed reliable) kernel that only knows about legacy mode?

I.e. will the replacement kernel, or (ideally) replacement driver on
the rare occasion that is needed on a running kernel, be able to reset
the device hard enough?

-- Jamie

Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations

2012-09-19 Thread Jamie Lokier

Peter Maydell wrote:
> On 19 September 2012 14:32, Jamie Lokier  wrote:
> > However, someone may run QEMU on a kernel before 2.6.32, which isn't
> > that old.  (E.g. my phone is running 2.6.28).
> 
> NB that ARM kernels that old have other amusing bugs, such
> as not saving the floating point registers when invoking
> signal handlers.

Hi Peter,

It's not that old (< 3 years).  Granted that's not a nice one, but I'm
under the impression it occurs only when the signal handler uses (VFP
hardware) floating point.  I.e. most programs don't do that, they keep
the signal handlers simple (probably including QEMU).

(I've read about other platforms that have similar issues using
floating point in signal handlers; best avoided.)

Anyway, people are running those kernels, someone will try to run QEMU
on it unless...

> I would be happy for QEMU to just say "your  kernel is too old!"...

I'd be quite happy with that as well, if you want to put a check in
and refuse to run (like Glibc does).

Less happy with obscure, rare failures of atomicity that are
practically undebuggable, and easily fixed.

Cheers,
-- Jamie

Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations

2012-09-19 Thread Jamie Lokier

liu ping fan wrote:
> >> +static inline void atomic_set(Atomic *v, int i)
> >> +{
> >> +v->counter = i;
> >> +}

Hi,

When running on ARM Linux kernels prior to 2.6.32, userspace
atomic_set() needs to use "clrex" or "strex" too.

See Linux commit 200b812d, "Clear the exclusive monitor when returning
from an exception".

You can see ARM's atomic_set() used to use "strex", and warns it's
important.  The kernel patch allows atomic_set() to be simplified, and
that includes for userspace, by putting clrex/strex in the exception
return path instead.

However, someone may run QEMU on a kernel before 2.6.32, which isn't
that old.  (E.g. my phone is running 2.6.28).

Otherwise you can have this situation:

Initially: a = 0.

Thread
  atomic_inc(&a, 1)
  = ldrex, add, [strex interrupted]

 Interrupted by signal handler
  atomic_set(&a, 3)
  = str
 Signal return

Resume thread
  = strex (succeeds because CPU-local exclusive-flag still set)

Result: a = 1, should be impossible when the signal triggered, and
information about the signal is lost.

A more realistic example would use atomic_compare_exchange(), to
atomic-read-and-clear, atomic-read-and-dec-if-not-zero a variable set
in a signal handler, however I've used atomic_inc() to illustrate
because that's in your patch.

Best,
-- Jamie

Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations

2012-09-19 Thread Jamie Lokier

Avi Kivity wrote:
> On 09/13/2012 09:54 AM, liu ping fan wrote:
> 
> >>> +typedef struct Atomic {
> >>> +int counter;
> >>> +} Atomic;
> >>
> >> Best to mark counter 'volatile'.
> >>
> >>> +
> >>> +static inline void atomic_set(Atomic *v, int i)
> >>> +{
> >>> +v->counter = i;
> >>> +}
> >>> +
> >>> +static inline int atomic_read(Atomic *v)
> >>> +{
> >>> +return v->counter;
> >>> +}
> >>>
> >>
> >> So these two operations don't get mangled by the optimizer.
> >>
> > Browsing linux code and reading lkml, find some similar material. But
> > they have moved volatile from ->counter to function - atomic_read().
> > As to atomic_read(), I think it need to prevent optimizer from
> > refetching issue, but as to atomic_set(), do we need ?
> 
> I think so, to prevent reordering.

Hi,

I don't think volatile makes any difference to reordering here.

The compiler is not going to move the atomic_set() store before or
after another instruction on the same atomic variable anyway, just
like it wouldn't do that for an ordinary assignment.

If you're concerned about ordering with respect to other memory, then
volatile wouldn't make much difference.  barrier() before and after would.

If you're copying Linux's semantics, Linux's atomic_set() doesn't
include any barriers, nor imply any.  atomic_read() uses volatile to
ensure that each call re-reads the value, for example in a loop.
(Same as ACCESS_ONCE()).  If there was a call to atomic_set() in a
loop, it doesn't guarantee that will be written each time.

-- Jamie

Re: [Qemu-devel] Get only TCG code without execution

2012-02-09 Thread Jamie Lokier

陳韋任 wrote:
> > As x86 doesn't use or need barrier instructions, when translating x86
> > to (say) run on ARM host, multi-threaded code that needs barriers
> > isn't easy to detect, so barriers may be required between every memory
> > access in the generated ARM code.
> 
>   Sounds awful to me. Regardless current QEMU's support for multi-threaded
> application, it's possible to emulate a architecture with stronger memory
> model on a weaker one?

It's possible, unfortunately those barriers tends to be quite
expensive and they are needed often, so it would run slowly. Probably
a lot slower than using a single host thread with preemption to
simulate multiple guest CPUs. But someone should try it and find out.

It might be possible to do some deep analysis of the guest to work out
which memory accesses don't need barriers, but it's a hard research
problem with no guarantee of a good solution.

One strategy which comes to mind is simulated MESI or MOESI (cache
coherency protocols) at the page level, so independent guest threads
never have unsynchronised access to the same page. Or at finer
granularity, with more emulation overhead (but still maybe less than
barriers). Another is software transactional memory techniques.

Neither will run system software at great speed, but certain kinds of
mostly-independent processing, for example a guest running mainly
userspace number crunching in independent processes, might work
alright.

-- Jamie

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-09 Thread Jamie Lokier

Anthony Liguori wrote:
> >The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> >them to userspace.
> 
> I'm a big fan of this.

I agree with getting rid of unnecessary emulations.
(Why were those things emulated in the first place?)

But it would be good to retain some way to "plugin" device emulations
in the kernel, separate from KVM core with a well-defined API boundary.

Then it wouldn't matter to the KVM core whether there's PIT emulation
or whatever; that would just be a separate module.  Perhaps even with
its own /dev device and maybe not tightly bound to KVM,

> >Note: this may cause a regression for older guests that don't
> >support MSI or kvmclock.  Device assignment will be done using
> >VFIO, that is, without direct kvm involvement.

I don't like the sound of regressions.

I tend to think of a VM as something that needs to have consistent
behaviour over a long time, for keeping working systems running for
years despite changing hardware, or reviving old systems to test
software and make patches for things in long-term maintenance etc.

But I haven't noticed problems from upgrading kernelspace-KVM yet,
only upgrading the userspace parts.  If a kernel upgrade is risky,
that makes upgrading host kernels difficult and "all or nothing" for
all the guests within.

However it looks like you mean only the performance characteristics
will change because of moving things back to userspace?

> >Local APICs will be mandatory, but it will be possible to hide them from
> >the guest.  This means that it will no longer be possible to emulate an
> >APIC in userspace, but it will be possible to virtualize an APIC-less
> >core - userspace will play with the LINT0/LINT1 inputs (configured as
> >EXITINT and NMI) to queue interrupts and NMIs.
> 
> I think this makes sense.  An interesting consequence of this is
> that it's no longer necessary to associate the VCPU context with an
> MMIO/PIO operation.  I'm not sure if there's an obvious benefit to
> that but it's interesting nonetheless.

Would that be useful for using VCPUs to run sandboxed userspace code
with ability to trap and control the whole environment (as opposed to
guest OSes, or ptrace which is rather incomplete and unsuitable for
sandboxing code meant for other OSes)?

Thanks,
-- Jamie

Re: [Qemu-devel] [PATCH] main-loop: For tools, initialize timers as part of qemu_init_main_loop()

2012-01-21 Thread Jamie Lokier

Michael Roth wrote:
> In some cases initializing the alarm timers can lead to non-negligable
> overhead from programs that link against qemu-tool.o. At least,
> setting a max-resolution WinMM alarm timer via mm_start_timer() (the
> current default for Windows) can increase the "tick rate" on Windows
> OSs and affect frequency scaling, and in the case of tools that run
> in guest OSs such has qemu-ga, the impact can be fairly dramatic
> (+20%/20% user/sys time on a core 2 processor was observed from an idle
> Windows XP guest).
> 
> This patch doesn't address the issue directly (not sure what a good
> solution would be for Windows, or what other situations it might be
> noticeable),

Is this a timer that need to fire soon after setting, every time?

I wonder if a different kind of Windows timer, lower-resolution, could
be used if the timeout is longer.  If it has insufficient resolution,
it could be set to trigger a little early, then set a high-resolution
timer at that point.

Maybe that could help for Linux CONFIG_NOHZ guests?

-- Jamie

Re: [Qemu-devel] Get only TCG code without execution

2012-01-20 Thread Jamie Lokier

陳韋任 wrote:
>   What's load/store exclusive implementation?

It's how some architectures do atomic operations, instead of having
atomic instructions like x86 does.

> And as a general emulator, QEMU shouldn't implement any
> architecture-specific memory model, right? What comes into my mind
> is QEMU only need to follow guest memory operations when translates
> guest binary to TCG ops. When translate TCG ops to host binary, it
> also has to be careful not to mess up the memory ordering.

The error occurs when emulating two or more guest CPUs in parallel
using two or more host CPUs for speed.  Then "not mess up the memory
ordering" may require barrier instructions in the host binary code,
depending on the guest and host architectures.  Without barrier
instructions, the CPUs reorder memory accesses even if the instruction
order is kept the same. This reordering done by the CPU is called the
memory model. TCG cannot currently produce these barrier instructions,
and it's not clear if it will ever be able to do so efficiently.

-- Jamie

Re: [Qemu-devel] Get only TCG code without execution

2012-01-20 Thread Jamie Lokier

Peter Maydell wrote:
> >  "guest binaries don't actually rely that much on the memory model."
> >
> > I think the reason is those guest binaries are single thread. Memory model 
> > is
> > important in multi-threaded case. BTW, our binary translator now can 
> > translate
> > x86 binary to ARM binary, and ARM has weaker memory model than x86.
> 
> Yes. At the moment this works for QEMU on ARM hosts because in
> system mode QEMU itself is single-threaded so the nastier interactions
> between multiple guest CPUs don't occur (just about every memory model
> defines that memory interactions within a single thread of execution
> behave in the obvious manner).

> I also had in mind that guest binaries
> tend to make fairly stereotypical use of things like LDREX/STREX
> rather than relying on obscure details like their interaction with
> plain load/stores.

As x86 doesn't use or need barrier instructions, when translating x86
to (say) run on ARM host, multi-threaded code that needs barriers
isn't easy to detect, so barriers may be required between every memory
access in the generated ARM code.

-- Jamie

Re: [Qemu-devel] qemu-kvm upstreaming: Do we need -no-kvm-pit and -no-kvm-pit-reinjection semantics?

2012-01-20 Thread Jamie Lokier

Jan Kiszka wrote:
> Usability. Users should not have to care about individual tick-based
> clocks. They care about "my OS requires lost ticks compensation, yes or no".

Conceivably an OS may require lost ticks compensation depending on
boot options given to the OS telling it which clock sources to use.

However I like the idea of a global default, which you can set and all
the devices inherit it unless overridden in each device.

-- Jamie

Re: [Qemu-devel] [PATCH 2/2] qemu-ga: Add the guest-suspend command

2012-01-17 Thread Jamie Lokier

Michael Roth wrote:
> >STDIO is one of the major areas of code that is definitely not
> >async signal safe. Consider Thread A doing something like
> >fwrite(stderr, "Foo\n"), while another thread forks, and then
> >its child also does an fwrite(stderr, "Foo\n"). Given that
> >every stdio function will lock/unlock a mutex, you easily get
> >this sequence of events:
> >
> >1. Thread A: lock(stderr)
> >2. Thread A: write(stderr, "foo\n");
> >3. Thread B: fork() ->  Process B1
> >4. Thread A: unlock(stderr)
> >5.   Process B1: lock(stderr)
> >
> >When the child process is started at step 3, the FILE *stderr
> >object will be locked by thread A.  When Thread A does the
> >unlock in step 4, it has no effect on Process B1. So process
> >B1 hangs forever in step 5.
> 
> Ahh, thanks for the example. I missed that these issues were
> specifically WRT to code that was fork()'d from a multi-threaded
> application. Seemed pretty scary otherwise :)

The pthread_atfork() mechanism, or equivalent in libc, should be
sorting out those stdio locks, but I don't know for sure what Glibc does.

I do know it traverses a stdio list on fork() though:

   http://sourceware.org/bugzilla/show_bug.cgi?id=4737#c4

Which is why Glibc's fork() is not async-signal-safe even though it's
supposed to be.

stdio in a fork() child is historical unix stuff; I expect there are
quite a lot of old applications that use stdio in a child process.
Not multi-threaded applications, but they can link to multi-threaded
libraries these without knowing.

Still there are bugs around (like Glibc's fork() not being
async-signal-safe).  It pays to be cautious.

-- Jamie

Re: [Qemu-devel] [PATCH 2/2] qemu-ga: Add the guest-suspend command

2012-01-17 Thread Jamie Lokier

Eric Blake wrote:
> On 01/16/2012 03:51 AM, Jamie Lokier wrote:
> > I'm not sure if it's relevant to the this code, but on Glibc fork() is
> > not async-signal-safe and has been known to crash in signal handlers.
> > This is why fork() was removed from SUS async-signal-safe functions.
> 
> fork() is still in the list of async-signal-safe functions [1];

You're right, but it looks like it may be removed in the next edition:

   https://www.opengroup.org/austin/docs/austin_446.txt

> it was only pthread_atfork() which was removed.

I didn't think pthread_atfork() ever was async-signal-safe.

> That is, fork() is _required_
> to be async-signal-safe (and usable from signal handlers), provided that
> the actions following the fork also follow safety rules.

Nonethless, Glibc fork() isn't async-signal-safe even if it should be:

http://sourceware.org/bugzilla/show_bug.cgi?id=4737
> > In general, why is multithreadedness relevant to async-signal-safety here?
> 
> Because POSIX 2008 (SUS inherits from POSIX, so it has the same
> restriction) states that if a multithreaded app calls fork, the child
> can only portably use async-signal-safe functions up until a successful
> exec or _exit.  Even though the child is _not_ operating in a signal
> handler context, it _is_ operating in a context of a single thread where
> other threads from the parent may have been holding locks, and thus
> calling any unsafe function (that is, any function that tries to obtain
> a lock) may deadlock.

Somewhat confusing, when you have pthread_atfork() existing for the
entire purpose of allowing non-async-signal-safe functions, provided
the application isn't multithreaded, but libraries can be (I'm not
sure what the difference between application and library is in this
context).

http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html

It is suggested that programs that use fork() call an exec function
very soon afterwards in the child process, thus resetting all
states. In the meantime, only a short list of async-signal-safe
library routines are promised to be available.

Unfortunately, this solution does not address the needs of
multi-threaded libraries. Application programs may not be aware that a
multi-threaded library is in use, and they feel free to call any
number of library routines between the fork() and exec calls, just as
they always have. Indeed, they may be extant single-threaded programs
and cannot, therefore, be expected to obey new restrictions imposed by
the threads library.

> I don't know if qemu-ga is intended to be a multi-threaded app, so I
> don't know if being paranoid about async-signal-safety matters in this
> particular case, but I _do_ know that libvirt has encountered issues
> with using non-safe functions prior to exec, which is why it always
> raises red flags when I see unsafe code between fork and exec.

Quite right, I agree. :-)

-- Jamie

Re: [Qemu-devel] [PATCH 2/2] qemu-ga: Add the guest-suspend command

2012-01-16 Thread Jamie Lokier

Eric Blake wrote:
> On 01/13/2012 12:15 PM, Luiz Capitulino wrote:
> > This might look complex, but the final code is quite simple. The
> > purpose of that approach is to allow qemu-ga to reap its children
> > (semi-)automatically from its SIGCHLD handler.
> 
> Yes, given your desire for the top-level qemu-ga signal handler to be
> simple, I can see why you did a double fork, so that the intermediate
> child can change the SIGCHLD behavior and actually do a blocking wait in
> the case where status should not be ignored.

An alternative is for SIGCHLD to write a byte to a non-blocking pipe
and do nothing else.  A main loop outside signal context reads from
the pipe, and on each read triggers a subloop of non-blocking
waitpid() getting child statuses until there are no more.  Because
it's outside signal context, it's safe to do anything with the child
statuses.

(A long time ago, on other unixes, this wasn't possible because
SIGCHLD would be retriggered until wait(), but it's not relevant on
anything modern.)

> > +execlp(pmutils_bin, pmutils_bin, arg, NULL);
> 
> Do we really want to be relying on a PATH lookup, or should we be using
> an absolute path in pmutils_bin?

Since you mention async-signal-safety, execlp() isn't
async-signal-safe!  Last time I checked, in Glibc execlp() could call
malloc().  Also reading PATH looks at the environment, which isn't
always thread-safe either, depending on what else is going on.

I'm not sure if it's relevant to the this code, but on Glibc fork() is
not async-signal-safe and has been known to crash in signal handlers.
This is why fork() was removed from SUS async-signal-safe functions.

> I didn't check whether slog() is async-signal safe (probably not, since
> even snprintf() is not async-signal safe, and you are passing a printf
> style format string).  But strerror() is not, so you shouldn't be using
> it in the child if qemu-ga is multithreaded.

In general, why is multithreadedness relevant to async-signal-safety here?

Thanks,
-- Jamie

Re: [Qemu-devel] converging around a single guest agent

2011-11-17 Thread Jamie Lokier

>>> On 11/16/2011 03:36 PM, Anthony Liguori wrote:
 We have another requirement. We need to embed the source for the guest
 agent in the QEMU release tarball. This is for GPL compliance since we
 want to include an ISO (eventually) that contains binaries.

Paolo Bonzini wrote:
> ovirt-guest-agent is licensed under GPLv3, so you do not need to;
> the options in GPLv3 include this one:
> 
> d) Convey the object code by offering access from a designated
> place (gratis or for a charge), and offer equivalent access to the
> Corresponding Source in the same way through the same place at no
> further charge.  You need not require recipients to copy the
> Corresponding Source along with the object code.  If the place to
> copy the object code is a network server, the Corresponding Source
> may be on a different server (operated by you or a third party)
> that supports equivalent copying facilities, provided you maintain
> clear directions next to the object code saying where to find the
> Corresponding Source.  Regardless of what server hosts the
> Corresponding Source, you remain obligated to ensure that it is
> available for as long as needed to satisfy these requirements.

Hi,

GPLv2 also has a clause similar to the above.  In GPLv2 it's not
enumerated, but says:

If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.

I'm not sure why "mere aggregation" (GPLv2) and "aggregate" (GPLv3)
aren't sufficient to allow shipping the different binaries together in
a single ISO regardless of where the source code lives or how it's licensed.

-- Jamie

Re: [Qemu-devel] [PATCH 5/6] Do constant folding for shift operations.

2011-05-27 Thread Jamie Lokier

Richard Henderson wrote:
> On 05/26/2011 01:25 PM, Blue Swirl wrote:
> >> I don't see the point.  The C99 implementation defined escape hatch
> >> exists for weird cpus.  Which we won't be supporting as a QEMU host.
> > 
> > Maybe not, but a compiler with this property could arrive. For
> > example, GCC developers could decide that since this weirdness is
> > allowed by the standard, it may be implemented as well.
> 
> If you like, you can write a configure test for it.  But, honestly,
> essentially every place in qemu that uses shifts on signed types
> would have to be audited.  Really.

I agree, the chance of qemu ever working, or needing to work, on a non
two's complement machine is pretty remote!

> The C99 hook exists to efficiently support targets that don't have
> arithmetic shift operations.  Honestly.

If you care, this should be portable without a configure test, as
constant folding should have the same behaviour:

(((int32_t)-3 >> 1 == (int32_t)-2)
 ? (int32_t)x >> (int32_t)y
 : long_winded_portable_shift_right(x, y))

-- Jamie

Re: [Qemu-devel] [0/25] Async threading for VirtFS using glib threads & coroutines.

2011-05-24 Thread Jamie Lokier

Venkateswararao Jujjuri wrote:
> This model makes the code simple and also in one shot we can convert
> all v9fs_do_syscalls into asynchronous threads. But as Aneesh raised
> will there be any additional overhead for the additional jumps?  We
> can quickly test it out too.

I'm not sure if this is exactly the right place (I haven't followed
the whole discussion), but there is a useful trick for getting rid of
one of the thread context switches:

Swizzle *which* thread is your "main" coroutine thread.

Instead of queuing up an item on the work queue, waking the worker
thread pool, and having a worker thread pick up the coroutine, you:

Declare the current thread to *be* a worker through from this point,
and queue the calling context for a worker thread to pick up.  When it
picks it up, *that* thread declares itself to be the main thread
coroutine thread.

So the coroutine entry step is just queuing a context for another
thread to pick up, and then diving into the blocking system call
(optimising out the enqueue/dequeue and thread switch).

In a sense, you make the "main" thread a long-lived work queue entry,
and have a symmetric pool, except that the main thread tends to behave
differently than the other work items.

This only works if the main thread's state is able to follow the
swizzling.  I don't know if KVM VCPUs will do that, for example, or if
there's other per-thread state that won't work.

If the main thread can't be swizzled, you can still use this trick
when doing the coroutine->syscall step starting form an existing
worker thread.

-- Jamie

Re: [Qemu-devel] [PATCH 1/2] coroutine: introduce coroutines

2011-05-24 Thread Jamie Lokier

Stefan Hajnoczi wrote:
> My current plan is to try using sigaltstack(2) instead of
> makecontext()/swapcontext() as a hack since OpenBSD doesn't have
> makecontext()/swapcontext().

sigaltstack() is just a system call to tell the system about an
alternative signal stack - that you have allocated yourself using
malloc().  According to 'info libc "Signal Stack"'.  It won't help you
get a new stack by itself.

Maybe take a look at what GNU Pth does.  It has a similar matrix of
tested platforms using different strategies on each, though it is
slightly different because it obviously doesn't link with
libpthread.so (it provides it!), and it has to context switch from the
SIGALRM handler for pre-emption.

> TBH I'm almost at the stage where I think we should just use threads
> and/or async callbacks, as appropriate.  Hopefully I'll be able to cook
> up a reasonably portable implementation of coroutines though, because
> the prospect of having to go fully threaded or do async callbacks isn't
> attractive in many cases.

Another classic trick is just to call a function recursively which has
a large local array(*), setjmp() every M calls, and longjmp() back to
the start after M*N calls.  That gets you N setjmp() contexts to
switch between, all in the same larger stack so it's fine even with
old pthread implementations, providing the total stack used isn't too
big, and the individual stacks you've allocated aren't too small for
the program.

If the large local array insists on being optimised away, it's
probably better anyway to track the address of a local variable, and
split the stack whenever the address has changed by enough.  Try to
make sure the compiler doesn't optimise away the tail recursion :-)

It works better on non-threaded programs as per-thread stacks are more
likely to have limited size.  *But* the initial thread often has a
large growable stack, just like a single-threaded program.  So it's a
good idea to do the stack carving in the initial thread (doesn't
necessarily have to be at the start of the program).  You may be able
to add guard pages afterwards with mprotect() if you're paranoid :-)

-- Jamie

Re: [Qemu-devel] [PATCH 1/2] coroutine: introduce coroutines

2011-05-24 Thread Jamie Lokier

Stefan Hajnoczi wrote:
> On Thu, May 12, 2011 at 10:51 AM, Jan Kiszka  wrote:
> > On 2011-05-11 12:15, Stefan Hajnoczi wrote:
> >> From: Kevin Wolf 
> >>
> >> Asynchronous code is becoming very complex.  At the same time
> >> synchronous code is growing because it is convenient to write.
> >> Sometimes duplicate code paths are even added, one synchronous and the
> >> other asynchronous.  This patch introduces coroutines which allow code
> >> that looks synchronous but is asynchronous under the covers.
> >>
> >> A coroutine has its own stack and is therefore able to preserve state
> >> across blocking operations, which traditionally require callback
> >> functions and manual marshalling of parameters.
> >>
> >> Creating and starting a coroutine is easy:
> >>
> >>   coroutine = qemu_coroutine_create(my_coroutine);
> >>   qemu_coroutine_enter(coroutine, my_data);
> >>
> >> The coroutine then executes until it returns or yields:
> >>
> >>   void coroutine_fn my_coroutine(void *opaque) {
> >>       MyData *my_data = opaque;
> >>
> >>       /* do some work */
> >>
> >>       qemu_coroutine_yield();
> >>
> >>       /* do some more work */
> >>   }
> >>
> >> Yielding switches control back to the caller of qemu_coroutine_enter().
> >> This is typically used to switch back to the main thread's event loop
> >> after issuing an asynchronous I/O request.  The request callback will
> >> then invoke qemu_coroutine_enter() once more to switch back to the
> >> coroutine.
> >>
> >> Note that coroutines never execute concurrently and should only be used
> >> from threads which hold the global mutex.  This restriction makes
> >> programming with coroutines easier than with threads.  Race conditions
> >> cannot occur since only one coroutine may be active at any time.  Other
> >> coroutines can only run across yield.
> >
> > Mmh, is there anything that conceptually prevent fixing this limitation
> > later on? I would really like to remove such dependency long-term as
> > well to have VCPUs operate truly independently on independent device models.
> 
> The use case that has motivated coroutines is the block layer.  It is
> synchronous in many places and definitely not thread-safe.  Coroutines
> is a step that solves the "synchronous" part of the problem but does
> not tackle the "not thread-safe" part.
> 
> It is possible to move from coroutines to threads but we need to
> remove single-thread assumptions from all the block layer code, which
> isn't a small task.  Coroutines does not prevent us from making the
> block layer thread-safe!

Keeping in mind that you may have to do some of the work even with
coroutines.  If the code is not thread safe, it may contain
assumptions that certain state does not change when it makes blocking
I/O calls, which stops being true once you have coroutines and replace
the I/O calls with async calls.  But at least the checking can be
confined to those places in the code.

It's quite similar to the Linux BKL - scheduling points have to be
checked but nowhere else does.  And, like the BKL, it could be "pushed
down" in stages over a long time period, to convert the coroutine code
over to concurrent threads over time, rather than in a single step.

By the end, even with full concurrency, there is still some potential
for coroutines, and/or async calls, to be useful for performance
balancing.

-- Jamie

Re: [Qemu-devel] [PATCH 1/2] coroutine: introduce coroutines

2011-05-24 Thread Jamie Lokier

Daniel P. Berrange wrote:
> On Wed, May 11, 2011 at 03:45:39PM +0200, Paolo Bonzini wrote:
> > On 05/11/2011 03:05 PM, Anthony Liguori wrote:
> > >>
> > >>A very slow way, too (on Windows at least if you use qemu_cond...).
> > >
> > >That doesn't mean you can't do a fiber implementation for Windows... but
> > >having a highly portable fallback is a good thing.
> > 
> > I agree but where would you place it, since QEMU is only portable to
> > POSIX and Windows?
> > 
> > osdep-$(CONFIG_POSIX) += coroutine-posix.c
> > osdep-$(CONFIG_WIN32) += coroutine-win32.c
> > osdep-??? += coroutine-fallback.c
> 
> NetBSD forbids the use of 'makecontext' in any application
> which also links to libpthread.so[1]. We used makecontext in
> GTK-VNC's coroutines and got random crashes in threaded
> apps running on NetBSD. So for NetBSD we tell people to use
> the thread based coroutines instead.

You have to use swapcontext(), no wait, you have to use setjmp(), no wait,
_setjmp(), no wait, threads Read on.

>From Glibc's FAQ, setjmp/longjmp are not portable choices:

- UNIX provides no other (portable) way of effecting a synchronous
  context switch (also known as co-routine switch).  Some versions
  support this via setjmp()/longjmp() but this does not work
  universally.

So in principle you should use swapcontext() in portable code.

(By the way, Glibc goes on about how it won't support swapcontext()
from async signal handlers, i.e. preemption, on some architectures
(IA-64/S-390), and I know it has been very subtly broken from a signal
handler on ARM.  Fair enough, somehow disappointing, but doesn't
matter for QEMU coroutines.)

But swapcontext() etc. have been withdrawn from POSIX 2008:

- Functions to be deleted

  Legacy: Delete all legacy functions except utimes (which should not be 
legacy).
  OB: Default position is to delete all OB functions.

  XSI Functions to change state

  _setjmp and _longjmp. Should become obsolete.

  getcontext, setcontext, makecontext and swapcontext are already
  marked OB and should be withdrawn. And header file . 

OB means obsolescent.  They were marked obsolescent a few versions
prior, with the rationale that you can use threads instead...

It's not surprising that NetBSD forbids makecontext() with
libpthread.so.  I suspect old versions of FreeBSD, OpenBSD, DragonFly
BSD, (and Mac OS X?), have the same restriction, because they have a
similar pthreads evolutionary history to LinuxThreads.  LinuxThreads
also breaks when using coroutines that switch stacks, because it uses
the stack pointer to know the current thread.

(LinuxThreads is old now, but that particular quirk still affects me
because some uCLinux platforms, on which I wish to use coroutines, still
don't have working NPTL - but they aren't likely to be running QEMU :-)

Finally, if you are using setjmp/longjmp, consider (from FreeBSD man page):

The setjmp()/longjmp() pairs save and restore the signal mask
while _setjmp()/_longjmp() pairs save and restore only the
register set and the stack.  (See sigprocmask(2).)

As setjmp/longjmp were chosen for performance, you may wish to use
_setjmp/_longjmp instead (when available), as swizzling the signal
mask on each switch may involve a system call and be rather slow.

-- Jamie

Re: [Qemu-devel] [PATCH] Add support for fd: protocol

2011-05-24 Thread Jamie Lokier

Stefan Hajnoczi wrote:
> On Mon, May 23, 2011 at 11:49 PM, Jamie Lokier  wrote:
> > Being able to override the backing file path would be useful anyway.
> >
> > I've already had problems when moving established qcow2 files between
> > systems, that for historical reasons contain either an absolute path
> > inside for the backing file, or some messy "../../whatever", or
> > "foo/bar/whatever", or "backing.img" (lots of different ones with the
> > same name), all of which are a pain to work around.
...
> Try the qemu-img rebase -f command:
> 
> qemu-img uses the unsafe mode if "-u" is specified. In this mode, only the
> backing file name and format of filename is changed without any
> checks on the
> file contents. The user must take care of specifying the correct new 
> backing
> file, or the guest-visible content of the image will be corrupted.
> 
> This mode is useful for renaming or moving the backing file to somewhere
> else.  It can be used without an accessible old backing file, i.e. you can
> use it to fix an image whose backing file has already been moved/renamed.

Yes indeed.  That feature was added after the last time I dealt with this 
problem.

However, I have wanted to open *precious*, *read-only* qcow2 images,
for example with -snapshot or the explicit equivalent, and for those
precious images I am loathe to let any tool write a single byte to
them.  The files are kept read-only, and often with the "immutable"
attribute on Linux, backed up and checksummed just to be sure.

I'd rather just override the value on the command line, so if that
feature may turn up for fd: related reasons, it'll be handy for the
read-only moved qcow2 file reason too.

-- Jamie

Re: [Qemu-devel] Use a hex string

2011-05-23 Thread Jamie Lokier

Anthony Liguori wrote:
> On 05/23/2011 06:02 PM, Jamie Lokier wrote:
> >Richard W.M. Jones wrote:
> >>The problem is to be able to send 64 bit memory and disk offsets
> >>faithfully.  This doesn't just fail to solve the problem, it's
> >>actually going to make it a whole lot worse.
> >
> >Such offsets would be so much more readable in hexadecimal.
> >
> >So why not use a string "0x80001234" instead?
> 
> This doesn't change the fundamental issue here.  Javascript's internal 
> representation for integers isn't 2s compliment, but IEEE794.  This 
> means the expectations about how truncation/overflow is handled is 
> fundamentally different.

No, the point is it's a string so Javascript numerics doesn't come
into it, no overflow, no truncation, no arithmetic.  Every program
that wants to handle them handles them as a *string-valued attribute*
externally, and whatever representation it needs for a particular
attribute internally.  (Just as enum values are represented with
strings too).

In the unlikely event that someone wants to do arithmetic on these
values *in actual Javascript*, it'll be tricky for them, but the
representation doesn't have much to do with that.

-- Jamie

[Qemu-devel] Use a hex string (was: [PATCH] qemu: json: Fix parsing of integers >= 0x8000000000000000)

2011-05-23 Thread Jamie Lokier

Richard W.M. Jones wrote:
> The problem is to be able to send 64 bit memory and disk offsets
> faithfully.  This doesn't just fail to solve the problem, it's
> actually going to make it a whole lot worse.

Such offsets would be so much more readable in hexadecimal.

So why not use a string "0x80001234" instead?

That is universally Javascript compatible as well as much more
convenient for humans.

Or at least, *accept* a hex string wherever a number is required by
QMP (just because hex is convenient anyway, no compatibility issue),
and *emit* a hex string where the number may be out of Javascript's
unambiguous range, or where a hex string would make more sense anyway.

-- Jamie

Re: [Qemu-devel] [PATCH] Add support for fd: protocol

2011-05-23 Thread Jamie Lokier

Markus Armbruster wrote:
> Anthony Liguori  writes:
> 
> > On 05/23/2011 05:30 AM, Daniel P. Berrange wrote:
> >> It feels to me that turning the current block driver code which just does
> >> open(2) on files, into something which issues events&  asynchronously
> >> waits for a file would potentially be quite complex.
> >>
> >> You also need to be much more careful from a security POV if the mgmt
> >> app is accepting requests to open arbitrary files from QEMU, to ensure
> >> the filenames are correctly/strictly validated before opening them and
> >> giving them back to QEMU. An architecture where the mgmt app decides
> >> what FDs to supply upfront, has less potential for security errors.
> >>
> >> To me the ideal would thus be that we can supply FDs for the backing
> >> store with -blockdev syntax, and that places where QEMU re-opens files
> >> would be enhanced to avoid that need. If there are things we can't do
> >> without closing&  re-opening the same file, then perhaps we need some
> >> new ioctl()/fcntl() calls to change those file attributes on the fly.
> >
> > I agree.  But my view of blockdev is that you wouldn't set an fd
> > attribute but rather the backing file name and use the fd protocol.
> > For instance:
> >
> > -blockdev id=foo-base,path=fd:4,format=raw
> > -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo
> 
> I guess you mean backing_file=foo-base.
> 
> If none is specified, use the backing file specification stored in the
> image.
> 
> Matches my current thinking.

Being able to override the backing file path would be useful anyway.

I've already had problems when moving established qcow2 files between
systems, that for historical reasons contain either an absolute path
inside for the backing file, or some messy "../../whatever", or
"foo/bar/whatever", or "backing.img" (lots of different ones with the
same name), all of which are a pain to work around.

(Imho, it would also make sense if qcow2 files contained a UUID for
their backing file to verify you've given the correct backing file,
and maybe help find it (much like Linux finds real disk devices and
filesystems when mounting these days).)

-- Jamie

Re: [Qemu-devel] [RFC] Memory API

2011-05-23 Thread Jamie Lokier

Gleb Natapov wrote:
> On Sun, May 22, 2011 at 10:50:22AM +0300, Avi Kivity wrote:
> > On 05/20/2011 02:25 PM, Gleb Natapov wrote:
> > >>
> > >>  A) Removing regions will change significantly. So far this is done by
> > >>  setting a region to IO_MEM_UNASSIGNED, keeping truncation. With the new
> > >>  API that will be a true removal which will additionally restore hidden
> > >>  regions.
> > >>
> > >And what problem do you expect may arise from that? Currently accessing
> > >such region after unassign will result in undefined behaviour, so this
> > >code is non working today, you can't make it worse.
> > >
> > 
> > If the conversion were perfect then yes.  However there is a
> > possibility that the conversion will not be perfect.
> > 
> > It's also good to have to have the code document its intentions.  If
> > you see _overlap() you know there is dynamic address decoding going
> > on, or something clever.
> > 
> > >>  B) Uncontrolled overlapping is a bug that should be caught by the core,
> > >>  and a new API is a perfect chance to do this.
> > >>
> > >Well, this will indeed introduce the difference in behaviour :) The guest
> > >that ran before will abort now. Are you actually aware of any such
> > >overlaps in the current code base?
> > 
> > Put a BAR over another BAR, then unmap it.
> > 
> _overlap will not help with that. PCI BARs can overlap, so _overlap will
> be used to register them. You do not what to abort qemu when guest
> configure overlapping PCI BARs don't you?

I'd rather guests have no way to abort qemu, except by explicit
agreement... even if they program BARs randomly or do anything else.
Right now my virtual server provider won't let me run my own kernels
because they are paranoid that a non-approved kernel might crash KVM.
Which is reasonable.  Even so, it's possible to reprogram BARs from
guest userspace.

Hot-adding devices, including ones with MMIO or IO addresses that
overlap another existing device, shouldn't make qemu abort either.
Perhaps disable the device, perhaps respond with an error, that's all.

Even then, if hot-adding some ISA device overlaps an existing PCI BAR,
it would be preferable if the devices (probably both of them) simply
didn't receive any bus cycles until the BARs were moved elsewhere,
maybe triggered PCI bus errors or MCEs or something like that, rather
than introducing never-tested-in-practice management-visible state
such as a "disabled" or "refused" device.

I don't know if qemu has devices like this, but many real ISA devices
have software-configurable IO, MMI and IRQ settings (ISAPNP) - it's
not just PCI.

I thoroughly approve of the plan to keep track of overlapping regions
so that adding/removing them has no side effect.  When they conflict
at equal priorities I suggest a good behaviour would be:

   - No access to the underlying device
   - MCE interrupt or equivalent, signalling a bus error

Then the order of registration doesn't make any difference, which is good.

-- Jamie

Re: [Qemu-devel] [PATCH 2/2 V7] qemu, qmp: add inject-nmi qmp command

2011-05-03 Thread Jamie Lokier

Blue Swirl wrote:
> On Fri, Apr 8, 2011 at 9:04 AM, Gleb Natapov  wrote:
> > On Thu, Apr 07, 2011 at 04:41:03PM -0500, Anthony Liguori wrote:
> >> On 04/07/2011 02:17 PM, Gleb Natapov wrote:
> >> >On Thu, Apr 07, 2011 at 10:04:00PM +0300, Blue Swirl wrote:
> >> >>On Thu, Apr 7, 2011 at 9:51 PM, Gleb Natapov  wrote:
> >> >>
> >> >>I'd prefer something more generic like these:
> >> >>raise /apic@fee0:l1int
> >> >>lower /i44FX-pcihost/e1000@03.0/pinD
> >> >>
> >> >>The clumsier syntax shouldn't be a problem, since this would be a
> >> >>system developer tool.
> >> >>
> >> >>Some kind of IRQ registration would be needed for this to work without
> >> >>lots of changes.
> >> >True. The ability to trigger any interrupt line is very useful for
> >> >debugging. I often re-implement it during debug.
> >>
> >> And it's a good thing to have, but exposing this as the only API to
> >> do something as simple as generating a guest crash dump is not the
> >> friendliest thing in the world to do to users.
> >>
> > Well, this is not intended to be used by regular users directly and
> > management can provide nicer interface for issuing NMI. But really,
> > my point is that NMI actually generates guest core dump in such rare
> > cases (only preconfigured Windows guests) that it doesn't warrant to
> > name command as such. Management is in much better position to implement
> > functionality with such name since it knows what type of guest it runs
> > and can tell agent to configure guest accordingly.
> 
> Does the management need to know about each and every debugging
> oriented interface? For example, "info regs",  "info mem", "info irq"
> and tracepoints?

Linux uses NMI for performance tracing, profiling, watchdog etc. so in
practice, NMI is very similar to the other IRQs.  I.e. highly guest
specific and depending on what's wired up to it.  Injecting NMI to all
CPUs at once does not make any sense for those Linux guests.

For Windows crash dumps, I think it makes sense to have a "button
wired to NMI" device, rather than inject-nmi directly, but I can see
that inject-nmi solves the intended problem quite neatly.

For Linux crash dumps, for example, there are other key combinations,
as well as watchdog devices, that can be used to trigger them.  A
virtual "button wired to GPIO/PCI-IRQ/etc." device might be quite
handy for debugging Linux guests, and would fit comfortably in a
management interface.

-- Jamie

Re: [Qemu-devel] [PATCH 2/2 V7] qemu, qmp: add inject-nmi qmp command

2011-05-03 Thread Jamie Lokier

Gleb Natapov wrote:
> On Thu, Apr 07, 2011 at 04:39:58PM -0500, Anthony Liguori wrote:
> > On 04/07/2011 01:51 PM, Gleb Natapov wrote:
> > >NMI does not have to generate crash dump on every guest we support.
> > >Actually even for windows guest it does not generate one without
> > >tweaking registry. For all I know there is a guest that checks mail when
> > >NMI arrives.
> > 
> > And for all we know, a guest can respond to an ACPI poweroff event
> > by tweeting the star spangled banner but we still call the
> > corresponding QMP command system_poweroff.
> > 
> Correct :) But at least system_poweroff implements ACPI poweroff as
> defined by ACPI spec. NMI is not designed as core dump event and is not
> used as such by majority of the guests.

Imho acpi_poweroff or poweroff_button would have been a clearer name.
Or even 'sendkey poweroff' - it's just a button someone on the
keyboard on a lot of systems anyway.  Next to the email button and what
looks, on my laptop, like the play-a-tune button :-)

I put system_poweroff into some QEMU-controlling scripts once, and was
disappointed when several guests ignored it.

But it's done now.

-- Jamie

Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

2011-01-19 Thread Jamie Lokier

Chunqiang Tang wrote:
> > >> Moreover, using a host file system not only adds overhead, but
> > >> also introduces data integrity issues. Specifically, if I/Os uses 
> O_DSYNC,
> > >> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
> > >> integrity in the event of a host crash. See
> > >> http://lwn.net/Articles/348739/ .
> > >
> > > You have the same issue with O_DIRECT when using a raw disk device
> > > too.  That is, O_DIRECT on a raw device does not guarantee integrity
> > > in the event of a host crash either, for mostly the same reasons.
> > 
> > QEMU has semantics that use O_DIRECT safely; there is no issue here.
> > When a drive is added with cache=none, QEMU not only uses O_DIRECT but
> > also advertises an enabled write cache to the guest.
> > 
> > The guest *must* flush the cache when it wants to ensure data is
> > stable.  In the event of a host crash, all, some, or none of the I/O
> > since the last flush may have made it to disk.  Each of these
> > possibilities is fair game since the guest may only depend on writes
> > being on disk if they completed and a successful flush was issued
> > afterwards.
> 
> Thank both of you for the explanation, which is very helpful to me. With 
> FVD's capability of eliminating the host file system and storing the image 
> on a logical volume, then perhaps we can always use O_DSYNC, because there 
> is little (or no?) LVM metadata that needs a flush on every write and 
> hence O_DSYNC  does not add overhead? I am not certain on this, and need 
> help for confirmation. If this is true, the guest does not need to flush 
> the cache. 

I think O_DSYNC does not work as you might expect on raw disk devices
and logical volumes.

That doesn't mean you don't need something for crash durability!
Instead, you need to issue the disk cache flushes in whatever way works.

It actually has a very *high* overhead.

The overhead isn't from metadata - it is from needing to flush the
disk cache after every write, which prevents the disk from reordering
writes.

If you don't issue the flushes, and the physical device has a volatile
write cache, then you cannot guarantee integrity in the event of a
host crash.

This can make a filesystem faster than a raw disk or logical volume in
some configurations, if the filesystem journals data writes to limit
the seeking needed to commit durably.

-- Jamie

Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

2011-01-18 Thread Jamie Lokier

Chunqiang Tang wrote:
> > Based on my limited understanding, I think FVD shares a 
> > lot in common with the COW format (block/cow.c).
> > 
> > But I think most of the advantages you mention could be considered as 
> > additions to either qcow2 or qed.  At any rate, the right way to have 
> > that discussion is in the form of patches on the ML.
> 
> FVD is much more advanced than block/cow.c. I would be happy to discuss 
> possible leverage, but setting aside the details of QCOW2, QED, and FVD, 
> let’s start with a discussion of what is needed for the next generation 
> image format. 

Thank you for the detailed description.

FVD looks quite good to me; it seems very simple yet performant at the
same time, due to its smart yet simple design.

> Moreover, using a host file system not only adds overhead, but 
> also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC, 
> it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data 
> integrity in the event of a host crash. See 
> http://lwn.net/Articles/348739/ . 

You have the same issue with O_DIRECT when using a raw disk device
too.  That is, O_DIRECT on a raw device does not guarantee integrity
in the event of a host crash either, for mostly the same reasons.

-- Jamie

Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

2011-01-18 Thread Jamie Lokier

Chunqiang Tang wrote:
> Doing both fault injection and verification together introduces some 
> subtlety. For example, even under the random failure mode, two disk writes 
> triggered by one VM-issued write must either fail together or succeed 
> together. Otherwise, the truth image and the test image will diverge and 
> verification won't succeed. Currently, qemu-test carefully works with the 
> 'sim' driver to guarantee those conditions. Those conditions need be 
> retained after code restructure. 

If the real backend is a host system file or device, and AIO or
multi-threaded writes are used, you can't depend on two parallel disk
writes (triggered by one VM-issued write) failing together or
succeeding together.  All you can do is look at the error code after
each operation completes, and use it to prevent issuing later
operations.  You can't stop the other parallel operations that are
already in progress.

Is that an issue in your design assumptions?

Thanks,
-- Jamie

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-10 Thread Jamie Lokier

Stefan Hajnoczi wrote:
> Since there is no ordering imposed between the data write and metadata
> update, the following scenarios may occur on crash:
> 1. Neither data write nor metadata update reach the disk.  This is
> fine, qed metadata has not been corrupted.
> 2. Data reaches disk but metadata update does not.  We have leaked a
> cluster but not corrupted metadata.  Leaked clusters can be detected
> with qemu-img check.
> 3. Metadata update reaches disk but data does not.  The interesting
> case!  The L2 table now points to a cluster which is beyond the last
> cluster in the image file.  Remember that file size is rounded down by
> cluster size, so partial data writes are discarded and this case
> applies.

Better add:

4. File size is extended fully, but the data didn't all reach the disk.
5. Metadata is partially updated.
6. (Nasty) Metadata partial write has clobbered neighbouring
   metadata which wasn't meant to be changed.  (This may happen up
   to a sector size on normal hard disks - data is hard to come by.
   This happens to a much larger file range on flash and RAIDs
   sometimes - I call it the "radius of destruction").

6 can also happen when doing the L1 updated mentioned earlier, in
which case you might lose a much larger part of the guest image.

-- Jamie

Re: [Qemu-devel] Anyone seeing huge slowdown launching qemu with Linux 2.6.35?

2010-08-03 Thread Jamie Lokier

Richard W.M. Jones wrote:
> We could demand that OSes write device drivers for more qemu devices
> -- already OS vendors write thousands of device drivers for all sorts
> of obscure devices, so this isn't really much of a demand for them.
> In fact, they're already doing it.

Result: Most OSes not working with qemu?

Actually we seem to be going that way.  Recent qemus don't work with
older versions of Windows any more, so we have to use different
versions of qemu for different guests.

-- Jamie

Re: [Qemu-devel] [PATCH] move 'unsafe' to end of caching modes in help

2010-07-21 Thread Jamie Lokier

Anthony Liguori wrote:
> On 07/21/2010 04:58 PM, Daniel P. Berrange wrote:
> >>Yes there is.  Use the version number.
> >> 
> >The version number is not suitable, because features can be removed at
> >compile time and/or
> 
> I don't see any features that libvirt would need to know about that are 
> disabled at compile time that aren't disabled by platform features (i.e. 
> being on a Linux vs. Windows host).
> 
> >  added via patch backports.
> 
> If a distro backports a feature, it should change the QEMU version 
> string.  If it doesn't, that's a distro problem.

To what version?  It can't use the newer version if it only backports
a subset of features; it would have to use a distro-specific version
number or a version string that somehow encodes feature independent of
the version number itself, by some agreed libvirt standard.  Which
isn't far off advertising features in the help string :-)

-- Jamie

Re: [Qemu-devel] Question about qemu firmware configuration (fw_cfg) device

2010-07-20 Thread Jamie Lokier

Gleb Natapov wrote:
> On Mon, Jul 19, 2010 at 09:40:18AM +0200, Alexander Graf wrote:
> > 
> > On 19.07.2010, at 09:33, Gleb Natapov wrote:
> > 
> > > On Mon, Jul 19, 2010 at 08:28:02AM +0100, Richard W.M. Jones wrote:
> > >> On Mon, Jul 19, 2010 at 09:23:56AM +0300, Gleb Natapov wrote:
> > >>> That what I am warring about too. If we are adding device we have to be
> > >>> sure such device can actually exist on real hw too otherwise we may have
> > >>> problems later.
> > >> 
> > >> I don't understand why the constraints of real h/w have anything to do
> > >> with this.  Can you explain?
> > >> 
> > > Each time we do something not architectural it cause us troubles later.
> > > So constraints of real h/w is our constrains to.
> > > 
> > >>> Also 1 second on 100M file does not look like huge gain to me.
> > >> 
> > >> Every second counts.  We're trying to get libguestfs boot times down
> > >> from 8-12 seconds to 4-5 seconds.  For many cases it's an interactive
> > >> program.
> > >> 
> > > So what about making initrd smaller? I remember managing two
> > > distribution in 64M flash in embedded project.
> > 
> > Having a huge initrd basically helps in reusing a lot of existing code. We 
> > do the same - in general the initrd is just a subset of the applications of 
> > the host OS. And if you start putting perl or the likes into it, it becomes 
> > big.
> > 
> Why not provide small disk/cdrom with all those utilities installed?
> 
> > I guess the best thing for now really is to try and see which code paths 
> > insb goes along. It should really be coalesced.
> > 
> It is coalesced to a certain extent (reenter guest every 1024 bytes,
> read from userspace page at a time). You need to continue injecting
> interrupt into a guest during long string operation and checking
> exception condition on a page boundaries.

First obvious change is to make that 4k bytes (page size) when the I/O
port is the firmware port.  That'll make initrd 4 times faster straight away.

If that's not enough saving, it strikes me a cleaner approach than
inventing new kinds of DMA and/or new PCI devices, is to just detect
when the rep insb instruction is used for loading a firmware blob and
treat that as a different trap.

Is guest SeaBIOS in real mode at that point?  If yes, then it would be
best to trap this combination:

  rep insb is fetching a blob + CPU is in real mode

Because then it's safe to skip the exception check on page boundaries.

If no, the trap will need to be a bit smarter.

Advantages of this approach:

  - No need for new BIOS
  - Will work with older BIOSes using current method, and accelerate them
  - No need for distinct -initrd BIOS implementations for isapc and pc,
(compared with the PCI proposal)
  - Doesn't add any new "extra-architectural" behaviour

-- Jamie

Re: [Qemu-devel] [Bug 595117] Re: qemu-nbd slow and missing "writeback" cache option

2010-06-23 Thread Jamie Lokier

Serge Hallyn wrote:
> The default of qemu-img (of using O_SYNC) is not very sensible
> because anyway, the client (the kernel) uses caches (write-back),
> (and "qemu-nbd -d" doesn't flush those by the way). So if for
> instance qemu-nbd is killed, regardless of whether qemu-nbd uses
> O_SYNC, O_DIRECT or not, the data in the image will not be
> consistent anyway, unless "syncs" are done by the client (like fsync
> on the nbd device or sync mount option), and with qemu-nbd's O_SYNC
> mode, those "sync"s will be extremely slow.

Do the "client syncs" cause the nbd server to fsync or fdatasync the file?

> It appears it is because by default the disk image it serves is open
> with O_SYNC. The --nocache option, unintuitively, makes matters a
> bit better because it causes the image to be open with O_DIRECT
> instead of O_SYNC.
[...]
> --cache=off is the same as --nocache (that is use O_DIRECT),
> writethrough is using O_SYNC and is still the default so this patch
> doesn't change the functionality. writeback is none of those flags,
> so is the addition of this patch. The patch also does an fsync upon
> "qemu-nbd -d" to make sure data is flushed to the image before
> removing the nbd.

I really wish qemu's options didn't give the false impression
"nocache" does less caching than "writethrough".  O_DIRECT does
caching in the disk controller/hardware, while O_SYNC hopefully does
not, nowadays.

-- Jamie

Re: [Qemu-devel] Re: block: format vs. protocol, and how they stack

2010-06-22 Thread Jamie Lokier

Markus Armbruster wrote:
> A possible reason why we currently expose format and protocol at the
> user interface is to avoid stacking there.

Pragmatic solution?: A few generic flags in each stacking module
("format/protocol/transport"), which govern which other modules are
allowed to stack on top or underneath.

For example, vvfat may provide a blockdev-like abstraction, along with
flags STACK_ABOVE_ONLY_RAW | STACK_BELOW_ONLY_DIRECTORY, which means
"raw" and "blkdebug" are allowed above (of course ;-) but other things
like the image formats shouldn't be.  And below, it can't stack on a
blockdev-like abstraction, but needs a directory and uses filesystem
operations on it - the thing that Plan9fs needs.

Btw, I think we expose "format" because "virtual disk image file
format" is a useful and meaningful concept to users.  When someone
needs to use a .VMDK file, they know it as a "VMDK format file", not
"I must use the VMDK protocol with this file".

-- Jamie

Re: [Qemu-devel] Re: block: format vs. protocol, and how they stack

2010-06-22 Thread Jamie Lokier

Kevin Wolf wrote:
> > The "protocol" parlance breaks down when we move away from the simple
> > stuff.  For instance, qcow2 needs two children: the block driver
> > providing the delta bits (in qcow2 format), and the block driver
> > providing the base bits (whose configuration happens to be stored in the
> > delta bits).  
> 
> Backing files are different. When talking about opening images (which is
> what we do here) the main difference is that they can be opened only
> after the image itself has been opened. I don't think we should include
> them in this discussion.

Imho, being unable to override the qcow2 backing file from the command
line / monitor is very annoying, if you've moved files from another
machine or just renamed them for tidiness.  It's especially bad if the
supplied qcow2 file has an absolute path in it, quite bad if it has
subdirectories or ".." components, annoying if you've been given
several qcow2 files all of which have the name "backing-file" stored
in them which are different images because they were originally on
different machines, and awful if it has the name of a block device in it.

So, imho, for the discussion of command line / QMP options, there
should be reserved a place for giving the name of the backing file
through command line/monitor/QMP, along with the backing file's
formats/protocols/transports/options, and so on recursively in a tree
structure of arbitrary depth.

There is also the matter of qcow2 files encoding the path, but not
necessarily all the blockdev options that you might want to use to
access the backing file, such as cache=.

In QMP it's obviously quite simple to accept a full child blockdev
specification object as a qcow2-specific parameter, thus not needing
any further discussion in this thread.  It's less obvious how to do it
on the command line or human monitor.

-- Jamie

Re: [Qemu-devel] Re: block: format vs. protocol, and how they stack

2010-06-22 Thread Jamie Lokier

Christoph Hellwig wrote:
> On Mon, Jun 21, 2010 at 09:51:23AM -0500, Anthony Liguori wrote:
> > I can appreciate the desire to keep protocols and formats as an internal 
> > distinction but as a user visible concept, I think your two examples 
> > highlight why exposing protocols as formats make sense.  A user doesn't 
> > necessarily care what's happening under the cover.  I think:
> > 
> > -blockdev format=qcow2,file=image.qcow2,id=blk1
> > 
> > and:
> > 
> > -blockdev protocol=vvfat,file=/tmp/dir,id=blk1
> > 
> > Would cause a bit of confusion.  It's not immediately clear why vvfat is 
> > a protocol and qcow2 isn't.  It's really an implementation detail that 
> > we implement qcow2 on top of a "protocol" called file.
> 
> Everything involving vvfat will end up in sheer confusion, and that's
> because vvfat is such a beast.  But it's a rather traditional example
> of a "protocol".  Unlike qcow2 / vmdk / vpc it can not be stacked on
> an arbitrary protocol (file/nbd/http), but rather accessed a directory
> tree.

There is no technical reason why vvfat couldn't be stacked on top of
FTP or HTTP-DAV or RSYNC or SCP, or even "wget -R".  Basically
anything with multiple files addressed by paths, and a way to retrieve
directories to find all the paths.

vvfat doesn't stack on top of "file-like protocols", it stacks
conceptually on top of "directory tree-like protocols", of which there
is currently one.  The arrival of Plan9fs may motivate the addition of
more.

You can't meaningfully stack "qcow2" or any other format than "raw" on
top of the virtual file image created by vvfat.  So that's another reason
it isn't the same as other "protocols".

-- Jamie

Re: [Qemu-devel] [Bug 596106] Re: kvm to emulate 64 bit cpu on 32 bit host

2010-06-20 Thread Jamie Lokier

Natalia Portillo wrote:
>You got the point wrong, I'm talking running WITH 64 bit hardware in a
>32 bit guest.
>This is done in Mac OS X Leopard (kernel is only 32 bit) and Mac OS X
>Snow Leopard (using 32 bit kernel not 64 bit one) by VMWare, Parallels
>and VirtualBox, as well as on Windows 32 bit using VMWare (dunno about
>VBox and Parallels, VirtualPC is unable to run 64 bit guests at all
>even on 64 bit hosts), just provided of course, the hardware is 64
>bit.

Ah yes, Mac OS X too.

Apart from breaking userspace, the other reason people stick with
32-bit host kernels on both Windows and Macs is the 64-bit device
drivers often don't work properly, or aren't available at all.  They
continue to improve, but still aren't as mature and dependable as
32-bit drivers.

This is also true of Linux 64-bit kernels - both bugs and unavailable
third party drivers/firmware.  (But less so than the other OSes.)
So even with Linux people cannot assume dropping in a 64-bit host
kernel is always free of kernel/driver issues.

Marking this feature request "won't fix" is just a statement that KVM
developers aren't inclined to support this feature.

But there's nothing to stop an interested contributor having a go.
I'm sure if it works and the code is clean enough it will be accepted.

>VirtualPC is unable to run 64 bit guests at all even on 64 bit
>hosts

Are you sure?  Microsoft provides numerous downloadable 64-bit guest
Windows images, and VirtualPC is Microsoft's; they must be running on
something.

-- Jamie

Re: [Qemu-devel] Re: [Bug 596106] Re: kvm to emulate 64 bit cpu on 32 bit host

2010-06-20 Thread Jamie Lokier

Paolo Bonzini wrote:
> On 06/19/2010 03:01 PM, Natalia Portillo wrote:
> >VMWare is able to do it, we should be able.
> 
> They do it like TCG does it, not like KVM.

I heard rumours VMWare use KVM-style chip virtualisation when running
a 64-bit guest on a 32-bit host kernel on 64-bit hardware.

If true, that makes particular sense for Windows host users, who can't
just drop in a 64-bit host kernel without breaking their userspace
thoroughly.  (If it was that easy, 64-bit Windows wouldn't use
a surruptitious VM to run 32-bit apps :-).

It seems like a good way for Windows users to run a single 64-bit app
on an otherwise 32-bit system that's working fine.

On Linux hosts I would expect you can drop in a 64-bit kernel, while
continuing to run a 32-bit userspace.  But I don't know if (a) that's
entirely true, and (b) if distro packaging blocks that sort of thing
from being easy.

Unfortunately even that doesn't help people who just want to run a
64-bit VM as an ordinary user and aren't permitted to change their
Linux host kernel, e.g. a shared system, or some rented servers.

-- Jamie

Re: [Qemu-devel] Re: [PATCH 09/10] pci: set PCI multi-function bit appropriately.

2010-06-18 Thread Jamie Lokier

Isaku Yamahata wrote:
> On Fri, Jun 18, 2010 at 03:44:04PM +0300, Michael S. Tsirkin wrote:
> > If we really want the ability to put unrelated devices
> > as functions in a single one, let's just add
> > a 'multifunction' qdev property, and validate that
> > it is set appropriately.
> 
> I think "unrelated" is policy. There is no obvious way to determine
> which functions can be in a same device.
> For example, popular chipset contains isa bridge, ide controller,
> usb controller, sound and modem in a single device as functions.
> It's up to hardware designer policy which functions are grouped into
> a device.

In hardware terms, quad-port ethernet controllers often present
themselves as a PCI bridge with four independent PCI ethernet
controllers, so they work with standard drivers.  Even though all four
are in a single chip.  Some USB devices do the same, presenting a
small bulk storage device to ship windows drivers [:roll-eyes: ;-)]
alongside the actual device.

-- Jamie

Re: [Qemu-devel] VLIW?

2010-06-17 Thread Jamie Lokier

Gibbons, Scott wrote:
> My architecture is an Interleaved Multithreading VLIW architecture.  One 
> bundle (packet) executes per processor cycle, rotating between threads (i.e., 
> thread 0 executes at time 0, thread 1 executes at time 1, then thread 0 
> executes at time 2, etc.).  Each thread has its own context (including a 
> program counter).  I'm not sure what kind of performance I would get in 
> translating a single bundle at a time (or maybe I'm misunderstanding).
> 
> I think I'll get basic single-thread operation working first, then attempt 
> multithreading when I have a spare month or so.

I know of another CPU architecture that has fine-grained hardware
threads and has working qemu emulation at a useful performance for
debugging kernels, but it's not public as far as I know, and I don't
know if it's ok to name it.  I don't think it's VLIW, only that it has
lots of hardware threads and a working qemu model.

-- Jamie

Re: [Qemu-devel] Re: [PATCH 1/2] qemu-io: retry fgets() when errno is EINTRg

2010-06-17 Thread Jamie Lokier

Kevin Wolf wrote:
> Am 16.06.2010 18:52, schrieb MORITA Kazutaka:
> > At Wed, 16 Jun 2010 13:04:47 +0200,
> > Kevin Wolf wrote:
> >>
> >> Am 15.06.2010 19:53, schrieb MORITA Kazutaka:
> >>> posix-aio-compat sends a signal in aio operations, so we should
> >>> consider that fgets() could be interrupted here.
> >>>
> >>> Signed-off-by: MORITA Kazutaka 
> >>> ---
> >>>  cmd.c |3 +++
> >>>  1 files changed, 3 insertions(+), 0 deletions(-)
> >>>
> >>> diff --git a/cmd.c b/cmd.c
> >>> index 2336334..460df92 100644
> >>> --- a/cmd.c
> >>> +++ b/cmd.c
> >>> @@ -272,7 +272,10 @@ fetchline(void)
> >>>   return NULL;
> >>>   printf("%s", get_prompt());
> >>>   fflush(stdout);
> >>> +again:
> >>>   if (!fgets(line, MAXREADLINESZ, stdin)) {
> >>> + if (errno == EINTR)
> >>> + goto again;
> >>>   free(line);
> >>>   return NULL;
> >>>   }
> >>
> >> This looks like a loop replaced by goto (and braces are missing). What
> >> about this instead?
> >>
> >> do {
> >> ret = fgets(...)
> >> } while (ret == NULL && errno == EINTR)
> >>
> >> if (ret == NULL) {
> >>fail
> >> }
> >>
> > 
> > I agree.
> > 
> > However, it seems that my second patch have already solved the
> > problem.  We register this readline routines as an aio handler now, so
> > fgets() does not block and cannot return with EINTR.
> > 
> > This patch looks no longer needed, sorry.
> 
> Good point. Thanks for having a look.

Anyway, are you sure stdio functions can be interrupted with EINTR?
Linus reminds us that some stdio functions have to retry internally
anyway:

http://comments.gmane.org/gmane.comp.version-control.git/18285

-- Jamie

Re: [Qemu-devel] Re: [CFR 6/10] cont command

2010-06-16 Thread Jamie Lokier

Anthony Liguori wrote:
> On 06/16/2010 11:17 AM, Juan Quintela wrote:
> >Consider the example that I showed you:
> >
> >(host A) (host B)
> >launch qemu launch qemu -incoming
> >migrate host B
> > .
> > do your things
> > exit/poweroff/...
> >
> >At this point you have a qemu launched on machine A, with nothing on
> >machine B.  running "cont" on machine A, have disastreus consecuences,
> >and there is no way to prevent it :(
> >   
> 
> If there was a reasonable belief that it wouldn't result in disaster, I 
> would fully support you.  However, I can't think of any rational reason 
> why someone would do this.  I can't think of a better analogy to 
> shooting yourself in the foot.

That looks like a useful way to fork a guest for testing, if host B is
launched with -snapshot, or a copy of the disk image, or a qcow2 child of it.

Does it work? :-)

-- Jamie

Re: [Qemu-devel] Re: [PATCH V4 2/3] qemu: Generic task offloading framework: threadlets

2010-06-16 Thread Jamie Lokier

Jamie Lokier wrote:
> Anthony Liguori wrote:
> > On 06/16/2010 09:29 AM, Paolo Bonzini wrote:
> > >On 06/16/2010 04:22 PM, Jamie Lokier wrote:
> > >>Paolo Bonzini wrote:
> > >>>These should be (at least for now) block-obj-$(CONFIG_POSIX).
> > >>>
> > >>>>+while (QTAILQ_EMPTY(&(queue->request_list))&&
> > >>>>+   (ret != ETIMEDOUT)) {
> > >>>>+ret = qemu_cond_timedwait(&(queue->cond),
> > >>>>+ &(queue->lock), 10*10);
> > >>>>+}
> > >>>
> > >>>Using qemu_cond_timedwait is a hack for not properly broadcasting the
> > >>>condvar in flush_threadlet_queue.
> > >>
> > >>Are you sure?  It looks like it also expires idle threads after a
> > >>fixed amount of idle time.
> > >
> > >Unnecessary idle threads are immediately expired as soon as the 
> > >threadlet exits if ncecessary, since here
> > 
> > If a threadlet is waiting to consume more work, unless we do a 
> > pthread_cancel (I dislike cancellation) it will keep waiting until it 
> > gets more work (which would mean it's not actually idle)...
> 
> There's some mild abuse of the mutex/condvar going on.
> 
> As (queue->exit || queue->idle_threads > queue->min_threads) is a
> condition for breaking out of the loop, that condition ought to be
> checked in the mutex->cond_wait region, but it isn't.
> 
> It doesn't matter here because the queue is empty when queue->exit,
> and the idle > min_threads condition can't become true.

Sorry, thinko.  It does matter when queue->exit, precisely because the
queue is empty :-)

Even cond_broadcast after queue->exit is set isn't enough to remove
the need for the timed wait hack.

Putting the whole condition inside the mutex->cond_wait region, not
just empty queue test, will remove the need for timed wait.  Broadcast
is still needed, or alternatively a cond_signal from each exiting
thread will allow them to wake and close without a thundering herd.

-- Jamie

Re: [Qemu-devel] Re: [PATCH V4 2/3] qemu: Generic task offloading framework: threadlets

2010-06-16 Thread Jamie Lokier

Anthony Liguori wrote:
> On 06/16/2010 09:29 AM, Paolo Bonzini wrote:
> >On 06/16/2010 04:22 PM, Jamie Lokier wrote:
> >>Paolo Bonzini wrote:
> >>>These should be (at least for now) block-obj-$(CONFIG_POSIX).
> >>>
> >>>>+while (QTAILQ_EMPTY(&(queue->request_list))&&
> >>>>+   (ret != ETIMEDOUT)) {
> >>>>+ret = qemu_cond_timedwait(&(queue->cond),
> >>>>+ &(queue->lock), 10*10);
> >>>>+}
> >>>
> >>>Using qemu_cond_timedwait is a hack for not properly broadcasting the
> >>>condvar in flush_threadlet_queue.
> >>
> >>Are you sure?  It looks like it also expires idle threads after a
> >>fixed amount of idle time.
> >
> >Unnecessary idle threads are immediately expired as soon as the 
> >threadlet exits if ncecessary, since here
> 
> If a threadlet is waiting to consume more work, unless we do a 
> pthread_cancel (I dislike cancellation) it will keep waiting until it 
> gets more work (which would mean it's not actually idle)...

There's some mild abuse of the mutex/condvar going on.

As (queue->exit || queue->idle_threads > queue->min_threads) is a
condition for breaking out of the loop, that condition ought to be
checked in the mutex->cond_wait region, but it isn't.

It doesn't matter here because the queue is empty when queue->exit,
and the idle > min_threads condition can't become true.

> >The min/max_threads parameters of the queue are currently immutable, 
> >so it can never happen that a thread has to be expired while it's 
> >waiting.  It may well become true in the future, in which case the 
> >condvar will have to be broadcast when min_threads changes.

Broadcasting when min_threads decreases wouldn't be enough, because
min_threads isn't checked inside the mutex->cond_wait region.

-- Jamie

Re: [Qemu-devel] Re: [PATCH V4 2/3] qemu: Generic task offloading framework: threadlets

2010-06-16 Thread Jamie Lokier

Paolo Bonzini wrote:
> These should be (at least for now) block-obj-$(CONFIG_POSIX).
> 
> >+while (QTAILQ_EMPTY(&(queue->request_list))&&
> >+   (ret != ETIMEDOUT)) {
> >+ret = qemu_cond_timedwait(&(queue->cond),
> >+&(queue->lock), 10*10);
> >+}
> 
> Using qemu_cond_timedwait is a hack for not properly broadcasting the 
> condvar in flush_threadlet_queue.

Are you sure?  It looks like it also expires idle threads after a
fixed amount of idle time.

-- Jamie

Re: [Qemu-devel] Re: [SeaBIOS] [PATCHv2] load hpet info for HPET ACPI table from qemu

2010-06-14 Thread Jamie Lokier

Gleb Natapov wrote:
> On Mon, Jun 14, 2010 at 09:54:25AM -0400, Kevin O'Connor wrote:
> > Could we just have qemu build the hpet tables and pass them through to
> > seabios?  Perhaps using the qemu_cfg_acpi_additional_tables() method.
> > 
> Possible, and I considered that. I personally prefer to pass minimum
> information required for seabios to discover underlying HW and leave
> ACPI table creation to seabios. That is how things done for HW that
> seabios can actually detect. If we will go your way pretty soon we will
> move creation of ACPI/SMBIOS/MP tables into qemu and IMHO this will be
> step backworkds.

Why would creation of all the tables in qemu be a bad thing or a step
in the wrong direction?

Crude argument in favour of doing it in qemu: They're both C code, so
sharing code between qemu and SeaBIOS may not be as traumatic as it
would be for an asm BIOS.  Doing in SeaBIOS forces the existence of
another API, in effect, to pass the qemu-specific hardware
information, so doing it in qemu means one less interface API that
needs designing and maintaining.

Argument in favour of doing it all in SeaBIOS: I'm not sure, what is it?

Indeed, why not build all the tables outside qemu using a separate
tool ("qemu-build-acpi-tables < machine-description.txt"), invoked
from qemu when it starts up, making it easier to examine the tables
using ACPI tools?

-- Jamie

Re: [Qemu-devel] [PATCH] virtio-blk: Avoid zeroing every request structure

2010-05-29 Thread Jamie Lokier

Alexander Graf wrote:
> Anthony Liguori wrote:
> > I'd prefer to stick to bug fixes for stable releases.  Performance
> > improvements are a good motivation for people to upgrade to 0.13 :-)
> 
> In general I agree, but this one looks like a really simple one.

Besides, there are too many reported guest regressions at the moment
to upgrade if using any of them.

-- Jamie

Re: [Qemu-devel] Re: SVM emulation: EVENTINJ marked valid when a pagefault happens while issuing a software interrupt

2010-05-28 Thread Jamie Lokier

Roedel, Joerg wrote:
> On Fri, May 28, 2010 at 02:10:59AM -0400, Jan Kiszka wrote:
> > Erik van der Kouwe wrote:
> 
> > > In my experience, if I provide the -enable-kvm switch then the guest VMM
> > > never detects the presence of virtualization support. Does this only
> > > work on AMD hardware? Or do I need to supply some additional parameter
> > > to make it work?
> > 
> > Yes, forgot to mention: -enable-nesting, and you need qemu-kvm. This
> > feature hasn't been merged upstream yet.
> 
> And the svm-emulation is only available on AMD hardware.

I assume you mean nested SVM emulation in a KVM guest is only
available on real AMD hardware?

Is this due to something inherent, or just a limitation of the KVM
code not handling all the necessary traps in kvm-intel?

Thanks,
-- Jamie

Re: [Qemu-devel] [PATCH] Add QEMU DirectFB display driver

2010-05-19 Thread Jamie Lokier

Julian Pidancet wrote:
> So after all, why not implementing our own VT switching and using
> directly the fbdev interface.

It's a good idea.  VT switching isn't hard to track reliably.

Being able to tell qemu, through the monitor, to attach/detach from a
particular VT might be a nice easy bonus too.

> I just checked the linux fbdev code to
> find out if it provides with a blitting method that could perform
> the pixel color conversion automatically for Qemu.
> 
> Unfortunately, from what I have read from the
> drivers/video/cfbimgblt.c file in the linux tree, there's no such
> thing, and it also means that we cannot take advantage of any kind
> of hardware pixel format conversion.

I'm not sure if DirectFB provides that particular operation, but I
have the impression it's the sort of thing DirectFB is intended for: A
framebuffer, plus a variety of 2d acceleration methods (and other
things like multi-buffering, video and alpha channel overlay).

-- Jamie

Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-19 Thread Jamie Lokier

Michael Tokarev wrote:
> Anthony Liguori wrote:
> []
> > For the Bug Day, anything is interesting IMHO.  My main interest is to
> > get as many people involved in testing and bug fixing as possible.  If
> > folks are interested in testing specific things like unusual or older
> > OSes, I'm happy to see it!
> 
> Well, interesting or not, but I for one don't know what to do with the
> results.  There were a thread on kvm@ about sigsegv in cirrus code when
> running winNT. The issue has been identified and appears to be fixed,
> as in, kvm process does not SIGSEGV anymore, but it does not work anyway,
> now printing:
> 
>  BUG: kvm_dirty_pages_log_enable_slot: invalid parameters
> 
> with garbled guest display.  Thanks goes to Stefano Stabellini for
> finding the SIGSEGV case, but unfortunately his hard work isn't quite
> useful since the behavour isn't very much different from the previous
> version... ;)

A "BUG:" is good to see in a bug report: It gives you something
specific to analyse.  Good luck ;-)

Imho, it'd be quite handy to keep a timeline of working/non-working
guests in a table somewhere, and which qemu versions and options they
were observed to work or break with.

> Also, thanks to Andre Przywara, whole winNT thing works but it requires
> -cpu qemu64,level=1 (or level=2 or =3), -- _not_ with default CPU.  This
> is also testing, but it's not obvious what to do witht the result...

Doesn't WinNT work with qemu32 or kvm32?
It's a 32-bit OS after all.

- Jamie

Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-18 Thread Jamie Lokier

Natalia Portillo wrote:
> Hi,
> 
> > - We'll try to migrate as many confirmable bugs from the Source Forge 
> > tracker to Launchpad.
> I think that part of the bug day should also include retesting OSes that 
> appear in OS Support List as having bug and confirming if the bug is still 
> present and if it's in Launchpad or not.

There have been reports of several legacy OSes being unable to install
or boot in the newer qemu while working in the older one.  They're
probably not in the "OS Support List" though.  Are they effectively
uninteresting for the purpose of the 0.13 release?

Unfortunately I doubt I will have time to participate in the Bug Day.

Thanks,
-- Jamie

[Qemu-devel] Re: [PATCH 3/8] Add QBuffer

2010-05-17 Thread Jamie Lokier

Jan Kiszka wrote:
> Jamie Lokier wrote:
> > Anthony Liguori wrote:
> >> Instead of encoding just as a string, it would be a good idea to encode 
> >> it as something like:
> >>
> >> {'__class__': 'base64', 'data': ...}
> > 
> > Is there a benefit to the class indirection, over simply a keyword?:
> > 
> > {'__base64__': ...}
> > 
> > __class__ seems to suggest much more than it's being used for here.
> > 
> 
> Depending on how sophisticated your parser is, you could directly push
> the result into an object of the proper type. And we can add more
> complex objects in the future that do not only consists of a single data
> key. Note that this extension is not just about encoding, it is about
> typecasting (dict -> custom type).

Sure, if that's the plan.

Does it make sense to combine encoding and custom types in this way?
It looks like mixing syntax and semantics, which has consequences for
code using generic parsers with separate semantic layer, but I realise
there's no "correct" answer.

Back to the syntax: I'm under the impression from earlier discussion
that the '__*__' keyspace reserved, so even types could use the
compact syntax?

Or is there something Javascript-ish (and not merely JSON-ish) about
'__class__' in particular which makes it appropriate?

-- Jamie

Re: [Qemu-devel] Re: [PATCH] Add cache=volatile parameter to -drive

2010-05-17 Thread Jamie Lokier

Alexander Graf wrote:
> 
> On 17.05.2010, at 18:26, Anthony Liguori wrote:
> 
> > On 05/17/2010 11:23 AM, Paul Brook wrote:
>  I don't see a difference between the results. Apparently the barrier
>  option doesn't change a thing.
>    
> >>> Ok.  I don't like it, but I can see how it's compelling.  I'd like to
> >>> see the documentation improved though.  I also think a warning printed
> >>> on stdio about the safety of the option would be appropriate.
> >>> 
> >> I disagree with this last bit.
> >> 
> >> Errors should be issued if the user did something wrong.
> >> Warnings should be issued if qemu did (or will soon do) something other 
> >> than
> >> what the user requested, or otherwise made questionable decisions on the
> >> user's behalf.
> >> 
> >> In this case we're doing exactly what the user requested. The only 
> >> plausible
> >> failure case is where a user is blindly trying options that they clearly 
> >> don't
> >> understand or read the documentation for. I have zero sympathy for 
> >> complaints
> >> like "Someone on the Internet told me to use --breakme, and broke thinks".
> >>   
> > 
> > I see it as the equivalent to the Taint bit in Linux.  I want to make it 
> > clear to users up front that if you use this option, and you have data loss 
> > issues, don't complain.
> > 
> > Just putting something in qemu-doc.texi is not enough IMHO.  Few people 
> > actually read it.
> 
> But that's why it's no default and also called "volatile". If you prefer, we 
> can call it cache=destroys_your_image.

With that semantic, a future iteration of cache=volatile could even
avoid writing to the backing file at all, if that's yet faster.  I
wonder if that would be faster.  Anyone fancy doing a hack with the
whole guest image as a big malloc inside qemu?  I don't have enough RAM :-)

-- Jamie

[Qemu-devel] A20 line control (was Re: [PATCH 0/2] pckbd improvements)

2010-05-17 Thread Jamie Lokier

Blue Swirl wrote:
> On 5/16/10, Jamie Lokier  wrote:
> > Blue Swirl wrote:
> >  > On 5/16/10, Paolo Bonzini  wrote:
> >  > > On 05/15/2010 11:49 AM, Blue Swirl wrote:
> >  > >
> >  > > > In 2/2, A20 logic changes a bit but I doubt any guest would be broken
> >  > > > if A20 line written through I/O port 92 couldn't be read via i8042.
> >  > > > The reverse (write using i8042 and read port 92) will work.
> >  > > >
> >  > >
> >  > >  Why take the risk?
> >  >
> >  > The alternative is to route a signal from port 92 to i8042. Or maybe
> >  > port 92 should belong to i8042, that could make things simpler but
> >  > then the port would appear on non-PC architectures as well.
> >  >
> >  > But I doubt any OS would depend on such details, because the details
> >  > seem to be murky:
> >  > http://www.win.tue.nl/~aeb/linux/kbd/A20.html
> >
> >
> > It's not hard to imagine some DOS memory driver or 286/386 DOS
> >  extender expecting to read the bit, if that's normal on PCs.
> >
> >  The earlier PCs didn't have port 92h, so presumably older DOS software
> >  uses the keyboard controller exclusively.
> >
> >  The details are murky, but on the other hand, I remember back in day,
> >  A20 line was common knowledge amongst DOS hackers on 286s and 386s,
> >  and the time I remember it from, port 92h was not available, so it
> >  can't have been too murky to use the i8042.
> 
> Right, but with this patch, writing to and reading from i8042 would
> still work, likewise for writing to and reading from port 92. Even
> writing via i8042, but reading via port 92 would work. What would not
> work reliably (even then, 50% probability of being correct) is when
> port 92 is written, but reading happens with i8042.
> 
> >  i8042 emulation isn't the same on PC on a non-PC because of the
> >  PC-specific wiring (outside the chip), such as its ability to reset
> >  the motherboard.  I don't see that it makes sense for qemu to pretend
> >  there are no differences at all.  Or, perhaps it makes sense to imply
> >  different GPIO wiring, separate from the i8042 itself.
> >
> >  On the other hand, something which makes sense to me:
> >
> >  In a PC, are port 92h and i8042's outputs OR'd together or AND'd
> >  together to control A20 proper?  Then they'd be two independent
> >  signals, and shouldn't mirror each other.
> 
> That's exactly what I meant, how could also random OS designer trust
> that the signals are combined the same way on every PC? With logic
> circuits, i8042 would still see its own output only, not the combined
> signal. If instead the signals were wired together, with some
> combination of inputs the output would not match what QEMU generates.
> Currently QEMU does not implement any logic for A20 line, which
> obviously can't match real hardware (or maybe some kind of dual port
> memory).

http://www.openwatcom.org/index.php/A20_Line

According to that page, MS-DOS's HIMEM.SYS tries 17 different methods
to control the A20 line! :-) Meanwhile, DOS/4GW, a DOS extender (there
are lots of those) allows the method to be set manually.

But there are only two common ones that are still implemented in
modern PC hardware: The keyboard commands to read, modify and write
the output port, and port 92h.

The random DOS-extender designers had to try each method, by checking
if the address space was actually wrapped.

With port 92h being known as the "fast A20 gate", I'm pretty sure any
program which includes that method will try that one first.

According to the wiki page (actually, my interpretation of it), the
output signal from port 92h is usually OR'd with the output signal
from the keyboard controller port.  That is, they are independent signals.

-- Jamie

Re: [Qemu-devel] Re: [PATCH 0/2] pckbd improvements

2010-05-16 Thread Jamie Lokier

Blue Swirl wrote:
> On 5/16/10, Paolo Bonzini  wrote:
> > On 05/15/2010 11:49 AM, Blue Swirl wrote:
> >
> > > In 2/2, A20 logic changes a bit but I doubt any guest would be broken
> > > if A20 line written through I/O port 92 couldn't be read via i8042.
> > > The reverse (write using i8042 and read port 92) will work.
> > >
> >
> >  Why take the risk?
> 
> The alternative is to route a signal from port 92 to i8042. Or maybe
> port 92 should belong to i8042, that could make things simpler but
> then the port would appear on non-PC architectures as well.
> 
> But I doubt any OS would depend on such details, because the details
> seem to be murky:
> http://www.win.tue.nl/~aeb/linux/kbd/A20.html

It's not hard to imagine some DOS memory driver or 286/386 DOS
extender expecting to read the bit, if that's normal on PCs.

The earlier PCs didn't have port 92h, so presumably older DOS software
uses the keyboard controller exclusively.

The details are murky, but on the other hand, I remember back in day,
A20 line was common knowledge amongst DOS hackers on 286s and 386s,
and the time I remember it from, port 92h was not available, so it
can't have been too murky to use the i8042.

i8042 emulation isn't the same on PC on a non-PC because of the
PC-specific wiring (outside the chip), such as its ability to reset
the motherboard.  I don't see that it makes sense for qemu to pretend
there are no differences at all.  Or, perhaps it makes sense to imply
different GPIO wiring, separate from the i8042 itself.

On the other hand, something which makes sense to me:

In a PC, are port 92h and i8042's outputs OR'd together or AND'd
together to control A20 proper?  Then they'd be two independent
signals, and shouldn't mirror each other.

-- Jamie

Re: [Qemu-devel] [PATCH 3/8] Add QBuffer

2010-05-16 Thread Jamie Lokier

Anthony Liguori wrote:
> Instead of encoding just as a string, it would be a good idea to encode 
> it as something like:
> 
> {'__class__': 'base64', 'data': ...}

Is there a benefit to the class indirection, over simply a keyword?:

{'__base64__': ...}

__class__ seems to suggest much more than it's being used for here.

-- Jamie

Re: [Qemu-devel] [PATCH 2/4] Add support for execution from ROMs in IO device mode

2010-05-13 Thread Jamie Lokier

Jan Kiszka wrote:
> While IO_MEM_ROMD marks an I/O memory region as "read/execute from RAM,
> but write to I/O handler", there is no flag indicating that an I/O
> region which is fully managed by I/O handlers can still be hosting
> executable code. One use case for this are flash device models that
> switch to I/O mode during reprogramming. Not all reprogramming states
> modify to read data, thus practically allow to continue execution.
> Moreover, we need to avoid switching the modes too frequently for
> performance reasons which requires fetching opcodes while still in I/O
> device mode.

I like this change.

Does "fetching opcodes while still in I/O device mode" fetch opcodes
from the RAM backing, or via the I/O read handlers?

If the latter, I'm wondering how KVM would cope with that.

Thanks,
-- Jamie

Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus

2010-05-13 Thread Jamie Lokier

Stefano Stabellini wrote:
> > I think we need to consider only dstpitch for a full invalidate.  We 
> > might be copying an offscreen bitmap into the screen, and srcpitch is 
> > likely to be the bitmap width instead of the screen pitch.
> 
> Agreed.

Even when copying on-screen (or partially on-screen), the srcpitch
does not affect the invalidated area.  The source area might be
strange (parallelogram, single line repeated), but srcpitch should
only affect whether qemu_console_copy can be used, not the
invalidation.

-- Jamie

Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus

2010-05-12 Thread Jamie Lokier

Stefano Stabellini wrote:
> On Wed, 12 May 2010, Avi Kivity wrote:
> > It's useful if you have a one-line horizontal pattern you want to 
> > propagate all over.
>  
> It might be useful all right, but it is not entirely clear what the
> hardware should do in this situation from the documentation we have, and
> certainly the current state of the cirrus emulation code doesn't help.

It's quite a reasonable thing for hardware to do, even if not documented.
It would be surprising if the hardware didn't copy the one-line pattern.

-- Jamie

Re: [Qemu-devel] [PATCH 0/2] Enable qemu block layer to not flush

2010-05-12 Thread Jamie Lokier

Stefan Hajnoczi wrote:
> On Wed, May 12, 2010 at 10:42 AM, Jamie Lokier  wrote:
> > Stefan Hajnoczi wrote:
> >> Why add a nop AIO operation instead of setting
> >> BlockDriverState->enable_write_cache to zero?  In that case no write
> >> cache would be reported to the guest (just like cache=writethrough).
> >
> > Hmm.  If the guest sees write cache absent, that prevents changing the
> > cache policy on the host later (from not flushing to flushing), which
> > you might want to do after an OS install has finished and booted up.
> 
> Right.  There are 3 cases from the guest perspective:
> 
> 1. Disable write cache or no write cache.  Flushing not needed.
> 2. Disable flushing but leave write cache enabled.
> 3. Enable write cache and use flushing.
> 
> When we don't report a write cache at all, the guest is always stuck at 1.
> 
> If you're going to do this for installs and other temporary workloads,
> then enabling the write cache again isn't an issue.  After installing
> successfully, restart the guest with a sane cache= mode.

That only works if you're happy to reboot the guest after the process
finishes.  I guess that is usually fine, but it is a restriction.

Is it possible via QMP to request that the guest is paused when it
next reboots, so that QMP operations to change the cache= mode can be
done (as it's not safe to change the guest-visible disk write cache
availability when it's running, and probably a request to do so should
be denied).

-- Jamie

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-05-12 Thread Jamie Lokier

Gerhard Wiesinger wrote:
> Can one switch to the old software vmm in VMWare?

Perhaps you can install a very old version of VMWare.
Maybe run it under KVM ;-)

> That was one of the reasons why I was looking for alternatives for 
> graphical DOS programs. Overall summary so far:
> 1.) QEMU without KVM: Problem with 286 DOS Extender instruction set, but 
> fast VGA
> 2.) QEMU with KVM: 286 DOS Extender apps ok, but slow VGA memory 
> performance
> 3.) VMWare Server 2.0 under Linux, application ok, but slow VGA memory 
> performance
> 4.) Virtual PC: Problems with 286 DOS Extender
> 5.) Bochs: Works well, but very slow.

I would be interested in the 286 DOS Extender issue, as I'd like to
use some 286 programs in QEMU at some point.

There were some changes to KVM in the kernel recently.  Were those
needed to get the 286 apps working?

> Looks like that VMWare Server and QEMU with KVM maybe have the same 
> architectural problems going through the whole slow chain from Guest OS to 
> virtualization layer for VGA writes.

They do have a similar architecture.

the VGA write speed is a bit surprising, as it should be fast in
256-colour non-modeX modes for both.  But maybe there's something
we've missed that makes it architecturally slow.  It will be
interesting to see what you find :-)

Thanks,
-- Jamie

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-05-12 Thread Jamie Lokier

Gerhard Wiesinger wrote:
> On Wed, 21 Apr 2010, Jamie Lokier wrote:
> 
> >Gerhard Wiesinger wrote:
> >>Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW
> >>of QEMU even from KVM must be possible (e.g. memory and port accesses are
> >>done on nearly every virtual device) and therefore I'm ending in C code in
> >>the QEMU hw/*.c directory. Therefore also the VGA memory area should be
> >>able to be accessable from KVM but with the specialized and fast memory
> >>access of QEMU.  Am I missing something?
> >
> >What you're missing is that when KVM calls out to QEMU to handle
> >hw/*.c traps, that call is very slow.  It's because the hardware-VM
> >support is a bit slow when the trap happens, and then the the call
> >from KVM in the kernel up to QEMU is a bit slow again.  Then all the
> >way back.  It adds up to a lot, for every I/O operation.
> 
> Isn't that then a general problem of KVM virtualization (oder hardware 
> virtualization) in general? Is this CPU dependend (AMD vs. Intel)?

Yes it is a general problem, but KVM emulates some time-critical
things in the kernel (like APIC and CPU instructions), so it's not too bad.

KVM is about 5x faster than TCG for most things, and slower for a few
things, so on balance it is usually faster.

The slow 256-colour mode writes sound like just a simple bug, though.
No need for complicated changes.

> >In 256-colour mode, KVM should be writing to the VGA memory at high
> >speed a lot like normal RAM, not trapping at the hardware-VM level,
> >and not calling up to the code in hw/*.c for every byte.
> 
> Yes, same picture to me: 256 color mode should be only a memory write (16 
> color mode is more difficult as pixel/byte mapping is not the same).
> But it looks like this isn't the case in this test scenario.
> 
> >You might double-check if your guest is using VGA "Mode X".  (See 
> >Wikipedia.)
> >
> >That was a way to accelerate VGA on real PCs, but it will be slow in
> >KVM for the same reasons as 16-colour mode.
> 
> Which way do you mean?

Look up Mode X on Wikipedia if you're interested, but it isn't
relevant to the problem you've reported.  Mode X cannot be enabled
with a BIOS call; it's a VGA hardware programming trick.  It would not
be useful in a VM environment.

-- Jamie

Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-12 Thread Jamie Lokier

Paul Brook wrote:
> > > Paul Brook wrote:
> > > > cache=none:
> > > >   No host caching. Reads and writes both go directly to underlying
> > > >   storage.
> > > > 
> > > > Useful to avoid double-caching.
> > > > 
> > > > cache=writethrough
> > > > 
> > > >   Reads are cached. Writes go directly to underlying storage.  Useful
> > > >   for
> > > > 
> > > > broken guests that aren't aware of drive caches.
> > > 
> > > These are misleading descriptions - because cache=none does not push
> > > writes down to powerfail-safe storage, while cache=writethrough might.
> > 
> > If so, then this is a serious bug.
> 
> .. though it may be a kernel bug rather that a qemu bug, depending on the 
> exact details.

It's not a kernel bug.  cache=none uses O_DIRECT, and O_DIRECT must
not force writes to powerfail-safe storage.  If it did, it would be
unusably slow for applications using O_DIRECT as a performance
enhancer / memory saver.  They can call fsync/fdatasync when they need
to for integrity.  (There might be kernel bugs in the latter department.)

> Either way, I consider any mode that inhibits host filesystem write
> cache but not volatile drive cache to be pretty worthless.

On the contrary, it greatly reduces host memory consumption so that
guest data isn't cached twice (it's already cached in the guest), and
it may improve performance by relaxing the POSIX write-serialisation
constraint (not sure if Linux cares; Solaris does).

> Either we guaranteed data integrity on completion or we don't.

The problem with the description of cache=none is it uses O_DIRECT,
which does always not push writes to powerfail-safe storage,.

O_DIRECT is effectively a hint.  It requests less caching in kernel
memory, may reduce memory usage and copying, may invoke direct DMA.

O_DIRECT does not tell the disk hardware to commit to powerfail-safe
storage.  I.e. it doesn't issue barriers or disable disk write caching.
(However, depending on a host setup, it might have that effect if disk
write cache is disabled by the admin).

Also, it doesn't even always write to disk: It falls back to buffered
in some circumstances, even on filesystems which support it - see
recent patches for btrfs which use buffered I/O for O_DIRECT for some
parts of some files.  (Many non-Linux OSes fall back to buffered
when any other process holds a non-O_DIRECT file descriptor, or when
requests don't meet some criteria).

The POSIX thing to use for cache=none would be O_DSYNC|O_RSYNC, and
that should work on some hosts, but Linux doesn't implement real O_RSYNC.

A combination which ought to work is O_DSYNC|O_DIRECT.  O_DIRECT is
the performance hint; O_DSYNC provides the commit request.  Christoph
Hellwig has mentioned that combination elsewhere on this thread.
It makes sense to me for cache=none.

O_DIRECT by itself is a useful performance & memory hint, so there
does need to be some option which maps onto O_DIRECT alone.  But it
shouldn't be documented as stronger than cache=writethrough, because
it isn't.

--  Jamie

Re: [Qemu-devel] [PATCH 0/2] Enable qemu block layer to not flush

2010-05-12 Thread Jamie Lokier

Stefan Hajnoczi wrote:
> Why add a nop AIO operation instead of setting
> BlockDriverState->enable_write_cache to zero?  In that case no write
> cache would be reported to the guest (just like cache=writethrough).

Hmm.  If the guest sees write cache absent, that prevents changing the
cache policy on the host later (from not flushing to flushing), which
you might want to do after an OS install has finished and booted up.

-- Jamie

Re: [Qemu-devel] [PATCH 07/22] qemu-error: Introduce get_errno_string()

2010-05-11 Thread Jamie Lokier

Anthony Liguori wrote:
> QMP should insult users from underlying platform quirks.  We should 
> translate errnos to appropriate QMP error types.

Fair enough.  What should it do when the platform returns an errno
value that qemu doesn't know about, and wants to pass to the QMP caller?

-- Jamie

Re: [Qemu-devel] [RFC] default mac address issue

2010-05-11 Thread Jamie Lokier

Anthony Liguori wrote:
> Hi Bruce,
> 
> On 05/10/2010 02:07 PM, Bruce Rogers wrote:
> >I know this behavior has worked this way all along, but I wanted to bring 
> >up the following concern and float a few ideas about possible solutions. 
> >Please provide your perspective, opinion, etc.
> >
> >qemu (or qemu-kvm) users can easily get into trouble when they don't 
> >specifying the mac address for their vm's nic and don't realize that 
> >multiple vm's running this way on the same network segment are colliding, 
> >since they all get a default mac address that is the same. They may be 
> >under the assumption that a random mac would be the default, as in many 
> >higher level tools for vm creation
> >   
> 
> This is certainly an important issue but it's one that's difficult to 
> resolve.
> 
> >Does it make sense to do any of the following:
> >
> >1) have qemu print a warning to stdout/stderr that the default mac address 
> >is being used and that it will interfere with other vms running the same 
> >way on a common network segment
> >   
> 
> This is definitely reasonable.
> 
> >2) what about changing the default behavior to randomizing the mac, and 
> >provide the legacy behavior with "-net nic,macaddr=default" or just 
> >"-use-default-mac"
> >
> >(or, as a flip side to #2):
> >
> >3) to at least make it easy for people to get around the problem, and just
> >use qem directly (without additional tools to launch qemu), add an option 
> >such as "-net nic,macaddr=randomize" or "-use-random-mac" which randomizes 
> >the mac for you
> >each time the machine is brought up, and hence avoids possible collisions.
> >   
> 
> A random mac address is almost always wrong.  If you run a guest twice 
> with this option, it's usually enough to trigger a new network detection 
> which which rename the network device to ethN + 1.  The result would be 
> broken networking for naive users since distros don't bother configuring 
> interfaces that weren't present during installation.

Yes, I've seen this when moving disk images between (real)
motherboards.  In the good old days it Just Worked.

Now, current distros using udev remember the MAC from the old board,
so the new board gets an interface called "eth1" instead of "eth0".

That's fine, but rather stupidly they've configured a useful default
for "eth0" which is DHCP, but the default for "eth1" etc. is to leave
it down.  Result: Disk moved to a replacement motherboard, and the
machine no longer responds to network connections.  Quite annoying if
it's a headless box, or one which boots up as a kiosk or something
with no console access.

Anyway, Anthony's right: Changing the MAC address of a guest each time
it is run (with the same disk image) is likely to be annoying.

It might be a good idea to store the chosen MAC in the qcow2 metadata,
if qcow2 is used?

For my Perl-managed qemu/kvm VMs, I find I need a small config file,
and a small state file which records run time state that survives
reboots (like the MAC address, and things like which CD and floppy
images were loaded).

(Perhaps in the search for a holy grail of a qemu config file format,
it might also be worth a mention that it's handy to store non-config
state somewhere too.)

-- Jamie

Re: [Qemu-devel] [PATCH 0/2] Enable qemu block layer to not flush

2010-05-11 Thread Jamie Lokier

Anthony Liguori wrote:
> There's got to be a better place to fix this.  Disable barriers in your 
> guests?

If only it were that easy.

OS installs are the thing that this feature would most help.  They
take ages, do a huge amount of writing with lots of seeking, and if
the host fails you're going to discard the image.

I'm not sure how I would change that setting for most OS install GUIs,
especially Windows, or if it's even possible.

It's usually much easier to change barrier settings after installing
and you've got a command line or registry editing tool.  But by then,
it's not useful any more.

Any other ideas?

-- Jamie

Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-11 Thread Jamie Lokier

Paul Brook wrote:
> cache=none:
>   No host caching. Reads and writes both go directly to underlying storage. 
> Useful to avoid double-caching.
> 
> cache=writethrough
>   Reads are cached. Writes go directly to underlying storage.  Useful for 
> broken guests that aren't aware of drive caches.

These are misleading descriptions - because cache=none does not push
writes down to powerfail-safe storage, while cache=writethrough might.

> cache=always (or a more scary name like cache=lie to defend against idiots)
>   Reads and writes are cached. Guest flushes are ignored.  Useful for dumb 
> guests in non-critical environments.

cache=unsafe would tell it like it is.

Even non-idiots could be excused for getting the wrong impression from
cache=always.

-- Jamie

Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-11 Thread Jamie Lokier

Anthony Liguori wrote:
> qemu-img create -f raw foo.img 10G
> mkfs.ext3 foo.img
> mount -oloop,rw,barrier=1 -t ext3 foo.img mnt
> 
> Works perfectly fine.

Hmm, interesting.  Didn't know loop propagated barriers.

So you're suggesting to use qemu with a loop device, and ext2 (bit
faster than ext3) and barrier=0 (well, that's implied if you use
ext2), and a raw image file on the ext2/3 filesystem, to provide the
effect of flush=off, becuase the loop device caches block writes on
the host, except for explicit barrier requests from the fs, which are
turned off?

That wasn't obvious the first time :-)

Does the loop device cache fs writes instead of propagating them
immediately to the underlying fs?  I guess it probably does.

Does the loop device allow the backing file to grow sparsely, to get
behavious like qcow2?

That's ugly but it might just work.

> >2. barrier=0 does _not_ provide the cache=off behaviour.  It only
> >disables barriers; it does not prevent writing to the disk hardware.
> 
> The proposal has nothing to do with cache=off.

Sorry, I meant flush=off (the proposal).  Mounting the host filesystem
(i.e. not using a loop device anywhere) with barrier=0 doesn't have
even close to the same effect.

> >>The problem with options added for developers is that those options are
> >>very often accidentally used for production.
> >> 
> >We already have risky cache= options.  Also, do we call fdatasync
> >(with barrier) on _every_ write for guests which disable the
> >emulated disk cache?
> 
> None of our cache= options should result in data corruption on power 
> loss.  If they do, it's a bug.

(I might have the details below a bit off.)

If cache=none uses O_DIRECT without calling fdatasync for guest
barriers, then it will get data corruption on power loss.

If cache=none does call fdatasync for guest barriers, then it might
still get corruption on power loss; I am not sure if recent Linux host
behaviour of O_DIRECT+fdatasync (with no buffered writes to commit)
issues the necessary barriers.  I am quite sure that older kernels did not.

cache=writethrough will get data corruption on power loss with older
Linux host kernels.  O_DSYNC did not issue barriers.  I'm not sure if
the behaviour of O_DSYNC that was recently changed is now issuing
barriers after every write.

Provided all the cache= options call fdatasync/fsync when the guest
issues a cache flush, and call fdatasync/fsync following _every_ write
when the guest has disabled the emulated write cache, that should be
as good as Qemu can reasonably do.  It's up to the host from there.

-- Jamie

Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-11 Thread Jamie Lokier

Anthony Liguori wrote:
> On 05/11/2010 08:12 AM, Paul Brook wrote:
> >>>cache=always (or a more scary name like cache=lie to defend against
> >>>idiots)
> >>>
> >>>Reads and writes are cached. Guest flushes are ignored.  Useful for
> >>>dumb guests in non-critical environments.
> >>>   
> >>I really don't believe that we should support a cache=lie.  There are
> >>many other obtain the same results.  For instance, mount your guest
> >>filesystem with barrier=0.
> >> 
> >Ideally yes. However in practice I suspect this is still a useful option. 
> >Is
> >it even possible to disable barriers in all cases (e.g. NTFS under 
> >windows)?
> >
> >In a production environment it's probably not so useful - you're generally
> >dealing with long lived, custom configured guests.
> >
> >In a development environment the rules can be a bit different. For example 
> >if
> >you're testing an OS installer then you really don't want to be passing 
> >magic
> >mount options. If the host machine dies then you don't care about the 
> >state of
> >the guest because you're going to start from scratch anyway.
> >   
> 
> Then create a mount point on your host and mount the host file system 
> under that mount with barrier=0.

Two reasons that advice doesn't work:

1. It doesn't work in many environments.  You can't mount a filesystem
with barrier=0 in one place and barrier=1 on a different point, and
there's ofen only one host partition.

2. barrier=0 does _not_ provide the cache=off behaviour.  It only
disables barriers; it does not prevent writing to the disk hardware.

If you are doing a transient OS install, ideally you want an amount
equal to your free RAM not written to disk until the end.  barrier=0
does not achieve that.

> The problem with options added for developers is that those options are 
> very often accidentally used for production.

We already have risky cache= options.  Also, do we call fdatasync
(with barrier) on _every_ write for guests which disable the
emulated disk cache?

-- Jamie

Re: [Qemu-devel] QLicense chaos

2010-05-07 Thread Jamie Lokier

Jan Kiszka wrote:
> Moreover, some of the QObject files are LGPL, some GPL. I bet this was
> also not intended. But what was the idea behind the LGPL? Some libqmp which
> can be used by closed source apps?

I believe LGPL is needed for source apps that have GPLv2-incompatible
licensing.  E.g. GPLv3, Apache license, OpenSSL?  (I'm not sure exactly.)

And for those who want to keep their own apps BSD-like.

-- Jamie

Re: [Qemu-devel] question on virtio

2010-05-06 Thread Jamie Lokier

Michael S. Tsirkin wrote:
> Hi!
> I see this in virtio_ring.c:
> 
> /* Put entry in available array (but don't update avail->idx *
>  until they do sync). */
> 
> Why is it done this way?
> It seems that updating the index straight away would be simpler, while
> this might allow the host to specilatively look up the buffer and handle
> it, without waiting for the kick.

Even better, if the host updates a location containing which index it
has seen recently, you can avoid the kick entirely during sustained
flows - just like your recent patch to avoid sending irqs to the
guest.

-- Jamie

Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-06 Thread Jamie Lokier

Rusty Russell wrote:
> On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > On Tue, May 04 2010, Rusty Russell wrote:
> > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing 
> > > > the
> > > > usual I/O suspects...
> > > 
> > > It would be nice to have a more fuller API for this, but the reality is
> > > that only the flush approach is really workable. Even just strict
> > > ordering of requests could only be supported on SCSI, and even there the
> > > kernel still lacks proper guarantees on error handling to prevent
> > > reordering there.
> > 
> > There's a few I/O scheduling differences that might be useful:
> > 
> > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not
> >before a BARRIER.  That might be useful for time-critical WRITEs,
> >and those issued by high I/O priority.
> 
> This is only because noone actually wants flushes or barriers, though
> I/O people seem to only offer that.  We really want " must
> occur before ".  That offers maximum choice to the I/O subsystem
> and potentially to smart (virtual?) disks.

We do want flushes for the "D" in ACID - such things as after
receiving a mail, or blog update into a database file (could be TDB),
and confirming that to the sender, to have high confidence that the
update won't disappear on system crash or power failure.

Less obviously, it's also needed for the "C" in ACID when more than
one file is involved.  "C" is about differently updated things staying
consistent with each other.

For example, imagine you have a TDB file mapping Samba usernames to
passwords, and another mapping Samba usernames to local usernames.  (I
don't know if you do this; it's just an illustration).

To rename a Samba user involves updating both.  Let's ignore transient
transactional issues :-) and just think about what happens with
per-file barriers and no sync, when a crash happens long after the
updates, and before the system has written out all data and issued low
level cache flushes.

After restarting, due to lack of sync, the Samba username could be
present in one file and not the other.

> > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
> >only for data belonging to a particular file (e.g. fdatasync with
> >no file size change, even on btrfs if O_DIRECT was used for the
> >writes being committed).  That would entail tagging FLUSHes and
> >WRITEs with a fs-specific identifier (such as inode number), opaque
> >to the scheduler which only checks equality.
> 
> This is closer.  In userspace I'd be happy with a "all prior writes to this
> struct file before all future writes".  Even if the original guarantees were
> stronger (ie. inode basis).  We currently implement transactions using 4 fsync
> /msync pairs.
> 
>   write_recovery_data(fd);
>   fsync(fd);
>   msync(mmap);
>   write_recovery_header(fd);
>   fsync(fd);
>   msync(mmap);
>   overwrite_with_new_data(fd);
>   fsync(fd);
>   msync(mmap);
>   remove_recovery_header(fd);
>   fsync(fd);
>   msync(mmap);
> 
> Yet we really only need ordering, not guarantees about it actually hitting
> disk before returning.
> 
> > In other words, FLUSH can be more relaxed than BARRIER inside the
> > kernel.  It's ironic that we think of fsync as stronger than
> > fbarrier outside the kernel :-)
> 
> It's an implementation detail; barrier has less flexibility because it has
> less information about what is required. I'm saying I want to give you as
> much information as I can, even if you don't use it yet.

I agree, and I've started a few threads about it over the last couple of years.

An fsync_range() system call would be very easy to use and
(most importantly) easy to understand.

With optional flags to weaken it (into fdatasync, barrier without sync,
sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush,
etc.), it would be very versatile, and still easy to understand.

With an AIO version, and another flag meaning don't rush, just return
when satisfied, and I suspect it would be useful for the most
demanding I/O apps.

-- Jamie

Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-06 Thread Jamie Lokier

Rusty Russell wrote:
> > Seems over-zealous.
> > If the recovery_header held a strong checksum of the recovery_data you would
> > not need the first fsync, and as long as you have two places to write 
> > recovery
> > data, you don't need the 3rd and 4th syncs.
> > Just:
> >   
> > write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
> >   fsync / msync
> >   overwrite_with_new_data()
> > 
> > To recovery you choose the most recent log_space and replay the content.
> > That may be a redundant operation, but that is no loss.
> 
> I think you missed a checksum for the new data?  Otherwise we can't tell if
> the new data is completely written.

The data checksum can go in the recovery-data block.  If there's
enough slack in the log, by the time that recovery-data block is
overwritten, you can be sure that an fsync has been done for that
data (by a later commit).

> But yes, I will steal this scheme for TDB2, thanks!

Take a look at the filesystems.  I think ext4 did some optimisations
in this area, and that checksums had to be added anyway due to a
subtle replay-corruption problem that happens when the log is
partially corrupted, and followed by non-corrupt blocks.

Also, you can remove even more fsyncs by adding a bit of slack to the
data space and writing into unused/fresh areas some of the time -
i.e. a bit like btrfs/zfs or anything log-structured, but you don't
have to go all the way with that.

> In practice, it's the first sync which is glacial, the rest are pretty cheap.

The 3rd and 4th fsyncs imply a disk seek each, just because the
preceding writes are to different areas of the disk.  Seeks are quite
slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap?
That it's only a couple of seeks, or that you don't see even that?

> 
> > Also cannot see the point of msync if you have already performed an fsync,
> > and if there is a point, I would expect you to call msync before
> > fsync... Maybe there is some subtlety there that I am not aware of.
> 
> I assume it's this from the msync man page:
> 
>msync()  flushes  changes  made  to the in-core copy of a file that was
>mapped into memory using mmap(2) back to disk.   Without  use  of  this
>call  there  is  no guarantee that changes are written back before mun‐
>map(2) is called. 

Historically, that means msync() ensures dirty mapping data is written
to the file as if with write(), and that mapping pages are removed or
refreshed to get the effect of read() (possibly a lazy one).  It's
more obvious in the early mmap implementations where mappings don't
share pages with the filesystem cache, so msync() has explicit
behaviour.

Like with write(), after calling msync() you would then call fsync()
to ensure the data is flushed to disk.

If you've been calling fsync then msync, I guess that's another fine
example of how these function are so hard to test, that they aren't.

Historically on Linux, msync has been iffy on some architectures, and
I'm still not sure it has the same semantics as other unixes.  fsync
as we know has also been iffy, and even now that fsync is tidier it
does not always issue a hardware-level cache commit.

But then historically writable mmap has been iffy on a boatload of
unixes.

> > > It's an implementation detail; barrier has less flexibility because it has
> > > less information about what is required. I'm saying I want to give you as
> > > much information as I can, even if you don't use it yet.
> > 
> > Only we know that approach doesn't work.
> > People will learn that they don't need to give the extra information to 
> > still
> > achieve the same result - just like they did with ext3 and fsync.
> > Then when we improve the implementation to only provide the guarantees that
> > you asked for, people will complain that they are getting empty files that
> > they didn't expect.
> 
> I think that's an oversimplification: IIUC that occurred to people *not*
> using fsync().  They weren't using it because it was too slow.  Providing
> a primitive which is as fast or faster and more specific doesn't have the
> same magnitude of social issues.

I agree with Rusty.  Let's make it perform well so there is no reason
to deliberately avoid using it, and let's make say what apps actually
want to request without being way too strong.

And please, if anyone has ideas on how we could make correct use of
these functions *testable* by app authors, I'm all ears.  Right now it
is quite difficult - pulling power on hard disks mid-transaction is
not a convenient method :)

> > The abstraction I would like to see is a simple 'barrier' that contains no
> > data and has a filesystem-wide effect.
> 
> I think you lack ambition ;)
> 
> Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
> suboptimal for md?  Since you have to hand your barrier to every device
> whereas a file-wide primitive may theoretically only go to some.

Yes.

Note that database-like programs s

Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-05 Thread Jamie Lokier

Rusty Russell wrote:
> On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
> > I took a stub at documenting CMD and FLUSH request types in virtio
> > block.  Christoph, could you look over this please?
> > 
> > I note that the interface seems full of warts to me,
> > this might be a first step to cleaning them.
> 
> ISTR Christoph had withdrawn some patches in this area, and was waiting
> for him to resubmit?
> 
> I've given up on figuring out the block device.  What seem to me to be sane
> semantics along the lines of memory barriers are foreign to disk people: they
> want (and depend on) flushing everywhere.
> 
> For example, tdb transactions do not require a flush, they only require what
> I would call a barrier: that prior data be written out before any future data.
> Surely that would be more efficient in general than a flush!  In fact, TDB
> wants only writes to *that file* (and metadata) written out first; it has no
> ordering issues with other I/O on the same device.

I've just posted elsewhere on this thread, that an I/O level flush can
be more efficient than an I/O level barrier (implemented using a
cache-flush really), because the barrier has stricter ordering
requirements at the I/O scheduling level.

By the time you work up to tdb, another way to think of it is
distinguishing "eager fsync" from "fsync but I'm not in a hurry -
delay as long as is convenient".  The latter makes much more sense
with AIO.

> A generic I/O interface would allow you to specify "this request
> depends on these outstanding requests" and leave it at that.  It
> might have some sync flush command for dumb applications and OSes.

For filesystems, it would probably be easy to label in-place
overwrites and fdatasync data flushes when there's no file extension
with an opqaue per-file identifier for certain operations.  Typically
over-writing in place and fdatasync would match up and wouldn't need
ordering against anything else.  Other operations would tend to get
labelled as ordered against everything including these.

-- Jamie

Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-05 Thread Jamie Lokier

Jens Axboe wrote:
> On Tue, May 04 2010, Rusty Russell wrote:
> > ISTR someone mentioning a desire for such an API years ago, so CC'ing the
> > usual I/O suspects...
> 
> It would be nice to have a more fuller API for this, but the reality is
> that only the flush approach is really workable. Even just strict
> ordering of requests could only be supported on SCSI, and even there the
> kernel still lacks proper guarantees on error handling to prevent
> reordering there.

There's a few I/O scheduling differences that might be useful:

1. The I/O scheduler could freely move WRITEs before a FLUSH but not
   before a BARRIER.  That might be useful for time-critical WRITEs,
   and those issued by high I/O priority.

2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
   only for data belonging to a particular file (e.g. fdatasync with
   no file size change, even on btrfs if O_DIRECT was used for the
   writes being committed).  That would entail tagging FLUSHes and
   WRITEs with a fs-specific identifier (such as inode number), opaque
   to the scheduler which only checks equality.

3. By delaying FLUSHes through reordering as above, the I/O scheduler
   could merge multiple FLUSHes into a single command.

4. On MD/RAID, BARRIER requires every backing device to quiesce before
   sending the low-level cache-flush, and all of those to finish
   before resuming each backing device.  FLUSH doesn't require as much
   synchronising.  (With per-file FLUSH; see 2; it could even avoid
   FLUSH altogether to some backing devices for small files).

In other words, FLUSH can be more relaxed than BARRIER inside the
kernel.  It's ironic that we think of fsync as stronger than
fbarrier outside the kernel :-)

-- Jamie

Re: [libvirt] [Qemu-devel] Re: Libvirt debug API

2010-04-26 Thread Jamie Lokier

Daniel P. Berrange wrote:
> > Much better to exact a commitment from libvirt to track all QMP (and 
> > command line) capabilities.  Instead of adding cleverness to QMP, add 
> > APIs to libvirt.
> 
> Agreed. Despite adding this monitor / XML passthrough capability, we still
> do not want apps to be using this at all. If there is some capability
> missing that apps need then the default mode of operation is to add the
> neccessary bits of libvirt. The monitor/XML pasthrough is just a short
> term quick workaround until the official support is done. As such I do
> not really think we need to put huge amounts of effort in the wierd 
> complex racey edge cases. The effort is better spent on getting the 
> features in libvirt.

All the features?  The qemu API is quite large already (look at all
the command line options and monitor commands).  I'll be very
surprised if libvirt provides all of it that obscure apps may use.

I'm thinking of features which are relatively obscure but nonetheless
useful to a small number of deployments.  Probably not enough to
justify the effort building data models, specifying the XML and remote
protocol and so on in libvirt.

(Unless that becomes so easily mapped to qemu's API that it's almost an
automatic thing... Which sounds like QMP, doesn't it?)

Is libvirt ever likely to go to the effort of providing all the
easily-usable API, or hooks, for:

- sending keys to a guest, driven by a timed host script?

- rebooting the guest while switching between USB touchpad and
  mouse devices, because one of them is needed during an OS
  install and the other is needed after?

- changing the amount of RAM available to the guest at the next
  reboot, for OS install needing more memory than run time, in a
  scripted fashion when building new VMs from install disk images?

- switching the guest between qemu mode and kvm mode on the next
  guest reset, because qemu is faster for some things (VGA
  updates) and kvm is faster for other things, so the best choice
  depends on which app you need to run on that guest

- pausing a VM, making a copy, and resuming it, so as to fork it
  into two VMs (literally fork)?

- setting up the host network container and NAT IP forwarding, on
  demand as guests are stopped and started, so that it works in
  the above scenario despite clashing IP addresses?

- running a copy of the same guest, or perhaps an entire OS
  install process (scripted), many times for different qemu and
  qemu-kvm versions, different BIOSes, and different
  almost-equivalent hardware emulations (i.e. different NIC types,
  SMP count, CPU features, disk controller type, AIO/cache type) -
  for testing guests and apps on them - with some paralellism?

None of those, except perhaps the first, as what I think of as typical
virtualisation workloads, and they all seem obscure things probably
outside libvirt's remit.  Probably not many users either :-)

Yet you can do them all today with qemu and scripting the monitor, and
it's getting easier with QMP.

Which is fine, qemu works, but it would be great to be able to see
those guests and interact in the basic ways through the libvirt-based
GUIs?

QMP pass-through or QMP multiple monitors seems to provide most of
that, although I can see libvirt getting a bit confused about which
devices and how much RAM the guest has installed at different times.

The bit about forking guests, I'm not sure how complicated it is to
tie in to libvirt's notion of which disk images are being used, and
hooking into it's network configuration to handle the clashing
addresses.

If those things are considered to be entirely outside libvirt's remit,
that's fine with me.  Fair enough: I will continue to live with ssh
and vinagre.

I'm just raising my hand as a potential user who might like to monitor
a bunch of active and inactive guests, remotely, see how much memory
they report using, etc. launch VNC viewer from the GUI, even choose
the target host based on load and migrate on demand, while also
needing a fair bit of non-standardness and qemu-level scripting too.

Imho, that probably comes under the heading of apps using pass-through
or multiple QMP monitors, which use features that probably won't and
probably shouldn't ever be handled by libvirt itself.

-- Jamie

Re: [Qemu-devel] Re: Bug#573439: qemu-kvm: fail to set hdd serial number

2010-04-26 Thread Jamie Lokier

Michael Tokarev wrote:
> 24.04.2010 17:05, Andreas Färber wrote:
> >Am 22.04.2010 um 11:40 schrieb Michael Tokarev:
> >
> >>11.03.2010 18:34, Michael Tokarev wrote:
> >>[]
> On version 0.12.3, -drive serial=XXX option does not work.
> Below patch fixes it. 'serial' is pointer, not array.
> 
> 
> --- qemu-kvm-0.12.3+dfsg/vl.c 2010-02-26 11:34:00.0 +0900
> +++ qemu-kvm-0.12.3+dfsg.old/vl.c 2010-03-11 02:26:00.134217787 +0900
> >[...]
> >
> >>Folks, can we please add this trivial one-liner to -stable or something?
> >>It has been one and a half months since it has been fixed in debian...
> >
> >Try submitting it as a proper Git patch with summary so that it can be
> >applied to master first; if it's already in master, post the commit id
> >so it can be cherry-picked. Also, mark the subject as [STABLE] or [0.12]
> >or something for Anthony to find it.
> 
> Well, It's not that difficult to carry it in the debian package.
> Hopefully other distros will follow (the ones who are not already),
> so that support requests in #...@freenode wont mention that again.

It would be nice for such trivial fixes to be committed to the stable
branch for those of us compiling stable versions from source.

Especially with so many guest regressions lately, so that keeping
multiple qemu versions around is an unfortunate necessity for the time
being.

-- Jamie

Re: [Qemu-devel] Atomicity of i386 guest atomic instructions

2010-04-23 Thread Jamie Lokier

Alexander Graf wrote:
> They should be atomic. TCG SMP swaps between different vCPUs only
> after translation blocks are done. In fact, the only way I'm aware
> of to stop the execution of a TB mid-way is a page fault.

A page fault would interrupt it if the atomic is implemented as
a read followed by a write, and the write faults.

> You can as always check things with the -d parameter.

-- Jamie

Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-23 Thread Jamie Lokier

Ian Molton wrote:
> Jamie Lokier wrote:
> > First of all: Why are your egd daemons with open connections dying
> > anyway?  Buggy egd?
> 
> No, they aren't buggy. but occasionally, the sysadmin of said server
> farms may want to, you know, update the daemon?

Many daemons don't kill active connections on upgrade.  For example
sshd, telnetd, ftpd, rsyncd...  Only new connections get the new daemon.

But let's approach this from a different angle:

What do _other_ long-lived EGD clients do?  Is it:

   1. When egd is upgraded, the clients break.
   3. Active connections aren't killed on egd upgrade.
   2. They keep trying to reconnect, as you've implemented in qemu.
   4. Whenever they want entropy, they're expected to open a
  connection, request what they want, read it, and close.  Each time.

Whatever other long-lived clients do, that's probably best for qemu
too.

4 is interesting because it's an alternative approach to rate-limiting
the byte stream: Instead, fetch a batch of bytes in a single
open/read/close transaction when needed.  Rate-limit _that_, and you
don't need separate reconnection code.

So I trying checking if egd kills connections when upgraded, and found...

No 'egd' package for my Debian and Ubuntu systems, nor anything which
looks obvious.  There are several other approaches to gathering
entropy from hardware sources, for example rng-tools, haveged, ekeyd, and
ekeyd-egd-linux (aha... it's a client).

All of those have in common: they fetch entropy from something, and
write it to the kernel's /dev/random pool.  Applications are expected
to read from that pool.

In particular, if you do have a hardware or network EGD entropy
source, you can run ekeyd-egd-linux which is an EGD client, which
transfers from EGD -> the kernel, so that applications can read from
/dev/random.

That means, on Debian & Ubuntu Linux at least, there is no need for
applications to talk EGD protocol themselves, even to get network or
hardware entropy - it's better left to egd-linux, rng-tools etc. to
manage.

But the situation is no doubt different on non-Linux hosts.

By the way, ekeyd-egd-linux is a bit thoughtful: For example it has a
"shannons-per-byte" option, and it doesn't drain the EGD server at all
when the local pool is sufficiently full.

Does your EGD client + virtio-rng support do that - avoid draining the
source when the guest's pool is full enough?

> > If guests need a _reliable_ source of data for security, silently not
> > complaining when it's gone away and hoping it comes back isn't good
> > enough.
> 
> Why? its not like the guest:
> 
> a) Has a choice in the matter
> b) Would carry on without the entropy (it knows it has no entropy)

Because one might prefer a big red light, a halted machine removed
from the cluster which can resume its work when ready, and an email to
warn you that the machine isn't able to operate normally _without_
having to configure each guest's email, rather than a working machine
with increasing numbers of stuck crypto processes waiting on
/dev/random which runs out of memory and after getting into swap hell,
you have to reboot it, losing the other work that it was in the
middle of doing.

Well, you personally might not prefer that.  But that's why we
separate policy from mechanism...

> > But then it would need to sync with the guest on reconnection, so that
> > the guest can restart whatever protocol it's using over the byte
> > stream.
> 
> Er, why? we're not talking about port forwarding here, we're talking
> about emulation of device hardware.

virtio-serial isn't emulating a normal serial port.  It supports apps
like "send machine status blobs regularly", without having to be
robust against half a blob being delivered.

You can design packets so that doesn't matter, but virtio-serial
supports not needing to do that, making the apps simpler.

> > I don't think it'll happen.  I think egd is a rather unusual
> > If another backend ever needs it, it's easy to move code around.
> 
> *bangs head on wall*
> 
> That was the exact same argument I made about the rate limiting code.
> Why is that apparently only valid if its not me that says it?

Because you're talking to multiple people who hold different opinions,
and opinions change as more is learned and thought about.  It's
iterative, and I, for one, am not in a position to make merging
decisions, only give my view on it.  Can't speak for the others.

> > I'm not convinced there's a need for it even for egd.
> 
> So what? I'm not convinced theres a need for about 90% of whats out
> there,

Ah, that's not quite what I meant.  I meant I wasn't convinced it is
needed for egd, not I don&

Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-23 Thread Jamie Lokier

Ian Molton wrote:
> >You can configure any chardev to be a tcp client. I never do that though
> >as I find it much more convenient to configure it as server.
> 
> Perhaps thats because chardev clients are nearly useless right now 
> because they just die if the connection drops...

Which is why drops/missing server should be a QMP event + action, same
as other triggers like disk full and watchdog trigger.

I do not want my guests to continue running if they are configured to
depend on Qemu entropy and it's not available.

> Or are you suggesting that we create another type of chardev, thats 
> nearly like a socket, but speaks egd and can reconnect? That seems 
> hideous to me.

Why hideous?

An egd chardev is a good thing because you can then trivially
use it as a random byte source for virtio-serial, isa-serial,
pci-serial, custom-soc-serial, debug-port even :-), and anything
else which might want random bytes as Gerd said.

That's way more useful than restricting to virtio-rng, because most
guests don't support virtio at all, but they can probably all take
entropy from a serial-like device.

Similarly the ability to connect to /dev/urandom directly, with the
rate-limiting but no auto-reconnection, looking like a chardev in
the same way, would make sense.  Reconnection is not needed in this
case - missing device should be an error at startup.

Your idea for an 'egd line discipline' would need to look exactly like
a chardev internally, or all the devices which might find it useful
would have to be changed to know about line disciplines, or it just
wouldn't be available as a random byte source to everything that uses
a chardev - unnecessary limiting.

There's nothing wrong with the egd chardev actually _being
implemented_ like a line discipline on top of another chardev, with a
chardev interface so everything can use it.

In which case it's quite natural to expose the options as a
user-visible chardev 'egd', defined to return random bytes on input
and ignore output, which takes all the same options as 'socket' and
actually uses a 'socket' chardev (passing along the options).

(Is there any actual point in supporting egd over non-sockets?)

I think rate-limiting is more generically useful as a 'line
discipline'-like feature, to work with any chardev type.  But it
should then have properties governing incoming and outgoing rate
limiting separately, which won't get much testing for the only
imminent user which is input-only.

> That way it wouldn't matter if it were a socket or anything else that 
> the data came in via, which is the case with the patch as I wrote it - 
> you can feed in EGD from a file, a socket, anything, and it just works.

What's the point in feeding egd protocol from a file?
If you want entropy from a file, it should probably be raw, not egd protocol.

-- Jamie

Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Jamie Lokier

Yoshiaki Tamura wrote:
> Jamie Lokier wrote:
> >Yoshiaki Tamura wrote:
> >>Dor Laor wrote:
> >>>On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> >>>>Event tapping is the core component of Kemari, and it decides on which
> >>>>event the
> >>>>primary should synchronize with the secondary. The basic assumption
> >>>>here is
> >>>>that outgoing I/O operations are idempotent, which is usually true for
> >>>>disk I/O
> >>>>and reliable network protocols such as TCP.
> >>>
> >>>IMO any type of network even should be stalled too. What if the VM runs
> >>>non tcp protocol and the packet that the master node sent reached some
> >>>remote client and before the sync to the slave the master failed?
> >>
> >>In current implementation, it is actually stalling any type of network
> >>that goes through virtio-net.
> >>
> >>However, if the application was using unreliable protocols, it should have
> >>its own recovering mechanism, or it should be completely stateless.
> >
> >Even with unreliable protocols, if slave takeover causes the receiver
> >to have received a packet that the sender _does not think it has ever
> >sent_, expect some protocols to break.
> >
> >If the slave replaying master's behaviour since the last sync means it
> >will definitely get into the same state of having sent the packet,
> >that works out.
> 
> That's something we're expecting now.
> 
> >But you still have to be careful that the other end's responses to
> >that packet are not seen by the slave too early during that replay.
> >Otherwise, for example, the slave may observe a TCP ACK to a packet
> >that it hasn't yet sent, which is an error.
> 
> Even current implementation syncs just before network output, what you 
> pointed out could happen.  In this case, would the connection going to be 
> lost, or would client/server recover from it?  If latter, it would be fine, 
> otherwise I wonder how people doing similar things are handling this 
> situation.

In the case of TCP in a "synchronised state", I think it will recover
according to the rules in RFC793.  In an "unsynchronised state"
(during connection), I'm not sure if it recovers or if it looks like a
"Connection reset" error.  I suspect it does recover but I'm not certain.

But that's TCP.  Other protocols, such as over UDP, may behave
differently, because this is not an anticipated behaviour of a
network.

> >However there is one respect in which they're not idempotent:
> >
> >The TTL field should be decreased if packets are delayed.  Packets
> >should not appear to live in the network for longer than TTL seconds.
> >If they do, some protocols (like TCP) can react to the delayed ones
> >differently, such as sending a RST packet and breaking a connection.
> >
> >It is acceptable to reduce TTL faster than the minimum.  After all, it
> >is reduced by 1 on every forwarding hop, in addition to time delays.
> 
> So the problem is, when the slave takes over, it sends a packet with same 
> TTL which client may have received.

Yes.  I guess this is a general problem with time-based protocols and
virtual machines getting stopped for 1 minute (say), without knowing
that real time has moved on for the other nodes.

Some application transaction, caching and locking protocols will give
wrong results when their time assumptions are discontinuous to such a
large degree.  It's a bit nasty to impose that on them after they
worked so hard on their reliability :-)

However, I think such implementations _could_ be made safe if those
programs can arrange to definitely be interrupted with a signal when
the discontinuity happens.  Of course, only if they're aware they may
be running on a Kemari system...

I have an intuitive idea that there is a solution to that, but each
time I try to write the next paragraph explaining it, some little
complication crops up and it needs more thought.  Something about
concurrent, asynchronous transactions to keep the master running while
recording the minimum states that replay needs to be safe, while
slewing the replaying slave's virtual clock back to real time quickly
during recovery mode.

-- Jamie

Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-22 Thread Jamie Lokier

Ian Molton wrote:
> > It might make sense to have the reconnect logic in the egd chardev
> > backend then, thereby obsoleting the socket reconnect patch.
> 
> Im not sure I agree there... surely there are other things which would
> benefit from generic socket reconnection support (virtio-rng cant be the
> only driver that might want to rely on a reliable source of data via a
> socket in a server-farm type situation?)

First of all: Why are your egd daemons with open connections dying
anyway?  Buggy egd?

Secondly: why isn't egd death an event reported over QMP, with a
monitor command to reconnect manually?

If guests need a _reliable_ source of data for security, silently not
complaining when it's gone away and hoping it comes back isn't good
enough.  It should be an error condition known to management, which
can halt the guest until egd is fixed or restarts if running without
entropy isn't acceptable in its policy.

Thirdly, which other things do you think would use it?

Maybe some virtio-serial apps would like it.

But then it would need to sync with the guest on reconnection, so that
the guest can restart whatever protocol it's using over the byte
stream.

In which case, it's better to tell the guest that the connection died,
and give the guest a way to request a new one when it's ready.

Reconnecting and resuming in the middle of the byte stram would be bad
(even for egd protocol?).  Pure /dev/urandom fetching is quite unusual in not
caring about this, but you shouldn't need to reconnect to that.

> Do we really want to re-implement reconnection (and reconnection retry
> anti-flood limiting) in every single backend?

I don't think it'll happen.  I think egd is a rather unusual

If another backend ever needs it, it's easy to move code around.

I'm not convinced there's a need for it even for egd.  Either egd
shouldn't be killing open connections (and is buggy if it is), or this
is normal egd behavior and so it's part of the egd protocol to
repeatedly reconnect, and therefore can go in the egd client code.

Meanwhile, because the egd might not return, it should be reported as
an error condition over QMP for management to do what it deems
appropriate.  In which case, management could tell it to reconnect
when it thinks is a good time, or do other things like switch the
randomness source to something else, or stop the guest, or warn the
admin that a guest is running without entropy.

-- Jamie

Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Jamie Lokier

Yoshiaki Tamura wrote:
> Dor Laor wrote:
> >On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> >>Event tapping is the core component of Kemari, and it decides on which
> >>event the
> >>primary should synchronize with the secondary. The basic assumption
> >>here is
> >>that outgoing I/O operations are idempotent, which is usually true for
> >>disk I/O
> >>and reliable network protocols such as TCP.
> >
> >IMO any type of network even should be stalled too. What if the VM runs
> >non tcp protocol and the packet that the master node sent reached some
> >remote client and before the sync to the slave the master failed?
> 
> In current implementation, it is actually stalling any type of network 
> that goes through virtio-net.
> 
> However, if the application was using unreliable protocols, it should have 
> its own recovering mechanism, or it should be completely stateless.

Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.

But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.

About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.

> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the 
> guest was almost idle, there will be no snapshots in 5sec.  On the other 
> hand, if the guest was running I/O intensive workloads (netperf, iozone 
> for example), there will be about 50 snapshots/sec.

That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.

-- Jamie

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier

Gerhard Wiesinger wrote:
> Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW 
> of QEMU even from KVM must be possible (e.g. memory and port accesses are 
> done on nearly every virtual device) and therefore I'm ending in C code in
> the QEMU hw/*.c directory. Therefore also the VGA memory area should be 
> able to be accessable from KVM but with the specialized and fast memory
> access of QEMU.  Am I missing something?

What you're missing is that when KVM calls out to QEMU to handle
hw/*.c traps, that call is very slow.  It's because the hardware-VM
support is a bit slow when the trap happens, and then the the call
from KVM in the kernel up to QEMU is a bit slow again.  Then all the
way back.  It adds up to a lot, for every I/O operation.

When QEMU does the same thing, it's fast because it's inside the same
process; it's just a function call.

That's why the most often called devices are emulated separately in
KVM's kernel code, things like the interrupt controller, timer chip
etc.  It's also why individual instructions that need help are
emulated in KVM's kernel code, instead of passing control up to QEMU
just for one instruction.

> BTW: Still not clear why performance is low with KVM since there are 
> no window changes in the testcase involved which could cause a (slow) page 
> fault.

It sounds like a bug.  Avi gave suggests about what to look for.
If it fixes my OS install speeds too, I'll be very happy :-)

In 256-colour mode, KVM should be writing to the VGA memory at high
speed a lot like normal RAM, not trapping at the hardware-VM level,
and not calling up to the code in hw/*.c for every byte.

You might double-check if your guest is using VGA "Mode X".  (See Wikipedia.)

That was a way to accelerate VGA on real PCs, but it will be slow in
KVM for the same reasons as 16-colour mode.

-- Jamie

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier

Avi Kivity wrote:
> On 04/21/2010 09:39 PM, Jamie Lokier wrote:
> >Avi Kivity wrote:
> >   
> >>Writes to vga in 16-color mode don't change set a memory location to a
> >>value, instead they change multiple memory locations.
> >> 
> >While code is just writing to the VGA memory, not reading(*) and not
> >touching the VGA I/O register that control the write latches, is it
> >possible in principle to swizzle the format around in memory to make
> >regular writes work?
> >   
> 
> Not in software.  We can map pages, not cross address lines.

Hence "swizzle".  You rearrange the data inside the page for the
crossed address lines, and undo the swizzle later on demand.  That
doesn't work for other VGA magic though.

> Guests that use 16 color vga are usually of little interest.

Fair enough.  We can move on :-)

It's been said that the super-slow VGA writes triggering this thread
are in 256-colour mode, so there's a different problem.  That should
be fast, shouldn't it?

I vaguely recall extremely slow OS installs I've seen in KVM, which
were fast in QEMU (and fast in KVM after installing), were using text
mode.  Possibly it was Windows 2000, or Windows Server 2003.  Text
mode should be fast too, shouldn't it?  I suppose it's possible that
it just looked like text mode and was really 16-colour mode.

-- Jamie

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier

Gerhard Wiesinger wrote:
> >>Would it be possible to handle these writes through QEMU directly 
> >>(without
> >>KVM), because performance is there very well (looking at the code there
> >>is some pointer arithmetic and some memory write done)?
> >
> >I've noticed extremely slow VGA performance too, when installing OSes.
> >It makes the difference between installing in a few minutes, and
> >installing taking hours - just because of the slow VGA.
> >
> >So generally I use qemu for installing old versions of Windows, then
> >change to KVM to run them after installing.
> >
> >Switching between KVM and qemu automatically based on guest code
> >behaviour, and making both memory models and device models compatible
> >at run time, is a difficult thing.  I guess it's not worth the
> >difficulty just to speed up VGA.
> 
> I think this is very easy to distingish:
> 1.) VGA Segment A000 is legacy and should be handled through QEMU 
> and not through KVM (because it is much more faster). Also 16 color modes 
> should be fast enough there.
> 2.) All other flat PCI memory accesses should be handled through KVM 
> (there is a specialized driver loaded for that PCI device in the non 
> legacy OS).
> 
> Is that easily possible?

No it isn't.  Distingushing addresses is trivial.  You've ignored the
hard part, which is switching between different virtualisation
architectures...

-- Jamie

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier

Avi Kivity wrote:
> Writes to vga in 16-color mode don't change set a memory location to a 
> value, instead they change multiple memory locations.

While code is just writing to the VGA memory, not reading(*) and not
touching the VGA I/O register that control the write latches, is it
possible in principle to swizzle the format around in memory to make
regular writes work?

(*) Reading should be ok for some settings of the write latches, I
think.

I wonder if guests of interest behave like that.

> >Is this a case where TCG would run significantly faster for code blocks
> >that have been detected to access the VGA memory?
> 
> Yes.

$ date
Wed Apr 21 19:37:38 2015
$ modprobe ktcg
;-)

-- Jamie

Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier

Gerhard Wiesinger wrote:
> I'm using VESA mode 0x101 (640x480 256 colors), but performance is 
> there very low (~1MB/s). Test is also WITHOUT any vga window change, so 
> there isn't any page switching overhead involved in this test case.
> 
> >>Any ideas for improvement?
> >
> >Currently when the physical memory map changes (which is what happens 
> >when the vga window is updated), kvm drops the entire shadow cache.  It's 
> >possible to do this only for vga memory, but not easy.
> 
> I don't think changing VGA window is a problem because there are 
> 500.000-1Mio changes/s possible.

1MB/s, 500k-1M changes/s Coincidence?  Is it taking a page fault
or trap on every write?

> Would it be possible to handle these writes through QEMU directly (without 
> KVM), because performance is there very well (looking at the code there 
> is some pointer arithmetic and some memory write done)?

I've noticed extremely slow VGA performance too, when installing OSes.
It makes the difference between installing in a few minutes, and
installing taking hours - just because of the slow VGA.

So generally I use qemu for installing old versions of Windows, then
change to KVM to run them after installing.

Switching between KVM and qemu automatically based on guest code
behaviour, and making both memory models and device models compatible
at run time, is a difficult thing.  I guess it's not worth the
difficulty just to speed up VGA.

-- Jamie

1 2 3 4 5 >

1 - 100 of 417 matches

Mail list logo