Re: [Qemu-devel] Using PCI config space to indicate config location
Rusty Russell wrote: I don't think it'll be that bad; reset clears the device to unknown, bar0 moves it from unknown-legacy mode, bar1/2/3 changes it from unknown-modern mode, and anything else is bad (I prefer being strict so we catch bad implementations from the beginning). Will that work, if the guest with kernel that uses modern mode, kexecs to an older (but presumed reliable) kernel that only knows about legacy mode? I.e. will the replacement kernel, or (ideally) replacement driver on the rare occasion that is needed on a running kernel, be able to reset the device hard enough? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
Anthony Liguori wrote: The new API will do away with the IOAPIC/PIC/PIT emulation and defer them to userspace. I'm a big fan of this. I agree with getting rid of unnecessary emulations. (Why were those things emulated in the first place?) But it would be good to retain some way to plugin device emulations in the kernel, separate from KVM core with a well-defined API boundary. Then it wouldn't matter to the KVM core whether there's PIT emulation or whatever; that would just be a separate module. Perhaps even with its own /dev device and maybe not tightly bound to KVM, Note: this may cause a regression for older guests that don't support MSI or kvmclock. Device assignment will be done using VFIO, that is, without direct kvm involvement. I don't like the sound of regressions. I tend to think of a VM as something that needs to have consistent behaviour over a long time, for keeping working systems running for years despite changing hardware, or reviving old systems to test software and make patches for things in long-term maintenance etc. But I haven't noticed problems from upgrading kernelspace-KVM yet, only upgrading the userspace parts. If a kernel upgrade is risky, that makes upgrading host kernels difficult and all or nothing for all the guests within. However it looks like you mean only the performance characteristics will change because of moving things back to userspace? Local APICs will be mandatory, but it will be possible to hide them from the guest. This means that it will no longer be possible to emulate an APIC in userspace, but it will be possible to virtualize an APIC-less core - userspace will play with the LINT0/LINT1 inputs (configured as EXITINT and NMI) to queue interrupts and NMIs. I think this makes sense. An interesting consequence of this is that it's no longer necessary to associate the VCPU context with an MMIO/PIO operation. I'm not sure if there's an obvious benefit to that but it's interesting nonetheless. Would that be useful for using VCPUs to run sandboxed userspace code with ability to trap and control the whole environment (as opposed to guest OSes, or ptrace which is rather incomplete and unsuitable for sandboxing code meant for other OSes)? Thanks, -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu-kvm upstreaming: Do we need -no-kvm-pit and -no-kvm-pit-reinjection semantics?
Jan Kiszka wrote: Usability. Users should not have to care about individual tick-based clocks. They care about my OS requires lost ticks compensation, yes or no. Conceivably an OS may require lost ticks compensation depending on boot options given to the OS telling it which clock sources to use. However I like the idea of a global default, which you can set and all the devices inherit it unless overridden in each device. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Avoid soft lockup message when KVM is stopped by host
Marcelo Tosatti wrote: In case the VM stops for whatever reason, the host system is not supposed to adjust time related hardware state to compensate, in an attempt to present apparent continuous time. If you save a VM and then restore it later, it is the guest responsability to adjust its time representation. If the guest doesn't know it's been stopped, then its time representation will be wrong until it finds out, e.g. after a few minutes with NTP, or even a seconds can be too long. That is sad when it happens because it breaks the coherence of any timed-lease caching the guest is involved in. I.e. where the guest acquires a lock on some data object (like a file in NFS) that it can efficiently access without network round trips (similar to MESI), with all nodes having agreed that it's coherent for, say, 5 seconds before renewing or breaking. (It's just a way to reduce latency.) But we can't trust CLOCK_MONOTONIC when a VM is involved, it's just one of those facts of life. So instead the effort is to try and detect when a VM is involved and then distrust the clock. (Non-VM) suspend/resume is similar, but there's usually a way to be notified about that as it happens. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Avoid soft lockup message when KVM is stopped by host
Marcelo Tosatti wrote: In case the VM stops for whatever reason, the host system is not supposed to adjust time related hardware state to compensate, in an attempt to present apparent continuous time. If you save a VM and then restore it later, it is the guest responsability to adjust its time representation. If the guest doesn't know it's been stopped, then its time representation will be wrong until it finds out, e.g. after a few minutes with NTP, or even a seconds can be too long. That is sad when it happens because it breaks the coherence of any timed-lease caching the guest is involved in. I.e. where the guest acquires a lock on some data object (like a file in NFS) that it can efficiently access without network round trips (similar to MESI), with all nodes having agreed that it's coherent for, say, 5 seconds before renewing or breaking. (It's just a way to reduce latency.) But we can't trust CLOCK_MONOTONIC when a VM is involved, it's just one of those facts of life. So instead the effort is to try and detect when a VM is involved and then distrust the clock. (Non-VM) suspend/resume is similar, but there's usually a way to be notified about that as it happens. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2 V7] qemu,qmp: add inject-nmi qmp command
Blue Swirl wrote: On Fri, Apr 8, 2011 at 9:04 AM, Gleb Natapov g...@redhat.com wrote: On Thu, Apr 07, 2011 at 04:41:03PM -0500, Anthony Liguori wrote: On 04/07/2011 02:17 PM, Gleb Natapov wrote: On Thu, Apr 07, 2011 at 10:04:00PM +0300, Blue Swirl wrote: On Thu, Apr 7, 2011 at 9:51 PM, Gleb Natapovg...@redhat.com wrote: I'd prefer something more generic like these: raise /apic@fee0:l1int lower /i44FX-pcihost/e1000@03.0/pinD The clumsier syntax shouldn't be a problem, since this would be a system developer tool. Some kind of IRQ registration would be needed for this to work without lots of changes. True. The ability to trigger any interrupt line is very useful for debugging. I often re-implement it during debug. And it's a good thing to have, but exposing this as the only API to do something as simple as generating a guest crash dump is not the friendliest thing in the world to do to users. Well, this is not intended to be used by regular users directly and management can provide nicer interface for issuing NMI. But really, my point is that NMI actually generates guest core dump in such rare cases (only preconfigured Windows guests) that it doesn't warrant to name command as such. Management is in much better position to implement functionality with such name since it knows what type of guest it runs and can tell agent to configure guest accordingly. Does the management need to know about each and every debugging oriented interface? For example, info regs, info mem, info irq and tracepoints? Linux uses NMI for performance tracing, profiling, watchdog etc. so in practice, NMI is very similar to the other IRQs. I.e. highly guest specific and depending on what's wired up to it. Injecting NMI to all CPUs at once does not make any sense for those Linux guests. For Windows crash dumps, I think it makes sense to have a button wired to NMI device, rather than inject-nmi directly, but I can see that inject-nmi solves the intended problem quite neatly. For Linux crash dumps, for example, there are other key combinations, as well as watchdog devices, that can be used to trigger them. A virtual button wired to GPIO/PCI-IRQ/etc. device might be quite handy for debugging Linux guests, and would fit comfortably in a management interface. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2 V7] qemu,qmp: add inject-nmi qmp command
Gleb Natapov wrote: On Thu, Apr 07, 2011 at 04:39:58PM -0500, Anthony Liguori wrote: On 04/07/2011 01:51 PM, Gleb Natapov wrote: NMI does not have to generate crash dump on every guest we support. Actually even for windows guest it does not generate one without tweaking registry. For all I know there is a guest that checks mail when NMI arrives. And for all we know, a guest can respond to an ACPI poweroff event by tweeting the star spangled banner but we still call the corresponding QMP command system_poweroff. Correct :) But at least system_poweroff implements ACPI poweroff as defined by ACPI spec. NMI is not designed as core dump event and is not used as such by majority of the guests. Imho acpi_poweroff or poweroff_button would have been a clearer name. Or even 'sendkey poweroff' - it's just a button someone on the keyboard on a lot of systems anyway. Next to the email button and what looks, on my laptop, like the play-a-tune button :-) I put system_poweroff into some QEMU-controlling scripts once, and was disappointed when several guests ignored it. But it's done now. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Anyone seeing huge slowdown launching qemu with Linux 2.6.35?
Richard W.M. Jones wrote: We could demand that OSes write device drivers for more qemu devices -- already OS vendors write thousands of device drivers for all sorts of obscure devices, so this isn't really much of a demand for them. In fact, they're already doing it. Result: Most OSes not working with qemu? Actually we seem to be going that way. Recent qemus don't work with older versions of Windows any more, so we have to use different versions of qemu for different guests. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010
Michael Tokarev wrote: Anthony Liguori wrote: [] For the Bug Day, anything is interesting IMHO. My main interest is to get as many people involved in testing and bug fixing as possible. If folks are interested in testing specific things like unusual or older OSes, I'm happy to see it! Well, interesting or not, but I for one don't know what to do with the results. There were a thread on kvm@ about sigsegv in cirrus code when running winNT. The issue has been identified and appears to be fixed, as in, kvm process does not SIGSEGV anymore, but it does not work anyway, now printing: BUG: kvm_dirty_pages_log_enable_slot: invalid parameters with garbled guest display. Thanks goes to Stefano Stabellini for finding the SIGSEGV case, but unfortunately his hard work isn't quite useful since the behavour isn't very much different from the previous version... ;) A BUG: is good to see in a bug report: It gives you something specific to analyse. Good luck ;-) Imho, it'd be quite handy to keep a timeline of working/non-working guests in a table somewhere, and which qemu versions and options they were observed to work or break with. Also, thanks to Andre Przywara, whole winNT thing works but it requires -cpu qemu64,level=1 (or level=2 or =3), -- _not_ with default CPU. This is also testing, but it's not obvious what to do witht the result... Doesn't WinNT work with qemu32 or kvm32? It's a 32-bit OS after all. - Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010
Natalia Portillo wrote: Hi, - We'll try to migrate as many confirmable bugs from the Source Forge tracker to Launchpad. I think that part of the bug day should also include retesting OSes that appear in OS Support List as having bug and confirming if the bug is still present and if it's in Launchpad or not. There have been reports of several legacy OSes being unable to install or boot in the newer qemu while working in the older one. They're probably not in the OS Support List though. Are they effectively uninteresting for the purpose of the 0.13 release? Unfortunately I doubt I will have time to participate in the Bug Day. Thanks, -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus
Stefano Stabellini wrote: I think we need to consider only dstpitch for a full invalidate. We might be copying an offscreen bitmap into the screen, and srcpitch is likely to be the bitmap width instead of the screen pitch. Agreed. Even when copying on-screen (or partially on-screen), the srcpitch does not affect the invalidated area. The source area might be strange (parallelogram, single line repeated), but srcpitch should only affect whether qemu_console_copy can be used, not the invalidation. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
Gerhard Wiesinger wrote: On Wed, 21 Apr 2010, Jamie Lokier wrote: Gerhard Wiesinger wrote: Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW of QEMU even from KVM must be possible (e.g. memory and port accesses are done on nearly every virtual device) and therefore I'm ending in C code in the QEMU hw/*.c directory. Therefore also the VGA memory area should be able to be accessable from KVM but with the specialized and fast memory access of QEMU. Am I missing something? What you're missing is that when KVM calls out to QEMU to handle hw/*.c traps, that call is very slow. It's because the hardware-VM support is a bit slow when the trap happens, and then the the call from KVM in the kernel up to QEMU is a bit slow again. Then all the way back. It adds up to a lot, for every I/O operation. Isn't that then a general problem of KVM virtualization (oder hardware virtualization) in general? Is this CPU dependend (AMD vs. Intel)? Yes it is a general problem, but KVM emulates some time-critical things in the kernel (like APIC and CPU instructions), so it's not too bad. KVM is about 5x faster than TCG for most things, and slower for a few things, so on balance it is usually faster. The slow 256-colour mode writes sound like just a simple bug, though. No need for complicated changes. In 256-colour mode, KVM should be writing to the VGA memory at high speed a lot like normal RAM, not trapping at the hardware-VM level, and not calling up to the code in hw/*.c for every byte. Yes, same picture to me: 256 color mode should be only a memory write (16 color mode is more difficult as pixel/byte mapping is not the same). But it looks like this isn't the case in this test scenario. You might double-check if your guest is using VGA Mode X. (See Wikipedia.) That was a way to accelerate VGA on real PCs, but it will be slow in KVM for the same reasons as 16-colour mode. Which way do you mean? Look up Mode X on Wikipedia if you're interested, but it isn't relevant to the problem you've reported. Mode X cannot be enabled with a BIOS call; it's a VGA hardware programming trick. It would not be useful in a VM environment. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
Gerhard Wiesinger wrote: Can one switch to the old software vmm in VMWare? Perhaps you can install a very old version of VMWare. Maybe run it under KVM ;-) That was one of the reasons why I was looking for alternatives for graphical DOS programs. Overall summary so far: 1.) QEMU without KVM: Problem with 286 DOS Extender instruction set, but fast VGA 2.) QEMU with KVM: 286 DOS Extender apps ok, but slow VGA memory performance 3.) VMWare Server 2.0 under Linux, application ok, but slow VGA memory performance 4.) Virtual PC: Problems with 286 DOS Extender 5.) Bochs: Works well, but very slow. I would be interested in the 286 DOS Extender issue, as I'd like to use some 286 programs in QEMU at some point. There were some changes to KVM in the kernel recently. Were those needed to get the 286 apps working? Looks like that VMWare Server and QEMU with KVM maybe have the same architectural problems going through the whole slow chain from Guest OS to virtualization layer for VGA writes. They do have a similar architecture. the VGA write speed is a bit surprising, as it should be fast in 256-colour non-modeX modes for both. But maybe there's something we've missed that makes it architecturally slow. It will be interesting to see what you find :-) Thanks, -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus
Stefano Stabellini wrote: On Wed, 12 May 2010, Avi Kivity wrote: It's useful if you have a one-line horizontal pattern you want to propagate all over. It might be useful all right, but it is not entirely clear what the hardware should do in this situation from the documentation we have, and certainly the current state of the cirrus emulation code doesn't help. It's quite a reasonable thing for hardware to do, even if not documented. It would be surprising if the hardware didn't copy the one-line pattern. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Rusty Russell wrote: Seems over-zealous. If the recovery_header held a strong checksum of the recovery_data you would not need the first fsync, and as long as you have two places to write recovery data, you don't need the 3rd and 4th syncs. Just: write_internally_checksummed_recovery_data_and_header_to_unused_log_space() fsync / msync overwrite_with_new_data() To recovery you choose the most recent log_space and replay the content. That may be a redundant operation, but that is no loss. I think you missed a checksum for the new data? Otherwise we can't tell if the new data is completely written. The data checksum can go in the recovery-data block. If there's enough slack in the log, by the time that recovery-data block is overwritten, you can be sure that an fsync has been done for that data (by a later commit). But yes, I will steal this scheme for TDB2, thanks! Take a look at the filesystems. I think ext4 did some optimisations in this area, and that checksums had to be added anyway due to a subtle replay-corruption problem that happens when the log is partially corrupted, and followed by non-corrupt blocks. Also, you can remove even more fsyncs by adding a bit of slack to the data space and writing into unused/fresh areas some of the time - i.e. a bit like btrfs/zfs or anything log-structured, but you don't have to go all the way with that. In practice, it's the first sync which is glacial, the rest are pretty cheap. The 3rd and 4th fsyncs imply a disk seek each, just because the preceding writes are to different areas of the disk. Seeks are quite slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap? That it's only a couple of seeks, or that you don't see even that? Also cannot see the point of msync if you have already performed an fsync, and if there is a point, I would expect you to call msync before fsync... Maybe there is some subtlety there that I am not aware of. I assume it's this from the msync man page: msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to disk. Without use of this call there is no guarantee that changes are written back before mun‐ map(2) is called. Historically, that means msync() ensures dirty mapping data is written to the file as if with write(), and that mapping pages are removed or refreshed to get the effect of read() (possibly a lazy one). It's more obvious in the early mmap implementations where mappings don't share pages with the filesystem cache, so msync() has explicit behaviour. Like with write(), after calling msync() you would then call fsync() to ensure the data is flushed to disk. If you've been calling fsync then msync, I guess that's another fine example of how these function are so hard to test, that they aren't. Historically on Linux, msync has been iffy on some architectures, and I'm still not sure it has the same semantics as other unixes. fsync as we know has also been iffy, and even now that fsync is tidier it does not always issue a hardware-level cache commit. But then historically writable mmap has been iffy on a boatload of unixes. It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. Only we know that approach doesn't work. People will learn that they don't need to give the extra information to still achieve the same result - just like they did with ext3 and fsync. Then when we improve the implementation to only provide the guarantees that you asked for, people will complain that they are getting empty files that they didn't expect. I think that's an oversimplification: IIUC that occurred to people *not* using fsync(). They weren't using it because it was too slow. Providing a primitive which is as fast or faster and more specific doesn't have the same magnitude of social issues. I agree with Rusty. Let's make it perform well so there is no reason to deliberately avoid using it, and let's make say what apps actually want to request without being way too strong. And please, if anyone has ideas on how we could make correct use of these functions *testable* by app authors, I'm all ears. Right now it is quite difficult - pulling power on hard disks mid-transaction is not a convenient method :) The abstraction I would like to see is a simple 'barrier' that contains no data and has a filesystem-wide effect. I think you lack ambition ;) Thinking about the single-file use case (eg. kvm guest or tdb), isn't that suboptimal for md? Since you have to hand your barrier to every device whereas a file-wide primitive may theoretically only go to some. Yes. Note that database-like programs still need fsync-like behaviour *sometimes*: The D in ACID depends on it, and the C
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Rusty Russell wrote: On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want these writes must occur before this write. That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. We do want flushes for the D in ACID - such things as after receiving a mail, or blog update into a database file (could be TDB), and confirming that to the sender, to have high confidence that the update won't disappear on system crash or power failure. Less obviously, it's also needed for the C in ACID when more than one file is involved. C is about differently updated things staying consistent with each other. For example, imagine you have a TDB file mapping Samba usernames to passwords, and another mapping Samba usernames to local usernames. (I don't know if you do this; it's just an illustration). To rename a Samba user involves updating both. Let's ignore transient transactional issues :-) and just think about what happens with per-file barriers and no sync, when a crash happens long after the updates, and before the system has written out all data and issued low level cache flushes. After restarting, due to lack of sync, the Samba username could be present in one file and not the other. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a all prior writes to this struct file before all future writes. Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Yet we really only need ordering, not guarantees about it actually hitting disk before returning. In other words, FLUSH can be more relaxed than BARRIER inside the kernel. It's ironic that we think of fsync as stronger than fbarrier outside the kernel :-) It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. I agree, and I've started a few threads about it over the last couple of years. An fsync_range() system call would be very easy to use and (most importantly) easy to understand. With optional flags to weaken it (into fdatasync, barrier without sync, sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush, etc.), it would be very versatile, and still easy to understand. With an AIO version, and another flag meaning don't rush, just return when satisfied, and I suspect it would be useful for the most demanding I/O apps. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] question on virtio
Michael S. Tsirkin wrote: Hi! I see this in virtio_ring.c: /* Put entry in available array (but don't update avail-idx * until they do sync). */ Why is it done this way? It seems that updating the index straight away would be simpler, while this might allow the host to specilatively look up the buffer and handle it, without waiting for the kick. Even better, if the host updates a location containing which index it has seen recently, you can avoid the kick entirely during sustained flows - just like your recent patch to avoid sending irqs to the guest. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. 3. By delaying FLUSHes through reordering as above, the I/O scheduler could merge multiple FLUSHes into a single command. 4. On MD/RAID, BARRIER requires every backing device to quiesce before sending the low-level cache-flush, and all of those to finish before resuming each backing device. FLUSH doesn't require as much synchronising. (With per-file FLUSH; see 2; it could even avoid FLUSH altogether to some backing devices for small files). In other words, FLUSH can be more relaxed than BARRIER inside the kernel. It's ironic that we think of fsync as stronger than fbarrier outside the kernel :-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Rusty Russell wrote: On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. ISTR Christoph had withdrawn some patches in this area, and was waiting for him to resubmit? I've given up on figuring out the block device. What seem to me to be sane semantics along the lines of memory barriers are foreign to disk people: they want (and depend on) flushing everywhere. For example, tdb transactions do not require a flush, they only require what I would call a barrier: that prior data be written out before any future data. Surely that would be more efficient in general than a flush! In fact, TDB wants only writes to *that file* (and metadata) written out first; it has no ordering issues with other I/O on the same device. I've just posted elsewhere on this thread, that an I/O level flush can be more efficient than an I/O level barrier (implemented using a cache-flush really), because the barrier has stricter ordering requirements at the I/O scheduling level. By the time you work up to tdb, another way to think of it is distinguishing eager fsync from fsync but I'm not in a hurry - delay as long as is convenient. The latter makes much more sense with AIO. A generic I/O interface would allow you to specify this request depends on these outstanding requests and leave it at that. It might have some sync flush command for dumb applications and OSes. For filesystems, it would probably be easy to label in-place overwrites and fdatasync data flushes when there's no file extension with an opqaue per-file identifier for certain operations. Typically over-writing in place and fdatasync would match up and wouldn't need ordering against anything else. Other operations would tend to get labelled as ordered against everything including these. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Yoshiaki Tamura wrote: Jamie Lokier wrote: Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Even with unreliable protocols, if slave takeover causes the receiver to have received a packet that the sender _does not think it has ever sent_, expect some protocols to break. If the slave replaying master's behaviour since the last sync means it will definitely get into the same state of having sent the packet, that works out. That's something we're expecting now. But you still have to be careful that the other end's responses to that packet are not seen by the slave too early during that replay. Otherwise, for example, the slave may observe a TCP ACK to a packet that it hasn't yet sent, which is an error. Even current implementation syncs just before network output, what you pointed out could happen. In this case, would the connection going to be lost, or would client/server recover from it? If latter, it would be fine, otherwise I wonder how people doing similar things are handling this situation. In the case of TCP in a synchronised state, I think it will recover according to the rules in RFC793. In an unsynchronised state (during connection), I'm not sure if it recovers or if it looks like a Connection reset error. I suspect it does recover but I'm not certain. But that's TCP. Other protocols, such as over UDP, may behave differently, because this is not an anticipated behaviour of a network. However there is one respect in which they're not idempotent: The TTL field should be decreased if packets are delayed. Packets should not appear to live in the network for longer than TTL seconds. If they do, some protocols (like TCP) can react to the delayed ones differently, such as sending a RST packet and breaking a connection. It is acceptable to reduce TTL faster than the minimum. After all, it is reduced by 1 on every forwarding hop, in addition to time delays. So the problem is, when the slave takes over, it sends a packet with same TTL which client may have received. Yes. I guess this is a general problem with time-based protocols and virtual machines getting stopped for 1 minute (say), without knowing that real time has moved on for the other nodes. Some application transaction, caching and locking protocols will give wrong results when their time assumptions are discontinuous to such a large degree. It's a bit nasty to impose that on them after they worked so hard on their reliability :-) However, I think such implementations _could_ be made safe if those programs can arrange to definitely be interrupted with a signal when the discontinuity happens. Of course, only if they're aware they may be running on a Kemari system... I have an intuitive idea that there is a solution to that, but each time I try to write the next paragraph explaining it, some little complication crops up and it needs more thought. Something about concurrent, asynchronous transactions to keep the master running while recording the minimum states that replay needs to be safe, while slewing the replaying slave's virtual clock back to real time quickly during recovery mode. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Even with unreliable protocols, if slave takeover causes the receiver to have received a packet that the sender _does not think it has ever sent_, expect some protocols to break. If the slave replaying master's behaviour since the last sync means it will definitely get into the same state of having sent the packet, that works out. But you still have to be careful that the other end's responses to that packet are not seen by the slave too early during that replay. Otherwise, for example, the slave may observe a TCP ACK to a packet that it hasn't yet sent, which is an error. About IP idempotency: In general, IP packets are allowed to be lost or duplicated in the network. All IP protocols should be prepared for that; it is a basic property. However there is one respect in which they're not idempotent: The TTL field should be decreased if packets are delayed. Packets should not appear to live in the network for longer than TTL seconds. If they do, some protocols (like TCP) can react to the delayed ones differently, such as sending a RST packet and breaking a connection. It is acceptable to reduce TTL faster than the minimum. After all, it is reduced by 1 on every forwarding hop, in addition to time delays. I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. That is a really satisfying number, thank you :-) Without this work I wouldn't have imagined that synchronised machines could work with such a low transaction rate. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
Avi Kivity wrote: On 04/19/2010 10:14 PM, Gerhard Wiesinger wrote: Hello, Finally I got QEMU-KVM to work but video performance under DOS is very low (QEMU 0.12.3 stable and QEMU GIT master branch is fast, QEMU KVM is slow) I'm measuring 2 performance critical video performance parameters: 1.) INT 10h, function AX=4F05h (set same window/set window/get window) 2.) Memory performance to segment page A000h So BIOS performance (which might be port performance to VGA index/value port) is about factor 5 slower, memory performance is about factor 100 slower. QEMU 0.12.3 and QEMU GIT performance is the same (in the measurement tolerance) and listed only once, QEMU KVM is much more slower (details see below). Test programs can be provided, source code will be release soon. Any ideas why KVM is so slow? 16-color vga is slow because kvm cannot map the framebuffer to the guest (writes are not interpreted as RAM writes). 256+-color vga should be fast, except when switching the vga window. Note it's only fast on average, the first write into a page will be slow as kvm maps it in. I don't understand: why is 256+-colour mappable and 16-colour not mappable? Is this a case where TCG would run significantly faster for code blocks that have been detected to access the VGA memory? Which mode are you using? Any ideas for improvement? Currently when the physical memory map changes (which is what happens when the vga window is updated), kvm drops the entire shadow cache. It's possible to do this only for vga memory, but not easy. If it's a page fault handled in the kernel, I would expect it to be about as fast as those old VGA DOS-extender drivers which provide the illusion of a single flat mapping, and bank switch on page faults - multiplied by the speed of modern CPUs compared with then. For many graphics things those DOS-extender drivers worked perfectly well. If it's a trap out to qemu on every vga window change, perhaps not quite so well. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
Avi Kivity wrote: Writes to vga in 16-color mode don't change set a memory location to a value, instead they change multiple memory locations. While code is just writing to the VGA memory, not reading(*) and not touching the VGA I/O register that control the write latches, is it possible in principle to swizzle the format around in memory to make regular writes work? (*) Reading should be ok for some settings of the write latches, I think. I wonder if guests of interest behave like that. Is this a case where TCG would run significantly faster for code blocks that have been detected to access the VGA memory? Yes. $ date Wed Apr 21 19:37:38 2015 $ modprobe ktcg ;-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
Gerhard Wiesinger wrote: Would it be possible to handle these writes through QEMU directly (without KVM), because performance is there very well (looking at the code there is some pointer arithmetic and some memory write done)? I've noticed extremely slow VGA performance too, when installing OSes. It makes the difference between installing in a few minutes, and installing taking hours - just because of the slow VGA. So generally I use qemu for installing old versions of Windows, then change to KVM to run them after installing. Switching between KVM and qemu automatically based on guest code behaviour, and making both memory models and device models compatible at run time, is a difficult thing. I guess it's not worth the difficulty just to speed up VGA. I think this is very easy to distingish: 1.) VGA Segment A000 is legacy and should be handled through QEMU and not through KVM (because it is much more faster). Also 16 color modes should be fast enough there. 2.) All other flat PCI memory accesses should be handled through KVM (there is a specialized driver loaded for that PCI device in the non legacy OS). Is that easily possible? No it isn't. Distingushing addresses is trivial. You've ignored the hard part, which is switching between different virtualisation architectures... -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
Avi Kivity wrote: On 04/21/2010 09:39 PM, Jamie Lokier wrote: Avi Kivity wrote: Writes to vga in 16-color mode don't change set a memory location to a value, instead they change multiple memory locations. While code is just writing to the VGA memory, not reading(*) and not touching the VGA I/O register that control the write latches, is it possible in principle to swizzle the format around in memory to make regular writes work? Not in software. We can map pages, not cross address lines. Hence swizzle. You rearrange the data inside the page for the crossed address lines, and undo the swizzle later on demand. That doesn't work for other VGA magic though. Guests that use 16 color vga are usually of little interest. Fair enough. We can move on :-) It's been said that the super-slow VGA writes triggering this thread are in 256-colour mode, so there's a different problem. That should be fast, shouldn't it? I vaguely recall extremely slow OS installs I've seen in KVM, which were fast in QEMU (and fast in KVM after installing), were using text mode. Possibly it was Windows 2000, or Windows Server 2003. Text mode should be fast too, shouldn't it? I suppose it's possible that it just looked like text mode and was really 16-colour mode. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
Gerhard Wiesinger wrote: Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW of QEMU even from KVM must be possible (e.g. memory and port accesses are done on nearly every virtual device) and therefore I'm ending in C code in the QEMU hw/*.c directory. Therefore also the VGA memory area should be able to be accessable from KVM but with the specialized and fast memory access of QEMU. Am I missing something? What you're missing is that when KVM calls out to QEMU to handle hw/*.c traps, that call is very slow. It's because the hardware-VM support is a bit slow when the trap happens, and then the the call from KVM in the kernel up to QEMU is a bit slow again. Then all the way back. It adds up to a lot, for every I/O operation. When QEMU does the same thing, it's fast because it's inside the same process; it's just a function call. That's why the most often called devices are emulated separately in KVM's kernel code, things like the interrupt controller, timer chip etc. It's also why individual instructions that need help are emulated in KVM's kernel code, instead of passing control up to QEMU just for one instruction. BTW: Still not clear why performance is low with KVM since there are no window changes in the testcase involved which could cause a (slow) page fault. It sounds like a bug. Avi gave suggests about what to look for. If it fixes my OS install speeds too, I'll be very happy :-) In 256-colour mode, KVM should be writing to the VGA memory at high speed a lot like normal RAM, not trapping at the hardware-VM level, and not calling up to the code in hw/*.c for every byte. You might double-check if your guest is using VGA Mode X. (See Wikipedia.) That was a way to accelerate VGA on real PCs, but it will be slow in KVM for the same reasons as 16-colour mode. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] virtio-spec: document block CMD and FLUSH
Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. One issue I struggled with especially is how type field mixes bits and non-bit values. I ended up simply defining all legal values, so that we have CMD = 2, CMD_OUT = 3 and so on. I also avoided instroducing inhdr/outhdr structures that virtio blk driver in linux uses, I was concerned that nesting tables will confuse the reader. Comments welcome. Signed-off-by: Michael S. Tsirkin m...@redhat.com -- diff --git a/virtio-spec.lyx b/virtio-spec.lyx index d16104a..ed35893 100644 --- a/virtio-spec.lyx +++ b/virtio-spec.lyx @@ -67,7 +67,11 @@ IBM Corporation \end_layout \begin_layout Standard + +\change_deleted 0 1266531118 FIXME: virtio block scsi passthrough section +\change_unchanged + \end_layout \begin_layout Standard @@ -4376,7 +4380,7 @@ struct virtio_net_ctrl_mac { The device can filter incoming packets by any number of destination MAC addresses. \begin_inset Foot -status open +status collapsed \begin_layout Plain Layout Since there are no guarentees, it can use a hash filter orsilently switch @@ -4549,6 +4553,22 @@ blk_size \end_inset . +\change_inserted 0 1266444580 + +\end_layout + +\begin_layout Description + +\change_inserted 0 1266471229 +VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. +\end_layout + +\begin_layout Description + +\change_inserted 0 1266444605 +VIRTIO_BLK_F_FLUSH (9) Cache flush command support. +\change_unchanged + \end_layout \begin_layout Description @@ -4700,17 +4720,25 @@ struct virtio_blk_req { \begin_layout Plain Layout +\change_deleted 0 1266472188 + #define VIRTIO_BLK_T_IN 0 \end_layout \begin_layout Plain Layout +\change_deleted 0 1266472188 + #define VIRTIO_BLK_T_OUT 1 \end_layout \begin_layout Plain Layout +\change_deleted 0 1266472188 + #define VIRTIO_BLK_T_BARRIER 0x8000 +\change_unchanged + \end_layout \begin_layout Plain Layout @@ -4735,11 +4763,15 @@ struct virtio_blk_req { \begin_layout Plain Layout +\change_deleted 0 1266472204 + #define VIRTIO_BLK_S_OK0 \end_layout \begin_layout Plain Layout +\change_deleted 0 1266472204 + #define VIRTIO_BLK_S_IOERR 1 \end_layout @@ -4759,32 +4791,481 @@ struct virtio_blk_req { \end_layout \begin_layout Standard -The type of the request is either a read (VIRTIO_BLK_T_IN) or a write (VIRTIO_BL -K_T_OUT); the high bit indicates that this request acts as a barrier and - that all preceeding requests must be complete before this one, and all - following requests must not be started until this is complete. + +\change_inserted 0 1266472490 +If the device has VIRTIO_BLK_F_SCSI feature, it can also support scsi packet + command requests, each of these requests is of form: +\begin_inset listings +inline false +status open + +\begin_layout Plain Layout + +\change_inserted 0 1266472395 + +struct virtio_scsi_pc_req { +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472375 + + u32 type; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472375 + + u32 ioprio; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266474298 + + u64 sector; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266474308 + +char cmd[]; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266505809 + + char data[][512]; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266505825 + +#define SCSI_SENSE_BUFFERSIZE 96 +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266505848 + +u8 sense[SCSI_SENSE_BUFFERSIZE]; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472969 + +u32 errors; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472979 + +u32 data_len; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472984 + +u32 sense_len; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472987 + +u32 residual; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472375 + + u8 status; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1266472375 + +}; +\end_layout + +\end_inset + + +\change_unchanged + \end_layout \begin_layout Standard -The ioprio field is a hint about the relative priorities of requests to - the device: higher numbers indicate more important requests. +The +\emph on +type +\emph default + of the request is either a read (VIRTIO_BLK_T_IN) +\change_inserted 0 1266495815 +, +\change_unchanged +
Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.
Mohammed Gamal wrote: On Mon, Apr 12, 2010 at 12:29 AM, Jamie Lokier ja...@shareable.org wrote: Javier Guerra Giraldez wrote: On Sat, Apr 10, 2010 at 7:42 AM, Mohammed Gamal m.gamal...@gmail.com wrote: On Sat, Apr 10, 2010 at 2:12 PM, Jamie Lokier ja...@shareable.org wrote: To throw a spanner in, the most widely supported filesystem across operating systems is probably NFS, version 2 :-) Remember that Windows usage on a VM is not some rare use case, and it'd be a little bit of a pain from a user's perspective to have to install a third party NFS client for every VM they use. Having something supported on the VM out of the box is a better option IMO. i don't think virtio-CIFS has any more support out of the box (on any system) than virtio-9P. It doesn't, but at least network-CIFS tends to work ok and is the method of choice for Windows VMs - when you can setup Samba on the host (which as previously noted you cannot always do non-disruptively with current Sambas). -- Jamie I think having support for both 9p and CIFS would be the best option. In that case the user will have the option to use either one, depending on how their guests support these filesystems. In that case I'd prefer to work on CIFS support while the 9p effort can still go on. I don't think both efforts are mutually exclusive. What do the rest of you guys think? I only noted NFS because most old OSes do not support CIFS or 9P - especially all the old unixes. I don't think old versions of MS-DOS and Windows (95, 98, ME, Nt4?) even support current CIFS. They need extra server settings to work - such as setting passwords on the server to non-encrypted and other quirks. Meanwhile Windows Vista/2008/7 works better with SMB2, not CIFS, to properly see symlinks and hard links. So there is no really nice out of the box file service which works easily with all guest OSes. I'm guessing that out of all the filesystems, CIFS is the most widely supported in recent OSes (released in the last 10 years). But I'm not really sure what the state of CIFS is for non-Windows, non-Linux, non-BSD guests. I'm not sure why 9P is being pursued. Does anything much support it, or do all OSes except quite recent Linux need a custom driver for 9P? Even Linux only got the first commit in the kernel 5 years ago, so probably it was only about 3 years ago that it will have begun appearing in stable distros, if at all. Filesystem passthrough to Linux guests installed in the last couple of years is a useful feature, and I know that for many people that is their only use of KVM, but compared with CIFS' broad support it seems like quite a narrow goal. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.
Alexander Graf wrote: Also since -net user does support samba exporting already, This I'm interested in. Last time I tried to use it, the smb= option didn't work because Samba refused to run when launched with qemu's mini config file and launched as a regular user. It needed access to various hard-coded root-owned directories, and spewed lots of errors about it into its logs. On closer inspection, those directories were hard-coded and could not be changed by the config file, nor could the features they were for be disabled. Even if I gave it permission to write to those, by running kvm/qemu as root, there was plenty of reason to worry that each instance of qemu-spawned Samba may interfere with the others and with the host's own starting with errors spewed into log files by both Sambas. So I had to give up on -net user,smb= completely :-( Is this something that you at SuSE have fixed or simply never encountered? My problems were with Debian and Ubuntu installations. I suspect they might have fixed some Samba problems by patching in different problems. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.
Javier Guerra Giraldez wrote: On Sat, Apr 10, 2010 at 7:42 AM, Mohammed Gamal m.gamal...@gmail.com wrote: On Sat, Apr 10, 2010 at 2:12 PM, Jamie Lokier ja...@shareable.org wrote: To throw a spanner in, the most widely supported filesystem across operating systems is probably NFS, version 2 :-) Remember that Windows usage on a VM is not some rare use case, and it'd be a little bit of a pain from a user's perspective to have to install a third party NFS client for every VM they use. Having something supported on the VM out of the box is a better option IMO. i don't think virtio-CIFS has any more support out of the box (on any system) than virtio-9P. It doesn't, but at least network-CIFS tends to work ok and is the method of choice for Windows VMs - when you can setup Samba on the host (which as previously noted you cannot always do non-disruptively with current Sambas). -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.
Mohammed Gamal wrote: Hi Javier, Thanks for the link. However, I'm still concerned with interoperability with other operating systems, including non-Windows ones. I am not sure of how many operating systems actually support 9p, but I'm almost certain that CIFS would be more widely-supported. I am still a newbie as far as all this is concerned, so if anyone has any arguments as to whether which approach should be taken, I'd be enlightened to hear them. To throw a spanner in, the most widely supported filesystem across operating systems is probably NFS, version 2 :-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [GSoC 2010] Pass-through filesystem support.
Mohammed Gamal wrote: 2- With respect to CIFS. I wonder how the shares are supposed to be exposed to the guest. Should the Samba server be modified to be able to use unix domain sockets instead of TCP ports and then QEMU communicating on these sockets. With that approach, how should the guest be able to see the exposed share? And what is the problem of using Samba with TCP ports? One problem with TCP ports is it only works when the guest's network is up :) You can't boot from that. It also makes things fragile or difficult if the guest work you are doing involves fiddling with the network settings. Doing it over virtio-serial would have many benefits. On the other hand, Samba+TCP+CIFS does have the advantage of working with virtually all guest OSes, including Linux / BSDs / Windows / MacOSX / Solaris etc. 9P only works with Linux as far as I know. I big problem with Samba at the moment is it's not possible to instantiate multiple instances of Samba any more, and not as a non-root user. That's because it contains some hard-coded paths to directories of run-time state, at least on Debian/Ubuntu hosts where I have tried and failed to use qemu's smb option, and there is no config file option to disable that or even change all the paths. Patching Samba to make per-user instantiations possible again would go a long way to making it useful for filesystem passthrough. Patching it so you can turn off all the fancy features and have it _just_ serve a filesystem with the most basic necessary authentication would be even better. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v3 1/1] Shared memory uio_pci driver
Avi Kivity wrote: ioctls encode the buffer size into the ioctl number, so in theory strace doesn't need to be taught about an ioctl to show its buffer. Unfortunately ioctl numbers don't always follow that rule :-( But maybe that's just awful proprietary drivers that I've seen. Anyway, strace should be taught how to read kernel headers to get ioctl types ;-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device
Cam Macdonell wrote: An irqfd can only trigger a single vector in a guest. Right now I only have one eventfd per guest.So ioeventfd/irqfd restricts the current implementation to a single vector that a guest can trigger. Without irqfd, eventfds can be used like registers a write the number of the vector they want to trigger, but as you point out it is racy. It's not racy if you use a pipe instead of eventfd. :-) Actually, why not? A byte pipe between guests would be more versatile. Could it even integrate with virtio-serial, somehow? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Completing big real mode emulation
Avi Kivity wrote: Either way - then we should make the goal of the project to support those old boot loaders. IMHO it should contain visibility. Doing theoretical stuff is just less fun for all parties. Or does that stuff work already? Mostly those old guests aged beyond usefulness. They are still broken, but nobody installs new images. Old images installed via workarounds work. Hey :) I still install old OSes occasionally, so that I can build and test code that will run on other people's still-running old machines. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Ideas wiki for GSoC 2010
Avi Kivity wrote: On 03/10/2010 11:30 PM, Luiz Capitulino wrote: 2. Do we have kvm-specific projects? Can they be part of the QEMU project or do we need a different mentoring organization for it? Something really interesting is kvm-assisted tcg. I'm afraid it's a bit too complicated to GSoC. Is this simpler: kvm-assisted user-mode emulation (no TCG involved)? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device
Paul Brook wrote: In a cross environment that becomes extremely hairy. For example the x86 architecture effectively has an implicit write barrier before every store, and an implicit read barrier before every load. Btw, x86 doesn't have any implicit barriers due to ordinary loads. Only stores and atomics have implicit barriers, afaik. As of March 2009[1] Intel guarantees that memory reads occur in order (they may only be reordered relative to writes). It appears AMD do not provide this guarantee, which could be an interesting problem for heterogeneous migration.. (Summary: At least on AMD64, it does too, for normal accesses to naturally aligned addresses in write-back cacheable memory.) Oh, that's interesting. Way back when I guess we knew writes were in order and it wasn't explicit that reads were, hence smp_rmb() using a locked atomic. Here is a post by Nick Piggin from 2007 with links to Intel _and_ AMD documents asserting that reads to cacheable memory are in program order: http://lkml.org/lkml/2007/9/28/212 Subject: [patch] x86: improved memory barrier implementation Links to documents: http://developer.intel.com/products/processor/manuals/318147.pdf http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf The Intel link doesn't work any more, but the AMD one does. Nick asserts both manufacturers are committed to in-order loads from cacheable memory for the x86 architecture. I have just read the AMD document, and it is in there (but not completely obviously), in section 7.2. The implicit load-load and store-store barriers are only guaranteed for normal cacheable accesses on naturally aligned boundaries to WB [write-back cacheable] memory. There are also implicit load-store barriers but not store-load. Note that the document covers AMD64; it does not say anything about their (now old) 32-bit processors. [*] The most recent docs I have handy. Up to and including Core-2 Duo. Are you sure the read ordering applies to 32-bit Intel and AMD CPUs too? Many years ago, before 64-bit x86 existed, I recall discussions on LKML where it was made clear that stores were performed in program order. If it were known at the time that loads were performed in program order on 32-bit x86s, I would have expected that to have been mentioned by someone. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device
Paul Brook wrote: However, coherence could be made host-type-independent by the host mapping and unampping pages, so that each page is only mapped into one guest (or guest CPU) at a time. Just like some clustering filesystems do to maintain coherence. You're assuming that a TLB flush implies a write barrier, and a TLB miss implies a read barrier. I'd be surprised if this were true in general. The host driver itself can issue full barriers at the same time as it maps pages on TLB miss, and would probably have to interrupt the guest's SMP KVM threads to insert a full barrier when broadcasting a TLB flush on unmap. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device
Avi Kivity wrote: On 03/08/2010 03:03 PM, Paul Brook wrote: On 03/08/2010 12:53 AM, Paul Brook wrote: Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. No. All new devices should be fully qdev based. I suspect you've also ignored a load of coherency issues, especially when not using KVM. As soon as you have shared memory in more than one host thread/process you have to worry about memory barriers. Shouldn't it be sufficient to require the guest to issue barriers (and to ensure tcg honours the barriers, if someone wants this with tcg)?. In a cross environment that becomes extremely hairy. For example the x86 architecture effectively has an implicit write barrier before every store, and an implicit read barrier before every load. Ah yes. For cross tcg environments you can map the memory using mmio callbacks instead of directly, and issue the appropriate barriers there. That makes sense. It will force an mmio callback for every access to the shared memory, which is ok for correctness but vastly slower when running in TCG compared with KVM. But it's hard to see what else could be done - those implicit write barries on x86 have to be emulated somehow. For TCG without inter-vm shared memory, those barriers aren't a problem. Non-random-corruption guest behaviour is paramount, so I hope the inter-vm device will add those mmio callbacks for the cross-arch case before it sees much action. (Strictly, it isn't cross-arch, but host-has-more-relaxed-implicit-memory-model-than-guest. I'm assuming TCG doesn't reorder memory instructions). -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device
Paul Brook wrote: On 03/08/2010 12:53 AM, Paul Brook wrote: Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. No. All new devices should be fully qdev based. I suspect you've also ignored a load of coherency issues, especially when not using KVM. As soon as you have shared memory in more than one host thread/process you have to worry about memory barriers. Shouldn't it be sufficient to require the guest to issue barriers (and to ensure tcg honours the barriers, if someone wants this with tcg)?. In a cross environment that becomes extremely hairy. For example the x86 architecture effectively has an implicit write barrier before every store, and an implicit read barrier before every load. Btw, x86 doesn't have any implicit barriers due to ordinary loads. Only stores and atomics have implicit barriers, afaik. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device
Alexander Graf wrote: Or we could put in some code that tells the guest the host shm architecture and only accept x86 on x86 for now. If anyone cares for other combinations, they're free to implement them. Seriously, we're looking at an interface designed for kvm here. Let's please keep it as simple and fast as possible for the actual use case, not some theoretically possible ones. The concern is that a perfectly working guest image running on kvm, the guest being some OS or app that uses this facility (_not_ a kvm-only guest driver), is later run on qemu on a different host, and then mostly works except for some silent data corruption. That is not a theoretical scenario. Well, the bit with this driver is theoretical, obviously :-) But not the bit about moving to a different host. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Inter-VM shared memory PCI device
Paul Brook wrote: Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. No. All new devices should be fully qdev based. I suspect you've also ignored a load of coherency issues, especially when not using KVM. As soon as you have shared memory in more than one host thread/process you have to worry about memory barriers. Yes. Guest-observable behaviour is likely to be quite different on different hosts, expecially beteen x86 and non-x86 hosts, which is not good at all for emulation. Memory barriers performed by the guest would help, but would not remove the fact that behaviour would vary beteen different host types if a guest doesn't call them. I.e. you could accidentally have some guests working fine for years on x86 hosts, which gain subtle memory corruption as soon as you run them on a different host. This is acceptable when recompiling code for different architectures, but it's asking for trouble with binary guest images which aren't supposed to depend on host architecture. However, coherence could be made host-type-independent by the host mapping and unampping pages, so that each page is only mapped into one guest (or guest CPU) at a time. Just like some clustering filesystems do to maintain coherence. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Slowdowns comparing qemu-kvm.git to qemu.git: vcpu/thread scheduling differences
Anthony Liguori wrote: No, basically, the problem will boil down to, the IO thread is select()'d waiting for an event to occur. However, you've done something in the VCPU thread that requires the IO thread to run it's main loop. You need to use qemu_notify_event() to force the IO thread to break out of select(). Debugging these problems are very difficult and the complexity here is the main reason the IO thread still hasn't been enabled by default in qemu.git. It is difficult. One approach to debugging them, in general, is to have a special debugging mode where the select() loop wakes up repeatedly at high speed and checks all the conditions it's supposed to to be waiting for that _should_ have triggered a wakeup, and if any of them are asserted for two timed wakeups (or some sufficient duration, checked by gettimeofday), emit a bug message because it probably is one. I don't know if that could be applied in qemu's event loop. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
Dor Laor wrote: x86 qemu64 x86 phenom x86 core2duo x86kvm64 x86 qemu32 x86 coreduo x86 486 x86 pentium x86 pentium2 x86 pentium3 x86 athlon x86 n270 I wonder if kvm32 would be good, for symmetry if nothing else. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
john cooper wrote: kvm itself can modify flags exported from qemu to a guest. I would hope for an option to request that qemu doesn't run if the guest won't get the cpuid flags requested on the command line. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
john cooper wrote: I foresee wanting to iterate over the models and pick the latest one which a host supports - on the grounds that you have done the hard work of ensuring it is a reasonably good performer, while probably working on another host of similar capability when a new host is made available. That's a fairly close use case to that of safe migration which was one of the primary motivations to identify the models being discussed. Although presentation and administration of such was considered the domain of management tools. My hypothetical script which iterates over models in that way is a management tool, and would use qemu to help do its job. Do you mean that more powerful management tools to support safe migration will maintain _their own_ processor model tables, and perform their calculations using their own tables instead of querying qemu, and therefore not have any need of qemu's built in table? If so, I favour more strongly Anthony's suggestion that the processor model table lives in a config file (eventually), as that file could be shared between management tools and qemu itself without duplication. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
john cooper wrote: I can appreciate the argument above, however the goal was choosing names with some basis in reality. These were recommended by our contacts within Intel, are used by VmWare to describe their similar cpu models, and arguably have fallen to defacto usage as evidenced by such sources as: http://en.wikipedia.org/wiki/Conroe_(microprocessor) http://en.wikipedia.org/wiki/Penryn_(microprocessor) http://en.wikipedia.org/wiki/Nehalem_(microarchitecture) (Aside: I can confirm they haven't fallen into de facto usage anywhere in my vicinity :-) I wonder if the contact within Intel are living in a bit of a bubble where these names are more familiar than the outside world.) I think we can all agree that there is no point looking for a familiar -cpu naming scheme because there aren't any familiar and meaningful names these days. used by VmWare to describe their similar cpu models If the same names are being used, I see some merit in qemu's list matching VMware's cpu models *exactly* (in capabilities, not id strings), to aid migration from VMware. Is that feasible? Do they match already? I suspect whatever we choose of reasonable length as a model tag for -cpu some further detail is going to be required. That was the motivation to augment the table as above with an instance of a LCD for that associated class. I'm not a typical user: I know quite a lot about x86 architecture; I just haven't kept up to date enough to know the code/model names. Typical users will know less about them. Understood. One thought I had to further clarify what is going on under the hood was to dump the cpuid flags for each model as part of (or in addition to) the above table. But this seems a bit extreme and kvm itself can modify flags exported from qemu to a guest. Here's another idea. It would be nice if qemu could tell the user which of the built-in -cpu choices is the most featureful subset of their own host. With -cpu host implemented, finding that is probably quite easy. Users with multiple hosts will get a better feel for what the -cpu names mean that way, probably better than any documentation would give them, because they probably have not much idea what CPU families they have anyway. (cat /proc/cpuinfo doesn't clarify, as I found). And it would give a simple, effective, quick indication of what they must choose if they want an VM image that runs on more than one of their hosts without a management tool. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
Anthony Liguori wrote: On 01/18/2010 10:45 AM, john cooper wrote: x86 Conroe Intel Celeron_4x0 (Conroe/Merom Class Core 2) x86 Penryn Intel Core 2 Duo P9xxx (Penryn Class Core 2) x86 Nehalem Intel Core i7 9xx (Nehalem Class Core i7) x86 Opteron_G1 AMD Opteron 240 (Gen 1 Class Opteron) x86 Opteron_G2 AMD Opteron 22xx (Gen 2 Class Opteron) x86 Opteron_G3 AMD Opteron 23xx (Gen 3 Class Opteron) I'm very much against having -cpu Nehalem. The whole point of this is to make things easier for a user and for most of the users I've encountered, -cpu Nehalem is just as obscure as -cpu qemu64,-sse3,+vmx,... When I saw that table just now, I had no idea whether Nehalem is newer and more advanced than Penryn, or the other way around. I also have no idea if Core i7 is newer than Core 2 Duo or not. I'm not a typical user: I know quite a lot about x86 architecture; I just haven't kept up to date enough to know the code/model names. Typical users will know less about them. It's only from seeing the G1/G2/G3 order that I guess they are listed in ascending order of functionality. Naturally, if I were choosing one, I'd want to choose the one with the most capabilities that works on whatever my host hardware provides. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
Chris Wright wrote: * Anthony Liguori (anth...@codemonkey.ws) wrote: I'm very much against having -cpu Nehalem. The whole point of this is to make things easier for a user and for most of the users I've encountered, -cpu Nehalem is just as obscure as -cpu qemu64,-sse3,+vmx,... What name will these users know? FWIW, it makes sense to me as it is. 2001, 2005, 2008, 2010 :-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
john cooper wrote: As before a cpu feature 'check' option is added which warns when feature flags (either implicit in a cpu model or explicit on the command line) would have otherwise been quietly unavailable to a guest: # qemu-system-x86_64 ... -cpu Nehalem,check warning: host cpuid _0001 lacks requested flag 'sse4.2' [0x0010] warning: host cpuid _0001 lacks requested flag 'popcnt' [0x0080] That's a nice feature. Can we have a 'checkfail' option which refuses to run if a requested capability isn't available? Thanks. I foresee wanting to iterate over the models and pick the latest one which a host supports - on the grounds that you have done the hard work of ensuring it is a reasonably good performer, while probably working on another host of similar capability when a new host is made available. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
Anthony Liguori wrote: On 01/19/2010 02:03 PM, Chris Wright wrote: * Anthony Liguori (anth...@codemonkey.ws) wrote: I'm very much against having -cpu Nehalem. The whole point of this is to make things easier for a user and for most of the users I've encountered, -cpu Nehalem is just as obscure as -cpu qemu64,-sse3,+vmx,... What name will these users know? FWIW, it makes sense to me as it is. Whatever is in /proc/cpuinfo. There is no mention of Nehalem in /proc/cpuinfo. My 5 /proc/cpuinfos say: Genuine Intel(R) CPU T2500 @ 2.00GHz Intel(R) Xeon(TM) CPU 3.00GHz Intel(R) Xeon(R) CPU E5335 @ 2.00GHz Intel(R) Xeon(TM) CPU 2.80GHz Intel(R) Xeon(R) CPU X5482 @ 3.20GHz I'm not sure if that's any more helpful :-) Especially the first one. I don't think of my laptop as having a T2500. I think of it as having a 32-bit Core Duo. And I have no idea what the different types of Xeon are. But then, I couldn't tell you whether they are Nehalems or Penryns either, and I'm quite sure the owners couldn't either. $ grep name /proc/cpuinfo model name : QEMU Virtual CPU version 0.9.1 If only they were all so clear :-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2] virtio-blk physical block size
Avi Kivity wrote: On 01/05/2010 02:56 PM, Rusty Russell wrote: Those should be the same for any sane interface. They are for classical disk devices with larger block sizes (MO, s390 dasd) and also for the now appearing 4k sector scsi disks. But in the ide world people are concerned about dos/window legacy compatiblity so they came up with a nasty hack: - there is a physical block size as used by the disk internally (4k initially) - all the interfaces to the operating system still happen in the traditional 512 byte blocks to not break any existing assumptions - to make sure modern operating systems can optimize for the larger physical sectors the disks expose this size, too. - even worse disks can also have alignment hacks for the traditional DOS partitions tables, so that the 512 byte block zero might even have an offset into the first larger physical block. This is also exposed in the ATA identify information. All in all I don't think this mess is a good idea to replicate in virtio. Virtio by defintion requires virtualization aware guests, so we should just follow the SCSI way of larger real block sizes here. Yes. The current VIRTIO_BLK_F_BLK_SIZE says please use this block size. We haven't actually specified what happens if the guest doesn't, but the spec says must, and the Linux implementation does so AFAICT. If we want a soft size, we could add that as a separate feature. No - I agree with Christoph, there's no reason to use a 512/4096 monstrosity with virtio. It would be good if virtio relayed the backing device's basic topology hints, so: - If the backing dev is a real disk with 512-byte sectors, virtio should indicate 512-byte blocks to the guest. - If the backing dev is a real disk with 4096-byte sectors, virtio should indicate 4096-byte blocks to the guest. With databases and filesystems, if you care about data integrity: - If the backing dev is a real disk with 4096-byte sectors, or a file whose access is through a 4096-byte-per-page cache, virtio must indicate 4096-byte blocks otherwise guest journalling is not host-powerfail safe. You get the idea. If there is only one parameter, it really should be at least as large as the smallest unit which may be corrupted by writes when errors occur. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2] virtio-blk physical block size
Avi Kivity wrote: Physical block size is the what the logical block size would have been is software didn't suck. In theory they should be the same, but since compatibility reaons clamp the logical block size to 512, they have to differ. A disk may have a physical block size of 4096 and emulate logical block size of 512 on top of that using read-modify-write. Or so I understand it. I think that's right, but a side effect is that if you get a power failure during the read-modify-write, bytes anywhere in 4096 sector may be incorrect, so journalling (etc.) needs to use 4096 byte blocks for data integrity, even though the drive emulates smaller writes. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Add definitions for current cpu models..
john cooper wrote: { +.name = Merom, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, +.stepping = 3, +.features = PPRO_FEATURES | +/* these features are needed for Win64 and aren't fully implemented */ +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | +/* this feature is needed for Solaris and isn't fully implemented */ +CPUID_PSE36, +.ext_features = CPUID_EXT_SSE3, /* from qemu64 */ Isn't SSE3 a generic feature on these Intel CPUs, so this comment is unnecessary? Or is SSE3 not present on a real Merom? If so, wouldn't it be better to omit it? +.ext2_features = (PPRO_FEATURES 0x0183F3FF) | Could we have a meaningful name for the magic number, please? Maybe even a: #define PPRO_EXT2_FEATURES (PPRO_FEATURES PPRO_EXT2_MASK) #define PPRO_EXT2_MASK (CPUID_... | CPUID_... | ...) /* Fill in. */ +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_SVM, /* from qemu64 */ +.xlevel = 0x800A, +.model_id = Intel Merom Core 2, +}, Does this mean requesting an Intel Merom will give the guest AMD's SVM capability? That's handy for virtualisation, but not an accurate CPU model. It seems inappropriate to name it Merom, with model_id Intel Merom Core 2, if it's adding extra qemu-specific capabilities. I would think few guests are likely to need the nested-SVM capability, so it could be omitted when Merom is requested, and added as an additional feature on request from the command line, just like other cpuid features can be added. +{ +.name = Penryn, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, +.stepping = 3, +.features = PPRO_FEATURES | +/* these features are needed for Win64 and aren't fully implemented */ +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | +/* this feature is needed for Solaris and isn't fully implemented */ +CPUID_PSE36, +.ext_features = CPUID_EXT_SSE3 | +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41, +.ext2_features = (PPRO_FEATURES 0x0183F3FF) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_SVM, +.xlevel = 0x800A, +.model_id = Intel Penryn Core 2, +}, Same comments as above for Merom about SVM and the PPRO_FEATURES mask. You don't include the from qemu64 comments this time. Is there a reason? +{ +.name = Nehalem, +.level = 2, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, +.family = 6, /* P6 */ +.model = 2, +.stepping = 3, +.features = PPRO_FEATURES | +/* these features are needed for Win64 and aren't fully implemented */ +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | +/* this feature is needed for Solaris and isn't fully implemented */ +CPUID_PSE36, +.ext_features = CPUID_EXT_SSE3 | +CPUID_EXT_CX16 | CPUID_EXT_SSSE3 | CPUID_EXT_SSE41 | +CPUID_EXT_SSE42 | CPUID_EXT_POPCNT, +.ext2_features = (PPRO_FEATURES 0x0183F3FF) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_SVM, +.xlevel = 0x800A, +.model_id = Intel Nehalem Core i7, +}, Same as previous. +{ +.name = Opteron_G1, +.level = 5, +.vendor1 = CPUID_VENDOR_INTEL_1, +.vendor2 = CPUID_VENDOR_INTEL_2, +.vendor3 = CPUID_VENDOR_INTEL_3, Someone else has already enquired - why Intel vendor id? +.family = 15, +.model = 6, +.stepping = 1, +.features = PPRO_FEATURES | +/* these features are needed for Win64 and aren't fully implemented */ +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | +/* this feature is needed for Solaris and isn't fully implemented */ +CPUID_PSE36, +.ext_features = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR, +.ext2_features = (PPRO_FEATURES 0x0183F3FF) | +CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, +.ext3_features = CPUID_EXT3_SVM, +.xlevel = 0x8008, +.model_id = AMD Opteron G1, +}, Why do the AMD models have CPUID_EXT_MONITOR but the Intel ones don't. Is it correct for the CPU models? Even a lowly 32-bit Intel Core has MONITOR. Is it omitted for performance? In that case shouldn't it be omitted for AMD too? +{ +
Re: [Qemu-devel] Re: [PATCH] Add VirtIO Frame Buffer Support
Avi Kivity wrote: On 11/03/2009 12:09 AM, Alexander Graf wrote: When we want to create a full VirtIO based machine, we're still missing graphics output. Fortunately, Linux provides us with most of the frameworks to render text and everything, we only need to implement a transport. So this is a frame buffer backend written for VirtIO. Using this and my patch to qemu, you can use paravirtualized graphics. What does this do that cirrus and/or vmware-vga don't? *This* virtio-fb doesn't, but one feature I think a lot of users (including me) would like is: Option to resize the guest desktop when the host desktop / host window / VNC client resizes. Tell the guest to provide multiple desktops when the host has multiple desktops, so things like twin monitors work nicely with guests. Relay EDID/Xrandr information and updates from host to guest, and generally handle hotplugging host monitors nicely. Are there any real hardware standards worth emulating which do that? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] whitelist host virtio networking features [was Re: qemu-kvm-0.11 regression, crashes on older ...]
Michael Tokarev wrote: If you want kvm to behave like this, wrap it into a trivial shell script that restarts the guest. True, kvm has enough crash-bugs elsewhere that I already have to deal with that. It'd be nice to distinguish kvm/qemu bugs from guest bugs, though :-) kvm/qemu also has lock-up bugs, where it's spinning at 100% and the guest seems to be stuck (even though the VNC server continues to work). That's a bit harder to fix with a wrapper script. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Release plan for 0.12.0
Michael S. Tsirkin wrote: On Wed, Oct 14, 2009 at 09:17:15AM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: Looks like Or has abandoned it. I have an updated version which works with new APIs, etc. Let me post it and we'll go from there. I'm generally inclined to oppose the functionality as I don't think it offers any advantages over the existing backends. I patch it in and use it all the time. It's much easier to setup on a random machine than a bridged config. Having two things that do the same thing is just going to lead to user confusion. They do not do the same thing. With raw socket you can use windows update without a bridge in the host, with tap you can't. On the other hand, with raw socket, guest Windows can't access files on the host's Samba share can it? So it's not that useful even for Windows guests. If the problem is tap is too hard to setup, we should try to simplify tap configuration. The problem is bridge is too hard to setup. Simplifying that is a good idea, but outside the scope of the qemu project. I venture it's important enough for qemu that it's worth working on that. Something that looks like the raw socket but behaves like an automatically instantiated bridge attached to the bound interface would be a useful interface. I don't have much time, but I'll help anybody who wants to do that. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/9] provide in-kernel ioapic
Anthony Liguori wrote: We already have the single device model implementation and the limitations are well known. The best way to move forward is for someone to send out patches implementing separate device models. At that point, it becomes a discussion of two concrete pieces of code verses hand waving. Out of curiosity now, what _are_ the behavioural differences between the in-kernel irqchip and the qemu one? Are the differences significant to guests, such that it might be necessary to disable the in-kernel irqchip for some guests, or conversely, necessary to use KVM for some guests? Thanks, -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/3] qemu: make cirrus init value pci spec compliant
Gleb Natapov wrote: But KVM doesn't support it (memory is always writable). That seems like something which could be fixed. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2 3/9] provide in-kernel ioapic
Gleb Natapov wrote: On Thu, Oct 08, 2009 at 06:42:07PM +0200, Avi Kivity wrote: On 10/08/2009 06:34 PM, Gleb Natapov wrote: So suppose I have simple watchdog device that required to be poked every second, otherwise it resets a computer. On migration we have to migrate time elapsed since last poke, but if device doesn't expose it to software in any way you are saying we can recreate is some other way? The time is exposed (you can measure it by poking the device and The time yes, not its internal representation. What if one implementation stores how much time passed and another how much time left. One counts in wall clack another only when guest runs. etc... and this is a device with only one write only register. In that case you can decide between calling it two different devices (which have the same guest-visible behaviour but are not interchangable), or calling them different implementations of one device - by adding a little more code to save state in a common format. (Although they may have to be different devices for qemu configuration, it's ok for them to have the same PCI IDs and for the guest not to know the difference) For your watchdog example, it's not hard to pick a saved state which works for both. ioapic will be harder to find a useful common saved state, and there might need to be an *optional hint* section (e.g. for selecting the next CPU to get an interrupt), but I think it would be worth it in this case. Being able to load a KVM image into TCG and vice versa is quite useful sometimes. E.g. I've had to do some OS installs using TCG at first, then switch to KVM later for performance. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2 4/9] provide in-kernel apic
Glauber Costa wrote: It ensures the two models are compatible. Since they're the same device from the point of view of the guest, there's no reason for them to have different representations or to be incompatible. live migration between something that has in-kernel irqchip and something that has not is likely to be a completely untested thing. And this is the only reason we might think of it as the same device. I don't see any value in supporting this combination Not just live migration. ACPI sleep + savevm + loadvm + ACPI resume, for example. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/3] qemu: make cirrus init value pci spec compliant
Michael S. Tsirkin wrote: More long-term, we have duplication between reset and init routines. Maybe devices really should have init and cleanup, and on reset we'd cleanup all devices and then init them again? It sounds look a good idea to me. That is, after all, what hardware reset often does :-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2 4/9] provide in-kernel apic
Glauber Costa wrote: On Fri, Oct 09, 2009 at 11:06:41AM +0100, Jamie Lokier wrote: Glauber Costa wrote: It ensures the two models are compatible. Since they're the same device from the point of view of the guest, there's no reason for them to have different representations or to be incompatible. live migration between something that has in-kernel irqchip and something that has not is likely to be a completely untested thing. And this is the only reason we might think of it as the same device. I don't see any value in supporting this combination Not just live migration. ACPI sleep + savevm + loadvm + ACPI resume, for example. Yes, but in this case too, I'd expect the irqchipness of qemu not to change. If I've just been sent an image produced by someone running KVM, and my machine is not KVM-capable, or I cannot upgrade the KVM kernel module because it's in use by other VMs (had this problem a few times), there's no choice but to change the irqchipness. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2 3/9] provide in-kernel ioapic
Glauber Costa wrote: On Thu, Oct 08, 2009 at 06:22:48PM +0200, Gleb Natapov wrote: On Thu, Oct 08, 2009 at 06:17:57PM +0200, Avi Kivity wrote: On 10/08/2009 06:07 PM, Jamie Lokier wrote: Haven't we already confirmed that it *isn't* just an ioapic accelerator because you can't migrate between in-kernel iopic and qemu's ioapic? We haven't confirmed it. Both implement the same spec, and if you can't migrate between them, one of them has a bug (for example, qemu ioapic doesn't implement polarity - but it's still just a bug). Are you saying that HW spec (that only describes software visible behavior) completely defines implementation? No other internal state is needed that may be done differently by different implementations? Most specifications leaves a lot as implementation specific. It's not hard to imagine a case in which both devices will follow the spec correctly, (no bugs involved), and yet differ in the implementation. Avi's not saying the implementations won't differ. I believe he's saying that implementation-specific states don't need to be saved if they have no effect on guest visible behaviour. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v2 3/9] provide in-kernel ioapic
Avi Kivity wrote: On 10/08/2009 03:49 PM, Anthony Liguori wrote: Glauber Costa wrote: This patch provides kvm with an in-kernel ioapic. We are currently not enabling it. The code is heavily based on what's in qemu-kvm.git. It really ought to be it's own file and own device model. Having the code mixed in with ioapic.c is confusing because it's unclear what code is in use when the in-kernel model is used. I disagree. It's the same device with the same guest-visible interface and the same host-visible interface (save/restore, 'info ioapic' if we write one). Splitting it into two files will only result in code duplication. Think of it as an ioapic accelerator. Haven't we already confirmed that it *isn't* just an ioapic accelerator because you can't migrate between in-kernel iopic and qemu's ioapic? Imho, if they cannot be swapped transparently, they are different device emulations. OF course there's nothing wrong with sharing lots of code. Maybe ioapic.c and ioapic-kvm.c, with shared code in ioapic-common.c? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/3] qemu: make cirrus init value pci spec compliant
Avi Kivity wrote: On 10/08/2009 06:06 PM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2009 at 05:29:29PM +0200, Avi Kivity wrote: On 10/08/2009 04:52 PM, Michael S. Tsirkin wrote: PCI memory should be disabled at reset, otherwise we might claim transactions at address 0. I/O should also be disabled, although for cirrus it is harmless to enable it as we do not have I/O bar. Note: need bios fix for this patch to work: currently pc-bios incorrently assumes that it does not need to enable i/o unless device has i/o bar. Signed-off-by: Michael S. Tsirkinm...@redhat.com This needs to be conditional on the machine type. Old machines must retain this code for live migration to work (we need to live migrate the bios, so we can't assume the bios fix is in during live migration from older qemus). No, if you migrate from older qemu you will be fine as command is enabled there on init. Right. No, I think Avi was right the first time. Migrating from an older qemu will be fine at first, but at the next reset _following_ migration, it'll be running the old BIOS on a new qemu and fail. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Extending virtio_console to support multiple ports
Alan Cox wrote: - Then, are we certain that there's no case where the tty layer will call us with some lock held or in an atomic context ? To be honest, I've totally lost track of the locking rules in tty land lately so it might well be ok, but something to verify. Some of the less well behaved line disciplines do this and always have done. I had a backtrace in my kernel log recently which looked like that, while doing PPP over Bluetooth RFCOMM. Resulted in AppArmor complaining that it's hook was being called in irq context. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Notes on block I/O data integrity
Christoph Hellwig wrote: On Wed, Aug 26, 2009 at 07:57:55PM +0100, Jamie Lokier wrote: Christoph Hellwig wrote: what about LVM? iv'e read somewhere that it used to just eat barriers used by XFS, making it less safe than simple partitions. Oh, any additional layers open another by cans of worms. On Linux until very recently using LVM or software raid means only disabled write caches are safe. I believe that's still true except if there's more than one backing drive, so software RAID still isn't safe. Did that change? Yes, it did change. I will recommend to keep doing what people caring for their data have been doing since these volatile write caches came up: turn them off. Unfortunately I tried that on a batch of 1000 or so embedded thingies with ext3, and the write performance plummeted. They are the same thingies where I observed lack of barriers resulting in filesystem corruption after power failure. We really need barriers with ATA disks to get decent write performance. It's a good recommendation though. That being said with the amount of bugs in filesystems related to write barriers my expectation for the RAID and device mapper code is not too high. Turning off volatile write cache does not provide commit integrity with RAID. RAID needs barriers to plug, drain and unplug the queues across all backing devices in a coordinated manner quite apart from the volatile write cache. And then there's still that pesky problem of writes which reach one disk and not it's parity disk. Unfortunately turning off the volatile write caches could actually make the timing window for failure worse, in the case of system crash without power failure. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Notes on block I/O data integrity
Christoph Hellwig wrote: As various people wanted to know how the various data integrity patches I've send out recently play together here's a small writeup on what issues we have in QEMU and how to fix it: Thanks for taking this on. Both this email and the one on linux-fsdevel about Linux behaviour are wonderfully clear summaries of the issues. Action plan for QEMU: - IDE needs to set the write cache enabled bit - virtio needs to implement a cache flush command and advertise it (also needs a small change to the host driver) With IDE and SCSI, and perhaps virtio-blk, guests should also be able to disable the write cache enabled bit, and that should be equivalent to the guest issuing a cache flush command after every write. At the host it could be implemented as if every write were followed by flush, or by switching to O_DSYNC (cache=writethrough) in response. The other way around: for guests where integrity isn't required (e.g. disposable guests for testing - or speed during guest OS installs), you might want an option to ignore cache flush commands - just let the guest *think* it's committing to disk, but don't waste time doing that on the host. For disks using volatile write caches, the cache flush is implemented by a protocol specific request, and the the barrier request are implemented by performing cache flushes before and after the barrier request, in addition to the draining mentioned above. The second cache flush can be replaced by setting the Force Unit Access bit on the barrier request on modern disks. For fdatasync (etc), you've probably noticed that it only needs one cache flush by itself, no second request or FUA write. Less obviously, there are opportunities to merge and reorder around non-barrier flush requests in the elevator, and to eliminate redundant flush requests. Also you don't need flushes to reach every backing drive on RAID, but knowing which ones to leave out is tricky and needs more hints from the filesystem. I agree with the whole of your general plan, both in QEMU and in Linux as a host. Spot on! -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Notes on block I/O data integrity
Nikola Ciprich wrote: clustered LVM SHOULD not have problems with it, as we're using just striped volumes, Note that LVM does not implement barriers at all, except for simple cases of a single backing device (I'm not sure if that includes dm-crypt). So your striped volumes may not offer this level of integrity. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Notes on block I/O data integrity
Christoph Hellwig wrote: what about LVM? iv'e read somewhere that it used to just eat barriers used by XFS, making it less safe than simple partitions. Oh, any additional layers open another by cans of worms. On Linux until very recently using LVM or software raid means only disabled write caches are safe. I believe that's still true except if there's more than one backing drive, so software RAID still isn't safe. Did that change? But even with barriers, software RAID may have a consistency problem if one stripe is updated and the system fails before the matching parity stripe is updated. I've been told that some hardware RAID implementations implement a kind of journalling to deal with this, but Linux software RAID does not. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] introduce kvm64 CPU
Avi Kivity wrote: On 08/22/2009 12:59 AM, Andre Przywara wrote: Typically users will want more specialized greatest common denomiator cpu types; if a site has standardized on recent hardware they will want the features of that hardware exposed. Sure, but this was not the purpose of this patch. Currently KVM guests see a CPU type which is TCG dependent, so I just wanted to get rid of this. Features of TCG and features of the host processor are totally uncorrelated. This new type should be KVM's default, leaving -cpu host as the alternative for the non-migration case. That does make sense. Note we can call it '-cpu kvm' since qemu will strip away long mode if it is not supported by the cpu or by the kernel. I thought the point was to provide a lowest common denominator for migration, while acknowledging that 64-bit is too useful to exclude? So if you start running on a 64-bit host, but know you have 32-bit hosts in your pool, you'll need '-cpu kvm32'. And if you start on a 32-bit host, but want to migrate to a 64-bit host, will that work if the destination has different cpuid than the source? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] introduce kvm64 CPU
Andre Przywara wrote: In addition to the TCG based qemu64 type let's introduce a kvm64 CPU type, which is the least common denominator of all KVM-capable x86-CPUs (based on Intel Pentium 4 Prescott). It can be used as a base type for migration. The idea is nice and the name is right, but the description is wrong. It obviously isn't the least common denominator of all KVM-capable x86-CPUs, as my KVM-capable Core Duo (32-bit) cannot run it. A kvm32 would be nice for symmetry. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] introduce kvm64 CPU
Avi Kivity wrote: On 08/21/2009 12:34 AM, Andre Przywara wrote: In addition to the TCG based qemu64 type let's introduce a kvm64 CPU type, which is the least common denominator of all KVM-capable x86-CPUs (based on Intel Pentium 4 Prescott). It can be used as a base type for migration. Typically users will want more specialized greatest common denomiator cpu types; if a site has standardized on recent hardware they will want the features of that hardware exposed. My experience is of sites which don't standardise on hardware in that way. They standardise on suppliers, and buy whatever is available when more hardware is needed, or reuse existing hardware which is made redundant from some other purpose. kvm64 is a good compromise for sites like that, because it should work with everything that's 64-bit and capable of running KVM. I expect all server machines which can run KVM are 64-bit, and it's only laptops which have 32-bit KVM-capable chips (but I'm not sure). -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] introduce kvm64 CPU
Andre Przywara wrote: If you happen to stuck with 32bit (pity you!) then I agree that a kvm32 would be nice to have. Will think about it... I know that 32-bit is a bit slower for some things due to register pressure (but it's a bit faster for some things due to less memory needed for pointers), and it's RAM is limited to about 3GB in practice, which affects some things but is plenty for others. I know it's a pain for KVM developers to support 32-bit hosts. And yes, it would be nice to run a 64-bit guest from time to time. But apart from being a bit slower, is there anything wrong with 32-bit x86s compared with 64-bit that justifies pity? The 32-bitness doesn't seem to be a handicap, only perhaps the expected amount of slowness for a laptop that's 2-3 years old, or a current netbook, compared with current desktops and servers. So I'm having a hard time understanding why 32-bitness is considered bad for KVM - why pity? Does it have any other real problems than not being able to emulate 64-bit guests that I should know about, or is it just a matter of distaste? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Amit Shah wrote: I think strings are better as numbers for identifying protocols as you can work without a central registry for the numbers then. I like the way assigned numbers work: it's simpler to code, needs a bitmap for all the ports that fits in nicely in the config space and udev rules / scripts can point /dev/vmch02 to /dev/console. How would a third party go about assigning themselves a number? For the sake of example, imagine they develop a simple service like guesttop which let's the host get a listing of guest processes. They'll have to distributed app-specific udev rule patches for every guest distro, which sounds like a lot of work. The app itself is probably a very simple C program; the hardest part of making it portable across distros would be the udev rules, which is silly. Anyway, every other device has a name or uuid these days. You can still use /dev/sda1 to refer to your boot partition, but LABEL=boot is also available if you prefer. Isn't that the ethos these days? Why not both? /dev/vmch05 if you prefer, plus symlink /dev/vmch-guesttop - /dev/vmch05 if name=guesttop was given to QEMU. If you do stay with numbers only, note that it's not like TCP/UDP port numbers because the number space is far smaller. Picking a random number that you hope nobody else uses is harder. ... Back to technical bits. If config space is tight, use a channel! Dedicate channel 0 to control, used to fetch the name (if there is one) for each number. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qcow2 corruption - seems to be fixed in kvm-85 and later :-)
Hi, Sometimes it's nice to find a mail with good news. A while back I reported corruption with qcow2, with the subject qcow2 corruption observed, fixed by reverting old change. I'd noticed that kvm-83 was corrupting a Windows 2k disk image, which was failing to boot, blue screening quite early. I found there was a qcow2 bug introduced from kvm-72 to kvm-73, and another from kvm-76 to kvm-77. Reverting both fixed this symptom. In order to check the bug later, I kept a copy of the disk image which blue screened. It still crashes with kvm-84. The release notes indicate there were some qcow2 fixes in that version; they were not enough to fix this problem. There were more qcow2 fixes in kvm-85. Happily, I can now report that kvm-85 and kvm-88 both boot this image with no apparent problems, and I will be deleting this junk disk image now that I'm confident no further testing is required :-) Thanks! -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Amit Shah wrote: On (Thu) Aug 06 2009 [08:58:01], Anthony Liguori wrote: Amit Shah wrote: On (Thu) Aug 06 2009 [08:29:40], Anthony Liguori wrote: Amit Shah wrote: Sure; but there's been no resistance from anyone from including the virtio-serial device driver so maybe we don't need to discuss that. There certainly is from me. The userspace interface is not reasonable for guest applications to use. One example that would readily come to mind is dbus. A daemon running on the guest that reads data off the port and interacts with the desktop by appropriate dbus commands. All that's needed is a stream of bytes and virtio-serial provides just that. dbus runs as an unprivileged user, how does dbus know which virtio-serial port to open and who sets the permissions on that port? The permission part can be handled by package maintainers and sysadmins via udev policies. So all data destined for dbus consumption gets to a daemon and that daemon then sends it over to dbus. virtio-serial is nice, easy, simple and versatile. We like that; it should stay that way. dbus isn't a good match for this. dbus is not intended for communication between hosts, by design. It depends on per-app configuration files in /etc/dbus/{session,system}.d/, which are expected to match the installed services. For this, the guest's files in /etc/dbus/ would have to match the QEMU host host services in detail. dbus doesn't have a good mechanism for copying with version skew between both of them, because normally everything resides on the same machine and the config and service are updated at the same time. This is hard to guarantee with a VM. Apart from dbus, hard-coded meanings of small N in /dev/vmchN are asking for trouble. It is bound to break when widely deployed and guest/host configs don't match. It also doesn't fit comfortably when you have, say, bob and alice both logged in with desktops on separate VTs. Clashes are inevitable, as third-party apps pick N values for themselves then get distributed - unless N values can be large (/dev/vmch44324 == kernelstats...). Sysadmins shouldn't have to hand-configure each app, and shouldn't have to repair clashes in defaults. Just Work is better. virtio-serial is nice. The only ugly part is _finding_ the right /dev/vmchN. Fortunately, _any_ out-of-band id string or id number makes it perfect. An option to specify PCI vendor/product ids in the QEMU host configuration is good enough. An option to specify one or more id strings is nicer. Finally, Anthony hit on an interesting idea with USB. Emulating USB sucks. But USB's _descriptors_ are quite effective, and the USB basic protocol is quite reasonable too. Descriptors are just a binary blob in a particular format, which describe a device and also say what it supports, and what standard interfaces can be used with it too. Bluetooth is similar; they might even use the same byte format, I'm not sure. All the code for parsing USB descriptors is already present in guest kernels, and the code for making appropriate device nodes and launching apps is already in udev. libusb also allows devices to be used without a kernel driver, and is cross-platform. There are plenty of examples of creating USB descriptors in QEMU, and may be the code can be reused. The only down side of USB is that emulating it sucks :-) That's mainly due to the host controllers, and the way interrupts use polling. So here's a couple of ideas: - virtio-usb, using virtio instead of a hardware USB host controller. That would provide all the features of USB naturally, like hotplug, device binding, access from userspace, but with high performance, low overhead, and no interrupt polling. You'd even have the option of cross-platform guest apps, as well as working on all Linux versions, by emulating a host controller when the guest doesn't have virtio-usb. As a bonus, existing USB support would be accelerated. - virtio-serial providing a binary id blob, whose format is the same as USB descriptors. Reuse the guest's USB parsing and binding to find and identify, but the actual device functionality would just be a byte pipe. That might be simple, as all it involves is a blob passed to the guest from QEMU. QEMU would build the id blob, maybe reusing existing USB code, and the guest would parse the blob as it already does for USB devices, with udev creating devices as it already does. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Anthony Liguori wrote: Richard W.M. Jones wrote: Have you considered using a usb serial device? Something attractive about it is that a productid/vendorid can be specified which means that you can use that as a method of enumerating devices. Hot add/remove is supported automagically. The same applies to PCI: productid/vendorid (and subids); PCI hotplug is possible though not as native as USB. Here's another idea: Many devices these days have a serial number or id string. E.g. USB storage, ATA drives, media cards, etc. Linux these days creates alias device nodes which include the id string in the device name. E.g. /dev/disks/by-id/ata-FUJITSU_MHV2100BH_NWAQT662615H So in addition to (or instead of) /dev/vmch0, /dev/vmch1 etc., Linux guests could easily generate: /dev/vmchannel/by-role/clipboard-0 /dev/vmchannel/by-role/gueststats-0 /dev/vmchannel/by-role/vmmanager-0 It's not necessary to do this at the beginning. All that is needed is to provide enough id information that will appear in /sys/..., so that that a udev policy for naming devices can be created at some later date. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Daniel P. Berrange wrote: I expect the first problem you'll run into is that copy/paste daemon has to run as an unprivileged user but /dev/vmch3 is going to be owned by root. You could set udev rules for /dev/vmch3 but that's pretty terrible IMHO. I don't think that's not too bad, for example, with fast-user-switching between multiple X servers and/or text consoles, there's already support code that deals with chown'ing things like /dev/snd/* devices to match the active console session. Doing the same with the /dev/vmch3 device so that it is only ever accessible to the current logged in user actually fits in to that scheme quite well. With multiple X servers, there can be more than one currently logged in user. Same with multiple text consoles - that's more familiar. Which one owns /dev/vmch3? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] rev3: support colon in filenames
Kevin Wolf wrote: Can we at least allow \, instead of ,, in parameter parsing, so that the backslash has the practical benefit of being a single universal escape character? Is there a good reason why we cannot simply use \char to escape _any_ character, in every context where a user-supplied string/name/path/file is used? I'm thinking of consistency here. Instead of special cases for filenames, why not a standard scheme for all the places in command lines _and_ the monitor where a name/path/file is needed? There are many examples where it would be useful if unusual characters didn't break things, they simply worked. Examples: -vnc unix: path, -net port: device path, -net script path, -net sock= path, -net group= groupname, tap and bt device names. \char is an obvious scheme to standardise on given QEMU's unix shell heritage. It would work equally well for command line options (which are often comma-separated) and for monitor commands (which are often space-separated). It would have the nice property of being easy for management programs/scripts to quote, without them having a special list of characters to quote, without needing to update them if QEMU needs to quote more characters in future for some reason. Now, I see one significant hurdle with that: it's quite inconvenient for Windows users, typing paths like c:\path\to\dir\file, if those backslashes are stipped. So I propose this as a universal quoting scheme: \char where char is not ASCII alphanumeric. Shell quoting is easy: qfile=`printf %s $file | sed 's/[^0-9a-zA-Z]//g'` qemu -drive file=$qfile,if=scsi,media=disk Same quoting applied when sending the monitor a command to change a CD-ROM file or add a USB disk, for example. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] rev5: support colon in filenames
Ram Pai wrote: I have verified with relative paths and it works. After analyzing the code, i came to the conclusion that call to realpath() adds no real value. The logic in bdrv_open2() is something like this bdrv_open2() { if (snapshot) { backup = realpath(filename); filename=generate_a_temp_file(); } drv=parse_and_get_bdrv(filename); drv-bdrv_open(filename); if (backup) { bdrv_open2(backup); } } in the above function, the call to realpath() would have been useful had it changed the current working directory before calling bdrv_open2(backup). It does not. If in case any function within drv-bdrv_open change the cwd, then I expect them to restore before returning. Also drv-bdrv_open() can anyway handle relative paths. Hence I conclude that the call to realpath() adds no value. Do you see a flaw in this logic? I don't know about snapshot, but when a qcow2 file contains a relative path to it's backing file, QEMU cannot simply open using that relative path, because it's relative to the directory containing the qcow2 file, not QEMU's current directory. (That said, I find it quite annoying when renaming qcow2 files that there's no easy way to rename their backing files, and it's even worse when moving qcow2 files which refer to backing files in another directory, and _especially_ when the qcow2 file contains an absolute path to the backing file and you're asked to move it to another system which doesn't have those directories.) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qcow2 relative paths (was: [PATCH] rev5: support colon in filenames)
Ram Pai wrote: I have successfully verified qcow2 files. But then I may not be trying out the exact thing that you are talking about. Can you give me a test case that I can verify. Commands tried with qemu-0.10.0-1ubuntu1: $ mkdir unlikely_subdir $ cd unlikely_subdir $ qemu-img create -f qcow2 backing.img 10 Formatting 'backing.img', fmt=qcow2, size=10 kB $ qemu-img create -f qcow2 -b ../unlikely_subdir/backing.img main.img 10 Formatting 'main.img', fmt=qcow2, backing_file=../unlikely_subdir/backing.img, size=10 kB $ cd .. $ qemu-img info unlikely_subdir/main.img image: unlikely_subdir/main.img file format: qcow2 virtual size: 10K (10240 bytes) disk size: 16K cluster_size: 4096 highest_alloc: 16384 backing file: ../unlikely_subdir/backing.img (actual path: unlikely_subdir/../unlikely_subdir/backing.img) See especially the actual path line. $ mv unlikely_subdir other_subdir $ ls -l other_subdir total 32 -rw-r--r-- 1 jamie jamie 16384 2009-07-15 21:59 backing.img -rw-r--r-- 1 jamie jamie 16384 2009-07-15 21:59 main.img $ qemu-img info other_subdir/main.img qemu-img: Could not open 'other_subdir/main.img' What an unhelpful error message... There isn't even a way to find out the backing file path which the tool is looking for. And one other thing. Let me know if there a test-suite that I can try for regressions. Sorry, I don't know anything about any QEMU test suites. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] rev3: support colon in filenames
Jan Kiszka wrote: Now, I see one significant hurdle with that: it's quite inconvenient for Windows users, typing paths like c:\path\to\dir\file, if those backslashes are stipped. We could exclude Windows from this (I think to remember that filenames are more restricted there anyway) or define a different, Windows-only escape character. I think both of those are bad ideas, because the same management scripts can run on Windows, and for consistency it's not just file names. Even Windows has block devices and network devices :-) Fortunately where char is not ASCII alphanumeric solves the practical cases where the user types an ordinary pathname. Or the user can type forward slashes just like they do in unix. So I propose this as a universal quoting scheme: \char where char is not ASCII alphanumeric. Shell quoting is easy: qfile=`printf %s $file | sed 's/[^0-9a-zA-Z]//g'` qemu -drive file=$qfile,if=scsi,media=disk I forgot a very obscure corner case, where the last character of the filename is a newline character. To do the right thing (with Bash at least), it should say '%s\n' instead of %s. Sue me :-) Same quoting applied when sending the monitor a command to change a CD-ROM file or add a USB disk, for example. To me this direction looks more promising than any other proposal so far. I wondered if it was just me... -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] rev3: support colon in filenames
Jan Kiszka wrote: Jamie Lokier wrote: Jan Kiszka wrote: Now, I see one significant hurdle with that: it's quite inconvenient for Windows users, typing paths like c:\path\to\dir\file, if those backslashes are stipped. We could exclude Windows from this (I think to remember that filenames are more restricted there anyway) or define a different, Windows-only escape character. I think both of those are bad ideas, because the same management scripts can run on Windows, and for consistency it's not just file names. Even Windows has block devices and network devices :-) I'm not sure if there is actually so much portability/reusability between Windows and the rest of the universe, but I'm surely not an expert in this. In my experience, shell scripts and Perl scripts tend to work either with no changes, or very small changes. Fortunately where char is not ASCII alphanumeric solves the practical cases where the user types an ordinary pathname. Or the user can type forward slashes just like they do in unix. We would still have to deal with the fact that so far '\' had no special meaning on Windows - except that is was the well-known path separator. So redefining its meaning would break a bit... The point is that paths tend to have alphanumeric characters at the start of each component, so it doesn't matter in most cases that it's redefined. People won't notice because c:\path\to\file will continue to work, whether it's by itself or part of a multi-option option. Exceptions are \\host\path and \\.\device, where the error will be so obvious they'll learn quickly. We could find a more complex scheme where \\ is unaffected, but complex is not good and will be wrongly implemented by other programs. Whereas \char is very common, well known and easy to get right, even when people guess how it's done, like they do when working out how to quote paths for rsync and ssh. Oh, I'm suddenly thinking that . should be included in alphanumeric :-) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] rev3: support colon in filenames
Anthony Liguori wrote: Jan Kiszka wrote: We would still have to deal with the fact that so far '\' had no special meaning on Windows - except that is was the well-known path separator. So redefining its meaning would break a bit... That's the problem. You will break existing Windows users. I know this goes against the current momentum in qemu, but overloading one option with a bunch of parameters seems absolutely silly to me. IMHO, -drive file=foo.img,if=virtio,cache=off should have always been at least three parameters. That's fine for command lines. I don't necessarily disagree with you. But how do you propose to handle paths in monitor commands, when the path contains a space/quote/whatever as it often does on Windows (My Documents, Program Files)? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] rev3: support colon in filenames
Anthony Liguori wrote: Jamie Lokier wrote: Anthony Liguori wrote: Jan Kiszka wrote: We would still have to deal with the fact that so far '\' had no special meaning on Windows - except that is was the well-known path separator. So redefining its meaning would break a bit... That's the problem. You will break existing Windows users. I know this goes against the current momentum in qemu, but overloading one option with a bunch of parameters seems absolutely silly to me. IMHO, -drive file=foo.img,if=virtio,cache=off should have always been at least three parameters. That's fine for command lines. I don't necessarily disagree with you. But how do you propose to handle paths in monitor commands, when the path contains a space/quote/whatever as it often does on Windows (My Documents, Program Files)? Same basic rules apply. The monitor should use shell-style quoting. So instead of consistency, you like the idea of using different quoting rules for the monitor than for command line arguments? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC] allow multi-core guests: introduce cores= option to -cpu
Andre Przywara wrote: So what about: -smp 4,cores=2,threads=2[,sockets=1] to inject 4 vCPUs in one package (automatically determined if omitted) with two cores and two threads/core? All parameters except the number of vCPUs would be optional, Why is the number of vCPUs required at all? -smp cores=2,threads=2 The 4 is redundant. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] virtio-serial: A guest - host interface for simple communication
Amit Shah wrote: On (Wed) Jun 24 2009 [17:40:49], Jamie Lokier wrote: Amit Shah wrote: A few sample uses for a vmchannel are to share the host and guest clipboards (to allow copy/paste between a host and a guest), to lock the screen of the guest session when the vnc viewer is closed, to find out which applications are installed on a guest OS even when the guest is powered down (using virt-inspector) and so on. Those all look like useful features. Can you run an application to provide those features on a guest which _doesn't_ have a vmchannel/virtio-serial support in the kernel? Or will it be restricted only to guests which have QEMU-specific support in their kernel? libguestfs currently uses the -net user based vmchannel interface that exists in current qemu. That doesn't need a kernel that doesn't have support for virtio-serial. That's great! If that works fine, and guest apps/libraries are using that as a fallback anyway, what benefit do they get from switching to virtio-serial when they detect that instead, given they still have code for the -net method? Is the plan to remove -net user based support from libguestfs? Is virtio-serial significantly simpler to use? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 1/2] allow hypervisor CPUID bit to be overriden
Avi Kivity wrote: On 06/23/2009 02:31 PM, Paul Brook wrote: On Tuesday 23 June 2009, Avi Kivity wrote: On 06/23/2009 12:47 AM, Andre Przywara wrote: KVM defaults to the hypervisor CPUID bit to be set, whereas pure QEMU clears it. On some occasions one want to set or clear it the other way round (for instance to get HyperV running inside a guest). Allow the default to be overridden on the command line and fix some whitespace damage on the way. It makes sense for qemu to set the hypervisor bit unconditionally. A guest running under qemu is not bare metal. I see no reason why a guest has to be told that it's running inside a VM. In principle an appropriately configured qemu should be indistinguishable from real hardware. In practice it's technically infeasible to cover absolutely everything, but if we set this bit we're not even trying. I have no objection to the bit being set by default for the QEMU CPU types. I agree it's pointless, but it is a Microsoft requirement for passing their SVVP tests. Enabling it by default makes life a little easier for users who wish to validate their hypervisor and has no drawbacks. Hold on. Do the SVVP tests fail on a real (non-virtal) machine then? Or is QEMU's machine emulation insufficiently accurate? I see a drawback in setting the bit by default. Something I expect from an emulator is that it behaves like a real machine to the extent possible. In particular, guest code which attempts to check if it's running on a real machine should get the answer yes. Unfriendly guest code which pops up a message like Sorry I refuse to work for you after 100 hours/ because you are attempting to run me in a virtual machine, and don't even think of trying to hide this from me now you know I look for it should never do so. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]
Avi Kivity wrote: On 06/16/2009 09:32 PM, Jamie Lokier wrote: Avi Kivity wrote: Another issue is enumeration. Guests will present their devices in the order they find them on the pci bus (of course enumeration is guest specific). So if I have 2 virtio controllers the only way I can distinguish between them is using their pci slots. virtio controllers really should have a user-suppliable string or UUID to identify them to the guest. Don't they? virtio controllers don't exist. When they do, they may have a UUID or not, but in either case guest infrastructure is in place for reporting the PCI slot, not the UUID. virtio disks do have a UUID. I don't think older versions of Windows will use it though, so if you reorder your slots you'll see your drive letters change. Same with Linux if you don't use udev by-uuid rules. I guess I meant virtio disks, so that's ok. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]
Avi Kivity wrote: If management apps need to hard-code which slots are available on different targets and different qemu versions, or restrictions on which devices can use which slots, or knowledge that some devices can be multi-function, or ... anything like that is just lame. You can't abstract these things away. If you can't put a NIC in slot 4, and you have 7 slots, then you cannot have 7 NICs. Having qemu allocate the slot numbers does not absolve management from knowing this limitation and preventing the user from creating a machine with 7 slots. Likewise, management will have to know which devices are multi-function, since that affects their hotpluggability. Ditto if some slot if faster than others, if you want to make use of this information you have to let the upper layers know. It could be done using an elaborate machine description that qemu exposes to management coupled with a constraint solver that optimizes the machine layout according to user specifications and hardware limitations. Or we could take the view that real life is not perfect (especially where computers are involved), add some machine specific knowledge, and spend the rest of the summer at the beach. To be honest, an elaborate machine description is probably fine... A fancy constraint solver is not required. A simple one strikes me as about as simple as what you'd hard-code anyway, but with fewer special cases. Note that the result can fail due to things like insufficient address space for all the device BARs even when they _are_ in the right slots. Especially if there are lots of slots, or bridges which can provide unlimited slots. That is arcane: device-dependent, CPU-dependent, machine-dependent, RAM-size dependent (in a non-linear way), device-option-dependent and probably QEMU-version-dependent too. It would be nice if libvirt (et al) would prevent the user from creating a VM with insufficient BAR space for that machine, but I'm not sure how to do it sanely, without arcane knowledge getting about. Maybe that idea of a .so shared by qemu and libvirt, to manipulate device configurations, is a sane one after all. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]
Avi Kivity wrote: Another issue is enumeration. Guests will present their devices in the order they find them on the pci bus (of course enumeration is guest specific). So if I have 2 virtio controllers the only way I can distinguish between them is using their pci slots. virtio controllers really should have a user-suppliable string or UUID to identify them to the guest. Don't they? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]
Mark McLoughlin wrote: After libvirt has done -drive file=foo... it should dump the machine config and use that from then on. Right - libvirt then wouldn't be able to avoid the complexity of merging any future changes into the dumped machine config. As long as qemu can accept a machine config _and_ -drive file=foo (and monitor commands to add/remove devices), libvirt could merge by simply calling qemu with whatever additional command line options or monitor commands modify the config, then dump the new config. That way, virtio would not have to deal with that complexity. It would be written in one place: qemu. Or better, a utility: qemu-machine-config. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]
Mark McLoughlin wrote: Worst case we hardcode those numbers (gasp, faint). Maybe we can just add the open slots to the -help output. That'd be nice and clean. Make them part of the machine configuration. After all, they are part of the machine configuration, and ACPI, BIOS etc. need to know about all the machine slots anyway. Having said that, I prefer the idea that slot allocation is handled either in Qemu, or in a separate utility called qemu-machine-config (for working with machine configs), or in a library libqemu-machine-config.so. I particularly don't like the idea of arcane machine-dependent slot allocation knowledge living in libvirt, because it needs to be in Qemu anyway for non-libvirt users. No point in having two implementations of something tricky and likely to have machine quirks, if one will do. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities
Paul Brook wrote: caps can be anywhere, but we don't expect it to change during machine execution lifetime. Or I am just confused by the name pci_device_load ? Right. So I want to load an image and it has capability X at offset Y. wmask has to match. I don't want to assume that we never change Y for the device without breaking old images, so I clear wmask here and set it up again after looking up capabilities that I loaded. We should not be loading state into a different device (or a similar device with a different set of capabilities). If you want to provide backwards compatibility then you should do that by creating a device that is the same as the original. As I mentioned in my earlier mail, loading a snapshot should never do anything that can not be achieved through normal operation. If you can create a machine be restoring a snapshot which you can't create by normally starting QEMU, then you'll soon have guests which work fine from their snapshots, but which cannot be booted without a snapshot because there's no way to boot the right machine for the guest. Ssomeone might even have guests like that for years without noticing, because they always save and restore guest state using snapshots, then one day they simply want to boot the guest from it's disk image and find there's no way to do it with any QEMU which runs on their host platform. I think the right long term answer to all this is a way to get QEMU to dump it's current machine configuration in glorious detail as a file which can be reloaded as a machine configuration. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities
Michael S. Tsirkin wrote: I think the right long term answer to all this is a way to get QEMU to dump it's current machine configuration in glorious detail as a file which can be reloaded as a machine configuration. And then we'll have the same set of problems there. We will, and the solution will be the same: options to create devices as they were in older versions of QEMU. It only needs to cover device features which matter to guests, not every bug fix. However with a machine configuration which is generated by QEMU, there's less worry about proliferation of obscure options, compared with the command line. You don't necessarily have to document every backward-compatibility option in any detail, you just have to make sure it's written and read properly, which is much the same thing as the snapshot code does. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH] virtio-blk: add SGI_IO passthru support
Christoph Hellwig wrote: On Thu, Apr 30, 2009 at 10:49:19PM +0100, Paul Brook wrote: Only if you emulate a crufty old parallel scsi bus, and that's just silly. One of the nice things about scsi is it separates the command set from the transport layer. cf. USB mass-storage, SAS, SBP2(firewire), and probably several others I've forgotten. It has nothing to do with an SPI bus. Everything that resembles a SAM architecture can have multiple LUs per targer, and multiple targers per initiator port, so we need all the complex queing code, and we need error handling and and and. If you're using virtio-block to connect to lots of LUNs on lots of targets (i.e. lots of block devices), don't you need similar queuing code and error handling for all that too? -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [libvirt] Re: [Qemu-devel] Changing the QEMU svn VERSION string
Paul Brook wrote: I'm extremely sceptical of anything that claims to need a fine grained version number. In practice version numbers for open source projects are fairly arbitrary and meaningless because almost everyone has their own set of patches and backported fixes anyway. I find it's needed onlyh when you need to interact with a program and workaround bugs or temporarily broken features, and also when the program gives no other way to determine its features. For some reason, I find kernels are the main thing this matters for... If the help text, some other output, or an API gives enough information for interacting programs to know what to do, that's much better and works with arbitrary patches etc. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [libvirt] Re: [Qemu-devel] Changing the QEMU svn VERSION string
Anthony Liguori wrote: I still think libvirt should work with versions of QEMU/KVM built from svn/git though. I think the only way to do that is for libvirt to relax their version checks to accommodate suffixes in the form major.minor.stable-foo. Ok, but try to stick to a well-defined rule about what suffix means later or earlier. In package managers, 1.2.3-rc1 is typically seen as a later version than 1.2.3 purely due to syntax. If you're consistently meaning 0.11.0-rc1 is earlier than 0.11.0 (final), that might need to be encoded in libvirt and other wrappers, if they have any fine-grained version sensistivity such as command line changes or bug workarounds. The Linux kernel was guilty of mixing up later and earlier version suffixes like this. With Linux this is a bit more important because it changes a lot between versions, so some apps do need fine-grained version checks to workaround bugs or avoid buggy features. Maybe that won't even happen with QEMU and libvirt working together. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html