Re: [PATCH v2] io_uring: fix short read slow path
On 7/5/22 7:28 AM, Stefan Hajnoczi wrote: > On Fri, Jul 01, 2022 at 07:52:31AM +0900, Dominique Martinet wrote: >> Stefano Garzarella wrote on Thu, Jun 30, 2022 at 05:49:21PM +0200: >>>> so when we ask for more we issue an extra short reads, making sure we go >>>> through the two short reads path. >>>> (Unfortunately I wasn't quite sure what to fiddle with to issue short >>>> reads in the first place, I tried cutting one of the iovs short in >>>> luring_do_submit() but I must not have been doing it properly as I ended >>>> up with 0 return values which are handled by filling in with 0 (reads >>>> after eof) and that didn't work well) >>> >>> Do you remember the kernel version where you first saw these problems? >> >> Since you're quoting my paragraph about testing two short reads, I've >> never seen any that I know of; but there's also no reason these couldn't >> happen. >> >> Single short reads have been happening for me with O_DIRECT (cache=none) >> on btrfs for a while, but unfortunately I cannot remember which was the >> first kernel I've seen this on -- I think rather than a kernel update it >> was due to file manipulations that made the file eligible for short >> reads in the first place (I started running deduplication on the backing >> file) >> >> The older kernel I have installed right now is 5.16 and that can >> reproduce it -- I'll give my laptop some work over the weekend to test >> still maintained stable branches if that's useful. > > Hi Dominique, > Linux 5.16 contains commit 9d93a3f5a0c ("io_uring: punt short reads to > async context"). The comment above QEMU's luring_resubmit_short_read() > claims that short reads are a bug that was fixed by Linux commit > 9d93a3f5a0c. > > If the comment is inaccurate it needs to be fixed. Maybe short writes > need to be handled too. > > I have CCed Jens and the io_uring mailing list to clarify: > 1. Are short IORING_OP_READV reads possible on files/block devices? > 2. Are short IORING_OP_WRITEV writes possible on files/block devices? In general we try very hard to avoid them, but if eg we get a short read or write from blocking context (eg io-wq), then io_uring does return that. There's really not much we can do here, it seems futile to retry IO which was issued just like it would've been from a normal blocking syscall yet it is still short. -- Jens Axboe
Re: io_uring possibly the culprit for qemu hang (linux-5.4.y)
On 10/17/20 8:29 AM, Ju Hyung Park wrote: > Hi Jens. > > On Sat, Oct 17, 2020 at 3:07 AM Jens Axboe wrote: >> >> Would be great if you could try 5.4.71 and see if that helps for your >> issue. >> > > Oh wow, yeah it did fix the issue. > > I'm able to reliably turn off and start the VM multiple times in a row. > Double checked by confirming QEMU is dynamically linked to liburing.so.1. > > Looks like those 4 io_uring fixes helped. Awesome, thanks for testing! -- Jens Axboe
Re: io_uring possibly the culprit for qemu hang (linux-5.4.y)
On 10/16/20 12:04 PM, Ju Hyung Park wrote: > A small update: > > As per Stefano's suggestion, disabling io_uring support from QEMU from > the configuration step did fix the problem and I'm no longer having > hangs. > > Looks like it __is__ an io_uring issue :( Would be great if you could try 5.4.71 and see if that helps for your issue. -- Jens Axboe
Re: [Qemu-devel] virtio_blk: fix defaults for max_hw_sectors and max_segment_size
... -- Jens Axboe
Re: [Qemu-devel] virtio_blk: fix defaults for max_hw_sectors and max_segment_size
On 11/26/2014 01:51 PM, Mike Snitzer wrote: On Wed, Nov 26 2014 at 2:48pm -0500, Jens Axboe ax...@kernel.dk wrote: On 11/21/2014 08:49 AM, Mike Snitzer wrote: On Fri, Nov 21 2014 at 4:54am -0500, Christoph Hellwig h...@infradead.org wrote: On Thu, Nov 20, 2014 at 02:00:59PM -0500, Mike Snitzer wrote: virtio_blk incorrectly established -1U as the default for these queue_limits. Set these limits to sane default values to avoid crashing the kernel. But the virtio-blk protocol should probably be extended to allow proper stacking of the disk's limits from the host. This change fixes a crash that was reported when virtio-blk was used to test linux-dm.git commit 604ea90641b4 (dm thin: adjust max_sectors_kb based on thinp blocksize) that will initially set max_sectors to max_hw_sectors and then rounddown to the first power-of-2 factor of the DM thin-pool's blocksize. Basically that commit assumes drivers don't suck when establishing max_hw_sectors so it acted like a canary in the coal mine. Is that a crash in the host or guest? What kind of mishandling did you see? Unless the recent virtio standard changed anything the host should be able to handle our arbitrary limits, and even if it doesn't that something we need to hash out with qemu and the virtio standards folks. Some good news: this guest crash isn't an issue with recent kernels (so upstream, fedora 20, RHEL7, etc aren't impacted -- Jens feel free to drop my virtio_blk patch; even though some of it's limits are clearly broken I'll defer to the virtio_blk developers on the best way forward -- sorry for the noise!). The BUG I saw only seems to impact RHEL6 kernels so far (note to self, actually _test_ on upstream before reporting a crash against upstream!) [root@RHEL-6 ~]# echo 1073741824 /sys/block/vdc/queue/max_sectors_kb [root@RHEL-6 ~]# lvs Message from syslogd@RHEL-6 at Nov 21 15:32:15 ... kernel:Kernel panic - not syncing: Fatal exception Here is the RHEL6 guest crash, just for full disclosure: kernel BUG at fs/direct-io.c:696! invalid opcode: [#1] SMP last sysfs file: /sys/devices/virtual/block/dm-4/dev CPU 0 Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log dm_mod microcode virtio_balloon i2c_piix4 i2c_core virtio_net ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: speedstep_lib] Pid: 1679, comm: lvs Not tainted 2.6.32 #6 Bochs Bochs RIP: 0010:[811ce336] [811ce336] __blockdev_direct_IO_newtrunc+0x986/0x1270 RSP: 0018:88011a11ba48 EFLAGS: 00010287 RAX: RBX: 8801192fbd28 RCX: 1000 RDX: ea0003b3d218 RSI: 88011aac4300 RDI: 880118572378 RBP: 88011a11bbe8 R08: R09: R10: R11: R12: 8801192fbd00 R13: R14: 880118c3cac0 R15: FS: 7fde78bc37a0() GS:88002820() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 012706f0 CR3: 00011a432000 CR4: 000407f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process lvs (pid: 1679, threadinfo 88011a11a000, task 8801185a4aa0) Stack: 88011a11bb48 88011a11baa8 8801000c 88011a11bb18 d 88011a11bdc8 88011a11beb8 d 000c1a11baa8 880118c3cb98 18c3ccb8 Call Trace: [811c9e90] ? blkdev_get_block+0x0/0x20 [811cec97] __blockdev_direct_IO+0x77/0xe0 [811c9e90] ? blkdev_get_block+0x0/0x20 [811caf17] blkdev_direct_IO+0x57/0x60 [811c9e90] ? blkdev_get_block+0x0/0x20 [8112619b] generic_file_aio_read+0x6bb/0x700 [811cba60] ? blkdev_get+0x10/0x20 [811cba70] ? blkdev_open+0x0/0xc0 [8118af4f] ? __dentry_open+0x23f/0x360 [811ca2d1] blkdev_aio_read+0x51/0x80 [8118dc6a] do_sync_read+0xfa/0x140 [8109eaf0] ? autoremove_wake_function+0x0/0x40 [811ca22c] ? block_ioctl+0x3c/0x40 [811a34c2] ? vfs_ioctl+0x22/0xa0 [811a3664] ? do_vfs_ioctl+0x84/0x580 [8122cee6] ? security_file_permission+0x16/0x20 [8118e625] vfs_read+0xb5/0x1a0 [8118e761] sys_read+0x51/0x90 [810e5aae] ? __audit_syscall_exit+0x25e/0x290 [8100b072] system_call_fastpath+0x16/0x1b Code: fe ff ff c7 85 fc fe ff ff 00 00 00 00 48 89 95 10 ff ff ff 8b 95 34 ff ff ff e8 46 ac ff ff 3b 85 34 ff ff ff 0f 84 fc 02 00 00 0f 0b eb fe 8b 9d 34 ff ff ff 8b 85 30 ff ff ff 01 d8 85 c0 0f RIP [811ce336] __blockdev_direct_IO_newtrunc+0x986/0x1270 RSP 88011a11ba48 ---[ end trace 73be5dcaf8939399
Re: [Qemu-devel] virtio_blk: fix defaults for max_hw_sectors and max_segment_size
On 11/26/2014 02:51 PM, Mike Snitzer wrote: On Wed, Nov 26 2014 at 3:54pm -0500, Jens Axboe ax...@kernel.dk wrote: On 11/26/2014 01:51 PM, Mike Snitzer wrote: On Wed, Nov 26 2014 at 2:48pm -0500, Jens Axboe ax...@kernel.dk wrote: That code isn't even in mainline, as far as I can tell... Right, it is old RHEL6 code. But I've yet to determine what changed upstream that enables this to just work with a really large max_sectors (I haven't been looking either). Kind of hard for the rest of us to say, since it's triggering a BUG in code we don't have :-) I never asked you or others to weigh in on old RHEL6 code. Once I realized upstream worked even if max_sectors is _really_ high I said sorry for the noise. But while you're here, I wouldn't mind getting your take on virtio-blk setting max_hw_sectors to -1U. As I said in my original reply to mst: it only makes sense to set a really high initial upper bound like that in a driver if that driver goes on to stack an underlying device's limit. -1U should just work, IMHO, there's no reason we should need to cap it at some synthetic value. That said, it seems it should be one of those parameters that should be negotiated up and set appropriately. -- Jens Axboe
Re: [Qemu-devel] Linux multiqueue block layer thoughts
On Wed, Nov 27 2013, Stefan Hajnoczi wrote: I finally got around to reading the Linux multiqueue block layer paper and wanted to share some thoughts about how it relates to QEMU and dataplane/QContext: http://kernel.dk/blk-mq.pdf I think Jens has virtio-blk multiqueue patches. So let's imagine that the virtio-blk device has multiple virtqueues. (virtio-scsi is already multiqueue BTW.) The paper focusses on two queue mappings: 1 queue per core and 1 queue per node. In both cases the idea is to keep the block I/O code path localized. This makes block I/O scale as the number of CPUs increases. In QEMU we'd want to set up a mapping for the virtio-blk mq device: each guest vcpu or guest node has a virtio-blk virtqueue which is serviced by a dataplane/QContext thread. QEMU would then process requests across these queues in parallel, although currently BlockDriverState is not thread-safe. At least for raw we should be able to submit requests in parallel from QEMU. Unfortunately there are some complications in the QEMU block layer: QEMU's own accounting, request tracking, and throttling features are global. We'd need to eventually do something similar to the multiqueue block layer changes in the kernel to detangle this state. Doing multiqueue for image formats is much more challenging - we'd have to tackle thread-safety in qcow2 and friends. For network block drivers like Gluster or NBD it's also not 100% clear what the best approach is. But I think the target here is local SSDs that are capable of high IOPs together with an SMP guest. At the end of all this we'd arrive at the following architecture: 1. Guest virtio device has multiple queues (1 per node or vcpu). 2. QEMU has multiple dataplane/QContext threads that process virtqueue kicks, they are bound to host CPUs/nodes. 3. Linux kernel has multiqueue block I/O. I think that sounds very reasonable. Let me know if there's anything you need help or advice with. Jens: when experimenting with multiqueue virtio-blk, how far did you modify QEMU to eliminate global request processing state from block.c? I did very little scaling testing on virtio-blk, it was more a demo case for conversion than anything else. So probably not of much use to what you are looking for... -- Jens Axboe
[Qemu-devel] Re: [PATCH RFC] virtio_blk: Use blk-iopoll for host-guest notify
On Tue, May 18 2010, Stefan Hajnoczi wrote: On Fri, May 14, 2010 at 05:30:56PM -0500, Brian Jackson wrote: Any preliminary numbers? latency, throughput, cpu use? What about comparing different weights? I am running benchmarks and will report results when they are in. I'm very interested as well, I have been hoping for some more adoption of this. I have mptsas and mpt2sas patches pending as well. I have not done enough and fully exhaustive weight analysis, so note me down for wanting such an analysis on virtio_blk as well. -- Jens Axboe
[Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Tue, May 04 2010, Rusty Russell wrote: On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. ISTR Christoph had withdrawn some patches in this area, and was waiting for him to resubmit? I've given up on figuring out the block device. What seem to me to be sane semantics along the lines of memory barriers are foreign to disk people: they want (and depend on) flushing everywhere. For example, tdb transactions do not require a flush, they only require what I would call a barrier: that prior data be written out before any future data. Surely that would be more efficient in general than a flush! In fact, TDB wants only writes to *that file* (and metadata) written out first; it has no ordering issues with other I/O on the same device. A generic I/O interface would allow you to specify this request depends on these outstanding requests and leave it at that. It might have some sync flush command for dumb applications and OSes. The userspace API might be not be as precise and only allow such a barrier against all prior writes on this fd. ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. -- Jens Axboe
Re: [Qemu-devel] cdrom disc type - is this patch correct? (unbreaks recent FreeBSD guest's -cdrom access)
0x01 #define MST_SEP_MUTE0x02 u_int16_t max_read_speed; /* max raw data rate in bytes/1000 */ u_int16_t max_vol_levels; /* number of discrete volume levels */ u_int16_t buf_size; /* internal buffer size in bytes/1024 */ u_int16_t cur_read_speed; /* current data rate in bytes/1000 */ u_int8_treserved3; u_int8_tmisc; u_int16_t max_write_speed;/* max raw data rate in bytes/1000 */ u_int16_t cur_write_speed;/* current data rate in bytes/1000 */ u_int16_t copy_protect_rev; u_int16_t reserved4; }; [...] and in http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/atapi-cd.c?rev=1.193.2.1;content-type=text%2Fx-cvsweb-markup a check is done like this: [...] static int acd_geom_access(struct g_provider *pp, int dr, int dw, int de) { device_t dev = pp-geom-softc; struct acd_softc *cdp = device_get_ivars(dev); int timeout = 60, track; /* check for media present, waiting for loading medium just in case */ while (timeout--) { if (!acd_mode_sense(dev, ATAPI_CDROM_CAP_PAGE, (caddr_t)cdp-cap, sizeof(cdp-cap)) cdp-cap.page_code == ATAPI_CDROM_CAP_PAGE) { if ((cdp-cap.medium_type == MST_FMT_NONE) || (cdp-cap.medium_type == MST_NO_DISC) || (cdp-cap.medium_type == MST_DOOR_OPEN) || (cdp-cap.medium_type == MST_FMT_ERROR)) return EIO; else break; } pause(acdld, hz / 2); } [...] There have been reports of this also being broken on real hw tho, like, http://lists.freebsd.org/pipermail/freebsd-current/2007-November/079760.html so I'm not sure what to make of this... Well if you ask me (I used to maintain the linux atapi driver), the freebsd driver suffers from a classic case of 'but the specs says so!' syndrome. In this case it's even ancient documentation. Drivers should never try to be 100% spec oriented, they also need a bit of real life sensibility. The code you quote right above this text is clearly too anal. -- Jens Axboe
Re: [Qemu-devel] cdrom disc type - is this patch correct? (unbreaks recent FreeBSD guest's -cdrom access)
On Tue, Nov 13 2007, Juergen Lock wrote: Hi! Yesterday I learned that FreeBSD 7.0-BETA2 guests will no longer read from the emulated cd drive, apparently because of this commit: http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/atapi-cd.c.diff?r1=1.193;r2=1.193.2.1 The following patch file added to the qemu-devel port fixes the issue for me, is it also correct? (making the guest see a dvd in the drive when it is inserted, previously it saw the drive as empty.) The second hunk is already in qemu cvs so remove it if you want to test on that. ISO used for testing: ftp://ftp.freebsd.org:/pub/FreeBSD/ISO-IMAGES-i386/7.0/7.0-BETA2-i386-disc1.iso (test by either selecting fixit-cdrom or by trying to install, just booting it will always work because that goes thru the bios.) Index: qemu/hw/ide.c @@ -1339,6 +1341,8 @@ case 0x2a: cpu_to_ube16(buf[0], 28 + 6); buf[2] = 0x70; +if (bdrv_is_inserted(s-bs)) +buf[2] = 0x40; medium type code has been obsoleted since at least 1999. Looking back at even older docs, 0x70 is 'door closed, no disc present'. 0x40 is a reserved value though, I would not suggest using that. Given that freebsd breaks, my suggest change would be the below - keep the 0x70 for when no disc is really inserted, but don't set anything if there is. diff --git a/hw/ide.c b/hw/ide.c index 5f76c27..52d4c78 100644 --- a/hw/ide.c +++ b/hw/ide.c @@ -1344,7 +1344,10 @@ static void ide_atapi_cmd(IDEState *s) break; case 0x2a: cpu_to_ube16(buf[0], 28 + 6); -buf[2] = 0x70; + if (!bdrv_is_inserted(s-bs)) + buf[2] = 0x70; + else + buf[2] = 0; buf[3] = 0; buf[4] = 0; buf[5] = 0; -- Jens Axboe
Re: [Qemu-devel] qemu-i386 segfaults running hello world.
On Sun, Jun 24 2007, Rob Landley wrote: On Saturday 23 June 2007 07:00:03 Jens Axboe wrote: I realize releases are a bit out of fashion, but is there any way to go through cvs to track down which checkin broke this stuff? I can do it in git, mercurial, or subversion. But cvs isn't really set up for this sort of thing... git clone git://git.kernel.dk/data/git/qemu.git and bisect on that then. It's a continued git import of the cvs repo, gets updated every night. Oh _cool_. Any way to get a mention of that on the qemu web page? I don't mind, it's already mentioned on some japanese qemu-win page for quite some time. -- Jens Axboe
Re: [Qemu-devel] qemu-i386 segfaults running hello world.
On Sat, Jun 23 2007, Rob Landley wrote: On Friday 22 June 2007 18:31:20 Rob Landley wrote: Ok, it's a more fundamental problem: [EMAIL PROTECTED]:/sys$ qemu-i386 Segmentation fault (core dumped) Nothing to do with the program it's trying to run, it segfaults with no arguments. Is anybody else seeing this? Rob So I'm vaguely suspecting that some of the dynamic linker magic this thing's doing is contributing to the screw up (or at least the complexity of debugging it), so I thought I'd statically link. If I ./configure --static the result doesn't build, it dies during linking. Is this expected? (Do I need to install .a versions of all the alsa and x11 libraries to make that work?) I realize releases are a bit out of fashion, but is there any way to go through cvs to track down which checkin broke this stuff? I can do it in git, mercurial, or subversion. But cvs isn't really set up for this sort of thing... git clone git://git.kernel.dk/data/git/qemu.git and bisect on that then. It's a continued git import of the cvs repo, gets updated every night. -- Jens Axboe
Re: [Qemu-devel] qemu/hw ide.c
On Mon, Feb 19 2007, Thiemo Seufer wrote: Thiemo Seufer wrote: [snip] Why is nsector uint32_t to begin with? Because nobody sent a patch to fix it, I figure. Actually I seem to recall it's because it's being overloaded for requests that are 256 sectors. It would be a good cleanup to get rid of that an turn nsector into a proper uint8_t. It appears to use 16k bits in some cases. I won't fiddle with it myself 16bits, or 64k, that is. Yeah, it's for larger requests. It would be nice to track elsewhere, though. I'll take a look at it. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Ensuring data is written to disk
On Tue, Aug 01 2006, Jamie Lokier wrote: Of course, guessing the disk drive write buffer size and trying not to kill system I/O performance with all these writes is another question entirely ... sigh !!! If you just want to evict all data from the drive's cache, and don't actually have other data to write, there is a CACHEFLUSH command you can send to the drive which will be more dependable than writing as much data as the cache size. Exactly, and this is what the OS fsync() should do once the drive has acknowledged that the data has been written (to cache). At least reiserfs w/barriers on Linux does this. Random write tricks are worthless, as you cannot make any assumptions about what the drive firmware will do. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Ensuring data is written to disk
On Tue, Aug 01 2006, Jamie Lokier wrote: Jens Axboe wrote: On Tue, Aug 01 2006, Jamie Lokier wrote: Of course, guessing the disk drive write buffer size and trying not to kill system I/O performance with all these writes is another question entirely ... sigh !!! If you just want to evict all data from the drive's cache, and don't actually have other data to write, there is a CACHEFLUSH command you can send to the drive which will be more dependable than writing as much data as the cache size. Exactly, and this is what the OS fsync() should do once the drive has acknowledged that the data has been written (to cache). At least reiserfs w/barriers on Linux does this. 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk is an SATA or SCSI type that supports ordered tagged commands? My understanding is that barriers force an ordering between write commands, and that CACHEFLUSH is used only with disks that don't have more sophisticated write ordering commands. Is the data still committed to the disk platter before fsync() returns on those? No SATA drive supports ordered tags, that is a SCSI only property. The barrier writes is a separate thing, probably reiser ties the two together because it needs to know if the flush cache command works as expected. Drives are funny sometimes... For SATA you always need at least one cache flush (you need one if you have the FUA/Forced Unit Access write available, you need two if not). 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too, for in-place writes which don't modify the inode and therefore don't have a journal entry? I don't think that it does, however it may have changed. A quick grep would seem to indicate that it has not changed. On Darwin, fsync() does not issue CACHEFLUSH to the drive. Instead, it has an fcntl F_FULLSYNC which does that, which is documented in Darwin's fsync() page as working with all Darwin's filesystems, provided the hardware honours CACHEFLUSH or the equivalent. That seems somewhat strange to me, I'd much rather be able to say that fsync() itself is safe. An added fcntl hack doesn't really help the applications that already rely on the correct behaviour. rom what little documentation I've found, on Linux it appears to be much less predictable. It seems that some filesystems, with some kernel versions, and some mount options, on some types of disk, with some drive settings, will commit data to a platter before fsync() returns, and others won't. And an application calling fsync() has no easy way to find out. Have I got this wrong? Nope, I'm afraid that is pretty much true... reiser and (it looks like, just grepped) XFS has best support for this. Unfortunately I don't think the user can actually tell if the OS does the right thing, outside of running a blktrace and verifying that it actually sends a flush cache down the queue. ps. (An aside question): do you happen to know of a good patch which implements IDE barriers w/ ext3 on 2.4 kernels? I found a patch by googling, but it seemed that the ext3 parts might not be finished, so I don't trust it. I've found turning off the IDE write cache makes writes safe, but with a huge performance cost. The hard part (the IDE code) can be grabbed from the SLES8 latest kernels, I developed and tested the code there. That also has the ext3 bits, IIRC. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
On Fri, Jul 28 2006, Rik van Riel wrote: Anthony Liguori wrote: Right now Fabrice is working on rewriting the block API to be asynchronous. There's been quite a lot of discussion about why using threads isn't a good idea for this Agreed, AIO is the way to go in the long run. With a proper async API, is there any reason why we would want this to be tunable? I don't think there's much of a benefit of prematurely claiming a write is complete especially once the SCSI emulation can support multiple simultaneous requests. You're right. This O_SYNC bandaid should probably stay in place to prevent data corruption, until the AIO framework is ready to be used. O_SYNC is horrible, it'll totally kill performance. QEMU is basically just a write cache enabled disk and it supports disk flushes as well. So essentially it's the OS on top of QEMU that needs to take care for flushing data out, like using barriers on the file system and propagating fsync() properly down. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Sat, Jul 29 2006, Paul Brook wrote: Easy to do with the fsync infrastructure, but probably not worth doing since people are working on the AIO I/O backend, which would allow multiple outstanding writes from a guest. That, in turn, means I/O completion in the guest can be done when the data really hits disk, but without a performance impact. Not entirely true. That only works if you allow multiple guest IO requests in parallel, ie. some form of tagged command queueing. This requires either improving the SCSI emulation, or implementing SATA emulation. AFAIK parallel IDE doesn't support command queueing. Parallel IDE does support queuing, but it never gained wide spread support and the standard is quite broken as well (which is probably _why_ it never got much adoption). It was also quite suboptimal from a CPU efficiency POV. Besides, async completion in itself is not enough, QEMU still needs to honor ordered writes (barriers) and cache flushes. My impression what that the initial AIO implementation is just straight serial async operation. IO wouldn't actually go any faster, it just means the guest can do something else while it's waiting. Depends on the app, if the io workload is parallel then you should see a nice speedup as well (as QEMU is then no longer the serializing bottle neck). -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Sat, Jul 29 2006, Rik van Riel wrote: Fabrice Bellard wrote: Hi, Using O_SYNC for disk image access is not acceptable: QEMU relies on the host OS to ensure that the data is written correctly. This means that write ordering is not preserved, and on a power failure any data written by qemu (or Xen fully virt) guests may not be preserved. Applications running on the host can count on fsync doing the right thing, meaning that if they call fsync, the data *will* have made it to disk. Applications running inside a guest have no guarantees that their data is actually going to make it anywhere when fsync returns... Then the guest OS is broken. Applications issuing an fsync() should issue a flush (or write-through), the guest OS should propagate this knowledge through it's io stack and the QEMU hard drive should get notified. If the guest OS isn't doing what it's supposed to, QEMU can't help you. And, in fact, running your app on the same host OS with write back caching would screw you as well. The timing window will probably be larger with QEMU, but the problem is essentially the same. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Mon, Jul 31 2006, Jonas Maebe wrote: On 31 jul 2006, at 09:08, Jens Axboe wrote: Applications running on the host can count on fsync doing the right thing, meaning that if they call fsync, the data *will* have made it to disk. Applications running inside a guest have no guarantees that their data is actually going to make it anywhere when fsync returns... Then the guest OS is broken. The problem is that supposedly many OS'es are broken in this way. See http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html Well, as others have written here as well, then their OS are broken on real hardware as well. I wouldn't be adverse to a QEMU work-around, but O_SYNC is clearly not a viable alternative! We could make QEMU behave more like a real hard drive when it has aio support, flushing dirty cache out in a manner more closely mimicking what a drive would do instead of relying on the page cache writeout deciding to write it out. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
On Mon, Jul 31 2006, andrzej zaborowski wrote: On 30/07/06, Jamie Lokier [EMAIL PROTECTED] wrote: Rik van Riel wrote: This may look like hair splitting, but so far I've lost a (test) postgresql database to this 3 times already. Not getting the guest application's data to disk when the application calls fsync is a recipe for disaster. Exactly the same thing happens with real IDE disks if IDE write caching (on the drive itself) is enabled, which it is by default. It is rarer, but it happens. The little difference with QEMU is that there are two caches above it: the host OS'es software cache and the IDE hardware cache. When a guest OS flushes its own software cache its precious data goes to the host's software cache while the guest thinks it's already the IDE cache. This is ofcourse of less importance because data in both caches (hard- and software) is lost when the power is cut off. But the drive cache does not let the dirty data linger for as long as wht OS page/buffer cache. IMHO what really makes IO unreliable in QEMU is that IO errors on the host are not reported to the guest by the IDE emulation and there's an exact place in hw/ide.c where they are arrogantly ignored. Send a patch, I'm pretty sure nobody would disagree :-) -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: high CPU load / async IO?
On Tue, Jul 25 2006, Fabrice Bellard wrote: Jens Axboe wrote: On Tue, Jul 25 2006, Sven Köhler wrote: So the current thread-based async dma patch is really just the wrong long term solution. A more long term solution is likely in the works. It requires quite a bit of code modification though. I see. So in other words: don't ask for simple async I/O now. The more complex and flexible sollution will follow soon. Yes, hopefully really soon. So i will wait patiently :-) Is anyone actively working on this, or is it just speculation? I'd greatly prefer (and might do, if no one is working on it and Fabrice would take it) do a libaio version, since that'll for sure perform the best on Linux. But a posixaio version might be saner, as that should work on other operating systems as well. Fabrice, can you let people know what you would prefer? I am working on an implementation and the first version will use the posix aio and possibly the Windows ReadFile/WriteFile overlapped I/Os. Anthony Liguori got a pre version of the code, but it is not commitable yet. Sounds good, so at least it's on its way :-) It's on of those big items left on the TODO, so will be good to see go in. Then one should implement an ahci host controller for queued command support next... -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: high CPU load / async IO?
On Wed, Jul 26 2006, Paul Brook wrote: Sounds good, so at least it's on its way :-) It's on of those big items left on the TODO, so will be good to see go in. Then one should implement an ahci host controller for queued command support next... Or use the scsi emulation :-) Ah, did not know that queueing was fully implemented there yet! -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: high CPU load / async IO?
On Wed, Jul 26 2006, Paul Brook wrote: On Wednesday 26 July 2006 13:23, Jens Axboe wrote: On Wed, Jul 26 2006, Paul Brook wrote: Sounds good, so at least it's on its way :-) It's on of those big items left on the TODO, so will be good to see go in. Then one should implement an ahci host controller for queued command support next... Or use the scsi emulation :-) Ah, did not know that queueing was fully implemented there yet! It isn't, but it's nearer than the SATA emulation! ahci wouldn't be too much work, but definitely more so than finishing the scsi bits! -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: high CPU load / async IO?
On Wed, Jul 26 2006, Sven Köhler wrote: Sounds good, so at least it's on its way :-) It's on of those big items left on the TODO, so will be good to see go in. Then one should implement an ahci host controller for queued command support next... Or use the scsi emulation :-) Ah, did not know that queueing was fully implemented there yet! It isn't, but it's nearer than the SATA emulation! ahci wouldn't be too much work, but definitely more so than finishing the scsi bits! That sounds great! I feel, like my dreams come true. BTW: Fabrice said, he will use the POSIX AIO (i guess, he means http://www.bullopensource.org/posix/ in case of Linux, right?) Well I would assume that he just would use the glibc posix aio, which is suboptimal but at least the code can be reused. The bull project looks like it's trying to mimic posix aio on top of linux aio, so (if they got the details right) that should be faster. I didn't check their sources, though. You should be able to use the bull stuff with qemu, it would most likely overloading the glibc function for posix aio. Which other OS do also support the POSIX AIO API? No idea really, but I would guess any unixy OS out there. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Re: high CPU load / async IO?
On Tue, Jul 25 2006, Sven Köhler wrote: So the current thread-based async dma patch is really just the wrong long term solution. A more long term solution is likely in the works. It requires quite a bit of code modification though. I see. So in other words: don't ask for simple async I/O now. The more complex and flexible sollution will follow soon. Yes, hopefully really soon. So i will wait patiently :-) Is anyone actively working on this, or is it just speculation? I'd greatly prefer (and might do, if no one is working on it and Fabrice would take it) do a libaio version, since that'll for sure perform the best on Linux. But a posixaio version might be saner, as that should work on other operating systems as well. Fabrice, can you let people know what you would prefer? -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Pentium D with guest Ubuntu 6.06 server kernel panic with kqemu
On Fri, Jul 07 2006, Joachim Henke wrote: Yes, this patch was included, but it doesn't solve that problem. As this message [http://www.mail-archive.com/qemu-devel@nongnu.org/ msg03972.html] states, the 'monitor' and the 'mwait' instructions have not been added. But your guest OS assumes them to be present, because your host cpu has the MONITOR flag set in CPUID. Jo. R. Armiento wrote: The error looks very similar to the one reported here: http://www.mail-archive.com/qemu-devel@nongnu.org/msg03964.html But I believe that reported issue should not appear in recent qemu, since SSE3 is now emulated (right?). (At least the patch in the end of that thread seems to already be included in the sources?) So, my hypothesis is that there is some other feature that appears in my host CPUID, which the booting linux image tries to make use of, but which qemu does not emulate. Until that gets fixed up, you can boot with idle=halt. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: RE : [Qemu-devel] cvttps2dq, movdq2q, movq2dq incorrect behaviour
On Tue, Jun 20 2006, malc wrote: On Tue, 20 Jun 2006, Sylvain Petreolle wrote: --- Julian Seward [EMAIL PROTECTED] a ?crit : The SSE2 instructions cvttps2dq, movdq2q, movq2dq do not behave correctly, as shown by the attached program. It should print cvttps2dq_1 ... ok cvttps2dq_2 ... ok movdq2q_1 ... ok movq2dq_1 ... ok I tried your program on my linux station : CPU: AMD Athlon(tm) XP 1600+ stepping 02 [EMAIL PROTECTED] qemu]$ gcc --version gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) [EMAIL PROTECTED] qemu]$ gcc -msse2 sse2test.c -o sse2test [EMAIL PROTECTED] qemu]$ ./sse2test cvttps2dq_1 ... failed cvttps2dq_2 ... failed movdq2q_1 ... failed movq2dq_1 ... failed what am i doing wrong here ? Running it on a CPU without SSE2, if i'm allowed to venture a gues. Doesn't work for me, either: [EMAIL PROTECTED]:/home/axboe $ ./a cvttps2dq_1 ... not ok result0.sd[0] = 0 (expected 12) result0.sd[1] = 0 (expected 56) result0.sd[2] = 0 (expected 43) result0.sd[3] = 0 (expected 87) cvttps2dq_2 ... not ok result0.sd[0] = 0 (expected 12) result0.sd[1] = 0 (expected 56) result0.sd[2] = 0 (expected 43) result0.sd[3] = 0 (expected 87) movdq2q_1 ... not ok result0.uq[0] = 240518168588 (expected 5124095577148911) movq2dq_1 ... not ok result0.uq[0] = 0 (expected 5124095577148911) result0.uq[1] = 0 (expected 0) [EMAIL PROTECTED]:/home/axboe $ ./a Segmentation fault Varies between the two. Compiling without -O2 makes the last two suceed, the others still not. This CPU has sse2. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: RE : [Qemu-devel] cvttps2dq, movdq2q, movq2dq incorrect behaviour
On Tue, Jun 20 2006, Jens Axboe wrote: On Tue, Jun 20 2006, malc wrote: On Tue, 20 Jun 2006, Sylvain Petreolle wrote: --- Julian Seward [EMAIL PROTECTED] a ?crit : The SSE2 instructions cvttps2dq, movdq2q, movq2dq do not behave correctly, as shown by the attached program. It should print cvttps2dq_1 ... ok cvttps2dq_2 ... ok movdq2q_1 ... ok movq2dq_1 ... ok I tried your program on my linux station : CPU: AMD Athlon(tm) XP 1600+ stepping 02 [EMAIL PROTECTED] qemu]$ gcc --version gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) [EMAIL PROTECTED] qemu]$ gcc -msse2 sse2test.c -o sse2test [EMAIL PROTECTED] qemu]$ ./sse2test cvttps2dq_1 ... failed cvttps2dq_2 ... failed movdq2q_1 ... failed movq2dq_1 ... failed what am i doing wrong here ? Running it on a CPU without SSE2, if i'm allowed to venture a gues. Doesn't work for me, either: [EMAIL PROTECTED]:/home/axboe $ ./a cvttps2dq_1 ... not ok result0.sd[0] = 0 (expected 12) result0.sd[1] = 0 (expected 56) result0.sd[2] = 0 (expected 43) result0.sd[3] = 0 (expected 87) cvttps2dq_2 ... not ok result0.sd[0] = 0 (expected 12) result0.sd[1] = 0 (expected 56) result0.sd[2] = 0 (expected 43) result0.sd[3] = 0 (expected 87) movdq2q_1 ... not ok result0.uq[0] = 240518168588 (expected 5124095577148911) movq2dq_1 ... not ok result0.uq[0] = 0 (expected 5124095577148911) result0.uq[1] = 0 (expected 0) [EMAIL PROTECTED]:/home/axboe $ ./a Segmentation fault Varies between the two. Compiling without -O2 makes the last two suceed, the others still not. This CPU has sse2. 32-bit version works, as intended I guess. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] kqemu version 1.3.0pre5
On Tue, Mar 28 2006, Ed Swierk wrote: I'm still getting a kernel panic running a Linux guest kernel with -kernel-qemu. I'm using kqemu-1.3.0pre5 and qemu-snapshot-2006-03-27_23. The guest kernel is a precompiled Fedora Core 4 kernel, version 2.6.14-1.1656_FC4. It works fine with kqemu in non-kernel-kqemu mode. Any hints for how to track this problem down? [snip] monitor/mwait feature present. using mwait in idle threads. [snip] invalid operand: [#1] Modules linked in: CPU:0 EIP:0060:[c0101147]Not tainted VLI EFLAGS: 00010246 (2.6.14-1.1656_FC4) EIP is at mwait_idle+0x2f/0x41 I don't think qemu supports PNI, which includes the monitor/mwait additions. I wonder why Linux detects that. You can probably get around it for now by either passing idle=poll as a boot parameter, or compile your kernel for plain i586 for instance. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] smp support and ide lba48
On Mon, Mar 13 2006, Mario Goppold wrote: Am Samstag, 11. März 2006 13:31 schrieb Jens Axboe: On Fri, Mar 10 2006, Mario Goppold wrote: Hi, I try to install SuSE92-64 on an 400G HD but it fails: hda: max request size: 128KiB hda: cannot use LBA48 - full capacity 838860800 sectors (429496 MB) hda: 268435456 sectors (137438 MB) w/256KiB Cache, CHS=65535/16/63, (U)DMA hda:4hda: lost interrupt hda: lost interrupt ... If I switch to 32bit (in grub) it works. Here is my Env: Qemu: snapshot20060304 (gcc version 3.3.6) KQemu: kqemu-1.3.0pre3 (gcc version 4.0.2, SuSE10.0, 2.6.13-15.8-smp) qemu-img create test.img 400G qemu-system-x86_64 -m 512 -k de -localtime -smp 2 \ -net nic,vlan=0,macaddr=00:01:02:03:04:05 -net tap,vlan=0 \ -hda test.img -cdrom /dev/dvd -boot d If I reduce the image-size it won't better. Yust now I try it without -smp 2 and see what I want unkown partition table ... So my question is : Is lba48 not smp save or is smp support broken (or incomplete)? lba48 support is not committed yet, read the linux messasge - it says it cannot use lba48, because the drive (qemu) doesn't support it. Find my latest posting on this list, it should get you going. Oh, i oversight that the patches not commited yet. Now i have the patches to the snapshot_2006-03-12 adapted (Patch 2/3 and 3/3 of your Mail from 4.1.2006) and applied but with no succsess: hda: max request size: 128KiB hda: 838860800 sectors (429496 MB) w/256KiB Cache, CHS=52216/255/63, (U)DMA hda: lost interrupt hda: lost interrupt hda: lost interrupt hda: lost interrupt hda:4hda: dma_timer_expiry: dma status == 0x24 hda: DMA interrupt recovery hda: lost interrupt ... So now you see the full drive size and linux can use it, however there seems to be an unrelated problem with interrupt delivery in smp mode. I can't say what causes that, other 'devices' will likely show the same problem. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] smp support and ide lba48
On Fri, Mar 10 2006, Mario Goppold wrote: Hi, I try to install SuSE92-64 on an 400G HD but it fails: hda: max request size: 128KiB hda: cannot use LBA48 - full capacity 838860800 sectors (429496 MB) hda: 268435456 sectors (137438 MB) w/256KiB Cache, CHS=65535/16/63, (U)DMA hda:4hda: lost interrupt hda: lost interrupt ... If I switch to 32bit (in grub) it works. Here is my Env: Qemu: snapshot20060304 (gcc version 3.3.6) KQemu: kqemu-1.3.0pre3 (gcc version 4.0.2, SuSE10.0, 2.6.13-15.8-smp) qemu-img create test.img 400G qemu-system-x86_64 -m 512 -k de -localtime -smp 2 \ -net nic,vlan=0,macaddr=00:01:02:03:04:05 -net tap,vlan=0 \ -hda test.img -cdrom /dev/dvd -boot d If I reduce the image-size it won't better. Yust now I try it without -smp 2 and see what I want unkown partition table ... So my question is : Is lba48 not smp save or is smp support broken (or incomplete)? lba48 support is not committed yet, read the linux messasge - it says it cannot use lba48, because the drive (qemu) doesn't support it. Find my latest posting on this list, it should get you going. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [PATCH] Fix Harddisk initialization
On Tue, Feb 21 2006, Thiemo Seufer wrote: Hello All, this fixes Harddisk initialization (s-nsector is initially 0x100, which is supposed to get handled as zero). Thiemo Index: qemu-work/hw/ide.c === --- qemu-work.orig/hw/ide.c 2006-02-18 22:12:56.0 + +++ qemu-work/hw/ide.c2006-02-19 02:34:13.0 + @@ -1550,12 +1550,12 @@ ide_set_irq(s); break; case WIN_SETMULT: -if (s-nsector MAX_MULT_SECTORS || +if ((s-nsector 0xFF) MAX_MULT_SECTORS || s-nsector == 0 || (s-nsector (s-nsector - 1)) != 0) { ide_abort_command(s); } else { -s-mult_sectors = s-nsector; +s-mult_sectors = s-nsector 0xFF; s-status = READY_STAT; } ide_set_irq(s); I think the much better patch would be to fix qemu not to put 256 unconditionally in -nsector if it is written as zero. It's really a special case for only the read/write commands, not a generel fixup. I'd suggest adding a nsector_internal to fixup this internally in the read/write path so all register correctly reflect what was actually written by the OS. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [PATCH 2/3] ide lba48 support
On Wed, Feb 01 2006, Fabrice Bellard wrote: Jens Axboe wrote: Subject: [PATCH] Add lba48 support to ide From: Jens Axboe [EMAIL PROTECTED] Date: 1136376117 +0100 Add lba48 support for the ide code. Read back of hob registers isn't there yet, though. Do you have a more recent patch ? In your latest patch, the lba48 field is never reset and the nsector may be broken. The lba48 setting did look a little odd, should be corrected now. I guess that is what would affect the nsector stuff, it looks correct to me know. From nobody Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Thu Feb 2 10:51:20 2006 +0100 Subject: [PATCH] Add lba48 support to ide Enables qemu to support ide disk images 2^28 * 512 bytes. --- hw/ide.c | 157 ++ 1 files changed, 137 insertions(+), 20 deletions(-) b67eb122b5646ddcfd13d45563bbe6aa5309e9c0 diff --git a/hw/ide.c b/hw/ide.c index 50b8e63..01b10e1 100644 --- a/hw/ide.c +++ b/hw/ide.c @@ -307,14 +307,24 @@ typedef struct IDEState { /* ide regs */ uint8_t feature; uint8_t error; -uint16_t nsector; /* 0 is 256 to ease computations */ +uint32_t nsector; uint8_t sector; uint8_t lcyl; uint8_t hcyl; +/* other part of tf for lba48 support */ +uint8_t hob_feature; +uint8_t hob_nsector; +uint8_t hob_sector; +uint8_t hob_lcyl; +uint8_t hob_hcyl; + uint8_t select; uint8_t status; + /* 0x3f6 command, only meaningful for drive 0 */ uint8_t cmd; +/* set for lba48 access */ +uint8_t lba48; /* depends on bit 4 in select, only meaningful for drive 0 */ struct IDEState *cur_drive; BlockDriverState *bs; @@ -462,13 +472,19 @@ static void ide_identify(IDEState *s) put_le16(p + 80, 0xf0); /* ata3 - ata6 supported */ put_le16(p + 81, 0x16); /* conforms to ata5 */ put_le16(p + 82, (1 14)); -put_le16(p + 83, (1 14)); +/* 13=flush_cache_ext,12=flush_cache,10=lba48 */ +put_le16(p + 83, (1 14) | (1 13) | (1 12) | (1 10)); put_le16(p + 84, (1 14)); put_le16(p + 85, (1 14)); -put_le16(p + 86, 0); +/* 13=flush_cache_ext,12=flush_cache,10=lba48 */ +put_le16(p + 86, (1 14) | (1 13) | (1 12) | (1 10)); put_le16(p + 87, (1 14)); put_le16(p + 88, 0x3f | (1 13)); /* udma5 set and supported */ put_le16(p + 93, 1 | (1 14) | 0x2000); +put_le16(p + 100, s-nb_sectors); +put_le16(p + 101, s-nb_sectors 16); +put_le16(p + 102, s-nb_sectors 32); +put_le16(p + 103, s-nb_sectors 48); memcpy(s-identify_data, p, sizeof(s-identify_data)); s-identify_set = 1; @@ -572,12 +588,19 @@ static int64_t ide_get_sector(IDEState * int64_t sector_num; if (s-select 0x40) { /* lba */ -sector_num = ((s-select 0x0f) 24) | (s-hcyl 16) | -(s-lcyl 8) | s-sector; + if (!s-lba48) { + sector_num = ((s-select 0x0f) 24) | (s-hcyl 16) | + (s-lcyl 8) | s-sector; + } else { + sector_num = ((int64_t)s-hob_hcyl 40) | + ((int64_t) s-hob_lcyl 32) | + ((int64_t) s-hob_sector 24) | + ((int64_t) s-hcyl 16) | + ((int64_t) s-lcyl 8) | s-sector; + } } else { sector_num = ((s-hcyl 8) | s-lcyl) * s-heads * s-sectors + -(s-select 0x0f) * s-sectors + -(s-sector - 1); +(s-select 0x0f) * s-sectors + (s-sector - 1); } return sector_num; } @@ -586,10 +609,19 @@ static void ide_set_sector(IDEState *s, { unsigned int cyl, r; if (s-select 0x40) { -s-select = (s-select 0xf0) | (sector_num 24); -s-hcyl = (sector_num 16); -s-lcyl = (sector_num 8); -s-sector = (sector_num); + if (!s-lba48) { +s-select = (s-select 0xf0) | (sector_num 24); +s-hcyl = (sector_num 16); +s-lcyl = (sector_num 8); +s-sector = (sector_num); + } else { + s-sector = sector_num; + s-lcyl = sector_num 8; + s-hcyl = sector_num 16; + s-hob_sector = sector_num 24; + s-hob_lcyl = sector_num 32; + s-hob_hcyl = sector_num 40; + } } else { cyl = sector_num / (s-heads * s-sectors); r = sector_num % (s-heads * s-sectors); @@ -1475,43 +1507,89 @@ static void cdrom_change_cb(void *opaque s-nb_sectors = nb_sectors; } +static void ide_cmd_lba48_transform(IDEState *s, int lba48) +{ +s-lba48 = lba48; + +/* handle the 'magic' 0 nsector count conversion here. to avoid + * fiddling with the rest of the read logic, we just store the + * full sector count in -nsector and ignore -hob_nsector from now + */ +if (!s-lba48) { + if (!s-nsector) + s-nsector = 256; +} else { + if (!s-nsector !s-hob_nsector) + s-nsector = 65536; + else
Re: [Qemu-devel] [PATCH 3/3] proper support of FLUSH_CACHE and FLUSH_CACHE_EXT
On Wed, Jan 04 2006, Johannes Schindelin wrote: Hi, On Wed, 4 Jan 2006, Jens Axboe wrote: 1.0.GIT Using git for QEmu development? Welcome to the club. ;-) Yes I just imported the repo into git, cvs isn't really my cup of tea and it isn't very handy for patch series. git isn't very tailored for that as well, but at least it allows me to just do a 'format-patch' against the old master and get the patch series. And with a devel branch, it's pretty easy to pull the new updates and rebase the devel branch as needed. Are you using a persistent git repo for qemu (ie continually importing new changes)? I've considered setting one up :-) Regarding your patches: as far as I understand them, I like 'em. Thanks! -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [PATCH 3/3] proper support of FLUSH_CACHE and FLUSH_CACHE_EXT
On Thu, Jan 05 2006, Jens Axboe wrote: Are you using a persistent git repo for qemu (ie continually importing new changes)? I've considered setting one up :-) I set up such a gateway, should be updated every night from Fabrices cvs repository. The web interface is here: http://brick.kernel.dk/git/?p=qemu.git;a=summary and you can pull from the following git url: git://brick.kernel.dk/data/git/cvsdata/qemu I've added the 'ide' branch, with the patches posted here. If there's an interest in this (the git repo, not the ide patches :), I can push it to kernel.org as well. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
[Qemu-devel] [PATCH 0/3] qemu ide updates
Hi, Here's the set of 3 patches I currently have for the qemu ide/block code. 1/3: The ide id updates 2/3: lba48 support 3/3: Proper support of the flush cache command -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
[Qemu-devel] [PATCH] ide id updates
(s-identify_data + 88,0x3f); + break; + case 0x04: /* mdma mode */ + put_le16(s-identify_data + 63,0x07 | (1 (val + 8))); + put_le16(s-identify_data + 88,0x3f); + break; + case 0x08: /* udma mode */ + put_le16(s-identify_data + 63,0x07); + put_le16(s-identify_data + 88,0x3f | (1 (val + 8))); + break; + default: + goto abort_cmd; + } +s-status = READY_STAT | SEEK_STAT; +ide_set_irq(s); +break; + } default: goto abort_cmd; } -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [PATCH] lba48 support
On Fri, Dec 30 2005, Fabrice Bellard wrote: Jens Axboe wrote: Saw the posts on this the other day and had a few spare hours to play with this. Works for me, with and without DMA (didn't test mult mode, but that should work fine too). Test with caution though, it's changing the ide code so could eat your data if there's a bug there... Most clever OS's don't use lba48 even for lba48 capable drives, unless the device is 2^28 sectors and the current request is past that (but they could be taking advantage of the larger transfer size possible, in which case lba48 will be used even for low sectors...). Thank you for the patch ! At least two details should be corrected before I can apply it: 1) Each duplicated IDE register acts as a 2 byte FIFO, so the logic you added in the write function should be modified (the regs_written field is not needed). 2) The read back logic should be implemented (HOB bit in the device control register). Updated patch below. The read back logic doesn't work right now, since we always set bits 5-7 (the obsolete) bits in device select. But I've dropped the regs_written hack, the hob registers are now (as intended) always the previous value. That makes it LIFO, which I suppose is what you meant? Index: hw/ide.c === RCS file: /sources/qemu/qemu/hw/ide.c,v retrieving revision 1.38 diff -u -r1.38 ide.c --- hw/ide.c6 Aug 2005 09:14:32 - 1.38 +++ hw/ide.c2 Jan 2006 12:58:15 - @@ -305,14 +305,24 @@ /* ide regs */ uint8_t feature; uint8_t error; -uint16_t nsector; /* 0 is 256 to ease computations */ +uint32_t nsector; uint8_t sector; uint8_t lcyl; uint8_t hcyl; +/* other part of tf for lba48 support */ +uint8_t hob_feature; +uint8_t hob_nsector; +uint8_t hob_sector; +uint8_t hob_lcyl; +uint8_t hob_hcyl; + uint8_t select; uint8_t status; + /* 0x3f6 command, only meaningful for drive 0 */ uint8_t cmd; +/* set for lba48 access */ +uint8_t lba48; /* depends on bit 4 in select, only meaningful for drive 0 */ struct IDEState *cur_drive; BlockDriverState *bs; @@ -449,13 +459,17 @@ put_le16(p + 61, s-nb_sectors 16); put_le16(p + 80, (1 1) | (1 2)); put_le16(p + 82, (1 14)); -put_le16(p + 83, (1 14)); +put_le16(p + 83, (1 14) | (1 10)); /* lba48 supported */ put_le16(p + 84, (1 14)); put_le16(p + 85, (1 14)); -put_le16(p + 86, 0); +put_le16(p + 86, (1 14) | (1 10)); /* lba48 supported */ put_le16(p + 87, (1 14)); put_le16(p + 88, 0x1f | (1 13)); put_le16(p + 93, 1 | (1 14) | 0x2000 | 0x4000); +put_le16(p + 100, s-nb_sectors); +put_le16(p + 101, s-nb_sectors 16); +put_le16(p + 102, s-nb_sectors 32); +put_le16(p + 103, s-nb_sectors 48); } static void ide_atapi_identify(IDEState *s) @@ -548,12 +562,18 @@ int64_t sector_num; if (s-select 0x40) { /* lba */ -sector_num = ((s-select 0x0f) 24) | (s-hcyl 16) | -(s-lcyl 8) | s-sector; + if (!s-lba48) { + sector_num = ((s-select 0x0f) 24) | (s-hcyl 16) | + (s-lcyl 8) | s-sector; + } else { + sector_num = ((int64_t)s-hcyl 40) | + ((int64_t) s-lcyl 32) | + (s-sector 24) | (s-hob_hcyl 16) | + (s-hob_lcyl 8) | s-hob_sector; + } } else { sector_num = ((s-hcyl 8) | s-lcyl) * s-heads * s-sectors + -(s-select 0x0f) * s-sectors + -(s-sector - 1); +(s-select 0x0f) * s-sectors + (s-sector - 1); } return sector_num; } @@ -562,10 +582,19 @@ { unsigned int cyl, r; if (s-select 0x40) { -s-select = (s-select 0xf0) | (sector_num 24); -s-hcyl = (sector_num 16); -s-lcyl = (sector_num 8); -s-sector = (sector_num); + if (!s-lba48) { +s-select = (s-select 0xf0) | (sector_num 24); +s-hcyl = (sector_num 16); +s-lcyl = (sector_num 8); +s-sector = (sector_num); + } else { + s-hob_sector = sector_num; + s-hob_lcyl = sector_num 8; + s-hob_hcyl = sector_num 16; + s-sector = sector_num 24; + s-lcyl = sector_num 32; + s-hcyl = sector_num 40; + } } else { cyl = sector_num / (s-heads * s-sectors); r = sector_num % (s-heads * s-sectors); @@ -1451,43 +1480,65 @@ s-nb_sectors = nb_sectors; } +static void ide_clear_hob(IDEState *ide_if) +{ +/* any write clears HOB high bit of device control register */ +ide_if[0].select = ~(1 7); +ide_if[1].select = ~(1 7); +} + static void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val) { IDEState *ide_if = opaque; IDEState *s; -int unit, n; +int unit, n, lba48_cmd = 0
Re: [Qemu-devel] [PATCH] lba48 support
On Fri, Dec 30 2005, Fabrice Bellard wrote: Jens Axboe wrote: Saw the posts on this the other day and had a few spare hours to play with this. Works for me, with and without DMA (didn't test mult mode, but that should work fine too). Test with caution though, it's changing the ide code so could eat your data if there's a bug there... Most clever OS's don't use lba48 even for lba48 capable drives, unless the device is 2^28 sectors and the current request is past that (but they could be taking advantage of the larger transfer size possible, in which case lba48 will be used even for low sectors...). Thank you for the patch ! At least two details should be corrected before I can apply it: 1) Each duplicated IDE register acts as a 2 byte FIFO, so the logic you added in the write function should be modified (the regs_written field is not needed). Perfect, I wasn't very fond of that approach either (it seemed fragile). 2) The read back logic should be implemented (HOB bit in the device control register). Indeed. I'll get these things fixed up, wont be before monday though. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
[Qemu-devel] [PATCH] lba48 support
{ +ide_if[0].hob_nsector = val; +ide_if[1].hob_nsector = val; + } break; case 3: -ide_if[0].sector = val; -ide_if[1].sector = val; + if (!hob) { +ide_if[0].sector = val; +ide_if[1].sector = val; + } else { +ide_if[0].hob_sector = val; +ide_if[1].hob_sector = val; + } break; case 4: -ide_if[0].lcyl = val; -ide_if[1].lcyl = val; + if (!hob) { +ide_if[0].lcyl = val; +ide_if[1].lcyl = val; + } else { +ide_if[0].hob_lcyl = val; +ide_if[1].hob_lcyl = val; + } break; case 5: -ide_if[0].hcyl = val; -ide_if[1].hcyl = val; + if (!hob) { +ide_if[0].hcyl = val; +ide_if[1].hcyl = val; + } else { +ide_if[0].hob_hcyl = val; +ide_if[1].hob_hcyl = val; + } break; case 6: ide_if[0].select = (val ~0x10) | 0xa0; @@ -1501,10 +1559,34 @@ #if defined(DEBUG_IDE) printf(ide: CMD=%02x\n, val); #endif + /* clear regs written when we see any command */ + ide_if[0].regs_written = 0; + s = ide_if-cur_drive; /* ignore commands to non existant slave */ if (s != ide_if !s-bs) break; + + s-lba48 = lba48_cmd; + + /* handle the 'magic' 0 nsector count conversion here. to avoid +* fiddling with the rest of the read logic, we just store the +* full sector count in -nsector and ignore -hob_nsector from now +*/ + if (!s-lba48) { + if (!s-nsector) + s-nsector = 256; + } else { + if (!s-nsector !s-hob_nsector) + s-nsector = 65536; + else { + int lo = s-hob_nsector; + int hi = s-nsector; + + s-nsector = (hi 8) | lo; + } + } + switch(val) { case WIN_IDENTIFY: if (s-bs !s-is_cdrom) { @@ -1536,12 +1618,16 @@ } ide_set_irq(s); break; +case WIN_VERIFY_EXT: + lba48_cmd = 1; case WIN_VERIFY: case WIN_VERIFY_ONCE: /* do sector number check ? */ s-status = READY_STAT; ide_set_irq(s); break; + case WIN_READ_EXT: + lba48_cmd = 1; case WIN_READ: case WIN_READ_ONCE: if (!s-bs) @@ -1549,6 +1635,8 @@ s-req_nb_sectors = 1; ide_sector_read(s); break; + case WIN_WRITE_EXT: + lba48_cmd = 1; case WIN_WRITE: case WIN_WRITE_ONCE: s-error = 0; @@ -1556,12 +1644,16 @@ s-req_nb_sectors = 1; ide_transfer_start(s, s-io_buffer, 512, ide_sector_write); break; + case WIN_MULTREAD_EXT: + lba48_cmd = 1; case WIN_MULTREAD: if (!s-mult_sectors) goto abort_cmd; s-req_nb_sectors = s-mult_sectors; ide_sector_read(s); break; +case WIN_MULTWRITE_EXT: + lba48_cmd = 1; case WIN_MULTWRITE: if (!s-mult_sectors) goto abort_cmd; @@ -1573,18 +1665,24 @@ n = s-req_nb_sectors; ide_transfer_start(s, s-io_buffer, 512 * n, ide_sector_write); break; + case WIN_READDMA_EXT: + lba48_cmd = 1; case WIN_READDMA: case WIN_READDMA_ONCE: if (!s-bs) goto abort_cmd; ide_sector_read_dma(s); break; + case WIN_WRITEDMA_EXT: + lba48_cmd = 1; case WIN_WRITEDMA: case WIN_WRITEDMA_ONCE: if (!s-bs) goto abort_cmd; ide_sector_write_dma(s); break; +case WIN_READ_NATIVE_MAX_EXT: + lba48_cmd = 1; case WIN_READ_NATIVE_MAX: ide_set_sector(s, s-nb_sectors - 1); s-status = READY_STAT; @@ -1615,6 +1713,7 @@ case WIN_STANDBYNOW1: case WIN_IDLEIMMEDIATE: case WIN_FLUSH_CACHE: +case WIN_FLUSH_CACHE_EXT: s-status = READY_STAT; ide_set_irq(s); break; -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Audio cd's in guest OS
On Sat, Nov 05 2005, Oliver Gerlich wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Lars Roland schrieb: On 11/4/05, Mike Swanson [EMAIL PROTECTED] wrote: I've found on systems where traditional rippers don't work (eg, cdparanoia), CDFS has a greater chance of ripping the CDs (by default into WAV, but you can enable an option to rip it in the pure CDDA format if you want). Thanks - I should have known that someone had made a file system for this. However I still think it would be great to be able to pass the actual /dev/cdrom on to the guest OS, but I must admit that I have not grasped the complexity yet on doing this, so I am going to do some Qemu code reading before continuing - I am not even sure if it can be done in VMWare although I seam to remember that Windows as a host OS running VMWare allows the guest access to a audio cdrom. Not sure how VMware does that; but actually I didn't even succeed accessing /dev/cdrom on the host when an audio cd is inserted: dd if=/dev/hdc of=/dev/null bs=2352 count=1 dd: reading `/dev/hdc': Input/output error 0+0 records in 0+0 records out 0 bytes transferred in 0.077570 seconds (0 bytes/sec) I used a blocksize of 2352 because I've read that's the size for audio cds... It didn't work with bs=1 either. While the block size you gave is correct for cdda frames, you cannot read them this way. The commands you use for reading data from a data track varies, and the CDROM driver will always use the READ_10 command for io originating from the file system layer. You would also need to put some effort into the page cache to allow non-power-of-2 block sizes for this to work. So it's not trivial :-) For reading audio tracks, you can use either some pass through command mechanism like CDROM_SEND_PACKET or SG_IO. Or the CDROMREADAUDIO ioctl, which is the easiest to use since it doesn't require an understanding of the command set. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] Audio cd's in guest OS
On Sat, Nov 05 2005, Fabrice Bellard wrote: Lars Roland wrote: On 11/4/05, Mike Swanson [EMAIL PROTECTED] wrote: I've found on systems where traditional rippers don't work (eg, cdparanoia), CDFS has a greater chance of ripping the CDs (by default into WAV, but you can enable an option to rip it in the pure CDDA format if you want). Thanks - I should have known that someone had made a file system for this. However I still think it would be great to be able to pass the actual /dev/cdrom on to the guest OS, but I must admit that I have not grasped the complexity yet on doing this, so I am going to do some Qemu code reading before continuing - I am not even sure if it can be done in VMWare although I seam to remember that Windows as a host OS running VMWare allows the guest access to a audio cdrom. QEMU does not currently support reading raw CD tracks, but it is definitely possible to add it (along with play audio features and even CD recording). I actually implemented the commands needed for recording some months ago, but never really wrapped it up and submitted it. If there's any interesting in this, I'll dust it off when I have some spare time. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [patch] non-blocking disk IO
On Tue, Oct 04 2005, Troy Benjegerdes wrote: What we want is to be able to have the guest OS request some DMA I/O operation, and have qemu be able to use AIO so that the actual disk hardware can dump the data directly in the pages the userspace process on the guest OS ends up wanting it in, avoiding several expensive memcopy and context switch operations. That should be easy enough to do already, with or without the nonblocking patch. Just make sure to open the files O_DIRECT and align the io buffers and lengths. With a 2.6 host, you can usually get away with 512-b aligment, on 2.4 you may have to ensure 1k/4k alignment. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel
Re: [Qemu-devel] [patch] non-blocking disk IO
On Mon, Oct 03 2005, John Coiner wrote: Non-blocking disk IO now works for any type of disk image, not just raw format. There is no longer any format-specific code in the patch: http://people.brandeis.edu/~jcoiner/qemu_idedma/qemu_dma_patch.html You might want this patch if: * you run a multitasking guest OS, * you access a disk sometimes, and * you wouldn't mind if QEMU ran a little faster. Why I have not got feedback in droves I do not understand ;) Why not use aio for this instead, seems like a better fit than spawning a thread per block device? That would still require a thread for handling completions, but you could easily just use a single completion thread for all devices for this as it would not need to do any real work. -- Jens Axboe ___ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel