Re: [Qemu-devel] [PATCH] PPC: Fail configure when libfdt is not available
On 10/18/2011 02:18 AM, Alexander Graf wrote: We have several targets in the PPC tree now that basically require libfdt to function properly, namely the pseries and the e500 targets. This dependency will rather increase than decrease in the future, so I want to make sure that people building shiny new 1.0 actually have libfdt installed to get rid of a few ifdefs in the code. Warning: This patch will likely make configure fail for people who don't select their own --target-list, but don't have libfdt development packages installed. However, we really need this new dependency to move on. Signed-off-by: Alexander Grafag...@suse.de F15 has them, and they install also on older Fedoras if you pick them from Koji. Acked-by: Paolo Bonzini pbonz...@redhat.com Paolo
Re: [Qemu-devel] [PATCH v8 1/4] block: add the block queue support
On Mon, Oct 17, 2011 at 6:17 PM, Paolo Bonzini pbonz...@redhat.com wrote: On 10/17/2011 12:17 PM, Kevin Wolf wrote: + +static int qemu_block_queue_handler(BlockQueueAIOCB *request) +{ + int ret; + BlockDriverAIOCB *res; + + res = request-handler(request-common.bs, request-sector_num, + request-qiov, request-nb_sectors, + qemu_block_queue_callback, request); + if (res) { + request-real_acb = res; + } + + ret = (res == NULL) ? 0 : 1; + + return ret; You mean return (res != NULL); and want to have bool as the return value of this function. Yeah, thanks. i will modify as below: ret = (res == NULL) ? false : true; ret = (res != NULL) is really more readable. return (res != NULL); is even nicer! :) Great, thanks Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. My qemu-kvm's tree is not updated about 2 months ago, because kernel.org is down, and I forgot to pull from github. After I pull it from github, I find the following changes: @@ -1523,9 +1523,7 @@ static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev) static const VMStateDescription vmstate_assigned_device = { .name = pci-assign, -.fields = (VMStateField []) { -VMSTATE_END_OF_LIST() -} +.unmigratable = 1, }; static void reset_assigned_device(DeviceState *dev) Why do you remove fields from vmstate_assigned_device? It is useful for dump because it does not check unmigratable. If vmstate_assigned_device does not contain .fields, qemu will crash in vmstate_save_state(). Thanks Wen Congyang Jan
[Qemu-devel] [Bug 877155] [NEW] VT-D NIC doesn't work in windows guest
Public bug reported: [This should be a kvm kernel bug. I met this issue with quemu and kvm for virtualization. For https://bugzilla.kernel.org is not available yet, I just file this bug here just for tracking. ] Environment: Host OS (ia32/ia32e/IA64):All Guest OS (ia32/ia32e/IA64):ia32e Guest OS Type (Linux/Windows):Linux kvm.git Commit:211978fb6c666f824ae8221baea7ee28efc1 qemu.git Commit:edbb7c0de56692868e6126c7ff7e8bf917f8d0e7 Host Kernel Version:3.1.0-rc9+ Hardware: Westmere-EP SandyBridge platform Bug detailed description: -- Using qemu and kvm upstream, VT-D NIC and SR-IOV VF don't work in Windows guest. I tried Windows XP, Vista, Win7, Win2k8 as a guest. And I used 82572 NIC and 82576 Virtual Function to assign to windows guest, and got the same result. Reproduce steps: 1.start a Windows guest with a NIC assigned: qemu-system-x86_64 -m 1024 -smp 2 -device pci-assign,host=08:00.0 -net none -hda Windows.img 2.check network in the Windows guest. (You'll find the assigned NIC doesn't work in guest.) Current result: NIC doesn't work in Windows guest Expected result: NIC will work. ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/877155 Title: VT-D NIC doesn't work in windows guest Status in QEMU: New Bug description: [This should be a kvm kernel bug. I met this issue with quemu and kvm for virtualization. For https://bugzilla.kernel.org is not available yet, I just file this bug here just for tracking. ] Environment: Host OS (ia32/ia32e/IA64):All Guest OS (ia32/ia32e/IA64):ia32e Guest OS Type (Linux/Windows):Linux kvm.git Commit:211978fb6c666f824ae8221baea7ee28efc1 qemu.git Commit:edbb7c0de56692868e6126c7ff7e8bf917f8d0e7 Host Kernel Version:3.1.0-rc9+ Hardware: Westmere-EP SandyBridge platform Bug detailed description: -- Using qemu and kvm upstream, VT-D NIC and SR-IOV VF don't work in Windows guest. I tried Windows XP, Vista, Win7, Win2k8 as a guest. And I used 82572 NIC and 82576 Virtual Function to assign to windows guest, and got the same result. Reproduce steps: 1.start a Windows guest with a NIC assigned: qemu-system-x86_64 -m 1024 -smp 2 -device pci-assign,host=08:00.0 -net none -hda Windows.img 2.check network in the Windows guest. (You'll find the assigned NIC doesn't work in guest.) Current result: NIC doesn't work in Windows guest Expected result: NIC will work. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/877155/+subscriptions
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. I will read gdbstub services's code now. Thanks Wen Congyang Jan
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On Tue, Oct 18, 2011 at 03:15:29PM +0800, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. My qemu-kvm's tree is not updated about 2 months ago, because kernel.org is down, and I forgot to pull from github. After I pull it from github, I find the following changes: @@ -1523,9 +1523,7 @@ static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev) static const VMStateDescription vmstate_assigned_device = { .name = pci-assign, -.fields = (VMStateField []) { -VMSTATE_END_OF_LIST() -} +.unmigratable = 1, }; static void reset_assigned_device(DeviceState *dev) Why do you remove fields from vmstate_assigned_device? It is useful for dump because it does not check unmigratable. If vmstate_assigned_device does not contain .fields, qemu will crash in vmstate_save_state(). Given that '.fields' is allowed to be NULL for some devices, I'd say even for normal migration, QEMU should be checking for NULL in the vmstate_save_state() code. This would prevent QEMU crashes in the case where someone removed the .unmigratable member, but forgot to add back a .fields member. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Re: [Qemu-devel] GPLv3 troubles
Avi Kivity a...@redhat.com writes: On 10/17/2011 07:46 PM, Stefan Weil wrote: So let's start. For any of my contributions, I agree to GPL v2 or later. Later generations should have the possibility to replace GPL v2 by something which matches future requirements. I expect Red Hat contributions can be relicensed to v2+ as well. Plenty of precedence for that.
Re: [Qemu-devel] [PATCH v8 1/4] block: add the block queue support
On Mon, Oct 17, 2011 at 6:17 PM, Kevin Wolf kw...@redhat.com wrote: Am 26.09.2011 10:01, schrieb Zhi Yong Wu: On Fri, Sep 23, 2011 at 11:32 PM, Kevin Wolf kw...@redhat.com wrote: Am 08.09.2011 12:11, schrieb Zhi Yong Wu: Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- Makefile.objs | 2 +- block/blk-queue.c | 201 + block/blk-queue.h | 59 block_int.h | 27 +++ 4 files changed, 288 insertions(+), 1 deletions(-) create mode 100644 block/blk-queue.c create mode 100644 block/blk-queue.h diff --git a/Makefile.objs b/Makefile.objs index 26b885b..5dcf456 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -33,7 +33,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-nested-y += qed-check.o -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/blk-queue.c b/block/blk-queue.c new file mode 100644 index 000..adef497 --- /dev/null +++ b/block/blk-queue.c @@ -0,0 +1,201 @@ +/* + * QEMU System Emulator queue definition for block layer + * + * Copyright (c) IBM, Corp. 2011 + * + * Authors: + * Zhi Yong Wu wu...@linux.vnet.ibm.com + * Stefan Hajnoczi stefa...@linux.vnet.ibm.com + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include block_int.h +#include block/blk-queue.h +#include qemu-common.h + +/* The APIs for block request queue on qemu block layer. + */ + +struct BlockQueueAIOCB { + BlockDriverAIOCB common; + QTAILQ_ENTRY(BlockQueueAIOCB) entry; + BlockRequestHandler *handler; + BlockDriverAIOCB *real_acb; + + int64_t sector_num; + QEMUIOVector *qiov; + int nb_sectors; +}; The idea is that each request is first queued on the QTAILQ, and at some point it's removed from the queue and gets a real_acb. But it never has both at the same time. Correct? NO. if block I/O throttling is enabled and I/O rate at runtime exceed this limits, this request will be enqueued. It represents the whole lifecycle of one enqueued request. What are the conditions under which the request will still be enqueued, but has a real_acb at the same time? When the request is not serviced and still need to be enqueued, one real_acb will be existable at the same time. thanks for your good catch.:) When one request will be mallocated at the first time, we should save the got acb to real_acb qemu_block_queue_enqueue(): request-real_acb = acb; + +typedef struct BlockQueueAIOCB BlockQueueAIOCB; + +struct BlockQueue { + QTAILQ_HEAD(requests, BlockQueueAIOCB) requests; + bool req_failed; + bool flushing; +}; I find req_failed pretty confusing. Needs documentation at least, but most probably also a better name. OK. request_has_failed? No, that doesn't describe what it's really doing. You set req_failed = true by default and then on some obscure condition clear it or not. It's tracking something, but I'm not sure what meaning it has during the whole process. In qemu_block_queue_flush, When bdrv_aio_readv/writev return NULL, if req_failed is changed to false, it indicates that the request exceeds the limits again; if req_failed is still true, it indicates that one I/O error takes place, at the monent, qemu_block_queue_callback(request, -EIO) need to be called to report this to upper layer. + +static void qemu_block_queue_dequeue(BlockQueue *queue, +
Re: [Qemu-devel] [PATCH v8 2/4] block: add the command line support
On Mon, Oct 17, 2011 at 6:19 PM, Kevin Wolf kw...@redhat.com wrote: Am 26.09.2011 08:15, schrieb Zhi Yong Wu: On Fri, Sep 23, 2011 at 11:54 PM, Kevin Wolf kw...@redhat.com wrote: +} + +static void bdrv_block_timer(void *opaque) +{ + BlockDriverState *bs = opaque; + BlockQueue *queue = bs-block_queue; + + qemu_block_queue_flush(queue); Hm, didn't really notice it while reading patch 1, but qemu_block_queue_flush() is misleading. It's really something like Why do you say this is misleading? qemu_block_queue_submit(). Right. It will resubmit all enqueued I/O requests. For me, flush sounds as if it waits for completion of all requests. The code is currently except one I/O error. But i think that we should not take the action, right? I don't make sure if we should keep all the enqueued request in order. Kevin -- Regards, Zhi Yong Wu
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 09:58, Daniel P. Berrange wrote: On Tue, Oct 18, 2011 at 03:15:29PM +0800, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. My qemu-kvm's tree is not updated about 2 months ago, because kernel.org is down, and I forgot to pull from github. After I pull it from github, I find the following changes: @@ -1523,9 +1523,7 @@ static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev) static const VMStateDescription vmstate_assigned_device = { .name = pci-assign, -.fields = (VMStateField []) { -VMSTATE_END_OF_LIST() -} +.unmigratable = 1, }; static void reset_assigned_device(DeviceState *dev) Why do you remove fields from vmstate_assigned_device? It is useful for dump because it does not check unmigratable. If vmstate_assigned_device does not contain .fields, qemu will crash in vmstate_save_state(). Given that '.fields' is allowed to be NULL for some devices, I'd say even for normal migration, QEMU should be checking for NULL in the vmstate_save_state() code. This would prevent QEMU crashes in the case where someone removed the .unmigratable member, but forgot to add back a .fields member. That crash wouldn't be bad because removinb unmigratable without adding proper fields is almost always a bug. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 09:58, Wen Congyang wrote: At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. Please try to avoid defining something new. Unless there is a striking reason, standard gdb core files should be generated so that you can load the dump directly into gdb for analysis. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
[Qemu-devel] hw/nand hw/onenand and read-only
These guys have been converted to qdev relatively recently. I wonder what happens when they use a drive defined with -drive if=none,readonly. If that's not a valid configuration, we better reject read-only drives, like ide_init_drive() does.
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
At 10/18/2011 04:19 PM, Jan Kiszka Write: On 2011-10-18 09:58, Wen Congyang wrote: At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. Please try to avoid defining something new. Unless there is a striking reason, standard gdb core files should be generated so that you can load the dump directly into gdb for analysis. I am not sure whehter the standard gdb core files can not be analyzed by crash. If not, I think we should define something new because it's easier to use crash than gdb to analyze the core files. Thanks Wen Congyang Jan
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 10:17, Jan Kiszka wrote: On 2011-10-18 09:58, Daniel P. Berrange wrote: On Tue, Oct 18, 2011 at 03:15:29PM +0800, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. My qemu-kvm's tree is not updated about 2 months ago, because kernel.org is down, and I forgot to pull from github. After I pull it from github, I find the following changes: @@ -1523,9 +1523,7 @@ static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev) static const VMStateDescription vmstate_assigned_device = { .name = pci-assign, -.fields = (VMStateField []) { -VMSTATE_END_OF_LIST() -} +.unmigratable = 1, }; static void reset_assigned_device(DeviceState *dev) Why do you remove fields from vmstate_assigned_device? It is useful for dump because it does not check unmigratable. If vmstate_assigned_device does not contain .fields, qemu will crash in vmstate_save_state(). Given that '.fields' is allowed to be NULL for some devices, I'd say even for normal migration, QEMU should be checking for NULL in the vmstate_save_state() code. This would prevent QEMU crashes in the case where someone removed the .unmigratable member, but forgot to add back a .fields member. That crash wouldn't be bad because removinb unmigratable without adding proper fields is almost always a bug. Err, s/almost//. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On Tue, Oct 18, 2011 at 10:17:27AM +0200, Jan Kiszka wrote: On 2011-10-18 09:58, Daniel P. Berrange wrote: On Tue, Oct 18, 2011 at 03:15:29PM +0800, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. My qemu-kvm's tree is not updated about 2 months ago, because kernel.org is down, and I forgot to pull from github. After I pull it from github, I find the following changes: @@ -1523,9 +1523,7 @@ static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev) static const VMStateDescription vmstate_assigned_device = { .name = pci-assign, -.fields = (VMStateField []) { -VMSTATE_END_OF_LIST() -} +.unmigratable = 1, }; static void reset_assigned_device(DeviceState *dev) Why do you remove fields from vmstate_assigned_device? It is useful for dump because it does not check unmigratable. If vmstate_assigned_device does not contain .fields, qemu will crash in vmstate_save_state(). Given that '.fields' is allowed to be NULL for some devices, I'd say even for normal migration, QEMU should be checking for NULL in the vmstate_save_state() code. This would prevent QEMU crashes in the case where someone removed the .unmigratable member, but forgot to add back a .fields member. That crash wouldn't be bad because removinb unmigratable without adding proper fields is almost always a bug. This is the kind of example of why QEMU is too crashy in general. We should be defensive against such bugs, by checking that preconditions are valid in the code, even if we don't expect them to happen often. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 10:25, Wen Congyang wrote: At 10/18/2011 04:19 PM, Jan Kiszka Write: On 2011-10-18 09:58, Wen Congyang wrote: At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. Please try to avoid defining something new. Unless there is a striking reason, standard gdb core files should be generated so that you can load the dump directly into gdb for analysis. I am not sure whehter the standard gdb core files can not be analyzed by crash. If not, I think we should define something new because it's easier to use crash than gdb to analyze the core files. gdb allows you to walk up the frame and print variables (globals local) etc. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH v8 3/4] block: add block timer and throttling algorithm
On Mon, Oct 17, 2011 at 11:54 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Mon, Oct 17, 2011 at 11:26 AM, Kevin Wolf kw...@redhat.com wrote: Am 26.09.2011 09:24, schrieb Zhi Yong Wu: On Sat, Sep 24, 2011 at 12:19 AM, Kevin Wolf kw...@redhat.com wrote: Am 08.09.2011 12:11, schrieb Zhi Yong Wu: Note: 1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario. 2.) When dd command is issued in guest, if its option bs is set to a large value such as bs=1024K, the result speed will slightly bigger than the limits. For these problems, if you have nice thought, pls let us know.:) Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- block.c | 259 --- block.h | 1 - 2 files changed, 248 insertions(+), 12 deletions(-) One general comment: What about synchronous and/or coroutine I/O operations? Do you think they are just not important enough to consider here or were they forgotten? For sync ops, we assume that it will be converse into async mode at some point of future, right? For coroutine I/O, it is introduced in image driver layer, and behind bdrv_aio_readv/writev. I think that we need not consider them, right? Meanwhile the block layer has been changed to handle all requests in terms of coroutines. So you would best move your intercepting code into the coroutine functions. Some additional info: the advantage of handling all requests in coroutines is that there is now a single place where you can put I/O throttling. It will work for bdrv_read(), bdrv_co_readv(), and bdrv_aio_readv(). There is no code duplication, just put the I/O throttling logic in bdrv_co_do_readv(). got it. thanks. Stefan -- Regards, Zhi Yong Wu
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
At 10/18/2011 04:26 PM, Jan Kiszka Write: On 2011-10-18 10:25, Wen Congyang wrote: At 10/18/2011 04:19 PM, Jan Kiszka Write: On 2011-10-18 09:58, Wen Congyang wrote: At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. Please try to avoid defining something new. Unless there is a striking reason, standard gdb core files should be generated so that you can load the dump directly into gdb for analysis. I am not sure whehter the standard gdb core files can not be analyzed by crash. If not, I think we should define something new because it's easier to use crash than gdb to analyze the core files. gdb allows you to walk up the frame and print variables (globals local) etc. Crash uses gdb to provide common function, and you can also use all the gdb commands in crash. Thanks Wen Congyang Jan
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 10:25, Daniel P. Berrange wrote: On Tue, Oct 18, 2011 at 10:17:27AM +0200, Jan Kiszka wrote: On 2011-10-18 09:58, Daniel P. Berrange wrote: On Tue, Oct 18, 2011 at 03:15:29PM +0800, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. My qemu-kvm's tree is not updated about 2 months ago, because kernel.org is down, and I forgot to pull from github. After I pull it from github, I find the following changes: @@ -1523,9 +1523,7 @@ static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev) static const VMStateDescription vmstate_assigned_device = { .name = pci-assign, -.fields = (VMStateField []) { -VMSTATE_END_OF_LIST() -} +.unmigratable = 1, }; static void reset_assigned_device(DeviceState *dev) Why do you remove fields from vmstate_assigned_device? It is useful for dump because it does not check unmigratable. If vmstate_assigned_device does not contain .fields, qemu will crash in vmstate_save_state(). Given that '.fields' is allowed to be NULL for some devices, I'd say even for normal migration, QEMU should be checking for NULL in the vmstate_save_state() code. This would prevent QEMU crashes in the case where someone removed the .unmigratable member, but forgot to add back a .fields member. That crash wouldn't be bad because removinb unmigratable without adding proper fields is almost always a bug. This is the kind of example of why QEMU is too crashy in general. We should be defensive against such bugs, by checking that preconditions are valid in the code, even if we don't expect them to happen often. A serious bug in a device model is tricky to handle gracefully. Given that this case is fairly clear (specifically for the code reviewer), I'm not sure what defensive measures you want to apply and what benefit they would provide. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 10:31, Wen Congyang wrote: At 10/18/2011 04:26 PM, Jan Kiszka Write: On 2011-10-18 10:25, Wen Congyang wrote: At 10/18/2011 04:19 PM, Jan Kiszka Write: On 2011-10-18 09:58, Wen Congyang wrote: At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. Please try to avoid defining something new. Unless there is a striking reason, standard gdb core files should be generated so that you can load the dump directly into gdb for analysis. I am not sure whehter the standard gdb core files can not be analyzed by crash. If not, I think we should define something new because it's easier to use crash than gdb to analyze the core files. gdb allows you to walk up the frame and print variables (globals local) etc. Crash uses gdb to provide common function, and you can also use all the gdb commands in crash. That what's the added value here when I can use gdb directly? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH v8 1/4] block: add the block queue support
Am 18.10.2011 10:07, schrieb Zhi Yong Wu: On Mon, Oct 17, 2011 at 6:17 PM, Kevin Wolf kw...@redhat.com wrote: + +typedef struct BlockQueueAIOCB BlockQueueAIOCB; + +struct BlockQueue { +QTAILQ_HEAD(requests, BlockQueueAIOCB) requests; +bool req_failed; +bool flushing; +}; I find req_failed pretty confusing. Needs documentation at least, but most probably also a better name. OK. request_has_failed? No, that doesn't describe what it's really doing. You set req_failed = true by default and then on some obscure condition clear it or not. It's tracking something, but I'm not sure what meaning it has during the whole process. In qemu_block_queue_flush, When bdrv_aio_readv/writev return NULL, if req_failed is changed to false, it indicates that the request exceeds the limits again; if req_failed is still true, it indicates that one I/O error takes place, at the monent, qemu_block_queue_callback(request, -EIO) need to be called to report this to upper layer. Okay, this makes some more sense now. How about reversing the logic and maintaining a limit_exceeded flag instead of a req_failed? I think this would be clearer. +void qemu_del_block_queue(BlockQueue *queue) +{ +BlockQueueAIOCB *request, *next; + +QTAILQ_FOREACH_SAFE(request, queue-requests, entry, next) { +QTAILQ_REMOVE(queue-requests, request, entry); +qemu_aio_release(request); +} + +g_free(queue); +} Can we be sure that no AIO requests are in flight that still use the now released AIOCB? Yeah, since qemu core code is serially performed, i think that when qemu_del_block_queue is performed, no requests are in flight. Right? Patch 2 has this code: +void bdrv_io_limits_disable(BlockDriverState *bs) +{ +bs-io_limits_enabled = false; + +if (bs-block_queue) { +qemu_block_queue_flush(bs-block_queue); +qemu_del_block_queue(bs-block_queue); +bs-block_queue = NULL; +} Does this mean that you can't disable I/O limits while the VM is running? NO, you can even though VM is running. Okay, in general qemu_block_queue_flush() empties the queue so that there are no requests left that qemu_del_block_queue() could drop from the queue. So in the common case it doesn't even enter the FOREACH loop. I think problems start when requests have failed or exceeded the limit again, then you have requests queued even after qemu_block_queue_flush(). You must be aware of this, otherwise the code in qemu_del_block_queue() wouldn't exist. But you can't release the ACBs without having called their callback, otherwise the caller would still assume that its ACB pointer is valid. Maybe calling the callback before releasing the ACB would be enough. However, for failed requests see below. + +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue, +BlockDriverState *bs, +BlockRequestHandler *handler, +int64_t sector_num, +QEMUIOVector *qiov, +int nb_sectors, +BlockDriverCompletionFunc *cb, +void *opaque) +{ +BlockDriverAIOCB *acb; +BlockQueueAIOCB *request; + +if (queue-flushing) { +queue-req_failed = false; +return NULL; +} else { +acb = qemu_aio_get(block_queue_pool, bs, + cb, opaque); +request = container_of(acb, BlockQueueAIOCB, common); +request-handler = handler; +request-sector_num= sector_num; +request-qiov = qiov; +request-nb_sectors= nb_sectors; +request-real_acb = NULL; +QTAILQ_INSERT_TAIL(queue-requests, request, entry); +} + +return acb; +} + +static int qemu_block_queue_handler(BlockQueueAIOCB *request) +{ +int ret; +BlockDriverAIOCB *res; + +res = request-handler(request-common.bs, request-sector_num, + request-qiov, request-nb_sectors, + qemu_block_queue_callback, request); +if (res) { +request-real_acb = res; +} + +ret = (res == NULL) ? 0 : 1; + +return ret; You mean return (res != NULL); and want to have bool as the return value of this function. Yeah, thanks. i will modify as below: ret = (res == NULL) ? false : true; ret = (res != NULL) is really more readable. I have adopted Paolo's suggestion. and static bool qemu_block_queue_handler() +} + +void qemu_block_queue_flush(BlockQueue *queue) +{ +queue-flushing = true; +while (!QTAILQ_EMPTY(queue-requests)) { +BlockQueueAIOCB *request = NULL; +int ret = 0; + +request = QTAILQ_FIRST(queue-requests); +QTAILQ_REMOVE(queue-requests, request, entry); + +queue-req_failed = true; +ret = qemu_block_queue_handler(request); +if (ret == 0) { +
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On Tue, Oct 18, 2011 at 10:31:10AM +0200, Jan Kiszka wrote: On 2011-10-18 10:31, Wen Congyang wrote: At 10/18/2011 04:26 PM, Jan Kiszka Write: On 2011-10-18 10:25, Wen Congyang wrote: At 10/18/2011 04:19 PM, Jan Kiszka Write: On 2011-10-18 09:58, Wen Congyang wrote: At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. Please try to avoid defining something new. Unless there is a striking reason, standard gdb core files should be generated so that you can load the dump directly into gdb for analysis. I am not sure whehter the standard gdb core files can not be analyzed by crash. If not, I think we should define something new because it's easier to use crash than gdb to analyze the core files. gdb allows you to walk up the frame and print variables (globals local) etc. Crash uses gdb to provide common function, and you can also use all the gdb commands in crash. That what's the added value here when I can use gdb directly? Crash can decode kernel structures, providing for example process listings and debugging into userspace processes. Crash actually reuses gdb's code too, so it's kind of a superset. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming blog: http://rwmj.wordpress.com Fedora now supports 80 OCaml packages (the OPEN alternative to F#) http://cocan.org/getting_started_with_ocaml_on_red_hat_and_fedora
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 10:34, Richard W.M. Jones wrote: On Tue, Oct 18, 2011 at 10:31:10AM +0200, Jan Kiszka wrote: On 2011-10-18 10:31, Wen Congyang wrote: At 10/18/2011 04:26 PM, Jan Kiszka Write: On 2011-10-18 10:25, Wen Congyang wrote: At 10/18/2011 04:19 PM, Jan Kiszka Write: On 2011-10-18 09:58, Wen Congyang wrote: At 10/18/2011 03:52 PM, Jan Kiszka Write: On 2011-10-18 09:15, Wen Congyang wrote: Hi, Jan Kiszka At 10/10/2011 05:34 PM, Jan Kiszka Write: On 2011-10-10 11:02, Daniel P. Berrange wrote: On Mon, Oct 10, 2011 at 08:52:08AM +0200, Jan Kiszka wrote: Run gdb with set debug remote 1 and watch the communication, it is not that complex. But a dump command is probably simpler for those scenarios, I agree. I have implemented the command dump and reuse migration's code. But I meet a problem when I test it. Using migration code for dump is most probably the wrong approach as you saw through that conflict. All you need are the register states and the RAM. Reuse gdbstub services for this. Hmm, if the migration code can not be reused, I think we should define a new qemu's vmcore format, and add some codes into crash to support such format. Please try to avoid defining something new. Unless there is a striking reason, standard gdb core files should be generated so that you can load the dump directly into gdb for analysis. I am not sure whehter the standard gdb core files can not be analyzed by crash. If not, I think we should define something new because it's easier to use crash than gdb to analyze the core files. gdb allows you to walk up the frame and print variables (globals local) etc. Crash uses gdb to provide common function, and you can also use all the gdb commands in crash. That what's the added value here when I can use gdb directly? Crash can decode kernel structures, providing for example process listings and debugging into userspace processes. Crash actually reuses gdb's code too, so it's kind of a superset. Yeah, I see. Could also be solved via gdb scripts, but crash is already there. But let's see if the formats actually differ. In the end, crash is just parsing the same information that also gdb sees. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 10:36, Jan Kiszka wrote: Crash can decode kernel structures, providing for example process listings and debugging into userspace processes. Crash actually reuses gdb's code too, so it's kind of a superset. Yeah, I see. Could also be solved via gdb scripts, but crash is already there. [ BTW, crash is for the dead. But having those features in form of gdb scripts would also allow live system analysis. Looks like it's worth obsoleting crash on the long run. ] Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH v8 3/4] block: add block timer and throttling algorithm
On Mon, Oct 17, 2011 at 6:26 PM, Kevin Wolf kw...@redhat.com wrote: Am 26.09.2011 09:24, schrieb Zhi Yong Wu: On Sat, Sep 24, 2011 at 12:19 AM, Kevin Wolf kw...@redhat.com wrote: Am 08.09.2011 12:11, schrieb Zhi Yong Wu: Note: 1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario. 2.) When dd command is issued in guest, if its option bs is set to a large value such as bs=1024K, the result speed will slightly bigger than the limits. For these problems, if you have nice thought, pls let us know.:) Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- block.c | 259 --- block.h | 1 - 2 files changed, 248 insertions(+), 12 deletions(-) One general comment: What about synchronous and/or coroutine I/O operations? Do you think they are just not important enough to consider here or were they forgotten? For sync ops, we assume that it will be converse into async mode at some point of future, right? For coroutine I/O, it is introduced in image driver layer, and behind bdrv_aio_readv/writev. I think that we need not consider them, right? Meanwhile the block layer has been changed to handle all requests in terms of coroutines. So you would best move your intercepting code into the coroutine functions. OK. I will. Also, do I understand correctly that you're always submitting the whole Right, when the block timer fire, it will flush whole request queue. queue at once? Does this effectively enforce the limit all the time or will it lead to some peaks and then no requests at all for a while until In fact, it only try to submit those enqueued request one by one. If fail to pass the limit, this request will be enqueued again. Right, I missed this. Makes sense. the average is right again? Yeah, it is possible. Do you better idea? Maybe some documentation on how it all works from a high level perspective would be helpful. + /* throttling disk read I/O */ + if (bs-io_limits_enabled) { + if (bdrv_exceed_io_limits(bs, nb_sectors, false, wait_time)) { + ret = qemu_block_queue_enqueue(bs-block_queue, bs, bdrv_aio_readv, + sector_num, qiov, nb_sectors, cb, opaque); + printf(wait_time=%ld\n, wait_time); + if (wait_time != -1) { + printf(reset block timer\n); + qemu_mod_timer(bs-block_timer, + wait_time + qemu_get_clock_ns(vm_clock)); + } + + if (ret) { + printf(ori ret is not null\n); + } else { + printf(ori ret is null\n); + } + + return ret; + } + } - return drv-bdrv_aio_readv(bs, sector_num, qiov, nb_sectors, + ret = drv-bdrv_aio_readv(bs, sector_num, qiov, nb_sectors, cb, opaque); + if (ret) { + if (bs-io_limits_enabled) { + bs-io_disps.bytes[BLOCK_IO_LIMIT_READ] += + (unsigned) nb_sectors * BDRV_SECTOR_SIZE; + bs-io_disps.ios[BLOCK_IO_LIMIT_READ]++; + } I wonder if you can't reuse bs-nr_bytes/nr_ops instead of introducing a second counting mechanism. Would have the advantage that numbers are NO, our counting variables will be reset to ZERO if current slice time(0.1ms) is used up. Instead of setting the counter to zero you could remember the base value and calculate the difference when you need it. The advantage is that we can share infrastructure instead of introducing several subtly different ways of I/O accounting. Good idea, let me try it. actually consistent (your metric counts slightly differently than the existing info blockstats one). Yeah, i notice this, and don't think there's wrong with it. and you? It's not really user friendly if a number that is called the same means this in one place and in another place that. OK Kevin -- Regards, Zhi Yong Wu
Re: [Qemu-devel] hw/nand hw/onenand and read-only
On 18.10.11 11:23 , ext Markus Armbruster arm...@redhat.com wrote: These guys have been converted to qdev relatively recently. I wonder what happens when they use a drive defined with -drive if=none,readonly. If that's not a valid configuration, we better reject read-only drives, like ide_init_drive() does. I'm not an expert with QEMU command line options; how do you pass such a drive to a NAND device? With if=mtd I get the following which I guess is expected result: $ qemu-system-arm -M beagle -drive if=mtd,file=nand.img,readonly qemu: readonly flag not supported for drive with this interface Regards, Juha
Re: [Qemu-devel] [PATCH] PPC: Fail configure when libfdt is not available
Am 18.10.2011 02:18, schrieb Alexander Graf: We have several targets in the PPC tree now that basically require libfdt to function properly, namely the pseries and the e500 targets. This dependency will rather increase than decrease in the future, so I want to make sure that people building shiny new 1.0 actually have libfdt installed to get rid of a few ifdefs in the code. Warning: This patch will likely make configure fail for people who don't select their own --target-list, but don't have libfdt development packages installed. However, we really need this new dependency to move on. Signed-off-by: Alexander Graf ag...@suse.de openSUSE 12.1 has libfdt1-devel, but you should set up a submodule and working build rules for Darwin, Haiku, etc. `make` doesn't fully work so I used custom scripts to build the right parts and to manually install the resulting binary and headers. Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746, AG Nürnberg
Re: [Qemu-devel] [PATCH] PPC: Fail configure when libfdt is not available
Am 18.10.2011 um 10:55 schrieb Andreas Färber afaer...@suse.de: Am 18.10.2011 02:18, schrieb Alexander Graf: We have several targets in the PPC tree now that basically require libfdt to function properly, namely the pseries and the e500 targets. This dependency will rather increase than decrease in the future, so I want to make sure that people building shiny new 1.0 actually have libfdt installed to get rid of a few ifdefs in the code. Warning: This patch will likely make configure fail for people who don't select their own --target-list, but don't have libfdt development packages installed. However, we really need this new dependency to move on. Signed-off-by: Alexander Graf ag...@suse.de openSUSE 12.1 has libfdt1-devel, but you should set up a submodule and working build rules for Darwin, Haiku, etc. `make` doesn't fully work so I used custom scripts to build the right parts and to manually install the resulting binary and headers. I don't fully understand. It's a build dependency, so whoever maintains libfdt / is interested in running ppc targets on those OSs needs to fix libfdt to build there. It's really the same as a dependency on glib or sdl or ... :). It's just less well known (and less active as a project). Alex
Re: [Qemu-devel] Patch for ui/cocoa.m
Hello Juan, Am 18.10.2011 03:22, schrieb Juan Pineda: This is my first time submitting a patch, so please let me know if I'm not following the correct protocol. Please see http://wiki.qemu.org/Contribute/SubmitAPatch Using the git-send-mail tool assures that the patch format gets right. Then, this patch despite its small size is not trivial so instead of qemu-triv...@nongnu.org please cc andreas.faer...@web.de for Cocoa. Thanks. Under OSX Lion the boot volume dialog is not closed and it permanently obscures the emulator window since under Lion the dialog cannot be repositioned. The fix adds only to add a single line to close the dialog. I don't have access to v10.7 so please describe the problem in more details: What command line do you use? If using the right arguments you shouldn't see a window at all. Are you maybe using -drive instead of -hda and that is not yet handled correctly? Apart from this issue, is it working correctly for you? Regards, Andreas Signed-off-by: Juan Pineda j...@logician.com Thanks! -Juan git diff ui/cocoa.m diff --git a/ui/cocoa.m b/ui/cocoa.m index d9e4e3d..4b42462 100644 --- a/ui/cocoa.m +++ b/ui/cocoa.m @@ -811,6 +811,8 @@ QemuCocoaView *cocoaView; char **argv = (char**)malloc( sizeof(char*)*3 ); + [sheet close]; + asprintf(argv[0], %s, bin); asprintf(argv[1], -hda); asprintf(argv[2], %s, img); -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746, AG Nürnberg
Re: [Qemu-devel] [PATCH v8 1/4] block: add the block queue support
On Tue, Oct 18, 2011 at 4:36 PM, Kevin Wolf kw...@redhat.com wrote: Am 18.10.2011 10:07, schrieb Zhi Yong Wu: On Mon, Oct 17, 2011 at 6:17 PM, Kevin Wolf kw...@redhat.com wrote: + +typedef struct BlockQueueAIOCB BlockQueueAIOCB; + +struct BlockQueue { + QTAILQ_HEAD(requests, BlockQueueAIOCB) requests; + bool req_failed; + bool flushing; +}; I find req_failed pretty confusing. Needs documentation at least, but most probably also a better name. OK. request_has_failed? No, that doesn't describe what it's really doing. You set req_failed = true by default and then on some obscure condition clear it or not. It's tracking something, but I'm not sure what meaning it has during the whole process. In qemu_block_queue_flush, When bdrv_aio_readv/writev return NULL, if req_failed is changed to false, it indicates that the request exceeds the limits again; if req_failed is still true, it indicates that one I/O error takes place, at the monent, qemu_block_queue_callback(request, -EIO) need to be called to report this to upper layer. Okay, this makes some more sense now. How about reversing the logic and maintaining a limit_exceeded flag instead of a req_failed? I think this would be clearer. OK +void qemu_del_block_queue(BlockQueue *queue) +{ + BlockQueueAIOCB *request, *next; + + QTAILQ_FOREACH_SAFE(request, queue-requests, entry, next) { + QTAILQ_REMOVE(queue-requests, request, entry); + qemu_aio_release(request); + } + + g_free(queue); +} Can we be sure that no AIO requests are in flight that still use the now released AIOCB? Yeah, since qemu core code is serially performed, i think that when qemu_del_block_queue is performed, no requests are in flight. Right? Patch 2 has this code: +void bdrv_io_limits_disable(BlockDriverState *bs) +{ + bs-io_limits_enabled = false; + + if (bs-block_queue) { + qemu_block_queue_flush(bs-block_queue); + qemu_del_block_queue(bs-block_queue); + bs-block_queue = NULL; + } Does this mean that you can't disable I/O limits while the VM is running? NO, you can even though VM is running. Okay, in general qemu_block_queue_flush() empties the queue so that there are no requests left that qemu_del_block_queue() could drop from the queue. So in the common case it doesn't even enter the FOREACH loop. I think that we should adopt !QTAILQ_EMPTY(queue-requests), not QTAILQ_FOREACH_SAFE in qemu_del_block_queue(), right? I think problems start when requests have failed or exceeded the limit again, then you have requests queued even after qemu_block_queue_flush(). You must be aware of this, otherwise the code in qemu_del_block_queue() wouldn't exist. But you can't release the ACBs without having called their callback, otherwise the caller would still assume that its ACB pointer is valid. Maybe calling the callback before releasing the ACB would be enough. Good, thanks. However, for failed requests see below. + +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue, + BlockDriverState *bs, + BlockRequestHandler *handler, + int64_t sector_num, + QEMUIOVector *qiov, + int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque) +{ + BlockDriverAIOCB *acb; + BlockQueueAIOCB *request; + + if (queue-flushing) { + queue-req_failed = false; + return NULL; + } else { + acb = qemu_aio_get(block_queue_pool, bs, + cb, opaque); + request = container_of(acb, BlockQueueAIOCB, common); + request-handler = handler; + request-sector_num = sector_num; + request-qiov = qiov; + request-nb_sectors = nb_sectors; + request-real_acb = NULL; + QTAILQ_INSERT_TAIL(queue-requests, request, entry); + } + + return acb; +} + +static int qemu_block_queue_handler(BlockQueueAIOCB *request) +{ + int ret; + BlockDriverAIOCB *res; + + res = request-handler(request-common.bs, request-sector_num, + request-qiov, request-nb_sectors, + qemu_block_queue_callback, request); + if (res) { + request-real_acb = res; + } + + ret = (res == NULL) ? 0 : 1; + + return ret; You mean return (res != NULL); and want to have bool as the return value of this function. Yeah, thanks. i will modify as below: ret = (res == NULL) ? false : true; ret = (res != NULL) is really more readable. I have adopted Paolo's suggestion. and static bool qemu_block_queue_handler() +} + +void qemu_block_queue_flush(BlockQueue *queue) +{ + queue-flushing = true; + while (!QTAILQ_EMPTY(queue-requests)) { + BlockQueueAIOCB *request = NULL; + int ret = 0; + +
Re: [Qemu-devel] [PATCH] PPC: Bump qemu-system-ppc to 64-bit physical address space
Am 18.10.2011 01:52, schrieb Alexander Graf: Some 32-bit PPC CPUs can use up to 36 bit of physicall address space. Treat them accordingly in the qemu-system-ppc binary type. physical Signed-off-by: Alexander Graf ag...@suse.de --- configure|2 +- target-ppc/cpu.h |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/configure b/configure index 9b4fe34..3bdb556 100755 --- a/configure +++ b/configure @@ -3276,7 +3276,7 @@ case $target_arch2 in ;; ppc) gdb_xml_files=power-core.xml power-fpu.xml power-altivec.xml power-spe.xml -target_phys_bits=32 +target_phys_bits=64 target_nptl=yes target_libs_softmmu=$fdt_libs ;; I've found this very unintuitive for the new targets I'm preparing. Could we use the real target_phys_bits=36 here and later on ceil this to 32/64 before using it for libhw${target_phys_bits}? Andreas diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h index 8e5c85c..f36f375 100644 --- a/target-ppc/cpu.h +++ b/target-ppc/cpu.h @@ -66,7 +66,7 @@ #define TARGET_PAGE_BITS 12 #endif /* defined(TARGET_PPCEMB) */ -#define TARGET_PHYS_ADDR_SPACE_BITS 32 +#define TARGET_PHYS_ADDR_SPACE_BITS 36 #define TARGET_VIRT_ADDR_SPACE_BITS 32 #endif /* defined (TARGET_PPC64) */ -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746, AG Nürnberg
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
At 10/18/2011 04:36 PM, Jan Kiszka Write: On 2011-10-18 10:34, Richard W.M. Jones wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. But let's see if the formats actually differ. In the end, crash is just parsing the same information that also gdb sees. I think the format can be similar with diskdump/kdump/netdump: dump_header: 1 block sub header: n blocks(n is stored in dump_header) bitmap: m blocks(2m is stored in dump_header) dumpable bitmap: m blocks memory data(We can know whether a page is stored in the core by bitmap and dumpable bitmap) The format of dump header(It's like kdump/diskdump): struct disk_dump_header { charsignature[SIG_LEN]; /* = QEMU */ int header_version; /* Dump header version */ struct new_utsname utsname;/* copy of system_utsname */ struct timeval timestamp; /* Time stamp */ unsigned intstatus; int block_size; /* Size of a block in byte */ int sub_hdr_size; /* Size of arch dependent header in blocks */ unsigned intbitmap_blocks; /* Size of Memory bitmap in block */ unsigned intmax_mapnr; /* = max_mapnr */ unsigned inttotal_ram_blocks;/* Number of blocks should be written */ unsigned intdevice_blocks; /* Number of total blocks in * the dump device */ unsigned intwritten_blocks; /* Number of written blocks */ unsigned intcurrent_cpu;/* CPU# which handles dump */ int nr_cpus;/* Number of CPUs */ }; The sub header can contains all registers's value on each vcpu, and other information, for example: struct qemu_sub_header { unsigned long start_pfn; unsigned long end_pfn; off_t offset_note; unsigned long size_note; }; Thanks Wen Congyang Jan
Re: [Qemu-devel] [PATCH] PPC: Fail configure when libfdt is not available
On 18 October 2011 01:18, Alexander Graf ag...@suse.de wrote: +if test $fdt != yes -a \( $target_arch2 = ppc -o \ + $target_arch2 = ppc64 -o $target_arch2 = ppcemb \); then + echo + echo Error: libfdt missing + echo The PowerPC target requires libfdt to work properly. + echo Please make sure to have it and its development packages installed + echo + exit 1 +fi if test $target_arch2 = ppc64 -a $fdt = yes; then echo CONFIG_PSERIES=y $config_target_mak fi The test -a, -o, ( and ) operators are deprecated (see the 'application usage section' of the POSIX 'test' documentation: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/test.html ) -- better not to use them in new code. Since target_arch2 == ppc64 and fdt != yes is now an impossible combination, you could remove the fdt check from the condition below which sets CONFIG_PSERIES. Also, missing '.' after 'installed'. -- PMM
Re: [Qemu-devel] [PATCH] Memory API bugfix - abolish addrrrange_end()
On 10/18/2011 03:38 AM, David Gibson wrote: On Mon, Oct 17, 2011 at 12:34:19PM +0200, Avi Kivity wrote: On 10/17/2011 07:31 AM, David Gibson wrote: In terms of how the code looks, it's seriously more ugly (see the patches I sent out). Conceptually it's cleaner, since we're not dodging the issue that we need to deal with a full 64-bit domain. We don't have to dodge that issue. I know how to remove the requirement for intermediate negative values, I just haven't made up a patch yet. With that we can change to uint64 and cover the full 64 bit range. In fact I think I can make it so that size==0 represents size=2^64 and even handle the full 64-bit, inclusive range properly. That means you can't do a real size == 0. Yeah... a memory range with size 0 has no effect by definition, I think we can do without it. How do we make sure all callers know this? But my main concern is maintainability. The 64-bit blanket is to short, if we keep pulling it in various directions we'll just expose ourselves in new ways. Nonsense, dealing with full X-bit range calculations in X-bit types is a fairly standard problem. The kernel does it in VMA handling for one. It just requires thinking about overflow cases. We discovered three bugs already (you found two, and I had one during development). Even if it can probably be done with extreme care, but is it worth spending all that development time on? I'm not sure there is a parallel with vmas, since we're offsetting in both the positive and negative directions. I think the so-called negative offsetting is just an artifact of our implementation. I don't see that it's any different from having a VMA whose file offset is larger than its memory address. Consider the vga window at 0xa pointing into the framebuffer at alias_offset 0x1a. To the system, it looks like a copy of the framebuffer starts at -0x100, becomes visible 0x1a bytes later (at 0xa), then becomes invisible again at 0xa8000. Yes, it's an artifact, but I don't want to spend too much time worrying about it, if I can throw a few more bits at the problem. The API is too central to make a case by case analysis of where things can go wrong, it needs to be robust. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH v8 1/4] block: add the block queue support
Am 18.10.2011 11:29, schrieb Zhi Yong Wu: +void qemu_del_block_queue(BlockQueue *queue) +{ +BlockQueueAIOCB *request, *next; + +QTAILQ_FOREACH_SAFE(request, queue-requests, entry, next) { +QTAILQ_REMOVE(queue-requests, request, entry); +qemu_aio_release(request); +} + +g_free(queue); +} Can we be sure that no AIO requests are in flight that still use the now released AIOCB? Yeah, since qemu core code is serially performed, i think that when qemu_del_block_queue is performed, no requests are in flight. Right? Patch 2 has this code: +void bdrv_io_limits_disable(BlockDriverState *bs) +{ +bs-io_limits_enabled = false; + +if (bs-block_queue) { +qemu_block_queue_flush(bs-block_queue); +qemu_del_block_queue(bs-block_queue); +bs-block_queue = NULL; +} Does this mean that you can't disable I/O limits while the VM is running? NO, you can even though VM is running. Okay, in general qemu_block_queue_flush() empties the queue so that there are no requests left that qemu_del_block_queue() could drop from the queue. So in the common case it doesn't even enter the FOREACH loop. I think that we should adopt !QTAILQ_EMPTY(queue-requests), not QTAILQ_FOREACH_SAFE in qemu_del_block_queue(), right? I think QTAILQ_FOREACH_SAFE is fine. I think problems start when requests have failed or exceeded the limit again, then you have requests queued even after qemu_block_queue_flush(). You must be aware of this, otherwise the code in qemu_del_block_queue() wouldn't exist. But you can't release the ACBs without having called their callback, otherwise the caller would still assume that its ACB pointer is valid. Maybe calling the callback before releasing the ACB would be enough. Good, thanks. +} +} +} + +queue-req_failed = true; +queue-flushing = false; +} + +bool qemu_block_queue_has_pending(BlockQueue *queue) +{ +return !queue-flushing !QTAILQ_EMPTY(queue-requests); +} Why doesn't the queue have pending requests in the middle of a flush operation? (That is, the flush hasn't completed yet) It is possible for the queue to have pending requests. if yes, how about? Sorry, can't parse this. I don't understand why the !queue-flushing part is correct. What about this? When bdrv_aio_readv/writev handle one request, it will determine if block queue is not being flushed and isn't NULL; if yes, It assume that this request is one new request from upper layer, so it won't determine if the I/O rate at runtime has exceeded the limits, but immediately insert it into block queue. Hm, I think I understand what you're saying, but only after looking at patch 3. This is not really implementing a has_pending(), but has_pending_and_caller_wasnt_called_during_flush(). I think it would be better to handle the queue-flushing condition in the caller where its use is more obvious. Kevin
Re: [Qemu-devel] [RFC][PATCH 00/45] qemu-kvm: MSI layer rework for in-kernel irqchip support
On 2011-10-17 17:57, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:34AM +0200, Jan Kiszka wrote: As previously indicated, I was working for quite a while on a major refactoring of the MSI additions we have in qemu-kvm to support in-kernel irqchip, vhost and device assignment. This is now the outcome. I'm quite happy with it, things are still working (apparently), and the invasiveness of KVM hooks into the MSI layer is significantly reduced. Moreover, I was able to port the device assignment code over generic MSI support, reducing the size of that file a bit further. Some further highlights: - fix for HPET MSI support with in-kernel irqchip - fully configurable MSI-X (allows 1:1 mapping for assigned devices) - refactored KVM core API for device assignment and IRQ routing I'm sending the whole series in one chunk so that you can see what the result will be. It's RFC as I bet that there are regressions included and maybe still room left for improvements. Once all is fine (can be broken up into multiple chunks for the merge), I would suggest patching qemu-kvm first and then start with porting things over to upstream. Comments review welcome. CC: Alexander Graf ag...@suse.de CC: Gerd Hoffmann kra...@redhat.com CC: Isaku Yamahata yamah...@valinux.co.jp So the scheme where we lazily update kvm gsi table on interrupts is interesting. There seems to be very little point in it for virtio, and it does seem to make it impossible to detect lack or resources (at the moment we let guest know if we run out of GSIs and linux guests can fall back to regular interrupts). Are we really at that limit already? Then I think it's rather time to lift it at the kernel side. I am guessing the idea is to use it for device assignment where it does make sense as there is no standard way to track which vectors are actually used? But how does it work there? kvm does not propage unmapped interrupts from an assigned device to qemu, does it? No, device assignment and irqfd (vhost, vfio, whatever-will-come) belong to the static group. They need to hand out a static gsi to the irq source, thus they cannot participate in lazy updating. But they can benefit from the generic config notifiers: whenever the guest changes some route (or disables it), they just need to inform the core about this. So there is no need to track used vectors separately anymore. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [V2 PATCH] rtl8139: check the buffer availiability
Hi Jason, Could you please elaborate what problem you try to resolve by this patch? And do you think we need notify I/O thread re-polling tap fd when receive descriptor becomes available? Otherwise, tap read polling will be disabled until the I/O handlers are updated by other reasons. Am I right? On 10/17/2011 10:55 AM, Jason Wang wrote: Reduce spurious packet drops on RX ring empty when in c+ mode by verifying that we have at least 1 buffer ahead of the time. Change from v1: Fix style comments from Stefan. Signed-off-by: Jason Wangjasow...@redhat.com --- hw/rtl8139.c | 44 ++-- 1 files changed, 30 insertions(+), 14 deletions(-) diff --git a/hw/rtl8139.c b/hw/rtl8139.c index 3753950..bcbc5e3 100644 --- a/hw/rtl8139.c +++ b/hw/rtl8139.c @@ -84,6 +84,19 @@ #define VLAN_TCI_LEN 2 #define VLAN_HLEN (ETHER_TYPE_LEN + VLAN_TCI_LEN) +/* w0 ownership flag */ +#define CP_RX_OWN (131) +/* w0 end of ring flag */ +#define CP_RX_EOR (130) +/* w0 bits 0...12 : buffer size */ +#define CP_RX_BUFFER_SIZE_MASK ((113) - 1) +/* w1 tag available flag */ +#define CP_RX_TAVA (116) +/* w1 bits 0...15 : VLAN tag */ +#define CP_RX_VLAN_TAG_MASK ((116) - 1) +/* w2 low 32bit of Rx buffer ptr */ +/* w3 high 32bit of Rx buffer ptr */ + #if defined (DEBUG_RTL8139) # define DPRINTF(fmt, ...) \ do { fprintf(stderr, RTL8139: fmt, ## __VA_ARGS__); } while (0) @@ -805,6 +818,22 @@ static inline target_phys_addr_t rtl8139_addr64(uint32_t low, uint32_t high) #endif } +/* Verify that we have at least one available rx buffer */ +static int rtl8139_cp_has_rxbuf(RTL8139State *s) +{ +uint32_t val, rxdw0; +target_phys_addr_t cplus_rx_ring_desc = rtl8139_addr64(s-RxRingAddrLO, + s-RxRingAddrHI); +cplus_rx_ring_desc += 16 * s-currCPlusRxDesc; +cpu_physical_memory_read(cplus_rx_ring_desc,val, 4); +rxdw0 = le32_to_cpu(val); +if (rxdw0 CP_RX_OWN) { +return 1; +} else { +return 0; +} +} + static int rtl8139_can_receive(VLANClientState *nc) { RTL8139State *s = DO_UPCAST(NICState, nc, nc)-opaque; @@ -819,7 +848,7 @@ static int rtl8139_can_receive(VLANClientState *nc) if (rtl8139_cp_receiver_enabled(s)) { /* ??? Flow control not implemented in c+ mode. This is a hack to work around slirp deficiencies anyway. */ -return 1; +return rtl8139_cp_has_rxbuf(s); } else { avail = MOD2(s-RxBufferSize + s-RxBufPtr - s-RxBufAddr, s-RxBufferSize); @@ -965,19 +994,6 @@ static ssize_t rtl8139_do_receive(VLANClientState *nc, const uint8_t *buf, size_ /* begin C+ receiver mode */ -/* w0 ownership flag */ -#define CP_RX_OWN (131) -/* w0 end of ring flag */ -#define CP_RX_EOR (130) -/* w0 bits 0...12 : buffer size */ -#define CP_RX_BUFFER_SIZE_MASK ((113) - 1) -/* w1 tag available flag */ -#define CP_RX_TAVA (116) -/* w1 bits 0...15 : VLAN tag */ -#define CP_RX_VLAN_TAG_MASK ((116) - 1) -/* w2 low 32bit of Rx buffer ptr */ -/* w3 high 32bit of Rx buffer ptr */ - int descriptor = s-currCPlusRxDesc; target_phys_addr_t cplus_rx_ring_desc;
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 11:43, Wen Congyang wrote: At 10/18/2011 04:36 PM, Jan Kiszka Write: On 2011-10-18 10:34, Richard W.M. Jones wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. But let's see if the formats actually differ. In the end, crash is just parsing the same information that also gdb sees. I think the format can be similar with diskdump/kdump/netdump: dump_header: 1 block sub header: n blocks(n is stored in dump_header) bitmap: m blocks(2m is stored in dump_header) dumpable bitmap: m blocks memory data(We can know whether a page is stored in the core by bitmap and dumpable bitmap) The format of dump header(It's like kdump/diskdump): struct disk_dump_header { charsignature[SIG_LEN]; /* = QEMU */ int header_version; /* Dump header version */ struct new_utsname utsname;/* copy of system_utsname */ struct timeval timestamp; /* Time stamp */ unsigned intstatus; int block_size; /* Size of a block in byte */ int sub_hdr_size; /* Size of arch dependent header in blocks */ unsigned intbitmap_blocks; /* Size of Memory bitmap in block */ unsigned intmax_mapnr; /* = max_mapnr */ unsigned inttotal_ram_blocks;/* Number of blocks should be written */ unsigned intdevice_blocks; /* Number of total blocks in * the dump device */ unsigned intwritten_blocks; /* Number of written blocks */ unsigned intcurrent_cpu;/* CPU# which handles dump */ int nr_cpus;/* Number of CPUs */ }; The sub header can contains all registers's value on each vcpu, and other information, for example: struct qemu_sub_header { unsigned long start_pfn; unsigned long end_pfn; off_t offset_note; unsigned long size_note; }; So is this a standard format or only similar to something? Which tools will understand it out-of-the-box? If it's not standard, why? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
[Qemu-devel] [Bug 874038] Re: ARM thumb2 does not propogate carry flag properly
Sorry for the noise, it appears I've checked on not clean 0.15.0 ** Changed in: qemu Status: New = Invalid -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/874038 Title: ARM thumb2 does not propogate carry flag properly Status in QEMU: Invalid Bug description: information on carry flag is lost if gen_set_CF_bit31(t1) is called after logic operation. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/874038/+subscriptions
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 10/18/2011 10:39 AM, Jan Kiszka wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. [ BTW, crash is for the dead. But having those features in form of gdb scripts would also allow live system analysis. Looks like it's worth obsoleting crash on the long run. ] Crash can also do live system analysis. :) Paolo
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On Tue, Oct 18, 2011 at 12:41:10PM +0200, Paolo Bonzini wrote: On 10/18/2011 10:39 AM, Jan Kiszka wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. [ BTW, crash is for the dead. But having those features in form of gdb scripts would also allow live system analysis. Looks like it's worth obsoleting crash on the long run. ] Crash can also do live system analysis. :) crash in fact is gdb - with a lot of additional commands.
Re: [Qemu-devel] [RFC][PATCH 12/45] msi: Introduce MSIRoutingCache
On 2011-10-17 17:37, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:19:56PM +0200, Jan Kiszka wrote: On 2011-10-17 13:06, Avi Kivity wrote: On 10/17/2011 11:27 AM, Jan Kiszka wrote: This cache will help us implementing KVM in-kernel irqchip support without spreading hooks all over the place. KVM requires us to register it first and then deliver it by raising a pseudo IRQ line returned on registration. While this could be changed for QEMU-originated MSI messages by adding direct MSI injection, we will still need this translation for irqfd-originated messages. The MSIRoutingCache will allow to track those registrations and update them lazily before the actual delivery. This avoid having to track MSI vectors at device level (like qemu-kvm currently does). +typedef enum { +MSI_ROUTE_NONE = 0, +MSI_ROUTE_STATIC, +} MSIRouteType; + +struct MSIRoutingCache { +MSIMessage msg; +MSIRouteType type; +int kvm_gsi; +int kvm_irqfd; +}; + diff --git a/hw/pci.h b/hw/pci.h index 329ab32..5b5d2fd 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -197,6 +197,10 @@ struct PCIDevice { MemoryRegion rom; uint32_t rom_bar; +/* MSI routing chaches */ +MSIRoutingCache *msi_cache; +MSIRoutingCache *msix_cache; + /* MSI entries */ int msi_entries_nr; struct KVMMsiMessage *msi_irq_entries; IMO this needlessly leaks kvm information into core qemu. The cache should be completely hidden in kvm code. I think msi_deliver() can hide the use of the cache completely. For pre-registered events like kvm's irqfd, you can use something like qemu_irq qemu_msi_irq(MSIMessage msg) for non-kvm, it simply returns a qemu_irq that triggers a stl_phys(); for kvm, it allocates an irqfd and a permanent entry in the cache and returns a qemu_irq that triggers the irqfd. See my previously mail: you want to track the life-cycle of an MSI source to avoid generating routes for identical sources. A messages is not a source. Two identical messages can come from different sources. Since MSI messages are edge triggered, I don't see how this would work without losing interrupts. And AFAIK, existing guests do not use the same message for different sources. Just like we handle shared edge-triggered line-base IRQs, shared MSIs are in principle feasible as well. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 12:41, Paolo Bonzini wrote: On 10/18/2011 10:39 AM, Jan Kiszka wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. [ BTW, crash is for the dead. But having those features in form of gdb scripts would also allow live system analysis. Looks like it's worth obsoleting crash on the long run. ] Crash can also do live system analysis. :) Does it work via remote interface? And it's still the wrong way around: You have to append gdb to the majority of command when doing debugging instead of issuing additional analysis commands from the gdb shell. That's understandable given the limited scripting power of old gdbs. But that's now history thanks to its python interface. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 42/45] msix: Introduce msix_init_simple
On Mon, Oct 17, 2011 at 09:21:34PM +0200, Jan Kiszka wrote: On 2011-10-17 16:28, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:27:31PM +0200, Jan Kiszka wrote: On 2011-10-17 13:22, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:16AM +0200, Jan Kiszka wrote: Devices models are usually not interested in specifying MSI-X configuration details beyond the number of vectors to provide and the BAR number to use. Layout of an exclusively used BAR and its registration can also be handled centrally. This is the purpose of msix_init_simple. It provides handy services to the existing users. Future users like device assignment may require more detailed setup specification. For them we will (re-)introduce msix_init with the full list of configuration option (in contrast to the current code). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Well, this seems a bit of a code churn then, doesn't it? We are also discussing using memory BAR for virtio-pci for other stuff besides MSI-X, so the last user of the _simple variant will be ivshmem then? We will surely see more MSI-X users over the time. Not sure if they all mix their MSIX-X BARs with other stuff. But e.g. the e1000 variant I have here does not. So there should be users in the future. Jan Question is, how hard is to pass in the BAR and the offset? That is trivial. But have a look at the final simple implementation. It also manages the container memory region for table and PBA and registers/unregisters that container as BAR. So there is measurable added-value. Jan Yes, I agree. In particular it's not very nice that the user has to know the size of the bar to create. But the API is very unfortunate IMO. I am also more interested in solutions that help all devices and not just those that have a dedicated bar for msix + pba. We should probably pass in the size of the memory region allocated to the msix table, and verify that the table fits there. We can also avoid passing in bar number, like this: diff --git a/hw/pci.c b/hw/pci.c index 749e8d8..d0d893e 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -903,6 +903,17 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, : pci_dev-bus-address_space_mem; } +int pci_get_bar_nr(PCIDevice *pci_dev, MemoryRegion *bar) +{ +int region_num; +for (region_num = 0; region_num PCI_NUM_REGIONS; ++region_num) { +if (pci_dev-io_regions[region_num].memory == bar) { +return region_num; +} +} +return -1; +} + pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num) { return pci_dev-io_regions[region_num].addr; -- MST
Re: [Qemu-devel] [RFC][PATCH 42/45] msix: Introduce msix_init_simple
On 2011-10-18 12:52, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:21:34PM +0200, Jan Kiszka wrote: On 2011-10-17 16:28, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:27:31PM +0200, Jan Kiszka wrote: On 2011-10-17 13:22, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:16AM +0200, Jan Kiszka wrote: Devices models are usually not interested in specifying MSI-X configuration details beyond the number of vectors to provide and the BAR number to use. Layout of an exclusively used BAR and its registration can also be handled centrally. This is the purpose of msix_init_simple. It provides handy services to the existing users. Future users like device assignment may require more detailed setup specification. For them we will (re-)introduce msix_init with the full list of configuration option (in contrast to the current code). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Well, this seems a bit of a code churn then, doesn't it? We are also discussing using memory BAR for virtio-pci for other stuff besides MSI-X, so the last user of the _simple variant will be ivshmem then? We will surely see more MSI-X users over the time. Not sure if they all mix their MSIX-X BARs with other stuff. But e.g. the e1000 variant I have here does not. So there should be users in the future. Jan Question is, how hard is to pass in the BAR and the offset? That is trivial. But have a look at the final simple implementation. It also manages the container memory region for table and PBA and registers/unregisters that container as BAR. So there is measurable added-value. Jan Yes, I agree. In particular it's not very nice that the user has to know the size of the bar to create. But the API is very unfortunate IMO. I'm open to see the prototypes of a different one. I am also more interested in solutions that help all devices and not just those that have a dedicated bar for msix + pba. That's what patch 43 is for. We should probably pass in the size of the memory region allocated to the msix table, and verify that the table fits there. Well, if you specify table and PBA offset explicitly, I think it is fair to expect the caller having done the math. There are some checks built into the core already, e.g. against table and PBA overlap. We could add one for checking BAR limits. I'm not sure, though, if requesting table and PBA sizes solves the issue that the user may have done the calculation wrong. We can also avoid passing in bar number, like this: diff --git a/hw/pci.c b/hw/pci.c index 749e8d8..d0d893e 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -903,6 +903,17 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, : pci_dev-bus-address_space_mem; } +int pci_get_bar_nr(PCIDevice *pci_dev, MemoryRegion *bar) +{ +int region_num; +for (region_num = 0; region_num PCI_NUM_REGIONS; ++region_num) { +if (pci_dev-io_regions[region_num].memory == bar) { +return region_num; +} +} +return -1; +} + pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num) { return pci_dev-io_regions[region_num].addr; That enforces a specific registration order. If you call msix_init before doing the BAR registration, the function above will not help. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 6/7] stellaris: convert adc to memory API
2011/10/17 Benoît Canet benoit.ca...@gmail.com: Signed-off-by: Benoit Canet benoit.ca...@gmail.com --- hw/stellaris.c | 30 -- 1 files changed, 12 insertions(+), 18 deletions(-) Reviewed-by: Peter Maydell peter.mayd...@linaro.org
Re: [Qemu-devel] [PATCH 4/7] stellaris: convert sys to memory API
2011/10/17 Benoît Canet benoit.ca...@gmail.com: Signed-off-by: Benoit Canet benoit.ca...@gmail.com --- hw/stellaris.c | 28 +++- 1 files changed, 11 insertions(+), 17 deletions(-) Reviewed-by: Peter Maydell peter.mayd...@linaro.org
Re: [Qemu-devel] [PATCH 2/7] integratorcp: convert icp pic to memory API
2011/10/17 Benoît Canet benoit.ca...@gmail.com: Signed-off-by: Benoit Canet benoit.ca...@gmail.com --- hw/integratorcp.c | 27 ++- 1 files changed, 10 insertions(+), 17 deletions(-) Reviewed-by: Peter Maydell peter.mayd...@linaro.org
Re: [Qemu-devel] [RFC][PATCH 11/45] msi: Factor out delivery hook
On 2011-10-17 15:48, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 03:41:44PM +0200, Avi Kivity wrote: On 10/17/2011 03:41 PM, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:15:56PM +0200, Jan Kiszka wrote: On 2011-10-17 12:56, Avi Kivity wrote: On 10/17/2011 11:27 AM, Jan Kiszka wrote: So far we deliver MSI messages by writing them into the target MMIO area. This reflects what happens on hardware, but imposes some limitations on the emulation when introducing KVM in-kernel irqchip models. For those we will need to track the message origin. Why do we need to track the message origin? Emulated interrupt remapping? The origin holds the routing cache which we need to track if the message already has a route (and that without searching long lists) and to update that route instead of add another one. Hmm, yes, but if the device does stl_phys or something like this, it won't work with irqchip, will it? And it should, ideally. Why not? it will fall back to the apic path, and use the local routing cache entry there. Does it still work with irqchip enabled? I didn't realize ... Yep, as MSI requests that land in the APIC MMIO page are also fed into msi_deliver and will take the normal path from there on. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 3/7] integratorcp: convert control to memory API
2011/10/17 Benoît Canet benoit.ca...@gmail.com: -static void icp_control_init(uint32_t base) +static void icp_control_init(target_phys_addr_t base) { - int iomemtype; + MemoryRegion *io; - iomemtype = cpu_register_io_memory(icp_control_readfn, - icp_control_writefn, NULL, - DEVICE_NATIVE_ENDIAN); - cpu_register_physical_memory(base, 0x0080, iomemtype); + io = (MemoryRegion *)g_malloc0(sizeof(MemoryRegion)); + memory_region_init_io(io, icp_control_ops, NULL, + control, 0x0080); + memory_region_add_subregion(get_system_memory(), base, io); /* ??? Save/restore. */ } I didn't spot this the first time round, but this is wrong. We shouldn't be g_malloc0()ing the MemoryRegion -- it should be an element in some suitable device struct. I think the right thing to do here is probably first to do the (fairly trivial) conversion of the icp_control code to be a sysbus device, then do the memory region conversion on top of that. -- PMM
Re: [Qemu-devel] [RFC][PATCH 12/45] msi: Introduce MSIRoutingCache
On 2011-10-17 17:43, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:46AM +0200, Jan Kiszka wrote: This cache will help us implementing KVM in-kernel irqchip support without spreading hooks all over the place. KVM requires us to register it first and then deliver it by raising a pseudo IRQ line returned on registration. While this could be changed for QEMU-originated MSI messages by adding direct MSI injection, we will still need this translation for irqfd-originated messages. The MSIRoutingCache will allow to track those registrations and update them lazily before the actual delivery. This avoid having to track MSI vectors at device level (like qemu-kvm currently does). Signed-off-by: Jan Kiszka jan.kis...@siemens.com So if many devices are added, exhausting the number of GSIs supported, we get terrible performance intead of simply failing outright. To me, this looks more like a bug than a feature ... If that ever turns out to be a bottleneck, failing looks like the worst we can do. Reporting excessive cache flushes would make some sense and could still be added. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On Mon, Oct 17, 2011 at 09:28:12PM +0200, Jan Kiszka wrote: On 2011-10-17 17:48, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:02AM +0200, Jan Kiszka wrote: This optimization was only required to keep KVM route usage low. Now that we solve that problem via lazy updates, we can drop the field. We still need interfaces to clear pending vectors, though (and we have to make use of them more broadly - but that's unrelated to this patch). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Lazy updates should be an implementation detail. IMO resource tracking of vectors makes sense as an API. Making devices deal with pending vectors as a concept, IMO, does not. There is really no use for tracking the vector lifecycle once we have lazy updates (except for static routes). It's a way too invasive concept, and it's not needed for anything but KVM. I think it's needed. The PCI spec states that when the device does not need an interrupt anymore, it should clear the pending bit. The use/unuse is IMO a decent API for this, because it uses a familiar resource tracking concept. Exposing this knowledge of msix to devices seems like a worse API. If you want an example, check http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/70915 and compare it to the changes done to hpet in this series. Jan This seems to be a general argument that lazy updates are good? I have no real problem with them, besides the fact that we need an API to reserve space in the routing table so that device setup can fail upfront. -- MST
Re: [Qemu-devel] [PATCH v2 00/13] allow tools to use the QEMU main loop
On 10/17/2011 10:19 PM, Paolo Bonzini wrote: On 10/17/2011 10:11 PM, Anthony Liguori wrote: Leftover from RFC. Right now, this is just a chainsaw cleanup. However, Kevin has plans to use the main loop everywhere (making qemu-io and qemu-img one huge coroutine). Can you add something to docs that explains how to use the new main loop? Yes, of course. Is it okay to just move more functions to main-loop.h (bottom halves, lock/unlock iothread, child watch) and add doc comments there? If so, I can send the 2 updated patches (main-loop.h/main-loop.c) for review, and push again to github. Paolo
Re: [Qemu-devel] [RFC][PATCH 11/45] msi: Factor out delivery hook
On Mon, Oct 17, 2011 at 09:15:47PM +0200, Jan Kiszka wrote: On 2011-10-17 15:43, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:45AM +0200, Jan Kiszka wrote: diff --git a/hw/msi.c b/hw/msi.c index 3c7ebc3..9055155 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -40,6 +40,14 @@ /* Flag for interrupt controller to declare MSI/MSI-X support */ bool msi_supported; +static void msi_unsupported(MSIMessage *msg) +{ +/* If we get here, the board failed to register a delivery handler. */ +abort(); +} + +void (*msi_deliver)(MSIMessage *msg) = msi_unsupported; + How about we set this to NULL, and check it instead of the bool flag? Yeah. I will introduce bool msi_supported(void) { return msi_deliver != msi_unsupported; } OK? Jan Looks a bit weird ... NULL is a pretty standard value for an invalid pointer, isn't it? -- MST
Re: [Qemu-devel] [RFC][PATCH 08/45] Introduce MSIMessage structure
On 2011-10-17 15:01, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 02:09:46PM +0200, Jan Kiszka wrote: On 2011-10-17 14:04, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:51:00PM +0200, Jan Kiszka wrote: On 2011-10-17 13:46, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:42AM +0200, Jan Kiszka wrote: Will be used for generating and distributing MSI messages, both in emulation mode and under KVM. Signed-off-by: Jan Kiszka jan.kis...@siemens.com I would add uint64_t msix_get_address(dev, vector) uint64_t msix_get_data(dev, vector) and same for msi. this would minimise the changes while still making it possible to avoid code duplication in kvm. I'm introducing msi[x]_message_from_vector for that purpose later on. Or what do you mean? Jan It does not look like everyone actually wants the structure, users seem to put it on stack and then immediately unwrap it to get at the address/data. So two accessorts get_data + get_address instead of one, will remove the need to rework all code to use the structure. The idea of this patch is to start handling MSI messages as a single blob. There should be no need to ask a device for parts of that blobs this way. There should be no need to look at the message at all. devices really only care about vector numbers. So we are left with msix.c msi.c and kvm as the only users. kvm has a cache of messages so it needs a struct of these, msix/msi don't. MSIMessages is primarily designed for path from the device to the interrupt controller. And there are multiple stops, already in the path after this patch set. Interrupt remapping, e.g., would add another stop. If you see use cases in this series, though, let me know. Jan Yes, I see them. msix_notify is one example. msi_notify is another. E.g. msi_notify would IMO look nicer as: stl_le_phys(msi_get_address(dev, vector), msi_get_data(dev, vector)); This line does not exist anymore at the end of this series. See pc_msi_deliver. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On 2011-10-18 13:58, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:28:12PM +0200, Jan Kiszka wrote: On 2011-10-17 17:48, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:02AM +0200, Jan Kiszka wrote: This optimization was only required to keep KVM route usage low. Now that we solve that problem via lazy updates, we can drop the field. We still need interfaces to clear pending vectors, though (and we have to make use of them more broadly - but that's unrelated to this patch). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Lazy updates should be an implementation detail. IMO resource tracking of vectors makes sense as an API. Making devices deal with pending vectors as a concept, IMO, does not. There is really no use for tracking the vector lifecycle once we have lazy updates (except for static routes). It's a way too invasive concept, and it's not needed for anything but KVM. I think it's needed. The PCI spec states that when the device does not need an interrupt anymore, it should clear the pending bit. That should be done explicitly if it is required outside existing clearing points. We already have that service, it's called msix_clear_vector. That alone does not justify msix_vector_use and all the state and logic behind it IMHO. The use/unuse is IMO a decent API for this, because it uses a familiar resource tracking concept. Exposing this knowledge of msix to devices seems like a worse API. If you want an example, check http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/70915 and compare it to the changes done to hpet in this series. Jan This seems to be a general argument that lazy updates are good? I have no real problem with them, besides the fact that we need an API to reserve space in the routing table so that device setup can fail upfront. That's not possible, even with used vectors, as devices change their vector usage depending on how the guest configures the devices. If you (pre-)allocate all possible vectors, you may run out of resources earlier than needed actually. That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 12/45] msi: Introduce MSIRoutingCache
On Mon, Oct 17, 2011 at 09:19:34PM +0200, Jan Kiszka wrote: On 2011-10-17 17:37, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:19:56PM +0200, Jan Kiszka wrote: On 2011-10-17 13:06, Avi Kivity wrote: On 10/17/2011 11:27 AM, Jan Kiszka wrote: This cache will help us implementing KVM in-kernel irqchip support without spreading hooks all over the place. KVM requires us to register it first and then deliver it by raising a pseudo IRQ line returned on registration. While this could be changed for QEMU-originated MSI messages by adding direct MSI injection, we will still need this translation for irqfd-originated messages. The MSIRoutingCache will allow to track those registrations and update them lazily before the actual delivery. This avoid having to track MSI vectors at device level (like qemu-kvm currently does). +typedef enum { +MSI_ROUTE_NONE = 0, +MSI_ROUTE_STATIC, +} MSIRouteType; + +struct MSIRoutingCache { +MSIMessage msg; +MSIRouteType type; +int kvm_gsi; +int kvm_irqfd; +}; + diff --git a/hw/pci.h b/hw/pci.h index 329ab32..5b5d2fd 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -197,6 +197,10 @@ struct PCIDevice { MemoryRegion rom; uint32_t rom_bar; +/* MSI routing chaches */ +MSIRoutingCache *msi_cache; +MSIRoutingCache *msix_cache; + /* MSI entries */ int msi_entries_nr; struct KVMMsiMessage *msi_irq_entries; IMO this needlessly leaks kvm information into core qemu. The cache should be completely hidden in kvm code. I think msi_deliver() can hide the use of the cache completely. For pre-registered events like kvm's irqfd, you can use something like qemu_irq qemu_msi_irq(MSIMessage msg) for non-kvm, it simply returns a qemu_irq that triggers a stl_phys(); for kvm, it allocates an irqfd and a permanent entry in the cache and returns a qemu_irq that triggers the irqfd. See my previously mail: you want to track the life-cycle of an MSI source to avoid generating routes for identical sources. A messages is not a source. Two identical messages can come from different sources. Since MSI messages are edge triggered, I don't see how this would work without losing interrupts. And AFAIK, existing guests do not use the same message for different sources. Just like we handle shared edge-triggered line-base IRQs, shared MSIs are in principle feasible as well. Jan For this case it seems quite harmless to use multiple routes for identical sources. Yes it would use more resources but it never happens in practice. So what Avi said originally is still true. -- MST
Re: [Qemu-devel] [Qemu-ppc] [PATCH] PPC: Fail configure when libfdt is not available
On Tue, Oct 18, 2011 at 10:55:01AM +0200, Andreas Färber wrote: Am 18.10.2011 02:18, schrieb Alexander Graf: We have several targets in the PPC tree now that basically require libfdt to function properly, namely the pseries and the e500 targets. This dependency will rather increase than decrease in the future, so I want to make sure that people building shiny new 1.0 actually have libfdt installed to get rid of a few ifdefs in the code. Warning: This patch will likely make configure fail for people who don't select their own --target-list, but don't have libfdt development packages installed. However, we really need this new dependency to move on. Signed-off-by: Alexander Graf ag...@suse.de openSUSE 12.1 has libfdt1-devel, but you should set up a submodule and working build rules for Darwin, Haiku, etc. `make` doesn't fully work so I used custom scripts to build the right parts and to manually install the resulting binary and headers. If there are build problems with libfdt on any platform let me know about them. I would like it to build clean as widely as possible, but I don't have that great a diversity of build environments, so I have to reply on bug reports. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [RFC][PATCH 11/45] msi: Factor out delivery hook
On 2011-10-18 14:05, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:15:47PM +0200, Jan Kiszka wrote: On 2011-10-17 15:43, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:45AM +0200, Jan Kiszka wrote: diff --git a/hw/msi.c b/hw/msi.c index 3c7ebc3..9055155 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -40,6 +40,14 @@ /* Flag for interrupt controller to declare MSI/MSI-X support */ bool msi_supported; +static void msi_unsupported(MSIMessage *msg) +{ +/* If we get here, the board failed to register a delivery handler. */ +abort(); +} + +void (*msi_deliver)(MSIMessage *msg) = msi_unsupported; + How about we set this to NULL, and check it instead of the bool flag? Yeah. I will introduce bool msi_supported(void) { return msi_deliver != msi_unsupported; } OK? Jan Looks a bit weird ... NULL is a pretty standard value for an invalid pointer, isn't it? Save us the runtime check and is equally expressive and readable IMHO. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 12/45] msi: Introduce MSIRoutingCache
On 2011-10-18 14:17, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:19:34PM +0200, Jan Kiszka wrote: On 2011-10-17 17:37, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:19:56PM +0200, Jan Kiszka wrote: On 2011-10-17 13:06, Avi Kivity wrote: On 10/17/2011 11:27 AM, Jan Kiszka wrote: This cache will help us implementing KVM in-kernel irqchip support without spreading hooks all over the place. KVM requires us to register it first and then deliver it by raising a pseudo IRQ line returned on registration. While this could be changed for QEMU-originated MSI messages by adding direct MSI injection, we will still need this translation for irqfd-originated messages. The MSIRoutingCache will allow to track those registrations and update them lazily before the actual delivery. This avoid having to track MSI vectors at device level (like qemu-kvm currently does). +typedef enum { +MSI_ROUTE_NONE = 0, +MSI_ROUTE_STATIC, +} MSIRouteType; + +struct MSIRoutingCache { +MSIMessage msg; +MSIRouteType type; +int kvm_gsi; +int kvm_irqfd; +}; + diff --git a/hw/pci.h b/hw/pci.h index 329ab32..5b5d2fd 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -197,6 +197,10 @@ struct PCIDevice { MemoryRegion rom; uint32_t rom_bar; +/* MSI routing chaches */ +MSIRoutingCache *msi_cache; +MSIRoutingCache *msix_cache; + /* MSI entries */ int msi_entries_nr; struct KVMMsiMessage *msi_irq_entries; IMO this needlessly leaks kvm information into core qemu. The cache should be completely hidden in kvm code. I think msi_deliver() can hide the use of the cache completely. For pre-registered events like kvm's irqfd, you can use something like qemu_irq qemu_msi_irq(MSIMessage msg) for non-kvm, it simply returns a qemu_irq that triggers a stl_phys(); for kvm, it allocates an irqfd and a permanent entry in the cache and returns a qemu_irq that triggers the irqfd. See my previously mail: you want to track the life-cycle of an MSI source to avoid generating routes for identical sources. A messages is not a source. Two identical messages can come from different sources. Since MSI messages are edge triggered, I don't see how this would work without losing interrupts. And AFAIK, existing guests do not use the same message for different sources. Just like we handle shared edge-triggered line-base IRQs, shared MSIs are in principle feasible as well. Jan For this case it seems quite harmless to use multiple routes for identical sources. Unless we track the source (via the MSIRoutingCache abstraction), there can be no multiple routes. The core cannot differentiate between identical messages, thus will not create multiple routes. But that's actually a corner case, and we could probably live with it. The real question is if we want to search for MSI routes on each message delivery. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 3/4] qapi-types.py: Fail gracefully if out dir is not specified
On Mon, 17 Oct 2011 16:27:34 -0500 Michael Roth mdr...@linux.vnet.ibm.com wrote: On Mon, 17 Oct 2011 13:29:36 -0200, Luiz Capitulino lcapitul...@redhat.com wrote: Otherwise we get: Traceback (most recent call last): File ./scripts/qapi-types.py, line 183, in module os.makedirs(output_dir) File /usr/lib64/python2.7/os.py, line 157, in makedirs mkdir(name, mode) OSError: [Errno 2] No such file or directory: '' Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- scripts/qapi-types.py |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/scripts/qapi-types.py b/scripts/qapi-types.py index 8df4b72..4a2ddc4 100644 --- a/scripts/qapi-types.py +++ b/scripts/qapi-types.py @@ -178,7 +178,11 @@ if __name__ == '__main__': prefix = a elif o in (-o, --output-dir): output_dir = a - + +if not output_dir: +sys.stdout.write(ouput directory was not specified\n) +sys.exit(1) + We should probably just set output_dir to os.getcwd() here. Fair enough. c_file = os.path.join(output_dir, prefix + c_file) h_file = os.path.join(output_dir, prefix + h_file) -- 1.7.7.rc3
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On Tue, Oct 18, 2011 at 02:08:59PM +0200, Jan Kiszka wrote: On 2011-10-18 13:58, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:28:12PM +0200, Jan Kiszka wrote: On 2011-10-17 17:48, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:02AM +0200, Jan Kiszka wrote: This optimization was only required to keep KVM route usage low. Now that we solve that problem via lazy updates, we can drop the field. We still need interfaces to clear pending vectors, though (and we have to make use of them more broadly - but that's unrelated to this patch). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Lazy updates should be an implementation detail. IMO resource tracking of vectors makes sense as an API. Making devices deal with pending vectors as a concept, IMO, does not. There is really no use for tracking the vector lifecycle once we have lazy updates (except for static routes). It's a way too invasive concept, and it's not needed for anything but KVM. I think it's needed. The PCI spec states that when the device does not need an interrupt anymore, it should clear the pending bit. That should be done explicitly if it is required outside existing clearing points. We already have that service, it's called msix_clear_vector. We do? I don't seem to see it upstream... That alone does not justify msix_vector_use and all the state and logic behind it IMHO. To me it looks like an abstraction that solves both this problem and the resource allocation problem. Resources are actually limited BTW, this is not just a KVM thing. qemu.git currently lets guests decide what to do with them, but it might turn out to be benefitial to warn the management application that it is shooting itself in the foot. The use/unuse is IMO a decent API for this, because it uses a familiar resource tracking concept. Exposing this knowledge of msix to devices seems like a worse API. If you want an example, check http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/70915 and compare it to the changes done to hpet in this series. Jan This seems to be a general argument that lazy updates are good? I have no real problem with them, besides the fact that we need an API to reserve space in the routing table so that device setup can fail upfront. That's not possible, even with used vectors, as devices change their vector usage depending on how the guest configures the devices. If you (pre-)allocate all possible vectors, you may run out of resources earlier than needed actually. This really depends, but please do look at how with virtio we report resource shortage to guest and let it fall back to level interrups. You seem to remove that capability. I actually would not mind preallocating everything upfront which is much easier. But with your patch we get a silent failure or a drastic slowdown which is much more painful IMO. That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan These checks work more or less by luck BTW. It's a hack which I hope lazy allocation will replace. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH] RUN_STATE_POSTMIGRATE can be transited from RUN_STATE_PAUSED
On Tue, 18 Oct 2011 12:47:27 +0800 Wen Congyang we...@cn.fujitsu.com wrote: The user can run command stop before migration. So the guest's RUN_STATE can be transisted from RUN_STATE_PAUSED to RUN_STATE_POSTMIGRATE. Signed-off-by: Wen Congyang we...@cn.fujitsu.com I've already submitted a pull request with the exact same patch included. Thanks. --- vl.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 2dce3ae..efae19c 100644 --- a/vl.c +++ b/vl.c @@ -341,6 +341,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_IO_ERROR, RUN_STATE_RUNNING }, { RUN_STATE_PAUSED, RUN_STATE_RUNNING }, +{ RUN_STATE_PAUSED, RUN_STATE_POSTMIGRATE }, { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING },
Re: [Qemu-devel] [RFC][PATCH 11/45] msi: Factor out delivery hook
On Tue, Oct 18, 2011 at 02:23:29PM +0200, Jan Kiszka wrote: On 2011-10-18 14:05, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:15:47PM +0200, Jan Kiszka wrote: On 2011-10-17 15:43, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:45AM +0200, Jan Kiszka wrote: diff --git a/hw/msi.c b/hw/msi.c index 3c7ebc3..9055155 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -40,6 +40,14 @@ /* Flag for interrupt controller to declare MSI/MSI-X support */ bool msi_supported; +static void msi_unsupported(MSIMessage *msg) +{ +/* If we get here, the board failed to register a delivery handler. */ +abort(); +} + +void (*msi_deliver)(MSIMessage *msg) = msi_unsupported; + How about we set this to NULL, and check it instead of the bool flag? Yeah. I will introduce bool msi_supported(void) { return msi_deliver != msi_unsupported; } OK? Jan Looks a bit weird ... NULL is a pretty standard value for an invalid pointer, isn't it? Save us the runtime check and is equally expressive and readable IMHO. Jan Do we need to check? NULL dereference leads to a crash just as surely... -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On 2011-10-18 14:33, Michael S. Tsirkin wrote: On Tue, Oct 18, 2011 at 02:08:59PM +0200, Jan Kiszka wrote: On 2011-10-18 13:58, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:28:12PM +0200, Jan Kiszka wrote: On 2011-10-17 17:48, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:02AM +0200, Jan Kiszka wrote: This optimization was only required to keep KVM route usage low. Now that we solve that problem via lazy updates, we can drop the field. We still need interfaces to clear pending vectors, though (and we have to make use of them more broadly - but that's unrelated to this patch). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Lazy updates should be an implementation detail. IMO resource tracking of vectors makes sense as an API. Making devices deal with pending vectors as a concept, IMO, does not. There is really no use for tracking the vector lifecycle once we have lazy updates (except for static routes). It's a way too invasive concept, and it's not needed for anything but KVM. I think it's needed. The PCI spec states that when the device does not need an interrupt anymore, it should clear the pending bit. That should be done explicitly if it is required outside existing clearing points. We already have that service, it's called msix_clear_vector. We do? I don't seem to see it upstream... True. From the device's POV, MSI-X (and also MSI!) vectors are actually level-triggered. So we should communicate the level to the MSI core and not just the edge. Needs more fixing That alone does not justify msix_vector_use and all the state and logic behind it IMHO. To me it looks like an abstraction that solves both this problem and the resource allocation problem. Resources are actually limited BTW, this is not just a KVM thing. qemu.git currently lets guests decide what to do with them, but it might turn out to be benefitial to warn the management application that it is shooting itself in the foot. The use/unuse is IMO a decent API for this, because it uses a familiar resource tracking concept. Exposing this knowledge of msix to devices seems like a worse API. If you want an example, check http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/70915 and compare it to the changes done to hpet in this series. Jan This seems to be a general argument that lazy updates are good? I have no real problem with them, besides the fact that we need an API to reserve space in the routing table so that device setup can fail upfront. That's not possible, even with used vectors, as devices change their vector usage depending on how the guest configures the devices. If you (pre-)allocate all possible vectors, you may run out of resources earlier than needed actually. This really depends, but please do look at how with virtio we report resource shortage to guest and let it fall back to level interrups. You seem to remove that capability. To my understanding, virtio will be the exception as no other device will have a chance to react on resource shortage while sending(!) an MSI message. I actually would not mind preallocating everything upfront which is much easier. But with your patch we get a silent failure or a drastic slowdown which is much more painful IMO. Again: did we already saw that limit? And where does it come from if not from KVM? That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan These checks work more or less by luck BTW. It's a hack which I hope lazy allocation will replace. The check is still valid (for x86) when we have to use static routes (device assignment, vhost). For lazy updates, it's obsolete, that's true. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 22/45] qemu-kvm: msix: Fire mask notifier on global mask changes
On Mon, Oct 17, 2011 at 09:00:12PM +0200, Jan Kiszka wrote: On 2011-10-17 14:16, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:56AM +0200, Jan Kiszka wrote: Also invoke the mask notifier if the global MSI-X mask is modified. For this purpose, we push the notifier call from the per-vector mask update to the central msix_handle_mask_update. Signed-off-by: Jan Kiszka jan.kis...@siemens.com This is a bugfix, isn't it? If yes it should be separated and put on -stable. Yep, will pull this to the front. I'll apply this to qemu.git, no need to mix bugfixes with features ... --- hw/msix.c | 16 +--- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 739b56f..247b255 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -221,7 +221,15 @@ static bool msix_is_masked(PCIDevice *dev, int vector) static void msix_handle_mask_update(PCIDevice *dev, int vector) { -if (!msix_is_masked(dev, vector) msix_is_pending(dev, vector)) { +bool masked = msix_is_masked(dev, vector); +int ret; + +if (dev-msix_mask_notifier) { +ret = dev-msix_mask_notifier(dev, vector, + msix_is_masked(dev, vector)); Use 'masked' value here as well? Yes. Jan
Re: [Qemu-devel] [RFC][PATCH 11/45] msi: Factor out delivery hook
On 2011-10-18 14:38, Michael S. Tsirkin wrote: On Tue, Oct 18, 2011 at 02:23:29PM +0200, Jan Kiszka wrote: On 2011-10-18 14:05, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:15:47PM +0200, Jan Kiszka wrote: On 2011-10-17 15:43, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:45AM +0200, Jan Kiszka wrote: diff --git a/hw/msi.c b/hw/msi.c index 3c7ebc3..9055155 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -40,6 +40,14 @@ /* Flag for interrupt controller to declare MSI/MSI-X support */ bool msi_supported; +static void msi_unsupported(MSIMessage *msg) +{ +/* If we get here, the board failed to register a delivery handler. */ +abort(); +} + +void (*msi_deliver)(MSIMessage *msg) = msi_unsupported; + How about we set this to NULL, and check it instead of the bool flag? Yeah. I will introduce bool msi_supported(void) { return msi_deliver != msi_unsupported; } OK? Jan Looks a bit weird ... NULL is a pretty standard value for an invalid pointer, isn't it? Save us the runtime check and is equally expressive and readable IMHO. Jan Do we need to check? NULL dereference leads to a crash just as surely... There is no NULL state of msi_deliver. A) it would execute msi_unsupported if all goes wrong (which will abort) and B) msi_supported() is supposed to protect us in the absence of bugs from ever executing msi_deliver() if it points to msi_unsupported. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 11/45] msi: Factor out delivery hook
On Tue, 18 Oct 2011, Michael S. Tsirkin wrote: On Tue, Oct 18, 2011 at 02:23:29PM +0200, Jan Kiszka wrote: On 2011-10-18 14:05, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:15:47PM +0200, Jan Kiszka wrote: On 2011-10-17 15:43, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:45AM +0200, Jan Kiszka wrote: diff --git a/hw/msi.c b/hw/msi.c index 3c7ebc3..9055155 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -40,6 +40,14 @@ /* Flag for interrupt controller to declare MSI/MSI-X support */ bool msi_supported; +static void msi_unsupported(MSIMessage *msg) +{ +/* If we get here, the board failed to register a delivery handler. */ +abort(); +} + +void (*msi_deliver)(MSIMessage *msg) = msi_unsupported; + How about we set this to NULL, and check it instead of the bool flag? Yeah. I will introduce bool msi_supported(void) { return msi_deliver != msi_unsupported; } OK? Jan Looks a bit weird ... NULL is a pretty standard value for an invalid pointer, isn't it? Save us the runtime check and is equally expressive and readable IMHO. Jan Do we need to check? NULL dereference leads to a crash just as surely... Not universally (not on AIX for instance (read)). -- mailto:av1...@comtv.ru
Re: [Qemu-devel] [RFC][PATCH 22/45] qemu-kvm: msix: Fire mask notifier on global mask changes
On 2011-10-18 14:40, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:00:12PM +0200, Jan Kiszka wrote: On 2011-10-17 14:16, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:56AM +0200, Jan Kiszka wrote: Also invoke the mask notifier if the global MSI-X mask is modified. For this purpose, we push the notifier call from the per-vector mask update to the central msix_handle_mask_update. Signed-off-by: Jan Kiszka jan.kis...@siemens.com This is a bugfix, isn't it? If yes it should be separated and put on -stable. Yep, will pull this to the front. I'll apply this to qemu.git, no need to mix bugfixes with features ... This doesn't apply to qemu.git (there are no notifiers upstream). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On Tue, Oct 18, 2011 at 02:38:36PM +0200, Jan Kiszka wrote: On 2011-10-18 14:33, Michael S. Tsirkin wrote: On Tue, Oct 18, 2011 at 02:08:59PM +0200, Jan Kiszka wrote: On 2011-10-18 13:58, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:28:12PM +0200, Jan Kiszka wrote: On 2011-10-17 17:48, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:02AM +0200, Jan Kiszka wrote: This optimization was only required to keep KVM route usage low. Now that we solve that problem via lazy updates, we can drop the field. We still need interfaces to clear pending vectors, though (and we have to make use of them more broadly - but that's unrelated to this patch). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Lazy updates should be an implementation detail. IMO resource tracking of vectors makes sense as an API. Making devices deal with pending vectors as a concept, IMO, does not. There is really no use for tracking the vector lifecycle once we have lazy updates (except for static routes). It's a way too invasive concept, and it's not needed for anything but KVM. I think it's needed. The PCI spec states that when the device does not need an interrupt anymore, it should clear the pending bit. That should be done explicitly if it is required outside existing clearing points. We already have that service, it's called msix_clear_vector. We do? I don't seem to see it upstream... True. From the device's POV, MSI-X (and also MSI!) vectors are actually level-triggered. This definitely takes adjusting to. So we should communicate the level to the MSI core and not just the edge. Needs more fixing That alone does not justify msix_vector_use and all the state and logic behind it IMHO. To me it looks like an abstraction that solves both this problem and the resource allocation problem. Resources are actually limited BTW, this is not just a KVM thing. qemu.git currently lets guests decide what to do with them, but it might turn out to be benefitial to warn the management application that it is shooting itself in the foot. The use/unuse is IMO a decent API for this, because it uses a familiar resource tracking concept. Exposing this knowledge of msix to devices seems like a worse API. If you want an example, check http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/70915 and compare it to the changes done to hpet in this series. Jan This seems to be a general argument that lazy updates are good? I have no real problem with them, besides the fact that we need an API to reserve space in the routing table so that device setup can fail upfront. That's not possible, even with used vectors, as devices change their vector usage depending on how the guest configures the devices. If you (pre-)allocate all possible vectors, you may run out of resources earlier than needed actually. This really depends, but please do look at how with virtio we report resource shortage to guest and let it fall back to level interrups. You seem to remove that capability. To my understanding, virtio will be the exception as no other device will have a chance to react on resource shortage while sending(!) an MSI message. Hmm, are you familiar with that spec? This is not what virtio does, resource shortage is detected during setup. This is exactly the problem with lazy registration as you don't allocate until it's too late. I actually would not mind preallocating everything upfront which is much easier. But with your patch we get a silent failure or a drastic slowdown which is much more painful IMO. Again: did we already saw that limit? And where does it come from if not from KVM? It's a hardware limitation of intel APICs. interrupt vector is encoded in an 8 bit field in msi address. So you can have at most 256 of these. That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan These checks work more or less by luck BTW. It's a hack which I hope lazy allocation will replace. The check is still valid (for x86) when we have to use static routes (device assignment, vhost). It's not valid at all - we are just lucky that linux and windows guests seem to zero out the vector when it's not in use. They do not have to do that. For lazy updates, it's obsolete, that's true. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 12:42, Christoph Hellwig wrote: On Tue, Oct 18, 2011 at 12:41:10PM +0200, Paolo Bonzini wrote: On 10/18/2011 10:39 AM, Jan Kiszka wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. [ BTW, crash is for the dead. But having those features in form of gdb scripts would also allow live system analysis. Looks like it's worth obsoleting crash on the long run. ] Crash can also do live system analysis. :) crash in fact is gdb - with a lot of additional commands. Is there a trick to make it work against a gdbserver? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 11/45] msi: Factor out delivery hook
On Tue, Oct 18, 2011 at 04:44:47PM +0400, malc wrote: On Tue, 18 Oct 2011, Michael S. Tsirkin wrote: On Tue, Oct 18, 2011 at 02:23:29PM +0200, Jan Kiszka wrote: On 2011-10-18 14:05, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:15:47PM +0200, Jan Kiszka wrote: On 2011-10-17 15:43, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:45AM +0200, Jan Kiszka wrote: diff --git a/hw/msi.c b/hw/msi.c index 3c7ebc3..9055155 100644 --- a/hw/msi.c +++ b/hw/msi.c @@ -40,6 +40,14 @@ /* Flag for interrupt controller to declare MSI/MSI-X support */ bool msi_supported; +static void msi_unsupported(MSIMessage *msg) +{ +/* If we get here, the board failed to register a delivery handler. */ +abort(); +} + +void (*msi_deliver)(MSIMessage *msg) = msi_unsupported; + How about we set this to NULL, and check it instead of the bool flag? Yeah. I will introduce bool msi_supported(void) { return msi_deliver != msi_unsupported; } OK? Jan Looks a bit weird ... NULL is a pretty standard value for an invalid pointer, isn't it? Save us the runtime check and is equally expressive and readable IMHO. Jan Do we need to check? NULL dereference leads to a crash just as surely... Not universally (not on AIX for instance (read)). This is a NULL function call though :) Anyway, this was just nitpicking. Do it any way you like. -- mailto:av1...@comtv.ru
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On 2011-10-17 17:48, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:02AM +0200, Jan Kiszka wrote: This optimization was only required to keep KVM route usage low. Now that we solve that problem via lazy updates, we can drop the field. We still need interfaces to clear pending vectors, though (and we have to make use of them more broadly - but that's unrelated to this patch). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Lazy updates should be an implementation detail. IMO resource tracking of vectors makes sense as an API. Making devices deal with pending vectors as a concept, IMO, does not. There is really no use for tracking the vector lifecycle once we have lazy updates (except for static routes). It's a way too invasive concept, and it's not needed for anything but KVM. If you want an example, check http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/70915 and compare it to the changes done to hpet in this series. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC][PATCH 22/45] qemu-kvm: msix: Fire mask notifier on global mask changes
On Tue, Oct 18, 2011 at 02:45:58PM +0200, Jan Kiszka wrote: On 2011-10-18 14:40, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 09:00:12PM +0200, Jan Kiszka wrote: On 2011-10-17 14:16, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:56AM +0200, Jan Kiszka wrote: Also invoke the mask notifier if the global MSI-X mask is modified. For this purpose, we push the notifier call from the per-vector mask update to the central msix_handle_mask_update. Signed-off-by: Jan Kiszka jan.kis...@siemens.com This is a bugfix, isn't it? If yes it should be separated and put on -stable. Yep, will pull this to the front. I'll apply this to qemu.git, no need to mix bugfixes with features ... This doesn't apply to qemu.git (there are no notifiers upstream). Jan Right, thanks for the reminder. Anyway, I'll take care of this, don't worry :) -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On 2011-10-18 14:48, Michael S. Tsirkin wrote: To my understanding, virtio will be the exception as no other device will have a chance to react on resource shortage while sending(!) an MSI message. Hmm, are you familiar with that spec? Not by heart. This is not what virtio does, resource shortage is detected during setup. This is exactly the problem with lazy registration as you don't allocate until it's too late. When is that setup phase? Does it actually come after every change to an MSI vector? I doubt so. Thus virtio can only estimate the guest usage as well (a guest may or may not actually write a non-null data into a vector and unmask it). I actually would not mind preallocating everything upfront which is much easier. But with your patch we get a silent failure or a drastic slowdown which is much more painful IMO. Again: did we already saw that limit? And where does it come from if not from KVM? It's a hardware limitation of intel APICs. interrupt vector is encoded in an 8 bit field in msi address. So you can have at most 256 of these. There should be no such limitation with pseudo GSIs we use for MSI injection. They end up as MSI messages again, so actually 256 (-reserved vectors) * number-of-cpus (on x86). That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan These checks work more or less by luck BTW. It's a hack which I hope lazy allocation will replace. The check is still valid (for x86) when we have to use static routes (device assignment, vhost). It's not valid at all - we are just lucky that linux and windows guests seem to zero out the vector when it's not in use. They do not have to do that. It is valid as it is just an optimization. If an unused vector has a non-null data field, we just redundantly register a route where we do not actually have to. But we do need to be prepared for potentially arriving messages on that virtual GSI, either via irqfd or kvm device assignment. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] GPLv3 troubles
On 10/18/2011 03:01 AM, Markus Armbruster wrote: Avi Kivitya...@redhat.com writes: On 10/17/2011 07:46 PM, Stefan Weil wrote: So let's start. For any of my contributions, I agree to GPL v2 or later. Later generations should have the possibility to replace GPL v2 by something which matches future requirements. I expect Red Hat contributions can be relicensed to v2+ as well. Plenty of precedence for that. Okay, let's get serious about it. I set up the following wiki page for coordination: http://wiki.qemu.org/Relicensing Please get the appropriate approval at Red Hat, and follow the instructions on the wiki. I'm in the process of getting approval at IBM. Between RH, IBM, and Blue, there's a pretty large chunk of files that can be relicensed immediately. It's probably best to tackle this incrementally. Regards, Anthony Liguori
[Qemu-devel] In soft freeze, hard freeze begins on November 1st
Hi, Just a friendly reminder that we are now in the soft feature freeze. Maintainers should exercise discretion about committing any new feature that would be extremely disruptive at this point. The remaining schedule follows below. November 1st will be the hard freeze of master. Please note that unlike previous releases, we will not be forking off the 1.1 development branch until after the 1.0 release. The goal is to make sure that we all focus on making 1.0 our highest quality release to date. | 2011-11-01 | Freeze master |- | 2011-11-07 | Tag qemu-1.0-rc1 |- | 2011-11-14 | Tag qemu-1.0-rc2 |- | 2011-11-21 | Tag qemu-1.0-rc3 |- | 2011-11-28 | Tag qemu-1.0-rc4 |- | 2011-12-01 | Tag qemu-1.0 Regards, Anthony Liguori
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On Tue, Oct 18, 2011 at 02:47:39PM +0200, Jan Kiszka wrote: Crash can also do live system analysis. :) crash in fact is gdb - with a lot of additional commands. Is there a trick to make it work against a gdbserver? No idea. I just noticed that crash was built around a gdb source tree and additional files when I had to hack it for a new kernel version a while ago. The optimum would be if all the crash changes were merged back into upstream gdb, but I'm not sure the gdb community would like to see all that very linux kernel specific code (especially given that it's very version dependent)
Re: [Qemu-devel] [RFC][PATCH 22/45] qemu-kvm: msix: Fire mask notifier on global mask changes
On 2011-10-17 14:16, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:56AM +0200, Jan Kiszka wrote: Also invoke the mask notifier if the global MSI-X mask is modified. For this purpose, we push the notifier call from the per-vector mask update to the central msix_handle_mask_update. Signed-off-by: Jan Kiszka jan.kis...@siemens.com This is a bugfix, isn't it? If yes it should be separated and put on -stable. Yep, will pull this to the front. --- hw/msix.c | 16 +--- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 739b56f..247b255 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -221,7 +221,15 @@ static bool msix_is_masked(PCIDevice *dev, int vector) static void msix_handle_mask_update(PCIDevice *dev, int vector) { -if (!msix_is_masked(dev, vector) msix_is_pending(dev, vector)) { +bool masked = msix_is_masked(dev, vector); +int ret; + +if (dev-msix_mask_notifier) { +ret = dev-msix_mask_notifier(dev, vector, + msix_is_masked(dev, vector)); Use 'masked' value here as well? Yes. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC][PATCH 42/45] msix: Introduce msix_init_simple
On 2011-10-17 16:28, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:27:31PM +0200, Jan Kiszka wrote: On 2011-10-17 13:22, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:28:16AM +0200, Jan Kiszka wrote: Devices models are usually not interested in specifying MSI-X configuration details beyond the number of vectors to provide and the BAR number to use. Layout of an exclusively used BAR and its registration can also be handled centrally. This is the purpose of msix_init_simple. It provides handy services to the existing users. Future users like device assignment may require more detailed setup specification. For them we will (re-)introduce msix_init with the full list of configuration option (in contrast to the current code). Signed-off-by: Jan Kiszka jan.kis...@siemens.com Well, this seems a bit of a code churn then, doesn't it? We are also discussing using memory BAR for virtio-pci for other stuff besides MSI-X, so the last user of the _simple variant will be ivshmem then? We will surely see more MSI-X users over the time. Not sure if they all mix their MSIX-X BARs with other stuff. But e.g. the e1000 variant I have here does not. So there should be users in the future. Jan Question is, how hard is to pass in the BAR and the offset? That is trivial. But have a look at the final simple implementation. It also manages the container memory region for table and PBA and registers/unregisters that container as BAR. So there is measurable added-value. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 15:08, Christoph Hellwig wrote: On Tue, Oct 18, 2011 at 02:47:39PM +0200, Jan Kiszka wrote: Crash can also do live system analysis. :) crash in fact is gdb - with a lot of additional commands. Is there a trick to make it work against a gdbserver? No idea. I just noticed that crash was built around a gdb source tree and additional files when I had to hack it for a new kernel version a while ago. The optimum would be if all the crash changes were merged back into upstream gdb, but I'm not sure the gdb community would like to see all that very linux kernel specific code (especially given that it's very version dependent) That's why I'm advocating gdb scripts, maybe maintained inside the kernel tree. But I need to look closer as well to see if vanilla gdb is already able to provide all registers used by crash so far (surely not all required for perfect debugging, but that's a long-known issue). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH v8 1/4] block: add the block queue support
On Tue, Oct 18, 2011 at 5:56 PM, Kevin Wolf kw...@redhat.com wrote: Am 18.10.2011 11:29, schrieb Zhi Yong Wu: +void qemu_del_block_queue(BlockQueue *queue) +{ + BlockQueueAIOCB *request, *next; + + QTAILQ_FOREACH_SAFE(request, queue-requests, entry, next) { + QTAILQ_REMOVE(queue-requests, request, entry); + qemu_aio_release(request); + } + + g_free(queue); +} Can we be sure that no AIO requests are in flight that still use the now released AIOCB? Yeah, since qemu core code is serially performed, i think that when qemu_del_block_queue is performed, no requests are in flight. Right? Patch 2 has this code: +void bdrv_io_limits_disable(BlockDriverState *bs) +{ + bs-io_limits_enabled = false; + + if (bs-block_queue) { + qemu_block_queue_flush(bs-block_queue); + qemu_del_block_queue(bs-block_queue); + bs-block_queue = NULL; + } Does this mean that you can't disable I/O limits while the VM is running? NO, you can even though VM is running. Okay, in general qemu_block_queue_flush() empties the queue so that there are no requests left that qemu_del_block_queue() could drop from the queue. So in the common case it doesn't even enter the FOREACH loop. I think that we should adopt !QTAILQ_EMPTY(queue-requests), not QTAILQ_FOREACH_SAFE in qemu_del_block_queue(), right? I think QTAILQ_FOREACH_SAFE is fine. I think problems start when requests have failed or exceeded the limit again, then you have requests queued even after qemu_block_queue_flush(). You must be aware of this, otherwise the code in qemu_del_block_queue() wouldn't exist. But you can't release the ACBs without having called their callback, otherwise the caller would still assume that its ACB pointer is valid. Maybe calling the callback before releasing the ACB would be enough. Good, thanks. + } + } + } + + queue-req_failed = true; + queue-flushing = false; +} + +bool qemu_block_queue_has_pending(BlockQueue *queue) +{ + return !queue-flushing !QTAILQ_EMPTY(queue-requests); +} Why doesn't the queue have pending requests in the middle of a flush operation? (That is, the flush hasn't completed yet) It is possible for the queue to have pending requests. if yes, how about? Sorry, can't parse this. I don't understand why the !queue-flushing part is correct. What about this? When bdrv_aio_readv/writev handle one request, it will determine if block queue is not being flushed and isn't NULL; if yes, It assume that this request is one new request from upper layer, so it won't determine if the I/O rate at runtime has exceeded the limits, but immediately insert it into block queue. Hm, I think I understand what you're saying, but only after looking at patch 3. This is not really implementing a has_pending(), but has_pending_and_caller_wasnt_called_during_flush(). I think it would be Correct. better to handle the queue-flushing condition in the caller where its use is more obvious. OK. i will do as this. Kevin -- Regards, Zhi Yong Wu
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On Tue, Oct 18, 2011 at 12:47:23PM +0200, Jan Kiszka wrote: On 2011-10-18 12:41, Paolo Bonzini wrote: On 10/18/2011 10:39 AM, Jan Kiszka wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. [ BTW, crash is for the dead. But having those features in form of gdb scripts would also allow live system analysis. Looks like it's worth obsoleting crash on the long run. ] Crash can also do live system analysis. :) Does it work via remote interface? And it's still the wrong way around: You have to append gdb to the majority of command when doing debugging instead of issuing additional analysis commands from the gdb shell. That's understandable given the limited scripting power of old gdbs. But that's now history thanks to its python interface. Before this conversation heads any further into the realm of absurdity, let's get back to the key requirements for production systems: (A) No development tools can be installed. (B) The core file must be removed from the production system quickly and analyzed by on another machine (indeed, often by experts from another company entirely). Anything that involves live debugging using GDB is simply not acceptable for production systems, whether you think that's right or not. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://et.redhat.com/~rjones/virt-top
Re: [Qemu-devel] KVM call agenda for October 18
Juan Quintela quint...@redhat.com wrote: Hi Please send in any agenda items you are interested in covering. Thanks, Juan. No agenda - no call Happy hacking, Juan.
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 15:30, Richard W.M. Jones wrote: On Tue, Oct 18, 2011 at 12:47:23PM +0200, Jan Kiszka wrote: On 2011-10-18 12:41, Paolo Bonzini wrote: On 10/18/2011 10:39 AM, Jan Kiszka wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. [ BTW, crash is for the dead. But having those features in form of gdb scripts would also allow live system analysis. Looks like it's worth obsoleting crash on the long run. ] Crash can also do live system analysis. :) Does it work via remote interface? And it's still the wrong way around: You have to append gdb to the majority of command when doing debugging instead of issuing additional analysis commands from the gdb shell. That's understandable given the limited scripting power of old gdbs. But that's now history thanks to its python interface. Before this conversation heads any further into the realm of absurdity, let's get back to the key requirements for production systems: (A) No development tools can be installed. (B) The core file must be removed from the production system quickly and analyzed by on another machine (indeed, often by experts from another company entirely). Anything that involves live debugging using GDB is simply not acceptable for production systems, whether you think that's right or not. My memory is usually that bad, but not in this case. :) This discussion should also work out a bigger picture than just your (valid) scenario. We should look at meeting that requirements but also see what would be a useful foundation for future extensions like having post-crash analysis and live debugging at a similar feature level with very similar or (much better) identical tools. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On Tue, Oct 18, 2011 at 03:00:29PM +0200, Jan Kiszka wrote: On 2011-10-18 14:48, Michael S. Tsirkin wrote: To my understanding, virtio will be the exception as no other device will have a chance to react on resource shortage while sending(!) an MSI message. Hmm, are you familiar with that spec? Not by heart. This is not what virtio does, resource shortage is detected during setup. This is exactly the problem with lazy registration as you don't allocate until it's too late. When is that setup phase? Does it actually come after every change to an MSI vector? I doubt so. No. During setup, driver requests vectors from the OS, and then tells the device which vector should each VQ use. It then checks that the assignment was successful. If not, it retries with less vectors. Other devices can do this during initialization, and signal resource availability to guest using msix vector number field. Thus virtio can only estimate the guest usage as well At some level, this is fundamental: some guest operations have no failure mode. So we must preallocate some resources to make sure they won't fail. (a guest may or may not actually write a non-null data into a vector and unmask it). Please, forget the non-NULL thing. virtio driver knows exactly how many vectors we use and communicates this info to the device. This is not uncommon at all. I actually would not mind preallocating everything upfront which is much easier. But with your patch we get a silent failure or a drastic slowdown which is much more painful IMO. Again: did we already saw that limit? And where does it come from if not from KVM? It's a hardware limitation of intel APICs. interrupt vector is encoded in an 8 bit field in msi address. So you can have at most 256 of these. There should be no such limitation with pseudo GSIs we use for MSI injection. They end up as MSI messages again, so actually 256 (-reserved vectors) * number-of-cpus (on x86). This limits which CPUs can get the interrupt though. Linux seems to have a global pool as it wants to be able to freely balance vectors between CPUs. Or, consider a guest with a single CPU :) Anyway, why argue - there is a limitation, and it's not coming from KVM, right? That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan These checks work more or less by luck BTW. It's a hack which I hope lazy allocation will replace. The check is still valid (for x86) when we have to use static routes (device assignment, vhost). It's not valid at all - we are just lucky that linux and windows guests seem to zero out the vector when it's not in use. They do not have to do that. It is valid as it is just an optimization. If an unused vector has a non-null data field, we just redundantly register a route where we do not actually have to. Well, the only reason we even have this code is because it was claimed that some devices declare support for a huge number of vectors which then go unused. So if the guest does not do this we'll run out of vectors ... But we do need to be prepared And ATM, we aren't, and probably can't be without kernel changes, right? for potentially arriving messages on that virtual GSI, either via irqfd or kvm device assignment. Jan Why irqfd? Device assignment is ATM the only place where we use these ugly hacks. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] hw/nand hw/onenand and read-only
juha.riihim...@nokia.com writes: On 18.10.11 11:23 , ext Markus Armbruster arm...@redhat.com wrote: These guys have been converted to qdev relatively recently. I wonder what happens when they use a drive defined with -drive if=none,readonly. If that's not a valid configuration, we better reject read-only drives, like ide_init_drive() does. I'm not an expert with QEMU command line options; how do you pass such a drive to a NAND device? With if=mtd I get the following which I guess is expected result: $ qemu-system-arm -M beagle -drive if=mtd,file=nand.img,readonly qemu: readonly flag not supported for drive with this interface Yes, that way works: $ qemu-system-arm -drive if=mtd,file=tmp.qcow2,readonly qemu-system-arm: -drive if=mtd,file=tmp.qcow2,readonly: readonly not supported by this bus type But try this way: $ qemu-system-arm -drive if=none,file=tmp.qcow2,readonly,id=foo -device nand,drive=foo Kernel image must be specified Grr, force it: $ qemu-system-arm -drive if=none,file=tmp.qcow2,readonly,id=foo -device nand,drive=foo -kernel /dev/null qemu: hardware error: nand_device_init: Unsupported NAND block size. [...] Note that I didn't bother supplying a drive with sensible size. If I did, I guess I'd get nand coupled to a read-only drive. Easy enough for you to try :)
Re: [Qemu-devel] [RFC][PATCH 23/45] qemu-kvm: Rework MSI-X mask notifier to generic MSI config notifiers
On Mon, Oct 17, 2011 at 09:08:58PM +0200, Jan Kiszka wrote: On 2011-10-17 14:39, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 01:45:04PM +0200, Jan Kiszka wrote: On 2011-10-17 13:40, Michael S. Tsirkin wrote: On Mon, Oct 17, 2011 at 11:27:57AM +0200, Jan Kiszka wrote: MSI config notifiers are supposed to be triggered on every relevant configuration change of MSI vectors or if MSI is enabled/disabled. Two notifiers are established, one for vector changes and one for general enabling. The former notifier additionally passes the currently active MSI message. This will allow to update potential in-kernel IRQ routes on changes. The latter notifier is optional and will only be used by a subset of clients. These notifiers are currently only available for MSI-X but will be extended to legacy MSI as well. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Passing message, always, does not seem to make sense: message is only valid if it is unmasked. If we go from unmasked to masked, the consumer could just ignore the message. Why don't we let the consumer get the message if it needs it? Because most consumer will need and because I want to keep the API simple. The API seems to get more complex, as you are passing in fields which are only valid sometimes. Further, IIRC the spec requires any changes to be done while message is masked. So mask notifier makes more sense to me: it does the same thing using one notifier that you do using two notifiers. That's in fact a possible optimization (only invoke the callback on mask transitions). Further, it is one that is already implemented. So I would prefer not to add work by removing it :) Generalization to cover MSI requires some changes. Unneeded behavioral changes back and forth should and will of course be avoided. I will rework this. Not sure if that applies to MSI as well, probably not. Probably not. However, if per vector masking is supported, and while vector is masked, the address/ data values might not make any sense. So I think even msi users needs to know about masked state. Yes, and they get this information via the config notifier. To have common types, I would prefer to stay with vector config notifiers as name then. Jan So we pass in nonsense values and ask all users to know about MSIX rules. Ugh. I do realize msi might change the vector without masking. We can either artificially call mask before value change and unmask after, or use 3 notifiers: mask,unmask,config. Add a comment that config is invoked when configuration for an unmasked vector is changed, and that it can only happen for msi, not msix. I see no need in complicating the API like this. MSI-X still needs the config information on unmask, so let's just consistently pass it via the unified config notifier instead of forcing the consumers to create yet two more handlers. I really do not see the benefit for the consumer. Jan The benefit is a clearer API, where all parameters you get are valid, so you do not need to go read the spec to see what is OK to use. Generally, encoding events in flags is more error prone than using different notifiers for different events. E.g. _unmask and _mask make it obvious that they are called on mask and on unmask respectively. OTOH _config_change(bool mask) is unclear: is 'mask' the new state or the old state? It might be just my taste, but I usually prefer multiple functions doing one thing each rather than a single function doing multiple things. It shouldn't be too hard ... -- MST
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On 2011-10-18 15:37, Michael S. Tsirkin wrote: On Tue, Oct 18, 2011 at 03:00:29PM +0200, Jan Kiszka wrote: On 2011-10-18 14:48, Michael S. Tsirkin wrote: To my understanding, virtio will be the exception as no other device will have a chance to react on resource shortage while sending(!) an MSI message. Hmm, are you familiar with that spec? Not by heart. This is not what virtio does, resource shortage is detected during setup. This is exactly the problem with lazy registration as you don't allocate until it's too late. When is that setup phase? Does it actually come after every change to an MSI vector? I doubt so. No. During setup, driver requests vectors from the OS, and then tells the device which vector should each VQ use. It then checks that the assignment was successful. If not, it retries with less vectors. Other devices can do this during initialization, and signal resource availability to guest using msix vector number field. Thus virtio can only estimate the guest usage as well At some level, this is fundamental: some guest operations have no failure mode. So we must preallocate some resources to make sure they won't fail. We can still track the expected maximum number of active vectors at core level, collect them from the KVM layer, and warn if we expect conflicts. Anxious MSI users could then refrain from using this feature, others might be fine with risking a slow-down on conflicts. (a guest may or may not actually write a non-null data into a vector and unmask it). Please, forget the non-NULL thing. virtio driver knows exactly how many vectors we use and communicates this info to the device. This is not uncommon at all. I actually would not mind preallocating everything upfront which is much easier. But with your patch we get a silent failure or a drastic slowdown which is much more painful IMO. Again: did we already saw that limit? And where does it come from if not from KVM? It's a hardware limitation of intel APICs. interrupt vector is encoded in an 8 bit field in msi address. So you can have at most 256 of these. There should be no such limitation with pseudo GSIs we use for MSI injection. They end up as MSI messages again, so actually 256 (-reserved vectors) * number-of-cpus (on x86). This limits which CPUs can get the interrupt though. Linux seems to have a global pool as it wants to be able to freely balance vectors between CPUs. Or, consider a guest with a single CPU :) Anyway, why argue - there is a limitation, and it's not coming from KVM, right? No, our limit we hit with MSI message routing are first of all KVM GSIs, and there only pseudo GSIs that do not go to any interrupt controller with limited pins. That could easily be lifted in the kernel if we run into shortages in practice. That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan These checks work more or less by luck BTW. It's a hack which I hope lazy allocation will replace. The check is still valid (for x86) when we have to use static routes (device assignment, vhost). It's not valid at all - we are just lucky that linux and windows guests seem to zero out the vector when it's not in use. They do not have to do that. It is valid as it is just an optimization. If an unused vector has a non-null data field, we just redundantly register a route where we do not actually have to. Well, the only reason we even have this code is because it was claimed that some devices declare support for a huge number of vectors which then go unused. So if the guest does not do this we'll run out of vectors ... But we do need to be prepared And ATM, we aren't, and probably can't be without kernel changes, right? for potentially arriving messages on that virtual GSI, either via irqfd or kvm device assignment. Jan Why irqfd? Device assignment is ATM the only place where we use these ugly hacks. vfio will use irqfds. And that virtio is partly out of the picture is only because we know much more about virtio internals (specifically: will not advertise more vectors than guests will want to use). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC 3/6] block: wait for overlapping requests
On Mon, Oct 17, 2011 at 04:47:29PM +0100, Stefan Hajnoczi wrote: When copy-on-read is enabled it is necessary to wait for overlapping requests before issuing new requests. This prevents races between the copy-on-read and a write request. Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- block.c | 39 +++ 1 files changed, 39 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index e624ac3..cc3202c 100644 --- a/block.c +++ b/block.c @@ -1001,6 +1001,7 @@ struct BdrvTrackedRequest { int nb_sectors; bool is_write; QLIST_ENTRY(BdrvTrackedRequest) list; +CoQueue wait_queue; /* coroutines blocked on this request */ }; /** @@ -1014,6 +1015,12 @@ static void tracked_request_remove(BdrvTrackedRequest *req) { if (req) { QLIST_REMOVE(req, list); + +/* Wake up all coroutines blocked on this request */ +while (qemu_co_queue_next(req-wait_queue)) { +/* Do nothing */ +} + g_free(req); } } @@ -1038,12 +1045,36 @@ static BdrvTrackedRequest *tracked_request_add(BlockDriverState *bs, req-sector_num = sector_num; req-nb_sectors = nb_sectors; req-is_write = is_write; +qemu_co_queue_init(req-wait_queue); QLIST_INSERT_HEAD(bs-tracked_requests, req, list); return req; } +static bool tracked_request_overlaps(BdrvTrackedRequest *req, + int64_t sector_num, int nb_sectors) { +return false; /* not yet implemented */ +} + +static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs, +int64_t sector_num, int nb_sectors) +{ +BdrvTrackedRequest *req; +bool retry; + +do { +retry = false; +QLIST_FOREACH(req, bs-tracked_requests, list) { +if (tracked_request_overlaps(req, sector_num, nb_sectors)) { +qemu_co_queue_wait(req-wait_queue); +retry = true; What prevents overlapping requests (from waiter criteria) to be inserted to the queue while there are waiters again? That is, why is it not possible for a waiter to wait indefinetely?
Re: [Qemu-devel] [RFC][PATCH 23/45] qemu-kvm: Rework MSI-X mask notifier to generic MSI config notifiers
On 2011-10-18 15:46, Michael S. Tsirkin wrote: To have common types, I would prefer to stay with vector config notifiers as name then. Jan So we pass in nonsense values and ask all users to know about MSIX rules. Ugh. I do realize msi might change the vector without masking. We can either artificially call mask before value change and unmask after, or use 3 notifiers: mask,unmask,config. Add a comment that config is invoked when configuration for an unmasked vector is changed, and that it can only happen for msi, not msix. I see no need in complicating the API like this. MSI-X still needs the config information on unmask, so let's just consistently pass it via the unified config notifier instead of forcing the consumers to create yet two more handlers. I really do not see the benefit for the consumer. Jan The benefit is a clearer API, where all parameters you get are valid, so you do not need to go read the spec to see what is OK to use. Generally, encoding events in flags is more error prone than using different notifiers for different events. E.g. _unmask and _mask make it obvious that they are called on mask and on unmask respectively. OTOH _config_change(bool mask) is unclear: is 'mask' the new state or the old state? It might be just my taste, but I usually prefer multiple functions doing one thing each rather than a single function doing multiple things. It shouldn't be too hard ... The impact on the user side (device models) will be larger while the work in those different handlers will be widely the same (check my series, e.g. the virtio handler). But it's still just a guess of mine as well. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 10/18/2011 06:28 PM, Jan Kiszka wrote: On 2011-10-18 11:43, Wen Congyang wrote: At 10/18/2011 04:36 PM, Jan Kiszka Write: On 2011-10-18 10:34, Richard W.M. Jones wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. But let's see if the formats actually differ. In the end, crash is just parsing the same information that also gdb sees. I think the format can be similar with diskdump/kdump/netdump: dump_header: 1 block sub header: n blocks(n is stored in dump_header) bitmap: m blocks(2m is stored in dump_header) dumpable bitmap: m blocks memory data(We can know whether a page is stored in the core by bitmap and dumpable bitmap) The format of dump header(It's like kdump/diskdump): struct disk_dump_header { charsignature[SIG_LEN]; /* = QEMU */ int header_version; /* Dump header version */ struct new_utsname utsname;/* copy of system_utsname */ struct timeval timestamp; /* Time stamp */ unsigned intstatus; int block_size; /* Size of a block in byte */ int sub_hdr_size; /* Size of arch dependent header in blocks */ unsigned intbitmap_blocks; /* Size of Memory bitmap in block */ unsigned intmax_mapnr; /* = max_mapnr */ unsigned inttotal_ram_blocks;/* Number of blocks should be written */ unsigned intdevice_blocks; /* Number of total blocks in * the dump device */ unsigned intwritten_blocks; /* Number of written blocks */ unsigned intcurrent_cpu;/* CPU# which handles dump */ int nr_cpus;/* Number of CPUs */ }; The sub header can contains all registers's value on each vcpu, and other information, for example: struct qemu_sub_header { unsigned long start_pfn; unsigned long end_pfn; off_t offset_note; unsigned long size_note; }; So is this a standard format or only similar to something? Which tools will understand it out-of-the-box? If it's not standard, why? Only similar to something, and we can add a little codes into crash to support such format. If you have a better format, please tell me. Thanks Wen Congyang Jan
[Qemu-devel] job offer for you
Hello, new job offer for you http://www.universfreeads.com/job.php?job=41264 ..
Re: [Qemu-devel] [Question] dump memory when host pci device is used by guest
On 2011-10-18 15:51, Wen Congyang wrote: On 10/18/2011 06:28 PM, Jan Kiszka wrote: On 2011-10-18 11:43, Wen Congyang wrote: At 10/18/2011 04:36 PM, Jan Kiszka Write: On 2011-10-18 10:34, Richard W.M. Jones wrote: Yeah, I see. Could also be solved via gdb scripts, but crash is already there. But let's see if the formats actually differ. In the end, crash is just parsing the same information that also gdb sees. I think the format can be similar with diskdump/kdump/netdump: dump_header: 1 block sub header: n blocks(n is stored in dump_header) bitmap: m blocks(2m is stored in dump_header) dumpable bitmap: m blocks memory data(We can know whether a page is stored in the core by bitmap and dumpable bitmap) The format of dump header(It's like kdump/diskdump): struct disk_dump_header { charsignature[SIG_LEN]; /* = QEMU */ int header_version; /* Dump header version */ struct new_utsname utsname;/* copy of system_utsname */ struct timeval timestamp; /* Time stamp */ unsigned intstatus; int block_size; /* Size of a block in byte */ int sub_hdr_size; /* Size of arch dependent header in blocks */ unsigned intbitmap_blocks; /* Size of Memory bitmap in block */ unsigned intmax_mapnr; /* = max_mapnr */ unsigned inttotal_ram_blocks;/* Number of blocks should be written */ unsigned intdevice_blocks; /* Number of total blocks in * the dump device */ unsigned intwritten_blocks; /* Number of written blocks */ unsigned intcurrent_cpu;/* CPU# which handles dump */ int nr_cpus;/* Number of CPUs */ }; The sub header can contains all registers's value on each vcpu, and other information, for example: struct qemu_sub_header { unsigned long start_pfn; unsigned long end_pfn; off_t offset_note; unsigned long size_note; }; So is this a standard format or only similar to something? Which tools will understand it out-of-the-box? If it's not standard, why? Only similar to something, and we can add a little codes into crash to support such format. If you have a better format, please tell me. The format crash already processes? What do you need in addition? That should be discussed first, not the format details. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [RFC][PATCH 28/45] qemu-kvm: msix: Drop tracking of used vectors
On Tue, Oct 18, 2011 at 03:46:06PM +0200, Jan Kiszka wrote: On 2011-10-18 15:37, Michael S. Tsirkin wrote: On Tue, Oct 18, 2011 at 03:00:29PM +0200, Jan Kiszka wrote: On 2011-10-18 14:48, Michael S. Tsirkin wrote: To my understanding, virtio will be the exception as no other device will have a chance to react on resource shortage while sending(!) an MSI message. Hmm, are you familiar with that spec? Not by heart. This is not what virtio does, resource shortage is detected during setup. This is exactly the problem with lazy registration as you don't allocate until it's too late. When is that setup phase? Does it actually come after every change to an MSI vector? I doubt so. No. During setup, driver requests vectors from the OS, and then tells the device which vector should each VQ use. It then checks that the assignment was successful. If not, it retries with less vectors. Other devices can do this during initialization, and signal resource availability to guest using msix vector number field. Thus virtio can only estimate the guest usage as well At some level, this is fundamental: some guest operations have no failure mode. So we must preallocate some resources to make sure they won't fail. We can still track the expected maximum number of active vectors at core level, collect them from the KVM layer, and warn if we expect conflicts. Anxious MSI users could then refrain from using this feature, others might be fine with risking a slow-down on conflicts. It seems like a nice feature until you have to debug it in the field :). If you really think it's worthwhile, let's add a 'force' flag so that advanced users at least can declare that they know what they are doing. (a guest may or may not actually write a non-null data into a vector and unmask it). Please, forget the non-NULL thing. virtio driver knows exactly how many vectors we use and communicates this info to the device. This is not uncommon at all. I actually would not mind preallocating everything upfront which is much easier. But with your patch we get a silent failure or a drastic slowdown which is much more painful IMO. Again: did we already saw that limit? And where does it come from if not from KVM? It's a hardware limitation of intel APICs. interrupt vector is encoded in an 8 bit field in msi address. So you can have at most 256 of these. There should be no such limitation with pseudo GSIs we use for MSI injection. They end up as MSI messages again, so actually 256 (-reserved vectors) * number-of-cpus (on x86). This limits which CPUs can get the interrupt though. Linux seems to have a global pool as it wants to be able to freely balance vectors between CPUs. Or, consider a guest with a single CPU :) Anyway, why argue - there is a limitation, and it's not coming from KVM, right? No, our limit we hit with MSI message routing are first of all KVM GSIs, and there only pseudo GSIs that do not go to any interrupt controller with limited pins. I see KVM_MAX_IRQ_ROUTES 1024 This is 256 so KVM does not seem to be the problem. That could easily be lifted in the kernel if we run into shortages in practice. What I was saying is that resources are limited even without kvm. That's also why we do those data == 0 checks to skip used but unconfigured vectors. Jan These checks work more or less by luck BTW. It's a hack which I hope lazy allocation will replace. The check is still valid (for x86) when we have to use static routes (device assignment, vhost). It's not valid at all - we are just lucky that linux and windows guests seem to zero out the vector when it's not in use. They do not have to do that. It is valid as it is just an optimization. If an unused vector has a non-null data field, we just redundantly register a route where we do not actually have to. Well, the only reason we even have this code is because it was claimed that some devices declare support for a huge number of vectors which then go unused. So if the guest does not do this we'll run out of vectors ... But we do need to be prepared And ATM, we aren't, and probably can't be without kernel changes, right? for potentially arriving messages on that virtual GSI, either via irqfd or kvm device assignment. Jan Why irqfd? Device assignment is ATM the only place where we use these ugly hacks. vfio will use irqfds. And that virtio is partly out of the picture is only because we know much more about virtio internals (specifically: will not advertise more vectors than guests will want to use). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] hw/nand hw/onenand and read-only
On 18 October 2011 14:38, Markus Armbruster arm...@redhat.com wrote: Yes, that way works: $ qemu-system-arm -drive if=mtd,file=tmp.qcow2,readonly qemu-system-arm: -drive if=mtd,file=tmp.qcow2,readonly: readonly not supported by this bus type But try this way: $ qemu-system-arm -drive if=none,file=tmp.qcow2,readonly,id=foo -device nand,drive=foo Kernel image must be specified Grr, force it: $ qemu-system-arm -drive if=none,file=tmp.qcow2,readonly,id=foo -device nand,drive=foo -kernel /dev/null qemu: hardware error: nand_device_init: Unsupported NAND block size. [...] Note that I didn't bother supplying a drive with sensible size. If I did, I guess I'd get nand coupled to a read-only drive. Easy enough for you to try :) Isn't this a block layer problem? It seems to be blockdev.c that emits the readonly not supported message for MTD devices, so presumably it ought to do so however the user sets up the device... -- PMM
Re: [Qemu-devel] [RFC 5/6] block: core copy-on-read logic
On Mon, Oct 17, 2011 at 04:47:31PM +0100, Stefan Hajnoczi wrote: Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com --- block.c | 69 ++ trace-events |1 + 2 files changed, 70 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index 0c22741..2aec6b4 100644 --- a/block.c +++ b/block.c @@ -1409,6 +1409,55 @@ int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset, return 0; } +static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs, +int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) +{ +void *bounce_buffer; +struct iovec iov; +QEMUIOVector bounce_qiov; +int64_t cluster_sector_num; +int cluster_nb_sectors; +size_t skip_bytes; +int ret; + +/* Cover entire cluster so no additional backing file I/O is required when + * allocating cluster in the image file. + */ +round_to_clusters(bs, sector_num, nb_sectors, + cluster_sector_num, cluster_nb_sectors); + +trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors, +cluster_sector_num, cluster_nb_sectors); + +iov.iov_len = cluster_nb_sectors * BDRV_SECTOR_SIZE; +iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len); +qemu_iovec_init_external(bounce_qiov, iov, 1); + +ret = bs-drv-bdrv_co_readv(bs, cluster_sector_num, cluster_nb_sectors, + bounce_qiov); +if (ret 0) { +goto err; +} + +ret = bs-drv-bdrv_co_writev(bs, cluster_sector_num, cluster_nb_sectors, + bounce_qiov); +if (ret 0) { +/* It might be okay to ignore write errors for guest requests. If this + * is a deliberate copy-on-read then we don't want to ignore the error. + * Simply report it in all cases. + */ +goto err; +} + +skip_bytes = (sector_num - cluster_sector_num) * BDRV_SECTOR_SIZE; +qemu_iovec_from_buffer(qiov, bounce_buffer + skip_bytes, + nb_sectors * BDRV_SECTOR_SIZE); + +err: +qemu_vfree(bounce_buffer); +return ret; +} + /* * Handle a read request in coroutine context */ @@ -1431,7 +1480,27 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs, } req = tracked_request_add(bs, sector_num, nb_sectors, false); + +if (bs-copy_on_read) { +int pnum; + +/* TODO it is not safe to call bdrv_is_allocated() in coroutine context + * because it's a synchronous interface. We probably want a + * bdrv_co_is_allocated(). */ +ret = bdrv_is_allocated(bs, sector_num, nb_sectors, pnum); +if (ret 0) { +goto out; +} This lacks shared base image support, BTW (as in copy should be performed only if cluster not in destination chain). Could be added later.
Re: [Qemu-devel] [PATCH 1/2] Move graphic-related coalesced MMIO flushes to affected device models
On 2011-10-18 16:00, Avi Kivity wrote: On 09/30/2011 01:31 PM, Jan Kiszka wrote: This is conceptually cleaner and will allow us to drop the nographic timer. Moreover, it will be mandatory to fully exploit future per-device coalesced MMIO rings. Appears to break winxp installation - the guest enters an infinite bitblt loop. Trying to find out why. Hmm, maybe there are side effects in certain modes that actually disallow coalescing. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 3/7] integratorcp: convert control to memory API
On 10/18/2011 01:44 PM, Peter Maydell wrote: 2011/10/17 Benoît Canet benoit.ca...@gmail.com: -static void icp_control_init(uint32_t base) +static void icp_control_init(target_phys_addr_t base) { -int iomemtype; +MemoryRegion *io; -iomemtype = cpu_register_io_memory(icp_control_readfn, - icp_control_writefn, NULL, - DEVICE_NATIVE_ENDIAN); -cpu_register_physical_memory(base, 0x0080, iomemtype); +io = (MemoryRegion *)g_malloc0(sizeof(MemoryRegion)); +memory_region_init_io(io, icp_control_ops, NULL, + control, 0x0080); +memory_region_add_subregion(get_system_memory(), base, io); /* ??? Save/restore. */ } I didn't spot this the first time round, but this is wrong. We shouldn't be g_malloc0()ing the MemoryRegion -- it should be an element in some suitable device struct. I think the right thing to do here is probably first to do the (fairly trivial) conversion of the icp_control code to be a sysbus device, then do the memory region conversion on top of that. It's not any less wrong than the original code, which also leaked memory. I'll merge it and let any further conversion take place on top. (though g_malloc0() is better replaced by g_new()). -- error compiling committee.c: too many arguments to function