Re: [Qemu-devel] [PATCH v8] kvm: notify host when the guest is panicked
On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote: > diff --git a/Documentation/virtual/kvm/pv_event.txt > b/Documentation/virtual/kvm/pv_event.txt > new file mode 100644 > index 000..0ebc890 > --- /dev/null > +++ b/Documentation/virtual/kvm/pv_event.txt > @@ -0,0 +1,32 @@ > +The KVM paravirtual event interface > += > + > +Initializing the paravirtual event interface > +== > +kvm_pv_event_init() > +Argiments: > + None > + > +Return Value: > + 0 : The guest kernel can't use paravirtual event interface. > + -1: The guest kernel can use paravirtual event interface. > + This documentation has the can and can't backwards.
Re: [Qemu-devel] [PATCH v2 2/2] ivshmem: use irqfd to interrupt among VMs
- Original Message - > On Tue, Dec 4, 2012 at 4:10 AM, Andrew Jones > wrote: > > > > > > - Original Message - > >> On Thu, Nov 29, 2012 at 1:34 AM, liu ping fan > >> wrote: > >> > On Thu, Nov 29, 2012 at 12:42 PM, Cam Macdonell > >> > wrote: > >> >> On Tue, Nov 27, 2012 at 7:53 PM, liu ping fan > >> >> > >> >> wrote: > >> >>> On Wed, Nov 28, 2012 at 5:48 AM, Cam Macdonell > >> >>> wrote: > >> >>>> On Sat, Nov 24, 2012 at 8:51 PM, Liu Ping Fan > >> >>>> wrote: > >> >>>>> From: Liu Ping Fan > >> >>>>> > >> >>>>> Using irqfd, so we can avoid switch between kernel and user > >> >>>>> when > >> >>>>> VMs interrupts each other. > >> >>>> > >> >>>> Nice work. Due to a hardware failure, there will be a small > >> >>>> delay in > >> >>>> me being able to test this. I'll follow up as soon as I can. > >> >>>> > >> >>> BTW where can I find the latest guest code for test? > >> >>> I got the guest code from > >> >>> git://gitorious.org/nahanni/guest-code.git. > >> >>> But it seems outdated, after fixing the unlocked_ioctl, and > >> >>> vm-id > >> >>> bits > >> >>> position (the codes conflict with ivshmem spec), it works (I > >> >>> have > >> >>> tested for V1). > >> >> > >> >> Hello, > >> >> > >> >> Which device driver are you using? > >> >> > >> > guest-code/kernel_module/standard/kvm_ivshmem.c > >> > >> The uio driver is the recommended one, however if you want to use > >> the > >> kvm_ivshmem one and have it working, then feel free to continue. > > > > If the uio driver is the recommended one, then can you please post > > it > > to lkml? It should be integrated into drivers/virt with an > > appropriate > > Kconfig update. > > > > Sure. Should it go under drivers/virt or drivers/uio? It seems the > uio drivers all get grouped together. Good point. That is the current practice. As there's still only a handful of uio drivers, then I guess it doesn't make sense to try and change that at this point. It seems that it would make more sense to use drivers/uio just for generic uio drivers though, and then the other uio drivers would go under drivers//uio, e.g. drivers/virt/uio Drew > > Thanks, > Cam > > > Thanks, > > Drew > > > >> > >> I had deleted it form the repo, but some users had based solutions > >> off > >> it, so I added it back. > >> > >> btw, my hardware issue has been resolved, so I'll get to testing > >> your > >> patch soon. > >> > >> Sincerely, > >> Cam > >> > >> > > >> >> Cam > >> >> > >> >>> > >> >>> Regards, > >> >>> Pingfan > >> >>> > >> >>>>> > >> >>>>> Signed-off-by: Liu Ping Fan > >> >>>>> --- > >> >>>>> hw/ivshmem.c | 54 > >> >>>>> +- > >> >>>>> 1 files changed, 53 insertions(+), 1 deletions(-) > >> >>>>> > >> >>>>> diff --git a/hw/ivshmem.c b/hw/ivshmem.c > >> >>>>> index 7c8630c..5709e89 100644 > >> >>>>> --- a/hw/ivshmem.c > >> >>>>> +++ b/hw/ivshmem.c > >> >>>>> @@ -19,6 +19,7 @@ > >> >>>>> #include "hw.h" > >> >>>>> #include "pc.h" > >> >>>>> #include "pci.h" > >> >>>>> +#include "msi.h" > >> >>>>> #include "msix.h" > >> >>>>> #include "kvm.h" > >> >>>>> #include "migration.h" > >> >>>>> @@ -83,6 +84,7 @@ typedef struct IVShmemState { > >> >>>>> uint32_t vectors; > >> >>>>> uint32_t features; > >> >>>>> EventfdEntry *eventfd_table; > >> >>>>> +i
Re: [Qemu-devel] [PATCH 1/3] qga: introduce guest-get-vcpus / guest-set-vcpus with stubs
- Original Message - > On 03/05/13 22:08, Eric Blake wrote: > > On 03/04/2013 03:19 PM, Laszlo Ersek wrote: > >> Signed-off-by: Laszlo Ersek > >> --- > > > >> +# @guest-set-vcpus: > >> +# > >> +# Attempt to reconfigure (currently: enable/disable) logical > >> processors inside > >> +# the guest. > >> +# > >> +# The input list is processed node by node in order. In each node > >> @logical-id > >> +# is used to look up the guest VCPU, for which @online specifies > >> the requested > >> +# state. The set of distinct @logical-id's is only required to be > >> a subset of > >> +# guest-supported identifiers. There's no restriction on list > >> length or on > >> +# repeating the same @logical-id (with possibly different @online > >> field). > >> +# Preferably the input list should describe a modified subset of > >> +# @guest-get-vcpus' return value. > >> +# > >> +# If part or whole of the requested operation can't be carried > >> out, the guest > >> +# VCPU state will be unspecified. > > > > Completely unspecified? > > Yes. "Unspecified" means "valid" (ie. at least one VCPU will be > online, > the guest won't be "dead"), but no further info will be returned at > once. > > > Or is it guaranteed that a subsequent > > successful guest-get-vcpus will still be reliably to learn after > > the > > fact what happened? > > Yes, that is both the intent and implied by "unspecified" (as opposed > to > "undefined"). > > > Would it make any more sense to have only a > > guest-set-vcpu, which attempts to set the state of a single vcpu, > > instead of an open-ended array of successive vcpu modifications in > > guest-set-vcpus? > > The current interface can be special-cased into that type of call, > however I wanted to provide a batch interface (flipping 100 VCPUs > shouldn't take 100 round trips). > > > The interface seems relatively sane, though, and it looks like > > something > > that libvirt would be able to use without having to add any new > > APIs > > (just a new flag value to the existing virDomainSetVcpusFlags() > > function). > > Oh. > > virDomainSetVcpusFlags() [src/libvirt.c] > qemuDomainSetVcpusFlags() [src/qemu/qemu_driver.c] > qemuDomainHotplugVcpus() > qemuMonitorSetCPU() [src/qemu/qemu_monitor.c] > qemuMonitorTextSetCPU() > "cpu_set %d %s" > > Does this work? I can't find any trace of the "cpu_set" (or the > "set_cpu") monitor command in upstream qemu. > > The relevant libvirt commits are: > - e8d6c289 Support VCPU hotplug in QEMU guests > ("NB, currently untested since QEMU segvs when running this!") > - a980d123 Fix CPU hotplug command names > > If this works and I'm just not seeing something then I have no reason > to > pursue this series. > > ... Ah I understand now. "cpu_set" *is* supported by the qemu-kvm > project at -- > and > by RHEL-6 qemu-kvm --, via ACPI. > > I'll have to test this in RHEL-6. If it doesn't work, I should check > why. If it does, I'll have to figure out if I should continue to work > on > this. cpu hotplug works for rhel6 and Igor is also pushing it to upstream qemu now. But unplug doesn't. We need this alternative solution to support both plug and unplug while unplug gets worked out. > > I wonder why doesn't have "cpu_set". > > Thanks > Laszlo >
Re: [Qemu-devel] [PATCH V8 07/11] NUMA: set guest numa nodes memory policy
- Original Message - > Set the guest numa nodes memory policies using the mbind(2) > system call node by node. > After this patch, we are able to set guest nodes memory policies > through the QEMU options, this arms to solve the guest cross > nodes memory access performance issue. > And as you all know, if PCI-passthrough is used, > direct-attached-device uses DMA transfer between device and qemu process. > All pages of the guest will be pinned by get_user_pages(). > > KVM_ASSIGN_PCI_DEVICE ioctl > kvm_vm_ioctl_assign_device() > =>kvm_assign_device() > => kvm_iommu_map_memslots() > => kvm_iommu_map_pages() >=> kvm_pin_pages() > > So, with direct-attached-device, all guest page's page count will be +1 and > any page migration will not work. AutoNUMA won't too. > > So, we should set the guest nodes memory allocation policies before > the pages are really mapped. > > Signed-off-by: Andre Przywara > Signed-off-by: Wanlong Gao > --- > numa.c | 89 > ++ > 1 file changed, 89 insertions(+) > > diff --git a/numa.c b/numa.c > index 436b8e0..b2c0048 100644 > --- a/numa.c > +++ b/numa.c > @@ -28,6 +28,16 @@ > #include "qapi-visit.h" > #include "qapi/opts-visitor.h" > #include "qapi/dealloc-visitor.h" > +#include "exec/memory.h" > + > +#ifdef CONFIG_NUMA > +#include > +#include > +#ifndef MPOL_F_RELATIVE_NODES > +#define MPOL_F_RELATIVE_NODES (1 << 14) > +#define MPOL_F_STATIC_NODES (1 << 15) > +#endif > +#endif > > QemuOptsList qemu_numa_opts = { > .name = "numa", > @@ -209,6 +219,78 @@ void set_numa_nodes(void) > } > } > > +#ifdef CONFIG_NUMA > +static int node_parse_bind_mode(unsigned int nodeid) > +{ > +int bind_mode; > + > +switch (numa_info[nodeid].policy) { > +case NUMA_NODE_POLICY_MEMBIND: > +bind_mode = MPOL_BIND; > +break; > +case NUMA_NODE_POLICY_INTERLEAVE: > +bind_mode = MPOL_INTERLEAVE; > +break; > +case NUMA_NODE_POLICY_PREFERRED: > +bind_mode = MPOL_PREFERRED; > +break; > +case NUMA_NODE_POLICY_DEFAULT: > +default: > +bind_mode = MPOL_DEFAULT; > +return bind_mode; > +} > + > +bind_mode |= numa_info[nodeid].relative ? > +MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES; > + > +return bind_mode; > +} > +#endif > + > +static int set_node_mem_policy(int nodeid) > +{ > +#ifdef CONFIG_NUMA > +void *ram_ptr; > +RAMBlock *block; > +ram_addr_t len, ram_offset = 0; > +int bind_mode; > +int i; > + > +QTAILQ_FOREACH(block, &ram_list.blocks, next) { > +if (!strcmp(block->mr->name, "pc.ram")) { > +break; > +} > +} > + > +if (block->host == NULL) { > +return -1; > +} > + > +ram_ptr = block->host; > +for (i = 0; i < nodeid; i++) { > +len = numa_info[i].node_mem; > +ram_offset += len; > +} > + > +len = numa_info[i].node_mem; > +bind_mode = node_parse_bind_mode(i); > + > +/* This is a workaround for a long standing bug in Linux' > + * mbind implementation, which cuts off the last specified > + * node. To stay compatible should this bug be fixed, we > + * specify one more node and zero this one out. > + */ > +clear_bit(numa_num_configured_nodes() + 1, numa_info[i].host_mem); > +if (mbind(ram_ptr + ram_offset, len, bind_mode, > +numa_info[i].host_mem, numa_num_configured_nodes() + 1, 0)) { > +perror("mbind"); > +return -1; > +} >From my quick read of this patch series, I think these two calls of numa_num_configured_nodes() are the only places that libnuma is used. Is it really worth the new dependency? Actually libnuma will only calculate what it returns from numa_num_configured_nodes() once, because it simply counts bits in a bitmask that it only initializes at library load time. So it would be more robust wrt to node onlining/offlining to avoid libnuma and to just fetch information from sysfs as needed anyway. In this particular code though, I think replacing numa_num_configured_nodes() with a maxnode, where unsigned long maxnode = find_last_bit(numa_info[i].host_mem, MAX_CPUMASK_BITS) would work the best. Another comment I have on this function is that I'd prefer to see something like unsigned long *nodes = numa_info[nodeid].host_mem; at the top, and then use that for a shorter name, rather than abusing the fact that i == nodeid after the loop, presumably just to keep the name short. drew > +#endif > + > +return 0; > +} > + > void set_numa_modes(void) > { > CPUState *cpu; > @@ -221,4 +303,11 @@ void set_numa_modes(void) > } > } > } > + > +for (i = 0; i < nb_numa_nodes; i++) { > +if (set_node_mem_policy(i) == -1) { > +fprintf(stderr, > +"qemu: can not set host memory policy for node%d\n", i); > +} > +} > } > --
Re: [Qemu-devel] [PATCH V8 07/11] NUMA: set guest numa nodes memory policy
- Original Message - > On 08/20/2013 09:41 PM, Andrew Jones wrote: > >> + > >> +/* This is a workaround for a long standing bug in Linux' > >> + * mbind implementation, which cuts off the last specified > >> + * node. To stay compatible should this bug be fixed, we > >> + * specify one more node and zero this one out. > >> + */ > >> +clear_bit(numa_num_configured_nodes() + 1, numa_info[i].host_mem); > >> +if (mbind(ram_ptr + ram_offset, len, bind_mode, > >> +numa_info[i].host_mem, numa_num_configured_nodes() + 1, 0)) { > >> +perror("mbind"); > >> +return -1; > >> +} > > > >>From my quick read of this patch series, I think these two calls of > > numa_num_configured_nodes() are the only places that libnuma is used. > > Is it really worth the new dependency? Actually libnuma will only calculate > > what it returns from numa_num_configured_nodes() once, because it simply > > counts bits in a bitmask that it only initializes at library load time. So > > it would be more robust wrt to node onlining/offlining to avoid libnuma and > > to just fetch information from sysfs as needed anyway. In this particular > > code though, I think replacing numa_num_configured_nodes() with a maxnode, > > where > > > > unsigned long maxnode = find_last_bit(numa_info[i].host_mem, > > MAX_CPUMASK_BITS) > > Sorry I can't understand this since numa_numa_configured_nodes() is for host, > but why could we find the last bit of guest setting to replace it? > You're not using numa_numa_configured_nodes() to index _the_ host's nodemask, you're using it to find the highest possible bit set in _a_ nodemask, numa_info[i].host_mem. mbind doesn't need its 'maxnode' param to be the highest possible host node bit, but rather just the highest possible bit set in the nodemask passed to it. find_last_bit will find that bit. You still need to add 1 to it as you do with numa_numa_configured_nodes() though, due to the kernel decrementing it by one erroneously as you've pointed out in your comment. drew
[Qemu-devel] [PATCH] kvm: warn if num cpus is greater than num recommended
The comment in kvm_max_vcpus() states that it's using the recommended procedure from the kernel API documentation to get the max number of vcpus that kvm supports. It is, but by always returning the maximum number supported. The maximum number should only be used for development purposes. qemu should check KVM_CAP_NR_VCPUS for the recommended number of vcpus. This patch adds a warning if a user specifies a number of cpus between the recommended and max. Signed-off-by: Andrew Jones --- kvm-all.c | 45 +++-- 1 file changed, 27 insertions(+), 18 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 716860f617455..9092e13ae60ea 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1313,24 +1313,24 @@ static int kvm_irqchip_create(KVMState *s) return 0; } -static int kvm_max_vcpus(KVMState *s) +/* Find number of supported CPUs using the recommended + * procedure from the kernel API documentation to cope with + * older kernels that may be missing capabilities. + */ +static int kvm_recommended_vcpus(KVMState *s) { int ret; -/* Find number of supported CPUs using the recommended - * procedure from the kernel API documentation to cope with - * older kernels that may be missing capabilities. - */ -ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); -if (ret) { -return ret; -} ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS); -if (ret) { -return ret; -} +return (ret) ? ret : 4; +} -return 4; +static int kvm_max_vcpus(KVMState *s) +{ +int ret; + +ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); +return (ret) ? ret : kvm_recommended_vcpus(s); } int kvm_init(void) @@ -1383,12 +1383,21 @@ int kvm_init(void) goto err; } -max_vcpus = kvm_max_vcpus(s); +max_vcpus = kvm_recommended_vcpus(s); if (smp_cpus > max_vcpus) { -ret = -EINVAL; -fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max cpus " -"supported by KVM (%d)\n", smp_cpus, max_vcpus); -goto err; +fprintf(stderr, +"Warning: Number of SMP cpus requested (%d) exceeds " +"recommended cpus supported by KVM (%d)\n", +smp_cpus, max_vcpus); + +max_vcpus = kvm_max_vcpus(s); +if (smp_cpus > max_vcpus) { +ret = -EINVAL; +fprintf(stderr, "Number of SMP cpus requested (%d) exceeds " +"max cpus supported by KVM (%d)\n", +smp_cpus, max_vcpus); +goto err; +} } s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0); -- 1.8.1.4
Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
- Original Message - > Add detection of libnuma (mostly contained in the numactl package) > to the configure script. Can be enabled or disabled on the command line, > default is use if available. > > Signed-off-by: Andre Przywara > Signed-off-by: Wanlong Gao Is this patch still necessary? I thought that dropping the numa_num_configured_nodes() calls from patch 8/12 got rid of the need for this library. Maybe I missed other uses? drew
Re: [Qemu-devel] [PATCH V9 08/12] NUMA: set guest numa nodes memory policy
- Original Message - > Set the guest numa nodes memory policies using the mbind(2) > system call node by node. > After this patch, we are able to set guest nodes memory policies > through the QEMU options, this arms to solve the guest cross > nodes memory access performance issue. > And as you all know, if PCI-passthrough is used, > direct-attached-device uses DMA transfer between device and qemu process. > All pages of the guest will be pinned by get_user_pages(). > > KVM_ASSIGN_PCI_DEVICE ioctl > kvm_vm_ioctl_assign_device() > =>kvm_assign_device() > => kvm_iommu_map_memslots() > => kvm_iommu_map_pages() >=> kvm_pin_pages() > > So, with direct-attached-device, all guest page's page count will be +1 and > any page migration will not work. AutoNUMA won't too. > > So, we should set the guest nodes memory allocation policies before > the pages are really mapped. > > Signed-off-by: Andre Przywara > Signed-off-by: Wanlong Gao > --- > numa.c | 90 > ++ > 1 file changed, 90 insertions(+) > > diff --git a/numa.c b/numa.c > index 4ccc6cb..4a9c368 100644 > --- a/numa.c > +++ b/numa.c > @@ -28,6 +28,16 @@ > #include "qapi-visit.h" > #include "qapi/opts-visitor.h" > #include "qapi/dealloc-visitor.h" > +#include "exec/memory.h" > + > +#ifdef CONFIG_NUMA > +#include > +#include > +#ifndef MPOL_F_RELATIVE_NODES > +#define MPOL_F_RELATIVE_NODES (1 << 14) > +#define MPOL_F_STATIC_NODES (1 << 15) > +#endif > +#endif > > QemuOptsList qemu_numa_opts = { > .name = "numa", > @@ -219,6 +229,79 @@ void set_numa_nodes(void) > } > } > > +#ifdef CONFIG_NUMA > +static int node_parse_bind_mode(unsigned int nodeid) > +{ > +int bind_mode; > + > +switch (numa_info[nodeid].policy) { > +case NUMA_NODE_POLICY_MEMBIND: > +bind_mode = MPOL_BIND; > +break; > +case NUMA_NODE_POLICY_INTERLEAVE: > +bind_mode = MPOL_INTERLEAVE; > +break; > +case NUMA_NODE_POLICY_PREFERRED: > +bind_mode = MPOL_PREFERRED; > +break; > +case NUMA_NODE_POLICY_DEFAULT: > +default: > +bind_mode = MPOL_DEFAULT; > +return bind_mode; > +} > + > +bind_mode |= numa_info[nodeid].relative ? > +MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES; > + > +return bind_mode; > +} > +#endif > + > +static int set_node_mem_policy(int nodeid) > +{ > +#ifdef CONFIG_NUMA > +void *ram_ptr; > +RAMBlock *block; > +ram_addr_t len, ram_offset = 0; > +int bind_mode; > +int i; > + > +QTAILQ_FOREACH(block, &ram_list.blocks, next) { > +if (!strcmp(block->mr->name, "pc.ram")) { > +break; > +} > +} > + > +if (block->host == NULL) { > +return -1; > +} > + > +ram_ptr = block->host; > +for (i = 0; i < nodeid; i++) { > +len = numa_info[i].node_mem; > +ram_offset += len; > +} > + > +len = numa_info[nodeid].node_mem; > +bind_mode = node_parse_bind_mode(nodeid); > +unsigned long *nodes = numa_info[nodeid].host_mem; > + > +/* This is a workaround for a long standing bug in Linux' > + * mbind implementation, which cuts off the last specified > + * node. To stay compatible should this bug be fixed, we > + * specify one more node and zero this one out. > + */ > +unsigned long maxnode = find_last_bit(nodes, MAX_CPUMASK_BITS); > +clear_bit(maxnode + 1, nodes); This clear_bit() isn't necessary. We know that maxnode+1 is certainly already clear, because find_last_bit() just returned maxnode as the last "non-clear" bit. > +if (mbind(ram_ptr + ram_offset, len, bind_mode, nodes, maxnode + 1, 0)) > { > +perror("mbind"); > +return -1; > +} > +#endif > + > +return 0; > +} > + > void set_numa_modes(void) > { > CPUState *cpu; > @@ -231,4 +314,11 @@ void set_numa_modes(void) > } > } > } > + > +for (i = 0; i < nb_numa_nodes; i++) { > +if (set_node_mem_policy(i) == -1) { > +fprintf(stderr, > +"qemu: can not set host memory policy for node%d\n", i); > +} > +} > } > -- > 1.8.4.rc4 > > >
Re: [Qemu-devel] [PATCH] kvm: warn if num cpus is greater than num recommended
- Original Message - > Am 22.08.2013 18:12, schrieb Eduardo Habkost: > > > > On 22/08/2013, at 12:39, Andrew Jones wrote: > > > >> The comment in kvm_max_vcpus() states that it's using the recommended > >> procedure from the kernel API documentation to get the max number > >> of vcpus that kvm supports. It is, but by always returning the > >> maximum number supported. The maximum number should only be used > >> for development purposes. qemu should check KVM_CAP_NR_VCPUS for > >> the recommended number of vcpus. This patch adds a warning if a user > >> specifies a number of cpus between the recommended and max. > >> > >> Signed-off-by: Andrew Jones > > > > CCing libvir-list. It is probably interesting for libvirt to expose or warn > > about the recommended VCPU limit somehow, and in this case a simple > > warning on stderr won't be enough. > > > >> --- > >> kvm-all.c | 45 +++-- > >> 1 file changed, 27 insertions(+), 18 deletions(-) > >> > >> diff --git a/kvm-all.c b/kvm-all.c > >> index 716860f617455..9092e13ae60ea 100644 > >> --- a/kvm-all.c > >> +++ b/kvm-all.c > >> @@ -1313,24 +1313,24 @@ static int kvm_irqchip_create(KVMState *s) > >> return 0; > >> } > >> > >> -static int kvm_max_vcpus(KVMState *s) > >> +/* Find number of supported CPUs using the recommended > >> + * procedure from the kernel API documentation to cope with > >> + * older kernels that may be missing capabilities. > >> + */ > >> +static int kvm_recommended_vcpus(KVMState *s) > >> { > >> int ret; > >> > >> -/* Find number of supported CPUs using the recommended > >> - * procedure from the kernel API documentation to cope with > >> - * older kernels that may be missing capabilities. > >> - */ > >> -ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); > >> -if (ret) { > >> -return ret; > >> -} > >> ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS); > >> -if (ret) { > >> -return ret; > >> -} > >> +return (ret) ? ret : 4; > >> +} > >> > >> -return 4; > >> +static int kvm_max_vcpus(KVMState *s) > >> +{ > >> +int ret; > >> + > >> +ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); > >> +return (ret) ? ret : kvm_recommended_vcpus(s); > >> } > >> > >> int kvm_init(void) > >> @@ -1383,12 +1383,21 @@ int kvm_init(void) > >> goto err; > >> } > >> > >> -max_vcpus = kvm_max_vcpus(s); > >> +max_vcpus = kvm_recommended_vcpus(s); > >> if (smp_cpus > max_vcpus) { > >> -ret = -EINVAL; > >> -fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max > >> cpus " > >> -"supported by KVM (%d)\n", smp_cpus, max_vcpus); > >> -goto err; > >> +fprintf(stderr, > >> +"Warning: Number of SMP cpus requested (%d) exceeds " > >> +"recommended cpus supported by KVM (%d)\n", > >> +smp_cpus, max_vcpus); > >> + > >> +max_vcpus = kvm_max_vcpus(s); > >> +if (smp_cpus > max_vcpus) { > >> +ret = -EINVAL; > >> +fprintf(stderr, "Number of SMP cpus requested (%d) exceeds " > >> +"max cpus supported by KVM (%d)\n", > >> +smp_cpus, max_vcpus); > >> +goto err; > >> +} > > Should at least the fatal one use the new error_report()? So far kvm-all.c doesn't use error_report(). I'm inclined to leave it that way for now, for the scope of this patch anyway. Maybe we should convert all of kvm-all.c at some point though? > > >> } > >> > >> s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0); > > I notice that only checks in kvm_init() based on smp_cpus are touched > herein. Should we add similar checks to CPU hot-add code and thus > possibly move that into some per-vCPU code path? > That's a better question for hot-plug folk. Does smp_cpus map to the current number of cpus, or to the number of possible cpus? If it maps to the number of possible cpus, then this is the right place. If the former, then I guess it'll take more thought. I'ved added Igor (still on vacation) to this reply, but regardless I vote we worry about hot-plug limit checking in different patch. thanks, drew > Regards, > Andreas > > -- > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany > GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg >
Re: [Qemu-devel] [PATCH] kvm: warn if num cpus is greater than num recommended
- Original Message - > Il 22/08/2013 17:39, Andrew Jones ha scritto: > > The comment in kvm_max_vcpus() states that it's using the recommended > > procedure from the kernel API documentation to get the max number > > of vcpus that kvm supports. It is, but by always returning the > > maximum number supported. The maximum number should only be used > > for development purposes. qemu should check KVM_CAP_NR_VCPUS for > > the recommended number of vcpus. This patch adds a warning if a user > > specifies a number of cpus between the recommended and max. > > > > Signed-off-by: Andrew Jones > > --- > > kvm-all.c | 45 +++-- > > 1 file changed, 27 insertions(+), 18 deletions(-) > > > > diff --git a/kvm-all.c b/kvm-all.c > > index 716860f617455..9092e13ae60ea 100644 > > --- a/kvm-all.c > > +++ b/kvm-all.c > > @@ -1313,24 +1313,24 @@ static int kvm_irqchip_create(KVMState *s) > > return 0; > > } > > > > -static int kvm_max_vcpus(KVMState *s) > > +/* Find number of supported CPUs using the recommended > > + * procedure from the kernel API documentation to cope with > > + * older kernels that may be missing capabilities. > > + */ > > +static int kvm_recommended_vcpus(KVMState *s) > > { > > int ret; > > > > -/* Find number of supported CPUs using the recommended > > - * procedure from the kernel API documentation to cope with > > - * older kernels that may be missing capabilities. > > - */ > > -ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); > > -if (ret) { > > -return ret; > > -} > > ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS); > > -if (ret) { > > -return ret; > > -} > > +return (ret) ? ret : 4; > > +} > > > > -return 4; > > +static int kvm_max_vcpus(KVMState *s) > > +{ > > +int ret; > > + > > +ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); > > +return (ret) ? ret : kvm_recommended_vcpus(s); > > } > > > > int kvm_init(void) > > @@ -1383,12 +1383,21 @@ int kvm_init(void) > > goto err; > > } > > > > -max_vcpus = kvm_max_vcpus(s); > > +max_vcpus = kvm_recommended_vcpus(s); > > if (smp_cpus > max_vcpus) { > > -ret = -EINVAL; > > -fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max > > cpus " > > -"supported by KVM (%d)\n", smp_cpus, max_vcpus); > > -goto err; > > +fprintf(stderr, > > +"Warning: Number of SMP cpus requested (%d) exceeds " > > +"recommended cpus supported by KVM (%d)\n", > > +smp_cpus, max_vcpus); > > + > > +max_vcpus = kvm_max_vcpus(s); > > +if (smp_cpus > max_vcpus) { > > +ret = -EINVAL; > > +fprintf(stderr, "Number of SMP cpus requested (%d) exceeds " > > +"max cpus supported by KVM (%d)\n", > > +smp_cpus, max_vcpus); > > +goto err; > > +} > > You print both error messages when smp_cpus is greater than the max cpus > supported; is it intentional? > Yup. This way we can inform the user not only that they're greater than hard-max, but also what the soft-max is. This allows the user to choose how much lower to adjust when they correct for the hard-max, possibly going down low enough to avoid blowing the soft-max as well. > Apart from this, the concept looks good. However, please over > qemu-kvm.git's uq/master branch, where we already have Marcelo's patch > to check max_cpus too against kvm_max_vcpus(s). > OK, will respin on uq/master. thanks, drew
[Qemu-devel] [PATCH v2] kvm: warn if num cpus is greater than num recommended
The comment in kvm_max_vcpus() states that it's using the recommended procedure from the kernel API documentation to get the max number of vcpus that kvm supports. It is, but by always returning the maximum number supported. The maximum number should only be used for development purposes. qemu should check KVM_CAP_NR_VCPUS for the recommended number of vcpus. This patch adds a warning if a user specifies a number of cpus between the recommended and max. v2: Incorporate tests for max_cpus, which specifies the maximum number of hotpluggable cpus. An additional note is that the message for the fail case was slightly changed, 'exceeds max cpus' to 'exceeds the maximum cpus'. If this is unacceptable change for users like libvirt, then I'll need to spin a v3. Signed-off-by: Andrew Jones --- kvm-all.c | 69 --- 1 file changed, 40 insertions(+), 29 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index a2d49786365e3..021f5f47e53da 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1322,24 +1322,20 @@ static int kvm_irqchip_create(KVMState *s) return 0; } -static int kvm_max_vcpus(KVMState *s) +/* Find number of supported CPUs using the recommended + * procedure from the kernel API documentation to cope with + * older kernels that may be missing capabilities. + */ +static int kvm_recommended_vcpus(KVMState *s) { -int ret; - -/* Find number of supported CPUs using the recommended - * procedure from the kernel API documentation to cope with - * older kernels that may be missing capabilities. - */ -ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); -if (ret) { -return ret; -} -ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS); -if (ret) { -return ret; -} +int ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS); +return (ret) ? ret : 4; +} -return 4; +static int kvm_max_vcpus(KVMState *s) +{ +int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS); +return (ret) ? ret : kvm_recommended_vcpus(s); } int kvm_init(void) @@ -1347,11 +1343,19 @@ int kvm_init(void) static const char upgrade_note[] = "Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n" "(see http://sourceforge.net/projects/kvm).\n"; +struct { +const char *name; +int num; +} num_cpus[] = { +{ "SMP", smp_cpus }, +{ "hotpluggable", max_cpus }, +{ NULL, } +}, *nc = num_cpus; +int soft_vcpus_limit, hard_vcpus_limit; KVMState *s; const KVMCapabilityInfo *missing_cap; int ret; int i; -int max_vcpus; s = g_malloc0(sizeof(KVMState)); @@ -1392,19 +1396,26 @@ int kvm_init(void) goto err; } -max_vcpus = kvm_max_vcpus(s); -if (smp_cpus > max_vcpus) { -ret = -EINVAL; -fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max cpus " -"supported by KVM (%d)\n", smp_cpus, max_vcpus); -goto err; -} +/* check the vcpu limits */ +soft_vcpus_limit = kvm_recommended_vcpus(s); +hard_vcpus_limit = kvm_max_vcpus(s); -if (max_cpus > max_vcpus) { -ret = -EINVAL; -fprintf(stderr, "Number of hotpluggable cpus requested (%d) exceeds max cpus " -"supported by KVM (%d)\n", max_cpus, max_vcpus); -goto err; +while (nc->name) { +if (nc->num > soft_vcpus_limit) { +fprintf(stderr, +"Warning: Number of %s cpus requested (%d) exceeds " +"the recommended cpus supported by KVM (%d)\n", +nc->name, nc->num, soft_vcpus_limit); + +if (nc->num > hard_vcpus_limit) { +ret = -EINVAL; +fprintf(stderr, "Number of %s cpus requested (%d) exceeds " +"the maximum cpus supported by KVM (%d)\n", +nc->name, nc->num, hard_vcpus_limit); +goto err; +} +} +nc++; } s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0); -- 1.8.1.4
Re: [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy
- Original Message - > The memory policy setting format is like: > > policy={default|membind|interleave|preferred}[,relative=true],host-nodes=N-N > And we are adding this setting as a suboption of "-numa mem,", > the memory policy then can be set like following: > -numa node,nodeid=0,cpus=0 \ > -numa node,nodeid=1,cpus=1 \ > -numa mem,nodeid=0,size=1G,policy=membind,host-nodes=0-1 \ > -numa mem,nodeid=1,size=1G,policy=interleave,relative=true,host-nodes=1 > > Signed-off-by: Wanlong Gao > --- > include/sysemu/sysemu.h | 3 +++ > numa.c | 13 + > qapi-schema.json| 31 +-- > vl.c| 3 +++ > 4 files changed, 48 insertions(+), 2 deletions(-) > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > index b683d08..81d16a5 100644 > --- a/include/sysemu/sysemu.h > +++ b/include/sysemu/sysemu.h > @@ -134,6 +134,9 @@ extern int nb_numa_mem_nodes; > typedef struct node_info { > uint64_t node_mem; > DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > +DECLARE_BITMAP(host_mem, MAX_CPUMASK_BITS); > +NumaNodePolicy policy; > +bool relative; > } NodeInfo; > extern NodeInfo numa_info[MAX_NODES]; > extern QemuOptsList qemu_numa_opts; > diff --git a/numa.c b/numa.c > index 3e2dfc1..4ccc6cb 100644 > --- a/numa.c > +++ b/numa.c > @@ -74,6 +74,7 @@ static int numa_mem_parse(NumaMemOptions *opts) > { > uint16_t nodenr; > uint64_t mem_size; > +uint16List *nodes; > > if (opts->has_nodeid) { > nodenr = opts->nodeid; > @@ -91,6 +92,18 @@ static int numa_mem_parse(NumaMemOptions *opts) > numa_info[nodenr].node_mem = mem_size; > } > > +if (opts->has_policy) { > +numa_info[nodenr].policy = opts->policy; > +} > + > +if (opts->has_relative) { > +numa_info[nodenr].relative = opts->relative; > +} > + > +for (nodes = opts->host_nodes; nodes; nodes = nodes->next) { > +bitmap_set(numa_info[nodenr].host_mem, nodes->value, 1); > +} > + > return 0; > } > > diff --git a/qapi-schema.json b/qapi-schema.json > index 11851a1..650741f 100644 > --- a/qapi-schema.json > +++ b/qapi-schema.json > @@ -3806,6 +3806,24 @@ > '*mem':'str' }} > > ## > +# @NumaNodePolicy > +# > +# NUMA node policy types > +# > +# @default: restore default policy, remove any nondefault policy > +# > +# @membind: a strict policy that restricts memory allocation to the > +# nodes specified > +# > +# @interleave: the page allocations is interleaved across the set > +# of nodes specified > +# > +# @preferred: set the preferred node for allocation > +## > +{ 'enum': 'NumaNodePolicy', > + 'data': [ 'default', 'membind', 'interleave', 'preferred' ] } > + > +## > # @NumaMemOptions > # > # Set memory information of guest NUMA node. (for OptsVisitor) > @@ -3814,9 +3832,18 @@ > # > # @size: #optional memory size of this node > # > +# @policy: #optional memory policy of this node > +# > +# @relative: #optional if the nodes specified are relative > +# > +# @host-nodes: #optional host nodes for its memory policy > +# > # Since 1.7 > ## > { 'type': 'NumaMemOptions', >'data': { > - '*nodeid': 'uint16', > - '*size': 'size' }} > + '*nodeid': 'uint16', > + '*size': 'size', > + '*policy': 'NumaNodePolicy', > + '*relative': 'bool', > + '*host-nodes': ['uint16'] }} > diff --git a/vl.c b/vl.c > index 2377b67..91b0d76 100644 > --- a/vl.c > +++ b/vl.c > @@ -2888,6 +2888,9 @@ int main(int argc, char **argv, char **envp) > for (i = 0; i < MAX_NODES; i++) { > numa_info[i].node_mem = 0; > bitmap_zero(numa_info[i].node_cpu, MAX_CPUMASK_BITS); > +bitmap_zero(numa_info[i].host_mem, MAX_CPUMASK_BITS); Shouldn't the bitmap size of host_mem be MAX_NODES? If so, and you change it, then make sure the find_last_bit() call also gets updated in patch 8/12, and anywhere else needed. drew > +numa_info[i].policy = NUMA_NODE_POLICY_DEFAULT; > +numa_info[i].relative = false; > } > > nb_numa_nodes = 0; > -- > 1.8.4.rc4 > > >
Re: [Qemu-devel] [PATCH V9 07/12] NUMA: parse guest numa nodes memory policy
- Original Message - > On 08/23/2013 10:11 PM, Andrew Jones wrote: > > > > > > - Original Message - > >> The memory policy setting format is like: > >> > >> policy={default|membind|interleave|preferred}[,relative=true],host-nodes=N-N > >> And we are adding this setting as a suboption of "-numa mem,", > >> the memory policy then can be set like following: > >> -numa node,nodeid=0,cpus=0 \ > >> -numa node,nodeid=1,cpus=1 \ > >> -numa mem,nodeid=0,size=1G,policy=membind,host-nodes=0-1 \ > >> -numa > >> mem,nodeid=1,size=1G,policy=interleave,relative=true,host-nodes=1 > >> > >> Signed-off-by: Wanlong Gao > >> --- > >> include/sysemu/sysemu.h | 3 +++ > >> numa.c | 13 + > >> qapi-schema.json| 31 +-- > >> vl.c| 3 +++ > >> 4 files changed, 48 insertions(+), 2 deletions(-) > >> > >> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > >> index b683d08..81d16a5 100644 > >> --- a/include/sysemu/sysemu.h > >> +++ b/include/sysemu/sysemu.h > >> @@ -134,6 +134,9 @@ extern int nb_numa_mem_nodes; > >> typedef struct node_info { > >> uint64_t node_mem; > >> DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS); > >> +DECLARE_BITMAP(host_mem, MAX_CPUMASK_BITS); > >> +NumaNodePolicy policy; > >> +bool relative; > >> } NodeInfo; > >> extern NodeInfo numa_info[MAX_NODES]; > >> extern QemuOptsList qemu_numa_opts; > >> diff --git a/numa.c b/numa.c > >> index 3e2dfc1..4ccc6cb 100644 > >> --- a/numa.c > >> +++ b/numa.c > >> @@ -74,6 +74,7 @@ static int numa_mem_parse(NumaMemOptions *opts) > >> { > >> uint16_t nodenr; > >> uint64_t mem_size; > >> +uint16List *nodes; > >> > >> if (opts->has_nodeid) { > >> nodenr = opts->nodeid; > >> @@ -91,6 +92,18 @@ static int numa_mem_parse(NumaMemOptions *opts) > >> numa_info[nodenr].node_mem = mem_size; > >> } > >> > >> +if (opts->has_policy) { > >> +numa_info[nodenr].policy = opts->policy; > >> +} > >> + > >> +if (opts->has_relative) { > >> +numa_info[nodenr].relative = opts->relative; > >> +} > >> + > >> +for (nodes = opts->host_nodes; nodes; nodes = nodes->next) { > >> +bitmap_set(numa_info[nodenr].host_mem, nodes->value, 1); > >> +} > >> + > >> return 0; > >> } > >> > >> diff --git a/qapi-schema.json b/qapi-schema.json > >> index 11851a1..650741f 100644 > >> --- a/qapi-schema.json > >> +++ b/qapi-schema.json > >> @@ -3806,6 +3806,24 @@ > >> '*mem':'str' }} > >> > >> ## > >> +# @NumaNodePolicy > >> +# > >> +# NUMA node policy types > >> +# > >> +# @default: restore default policy, remove any nondefault policy > >> +# > >> +# @membind: a strict policy that restricts memory allocation to the > >> +# nodes specified > >> +# > >> +# @interleave: the page allocations is interleaved across the set > >> +# of nodes specified > >> +# > >> +# @preferred: set the preferred node for allocation > >> +## > >> +{ 'enum': 'NumaNodePolicy', > >> + 'data': [ 'default', 'membind', 'interleave', 'preferred' ] } > >> + > >> +## > >> # @NumaMemOptions > >> # > >> # Set memory information of guest NUMA node. (for OptsVisitor) > >> @@ -3814,9 +3832,18 @@ > >> # > >> # @size: #optional memory size of this node > >> # > >> +# @policy: #optional memory policy of this node > >> +# > >> +# @relative: #optional if the nodes specified are relative > >> +# > >> +# @host-nodes: #optional host nodes for its memory policy > >> +# > >> # Since 1.7 > >> ## > >> { 'type': 'NumaMemOptions', > >>'data': { > >> - '*nodeid': 'uint16', > >> - '*size': 'size' }} > >> + '*nodeid': 'uint16', > >>
Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
- Original Message - > On 08/23/2013 04:40 PM, Andrew Jones wrote: > > > > > > - Original Message - > >> Add detection of libnuma (mostly contained in the numactl package) > >> to the configure script. Can be enabled or disabled on the command line, > >> default is use if available. > >> > >> Signed-off-by: Andre Przywara > >> Signed-off-by: Wanlong Gao > > > > Is this patch still necessary? I thought that dropping the > > numa_num_configured_nodes() calls from patch 8/12 got rid > > of the need for this library. Maybe I missed other uses? > > Yes, in 08/12 we also use mbind(), You don't need a whole library for mbind(), it's a syscall. See syscall(2). > and in 09/12 we use max_numa_node(). Really? I didn't see it there. And anyway, that goes back to our discussion about setting qemu's MAX_NODES to whatever we think qemu should support, and then just checking that we don't blow that limit whenever reading host node info, i.e. maxnode = 0; while (host_nodes[maxnode] && maxnode < MAX_NODES) node_read(&info[maxnode++]); type of a thing. And, if there's a place you really need to know the current online number of host nodes, then, like I said earlier, you should just go to sysfs yourself. libnuma:numa_max_node() returns an int that it only initializes at library load time, so it's not going to adapt to onlining/offlining. drew > > Thanks, > Wanlong Gao > > > > > drew > > > > >
Re: [Qemu-devel] [PATCH] kvm: warn if num cpus is greater than num recommended
- Original Message - > Il 23/08/2013 13:33, Andrew Jones ha scritto: > > Does smp_cpus map to the current > > number of cpus, or to the number of possible cpus? If it maps to the number > > of possible cpus, then this is the right place. If the former, then I guess > > it'll take more thought. I'ved added Igor (still on vacation) to this > > reply, > > but regardless I vote we worry about hot-plug limit checking in different > > patch. > > smp_cpus is the initial number, max_cpus is the number of possible cpus. > Yeah, I noticed that this issue is at least partially addressed already with my v2, which incorporates Marcelo's check against the number of hotpluggable cpus (max_cpus). drew
Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
- Original Message - > On 08/26/2013 03:46 PM, Andrew Jones wrote: > >>> Is this patch still necessary? I thought that dropping the > >>> > > numa_num_configured_nodes() calls from patch 8/12 got rid > >>> > > of the need for this library. Maybe I missed other uses? > >> > > >> > Yes, in 08/12 we also use mbind(), > > You don't need a whole library for mbind(), it's a syscall. See syscall(2). > > > >> > and in 09/12 we use max_numa_node(). > > Really? I didn't see it there. And anyway, that goes back to our discussion > > about setting qemu's MAX_NODES to whatever we think qemu should support, > > and then just checking that we don't blow that limit whenever reading > > host node info, i.e. > > > > maxnode = 0; > > while (host_nodes[maxnode] && maxnode < MAX_NODES) > > node_read(&info[maxnode++]); > > > > type of a thing. > > > > And, if there's a place you really need to know the current online number > > of host nodes, then, like I said earlier, you should just go to sysfs > > yourself. libnuma:numa_max_node() returns an int that it only initializes > > at library load time, so it's not going to adapt to onlining/offlining. > > OK, thank you. > Then I should define MPOL_* macros in QEMU and use mbind(2) syscall directly, > right? Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a more general lib. Whether or not we want to redefine those symbols within qemu, in order to avoid the dependency on installing numactl-devel, isn't something I can answer. That's a better question for Anthony. Anthony? Paolo, any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the linux-header synch script? thanks, drew > > Thanks, > Wanlong Gao > > > > > drew > > > >
Re: [Qemu-devel] [libvirt] [PATCH v2] kvm: warn if num cpus is greater than num recommended
- Original Message - > On 08/23/2013 07:24 AM, Andrew Jones wrote: > > The comment in kvm_max_vcpus() states that it's using the recommended > > procedure from the kernel API documentation to get the max number > > of vcpus that kvm supports. It is, but by always returning the > > maximum number supported. The maximum number should only be used > > for development purposes. qemu should check KVM_CAP_NR_VCPUS for > > the recommended number of vcpus. This patch adds a warning if a user > > specifies a number of cpus between the recommended and max. > > > > v2: > > Incorporate tests for max_cpus, which specifies the maximum number > > of hotpluggable cpus. An additional note is that the message for > > the fail case was slightly changed, 'exceeds max cpus' to > > 'exceeds the maximum cpus'. If this is unacceptable change for > > users like libvirt, then I'll need to spin a v3. > > A quick grep of libvirt does not show any dependence on the particular > wording "exceeds max cpus", so you are probably fine changing that. > > What I'm more worried about is what number is libvirt supposed to show > to the end user, and should libvirt enforce the lower recommended max, > or the larger kernel absolute max? Which of the two values does the QMP > 'MachineInfo' type return in its 'cpu-max' field during the > 'query-machines' command? Should we be modifying QMP to return both > values, so that libvirt can also expose the logic to the end user of > allowing a recommended vs. larger development max? > Machine definitions maintain yet another 'max_cpus'. And it appears that qmp would return that value. It would probably be best if it returned max(qemu_machine.max_cpus, kvm_max_cpus) though. I'm starting to think that we should just keep things simple for most of the virt stack by sticking to enforcing the larger developer max. And then on a production kernel we should just compile KVM_MAX_VCPUS = KVM_SOFT_MAX_VCPUS and be done with it. With that thought, this patch could be dropped too. The alternative seems to be supporting a run-time selectable experimental mode throughout the whole virt stack. drew
Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
- Original Message - > On 08/28/2013 09:44 PM, Paolo Bonzini wrote: > > Il 26/08/2013 10:43, Andrew Jones ha scritto: > >> > >> - Original Message - > >>>> On 08/26/2013 03:46 PM, Andrew Jones wrote: > >>>>>>>>>> Is this patch still necessary? I thought that dropping the > >>>>>>>>>>>>>> numa_num_configured_nodes() calls from patch 8/12 got rid > >>>>>>>>>>>>>> of the need for this library. Maybe I missed other uses? > >>>>>>>>>> > >>>>>>>>>> Yes, in 08/12 we also use mbind(), > >>>>>> You don't need a whole library for mbind(), it's a syscall. See > >>>>>> syscall(2). > >>>>>> > >>>>>>>>>> and in 09/12 we use max_numa_node(). > >>>>>> Really? I didn't see it there. And anyway, that goes back to our > >>>>>> discussion > >>>>>> about setting qemu's MAX_NODES to whatever we think qemu should > >>>>>> support, > >>>>>> and then just checking that we don't blow that limit whenever reading > >>>>>> host node info, i.e. > >>>>>> > >>>>>> maxnode = 0; > >>>>>> while (host_nodes[maxnode] && maxnode < MAX_NODES) > >>>>>> node_read(&info[maxnode++]); > >>>>>> > >>>>>> type of a thing. > >>>>>> > >>>>>> And, if there's a place you really need to know the current online > >>>>>> number > >>>>>> of host nodes, then, like I said earlier, you should just go to sysfs > >>>>>> yourself. libnuma:numa_max_node() returns an int that it only > >>>>>> initializes > >>>>>> at library load time, so it's not going to adapt to > >>>>>> onlining/offlining. > >>>> > >>>> OK, thank you. > >>>> Then I should define MPOL_* macros in QEMU and use mbind(2) syscall > >>>> directly, > >>>> right? > >> Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a more > >> general lib. Whether or not we want to redefine those symbols within > >> qemu, in order to avoid the dependency on installing numactl-devel, isn't > >> something I can answer. That's a better question for Anthony. Anthony? > >> Paolo, > >> any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the > >> linux-header synch script? > >> > > > > I think using libnuma is fine. In principle this could be used on other > > OSes than Linux, I think? > > But seems that mbind(2) is Linux-specific syscall, right? > You would need to avoid directly calling mbind, i.e. use libnuma for all numa related calls. Then, if libnuma were to support more OSes, qemu would automatically (wrt to numa) as well. Your mbind() with libnuma would look like this numa_set_bind_policy(strict) numa_tonodemask_memory(addr, size, nodemask) The problem is that set_bind_policy only takes a bool, and thus only allows two of the four possibly policies MPOL_BINDstrict == 1 MPOL_PREFERRED strict == 0 So, due to libnuma's policy setting limitations, and the fact it doesn't currently support more OSes than Linux, then I prefer your current series version that drops libnuma. If qemu will need to support NUMA on another OS, then we can cross this bridge when we get there. drew
Re: [Qemu-devel] [PATCH V9 06/12] NUMA: Add Linux libnuma detection
- Original Message - > > > - Original Message - > > On 08/28/2013 09:44 PM, Paolo Bonzini wrote: > > > Il 26/08/2013 10:43, Andrew Jones ha scritto: > > >> > > >> - Original Message ----- > > >>>> On 08/26/2013 03:46 PM, Andrew Jones wrote: > > >>>>>>>>>> Is this patch still necessary? I thought that dropping the > > >>>>>>>>>>>>>> numa_num_configured_nodes() calls from patch 8/12 got rid > > >>>>>>>>>>>>>> of the need for this library. Maybe I missed other uses? > > >>>>>>>>>> > > >>>>>>>>>> Yes, in 08/12 we also use mbind(), > > >>>>>> You don't need a whole library for mbind(), it's a syscall. See > > >>>>>> syscall(2). > > >>>>>> > > >>>>>>>>>> and in 09/12 we use max_numa_node(). > > >>>>>> Really? I didn't see it there. And anyway, that goes back to our > > >>>>>> discussion > > >>>>>> about setting qemu's MAX_NODES to whatever we think qemu should > > >>>>>> support, > > >>>>>> and then just checking that we don't blow that limit whenever > > >>>>>> reading > > >>>>>> host node info, i.e. > > >>>>>> > > >>>>>> maxnode = 0; > > >>>>>> while (host_nodes[maxnode] && maxnode < MAX_NODES) > > >>>>>> node_read(&info[maxnode++]); > > >>>>>> > > >>>>>> type of a thing. > > >>>>>> > > >>>>>> And, if there's a place you really need to know the current online > > >>>>>> number > > >>>>>> of host nodes, then, like I said earlier, you should just go to > > >>>>>> sysfs > > >>>>>> yourself. libnuma:numa_max_node() returns an int that it only > > >>>>>> initializes > > >>>>>> at library load time, so it's not going to adapt to > > >>>>>> onlining/offlining. > > >>>> > > >>>> OK, thank you. > > >>>> Then I should define MPOL_* macros in QEMU and use mbind(2) syscall > > >>>> directly, > > >>>> right? > > >> Hmm, yeah, that's too bad that numaif.h is part of libnuma, and not a > > >> more > > >> general lib. Whether or not we want to redefine those symbols within > > >> qemu, in order to avoid the dependency on installing numactl-devel, > > >> isn't > > >> something I can answer. That's a better question for Anthony. Anthony? > > >> Paolo, > > >> any opinions? Maybe we should pick up uapi/linux/mempolicy.h with the > > >> linux-header synch script? > > >> > > > > > > I think using libnuma is fine. In principle this could be used on other > > > OSes than Linux, I think? > > > > But seems that mbind(2) is Linux-specific syscall, right? > > > > You would need to avoid directly calling mbind, i.e. use libnuma for all > numa related calls. Then, if libnuma were to support more OSes, qemu would > automatically (wrt to numa) as well. Your mbind() with libnuma would look > like this > > numa_set_bind_policy(strict) > numa_tonodemask_memory(addr, size, nodemask) > > The problem is that set_bind_policy only takes a bool, and thus only > allows two of the four possibly policies > > MPOL_BINDstrict == 1 > MPOL_PREFERRED strict == 0 > Ah, there is a way to get interleave policy if (policy == interleave) { numa_interleave_memory(addr, size, nodemask) } else { numa_set_bind_policy(strict) numa_tonodemask_memory(addr, size, nodemask) } a bit clunky. And I still don't see a way to select MPOL_DEFAULT, nor a way to use any additional flags, such as MPOL_F_RELATIVE_NODES. > So, due to libnuma's policy setting limitations, and the fact it doesn't > currently support more OSes than Linux, then I prefer your current > series version that drops libnuma. If qemu will need to support NUMA on > another OS, then we can cross this bridge when we get there.
[Qemu-devel] [PATCH] ivshmem: use error_report
Replace all the fprintf(stderr, ...) calls with error_report. Signed-off-by: Andrew Jones --- hw/misc/ivshmem.c | 27 +-- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index bf585b7691998..d285df7d65a9f 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -300,7 +300,7 @@ static CharDriverState* create_eventfd_chr_device(void * opaque, EventNotifier * chr = qemu_chr_open_eventfd(eventfd); if (chr == NULL) { -fprintf(stderr, "creating eventfd for eventfd %d failed\n", eventfd); +error_report("creating eventfd for eventfd %d failed\n", eventfd); exit(-1); } qemu_chr_fe_claim_no_fail(chr); @@ -328,14 +328,13 @@ static int check_shm_size(IVShmemState *s, int fd) { struct stat buf; if (fstat(fd, &buf) < 0) { -fprintf(stderr, "ivshmem: exiting: fstat on fd %d failed: %s\n", +error_report("exiting: fstat on fd %d failed: %s\n", fd, strerror(errno)); return -1; } if (s->ivshmem_size > buf.st_size) { -fprintf(stderr, -"IVSHMEM ERROR: Requested memory size greater" +error_report("Requested memory size greater" " than shared object size (%" PRIu64 " > %" PRIu64")\n", s->ivshmem_size, (uint64_t)buf.st_size); return -1; @@ -510,7 +509,7 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, int size) incoming_fd = dup(tmp_fd); if (incoming_fd == -1) { -fprintf(stderr, "could not allocate file descriptor %s\n", +error_report("could not allocate file descriptor %s\n", strerror(errno)); close(tmp_fd); return; @@ -618,13 +617,13 @@ static uint64_t ivshmem_get_size(IVShmemState * s) { value <<= 30; break; default: -fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg); +error_report("invalid ram size: %s\n", s->sizearg); exit(1); } /* BARs must be a power of 2 */ if (!is_power_of_two(value)) { -fprintf(stderr, "ivshmem: size must be power of 2\n"); +error_report("size must be power of 2\n"); exit(1); } @@ -676,7 +675,7 @@ static int ivshmem_load(QEMUFile* f, void *opaque, int version_id) } if (proxy->role_val == IVSHMEM_PEER) { -fprintf(stderr, "ivshmem: 'peer' devices are not migratable\n"); +error_report("'peer' devices are not migratable\n"); return -EINVAL; } @@ -722,7 +721,7 @@ static int pci_ivshmem_init(PCIDevice *dev) /* IRQFD requires MSI */ if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD) && !ivshmem_has_feature(s, IVSHMEM_MSI)) { -fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n"); +error_report("ioeventfd/irqfd requires MSI\n"); exit(1); } @@ -733,7 +732,7 @@ static int pci_ivshmem_init(PCIDevice *dev) } else if (strncmp(s->role, "master", 7) == 0) { s->role_val = IVSHMEM_MASTER; } else { -fprintf(stderr, "ivshmem: 'role' must be 'peer' or 'master'\n"); +error_report("'role' must be 'peer' or 'master'\n"); exit(1); } } else { @@ -773,7 +772,7 @@ static int pci_ivshmem_init(PCIDevice *dev) * to the ivshmem server to receive the memory region */ if (s->shmobj != NULL) { -fprintf(stderr, "WARNING: do not specify both 'chardev' " +error_report("WARNING: do not specify both 'chardev' " "and 'shm' with ivshmem\n"); } @@ -802,7 +801,7 @@ static int pci_ivshmem_init(PCIDevice *dev) int fd; if (s->shmobj == NULL) { -fprintf(stderr, "Must specify 'chardev' or 'shm' to ivshmem\n"); +error_report("Must specify 'chardev' or 'shm' to ivshmem\n"); exit(1); } @@ -814,12 +813,12 @@ static int pci_ivshmem_init(PCIDevice *dev) S_IRWXU|S_IRWXG|S_IRWXO)) > 0) { /* truncate file to length PCI device's memory */ if (ftruncate(fd, s->ivshmem_size) != 0) { -fprintf(stderr, "ivshmem: could not truncate shared file\n"); +error_report("could not truncate shared file\n"); } } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR, S_IRWXU|S_IRWXG|S_IRWXO)) < 0) { -fprintf(stderr, "ivshmem: could not open shared file\n"); +error_report("could not open shared file\n"); exit(-1); } -- 1.9.3
Re: [Qemu-devel] [PATCH] ivshmem: use error_report
On Thu, Sep 18, 2014 at 04:46:55PM -0600, Eric Blake wrote: > On 09/18/2014 04:39 PM, Andrew Jones wrote: > > Replace all the fprintf(stderr, ...) calls with error_report. > > > > Signed-off-by: Andrew Jones > > --- > > hw/misc/ivshmem.c | 27 +-- > > 1 file changed, 13 insertions(+), 14 deletions(-) > > > > diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c > > index bf585b7691998..d285df7d65a9f 100644 > > --- a/hw/misc/ivshmem.c > > +++ b/hw/misc/ivshmem.c > > @@ -300,7 +300,7 @@ static CharDriverState* create_eventfd_chr_device(void > > * opaque, EventNotifier * > > chr = qemu_chr_open_eventfd(eventfd); > > > > if (chr == NULL) { > > -fprintf(stderr, "creating eventfd for eventfd %d failed\n", > > eventfd); > > +error_report("creating eventfd for eventfd %d failed\n", eventfd); > > The conversion to error_report() should also drop trailing \n. Thanks for the quick review. I'll send a v2 with this change. > > > exit(-1); > > Another bug (but probably worth cleaning up in a separate patch) - > exit(-1) is the same as exit(255), which is not a usual exit status > (although it DOES make xargs behave differently). I'll go ahead an squeeze a s/-1/1/ exit code change into this patch too. I think it's a similar enough cleanup. > > > > /* BARs must be a power of 2 */ > > if (!is_power_of_two(value)) { > > -fprintf(stderr, "ivshmem: size must be power of 2\n"); > > +error_report("size must be power of 2\n"); > > exit(1); > > But seeing as how much of this file uses the more typical exit(1), we > should consistently use 1 in all places where we exit early. > > > } else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR, > > S_IRWXU|S_IRWXG|S_IRWXO)) < 0) { > > -fprintf(stderr, "ivshmem: could not open shared file\n"); > > +error_report("could not open shared file\n"); > > exit(-1); > > Another weird use of exit(-1). > > -- > Eric Blake eblake redhat com+1-919-301-3266 > Libvirt virtualization library http://libvirt.org >
[Qemu-devel] [PATCH v2] ivshmem: use error_report
Replace all the fprintf(stderr, ...) calls with error_report. Also make sure exit() consistently uses the error code 1. A few calls used -1. Signed-off-by: Andrew Jones --- hw/misc/ivshmem.c | 39 +++ 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index bf585b7691998..b3983296f58fa 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -300,8 +300,8 @@ static CharDriverState* create_eventfd_chr_device(void * opaque, EventNotifier * chr = qemu_chr_open_eventfd(eventfd); if (chr == NULL) { -fprintf(stderr, "creating eventfd for eventfd %d failed\n", eventfd); -exit(-1); +error_report("creating eventfd for eventfd %d failed", eventfd); +exit(1); } qemu_chr_fe_claim_no_fail(chr); @@ -328,15 +328,14 @@ static int check_shm_size(IVShmemState *s, int fd) { struct stat buf; if (fstat(fd, &buf) < 0) { -fprintf(stderr, "ivshmem: exiting: fstat on fd %d failed: %s\n", +error_report("exiting: fstat on fd %d failed: %s", fd, strerror(errno)); return -1; } if (s->ivshmem_size > buf.st_size) { -fprintf(stderr, -"IVSHMEM ERROR: Requested memory size greater" -" than shared object size (%" PRIu64 " > %" PRIu64")\n", +error_report("Requested memory size greater" +" than shared object size (%" PRIu64 " > %" PRIu64")", s->ivshmem_size, (uint64_t)buf.st_size); return -1; } else { @@ -510,7 +509,7 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, int size) incoming_fd = dup(tmp_fd); if (incoming_fd == -1) { -fprintf(stderr, "could not allocate file descriptor %s\n", +error_report("could not allocate file descriptor %s", strerror(errno)); close(tmp_fd); return; @@ -524,7 +523,7 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, int size) s->max_peer = 0; if (check_shm_size(s, incoming_fd) == -1) { -exit(-1); +exit(1); } /* mmap the region and map into the BAR2 */ @@ -618,13 +617,13 @@ static uint64_t ivshmem_get_size(IVShmemState * s) { value <<= 30; break; default: -fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg); +error_report("invalid ram size: %s", s->sizearg); exit(1); } /* BARs must be a power of 2 */ if (!is_power_of_two(value)) { -fprintf(stderr, "ivshmem: size must be power of 2\n"); +error_report("size must be power of 2"); exit(1); } @@ -676,7 +675,7 @@ static int ivshmem_load(QEMUFile* f, void *opaque, int version_id) } if (proxy->role_val == IVSHMEM_PEER) { -fprintf(stderr, "ivshmem: 'peer' devices are not migratable\n"); +error_report("'peer' devices are not migratable"); return -EINVAL; } @@ -722,7 +721,7 @@ static int pci_ivshmem_init(PCIDevice *dev) /* IRQFD requires MSI */ if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD) && !ivshmem_has_feature(s, IVSHMEM_MSI)) { -fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n"); +error_report("ioeventfd/irqfd requires MSI"); exit(1); } @@ -733,7 +732,7 @@ static int pci_ivshmem_init(PCIDevice *dev) } else if (strncmp(s->role, "master", 7) == 0) { s->role_val = IVSHMEM_MASTER; } else { -fprintf(stderr, "ivshmem: 'role' must be 'peer' or 'master'\n"); +error_report("'role' must be 'peer' or 'master'"); exit(1); } } else { @@ -773,8 +772,8 @@ static int pci_ivshmem_init(PCIDevice *dev) * to the ivshmem server to receive the memory region */ if (s->shmobj != NULL) { -fprintf(stderr, "WARNING: do not specify both 'chardev' " -"and 'shm' with ivshmem\n"); +error_report("WARNING: do not specify both 'chardev' " +"and 'shm' with ivshmem"); } IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n", @@ -802,7 +801,7 @@ static int pci_ivshmem_init(PCIDevice *dev) int fd; if (s->shmobj == NULL) { -fprintf(st
Re: [Qemu-devel] [PATCH v2] ivshmem: use error_report
On Mon, Sep 22, 2014 at 02:20:09PM +0300, Michael S. Tsirkin wrote: > On Fri, Sep 19, 2014 at 04:42:59PM +0800, zhanghailiang wrote: > > On 2014/9/19 15:34, zhanghailiang wrote: > > >On 2014/9/19 7:17, Andrew Jones wrote: > > >>Replace all the fprintf(stderr, ...) calls with error_report. > > >>Also make sure exit() consistently uses the error code 1. A few calls > > >>used -1. > > >> > > >>Signed-off-by: Andrew Jones > > >>--- > > >> hw/misc/ivshmem.c | 39 +++ > > >> 1 file changed, 19 insertions(+), 20 deletions(-) > > >> > > >>diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c > > >>index bf585b7691998..b3983296f58fa 100644 > > >>--- a/hw/misc/ivshmem.c > > >>+++ b/hw/misc/ivshmem.c > > >>@@ -300,8 +300,8 @@ static CharDriverState* > > >>create_eventfd_chr_device(void * opaque, EventNotifier * > > >> chr = qemu_chr_open_eventfd(eventfd); > > >> > > >> if (chr == NULL) { > > >>-fprintf(stderr, "creating eventfd for eventfd %d failed\n", > > >>eventfd); > > >>-exit(-1); > > >>+error_report("creating eventfd for eventfd %d failed", eventfd); > > >>+exit(1); > > >> } > > >> qemu_chr_fe_claim_no_fail(chr); > > >> > > >>@@ -328,15 +328,14 @@ static int check_shm_size(IVShmemState *s, int fd) { > > >> struct stat buf; > > >> > > >> if (fstat(fd, &buf) < 0) { > > >>-fprintf(stderr, "ivshmem: exiting: fstat on fd %d failed: %s\n", > > >>+error_report("exiting: fstat on fd %d failed: %s", > > >> fd, strerror(errno)); > > > > > >The indentation looks weird, better to fix it.;) > > >More of the same elsewhere. > > > > > > > Er, actually, maybe for print like function, it is no need to indent like > > other function. > > So just ignore this comment, Sorry.;) > > > No, I agree with the original comment. > Please indent continuation lines to the right of the opening (, > here and elsewhere. OK, sending a v3. drew > > > > >> return -1; > > >> } > > >> > > >> if (s->ivshmem_size > buf.st_size) { > > >>-fprintf(stderr, > > >>-"IVSHMEM ERROR: Requested memory size greater" > > >>-" than shared object size (%" PRIu64 " > %" PRIu64")\n", > > >>+error_report("Requested memory size greater" > > >>+" than shared object size (%" PRIu64 " > %" PRIu64")", > > >> s->ivshmem_size, (uint64_t)buf.st_size); > > >> return -1; > > >> } else { > > >>@@ -510,7 +509,7 @@ static void ivshmem_read(void *opaque, const uint8_t > > >>*buf, int size) > > >> incoming_fd = dup(tmp_fd); > > >> > > >> if (incoming_fd == -1) { > > >>-fprintf(stderr, "could not allocate file descriptor %s\n", > > >>+error_report("could not allocate file descriptor %s", > > >> > > >> strerror(errno)); > > >> close(tmp_fd); > > >> return; > > >>@@ -524,7 +523,7 @@ static void ivshmem_read(void *opaque, const uint8_t > > >>*buf, int size) > > >> s->max_peer = 0; > > >> > > >> if (check_shm_size(s, incoming_fd) == -1) { > > >>-exit(-1); > > >>+exit(1); > > >> } > > >> > > >> /* mmap the region and map into the BAR2 */ > > >>@@ -618,13 +617,13 @@ static uint64_t ivshmem_get_size(IVShmemState * s) { > > >> value <<= 30; > > >> break; > > >> default: > > >>-fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg); > > >>+error_report("invalid ram size: %s", s->sizearg); > > >> exit(1); > > >> } > > >> > > >> /* BARs must be a power of 2 */ > > >> if (!is_power_of_two
[Qemu-devel] [PATCH v3] ivshmem: use error_report
Replace all the fprintf(stderr, ...) calls with error_report. Also make sure exit() consistently uses the error code 1. A few calls used -1. While at it cleanup some indentation in the printf argument lists. Signed-off-by: Andrew Jones --- v3: fix indentation of printf argument lists [Hailiang Zhang] v2: error_report shouldn't use '\n' [Eric Blake] Applies on "ivshmem security fixes" (http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg03056.html) --- hw/misc/ivshmem.c | 52 +--- 1 file changed, 25 insertions(+), 27 deletions(-) diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index bf585b7691998..5d272c84e9c7d 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -133,7 +133,7 @@ static void ivshmem_update_irq(IVShmemState *s, int val) /* don't print ISR resets */ if (isr) { IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n", - isr ? 1 : 0, s->intrstatus, s->intrmask); +isr ? 1 : 0, s->intrstatus, s->intrmask); } pci_set_irq(d, (isr != 0)); @@ -300,8 +300,8 @@ static CharDriverState* create_eventfd_chr_device(void * opaque, EventNotifier * chr = qemu_chr_open_eventfd(eventfd); if (chr == NULL) { -fprintf(stderr, "creating eventfd for eventfd %d failed\n", eventfd); -exit(-1); +error_report("creating eventfd for eventfd %d failed", eventfd); +exit(1); } qemu_chr_fe_claim_no_fail(chr); @@ -328,16 +328,15 @@ static int check_shm_size(IVShmemState *s, int fd) { struct stat buf; if (fstat(fd, &buf) < 0) { -fprintf(stderr, "ivshmem: exiting: fstat on fd %d failed: %s\n", -fd, strerror(errno)); +error_report("exiting: fstat on fd %d failed: %s", + fd, strerror(errno)); return -1; } if (s->ivshmem_size > buf.st_size) { -fprintf(stderr, -"IVSHMEM ERROR: Requested memory size greater" -" than shared object size (%" PRIu64 " > %" PRIu64")\n", -s->ivshmem_size, (uint64_t)buf.st_size); +error_report("Requested memory size greater" + " than shared object size (%" PRIu64 " > %" PRIu64")", + s->ivshmem_size, (uint64_t)buf.st_size); return -1; } else { return 0; @@ -510,8 +509,7 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, int size) incoming_fd = dup(tmp_fd); if (incoming_fd == -1) { -fprintf(stderr, "could not allocate file descriptor %s\n", -strerror(errno)); +error_report("could not allocate file descriptor %s", strerror(errno)); close(tmp_fd); return; } @@ -524,7 +522,7 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, int size) s->max_peer = 0; if (check_shm_size(s, incoming_fd) == -1) { -exit(-1); +exit(1); } /* mmap the region and map into the BAR2 */ @@ -535,7 +533,7 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, int size) vmstate_register_ram(&s->ivshmem, DEVICE(s)); IVSHMEM_DPRINTF("guest h/w addr = %p, size = %" PRIu64 "\n", - map_ptr, s->ivshmem_size); +map_ptr, s->ivshmem_size); memory_region_add_subregion(&s->bar, 0, &s->ivshmem); @@ -556,7 +554,7 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, int size) /* this is an eventfd for a particular guest VM */ IVSHMEM_DPRINTF("eventfds[%ld][%d] = %d\n", incoming_posn, -guest_max_eventfd, incoming_fd); +guest_max_eventfd, incoming_fd); event_notifier_init_fd(&s->peers[incoming_posn].eventfds[guest_max_eventfd], incoming_fd); @@ -618,13 +616,13 @@ static uint64_t ivshmem_get_size(IVShmemState * s) { value <<= 30; break; default: -fprintf(stderr, "qemu: invalid ram size: %s\n", s->sizearg); +error_report("invalid ram size: %s", s->sizearg); exit(1); } /* BARs must be a power of 2 */ if (!is_power_of_two(value)) { -fprintf(stderr, "ivshmem: size must be power of 2\n"); +error_report("size must be power of 2"); exit(1); } @@ -676,7 +674,7 @@ static int ivshmem_load(QEMUFile* f, void *opaque, int version_id) } if (proxy->role_val == IVSHMEM_PEER) { -fprintf(stderr, "ivshmem:
Re: [Qemu-devel] [PATCH 2/6] libqemustub: add more stubs for qemu-char
On Thu, Jun 19, 2014 at 06:07:40PM +0300, Nikolay Nikolaev wrote: > Additional stubs: > - chr_baum_init > - qemu_chr_open_spice_vmc > - qemu_chr_open_spice_port > > Signed-off-by: Nikolay Nikolaev > --- > stubs/Makefile.objs |2 ++ > stubs/chr-baum-init.c |7 +++ > stubs/qemu-chr-open-spice.c | 12 > 3 files changed, 21 insertions(+) > create mode 100644 stubs/chr-baum-init.c > create mode 100644 stubs/qemu-chr-open-spice.c > > diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs > index 997d68d..2312076 100644 > --- a/stubs/Makefile.objs > +++ b/stubs/Makefile.objs > @@ -1,5 +1,6 @@ > stub-obj-y += arch-query-cpu-def.o > stub-obj-y += bdrv-commit-all.o > +stub-obj-y += chr-baum-init.o > stub-obj-y += chr-msmouse.o > stub-obj-y += clock-warp.o > stub-obj-y += cpu-get-clock.o > @@ -24,6 +25,7 @@ stub-obj-y += mon-set-error.o > stub-obj-y += monitor-init.o > stub-obj-y += notify-event.o > stub-obj-y += pci-drive-hot-add.o > +stub-obj-y += qemu-chr-open-spice.o > stub-obj-y += qtest.o > stub-obj-y += reset.o > stub-obj-y += runstate-check.o > diff --git a/stubs/chr-baum-init.c b/stubs/chr-baum-init.c > new file mode 100644 > index 000..f5cc6ce > --- /dev/null > +++ b/stubs/chr-baum-init.c > @@ -0,0 +1,7 @@ > +#include "qemu-common.h" > +#include "sysemu/char.h" > + > +CharDriverState *chr_baum_init(void) > +{ > +return NULL; > +} > diff --git a/stubs/qemu-chr-open-spice.c b/stubs/qemu-chr-open-spice.c > new file mode 100644 > index 000..40a29a9 > --- /dev/null > +++ b/stubs/qemu-chr-open-spice.c > @@ -0,0 +1,12 @@ > +#include "qemu-common.h" > +#include "ui/qemu-spice.h" > + > +CharDriverState *qemu_chr_open_spice_vmc(const char *type) > +{ > +return NULL; > +} > + > +CharDriverState *qemu_chr_open_spice_port(const char *name) > +{ > +return NULL; > +} > > The build breaks for '--disable-spice' configs now. You get .../stubs/qemu-chr-open-spice.c:4:18: error: no previous prototype for 'qemu_chr_open_spice_vmc' [-Werror=missing-prototypes] CharDriverState *qemu_chr_open_spice_vmc(const char *type) ^ .../stubs/qemu-chr-open-spice.c:9:18: error: no previous prototype for 'qemu_chr_open_spice_port' [-Werror=missing-prototypes] CharDriverState *qemu_chr_open_spice_port(const char *name) ^ I believe you need something like the following instead drew #include "qemu-common.h" #include "ui/qemu-spice.h" #ifdef CONFIG_SPICE CharDriverState *qemu_chr_open_spice_vmc(const char *type) { return NULL; } #if SPICE_SERVER_VERSION >= 0x000c02 CharDriverState *qemu_chr_open_spice_port(const char *name) { return NULL; } #endif #endif
[Qemu-devel] [PATCH] hw/arm/virt: fix pl031 addr typo
pl031's base address should be 0x9001000, 0x9001. While in there also add some spacing and zeros to make it easier to read the map. Signed-off-by: Andrew Jones --- hw/arm/virt.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 405c61d39c1e9..f817820972475 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -98,17 +98,17 @@ typedef struct VirtBoardInfo { */ static const MemMapEntry a15memmap[] = { /* Space up to 0x800 is reserved for a boot ROM */ -[VIRT_FLASH] = { 0, 0x800 }, -[VIRT_CPUPERIPHS] = { 0x800, 0x2 }, +[VIRT_FLASH] = { 0, 0x0800 }, +[VIRT_CPUPERIPHS] = { 0x0800, 0x0002 }, /* GIC distributor and CPU interfaces sit inside the CPU peripheral space */ -[VIRT_GIC_DIST] = { 0x800, 0x1 }, -[VIRT_GIC_CPU] = { 0x801, 0x1 }, -[VIRT_UART] = { 0x900, 0x1000 }, -[VIRT_RTC] = { 0x9001, 0x1000 }, -[VIRT_MMIO] = { 0xa00, 0x200 }, +[VIRT_GIC_DIST] = { 0x0800, 0x0001 }, +[VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, +[VIRT_UART] = { 0x0900, 0x1000 }, +[VIRT_RTC] ={ 0x09001000, 0x1000 }, +[VIRT_MMIO] = { 0x0a00, 0x0200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ -[VIRT_MEM] = { 0x4000, 30ULL * 1024 * 1024 * 1024 }, +[VIRT_MEM] ={ 0x4000, 30ULL * 1024 * 1024 * 1024 }, }; static const int a15irqmap[] = { -- 1.9.3
Re: [Qemu-devel] [PATCH] hw/arm/virt: fix pl031 addr typo
On Tue, Jul 29, 2014 at 04:58:44PM +0100, Peter Maydell wrote: > On 29 July 2014 16:44, Andrew Jones wrote: > > pl031's base address should be 0x9001000, 0x9001. While in there ^ meant to type 'not' here, but guess that was obvious > > also add some spacing and zeros to make it easier to read the map. I can send two separate patches for the fix and the formatting, but you'd still have to check the formatting patch closely to make sure nothing else changed... > > > > Signed-off-by: Andrew Jones > > -[VIRT_RTC] = { 0x9001, 0x1000 }, > > +[VIRT_RTC] ={ 0x09001000, 0x1000 }, > > ...and assuming from the commit message that this is the > only actual change, the alignment to 64K is deliberate, > for the benefit of guests with 64K pages. 0K, so it needs to be 0x0901, which is still not what it is. As it is right now it's sitting in RAM, when configuring a guest to have greater than 1G. drew > > thanks > -- PMM >
[Qemu-devel] [PATCH v2] hw/arm/virt: fix pl031 addr typo
pl031's base address should be 0x901, not 0x9001, otherwise it sits in ram when configuring a guest with greater than 1G. Signed-off-by: Andrew Jones --- v2: - pl031 needs 64K alignment - don't change the formatting, will send another patch, which may get ignored :-) --- hw/arm/virt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 405c61d39c1e9..89532bd786436 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -104,7 +104,7 @@ static const MemMapEntry a15memmap[] = { [VIRT_GIC_DIST] = { 0x800, 0x1 }, [VIRT_GIC_CPU] = { 0x801, 0x1 }, [VIRT_UART] = { 0x900, 0x1000 }, -[VIRT_RTC] = { 0x9001, 0x1000 }, +[VIRT_RTC] = { 0x901, 0x1000 }, [VIRT_MMIO] = { 0xa00, 0x200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ -- 1.9.3
[Qemu-devel] [PATCH] hw/arm/virt: formatting: memory map
Add some spacing and zeros to make it easier to read and modify the map. This patch has no functional changes. The review looks ugly, but it's actually pretty easy to confirm all the addresses are as they should be - thanks to the new formatting ;-) Applies on top of 'v2.1.0-rc5'. Signed-off-by: Andrew Jones --- hw/arm/virt.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 89532bd786436..ba9429806 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -98,17 +98,17 @@ typedef struct VirtBoardInfo { */ static const MemMapEntry a15memmap[] = { /* Space up to 0x800 is reserved for a boot ROM */ -[VIRT_FLASH] = { 0, 0x800 }, -[VIRT_CPUPERIPHS] = { 0x800, 0x2 }, +[VIRT_FLASH] = { 0, 0x0800 }, +[VIRT_CPUPERIPHS] = { 0x0800, 0x0002 }, /* GIC distributor and CPU interfaces sit inside the CPU peripheral space */ -[VIRT_GIC_DIST] = { 0x800, 0x1 }, -[VIRT_GIC_CPU] = { 0x801, 0x1 }, -[VIRT_UART] = { 0x900, 0x1000 }, -[VIRT_RTC] = { 0x901, 0x1000 }, -[VIRT_MMIO] = { 0xa00, 0x200 }, +[VIRT_GIC_DIST] = { 0x0800, 0x0001 }, +[VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, +[VIRT_UART] = { 0x0900, 0x1000 }, +[VIRT_RTC] ={ 0x0901, 0x1000 }, +[VIRT_MMIO] = { 0x0a00, 0x0200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ -[VIRT_MEM] = { 0x4000, 30ULL * 1024 * 1024 * 1024 }, +[VIRT_MEM] ={ 0x4000, 30ULL * 1024 * 1024 * 1024 }, }; static const int a15irqmap[] = { -- 1.9.3
Re: [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT
On Mon, May 06, 2013 at 09:56:58PM +0200, Andrea Arcangeli wrote: > +++ b/mm/madvise.c > @@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct * vma, > if (error) > goto out; > break; > + case MADV_USERFAULT: > + if (vma->vm_ops) { > + error = -EINVAL; > + goto out; > + } > + new_flags |= VM_USERFAULT; > + break; > + case MADV_NOUSERFAULT: > + if (vma->vm_ops) { > + WARN_ON(new_flags & VM_USERFAULT); > + error = -EINVAL; > + goto out; > + } > + new_flags &= ~VM_USERFAULT; > + break; > } > > if (new_flags == vma->vm_flags) { > @@ -405,6 +420,7 @@ madvise_behavior_valid(int behavior) > case MADV_HUGEPAGE: > case MADV_NOHUGEPAGE: > #endif > + case MADV_USERFAULT: Missing MADV_NOUSERFAULT: here > case MADV_DONTDUMP: > case MADV_DODUMP: > return 1;
Re: [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages()
On Mon, May 06, 2013 at 09:56:57PM +0200, Andrea Arcangeli wrote: > > The current behavior of remap_anon_pages is very strict to avoid any > chance of memory corruption going unnoticed, and it will return > -EFAULT at the first sign of something unexpected (like a page already > mapped in the destination pmd/pte, potentially signaling an userland > thread race condition with two threads userfaulting on the same > destination address). mremap is not strict like that: it would drop > the destination range silently and it would succeed in such a > condition. So on the API side, I wonder if I should add a flag to > remap_anon_pages to provide non-strict behavior more similar to > mremap. OTOH not providing the permissive mremap behavior may actually > be better to force userland to be strict and be sure it knows what it > is doing (otherwise it should use mremap in the first place?). > What about instead of adding a new syscall (remap_anon_pages) to instead extend mremap with new flags giving it a strict mode? drew
[Qemu-devel] [PATCH v3] virtio: Introduce virtio-testdev
This is a virtio version of hw/misc/debugexit and should evolve into a virtio version of pc-testdev. pc-testdev uses the PC's ISA bus, whereas this testdev can be plugged into a virtio-mmio transport, which is needed for kvm-unit-tests/arm. virtio-testdev uses the virtio device config space as a communication channel, and implements an RTAS-like protocol through it allowing guests to execute commands. Only three commands are currently implemented; 1) VERSION: for version compatibility checks 2) CLEAR: set all the config space back to zero 3) EXIT:exit() from qemu with a status code --- v3: Convert to QOM realize v2: - No real functional changes, just added some comments and changed register bank accesses to use byte offsets, allowing the driver implementation to better mirror this device's implementation. Signed-off-by: Andrew Jones --- hw/virtio/Makefile.objs| 1 + hw/virtio/virtio-testdev.c | 154 + 2 files changed, 155 insertions(+) create mode 100644 hw/virtio/virtio-testdev.c diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs index 1ba53d9cc316c..b63fea4a175f0 100644 --- a/hw/virtio/Makefile.objs +++ b/hw/virtio/Makefile.objs @@ -3,6 +3,7 @@ common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o common-obj-y += virtio-bus.o common-obj-y += virtio-mmio.o common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += dataplane/ +common-obj-y += virtio-testdev.o obj-y += virtio.o virtio-balloon.o obj-$(CONFIG_LINUX) += vhost.o diff --git a/hw/virtio/virtio-testdev.c b/hw/virtio/virtio-testdev.c new file mode 100644 index 0..495639a66d3bc --- /dev/null +++ b/hw/virtio/virtio-testdev.c @@ -0,0 +1,154 @@ +/* + * Virtio Test Device + * + * Copyright (C) 2013 Red Hat, Inc. + * Copyright (C) 2013 Andrew Jones + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#include "hw/virtio/virtio-bus.h" + +#define VIRTIO_ID_TESTDEV 0x + +#define TYPE_VIRTIO_TESTDEV "virtio-testdev" +#define VIRTIO_TESTDEV(obj) \ +OBJECT_CHECK(VirtIOTestdev, (obj), TYPE_VIRTIO_TESTDEV) + +#define TESTDEV_MAJOR_VER 1 +#define TESTDEV_MINOR_VER 1 + +/* + * Size of the register bank in bytes. The max is determined + * by the device's config space. For virtio-mmio devices we + * only have 256 bytes, so that's our max, but we don't need + * that much anyway. 64 bytes gives us token, nargs, nrets, + * and an additional 13 4-byte registers for input/output. + * That'll do. + */ +#define CONFIG_SIZE 64 + +enum { +VERSION = 1, +CLEAR, +EXIT, +}; + +#define TOKEN_OFFSET0x0 +#define NARGS_OFFSET0x4 +#define NRETS_OFFSET0x8 +#define ARG_OFFSET(n) (0xc + (n) * 4) +#define __RET_OFFSET(nargs, n) (ARG_OFFSET(nargs) + (n) * 4) +#define RET_OFFSET(d, n)__RET_OFFSET(config_readl(d, NARGS_OFFSET), n) + +typedef struct VirtIOTestdev { +VirtIODevice parent_obj; +uint8_t config[CONFIG_SIZE]; +} VirtIOTestdev; + +static uint32_t config_readl(VirtIOTestdev *dev, unsigned offset) +{ +uint32_t *config = (uint32_t *)&dev->config[0]; + +if (offset > (CONFIG_SIZE - 4)) { +return 0; +} + +return le32_to_cpu(config[offset / 4]); +} + +static void config_writel(VirtIOTestdev *dev, unsigned offset, uint32_t val) +{ +uint32_t *config = (uint32_t *)&dev->config[0]; + +if (offset <= (CONFIG_SIZE - 4)) { +config[offset / 4] = cpu_to_le32(val); +} +} + +static void virtio_testdev_get_config(VirtIODevice *vdev, uint8_t *config_data) +{ +VirtIOTestdev *dev = VIRTIO_TESTDEV(vdev); +memcpy(config_data, &dev->config[0], CONFIG_SIZE); +} + +static void virtio_testdev_set_config(VirtIODevice *vdev, + const uint8_t *config_data) +{ +VirtIOTestdev *dev = VIRTIO_TESTDEV(vdev); +uint32_t token; + +memcpy(&dev->config[0], config_data, CONFIG_SIZE); + +token = config_readl(dev, TOKEN_OFFSET); + +/* + * The token register must always remain zero in order to + * avoid re-executing operations while new operation + * arguments are being filled in. + */ +config_writel(dev, TOKEN_OFFSET, 0); + +switch (token) { +case 0: +/* + * No token yet, so we were just filling in arguments, and + * now there's nothing left to do. + */ +return; +case VERSION: +config_writel(dev, RET_OFFSET(dev, 0), +(TESTDEV_MAJOR_VER << 16) | TESTDEV_MINOR_VER); +break; +case EXIT: +exit((config_readl(dev, ARG_OFFSET(0)) << 1) | 1); +case CLEAR: +default: +/* The CLEAR op and unknown ops reset all registers */ +memset(&dev->config[0], 0, CONFIG_SIZE); +} +} + +static uint32_t virtio_testdev_get_features(VirtIODevice
Re: [Qemu-devel] [PATCH v3] virtio: Introduce virtio-testdev
On Thu, Jan 30, 2014 at 07:44:59AM -0500, Mike Day wrote: > > Andrew Jones writes: > > > This is a virtio version of hw/misc/debugexit and should evolve into a > > virtio version of pc-testdev. pc-testdev uses the PC's ISA bus, whereas > > this testdev can be plugged into a virtio-mmio transport, which is > > needed for kvm-unit-tests/arm. virtio-testdev uses the virtio device > > config space as a communication channel, and implements an RTAS-like > > protocol through it allowing guests to execute commands. Only three > > commands are currently implemented; > > 1) VERSION: for version compatibility checks > > 2) CLEAR: set all the config space back to zero > > 3) EXIT:exit() from qemu with a status code > > > +static uint32_t virtio_testdev_get_features(VirtIODevice *vdev, uint32_t f) > > +{ > > +return f; > > +} > > + > > Is this meant to be a stub currently? > Something like that. *_get_features() must be supplied by all virtio devices. Just returning the requested features, f, rather than zero, is how virtio-rng does it. So I went that way too. drew
[Qemu-devel] [PATCH v2] virtio: Introduce virtio-testdev
This is a virtio version of hw/misc/debugexit and should evolve into a virtio version of pc-testdev. pc-testdev uses the PC's ISA bus, whereas this testdev can be plugged into a virtio-mmio transport, which is needed for kvm-unit-tests/arm. virtio-testdev uses the virtio device config space as a communication channel, and implements an RTAS-like protocol through it allowing guests to execute commands. Only three commands are currently implemented; 1) VERSION: for version compatibility checks 2) CLEAR: set all the config space back to zero 3) EXIT:exit() from qemu with a status code --- v2: - No real functional changes, just added some comments and changed register bank accesses to use byte offsets, allowing the driver implementation to better mirror this device's implementation. Signed-off-by: Andrew Jones --- hw/virtio/Makefile.objs| 1 + hw/virtio/virtio-testdev.c | 154 + 2 files changed, 155 insertions(+) create mode 100644 hw/virtio/virtio-testdev.c diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs index 1ba53d9cc316c..b63fea4a175f0 100644 --- a/hw/virtio/Makefile.objs +++ b/hw/virtio/Makefile.objs @@ -3,6 +3,7 @@ common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o common-obj-y += virtio-bus.o common-obj-y += virtio-mmio.o common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += dataplane/ +common-obj-y += virtio-testdev.o obj-y += virtio.o virtio-balloon.o obj-$(CONFIG_LINUX) += vhost.o diff --git a/hw/virtio/virtio-testdev.c b/hw/virtio/virtio-testdev.c new file mode 100644 index 0..6f6a16e42006c --- /dev/null +++ b/hw/virtio/virtio-testdev.c @@ -0,0 +1,154 @@ +/* + * Virtio Test Device + * + * Copyright (C) 2013 Red Hat, Inc. + * Copyright (C) 2013 Andrew Jones + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#include "hw/virtio/virtio-bus.h" + +#define VIRTIO_ID_TESTDEV 0x + +#define TYPE_VIRTIO_TESTDEV "virtio-testdev" +#define VIRTIO_TESTDEV(obj) \ +OBJECT_CHECK(VirtIOTestdev, (obj), TYPE_VIRTIO_TESTDEV) + +#define TESTDEV_MAJOR_VER 1 +#define TESTDEV_MINOR_VER 1 + +/* + * Size of the register bank in bytes. The max is determined + * by the device's config space. For virtio-mmio devices we + * only have 256 bytes, so that's our max, but we don't need + * that much anyway. 64 bytes gives us token, nargs, nrets, + * and an additional 13 4-byte registers for input/output. + * That'll do. + */ +#define CONFIG_SIZE 64 + +enum { +VERSION = 1, +CLEAR, +EXIT, +}; + +#define TOKEN_OFFSET0x0 +#define NARGS_OFFSET0x4 +#define NRETS_OFFSET0x8 +#define ARG_OFFSET(n) (0xc + (n) * 4) +#define __RET_OFFSET(nargs, n) (ARG_OFFSET(nargs) + (n) * 4) +#define RET_OFFSET(d, n)__RET_OFFSET(config_readl(d, NARGS_OFFSET), n) + +typedef struct VirtIOTestdev { +VirtIODevice parent_obj; +uint8_t config[CONFIG_SIZE]; +} VirtIOTestdev; + +static uint32_t config_readl(VirtIOTestdev *dev, unsigned offset) +{ +uint32_t *config = (uint32_t *)&dev->config[0]; + +if (offset > (CONFIG_SIZE - 4)) { +return 0; +} + +return le32_to_cpu(config[offset / 4]); +} + +static void config_writel(VirtIOTestdev *dev, unsigned offset, uint32_t val) +{ +uint32_t *config = (uint32_t *)&dev->config[0]; + +if (offset <= (CONFIG_SIZE - 4)) { +config[offset / 4] = cpu_to_le32(val); +} +} + +static void virtio_testdev_get_config(VirtIODevice *vdev, uint8_t *config_data) +{ +VirtIOTestdev *dev = VIRTIO_TESTDEV(vdev); +memcpy(config_data, &dev->config[0], CONFIG_SIZE); +} + +static void virtio_testdev_set_config(VirtIODevice *vdev, + const uint8_t *config_data) +{ +VirtIOTestdev *dev = VIRTIO_TESTDEV(vdev); +uint32_t token; + +memcpy(&dev->config[0], config_data, CONFIG_SIZE); + +token = config_readl(dev, TOKEN_OFFSET); + +/* + * The token register must always remain zero in order to + * avoid re-executing operations while new operation + * arguments are being filled in. + */ +config_writel(dev, TOKEN_OFFSET, 0); + +switch (token) { +case 0: +/* + * No token yet, so we were just filling in arguments, and + * now there's nothing left to do. + */ +return; +case VERSION: +config_writel(dev, RET_OFFSET(dev, 0), +(TESTDEV_MAJOR_VER << 16) | TESTDEV_MINOR_VER); +break; +case EXIT: +exit((config_readl(dev, ARG_OFFSET(0)) << 1) | 1); +case CLEAR: +default: +/* The CLEAR op and unknown ops reset all registers */ +memset(&dev->config[0], 0, CONFIG_SIZE); +} +} + +static uint32_t virtio_testdev_get_features(VirtIODevice *vdev, uint32_t f) +{ +
Re: [Qemu-devel] [PATCH V16 09/11] NUMA: set guest numa nodes memory policy
On Wed, Nov 27, 2013 at 03:35:53PM +0100, Paolo Bonzini wrote: > > + > > +len = numa_info[nodeid].node_mem; > > +bind_mode = node_parse_bind_mode(nodeid); > > +unsigned long *nodes = numa_info[nodeid].host_mem; > > + > > +/* This is a workaround for a long standing bug in Linux' > > + * mbind implementation, which cuts off the last specified > > + * node. To stay compatible should this bug be fixed, we > > + * specify one more node and zero this one out. > > + */ > > +unsigned long maxnode = find_last_bit(nodes, MAX_NODES); > > +if (syscall(SYS_mbind, ram_ptr + ram_offset, len, bind_mode, > > +nodes, maxnode + 2, 0)) { > > +perror("mbind"); > > +return -1; > > +} > > Also, it's still not clear to me why we're not using libnuma. > There only seem to be two reasons to use it right now; 1) it provides a wrapper around mbind 2) it exists. IMHO libnuma needs some work before it would be suitable for qemu to marry it. Its interfaces don't buy a lot of elegance, it only reads system information on link time (i.e. no good if you want to consider hotplug of cpus and nodes), and it's currently linux specific anyway. It's a good idea, but is it worth yet another qemu build dependency? drew
[Qemu-devel] [PATCH] virtio: Introduce virtio-testdev
This is a virtio version of hw/misc/debugexit and should evolve into a virtio version of pc-testdev. pc-testdev uses the PC's ISA bus, whereas this testdev can be plugged into a virtio-mmio transport, which is needed for kvm-unit-tests/arm. virtio-testdev uses the virtio device config space as a communication channel, and implements an RTAS-like protocol through it allowing guests to execute commands. Only three commands are currently implemented; 1) VERSION: for version compatibility checks 2) CLEAR: set all the config space back to zero 3) EXIT:exit() from qemu with a status code Note, the protocol also requires all data passing through the config space to be in little-endian. See [1] for an example of a driver for this device. [1] https://github.com/rhdrjones/kvm-unit-tests/blob/ff8df5378ffccfbdf25fe79241837e61eb2258e1/lib/virtio-testdev.c Signed-off-by: Andrew Jones --- default-configs/arm-softmmu.mak | 2 + hw/virtio/Makefile.objs | 1 + hw/virtio/virtio-testdev.c | 117 3 files changed, 120 insertions(+) create mode 100644 hw/virtio/virtio-testdev.c diff --git a/default-configs/arm-softmmu.mak b/default-configs/arm-softmmu.mak index ac0815d66310f..56f8086e61974 100644 --- a/default-configs/arm-softmmu.mak +++ b/default-configs/arm-softmmu.mak @@ -80,3 +80,5 @@ CONFIG_VERSATILE_PCI=y CONFIG_VERSATILE_I2C=y CONFIG_SDHCI=y + +CONFIG_VIRTIO_TESTDEV=y diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs index 1ba53d9cc316c..b3d16d522f54b 100644 --- a/hw/virtio/Makefile.objs +++ b/hw/virtio/Makefile.objs @@ -3,6 +3,7 @@ common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o common-obj-y += virtio-bus.o common-obj-y += virtio-mmio.o common-obj-$(CONFIG_VIRTIO_BLK_DATA_PLANE) += dataplane/ +common-obj-$(CONFIG_VIRTIO_TESTDEV) += virtio-testdev.o obj-y += virtio.o virtio-balloon.o obj-$(CONFIG_LINUX) += vhost.o diff --git a/hw/virtio/virtio-testdev.c b/hw/virtio/virtio-testdev.c new file mode 100644 index 0..d6852d563702e --- /dev/null +++ b/hw/virtio/virtio-testdev.c @@ -0,0 +1,117 @@ +#include "hw/virtio/virtio-bus.h" + +#define VIRTIO_ID_TESTDEV 0x + +#define TYPE_VIRTIO_TESTDEV "virtio-testdev" +#define VIRTIO_TESTDEV(obj) \ +OBJECT_CHECK(VirtIOTestdev, (obj), TYPE_VIRTIO_TESTDEV) + +#define TESTDEV_MAJOR_VER 1 +#define TESTDEV_MINOR_VER 1 + +#define CONFIG_SIZE 0x100 + +enum { +VERSION = 1, +CLEAR, +EXIT, +}; + +enum { TOKEN, NARGS, NRETS, ARG1, ARG2, ARG3, ARG4, }; + +#define RET1(nargs) (ARG1 + (nargs) + 0) +#define RET2(nargs) (ARG1 + (nargs) + 1) +#define RET3(nargs) (ARG1 + (nargs) + 2) +#define RET4(nargs) (ARG1 + (nargs) + 3) + +#define calc_len(nargs, nrets) ((3 + (nargs) + (nrets)) * 4) + +typedef struct VirtIOTestdev { +VirtIODevice parent_obj; +uint8_t config[CONFIG_SIZE]; +size_t len; /* currently used bytes */ +} VirtIOTestdev; + +static void virtio_testdev_get_config(VirtIODevice *vdev, uint8_t *config_data) +{ +VirtIOTestdev *dev = VIRTIO_TESTDEV(vdev); +memcpy(config_data, &dev->config[0], dev->len); +} + +static void virtio_testdev_set_config(VirtIODevice *vdev, + const uint8_t *config_data) +{ +VirtIOTestdev *dev = VIRTIO_TESTDEV(vdev); +uint32_t *c32 = (uint32_t *)&dev->config[0]; +uint32_t token, nargs, nrets; + +memcpy(c32, config_data, 32); /* assume write is in first 32 bytes, + we can grab more later, if needed */ +token = le32_to_cpu(c32[TOKEN]); +nargs = le32_to_cpu(c32[NARGS]); +nrets = le32_to_cpu(c32[NRETS]); + +if (!token) { +return; +} + +switch (token) { +case VERSION: +c32[RET1(nargs)] = +cpu_to_le32((TESTDEV_MAJOR_VER << 16) | TESTDEV_MINOR_VER); +break; +case CLEAR: +memset(c32, 0, CONFIG_SIZE); +break; +case EXIT: +exit((le32_to_cpu(c32[ARG1]) << 1) | 1); +default: +break; +} + +c32[TOKEN] = 0; +dev->len = calc_len(nargs, nrets); +} + +static uint32_t virtio_testdev_get_features(VirtIODevice *vdev, uint32_t f) +{ +return f; +} + +static int virtio_testdev_init(VirtIODevice *vdev) +{ +virtio_init(vdev, "virtio-testdev", VIRTIO_ID_TESTDEV, CONFIG_SIZE); +return 0; +} + +static int virtio_testdev_exit(DeviceState *qdev) +{ +virtio_cleanup(VIRTIO_DEVICE(qdev)); +return 0; +} + +static void virtio_testdev_class_init(ObjectClass *klass, void *data) +{ +DeviceClass *dc = DEVICE_CLASS(klass); +VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass); +dc->exit = virtio_testdev_exit; +set_bit(DEVICE_CATEGORY_MISC, dc->categories); +vdc->init = virtio_testdev_init; +vdc->get_config = virtio_testdev_get_config; +vdc->set_config = virtio_testdev_set_config; +vdc->get_features = virtio
Re: [Qemu-devel] [PATCH] virtio: Introduce virtio-testdev
On Tue, Oct 15, 2013 at 12:26:10PM +0530, Anup Patel wrote: > Hi Andrew, > > On Mon, Oct 14, 2013 at 9:29 PM, Andrew Jones wrote: > > This is a virtio version of hw/misc/debugexit and should evolve into a > > virtio version of pc-testdev. pc-testdev uses the PC's ISA bus, whereas > > this testdev can be plugged into a virtio-mmio transport, which is > > needed for kvm-unit-tests/arm. virtio-testdev uses the virtio device > > config space as a communication channel, and implements an RTAS-like > > protocol through it allowing guests to execute commands. Only three > > commands are currently implemented; > > 1) VERSION: for version compatibility checks > > 2) CLEAR: set all the config space back to zero > > 3) EXIT:exit() from qemu with a status code > > How about adding RESET command to reset the VM? > Hi Anup, I'm not opposed to it, but at the moment I'm not sure how we would utilize it within kvm-unit-tests. Maybe it would be useful for another application though? So maybe we can add it as an add-on patch at the time we come up with its use case? Thanks for the review! drew
Re: [Qemu-devel] [PATCH] import kvm-unittest in QEMU source tree
On Wed, Oct 16, 2013 at 10:03:37PM +0300, Michael S. Tsirkin wrote: > This simply imports kvm-unittest git into qemu source tree. > > We can next work on making make check run it > automatically. > > Squashed 'kvm-unittest/' content from commit 2bc0e29 > > git-subtree-dir: kvm-unittest > git-subtree-split: 2bc0e29ee4447bebcd3b90053881f59265306adc > > Signed-off-by: Michael S. Tsirkin > > --- > > Gleb, Paolo, any objections to this? I really want a small guest for > running ACPI tests during make check, and kvm-unittest seems to fit the > bill. > > Ability to test e.g. PCI with this in-tree would be very benefitial. > Sorry for slow reply, I just noticed this mail now. So far qemu is a dependency for kvm-unit-tests (both x86 and arm use it), but at some point that could change (if another arch doesn't want to use qemu). So if that happens, then it won't make much sense for the tests to live in qemu. Thus I'm not 100% in favor of moving them. However, if the consensus is to move them, then I have two comments on this patch. 1) There are a couple pendings patches on the kvm list that tidy up the kvm-unit-tests repo - removing lots of the files. That should be committed first to avoid importing a bunch of files right before deleting them. 2) The name shouldn't change from kvm-unit-tests to kvm-unittest. drew
[Qemu-devel] [PATCH 0/5] tcg-arm: LPAE: fix and extend xn control
This series fixes and extends the determination of whether or not an address is executable for LPAE translations. The main patch is 4/5, and describes the details in its commit message. Patches 1-3 prepare for the main patch, and patch 5/5 is a code cleanup made possible by introducing a new helper function with the main patch. This is actually a second version of [1], now based on Peter's recent translation regime support. However, as the series has changed substantially, I'm not calling this 'v2'. I believe I've addressed all of Peter's comments on that first posting in this refresh though. Tested by booting Linux on mach-virt with cortex-a15 and cortex-a57 (just up to 'Unable to mount root fs'), and also with a kvm-unit-tests test. The curious can check out the unit test here [2]. Thanks in advance for reviews! drew [1] http://lists.gnu.org/archive/html/qemu-devel/2015-01/msg01433.html [2] https://github.com/rhdrjones/kvm-unit-tests/commit/ee553e4bb795b0150e31c794bf8953dfb08d619a Andrew Jones (5): target-arm: convert check_ap to get_rw_prot target-arm: enable get_rw_prot to take simple AP target-arm: add an is_user param to get_rw_prot target-arm: get_phys_addr_lpae: more xn control target-arm: apply get_S1prot to get_phys_addr_v6 target-arm/helper.c | 193 ++-- 1 file changed, 126 insertions(+), 67 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH 3/5] target-arm: add an is_user param to get_rw_prot
Give callers the ability to get page protection flags for PL0, even if not currently operating in PL0. This puts the burden of determining 'is_user' on the caller (again - it was this way before regime_is_user was introduced), but will be useful in a following patch. Signed-off-by: Andrew Jones --- target-arm/helper.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index b63ec7b7979f9..df78f481e92f4 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4696,12 +4696,11 @@ static inline bool regime_is_user(CPUARMState *env, ARMMMUIdx mmu_idx) * R/W protection flags */ static inline int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, - int ap, int domain_prot) + bool is_user, int ap, int domain_prot) { bool simple_ap = regime_using_lpae_format(env, mmu_idx) || (regime_sctlr(env, mmu_idx) & SCTLR_AFE); bool domain_prot_valid = !regime_using_lpae_format(env, mmu_idx); -bool is_user = regime_is_user(env, mmu_idx); if (domain_prot_valid && domain_prot == 3) { return PAGE_READ | PAGE_WRITE; @@ -4881,7 +4880,8 @@ static int get_phys_addr_v5(CPUARMState *env, uint32_t address, int access_type, } code = 15; } -*prot = get_rw_prot(env, mmu_idx, ap, domain_prot); +*prot = get_rw_prot(env, mmu_idx, regime_is_user(env, mmu_idx), +ap, domain_prot); *prot |= *prot ? PAGE_EXEC : 0; if (!(*prot & (1 << access_type))) { /* Access permission fault. */ @@ -4989,7 +4989,9 @@ static int get_phys_addr_v6(CPUARMState *env, uint32_t address, int access_type, if (domain_prot == 3) { *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; } else { -if (pxn && !regime_is_user(env, mmu_idx)) { +bool is_user = regime_is_user(env, mmu_idx); + +if (pxn && !is_user) { xn = 1; } if (xn && access_type == 2) @@ -5002,7 +5004,7 @@ static int get_phys_addr_v6(CPUARMState *env, uint32_t address, int access_type, code = (code == 15) ? 6 : 3; goto do_fault; } -*prot = get_rw_prot(env, mmu_idx, ap, domain_prot); +*prot = get_rw_prot(env, mmu_idx, is_user, ap, domain_prot); *prot |= *prot && !xn ? PAGE_EXEC : 0; if (!(*prot & (1 << access_type))) { /* Access permission fault. */ -- 1.9.3
[Qemu-devel] [PATCH 2/5] target-arm: enable get_rw_prot to take simple AP
Teach get_rw_prot about the simple AP format AP[2:1]. An additional switch was added, as opposed to converting ap := AP[2:1] to AP[2:0] with a simple shift - and then modifying cases 0,2,4,6, because the resulting code is easier to read with the switch. Signed-off-by: Andrew Jones --- target-arm/helper.c | 22 +- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index 610f305c4d661..b63ec7b7979f9 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4698,12 +4698,32 @@ static inline bool regime_is_user(CPUARMState *env, ARMMMUIdx mmu_idx) static inline int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, int ap, int domain_prot) { +bool simple_ap = regime_using_lpae_format(env, mmu_idx) + || (regime_sctlr(env, mmu_idx) & SCTLR_AFE); +bool domain_prot_valid = !regime_using_lpae_format(env, mmu_idx); bool is_user = regime_is_user(env, mmu_idx); -if (domain_prot == 3) { +if (domain_prot_valid && domain_prot == 3) { return PAGE_READ | PAGE_WRITE; } +/* ap is AP[2:1] */ +if (simple_ap) { +switch (ap) { +case 0: +return is_user ? 0 : PAGE_READ | PAGE_WRITE; +case 1: +return PAGE_READ | PAGE_WRITE; +case 2: +return is_user ? 0 : PAGE_READ; +case 3: +return PAGE_READ; +default: +g_assert_not_reached(); +} +} + +/* ap is AP[2:0] */ switch (ap) { case 0: if (arm_feature(env, ARM_FEATURE_V7)) { -- 1.9.3
[Qemu-devel] [PATCH 5/5] target-arm: apply get_S1prot to get_phys_addr_v6
Now that we have get_S1prot, we can apply it to get_phys_addr_v6 for a minor code cleanup. Signed-off-by: Andrew Jones --- target-arm/helper.c | 27 --- 1 file changed, 8 insertions(+), 19 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index 20e5753bd216d..c41305e7e2bdf 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -5064,30 +5064,19 @@ static int get_phys_addr_v6(CPUARMState *env, uint32_t address, int access_type, } code = 15; } -if (domain_prot == 3) { -*prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; -} else { -bool is_user = regime_is_user(env, mmu_idx); - -if (pxn && !is_user) { -xn = 1; -} -if (xn && access_type == 2) -goto do_fault; - +if (regime_sctlr(env, mmu_idx) & SCTLR_AFE) { /* The simplified model uses AP[0] as an access control bit. */ -if ((regime_sctlr(env, mmu_idx) & SCTLR_AFE) -&& (ap & 1) == 0) { +if ((ap & 1) == 0) { /* Access flag fault. */ code = (code == 15) ? 6 : 3; goto do_fault; } -*prot = get_rw_prot(env, mmu_idx, is_user, ap, domain_prot); -*prot |= *prot && !xn ? PAGE_EXEC : 0; -if (!(*prot & (1 << access_type))) { -/* Access permission fault. */ -goto do_fault; -} +ap >>= 1; +} +*prot = get_S1prot(env, mmu_idx, false, ap, domain_prot, 0, xn, pxn); +if (!(*prot & (1 << access_type))) { +/* Access permission fault. */ +goto do_fault; } *phys_ptr = phys_addr; return 0; -- 1.9.3
[Qemu-devel] [PATCH 1/5] target-arm: convert check_ap to get_rw_prot
Instead of mixing access permission checking with access permissions to page protection flags translation, just do the translation, and leave it to the caller to check the protection flags against the access type. As this function only considers READ/WRITE, not EXEC, then name it accordingly. Signed-off-by: Andrew Jones --- target-arm/helper.c | 47 +-- 1 file changed, 17 insertions(+), 30 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index 1a1a00577e780..610f305c4d661 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4692,34 +4692,23 @@ static inline bool regime_is_user(CPUARMState *env, ARMMMUIdx mmu_idx) } } -/* Check section/page access permissions. - Returns the page protection flags, or zero if the access is not - permitted. */ -static inline int check_ap(CPUARMState *env, ARMMMUIdx mmu_idx, - int ap, int domain_prot, - int access_type) -{ -int prot_ro; +/* Translate section/page access permissions to page + * R/W protection flags + */ +static inline int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, + int ap, int domain_prot) +{ bool is_user = regime_is_user(env, mmu_idx); if (domain_prot == 3) { return PAGE_READ | PAGE_WRITE; } -if (access_type == 1) { -prot_ro = 0; -} else { -prot_ro = PAGE_READ; -} - switch (ap) { case 0: if (arm_feature(env, ARM_FEATURE_V7)) { return 0; } -if (access_type == 1) { -return 0; -} switch (regime_sctlr(env, mmu_idx) & (SCTLR_S | SCTLR_R)) { case SCTLR_S: return is_user ? 0 : PAGE_READ; @@ -4732,7 +4721,7 @@ static inline int check_ap(CPUARMState *env, ARMMMUIdx mmu_idx, return is_user ? 0 : PAGE_READ | PAGE_WRITE; case 2: if (is_user) { -return prot_ro; +return PAGE_READ; } else { return PAGE_READ | PAGE_WRITE; } @@ -4741,16 +4730,16 @@ static inline int check_ap(CPUARMState *env, ARMMMUIdx mmu_idx, case 4: /* Reserved. */ return 0; case 5: -return is_user ? 0 : prot_ro; +return is_user ? 0 : PAGE_READ; case 6: -return prot_ro; +return PAGE_READ; case 7: if (!arm_feature(env, ARM_FEATURE_V6K)) { return 0; } -return prot_ro; +return PAGE_READ; default: -abort(); +g_assert_not_reached(); } } @@ -4872,12 +4861,12 @@ static int get_phys_addr_v5(CPUARMState *env, uint32_t address, int access_type, } code = 15; } -*prot = check_ap(env, mmu_idx, ap, domain_prot, access_type); -if (!*prot) { +*prot = get_rw_prot(env, mmu_idx, ap, domain_prot); +*prot |= *prot ? PAGE_EXEC : 0; +if (!(*prot & (1 << access_type))) { /* Access permission fault. */ goto do_fault; } -*prot |= PAGE_EXEC; *phys_ptr = phys_addr; return 0; do_fault: @@ -4993,14 +4982,12 @@ static int get_phys_addr_v6(CPUARMState *env, uint32_t address, int access_type, code = (code == 15) ? 6 : 3; goto do_fault; } -*prot = check_ap(env, mmu_idx, ap, domain_prot, access_type); -if (!*prot) { +*prot = get_rw_prot(env, mmu_idx, ap, domain_prot); +*prot |= *prot && !xn ? PAGE_EXEC : 0; +if (!(*prot & (1 << access_type))) { /* Access permission fault. */ goto do_fault; } -if (!xn) { -*prot |= PAGE_EXEC; -} } *phys_ptr = phys_addr; return 0; -- 1.9.3
[Qemu-devel] [PATCH 4/5] target-arm: get_phys_addr_lpae: more xn control
This patch makes the following changes to the determination of whether an address is executable, when translating addresses using LPAE. 1. No longer assumes that PL0 can't execute when it can't read. It can in AArch64, a difference from AArch32. 2. Use va_size == 64 to determine we're in AArch64, rather than arm_feature(env, ARM_FEATURE_V8), which is insufficient. 3. Add additional XN determinants - NS && is_secure && (SCR & SCR_SIF) - WXN && (prot & PAGE_WRITE) - AArch64: (prot_PL0 & PAGE_WRITE) - AArch32: UWXN && (prot_PL0 & PAGE_WRITE) - XN determination should also work in secure mode (untested) - XN may even work in EL2 (currently impossible to test) 4. Cleans up the bloated PAGE_EXEC condition - by removing it. The helper get_S1prot is introduced, which also handles short- descriptors and v6 XN determination. It may even work in EL2, when support for that comes, but, as the function name implies, it only works for stage 1 translations. Signed-off-by: Andrew Jones --- target-arm/helper.c | 111 1 file changed, 86 insertions(+), 25 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index df78f481e92f4..20e5753bd216d 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4695,8 +4695,8 @@ static inline bool regime_is_user(CPUARMState *env, ARMMMUIdx mmu_idx) /* Translate section/page access permissions to page * R/W protection flags */ -static inline int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, - bool is_user, int ap, int domain_prot) +static int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, + bool is_user, int ap, int domain_prot) { bool simple_ap = regime_using_lpae_format(env, mmu_idx) || (regime_sctlr(env, mmu_idx) & SCTLR_AFE); @@ -4762,6 +4762,84 @@ static inline int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, } } +/* Translate section/page access permissions to protection flags */ +static int get_S1prot(CPUARMState *env, ARMMMUIdx mmu_idx, bool is_aa64, + int ap, int domain_prot, int ns, int xn, int pxn) +{ +bool domain_prot_valid = !regime_using_lpae_format(env, mmu_idx); +bool is_user = regime_is_user(env, mmu_idx); +bool have_wxn; +int prot_rw, user_rw; +int wxn = 0; + +assert(mmu_idx != ARMMMUIdx_S2NS); + +if (domain_prot_valid && domain_prot == 3) { +return PAGE_READ | PAGE_WRITE | PAGE_EXEC; +} + +user_rw = get_rw_prot(env, mmu_idx, true, ap, domain_prot); +if (is_user) { +prot_rw = user_rw; +} else { +prot_rw = get_rw_prot(env, mmu_idx, false, ap, domain_prot); +} + +if (ns && arm_is_secure(env) && (env->cp15.scr_el3 & SCR_SIF)) { +return prot_rw; +} + +/* TODO have_wxn should be replaced with arm_feature(env, ARM_FEATURE_EL2), + * when ARM_FEATURE_EL2 starts getting set. For now we assume all v7 + * compatible processors have EL2, which is required for [U]WXN. + */ +have_wxn = arm_feature(env, ARM_FEATURE_V7); + +if (have_wxn) { +wxn = regime_sctlr(env, mmu_idx) & SCTLR_WXN; +} + +if (is_aa64) { +assert(arm_feature(env, ARM_FEATURE_V8)); +switch (regime_el(env, mmu_idx)) { +case 1: +if (is_user && !user_rw) { +wxn = 0; +} else if (!is_user) { +xn = pxn || (user_rw & PAGE_WRITE); +} +break; +case 2: +case 3: +break; +} +} else if (arm_feature(env, ARM_FEATURE_V6K)) { +switch (regime_el(env, mmu_idx)) { +case 1: +case 3: +if (is_user) { +xn = xn || !user_rw; +} else { +int uwxn = 0; +if (have_wxn) { +uwxn = regime_sctlr(env, mmu_idx) & SCTLR_UWXN; +} +xn = xn || !prot_rw || pxn || (uwxn && (user_rw & PAGE_WRITE)); +} +break; +case 2: +break; +} +} else { +xn = wxn = 0; +} + +if (xn || (wxn && (prot_rw & PAGE_WRITE))) { +return prot_rw; +} +return prot_rw | PAGE_EXEC; +} + static bool get_level1_table_address(CPUARMState *env, ARMMMUIdx mmu_idx, uint32_t *table, uint32_t address) { @@ -5047,7 +5125,6 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, int32_t granule_sz = 9; int32_t va_size = 32; int32_t tbi = 0; -bool is_user; TCR *tcr = regime_tcr(env, mmu_idx); /* TODO: @@ -5220,31 +5297,15 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, /* Access flag */ goto do_fault;
Re: [Qemu-devel] [PATCH 5/5] target-arm: apply get_S1prot to get_phys_addr_v6
On Thu, Feb 12, 2015 at 04:05:07PM +0100, Andrew Jones wrote: > Now that we have get_S1prot, we can apply it to get_phys_addr_v6 > for a minor code cleanup. Actually, I should point out that this isn't just a cleanup, but also a fix. See below. > > Signed-off-by: Andrew Jones > --- > target-arm/helper.c | 27 --- > 1 file changed, 8 insertions(+), 19 deletions(-) > > diff --git a/target-arm/helper.c b/target-arm/helper.c > index 20e5753bd216d..c41305e7e2bdf 100644 > --- a/target-arm/helper.c > +++ b/target-arm/helper.c > @@ -5064,30 +5064,19 @@ static int get_phys_addr_v6(CPUARMState *env, > uint32_t address, int access_type, > } > code = 15; > } > -if (domain_prot == 3) { > -*prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; > -} else { > -bool is_user = regime_is_user(env, mmu_idx); > - > -if (pxn && !is_user) { > -xn = 1; > -} > -if (xn && access_type == 2) > -goto do_fault; > - > +if (regime_sctlr(env, mmu_idx) & SCTLR_AFE) { > /* The simplified model uses AP[0] as an access control bit. */ > -if ((regime_sctlr(env, mmu_idx) & SCTLR_AFE) > -&& (ap & 1) == 0) { > +if ((ap & 1) == 0) { > /* Access flag fault. */ > code = (code == 15) ? 6 : 3; > goto do_fault; > } > -*prot = get_rw_prot(env, mmu_idx, is_user, ap, domain_prot); > -*prot |= *prot && !xn ? PAGE_EXEC : 0; > -if (!(*prot & (1 << access_type))) { > -/* Access permission fault. */ > -goto do_fault; > -} > +ap >>= 1; The original code didn't take into account that it may be calling check_ap with a simple AP, AP[2:1]. The code should have always been similar to the above, i.e. if (regime_sctlr(env, mmu_idx) & SCTLR_AFE) { /* The simplified model uses AP[0] as an access control bit. */ if ((ap & 1) == 0) { /* Access flag fault. */ code = (code == 15) ? 6 : 3; goto do_fault; } *prot = ; } else { *prot = check_ap(env, mmu_idx, ap, domain_prot, access_type); } if (!*prot) { /* Access permission fault. */ goto do_fault; } if (!xn) { *prot |= PAGE_EXEC; } As a simple AP wouldn't be properly translated to protection flags with check_ap (except for case 6), then I think this should have caused some problems. Maybe this path just hasn't been tested? I don't see CR_AFE getting used by Linux, so possibly not. I should update the commit message to point this fix out. Or, actually, I should probably add another patch to the series (3/6), which addresses just this issue, and builds it on patch 2 "...to take simple AP". Peter, please let me know your preference. Thanks, drew > +} > +*prot = get_S1prot(env, mmu_idx, false, ap, domain_prot, 0, xn, pxn); > +if (!(*prot & (1 << access_type))) { > +/* Access permission fault. */ > +goto do_fault; > } > *phys_ptr = phys_addr; > return 0; > -- > 1.9.3 > > >
Re: [Qemu-devel] [PATCH 4/5] target-arm: get_phys_addr_lpae: more xn control
On Thu, Feb 12, 2015 at 04:05:06PM +0100, Andrew Jones wrote: > This patch makes the following changes to the determination of > whether an address is executable, when translating addresses > using LPAE. > > 1. No longer assumes that PL0 can't execute when it can't read. >It can in AArch64, a difference from AArch32. > 2. Use va_size == 64 to determine we're in AArch64, rather than >arm_feature(env, ARM_FEATURE_V8), which is insufficient. > 3. Add additional XN determinants >- NS && is_secure && (SCR & SCR_SIF) >- WXN && (prot & PAGE_WRITE) >- AArch64: (prot_PL0 & PAGE_WRITE) >- AArch32: UWXN && (prot_PL0 & PAGE_WRITE) >- XN determination should also work in secure mode (untested) >- XN may even work in EL2 (currently impossible to test) > 4. Cleans up the bloated PAGE_EXEC condition - by removing it. > > The helper get_S1prot is introduced, which also handles short- > descriptors and v6 XN determination. It may even work in EL2, > when support for that comes, but, as the function name implies, > it only works for stage 1 translations. Just because I like replying to myself so much... re: feature checking > > Signed-off-by: Andrew Jones > --- > target-arm/helper.c | 111 > > 1 file changed, 86 insertions(+), 25 deletions(-) > > diff --git a/target-arm/helper.c b/target-arm/helper.c > index df78f481e92f4..20e5753bd216d 100644 > --- a/target-arm/helper.c > +++ b/target-arm/helper.c > @@ -4695,8 +4695,8 @@ static inline bool regime_is_user(CPUARMState *env, > ARMMMUIdx mmu_idx) > /* Translate section/page access permissions to page > * R/W protection flags > */ > -static inline int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, > - bool is_user, int ap, int domain_prot) > +static int get_rw_prot(CPUARMState *env, ARMMMUIdx mmu_idx, > + bool is_user, int ap, int domain_prot) > { > bool simple_ap = regime_using_lpae_format(env, mmu_idx) > || (regime_sctlr(env, mmu_idx) & SCTLR_AFE); > @@ -4762,6 +4762,84 @@ static inline int get_rw_prot(CPUARMState *env, > ARMMMUIdx mmu_idx, > } > } > > +/* Translate section/page access permissions to protection flags */ > +static int get_S1prot(CPUARMState *env, ARMMMUIdx mmu_idx, bool is_aa64, > + int ap, int domain_prot, int ns, int xn, int pxn) > +{ > +bool domain_prot_valid = !regime_using_lpae_format(env, mmu_idx); > +bool is_user = regime_is_user(env, mmu_idx); > +bool have_wxn; > +int prot_rw, user_rw; > +int wxn = 0; > + > +assert(mmu_idx != ARMMMUIdx_S2NS); > + > +if (domain_prot_valid && domain_prot == 3) { > +return PAGE_READ | PAGE_WRITE | PAGE_EXEC; > +} > + > +user_rw = get_rw_prot(env, mmu_idx, true, ap, domain_prot); > +if (is_user) { > +prot_rw = user_rw; > +} else { > +prot_rw = get_rw_prot(env, mmu_idx, false, ap, domain_prot); > +} > + > +if (ns && arm_is_secure(env) && (env->cp15.scr_el3 & SCR_SIF)) { > +return prot_rw; > +} > + > +/* TODO have_wxn should be replaced with arm_feature(env, > ARM_FEATURE_EL2), > + * when ARM_FEATURE_EL2 starts getting set. For now we assume all v7 > + * compatible processors have EL2, which is required for [U]WXN. > + */ > +have_wxn = arm_feature(env, ARM_FEATURE_V7); I probably should have used ARM_FEATURE_LPAE here instead to be a bit closer to the truth. > + > +if (have_wxn) { > +wxn = regime_sctlr(env, mmu_idx) & SCTLR_WXN; > +} > + > +if (is_aa64) { > +assert(arm_feature(env, ARM_FEATURE_V8)); > +switch (regime_el(env, mmu_idx)) { > +case 1: > +if (is_user && !user_rw) { > +wxn = 0; > +} else if (!is_user) { > +xn = pxn || (user_rw & PAGE_WRITE); > +} > +break; > +case 2: > +case 3: > +break; > +} > +} else if (arm_feature(env, ARM_FEATURE_V6K)) { > +switch (regime_el(env, mmu_idx)) { > +case 1: > +case 3: > +if (is_user) { > +xn = xn || !user_rw; > +} else { > +int uwxn = 0; > +if (have_wxn) { > +uwxn = regime_sctlr(env, mmu_idx) & SCTLR_UWXN; > +} > +xn = xn || !prot_rw || pxn || (uwxn && (user_rw & > PAGE_WRITE)); I thi
Re: [Qemu-devel] [PATCH 0/5] tcg-arm: LPAE: fix and extend xn control
On Thu, Feb 12, 2015 at 04:05:02PM +0100, Andrew Jones wrote: > This series fixes and extends the determination of whether or > not an address is executable for LPAE translations. The main > patch is 4/5, and describes the details in its commit message. > Patches 1-3 prepare for the main patch, and patch 5/5 is a > code cleanup made possible by introducing a new helper function > with the main patch. > > This is actually a second version of [1], now based on Peter's > recent translation regime support. However, as the series > has changed substantially, I'm not calling this 'v2'. I believe > I've addressed all of Peter's comments on that first posting in > this refresh though. > > Tested by booting Linux on mach-virt with cortex-a15 and cortex-a57 > (just up to 'Unable to mount root fs'), and also with a kvm-unit-tests > test. The curious can check out the unit test here [2]. > > Thanks in advance for reviews! > > drew > > [1] http://lists.gnu.org/archive/html/qemu-devel/2015-01/msg01433.html > [2] > https://github.com/rhdrjones/kvm-unit-tests/commit/ee553e4bb795b0150e31c794bf8953dfb08d619a > > Andrew Jones (5): > target-arm: convert check_ap to get_rw_prot > target-arm: enable get_rw_prot to take simple AP > target-arm: add an is_user param to get_rw_prot > target-arm: get_phys_addr_lpae: more xn control > target-arm: apply get_S1prot to get_phys_addr_v6 > > target-arm/helper.c | 193 > ++-- > 1 file changed, 126 insertions(+), 67 deletions(-) > > -- > 1.9.3 > > > Ping? This isn't a huge priority, but I wouldn't want it to get lost either. Thanks, drew
Re: [Qemu-devel] [PATCH 0/5] tcg-arm: LPAE: fix and extend xn control
On Wed, Feb 25, 2015 at 12:08:16AM +0900, Peter Maydell wrote: > On 25 February 2015 at 00:06, Andrew Jones wrote: > > Ping? This isn't a huge priority, but I wouldn't want it to get lost either. > > I'm on holiday :-) Easy stuff is getting done but things that > take time (read: code review) are postponed til mid-March. > You can wait til then, or find somebody else on list to review... > > -- PMM > As much activity as I've seen from you lately, I had no idea you were on vacation :-) Enjoy! I'm OK waiting until mid-March (of course other reviewers are welcome to jump in too, I just don't have anybody specific in mind to cc atm.) Thanks, drew
Re: [Qemu-devel] [PATCH v2 01/47] acpi: introduce AML composer aml_append()
On Mon, Jan 26, 2015 at 04:09:20PM +0100, Igor Mammedov wrote: > On Mon, 26 Jan 2015 10:57:21 +0100 > Igor Mammedov wrote: > > > On Sat, 24 Jan 2015 18:33:50 +0200 > > "Michael S. Tsirkin" wrote: > > > > > On Fri, Jan 23, 2015 at 06:56:20PM +0100, Igor Mammedov wrote: > > > > On Fri, 23 Jan 2015 15:55:11 +0200 > > > > "Michael S. Tsirkin" wrote: > > > > > [...] > > > > I refuse to give up on cleaner and simpler API yet :) > > > > > > > > > > > > > > > > > > > Your patches are almost there, they are pretty clean, the only issue I > > > > > think is this passing of AcpiAml by value, sometimes freeing buffer in > > > > > the process, sometimes not. > > > > Currently buffer is allocated by API and is always freed whenever > > > > it's passed to another API function. > > > > That's why it makes user not to care about memory mgmt. > > > > > > > > The only limitation of it is if you store AcpiAml return value into some > > > > variable you are responsible to use it only once for passing to another > > > > API > > > > function. Reusing this variable's value (pass it to API function second > > > > time) > > > > would cause cause use-after-free and freeing-freed bugs. > > > > Like this: > > > > AcpiAml table = acpi_definition_block("SSDT",...); > > > > AcpiAml scope = acpi_scope("PCI0"); > > > > aml_append(&table, scope); // <- here scope becomes invalid > > > > // a bug > > > > aml_append(&table, scope); // use-after-free + freeing-freed bugs > > > > > > > > There are several approaches to look for resolving above issues: > > > > 1. Adopt and use memory mgmt model used by GTK+ > > > >in nutshell: > > > > http://www.cs.hunter.cuny.edu/~sweiss/course_materials/csci493.70/lecture_notes/GTK_memory_mngmt.pdf > > > >In particular adopt behavior of GInitiallyUnowned usage model > > > > > > > >that will allow to keep convenient chained call style and if > > > > necessary > > > >reuse objects returned by API by explicitly referencing/dereferencing > > > >them if needed. > > > > > > Hmm, it's still easy to misuse. I think I prefer option 2 below. > > That's basically what we have/use in QOM with object_new(FOO) + > > object_unref() > > I have no idea why we invented our own Object infrastructure > > when we could just use GObject one from already used glib. > > > > > > > > > 2. It's possible to drop freeing inside API completely and > > > >record(store in list) every new object inside a table context. > > > >When table is constructed, list of created objects could be > > > >safely freed. > > > >With that it would be safe to reuse every AcpiAml object > > > >and avoid free-after-use issues with limitation that created > > > >AcpiAml objects shouldn't be used after table was closed. > > > >It should cover all practical use of API, i.e. no cross > > > >table AcpiAml objects. > > > > > > So each aml_alloc function gets pointer to this list, > > > and adds the new element there. > > > Eventually we do free_all to free all elements, > > > so there isn't even an aml_free to mis-use. > > I'm thinking a little bit different about implementation though. > > I still don't like the use of explicit alloc/free being called > > by API user since it doesn't allow chained API calls and > > I think it's unnecessary complication see below why. > > > > Here is what's true about current API and a I'd like to with it: > > > > 1. Every API call (except aml_append) makes aml_alloc(), it's just > > like a wrapper about object_new(FOO). (current + new impl.) > > > > 2 Every API call that takes AML type as input argument > > 2.1 consumes (frees) it (current impl.) > > (it's easy to fix use after free concern too, > >just pass AML by pointer and zero-out memory before it's freed > >and assert whenever one of input arguments is not correct, > >i.e. it was reused second time) > > There is no need for following steps after this one. > > 2.2 takes ownership of GInitiallyUnowned and adds it to its list > > of its children. > > 3. Free children when AML object is destroyed (i.e. ref count zero) > > That way when toplevel table object (definition block in 42/47) > > is added to ACPI blob we can unref it, which will cause > > its whole children tree freed, except for AML objects where > > API user explicitly took extra reference (i.e. wanted them > > to reuse in another table) > > > > I'd prefer: > > * 2.1 way to address your current concern of use-after-free > > as the most simplest one (no reuse is possible however) > > or > > * follow already used by QEMU QOM/GObject pattern of > >implicit alloc/free > > > > since they allow to construct AML in a more simple/manageable way i.e. > > > > aml_append(method, > > aml_store(aml_string("foo"), aml_local(0))) > > ); > > > > v.s. explicit headache of alloc/free, which doesn't fix > > use-after-free anyway and just adds more boiler
Re: [Qemu-devel] [PATCH v2 01/47] acpi: introduce AML composer aml_append()
On Wed, Jan 28, 2015 at 09:56:26AM +0200, Michael S. Tsirkin wrote: > > I've tried redo series with passing alloc list as first argument, > > looks ugly as hell > > I tried too. Not too bad at all. See below: I'm not so sure. Looking at the version below, I find the acpi_arg1(p) the most distracting. That API call creates the simplest object, so should be the simplest looking. Actually, you suggested that acpi_arg1(), a wrapper to make things look even simpler, wasn't necessary, acpi_arg(1) would be fine. I agree with that, but now we'd have acpi_arg(p, 1), which is really starting to clutter an AML composition built with many such calls. > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > index f66da5d..820504a 100644 > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -491,14 +491,14 @@ static void acpi_set_pci_info(void) > } > } > > -static void build_append_pcihp_notify_entry(AcpiAml *method, int slot) > +static void build_append_pcihp_notify_entry(AmlPool *p, AcpiAml *method, int > slot) > { > -AcpiAml if_ctx; > +AcpiAml *if_ctx; > int32_t devfn = PCI_DEVFN(slot, 0); > > -if_ctx = acpi_if(acpi_and(acpi_arg0(), acpi_int(0x1U << slot))); > -aml_append(&if_ctx, acpi_notify(acpi_name("S%.02X", devfn), > acpi_arg1())); > -aml_append(method, if_ctx); > +if_ctx = acpi_if(p, acpi_and(p, acpi_arg0(), acpi_int(p, 0x1U << slot))); ^ forgot your p here > +aml_append(p, if_ctx, acpi_notify(p, acpi_name(p, "S%.02X", devfn), > acpi_arg1(p))); > +aml_append(p, method, if_ctx); > } > > static void build_append_pci_bus_devices(AcpiAml *parent_scope, PCIBus *bus, > > What exactly is the problem? A tiny bit more verbose but the lifetime > of all objects is now explicit. It's probably just a personal preference thing. Igor and I prefer the sequence of AML composing calls to appear as simple as possible, i.e. develop the cleanest API as possible. To do this we need to find ways to hide the memory management, which comes at a cost of using a model that supports garbage collection, or adding a global variable to hide the pool. Your preference appears to be to keep memory management as simple and explicit as possible, at the expense of peppering each AML build function with a bunch of 'p's. I agree with Igor that we should get votes from the initial consumers of this API. Thanks, drew
Re: [Qemu-devel] [PATCH 00/13] convert AML API to QOM
On Wed, Jan 28, 2015 at 10:03:24AM +, Igor Mammedov wrote: > > Example demonstrates use of QOM object for building AML > objects tree and freeing it explicitly when requested. > > Top level ACPI tables blob object is only partially > switched to object model now but I hope it demostrates > the point of absracting code and hiding parts of code > that could be done without user intervention. > > Pathes 1/2/8 are just a convertion boiler plate which > will go away on respin. > > > Igor Mammedov (13): > convert to passing AcpiAml by pointers > make toplevel ACPI tables blob a pointer > qom: add support for weak referenced object: aka UnownedObject > acpi: make AcpiAml an OQM object > acpi: use TYPE_AML_OBJECT inside of AML API > acpi: use TYPE_AML_OBJECT for toplevel ACPI tables blob > acpi: make toplevel ACPI tables blob a dedicated object > i386: acpi: hack not yet converted tables calls to deal with > table_data being a pointer > acpi: add aml_blob() helper > i386: acpi: add DSDT table using AML API > acpi: acpi_add_table() to common cross target file > acpi: prepare for API internal collection of RSDT entries > i386: acpi: mark SSDT as RSDT entry so API would add entry to RSDT > automatically > > hw/acpi/acpi-build-utils.c | 537 > + > hw/i386/acpi-build.c | 388 +-- > include/hw/acpi/acpi-build-utils.h | 119 +--- > include/qom/object.h | 20 ++ > qom/object.c | 20 +- > 5 files changed, 620 insertions(+), 464 deletions(-) > > -- > 1.8.3.1 > Thanks for doing this. It's satisfied my curiosity as to how much boilerplate we'd need for the conversion. I looked mainly at the first half of this series, as it appears the second half is an add-on, not directly relevant to the explicit memory management vs. object model memory management discussion. The additional code, and need for QOM knowledge, appears pretty low. So, as this would allow developers focused on ACPI table construction to almost completely avoid doing any memory management themselves, then, FWIW, it looks like a good trade off to me. drew
Re: [Qemu-devel] the arm cache coherency cluster
On Fri, Mar 06, 2015 at 01:49:40PM -0500, Andrew Jones wrote: > In reply to this message I'll send two series' one for KVM and > one for QEMU. The two series' are their respective component > complements, and attempt to implement cache coherency for arm > guests using emulated devices, where the emulator (qemu) uses > cached memory for the device memory, but the guest uses > uncached - as device memory is generally used. Right now I've > just focused on VGA vram. > > This approach starts as the "add a new memslot flag" approach, > and then turns into the "make qemu do some cache maintenance" > approach with the final patch of each series (6/6). It stops > short of the "add syscalls..." approach. Below is a summary of > all the approaches discussed so far, to my knowledge. > > "MAIR manipulating" > Posted[1] by Ard. Works. No performance degradation. Potential > issues with device assignment and the guest getting confused. > > "add a new memslot flag" > This posting (not counting patches 6/6). Works. Huge performance > degradation. > > "make qemu do some cache maintenance" > This posting (patches 6/6). We can only do so much in qemu > without syscalls. This series does what it can. Almost works, > probably could work, after playing 'find the missing flush'. > This approach still requires the new memslot flag, as userspace > can't invalidate the cache, only clean, or clean+invalidate. > No noticeable performance degradation. > > "add syscalls to make qemu do all cache maintenance" > Variant 1: implement as kvm ioctls - to avoid trying to get >syscalls into the general kernel > Variant 2: add real syscalls, or maybe just ARM private SWIs >like __ARM_NR_cacheflush > This approach should work, and if we add an invalidate syscall, > then we shouldn't need any kvm changes at all, i.e. no need for > the memslot flag. I haven't experimented with this yet, but I'm > starting to like the idea of variant 2, with a private SWI, so > will try to pull something together soon for that. > > "describe the problematic memory as cached to the guest" > Not an ideal solution for virt. Could maybe be workable as a > quirk for a specific device though. > > re: $SUBJECT; Here 'cluster' is defined by the urban dictionary. > > [1] http://thread.gmane.org/gmane.comp.emulators.kvm.arm.devel/34/ I'm going to send another pair of series'. A "v2", as, IMO, the new approach supersedes the pair implemented here. After poking around in qemu, looking for the best places to do cache maintenance, I decided I didn't really like doing it there at all, and opted to try another approach, that I'd forgotten to mention in this mail. That approach is the "MADV_UNCACHED" type that Paolo suggested. This type of approach could also be described as "make userspace's memory access type match the expected access type of the guest", and Mario has suggested using a memory driver, which could have the same result. The series I'll send is inspired by both Paolo's and Mario's suggestions, but it uses a kvm memslot flag, rather than an madvise flag, and thus for the memory driver, it's just KVM. drew
[Qemu-devel] the arm cache coherency cluster "v2"
In reply to this message I'll send two series' one for KVM and one for QEMU. The two series' are their respective component complements, and attempt to implement cache coherency for arm guests using emulated devices, where the emulator (qemu) uses cached memory for the device memory, but the guest uses uncached - as device memory is generally used. Right now I've just focused on VGA vram. This approach is the "MADV_UNCACHED" type that Paolo suggested. This type of approach could also be described as "make userspace's memory access type match the expected access type of the guest", and Mario has suggested using a memory driver, which could have the same result. The coming series' is inspired by both Paolo's and Mario's suggestions, but it uses a kvm memslot flag, rather than an madvise flag, and thus for the memory driver, it's just KVM. See the thread https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html for some more background. Thanks in advance for comments. drew
[Qemu-devel] [RFC PATCH 0/3] KVM: Introduce KVM_MEM_UNCACHED
Introduce a new memory region flag, KVM_MEM_UNCACHED, which is needed by ARM. This flag informs KVM that the given memory region is typically mapped by the guest as uncached. KVM for ARM then maps that region as uncached for userspace as well, in order to keep coherency. Andrew Jones (3): KVM: promote KVM_MEMSLOT_INCOHERENT to uapi arm/arm64: KVM: decouple READONLY and UNCACHED arm/arm64: KVM: implement KVM_MEM_UNCACHED Documentation/virtual/kvm/api.txt | 16 --- arch/arm/include/asm/kvm_mmu.h| 9 arch/arm/include/uapi/asm/kvm.h | 2 + arch/arm/kvm/arm.c| 1 + arch/arm/kvm/mmu.c| 90 ++- arch/arm64/include/asm/kvm_mmu.h | 9 arch/arm64/include/uapi/asm/kvm.h | 2 + include/linux/kvm_host.h | 1 - include/uapi/linux/kvm.h | 2 + virt/kvm/kvm_main.c | 7 ++- 10 files changed, 121 insertions(+), 18 deletions(-) -- 1.8.3.1
[Qemu-devel] [RFC PATCH 3/3] arm/arm64: KVM: implement KVM_MEM_UNCACHED
When userspace tells us a memory region is uncached, then we need to pin all its pages and set them all to be uncached. Signed-off-by: Andrew Jones --- arch/arm/include/asm/kvm_mmu.h| 9 + arch/arm/include/uapi/asm/kvm.h | 1 + arch/arm/kvm/mmu.c| 71 +++ arch/arm64/include/asm/kvm_mmu.h | 9 + arch/arm64/include/uapi/asm/kvm.h | 1 + 5 files changed, 91 insertions(+) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 37ca2a4c6f094..6802f6adc12bf 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -265,6 +265,15 @@ static inline void __kvm_flush_dcache_pud(pud_t pud) { } +static inline void __set_page_uncached(pte_t *ptep) +{ + pte_t pte = *ptep; + + pte = clear_pte_bit(pte, L_PTE_MT_MASK); + pte = set_pte_bit(pte, L_PTE_MT_UNCACHED); + set_pte_ext(ptep, pte, 0); +} + #define kvm_virt_to_phys(x)virt_to_idmap((unsigned long)(x)) void kvm_set_way_flush(struct kvm_vcpu *vcpu); diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h index 9d6fc19acf8a2..cdd456f591882 100644 --- a/arch/arm/include/uapi/asm/kvm.h +++ b/arch/arm/include/uapi/asm/kvm.h @@ -109,6 +109,7 @@ struct kvm_sync_regs { }; struct kvm_arch_memory_slot { + struct page **pages; }; /* If you need to interpret the index values, here is the key: */ diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 59af5ad779eb6..d4e47572d3a5d 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -1697,6 +1697,54 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, kvm_mmu_wp_memory_region(kvm, mem->slot); } +static int set_page_uncached(pte_t *ptep, pgtable_t token, +unsigned long addr, void *data) +{ + __set_page_uncached(ptep); + kvm_flush_dcache_pte(*ptep); + return 0; +} + +static int vma_range_pin_and_set_uncached(struct vm_area_struct *vma, + hva_t start, long nr_pages, + struct page **pages) +{ + unsigned long size = nr_pages * PAGE_SIZE; + int ret; + + down_read(&vma->vm_mm->mmap_sem); + ret = get_user_pages(NULL, vma->vm_mm, start, nr_pages, +true, true, pages, NULL); + up_read(&vma->vm_mm->mmap_sem); + + if (ret < 0) + return ret; + + if (ret == nr_pages) { + ret = apply_to_page_range(vma->vm_mm, start, size, + set_page_uncached, NULL); + flush_tlb_kernel_range(start, start + size); + return ret; + } + + return -EFAULT; +} + +static void unpin_pages(struct kvm_memory_slot *memslot) +{ + int i; + + if (!memslot->arch.pages) + return; + + for (i = 0; i < memslot->npages; ++i) { + if (memslot->arch.pages[i]) + put_page(memslot->arch.pages[i]); + } + kfree(memslot->arch.pages); + memslot->arch.pages = NULL; +} + int kvm_arch_prepare_memory_region(struct kvm *kvm, struct kvm_memory_slot *memslot, struct kvm_userspace_memory_region *mem, @@ -1705,6 +1753,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, hva_t hva = mem->userspace_addr; hva_t reg_end = hva + mem->memory_size; bool writable = !(mem->flags & KVM_MEM_READONLY); + struct page **pages = memslot->arch.pages; int ret = 0; if (change != KVM_MR_CREATE && change != KVM_MR_MOVE && @@ -1768,6 +1817,26 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, writable); if (ret) break; + } else if ((change != KVM_MR_FLAGS_ONLY) + && (memslot->flags & KVM_MEM_UNCACHED)) { + + long nr_pages = (vm_end - vm_start)/PAGE_SIZE; + + if (!pages) { + pages = kzalloc(memslot->npages * + sizeof(struct page *), GFP_KERNEL); + if (!pages) + return -ENOMEM; + memslot->arch.pages = pages; + } + + ret = vma_range_pin_and_set_uncached(vma, vm_start, + nr_pages, pages); + if (ret) { + unpin_pages(memslot); + break; + } + pages += nr_pages; } hva = vm_end;
[Qemu-devel] [RFC PATCH 2/3] arm/arm64: KVM: decouple READONLY and UNCACHED
KVM_MEM_UNCACHED memory will no longer need caches to be flushed for memory as it's faulted in. Just use READONLY directly, in that case, now. Signed-off-by: Andrew Jones --- arch/arm/kvm/mmu.c | 19 +-- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 719931e83c468..59af5ad779eb6 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -1260,7 +1260,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (!hugetlb && !force_pte) hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa); - fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED; + /* +* Readonly memslots are not incoherent with the caches by definition, +* but in practice, they are used mostly to emulate ROMs or NOR flashes +* that the guest may consider devices and hence map as uncached. +* To prevent incoherency issues in these cases, force dcache flushes +* for all pages in the region as they're faulted in. +*/ + fault_ipa_uncached = (memslot->flags & KVM_MEM_READONLY) && + !(memslot->flags & KVM_MEM_UNCACHED); if (hugetlb) { pmd_t new_pmd = pfn_pmd(pfn, mem_type); @@ -1784,15 +1792,6 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free, int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned long npages) { - /* -* Readonly memslots are not incoherent with the caches by definition, -* but in practice, they are used mostly to emulate ROMs or NOR flashes -* that the guest may consider devices and hence map as uncached. -* To prevent incoherency issues in these cases, tag all readonly -* regions as incoherent. -*/ - if (slot->flags & KVM_MEM_READONLY) - slot->flags |= KVM_MEM_UNCACHED; return 0; } -- 1.8.3.1
[Qemu-devel] [RFC PATCH 1/4] kvm-all: put kvm_mem_flags to more work
Currently kvm_mem_flags just translates bools to bits, let's make it also determine the bools first. This avoids its parameter list growing each time we add a flag. Signed-off-by: Andrew Jones --- kvm-all.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 55025cc366992..a1bc05a8d5c5c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -295,10 +295,14 @@ err: * dirty pages logging control */ -static int kvm_mem_flags(KVMState *s, bool log_dirty, bool readonly) +static int kvm_mem_flags(MemoryRegion *mr) { +bool readonly = mr->readonly || memory_region_is_romd(mr); int flags = 0; -flags = log_dirty ? KVM_MEM_LOG_DIRTY_PAGES : 0; + +if (memory_region_is_logging(mr)) { +flags |= KVM_MEM_LOG_DIRTY_PAGES; +} if (readonly && kvm_readonly_mem_allowed) { flags |= KVM_MEM_READONLY; } @@ -313,7 +317,10 @@ static int kvm_slot_dirty_pages_log_change(KVMSlot *mem, bool log_dirty) old_flags = mem->flags; -flags = (mem->flags & ~mask) | kvm_mem_flags(s, log_dirty, false); +flags = mem->flags & ~mask; +if (log_dirty) { +flags |= KVM_MEM_LOG_DIRTY_PAGES; +} mem->flags = flags; /* If nothing changed effectively, no need to issue ioctl */ @@ -643,9 +650,7 @@ static void kvm_set_phys_mem(MemoryRegionSection *section, bool add) KVMSlot *mem, old; int err; MemoryRegion *mr = section->mr; -bool log_dirty = memory_region_is_logging(mr); bool writeable = !mr->readonly && !mr->rom_device; -bool readonly_flag = mr->readonly || memory_region_is_romd(mr); hwaddr start_addr = section->offset_within_address_space; ram_addr_t size = int128_get64(section->size); void *ram = NULL; @@ -689,7 +694,7 @@ static void kvm_set_phys_mem(MemoryRegionSection *section, bool add) (ram - start_addr == mem->ram - mem->start_addr)) { /* The new slot fits into the existing one and comes with * identical parameters - update flags and done. */ -kvm_slot_dirty_pages_log_change(mem, log_dirty); +kvm_slot_dirty_pages_log_change(mem, memory_region_is_logging(mr)); return; } @@ -722,7 +727,7 @@ static void kvm_set_phys_mem(MemoryRegionSection *section, bool add) mem->memory_size = old.memory_size; mem->start_addr = old.start_addr; mem->ram = old.ram; -mem->flags = kvm_mem_flags(s, log_dirty, readonly_flag); +mem->flags = kvm_mem_flags(mr); err = kvm_set_user_memory_region(s, mem); if (err) { @@ -743,7 +748,7 @@ static void kvm_set_phys_mem(MemoryRegionSection *section, bool add) mem->memory_size = start_addr - old.start_addr; mem->start_addr = old.start_addr; mem->ram = old.ram; -mem->flags = kvm_mem_flags(s, log_dirty, readonly_flag); +mem->flags = kvm_mem_flags(mr); err = kvm_set_user_memory_region(s, mem); if (err) { @@ -767,7 +772,7 @@ static void kvm_set_phys_mem(MemoryRegionSection *section, bool add) size_delta = mem->start_addr - old.start_addr; mem->memory_size = old.memory_size - size_delta; mem->ram = old.ram + size_delta; -mem->flags = kvm_mem_flags(s, log_dirty, readonly_flag); +mem->flags = kvm_mem_flags(mr); err = kvm_set_user_memory_region(s, mem); if (err) { @@ -789,7 +794,7 @@ static void kvm_set_phys_mem(MemoryRegionSection *section, bool add) mem->memory_size = size; mem->start_addr = start_addr; mem->ram = ram; -mem->flags = kvm_mem_flags(s, log_dirty, readonly_flag); +mem->flags = kvm_mem_flags(mr); err = kvm_set_user_memory_region(s, mem); if (err) { -- 1.8.3.1
[Qemu-devel] [RFC PATCH 1/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
Also rename to KVM_MEM_UNCACHED. Signed-off-by: Andrew Jones --- Documentation/virtual/kvm/api.txt | 16 ++-- arch/arm/include/uapi/asm/kvm.h | 1 + arch/arm/kvm/arm.c| 1 + arch/arm/kvm/mmu.c| 4 ++-- arch/arm64/include/uapi/asm/kvm.h | 1 + include/linux/kvm_host.h | 1 - include/uapi/linux/kvm.h | 2 ++ virt/kvm/kvm_main.c | 7 ++- 8 files changed, 23 insertions(+), 10 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 0007fef4ed814..a5a51403a7937 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -902,6 +902,7 @@ struct kvm_userspace_memory_region { /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_UNCACHED (1UL << 2) This ioctl allows the user to create or modify a guest physical memory slot. When changing an existing slot, it may be moved in the guest @@ -917,12 +918,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +The flags field supports three flags: KVM_MEM_LOG_DIRTY_PAGES, +KVM_MEM_READONLY, and KVM_MEM_UNCACHED. The first can be set to instruct +KVM to keep track of writes to memory within the slot. See KVM_GET_DIRTY_LOG +ioctl to know how to use it. The second can be set, if KVM_CAP_READONLY_MEM +capability allows it, to make a new slot read-only. In this case, writes to +this memory will be posted to userspace as KVM_EXIT_MMIO exits. The third can +be set, if the KVM_CAP_UNCACHED_MEM capability allows it. This remaps the +memory as uncached, i.e. userspace will always directly read/write RAM for +this memory region. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h index 09ee408c1a676..9d6fc19acf8a2 100644 --- a/arch/arm/include/uapi/asm/kvm.h +++ b/arch/arm/include/uapi/asm/kvm.h @@ -26,6 +26,7 @@ #define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_IRQ_LINE #define __KVM_HAVE_READONLY_MEM +#define __KVM_HAVE_UNCACHED_MEM #define KVM_REG_SIZE(id) \ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index c9e6ef1f7403a..8d4c08f238cf0 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -180,6 +180,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ARM_PSCI: case KVM_CAP_ARM_PSCI_0_2: case KVM_CAP_READONLY_MEM: + case KVM_CAP_UNCACHED_MEM: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index bcc1b3ad2adce..719931e83c468 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -1260,7 +1260,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (!hugetlb && !force_pte) hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa); - fault_ipa_uncached = memslot->flags & KVM_MEMSLOT_INCOHERENT; + fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED; if (hugetlb) { pmd_t new_pmd = pfn_pmd(pfn, mem_type); @@ -1792,7 +1792,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, * regions as incoherent. */ if (slot->flags & KVM_MEM_READONLY) - slot->flags |= KVM_MEMSLOT_INCOHERENT; + slot->flags |= KVM_MEM_UNCACHED; return 0; } diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 8e38878c87c61..5553d112e405b 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -38,6 +38,7 @@ #define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_IRQ_LINE #define __KVM_HAVE_READONLY_MEM +#define __KVM_HAVE_UNCACHED_MEM #define KVM_REG_SIZE(id) \ (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT)) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3b934cc94cc83..9dfb519c51e5b 100644 --- a/include/linux/kvm_h
[Qemu-devel] [RFC PATCH 3/4] memory: add uncached flag
Add an 'uncached' flag, which will result in the KVM_MEM_UNCACHED flag getting set on KVM's corresponding memory slot. Signed-off-by: Andrew Jones --- include/exec/memory.h | 25 + kvm-all.c | 9 + memory.c | 15 +++ 3 files changed, 49 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 06ffa1d185b93..2a0d016f5fe6d 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -158,6 +158,7 @@ struct MemoryRegion { bool rom_device; bool warning_printed; /* For reservations */ bool flush_coalesced_mmio; +bool uncached; MemoryRegion *alias; hwaddr alias_offset; int32_t priority; @@ -778,6 +779,30 @@ void memory_region_set_flush_coalesced(MemoryRegion *mr); void memory_region_clear_flush_coalesced(MemoryRegion *mr); /** + * memory_region_set_uncached: Flag this memory region (for e.g. kvm) as + * typically being mapped uncached. + * + * @mr: the memory region to be updated. + */ +void memory_region_set_uncached(MemoryRegion *mr); + +/** + * memory_region_clear_uncached: Remove the 'uncached' flag. + * + * @mr: the memory region to be updated. + */ +void memory_region_clear_uncached(MemoryRegion *mr); + +/** + * memory_region_may_map_uncached: Return the 'uncached' flag, which + * indicates the region is typically + * mapped uncached. + * + * @mr: the memory region to check. + */ +bool memory_region_may_map_uncached(MemoryRegion *mr); + +/** * memory_region_add_eventfd: Request an eventfd to be triggered when a word *is written to a location. * diff --git a/kvm-all.c b/kvm-all.c index a1bc05a8d5c5c..4c826d0eab37b 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -126,6 +126,7 @@ bool kvm_gsi_routing_allowed; bool kvm_gsi_direct_mapping; bool kvm_allowed; bool kvm_readonly_mem_allowed; +bool kvm_uncached_mem_allowed; bool kvm_vm_attributes_allowed; static const KVMCapabilityInfo kvm_required_capabilites[] = { @@ -306,6 +307,9 @@ static int kvm_mem_flags(MemoryRegion *mr) if (readonly && kvm_readonly_mem_allowed) { flags |= KVM_MEM_READONLY; } +if (memory_region_may_map_uncached(mr) && kvm_uncached_mem_allowed) { +flags |= KVM_MEM_UNCACHED; +} return flags; } @@ -1595,6 +1599,11 @@ static int kvm_init(MachineState *ms) (kvm_check_extension(s, KVM_CAP_READONLY_MEM) > 0); #endif +#ifdef KVM_CAP_UNCACHED_MEM +kvm_uncached_mem_allowed = +(kvm_check_extension(s, KVM_CAP_UNCACHED_MEM) > 0); +#endif + kvm_eventfds_allowed = (kvm_check_extension(s, KVM_CAP_IOEVENTFD) > 0); diff --git a/memory.c b/memory.c index ee3f2a8a954ef..495cec1b6650c 100644 --- a/memory.c +++ b/memory.c @@ -1549,6 +1549,21 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr) } } +void memory_region_set_uncached(MemoryRegion *mr) +{ +mr->uncached = true; +} + +void memory_region_clear_uncached(MemoryRegion *mr) +{ +mr->uncached = false; +} + +bool memory_region_may_map_uncached(MemoryRegion *mr) +{ +return mr->uncached; +} + void memory_region_add_eventfd(MemoryRegion *mr, hwaddr addr, unsigned size, -- 1.8.3.1
[Qemu-devel] [RFC PATCH 0/4] support KVM_MEM_UNCACHED
Add support for the new KVM_MEM_UNCACHED flag, and flag appropriate memory. (Only flags vga/vram for now.) Andrew Jones (4): kvm-all: put kvm_mem_flags to more work HACK: linux header update memory: add uncached flag vga: flag vram as uncached hw/display/vga.c | 1 + include/exec/memory.h | 25 + kvm-all.c | 34 -- linux-headers/linux/kvm.h | 2 ++ memory.c | 15 +++ 5 files changed, 67 insertions(+), 10 deletions(-) -- 1.8.3.1
[Qemu-devel] [RFC PATCH 2/4] HACK: linux header update
Should do a proper update-linux-headers.sh update. Signed-off-by: Andrew Jones --- linux-headers/linux/kvm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index 60a54c82a3b76..34c03cd80aa69 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -108,6 +108,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_UNCACHED (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { @@ -760,6 +761,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_ENABLE_HCALL 104 #define KVM_CAP_CHECK_EXTENSION_VM 105 #define KVM_CAP_S390_USER_SIGP 106 +#define KVM_CAP_UNCACHED_MEM 107 #ifdef KVM_CAP_IRQ_ROUTING -- 1.8.3.1
[Qemu-devel] [RFC PATCH 4/4] vga: flag vram as uncached
Video RAM is typically mapped as uncached by guests. Flag it as such. Signed-off-by: Andrew Jones --- hw/display/vga.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/display/vga.c b/hw/display/vga.c index c0f7b343bbab5..5c8c249a8d780 100644 --- a/hw/display/vga.c +++ b/hw/display/vga.c @@ -2143,6 +2143,7 @@ void vga_common_init(VGACommonState *s, Object *obj, bool global_vmstate) s->is_vbe_vmstate = 1; memory_region_init_ram(&s->vram, obj, "vga.vram", s->vram_size, &error_abort); +memory_region_set_uncached(&s->vram); vmstate_register_ram(&s->vram, global_vmstate ? NULL : DEVICE(obj)); xen_register_framebuffer(&s->vram); s->vram_ptr = memory_region_get_ram_ptr(&s->vram); -- 1.8.3.1
Re: [Qemu-devel] the arm cache coherency cluster "v2"
On Wed, Mar 18, 2015 at 03:08:20PM -0400, Andrew Jones wrote: > In reply to this message I'll send two series' one for KVM and > one for QEMU. The two series' are their respective component > complements, and attempt to implement cache coherency for arm > guests using emulated devices, where the emulator (qemu) uses > cached memory for the device memory, but the guest uses > uncached - as device memory is generally used. Right now I've > just focused on VGA vram. > > This approach is the "MADV_UNCACHED" type that Paolo suggested. > This type of approach could also be described as "make userspace's > memory access type match the expected access type of the guest", > and Mario has suggested using a memory driver, which could have > the same result. > > The coming series' is inspired by both Paolo's and Mario's > suggestions, but it uses a kvm memslot flag, rather than an > madvise flag, and thus for the memory driver, it's just KVM. > > See the thread > > https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html > > for some more background. > > Thanks in advance for comments. > > drew I forgot to mention that I've done some light testing with this. It seems to work, and without (to eye) noticeable performance degradation. Thanks, drew
Re: [Qemu-devel] [RFC PATCH 0/3] KVM: Introduce KVM_MEM_UNCACHED
On Thu, Mar 19, 2015 at 05:56:20PM +0100, Paolo Bonzini wrote: > > > On 18/03/2015 20:10, Andrew Jones wrote: > > Introduce a new memory region flag, KVM_MEM_UNCACHED, which > > is needed by ARM. This flag informs KVM that the given memory > > region is typically mapped by the guest as uncached. KVM for > > ARM then maps that region as uncached for userspace as well, > > in order to keep coherency. > > > > Andrew Jones (3): > > KVM: promote KVM_MEMSLOT_INCOHERENT to uapi > > arm/arm64: KVM: decouple READONLY and UNCACHED > > arm/arm64: KVM: implement KVM_MEM_UNCACHED > > > > Documentation/virtual/kvm/api.txt | 16 --- > > arch/arm/include/asm/kvm_mmu.h| 9 > > arch/arm/include/uapi/asm/kvm.h | 2 + > > arch/arm/kvm/arm.c| 1 + > > arch/arm/kvm/mmu.c| 90 > > ++- > > arch/arm64/include/asm/kvm_mmu.h | 9 > > arch/arm64/include/uapi/asm/kvm.h | 2 + > > include/linux/kvm_host.h | 1 - > > include/uapi/linux/kvm.h | 2 + > > virt/kvm/kvm_main.c | 7 ++- > > 10 files changed, 121 insertions(+), 18 deletions(-) > > Hi Paolo, Thanks for the comments! > > I think the pinning breaks running KVM as non-root (the default ulimit > -l is just 64 KiB). Can you do something about it with the MMU > notifiers instead? I'll look into this as soon as possible. I'm headed out 25 minutes ago for some vacation time though, so it'll be week or so before I can. > > It looks very clean. > > The fly in the ointment is that some kind of quirk is probably needed in > the future for ivshmem---it's definitely not acceptable performance-wise > to mark it uncached, neither in the guests not in the host. On the > other hand that quirk would be necessary only in the guest, not in the > firmware, so we can live with a mixture of the two approaches. Yes, quirking ivshmem makes sense. ivshmem users (guest drivers) should ensure they map the memory as cached, as they know it's normal memory on the host. drew
[Qemu-devel] [PATCH] tcg-aarch64: handle additional PXN case
D4.5.1 "Memory access control:Access permissions for instruction execution" states "... In addition: * For the EL1&0 translation regime, if the value of the AP[2:1] bits is 0b01, permitting write access from EL0, then the PXN bit is treated as if it has the value 1, regardless of its actual value. ..." Signed-off-by: Andrew Jones --- target-arm/helper.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/target-arm/helper.c b/target-arm/helper.c index 3ef0f1f38eda5..96275194a 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4960,6 +4960,8 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; if ((arm_feature(env, ARM_FEATURE_V8) && is_user && (attrs & (1 << 12))) || (!arm_feature(env, ARM_FEATURE_V8) && (attrs & (1 << 12))) || +(arm_feature(env, ARM_FEATURE_V8) && !is_user && + ((attrs & (3 << 4)) == (1 << 4) /* AP[2:1] == 0b01 */)) || (!is_user && (attrs & (1 << 11 { /* XN/UXN or PXN. Since we only implement EL0/EL1 we unconditionally * treat XN/UXN as UXN for v8. -- 1.9.3
Re: [Qemu-devel] [PATCH] tcg-aarch64: handle additional PXN case
On Mon, Jan 05, 2015 at 11:54:17AM +, Peter Maydell wrote: > On 2 January 2015 at 17:33, Andrew Jones wrote: > > D4.5.1 "Memory access control:Access permissions for instruction > > execution" states > > "... > > In addition: > > * For the EL1&0 translation regime, if the value of the AP[2:1] bits > > is 0b01, permitting write access from EL0, then the PXN bit is > > treated as if it has the value 1, regardless of its actual value. > > ..." > > As far as I can see this only applies to 64-bit translations > (there is no equivalent wording in the 32-bit VMSA section of > the ARM ARM), so I think the condition should be on va_size == 64, > not on ARM_FEATURE_V8. Ah yes, using ARM_FEATURE_V8 is a mistake. I don't see anything like this in the AArch32 section either (just looked now). > > > @@ -4960,6 +4960,8 @@ static int get_phys_addr_lpae(CPUARMState *env, > > target_ulong address, > > *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; > > if ((arm_feature(env, ARM_FEATURE_V8) && is_user && (attrs & (1 << > > 12))) || > > (!arm_feature(env, ARM_FEATURE_V8) && (attrs & (1 << 12))) || > > +(arm_feature(env, ARM_FEATURE_V8) && !is_user && > > + ((attrs & (3 << 4)) == (1 << 4) /* AP[2:1] == 0b01 */)) || > > (!is_user && (attrs & (1 << 11 { > > /* XN/UXN or PXN. Since we only implement EL0/EL1 we > > unconditionally > > * treat XN/UXN as UXN for v8. > > This condition is becoming pretty badly overweight. I think that > rather than just add another clause to it (especially one which > needs an embedded /* comment */ !) we should split it up somehow. > (Consider also that as per the comment we're going to need to > distinguish UXN from XN shortly for EL2/EL3.) I can take a stab at cleaning this up. The thought had crossed my mind as well. > > We don't implement the SCTLR.UWXN/WXN bits either -- don't know > if you care about those. I care in the sense that I'd like tcg-aarch64 to be as accurate as possible, but I haven't bumped into a need for WXN support yet, as I have with this PXN condition. I can throw support into the new 'prot = check_xn(...)' function that we'll create for the cleanup. Thanks, drew
[Qemu-devel] [PATCH] tcg-arm: more instruction execution control
Cleanup XN/PXN handling in get_phys_addr_lpae, and implement all but EL2 support of the following ARMv8 sections D4.5.1 Memory access control: Access permissions for instruction execution G4.7.2 Execute-never restrictions on instruction fetching G4.7.2 matches the ARMv7 section B3.7.2 when long-descriptors are used. Signed-off-by: Andrew Jones --- I didn't test this with secure mode (I didn't even check if that's currently possible), but I did test all other EL1&0 XN controls with both tcg-arm (cortex-a15) and tcg-aarch64 (cortex-a57) by hacking up some tests with kvm-unit-tests/arm[64]. I also booted Linux (just up to looking for a rootfs) with both. --- target-arm/helper.c | 99 ++--- 1 file changed, 79 insertions(+), 20 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index 3ef0f1f38eda5..48a83cff957fb 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4516,6 +4516,69 @@ static inline int check_ap(CPUARMState *env, int ap, int domain_prot, } } +/* Check section/page execution permission. + * + * Returns PAGE_EXEC if execution is permitted, otherwise zero. + * + * Supports AArch64, AArch32, and v7-LPAE. Doesn't yet support + * EL2. Actually, for v7-LPAE, we'd also need to emulate Virt + * Extensions to truly have WXN/UWXN. We don't support checking + * for that yet, so we just assume we have them. + * + * @env : CPUARMState + * @is_user : 0 for privileged access, 1 for user + * @ap : Access permissions (AP[2:1]) from descriptor + * @ns : (NSTable || NS) from descriptors + * @xn : ([U]XNTable || [U]XN) descriptors + * @pxn : (PXNTable || PXN) descriptors + */ +static int check_xn_lpae(CPUARMState *env, int is_user, int ap, + int ns, int xn, int pxn) +{ +int wxn; + +switch (arm_current_el(env)) { +case 0: +case 1: +if (is_user && !(ap & 1)) { +return 0; +} +wxn = env->cp15.sctlr_el[1] & SCTLR_WXN; +if (arm_el_is_aa64(env, 1)) { +pxn = pxn || ((ap & 1) && !(ap & 2)); +xn = is_user ? xn : pxn; +} else { +int uwxn = env->cp15.sctlr_el[1] & SCTLR_UWXN; +pxn = pxn || (uwxn && (ap & 1) && !(ap & 2)); +xn = xn || (!is_user && pxn); +} +if (arm_is_secure(env)) { +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); +} +break; +case 2: +/* TODO actually support EL2 */ +assert(true); + +if (arm_el_is_aa64(env, 2)) { +wxn = env->cp15.sctlr_el[2] & SCTLR_WXN; +} else { +/* wxn = HSCTLR.WXN */ +} +break; +case 3: +if (arm_el_is_aa64(env, 3)) { +wxn = env->cp15.sctlr_el[3] & SCTLR_WXN; +} else { +wxn = 0; +} +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); +break; +} + +return xn || (!(ap & 2) && wxn) ? 0 : PAGE_EXEC; +} + static bool get_level1_table_address(CPUARMState *env, uint32_t *table, uint32_t address) { @@ -4787,7 +4850,7 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, hwaddr descaddr, descmask; uint32_t tableattrs; target_ulong page_size; -uint32_t attrs; +uint32_t attrs, ap; int32_t granule_sz = 9; int32_t va_size = 32; int32_t tbi = 0; @@ -4941,7 +5004,7 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, if (extract32(tableattrs, 2, 1)) { attrs &= ~(1 << 4); } -/* Since we're always in the Non-secure state, NSTable is ignored. */ +attrs |= extract32(tableattrs, 4, 1) << 3; /* NS */ break; } /* Here descaddr is the final physical address, and attributes @@ -4952,29 +5015,25 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, /* Access flag */ goto do_fault; } + fault_type = permission_fault; -if (is_user && !(attrs & (1 << 4))) { +ap = extract32(attrs, 4, 2); /* AP[2:1] */ + +if (is_user && !(ap & 1)) { /* Unprivileged access not enabled */ goto do_fault; } -*prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; -if ((arm_feature(env, ARM_FEATURE_V8) && is_user && (attrs & (1 << 12))) || -(!arm_feature(env, ARM_FEATURE_V8) && (attrs & (1 << 12))) || -(!is_user && (attrs & (1 << 11 { -/* XN/UXN or PXN. Since we only implement EL0/EL1 we unconditionally - * treat XN/UXN as UXN for v8. - */ -if (access_type ==
Re: [Qemu-devel] [PATCH] tcg-arm: more instruction execution control
On Fri, Jan 09, 2015 at 06:06:33PM +0100, Andrew Jones wrote: > Cleanup XN/PXN handling in get_phys_addr_lpae, and implement all but > EL2 support of the following ARMv8 sections > > D4.5.1 Memory access control: Access permissions for instruction > execution > G4.7.2 Execute-never restrictions on instruction fetching > > G4.7.2 matches the ARMv7 section B3.7.2 when long-descriptors are used. > > Signed-off-by: Andrew Jones > > --- > I didn't test this with secure mode (I didn't even check if that's > currently possible), but I did test all other EL1&0 XN controls with > both tcg-arm (cortex-a15) and tcg-aarch64 (cortex-a57) by hacking up > some tests with kvm-unit-tests/arm[64]. I also booted Linux (just > up to looking for a rootfs) with both. > --- > target-arm/helper.c | 99 > ++--- > 1 file changed, 79 insertions(+), 20 deletions(-) > > diff --git a/target-arm/helper.c b/target-arm/helper.c > index 3ef0f1f38eda5..48a83cff957fb 100644 > --- a/target-arm/helper.c > +++ b/target-arm/helper.c > @@ -4516,6 +4516,69 @@ static inline int check_ap(CPUARMState *env, int ap, > int domain_prot, >} > } > > +/* Check section/page execution permission. > + * > + * Returns PAGE_EXEC if execution is permitted, otherwise zero. > + * > + * Supports AArch64, AArch32, and v7-LPAE. Doesn't yet support > + * EL2. Actually, for v7-LPAE, we'd also need to emulate Virt > + * Extensions to truly have WXN/UWXN. We don't support checking > + * for that yet, so we just assume we have them. > + * > + * @env : CPUARMState > + * @is_user : 0 for privileged access, 1 for user > + * @ap : Access permissions (AP[2:1]) from descriptor > + * @ns : (NSTable || NS) from descriptors > + * @xn : ([U]XNTable || [U]XN) descriptors > + * @pxn : (PXNTable || PXN) descriptors Just realized I'm missing 'from' in the last two lines above. > + */ > +static int check_xn_lpae(CPUARMState *env, int is_user, int ap, > + int ns, int xn, int pxn) > +{ > +int wxn; > + > +switch (arm_current_el(env)) { > +case 0: > +case 1: > +if (is_user && !(ap & 1)) { > +return 0; > +} > +wxn = env->cp15.sctlr_el[1] & SCTLR_WXN; > +if (arm_el_is_aa64(env, 1)) { > +pxn = pxn || ((ap & 1) && !(ap & 2)); > +xn = is_user ? xn : pxn; > +} else { > +int uwxn = env->cp15.sctlr_el[1] & SCTLR_UWXN; > +pxn = pxn || (uwxn && (ap & 1) && !(ap & 2)); > +xn = xn || (!is_user && pxn); > +} > +if (arm_is_secure(env)) { > +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); > +} > +break; > +case 2: > +/* TODO actually support EL2 */ > +assert(true); And, argh... brain wires got crossed here. Of course this should be 'assert(false)'. I'll send a v2. > + > +if (arm_el_is_aa64(env, 2)) { > +wxn = env->cp15.sctlr_el[2] & SCTLR_WXN; > +} else { > +/* wxn = HSCTLR.WXN */ > +} > +break; > +case 3: > +if (arm_el_is_aa64(env, 3)) { > +wxn = env->cp15.sctlr_el[3] & SCTLR_WXN; > +} else { > +wxn = 0; > +} > +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); > +break; > +} > + > +return xn || (!(ap & 2) && wxn) ? 0 : PAGE_EXEC; > +} > + > static bool get_level1_table_address(CPUARMState *env, uint32_t *table, > uint32_t address) > { > @@ -4787,7 +4850,7 @@ static int get_phys_addr_lpae(CPUARMState *env, > target_ulong address, > hwaddr descaddr, descmask; > uint32_t tableattrs; > target_ulong page_size; > -uint32_t attrs; > +uint32_t attrs, ap; > int32_t granule_sz = 9; > int32_t va_size = 32; > int32_t tbi = 0; > @@ -4941,7 +5004,7 @@ static int get_phys_addr_lpae(CPUARMState *env, > target_ulong address, > if (extract32(tableattrs, 2, 1)) { > attrs &= ~(1 << 4); > } > -/* Since we're always in the Non-secure state, NSTable is ignored. */ > +attrs |= extract32(tableattrs, 4, 1) << 3; /* NS */ > break; > } > /* Here descaddr is the final physical address, and attributes > @@ -4952,29 +5015,25 @@ static int get_phys_addr_lpae(C
[Qemu-devel] [PATCH v2] tcg-arm: more instruction execution control
Cleanup XN/PXN handling in get_phys_addr_lpae, and implement all but EL2 support of the following ARMv8 sections D4.5.1 Memory access control: Access permissions for instruction execution G4.7.2 Execute-never restrictions on instruction fetching G4.7.2 matches the ARMv7 section B3.7.2 when long-descriptors are used. Signed-off-by: Andrew Jones --- I didn't test this with secure mode (I didn't even check if that's currently possible), but I did test all other EL1&0 XN controls with both tcg-arm (cortex-a15) and tcg-aarch64 (cortex-a57) by hacking up some tests with kvm-unit-tests/arm[64]. I also booted Linux (just up to looking for a rootfs) with both. --- target-arm/helper.c | 99 ++--- 1 file changed, 79 insertions(+), 20 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index 3ef0f1f38eda5..94f7631f0a125 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4516,6 +4516,69 @@ static inline int check_ap(CPUARMState *env, int ap, int domain_prot, } } +/* Check section/page execution permission. + * + * Returns PAGE_EXEC if execution is permitted, otherwise zero. + * + * Supports AArch64, AArch32, and v7-LPAE. Doesn't yet support + * EL2. Actually, for v7-LPAE, we'd also need to emulate Virt + * Extensions to truly have WXN/UWXN. We don't support checking + * for that yet, so we just assume we have them. + * + * @env : CPUARMState + * @is_user : 0 for privileged access, 1 for user + * @ap : Access permissions (AP[2:1]) from descriptor + * @ns : (NSTable || NS) from descriptors + * @xn : ([U]XNTable || [U]XN) from descriptors + * @pxn : (PXNTable || PXN) from descriptors + */ +static int check_xn_lpae(CPUARMState *env, int is_user, int ap, + int ns, int xn, int pxn) +{ +int wxn; + +switch (arm_current_el(env)) { +case 0: +case 1: +if (is_user && !(ap & 1)) { +return 0; +} +wxn = env->cp15.sctlr_el[1] & SCTLR_WXN; +if (arm_el_is_aa64(env, 1)) { +pxn = pxn || ((ap & 1) && !(ap & 2)); +xn = is_user ? xn : pxn; +} else { +int uwxn = env->cp15.sctlr_el[1] & SCTLR_UWXN; +pxn = pxn || (uwxn && (ap & 1) && !(ap & 2)); +xn = xn || (!is_user && pxn); +} +if (arm_is_secure(env)) { +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); +} +break; +case 2: +/* TODO actually support EL2 */ +assert(false); + +if (arm_el_is_aa64(env, 2)) { +wxn = env->cp15.sctlr_el[2] & SCTLR_WXN; +} else { +/* wxn = HSCTLR.WXN */ +} +break; +case 3: +if (arm_el_is_aa64(env, 3)) { +wxn = env->cp15.sctlr_el[3] & SCTLR_WXN; +} else { +wxn = 0; +} +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); +break; +} + +return xn || (!(ap & 2) && wxn) ? 0 : PAGE_EXEC; +} + static bool get_level1_table_address(CPUARMState *env, uint32_t *table, uint32_t address) { @@ -4787,7 +4850,7 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, hwaddr descaddr, descmask; uint32_t tableattrs; target_ulong page_size; -uint32_t attrs; +uint32_t attrs, ap; int32_t granule_sz = 9; int32_t va_size = 32; int32_t tbi = 0; @@ -4941,7 +5004,7 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, if (extract32(tableattrs, 2, 1)) { attrs &= ~(1 << 4); } -/* Since we're always in the Non-secure state, NSTable is ignored. */ +attrs |= extract32(tableattrs, 4, 1) << 3; /* NS */ break; } /* Here descaddr is the final physical address, and attributes @@ -4952,29 +5015,25 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, /* Access flag */ goto do_fault; } + fault_type = permission_fault; -if (is_user && !(attrs & (1 << 4))) { +ap = extract32(attrs, 4, 2); /* AP[2:1] */ + +if (is_user && !(ap & 1)) { /* Unprivileged access not enabled */ goto do_fault; } -*prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; -if ((arm_feature(env, ARM_FEATURE_V8) && is_user && (attrs & (1 << 12))) || -(!arm_feature(env, ARM_FEATURE_V8) && (attrs & (1 << 12))) || -(!is_user && (attrs & (1 << 11 { -/* XN/UXN or PXN. Since we only implement EL0/EL1 we unconditionally - * treat XN/UXN as UXN for v8. - */ -if (access_type ==
[Qemu-devel] [PATCH 0/2] tcg-arm: fix and extend instruction execution control
We're currently assuming EL1 can execute code it shouldn't, and that EL0 shouldn't execute code it can. Fix those cases, and also extend instruction execution control to handle WXN and more. The first patch addresses EL0 faulting when it should be allowed to execute. The second patch addresses EL1 not faulting when it should, as well as adds in additional execution control. Andrew Jones (2): tcg-aarch64: user doesn't need R/W access to exec tcg-arm: more instruction execution control target-arm/helper.c | 103 1 file changed, 80 insertions(+), 23 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH 1/2] tcg-aarch64: user doesn't need R/W access to exec
Table D4-32 shows that execute access from EL0 doesn't depend on AP[1]. Signed-off-by: Andrew Jones --- target-arm/helper.c | 27 --- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index 3ef0f1f38eda5..7c30a2669a0f2 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4787,7 +4787,7 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, hwaddr descaddr, descmask; uint32_t tableattrs; target_ulong page_size; -uint32_t attrs; +uint32_t attrs, ap; int32_t granule_sz = 9; int32_t va_size = 32; int32_t tbi = 0; @@ -4952,14 +4952,20 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, /* Access flag */ goto do_fault; } + fault_type = permission_fault; -if (is_user && !(attrs & (1 << 4))) { -/* Unprivileged access not enabled */ -goto do_fault; +ap = extract32(attrs, 4, 2); /* AP[2:1] */ + +*prot = 0; +if (!is_user || (ap & 1)) { +*prot |= PAGE_READ; +*prot |= !(ap & 2) ? PAGE_WRITE : 0; } -*prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; + +*prot |= PAGE_EXEC; if ((arm_feature(env, ARM_FEATURE_V8) && is_user && (attrs & (1 << 12))) || (!arm_feature(env, ARM_FEATURE_V8) && (attrs & (1 << 12))) || +(!arm_el_is_aa64(env, 1) && is_user && !(ap & 1)) || (!is_user && (attrs & (1 << 11 { /* XN/UXN or PXN. Since we only implement EL0/EL1 we unconditionally * treat XN/UXN as UXN for v8. @@ -4969,12 +4975,11 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, } *prot &= ~PAGE_EXEC; } -if (attrs & (1 << 5)) { -/* Write access forbidden */ -if (access_type == 1) { -goto do_fault; -} -*prot &= ~PAGE_WRITE; + +if ((*prot == 0) +|| (!(*prot & PAGE_WRITE) && access_type == 1) +|| (!(*prot & PAGE_EXEC) && access_type == 2)) { +goto do_fault; } *phys_ptr = descaddr; -- 1.9.3
[Qemu-devel] [PATCH 2/2] tcg-arm: more instruction execution control
Cleanup XN/PXN handling in get_phys_addr_lpae, and implement all but EL2 support of the following ARMv8 sections D4.5.1 Memory access control: Access permissions for instruction execution G4.7.2 Execute-never restrictions on instruction fetching G4.7.2 matches the ARMv7 section B3.7.2 when long-descriptors are used. Signed-off-by: Andrew Jones --- v3: - correct logic for (is_user && !(ap & 1)) case - base on "user doesn't need R/W access to exec" patch v2: correct assert in el2 case I didn't test this with secure mode (I didn't even check if that's currently possible), but I did test all other EL1&0 XN controls with both tcg-arm (cortex-a15) and tcg-aarch64 (cortex-a57) by hacking up some tests with kvm-unit-tests/arm[64]. I also booted Linux (just up to looking for a rootfs) with both. target-arm/helper.c | 80 +++-- 1 file changed, 66 insertions(+), 14 deletions(-) diff --git a/target-arm/helper.c b/target-arm/helper.c index 7c30a2669a0f2..715e0f47bec7d 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -4516,6 +4516,69 @@ static inline int check_ap(CPUARMState *env, int ap, int domain_prot, } } +/* Check section/page execution permission. + * + * Returns PAGE_EXEC if execution is permitted, otherwise zero. + * + * Supports AArch64, AArch32, and v7-LPAE. Doesn't yet support + * EL2. Actually, for v7-LPAE, we'd also need to emulate Virt + * Extensions to truly have WXN/UWXN. We don't support checking + * for that yet, so we just assume we have them. + * + * @env : CPUARMState + * @is_user : 0 for privileged access, 1 for user + * @ap : Access permissions (AP[2:1]) from descriptor + * @ns : (NSTable || NS) from descriptors + * @xn : ([U]XNTable || [U]XN) from descriptors + * @pxn : (PXNTable || PXN) from descriptors + */ +static int check_xn_lpae(CPUARMState *env, int is_user, int ap, + int ns, int xn, int pxn) +{ +int wxn; + +switch (arm_current_el(env)) { +case 0: +case 1: +wxn = env->cp15.sctlr_el[1] & SCTLR_WXN; +if (arm_el_is_aa64(env, 1)) { +if (is_user && !(ap & 1)) { +wxn = 0; +} +pxn = pxn || ((ap & 1) && !(ap & 2)); +xn = is_user ? xn : pxn; +} else { +int uwxn = env->cp15.sctlr_el[1] & SCTLR_UWXN; +pxn = pxn || (uwxn && (ap & 1) && !(ap & 2)); +xn = xn || (!is_user && pxn) || (is_user && !(ap & 1)); +} +if (arm_is_secure(env)) { +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); +} +break; +case 2: +/* TODO actually support EL2 */ +assert(false); + +if (arm_el_is_aa64(env, 2)) { +wxn = env->cp15.sctlr_el[2] & SCTLR_WXN; +} else { +/* wxn = HSCTLR.WXN */ +} +break; +case 3: +if (arm_el_is_aa64(env, 3)) { +wxn = env->cp15.sctlr_el[3] & SCTLR_WXN; +} else { +wxn = 0; +} +xn = xn || (ns && (env->cp15.scr_el3 & SCR_SIF)); +break; +} + +return xn || (!(ap & 2) && wxn) ? 0 : PAGE_EXEC; +} + static bool get_level1_table_address(CPUARMState *env, uint32_t *table, uint32_t address) { @@ -4941,7 +5004,7 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, if (extract32(tableattrs, 2, 1)) { attrs &= ~(1 << 4); } -/* Since we're always in the Non-secure state, NSTable is ignored. */ +attrs |= extract32(tableattrs, 4, 1) << 3; /* NS */ break; } /* Here descaddr is the final physical address, and attributes @@ -4962,19 +5025,8 @@ static int get_phys_addr_lpae(CPUARMState *env, target_ulong address, *prot |= !(ap & 2) ? PAGE_WRITE : 0; } -*prot |= PAGE_EXEC; -if ((arm_feature(env, ARM_FEATURE_V8) && is_user && (attrs & (1 << 12))) || -(!arm_feature(env, ARM_FEATURE_V8) && (attrs & (1 << 12))) || -(!arm_el_is_aa64(env, 1) && is_user && !(ap & 1)) || -(!is_user && (attrs & (1 << 11 { -/* XN/UXN or PXN. Since we only implement EL0/EL1 we unconditionally - * treat XN/UXN as UXN for v8. - */ -if (access_type == 2) { -goto do_fault; -} -*prot &= ~PAGE_EXEC; -} +*prot |= check_xn_lpae(env, is_user, ap, extract32(attrs, 3, 1), + extract32(attrs, 12, 1), extract32(attrs, 11, 1)); if ((*prot == 0) || (!(*prot & PAGE_WRITE) && access_type == 1) -- 1.9.3
Re: [Qemu-devel] [PATCH v2] tcg-arm: more instruction execution control
On Mon, Jan 12, 2015 at 01:46:47PM +0100, Andrew Jones wrote: > Cleanup XN/PXN handling in get_phys_addr_lpae, and implement all but > EL2 support of the following ARMv8 sections > > D4.5.1 Memory access control: Access permissions for instruction > execution > G4.7.2 Execute-never restrictions on instruction fetching > > G4.7.2 matches the ARMv7 section B3.7.2 when long-descriptors are used. > > Signed-off-by: Andrew Jones > While confirming the documentation wasn't wrong (it wasn't), I see I missed another issue with qemu's instruction execution control. For AArch64, EL0 can execute code even if it doesn't have R/W access, i.e. AP[1]=0. To make this fix more clear I've done it in a separate patch, and then rebased this patch on that. Thus, please drop this patch, as I'll send a 2-patch patch series now that replaces it. drew
Re: [Qemu-devel] [PATCH 1/2] tcg-aarch64: user doesn't need R/W access to exec
On Fri, Jan 16, 2015 at 02:52:21PM +, Peter Maydell wrote: > On 13 January 2015 at 15:48, Andrew Jones wrote: > > Table D4-32 shows that execute access from EL0 doesn't depend > > on AP[1]. > > This commit message is a bit sparse, which confused me > for a bit. It would be worth beefing it up a bit: > > target-arm: 64-bit EL0 code can execute from unreadable pages > > In AArch64 mode, a page can be executable even if it is not > readable (a difference from AArch32). Instead of bailing out > early if the page is not readable, just add "32 bit and > page not readable" to the list of conditions that make a > page non-executable, and check whether the protections and > the access type are compatible once at the end of the function. OK > > > Signed-off-by: Andrew Jones > > --- > > target-arm/helper.c | 27 --- > > 1 file changed, 16 insertions(+), 11 deletions(-) > > > > diff --git a/target-arm/helper.c b/target-arm/helper.c > > index 3ef0f1f38eda5..7c30a2669a0f2 100644 > > --- a/target-arm/helper.c > > +++ b/target-arm/helper.c > > @@ -4787,7 +4787,7 @@ static int get_phys_addr_lpae(CPUARMState *env, > > target_ulong address, > > hwaddr descaddr, descmask; > > uint32_t tableattrs; > > target_ulong page_size; > > -uint32_t attrs; > > +uint32_t attrs, ap; > > int32_t granule_sz = 9; > > int32_t va_size = 32; > > int32_t tbi = 0; > > @@ -4952,14 +4952,20 @@ static int get_phys_addr_lpae(CPUARMState *env, > > target_ulong address, > > /* Access flag */ > > goto do_fault; > > } > > + > > fault_type = permission_fault; > > -if (is_user && !(attrs & (1 << 4))) { > > -/* Unprivileged access not enabled */ > > -goto do_fault; > > +ap = extract32(attrs, 4, 2); /* AP[2:1] */ > > + > > +*prot = 0; > > +if (!is_user || (ap & 1)) { > > +*prot |= PAGE_READ; > > +*prot |= !(ap & 2) ? PAGE_WRITE : 0; > > Personally I would find > if (!(ap & 2)) { > *prot |= PAGE_WRITE; > } > > clearer. OK > > > } > > -*prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC; > > + > > +*prot |= PAGE_EXEC; > > if ((arm_feature(env, ARM_FEATURE_V8) && is_user && (attrs & (1 << > > 12))) || > > (!arm_feature(env, ARM_FEATURE_V8) && (attrs & (1 << 12))) || > > +(!arm_el_is_aa64(env, 1) && is_user && !(ap & 1)) || > > (!is_user && (attrs & (1 << 11 { > > /* XN/UXN or PXN. Since we only implement EL0/EL1 we > > unconditionally > > * treat XN/UXN as UXN for v8. > > @@ -4969,12 +4975,11 @@ static int get_phys_addr_lpae(CPUARMState *env, > > target_ulong address, > > } > > There is a "if access_type == 2 goto do_fault" check just > above this hunk which you can delete, because we're now > doing that check in the code you add below. Right. Will do, alternatively I should have brought the PAGE_EXEC handling below in with patch 2/2, which was my plan, but forgot to split it out. > > > *prot &= ~PAGE_EXEC; > > } > > -if (attrs & (1 << 5)) { > > -/* Write access forbidden */ > > -if (access_type == 1) { > > -goto do_fault; > > -} > > -*prot &= ~PAGE_WRITE; > > + > > +if ((*prot == 0) > > +|| (!(*prot & PAGE_WRITE) && access_type == 1) > > +|| (!(*prot & PAGE_EXEC) && access_type == 2)) { > > +goto do_fault; > > Why isn't this just > if (!(*prot & (1 << access_type))) { yeah, that would be better > > ? (Or at least, why doesn't it treat PAGE_READ the same way > as the other two bits?) As it is I think we'll treat a page > that is marked exec-not-readable as if it were readable. Oh yes, we should check PAGE_READ as well Thanks for the review. I see from another mail that you'll be sending some patches I should base the next version on. So I'll hold off on sending a revised patch until I see that. drew
Re: [Qemu-devel] [PATCH 2/2] tcg-arm: more instruction execution control
On Fri, Jan 16, 2015 at 04:16:02PM +, Peter Maydell wrote: > On 13 January 2015 at 15:48, Andrew Jones wrote: > > Cleanup XN/PXN handling in get_phys_addr_lpae, and implement all but > > EL2 support of the following ARMv8 sections > > > > D4.5.1 Memory access control: Access permissions for instruction > > execution > > G4.7.2 Execute-never restrictions on instruction fetching > > > > G4.7.2 matches the ARMv7 section B3.7.2 when long-descriptors are used. > > > > Signed-off-by: Andrew Jones > > > > --- > > v3: > > - correct logic for (is_user && !(ap & 1)) case > > - base on "user doesn't need R/W access to exec" patch > > v2: correct assert in el2 case > > > > I didn't test this with secure mode (I didn't even check if that's > > currently possible), but I did test all other EL1&0 XN controls with > > both tcg-arm (cortex-a15) and tcg-aarch64 (cortex-a57) by hacking up > > some tests with kvm-unit-tests/arm[64]. I also booted Linux (just > > up to looking for a rootfs) with both. > > > > target-arm/helper.c | 80 > > +++-- > > 1 file changed, 66 insertions(+), 14 deletions(-) > > > > diff --git a/target-arm/helper.c b/target-arm/helper.c > > index 7c30a2669a0f2..715e0f47bec7d 100644 > > --- a/target-arm/helper.c > > +++ b/target-arm/helper.c > > @@ -4516,6 +4516,69 @@ static inline int check_ap(CPUARMState *env, int ap, > > int domain_prot, > >} > > } > > > > +/* Check section/page execution permission. > > + * > > + * Returns PAGE_EXEC if execution is permitted, otherwise zero. > > + * > > + * Supports AArch64, AArch32, and v7-LPAE. Doesn't yet support > > + * EL2. Actually, for v7-LPAE, we'd also need to emulate Virt > > + * Extensions to truly have WXN/UWXN. We don't support checking > > + * for that yet, so we just assume we have them. > > If you want to know if the Virtualization Extensions are > present, check feature bit ARM_FEATURE_EL2. OK > > In general this code seems to be rather short on feature > bit checks -- notice that the code it is replacing was > doing checks on the V8 feature bit, for instance. The code it's replacing wasn't using the V8 feature bit correctly. It was using it to determine AArch64 vs. AArch32, not just V8 vs. V7. > > > + * > > + * @env : CPUARMState > > + * @is_user : 0 for privileged access, 1 for user > > + * @ap : Access permissions (AP[2:1]) from descriptor > > + * @ns : (NSTable || NS) from descriptors > > + * @xn : ([U]XNTable || [U]XN) from descriptors > > + * @pxn : (PXNTable || PXN) from descriptors > > + */ > > +static int check_xn_lpae(CPUARMState *env, int is_user, int ap, > > + int ns, int xn, int pxn) > > +{ > > +int wxn; > > + > > +switch (arm_current_el(env)) { > > +case 0: > > If arm_current_el() is 0 but EL3 is 32 bit then the > controlling environment here is the Secure PL1, which > is EL3, and falling through to look at sctlr_el[1] is > wrong. (SCTLR(S) is sctlr_el[3].) > > (Also get_phys_addr() can be called via the ats_write() > functions, which means that "arm_current_el()" and "the > address translation regime we're trying to determine the > answer for" aren't necessarily the same...) > > I think we're very soon going to need to bite the bullet and > make this code have a concept of the current "translation > regime", as the ARM ARM terms it... > > If you wish to avoid that (and I wouldn't object, given > this patchset is snowballing in scope already) then do what > get_phys_addr() does to get the relevant SCTLR: > uint32_t sctlr = A32_BANKED_CURRENT_REG_GET(env, sctlr); > This will work for everything we currently support: > * AArch64 EL1 (with an EL0 either AArch32 or AArch64) > * AArch32 EL1 > * TZ for the limited case of EL3 == AArch32 (and so no EL1) Thanks. I see from another mail that you've opted to get the translation regime worked up sooner than later. I'll wait for that to base a revision of this patch on. > > > +case 1: > > +wxn = env->cp15.sctlr_el[1] & SCTLR_WXN; > > +if (arm_el_is_aa64(env, 1)) { > > +if (is_user && !(ap & 1)) { > > +wxn = 0; > > +} > > I don't understand what we're doing here: as far as I can see > the ARM ARM simply define
Re: [Qemu-devel] [PATCH 2/2] tcg-arm: more instruction execution control
On Fri, Jan 16, 2015 at 07:02:55PM +, Peter Maydell wrote: > On 16 January 2015 at 16:16, Peter Maydell wrote: > > I think we're very soon going to need to bite the bullet and > > make this code have a concept of the current "translation > > regime", as the ARM ARM terms it... > > In fact I think we need to do it now. I'll try to > write some code to at least get a conceptual framework > in place and send it out next week sometime. Sounds good. Thanks! drew
[Qemu-devel] [PATCH] vl: rework smp_parse
smp_parse has a couple problems. First, it should use max_cpus, not smp_cpus when calculating missing topology information. Conversely, if maxcpus is not input, then the topology should dictate max_cpus, as the topology may support more than the input smp_cpus number. Second, smp_parse shouldn't silently adjust the number of threads a user provides, which it currently does in order to ensure the topology supports the number of cpus provided. smp_parse should rather complain and exit when input isn't consistent. This patch fixes those issues and attempts to clarify the code a bit too. I don't believe we need to consider compatibility with the old behaviors. Specifying something like -smp 4,sockets=1,cores=1,threads=1 is wrong, even though it currently works (the number of threads is silently adjusted to 4). And, specifying something like -smp 4,sockets=1,cores=4,threads=1,maxcpus=8 is also wrong, as there's no way to hotplug the additional 4 cpus with this topology. So, any users doing these things should be corrected. The new error message this patch provides should help them do that. Below are some examples with before/after results // topology should support up to max_cpus < -smp 4,sockets=2,maxcpus=8sockets=2 cores=2 threads=1 smp_cpus=4 max_cpus=8 > -smp 4,sockets=2,maxcpus=8sockets=2 cores=4 threads=1 > smp_cpus=4 max_cpus=8 // topology supports more than smp_cpus, max_cpus should be higher < -smp 4,sockets=4,cores=2 sockets=4 cores=2 threads=1 smp_cpus=4 max_cpus=4 > -smp 4,sockets=4,cores=2 sockets=4 cores=2 threads=1 > smp_cpus=4 max_cpus=8 // shouldn't silently adjust threads < -smp 4,sockets=1,cores=2,threads=1sockets=1 cores=2 threads=2 smp_cpus=4 max_cpus=4 > -smp 4,sockets=1,cores=2,threads=1"maxcpus must be equal to or > greater than smp" < -smp 4,sockets=2,cores=2,threads=4,maxcpus=8 sockets=2 cores=2 threads=1 smp_cpus=4 max_cpus=8 > -smp 4,sockets=2,cores=2,threads=4,maxcpus=8 "cpu topology: sockets (2) * > cores (2) * threads (4) != max_cpus (8)" Signed-off-by: Andrew Jones --- vl.c | 59 ++- 1 file changed, 38 insertions(+), 21 deletions(-) diff --git a/vl.c b/vl.c index f4a6e5e05bce2..23b21a5fbca50 100644 --- a/vl.c +++ b/vl.c @@ -1270,35 +1270,52 @@ static QemuOptsList qemu_smp_opts = { static void smp_parse(QemuOpts *opts) { if (opts) { - -unsigned cpus= qemu_opt_get_number(opts, "cpus", 0); unsigned sockets = qemu_opt_get_number(opts, "sockets", 0); unsigned cores = qemu_opt_get_number(opts, "cores", 0); unsigned threads = qemu_opt_get_number(opts, "threads", 0); +unsigned cpus; + +smp_cpus = qemu_opt_get_number(opts, "cpus", 0); +max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); + +cpus = max_cpus ? max_cpus : smp_cpus; /* compute missing values, prefer sockets over cores over threads */ -if (cpus == 0 || sockets == 0) { -sockets = sockets > 0 ? sockets : 1; -cores = cores > 0 ? cores : 1; -threads = threads > 0 ? threads : 1; -if (cpus == 0) { -cpus = cores * threads * sockets; -} -} else { -if (cores == 0) { -threads = threads > 0 ? threads : 1; -cores = cpus / (sockets * threads); -} else { -threads = cpus / (cores * sockets); -} +if (cpus == 0) { +cpus = sockets ? sockets : 1; +cpus = cpus * cores ? cpus * cores : cpus; +cpus = cpus * threads ? cpus * threads : cpus; } -max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); +if (sockets == 0) { +sockets = 1; +} -smp_cpus = cpus; -smp_cores = cores > 0 ? cores : 1; -smp_threads = threads > 0 ? threads : 1; +if (cores == 0) { +threads = threads ? threads : 1; +cores = cpus / (sockets * threads); +cores = cores ? cores : 1; +} + +if (threads == 0) { +threads = cpus / (sockets * cores); +threads = threads ? threads : 1; +} + +if (max_cpus == 0) { +max_cpus = sockets * cores * threads; +} +if (sockets * cores * threads != max_cpus) { +fprintf(stderr, "cpu topology: " +"sockets (%u) * cores (%u) * threads (%u) != max_cpus (%u)\n", +sockets, cores, threads, max_cpus); +exit(1); +} + +smp_cpus = smp_cpus ? smp_cpus : max_cpus; +smp_cores = cores; +smp_thr
Re: [Qemu-devel] [PATCH] vl: rework smp_parse
On Thu, Nov 06, 2014 at 05:17:44PM -0200, Eduardo Habkost wrote: > On Thu, Nov 06, 2014 at 05:09:35PM +0100, Andrew Jones wrote: > > smp_parse has a couple problems. First, it should use max_cpus, > > not smp_cpus when calculating missing topology information. > > Conversely, if maxcpus is not input, then the topology should > > dictate max_cpus, as the topology may support more than the > > input smp_cpus number. Second, smp_parse shouldn't silently > > adjust the number of threads a user provides, which it currently > > does in order to ensure the topology supports the number > > of cpus provided. smp_parse should rather complain and exit > > when input isn't consistent. This patch fixes those issues and > > attempts to clarify the code a bit too. > > > > I don't believe we need to consider compatibility with the old > > behaviors. Specifying something like > > > > -smp 4,sockets=1,cores=1,threads=1 > > > > is wrong, even though it currently works (the number of threads > > is silently adjusted to 4). > > Agreed, in this case. > > > And, specifying something like > > > > -smp 4,sockets=1,cores=4,threads=1,maxcpus=8 > > > > is also wrong, as there's no way to hotplug the additional 4 cpus > > with this topology. So, any users doing these things should be > > corrected. The new error message this patch provides should help > > them do that. > > In this case, you are proposing a new meaning for "sockets", and it > makes sense (I as suppose people understand "sockets" to mean all > sockets, even the empty ones). But it looks like it will break > compatility on cases we don't want to. > > For example, the case below is currently valid, and probably can be > generated by libvirt today. But you are now rejecting it: > > -smp 30,sockets=5,cores=3,threads=2,maxcpus=60 "cpu topology: sockets (5) > * cores (3) * threads (2) != max_cpus (60)" > > This is currently valid because "sockets" today means "online sockets", > not "all sockets, even the empty ones". Will hotplug still work correctly with this meaning? I expect that we need holes to fill. > > > It looks like the patch also breaks automatic socket count calculation, > which (AFAIK) works today: > > I get: > -smp 30,cores=3,threads=2"maxcpus must be equal to or greater than > smp" > but I expect: > -smp 30,cores=3,threads=2sockets=5 cores=3 threads=2 smp_cpus=30 > max_cpus=30 True. I should have preserved that behavior. The current code actually ends up with sockets=1 cores=3 threads=2 cpus=30 maxcpus=30 in smp_parse(), which is nonsense, but as the sockets count isn't saved (it's always derivable from smp_cpus, cores, threads), then in practice it doesn't matter. > > (And the error message is confusing: I am not even setting max_cpus, why > is it saying maxcpus is wrong?) > I should have added an underscore to maxcpus in that error message, as max_cpus is either maxcpus or the result of the topology now. > > > > > Below are some examples with before/after results > > > > // topology should support up to max_cpus > > < -smp 4,sockets=2,maxcpus=8sockets=2 cores=2 threads=1 > > smp_cpus=4 max_cpus=8 > > > -smp 4,sockets=2,maxcpus=8sockets=2 cores=4 threads=1 > > > smp_cpus=4 max_cpus=8 > > So, like in my first example above, we are breaking compatibility here > because the meaning of "sockets" changed. Do we want to? > I guess the main questions are: - do we need to make this change for hotplug? - do we need to make this change to make the command line more intuitive? > > (Unless otherwise noted in my other comments below, I am using the new > meaning for "sockets", not the old one.) > > > > > // topology supports more than smp_cpus, max_cpus should be higher > > < -smp 4,sockets=4,cores=2 sockets=4 cores=2 threads=1 > > smp_cpus=4 max_cpus=4 > > > -smp 4,sockets=4,cores=2 sockets=4 cores=2 threads=1 > > > smp_cpus=4 max_cpus=8 > > There are also existing cases where the user could simply expect > smp_threads to be calculated automatically. You seem to cover that, but > explicit testing would be nice. e.g., to make sure we won't break stuff > like: > > -smp 8,sockets=2,cores=2 sockets=2 cores=2 threads=2 > smp_cpus=8 max_cpus=8 > -smp 12,sockets=2,cores=2 sockets=2 cores=2 threads=3 > smp_cpus=12 max_cpus=12 > yeah, these w
Re: [Qemu-devel] [PATCH] vl: rework smp_parse
On Thu, Nov 06, 2014 at 11:11:30PM +0100, Paolo Bonzini wrote: > > > On 06/11/2014 17:09, Andrew Jones wrote: > > +if (sockets * cores * threads != max_cpus) { > > +fprintf(stderr, "cpu topology: " > > +"sockets (%u) * cores (%u) * threads (%u) != max_cpus > > (%u)\n", > > +sockets, cores, threads, max_cpus); > > +exit(1); > > +} > > I think this would cause too many failures in the wild. Perhaps error > out if it is lower, and warn if sockets * cores * threads > max_cpus > since we actually allow hot-plug a thread at a time? We'd still have more failures if we choose to error out when it's lower, since we currently silently adjust threads in some of those cases, or just don't care that the topology doesn't support up to maxcpus in other. I'm not sure how best to go about modifying the command line semantics in a backwards compatible way, other than to just create a new "-smp" option. I'm open to all opinions and suggestions. Thanks, drew
Re: [Qemu-devel] [PATCH] vl: rework smp_parse
On Fri, Nov 07, 2014 at 10:40:14AM +0100, Paolo Bonzini wrote: > > > On 07/11/2014 10:29, Andrew Jones wrote: > >> > I think this would cause too many failures in the wild. Perhaps error > >> > out if it is lower, and warn if sockets * cores * threads > max_cpus > >> > since we actually allow hot-plug a thread at a time? > > We'd still have more failures if we choose to error out when it's lower, > > since we currently silently adjust threads in some of those cases, or > > just don't care that the topology doesn't support up to maxcpus in other. > > So I guess we need a decent fallback if it doesn't match. Something > like (based also on the reply from Eduardo): > > 1) always warn if max_cpus % (cores*threads) != 0 || smp_cpus % > (cores*threads) != 0 > > 2) if sockets*cpus*threads < max_cpus, adjust sockets to > DIV_ROUND_UP(max_cpus, cores*threads). If we didn't warn in step 1, do > it now. Give a different, less harsh warning if the cmdline > sockets*cpus*threads did match smp_cpus. In the latter case, the user > _almost_ knows what he was doing. > > Not perfect, but it could be something to start from. Adjusting sockets > is better than adjusting threads. OK. I can whip up a v2 that is less harsh (more warnings, less aborts). I'll also address the other issue I mentioned in the bottom of my reply to Eduardo, which is to make sure we consider machine_class->max_cpus. drew > > Paolo > > > I'm not sure how best to go about modifying the command line semantics > > in a backwards compatible way, other than to just create a new "-smp" > > option. I'm open to all opinions and suggestions.
Re: [Qemu-devel] [PATCH] vl: rework smp_parse
On Fri, Nov 07, 2014 at 10:52:31AM +0100, Andrew Jones wrote: > On Fri, Nov 07, 2014 at 10:40:14AM +0100, Paolo Bonzini wrote: > > > > > > On 07/11/2014 10:29, Andrew Jones wrote: > > >> > I think this would cause too many failures in the wild. Perhaps error > > >> > out if it is lower, and warn if sockets * cores * threads > max_cpus > > >> > since we actually allow hot-plug a thread at a time? > > > We'd still have more failures if we choose to error out when it's lower, > > > since we currently silently adjust threads in some of those cases, or > > > just don't care that the topology doesn't support up to maxcpus in other. > > > > So I guess we need a decent fallback if it doesn't match. Something > > like (based also on the reply from Eduardo): > > > > 1) always warn if max_cpus % (cores*threads) != 0 || smp_cpus % > > (cores*threads) != 0 > > > > 2) if sockets*cpus*threads < max_cpus, adjust sockets to > > DIV_ROUND_UP(max_cpus, cores*threads). If we didn't warn in step 1, do > > it now. Give a different, less harsh warning if the cmdline > > sockets*cpus*threads did match smp_cpus. In the latter case, the user > > _almost_ knows what he was doing. > > > > Not perfect, but it could be something to start from. Adjusting sockets > > is better than adjusting threads. > > OK. I can whip up a v2 that is less harsh (more warnings, less aborts). > I'll also address the other issue I mentioned in the bottom of my reply > to Eduardo, which is to make sure we consider machine_class->max_cpus. After talking with Igor, it seems like the better approach is to get smp parameters converted over to machine properties, allowing us to use the old parser for old machine types, and the new for new. I'll look into it. drew
Re: [Qemu-devel] [PATCH] vl: rework smp_parse
On Fri, Nov 07, 2014 at 10:22:39AM +0100, Andrew Jones wrote: > On Thu, Nov 06, 2014 at 05:17:44PM -0200, Eduardo Habkost wrote: > > On Thu, Nov 06, 2014 at 05:09:35PM +0100, Andrew Jones wrote: > > > smp_parse has a couple problems. First, it should use max_cpus, > > > not smp_cpus when calculating missing topology information. > > > Conversely, if maxcpus is not input, then the topology should > > > dictate max_cpus, as the topology may support more than the > > > input smp_cpus number. Second, smp_parse shouldn't silently > > > adjust the number of threads a user provides, which it currently > > > does in order to ensure the topology supports the number > > > of cpus provided. smp_parse should rather complain and exit > > > when input isn't consistent. This patch fixes those issues and > > > attempts to clarify the code a bit too. > > > > > > I don't believe we need to consider compatibility with the old > > > behaviors. Specifying something like > > > > > > -smp 4,sockets=1,cores=1,threads=1 > > > > > > is wrong, even though it currently works (the number of threads > > > is silently adjusted to 4). > > > > Agreed, in this case. > > > > > And, specifying something like > > > > > > -smp 4,sockets=1,cores=4,threads=1,maxcpus=8 > > > > > > is also wrong, as there's no way to hotplug the additional 4 cpus > > > with this topology. So, any users doing these things should be > > > corrected. The new error message this patch provides should help > > > them do that. > > > > In this case, you are proposing a new meaning for "sockets", and it > > makes sense (I as suppose people understand "sockets" to mean all > > sockets, even the empty ones). But it looks like it will break > > compatility on cases we don't want to. > > > > For example, the case below is currently valid, and probably can be > > generated by libvirt today. But you are now rejecting it: > > > > -smp 30,sockets=5,cores=3,threads=2,maxcpus=60 "cpu topology: sockets > > (5) * cores (3) * threads (2) != max_cpus (60)" > > > > This is currently valid because "sockets" today means "online sockets", > > not "all sockets, even the empty ones". > > Will hotplug still work correctly with this meaning? I expect that > we need holes to fill. > > > > > > > It looks like the patch also breaks automatic socket count calculation, > > which (AFAIK) works today: > > > > I get: > > -smp 30,cores=3,threads=2"maxcpus must be equal to or greater than > > smp" > > but I expect: > > -smp 30,cores=3,threads=2sockets=5 cores=3 threads=2 smp_cpus=30 > > max_cpus=30 > > True. I should have preserved that behavior. > > The current code actually ends up with > > sockets=1 cores=3 threads=2 cpus=30 maxcpus=30 > > in smp_parse(), which is nonsense, but as the sockets count isn't saved > (it's always derivable from smp_cpus, cores, threads), then in practice > it doesn't matter. > > > > > (And the error message is confusing: I am not even setting max_cpus, why > > is it saying maxcpus is wrong?) > > > > I should have added an underscore to maxcpus in that error message, as > max_cpus is either maxcpus or the result of the topology now. > > > > > > > > > Below are some examples with before/after results > > > > > > // topology should support up to max_cpus > > > < -smp 4,sockets=2,maxcpus=8sockets=2 cores=2 > > > threads=1 smp_cpus=4 max_cpus=8 > > > > -smp 4,sockets=2,maxcpus=8sockets=2 cores=4 > > > > threads=1 smp_cpus=4 max_cpus=8 > > > > So, like in my first example above, we are breaking compatibility here > > because the meaning of "sockets" changed. Do we want to? > > > > I guess the main questions are: > - do we need to make this change for hotplug? > - do we need to make this change to make the command line more > intuitive? > > > > > (Unless otherwise noted in my other comments below, I am using the new > > meaning for "sockets", not the old one.) > > > > > > > > // topology supports more than smp_cpus, max_cpus should be higher > > > < -smp 4,sockets=4,cores=2 sockets=4 cores=2 > > > threads=1 smp_cpus=4 max_cpu
Re: [Qemu-devel] [PATCH] vl: rework smp_parse
On Fri, Nov 07, 2014 at 10:16:06AM -0200, Eduardo Habkost wrote: > On Fri, Nov 07, 2014 at 12:21:26PM +0100, Andrew Jones wrote: > > On Fri, Nov 07, 2014 at 10:52:31AM +0100, Andrew Jones wrote: > > > On Fri, Nov 07, 2014 at 10:40:14AM +0100, Paolo Bonzini wrote: > > > > > > > > > > > > On 07/11/2014 10:29, Andrew Jones wrote: > > > > >> > I think this would cause too many failures in the wild. Perhaps > > > > >> > error > > > > >> > out if it is lower, and warn if sockets * cores * threads > > > > > >> > max_cpus > > > > >> > since we actually allow hot-plug a thread at a time? > > > > > We'd still have more failures if we choose to error out when it's > > > > > lower, > > > > > since we currently silently adjust threads in some of those cases, or > > > > > just don't care that the topology doesn't support up to maxcpus in > > > > > other. > > > > > > > > So I guess we need a decent fallback if it doesn't match. Something > > > > like (based also on the reply from Eduardo): > > > > > > > > 1) always warn if max_cpus % (cores*threads) != 0 || smp_cpus % > > > > (cores*threads) != 0 > > > > > > > > 2) if sockets*cpus*threads < max_cpus, adjust sockets to > > > > DIV_ROUND_UP(max_cpus, cores*threads). If we didn't warn in step 1, do > > > > it now. Give a different, less harsh warning if the cmdline > > > > sockets*cpus*threads did match smp_cpus. In the latter case, the user > > > > _almost_ knows what he was doing. > > > > > > > > Not perfect, but it could be something to start from. Adjusting sockets > > > > is better than adjusting threads. > > > > > > OK. I can whip up a v2 that is less harsh (more warnings, less aborts). > > > I'll also address the other issue I mentioned in the bottom of my reply > > > to Eduardo, which is to make sure we consider machine_class->max_cpus. > > > > After talking with Igor, it seems like the better approach is to get > > smp parameters converted over to machine properties, allowing us to > > use the old parser for old machine types, and the new for new. I'll > > look into it. > > While we work on that, I think we could at least change the existing > code to abort if sockets*cores*threads < smp_cpus (when all 4 options > are present in the command-line), and never adjust any option silently. But then we'll end up with 3 different behaviors. If we do anything for the short-term, then maybe it should only be to add warnings. The machine property version will then convert those warnings to aborts for new machines types. drew
Re: [Qemu-devel] [PATCH] vl: rework smp_parse
On Thu, Nov 06, 2014 at 05:09:35PM +0100, Andrew Jones wrote: > smp_parse has a couple problems. First, it should use max_cpus, > not smp_cpus when calculating missing topology information. > Conversely, if maxcpus is not input, then the topology should > dictate max_cpus, as the topology may support more than the > input smp_cpus number. Second, smp_parse shouldn't silently > adjust the number of threads a user provides, which it currently > does in order to ensure the topology supports the number > of cpus provided. smp_parse should rather complain and exit > when input isn't consistent. This patch fixes those issues and > attempts to clarify the code a bit too. > > I don't believe we need to consider compatibility with the old > behaviors. Specifying something like > > -smp 4,sockets=1,cores=1,threads=1 > > is wrong, even though it currently works (the number of threads > is silently adjusted to 4). And, specifying something like > > -smp 4,sockets=1,cores=4,threads=1,maxcpus=8 > > is also wrong, as there's no way to hotplug the additional 4 cpus > with this topology. So, any users doing these things should be > corrected. The new error message this patch provides should help > them do that. > > Below are some examples with before/after results > > // topology should support up to max_cpus > < -smp 4,sockets=2,maxcpus=8sockets=2 cores=2 threads=1 > smp_cpus=4 max_cpus=8 > > -smp 4,sockets=2,maxcpus=8sockets=2 cores=4 threads=1 > > smp_cpus=4 max_cpus=8 > > // topology supports more than smp_cpus, max_cpus should be higher > < -smp 4,sockets=4,cores=2 sockets=4 cores=2 threads=1 > smp_cpus=4 max_cpus=4 > > -smp 4,sockets=4,cores=2 sockets=4 cores=2 threads=1 > > smp_cpus=4 max_cpus=8 > > // shouldn't silently adjust threads > < -smp 4,sockets=1,cores=2,threads=1sockets=1 cores=2 threads=2 > smp_cpus=4 max_cpus=4 > > -smp 4,sockets=1,cores=2,threads=1"maxcpus must be equal to or > > greater than smp" > < -smp 4,sockets=2,cores=2,threads=4,maxcpus=8 sockets=2 cores=2 threads=1 > smp_cpus=4 max_cpus=8 > > -smp 4,sockets=2,cores=2,threads=4,maxcpus=8 "cpu topology: sockets (2) * > > cores (2) * threads (4) != max_cpus (8)" > > Signed-off-by: Andrew Jones > --- > vl.c | 59 ++- > 1 file changed, 38 insertions(+), 21 deletions(-) > > diff --git a/vl.c b/vl.c > index f4a6e5e05bce2..23b21a5fbca50 100644 > --- a/vl.c > +++ b/vl.c > @@ -1270,35 +1270,52 @@ static QemuOptsList qemu_smp_opts = { > static void smp_parse(QemuOpts *opts) > { > if (opts) { > - > -unsigned cpus= qemu_opt_get_number(opts, "cpus", 0); > unsigned sockets = qemu_opt_get_number(opts, "sockets", 0); > unsigned cores = qemu_opt_get_number(opts, "cores", 0); > unsigned threads = qemu_opt_get_number(opts, "threads", 0); > +unsigned cpus; > + > +smp_cpus = qemu_opt_get_number(opts, "cpus", 0); > +max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); > + > +cpus = max_cpus ? max_cpus : smp_cpus; > > /* compute missing values, prefer sockets over cores over threads */ > -if (cpus == 0 || sockets == 0) { > -sockets = sockets > 0 ? sockets : 1; > -cores = cores > 0 ? cores : 1; > -threads = threads > 0 ? threads : 1; > -if (cpus == 0) { > -cpus = cores * threads * sockets; > -} > -} else { > -if (cores == 0) { > -threads = threads > 0 ? threads : 1; > -cores = cpus / (sockets * threads); > -} else { > -threads = cpus / (cores * sockets); > -} > +if (cpus == 0) { > +cpus = sockets ? sockets : 1; > +cpus = cpus * cores ? cpus * cores : cpus; > +cpus = cpus * threads ? cpus * threads : cpus; > } > > -max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); > +if (sockets == 0) { > +sockets = 1; > +} > > -smp_cpus = cpus; > -smp_cores = cores > 0 ? cores : 1; > -smp_threads = threads > 0 ? threads : 1; > +if (cores == 0) { > +threads = threads ? threads : 1; > +cores = cpus / (sockets * threads); > +cores = cores ? cores : 1; > +} > + > +if (threa
[Qemu-devel] [PATCH 0/3] vl: smp_parse sanity checks
See individual patches. Andrew Jones (3): vl: fix max_cpus check vl: sanity check cpu topology vl: warn on topology <-> maxcpus mismatch vl.c | 38 -- 1 file changed, 32 insertions(+), 6 deletions(-) -- 1.9.3
[Qemu-devel] [PATCH 1/3] vl: fix max_cpus check
We should confirm max_cpus, which is >= smp_cpus, is <= the machine's true max_cpus, not just smp_cpus. Signed-off-by: Andrew Jones --- vl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vl.c b/vl.c index f4a6e5e05bce2..9d9855092ab4a 100644 --- a/vl.c +++ b/vl.c @@ -3903,9 +3903,9 @@ int main(int argc, char **argv, char **envp) smp_parse(qemu_opts_find(qemu_find_opts("smp-opts"), NULL)); machine_class->max_cpus = machine_class->max_cpus ?: 1; /* Default to UP */ -if (smp_cpus > machine_class->max_cpus) { +if (max_cpus > machine_class->max_cpus) { fprintf(stderr, "Number of SMP cpus requested (%d), exceeds max cpus " -"supported by machine `%s' (%d)\n", smp_cpus, +"supported by machine `%s' (%d)\n", max_cpus, machine_class->name, machine_class->max_cpus); exit(1); } -- 1.9.3
[Qemu-devel] [PATCH 2/3] vl: sanity check cpu topology
smp_parse allows partial or complete cpu topology to be given. In either case there may be inconsistencies in the input which are currently not sounding any alarms. In some cases the input is even being silently corrected. We shouldn't do this. Add warnings when input isn't adding up right, and even abort when the complete cpu topology has been input, but isn't correct. Signed-off-by: Andrew Jones --- vl.c | 27 +++ 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index 9d9855092ab4a..c62fe29aa8075 100644 --- a/vl.c +++ b/vl.c @@ -1288,16 +1288,35 @@ static void smp_parse(QemuOpts *opts) if (cores == 0) { threads = threads > 0 ? threads : 1; cores = cpus / (sockets * threads); -} else { -threads = cpus / (cores * sockets); +if (cpus % (sockets * threads)) { +fprintf(stderr, "cpu topology: warning: " +"Calculation results in fractional cores number. " +"Adjusting.\n"); +cores += 1; +} +} else if (threads == 0) { +threads = cpus / (sockets * cores); +if (cpus % (sockets * cores)) { +fprintf(stderr, "cpu topology: warning: " +"Calculation results in fractional threads number. " +"Adjusting.\n"); +threads += 1; +} } } +if (sockets * cores * threads < cpus) { +fprintf(stderr, "cpu topology: error: " +"sockets (%u) * cores (%u) * threads (%u) < smp_cpus (%u)\n", +sockets, cores, threads, cpus); +exit(1); +} + max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); smp_cpus = cpus; -smp_cores = cores > 0 ? cores : 1; -smp_threads = threads > 0 ? threads : 1; +smp_cores = cores; +smp_threads = threads; } -- 1.9.3
[Qemu-devel] [PATCH 3/3] vl: warn on topology <-> maxcpus mismatch
Start guiding users towards making sure their topology supports the maximum number of cpus they wish to support. A future patch series will enforce this for new machine types. Signed-off-by: Andrew Jones --- vl.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/vl.c b/vl.c index c62fe29aa8075..72ffbffd858c5 100644 --- a/vl.c +++ b/vl.c @@ -1313,6 +1313,13 @@ static void smp_parse(QemuOpts *opts) } max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); +max_cpus = max_cpus ?: cpus; + +if (sockets * cores * threads != max_cpus) { +fprintf(stderr, "cpu topology: warning: " +"sockets (%u) * cores (%u) * threads (%u) != max_cpus (%u)\n", +sockets, cores, threads, max_cpus); +} smp_cpus = cpus; smp_cores = cores; -- 1.9.3
Re: [Qemu-devel] [PATCH 2/3] vl: sanity check cpu topology
On Tue, Nov 11, 2014 at 10:41:00AM -0200, Eduardo Habkost wrote: > On Fri, Nov 07, 2014 at 05:04:39PM +0100, Andrew Jones wrote: > > smp_parse allows partial or complete cpu topology to be given. > > In either case there may be inconsistencies in the input which > > are currently not sounding any alarms. In some cases the input > > is even being silently corrected. We shouldn't do this. Add > > warnings when input isn't adding up right, and even abort when > > the complete cpu topology has been input, but isn't correct. > > > > Signed-off-by: Andrew Jones > > So, we are fixing bugs and changing behavior on two different cases > here: > > 1) when all options are provided and they aren't enough for smp_cpus; > 2) when one option was missing, but the existing calculation was >incorrect because of division truncation. 3) when threads were provided, but incorrect, we silently changed it. I thought you wanted to fix this one right now too. > > I don't think we need to keep compatibility on (1) because the user is > obviously providing an invalid configuration. That's why I suggested we > implemented it in 2.2. And it is safer because we won't be silently > changing behavior: QEMU is going to abort and the mistake will be easily > detected. > > But (2) is fixing a QEMU bug, not user error. The user may be unaware of > the bug, and will get a silent ABI change once upgrading to a newer > QEMU. We can keep it rounding down, unless the result is zero, as the current code does. How about keeping the new warning? Nah, let's drop it. Who actually cares about warnings anyway... > > I suggest fixing only (1) by now and keeping the behavior for (2) on > QEMU 2.2. Something like: > > if (sockets == 0) { > /* keep existing code for sockets == 0 */ > } else if (cores == 0) { > /* keep existing code for cores == 0 */ > } else if (threads == 0) { > /* keep existing code for threads == 0 */ This doesn't exist with current code. Adding an 'if (threads == 0)' case is fix (3). > } else { > /* new code: */ > if (sockets * cores * threads < cpus) { > fprintf(stderr, "cpu topology: error: " > "sockets (%u) * cores (%u) * threads (%u) < smp_cpus > (%u)\n", > sockets, cores, threads, cpus); > exit(1); > } > } > > Below is a v2 I can post if it looks good to you. From: Andrew Jones Date: Fri, 7 Nov 2014 15:45:07 +0100 Subject: [PATCH v2] vl: sanity check cpu topology smp_parse allows partial or complete cpu topology to be given. In either case there may be inconsistencies in the input which are currently not sounding any alarms. In some cases the input is even being silently corrected. Stop silently adjusting input and abort when the complete cpu topology has been input, but isn't correct. Signed-off-by: Andrew Jones --- vl.c | 31 +++ 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index 9d9855092ab4a..e686fd21e266f 100644 --- a/vl.c +++ b/vl.c @@ -1288,16 +1288,39 @@ static void smp_parse(QemuOpts *opts) if (cores == 0) { threads = threads > 0 ? threads : 1; cores = cpus / (sockets * threads); -} else { -threads = cpus / (cores * sockets); +if (cpus % (sockets * threads)) { +/* The calculation resulted in a fractional number, so we + * need to adjust it. The below adjustment is wrong, it + * should be '+= 1', but we need to keep it this way for + * compatibility. + */ +cores = cores > 0 ? cores : 1; +} +} else if (threads == 0) { +threads = cpus / (sockets * cores); +if (cpus % (sockets * cores)) { +/* The calculation resulted in a fractional number, so we + * need to adjust it. The below adjustment is wrong, it + * should be '+= 1', but we need to keep it this way for + * compatibility. + */ +threads = threads > 0 ? threads : 1; +} } } +if (sockets * cores * threads < cpus) { +fprintf(stderr, "cpu topology: error: " +"sockets (%u) * cores (%u) * threads (%u) < smp_cpus (%u)\n", +sockets, cores, threads, cpus); +exit(1); +} + max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); smp_cpus = cpus; -smp_cores = cores > 0 ? cores : 1; -smp_threads = threads > 0 ? threads : 1; +smp_cores = cores; +smp_threads = threads; } -- 1.9.3
Re: [Qemu-devel] [PATCH 2/3] vl: sanity check cpu topology
On Tue, Nov 11, 2014 at 01:48:00PM -0200, Eduardo Habkost wrote: > On Tue, Nov 11, 2014 at 03:37:11PM +0100, Andrew Jones wrote: > > On Tue, Nov 11, 2014 at 10:41:00AM -0200, Eduardo Habkost wrote: > > > On Fri, Nov 07, 2014 at 05:04:39PM +0100, Andrew Jones wrote: > > > > smp_parse allows partial or complete cpu topology to be given. > > > > In either case there may be inconsistencies in the input which > > > > are currently not sounding any alarms. In some cases the input > > > > is even being silently corrected. We shouldn't do this. Add > > > > warnings when input isn't adding up right, and even abort when > > > > the complete cpu topology has been input, but isn't correct. > > > > > > > > Signed-off-by: Andrew Jones > > > > > > So, we are fixing bugs and changing behavior on two different cases > > > here: > > > > > > 1) when all options are provided and they aren't enough for smp_cpus; > > > 2) when one option was missing, but the existing calculation was > > >incorrect because of division truncation. > > > > 3) when threads were provided, but incorrect, we silently changed > > it. I thought you wanted to fix this one right now too. > > True. When I described (1) I was seeing (3) as a subset of it, as > threads is silently changed only if core and sockets were also provided > and cores*sockets*threads != smp_cpus. > > So, yes, I want to fix (1) and (3) on 2.2. I am not sure about (2). > > > > > > > > > I don't think we need to keep compatibility on (1) because the user is > > > obviously providing an invalid configuration. That's why I suggested we > > > implemented it in 2.2. And it is safer because we won't be silently > > > changing behavior: QEMU is going to abort and the mistake will be easily > > > detected. > > > > > > But (2) is fixing a QEMU bug, not user error. The user may be unaware of > > > the bug, and will get a silent ABI change once upgrading to a newer > > > QEMU. > > > > We can keep it rounding down, unless the result is zero, as the current > > code does. How about keeping the new warning? Nah, let's drop it. Who > > actually cares about warnings anyway... > > A warning would be interesting for (2), maybe. I just would like to have > it addressed in a separate patch, so we can fix (3) and (1) in 2.2 > before we decide about (2). > > > > > > > > > I suggest fixing only (1) by now and keeping the behavior for (2) on > > > QEMU 2.2. Something like: > > > > > > if (sockets == 0) { > > > /* keep existing code for sockets == 0 */ > > > } else if (cores == 0) { > > > /* keep existing code for cores == 0 */ > > > } else if (threads == 0) { > > > /* keep existing code for threads == 0 */ > > > > This doesn't exist with current code. Adding an 'if (threads == 0)' > > case is fix (3). > > Yes, I was talking about the existing code that's inside the "else" > branch (and would change threads silently even if it was already set), > that would fix (3) too. > > > > > > } else { > > > /* new code: */ > > > if (sockets * cores * threads < cpus) { > > > fprintf(stderr, "cpu topology: error: " > > > "sockets (%u) * cores (%u) * threads (%u) < smp_cpus > > > (%u)\n", > > > sockets, cores, threads, cpus); > > > exit(1); > > > } > > > } > > > > > > > > > > Below is a v2 I can post if it looks good to you. > > > > From: Andrew Jones > > Date: Fri, 7 Nov 2014 15:45:07 +0100 > > Subject: [PATCH v2] vl: sanity check cpu topology > > > > smp_parse allows partial or complete cpu topology to be given. > > In either case there may be inconsistencies in the input which > > are currently not sounding any alarms. In some cases the input > > is even being silently corrected. Stop silently adjusting input > > and abort when the complete cpu topology has been input, but > > isn't correct. > > I don't think that's accurate: you are not aborting only when the > complete CPU topology has been input, but also when QEMU automatic > calculation is incorrect due to truncation. OK, let's modify the commit message, but keep the patch. The truncation problem we'd abort for is still directly due to user input. We wouldn't get a fractional cores or threads number if the input was appropriate. Anyway, if we don't need to worry about sockets*cores*threads < cpus for the truncation case, then why worry about it for the full input case? Well, other than informing the user that the input she's been using all this time with the silent threads adjustment was wrong, which is why I suspect you want it. However, that argument should hold for the truncation case too, i.e. informing a user that e.g. 5,sockets=2,threads=1 has always been wrong. drew
Re: [Qemu-devel] [PATCH 2/3] vl: sanity check cpu topology
On Tue, Nov 11, 2014 at 04:31:24PM -0200, Eduardo Habkost wrote: > On Tue, Nov 11, 2014 at 03:37:11PM +0100, Andrew Jones wrote: > [...] > > Below is a v2 I can post if it looks good to you. > > > > From: Andrew Jones > > Date: Fri, 7 Nov 2014 15:45:07 +0100 > > Subject: [PATCH v2] vl: sanity check cpu topology > > > > smp_parse allows partial or complete cpu topology to be given. > > In either case there may be inconsistencies in the input which > > are currently not sounding any alarms. In some cases the input > > is even being silently corrected. Stop silently adjusting input > > and abort when the complete cpu topology has been input, but > > isn't correct. > > > > Signed-off-by: Andrew Jones > > After applying this patch: > > $ ./install/bin/qemu-system-x86_64 -smp 12 > cpu topology: error: sockets (1) * cores (1) * threads (1) < smp_cpus (12) > > That is why I wanted to address the most obvious (and less risky) issues first > (aborting only if all options were explicitly set), and touch automatic > calculation later. > Oh right. I fixed that once, but then lost the change when trying to produce these half fixes. drew
Re: [Qemu-devel] [PATCH v2 1/3] vl: Avoid unnecessary 'if' nesting
On Tue, Nov 11, 2014 at 04:50:54PM -0200, Eduardo Habkost wrote: > Just a coding style change, to make other changes easier to review. > > Signed-off-by: Eduardo Habkost > --- > vl.c | 10 -- > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/vl.c b/vl.c > index f4a6e5e..c1bfc9b 100644 > --- a/vl.c > +++ b/vl.c > @@ -1284,13 +1284,11 @@ static void smp_parse(QemuOpts *opts) > if (cpus == 0) { > cpus = cores * threads * sockets; > } > +} else if (cores == 0) { > +threads = threads > 0 ? threads : 1; > +cores = cpus / (sockets * threads); > } else { > -if (cores == 0) { > -threads = threads > 0 ? threads : 1; > -cores = cpus / (sockets * threads); > -} else { > -threads = cpus / (cores * sockets); > -} > +threads = cpus / (cores * sockets); > } > > max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); > -- > 1.9.3 > Reviewed-by: Andrew Jones
Re: [Qemu-devel] [PATCH v2 3/3] vl: Don't silently change topology when all -smp options were set
On Tue, Nov 11, 2014 at 04:50:56PM -0200, Eduardo Habkost wrote: > QEMU tries to change the "threads" option even if it was explicitly set > in the command-line, and it shouldn't do that. > > The right thing to do when all options (cpus, sockets, cores, threds) > are explicitly set is to sanity check them and abort in case they don't > make sense (i.e. when sockets*cores*threads < cpus). > > Signed-off-by: Eduardo Habkost > --- > vl.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/vl.c b/vl.c > index 2ed8b07..8880a4e 100644 > --- a/vl.c > +++ b/vl.c > @@ -1287,8 +1287,14 @@ static void smp_parse(QemuOpts *opts) > } else if (cores == 0) { > threads = threads > 0 ? threads : 1; > cores = cpus / (sockets * threads); > -} else { > +} else if (threads == 0) { > threads = cpus / (cores * sockets); > +} else if (sockets * cores * threads < cpus) { > +fprintf(stderr, "cpu topology: error: " > +"sockets (%u) * cores (%u) * threads (%u) < " > +"smp_cpus (%u)\n", > +sockets, cores, threads, cpus); > +exit(1); > } > > max_cpus = qemu_opt_get_number(opts, "maxcpus", 0); > -- > 1.9.3 > Reviewed-by: Andrew Jones
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to "virt" board
On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: > fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, > ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the "virt" > board. > > The mmio register block of fw_cfg is advertized in the device tree. As > base address we pick 0x0902, which conforms to the comment preceding > "a15memmap": it falls in the miscellaneous device I/O range 128MB..256MB, > and it is aligned at 64KB. > > fw_cfg automatically exports a number of files to the guest; for example, > "bootorder" (see fw_cfg_machine_reset()). > > Signed-off-by: Laszlo Ersek > --- > hw/arm/virt.c | 21 + > 1 file changed, 21 insertions(+) > > diff --git a/hw/arm/virt.c b/hw/arm/virt.c > index 314e55b..070bd34 100644 > --- a/hw/arm/virt.c > +++ b/hw/arm/virt.c > @@ -68,6 +68,7 @@ enum { > VIRT_UART, > VIRT_MMIO, > VIRT_RTC, > +VIRT_FW_CFG, > }; > > typedef struct MemMapEntry { > @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { > [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, > [VIRT_UART] = { 0x0900, 0x1000 }, > [VIRT_RTC] ={ 0x0901, 0x1000 }, > +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, > [VIRT_MMIO] = { 0x0a00, 0x0200 }, > /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size > */ > /* 0x1000 .. 0x4000 reserved for PCI */ > @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) > g_free(nodename); > } > > +static void create_fw_cfg(const VirtBoardInfo *vbi) > +{ > +hwaddr base = vbi->memmap[VIRT_FW_CFG].base; > +char *nodename; > + > +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); > + > +nodename = g_strdup_printf("/fw-cfg@%" PRIx64, base); > +qemu_fdt_add_subnode(vbi->fdt, nodename); > +qemu_fdt_setprop_string(vbi->fdt, nodename, > +"compatible", "fw-cfg,mmio"); > +qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", > + 2, base, 2, FW_CFG_SIZE, > + 2, base + FW_CFG_SIZE, 2, FW_CFG_DATA_SIZE); Overkill suggestion alert, but how about defining something like #define FW_CFG_SIZE_ALIGNED \ MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) and then using that in your memmap size calculation and fw-cfg-data base address calculation. The only reason I suggest this is because it's hard to tell that fw-cfg-data's address will be naturally aligned without hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change (which it probably never will), then it may not be. > +g_free(nodename); > +} > + > static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size) > { > const VirtBoardInfo *board = (const VirtBoardInfo *)binfo; > @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine) > */ > create_virtio_devices(vbi, pic); > > +create_fw_cfg(vbi); > + > vbi->bootinfo.ram_size = machine->ram_size; > vbi->bootinfo.kernel_filename = machine->kernel_filename; > vbi->bootinfo.kernel_cmdline = machine->kernel_cmdline; > -- > 1.8.3.1 >
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to "virt" board
On Fri, Nov 28, 2014 at 11:43:32AM +0100, Laszlo Ersek wrote: > On 11/28/14 11:38, Andrew Jones wrote: > > On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: > >> fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, > >> ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the "virt" > >> board. > >> > >> The mmio register block of fw_cfg is advertized in the device tree. As > >> base address we pick 0x0902, which conforms to the comment preceding > >> "a15memmap": it falls in the miscellaneous device I/O range 128MB..256MB, > >> and it is aligned at 64KB. > >> > >> fw_cfg automatically exports a number of files to the guest; for example, > >> "bootorder" (see fw_cfg_machine_reset()). > >> > >> Signed-off-by: Laszlo Ersek > >> --- > >> hw/arm/virt.c | 21 + > >> 1 file changed, 21 insertions(+) > >> > >> diff --git a/hw/arm/virt.c b/hw/arm/virt.c > >> index 314e55b..070bd34 100644 > >> --- a/hw/arm/virt.c > >> +++ b/hw/arm/virt.c > >> @@ -68,6 +68,7 @@ enum { > >> VIRT_UART, > >> VIRT_MMIO, > >> VIRT_RTC, > >> +VIRT_FW_CFG, > >> }; > >> > >> typedef struct MemMapEntry { > >> @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { > >> [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, > >> [VIRT_UART] = { 0x0900, 0x1000 }, > >> [VIRT_RTC] ={ 0x0901, 0x1000 }, > >> +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, > >> [VIRT_MMIO] = { 0x0a00, 0x0200 }, > >> /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that > >> size */ > >> /* 0x1000 .. 0x4000 reserved for PCI */ > >> @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) > >> g_free(nodename); > >> } > >> > >> +static void create_fw_cfg(const VirtBoardInfo *vbi) > >> +{ > >> +hwaddr base = vbi->memmap[VIRT_FW_CFG].base; > >> +char *nodename; > >> + > >> +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); > >> + > >> +nodename = g_strdup_printf("/fw-cfg@%" PRIx64, base); > >> +qemu_fdt_add_subnode(vbi->fdt, nodename); > >> +qemu_fdt_setprop_string(vbi->fdt, nodename, > >> +"compatible", "fw-cfg,mmio"); > >> +qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", > >> + 2, base, 2, FW_CFG_SIZE, > >> + 2, base + FW_CFG_SIZE, 2, > >> FW_CFG_DATA_SIZE); > > > > Overkill suggestion alert, but how about defining something like > > > > #define FW_CFG_SIZE_ALIGNED \ > > MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ > > QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) > > > > and then using that in your memmap size calculation and fw-cfg-data base > > address calculation. The only reason I suggest this is because it's hard > > to tell that fw-cfg-data's address will be naturally aligned without > > hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change > > (which it probably never will), then it may not be. > > Why does it need to be aligned? Natural alignment is more efficient. > > The selector register is aligned at a 64KB boundary (for independent, > strict reasons). > > The data register is not aligned at all, and -- AFAICS -- it need not > be, because it's 1 byte wide. (In fact the ARM-specific > Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the > above layout passes without problems.) Right. As FW_CFG_DATA_SIZE is currently 1 byte, it's already naturally aligned, and the macro definition I have above actually doesn't change anything (which is why I gave the overkill alert). However if FW_CFG_DATA_SIZE was to change, then the natural alignment could be lost. > > The full register block is 3 bytes wide. Is that a problem? No, it's fine as is, and the FW_CFG_SIZE_ALIGNED would leave it 3 bytes wide too. FW_CFG_SIZE_ALIGNED only adds future-proofing. However it really is overkill as the chance that FW_CFG_DATA_SIZE will change is nil. > > Thanks > Laszlo > > > > >> +g_free(nodename); > >> +} > >> + > >> static void *machvirt_dtb(const struct arm_boot_info *binfo, int > >> *fdt_size) > >> { > >> const VirtBoardInfo *board = (const VirtBoardInfo *)binfo; > >> @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine) > >> */ > >> create_virtio_devices(vbi, pic); > >> > >> +create_fw_cfg(vbi); > >> + > >> vbi->bootinfo.ram_size = machine->ram_size; > >> vbi->bootinfo.kernel_filename = machine->kernel_filename; > >> vbi->bootinfo.kernel_cmdline = machine->kernel_cmdline; > >> -- > >> 1.8.3.1 > >> >
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to "virt" board
On Fri, Nov 28, 2014 at 11:49:48AM +0100, Laszlo Ersek wrote: > On 11/28/14 11:43, Laszlo Ersek wrote: > > On 11/28/14 11:38, Andrew Jones wrote: > >> On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: > >>> fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, > >>> ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the "virt" > >>> board. > >>> > >>> The mmio register block of fw_cfg is advertized in the device tree. As > >>> base address we pick 0x0902, which conforms to the comment preceding > >>> "a15memmap": it falls in the miscellaneous device I/O range 128MB..256MB, > >>> and it is aligned at 64KB. > >>> > >>> fw_cfg automatically exports a number of files to the guest; for example, > >>> "bootorder" (see fw_cfg_machine_reset()). > >>> > >>> Signed-off-by: Laszlo Ersek > >>> --- > >>> hw/arm/virt.c | 21 + > >>> 1 file changed, 21 insertions(+) > >>> > >>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c > >>> index 314e55b..070bd34 100644 > >>> --- a/hw/arm/virt.c > >>> +++ b/hw/arm/virt.c > >>> @@ -68,6 +68,7 @@ enum { > >>> VIRT_UART, > >>> VIRT_MMIO, > >>> VIRT_RTC, > >>> +VIRT_FW_CFG, > >>> }; > >>> > >>> typedef struct MemMapEntry { > >>> @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { > >>> [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, > >>> [VIRT_UART] = { 0x0900, 0x1000 }, > >>> [VIRT_RTC] ={ 0x0901, 0x1000 }, > >>> +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, > >>> [VIRT_MMIO] = { 0x0a00, 0x0200 }, > >>> /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that > >>> size */ > >>> /* 0x1000 .. 0x4000 reserved for PCI */ > >>> @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) > >>> g_free(nodename); > >>> } > >>> > >>> +static void create_fw_cfg(const VirtBoardInfo *vbi) > >>> +{ > >>> +hwaddr base = vbi->memmap[VIRT_FW_CFG].base; > >>> +char *nodename; > >>> + > >>> +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); > >>> + > >>> +nodename = g_strdup_printf("/fw-cfg@%" PRIx64, base); > >>> +qemu_fdt_add_subnode(vbi->fdt, nodename); > >>> +qemu_fdt_setprop_string(vbi->fdt, nodename, > >>> +"compatible", "fw-cfg,mmio"); > >>> +qemu_fdt_setprop_sized_cells(vbi->fdt, nodename, "reg", > >>> + 2, base, 2, FW_CFG_SIZE, > >>> + 2, base + FW_CFG_SIZE, 2, > >>> FW_CFG_DATA_SIZE); > >> > >> Overkill suggestion alert, but how about defining something like > >> > >> #define FW_CFG_SIZE_ALIGNED \ > >> MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ > >> QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) > >> > >> and then using that in your memmap size calculation and fw-cfg-data base > >> address calculation. The only reason I suggest this is because it's hard > >> to tell that fw-cfg-data's address will be naturally aligned without > >> hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change > >> (which it probably never will), then it may not be. > > > > Why does it need to be aligned? > > > > The selector register is aligned at a 64KB boundary (for independent, > > strict reasons). > > > > The data register is not aligned at all, and -- AFAICS -- it need not > > be, because it's 1 byte wide. (In fact the ARM-specific > > Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the > > above layout passes without problems.) > > > > The full register block is 3 bytes wide. Is that a problem? > > Hm, I think I get it now. If FW_CFG_DATA_SIZE were to increase, then its > alignment would have to increase as well, and whatever alignment > FW_CFG_SIZE provides might not suffice. So, you'd calculate the natural > alignment, but wouldn't increase it beyond 4. > > I do think this is a bit overkill :) but I can do it. Let's wait for > more review comments first. Actually, on second thought, completely scratch my overkill suggestion. It's actually wrong to be concerned with it anyway. FW_CFG_DATA_SIZE doesn't dictate how we should access the data port, the fw-cfg protocol does, and that says we should access exactly one byte. So, regardless of the fw-cfg-data size, we'll never have to worry about the data port's alignment, as we'll never access more than one byte from it. > > Thanks! > Laszlo
Re: [Qemu-devel] [PATCH v2] arm: add fw_cfg to "virt" board
On Wed, Dec 10, 2014 at 12:13:34PM +0100, Paolo Bonzini wrote: > On 09/12/2014 16:54, Richard W.M. Jones wrote: > > On a slightly related topic, virtio-mmio traps are slow: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1123555 > > I suspect it's not virtio-mmio, but just KVM on ARM. > > It would be nice to have a timing test for vmexits in kvm-unit-tests, > similar to what is already there for x86. > Christoffer had some for his initial kernel selftests, and at one point he was porting them to an early version of kvm-unit-tests/arm. We can resurrect that shortly, as I'll be posting kvm-unit-tests/arm64 today. drew
Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
On Fri, Dec 12, 2014 at 04:58:45PM +0100, Laszlo Ersek wrote: > Make it clear that the maximum access size to the MMIO data register > determines the full size of the memory region. > > Currently the max access size is 1. Ensure that if a larger size were used > in "fw_cfg_data_mem_ops.valid.max_access_size", the memory subsystem would > split the access to byte-sized accesses internally, in increasing address > order. > > fw_cfg_data_mem_read() and fw_cfg_data_mem_write() don't care about > "address" or "size"; they just call the sequential fw_cfg_read() and > fw_cfg_write() functions, correspondingly. Therefore the automatic > splitting is just right. (The endianness of "fw_cfg_data_mem_ops" is > native.) This 'is native' caught my eye. Laszlo's Documentation/devicetree/bindings/arm/fw-cfg.txt patch states that the selector register is LE and " The data register allows accesses with 8, 16, 32 and 64-bit width. Accesses larger than a byte are interpreted as arrays, bundled together only for better performance. The bytes constituting such a word, in increasing address order, correspond to the bytes that would have been transferred by byte-wide accesses in chronological order. " I chatted with Laszlo to make sure the host-is-BE case was considered. It looks like there may be an issue there that Laszlo promised to look into. Laszlo had already noticed that the selector was defined as native in qemu as well, but should be LE. Now that we support > 1 byte reads and writes from the data port, then maybe we should look into changing that as well. drew > > This patch doesn't change behavior. > > Signed-off-by: Laszlo Ersek > --- > > Notes: > v4: > - unchanged > > v3: > - new in v3 [Drew Jones] > > hw/nvram/fw_cfg.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c > index c4b78ed..7f6031c 100644 > --- a/hw/nvram/fw_cfg.c > +++ b/hw/nvram/fw_cfg.c > @@ -30,9 +30,8 @@ > #include "qemu/error-report.h" > #include "qemu/config-file.h" > > #define FW_CFG_SIZE 2 > -#define FW_CFG_DATA_SIZE 1 > #define TYPE_FW_CFG "fw_cfg" > #define FW_CFG_NAME "fw_cfg" > #define FW_CFG_PATH "/machine/" FW_CFG_NAME > #define FW_CFG(obj) OBJECT_CHECK(FWCfgState, (obj), TYPE_FW_CFG) > @@ -323,8 +322,9 @@ static const MemoryRegionOps fw_cfg_data_mem_ops = { > .valid = { > .min_access_size = 1, > .max_access_size = 1, > }, > +.impl.max_access_size = 1, > }; > > static const MemoryRegionOps fw_cfg_comb_mem_ops = { > .read = fw_cfg_comb_read, > @@ -608,9 +608,10 @@ static void fw_cfg_initfn(Object *obj) > memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s, >"fwcfg.ctl", FW_CFG_SIZE); > sysbus_init_mmio(sbd, &s->ctl_iomem); > memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s, > - "fwcfg.data", FW_CFG_DATA_SIZE); > + "fwcfg.data", > + fw_cfg_data_mem_ops.valid.max_access_size); > sysbus_init_mmio(sbd, &s->data_iomem); > /* In case ctl and data overlap: */ > memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s, >"fwcfg", FW_CFG_SIZE); > -- > 1.8.3.1 > >
Re: [Qemu-devel] [RFC PATCH 0/3] KVM: Introduce KVM_MEM_UNCACHED
On Wed, Apr 29, 2015 at 10:19:55AM +0100, Peter Maydell wrote: > On 29 April 2015 at 10:03, Alexander Graf wrote: > > > > > > On 18.03.15 20:10, Andrew Jones wrote: > >> Introduce a new memory region flag, KVM_MEM_UNCACHED, which > >> is needed by ARM. This flag informs KVM that the given memory > >> region is typically mapped by the guest as uncached. KVM for > >> ARM then maps that region as uncached for userspace as well, > >> in order to keep coherency. > > > > I finally managed to give this a spin and immediately ran into an > > unaligned access trap: Thanks! And err.. sorry it broke. > > > > [ 116.509976] Unhandled fault: alignment fault (0x9261) at > > 0x03ffb130 > > > > Program received signal SIGBUS, Bus error. > > [Switching to Thread 0x3ffb317ecb0 (LWP 1956)] > > 0x03ffb685ed68 in memset () from /lib64/libc.so.6 > > (gdb) bt > > #0 0x03ffb685ed68 in memset () from /lib64/libc.so.6 > > #1 0x02c013ec in memset (__len=, __ch=0, > > __dest=) at /usr/include/bits/string3.h:90 > > #2 vbe_ioport_write_data (opaque=0x2aaabbd3600, addr=, > > val=65) at /usr/src/debug/qemu-2.3.0-rc4/hw/display/vga.c:739 > > #3 0x02be07dc in memory_region_write_accessor (mr= > out>, addr=, value=, size=, > > shift=, mask=) at > > /usr/src/debug/qemu-2.3.0-rc4/memory.c:430 > > [...] > > This appears to be because the KVM_MEM_UNCACHED flag > confusingly isn't marking the memory as Normal-Noncacheable > but as Device-nGnRnE (aka Strongly Ordered). You can't unalignedly > access Device memory (and so you can't use the usual userspace > memcpy, memset, etc). > > Did we really want Device-nGnRnE memory here? If we did can > we have a less confusing name for the flag? I hadn't considered the other side-effects of going with device vs. normal memory. I'll change patch 3/3 with 31c31 < + pte = set_pte_bit(pte, L_PTE_MT_UNCACHED); --- > + pte = set_pte_bit(pte, L_PTE_MT_WRITEBACK); 166c166 < + pte = set_pte_bit(pte, PTE_ATTRINDX(MT_DEVICE_nGnRnE)); --- > + pte = set_pte_bit(pte, PTE_ATTRINDX(MT_NORMAL_NC)); I've left this on the backburner too long, but I can pick it back up on Monday. I still haven't looked at Paolo's mmu notifier suggestion yet either. I'll start looking at that as well. Thanks, drew > > [for the non-ARM folk, nGnRnE == no gathering of accesses, > no reordering, no early-write-acknowledgement.] > > thanks > -- PMM > >
Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote: > On 14 May 2015 at 11:31, Andrew Jones wrote: > > Forgot to (4): switch from setting userspace's mapping to > > device memory to normal, non-cacheable. Using device memory > > caused a problem that Alex Graf found, and Peter Maydell suggested > > using normal, non-cacheable instead. > > Did you check that non-cacheable is definitely the correct > kind of Normal memory attribute we want? (ie not write-through). I was concerned that write-through wouldn't be sufficient. If the guest writes to its non-cached memory, and QEMU needs to see what it wrote, then won't write-through fail to work? Unless we some how invalidate the cache first? drew