Re: [Xen-devel] [PATCH for-4.5] xen: vnuma: expose vnode_to_pnode to guest
On 10.11.14 at 11:51, dario.faggi...@citrix.com wrote: I'm 100% ok to re-start that discussion here and now... however, how stable should this interface be? Can't we deal with this when actually implementing NUMA aware ballooning and add stuff at than point, if necessary? Wei's desire to have a stable interface for 4.5 and later is quite reasonable, as we can't change the interface once a release got shipped with it. If we were to discover additional needs later, only backward compatible changes to the existing interface would be allowed (and e.g. adding another field to the interface structure would be out of scope). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [TestDay] VMX test report for Xen 4.5.0-rc1
On 12.11.14 at 07:58, robert...@intel.com wrote: 2. Failed to hotplug a VT-d device with XEN4.5-RC1 http://bugzilla-archived.xenproject.org/bugzilla/show_bug.cgi?id=1894 First of all I'm not sure it is really useful to use the old, discontinued bugzilla to report bugs. I think it would be much easier if you reported them directly (and individually) to the mailing list, with or without triggering the creation of bugs in the new tracking system. As to the specific bug above - nothing is being said on whether a driver was still attached to the device at the time it was being hot-unplugged; if there was, I'm not sure what the expected behavior would be (i.e. on real hardware). 3. Fail to hot add multi devices to guest http://bugzilla-archived.xenproject.org/bugzilla/show_bug.cgi?id=1895 The description in contradicting itself - title and parts of it say the operation failed, but Bug detailed description says Checking inside guest VM, the VT-D device is attached to guest VM actually, and works fine. 4. Not all PFs are available if assign multi VT-d devices to Wndows guest VM http://bugzilla-archived.xenproject.org/bugzilla/show_bug.cgi?id=1896 I think we were told the other day that pass-through of PFs is not being supported by Intel. What's the purpose of reporting bugs there? 7. Error msg occurred after attaching multi SR-IOV VF device to running guest http://bugzilla-archived.xenproject.org/bugzilla/show_bug.cgi?id=1899 This reads (except for the title) the same as 1895. Am I overlooking something here? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
On 12.11.14 at 10:13, tiejun.c...@intel.com wrote: On 2014/11/12 17:02, Jan Beulich wrote: On 12.11.14 at 09:45, tiejun.c...@intel.com wrote: #2 flags field in each specific device of new domctl would control whether this device need to check/reserve its own RMRR range. But its not dependent on current device assignment domctl, so the user can use them to control which devices need to work as hotplug later, separately. And this could be left as a second step, in order for what needs to be done now to not get more complicated that necessary. Do you mean currently we still rely on the device assignment domctl to provide SBDF? So looks nothing should be changed in our policy. I can't connect your question to what I said. What I tried to tell you Something is misunderstanding to me. was that I don't currently see a need to make this overly complicated: Having the option to punch holes for all devices and (by default) dealing with just the devices assigned at boot may be sufficient as a first step. Yet (repeating just to avoid any misunderstanding) that makes things easier only if we decide to require device assignment to happen before memory getting populated (since in that case there's Here what do you mean, 'if we decide to require device assignment to happen before memory getting populated'? Because -quote- In the present the device assignment is always after memory population. And I also mentioned previously I double checked this sequence with printk. Or you already plan or deciede to change this sequence? So it is now the 3rd time that I'm telling you that part of your decision making as to which route to follow should be to re-consider whether the current sequence of operations shouldn't be changed. Please also consult with the VT-d maintainers (hint to them: participating in this discussion publicly would be really nice) on _all_ decisions to be made here. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] Adjust number of domains in cpupools when destroying domain
On 12.11.14 at 11:46, andrew.coop...@citrix.com wrote: On 12/11/14 10:40, Juergen Gross wrote: --- a/xen/common/cpupool.c +++ b/xen/common/cpupool.c @@ -225,6 +225,35 @@ static int cpupool_destroy(struct cpupool *c) } /* + * Move domain to another cpupool + */ +static int cpupool_move_domain_unlocked(struct domain *d, struct cpupool *c) This isn't an unlocked function. It is strictly called with the cpupool_lock held. Per prevailing style, it should be named __cpupool_move_domain(). I generally disagree to this, even if this is the prevailing style. Double-underscore prefixed names shouldn't be used at all in our code, as they're being reserved by the C library standard (and the compiler is free to introduce library calls named such). But the question of course is valid why the function name says unlocked when it's always being called with the lock held - locked would seem more natural in this case. But in the end Jürgen is the maintainer of that code, so he decides. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] RFC: vNUMA project
On 12.11.14 at 14:45, wei.l...@citrix.com wrote: On Wed, Nov 12, 2014 at 09:35:01AM +, Jan Beulich wrote: On 11.11.14 at 19:03, david.vra...@citrix.com wrote: On 11/11/14 17:36, Wei Liu wrote: Option #1 requires less modification to guest, because guest won't need to switch to new hypercall. It's unclear at this point if a guest asks to populate a gpfn that doesn't belong to any vnode, what Xen should do about it. Should it be permissive or strict? There are XENMEMF flags to request exact node or not -- leave it up to the balloon driver. The Linux balloon driver could try exact on all nodes before falling back to permissive or just always try inexact. Perhaps a XENMEMF_vnode bit to indicate the node is virtual? Yes. The only bad thing here is that we don't currently check in the hypervisor that unknown bits are zero, i.e. code using the new flag will need to have a separate means to find out whether the bit is supported. Not a big deal of course. If this new bit is set and domain has vnuma, then it's valid (supported); otherwise it's not. To not break existing guests, we can fall back to non-vnuma hinted allocation when the new bit is set and vnuma is not available. While this is valid, none of this was my point - I was talking about a new guest running on an older hypervisor. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] RFC: vNUMA project
On 12.11.14 at 15:40, wei.l...@citrix.com wrote: So what's the usual technique in Linux to make sure if a specific Xen feature is present? Jan, is it suitable to use a XENFEAT_* bit for this? Yes, that would be the canonical way. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/3] x86/xen: use the maximum MFN to calculate the required DMA mask
On 12.11.14 at 16:25, david.vra...@citrix.com wrote: +u64 +xen_swiotlb_get_required_mask(struct device *dev) +{ + u64 max_mfn; + + max_mfn = HYPERVISOR_memory_op(XENMEM_maximum_ram_page, NULL); + + return DMA_BIT_MASK(fls64(max_mfn PAGE_SHIFT) + 1); +} The value the hypercall returns is exclusive and an unsigned long. Hence I'd suggest u64 xen_swiotlb_get_required_mask(struct device *dev) { unsigned long max_mfn; max_mfn = HYPERVISOR_memory_op(XENMEM_maximum_ram_page, NULL); return DMA_BIT_MASK(__fls(max_mfn - 1) + 1 + PAGE_SHIFT); } Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] libxc: Turn on the pdpe1gb cpuid flag by default
On 13.11.14 at 06:31, liang.z...@intel.com wrote: Apart from the subject being kind of wrong (you're not turning it on by default, but only expose it if the hardware supports it), ... --- a/tools/libxc/xc_cpuid_x86.c +++ b/tools/libxc/xc_cpuid_x86.c @@ -192,6 +192,7 @@ static void intel_xc_cpuid_policy( bitmaskof(X86_FEATURE_ABM)); regs[3] = (bitmaskof(X86_FEATURE_NX) | bitmaskof(X86_FEATURE_LM) | +bitmaskof(X86_FEATURE_PAGE1GB) | bitmaskof(X86_FEATURE_SYSCALL) | bitmaskof(X86_FEATURE_RDTSCP)); break; ... this is incomplete: The AMD function would seem to need the same change, and the !PAE special casing in xc_cpuid_hvm_policy() should imo disable the flag. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [TestDay]Xen-4.5-RC1: Dom0 failed to resume from S3 power state with operations on '/sys/power/state'
On 13.11.14 at 03:32, robert...@intel.com wrote: Hi, This is a separated report for bug http://bugzilla-archived.xenproject.org/bugzilla/show_bug.cgi?id=1897 Environment: Service Arch (ia32/ia32e/IA64): ia32e Guest Arch (ia32/ia32e/IA64): Guest OS Type (Linux/Windows): Change Set: Xen4.5-RC1 (git:1b068a5b485062c862c20d8ba7674faa97ee3714) Hardware: Ivybridge-EP, Grantley-EP Other: Bug detailed description: -- Boot into Dom0 (kernel 3.17.0), execute the command echo mem /sys/power/state, to make Dom0 system get into S3 sleep state, then press any key or power button to resume from S3 state, but failed and system hangs up with black screen. Reproduce steps: 1. Boot into Dom0, execute the command echo mem /sys/power/state to make Dom0 get into S3 state. 2. Press any key or press power button to resume from S3,no resume to dom0 Current result: Dom0 system cannot resume from S3 power state. Considering the Konrad confirmed this to work on other hardware it's likely going to be difficult to analyze this for anyone not in possession of a machine where things don't work as expected. I.e. again presumably a case where at least initial investigation would ideally be done by your engineers. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] About sharing pages between Xen and guest kernel
On 13.11.14 at 00:52, am...@gatech.edu wrote: Hello, I am trying to set up a shared page between the hypervisor and a Linux guest kernel. In Xen I am doing: void *ptr = alloc_xenheap_page(); share_xen_page_with_guest(virt_to_page(ptr), current-domain, XENSHARE_writable); unsigned int mfn = virt_to_mfn(ptr); And my plan was to pass the mfn to the guest kernel and retrieve the pfn with mfn_to_pfn(). However, this returned ~0. So I took a look at share_xen_page_with_guest() and I noticed it calls set_gpfn_from_mfn(mfn, INVALID_M2P_ENTRY), where INVALID_M2P_ENTRY = ~0. My questions are: 1. Do I have to call set_gpfn_from_mfn() after calling share_xen_page_with_guest() in the hypervisor? 2. If yes, what gpfn should I use? I am thinking I could allocate a page in the guest, and retrieve the pfn with page_to_pfn(), and then pass that to the hypervisor. But I don’t know if I am overcomplicating things. No, that wouldn't be correct (as is implied by 2). Instead you want to map the page based on its MFN (i.e. potentially without even establishing a PFN for it). But I suppose you looked at other examples in the source code, and you should have realized that this isn't really meant to be used for arbitrary memory sharing. Perhaps xentrace's map_tbufs() would be the closest reference to use. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] delaying 4.4.2 and 4.3.4
All, while the 3 month period since the previous stable releases would expire at the beginning of December, looking at the number of changes in the stable trees I don't think starting preparations right now would make much sense. Unless I hear severe objections to this plan, and unless 4.5 gets significantly delayed, I think we should look at RC1s for them once 4.5 is (almost) out the door. That'll also minimize collisions between testing needs 4.5 and the stable releases have. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 2/3] x86: allow dma_get_required_mask() to be overridden
On 13.11.14 at 11:25, jbeul...@suse.com wrote: On 12.11.14 at 16:25, david.vra...@citrix.com wrote: --- a/arch/x86/include/asm/device.h +++ b/arch/x86/include/asm/device.h @@ -13,4 +13,6 @@ struct dev_archdata { struct pdev_archdata { }; +#define ARCH_HAS_DMA_GET_REQUIRED_MASK Is there a particular reason you put this here rather than in dma-mapping.h? Never mind - I see you need it that way so that struct dma_map_ops gets the get_required_mask field defined, other than e.g. ia64 which handles through another abstraction layer. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] xen: remove Xencomm
On 13.11.14 at 17:03, t...@xen.org wrote: Being a feature that has only been used by ia64 and/or ppc it doesn't seem like we need to keep it any longer in the tree. So remove it. Signed-off-by: Boris Ostrovsky boris.ostrov...@oracle.com Signed-off-by: Tim Deegan t...@xen.org Acked-by: Jan Beulich jbeul...@suse.com ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH RFC] EFI: allow retry of ExitBootServices() call
The specification is kind of vague under what conditions ExitBootServices() may legitimately fail, requiring the OS loader to retry: If MapKey value is incorrect, ExitBootServices() returns EFI_INVALID_PARAMETER and GetMemoryMap() with ExitBootServices() must be called again. Firmware implementation may choose to do a partial shutdown of the boot services during the first call to ExitBootServices(). EFI OS loader should not make calls to any boot service function other then GetMemoryMap() after the first call to ExitBootServices(). While our code guarantees the map key to be valid, there are systems where a firmware internal notification sent while processing ExitBootServices() reportedly results in changes to the memory map. In that case, make a best effort second try: Avoid any boot service calls other than the two named above, with the possible exception of error paths. Those aren't a problem, since if we end up needing to retry, we're hosed when something goes wrong as much as if we didn't make the retry attempt. For x86, a minimal adjustment to efi_arch_process_memory_map() is needed for it to cope with potentially being called a second time. For arm64, while efi_process_memory_map_bootinfo() is easy to verify that it can safely be called more than once without violating spec constraints, it's not so obvious for fdt_add_uefi_nodes(), hence a step by step approach: - deletion of memory nodes and memory reserve map entries: the 2nd pass shouldn't find any as the 1st one deleted them all, - a chosen node should be found as it got added in the 1st pass, - the various linux,uefi-* nodes all got added during the 1st pass and hence only their contents may get updated. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/xen/arch/x86/efi/efi-boot.h +++ b/xen/arch/x86/efi/efi-boot.h @@ -140,7 +140,7 @@ static void __init efi_arch_process_memo /* Populate E820 table and check trampoline area availability. */ e = e820map - 1; -for ( i = 0; i map_size; i += desc_size ) +for ( e820nr = i = 0; i map_size; i += desc_size ) { EFI_MEMORY_DESCRIPTOR *desc = map + i; u64 len = desc-NumberOfPages EFI_PAGE_SHIFT; --- a/xen/common/efi/boot.c +++ b/xen/common/efi/boot.c @@ -703,7 +703,7 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SY EFI_GRAPHICS_OUTPUT_PROTOCOL *gop = NULL; EFI_GRAPHICS_OUTPUT_MODE_INFORMATION *mode_info; union string section = { NULL }, name; -bool_t base_video = 0; +bool_t base_video = 0, retry; char *option_str; bool_t use_cfg_file; @@ -1051,17 +1051,23 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SY if ( !efi_memmap ) blexit(LUnable to allocate memory for EFI memory map); -status = efi_bs-GetMemoryMap(efi_memmap_size, efi_memmap, map_key, - efi_mdesc_size, mdesc_ver); -if ( EFI_ERROR(status) ) -PrintErrMesg(LCannot obtain memory map, status); +for ( retry = 0; ; retry = 1 ) +{ +status = efi_bs-GetMemoryMap(efi_memmap_size, efi_memmap, map_key, + efi_mdesc_size, mdesc_ver); +if ( EFI_ERROR(status) ) +PrintErrMesg(LCannot obtain memory map, status); -efi_arch_process_memory_map(SystemTable, efi_memmap, efi_memmap_size, -efi_mdesc_size, mdesc_ver); +efi_arch_process_memory_map(SystemTable, efi_memmap, efi_memmap_size, +efi_mdesc_size, mdesc_ver); -efi_arch_pre_exit_boot(); +efi_arch_pre_exit_boot(); + +status = efi_bs-ExitBootServices(ImageHandle, map_key); +if ( status != EFI_INVALID_PARAMETER || retry ) +break; +} -status = efi_bs-ExitBootServices(ImageHandle, map_key); if ( EFI_ERROR(status) ) PrintErrMesg(LCannot exit boot services, status); EFI: allow retry of ExitBootServices() call The specification is kind of vague under what conditions ExitBootServices() may legitimately fail, requiring the OS loader to retry: If MapKey value is incorrect, ExitBootServices() returns EFI_INVALID_PARAMETER and GetMemoryMap() with ExitBootServices() must be called again. Firmware implementation may choose to do a partial shutdown of the boot services during the first call to ExitBootServices(). EFI OS loader should not make calls to any boot service function other then GetMemoryMap() after the first call to ExitBootServices(). While our code guarantees the map key to be valid, there are systems where a firmware internal notification sent while processing ExitBootServices() reportedly results in changes to the memory map. In that case, make a best effort second try: Avoid any boot service calls other than the two named above, with the possible exception of error paths. Those aren't a problem, since if we end up needing to retry, we're hosed when something goes wrong as much as if we didn't make the retry attempt. For x86, a minimal
Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
On 14.11.14 at 14:11, li...@eikelenboom.it wrote: (XEN) [2014-11-14 13:00:06.810] [ Xen-4.5.0-rc x86_64 debug=y Not tainted ] (XEN) [2014-11-14 13:00:06.810] CPU:3 (XEN) [2014-11-14 13:00:06.810] RIP:e008:[82d080148f14] dpci_softirq+0x9c/0x231 (XEN) [2014-11-14 13:00:06.810] RFLAGS: 00010283 CONTEXT: hypervisor (XEN) [2014-11-14 13:00:06.811] rax: 00100100 rbx: 830515c86d90 rcx: 0001 This and ... (XEN) [2014-11-14 13:00:06.811] Pagetable walk from 00100108: (XEN) [2014-11-14 13:00:06.811] L4[0x000] = 00055d09a063 (XEN) [2014-11-14 13:00:06.811] L3[0x000] = 00055d099063 (XEN) [2014-11-14 13:00:06.811] L2[0x000] = 00055d098063 ... this suggest it hit a poisoned list entry, due to #define LIST_POISON1 ((void *) 0x00100100) But without you helping out with associating the crash location with a source line (or allowing us to do so by making xen-syms available somewhere) this is going to remain guesswork. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
On 14.11.14 at 16:20, li...@eikelenboom.it wrote: If it still helps i could try Andrews suggestion and try out with only commit aeeea485 .. Yes, even if it's pretty certain it's the second of the commits, verifying this would be helpful (or if the assumption is wrong, the pattern it's dying with would change and hence perhaps provide further clues). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
On 17.11.14 at 08:57, tiejun.c...@intel.com wrote: --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -698,10 +698,13 @@ struct get_reserved_device_memory { unsigned int used_entries; }; -static int get_reserved_device_memory(xen_pfn_t start, - xen_ulong_t nr, void *ctxt) +static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, u16 seg, + u16 *ids, int cnt, void *ctxt) While the approach is a lot better than what you did previously, I still don't like you adding 3 new parameters when one would do (calling the callback for each SBDF individually): That way you avoid introducing a hidden dependency on how the VT-d code manages its internal data. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFC] EFI: allow retry of ExitBootServices() call
On 14.11.14 at 16:32, konrad.w...@oracle.com wrote: On Fri, Nov 14, 2014 at 12:37:30PM +, Jan Beulich wrote: The specification is kind of vague under what conditions ExitBootServices() may legitimately fail, requiring the OS loader to retry: If MapKey value is incorrect, ExitBootServices() returns EFI_INVALID_PARAMETER and GetMemoryMap() with ExitBootServices() must be called again. Firmware implementation may choose to do a partial shutdown of the boot services during the first call to ExitBootServices(). EFI OS loader should not make calls to any boot service function other then GetMemoryMap() after the first call to ExitBootServices(). While our code guarantees the map key to be valid, there are systems where a firmware internal notification sent while processing ExitBootServices() reportedly results in changes to the memory map. s/reportedly/in fact/ In that case, make a best effort second try: Avoid any boot service calls other than the two named above, with the possible exception of error paths. Those aren't a problem, since if we end up needing to retry, we're hosed when something goes wrong as much as if we didn't make the retry attempt. For x86, a minimal adjustment to efi_arch_process_memory_map() is needed for it to cope with potentially being called a second time. Wow. Talk about timing. We saw this and were going to see doing something similar. So what are your thoughts then regarding this patch going into 4.5? And for the LIST_POISON* override one? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH V2] Decouple SnadyBridge quirk form VTd timeout
On 17.11.14 at 14:32, donald.d.dug...@intel.com wrote: Hmm, good ideas. How about I change the `snb_igd_quirk' parameter to be: snb_igd_quirk = current behavior, enable quirk with 1 sec timeout snb_igd_quirk=default = enable quirk with theoretical max timeout of 670 msec snb_igd_quirk=n = enable quirk with timeout of `n' msec Please retain the current boolean behavior (i.e. =yes, =no, etc should all continue to be permitted). And the =yes case then would become what you propose above as =default. Plus I see no reason for the lack of a specified value to mean 1s - if 670ms is the theoretical may, there's no need to support the 1s one by means other than the use specifying that value on the command line. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH V2] Decouple SnadyBridge quirk form VTd timeout
On 17.11.14 at 15:51, donald.d.dug...@intel.com wrote: My primary goal is to decouple the IGD quirk code from the VTd timeout value, the two are unrelated and shouldn't be using the same define. In the process we can clean up the IGD code so that, if a user wants, the user can specify a more appropriate timeout value for the quirk code. There's no need to change the current behavior of the flag, just give the user more options. It the VT-d maintainers agree with that, I certainly won't stand in the way. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] libxc: Expose the pdpe1gb cpuid flag to guest
On 17.11.14 at 16:39, ian.jack...@eu.citrix.com wrote: Liang Li writes ([PATCH] libxc: Expose the pdpe1gb cpuid flag to guest): If hardware support the pdpe1gb flag, expose it to guest by default. Users don't have to use a 'cpuid= ' option in config file to turn it on. I don't understand what this flag does. I guess from the name it turns on 1G pages. I guess these are supported ? I would like to see comment from an x86 hypervisor maintainer. Yes, we support 1Gb pages. The purpose of the patch is to not unconditionally hide the flag from guests. (I had commented on v1, but sadly this one wasn't tagged as v2, nor was I included on the Cc list, hence I didn't spot the new version.) The one thing I'm not certain about is shadow mode: Only 2Mb pages may be supported there. Tim? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFC 0/7] xen: Clean-up of mem_event subsystem
On 17.11.14 at 17:43, rcojoc...@bitdefender.com wrote: On 11/17/2014 06:37 PM, Jan Beulich wrote: On 12.11.14 at 16:48, andrew.coop...@citrix.com wrote: On 12/11/14 15:31, Tamas K Lengyel wrote: xen/include/public/domctl.h | 44 +-- xen/include/public/hvm/params.h | 2 +- xen/include/public/mem_event.h | 134 --- xen/include/public/memory.h | 6 +- xen/include/public/vm_event.h | 179 + While in principle I think this series is a very good thing, there is a problem with editing the pubic header files. The contents of mem_event.h is not currently hidden behind #ifdef __XEN_TOOLS__ As a result, it is strictly speaking part of the VM-visible public API/ABI and not permitted to change in a backwards incompatible manor. Having said that, it is currently only usable by privileged domains, so there is an argument to be made for declaring that it should have been hidden behind __XEN_TOOLS__ in the first place, making it permittable to change. I'm not sure I agree - the meaning of tools here would seem quite a bit different than e.g. in domctl.h. Looking at patch 1, I can't see how an old consumer (remember that for many of these we have at best fake consumers in the tree) would deal with the now differently arranged data. I don't see any versioning of the interface, and hence I can't see how tools would know which of the formats to expect. In the initial patch I've sent Tamas I had arranged things as follows, (so that the layout would stay compatible) but I think we ended up discussing it and deciding it would look cleaner to just re-do the whole thing: +struct mem_event_ept_data { +uint64_t gfn; +uint64_t offset; +uint64_t gla; /* if gla_valid */ +}; + +struct mem_event_cr_data { +uint64_t new_value; +uint64_t _pad; +uint64_t old_value; +}; + +struct mem_event_int3_data { +uint64_t gfn; +uint64_t _pad; +uint64_t eip; +}; + +struct mem_event_singlestep_data { +uint64_t gfn; +uint64_t _pad; +uint64_t eip; +}; + +struct mem_event_msr_data { +uint64_t msr; +uint64_t old_value; +uint64_t new_value; +}; + typedef struct mem_event_st { uint32_t flags; uint32_t vcpu_id; -uint64_t gfn; -uint64_t offset; -uint64_t gla; /* if gla_valid */ +union { +struct mem_event_ept_dataept_event; +struct mem_event_cr_data cr_event; +struct mem_event_int3_data int3_event; +struct mem_event_singlestep_data singlestep_event; +struct mem_event_msr_datamsr_event; +}; uint32_t p2mt; Would something like this be more along the right lines? Yes, afaic. But tool stack and mm maintainers need to be on the same page before you should take this as the final route to follow. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFC 0/7] xen: Clean-up of mem_event subsystem
On 17.11.14 at 18:06, tamas.leng...@zentific.com wrote: On Mon, Nov 17, 2014 at 5:37 PM, Jan Beulich jbeul...@suse.com wrote: On 12.11.14 at 16:48, andrew.coop...@citrix.com wrote: On 12/11/14 15:31, Tamas K Lengyel wrote: xen/include/public/domctl.h | 44 +-- xen/include/public/hvm/params.h | 2 +- xen/include/public/mem_event.h | 134 --- xen/include/public/memory.h | 6 +- xen/include/public/vm_event.h | 179 + While in principle I think this series is a very good thing, there is a problem with editing the pubic header files. The contents of mem_event.h is not currently hidden behind #ifdef __XEN_TOOLS__ As a result, it is strictly speaking part of the VM-visible public API/ABI and not permitted to change in a backwards incompatible manor. Having said that, it is currently only usable by privileged domains, so there is an argument to be made for declaring that it should have been hidden behind __XEN_TOOLS__ in the first place, making it permittable to change. I'm not sure I agree - the meaning of tools here would seem quite a bit different than e.g. in domctl.h. Looking at patch 1, I can't see how an old consumer (remember that for many of these we have at best fake consumers in the tree) would deal with the now differently arranged data. I don't see any versioning of the interface, and hence I can't see how tools would know which of the formats to expect. The lack of versioning is a real concern which I have aired during the 4.5 development process. For example, when we switched from HVMEM_access_* to XENMEM_access_* a customer had to do a bunch of manual configure checks to determine what is supported and what isn't. Furthermore, many of the related APIs have changed quite radically between Xen 4.1-4.5, some being abandoned midway just to resurface later in a different context. Going forward having a clear VM_EVENT_VERSION definition would be very helpful and would be the cleanest solution IMHO. That's a concern different from mine - source compatibility may be acceptable to get broken. Undetectable binary incompatibilities are what worry me more. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
On 17.11.14 at 23:40, li...@eikelenboom.it wrote: OK, i still don't get why the output of debug-key 'i' reports +47 as pirq here instead of the guest value (g_gsi of for this legacy interrupt which is 40 ?), like it does when it's a MSI with the PIRQ ? No - as you said yourself, it's a matter of who uses which numbers. Xen can't know the IRQ number the guest OS choses internally. It can only tell you the number the hypervisor uses internally and the one it told the guest to use to refer to it (the former is what we commonly refer to as IRQ, while the latter is the pIRQ). Pin-based (i.e. IO-APIC) interrupts should always use the same number for both; MSI ones could only happen to have them match up, but would use distinct numbers normally. I am puzzled by the driver binding twice to the same interrupt, but perhaps that is just a buggy driver. Doesn't that happen more often like with integrated USB controllers ? 17: 4 0 0 0 0 0 xen-pirq-ioapic-level ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3 18: 4385 0 0 0 0 0 xen-pirq-ioapic-level ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7 No, these are all distinct devices - if you look at you lspci output, you should find multiple EHCI and OHCI controller instances. It's slightly odd that each kind gets grouped onto a single IO-APIC pin, but that's a (legal) BIOS decision. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
On 17.11.14 at 19:01, li...@eikelenboom.it wrote: (XEN) [2014-11-17 17:54:18.695] CPU00: (XEN) [2014-11-17 17:54:18.705] CPU01: (XEN) [2014-11-17 17:54:18.716] d16 OK-softirq 62msec ago, state:1, 2628 count, [prev:83054ef57e70, next:83054ef57e70] 83051b904428NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:18.765] d16 OK-raise 112msec ago, state:1, 2628 count, [prev:00200200, next:00100100] 83051b904428NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:18.815] CPU02: (XEN) [2014-11-17 17:54:18.825] d17 OK-softirq 500msec ago, state:1, 3439 count, [prev:83054ef47e70, next:83054ef47e70] 83051a1c8c28NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:18.875] d17 OK-raise 549msec ago, state:1, 3439 count, [prev:00200200, next:00100100] 83051a1c8c28NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:18.924] CPU03: (XEN) [2014-11-17 17:54:18.935] d16 OK-softirq 313msec ago, state:1, 3533 count, [prev:83054ef37e70, next:83054ef37e70] 83051b904428NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:18.984] d16 OK-raise 363msec ago, state:1, 3533 count, [prev:00200200, next:00100100] 83051b904428NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:19.034] CPU04: (XEN) [2014-11-17 17:54:19.044] d16 OK-softirq 359msec ago, state:1, 3691 count, [prev:83054ef27e88, next:83054ef27e88] 83051b904428NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:19.094] d16 OK-raise 408msec ago, state:1, 3691 count, [prev:00200200, next:00100100] 83051b904428NULL MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 (XEN) [2014-11-17 17:54:19.143] CPU05: (XEN) [2014-11-17 17:54:19.154] d16 OK-softirq 458msec ago, state:1, 52039 count, [prev:83054ef283e0, next:83054ef283e0] 83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0 (XEN) [2014-11-17 17:54:19.205] d16 OK-raise 489msec ago, state:1, 52049 count, [prev:00200200, next:00100100] 83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0 (XEN) [2014-11-17 17:54:19.257] d16 ERR-poison 561msec ago, state:0, 1 count, [prev:00200200, next:00100100] 83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0 (XEN) [2014-11-17 17:54:19.307] d16 Z-softirq 731msec ago, state:3, 3 count, [prev:83054ef283e0, next:83054ef283e0] 83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0 (XEN) [2014-11-17 17:54:19.356] domain_crash called from io.c:938 (XEN) [2014-11-17 17:54:19.356] Domain 16 reported crashed by domain 32767 on cpu#5: I think what would help establishing the sequence of events would be to at the very least calculate the times printed above relative to a single NOW() invocation done in dump_debug() rather than dump_record(). That may, however, yield meaningless values when done at millisecond granularity. Hence, either using nanosecond granularity or not using time values but a simple sequence counter might be a desirable approach. What puzzles me additionally is that for list_del() to encounter an already removed entry, I'd expect the entry to (mistakenly) have got added twice. Yet there's no sign of that (the most recent OK-raise entry shows the list entry still having poisoned pointers). Or it would need to have got inserted a second time on another CPU, but the track record thereof having got overwritten. Perhaps now that we suspect the legacy IRQ to be the problematic one the patch could be adjusted to track only operations on non-MSI IRQs (or separately all three PT_IRQ_TYPE_*)? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
On 18.11.14 at 09:16, tiejun.c...@intel.com wrote: On 2014/11/18 16:01, Jan Beulich wrote: On 18.11.14 at 04:08, tiejun.c...@intel.com wrote: Here I tried to implement what you want. Note just pick two key fragments since others have no big deal. #1: @@ -898,14 +898,25 @@ int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt) { struct acpi_rmrr_unit *rmrr; int rc = 0; +unsigned int i; +u32 id; +u16 bdf; list_for_each_entry(rmrr, acpi_rmrr_units, list) { -rc = func(PFN_DOWN(rmrr-base_address), - PFN_UP(rmrr-end_address) - PFN_DOWN(rmrr-base_address), - ctxt); -if ( rc ) -break; +for (i = 0; (bdf = rmrr-scope.devices[i]) +i rmrr-scope.devices_cnt !rc; i++) +{ +id = PCI_SBDF(rmrr-segment, bdf); +rc = func(PFN_DOWN(rmrr-base_address), + PFN_UP(rmrr-end_address) - +PFN_DOWN(rmrr-base_address), + id, + ctxt); +if ( rc 0 ) +return rc; +} +rc = 0; Getting close - the main issue is that (as previously mentioned) you should avoid open-coding for_each_rmrr_device(). It also doesn't Sorry, are you saying these lines? +for (i = 0; (bdf = rmrr-scope.devices[i]) +i rmrr-scope.devices_cnt !rc; i++) So without lookuping devices[i], how can we call func() for each sbdf as you mentioned? You've got both rmrr and bdf in the body of for_each_rmrr_device(). After all - as I said - you just open-coded it. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for Xen 4.5] xen/arm: Add support for GICv3 for domU
On 18.11.14 at 16:00, julien.gr...@linaro.org wrote: On 10/31/2014 09:02 AM, Jan Beulich wrote: On 30.10.14 at 19:51, julien.gr...@linaro.org wrote: The naming suggests that the #if really should be around just the gic_version field (with a dummy field in the #else case to be C89 compatible, e.g. a zero width unnamed bitfield) and the corresponding #define-s above, ... Not really related to this patch... but the way to improve it (via extending createdomain). I need to create an empty structure. Is the dummy field really needed? If so, did you meant? struct { int :0; } Yes. The C spec declare this kind of structure as undefined. I can't find anything saying so. Would an empty structure and used it be better? Empty structures (and unions) aren't valid in standard C afaics, up to and including C11. That was the whole point of suggesting the above alternative, with me (maybe wrongly) believing that this would be valid. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Problems accessing passthrough PCI device
On 18.11.14 at 17:24, furryfutt...@gmail.com wrote: Hello Jan, Friday, November 14, 2014, 5:27:36 AM, you wrote: I implied your earlier statement to mean that. But - did you also verify that the three flags actually end up set (ideally from both DomU and Dom0 perspective)? The PCI backend may be screwing up things... Yes I do verify the write. How do I check this from Dom0? Just use lspci. I've just checked this with lspci. I see that the IO is being enabled. Memory you mean. Any other idea on why I might be reading back 0xff for all PCI memory area reads? The lspci output follows. Since this isn't behind a bridge - no, not really. Did you try this with any other device for comparison purposes? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
On 19.11.14 at 02:26, tiejun.c...@intel.com wrote: So without lookuping devices[i], how can we call func() for each sbdf as you mentioned? You've got both rmrr and bdf in the body of for_each_rmrr_device(). After all - as I said - you just open-coded it. Yeah, so change this again, int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt) { struct acpi_rmrr_unit *rmrr; int rc = 0; unsigned int i; u16 bdf; for_each_rmrr_device ( rmrr, bdf, i ) { rc = func(PFN_DOWN(rmrr-base_address), PFN_UP(rmrr-end_address) - PFN_DOWN(rmrr-base_address), PCI_SBDF(rmrr-segment, bdf), ctxt); /* Hit this entry so just go next. */ if ( rc == 1 ) i = rmrr-scope.devices_cnt; else if ( rc 0 ) return rc; } return rc; } Better. Another improvement would be make it not depend on the internal workings of for_each_rmrr_device()... And in any case you should not special case 1 - just return when rc is negative and skip the rest of the current RMRR when it's positive. And of course make the function's final return value predictable. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Regression, host crash with 4.5rc1
On 20.11.14 at 02:23, sfl...@ihonk.com wrote: On 11/17/2014 23:54, Jan Beulich wrote: Another thing - now that serial logging appears to be working for you, did you try whether the host, once hung, still reacts to serial input (perhaps force input to go to Xen right at boot via the conswitch= option)? If so, 'd' debug-key output would likely be the piece of most interest. Here you go. Performed with a checkout of 9a727a81 (because it was handy), let me know if you'd rather see the results from 4.5-rc2 or any other Xen debugging info: Interesting - all CPUs are executing stuff from Dom1, which be itself is not indicating a problem, but may (considering the host hang) hint at guest vCPU-s not getting de-scheduled anymore on one or more of the CPUs. The 'a' debug key output would hopefully give us enough information to know whether that's the case. Ideally do 'd' and 'a' a couple of times each (alternating, and with a few seconds in between). Also, for double checking purposes, could you make the xen-syms of a build you observed the problem with available somewhere? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v10 for-xen-4.5 2/2] dpci: Replace tasklet with an softirq
On 19.11.14 at 17:44, konrad.w...@oracle.com wrote: On Fri, Nov 14, 2014 at 11:11:46AM -0500, Konrad Rzeszutek Wilk wrote: On Fri, Nov 14, 2014 at 03:13:42PM +, Jan Beulich wrote: On 12.11.14 at 03:23, konrad.w...@oracle.com wrote: +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci) +{ +struct domain *d = pirq_dpci-dom; + +ASSERT(spin_is_locked(d-event_lock)); + +switch ( cmpxchg(pirq_dpci-state, 1 STATE_SCHED, 0) ) +{ +case (1 STATE_SCHED): +/* + * We are going to try to de-schedule the softirq before it goes in + * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'. + */ +put_domain(d); +/* fallthrough. */ Considering Sander's report, the only suspicious place I find is this one: When the STATE_SCHED flag is set, pirq_dpci is on some CPU's list. What guarantees it to get removed from that list before getting inserted on another one? None. The moment that STATE_SCHED is cleared, 'raise_softirq_for' is free to manipulate the list. I was too quick to say this. A bit more inspection shows that while 'raise_softirq_for' is free to manipulate the list - it won't be called. The reason is that the pt_pirq_softirq_reset is called _after_ the IRQ action handler are removed for this IRQ. That means we will not receive any interrupts for it and call 'raise_softirq_for'. At least until 'pt_irq_create_bind' is called. And said function has a check for this too: 42 * A crude 'while' loop with us dropping the spinlock and giving 243 * the softirq_dpci a chance to run. 244 * We MUST check for this condition as the softirq could be scheduled 245 * and hasn't run yet. Note that this code replaced tasklet_kill which 246 * would have spun forever and would do the same thing (wait to flush out 247 * outstanding hvm_dirq_assist calls. 248 */ 249 if ( pt_pirq_softirq_active(pirq_dpci) ) With that you correct your earlier reply, but I don't see how this answers my original question. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Problems accessing passthrough PCI device
On 19.11.14 at 16:12, furryfutt...@gmail.com wrote: This is getting more interesting. It seems that something is overwriting the pci-back configuration data. Starting from a fresh reboot I checked the Dom0 pci configuration and got this: root@smartin-xen:~# lspci -s 00:19.0 -x 00:19.0 Ethernet controller: Intel Corporation Device 1559 (rev 04) 00: 86 80 59 15 00 00 10 00 04 00 00 02 00 00 00 00 10: 00 00 d0 f7 00 c0 d3 f7 81 f0 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 54 20 30: 00 00 00 00 c8 00 00 00 00 00 00 00 05 01 00 00 I then start/stop my DomU and checked the Dom0 pci configuration again and got this: root@smartin-xen:~# lspci -s 00:19.0 -x 00:19.0 Ethernet controller: Intel Corporation Device 1559 (rev 04) 00: 86 80 59 15 00 00 10 00 04 00 00 02 00 00 00 00 10: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 54 20 30: 00 00 00 00 c8 00 00 00 00 00 00 00 05 01 00 00 I.e. the BARs got zapped. Not something that should happen. And certainly explaining why you would not be able to use the device a second time. Inside my DomU I added code to print the PCI configuration registers and what I get after restarting the DomU is: (d18) 14:57:04.042 src/e1000e.c@00150: 00: 86 80 59 15 00 00 10 00 04 00 00 02 00 00 00 00 (d18) 14:57:04.042 src/e1000e.c@00150: 10: 00 00 d0 f7 00 c0 d3 f7 81 f0 00 00 00 00 00 00 (d18) 14:57:04.042 src/e1000e.c@00150: 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 54 20 (d18) 14:57:04.043 src/e1000e.c@00150: 30: 00 00 00 00 c8 00 00 00 00 00 00 00 14 01 00 00 (d18) 14:57:04.043 src/e1000e.c@00324: Enable PCI Memory Access (d18) 14:57:05.043 src/e1000e.c@00150: 00: 86 80 59 15 03 00 10 00 04 00 00 02 00 00 00 00 (d18) 14:57:05.044 src/e1000e.c@00150: 10: 00 00 d0 f7 00 c0 d3 f7 81 f0 00 00 00 00 00 00 (d18) 14:57:05.044 src/e1000e.c@00150: 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 54 20 (d18) 14:57:05.045 src/e1000e.c@00150: 30: 00 00 00 00 c8 00 00 00 00 00 00 00 14 01 00 00 As you can see the pci configuration read from the pci-back driver by my DomU is different to the data in the Dom0 pci configuration! The only byte that is different afaics is the one at 0x3c, and that's fine. Plus comparing with what Dom0 sees before the guest was started or after the guest was stopped isn't really meaningful - instead you'd want to compare against what Dom0 sees while the guest is running. But in the end your problem may have to do with the various warnings issued regarding unrecognized DMAR table data, all relating to the use of namespaces that our code doesn't have support for. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH 0/3] x86: XSA-109/110 follow-up (to be considered for 4.5)
1: tighten page table owner checking in do_mmu_update() 2: don't ignore foreigndom input on various MMUEXT ops 3: HVM: don't crash guest upon problems occurring in user mode Reason to request considering this for 4.5: Tightened argument checking (as done in the first two patches) reduces the chances of them getting abused. Not unduly crashing the guest (as done in the third one) may avoid future security issues of guest user mode affecting the guest kernel. Signed-off-by: Jan Beulich jbeul...@suse.com ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH 1/3] x86: tighten page table owner checking in do_mmu_update()
MMU_MACHPHYS_UPDATE, not manipulating page tables, shouldn't ignore a bad page table domain being specified. Also pt_owner can't be NULL when reaching the out label, so the respective check can be dropped. Signed-off-by: Jan Beulich jbeul...@suse.com Acked-by: Tim Deegan t...@xen.org --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -3618,6 +3618,11 @@ long do_mmu_update( break; case MMU_MACHPHYS_UPDATE: +if ( unlikely(d != pt_owner) ) +{ +rc = -EPERM; +break; +} mfn = req.ptr PAGE_SHIFT; gpfn = req.val; @@ -3694,7 +3699,7 @@ long do_mmu_update( perfc_add(num_page_updates, i); out: -if ( pt_owner (pt_owner != d) ) +if ( pt_owner != d ) rcu_unlock_domain(pt_owner); /* Add incremental work we have done to the @done output parameter. */ x86: tighten page table owner checking in do_mmu_update() MMU_MACHPHYS_UPDATE, not manipulating page tables, shouldn't ignore a bad page table domain being specified. Also pt_owner can't be NULL when reaching the out label, so the respective check can be dropped. Signed-off-by: Jan Beulich jbeul...@suse.com Acked-by: Tim Deegan t...@xen.org --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -3618,6 +3618,11 @@ long do_mmu_update( break; case MMU_MACHPHYS_UPDATE: +if ( unlikely(d != pt_owner) ) +{ +rc = -EPERM; +break; +} mfn = req.ptr PAGE_SHIFT; gpfn = req.val; @@ -3694,7 +3699,7 @@ long do_mmu_update( perfc_add(num_page_updates, i); out: -if ( pt_owner (pt_owner != d) ) +if ( pt_owner != d ) rcu_unlock_domain(pt_owner); /* Add incremental work we have done to the @done output parameter. */ ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH 2/3] x86: don't ignore foreigndom input on various MMUEXT ops
Instead properly fail requests that shouldn't be issued on foreign domains or - for MMUEXT_{CLEAR,COPY}_PAGE - extend the existing operation to work that way. In the course of doing this the need to always clear okay even when wanting an error code other than -EINVAL became unwieldy, so the respective logic is being adjusted at once, together with a little other related cleanup. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -3062,23 +3062,23 @@ long do_mmuext_op( } case MMUEXT_NEW_BASEPTR: -if ( paging_mode_translate(d) ) -okay = 0; +if ( unlikely(d != pg_owner) ) +rc = -EPERM; +else if ( unlikely(paging_mode_translate(d)) ) +rc = -EINVAL; else -{ rc = new_guest_cr3(op.arg1.mfn); -okay = !rc; -} break; case MMUEXT_NEW_USER_BASEPTR: { unsigned long old_mfn; -if ( paging_mode_translate(current-domain) ) -{ -okay = 0; +if ( unlikely(d != pg_owner) ) +rc = -EPERM; +else if ( unlikely(paging_mode_translate(d)) ) +rc = -EINVAL; +if ( unlikely(rc) ) break; -} old_mfn = pagetable_get_pfn(curr-arch.guest_table_user); /* @@ -3136,12 +3136,17 @@ long do_mmuext_op( } case MMUEXT_TLB_FLUSH_LOCAL: -flush_tlb_local(); +if ( likely(d == pg_owner) ) +flush_tlb_local(); +else +rc = -EPERM; break; case MMUEXT_INVLPG_LOCAL: -if ( !paging_mode_enabled(d) - || paging_invlpg(curr, op.arg1.linear_addr) != 0 ) +if ( unlikely(d != pg_owner) ) +rc = -EPERM; +else if ( !paging_mode_enabled(d) || + paging_invlpg(curr, op.arg1.linear_addr) != 0 ) flush_tlb_one_local(op.arg1.linear_addr); break; @@ -3150,13 +3155,16 @@ long do_mmuext_op( { cpumask_t pmask; -if ( unlikely(vcpumask_to_pcpumask(d, -guest_handle_to_param(op.arg2.vcpumask, const_void), -pmask)) ) -{ -okay = 0; +if ( unlikely(d != pg_owner) ) +rc = -EPERM; +else if ( unlikely(vcpumask_to_pcpumask(d, + guest_handle_to_param(op.arg2.vcpumask, + const_void), + pmask)) ) +rc = -EINVAL; +if ( unlikely(rc) ) break; -} + if ( op.cmd == MMUEXT_TLB_FLUSH_MULTI ) flush_tlb_mask(pmask); else @@ -3165,18 +3173,26 @@ long do_mmuext_op( } case MMUEXT_TLB_FLUSH_ALL: -flush_tlb_mask(d-domain_dirty_cpumask); +if ( likely(d == pg_owner) ) +flush_tlb_mask(d-domain_dirty_cpumask); +else +rc = -EPERM; break; case MMUEXT_INVLPG_ALL: -flush_tlb_one_mask(d-domain_dirty_cpumask, op.arg1.linear_addr); +if ( likely(d == pg_owner) ) +flush_tlb_one_mask(d-domain_dirty_cpumask, op.arg1.linear_addr); +else +rc = -EPERM; break; case MMUEXT_FLUSH_CACHE: -if ( unlikely(!cache_flush_permitted(d)) ) +if ( unlikely(d != pg_owner) ) +rc = -EPERM; +else if ( unlikely(!cache_flush_permitted(d)) ) { MEM_LOG(Non-physdev domain tried to FLUSH_CACHE.); -okay = 0; +rc = -EACCES; } else { @@ -3185,8 +3201,8 @@ long do_mmuext_op( break; case MMUEXT_FLUSH_CACHE_GLOBAL: -if ( unlikely(foreigndom != DOMID_SELF) ) -okay = 0; +if ( unlikely(d != pg_owner) ) +rc = -EPERM; else if ( likely(cache_flush_permitted(d)) ) { unsigned int cpu; @@ -3211,7 +3227,9 @@ long do_mmuext_op( unsigned long ptr = op.arg1.linear_addr; unsigned long ents = op.arg2.nr_ents; -if ( paging_mode_external(d) ) +if ( unlikely(d != pg_owner) ) +rc = -EPERM; +else if ( paging_mode_external(d) ) { MEM_LOG(ignoring SET_LDT hypercall from external domain); okay = 0; @@ -3237,7 +3255,7 @@ long do_mmuext_op( case MMUEXT_CLEAR_PAGE: { struct page_info *page; -page = get_page_from_gfn
Re: [Xen-devel] [for-xen-4.5 PATCH v2 1/2] dpci: Fix list corruption if INTx device is used and an IRQ timeout is invoked.
On 19.11.14 at 23:21, konrad.w...@oracle.com wrote: @@ -97,13 +97,15 @@ bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci) } /* - * Reset the pirq_dpci-dom parameter to NULL. + * Cancels an outstanding pirq_dpci (if scheduled). Also if clear is set, + * reset pirq_dpci-dom parameter to NULL (used for teardown). * * This function checks the different states to make sure it can do it * at the right time. If it unschedules the 'hvm_dirq_assist' from running * it also refcounts (which is what the softirq would have done) properly. */ -static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci) +static void pt_pirq_softirq_cancel(struct hvm_pirq_dpci *pirq_dpci, + unsigned int clear) Looks reasonable apart from this wanting to be bool_t, and apart from the fact that iiuc it still doesn't eliminate the observed crashes (in which case the fix for that may better be folded into here). I'm not really convinced that this fix is what you currently have as patch 2 (as the previously mentioned problem of _reset() [now _cancel()] clearing STATE_SCHED without removing the list entry from the list is now becoming even more obvious), but I'll also comment there. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-xen-4.5 PATCH v2 2/2] dpci: Add ZOMBIE state to allow the softirq to finish with the dpci_pirq.
On 19.11.14 at 23:21, konrad.w...@oracle.com wrote: Leaving aside the question of whether this is the right approach, in case it is a couple of comments: @@ -85,7 +91,7 @@ static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci) */ bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci) { -if ( pirq_dpci-state ((1 STATE_RUN) | (1 STATE_SCHED)) ) +if ( pirq_dpci-state ((1 STATE_RUN) | (1 STATE_SCHED) | (1 STATE_ZOMBIE) ) ) This is becoming unwieldy - perhaps better just if ( pirq_dpci-state )? @@ -122,6 +128,7 @@ static void pt_pirq_softirq_cancel(struct hvm_pirq_dpci *pirq_dpci, /* fallthrough. */ case (1 STATE_RUN): case (1 STATE_RUN) | (1 STATE_SCHED): +case (1 STATE_RUN) | (1 STATE_SCHED) | (1 STATE_ZOMBIE): How would we get into this state? Afaict STATE_ZOMBIE can't be set at the same time as STATE_SCHED. @@ -786,6 +793,7 @@ unlock: static void dpci_softirq(void) { unsigned int cpu = smp_processor_id(); +unsigned int reset = 0; bool_t (and would better be inside the loop, avoiding the pointless re-init on the last iteration). But taking it as a whole - aren't we now at risk of losing an interrupt instance the guest expects, due to raise_softirq_for() bailing on zombie entries, and dpci_softirq() being the only place where the zombie state gets cleared? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 4/4] x86/xen: use the maximum MFN to calculate the required DMA mask
On 19.11.14 at 17:02, david.vra...@citrix.com wrote: On a Xen PV guest the DMA addresses and physical addresses are not 1:1 (such as Xen PV guests) and the generic dma_get_required_mask() does not return the correct mask (since it uses max_pfn). Some device drivers (such as mptsas, mpt2sas) use dma_get_required_mask() to set the device's DMA mask to allow them to use only 32-bit DMA addresses in hardware structures. This results in unnecessary use of the SWIOTLB if DMA addresses are more than 32-bits, impacting performance significantly. Provide a get_required_mask op that uses the maximum MFN to calculate the DMA mask. Signed-off-by: David Vrabel david.vra...@citrix.com Reviewed-by: Jan Beulich jbeul...@suse.com ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/3] x86/HVM: don't crash guest upon problems occurring in user mode
On 20.11.14 at 12:34, t...@xen.org wrote: At 11:13 +0100 on 20 Nov (1416478386), Jan Beulich wrote: This extends commit 5283b310 (x86/HVM: only kill guest when unknown VM exit occurred in guest kernel mode) to further cases, including the failed VM entry one that XSA-110 was needed to be issued for. Signed-off-by: Jan Beulich jbeul...@suse.com This seems like a good idea in general, but I'm not sure it's appropriate for _all_ of these. Unhandled exit types and overlong instruction decode seem obviously good. hvm_hap_nested_page_fault() returns 0: seems only to happen for pvh guests that write to read-only memory (?). That seems like a different class of failure. I don't think our response should be different based on the privilege level here, although domain_crash() does seem harsh. (I presume this is to avoid emulating an instruction in PVH mode?) If we're changing this, I think it should be to #GP rather than #UD. I dropped those for now. We should re-evaluate this once we have #VE support on VMX (i.e. we may want to inject that one there instead). p2m_pt_handle_deferred_changes() returns 0: AFAICS this is basically ENOMEM when trying to update p2m tables. It's so unlikely to be caused by userspace activity that disguising it with #UD is probably just unhelpful. It turns a clean failure into an undebuggable intermittent glitch. Dropped too. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC for 4.6] xen: Extend DOMCTL createdomain to support arch configuration
On 18.11.14 at 20:20, julien.gr...@linaro.org wrote: --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -540,6 +540,7 @@ void __init noreturn __start_xen(unsigned long mbi_p) int i, j, e820_warn = 0, bytes = 0; bool_t acpi_boot_table_init_done = 0; struct domain *dom0; +struct arch_domainconfig config; struct ns16550_defaults ns16550 = { .data_bits = 8, .parity= 'n', @@ -1347,7 +1348,8 @@ void __init noreturn __start_xen(unsigned long mbi_p) domcr_flags |= DOMCRF_pvh | DOMCRF_hap; /* Create initial domain 0. */ -dom0 = domain_create(0, domcr_flags, 0); +memset(config, 0, sizeof(config)); +dom0 = domain_create(0, domcr_flags, 0, config); Why not NULL like almost everywhere else? --- a/xen/include/public/arch-x86/xen.h +++ b/xen/include/public/arch-x86/xen.h @@ -228,6 +228,11 @@ struct arch_shared_info { }; typedef struct arch_shared_info arch_shared_info_t; +struct arch_domainconfig { +char dummy; +}; +typedef struct arch_domainconfig arch_domainconfig_t; I think we decided that, regardless of all bad precedents, we'd stop adding things without xen_ prefix to the public interface. I'm also rather uncertain about all these typedef-s - I think we could very well get away without adding any new ones (if internally to the hypervisor or tools they turn out to make the code much better readable, each such component could still introduce one on its own). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
On 20.11.14 at 15:40, kevin.t...@intel.com wrote: From: Tian, Kevin Sent: Thursday, November 20, 2014 4:51 PM From: Jan Beulich [mailto:jbeul...@suse.com] Sent: Thursday, November 20, 2014 4:04 PM On 20.11.14 at 08:45, kevin.t...@intel.com wrote: Current option sounds a reasonable one, i.e. passing a list of BDFs assigned to this VM before populating p2m, and then having hypervisor to filter out reserved regions associated with those BDFs. This way libxc teaches Xen to create reserved regions once, and then later the filtered info is returned upon query. Reasonable, but partly redundant. The positive aspect being that it permits this list and the list of actually being assigned devices to be different, i.e. allowing holes to be set up for devices that only _may_ get assigned at some point. redundant if we think the list only exactly matching the statically assigned devices. but that's just current point. reasonable if we think there may be other policies impacting the list (e.g. if hotplugable device may have a config option and then we have a potential list larger than static assigned devices). From this angle I think the new interface actually makes sense for this very purpose. Jan are you OK with this? I can live with it, yes. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-xen-4.5 PATCH v2 2/2] dpci: Add ZOMBIE state to allow the softirq to finish with the dpci_pirq.
On 20.11.14 at 20:51, konrad.w...@oracle.com wrote: @@ -669,7 +670,7 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) ASSERT(d-arch.hvm_domain.irq.dpci); spin_lock(d-event_lock); -if ( pirq_dpci-state ) +if ( test_and_clear_bool(pirq_dpci-masked) ) { struct pirq *pirq = dpci_pirq(pirq_dpci); const struct dev_intx_gsi_link *digl; So this now guards solely against the timeout enforced EOI? Why do you no longer need to guard against cancellation (i.e. why isn't this looking at both -state and -masked)? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
On 21.11.14 at 08:43, kevin.t...@intel.com wrote: From: Chen, Tiejun Sent: Friday, November 21, 2014 2:26 PM On 2014/11/3 18:02, Jan Beulich wrote: On 03.11.14 at 10:55, tiejun.c...@intel.com wrote: On 2014/11/3 17:45, Jan Beulich wrote: On 03.11.14 at 10:32, tiejun.c...@intel.com wrote: On 2014/11/3 16:53, Jan Beulich wrote: On 03.11.14 at 03:22, tiejun.c...@intel.com wrote: On 2014/10/31 16:14, Jan Beulich wrote: On 31.10.14 at 03:20, tiejun.c...@intel.com wrote: On 2014/10/30 17:13, Jan Beulich wrote: On 30.10.14 at 06:55, tiejun.c...@intel.com wrote: On 2014/10/29 17:05, Jan Beulich wrote: On 29.10.14 at 07:54, tiejun.c...@intel.com wrote: Looks I can remove those stuff from util.h and just add 'extern' to them when we really need them. Please stop thinking this way. Declarations for things defined in .c files are to be present in headers, and the defining .c file has to include that header (making sure declaration and definition are and remain in sync). I hate having to again repeat my remark that you shouldn't forget it's not application code that you're modifying. Robust and maintainable code are a requirement in the hypervisor (and, as said it being an extension of it, hvmloader). Which - just to avoid any misunderstanding - isn't to say that this shouldn't also apply to application code. It's just that in the hypervisor and kernel (and certain other code system components) the consequences of being lax are much more severe. Okay. But currently, the pci.c file already include 'util.h' and 'xen/memory.h, #include util.h ... #include xen/memory.h We can't redefine struct xen_reserved_device_memory in util.h. Redefine? I said forward declare. Seems we just need to declare hvm_get_reserved_device_memory_map() in the head file, tools/firmware/hvmloader/util.h, unsigned int hvm_get_reserved_device_memory_map(void); To me this looks very much like poor programming style, even if in the context of hvmloader communicating information via global variables rather than function arguments and return values is Do you mean you don't like a global variable? But it can be use to get RDM without more hypercall or function call in the context of hvmloader. This argument which you brought up before, and which we commented on before, is pretty pointless. We don't really care much about doing one or two more hypercalls from hvmloader, unless these would be long-running ones. Another benefit to use a global variable is that we wouldn't allocate xen_reserved_device_memory * N each time, and reduce some duplicated codes, unless you mean I should define that as static inside in local. Now this reason is indeed worth a consideration. How many times is the data being needed/retrieved? Currently there are two cases in tools/hvmloader, setup pci and build e820 table. Each time, as you know we don't know how may entries we should require, so we always allocate one instance then according to the return value to allocate the proper instances to get that. Hmm, two uses isn't really that bad, i.e. I'd then still be in favor of a more normal interface. Just go back here since I realize we always use mem_alloc(), which is pick from RESERVED_MEMORY, to allocate all buffer inside this hypercall caller in hvmloader, but unfortunately we have no any associated free function implementation in hvmloader, so if we call this multiple times this means it really waster more memory in RESERVED_MEMORY. So I still think one global variable should be fine. it's natural to get reserved information once, and then saved for either one-time or multiple-time references. Not really natural, but possible. Yet if done this way, then the interface shouldn't give the appearance of retrieving it every time, i.e. the global should be set up separately and the users of the data should access the data rather than calling a (fake) function. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Regression, host crash with 4.5rc1
On 20.11.14 at 21:07, sfl...@ihonk.com wrote: Running with mwait-idle=0 solves (hides?) the problem. Next step is to fiddle with the C states? So this also prompted me to go over the list of errata. Just to confirm - your CPU is family 6 model 44? What stepping? And what nominal frequency? There are a couple potentially relevant errata (BC36, BC38, BC54, BC77, BC110). To exclude BC36, a boot log with apic-verbosity=debug and debug key 'i' output would be necessary. BC38 should not affect us since we don't enter C states from ISRs. BC54 is probably irrelevant since we meanwhile know that your system doesn't really hang hard. For BC77 it would be worth trying to disable turbo mode instead of disabling the mwait-idle driver (xenpm disable-turbo-mode right after boot). And BC110 would be relevant only if without the mwait-idle driver there would be no use of C3. Plus anyway this would more likely end up in a hard hang too. And then, considering that my system with a family 6 model 44 CPU has never shown anything similar (albeit that doesn't mean all that much since our workloads are likely very different), you're not over-clocking? And did you disable hyper-threading on purpose (if so could you check whether enabling it makes a difference)? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Buggy interaction of live migration and p2m updates
On 20.11.14 at 19:28, andrew.coop...@citrix.com wrote: Should the guest change the p2m structure during live migration, the toolstack ends up with a stale p2m with a non-p2m frame in the middle, resulting in bogus cross-referencing. Should the guest change an entry in the p2m, the p2m frame itself will be resent as it would be marked as dirty in the logdirty bitmap, but the target pfn will remain unsent and probably stale on the receiving side. MMU_MACHPHYS_UPDATE processing marks the page being changed as dirty. Perhaps guest_physmap_{add,remove}_page() (or certain callers thereof) should do so too? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 1/4] expand x86 arch_shared_info to support linear p2m list
On 21.11.14 at 14:37, jgr...@suse.com wrote: On 11/21/2014 02:26 PM, Andrew Cooper wrote: On 21/11/14 12:57, Juergen Gross wrote: On 11/21/2014 01:23 PM, Jan Beulich wrote: On 14.11.14 at 10:37, jgr...@suse.com.non-mime.internet wrote: --- a/xen/include/public/arch-x86/xen.h +++ b/xen/include/public/arch-x86/xen.h @@ -224,7 +224,12 @@ struct arch_shared_info { /* Frame containing list of mfns containing list of mfns containing p2m. */ xen_pfn_t pfn_to_mfn_frame_list_list; unsigned long nmi_reason; -uint64_t pad[32]; +/* + * Following two fields are valid if pfn_to_mfn_frame_list_list contains + * ~0UL. + */ +unsigned long p2m_vaddr;/* virtual address of the p2m list */ +unsigned long p2m_as_root; /* mfn of the top level page table */ xen_pfn_t please. And what does the as in the name stand for? as is address space. I can rename it to e.g. p2m_pgd_mfn. That is a linuxism in naming, which is also not accurate. From my understanding, this frame must be an L4 table for 64bit guests, and an L3 table for 32bit guests. I.e. it is effectively a cr3 with which to use the p2m_vaddr field above? Okay, so p2m_cr3 then? I'm okay with that, but would think p2m_pt_root would be more generic. Considering PAE, wouldn't this need to be a 64-bit value then even in the 32-bit interface? Or alternatively be represented as MFN instead? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PCI-passthrough for 32 bit guests and high MMIO addresses
On 21.11.14 at 15:39, jgr...@suse.com wrote: Trying to do PCI-passthrough with a 32-bit pv-domain I passed the wrong device to the domain. The MMIO address was too large for a MFN of a 32-bit system (it was 38000320-3800036f). Instead of rejecting the operation Xen tried to perform it resulting in a (quite understandable) failure in the domU. I think either the hypervisor or the tools should refuse to do PCI-passthrough in this case. What's wrong with this large an address? 32-bit PV uses PAE, i.e. can map them. If the kernel isn't capable of that that's not something to make Xen (or the tools) refuse such assignments. I would only see an issue if a hypercall interface involved here isn't using wide enough fields (but these addresses should be read from the BARs, i.e. no hypercall involved). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PCI-passthrough for 32 bit guests and high MMIO addresses
On 21.11.14 at 16:01, andrew.coop...@citrix.com wrote: On 21/11/14 14:54, Jan Beulich wrote: On 21.11.14 at 15:39, jgr...@suse.com wrote: Trying to do PCI-passthrough with a 32-bit pv-domain I passed the wrong device to the domain. The MMIO address was too large for a MFN of a 32-bit system (it was 38000320-3800036f). Instead of rejecting the operation Xen tried to perform it resulting in a (quite understandable) failure in the domU. I think either the hypervisor or the tools should refuse to do PCI-passthrough in this case. What's wrong with this large an address? It is wider than 44 bits, so doesn't fit in a 32bit pfn for p2m/m2p update operations. MMIO regions don't go into these tables. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PCI-passthrough for 32 bit guests and high MMIO addresses
On 21.11.14 at 16:17, jgr...@suse.com wrote: On 11/21/2014 03:54 PM, Jan Beulich wrote: On 21.11.14 at 15:39, jgr...@suse.com wrote: Trying to do PCI-passthrough with a 32-bit pv-domain I passed the wrong device to the domain. The MMIO address was too large for a MFN of a 32-bit system (it was 38000320-3800036f). Instead of rejecting the operation Xen tried to perform it resulting in a (quite understandable) failure in the domU. I think either the hypervisor or the tools should refuse to do PCI-passthrough in this case. What's wrong with this large an address? 32-bit PV uses PAE, i.e. can map them. If the kernel isn't capable of that that's not something to make Xen (or the tools) refuse such assignments. I would only see an issue if a hypercall interface involved here isn't using wide enough fields (but these addresses should be read from the BARs, i.e. no hypercall involved). The MFN format is part of the pv-ABI. And a MFN of a 32-bit pv-guest is only 32 bits (even if don't take the invalid bit into account). Should a pv-guest really be capable to map an address outside it's accessible MFN-range? For MMIO, why not? All that counts there is the page table entry format. Are the tools capable of processing such a mapping in case of saving the domain? MMIO isn't subject to saving/restoring. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-xen-4.5 PATCH v2 2/2] dpci: Add ZOMBIE state to allow the softirq to finish with the dpci_pirq.
On 21.11.14 at 16:14, konrad.w...@oracle.com wrote: On Fri, Nov 21, 2014 at 01:03:43PM +0100, Jan Beulich wrote: On 21.11.14 at 12:50, konrad.w...@oracle.com wrote: On November 21, 2014 2:51:33 AM EST, Jan Beulich jbeul...@suse.com wrote: On 20.11.14 at 20:51, konrad.w...@oracle.com wrote: @@ -669,7 +670,7 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci) ASSERT(d-arch.hvm_domain.irq.dpci); spin_lock(d-event_lock); -if ( pirq_dpci-state ) +if ( test_and_clear_bool(pirq_dpci-masked) ) { struct pirq *pirq = dpci_pirq(pirq_dpci); const struct dev_intx_gsi_link *digl; So this now guards solely against the timeout enforced EOI? Why do you no longer need to guard against cancellation (i.e. why isn't this looking at both -state and -masked)? The previous state check was superfluous as the dpci_softirq would check for the valid STATE_ before calling hvm_dirq_assist (and deal with cancellation). I thought so first too, but then how is the guarding against cancellation working now? Since there are two forms of cancellation, lets consider each one seperatly: 1). pt_irq_create_bind and pt_irq_destroy_bind. Each of them calling pt_pirq_softirq_reset in the 'error' case or in the normal path of destruction. When that happens the action handler for the IRQ is removed the moment we call pirq_guest_unbind. As such we only have to deal with the potential outstanding pirq_dpci waiting to be serviced. Since we did the 'pt_pirq_softirq_reset' we have changed the state from STATE_SCHED to zero. That means dpci_softirq will NOT go in: if ( test_and_clear_bit(STATE_SCHED, pirq_dpci-state) ) Unless the flag got set again in the meantime. Since the event lock is being acquired right before the line quoted above, _that_ is what needs closely looking at (and why I was asking about the deletion of the [at the first glance unnecessary] checking of -state in hvm_dirq_assist()). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 01/19] xen: dump vNUMA information with debug key u
On 21.11.14 at 16:06, wei.l...@citrix.com wrote: Signed-off-by: Elena Ufimsteva ufimts...@gmail.com Signed-off-by: Wei Liu wei.l...@citrix.com Cc: Jan Beulich jbeul...@suse.com --- xen/arch/x86/numa.c | 46 +- 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c index 628a40a..d27c30f 100644 --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -363,10 +363,12 @@ EXPORT_SYMBOL(node_data); static void dump_numa(unsigned char key) { s_time_t now = NOW(); -int i; +int i, j, err, n; unsigned int please for all but err. @@ -408,6 +410,48 @@ static void dump_numa(unsigned char key) for_each_online_node ( i ) printk(Node %u: %u\n, i, page_num_node[i]); + +if ( !d-vnuma ) +continue; + +vnuma = d-vnuma; +printk( %u vnodes, %u vcpus\n, vnuma-nr_vnodes, d-max_vcpus); +for ( i = 0; i vnuma-nr_vnodes; i++ ) +{ +err = snprintf(keyhandler_scratch, 12, %u, +vnuma-vnode_to_pnode[i]); +if ( err 0 || vnuma-vnode_to_pnode[i] == NUMA_NO_NODE ) +snprintf(keyhandler_scratch, 3, ???); strcpy() would be much cheaper here. + +printk( vnode %3u - pnode %s\n, i, keyhandler_scratch); +for ( j = 0; j vnuma-nr_vmemranges; j++ ) +{ +if ( vnuma-vmemrange[j].nid == i ) +{ +mem = vnuma-vmemrange[j].end - vnuma-vmemrange[j].start; +printk(%PRIu64 MB: , mem 20); +printk( 0x%PRIx64 - 0x%PRIx64\n, + vnuma-vmemrange[j].start, + vnuma-vmemrange[j].end); Where possible please don't split printk()s of a single output line. Also %# instead of 0x% please (but maybe padding the values so the align properly would be the better change; that would also eliminate the need for explicit leading spaces). +} +} + +printk(vcpus: ); +for ( j = 0, n = 0; j d-max_vcpus; j++ ) +{ +if ( vnuma-vcpu_to_vnode[j] == i ) +{ +if ( (n + 1) % 8 == 0 ) +printk(%d\n, j); +else if ( !(n % 8) n != 0 ) +printk(%d , j); +else +printk(%d , j); Same here - left padding them to make them aligned will likely make the result quite a bit easier to grok especially for large guests. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 00/19] Virtual NUMA for PV and HVM
On 21.11.14 at 17:35, wei.l...@citrix.com wrote: On Fri, Nov 21, 2014 at 04:25:34PM +, Jan Beulich wrote: On 21.11.14 at 16:06, wei.l...@citrix.com wrote: vnuma_vdistances = [10, 30] # optional Being optional, would the real distances be used instead? And what Default value of [10, 20] will be used. That's bad. Would it be very difficult to use the host values? meaning does this apparently one dimensional array here have for the actually two dimensional SLIT? (Read: An example with more than two nodes would be useful.) The first element of [X, Y] is local distance, the second element is remote distance. For a 4 node system: 0123 0XYYY 1YXYY 2YYXY 3YYYX That may match up with how most current NUMA systems look like, but do we really want to bake in an oversimplification like this? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 00/19] Virtual NUMA for PV and HVM
On 21.11.14 at 17:55, wei.l...@citrix.com wrote: Nonetheless I'm all for having a configuration option that would meet both present and future need. Do you have anything in mind? Are you suggesting we should allow specifying every element in SLIT in xl? I think that would be desirable. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xen-pciback: drop SR-IOV VFs when PF driver unloads
On 21.11.14 at 23:03, konrad.w...@oracle.com wrote: I rewrote it a bit to be more in the style of pciback: [...] [v2: Removed the switch statement, moved it about] What you don't mention here is that you also removed the outer loop, yet that breaks functionality afaict: There can (and I suppose normally will) be multiple VFs needing device_release_driver() called when a single PF goes away. Also I'm not really happy for a patch with my S-o-b underneath to introduce goto-s without real need. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Regression, host crash with 4.5rc1
On 23.11.14 at 02:28, sfl...@ihonk.com wrote: With mwait-idle=0: (XEN) 'c' pressed - printing ACPI Cx structures (XEN) ==cpu0== (XEN) active state: C0 (XEN) max_cstate: C7 (XEN) states: (XEN) C1: type[C1] latency[001] usage[] method[ FFH] duration[0] (XEN) C2: type[C0] latency[000] usage[] method[ NONE] duration[0] (XEN) C3: type[C3] latency[064] usage[] method[ FFH] duration[0] (XEN) C4: type[C3] latency[096] usage[] method[ FFH] duration[0] (XEN)*C0: usage[] duration[46930624784] (XEN) PC2[0] PC3[0] PC6[0] PC7[0] (XEN) CC3[0] CC6[0] CC7[0] [...] Very interesting - the hypervisor has C-state information, but never entered any of them. That certainly explains the difference between using/not using the ,wait-idle driver, but puts us back to there being a more general issue with C-state use on this CPU model. Possibly related to C2 having entry method NONE, but then again I can't see how such a state could get entered into the table the first place: set_cx() bails upon check_cx() returning an error, and hence its switch()'s default statement should never be reached. Plus even if an array entry was set to NONE, it should simply be ignored when looking for a state to enter. I'll probably need to put together a debugging patch to figure out what's going on here. In any event C2 being set to NONE and that information presumably coming from firmware is an indication that there's a problem with C2 (note that the numbering doesn't really match up with what the document says, this likely really is C1E) on that CPU. Which gets us back to ... CPU information for one of the cores, 2.8 GHz is nominal, stepping is 2. Not sure how to translate that stepping number into Intel's format: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5660 @ 2.80GHz stepping: 2 [...] There are a couple potentially relevant errata (BC36, BC38, BC54, BC77, BC110). To exclude BC36, a boot log with apic-verbosity=debug and debug key 'i' output would be necessary. Done, see the very end of the email. BC38 should not affect us since we don't enter C states from ISRs. BC54 is probably irrelevant since we meanwhile know that your system doesn't really hang hard. For BC77 it would be worth trying to disable turbo mode instead of disabling the mwait-idle driver (xenpm disable-turbo-mode right after boot). I looked up BC77 but as a result found this document[1], which seems to relate to the i7. Would this[2] not be the relevant document? [1] http://www.intel.com/content/dam/www/public/us/en/documents/specification-upd ates/core-i7-900-ee-and-desktop-processor-series-32nm-spec-update.pdf [2] http://www.intel.com/content/dam/www/public/us/en/documents/specification-upd ates/xeon-5600-specification-update.pdf Indeed. I wasn't aware that there are family/model/stepping tuples that can be both Xeon and desktop CPUs. As promised, below is the apic-verbosity=debug log, with 'i'. Thanks! I'm sorry, I misspelled the option, it's really apic_verbosity=debug. The 'i' output at least already confirms that there are no ExtINT entries among the IO-APIC RTEs. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [for-xen-4.5 PATCH v2 2/2] dpci: Add ZOMBIE state to allow the softirq to finish with the dpci_pirq.
On 21.11.14 at 17:45, konrad.w...@oracle.com wrote: From 90d00db0949a8e796d7f812134753a54b2fe3d51 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk konrad.w...@oracle.com Date: Thu, 20 Nov 2014 14:28:11 -0500 Subject: [PATCH] dpci: Add 'masked' as an gate for hvm_dirq_assist to process. So am I right in understanding that this is then the only thing that needs to go into the tree to fix Sander's problem? No zombie state anymore? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled
On 21.11.14 at 20:56, konrad.w...@oracle.com wrote: On Fri, Nov 21, 2014 at 03:06:56PM +, Wei Liu wrote: --- a/tools/firmware/hvmloader/pci.c +++ b/tools/firmware/hvmloader/pci.c @@ -88,6 +88,19 @@ void pci_setup(void) printf(Relocating guest memory for lowmem MMIO space %s\n, allow_memory_relocate?enabled:disabled); +/* Disallow low memory relocation when vNUMA is enabled, because + * relocated memory ends up off node. Further more, even if we + * dynamically expand node coverage in hvmloader, low memory and + * high memory may reside in different physical nodes, blindly + * relocates low memory to high memory gives us a sub-optimal + * configuration. And this is done in hvmloader, so the toolstack has no inkling that we need to relocate memory to make space for the PCI. In such case I would not have this check here. Instead put it in libxl and disallow vNUMA with PCI passthrough. And then the fix is to take the logic that is in hvmloader for PCI BAR size relocation and move it in libxl. Then it can construct the proper vNUMA topology and also fix an outstanding QEMU-xen bug. The problem then being that two code pieces pretty far apart from one another need to be kept in perfect sync. Not really nice in terms of maintainability. I'd really prefer hvmloader to re-write the vNUMA info (on real hardware firmware also needs to take care of the memory holes - there's no magic external entity there either). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 14/19] hvmloader: disallow memory relocation when vNUMA is enabled
On 21.11.14 at 16:06, wei.l...@citrix.com wrote: Signed-off-by: Wei Liu wei.l...@citrix.com So this is the fourth patch now without any description whatsoever. --- a/tools/firmware/hvmloader/pci.c +++ b/tools/firmware/hvmloader/pci.c @@ -88,6 +88,19 @@ void pci_setup(void) printf(Relocating guest memory for lowmem MMIO space %s\n, allow_memory_relocate?enabled:disabled); +/* Disallow low memory relocation when vNUMA is enabled, because + * relocated memory ends up off node. Further more, even if we + * dynamically expand node coverage in hvmloader, low memory and + * high memory may reside in different physical nodes, blindly + * relocates low memory to high memory gives us a sub-optimal + * configuration. + */ +if ( hvm_info-nr_nodes != 0 allow_memory_relocate ) +{ +allow_memory_relocate = false; +printf(vNUMA enabled, relocating guest memory for lowmem MMIO space disabled\n); +} Apart from the comment violating our coding style, as already indicated in the reply to Konrad's comment I don't think this is the right approach. If it is meant to be a temporary measure, the comment should say so (and perhaps have a TBD or similar grep- able mark in it). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Regression, host crash with 4.5rc1
On 24.11.14 at 10:08, sfl...@ihonk.com wrote: On Nov 24, 2014, at 00:45, Jan Beulich jbeul...@suse.com wrote: On 23.11.14 at 02:28, sfl...@ihonk.com wrote: With mwait-idle=0: (XEN) 'c' pressed - printing ACPI Cx structures (XEN) ==cpu0== (XEN) active state: C0 (XEN) max_cstate: C7 (XEN) states: (XEN) C1: type[C1] latency[001] usage[] method[ FFH] duration[0] (XEN) C2: type[C0] latency[000] usage[] method[ NONE] duration[0] (XEN) C3: type[C3] latency[064] usage[] method[ FFH] duration[0] (XEN) C4: type[C3] latency[096] usage[] method[ FFH] duration[0] (XEN)*C0: usage[] duration[46930624784] (XEN) PC2[0] PC3[0] PC6[0] PC7[0] (XEN) CC3[0] CC6[0] CC7[0] [...] Very interesting - the hypervisor has C-state information, but never entered any of them. That certainly explains the difference between using/not using the ,wait-idle driver, but puts us back to there being a more general issue with C-state use on this CPU model. Possibly related to C2 having entry method NONE, but then again I can't see how such a state could get entered into the table the first place: set_cx() bails upon check_cx() returning an error, and hence its switch()'s default statement should never be reached. Plus even if an array entry was set to NONE, it should simply be ignored when looking for a state to enter. I'll probably need to put together a debugging patch to figure out what's going on here. Okay, happy to give it a go whenever you have the time to put something together. While putting this together I found the reason for the strange C2: line, and the attached debugging patch already has the fix for it (which I'll also submit separately, and hence you may need to drop that specific hunk should you end up applying it on a tree which already has that fix). You'll need to again run with mwait-idle=0, and it's the boot messages along with the 'c' debug key output that's of interest. Thanks, Jan --- unstable.orig/xen/arch/x86/acpi/cpu_idle.c +++ unstable/xen/arch/x86/acpi/cpu_idle.c @@ -58,7 +58,7 @@ #include xen/notifier.h #include xen/cpu.h -/*#define DEBUG_PM_CX*/ +#define DEBUG_PM_CX #define GET_HW_RES_IN_NS(msr, val) \ do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 ) @@ -238,6 +238,9 @@ static char* acpi_cstate_method_name[] = HALT }; +struct reasons { unsigned long max, pwr, urg, nxt; };//temp +static DEFINE_PER_CPU(struct reasons, reasons);//temp + static void print_acpi_power(uint32_t cpu, struct acpi_processor_power *power) { uint32_t i, idle_usage = 0; @@ -273,6 +276,8 @@ static void print_acpi_power(uint32_t cp printk((last_state_idx == 0) ?* : ); printk(C0:\tusage[%08d] duration[%PRId64]\n, idle_usage, NOW() - idle_res); +printk(max=%lx pwr=%lx urg=%lx nxt=%lx\n,//temp + per_cpu(reasons.max, cpu), per_cpu(reasons.pwr, cpu), per_cpu(reasons.urg, cpu), per_cpu(reasons.nxt, cpu)); print_hw_residencies(cpu); } @@ -501,6 +506,7 @@ static void acpi_processor_idle(void) u32 exp = 0, pred = 0; u32 irq_traced[4] = { 0 }; +next_state = 1;//temp if ( max_cstate 0 power !sched_has_urgent_vcpu() (next_state = cpuidle_current_governor-select(power)) 0 ) { @@ -519,6 +525,10 @@ static void acpi_processor_idle(void) } if ( !cx ) { +this_cpu(reasons.max) += max_cstate = 0;//temp +this_cpu(reasons.pwr) += !power;//temp +this_cpu(reasons.urg) += !!sched_has_urgent_vcpu();//temp +this_cpu(reasons.nxt) += next_state = 0;//temp if ( pm_idle_save ) pm_idle_save(); else @@ -1007,6 +1017,7 @@ static void set_cx( cx-entry_method = ACPI_CSTATE_EM_SYSIO; break; default: +printk(CPU%u: C%u space %x?\n, acpi_power-cpu, xen_cx-type, xen_cx-reg.space_id);//temp cx-entry_method = ACPI_CSTATE_EM_NONE; break; } @@ -1015,7 +1026,7 @@ static void set_cx( cx-target_residency = cx-latency * latency_factor; smp_wmb(); -acpi_power-count++; +acpi_power-count += (cx-type != ACPI_STATE_C1); if ( cx-type == ACPI_STATE_C1 || cx-type == ACPI_STATE_C2 ) acpi_power-safe_state = cx; } @@ -1141,6 +1152,7 @@ long set_cx_pminfo(uint32_t cpu, struct /* FIXME: C-state dependency is not supported by far */ +printk(CPU%u: %pS, %pS\n, cpu, pm_idle_save, pm_idle);//temp if ( cpu_id == 0 ) { if ( pm_idle_save == NULL ) ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 0/3] x86: XSA-109/110 follow-up (to be considered for 4.5)
On 20.11.14 at 11:04, jbeul...@suse.com wrote: 1: tighten page table owner checking in do_mmu_update() 2: don't ignore foreigndom input on various MMUEXT ops 3: HVM: don't crash guest upon problems occurring in user mode Reason to request considering this for 4.5: Tightened argument checking (as done in the first two patches) reduces the chances of them getting abused. Not unduly crashing the guest (as done in the third one) may avoid future security issues of guest user mode affecting the guest kernel. Signed-off-by: Jan Beulich jbeul...@suse.com Hi Konrad - looks like I forgot to Cc you... Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 2/3] x86: don't ignore foreigndom input on various MMUEXT ops
On 24.11.14 at 13:43, dunl...@umich.edu wrote: On Thu, Nov 20, 2014 at 10:12 AM, Jan Beulich jbeul...@suse.com wrote: Instead properly fail requests that shouldn't be issued on foreign domains or - for MMUEXT_{CLEAR,COPY}_PAGE - extend the existing operation to work that way. I take it this is for 4.6? Not really, as said in the cover letter. I've looked through it and everything looks OK. But I agree with Tim, that having so many different changes all at the same time makes the patch hard to review. In particular, I'd rather start with a patch to get rid of okay entirely; then make MMUEXT_{CLEAR,COPY}_PAGE use foreingndom instead of current; then have a patch which returns -EPERM for the other ones; then a patch to get rid of spage in MMUEXT_[UN]MARK_SUPER. Yes, that's how I would have done it following Tim's comments if there wasn't the desire for backporting - that'll become quite a bit more involved if I do the cleanup patch removing okay first. And doing the -EPERM one first would mean that the backport needs to carefully add properly setting okay. Splitting the clear/copy page change from the -EPERM one doesn't make much sense to me at all considering the title (and hence purpose) of the patch. Regarding MMUEXT_{CLEAR,COPY}_PAGE: This is effectively changing the interface. Are we sure there are no callers which just expect them to work on current, and don't set foreigndom properly? As much or as little as all of the sudden returning -EPERM in such a case for other sub-ops. They never were meant to be used that way, and the original omission of those checks shouldn't preclude us from adding them. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Failure on make clean
On 25.11.14 at 08:44, jgr...@suse.com wrote: Hi, make clean in xen-unstable is failing: make[2]: Entering directory '/home/gross/xen/tools' set -e; if test -d qemu-xen-traditional-dir/.; then \ make -C qemu-xen-traditional-dir clean; \ fi make[3]: Entering directory '/home/gross/xen/tools/qemu-xen-traditional-dir-remote' Makefile:3: config-host.mak: No such file or directory SRC_PATH gets defined there. This file missing makes me wonder whether you ran make clean without a prior make. If so, perhaps the test in tools/Makefile should be altered: subdir-clean-qemu-xen-traditional-dir: set -e; if test -e qemu-xen-traditional-dir/config-host.mak; then \ $(MAKE) -C qemu-xen-traditional-dir clean; \ fi (possibly in a similar way for subdir-clean-qemu-xen-dir), albeit that may end up leaving an unclean tree when having interrupted qemu's configure process (not sure in which order it creates files). Jan Makefile:4: /rules.mak: No such file or directory make[3]: *** No rule to make target '/rules.mak'. Stop. make[3]: Leaving directory '/home/gross/xen/tools/qemu-xen-traditional-dir-remote' Makefile:181: recipe for target 'subdir-clean-qemu-xen-traditional-dir' failed make[2]: *** [subdir-clean-qemu-xen-traditional-dir] Error 2 make[2]: Leaving directory '/home/gross/xen/tools' /home/gross/xen/tools/../tools/Rules.mk:111: recipe for target 'subdirs-clean' failed make[1]: *** [subdirs-clean] Error 2 make[1]: Leaving directory '/home/gross/xen/tools' Makefile:139: recipe for target 'clean' failed make: *** [clean] Error 2 I think SRC_PATH is somehow undefined... Juergen ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5] x86/HVM: Partial revert of 28b4baacd5
On 25.11.14 at 11:08, andrew.coop...@citrix.com wrote: A failed vmentry is overwhelmingly likely to be caused by corrupt VMCS state. As a result, injecting a fault and retrying the the vmentry is likely to fail in the same way. That's not all that unlikely - remember that the change was prompted by the XSA-110 fix. There CS pieces being in a bad state would get corrected by the exception injection. One other alternative, which I would pursue if we were not already in -rc2 would be to add some extra logic to detect repeated vmentry failure and allow one attempt to shoot userspace before giving up and crashing the domain. That's not even needed afaict (and if it really is, it can't be all that difficult/intrusive): Did you observe what you attempt to fix here in practice, or is this just from theoretical considerations? I ask because I don't think it can actually happen, as the second time we get here the guest ought to be in kernel mode (due to the exception injection) and hence would get crashed anyway. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Regression, host crash with 4.5rc1
On 25.11.14 at 10:38, sfl...@ihonk.com wrote: On 11/25/2014 12:16 AM, Jan Beulich wrote: Interesting, so other than for me (perhaps due to other patches I have in my tree) the change resulted in C states now being used again despite mwait-idle=0, which is good. Question now is - with this being the case, did the hangs re-occur? Unfortunately they did. (Happened unusually quick this time, though I doubt the statistical significance.) Not sure what the desirable output is, so I did a couple of 'a' and 'd' requests, capped off by a 'c': Okay, so it's not really the mwait-idle driver causing the regression, but it is C-state related. Hence we're now down to seeing whether all or just the deeper C states are affected, i.e. I now need to ask you to play with max_cstate=. For that you'll have to remember that the option's effect differs between the ACPI and the MWAIT idle drivers. In the spirit of bisection I'd suggest using max_cstate=2 first no matter which of the two scenarios you pick. If that still hangs, max_cstate=1 obviously is the only other thing to try. Should that not hang (and you left out mwait-idle=0), trying max_cstate=3 in that same scenario would be the other case to check. No need for 'd' and 'a' output for the time being, but 'c' output would be much appreciated for all cases where you observe hangs. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5] x86/HVM: Partial revert of 28b4baacd5
On 25.11.14 at 11:46, andrew.coop...@citrix.com wrote: On 25/11/14 10:42, Jan Beulich wrote: On 25.11.14 at 11:08, andrew.coop...@citrix.com wrote: A failed vmentry is overwhelmingly likely to be caused by corrupt VMCS state. As a result, injecting a fault and retrying the the vmentry is likely to fail in the same way. That's not all that unlikely - remember that the change was prompted by the XSA-110 fix. There CS pieces being in a bad state would get corrected by the exception injection. One other alternative, which I would pursue if we were not already in -rc2 would be to add some extra logic to detect repeated vmentry failure and allow one attempt to shoot userspace before giving up and crashing the domain. That's not even needed afaict (and if it really is, it can't be all that difficult/intrusive): Did you observe what you attempt to fix here in practice, or is this just from theoretical considerations? I ask because I don't think it can actually happen, as the second time we get here the guest ought to be in kernel mode (due to the exception injection) and hence would get crashed anyway. Only from theoretical considerations. A bad CS (and possibly SS) would be fixed by this, but there are many others which wouldn't But that doesn't eliminate the fact that in the second pass we'd find the guest in kernel mode, and hence crash it. Yet your reply sounds as if you still think your patch is needed. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5] x86/HVM: Partial revert of 28b4baacd5
On 25.11.14 at 11:58, andrew.coop...@citrix.com wrote: On 25/11/14 10:46, Andrew Cooper wrote: On 25/11/14 10:42, Jan Beulich wrote: On 25.11.14 at 11:08, andrew.coop...@citrix.com wrote: A failed vmentry is overwhelmingly likely to be caused by corrupt VMCS state. As a result, injecting a fault and retrying the the vmentry is likely to fail in the same way. That's not all that unlikely - remember that the change was prompted by the XSA-110 fix. There CS pieces being in a bad state would get corrected by the exception injection. One other alternative, which I would pursue if we were not already in -rc2 would be to add some extra logic to detect repeated vmentry failure and allow one attempt to shoot userspace before giving up and crashing the domain. That's not even needed afaict (and if it really is, it can't be all that difficult/intrusive): Did you observe what you attempt to fix here in practice, or is this just from theoretical considerations? I ask because I don't think it can actually happen, as the second time we get here the guest ought to be in kernel mode (due to the exception injection) and hence would get crashed anyway. Only from theoretical considerations. A bad CS (and possibly SS) would be fixed by this, but there are many others which wouldn't Actually, as Tim correctly points out, a bad CS/SS won't be fixed by this without emulating the event injection. Per the XSA-106 followup, we only ever emulate enough of event injection to cover the dpl checks on software events for older generation SVM. We never actually emulate the context switch itself. Which suggests that rather than doing the partial revert as you propose we might better extend the check to become kernel mode or event injection pending. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5] x86/HVM: Partial revert of 28b4baacd5
On 25.11.14 at 13:10, andrew.coop...@citrix.com wrote: On 25/11/14 11:31, Jan Beulich wrote: On 25.11.14 at 11:58, andrew.coop...@citrix.com wrote: On 25/11/14 10:46, Andrew Cooper wrote: On 25/11/14 10:42, Jan Beulich wrote: On 25.11.14 at 11:08, andrew.coop...@citrix.com wrote: A failed vmentry is overwhelmingly likely to be caused by corrupt VMCS state. As a result, injecting a fault and retrying the the vmentry is likely to fail in the same way. That's not all that unlikely - remember that the change was prompted by the XSA-110 fix. There CS pieces being in a bad state would get corrected by the exception injection. One other alternative, which I would pursue if we were not already in -rc2 would be to add some extra logic to detect repeated vmentry failure and allow one attempt to shoot userspace before giving up and crashing the domain. That's not even needed afaict (and if it really is, it can't be all that difficult/intrusive): Did you observe what you attempt to fix here in practice, or is this just from theoretical considerations? I ask because I don't think it can actually happen, as the second time we get here the guest ought to be in kernel mode (due to the exception injection) and hence would get crashed anyway. Only from theoretical considerations. A bad CS (and possibly SS) would be fixed by this, but there are many others which wouldn't Actually, as Tim correctly points out, a bad CS/SS won't be fixed by this without emulating the event injection. Per the XSA-106 followup, we only ever emulate enough of event injection to cover the dpl checks on software events for older generation SVM. We never actually emulate the context switch itself. Which suggests that rather than doing the partial revert as you propose we might better extend the check to become kernel mode or event injection pending. At that point, it is safer just to unconditionally crash on a repeated vmentry failure, rather than gain a list of conditions which we hope wont leave us spinning in a loop. Crashing on _repeated_ VM entry failure is certainly fine with me. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH] vNUMA: rename interface structures
No-one (including me) paid attention during review that these structures don't adhere to the naming requirements of the public interface: Consistently use xen_ prefixes at least for all new additions. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/tools/libxc/include/xenctrl.h +++ b/tools/libxc/include/xenctrl.h @@ -1264,7 +1264,7 @@ int xc_domain_setvnuma(xc_interface *xch uint32_t nr_vnodes, uint32_t nr_regions, uint32_t nr_vcpus, -vmemrange_t *vmemrange, +xen_vmemrange_t *vmemrange, unsigned int *vdistance, unsigned int *vcpu_to_vnode, unsigned int *vnode_to_pnode); --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -2171,7 +2171,7 @@ int xc_domain_setvnuma(xc_interface *xch uint32_t nr_vnodes, uint32_t nr_vmemranges, uint32_t nr_vcpus, - vmemrange_t *vmemrange, + xen_vmemrange_t *vmemrange, unsigned int *vdistance, unsigned int *vcpu_to_vnode, unsigned int *vnode_to_pnode) --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -345,7 +345,7 @@ static struct vnuma_info *vnuma_alloc(un vnuma-vdistance = xmalloc_array(unsigned int, nr_vnodes * nr_vnodes); vnuma-vcpu_to_vnode = xmalloc_array(unsigned int, nr_vcpus); vnuma-vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes); -vnuma-vmemrange = xmalloc_array(vmemrange_t, nr_ranges); +vnuma-vmemrange = xmalloc_array(xen_vmemrange_t, nr_ranges); if ( vnuma-vdistance == NULL || vnuma-vmemrange == NULL || vnuma-vcpu_to_vnode == NULL || vnuma-vnode_to_pnode == NULL ) --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -972,7 +972,7 @@ long do_memory_op(unsigned long cmd, XEN case XENMEM_get_vnumainfo: { -struct vnuma_topology_info topology; +struct xen_vnuma_topology_info topology; struct domain *d; unsigned int dom_vnodes, dom_vranges, dom_vcpus; struct vnuma_info tmp; @@ -1033,7 +1033,7 @@ long do_memory_op(unsigned long cmd, XEN read_unlock(d-vnuma_rwlock); tmp.vdistance = xmalloc_array(unsigned int, dom_vnodes * dom_vnodes); -tmp.vmemrange = xmalloc_array(vmemrange_t, dom_vranges); +tmp.vmemrange = xmalloc_array(xen_vmemrange_t, dom_vranges); tmp.vcpu_to_vnode = xmalloc_array(unsigned int, dom_vcpus); if ( tmp.vdistance == NULL || --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -980,7 +980,7 @@ struct xen_domctl_vnuma { /* * memory rages for each vNUMA node */ -XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange; +XEN_GUEST_HANDLE_64(xen_vmemrange_t) vmemrange; }; typedef struct xen_domctl_vnuma xen_domctl_vnuma_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t); --- a/xen/include/public/memory.h +++ b/xen/include/public/memory.h @@ -530,14 +530,13 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_ #define XENMEM_get_vnumainfo26 /* vNUMA node memory ranges */ -struct vmemrange { +struct xen_vmemrange { uint64_t start, end; unsigned int flags; unsigned int nid; }; - -typedef struct vmemrange vmemrange_t; -DEFINE_XEN_GUEST_HANDLE(vmemrange_t); +typedef struct xen_vmemrange xen_vmemrange_t; +DEFINE_XEN_GUEST_HANDLE(xen_vmemrange_t); /* * vNUMA topology specifies vNUMA node number, distance table, @@ -548,7 +547,7 @@ DEFINE_XEN_GUEST_HANDLE(vmemrange_t); * copied back to guest. Domain returns expected values of nr_vnodes, * nr_vmemranges and nr_vcpus to guest if the values where incorrect. */ -struct vnuma_topology_info { +struct xen_vnuma_topology_info { /* IN */ domid_t domid; uint16_t pad; @@ -566,12 +565,12 @@ struct vnuma_topology_info { uint64_t pad; } vcpu_to_vnode; union { -XEN_GUEST_HANDLE(vmemrange_t) h; +XEN_GUEST_HANDLE(xen_vmemrange_t) h; uint64_t pad; } vmemrange; }; -typedef struct vnuma_topology_info vnuma_topology_info_t; -DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t); +typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t; +DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t); /* Next available subop number is 27 */ --- a/xen/include/xen/domain.h +++ b/xen/include/xen/domain.h @@ -100,7 +100,7 @@ struct vnuma_info { unsigned int *vdistance; unsigned int *vcpu_to_vnode; unsigned int *vnode_to_pnode; -struct vmemrange *vmemrange; +struct xen_vmemrange *vmemrange; }; void vnuma_destroy(struct vnuma_info *vnuma); vNUMA: rename interface structures No-one (including me) paid attention during review that these structures don't adhere to the naming requirements of the public interface: Consistently use xen_ prefixes
Re: [Xen-devel] [PATCH v15 09/21] x86/VPMU: Add public xenpmu.h
On 17.11.14 at 00:07, boris.ostrov...@oracle.com wrote: --- a/xen/include/xlat.lst +++ b/xen/include/xlat.lst @@ -103,6 +103,10 @@ !vcpu_set_singleshot_timer vcpu.h ?xenoprof_init xenoprof.h ?xenoprof_passivexenoprof.h +?pmu_intel_ctxt arch-x86/pmu.h +?pmu_amd_ctxtarch-x86/pmu.h +?pmu_cntr_pair arch-x86/pmu.h +?pmu_regsarch-x86/pmu.h If another revision turns out necessary, please get these additions sorted alphabetically. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 11/21] x86/VPMU: Interface for setting PMU mode and flags
On 17.11.14 at 00:07, boris.ostrov...@oracle.com wrote: @@ -278,3 +290,139 @@ void vpmu_dump(struct vcpu *v) vpmu-arch_vpmu_ops-arch_vpmu_dump(v); } +static s_time_t vpmu_start_ctxt_switch; Blank line here please. +static long vpmu_sched_checkin(void *arg) +{ +int cpu = cpumask_next(smp_processor_id(), cpu_online_map); unsigned int. +int ret; + +/* Mode change shouldn't take more than a few (say, 5) seconds. */ +if ( NOW() vpmu_start_ctxt_switch + SECONDS(5) ) +{ +ret = -EBUSY; +goto fail; +} So what does this guard against? Holding xenpmu_mode_lock for 5 seconds is not a problem, plus I don't think you actually need a lock at all. Just using a global, atomically updated flag ought to be sufficient (the way you use the lock is really nothing else afaict). + +if ( cpu nr_cpu_ids ) +{ +ret = continue_hypercall_on_cpu(cpu, vpmu_sched_checkin, arg); +if ( ret ) +goto fail; +} +else +vpmu_start_ctxt_switch = 0; + +return 0; + + fail: +vpmu_mode = (uint64_t)arg; Even if we don't support x86-32 anymore, please avoid giving bad examples: Casts between pointers and integers should always use (unsigned) long on the integer side. +vpmu_start_ctxt_switch = 0; +return ret; +} + +static int vpmu_force_context_switch(uint64_t old_mode) Same comment as above regarding the type. +{ +int ret; + +vpmu_start_ctxt_switch = NOW(); + +ret = continue_hypercall_on_cpu(cpumask_first(cpu_online_map), +vpmu_sched_checkin, (void *)old_mode); +if ( ret ) +vpmu_start_ctxt_switch = 0; + +return ret; +} I don't think it is guaranteed (independent of implementation details of continue_hypercall_on_cpu()) that this way you went through the context switch path once on each CPU if cpumask_first(cpu_online_map) == smp_processor_id() here. In particular it looks like there being a problem calling vcpu_sleep_sync() in the tasklet handler when v == current (which would be the case if you started on the correct CPU and the tasklet softirq got handled before the scheduler one). I think you instead want to use cpumask_cycle() here and above, and finish the whole operation once you're back on the CPU you started on (the single-pCPU case may need some extra consideration). I realize that you simply follow what microcode_update() does, but I think we should rather correct that one than copy the latent issue it has elsewhere. +long do_xenpmu_op(int op, XEN_GUEST_HANDLE_PARAM(xen_pmu_params_t) arg) +{ +int ret; +struct xen_pmu_params pmu_params; + +ret = xsm_pmu_op(XSM_OTHER, current-domain, op); +if ( ret ) +return ret; + +switch ( op ) +{ +case XENPMU_mode_set: +{ +uint64_t old_mode; Irrespective of the earlier comments I don't see why this isn't just unsigned int (or typeof(vpmu_mode)). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 11/21] x86/VPMU: Interface for setting PMU mode and flags
On 17.11.14 at 00:07, boris.ostrov...@oracle.com wrote: @@ -244,19 +256,19 @@ void vpmu_initialise(struct vcpu *v) switch ( vendor ) { case X86_VENDOR_AMD: -if ( svm_vpmu_initialise(v, opt_vpmu_enabled) != 0 ) -opt_vpmu_enabled = 0; +if ( svm_vpmu_initialise(v) != 0 ) +vpmu_mode = XENPMU_MODE_OFF; break; case X86_VENDOR_INTEL: -if ( vmx_vpmu_initialise(v, opt_vpmu_enabled) != 0 ) -opt_vpmu_enabled = 0; +if ( vmx_vpmu_initialise(v) != 0 ) +vpmu_mode = XENPMU_MODE_OFF; break; So this turns off the vPMU globally upon failure of initializing some random vCPU. Why is that? I see this was the case even before your entire series, but shouldn't this be fixed _before_ enhancing the whole thing to support PV/PVH? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 13/21] x86/VPMU: Initialize PMU for PV(H) guests
On 17.11.14 at 00:07, boris.ostrov...@oracle.com wrote: --- a/xen/arch/x86/hvm/vmx/vpmu_core2.c +++ b/xen/arch/x86/hvm/vmx/vpmu_core2.c @@ -362,24 +362,34 @@ static int core2_vpmu_alloc_resource(struct vcpu *v) struct xen_pmu_intel_ctxt *core2_vpmu_cxt = NULL; uint64_t *p = NULL; -if ( !acquire_pmu_ownership(PMU_OWNER_HVM) ) -return 0; - -wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0); -if ( vmx_add_host_load_msr(MSR_CORE_PERF_GLOBAL_CTRL) ) +p = xzalloc_bytes(sizeof(uint64_t)); xzalloc(uint64_t)? +if ( !p ) goto out_err; -if ( vmx_add_guest_msr(MSR_CORE_PERF_GLOBAL_CTRL) ) -goto out_err; -vmx_write_guest_msr(MSR_CORE_PERF_GLOBAL_CTRL, 0); - -core2_vpmu_cxt = xzalloc_bytes(sizeof(*core2_vpmu_cxt) + - sizeof(uint64_t) * fixed_pmc_cnt + - sizeof(struct xen_pmu_cntr_pair) * - arch_pmc_cnt); -p = xzalloc(uint64_t); At least it was that way before... With that fixed, feel free to add Acked-by: Jan Beulich jbeul...@suse.com Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 17/21] x86/VPMU: Handle PMU interrupts for PV guests
On 17.11.14 at 00:07, boris.ostrov...@oracle.com wrote: +if ( (vpmu_mode XENPMU_MODE_SELF) ) +cur_regs = guest_cpu_user_regs(); +else if ( !guest_mode(regs) is_hardware_domain(sampling-domain) ) +{ +cur_regs = regs; +domid = DOMID_XEN; +} +else +cur_regs = guest_cpu_user_regs(); Of course it would be nice to do this with a single if/else pair. +else +{ +struct segment_register seg; + +hvm_get_segment_register(sampled, x86_seg_cs, seg); +r-cs = seg.sel; +hvm_get_segment_register(sampled, x86_seg_ss, seg); +r-ss = seg.sel; +if ( seg.attr.fields.dpl != 0 ) +*flags |= PMU_SAMPLE_USER; Is that how hardware treats it (CPL != 0 meaning user, rather than CPL == 3)? Maybe you should surface CPL instead of a boolean flag? Anyway, with or without the adjustments (unless you go a 3rd way) Acked-by: Jan Beulich jbeul...@suse.com Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFCv3 3/8] xen: introduce XEN_DOMCTL_set_recipient
On 07.10.14 at 15:10, vkuzn...@redhat.com wrote: New operation sets the 'recipient' domain which will recieve all memory pages from a particular domain when these pages are freed. --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -1152,6 +1152,33 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) } break; +case XEN_DOMCTL_set_recipient: +{ +struct domain *recipient_dom; + +if ( op-u.recipient.recipient == DOMID_INVALID ) +{ +if ( d-recipient ) +{ +put_domain(d-recipient); +} Please drop pointless braces like these and ... +d-recipient = NULL; +break; +} + +recipient_dom = get_domain_by_id(op-u.recipient.recipient); +if ( recipient_dom == NULL ) +{ +ret = -ESRCH; +break; +} +else +{ +d-recipient = recipient_dom; +} ... the pointless else-s/break-s like here. --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -1707,6 +1707,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) { struct domain *d = page_get_owner(pg); unsigned int i; +unsigned long mfn, gmfn; bool_t drop_dom_ref; ASSERT(!in_irq()); @@ -1764,11 +1765,28 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) scrub = 1; } -if ( unlikely(scrub) ) -for ( i = 0; i (1 order); i++ ) -scrub_one_page(pg[i]); +if ( !d || !d-recipient || d-recipient-is_dying ) +{ +if ( unlikely(scrub) ) +for ( i = 0; i (1 order); i++ ) +scrub_one_page(pg[i]); -free_heap_pages(pg, order); +free_heap_pages(pg, order); +} +else +{ +mfn = page_to_mfn(pg); +gmfn = mfn_to_gmfn(d, mfn); + +page_set_owner(pg, NULL); +assign_pages(d-recipient, pg, order, 0); This can fail. + +if ( guest_physmap_add_page(d-recipient, gmfn, mfn, order) ) +{ +gdprintk(XENLOG_INFO, Failed to add page to domain's physmap + mfn: %lx\n, mfn); The current domain/vCPU don't matter here afaict, hence no need for gdprintk(). The source and/or recipient domain do matter though, hence printing either or both would seem desirable. The gMFN may be relevant for diagnostic purposes too. Hence e.g. printk(XENLOG_G_INFO Failed to add MFN %lx (GFN %lx) to Dom%d's physmap\n mfn, gmfn, d-recipient-domain_id); What's worse though is that you don't guard against XEN_DOMCTL_set_recipient zapping d-recipient. If that really is necessary to be supported, some synchronization will be needed. And finally - what's the intended protocol for the tool stack to know that all pages got transferred (and hence the new domain can be launched)? --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -965,6 +965,23 @@ struct xen_domctl_vnuma { typedef struct xen_domctl_vnuma xen_domctl_vnuma_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t); +/* + * XEN_DOMCTL_set_recipient - sets the 'recipient' domain which will recieve + * all the domain's memory after its death or when and memory page from + * domain's heap is being freed. So this specifically allows for this to happen on a non-dying domain. Why, i.e. what's the intended usage scenario? If not really needed, please remove this and verify in the handling that the source domain is dying. --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -366,6 +366,7 @@ struct domain bool_t is_privileged; /* Which guest this guest has privileges on */ struct domain *target; +struct domain *recipient; Please add a brief comment similar to the one for target. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-xen-4.5] x86/pvh/vpmu: Disable VPMU for PVH guests
On 25.11.14 at 15:38, boris.ostrov...@oracle.com wrote: On 11/25/2014 03:45 AM, Jan Beulich wrote: @@ -1429,6 +1429,12 @@ int vlapic_init(struct vcpu *v) HVM_DBG_LOG(DBG_LEVEL_VLAPIC, %d, v-vcpu_id); +if ( is_pvh_vcpu(v) ) +{ +vlapic-hw.disabled = VLAPIC_HW_DISABLED; I did consider doing that but I thought that this flag is meant to be set when the guest clears MSR_IA32_APICBASE_ENABLE to disable APIC and therefore I'd be overloading it (the flag) in a way. There's no overloading here - we're then simply treating all PVH vCPU-s as having permanently hardware-disabled LAPICs (reflecting current reality). Regardless, do you think that disabling VPMU for PVH is worth anyway? That depends on what (bad) consequences not doing so has. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH RFCv3 3/8] xen: introduce XEN_DOMCTL_set_recipient
On 25.11.14 at 16:41, vkuzn...@redhat.com wrote: Jan Beulich jbeul...@suse.com writes: On 07.10.14 at 15:10, vkuzn...@redhat.com wrote: @@ -1764,11 +1765,28 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) scrub = 1; } -if ( unlikely(scrub) ) -for ( i = 0; i (1 order); i++ ) -scrub_one_page(pg[i]); +if ( !d || !d-recipient || d-recipient-is_dying ) +{ +if ( unlikely(scrub) ) +for ( i = 0; i (1 order); i++ ) +scrub_one_page(pg[i]); -free_heap_pages(pg, order); +free_heap_pages(pg, order); +} +else +{ +mfn = page_to_mfn(pg); +gmfn = mfn_to_gmfn(d, mfn); + +page_set_owner(pg, NULL); +assign_pages(d-recipient, pg, order, 0); This can fail. .. if the recipient is dying or we're over-allocating. Shouldn't happen (if toolstack does its job right) but I'll add a check. You must not rely on the tool stack doing things right (see XSA-77). What's worse though is that you don't guard against XEN_DOMCTL_set_recipient zapping d-recipient. If that really is necessary to be supported, some synchronization will be needed. And finally - what's the intended protocol for the tool stack to know that all pages got transferred (and hence the new domain can be launched)? (hope this answers both questions above) Now the protocol is: 1) Toolstack queries for all page frames (with xc_get_pfn_type_batch()) for the old domain. 2) Toolstack issues XEN_DOMCTL_set_recipient with the recipient domain 3) Toolstack kills the source domain 4) Toolstack waits for awhile (loop keeps checking while we see new pages arriving + some timeout). 5) Toolstack issues XEN_DOMCTL_set_recipient with DOMID_INVALID resetting recipient. 6) Toolstack checks if all pages were transferred 7) If some pages are missing (e.g. source domain became zombie holding something) request new empty pages instead. 8) Toolstack starts new domain So we don't actually need XEN_DOMCTL_set_recipient to switch from one recipient to the other, we just need a way to reset it. No, this doesn't address either question. For the first one, you again assume the tool stack behaves correctly, which is wrong nowadays. And for the second you give a description, but that's not really a dependable protocol. Nor do I really follow why resetting the recipient is necessary. Can't the tools e.g. wait for the final death of the domain if there's no other notification? --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -965,6 +965,23 @@ struct xen_domctl_vnuma { typedef struct xen_domctl_vnuma xen_domctl_vnuma_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t); +/* + * XEN_DOMCTL_set_recipient - sets the 'recipient' domain which will recieve + * all the domain's memory after its death or when and memory page from + * domain's heap is being freed. So this specifically allows for this to happen on a non-dying domain. Why, i.e. what's the intended usage scenario? If not really needed, please remove this and verify in the handling that the source domain is dying. Sorry if I didn't get this comment right. We need a way to tell which domain will recieve memory and XEN_DOMCTL_set_recipient sets (and resets) this target domain. We call it from toolstack before we start killing old domain so it is not dying yet. We can't do it hypervistor-side when our source domain exits doing SHUTDOWN_soft_reset as there is no recipient domain created yet. From a general (hypervisor) point of view we don't actually care if the domain is dying or not. We just want to recieve all freed pages from this domain (so they go to some other domain instead of trash). We do care I think, primarily because we want a correct GMFN - MFN mapping. Seeing multiple pages for the same GMFN is for example going to cause the whole process in its current form to fail. And considering the need for such a correct mapping - is it always the case that the mapping gets updated after a page got removed from a guest? (I can see that the mapping doesn't change anymore for a dying guest, but you explicitly say that you want/need this to work before the guest is actually marked dying.) Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5 RFC v2] x86/HVM: Unconditionally crash guests on repeated vmentry failures
On 25.11.14 at 17:54, andrew.coop...@citrix.com wrote: This is RFC as there is still a niggle. I tested this via a partial revert of the XSA-110 fix but the result is quite chatty given a double VMCB dump and domain crash. However, I am not sure we want to make any vmentry failure quite, as any vmentry failure constitues a Xen bug. I think that double printing would be tolerable, but I've had yet another idea: Couldn't we make the second exception a #DF, thus having the guest killed via triple fault in the worst case at the third recurring failure (via hvm_combine_hw_exceptions())? Also your test results point out that we're delivering such an exception with wrong context to the guest: Machine state should match that before the results from the emulation got committed. But doing so would be a pretty significant change for almost no benefit. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5 RFC v2] x86/HVM: Unconditionally crash guests on repeated vmentry failures
On 26.11.14 at 18:43, andrew.coop...@citrix.com wrote: My v1 patch only fixes the VMX side of things. SVM doesn't explicitly identify a failed vmentry and lets it fall into the default case of the vmexit handler. As such, with v1, the infinite loop still affects AMD hardware. Right; I should have said something along the lines of v1. An SVM part is needed, but that should probably extend beyond what you proposed in v2: There are a number of goto exit_and_crash statements ahead of where you place your addition. I think they all need to be treated similarly. I therefore think we should revert the VMX part of 28b4baacd5 and make SVM behavior consistent with what results for VMX: Crash the guest unconditionally on VMEXIT_INVALID, without looking for recurring VM entry failures. See below/attached. Jan x86/HVM: prevent infinite VM entry retries This reverts the VMX side of commit 28b4baac (x86/HVM: don't crash guest upon problems occurring in user mode) and gets SVM in line with the resulting VMX behavior. This is because Andrew validly says A failed vmentry is overwhelmingly likely to be caused by corrupt VMC[SB] state. As a result, injecting a fault and retrying the the vmentry is likely to fail in the same way. Reported-by: Andrew Cooper andrew.coop...@citrix.com Signed-off-by: Jan Beulich jbeul...@suse.com --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -2338,6 +2338,7 @@ void svm_vmexit_handler(struct cpu_user_ struct nestedvcpu *nv = vcpu_nestedhvm(v); struct vmcb_struct *ns_vmcb = nv-nv_vvmcx; uint64_t exitinfo1, exitinfo2; +bool_t crash = 0; paging_update_nestedmode(v); @@ -2371,13 +2372,16 @@ void svm_vmexit_handler(struct cpu_user_ goto out; case NESTEDHVM_VMEXIT_FATALERROR: gdprintk(XENLOG_ERR, unexpected nestedsvm_vmexit() error\n); -goto exit_and_crash; - +crash = 1; +break; default: BUG(); case NESTEDHVM_VMEXIT_ERROR: break; } +if ( crash ) +break; +/* fall through */ case NESTEDHVM_VMEXIT_ERROR: gdprintk(XENLOG_ERR, nestedsvm_check_intercepts() returned NESTEDHVM_VMEXIT_ERROR\n); @@ -2385,18 +2389,25 @@ void svm_vmexit_handler(struct cpu_user_ case NESTEDHVM_VMEXIT_FATALERROR: gdprintk(XENLOG_ERR, unexpected nestedsvm_check_intercepts() error\n); -goto exit_and_crash; +crash = 1; +break; default: gdprintk(XENLOG_INFO, nestedsvm_check_intercepts() returned %i\n, nsret); -goto exit_and_crash; +crash = 1; +break; } + +if ( unlikely(crash) exit_reason != VMEXIT_INVALID ) +goto exit_and_crash; } if ( unlikely(exit_reason == VMEXIT_INVALID) ) { +gdprintk(XENLOG_ERR, invalid VMCB state:\n); svm_vmcb_dump(__func__, vmcb); -goto exit_and_crash; +domain_crash(v-domain); +return; } perfc_incra(svmexits, exit_reason); --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -134,18 +134,6 @@ static void vmx_vcpu_destroy(struct vcpu passive_domain_destroy(v); } -/* Only crash the guest if the problem originates in kernel mode. */ -static void vmx_crash_or_fault(struct vcpu *v) -{ -struct segment_register ss; - -vmx_get_segment_register(v, x86_seg_ss, ss); -if ( ss.attr.fields.dpl ) -hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); -else -domain_crash(v-domain); -} - static DEFINE_PER_CPU(struct vmx_msr_state, host_msr_state); static const u32 msr_index[] = @@ -2520,7 +2508,7 @@ static void vmx_failed_vmentry(unsigned vmcs_dump_vcpu(curr); printk(**\n); -vmx_crash_or_fault(curr); +domain_crash(curr-domain); } void vmx_enter_realmode(struct cpu_user_regs *regs) @@ -3173,8 +3161,19 @@ void vmx_vmexit_handler(struct cpu_user_ /* fall through */ default: exit_and_crash: -gdprintk(XENLOG_WARNING, Bad vmexit (reason %#lx)\n, exit_reason); -vmx_crash_or_fault(v); +{ +struct segment_register ss; + +gdprintk(XENLOG_WARNING, Bad vmexit (reason %#lx)\n, + exit_reason); + +vmx_get_segment_register(v, x86_seg_ss, ss); +if ( ss.attr.fields.dpl ) +hvm_inject_hw_exception(TRAP_invalid_op, +HVM_DELIVER_NO_ERROR_CODE); +else +domain_crash(v-domain); +} break; } x86/HVM: prevent infinite VM entry retries This reverts the VMX side of commit 28b4baac (x86/HVM: don't crash guest upon problems occurring
Re: [Xen-devel] [PATCH RFCv3 3/8] xen: introduce XEN_DOMCTL_set_recipient
On 25.11.14 at 18:10, vkuzn...@redhat.com wrote: Jan Beulich jbeul...@suse.com writes: On 25.11.14 at 16:41, vkuzn...@redhat.com wrote: Jan Beulich jbeul...@suse.com writes: On 07.10.14 at 15:10, vkuzn...@redhat.com wrote: @@ -1764,11 +1765,28 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) scrub = 1; } -if ( unlikely(scrub) ) -for ( i = 0; i (1 order); i++ ) -scrub_one_page(pg[i]); +if ( !d || !d-recipient || d-recipient-is_dying ) +{ +if ( unlikely(scrub) ) +for ( i = 0; i (1 order); i++ ) +scrub_one_page(pg[i]); -free_heap_pages(pg, order); +free_heap_pages(pg, order); +} +else +{ +mfn = page_to_mfn(pg); +gmfn = mfn_to_gmfn(d, mfn); + +page_set_owner(pg, NULL); +assign_pages(d-recipient, pg, order, 0); This can fail. .. if the recipient is dying or we're over-allocating. Shouldn't happen (if toolstack does its job right) but I'll add a check. You must not rely on the tool stack doing things right (see XSA-77). What's worse though is that you don't guard against XEN_DOMCTL_set_recipient zapping d-recipient. If that really is necessary to be supported, some synchronization will be needed. And finally - what's the intended protocol for the tool stack to know that all pages got transferred (and hence the new domain can be launched)? (hope this answers both questions above) Now the protocol is: 1) Toolstack queries for all page frames (with xc_get_pfn_type_batch()) for the old domain. 2) Toolstack issues XEN_DOMCTL_set_recipient with the recipient domain 3) Toolstack kills the source domain 4) Toolstack waits for awhile (loop keeps checking while we see new pages arriving + some timeout). 5) Toolstack issues XEN_DOMCTL_set_recipient with DOMID_INVALID resetting recipient. 6) Toolstack checks if all pages were transferred 7) If some pages are missing (e.g. source domain became zombie holding something) request new empty pages instead. 8) Toolstack starts new domain So we don't actually need XEN_DOMCTL_set_recipient to switch from one recipient to the other, we just need a way to reset it. No, this doesn't address either question. For the first one, you again assume the tool stack behaves correctly, which is wrong nowadays. And for the second you give a description, but that's not really a dependable protocol. Nor do I really follow why resetting the recipient is necessary. Can't the tools e.g. wait for the final death of the domain if there's no other notification? Yes, it's possible and should happen in all normal cases. However my idea was that it's possible to start new domain even if the old one fails to die holding several (presumably special) pages and we're fine with replacing those with empty pages. In case we go for 'always wait for the original domain to die' solution resetting XEN_DOMCTL_set_recipient is not necessary (it is necessary in case we don't as we can recieve a page after we already requested a new one instead). I think this always wait is imo the only reasonable solution: How would replacing some pages with empty ones fit the kexec/kdump purposes? However, it may not be necessary to wait for the domain to completely disappear, it may be sufficient to wait for its d-tot_pages to become zero. From a general (hypervisor) point of view we don't actually care if the domain is dying or not. We just want to recieve all freed pages from this domain (so they go to some other domain instead of trash). We do care I think, primarily because we want a correct GMFN - MFN mapping. Seeing multiple pages for the same GMFN is for example going to cause the whole process in its current form to fail. Can adding a check that nothing is mapped to the GMFN before mapping new MFN there be a solution? Not checking for this is presumably the better approach - the new domain would get what the old one had last at a given GMFN. What it may not get are pages that intermediately got moved around. But again, all that is relevant only if the old domain can still alter its memory layout while the transfer is already in progress, which (as said before) I think should be avoided. And considering the need for such a correct mapping - is it always the case that the mapping gets updated after a page got removed from a guest? (I can see that the mapping doesn't change anymore for a dying guest, but you explicitly say that you want/need this to work before the guest is actually marked dying.) Actual reassignment here happens for a dying guest only as XEN_DOMCTL_set_recipient does nothing. If you think it's safer to depend on the fact that dying flag is always set we can couple XEN_DOMCTL_set_recipient with XEN_DOMCTL_destroydomain in one call. It is possible if we go
Re: [Xen-devel] [PATCH v15 11/21] x86/VPMU: Interface for setting PMU mode and flags
On 26.11.14 at 15:32, boris.ostrov...@oracle.com wrote: On 11/25/2014 08:49 AM, Jan Beulich wrote: On 17.11.14 at 00:07, boris.ostrov...@oracle.com wrote: @@ -244,19 +256,19 @@ void vpmu_initialise(struct vcpu *v) switch ( vendor ) { case X86_VENDOR_AMD: -if ( svm_vpmu_initialise(v, opt_vpmu_enabled) != 0 ) -opt_vpmu_enabled = 0; +if ( svm_vpmu_initialise(v) != 0 ) +vpmu_mode = XENPMU_MODE_OFF; break; case X86_VENDOR_INTEL: -if ( vmx_vpmu_initialise(v, opt_vpmu_enabled) != 0 ) -opt_vpmu_enabled = 0; +if ( vmx_vpmu_initialise(v) != 0 ) +vpmu_mode = XENPMU_MODE_OFF; break; So this turns off the vPMU globally upon failure of initializing some random vCPU. Why is that? I see this was the case even before your entire series, but shouldn't this be fixed _before_ enhancing the whole thing to support PV/PVH? Yes, that's probably too strong. Do you want to fix this as an early patch (before PV(H)) support gets in? I'd rather do it in the patch that moves things into initcalls. Yes, I think this should be fixed in a prereq patch, thus allowing it to be easily backported if so desired. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 17/21] x86/VPMU: Handle PMU interrupts for PV guests
On 26.11.14 at 15:39, boris.ostrov...@oracle.com wrote: On 11/25/2014 09:28 AM, Jan Beulich wrote: +else +{ +struct segment_register seg; + +hvm_get_segment_register(sampled, x86_seg_cs, seg); +r-cs = seg.sel; +hvm_get_segment_register(sampled, x86_seg_ss, seg); +r-ss = seg.sel; +if ( seg.attr.fields.dpl != 0 ) +*flags |= PMU_SAMPLE_USER; Is that how hardware treats it (CPL != 0 meaning user, rather than CPL == 3)? You mean how *software* (e.g. Linux kernel) treats it? If yes, then for 32-bit user_mode() checks for (CS == 3) and for 64-bit it's !!(CS 3). No, I meant hardware. There CPL qualified PMU aspects, and it was those I had in mind to use as reference here. Maybe you should surface CPL instead of a boolean flag? Am I not already doing it by passing SS and CS to the guest? No, neither SS.RPL nor CS.RPL formally represent CPL. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Regression, host crash with 4.5rc1
On 27.11.14 at 06:29, sfl...@ihonk.com wrote: On 11/25/2014 03:00 AM, Jan Beulich wrote: Okay, so it's not really the mwait-idle driver causing the regression, but it is C-state related. Hence we're now down to seeing whether all or just the deeper C states are affected, i.e. I now need to ask you to play with max_cstate=. For that you'll have to remember that the option's effect differs between the ACPI and the MWAIT idle drivers. In the spirit of bisection I'd suggest using max_cstate=2 first no matter which of the two scenarios you pick. If that still hangs, max_cstate=1 obviously is the only other thing to try. Should that not hang (and you left out mwait-idle=0), trying max_cstate=3 in that same scenario would be the other case to check. No need for 'd' and 'a' output for the time being, but 'c' output would be much appreciated for all cases where you observe hangs. Okay, working through that now. I tried max_cstate=2 and got no hangs, whether with or without mwait-idle=0. However, I was puzzled by this: (XEN) 'c' pressed - printing ACPI Cx structures (XEN) ==cpu0== (XEN) active state: C0 (XEN) max_cstate: C2 (XEN) states: (XEN) C1: type[C1] latency[003] usage[12219860] method[ FFH] duration[1190961948551] (XEN) C2: type[C1] latency[010] usage[10205554] method[ FFH] duration[2015393965907] (XEN) C3: type[C2] latency[020] usage[50926286] method[ FFH] duration[30527997858148] (XEN)*C0: usage[73351700] duration[9974627547595] (XEN) max=0 pwr=0 urg=0 nxt=0 (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0] (XEN) CC3[28794734145697] CC6[0] CC7[0] (XEN) ==cpu1== (XEN) active state: C3 (XEN) max_cstate: C2 (XEN) states: (XEN) C1: type[C1] latency[003] usage[10699950] method[ FFH] duration[1141422044112] (XEN) C2: type[C1] latency[010] usage[06382904] method[ FFH] duration[1329739264322] (XEN)*C3: type[C2] latency[020] usage[44630764] method[ FFH] duration[31676618425954] (XEN) C0: usage[61713618] duration[9561201640320] (XEN) max=0 pwr=0 urg=0 nxt=0 (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0] (XEN) CC3[30066495105056] CC6[0] CC7[0] [...] Why would some of the cores be in C3 even though they list max_cstate as C2? This was precisely the reason why I told you that the numbering differs (and is confusing and has nothing to do with actual C state numbers): What max_cstate refers to in the mwait-idle driver is what above is listed as type[Cx], i.e. the state at index 1 is C1, at 2 we've got C1E, and at 3 we've got C2. And those still aren't in line with the numbering the CPU documentation uses, it's rather kind of meant to refer to the ACPI numbering (but probably also not fully matching up). So max_cstate=2 working suggests a problem with what the CPU calls C6, which presumably isn't all that surprising considering the many errata (BD35, BD38, BD40, BD59, BD87, and BD104). Not sure how to proceed from here - I suppose you already made sure you run with the latest available BIOS. And with 6 errata documented it's not all that unlikely that there's a 7th one with MONITOR/MWAIT behavior. The commit you bisected to (and which you had verified to be the culprit by just forcing arch_skip_send_event_check() to always return false) could be reasonably assumed to be broken only when MWAIT use for all C states didn't work. Don, Jun - is there anything known but not yet publicly documented for Family 6 Model 44 Xeons? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] xsm/flask: improve unknown permission handling
On 25.11.14 at 19:05, dgde...@tycho.nsa.gov wrote: --- a/xen/xsm/flask/hooks.c +++ b/xen/xsm/flask/hooks.c @@ -135,6 +135,19 @@ static int get_irq_sid(int irq, u32 *sid, struct avc_audit_data *ad) return 0; } +static int avc_unknown_permission(const char* name, int id) const char *name +{ +/* A guest making an invalid hypercall can trigger this message, so it can't + * be an ASSERT or BUG_ON, but normally it is caused by a missing case in + * one of the switch statements below. + */ +printk(XENLOG_G_ERR FLASK: Unknown %s: %d.\n, name, id); I think this ought to be XENLOG_G_WARNING when not returning an error. E.g. switch printing and return code determination, use the return code to select the correct log level, and return after logging the message. Jan +if ( !flask_enforcing || security_get_allow_unknown() ) +return 0; +else +return -EPERM; +} + static int flask_domain_alloc_security(struct domain *d) { struct domain_security_struct *dsec; ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5 RFC v2] x86/HVM: Unconditionally crash guests on repeated vmentry failures
On 27.11.14 at 11:26, t...@xen.org wrote: At 08:42 + on 27 Nov (1417074133), Jan Beulich wrote: On 26.11.14 at 18:43, andrew.coop...@citrix.com wrote: My v1 patch only fixes the VMX side of things. SVM doesn't explicitly identify a failed vmentry and lets it fall into the default case of the vmexit handler. As such, with v1, the infinite loop still affects AMD hardware. Right; I should have said something along the lines of v1. An SVM part is needed, but that should probably extend beyond what you proposed in v2: There are a number of goto exit_and_crash statements ahead of where you place your addition. I think they all need to be treated similarly. I therefore think we should revert the VMX part of 28b4baacd5 and make SVM behavior consistent with what results for VMX: Crash the guest unconditionally on VMEXIT_INVALID, without looking for recurring VM entry failures. See below/attached. Jan x86/HVM: prevent infinite VM entry retries This reverts the VMX side of commit 28b4baac (x86/HVM: don't crash guest upon problems occurring in user mode) and gets SVM in line with the resulting VMX behavior. This is because Andrew validly says A failed vmentry is overwhelmingly likely to be caused by corrupt VMC[SB] state. As a result, injecting a fault and retrying the the vmentry is likely to fail in the same way. Reported-by: Andrew Cooper andrew.coop...@citrix.com Signed-off-by: Jan Beulich jbeul...@suse.com Looking at the SVM side, AFAICT you're trying to filter out VMEXIT_INVALID exits that couldn't be handled by nested SVM, but I think it's fine just to always crash on nested-SVM failures: we know the guest wasn't in user mode because it successfully executed VMRUN. And looking at it, the other users of that label are for unexpected debugging exits, which can't be caused by the guest userspace either. So how about this for the SVM side, reverting to crashing for everything except new, unsupported exit types? Generally a good idea, but there are two paths to exit_and_crash (for VMEXIT_EXCEPTION_DB and VMEXIT_EXCEPTION_BP) which I think would better crash only conditionally. --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -2675,16 +2675,18 @@ void svm_vmexit_handler(struct cpu_user_regs *regs) break; default: -exit_and_crash: gdprintk(XENLOG_ERR, unexpected VMEXIT: exit reason = %#PRIx64, exitinfo1 = %#PRIx64, exitinfo2 = %#PRIx64\n, exit_reason, (u64)vmcb-exitinfo1, (u64)vmcb-exitinfo2); -if ( vmcb_get_cpl(vmcb) ) +if ( vmcb_get_cpl(vmcb) ) { hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); -else -domain_crash(v-domain); +break; +} +/* else fall through */ +exit_and_crash: +domain_crash(v-domain); break; } Additionally this re-arrangement would make some domain_crash() invocations silent (i.e. no other accompanying message), but that's of course easy to fix. And finally, if doing it that way we might better remove the exit_and_crash label altogether and call domain_crash() directly in the places we mean it to be called. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5 RFC v2] x86/HVM: Unconditionally crash guests on repeated vmentry failures
On 27.11.14 at 12:29, t...@xen.org wrote: At 11:16 + on 27 Nov (1417083414), Jan Beulich wrote: On 27.11.14 at 11:26, t...@xen.org wrote: At 08:42 + on 27 Nov (1417074133), Jan Beulich wrote: On 26.11.14 at 18:43, andrew.coop...@citrix.com wrote: My v1 patch only fixes the VMX side of things. SVM doesn't explicitly identify a failed vmentry and lets it fall into the default case of the vmexit handler. As such, with v1, the infinite loop still affects AMD hardware. Right; I should have said something along the lines of v1. An SVM part is needed, but that should probably extend beyond what you proposed in v2: There are a number of goto exit_and_crash statements ahead of where you place your addition. I think they all need to be treated similarly. I therefore think we should revert the VMX part of 28b4baacd5 and make SVM behavior consistent with what results for VMX: Crash the guest unconditionally on VMEXIT_INVALID, without looking for recurring VM entry failures. See below/attached. Jan x86/HVM: prevent infinite VM entry retries This reverts the VMX side of commit 28b4baac (x86/HVM: don't crash guest upon problems occurring in user mode) and gets SVM in line with the resulting VMX behavior. This is because Andrew validly says A failed vmentry is overwhelmingly likely to be caused by corrupt VMC[SB] state. As a result, injecting a fault and retrying the the vmentry is likely to fail in the same way. Reported-by: Andrew Cooper andrew.coop...@citrix.com Signed-off-by: Jan Beulich jbeul...@suse.com Looking at the SVM side, AFAICT you're trying to filter out VMEXIT_INVALID exits that couldn't be handled by nested SVM, but I think it's fine just to always crash on nested-SVM failures: we know the guest wasn't in user mode because it successfully executed VMRUN. And looking at it, the other users of that label are for unexpected debugging exits, which can't be caused by the guest userspace either. So how about this for the SVM side, reverting to crashing for everything except new, unsupported exit types? Generally a good idea, but there are two paths to exit_and_crash (for VMEXIT_EXCEPTION_DB and VMEXIT_EXCEPTION_BP) which I think would better crash only conditionally. Those are catching Xen bugs -- we don't (or at least shouldn't) enable those exit types when the debugger is not attached. I think that, like with the p2m ENOMEM case, turning them into #UD is not really helpful. But if you prefer, those could be made into 'goto default' to preserve the current behaviour, which would also retain the debugging output. And finally, if doing it that way we might better remove the exit_and_crash label altogether and call domain_crash() directly in the places we mean it to be called. Indeed. How's this, then? Looks good - if you don't mind I'll replace the SVM part of the earlier patch with this, add your S-o-b alongside mine, and send again for final review. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86/HVM: prevent infinite VM entry retries
On 27.11.14 at 13:46, andrew.coop...@citrix.com wrote: On 27/11/14 12:39, Jan Beulich wrote: if ( unlikely(exit_reason == VMEXIT_INVALID) ) { I think it would be good to retain the printk here from previous versions of the patch, specifically identifies the below vmcb dump as caused by a failed vmentry. As it stands, it is a vmcb dump with no other context. Either way, Reviewed-by: Andrew Cooper andrew.coop...@citrix.com I meant to do so and then forgot. Added to the to be committed version. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3] libxc: Expose the 1GB pages cpuid flag to guest
On 28.11.14 at 04:28, liang.z...@intel.com wrote: --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4287,7 +4287,7 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx, !host_tsc_is_safe() ) *edx = ~cpufeat_mask(X86_FEATURE_RDTSCP); /* Hide 1GB-superpage feature if we can't emulate it. */ -if (!hvm_pse1gb_supported(d)) +if (!hvm_pse1gb_supported(d) || paging_mode_shadow(d)) *edx = ~cpufeat_mask(X86_FEATURE_PAGE1GB); With #define hvm_pse1gb_supported(d) \ (cpu_has_page1gb paging_mode_hap(d)) the change above is pointless. While, considering this, comments on v2 may have been misleading, you should have simply updated the patch description instead to clarify why the v2 change was okay even for the shadow mode case. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [xen-4.4-testing test] 31882: regressions - trouble: blocked/broken/fail/pass
On 28.11.14 at 10:07, ian.jack...@eu.citrix.com wrote: flight 31882 xen-4.4-testing real [real] http://www.chiark.greenend.org.uk/~xensrcts/logs/31882/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-i386-pair 17 guest-migrate/src_host/dst_host fail REGR. vs. 31781 Wasn't the swiotlb problem supposed to be dealt with by now? swiotlb=65536 looks to be in place here, yet that still appears to not be big enough... Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm
On 28.11.14 at 08:59, yu.c.zh...@linux.intel.com wrote: --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -2838,7 +2838,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, * to the mmio handler. */ if ( (p2mt == p2m_mmio_dm) || - (npfec.write_access (p2mt == p2m_ram_ro)) ) + (npfec.write_access + ((p2mt == p2m_ram_ro) || (p2mt == p2m_mmio_write_dm))) ) Why are you corrupting indentation here? Furthermore the code you modify here suggests that p2m_ram_ro already has the needed semantics - writes get passed to the DM. None of the other changes you make, and none of the other uses of p2m_ram_ro appear to be in conflict with your intentions, so you'd really need to explain better why you need the new type. @@ -5922,6 +5923,8 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) a.mem_type = HVMMEM_ram_rw; else if ( p2m_is_grant(t) ) a.mem_type = HVMMEM_ram_rw; +else if ( t == p2m_mmio_write_dm ) +a.mem_type = HVMMEM_mmio_write_dm; else a.mem_type = HVMMEM_mmio_dm; rc = __copy_to_guest(arg, a, 1) ? -EFAULT : 0; @@ -5941,7 +5944,8 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) static const p2m_type_t memtype[] = { [HVMMEM_ram_rw] = p2m_ram_rw, [HVMMEM_ram_ro] = p2m_ram_ro, -[HVMMEM_mmio_dm] = p2m_mmio_dm +[HVMMEM_mmio_dm] = p2m_mmio_dm, +[HVMMEM_mmio_write_dm] = p2m_mmio_write_dm }; if ( copy_from_guest(a, arg, 1) ) @@ -5987,14 +5991,17 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) rc = -EAGAIN; goto param_fail4; } + Stray addition of a blank line? if ( !p2m_is_ram(t) - (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) ) + (!p2m_is_hole(t) || a.hvmmem_type != HVMMEM_mmio_dm) + t != p2m_mmio_write_dm ) Do you really want to permit e.g. transitions between mmio_dm and mmio_write_dm? We should be as restrictive as possible here to not open up paths to security problems. { put_gfn(d, pfn); goto param_fail4; } rc = p2m_change_type_one(d, pfn, t, memtype[a.hvmmem_type]); + put_gfn(d, pfn); if ( rc ) goto param_fail4; Another stray newline addition. --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -72,6 +72,7 @@ typedef enum { p2m_ram_shared = 12, /* Shared or sharable memory */ p2m_ram_broken = 13, /* Broken page, access cause domain crash */ p2m_map_foreign = 14,/* ram pages from foreign domain */ +p2m_mmio_write_dm = 15, /* Read-only; writes go to the device model */ } p2m_type_t; /* Modifiers to the query */ If the new type is really needed, shouldn't this get added to P2M_RO_TYPES? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3] libxc: Expose the 1GB pages cpuid flag to guest
On 28.11.14 at 11:29, liang.z...@intel.com wrote: -if (!hvm_pse1gb_supported(d)) +if (!hvm_pse1gb_supported(d) || paging_mode_shadow(d)) *edx = ~cpufeat_mask(X86_FEATURE_PAGE1GB); With #define hvm_pse1gb_supported(d) \ (cpu_has_page1gb paging_mode_hap(d)) the change above is pointless. While, considering this, comments on v2 may have been misleading, you should have simply updated the patch description instead to clarify why the v2 change was okay even for the shadow mode case. I checked the code and found that for a normal guest can only be in hap mode or shadow mode before I sending the patch, but I am not share if it's true for dom0. The CPUID code in libxc doesn't apply to Dom0 at all, and CPUID handling is also special cased in the hypervisor for Dom0. Plus finally Dom0 only possibly being PV or PVH, PVH requiring HAP and PV generally not allowing large pages anyway, your concern is unnecessary. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v4] libxc: Expose the 1GB pages cpuid flag to guest
On 28.11.14 at 11:52, liang.z...@intel.com wrote: If hardware support the 1GB pages, expose the feature to guest by default. Users don't have to use a 'cpuid= ' option in config fil e to turn it on. If guest use shadow mode, the 1GB pages feature will be hidden from guest, this is done in the function hvm_cpuid(). So the change is okay for shadow mode case. Signed-off-by: Liang Li liang.z...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Reviewed-by: Jan Beulich jbeul...@suse.com ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Xen-4.5 HVMOP ABI issues
On 28.11.14 at 12:36, andrew.coop...@citrix.com wrote: However, it occurs to me that HVMOP_op_mask absolutely must live in the public header files, as it is guest visible, and forms part of the ABI. Consider a userspace task which falls into a continuation, is preempted and paused by the guest kernel, the entire VM migrated, and the task unpaused later. If HVMOP_op_mask has changed in that time, the continuation will become erroneous. In particular, all hypercall continuation strategies we use *must* be forward compatible when moving to newer versions of Xen, because Xen cannot possibly guarantee that a continuation which starts will finish on the same hypervisor. Hmm, I see the issue, but surfacing such implementation details would not only be unfortunate, but eventually prevent us from making future changes. Just recall the mem-op case where we had to widen the continuation encoding mask at some point. Hence while I can't immediately see a way to avoid the situation you describe (and it doesn't even take a user space task in a preemptible kernel), I think we should allow ourselves a little more time to find possible solutions other than locking down these masks (really they don't need to be exposed in the public headers, but we would need them to not change if no other solution can be found). One thing to note is that the _introduction_ of such a mask (as has happened for hvm-op between 4.4 and 4.5) is not a problem as long as the respective bits all being zero has no special meaning. What I considered for mem-op a while ago though (and then forgot again) was to refuse non-zero start_extent values for any operations not making use of that mechanism. The same would of course be applicable to gnttab-op and hvm-op. What might additionally make this not that urgent an issue for 4.5 is that only XSM_DM_PRIV guarded operations are affected, and I suppose a stubdom gets re-created on the target host (rather than migrated) when its client migrates. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Xen-4.5 HVMOP ABI issues
On 28.11.14 at 14:55, andrew.coop...@citrix.com wrote: The problem is with continuations which reuse the upper bits of the input register, not with this HVMOP_op_mask specifically; the HVMOP_op_mask simply adds to an existing problem. This is something which needs considering urgently because, as you identify above, we have already suffered an accidental ABI breakage with the mem-op widening. Since we can't retroactively fix the mem-op widening, I still don't see what's urgent here: As long as we don't change any of these masks, nothing bad is going to happen. Of course one thing to consider with this aspect in mind is whether to change the hvm-op or gnttab-op masks again _before_ 4.5 goes out, based on whether we feel they're wide enough for the (un)foreseeable future. 32bit HVM guests reliably form a continuation on every single iteration of a continuable hypercall (e.g. decrease reservation, which is the base of my XSA-111 PoC), making it trivial to construct a migrateable HVM domain which exposes the issue. Hmm, I can't offhand see why that would be, but what you write reads like you determined the reason? I ask because an unavoidable use of continuations is certainly nice for making sure those code paths get tested, but would be quite desirable to get eliminated at least for non-debug builds. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm
On 01.12.14 at 09:49, yu.c.zh...@linux.intel.com wrote: On 11/28/2014 5:57 PM, Jan Beulich wrote: On 28.11.14 at 08:59, yu.c.zh...@linux.intel.com wrote: --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -2838,7 +2838,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, * to the mmio handler. */ if ( (p2mt == p2m_mmio_dm) || - (npfec.write_access (p2mt == p2m_ram_ro)) ) + (npfec.write_access + ((p2mt == p2m_ram_ro) || (p2mt == p2m_mmio_write_dm))) ) Why are you corrupting indentation here? Thanks for your comments, Jan. The indentation here is to make sure the ((p2mt == p2m_ram_ro) || (p2mt == p2m_mmio_write_dm)) are grouped together. But I am not sure if this is correct according to xen coding style. I may have misunderstood your previous comments on Sep 3, which said the indentation would need adjustment in reply to [Xen-devel] [PATCH v3 1/2] x86: add p2m_mmio_write_dm Getting the alignment right is needed, but no via using hard tabs. Furthermore the code you modify here suggests that p2m_ram_ro already has the needed semantics - writes get passed to the DM. None of the other changes you make, and none of the other uses of p2m_ram_ro appear to be in conflict with your intentions, so you'd really need to explain better why you need the new type. Thanks Jan. To my understanding, pages with p2m_ram_ro are not supposed to be modified by guest. So in __hvm_copy(), when p2m type of a page is p2m_ram_rom, no copy will occur. However, to our usage, we just wanna this page to be write protected, so that our device model can be triggered to do some emulation. The content written to this page is supposed not to be dropped. This way, if sequentially a read operation is performed by guest to this page, the guest will still see its anticipated value. __hvm_copy() is only a helper function, and doesn't write to mmio_dm space either; instead its (indirect) callers would invoke hvmemul_do_mmio() upon seeing HVMCOPY_bad_gfn_to_mfn returns. The question hence is about the apparent inconsistency resulting from writes to ram_ro being dropped here but getting passed to the DM by hvm_hap_nested_page_fault(). Tim - is that really intentional? Maybe I need to update my commit message to explain this more clearly. :) At the very least, yes. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch V2] expand x86 arch_shared_info to support linear p2m list
On 01.12.14 at 10:29, jgr...@suse.com wrote: The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list currently contains the mfn of the top level page frame of the 3 level p2m tree, which is used by the Xen tools during saving and restoring (and live migration) of pv domains and for crash dump analysis. With three levels of the p2m tree it is possible to support up to 512 GB of RAM for a 64 bit pv domain. A 32 bit pv domain can support more, as each memory page can hold 1024 instead of 512 entries, leading to a limit of 4 TB. To be able to support more RAM on x86-64 switch to a virtual mapped p2m list. This patch expands struct arch_shared_info with a new p2m list virtual address, the root of the page table root and a p2m generation count. The new information is indicated by the domain to be valid by storing ~0UL into pfn_to_mfn_frame_list_list. The hypervisor indicates usability of this feature by a new flag XENFEAT_virtual_p2m. Right now XENFEAT_virtual_p2m will not be set. This will change when the Xen tools support the virtual mapped p2m list. This seems wrong: XENFEAT_* only reflect hypervisor capabilities. I.e. the availability of the new functionality may need to be advertised another way - xenstore perhaps? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm
On 01.12.14 at 11:30, t...@xen.org wrote: At 09:32 + on 01 Dec (1417422746), Jan Beulich wrote: On 01.12.14 at 09:49, yu.c.zh...@linux.intel.com wrote: To my understanding, pages with p2m_ram_ro are not supposed to be modified by guest. So in __hvm_copy(), when p2m type of a page is p2m_ram_rom, no copy will occur. However, to our usage, we just wanna this page to be write protected, so that our device model can be triggered to do some emulation. The content written to this page is supposed not to be dropped. This way, if sequentially a read operation is performed by guest to this page, the guest will still see its anticipated value. __hvm_copy() is only a helper function, and doesn't write to mmio_dm space either; instead its (indirect) callers would invoke hvmemul_do_mmio() upon seeing HVMCOPY_bad_gfn_to_mfn returns. The question hence is about the apparent inconsistency resulting from writes to ram_ro being dropped here but getting passed to the DM by hvm_hap_nested_page_fault(). Tim - is that really intentional? No - and AFAICT it shouldn't be happening. It _is_ how it was implemented originally, because it involved fewer moving parts and didn't need to be efficient (and after all, writes to entirely missing addresses go to the device model too). But the code was later updated to log and discard writes to read-only memory (see 4d8aa29 from Trolle Selander). Early version of p2m_ram_ro were documented in the internal headers as sending the writes to the DM, but the public interface (HVMMEM_ram_ro) has always said that writes are discarded. Hmm, so which way do you recommend resolving the inconsistency then - match what the public interface says or what the apparent original intention for the internal type was? Presumably we need to follow the public interface mandated model, and hence require the new type to be introduced. During this bit of archaeology I realised that either this new type should _not_ be made part of P2M_RO_TYPES, or, better, we need a new class of P2M types (P2M_DISCARD_WRITE_TYPES, say) that should be used for these paths in emulate_gva_to_mfn() and __hvm_copy(), containing just p2m_ram_ro and p2m_grant_map_ro. And I suppose in that latter case the new type could be made part of P2M_RO_TYPES()? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [Patch V2] expand x86 arch_shared_info to support linear p2m list
On 01.12.14 at 12:19, david.vra...@citrix.com wrote: On 01/12/14 10:15, Jan Beulich wrote: On 01.12.14 at 10:29, jgr...@suse.com wrote: The x86 struct arch_shared_info field pfn_to_mfn_frame_list_list currently contains the mfn of the top level page frame of the 3 level p2m tree, which is used by the Xen tools during saving and restoring (and live migration) of pv domains and for crash dump analysis. With three levels of the p2m tree it is possible to support up to 512 GB of RAM for a 64 bit pv domain. A 32 bit pv domain can support more, as each memory page can hold 1024 instead of 512 entries, leading to a limit of 4 TB. To be able to support more RAM on x86-64 switch to a virtual mapped p2m list. This patch expands struct arch_shared_info with a new p2m list virtual address, the root of the page table root and a p2m generation count. The new information is indicated by the domain to be valid by storing ~0UL into pfn_to_mfn_frame_list_list. The hypervisor indicates usability of this feature by a new flag XENFEAT_virtual_p2m. Right now XENFEAT_virtual_p2m will not be set. This will change when the Xen tools support the virtual mapped p2m list. This seems wrong: XENFEAT_* only reflect hypervisor capabilities. I.e. the availability of the new functionality may need to be advertised another way - xenstore perhaps? Xenstore doesn't work for dom0. Shouldn't this be something the guest kernel reports using a ELF note bit? When building a domain (either in Xen for dom0 or in the tools), the builder may provide a linear p2m iff supported by the guest kernel and then (and only then) can it provide a guest with 512 GiB. Yes, surely this flag could act as a kernel capability indicator (via the XEN_ELFNOTE_SUPPORTED_FEATURES note), like e.g. XENFEAT_dom0 already does. Jürgen's final statement, however, suggested to me that this is meant to be only consumed by kernels. Otoh the P2M provided by both Dom0 and DomU builders have always been linear anyway; it's the pv-ops kernel that constructs a tree as replacement. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm
On 01.12.14 at 13:13, t...@xen.org wrote: At 11:17 + on 01 Dec (1417429027), Jan Beulich wrote: On 01.12.14 at 11:30, t...@xen.org wrote: At 09:32 + on 01 Dec (1417422746), Jan Beulich wrote: On 01.12.14 at 09:49, yu.c.zh...@linux.intel.com wrote: To my understanding, pages with p2m_ram_ro are not supposed to be modified by guest. So in __hvm_copy(), when p2m type of a page is p2m_ram_rom, no copy will occur. However, to our usage, we just wanna this page to be write protected, so that our device model can be triggered to do some emulation. The content written to this page is supposed not to be dropped. This way, if sequentially a read operation is performed by guest to this page, the guest will still see its anticipated value. __hvm_copy() is only a helper function, and doesn't write to mmio_dm space either; instead its (indirect) callers would invoke hvmemul_do_mmio() upon seeing HVMCOPY_bad_gfn_to_mfn returns. The question hence is about the apparent inconsistency resulting from writes to ram_ro being dropped here but getting passed to the DM by hvm_hap_nested_page_fault(). Tim - is that really intentional? No - and AFAICT it shouldn't be happening. It _is_ how it was implemented originally, because it involved fewer moving parts and didn't need to be efficient (and after all, writes to entirely missing addresses go to the device model too). But the code was later updated to log and discard writes to read-only memory (see 4d8aa29 from Trolle Selander). Early version of p2m_ram_ro were documented in the internal headers as sending the writes to the DM, but the public interface (HVMMEM_ram_ro) has always said that writes are discarded. Hmm, so which way do you recommend resolving the inconsistency then - match what the public interface says or what the apparent original intention for the internal type was? Presumably we need to follow the public interface mandated model, and hence require the new type to be introduced. Sorry, I was unclear -- there isn't an inconsistency; both internal and public headers currently say that writes are discarded and AFAICT that is what the code does. Not for hvm_hap_nested_page_fault() afaict - the forwarding to DM there contradicts the writes are discarded model that other code paths follow. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4] x86: add p2m_mmio_write_dm
On 02.12.14 at 08:48, yu.c.zh...@linux.intel.com wrote: On 12/1/2014 8:31 PM, Jan Beulich wrote: On 01.12.14 at 13:13, t...@xen.org wrote: At 11:17 + on 01 Dec (1417429027), Jan Beulich wrote: On 01.12.14 at 11:30, t...@xen.org wrote: At 09:32 + on 01 Dec (1417422746), Jan Beulich wrote: On 01.12.14 at 09:49, yu.c.zh...@linux.intel.com wrote: To my understanding, pages with p2m_ram_ro are not supposed to be modified by guest. So in __hvm_copy(), when p2m type of a page is p2m_ram_rom, no copy will occur. However, to our usage, we just wanna this page to be write protected, so that our device model can be triggered to do some emulation. The content written to this page is supposed not to be dropped. This way, if sequentially a read operation is performed by guest to this page, the guest will still see its anticipated value. __hvm_copy() is only a helper function, and doesn't write to mmio_dm space either; instead its (indirect) callers would invoke hvmemul_do_mmio() upon seeing HVMCOPY_bad_gfn_to_mfn returns. The question hence is about the apparent inconsistency resulting from writes to ram_ro being dropped here but getting passed to the DM by hvm_hap_nested_page_fault(). Tim - is that really intentional? No - and AFAICT it shouldn't be happening. It _is_ how it was implemented originally, because it involved fewer moving parts and didn't need to be efficient (and after all, writes to entirely missing addresses go to the device model too). But the code was later updated to log and discard writes to read-only memory (see 4d8aa29 from Trolle Selander). Early version of p2m_ram_ro were documented in the internal headers as sending the writes to the DM, but the public interface (HVMMEM_ram_ro) has always said that writes are discarded. Hmm, so which way do you recommend resolving the inconsistency then - match what the public interface says or what the apparent original intention for the internal type was? Presumably we need to follow the public interface mandated model, and hence require the new type to be introduced. Sorry, I was unclear -- there isn't an inconsistency; both internal and public headers currently say that writes are discarded and AFAICT that is what the code does. Not for hvm_hap_nested_page_fault() afaict - the forwarding to DM there contradicts the writes are discarded model that other code paths follow. Thanks, Jan. By inconsistency, do you mean the p2m_ram_ro shall not trigger the handle_mmio_with_translation() in hvm_hap_nested_page_fault()? Yes, pending Tim's agreement. I'm also a bit confused with the writes are discarded/dropped comments in the code. Does this mean writes to the p2m_ram_ro pages should be abandoned without going to the dm, or going to the dm and ignored later? The code seems to be the second one. And it would seem to me that it should be the former. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH for-4.5] call xc_domain_irq_permission before xc_domain_irq_permission
On 01.12.14 at 20:22, stefano.stabell...@eu.citrix.com wrote: xc_physdev_unmap_pirq might revoke the permission to map the irq from the domain causing the following xc_domain_irq_permission call to fail and return error (domain_pirq_to_irq returns 0). Apart from the patch title being rather confusing, isn't the root problem then xc_physdev_unmap_pirq() removes permissions? At least after the strict splitting of permission granting and mapping for MMIO (commit 0561e1f0) it would seem that if the underlying hypercall here behaves similarly to the original behavior there, it ought to be fixed in an analogous way. Jan Call xc_domain_irq_permission first to prevent this from happening (xc_physdev_unmap_pirq calls irq_deny_access, that doesn't fail if the permission is already removed). This is a simple bug fix and I think should go in 4.5. Signed-off-by: Stefano Stabellini stefano.stabell...@eu.citrix.com diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c index 316643c..904d255 100644 --- a/tools/libxl/libxl_pci.c +++ b/tools/libxl/libxl_pci.c @@ -1288,14 +1290,14 @@ skip1: goto out; } if ((fscanf(f, %u, irq) == 1) irq) { -rc = xc_physdev_unmap_pirq(ctx-xch, domid, irq); -if (rc 0) { -LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, xc_physdev_unmap_pirq irq=%d, irq); -} rc = xc_domain_irq_permission(ctx-xch, domid, irq, 0); if (rc 0) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, xc_domain_irq_permission irq=%d, irq); } +rc = xc_physdev_unmap_pirq(ctx-xch, domid, irq); +if (rc 0) { +LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, xc_physdev_unmap_pirq irq=%d, irq); +} } fclose(f); } ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel