Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
>>> On 10.07.15 at 08:21, wrote: > >> -Original Message- >> From: Jan Beulich [mailto:jbeul...@suse.com] >> Sent: Thursday, July 09, 2015 3:26 PM >> To: Wu, Feng; Tian, Kevin >> Cc: Andrew Cooper; george.dun...@eu.citrix.com; Zhang, Yang Z; >> xen-devel@lists.xen.org; k...@xen.org >> Subject: RE: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU >> is blocked >> >> >>> On 09.07.15 at 00:49, wrote: >> >> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] >> >> Sent: Wednesday, July 08, 2015 9:09 PM >> >> On 08/07/2015 13:46, Jan Beulich wrote: >> >> On 08.07.15 at 13:00, wrote: >> >> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata >> >> >>> vmx_function_table = { >> >> >>> .enable_msr_exit_interception = >> vmx_enable_msr_exit_interception, >> >> >>> }; >> >> >>> >> >> >>> +/* >> >> >>> + * Handle VT-d posted-interrupt when VCPU is blocked. >> >> >>> + */ >> >> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs) >> >> >>> +{ >> >> >>> +struct arch_vmx_struct *vmx; >> >> >>> +unsigned int cpu = smp_processor_id(); >> >> >>> + >> >> >>> +spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu)); >> >> >>> + >> >> >>> +/* >> >> >>> + * FIXME: The length of the list depends on how many >> >> >>> + * vCPU is current blocked on this specific pCPU. >> >> >>> + * This may hurt the interrupt latency if the list >> >> >>> + * grows to too many entries. >> >> >>> + */ >> >> >> let's go with this linked list first until a real issue is identified. >> >> > This is exactly the way of thinking I dislike when it comes to code >> >> > that isn't intended to be experimental only: We shouldn't wait >> >> > for problems to surface when we already can see them. I.e. if >> >> > there are no plans to deal with this, I'd ask for the feature to be >> >> > off by default and be properly marked experimental in the >> >> > command line option documentation (so people know to stay >> >> > away from it). >> >> >> >> And in this specific case, there is no balancing of vcpus across the >> >> pcpus lists. >> >> >> >> One can construct a pathological case using pinning and pausing to get >> >> almost every vcpu on a single pcpu list, and vcpus recieving fewer >> >> interrupts will exasperate the problem by staying on the list for longer >> >> periods of time. >> > >> > In that extreme case I believe many contentions in other code paths will >> > be much larger than overhead caused by this structure limitation. >> >> Examples? >> >> >> IMO, the PI feature cannot be declared as done/supported with this bug >> >> remaining. OTOH, it is fine to be experimental, and disabled by default >> >> for people who wish to experiment. >> >> >> > >> > Again, I don't expect to see it disabled as experimental. For good >> > production >> > environment where vcpus are well balanced and interrupt latency is >> > sensitive, >> > linked list should be efficient here. For bad environment like extreme case >> > you raised, I don't know whether it really matters to just tune interrupt >> > path. >> >> Can you _guarantee_ that everything potentially leading to such a >> pathological situation is covered by XSA-77? And even if it is now, >> removing elements from the waiver list would become significantly >> more difficult if disconnected behavior like this one would need to >> be taken into account. >> >> Please understand that history has told us to be rather more careful >> than might seem necessary with this: ATS originally having been >> enabled by default is one bold example, and the recent flood of MSI >> related XSAs is another; I suppose I could find more. All affecting >> code originating from Intel, apparently written with only functionality >> in mind, while having left out (other than basic) security considerations. >> >> IOW, with my committer role hat on, the feature is going to be >> experimental (and hence default off) unless the issue here gets >> addressed. And no, I cannot immediately suggest a good approach, >> and with all of the rush before the feature freeze I also can't justify >> taking a lot of time to think of options. > > Is it acceptable to you if I only add the blocked vcpus that has > assigned devices to the list? I think that should shorten the > length of the list. I actually implied this to be the case already, i.e. - if it's not, this needs to be fixed anyway, - it's not going to eliminate the concern (just think of a couple of many-vCPU guests all having devices assigned). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
>>> On 10.07.15 at 07:59, wrote: > If you agree with doing all this in a central place, maybe we can create > an arch hook for 'struct scheduler' to do this and call it in all the places > vcpu_runstate_change() gets called. What is your opinion about this? Doing this in a central place is certainly the right approach, but adding an arch hook that needs to be called everywhere vcpu_runstate_change() wouldn't serve that purpose. Instead we'd need to replace all current vcpu_runstate_change() calls with calls to a new function calling both this and the to be added arch hook. But please wait for George's / Dario's feedback, because they seem to be even less convinced than me about your model of tying the updates to runstate changes. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
> -Original Message- > From: Jan Beulich [mailto:jbeul...@suse.com] > Sent: Thursday, July 09, 2015 3:26 PM > To: Wu, Feng; Tian, Kevin > Cc: Andrew Cooper; george.dun...@eu.citrix.com; Zhang, Yang Z; > xen-devel@lists.xen.org; k...@xen.org > Subject: RE: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU > is blocked > > >>> On 09.07.15 at 00:49, wrote: > >> From: Andrew Cooper [mailto:andrew.coop...@citrix.com] > >> Sent: Wednesday, July 08, 2015 9:09 PM > >> On 08/07/2015 13:46, Jan Beulich wrote: > >> On 08.07.15 at 13:00, wrote: > >> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata > >> >>> vmx_function_table = { > >> >>> .enable_msr_exit_interception = > vmx_enable_msr_exit_interception, > >> >>> }; > >> >>> > >> >>> +/* > >> >>> + * Handle VT-d posted-interrupt when VCPU is blocked. > >> >>> + */ > >> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs) > >> >>> +{ > >> >>> +struct arch_vmx_struct *vmx; > >> >>> +unsigned int cpu = smp_processor_id(); > >> >>> + > >> >>> +spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu)); > >> >>> + > >> >>> +/* > >> >>> + * FIXME: The length of the list depends on how many > >> >>> + * vCPU is current blocked on this specific pCPU. > >> >>> + * This may hurt the interrupt latency if the list > >> >>> + * grows to too many entries. > >> >>> + */ > >> >> let's go with this linked list first until a real issue is identified. > >> > This is exactly the way of thinking I dislike when it comes to code > >> > that isn't intended to be experimental only: We shouldn't wait > >> > for problems to surface when we already can see them. I.e. if > >> > there are no plans to deal with this, I'd ask for the feature to be > >> > off by default and be properly marked experimental in the > >> > command line option documentation (so people know to stay > >> > away from it). > >> > >> And in this specific case, there is no balancing of vcpus across the > >> pcpus lists. > >> > >> One can construct a pathological case using pinning and pausing to get > >> almost every vcpu on a single pcpu list, and vcpus recieving fewer > >> interrupts will exasperate the problem by staying on the list for longer > >> periods of time. > > > > In that extreme case I believe many contentions in other code paths will > > be much larger than overhead caused by this structure limitation. > > Examples? > > >> IMO, the PI feature cannot be declared as done/supported with this bug > >> remaining. OTOH, it is fine to be experimental, and disabled by default > >> for people who wish to experiment. > >> > > > > Again, I don't expect to see it disabled as experimental. For good > > production > > environment where vcpus are well balanced and interrupt latency is > > sensitive, > > linked list should be efficient here. For bad environment like extreme case > > you raised, I don't know whether it really matters to just tune interrupt > > path. > > Can you _guarantee_ that everything potentially leading to such a > pathological situation is covered by XSA-77? And even if it is now, > removing elements from the waiver list would become significantly > more difficult if disconnected behavior like this one would need to > be taken into account. > > Please understand that history has told us to be rather more careful > than might seem necessary with this: ATS originally having been > enabled by default is one bold example, and the recent flood of MSI > related XSAs is another; I suppose I could find more. All affecting > code originating from Intel, apparently written with only functionality > in mind, while having left out (other than basic) security considerations. > > IOW, with my committer role hat on, the feature is going to be > experimental (and hence default off) unless the issue here gets > addressed. And no, I cannot immediately suggest a good approach, > and with all of the rush before the feature freeze I also can't justify > taking a lot of time to think of options. > Is it acceptable to you if I only add the blocked vcpus that has assigned devices to the list? I think that should shorten the length of the list. Thanks, Feng > Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PCI Pass-through in Xen ARM - Draft 2.
On Thu, Jul 9, 2015 at 7:27 PM, Julien Grall wrote: > > > On 09/07/2015 12:30, Manish Jaggi wrote: >> >> On Thursday 09 July 2015 01:38 PM, Julien Grall wrote: >>> >>> On 09/07/2015 08:13, Manish Jaggi wrote: > > > If this was a domctl there might be scope for accepting an > implementation which made assumptions such as sbdf == deviceid. However > I'd still like to see this topic given proper treatment in the design > and not just glossed over with "this is how ThunderX does things". I got your point. > > Or maybe the solution is simple and we should just do it now -- i.e. > can > we add a new field to the PHYSDEVOP_pci_host_bridge_add argument struct > which contains the base deviceid for that bridge deviceId would be same as sbdf. As we dont have a way to translate sbdf to deviceID. >>> >>> >>> I think we have to be clear in this design document about the >>> different meaning. >>> >>> When the Device Tree is used, it's assumed that the deviceID will be >>> equal to the requester ID and not the sbdf. >> >> Does SMMU v2 has a concept of requesterID. >> I see requesterID term in SMMUv3 > > > The requester ID is part of the PCI spec and not the SMMU. > > The version of the SMMUv2 spec doesn't mention anything about PCI. I suspect > this is because the spec has been written before the introduced PCI. And > therefore this is implementation defined. > >>> >>> Linux provides a function (pci_for_each_dma_alias) which will return a >>> requester ID for a given PCI device. It appears that the BDF (the 's' >>> of sBDF is only internal to Linux and not part of the hardware) is >>> equal to the requester ID on your platform but we can't assume it for >>> anyone else. >> >> so you mean requesterID = pci_for_each_dma_alias(sbdf) > > > Yes. > >>> >>> When we have a PCI in hand, we have to find the requester ID for this >>> device. >> >> That is the question. How to map requesterID to sbdf > > > See above. > >>> On >> >> Once ? > > > Yes. > >>> we have it we can deduce the streamID and the deviceID. The way to do >>> it will depend on whether we use device tree or ACPI: >>> - For device tree, the streamID, and deviceID will be equal to the >>> requester ID >> >> what do you think should be streamID when a device is PCI EP and is >> enumerated. Also per ARM SMMU 2.0 spec StreamID is implementation >> specific. >> As per SMMUv3 specs >> "For PCI, it is intended that StreamID is generated from the PCI >> RequesterID. The generation function may be 1:1 >> where one Root Complex is hosted by one SMMU" > > > I think my sentence "For device tree, the streamID, and deviceID will be > equal to the requester ID" is pretty clear. FWIW, this is the solution > chosen for Linux: > > "Assume Stream ID == Requester ID for now. We need a way to describe the ID > mappings in FDT" (see arm_smmu_add_pci_device in drivers/iommu/arm-smmu.c). > > You can refer to my point below about ACPI tables. The solution would be > exactly the same. If we have a requester ID in hand we can do pretty much > everything. There is already one proposal by Mark Rutland on this topic about describing StreamID to Requester ID mapping in DT. http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/333199.html Probably till it gets finalized assuming RequesterID=StreamId is the only way since deriving StreamID from PCIe RequsterID will vary from one vendor to another. Thanks, Pranav > > The whole point of my previous email is to give insight about what we need > and what we can deduce based on firmware tables. I didn't cover anything > implementation details. > > Regards, > > -- > Julien Grall > > > ___ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][PATCH 16/16] tools: parse to enable new rdm policy parameters
The first issue (which would really be relevant to the documentation patch) is that the documentation is in a separate commit. There are sometimes valid reasons for doing this. I'm not sure if they apply, Wei suggested we should organize/spit all patches according to libxl, libxc, xc and xl. but if they do this should be explained in one of the commit messages. If this was done I'm afraid I have missed it. In this patch head description, maybe I can change something like this This patch parses to enable user configurable parameters to specify RDM resource and according policies which are defined previously, +}else if ( !strcmp(optkey, "rdm_policy") ) { +if ( !strcmp(tok, "strict") ) { +pcidev->rdm_policy = LIBXL_RDM_RESERVE_POLICY_STRICT; +} else if ( !strcmp(tok, "relaxed") ) { +pcidev->rdm_policy = LIBXL_RDM_RESERVE_POLICY_RELAXED; +} else { +XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM property" + " policy: 'strict' or 'relaxed'.", + tok); +goto parse_error; +} This section has coding style (whitespace) problems and long lines. If you need to respin, please fix them. Are you saying this? } else if ( -> }else if ( } else { -> }else { Additionally I don't found which line is over 80 characters. +for (tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) { +switch(state) { +case STATE_TYPE: +if (*ptr == '=') { +state = STATE_RDM_STRATEGY; +*ptr = '\0'; +if (strcmp(tok, "strategy")) { +XLU__PCI_ERR(cfg, "Unknown RDM state option: %s", tok); +goto parse_error; +} +tok = ptr + 1; +} This code is extremely repetitive. I just refer to xlu_pci_parse_bdf() switch(state) { case STATE_DOMAIN: if ( *ptr == ':' ) { state = STATE_BUS; *ptr = '\0'; if ( hex_convert(tok, &dom, 0x) ) goto parse_error; tok = ptr + 1; } break; Really I would prefer that this parsing was done with a miniature flex parser, rather than ad-hoc pointer arithmetic and use of strtok. Sorry, could you show this explicitly? Thanks Tiejun ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
> -Original Message- > From: George Dunlap [mailto:george.dun...@eu.citrix.com] > Sent: Thursday, July 09, 2015 10:28 PM > To: Dario Faggioli; Jan Beulich > Cc: Tian, Kevin; Wu, Feng; andrew.coop...@citrix.com; xen-devel; Zhang, Yang > Z; k...@xen.org > Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor > during vCPU scheduling > > On 07/09/2015 03:18 PM, Dario Faggioli wrote: > > On Thu, 2015-07-09 at 14:44 +0100, Jan Beulich wrote: > > On 09.07.15 at 14:53, wrote: > > > >>> Consider the following scenario: > >>> - v1 blocks on pcpu 0. > >>> - vcpu_runstate_change() will do everything necessary for v1 on p0. > >>> - The scheduler does load balancing and moves v1 to p1, calling > >>> vcpu_migrate(). Because the vcpu is still blocked, > >>> vcpu_runstate_change() is not called. > >>> - A device interrupt is generated. > >>> > >>> What happens to the interrupt? Does everything still work properly, or > >>> will the device wake-up interrupt go to the wrong pcpu (p0 rather than > >>> p1)? > >> > >> I think much of this was discussed before, since I also disliked the > >> hooking into vcpu_runstate_change(). What I remember having > >> been told is that it really only matters which pCPU's list a vCPU is > >> on, not what v->processor says. > >> > > Right. > > > > But, as far as I could understand from the patches I've seen, a vcpu > > ends up in a list when it blocks, and when it blocks there will be a > > context switch, and hence we can deal with the queueing during the the > > context switch itself (which is, in part, an arch specific operation > > already). > > > > What am I missing? > > I think what you're missing is that Jan is answering my question about > migrating a blocked vcpu, not arguing that vcpu_runstate_change() is the > right way to go. At least that's how I understood him. :-) > > But regarding context_switch: I think the reason we need more hooks than > that is that context_switch only changes into and out of running state. > There are also changes that need to happen when you change from blocked > to offline, offline to blocked, blocked to runnable, &c; these don't go > through context_switch. That's why I was suggesting some architectural > equivalents to the SCHED_OP() callbacks to be added to vcpu_wake &c. > > vcpu_runstate_change() is at the moment a nice quiet cul-de-sac that > just does a little bit of accounting; I'd rather not have it suddenly > become a major thoroughfare for runstate change hooks, if we can avoid > it. :-) So in my understanding, vcpu_runstate_change() is a central place to do this, which is good. However, this function is original designed to be served only for accounting. it is a little intrusive to make it so important after adding the hooks in it. If you agree with doing all this in a central place, maybe we can create an arch hook for 'struct scheduler' to do this and call it in all the places vcpu_runstate_change() gets called. What is your opinion about this? Thanks, Feng > > -George ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][PATCH 13/16] libxl: construct e820 map with RDM information for HVM guest
tools/libxl/libxl_dom.c | 5 +++ tools/libxl/libxl_internal.h | 24 + tools/libxl/libxl_x86.c | 83 ... diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 62ef120..41da479 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -1004,6 +1004,11 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid, goto out; } +if (libxl__domain_construct_e820(gc, d_config, domid, &args)) { +LOG(ERROR, "setting domain memory map failed"); +goto out; +} This is platform-independent code, isn't it ? In which case this will break the build on ARM, I think. Would an ARM maintainer please confirm. I think you're right. I should make this specific to arch since here we're talking e820. So I tried to refactor this patch, diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h index d04871c..939178a 100644 --- a/tools/libxl/libxl_arch.h +++ b/tools/libxl/libxl_arch.h @@ -49,4 +49,11 @@ int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc, _hidden int libxl__arch_domain_map_irq(libxl__gc *gc, uint32_t domid, int irq); +/* arch specific to construct memory mapping function */ +_hidden +int libxl__arch_domain_construct_memmap(libxl__gc *gc, +libxl_domain_config *d_config, +uint32_t domid, +struct xc_hvm_build_args *args); + #endif diff --git a/tools/libxl/libxl_arm.c b/tools/libxl/libxl_arm.c index f09c860..1526467 100644 --- a/tools/libxl/libxl_arm.c +++ b/tools/libxl/libxl_arm.c @@ -926,6 +926,14 @@ int libxl__arch_domain_map_irq(libxl__gc *gc, uint32_t domid, int irq) return xc_domain_bind_pt_spi_irq(CTX->xch, domid, irq, irq); } +int libxl__arch_domain_construct_memmap(libxl__gc *gc, +libxl_domain_config *d_config, +uint32_t domid, +struct xc_hvm_build_args *args) +{ +return 0; +} + /* * Local variables: * mode: C diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 62ef120..691c1f6 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -1004,6 +1004,11 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid, goto out; } +if (libxl__arch_domain_construct_memmap(gc, d_config, domid, &args)) { +LOG(ERROR, "setting domain memory map failed"); +goto out; +} + ret = hvm_build_set_params(ctx->xch, domid, info, state->store_port, &state->store_mfn, state->console_port, &state->console_mfn, state->store_domid, diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c index ed2bd38..66b3d7f 100644 --- a/tools/libxl/libxl_x86.c +++ b/tools/libxl/libxl_x86.c @@ -438,6 +438,89 @@ int libxl__arch_domain_map_irq(libxl__gc *gc, uint32_t domid, int irq) } /* + * Here we're just trying to set these kinds of e820 mappings: + * + * #1. Low memory region + * + * Low RAM starts at least from 1M to make sure all standard regions + * of the PC memory map, like BIOS, VGA memory-mapped I/O and vgabios, + * have enough space. + * Note: Those stuffs below 1M are still constructed with multiple + * e820 entries by hvmloader. At this point we don't change anything. + * + * #2. RDM region if it exists + * + * #3. High memory region if it exists + * + * Note: these regions are not overlapping since we already check + * to adjust them. Please refer to libxl__domain_device_construct_rdm(). + */ +#define GUEST_LOW_MEM_START_DEFAULT 0x10 +int libxl__arch_domain_construct_memmap(libxl__gc *gc, +libxl_domain_config *d_config, +uint32_t domid, +struct xc_hvm_build_args *args) +{ ... Aside from that I have no issues with this patch. But if you think I should resend this let me know. Thanks Tiejun ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [linux-3.4 test] 59267: regressions - FAIL
flight 59267 linux-3.4 real [real] http://logs.test-lab.xenproject.org/osstest/logs/59267/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-amd64-xl-qemut-win7-amd64 6 xen-boot fail REGR. vs. 30511 Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-sedf-pin 6 xen-boot fail in 58831 pass in 58798 test-amd64-i386-xl-qemuu-win7-amd64 9 windows-install fail in 58831 pass in 59267 test-amd64-amd64-pair10 xen-boot/dst_host fail pass in 58798 test-amd64-amd64-pair 9 xen-boot/src_host fail pass in 58798 test-amd64-i386-pair 10 xen-boot/dst_host fail pass in 58831 test-amd64-i386-pair 9 xen-boot/src_host fail pass in 58831 Regressions which are regarded as allowable (not blocking): test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm 6 xen-boot fail baseline untested test-amd64-i386-xl-qemut-debianhvm-amd64-xsm 6 xen-boot fail baseline untested test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm 6 xen-boot fail baseline untested test-amd64-amd64-xl-multivcpu 6 xen-boot fail baseline untested test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm 6 xen-boot fail baseline untested test-amd64-amd64-libvirt-xsm 6 xen-bootfail baseline untested test-amd64-i386-libvirt-xsm 6 xen-bootfail baseline untested test-amd64-amd64-xl-credit2 6 xen-bootfail baseline untested test-amd64-amd64-xl-xsm 6 xen-bootfail baseline untested test-amd64-i386-xl-xsm6 xen-bootfail baseline untested test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 12 guest-localmigrate fail baseline untested test-amd64-amd64-xl-rtds 6 xen-bootfail baseline untested test-amd64-amd64-xl-sedf 6 xen-boot fail in 58831 like 30406 test-amd64-i386-libvirt 11 guest-start fail like 30511 test-amd64-amd64-libvirt 11 guest-start fail like 30511 test-amd64-amd64-xl-qemuu-win7-amd64 16 guest-stop fail like 30511 test-amd64-i386-xl-qemut-win7-amd64 16 guest-stop fail like 30511 test-amd64-amd64-xl-qemuu-ovmf-amd64 6 xen-bootfail like 53709-bisect test-amd64-i386-xl6 xen-bootfail like 53725-bisect test-amd64-i386-freebsd10-amd64 6 xen-boot fail like 58780-bisect test-amd64-i386-xl-qemuu-winxpsp3 6 xen-boot fail like 58786-bisect test-amd64-i386-qemut-rhel6hvm-intel 6 xen-bootfail like 58788-bisect test-amd64-i386-rumpuserxen-i386 6 xen-bootfail like 58799-bisect test-amd64-i386-xl-qemuu-winxpsp3-vcpus1 6 xen-bootfail like 58801-bisect test-amd64-amd64-xl-qemuu-debianhvm-amd64 6 xen-boot fail like 58803-bisect test-amd64-amd64-xl-qemut-winxpsp3 6 xen-boot fail like 58804-bisect test-amd64-i386-freebsd10-i386 6 xen-boot fail like 58805-bisect test-amd64-i386-xl-qemuu-ovmf-amd64 6 xen-boot fail like 58806-bisect test-amd64-amd64-xl-qemuu-winxpsp3 6 xen-boot fail like 58807-bisect test-amd64-i386-xl-qemut-winxpsp3 6 xen-boot fail like 58808-bisect test-amd64-i386-xl-qemut-winxpsp3-vcpus1 6 xen-bootfail like 58809-bisect test-amd64-amd64-rumpuserxen-amd64 6 xen-boot fail like 58810-bisect test-amd64-i386-xl-qemuu-debianhvm-amd64 6 xen-bootfail like 58811-bisect test-amd64-amd64-xl-qemut-debianhvm-amd64 6 xen-boot fail like 58813-bisect test-amd64-i386-qemuu-rhel6hvm-intel 6 xen-bootfail like 58814-bisect test-amd64-i386-xl-qemut-debianhvm-amd64 6 xen-bootfail like 58815-bisect Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt-xsm 12 migrate-support-check fail in 58831 never pass test-amd64-i386-libvirt 12 migrate-support-check fail in 58831 never pass test-amd64-amd64-libvirt 12 migrate-support-check fail in 58831 never pass test-amd64-amd64-xl-pvh-amd 11 guest-start fail never pass test-amd64-amd64-xl-pvh-intel 11 guest-start fail never pass test-amd64-i386-xl-qemuu-win7-amd64 16 guest-stop fail never pass version targeted for testing: linuxcf1b3dad6c5699b977273276bada8597636ef3e2 baseline version: linuxbb4a05a0400ed6d2f1e13d1f82f289ff74300a70 Last test of basis30511 2014-09-29 16:37:46 Z 283 days Failing since 32004 2014-12-02 04:10:03 Z 219 days 169 attempts Testing same since58781 2015-06-20 14:15:50 Z 19 days 23 attempts 500 people touched revisions under test, not listing them all jobs: build-amd64-xsm pass build-i386-xsm pass build-amd64
Re: [Xen-devel] [v7][PATCH 11/16] tools/libxl: detect and avoid conflicts with RDM
I have found a few things in this patch which I would like to see improved. See below. Given how late I am with this review, I do not feel that I should be nacking it at this time. You have a tools ack from Wei, so my comments are not a blocker for this series. But if you need to respin, please take these comments into account, and consider which are feasible to fix in the time available. If you are respinning this series targeting Xen 4.7 or later, please address all of the points I make below. Thanks for your comments and looks I should address them now. Thanks. +int libxl__domain_device_construct_rdm(libxl__gc *gc, + libxl_domain_config *d_config, + uint64_t rdm_mem_boundary, + struct xc_hvm_build_args *args) ... +uint64_t highmem_end = args->highmem_end ? args->highmem_end : (1ull\ <<32); ... +if (strategy == LIBXL_RDM_RESERVE_STRATEGY_IGNORE && !d_config->num_\ pcidevs) There are quite a few of these long lines, which should be wrapped. See tools/libxl/CODING_STYLE. Sorry I can't found any case to what you're talking. So are you saying this? if (strategy == LIBXL_RDM_RESERVE_STRATEGY_IGNORE && !d_config->num_pcidevs) Or @@ -143,6 +143,15 @@ static bool overlaps_rdm(uint64_t start, uint64_t memsize, } /* + * Check whether any rdm should be exposed.. + * Returns true if needs, else returns false. + */ +static bool exposes_rdm(uint32_t strategy, int num_pcidevs) +{ +return strategy == LIBXL_RDM_RESERVE_STRATEGY_IGNORE && !num_pcidevs; +} + +/* * Check reported RDM regions and handle potential gfn conflicts according * to user preferred policy. * @@ -182,7 +191,7 @@ int libxl__domain_device_construct_rdm(libxl__gc *gc, uint64_t highmem_end = args->highmem_end ? args->highmem_end : (1ull<<32); /* Might not expose rdm. */ -if (strategy == LIBXL_RDM_RESERVE_STRATEGY_IGNORE && !d_config->num_pcidevs) +if (exposes_rdm(strategy, d_config->num_pcidevs)) return 0; /* Query all RDM entries in this platform */ +d_config->num_rdms = nr_entries; +d_config->rdms = libxl__realloc(NOGC, d_config->rdms, +d_config->num_rdms * sizeof(libxl_device_rdm)); This code is remarkably similar to a function later on which adds an rdm. Please can you factor it out. Do you mean I should merge them as one as possible? But seems not be possible because we have seveal combinations of these two conditions, strategy = LIBXL_RDM_RESERVE_STRATEGY_HOST and one or pci devices are also passes through. #1. strategy = LIBXL_RDM_RESERVE_STRATEGY_HOST but without any devices So it appropriately needs this libxl__realloc() here. The second libxl__realloc() doesn't take any effect. #2. strategy = LIBXL_RDM_RESERVE_STRATEGY_HOST with one or more devices Actually we don't need the second libxl__realloc(). This is same as #1. #3. strategy != LIBXL_RDM_RESERVE_STRATEGY_HOST also with one or more devices So we just need the second libxl__realloc() later. The fist libxl__realloc() doesn't be called at all. Especially, its very possible we're going to handle less RDMs, compared to LIBXL_RDM_RESERVE_STRATEGY_HOST. +} else +d_config->num_rdms = 0; Please can you put { } around the else block too. I don't think this mixed style is good. Fixed. +for (j = 0; j < d_config->num_rdms; j++) { +if (d_config->rdms[j].start == + (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT) This construct (uint64_t)some_pfn << XC_PAGE_SHIFT appears an awful lot. I would prefer it if it were done in an inline function (or maybe a macro). Like this? #define (x) ((uint64_t)x << XC_PAGE_SHIFT) Sorry I can't figure out a good name here :) Any suggestions? +libxl_domain_build_info *const info = &d_config->b_info; +/* + * Currently we fix this as 2G to guarantte how to handle ^ Should read "guarantee". Fixed. +ret = libxl__domain_device_construct_rdm(gc, d_config, + rdm_mem_boundary, + &args); +if (ret) { +LOG(ERROR, "checking reserved device memory failed"); +goto out; +} `rc' should be used here rather than `ret'. (It is unfortunate that this function has poor style already, but it would be best not to make it worse.) I can do this but according to tools/libxl/CODING_STYLE, looks I should first post a separate patch to fix this code style issue, and then rebase my patch, right? Thanks TIejun ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 00/27] Libxl migration v2
On 07/10/2015 02:26 AM, Andrew Cooper wrote: This series adds support for the libxl migration v2 stream, and untangles the existing layering violations of the toolstack and qemu records. It can be found on the branch "libxl-migv2-v2" git://xenbits.xen.org/people/andrewcoop/xen.git http://xenbits.xen.org/git-http/people/andrewcoop/xen.git Major changes in v2 are being rebased over the libxl AO-abort series, and a redesign of the internal logic to support Remus/COLO buffering and failover. Great! I've read the series, the redesigned logic seems correct to me, I will rebase COLO series and test to see if it will work, thanks! At the end of the series, legacy migration is no longer used. The Remus code is untested by me. All other combinations of suspend/migrate/resume have been tested with PV and HVM guests (qemu-trad and qemu-upstream), including 32 -> 64 bit migration (which was the underlying bug causing us to write migration v2 in the first place). Anyway, thoughts/comments welcome. Please test! ~Andrew Summary of Acks/Modified/New from v1 N bsd-sys-queue-h-seddery: Massage `offsetof' A tools/libxc: Always compile the compat qemu variables into xc_sr_context A tools/libxl: Introduce ROUNDUP() N tools/libxl: Introduce libxl__kill() AM tools/libxl: Stash all restore parameters in domain_create_state N tools/libxl: Split libxl__domain_create_state.restore_fd in two M tools/libxl: Extra management APIs for the save helper AM tools/xl: Mandatory flag indicating the format of the migration stream docs: Libxl migration v2 stream specification AM tools/python: Libxc migration v2 infrastructure AM tools/python: Libxl migration v2 infrastructure N tools/python: Other migration infrastructure AM tools/python: Verification utility for v2 stream spec compliance AM tools/python: Conversion utility for legacy migration streams M tools/libxl: Migration v2 stream format M tools/libxl: Infrastructure for reading a libxl migration v2 stream M tools/libxl: Support converting a legacy stream to a v2 stream M tools/libxl: Convert a legacy stream if needed M tools/libxc+libxl+xl: Restore v2 streams M tools/libxl: Infrastructure for writing a v2 stream M tools/libxc+libxl+xl: Save v2 streams AM docs/libxl: Introduce CHECKPOINT_END to support migration v2 remus streams M tools/libxl: Write checkpoint records into the stream M tools/libx{c,l}: Introduce restore_callbacks.checkpoint() M tools/libxl: Handle checkpoint records in a libxl migration v2 stream A tools/libxc: Drop all XG_LIBXL_HVM_COMPAT code from libxc A tools/libxl: Drop all knowledge of toolstack callbacks Andrew Cooper (23): tools/libxc: Always compile the compat qemu variables into xc_sr_context tools/libxl: Introduce ROUNDUP() tools/libxl: Introduce libxl__kill() tools/libxl: Stash all restore parameters in domain_create_state tools/libxl: Split libxl__domain_create_state.restore_fd in two tools/libxl: Extra management APIs for the save helper tools/xl: Mandatory flag indicating the format of the migration stream docs: Libxl migration v2 stream specification tools/python: Libxc migration v2 infrastructure tools/python: Libxl migration v2 infrastructure tools/python: Other migration infrastructure tools/python: Verification utility for v2 stream spec compliance tools/python: Conversion utility for legacy migration streams tools/libxl: Support converting a legacy stream to a v2 stream tools/libxl: Convert a legacy stream if needed tools/libxc+libxl+xl: Restore v2 streams tools/libxc+libxl+xl: Save v2 streams docs/libxl: Introduce CHECKPOINT_END to support migration v2 remus streams tools/libxl: Write checkpoint records into the stream tools/libx{c,l}: Introduce restore_callbacks.checkpoint() tools/libxl: Handle checkpoint records in a libxl migration v2 stream tools/libxc: Drop all XG_LIBXL_HVM_COMPAT code from libxc tools/libxl: Drop all knowledge of toolstack callbacks Ian Jackson (1): bsd-sys-queue-h-seddery: Massage `offsetof' Ross Lagerwall (3): tools/libxl: Migration v2 stream format tools/libxl: Infrastructure for reading a libxl migration v2 stream tools/libxl: Infrastructure for writing a v2 stream docs/specs/libxl-migration-stream.pandoc | 217 ++ tools/include/xen-external/bsd-sys-queue-h-seddery |2 + tools/libxc/Makefile |2 - tools/libxc/include/xenguest.h |9 + tools/libxc/xc_sr_common.h | 12 +- tools/libxc/xc_sr_restore.c| 71 +- tools/libxc/xc_sr_restore_x86_hvm.c| 124 tools/libxc/xc_sr_save_x86_hvm.c | 36 - tools/libxl/Makefile |2 + tools/libxl/libxl.h| 19 + tools/libxl/libxl_aoutils.c| 15 + tools/li
[Xen-devel] Question about seperating request and response ring in PV network
Hi, all: I am trying to improve the performance of netfront/netback, and I found that there were some discussion about PV network performance improvement in devel mailing list ([1]). The proposals mentioned in [1] are helpful, such as multipage ring, multiqueue, etc, and some of them have already merged into upstream. However, I am wondering if we have already used multipage ring, why "seperate request and response ring" listed in [1] is also helpful for performance improvement? I have read the implementation of vmxnet3 driver, it does have separated request and response rings, but I can not understand what the advantage of this method is. Thanks. [1] http://lists.xen.org/archives/html/xen-devel/2013-05/msg01904.html Best Regards Lui ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [linux-4.1 test] 59275: tolerable FAIL - PUSHED
flight 59275 linux-4.1 real [real] http://logs.test-lab.xenproject.org/osstest/logs/59275/ Failures :-/ but no regressions. Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-qemut-win7-amd64 16 guest-stop fail pass in 59143 Regressions which are regarded as allowable (not blocking): test-amd64-i386-libvirt 11 guest-start fail REGR. vs. 59031 test-armhf-armhf-xl-credit2 15 guest-start/debian.repeat fail blocked in 59031 test-amd64-amd64-xl-qemuu-win7-amd64 16 guest-stopfail in 59143 like 59031 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate/x10 fail in 59143 like 59260-bisect test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 12 guest-localmigrate fail like 59257-bisect Tests which did not succeed, but are not blocking: test-amd64-i386-freebsd10-amd64 9 freebsd-install fail never pass test-amd64-i386-freebsd10-i386 9 freebsd-install fail never pass test-amd64-amd64-xl-pvh-intel 13 guest-saverestorefail never pass test-amd64-amd64-xl-pvh-amd 11 guest-start fail never pass test-amd64-i386-libvirt-xsm 11 guest-start fail never pass test-amd64-amd64-libvirt-xsm 12 migrate-support-checkfail never pass test-amd64-amd64-libvirt 12 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 11 guest-start fail never pass test-armhf-armhf-xl-arndale 12 migrate-support-checkfail never pass test-armhf-armhf-libvirt-xsm 12 migrate-support-checkfail never pass test-armhf-armhf-xl-xsm 12 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 12 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 12 migrate-support-checkfail never pass test-armhf-armhf-libvirt 12 migrate-support-checkfail never pass test-amd64-i386-xl-qemuu-win7-amd64 16 guest-stop fail never pass test-armhf-armhf-xl 12 migrate-support-checkfail never pass test-amd64-i386-xl-qemut-win7-amd64 16 guest-stop fail never pass test-armhf-armhf-xl-multivcpu 12 migrate-support-checkfail never pass version targeted for testing: linux6a010c0abd49388a49af3d5a5bfc00e0d5767607 baseline version: linuxb953c0d234bc72e8489d3bf51a276c5c4ec85345 Last test of basis59031 2015-07-02 23:39:59 Z7 days Testing same since59054 2015-07-05 10:20:43 Z4 days5 attempts People who touched revisions under test: Alexander Shishkin Alexey Sokolov Andi Kleen Arnaldo Carvalho de Melo Borislav Petkov Borislav Petkov Dmitry Tunin Greg Kroah-Hartman Imre Palik Ingo Molnar Jiri Olsa Kalle Valo Lukas Wunner Marcel Holtmann Oleg Nesterov Palik, Imre Peter Zijlstra (Intel) RafaÅ MiÅecki jobs: build-amd64-xsm pass build-armhf-xsm pass build-i386-xsm pass build-amd64 pass build-armhf pass build-i386 pass build-amd64-libvirt pass build-armhf-libvirt pass build-i386-libvirt pass build-amd64-pvopspass build-armhf-pvopspass build-i386-pvops pass build-amd64-rumpuserxen pass build-i386-rumpuserxen pass test-amd64-amd64-xl pass test-armhf-armhf-xl pass test-amd64-i386-xl pass test-amd64-amd64-xl-qemut-debianhvm-amd64-xsmpass test-amd64-i386-xl-qemut-debianhvm-amd64-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsmpass test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm pass test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsmpass test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm fail test-amd64-amd64-libvirt-xsm pass test-armhf-armhf-libvirt-xsm pass test-amd64-i386-libvirt-xsm fail test-amd64-amd64-xl-xsm pass test-armhf-armhf-xl-xsm pass test-amd64-i386-xl-xsm pass test-amd64-amd64-xl-pvh-amd
Re: [Xen-devel] [PATCH v4 15/15] tools/xen-access: altp2m testcases
> @@ -546,6 +652,23 @@ int main(int argc, char *argv[]) > } > > break; > +case VM_EVENT_REASON_SINGLESTEP: > +printf("Singlestep: rip=%016"PRIx64", vcpu %d\n", > + req.regs.x86.rip, > + req.vcpu_id); > + > +if ( altp2m ) > +{ > +printf("\tSwitching altp2m to view %u!\n", > altp2m_view_id); > + > +rsp.reason = VM_EVENT_REASON_MEM_ACCESS; > So this was a workaround for v3 of the series that is no longer necessary - it's probably cleaner to have the same reason set for the response as the request was. It's not against any rule, so the code is still correct and works, it's just not best practice. So in case there is another round on the series, it could be fixed then. Tamas ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 04/27] tools/libxl: Introduce libxl__kill()
On 07/10/2015 02:26 AM, Andrew Cooper wrote: as a wrapper to kill(2), and use it in preference to sendig in s/sendig/sendsig/ libxl_save_callout.c. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- Logically new in v2 - split out from a v1 change which was itself a cherrypick-and-modify from the AO Abort series --- tools/libxl/libxl_aoutils.c | 15 +++ tools/libxl/libxl_internal.h |2 ++ tools/libxl/libxl_save_callout.c | 10 ++ 3 files changed, 19 insertions(+), 8 deletions(-) diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c index 0931eee..274ef39 100644 --- a/tools/libxl/libxl_aoutils.c +++ b/tools/libxl/libxl_aoutils.c @@ -621,3 +621,18 @@ bool libxl__async_exec_inuse(const libxl__async_exec_state *aes) assert(time_inuse == child_inuse); return child_inuse; } + +void libxl__kill(libxl__gc *gc, pid_t pid, int sig, const char *what) +{ +int r = kill(pid, sig); +if (r) LOGE(WARN, "failed to kill() %s [%lu] (signal %d)", +what, (unsigned long)pid, sig); +} + +/* + * Local variables: + * mode: C + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 19fc425..9147de1 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2244,6 +2244,8 @@ struct libxl__async_exec_state { int libxl__async_exec_start(libxl__async_exec_state *aes); bool libxl__async_exec_inuse(const libxl__async_exec_state *aes); +void libxl__kill(libxl__gc *gc, pid_t pid, int sig, const char *what); + /*- device addition/removal -*/ typedef struct libxl__ao_device libxl__ao_device; diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c index 087c2d5..b82a5c1 100644 --- a/tools/libxl/libxl_save_callout.c +++ b/tools/libxl/libxl_save_callout.c @@ -244,12 +244,6 @@ static void run_helper(libxl__egc *egc, libxl__save_helper_state *shs, libxl__carefd_close(childs_pipes[1]); helper_failed(egc, shs, rc);; } -static void sendsig(libxl__gc *gc, libxl__save_helper_state *shs, int sig) -{ -int r = kill(shs->child.pid, sig); -if (r) LOGE(WARN, "failed to kill save/restore helper [%lu] (signal %d)", -(unsigned long)shs->child.pid, sig); -} static void helper_failed(libxl__egc *egc, libxl__save_helper_state *shs, int rc) @@ -266,7 +260,7 @@ static void helper_failed(libxl__egc *egc, libxl__save_helper_state *shs, return; } -sendsig(gc, shs, SIGKILL); +libxl__kill(gc, shs->child.pid, SIGKILL, "save/restore helper"); } static void helper_stop(libxl__egc *egc, libxl__ao_abortable *abrt, int rc) @@ -282,7 +276,7 @@ static void helper_stop(libxl__egc *egc, libxl__ao_abortable *abrt, int rc) if (!shs->rc) shs->rc = rc; -sendsig(gc, shs, SIGTERM); +libxl__kill(gc, shs->child.pid, SIGTERM, "save/restore helper"); } static void helper_stdout_readable(libxl__egc *egc, libxl__ev_fd *ev, -- Thanks, Yang. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 09/15] x86/altp2m: alternate p2m memory events.
> diff --git a/xen/include/public/vm_event.h b/xen/include/public/vm_event.h > index 577e971..6dfa9db 100644 > --- a/xen/include/public/vm_event.h > +++ b/xen/include/public/vm_event.h > @@ -47,6 +47,16 @@ > #define VM_EVENT_FLAG_VCPU_PAUSED (1 << 0) > /* Flags to aid debugging mem_event */ > #define VM_EVENT_FLAG_FOREIGN (1 << 1) > +/* > + * This flag can be set in a request or a response > + * > + * On a request, indicates that the event occurred in the alternate p2m > specified by > + * the altp2m_idx request field. > + * > + * On a response, indicates that the VCPU should resume in the alternate > p2m specified > + * by the altp2m_idx response field if possible. > + */ > +#define VM_EVENT_FLAG_ALTERNATE_P2M (1 << 2) > This will now collide with staging following Razvan's patch that recently got merged. Otherwise: Acked-by: Tamas K Lengyel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v4 15/15] tools/xen-access: altp2m testcases
From: Tamas K Lengyel Working altp2m test-case. Extended the test tool to support singlestepping to better highlight the core feature of altp2m view switching. Signed-off-by: Tamas K Lengyel Signed-off-by: Ed White --- tools/tests/xen-access/xen-access.c | 173 ++-- 1 file changed, 148 insertions(+), 25 deletions(-) diff --git a/tools/tests/xen-access/xen-access.c b/tools/tests/xen-access/xen-access.c index 12ab921..6daa408 100644 --- a/tools/tests/xen-access/xen-access.c +++ b/tools/tests/xen-access/xen-access.c @@ -275,6 +275,19 @@ xenaccess_t *xenaccess_init(xc_interface **xch_r, domid_t domain_id) return NULL; } +static inline +int control_singlestep( +xc_interface *xch, +domid_t domain_id, +unsigned long vcpu, +bool enable) +{ +uint32_t op = enable ? +XEN_DOMCTL_DEBUG_OP_SINGLE_STEP_ON : XEN_DOMCTL_DEBUG_OP_SINGLE_STEP_OFF; + +return xc_domain_debug_control(xch, domain_id, op, vcpu); +} + /* * Note that this function is not thread safe. */ @@ -317,13 +330,15 @@ static void put_response(vm_event_t *vm_event, vm_event_response_t *rsp) void usage(char* progname) { -fprintf(stderr, -"Usage: %s [-m] write|exec|breakpoint\n" +fprintf(stderr, "Usage: %s [-m] write|exec", progname); +#if defined(__i386__) || defined(__x86_64__) +fprintf(stderr, "|breakpoint|altp2m_write|altp2m_exec"); +#endif +fprintf(stderr, "\n" "Logs first page writes, execs, or breakpoint traps that occur on the domain.\n" "\n" -"-m requires this program to run, or else the domain may pause\n", -progname); +"-m requires this program to run, or else the domain may pause\n"); } int main(int argc, char *argv[]) @@ -341,6 +356,8 @@ int main(int argc, char *argv[]) int required = 0; int breakpoint = 0; int shutting_down = 0; +int altp2m = 0; +uint16_t altp2m_view_id = 0; char* progname = argv[0]; argv++; @@ -379,10 +396,22 @@ int main(int argc, char *argv[]) default_access = XENMEM_access_rw; after_first_access = XENMEM_access_rwx; } +#if defined(__i386__) || defined(__x86_64__) else if ( !strcmp(argv[0], "breakpoint") ) { breakpoint = 1; } +else if ( !strcmp(argv[0], "altp2m_write") ) +{ +default_access = XENMEM_access_rx; +altp2m = 1; +} +else if ( !strcmp(argv[0], "altp2m_exec") ) +{ +default_access = XENMEM_access_rw; +altp2m = 1; +} +#endif else { usage(argv[0]); @@ -415,22 +444,73 @@ int main(int argc, char *argv[]) goto exit; } -/* Set the default access type and convert all pages to it */ -rc = xc_set_mem_access(xch, domain_id, default_access, ~0ull, 0); -if ( rc < 0 ) +/* With altp2m we just create a new, restricted view of the memory */ +if ( altp2m ) { -ERROR("Error %d setting default mem access type\n", rc); -goto exit; -} +xen_pfn_t gfn = 0; +unsigned long perm_set = 0; + +rc = xc_altp2m_set_domain_state( xch, domain_id, 1 ); +if ( rc < 0 ) +{ +ERROR("Error %d enabling altp2m on domain!\n", rc); +goto exit; +} + +rc = xc_altp2m_create_view( xch, domain_id, default_access, &altp2m_view_id ); +if ( rc < 0 ) +{ +ERROR("Error %d creating altp2m view!\n", rc); +goto exit; +} -rc = xc_set_mem_access(xch, domain_id, default_access, START_PFN, - (xenaccess->max_gpfn - START_PFN) ); +DPRINTF("altp2m view created with id %u\n", altp2m_view_id); +DPRINTF("Setting altp2m mem_access permissions.. "); -if ( rc < 0 ) +for(; gfn < xenaccess->max_gpfn; ++gfn) +{ +rc = xc_altp2m_set_mem_access( xch, domain_id, altp2m_view_id, gfn, + default_access); +if ( !rc ) +perm_set++; +} + +DPRINTF("done! Permissions set on %lu pages.\n", perm_set); + +rc = xc_altp2m_switch_to_view( xch, domain_id, altp2m_view_id ); +if ( rc < 0 ) +{ +ERROR("Error %d switching to altp2m view!\n", rc); +goto exit; +} + +rc = xc_monitor_singlestep( xch, domain_id, 1 ); +if ( rc < 0 ) +{ +ERROR("Error %d failed to enable singlestep monitoring!\n", rc); +goto exit; +} +} + +if ( !altp2m ) { -ERROR("Error %d setting all memory to access type %d\n", rc, - default_access); -goto exit; +/* Set the default access type and convert all pages to it */ +rc = xc_set_mem_access(xch, domain_id, default_access, ~0ull, 0); +if ( rc < 0 ) +{ +ERROR("Error %d setting default mem acc
[Xen-devel] [PATCH v4 14/15] tools/libxc: add support to altp2m hvmops
From: Tamas K Lengyel Wrappers to issue altp2m hvmops. Signed-off-by: Tamas K Lengyel Signed-off-by: Ravi Sahita --- tools/libxc/Makefile | 1 + tools/libxc/include/xenctrl.h | 21 tools/libxc/xc_altp2m.c | 237 ++ 3 files changed, 259 insertions(+) create mode 100644 tools/libxc/xc_altp2m.c diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index 153b79e..c2c2b1c 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -10,6 +10,7 @@ override CONFIG_MIGRATE := n endif CTRL_SRCS-y := +CTRL_SRCS-y += xc_altp2m.c CTRL_SRCS-y += xc_core.c CTRL_SRCS-$(CONFIG_X86) += xc_core_x86.c CTRL_SRCS-$(CONFIG_ARM) += xc_core_arm.c diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h index d1d2ab3..ecddf28 100644 --- a/tools/libxc/include/xenctrl.h +++ b/tools/libxc/include/xenctrl.h @@ -2316,6 +2316,27 @@ void xc_tmem_save_done(xc_interface *xch, int dom); int xc_tmem_restore(xc_interface *xch, int dom, int fd); int xc_tmem_restore_extra(xc_interface *xch, int dom, int fd); +/** + * altp2m operations + */ + +int xc_altp2m_get_domain_state(xc_interface *handle, domid_t dom, bool *state); +int xc_altp2m_set_domain_state(xc_interface *handle, domid_t dom, bool state); +int xc_altp2m_set_vcpu_enable_notify(xc_interface *handle, xen_pfn_t gfn); +int xc_altp2m_create_view(xc_interface *handle, domid_t domid, + xenmem_access_t default_access, uint16_t *view_id); +int xc_altp2m_destroy_view(xc_interface *handle, domid_t domid, + uint16_t view_id); +/* Switch all vCPUs of the domain to the specified altp2m view */ +int xc_altp2m_switch_to_view(xc_interface *handle, domid_t domid, + uint16_t view_id); +int xc_altp2m_set_mem_access(xc_interface *handle, domid_t domid, + uint16_t view_id, xen_pfn_t gfn, + xenmem_access_t access); +int xc_altp2m_change_gfn(xc_interface *handle, domid_t domid, + uint16_t view_id, xen_pfn_t old_gfn, + xen_pfn_t new_gfn); + /** * Mem paging operations. * Paging is supported only on the x86 architecture in 64 bit mode, with diff --git a/tools/libxc/xc_altp2m.c b/tools/libxc/xc_altp2m.c new file mode 100644 index 000..a4be36b --- /dev/null +++ b/tools/libxc/xc_altp2m.c @@ -0,0 +1,237 @@ +/** + * + * xc_altp2m.c + * + * Interface to altp2m related HVMOPs + * + * Copyright (c) 2015 Tamas K Lengyel (ta...@tklengyel.com) + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "xc_private.h" +#include +#include + +int xc_altp2m_get_domain_state(xc_interface *handle, domid_t dom, bool *state) +{ +int rc; +DECLARE_HYPERCALL; +DECLARE_HYPERCALL_BUFFER(xen_hvm_altp2m_op_t, arg); + +arg = xc_hypercall_buffer_alloc(handle, arg, sizeof(*arg)); +if ( arg == NULL ) +return -1; + +hypercall.op = __HYPERVISOR_hvm_op; +hypercall.arg[0] = HVMOP_altp2m; +hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(arg); + +arg->cmd = HVMOP_altp2m_get_domain_state; +arg->domain = dom; + +rc = do_xen_hypercall(handle, &hypercall); + +if ( !rc ) +*state = arg->u.domain_state.state; + +xc_hypercall_buffer_free(handle, arg); +return rc; +} + +int xc_altp2m_set_domain_state(xc_interface *handle, domid_t dom, bool state) +{ +int rc; +DECLARE_HYPERCALL; +DECLARE_HYPERCALL_BUFFER(xen_hvm_altp2m_op_t, arg); + +arg = xc_hypercall_buffer_alloc(handle, arg, sizeof(*arg)); +if ( arg == NULL ) +return -1; + +hypercall.op = __HYPERVISOR_hvm_op; +hypercall.arg[0] = HVMOP_altp2m; +hypercall.arg[1] = HYPERCALL_BUFFER_AS_ARG(arg); + +arg->cmd = HVMOP_altp2m_set_domain_state; +arg->domain = dom; +arg->u.domain_state.state = state; + +rc = do_xen_hypercall(handle, &hypercall); + +xc_hypercall_buffer_free(handle, arg); +return rc; +} + +/* This is a bit odd to me that it acts on current.. */ +int xc_altp2m_set_vcpu_enable_notify(xc_interface *handle, xen_pfn_t gfn) +{ +int rc; +DECLARE_HYPERCALL; +DECLARE_H
[Xen-devel] [PATCH v4 13/15] x86/altp2m: XSM hooks for altp2m HVM ops
From: Ravi Sahita Signed-off-by: Ravi Sahita Acked-by: Daniel De Graaf --- tools/flask/policy/policy/modules/xen/xen.if | 4 ++-- xen/arch/x86/hvm/hvm.c | 6 ++ xen/include/xsm/dummy.h | 12 xen/include/xsm/xsm.h| 12 xen/xsm/dummy.c | 2 ++ xen/xsm/flask/hooks.c| 12 xen/xsm/flask/policy/access_vectors | 7 +++ 7 files changed, 53 insertions(+), 2 deletions(-) diff --git a/tools/flask/policy/policy/modules/xen/xen.if b/tools/flask/policy/policy/modules/xen/xen.if index f4cde11..6177fe9 100644 --- a/tools/flask/policy/policy/modules/xen/xen.if +++ b/tools/flask/policy/policy/modules/xen/xen.if @@ -8,7 +8,7 @@ define(`declare_domain_common', ` allow $1 $2:grant { query setup }; allow $1 $2:mmu { adjust physmap map_read map_write stat pinpage updatemp mmuext_op }; - allow $1 $2:hvm { getparam setparam }; + allow $1 $2:hvm { getparam setparam altp2mhvm_op }; allow $1 $2:domain2 get_vnumainfo; ') @@ -58,7 +58,7 @@ define(`create_domain_common', ` allow $1 $2:mmu { map_read map_write adjust memorymap physmap pinpage mmuext_op updatemp }; allow $1 $2:grant setup; allow $1 $2:hvm { cacheattr getparam hvmctl irqlevel pciroute sethvmc - setparam pcilevel trackdirtyvram nested }; + setparam pcilevel trackdirtyvram nested altp2mhvm altp2mhvm_op }; ') # create_domain(priv, target) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 6e59e68..7c82e89 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -5887,6 +5887,9 @@ static int hvmop_set_param( nestedhvm_vcpu_destroy(v); break; case HVM_PARAM_ALTP2MHVM: +rc = xsm_hvm_param_altp2mhvm(XSM_PRIV, d); +if ( rc ) +break; if ( a.value > 1 ) rc = -EINVAL; if ( a.value && @@ -6490,6 +6493,9 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) } if ( !rc ) +rc = xsm_hvm_altp2mhvm_op(XSM_TARGET, d ? d : current->domain); + +if ( !rc ) { switch ( a.cmd ) { diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h index f044c0f..e0b561d 100644 --- a/xen/include/xsm/dummy.h +++ b/xen/include/xsm/dummy.h @@ -548,6 +548,18 @@ static XSM_INLINE int xsm_hvm_param_nested(XSM_DEFAULT_ARG struct domain *d) return xsm_default_action(action, current->domain, d); } +static XSM_INLINE int xsm_hvm_param_altp2mhvm(XSM_DEFAULT_ARG struct domain *d) +{ +XSM_ASSERT_ACTION(XSM_PRIV); +return xsm_default_action(action, current->domain, d); +} + +static XSM_INLINE int xsm_hvm_altp2mhvm_op(XSM_DEFAULT_ARG struct domain *d) +{ +XSM_ASSERT_ACTION(XSM_TARGET); +return xsm_default_action(action, current->domain, d); +} + static XSM_INLINE int xsm_vm_event_control(XSM_DEFAULT_ARG struct domain *d, int mode, int op) { XSM_ASSERT_ACTION(XSM_PRIV); diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h index c872d44..dc48d23 100644 --- a/xen/include/xsm/xsm.h +++ b/xen/include/xsm/xsm.h @@ -147,6 +147,8 @@ struct xsm_operations { int (*hvm_param) (struct domain *d, unsigned long op); int (*hvm_control) (struct domain *d, unsigned long op); int (*hvm_param_nested) (struct domain *d); +int (*hvm_param_altp2mhvm) (struct domain *d); +int (*hvm_altp2mhvm_op) (struct domain *d); int (*get_vnumainfo) (struct domain *d); int (*vm_event_control) (struct domain *d, int mode, int op); @@ -586,6 +588,16 @@ static inline int xsm_hvm_param_nested (xsm_default_t def, struct domain *d) return xsm_ops->hvm_param_nested(d); } +static inline int xsm_hvm_param_altp2mhvm (xsm_default_t def, struct domain *d) +{ +return xsm_ops->hvm_param_altp2mhvm(d); +} + +static inline int xsm_hvm_altp2mhvm_op (xsm_default_t def, struct domain *d) +{ +return xsm_ops->hvm_altp2mhvm_op(d); +} + static inline int xsm_get_vnumainfo (xsm_default_t def, struct domain *d) { return xsm_ops->get_vnumainfo(d); diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c index e84b0e4..3461d4f 100644 --- a/xen/xsm/dummy.c +++ b/xen/xsm/dummy.c @@ -116,6 +116,8 @@ void xsm_fixup_ops (struct xsm_operations *ops) set_to_dummy_if_null(ops, hvm_param); set_to_dummy_if_null(ops, hvm_control); set_to_dummy_if_null(ops, hvm_param_nested); +set_to_dummy_if_null(ops, hvm_param_altp2mhvm); +set_to_dummy_if_null(ops, hvm_altp2mhvm_op); set_to_dummy_if_null(ops, do_xsm_op); #ifdef CONFIG_COMPAT diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c index 6e37d29..2b998c9 100644 --- a/xen/xsm/flask/hooks.c +++ b/xen/xsm/flask/hooks.c @@ -1170,6 +1170,16 @@ static int flask_hvm_param_nested(struct domain *d) return current_has_perm(d, S
[Xen-devel] [PATCH v4 12/15] x86/altp2m: Add altp2mhvm HVM domain parameter.
The altp2mhvm and nestedhvm parameters are mutually exclusive and cannot be set together. Signed-off-by: Ed White Reviewed-by: Andrew Cooper for the hypervisor bits. --- docs/man/xl.cfg.pod.5 | 12 tools/libxl/libxl.h | 6 ++ tools/libxl/libxl_create.c | 1 + tools/libxl/libxl_dom.c | 2 ++ tools/libxl/libxl_types.idl | 1 + tools/libxl/xl_cmdimpl.c| 10 ++ xen/arch/x86/hvm/hvm.c | 23 +-- xen/include/public/hvm/params.h | 5 - 8 files changed, 57 insertions(+), 3 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index a3e0e2e..18afd46 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -1035,6 +1035,18 @@ enabled by default and you should usually omit it. It may be necessary to disable the HPET in order to improve compatibility with guest Operating Systems (X86 only) +=item B + +Enables or disables hvm guest access to alternate-p2m capability. +Alternate-p2m allows a guest to manage multiple p2m guest physical +"memory views" (as opposed to a single p2m). This option is +disabled by default and is available only to hvm domains. +You may want this option if you want to access-control/isolate +access to specific guest physical memory pages accessed by +the guest, e.g. for HVM domain memory introspection or +for isolation/access-control of memory between components within +a single guest hvm domain. + =item B Enable or disables guest access to hardware virtualisation features, diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index a1c5d15..17222e7 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -745,6 +745,12 @@ typedef struct libxl__ctx libxl_ctx; #define LIBXL_HAVE_BUILDINFO_SERIAL_LIST 1 /* + * LIBXL_HAVE_ALTP2M + * If this is defined, then libxl supports alternate p2m functionality. + */ +#define LIBXL_HAVE_ALTP2M 1 + +/* * LIBXL_HAVE_REMUS * If this is defined, then libxl supports remus. */ diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index f366a09..418deee 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -329,6 +329,7 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc, libxl_defbool_setdefault(&b_info->u.hvm.hpet, true); libxl_defbool_setdefault(&b_info->u.hvm.vpt_align, true); libxl_defbool_setdefault(&b_info->u.hvm.nested_hvm, false); +libxl_defbool_setdefault(&b_info->u.hvm.altp2m, false); libxl_defbool_setdefault(&b_info->u.hvm.usb,false); libxl_defbool_setdefault(&b_info->u.hvm.xen_platform_pci, true); diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index bdc0465..2f1200e 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -300,6 +300,8 @@ static void hvm_set_conf_params(xc_interface *handle, uint32_t domid, libxl_defbool_val(info->u.hvm.vpt_align)); xc_hvm_param_set(handle, domid, HVM_PARAM_NESTEDHVM, libxl_defbool_val(info->u.hvm.nested_hvm)); +xc_hvm_param_set(handle, domid, HVM_PARAM_ALTP2MHVM, +libxl_defbool_val(info->u.hvm.altp2m)); } int libxl__build_pre(libxl__gc *gc, uint32_t domid, diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index e1632fa..fb641fe 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -440,6 +440,7 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("mmio_hole_memkb", MemKB), ("timer_mode", libxl_timer_mode), ("nested_hvm", libxl_defbool), + ("altp2m", libxl_defbool), ("smbios_firmware", string), ("acpi_firmware",string), ("nographic",libxl_defbool), diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index c858068..43cf6bf 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -1500,6 +1500,16 @@ static void parse_config_data(const char *config_source, xlu_cfg_get_defbool(config, "nestedhvm", &b_info->u.hvm.nested_hvm, 0); +xlu_cfg_get_defbool(config, "altp2mhvm", &b_info->u.hvm.altp2m, 0); + +if (!libxl_defbool_is_default(b_info->u.hvm.nested_hvm) && +libxl_defbool_val(b_info->u.hvm.nested_hvm) && +!libxl_defbool_is_default(b_info->u.hvm.altp2m) && +libxl_defbool_val(b_info->u.hvm.altp2m)) { +fprintf(stderr, "ERROR: nestedhvm and altp2mhvm cannot be used together\n"); +exit(1); +} + xlu_cfg_replace_string(config, "smbios_firmware", &b_info->u.hvm.smbios_firmware, 0);
[Xen-devel] [PATCH v4 11/15] x86/altp2m: define and implement alternate p2m HVMOP types.
Signed-off-by: Ed White --- xen/arch/x86/hvm/hvm.c | 138 xen/include/public/hvm/hvm_op.h | 82 2 files changed, 220 insertions(+) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index bda6c1e..23cd507 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -6443,6 +6443,144 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg) break; } +case HVMOP_altp2m: +{ +struct xen_hvm_altp2m_op a; +struct domain *d = NULL; + +if ( copy_from_guest(&a, arg, 1) ) +return -EFAULT; + +switch ( a.cmd ) +{ +case HVMOP_altp2m_get_domain_state: +case HVMOP_altp2m_set_domain_state: +case HVMOP_altp2m_create_p2m: +case HVMOP_altp2m_destroy_p2m: +case HVMOP_altp2m_switch_p2m: +case HVMOP_altp2m_set_mem_access: +case HVMOP_altp2m_change_gfn: +d = rcu_lock_domain_by_any_id(a.domain); +if ( d == NULL ) +return -ESRCH; + +if ( !is_hvm_domain(d) || !hvm_altp2m_supported() ) +rc = -EINVAL; + +break; +case HVMOP_altp2m_vcpu_enable_notify: + +break; +default: +return -ENOSYS; + +break; +} + +if ( !rc ) +{ +switch ( a.cmd ) +{ +case HVMOP_altp2m_get_domain_state: +a.u.domain_state.state = altp2m_active(d); +rc = __copy_to_guest(arg, &a, 1) ? -EFAULT : 0; + +break; +case HVMOP_altp2m_set_domain_state: +{ +struct vcpu *v; +bool_t ostate; + +if ( nestedhvm_enabled(d) ) +{ +rc = -EINVAL; +break; +} + +ostate = d->arch.altp2m_active; +d->arch.altp2m_active = !!a.u.domain_state.state; + +/* If the alternate p2m state has changed, handle appropriately */ +if ( d->arch.altp2m_active != ostate && + (ostate || !(rc = p2m_init_altp2m_by_id(d, 0))) ) +{ +for_each_vcpu( d, v ) +{ +if ( !ostate ) +altp2m_vcpu_initialise(v); +else +altp2m_vcpu_destroy(v); +} + +if ( ostate ) +p2m_flush_altp2m(d); +} + +break; +} +default: +{ +if ( !(d ? d : current->domain)->arch.altp2m_active ) +{ +rc = -EINVAL; +break; +} + +switch ( a.cmd ) +{ +case HVMOP_altp2m_vcpu_enable_notify: +{ +struct vcpu *curr = current; +p2m_type_t p2mt; + +if ( (gfn_x(vcpu_altp2m(curr).veinfo_gfn) != INVALID_GFN) || + (mfn_x(get_gfn_query_unlocked(curr->domain, +a.u.enable_notify.gfn, &p2mt)) == INVALID_MFN) ) +return -EINVAL; + +vcpu_altp2m(curr).veinfo_gfn = _gfn(a.u.enable_notify.gfn); +ap2m_vcpu_update_vmfunc_ve(curr); + +break; +} +case HVMOP_altp2m_create_p2m: +if ( !(rc = p2m_init_next_altp2m(d, &a.u.view.view)) ) +rc = __copy_to_guest(arg, &a, 1) ? -EFAULT : 0; + +break; +case HVMOP_altp2m_destroy_p2m: +rc = p2m_destroy_altp2m_by_id(d, a.u.view.view); + +break; +case HVMOP_altp2m_switch_p2m: +rc = p2m_switch_domain_altp2m_by_id(d, a.u.view.view); + +break; +case HVMOP_altp2m_set_mem_access: +rc = p2m_set_altp2m_mem_access(d, a.u.set_mem_access.view, +_gfn(a.u.set_mem_access.gfn), +a.u.set_mem_access.hvmmem_access); + +break; +case HVMOP_altp2m_change_gfn: +rc = p2m_change_altp2m_gfn(d, a.u.change_gfn.view, +_gfn(a.u.change_gfn.old_gfn), +_gfn(a.u.change_gfn.new_gfn)); + +break; +} + +break; +} +} +} + +if ( d ) +rcu_unlock_domain(d); + +break; +} + default: { gdprintk(XENLOG_DEBUG, "Bad HVM op %ld.\n", op); diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
[Xen-devel] [PATCH v4 10/15] x86/altp2m: add remaining support routines.
Add the remaining routines required to support enabling the alternate p2m functionality. Signed-off-by: Ed White Reviewed-by: Andrew Cooper --- xen/arch/x86/hvm/hvm.c | 58 +- xen/arch/x86/mm/hap/Makefile | 1 + xen/arch/x86/mm/hap/altp2m_hap.c | 98 ++ xen/arch/x86/mm/p2m-ept.c| 3 + xen/arch/x86/mm/p2m.c| 385 +++ xen/include/asm-x86/hvm/altp2m.h | 4 + xen/include/asm-x86/p2m.h| 33 7 files changed, 576 insertions(+), 6 deletions(-) create mode 100644 xen/arch/x86/mm/hap/altp2m_hap.c diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index dbb4696..bda6c1e 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -2802,10 +2802,11 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, mfn_t mfn; struct vcpu *curr = current; struct domain *currd = curr->domain; -struct p2m_domain *p2m; +struct p2m_domain *p2m, *hostp2m; int rc, fall_through = 0, paged = 0; int sharing_enomem = 0; vm_event_request_t *req_ptr = NULL; +bool_t ap2m_active = 0; /* On Nested Virtualization, walk the guest page table. * If this succeeds, all is fine. @@ -2865,11 +2866,31 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, goto out; } -p2m = p2m_get_hostp2m(currd); -mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, +ap2m_active = altp2m_active(currd); + +/* Take a lock on the host p2m speculatively, to avoid potential + * locking order problems later and to handle unshare etc. + */ +hostp2m = p2m_get_hostp2m(currd); +mfn = get_gfn_type_access(hostp2m, gfn, &p2mt, &p2ma, P2M_ALLOC | (npfec.write_access ? P2M_UNSHARE : 0), NULL); +if ( ap2m_active ) +{ +if ( altp2m_hap_nested_page_fault(curr, gpa, gla, npfec, &p2m) == 1 ) +{ +/* entry was lazily copied from host -- retry */ +__put_gfn(hostp2m, gfn); +rc = 1; +goto out; +} + +mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, NULL); +} +else +p2m = hostp2m; + /* Check access permissions first, then handle faults */ if ( mfn_x(mfn) != INVALID_MFN ) { @@ -2909,6 +2930,20 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, if ( violation ) { +/* Should #VE be emulated for this fault? */ +if ( p2m_is_altp2m(p2m) && !cpu_has_vmx_virt_exceptions ) +{ +bool_t sve; + +p2m->get_entry(p2m, gfn, &p2mt, &p2ma, 0, NULL, &sve); + +if ( !sve && ap2m_vcpu_emulate_ve(curr) ) +{ +rc = 1; +goto out_put_gfn; +} +} + if ( p2m_mem_access_check(gpa, gla, npfec, &req_ptr) ) { fall_through = 1; @@ -2928,7 +2963,9 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, (npfec.write_access && (p2m_is_discard_write(p2mt) || (p2mt == p2m_mmio_write_dm))) ) { -put_gfn(currd, gfn); +__put_gfn(p2m, gfn); +if ( ap2m_active ) +__put_gfn(hostp2m, gfn); rc = 0; if ( unlikely(is_pvh_domain(currd)) ) @@ -2957,6 +2994,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, /* Spurious fault? PoD and log-dirty also take this path. */ if ( p2m_is_ram(p2mt) ) { +rc = 1; /* * Page log dirty is always done with order 0. If this mfn resides in * a large page, we do not change other pages type within that large @@ -2965,9 +3003,15 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, if ( npfec.write_access ) { paging_mark_dirty(currd, mfn_x(mfn)); +/* If p2m is really an altp2m, unlock here to avoid lock ordering + * violation when the change below is propagated from host p2m */ +if ( ap2m_active ) +__put_gfn(p2m, gfn); p2m_change_type_one(currd, gfn, p2m_ram_logdirty, p2m_ram_rw); +__put_gfn(ap2m_active ? hostp2m : p2m, gfn); + +goto out; } -rc = 1; goto out_put_gfn; } @@ -2977,7 +3021,9 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla, rc = fall_through; out_put_gfn: -put_gfn(currd, gfn); +__put_gfn(p2m, gfn); +if ( ap2m_active ) +__put_gfn(hostp2m, gfn); out: /* All of these are delayed until we exit, since we might * sleep on event ring wait queues, and we must not hold diff --git a/xen/arch/x86/mm/hap/Makefile b/xen/arch/x86/mm/hap/Makefile index 68f2bb5..216cd90 100644 --- a/xen/arch/x86/mm/hap/Makefile +++ b/xen/arch/x86/mm/hap/Makefile @@ -4,6 +4,7 @@ obj-y += guest_wal
[Xen-devel] [PATCH v4 09/15] x86/altp2m: alternate p2m memory events.
Add a flag to indicate that a memory event occurred in an alternate p2m and a field containing the p2m index. Allow any event response to switch to a different alternate p2m using the same flag and field. Modify p2m_mem_access_check() to handle alternate p2m's. Signed-off-by: Ed White Acked-by: Andrew Cooper for the x86 bits. Acked-by: George Dunlap --- xen/arch/x86/mm/p2m.c | 19 ++- xen/common/vm_event.c | 4 xen/include/asm-arm/p2m.h | 6 ++ xen/include/asm-x86/p2m.h | 3 +++ xen/include/public/vm_event.h | 11 +++ 5 files changed, 42 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c index 561a83c..d4d1ba1 100644 --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -1514,6 +1514,12 @@ void p2m_mem_access_emulate_check(struct vcpu *v, } } +void p2m_altp2m_check(struct vcpu *v, uint16_t idx) +{ +if ( altp2m_active(v->domain) ) +p2m_switch_vcpu_altp2m_by_id(v, idx); +} + bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla, struct npfec npfec, vm_event_request_t **req_ptr) @@ -1521,7 +1527,7 @@ bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla, struct vcpu *v = current; unsigned long gfn = gpa >> PAGE_SHIFT; struct domain *d = v->domain; -struct p2m_domain* p2m = p2m_get_hostp2m(d); +struct p2m_domain *p2m = NULL; mfn_t mfn; p2m_type_t p2mt; p2m_access_t p2ma; @@ -1530,6 +1536,11 @@ bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla, unsigned long eip = guest_cpu_user_regs()->eip; bool_t sve; +if ( altp2m_active(d) ) +p2m = p2m_get_altp2m(v); +if ( !p2m ) +p2m = p2m_get_hostp2m(d); + /* First, handle rx2rw conversion automatically. * These calls to p2m->set_entry() must succeed: we have the gfn * locked and just did a successful get_entry(). */ @@ -1636,6 +1647,12 @@ bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla, req->vcpu_id = v->vcpu_id; p2m_vm_event_fill_regs(req); + +if ( altp2m_active(v->domain) ) +{ +req->flags |= VM_EVENT_FLAG_ALTERNATE_P2M; +req->altp2m_idx = vcpu_altp2m(v).p2midx; +} } /* Pause the current VCPU */ diff --git a/xen/common/vm_event.c b/xen/common/vm_event.c index 120a78a..13224e2 100644 --- a/xen/common/vm_event.c +++ b/xen/common/vm_event.c @@ -399,6 +399,10 @@ void vm_event_resume(struct domain *d, struct vm_event_domain *ved) }; +/* Check for altp2m switch */ +if ( rsp.flags & VM_EVENT_FLAG_ALTERNATE_P2M ) +p2m_altp2m_check(v, rsp.altp2m_idx); + /* Unpause domain. */ if ( rsp.flags & VM_EVENT_FLAG_VCPU_PAUSED ) vm_event_vcpu_unpause(v); diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h index 63748ef..08bdce3 100644 --- a/xen/include/asm-arm/p2m.h +++ b/xen/include/asm-arm/p2m.h @@ -109,6 +109,12 @@ void p2m_mem_access_emulate_check(struct vcpu *v, /* Not supported on ARM. */ } +static inline +void p2m_altp2m_check(struct vcpu *v, uint16_t idx) +{ +/* Not supported on ARM. */ +} + #define p2m_is_foreign(_t) ((_t) == p2m_map_foreign) #define p2m_is_ram(_t) ((_t) == p2m_ram_rw || (_t) == p2m_ram_ro) diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h index 0a172e0..722e54c 100644 --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -751,6 +751,9 @@ uint16_t p2m_find_altp2m_by_eptp(struct domain *d, uint64_t eptp); /* Switch alternate p2m for a single vcpu */ bool_t p2m_switch_vcpu_altp2m_by_id(struct vcpu *v, uint16_t idx); +/* Check to see if vcpu should be switched to a different p2m. */ +void p2m_altp2m_check(struct vcpu *v, uint16_t idx); + /* * p2m type to IOMMU flags */ diff --git a/xen/include/public/vm_event.h b/xen/include/public/vm_event.h index 577e971..6dfa9db 100644 --- a/xen/include/public/vm_event.h +++ b/xen/include/public/vm_event.h @@ -47,6 +47,16 @@ #define VM_EVENT_FLAG_VCPU_PAUSED (1 << 0) /* Flags to aid debugging mem_event */ #define VM_EVENT_FLAG_FOREIGN (1 << 1) +/* + * This flag can be set in a request or a response + * + * On a request, indicates that the event occurred in the alternate p2m specified by + * the altp2m_idx request field. + * + * On a response, indicates that the VCPU should resume in the alternate p2m specified + * by the altp2m_idx response field if possible. + */ +#define VM_EVENT_FLAG_ALTERNATE_P2M (1 << 2) /* * Reasons for the vm event request @@ -194,6 +204,7 @@ typedef struct vm_event_st { uint32_t flags; /* VM_EVENT_FLAG_* */ uint32_t reason;/* VM_EVENT_REASON_* */ uint32_t vcpu_id; +uint16_t altp2m_idx; /* may be used during request and response */ union { struct vm_event_pagingmem_paging; -- 1.9.1
[Xen-devel] [PATCH v4 08/15] x86/altp2m: add control of suppress_ve.
From: George Dunlap The existing ept_set_entry() and ept_get_entry() routines are extended to optionally set/get suppress_ve. Passing -1 will set suppress_ve on new p2m entries, or retain suppress_ve flag on existing entries. Signed-off-by: George Dunlap --- xen/arch/x86/mm/mem_sharing.c | 5 +++-- xen/arch/x86/mm/p2m-ept.c | 18 xen/arch/x86/mm/p2m-pod.c | 12 +-- xen/arch/x86/mm/p2m-pt.c | 10 +++-- xen/arch/x86/mm/p2m.c | 50 ++- xen/include/asm-x86/p2m.h | 24 +++-- 6 files changed, 70 insertions(+), 49 deletions(-) diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c index 16e329e..5780a26 100644 --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -1257,10 +1257,11 @@ int relinquish_shared_pages(struct domain *d) p2m_type_t t; mfn_t mfn; int set_rc; +bool_t sve; if ( atomic_read(&d->shr_pages) == 0 ) break; -mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, NULL); +mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, NULL, &sve); if ( mfn_valid(mfn) && (t == p2m_ram_shared) ) { /* Does not fail with ENOMEM given the DESTROY flag */ @@ -1270,7 +1271,7 @@ int relinquish_shared_pages(struct domain *d) * unshare. Must succeed: we just read the old entry and * we hold the p2m lock. */ set_rc = p2m->set_entry(p2m, gfn, _mfn(0), PAGE_ORDER_4K, -p2m_invalid, p2m_access_rwx); +p2m_invalid, p2m_access_rwx, sve); ASSERT(set_rc == 0); count += 0x10; } diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c index 4111795..1106235 100644 --- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -657,7 +657,8 @@ bool_t ept_handle_misconfig(uint64_t gpa) */ static int ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, - unsigned int order, p2m_type_t p2mt, p2m_access_t p2ma) + unsigned int order, p2m_type_t p2mt, p2m_access_t p2ma, + int sve) { ept_entry_t *table, *ept_entry = NULL; unsigned long gfn_remainder = gfn; @@ -803,7 +804,11 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, ept_p2m_type_to_flags(p2m, &new_entry, p2mt, p2ma); } -new_entry.suppress_ve = 1; +if ( sve != -1 ) +new_entry.suppress_ve = !!sve; +else +new_entry.suppress_ve = is_epte_valid(&old_entry) ? +old_entry.suppress_ve : 1; rc = atomic_write_ept_entry(ept_entry, new_entry, target); if ( unlikely(rc) ) @@ -850,8 +855,9 @@ out: /* Read ept p2m entries */ static mfn_t ept_get_entry(struct p2m_domain *p2m, - unsigned long gfn, p2m_type_t *t, p2m_access_t* a, - p2m_query_t q, unsigned int *page_order) +unsigned long gfn, p2m_type_t *t, p2m_access_t* a, +p2m_query_t q, unsigned int *page_order, +bool_t *sve) { ept_entry_t *table = map_domain_page(pagetable_get_pfn(p2m_get_pagetable(p2m))); unsigned long gfn_remainder = gfn; @@ -865,6 +871,8 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m, *t = p2m_mmio_dm; *a = p2m_access_n; +if ( sve ) +*sve = 1; /* This pfn is higher than the highest the p2m map currently holds */ if ( gfn > p2m->max_mapped_pfn ) @@ -930,6 +938,8 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m, else *t = ept_entry->sa_p2mt; *a = ept_entry->access; +if ( sve ) +*sve = ept_entry->suppress_ve; mfn = _mfn(ept_entry->mfn); if ( i ) diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c index 0679f00..a2f6d02 100644 --- a/xen/arch/x86/mm/p2m-pod.c +++ b/xen/arch/x86/mm/p2m-pod.c @@ -536,7 +536,7 @@ recount: p2m_access_t a; p2m_type_t t; -(void)p2m->get_entry(p2m, gpfn + i, &t, &a, 0, NULL); +(void)p2m->get_entry(p2m, gpfn + i, &t, &a, 0, NULL, NULL); if ( t == p2m_populate_on_demand ) pod++; @@ -587,7 +587,7 @@ recount: p2m_type_t t; p2m_access_t a; -mfn = p2m->get_entry(p2m, gpfn + i, &t, &a, 0, NULL); +mfn = p2m->get_entry(p2m, gpfn + i, &t, &a, 0, NULL, NULL); if ( t == p2m_populate_on_demand ) { p2m_set_entry(p2m, gpfn + i, _mfn(INVALID_MFN), 0, p2m_invalid, @@ -676,7 +676,7 @@ p2m_pod_zero_check_superpage(struct p2m_domain *p2m, unsigned long gfn) for ( i=0; iget_entry(p2m, gfn + i, &type, &a, 0, NULL); +mfn = p2m->get_entry(p2m, gfn + i, &type, &a, 0, NULL, NULL); if ( i == 0 ) { @@ -808,7 +808,7 @@ p2m
[Xen-devel] [PATCH v4 07/15] VMX: add VMFUNC leaf 0 (EPTP switching) to emulator.
From: Ravi Sahita Signed-off-by: Ravi Sahita --- xen/arch/x86/hvm/emulate.c | 19 +-- xen/arch/x86/hvm/vmx/vmx.c | 29 + xen/arch/x86/x86_emulate/x86_emulate.c | 20 +++- xen/arch/x86/x86_emulate/x86_emulate.h | 4 xen/include/asm-x86/hvm/hvm.h | 2 ++ 5 files changed, 67 insertions(+), 7 deletions(-) diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c index fe5661d..1c90832 100644 --- a/xen/arch/x86/hvm/emulate.c +++ b/xen/arch/x86/hvm/emulate.c @@ -1436,6 +1436,19 @@ static int hvmemul_invlpg( return rc; } +static int hvmemul_vmfunc( +struct x86_emulate_ctxt *ctxt) +{ +int rc; + +rc = hvm_funcs.ap2m_vcpu_emulate_vmfunc(ctxt->regs); +if ( rc != X86EMUL_OKAY ) +{ +hvmemul_inject_hw_exception(TRAP_invalid_op, 0, ctxt); +} +return rc; +} + static const struct x86_emulate_ops hvm_emulate_ops = { .read = hvmemul_read, .insn_fetch= hvmemul_insn_fetch, @@ -1459,7 +1472,8 @@ static const struct x86_emulate_ops hvm_emulate_ops = { .inject_sw_interrupt = hvmemul_inject_sw_interrupt, .get_fpu = hvmemul_get_fpu, .put_fpu = hvmemul_put_fpu, -.invlpg= hvmemul_invlpg +.invlpg= hvmemul_invlpg, +.vmfunc= hvmemul_vmfunc, }; static const struct x86_emulate_ops hvm_emulate_ops_no_write = { @@ -1485,7 +1499,8 @@ static const struct x86_emulate_ops hvm_emulate_ops_no_write = { .inject_sw_interrupt = hvmemul_inject_sw_interrupt, .get_fpu = hvmemul_get_fpu, .put_fpu = hvmemul_put_fpu, -.invlpg= hvmemul_invlpg +.invlpg= hvmemul_invlpg, +.vmfunc= hvmemul_vmfunc, }; static int _hvm_emulate_one(struct hvm_emulate_ctxt *hvmemul_ctxt, diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 28afdaa..2664673 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -82,6 +82,7 @@ static void vmx_fpu_dirty_intercept(void); static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content); static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content); static void vmx_invlpg_intercept(unsigned long vaddr); +static int vmx_vmfunc_intercept(struct cpu_user_regs *regs); uint8_t __read_mostly posted_intr_vector; @@ -1830,6 +1831,19 @@ static void vmx_vcpu_update_vmfunc_ve(struct vcpu *v) vmx_vmcs_exit(v); } +static int vmx_vcpu_emulate_vmfunc(struct cpu_user_regs *regs) +{ +int rc = X86EMUL_EXCEPTION; +struct vcpu *curr = current; + +if ( !cpu_has_vmx_vmfunc && altp2m_active(curr->domain) && + regs->eax == 0 && + p2m_switch_vcpu_altp2m_by_id(curr, (uint16_t)regs->ecx) ) +rc = X86EMUL_OKAY; + +return rc; +} + static bool_t vmx_vcpu_emulate_ve(struct vcpu *v) { bool_t rc = 0; @@ -1898,6 +1912,7 @@ static struct hvm_function_table __initdata vmx_function_table = { .msr_read_intercept = vmx_msr_read_intercept, .msr_write_intercept = vmx_msr_write_intercept, .invlpg_intercept = vmx_invlpg_intercept, +.vmfunc_intercept = vmx_vmfunc_intercept, .handle_cd= vmx_handle_cd, .set_info_guest = vmx_set_info_guest, .set_rdtsc_exiting= vmx_set_rdtsc_exiting, @@ -1924,6 +1939,7 @@ static struct hvm_function_table __initdata vmx_function_table = { .ap2m_vcpu_update_eptp = vmx_vcpu_update_eptp, .ap2m_vcpu_update_vmfunc_ve = vmx_vcpu_update_vmfunc_ve, .ap2m_vcpu_emulate_ve = vmx_vcpu_emulate_ve, +.ap2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc, }; const struct hvm_function_table * __init start_vmx(void) @@ -2095,6 +2111,12 @@ static void vmx_invlpg_intercept(unsigned long vaddr) vpid_sync_vcpu_gva(curr, vaddr); } +static int vmx_vmfunc_intercept(struct cpu_user_regs *regs) +{ +gdprintk(XENLOG_ERR, "Failed guest VMFUNC execution\n"); +return X86EMUL_EXCEPTION; +} + static int vmx_cr_access(unsigned long exit_qualification) { struct vcpu *curr = current; @@ -3234,6 +3256,13 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) update_guest_eip(); break; +case EXIT_REASON_VMFUNC: +if ( vmx_vmfunc_intercept(regs) == X86EMUL_EXCEPTION ) +hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); +else +update_guest_eip(); +break; + case EXIT_REASON_MWAIT_INSTRUCTION: case EXIT_REASON_MONITOR_INSTRUCTION: case EXIT_REASON_GETSEC: diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c index c017c69..d941771 100644 --- a/xen/arch/x86/x86_emulate/x86_emulate.c +++ b/xen/arch/x86/x86_emulate/x86_emulate.c @@ -3816,8 +3816,9 @@ x86_emulate( struct segment_register reg; unsigned long base, limit, cr0, cr0w; -if ( modrm == 0xdf ) /* invlpga */ +
[Xen-devel] [PATCH v4 05/15] x86/altp2m: basic data structures and support routines.
Add the basic data structures needed to support alternate p2m's and the functions to initialise them and tear them down. Although Intel hardware can handle 512 EPTP's per hardware thread concurrently, only 10 per domain are supported in this patch for performance reasons. The iterator in hap_enable() does need to handle 512, so that is now uint16_t. This change also splits the p2m lock into one lock type for altp2m's and another type for all other p2m's. The purpose of this is to place the altp2m list lock between the types, so the list lock can be acquired whilst holding the host p2m lock. Signed-off-by: Ed White Reviewed-by: Andrew Cooper --- xen/arch/x86/hvm/Makefile| 1 + xen/arch/x86/hvm/altp2m.c| 92 + xen/arch/x86/hvm/hvm.c | 21 + xen/arch/x86/mm/hap/hap.c| 32 - xen/arch/x86/mm/mm-locks.h | 38 +++- xen/arch/x86/mm/p2m.c| 98 xen/include/asm-x86/domain.h | 10 xen/include/asm-x86/hvm/altp2m.h | 38 xen/include/asm-x86/hvm/hvm.h| 17 +++ xen/include/asm-x86/hvm/vcpu.h | 9 xen/include/asm-x86/p2m.h| 30 +++- 11 files changed, 382 insertions(+), 4 deletions(-) create mode 100644 xen/arch/x86/hvm/altp2m.c create mode 100644 xen/include/asm-x86/hvm/altp2m.h diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile index 69af47f..eb1a37b 100644 --- a/xen/arch/x86/hvm/Makefile +++ b/xen/arch/x86/hvm/Makefile @@ -1,6 +1,7 @@ subdir-y += svm subdir-y += vmx +obj-y += altp2m.o obj-y += asid.o obj-y += emulate.o obj-y += event.o diff --git a/xen/arch/x86/hvm/altp2m.c b/xen/arch/x86/hvm/altp2m.c new file mode 100644 index 000..f98a38d --- /dev/null +++ b/xen/arch/x86/hvm/altp2m.c @@ -0,0 +1,92 @@ +/* + * Alternate p2m HVM + * Copyright (c) 2014, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + */ + +#include +#include +#include +#include + +void +altp2m_vcpu_reset(struct vcpu *v) +{ +struct altp2mvcpu *av = &vcpu_altp2m(v); + +av->p2midx = INVALID_ALTP2M; +av->veinfo_gfn = _gfn(INVALID_GFN); + +if ( hvm_funcs.ap2m_vcpu_reset ) +hvm_funcs.ap2m_vcpu_reset(v); +} + +int +altp2m_vcpu_initialise(struct vcpu *v) +{ +int rc = -EOPNOTSUPP; + +if ( v != current ) +vcpu_pause(v); + +if ( !hvm_funcs.ap2m_vcpu_initialise || + (hvm_funcs.ap2m_vcpu_initialise(v) == 0) ) +{ +rc = 0; +altp2m_vcpu_reset(v); +vcpu_altp2m(v).p2midx = 0; +atomic_inc(&p2m_get_altp2m(v)->active_vcpus); + +ap2m_vcpu_update_eptp(v); +} + +if ( v != current ) +vcpu_unpause(v); + +return rc; +} + +void +altp2m_vcpu_destroy(struct vcpu *v) +{ +struct p2m_domain *p2m; + +if ( v != current ) +vcpu_pause(v); + +if ( hvm_funcs.ap2m_vcpu_destroy ) +hvm_funcs.ap2m_vcpu_destroy(v); + +if ( (p2m = p2m_get_altp2m(v)) ) +atomic_dec(&p2m->active_vcpus); + +altp2m_vcpu_reset(v); + +ap2m_vcpu_update_eptp(v); +ap2m_vcpu_update_vmfunc_ve(v); + +if ( v != current ) +vcpu_unpause(v); +} + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 4019658..dbb4696 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -58,6 +58,7 @@ #include #include #include +#include #include #include #include @@ -2380,6 +2381,7 @@ void hvm_vcpu_destroy(struct vcpu *v) { hvm_all_ioreq_servers_remove_vcpu(v->domain, v); +altp2m_vcpu_destroy(v); nestedhvm_vcpu_destroy(v); free_compat_arg_xlat(v); @@ -6498,6 +6500,25 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v) return hvm_funcs.nhvm_intr_blocked(v); } +void ap2m_vcpu_update_eptp(struct vcpu *v) +{ +if ( hvm_funcs.ap2m_vcpu_update_eptp ) +hvm_funcs.ap2m_vcpu_update_eptp(v); +} + +void ap2m_vcpu_update_vmfunc_ve(struct vcpu *v) +{ +if ( hvm_funcs.ap2m_vcpu_update_vmfunc_ve ) +hvm_funcs.ap2m_vcpu_update_vmfunc_ve(v); +} + +bool_t ap2m_vcpu_emulate_ve(struct vcpu *v) +{ +if ( hvm_funcs.ap2m_vcpu_emulate_ve ) +return hvm_funcs.ap
[Xen-devel] [PATCH v4 06/15] VMX/altp2m: add code to support EPTP switching and #VE.
Implement and hook up the code to enable VMX support of VMFUNC and #VE. VMFUNC leaf 0 (EPTP switching) emulation is added in a later patch. Signed-off-by: Ed White Reviewed-by: Andrew Cooper Acked-by: Jun Nakajima --- xen/arch/x86/hvm/vmx/vmx.c | 138 + 1 file changed, 138 insertions(+) diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 07527dd..28afdaa 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -56,6 +56,7 @@ #include #include #include +#include #include #include #include @@ -1763,6 +1764,104 @@ static void vmx_enable_msr_exit_interception(struct domain *d) MSR_TYPE_W); } +static void vmx_vcpu_update_eptp(struct vcpu *v) +{ +struct domain *d = v->domain; +struct p2m_domain *p2m = NULL; +struct ept_data *ept; + +if ( altp2m_active(d) ) +p2m = p2m_get_altp2m(v); +if ( !p2m ) +p2m = p2m_get_hostp2m(d); + +ept = &p2m->ept; +ept->asr = pagetable_get_pfn(p2m_get_pagetable(p2m)); + +vmx_vmcs_enter(v); + +__vmwrite(EPT_POINTER, ept_get_eptp(ept)); + +if ( v->arch.hvm_vmx.secondary_exec_control & +SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS ) +__vmwrite(EPTP_INDEX, vcpu_altp2m(v).p2midx); + +vmx_vmcs_exit(v); +} + +static void vmx_vcpu_update_vmfunc_ve(struct vcpu *v) +{ +struct domain *d = v->domain; +u32 mask = SECONDARY_EXEC_ENABLE_VM_FUNCTIONS; + +if ( !cpu_has_vmx_vmfunc ) +return; + +if ( cpu_has_vmx_virt_exceptions ) +mask |= SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS; + +vmx_vmcs_enter(v); + +if ( !d->is_dying && altp2m_active(d) ) +{ +v->arch.hvm_vmx.secondary_exec_control |= mask; +__vmwrite(VM_FUNCTION_CONTROL, VMX_VMFUNC_EPTP_SWITCHING); +__vmwrite(EPTP_LIST_ADDR, virt_to_maddr(d->arch.altp2m_eptp)); + +if ( cpu_has_vmx_virt_exceptions ) +{ +p2m_type_t t; +mfn_t mfn; + +mfn = get_gfn_query_unlocked(d, gfn_x(vcpu_altp2m(v).veinfo_gfn), &t); + +if ( mfn_x(mfn) != INVALID_MFN ) +__vmwrite(VIRT_EXCEPTION_INFO, mfn_x(mfn) << PAGE_SHIFT); +else +mask &= ~SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS; +} +} +else +v->arch.hvm_vmx.secondary_exec_control &= ~mask; + +__vmwrite(SECONDARY_VM_EXEC_CONTROL, +v->arch.hvm_vmx.secondary_exec_control); + +vmx_vmcs_exit(v); +} + +static bool_t vmx_vcpu_emulate_ve(struct vcpu *v) +{ +bool_t rc = 0; +ve_info_t *veinfo = gfn_x(vcpu_altp2m(v).veinfo_gfn) != INVALID_GFN ? +hvm_map_guest_frame_rw(gfn_x(vcpu_altp2m(v).veinfo_gfn), 0) : NULL; + +if ( !veinfo ) +return 0; + +if ( veinfo->semaphore != 0 ) +goto out; + +rc = 1; + +veinfo->exit_reason = EXIT_REASON_EPT_VIOLATION; +veinfo->semaphore = ~0l; +veinfo->eptp_index = vcpu_altp2m(v).p2midx; + +vmx_vmcs_enter(v); +__vmread(EXIT_QUALIFICATION, &veinfo->exit_qualification); +__vmread(GUEST_LINEAR_ADDRESS, &veinfo->gla); +__vmread(GUEST_PHYSICAL_ADDRESS, &veinfo->gpa); +vmx_vmcs_exit(v); + +hvm_inject_hw_exception(TRAP_virtualisation, +HVM_DELIVER_NO_ERROR_CODE); + +out: +hvm_unmap_guest_frame(veinfo, 0); +return rc; +} + static struct hvm_function_table __initdata vmx_function_table = { .name = "VMX", .cpu_up_prepare = vmx_cpu_up_prepare, @@ -1822,6 +1921,9 @@ static struct hvm_function_table __initdata vmx_function_table = { .nhvm_hap_walk_L1_p2m = nvmx_hap_walk_L1_p2m, .hypervisor_cpuid_leaf = vmx_hypervisor_cpuid_leaf, .enable_msr_exit_interception = vmx_enable_msr_exit_interception, +.ap2m_vcpu_update_eptp = vmx_vcpu_update_eptp, +.ap2m_vcpu_update_vmfunc_ve = vmx_vcpu_update_vmfunc_ve, +.ap2m_vcpu_emulate_ve = vmx_vcpu_emulate_ve, }; const struct hvm_function_table * __init start_vmx(void) @@ -2743,6 +2845,42 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs) /* Now enable interrupts so it's safe to take locks. */ local_irq_enable(); + +/* + * If the guest has the ability to switch EPTP without an exit, + * figure out whether it has done so and update the altp2m data. + */ +if ( altp2m_active(v->domain) && +(v->arch.hvm_vmx.secondary_exec_control & +SECONDARY_EXEC_ENABLE_VM_FUNCTIONS) ) +{ +unsigned long idx; + +if ( v->arch.hvm_vmx.secondary_exec_control & +SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS ) +__vmread(EPTP_INDEX, &idx); +else +{ +unsigned long eptp; + +__vmread(EPT_POINTER, &eptp); + +if ( (idx = p2m_find_altp2m_by_eptp(v->domain, eptp)) == + INVALID_ALTP2M ) +{ +gdprintk(XENLOG_ERR, "EPTP not found in alternate p2
[Xen-devel] [PATCH v4 04/15] x86/HVM: Hardware alternate p2m support detection.
As implemented here, only supported on platforms with VMX HAP. By default this functionality is force-disabled, it can be enabled by specifying altp2m=1 on the Xen command line. Signed-off-by: Ed White Reviewed-by: Andrew Cooper --- docs/misc/xen-command-line.markdown | 7 +++ xen/arch/x86/hvm/hvm.c | 7 +++ xen/arch/x86/hvm/vmx/vmx.c | 1 + xen/include/asm-x86/hvm/hvm.h | 9 + 4 files changed, 24 insertions(+) diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index aa684c0..3391c66 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -139,6 +139,13 @@ mode during S3 resume. > Default: `true` Permit Xen to use superpages when performing memory management. + +### altp2m (Intel) +> `= ` + ++> Default: `false` + +Permit multiple copies of host p2m. ### apic > `= bigsmp | default` diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 535d622..4019658 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -94,6 +94,10 @@ bool_t opt_hvm_fep; boolean_param("hvm_fep", opt_hvm_fep); #endif +/* Xen command-line option to enable altp2m */ +static bool_t __initdata opt_altp2m_enabled = 0; +boolean_param("altp2m", opt_altp2m_enabled); + static int cpu_callback( struct notifier_block *nfb, unsigned long action, void *hcpu) { @@ -160,6 +164,9 @@ static int __init hvm_enable(void) if ( !fns->pvh_supported ) printk(XENLOG_INFO "HVM: PVH mode not supported on this platform\n"); +if ( !opt_altp2m_enabled ) +hvm_funcs.altp2m_supported = 0; + /* * Allow direct access to the PC debug ports 0x80 and 0xed (they are * often used for I/O delays, but the vmexits simply slow things down). diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index fc29b89..07527dd 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -1841,6 +1841,7 @@ const struct hvm_function_table * __init start_vmx(void) if ( cpu_has_vmx_ept && (cpu_has_vmx_pat || opt_force_ept) ) { vmx_function_table.hap_supported = 1; +vmx_function_table.altp2m_supported = 1; vmx_function_table.hap_capabilities = 0; diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index 57f9605..c61cfe7 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -94,6 +94,9 @@ struct hvm_function_table { /* Necessary hardware support for PVH mode? */ int pvh_supported; +/* Necessary hardware support for alternate p2m's? */ +bool_t altp2m_supported; + /* Indicate HAP capabilities. */ int hap_capabilities; @@ -509,6 +512,12 @@ bool_t nhvm_vmcx_hap_enabled(struct vcpu *v); /* interrupt */ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v); +/* returns true if hardware supports alternate p2m's */ +static inline bool_t hvm_altp2m_supported(void) +{ +return hvm_funcs.altp2m_supported; +} + #ifndef NDEBUG /* Permit use of the Forced Emulation Prefix in HVM guests */ extern bool_t opt_hvm_fep; -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v4 02/15] VMX: VMFUNC and #VE definitions and detection.
Currently, neither is enabled globally but may be enabled on a per-VCPU basis by the altp2m code. Remove the check for EPTE bit 63 == zero in ept_split_super_page(), as that bit is now hardware-defined. Signed-off-by: Ed White Reviewed-by: Andrew Cooper Acked-by: George Dunlap Acked-by: Jun Nakajima --- xen/arch/x86/hvm/vmx/vmcs.c| 42 +++--- xen/arch/x86/mm/p2m-ept.c | 1 - xen/include/asm-x86/hvm/vmx/vmcs.h | 14 +++-- xen/include/asm-x86/hvm/vmx/vmx.h | 13 +++- xen/include/asm-x86/msr-index.h| 1 + 5 files changed, 64 insertions(+), 7 deletions(-) diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index 4c5ceb5..bc1cabd 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -101,6 +101,8 @@ u32 vmx_secondary_exec_control __read_mostly; u32 vmx_vmexit_control __read_mostly; u32 vmx_vmentry_control __read_mostly; u64 vmx_ept_vpid_cap __read_mostly; +u64 vmx_vmfunc __read_mostly; +bool_t vmx_virt_exception __read_mostly; const u32 vmx_introspection_force_enabled_msrs[] = { MSR_IA32_SYSENTER_EIP, @@ -140,6 +142,8 @@ static void __init vmx_display_features(void) P(cpu_has_vmx_virtual_intr_delivery, "Virtual Interrupt Delivery"); P(cpu_has_vmx_posted_intr_processing, "Posted Interrupt Processing"); P(cpu_has_vmx_vmcs_shadowing, "VMCS shadowing"); +P(cpu_has_vmx_vmfunc, "VM Functions"); +P(cpu_has_vmx_virt_exceptions, "Virtualisation Exceptions"); P(cpu_has_vmx_pml, "Page Modification Logging"); #undef P @@ -185,6 +189,7 @@ static int vmx_init_vmcs_config(void) u64 _vmx_misc_cap = 0; u32 _vmx_vmexit_control; u32 _vmx_vmentry_control; +u64 _vmx_vmfunc = 0; bool_t mismatch = 0; rdmsr(MSR_IA32_VMX_BASIC, vmx_basic_msr_low, vmx_basic_msr_high); @@ -230,7 +235,9 @@ static int vmx_init_vmcs_config(void) SECONDARY_EXEC_ENABLE_EPT | SECONDARY_EXEC_ENABLE_RDTSCP | SECONDARY_EXEC_PAUSE_LOOP_EXITING | - SECONDARY_EXEC_ENABLE_INVPCID); + SECONDARY_EXEC_ENABLE_INVPCID | + SECONDARY_EXEC_ENABLE_VM_FUNCTIONS | + SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS); rdmsrl(MSR_IA32_VMX_MISC, _vmx_misc_cap); if ( _vmx_misc_cap & VMX_MISC_VMWRITE_ALL ) opt |= SECONDARY_EXEC_ENABLE_VMCS_SHADOWING; @@ -341,6 +348,24 @@ static int vmx_init_vmcs_config(void) || !(_vmx_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT) ) _vmx_pin_based_exec_control &= ~ PIN_BASED_POSTED_INTERRUPT; +/* The IA32_VMX_VMFUNC MSR exists only when VMFUNC is available */ +if ( _vmx_secondary_exec_control & SECONDARY_EXEC_ENABLE_VM_FUNCTIONS ) +{ +rdmsrl(MSR_IA32_VMX_VMFUNC, _vmx_vmfunc); + +/* + * VMFUNC leaf 0 (EPTP switching) must be supported. + * + * Or we just don't use VMFUNC. + */ +if ( !(_vmx_vmfunc & VMX_VMFUNC_EPTP_SWITCHING) ) +_vmx_secondary_exec_control &= ~SECONDARY_EXEC_ENABLE_VM_FUNCTIONS; +} + +/* Virtualization exceptions are only enabled if VMFUNC is enabled */ +if ( !(_vmx_secondary_exec_control & SECONDARY_EXEC_ENABLE_VM_FUNCTIONS) ) +_vmx_secondary_exec_control &= ~SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS; + min = 0; opt = VM_ENTRY_LOAD_GUEST_PAT | VM_ENTRY_LOAD_BNDCFGS; _vmx_vmentry_control = adjust_vmx_controls( @@ -361,6 +386,9 @@ static int vmx_init_vmcs_config(void) vmx_vmentry_control= _vmx_vmentry_control; vmx_basic_msr = ((u64)vmx_basic_msr_high << 32) | vmx_basic_msr_low; +vmx_vmfunc = _vmx_vmfunc; +vmx_virt_exception = !!(_vmx_secondary_exec_control & + SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS); vmx_display_features(); /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */ @@ -397,6 +425,9 @@ static int vmx_init_vmcs_config(void) mismatch |= cap_check( "EPT and VPID Capability", vmx_ept_vpid_cap, _vmx_ept_vpid_cap); +mismatch |= cap_check( +"VMFUNC Capability", +vmx_vmfunc, _vmx_vmfunc); if ( cpu_has_vmx_ins_outs_instr_info != !!(vmx_basic_msr_high & (VMX_BASIC_INS_OUT_INFO >> 32)) ) { @@ -967,6 +998,11 @@ static int construct_vmcs(struct vcpu *v) /* Do not enable Monitor Trap Flag unless start single step debug */ v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG; +/* Disable VMFUNC and #VE for now: they may be enabled later by altp2m. */ +v->arch.hvm_vmx.secondary_exec_control &= +~(SECONDARY_EXEC_ENABLE_VM_FUNCTIONS | + SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS); + if ( is_pvh_domain(d) ) { /* Disable virtual apics, TPR */ @@ -1790,9 +1826,9 @@ void v
[Xen-devel] [PATCH v4 00/15] Alternate p2m: support multiple copies of host p2m
This set of patches adds support to hvm domains for EPTP switching by creating multiple copies of the host p2m (currently limited to 10 copies). The primary use of this capability is expected to be in scenarios where access to memory needs to be monitored and/or restricted below the level at which the guest OS page tables operate. Two examples that were discussed at the 2014 Xen developer summit are: VM introspection: http://www.slideshare.net/xen_com_mgr/ zero-footprint-guest-memory-introspection-from-xen Secure inter-VM communication: http://www.slideshare.net/xen_com_mgr/nakajima-nvf A more detailed design specification can be found at: http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg01319.html Each p2m copy is populated lazily on EPT violations. Permissions for pages in alternate p2m's can be changed in a similar way to the existing memory access interface, and gfn->mfn mappings can be changed. All this is done through extra HVMOP types. The cross-domain HVMOP code has been compile-tested only. Also, the cross-domain code is hypervisor-only, the toolstack has not been modified. The intra-domain code has been tested. Violation notifications can only be received for pages that have been modified (access permissions and/or gfn->mfn mapping) intra-domain, and only on VCPU's that have enabled notification. VMFUNC and #VE will both be emulated on hardware without native support. This code is not compatible with nested hvm functionality and will refuse to work with nested hvm active. It is also not compatible with migration. It should be considered experimental. Changes since v3: Major changes are: Replaced patch 8. Refactored patch 11 to use a single HVMOP with subcodes. Addressed feedback in patch 7, and some other patches. Added two tools/test patches from Tamas. Both are optional. Added various ack's and reviewed-by's. Rebased. Ravi Sahita will now be the point of contact for this series. Changes since v2: Addressed all v2 feedback *except*: In patch 5, the per-domain EPTP list page is still allocated from the Xen heap. If allocated from the domain heap Xen panics - IIRC on Haswell hardware when walking the EPTP list during exit processing in patch 6. HVM_ops are not merged. Tamas suggested merging the memory access ops, but in practice they are not as similar as they appear on the surface. Razvan suggested merging the implementation code in p2m.c, but that is also not as common as it appears on the surface. Andrew suggested merging all altp2m ops into one with a subop code in the input stucture. His point that only 255 ops can be defined is well taken, but altp2m uses only 2 more ops than the recently introduced ioreq ops, and <15% of the available ops have been defined. Since we don't know how to implement XSM hooks and policy with the subop model, we have not adopted this suggestion. The p2m set/get interface is not modified. The altp2m code needs to write suppress_ve in 2 places and read it in 1 place. The original patch series managed this by coupling the state of suppress_ve to the p2m memory type, which Tim disliked. In v2 of the series, special set/get interaces were added to access suppress_ve only when required. Jan has suggested changing the existing interfaces, but we feel this is inappropriate for this experimental patch series. Changing the existing interfaces would require a design agreement to be reached and would impact a large amount of existing code. Andrew kindly added some reviewed-by's to v2. I have not carried his reviewed-by of the memory event patch forward because Tamas requested significant changes to the patch. Changes since v1: Many changes since v1 in response to maintainer feedback, including: Suppress_ve state is now decoupled from memory type VMFUNC emulation handled in x86 emulator Lazy-copy algorithm copies any page where mfn != INVALID_MFN All nested page fault handling except lazy-copy is now in top-level (hvm.c) nested page fault handler Split p2m lock type (as suggested by Tim) to avoid lock order violations XSM hooks Xen parameter to globally enable altp2m (default disabled) and HVM parameter Altp2m reference counting no longer uses dirty_cpu bitmap Remapped page tracking to invalidate altp2m's where needed to protect Xen Many other minor changes The altp2m invalidation is implemented to a level that I believe satisifies the requirements of protecting Xen. Invalidation notification is not yet implemented, and there may be other cases where invalidation is warranted to protect the integrity of the restrictions placed through altp2m. We may add further patches in this area. Testability is still a potential issue. We have offered to make our internal Windows test binaries available for intra-domain testing. Tamas
[Xen-devel] [PATCH v4 03/15] VMX: implement suppress #VE.
In preparation for selectively enabling #VE in a later patch, set suppress #VE on all EPTE's. Suppress #VE should always be the default condition for two reasons: it is generally not safe to deliver #VE into a guest unless that guest has been modified to receive it; and even then for most EPT violations only the hypervisor is able to handle the violation. Signed-off-by: Ed White Reviewed-by: Andrew Cooper Reviewed-by: George Dunlap Acked-by: Jun Nakajima --- xen/arch/x86/mm/p2m-ept.c | 26 +- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c index a6c9adf..4111795 100644 --- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -41,7 +41,8 @@ #define is_epte_superpage(ept_entry)((ept_entry)->sp) static inline bool_t is_epte_valid(ept_entry_t *e) { -return (e->epte != 0 && e->sa_p2mt != p2m_invalid); +/* suppress_ve alone is not considered valid, so mask it off */ +return ((e->epte & ~(1ul << 63)) != 0 && e->sa_p2mt != p2m_invalid); } /* returns : 0 for success, -errno otherwise */ @@ -219,6 +220,8 @@ static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry, static int ept_set_middle_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry) { struct page_info *pg; +ept_entry_t *table; +unsigned int i; pg = p2m_alloc_ptp(p2m, 0); if ( pg == NULL ) @@ -232,6 +235,15 @@ static int ept_set_middle_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry) /* Manually set A bit to avoid overhead of MMU having to write it later. */ ept_entry->a = 1; +ept_entry->suppress_ve = 1; + +table = __map_domain_page(pg); + +for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) +table[i].suppress_ve = 1; + +unmap_domain_page(table); + return 1; } @@ -281,6 +293,7 @@ static int ept_split_super_page(struct p2m_domain *p2m, ept_entry_t *ept_entry, epte->sp = (level > 1); epte->mfn += i * trunk; epte->snp = (iommu_enabled && iommu_snoop); +epte->suppress_ve = 1; ept_p2m_type_to_flags(p2m, epte, epte->sa_p2mt, epte->access); @@ -790,6 +803,8 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, ept_p2m_type_to_flags(p2m, &new_entry, p2mt, p2ma); } +new_entry.suppress_ve = 1; + rc = atomic_write_ept_entry(ept_entry, new_entry, target); if ( unlikely(rc) ) old_entry.epte = 0; @@ -,6 +1126,8 @@ static void ept_flush_pml_buffers(struct p2m_domain *p2m) int ept_p2m_init(struct p2m_domain *p2m) { struct ept_data *ept = &p2m->ept; +ept_entry_t *table; +unsigned int i; p2m->set_entry = ept_set_entry; p2m->get_entry = ept_get_entry; @@ -1134,6 +1151,13 @@ int ept_p2m_init(struct p2m_domain *p2m) p2m->flush_hardware_cached_dirty = ept_flush_pml_buffers; } +table = map_domain_page(pagetable_get_pfn(p2m_get_pagetable(p2m))); + +for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) +table[i].suppress_ve = 1; + +unmap_domain_page(table); + if ( !zalloc_cpumask_var(&ept->synced_mask) ) return -ENOMEM; -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v4 01/15] common/domain: Helpers to pause a domain while in context
From: Andrew Cooper For use on codepaths which would need to use domain_pause() but might be in the target domain's context. In the case that the target domain is in context, all other vcpus are paused. Signed-off-by: Andrew Cooper --- xen/common/domain.c | 28 xen/include/xen/sched.h | 5 + 2 files changed, 33 insertions(+) diff --git a/xen/common/domain.c b/xen/common/domain.c index 3bc52e6..1bb24ae 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -1010,6 +1010,34 @@ int domain_unpause_by_systemcontroller(struct domain *d) return 0; } +void domain_pause_except_self(struct domain *d) +{ +struct vcpu *v, *curr = current; + +if ( curr->domain == d ) +{ +for_each_vcpu( d, v ) +if ( likely(v != curr) ) +vcpu_pause(v); +} +else +domain_pause(d); +} + +void domain_unpause_except_self(struct domain *d) +{ +struct vcpu *v, *curr = current; + +if ( curr->domain == d ) +{ +for_each_vcpu( d, v ) +if ( likely(v != curr) ) +vcpu_unpause(v); +} +else +domain_unpause(d); +} + int vcpu_reset(struct vcpu *v) { struct domain *d = v->domain; diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index b29d9e7..73d3bc8 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -804,6 +804,11 @@ static inline int domain_pause_by_systemcontroller_nosync(struct domain *d) { return __domain_pause_by_systemcontroller(d, domain_pause_nosync); } + +/* domain_pause() but safe against trying to pause current. */ +void domain_pause_except_self(struct domain *d); +void domain_unpause_except_self(struct domain *d); + void cpu_init(void); struct scheduler; -- 1.9.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][PATCH 10/16] tools: introduce some new parameters to set rdm policy
int libxl__device_pci_setdefault(libxl__gc *gc, libxl_device_pci *pci) { +/* We'd like to force reserve rdm specific to a device by default.*/ +if ( pci->rdm_policy == LIBXL_RDM_RESERVE_POLICY_INVALID) ^ I have just spotted that spurious whitespace. However I won't block this for that. Sorry, this is my typo. Acked-by: Ian Jackson (actually). I would appreciate it if you could ensure that this is fixed in any repost. You may retain my ack if you do that. Committers should feel free to fix it on commit. I fixed this in my tree. Thanks Tiejun Thanks, Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
> -Original Message- > From: George Dunlap [mailto:george.dun...@eu.citrix.com] > Sent: Thursday, July 09, 2015 8:53 PM > To: Wu, Feng > Cc: Dario Faggioli; Tian, Kevin; k...@xen.org; andrew.coop...@citrix.com; > xen-devel; jbeul...@suse.com; Zhang, Yang Z > Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor > during vCPU scheduling > > On 07/09/2015 12:38 PM, Wu, Feng wrote: > > > > > >> -Original Message- > >> From: dunl...@gmail.com [mailto:dunl...@gmail.com] On Behalf Of > George > >> Dunlap > >> Sent: Thursday, July 09, 2015 7:20 PM > >> To: Wu, Feng > >> Cc: Dario Faggioli; Tian, Kevin; k...@xen.org; andrew.coop...@citrix.com; > >> xen-devel; jbeul...@suse.com; Zhang, Yang Z > >> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts > Descriptor > >> during vCPU scheduling > >> > >> On Thu, Jul 9, 2015 at 4:09 AM, Wu, Feng wrote: > That does not necessarily means "we need to do something" in > vcpu_runstate_change(). Actually, that's exactly what I'm asking: can > you check whether this thing that you need doing can be done somewhere > else than in vcpu_runstaete_change() ? > >>> > >>> Why do you think vcpu_runstaete_change() is not the right place to do > this? > >> > >> Because what the vcpu_runstate_change() function does at the moment is > >> *update the vcpu runstate variable*. It doesn't actually change the > >> runstate -- the runstate is changed in the various bits of code that > >> call it; and it's not designed to be a generic place to put hooks on > >> the runstate changing. > >> > >> I haven't done a thorough review of this yet, but at least looking > >> through this patch, and skimming the titles, I don't see anywhere you > >> handle migration -- what happens if a vcpu that's blocked / offline / > >> runnable migrates from one cpu to another? Is the information > >> updated? > > > > Thanks for your review! > > And I'd like to say -- sorry that I didn't notice this issue sooner; I > know you've had your series posted for quite a while, but I didn't > realize until last week that it actually involved the scheduler. It's > really my fault for not paying closer attention -- you did CC me in v2 > back in June. > > > The migration is handled in arch_pi_desc_update() which is called > > by vcpu_runstate_change(). > > Well as far as I can tell from looking at the code, > vcpu_runstate_change() will not be called when migrating a vcpu which is > already blocked. > > Consider the following scenario: > - v1 blocks on pcpu 0. > - vcpu_runstate_change() will do everything necessary for v1 on p0. > - The scheduler does load balancing and moves v1 to p1, calling > vcpu_migrate(). Because the vcpu is still blocked, > vcpu_runstate_change() is not called. > - A device interrupt is generated. > > What happens to the interrupt? Does everything still work properly, or > will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)? I think it works correctly. Before blocking, we save the v->processor, and save the vcpu on this per-cpu list, even when the vCPU is migrated to another pCPU, the wakeup notification event will go to the original one (p0 in this case), this is what I want, in the list of p0, we can find and unblock the blocked vCPU, this is the point. Thanks, Feng > > > or to add a set of architectural hooks (similar to > >> the SCHED_OP() hooks) in the various places you need them. > > > > I don't have a picture of this method, but from your comments, seems > > we need to put the logic to many different places, and must be very > > careful so as to not miss some places. I think the above method > > is more clear and straightforward, since we have a central place to > > handle all the cases. Anyway, if you prefer to this one, it would be > > highly appreciated if you can give a more detailed solution! Thank you! > > Well you can check to make sure you've caught at least all the places > you had before by searching for vcpu_runstate_change(). :-) > > Using the callback method also can help prompt you to think about other > times you may need to do something. For instance, you might still > consider searching for SCHED_OP() everywhere in schedule.c and seeing if > that's a place you need to do something (similar to the migration thing > above). > > Anyway, the most detailed thing I can say at this time is to look at > SCHED_OP() and see if doing something like that, but for architectural > callbacks, makes sense. > > I'll come back and take a closer look a bit later. > > -George > ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
> -Original Message- > From: Dario Faggioli [mailto:dario.faggi...@citrix.com] > Sent: Thursday, July 09, 2015 8:42 PM > To: Wu, Feng > Cc: George Dunlap; Tian, Kevin; k...@xen.org; andrew.coop...@citrix.com; > xen-devel; jbeul...@suse.com; Zhang, Yang Z > Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor > during vCPU scheduling > > On Thu, 2015-07-09 at 11:38 +, Wu, Feng wrote: > > > > > -Original Message- > > > From: dunl...@gmail.com [mailto:dunl...@gmail.com] On Behalf Of > George > > > Dunlap > > > > > Why do you think vcpu_runstaete_change() is not the right place to do > this? > > > > > > Because what the vcpu_runstate_change() function does at the moment is > > > *update the vcpu runstate variable*. It doesn't actually change the > > > runstate -- the runstate is changed in the various bits of code that > > > call it; and it's not designed to be a generic place to put hooks on > > > the runstate changing. > > > > > > I haven't done a thorough review of this yet, but at least looking > > > through this patch, and skimming the titles, I don't see anywhere you > > > handle migration -- what happens if a vcpu that's blocked / offline / > > > runnable migrates from one cpu to another? Is the information > > > updated? > > > > Thanks for your review! > > > > The migration is handled in arch_pi_desc_update() which is called > > by vcpu_runstate_change(). > > > > > > > > The right thing to do in this situation is either to change > > > vcpu_runstate_change() so that it is the central place to make all (or > > > most) hooks happen; > > > > Yes, this is my implementation. I think vcpu_runstate_change() > > is the _central_ place to do things when vCPU state is changed. This > > makes things clear and simple. I call an arch hooks to update > > posted-interrupt descriptor in this function. > > > Perhaps, one way to double check this line of reasoning (the fact that > you think this needs to lay on top of runstates, and more specifically > in that function), would be to come up with some kind of "list of > requirements", not taking runstates into account. > > I know there is a design document for this series (and I also know I > could have commented on it earlier, sorry for that), but that itself > mentions runstates, which does not help. > > What I mean is, can you describe when you need each specific operation > needs to happen? Something like "descriptor needs to be updated like > this upon migration", "notification should be disabled when vcpu starts > running", "notification method should be changed that other way when > vcpu is preempted", etc. I cannot see the differences, I think the requirements are clearly listed in the design doc and the comments of this patch. > > This would help a lot, IMO, figuring out the actual functional > requirements that needs to be satisfied for things to work well. Once > that is done, we can go check in the code where is the best place to put > each call, hook, or whatever. > > > Note that I've already tried to infer the above, by looking at the > patches, and that is making me think that it would be possible to > implement things in another way. But maybe I'm missing something. So it > would be really valuable if you, with all your knowledge of how PI > should work, could do it. I keep describing how PI works, what the purpose of the two vectors are, how special they are from the beginning. Thanks, Feng > > Thanks and Regards, > Dario > -- > <> (Raistlin Majere) > - > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v6] run QEMU as non-root
On 07/09/2015 04:34 AM, Ian Campbell wrote: On Wed, 2015-07-01 at 15:03 -0600, Jim Fehlig wrote: Perhaps. But thanks for providing a way (b_info->device_model_user) for apps to override the libxl policy. You mentioned in v5 that libvirt supports setting both the user and the group and that the qemu driver supports that. How does that work? AFAICT qemu's -runas option only takes a user and it takes that user's primary group and uses that with no configurability. I think that's a fine way to do things, but you implied greater configurability in libvirt and I'm now curious... The libvirt qemu driver doesn't use qemu's -runas option. It calls setregid()/setreuid() in the child after fork()'ing, but before exec()'ing, qemu. Regards, Jim ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [linux-linus test] 59254: tolerable FAIL - PUSHED
flight 59254 linux-linus real [real] http://logs.test-lab.xenproject.org/osstest/logs/59254/ Failures :-/ but no regressions. Regressions which are regarded as allowable (not blocking): test-amd64-i386-libvirt 11 guest-start fail like 59086 test-amd64-amd64-libvirt 11 guest-start fail like 59086 test-amd64-amd64-xl-qemuu-win7-amd64 16 guest-stop fail like 59086 test-amd64-i386-xl-qemuu-win7-amd64 16 guest-stop fail like 59086 Tests which did not succeed, but are not blocking: test-amd64-amd64-xl-pvh-amd 11 guest-start fail never pass test-amd64-amd64-xl-pvh-intel 13 guest-saverestorefail never pass test-amd64-i386-libvirt-xsm 12 migrate-support-checkfail never pass test-amd64-amd64-libvirt-xsm 12 migrate-support-checkfail never pass test-armhf-armhf-xl-xsm 12 migrate-support-checkfail never pass test-armhf-armhf-xl 12 migrate-support-checkfail never pass test-armhf-armhf-xl-arndale 12 migrate-support-checkfail never pass test-armhf-armhf-xl-cubietruck 12 migrate-support-checkfail never pass test-armhf-armhf-xl-credit2 12 migrate-support-checkfail never pass test-armhf-armhf-libvirt 12 migrate-support-checkfail never pass test-amd64-i386-xl-qemut-win7-amd64 16 guest-stop fail never pass test-amd64-amd64-xl-qemut-win7-amd64 16 guest-stop fail never pass test-armhf-armhf-xl-rtds 12 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 15 guest-start/debian.repeatfail never pass test-armhf-armhf-xl-multivcpu 12 migrate-support-checkfail never pass test-armhf-armhf-libvirt-xsm 12 migrate-support-checkfail never pass version targeted for testing: linux45820c294fe1b1a9df495d57f40585ef2d069a39 baseline version: linux1c4c7159ed2468f3ac4ce5a7f08d79663d381a93 Last test of basis59086 2015-07-06 04:24:57 Z3 days Failing since 59130 2015-07-07 04:25:30 Z2 days3 attempts Testing same since59254 2015-07-09 04:20:48 Z0 days1 attempts People who touched revisions under test: Arnaldo Carvalho de Melo Greg Kroah-Hartman Ingo Molnar John Stultz Laura Abbott Laura Abbott Linus Torvalds Mark Rutland Nathan Lynch Nicolas Pitre Peter Zijlstra (Intel) Peter Zijlstra Russell King Santosh Shilimkar Stephen Boyd Steven Rostedt Szabolcs Nagy Tomas Winkler Vitaly Andrianov Will Deacon Wolfram Sang Yann Droneaud jobs: build-amd64-xsm pass build-armhf-xsm pass build-i386-xsm pass build-amd64 pass build-armhf pass build-i386 pass build-amd64-libvirt pass build-armhf-libvirt pass build-i386-libvirt pass build-amd64-pvopspass build-armhf-pvopspass build-i386-pvops pass build-amd64-rumpuserxen pass build-i386-rumpuserxen pass test-amd64-amd64-xl pass test-armhf-armhf-xl pass test-amd64-i386-xl pass test-amd64-amd64-xl-qemut-debianhvm-amd64-xsmpass test-amd64-i386-xl-qemut-debianhvm-amd64-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsmpass test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm pass test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsmpass test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm pass test-amd64-amd64-libvirt-xsm pass test-armhf-armhf-libvirt-xsm pass test-amd64-i386-libvirt-xsm pass test-amd64-amd64-xl-xsm pass test-armhf-armhf-xl-xsm pass test-amd64-i386-xl-xsm pass test-amd64-amd64-xl-pvh-amd fail test-amd64-i386-qemut-rhel6hvm-amd pass test-amd64-i386-qemuu-rhel6hvm-amd pas
[Xen-devel] [PATCH v2 05/20] block/xen-blkfront: Split blkif_queue_request in 2
Currently, blkif_queue_request has 2 distinct execution path: - Send a discard request - Send a read/write request The function is also allocating grants to use for generating the request. Although, this is only used for read/write request. Rather than having a function with 2 distinct execution path, separate the function in 2. This will also remove one level of tabulation. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Roger Pau Monné Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Patch added --- drivers/block/xen-blkfront.c | 280 +++ 1 file changed, 153 insertions(+), 127 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 6d89ed3..7107d58 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -392,13 +392,35 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t mode, return 0; } -/* - * Generate a Xen blkfront IO request from a blk layer request. Reads - * and writes are handled as expected. - * - * @req: a request struct - */ -static int blkif_queue_request(struct request *req) +static int blkif_queue_discard_req(struct request *req) +{ + struct blkfront_info *info = req->rq_disk->private_data; + struct blkif_request *ring_req; + unsigned long id; + + /* Fill out a communications ring structure. */ + ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); + id = get_id_from_freelist(info); + info->shadow[id].request = req; + + ring_req->operation = BLKIF_OP_DISCARD; + ring_req->u.discard.nr_sectors = blk_rq_sectors(req); + ring_req->u.discard.id = id; + ring_req->u.discard.sector_number = (blkif_sector_t)blk_rq_pos(req); + if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard) + ring_req->u.discard.flag = BLKIF_DISCARD_SECURE; + else + ring_req->u.discard.flag = 0; + + info->ring.req_prod_pvt++; + + /* Keep a private copy so we can reissue requests when recovering. */ + info->shadow[id].req = *ring_req; + + return 0; +} + +static int blkif_queue_rw_req(struct request *req) { struct blkfront_info *info = req->rq_disk->private_data; struct blkif_request *ring_req; @@ -418,9 +440,6 @@ static int blkif_queue_request(struct request *req) struct scatterlist *sg; int nseg, max_grefs; - if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) - return 1; - max_grefs = req->nr_phys_segments; if (max_grefs > BLKIF_MAX_SEGMENTS_PER_REQUEST) /* @@ -450,139 +469,128 @@ static int blkif_queue_request(struct request *req) id = get_id_from_freelist(info); info->shadow[id].request = req; - if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) { - ring_req->operation = BLKIF_OP_DISCARD; - ring_req->u.discard.nr_sectors = blk_rq_sectors(req); - ring_req->u.discard.id = id; - ring_req->u.discard.sector_number = (blkif_sector_t)blk_rq_pos(req); - if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard) - ring_req->u.discard.flag = BLKIF_DISCARD_SECURE; - else - ring_req->u.discard.flag = 0; + BUG_ON(info->max_indirect_segments == 0 && + req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST); + BUG_ON(info->max_indirect_segments && + req->nr_phys_segments > info->max_indirect_segments); + nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg); + ring_req->u.rw.id = id; + if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) { + /* +* The indirect operation can only be a BLKIF_OP_READ or +* BLKIF_OP_WRITE +*/ + BUG_ON(req->cmd_flags & (REQ_FLUSH | REQ_FUA)); + ring_req->operation = BLKIF_OP_INDIRECT; + ring_req->u.indirect.indirect_op = rq_data_dir(req) ? + BLKIF_OP_WRITE : BLKIF_OP_READ; + ring_req->u.indirect.sector_number = (blkif_sector_t)blk_rq_pos(req); + ring_req->u.indirect.handle = info->handle; + ring_req->u.indirect.nr_segments = nseg; } else { - BUG_ON(info->max_indirect_segments == 0 && - req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST); - BUG_ON(info->max_indirect_segments && - req->nr_phys_segments > info->max_indirect_segments); - nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg); - ring_req->u.rw.id = id; - if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) { + ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req); + ring_req->u.rw.handle = info->handle; + ri
[Xen-devel] [PATCH v2 14/20] xen/grant-table: Make it running on 64KB granularity
The Xen interface is using 4KB page granularity. This means that each grant is 4KB. The current implementation allocates a Linux page per grant. On Linux using 64KB page granularity, only the first 4KB of the page will be used. We could decrease the memory wasted by sharing the page with multiple grant. It will require some care with the {Set,Clear}ForeignPage macro. Note that no changes has been made in the x86 code because both Linux and Xen will only use 4KB page granularity. Signed-off-by: Julien Grall Reviewed-by: David Vrabel Cc: Stefano Stabellini Cc: Russell King Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky --- Changes in v2 - Add David's reviewed-by --- arch/arm/xen/p2m.c| 6 +++--- drivers/xen/grant-table.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/arm/xen/p2m.c b/arch/arm/xen/p2m.c index 887596c..0ed01f2 100644 --- a/arch/arm/xen/p2m.c +++ b/arch/arm/xen/p2m.c @@ -93,8 +93,8 @@ int set_foreign_p2m_mapping(struct gnttab_map_grant_ref *map_ops, for (i = 0; i < count; i++) { if (map_ops[i].status) continue; - set_phys_to_machine(map_ops[i].host_addr >> PAGE_SHIFT, - map_ops[i].dev_bus_addr >> PAGE_SHIFT); + set_phys_to_machine(map_ops[i].host_addr >> XEN_PAGE_SHIFT, + map_ops[i].dev_bus_addr >> XEN_PAGE_SHIFT); } return 0; @@ -108,7 +108,7 @@ int clear_foreign_p2m_mapping(struct gnttab_unmap_grant_ref *unmap_ops, int i; for (i = 0; i < count; i++) { - set_phys_to_machine(unmap_ops[i].host_addr >> PAGE_SHIFT, + set_phys_to_machine(unmap_ops[i].host_addr >> XEN_PAGE_SHIFT, INVALID_P2M_ENTRY); } diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index 3679293..0a1f903 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -668,7 +668,7 @@ int gnttab_setup_auto_xlat_frames(phys_addr_t addr) if (xen_auto_xlat_grant_frames.count) return -EINVAL; - vaddr = xen_remap(addr, PAGE_SIZE * max_nr_gframes); + vaddr = xen_remap(addr, XEN_PAGE_SIZE * max_nr_gframes); if (vaddr == NULL) { pr_warn("Failed to ioremap gnttab share frames (addr=%pa)!\n", &addr); @@ -680,7 +680,7 @@ int gnttab_setup_auto_xlat_frames(phys_addr_t addr) return -ENOMEM; } for (i = 0; i < max_nr_gframes; i++) - pfn[i] = PFN_DOWN(addr) + i; + pfn[i] = XEN_PFN_DOWN(addr) + i; xen_auto_xlat_grant_frames.vaddr = vaddr; xen_auto_xlat_grant_frames.pfn = pfn; @@ -1004,7 +1004,7 @@ static void gnttab_request_version(void) { /* Only version 1 is used, which will always be available. */ grant_table_version = 1; - grefs_per_grant_frame = PAGE_SIZE / sizeof(struct grant_entry_v1); + grefs_per_grant_frame = XEN_PAGE_SIZE / sizeof(struct grant_entry_v1); gnttab_interface = &gnttab_v1_ops; pr_info("Grant tables using version %d layout\n", grant_table_version); -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 10/20] xen/xenbus: Use Xen page definition
All the ring (xenstore, and PV rings) are always based on the page granularity of Xen. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Also update the ring mapping function --- drivers/xen/xenbus/xenbus_client.c | 6 +++--- drivers/xen/xenbus/xenbus_probe.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c index 9ad3272..80272f6 100644 --- a/drivers/xen/xenbus/xenbus_client.c +++ b/drivers/xen/xenbus/xenbus_client.c @@ -388,7 +388,7 @@ int xenbus_grant_ring(struct xenbus_device *dev, void *vaddr, } grefs[i] = err; - vaddr = vaddr + PAGE_SIZE; + vaddr = vaddr + XEN_PAGE_SIZE; } return 0; @@ -555,7 +555,7 @@ static int xenbus_map_ring_valloc_pv(struct xenbus_device *dev, if (!node) return -ENOMEM; - area = alloc_vm_area(PAGE_SIZE * nr_grefs, ptes); + area = alloc_vm_area(XEN_PAGE_SIZE * nr_grefs, ptes); if (!area) { kfree(node); return -ENOMEM; @@ -750,7 +750,7 @@ static int xenbus_unmap_ring_vfree_pv(struct xenbus_device *dev, void *vaddr) unsigned long addr; memset(&unmap[i], 0, sizeof(unmap[i])); - addr = (unsigned long)vaddr + (PAGE_SIZE * i); + addr = (unsigned long)vaddr + (XEN_PAGE_SIZE * i); unmap[i].host_addr = arbitrary_virt_to_machine( lookup_address(addr, &level)).maddr; unmap[i].dev_bus_addr = 0; diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c index 4308fb3..c67e5ba 100644 --- a/drivers/xen/xenbus/xenbus_probe.c +++ b/drivers/xen/xenbus/xenbus_probe.c @@ -713,7 +713,7 @@ static int __init xenstored_local_init(void) xen_store_mfn = xen_start_info->store_mfn = pfn_to_mfn(virt_to_phys((void *)page) >> - PAGE_SHIFT); + XEN_PAGE_SHIFT); /* Next allocate a local port which xenstored can bind to */ alloc_unbound.dom= DOMID_SELF; @@ -804,7 +804,7 @@ static int __init xenbus_init(void) goto out_error; xen_store_mfn = (unsigned long)v; xen_store_interface = - xen_remap(xen_store_mfn << PAGE_SHIFT, PAGE_SIZE); + xen_remap(xen_store_mfn << XEN_PAGE_SHIFT, XEN_PAGE_SIZE); break; default: pr_warn("Xenstore state unknown\n"); -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 15/20] block/xen-blkfront: Make it running on 64KB page granularity
From: Julien Grall The PV block protocol is using 4KB page granularity. The goal of this patch is to allow a Linux using 64KB page granularity using block device on a non-modified Xen. The block API is using segment which should at least be the size of a Linux page. Therefore, the driver will have to break the page in chunk of 4K before giving the page to the backend. Breaking a 64KB segment in 4KB chunk will result to have some chunk with no data. As the PV protocol always require to have data in the chunk, we have to count the number of Xen page which will be in use and avoid to sent empty chunk. Note that, a pre-defined number of grant is reserved before preparing the request. This pre-defined number is based on the number and the maximum size of the segments. If each segment contain a very small amount of data, the driver may reserve too much grant (16 grant is reserved per segment with 64KB page granularity). Futhermore, in the case of persistent grant we allocate one Linux page per grant although only the 4KB of the page will be effectively use. This could be improved by share the page with multiple grants. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Roger Pau Monné Cc: Boris Ostrovsky Cc: David Vrabel --- Improvement such as support 64KB grant is not taken into consideration in this patch because we have the requirement to run a Linux using 64KB page on a non-modified Xen. Changes in v2: - Use gnttab_foreach_grant to split a Linux page into grant --- drivers/block/xen-blkfront.c | 304 --- 1 file changed, 198 insertions(+), 106 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 95fd067..644ba76 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -77,6 +77,7 @@ struct blk_shadow { struct grant **grants_used; struct grant **indirect_grants; struct scatterlist *sg; + unsigned int num_sg; }; struct split_bio { @@ -106,8 +107,8 @@ static unsigned int xen_blkif_max_ring_order; module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, S_IRUGO); MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for the shared ring"); -#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE * (info)->nr_ring_pages) -#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * XENBUS_MAX_RING_PAGES) +#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, XEN_PAGE_SIZE * (info)->nr_ring_pages) +#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, XEN_PAGE_SIZE * XENBUS_MAX_RING_PAGES) /* * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19 * characters are enough. Define to 20 to keep consist with backend. @@ -146,6 +147,7 @@ struct blkfront_info unsigned int discard_granularity; unsigned int discard_alignment; unsigned int feature_persistent:1; + /* Number of 4K segment handled */ unsigned int max_indirect_segments; int is_ready; }; @@ -173,10 +175,19 @@ static DEFINE_SPINLOCK(minor_lock); #define DEV_NAME "xvd" /* name in /dev */ -#define SEGS_PER_INDIRECT_FRAME \ - (PAGE_SIZE/sizeof(struct blkif_request_segment)) -#define INDIRECT_GREFS(_segs) \ - ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME) +/* + * Xen use 4K pages. The guest may use different page size (4K or 64K) + * Number of Xen pages per segment + */ +#define XEN_PAGES_PER_SEGMENT (PAGE_SIZE / XEN_PAGE_SIZE) + +#define SEGS_PER_INDIRECT_FRAME\ + (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment) / XEN_PAGES_PER_SEGMENT) +#define XEN_PAGES_PER_INDIRECT_FRAME \ + (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment)) + +#define INDIRECT_GREFS(_pages) \ + ((_pages + XEN_PAGES_PER_INDIRECT_FRAME - 1)/XEN_PAGES_PER_INDIRECT_FRAME) static int blkfront_setup_indirect(struct blkfront_info *info); @@ -463,14 +474,100 @@ static int blkif_queue_discard_req(struct request *req) return 0; } +struct setup_rw_req { + unsigned int grant_idx; + struct blkif_request_segment *segments; + struct blkfront_info *info; + struct blkif_request *ring_req; + grant_ref_t gref_head; + unsigned int id; + /* Only used when persistent grant is used and it's a read request */ + bool need_copy; + unsigned int bvec_off; + char *bvec_data; +}; + +static void blkif_setup_rw_req_grant(unsigned long mfn, unsigned int offset, +unsigned int *len, void *data) +{ + struct setup_rw_req *setup = data; + int n, ref; + struct grant *gnt_list_entry; + unsigned int fsect, lsect; + /* Convenient aliases */ + unsigned int grant_idx = setup->grant_idx; + struct blkif_request *ring_req = setup->ring_req; + struct blkfront_info *info = setup->info; + struct blk_shadow *shadow = &info->
[Xen-devel] [PATCH v2 13/20] xen/events: fifo: Make it running on 64KB granularity
Only use the first 4KB of the page to store the events channel info. It means that we will wast 60KB every time we allocate page for: * control block: a page is allocating per CPU * event array: a page is allocating everytime we need to expand it I think we can reduce the memory waste for the 2 areas by: * control block: sharing between multiple vCPUs. Although it will require some bookkeeping in order to not free the page when the CPU goes offline and the other CPUs sharing the page still there * event array: always extend the array event by 64K (i.e 16 4K chunk). That would require more care when we fail to expand the event channel. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- drivers/xen/events/events_base.c | 2 +- drivers/xen/events/events_fifo.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 96093ae..858d2f6 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -40,11 +40,11 @@ #include #include #include -#include #endif #include #include #include +#include #include #include diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c index ed673e1..d53c297 100644 --- a/drivers/xen/events/events_fifo.c +++ b/drivers/xen/events/events_fifo.c @@ -54,7 +54,7 @@ #include "events_internal.h" -#define EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t)) +#define EVENT_WORDS_PER_PAGE (XEN_PAGE_SIZE / sizeof(event_word_t)) #define MAX_EVENT_ARRAY_PAGES (EVTCHN_FIFO_NR_CHANNELS / EVENT_WORDS_PER_PAGE) struct evtchn_fifo_queue { -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 18/20] net/xen-netback: Make it running on 64KB page granularity
The PV network protocol is using 4KB page granularity. The goal of this patch is to allow a Linux using 64KB page granularity working as a network backend on a non-modified Xen. It's only necessary to adapt the ring size and break skb data in small chunk of 4KB. The rest of the code is relying on the grant table code. Signed-off-by: Julien Grall Cc: Ian Campbell Cc: Wei Liu Cc: net...@vger.kernel.org --- Improvement such as support of 64KB grant is not taken into consideration in this patch because we have the requirement to run a Linux using 64KB pages on a non-modified Xen. Changes in v2: - Correctly set MAX_GRANT_COPY_OPS and XEN_NETBK_RX_SLOTS_MAX - Don't use XEN_PAGE_SIZE in handle_frag_list as we coalesce fragment into a new skb - Use gnntab_foreach_grant to split a Linux page into grant --- drivers/net/xen-netback/common.h | 15 +++-- drivers/net/xen-netback/netback.c | 138 +++--- 2 files changed, 93 insertions(+), 60 deletions(-) diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h index 8a495b3..bb68211 100644 --- a/drivers/net/xen-netback/common.h +++ b/drivers/net/xen-netback/common.h @@ -44,6 +44,7 @@ #include #include #include +#include #include typedef unsigned int pending_ring_idx_t; @@ -64,8 +65,8 @@ struct pending_tx_info { struct ubuf_info callback_struct; }; -#define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE) -#define XEN_NETIF_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, PAGE_SIZE) +#define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, XEN_PAGE_SIZE) +#define XEN_NETIF_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, XEN_PAGE_SIZE) struct xenvif_rx_meta { int id; @@ -80,16 +81,18 @@ struct xenvif_rx_meta { /* Discriminate from any valid pending_idx value. */ #define INVALID_PENDING_IDX 0x -#define MAX_BUFFER_OFFSET PAGE_SIZE +#define MAX_BUFFER_OFFSET XEN_PAGE_SIZE #define MAX_PENDING_REQS XEN_NETIF_TX_RING_SIZE +#define MAX_XEN_SKB_FRAGS (65536 / XEN_PAGE_SIZE + 1) + /* It's possible for an skb to have a maximal number of frags * but still be less than MAX_BUFFER_OFFSET in size. Thus the - * worst-case number of copy operations is MAX_SKB_FRAGS per + * worst-case number of copy operations is MAX_XEN_SKB_FRAGS per * ring slot. */ -#define MAX_GRANT_COPY_OPS (MAX_SKB_FRAGS * XEN_NETIF_RX_RING_SIZE) +#define MAX_GRANT_COPY_OPS (MAX_XEN_SKB_FRAGS * XEN_NETIF_RX_RING_SIZE) #define NETBACK_INVALID_HANDLE -1 @@ -203,7 +206,7 @@ struct xenvif_queue { /* Per-queue data for xenvif */ /* Maximum number of Rx slots a to-guest packet may use, including the * slot needed for GSO meta-data. */ -#define XEN_NETBK_RX_SLOTS_MAX (MAX_SKB_FRAGS + 1) +#define XEN_NETBK_RX_SLOTS_MAX ((MAX_XEN_SKB_FRAGS + 1)) enum state_bit_shift { /* This bit marks that the vif is connected */ diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index 3f77030..828085b 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -263,6 +263,65 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif_queue *queue, return meta; } +struct gop_frag_copy +{ + struct xenvif_queue *queue; + struct netrx_pending_operations *npo; + struct xenvif_rx_meta *meta; + int head; + int gso_type; + + struct page *page; +}; + +static void xenvif_gop_frag_copy_grant(unsigned long mfn, unsigned int offset, + unsigned int *len, void *data) +{ + struct gop_frag_copy *info = data; + struct gnttab_copy *copy_gop; + struct xen_page_foreign *foreign; + /* Convenient aliases */ + struct xenvif_queue *queue = info->queue; + struct netrx_pending_operations *npo = info->npo; + struct page *page = info->page; + + BUG_ON(npo->copy_off > MAX_BUFFER_OFFSET); + + if (npo->copy_off == MAX_BUFFER_OFFSET) + info->meta = get_next_rx_buffer(queue, npo); + + if (npo->copy_off + *len > MAX_BUFFER_OFFSET) + *len = MAX_BUFFER_OFFSET - npo->copy_off; + + copy_gop = npo->copy + npo->copy_prod++; + copy_gop->flags = GNTCOPY_dest_gref; + copy_gop->len = *len; + + foreign = xen_page_foreign(page); + if (foreign) { + copy_gop->source.domid = foreign->domid; + copy_gop->source.u.ref = foreign->gref; + copy_gop->flags |= GNTCOPY_source_gref; + } else { + copy_gop->source.domid = DOMID_SELF; + copy_gop->source.u.gmfn = mfn; + } + copy_gop->source.offset = offset; + + copy_gop->dest.domid = queue->vif->domid; + copy_gop->dest.offset = npo->copy_off; + copy_gop->dest.u.ref = npo->copy_gref; + + npo->copy_off += *len; + info->meta->size += *len; + + /* Leave a gap for the GSO descripto
[Xen-devel] [PATCH v2 12/20] xen/balloon: Don't rely on the page granularity is the same for Xen and Linux
For ARM64 guests, Linux is able to support either 64K or 4K page granularity. Although, the hypercall interface is always based on 4K page granularity. With 64K page granuliarty, a single page will be spread over multiple Xen frame. When a driver request/free a balloon page, the balloon driver will have to split the Linux page in 4K chunk before asking Xen to add/remove the frame from the guest. Note that this can work on any page granularity assuming it's a multiple of 4K. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel Cc: Wei Liu --- Changes in v2: - Use xen_apply_to_page to split a page in 4K chunk - It's not necessary to have a smaller frame list. Re-use PAGE_SIZE - Convert reserve_additional_memory to use XEN_... macro --- drivers/xen/balloon.c | 147 +++--- 1 file changed, 105 insertions(+), 42 deletions(-) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index fd93369..19a72b1 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -230,6 +230,7 @@ static enum bp_state reserve_additional_memory(long credit) nid = memory_add_physaddr_to_nid(hotplug_start_paddr); #ifdef CONFIG_XEN_HAVE_PVMMU + /* TODO */ /* * add_memory() will build page tables for the new memory so * the p2m must contain invalid entries so the correct @@ -242,8 +243,8 @@ static enum bp_state reserve_additional_memory(long credit) if (!xen_feature(XENFEAT_auto_translated_physmap)) { unsigned long pfn, i; - pfn = PFN_DOWN(hotplug_start_paddr); - for (i = 0; i < balloon_hotplug; i++) { + pfn = XEN_PFN_DOWN(hotplug_start_paddr); + for (i = 0; i < (balloon_hotplug * XEN_PFN_PER_PAGE); i++) { if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) { pr_warn("set_phys_to_machine() failed, no memory added\n"); return BP_ECANCELED; @@ -323,10 +324,72 @@ static enum bp_state reserve_additional_memory(long credit) } #endif /* CONFIG_XEN_BALLOON_MEMORY_HOTPLUG */ +static int set_frame(struct page *page, unsigned long pfn, void *data) +{ + unsigned long *index = data; + + frame_list[(*index)++] = pfn; + + return 0; +} + +#ifdef CONFIG_XEN_HAVE_PVMMU +static int pvmmu_update_mapping(struct page *page, unsigned long pfn, + void *data) +{ + unsigned long *index = data; + xen_pfn_t frame = frame_list[*index]; + + set_phys_to_machine(pfn, frame); + /* Link back into the page tables if not highmem. */ + if (!PageHighMem(page)) { + int ret; + ret = HYPERVISOR_update_va_mapping( + (unsigned long)__va(pfn << XEN_PAGE_SHIFT), + mfn_pte(frame, PAGE_KERNEL), + 0); + BUG_ON(ret); + } + + (*index)++; + + return 0; +} +#endif + +static int balloon_remove_mapping(struct page *page, unsigned long pfn, + void *data) +{ + unsigned long *index = data; + + /* We expect the frame_list to contain the same pfn */ + BUG_ON(pfn != frame_list[*index]); + + frame_list[*index] = pfn_to_mfn(pfn); + +#ifdef CONFIG_XEN_HAVE_PVMMU + if (!xen_feature(XENFEAT_auto_translated_physmap)) { + if (!PageHighMem(page)) { + int ret; + + ret = HYPERVISOR_update_va_mapping( + (unsigned long)__va(pfn << XEN_PAGE_SHIFT), + __pte_ma(0), 0); + BUG_ON(ret); + } + __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); + } +#endif + + (*index)++; + + return 0; +} + static enum bp_state increase_reservation(unsigned long nr_pages) { int rc; - unsigned long pfn, i; + unsigned long i, frame_idx; struct page *page; struct xen_memory_reservation reservation = { .address_bits = 0, @@ -343,44 +406,43 @@ static enum bp_state increase_reservation(unsigned long nr_pages) } #endif - if (nr_pages > ARRAY_SIZE(frame_list)) - nr_pages = ARRAY_SIZE(frame_list); + if (nr_pages > (ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE)) + nr_pages = ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE; + frame_idx = 0; page = list_first_entry_or_null(&ballooned_pages, struct page, lru); for (i = 0; i < nr_pages; i++) { if (!page) { nr_pages = i; break; } - frame_list[i] = page_to_pfn(page); + + rc = xen_apply_to_page(page, set_frame, &frame_idx); +
[Xen-devel] [PATCH v2 16/20] block/xen-blkback: Make it running on 64KB page granularity
The PV block protocol is using 4KB page granularity. The goal of this patch is to allow a Linux using 64KB page granularity behaving as a block backend on a non-modified Xen. It's only necessary to adapt the ring size and the number of request per indirect frames. The rest of the code is relying on the grant table code. Note that the grant table code is allocating a Linux page per grant which will result to waste 6OKB for every grant when Linux is using 64KB page granularity. This could be improved by sharing the page between multiple grants. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: "Roger Pau Monné" Cc: Boris Ostrovsky Cc: David Vrabel --- Improvement such as support of 64KB grant is not taken into consideration in this patch because we have the requirement to run a Linux using 64KB pages on a non-modified Xen. This has been tested only with a loop device. I plan to test passing hard drive partition but I didn't yet convert the swiotlb code. --- drivers/block/xen-blkback/blkback.c | 5 +++-- drivers/block/xen-blkback/common.h | 16 +--- drivers/block/xen-blkback/xenbus.c | 9 ++--- 3 files changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c index ced9677..d5cce8c 100644 --- a/drivers/block/xen-blkback/blkback.c +++ b/drivers/block/xen-blkback/blkback.c @@ -961,7 +961,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req, seg[n].nsec = segments[i].last_sect - segments[i].first_sect + 1; seg[n].offset = (segments[i].first_sect << 9); - if ((segments[i].last_sect >= (PAGE_SIZE >> 9)) || + if ((segments[i].last_sect >= (XEN_PAGE_SIZE >> 9)) || (segments[i].last_sect < segments[i].first_sect)) { rc = -EINVAL; goto unmap; @@ -1210,6 +1210,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, req_operation = req->operation == BLKIF_OP_INDIRECT ? req->u.indirect.indirect_op : req->operation; + if ((req->operation == BLKIF_OP_INDIRECT) && (req_operation != BLKIF_OP_READ) && (req_operation != BLKIF_OP_WRITE)) { @@ -1268,7 +1269,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, seg[i].nsec = req->u.rw.seg[i].last_sect - req->u.rw.seg[i].first_sect + 1; seg[i].offset = (req->u.rw.seg[i].first_sect << 9); - if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) || + if ((req->u.rw.seg[i].last_sect >= (XEN_PAGE_SIZE >> 9)) || (req->u.rw.seg[i].last_sect < req->u.rw.seg[i].first_sect)) goto fail_response; diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h index 45a044a..33836bb 100644 --- a/drivers/block/xen-blkback/common.h +++ b/drivers/block/xen-blkback/common.h @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -51,12 +52,21 @@ extern unsigned int xen_blkif_max_ring_order; */ #define MAX_INDIRECT_SEGMENTS 256 -#define SEGS_PER_INDIRECT_FRAME \ - (PAGE_SIZE/sizeof(struct blkif_request_segment)) +/* + * Xen use 4K pages. The guest may use different page size (4K or 64K) + * Number of Xen pages per segment + */ +#define XEN_PAGES_PER_SEGMENT (PAGE_SIZE / XEN_PAGE_SIZE) + +#define SEGS_PER_INDIRECT_FRAME\ + (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment) / XEN_PAGES_PER_SEGMENT) +#define XEN_PAGES_PER_INDIRECT_FRAME \ + (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment)) + #define MAX_INDIRECT_PAGES \ ((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME) #define INDIRECT_PAGES(_segs) \ - ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME) + ((_segs + XEN_PAGES_PER_INDIRECT_FRAME - 1)/XEN_PAGES_PER_INDIRECT_FRAME) /* Not a real protocol. Used to generate ring structs which contain * the elements common to all protocols only. This way we get a diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index deb3f00..edd27e4 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -176,21 +176,24 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref, { struct blkif_sring *sring; sring = (struct blkif_sring *)blkif->blk_ring; - BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE * nr_grefs); + BACK_RING_INIT(&blkif->blk_rings.native, sring, + XEN_PAGE_SIZE * nr_grefs); break; } case BLKIF_PROTOCOL_X86_32: { struct blkif_x8
[Xen-devel] [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity
The hypercall interface (as well as the toolstack) is always using 4KB page granularity. When the toolstack is asking for mapping a series of guest PFN in a batch, it expects to have the page map contiguously in its virtual memory. When Linux is using 64KB page granularity, the privcmd driver will have to map multiple Xen PFN in a single Linux page. Note that this solution works on page granularity which is a multiple of 4KB. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Use xen_apply_to_page --- drivers/xen/privcmd.c | 8 +-- drivers/xen/xlate_mmu.c | 127 +--- 2 files changed, 92 insertions(+), 43 deletions(-) diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index 5a29616..e8714b4 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -446,7 +446,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version) return -EINVAL; } - nr_pages = m.num; + nr_pages = DIV_ROUND_UP_ULL(m.num, PAGE_SIZE / XEN_PAGE_SIZE); if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT))) return -EINVAL; @@ -494,7 +494,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version) goto out_unlock; } if (xen_feature(XENFEAT_auto_translated_physmap)) { - ret = alloc_empty_pages(vma, m.num); + ret = alloc_empty_pages(vma, nr_pages); if (ret < 0) goto out_unlock; } else @@ -518,6 +518,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version) state.global_error = 0; state.version = version; + BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0); /* mmap_batch_fn guarantees ret == 0 */ BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t), &pagelist, mmap_batch_fn, &state)); @@ -582,12 +583,13 @@ static void privcmd_close(struct vm_area_struct *vma) { struct page **pages = vma->vm_private_data; int numpgs = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; + int nr_pfn = (vma->vm_end - vma->vm_start) >> XEN_PAGE_SHIFT; int rc; if (!xen_feature(XENFEAT_auto_translated_physmap) || !numpgs || !pages) return; - rc = xen_unmap_domain_mfn_range(vma, numpgs, pages); + rc = xen_unmap_domain_mfn_range(vma, nr_pfn, pages); if (rc == 0) free_xenballooned_pages(numpgs, pages); else diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c index 58a5389..1fac17c 100644 --- a/drivers/xen/xlate_mmu.c +++ b/drivers/xen/xlate_mmu.c @@ -38,31 +38,9 @@ #include #include -/* map fgmfn of domid to lpfn in the current domain */ -static int map_foreign_page(unsigned long lpfn, unsigned long fgmfn, - unsigned int domid) -{ - int rc; - struct xen_add_to_physmap_range xatp = { - .domid = DOMID_SELF, - .foreign_domid = domid, - .size = 1, - .space = XENMAPSPACE_gmfn_foreign, - }; - xen_ulong_t idx = fgmfn; - xen_pfn_t gpfn = lpfn; - int err = 0; - - set_xen_guest_handle(xatp.idxs, &idx); - set_xen_guest_handle(xatp.gpfns, &gpfn); - set_xen_guest_handle(xatp.errs, &err); - - rc = HYPERVISOR_memory_op(XENMEM_add_to_physmap_range, &xatp); - return rc < 0 ? rc : err; -} - struct remap_data { xen_pfn_t *fgmfn; /* foreign domain's gmfn */ + xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */ pgprot_t prot; domid_t domid; struct vm_area_struct *vma; @@ -71,24 +49,75 @@ struct remap_data { struct xen_remap_mfn_info *info; int *err_ptr; int mapped; + + /* Hypercall parameters */ + int h_errs[XEN_PFN_PER_PAGE]; + xen_ulong_t h_idxs[XEN_PFN_PER_PAGE]; + xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE]; + + int h_iter; /* Iterator */ }; +static int setup_hparams(struct page *page, unsigned long pfn, void *data) +{ + struct remap_data *info = data; + + /* We may not have enough domain's gmfn to fill a Linux Page */ + if (info->fgmfn == info->efgmfn) + return 0; + + info->h_idxs[info->h_iter] = *info->fgmfn; + info->h_gpfns[info->h_iter] = pfn; + info->h_errs[info->h_iter] = 0; + info->h_iter++; + + info->fgmfn++; + + return 0; +} + static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr, void *data) { struct remap_data *info = data; struct page *page = info->pages[info->index++]; - unsigned long pfn = page_to_pfn(page); - pte_t pte = pte_mkspecial(pfn_pte(pfn, info->
[Xen-devel] [PATCH v2 11/20] tty/hvc: xen: Use xen page definition
The console ring is always based on the page granularity of Xen. Signed-off-by: Julien Grall Cc: Greg Kroah-Hartman Cc: Jiri Slaby Cc: David Vrabel Cc: Stefano Stabellini Cc: Boris Ostrovsky Cc: linuxppc-...@lists.ozlabs.org --- drivers/tty/hvc/hvc_xen.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c index a9d837f..2135944 100644 --- a/drivers/tty/hvc/hvc_xen.c +++ b/drivers/tty/hvc/hvc_xen.c @@ -230,7 +230,7 @@ static int xen_hvm_console_init(void) if (r < 0 || v == 0) goto err; mfn = v; - info->intf = xen_remap(mfn << PAGE_SHIFT, PAGE_SIZE); + info->intf = xen_remap(mfn << XEN_PAGE_SHIFT, XEN_PAGE_SIZE); if (info->intf == NULL) goto err; info->vtermno = HVC_COOKIE; @@ -392,7 +392,7 @@ static int xencons_connect_backend(struct xenbus_device *dev, if (xen_pv_domain()) mfn = virt_to_mfn(info->intf); else - mfn = __pa(info->intf) >> PAGE_SHIFT; + mfn = __pa(info->intf) >> XEN_PAGE_SHIFT; ret = gnttab_alloc_grant_references(1, &gref_head); if (ret < 0) return ret; @@ -476,7 +476,7 @@ static int xencons_resume(struct xenbus_device *dev) struct xencons_info *info = dev_get_drvdata(&dev->dev); xencons_disconnect_backend(info); - memset(info->intf, 0, PAGE_SIZE); + memset(info->intf, 0, XEN_PAGE_SIZE); return xencons_connect_backend(dev, info); } -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 20/20] arm/xen: Add support for 64KB page granularity
The hypercall interface is always using 4KB page granularity. This is requiring to use xen page definition macro when we deal with hypercall. Note that pfn_to_mfn is working with a Xen pfn (i.e 4KB). We may want to rename pfn_mfn to make this explicit. We also allocate a 64KB page for the shared page even though only the first 4KB is used. I don't think this is really important for now as it helps to have the pointer 4KB aligned (XENMEM_add_to_physmap is taking a Xen PFN). Signed-off-by: Julien Grall Reviewed-by: Stefano Stabellini Cc: Russell King --- Changes in v2 - Add Stefano's reviewed-by --- arch/arm/include/asm/xen/page.h | 12 ++-- arch/arm/xen/enlighten.c| 6 +++--- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/arm/include/asm/xen/page.h b/arch/arm/include/asm/xen/page.h index 1bee8ca..ab6eb9a 100644 --- a/arch/arm/include/asm/xen/page.h +++ b/arch/arm/include/asm/xen/page.h @@ -56,19 +56,19 @@ static inline unsigned long mfn_to_pfn(unsigned long mfn) static inline xmaddr_t phys_to_machine(xpaddr_t phys) { - unsigned offset = phys.paddr & ~PAGE_MASK; - return XMADDR(PFN_PHYS(pfn_to_mfn(PFN_DOWN(phys.paddr))) | offset); + unsigned offset = phys.paddr & ~XEN_PAGE_MASK; + return XMADDR(XEN_PFN_PHYS(pfn_to_mfn(XEN_PFN_DOWN(phys.paddr))) | offset); } static inline xpaddr_t machine_to_phys(xmaddr_t machine) { - unsigned offset = machine.maddr & ~PAGE_MASK; - return XPADDR(PFN_PHYS(mfn_to_pfn(PFN_DOWN(machine.maddr))) | offset); + unsigned offset = machine.maddr & ~XEN_PAGE_MASK; + return XPADDR(XEN_PFN_PHYS(mfn_to_pfn(XEN_PFN_DOWN(machine.maddr))) | offset); } /* VIRT <-> MACHINE conversion */ #define virt_to_machine(v) (phys_to_machine(XPADDR(__pa(v -#define virt_to_mfn(v) (pfn_to_mfn(virt_to_pfn(v))) -#define mfn_to_virt(m) (__va(mfn_to_pfn(m) << PAGE_SHIFT)) +#define virt_to_mfn(v) (pfn_to_mfn(virt_to_phys(v) >> XEN_PAGE_SHIFT)) +#define mfn_to_virt(m) (__va(mfn_to_pfn(m) << XEN_PAGE_SHIFT)) static inline xmaddr_t arbitrary_virt_to_machine(void *vaddr) { diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c index 6c09cc4..c7d32af 100644 --- a/arch/arm/xen/enlighten.c +++ b/arch/arm/xen/enlighten.c @@ -96,8 +96,8 @@ static void xen_percpu_init(void) pr_info("Xen: initializing cpu%d\n", cpu); vcpup = per_cpu_ptr(xen_vcpu_info, cpu); - info.mfn = __pa(vcpup) >> PAGE_SHIFT; - info.offset = offset_in_page(vcpup); + info.mfn = __pa(vcpup) >> XEN_PAGE_SHIFT; + info.offset = xen_offset_in_page(vcpup); err = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_info, cpu, &info); BUG_ON(err); @@ -220,7 +220,7 @@ static int __init xen_guest_init(void) xatp.domid = DOMID_SELF; xatp.idx = 0; xatp.space = XENMAPSPACE_shared_info; - xatp.gpfn = __pa(shared_info_page) >> PAGE_SHIFT; + xatp.gpfn = __pa(shared_info_page) >> XEN_PAGE_SHIFT; if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp)) BUG(); -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 17/20] net/xen-netfront: Make it running on 64KB page granularity
The PV network protocol is using 4KB page granularity. The goal of this patch is to allow a Linux using 64KB page granularity using network device on a non-modified Xen. It's only necessary to adapt the ring size and break skb data in small chunk of 4KB. The rest of the code is relying on the grant table code. Note that we allocate a Linux page for each rx skb but only the first 4KB is used. We may improve the memory usage by extending the size of the rx skb. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel Cc: net...@vger.kernel.org --- Improvement such as support of 64KB grant is not taken into consideration in this patch because we have the requirement to run a Linux using 64KB pages on a non-modified Xen. Tested with workload such as ping, ssh, wget, git... I would happy if someone give details how to test all the path. Changes in v2: - Use gnttab_foreach_grant to split a Linux page in grant - Fix count slots --- drivers/net/xen-netfront.c | 121 - 1 file changed, 87 insertions(+), 34 deletions(-) diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index f948c46..7233b09 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -74,8 +74,8 @@ struct netfront_cb { #define GRANT_INVALID_REF 0 -#define NET_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE) -#define NET_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, PAGE_SIZE) +#define NET_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, XEN_PAGE_SIZE) +#define NET_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, XEN_PAGE_SIZE) /* Minimum number of Rx slots (includes slot for GSO metadata). */ #define NET_RX_SLOTS_MIN (XEN_NETIF_NR_SLOTS_MIN + 1) @@ -291,7 +291,7 @@ static void xennet_alloc_rx_buffers(struct netfront_queue *queue) struct sk_buff *skb; unsigned short id; grant_ref_t ref; - unsigned long pfn; + struct page *page; struct xen_netif_rx_request *req; skb = xennet_alloc_one_rx_buffer(queue); @@ -307,14 +307,13 @@ static void xennet_alloc_rx_buffers(struct netfront_queue *queue) BUG_ON((signed short)ref < 0); queue->grant_rx_ref[id] = ref; - pfn = page_to_pfn(skb_frag_page(&skb_shinfo(skb)->frags[0])); + page = skb_frag_page(&skb_shinfo(skb)->frags[0]); req = RING_GET_REQUEST(&queue->rx, req_prod); - gnttab_grant_foreign_access_ref(ref, - queue->info->xbdev->otherend_id, - pfn_to_mfn(pfn), - 0); - + gnttab_page_grant_foreign_access_ref(ref, + queue->info->xbdev->otherend_id, +page, +0); req->id = id; req->gref = ref; } @@ -415,15 +414,26 @@ static void xennet_tx_buf_gc(struct netfront_queue *queue) xennet_maybe_wake_tx(queue); } -static struct xen_netif_tx_request *xennet_make_one_txreq( - struct netfront_queue *queue, struct sk_buff *skb, - struct page *page, unsigned int offset, unsigned int len) +struct xennet_gnttab_make_txreq +{ + struct netfront_queue *queue; + struct sk_buff *skb; + struct page *page; + struct xen_netif_tx_request *tx; /* Last request */ + unsigned int size; +}; + +static void xennet_tx_setup_grant(unsigned long mfn, unsigned int offset, + unsigned int *len, void *data) { + struct xennet_gnttab_make_txreq *info = data; unsigned int id; struct xen_netif_tx_request *tx; grant_ref_t ref; - - len = min_t(unsigned int, PAGE_SIZE - offset, len); + /* convenient aliases */ + struct page *page = info->page; + struct netfront_queue *queue = info->queue; + struct sk_buff *skb = info->skb; id = get_id_from_freelist(&queue->tx_skb_freelist, queue->tx_skbs); tx = RING_GET_REQUEST(&queue->tx, queue->tx.req_prod_pvt++); @@ -431,7 +441,7 @@ static struct xen_netif_tx_request *xennet_make_one_txreq( BUG_ON((signed short)ref < 0); gnttab_grant_foreign_access_ref(ref, queue->info->xbdev->otherend_id, - page_to_mfn(page), GNTMAP_readonly); + mfn, GNTMAP_readonly); queue->tx_skbs[id].skb = skb; queue->grant_tx_page[id] = page; @@ -440,10 +450,37 @@ static struct xen_netif_tx_request *xennet_make_one_txreq( tx->id = id; tx->gref = ref; tx->offset = offset; - tx->size = len; + tx->size = *len; tx->flags = 0; - return tx; +
[Xen-devel] [PATCH v2 00/20] xen/arm64: Add support for 64KB page
Hi all, ARM64 Linux is supporting both 4KB and 64KB page granularity. Although, Xen hypercall interface and PV protocol are always based on 4KB page granularity. Any attempt to boot a Linux guest with 64KB pages enabled will result to a guest crash. This series is a first attempt to allow those Linux running with the current hypercall interface and PV protocol. This solution has been chosen because we want to run Linux 64KB in released Xen ARM version or/and platform using an old version of Linux DOM0. There is room for improvement, such as support of 64KB grant, modification of PV protocol to support different page size... They will be explored in a separate patch series later. For this new version, new helpers have been added in order to split the Linux page in multiple grant. This is requiring to produce some callback. So far I've only done a quick network performance test using iperf. The server is living in DOM0 and the client in the guest. Average betwen 10 iperf : DOM0Guest Result 4KB-mod 64KB3.176 Gbits/sec 4KB-mod 4KB-mod 3.245 Gbits/sec 4KB-mod 4KB 3.258 Gbits/sec 4KB 4KB 3.292 Gbits/sec 4KB 4KB-mod 3.265 Gbits/sec 4KB 64KB3.189 Gbits/sec 4KB-mod: Linux with the 64KB patch series 4KB: linux/master The network performance is slightly worst with this series (-0.15%). I suspect, this is because of using an indirection to setup the grant. This is necessary in order to ensure that the grant will be correctly sized no matter of the Linux page granularity. This could be used later in order to support bigger grant. TODO list: - swiotlb not yet converted to 64KB pages - it may be possible to move some common define between netback/netfront and blkfront/blkback in an header. Note that the patches has only been build tested on x86. It may also be possible that they are not compiling one by one. I will give a look for the next version. A branch based on the latest linux/master can be found here: git://xenbits.xen.org/people/julieng/linux-arm.git branch xen-64k-v2 Comments, suggestions are welcomed. Sincerely yours, Cc: david.vra...@citrix.com Cc: konrad.w...@oracle.com Cc: boris.ostrov...@oracle.com Cc: wei.l...@citrix.com Cc: roger@citrix.com Julien Grall (20): xen: Add Xen specific page definition xen: Introduce a function to split a Linux page into Xen page xen/grant: Introduce helpers to split a page into grant xen/grant: Add helper gnttab_page_grant_foreign_access_ref block/xen-blkfront: Split blkif_queue_request in 2 block/xen-blkfront: Store a page rather a pfn in the grant structure block/xen-blkfront: split get_grant in 2 net/xen-netback: xenvif_gop_frag_copy: move GSO check out of the loop xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page xen/xenbus: Use Xen page definition tty/hvc: xen: Use xen page definition xen/balloon: Don't rely on the page granularity is the same for Xen and Linux xen/events: fifo: Make it running on 64KB granularity xen/grant-table: Make it running on 64KB granularity block/xen-blkfront: Make it running on 64KB page granularity block/xen-blkback: Make it running on 64KB page granularity net/xen-netfront: Make it running on 64KB page granularity net/xen-netback: Make it running on 64KB page granularity xen/privcmd: Add support for Linux 64KB page granularity arm/xen: Add support for 64KB page granularity arch/arm/include/asm/xen/page.h | 12 +- arch/arm/xen/enlighten.c| 6 +- arch/arm/xen/p2m.c | 6 +- arch/x86/include/asm/xen/page.h | 2 +- drivers/block/xen-blkback/blkback.c | 5 +- drivers/block/xen-blkback/common.h | 16 +- drivers/block/xen-blkback/xenbus.c | 9 +- drivers/block/xen-blkfront.c| 536 +++- drivers/net/xen-netback/common.h| 15 +- drivers/net/xen-netback/netback.c | 148 ++ drivers/net/xen-netfront.c | 121 +--- drivers/tty/hvc/hvc_xen.c | 6 +- drivers/xen/balloon.c | 147 +++--- drivers/xen/biomerge.c | 7 + drivers/xen/events/events_base.c| 2 +- drivers/xen/events/events_fifo.c| 2 +- drivers/xen/grant-table.c | 32 ++- drivers/xen/privcmd.c | 8 +- drivers/xen/xenbus/xenbus_client.c | 6 +- drivers/xen/xenbus/xenbus_probe.c | 4 +- drivers/xen/xlate_mmu.c | 127 ++--- include/xen/grant_table.h | 50 include/xen/page.h | 41 ++- 23 files changed, 896 insertions(+), 412 deletions(-) -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page
The Xen interface is always using 4KB page. This means that a Linux page may be split across multiple Xen page when the page granularity is not the same. This helper will break down a Linux page into 4KB chunk and call the helper on each of them. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Patch added --- include/xen/page.h | 20 1 file changed, 20 insertions(+) diff --git a/include/xen/page.h b/include/xen/page.h index 8ebd37b..b1f7722 100644 --- a/include/xen/page.h +++ b/include/xen/page.h @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS]; extern unsigned long xen_released_pages; +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data); + +/* Break down the page in 4KB granularity and call fn foreach xen pfn */ +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn, + void *data) +{ + unsigned long pfn = xen_page_to_pfn(page); + int i; + int ret; + + for (i = 0; i < XEN_PFN_PER_PAGE; i++, pfn++) { + ret = fn(page, pfn, data); + if (ret) + return ret; + } + + return ret; +} + + #endif /* _XEN_PAGE_H */ -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 04/20] xen/grant: Add helper gnttab_page_grant_foreign_access_ref
Many PV drivers contain the idiom: pfn = page_to_mfn(...) /* Or similar */ gnttab_grant_foreign_access_ref Replace it by a new helper. Note that when Linux is using a different page granularity than Xen, the helper only gives access to the first 4KB grant. This is useful where drivers are allocating a full Linux page for each grant. Also include xen/interface/grant_table.h rather than xen/grant_table.h in asm/page.h for x86 to fix a compilation issue [1]. Only the former is useful in order to get the structure definition. [1] Interpendency between asm/page.h and xen/grant_table.h which result to page_mfn not being defined when necessary. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Patch added --- arch/x86/include/asm/xen/page.h | 2 +- include/xen/grant_table.h | 9 + 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h index c44a5d5..fb2e037 100644 --- a/arch/x86/include/asm/xen/page.h +++ b/arch/x86/include/asm/xen/page.h @@ -12,7 +12,7 @@ #include #include -#include +#include #include /* Xen machine address */ diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h index 6f77378..6a1ef86 100644 --- a/include/xen/grant_table.h +++ b/include/xen/grant_table.h @@ -131,6 +131,15 @@ void gnttab_cancel_free_callback(struct gnttab_free_callback *callback); void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid, unsigned long frame, int readonly); +/* Give access to the first 4K of the page */ +static inline void gnttab_page_grant_foreign_access_ref( + grant_ref_t ref, domid_t domid, + struct page *page, int readonly) +{ + gnttab_grant_foreign_access_ref(ref, domid, page_to_mfn(page), + readonly); +} + void gnttab_grant_foreign_transfer_ref(grant_ref_t, domid_t domid, unsigned long pfn); -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 07/20] block/xen-blkfront: split get_grant in 2
Prepare the code to support 64KB page granularity. The first implementation will use a full Linux page per indirect and persistent grant. When non-persistent grant is used, each page of a bio request may be split in multiple grant. Furthermore, the field page of the grant structure is only used to copy data from persistent grant or indirect grant. Avoid to set it for other use case as it will have no meaning given the page will be split in multiple grant. Provide 2 functions, to setup indirect grant, the other for bio page. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Roger Pau Monné Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Patch added --- drivers/block/xen-blkfront.c | 85 ++-- 1 file changed, 59 insertions(+), 26 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 7b81d23..95fd067 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -242,34 +242,77 @@ out_of_memory: return -ENOMEM; } -static struct grant *get_grant(grant_ref_t *gref_head, - struct page *page, - struct blkfront_info *info) +static struct grant *get_free_grant(struct blkfront_info *info) { struct grant *gnt_list_entry; - unsigned long buffer_mfn; BUG_ON(list_empty(&info->grants)); gnt_list_entry = list_first_entry(&info->grants, struct grant, - node); + node); list_del(&gnt_list_entry->node); - if (gnt_list_entry->gref != GRANT_INVALID_REF) { + if (gnt_list_entry->gref != GRANT_INVALID_REF) info->persistent_gnts_c--; + + return gnt_list_entry; +} + +static void grant_foreign_access(const struct grant *gnt_list_entry, +const struct blkfront_info *info) +{ + gnttab_page_grant_foreign_access_ref(gnt_list_entry->gref, +info->xbdev->otherend_id, +gnt_list_entry->page, +0); +} + +static struct grant *get_grant(grant_ref_t *gref_head, + unsigned long mfn, + struct blkfront_info *info) +{ + struct grant *gnt_list_entry = get_free_grant(info); + + if (gnt_list_entry->gref != GRANT_INVALID_REF) return gnt_list_entry; + + /* Assign a gref to this page */ + gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head); + BUG_ON(gnt_list_entry->gref == -ENOSPC); + if (info->feature_persistent) + grant_foreign_access(gnt_list_entry, info); + else { + /* Grant access to the MFN passed by the caller */ + gnttab_grant_foreign_access_ref(gnt_list_entry->gref, + info->xbdev->otherend_id, + mfn, 0); } + return gnt_list_entry; +} + +static struct grant *get_indirect_grant(grant_ref_t *gref_head, + struct blkfront_info *info) +{ + struct grant *gnt_list_entry = get_free_grant(info); + + if (gnt_list_entry->gref != GRANT_INVALID_REF) + return gnt_list_entry; + /* Assign a gref to this page */ gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head); BUG_ON(gnt_list_entry->gref == -ENOSPC); if (!info->feature_persistent) { - BUG_ON(!page); - gnt_list_entry->page = page; + struct page *indirect_page; + + /* Fetch a pre-allocated page to use for indirect grefs */ + BUG_ON(list_empty(&info->indirect_pages)); + indirect_page = list_first_entry(&info->indirect_pages, +struct page, lru); + list_del(&indirect_page->lru); + gnt_list_entry->page = indirect_page; } - buffer_mfn = page_to_mfn(gnt_list_entry->page); - gnttab_grant_foreign_access_ref(gnt_list_entry->gref, - info->xbdev->otherend_id, - buffer_mfn, 0); + grant_foreign_access(gnt_list_entry, info); + return gnt_list_entry; } @@ -522,29 +565,19 @@ static int blkif_queue_rw_req(struct request *req) if ((ring_req->operation == BLKIF_OP_INDIRECT) && (i % SEGS_PER_INDIRECT_FRAME == 0)) { - struct page *uninitialized_var(page); - if (segments) kunmap_atomic(segments); n = i / SEGS_PER_INDIRECT_FRAME; - if (!info->feature_persistent) { - struct page *ind
[Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page
When Linux is using 64K page granularity, every page will be slipt in multiple non-contiguous 4K MFN (page granularity of Xen). I'm not sure how to handle efficiently the check to know whether we can merge 2 biovec with a such case. So for now, always says that biovec are not mergeable. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Remove the workaround and check if the Linux page granularity is the same as Xen or not --- drivers/xen/biomerge.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c index 0edb91c..571567c 100644 --- a/drivers/xen/biomerge.c +++ b/drivers/xen/biomerge.c @@ -6,10 +6,17 @@ bool xen_biovec_phys_mergeable(const struct bio_vec *vec1, const struct bio_vec *vec2) { +#if XEN_PAGE_SIZE == PAGE_SIZE unsigned long mfn1 = pfn_to_mfn(page_to_pfn(vec1->bv_page)); unsigned long mfn2 = pfn_to_mfn(page_to_pfn(vec2->bv_page)); return __BIOVEC_PHYS_MERGEABLE(vec1, vec2) && ((mfn1 == mfn2) || ((mfn1+1) == mfn2)); +#else + /* XXX: bio_vec are not mergeable when using different page size in +* Xen and Linux +*/ + return 0; +#endif } EXPORT_SYMBOL(xen_biovec_phys_mergeable); -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 06/20] block/xen-blkfront: Store a page rather a pfn in the grant structure
All the usage of the field pfn are done using the same idiom: pfn_to_page(grant->pfn) This will return always the same page. Store directly the page in the grant to clean up the code. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Roger Pau Monné Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Patch added --- drivers/block/xen-blkfront.c | 37 ++--- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 7107d58..7b81d23 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -67,7 +67,7 @@ enum blkif_state { struct grant { grant_ref_t gref; - unsigned long pfn; + struct page *page; struct list_head node; }; @@ -219,7 +219,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num) kfree(gnt_list_entry); goto out_of_memory; } - gnt_list_entry->pfn = page_to_pfn(granted_page); + gnt_list_entry->page = granted_page; } gnt_list_entry->gref = GRANT_INVALID_REF; @@ -234,7 +234,7 @@ out_of_memory: &info->grants, node) { list_del(&gnt_list_entry->node); if (info->feature_persistent) - __free_page(pfn_to_page(gnt_list_entry->pfn)); + __free_page(gnt_list_entry->page); kfree(gnt_list_entry); i--; } @@ -243,7 +243,7 @@ out_of_memory: } static struct grant *get_grant(grant_ref_t *gref_head, - unsigned long pfn, + struct page *page, struct blkfront_info *info) { struct grant *gnt_list_entry; @@ -263,10 +263,10 @@ static struct grant *get_grant(grant_ref_t *gref_head, gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head); BUG_ON(gnt_list_entry->gref == -ENOSPC); if (!info->feature_persistent) { - BUG_ON(!pfn); - gnt_list_entry->pfn = pfn; + BUG_ON(!page); + gnt_list_entry->page = page; } - buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn); + buffer_mfn = page_to_mfn(gnt_list_entry->page); gnttab_grant_foreign_access_ref(gnt_list_entry->gref, info->xbdev->otherend_id, buffer_mfn, 0); @@ -522,7 +522,7 @@ static int blkif_queue_rw_req(struct request *req) if ((ring_req->operation == BLKIF_OP_INDIRECT) && (i % SEGS_PER_INDIRECT_FRAME == 0)) { - unsigned long uninitialized_var(pfn); + struct page *uninitialized_var(page); if (segments) kunmap_atomic(segments); @@ -536,15 +536,15 @@ static int blkif_queue_rw_req(struct request *req) indirect_page = list_first_entry(&info->indirect_pages, struct page, lru); list_del(&indirect_page->lru); - pfn = page_to_pfn(indirect_page); + page = indirect_page; } - gnt_list_entry = get_grant(&gref_head, pfn, info); + gnt_list_entry = get_grant(&gref_head, page, info); info->shadow[id].indirect_grants[n] = gnt_list_entry; - segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn)); + segments = kmap_atomic(gnt_list_entry->page); ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref; } - gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info); + gnt_list_entry = get_grant(&gref_head, sg_page(sg), info); ref = gnt_list_entry->gref; info->shadow[id].grants_used[i] = gnt_list_entry; @@ -555,7 +555,7 @@ static int blkif_queue_rw_req(struct request *req) BUG_ON(sg->offset + sg->length > PAGE_SIZE); - shared_data = kmap_atomic(pfn_to_page(gnt_list_entry->pfn)); + shared_data = kmap_atomic(gnt_list_entry->page); bvec_data = kmap_atomic(sg_page(sg)); /* @@ -1002,7 +1002,7 @@ static void blkif_free(struct blkfront_info *info, int suspend) info->persistent_gnts_c--; } if (info->feature_persistent) - __free_page(pfn_to_page(persistent_gnt->pfn)); +
[Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition
The Xen hypercall interface is always using 4K page granularity on ARM and x86 architecture. With the incoming support of 64K page granularity for ARM64 guest, it won't be possible to re-use the Linux page definition in Xen drivers. Introduce Xen page definition helpers based on the Linux page definition. They have exactly the same name but prefixed with XEN_/xen_ prefix. Also modify page_to_pfn to use new Xen page definition. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- I'm wondering if we should drop page_to_pfn has the macro will likely misuse when Linux is using 64KB page granularity. Changes in v2: - Add XEN_PFN_UP - Add a comment describing the behavior of page_to_pfn --- include/xen/page.h | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/include/xen/page.h b/include/xen/page.h index c5ed20b..8ebd37b 100644 --- a/include/xen/page.h +++ b/include/xen/page.h @@ -1,11 +1,30 @@ #ifndef _XEN_PAGE_H #define _XEN_PAGE_H +#include + +/* The hypercall interface supports only 4KB page */ +#define XEN_PAGE_SHIFT 12 +#define XEN_PAGE_SIZE (_AC(1,UL) << XEN_PAGE_SHIFT) +#define XEN_PAGE_MASK (~(XEN_PAGE_SIZE-1)) +#define xen_offset_in_page(p) ((unsigned long)(p) & ~XEN_PAGE_MASK) +#define xen_pfn_to_page(pfn) \ + ((pfn_to_page(((unsigned long)(pfn) << XEN_PAGE_SHIFT) >> PAGE_SHIFT))) +#define xen_page_to_pfn(page) \ + (((page_to_pfn(page)) << PAGE_SHIFT) >> XEN_PAGE_SHIFT) + +#define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE) + +#define XEN_PFN_DOWN(x)((x) >> XEN_PAGE_SHIFT) +#define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT) +#define XEN_PFN_PHYS(x)((phys_addr_t)(x) << XEN_PAGE_SHIFT) + #include +/* Return the MFN associated to the first 4KB of the page */ static inline unsigned long page_to_mfn(struct page *page) { - return pfn_to_mfn(page_to_pfn(page)); + return pfn_to_mfn(xen_page_to_pfn(page)); } struct xen_memory_region { -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 03/20] xen/grant: Introduce helpers to split a page into grant
Currently, a grant is always based on the Xen page granularity (i.e 4KB). When Linux is using a different page granularity, a single page will be split between multiple grants. The new helpers will be in charge to split the Linux page into grant and call a function given by the caller on each grant. In order to help some PV drivers, the callback is allowed to use less data and must update the resulting length. This is useful for netback. Also provide and helper to count the number of grants within a given contiguous region. Signed-off-by: Julien Grall Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Vrabel --- Changes in v2: - Patch added --- drivers/xen/grant-table.c | 26 ++ include/xen/grant_table.h | 41 + 2 files changed, 67 insertions(+) diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index 62f591f..3679293 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -296,6 +296,32 @@ int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly) } EXPORT_SYMBOL_GPL(gnttab_end_foreign_access_ref); +void gnttab_foreach_grant(struct page *page, unsigned int offset, + unsigned int len, xen_grant_fn_t fn, + void *data) +{ + unsigned int goffset; + unsigned int glen; + unsigned long pfn; + + len = min_t(unsigned int, PAGE_SIZE - offset, len); + goffset = offset & ~XEN_PAGE_MASK; + + pfn = xen_page_to_pfn(page) + (offset >> XEN_PAGE_SHIFT); + + while (len) { + glen = min_t(unsigned int, XEN_PAGE_SIZE - goffset, len); + fn(pfn_to_mfn(pfn), goffset, &glen, data); + + goffset += glen; + if (goffset == XEN_PAGE_SIZE) { + goffset = 0; + pfn++; + } + len -= glen; + } +} + struct deferred_entry { struct list_head list; grant_ref_t ref; diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h index 4478f4b..6f77378 100644 --- a/include/xen/grant_table.h +++ b/include/xen/grant_table.h @@ -45,8 +45,10 @@ #include #include +#include #include #include +#include #define GNTTAB_RESERVED_XENSTORE 1 @@ -224,4 +226,43 @@ static inline struct xen_page_foreign *xen_page_foreign(struct page *page) #endif } +/* Split Linux page in chunk of the size of the grant and call fn + * + * Parameters of fn: + * mfn: machine frame number based on grant granularity + * offset: offset in the grant + * len: length of the data in the grant. If fn decides to use less data, + * it must update len. + * data: internal information + */ +typedef void (*xen_grant_fn_t)(unsigned long mfn, unsigned int offset, + unsigned int *len, void *data); + +void gnttab_foreach_grant(struct page *page, unsigned int offset, + unsigned int len, xen_grant_fn_t fn, + void *data); + +/* Helper to get to call fn only on the first "grant chunk" */ +static inline void gnttab_one_grant(struct page *page, unsigned int offset, + unsigned len, xen_grant_fn_t fn, + void *data) +{ + /* The first request is limited to the size of one grant */ + len = min_t(unsigned int, XEN_PAGE_SIZE - (offset & ~XEN_PAGE_MASK), + len); + + gnttab_foreach_grant(page, offset, len, fn, data); +} + +/* Get the number of grant in a specified region + * + * offset: Offset in the first page + * len: total lenght of data (can cross multiple page) + */ +static inline unsigned int gnttab_count_grant(unsigned int offset, + unsigned int len) +{ + return (XEN_PFN_UP((offset & ~XEN_PAGE_MASK) + len)); +} + #endif /* __ASM_GNTTAB_H__ */ -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 08/20] net/xen-netback: xenvif_gop_frag_copy: move GSO check out of the loop
The skb doesn't change within the function. Therefore it's only necessary to check if we need GSO once at the beginning. Signed-off-by: Julien Grall Cc: Ian Campbell Cc: Wei Liu Cc: net...@vger.kernel.org --- Changes in v2: - Patch added --- drivers/net/xen-netback/netback.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index 880d0d6..3f77030 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -277,6 +277,13 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb unsigned long bytes; int gso_type = XEN_NETIF_GSO_TYPE_NONE; + if (skb_is_gso(skb)) { + if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) + gso_type = XEN_NETIF_GSO_TYPE_TCPV4; + else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) + gso_type = XEN_NETIF_GSO_TYPE_TCPV6; + } + /* Data must not cross a page boundary. */ BUG_ON(size + offset > PAGE_SIZEgso_type & SKB_GSO_TCPV6) - gso_type = XEN_NETIF_GSO_TYPE_TCPV6; - } - if (*head && ((1 << gso_type) & queue->vif->gso_mask)) queue->rx.req_cons++; -- 2.1.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [libvirt test] 59256: regressions - FAIL
flight 59256 libvirt real [real] http://logs.test-lab.xenproject.org/osstest/logs/59256/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-i386-libvirt-xsm 11 guest-start fail REGR. vs. 58842 Regressions which are regarded as allowable (not blocking): test-amd64-i386-libvirt 11 guest-start fail like 58842 test-amd64-amd64-libvirt 11 guest-start fail like 58842 Tests which did not succeed, but are not blocking: test-amd64-amd64-libvirt-xsm 12 migrate-support-checkfail never pass test-armhf-armhf-libvirt 12 migrate-support-checkfail never pass test-armhf-armhf-libvirt-xsm 12 migrate-support-checkfail never pass version targeted for testing: libvirt e9c2734441af0065c69fc1317965a6dd6c7f14e3 baseline version: libvirt d10a5f58c75e7eb5943b44cc36a1e768adb2cdb0 Last test of basis58842 2015-06-23 04:23:54 Z 16 days Failing since 58870 2015-06-24 04:20:11 Z 15 days 13 attempts Testing same since59256 2015-07-09 04:20:20 Z0 days1 attempts People who touched revisions under test: Andrea Bolognani Boris Fiuczynski Daniel Veillard Dmitry Guryanov Eric Blake Erik Skultety Jim Fehlig Jiri Denemark John Ferlan Ján Tomko Laine Stump Luyao Huang Martin Kletzander Maxim Nestratov Michal Dubiel Michal Privoznik Mikhail Feoktistov Nikolay Shirokovskiy Nikolay Shirokovskiy Pavel Fedin Pavel Hrdina Peter Krempa Prerna Saxena Serge Hallyn jobs: build-amd64-xsm pass build-armhf-xsm pass build-i386-xsm pass build-amd64 pass build-armhf pass build-i386 pass build-amd64-libvirt pass build-armhf-libvirt pass build-i386-libvirt pass build-amd64-pvopspass build-armhf-pvopspass build-i386-pvops pass test-amd64-amd64-libvirt-xsm pass test-armhf-armhf-libvirt-xsm pass test-amd64-i386-libvirt-xsm fail test-amd64-amd64-libvirt fail test-armhf-armhf-libvirt pass test-amd64-i386-libvirt fail sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 2139 lines long.) ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [linux-3.18 test] 59222: regressions - FAIL
flight 59222 linux-3.18 real [real] http://logs.test-lab.xenproject.org/osstest/logs/59222/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-amd64-xl-pvh-intel 11 guest-start fail REGR. vs. 58581 Tests which are failing intermittently (not blocking): test-armhf-armhf-libvirt 6 xen-bootfail pass in 59001 Regressions which are regarded as allowable (not blocking): test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 12 guest-localmigrate fail baseline untested test-armhf-armhf-xl-rtds 11 guest-startfail in 59001 baseline untested test-amd64-i386-libvirt-xsm 11 guest-start fail like 58558 test-amd64-amd64-libvirt 11 guest-start fail like 58558 test-amd64-i386-libvirt 11 guest-start fail like 58581 test-armhf-armhf-xl-multivcpu 6 xen-boot fail like 58581 test-armhf-armhf-xl 6 xen-boot fail like 58581 test-armhf-armhf-xl-credit2 6 xen-boot fail like 58581 test-armhf-armhf-xl-xsm 6 xen-boot fail like 58581 test-armhf-armhf-libvirt-xsm 6 xen-boot fail like 58581 test-amd64-amd64-xl-qemut-win7-amd64 16 guest-stop fail like 58581 test-amd64-i386-xl-qemuu-win7-amd64 16 guest-stop fail like 58581 test-amd64-amd64-xl-qemuu-win7-amd64 16 guest-stop fail like 58581 Tests which did not succeed, but are not blocking: test-amd64-i386-libvirt-xsm 12 migrate-support-check fail in 59001 never pass test-amd64-amd64-libvirt 12 migrate-support-check fail in 59001 never pass test-armhf-armhf-libvirt 12 migrate-support-check fail in 59001 never pass test-amd64-amd64-xl-pvh-amd 11 guest-start fail never pass test-amd64-i386-freebsd10-i386 9 freebsd-install fail never pass test-amd64-i386-freebsd10-amd64 9 freebsd-install fail never pass test-armhf-armhf-xl-cubietruck 6 xen-boot fail never pass test-amd64-amd64-libvirt-xsm 12 migrate-support-checkfail never pass test-amd64-i386-xl-qemut-win7-amd64 16 guest-stop fail never pass test-armhf-armhf-xl-arndale 12 migrate-support-checkfail never pass test-armhf-armhf-xl-rtds 14 guest-start.2fail never pass test-armhf-armhf-xl-rtds 12 migrate-support-checkfail never pass version targeted for testing: linuxea5dd38e93b3bec3427e5d3eef000bbf5d637e76 baseline version: linuxd048c068d00da7d4cfa5ea7651933b99026958cf Last test of basis58581 2015-06-15 09:42:22 Z 24 days Testing same since58976 2015-06-29 19:43:23 Z9 days 11 attempts People who touched revisions under test: "Eric W. Biederman" Aaron Lu Alexander Duyck Alexei Starovoitov Andrew de los Reyes Andrew Morton Andy Lutomirski Anna Schumaker Aravind Gopalakrishnan Ard Biesheuvel Arnd Bergmann Baruch Siach Ben Serebrin Benjamin Tissoires Bjørn Mork Borislav Petkov Chuck Lever Cong Wang Daniel Borkmann Darren Hart Darren Salt David Daney David S. Miller David Woodhouse Devesh Sharma Doug Ledford Eric Dumazet Eric W. Biederman Felipe Balbi Feng Kan Florent Fourcot Florian Fainelli Geert Uytterhoeven Grant Likely Greg Kroah-Hartman Hannes Frederic Sowa Henning Rogge Honggang Li Ian Campbell Ilya Dryomov Ingo Molnar Jakub Sitnicki Jason Gunthorpe Jiada Wang Jiri Kosina Jiri Pirko Joerg Roedel Jovi Zhangwei Ken Xue Kevin Hilman Konstantin Khlebnikov Laura Abbott Laurent Pinchart Linus Lüssing Linus Torvalds Linus Walleij Mark Brown Mark Salyzyn Matt Fleming Matthew Garrett Mauro Carvalho Chehab Max Filippov Meghana Cheripady Mel Gorman Mika Westerberg Milan Plzik Neal Cardwell Neil Horman Nicolas Dichtel Nicolas Iooss Nicolas Pitre Nikolay Aleksandrov Oliver Neukum Oliver Neukum oli...@neukum.org Olof Johansson Pablo Neira Ayuso Paolo Bonzini Pelle Nilsson Peter Zijlstra (Intel) Rafael J. Wysocki Raphael Assenat Richard Cochran Rik Theys Ross Lagerwall Roy Franz Russell King Sasha Levin Sean Young Shawn Bohrer Simon Horman Sowmini Varadhan Sriharsha Basavapatna Steffen Klassert Thadeu Lima de Souza Cascardo Theodore Ts'o Uwe Kleine-König Veaceslav Falico Veeresh U. Kokatnur Vijay Subramanian Vinod Koul Vittorio Gambaletta Vlad Yasevich Vladislav Yasevich WANG Cong Wei Liu Yao Xiwei Yoshihiro Shimoda Yuchung Cheng jobs: build-amd64-xsm pass build-armh
Re: [Xen-devel] [RFC Patch V1 06/12] PCI: Use for_pci_msi_entry() to access MSI device list
On Thu, Jul 09, 2015 at 04:00:41PM +0800, Jiang Liu wrote: > Use accessor for_pci_msi_entry() to access MSI device list, so we could easily > move msi_list from struct pci_dev into struct device later. > > Signed-off-by: Jiang Liu > --- > drivers/pci/msi.c | 39 --- > drivers/pci/xen-pcifront.c |2 +- Acked-by on the Xen bits. > 2 files changed, 21 insertions(+), 20 deletions(-) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 7b4c20c9f9ca..d09afa78d7a1 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -131,7 +131,7 @@ int __weak arch_setup_msi_irqs(struct pci_dev *dev, int > nvec, int type) > if (type == PCI_CAP_ID_MSI && nvec > 1) > return 1; > > - list_for_each_entry(entry, &dev->msi_list, list) { > + for_each_pci_msi_entry(entry, dev) { > ret = arch_setup_msi_irq(dev, entry); > if (ret < 0) > return ret; > @@ -151,7 +151,7 @@ void default_teardown_msi_irqs(struct pci_dev *dev) > int i; > struct msi_desc *entry; > > - list_for_each_entry(entry, &dev->msi_list, list) > + for_each_pci_msi_entry(entry, dev) > if (entry->irq) > for (i = 0; i < entry->nvec_used; i++) > arch_teardown_msi_irq(entry->irq + i); > @@ -168,7 +168,7 @@ static void default_restore_msi_irq(struct pci_dev *dev, > int irq) > > entry = NULL; > if (dev->msix_enabled) { > - list_for_each_entry(entry, &dev->msi_list, list) { > + for_each_pci_msi_entry(entry, dev) { > if (irq == entry->irq) > break; > } > @@ -282,7 +282,7 @@ void default_restore_msi_irqs(struct pci_dev *dev) > { > struct msi_desc *entry; > > - list_for_each_entry(entry, &dev->msi_list, list) > + for_each_pci_msi_entry(entry, dev) > default_restore_msi_irq(dev, entry->irq); > } > > @@ -363,21 +363,22 @@ EXPORT_SYMBOL_GPL(pci_write_msi_msg); > > static void free_msi_irqs(struct pci_dev *dev) > { > + struct list_head *msi_list = dev_to_msi_list(&dev->dev); > struct msi_desc *entry, *tmp; > struct attribute **msi_attrs; > struct device_attribute *dev_attr; > int i, count = 0; > > - list_for_each_entry(entry, &dev->msi_list, list) > + for_each_pci_msi_entry(entry, dev) > if (entry->irq) > for (i = 0; i < entry->nvec_used; i++) > BUG_ON(irq_has_action(entry->irq + i)); > > pci_msi_teardown_msi_irqs(dev); > > - list_for_each_entry_safe(entry, tmp, &dev->msi_list, list) { > + list_for_each_entry_safe(entry, tmp, msi_list, list) { > if (entry->msi_attrib.is_msix) { > - if (list_is_last(&entry->list, &dev->msi_list)) > + if (list_is_last(&entry->list, msi_list)) > iounmap(entry->mask_base); > } > > @@ -448,7 +449,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev) > > if (!dev->msix_enabled) > return; > - BUG_ON(list_empty(&dev->msi_list)); > + BUG_ON(list_empty(dev_to_msi_list(&dev->dev))); > > /* route the table */ > pci_intx_for_msi(dev, 0); > @@ -456,7 +457,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev) > PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL); > > arch_restore_msi_irqs(dev); > - list_for_each_entry(entry, &dev->msi_list, list) > + for_each_pci_msi_entry(entry, dev) > msix_mask_irq(entry, entry->masked); > > pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0); > @@ -501,7 +502,7 @@ static int populate_msi_sysfs(struct pci_dev *pdev) > int count = 0; > > /* Determine how many msi entries we have */ > - list_for_each_entry(entry, &pdev->msi_list, list) > + for_each_pci_msi_entry(entry, pdev) > ++num_msi; > if (!num_msi) > return 0; > @@ -510,7 +511,7 @@ static int populate_msi_sysfs(struct pci_dev *pdev) > msi_attrs = kzalloc(sizeof(void *) * (num_msi + 1), GFP_KERNEL); > if (!msi_attrs) > return -ENOMEM; > - list_for_each_entry(entry, &pdev->msi_list, list) { > + for_each_pci_msi_entry(entry, pdev) { > msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL); > if (!msi_dev_attr) > goto error_attrs; > @@ -599,7 +600,7 @@ static int msi_verify_entries(struct pci_dev *dev) > { > struct msi_desc *entry; > > - list_for_each_entry(entry, &dev->msi_list, list) { > + for_each_pci_msi_entry(entry, dev) { > if (!dev->no_64bit_msi || !entry->msg.address_hi) > continue; > dev_err(&dev->dev, "Device has broken 64-bit MSI but arc
Re: [Xen-devel] [RFC Patch V1 05/12] x86, PCI: Use for_pci_msi_entry() to access MSI device list
On Thu, Jul 09, 2015 at 04:00:40PM +0800, Jiang Liu wrote: > Use accessor for_pci_msi_entry() to access MSI device list, so we could > easily move msi_list from struct pci_dev into struct device later. > > Signed-off-by: Jiang Liu Looks pretty simple. Acked- by: Konrad Rzeszutek Wilk > --- > arch/x86/pci/xen.c |8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > index d22f4b5bbc04..ff31ab464213 100644 > --- a/arch/x86/pci/xen.c > +++ b/arch/x86/pci/xen.c > @@ -179,7 +179,7 @@ static int xen_setup_msi_irqs(struct pci_dev *dev, int > nvec, int type) > if (ret) > goto error; > i = 0; > - list_for_each_entry(msidesc, &dev->msi_list, list) { > + for_each_pci_msi_entry(msidesc, dev) { > irq = xen_bind_pirq_msi_to_irq(dev, msidesc, v[i], > (type == PCI_CAP_ID_MSI) ? nvec > : 1, > (type == PCI_CAP_ID_MSIX) ? > @@ -230,7 +230,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, > int nvec, int type) > if (type == PCI_CAP_ID_MSI && nvec > 1) > return 1; > > - list_for_each_entry(msidesc, &dev->msi_list, list) { > + for_each_pci_msi_entry(msidesc, dev) { > __pci_read_msi_msg(msidesc, &msg); > pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > @@ -274,7 +274,7 @@ static int xen_initdom_setup_msi_irqs(struct pci_dev > *dev, int nvec, int type) > int ret = 0; > struct msi_desc *msidesc; > > - list_for_each_entry(msidesc, &dev->msi_list, list) { > + for_each_pci_msi_entry(msidesc, dev) { > struct physdev_map_pirq map_irq; > domid_t domid; > > @@ -386,7 +386,7 @@ static void xen_teardown_msi_irqs(struct pci_dev *dev) > { > struct msi_desc *msidesc; > > - msidesc = list_entry(dev->msi_list.next, struct msi_desc, list); > + msidesc = first_pci_msi_entry(dev); > if (msidesc->msi_attrib.is_msix) > xen_pci_frontend_disable_msix(dev); > else > -- > 1.7.10.4 > ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v9 4/4] iommu: add rmrr Xen command line option for extra rmrrs
On Wed, Jul 08, 2015 at 09:50:26PM -0400, elena.ufimts...@oracle.com wrote: > From: Elena Ufimtseva > > On some platforms RMRR regions may be not specified > in ACPI and thus will not be mapped 1:1 in dom0. This > causes IO Page Faults and prevents dom0 from booting > in PVH mode. > New Xen command line option rmrr allows to specify > such devices and memory regions. These regions are added > to the list of RMRR defined in ACPI if the device > is present in system. As a result, additional RMRRs will > be mapped 1:1 in dom0 with correct permissions. > > Mentioned above problems were discovered during PVH work with > ThinkCentre M and Dell 5600T. No official documentation > was found so far in regards to what devices and why cause this. > Experiments show that ThinkCentre M USB devices with enabled > debug port generate DMA read transactions to the regions of > memory marked reserved in host e820 map. > For Dell 5600T the device and faulting addresses are not found yet. > > For detailed history of the discussion please check following threads: > http://lists.Xen.org/archives/html/xen-devel/2015-02/msg01724.html > http://lists.Xen.org/archives/html/xen-devel/2015-01/msg02513.html > > Format for rmrr Xen command line option: > rmrr=start<-end>=[s1]bdf1[,[s1]bdf2[,...]];start<-end>=[s2]bdf1[,[s2]bdf2[,...]] > If grub2 used and multiple ranges are specified, ';' should be > quoted/escaped, refer to grub2 manual for more information. > Reviewed-by: Konrad Rzeszutek Wilk Thanks! > Signed-off-by: Elena Ufimtseva > --- > docs/misc/xen-command-line.markdown | 13 +++ > xen/drivers/passthrough/vtd/dmar.c | 209 > +++- > 2 files changed, 221 insertions(+), 1 deletion(-) > > diff --git a/docs/misc/xen-command-line.markdown > b/docs/misc/xen-command-line.markdown > index aa684c0..f307f3d 100644 > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -1197,6 +1197,19 @@ Specify the host reboot method. > 'efi' instructs Xen to reboot using the EFI reboot call (in EFI mode by > default it will use that method first). > > +### rmrr > +> '= > start<-end>=[s1]bdf1[,[s1]bdf2[,...]];start<-end>=[s2]bdf1[,[s2]bdf2[,...]] > + > +Define RMRR units that are missing from ACPI table along with device they > +belong to and use them for 1:1 mapping. End addresses can be omitted and one > +page will be mapped. The ranges are inclusive when start and end are > specified. > +If segment of the first device is not specified, segment zero will be used. > +If other segments are not specified, first device segment will be used. > +If a segment is specified for other than the first device and it does not > match > +the one specified for the first one, an error will be reported. > +Note: grub2 requires to escape or use quotations if special characters are > used, > +namely ';', refer to the grub2 documentation if multiple ranges are > specified. > + > ### ro-hpet > > `= ` > > diff --git a/xen/drivers/passthrough/vtd/dmar.c > b/xen/drivers/passthrough/vtd/dmar.c > index a8e1e5d..f62fb02 100644 > --- a/xen/drivers/passthrough/vtd/dmar.c > +++ b/xen/drivers/passthrough/vtd/dmar.c > @@ -869,6 +869,145 @@ out: > return ret; > } > > +#define MAX_EXTRA_RMRR_PAGES 16 > +#define MAX_EXTRA_RMRR 10 > + > +/* RMRR units derived from command line rmrr option. */ > +#define MAX_EXTRA_RMRR_DEV 20 > +struct extra_rmrr_unit { > +struct list_head list; > +unsigned long base_pfn, end_pfn; > +unsigned int dev_count; > +u32sbdf[MAX_EXTRA_RMRR_DEV]; > +}; > +static __initdata unsigned int nr_rmrr; > +static struct __initdata extra_rmrr_unit extra_rmrr_units[MAX_EXTRA_RMRR]; > + > +/* Macro for RMRR inclusive range formatting. */ > +#define PRI_RMRR(s,e) "[%lx-%lx]" > + > +static void __init add_extra_rmrr(void) > +{ > +struct acpi_rmrr_unit *acpi_rmrr; > +struct acpi_rmrr_unit *rmrru; > +unsigned int dev, seg, i, j; > +unsigned long pfn; > +bool_t overlap; > + > +for ( i = 0; i < nr_rmrr; i++ ) > +{ > +if ( extra_rmrr_units[i].base_pfn > extra_rmrr_units[i].end_pfn ) > +{ > +printk(XENLOG_ERR VTDPREFIX > + "Invalid RMRR Range "PRI_RMRR(s,e)"\n", > + extra_rmrr_units[i].base_pfn, > extra_rmrr_units[i].end_pfn); > +continue; > +} > + > +if ( extra_rmrr_units[i].end_pfn - extra_rmrr_units[i].base_pfn >= > + MAX_EXTRA_RMRR_PAGES ) > +{ > +printk(XENLOG_ERR VTDPREFIX > + "RMRR range "PRI_RMRR(s,e)" exceeds > "__stringify(MAX_EXTRA_RMRR_PAGES)" pages\n", > + extra_rmrr_units[i].base_pfn, > extra_rmrr_units[i].end_pfn); > +continue; > +} > + > +for ( j = 0; j < nr_rmrr; j++ ) > +{ > +if ( i != j && > + extra_rmrr_units[i].base_pfn <= extra_rmrr_units[j].end_pfn > && > + extra_rmrr_u
Re: [Xen-devel] [PATCH v9 2/4] iommu VT-d: separate rmrr addition function
On Wed, Jul 08, 2015 at 09:50:24PM -0400, elena.ufimts...@oracle.com wrote: > From: Elena Ufimtseva > > In preparation for auxiliary RMRR data provided on Xen > command line, make RMRR adding a separate function. > Also free memery for rmrr device scope in error path. s/memery/memory/ > > Signed-off-by: Elena Ufimtseva Reviewed-by: Konrad Rzeszutek Wilk > --- > xen/drivers/passthrough/vtd/dmar.c | 126 > +++-- > 1 file changed, 65 insertions(+), 61 deletions(-) > > diff --git a/xen/drivers/passthrough/vtd/dmar.c > b/xen/drivers/passthrough/vtd/dmar.c > index 77ef708..a8e1e5d 100644 > --- a/xen/drivers/passthrough/vtd/dmar.c > +++ b/xen/drivers/passthrough/vtd/dmar.c > @@ -585,6 +585,68 @@ out: > return ret; > } > > +static int register_one_rmrr(struct acpi_rmrr_unit *rmrru) > +{ > +bool_t ignore = 0; > +unsigned int i = 0; > +int ret = 0; > + > +/* Skip checking if segment is not accessible yet. */ > +if ( !pci_known_segment(rmrru->segment) ) > +i = UINT_MAX; > + > +for ( ; i < rmrru->scope.devices_cnt; i++ ) > +{ > +u8 b = PCI_BUS(rmrru->scope.devices[i]); > +u8 d = PCI_SLOT(rmrru->scope.devices[i]); > +u8 f = PCI_FUNC(rmrru->scope.devices[i]); > + > +if ( pci_device_detect(rmrru->segment, b, d, f) == 0 ) > +{ > +dprintk(XENLOG_WARNING VTDPREFIX, > +" Non-existent device (%04x:%02x:%02x.%u) is reported" > +" in RMRR (%"PRIx64", %"PRIx64")'s scope!\n", > +rmrru->segment, b, d, f, > +rmrru->base_address, rmrru->end_address); > +ignore = 1; > +} > +else > +{ > +ignore = 0; > +break; > +} > +} > + > +if ( ignore ) > +{ > +dprintk(XENLOG_WARNING VTDPREFIX, > +" Ignore the RMRR (%"PRIx64", %"PRIx64") due to " > +"devices under its scope are not PCI discoverable!\n", > +rmrru->base_address, rmrru->end_address); > +scope_devices_free(&rmrru->scope); > +xfree(rmrru); > +} > +else if ( rmrru->base_address > rmrru->end_address ) > +{ > +dprintk(XENLOG_WARNING VTDPREFIX, > +" The RMRR (%"PRIx64", %"PRIx64") is incorrect!\n", > +rmrru->base_address, rmrru->end_address); > +scope_devices_free(&rmrru->scope); > +xfree(rmrru); > +ret = -EFAULT; > +} > +else > +{ > +if ( iommu_verbose ) > +dprintk(VTDPREFIX, > +" RMRR region: base_addr %"PRIx64" end_address > %"PRIx64"\n", > +rmrru->base_address, rmrru->end_address); > +acpi_register_rmrr_unit(rmrru); > +} > + > +return ret; > +} > + > static int __init > acpi_parse_one_rmrr(struct acpi_dmar_header *header) > { > @@ -635,68 +697,10 @@ acpi_parse_one_rmrr(struct acpi_dmar_header *header) > ret = acpi_parse_dev_scope(dev_scope_start, dev_scope_end, > &rmrru->scope, RMRR_TYPE, rmrr->segment); > > -if ( ret || (rmrru->scope.devices_cnt == 0) ) > -xfree(rmrru); > +if ( !ret && (rmrru->scope.devices_cnt != 0) ) > +register_one_rmrr(rmrru); > else > -{ > -u8 b, d, f; > -bool_t ignore = 0; > -unsigned int i = 0; > - > -/* Skip checking if segment is not accessible yet. */ > -if ( !pci_known_segment(rmrr->segment) ) > -i = UINT_MAX; > - > -for ( ; i < rmrru->scope.devices_cnt; i++ ) > -{ > -b = PCI_BUS(rmrru->scope.devices[i]); > -d = PCI_SLOT(rmrru->scope.devices[i]); > -f = PCI_FUNC(rmrru->scope.devices[i]); > - > -if ( !pci_device_detect(rmrr->segment, b, d, f) ) > -{ > -dprintk(XENLOG_WARNING VTDPREFIX, > -" Non-existent device (%04x:%02x:%02x.%u) is > reported" > -" in RMRR (%"PRIx64", %"PRIx64")'s scope!\n", > -rmrr->segment, b, d, f, > -rmrru->base_address, rmrru->end_address); > -ignore = 1; > -} > -else > -{ > -ignore = 0; > -break; > -} > -} > - > -if ( ignore ) > -{ > -dprintk(XENLOG_WARNING VTDPREFIX, > -" Ignore the RMRR (%"PRIx64", %"PRIx64") due to " > -"devices under its scope are not PCI discoverable!\n", > -rmrru->base_address, rmrru->end_address); > -scope_devices_free(&rmrru->scope); > -xfree(rmrru); > -} > -else if ( base_addr > end_addr ) > -{ > -dprintk(XENLOG_WARNING VTDPREFIX, > -" The RMRR (%"PRIx64", %"PRIx64") is incorrect!\n", > -rmrru->base_addres
[Xen-devel] [PATCH v2 17/27] tools/libxl: Support converting a legacy stream to a v2 stream
When a legacy stream is found, it needs to be converted to a v2 stream for the reading logic. This is done by exec()ing the python conversion utility. One complication is that the caller of this interface needs to assume ownership of the output fd, to prevent it being closed while still in use in a datacopier. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- No major differences since v1. It has gained a new ao_abrt as a result of rebasing over the AO ABORT series, and has been made x86-specific. --- tools/libxl/Makefile|1 + tools/libxl/libxl_convert_callout.c | 172 +++ tools/libxl/libxl_internal.h| 48 ++ 3 files changed, 221 insertions(+) create mode 100644 tools/libxl/libxl_convert_callout.c diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index c71c5fe..0ebc35a 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -59,6 +59,7 @@ endif LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o libxl_psr.o +LIBXL_OBJS-$(CONFIG_X86) += libxl_convert_callout.o LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o libxl_libfdt_compat.o ifeq ($(CONFIG_NetBSD),y) diff --git a/tools/libxl/libxl_convert_callout.c b/tools/libxl/libxl_convert_callout.c new file mode 100644 index 000..02f3b82 --- /dev/null +++ b/tools/libxl/libxl_convert_callout.c @@ -0,0 +1,172 @@ +/* + * Copyright (C) 2014 Citrix Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as published + * by the Free Software Foundation; version 2.1 only. with the special + * exception on linking described in file LICENSE. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + */ + +#include "libxl_osdeps.h" + +#include "libxl_internal.h" + +/* + * Infrastructure for converting a legacy migration stream into a libxl v2 + * stream. + * + * This is done by fork()ing the python conversion script, which takes in a + * legacy stream, and puts out a suitably-formatted v2 stream. + */ + +static void helper_failed(libxl__egc *egc, + libxl__conversion_helper_state *chs, int rc); +static void helper_stop(libxl__egc *egc, libxl__ao_abortable*, int rc); +static void helper_exited(libxl__egc *egc, libxl__ev_child *ch, + pid_t pid, int status); +static void helper_done(libxl__egc *egc, +libxl__conversion_helper_state *chs); + +void libxl__conversion_helper_init(libxl__conversion_helper_state *chs) +{ +libxl__ao_abortable_init(&chs->abrt); +libxl__ev_child_init(&chs->child); +} + +void libxl__convert_legacy_stream(libxl__egc *egc, + libxl__conversion_helper_state *chs) +{ +STATE_AO_GC(chs->ao); +libxl__carefd *child_in = NULL, *child_out = NULL; +int ret = 0; + +chs->rc = 0; +libxl__conversion_helper_init(chs); + +chs->abrt.ao = chs->ao; +chs->abrt.callback = helper_stop; +ret = libxl__ao_abortable_register(&chs->abrt); +if (ret) goto err; + +libxl__carefd_begin(); +int fds[2]; +if (libxl_pipe(CTX, fds)) { +ret = ERROR_FAIL; +libxl__carefd_unlock(); +goto err; +} +child_out = libxl__carefd_record(CTX, fds[0]); +child_in = libxl__carefd_record(CTX, fds[1]); +libxl__carefd_unlock(); + +pid_t pid = libxl__ev_child_fork(gc, &chs->child, helper_exited); +if (!pid) { +char * const args[] = +{ +getenv("LIBXL_CONVERT_HELPER") ?: +LIBEXEC_BIN "/convert-legacy-stream", +"--in", GCSPRINTF("%d", chs->legacy_fd), +"--out",GCSPRINTF("%d", fds[1]), +"--width", +#ifdef __i386__ +"32", +#else +"64", +#endif +"--guest", chs->hvm ? "hvm" : "pv", +"--format", "libxl", +/* "--verbose", */ +NULL, +}; + +libxl_fd_set_cloexec(CTX, chs->legacy_fd, 0); +libxl_fd_set_cloexec(CTX, libxl__carefd_fd(child_in), 0); + +libxl__exec(gc, +-1, -1, -1, +args[0], args, NULL); +} + +libxl__carefd_close(child_in); +chs->v2_carefd = child_out; + +assert(!ret); +return; + + err: +assert(ret); +helper_failed(egc, chs, ret); +} + +void libxl__convert_legacy_stream_abort(libxl__egc *egc, +libxl__conversion_helper_state *chs, +int rc) +{ +helper_stop(egc, &chs->abrt, rc); +} + +static void helper_failed(libxl__egc *egc, + libxl__conversion_he
[Xen-devel] [PATCH v2 12/27] tools/python: Other migration infrastructure
Contains: * Reverse-engineered notes of the legacy format from xg_save_restore.h * Python implementation of the legacy format * Public HVM Params used in the legacy stream * XL header format Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- New in v2 - removes various many magic numbers from subsequent scripts --- tools/python/xen/migration/legacy.py | 279 ++ tools/python/xen/migration/public.py | 21 +++ tools/python/xen/migration/xl.py | 12 ++ 3 files changed, 312 insertions(+) create mode 100644 tools/python/xen/migration/legacy.py create mode 100644 tools/python/xen/migration/public.py create mode 100644 tools/python/xen/migration/xl.py diff --git a/tools/python/xen/migration/legacy.py b/tools/python/xen/migration/legacy.py new file mode 100644 index 000..2f2240a --- /dev/null +++ b/tools/python/xen/migration/legacy.py @@ -0,0 +1,279 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +""" +Libxc legacy migration streams + +Documentation and record structures for legacy migration +""" + +""" +SAVE/RESTORE/MIGRATE PROTOCOL += + +The general form of a stream of chunks is a header followed by a +body consisting of a variable number of chunks (terminated by a +chunk with type 0) followed by a trailer. + +For a rolling/checkpoint (e.g. remus) migration then the body and +trailer phases can be repeated until an external event +(e.g. failure) causes the process to terminate and commit to the +most recent complete checkpoint. + +HEADER +-- + +unsigned long: p2m_size + +extended-info (PV-only, optional): + + If first unsigned long == ~0UL then extended info is present, + otherwise unsigned long is part of p2m. Note that p2m_size above + does not include the length of the extended info. + + extended-info: + +unsigned long: signature == ~0UL +uint32_t : number of bytes remaining in extended-info + +1 or more extended-info blocks of form: +char[4] : block identifier +uint32_t : block data size +bytes: block data + +defined extended-info blocks: +"vcpu" : VCPU context info containing vcpu_guest_context_t. + The precise variant of the context structure + (e.g. 32 vs 64 bit) is distinguished by + the block size. +"extv" : Presence indicates use of extended VCPU context in + tail, data size is 0. + +p2m (PV-only): + + consists of p2m_size bytes comprising an array of xen_pfn_t sized entries. + +BODY PHASE - Format A (for live migration or Remus without compression) +-- + +A series of chunks with a common header: + int : chunk type + +If the chunk type is +ve then chunk contains guest memory data, and the +type contains the number of pages in the batch: + +unsigned long[] : PFN array, length == number of pages in batch + Each entry consists of XEN_DOMCTL_PFINFO_* + in bits 31-28 and the PFN number in bits 27-0. +page data: PAGE_SIZE bytes for each page marked present in PFN + array + +If the chunk type is -ve then chunk consists of one of a number of +metadata types. See definitions of XC_SAVE_ID_* below. + +If chunk type is 0 then body phase is complete. + + +BODY PHASE - Format B (for Remus with compression) +-- + +A series of chunks with a common header: + int : chunk type + +If the chunk type is +ve then chunk contains array of PFNs corresponding +to guest memory and type contains the number of PFNs in the batch: + +unsigned long[] : PFN array, length == number of pages in batch + Each entry consists of XEN_DOMCTL_PFINFO_* + in bits 31-28 and the PFN number in bits 27-0. + +If the chunk type is -ve then chunk consists of one of a number of +metadata types. See definitions of XC_SAVE_ID_* below. + +If the chunk type is -ve and equals XC_SAVE_ID_COMPRESSED_DATA, then the +chunk consists of compressed page data, in the following format: + +unsigned long: Size of the compressed chunk to follow +compressed data : variable length data of size indicated above. + This chunk consists of compressed page data. + The number of pages in one chunk depends on + the amount of space available in the sender's + output buffer. + +Format of compressed data: + compressed_data = * + delta = + marker = (RUNFLAG|SKIPFLAG) bitwise-or RUNLEN [1 byte marker] + RUNFLAG = 0 + SKIPFLAG= 1 << 7 + RUNLEN = 7-bit unsigned value indicating number of WORDS in the run + run = string of bytes of length sizeof(WORD) * RUNLEN + + If marker contains RUNFLAG, then RUNLEN *
[Xen-devel] [PATCH v2 27/27] tools/libxl: Drop all knowledge of toolstack callbacks
Libxl has now been fully adjusted not to need them. Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/libxl/libxl_dom.c|1 - tools/libxl/libxl_internal.h |2 -- tools/libxl/libxl_save_callout.c | 39 +--- tools/libxl/libxl_save_helper.c| 29 --- tools/libxl/libxl_save_msgs_gen.pl |7 ++- 5 files changed, 3 insertions(+), 75 deletions(-) diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 9f5ddc9..3c765f4 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -2115,7 +2115,6 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss) callbacks->suspend = libxl__domain_suspend_callback; callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty; -dss->shs.callbacks.save.toolstack_save = libxl__toolstack_save; dss->sws.fd = dss->fd; dss->sws.ao = dss->ao; diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 1b62f25..13e2493 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2699,8 +2699,6 @@ _hidden void libxl__datacopier_prefixdata(libxl__egc*, libxl__datacopier_state*, typedef struct libxl__srm_save_callbacks { libxl__srm_save_autogen_callbacks a; -int (*toolstack_save)(uint32_t domid, uint8_t **buf, - uint32_t *len, void *data); } libxl__srm_save_callbacks; typedef struct libxl__srm_restore_callbacks { diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c index cd18cd2..2a6662f 100644 --- a/tools/libxl/libxl_save_callout.c +++ b/tools/libxl/libxl_save_callout.c @@ -78,41 +78,12 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs, void libxl__xc_domain_save(libxl__egc *egc, libxl__domain_suspend_state *dss) { STATE_AO_GC(dss->ao); -int r, rc, toolstack_data_fd = -1; -uint32_t toolstack_data_len = 0; - -/* Resources we need to free */ -uint8_t *toolstack_data_buf = 0; unsigned cbflags = libxl__srm_callout_enumcallbacks_save (&dss->shs.callbacks.save.a); -if (dss->shs.callbacks.save.toolstack_save) { -r = dss->shs.callbacks.save.toolstack_save -(dss->domid, &toolstack_data_buf, &toolstack_data_len, dss); -if (r) { rc = ERROR_FAIL; goto out; } - -dss->shs.toolstack_data_file = tmpfile(); -if (!dss->shs.toolstack_data_file) { -LOGE(ERROR, "cannot create toolstack data tmpfile"); -rc = ERROR_FAIL; -goto out; -} -toolstack_data_fd = fileno(dss->shs.toolstack_data_file); - -r = libxl_write_exactly(CTX, toolstack_data_fd, -toolstack_data_buf, toolstack_data_len, -"toolstack data tmpfile", 0); -if (r) { rc = ERROR_FAIL; goto out; } - -/* file position must be reset before passing to libxl-save-helper. */ -r = lseek(toolstack_data_fd, 0, SEEK_SET); -if (r) { rc = ERROR_FAIL; goto out; } -} - const unsigned long argnums[] = { dss->domid, 0, 0, dss->xcflags, dss->hvm, -toolstack_data_fd, toolstack_data_len, cbflags, }; @@ -123,18 +94,10 @@ void libxl__xc_domain_save(libxl__egc *egc, libxl__domain_suspend_state *dss) dss->shs.caller_state = dss; dss->shs.need_results = 0; -free(toolstack_data_buf); - run_helper(egc, &dss->shs, "--save-domain", dss->fd, - &toolstack_data_fd, 1, + NULL, 0, argnums, ARRAY_SIZE(argnums)); return; - - out: -free(toolstack_data_buf); -if (dss->shs.toolstack_data_file) fclose(dss->shs.toolstack_data_file); - -libxl__xc_domain_save_done(egc, dss, rc, 0, 0); } diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c index f196786..1622bb7 100644 --- a/tools/libxl/libxl_save_helper.c +++ b/tools/libxl/libxl_save_helper.c @@ -213,32 +213,8 @@ int helper_getreply(void *user) /*- other callbacks -*/ -static int toolstack_save_fd; -static uint32_t toolstack_save_len; static struct save_callbacks helper_save_callbacks; -static int toolstack_save_cb(uint32_t domid, uint8_t **buf, - uint32_t *len, void *data) -{ -int r; - -assert(toolstack_save_fd > 0); - -/* This is a hack for remus */ -if (helper_save_callbacks.checkpoint) { -r = lseek(toolstack_save_fd, 0, SEEK_SET); -if (r) fail(errno,"rewind toolstack data tmpfile"); -} - -*buf = xmalloc(toolstack_save_len); -r = read_exactly(toolstack_save_fd, *buf, toolstack_save_len); -if (r<0) fail(errno,"read toolstack data"); -if (r==0) fail(0,"read toolstack data eof"); - -*len = toolstack_save_len; -return 0; -} - static void startup(const char *op) { x
[Xen-devel] [PATCH v2 21/27] tools/libxc+libxl+xl: Save v2 streams
This is a complicated set of changes which must be done together for bisectability. * libxl-save-helper is updated to unconditionally use libxc migration v2. * libxl compatibility workarounds in libxc are disabled for save operations. * libxl__stream_write_start() is logically spliced into the event location where libxl__xc_domain_save() used to reside. * xl is updated to indicate that the stream is now v2 Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/libxc/Makefile |2 -- tools/libxl/libxl.h |2 ++ tools/libxl/libxl_dom.c | 46 +- tools/libxl/libxl_save_helper.c |2 +- tools/libxl/libxl_stream_write.c | 35 +++-- tools/libxl/xl_cmdimpl.c |1 + 6 files changed, 47 insertions(+), 41 deletions(-) diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index 2cd0b1a..1aec848 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -64,8 +64,6 @@ GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_hvm.c GUEST_SRCS-y += xc_sr_restore.c GUEST_SRCS-y += xc_sr_save.c GUEST_SRCS-y += xc_offline_page.c xc_compression.c -xc_sr_save_x86_hvm.o: CFLAGS += -DXG_LIBXL_HVM_COMPAT -xc_sr_save_x86_hvm.opic: CFLAGS += -DXG_LIBXL_HVM_COMPAT else GUEST_SRCS-y += xc_nomigrate.c endif diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index e64a606..4f24e5f 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -812,6 +812,8 @@ * * If this is defined, then the libxl_domain_create_restore() interface takes * a "stream_version" parameter and supports a value of 2. + * + * libxl_domain_suspend() will produce a v2 stream. */ #define LIBXL_HAVE_STREAM_V2 1 diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 8642192..de05124 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -1153,6 +1153,8 @@ int libxl__toolstack_restore(uint32_t domid, const uint8_t *ptr, /* Domain suspend (save) */ +static void stream_done(libxl__egc *egc, +libxl__stream_write_state *sws, int rc); static void domain_suspend_done(libxl__egc *egc, libxl__domain_suspend_state *dss, int rc); static void domain_suspend_callback_common_done(libxl__egc *egc, @@ -2117,50 +2119,22 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss) callbacks->switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty; dss->shs.callbacks.save.toolstack_save = libxl__toolstack_save; -libxl__xc_domain_save(egc, dss); +dss->sws.fd = dss->fd; +dss->sws.ao = dss->ao; +dss->sws.completion_callback = stream_done; + +libxl__stream_write_start(egc, &dss->sws); return; out: domain_suspend_done(egc, dss, rc); } -void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void, -int rc, int retval, int errnoval) +static void stream_done(libxl__egc *egc, +libxl__stream_write_state *sws, int rc) { -libxl__domain_suspend_state *dss = dss_void; -STATE_AO_GC(dss->ao); - -/* Convenience aliases */ -const libxl_domain_type type = dss->type; - -if (rc) -goto out; - -if (retval) { -LOGEV(ERROR, errnoval, "saving domain: %s", - dss->guest_responded ? - "domain responded to suspend request" : - "domain did not respond to suspend request"); -if ( !dss->guest_responded ) -rc = ERROR_GUEST_TIMEDOUT; -else if (dss->rc) -rc = dss->rc; -else -rc = ERROR_FAIL; -goto out; -} - -if (type == LIBXL_DOMAIN_TYPE_HVM) { -rc = libxl__domain_suspend_device_model(gc, dss); -if (rc) goto out; +libxl__domain_suspend_state *dss = CONTAINER_OF(sws, *dss, sws); -libxl__domain_save_device_model(egc, dss, domain_suspend_done); -return; -} - -rc = 0; - -out: domain_suspend_done(egc, dss, rc); } diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c index efbe2eb..f196786 100644 --- a/tools/libxl/libxl_save_helper.c +++ b/tools/libxl/libxl_save_helper.c @@ -286,7 +286,7 @@ int main(int argc, char **argv) startup("save"); setup_signals(save_signal_handler); -r = xc_domain_save(xch, io_fd, dom, max_iters, max_factor, flags, +r = xc_domain_save2(xch, io_fd, dom, max_iters, max_factor, flags, &helper_save_callbacks, hvm); complete(r); diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c index bf568ad..331173f 100644 --- a/tools/libxl/libxl_stream_write.c +++ b/tools/libxl/libxl_stream_write.c @@ -276,8 +276,39 @@ static void libxc_header_done(libxl__egc *egc, libxl__xc_domain_save(egc, dss); } -sta
[Xen-devel] [PATCH v2 24/27] tools/libx{c, l}: Introduce restore_callbacks.checkpoint()
And call it when a checkpoint record is found in the libxc stream. Some parts of this patch have been based on patches from the COLO series. Signed-off-by: Wen Congyang Signed-off-by: Yang Hongyang Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- v2: Borrow sufficient fragments from several COLO patches to get BROKEN_CHANNEL and checkpoint failover to function. --- tools/libxc/include/xenguest.h |9 ++ tools/libxc/xc_sr_common.h |7 +++-- tools/libxc/xc_sr_restore.c| 53 ++-- tools/libxl/libxl_save_msgs_gen.pl |2 +- 4 files changed, 53 insertions(+), 18 deletions(-) diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h index 7581263..8799c9f 100644 --- a/tools/libxc/include/xenguest.h +++ b/tools/libxc/include/xenguest.h @@ -102,6 +102,15 @@ struct restore_callbacks { int (*toolstack_restore)(uint32_t domid, const uint8_t *buf, uint32_t size, void* data); +/* A checkpoint record has been found in the stream. + * + * returns: + * 0: (error)terminate processing + * 1: (success) continue normally + * 2: (failover) failover and resume VM + */ +int (*checkpoint)(void* data); + /* to be provided as the last argument to each callback function */ void* data; }; diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h index 08c66db..1f4d4e4 100644 --- a/tools/libxc/xc_sr_common.h +++ b/tools/libxc/xc_sr_common.h @@ -130,10 +130,13 @@ struct xc_sr_restore_ops * Process an individual record from the stream. The caller shall take * care of processing common records (e.g. END, PAGE_DATA). * - * @return 0 for success, -1 for failure, or the sentinel value - * RECORD_NOT_PROCESSED. + * @return 0 for success, -1 for failure, or the following sentinels: + * - RECORD_NOT_PROCESSED + * - BROKEN_CHANNEL: under Remus/COLO, this means master may be dead, and + *a failover is needed. */ #define RECORD_NOT_PROCESSED 1 +#define BROKEN_CHANNEL 2 int (*process_record)(struct xc_sr_context *ctx, struct xc_sr_record *rec); /** diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c index 9e27dba..c9b5213 100644 --- a/tools/libxc/xc_sr_restore.c +++ b/tools/libxc/xc_sr_restore.c @@ -1,5 +1,7 @@ #include +#include + #include "xc_sr_common.h" /* @@ -472,7 +474,7 @@ static int handle_page_data(struct xc_sr_context *ctx, struct xc_sr_record *rec) static int handle_checkpoint(struct xc_sr_context *ctx) { xc_interface *xch = ctx->xch; -int rc = 0; +int rc = 0, ret; unsigned i; if ( !ctx->restore.checkpointed ) @@ -482,6 +484,21 @@ static int handle_checkpoint(struct xc_sr_context *ctx) goto err; } +ret = ctx->restore.callbacks->checkpoint(ctx->restore.callbacks->data); +switch ( ret ) +{ +case 1: /* Success */ +break; + +case 2: /* Failover */ +rc = BROKEN_CHANNEL; +goto err; + +default: /* Other fatal error */ +rc = -1; +goto err; +} + if ( ctx->restore.buffer_all_records ) { IPRINTF("All records buffered"); @@ -560,19 +577,6 @@ static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec) free(rec->data); rec->data = NULL; -if ( rc == RECORD_NOT_PROCESSED ) -{ -if ( rec->type & REC_TYPE_OPTIONAL ) -DPRINTF("Ignoring optional record %#x (%s)", -rec->type, rec_type_to_str(rec->type)); -else -{ -ERROR("Mandatory record %#x (%s) not handled", - rec->type, rec_type_to_str(rec->type)); -rc = -1; -} -} - return rc; } @@ -678,7 +682,22 @@ static int restore(struct xc_sr_context *ctx) else { rc = process_record(ctx, &rec); -if ( rc ) +if ( rc == RECORD_NOT_PROCESSED ) +{ +if ( rec.type & REC_TYPE_OPTIONAL ) +DPRINTF("Ignoring optional record %#x (%s)", +rec.type, rec_type_to_str(rec.type)); +else +{ +ERROR("Mandatory record %#x (%s) not handled", + rec.type, rec_type_to_str(rec.type)); +rc = -1; +goto err; +} +} +else if ( rc == BROKEN_CHANNEL ) +goto remus_failover; +else if ( rc ) goto err; } @@ -735,6 +754,10 @@ int xc_domain_restore2(xc_interface *xch, int io_fd, uint32_t dom, ctx.restore.checkpointed = checkpointed_stream; ctx.restore.callbacks = callbacks; +/* Sanity checks for callbacks. */ +if ( checkpointed_stream ) +assert(callbacks->checkpoint); + IPRINTF("In experimental %s", __fun
[Xen-devel] [PATCH v2 26/27] tools/libxc: Drop all XG_LIBXL_HVM_COMPAT code from libxc
Libxl has now been fully adjusted not to need it. Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/libxc/xc_sr_common.h |5 -- tools/libxc/xc_sr_restore.c | 18 - tools/libxc/xc_sr_restore_x86_hvm.c | 124 --- tools/libxc/xc_sr_save_x86_hvm.c| 36 -- 4 files changed, 183 deletions(-) diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h index 1f4d4e4..64f6082 100644 --- a/tools/libxc/xc_sr_common.h +++ b/tools/libxc/xc_sr_common.h @@ -309,11 +309,6 @@ struct xc_sr_context /* HVM context blob. */ void *context; size_t contextsz; - -/* #ifdef XG_LIBXL_HVM_COMPAT */ -uint32_t qlen; -void *qbuf; -/* #endif */ } restore; }; } x86_hvm; diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c index c9b5213..2f6a763 100644 --- a/tools/libxc/xc_sr_restore.c +++ b/tools/libxc/xc_sr_restore.c @@ -627,9 +627,6 @@ static void cleanup(struct xc_sr_context *ctx) PERROR("Failed to clean up"); } -#ifdef XG_LIBXL_HVM_COMPAT -extern int read_qemu(struct xc_sr_context *ctx); -#endif /* * Restore a domain. */ @@ -656,21 +653,6 @@ static int restore(struct xc_sr_context *ctx) goto err; } -#ifdef XG_LIBXL_HVM_COMPAT -if ( ctx->dominfo.hvm && - (rec.type == REC_TYPE_END || rec.type == REC_TYPE_CHECKPOINT) ) -{ -rc = read_qemu(ctx); -if ( rc ) -{ -if ( ctx->restore.buffer_all_records ) -goto remus_failover; -else -goto err; -} -} -#endif - if ( ctx->restore.buffer_all_records && rec.type != REC_TYPE_END && rec.type != REC_TYPE_CHECKPOINT ) diff --git a/tools/libxc/xc_sr_restore_x86_hvm.c b/tools/libxc/xc_sr_restore_x86_hvm.c index 6f5af0e..49d22c7 100644 --- a/tools/libxc/xc_sr_restore_x86_hvm.c +++ b/tools/libxc/xc_sr_restore_x86_hvm.c @@ -3,24 +3,6 @@ #include "xc_sr_common_x86.h" -#ifdef XG_LIBXL_HVM_COMPAT -static int handle_toolstack(struct xc_sr_context *ctx, struct xc_sr_record *rec) -{ -xc_interface *xch = ctx->xch; -int rc; - -if ( !ctx->restore.callbacks || !ctx->restore.callbacks->toolstack_restore ) -return 0; - -rc = ctx->restore.callbacks->toolstack_restore( -ctx->domid, rec->data, rec->length, ctx->restore.callbacks->data); - -if ( rc < 0 ) -PERROR("restoring toolstack"); -return rc; -} -#endif - /* * Process an HVM_CONTEXT record from the stream. */ @@ -93,98 +75,6 @@ static int handle_hvm_params(struct xc_sr_context *ctx, return 0; } -#ifdef XG_LIBXL_HVM_COMPAT -int read_qemu(struct xc_sr_context *ctx); -int read_qemu(struct xc_sr_context *ctx) -{ -xc_interface *xch = ctx->xch; -char qemusig[21]; -uint32_t qlen; -void *qbuf = NULL; -int rc = -1; - -if ( read_exact(ctx->fd, qemusig, sizeof(qemusig)) ) -{ -PERROR("Error reading QEMU signature"); -goto out; -} - -if ( !memcmp(qemusig, "DeviceModelRecord0002", sizeof(qemusig)) ) -{ -if ( read_exact(ctx->fd, &qlen, sizeof(qlen)) ) -{ -PERROR("Error reading QEMU record length"); -goto out; -} - -qbuf = malloc(qlen); -if ( !qbuf ) -{ -PERROR("no memory for device model state"); -goto out; -} - -if ( read_exact(ctx->fd, qbuf, qlen) ) -{ -PERROR("Error reading device model state"); -goto out; -} -} -else -{ -ERROR("Invalid device model state signature '%*.*s'", - (int)sizeof(qemusig), (int)sizeof(qemusig), qemusig); -goto out; -} - -/* With Remus, this could be read many times */ -if ( ctx->x86_hvm.restore.qbuf ) -free(ctx->x86_hvm.restore.qbuf); -ctx->x86_hvm.restore.qbuf = qbuf; -ctx->x86_hvm.restore.qlen = qlen; -rc = 0; - -out: -if (rc) -free(qbuf); -return rc; -} - -static int handle_qemu(struct xc_sr_context *ctx) -{ -xc_interface *xch = ctx->xch; -char path[256]; -uint32_t qlen = ctx->x86_hvm.restore.qlen; -void *qbuf = ctx->x86_hvm.restore.qbuf; -int rc = -1; -FILE *fp = NULL; - -sprintf(path, XC_DEVICE_MODEL_RESTORE_FILE".%u", ctx->domid); -fp = fopen(path, "wb"); -if ( !fp ) -{ -PERROR("Failed to open '%s' for writing", path); -goto out; -} - -DPRINTF("Writing %u bytes of QEMU data", qlen); -if ( fwrite(qbuf, 1, qlen, fp) != qlen ) -{ -PERROR("Failed to write %u bytes of QEMU data", qlen); -goto out; -} - -rc = 0; - - out: -if ( fp ) -fclose(fp); -free(qbuf); - -return
[Xen-devel] [PATCH v2 23/27] tools/libxl: Write checkpoint records into the stream
when signalled to do so by libxl__remus_domain_checkpoint_callback() Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- This patch has changed substantially in v2 as a result of changes earlier in the series. No behavioural difference from v1. --- tools/libxl/libxl_dom.c | 18 - tools/libxl/libxl_internal.h |7 tools/libxl/libxl_stream_write.c | 80 -- 3 files changed, 91 insertions(+), 14 deletions(-) diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index de05124..9f5ddc9 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -1937,8 +1937,8 @@ static void remus_devices_preresume_cb(libxl__egc *egc, /*- remus asynchronous checkpoint callback -*/ -static void remus_checkpoint_dm_saved(libxl__egc *egc, - libxl__domain_suspend_state *dss, int rc); +static void remus_checkpoint_stream_written( +libxl__egc *egc, libxl__stream_write_state *sws, int rc); static void remus_devices_commit_cb(libxl__egc *egc, libxl__remus_devices_state *rds, int rc); @@ -1953,17 +1953,14 @@ static void libxl__remus_domain_checkpoint_callback(void *data) libxl__egc *egc = dss->shs.egc; STATE_AO_GC(dss->ao); -/* This would go into tailbuf. */ -if (dss->hvm) { -libxl__domain_save_device_model(egc, dss, remus_checkpoint_dm_saved); -} else { -remus_checkpoint_dm_saved(egc, dss, 0); -} +libxl__stream_write_start_checkpoint(egc, &dss->sws); } -static void remus_checkpoint_dm_saved(libxl__egc *egc, - libxl__domain_suspend_state *dss, int rc) +static void remus_checkpoint_stream_written( +libxl__egc *egc, libxl__stream_write_state *sws, int rc) { +libxl__domain_suspend_state *dss = CONTAINER_OF(sws, *dss, sws); + /* Convenience aliases */ libxl__remus_devices_state *const rds = &dss->rds; @@ -2113,6 +2110,7 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss) callbacks->suspend = libxl__remus_domain_suspend_callback; callbacks->postcopy = libxl__remus_domain_resume_callback; callbacks->checkpoint = libxl__remus_domain_checkpoint_callback; +dss->sws.checkpoint_callback = remus_checkpoint_stream_written; } else callbacks->suspend = libxl__domain_suspend_callback; diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 2beb534..84e22c2 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2987,9 +2987,13 @@ struct libxl__stream_write_state { void (*completion_callback)(libxl__egc *egc, libxl__stream_write_state *sws, int rc); +void (*checkpoint_callback)(libxl__egc *egc, +libxl__stream_write_state *sws, +int rc); /* Private */ int rc; bool running; +bool in_checkpoint; /* Active-stuff handling */ int joined_rc; @@ -3009,6 +3013,9 @@ struct libxl__stream_write_state { _hidden void libxl__stream_write_start(libxl__egc *egc, libxl__stream_write_state *stream); +_hidden void libxl__stream_write_start_checkpoint( +libxl__egc *egc, libxl__stream_write_state *stream); + _hidden void libxl__stream_write_abort(libxl__egc *egc, libxl__stream_write_state *stream, int rc); diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c index 331173f..0f5216b 100644 --- a/tools/libxl/libxl_stream_write.c +++ b/tools/libxl/libxl_stream_write.c @@ -23,6 +23,9 @@ * - libxl__stream_write_start() * - Start writing a stream from the start. * + * - libxl__stream_write_start() + * - Write the records which form a checkpoint into a stream. + * * In normal operation, there are two tasks running at once; this stream * processing, and the libxl-save-helper. check_stream_finished() is used to * join all the tasks in both success and error cases. @@ -39,6 +42,12 @@ * - Toolstack record * - if (hvm), Qemu record * - End record + * + * For checkpointed stream, there is a second loop which is triggered by a + * save-helper checkpoint callback. It writes: + * - Toolstack record + * - if (hvm), Qemu record + * - Checkpoint end record */ static void stream_success(libxl__egc *egc, @@ -73,6 +82,15 @@ static void emulator_record_done(libxl__egc *egc, static void write_end_record(libxl__egc *egc, libxl__stream_write_state *stream); +/* Event callbacks unique to checkpointed streams. */ +static void checkpoint_done(libxl__egc *egc, +libxl__stream_write_state *stream, +
[Xen-devel] [PATCH v2 14/27] tools/python: Conversion utility for legacy migration streams
This utility will take a legacy stream as in input, and produce a v2 stream as an output. It is exec()'d by libxl to provide backwards compatibility. Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/python/Makefile |4 + tools/python/scripts/convert-legacy-stream | 678 2 files changed, 682 insertions(+) create mode 100755 tools/python/scripts/convert-legacy-stream diff --git a/tools/python/Makefile b/tools/python/Makefile index e933be8..df942a7 100644 --- a/tools/python/Makefile +++ b/tools/python/Makefile @@ -17,9 +17,13 @@ build: genwrap.py $(XEN_ROOT)/tools/libxl/libxl_types.idl \ .PHONY: install install: + $(INSTALL_DIR) $(DESTDIR)$(PRIVATE_BINDIR) + CC="$(CC)" CFLAGS="$(PY_CFLAGS)" $(PYTHON) setup.py install \ $(PYTHON_PREFIX_ARG) --root="$(DESTDIR)" --force + $(INSTALL_PROG) scripts/convert-legacy-stream $(DESTDIR)$(PRIVATE_BINDIR) + .PHONY: test test: export LD_LIBRARY_PATH=$$(readlink -f ../libxc):$$(readlink -f ../xenstore); $(PYTHON) test.py -b -u diff --git a/tools/python/scripts/convert-legacy-stream b/tools/python/scripts/convert-legacy-stream new file mode 100755 index 000..d54fa22 --- /dev/null +++ b/tools/python/scripts/convert-legacy-stream @@ -0,0 +1,678 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +""" +Convert a legacy migration stream to a v2 stream. +""" + +import sys +import os, os.path +import syslog +import traceback + +from struct import calcsize, unpack, pack + +from xen.migration import legacy, public, libxc, libxl, xl + +__version__ = 1 + +fin = None # Input file/fd +fout = None# Output file/fd +twidth = 0 # Legacy toolstack bitness (32 or 64) +pv = None # Boolean (pv or hvm) +qemu = True# Boolean - process qemu record? +log_to_syslog = False # Boolean - Log to syslog instead of stdout/err? +verbose = False# Boolean - Summarise stream contents + +def stream_read(_ = None): +"""Read from the input""" +return fin.read(_) + +def stream_write(_): +"""Write to the output""" +return fout.write(_) + +def info(msg): +"""Info message, routed to appropriate destination""" +if verbose: +if log_to_syslog: +for line in msg.split("\n"): +syslog.syslog(syslog.LOG_INFO, line) +else: +print msg + +def err(msg): +"""Error message, routed to appropriate destination""" +if log_to_syslog: +for line in msg.split("\n"): +syslog.syslog(syslog.LOG_ERR, line) +print >> sys.stderr, msg + +class StreamError(StandardError): +"""Error with the incoming migration stream""" +pass + +class VM(object): +"""Container of VM parameters""" + +def __init__(self, fmt): +# Common +self.p2m_size = 0 + +# PV +self.max_vcpu_id = 0 +self.online_vcpu_map = [] +self.width = 0 +self.levels = 0 +self.basic_len = 0 +self.extd = False +self.xsave_len = 0 + +# libxl +self.libxl = fmt == "libxl" +self.xenstore = [] # Deferred "toolstack" records + +def write_libxc_ihdr(): +stream_write(pack(libxc.IHDR_FORMAT, + libxc.IHDR_MARKER, # Marker + libxc.IHDR_IDENT, # Ident + libxc.IHDR_VERSION, # Version + libxc.IHDR_OPT_LE, # Options + 0, 0)) # Reserved + +def write_libxc_dhdr(): +if pv: +dtype = libxc.DHDR_TYPE_x86_pv +else: +dtype = libxc.DHDR_TYPE_x86_hvm + +stream_write(pack(libxc.DHDR_FORMAT, + dtype,# Type + 12, # Page size + 0,# Reserved + 0,# Xen major (converted) + __version__)) # Xen minor (converted) + +def write_libxl_hdr(): +stream_write(pack(libxl.HDR_FORMAT, + libxl.HDR_IDENT, # Ident + libxl.HDR_VERSION, # Version 2 + libxl.HDR_OPT_LE | # Options + libxl.HDR_OPT_LEGACY # Little Endian and Legacy + )) + +def write_record(rt, *argl): +alldata = ''.join(argl) +length = len(alldata) + +record = pack(libxc.RH_FORMAT, rt, length) + alldata +plen = (8 - (length & 7)) & 7 +record += '\x00' * plen + +stream_write(record) + +def write_libxc_pv_info(vm): +write_record(libxc.REC_TYPE_x86_pv_info, + pack(libxc.X86_PV_INFO_FORMAT, + vm.width, vm.levels, 0, 0)) + +def write_libxc_pv_p2m_frames(vm, pfns): +write_record(libxc.REC_TYPE_x86_pv_p2m_frames, + pack(libxc.X86_PV_P2M_FRAMES_FORMAT, + 0, vm.p2m_size - 1), + pack("Q" * len(pfns), *pf
[Xen-devel] [PATCH v2 20/27] tools/libxl: Infrastructure for writing a v2 stream
From: Ross Lagerwall This contains the event machinary and state machines to write non-checkpointed migration v2 stream (with the exception of the xc_domain_save() handling which is spliced later in a bisectable way). Signed-off-by: Ross Lagerwall Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- As with the read side of things, this has undergone substantial changes in v2. --- tools/libxl/Makefile |2 +- tools/libxl/libxl_internal.h | 47 tools/libxl/libxl_stream_write.c | 451 ++ 3 files changed, 499 insertions(+), 1 deletion(-) create mode 100644 tools/libxl/libxl_stream_write.c diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index 0ebc35a..7d44483 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -95,7 +95,7 @@ LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \ libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \ libxl_internal.o libxl_utils.o libxl_uuid.o \ libxl_json.o libxl_aoutils.o libxl_numa.o libxl_vnuma.o \ - libxl_stream_read.o \ + libxl_stream_read.o libxl_stream_write.o \ libxl_save_callout.o _libxl_save_msgs_callout.o \ libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y) LIBXL_OBJS += libxl_genid.o diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 1cf1884..2beb534 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2973,6 +2973,52 @@ typedef void libxl__domain_suspend_cb(libxl__egc*, typedef void libxl__save_device_model_cb(libxl__egc*, libxl__domain_suspend_state*, int rc); +/* State for writing a libxl migration v2 stream */ +typedef struct libxl__stream_write_state libxl__stream_write_state; + +typedef void (*sws_record_done_cb)(libxl__egc *egc, + libxl__stream_write_state *sws); + +struct libxl__stream_write_state { +/* filled by the user */ +libxl__ao *ao; +int fd; +uint32_t domid; +void (*completion_callback)(libxl__egc *egc, +libxl__stream_write_state *sws, +int rc); +/* Private */ +int rc; +bool running; + +/* Active-stuff handling */ +int joined_rc; + +/* Main stream-writing data */ +size_t padding; +libxl__datacopier_state dc; +sws_record_done_cb record_done_callback; + +/* Emulator blob handling */ +libxl__datacopier_state emu_dc; +libxl__carefd *emu_carefd; +libxl__sr_rec_hdr emu_rec_hdr; +void *emu_body; +}; + +_hidden void libxl__stream_write_start(libxl__egc *egc, + libxl__stream_write_state *stream); + +_hidden void libxl__stream_write_abort(libxl__egc *egc, + libxl__stream_write_state *stream, + int rc); + +static inline bool libxl__stream_write_inuse( +const libxl__stream_write_state *stream) +{ +return stream->running; +} + typedef struct libxl__logdirty_switch { const char *cmd; const char *cmd_path; @@ -3013,6 +3059,7 @@ struct libxl__domain_suspend_state { /* private for libxl__domain_save_device_model */ libxl__save_device_model_cb *save_dm_callback; libxl__datacopier_state save_dm_datacopier; +libxl__stream_write_state sws; }; diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c new file mode 100644 index 000..bf568ad --- /dev/null +++ b/tools/libxl/libxl_stream_write.c @@ -0,0 +1,451 @@ +/* + * Copyright (C) 2015 Citrix Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as published + * by the Free Software Foundation; version 2.1 only. with the special + * exception on linking described in file LICENSE. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + */ + +#include "libxl_osdeps.h" /* must come before any other headers */ + +#include "libxl_internal.h" + +/* + * Infrastructure for writing a domain to a libxl migration v2 stream. + * + * Entry points from outside: + * - libxl__stream_write_start() + * - Start writing a stream from the start. + * + * In normal operation, there are two tasks running at once; this stream + * processing, and the libxl-save-helper. check_stream_finished() is used to + * join all the tasks in both success and error cases. + * + * Nomenclature for event callbacks: + * - $FOO_done(): Completion callback for $FOO + * - write_$FOO(): Set up the datacop
[Xen-devel] [PATCH v2 10/27] tools/python: Libxc migration v2 infrastructure
Contains: * Python implementation of the libxc migration v2 records * Verification code for spec compliance * Unit tests Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/python/setup.py|1 + tools/python/xen/migration/libxc.py | 446 ++ tools/python/xen/migration/tests.py | 41 tools/python/xen/migration/verify.py | 37 +++ 4 files changed, 525 insertions(+) create mode 100644 tools/python/xen/migration/__init__.py create mode 100644 tools/python/xen/migration/libxc.py create mode 100644 tools/python/xen/migration/tests.py create mode 100644 tools/python/xen/migration/verify.py diff --git a/tools/python/setup.py b/tools/python/setup.py index 439c429..5bf81be 100644 --- a/tools/python/setup.py +++ b/tools/python/setup.py @@ -43,6 +43,7 @@ setup(name= 'xen', version = '3.0', description = 'Xen', packages= ['xen', + 'xen.migration', 'xen.lowlevel', ], ext_package = "xen.lowlevel", diff --git a/tools/python/xen/migration/__init__.py b/tools/python/xen/migration/__init__.py new file mode 100644 index 000..e69de29 diff --git a/tools/python/xen/migration/libxc.py b/tools/python/xen/migration/libxc.py new file mode 100644 index 000..b0255ac --- /dev/null +++ b/tools/python/xen/migration/libxc.py @@ -0,0 +1,446 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +""" +Libxc Migration v2 streams + +Record structures as per docs/specs/libxc-migration-stream.pandoc, and +verification routines. +""" + +import sys + +from struct import calcsize, unpack + +from xen.migration.verify import StreamError, RecordError, VerifyBase + +# Image Header +IHDR_FORMAT = "!QIIHHI" + +IHDR_MARKER = 0x +IHDR_IDENT = 0x58454E46 # "XENF" in ASCII +IHDR_VERSION = 2 + +IHDR_OPT_BIT_ENDIAN = 0 +IHDR_OPT_LE = (0 << IHDR_OPT_BIT_ENDIAN) +IHDR_OPT_BE = (1 << IHDR_OPT_BIT_ENDIAN) + +IHDR_OPT_RESZ_MASK = 0xfffe + +# Domain Header +DHDR_FORMAT = "IHHII" + +DHDR_TYPE_x86_pv = 0x0001 +DHDR_TYPE_x86_hvm = 0x0002 +DHDR_TYPE_x86_pvh = 0x0003 +DHDR_TYPE_arm = 0x0004 + +dhdr_type_to_str = { +DHDR_TYPE_x86_pv : "x86 PV", +DHDR_TYPE_x86_hvm : "x86 HVM", +DHDR_TYPE_x86_pvh : "x86 PVH", +DHDR_TYPE_arm : "ARM", +} + +# Records +RH_FORMAT = "II" + +REC_TYPE_end = 0x +REC_TYPE_page_data= 0x0001 +REC_TYPE_x86_pv_info = 0x0002 +REC_TYPE_x86_pv_p2m_frames= 0x0003 +REC_TYPE_x86_pv_vcpu_basic= 0x0004 +REC_TYPE_x86_pv_vcpu_extended = 0x0005 +REC_TYPE_x86_pv_vcpu_xsave= 0x0006 +REC_TYPE_shared_info = 0x0007 +REC_TYPE_tsc_info = 0x0008 +REC_TYPE_hvm_context = 0x0009 +REC_TYPE_hvm_params = 0x000a +REC_TYPE_toolstack= 0x000b +REC_TYPE_x86_pv_vcpu_msrs = 0x000c +REC_TYPE_verify = 0x000d +REC_TYPE_checkpoint = 0x000e + +rec_type_to_str = { +REC_TYPE_end : "End", +REC_TYPE_page_data: "Page data", +REC_TYPE_x86_pv_info : "x86 PV info", +REC_TYPE_x86_pv_p2m_frames: "x86 PV P2M frames", +REC_TYPE_x86_pv_vcpu_basic: "x86 PV vcpu basic", +REC_TYPE_x86_pv_vcpu_extended : "x86 PV vcpu extended", +REC_TYPE_x86_pv_vcpu_xsave: "x86 PV vcpu xsave", +REC_TYPE_shared_info : "Shared info", +REC_TYPE_tsc_info : "TSC info", +REC_TYPE_hvm_context : "HVM context", +REC_TYPE_hvm_params : "HVM params", +REC_TYPE_toolstack: "Toolstack", +REC_TYPE_x86_pv_vcpu_msrs : "x86 PV vcpu msrs", +REC_TYPE_verify : "Verify", +REC_TYPE_checkpoint : "Checkpoint", +} + +# page_data +PAGE_DATA_FORMAT = "II" +PAGE_DATA_PFN_MASK = (1L << 52) - 1 +PAGE_DATA_PFN_RESZ_MASK = ((1L << 60) - 1) & ~((1L << 52) - 1) + +# flags from xen/public/domctl.h: XEN_DOMCTL_PFINFO_* shifted by 32 bits +PAGE_DATA_TYPE_SHIFT = 60 +PAGE_DATA_TYPE_LTABTYPE_MASK = (0x7L << PAGE_DATA_TYPE_SHIFT) +PAGE_DATA_TYPE_LTAB_MASK = (0xfL << PAGE_DATA_TYPE_SHIFT) +PAGE_DATA_TYPE_LPINTAB = (0x8L << PAGE_DATA_TYPE_SHIFT) # Pinned pagetable + +PAGE_DATA_TYPE_NOTAB = (0x0L << PAGE_DATA_TYPE_SHIFT) # Regular page +PAGE_DATA_TYPE_L1TAB = (0x1L << PAGE_DATA_TYPE_SHIFT) # L1 pagetable +PAGE_DATA_TYPE_L2TAB = (0x2L << PAGE_DATA_TYPE_SHIFT) # L2 pagetable +PAGE_DATA_TYPE_L3TAB = (0x3L << PAGE_DATA_TYPE_SHIFT) # L3 pagetable +PAGE_DATA_TYPE_L4TAB = (0x4L << PAGE_DATA_TYPE_SHIFT) # L4 pagetable +PAGE_DATA_TYPE_BROKEN= (0xdL << PAGE_DATA_TYPE_SHIFT) # Broken +PAGE_DATA_TYPE_XALLOC= (0xeL << PAGE_DATA_TYPE_SHIFT) # Allocate-only +PAGE_DATA_TYPE_XTAB = (0x
[Xen-devel] [PATCH v2 18/27] tools/libxl: Convert a legacy stream if needed
For backwards compatibility, a legacy stream needs converting before it can be read by the v2 stream logic. This causes the v2 stream logic to need to juggle two parallel tasks. check_stream_finished() is introduced for the purpose of joining the tasks in both success and error cases. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/libxl/libxl_internal.h|7 +++ tools/libxl/libxl_stream_read.c | 98 ++- 2 files changed, 104 insertions(+), 1 deletion(-) diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 68e7f02..1cf1884 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -3274,6 +3274,7 @@ struct libxl__stream_read_state { /* filled by the user */ libxl__ao *ao; int fd; +bool legacy; void (*completion_callback)(libxl__egc *egc, libxl__stream_read_state *srs, int rc); @@ -3281,6 +3282,12 @@ struct libxl__stream_read_state { int rc; bool running; +/* Active-stuff handling */ +int joined_rc; + +/* Conversion helper */ +libxl__carefd *v2_carefd; + /* Main stream-reading data */ libxl__datacopier_state dc; libxl__sr_hdr hdr; diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c index 6f5d572..3011820 100644 --- a/tools/libxl/libxl_stream_read.c +++ b/tools/libxl/libxl_stream_read.c @@ -57,6 +57,10 @@ *- stream_write_emulator() *- stream_write_emulator_done() *- stream_continue() + * + * Depending on the contents of the stream, there are likely to be several + * parallel tasks being managed. check_stream_finished() is used to join all + * tasks in both success and error cases. */ /* Stream error/success handling. */ @@ -67,6 +71,16 @@ static void stream_failed(libxl__egc *egc, static void stream_done(libxl__egc *egc, libxl__stream_read_state *stream); +/* Stream other-active-stuff handling. */ +/* libxl__xc_domain_restore_done() is logically grouped here. */ +#if defined(__x86_64__) || defined(__i386__) +static void conversion_done(libxl__egc *egc, +libxl__conversion_helper_state *chs, int rc); +#endif +static void check_stream_finished(libxl__egc *egc, + libxl__stream_read_state *stream, + int rc, const char *what); + /* Event callbacks for main reading loop. */ static void stream_header_done(libxl__egc *egc, libxl__datacopier_state *dc, @@ -112,12 +126,40 @@ static int setup_read(libxl__stream_read_state *stream, void libxl__stream_read_start(libxl__egc *egc, libxl__stream_read_state *stream) { +libxl__domain_create_state *dcs = CONTAINER_OF(stream, *dcs, srs); libxl__datacopier_state *dc = &stream->dc; +STATE_AO_GC(stream->ao); int ret = 0; /* State initialisation. */ assert(!stream->running); +/* + * Initialise other moving parts so stream_check_finished() can correctly + * work out whether to tear them down. + */ +libxl__conversion_helper_init(&dcs->chs); + +#if defined(__x86_64__) || defined(__i386__) +if (stream->legacy) { +/* Convert a legacy stream, if needed. */ +dcs->chs.ao = stream->ao; +dcs->chs.legacy_fd = stream->fd; +dcs->chs.hvm = +(dcs->guest_config->b_info.type == LIBXL_DOMAIN_TYPE_HVM); +dcs->chs.v2_carefd = NULL; +dcs->chs.completion_callback = conversion_done; + +libxl__convert_legacy_stream(egc, &dcs->chs); + +assert(dcs->chs.v2_carefd); +stream->v2_carefd = dcs->chs.v2_carefd; +dcs->libxc_fd = stream->fd = libxl__carefd_fd(dcs->chs.v2_carefd); +} +#endif + +/* stream->fd is now guarenteed to be a v2 stream. */ + memset(dc, 0, sizeof(*dc)); dc->ao = stream->ao; dc->readfd = stream->fd; @@ -183,7 +225,50 @@ static void stream_done(libxl__egc *egc, free(rec); } -stream->completion_callback(egc, stream, stream->rc); +if (stream->v2_carefd) +libxl__carefd_close(stream->v2_carefd); + +check_stream_finished(egc, stream, stream->rc, "stream"); +} + +static void check_stream_finished(libxl__egc *egc, + libxl__stream_read_state *stream, + int rc, const char *what) +{ +libxl__domain_create_state *dcs = CONTAINER_OF(stream, *dcs, srs); +STATE_AO_GC(stream->ao); + +LOG(DEBUG, "Task '%s' joining (rc %d)", what, rc); + +if (rc && !stream->joined_rc) { +bool skip = false; +/* First reported failure from joining tasks. Tear everything down */ +stream->joined_rc = rc; + +if (libxl__stream_read_inuse(&dcs->srs)) { +skip = true; +libxl__stream_read_abort(egc, &dcs
[Xen-devel] [PATCH v2 15/27] tools/libxl: Migration v2 stream format
From: Ross Lagerwall C structures describing the Libxl migration v2 stream format Signed-off-by: Ross Lagerwall Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- v2: Move into libxl__ namespace --- tools/libxl/libxl_sr_stream_format.h | 57 ++ 1 file changed, 57 insertions(+) create mode 100644 tools/libxl/libxl_sr_stream_format.h diff --git a/tools/libxl/libxl_sr_stream_format.h b/tools/libxl/libxl_sr_stream_format.h new file mode 100644 index 000..f4f790b --- /dev/null +++ b/tools/libxl/libxl_sr_stream_format.h @@ -0,0 +1,57 @@ +#ifndef LIBXL__SR_STREAM_FORMAT_H +#define LIBXL__SR_STREAM_FORMAT_H + +/* + * C structures for the Migration v2 stream format. + * See docs/specs/libxl-migration-stream.pandoc + */ + +#include + +typedef struct libxl__sr_hdr +{ +uint64_t ident; +uint32_t version; +uint32_t options; +} libxl__sr_hdr; + +#define RESTORE_STREAM_IDENT 0x4c6962786c466d74UL +#define RESTORE_STREAM_VERSION 0x0002U + +#define RESTORE_OPT_BIG_ENDIAN (1u << 0) +#define RESTORE_OPT_LEGACY (1u << 1) + + +typedef struct libxl__sr_rec_hdr +{ +uint32_t type; +uint32_t length; +} libxl__sr_rec_hdr; + +/* All records must be aligned up to an 8 octet boundary */ +#define REC_ALIGN_ORDER 3U + +#define REC_TYPE_END 0xU +#define REC_TYPE_LIBXC_CONTEXT 0x0001U +#define REC_TYPE_XENSTORE_DATA 0x0002U +#define REC_TYPE_EMULATOR_CONTEXT0x0003U + +typedef struct libxl__sr_emulator_hdr +{ +uint32_t id; +uint32_t index; +} libxl__sr_emulator_hdr; + +#define EMULATOR_UNKNOWN 0xU +#define EMULATOR_QEMU_TRADITIONAL0x0001U +#define EMULATOR_QEMU_UPSTREAM 0x0002U + +#endif /* LIBXL__SR_STREAM_FORMAT_H */ + +/* + * Local variables: + * mode: C + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */ -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 13/27] tools/python: Verification utility for v2 stream spec compliance
Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- This is exceedingly useful for development, but not of practical use being installed into a production dom0. --- tools/python/scripts/verify-stream-v2 | 174 + 1 file changed, 174 insertions(+) create mode 100755 tools/python/scripts/verify-stream-v2 diff --git a/tools/python/scripts/verify-stream-v2 b/tools/python/scripts/verify-stream-v2 new file mode 100755 index 000..3daf257 --- /dev/null +++ b/tools/python/scripts/verify-stream-v2 @@ -0,0 +1,174 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +""" Verify a v2 format migration stream """ + +import sys +import struct +import os, os.path +import syslog +import traceback + +from xen.migration.verify import StreamError, RecordError +from xen.migration.libxc import VerifyLibxc +from xen.migration.libxl import VerifyLibxl + +fin = None # Input file/fd +log_to_syslog = False # Boolean - Log to syslog instead of stdout/err? +verbose = False# Boolean - Summarise stream contents +quiet = False # Boolean - Suppress error printing + +def info(msg): +"""Info message, routed to appropriate destination""" +if not quiet and verbose: +if log_to_syslog: +for line in msg.split("\n"): +syslog.syslog(syslog.LOG_INFO, line) +else: +print msg + +def err(msg): +"""Error message, routed to appropriate destination""" +if not quiet: +if log_to_syslog: +for line in msg.split("\n"): +syslog.syslog(syslog.LOG_ERR, line) +print >> sys.stderr, msg + +def stream_read(_ = None): +"""Read from input""" +return fin.read(_) + +def rdexact(nr_bytes): +"""Read exactly nr_bytes from fin""" +_ = stream_read(nr_bytes) +if len(_) != nr_bytes: +raise IOError("Stream truncated") +return _ + +def unpack_exact(fmt): +"""Unpack a format from fin""" +sz = struct.calcsize(fmt) +return struct.unpack(fmt, rdexact(sz)) + + +def skip_xl_header(): +"""Skip over an xl header in the stream""" + +hdr = rdexact(32) +if hdr != "Xen saved domain, xl format\n \0 \r": +raise StreamError("No xl header") + +_, mflags, _, optlen = unpack_exact("=") +_ = rdexact(optlen) + +info("Processed xl header") + +if mflags & 2: # XL_MANDATORY_FLAG_STREAMv2 +return "libxl" +else: +return "libxc" + +def read_stream(fmt): +""" Read an entire stream """ + +try: +if fmt == "xl": +fmt = skip_xl_header() + +if fmt == "libxc": +VerifyLibxc(info, stream_read).verify() +else: +VerifyLibxl(info, stream_read).verify() + +except (IOError, StreamError, RecordError): +err("Stream Error:") +err(traceback.format_exc()) +return 1 + +except StandardError: +err("Script Error:") +err(traceback.format_exc()) +err("Please fix me") +return 2 + +return 0 + +def open_file_or_fd(val, mode, buffering): +""" +If 'val' looks like a decimal integer, open it as an fd. If not, try to +open it as a regular file. +""" + +fd = -1 +try: +# Does it look like an integer? +try: +fd = int(val, 10) +except ValueError: +pass + +# Try to open it... +if fd != -1: +return os.fdopen(fd, mode, buffering) +else: +return open(val, mode, buffering) + +except StandardError, e: +if fd != -1: +err("Unable to open fd %d: %s: %s" % +(fd, e.__class__.__name__, e)) +else: +err("Unable to open file '%s': %s: %s" % +(val, e.__class__.__name__, e)) + +raise SystemExit(2) + +def main(): +""" main """ +from optparse import OptionParser +global fin, quiet, verbose + +# Change stdout to be line-buffered. +sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 1) + +parser = OptionParser(usage = "%prog [options]", + description = + "Verify a stream according to the v2 spec") + +# Optional options +parser.add_option("-i", "--in", dest = "fin", metavar = "", + default = "0", + help = "Stream to verify (defaults to stdin)") +parser.add_option("-v", "--verbose", action = "store_true", default = False, + help = "Summarise stream contents") +parser.add_option("-q", "--quiet", action = "store_true", default = False, + help = "Suppress all logging/errors") +parser.add_option("-f", "--format", dest = "format", + metavar = "", default = "libxc", + choices = ["libxc", "libxl", "xl"], + help = "Format of the incoming stream (defaults to libxc)") +parser.add_op
[Xen-devel] [PATCH v2 25/27] tools/libxl: Handle checkpoint records in a libxl migration v2 stream
This is the final bit of untangling for Remus. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- As before, Remus functionality is untested, but the new logic here should handle failovers correctly. The patch has changed greatly from v1, both in a functional sence, and because of the knockon effects from earlier changes. --- tools/libxl/libxl_create.c | 27 +++ tools/libxl/libxl_internal.h|8 tools/libxl/libxl_stream_read.c | 97 +++ 3 files changed, 132 insertions(+) diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index 2a0063a..0325bf1 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -672,6 +672,29 @@ static int store_libxl_entry(libxl__gc *gc, uint32_t domid, libxl_device_model_version_to_string(b_info->device_model_version)); } +/*- remus asynchronous checkpoint callback -*/ + +static void remus_checkpoint_stream_done( +libxl__egc *egc, libxl__stream_read_state *srs, int rc); + +static void libxl__remus_domain_checkpoint_callback(void *data) +{ +libxl__save_helper_state *shs = data; +libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs); +libxl__egc *egc = dcs->shs.egc; +STATE_AO_GC(dcs->ao); + +libxl__stream_read_start_checkpoint(egc, &dcs->srs); +} + +static void remus_checkpoint_stream_done( +libxl__egc *egc, libxl__stream_read_state *srs, int rc) +{ +libxl__domain_create_state *dcs = CONTAINER_OF(srs, *dcs, srs); + +libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->shs, rc); +} + /*- main domain creation -*/ /* We have a linear control flow; only one event callback is @@ -939,6 +962,8 @@ static void domcreate_bootloader_done(libxl__egc *egc, libxl_domain_config *const d_config = dcs->guest_config; const int restore_fd = dcs->restore_fd; libxl__domain_build_state *const state = &dcs->build_state; +libxl__srm_restore_autogen_callbacks *const callbacks = +&dcs->shs.callbacks.restore.a; if (rc) { domcreate_rebuild_done(egc, dcs, rc); @@ -966,6 +991,7 @@ static void domcreate_bootloader_done(libxl__egc *egc, } /* Restore */ +callbacks->checkpoint = libxl__remus_domain_checkpoint_callback; rc = libxl__build_pre(gc, domid, d_config, state); if (rc) @@ -975,6 +1001,7 @@ static void domcreate_bootloader_done(libxl__egc *egc, dcs->srs.fd = restore_fd; dcs->srs.legacy = (dcs->restore_params.stream_version == 1); dcs->srs.completion_callback = domcreate_stream_done; +dcs->srs.checkpoint_callback = remus_checkpoint_stream_done; libxl__stream_read_start(egc, &dcs->srs); return; diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 84e22c2..1b62f25 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -3332,9 +3332,13 @@ struct libxl__stream_read_state { void (*completion_callback)(libxl__egc *egc, libxl__stream_read_state *srs, int rc); +void (*checkpoint_callback)(libxl__egc *egc, +libxl__stream_read_state *srs, +int rc); /* Private */ int rc; bool running; +bool in_checkpoint; /* Active-stuff handling */ int joined_rc; @@ -3349,6 +3353,8 @@ struct libxl__stream_read_state { LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; enum { SRS_PHASE_NORMAL, +SRS_PHASE_BUFFERING, +SRS_PHASE_UNBUFFERING, } phase; bool recursion_guard; @@ -3362,6 +3368,8 @@ _hidden void libxl__stream_read_start(libxl__egc *egc, _hidden void libxl__stream_read_continue(libxl__egc *egc, libxl__stream_read_state *stream); +_hidden void libxl__stream_read_start_checkpoint( +libxl__egc *egc, libxl__stream_read_state *stream); _hidden void libxl__stream_read_abort(libxl__egc *egc, libxl__stream_read_state *stream, int rc); diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c index 81095cd..6cfa05c 100644 --- a/tools/libxl/libxl_stream_read.c +++ b/tools/libxl/libxl_stream_read.c @@ -123,6 +123,10 @@ static int setup_read(libxl__stream_read_state *stream, return libxl__datacopier_start(dc); } +/* Error handling for checkpoint mini-loop. */ +static void checkpoint_done(libxl__egc *egc, +libxl__stream_read_state *stream, int rc); + void libxl__stream_read_start(libxl__egc *egc, libxl__stream_read_state *stream) { @@ -186,6 +190,18 @@ void libxl__stream_read_start(libxl__egc *egc, stream_failed(egc, stream, ret); } +void libxl__stream_read_start_checkpoint(libxl__egc *egc, + libxl__stream_read_state *stream) +{ +
[Xen-devel] [PATCH v2 16/27] tools/libxl: Infrastructure for reading a libxl migration v2 stream
From: Ross Lagerwall This contains the event machinary and state machines to read an act on a non-checkpointed migration v2 stream (with the exception of the xc_domain_restore() handling which is spliced later in a bisectable way). It also contains some boilerplate to help support checkpointed streams, which shall be introduced in a later patch. Signed-off-by: Ross Lagerwall Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- Large quantities of the logic here are completely overhauled since v1, mostly as part of fixing the checkpoint buffering bug which was the cause of the broken Remus failover. The result is actually more simple overall; all records are buffered in memory (there is no splicing of the emulator records any more), with normal streams having exactly 0 or 1 records currently in the buffer, before processing. Remus support later will allow multiple buffered records. --- tools/libxl/Makefile|1 + tools/libxl/libxl_internal.h| 55 + tools/libxl/libxl_stream_read.c | 504 +++ 3 files changed, 560 insertions(+) create mode 100644 tools/libxl/libxl_stream_read.c diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index cc9c152..c71c5fe 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -94,6 +94,7 @@ LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \ libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \ libxl_internal.o libxl_utils.o libxl_uuid.o \ libxl_json.o libxl_aoutils.o libxl_numa.o libxl_vnuma.o \ + libxl_stream_read.o \ libxl_save_callout.o _libxl_save_msgs_callout.o \ libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y) LIBXL_OBJS += libxl_genid.o diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 4bd6ea1..0f17e7c 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -19,6 +19,8 @@ #include "libxl_osdeps.h" /* must come before any other headers */ +#include "libxl_sr_stream_format.h" + #include #include #include @@ -3211,6 +3213,58 @@ typedef void libxl__domain_create_cb(libxl__egc *egc, libxl__domain_create_state*, int rc, uint32_t domid); +/* State for manipulating a libxl migration v2 stream */ +typedef struct libxl__stream_read_state libxl__stream_read_state; + +typedef struct libxl__sr_record_buf { +/* private to stream read helper */ +LIBXL_STAILQ_ENTRY(struct libxl__sr_record_buf) entry; +libxl__sr_rec_hdr hdr; +void *body; +} libxl__sr_record_buf; + +struct libxl__stream_read_state { +/* filled by the user */ +libxl__ao *ao; +int fd; +void (*completion_callback)(libxl__egc *egc, +libxl__stream_read_state *srs, +int rc); +/* Private */ +int rc; +bool running; + +/* Main stream-reading data */ +libxl__datacopier_state dc; +libxl__sr_hdr hdr; +libxl__sr_record_buf *incoming_record; +LIBXL_STAILQ_HEAD(, libxl__sr_record_buf) record_queue; +enum { +SRS_PHASE_NORMAL, +} phase; +bool recursion_guard; + +/* Emulator blob handling */ +libxl__datacopier_state emu_dc; +libxl__carefd *emu_carefd; +}; + +_hidden void libxl__stream_read_start(libxl__egc *egc, + libxl__stream_read_state *stream); + +_hidden void libxl__stream_read_continue(libxl__egc *egc, + libxl__stream_read_state *stream); + +_hidden void libxl__stream_read_abort(libxl__egc *egc, + libxl__stream_read_state *stream, int rc); + +static inline bool libxl__stream_read_inuse( +const libxl__stream_read_state *stream) +{ +return stream->running; +} + + struct libxl__domain_create_state { /* filled in by user */ libxl__ao *ao; @@ -3227,6 +3281,7 @@ struct libxl__domain_create_state { libxl__stub_dm_spawn_state dmss; /* If we're not doing stubdom, we use only dmss.dm, * for the non-stubdom device model. */ +libxl__stream_read_state srs; libxl__save_helper_state shs; /* necessary if the domain creation failed and we have to destroy it */ libxl__domain_destroy_state dds; diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c new file mode 100644 index 000..6f5d572 --- /dev/null +++ b/tools/libxl/libxl_stream_read.c @@ -0,0 +1,504 @@ +/* + * Copyright (C) 2015 Citrix Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as published + * by the Free Software Foundation; version 2.1 only. with the special + * exception on linking described in file LICENSE.
[Xen-devel] [PATCH v2 22/27] docs/libxl: Introduce CHECKPOINT_END to support migration v2 remus streams
In a remus senario, libxc will write a CHECKPOINT record, then hand ownership of the fd to libxl. Libxl then writes any records required and finishes with a CHECKPOINT_END record, then hands ownership of the fd back to libxc. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- docs/specs/libxl-migration-stream.pandoc | 14 +- tools/libxl/libxl_sr_stream_format.h |1 + tools/python/xen/migration/libxl.py | 11 +++ 3 files changed, 25 insertions(+), 1 deletion(-) diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc index 7235317..674bfdd 100644 --- a/docs/specs/libxl-migration-stream.pandoc +++ b/docs/specs/libxl-migration-stream.pandoc @@ -119,7 +119,9 @@ type 0x: END 0x0003: EMULATOR_CONTEXT - 0x0004 - 0x7FFF: Reserved for future _mandatory_ + 0x0004: CHECKPOINT_END + + 0x0005 - 0x7FFF: Reserved for future _mandatory_ records. 0x8000 - 0x: Reserved for future _optional_ @@ -203,3 +205,13 @@ indexIndex of this emulator for the domain, if multiple emulator_ctx Emulator context blob. + +CHECKPOINT\_END +--- + +A checkpoint end record marks the end of a checkpoint in the image. + + 0 1 2 3 4 5 6 7 octet ++-+ + +The end record contains no fields; its body_length is 0. diff --git a/tools/libxl/libxl_sr_stream_format.h b/tools/libxl/libxl_sr_stream_format.h index f4f790b..3f3c497 100644 --- a/tools/libxl/libxl_sr_stream_format.h +++ b/tools/libxl/libxl_sr_stream_format.h @@ -35,6 +35,7 @@ #define REC_TYPE_LIBXC_CONTEXT 0x0001U #define REC_TYPE_XENSTORE_DATA 0x0002U #define REC_TYPE_EMULATOR_CONTEXT0x0003U +#define REC_TYPE_CHECKPOINT_END 0x0004U typedef struct libxl__sr_emulator_hdr { diff --git a/tools/python/xen/migration/libxl.py b/tools/python/xen/migration/libxl.py index 4e1f4f8..415502e 100644 --- a/tools/python/xen/migration/libxl.py +++ b/tools/python/xen/migration/libxl.py @@ -36,12 +36,14 @@ REC_TYPE_end = 0x REC_TYPE_libxc_context= 0x0001 REC_TYPE_xenstore_data= 0x0002 REC_TYPE_emulator_context = 0x0003 +REC_TYPE_checkpoint_end = 0x0004 rec_type_to_str = { REC_TYPE_end : "End", REC_TYPE_libxc_context: "Libxc context", REC_TYPE_xenstore_data: "Xenstore data", REC_TYPE_emulator_context : "Emulator context", +REC_TYPE_checkpoint_end : "Checkpoint end", } # emulator_context @@ -176,6 +178,13 @@ class VerifyLibxl(VerifyBase): self.info(" Index %d, type %s" % (emu_idx, emulator_id_to_str[emu_id])) +def verify_record_checkpoint_end(self, content): +""" Checkpoint end record """ + +if len(content) != 0: +raise RecordError("Checkpoint end record with non-zero length") + + record_verifiers = { REC_TYPE_end: VerifyLibxl.verify_record_end, @@ -185,4 +194,6 @@ record_verifiers = { VerifyLibxl.verify_record_xenstore_data, REC_TYPE_emulator_context: VerifyLibxl.verify_record_emulator_context, +REC_TYPE_checkpoint_end: +VerifyLibxl.verify_record_checkpoint_end, } -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 19/27] tools/libxc+libxl+xl: Restore v2 streams
This is a complicated set of changes which must be done together for bisectability. * libxl-save-helper is updated to unconditionally use libxc migration v2. * libxl compatibility workarounds in libxc are disabled for restore operations. * libxl__stream_read_start() is logically spliced into the event location where libxl__xc_domain_restore() used to reside. The parameters 'hvm', 'pae', and 'superpages' were previously superfluous, and are completely unused in migration v2. callbacks->toolstack_restore is handled via a migration v2 record now, rather than via a callback from libxc. NB: this change breaks Remus. Further untangling needs to happen before Remus will function. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- Since v1: * Drop "legacy_width" from the IDL * Gain a LIBXL_HAVE_ to signify support of migration v2 streams --- tools/libxc/Makefile|4 ++-- tools/libxl/libxl.h | 17 ++ tools/libxl/libxl_create.c | 48 --- tools/libxl/libxl_save_helper.c |2 +- tools/libxl/libxl_stream_read.c | 34 +++ tools/libxl/libxl_types.idl |1 + tools/libxl/xl_cmdimpl.c|7 +- 7 files changed, 76 insertions(+), 37 deletions(-) diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index b659df4..2cd0b1a 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -64,8 +64,8 @@ GUEST_SRCS-$(CONFIG_X86) += xc_sr_save_x86_hvm.c GUEST_SRCS-y += xc_sr_restore.c GUEST_SRCS-y += xc_sr_save.c GUEST_SRCS-y += xc_offline_page.c xc_compression.c -$(patsubst %.c,%.o,$(GUEST_SRCS-y)): CFLAGS += -DXG_LIBXL_HVM_COMPAT -$(patsubst %.c,%.opic,$(GUEST_SRCS-y)): CFLAGS += -DXG_LIBXL_HVM_COMPAT +xc_sr_save_x86_hvm.o: CFLAGS += -DXG_LIBXL_HVM_COMPAT +xc_sr_save_x86_hvm.opic: CFLAGS += -DXG_LIBXL_HVM_COMPAT else GUEST_SRCS-y += xc_nomigrate.c endif diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index e9d63c9..e64a606 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -807,6 +807,23 @@ */ #define LIBXL_HAVE_SOCKET_BITMAP_ALLOC 1 +/* + * LIBXL_HAVE_STREAM_V2 + * + * If this is defined, then the libxl_domain_create_restore() interface takes + * a "stream_version" parameter and supports a value of 2. + */ +#define LIBXL_HAVE_STREAM_V2 1 + +/* + * LIBXL_HAVE_STREAM_V1 + * + * In the case that LIBXL_HAVE_STREAM_V2 is set, LIBXL_HAVE_STREAM_V1 + * indicates that libxl_domain_create_restore() can handle a "stream_version" + * parameter of 1, and convert the stream format automatically. + */ +#define LIBXL_HAVE_STREAM_V1 1 + typedef char **libxl_string_list; void libxl_string_list_dispose(libxl_string_list *sl); int libxl_string_list_length(const libxl_string_list *sl); diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index be13204..2a0063a 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -704,6 +704,10 @@ static void domcreate_attach_dtdev(libxl__egc *egc, static void domcreate_console_available(libxl__egc *egc, libxl__domain_create_state *dcs); +static void domcreate_stream_done(libxl__egc *egc, + libxl__stream_read_state *srs, + int ret); + static void domcreate_rebuild_done(libxl__egc *egc, libxl__domain_create_state *dcs, int ret); @@ -933,11 +937,8 @@ static void domcreate_bootloader_done(libxl__egc *egc, /* convenience aliases */ const uint32_t domid = dcs->guest_domid; libxl_domain_config *const d_config = dcs->guest_config; -libxl_domain_build_info *const info = &d_config->b_info; const int restore_fd = dcs->restore_fd; libxl__domain_build_state *const state = &dcs->build_state; -libxl__srm_restore_autogen_callbacks *const callbacks = -&dcs->shs.callbacks.restore.a; if (rc) { domcreate_rebuild_done(egc, dcs, rc); @@ -970,30 +971,16 @@ static void domcreate_bootloader_done(libxl__egc *egc, if (rc) goto out; -/* read signature */ -int hvm, pae, superpages; -switch (info->type) { -case LIBXL_DOMAIN_TYPE_HVM: -hvm = 1; -superpages = 1; -pae = libxl_defbool_val(info->u.hvm.pae); -callbacks->toolstack_restore = libxl__toolstack_restore; -break; -case LIBXL_DOMAIN_TYPE_PV: -hvm = 0; -superpages = 0; -pae = 1; -break; -default: -rc = ERROR_INVAL; -goto out; -} -libxl__xc_domain_restore(egc, dcs, - hvm, pae, superpages); +dcs->srs.ao = ao; +dcs->srs.fd = restore_fd; +dcs->srs.legacy = (dcs->restore_params.stream_version == 1); +dcs->srs.completion_callback = domcreate_stream_done; + +libxl__stream_read_start(egc, &dcs->srs); return; out: -libxl__xc
[Xen-devel] [PATCH v2 11/27] tools/python: Libxl migration v2 infrastructure
Contains: * Python implementation of the libxl migration v2 records * Verification code for spec compliance * Unit tests Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/python/xen/migration/libxl.py | 188 +++ tools/python/xen/migration/tests.py | 13 +++ 2 files changed, 201 insertions(+) create mode 100644 tools/python/xen/migration/libxl.py diff --git a/tools/python/xen/migration/libxl.py b/tools/python/xen/migration/libxl.py new file mode 100644 index 000..4e1f4f8 --- /dev/null +++ b/tools/python/xen/migration/libxl.py @@ -0,0 +1,188 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +""" +Libxl Migration v2 streams + +Record structures as per docs/specs/libxl-migration-stream.pandoc, and +verification routines. +""" + +import sys + +from struct import calcsize, unpack +from xen.migration.verify import StreamError, RecordError, VerifyBase +from xen.migration.libxc import VerifyLibxc + +# Header +HDR_FORMAT = "!QII" + +HDR_IDENT = 0x4c6962786c466d74 # "LibxlFmt" in ASCII +HDR_VERSION = 2 + +HDR_OPT_BIT_ENDIAN = 0 +HDR_OPT_BIT_LEGACY = 1 + +HDR_OPT_LE = (0 << HDR_OPT_BIT_ENDIAN) +HDR_OPT_BE = (1 << HDR_OPT_BIT_ENDIAN) +HDR_OPT_LEGACY = (1 << HDR_OPT_BIT_LEGACY) + +HDR_OPT_RESZ_MASK = 0xfffc + +# Records +RH_FORMAT = "II" + +REC_TYPE_end = 0x +REC_TYPE_libxc_context= 0x0001 +REC_TYPE_xenstore_data= 0x0002 +REC_TYPE_emulator_context = 0x0003 + +rec_type_to_str = { +REC_TYPE_end : "End", +REC_TYPE_libxc_context: "Libxc context", +REC_TYPE_xenstore_data: "Xenstore data", +REC_TYPE_emulator_context : "Emulator context", +} + +# emulator_context +EMULATOR_CONTEXT_FORMAT = "II" + +EMULATOR_ID_unknown = 0x +EMULATOR_ID_qemu_trad = 0x0001 +EMULATOR_ID_qemu_upstream = 0x0002 + +emulator_id_to_str = { +EMULATOR_ID_unknown : "Unknown", +EMULATOR_ID_qemu_trad : "Qemu Traditional", +EMULATOR_ID_qemu_upstream : "Qemu Upstream", +} + + +# +# libxl format +# + +LIBXL_QEMU_SIGNATURE = "DeviceModelRecord0002" +LIBXL_QEMU_RECORD_HDR = "=%dsI" % (len(LIBXL_QEMU_SIGNATURE), ) + +class VerifyLibxl(VerifyBase): +""" Verify a Libxl v2 stream """ + +def __init__(self, info, read): +VerifyBase.__init__(self, info, read) + + +def verify(self): +""" Verity a libxl stream """ + +self.verify_hdr() + +while self.verify_record() != REC_TYPE_end: +pass + + +def verify_hdr(self): +""" Verify a Header """ +ident, version, options = self.unpack_exact(HDR_FORMAT) + +if ident != HDR_IDENT: +raise StreamError("Bad image id: Expected 0x%x, got 0x%x" + % (HDR_IDENT, ident)) + +if version != HDR_VERSION: +raise StreamError("Unknown image version: Expected %d, got %d" + % (HDR_VERSION, version)) + +if options & HDR_OPT_RESZ_MASK: +raise StreamError("Reserved bits set in image options field: 0x%x" + % (options & HDR_OPT_RESZ_MASK)) + +if ( (sys.byteorder == "little") and + ((options & HDR_OPT_BIT_ENDIAN) != HDR_OPT_LE) ): +raise StreamError( +"Stream is not native endianess - unable to validate") + +endian = ["little", "big"][options & HDR_OPT_LE] + +if options & HDR_OPT_LEGACY: +self.info("Libxl Header: %s endian, legacy converted" % (endian, )) +else: +self.info("Libxl Header: %s endian" % (endian, )) + + +def verify_record(self): +""" Verify an individual record """ +rtype, length = self.unpack_exact(RH_FORMAT) + +if rtype not in rec_type_to_str: +raise StreamError("Unrecognised record type %x" % (rtype, )) + +self.info("Libxl Record: %s, length %d" + % (rec_type_to_str[rtype], length)) + +contentsz = (length + 7) & ~7 +content = self.rdexact(contentsz) + +padding = content[length:] +if padding != "\x00" * len(padding): +raise StreamError("Padding containing non0 bytes found") + +if rtype not in record_verifiers: +raise RuntimeError("No verification function for libxl record '%s'" + % rec_type_to_str[rtype]) +else: +record_verifiers[rtype](self, content[:length]) + +return rtype + + +def verify_record_end(self, content): +""" End record """ + +if len(content) != 0: +raise RecordError("End record with non-zero length") + + +def verify_record_libxc_context(self, content): +""" Libxc context record """ + +if len(content) != 0: +raise RecordError("Libxc context record with non-zero length") + +# Verify the libxc stream, as we can't seek forwards through it +
[Xen-devel] [PATCH v2 02/27] tools/libxc: Always compile the compat qemu variables into xc_sr_context
This is safe (as the variables will simply be unused), and is required for correct compilation when midway through untangling the libxc/libxl interaction. The #define is left in place to highlight that the variables can be removed once the untangling is complete. Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/libxc/xc_sr_common.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h index 565c5da..08c66db 100644 --- a/tools/libxc/xc_sr_common.h +++ b/tools/libxc/xc_sr_common.h @@ -307,10 +307,10 @@ struct xc_sr_context void *context; size_t contextsz; -#ifdef XG_LIBXL_HVM_COMPAT +/* #ifdef XG_LIBXL_HVM_COMPAT */ uint32_t qlen; void *qbuf; -#endif +/* #endif */ } restore; }; } x86_hvm; -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 00/27] Libxl migration v2
This series adds support for the libxl migration v2 stream, and untangles the existing layering violations of the toolstack and qemu records. It can be found on the branch "libxl-migv2-v2" git://xenbits.xen.org/people/andrewcoop/xen.git http://xenbits.xen.org/git-http/people/andrewcoop/xen.git Major changes in v2 are being rebased over the libxl AO-abort series, and a redesign of the internal logic to support Remus/COLO buffering and failover. At the end of the series, legacy migration is no longer used. The Remus code is untested by me. All other combinations of suspend/migrate/resume have been tested with PV and HVM guests (qemu-trad and qemu-upstream), including 32 -> 64 bit migration (which was the underlying bug causing us to write migration v2 in the first place). Anyway, thoughts/comments welcome. Please test! ~Andrew Summary of Acks/Modified/New from v1 N bsd-sys-queue-h-seddery: Massage `offsetof' A tools/libxc: Always compile the compat qemu variables into xc_sr_context A tools/libxl: Introduce ROUNDUP() N tools/libxl: Introduce libxl__kill() AM tools/libxl: Stash all restore parameters in domain_create_state N tools/libxl: Split libxl__domain_create_state.restore_fd in two M tools/libxl: Extra management APIs for the save helper AM tools/xl: Mandatory flag indicating the format of the migration stream docs: Libxl migration v2 stream specification AM tools/python: Libxc migration v2 infrastructure AM tools/python: Libxl migration v2 infrastructure N tools/python: Other migration infrastructure AM tools/python: Verification utility for v2 stream spec compliance AM tools/python: Conversion utility for legacy migration streams M tools/libxl: Migration v2 stream format M tools/libxl: Infrastructure for reading a libxl migration v2 stream M tools/libxl: Support converting a legacy stream to a v2 stream M tools/libxl: Convert a legacy stream if needed M tools/libxc+libxl+xl: Restore v2 streams M tools/libxl: Infrastructure for writing a v2 stream M tools/libxc+libxl+xl: Save v2 streams AM docs/libxl: Introduce CHECKPOINT_END to support migration v2 remus streams M tools/libxl: Write checkpoint records into the stream M tools/libx{c,l}: Introduce restore_callbacks.checkpoint() M tools/libxl: Handle checkpoint records in a libxl migration v2 stream A tools/libxc: Drop all XG_LIBXL_HVM_COMPAT code from libxc A tools/libxl: Drop all knowledge of toolstack callbacks Andrew Cooper (23): tools/libxc: Always compile the compat qemu variables into xc_sr_context tools/libxl: Introduce ROUNDUP() tools/libxl: Introduce libxl__kill() tools/libxl: Stash all restore parameters in domain_create_state tools/libxl: Split libxl__domain_create_state.restore_fd in two tools/libxl: Extra management APIs for the save helper tools/xl: Mandatory flag indicating the format of the migration stream docs: Libxl migration v2 stream specification tools/python: Libxc migration v2 infrastructure tools/python: Libxl migration v2 infrastructure tools/python: Other migration infrastructure tools/python: Verification utility for v2 stream spec compliance tools/python: Conversion utility for legacy migration streams tools/libxl: Support converting a legacy stream to a v2 stream tools/libxl: Convert a legacy stream if needed tools/libxc+libxl+xl: Restore v2 streams tools/libxc+libxl+xl: Save v2 streams docs/libxl: Introduce CHECKPOINT_END to support migration v2 remus streams tools/libxl: Write checkpoint records into the stream tools/libx{c,l}: Introduce restore_callbacks.checkpoint() tools/libxl: Handle checkpoint records in a libxl migration v2 stream tools/libxc: Drop all XG_LIBXL_HVM_COMPAT code from libxc tools/libxl: Drop all knowledge of toolstack callbacks Ian Jackson (1): bsd-sys-queue-h-seddery: Massage `offsetof' Ross Lagerwall (3): tools/libxl: Migration v2 stream format tools/libxl: Infrastructure for reading a libxl migration v2 stream tools/libxl: Infrastructure for writing a v2 stream docs/specs/libxl-migration-stream.pandoc | 217 ++ tools/include/xen-external/bsd-sys-queue-h-seddery |2 + tools/libxc/Makefile |2 - tools/libxc/include/xenguest.h |9 + tools/libxc/xc_sr_common.h | 12 +- tools/libxc/xc_sr_restore.c| 71 +- tools/libxc/xc_sr_restore_x86_hvm.c| 124 tools/libxc/xc_sr_save_x86_hvm.c | 36 - tools/libxl/Makefile |2 + tools/libxl/libxl.h| 19 + tools/libxl/libxl_aoutils.c| 15 + tools/libxl/libxl_convert_callout.c| 172 + tools/libxl/libxl_create.c | 86 ++- tools/libxl/libxl_dom.c| 65 +- tools/libxl/libxl_internal.h | 192 -
[Xen-devel] [PATCH v2 01/27] bsd-sys-queue-h-seddery: Massage `offsetof'
From: Ian Jackson For some reason BSD's queue.h uses `__offsetof'. It expects it to work just like offsetof. So use offsetof. Reported-by: Andrew Cooper Signed-off-by: Ian Jackson --- tools/include/xen-external/bsd-sys-queue-h-seddery |2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/include/xen-external/bsd-sys-queue-h-seddery b/tools/include/xen-external/bsd-sys-queue-h-seddery index 7a957e3..3f8716d 100755 --- a/tools/include/xen-external/bsd-sys-queue-h-seddery +++ b/tools/include/xen-external/bsd-sys-queue-h-seddery @@ -69,4 +69,6 @@ s/\b struct \s+ type \b/type/xg; s,^\#include.*sys/cdefs.*,/* $& */,xg; +s,\b __offsetof \b ,offsetof,xg; + s/\b( NULL )/0/xg; -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 04/27] tools/libxl: Introduce libxl__kill()
as a wrapper to kill(2), and use it in preference to sendig in libxl_save_callout.c. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- Logically new in v2 - split out from a v1 change which was itself a cherrypick-and-modify from the AO Abort series --- tools/libxl/libxl_aoutils.c | 15 +++ tools/libxl/libxl_internal.h |2 ++ tools/libxl/libxl_save_callout.c | 10 ++ 3 files changed, 19 insertions(+), 8 deletions(-) diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c index 0931eee..274ef39 100644 --- a/tools/libxl/libxl_aoutils.c +++ b/tools/libxl/libxl_aoutils.c @@ -621,3 +621,18 @@ bool libxl__async_exec_inuse(const libxl__async_exec_state *aes) assert(time_inuse == child_inuse); return child_inuse; } + +void libxl__kill(libxl__gc *gc, pid_t pid, int sig, const char *what) +{ +int r = kill(pid, sig); +if (r) LOGE(WARN, "failed to kill() %s [%lu] (signal %d)", +what, (unsigned long)pid, sig); +} + +/* + * Local variables: + * mode: C + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 19fc425..9147de1 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2244,6 +2244,8 @@ struct libxl__async_exec_state { int libxl__async_exec_start(libxl__async_exec_state *aes); bool libxl__async_exec_inuse(const libxl__async_exec_state *aes); +void libxl__kill(libxl__gc *gc, pid_t pid, int sig, const char *what); + /*- device addition/removal -*/ typedef struct libxl__ao_device libxl__ao_device; diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c index 087c2d5..b82a5c1 100644 --- a/tools/libxl/libxl_save_callout.c +++ b/tools/libxl/libxl_save_callout.c @@ -244,12 +244,6 @@ static void run_helper(libxl__egc *egc, libxl__save_helper_state *shs, libxl__carefd_close(childs_pipes[1]); helper_failed(egc, shs, rc);; } -static void sendsig(libxl__gc *gc, libxl__save_helper_state *shs, int sig) -{ -int r = kill(shs->child.pid, sig); -if (r) LOGE(WARN, "failed to kill save/restore helper [%lu] (signal %d)", -(unsigned long)shs->child.pid, sig); -} static void helper_failed(libxl__egc *egc, libxl__save_helper_state *shs, int rc) @@ -266,7 +260,7 @@ static void helper_failed(libxl__egc *egc, libxl__save_helper_state *shs, return; } -sendsig(gc, shs, SIGKILL); +libxl__kill(gc, shs->child.pid, SIGKILL, "save/restore helper"); } static void helper_stop(libxl__egc *egc, libxl__ao_abortable *abrt, int rc) @@ -282,7 +276,7 @@ static void helper_stop(libxl__egc *egc, libxl__ao_abortable *abrt, int rc) if (!shs->rc) shs->rc = rc; -sendsig(gc, shs, SIGTERM); +libxl__kill(gc, shs->child.pid, SIGTERM, "save/restore helper"); } static void helper_stdout_readable(libxl__egc *egc, libxl__ev_fd *ev, -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 05/27] tools/libxl: Stash all restore parameters in domain_create_state
Shortly more parameters will appear, and this saves unboxing each one. libxl_domain_restore_params is mandatory for restore streams, and ignored for plain creation. The old 'checkpointed_stream' was incorrectly identified as a private parameter when it was infact public. No functional change. Signed-off-by: Andrew Cooper Acked-by: Ian Campbell Reviewed-by: Yang Hongyang CC: Ian Jackson CC: Wei Liu --- Since v1: * Gate validity on restore_fd being valid. --- tools/libxl/libxl_create.c | 13 +++-- tools/libxl/libxl_internal.h |2 +- tools/libxl/libxl_save_callout.c |2 +- 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index b785ddd..61515da 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -1512,8 +1512,8 @@ static void domain_create_cb(libxl__egc *egc, int rc, uint32_t domid); static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config, -uint32_t *domid, -int restore_fd, int checkpointed_stream, +uint32_t *domid, int restore_fd, +const libxl_domain_restore_params *params, const libxl_asyncop_how *ao_how, const libxl_asyncprogress_how *aop_console_how) { @@ -1526,8 +1526,9 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config, libxl_domain_config_init(&cdcs->dcs.guest_config_saved); libxl_domain_config_copy(ctx, &cdcs->dcs.guest_config_saved, d_config); cdcs->dcs.restore_fd = restore_fd; +if (restore_fd > -1) +cdcs->dcs.restore_params = *params; cdcs->dcs.callback = domain_create_cb; -cdcs->dcs.checkpointed_stream = checkpointed_stream; libxl__ao_progress_gethow(&cdcs->dcs.aop_console_how, aop_console_how); cdcs->domid_out = domid; @@ -1553,7 +1554,7 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config, const libxl_asyncop_how *ao_how, const libxl_asyncprogress_how *aop_console_how) { -return do_domain_create(ctx, d_config, domid, -1, 0, +return do_domain_create(ctx, d_config, domid, -1, NULL, ao_how, aop_console_how); } @@ -1563,8 +1564,8 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config, const libxl_asyncop_how *ao_how, const libxl_asyncprogress_how *aop_console_how) { -return do_domain_create(ctx, d_config, domid, restore_fd, -params->checkpointed_stream, ao_how, aop_console_how); +return do_domain_create(ctx, d_config, domid, restore_fd, params, +ao_how, aop_console_how); } /* diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 9147de1..5ab945a 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -3217,11 +3217,11 @@ struct libxl__domain_create_state { libxl_domain_config *guest_config; libxl_domain_config guest_config_saved; /* vanilla config */ int restore_fd; +libxl_domain_restore_params restore_params; libxl__domain_create_cb *callback; libxl_asyncprogress_how aop_console_how; /* private to domain_create */ int guest_domid; -int checkpointed_stream; libxl__domain_build_state build_state; libxl__bootloader_state bl; libxl__stub_dm_spawn_state dmss; diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c index b82a5c1..80aae1b 100644 --- a/tools/libxl/libxl_save_callout.c +++ b/tools/libxl/libxl_save_callout.c @@ -60,7 +60,7 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs, state->store_domid, state->console_port, state->console_domid, hvm, pae, superpages, -cbflags, dcs->checkpointed_stream, +cbflags, dcs->restore_params.checkpointed_stream, }; dcs->shs.ao = ao; -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 03/27] tools/libxl: Introduce ROUNDUP()
This is the same as is used by libxc. Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- tools/libxl/libxl_internal.h |3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 5235d25..19fc425 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -110,6 +110,9 @@ #define ARRAY_SIZE(a) (sizeof(a) / sizeof(a[0])) +#define ROUNDUP(_val, _order) \ +(((unsigned long)(_val)+(1UL<<(_order))-1) & ~((1UL<<(_order))-1)) + #define min(X, Y) ({ \ const typeof (X) _x = (X); \ const typeof (Y) _y = (Y); \ -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 08/27] tools/xl: Mandatory flag indicating the format of the migration stream
Introduced at this point so the python stream conversion code has a concrete ABI to use. Later when libxl itself starts supporting a v2 stream, it will be added to XL_MANDATORY_FLAG_ALL. Signed-off-by: Andrew Cooper Acked-by: Ian Campbell CC: Ian Jackson CC: Wei Liu --- v2: Expand commit message --- tools/libxl/xl_cmdimpl.c |1 + 1 file changed, 1 insertion(+) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 971209c..26b1e7d 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -109,6 +109,7 @@ */ #define XL_MANDATORY_FLAG_JSON (1U << 0) /* config data is in JSON format */ +#define XL_MANDATORY_FLAG_STREAMv2 (1U << 1) /* stream is v2 */ #define XL_MANDATORY_FLAG_ALL (XL_MANDATORY_FLAG_JSON) struct save_file_header { char magic[32]; /* savefileheader_magic */ -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 06/27] tools/libxl: Split libxl__domain_create_state.restore_fd in two
In a future patch, we shall support automatically converting a legacy stream to a v2 stream, in which case libxc needs to read from a different fd. Simply overwriting restore_fd does not work; the two fd's have different circumstances. The restore_fd needs to be returned to its origial state before libxl_domain_create_restore() returns, while in the converted case, the fd needs allocating and deallocating appropriately. No functional change. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- New in v2 --- tools/libxl/libxl_create.c |2 +- tools/libxl/libxl_internal.h |2 +- tools/libxl/libxl_save_callout.c |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index 61515da..be13204 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -1525,7 +1525,7 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config, cdcs->dcs.guest_config = d_config; libxl_domain_config_init(&cdcs->dcs.guest_config_saved); libxl_domain_config_copy(ctx, &cdcs->dcs.guest_config_saved, d_config); -cdcs->dcs.restore_fd = restore_fd; +cdcs->dcs.restore_fd = cdcs->dcs.libxc_fd = restore_fd; if (restore_fd > -1) cdcs->dcs.restore_params = *params; cdcs->dcs.callback = domain_create_cb; diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 5ab945a..588cfb8 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -3216,7 +3216,7 @@ struct libxl__domain_create_state { libxl__ao *ao; libxl_domain_config *guest_config; libxl_domain_config guest_config_saved; /* vanilla config */ -int restore_fd; +int restore_fd, libxc_fd; libxl_domain_restore_params restore_params; libxl__domain_create_cb *callback; libxl_asyncprogress_how aop_console_how; diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c index 80aae1b..1136b79 100644 --- a/tools/libxl/libxl_save_callout.c +++ b/tools/libxl/libxl_save_callout.c @@ -48,7 +48,7 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs, /* Convenience aliases */ const uint32_t domid = dcs->guest_domid; -const int restore_fd = dcs->restore_fd; +const int restore_fd = dcs->libxc_fd; libxl__domain_build_state *const state = &dcs->build_state; unsigned cbflags = libxl__srm_callout_enumcallbacks_restore -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 07/27] tools/libxl: Extra management APIs for the save helper
With migration v2, there are several moving parts needing to be juggled at once. This requires the error handling logic to be able to query the state of each moving part, possibly before they have been started, and be able to cancel them. Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- Since v1: * Add an _init() function which allows _inuse() to be safe to call even before the save helper has started. --- tools/libxl/libxl_internal.h |9 + tools/libxl/libxl_save_callout.c | 17 ++--- 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 588cfb8..4bd6ea1 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -3272,6 +3272,15 @@ _hidden void libxl__xc_domain_restore(libxl__egc *egc, _hidden void libxl__xc_domain_restore_done(libxl__egc *egc, void *dcs_void, int rc, int retval, int errnoval); +_hidden void libxl__save_helper_init(libxl__save_helper_state *shs); +_hidden void libxl__save_helper_abort(libxl__egc *egc, + libxl__save_helper_state *shs); + +static inline bool libxl__save_helper_inuse(const libxl__save_helper_state *shs) +{ +return libxl__ev_child_inuse(&shs->child); +} + /* Each time the dm needs to be saved, we must call suspend and then save */ _hidden int libxl__domain_suspend_device_model(libxl__gc *gc, libxl__domain_suspend_state *dss); diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c index 1136b79..cd18cd2 100644 --- a/tools/libxl/libxl_save_callout.c +++ b/tools/libxl/libxl_save_callout.c @@ -146,6 +146,13 @@ void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc, shs->egc = 0; } +void libxl__save_helper_init(libxl__save_helper_state *shs) +{ +libxl__ao_abortable_init(&shs->abrt); +libxl__ev_fd_init(&shs->readable); +libxl__ev_child_init(&shs->child); +} + /*- helper execution -*/ static void run_helper(libxl__egc *egc, libxl__save_helper_state *shs, @@ -167,9 +174,7 @@ static void run_helper(libxl__egc *egc, libxl__save_helper_state *shs, shs->rc = 0; shs->completed = 0; shs->pipes[0] = shs->pipes[1] = 0; -libxl__ao_abortable_init(&shs->abrt); -libxl__ev_fd_init(&shs->readable); -libxl__ev_child_init(&shs->child); +libxl__save_helper_init(shs); shs->abrt.ao = shs->ao; shs->abrt.callback = helper_stop; @@ -279,6 +284,12 @@ static void helper_stop(libxl__egc *egc, libxl__ao_abortable *abrt, int rc) libxl__kill(gc, shs->child.pid, SIGTERM, "save/restore helper"); } +void libxl__save_helper_abort(libxl__egc *egc, + libxl__save_helper_state *shs) +{ +helper_stop(egc, &shs->abrt, ERROR_FAIL); +} + static void helper_stdout_readable(libxl__egc *egc, libxl__ev_fd *ev, int fd, short events, short revents) { -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2 09/27] docs: Libxl migration v2 stream specification
Signed-off-by: Andrew Cooper CC: Ian Campbell CC: Ian Jackson CC: Wei Liu --- docs/specs/libxl-migration-stream.pandoc | 205 ++ 1 file changed, 205 insertions(+) create mode 100644 docs/specs/libxl-migration-stream.pandoc diff --git a/docs/specs/libxl-migration-stream.pandoc b/docs/specs/libxl-migration-stream.pandoc new file mode 100644 index 000..7235317 --- /dev/null +++ b/docs/specs/libxl-migration-stream.pandoc @@ -0,0 +1,205 @@ +% LibXenLight Domain Image Format +% Andrew Cooper <> +% Draft B + +Introduction + + +For the purposes of this document, `xl` is used as a representation of any +implementer of the `libxl` API. `xl` should be considered completely +interchangeable with alternates, such as `libvirt` or `xenopsd-xl`. + +Purpose +--- + +The _domain image format_ is the context of a running domain used for +snapshots of a domain or for transferring domains between hosts during +migration. + +There are a number of problems with the domain image format used in Xen 4.5 +and earlier (the _legacy format_) + +* There is no `libxl` context information. `xl` is required to send certain + pieces of `libxl` context itself. + +* The contents of the stream is passed directly through `libxl` to `libxc`. + The legacy `libxc` format contained some information which belonged at the + `libxl` level, resulting in awkward layer violation to return the + information back to `libxl`. + +* The legacy `libxc` format was inextensible, causing inextensibility in the + legacy `libxl` handling. + +This design addresses the above points, allowing for a completely +self-contained, extensible stream with each layer responsibile for its own +appropriate information. + + +Not Yet Included + + +The following features are not yet fully specified and will be +included in a future draft. + +* Remus + +* ARM + + +Overview + + +The image format consists of a _Header_, followed by 1 or more _Records_. +Each record consists of a type and length field, followed by any type-specific +data. + +\clearpage + +Header +== + +The header identifies the stream as a `libxl` stream, including the version of +this specification that it complies with. + +All fields in this header shall be in _big-endian_ byte order, regardless of +the setting of the endianness bit. + + 0 1 2 3 4 5 6 7 octet ++-+ +| ident | ++---+-+ +| version | options | ++---+-+ + + +Field Description +--- +ident 0x4c6962786c466d74 ("LibxlFmt" in ASCII). + +version 0x0002. The version of this specification. + +options bit 0: Endianness.0 = little-endian, 1 = big-endian. + +bit 1: Legacy Format. If set, this stream was created by + the legacy conversion tool. + +bits 2-31: Reserved. + + +The endianness shall be 0 (little-endian) for images generated on an +i386, x86_64, or arm host. + +\clearpage + + +Records +=== + +A record has a record header, type specific data and a trailing footer. If +`length` is not a multiple of 8, the body is padded with zeroes to align the +end of the record on an 8 octet boundary. + + 0 1 2 3 4 5 6 7 octet ++---+-+ +| type | body_length | ++---+---+-+ +| body... | +... +| | padding (0 to 7 octets) | ++---+-+ + + +FieldDescription +--- --- +type 0x: END + + 0x0001: LIBXC_CONTEXT + + 0x0002: XENSTORE_DATA + + 0x0003: EMULATOR_CONTEXT + + 0x0004 - 0x7FFF: Reserved for future _mandatory_ + records. + + 0x8000 - 0x: Reserved for future _optional_ + records. + +body_length Length in octets of the record body. + +body Content of the record. + +padding 0 to 7 octets of zeros to pad the whole record to a multiple + of 8 octets. + + +\clearpage + +END + + +A end record marks the end of the image, and shall be the final record +in the stream. + + 0 1 2 3 4 5 6 7 octet
Re: [Xen-devel] [v7][PATCH 16/16] tools: parse to enable new rdm policy parameters
Tiejun Chen writes ("[v7][PATCH 16/16] tools: parse to enable new rdm policy parameters"): > This patch parses to enable user configurable parameters to specify > RDM resource and according policies, > > Global RDM parameter: > rdm = "strategy=host,policy=strict/relaxed" > Per-device RDM parameter: > pci = [ 'sbdf, rdm_policy=strict/relaxed' ] > > Default per-device RDM policy is same as default global RDM policy as being > 'relaxed'. And the per-device policy would override the global policy like > others. Thanks for this. I have found a couple of things in this patch which I would like to see improved. See below. Again, given how late I am, I do not feel that I should be nacking it at this time. You have a tools ack from Wei, so my comments are not a blocker for this series. But if you need to respin, please take these comments into account, and consider which are feasible to fix in the time available. If you are respinning this series targeting Xen 4.7 or later, please address all of the points I make below. Thanks. The first issue (which would really be relevant to the documentation patch) is that the documentation is in a separate commit. There are sometimes valid reasons for doing this. I'm not sure if they apply, but if they do this should be explained in one of the commit messages. If this was done I'm afraid I have missed it. > +}else if ( !strcmp(optkey, "rdm_policy") ) { > +if ( !strcmp(tok, "strict") ) { > +pcidev->rdm_policy = LIBXL_RDM_RESERVE_POLICY_STRICT; > +} else if ( !strcmp(tok, "relaxed") ) { > +pcidev->rdm_policy = > LIBXL_RDM_RESERVE_POLICY_RELAXED; > +} else { > +XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM > property" > + " policy: 'strict' or 'relaxed'.", > + tok); > +goto parse_error; > +} This section has coding style (whitespace) problems and long lines. If you need to respin, please fix them. > +for (tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) { > +switch(state) { > +case STATE_TYPE: > +if (*ptr == '=') { > +state = STATE_RDM_STRATEGY; > +*ptr = '\0'; > +if (strcmp(tok, "strategy")) { > +XLU__PCI_ERR(cfg, "Unknown RDM state option: %s", tok); > +goto parse_error; > +} > +tok = ptr + 1; > +} This code is extremely repetitive. Really I would prefer that this parsing was done with a miniature flex parser, rather than ad-hoc pointer arithmetic and use of strtok. Thanks, Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][PATCH 13/16] libxl: construct e820 map with RDM information for HVM guest
Tiejun Chen writes ("[v7][PATCH 13/16] libxl: construct e820 map with RDM information for HVM guest"): > Here we'll construct a basic guest e820 table via > XENMEM_set_memory_map. This table includes lowmem, highmem > and RDMs if they exist, and hvmloader would need this info > later. > > Note this guest e820 table would be same as before if the > platform has no any RDM or we disable RDM (by default). ... > tools/libxl/libxl_dom.c | 5 +++ > tools/libxl/libxl_internal.h | 24 + > tools/libxl/libxl_x86.c | 83 > ... > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index 62ef120..41da479 100644 > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -1004,6 +1004,11 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid, > goto out; > } > > +if (libxl__domain_construct_e820(gc, d_config, domid, &args)) { > +LOG(ERROR, "setting domain memory map failed"); > +goto out; > +} This is platform-independent code, isn't it ? In which case this will break the build on ARM, I think. Would an ARM maintainer please confirm. Aside from that I have no issues with this patch. Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][PATCH 12/16] tools: introduce a new parameter to set a predefined rdm boundary
Tiejun Chen writes ("[v7][PATCH 12/16] tools: introduce a new parameter to set a predefined rdm boundary"): > Previously we always fix that predefined boundary as 2G to handle > conflict between memory and rdm, but now this predefined boundar > can be changes with the parameter "rdm_mem_boundary" in .cfg file. > > CC: Ian Jackson > CC: Stefano Stabellini > CC: Ian Campbell > CC: Wei Liu > Acked-by: Wei Liu > Signed-off-by: Tiejun Chen Acked-by: Ian Jackson ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][PATCH 11/16] tools/libxl: detect and avoid conflicts with RDM
Tiejun Chen writes ("[v7][PATCH 11/16] tools/libxl: detect and avoid conflicts with RDM"): > While building a VM, HVM domain builder provides struct hvm_info_table{} > to help hvmloader. Currently it includes two fields to construct guest > e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should > check them to fix any conflict with RDM. ... > CC: Ian Jackson > CC: Stefano Stabellini > CC: Ian Campbell > CC: Wei Liu > Acked-by: Wei Liu > Signed-off-by: Tiejun Chen > Reviewed-by: Kevin Tian I have found a few things in this patch which I would like to see improved. See below. Given how late I am with this review, I do not feel that I should be nacking it at this time. You have a tools ack from Wei, so my comments are not a blocker for this series. But if you need to respin, please take these comments into account, and consider which are feasible to fix in the time available. If you are respinning this series targeting Xen 4.7 or later, please address all of the points I make below. Thanks. > +int libxl__domain_device_construct_rdm(libxl__gc *gc, > + libxl_domain_config *d_config, > + uint64_t rdm_mem_boundary, > + struct xc_hvm_build_args *args) ... > +uint64_t highmem_end = args->highmem_end ? args->highmem_end : (1ull\ <<32); ... > +if (strategy == LIBXL_RDM_RESERVE_STRATEGY_IGNORE && !d_config->num_\ pcidevs) There are quite a few of these long lines, which should be wrapped. See tools/libxl/CODING_STYLE. > +d_config->num_rdms = nr_entries; > +d_config->rdms = libxl__realloc(NOGC, d_config->rdms, > +d_config->num_rdms * sizeof(libxl_device_rdm)); This code is remarkably similar to a function later on which adds an rdm. Please can you factor it out. > +} else > +d_config->num_rdms = 0; Please can you put { } around the else block too. I don't think this mixed style is good. > +for (j = 0; j < d_config->num_rdms; j++) { > +if (d_config->rdms[j].start == > + (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT) This construct (uint64_t)some_pfn << XC_PAGE_SHIFT appears an awful lot. I would prefer it if it were done in an inline function (or maybe a macro). > +libxl_domain_build_info *const info = &d_config->b_info; > +/* > + * Currently we fix this as 2G to guarantte how to handle ^ Should read "guarantee". > +ret = libxl__domain_device_construct_rdm(gc, d_config, > + rdm_mem_boundary, > + &args); > +if (ret) { > +LOG(ERROR, "checking reserved device memory failed"); > +goto out; > +} `rc' should be used here rather than `ret'. (It is unfortunate that this function has poor style already, but it would be best not to make it worse.) Thanks, Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v8 1/4] pci: add PCI_SBDF and PCI_SEG macros
- wei.l...@citrix.com wrote: > On Thu, Jul 09, 2015 at 05:00:45PM +0100, Jan Beulich wrote: > > >>> On 09.07.15 at 17:53, wrote: > > > - jbeul...@suse.com wrote: > > >> >>> On 09.07.15 at 14:07, wrote: > > >> > You are right, it needs to be rebased. I can post later rebased > on > > >> memory > > >> > leak fix version, if you thin its a way to go. > > >> > > >> I didn't look at v9 yet, and can't predict when I will be able > to. > > > > > > Would you like me to post v10 with memory leak patch included in > the > > > patchset before you start looking at v9? > > > > If there is a dependency on the changes in the leak fix v6, then > > this would be a good idea. If not, you can keep things as they are > > now. I view the entire set more as a bug fix than a feature anyway, > > and hence see no reason not to get this in after the freeze. But > I'm > > adding Wei just in case... > > > Thanks Jan. The dependency exists on memory leak patch, so I will add it to this series and squash the first patch from v9. > I just looked at v9. The first three patches are quite mechanical. > The > fourth patch is relatively bigger but it's also quite straightforward > (mostly parsing input). All in all, this series itself is > self-contained. > > I'm don't think OSSTest is able to test that, so it would not cause > visible regression on our side. > > I also agree it's a bug fix. Preferably this series should be applied > before first RC. > > Wei. Thank you Wei. > > > Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v7][PATCH 10/16] tools: introduce some new parameters to set rdm policy
Tiejun Chen writes ("[v7][PATCH 10/16] tools: introduce some new parameters to set rdm policy"): > This patch introduces user configurable parameters to specify RDM > resource and according policies, ... > int libxl__device_pci_setdefault(libxl__gc *gc, libxl_device_pci *pci) > { > +/* We'd like to force reserve rdm specific to a device by default.*/ > +if ( pci->rdm_policy == LIBXL_RDM_RESERVE_POLICY_INVALID) ^ I have just spotted that spurious whitespace. However I won't block this for that. Acked-by: Ian Jackson (actually). I would appreciate it if you could ensure that this is fixed in any repost. You may retain my ack if you do that. Committers should feel free to fix it on commit. Thanks, Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH 4/9] libxl: event tests: Improve Makefile doc comment
Including the explanation of how to run these tests. Signed-off-by: Ian Jackson --- tools/libxl/Makefile |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index cc9c152..44a4da7 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -109,7 +109,13 @@ LIBXL_TESTS += timedereg # "outside libxl" file is compiled exactly like a piece of application # code. They must share information via explicit libxl entrypoints. # Unlike proper parts of libxl, it is permissible for libxl_test_FOO.c -# to use private global variables for its state. +# to use private global variables for its state. Note that all the +# "inside" parts are compiled into a single test library, so their +# symbol names must be unique. +# +# To run these tests, either use LD_PRELOAD to get libxenlight_test.so +# loaded, or rename it to libxenlight.so so it is the target of the +# appropriate symlinks. LIBXL_TEST_OBJS += $(foreach t, $(LIBXL_TESTS),libxl_test_$t.o) TEST_PROG_OBJS += $(foreach t, $(LIBXL_TESTS),test_$t.o) test_common.o -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH 6/9] libxl: event tests: Provide libxl_test_fdevent
We are going to use this shortly. But, it is nicely self-contained. Signed-off-by: Ian Jackson --- tools/libxl/Makefile |2 +- tools/libxl/libxl_test_fdevent.c | 79 ++ tools/libxl/libxl_test_fdevent.h | 12 ++ 3 files changed, 92 insertions(+), 1 deletion(-) create mode 100644 tools/libxl/libxl_test_fdevent.c create mode 100644 tools/libxl/libxl_test_fdevent.h diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index 512b0e1..b92809c 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -101,7 +101,7 @@ LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o LIBXL_TESTS += timedereg LIBXL_TESTS_PROGS = $(LIBXL_TESTS) -LIBXL_TESTS_INSIDE = $(LIBXL_TESTS) +LIBXL_TESTS_INSIDE = $(LIBXL_TESTS) fdevent # Each entry FOO in LIBXL_TESTS has two main .c files: # libxl_test_FOO.c "inside libxl" code to support the test case diff --git a/tools/libxl/libxl_test_fdevent.c b/tools/libxl/libxl_test_fdevent.c new file mode 100644 index 000..2d875d9 --- /dev/null +++ b/tools/libxl/libxl_test_fdevent.c @@ -0,0 +1,79 @@ +/* + * fdevent test helpr for the libxl event system + */ + +#include "libxl_internal.h" + +#include "libxl_test_fdevent.h" + +typedef struct { +libxl__ao *ao; +libxl__ev_fd fd; +libxl__ao_abortable abrt; +} libxl__test_fdevent; + +static void fdevent_complete(libxl__egc *egc, libxl__test_fdevent *tfe, + int rc); + +static void tfe_init(libxl__test_fdevent *tfe, libxl__ao *ao) +{ +tfe->ao = ao; +libxl__ev_fd_init(&tfe->fd); +libxl__ao_abortable_init(&tfe->abrt); +} + +static void tfe_cleanup(libxl__gc *gc, libxl__test_fdevent *tfe) +{ +libxl__ev_fd_deregister(gc, &tfe->fd); +libxl__ao_abortable_deregister(&tfe->abrt); +} + +static void tfe_fd_cb(libxl__egc *egc, libxl__ev_fd *ev, + int fd, short events, short revents) +{ +libxl__test_fdevent *tfe = CONTAINER_OF(ev,*tfe,fd); +STATE_AO_GC(tfe->ao); +fdevent_complete(egc, tfe, 0); +} + +static void tfe_abrt_cb(libxl__egc *egc, libxl__ao_abortable *abrt, +int rc) +{ +libxl__test_fdevent *tfe = CONTAINER_OF(abrt,*tfe,abrt); +STATE_AO_GC(tfe->ao); +fdevent_complete(egc, tfe, rc); +} + +static void fdevent_complete(libxl__egc *egc, libxl__test_fdevent *tfe, + int rc) +{ +STATE_AO_GC(tfe->ao); +tfe_cleanup(gc, tfe); +libxl__ao_complete(egc, ao, rc); +} + +int libxl_test_fdevent(libxl_ctx *ctx, int fd, short events, + libxl_asyncop_how *ao_how) +{ +int rc; +libxl__test_fdevent *tfe; + +AO_CREATE(ctx, 0, ao_how); +GCNEW(tfe); + +tfe_init(tfe, ao); + +rc = libxl__ev_fd_register(gc, &tfe->fd, tfe_fd_cb, fd, events); +if (rc) goto out; + +tfe->abrt.ao = ao; +tfe->abrt.callback = tfe_abrt_cb; +rc = libxl__ao_abortable_register(&tfe->abrt); +if (rc) goto out; + +return AO_INPROGRESS; + + out: +tfe_cleanup(gc, tfe); +return AO_CREATE_FAIL(rc); +} diff --git a/tools/libxl/libxl_test_fdevent.h b/tools/libxl/libxl_test_fdevent.h new file mode 100644 index 000..82a307e --- /dev/null +++ b/tools/libxl/libxl_test_fdevent.h @@ -0,0 +1,12 @@ +#ifndef TEST_FDEVENT_H +#define TEST_FDEVENT_H + +#include + +int libxl_test_fdevent(libxl_ctx *ctx, int fd, short events, + libxl_asyncop_how *ao_how) + LIBXL_EXTERNAL_CALLERS_ONLY; +/* This operation waits for one of the poll events to occur on fd, and + * then completes successfully. (Or, it can be aborted.) */ + +#endif /*TEST_FDEVENT_H*/ -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH 9/9] libxl: event tests: test_timedereg: Fix rc handling
In 31c836f4 "libxl: events: Permit timeouts to signal ao abort", timeout callbacks take an extra rc argument. In that patch the wrong assertion is made about the rc in test_timedereg's `occurs' callback. Fix this to make the test pass again. Signed-off-by: Ian Jackson --- tools/libxl/libxl_test_timedereg.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/libxl/libxl_test_timedereg.c b/tools/libxl/libxl_test_timedereg.c index c464663..a567db6 100644 --- a/tools/libxl/libxl_test_timedereg.c +++ b/tools/libxl/libxl_test_timedereg.c @@ -67,7 +67,7 @@ static void occurs(libxl__egc *egc, libxl__ev_time *ev, int off = ev - &et[0][0]; LOG(DEBUG,"occurs[%d][%d] seq=%d rc=%d", off/NTIMES, off%NTIMES, seq, rc); -assert(!rc); +assert(rc == ERROR_TIMEDOUT); switch (seq) { case 0: -- 1.7.10.4 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel