Re: [Qemu-devel] [PATCH] PPC/KVM: early validation of vcpu id
On Tue, Apr 26, 2016 at 03:41:04PM +0200, Greg Kurz wrote: > The KVM API restricts vcpu ids to be < KVM_CAP_MAX_VCPUS. On PowerPC > targets, depending on the number of threads per core in the host and > in the guest, some topologies do generate higher vcpu ids actually. > When this happens, QEMU bails out with the following error: > > kvm_init_vcpu failed: Invalid argument > > The KVM_CREATE_VCPU ioctl has several EINVAL return paths, so it is > not possible to fully disambiguate. > > This patch adds a check in the code that computes vcpu ids, so that > we can detect the error earlier, and print a friendlier message instead > of calling KVM_CREATE_VCPU with an obviously bogus vcpu id. > > Signed-off-by: Greg KurzApplied to ppc-for-2.7, thanks. I'm still kicking myself for doing the SMT / cpu ids that way way back when. Should have just had a "SET_SMT" ioctl() and allocated the cpu ids sequentially. Too clever by half :( > --- > include/sysemu/kvm.h|2 ++ > kvm-all.c |6 ++ > target-ppc/translate_init.c |8 > 3 files changed, 16 insertions(+) > > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h > index 0e18f15c9493..27bf50ef379e 100644 > --- a/include/sysemu/kvm.h > +++ b/include/sysemu/kvm.h > @@ -344,6 +344,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s); > > int kvm_arch_init_vcpu(CPUState *cpu); > > +bool kvm_vcpu_id_is_valid(int vcpu_id); > + > /* Returns VCPU ID to be used on KVM_CREATE_VCPU ioctl() */ > unsigned long kvm_arch_vcpu_id(CPUState *cpu); > > diff --git a/kvm-all.c b/kvm-all.c > index e7b66df197fd..3625a3e07457 100644 > --- a/kvm-all.c > +++ b/kvm-all.c > @@ -1459,6 +1459,12 @@ static int kvm_max_vcpus(KVMState *s) > return (ret) ? ret : kvm_recommended_vcpus(s); > } > > +bool kvm_vcpu_id_is_valid(int vcpu_id) > +{ > +KVMState *s = KVM_STATE(current_machine->accelerator); > +return vcpu_id >= 0 && vcpu_id < kvm_max_vcpus(s); > +} > + > static int kvm_init(MachineState *ms) > { > MachineClass *mc = MACHINE_GET_CLASS(ms); > diff --git a/target-ppc/translate_init.c b/target-ppc/translate_init.c > index f51572552bc2..6c89b18a05f9 100644 > --- a/target-ppc/translate_init.c > +++ b/target-ppc/translate_init.c > @@ -9247,6 +9247,14 @@ static void ppc_cpu_realizefn(DeviceState *dev, Error > **errp) > #if !defined(CONFIG_USER_ONLY) > cpu->cpu_dt_id = (cs->cpu_index / smp_threads) * max_smt > + (cs->cpu_index % smp_threads); > + > +if (kvm_enabled() && !kvm_vcpu_id_is_valid(cpu->cpu_dt_id)) { > +error_setg(errp, "Can't create CPU with id %d in KVM", > cpu->cpu_dt_id); > +error_append_hint(errp, "Adjust the number of cpus to %d " > + "or try to raise the number of threads per core\n", > + cpu->cpu_dt_id * smp_threads / max_smt); > +return; > +} > #endif > > if (tcg_enabled()) { > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
On Wed, 2016-04-27 at 06:44 +0300, Max Filippov wrote: > Hi Wei, > > On Wed, Apr 27, 2016 at 03:27:47AM +, Wei, Jiangang wrote: > > On Wed, 2016-04-27 at 10:07 +0800, Zhou Jie wrote: > > > static void open_eth_start_xmit(OpenEthState *s, desc *tx) > > > { > > > -uint8_t buf[65536]; > > > +uint8_t *buf = NULL; > > > +uint8_t buffer[0x600]; > > Hi, > > > > I'm curious about 0x600. > > How do you determine this size? > > IMO, Max's suggestion looks more reasonable. > > (1536 bytes, maximal frame length when HUGEN bit is not set in MODER) > > This is the same value. Opencores 10/100 ethernet spec uses both > decimal and hexadecimal notation. I got it Thanks for your reply. Wei
Re: [Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
Hi Wei, On Wed, Apr 27, 2016 at 03:27:47AM +, Wei, Jiangang wrote: > On Wed, 2016-04-27 at 10:07 +0800, Zhou Jie wrote: > > static void open_eth_start_xmit(OpenEthState *s, desc *tx) > > { > > -uint8_t buf[65536]; > > +uint8_t *buf = NULL; > > +uint8_t buffer[0x600]; > Hi, > > I'm curious about 0x600. > How do you determine this size? > IMO, Max's suggestion looks more reasonable. > (1536 bytes, maximal frame length when HUGEN bit is not set in MODER) This is the same value. Opencores 10/100 ethernet spec uses both decimal and hexadecimal notation. -- Thanks. -- Max
Re: [Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
On Wed, 2016-04-27 at 10:07 +0800, Zhou Jie wrote: > open_eth_start_xmit has a huge stack usage of 65536 bytes approx. > Moving large arrays to heap to reduce stack usage. > > Signed-off-by: Zhou Jie> --- > hw/net/opencores_eth.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/hw/net/opencores_eth.c b/hw/net/opencores_eth.c > index c6094fb..fa0a4e7 100644 > --- a/hw/net/opencores_eth.c > +++ b/hw/net/opencores_eth.c > @@ -483,7 +483,8 @@ static NetClientInfo net_open_eth_info = { > > static void open_eth_start_xmit(OpenEthState *s, desc *tx) > { > -uint8_t buf[65536]; > +uint8_t *buf = NULL; > +uint8_t buffer[0x600]; Hi, I'm curious about 0x600. How do you determine this size? IMO, Max's suggestion looks more reasonable. (1536 bytes, maximal frame length when HUGEN bit is not set in MODER) Regards, wei > unsigned len = GET_FIELD(tx->len_flags, TXD_LEN); > unsigned tx_len = len; > > @@ -498,6 +499,11 @@ static void open_eth_start_xmit(OpenEthState *s, desc > *tx) > > trace_open_eth_start_xmit(tx->buf_ptr, len, tx_len); > > +if (tx_len > 0x600) { > +buf = g_new(uint8_t, tx_len); > +} else { > +buf = buffer; > +} > if (len > tx_len) { > len = tx_len; > } > @@ -506,6 +512,9 @@ static void open_eth_start_xmit(OpenEthState *s, desc *tx) > memset(buf + len, 0, tx_len - len); > } > qemu_send_packet(qemu_get_queue(s->nic), buf, tx_len); > +if (tx_len > 0x600) { > +g_free(buf); > +} > > if (tx->len_flags & TXD_WR) { > s->tx_desc = 0;
[Qemu-devel] 'tcg fatal error' with qemu v2.6.0-rc3 (bisected)
Hi, when running qemu version 2.6.0-rc3, I get the following error message. /opt/buildbot/qemu/qemu/tcg/tcg.c:1769: tcg fatal error qemu command line is as follows. qemu-system-ppc64 -M mpc8544ds \ -cpu e5500 \ -m 1024 -kernel arch/powerpc/boot/uImage -initrd busybox-ppc.cpio \ -nographic -vga none -monitor null -no-reboot \ --append "rdinit=/sbin/init console=tty console=ttyS0" \ -machine "dt_compatible=fsl,,P5020DS" Files are from my test suite at https://github.com/groeck/linux-build-test. Bisect points to commit 40ae5c62ebaaf7d9d3b93b88c2d32bf6342f7889 ("tcg: Introduce temp_load"). Bisect log is attached. Guenter --- # bad: [f419a626c76bcb26697883af702862e8623056f9] usb/uhci: move pid check # good: [a8c40fa2d667e585382080db36ac44e216b37a1c] Update version for v2.5.0 release git bisect start 'HEAD' 'v2.5.0' # bad: [253785e3b96f48c52568c312cec0a5ec596c527f] scripts/feature_to_c.sh: Include qemu/osdep.h rather than config.h git bisect bad 253785e3b96f48c52568c312cec0a5ec596c527f # good: [f4109dba216f2df61a6098fdd7a6f2d2be4ac848] scripts/kvm/kvm_stat: Moved DebugfsProvider git bisect good f4109dba216f2df61a6098fdd7a6f2d2be4ac848 # good: [8983b670f62ab5e5e8dd2690bf8304123651bfe5] block: qemu-iotests - add test for snapshot, commit, snapshot bug git bisect good 8983b670f62ab5e5e8dd2690bf8304123651bfe5 # good: [d7bea75d35a44023efc9d481d3a1a2600677b2ef] qapi: Avoid use of misnamed DO_UPCAST() git bisect good d7bea75d35a44023efc9d481d3a1a2600677b2ef # bad: [c9f19dff101e2c2cf3fa3967eceec2833e845e40] Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging git bisect bad c9f19dff101e2c2cf3fa3967eceec2833e845e40 # bad: [977a82ab56daac83623d730174f47d5a7edd73c9] configure: sanity check the glib library that pkg-config finds git bisect bad 977a82ab56daac83623d730174f47d5a7edd73c9 # good: [a86156401559cb4401cf9ecc704faeab6fc8bb19] qmp: Fix reference-counting of qnull on empty output visit git bisect good a86156401559cb4401cf9ecc704faeab6fc8bb19 # good: [12b9b11a2743002232098afb41810f1c0cb211a0] tcg: Change temp_sync argument to TCGTemp git bisect good 12b9b11a2743002232098afb41810f1c0cb211a0 # bad: [ac1be2ae6b2995b99430c48329eb971b0281acf1] Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2016-02-09' into staging git bisect bad ac1be2ae6b2995b99430c48329eb971b0281acf1 # good: [423aeaf219890e8a7311dbeef1a925020027c2ea] qapi: Add missing JSON files in build dependencies git bisect good 423aeaf219890e8a7311dbeef1a925020027c2ea # bad: [40ae5c62ebaaf7d9d3b93b88c2d32bf6342f7889] tcg: Introduce temp_load git bisect bad 40ae5c62ebaaf7d9d3b93b88c2d32bf6342f7889 # good: [b13eb728d33deaa53efc0dcef557da998e6ec40e] tcg: Change temp_save argument to TCGTemp git bisect good b13eb728d33deaa53efc0dcef557da998e6ec40e # first bad commit: [40ae5c62ebaaf7d9d3b93b88c2d32bf6342f7889] tcg: Introduce temp_load
Re: [Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
On 2016/4/27 10:46, Max Filippov wrote: Hi Zhou, On Wed, Apr 27, 2016 at 10:18:56AM +0800, Zhou Jie wrote: When I committed another patch which named as "hw/net/virtio-net: Allocating Large sized arrays to heap" . Christian Borntraeger said that 16k is usually perfectly fine for a userspace stack and doing allocations in a hot path might actually hurt performance. Although the size is 65536 bytes here, I think open_eth_start_xmit is in a hot path. So, it is OK, if you think that this patch should not be applied. With Linux as guest OS we shouldn't see any allocations as it doesn't send huge packets, so I think this patch is fine. I can take it through the xtensa tree if you don't have other plan. OK, Thanks Sincerely, Zhou Jie
Re: [Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
Hi Zhou, On Wed, Apr 27, 2016 at 10:18:56AM +0800, Zhou Jie wrote: >When I committed another patch which named as >"hw/net/virtio-net: Allocating Large sized arrays to heap" . > >Christian Borntraeger said that 16k is usually perfectly fine >for a userspace stack and doing allocations in a hot path >might actually hurt performance. > >Although the size is 65536 bytes here, >I think open_eth_start_xmit is in a hot path. >So, it is OK, if you think that this patch should not be applied. With Linux as guest OS we shouldn't see any allocations as it doesn't send huge packets, so I think this patch is fine. I can take it through the xtensa tree if you don't have other plan. -- Thanks. -- Max
Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
On Sun, Apr 24, 2016 at 18:06:51 -0400, Emilio G. Cota wrote: > On Sun, Apr 24, 2016 at 12:46:08 -0700, Richard Henderson wrote: > > On 04/22/2016 04:57 PM, Emilio G. Cota wrote: > > >On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote: > > >>FWIW, so that I could get an idea of how the stats change as we improve > > >>the > > >>hashing, I inserted the attachment 1 patch between patches 5 and 6, and > > >>with > > >>attachment 2 attempting to fix the accounting for patches 9 and 10. > > > > > >For qht, I dislike the approach of reporting "avg chain" per-element, > > >instead of per-bucket. Performance for a bucket whose entries are > > >all valid is virtually the same as that of a bucket that only > > >has one valid element; thus, with per-bucket reporting, we'd say that > > >the chain lenght is 1 in both cases, i.e. "perfect". With per-element > > >reporting, we'd report 4 (on a 64-bit host, since that's the value of > > >QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the > > >wrong idea (users would think they're in trouble, when they're not). > > > > But otherwise you have no way of knowing how full the buckets are. The > > bucket size is just something that you have to keep in mind. > > I'll make some changes in v4 that I think will address both your and > my concerns: > - Report the number of empty buckets > - Do not count empty buckets when reporting avg bucket chain length > - Report average bucket occupancy (in %, so that QHT_BUCKET_ENTRIES > does not have to be reported.) How does the following look? Example with good hashing, i.e. func5(phys_pc, pc, flags): TB count704242/1342156 [...] TB hash buckets 386484/524288 (73.72% used) TB hash occupancy 32.57% avg chain occupancy. Histogram: 0-10%▆▁█▁▁▅▁▃▁▁90-100% TB hash avg chain 1.02 buckets. Histogram: 1█▁▁3 Example with bad hashing, i.e. func5(phys_pc, 0, 0): TB count710748/1342156 [...] TB hash buckets 113569/524288 (21.66% used) TB hash occupancy 10.24% avg chain occupancy. Histogram: 0-10%█▁90-100% TB hash avg chain 2.11 buckets. Histogram: 1▇▁93 Note that: - "TB hash avg chain" does _not_ count empty buckets. This gives an idea of how many buckets a typical hit goes through. - "TB hash occupancy" _does_ count empty buckets. It is called "avg chain occupancy" and not "avg occupancy" because the counts are only valid per-chain due to the seqlock protecting each chain. Thanks, Emilio
Re: [Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
On Wed, Apr 27, 2016 at 10:07:48AM +0800, Zhou Jie wrote: > open_eth_start_xmit has a huge stack usage of 65536 bytes approx. > Moving large arrays to heap to reduce stack usage. > > Signed-off-by: Zhou Jie> --- > hw/net/opencores_eth.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) Acked-by: Max Filippov -- Thanks. -- Max
Re: [Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
Hi Max, When I committed another patch which named as "hw/net/virtio-net: Allocating Large sized arrays to heap" . Christian Borntraeger said that 16k is usually perfectly fine for a userspace stack and doing allocations in a hot path might actually hurt performance. Although the size is 65536 bytes here, I think open_eth_start_xmit is in a hot path. So, it is OK, if you think that this patch should not be applied. Sincerely, Zhou Jie On 2016/4/27 10:07, Zhou Jie wrote: open_eth_start_xmit has a huge stack usage of 65536 bytes approx. Moving large arrays to heap to reduce stack usage. Signed-off-by: Zhou Jie--- hw/net/opencores_eth.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/hw/net/opencores_eth.c b/hw/net/opencores_eth.c index c6094fb..fa0a4e7 100644 --- a/hw/net/opencores_eth.c +++ b/hw/net/opencores_eth.c @@ -483,7 +483,8 @@ static NetClientInfo net_open_eth_info = { static void open_eth_start_xmit(OpenEthState *s, desc *tx) { -uint8_t buf[65536]; +uint8_t *buf = NULL; +uint8_t buffer[0x600]; unsigned len = GET_FIELD(tx->len_flags, TXD_LEN); unsigned tx_len = len; @@ -498,6 +499,11 @@ static void open_eth_start_xmit(OpenEthState *s, desc *tx) trace_open_eth_start_xmit(tx->buf_ptr, len, tx_len); +if (tx_len > 0x600) { +buf = g_new(uint8_t, tx_len); +} else { +buf = buffer; +} if (len > tx_len) { len = tx_len; } @@ -506,6 +512,9 @@ static void open_eth_start_xmit(OpenEthState *s, desc *tx) memset(buf + len, 0, tx_len - len); } qemu_send_packet(qemu_get_queue(s->nic), buf, tx_len); +if (tx_len > 0x600) { +g_free(buf); +} if (tx->len_flags & TXD_WR) { s->tx_desc = 0; -- 周潔 Dept 1 No. 6 Wenzhu Road, Nanjing, 210012, China TEL:+86+25-86630566-8557 FUJITSU INTERNAL:7998-8557 E-Mail:zhoujie2...@cn.fujitsu.com
[Qemu-devel] [PATCH v2] hw/net/opencores_eth: Allocating Large sized arrays to heap
open_eth_start_xmit has a huge stack usage of 65536 bytes approx. Moving large arrays to heap to reduce stack usage. Signed-off-by: Zhou Jie--- hw/net/opencores_eth.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/hw/net/opencores_eth.c b/hw/net/opencores_eth.c index c6094fb..fa0a4e7 100644 --- a/hw/net/opencores_eth.c +++ b/hw/net/opencores_eth.c @@ -483,7 +483,8 @@ static NetClientInfo net_open_eth_info = { static void open_eth_start_xmit(OpenEthState *s, desc *tx) { -uint8_t buf[65536]; +uint8_t *buf = NULL; +uint8_t buffer[0x600]; unsigned len = GET_FIELD(tx->len_flags, TXD_LEN); unsigned tx_len = len; @@ -498,6 +499,11 @@ static void open_eth_start_xmit(OpenEthState *s, desc *tx) trace_open_eth_start_xmit(tx->buf_ptr, len, tx_len); +if (tx_len > 0x600) { +buf = g_new(uint8_t, tx_len); +} else { +buf = buffer; +} if (len > tx_len) { len = tx_len; } @@ -506,6 +512,9 @@ static void open_eth_start_xmit(OpenEthState *s, desc *tx) memset(buf + len, 0, tx_len - len); } qemu_send_packet(qemu_get_queue(s->nic), buf, tx_len); +if (tx_len > 0x600) { +g_free(buf); +} if (tx->len_flags & TXD_WR) { s->tx_desc = 0; -- 2.5.5
Re: [Qemu-devel] [PATCH v2] vl: change runstate only if new state is different from current state
ping... Thanks Li Zhijian On 04/14/2016 11:25 AM, Li Zhijian wrote: Previously, qemu will abort at following scenario: (qemu) stop (qemu) system_reset (qemu) system_reset (qemu) 2016-04-13T20:54:38.979158Z qemu-system-x86_64: invalid runstate transition: 'prelaunch' -> 'prelaunch' Signed-off-by: Li ZhijianAcked-by: Paolo Bonzini --- from v1 - fix patch title typo 'chanage' -> 'change' - coding sytle: 'return ;' -> 'return;' - add Acked tag vl.c | 4 1 file changed, 4 insertions(+) diff --git a/vl.c b/vl.c index 9df534f..039a353 100644 --- a/vl.c +++ b/vl.c @@ -692,6 +692,10 @@ void runstate_set(RunState new_state) { assert(new_state < RUN_STATE__MAX); +if (current_run_state == new_state) { +return; +} + if (!runstate_valid_transitions[current_run_state][new_state]) { error_report("invalid runstate transition: '%s' -> '%s'", RunState_lookup[current_run_state],
Re: [Qemu-devel] [PATCH v14 13/19] qmp: Tighten output visitor rules
On 04/15/2016 03:02 AM, Markus Armbruster wrote: > Eric Blakewrites: > >> Add a new qmp_output_visitor_reset(), which must be called before >> reusing an exising QmpOutputVisitor on a new root object. Tighten >> assertions to require that qmp_output_get_qobject() can only be >> called after pairing a visit_end_* for every visit_start_* (rather >> than allowing it to return a partially built object), and that it >> must not be called unless at least one visit_type_* or visit_start/ >> visit_end pair has occurred since creation/reset (the accidental >> return of NULL fixed by commit ab8bf1d7 would have been much >> easier to diagnose). >> >> Also, check that we are encountering the expected object or list >> type (both during visit_end*, and also by validating whether 'name' >> was NULL - although the latter may need to change later if we >> improve error messages by always passing in a sensible name). >> This provides protection against mismatched push(struct)/pop(list) >> or push(list)/pop(struct), similar to the qmp-input protection >> added in commit bdd8e6b5. >> >> Signed-off-by: Eric Blake > > As written, the commit message makes me wonder why we add > qmp_output_visitor_reset() in the same commit. I think the reason is > the tightened rules make it necessary. The commit message could make > that clearer by explaining the rule changes first, then point out we > need a reset to comply with the rules. I think I'll try splitting the addition of qmp_output_visitor_reset() into a separate patch. >> @@ -93,6 +92,9 @@ static void qmp_output_add_obj(QmpOutputVisitor *qov, >> const char *name, >> qdict_put_obj(qobject_to_qdict(cur), name, value); >> break; >> case QTYPE_QLIST: >> +/* FIXME: assertion needs adjustment if we fix visit-core >> + * to pass "name.0" style name during lists. */ > > visit-core merely passes through whatever name it gets from the client. > Thus, saying we 'fix visit-core to pass "name.0"' is a bit misleading. > What we'd do is change it to require "name.0", then update its users to > comply. Maybe it's not too inaccurate - the only callers are the generated visit_type_FOOList() functions, but having a common "name.%d" generator in the core may be easier than bloating each generated visit_type_FOOList. > > Moreover, this is a note, not a FIXME: nothing is broken here. The > closest we get to "broken" are the bad error messages, but they're > elsewhere. > > I'd simply drop the comment. But this solution nicely sidesteps the "how will we fix error messages", so I've dropped the comment. > >> +assert(!name); > > PATCH 08 made this part of the contract. It also added a bunch of > contract-checking assertions. Should this one be in PATCH 08, too? Well, it's only weakly part of the contract unless (until?) we fix callers/core to pass in "name.0", and then the assert would trigger. However, checking the assertion in patch 8 is harder, without making the core track whether it is currently in a list or struct visit (that is, the only place where we know whether 'name' should be NULL or not is where we've tracked a stack of our current visit_start_* calls; but the core is not tracking a stack because that would be redundant with the stacks in the qmp visitors). So for now I think I'll just keep it here. >> +++ b/tests/test-qmp-output-visitor.c >> @@ -139,6 +139,7 @@ static void test_visitor_out_enum(TestOutputVisitorData >> *data, >> @@ -455,6 +460,7 @@ static void >> test_visitor_out_alternate(TestOutputVisitorData *data, >> qapi_free_UserDefAlternate(tmp); >> qobject_decref(arg); >> >> +qmp_output_visitor_reset(data->qov); >> tmp = g_new0(UserDefAlternate, 1); >> tmp->type = QTYPE_QDICT; >> tmp->u.udfu.integer = 1; > > How did you find the places that now need qmp_output_visitor_reset()? Ran the test, found what asserted, and added a reset() to make the test pass again. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH] hw/net/opencores_eth: Allocating Large sized arrays to heap
On 2016/4/26 12:12, Max Filippov wrote: Hi Zhou, On Tue, Apr 26, 2016 at 6:35 AM, Zhou Jiewrote: open_eth_start_xmit has a huge stack usage of 65536 bytes approx. Moving large arrays to heap to reduce stack usage. It's an exception, not the rule when full 65536 byte long buffer might be needed. Can we do a little better change and not allocate and free this buffer every time unconditionally, but instead make buf smaller (1536 bytes, maximal frame length when HUGEN bit is not set in MODER) and only do allocation when that's not enough? Thank you for your suggestion. I will modify this patch. Sincerely, Zhou Jie -- 周潔 Dept 1 No. 6 Wenzhu Road, Nanjing, 210012, China TEL:+86+25-86630566-8557 FUJITSU INTERNAL:7998-8557 E-Mail:zhoujie2...@cn.fujitsu.com
Re: [Qemu-devel] [PATCH] hw/net/virtio-net: Allocating Large sized arrays to heap
On 2016/4/26 20:42, Michael S. Tsirkin wrote: On Tue, Apr 26, 2016 at 04:05:24PM +0800, Zhou Jie wrote: virtio_net_flush_tx has a huge stack usage of 16392 bytes approx. Moving large arrays to heap to reduce stack usage. Signed-off-by: Zhou JieI don't think it's appropriate for trivial. Also - what's the point, exactly? I think functions should not have very large stack frames. For 64bit machine it will be 32k. But according what Christian Borntraeger said, it doesn't matter. So, considerate that virtio_net_flush_tx is in a hot path, I agree to not change this. Sincerely, Zhou Jie --- hw/net/virtio-net.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 5798f87..cab7bbc 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -1213,6 +1213,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) VirtIONet *n = q->n; VirtIODevice *vdev = VIRTIO_DEVICE(n); VirtQueueElement *elem; +struct iovec *sg = NULL, *sg2 = NULL; int32_t num_packets = 0; int queue_index = vq2q(virtio_get_queue_index(q->tx_vq)); if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) { @@ -1224,10 +1225,12 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) return num_packets; } +sg = g_new(struct iovec, VIRTQUEUE_MAX_SIZE); +sg2 = g_new(struct iovec, VIRTQUEUE_MAX_SIZE + 1); for (;;) { ssize_t ret; unsigned int out_num; -struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg; +struct iovec *out_sg; struct virtio_net_hdr_mrg_rxbuf mhdr; elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement)); @@ -1252,7 +1255,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) virtio_net_hdr_swap(vdev, (void *) ); sg2[0].iov_base = sg2[0].iov_len = n->guest_hdr_len; -out_num = iov_copy([1], ARRAY_SIZE(sg2) - 1, +out_num = iov_copy([1], VIRTQUEUE_MAX_SIZE, out_sg, out_num, n->guest_hdr_len, -1); if (out_num == VIRTQUEUE_MAX_SIZE) { @@ -1269,10 +1272,10 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) */ assert(n->host_hdr_len <= n->guest_hdr_len); if (n->host_hdr_len != n->guest_hdr_len) { -unsigned sg_num = iov_copy(sg, ARRAY_SIZE(sg), +unsigned sg_num = iov_copy(sg, VIRTQUEUE_MAX_SIZE, out_sg, out_num, 0, n->host_hdr_len); -sg_num += iov_copy(sg + sg_num, ARRAY_SIZE(sg) - sg_num, +sg_num += iov_copy(sg + sg_num, VIRTQUEUE_MAX_SIZE - sg_num, out_sg, out_num, n->guest_hdr_len, -1); out_num = sg_num; @@ -1284,6 +1287,8 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) if (ret == 0) { virtio_queue_set_notification(q->tx_vq, 0); q->async_tx.elem = elem; +g_free(sg); +g_free(sg2); return -EBUSY; } @@ -1296,6 +1301,8 @@ drop: break; } } +g_free(sg); +g_free(sg2); return num_packets; } -- 2.5.5 . -- 周潔 Dept 1 No. 6 Wenzhu Road, Nanjing, 210012, China TEL:+86+25-86630566-8557 FUJITSU INTERNAL:7998-8557 E-Mail:zhoujie2...@cn.fujitsu.com
Re: [Qemu-devel] [PATCH 02/19] block: Add BDS.backing_overridden
On 04/26/2016 03:32 PM, Max Reitz wrote: > If the backing file is overridden, this most probably does change the > guest-visible data of a BDS. Therefore, we will need to consider this in > bdrv_refresh_filename(). > > Adding a new field to the BDS is not nice, but it is very simple and > exactly keeps track of whether the backing file has been overridden. > > Signed-off-by: Max Reitz> --- > block.c | 2 ++ > include/block/block_int.h | 1 + > 2 files changed, 3 insertions(+) Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 01/19] block: Use children list in bdrv_refresh_filename
On 04/26/2016 03:32 PM, Max Reitz wrote: > bdrv_refresh_filename() should invoke itself recursively on all > children, not just on file. > > With that change, we can remove the manual invocations in blkverify and > quorum. > > Signed-off-by: Max Reitz> --- > block.c | 9 + > block/blkverify.c | 3 --- > block/quorum.c| 1 - > 3 files changed, 5 insertions(+), 8 deletions(-) Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH for-2.7 v2 07/17] rbd: Implement image locking
On Tue, 04/26 10:42, Jason Dillaman wrote: > On Sun, Apr 24, 2016 at 7:42 PM, Fam Zhengwrote: > > On Fri, 04/22 21:57, Jason Dillaman wrote: > >> Since this cannot automatically recover from a crashed QEMU client with an > >> RBD image, perhaps this RBD locking should not default to enabled. > >> Additionally, this will conflict with the "exclusive-lock" feature > >> available since the Ceph Hammer-release since both utilize the same locking > >> construct. > >> > >> As a quick background, the optional exclusive-lock feature can be > >> enabled/disabled on image and safely/automatically handles the case of > >> recovery from a crashed client. Under normal conditions, the RBD > >> exclusive-lock feature automatically acquires the lock upon the first > >> attempt to write to the image and transparently transitions ownership of > >> the lock between two or more clients -- used for QEMU live-migration. > > > > Is it enabled by default? > > > > Starting with the Jewel release of Ceph it is enabled by default. OK, then I'll leave rbd in this QEMU series for now. > > >> > >> I'd be happy to add a new librbd API method to explicitly expose acquiring > >> and releasing the RBD exclusive lock since it certainly solves a couple > >> compromises in our current QEMU integration. > > > > Does the API do enable/disable or acquire/relase? Could you explain the > > differences between it and rbd_lock_exclusive? > > There is already an API for enabling/disabling the exclusive-lock > feature (and associated CLI tooling). This would be a new API method > to explicitly acquire / release the exclusive-lock (instead of > implicitly acquiring the lock upon first write request as it currently > does in order to support QEMU live-migration). > > The rbd_lock_exclusive/rbd_lock_shared are essentially advisory locks. > Nothing stops a client from accessing the image if it doesn't attempt > to acquire the lock (even if another client already has the image > locked exclusively). It also doesn't support automatic recovery from > failed clients. I see, thanks for the explanation! Fam
Re: [Qemu-devel] Updating documentation at http://wiki.qemu.org/download/qemu-doc.html
On 26/04/16 13:37, Jeff Cody wrote: > On Tue, Apr 26, 2016 at 11:02:45AM +0100, Stefan Hajnoczi wrote: >> On Tue, Apr 26, 2016 at 07:50:31AM +0100, Mark Cave-Ayland wrote: >>> Does anyone know if it's possible to update the documentation at the >>> above URL? It appears as a link at http://wiki.qemu-project.org/Manual >>> and seems to be the first hit for most people looking for information on >>> QEMU SPARC emulation, which is frustrating as the information there is >>> well out of date. Presumably this is because the documentation is hosted >>> on the qemu.org domain itself and so ranks higher? >> >> Hi Mark, >> I've CCed Jeff Cody, who is taking over as system administrator for >> QEMU. The http://wiki.qemu.org/download/qemu-doc.html link is just a >> static page from 2010. >> >> It should be possible to rebuild the HTML docs and replace the file. >> Even better would be to update it from a cron job. >> >> Stefan > > Thanks. > > I've update the qemu-doc.html at the above link with the latest output of > "make qemu-doc.html" from git master. I'll set up a daily cronjob to take > care > of that automatically, as well. > > Jeff Hi Jeff, Thanks for getting all this set up - the new documentation update looks good here. Shall I also update the link on http://wiki.qemu-project.org/Manual to reflect the fact that the updates are now generated regularly? "Older version of the above" doesn't quite seem appropriate any more... ATB, Mark.
Re: [Qemu-devel] [PATCH v14 08/19] qapi: Document visitor interfaces, add assertions
On 04/14/2016 09:22 AM, Markus Armbruster wrote: > Eric Blakewrites: > >> The visitor interface for mapping between QObject/QemuOpts/string >> and QAPI is scandalously under-documented, making changes to visitor >> core, individual visitors, and users of visitors difficult to >> coordinate. Among other questions: when is it safe to pass NULL, >> vs. when a string must be provided; which visitors implement which >> callbacks; the difference between concrete and virtual visits. >> >> Correct this by retrofitting proper contracts, and document where some >> of the interface warts remain (for example, we may want to modify >> visit_end_* to require the same 'obj' as the visit_start counterpart, >> so the dealloc visitor can be simplified). Later patches in this >> series will tackle some, but not all, of these warts. >> >> Add assertions to (partially) enforce the contract. Some of these >> were only made possible by recent cleanup commits. >> >> +/* >> + * The QAPI schema defines both a set of C data types, and a QMP wire >> + * format. A QAPI object is formed as a directed acyclic graph of >> + * QAPI values. > > I understand what you're trying to say, but I find the value / object > dichotomy odd. For me, A QAPI object isn't a DAG, it's a node in a DAG. > Perhaps: "QAPI objects can contain references to other QAPI objects, > resulting in a directed acyclic graph." Thanks for a lot of good comments. I'm replying only to the questions you left amidst all the good review. >> +++ b/qapi/qapi-visit-core.c >> @@ -23,6 +23,10 @@ >> void visit_start_struct(Visitor *v, const char *name, void **obj, >> size_t size, Error **errp) >> { >> +if (obj) { >> +assert(size); > > Yes, because the generator puts a dummy member into empty structs. > >> +assert(v->type != VISITOR_OUTPUT || *obj); > > Can you point me to the spot in the contract that requires this? Translation of the assert: If you are using an output visitor, and not doing a virtual walk (obj is non-NULL), then the object must be completely built (*obj is non-NULL). For an input visitor, *obj is NULL on entry (we're allocating it, after all); for the dealloc visitor, *obj may or may not be NULL (since we handle cleanup of partial allocation). In the text, "output visitors (QMP and string) take a completed QAPI graph", but maybe I can further clarify that a completed object means that *obj is non-NULL and all 'has_member' and 'member' members are complete. > > Unlike last time, my remarks are pretty much only about how to say > things, not about what to say. Progress! Yay! Hopefully I'll post v15 soon and we can get this in at the start of the 2.7 cycle. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [Qemu-arm] [PATCH] target-arm: Fix descriptor address masking in ARM address translation
26.04.2016, 19:35, "Tom Hanson":On 03/21/2016 09:56 AM, Sergey Sorokin wrote: 17.03.2016, 18:24, "Peter Maydell" : On 17 March 2016 at 15:21, Sergey Sorokin wrote:17.03.2016, 14:40, "Peter Maydell" :On 13 March 2016 at 18:28, Sergey Sorokin wrote:If you want to implement the AddressSize checks that's fine,but otherwise please leave this bit of the code alone. You said me that my code is not correct, I have proved that it conforms to the documentation. It's a bit obfuscating when the doc explicitly says to take bits up to 39 from the descriptor, but in QEMU we take bits up to 47 relying on the check in another part of the code, even if both ways are correct.The way the code in QEMU is structured is that we extract thedescriptor field in one go and then will operate on it(checking for need to AddressSize fault, etc) as a secondaction. The field descriptors themselves are the sizes I said.Well, may be it's enough just to change this comment as you intend:- /* The address field in the descriptor goes up to bit 39 for ARMv7- * but up to bit 47 for ARMv8.+ /* The address field in the descriptor goes up to bit 39 for AArch32+ * but up to bit 47 for AArch64. */ The comment is correct as it stands. thanks -- PMM I mean in the patch. We need to fix lower bits in descaddrmask anyway. So: I could describe in the comment, that the descriptor field is up to bit 47 for ARMv8 (as long as you want it), but we use the descaddrmask up to bit 39 for AArch32, because we don't need other bits in that case to construct next descriptor address. It is clearly described in the ARM pseudo-code. Why should we keep in the mask bits from 40 up to 47 if we don't need them? Even if they are all zeroes. It is a bit obfuscating, as I said. I agree with Peter. The original comment is correct.Looking at the TLBRecord AArch32.TranslationTableWalkLD pseudocode, it is treating the AArch32 address as 48 bits long. For example: if !IsZero(baseregister<47:40>) then level = 0; result.addrdesc.fault = AArch32.AddressSizeFault(ipaddress, domain, level, acctype, iswrite, secondstage, s2fs1walk); return result;This requires that an AArch32 address have specific values up through bit 47. There is a newer version of the patch. I'm sorry, I forgot to report here about it.
[Qemu-devel] [PATCH 19/19] iotests: Add quorum case to test 110
Test 110 tests relative backing filenames for complex BDS trees. Add quorum as an example that can never work automatically (without special-casing if all child nodes have the same base directory), and an example on how to make it work manually (using the base-directory option). Signed-off-by: Max Reitz--- tests/qemu-iotests/110 | 48 ++ tests/qemu-iotests/110.out | 9 + 2 files changed, 57 insertions(+) diff --git a/tests/qemu-iotests/110 b/tests/qemu-iotests/110 index ba1b3c6..f1b7b08 100755 --- a/tests/qemu-iotests/110 +++ b/tests/qemu-iotests/110 @@ -30,6 +30,7 @@ status=1 # failure is the default! _cleanup() { _cleanup_test_img +rm -f "$TEST_IMG.copy" } trap "_cleanup; exit \$status" 0 1 2 3 15 @@ -87,6 +88,53 @@ echo # omit the image size; it should work anyway _make_test_img -b "$TEST_IMG_REL.base" +echo +echo '=== Nodes without a common directory ===' +echo + +cp "$TEST_IMG" "$TEST_IMG.copy" + +# Should not work +TEST_IMG="json:{ +'driver': '$IMGFMT', +'file': { +'driver': 'quorum', +'vote-threshold': 1, +'children': [ +{ +'driver': 'file', +'filename': '$TEST_IMG' +}, +{ +'driver': 'file', +'filename': '$TEST_IMG.copy' +} +] +} +}" _img_info | _filter_img_info + +echo + +# Should work +TEST_IMG="json:{ +'driver': '$IMGFMT', +'file': { +'driver': 'quorum', +'base-directory': '$TEST_DIR/', +'vote-threshold': 1, +'children': [ +{ +'driver': 'file', +'filename': '$TEST_IMG' +}, +{ +'driver': 'file', +'filename': '$TEST_IMG.copy' +} +] +} +}" _img_info | _filter_img_info + # success, all done echo '*** done' diff --git a/tests/qemu-iotests/110.out b/tests/qemu-iotests/110.out index 5370bc1..e586538 100644 --- a/tests/qemu-iotests/110.out +++ b/tests/qemu-iotests/110.out @@ -19,4 +19,13 @@ backing file: t.IMGFMT.base (actual path: TEST_DIR/t.IMGFMT.base) === Backing name is always relative to the backed image === Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file=t.IMGFMT.base + +=== Nodes without a common directory === + +qemu-img: Cannot generate a base directory for quorum nodes + +image: json:{"driver": "IMGFMT", "file": {"children": [{"driver": "file", "filename": "TEST_DIR/t.IMGFMT"}, {"driver": "file", "filename": "TEST_DIR/t.IMGFMT.copy"}], "driver": "quorum", "blkverify": false, "rewrite-corrupted": false, "vote-threshold": 1}} +file format: IMGFMT +virtual size: 64M (67108864 bytes) +backing file: t.IMGFMT.base (actual path: TEST_DIR/t.IMGFMT.base) *** done -- 2.8.0
[Qemu-devel] [PATCH 17/19] block: Use bdrv_dirname() for relative filenames
bdrv_get_full_backing_filename_from_filename() breaks down when it comes to JSON filenames. Using bdrv_dirname() as the basis is better because since we have BDS, we can descend through the BDS tree to the protocol layer, which gives us a greater probability of finding a non-JSON name; also, bdrv_dirname() is more correct as it allows block drivers to override the generation of that directory name in a protocol-specific way. We still need to keep bdrv_get_full_backing_filename_from_filename(), though, because it has valid callers which need it during image creation when no BDS is available yet. This makes a test case in qemu-iotest 110, which was supposed to fail, work. That is actually good, but we need to change the reference output (and the comment in 110) accordingly. Signed-off-by: Max Reitz--- block.c| 20 +++- tests/qemu-iotests/110 | 3 ++- tests/qemu-iotests/110.out | 2 +- 3 files changed, 18 insertions(+), 7 deletions(-) diff --git a/block.c b/block.c index 23962ce..a909b7a 100644 --- a/block.c +++ b/block.c @@ -212,12 +212,22 @@ char *bdrv_get_full_backing_filename_from_filename(const char *backed, static char *bdrv_make_absolute_filename(BlockDriverState *relative_to, const char *filename, Error **errp) { -char *bs_filename = relative_to->exact_filename[0] -? relative_to->exact_filename -: relative_to->filename; +char *dir, *full_name; -return bdrv_get_full_backing_filename_from_filename(bs_filename, filename, -errp); +if (filename[0] == '\0' || path_has_protocol(filename) || +path_is_absolute(filename)) +{ +return g_strdup(filename); +} + +dir = bdrv_dirname(relative_to, errp); +if (!dir) { +return NULL; +} + +full_name = g_strconcat(dir, filename, NULL); +g_free(dir); +return full_name; } char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp) diff --git a/tests/qemu-iotests/110 b/tests/qemu-iotests/110 index 9de7369..ba1b3c6 100755 --- a/tests/qemu-iotests/110 +++ b/tests/qemu-iotests/110 @@ -61,7 +61,8 @@ echo '=== Non-reconstructable filename ===' echo # Across blkdebug without a config file, you cannot reconstruct filenames, so -# qemu is incapable of knowing the directory of the top image +# qemu is incapable of knowing the directory of the top image from the filename +# alone. However, using bdrv_dirname(), it should still work. TEST_IMG="json:{ 'driver': '$IMGFMT', 'file': { diff --git a/tests/qemu-iotests/110.out b/tests/qemu-iotests/110.out index b3584ff..5370bc1 100644 --- a/tests/qemu-iotests/110.out +++ b/tests/qemu-iotests/110.out @@ -14,7 +14,7 @@ backing file: t.IMGFMT.base (actual path: TEST_DIR/t.IMGFMT.base) image: json:{"driver": "IMGFMT", "file": {"set-state.0.event": "read_aio", "image": {"driver": "file", "filename": "TEST_DIR/t.IMGFMT"}, "driver": "blkdebug", "set-state.0.new_state": 42}} file format: IMGFMT virtual size: 64M (67108864 bytes) -backing file: t.IMGFMT.base (cannot determine actual path) +backing file: t.IMGFMT.base (actual path: TEST_DIR/t.IMGFMT.base) === Backing name is always relative to the backed image === -- 2.8.0
[Qemu-devel] [PATCH 14/19] blkverify: Make bdrv_dirname() return NULL
blkverify's BDSs have a file BDS, but we do not want this to be preferred over the raw node. There is no way to decide between the two (and not really a reason to, either), so just return NULL in blkverify's implementation of bdrv_dirname(). Signed-off-by: Max Reitz--- block/blkverify.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/block/blkverify.c b/block/blkverify.c index f445e3e..35d080f 100644 --- a/block/blkverify.c +++ b/block/blkverify.c @@ -338,6 +338,15 @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options) } } +static char *blkverify_dirname(BlockDriverState *bs, Error **errp) +{ +/* In general, there are two BDSs with different dirnames below this one; + * so there is no unique dirname we could return (unless both are equal by + * chance). Therefore, to be consistent, just always return NULL. */ +error_setg(errp, "Cannot generate a base directory for blkverify nodes"); +return NULL; +} + static BlockDriver bdrv_blkverify = { .format_name = "blkverify", .protocol_name= "blkverify", @@ -348,6 +357,7 @@ static BlockDriver bdrv_blkverify = { .bdrv_close = blkverify_close, .bdrv_getlength = blkverify_getlength, .bdrv_refresh_filename= blkverify_refresh_filename, +.bdrv_dirname = blkverify_dirname, .bdrv_aio_readv = blkverify_aio_readv, .bdrv_aio_writev = blkverify_aio_writev, -- 2.8.0
[Qemu-devel] [PATCH 12/19] block: Fix bdrv_find_backing_image()
bdrv_find_backing_image() should use bdrv_get_full_backing_filename() or bdrv_make_absolute_filename() instead of trying to do what those functions do by itself. path_combine_deprecated() can now be dropped, so let's do that. Signed-off-by: Max Reitz--- block.c | 28 +--- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/block.c b/block.c index 49dc2cb..6e89abd 100644 --- a/block.c +++ b/block.c @@ -192,15 +192,6 @@ char *path_combine(const char *base_path, const char *filename) return result; } -static void path_combine_deprecated(char *dest, int dest_size, -const char *base_path, -const char *filename) -{ -char *combined = path_combine(base_path, filename); -pstrcpy(dest, dest_size, combined); -g_free(combined); -} - char *bdrv_get_full_backing_filename_from_filename(const char *backed, const char *backing, Error **errp) @@ -3147,7 +3138,6 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs, filename_full = g_malloc(PATH_MAX); backing_file_full = g_malloc(PATH_MAX); -filename_tmp = g_malloc(PATH_MAX); is_protocol = path_has_protocol(backing_file); @@ -3163,22 +3153,31 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs, } else { /* If not an absolute filename path, make it relative to the current * image's filename path */ -path_combine_deprecated(filename_tmp, PATH_MAX, curr_bs->filename, -backing_file); +filename_tmp = bdrv_make_absolute_filename(curr_bs, backing_file, + NULL); +if (!filename_tmp) { +continue; +} /* We are going to compare absolute pathnames */ if (!realpath(filename_tmp, filename_full)) { +g_free(filename_tmp); continue; } +g_free(filename_tmp); /* We need to make sure the backing filename we are comparing against * is relative to the current image filename (or absolute) */ -path_combine_deprecated(filename_tmp, PATH_MAX, curr_bs->filename, -curr_bs->backing_file); +filename_tmp = bdrv_get_full_backing_filename(curr_bs, NULL); +if (!filename_tmp) { +continue; +} if (!realpath(filename_tmp, backing_file_full)) { +g_free(filename_tmp); continue; } +g_free(filename_tmp); if (strcmp(backing_file_full, filename_full) == 0) { retval = curr_bs->backing->bs; @@ -3189,7 +3188,6 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs, g_free(filename_full); g_free(backing_file_full); -g_free(filename_tmp); return retval; } -- 2.8.0
[Qemu-devel] [PATCH 11/19] block: Add bdrv_make_absolute_filename()
This is a general function for making a filename that is relative to a certain BDS absolute. It calls bdrv_get_full_backing_filename_from_filename() for now, but that will be changed in a follow-up patch. Signed-off-by: Max Reitz--- block.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index 083f4ae..49dc2cb 100644 --- a/block.c +++ b/block.c @@ -218,15 +218,22 @@ char *bdrv_get_full_backing_filename_from_filename(const char *backed, } } -char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp) +static char *bdrv_make_absolute_filename(BlockDriverState *relative_to, + const char *filename, Error **errp) { -char *backed = bs->exact_filename[0] ? bs->exact_filename : bs->filename; +char *bs_filename = relative_to->exact_filename[0] +? relative_to->exact_filename +: relative_to->filename; -return bdrv_get_full_backing_filename_from_filename(backed, -bs->backing_file, +return bdrv_get_full_backing_filename_from_filename(bs_filename, filename, errp); } +char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp) +{ +return bdrv_make_absolute_filename(bs, bs->backing_file, errp); +} + void bdrv_register(BlockDriver *bdrv) { bdrv_setup_io_funcs(bdrv); -- 2.8.0
[Qemu-devel] [PATCH 18/19] block: Add 'base-directory' BDS option
Using this option, one can directly override what bdrv_dirname() will return. This is useful if one uses e.g. qcow2 on top of quorum (with only protocol BDSs under the quorum BDS) and wants to be able to use relative backing filenames. Signed-off-by: Max Reitz--- block.c | 13 + include/block/block_int.h | 2 ++ qapi/block-core.json | 9 + 3 files changed, 24 insertions(+) diff --git a/block.c b/block.c index a909b7a..2086c6d 100644 --- a/block.c +++ b/block.c @@ -889,6 +889,11 @@ static QemuOptsList bdrv_runtime_opts = { .type = QEMU_OPT_BOOL, .help = "Ignore flush requests", }, +{ +.name = "base-directory", +.type = QEMU_OPT_STRING, +.help = "String to be prepended to filenames relative to this BDS to make them absolute", +}, { /* end of list */ } }, }; @@ -948,6 +953,8 @@ static int bdrv_open_common(BlockDriverState *bs, BdrvChild *file, goto fail_opts; } +bs->dirname = g_strdup(qemu_opt_get(opts, "base-directory")); + bs->request_alignment = 512; bs->zero_beyond_eof = true; bs->read_only = !(bs->open_flags & BDRV_O_RDWR); @@ -2359,6 +2366,8 @@ static void bdrv_delete(BlockDriverState *bs) bdrv_close(bs); +g_free(bs->dirname); + /* remove from list, if necessary */ if (bs->node_name[0] != '\0') { QTAILQ_REMOVE(_bdrv_states, bs, node_list); @@ -4043,6 +4052,10 @@ char *bdrv_dirname(BlockDriverState *bs, Error **errp) { BlockDriver *drv = bs->drv; +if (bs->dirname) { +return g_strdup(bs->dirname); +} + if (!drv) { error_setg(errp, "Node '%s' is ejected", bs->node_name); return NULL; diff --git a/include/block/block_int.h b/include/block/block_int.h index 087d9ac..863caf2 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -415,6 +415,8 @@ struct BlockDriverState { QDict *full_open_options; char exact_filename[PATH_MAX]; +char *dirname; + BdrvChild *backing; BdrvChild *file; diff --git a/qapi/block-core.json b/qapi/block-core.json index 1d09079..85b8e06 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2045,6 +2045,14 @@ # @node-name: #optional the name of a block driver state node (Since 2.0). # This option is required on the top level of blockdev-add if # the @id option is not given there. +# @base-directory #optional May be specified for any node. Normally, whenever a +# filename is specified which is supposed to be relative to this +# node (such as relative backing filenames), the base directory +# to be used is the directory the image file of this node is in, +# which is simply prepended to the relative filename. Using this +# option, the string which is prepended (i.e. the base +# directory) can be overridden. +# (Since 2.7) # @discard: #optional discard-related options (default: ignore) # @cache: #optional cache-related options # @aio: #optional AIO backend (default: threads) @@ -2073,6 +2081,7 @@ 'base': { 'driver': 'BlockdevDriver', '*id': 'str', '*node-name': 'str', +'*base-directory': 'str', '*discard': 'BlockdevDiscardOptions', '*cache': 'BlockdevCacheOptions', '*aio': 'BlockdevAioOptions', -- 2.8.0
[Qemu-devel] [PATCH 08/19] block: Make path_combine() return the path
Besides being safe for arbitrary path lengths, after some follow-up patches all callers will want a freshly allocated buffer anyway. In the meantime, path_combine_deprecated() is added which has the same interface as path_combine() had before this patch. All callers to that function will be converted in follow-up patches. Signed-off-by: Max Reitz--- block.c | 81 +-- block/vmdk.c | 3 +- include/block/block.h | 4 +-- 3 files changed, 48 insertions(+), 40 deletions(-) diff --git a/block.c b/block.c index b3ff08f..41f3c92 100644 --- a/block.c +++ b/block.c @@ -146,48 +146,59 @@ int path_is_absolute(const char *path) #endif } -/* if filename is absolute, just copy it to dest. Otherwise, build a +/* if filename is absolute, just return its duplicate. Otherwise, build a path to it by considering it is relative to base_path. URL are supported. */ -void path_combine(char *dest, int dest_size, - const char *base_path, - const char *filename) +char *path_combine(const char *base_path, const char *filename) { const char *p, *p1; +char *result; int len; -if (dest_size <= 0) -return; if (path_is_absolute(filename)) { -pstrcpy(dest, dest_size, filename); +return g_strdup(filename); +} + +p = strchr(base_path, ':'); +if (p) { +p++; } else { -p = strchr(base_path, ':'); -if (p) -p++; -else -p = base_path; -p1 = strrchr(base_path, '/'); +p = base_path; +} +p1 = strrchr(base_path, '/'); #ifdef _WIN32 -{ -const char *p2; -p2 = strrchr(base_path, '\\'); -if (!p1 || p2 > p1) -p1 = p2; +{ +const char *p2; +p2 = strrchr(base_path, '\\'); +if (!p1 || p2 > p1) { +p1 = p2; } +} #endif -if (p1) -p1++; -else -p1 = base_path; -if (p1 > p) -p = p1; -len = p - base_path; -if (len > dest_size - 1) -len = dest_size - 1; -memcpy(dest, base_path, len); -dest[len] = '\0'; -pstrcat(dest, dest_size, filename); +if (p1) { +p1++; +} else { +p1 = base_path; +} +if (p1 > p) { +p = p1; } +len = p - base_path; + +result = g_malloc(len + strlen(filename) + 1); +memcpy(result, base_path, len); +strcpy(result + len, filename); + +return result; +} + +static void path_combine_deprecated(char *dest, int dest_size, +const char *base_path, +const char *filename) +{ +char *combined = path_combine(base_path, filename); +pstrcpy(dest, dest_size, combined); +g_free(combined); } void bdrv_get_full_backing_filename_from_filename(const char *backed, @@ -203,7 +214,7 @@ void bdrv_get_full_backing_filename_from_filename(const char *backed, error_setg(errp, "Cannot use relative backing file names for '%s'", backed); } else { -path_combine(dest, sz, backed, backing); +path_combine_deprecated(dest, sz, backed, backing); } } @@ -3147,8 +3158,8 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs, } else { /* If not an absolute filename path, make it relative to the current * image's filename path */ -path_combine(filename_tmp, PATH_MAX, curr_bs->filename, - backing_file); +path_combine_deprecated(filename_tmp, PATH_MAX, curr_bs->filename, +backing_file); /* We are going to compare absolute pathnames */ if (!realpath(filename_tmp, filename_full)) { @@ -3157,8 +3168,8 @@ BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs, /* We need to make sure the backing filename we are comparing against * is relative to the current image filename (or absolute) */ -path_combine(filename_tmp, PATH_MAX, curr_bs->filename, - curr_bs->backing_file); +path_combine_deprecated(filename_tmp, PATH_MAX, curr_bs->filename, +curr_bs->backing_file); if (!realpath(filename_tmp, backing_file_full)) { continue; diff --git a/block/vmdk.c b/block/vmdk.c index 45f9d3c..142de20 100644 --- a/block/vmdk.c +++ b/block/vmdk.c @@ -847,8 +847,7 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs, return -EINVAL; } -extent_path = g_malloc0(PATH_MAX); -path_combine(extent_path, PATH_MAX, desc_file_path, fname); +extent_path = path_combine(desc_file_path, fname); ret = snprintf(extent_opt_prefix, 32,
[Qemu-devel] [PATCH 15/19] quorum: Make bdrv_dirname() return NULL
While the common implementation for bdrv_dirname() should return NULL for quorum BDSs already (because they do not have a file node and their exact_filename field should be empty), there is no reason not to make that explicit. Signed-off-by: Max Reitz--- block/quorum.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index 98c6588..abe2148 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -1053,6 +1053,16 @@ static void quorum_refresh_filename(BlockDriverState *bs, QDict *options) bs->full_open_options = opts; } +static char *quorum_dirname(BlockDriverState *bs, Error **errp) +{ +/* In general, there are multiple BDSs with different dirnames below this + * one; so there is no unique dirname we could return (unless all are equal + * by chance, or there is only one). Therefore, to be consistent, just + * always return NULL. */ +error_setg(errp, "Cannot generate a base directory for quorum nodes"); +return NULL; +} + static BlockDriver bdrv_quorum = { .format_name= "quorum", .protocol_name = "quorum", @@ -1062,6 +1072,7 @@ static BlockDriver bdrv_quorum = { .bdrv_file_open = quorum_open, .bdrv_close = quorum_close, .bdrv_refresh_filename = quorum_refresh_filename, +.bdrv_dirname = quorum_dirname, .bdrv_co_flush_to_disk = quorum_co_flush, -- 2.8.0
[Qemu-devel] [PATCH 07/19] qcow2: Implement bdrv_refresh_filename()
Implement this function by invoking bdrv_default_refresh_format_filename(bs, false). None of the qcow2 runtime options change the guest-visible state of a BDS. Signed-off-by: Max Reitz--- block/qcow2.c | 8 1 file changed, 8 insertions(+) diff --git a/block/qcow2.c b/block/qcow2.c index 470734b..bcbf94e 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -3256,6 +3256,13 @@ void qcow2_signal_corruption(BlockDriverState *bs, bool fatal, int64_t offset, s->signaled_corruption = true; } +static void qcow2_refresh_filename(BlockDriverState *bs, QDict *opts) +{ +(void)opts; + +bdrv_default_refresh_format_filename(bs, false); +} + static QemuOptsList qcow2_create_opts = { .name = "qcow2-create-opts", .head = QTAILQ_HEAD_INITIALIZER(qcow2_create_opts.head), @@ -3346,6 +3353,7 @@ BlockDriver bdrv_qcow2 = { .bdrv_snapshot_load_tmp = qcow2_snapshot_load_tmp, .bdrv_get_info = qcow2_get_info, .bdrv_get_specific_info = qcow2_get_specific_info, +.bdrv_refresh_filename = qcow2_refresh_filename, .bdrv_save_vmstate= qcow2_save_vmstate, .bdrv_load_vmstate= qcow2_load_vmstate, -- 2.8.0
[Qemu-devel] [PATCH 05/19] block: Add bdrv_default_refresh_protocol_filename
Split off the default code for protocol BDS from bdrv_refresh_filename() into an own function, just as it has been done for format BDS. Signed-off-by: Max Reitz--- block.c | 58 +- 1 file changed, 33 insertions(+), 25 deletions(-) diff --git a/block.c b/block.c index 447468c..511a0b1 100644 --- a/block.c +++ b/block.c @@ -3922,6 +3922,38 @@ static void bdrv_default_refresh_format_filename(BlockDriverState *bs) } } +static void bdrv_default_refresh_protocol_filename(BlockDriverState *bs) +{ +BlockDriver *drv = bs->drv; +QDict *opts; + +/* There is no underlying file BDS (at least referenced by BDS.file), + * so the full options QDict should be equal to the options given + * specifically for this block device when it was opened (plus the + * driver specification). + * Because those options don't change, there is no need to update + * full_open_options when it's already set. */ + +opts = qdict_new(); +append_open_options(opts, bs); +qdict_put_obj(opts, "driver", + QOBJECT(qstring_from_str(drv->format_name))); + +if (bs->exact_filename[0]) { +/* This may not work for all block protocol drivers (some may + * require this filename to be parsed), but we have to find some + * default solution here, so just include it. If some block driver + * does not support pure options without any filename at all or + * needs some special format of the options QDict, it needs to + * implement the driver-specific bdrv_refresh_filename() function. + */ +qdict_put_obj(opts, "filename", + QOBJECT(qstring_from_str(bs->exact_filename))); +} + +bs->full_open_options = opts; +} + /* Updates the following BDS fields: * - exact_filename: A filename which may be used for opening a block device *which (mostly) equals the given BDS (even without any @@ -3967,31 +3999,7 @@ void bdrv_refresh_filename(BlockDriverState *bs) /* Try to reconstruct valid information from the underlying file */ bdrv_default_refresh_format_filename(bs); } else if (!bs->full_open_options && qdict_size(bs->options)) { -/* There is no underlying file BDS (at least referenced by BDS.file), - * so the full options QDict should be equal to the options given - * specifically for this block device when it was opened (plus the - * driver specification). - * Because those options don't change, there is no need to update - * full_open_options when it's already set. */ - -opts = qdict_new(); -append_open_options(opts, bs); -qdict_put_obj(opts, "driver", - QOBJECT(qstring_from_str(drv->format_name))); - -if (bs->exact_filename[0]) { -/* This may not work for all block protocol drivers (some may - * require this filename to be parsed), but we have to find some - * default solution here, so just include it. If some block driver - * does not support pure options without any filename at all or - * needs some special format of the options QDict, it needs to - * implement the driver-specific bdrv_refresh_filename() function. - */ -qdict_put_obj(opts, "filename", - QOBJECT(qstring_from_str(bs->exact_filename))); -} - -bs->full_open_options = opts; +bdrv_default_refresh_protocol_filename(bs); } if (bs->exact_filename[0]) { -- 2.8.0
[Qemu-devel] [PATCH 16/19] block/nbd: Implement bdrv_dirname()
The idea behind this implementation is that the export name might be interpreted as a path (which is the only apparent interpretation of relative filenames for NBD paths). The default implementation of bdrv_dirname() would handle that just fine for nbd+tcp, but not for nbd+unix, because in that case, the last element of the path is the Unix socket path and not the export name. Therefore, we need to implement an own bdrv_dirname() which uses the legacy NBD URL which has the export name at the end. Signed-off-by: Max Reitz--- block/nbd.c | 29 + 1 file changed, 29 insertions(+) diff --git a/block/nbd.c b/block/nbd.c index f7ea3b3..64534bb 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -469,6 +469,32 @@ static void nbd_refresh_filename(BlockDriverState *bs, QDict *options) bs->full_open_options = opts; } +static char *nbd_dirname(BlockDriverState *bs, Error **errp) +{ +const char *path = qdict_get_try_str(bs->options, "path"); +const char *host = qdict_get_try_str(bs->options, "host"); +const char *port = qdict_get_try_str(bs->options, "port"); +const char *export = qdict_get_try_str(bs->options, "export"); + +if (path && export) { +return g_strdup_printf("nbd:unix:%s:exportname=%s/", path, export); +} else if (path && !export) { +return g_strdup_printf("nbd:unix:%s:exportname=", path); +} else if (!path && export && port) { +return g_strdup_printf("nbd://%s:%s/%s/", host, port, export); +} else if (!path && export && !port) { +return g_strdup_printf("nbd://%s/%s/", host, export); +} else if (!path && !export && port) { +return g_strdup_printf("nbd://%s:%s/", host, port); +} else if (!path && !export && !port) { +return g_strdup_printf("nbd://%s/", host); +} + +error_setg(errp, "Cannot generate a base directory for NBD node '%s'", + bs->filename); +return NULL; +} + static BlockDriver bdrv_nbd = { .format_name= "nbd", .protocol_name = "nbd", @@ -487,6 +513,7 @@ static BlockDriver bdrv_nbd = { .bdrv_detach_aio_context= nbd_detach_aio_context, .bdrv_attach_aio_context= nbd_attach_aio_context, .bdrv_refresh_filename = nbd_refresh_filename, +.bdrv_dirname = nbd_dirname, }; static BlockDriver bdrv_nbd_tcp = { @@ -507,6 +534,7 @@ static BlockDriver bdrv_nbd_tcp = { .bdrv_detach_aio_context= nbd_detach_aio_context, .bdrv_attach_aio_context= nbd_attach_aio_context, .bdrv_refresh_filename = nbd_refresh_filename, +.bdrv_dirname = nbd_dirname, }; static BlockDriver bdrv_nbd_unix = { @@ -527,6 +555,7 @@ static BlockDriver bdrv_nbd_unix = { .bdrv_detach_aio_context= nbd_detach_aio_context, .bdrv_attach_aio_context= nbd_attach_aio_context, .bdrv_refresh_filename = nbd_refresh_filename, +.bdrv_dirname = nbd_dirname, }; static void bdrv_nbd_init(void) -- 2.8.0
[Qemu-devel] [PATCH 13/19] block: Add bdrv_dirname()
This function may be implemented by block drivers to derive a directory name from a BDS. Concatenating this g_free()-able string with a relative filename must result in a valid (not necessarily existing) filename, so this is a function that should generally be not implemented by format drivers, because this is protocol-specific. If a BDS's driver does not implement this function, bdrv_dirname() will fall through to the BDS's file if it exists. If it does not, the exact_filename field will be used to generate a directory name. Signed-off-by: Max Reitz--- block.c | 26 ++ include/block/block.h | 1 + include/block/block_int.h | 1 + 3 files changed, 28 insertions(+) diff --git a/block.c b/block.c index 6e89abd..23962ce 100644 --- a/block.c +++ b/block.c @@ -4028,3 +4028,29 @@ void bdrv_refresh_filename(BlockDriverState *bs) QDECREF(json); } } + +char *bdrv_dirname(BlockDriverState *bs, Error **errp) +{ +BlockDriver *drv = bs->drv; + +if (!drv) { +error_setg(errp, "Node '%s' is ejected", bs->node_name); +return NULL; +} + +if (drv->bdrv_dirname) { +return drv->bdrv_dirname(bs, errp); +} + +if (bs->file) { +return bdrv_dirname(bs->file->bs, errp); +} + +if (bs->exact_filename[0] != '\0') { +return path_combine(bs->exact_filename, ""); +} + +error_setg(errp, "Cannot generate a base directory for %s nodes", + drv->format_name); +return NULL; +} diff --git a/include/block/block.h b/include/block/block.h index 83f9f31..b0fe5dc 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -437,6 +437,7 @@ char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp); char *bdrv_get_full_backing_filename_from_filename(const char *backed, const char *backing, Error **errp); +char *bdrv_dirname(BlockDriverState *bs, Error **errp); int bdrv_is_snapshot(BlockDriverState *bs); int path_has_protocol(const char *path); diff --git a/include/block/block_int.h b/include/block/block_int.h index eb3665a..087d9ac 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -137,6 +137,7 @@ struct BlockDriver { int (*bdrv_make_empty)(BlockDriverState *bs); void (*bdrv_refresh_filename)(BlockDriverState *bs, QDict *options); +char *(*bdrv_dirname)(BlockDriverState *bs, Error **errp); /* aio */ BlockAIOCB *(*bdrv_aio_readv)(BlockDriverState *bs, -- 2.8.0
[Qemu-devel] [PATCH 06/19] block: Make bdrv_default_refresh_format_filename public
In order to allow block drivers to use that function, it needs to be public. In order to be useful, it needs to take a parameter which allows the caller to specify whether the runtime options allowed by the block driver are actually significant for the guest-visible BDS content. Signed-off-by: Max Reitz--- block.c | 9 ++--- include/block/block_int.h | 3 +++ 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/block.c b/block.c index 511a0b1..b3ff08f 100644 --- a/block.c +++ b/block.c @@ -3872,7 +3872,10 @@ static bool append_open_options(QDict *d, BlockDriverState *bs) return found_any; } -static void bdrv_default_refresh_format_filename(BlockDriverState *bs) +/* @options_significant must be true if any of the driver-specific runtime + * options may change the guest-visible content of the BDS */ +void bdrv_default_refresh_format_filename(BlockDriverState *bs, + bool options_significant) { BlockDriver *drv = bs->drv; QDict *opts; @@ -3885,7 +3888,7 @@ static void bdrv_default_refresh_format_filename(BlockDriverState *bs) } opts = qdict_new(); -has_open_options = append_open_options(opts, bs); +has_open_options = append_open_options(opts, bs) && options_significant; has_open_options |= bs->backing_overridden; /* If no specific options have been given for this BDS, the filename of @@ -3997,7 +4000,7 @@ void bdrv_refresh_filename(BlockDriverState *bs) QDECREF(opts); } else if (bs->file) { /* Try to reconstruct valid information from the underlying file */ -bdrv_default_refresh_format_filename(bs); +bdrv_default_refresh_format_filename(bs, true); } else if (!bs->full_open_options && qdict_size(bs->options)) { bdrv_default_refresh_protocol_filename(bs); } diff --git a/include/block/block_int.h b/include/block/block_int.h index d73e9ce..eb3665a 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -532,6 +532,9 @@ BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size, void bdrv_set_io_limits(BlockDriverState *bs, ThrottleConfig *cfg); +void bdrv_default_refresh_format_filename(BlockDriverState *bs, + bool options_significant); + /** * bdrv_add_before_write_notifier: -- 2.8.0
[Qemu-devel] [PATCH 09/19] block: bdrv_get_full_backing_filename_from_...'s ret. val.
Make bdrv_get_full_backing_filename_from_filename() return an allocated string instead of placing the result in a caller-provided buffer. Signed-off-by: Max Reitz--- block.c | 32 +++- block/vmdk.c | 8 +++- include/block/block.h | 7 +++ 3 files changed, 25 insertions(+), 22 deletions(-) diff --git a/block.c b/block.c index 41f3c92..a12b1d3 100644 --- a/block.c +++ b/block.c @@ -201,20 +201,20 @@ static void path_combine_deprecated(char *dest, int dest_size, g_free(combined); } -void bdrv_get_full_backing_filename_from_filename(const char *backed, - const char *backing, - char *dest, size_t sz, - Error **errp) +char *bdrv_get_full_backing_filename_from_filename(const char *backed, + const char *backing, + Error **errp) { if (backing[0] == '\0' || path_has_protocol(backing) || path_is_absolute(backing)) { -pstrcpy(dest, sz, backing); +return g_strdup(backing); } else if (backed[0] == '\0' || strstart(backed, "json:", NULL)) { error_setg(errp, "Cannot use relative backing file names for '%s'", backed); +return NULL; } else { -path_combine_deprecated(dest, sz, backed, backing); +return path_combine(backed, backing); } } @@ -222,9 +222,15 @@ void bdrv_get_full_backing_filename(BlockDriverState *bs, char *dest, size_t sz, Error **errp) { char *backed = bs->exact_filename[0] ? bs->exact_filename : bs->filename; +char *full_name; -bdrv_get_full_backing_filename_from_filename(backed, bs->backing_file, - dest, sz, errp); +full_name = bdrv_get_full_backing_filename_from_filename(backed, + bs->backing_file, + errp); +if (full_name) { +pstrcpy(dest, sz, full_name); +g_free(full_name); +} } void bdrv_register(BlockDriver *bdrv) @@ -3551,16 +3557,16 @@ void bdrv_img_create(const char *filename, const char *fmt, if (size == -1) { if (backing_file) { BlockDriverState *bs; -char *full_backing = g_new0(char, PATH_MAX); +char *full_backing; int64_t size; int back_flags; QDict *backing_options = NULL; -bdrv_get_full_backing_filename_from_filename(filename, backing_file, - full_backing, PATH_MAX, - _err); +full_backing = +bdrv_get_full_backing_filename_from_filename(filename, + backing_file, + _err); if (local_err) { -g_free(full_backing); goto out; } diff --git a/block/vmdk.c b/block/vmdk.c index 142de20..58fe380 100644 --- a/block/vmdk.c +++ b/block/vmdk.c @@ -1934,12 +1934,10 @@ static int vmdk_create(const char *filename, QemuOpts *opts, Error **errp) } if (backing_file) { BlockBackend *blk; -char *full_backing = g_new0(char, PATH_MAX); -bdrv_get_full_backing_filename_from_filename(filename, backing_file, - full_backing, PATH_MAX, - _err); +char *full_backing = +bdrv_get_full_backing_filename_from_filename(filename, backing_file, + _err); if (local_err) { -g_free(full_backing); error_propagate(errp, local_err); ret = -ENOENT; goto exit; diff --git a/include/block/block.h b/include/block/block.h index 4604465..c6c00dd 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -435,10 +435,9 @@ void bdrv_get_backing_filename(BlockDriverState *bs, char *filename, int filename_size); void bdrv_get_full_backing_filename(BlockDriverState *bs, char *dest, size_t sz, Error **errp); -void bdrv_get_full_backing_filename_from_filename(const char *backed, - const char *backing, - char *dest, size_t sz, - Error **errp); +char *bdrv_get_full_backing_filename_from_filename(const char *backed, +
[Qemu-devel] [PATCH 10/19] block: bdrv_get_full_backing_filename's ret. val.
Make bdrv_get_full_backing_filename() return an allocated string instead of placing the result in a caller-provided buffer. Signed-off-by: Max Reitz--- block.c | 24 block/qapi.c | 12 ++-- include/block/block.h | 3 +-- 3 files changed, 11 insertions(+), 28 deletions(-) diff --git a/block.c b/block.c index a12b1d3..083f4ae 100644 --- a/block.c +++ b/block.c @@ -218,19 +218,13 @@ char *bdrv_get_full_backing_filename_from_filename(const char *backed, } } -void bdrv_get_full_backing_filename(BlockDriverState *bs, char *dest, size_t sz, -Error **errp) +char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp) { char *backed = bs->exact_filename[0] ? bs->exact_filename : bs->filename; -char *full_name; -full_name = bdrv_get_full_backing_filename_from_filename(backed, - bs->backing_file, - errp); -if (full_name) { -pstrcpy(dest, sz, full_name); -g_free(full_name); -} +return bdrv_get_full_backing_filename_from_filename(backed, +bs->backing_file, +errp); } void bdrv_register(BlockDriver *bdrv) @@ -1289,7 +1283,7 @@ out: int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options, const char *bdref_key, Error **errp) { -char *backing_filename = g_malloc0(PATH_MAX); +char *backing_filename = NULL; char *bdref_key_dot; const char *reference = NULL; int ret = 0; @@ -1317,13 +1311,12 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options, reference = qdict_get_try_str(parent_options, bdref_key); if (reference || qdict_haskey(options, "file.filename")) { bs->backing_overridden = true; -backing_filename[0] = '\0'; +backing_filename = NULL; } else if (bs->backing_file[0] == '\0' && qdict_size(options) == 0) { QDECREF(options); goto free_exit; } else { -bdrv_get_full_backing_filename(bs, backing_filename, PATH_MAX, - _err); +backing_filename = bdrv_get_full_backing_filename(bs, _err); if (local_err) { ret = -EINVAL; error_propagate(errp, local_err); @@ -1344,8 +1337,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options, } backing_hd = NULL; -ret = bdrv_open_inherit(_hd, -*backing_filename ? backing_filename : NULL, +ret = bdrv_open_inherit(_hd, backing_filename, reference, options, 0, bs, _backing, errp); if (ret < 0) { diff --git a/block/qapi.c b/block/qapi.c index c5f6ba6..a0fb7fb 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -265,18 +265,10 @@ void bdrv_query_image_info(BlockDriverState *bs, backing_filename = bs->backing_file; if (backing_filename[0] != '\0') { -char *backing_filename2 = g_malloc0(PATH_MAX); +char *backing_filename2; info->backing_filename = g_strdup(backing_filename); info->has_backing_filename = true; -bdrv_get_full_backing_filename(bs, backing_filename2, PATH_MAX, ); -if (err) { -/* Can't reconstruct the full backing filename, so we must omit - * this field and apply a Best Effort to this query. */ -g_free(backing_filename2); -backing_filename2 = NULL; -error_free(err); -err = NULL; -} +backing_filename2 = bdrv_get_full_backing_filename(bs, NULL); /* Always report the full_backing_filename if present, even if it's the * same as backing_filename. That they are same is useful info. */ diff --git a/include/block/block.h b/include/block/block.h index c6c00dd..83f9f31 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -433,8 +433,7 @@ void bdrv_round_to_clusters(BlockDriverState *bs, const char *bdrv_get_encrypted_filename(BlockDriverState *bs); void bdrv_get_backing_filename(BlockDriverState *bs, char *filename, int filename_size); -void bdrv_get_full_backing_filename(BlockDriverState *bs, -char *dest, size_t sz, Error **errp); +char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp); char *bdrv_get_full_backing_filename_from_filename(const char *backed, const char *backing, Error **errp); -- 2.8.0
[Qemu-devel] [PATCH 01/19] block: Use children list in bdrv_refresh_filename
bdrv_refresh_filename() should invoke itself recursively on all children, not just on file. With that change, we can remove the manual invocations in blkverify and quorum. Signed-off-by: Max Reitz--- block.c | 9 + block/blkverify.c | 3 --- block/quorum.c| 1 - 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/block.c b/block.c index d4939b4..e349e83 100644 --- a/block.c +++ b/block.c @@ -3885,16 +3885,17 @@ static bool append_open_options(QDict *d, BlockDriverState *bs) void bdrv_refresh_filename(BlockDriverState *bs) { BlockDriver *drv = bs->drv; +BdrvChild *child; QDict *opts; if (!drv) { return; } -/* This BDS's file name will most probably depend on its file's name, so - * refresh that first */ -if (bs->file) { -bdrv_refresh_filename(bs->file->bs); +/* This BDS's file name may depend on any of its children's file names, so + * refresh those first */ +QLIST_FOREACH(child, >children, next) { +bdrv_refresh_filename(child->bs); } if (drv->bdrv_refresh_filename) { diff --git a/block/blkverify.c b/block/blkverify.c index 9414b7a..f445e3e 100644 --- a/block/blkverify.c +++ b/block/blkverify.c @@ -313,9 +313,6 @@ static void blkverify_refresh_filename(BlockDriverState *bs, QDict *options) { BDRVBlkverifyState *s = bs->opaque; -/* bs->file->bs has already been refreshed */ -bdrv_refresh_filename(s->test_file->bs); - if (bs->file->bs->full_open_options && s->test_file->bs->full_open_options) { diff --git a/block/quorum.c b/block/quorum.c index da15465..98c6588 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -1028,7 +1028,6 @@ static void quorum_refresh_filename(BlockDriverState *bs, QDict *options) int i; for (i = 0; i < s->num_children; i++) { -bdrv_refresh_filename(s->children[i]->bs); if (!s->children[i]->bs->full_open_options) { return; } -- 2.8.0
[Qemu-devel] [PATCH 00/19] block: Fix some filename generation issues
(Fun fact: This series has been lying around on my disk since last November. I guess I forgot to send it because I still wanted to review it before sending it out, and that it what I forgot to do. Well, and now Berto noticed that we ought to fix something.) There are some issues regarding filename generation right now: - You always get a JSON filename if you set even a single qcow2-specific runtime options (as long as it does not have a dot in it, which is a bug, too, but here it is working in our favor...). That is not nice and actually breaks the usage of backing files with relative filenames with such qcow2 BDS. - As hinted above, you cannot use relative backing filenames with BDS that have a JSON filename only, even though qemu might be able to obtain the directory name by walking through the BDS graph to the protocol level. - Overriding the backing file at runtime should invalidate the filename because it actually changes the BDS's data. Therefore, we need to force a JSON filename in that case, containing the backing file override. - Much of our code assumes paths never to exceed PATH_MAX in length. This is wrong, at least because of JSON filenames. This should be fixed wherever the opportunity arises. This series addresses the first issue by splitting off code from bdrv_refresh_filename() into bdrv_default_refresh_format_filename() and bdrv_default_refresh_protocol_filename(). The former will take a boolean signifying whether the driver-specific runtime options are actually significant for the BDS data (guest-visible content). For qcow2, they never are, therefore, the qcow2 driver has to implement bdrv_refresh_filename() by invoking bdrv_default_refresh_format_filename() with the boolean set appropriately (deviating from the actual default implementation which has to assume that all driver-specific runtime options are significant). [Patches 4 -- 7] The second issue is addressed by introducing bdrv_dirname() which returns the directory of a specific BDS. By default, this is obtained by walking through the BDS graph to the protocol level and processing that BDS's filename (exact_filename, to be exact). This behavior can be overridden by any block driver along the path implementing bdrv_dirname() itself, or by the user explicitly specifying the 'base-directory' option for any BDS node. [Patches 11, 13 -- 18] The third issue is addressed by paying respect to the backing file options, and noting whether any of the backing file's options have been overridden in a way that make opening the backing file implicitly (using the filename specified in the overlay file) impossible. [Patches 2 -- 3] The fourth issue has been addressed as far as it made sense to do it along fixing the second one. [Patch 8 -- 10] Furthermore, there are two fixes to code touched in this series. [Patches 1 and 12] Max Reitz (19): block: Use children list in bdrv_refresh_filename block: Add BDS.backing_overridden block: Respect backing bs in bdrv_refresh_filename block: Add bdrv_default_refresh_format_filename block: Add bdrv_default_refresh_protocol_filename block: Make bdrv_default_refresh_format_filename public qcow2: Implement bdrv_refresh_filename() block: Make path_combine() return the path block: bdrv_get_full_backing_filename_from_...'s ret. val. block: bdrv_get_full_backing_filename's ret. val. block: Add bdrv_make_absolute_filename() block: Fix bdrv_find_backing_image() block: Add bdrv_dirname() blkverify: Make bdrv_dirname() return NULL quorum: Make bdrv_dirname() return NULL block/nbd: Implement bdrv_dirname() block: Use bdrv_dirname() for relative filenames block: Add 'base-directory' BDS option iotests: Add quorum case to test 110 block.c | 340 +++--- block/blkverify.c | 13 +- block/nbd.c | 29 block/qapi.c | 12 +- block/qcow2.c | 8 + block/quorum.c| 12 +- block/vmdk.c | 11 +- include/block/block.h | 15 +- include/block/block_int.h | 7 + qapi/block-core.json | 9 ++ tests/qemu-iotests/051.out| 8 +- tests/qemu-iotests/051.pc.out | 8 +- tests/qemu-iotests/110| 51 ++- tests/qemu-iotests/110.out| 11 +- 14 files changed, 372 insertions(+), 162 deletions(-) -- 2.8.0
[Qemu-devel] [PATCH 04/19] block: Add bdrv_default_refresh_format_filename
Split off the default code for format BDS from bdrv_refresh_filename() into an own function, first because it is nicer this way, and second because this will allow block drivers to reuse this function in their own specific implementation of bdrv_refresh_filename(). Signed-off-by: Max Reitz--- block.c | 95 +++-- 1 file changed, 51 insertions(+), 44 deletions(-) diff --git a/block.c b/block.c index aff9ea3..447468c 100644 --- a/block.c +++ b/block.c @@ -3872,6 +3872,56 @@ static bool append_open_options(QDict *d, BlockDriverState *bs) return found_any; } +static void bdrv_default_refresh_format_filename(BlockDriverState *bs) +{ +BlockDriver *drv = bs->drv; +QDict *opts; +bool has_open_options; + +bs->exact_filename[0] = '\0'; +if (bs->full_open_options) { +QDECREF(bs->full_open_options); +bs->full_open_options = NULL; +} + +opts = qdict_new(); +has_open_options = append_open_options(opts, bs); +has_open_options |= bs->backing_overridden; + +/* If no specific options have been given for this BDS, the filename of + * the underlying file should suffice for this one as well */ +if (bs->file->bs->exact_filename[0] && !has_open_options) { +strcpy(bs->exact_filename, bs->file->bs->exact_filename); +} +/* Reconstructing the full options QDict is simple for most format block + * drivers, as long as the full options are known for the underlying + * file BDS. The full options QDict of that file BDS should somehow + * contain a representation of the filename, therefore the following + * suffices without querying the (exact_)filename of this BDS. */ +if (bs->file->bs->full_open_options && +(!bs->backing || bs->backing->bs->full_open_options)) +{ +qdict_put_obj(opts, "driver", + QOBJECT(qstring_from_str(drv->format_name))); + +QINCREF(bs->file->bs->full_open_options); +qdict_put_obj(opts, "file", + QOBJECT(bs->file->bs->full_open_options)); + +if (bs->backing) { +QINCREF(bs->backing->bs->full_open_options); +qdict_put_obj(opts, "backing", + QOBJECT(bs->backing->bs->full_open_options)); +} else if (bs->backing_overridden && !bs->backing) { +qdict_put_obj(opts, "backing", QOBJECT(qstring_new())); +} + +bs->full_open_options = opts; +} else { +QDECREF(opts); +} +} + /* Updates the following BDS fields: * - exact_filename: A filename which may be used for opening a block device *which (mostly) equals the given BDS (even without any @@ -3915,50 +3965,7 @@ void bdrv_refresh_filename(BlockDriverState *bs) QDECREF(opts); } else if (bs->file) { /* Try to reconstruct valid information from the underlying file */ -bool has_open_options; - -bs->exact_filename[0] = '\0'; -if (bs->full_open_options) { -QDECREF(bs->full_open_options); -bs->full_open_options = NULL; -} - -opts = qdict_new(); -has_open_options = append_open_options(opts, bs); -has_open_options |= bs->backing_overridden; - -/* If no specific options have been given for this BDS, the filename of - * the underlying file should suffice for this one as well */ -if (bs->file->bs->exact_filename[0] && !has_open_options) { -strcpy(bs->exact_filename, bs->file->bs->exact_filename); -} -/* Reconstructing the full options QDict is simple for most format block - * drivers, as long as the full options are known for the underlying - * file BDS. The full options QDict of that file BDS should somehow - * contain a representation of the filename, therefore the following - * suffices without querying the (exact_)filename of this BDS. */ -if (bs->file->bs->full_open_options && -(!bs->backing || bs->backing->bs->full_open_options)) -{ -qdict_put_obj(opts, "driver", - QOBJECT(qstring_from_str(drv->format_name))); - -QINCREF(bs->file->bs->full_open_options); -qdict_put_obj(opts, "file", - QOBJECT(bs->file->bs->full_open_options)); - -if (bs->backing) { -QINCREF(bs->backing->bs->full_open_options); -qdict_put_obj(opts, "backing", - QOBJECT(bs->backing->bs->full_open_options)); -} else if (bs->backing_overridden && !bs->backing) { -qdict_put_obj(opts, "backing", QOBJECT(qstring_new())); -} - -bs->full_open_options = opts; -} else { -QDECREF(opts); -} +bdrv_default_refresh_format_filename(bs); } else if (!bs->full_open_options &&
[Qemu-devel] [PATCH 02/19] block: Add BDS.backing_overridden
If the backing file is overridden, this most probably does change the guest-visible data of a BDS. Therefore, we will need to consider this in bdrv_refresh_filename(). Adding a new field to the BDS is not nice, but it is very simple and exactly keeps track of whether the backing file has been overridden. Signed-off-by: Max Reitz--- block.c | 2 ++ include/block/block_int.h | 1 + 2 files changed, 3 insertions(+) diff --git a/block.c b/block.c index e349e83..e178488 100644 --- a/block.c +++ b/block.c @@ -1299,6 +1299,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options, reference = qdict_get_try_str(parent_options, bdref_key); if (reference || qdict_haskey(options, "file.filename")) { +bs->backing_overridden = true; backing_filename[0] = '\0'; } else if (bs->backing_file[0] == '\0' && qdict_size(options) == 0) { QDECREF(options); @@ -1589,6 +1590,7 @@ static int bdrv_open_inherit(BlockDriverState **pbs, const char *filename, backing = qdict_get_try_str(options, "backing"); if (backing && *backing == '\0') { flags |= BDRV_O_NO_BACKING; +bs->backing_overridden = true; qdict_del(options, "backing"); } diff --git a/include/block/block_int.h b/include/block/block_int.h index 10d8759..d73e9ce 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -409,6 +409,7 @@ struct BlockDriverState { char backing_file[PATH_MAX]; /* if non zero, the image is a diff of this file image */ char backing_format[16]; /* if non-zero and backing_file exists */ +bool backing_overridden; /* backing file has been specified by the user */ QDict *full_open_options; char exact_filename[PATH_MAX]; -- 2.8.0
[Qemu-devel] [PATCH 03/19] block: Respect backing bs in bdrv_refresh_filename
Basically, bdrv_refresh_filename() should respect all children of a BlockDriverState. However, generally those children are driver-specific, so this function cannot handle the general case. On the other hand, there are only few drivers which use other children than @file and @backing (that being vmdk, quorum, and blkverify). Most block drivers only use @file and/or @backing (if they use any children at all). Both can be implemented directly in bdrv_refresh_filename. The user overriding the file's filename is already handled, however, the user overriding the backing file is not. If this is done, opening the BDS with the plain filename of its file will not be correct, so we may not set bs->exact_filename in that case. iotest 051 contains test cases for overwriting the backing file, and so its output changes with this patch applied (which I consider a good thing). Signed-off-by: Max Reitz--- block.c | 14 +- tests/qemu-iotests/051.out| 8 tests/qemu-iotests/051.pc.out | 8 3 files changed, 21 insertions(+), 9 deletions(-) diff --git a/block.c b/block.c index e178488..aff9ea3 100644 --- a/block.c +++ b/block.c @@ -3925,6 +3925,7 @@ void bdrv_refresh_filename(BlockDriverState *bs) opts = qdict_new(); has_open_options = append_open_options(opts, bs); +has_open_options |= bs->backing_overridden; /* If no specific options have been given for this BDS, the filename of * the underlying file should suffice for this one as well */ @@ -3936,13 +3937,24 @@ void bdrv_refresh_filename(BlockDriverState *bs) * file BDS. The full options QDict of that file BDS should somehow * contain a representation of the filename, therefore the following * suffices without querying the (exact_)filename of this BDS. */ -if (bs->file->bs->full_open_options) { +if (bs->file->bs->full_open_options && +(!bs->backing || bs->backing->bs->full_open_options)) +{ qdict_put_obj(opts, "driver", QOBJECT(qstring_from_str(drv->format_name))); + QINCREF(bs->file->bs->full_open_options); qdict_put_obj(opts, "file", QOBJECT(bs->file->bs->full_open_options)); +if (bs->backing) { +QINCREF(bs->backing->bs->full_open_options); +qdict_put_obj(opts, "backing", + QOBJECT(bs->backing->bs->full_open_options)); +} else if (bs->backing_overridden && !bs->backing) { +qdict_put_obj(opts, "backing", QOBJECT(qstring_new())); +} + bs->full_open_options = opts; } else { QDECREF(opts); diff --git a/tests/qemu-iotests/051.out b/tests/qemu-iotests/051.out index 408d613..d936798 100644 --- a/tests/qemu-iotests/051.out +++ b/tests/qemu-iotests/051.out @@ -59,7 +59,7 @@ QEMU X.Y.Z monitor - type 'help' for more information Testing: -drive file=TEST_DIR/t.qcow2,driver=qcow2,backing.file.filename=TEST_DIR/t.qcow2.orig,if=none,id=drive0 -nodefaults QEMU X.Y.Z monitor - type 'help' for more information (qemu) i[K[Din[K[D[Dinf[K[D[D[Dinfo[K[D[D[D[Dinfo [K[D[D[D[D[Dinfo b[K[D[D[D[D[D[Dinfo bl[K[D[D[D[D[D[D[Dinfo blo[K[D[D[D[D[D[D[D[Dinfo bloc[K[D[D[D[D[D[D[D[D[Dinfo block[K -drive0 (NODE_NAME): TEST_DIR/t.qcow2 (qcow2) +drive0 (NODE_NAME): json:{"backing": {"driver": "qcow2", "file": {"driver": "file", "filename": "TEST_DIR/t.qcow2.orig"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "TEST_DIR/t.qcow2"}} (qcow2) Removable device: not locked, tray closed Cache mode: writeback Backing file: TEST_DIR/t.qcow2.orig (chain depth: 1) @@ -148,7 +148,7 @@ QEMU_PROG: -drive driver=null-co,cache=invalid_value: invalid cache option Testing: -drive file=TEST_DIR/t.qcow2,cache=writeback,backing.file.filename=TEST_DIR/t.qcow2.base,backing.cache.no-flush=on,backing.node-name=backing,backing.file.node-name=backing-file,file.node-name=file,if=none,id=drive0 -nodefaults QEMU X.Y.Z monitor - type 'help' for more information (qemu) i[K[Din[K[D[Dinf[K[D[D[Dinfo[K[D[D[D[Dinfo [K[D[D[D[D[Dinfo b[K[D[D[D[D[D[Dinfo bl[K[D[D[D[D[D[D[Dinfo blo[K[D[D[D[D[D[D[D[Dinfo bloc[K[D[D[D[D[D[D[D[D[Dinfo block[K -drive0 (NODE_NAME): TEST_DIR/t.qcow2 (qcow2) +drive0 (NODE_NAME): json:{"backing": {"driver": "qcow2", "file": {"driver": "file", "filename": "TEST_DIR/t.qcow2.base"}}, "driver": "qcow2", "file": {"driver": "file", "filename": "TEST_DIR/t.qcow2"}} (qcow2) Removable device: not locked, tray closed Cache mode: writeback Backing file: TEST_DIR/t.qcow2.base (chain depth: 1) @@ -168,7 +168,7 @@ backing-file: TEST_DIR/t.qcow2.base (file, read-only) Testing: -drive
Re: [Qemu-devel] [PATCH] usb-mtp: fix usb_mtp_get_device_info so that libmtp on the guest doesn't complain
Stefan Hajnocziwrites: > On Sun, Apr 17, 2016 at 04:29:53AM -0700, Isaac Lozano wrote: >> If an application uses libmtp on the guest system, >> it will complain with the warning message: >> LIBMTP WARNING: VendorExtensionID: >> LIBMTP WARNING: VendorExtensionDesc: (null) >> LIBMTP WARNING: this typically means the device is PTP (i.e. a camera) but >> not a MTP device at all. Trying to continue anyway. >> >> This is because libmtp expects a MTP Vendor Extension ID of 0x0006 and a >> MTP Version of 0x0064. These numbers are taken from Microsoft's MTP Vendor >> Extension Identification Message page and are what most physical devices >> show. >> >> Signed-off-by: Isaac Lozano <109loza...@gmail.com> >> --- >> hw/usb/dev-mtp.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) > > Welcome to QEMU! > > Link to MTP Vendor Extension Identification Message page for other > reviewers: > https://msdn.microsoft.com/en-us/library/ff632482.aspx I am confused why the MTP spec on usb.org specifies in 5.1.1.2 that this value should be 0x for MTP devices. Is it just not updated or there are similar MS specific differences to watch out for ? Either way, it seems libmtp will always be a good reference when we are stuck with such discrepencies. > Reviewed-by: Stefan Hajnoczi
Re: [Qemu-devel] [RFC PATCH v2 2/2] spapr: Memory hot-unplug support
Quoting Igor Mammedov (2016-04-26 02:52:36) > On Tue, 26 Apr 2016 10:39:23 +0530 > Bharata B Raowrote: > > > On Mon, Apr 25, 2016 at 11:20:50AM +0200, Igor Mammedov wrote: > > > On Wed, 16 Mar 2016 10:11:54 +0530 > > > Bharata B Rao wrote: > > > > > > > On Wed, Mar 16, 2016 at 12:36:05PM +1100, David Gibson wrote: > > > > > On Tue, Mar 15, 2016 at 10:08:56AM +0530, Bharata B Rao wrote: > > > > > > Add support to hot remove pc-dimm memory devices. > > > > > > > > > > > > Signed-off-by: Bharata B Rao > > > > > > > > > > Reviewed-by: David Gibson > > > > > > > > > > Looks correct, but again, needs to wait on the PAPR change. > > > [...] > > > > > > > > While we are here, I would also like to get some opinion on the real > > > > need for memory unplug. Is there anything that memory unplug gives us > > > > which memory ballooning (shrinking mem via ballooning) can't give ? > > > Sure ballooning can complement memory hotplug but turning it on would > > > effectively reduce hotplug to balloning as it would enable overcommit > > > capability instead of hard partitioning pc-dimms provides. So one > > > could just use ballooning only and not bother with hotplug at all. > > > > > > On the other hand memory hotplug/unplug (at least on x86) tries > > > to model real hardware, thus removing need in paravirt ballooning > > > solution in favor of native guest support. > > > > Thanks for your views. > > > > > > > > PS: > > > Guest wise, currently hot-unplug is not well supported in linux, > > > i.e. it's not guarantied that guest will honor unplug request > > > as it may pin dimm by using it as a non migratable memory. So > > > there is something to work on guest side to make unplug more > > > reliable/guarantied. > > > > In the above scenario where the guest doesn't allow removal of certain > > parts of DIMM memory, what is the expected behaviour as far as QEMU > > DIMM device is concerned ? I seem to be running into this situation > > very often with PowerPC mem unplug where I am left with a DIMM device > > that has only some memory blocks released. In this situation, I would like > > to block further unplug requests on the same device, but QEMU seems > > to allow more such unplug requests to come in via the monitor. So > > qdev won't help me here ? Should I detect such condition from the > > machine unplug() handler and take required action ? > I think offlining is a guests task along with recovering from > inability to offline (i.e. offline all + eject or restore original state). > QUEM does it's job by notifying guest what dimm it wants to remove > and removes it when guest asks it (at least in x86 world). In the case of pseries, the DIMM abstraction isn't really exposed to the guest, but rather the memory blocks we use to make the backing memdev memory available to the guest. During unplug, the guest completely releases these blocks back to QEMU, and if it can only release a subset of what's requested it does not attempt to recover. We can potentially change that behavior on the guest side, since partially-freed DIMMs aren't currently useful on the host-side... But, in the case of pseries, I wonder if it makes sense to maybe go ahead and MADV_DONTNEED the ranges backing these released blocks so the host can at least partially reclaim the memory from a partially unplugged DIMM? > > > > > On x86, if some pages are offlined and subsequently other pages couldn't > > be offlined, then I see the full DIMM memory size remaining > > with the guest. So I infer that on x86, QEMU memory unplug either > > removes full DIMM or nothing. Is that understanding correct ? > I wouldn't bet that it's guarantied behavior but it should be this way. > > > > > Regards, > > Bharata. > > >
Re: [Qemu-devel] Is anyone able to load a web page from a guest operating system?
On Apr 26, 2016, at 4:12 PM, Thomas Huth wrote: > On 26.04.2016 21:25, Programmingkid wrote: >> >> On Apr 26, 2016, at 3:00 PM, Dr. David Alan Gilbert wrote: >> >>> * Programmingkid (programmingk...@gmail.com) wrote: My three guest operating systems can't load a web page. I think this is a bug with QEMU. Is there anyone who has the latest revision of QEMU that can access the web from a guest? Or are you experiencing the same problem? >>> >>> Works here. >> So you are using a very recent version of QEMU? Maybe something that was >> pulled in the last day or so? >> >>> Now - how about some basic debug! >>> Does your guest see a network card? >> yes >> >>> Does DNS work? >> Doesn't look like it. If I use just an ip address it still doesn't work. >> >>> Does ping work? >> I can ping the virtual router at 10.0.2.2. Any other ip address fails. > > That's normal for user-mode / slirp networking. You can't ping external > hosts with this mode. > >>> Does a telnet/ssh from the guest work? >> telnet www.google.com 80 failed >> Didn't have an address to use ssh on. >> >>> What's the qemu command line you use? >> qemu-system-ppc -hda -hdb -m 512 -boot c -M mac99 -netdev >> user,id=mynet0 -device usb-net,netdev=mynet0 -cpu 750 -prom-env boot-args=-v >> -device ich9-usb-uhci1,id=newusb -device usb-audio,bus=newusb.0 >> >> and >> >> qemu-system-ppc -hda -hdb -m 512 -boot c -M mac99 -netdev >> user,id=mynet0 -device rtl8139,netdev=mynet0 -cpu 750 -prom-env boot-args=-v >> -device ich9-usb-uhci1,id=newusb -device usb-audio,bus=newusb.0 > > Ok, that means you're using user-mode / slirp networking. > I just tried it with a pseries guest, and it seems to be working fine > for me with the current git version of QEMU (f419a626c76bcb266). So you are saying you can view web pages on your guest? > > Now, what kind of host do you use? Mac OS X? Yes. Mac OS 10.6. > Also can you determine a revision when it was still working fine for > you? (and then maybe even bisect the problem?) I will see what I can find out.
Re: [Qemu-devel] Is anyone able to load a web page from a guest operating system?
On 26.04.2016 21:25, Programmingkid wrote: > > On Apr 26, 2016, at 3:00 PM, Dr. David Alan Gilbert wrote: > >> * Programmingkid (programmingk...@gmail.com) wrote: >>> My three guest operating systems can't load a web page. I think this is a >>> bug with QEMU. Is there anyone who has the latest revision of QEMU that can >>> access the web from a guest? Or are you experiencing the same problem? >> >> Works here. > So you are using a very recent version of QEMU? Maybe something that was > pulled in the last day or so? > >> Now - how about some basic debug! >> Does your guest see a network card? > yes > >> Does DNS work? > Doesn't look like it. If I use just an ip address it still doesn't work. > >> Does ping work? > I can ping the virtual router at 10.0.2.2. Any other ip address fails. That's normal for user-mode / slirp networking. You can't ping external hosts with this mode. >> Does a telnet/ssh from the guest work? > telnet www.google.com 80 failed > Didn't have an address to use ssh on. > >> What's the qemu command line you use? > qemu-system-ppc -hda -hdb -m 512 -boot c -M mac99 -netdev > user,id=mynet0 -device usb-net,netdev=mynet0 -cpu 750 -prom-env boot-args=-v > -device ich9-usb-uhci1,id=newusb -device usb-audio,bus=newusb.0 > > and > > qemu-system-ppc -hda -hdb -m 512 -boot c -M mac99 -netdev > user,id=mynet0 -device rtl8139,netdev=mynet0 -cpu 750 -prom-env boot-args=-v > -device ich9-usb-uhci1,id=newusb -device usb-audio,bus=newusb.0 Ok, that means you're using user-mode / slirp networking. I just tried it with a pseries guest, and it seems to be working fine for me with the current git version of QEMU (f419a626c76bcb266). Now, what kind of host do you use? Mac OS X? Also can you determine a revision when it was still working fine for you? (and then maybe even bisect the problem?) Thomas
Re: [Qemu-devel] Is anyone able to load a web page from a guest operating system?
On Apr 26, 2016, at 3:00 PM, Dr. David Alan Gilbert wrote: > * Programmingkid (programmingk...@gmail.com) wrote: >> My three guest operating systems can't load a web page. I think this is a >> bug with QEMU. Is there anyone who has the latest revision of QEMU that can >> access the web from a guest? Or are you experiencing the same problem? > > Works here. So you are using a very recent version of QEMU? Maybe something that was pulled in the last day or so? > Now - how about some basic debug! > Does your guest see a network card? yes > Does DNS work? Doesn't look like it. If I use just an ip address it still doesn't work. > Does ping work? I can ping the virtual router at 10.0.2.2. Any other ip address fails. > Does a telnet/ssh from the guest work? telnet www.google.com 80 failed Didn't have an address to use ssh on. > What's the qemu command line you use? qemu-system-ppc -hda -hdb -m 512 -boot c -M mac99 -netdev user,id=mynet0 -device usb-net,netdev=mynet0 -cpu 750 -prom-env boot-args=-v -device ich9-usb-uhci1,id=newusb -device usb-audio,bus=newusb.0 and qemu-system-ppc -hda -hdb -m 512 -boot c -M mac99 -netdev user,id=mynet0 -device rtl8139,netdev=mynet0 -cpu 750 -prom-env boot-args=-v -device ich9-usb-uhci1,id=newusb -device usb-audio,bus=newusb.0 This and my other guest did access the internet before. Thanks for helping.
Re: [Qemu-devel] Is anyone able to load a web page from a guest operating system?
* Programmingkid (programmingk...@gmail.com) wrote: > My three guest operating systems can't load a web page. I think this is a bug > with QEMU. Is there anyone who has the latest revision of QEMU that can > access the web from a guest? Or are you experiencing the same problem? Works here. Now - how about some basic debug! Does your guest see a network card? Does DNS work? Does ping work? Does a telnet/ssh from the guest work? What's the qemu command line you use? Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] [RFC for-2.7 05/11] pseries: Build device tree only at reset time
On 20.04.2016 04:33, David Gibson wrote: > Currently the pseries code builds a "skeleton" device tree at machine init > time, then adds a bunch of stuff to it at reset. Over time, more and more > logic has had to be moved from init to reset time, and there's really no > advantage to doing any of it at init time. > > This patch removes the init time spapr_create_fdt_skel() and moves its > logic into the reset time spapr_build_fdt(). There's still a fairly > pointless divide between the "skeleton" logic (using libfdt serial-write > functions) and the "tweak" logic (using libfdt random access functions) > but at least it all happens at the same time, making further consolidation > easier. > > Signed-off-by: David Gibson> --- > hw/ppc/spapr.c | 398 > - > include/hw/ppc/spapr.h | 1 - > 2 files changed, 192 insertions(+), 207 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index da10136..6e1192f 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c ... > @@ -901,7 +702,7 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr, > hwaddr rtas_addr, > hwaddr rtas_size) > { > -MachineState *machine = MACHINE(qdev_get_machine()); > +MachineState *machine = MACHINE(spapr); > sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine); > const char *boot_device = machine->boot_order; > int ret, i; > @@ -909,11 +710,200 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr, > char *bootlist; > void *fdt; > sPAPRPHBState *phb; > +uint32_t start_prop = cpu_to_be32(spapr->initrd_base); > +uint32_t end_prop = cpu_to_be32(spapr->initrd_base + spapr->initrd_size); > +GString *hypertas = g_string_sized_new(256); > +GString *qemu_hypertas = g_string_sized_new(256); > +uint32_t refpoints[] = {cpu_to_be32(0x4), cpu_to_be32(0x4)}; > +uint32_t interrupt_server_ranges_prop[] = {0, cpu_to_be32(max_cpus)}; > +unsigned char vec5[] = {0x0, 0x0, 0x0, 0x0, 0x0, 0x80}; > +char *buf; > + > +add_str(hypertas, "hcall-pft"); > +add_str(hypertas, "hcall-term"); > +add_str(hypertas, "hcall-dabr"); > +add_str(hypertas, "hcall-interrupt"); > +add_str(hypertas, "hcall-tce"); > +add_str(hypertas, "hcall-vio"); > +add_str(hypertas, "hcall-splpar"); > +add_str(hypertas, "hcall-bulk"); > +add_str(hypertas, "hcall-set-mode"); > +add_str(qemu_hypertas, "hcall-memop1"); > + > +fdt = g_malloc0(FDT_MAX_SIZE); > +_FDT((fdt_create(fdt, FDT_MAX_SIZE))); > + > +if (spapr->kernel_size) { > +_FDT((fdt_add_reservemap_entry(fdt, KERNEL_LOAD_ADDR, > + spapr->kernel_size))); > +} > +if (spapr->initrd_size) { > +_FDT((fdt_add_reservemap_entry(fdt, spapr->initrd_base, > + spapr->initrd_size))); > +} > +_FDT((fdt_finish_reservemap(fdt))); > + > +/* Root node */ > +_FDT((fdt_begin_node(fdt, ""))); > +_FDT((fdt_property_string(fdt, "device_type", "chrp"))); > +_FDT((fdt_property_string(fdt, "model", "IBM pSeries (emulated by > qemu)"))); > +_FDT((fdt_property_string(fdt, "compatible", "qemu,pseries"))); > + > +/* > + * Add info to guest to indentify which host is it being run on > + * and what is the uuid of the guest > + */ > +if (kvmppc_get_host_model()) { > +_FDT((fdt_property_string(fdt, "host-model", buf))); > +g_free(buf); > +} > +if (kvmppc_get_host_serial()) { > +_FDT((fdt_property_string(fdt, "host-serial", buf))); > +g_free(buf); > +} > > -fdt = g_malloc(FDT_MAX_SIZE); > +buf = g_strdup_printf(UUID_FMT, qemu_uuid[0], qemu_uuid[1], > + qemu_uuid[2], qemu_uuid[3], qemu_uuid[4], > + qemu_uuid[5], qemu_uuid[6], qemu_uuid[7], > + qemu_uuid[8], qemu_uuid[9], qemu_uuid[10], > + qemu_uuid[11], qemu_uuid[12], qemu_uuid[13], > + qemu_uuid[14], qemu_uuid[15]); > + > +_FDT((fdt_property_string(fdt, "vm,uuid", buf))); > +if (qemu_uuid_set) { > +_FDT((fdt_property_string(fdt, "system-id", buf))); > +} > +g_free(buf); > + > +if (qemu_get_vm_name()) { > +_FDT((fdt_property_string(fdt, "ibm,partition-name", > + qemu_get_vm_name(; > +} > + > +_FDT((fdt_property_cell(fdt, "#address-cells", 0x2))); > +_FDT((fdt_property_cell(fdt, "#size-cells", 0x2))); > + > +/* /chosen */ > +_FDT((fdt_begin_node(fdt, "chosen"))); > + > +/* Set Form1_affinity */ > +_FDT((fdt_property(fdt, "ibm,architecture-vec-5", vec5, sizeof(vec5; > + > +_FDT((fdt_property_string(fdt, "bootargs", machine->kernel_cmdline))); > +_FDT((fdt_property(fdt, "linux,initrd-start", > +
Re: [Qemu-devel] KVM Forum 2016: Call For Participation
On 10 March 2016 at 18:09, Paolo Bonziniwrote: > = > KVM Forum 2016: Call For Participation > August 24-26, 2016 - Westin Harbor Castle - Toronto, Canada > > (All submissions must be received before midnight May 1, 2016) > = > > KVM Forum is an annual event that presents a rare opportunity > for developers and users to meet, discuss the state of Linux > virtualization technology, and plan for the challenges ahead. > We invite you to lead part of the discussion by submitting a speaking > proposal for KVM Forum 2016. Hi; this is just a followup to remind people that the KVM Forum CFP deadline of May 1st is rapidly approaching, so if you were thinking of submitting a proposal now is the time. All the CFP information is here: http://events.linuxfoundation.org/events/kvm-forum/program/cfp thanks -- PMM (on behalf of the KVM Forum 2016 Program Committee)
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
On 2016-04-26 18:07, Radim Krčmář wrote: > 2016-04-26 17:28+0200, Jan Kiszka: >> On 2016-04-26 16:59, Radim Krčmář wrote: >>> 2016-04-26 16:24+0200, Jan Kiszka: On 2016-04-26 13:40, Peter Xu wrote: > Currently, all the interrupts will be translated into one MSI in > vtd_generate_msi_message(), in which only 8 bits of dest_id is used > (msg.dest = irq->dest). We may possibly need to use the high 32 bits > of MSI address to store the rest dest[31:8]? Don't know whether this > would be enough though. Yes, I ran into this topic as well as I hacked up those line. Currently, KVM does not support more than 254 vCPUs, so 8 bits of those 32 are actually fine, and piggy-backing them in an MSI message works. Once KVM supports more CPUs, it has to come up with a new userspace interface to inject APIC events for more than 255 CPUs. Maybe the existing direct MSI inject with its unused flags could be "bended", maybe there are already better ideas (Radim?). >>> >>> Adding a flag to msi_msg and taking 3-4 bytes from padding to express >>> x2APIC addresses is reasonable. (It is what my prototype did. :]) >>> >>> The conceptually different idea is forcing all userspace interrupts >>> through irqfd routes, which would obsolete the ad-host inject. >> >> irqfd for userspace sources is a bit clumsy from the API POV. On the >> other hand, we need to tweak the routing API anyway to achieve the same >> address extension there, too. > > I agree, both of them should change. KVM_SIGNAL_MSI was added just > because the route-based injection in kvm_irqchip_send_msi() sucked; > maybe we could find a good solution now, but the direct interface is > already here ... We won't get away without irq routes because we need them for sources that only issue binary signals for interrupt events (in-kernel, eventfd from userspace). In fact, those sources should not know more than that about their irqs. But we also do not want them to route everything through userspace for on-the-fly translation and reinjection to kvm. Jan
Re: [Qemu-devel] [RFC for-2.7 04/11] pseries: Make spapr_create_fdt_skel() get information from machine state
On 20.04.2016 04:33, David Gibson wrote: > Currently spapr_create_fdt_skel() takes a bunch of individual parameters > for various things it will put in the device tree. Some of these can > already be taken directly from sPAPRMachineState. This patch alters it so > that all of them can be taken from there, which will allow this code to > be moved away from its current caller in future. > > Signed-off-by: David Gibson> --- > hw/ppc/spapr.c | 81 > ++ > include/hw/ppc/spapr.h | 4 +++ > 2 files changed, 40 insertions(+), 45 deletions(-) Reviewed-by: Thomas Huth
Re: [Qemu-devel] how to enable nvme device in Qemu?
On 04/26/16 11:18, Stefan Hajnoczi wrote: > On Tue, Apr 19, 2016 at 08:37:42AM +, Qingtao Sun wrote: >> I want to boot OS from NVMe device in qemu-2.5.1. But I didn't find >> the bootable NVMe device in guest BIOS. How to enable this feature? > > I don't see an NVMe driver so I guess SeaBIOS does not support boot from > NVMe: > https://code.coreboot.org/p/seabios/source/tree/master/src/hw > > I have CCed the SeaBIOS mailing list in case they have any advice. Removing the SeaBIOS list before I achieve "generally hated" status there, for peddling OVMF :) Qingtao Sun: you can boot off of NVMe devices with recent OVMF. https://github.com/tianocore/edk2/issues/48 https://www.kraxel.org/repos/ As far as QEMU is concerned, you will need a binary that has commit a907ec52cc1a (v2.6.0-rc3, for example -- v2.5.* don't have it, directly or via cherry-pick). Thanks Laszlo
[Qemu-devel] Working on AF_VSOCK packet capture
Hi all, My name is Gerard Garcia and I have been funded by GSOC16 to work on a device driver to allow capturing host/guest traffic through AF_VSOCK sockets: http://qemu-project.org/Features/VirtioVsock I'll mostly work on the Linux kernel codebase but the device driver is closely related to QEMU so I felt like introducing myself here! My IRC nick is gerardgarcia. I'll try to be around! Gerard
Re: [Qemu-devel] [Qemu-arm] [PATCH] target-arm: Fix descriptor address masking in ARM address translation
On 03/21/2016 09:56 AM, Sergey Sorokin wrote: 17.03.2016, 18:24, "Peter Maydell": On 17 March 2016 at 15:21, Sergey Sorokin wrote: 17.03.2016, 14:40, "Peter Maydell" : On 13 March 2016 at 18:28, Sergey Sorokin wrote: If you want to implement the AddressSize checks that's fine, but otherwise please leave this bit of the code alone. You said me that my code is not correct, I have proved that it conforms to the documentation. It's a bit obfuscating when the doc explicitly says to take bits up to 39 from the descriptor, but in QEMU we take bits up to 47 relying on the check in another part of the code, even if both ways are correct. The way the code in QEMU is structured is that we extract the descriptor field in one go and then will operate on it (checking for need to AddressSize fault, etc) as a second action. The field descriptors themselves are the sizes I said. Well, may be it's enough just to change this comment as you intend: - /* The address field in the descriptor goes up to bit 39 for ARMv7 - * but up to bit 47 for ARMv8. + /* The address field in the descriptor goes up to bit 39 for AArch32 + * but up to bit 47 for AArch64. */ The comment is correct as it stands. thanks -- PMM I mean in the patch. We need to fix lower bits in descaddrmask anyway. So: I could describe in the comment, that the descriptor field is up to bit 47 for ARMv8 (as long as you want it), but we use the descaddrmask up to bit 39 for AArch32, because we don't need other bits in that case to construct next descriptor address. It is clearly described in the ARM pseudo-code. Why should we keep in the mask bits from 40 up to 47 if we don't need them? Even if they are all zeroes. It is a bit obfuscating, as I said. I agree with Peter. The original comment is correct. Looking at the TLBRecord AArch32.TranslationTableWalkLD pseudocode, it is treating the AArch32 address as 48 bits long. For example: if !IsZero(baseregister<47:40>) then level = 0; result.addrdesc.fault = AArch32.AddressSizeFault(ipaddress, domain, level, acctype, iswrite, secondstage, s2fs1walk); return result; This requires that an AArch32 address have specific values up through bit 47.
[Qemu-devel] Is anyone able to load a web page from a guest operating system?
My three guest operating systems can't load a web page. I think this is a bug with QEMU. Is there anyone who has the latest revision of QEMU that can access the web from a guest? Or are you experiencing the same problem?
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
2016-04-26 17:28+0200, Jan Kiszka: > On 2016-04-26 16:59, Radim Krčmář wrote: >> 2016-04-26 16:24+0200, Jan Kiszka: >>> On 2016-04-26 13:40, Peter Xu wrote: Currently, all the interrupts will be translated into one MSI in vtd_generate_msi_message(), in which only 8 bits of dest_id is used (msg.dest = irq->dest). We may possibly need to use the high 32 bits of MSI address to store the rest dest[31:8]? Don't know whether this would be enough though. >>> >>> Yes, I ran into this topic as well as I hacked up those line. Currently, >>> KVM does not support more than 254 vCPUs, so 8 bits of those 32 are >>> actually fine, and piggy-backing them in an MSI message works. >>> >>> Once KVM supports more CPUs, it has to come up with a new userspace >>> interface to inject APIC events for more than 255 CPUs. Maybe the >>> existing direct MSI inject with its unused flags could be "bended", >>> maybe there are already better ideas (Radim?). >> >> Adding a flag to msi_msg and taking 3-4 bytes from padding to express >> x2APIC addresses is reasonable. (It is what my prototype did. :]) >> >> The conceptually different idea is forcing all userspace interrupts >> through irqfd routes, which would obsolete the ad-host inject. > > irqfd for userspace sources is a bit clumsy from the API POV. On the > other hand, we need to tweak the routing API anyway to achieve the same > address extension there, too. I agree, both of them should change. KVM_SIGNAL_MSI was added just because the route-based injection in kvm_irqchip_send_msi() sucked; maybe we could find a good solution now, but the direct interface is already here ...
Re: [Qemu-devel] [PATCH for-2.7 v2 07/17] rbd: Implement image locking
On Sun, Apr 24, 2016 at 7:42 PM, Fam Zhengwrote: > On Fri, 04/22 21:57, Jason Dillaman wrote: >> Since this cannot automatically recover from a crashed QEMU client with an >> RBD image, perhaps this RBD locking should not default to enabled. >> Additionally, this will conflict with the "exclusive-lock" feature >> available since the Ceph Hammer-release since both utilize the same locking >> construct. >> >> As a quick background, the optional exclusive-lock feature can be >> enabled/disabled on image and safely/automatically handles the case of >> recovery from a crashed client. Under normal conditions, the RBD >> exclusive-lock feature automatically acquires the lock upon the first >> attempt to write to the image and transparently transitions ownership of >> the lock between two or more clients -- used for QEMU live-migration. > > Is it enabled by default? > Starting with the Jewel release of Ceph it is enabled by default. >> >> I'd be happy to add a new librbd API method to explicitly expose acquiring >> and releasing the RBD exclusive lock since it certainly solves a couple >> compromises in our current QEMU integration. > > Does the API do enable/disable or acquire/relase? Could you explain the > differences between it and rbd_lock_exclusive? There is already an API for enabling/disabling the exclusive-lock feature (and associated CLI tooling). This would be a new API method to explicitly acquire / release the exclusive-lock (instead of implicitly acquiring the lock upon first write request as it currently does in order to support QEMU live-migration). The rbd_lock_exclusive/rbd_lock_shared are essentially advisory locks. Nothing stops a client from accessing the image if it doesn't attempt to acquire the lock (even if another client already has the image locked exclusively). It also doesn't support automatic recovery from failed clients. -- Jason
Re: [Qemu-devel] [RFC v3] translate-all: protect code_gen_buffer with RCU
On 04/25/2016 11:35 PM, Alex Bennée wrote: > > Richard Hendersonwrites: > >> On 04/25/2016 04:46 PM, Emilio G. Cota wrote: >>> +/* >>> + * write the prologue into buf2. This is safe because we'll later call >>> + * tcg_prologue_init on buf1, from which we'll start execution. >>> + */ >>> +tcg_ctx.code_gen_buffer = code_gen_buf2; >>> +tcg_prologue_init(_ctx); >>> + >> >> Ah, no. Write only one prologue, not one per buffer. >> >> If they're sufficiently close (i.e. one allocation under the max size), >> then the same one can be used for both halves. >> >> The global variables that you didn't see in this revision are: >> >> aarch64/tcg-target.inc.c:static tcg_insn_unit *tb_ret_addr; >> arm/tcg-target.inc.c:static tcg_insn_unit *tb_ret_addr; >> i386/tcg-target.inc.c:static tcg_insn_unit *tb_ret_addr; >> ia64/tcg-target.inc.c:static tcg_insn_unit *tb_ret_addr; >> ia64/tcg-target.inc.c:tcg_insn_unit *thunks[8] = { }; >> mips/tcg-target.inc.c:static tcg_insn_unit *tb_ret_addr; >> ppc/tcg-target.inc.c:static tcg_insn_unit *tb_ret_addr; >> s390/tcg-target.inc.c:static tcg_insn_unit *tb_ret_addr; >> sparc/tcg-target.inc.c:static tcg_insn_unit *qemu_ld_trampoline[16]; >> sparc/tcg-target.inc.c:static tcg_insn_unit *qemu_st_trampoline[16]; > > Aside from the existing code structure is there any reason to have only > one prologue? Well, there's also the gdb jit unwind info. But aside from those, no. > It doesn't seem to be a large amount of code and in the > case of having smaller translation regions I would posit having a > "local" prologue/epilogue would make the jumps cheaper. Not really. The jumps are generally in range already, based on the restriction in max buffer size. Really only arm32 (and ppc32, post direct jump atomicity patchset) are the only ones that require a tiny (less than 64MB) buffer. Anything bigger than 64MB, I don't see any reason to create two independent buffers. The other consideration not yet mentioned is that you'd like to put on the entire buffer, in the case of x86_64 and some others, within 2GB of the main executable, so that helper calls can use a direct call insn. r~
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
On 2016-04-26 16:59, Radim Krčmář wrote: > 2016-04-26 16:24+0200, Jan Kiszka: >> On 2016-04-26 13:40, Peter Xu wrote: >>> Currently, all the interrupts will be translated into one MSI in >>> vtd_generate_msi_message(), in which only 8 bits of dest_id is used >>> (msg.dest = irq->dest). We may possibly need to use the high 32 bits >>> of MSI address to store the rest dest[31:8]? Don't know whether this >>> would be enough though. >> >> Yes, I ran into this topic as well as I hacked up those line. Currently, >> KVM does not support more than 254 vCPUs, so 8 bits of those 32 are >> actually fine, and piggy-backing them in an MSI message works. >> >> Once KVM supports more CPUs, it has to come up with a new userspace >> interface to inject APIC events for more than 255 CPUs. Maybe the >> existing direct MSI inject with its unused flags could be "bended", >> maybe there are already better ideas (Radim?). > > Adding a flag to msi_msg and taking 3-4 bytes from padding to express > x2APIC addresses is reasonable. (It is what my prototype did. :]) > > The conceptually different idea is forcing all userspace interrupts > through irqfd routes, which would obsolete the ad-host inject. irqfd for userspace sources is a bit clumsy from the API POV. On the other hand, we need to tweak the routing API anyway to achieve the same address extension there, too. Jan > >> In any case, the KVM layer in userspace will then have to pick up all 32 >> destination bits from the IRTE and deliver them via that new interface >> to the in-kernel APICs. > > I'll keep that in mind, thanks. >
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
2016-04-26 16:24+0200, Jan Kiszka: > On 2016-04-26 13:40, Peter Xu wrote: >> Currently, all the interrupts will be translated into one MSI in >> vtd_generate_msi_message(), in which only 8 bits of dest_id is used >> (msg.dest = irq->dest). We may possibly need to use the high 32 bits >> of MSI address to store the rest dest[31:8]? Don't know whether this >> would be enough though. > > Yes, I ran into this topic as well as I hacked up those line. Currently, > KVM does not support more than 254 vCPUs, so 8 bits of those 32 are > actually fine, and piggy-backing them in an MSI message works. > > Once KVM supports more CPUs, it has to come up with a new userspace > interface to inject APIC events for more than 255 CPUs. Maybe the > existing direct MSI inject with its unused flags could be "bended", > maybe there are already better ideas (Radim?). Adding a flag to msi_msg and taking 3-4 bytes from padding to express x2APIC addresses is reasonable. (It is what my prototype did. :]) The conceptually different idea is forcing all userspace interrupts through irqfd routes, which would obsolete the ad-host inject. > In any case, the KVM layer in userspace will then have to pick up all 32 > destination bits from the IRTE and deliver them via that new interface > to the in-kernel APICs. I'll keep that in mind, thanks.
Re: [Qemu-devel] [patch v6 11/12] vfio: register aer resume notification handler for aer resume
On Tue, 26 Apr 2016 11:39:02 +0800 Chen Fanwrote: > On 04/14/2016 09:02 AM, Chen Fan wrote: > > > > On 04/12/2016 05:38 AM, Alex Williamson wrote: > >> On Tue, 5 Apr 2016 19:42:02 +0800 > >> Cao jin wrote: > >> > >>> From: Chen Fan > >>> > >>> for supporting aer recovery, host and guest would run the same aer > >>> recovery code, that would do the secondary bus reset if the error > >>> is fatal, the aer recovery process: > >>>1. error_detected > >>>2. reset_link (if fatal) > >>>3. slot_reset/mmio_enabled > >>>4. resume > >>> > >>> it indicates that host will do secondary bus reset to reset > >>> the physical devices under bus in step 2, that would cause > >>> devices in D3 status in a short time. but in qemu, we register > >>> an error detected handler, that would be invoked as host broadcasts > >>> the error-detected event in step 1, in order to avoid guest do > >>> reset_link when host do reset_link simultaneously. it may cause > >>> fatal error. we introduce a resmue notifier to assure host reset > >>> completely. then do guest aer injection. > >> Why is it safe to continue running the VM between the error detected > >> notification and the resume notification? We're just pushing back the > >> point at which we inject the AER into the guest, potentially negating > >> any benefit by allowing the VM to consume bad data. Shouldn't we > >> instead be immediately notifying the VM on error detected, but stalling > >> any access to the device until resume is signaled? How do we know that > >> resume will ever be signaled? We have both the problem that we may be > >> running on an older kernel that won't support a resume notification and > >> the problem that seeing a resume notification depends on the host being > >> able to successfully complete a link reset after fatal error. We can > >> detect support for resume notification, but we still need a strategy > >> for never receiving it. Thanks, > > That's make sense, but I haven't came up with a good idea. do you have > > any idea, Alex? I don't know that there are any good solutions here. We need to respond to the current error notifier interrupt and not regress from our support there. I think that means that if we want to switch from a simple halt-on-error to a mechanism for the guest to handle recovery, we need to disable access to the device between being notified that the error occurred and being notified to resume. We can do that by disabling mmaps to the device and preventing access via the slow path handlers. I don't know what the best solution is for preventing access, do we block and pause the VM or do we drop writes and return -1 for reads, that's something that needs to be determined. We also need to inject the AER into the VM at the point we're notified of an error because the VM needs to know as soon as possible to stop using the device or trusting any data from it. The next coordination point would be something like the resume notifier that you've added and there are numerous questions around the interaction of that with the guest handling. Clearly we can't do a guest directed bus reset until we get the resume notifier, so do we block that execution path in QEMU until the resume notification is received? What happens if we don't get that notification? Is there any way that we can rely on the host having done a bus reset to the point where we don't need to act on the guest directed reset? These are all things that need to be figured out. Thanks, Alex
Re: [Qemu-devel] [PATCH] iotests: fix the redirection order in 083
On 04/26/2016 04:13 AM, Wei Jiangang wrote: > It should redirect stdout to /dev/null first, > then redirect stderr to whatever stdout currently points at. > > Signed-off-by: Wei Jiangang> --- > tests/qemu-iotests/083 | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
On 2016-04-26 13:40, Peter Xu wrote: > On Tue, Apr 26, 2016 at 12:51:44PM +0200, Jan Kiszka wrote: >> On 2016-04-26 12:38, Peter Xu wrote: >> Hi, Jan, >> >> The above issue should be caused by EOI missing of level-triggered >> interrupts. Before that, I was always using edge-triggered >> interrupts for test, so didn't encounter this one. Would you please >> help try below patch? It can be applied directly onto the series, >> and should solve the issue (it works on my test vm, and I'll take it >> in v5 as well if it also works for you): >> > > Works here as well. I even made EIM working with some hack, though > Jailhouse spits out strange warnings, despite it works fine (x2apic > mode, split irqchip). Corrections: the warnings are issued by qemu, not Jailhouse, e.g. qemu-system-x86_64: VT-d Failed to remap interrupt for gsi 22. I suspect that comes from the hand-over phase of Jailhouse, when it mutes all interrupts in the system while reconfiguring IR and IOAPIC. Please convert this error (in kvm_arch_fixup_msi_route) into a trace point. It shall not annoy the host. Also check if you have more of such guest-triggerable error messages. >>> >>> Okay. This should be the only one. I can use trace instead. >>> >>> Meanwhile, I still suppose we should not seen it even with >>> error_report().. Would this happen when boot e.g. generic kernels? >> >> No, this is - most probably, still need to validate in details - an >> effect due to the special case with Jailhouse that interrupt and VT-d >> management is handed over from a running OS to a hypervisor. >> >> Jan >> >> PS: Current quick hack for EIM support (lacks compat mode blocking at >> least): >> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c >> index 1082ab5..709a92c 100644 >> --- a/hw/i386/intel_iommu.c >> +++ b/hw/i386/intel_iommu.c >> @@ -907,6 +907,7 @@ static void >> vtd_interrupt_remap_table_setup(IntelIOMMUState *s) >> value = vtd_get_quad_raw(s, DMAR_IRTA_REG); >> s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1); >> s->intr_root = value & VTD_IRTA_ADDR_MASK; >> +s->intr_eime = value & VTD_IRTA_EIME; >> >> QLIST_FOREACH(notifier, >iec_notifiers, list) { >> if (notifier->iec_notify) { >> @@ -2052,11 +2053,13 @@ static int vtd_remap_irq_get(IntelIOMMUState *iommu, >> uint16_t index, VTDIrq *irq >> irq->trigger_mode = irte.trigger_mode; >> irq->vector = irte.vector; >> irq->delivery_mode = irte.delivery_mode; >> -/* Not support EIM yet: please refer to vt-d 9.10 DST bits */ >> +irq->dest = irte.dest_id; >> +if (!iommu->intr_eime) { >> #define VTD_IR_APIC_DEST_MASK (0xff00ULL) >> #define VTD_IR_APIC_DEST_SHIFT(8) >> -irq->dest = (irte.dest_id & VTD_IR_APIC_DEST_MASK) >> \ >> -VTD_IR_APIC_DEST_SHIFT; >> +irq->dest = (irq->dest & VTD_IR_APIC_DEST_MASK) >> >> +VTD_IR_APIC_DEST_SHIFT; >> +} >> irq->dest_mode = irte.dest_mode; >> irq->redir_hint = irte.redir_hint; >> >> @@ -2316,7 +2319,7 @@ static void vtd_init(IntelIOMMUState *s) >> s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO; >> >> if (ms->iommu_intr) { >> -s->ecap |= VTD_ECAP_IR; >> +s->ecap |= VTD_ECAP_IR | VTD_ECAP_EIM; >> } >> >> vtd_reset_context_cache(s); >> @@ -2370,10 +2373,9 @@ static void vtd_init(IntelIOMMUState *s) >> vtd_define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000ULL); >> >> /* >> - * Interrupt remapping registers, not support extended interrupt >> - * mode for now. >> + * Interrupt remapping registers. >> */ >> -vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf00fULL, 0); >> +vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf80fULL, 0); >> } >> >> /* Should not reset address_spaces when reset because devices will still use >> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h >> index 10c20fe..72b0114 100644 >> --- a/hw/i386/intel_iommu_internal.h >> +++ b/hw/i386/intel_iommu_internal.h >> @@ -176,6 +176,7 @@ >> >> /* IRTA_REG */ >> #define VTD_IRTA_ADDR_MASK (VTD_HAW_MASK ^ 0xfffULL) >> +#define VTD_IRTA_EIME (1ULL << 11) >> #define VTD_IRTA_SIZE_MASK (0xfULL) >> >> /* ECAP_REG */ >> @@ -184,6 +185,7 @@ >> #define VTD_ECAP_QI (1ULL << 1) >> /* Interrupt Remapping support */ >> #define VTD_ECAP_IR (1ULL << 3) >> +#define VTD_ECAP_EIM(1ULL << 4) >> >> /* CAP_REG */ >> /* (offset >> 4) << 24 */ >> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h >> index d7613af..c6c53c6 100644 >> --- a/include/hw/i386/intel_iommu.h >> +++ b/include/hw/i386/intel_iommu.h >> @@ -261,6 +261,7 @@ struct IntelIOMMUState { >> bool intr_enabled; /* Whether guest enabled IR */ >> dma_addr_t intr_root;
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
2016-04-26 15:34+0800, Peter Xu: > Hi, Jan, > > The above issue should be caused by EOI missing of level-triggered > interrupts. Before that, I was always using edge-triggered > interrupts for test, so didn't encounter this one. Would you please > help try below patch? It can be applied directly onto the series, > and should solve the issue (it works on my test vm, and I'll take it > in v5 as well if it also works for you): > > - > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c > @@ -281,6 +281,36 @@ ioapic_mem_read(void *opaque, hwaddr addr, unsigned int > size) > +/* > + * This is to satisfy the hack in Linux kernel. One hack of it is to > + * simulate clearing the Remote IRR bit of IOAPIC entry using the > + * following: > + * > + * "For IO-APIC's with EOI register, we use that to do an explicit EOI. > + * Otherwise, we simulate the EOI message manually by changing the trigger > + * mode to edge and then back to level, with RTE being masked during > + * this." > + * > + * (See linux kernel __eoi_ioapic_pin() comment in commit c0205701) > + * > + * This is based on the assumption that, Remote IRR bit will be > + * cleared by IOAPIC hardware for edge-triggered interrupts (I > + * believe that's what the IOAPIC version 0x1X hardware does). I thought that Linux doesn't use explicit "EOI" to IO-APIC, but relies on EOI broadcast from LAPIC -- does that change with IR? > + * So > + * if we are emulating it, we'd better do it the same here, so that > + * the guest kernel hack will work as well on QEMU. Totally. > + * Without this, level-triggered interrupts in IR mode might fail to > + * work correctly. (I don't really understand why it worked before.) > + */ > +static inline void > +ioapic_fix_edge_remote_irr(uint64_t *entry) > +{ > +if (*entry & IOAPIC_LVT_TRIGGER_MODE) { > +/* Level triggered interrupts, make sure remote IRR is zero */ > +*entry &= ~((uint64_t)IOAPIC_LVT_REMOTE_IRR); (You can just unconditionally zero it, edge doesn't care.) > +} > +} > + > @@ -314,6 +344,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, > s->ioredtbl[index] &= ~0xULL; > s->ioredtbl[index] |= val; > } > +ioapic_fix_edge_remote_irr(>ioredtbl[index]); I think this can be done only in the else branch of (s->ioregsel & 1). (If the guest kernel does level->edge->level, then remote_irr probably should be cleared only on edge->level transition and not on level->level, but I haven't seen that in the spec ...) > ioapic_service(s); > > > I am still looking into guest part codes. Although the above patch > should solve the issue, there are still issues in guest codes when > IR is enabled: > > - mismatched "vector" in IOAPIC entry and IRTE entry (this is > required in vt-d spec 5.1.5.1, and required to correctly deliver > EOI broadcast I guess). See intel_irq_remapping_prepare_irte(): "required" is a way of saying that the opposite is undefined. No need to think about it in IOMMU. > - I encountered that level-triggered entries in IOAPIC is marked as > edge-triggered interrupt in APIC (which is strange)... What/where do you mean? (The only difference I know of is that level triggered vectors in LAPIC have their respective TMR bit set while edge do not.) Thanks.
Re: [Qemu-devel] Simultaneous vhostfds= and ifname= options
Hi all, Any thoughts to share on that? Thanks, Michal On 15 April 2016 at 18:50, Michał Dubielwrote: > Hi guys, > > Qemu does not allow currently to use vhostfds= and ifname= options of > -netdev tap... at the same time. In order to use vhostfds=, you have to > also use fds= instead. May ask you what is rationale behind that? > > I am asking as we need this for the OpenContrail vRouter multiqueue > support. We are passing name of previously created tap, and vhost-net > device descriptors (which are supplied by the libvirt). > > Actually I have internally a very simple patch that allows for > simultaneous vhostfds= and ifname=. All seems to work fine with this, but > perhaps there is something I do not know and this combination of options > indeed should not be allowed. > > If there is no issues with allowing for that, I could contribute it. > > Regards, > Michal >
[Qemu-devel] [PATCH] PPC/KVM: early validation of vcpu id
The KVM API restricts vcpu ids to be < KVM_CAP_MAX_VCPUS. On PowerPC targets, depending on the number of threads per core in the host and in the guest, some topologies do generate higher vcpu ids actually. When this happens, QEMU bails out with the following error: kvm_init_vcpu failed: Invalid argument The KVM_CREATE_VCPU ioctl has several EINVAL return paths, so it is not possible to fully disambiguate. This patch adds a check in the code that computes vcpu ids, so that we can detect the error earlier, and print a friendlier message instead of calling KVM_CREATE_VCPU with an obviously bogus vcpu id. Signed-off-by: Greg Kurz--- include/sysemu/kvm.h|2 ++ kvm-all.c |6 ++ target-ppc/translate_init.c |8 3 files changed, 16 insertions(+) diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 0e18f15c9493..27bf50ef379e 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -344,6 +344,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s); int kvm_arch_init_vcpu(CPUState *cpu); +bool kvm_vcpu_id_is_valid(int vcpu_id); + /* Returns VCPU ID to be used on KVM_CREATE_VCPU ioctl() */ unsigned long kvm_arch_vcpu_id(CPUState *cpu); diff --git a/kvm-all.c b/kvm-all.c index e7b66df197fd..3625a3e07457 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1459,6 +1459,12 @@ static int kvm_max_vcpus(KVMState *s) return (ret) ? ret : kvm_recommended_vcpus(s); } +bool kvm_vcpu_id_is_valid(int vcpu_id) +{ +KVMState *s = KVM_STATE(current_machine->accelerator); +return vcpu_id >= 0 && vcpu_id < kvm_max_vcpus(s); +} + static int kvm_init(MachineState *ms) { MachineClass *mc = MACHINE_GET_CLASS(ms); diff --git a/target-ppc/translate_init.c b/target-ppc/translate_init.c index f51572552bc2..6c89b18a05f9 100644 --- a/target-ppc/translate_init.c +++ b/target-ppc/translate_init.c @@ -9247,6 +9247,14 @@ static void ppc_cpu_realizefn(DeviceState *dev, Error **errp) #if !defined(CONFIG_USER_ONLY) cpu->cpu_dt_id = (cs->cpu_index / smp_threads) * max_smt + (cs->cpu_index % smp_threads); + +if (kvm_enabled() && !kvm_vcpu_id_is_valid(cpu->cpu_dt_id)) { +error_setg(errp, "Can't create CPU with id %d in KVM", cpu->cpu_dt_id); +error_append_hint(errp, "Adjust the number of cpus to %d " + "or try to raise the number of threads per core\n", + cpu->cpu_dt_id * smp_threads / max_smt); +return; +} #endif if (tcg_enabled()) {
Re: [Qemu-devel] [RFC PATCH resend 00/11] Make CoMutex/CoQueue/CoRwlock thread-safe
On Fri, Apr 15, 2016 at 01:31:55PM +0200, Paolo Bonzini wrote: > [this time including the mailing list] > > This is yet another tiny bit of the multiqueue work, this time affecting > the synchronization infrastructure for coroutines. Currently, coroutines > synchronize between the main I/O thread and the dataplane iothread through > the AioContext lock. However, for multiqueue a single BDS will be used > by multiple iothreads and hence multiple AioContexts. This calls for > a different approach to coroutine synchronization. This series is my > first attempt at it. Besides multiqueue, I think throttling groups > (which can already span multiple AioContexts) could also take advantage > of the new CoMutexes. > > The series has two main parts: > > - it makes CoMutex bind coroutines to an AioContexts. It is of course > still possible to move coroutines around with explicit yield and > qemu_coroutine_enter, but a CoMutex will now reenter the coroutine > where it was running before, rather than in the AioContext of > the previous owner. To do this, a new function aio_co_schedule is > introduced to run a coroutine on a given iothread. I think this could > be used in other places too; for now it lets a CoMutex protect stuff > across multiple AioContexts without them moving around(*). Of course > currently a CoMutex is generally not used across multiple iothreads, > because you have to acquire/release the AioContext around CoMutex > critical sections. However... > > - ... the second change is exactly to make CoMutex thread-safe and remove > the need for a "thread-based" mutex around it. The new CoMutex is > exactly the same as a mutex implementation that you'd find in an > operating system. iothreads are the moral equivalent of CPUs in > a kernel, while coroutines resemble kernel threads running without > preemption on a CPU. Even if you have to take concurrency into account, > the lack of preemption while running coroutines or bottom halves > keeps the complexity at bay. For example, it is easy to synchronize > between qemu_co_mutex_lock's yield and the qemu_coroutine_enter in > aio_co_schedule's bottom half. > > Same as before, CoMutex puts coroutines to sleep with > qemu_coroutine_yield and wake them up with the new aio_co_schedule. > I could have wrapped CoMutex's CoQueue with a "regular" thread mutex or > spinlock. The resulting code would have looked a lot like RFifoLock > (with CoQueue replacing RFifoLock's condition variable). Rather, > inspired by the parallel between coroutines and non-preemptive kernel > threads, I chose to adopt the same lock-free mutex algorithm as OSv. > The algorithm only needs two to four atomic ops for a lock-unlock pair > (two when uncontended). To cover CoQueue, each CoQueue is made to > depend on a CoMutex, similar to condition variables. Most CoQueues > already have a corresponding CoMutex so this is not a big deal; > converting the others is left for a future series. I did this because > CoQueue looks a lot like OSv's waitqueues; so if necessary, we could > even take OSv's support for wait morphing (which avoids the thundering > herd problem) and add it to CoMutex and CoQueue. This may be useful > when making tracked_requests thread-safe. > > Kevin: this has nothing to do with my old plan from Brno, and it's > possibly a lot closer to what you wanted. Your idea was to automatically > release the "thread mutex" when a coroutine yields, I think you should > be fine with not having a thread mutex at all! > > This will need some changes in the formats because, for multiqueue, > CoMutexes would need to be used like "real" thread mutexes. Code like > this: > > ... > qemu_co_mutex_unlock() > ... /* still access shared data, but don't yield */ > qemu_coroutine_yield() > > might be required to use this other pattern: > > ... /* access shared data, but don't yield */ > qemu_co_mutex_unlock() > qemu_coroutine_yield() > > because "adding a second CPU" is already introducing concurrency that > wasn't there before. The "non-preemptive multitasking" reference only > applies to things that access AioContext-local data. This includes the > synchronization primitives implemented in this series, the thread pool, > the Linux AIO support, but not much else. It still simplifies _those_ > though. :) > > Anyhow, we'll always have some BlockDriver that do not support multiqueue, > such as the network protocols. Thus it would be possible to handle the > formats one at a time. raw-posix, raw and qcow2 would already form a > pretty good set, and the first two do not use CoMutex at all. > > The patch has quite a lot of new code, but about half of it is testcases. > The new test covers correctness and (when run with --verbose) also takes a > stab at measuring performance; the results is that performance of CoMutex > is comparable to pthread mutexes. The only snag is
Re: [Qemu-devel] [PATCH] hw/intc/arm_gic: add tracepoints
On Thu, Apr 21, 2016 at 08:24:41AM -0700, Hollis Blanchard wrote: > These are obviously critical to understanding interrupt delivery: > gic_enable_irq > gic_disable_irq > gic_set_irq (inbound irq from device models) > gic_update_set_irq (outbound irq to CPU) > gic_acknowledge_irq > > The only one that I think might raise eyebrows is gic_update_bestirq, but I've > (sadly) debugged problems that ended up being caused by unexpected priorities. > Knowing that the GIC has an irq ready, but doesn't deliver to the CPU due to > priority, has also proven important. > > Signed-off-by: Hollis Blanchard> --- > hw/intc/arm_gic.c | 12 > trace-events | 8 > 2 files changed, 20 insertions(+) Thanks, applied to my tracing tree (for QEMU 2.7): https://github.com/stefanha/qemu/commits/tracing Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH v14 15/19] qapi-commands: Wrap argument visit in visit_start_struct
On 04/15/2016 05:42 AM, Markus Armbruster wrote: > Eric Blakewrites: > >> The qmp-input visitor was playing rather fast and loose: when > > I guess (some of) its *users* are playing fast and loose, and the > visitor itself lets them. The patch's title suggests "some of its > users" == qapi-commands.py. > >> visiting a QDict, you could grab members of the root dictionary >> without first pushing into the dict. But we are about to tighten >> the input visitor, at which point the generated marshal code >> MUST follow the same paradigms as everyone else, of pushing into >> the struct before grabbing its keys, because the value of 'name' >> should be ignored on the top-level visit. >> >> Generated code grows as follows: >> >> |@@ -515,7 +695,15 @@ void qmp_marshal_blockdev_backup(QDict * >> | BlockdevBackup arg = {0}; >> | >> | v = qmp_input_get_visitor(qiv); >> |+visit_start_struct(v, NULL, NULL, 0, ); >> |+if (err) { >> |+goto out; >> |+} >> | visit_type_BlockdevBackup_members(v, , ); >> |+if (!err) { >> |+visit_check_struct(v, ); >> |+} > > Does this find errors that weren't found before? All that this could find is excess input, but we are already checking for excess input prior to calling into the generated marshaling code, so it doesn't find anything new. >> >> Note that this change could also make it possible for the >> marshalling code to automatically detect excess input at the top >> level, and not just in nested dictionaries. However, that checking >> is not currently useful (and we rely on the manual checking in >> monitor.c:qmp_check_client_args() instead) as long as qmp-commands.hx >> uses .args_type, and as long as we have 'name:O' as an arg-type that >> explicitly allows unknown top-level keys because we haven't yet >> converted 'device_add' and 'netdev_add' to introspectible use of >> 'any'. > > Hmm, that explains why finding these additional errors wouldn't be > useful. Good to know, but doesn't quite answer my question. I guess what it really translates to is we are now doing redundant checking, and I should do a followup patch to simplify monitor.c. >> @@ -150,7 +158,9 @@ out: >> qmp_input_visitor_cleanup(qiv); >> qdv = qapi_dealloc_visitor_new(); >> v = qapi_dealloc_get_visitor(qdv); >> +visit_start_struct(v, NULL, NULL, 0, NULL); >> visit_type_%(c_name)s_members(v, , NULL); >> +visit_end_struct(v); >> qapi_dealloc_visitor_cleanup(qdv); >> ''', >> c_name=arg_type.c_name()) > > No visit_check_struct() here. I think its contract should explicitly > state that you may omit it when you're not interested in errors. Indeed, calling visit_check_struct(, NULL) can't report any errors, so skipping it should be documented as safe. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-devel] how should a device implement an array of link(?) properties?
Hi; I have what seems like a fairly straightforward requirement for a QOM device but no idea how to implement it, so I'm looking for advice on the right "modern" way to do it... Specifically, this is the GICv3 device. It would like to have a pointer to every CPU object it needs to be connected to. My guess is that this should be an array of QOM link properties of some kind, so that the board level code can do: * create the gicv3 device * set the 'num-cpu' property * for each cpu: set the cpu[i] property to that CPU * realize the device But how should I implement this in the device? There are qdev array properties, but there's no link qdev property type. Or should I not be using a link property at all? Suggestions for the right approach welcome. thanks -- PMM
Re: [Qemu-devel] [PATCH] hw/net/virtio-net: Allocating Large sized arrays to heap
On Tue, Apr 26, 2016 at 04:05:24PM +0800, Zhou Jie wrote: > virtio_net_flush_tx has a huge stack usage of 16392 bytes approx. > Moving large arrays to heap to reduce stack usage. > > Signed-off-by: Zhou JieI don't think it's appropriate for trivial. Also - what's the point, exactly? > --- > hw/net/virtio-net.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c > index 5798f87..cab7bbc 100644 > --- a/hw/net/virtio-net.c > +++ b/hw/net/virtio-net.c > @@ -1213,6 +1213,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > VirtIONet *n = q->n; > VirtIODevice *vdev = VIRTIO_DEVICE(n); > VirtQueueElement *elem; > +struct iovec *sg = NULL, *sg2 = NULL; > int32_t num_packets = 0; > int queue_index = vq2q(virtio_get_queue_index(q->tx_vq)); > if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) { > @@ -1224,10 +1225,12 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > return num_packets; > } > > +sg = g_new(struct iovec, VIRTQUEUE_MAX_SIZE); > +sg2 = g_new(struct iovec, VIRTQUEUE_MAX_SIZE + 1); > for (;;) { > ssize_t ret; > unsigned int out_num; > -struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], > *out_sg; > +struct iovec *out_sg; > struct virtio_net_hdr_mrg_rxbuf mhdr; > > elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement)); > @@ -1252,7 +1255,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > virtio_net_hdr_swap(vdev, (void *) ); > sg2[0].iov_base = > sg2[0].iov_len = n->guest_hdr_len; > -out_num = iov_copy([1], ARRAY_SIZE(sg2) - 1, > +out_num = iov_copy([1], VIRTQUEUE_MAX_SIZE, > out_sg, out_num, > n->guest_hdr_len, -1); > if (out_num == VIRTQUEUE_MAX_SIZE) { > @@ -1269,10 +1272,10 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > */ > assert(n->host_hdr_len <= n->guest_hdr_len); > if (n->host_hdr_len != n->guest_hdr_len) { > -unsigned sg_num = iov_copy(sg, ARRAY_SIZE(sg), > +unsigned sg_num = iov_copy(sg, VIRTQUEUE_MAX_SIZE, > out_sg, out_num, > 0, n->host_hdr_len); > -sg_num += iov_copy(sg + sg_num, ARRAY_SIZE(sg) - sg_num, > +sg_num += iov_copy(sg + sg_num, VIRTQUEUE_MAX_SIZE - sg_num, > out_sg, out_num, > n->guest_hdr_len, -1); > out_num = sg_num; > @@ -1284,6 +1287,8 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > if (ret == 0) { > virtio_queue_set_notification(q->tx_vq, 0); > q->async_tx.elem = elem; > +g_free(sg); > +g_free(sg2); > return -EBUSY; > } > > @@ -1296,6 +1301,8 @@ drop: > break; > } > } > +g_free(sg); > +g_free(sg2); > return num_packets; > } > > -- > 2.5.5 > >
Re: [Qemu-devel] [RFC 00/13] Multiple fd migration support
* Juan Quintela (quint...@redhat.com) wrote: > "Dr. David Alan Gilbert"wrote: > > * Juan Quintela (quint...@redhat.com) wrote: > >> Hi > >> > >> This patch series is "an" initial implementation of multiple fd migration. > >> This is to get something out for others to comment, it is not finished at > >> all. > > > > I've had a quick skim: > > a) I think mst is right about the risk of getting stale pages out of > > order. > > I have been thinking about this. We just need to send a "we have > finish" this round packet. And reception has to wait for all threads to > finish before continue. It is easier and not expensive. We never > resend the same page during the same round. Yes. > > b) Since you don't change the URI at all, it's a bit restricted; for > > example, > > it means I can't run separate sessions over different NICs unless I've > > done something clever at the routing/or bonded them. > > One thing I liked the sound of multi-fd for is NUMA; get a BIG box > > and give each numa node a separate NIC and run a separate thread on > > each > > node. > > If we want this _how_ we want to configure it. This was part of the > reason to post the patch. It works only for tcp, I don't even try the > others, just to see what people want. I was thinking this would work even for TCP; you'd just need a way to pass different URIs (with address/port) for each connection. > > c) Hmm we do still have a single thread doing all the bitmap syncing and > > scanning, > > we'll have to watch out if that is the bottleneck at all. > > Yeap. My idea here was to still maintain the bitmap scanning on the > main thread, but send work to the "worker threads" in batches, not in > single pages. But I haven't really profiled how long we spend there. Yeh, it would be interesting to see what this profile looked like; if we suddenly found that main thread had spare cycles perhaps we could do some more interesting types of scanning. > > d) All the zero testing is still done in the main thread which we know is > > expensive. > > Not trivial if we don't want to send control information over the > "other" channels. One solution would be split the main memory in > different "main" threads. No performance profiles. Yes, and it's tricky because the order is: 1) Send control information 2) Farm it out to individual thread It's too late for '2' to say 'it's zero'. > > e) Do we need to do something for security with having multiple ports? How > > do we check that nothing snuck in on one of our extra ports, have we > > got > > sanity checks to make sure it's actually the right stream. > > > We only have a single port. We opened it several times. It shouldn't > require changes in either libvirt/firewall. (Famous last words) True I guess. > > > f) You're handing out pages to the sending threads on the basis of which > > one > > is free (in the same way as the multi threaded compression); but I > > think > > it needs some sanity adding to only hand out whole host pages - it > > feels > > like receiving all the chunks of one host page down separate FDs would > > be horrible. > > Trivial optimization would be to send _whole_ huge pages in one go. I > wanted comments about what people wanted here. My idea was really to > add multipage or several pages in one go. Would reduce synchronization > a lot. I do to the 1st that becomes free because .. I don't know > how long a specific transmission is going to take. TCP for you :-( Sending huge pages would be very nice; the tricky thing is you don't want to send a huge page unless it's all marked dirty. > > g) I think you might be able to combine the compression into the same > > threads; > > so that if multi-fd + multi-threaded-compresison is set you don't end > > up with 2 sets of threads and it might be the simplest way to make them > > work together. > > Yeap, I thought that. But I didn't want to merge them in a first > stage. It makes much more sense to _not_ send the compressed data > through the main channel. But that would be v2 (or 3, or 4 ...) Right. > > h) You've used the last free RAM_SAVE_FLAG! And the person who takes the > > last > > slice^Wbit has to get some more. > > Since arm, ppc, and 68k have variants that have TARGET_PAGE_BITS 10 > > that > > means we're full; I suggest what you do is use that flag to mean that > > we > > send another 64bit word; and in that word you use the bottom 7 bits for > > the fd index and bit 7 is set to indicate it's fd. The other bits are > > sent > > as zero and available for the next use. > > Either that or start combining with some other flags. > > (I may have a use for some more bits in mind!) > > Ok. I can looke at that. > > > i) Is this safe for xbzrle - what happens to the cache (or is it all > > still the main
Re: [Qemu-devel] Updating documentation at http://wiki.qemu.org/download/qemu-doc.html
On Tue, Apr 26, 2016 at 11:02:45AM +0100, Stefan Hajnoczi wrote: > On Tue, Apr 26, 2016 at 07:50:31AM +0100, Mark Cave-Ayland wrote: > > Does anyone know if it's possible to update the documentation at the > > above URL? It appears as a link at http://wiki.qemu-project.org/Manual > > and seems to be the first hit for most people looking for information on > > QEMU SPARC emulation, which is frustrating as the information there is > > well out of date. Presumably this is because the documentation is hosted > > on the qemu.org domain itself and so ranks higher? > > Hi Mark, > I've CCed Jeff Cody, who is taking over as system administrator for > QEMU. The http://wiki.qemu.org/download/qemu-doc.html link is just a > static page from 2010. > > It should be possible to rebuild the HTML docs and replace the file. > Even better would be to update it from a cron job. > > Stefan Thanks. I've update the qemu-doc.html at the above link with the latest output of "make qemu-doc.html" from git master. I'll set up a daily cronjob to take care of that automatically, as well. Jeff
Re: [Qemu-devel] [Qemu-discuss] iolimits for virtio-9p
On Tue 19 Apr 2016 02:09:24 PM CEST, Pradeep Kiruvale wrote: > We are planning to implement the io-limits for the virtio-9p driver > i.e for fsdev devices. > So, I am looking into the code base and how it has done for the block > io devices. > > I would like to know how difficult is this and is there some one out > there who has any plan to do this? Hi, as Stefan said already, the common API is in throttle.h. It should be generic enough to be used in other parts of QEMU, but tell me if you feel that it needs changes. Once you configure the throttling settings you essentially only need to call throttle_schedule_timer() to see if a request needs to be throttled, and afterwards throttle_account() to register the I/O that has been peformed. You'll see that there's also throttle-group.c, but that's specific to the block layer and not meant to be generic. Berto
Re: [Qemu-devel] [PATCH v4 2/3] hw/arm/virt: Add PMU node for virt machine
On Tue, Apr 26, 2016 at 07:40:45PM +0800, Shannon Zhao wrote: > From: Shannon Zhao> > Add a virtual PMU device for virt machine while use PPI 7 for PMU > overflow interrupt number. > > Signed-off-by: Shannon Zhao > --- > hw/arm/virt.c | 33 + > include/hw/arm/virt.h | 4 > include/sysemu/kvm.h | 1 + > stubs/kvm.c | 5 + > target-arm/kvm64.c| 41 + > 5 files changed, 84 insertions(+) Reviewed-by: Andrew Jones > > diff --git a/hw/arm/virt.c b/hw/arm/virt.c > index 56d35c7..376cb87 100644 > --- a/hw/arm/virt.c > +++ b/hw/arm/virt.c > @@ -428,6 +428,37 @@ static void fdt_add_gic_node(VirtBoardInfo *vbi, int > type) > qemu_fdt_setprop_cell(vbi->fdt, "/intc", "phandle", vbi->gic_phandle); > } > > +static void fdt_add_pmu_nodes(const VirtBoardInfo *vbi, int gictype) > +{ > +CPUState *cpu; > +ARMCPU *armcpu; > +uint32_t irqflags = GIC_FDT_IRQ_FLAGS_LEVEL_HI; > + > +CPU_FOREACH(cpu) { > +armcpu = ARM_CPU(cpu); > +if (!armcpu->has_pmu || > +!kvm_arm_pmu_create(cpu, PPI(VIRTUAL_PMU_IRQ))) { > +return; > +} > +} > + > +if (gictype == 2) { > +irqflags = deposit32(irqflags, GIC_FDT_IRQ_PPI_CPU_START, > + GIC_FDT_IRQ_PPI_CPU_WIDTH, > + (1 << vbi->smp_cpus) - 1); > +} > + > +armcpu = ARM_CPU(qemu_get_cpu(0)); > +qemu_fdt_add_subnode(vbi->fdt, "/pmu"); > +if (arm_feature(>env, ARM_FEATURE_V8)) { > +const char compat[] = "arm,armv8-pmuv3"; > +qemu_fdt_setprop(vbi->fdt, "/pmu", "compatible", > + compat, sizeof(compat)); > +qemu_fdt_setprop_cells(vbi->fdt, "/pmu", "interrupts", > + GIC_FDT_IRQ_TYPE_PPI, VIRTUAL_PMU_IRQ, > irqflags); > +} > +} > + > static void create_v2m(VirtBoardInfo *vbi, qemu_irq *pic) > { > int i; > @@ -1246,6 +1277,8 @@ static void machvirt_init(MachineState *machine) > > create_gic(vbi, pic, gic_version, vms->secure); > > +fdt_add_pmu_nodes(vbi, gic_version); > + > create_uart(vbi, pic, VIRT_UART, sysmem); > > if (vms->secure) { > diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h > index ecd8589..b50f095 100644 > --- a/include/hw/arm/virt.h > +++ b/include/hw/arm/virt.h > @@ -40,6 +40,10 @@ > #define ARCH_TIMER_NS_EL1_IRQ 14 > #define ARCH_TIMER_NS_EL2_IRQ 10 > > +#define VIRTUAL_PMU_IRQ 7 > + > +#define PPI(irq) ((irq) + 16) > + > enum { > VIRT_FLASH, > VIRT_MEM, > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h > index 0e18f15..4522043 100644 > --- a/include/sysemu/kvm.h > +++ b/include/sysemu/kvm.h > @@ -523,4 +523,5 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void > *source); > * Returns: 0 on success, or a negative errno on failure. > */ > int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target); > +int kvm_arm_pmu_create(CPUState *cs, int irq); > #endif > diff --git a/stubs/kvm.c b/stubs/kvm.c > index ddd6204..667e269 100644 > --- a/stubs/kvm.c > +++ b/stubs/kvm.c > @@ -6,3 +6,8 @@ int kvm_arch_irqchip_create(MachineState *ms, KVMState *s) > { > return 0; > } > + > +int kvm_arm_pmu_create(CPUState *cs, int irq) > +{ > +return 0; > +} > diff --git a/target-arm/kvm64.c b/target-arm/kvm64.c > index b364789..893f983 100644 > --- a/target-arm/kvm64.c > +++ b/target-arm/kvm64.c > @@ -382,6 +382,47 @@ static CPUWatchpoint *find_hw_watchpoint(CPUState *cpu, > target_ulong addr) > return NULL; > } > > +static bool kvm_arm_pmu_support_ctrl(CPUState *cs, struct kvm_device_attr > *attr) > +{ > +return kvm_vcpu_ioctl(cs, KVM_HAS_DEVICE_ATTR, attr) == 0; > +} > + > +int kvm_arm_pmu_create(CPUState *cs, int irq) > +{ > +int err; > + > +struct kvm_device_attr attr = { > +.group = KVM_ARM_VCPU_PMU_V3_CTRL, > +.addr = (intptr_t), > +.attr = KVM_ARM_VCPU_PMU_V3_IRQ, > +.flags = 0, > +}; > + > +if (!kvm_arm_pmu_support_ctrl(cs, )) { > +return 0; > +} > + > +err = kvm_vcpu_ioctl(cs, KVM_SET_DEVICE_ATTR, ); > +if (err < 0) { > +fprintf(stderr, "KVM_SET_DEVICE_ATTR failed: %s\n", > +strerror(-err)); > +abort(); > +} > + > +attr.group = KVM_ARM_VCPU_PMU_V3_CTRL; > +attr.attr = KVM_ARM_VCPU_PMU_V3_INIT; > +attr.addr = 0; > +attr.flags = 0; > + > +err = kvm_vcpu_ioctl(cs, KVM_SET_DEVICE_ATTR, ); > +if (err < 0) { > +fprintf(stderr, "KVM_SET_DEVICE_ATTR failed: %s\n", > +strerror(-err)); > +abort(); > +} > + > +return 1; > +} > > static inline void set_feature(uint64_t *features, int feature) > { > -- > 2.0.4 > > >
[Qemu-devel] [PATCH v4 0/3] Add guest PMU in machine virt
From: Shannon ZhaoKVM-ARM64 supports guest PMU now. This series add the support in machine virt so that guest could use PMU. The ACPI part is tested with below guest kernel patches. https://lkml.org/lkml/2016/4/12/755 Changes since v3: * if kvm_arm_pmu_create returns a failure, don't create pmu dts node for guest Changes since v2: * address Andrew's comments on PATCH 2, thanks Changes since v1: * rebase on master * Address Andrew's comments, add a macro PPI, fix code style, add cpu_to_le32() Shannon Zhao (3): target-arm: kvm64: set guest PMUv3 feature bit if supported hw/arm/virt: Add PMU node for virt machine hw/arm/virt-acpi-build: Add PMU IRQ number in ACPI table hw/arm/virt-acpi-build.c | 3 +++ hw/arm/virt.c| 33 + include/hw/arm/virt.h| 4 include/sysemu/kvm.h | 1 + stubs/kvm.c | 5 + target-arm/cpu-qom.h | 2 ++ target-arm/kvm64.c | 46 ++ 7 files changed, 94 insertions(+) -- 2.0.4
[Qemu-devel] [PATCH v4 1/3] target-arm: kvm64: set guest PMUv3 feature bit if supported
From: Shannon ZhaoCheck if kvm supports guest PMUv3. If so, set the corresponding feature bit for vcpu. Signed-off-by: Shannon Zhao Reviewed-by: Andrew Jones --- target-arm/cpu-qom.h | 2 ++ target-arm/kvm64.c | 5 + 2 files changed, 7 insertions(+) diff --git a/target-arm/cpu-qom.h b/target-arm/cpu-qom.h index 1061c08..93aa6a4 100644 --- a/target-arm/cpu-qom.h +++ b/target-arm/cpu-qom.h @@ -105,6 +105,8 @@ typedef struct ARMCPU { bool powered_off; /* CPU has security extension */ bool has_el3; +/* CPU has PMU (Performance Monitor Unit) */ +bool has_pmu; /* CPU has memory protection unit */ bool has_mpu; diff --git a/target-arm/kvm64.c b/target-arm/kvm64.c index e8527bf..b364789 100644 --- a/target-arm/kvm64.c +++ b/target-arm/kvm64.c @@ -461,6 +461,11 @@ int kvm_arch_init_vcpu(CPUState *cs) if (!arm_feature(>env, ARM_FEATURE_AARCH64)) { cpu->kvm_init_features[0] |= 1 << KVM_ARM_VCPU_EL1_32BIT; } +if (kvm_irqchip_in_kernel() && +kvm_check_extension(cs->kvm_state, KVM_CAP_ARM_PMU_V3)) { +cpu->has_pmu = true; +cpu->kvm_init_features[0] |= 1 << KVM_ARM_VCPU_PMU_V3; +} /* Do KVM_ARM_VCPU_INIT ioctl */ ret = kvm_arm_vcpu_init(cs); -- 2.0.4
[Qemu-devel] [PATCH v4 2/3] hw/arm/virt: Add PMU node for virt machine
From: Shannon ZhaoAdd a virtual PMU device for virt machine while use PPI 7 for PMU overflow interrupt number. Signed-off-by: Shannon Zhao --- hw/arm/virt.c | 33 + include/hw/arm/virt.h | 4 include/sysemu/kvm.h | 1 + stubs/kvm.c | 5 + target-arm/kvm64.c| 41 + 5 files changed, 84 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 56d35c7..376cb87 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -428,6 +428,37 @@ static void fdt_add_gic_node(VirtBoardInfo *vbi, int type) qemu_fdt_setprop_cell(vbi->fdt, "/intc", "phandle", vbi->gic_phandle); } +static void fdt_add_pmu_nodes(const VirtBoardInfo *vbi, int gictype) +{ +CPUState *cpu; +ARMCPU *armcpu; +uint32_t irqflags = GIC_FDT_IRQ_FLAGS_LEVEL_HI; + +CPU_FOREACH(cpu) { +armcpu = ARM_CPU(cpu); +if (!armcpu->has_pmu || +!kvm_arm_pmu_create(cpu, PPI(VIRTUAL_PMU_IRQ))) { +return; +} +} + +if (gictype == 2) { +irqflags = deposit32(irqflags, GIC_FDT_IRQ_PPI_CPU_START, + GIC_FDT_IRQ_PPI_CPU_WIDTH, + (1 << vbi->smp_cpus) - 1); +} + +armcpu = ARM_CPU(qemu_get_cpu(0)); +qemu_fdt_add_subnode(vbi->fdt, "/pmu"); +if (arm_feature(>env, ARM_FEATURE_V8)) { +const char compat[] = "arm,armv8-pmuv3"; +qemu_fdt_setprop(vbi->fdt, "/pmu", "compatible", + compat, sizeof(compat)); +qemu_fdt_setprop_cells(vbi->fdt, "/pmu", "interrupts", + GIC_FDT_IRQ_TYPE_PPI, VIRTUAL_PMU_IRQ, irqflags); +} +} + static void create_v2m(VirtBoardInfo *vbi, qemu_irq *pic) { int i; @@ -1246,6 +1277,8 @@ static void machvirt_init(MachineState *machine) create_gic(vbi, pic, gic_version, vms->secure); +fdt_add_pmu_nodes(vbi, gic_version); + create_uart(vbi, pic, VIRT_UART, sysmem); if (vms->secure) { diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h index ecd8589..b50f095 100644 --- a/include/hw/arm/virt.h +++ b/include/hw/arm/virt.h @@ -40,6 +40,10 @@ #define ARCH_TIMER_NS_EL1_IRQ 14 #define ARCH_TIMER_NS_EL2_IRQ 10 +#define VIRTUAL_PMU_IRQ 7 + +#define PPI(irq) ((irq) + 16) + enum { VIRT_FLASH, VIRT_MEM, diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 0e18f15..4522043 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -523,4 +523,5 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source); * Returns: 0 on success, or a negative errno on failure. */ int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target); +int kvm_arm_pmu_create(CPUState *cs, int irq); #endif diff --git a/stubs/kvm.c b/stubs/kvm.c index ddd6204..667e269 100644 --- a/stubs/kvm.c +++ b/stubs/kvm.c @@ -6,3 +6,8 @@ int kvm_arch_irqchip_create(MachineState *ms, KVMState *s) { return 0; } + +int kvm_arm_pmu_create(CPUState *cs, int irq) +{ +return 0; +} diff --git a/target-arm/kvm64.c b/target-arm/kvm64.c index b364789..893f983 100644 --- a/target-arm/kvm64.c +++ b/target-arm/kvm64.c @@ -382,6 +382,47 @@ static CPUWatchpoint *find_hw_watchpoint(CPUState *cpu, target_ulong addr) return NULL; } +static bool kvm_arm_pmu_support_ctrl(CPUState *cs, struct kvm_device_attr *attr) +{ +return kvm_vcpu_ioctl(cs, KVM_HAS_DEVICE_ATTR, attr) == 0; +} + +int kvm_arm_pmu_create(CPUState *cs, int irq) +{ +int err; + +struct kvm_device_attr attr = { +.group = KVM_ARM_VCPU_PMU_V3_CTRL, +.addr = (intptr_t), +.attr = KVM_ARM_VCPU_PMU_V3_IRQ, +.flags = 0, +}; + +if (!kvm_arm_pmu_support_ctrl(cs, )) { +return 0; +} + +err = kvm_vcpu_ioctl(cs, KVM_SET_DEVICE_ATTR, ); +if (err < 0) { +fprintf(stderr, "KVM_SET_DEVICE_ATTR failed: %s\n", +strerror(-err)); +abort(); +} + +attr.group = KVM_ARM_VCPU_PMU_V3_CTRL; +attr.attr = KVM_ARM_VCPU_PMU_V3_INIT; +attr.addr = 0; +attr.flags = 0; + +err = kvm_vcpu_ioctl(cs, KVM_SET_DEVICE_ATTR, ); +if (err < 0) { +fprintf(stderr, "KVM_SET_DEVICE_ATTR failed: %s\n", +strerror(-err)); +abort(); +} + +return 1; +} static inline void set_feature(uint64_t *features, int feature) { -- 2.0.4
[Qemu-devel] [PATCH v4 3/3] hw/arm/virt-acpi-build: Add PMU IRQ number in ACPI table
From: Shannon ZhaoAdd PMU IRQ number in ACPI table, then we can use PMU in guest through ACPI. Signed-off-by: Shannon Zhao Reviewed-by: Andrew Jones --- hw/arm/virt-acpi-build.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index f51fe39..5031232 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -491,6 +491,9 @@ build_madt(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info) gicc->arm_mpidr = armcpu->mp_affinity; gicc->uid = i; gicc->flags = cpu_to_le32(ACPI_GICC_ENABLED); + +if (armcpu->has_pmu) +gicc->performance_interrupt = cpu_to_le32(PPI(VIRTUAL_PMU_IRQ)); } if (guest_info->gic_version == 3) { -- 2.0.4
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
On Tue, Apr 26, 2016 at 12:51:44PM +0200, Jan Kiszka wrote: > On 2016-04-26 12:38, Peter Xu wrote: > Hi, Jan, > > The above issue should be caused by EOI missing of level-triggered > interrupts. Before that, I was always using edge-triggered > interrupts for test, so didn't encounter this one. Would you please > help try below patch? It can be applied directly onto the series, > and should solve the issue (it works on my test vm, and I'll take it > in v5 as well if it also works for you): > > >>> > >>> Works here as well. I even made EIM working with some hack, though > >>> Jailhouse spits out strange warnings, despite it works fine (x2apic > >>> mode, split irqchip). > >> > >> Corrections: the warnings are issued by qemu, not Jailhouse, e.g. > >> > >> qemu-system-x86_64: VT-d Failed to remap interrupt for gsi 22. > >> > >> I suspect that comes from the hand-over phase of Jailhouse, when it > >> mutes all interrupts in the system while reconfiguring IR and IOAPIC. > >> > >> Please convert this error (in kvm_arch_fixup_msi_route) into a trace > >> point. It shall not annoy the host. Also check if you have more of such > >> guest-triggerable error messages. > > > > Okay. This should be the only one. I can use trace instead. > > > > Meanwhile, I still suppose we should not seen it even with > > error_report().. Would this happen when boot e.g. generic kernels? > > No, this is - most probably, still need to validate in details - an > effect due to the special case with Jailhouse that interrupt and VT-d > management is handed over from a running OS to a hypervisor. > > Jan > > PS: Current quick hack for EIM support (lacks compat mode blocking at > least): > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 1082ab5..709a92c 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -907,6 +907,7 @@ static void > vtd_interrupt_remap_table_setup(IntelIOMMUState *s) > value = vtd_get_quad_raw(s, DMAR_IRTA_REG); > s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1); > s->intr_root = value & VTD_IRTA_ADDR_MASK; > +s->intr_eime = value & VTD_IRTA_EIME; > > QLIST_FOREACH(notifier, >iec_notifiers, list) { > if (notifier->iec_notify) { > @@ -2052,11 +2053,13 @@ static int vtd_remap_irq_get(IntelIOMMUState *iommu, > uint16_t index, VTDIrq *irq > irq->trigger_mode = irte.trigger_mode; > irq->vector = irte.vector; > irq->delivery_mode = irte.delivery_mode; > -/* Not support EIM yet: please refer to vt-d 9.10 DST bits */ > +irq->dest = irte.dest_id; > +if (!iommu->intr_eime) { > #define VTD_IR_APIC_DEST_MASK (0xff00ULL) > #define VTD_IR_APIC_DEST_SHIFT(8) > -irq->dest = (irte.dest_id & VTD_IR_APIC_DEST_MASK) >> \ > -VTD_IR_APIC_DEST_SHIFT; > +irq->dest = (irq->dest & VTD_IR_APIC_DEST_MASK) >> > +VTD_IR_APIC_DEST_SHIFT; > +} > irq->dest_mode = irte.dest_mode; > irq->redir_hint = irte.redir_hint; > > @@ -2316,7 +2319,7 @@ static void vtd_init(IntelIOMMUState *s) > s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO; > > if (ms->iommu_intr) { > -s->ecap |= VTD_ECAP_IR; > +s->ecap |= VTD_ECAP_IR | VTD_ECAP_EIM; > } > > vtd_reset_context_cache(s); > @@ -2370,10 +2373,9 @@ static void vtd_init(IntelIOMMUState *s) > vtd_define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000ULL); > > /* > - * Interrupt remapping registers, not support extended interrupt > - * mode for now. > + * Interrupt remapping registers. > */ > -vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf00fULL, 0); > +vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf80fULL, 0); > } > > /* Should not reset address_spaces when reset because devices will still use > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h > index 10c20fe..72b0114 100644 > --- a/hw/i386/intel_iommu_internal.h > +++ b/hw/i386/intel_iommu_internal.h > @@ -176,6 +176,7 @@ > > /* IRTA_REG */ > #define VTD_IRTA_ADDR_MASK (VTD_HAW_MASK ^ 0xfffULL) > +#define VTD_IRTA_EIME (1ULL << 11) > #define VTD_IRTA_SIZE_MASK (0xfULL) > > /* ECAP_REG */ > @@ -184,6 +185,7 @@ > #define VTD_ECAP_QI (1ULL << 1) > /* Interrupt Remapping support */ > #define VTD_ECAP_IR (1ULL << 3) > +#define VTD_ECAP_EIM(1ULL << 4) > > /* CAP_REG */ > /* (offset >> 4) << 24 */ > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h > index d7613af..c6c53c6 100644 > --- a/include/hw/i386/intel_iommu.h > +++ b/include/hw/i386/intel_iommu.h > @@ -261,6 +261,7 @@ struct IntelIOMMUState { > bool intr_enabled; /* Whether guest enabled IR */ > dma_addr_t intr_root; /* Interrupt remapping table pointer */ > uint32_t intr_size; /* Number of IR table entries */ > +
Re: [Qemu-devel] [PATCH v8 0/5] ARM: Add NUMA support for machine virt
On Tue, Apr 26, 2016 at 06:40:24PM +0800, Shannon Zhao wrote: > From: Shannon Zhao> > Add NUMA support for machine virt. Tested successfully running a guest > Linux kernel with the following patch applied: > > - [PATCH v16 0/6] arm64, numa: Add numa support for arm64 platforms > https://lkml.org/lkml/2016/4/8/571 > - [PATCH v5 00/14] ACPI NUMA support for ARM64 > https://lkml.org/lkml/2016/4/19/852 This series looks good to me, but I guess we still need to wait for the DT spec[*] to be merged first. Hopefully that'll happen around the time that the 2.7 dev window opens. Thanks, drew [*] https://lkml.org/lkml/2016/4/8/572 > > Example qemu command line: > qemu-system-aarch64 \ > -enable-kvm -smp 4\ > -kernel Image \ > -m 512 -machine virt,kernel_irqchip=on \ > -initrd guestfs.cpio.gz \ > -cpu host -nographic \ > -numa node,mem=256M,cpus=0-1,nodeid=0 \ > -numa node,mem=256M,cpus=2-3,nodeid=1 \ > -append "console=ttyAMA0 root=/dev/ram" > > Changes since v7: > * fix code style suggested by Marcel > * rename acpi_build_srat_memory to build_srat_memory > > Changes since v6: > * squash first two patches of previous series together > * fix the definition of proximity in AcpiSratMemoryAffinity > * rename acpi_build_srat_memory to build_acpi_srat_memory > > Changes since v5: > * don't generate /distance-map node since it's optional > * improve the /memory node name > * move acpi_build_srat_memory to common place then reuse it to generate > SRAT table > > Changes since v4: > * rebased on new kernel driver and device bindings, especially the > compatible string "numa-distance-map-v1" of /distance-map node > * set the numa-node-id for first /memory node > > Changes since v3: > * based on new kernel driver and device bindings > * add ACPI part > > Changes since v2: > * update to use NUMA node property arm,associativity. > > Changes since v1: > Take into account Peter's comments: > * rename virt_memory_init to arm_generate_memory_dtb > * move arm_generate_memory_dtb to boot.c and make it a common func > * use a struct numa_map to generate numa dtb > > Shannon Zhao (5): > ARM: Virt: Set numa-node-id for cpu and memory nodes > ACPI: Add GICC Affinity Structure > ACPI: Fix the definition of proximity in AcpiSratMemoryAffinity > ACPI: move acpi_build_srat_memory to common place > ACPI: Virt: Generate SRAT table > > hw/acpi/aml-build.c | 11 ++ > hw/arm/boot.c | 43 +++-- > hw/arm/virt-acpi-build.c| 52 > + > hw/arm/virt.c | 8 +++ > hw/i386/acpi-build.c| 41 +-- > include/hw/acpi/acpi-defs.h | 17 +-- > include/hw/acpi/aml-build.h | 10 + > 7 files changed, 143 insertions(+), 39 deletions(-) > > -- > 2.0.4 > > >
Re: [Qemu-devel] [PATCH v3 3/3] hw/arm/virt-acpi-build: Add PMU IRQ number in ACPI table
On Tue, Apr 26, 2016 at 07:08:25PM +0800, Shannon Zhao wrote: > > > On 2016/4/25 21:49, Shannon Zhao wrote: > > On 2016年04月25日 20:42, Andrew Jones wrote: > >> > On Mon, Apr 25, 2016 at 03:11:46PM +0800, Shannon Zhao wrote: > >> > From: Shannon Zhao> >> > > >> > Add PMU IRQ number in ACPI table, then we can use PMU in guest > >> > through > >> > ACPI. > >> > > >> > Signed-off-by: Shannon Zhao > >> > Reviewed-by: Andrew Jones > >> > --- > >> > hw/arm/virt-acpi-build.c | 3 +++ > >> > 1 file changed, 3 insertions(+) > >> > Question: for testing of this, did you use > >> > https://lkml.org/lkml/2016/4/12/755 ? > > Actually I didn't test PMU with ACPI since I didn't notice the patch > > series you point out and it's very obvious for QEMU changes(adding irq > > flag and irq number in GICC). I will have a test later. > > I have test this patch series using ACPI and guest with the patches you > point out. Guest prints below logs and perf works well in guest. > > [0.00] ACPI-PMU: Assign CPU 0 girq 23 level 0 > [0.00] ACPI-PMU: Assign CPU 1 girq 23 level 0 > [0.00] ACPI-PMU: Assign CPU 2 girq 23 level 0 > [0.00] ACPI-PMU: Assign CPU 3 girq 23 level 0 > [...] > [0.094782] ACPI-PMU: Setting up 4 PMUs for CPU type D07 Thanks for the additional testing Shannon! drew
Re: [Qemu-devel] [PATCH v3 3/3] hw/arm/virt-acpi-build: Add PMU IRQ number in ACPI table
On 2016/4/25 21:49, Shannon Zhao wrote: > On 2016年04月25日 20:42, Andrew Jones wrote: >> > On Mon, Apr 25, 2016 at 03:11:46PM +0800, Shannon Zhao wrote: >> > From: Shannon Zhao>> > >> > Add PMU IRQ number in ACPI table, then we can use PMU in guest through >> > ACPI. >> > >> > Signed-off-by: Shannon Zhao >> > Reviewed-by: Andrew Jones >> > --- >> > hw/arm/virt-acpi-build.c | 3 +++ >> > 1 file changed, 3 insertions(+) >> > Question: for testing of this, did you use >> > https://lkml.org/lkml/2016/4/12/755 ? > Actually I didn't test PMU with ACPI since I didn't notice the patch > series you point out and it's very obvious for QEMU changes(adding irq > flag and irq number in GICC). I will have a test later. I have test this patch series using ACPI and guest with the patches you point out. Guest prints below logs and perf works well in guest. [0.00] ACPI-PMU: Assign CPU 0 girq 23 level 0 [0.00] ACPI-PMU: Assign CPU 1 girq 23 level 0 [0.00] ACPI-PMU: Assign CPU 2 girq 23 level 0 [0.00] ACPI-PMU: Assign CPU 3 girq 23 level 0 [...] [0.094782] ACPI-PMU: Setting up 4 PMUs for CPU type D07 Thanks, -- Shannon
[Qemu-devel] [PATCH v8 3/5] ACPI: Fix the definition of proximity in AcpiSratMemoryAffinity
From: Shannon ZhaoACPI spec says that Proximity Domain is an "Integer that represents the proximity domain to which the processor belongs". So define it as a uint32_t. Cc: Michael S. Tsirkin Cc: Igor Mammedov Signed-off-by: Shannon Zhao Reviewed-by: Andrew Jones --- hw/i386/acpi-build.c| 3 +-- include/hw/acpi/acpi-defs.h | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index 9ae4c0d..3c031aa 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -2440,8 +2440,7 @@ acpi_build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, { numamem->type = ACPI_SRAT_MEMORY; numamem->length = sizeof(*numamem); -memset(numamem->proximity, 0, 4); -numamem->proximity[0] = node; +numamem->proximity = cpu_to_le32(node); numamem->flags = cpu_to_le32(flags); numamem->base_addr = cpu_to_le64(base); numamem->range_length = cpu_to_le64(len); diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h index bcf5c3f..850a962 100644 --- a/include/hw/acpi/acpi-defs.h +++ b/include/hw/acpi/acpi-defs.h @@ -475,7 +475,7 @@ typedef struct AcpiSratProcessorAffinity AcpiSratProcessorAffinity; struct AcpiSratMemoryAffinity { ACPI_SUB_HEADER_DEF -uint8_t proximity[4]; +uint32_tproximity; uint16_treserved1; uint64_tbase_addr; uint64_trange_length; -- 2.0.4
[Qemu-devel] [PATCH v8 5/5] ACPI: Virt: Generate SRAT table
From: Shannon ZhaoTo support NUMA, it needs to generate SRAT ACPI table. Signed-off-by: Shannon Zhao Reviewed-by: Andrew Jones --- hw/arm/virt-acpi-build.c | 52 1 file changed, 52 insertions(+) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index f51fe39..26a7bac 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -43,6 +43,7 @@ #include "hw/acpi/aml-build.h" #include "hw/pci/pcie_host.h" #include "hw/pci/pci.h" +#include "sysemu/numa.h" #define ARM_SPI_BASE 32 #define ACPI_POWER_BUTTON_DEVICE "PWRB" @@ -414,6 +415,52 @@ build_spcr(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info) } static void +build_srat(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info) +{ +AcpiSystemResourceAffinityTable *srat; +AcpiSratProcessorGiccAffinity *core; +AcpiSratMemoryAffinity *numamem; +int i, j, srat_start; +uint64_t mem_base; +uint32_t *cpu_node = g_malloc0(guest_info->smp_cpus * sizeof(uint32_t)); + +for (i = 0; i < guest_info->smp_cpus; i++) { +for (j = 0; j < nb_numa_nodes; j++) { +if (test_bit(i, numa_info[j].node_cpu)) { +cpu_node[i] = j; +break; +} +} +} + +srat_start = table_data->len; +srat = acpi_data_push(table_data, sizeof(*srat)); +srat->reserved1 = cpu_to_le32(1); + +for (i = 0; i < guest_info->smp_cpus; ++i) { +core = acpi_data_push(table_data, sizeof(*core)); +core->type = ACPI_SRAT_PROCESSOR_GICC; +core->length = sizeof(*core); +core->proximity = cpu_to_le32(cpu_node[i]); +core->acpi_processor_uid = cpu_to_le32(i); +core->flags = cpu_to_le32(1); +} +g_free(cpu_node); + +mem_base = guest_info->memmap[VIRT_MEM].base; +for (i = 0; i < nb_numa_nodes; ++i) { +numamem = acpi_data_push(table_data, sizeof(*numamem)); +build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i, + MEM_AFFINITY_ENABLED); +mem_base += numa_info[i].node_mem; +} + +build_header(linker, table_data, + (void *)(table_data->data + srat_start), "SRAT", + table_data->len - srat_start, 3, NULL, NULL); +} + +static void build_mcfg(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info) { AcpiTableMcfg *mcfg; @@ -638,6 +685,11 @@ void virt_acpi_build(VirtGuestInfo *guest_info, AcpiBuildTables *tables) acpi_add_table(table_offsets, tables_blob); build_spcr(tables_blob, tables->linker, guest_info); +if (nb_numa_nodes > 0) { +acpi_add_table(table_offsets, tables_blob); +build_srat(tables_blob, tables->linker, guest_info); +} + /* RSDT is pointed to by RSDP */ rsdt = tables_blob->len; build_rsdt(tables_blob, tables->linker, table_offsets, NULL, NULL); -- 2.0.4
Re: [Qemu-devel] [RFC for-2.7 01/11] qdt: IEEE1275-style device tree utility code
On 20.04.2016 04:33, David Gibson wrote: ... > This patch introduces a new utility library "qdt" for runtime > manipulation of device trees. The intention is that machine type code > can use these routines to construct the device tree conveniently, > using a pointer-based representation doesn't have the limitations > above. They can then use qdt_flatten() to convert the completed tree > to fdt format as a single O(n) operation to pass to the guest. Good idea, the FDT format itself is really not very well suited for dynamic manipulations... ... > diff --git a/util/qdt.c b/util/qdt.c > new file mode 100644 > index 000..e3a449a > --- /dev/null > +++ b/util/qdt.c > @@ -0,0 +1,262 @@ > +/* > + * Functions for manipulating IEEE1275 (Open Firmware) style device > + * trees. What does QDT stand for? Maybe add that in the description here. > + * Copyright David Gibson, Red Hat Inc. 2016 > + * > + * This work is licensed unter the GNU GPL version 2 or (at your > + * option) any later version. > + */ > + > +#include > +#include > + > +#include "qemu/osdep.h" > +#include "qapi/error.h" > +#include "qemu/qdt.h" > +#include "qemu/error-report.h" > + > +/* > + * Node functions > + */ > + > +QDTNode *qdt_new_node(const gchar *name) > +{ > +QDTNode *node = g_new0(QDTNode, 1); > + > +g_assert(!strchr(name, '/')); > + > +node->name = g_strdup(name); > +QTAILQ_INIT(>properties); > +QTAILQ_INIT(>children); > + > +return node; > +} > + > +static QDTNode *get_subnode(QDTNode *parent, const gchar *name, size_t > namelen) > +{ > +QDTNode *child; > + > +g_assert(!memchr(name, '/', namelen)); > + > +QTAILQ_FOREACH(child, >children, sibling) { > +if ((strlen(child->name) == namelen) > +&& (memcmp(child->name, name, namelen) == 0)) { Too many parenthesis for my taste ;-) > +return child; > +} > +} > + > +return NULL; > +} > + > +QDTNode *qdt_get_node_relative(QDTNode *node, const gchar *path) > +{ > +const gchar *slash; > +gsize seglen; > + > +do { > +slash = strchr(path, '/'); > +seglen = slash ? slash - path : strlen(path); > + > +node = get_subnode(node, path, seglen); > +path += seglen + 1; > +} while (node && slash); > + > +return node; > +} > + > +QDTNode *qdt_get_node(QDTNode *root, const gchar *path) > +{ > +g_assert(!root->parent); > +g_assert(path[0] == '/'); > +return qdt_get_node_relative(root, path + 1); > +} > + > +QDTNode *qdt_add_subnode(QDTNode *parent, const gchar *name) > +{ > +QDTNode *new = qdt_new_node(name); > + > +new->parent = parent; > +QTAILQ_INSERT_TAIL(>children, new, sibling); > +return new; > +} In case somebody ever tries to compile this with a C++ compiler ... it's maybe better avoid using "new" as name for a variable. > +/* > + * Property functions > + */ > + > +QDTProperty *qdt_new_property(const gchar *name, gconstpointer val, gsize > len) > +{ > +QDTProperty *prop = g_malloc0(sizeof(*prop) + len); > + > +prop->name = g_strdup(name); > +prop->len = len; > +memcpy(prop->val, val, len); > +return prop; > +} > + > +static QDTProperty *getprop_(const QDTNode *node, const gchar *name) Underscore at the end looks somewhat strange ... can't you simply drop that? > +{ > +QDTProperty *prop; > + > +QTAILQ_FOREACH(prop, >properties, list) { > +if (strcmp(prop->name, name) == 0) { > +return prop; > +} > +} > +return NULL; > +} > + > +const QDTProperty *qdt_getprop(const QDTNode *node, const gchar *name) > +{ > +return getprop_(node, name); > +} > + > +void qdt_delprop(QDTNode *node, const gchar *name) > +{ > +QDTProperty *prop = getprop_(node, name); > + > +if (prop) { > +QTAILQ_REMOVE(>properties, prop, list); > +g_free(prop->name); > +g_free(prop); > +} > +} > + > +const QDTProperty *qdt_setprop(QDTNode *node, const gchar *name, > + gconstpointer val, gsize len) > +{ > +QDTProperty *prop; > + > +qdt_delprop(node, name); > + > +prop = g_malloc0(sizeof(*prop) + len); > +prop->name = g_strdup(name); > +prop->len = len; > +memcpy(prop->val, val, len); Can you replace the above four lines with qdt_new_property ? > +QTAILQ_INSERT_TAIL(>properties, prop, list); > +return prop; > +} > + > +const QDTProperty *qdt_setprop_string(QDTNode *node, const gchar *name, > + const gchar *val) > +{ > +return qdt_setprop(node, name, val, strlen(val) + 1); > +} > + > +const QDTProperty *qdt_setprop_cells_(QDTNode *node, const gchar *name, > + const uint32_t *val, gsize len) > +{ > +uint32_t swapval[len]; > +gsize i; > + > +for (i = 0; i < len; i++) { > +swapval[i] = cpu_to_fdt32(val[i]); > +} > +return qdt_setprop(node, name, swapval, sizeof(swapval)); > +} > + > +const
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
On 2016-04-26 12:38, Peter Xu wrote: Hi, Jan, The above issue should be caused by EOI missing of level-triggered interrupts. Before that, I was always using edge-triggered interrupts for test, so didn't encounter this one. Would you please help try below patch? It can be applied directly onto the series, and should solve the issue (it works on my test vm, and I'll take it in v5 as well if it also works for you): >>> >>> Works here as well. I even made EIM working with some hack, though >>> Jailhouse spits out strange warnings, despite it works fine (x2apic >>> mode, split irqchip). >> >> Corrections: the warnings are issued by qemu, not Jailhouse, e.g. >> >> qemu-system-x86_64: VT-d Failed to remap interrupt for gsi 22. >> >> I suspect that comes from the hand-over phase of Jailhouse, when it >> mutes all interrupts in the system while reconfiguring IR and IOAPIC. >> >> Please convert this error (in kvm_arch_fixup_msi_route) into a trace >> point. It shall not annoy the host. Also check if you have more of such >> guest-triggerable error messages. > > Okay. This should be the only one. I can use trace instead. > > Meanwhile, I still suppose we should not seen it even with > error_report().. Would this happen when boot e.g. generic kernels? No, this is - most probably, still need to validate in details - an effect due to the special case with Jailhouse that interrupt and VT-d management is handed over from a running OS to a hypervisor. Jan PS: Current quick hack for EIM support (lacks compat mode blocking at least): diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index 1082ab5..709a92c 100644 --- a/hw/i386/intel_iommu.c +++ b/hw/i386/intel_iommu.c @@ -907,6 +907,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s) value = vtd_get_quad_raw(s, DMAR_IRTA_REG); s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1); s->intr_root = value & VTD_IRTA_ADDR_MASK; +s->intr_eime = value & VTD_IRTA_EIME; QLIST_FOREACH(notifier, >iec_notifiers, list) { if (notifier->iec_notify) { @@ -2052,11 +2053,13 @@ static int vtd_remap_irq_get(IntelIOMMUState *iommu, uint16_t index, VTDIrq *irq irq->trigger_mode = irte.trigger_mode; irq->vector = irte.vector; irq->delivery_mode = irte.delivery_mode; -/* Not support EIM yet: please refer to vt-d 9.10 DST bits */ +irq->dest = irte.dest_id; +if (!iommu->intr_eime) { #define VTD_IR_APIC_DEST_MASK (0xff00ULL) #define VTD_IR_APIC_DEST_SHIFT(8) -irq->dest = (irte.dest_id & VTD_IR_APIC_DEST_MASK) >> \ -VTD_IR_APIC_DEST_SHIFT; +irq->dest = (irq->dest & VTD_IR_APIC_DEST_MASK) >> +VTD_IR_APIC_DEST_SHIFT; +} irq->dest_mode = irte.dest_mode; irq->redir_hint = irte.redir_hint; @@ -2316,7 +2319,7 @@ static void vtd_init(IntelIOMMUState *s) s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO; if (ms->iommu_intr) { -s->ecap |= VTD_ECAP_IR; +s->ecap |= VTD_ECAP_IR | VTD_ECAP_EIM; } vtd_reset_context_cache(s); @@ -2370,10 +2373,9 @@ static void vtd_init(IntelIOMMUState *s) vtd_define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000ULL); /* - * Interrupt remapping registers, not support extended interrupt - * mode for now. + * Interrupt remapping registers. */ -vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf00fULL, 0); +vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf80fULL, 0); } /* Should not reset address_spaces when reset because devices will still use diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h index 10c20fe..72b0114 100644 --- a/hw/i386/intel_iommu_internal.h +++ b/hw/i386/intel_iommu_internal.h @@ -176,6 +176,7 @@ /* IRTA_REG */ #define VTD_IRTA_ADDR_MASK (VTD_HAW_MASK ^ 0xfffULL) +#define VTD_IRTA_EIME (1ULL << 11) #define VTD_IRTA_SIZE_MASK (0xfULL) /* ECAP_REG */ @@ -184,6 +185,7 @@ #define VTD_ECAP_QI (1ULL << 1) /* Interrupt Remapping support */ #define VTD_ECAP_IR (1ULL << 3) +#define VTD_ECAP_EIM(1ULL << 4) /* CAP_REG */ /* (offset >> 4) << 24 */ diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h index d7613af..c6c53c6 100644 --- a/include/hw/i386/intel_iommu.h +++ b/include/hw/i386/intel_iommu.h @@ -261,6 +261,7 @@ struct IntelIOMMUState { bool intr_enabled; /* Whether guest enabled IR */ dma_addr_t intr_root; /* Interrupt remapping table pointer */ uint32_t intr_size; /* Number of IR table entries */ +bool intr_eime; /* Extended interrupt mode enabled */ QLIST_HEAD(, VTD_IEC_Notifier) iec_notifiers; /* IEC notify list */ };
[Qemu-devel] [PATCH v8 0/5] ARM: Add NUMA support for machine virt
From: Shannon ZhaoAdd NUMA support for machine virt. Tested successfully running a guest Linux kernel with the following patch applied: - [PATCH v16 0/6] arm64, numa: Add numa support for arm64 platforms https://lkml.org/lkml/2016/4/8/571 - [PATCH v5 00/14] ACPI NUMA support for ARM64 https://lkml.org/lkml/2016/4/19/852 Example qemu command line: qemu-system-aarch64 \ -enable-kvm -smp 4\ -kernel Image \ -m 512 -machine virt,kernel_irqchip=on \ -initrd guestfs.cpio.gz \ -cpu host -nographic \ -numa node,mem=256M,cpus=0-1,nodeid=0 \ -numa node,mem=256M,cpus=2-3,nodeid=1 \ -append "console=ttyAMA0 root=/dev/ram" Changes since v7: * fix code style suggested by Marcel * rename acpi_build_srat_memory to build_srat_memory Changes since v6: * squash first two patches of previous series together * fix the definition of proximity in AcpiSratMemoryAffinity * rename acpi_build_srat_memory to build_acpi_srat_memory Changes since v5: * don't generate /distance-map node since it's optional * improve the /memory node name * move acpi_build_srat_memory to common place then reuse it to generate SRAT table Changes since v4: * rebased on new kernel driver and device bindings, especially the compatible string "numa-distance-map-v1" of /distance-map node * set the numa-node-id for first /memory node Changes since v3: * based on new kernel driver and device bindings * add ACPI part Changes since v2: * update to use NUMA node property arm,associativity. Changes since v1: Take into account Peter's comments: * rename virt_memory_init to arm_generate_memory_dtb * move arm_generate_memory_dtb to boot.c and make it a common func * use a struct numa_map to generate numa dtb Shannon Zhao (5): ARM: Virt: Set numa-node-id for cpu and memory nodes ACPI: Add GICC Affinity Structure ACPI: Fix the definition of proximity in AcpiSratMemoryAffinity ACPI: move acpi_build_srat_memory to common place ACPI: Virt: Generate SRAT table hw/acpi/aml-build.c | 11 ++ hw/arm/boot.c | 43 +++-- hw/arm/virt-acpi-build.c| 52 + hw/arm/virt.c | 8 +++ hw/i386/acpi-build.c| 41 +-- include/hw/acpi/acpi-defs.h | 17 +-- include/hw/acpi/aml-build.h | 10 + 7 files changed, 143 insertions(+), 39 deletions(-) -- 2.0.4
[Qemu-devel] [PATCH v8 1/5] ARM: Virt: Set numa-node-id for cpu and memory nodes
From: Shannon ZhaoGenerate memory nodes according to NUMA topology. Set numa-node-id property for cpu and memory nodes. Signed-off-by: Shannon Zhao Reviewed-by: Andrew Jones --- hw/arm/boot.c | 43 +-- hw/arm/virt.c | 8 2 files changed, 45 insertions(+), 6 deletions(-) diff --git a/hw/arm/boot.c b/hw/arm/boot.c index 5876945..1b913a4 100644 --- a/hw/arm/boot.c +++ b/hw/arm/boot.c @@ -14,6 +14,7 @@ #include "hw/arm/linux-boot-if.h" #include "sysemu/kvm.h" #include "sysemu/sysemu.h" +#include "sysemu/numa.h" #include "hw/boards.h" #include "hw/loader.h" #include "elf.h" @@ -405,6 +406,9 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo, void *fdt = NULL; int size, rc; uint32_t acells, scells; +char *nodename; +unsigned int i; +hwaddr mem_base, mem_len; if (binfo->dtb_filename) { char *filename; @@ -456,12 +460,39 @@ static int load_dtb(hwaddr addr, const struct arm_boot_info *binfo, goto fail; } -rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg", - acells, binfo->loader_start, - scells, binfo->ram_size); -if (rc < 0) { -fprintf(stderr, "couldn't set /memory/reg\n"); -goto fail; +if (nb_numa_nodes > 0) { +/* + * Turn the /memory node created before into a NOP node, then create + * /memory@addr nodes for all numa nodes respectively. + */ +qemu_fdt_nop_node(fdt, "/memory"); +mem_base = binfo->loader_start; +for (i = 0; i < nb_numa_nodes; i++) { +mem_len = numa_info[i].node_mem; +nodename = g_strdup_printf("/memory@%" PRIx64, mem_base); +qemu_fdt_add_subnode(fdt, nodename); +qemu_fdt_setprop_string(fdt, nodename, "device_type", "memory"); +rc = qemu_fdt_setprop_sized_cells(fdt, nodename, "reg", + acells, mem_base, + scells, mem_len); +if (rc < 0) { +fprintf(stderr, "couldn't set %s/reg for node %d\n", nodename, +i); +goto fail; +} + +qemu_fdt_setprop_cell(fdt, nodename, "numa-node-id", i); +mem_base += mem_len; +g_free(nodename); +} +} else { +rc = qemu_fdt_setprop_sized_cells(fdt, "/memory", "reg", + acells, binfo->loader_start, + scells, binfo->ram_size); +if (rc < 0) { +fprintf(stderr, "couldn't set /memory/reg\n"); +goto fail; +} } if (binfo->kernel_cmdline && *binfo->kernel_cmdline) { diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 56d35c7..fe6b11d 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -38,6 +38,7 @@ #include "net/net.h" #include "sysemu/block-backend.h" #include "sysemu/device_tree.h" +#include "sysemu/numa.h" #include "sysemu/sysemu.h" #include "sysemu/kvm.h" #include "hw/boards.h" @@ -329,6 +330,7 @@ static void fdt_add_cpu_nodes(const VirtBoardInfo *vbi) { int cpu; int addr_cells = 1; +unsigned int i; /* * From Documentation/devicetree/bindings/arm/cpus.txt @@ -378,6 +380,12 @@ static void fdt_add_cpu_nodes(const VirtBoardInfo *vbi) armcpu->mp_affinity); } +for (i = 0; i < nb_numa_nodes; i++) { +if (test_bit(cpu, numa_info[i].node_cpu)) { +qemu_fdt_setprop_cell(vbi->fdt, nodename, "numa-node-id", i); +} +} + g_free(nodename); } } -- 2.0.4
[Qemu-devel] [PATCH v5 4/6] tcg: Clean up from 'next_tb'
From: Sergey FedorovThe value returned from tcg_qemu_tb_exec() is the value passed to the corresponding tcg_gen_exit_tb() at translation time of the last TB attempted to execute. It is a little confusing to store it in a variable named 'next_tb'. In fact, it is a combination of 4-byte aligned pointer and additional information in its two least significant bits. Break it down right away into two variables named 'last_tb' and 'tb_exit' which are a pointer to the last TB attempted to execute and the TB exit reason, correspondingly. This simplifies the code and improves its readability. Correct a misleading documentation comment for tcg_qemu_tb_exec() and fix logging in cpu_tb_exec(). Also rename a misleading 'next_tb' in another couple of places. Signed-off-by: Sergey Fedorov Signed-off-by: Sergey Fedorov --- Changes in v5: * Shortcut a non-null test cpu-exec.c | 59 --- tcg/tcg.h| 19 ++- tci.c| 6 +++--- trace-events | 2 +- 4 files changed, 46 insertions(+), 40 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 36942340d7e3..6ad5fbdf1f8e 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -136,7 +136,9 @@ static void init_delay_params(SyncClocks *sc, const CPUState *cpu) static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb) { CPUArchState *env = cpu->env_ptr; -uintptr_t next_tb; +uintptr_t ret; +TranslationBlock *last_tb; +int tb_exit; uint8_t *tb_ptr = itb->tc_ptr; qemu_log_mask_and_addr(CPU_LOG_EXEC, itb->pc, @@ -160,36 +162,37 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb) #endif /* DEBUG_DISAS */ cpu->can_do_io = !use_icount; -next_tb = tcg_qemu_tb_exec(env, tb_ptr); +ret = tcg_qemu_tb_exec(env, tb_ptr); cpu->can_do_io = 1; -trace_exec_tb_exit((void *) (next_tb & ~TB_EXIT_MASK), - next_tb & TB_EXIT_MASK); +last_tb = (TranslationBlock *)(ret & ~TB_EXIT_MASK); +tb_exit = ret & TB_EXIT_MASK; +trace_exec_tb_exit(last_tb, tb_exit); -if ((next_tb & TB_EXIT_MASK) > TB_EXIT_IDX1) { +if (tb_exit > TB_EXIT_IDX1) { /* We didn't start executing this TB (eg because the instruction * counter hit zero); we must restore the guest PC to the address * of the start of the TB. */ CPUClass *cc = CPU_GET_CLASS(cpu); -TranslationBlock *tb = (TranslationBlock *)(next_tb & ~TB_EXIT_MASK); -qemu_log_mask_and_addr(CPU_LOG_EXEC, itb->pc, +qemu_log_mask_and_addr(CPU_LOG_EXEC, last_tb->pc, "Stopped execution of TB chain before %p [" TARGET_FMT_lx "] %s\n", - itb->tc_ptr, itb->pc, lookup_symbol(itb->pc)); + last_tb->tc_ptr, last_tb->pc, + lookup_symbol(last_tb->pc)); if (cc->synchronize_from_tb) { -cc->synchronize_from_tb(cpu, tb); +cc->synchronize_from_tb(cpu, last_tb); } else { assert(cc->set_pc); -cc->set_pc(cpu, tb->pc); +cc->set_pc(cpu, last_tb->pc); } } -if ((next_tb & TB_EXIT_MASK) == TB_EXIT_REQUESTED) { +if (tb_exit == TB_EXIT_REQUESTED) { /* We were asked to stop executing TBs (probably a pending * interrupt. We've now stopped, so clear the flag. */ cpu->tcg_exit_req = 0; } -return next_tb; +return ret; } #ifndef CONFIG_USER_ONLY @@ -358,8 +361,8 @@ int cpu_exec(CPUState *cpu) CPUArchState *env = _cpu->env; #endif int ret, interrupt_request; -TranslationBlock *tb; -uintptr_t next_tb; +TranslationBlock *tb, *last_tb; +int tb_exit = 0; SyncClocks sc; /* replay_interrupt may need current_cpu */ @@ -442,7 +445,7 @@ int cpu_exec(CPUState *cpu) #endif } -next_tb = 0; /* force lookup of first TB */ +last_tb = NULL; /* forget the last executed TB after exception */ for(;;) { interrupt_request = cpu->interrupt_request; if (unlikely(interrupt_request)) { @@ -487,7 +490,7 @@ int cpu_exec(CPUState *cpu) else { replay_interrupt(); if (cc->cpu_exec_interrupt(cpu, interrupt_request)) { -next_tb = 0; +last_tb = NULL; } } /* Don't use the cached interrupt_request value, @@ -496,7 +499,7 @@ int cpu_exec(CPUState *cpu) cpu->interrupt_request &= ~CPU_INTERRUPT_EXITTB; /* ensure that no TB jump will be modified as the program flow was changed */
[Qemu-devel] [PATCH v5 6/6] cpu-exec: Move TB chaining into tb_find_fast()
From: Sergey FedorovMove tb_add_jump() call and surrounding code from cpu_exec() into tb_find_fast(). That simplifies cpu_exec() a little by hiding the direct chaining optimization details into tb_find_fast(). It also allows to move tb_lock()/tb_unlock() pair into tb_find_fast(), putting it closer to tb_find_slow() which also manipulates the lock. Suggested-by: Alex Bennée Signed-off-by: Sergey Fedorov Signed-off-by: Sergey Fedorov --- cpu-exec.c | 41 ++--- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 8ba899b47a23..44dae04ae819 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -320,7 +320,9 @@ found: return tb; } -static inline TranslationBlock *tb_find_fast(CPUState *cpu) +static inline TranslationBlock *tb_find_fast(CPUState *cpu, + TranslationBlock **last_tb, + int tb_exit) { CPUArchState *env = (CPUArchState *)cpu->env_ptr; TranslationBlock *tb; @@ -331,11 +333,27 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu) always be the same before a given translated block is executed. */ cpu_get_tb_cpu_state(env, , _base, ); +tb_lock(); tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]; if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base || tb->flags != flags)) { tb = tb_find_slow(cpu, pc, cs_base, flags); } +if (cpu->tb_flushed) { +/* Ensure that no TB jump will be modified as the + * translation buffer has been flushed. + */ +*last_tb = NULL; +cpu->tb_flushed = false; +} +/* see if we can patch the calling TB. When the TB + spans two pages, we cannot safely do a direct + jump. */ +if (*last_tb && tb->page_addr[1] == -1 && +!qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) { +tb_add_jump(*last_tb, tb_exit, tb); +} +tb_unlock(); return tb; } @@ -441,7 +459,8 @@ int cpu_exec(CPUState *cpu) } else if (replay_has_exception() && cpu->icount_decr.u16.low + cpu->icount_extra == 0) { /* try to cause an exception pending in the log */ -cpu_exec_nocache(cpu, 1, tb_find_fast(cpu), true); +last_tb = NULL; /* Avoid chaining TBs */ +cpu_exec_nocache(cpu, 1, tb_find_fast(cpu, _tb, 0), true); ret = -1; break; #endif @@ -511,23 +530,7 @@ int cpu_exec(CPUState *cpu) cpu->exception_index = EXCP_INTERRUPT; cpu_loop_exit(cpu); } -tb_lock(); -tb = tb_find_fast(cpu); -if (cpu->tb_flushed) { -/* Ensure that no TB jump will be modified as the - * translation buffer has been flushed. - */ -last_tb = NULL; -cpu->tb_flushed = false; -} -/* see if we can patch the calling TB. When the TB - spans two pages, we cannot safely do a direct - jump. */ -if (last_tb && tb->page_addr[1] == -1 -&& !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) { -tb_add_jump(last_tb, tb_exit, tb); -} -tb_unlock(); +tb = tb_find_fast(cpu, _tb, tb_exit); if (likely(!cpu->exit_request)) { uintptr_t ret; trace_exec_tb(tb, tb->pc); -- 2.8.1
[Qemu-devel] [PATCH v8 2/5] ACPI: Add GICC Affinity Structure
From: Shannon ZhaoCc: Michael S. Tsirkin Cc: Igor Mammedov Signed-off-by: Shannon Zhao Reviewed-by: Andrew Jones --- hw/i386/acpi-build.c| 2 +- include/hw/acpi/acpi-defs.h | 15 ++- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index 6477003..9ae4c0d 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -2474,7 +2474,7 @@ build_srat(GArray *table_data, GArray *linker, MachineState *machine) int apic_id = apic_ids->cpus[i].arch_id; core = acpi_data_push(table_data, sizeof *core); -core->type = ACPI_SRAT_PROCESSOR; +core->type = ACPI_SRAT_PROCESSOR_APIC; core->length = sizeof(*core); core->local_apic_id = apic_id; curnode = pcms->node_cpu[apic_id]; diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h index c7a03d4..bcf5c3f 100644 --- a/include/hw/acpi/acpi-defs.h +++ b/include/hw/acpi/acpi-defs.h @@ -455,8 +455,10 @@ struct AcpiSystemResourceAffinityTable } QEMU_PACKED; typedef struct AcpiSystemResourceAffinityTable AcpiSystemResourceAffinityTable; -#define ACPI_SRAT_PROCESSOR 0 +#define ACPI_SRAT_PROCESSOR_APIC 0 #define ACPI_SRAT_MEMORY 1 +#define ACPI_SRAT_PROCESSOR_x2APIC 2 +#define ACPI_SRAT_PROCESSOR_GICC 3 struct AcpiSratProcessorAffinity { @@ -483,6 +485,17 @@ struct AcpiSratMemoryAffinity } QEMU_PACKED; typedef struct AcpiSratMemoryAffinity AcpiSratMemoryAffinity; +struct AcpiSratProcessorGiccAffinity +{ +ACPI_SUB_HEADER_DEF +uint32_tproximity; +uint32_tacpi_processor_uid; +uint32_tflags; +uint32_tclock_domain; +} QEMU_PACKED; + +typedef struct AcpiSratProcessorGiccAffinity AcpiSratProcessorGiccAffinity; + /* PCI fw r3.0 MCFG table. */ /* Subtable */ struct AcpiMcfgAllocation { -- 2.0.4
Re: [Qemu-devel] update status -- vTPM for HVM virtual machine
On April 26, 2016 6:25 PM, Wei Liuwrote: > For avoidance of doubt, this is purely status update, no action is needed on > my part. Yes. > Let me know if I misunderstood. > Quan
[Qemu-devel] [PATCH v5 3/6] cpu-exec: elide more icount code if CONFIG_USER_ONLY
From: Paolo BonziniSigned-off-by: Paolo Bonzini [Alex Bennée: #ifndef replay code to match elided functions] Signed-off-by: Alex Bennée Signed-off-by: Sergey Fedorov --- cpu-exec.c | 8 1 file changed, 8 insertions(+) diff --git a/cpu-exec.c b/cpu-exec.c index 4cba4efc92b2..36942340d7e3 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -192,6 +192,7 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb) return next_tb; } +#ifndef CONFIG_USER_ONLY /* Execute the code without caching the generated code. An interpreter could be used if available. */ static void cpu_exec_nocache(CPUState *cpu, int max_cycles, @@ -216,6 +217,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles, tb_phys_invalidate(tb, -1); tb_free(tb); } +#endif static TranslationBlock *tb_find_physical(CPUState *cpu, target_ulong pc, @@ -430,12 +432,14 @@ int cpu_exec(CPUState *cpu) } #endif } +#ifndef CONFIG_USER_ONLY } else if (replay_has_exception() && cpu->icount_decr.u16.low + cpu->icount_extra == 0) { /* try to cause an exception pending in the log */ cpu_exec_nocache(cpu, 1, tb_find_fast(cpu), true); ret = -1; break; +#endif } next_tb = 0; /* force lookup of first TB */ @@ -545,6 +549,9 @@ int cpu_exec(CPUState *cpu) case TB_EXIT_ICOUNT_EXPIRED: { /* Instruction counter expired. */ +#ifdef CONFIG_USER_ONLY +abort(); +#else int insns_left = cpu->icount_decr.u32; if (cpu->icount_extra && insns_left >= 0) { /* Refill decrementer and continue execution. */ @@ -564,6 +571,7 @@ int cpu_exec(CPUState *cpu) cpu_loop_exit(cpu); } break; +#endif } default: break; -- 2.8.1
[Qemu-devel] [PATCH v5 1/6] tcg: code_bitmap is not used by user-mode emulation
From: Paolo BonziniSigned-off-by: Paolo Bonzini [Sergey Fedorov: eliminate the field entirely in user-mode] Signed-off-by: Sergey Fedorov Reviewed-by: Richard Henderson Reviewed-by: Alex Bennée --- Changes in v2: * The field is eliminated entirely in user-mode translate-all.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/translate-all.c b/translate-all.c index 8329ea60eeda..0d5d9449dc6b 100644 --- a/translate-all.c +++ b/translate-all.c @@ -75,8 +75,9 @@ typedef struct PageDesc { /* in order to optimize self modifying code, we count the number of lookups we do to a given page to use a bitmap */ unsigned int code_write_count; +#ifdef CONFIG_SOFTMMU unsigned long *code_bitmap; -#if defined(CONFIG_USER_ONLY) +#else unsigned long flags; #endif } PageDesc; @@ -784,8 +785,10 @@ void tb_free(TranslationBlock *tb) static inline void invalidate_page_bitmap(PageDesc *p) { +#ifdef CONFIG_SOFTMMU g_free(p->code_bitmap); p->code_bitmap = NULL; +#endif p->code_write_count = 0; } @@ -1019,6 +1022,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr) tcg_ctx.tb_ctx.tb_phys_invalidate_count++; } +#ifdef CONFIG_SOFTMMU static void build_page_bitmap(PageDesc *p) { int n, tb_start, tb_end; @@ -1047,6 +1051,7 @@ static void build_page_bitmap(PageDesc *p) tb = tb->page_next[n]; } } +#endif /* Called with mmap_lock held for user mode emulation. */ TranslationBlock *tb_gen_code(CPUState *cpu, @@ -1296,6 +1301,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end, #endif } +#ifdef CONFIG_SOFTMMU /* len must be <= 8 and start must be a multiple of len */ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len) { @@ -1333,8 +1339,7 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len) tb_invalidate_phys_page_range(start, start + len, 1); } } - -#if !defined(CONFIG_SOFTMMU) +#else /* Called with mmap_lock held. */ static void tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc, void *puc, -- 2.8.1
[Qemu-devel] [PATCH v8 4/5] ACPI: move acpi_build_srat_memory to common place
From: Shannon ZhaoMove acpi_build_srat_memory to common place so that it could be reused by ARM. Rename it to build_srat_memory. Cc: Michael S. Tsirkin Cc: Igor Mammedov Signed-off-by: Shannon Zhao Reviewed-by: Andrew Jones --- hw/acpi/aml-build.c | 11 +++ hw/i386/acpi-build.c| 38 +- include/hw/acpi/aml-build.h | 10 ++ 3 files changed, 30 insertions(+), 29 deletions(-) diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c index ab89ca6..cedb74e 100644 --- a/hw/acpi/aml-build.c +++ b/hw/acpi/aml-build.c @@ -1563,3 +1563,14 @@ build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets, build_header(linker, table_data, (void *)rsdt, "RSDT", rsdt_len, 1, oem_id, oem_table_id); } + +void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, + uint64_t len, int node, MemoryAffinityFlags flags) +{ +numamem->type = ACPI_SRAT_MEMORY; +numamem->length = sizeof(*numamem); +numamem->proximity = cpu_to_le32(node); +numamem->flags = cpu_to_le32(flags); +numamem->base_addr = cpu_to_le64(base); +numamem->range_length = cpu_to_le64(len); +} diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index 3c031aa..279f0d7 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -2427,25 +2427,6 @@ build_tpm2(GArray *table_data, GArray *linker) (void *)tpm2_ptr, "TPM2", sizeof(*tpm2_ptr), 4, NULL, NULL); } -typedef enum { -MEM_AFFINITY_NOFLAGS = 0, -MEM_AFFINITY_ENABLED = (1 << 0), -MEM_AFFINITY_HOTPLUGGABLE = (1 << 1), -MEM_AFFINITY_NON_VOLATILE = (1 << 2), -} MemoryAffinityFlags; - -static void -acpi_build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, - uint64_t len, int node, MemoryAffinityFlags flags) -{ -numamem->type = ACPI_SRAT_MEMORY; -numamem->length = sizeof(*numamem); -numamem->proximity = cpu_to_le32(node); -numamem->flags = cpu_to_le32(flags); -numamem->base_addr = cpu_to_le64(base); -numamem->range_length = cpu_to_le64(len); -} - static void build_srat(GArray *table_data, GArray *linker, MachineState *machine) { @@ -2491,7 +2472,7 @@ build_srat(GArray *table_data, GArray *linker, MachineState *machine) numa_start = table_data->len; numamem = acpi_data_push(table_data, sizeof *numamem); -acpi_build_srat_memory(numamem, 0, 640*1024, 0, MEM_AFFINITY_ENABLED); +build_srat_memory(numamem, 0, 640 * 1024, 0, MEM_AFFINITY_ENABLED); next_base = 1024 * 1024; for (i = 1; i < pcms->numa_nodes + 1; ++i) { mem_base = next_base; @@ -2507,21 +2488,21 @@ build_srat(GArray *table_data, GArray *linker, MachineState *machine) mem_len -= next_base - pcms->below_4g_mem_size; if (mem_len > 0) { numamem = acpi_data_push(table_data, sizeof *numamem); -acpi_build_srat_memory(numamem, mem_base, mem_len, i - 1, - MEM_AFFINITY_ENABLED); +build_srat_memory(numamem, mem_base, mem_len, i - 1, + MEM_AFFINITY_ENABLED); } mem_base = 1ULL << 32; mem_len = next_base - pcms->below_4g_mem_size; next_base += (1ULL << 32) - pcms->below_4g_mem_size; } numamem = acpi_data_push(table_data, sizeof *numamem); -acpi_build_srat_memory(numamem, mem_base, mem_len, i - 1, - MEM_AFFINITY_ENABLED); +build_srat_memory(numamem, mem_base, mem_len, i - 1, + MEM_AFFINITY_ENABLED); } slots = (table_data->len - numa_start) / sizeof *numamem; for (; slots < pcms->numa_nodes + 2; slots++) { numamem = acpi_data_push(table_data, sizeof *numamem); -acpi_build_srat_memory(numamem, 0, 0, 0, MEM_AFFINITY_NOFLAGS); +build_srat_memory(numamem, 0, 0, 0, MEM_AFFINITY_NOFLAGS); } /* @@ -2531,10 +2512,9 @@ build_srat(GArray *table_data, GArray *linker, MachineState *machine) */ if (hotplugabble_address_space_size) { numamem = acpi_data_push(table_data, sizeof *numamem); -acpi_build_srat_memory(numamem, pcms->hotplug_memory.base, - hotplugabble_address_space_size, 0, - MEM_AFFINITY_HOTPLUGGABLE | - MEM_AFFINITY_ENABLED); +build_srat_memory(numamem, pcms->hotplug_memory.base, + hotplugabble_address_space_size, 0, + MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); } build_header(linker, table_data, diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h index 2c994b3..7eb51c7 100644 --- a/include/hw/acpi/aml-build.h +++
[Qemu-devel] [PATCH v5 0/6] tcg: Misc clean-up patches
From: Sergey FedorovThis patch series consists of various general TCG clean-up patches extracted from Paolo's MTTCG tree [1] and Alex's MTTCG base enablement tree [2]. I also add here a patch from myself to rework tb_invalidated_flag based on the Paolo's "tcg: move tb_invalidated_flag to CPUState" patch from the original verions of this series. Another patch of mine cleans up from a misleading 'next_tb' variable. The main idea is to review and merge these patches separately from the MTTCG series to cut the latter and make it easier to review. The series' tree can be found in a public git repository [3]. [1] https://github.com/bonzini/qemu/tree/mttcg [2] https://github.com/stsquad/qemu/tree/mttcg/base-patches-v2 [3] https://github.com/sergefdrv/qemu/tree/tcg-cleanup-v5 Summary of changes: v5: * Shortcut non-null tests in [PATCH v5 4/6] * Add a patch to move TB chaining to tb_find_fast() [PATCH v5 6/6] v4: * Add a patch to clean up from 'next_tb' [PATCH v4 4/6] v3: * Add a patch to rework tb_invalidated_flag from myself [PATCH v3 4/4] v2: * Complete code_bitmap elimination [PATCH v2 1/3] * Take Alex's version of tb_find_physical() reorganization [PATCH v2 2/3] * Drop [PATCH 3/5] completely * Drop [PATCH 4/5] and [PATCH 5/5] for separate series * Take Alex's rebase of Paolo's icount code eliding [PATCH v2 3/3] Alex Bennée (1): tcg: reorganize tb_find_physical loop Paolo Bonzini (2): tcg: code_bitmap is not used by user-mode emulation cpu-exec: elide more icount code if CONFIG_USER_ONLY Sergey Fedorov (3): tcg: Clean up from 'next_tb' tcg: Rework tb_invalidated_flag cpu-exec: Move TB chaining into tb_find_fast() cpu-exec.c | 157 +++- include/exec/exec-all.h | 2 - include/qom/cpu.h | 2 + tcg/tcg.h | 19 +++--- tci.c | 6 +- trace-events| 2 +- translate-all.c | 16 ++--- 7 files changed, 114 insertions(+), 90 deletions(-) -- 2.8.1
[Qemu-devel] [PATCH v5 5/6] tcg: Rework tb_invalidated_flag
From: Sergey Fedorov'tb_invalidated_flag' was meant to catch two events: * some TB has been invalidated by tb_phys_invalidate(); * the whole translation buffer has been flushed by tb_flush(). Then it was checked: * in cpu_exec() to ensure that the last executed TB can be safely linked to directly call the next one; * in cpu_exec_nocache() to decide if the original TB should be provided for further possible invalidation along with the temporarily generated TB. It is always safe to patch an invalidated TB since it is not going to be used anyway. It is also safe to call tb_phys_invalidate() for an already invalidated TB. Thus, setting this flag in tb_phys_invalidate() is simply unnecessary. Moreover, it can prevent from pretty proper linking of TBs, if any arbitrary TB has been invalidated. So just don't touch it in tb_phys_invalidate(). If this flag is only used to catch whether tb_flush() has been called then rename it to 'tb_flushed'. Declare it as 'bool' and stick to using only 'true' and 'false' to set its value. Also, instead of setting it in tb_gen_code(), just after tb_flush() has been called, do it right inside of tb_flush(). In cpu_exec(), this flag is used to track if tb_flush() has been called and have made 'next_tb' (a reference to the last executed TB) invalid for linking it to directly call the next TB. tb_flush() can be called during the CPU execution loop from tb_gen_code(), during TB execution or by another thread while 'tb_lock' is released. Catch for translation buffer flush reliably by resetting this flag once before first TB lookup and each time we find it set before trying to add a direct jump. Don't touch in in tb_find_physical(). Each vCPU has its own execution loop in multithreaded mode and thus should have its own copy of the flag to be able to reset it with its own 'next_tb' and don't affect any other vCPU execution thread. So make this flag per-vCPU and move it to CPUState. In cpu_exec_nocache(), we only need to check if tb_flush() has been called from tb_gen_code() called by cpu_exec_nocache() itself. To do this reliably, preserve the old value of the flag, reset it before calling tb_gen_code(), check afterwards, and combine the saved value back to the flag. This patch is based on the patch "tcg: move tb_invalidated_flag to CPUState" from Paolo Bonzini . Signed-off-by: Sergey Fedorov Signed-off-by: Sergey Fedorov --- Changes in v4: * Rebased on top of the previous patch cpu-exec.c | 21 +++-- include/exec/exec-all.h | 2 -- include/qom/cpu.h | 2 ++ translate-all.c | 5 + 4 files changed, 14 insertions(+), 16 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 6ad5fbdf1f8e..8ba899b47a23 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -202,16 +202,20 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles, TranslationBlock *orig_tb, bool ignore_icount) { TranslationBlock *tb; +bool old_tb_flushed; /* Should never happen. We only end up here when an existing TB is too long. */ if (max_cycles > CF_COUNT_MASK) max_cycles = CF_COUNT_MASK; +old_tb_flushed = cpu->tb_flushed; +cpu->tb_flushed = false; tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base, orig_tb->flags, max_cycles | CF_NOCACHE | (ignore_icount ? CF_IGNORE_ICOUNT : 0)); -tb->orig_tb = tcg_ctx.tb_ctx.tb_invalidated_flag ? NULL : orig_tb; +tb->orig_tb = cpu->tb_flushed ? NULL : orig_tb; +cpu->tb_flushed |= old_tb_flushed; cpu->current_tb = tb; /* execute the generated code */ trace_exec_tb_nocache(tb, tb->pc); @@ -232,8 +236,6 @@ static TranslationBlock *tb_find_physical(CPUState *cpu, unsigned int h; tb_page_addr_t phys_pc, phys_page1; -tcg_ctx.tb_ctx.tb_invalidated_flag = 0; - /* find translated block using physical mappings */ phys_pc = get_page_addr_code(env, pc); phys_page1 = phys_pc & TARGET_PAGE_MASK; @@ -446,6 +448,7 @@ int cpu_exec(CPUState *cpu) } last_tb = NULL; /* forget the last executed TB after exception */ +cpu->tb_flushed = false; /* reset before first TB lookup */ for(;;) { interrupt_request = cpu->interrupt_request; if (unlikely(interrupt_request)) { @@ -510,14 +513,12 @@ int cpu_exec(CPUState *cpu) } tb_lock(); tb = tb_find_fast(cpu); -/* Note: we do it here to avoid a gcc bug on Mac OS X when - doing it in tb_find_slow */ -if (tcg_ctx.tb_ctx.tb_invalidated_flag) { -/* as some TB could have been invalidated because - of memory exceptions while generating the code, we - must recompute the hash
[Qemu-devel] [PATCH v5 2/6] tcg: reorganize tb_find_physical loop
From: Alex BennéePut some comments and improve code structure. This should help reading the code. Signed-off-by: Alex Bennée [Sergey Fedorov: provide commit message; bring back resetting of tb_invalidated_flag] Signed-off-by: Sergey Fedorov Reviewed-by: Richard Henderson --- cpu-exec.c | 44 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index bbfcbfb54385..4cba4efc92b2 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -223,10 +223,9 @@ static TranslationBlock *tb_find_physical(CPUState *cpu, uint64_t flags) { CPUArchState *env = (CPUArchState *)cpu->env_ptr; -TranslationBlock *tb, **ptb1; +TranslationBlock *tb, **tb_hash_head, **ptb1; unsigned int h; tb_page_addr_t phys_pc, phys_page1; -target_ulong virt_page2; tcg_ctx.tb_ctx.tb_invalidated_flag = 0; @@ -234,37 +233,42 @@ static TranslationBlock *tb_find_physical(CPUState *cpu, phys_pc = get_page_addr_code(env, pc); phys_page1 = phys_pc & TARGET_PAGE_MASK; h = tb_phys_hash_func(phys_pc); -ptb1 = _ctx.tb_ctx.tb_phys_hash[h]; -for(;;) { -tb = *ptb1; -if (!tb) { -return NULL; -} + +/* Start at head of the hash entry */ +ptb1 = tb_hash_head = _ctx.tb_ctx.tb_phys_hash[h]; +tb = *ptb1; + +while (tb) { if (tb->pc == pc && tb->page_addr[0] == phys_page1 && tb->cs_base == cs_base && tb->flags == flags) { -/* check next page if needed */ -if (tb->page_addr[1] != -1) { -tb_page_addr_t phys_page2; -virt_page2 = (pc & TARGET_PAGE_MASK) + -TARGET_PAGE_SIZE; -phys_page2 = get_page_addr_code(env, virt_page2); +if (tb->page_addr[1] == -1) { +/* done, we have a match */ +break; +} else { +/* check next page if needed */ +target_ulong virt_page2 = (pc & TARGET_PAGE_MASK) + + TARGET_PAGE_SIZE; +tb_page_addr_t phys_page2 = get_page_addr_code(env, virt_page2); + if (tb->page_addr[1] == phys_page2) { break; } -} else { -break; } } + ptb1 = >phys_hash_next; +tb = *ptb1; } -/* Move the TB to the head of the list */ -*ptb1 = tb->phys_hash_next; -tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h]; -tcg_ctx.tb_ctx.tb_phys_hash[h] = tb; +if (tb) { +/* Move the TB to the head of the list */ +*ptb1 = tb->phys_hash_next; +tb->phys_hash_next = *tb_hash_head; +*tb_hash_head = tb; +} return tb; } -- 2.8.1
Re: [Qemu-devel] [PATCH v4 00/16] IOMMU: Enable interrupt remapping for Intel IOMMU
On Tue, Apr 26, 2016 at 10:15:46AM +0200, Jan Kiszka wrote: > On 2016-04-26 09:57, Jan Kiszka wrote: > > On 2016-04-26 09:34, Peter Xu wrote: > >> On Mon, Apr 25, 2016 at 09:24:12AM +0200, Jan Kiszka wrote: > >>> On 2016-04-25 09:18, Peter Xu wrote: > On Mon, Apr 25, 2016 at 07:16:19AM +0200, Jan Kiszka wrote: > > On 2016-04-19 10:38, Peter Xu wrote: > > [...] > > >> By default, IR is disabled to be better compatible with current > >> QEMU. To enable IR, we can using the following command to boot a > >> IR-supported VM with virtio-net device with vhost (still do not > >> support kvm-ioapic, so we need to specify kernel-irqchip={split|off} > >> here): > >> > >> $ qemu-system-x86_64 -M q35,iommu=on,intr=on,kernel-irqchip=split \ > > > > "intr" sounds a bit too much like "interrupt", not "interrupt > > remapping". Why not use the kernel's form, "intremap"? > > Sure. It sounds nice to be aligned with the kernel one. Let me take > it in v5. > > > > >> -enable-kvm -m 1024 \ > >> -netdev tap,id=net0,vhost=on \ > >> -device virtio-net-pci,netdev=user.0 \ > >> -monitor telnet::,server,nowait \ > >> /var/lib/libvirt/images/vm1.qcow2 > >> > >> When guest boots, we can verify whether IR enabled by grepping the > >> dmesg like: > >> > >> [root@localhost ~]# journalctl -k | grep "DMAR-IR" > >> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: IOAPIC id 0 > >> under DRHD base 0xfed9 IOMMU 0 > >> Feb 19 11:21:23 localhost.localdomain kernel: DMAR-IR: Enabled IRQ > >> remapping in xapic mode > >> > >> Currently supported devices: > >> > >> - Emulated/Splitted irqchip > >> - Generic PCI Devices > >> - vhost devices > >> - pass through device support? Not tested, but suppose it should work. > > > > I've tested this series against my Jailhouse setup, and it works pretty > > well! Actually considering to move my test setup over this branch. > > This is really encouraging feedback! Btw, thanks for all kinds of > help on this patchset. :-) > > > > > However, split irqchip still has some issues: When I boot a q35 machine > > with Linux, the e1000 network adapter only gets a single IRQ delivered. > > Interestingly, other IOAPIC IRQs like the keyboard work all the time. I > > didn't debug this in details yet. > > I reproduced this problem. It seems that it fails even with > kernel-irqchip=off. Will try to dig it out. > >>> > >>> Very good. Hope it can be easily fixed. > >> > >> Hi, Jan, > >> > >> The above issue should be caused by EOI missing of level-triggered > >> interrupts. Before that, I was always using edge-triggered > >> interrupts for test, so didn't encounter this one. Would you please > >> help try below patch? It can be applied directly onto the series, > >> and should solve the issue (it works on my test vm, and I'll take it > >> in v5 as well if it also works for you): > >> > > > > Works here as well. I even made EIM working with some hack, though > > Jailhouse spits out strange warnings, despite it works fine (x2apic > > mode, split irqchip). > > Corrections: the warnings are issued by qemu, not Jailhouse, e.g. > > qemu-system-x86_64: VT-d Failed to remap interrupt for gsi 22. > > I suspect that comes from the hand-over phase of Jailhouse, when it > mutes all interrupts in the system while reconfiguring IR and IOAPIC. > > Please convert this error (in kvm_arch_fixup_msi_route) into a trace > point. It shall not annoy the host. Also check if you have more of such > guest-triggerable error messages. Okay. This should be the only one. I can use trace instead. Meanwhile, I still suppose we should not seen it even with error_report().. Would this happen when boot e.g. generic kernels? -- peterx