Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On 01/28/2014 01:27 AM, Benjamin Herrenschmidt wrote: On Wed, 2014-01-22 at 17:29 +, Peter Maydell wrote: Basically if it would be on real bus, get byte value that corresponds to phys_addr + 0 address place it into data[0], get byte value that corresponds to phys_addr + 1 address place it into data[1], etc. This just isn't how real buses work. Actually it can be :-) There is no address + 1, address + 2. There is a single address for the memory transaction and a set of data on data lines and some separate size information. How the device at the far end of the bus chooses to respond to 32 bit accesses to address X versus 8 bit accesses to addresses X through X+3 is entirely its own business and unrelated to the CPU. However the bus has a definition of what byte lane is the lowest in address order. Byte order invariance is an important function of all busses. I think that trying to treat it any differently than an address ordered series of bytes is going to turn into a complete and inextricable mess. I agree. The two options are: (address, byte array, length) and (address, value, word size, endianness) the first is the KVM ABI, the second is how MemoryRegions work. Both are valid, but the first is more general (supports the 3-byte accesses sometimes generated on x86). (It would be perfectly possible to have a device which when you read from address X as 32 bits returned 0x12345678, when you read from address X as 16 bits returned 0x9abc, returned 0x42 for an 8 bit read from X+1, and so on. Having byte reads from X..X+3 return values corresponding to parts of the 32 bit access is purely a convention.) Right, it's possible. It's also stupid and not how most modern devices and busses work. Besides there is no reason why that can't be implemented with Victor proposal anyway. Right.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Tue, Jan 28, 2014 at 03:47:32PM +1100, Benjamin Herrenschmidt wrote: On Mon, 2014-01-27 at 16:44 -0800, Christoffer Dall wrote: I'm loosing track of this discussion, Ben, can you explain a bit? You wrote: Having a byte array coming in that represents what the CPU does in its current byte order means you do *NOT* need to query the endianness of the guest CPU from userspace. What does a byte array that represents what the CPU does in its current byte order mean in this context. Do you mean the VCPU or the physical CPU when you say CPU. It doesn't matter once it's a byte array in address order. Again this is the *right* abstraction for the kernel ABI, because you do not care about the endianness of either side, guest or host. It makes no sense to treat a modern CPU data bus as having an MSB and an LSB (even if they have it sometimes on the block diagram). Only when *interpreting a value* on that bus, such as an *address* does the endianness become of use. Treat the bus instead as an ordered sequence of bytes in ascending address order and most of the complexity goes away. From there, for a given device, it all depends which bytes *that device* choses to consider as being the MSB vs. LSB. It's not even a bus thing, though of course some busses suggest an endianness, and some like PCI mandates it for configuration space. But it remains a device-side choice. I read your text as saying just do a store of the register into the data pointer and don't worry about endianness, but somebody, somewhere, has to check the VCPU endianness setting. I'm probably wrong, and you are probably the right person to clear this up, but can you formulate exactly what you think the KVM ABI is and how you would put it in Documentation/virtual/kvm/api.txt? My point of view is that it is KVM that needs to do this, and it should emulate the CPU by performing a byteswap in the case where the CPU E-bit is set on ARM, but this is an ARM-centric way of looking at things. The ABI going to qemu should be (and inside qemu from TCG to the emulation) that the CPU did an access of N bytes wide at address A whose value is the byte array data[] in ascending address order. OK, I've sent a v3 of the ABI clarification patch following the wording from you and Scott. I think we all agree what the format should look like at this point and hopefully we can quickly agree about a text to describe that. Thanks, -Christoffer
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Wed, 2014-01-22 at 17:29 +, Peter Maydell wrote: Basically if it would be on real bus, get byte value that corresponds to phys_addr + 0 address place it into data[0], get byte value that corresponds to phys_addr + 1 address place it into data[1], etc. This just isn't how real buses work. Actually it can be :-) There is no address + 1, address + 2. There is a single address for the memory transaction and a set of data on data lines and some separate size information. How the device at the far end of the bus chooses to respond to 32 bit accesses to address X versus 8 bit accesses to addresses X through X+3 is entirely its own business and unrelated to the CPU. However the bus has a definition of what byte lane is the lowest in address order. Byte order invariance is an important function of all busses. I think that trying to treat it any differently than an address ordered series of bytes is going to turn into a complete and inextricable mess. (It would be perfectly possible to have a device which when you read from address X as 32 bits returned 0x12345678, when you read from address X as 16 bits returned 0x9abc, returned 0x42 for an 8 bit read from X+1, and so on. Having byte reads from X..X+3 return values corresponding to parts of the 32 bit access is purely a convention.) Right, it's possible. It's also stupid and not how most modern devices and busses work. Besides there is no reason why that can't be implemented with Victor proposal anyway. Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Wed, 2014-01-22 at 11:29 -0800, Victor Kamensky wrote: I don't see why you so attached to desire to describe data part of memory transaction as just one of int types. If we are talking about bunch of hypothetical cases imagine such bus that allow transaction with size of 6 bytes. How do you describe such data in your ints speak? What endianity you can assign to sequence of 6 bytes? While note that description of such transaction as set of 6 byte values at address $whatever makes perfect sense. Absolutely. For example, the real bus out of a POWER8 core is something like 128 bytes wide though I wouldn't be surprised if it was serialized, I don't actually know the details, it's all inside the chip. The interconnect between chip is a multi-lane elastic interface whose width has nothing to do with the payload size. Same goes with PCIe. The only thing that can more/less sanely represent all of these things is a series of bytes ordered by address, with attributes such as the access size (or byte enables if that makes more sense or we want to emulate really funky stuff) and possibly other decoration that some architectures might want to have (such as caching/combining attributes etc... which *might* be useful under some circumstances). Cheers, Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Wed, 2014-01-22 at 20:02 +, Peter Maydell wrote: Defining it as being always guest-order would mean that userspace had to continually look at the guest CPU endianness bit, which is annoying and awkward. Defining it as always host-endian order is the most reasonable option available. It also happens to work for the current QEMU code, which is nice. No. Having a byte array coming in that represents what the CPU does in its current byte order means you do *NOT* need to query the endianness of the guest CPU from userspace. Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On 27 January 2014 23:34, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Wed, 2014-01-22 at 20:02 +, Peter Maydell wrote: Defining it as being always guest-order would mean that userspace had to continually look at the guest CPU endianness bit, which is annoying and awkward. Defining it as always host-endian order is the most reasonable option available. It also happens to work for the current QEMU code, which is nice. No. Having a byte array coming in that represents what the CPU does in its current byte order means you do *NOT* need to query the endianness of the guest CPU from userspace. Er, what? If we make the array be guest's current order then by definition userspace has to look at the guest's current endianness. I agree that would be bad. Either of the two current proposals (host kernel order; guest CPU's native/natural/default-byte-order) avoid it, though. -- PMM
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Thu, 2014-01-23 at 15:33 +, Peter Maydell wrote: (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not the same, because in the ARM case it is doing an internal-to-CPU byteswap, and in the PPC case it is not Aren't they both byte-order invariant ? In that case they are the same. Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Tue, 2014-01-28 at 11:07 +1100, Benjamin Herrenschmidt wrote: On Thu, 2014-01-23 at 15:33 +, Peter Maydell wrote: (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not the same, because in the ARM case it is doing an internal-to-CPU byteswap, and in the PPC case it is not Aren't they both byte-order invariant ? I meant byte-address... In that case they are the same. Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
The point is simple, and Peter has made it over and over: Any consumer of a memory operation sees value, len, address. This is what KVM_EXIT_MMIO emulates. So just by knowing the ABI definition and having a pointer to the structure you need to be able to tell me value, len, address. But that's useless because it doesn't tell you the byte order of the value which is critical for emulation unless you *defined* in your ABI the byte order of the value, and thus include an artificial swap when the guest is in a different endian mode than the host. My understanding is that ARM is byte-address invariant, as is powerpc, so it makes a LOT more sense to carry a sequence of address ordered bytes instead which will correspond to what the guest code thinks it's writing, and have the device respond appropriately based on the endianness of the bus it sits on or the device itself. (2) the API between kernel and userspace needs to define the semantics of mmio.data, ie how to map between x byte wide transaction with value v and the array, and that is primarily what this conversation is about (3) the only choice which is both (a) sensible and (b) not breaking existing usage is to say the array is in host-kernel-byte-order (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not the same, because in the ARM case it is doing an internal-to-CPU byteswap, and in the PPC case it is not I very much doubt that there is a difference here, I'm not sure about that business with internal byteswap. The order in which the bytes of a word are presented on the bus changes depending on the core endianness. If that's what you call a byteswap then both ARM and PPC do it when they are in their other endian, but that confusion comes from assuming that a data bus has an endianness at all to begin with. I was hoping that by 2014, such ideas were things of the past. That is one of the key disconnects. I'll go find real examples in ARM LE, ARM BE, and PPC BE Linux kernel. Just for everybody sake's here is summary of the disconnect: If we have the same h/w connected to memory bus in ARM and PPC systems and we have the following three pieces of code that work with r0 having same device same register address: 1. ARM LE word write of 0x04030201: setend le mov r1, #0x04030201 str r1, [r0] 2. ARM BE word write of 0x01020304: setend be mov r1, #0x01020304 str r1, [r0] 3. PPC BE word write of 0x01020304: lis r1,0x102 ori r1,r1,0x304 stwr1,0(r0) I claim that h/w will see the same data on bus lines in all three cases, and h/w would acts the same in all three cases. Peter says that ARM BE and PPC BE case h/w would act differently. If anyone else can offer opinion on that while I am looking for real examples that would be great. I really don't think listing all these examples help. You need to focus on the key points that Peter listed in his previous mail. I tried in our chat to ask you this questions: vcpu_data_host_to_guest() is handling a read from an emulated device. All the info you have is: (1) len of memory access (2) mmio.data pointer (3) destination register (4) host CPU endianness (5) guest CPU endianness Based on this information alone, you need to decide whether you do a byteswap or not before loading the hardware register upon returning to the guest. You will find it impossible to answer, because you don't know the layout of mmio.data, and that is the thing we are trying to solve. If you cannot reply to this point in less than 50 lines or mention anything about devices being LE or BE or come with examples, I am probably not going to read your reply, sorry. -Christoffer
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Thu, 2014-01-23 at 20:11 -0800, Victor Kamensky wrote: I would take 50 byteswaps with a clear ABI any day over an obscure standard that can avoid a single hardware-on-register instruction. This is about designing a clean software interface, not about building an optimized integrated stack. Unfortunately, this is going nowhere, so I think we need to stop this thread. As you can see I have sent a patch as a clarification to the ABI, if it's merged we can move on with more important tasks. OK, that is fine. I still believe is not the best choice, but I agree that we need to move on. I will respin my V7 KVM BE patches according to this new semantics, I will integrate comments that you (thanks!) and others gave me over mailing list and post my series again when it is ready. Right, the whole host endian is a horrible choice from every way you look at it, but I'm afraid it's unfixable since it's already ABI :-( Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Mon, 2014-01-27 at 23:49 +, Peter Maydell wrote: Er, what? If we make the array be guest's current order then by definition userspace has to look at the guest's current endianness. I agree that would be bad. Either of the two current proposals (host kernel order; guest CPU's native/natural/default-byte-order) avoid it, though. No, this has nothing to do with the guest endianness, and all to do about the (hopefully) byte-address invariant bus we have on the processor. Anyway, the existing crap is ABI so I suspect we have to stick with it, just maybe document it better. Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Tue, Jan 28, 2014 at 11:32:41AM +1100, Benjamin Herrenschmidt wrote: On Thu, 2014-01-23 at 20:11 -0800, Victor Kamensky wrote: I would take 50 byteswaps with a clear ABI any day over an obscure standard that can avoid a single hardware-on-register instruction. This is about designing a clean software interface, not about building an optimized integrated stack. Unfortunately, this is going nowhere, so I think we need to stop this thread. As you can see I have sent a patch as a clarification to the ABI, if it's merged we can move on with more important tasks. OK, that is fine. I still believe is not the best choice, but I agree that we need to move on. I will respin my V7 KVM BE patches according to this new semantics, I will integrate comments that you (thanks!) and others gave me over mailing list and post my series again when it is ready. Right, the whole host endian is a horrible choice from every way you look at it, but I'm afraid it's unfixable since it's already ABI :-( Why is it a horrible choice? I don't think it's actually ABI at this point, it's undefined. The only thing fixed is PPC BE host and ARM LE host, and in both cases we currently perform a byteswap in KVM if the guest is a different endianness. Honestly I don't care which way it's defined, as long as it's defined somehow, and I have not yet seen anyone formulate how the ABI specification should be worded, so people clearly understand what's going on. If you take a look at the v2 patch KVM: Specify byte order for KVM_EXIT_MMIO, that's where it ended up. If you can formulate something with your experience in endianness that makes this clear, it would be extremely helpful. -Christoffer
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Tue, Jan 28, 2014 at 11:36:13AM +1100, Benjamin Herrenschmidt wrote: On Mon, 2014-01-27 at 23:49 +, Peter Maydell wrote: Er, what? If we make the array be guest's current order then by definition userspace has to look at the guest's current endianness. I agree that would be bad. Either of the two current proposals (host kernel order; guest CPU's native/natural/default-byte-order) avoid it, though. No, this has nothing to do with the guest endianness, and all to do about the (hopefully) byte-address invariant bus we have on the processor. Anyway, the existing crap is ABI so I suspect we have to stick with it, just maybe document it better. I'm loosing track of this discussion, Ben, can you explain a bit? You wrote: Having a byte array coming in that represents what the CPU does in its current byte order means you do *NOT* need to query the endianness of the guest CPU from userspace. What does a byte array that represents what the CPU does in its current byte order mean in this context. Do you mean the VCPU or the physical CPU when you say CPU. I read your text as saying just do a store of the register into the data pointer and don't worry about endianness, but somebody, somewhere, has to check the VCPU endianness setting. I'm probably wrong, and you are probably the right person to clear this up, but can you formulate exactly what you think the KVM ABI is and how you would put it in Documentation/virtual/kvm/api.txt? My point of view is that it is KVM that needs to do this, and it should emulate the CPU by performing a byteswap in the case where the CPU E-bit is set on ARM, but this is an ARM-centric way of looking at things. Thanks, -Christoffer
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Mon, 2014-01-27 at 16:44 -0800, Christoffer Dall wrote: I'm loosing track of this discussion, Ben, can you explain a bit? You wrote: Having a byte array coming in that represents what the CPU does in its current byte order means you do *NOT* need to query the endianness of the guest CPU from userspace. What does a byte array that represents what the CPU does in its current byte order mean in this context. Do you mean the VCPU or the physical CPU when you say CPU. It doesn't matter once it's a byte array in address order. Again this is the *right* abstraction for the kernel ABI, because you do not care about the endianness of either side, guest or host. It makes no sense to treat a modern CPU data bus as having an MSB and an LSB (even if they have it sometimes on the block diagram). Only when *interpreting a value* on that bus, such as an *address* does the endianness become of use. Treat the bus instead as an ordered sequence of bytes in ascending address order and most of the complexity goes away. From there, for a given device, it all depends which bytes *that device* choses to consider as being the MSB vs. LSB. It's not even a bus thing, though of course some busses suggest an endianness, and some like PCI mandates it for configuration space. But it remains a device-side choice. I read your text as saying just do a store of the register into the data pointer and don't worry about endianness, but somebody, somewhere, has to check the VCPU endianness setting. I'm probably wrong, and you are probably the right person to clear this up, but can you formulate exactly what you think the KVM ABI is and how you would put it in Documentation/virtual/kvm/api.txt? My point of view is that it is KVM that needs to do this, and it should emulate the CPU by performing a byteswap in the case where the CPU E-bit is set on ARM, but this is an ARM-centric way of looking at things. The ABI going to qemu should be (and inside qemu from TCG to the emulation) that the CPU did an access of N bytes wide at address A whose value is the byte array data[] in ascending address order. Ben.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On 23.01.2014, at 05:25, Victor Kamensky victor.kamen...@linaro.org wrote: Hi Alex, Sorry, for delayed reply, I was focusing on discussion with Peter. Hope you and other folks may get something out of it :). Please see responses inline On 22 January 2014 02:52, Alexander Graf ag...@suse.de wrote: On 22.01.2014, at 08:26, Victor Kamensky victor.kamen...@linaro.org wrote: On 21 January 2014 22:41, Alexander Graf ag...@suse.de wrote: Native endian really is just a shortcut for target endian which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le. I disagree. Fully functional ARM BE system is what we've been working on for last few months. 'We' is Linaro Networking Group, Endian subteam and some other guys in ARM and across community. Why we do that is a bit beyond of this discussion. ARM BE patches for both V7 and V8 are already in mainline kernel. But ARM BE KVM host is broken now. It is known deficiency that I am trying to fix. Please look at [1]. Patches for V7 BE KVM were proposed and currently under active discussion. Currently I work on ARM V8 BE KVM changes. So native endian in ARM is value of CPSR register E bit. If it is off native endian is LE, if it is on it is BE. Once and if we agree on ARM BE KVM host changes, the next step would be patches in qemu one of which introduces qemu-system-armeb. Please see [2]. I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. Maybe it is not ideology conflict but rather terminology clarity issue :). I am not sure what do you mean by qemu-system-arm that is BE capable. In qemu build system there is just target name 'arm', which is ARM V7 cpu in LE mode, and 'armeb' target which is ARM V7 cpu in BE mode. That is true for a lot of open source packages. You could check [1] patch that introduces armeb target into qemu. Build for arm target produces qemu-system-arm executable that is marked 'ELF 32-bit LSB executable' and it could run on LE traditional ARM Linux. Build for armeb target produces qemu-system-armeb executable that is marked 'ELF 32-bit MSB executable' that can run on BE ARM Linux. armbe is nothing special here, just build option for qemu that should run on BE ARM Linux. But why should it be called armbe then? What actual difference does the model have compared to the qemu-system-arm model? Both qemu-system-arm and qemu-system-armeb should be BE/LE capable. I.e either of them along with KVM could either run LE or BE guest. MarcZ demonstrated that this is possible. I've tested both LE and BE guests with qemu-system-arm running on traditional LE ARM Linux, effectively repeating Marc's setup but with qemu. And I did test with my patches both BE and LE guests with qemu-system-armeb running on BE ARM Linux. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the default endiannes for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the default. I am not sure, what you mean by default? Is it initial setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes, the way it is currently implemented by committed qemu-system-arm, and proposed qemu-system-armeb patches, they are both off. I.e even qemu-system-armeb starts running vcpu in LE mode, exactly by very similar reason as desribed in your next paragraph qemu-system-armeb has tiny bootloader that starts in LE mode, jumps to kernel kernel switches cpu to run in BE mode 'setend be' and EE bit is set just before mmu is enabled. You're proving my point even more. If both targets are LE/BE capable and both targets start execution in LE mode, then why do we need a qemu-system-armbe at all? Just use qemu-system-arm. Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. We have very similar situation with BE ARM Linux. When we run ARM BE Linux we start with bootloader which is LE and then CPU issues 'setend be' very soon as it starts executing kernel code, all secondary CPUs issue 'setend be' when they go out of reset pen or bootmonitor sleep. But there's no reason this little endian kernel couldn't theoretically have big endian user space running with access to emulated device registers. I don't want to go there, it is very very messy ... -- Just a side note: -- Interestingly, half a year before I joined Linaro in Cisco I and my colleague implemented kernel patch that allowed to run BE user-space processes as sort of separate personality on top of LE ARM kernel ... treated kind of multi-abi system. Effectively we had to do byteswaps on all non-trivial system calls and ioctls in side of the kernel. We converted around 30 system calls and around 10
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Wed, 22 Jan 2014 20:25:05 -0800 Victor Kamensky victor.kamen...@linaro.org wrote: Hi Alex, Sorry, for delayed reply, I was focusing on discussion with Peter. Hope you and other folks may get something out of it :). Please see responses inline On 22 January 2014 02:52, Alexander Graf ag...@suse.de wrote: On 22.01.2014, at 08:26, Victor Kamensky victor.kamen...@linaro.org wrote: On 21 January 2014 22:41, Alexander Graf ag...@suse.de wrote: Native endian really is just a shortcut for target endian which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le. I disagree. Fully functional ARM BE system is what we've been working on for last few months. 'We' is Linaro Networking Group, Endian subteam and some other guys in ARM and across community. Why we do that is a bit beyond of this discussion. ARM BE patches for both V7 and V8 are already in mainline kernel. But ARM BE KVM host is broken now. It is known deficiency that I am trying to fix. Please look at [1]. Patches for V7 BE KVM were proposed and currently under active discussion. Currently I work on ARM V8 BE KVM changes. So native endian in ARM is value of CPSR register E bit. If it is off native endian is LE, if it is on it is BE. Once and if we agree on ARM BE KVM host changes, the next step would be patches in qemu one of which introduces qemu-system-armeb. Please see [2]. I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. Maybe it is not ideology conflict but rather terminology clarity issue :). I am not sure what do you mean by qemu-system-arm that is BE capable. In qemu build system there is just target name 'arm', which is ARM V7 cpu in LE mode, and 'armeb' target which is ARM V7 cpu in BE mode. That is true for a lot of open source packages. You could check [1] patch that introduces armeb target into qemu. Build for arm target produces qemu-system-arm executable that is marked 'ELF 32-bit LSB executable' and it could run on LE traditional ARM Linux. Build for armeb target produces qemu-system-armeb executable that is marked 'ELF 32-bit MSB executable' that can run on BE ARM Linux. armbe is nothing special here, just build option for qemu that should run on BE ARM Linux. Hmmm... it looks like there is a confusion about the qemu command naming. The -target suffix in qemu-system-target has nothing to do with the ELF information of the command itself. [greg@bahia ~]$ file `which qemu-system-arm` /bin/qemu-system-arm: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0xbcb974847daa8159c17ed74906cd5351387d4097, stripped It is valid to create a new target if it is substantially different from existing ones (ppc64 versus ppc for example). This is not the case with ARM since it is the very same CPU that can switch endianess with the 'setend' instruction (which needs anyway to be emulated when running in TCG mode). qemu-system-arm is THE command that should be able to emulate an ARM cpu, whether the guest does 'setend le' or 'setend be'. Both qemu-system-arm and qemu-system-armeb should be BE/LE capable. I.e either of them along with KVM could either run LE or BE guest. MarcZ demonstrated that this is possible. I've tested both LE and BE guests with qemu-system-arm running on traditional LE ARM Linux, effectively repeating Marc's setup but with qemu. And I did test with my patches both BE and LE guests with qemu-system-armeb running on BE ARM Linux. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the default endiannes for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the default. I am not sure, what you mean by default? Is it initial setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes, the way it is currently implemented by committed qemu-system-arm, and proposed qemu-system-armeb patches, they are both off. I.e even qemu-system-armeb starts running vcpu in LE mode, exactly by very similar reason as desribed in your next paragraph qemu-system-armeb has tiny bootloader that starts in LE mode, jumps to kernel kernel switches cpu to run in BE mode 'setend be' and EE bit is set just before mmu is enabled. Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. We have very similar situation with BE ARM Linux. When we run ARM BE Linux we start with bootloader which is LE and then CPU issues 'setend be' very soon as it starts executing kernel code, all secondary CPUs issue 'setend be' when they go out of reset pen or bootmonitor sleep. But there's no reason this little endian kernel couldn't
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On Wed, Jan 22, 2014 at 02:27:29PM +0530, Anup Patel wrote: [...] Thanks for the info on QEMU side handling of MMIO data. I was not aware that we would be only have target endian = LE for ARM/ARM64 in QEMU. I think Marc Z had mentioned similar thing about MMIO this in our previous discussions on his patches. (Please refer, http://www.spinics.net/lists/arm-kernel/msg283313.html) This clearly means MMIO data passed to user space (QEMU) has to of host endianness so that QEMU can take care of bust-device endian map. Hmmm, I'm not sure what you mean exactly, but the fact remains that we simply need to decide on a layout of mmio.data that (1) doesn't break existing userspace (2) is clearly defined for mixed-mmio use cases. Current vcpu_data_guest_to_host() and vcpu_data_host_to_guest() does not perform endianness conversion of MMIO data to LE when we are running LE guest on BE host so we do need Victor's patch for fixing vcpu_data_guest_to_host() and vcpu_data_host_to_guest(). (Already reported long time back by me, http://www.spinics.net/lists/arm-kernel/msg283308.html) The problem is that we cannot decide on how the patch should look like before the endianness of mmio.data is decided. Alex, Peter, and I agree that it should be that of the host endianness and represent what the architecture in question would put on the memory bus. In the case of ARM, it's the register value when the VCPU E-bit is clear, and it's the byteswapped register value when the VCPU E-bit is set. Therefore, the patch needs to do an unconditional byteswap when the VCPU E-bit is set, instead of the beXX_to_cpu and cpu_to_beXX. I'm sending out a patch to clarify the KVM API so we can move on. -Christoffer
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Hi Alex, On Wed, Jan 22, 2014 at 12:11 PM, Alexander Graf ag...@suse.de wrote: Am 22.01.2014 um 07:31 schrieb Anup Patel a...@brainfault.org: On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky victor.kamen...@linaro.org wrote: Hi Guys, Christoffer and I had a bit heated chat :) on this subject last night. Christoffer, really appreciate your time! We did not really reach agreement during the chat and Christoffer asked me to follow up on this thread. Here it goes. Sorry, it is very long email. I don't believe we can assign any endianity to mmio.data[] byte array. I believe mmio.data[] and mmio.len acts just memcpy and that is all. As memcpy does not imply any endianity of underlying data mmio.data[] should not either. Here is my definition: mmio.data[] is array of bytes that contains memory bytes in such form, for read case, that if those bytes are placed in guest memory and guest executes the same read access instruction with address to this memory, result would be the same as real h/w device memory access. Rest of KVM host and hypervisor part of code should really take care of mmio.data[] memory so it will be delivered to vcpu registers and restored by hypervisor part in such way that guest CPU register value is the same as it would be for real non-emulated h/w read access (that is emulation part). The same goes for write access, if guest writes into memory and those bytes are just copied to emulated h/w register it would have the same effect as real mapped h/w register write. In shorter form, i.e for len=4 access: endianity of integer at mmio.data[0] address should match endianity of emulated h/w device behind phys_addr address, regardless what is endianity of emulator, KVM host, hypervisor, and guest Examples that illustrate my definition -- 1) LE guest (E bit is off in ARM speak) reads integer (4 bytes) from mapped h/w LE device register - mmio.data[3] contains MSB, mmio.data[0] contains LSB. 2) BE guest (E bit is on in ARM speak) reads integer from mapped h/w LE device register - mmio.data[3] contains MSB, mmio.data[0] contains LSB. Note that if mmio.data[0] memory would be placed in guest address space and instruction restarted with new address, then it would meet BE guest expectations - the guest knows that it reads LE h/w so it will byteswap register before processing it further. This is BE guest ARM case (regardless of what KVM host endianity is). 3) BE guest reads integer from mapped h/w BE device register - mmio.data[0] contains MSB, mmio.data[3] contains LSB. Note that if mmio.data[0] memory would be placed in guest address space and instruction restarted with new address, then it would meet BE guest expectation - the guest knows that it reads BE h/w so it will proceed further without any other work. I guess, it is BE ppc case. Arguments in favor of memcpy semantics of mmio.data[] -- x) What are possible values of 'len'? Previous discussions imply that is always powers of 2. Why is that? Maybe there will be CPU that would need to do 5 bytes mmio access, or 6 bytes. How do you assign endianity to such case? 'len' 5 or 6, or any works fine with memcpy semantics. I admit it is hypothetical case, but IMHO it tests how clean ABI definition is. x) Byte array does not have endianity because it does not have any structure. If one would want to imply structure why mmio is not defined in such way so structure reflected in mmio definition? Something like: /* KVM_EXIT_MMIO */ struct { __u64 phys_addr; union { __u8 byte; __u16 hword; __u32 word; __u64 dword; } data; __u32 len; __u8 is_write; } mmio; where len is really serves as union discriminator and only allowed len values are 1, 2, 4, 8. In this case, I agree, endianity of integer types should be defined. I believe, use of byte array strongly implies that original intent was to have semantics of byte stream copy, just like memcpy does. x) Note there is nothing wrong with user kernel ABI to use just bytes stream as parameter. There is already precedents like 'read' and 'write' system calls :). x) Consider case when KVM works with emulated memory mapped h/w devices where some devices operate in LE mode and others operate in BE mode. It is defined by semantics of real h/w device which is it, and should be emulated by emulator and KVM given all other context. As far as mmio.data[] array concerned, if the same integer value is read from these devices registers, mmio.data[] memory should contain integer in opposite endianity for these two cases, i.e MSB is data[0] in
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On 22.01.2014, at 08:26, Victor Kamensky victor.kamen...@linaro.org wrote: On 21 January 2014 22:41, Alexander Graf ag...@suse.de wrote: Native endian really is just a shortcut for target endian which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le. I disagree. Fully functional ARM BE system is what we've been working on for last few months. 'We' is Linaro Networking Group, Endian subteam and some other guys in ARM and across community. Why we do that is a bit beyond of this discussion. ARM BE patches for both V7 and V8 are already in mainline kernel. But ARM BE KVM host is broken now. It is known deficiency that I am trying to fix. Please look at [1]. Patches for V7 BE KVM were proposed and currently under active discussion. Currently I work on ARM V8 BE KVM changes. So native endian in ARM is value of CPSR register E bit. If it is off native endian is LE, if it is on it is BE. Once and if we agree on ARM BE KVM host changes, the next step would be patches in qemu one of which introduces qemu-system-armeb. Please see [2]. I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the default endiannes for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the default. Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. But there's no reason this little endian kernel couldn't theoretically have big endian user space running with access to emulated device registers. As Peter already pointed out, the actual breakage behind this is that we have a default endianness at all. But that's a very difficult thing to resolve and I don't think should be our primary goal. Just live with the fact that we declare ARM little endian in QEMU and swap things accordingly - then everyone's happy. This really only ever becomes a problem if you have devices that have awareness of the CPUs endian mode. The only one on PPC that I'm aware of that falls into this category is virtio and there are patches pending to solve that. I don't know if there are any QEMU emulated devices outside of virtio with this issue on ARM, but you'll have to make the emulation code for those look at the CPU state then. QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU. I am not sure that I follow above. Suppose I have move r1, #1 str r1, [r0] where r0 is device address. Now depending on CPSR E bit value device address will receive 1 as integer either in LE order or in BE order. That is how ARM v7 CPU works, regardless whether it is emulated or not. So if E bit is off (LE case) after str is executed byte at r0 address will get 1 byte at r0 + 1 address will get 0 byte at r0 + 2 address will get 0 byte at r0 + 3 address will get 0 If E bit is on (BE case) after str is executed byte at r0 address will get 0 byte at r0 + 1 address will get 0 byte at r0 + 2 address will get 0 byte at r0 + 3 address will get 1 my point that mmio.data[] just carries bytes for phys_addr mmio.data[0] would be value for byte at phys_addr, mmio.data[1] would be value for byte at phys_addr + 1, and so on. What we get is an instruction that traps because it wants to write r1 (which has value=1) into address x. So at that point we get the register value. Then we need to take a look at the E bit to see whether the write was supposed to be in non-host endianness because we need to emulate exactly the LE/BE difference you're indicating above. The way we implement this on PPC is that we simply byte swap the register value when guest_endian != host_endian. With this in place, QEMU can just memcpy() the value into a local register and feed it into its emulation code which expects a register value as if the CPU was running in native endianness as parameter - with native meaning little endian for qemu-system-arm. Device emulation code doesn't know what to do with a byte array. Take a look at QEMU's MMIO handler: case KVM_EXIT_MMIO: DPRINTF(handle_mmio\n); cpu_physical_memory_rw(run-mmio.phys_addr, run-mmio.data, run-mmio.len, run-mmio.is_write); ret = 0; break; which translates to switch (l) { case 8: /* 64 bit write access */ val = ldq_p(buf); error |= io_mem_write(mr, addr1, val, 8); break; case 4:
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Hi Alex, Sorry, for delayed reply, I was focusing on discussion with Peter. Hope you and other folks may get something out of it :). Please see responses inline On 22 January 2014 02:52, Alexander Graf ag...@suse.de wrote: On 22.01.2014, at 08:26, Victor Kamensky victor.kamen...@linaro.org wrote: On 21 January 2014 22:41, Alexander Graf ag...@suse.de wrote: Native endian really is just a shortcut for target endian which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le. I disagree. Fully functional ARM BE system is what we've been working on for last few months. 'We' is Linaro Networking Group, Endian subteam and some other guys in ARM and across community. Why we do that is a bit beyond of this discussion. ARM BE patches for both V7 and V8 are already in mainline kernel. But ARM BE KVM host is broken now. It is known deficiency that I am trying to fix. Please look at [1]. Patches for V7 BE KVM were proposed and currently under active discussion. Currently I work on ARM V8 BE KVM changes. So native endian in ARM is value of CPSR register E bit. If it is off native endian is LE, if it is on it is BE. Once and if we agree on ARM BE KVM host changes, the next step would be patches in qemu one of which introduces qemu-system-armeb. Please see [2]. I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. Maybe it is not ideology conflict but rather terminology clarity issue :). I am not sure what do you mean by qemu-system-arm that is BE capable. In qemu build system there is just target name 'arm', which is ARM V7 cpu in LE mode, and 'armeb' target which is ARM V7 cpu in BE mode. That is true for a lot of open source packages. You could check [1] patch that introduces armeb target into qemu. Build for arm target produces qemu-system-arm executable that is marked 'ELF 32-bit LSB executable' and it could run on LE traditional ARM Linux. Build for armeb target produces qemu-system-armeb executable that is marked 'ELF 32-bit MSB executable' that can run on BE ARM Linux. armbe is nothing special here, just build option for qemu that should run on BE ARM Linux. Both qemu-system-arm and qemu-system-armeb should be BE/LE capable. I.e either of them along with KVM could either run LE or BE guest. MarcZ demonstrated that this is possible. I've tested both LE and BE guests with qemu-system-arm running on traditional LE ARM Linux, effectively repeating Marc's setup but with qemu. And I did test with my patches both BE and LE guests with qemu-system-armeb running on BE ARM Linux. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the default endiannes for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the default. I am not sure, what you mean by default? Is it initial setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes, the way it is currently implemented by committed qemu-system-arm, and proposed qemu-system-armeb patches, they are both off. I.e even qemu-system-armeb starts running vcpu in LE mode, exactly by very similar reason as desribed in your next paragraph qemu-system-armeb has tiny bootloader that starts in LE mode, jumps to kernel kernel switches cpu to run in BE mode 'setend be' and EE bit is set just before mmu is enabled. Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. We have very similar situation with BE ARM Linux. When we run ARM BE Linux we start with bootloader which is LE and then CPU issues 'setend be' very soon as it starts executing kernel code, all secondary CPUs issue 'setend be' when they go out of reset pen or bootmonitor sleep. But there's no reason this little endian kernel couldn't theoretically have big endian user space running with access to emulated device registers. I don't want to go there, it is very very messy ... -- Just a side note: -- Interestingly, half a year before I joined Linaro in Cisco I and my colleague implemented kernel patch that allowed to run BE user-space processes as sort of separate personality on top of LE ARM kernel ... treated kind of multi-abi system. Effectively we had to do byteswaps on all non-trivial system calls and ioctls in side of the kernel. We converted around 30 system calls and around 10 ioctls. Our target process was just using those and it works working, but patch was very intrusive and unnatural. I think in Linaro there was some public version of my presentation circulated that explained all this mess. I don't want seriously to consider it. The only robust mixed mode, as MarcZ demonstrated, could be done only on VM boundaries. I.e LE host can run BE guest fine. And BE host can run LE guest fine. Everything else would be a huge mess. If we want to
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Am 22.01.2014 um 07:31 schrieb Anup Patel a...@brainfault.org: On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky victor.kamen...@linaro.org wrote: Hi Guys, Christoffer and I had a bit heated chat :) on this subject last night. Christoffer, really appreciate your time! We did not really reach agreement during the chat and Christoffer asked me to follow up on this thread. Here it goes. Sorry, it is very long email. I don't believe we can assign any endianity to mmio.data[] byte array. I believe mmio.data[] and mmio.len acts just memcpy and that is all. As memcpy does not imply any endianity of underlying data mmio.data[] should not either. Here is my definition: mmio.data[] is array of bytes that contains memory bytes in such form, for read case, that if those bytes are placed in guest memory and guest executes the same read access instruction with address to this memory, result would be the same as real h/w device memory access. Rest of KVM host and hypervisor part of code should really take care of mmio.data[] memory so it will be delivered to vcpu registers and restored by hypervisor part in such way that guest CPU register value is the same as it would be for real non-emulated h/w read access (that is emulation part). The same goes for write access, if guest writes into memory and those bytes are just copied to emulated h/w register it would have the same effect as real mapped h/w register write. In shorter form, i.e for len=4 access: endianity of integer at mmio.data[0] address should match endianity of emulated h/w device behind phys_addr address, regardless what is endianity of emulator, KVM host, hypervisor, and guest Examples that illustrate my definition -- 1) LE guest (E bit is off in ARM speak) reads integer (4 bytes) from mapped h/w LE device register - mmio.data[3] contains MSB, mmio.data[0] contains LSB. 2) BE guest (E bit is on in ARM speak) reads integer from mapped h/w LE device register - mmio.data[3] contains MSB, mmio.data[0] contains LSB. Note that if mmio.data[0] memory would be placed in guest address space and instruction restarted with new address, then it would meet BE guest expectations - the guest knows that it reads LE h/w so it will byteswap register before processing it further. This is BE guest ARM case (regardless of what KVM host endianity is). 3) BE guest reads integer from mapped h/w BE device register - mmio.data[0] contains MSB, mmio.data[3] contains LSB. Note that if mmio.data[0] memory would be placed in guest address space and instruction restarted with new address, then it would meet BE guest expectation - the guest knows that it reads BE h/w so it will proceed further without any other work. I guess, it is BE ppc case. Arguments in favor of memcpy semantics of mmio.data[] -- x) What are possible values of 'len'? Previous discussions imply that is always powers of 2. Why is that? Maybe there will be CPU that would need to do 5 bytes mmio access, or 6 bytes. How do you assign endianity to such case? 'len' 5 or 6, or any works fine with memcpy semantics. I admit it is hypothetical case, but IMHO it tests how clean ABI definition is. x) Byte array does not have endianity because it does not have any structure. If one would want to imply structure why mmio is not defined in such way so structure reflected in mmio definition? Something like: /* KVM_EXIT_MMIO */ struct { __u64 phys_addr; union { __u8 byte; __u16 hword; __u32 word; __u64 dword; } data; __u32 len; __u8 is_write; } mmio; where len is really serves as union discriminator and only allowed len values are 1, 2, 4, 8. In this case, I agree, endianity of integer types should be defined. I believe, use of byte array strongly implies that original intent was to have semantics of byte stream copy, just like memcpy does. x) Note there is nothing wrong with user kernel ABI to use just bytes stream as parameter. There is already precedents like 'read' and 'write' system calls :). x) Consider case when KVM works with emulated memory mapped h/w devices where some devices operate in LE mode and others operate in BE mode. It is defined by semantics of real h/w device which is it, and should be emulated by emulator and KVM given all other context. As far as mmio.data[] array concerned, if the same integer value is read from these devices registers, mmio.data[] memory should contain integer in opposite endianity for these two cases, i.e MSB is data[0] in one case and MSB is data[3] is in another case. It cannot
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On 21 January 2014 22:41, Alexander Graf ag...@suse.de wrote: Am 22.01.2014 um 07:31 schrieb Anup Patel a...@brainfault.org: On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky victor.kamen...@linaro.org wrote: Hi Guys, Christoffer and I had a bit heated chat :) on this subject last night. Christoffer, really appreciate your time! We did not really reach agreement during the chat and Christoffer asked me to follow up on this thread. Here it goes. Sorry, it is very long email. I don't believe we can assign any endianity to mmio.data[] byte array. I believe mmio.data[] and mmio.len acts just memcpy and that is all. As memcpy does not imply any endianity of underlying data mmio.data[] should not either. Here is my definition: mmio.data[] is array of bytes that contains memory bytes in such form, for read case, that if those bytes are placed in guest memory and guest executes the same read access instruction with address to this memory, result would be the same as real h/w device memory access. Rest of KVM host and hypervisor part of code should really take care of mmio.data[] memory so it will be delivered to vcpu registers and restored by hypervisor part in such way that guest CPU register value is the same as it would be for real non-emulated h/w read access (that is emulation part). The same goes for write access, if guest writes into memory and those bytes are just copied to emulated h/w register it would have the same effect as real mapped h/w register write. In shorter form, i.e for len=4 access: endianity of integer at mmio.data[0] address should match endianity of emulated h/w device behind phys_addr address, regardless what is endianity of emulator, KVM host, hypervisor, and guest Examples that illustrate my definition -- 1) LE guest (E bit is off in ARM speak) reads integer (4 bytes) from mapped h/w LE device register - mmio.data[3] contains MSB, mmio.data[0] contains LSB. 2) BE guest (E bit is on in ARM speak) reads integer from mapped h/w LE device register - mmio.data[3] contains MSB, mmio.data[0] contains LSB. Note that if mmio.data[0] memory would be placed in guest address space and instruction restarted with new address, then it would meet BE guest expectations - the guest knows that it reads LE h/w so it will byteswap register before processing it further. This is BE guest ARM case (regardless of what KVM host endianity is). 3) BE guest reads integer from mapped h/w BE device register - mmio.data[0] contains MSB, mmio.data[3] contains LSB. Note that if mmio.data[0] memory would be placed in guest address space and instruction restarted with new address, then it would meet BE guest expectation - the guest knows that it reads BE h/w so it will proceed further without any other work. I guess, it is BE ppc case. Arguments in favor of memcpy semantics of mmio.data[] -- x) What are possible values of 'len'? Previous discussions imply that is always powers of 2. Why is that? Maybe there will be CPU that would need to do 5 bytes mmio access, or 6 bytes. How do you assign endianity to such case? 'len' 5 or 6, or any works fine with memcpy semantics. I admit it is hypothetical case, but IMHO it tests how clean ABI definition is. x) Byte array does not have endianity because it does not have any structure. If one would want to imply structure why mmio is not defined in such way so structure reflected in mmio definition? Something like: /* KVM_EXIT_MMIO */ struct { __u64 phys_addr; union { __u8 byte; __u16 hword; __u32 word; __u64 dword; } data; __u32 len; __u8 is_write; } mmio; where len is really serves as union discriminator and only allowed len values are 1, 2, 4, 8. In this case, I agree, endianity of integer types should be defined. I believe, use of byte array strongly implies that original intent was to have semantics of byte stream copy, just like memcpy does. x) Note there is nothing wrong with user kernel ABI to use just bytes stream as parameter. There is already precedents like 'read' and 'write' system calls :). x) Consider case when KVM works with emulated memory mapped h/w devices where some devices operate in LE mode and others operate in BE mode. It is defined by semantics of real h/w device which is it, and should be emulated by emulator and KVM given all other context. As far as mmio.data[] array concerned, if the same integer value is read from these devices registers, mmio.data[] memory should contain integer in opposite endianity for these two cases, i.e MSB is data[0] in one case and MSB