Re: [PATCH] x86: fix q35 kernel measurements broken due to rng seeding
On February 2, 2023 7:17:01 AM PST, James Bottomley wrote: >On Thu, 2023-02-02 at 07:03 -0800, H. Peter Anvin wrote: >[...] >> NAK. We need to fix the actual problem of the kernel stomping on >> memory it shouldn't, not paper around it. > >This is a first boot situation, not kexec (I just updated kexec because >it should use any new mechanism we propose). Unlike kexec, for first >boot we're very constrained by the amount of extra space QEMU has to do >this. The boot_params are the first page of the kernel load, but the >kernel proper begins directly after it, so we can't expand it. The two >schemes tried: loading after the kernel and loading after the command >line both tamper with integrity protected files, so we shouldn't use >this mechanism. This is the essence of the problem: If we add this >area at boot, it has to go in an existing memory location; we can't >steal random guest areas. All current config parameters are passed >through as fw_config files, so we can only use that mechanism *if* we >know where the area ends up in the loaded kernel *and* the file isn't >integrity protected (this latter is expanding over time). > >If we could wind back time, I'd have added the 32 byte random seed to >boot_params properly not coded it as a setup_data addition, but now >we're stuck with coping with existing behaviour, which is why I thought >the retro fit to boot_params would be the better path forward, but if >you have any alternatives, I'm sure we could look at them. > >James > One option that you do have that should be backwards compatible, even, is to mark that memory as reserved in the memory map, basically doing the job that the kernel decompressor should have done in the first place. The downside is that existing kennels will never reclaim that memory, but since it is such a small amount, it shouldn't really matter. We could even reserve a memory type code for it; that way a newer kernel could know to reclaim that memory at the very end of the boot process, when it would deallocate other setup_data entries. Existing kernels will, for obvious reasons, treat unknown memory types as equivalent to type 2 – permanent keep out.
Re: [PATCH] x86: fix q35 kernel measurements broken due to rng seeding
On February 2, 2023 7:17:01 AM PST, James Bottomley wrote: >On Thu, 2023-02-02 at 07:03 -0800, H. Peter Anvin wrote: >[...] >> NAK. We need to fix the actual problem of the kernel stomping on >> memory it shouldn't, not paper around it. > >This is a first boot situation, not kexec (I just updated kexec because >it should use any new mechanism we propose). Unlike kexec, for first >boot we're very constrained by the amount of extra space QEMU has to do >this. The boot_params are the first page of the kernel load, but the >kernel proper begins directly after it, so we can't expand it. The two >schemes tried: loading after the kernel and loading after the command >line both tamper with integrity protected files, so we shouldn't use >this mechanism. This is the essence of the problem: If we add this >area at boot, it has to go in an existing memory location; we can't >steal random guest areas. All current config parameters are passed >through as fw_config files, so we can only use that mechanism *if* we >know where the area ends up in the loaded kernel *and* the file isn't >integrity protected (this latter is expanding over time). > >If we could wind back time, I'd have added the 32 byte random seed to >boot_params properly not coded it as a setup_data addition, but now >we're stuck with coping with existing behaviour, which is why I thought >the retro fit to boot_params would be the better path forward, but if >you have any alternatives, I'm sure we could look at them. > >James > Somewhat aside... There are other issues with passing the entropy seed in memory, of course; in order to avoid accidental or malicious reuse it would be far better if the entropy was actively pulled at use time via virtual I/O, a hypercall, or RDRAND/RDSEED emulation, but I do realize that that is not always an option.
Re: [PATCH] x86: fix q35 kernel measurements broken due to rng seeding
On February 2, 2023 7:17:01 AM PST, James Bottomley wrote: >On Thu, 2023-02-02 at 07:03 -0800, H. Peter Anvin wrote: >[...] >> NAK. We need to fix the actual problem of the kernel stomping on >> memory it shouldn't, not paper around it. > >This is a first boot situation, not kexec (I just updated kexec because >it should use any new mechanism we propose). Unlike kexec, for first >boot we're very constrained by the amount of extra space QEMU has to do >this. The boot_params are the first page of the kernel load, but the >kernel proper begins directly after it, so we can't expand it. The two >schemes tried: loading after the kernel and loading after the command >line both tamper with integrity protected files, so we shouldn't use >this mechanism. This is the essence of the problem: If we add this >area at boot, it has to go in an existing memory location; we can't >steal random guest areas. All current config parameters are passed >through as fw_config files, so we can only use that mechanism *if* we >know where the area ends up in the loaded kernel *and* the file isn't >integrity protected (this latter is expanding over time). > >If we could wind back time, I'd have added the 32 byte random seed to >boot_params properly not coded it as a setup_data addition, but now >we're stuck with coping with existing behaviour, which is why I thought >the retro fit to boot_params would be the better path forward, but if >you have any alternatives, I'm sure we could look at them. > >James > The right thing to do is to fix the kernel so that it doesn't stomp on this memory, just as it cannot stomp on boot_params, initrd, or the command line. The kernel boot protocol defines a keep-out area, but physical kaslr violates it and so the kaslr code in the decompressor is responsible for keeping track of the keepout areas, and apparently noone every did. Adding it to boot_params and bumping the version number is a hack that doesn't solve the backwards compatibility problem, so we should just fix the bug instead. Adding it to boot_params and adding a setup_data pointer MAY be backwards compatible, but it also enables an absolutely catastrophic failure mode: an unaware loader may end up relocating boot_params without knowing that that data has a secondary pointer into it, with the result that we now have a bogus pointer in a linked list. Not good. Fixing the bug properly is also the only way forward for future setup_data users, and we are running low on space in the fixed-size structures.
Re: [PATCH] x86: fix q35 kernel measurements broken due to rng seeding
On February 2, 2023 6:38:12 AM PST, James Bottomley wrote: >On Wed, 2023-02-01 at 15:48 -0500, Jason A. Donenfeld wrote: >[...] >> But it sounds like you might now have a concrete suggestion on >> something even better. I'm CCing hpa, as this is his wheelhouse, and >> maybe you two can divise the next step while I'm away. Maybe the pad9 >> thing you mentioned is the super nice solution we've been searching >> for this whole time. When I'm home in 10 days and have internet >> again, I'll take a look at where thing's are out and try to figure >> out how I can be productive again with it. > >OK, so just FYI HPA, this is the patch I'm thinking of sending to >linux-kernel to reserve space in struct boot_params for this. If you >could take a look and advise on the location before I send the final >patch, I'd be grateful. I took space in _pad9 because that's the >standard method (add on to end), but it does strike me we could also >use all of _pad8 for the (the addition is only 48 bytes) or even _pad3 >+ hd0_info + hd1_info. > >James > >--- > >diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S >index 9338c68e7413..0120ab77dac9 100644 >--- a/arch/x86/boot/header.S >+++ b/arch/x86/boot/header.S >@@ -308,7 +308,7 @@ _start: > # Part 2 of the header, from the old setup.S > > .ascii "HdrS" # header signature >- .word 0x020f # header version number (>= 0x0105) >+ .word 0x0210 # header version number (>= 0x0105) > # or else old loadlin-1.5 will fail) > .globl realmode_swtch > realmode_swtch: .word 0, 0# default_switch, SETUPSEG >diff --git a/arch/x86/include/uapi/asm/bootparam.h >b/arch/x86/include/uapi/asm/bootparam.h >index 01d19fc22346..c614ff0755f2 100644 >--- a/arch/x86/include/uapi/asm/bootparam.h >+++ b/arch/x86/include/uapi/asm/bootparam.h >@@ -181,6 +181,19 @@ struct ima_setup_data { > __u64 size; > } __attribute__((packed)); > >+/* >+ * Define a boot_param area for the RNG seed which can be used via the >+ * setup_data mechanism (so must have a setup_data header) but which >+ * is embedded in boot_params because qemu has been unable to find >+ * a safe data space for it. The value RNG_SEED_LENGTH must not >+ * change (pad length dependent on it) and must match the value in QEMU >+ */ >+#define RNG_SEED_LENGTH 32 >+struct random_seed_data { >+ struct setup_data s; >+ __u8data[RNG_SEED_LENGTH]; >+} __attribute__((packed)); >+ > /* The so-called "zeropage" */ > struct boot_params { > struct screen_info screen_info; /* 0x000 */ >@@ -228,7 +241,8 @@ struct boot_params { > struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 > */ > __u8 _pad8[48];/* 0xcd0 */ > struct edd_info eddbuf[EDDMAXNR]; /* 0xd00 */ >- __u8 _pad9[276]; /* 0xeec */ >+ struct random_seed_data random_seed;/* 0xeec */ >+ __u8 _pad9[228]; /* 0xf1c */ > } __attribute__((packed)); > > /** >diff --git a/arch/x86/kernel/kexec-bzimage64.c >b/arch/x86/kernel/kexec-bzimage64.c >index 6b58610a1552..fb719682579d 100644 >--- a/arch/x86/kernel/kexec-bzimage64.c >+++ b/arch/x86/kernel/kexec-bzimage64.c >@@ -110,13 +110,10 @@ static int setup_e820_entries(struct boot_params *params) > return 0; > } > >-enum { RNG_SEED_LENGTH = 32 }; >- > static void >-setup_rng_seed(struct boot_params *params, unsigned long params_load_addr, >- unsigned int rng_seed_setup_data_offset) >+setup_rng_seed(struct boot_params *params, unsigned long params_load_addr) > { >- struct setup_data *sd = (void *)params + rng_seed_setup_data_offset; >+ struct setup_data *sd = >random_seed.s; > unsigned long setup_data_phys; > > if (!rng_is_initialized()) >@@ -125,7 +122,8 @@ setup_rng_seed(struct boot_params *params, unsigned long >params_load_addr, > sd->type = SETUP_RNG_SEED; > sd->len = RNG_SEED_LENGTH; > get_random_bytes(sd->data, RNG_SEED_LENGTH); >- setup_data_phys = params_load_addr + rng_seed_setup_data_offset; >+ setup_data_phys = params_load_addr + offsetof(struct boot_params, >+random_seed); > sd->next = params->hdr.setup_data; > params->hdr.setup_data = setup_data_phys; > } >@@ -306,7 +304,7 @@ setup_boot_parameters(struct kimage *image, struct >boot_params *params, > } > > /* Setup RNG seed */ >- setup_rng_seed(params, params_load_addr, setup_data_offset); >+ setup_rng_seed(params, params_load_addr); > > /* Setup EDD info */ > memcpy(params->eddbuf, boot_params.eddbuf, > NAK. We need to fix the actual problem of the kernel stomping on memory it shouldn't, not paper around it.
Re: [PULL 10/56] x86: don't let decompressed kernel image clobber setup_data
On January 31, 2023 1:22:43 PM PST, "Jason A. Donenfeld" wrote: >On Tue, Jan 31, 2023, 15:55 H. Peter Anvin wrote: > >> On January 30, 2023 12:19:14 PM PST, "Michael S. Tsirkin" >> wrote: >> >From: "Jason A. Donenfeld" >> > >> >The setup_data links are appended to the compressed kernel image. Since >> >the kernel image is typically loaded at 0x10, setup_data lives at >> >`0x10 + compressed_size`, which does not get relocated during the >> >kernel's boot process. >> > >> >The kernel typically decompresses the image starting at address >> >0x100 (note: there's one more zero there than the compressed image >> >above). This usually is fine for most kernels. >> > >> >However, if the compressed image is actually quite large, then >> >setup_data will live at a `0x10 + compressed_size` that extends into >> >the decompressed zone at 0x100. In other words, if compressed_size >> >is larger than `0x100 - 0x10`, then the decompression step will >> >clobber setup_data, resulting in crashes. >> > >> >Visually, what happens now is that QEMU appends setup_data to the kernel >> >image: >> > >> > kernel imagesetup_data >> > |--||| >> >0x10 0x10+l1 0x10+l1+l2 >> > >> >The problem is that this decompresses to 0x100 (one more zero). So >> >if l1 is > (0x100-0x10), then this winds up looking like: >> > >> > kernel imagesetup_data >> > |--||| >> >0x10 0x10+l1 0x10+l1+l2 >> > >> > d e c o m p r e s s e d k e r n e l >> > >> |-| >> >0x100 >> 0x100+l3 >> > >> >The decompressed kernel seemingly overwriting the compressed kernel >> >image isn't a problem, because that gets relocated to a higher address >> >early on in the boot process, at the end of startup_64. setup_data, >> >however, stays in the same place, since those links are self referential >> >and nothing fixes them up. So the decompressed kernel clobbers it. >> > >> >Fix this by appending setup_data to the cmdline blob rather than the >> >kernel image blob, which remains at a lower address that won't get >> >clobbered. >> > >> >This could have been done by overwriting the initrd blob instead, but >> >that poses big difficulties, such as no longer being able to use memory >> >mapped files for initrd, hurting performance, and, more importantly, the >> >initrd address calculation is hard coded in qboot, and it always grows >> >down rather than up, which means lots of brittle semantics would have to >> >be changed around, incurring more complexity. In contrast, using cmdline >> >is simple and doesn't interfere with anything. >> > >> >The microvm machine has a gross hack where it fiddles with fw_cfg data >> >after the fact. So this hack is updated to account for this appending, >> >by reserving some bytes. >> > >> >Fixup-by: Michael S. Tsirkin >> >Cc: x...@kernel.org >> >Cc: Philippe Mathieu-Daudé >> >Cc: H. Peter Anvin >> >Cc: Borislav Petkov >> >Cc: Eric Biggers >> >Signed-off-by: Jason A. Donenfeld >> >Message-Id: <20221230220725.618763-1-ja...@zx2c4.com> >> >Message-ID: <20230128061015-mutt-send-email-...@kernel.org> >> >Reviewed-by: Michael S. Tsirkin >> >Signed-off-by: Michael S. Tsirkin >> >Tested-by: Eric Biggers >> >Tested-by: Mathias Krause >> >--- >> > include/hw/i386/microvm.h | 5 ++-- >> > include/hw/nvram/fw_cfg.h | 9 +++ >> > hw/i386/microvm.c | 15 +++ >> > hw/i386/x86.c | 52 +-- >> > hw/nvram/fw_cfg.c | 9 +++ >> > 5 files changed, 59 insertions(+), 31 deletions(-) >> > >> >diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h >> >index fad97a891d..e8af61f194 100644 >> >--- a/include/hw/i386/microvm.h >> >+++ b/include/hw/i386/microvm.h >> >@@ -50,8 +50,9 @@ >> > */ >> > >> > /* Platform virtio definitions */ >> >-#define VIRTIO_MMIO_BASE 0xfeb0
Re: [PULL 10/56] x86: don't let decompressed kernel image clobber setup_data
On January 30, 2023 12:19:14 PM PST, "Michael S. Tsirkin" wrote: >From: "Jason A. Donenfeld" > >The setup_data links are appended to the compressed kernel image. Since >the kernel image is typically loaded at 0x10, setup_data lives at >`0x10 + compressed_size`, which does not get relocated during the >kernel's boot process. > >The kernel typically decompresses the image starting at address >0x100 (note: there's one more zero there than the compressed image >above). This usually is fine for most kernels. > >However, if the compressed image is actually quite large, then >setup_data will live at a `0x10 + compressed_size` that extends into >the decompressed zone at 0x100. In other words, if compressed_size >is larger than `0x100 - 0x10`, then the decompression step will >clobber setup_data, resulting in crashes. > >Visually, what happens now is that QEMU appends setup_data to the kernel >image: > > kernel imagesetup_data > |--||| >0x10 0x10+l1 0x10+l1+l2 > >The problem is that this decompresses to 0x100 (one more zero). So >if l1 is > (0x100-0x10), then this winds up looking like: > > kernel imagesetup_data > |--||| >0x10 0x10+l1 0x10+l1+l2 > > d e c o m p r e s s e d k e r n e l > > |-| >0x100 > 0x100+l3 > >The decompressed kernel seemingly overwriting the compressed kernel >image isn't a problem, because that gets relocated to a higher address >early on in the boot process, at the end of startup_64. setup_data, >however, stays in the same place, since those links are self referential >and nothing fixes them up. So the decompressed kernel clobbers it. > >Fix this by appending setup_data to the cmdline blob rather than the >kernel image blob, which remains at a lower address that won't get >clobbered. > >This could have been done by overwriting the initrd blob instead, but >that poses big difficulties, such as no longer being able to use memory >mapped files for initrd, hurting performance, and, more importantly, the >initrd address calculation is hard coded in qboot, and it always grows >down rather than up, which means lots of brittle semantics would have to >be changed around, incurring more complexity. In contrast, using cmdline >is simple and doesn't interfere with anything. > >The microvm machine has a gross hack where it fiddles with fw_cfg data >after the fact. So this hack is updated to account for this appending, >by reserving some bytes. > >Fixup-by: Michael S. Tsirkin >Cc: x...@kernel.org >Cc: Philippe Mathieu-Daudé >Cc: H. Peter Anvin >Cc: Borislav Petkov >Cc: Eric Biggers >Signed-off-by: Jason A. Donenfeld >Message-Id: <20221230220725.618763-1-ja...@zx2c4.com> >Message-ID: <20230128061015-mutt-send-email-...@kernel.org> >Reviewed-by: Michael S. Tsirkin >Signed-off-by: Michael S. Tsirkin >Tested-by: Eric Biggers >Tested-by: Mathias Krause >--- > include/hw/i386/microvm.h | 5 ++-- > include/hw/nvram/fw_cfg.h | 9 +++ > hw/i386/microvm.c | 15 +++ > hw/i386/x86.c | 52 +-- > hw/nvram/fw_cfg.c | 9 +++ > 5 files changed, 59 insertions(+), 31 deletions(-) > >diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h >index fad97a891d..e8af61f194 100644 >--- a/include/hw/i386/microvm.h >+++ b/include/hw/i386/microvm.h >@@ -50,8 +50,9 @@ > */ > > /* Platform virtio definitions */ >-#define VIRTIO_MMIO_BASE 0xfeb0 >-#define VIRTIO_CMDLINE_MAXLEN 64 >+#define VIRTIO_MMIO_BASE0xfeb0 >+#define VIRTIO_CMDLINE_MAXLEN 64 >+#define VIRTIO_CMDLINE_TOTAL_MAX_LEN((VIRTIO_CMDLINE_MAXLEN + 1) * 16) > > #define GED_MMIO_BASE 0xfea0 > #define GED_MMIO_BASE_MEMHP (GED_MMIO_BASE + 0x100) >diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h >index 2e503904dc..990dcdbb2e 100644 >--- a/include/hw/nvram/fw_cfg.h >+++ b/include/hw/nvram/fw_cfg.h >@@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t >key, >void *data, size_t len, >bool read_only); > >+/** >+ * fw_cfg_read_bytes_ptr: >+ * @s: fw_cfg device being modified >+ * @key: selector key value for new fw_cfg item >+ * >+ * Reads
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On 12/31/22 20:55, Mika Penttilä wrote: If decompression does clobber the data, then we *also* need to figure out why that is. There are basically three possibilities: 1. If physical KASLR is NOT used: a. The boot loader doesn't honor the kernel safe area properly; b. Somewhere in the process a bug in the calculation of the kernel safe area has crept in. 2. If physical KASLR IS used: The decompressor doesn't correctly keep track of nor relocate all the keep-out zones before picking a target address. Seems setup_data is not included in those mem_avoid regions. [facepalm] One is a bootloader bug, two is a kernel bugs. My guess is (2) is the culprit, but (1b) should be checked, too. Correction: two are kernel bugs, i.e. (1b) and (2) are both kernel bugs. -hpa
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On 12/31/22 10:22, Jason A. Donenfeld wrote: On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote: On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote: That failure is unrelated to the ident mapping issue Peter and I discussed. The original failure is described in the commit message: decompression clobbers the data, so sd->next points to garbage. Right So with that understanding confirmed, I'm confused at your surprise that hpa's unrelated fix to the different issue didn't fix this issue. If decompression does clobber the data, then we *also* need to figure out why that is. There are basically three possibilities: 1. If physical KASLR is NOT used: a. The boot loader doesn't honor the kernel safe area properly; b. Somewhere in the process a bug in the calculation of the kernel safe area has crept in. 2. If physical KASLR IS used: The decompressor doesn't correctly keep track of nor relocate all the keep-out zones before picking a target address. One is a bootloader bug, two is a kernel bug. My guess is (2) is the culprit, but (1b) should be checked, too. -hpa
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On 12/31/22 19:21, H. Peter Anvin wrote: Alternatively, setup_data could be relocated, the boot param protocol could be bumped, and then QEMU could conditionalized it's use of setup_data based on that protocol version. That'd work, but seems a bit more involved. I think this is at least worth considering because the kernel overwriting setup_data after having been told where that setup_data is located is not really nice. Well not explicitly at least - it gets the pointer to the first element and something needs to traverse that list to know which addresses are live. But still, that info is there so perhaps we need to take setup_data into consideration too before decompressing... It would probably be a good idea to add a "maximum physical address for initrd/setup_data/cmdline" field to struct kernel_info, though. It appears right now that those fields are being identity-mapped in the decompressor, and that means that if 48-bit addressing is used, physical memory may extend past the addressable range. -hpa
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On 12/31/22 11:00, Borislav Petkov wrote: On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote: So with that understanding confirmed, I'm confused at your surprise that hpa's unrelated fix to the different issue didn't fix this issue. No surprise there - I used a qemu variant without your patch to prevent the setup_data clobbering so hpa's fix can't help there. But since the kernel doesn't do this now, and the 62MiB bug also seems to apply to existing kernels, for the purposes of QEMU for now, I think the v3 patch is probably best, since it'll handle existing kernels. Right, we can't fix all those guests which are out there. Alternatively, setup_data could be relocated, the boot param protocol could be bumped, and then QEMU could conditionalized it's use of setup_data based on that protocol version. That'd work, but seems a bit more involved. I think this is at least worth considering because the kernel overwriting setup_data after having been told where that setup_data is located is not really nice. Well not explicitly at least - it gets the pointer to the first element and something needs to traverse that list to know which addresses are live. But still, that info is there so perhaps we need to take setup_data into consideration too before decompressing... As far as the decompression itself goes, it should only a problem if we are using physical KASLR since otherwise the kernel has a guaranteed safe zone already allocated by the boot loader. However, if physical KASLR is in use, then the decompressor needs to know everything there is to know about the memory map. However, there also seems to be some kind of interaction with AMD SEV-SNP. The bug appears to have been introduced by: b57feed2cc2622ae14b2fa62f19e973e5e0a60cf x86/compressed/64: Add identity mappings for setup_data entries https://lore.kernel.org/r/tycpr01mb694815cd815e98945f63c99183...@tycpr01mb6948.jpnprd01.prod.outlook.com ... which was included in version 5.19, so it is relatively recent. For a small amount of setup_data, the solution of just putting it next to the command line makes a lot of sense, and should be safe indefinitely. -hpa
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On 12/30/22 17:06, H. Peter Anvin wrote TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to very, very old kernels that used INT 15h, AH=88h to probe memory. I am 88% sure this was fixed long before setup_data was created, as it was created originally to carry e820 info for more than 128(!) memory segments. However, as we see here, it is never certain that bugs didn't creep in in the meantime... -hpa
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On 12/30/22 14:10, Jason A. Donenfeld wrote: On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote: See the other thread fork. They have identified the problem already. Not sure I follow. Is there another thread where somebody worked out why this 62meg limit was happening? Note that I sent v2/v3, to fix the original problem in a different way, and if that looks good to the QEMU maintainers, then we can all be happy with that. But I *haven't* addressed and still don't fully understand why the 62meg limit applied to my v1 in the way it does. Did you find a bug there to fix? If so, please do CC me. Yes, you yourself posted the problem: Then build qemu. Run it with `-kernel bzImage`, based on the kernel built with the .config I attached. You'll see that the CPU triple faults when hitting this line: sd = (struct setup_data *)boot_params->hdr.setup_data; while (sd) { unsigned long sd_addr = (unsigned long)sd; kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); < sd = (struct setup_data *)sd->next; } , because it dereferences *sd. This does not happen if the decompressed size of the kernel is < 62 megs. So that's the "big and pretty serious" bug that might be worthy of investigation. This needs to be something like: kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd)); kernel_add_identity_map(sd_addr + sizeof(*sd), sd_addr + sizeof(*sd) + sd->len); TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to very, very old kernels that used INT 15h, AH=88h to probe memory. -hpa
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On December 30, 2022 11:54:11 AM PST, Borislav Petkov wrote: >On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote: >> Look closer at the boot process. The compressed image is initially at >> 0x10, but it gets relocated to a safer area at the end of >> startup_64: > >That is the address we're executing here from, rip here looks like 0x100xxx. > >> /* >> * Copy the compressed kernel to the end of our buffer >> * where decompression in place becomes safe. >> */ >> pushq %rsi >> leaq(_bss-8)(%rip), %rsi >> leaqrva(_bss-8)(%rbx), %rdi > >when you get to here, it looks something like this: > >leaq(_bss-8)(%rip), %rsi # 0x9e7ff8 >leaqrva(_bss-8)(%rbx), %rdi# 0xc6eeff8 > >so the source address is that _bss thing and we copy... > >> movl$(_bss - startup_32), %ecx >> shrl$3, %ecx >> std > >... backwards since DF=1. > >Up to: > ># rsi = 0x8 ># rdi = 0xbe06ff8 > >Ok, so the source address is 0x10. Good. > >> HOWEVER, qemu currently appends setup_data to the end of the >> compressed kernel image, > >Yeah, you mean the kernel which starts executing at 0x10, i.e., that part >which is compressed/head_64.S and which does the above and the relocation etc. > >> and this part isn't moved, and setup_data links aren't walked/relocated. So >> that means the original address remains, of 0x10. > >See above: when it starts copying the kernel image backwards to a higher >address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data >*after* that address. And that doesn't get copied ofc. > >So far, so good. > >Now later, we extract the compressed kernel created with the mkpiggy magic: > >input_data: >.incbin "arch/x86/boot/compressed/vmlinux.bin.gz" >input_data_end: > >by doing > >/* > * Do the extraction, and jump to the new kernel.. > */ > >pushq %rsi/* Save the real mode argument */ > 0x13d00 >movq%rsi, %rdi /* real mode address */ > 0x13d00 >leaqboot_heap(%rip), %rsi /* malloc area for uncompression */ > 0xc6ef000 >leaqinput_data(%rip), %rdx /* input_data */ > 0xbe073a8 >movlinput_len(%rip), %ecx /* input_len */ > 0x8cfe13 >movq%rbp, %r8 /* output target address */ > 0x100 >movloutput_len(%rip), %r9d /* decompressed length, end of relocs > */ >callextract_kernel /* returns kernel location in %rax */ >popq%rsi > >(actual addresses at the end.) > >Now, when you say you triplefault somewhere in initialize_identity_maps() when >trying to access setup_data, then if you look a couple of lines before that >call >we do > > call load_stage2_idt > >which sets up a boottime #PF handler do_boot_page_fault() and it actually does >call kernel_add_identity_map() so *actually* it should map any unmapped >setup_data addresses. > >So why doesn't it do that and why do you triplefault? > >Hmmm. > See the other thread fork. They have identified the problem already.
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On December 30, 2022 7:59:30 AM PST, "Jason A. Donenfeld" wrote: >Hi, > >On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote: >> On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" >> wrote: >> >Hi, >> > >> >Read this message in a fixed width text editor with a lot of columns. >> > >> >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: >> >> Glad you asked. >> >> >> >> So the kernel load addresses are parameterized in the kernel image >> >> setup header. One of the things that are so parameterized are the size >> >> and possible realignment of the kernel image in memory. >> >> >> >> I'm very confused where you are getting the 64 MB number from. There >> >> should not be any such limitation. >> > >> >Currently, QEMU appends it to the kernel image, not to the initramfs as >> >you suggest below. So, that winds up looking, currently, like: >> > >> > kernel imagesetup_data >> > |--||| >> >0x10 0x10+l1 0x10+l1+l2 >> > >> >The problem is that this decompresses to 0x100 (one more zero). So >> >if l1 is > (0x100-0x10), then this winds up looking like: >> > >> > kernel imagesetup_data >> > |--||| >> >0x10 0x10+l1 0x10+l1+l2 >> > >> > d e c o m p r e s s e d k e r n e l >> > >> > |-| >> >0x100 >> > 0x100+l3 >> > >> >The decompressed kernel seemingly overwriting the compressed kernel >> >image isn't a problem, because that gets relocated to a higher address >> >early on in the boot process. setup_data, however, stays in the same >> >place, since those links are self referential and nothing fixes them up. >> >So the decompressed kernel clobbers it. >> > >> >The solution in this commit adds a bunch of padding between the kernel >> >image and setup_data to avoid this. That looks like this: >> > >> > kernel imagepadding >> > setup_data >> > >> > |--||---||| >> >0x10 0x10+l1 >> > 0x100+l3 0x100+l3+l2 >> > >> > d e c o m p r e s s e d k e r n e l >> > >> > |-| >> >0x100 >> > 0x100+l3 >> > >> >This way, the decompressed kernel doesn't clobber setup_data. >> > >> >The problem is that if 0x100+l3-0x10 is around 62 megabytes, >> >then the bootloader crashes when trying to dereference setup_data's >> >->len param at the end of initialize_identity_maps() in ident_map_64.c. >> >I don't know why it does this. If I could remove the 62 megabyte >> >restriction, then I could keep with this technique and all would be >> >well. >> > >> >> In general, setup_data should be able to go anywhere the initrd can >> >> go, and so is subject to the same address cap (896 MB for old kernels, >> >> 4 GB on newer ones; this address too is enumerated in the header.) >> > >> >It would be theoretically possible to attach it to the initrd image >> >instead of to the kernel image. As a last resort, I guess I can look >> >into doing that. However, that's going to require some serious rework >> >and plumbing of a lot of different components. So if I can make it work >> >as is, that'd be ideal. However, I need to figure out this weird 62 meg >> >limitation. >> > >> >Any ideas on that? >> > >> >Jason >> >> As far as a crash... that sounds like a big and a pretty serious one at that. >> >> Could you let me know what kernel you are using and how *exactly* you are >> booting it? > >I'll attach a .config file. Apply the patch at the top of this thread to >qemu, except make one modificat
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" wrote: >Hi, > >Read this message in a fixed width text editor with a lot of columns. > >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: >> Glad you asked. >> >> So the kernel load addresses are parameterized in the kernel image >> setup header. One of the things that are so parameterized are the size >> and possible realignment of the kernel image in memory. >> >> I'm very confused where you are getting the 64 MB number from. There >> should not be any such limitation. > >Currently, QEMU appends it to the kernel image, not to the initramfs as >you suggest below. So, that winds up looking, currently, like: > > kernel imagesetup_data > |--||| >0x10 0x10+l1 0x10+l1+l2 > >The problem is that this decompresses to 0x100 (one more zero). So >if l1 is > (0x100-0x10), then this winds up looking like: > > kernel imagesetup_data > |--||| >0x10 0x10+l1 0x10+l1+l2 > > d e c o m p r e s s e d k e r n e l > > |-| >0x100 > 0x100+l3 > >The decompressed kernel seemingly overwriting the compressed kernel >image isn't a problem, because that gets relocated to a higher address >early on in the boot process. setup_data, however, stays in the same >place, since those links are self referential and nothing fixes them up. >So the decompressed kernel clobbers it. > >The solution in this commit adds a bunch of padding between the kernel >image and setup_data to avoid this. That looks like this: > > kernel imagepadding > setup_data > > |--||---||| >0x10 0x10+l1 >0x100+l3 0x100+l3+l2 > > d e c o m p r e s s e d k e r n e l > > |-| >0x100 > 0x100+l3 > >This way, the decompressed kernel doesn't clobber setup_data. > >The problem is that if 0x100+l3-0x10 is around 62 megabytes, >then the bootloader crashes when trying to dereference setup_data's >->len param at the end of initialize_identity_maps() in ident_map_64.c. >I don't know why it does this. If I could remove the 62 megabyte >restriction, then I could keep with this technique and all would be >well. > >> In general, setup_data should be able to go anywhere the initrd can >> go, and so is subject to the same address cap (896 MB for old kernels, >> 4 GB on newer ones; this address too is enumerated in the header.) > >It would be theoretically possible to attach it to the initrd image >instead of to the kernel image. As a last resort, I guess I can look >into doing that. However, that's going to require some serious rework >and plumbing of a lot of different components. So if I can make it work >as is, that'd be ideal. However, I need to figure out this weird 62 meg >limitation. > >Any ideas on that? > >Jason As far as a crash... that sounds like a big and a pretty serious one at that. Could you let me know what kernel you are using and how *exactly* you are booting it?
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" wrote: >Hi, > >Read this message in a fixed width text editor with a lot of columns. > >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote: >> Glad you asked. >> >> So the kernel load addresses are parameterized in the kernel image >> setup header. One of the things that are so parameterized are the size >> and possible realignment of the kernel image in memory. >> >> I'm very confused where you are getting the 64 MB number from. There >> should not be any such limitation. > >Currently, QEMU appends it to the kernel image, not to the initramfs as >you suggest below. So, that winds up looking, currently, like: > > kernel imagesetup_data > |--||| >0x10 0x10+l1 0x10+l1+l2 > >The problem is that this decompresses to 0x100 (one more zero). So >if l1 is > (0x100-0x10), then this winds up looking like: > > kernel imagesetup_data > |--||| >0x10 0x10+l1 0x10+l1+l2 > > d e c o m p r e s s e d k e r n e l > > |-| >0x100 > 0x100+l3 > >The decompressed kernel seemingly overwriting the compressed kernel >image isn't a problem, because that gets relocated to a higher address >early on in the boot process. setup_data, however, stays in the same >place, since those links are self referential and nothing fixes them up. >So the decompressed kernel clobbers it. > >The solution in this commit adds a bunch of padding between the kernel >image and setup_data to avoid this. That looks like this: > > kernel imagepadding > setup_data > > |--||---||| >0x10 0x10+l1 >0x100+l3 0x100+l3+l2 > > d e c o m p r e s s e d k e r n e l > > |-| >0x100 > 0x100+l3 > >This way, the decompressed kernel doesn't clobber setup_data. > >The problem is that if 0x100+l3-0x10 is around 62 megabytes, >then the bootloader crashes when trying to dereference setup_data's >->len param at the end of initialize_identity_maps() in ident_map_64.c. >I don't know why it does this. If I could remove the 62 megabyte >restriction, then I could keep with this technique and all would be >well. > >> In general, setup_data should be able to go anywhere the initrd can >> go, and so is subject to the same address cap (896 MB for old kernels, >> 4 GB on newer ones; this address too is enumerated in the header.) > >It would be theoretically possible to attach it to the initrd image >instead of to the kernel image. As a last resort, I guess I can look >into doing that. However, that's going to require some serious rework >and plumbing of a lot of different components. So if I can make it work >as is, that'd be ideal. However, I need to figure out this weird 62 meg >limitation. > >Any ideas on that? > >Jason Ok, the code I sent will figure out the minimum amount of padding that you need (min_initrd_addr) as well.
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On 12/28/22 15:58, H. Peter Anvin wrote: On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" wrote: HELLO H. PETER ANVIN, E L L O On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote: Fix looks good, glad you figured out the problem. I mean, kind of. The solution here sucks, especially given that in the worst case, setup_data just gets dropped. I'm half inclined to consider this a kernel bug instead, and add some code to relocate setup_data prior to decompression, and then fix up all the links. It seems like this would be a lot more robust. I just wish the people who wrote this stuff would chime in. I've had x...@kernel.org CC'd but so far, no input from them. Apparently you are the x86 boot guru. What do you want to happen here? Your input would be very instrumental. Jason Hi! Glad you asked. So the kernel load addresses are parameterized in the kernel image setup header. One of the things that are so parameterized are the size and possible realignment of the kernel image in memory. I'm very confused where you are getting the 64 MB number from. There should not be any such limitation. In general, setup_data should be able to go anywhere the initrd can go, and so is subject to the same address cap (896 MB for old kernels, 4 GB on newer ones; this address too is enumerated in the header.) If you want to put setup_data above 4 GB, it *should* be ok if and only if the kernel supports loading the initrd high, too (again, enumerated in the header. TL;DR: put setup_data where you put the initrd (before or after doesn't matter.) To be maximally conservative, link the setup_data list in order from lowest to highest address; currently there is no such item of relevance, but in the future there may be setup_data items needed by the BIOS part of the bootstrap in which case they would have to be < 1 MB and precede any items > 1 MB for obvious reasons. That being said, with BIOS dying it is not all that likely that such entries will ever be needed. So let me try for an algorithm. Attached as a text file to avoid line break damage. -hpaHere is an attempted description with pseudo-C code: First of all, take a 4K page of memory and *initialize it to zero*. { #include /* From the uapi kernel sources */ /* Allocated somewhere in your code... */ extern unsigned char *kernel_image; /* Kernel file */ extern struct boot_params *boot_params; /* 4K buffer */ extern uint32_t kernel_image_size; /* Size of kernel file */ /* Callbacks into your code */ extern bool is_bios_boot(void); extern uint32_t end_of_low_memory(void); /* For BIOS boot */ /* * This MUST return an alignment address between start_address * and max_address... */ extern uint64_t maybe_relocate_kernel(uint64_t start_address, uint64_t max_address, uint32_t alignment); /* * Convenience pointer into the kernel image; modifications * done here should be reflected in the loaded kernel image */ struct setup_header * const kernel_setup_header = (struct setup_header *)(kernel_image + 0x1f1); /* Initialize boot_params to zero!!! */ memset(boot_params, 0, sizeof *boot_params); } Copy the setup header starting at file offset 0x1f1 to offset 0x1f1 into that page: { int setup_length = kernel_setup_header->header == 0x53726448 ? (kernel_setup_header->jump >> 8) + 17 : 15; memcpy(_params->hdr, kernel_setup_header, setup_length); } Now you can compute values including ones are omitted by older kernels: { /* * Split between the part of the kernel to be loaded into * low memory (for 16-bit boot, otherwise it can be safely * omitted) and the part to be loaded into high memory. */ if (!boot_params->hdr.setup_sects) boot_param->hdr.setup_sects = 4; int high_kernel_start = (boot_param->hdr.setup_sects+1) << 9; /* * Highest permitted address for the high part of the kernel image, * initrd, command line (*except for 16-bit boot*), and setup_data * * max_initrd_addr here is exclusive */ uint64_t max_initrd_addr = (uint64_t)boot_params->hdr.initrd_addr_max + 1; if (boot_params->hdr.version < 0x0200) max_initrd_addr = 0;/* No initrd supported */ else if (boot_params->hdr.version < 0x0203) max_initrd_addr = 0x3800; else if (boot_params->hdr.version >= 0x020c && (boot_params->hdr.xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) max_initrd_addr = (uint64_t)1 << 52; /* Architecture-imposed limit */ /* * Maximum command line size *including terminating null* */ unsigned int cmdline_size; if (boot_params->hdr.version < 0x0200) cmdline_size = 0; /* No command line supported */ else if (boot_params->hdr.version < 0x0
Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" wrote: >HELLO H. PETER ANVIN, >E >L >L >O > >On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote: >> > Fix looks good, glad you figured out the problem. >> >> I mean, kind of. The solution here sucks, especially given that in the >> worst case, setup_data just gets dropped. I'm half inclined to consider >> this a kernel bug instead, and add some code to relocate setup_data >> prior to decompression, and then fix up all the links. It seems like >> this would be a lot more robust. >> >> I just wish the people who wrote this stuff would chime in. I've had >> x...@kernel.org CC'd but so far, no input from them. > >Apparently you are the x86 boot guru. What do you want to happen here? >Your input would be very instrumental. > >Jason Hi! Glad you asked. So the kernel load addresses are parameterized in the kernel image setup header. One of the things that are so parameterized are the size and possible realignment of the kernel image in memory. I'm very confused where you are getting the 64 MB number from. There should not be any such limitation. In general, setup_data should be able to go anywhere the initrd can go, and so is subject to the same address cap (896 MB for old kernels, 4 GB on newer ones; this address too is enumerated in the header.) If you want to put setup_data above 4 GB, it *should* be ok if and only if the kernel supports loading the initrd high, too (again, enumerated in the header. TL;DR: put setup_data where you put the initrd (before or after doesn't matter.) To be maximally conservative, link the setup_data list in order from lowest to highest address; currently there is no such item of relevance, but in the future there may be setup_data items needed by the BIOS part of the bootstrap in which case they would have to be < 1 MB and precede any items > 1 MB for obvious reasons. That being said, with BIOS dying it is not all that likely that such entries will ever be needed.
Re: [Qemu-devel] Qemu baseline requirements/portability?
On 6/5/19 12:55 PM, H. Peter Anvin wrote: > Hi, > > I am writing some code I'm hoping will be able to make it into Qemu, but I > can't seem to find what the baseline portability requirements are. I'm > specifically wondering about newer POSIX features like openat(), which seems > to be used in the 9p filesystem and nowhere else, and what version of glib one > can rely on? > Specifically, I'm trying to satisfy a 10-year-old request by me and others to support composite initrd during Linux boot. -hpa
[Qemu-devel] Qemu baseline requirements/portability?
Hi, I am writing some code I'm hoping will be able to make it into Qemu, but I can't seem to find what the baseline portability requirements are. I'm specifically wondering about newer POSIX features like openat(), which seems to be used in the 9p filesystem and nowhere else, and what version of glib one can rely on? Thanks, -hpa
Re: [Qemu-devel] [RFC/PoC PATCH 1/3] i386: set initrd_max to 4G - 1 to allow up to 4G initrd
On 11/11/18 10:19 PM, Ingo Molnar wrote: > >> In part as a result of this exchange I have spent some time thinking >> about the boot protocol and its dependencies, and there is, in fact, a >> much more serious problem that needs to be addressed: it is not >> currently possible in a forward-compatible way to map all data areas >> that may be occupied by bootloader-provided data. The kernel proper has >> an advantage here, in that the kernel will by definition always be the >> "owner of the protocol" (anything the kernel doesn't know how to map >> won't be used by the kernel anyway), but it really isn't a good >> situation. So I'm currently trying to think up a way to make that >> possible. > > I might be a bit dense early in the morning, but could you elaborate? > What do you mean by mapping all data areas? Alright, awake now... As it sits right now, the protocol contains a number of data structures with pointers, pointing to a variety of memory areas that can be set up by the bootloader. Now, consider something like KASLR or a secondary boot loader where we need to allocate memory in between the primary bootloader and the kernel to be run. With the kernel proper, in the absence of KASLR, we have solved this by marking out exactly how much memory the kernel may need before it has its own memory manager up and running, but KASLR needs to move it outside this range, and a secondary boot loader shim of some sort may need to allocate additional data structures. In the particular case of an UEFI system where we do the right thing (which Grub2 doesn't, by default) and enter via the kernel UEFI stub we are okay, but for other boot scenarios we are in trouble: even if we know where all the pointers are and how to determine the size of various data structures, once the protocol is updated with new information then that is no longer valid. The setup_data linked list solves that under certain circumstances, but in others it has turned out to not be adequate. There are a couple of options: a) Not allow any new pointers to memory areas in what is considered system RAM. Such data structures *must* have a setup_data linked list header. Pointers into E820 table reserved areas are still acceptable. b) Create a new E820 table memory type for "boot data", similar to what UEFI already has, and encourage boot loaders to mark any allocated memory structures that way. The main problem with that is that the poor quality of boot loaders may mean that that fails to happen, and because it wouldn't "fail hard" it is likely that they will get it wrong. The difference from the RESERVED memory type is that the kernel can reclaim that memory after the data has been recovered. c) This might be the preferred option: 1. Just like (a), do not allow new pointers to memory areas in system RAM in struct boot_params. 2. Create a subrange of struct setup_data (e.g. bit 30 = 1) explicitly containing pointers to other data structures, including sizes, in a way that can be parsed by generic code. 3. Encourage boot loaders to make sure the setup_data list is in order of ascending address (and WARN if it is not.) 4. Add (b) as an option, for responsible boot loaders ;) to provide an extra level of protection.
Re: [Qemu-devel] [RFC/PoC PATCH 1/3] i386: set initrd_max to 4G - 1 to allow up to 4G initrd
On 11/11/18 10:19 PM, Ingo Molnar wrote: > > I might be a bit dense early in the morning, but could you elaborate? > What do you mean by mapping all data areas? > Heh. I need to pack for LPC and get some sleep before my flight lest I'll be denser than depleted uranium; I'll write an explanation tomorrow. -hpa
Re: [Qemu-devel] [RFC/PoC PATCH 1/3] i386: set initrd_max to 4G - 1 to allow up to 4G initrd
On 11/11/18 8:56 PM, Ingo Molnar wrote: > >> Also note that the ext_ramdisk_image and ext_ramdisk_size are part of >> struct boot_params as opposed to struct setup_header, which means that >> they are not supported when entering via the 16-bit BIOS entry point, >> and I am willing to bet that there will be, ahem, "strangeness" if >> entered via the 32-bit entry point if at least the command line is >> loaded above the 4 GB mark; the initrd should be fine, though. >> >> This is obviosly not an issue in EFI environments, where we enter >> through the EFI handover entry point. >> >> The main reason these were not added to struct setup_header is that >> there are only 24 bytes left in that header and so space is highly >> precious. One way to deal with that if we really, really need to would >> be to add an initrd/initramfs type of setup_data. > > Is there no way to extend that header by making an extended header part > of the payload? > > IIRC that header is small and fixed size to be part of a single sector at > the very beginning of boot images, but accessing any extended header bits > from the payload section shouldn't really be an issue for a modern > bootloader to handle, right? > > Such an extended header could use a more modern (self-extending) ABI as > well. > Yes, although I don't really think it is as much of an issue as it seems at this point. The limit comes from having used a one-byte jump instruction at the beginning; however, these days that limit is functionally walled. It is of course possible to address this if it should become necessary, however, the current protocol has lasted for 23 years so far and we haven't run out yet, even with occasional missteps. As such, I don't think we are in a huge hurry to address this particular aspect. In part as a result of this exchange I have spent some time thinking about the boot protocol and its dependencies, and there is, in fact, a much more serious problem that needs to be addressed: it is not currently possible in a forward-compatible way to map all data areas that may be occupied by bootloader-provided data. The kernel proper has an advantage here, in that the kernel will by definition always be the "owner of the protocol" (anything the kernel doesn't know how to map won't be used by the kernel anyway), but it really isn't a good situation. So I'm currently trying to think up a way to make that possible. -hpa
Re: [Qemu-devel] [RFC/PoC PATCH 1/3] i386: set initrd_max to 4G - 1 to allow up to 4G initrd
On 11/9/18 5:40 AM, Li Zhijian wrote: > Just noticed that there is a field xloadflags at recent protocol > 60 Protocol 2.12: (Kernel 3.8) Added the xloadflags field and > extension fields > 61 to struct boot_params for loading bzImage and ramdisk > 62 above 4G in 64bit. > [snip] > 617 Field name: xloadflags > 618 Type: read > 619 Offset/size: 0x236/2 > 620 Protocol: 2.12+ > 621 > 622 This field is a bitmask. > 623 > 624 Bit 0 (read): XLF_KERNEL_64 > 625 - If 1, this kernel has the legacy 64-bit entry point at > 0x200. > 626 > 627 Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G > 628 - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G. > 629 > > maybe we can reuse this field and append a new Bit 5 > XLF_INITRD_MAX_SIZE_4G and increase header version. > For the old protocol version 2.12+, if XLF_CAN_BE_LOADED_ABOVE_4G is > set, we can also realize ~4GB initrd is allowed. > > bootloader side: > if protocol >= 2.15 > if XLF_INITRD_LOAD_BELOW_4G > support ~4G initrd > fi > else if protocol >=2.12 > if XLF_CAN_BE_LOADED_ABOVE_4G > support ~4G initrd > fi > fi > The two are equivalent. Obviously you have to load above 4 GB if you have more than 4 GB of initrd. If XLF_CAN_BE_LOADED_ABOVE_4G is not set, then you most likely are on a 32-bit kernel and there are more fundamental limits (even if you were to load it above the 2 GB mark, you would be limited by the size of kernel memory.) So, in case you are wondering: the bootloader that broke when setting the initrd_max field above 2 GB was, of course, Grub. So just use XLF_CAN_BE_LOADED_ABOVE_4G. There is no need for a new flag or new field. Also note that the ext_ramdisk_image and ext_ramdisk_size are part of struct boot_params as opposed to struct setup_header, which means that they are not supported when entering via the 16-bit BIOS entry point, and I am willing to bet that there will be, ahem, "strangeness" if entered via the 32-bit entry point if at least the command line is loaded above the 4 GB mark; the initrd should be fine, though. This is obviosly not an issue in EFI environments, where we enter through the EFI handover entry point. The main reason these were not added to struct setup_header is that there are only 24 bytes left in that header and so space is highly precious. One way to deal with that if we really, really need to would be to add an initrd/initramfs type of setup_data. -hpa
Re: [Qemu-devel] [PATCH] slirp: support dynamic block size for TFTP transfers
On 11/21/16 11:45, Hervé Poussineau wrote: > The blocksize option is defined in RFC 1783. > We now support block sizes between 1 and 1432 bytes, instead of 512 only. It ought to be 1476: Ethernet MTU = 1500, minus a minimum of 20 bytes for an IPv4 header and 4 for a TFTP header. Some bootloaders including Syslinux sandbag this value to avoid creating fragmented packets in case there is a VPN or something similar in the network path. -hpa
Re: [Qemu-devel] RFC: virtio-rng and /dev/urandom
On April 18, 2016 4:26:24 AM PDT, "Daniel P. Berrange" <berra...@redhat.com> wrote: >On Mon, Apr 18, 2016 at 01:07:40PM +0200, Hubert Kario wrote: >> On Monday 18 April 2016 02:46:19 H. Peter Anvin wrote: >> > Another thing that really needs to be addressed, but is a separate >> > issue: invalidating and reseeding the entropy pool after a snapshot >> > event. >> >> definitely agreed >> >> though just reseeding would be sufficient - the goal is to make the >> output unpredictable and unique between multiple machines starting >from >> the same snapshot, feeding enough random data to make the entropy >pool >> unique again is sufficient to achieve that > >If you're spawning multiple machines from the same base snapshot, >the seeding of RNG is just one of many many things that need >dealing with. eg new /etc/machine-id, new ssh host keys, changing >MAC address of NICs with corresponding guest config file changes, >many other application specific identifiers / keys intended to >be unique per machine. As such, libvirt explicitly tries to >prevent you spawning multiple machines from the same snapshot. > >That all said, Microsoft HyperV has defined a concept of a >"Virtual Machine Generation ID" and specified various hypervisor >operations which should result in this value changing[1]. For example >restoring from a snapshot should always change the genid, as would >restoring from backup, or cloned from another image, or failed over >during disaster recovery. > >This vm genid is exposed to the guest via ACPI and there's an >notification whenever it changes. > >There are patches for QEMU[2] to support this feature in a manner that >is compatible with the hyperv spec, but they are sadly still not >merged :-( > >So it would be possible for the Linux kernel to re-initialize its >RNG after snapshot by hooking into the vm-genid ACPI notification. > > >Regards, >Daniel > >[1] >https://lists.nongnu.org/archive/html/qemu-devel/2014-10/msg00489.html >[2] https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05599.html There are multiple machines, and there are snapshots restored. -- Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
Re: [Qemu-devel] RFC: virtio-rng and /dev/urandom
On April 18, 2016 2:28:42 AM PDT, "Daniel P. Berrange" <berra...@redhat.com> wrote: >On Fri, Apr 15, 2016 at 08:56:59AM -0700, H. Peter Anvin wrote: >> On April 15, 2016 3:41:34 AM PDT, Cole Robinson <crobi...@redhat.com> >wrote: >> >Libvirt currently rejects using host /dev/urandom as an input source >> >for a >> >virtio-rng device. The only accepted sources are /dev/random and >> >/dev/hwrng. >> >This is the result of discussions on qemu-devel around when the >feature >> >was >> >first added (2013). Examples: >> > >> >http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg02387.html >> >>https://lists.gnu.org/archive/html/qemu-devel/2013-03/threads.html#00023 >> > >> >libvirt's rejection of /dev/urandom has generated some complaints >from >> >users: >> > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1074464 >> >* cited: http://www.2uo.de/myths-about-urandom/ >> >http://www.redhat.com/archives/libvir-list/2016-March/msg01062.html >> >http://www.redhat.com/archives/libvir-list/2016-April/msg00186.html >> > >> >I think it's worth having another discussion about this, at least >with >> >a >> >recent argument in one place so we can put it to bed. I'm CCing a >bunch >> >of >> >people. I think the questions are: >> > >> >1) is the original recommendation to never use >virtio-rng+/dev/urandom >> >correct? >> > >> >2) regardless of #1, should we continue to reject that config in >> >libvirt? >> > >> >Thanks, >> >Cole >> >> Using /dev/urandom for virtio-rng, *except* perhaps for a small seed, >> it a complete waste of cycles. There is absolutely no reason to have >> one prng feed another. > >Regardless of the performance aspect, the key question we need the >answer to is whether it *cryptographically safe* to use /dev/urandom >on the host to feed virtio-rng. The original discussion said it was >/unsafe/ to use /dev/urandom, hence why we do not allow it. If the >only downside is wasted performance, then it is reasonable to allow >the user to use /dev/urandom if they so wish. > >Regards, >Daniel Perhaps. What we do know is that it at least used to be a fairly common misconfiguration; up there with people who would feed /dev/urandom to rngd! Probably there ought to be a backend which knows to use urandom for a seed and then fall back to random, rather than simply relying on a file name. Another thing that really needs to be addressed, but is a separate issue: invalidating and reseeding the entropy pool after a snapshot event. -- Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
Re: [Qemu-devel] RFC: virtio-rng and /dev/urandom
On 04/16/16 01:31, Paolo Bonzini wrote: > > Right, but there's always the point about people that use heterogeneous > hosts and cannot pass rdrand/rdseed to the guest. For these, we should > add a QEMU driver that uses rdrand/rdseed, and thus decouples virtio-rng > from the host /dev/* completely. > > From the libvirt POV there are various possibilities: > > - Libvirt can have a libvirt.conf parameter that says "ignore whatever is > specified in the guest XML if rdrand/rdseed is available, and instead use > rdrand/rdseed". > > - Libvirt can allow specifying rdrand/rdseed _and_ an additional backend, > like this: > > > /dev/random > > and fallback to the second if rdrand/rdseed are not available. > The other thing, and this is one area where there is some legitimacy to the /dev/urandom argument: on a fresh boot, it would be highly desirable to get a seed value from virtio-rng even if that is "entropyless". The backwards-compatible way would be to provide, say, 64 bytes of /dev/urandom before switching to /dev/random, but it might be desirable to give the guest OS some way to cause that to reset, explicitly requesting a new seed after an in-VM guest reboot, kexec et al. This also ties into the proposed MSR to support kASLR in the guest in the absence of rdrand/rdseed. Using virtio in that phase of bootup is generally not feasible. -hpa
Re: [Qemu-devel] RFC: virtio-rng and /dev/urandom
On 04/16/16 01:31, Paolo Bonzini wrote: > > Right, but there's always the point about people that use heterogeneous > hosts and cannot pass rdrand/rdseed to the guest. For these, we should > add a QEMU driver that uses rdrand/rdseed, and thus decouples virtio-rng > from the host /dev/* completely. > You should be able to crib the code from rngd (rng-tools) pretty much verbatim. -hpa
Re: [Qemu-devel] RFC: virtio-rng and /dev/urandom
On April 15, 2016 9:10:44 AM PDT, Hubert Kariowrote: >On Friday 15 April 2016 09:47:51 Eric Blake wrote: >> On 04/15/2016 04:41 AM, Cole Robinson wrote: >> > Libvirt currently rejects using host /dev/urandom as an input >source >> > for a virtio-rng device. The only accepted sources are /dev/random >> > and /dev/hwrng. This is the result of discussions on qemu-devel >> > around when the feature was first added (2013). Examples: >> > >> > http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg02387.html >> > >https://lists.gnu.org/archive/html/qemu-devel/2013-03/threads.html#0 >> > 0023 >> > >> > libvirt's rejection of /dev/urandom has generated some complaints >> > from users: >> > >> > https://bugzilla.redhat.com/show_bug.cgi?id=1074464 >> > * cited: http://www.2uo.de/myths-about-urandom/ >> > http://www.redhat.com/archives/libvir-list/2016-March/msg01062.html >> > http://www.redhat.com/archives/libvir-list/2016-April/msg00186.html >> > >> > I think it's worth having another discussion about this, at least >> > with a recent argument in one place so we can put it to bed. I'm >> > CCing a bunch of people. I think the questions are: >> > >> > 1) is the original recommendation to never use >> > virtio-rng+/dev/urandom correct? >> That I'm not sure about - and the answer may be context-dependent >(for >> example a FIPS user may care more than an ordinary user) > >/dev/urandom use is FIPS compliant, no FIPS-validated protocol or >cryptographic primitive requires the "fresh" entropy provided by >/dev/random. All primitives are designed to work with weaker entropy >guarantees than what /dev/urandom provides. So: using urandom for a seed makes sense, but "unplugging the drain" is a huge waste of resources and provides absolutely zero value. Also, I do not believe /dev/urandom is FIPS compliant. Finally, the refill policy is different, so it is not really true the algorithm is the same. All in all, other than a seed value it really doesn't make any sense. Of course, none of this matters on newer Intel hardware ;) -- Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
Re: [Qemu-devel] RFC: virtio-rng and /dev/urandom
On April 15, 2016 9:10:44 AM PDT, Hubert Kariowrote: >On Friday 15 April 2016 09:47:51 Eric Blake wrote: >> On 04/15/2016 04:41 AM, Cole Robinson wrote: >> > Libvirt currently rejects using host /dev/urandom as an input >source >> > for a virtio-rng device. The only accepted sources are /dev/random >> > and /dev/hwrng. This is the result of discussions on qemu-devel >> > around when the feature was first added (2013). Examples: >> > >> > http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg02387.html >> > >https://lists.gnu.org/archive/html/qemu-devel/2013-03/threads.html#0 >> > 0023 >> > >> > libvirt's rejection of /dev/urandom has generated some complaints >> > from users: >> > >> > https://bugzilla.redhat.com/show_bug.cgi?id=1074464 >> > * cited: http://www.2uo.de/myths-about-urandom/ >> > http://www.redhat.com/archives/libvir-list/2016-March/msg01062.html >> > http://www.redhat.com/archives/libvir-list/2016-April/msg00186.html >> > >> > I think it's worth having another discussion about this, at least >> > with a recent argument in one place so we can put it to bed. I'm >> > CCing a bunch of people. I think the questions are: >> > >> > 1) is the original recommendation to never use >> > virtio-rng+/dev/urandom correct? >> That I'm not sure about - and the answer may be context-dependent >(for >> example a FIPS user may care more than an ordinary user) > >/dev/urandom use is FIPS compliant, no FIPS-validated protocol or >cryptographic primitive requires the "fresh" entropy provided by >/dev/random. All primitives are designed to work with weaker entropy >guarantees than what /dev/urandom provides. That's not the point. The point is that it is a complete waste of resources when the PRNG can simply be run in the guest -- Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
Re: [Qemu-devel] RFC: virtio-rng and /dev/urandom
On April 15, 2016 3:41:34 AM PDT, Cole Robinsonwrote: >Libvirt currently rejects using host /dev/urandom as an input source >for a >virtio-rng device. The only accepted sources are /dev/random and >/dev/hwrng. >This is the result of discussions on qemu-devel around when the feature >was >first added (2013). Examples: > >http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg02387.html >https://lists.gnu.org/archive/html/qemu-devel/2013-03/threads.html#00023 > >libvirt's rejection of /dev/urandom has generated some complaints from >users: > >https://bugzilla.redhat.com/show_bug.cgi?id=1074464 >* cited: http://www.2uo.de/myths-about-urandom/ >http://www.redhat.com/archives/libvir-list/2016-March/msg01062.html >http://www.redhat.com/archives/libvir-list/2016-April/msg00186.html > >I think it's worth having another discussion about this, at least with >a >recent argument in one place so we can put it to bed. I'm CCing a bunch >of >people. I think the questions are: > >1) is the original recommendation to never use virtio-rng+/dev/urandom >correct? > >2) regardless of #1, should we continue to reject that config in >libvirt? > >Thanks, >Cole Using /dev/urandom for virtio-rng, *except* perhaps for a small seed, it a complete waste of cycles. There is absolutely no reason to have one prng feed another. -- Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
Re: [Qemu-devel] [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug
On 01/22/2014 02:21 AM, Paolo Bonzini wrote: Il 21/01/2014 19:59, Liu, Jinsong ha scritto: From 3155a190ce6ebb213e6c724240f4e6620ba67a9d Mon Sep 17 00:00:00 2001 From: Liu Jinsong jinsong@intel.com Date: Fri, 13 Dec 2013 02:32:03 +0800 Subject: [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug EBX of cpuid(0xD, 0) is dynamic per XCR0 features enable/disable. Bit 63 of XCR0 is reserved for future expansion. Signed-off-by: Liu Jinsong jinsong@intel.com Peter, can I have your acked-by on this? Yes. Acked-by: H. Peter Anvin h...@linux.intel.com
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
On 03/25/2013 01:56 PM, Eduardo Habkost wrote: It needs to be possible to fix bugs It is possible to fix them today: just write a compat function or add a global variable that is handled by cpu_x86_init(), and call it from the pc-1.3 machine-type init function. See disable_kvm_pv_eoi() and enable_compat_apic_id_mode(), for example. The difference is that this will be much easier once we introduce the static properties: then you just need to add the compat property values to the compat_props field on the machine-type struct. Hi guys, Any movement on this in the past year? -hpa
Re: [Qemu-devel] [PATCH 2/3] X86, mpx: Intel MPX definition
No... we always ask for cpufeature.h patches separately because they sometimes cause conflicts between branches. Borislav Petkov b...@alien8.de wrote: On Sat, Dec 07, 2013 at 02:52:55AM +0800, Qiaowei Ren wrote: Signed-off-by: Qiaowei Ren qiaowei@intel.com Signed-off-by: Xudong Hao xudong@intel.com Signed-off-by: Liu Jinsong jinsong@intel.com --- arch/x86/include/asm/cpufeature.h |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) This patch should probably be merged with the next one... diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index d3f5c63..6c2738d 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -216,6 +216,7 @@ #define X86_FEATURE_ERMS(9*32+ 9) /* Enhanced REP MOVSB/STOSB */ #define X86_FEATURE_INVPCID (9*32+10) /* Invalidate Processor Context ID */ #define X86_FEATURE_RTM (9*32+11) /* Restricted Transactional Memory */ +#define X86_FEATURE_MPX (9*32+14) /* Memory Protection Extension */ #define X86_FEATURE_RDSEED (9*32+18) /* The RDSEED instruction */ #define X86_FEATURE_ADX (9*32+19) /* The ADCX and ADOX instructions */ #define X86_FEATURE_SMAP(9*32+20) /* Supervisor Mode Access Prevention */ @@ -330,6 +331,7 @@ extern const char * const x86_power_flags[32]; #define cpu_has_perfctr_l2 boot_cpu_has(X86_FEATURE_PERFCTR_L2) #define cpu_has_cx8 boot_cpu_has(X86_FEATURE_CX8) #define cpu_has_cx16boot_cpu_has(X86_FEATURE_CX16) +#define cpu_has_mpx boot_cpu_has(X86_FEATURE_MPX) ... and we're trying to not have more of those macros so people should be simply using boot_cpu_has(X86_FEATURE_YYY). -- Sent from my mobile phone. Please pardon brevity and lack of formatting.
Re: [Qemu-devel] [PATCH v3 0/2] Intel MPX feature support at Qemu
On 12/06/2013 08:27 AM, Liu, Jinsong wrote: Eric Blake wrote: On 12/06/2013 07:06 AM, Liu, Jinsong wrote: Intel has released Memory Protection Extensions (MPX) recently. Please refer to http://download-software.intel.com/sites/default/files/319433-015.pdf These 2 patches are version2 to support Intel MPX at qemu side. You still aren't threading correctly, which makes it hard to track your series. Please review http://wiki.qemu.org/Contribute/SubmitAPatch and make sure your 'git send-email' settings allow for proper threading; a good way to test that is to first send the patch series to yourself to ensure your environment is set up correctly. Thanks Blake! will take care and learn using git send-email when I send patches later (i.e. kvm mpx patches). Not to mention that Linux kernel patches should be Cc:'d to linux-ker...@vger.kernel.org. -hpa
Re: [Qemu-devel] [PATCH 3/3] X86, mpx: Intel MPX xstate feature definition
On 12/06/2013 05:46 AM, Borislav Petkov wrote: I'm guessing this and the struct lwp_struct above is being added so that you can have the LWP XSAVE area size? If so, you don't need it: LWP XSAVE area is 128 bytes at offset 832 according to my manuals so I'd guess having a u8 lwp_area[128] should be fine. Sure, but any reason to *not* document the internal structure? +struct bndregs_struct bndregs; +struct bndcsr_struct bndcsr; /* new processor state extensions will go here */ } __attribute__ ((packed, aligned (64))); diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h index 0415cda..5cd9de3 100644 --- a/arch/x86/include/asm/xsave.h +++ b/arch/x86/include/asm/xsave.h @@ -9,6 +9,8 @@ #define XSTATE_FP 0x1 #define XSTATE_SSE 0x2 #define XSTATE_YMM 0x4 +#define XSTATE_BNDREGS 0x8 +#define XSTATE_BNDCSR 0x10 #define XSTATE_FPSSE(XSTATE_FP | XSTATE_SSE) @@ -20,10 +22,12 @@ #define XSAVE_YMM_SIZE 256 #define XSAVE_YMM_OFFSET(XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET) +#define XSTATE_FLEXIBLE (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) What's the use of that macro if it is used only once? Documentation seems good enough. Explicitly separating out the features which MUST be eagerly saved seems like a good thing. +#define XSTATE_EAGER(XSTATE_BNDREGS | XSTATE_BNDCSR) /* * These are the features that the OS can handle currently. */ -#define XCNTXT_MASK (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) +#define XCNTXT_MASK (XSTATE_FLEXIBLE | XSTATE_EAGER) -hpa
Re: [Qemu-devel] [PATCH v2 3/3] X86, mpx: Intel MPX xstate feature definition
On 12/06/2013 09:35 AM, Paolo Bonzini wrote: Sorry for the back-and-forth, but I think this and the removal of XSTATE_FLEXIBLE (perhaps XSTATE_LAZY?) makes your v2 worse than v1. Since Peter already said the same, please undo these changes. Also, how is XSTATE_EAGER used? Should MPX be disabled when xsaveopt is disabled on the kernel command line? (Liu, how would this affect the KVM patches, too?) There are two options: we could disable MPX etc. or we could force eager saving (using xsave) even if xsaveopt is disabled. It is a hard call to make, but I guess I'm leaning towards the latter; we could add an lazyxsave option to explicitly disable all eager features if there is use for that. -hpa
Re: [Qemu-devel] [PATCH v2 3/3] X86, mpx: Intel MPX xstate feature definition
On 12/06/2013 12:05 PM, Liu, Jinsong wrote: Since Peter already said the same, please undo these changes. Also, how is XSTATE_EAGER used? Should MPX be disabled when xsaveopt is disabled on the kernel command line? (Liu, how would this affect the KVM patches, too?) Paolo Currently seems no, and if needed we can add a new patch at kvm side accordingly when native mpx patches checked in. We need to either disable these features in lazy mode, or we need to force eager mode if these features are to be supported. The problem with the latter is that it means forcing eager mode regardless of if anything actually *uses* these features. A third option would be to require applications to use a prctl() or similar to enable eager-save features. Thoughts? -hpa
Re: [Qemu-devel] [PATCH v2 3/3] X86, mpx: Intel MPX xstate feature definition
On 12/06/2013 05:16 PM, Ren, Qiaowei wrote: Jinsong think that both kvm and host depend on these feature definition header file, so we firstly submit these files depended on. Yes, but we can't turn on the feature without proper protection. Either way, they are now in tip:x86/cpufeature. -hpa
Re: [Qemu-devel] [PATCH 1/4] X86: Intel MPX definiation
On 12/05/2013 08:08 AM, Paolo Bonzini wrote: Il 02/12/2013 17:43, Liu, Jinsong ha scritto: From fbfa537f690eca139a96c6b2636ab5130bf57716 Mon Sep 17 00:00:00 2001 From: Liu Jinsong jinsong@intel.com Date: Fri, 29 Nov 2013 01:27:00 +0800 Subject: [PATCH 1/4] X86: Intel MPX definiation Signed-off-by: Xudong Hao xudong@intel.com Signed-off-by: Liu Jinsong jinsong@intel.com --- arch/x86/include/asm/cpufeature.h |2 ++ arch/x86/include/asm/xsave.h |5 - 2 files changed, 6 insertions(+), 1 deletions(-) hpa/Ingo/Thomas, can you give your Acked-by for this patch? I'm not sure of the consequences of changing XCNTXT_MASK. This series (which was submitted with the wrong threading) wants it so that KVM can use fpu_save_init and fpu_restore_checking to save and restore the MPX state of the guest. Hi, I'm currently reviewing internally another set of patches for MPX support which would at least in part conflict with these. I don't see the rest of the series -- where was it posted? Either way: 1. asm/cpufeatures.h patches should always be separate, as we put those into a special branch into the -tip tree since they touch so many other things. 2. Enabling MPX is only safe with XSTATE_EAGER, which Qiaowei's patchset has done correctly. -hpa
Re: [Qemu-devel] Would virtio support 64 bit address for vring virtqueue?
On 08/28/2013 10:35 AM, Anthony Liguori wrote: Yes, it was originally designed with the 16TB limit in mind. PCI doesn't support 64-bit PIO operations so it would have required a high/low register and additional magic. s/PCI/x86/ Additional magic is needed only if atomic changes are necessary, but in this case this is stuff that is set up during early configuration and so isn't an issue. The additional magic needed in that case is latch logic - the write to the low part doesn't actually take effect until the high part is written. -hpa
Re: [Qemu-devel] [RFC] sanitize memory on system reset
Only on a real 286. At least since 486 the legacy switch has been INIT, not RESET. Alexander Graf ag...@suse.de wrote: Am 14.06.2013 um 08:56 schrieb Christian Borntraeger borntrae...@de.ibm.com: On 13/06/13 13:56, Anthony Liguori wrote: Markus Armbruster arm...@redhat.com writes: Peter Lieven p...@kamp.de writes: On 13.06.2013 10:40, Stefan Hajnoczi wrote: On Thu, Jun 13, 2013 at 08:09:09AM +0200, Peter Lieven wrote: I was thinking if it would be a good idea to zeroize all memory resources on system reset and madvise dontneed them afterwards. This would avoid system reset attacks in case the attacker has only access to the console of a vServer but not on the physical host and it would shrink RSS size of the vServer siginificantly. I wonder if you'll hit weird OS installers or PXE clients that rely on stashing stuff in memory across reset. One point: Wouldn't a memory test which some systems do at startup break these as well? Systems that distinguish between warm and cold boot (such as PCs) generally run POST only on cold boot. I'm not saying triggering warm reboot and expecting memory contents to survive is a good idea, but it has been done. Doesn't kexec do a warm reboot stashing the new kernel somewhere in memory? It does something like that on s390. There is a diagnose instruction to the machine, that resets all subsystems and cpus in a defined state, but lets the memory untouched. So we want to be able to define the components of the system which we are going to reset and have two cases: 1. reset everything and clear the memory 2. just reset the cpus and devices, but leave the memory untouched For case 2 we basically want to avoid memory clearing AND bios reloading Legacy 286 protected mode to real mode switching also happens through the CPU reset PIN, so there certainly is a need to distinguish. Alex -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] [Patch] fix /proc/self/maps output
On 03/29/2013 09:01 AM, Christophe Lyon wrote: Add a space at end of line when there is no filename to print, to conform to linux kernel format. Signed-off-by: Christophe Lyon christophe.l...@linaro.org --- linux-user/syscall.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/linux-user/syscall.c b/linux-user/syscall.c index a148d9f..3b0ca86 100644 --- a/linux-user/syscall.c +++ b/linux-user/syscall.c @@ -5016,7 +5016,7 @@ static int open_self_maps(void *cpu_env, int fd) %c%c%c%c %08 PRIx64 %02x:%02x %d%s%s\n, h2g(min), h2g(max), flag_r, flag_w, flag_x, flag_p, offset, dev_maj, dev_min, inode, -path[0] ?: , path); +path[0] ?: , path); } } Please move the fixed space into the format. Perhaps even use a %-123s type format even... -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
On 03/28/2013 12:15 PM, Aurelien Jarno wrote: This really looks like Linux kernel specific. I haven't been able to test on a real machine, but the documentation I have found suggest that without and x87 FPU, the FPU instructions are simply ignored. The common way to detect an FPU is therefore to initialize registers to a given value, run fnstsw and fnstcw instructions with the register in arguments and see if they have been modified. The Linux kernel indeed set the initial value of these registers to 0x, but I am not sure all codes are doing the same. For me it looks like better to skip such instructions directly in translate.c. As a bonus it seems easy to do that for all FPU instructions. At least *some* real-life CPUs returned 0x, at the very least the machines with external buses did (e.g. 386 without 387). I don't have access to a live 486SX so I can test if 486SX behaved differenly. -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
On 03/28/2013 12:15 PM, Aurelien Jarno wrote: This really looks like Linux kernel specific. I haven't been able to test on a real machine, but the documentation I have found suggest that without and x87 FPU, the FPU instructions are simply ignored. The common way to detect an FPU is therefore to initialize registers to a given value, run fnstsw and fnstcw instructions with the register in arguments and see if they have been modified. The Linux kernel indeed set the initial value of these registers to 0x, but I am not sure all codes are doing the same. For me it looks like better to skip such instructions directly in translate.c. As a bonus it seems easy to do that for all FPU instructions. It might have been (and this is from memory, so don't take it for anything) that the register form receives 0x, but the memory form is ignored. -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
Qemu is absolutely horrid at modeling corner cases. Rob Landley r...@landley.net wrote: On 03/28/2013 03:12:11 PM, H. Peter Anvin wrote: On 03/28/2013 12:15 PM, Aurelien Jarno wrote: This really looks like Linux kernel specific. I haven't been able to test on a real machine, but the documentation I have found suggest that without and x87 FPU, the FPU instructions are simply ignored. The common way to detect an FPU is therefore to initialize registers to a given value, run fnstsw and fnstcw instructions with the register in arguments and see if they have been modified. The Linux kernel indeed set the initial value of these registers to 0x, but I am not sure all codes are doing the same. For me it looks like better to skip such instructions directly in translate.c. As a bonus it seems easy to do that for all FPU instructions. It might have been (and this is from memory, so don't take it for anything) that the register form receives 0x, but the memory form is ignored. Speaking of which, Solar Designer recently found a bug where pentium 3 silently ignores the 66 prefix that later became SSE2, and thus the code ran but produced the wrong result: https://twitter.com/solardiz/status/316204216962142209 https://twitter.com/solardiz/status/316207184134410240 But this isn't what QEMU does: https://twitter.com/solardiz/status/316944417871245313 Rob -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] [RFC PATCH 3/3] mc146818rtc: export the timezone information
On 03/25/2013 02:05 AM, Paolo Bonzini wrote: Interesting, do you have SeaBIOS and/or OVMF patches for this? Not at this point. -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
On 03/25/2013 08:15 AM, Igor Mammedov wrote: Such changes have been rejected in the past (e.g., n270 Atom). I personally wouldn't object to 486 changes, but I guess it should rather be handled via Igor's CPU static properties that I have in my review queue: The .model value would be set to 8 but the PC machine would be changed alongside to set model = 0 for pc-1.4 and earlier. It doesn't relates to property refactoring nor to slim CPU sub-classes conversion either. So it could go in independently. But is this change safe from migration POV? Well, given that the CPU model presented is actually closer to a model 8 than a model 0 it probably is... but the real question is what would cause someone to do migration of a 486 CPU model. The n270 issue is problematic, because right now n270 can't actually run software compiled for N270... -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
On 03/25/2013 12:05 PM, Eduardo Habkost wrote: On Mon, Mar 25, 2013 at 11:44:30AM -0700, H. Peter Anvin wrote: On 03/25/2013 08:15 AM, Igor Mammedov wrote: Such changes have been rejected in the past (e.g., n270 Atom). I personally wouldn't object to 486 changes, but I guess it should rather be handled via Igor's CPU static properties that I have in my review queue: The .model value would be set to 8 but the PC machine would be changed alongside to set model = 0 for pc-1.4 and earlier. It doesn't relates to property refactoring nor to slim CPU sub-classes conversion either. So it could go in independently. But is this change safe from migration POV? Well, given that the CPU model presented is actually closer to a model 8 than a model 0 it probably is... but the real question is what would cause someone to do migration of a 486 CPU model. The n270 issue is problematic, because right now n270 can't actually run software compiled for N270... FWIW, I wouldn't mind too much if the maintainers decide to document 486 and n270 as migration-unsafe and then knowingly break live-migration of those CPU models between qemu = 1.3 and qemu = 1.4. It's up to the maintainers to choose which way to go. The right thing, of course (and I believe that's where things are going) is to unwind these descriptions at the time the VM is created; the migration should implement the machine as it was launched. If that isn't practical, then the right thing to do is probably to have some kind of machine description conversion (so, say, 486 can be converted to 486-1.3 containing the legacy description), but telling people that -cpu n270 is something other than a real N270 that can't run N270 software is user-hostile in the extreme. It needs to be possible to fix bugs -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
Now, all this said... if the only objection is the change of model number for 486 then I suggest just dropping that. -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
On 03/25/2013 01:56 PM, Eduardo Habkost wrote: It needs to be possible to fix bugs It is possible to fix them today: just write a compat function or add a global variable that is handled by cpu_x86_init(), and call it from the pc-1.3 machine-type init function. See disable_kvm_pv_eoi() and enable_compat_apic_id_mode(), for example. The difference is that this will be much easier once we introduce the static properties: then you just need to add the compat property values to the compat_props field on the machine-type struct. It sounds like this is underway, and since the priority for the 486 bit is very low it is better for that bit to just wait. -hpa
Re: [Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
Low priority ping on this patchset...? -hpa
Re: [Qemu-devel] virtio-rng and fd passing
On 03/02/2013 04:23 AM, Paolo Bonzini wrote: Il 02/03/2013 04:13, Anthony Liguori ha scritto: There is no valid use-case of rng-random other than using /dev/random. In fact, it was probably a mistake to even allow a filename to be specified because it lets people do silly things (like /dev/urandom). If you want anything other than /dev/random, you should use rng-egd. /dev/hwrng makes sense too. Only if the host isn't using it, which it almost certainly should if there is something there. On the other hand, yes, it is cryptographically sound (since it presents itself as /dev/hwrng in the guest!) and it does make sense for a very thin host. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] virtio-rng and fd passing
On 02/28/2013 04:36 PM, Eric Blake wrote: Stefan Berger and I discovered on IRC that virtio-rng is unable to support fd passing. We attempted: qemu-system-x86_64 ... -add-fd set=4,fd=34,opaque=RDONLY:/dev/urandom -object rng-random,id=rng0,filename=/dev/fdset/4 -device virtio-rng-pci,rng=rng0,bus=pci.0,addr=0x6 Unrelated, but you really, really, really don't want to pass /dev/urandom there, use /dev/random. -hpa
Re: [Qemu-devel] 9pfs segfaults on chmod(special)
On 02/28/2013 04:24 AM, M. Mohan Kumar wrote: By default 9p.u is used, you can override by that mount -t 9p -otrans=virtio,version=9p2000.L tag /mnt Shouldn't we change that default? -hpa
Re: [Qemu-devel] virtio-rng and fd passing
The guest kernel already provides the PRNG itself. We have been over this... Stefan Berger stef...@linux.vnet.ibm.com wrote: On 03/01/2013 02:37 PM, H. Peter Anvin wrote: On 02/28/2013 04:36 PM, Eric Blake wrote: Stefan Berger and I discovered on IRC that virtio-rng is unable to support fd passing. We attempted: qemu-system-x86_64 ... -add-fd set=4,fd=34,opaque=RDONLY:/dev/urandom -object rng-random,id=rng0,filename=/dev/fdset/4 -device virtio-rng-pci,rng=rng0,bus=pci.0,addr=0x6 Unrelated, but you really, really, really don't want to pass /dev/urandom there, use /dev/random. From what I am reading about /dev/random is that it will start blocking once not enough entropy is available anymore. Sounds like this could be abused if multiple VMs were using this device and one drains the entropy.. An alternative may be to pick go through a crypto library that seeds itself with entropy and implements random number generators following NIST 800-90 for example. Freebl would offer at least one such implementation: http://dxr.mozilla.org/mozilla-central/security/nss/lib/freebl/drbg.c.html - search for 'NIST' there Stefan -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] [SeaBIOS] [RFC PATCH] Distinguish between reset types
On 02/20/2013 07:37 AM, David Woodhouse wrote: On Wed, 2013-02-20 at 16:34 +0100, Paolo Bonzini wrote: Il 20/02/2013 16:18, Laszlo Ersek ha scritto: I'm beginning to wish I'd just ignored the fact that we need a properly working soft reset to get back from 286 protected mode to real mode, and wired up the damn PAM reset unconditionally. I'm not convinced that the protected-real mode transition will work for anyone anyway. I believe currently we must be somewhere between soft reset hard reset. I estimate getting closer to a well-emulated hard reset is more important than getting closer to a soft one. If you were to extend the i440FX reset handler so that it unconditionally resets the PAMs, I'd give my Rb. (Take it for what it's worth :)) It would actually make sense. The right way to do soft reset has nothing to do with qemu_system_reset_request(). I've posted that version of the patch, with a suitable comment. Right... the soft reset described above is really INIT, which isn't even a reset in modern CPUs (it couldn't be, it has to preserve caches) but more of a special interrupt. It is also used during multiprocessor init. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
[Qemu-devel] [RFC PATCH 2/3] target-i386: Raise #UD on accessing non-existent control registers
From: H. Peter Anvin h...@zytor.com If we touch control registers that don't exist, either read or write, raise the #UD exception (undefined opcode). This is useful for testing booting on old CPUs. CR4 is assumed to exist if and only if there are CPU features other than the FPU defined (typically at least VME). Signed-off-by: H. Peter Anvin h...@zytor.com --- target-i386/misc_helper.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c index 1ff25d1..6da3f32 100644 --- a/target-i386/misc_helper.c +++ b/target-i386/misc_helper.c @@ -154,9 +154,18 @@ target_ulong helper_read_crN(CPUX86State *env, int reg) cpu_svm_check_intercept_param(env, SVM_EXIT_READ_CR0 + reg, 0); switch (reg) { -default: +case 0: +case 2: +case 3: val = env-cr[reg]; break; +case 4: +if (env-cpuid_features = CPUID_FP87) { +raise_exception_err(env, EXCP06_ILLOP, 0); +} else { +val = env-cr[reg]; +} +break; case 8: if (!(env-hflags2 HF2_VINTR_MASK)) { val = cpu_get_apic_tpr(env-apic_state); @@ -164,6 +173,9 @@ target_ulong helper_read_crN(CPUX86State *env, int reg) val = env-v_tpr; } break; +default: +raise_exception_err(env, EXCP06_ILLOP, 0); +break; } return val; } @@ -175,11 +187,18 @@ void helper_write_crN(CPUX86State *env, int reg, target_ulong t0) case 0: cpu_x86_update_cr0(env, t0); break; +case 2: +env-cr[reg] = t0; +break; case 3: cpu_x86_update_cr3(env, t0); break; case 4: -cpu_x86_update_cr4(env, t0); +if (env-cpuid_features = CPUID_FP87) { +raise_exception_err(env, EXCP06_ILLOP, 0); +} else { +cpu_x86_update_cr4(env, t0); +} break; case 8: if (!(env-hflags2 HF2_VINTR_MASK)) { @@ -188,7 +207,7 @@ void helper_write_crN(CPUX86State *env, int reg, target_ulong t0) env-v_tpr = t0 0x0f; break; default: -env-cr[reg] = t0; +raise_exception_err(env, EXCP06_ILLOP, 0); break; } } -- 1.7.11.7
[Qemu-devel] [RFC PATCH 3/3] mc146818rtc: export the timezone information
From: H. Peter Anvin h...@zytor.com There is no standard method for storing timezone information associated with the classic PC/AT RTC, however, there are standard methods in ACPI (Time and Alarm Device) and EFI (GetTime/SetTime) for getting this information. Since these are abstract methods, it is qreally firmware-specific how it is stored, however, since Qemu initializes the RTC in the virtual environment that information needs to come from Qemu in the first place. Non-PC platforms that use the MC146181 RTC may have their own firmware-specific methods as well. The most logical place to stash this information is in the RTC CMOS; not only is it logically co-located with the relevant information, but it is also very easy to access from ACPI bytecode. Thus, save the timezone information in two bytes in CMOS that have no known standard definition, but are yet within the 64 bytes that even the most basic RTC CMOS implementations including the original MC146181 support. Note: all timezones currently in use in the world are on 15-minutes boundaries, which would allow this information to be stored in a single signed byte. However, both EFI and ACPI use a minute-granular interface (specified as -1440 to +1440 with 2047 used to mean unknown, this requires a minimum of 12 bits to represent); this follows that model. Signed-off-by: H. Peter Anvin h...@zytor.com Cc: Kevin O'Connor ke...@koconnor.net Cc: David Woodhouse dw...@infradead.org --- hw/mc146818rtc.c | 6 ++ hw/mc146818rtc_regs.h | 2 ++ 2 files changed, 8 insertions(+) diff --git a/hw/mc146818rtc.c b/hw/mc146818rtc.c index 2fb11f6..72541dd 100644 --- a/hw/mc146818rtc.c +++ b/hw/mc146818rtc.c @@ -681,6 +681,7 @@ static void rtc_set_date_from_host(ISADevice *dev) { RTCState *s = DO_UPCAST(RTCState, dev, dev); struct tm tm; +int minuteseast; qemu_get_timedate(tm, 0); @@ -690,6 +691,11 @@ static void rtc_set_date_from_host(ISADevice *dev) /* set the CMOS date */ rtc_set_cmos(s, tm); + +/* Set the timezone information as a signed 16-bit number of minutes */ +minuteseast = ((int64_t)s-base_rtc - (int64_t)mktime(tm)) / 60; +s-cmos_data[RTC_TIMEZONE_L] = (uint8_t)(minuteseast); +s-cmos_data[RTC_TIMEZONE_H] = (uint8_t)(minuteseast 8); } static int rtc_post_load(void *opaque, int version_id) diff --git a/hw/mc146818rtc_regs.h b/hw/mc146818rtc_regs.h index ccdee42..7dd5e0d 100644 --- a/hw/mc146818rtc_regs.h +++ b/hw/mc146818rtc_regs.h @@ -47,6 +47,8 @@ /* PC cmos mappings */ #define RTC_CENTURY 0x32 #define RTC_IBM_PS2_CENTURY_BYTE 0x37 +#define RTC_TIMEZONE_L 0x3e +#define RTC_TIMEZONE_H 0x3f #define REG_A_UIP 0x80 -- 1.7.11.7
[Qemu-devel] [RFC PATCH 1/3] target-i386: Add 486sx, old486, and old486sx CPU models
From: H. Peter Anvin h...@zytor.com Add models for 486SX, and pre-CPUID versions of the 486 (DX SX). Change the model number for the standard 486DX to a model which actually had CPUID. Note: these models are fairly vestigial, for example most of the FPU operations still work; only F*ST[CS]W have been modified to appear as through there is no FPU. This also changes the classic 486 model number to 8 (DX4) which matches the feature set presented. Signed-off-by: H. Peter Anvin h...@zytor.com --- target-i386/cpu.c | 39 ++--- target-i386/fpu_helper.c | 12 +++-- target-i386/misc_helper.c | 15 target-i386/translate.c | 62 +++ 4 files changed, 75 insertions(+), 53 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index aab35c7..a5aad19 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -365,8 +365,11 @@ typedef struct x86_def_t { uint32_t cpuid_7_0_ebx_features; } x86_def_t; -#define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE) -#define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \ +#define OLD_I486SX_FEATURES 0 +#define OLD_I486_FEATURES CPUID_FP87 +#define I486SX_FEATURES CPUID_VME /* SX2+ */ +#define I486_FEATURES (CPUID_FP87 | CPUID_VME) /* DX4 and some DX2 */ +#define PENTIUM_FEATURES (I486_FEATURES | CPUID_PSE | CPUID_DE | CPUID_TSC | \ CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC) #define PENTIUM2_FEATURES (PENTIUM_FEATURES | CPUID_PAE | CPUID_SEP | \ CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \ @@ -535,16 +538,46 @@ static x86_def_t builtin_x86_defs[] = { .model_id = Genuine Intel(R) CPU T2600 @ 2.16GHz, }, { +.name = old486, +.level = 0, +.vendor = CPUID_VENDOR_INTEL, +.family = 4, +.model = 1, +.stepping = 0, +.features = OLD_I486_FEATURES, +.xlevel = 0, +}, +{ +.name = old486sx, +.level = 0, +.vendor = CPUID_VENDOR_INTEL, +.family = 4, +.model = 2, +.stepping = 0, +.features = OLD_I486SX_FEATURES, +.xlevel = 0, +}, +{ .name = 486, .level = 1, .vendor = CPUID_VENDOR_INTEL, .family = 4, -.model = 0, +.model = 8, .stepping = 0, .features = I486_FEATURES, .xlevel = 0, }, { +.name = 486sx, +.level = 1, +.vendor = CPUID_VENDOR_INTEL, +.family = 4, +.model = 5, +.stepping = 0, +.features = I486SX_FEATURES, +.xlevel = 0, +}, +{ .name = pentium, .level = 1, .vendor = CPUID_VENDOR_INTEL, diff --git a/target-i386/fpu_helper.c b/target-i386/fpu_helper.c index 44f3d27..c4c2724 100644 --- a/target-i386/fpu_helper.c +++ b/target-i386/fpu_helper.c @@ -530,12 +530,20 @@ void helper_fldz_FT0(CPUX86State *env) uint32_t helper_fnstsw(CPUX86State *env) { -return (env-fpus ~0x3800) | (env-fpstt 0x7) 11; +if (!(env-cpuid_features CPUID_FP87)) { +return 0x; +} else { +return (env-fpus ~0x3800) | (env-fpstt 0x7) 11; +} } uint32_t helper_fnstcw(CPUX86State *env) { -return env-fpuc; +if (!(env-cpuid_features CPUID_FP87)) { +return 0x; +} else { +return env-fpuc; +} } static void update_fp_status(CPUX86State *env) diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c index b6d5740..1ff25d1 100644 --- a/target-i386/misc_helper.c +++ b/target-i386/misc_helper.c @@ -122,11 +122,16 @@ void helper_cpuid(CPUX86State *env) cpu_svm_check_intercept_param(env, SVM_EXIT_CPUID, 0); -cpu_x86_cpuid(env, (uint32_t)EAX, (uint32_t)ECX, eax, ebx, ecx, edx); -EAX = eax; -EBX = ebx; -ECX = ecx; -EDX = edx; +if (!env-cpuid_level) { +raise_exception_err(env, EXCP06_ILLOP, 0); +} else { +cpu_x86_cpuid(env, (uint32_t)EAX, (uint32_t)ECX, + eax, ebx, ecx, edx); +EAX = eax; +EBX = ebx; +ECX = ecx; +EDX = edx; +} } #if defined(CONFIG_USER_ONLY) diff --git a/target-i386/translate.c b/target-i386/translate.c index 112c310..6d8abff 100644 --- a/target-i386/translate.c +++ b/target-i386/translate.c @@ -103,6 +103,8 @@ typedef struct DisasContext { struct TranslationBlock *tb; int popl_esp_hack; /* for correct popl with esp base handling */ int rip_offset; /* only used in x86_64, but left for simplicity */ +int cpuid_family; +int cpuid_level; int cpuid_features; int cpuid_ext_features; int cpuid_ext2_features; @@ -6513,52 +6515,24 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, if (s-vm86 s-iopl != 3) { gen_exception(s, EXCP0D_GPF, pc_start - s-cs_base); } else { +uint32_t mask
Re: [Qemu-devel] [edk2] (PAM stuff) reset doesn't work on OVMF + SeaBIOS CSM
On 02/14/2013 12:41 PM, Laszlo Ersek wrote: ). cpu_reset() [target-i386/helper.c] sets CS:IP to f000:fff0, which is the exact address of... reset_vector() in SeaBIOS. This would be a bug, but it isn't quite true. If you look at x86_cpu_reset() you will note that it sets the code segment base to 0x, not 0xf as one could expect from the above. This is also true of a physical x86. As such, the *real* reset vector is at 0xfff0 as opposed to the SeaBIOS vector at 0x0 -- this is a backwards compatibility vector which typically just issues a real reset. Now, if Qemu doesn't handle the distinction here correctly, that is a bug. -hpa
Re: [Qemu-devel] [SeaBIOS] [edk2] (PAM stuff) reset doesn't work on OVMF + SeaBIOS CSM
On 02/14/2013 01:27 PM, David Woodhouse wrote: So it *is* jumping to 0xfff0 but the memory at that location isn't what we expect? Do the PAM registers affect *that* too, or only the region from 0xc-0xf? Surely the contents at 4GiB-δ should be unchanged by *anything* we do with the PAM registers? Or maybe not... after also downloading the i440fx data sheet, I'm even more confused. There's some aliasing with... not the region at 1MiB-δ but the region at 16MiB-δ: (From §4.1 System Address Map): 2. High BIOS Area (FFE0_h−− _h) The top 2 Mbytes of the Extended Memory Region is reserved for System BIOS (High BIOS), extended BIOS for PCI devices, and the A20 alias of the system BIOS. The CPU begins execution from the High BIOS after reset. This region is mapped to the PCI so that the upper subset of this region is aliased to 16 Mbytes minus 256-Kbyte range. That is presumably a 286 compatibility hack -- the 286 had 24 address lines. I doubt anyone gives a hoot about it, and neither EDK2 nor SeaBIOS should care. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH] target-i386: n270 can MOVBE
This is problematic since the n270 qemu model won't boot a real kernel compiled for that box. Eduardo Habkost ehabk...@redhat.com wrote: On Fri, Feb 08, 2013 at 10:30:02AM +0100, Borislav Petkov wrote: From: Borislav Petkov b...@suse.de The Atom core (cpu name n270 in QEMU speak) supports MOVBE. This is needed when booting 3.8 and later linux kernels built with the MATOM target because we require MOVBE in order to boot properly now. Cc: H. Peter Anvin h...@zytor.com Cc: Richard Henderson r...@twiddle.net Signed-off-by: Borislav Petkov b...@suse.de --- Right, so I was playing with booting MATOM kernels in QEMU. As it turns out, QEMU's n270 model doesn't advertize MOVBE although the real hardware supports it. Quick search pointed me to http://lists.nongnu.org/archive/html/qemu-devel/2013-01/msg04317.html which adds that support, among other things. I've merged Richard's patchset with qemu's current master and after applying this patch below, I can report success booting an MATOM kernel with QEMU. The same kernel boots on the real n270 hardware, btw. target-i386/cpu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 9f38e4435e53..83816edd8410 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -610,7 +610,8 @@ static x86_def_t builtin_x86_defs[] = { CPUID_ACPI | CPUID_SS | CPUID_HT | CPUID_TM | CPUID_PBE, /* Some CPUs got no CPUID_SEP */ .ext_features = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | -CPUID_EXT_DSCPL | CPUID_EXT_EST | CPUID_EXT_TM2 | CPUID_EXT_XTPR, +CPUID_EXT_DSCPL | CPUID_EXT_EST | CPUID_EXT_TM2 | CPUID_EXT_XTPR | +CPUID_EXT_MOVBE, .ext2_features = (PPRO_FEATURES CPUID_EXT2_AMD_ALIASES) | -machine pc-1.3 -cpu n270 (and older machine-types) needs to keep MOVBE disabled, or you will break live migration. Personally I wouldn't mind declaring n270 as a non-migratable CPU model. But libvirt supports n270, so it already expects n270 to expose a stable ABI on each machine-type. CPUID_EXT2_NX, .ext3_features = CPUID_EXT3_LAHF_LM, -- 1.8.1.2.422.g08c0e7f -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
On 10/31/2012 12:29 AM, Paolo Bonzini wrote: Related to this, rdrand's entropy content in the worst case will be only 1/255th of the data it produces: Intel documents that one 256-bit seed will result in up to 1022 64-bit random numbers. Yet, it is good enough to drive rngd. Would it make sense for QEMU to implement the same kind of stretching of /dev/random data, to avoid depleting the host's entropy pool too fast? Absolutely not; in fact, we have to do data reduction in rngd for exactly this reason (and a Qemu backend would have to do the same). There is a new RDSEED instruction in newer CPUs to correct this. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
On 10/30/2012 02:05 AM, Paolo Bonzini wrote: Either you're not reading what I wrote, or you're confusing me with someone else. My apologies, you are indeed correct. I misinterpreted your emails, probably because I got you confused with someone else. I *never* mentioned passing /dev/urandom, and in fact I explained to Anthony that it is wrong. Please take a look at http://permalink.gmane.org/gmane.comp.emulators.qemu/178123 What I said that passing /dev/hwrng or rdrand would: - not make /dev/random with virtio-rng-pci worse than without It wouldn't, but it would make virtio-rng-pci a potential noop on a system where it could genuinely do better. - make migration working - avoiding denial of service for the host's /dev/random However, it means that if there is an rngd-readable source on the host (e.g. TPM, DRNG) then the guest cannot take advantage of it; if it accesses /dev/random then it would be able to. This is particularly toxic if you turn off DRNG to the host in the name of migration; the DRNG is a very high bandwidth source which is processed directly by rngd since there is no point in doing a detour via /dev/hwrng in the kernel. As such, with your proposed version you would take one of the best possible situations and turn it into the worst possible situation. Furthermore, you are in many ways still causing a DoS on the host, since you are eating up entropy that would otherwise be fed into /dev/random. So there are cases where the situation is much worse with /dev/hwrng than with /dev/random. -hpa
Re: [Qemu-devel] [PATCH 0/8] add paravirtualization hwrng support (v2)
On 10/30/2012 04:02 PM, Anthony Liguori wrote: My take away from all of the various discussions on what the Right Way to use virtio-rng is: 1) /dev/random should always be used as the entropy source (I've left it configurable though) 2) I think the Right Way to configure virtio-rng is to figure out what the available entropy is on the host, and then decide how to allocate that to each guest. As such, I've implemented rate limiting. I think QEMU is the right place to do this because this is a property of specific virtual machines. I can imagine a cloud provider wanting to guarantee a certain level of entropy for different classes of VMs. Even if rngd could be used to do this, configuring it differently for different guests would be cumbersome. rngd is not where this should happen, it should be in the /dev/random implementation in the (host) kernel. That way it is applicable to all types of clients, not just Qemu. 3) `qemu -device virtio-rng-pci` will Just Work but risks exhausting host entropy. This means we can't make it the default for machines. But for most command line users, I think this is the behavior they want. It's a bit unfortunate, but I'm not going to push on that point. Given the migration issue I'll write up an implementation of a DRNG (RDRAND/RDSEED) backend once this is upstream. If RDRAND is disabled in the guest, but available in the host, this would be the one to use. If RDRAND is available in the guest it should be used directly if rngd is new enough, but since virtio-rng has been in the kernel since 2008 there still might be some guests which could use such an implementation without having been RDRAND-enabled. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
On 10/28/2012 11:23 PM, Amit Shah wrote: One solution could be to feed host's /dev/urandom to readers of guests' /dev/urandom. We could then pass the rare true entropy bits from host's /dev/hwrng or /dev/random to the guest via virtio-rng-pci's /dev/hwrng interface in the guest. If this is a valid idea (host /dev/urandom goes directly to guest's /dev/urandom), we would need some guest-side surgery, but it shouldn't be huge work, and would remove several bottlenecks. Is this a very crazy idea? It's not crazy, it's just pointless. You're doing a completely unnecessary hypercall to run the PRNG in host space. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
On 10/29/2012 01:45 AM, Paolo Bonzini wrote: First, hwrng is only one of the sources used by rngd. It can also (currently) use RDRAND or TPM; additional sources are likely to be added in the future. Second, the harvesting of environmental noise -- timings -- is not as good in a VM as on plain hardware, so for the no-hwrng case it is better for this to be done in the host than in the VM. Neither of these make /dev/random with virtio-rng-pci worse than without (as would be the case if you fed /dev/urandom). And migration works. This, and avoiding denial of service for the host's /dev/random, is all I care about at this time. There is always time to change defaults to something better. Your logic are roughly on the level with the people who caused the Debian bug. You are proposing something utterly reckless. Sorry, please stop. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
On 10/29/2012 01:45 AM, Paolo Bonzini wrote: Il 26/10/2012 22:29, H. Peter Anvin ha scritto: This is surreal. Output from /dev/hwrng turns into output for /dev/random... it us guaranteed worse; period, end of story. Isn't that exactly what happens in bare-metal? hwrng - rngd - random. Instead here we'd have, host hwrng - virtio-rng-pci - guest hwrng - guest rngd - guest random. The only difference is that you paravirtualize access to the host hwrng to a) distribute entropy to multiple guests; b) support migration across hosts with different CPUs and hardware. First, hwrng is only one of the sources used by rngd. It can also (currently) use RDRAND or TPM; additional sources are likely to be added in the future. Second, the harvesting of environmental noise -- timings -- is not as good in a VM as on plain hardware, so for the no-hwrng case it is better for this to be done in the host than in the VM. Neither of these make /dev/random with virtio-rng-pci worse than without (as would be the case if you fed /dev/urandom). And migration works. This, and avoiding denial of service for the host's /dev/random, is all I care about at this time. There is always time to change defaults to something better. Let me be more specific. First of all, feeding /dev/urandom to the guest is dangerous -- you are feeding it PRNG contents but telling it that it is real entropy. This is a security hole. Second of all, you're doing something pointless: you are still exhausting the entropy pool on the host at the same rate, and all you end up with is something that isn't what you want. You still have the same DoS on the host /dev/random that you're worried about. Third, you're doing something inefficient: you're running a PRNG in the host which could be run more efficiently in guest space. From an Intel perspective I guess I should be happy, as it functionally would mean that unless you have RDRAND in the host you're insecure, but I'd much rather see the Right Thing done. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
On 10/26/2012 08:42 AM, Anthony Liguori wrote: Is /dev/random even appropriate to feed rngd? rngd needs _a lot_ of entropy to even start working. Its randomness test works in groups of 2 bits. On a system without an hardware RNG, /dev/random can hardly produce 4000 bits/minute. This means a guest will not get any entropy boost for 5 minutes after it's started, even if we allow it to exhaust the parent's entropy. I don't know, but rng-random is a non-blocking backend so it can handle /dev/random, /dev/urandom, or /dev/hwrng. /dev/urandom is just plain *wrong*... it is feeding a PRNG into a PRNG which can best be described as masturbation and at worst as a cryptographic usage violation. /dev/hwrng is reasonable, in some ways; after all, the guest itself is expected to use rngd. There are, however, at least two problems: a) it means that the guest *has* to run rngd or a similar engine; if you have control over the guests it might be more efficient to run rngd in skip-test mode (I don't think that is currently implemented but it could/should be) and centralize all testing to the host. A skip-test mode would also allow rngd to forward-feed shorter blocks than 2500 bytes. b) if the host has no physical hwrng, /dev/hwrng will output nothing at all, which is worse than /dev/random in that situation. Stefan Berger suggested a backend that uses a PRNG in FreeBL. That's probably the best default since it punts to a userspace library to deal with ensuring there's adequate whitening/entropy to start with. We SHOULD NOT expose a PRNG here! It is the same fail as using /dev/urandom (but worse) The whole point is to inject actual entropy... a PRNG can (and typically will) just run in guest space. Maybe rdrand, but that's just a chardev---so why isn't this enough: -chardev file,source=on,path=/dev/hwrng,id=chr0 -device virtio-rng-pci,file=chr0 -chardev rdrand,id=chr0 -device virtio-rng-pci,file=chr0 -chardev socket,host=localhost,port=1024,id=chr0 -device virtio-rng-pci,rng=chr0,egd=on (which I suggested in my reply to Amit)? If you have rdrand you might just use it in the guest directly, unless you have a strong reason (migration?) not to do that. Either way, for rdrand you need whitening similar to what rngd is doing (for *rdseed* you do not, but rdseed is not shipping yet.) The startup issue is an interesting problem. If you have full control over the guest it might be best to simply inject some entropy into the guest on startup via the initramfs or a disk image; that has its own awkwardness too, of course. The one bit that could potentially be solved in Qemu would be an option to don't start the guest until X bytes of entropy have been gathered. Overall, I want to emphasize that we don't want to try solve generic problems in virtualization space... resource constraints on /dev/random is a generic host OS issue for example. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
PRNGs don't create entropy. Period. The guest will run its own PRNG. Anthony Liguori aligu...@us.ibm.com wrote: H. Peter Anvin h...@zytor.com writes: On 10/26/2012 08:42 AM, Anthony Liguori wrote: Is /dev/random even appropriate to feed rngd? rngd needs _a lot_ of entropy to even start working. Its randomness test works in groups of 2 bits. On a system without an hardware RNG, /dev/random can hardly produce 4000 bits/minute. This means a guest will not get any entropy boost for 5 minutes after it's started, even if we allow it to exhaust the parent's entropy. I don't know, but rng-random is a non-blocking backend so it can handle /dev/random, /dev/urandom, or /dev/hwrng. /dev/urandom is just plain *wrong*... it is feeding a PRNG into a PRNG which can best be described as masturbation and at worst as a cryptographic usage violation. I don't understand your logic here. From the discussions I've had, the quality of the randomness from a *well seeded* PRNG ought to be good enough to act as an entropy source within the guest. What qualifies as well seeded is a bit difficult to pin down with more specificity than kilobytes of data. I stayed away from /dev/urandom primarily because it's impossible to determine if it's well seeded or not making urandom dangerous to use. But using a PRNG makes sense to me when dealing with multiple guests. If you have a finite source of entropy in the host, using a PRNG to create unique entropy for each guest is certainly better than duplicating entropy. Adding Ted T'so and a few others to CC in hopes that they can chime in here too. FWIW, none of this should affect this series being merged as it can use a variety of different inputs. But I would like to have a strong recommendation for what people should use (and make that default) so I'd really like to get a clear answer here. Regards, Anthony Liguori /dev/hwrng is reasonable, in some ways; after all, the guest itself is expected to use rngd. There are, however, at least two problems: a) it means that the guest *has* to run rngd or a similar engine; if you have control over the guests it might be more efficient to run rngd in skip-test mode (I don't think that is currently implemented but it could/should be) and centralize all testing to the host. A skip-test mode would also allow rngd to forward-feed shorter blocks than 2500 bytes. b) if the host has no physical hwrng, /dev/hwrng will output nothing at all, which is worse than /dev/random in that situation. Stefan Berger suggested a backend that uses a PRNG in FreeBL. That's probably the best default since it punts to a userspace library to deal with ensuring there's adequate whitening/entropy to start with. We SHOULD NOT expose a PRNG here! It is the same fail as using /dev/urandom (but worse) The whole point is to inject actual entropy... a PRNG can (and typically will) just run in guest space. Maybe rdrand, but that's just a chardev---so why isn't this enough: -chardev file,source=on,path=/dev/hwrng,id=chr0 -device virtio-rng-pci,file=chr0 -chardev rdrand,id=chr0 -device virtio-rng-pci,file=chr0 -chardev socket,host=localhost,port=1024,id=chr0 -device virtio-rng-pci,rng=chr0,egd=on (which I suggested in my reply to Amit)? If you have rdrand you might just use it in the guest directly, unless you have a strong reason (migration?) not to do that. Either way, for rdrand you need whitening similar to what rngd is doing (for *rdseed* you do not, but rdseed is not shipping yet.) The startup issue is an interesting problem. If you have full control over the guest it might be best to simply inject some entropy into the guest on startup via the initramfs or a disk image; that has its own awkwardness too, of course. The one bit that could potentially be solved in Qemu would be an option to don't start the guest until X bytes of entropy have been gathered. Overall, I want to emphasize that we don't want to try solve generic problems in virtualization space... resource constraints on /dev/random is a generic host OS issue for example. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
This is surreal. Output from /dev/hwrng turns into output for /dev/random... it us guaranteed worse; period, end of story. I don't know who the agreement is with, but it is ridiculous in this case. As far as the need whitening issue, I discussed that with Ted at Kernel Summit and we came up with an outline for what we can do to improve the current situation. It is challenging to deal with the patanoia crowd at the same time. Paolo Bonzini pbonz...@redhat.com wrote: Il 26/10/2012 18:09, H. Peter Anvin ha scritto: a) it means that the guest *has* to run rngd or a similar engine; if you have control over the guests it might be more efficient to run rngd in skip-test mode (I don't think that is currently implemented but it could/should be) and centralize all testing to the host. A skip-test mode would also allow rngd to forward-feed shorter blocks than 2500 bytes. That would be something we can communicate via virtio-rng-pci feature bits. Perhaps /dev/hwrng could also expose a need-whitening property in sysfs. b) if the host has no physical hwrng, /dev/hwrng will output nothing at all, which is worse than /dev/random in that situation. Not really, it would just mean that the guest takes more time to gather entropy, just like if you had no virtio-rng-pci at all. If you have rdrand you might just use it in the guest directly, unless you have a strong reason (migration?) not to do that. Either way, for rdrand you need whitening similar to what rngd is doing (for *rdseed* you do not, but rdseed is not shipping yet.) Yes, migration is a good reason. The startup issue is an interesting problem. If you have full control over the guest it might be best to simply inject some entropy into the guest on startup via the initramfs or a disk image; that has its own awkwardness too, of course. The one bit that could potentially be solved in Qemu would be an option to don't start the guest until X bytes of entropy have been gathered. Overall, I want to emphasize that we don't want to try solve generic problems in virtualization space... resource constraints on /dev/random is a generic host OS issue for example. Yes, but the agreed solution right now is that programs should not ask for more than 32 bytes or so from /dev/random. Which means /dev/random is not a suitable backend for virtio-rng-pci as things stand now. Paolo -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
That statement is pretty toxic... I wonder where it came from. It is at best horribly misleading and actively encourages dangerous behaviours even for the cases where it isn't actively wrong. Paolo Bonzini pbonz...@redhat.com wrote: Il 26/10/2012 21:07, H. Peter Anvin ha scritto: This is surreal. Output from /dev/hwrng turns into output for /dev/random... it us guaranteed worse; period, end of story. Isn't that exactly what happens in bare-metal? hwrng - rngd - random. Instead here we'd have, host hwrng - virtio-rng-pci - guest hwrng - guest rngd - guest random. The only difference is that you paravirtualize access to the host hwrng to a) distribute entropy to multiple guests; b) support migration across hosts with different CPUs and hardware. I don't know who the agreement is with, but it is ridiculous in this case. man 4 random: While some safety margin above that minimum is reasonable, as a guard against flaws in the CPRNG algorithm, no cryptographic primitive available today can hope to promise more than 256 bits of security, so if any program reads more than 256 bits (32 bytes) from the kernel random pool per invocation, or per reasonable reseed interval (not less than one minute), that should be taken as a sign that its cryptography is not skilfully implemented. Paolo -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] [PATCH 0/6] add paravirtualization hwrng support
On 10/26/2012 12:51 PM, Paolo Bonzini wrote: Il 26/10/2012 21:07, H. Peter Anvin ha scritto: This is surreal. Output from /dev/hwrng turns into output for /dev/random... it us guaranteed worse; period, end of story. Isn't that exactly what happens in bare-metal? hwrng - rngd - random. Instead here we'd have, host hwrng - virtio-rng-pci - guest hwrng - guest rngd - guest random. The only difference is that you paravirtualize access to the host hwrng to a) distribute entropy to multiple guests; b) support migration across hosts with different CPUs and hardware. First, hwrng is only one of the sources used by rngd. It can also (currently) use RDRAND or TPM; additional sources are likely to be added in the future. Second, the harvesting of environmental noise -- timings -- is not as good in a VM as on plain hardware, so for the no-hwrng case it is better for this to be done in the host than in the VM. -hpa
[Qemu-devel] [PATCH] x86: Implement SMEP and SMAP
From: H. Peter Anvin h...@linux.intel.com This patch implements Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode Access Prevention (SMAP) for x86. The purpose of the patch, obviously, is to help kernel developers debug the support for those features. A fair bit of the code relates to the handling of CPUID features. The CPUID code probably would get greatly simplified if all the feature bit words were unified into a single vector object, but in the interest of producing a minimal patch for SMEP/SMAP, and because I had very limited time for this project, I followed the existing style. Signed-off-by: H. Peter Anvin h...@linux.intel.com --- target-i386/cc_helper.c | 10 ++ target-i386/cpu.c | 33 - target-i386/cpu.h | 33 + target-i386/helper.c| 95 + target-i386/helper.h| 2 ++ target-i386/translate.c | 28 +++ 6 files changed, 164 insertions(+), 37 deletions(-) diff --git a/target-i386/cc_helper.c b/target-i386/cc_helper.c index 07892f9..9422003 100644 --- a/target-i386/cc_helper.c +++ b/target-i386/cc_helper.c @@ -353,6 +353,16 @@ void helper_sti(CPUX86State *env) env-eflags |= IF_MASK; } +void helper_clac(CPUX86State *env) +{ +env-eflags = ~AC_MASK; +} + +void helper_stac(CPUX86State *env) +{ +env-eflags |= AC_MASK; +} + #if 0 /* vm86plus instructions */ void helper_cli_vm(CPUX86State *env) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 423e009..56d0977 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -100,6 +100,13 @@ static const char *svm_feature_name[] = { NULL, NULL, NULL, NULL, }; +static const char *cpuid_7_0_ebx_feature_name[] = { +NULL, NULL, NULL, NULL, NULL, NULL, NULL, smep, +NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, smap, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, +}; + /* collects per-function cpuid data */ typedef struct model_features_t { @@ -215,14 +222,16 @@ static void add_flagname_to_bitmaps(const char *flagname, uint32_t *features, uint32_t *ext2_features, uint32_t *ext3_features, uint32_t *kvm_features, -uint32_t *svm_features) +uint32_t *svm_features, +uint32_t *cpuid_7_0_ebx_features) { if (!lookup_feature(features, flagname, NULL, feature_name) !lookup_feature(ext_features, flagname, NULL, ext_feature_name) !lookup_feature(ext2_features, flagname, NULL, ext2_feature_name) !lookup_feature(ext3_features, flagname, NULL, ext3_feature_name) !lookup_feature(kvm_features, flagname, NULL, kvm_feature_name) -!lookup_feature(svm_features, flagname, NULL, svm_feature_name)) +!lookup_feature(svm_features, flagname, NULL, svm_feature_name) +!lookup_feature(cpuid_7_0_ebx_features, flagname, NULL, cpuid_7_0_ebx_feature_name)) fprintf(stderr, CPU feature %s not found\n, flagname); } @@ -285,6 +294,7 @@ typedef struct x86_def_t { #define TCG_EXT3_FEATURES (CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | \ CPUID_EXT3_CR8LEG | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A) #define TCG_SVM_FEATURES 0 +#define TCG_7_0_EBX_FEATURES (CPUID_7_0_EBX_SMEP | CPUID_7_0_EBX_SMAP) /* maintains list of cpu model definitions */ @@ -295,7 +305,7 @@ static x86_def_t *x86_defs = {NULL}; static x86_def_t builtin_x86_defs[] = { { .name = qemu64, -.level = 4, +.level = 7, .vendor1 = CPUID_VENDOR_AMD_1, .vendor2 = CPUID_VENDOR_AMD_2, .vendor3 = CPUID_VENDOR_AMD_3, @@ -310,6 +320,7 @@ static x86_def_t builtin_x86_defs[] = { CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, .ext3_features = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A, +.cpuid_7_0_ebx_features = CPUID_7_0_EBX_SMEP | CPUID_7_0_EBX_SMAP, .xlevel = 0x800A, }, { @@ -873,10 +884,12 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) uint32_t plus_features = 0, plus_ext_features = 0; uint32_t plus_ext2_features = 0, plus_ext3_features = 0; uint32_t plus_kvm_features = 0, plus_svm_features = 0; +uint32_t plus_7_0_ebx_features = 0; /* Features to be removed */ uint32_t minus_features = 0, minus_ext_features = 0; uint32_t minus_ext2_features = 0, minus_ext3_features = 0; uint32_t minus_kvm_features = 0, minus_svm_features = 0; +uint32_t minus_7_0_ebx_features = 0; uint32_t numvalue; for (def = x86_defs; def; def = def-next) @@ -903,8 +916,8 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) #endif add_flagname_to_bitmaps(hypervisor
Re: [Qemu-devel] [PATCH] x86: Implement SMEP and SMAP
On 09/26/2012 12:50 PM, Anthony Liguori wrote: The patch looks good except for these two chunks. This would break live migration from a new QEMU to an old one because CPUs are currently not versioned. If you just remove these two chunks, the patch can be applied and you can still test with: qemu-system-x86_64 -cpu qemu64,+smep,+smap We're working vcpu versioning and will hopefully have it in place for 1.3. If we get there, we can enable these features unconditionally and add the appropriate backwards compatibility code. OK. I had to add a chunk to up the minimum CPUID level to 7 if any of the level 7 features were enabled, however. scripts/checkpatch.pl will complain about lack of curly braces with ifs. I know this file does not use them consistently. Since you have to respin anyway, please run checkpatch and fixup the errors. Actually, several of the files are very consistent about *not* using them. I'll respin the patch, but it ain't pretty. -hpa
[Qemu-devel] [PATCH v2] x86: Implement SMEP and SMAP
From: H. Peter Anvin h...@linux.intel.com This patch implements Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode Access Prevention (SMAP) for x86. The purpose of the patch, obviously, is to help kernel developers debug the support for those features. A fair bit of the code relates to the handling of CPUID features. The CPUID code probably would get greatly simplified if all the feature bit words were unified into a single vector object, but in the interest of producing a minimal patch for SMEP/SMAP, and because I had very limited time for this project, I followed the existing style. [ v2: don't change the definition of the qemu64 CPU shorthand, since that breaks loading old snapshots. Per Anthony Liguori this can be fixed once the CPU feature set is snapshot. Change the coding style slightly to conform to checkpatch.pl. ] Signed-off-by: H. Peter Anvin h...@linux.intel.com --- target-i386/cc_helper.c | 10 +++ target-i386/cpu.c | 34 --- target-i386/cpu.h | 33 -- target-i386/helper.c| 150 ++- target-i386/helper.h|2 + target-i386/translate.c | 27 +++-- 6 files changed, 207 insertions(+), 49 deletions(-) diff --git a/target-i386/cc_helper.c b/target-i386/cc_helper.c index 07892f9..9422003 100644 --- a/target-i386/cc_helper.c +++ b/target-i386/cc_helper.c @@ -353,6 +353,16 @@ void helper_sti(CPUX86State *env) env-eflags |= IF_MASK; } +void helper_clac(CPUX86State *env) +{ +env-eflags = ~AC_MASK; +} + +void helper_stac(CPUX86State *env) +{ +env-eflags |= AC_MASK; +} + #if 0 /* vm86plus instructions */ void helper_cli_vm(CPUX86State *env) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index fd4fe28..f186439 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -100,6 +100,13 @@ static const char *svm_feature_name[] = { NULL, NULL, NULL, NULL, }; +static const char *cpuid_7_0_ebx_feature_name[] = { +NULL, NULL, NULL, NULL, NULL, NULL, NULL, smep, +NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, smap, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, +}; + /* collects per-function cpuid data */ typedef struct model_features_t { @@ -215,14 +222,17 @@ static void add_flagname_to_bitmaps(const char *flagname, uint32_t *features, uint32_t *ext2_features, uint32_t *ext3_features, uint32_t *kvm_features, -uint32_t *svm_features) +uint32_t *svm_features, +uint32_t *cpuid_7_0_ebx_features) { if (!lookup_feature(features, flagname, NULL, feature_name) !lookup_feature(ext_features, flagname, NULL, ext_feature_name) !lookup_feature(ext2_features, flagname, NULL, ext2_feature_name) !lookup_feature(ext3_features, flagname, NULL, ext3_feature_name) !lookup_feature(kvm_features, flagname, NULL, kvm_feature_name) -!lookup_feature(svm_features, flagname, NULL, svm_feature_name)) +!lookup_feature(svm_features, flagname, NULL, svm_feature_name) +!lookup_feature(cpuid_7_0_ebx_features, flagname, NULL, +cpuid_7_0_ebx_feature_name)) fprintf(stderr, CPU feature %s not found\n, flagname); } @@ -284,6 +294,7 @@ typedef struct x86_def_t { #define TCG_EXT3_FEATURES (CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | \ CPUID_EXT3_CR8LEG | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A) #define TCG_SVM_FEATURES 0 +#define TCG_7_0_EBX_FEATURES (CPUID_7_0_EBX_SMEP | CPUID_7_0_EBX_SMAP) /* maintains list of cpu model definitions */ @@ -1091,10 +1102,12 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) uint32_t plus_features = 0, plus_ext_features = 0; uint32_t plus_ext2_features = 0, plus_ext3_features = 0; uint32_t plus_kvm_features = 0, plus_svm_features = 0; +uint32_t plus_7_0_ebx_features = 0; /* Features to be removed */ uint32_t minus_features = 0, minus_ext_features = 0; uint32_t minus_ext2_features = 0, minus_ext3_features = 0; uint32_t minus_kvm_features = 0, minus_svm_features = 0; +uint32_t minus_7_0_ebx_features = 0; uint32_t numvalue; for (def = x86_defs; def; def = def-next) @@ -1121,8 +1134,8 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) #endif add_flagname_to_bitmaps(hypervisor, plus_features, -plus_ext_features, plus_ext2_features, plus_ext3_features, -plus_kvm_features, plus_svm_features); +plus_ext_features, plus_ext2_features, plus_ext3_features, +plus_kvm_features, plus_svm_features, plus_7_0_ebx_features); featurestr = strtok(NULL, ,); @@ -1132,12 +1145,12 @@ static int cpu_x86_find_by_name(x86_def_t
Re: [Qemu-devel] [PATCH 07/22] target-i386: convert cpuid features into properties
On 09/26/2012 01:32 PM, Igor Mammedov wrote: + +static void x86_cpuid_get_feature(Object *obj, Visitor *v, void *opaque, + const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +CPUX86State *env = cpu-env; +bool value = true; + +if (!is_feature_set(name, env-cpuid_features, feature_name) + !is_feature_set(name, env-cpuid_ext_features, ext_feature_name) + !is_feature_set(name, env-cpuid_ext2_features, ext2_feature_name) + !is_feature_set(name, env-cpuid_ext3_features, ext3_feature_name) + !is_feature_set(name, env-cpuid_kvm_features, kvm_feature_name) + !is_feature_set(name, env-cpuid_svm_features, svm_feature_name)) { +value = false; +} + If you're going to do a full-blown restructuring of the CPUID handling, how about actually turning this into a composite object, specifically an array, instead of this insane collection of arguments (which needs to be augmented)? -hpa
Re: [Qemu-devel] [PATCH v3 0/1] virtio-rng: hardware random number generator
On 06/19/2012 11:59 PM, Amit Shah wrote: Hello, Here's the 3rd iteration of the virtio-rng device. This update just rebases the patch on top of current master. Details on the patch in the commit message. Hi everyone... I just stumbled on this patchset after realizing that the virtio-rng support still is not even in the Qemu git tree even though the kernel side has been there for four years(!) It seems this support has been stuck in overengineering hell for years. I have to admit to having a bit of a confusion about what is so hard about reading from a file descriptor, which may return partial reads. I understand the fairness argument, but I would argue that if it is an actual problem should be solved in the kernel and therefore benefit all types of applications rather than at the application level (which Qemu, effectively, is.) However, one key error I spotted was using /dev/urandom. Using /dev/urandom is a very serious cryptographic error, as well as completely pointless. The whole point of this is to provided distilled entropy to the guest, so that it can seed true entropy into *its* entropy pool. As such, using /dev/urandom is: a) needlessly slow. It will churn the host kernel PRNG needlessly, and not provide any backpressure when the host pool is already drained. Using /dev/random instead will indicate that the host pool is drained, and not pass a bunch of PRNG noise across an expensive channel when the PRNG in the *guest* could do the PRNG expansion just as well. b) cryptographically incorrect. This will tell the guest that it is being provided with entropy that it doesn't actually have, which will mean the guest severely overestimates the entropy that it has available to it. To put it differently: /dev/random in the guest will behave like (a very expensive version of) /dev/urandom from a cryptographic point of view. The right interface to the host kernel, therefore, is /dev/random. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH v3 0/1] virtio-rng: hardware random number generator
On 09/16/2012 04:23 PM, Anthony Liguori wrote: This is *exactly* what the problem is. If using /dev/urandom is pointless--and so far, many people have made compelling arguments that it is--then using /dev/random is seemingly impossible to do fairly. It is not merely pointless, it is a security hole. The virtio-rng interface doesn't have any notion of guarantee of entropy availability. The guest just keeps requesting entropy with no indication to the hypervisor of what it should and shouldn't give. With the latest fixes to rngd this is no longer true. This *was* a bug in rngd 4; it would always claim to be entropy-starved (and so a guest running rngd 4 will have this pathology, but no distro is known to run rngd 4 by default due to a number of other problems with these versions of rngd). This has been fixed. If that backpressure isn't propagated through the virtio-rng driver then that does need to be fixed, but from at least a cursory look I think it does. Since /dev/random is a finite pool, it's quite possible that one guest could quickly exhaust /dev/random for all other guests. How is this not a clear denial of service? Well, by that definition the fact that the service hasn't been provided at all until now is a bigger denial of service... This is why supporting EGD is so important IMHO. Something else needs to deal with handing out entropy in a responsible way. No, you're missing the key point. Fixing this in applications (and from the host kernel's perspective, this is an application) is broken, because it is not just Qemu that has that property. If this is an actual problem (and it's not clear to me that it is in anything but theory, because although the kernel doesn't do round robin it will at least provide amount of stochastic fairness) then it is in the kernel that the fix belongs. Throwing an extra daemon -- one which doesn't even claim to be designed for this purpose -- into the middle is just silly. Either which way, it can easily be handled via a sane fairness daemon which doesn't need a complicated protocol. Anyway, this is on my list for 1.3. The series just needs some light dusting before resubmission. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [Qemu-devel] [PATCH v3 0/1] virtio-rng: hardware random number generator
Right, obviously. However, under no circumstances should /dev/urandom be used! Amit Shah amit.s...@redhat.com wrote: On (Sun) 16 Sep 2012 [13:42:46], H. Peter Anvin wrote: On 06/19/2012 11:59 PM, Amit Shah wrote: Hello, Here's the 3rd iteration of the virtio-rng device. This update just rebases the patch on top of current master. Details on the patch in the commit message. Hi everyone... I just stumbled on this patchset after realizing that the virtio-rng support still is not even in the Qemu git tree even though the kernel side has been there for four years(!) It seems this support has been stuck in overengineering hell for years. I have to admit to having a bit of a confusion about what is so hard about reading from a file descriptor, which may return partial reads. I understand the fairness argument, but I would argue that if it is an actual problem should be solved in the kernel and therefore benefit all types of applications rather than at the application level (which Qemu, effectively, is.) However, one key error I spotted was using /dev/urandom. Using /dev/urandom is a very serious cryptographic error, as well as completely pointless. The whole point of this is to provided distilled entropy to the guest, so that it can seed true entropy into *its* entropy pool. As such, using /dev/urandom is: a) needlessly slow. It will churn the host kernel PRNG needlessly, and not provide any backpressure when the host pool is already drained. Using /dev/random instead will indicate that the host pool is drained, and not pass a bunch of PRNG noise across an expensive channel when the PRNG in the *guest* could do the PRNG expansion just as well. b) cryptographically incorrect. This will tell the guest that it is being provided with entropy that it doesn't actually have, which will mean the guest severely overestimates the entropy that it has available to it. To put it differently: /dev/random in the guest will behave like (a very expensive version of) /dev/urandom from a cryptographic point of view. The right interface to the host kernel, therefore, is /dev/random. Agreed with your points -- /dev/random on the host itself could be fed in via a real hwrng. Also, for the fairness arguments, several guests doing IO also contribute to the random pool. The ideal interface, though, in qemu should be sourcing entropy from a file descriptor, and the admin chooses what he wants to source entropy from - /dev/random, /dev/urandom, or even /dev/hwrng. Amit -- Sent from my mobile phone. Please excuse brevity and lack of formatting.
Re: [Qemu-devel] x86, nops settings result in kernel crash
On 08/16/2012 11:53 AM, Alan Cox wrote: Yes, if I remove the break statement (introduced by this commit), it works fine. What version of qemu is this - do we have qemu bug here I wonder. Also, is it 32 or 64 bits? -hpa
[Qemu-devel] Re: [PATCH 5/6] Add support for a USB audio device model
On 10/28/2010 1:01 AM, Gerd Hoffmann wrote: On 10/27/10 19:13, H. Peter Anvin wrote: On 10/27/2010 9:04 AM, Gerd Hoffmann wrote: This brings a usb audio device to qemu. Output only, fixed at 16bit stereo @ 48 Hz. Based on a patch from H. Peter Anvinh...@linux.intel.com Please don't apply this. This patch is outdated and has several glaring bugs. I can send a better patch later (I'm in an all-day meeting with insanely bad networking.) Oh, I thought you've abandoned the project ... Note that I've update the patch too and fixed at least some of those bugs. Nevertheless I'll be happy to see your latest bits and try to merge the best of both versions. I have a git tree on kernel.org with the latest version. However, the rate-matching code needs to be ripped out since it just plain doesn't work, and tends to produce worse results than no rate matching: http://git.kernel.org/?p=virt/qemu/hpa/qemu-usb-audio-wip.git;a=shortlog;h=refs/heads/usb-audio -hpa
[Qemu-devel] Re: [PATCH 5/6] Add support for a USB audio device model
On 10/28/2010 1:01 AM, Gerd Hoffmann wrote: On 10/27/10 19:13, H. Peter Anvin wrote: On 10/27/2010 9:04 AM, Gerd Hoffmann wrote: This brings a usb audio device to qemu. Output only, fixed at 16bit stereo @ 48 Hz. Based on a patch from H. Peter Anvinh...@linux.intel.com Please don't apply this. This patch is outdated and has several glaring bugs. I can send a better patch later (I'm in an all-day meeting with insanely bad networking.) Oh, I thought you've abandoned the project ... Note that I've update the patch too and fixed at least some of those bugs. Nevertheless I'll be happy to see your latest bits and try to merge the best of both versions. cheers, Gerd Grmf ... looks like I have multiple pieces of code in multiple places... I need to reconcile the best version I have by hand first. -hpa
[Qemu-devel] Re: [PATCH 5/6] Add support for a USB audio device model
On 10/28/2010 1:01 AM, Gerd Hoffmann wrote: On 10/27/10 19:13, H. Peter Anvin wrote: On 10/27/2010 9:04 AM, Gerd Hoffmann wrote: This brings a usb audio device to qemu. Output only, fixed at 16bit stereo @ 48 Hz. Based on a patch from H. Peter Anvinh...@linux.intel.com Please don't apply this. This patch is outdated and has several glaring bugs. I can send a better patch later (I'm in an all-day meeting with insanely bad networking.) Oh, I thought you've abandoned the project ... Note that I've update the patch too and fixed at least some of those bugs. Nevertheless I'll be happy to see your latest bits and try to merge the best of both versions. I'm going to have to look at this later... I'm in a F2F meeting most of the day today. -hpa
Re: [Qemu-devel] Hitting 29 NIC limit
On 10/14/2010 05:57 AM, Anthony Liguori wrote: I've always been sceptical of this. When physical systems have a large number of NICs, it's via multiple functions, not a bunch of PCI bridges. Actually a lot of multiport PCI cards are in fact single or dual NICs behind PCI bridges. -hpa
[Qemu-devel] Re: [PATCH] [RFC] Add support for a USB audio device model
On 10/14/2010 06:51 AM, Mike Snitzer wrote: Was just wondering if you've been able to put some time to the rate-matching issues? Has this usb-audio patch evolved and I'm just missing it? Thanks for doing this work! Mike The sad result really is: it doesn't work, and it probably will never work. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/12/2010 12:06 PM, Gleb Natapov wrote: This is true to some extent -- there is some standard content, and some further can be described via ACPI tables. However, my point was mostly that it is an existing model for nonvolatile storage which also works on hardware (and is vastly simpler albeit smaller in size than ESCD). And my point is why would we want nonvolatile storage for BIOS settings in qemu. It doesn't provide anything that can't be done through command line and configured nicely by virt-manager and it introduces one more file to carry around with your VM. And if the idea is to create it on the fly then it is no longer nonvolatile and is not better then fw_cfg. If we want nonvolatile storage for some reason I would agree that CMOS is good candidate for that except of its size. How much can you fit into 128 byte? Less then one tweet. 128 bytes isn't a hard limit; the original PC/AT actually had only 64 bytes, but the standard interface allows 128 bytes. However, there is a semi-standard (common extension) interface using ports 72/73 which allows 256 bytes, and quite a few chipsets have added further extensions. The ACPI specification recognizes three interfaces as standard: PC/AT (64 bytes, even though 128 bytes is available on a lot of platforms), PIIX4 (256 bytes), and Dallas Semiconductor (256 bytes or more). The interface for the latter isn't well cited in the ACPI spec, but I'm guessing this is referring to the DS17885 series of chips, which can have up to 8K CMOS using a bank-switched scheme which presents 128 bytes at a time (thus accessible via only the standard 70/71 ports.) -hpa
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/13/2010 12:17 PM, H. Peter Anvin wrote: The ACPI specification recognizes three interfaces as standard: PC/AT (64 bytes, even though 128 bytes is available on a lot of platforms), PIIX4 (256 bytes), and Dallas Semiconductor (256 bytes or more). The interface for the latter isn't well cited in the ACPI spec, but I'm guessing this is referring to the DS17885 series of chips, which can have up to 8K CMOS using a bank-switched scheme which presents 128 bytes at a time (thus accessible via only the standard 70/71 ports.) FWIW, the DS17885 scheme actually allows addressing up to 64K; 8K is the maximum that DS produced with this particular interface as far as I know, but there are 16 address bits available. -hpa
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/13/2010 01:00 PM, H. Peter Anvin wrote: On 10/13/2010 12:17 PM, H. Peter Anvin wrote: The ACPI specification recognizes three interfaces as standard: PC/AT (64 bytes, even though 128 bytes is available on a lot of platforms), PIIX4 (256 bytes), and Dallas Semiconductor (256 bytes or more). The interface for the latter isn't well cited in the ACPI spec, but I'm guessing this is referring to the DS17885 series of chips, which can have up to 8K CMOS using a bank-switched scheme which presents 128 bytes at a time (thus accessible via only the standard 70/71 ports.) FWIW, the DS17885 scheme actually allows addressing up to 64K; 8K is the maximum that DS produced with this particular interface as far as I know, but there are 16 address bits available. -hpa I think this is the relevant application note: http://www.maxim-ic.com/app-notes/index.mvp/id/77 -hpa
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/12/2010 01:01 AM, Gleb Natapov wrote: On Mon, Oct 11, 2010 at 02:15:26PM -0700, H. Peter Anvin wrote: I don't disagree. I think the best thing to do is to let SeaBIOS create a boot order table that contains descriptive information and then advertise that to QEMU. QEMU can then try to associate the list of bootable devices with it's own set of devices and select a preferred order that it can then give back to SeaBIOS. SeaBIOS can then present that list to the user for additional refinement. Really, this kind of comes down to having a data structure that anything (Qemu, SeaBIOS and if needed the guest OS) can read and modify as needed. But then QEMU and seabios will have to have shared storage they can both write too. And this shared storage is part of VM now so you need to carry it around when you move your VM elsewhere. Yes, and it's part of real hardware, too. It's usually called the CMOS, short for CMOS RAM. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On real hardware it is shared between BIOS and the OS, actually. Gleb Natapov g...@redhat.com wrote: On Tue, Oct 12, 2010 at 09:33:16AM -0700, H. Peter Anvin wrote: On 10/12/2010 01:01 AM, Gleb Natapov wrote: On Mon, Oct 11, 2010 at 02:15:26PM -0700, H. Peter Anvin wrote: I don't disagree. I think the best thing to do is to let SeaBIOS create a boot order table that contains descriptive information and then advertise that to QEMU. QEMU can then try to associate the list of bootable devices with it's own set of devices and select a preferred order that it can then give back to SeaBIOS. SeaBIOS can then present that list to the user for additional refinement. Really, this kind of comes down to having a data structure that anything (Qemu, SeaBIOS and if needed the guest OS) can read and modify as needed. But then QEMU and seabios will have to have shared storage they can both write too. And this shared storage is part of VM now so you need to carry it around when you move your VM elsewhere. Yes, and it's part of real hardware, too. It's usually called the CMOS, short for CMOS RAM. On real hardware it is not shared between HW and bios. It is written/read only by BIOS. In qemu it is not persistent and generated for each qemu invocation. Previously it was used to pass config params from qemu to a bios (and some legacy params are still passed that way), but we moved to better interface for that (firmware config). -- Gleb. -- Sent from my mobile phone. Please pardon any lack of formatting.
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/12/2010 10:41 AM, Gleb Natapov wrote: On Tue, Oct 12, 2010 at 10:35:51AM -0700, H. Peter Anvin wrote: On real hardware it is shared between BIOS and the OS, actually. Guest OS can write in qemu CMOS too. But what is it useful for? Most of its content is not standard AFAIK. This is true to some extent -- there is some standard content, and some further can be described via ACPI tables. However, my point was mostly that it is an existing model for nonvolatile storage which also works on hardware (and is vastly simpler albeit smaller in size than ESCD). -hpa
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/11/2010 12:51 PM, Anthony Liguori wrote: -kernel hijacks int19 so it cannot participate in any kind of boot order. It's either present (and therefore the bootable disk) or not present. That's a misdesign, though: it should be able to participate in BBS as a BEV. -hpa
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/11/2010 01:30 PM, Anthony Liguori wrote: On 10/11/2010 02:59 PM, Gleb Natapov wrote: No boot rom should do that. extboot wreaks havoc when it is used. And since virtio is now supported by bios there is no reason to use it. You don't really have a choice. You could be doing hardware passthrough and the ROM on the card may hijack int19. The BBS standard actually documents how to deal with that -- it pretty much works out to let the card initialize, then see if it mucked with int19, and then put int19 back... if we want to run that card, then we invoke the int19 that the card set up. Whoever needs scsi boot should add it to seabios too. I don't disagree. I think the best thing to do is to let SeaBIOS create a boot order table that contains descriptive information and then advertise that to QEMU. QEMU can then try to associate the list of bootable devices with it's own set of devices and select a preferred order that it can then give back to SeaBIOS. SeaBIOS can then present that list to the user for additional refinement. Really, this kind of comes down to having a data structure that anything (Qemu, SeaBIOS and if needed the guest OS) can read and modify as needed. -hpa
Re: [SeaBIOS] [Qemu-devel] [RFC] Passing boot order from qemu to seabios
On 10/11/2010 02:41 PM, Sebastian Herbszt wrote: H. Peter Anvin wrote: On 10/11/2010 01:30 PM, Anthony Liguori wrote: On 10/11/2010 02:59 PM, Gleb Natapov wrote: No boot rom should do that. extboot wreaks havoc when it is used. And since virtio is now supported by bios there is no reason to use it. You don't really have a choice. You could be doing hardware passthrough and the ROM on the card may hijack int19. The BBS standard actually documents how to deal with that -- it pretty much works out to let the card initialize, then see if it mucked with int19, and then put int19 back... if we want to run that card, then we invoke the int19 that the card set up. The BIOS Boot Specification, Version 1.01 from January 11, 1996 seems not to recommend this: 3.4 Legacy IPL Devices Legacy IPL devices will be allowed to take control of the system (via hooking interrupts) in both Legacy and PnP systems. The Plug and Play BIOS specification recommends that Legacy devices that hook a bootstrap interrupt such as INT 19h, 18h, or 13h have the interrupt re-captured by the BIOS. This is not done because grabbing an interrupt vector back after a device has hooked it can produce unpredictable results. Further, by allowing the card to take control, the behavior of these Legacy cards will be the same on both PnP and Legacy machines. 6.8 Notes on the POST Process The Plug and Play BIOS Specification says that if a Legacy IPL device's option ROM captures INT 18h or INT 19h, the BIOS should save this vector and then restore the original one put there by the BIOS. The BIOS Boot Specification deviates from this in that these vectors are not recaptured after each Legacy option ROM returns from initialization. That would be considered unsafe. Sorry, you're right -- I confused the PNPBIOS spec with the BBS spec (and compounded the error by correctly remembering that BBS overrides PNPBIOS). -hpa
Re: [Qemu-devel] [PATCH] add 40-48 bit RAM range to seabios
On 09/17/2010 06:11 AM, Kevin O'Connor wrote: On Fri, Sep 17, 2010 at 07:53:12AM -0500, Anthony Liguori wrote: On 09/16/2010 08:47 PM, Kevin O'Connor wrote: On Wed, Sep 15, 2010 at 07:15:28PM +0200, Andrea Arcangeli wrote: This uses a new cmos port at 0x5e that shall read zero to be backwards compatible. It looks okay to me. Can you provide an Acked-by from one of the qemu or kvm maintainers? Is CMOS the best place to communicate this (as opposed to fw_cfg)? I know we currently expose memory size via CMOS but perhaps it's better to switch to a 64-bit fw_cfg value. I'd personally prefer fw_cfg. Also, another possibility would be to stop sending an absolute max and to instead send a map of memory. The latter would allow for non-contiguous memory. That would be highly useful for testing purposes. -hpa
[Qemu-devel] Re: [PATCH] [RFC] Add support for a USB audio device model
On 09/13/2010 01:53 PM, Amos Kong wrote: # patch -p1 /tmp/usb-audio.patch # ./configure ... ... preadv supportyes fdatasync yes uuid support no vhost-net support no Trace backend nop Trace output file trace-pid ./configure: 2276: Bad substitution What shell is your /bin/sh? -hpa
[Qemu-devel] Re: [PATCH] [RFC] Add support for a USB audio device model
On 09/13/2010 01:53 PM, Amos Kong wrote: # patch -p1 /tmp/usb-audio.patch # ./configure ... ... preadv supportyes fdatasync yes uuid support no vhost-net support no Trace backend nop Trace output file trace-pid ./configure: 2276: Bad substitution diff --git a/create_config b/create_config index 0098e68..1caa25b 100755 --- a/create_config +++ b/create_config @@ -25,7 +25,7 @@ case $line in CONFIG_AUDIO_DRIVERS=*) drivers=${line#*=} echo #define CONFIG_AUDIO_DRIVERS \\ -for drv in $drivers; do +for drv in ${drivers//-/_}; do echo ${drv}_audio_driver,\\ done echo @@ -39,10 +39,12 @@ case $line in ;; CONFIG_*=y) # configuration name=${line%=*} +name=${name//-/_} echo #define $name 1 ;; CONFIG_*=*) # configuration name=${line%=*} +name=${name//-/_} value=${line#*=} echo #define $name $value ;; Looks like ${.../...} is a bashism. One can replace it with: name=`echo $name | tr '-' '_'` and for drv in `echo $drivers | tr '-' '_'`; do -hpa