Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
Rob Landley wrote: > On Wednesday 06 June 2007 7:41 pm, H. Peter Anvin wrote: >> This makes vmlinux (normally stripped) recoverable from the bzImage file >> and so anything that is currently booting vmlinux would be serviced by >> this scheme. > > Would this make it sane to strip the initramfs image out of vmlinux with > objdump and replace it with another one, or are there offsets resolved during > the build that stop that for vmlinux? > There probably are offsets resolved during the build. However, that wouldn't be all that hard to fix. Still, one can argue whether or not it is sane under any definition to do this kind of unpacking-repacking of ELF files. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
On Wednesday 06 June 2007 7:41 pm, H. Peter Anvin wrote: > This makes vmlinux (normally stripped) recoverable from the bzImage file > and so anything that is currently booting vmlinux would be serviced by > this scheme. Would this make it sane to strip the initramfs image out of vmlinux with objdump and replace it with another one, or are there offsets resolved during the build that stop that for vmlinux? Rob -- The Google cluster became self-aware at 2:14am EDT August 29, 2007... ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
H. Peter Anvin wrote: > It doesn't if we simply declare that a certain chunk of memory is > available to it, for the case where it runs in the native configuration. > Since it doesn't have to support *any* ELF file, just the kernel one, > that's an option. > I suppose. But given that its always built at the same time as - and linked to - the kernel itself, it can have private knowledge about the kernel. > On the other hand, I guess with the decompressor/ELF parser being PIC, > one would simply look for the highest used address, and relocate itself > above that point. It's not really all that different from what the > decompressor does today, except that it knows the address a priori. > Yes, it would have to decompress the ELF file into a temp buffer, and then rearrange itself and the decompressed ELF file to make space for the ELF file's final location. Seems a bit more complex because it has to be done in the middle of execution rather that at start of day. But perhaps that doesn't matter very much. >> I was thinking of making the ELF file entirely descriptive, since its >> just a set of ELF headers inserted into the existing bzImage structure, >> and it still relies on the bzImage being build properly in the first place. >> > > Again, it's an option. The downside is that you don't get the automatic > test coverage of having it be exercised as often as possible. I don't follow your argument at all. I'm proposing the kernel take the same code path regardless of how its booted, with the only two variations: 1. boot all the way up from 16-bit mode, or 2. start directly in 32-bit mode which is essentially the current situation (setup vs code32_start). All I'm adding is a bit more metadata for the domain builder to work with. The code will get exercised on every boot in every environment, and the metadata will be tested by whichever environment cares about it. You're proposing that we add a third booting variation, where the bootloader takes on the responsibility for decompressing and loading the kernel's ELF image. In addition, you're proposing changing the existing 32-bit portion of the boot to perform the same job as the third method, but in a way which is not reusable by a paravirtual domain builder. This means that the boot path is unique for each boot environment, and so will overall get less coverage. Given that one axis of the test matrix - "number of subarchtectures" - is the same in both cases, and the other axis - "number of ways of booting" - is larger in your proposal, it seems to me that your's has the higher testing burden. Anyway, I added an extra pointer in the boot_params so that you can implement it that way if you really want (no real reason you can have ELF within ELF within bzImage, but it starts to look a bit engineering-by-compromise at that point). It isn't, however, the approach I want to take with Xen. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
Jeremy Fitzhardinge wrote: > > Certainly, but much harder to implement. The ELF parser needs to be > prepared to move itself around to get out of the way of the ELF file. > It's a fairly large change from how it works now. > It doesn't if we simply declare that a certain chunk of memory is available to it, for the case where it runs in the native configuration. Since it doesn't have to support *any* ELF file, just the kernel one, that's an option. On the other hand, I guess with the decompressor/ELF parser being PIC, one would simply look for the highest used address, and relocate itself above that point. It's not really all that different from what the decompressor does today, except that it knows the address a priori. > I was thinking of making the ELF file entirely descriptive, since its > just a set of ELF headers inserted into the existing bzImage structure, > and it still relies on the bzImage being build properly in the first place. Again, it's an option. The downside is that you don't get the automatic test coverage of having it be exercised as often as possible. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
H. Peter Anvin wrote: > I was thinking prescriptive, having the decompressor read the output > stream and interpret it as ELF. I guess a descriptive approach could be > made to work, too (I haven't really thought about that avenue of > approach), but the prescriptive model seems more powerful, at least to me. Certainly, but much harder to implement. The ELF parser needs to be prepared to move itself around to get out of the way of the ELF file. It's a fairly large change from how it works now. I was thinking of making the ELF file entirely descriptive, since its just a set of ELF headers inserted into the existing bzImage structure, and it still relies on the bzImage being build properly in the first place. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
Jeremy Fitzhardinge wrote: > > I'm not sure I fully understand the mechanism you're proposing. You > have the 16-bit setup code, the 32-bit decompressor, and an ELF.gz. Once > the decompressor has extracted the actual ELF file, are you proposing > that it properly parse the ELF file and follow its instuctions to put > the segments in the appropriate places, or are you assuming that the > decompressor can just skip that part and plonk the ELF file where it wants? > > In other words, do you see the Phdrs as being descriptive or prescriptive? > I was thinking prescriptive, having the decompressor read the output stream and interpret it as ELF. I guess a descriptive approach could be made to work, too (I haven't really thought about that avenue of approach), but the prescriptive model seems more powerful, at least to me. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
H. Peter Anvin wrote: > I still believe that we should provide, in effect, vmlinux as a > (compressed) ELF file rather than provide the intermediate stage. It > would reduce the complexity of testing (all information provided about a > stage have to be both guaranteed to even make sense in the future as > well as be tested to conform to such information I'm not sure I follow you. Sure, you're right that the Phdr info contained within the bzImage needs to be tested for correctness. This wouldn't normally happen when booting native, but when booting under the most constrained environment - Xen - it will be tested (and I intend making the Xen loader as strict as possible). Of course, it won't help if the Phdrs are overmap too much, but I don't think that matters too much, so long as the mappings are not excessively large. I'm not sure what you mean about "make sense in the future". If you're booting the kernel in a new paravirtualized environment, you've presumably modified the kernel to understand that environment, and perhaps had to update the boot image format a bit to deal with its requirements. I agree that updating the bzImage format may require retesting in all the other environments, but I think that's probably true for your scheme as well. After all, you're assuming that the vmlinux itself provides all necessary information to be loaded in any environment, which is not necessarily true (it may need extra ELF notes, for example). But if there are any major structural changes needed in the vmlinux, then that will be equally problematic for both directly using vmlinux and using ELF-in-bzImage. So I don't think your argument convincingly sways in any particular direction. > ) as well as cover a > larger number of environments -- any environment where injecting data > into memory is cheaper than execution is quite unhappy about the current > system. Such environments include heterogeneous embedded systems (think > a slow CPU on a PCI card where the host CPU has direct access to the > memory on the card) as well as simulators/emulators. > Well, nothing in this scheme precludes the ELF file from being a plain uncompressed kernel image. If that's what these environments want, its easy to provide with a small update to the Makefiles. > For environments where so is appropriate it would even be possible to > run the setup, invoke the code32_setup hook to do the decompression (and > relocation, if appropriate) in host space. > Well, that's what we currently have, and we can't break backwards compatibility. > This makes vmlinux (normally stripped) recoverable from the bzImage file > and so anything that is currently booting vmlinux would be serviced by > this scheme. > I'm not sure I fully understand the mechanism you're proposing. You have the 16-bit setup code, the 32-bit decompressor, and an ELF.gz. Once the decompressor has extracted the actual ELF file, are you proposing that it properly parse the ELF file and follow its instuctions to put the segments in the appropriate places, or are you assuming that the decompressor can just skip that part and plonk the ELF file where it wants? In other words, do you see the Phdrs as being descriptive or prescriptive? J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file
Jeremy Fitzhardinge wrote: > This patch makes the payload of the bzImage file an ELF file. In > other words, the bzImage is structured as follows: > - boot sector > - 16bit setup code > - ELF header > - decompressor > - compressed kernel > > A bootloader may find the start of the ELF file by looking at the > setup_size entry in the boot params, and using that to find the offset > of the ELF header. The ELF Phdrs contain all the mapped memory > required to decompress and start booting the kernel. > > One slightly complex part of this is that the bzImage boot_params need > to know about the internal structure of the ELF file, at least to the > extent of being able to point the core32_start entry at the ELF file's > entrypoint, so that loaders which use this field will still work. > > Similarly, the ELF header needs to know how big the kernel vmlinux's > bss segment is, in order to make sure is is mapped properly. > > To handle these two cases, we generate abstracted versions of the > object files which only contain the symbols we care about (generated > with objcopy --strip-all --keep-symbol=X), and then include those > symbol tables with ld -R. I still believe that we should provide, in effect, vmlinux as a (compressed) ELF file rather than provide the intermediate stage. It would reduce the complexity of testing (all information provided about a stage have to be both guaranteed to even make sense in the future as well as be tested to conform to such information) as well as cover a larger number of environments -- any environment where injecting data into memory is cheaper than execution is quite unhappy about the current system. Such environments include heterogeneous embedded systems (think a slow CPU on a PCI card where the host CPU has direct access to the memory on the card) as well as simulators/emulators. For environments where so is appropriate it would even be possible to run the setup, invoke the code32_setup hook to do the decompression (and relocation, if appropriate) in host space. This makes vmlinux (normally stripped) recoverable from the bzImage file and so anything that is currently booting vmlinux would be serviced by this scheme. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH RFC 6/7] i386: make the bzImage payload an ELF file
This patch makes the payload of the bzImage file an ELF file. In other words, the bzImage is structured as follows: - boot sector - 16bit setup code - ELF header - decompressor - compressed kernel A bootloader may find the start of the ELF file by looking at the setup_size entry in the boot params, and using that to find the offset of the ELF header. The ELF Phdrs contain all the mapped memory required to decompress and start booting the kernel. One slightly complex part of this is that the bzImage boot_params need to know about the internal structure of the ELF file, at least to the extent of being able to point the core32_start entry at the ELF file's entrypoint, so that loaders which use this field will still work. Similarly, the ELF header needs to know how big the kernel vmlinux's bss segment is, in order to make sure is is mapped properly. To handle these two cases, we generate abstracted versions of the object files which only contain the symbols we care about (generated with objcopy --strip-all --keep-symbol=X), and then include those symbol tables with ld -R. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: "Eric W. Biederman" <[EMAIL PROTECTED]> Cc: H. Peter Anvin <[EMAIL PROTECTED]> Cc: Vivek Goyal <[EMAIL PROTECTED]> Cc: Rusty Russell <[EMAIL PROTECTED]> --- arch/i386/boot/Makefile | 11 -- arch/i386/boot/compressed/Makefile| 29 +-- arch/i386/boot/compressed/elfhdr.S| 60 + arch/i386/boot/compressed/head.S |9 ++-- arch/i386/boot/compressed/notes.S |7 +++ arch/i386/boot/compressed/vmlinux.lds | 24 ++--- arch/i386/boot/header.S |7 --- arch/i386/boot/setup.ld |5 ++ arch/i386/kernel/head.S |1 arch/i386/kernel/vmlinux.lds.S|1 10 files changed, 131 insertions(+), 23 deletions(-) === --- a/arch/i386/boot/Makefile +++ b/arch/i386/boot/Makefile @@ -72,14 +72,19 @@ AFLAGS := $(CFLAGS) -D__ASSEMBLY__ SETUP_OBJS = $(addprefix $(obj)/,$(setup-y)) -LDFLAGS_setup.elf := -T -$(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE +$(obj)/zImage $(obj)/bzImage: \ + LDFLAGS := \ + -R $(obj)/compressed/blob-syms \ + --defsym IMAGE_OFFSET=$(IMAGE_OFFSET) -T + +$(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS)\ + $(obj)/compressed/blob-syms FORCE $(call if_changed,ld) $(obj)/payload.o: EXTRA_AFLAGS := -Wa,-I$(obj) $(obj)/payload.o: $(src)/payload.S $(obj)/blob.bin -$(obj)/compressed/blob: FORCE +$(obj)/compressed/blob $(obj)/compressed/blob-syms: FORCE $(Q)$(MAKE) $(build)=$(obj)/compressed IMAGE_OFFSET=$(IMAGE_OFFSET) $@ # Set this if you want to pass append arguments to the zdisk/fdimage/isoimage kernel === --- a/arch/i386/boot/compressed/Makefile +++ b/arch/i386/boot/compressed/Makefile @@ -4,21 +4,42 @@ # create a compressed vmlinux image from the original vmlinux # -targets:= blob vmlinux.bin vmlinux.bin.gz head.o misc.o piggy.o \ +targets:= blob vmlinux.bin vmlinux.bin.gz \ + elfhdr.o head.o misc.o notes.o piggy.o \ vmlinux.bin.all vmlinux.relocs -LDFLAGS_blob := -T hostprogs-y:= relocs CFLAGS := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -O2 \ -fno-strict-aliasing -fPIC \ $(call cc-option,-ffreestanding) \ $(call cc-option,-fno-stack-protector) -LDFLAGS := -m elf_i386 +LDFLAGS := -R $(obj)/vmlinux-syms --defsym IMAGE_OFFSET=$(IMAGE_OFFSET) -T -$(obj)/blob: $(src)/vmlinux.lds $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o FORCE +OBJS=$(addprefix $(obj)/,elfhdr.o head.o misc.o notes.o piggy.o) + +$(obj)/blob: $(src)/vmlinux.lds $(obj)/vmlinux-syms $(OBJS) FORCE $(call if_changed,ld) @: + +# Generate a stripped-down object including only the symbols needed +# so that we can get them with ld -R. Direct stderr to /dev/null to +# shut useless warning up. +quiet_cmd_symextract = SYMEXT $@ + cmd_symextract = objcopy -S \ + $(addprefix -j,$(EXTRACTSECTS)) \ + $(addprefix -K,$(EXTRACTSYMS)) \ + $< $@ 2>/dev/null + +$(obj)/blob-syms: EXTRACTSYMS := blob_entry blob_payload +$(obj)/blob-syms: EXTRACTSECTS := .text.head .data.compressed +$(obj)/blob-syms: $(obj)/blob FORCE + $(call if_changed,symextract) + +$(obj)/vmlinux-syms: EXTRACTSYMS := __reserved_end +$(obj)/vmlinux-syms: EXTRACTSECTS := .bss +$(obj)/vmlinux-syms: vmlinux FORCE + $(call if_changed,symextract) $(obj)/vmlinux.bin: vmlinux FORCE $(call if_changed,objcopy) === --
[PATCH RFC 4/7] define ELF notes for adding to a boot image
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: Vivek Goyal <[EMAIL PROTECTED]> --- include/linux/elf_boot.h | 15 +++ 1 file changed, 15 insertions(+) === --- /dev/null +++ b/include/linux/elf_boot.h @@ -0,0 +1,15 @@ +#ifndef ELF_BOOT_H +#define ELF_BOOT_H + +/* Elf notes to help bootloaders identify what program they are booting. + */ + +/* Standardized Elf image notes for booting... The name for all of these is ELFBoot */ +#define ELF_NOTE_BOOT ELFBoot + +#define EIN_PROGRAM_NAME 1 /* The program in this ELF file */ +#define EIN_PROGRAM_VERSION2 /* The version of the program in this ELF file */ +#define EIN_PROGRAM_CHECKSUM 3 /* ip style checksum of the memory image. */ +#define EIN_ARGUMENT_STYLE 4 /* String identifying argument passing style */ + +#endif /* ELF_BOOT_H */ -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH RFC 3/7] allow linux/elf.h to be included in assembler
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> --- include/linux/elf.h | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) === --- a/include/linux/elf.h +++ b/include/linux/elf.h @@ -1,9 +1,10 @@ #ifndef _LINUX_ELF_H #define _LINUX_ELF_H +#include +#ifndef __ASSEMBLY__ #include #include -#include #include struct file; @@ -31,6 +32,7 @@ typedef __u32 Elf64_Word; typedef __u32 Elf64_Word; typedef __u64 Elf64_Xword; typedef __s64 Elf64_Sxword; +#endif /* __ASSEMBLY__ */ /* These constants are for the segment types stored in the image headers */ #define PT_NULL0 @@ -123,6 +125,7 @@ typedef __s64 Elf64_Sxword; #define ELF64_ST_BIND(x) ELF_ST_BIND(x) #define ELF64_ST_TYPE(x) ELF_ST_TYPE(x) +#ifndef __ASSEMBLY__ typedef struct dynamic{ Elf32_Sword d_tag; union{ @@ -138,6 +141,7 @@ typedef struct { Elf64_Addr d_ptr; } d_un; } Elf64_Dyn; +#endif /* __ASSEMBLY__ */ /* The following are used with relocations */ #define ELF32_R_SYM(x) ((x) >> 8) @@ -146,6 +150,7 @@ typedef struct { #define ELF64_R_SYM(i) ((i) >> 32) #define ELF64_R_TYPE(i)((i) & 0x) +#ifndef __ASSEMBLY__ typedef struct elf32_rel { Elf32_Addr r_offset; Elf32_Word r_info; @@ -185,11 +190,12 @@ typedef struct elf64_sym { Elf64_Addr st_value; /* Value of the symbol */ Elf64_Xword st_size; /* Associated symbol size */ } Elf64_Sym; - +#endif /* __ASSEMBLY__ */ #define EI_NIDENT 16 -typedef struct elf32_hdr{ +#ifndef __ASSEMBLY__ +typedef struct elf32_hdr { unsigned chare_ident[EI_NIDENT]; Elf32_Half e_type; Elf32_Half e_machine; @@ -222,6 +228,7 @@ typedef struct elf64_hdr { Elf64_Half e_shnum; Elf64_Half e_shstrndx; } Elf64_Ehdr; +#endif /* __ASSEMBLY__ */ /* These constants define the permissions on sections in the program header, p_flags. */ @@ -229,7 +236,8 @@ typedef struct elf64_hdr { #define PF_W 0x2 #define PF_X 0x1 -typedef struct elf32_phdr{ +#ifndef __ASSEMBLY__ +typedef struct elf32_phdr { Elf32_Word p_type; Elf32_Offp_offset; Elf32_Addr p_vaddr; @@ -250,6 +258,7 @@ typedef struct elf64_phdr { Elf64_Xword p_memsz; /* Segment size in memory */ Elf64_Xword p_align; /* Segment alignment, file & memory */ } Elf64_Phdr; +#endif /* __ASSEMBLY__ */ /* sh_type */ #define SHT_NULL 0 @@ -284,7 +293,8 @@ typedef struct elf64_phdr { #define SHN_ABS0xfff1 #define SHN_COMMON 0xfff2 #define SHN_HIRESERVE 0x - + +#ifndef __ASSEMBLY__ typedef struct { Elf32_Word sh_name; Elf32_Word sh_type; @@ -310,6 +320,7 @@ typedef struct elf64_shdr { Elf64_Xword sh_addralign;/* Section alignment */ Elf64_Xword sh_entsize; /* Entry size if section holds table */ } Elf64_Shdr; +#endif /* __ASSEMBLY__ */ #defineEI_MAG0 0 /* e_ident[] indexes */ #defineEI_MAG1 1 @@ -343,6 +354,7 @@ typedef struct elf64_shdr { #define ELFOSABI_NONE 0 #define ELFOSABI_LINUX 3 +#define ELFOSABI_STANDALONE255 #ifndef ELF_OSABI #define ELF_OSABI ELFOSABI_NONE @@ -357,6 +369,7 @@ typedef struct elf64_shdr { #define NT_PRXFPREG 0x46e62b7f /* copied from gdb5.1/include/elf/common.h */ +#ifndef __ASSEMBLY__ /* Note header in a PT_NOTE section */ typedef struct elf32_note { Elf32_Word n_namesz; /* Name size */ @@ -396,5 +409,6 @@ static inline void arch_write_notes(stru #define ELF_CORE_EXTRA_NOTES_SIZE arch_notes_size() #define ELF_CORE_WRITE_EXTRA_NOTES arch_write_notes(file) #endif /* ARCH_HAVE_EXTRA_ELF_NOTES */ +#endif /* __ASSEMBLY__ */ #endif /* _LINUX_ELF_H */ -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH RFC 7/7] i386: paravirt boot sequence
This patch uses the updated boot protocol to do paravirtualized boot. If the boot version is >= 2.07, then it will do two things: 1. Check the bootparams loadflags to see if we should reload the segment registers and clear interrupts. This is appropriate for normal native boot and some paravirtualized environments, but inappropraite for others. 2. Check the hardware architecture, and dispatch to the appropriate kernel entrypoint. If the bootloader doesn't set this, then we simply do the normal boot sequence. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: "Eric W. Biederman" <[EMAIL PROTECTED]> Cc: H. Peter Anvin <[EMAIL PROTECTED]> Cc: Vivek Goyal <[EMAIL PROTECTED]> Cc: Rusty Russell <[EMAIL PROTECTED]> --- arch/i386/boot/header.S |9 - arch/i386/kernel/head.S | 47 +++ 2 files changed, 51 insertions(+), 5 deletions(-) === --- a/arch/i386/boot/header.S +++ b/arch/i386/boot/header.S @@ -119,7 +119,7 @@ 1: # Part 2 of the header, from the old setup.S .ascii "HdrS" # header signature - .word 0x0206 # header version number (>= 0x0105) + .word 0x0207 # header version number (>= 0x0105) # or else old loadlin-1.5 will fail) .globl realmode_swtch realmode_swtch:.word 0, 0# default_switch, SETUPSEG @@ -209,6 +209,13 @@ cmdline_size: .long COMMAND_LINE_SIZ #added with boot protocol #version 2.06 +hardware_subarch: .long 0 # subarchitecture, added with 2.07 + # default to 0 for normal x86 PC + +hardware_subarch_data: .quad 0 + +kernel_payload:.long blob_payload # raw kernel data + # End of setup header # .section ".inittext", "ax" === --- a/arch/i386/kernel/head.S +++ b/arch/i386/kernel/head.S @@ -71,28 +71,37 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + */ .section .text.head,"ax",@progbits ENTRY(startup_32) + /* check to see if KEEP_SEGMENTS flag is meaningful */ + cmpw $0x207, BP_version(%esi) + jb 1f + + /* test KEEP_SEGMENTS flag to see if the bootloader is asking + us to not reload segments */ + testb $(1<<6), BP_loadflags(%esi) + jnz 2f /* * Set segments to known values. */ - cld - lgdt boot_gdt_descr - __PAGE_OFFSET +1: lgdt boot_gdt_descr - __PAGE_OFFSET movl $(__BOOT_DS),%eax movl %eax,%ds movl %eax,%es movl %eax,%fs movl %eax,%gs +2: /* * Clear BSS first so that there are no surprises... - * No need to cld as DF is already clear from cld above... - */ + */ + cld xorl %eax,%eax movl $__bss_start - __PAGE_OFFSET,%edi movl $__bss_stop - __PAGE_OFFSET,%ecx subl %edi,%ecx shrl $2,%ecx rep ; stosl + /* * Copy bootup parameters out of the way. * Note: %esi still has the pointer to the real-mode data. @@ -120,6 +129,35 @@ 2: movsl 1: +#ifdef CONFIG_PARAVIRT + cmpw $0x207, (boot_params + BP_version - __PAGE_OFFSET) + jb default_entry + + /* Paravirt-compatible boot parameters. Look to see what architecture + we're booting under. */ + movl (boot_params + BP_hardware_subarch - __PAGE_OFFSET), %eax + cmpl $num_subarch_entries, %eax + jae bad_subarch + + movl subarch_entries - __PAGE_OFFSET(,%eax,4), %eax + subl $__PAGE_OFFSET, %eax + jmp *%eax + +bad_subarch: +WEAK(lguest_entry) +WEAK(xen_entry) + /* Unknown implementation; there's really + nothing we can do at this point. */ + ud2a +.data +subarch_entries: + .long default_entry /* normal x86/PC */ + .long lguest_entry /* lguest hypervisor */ + .long xen_entry /* Xen hypervisor */ +num_subarch_entries = (. - subarch_entries) / 4 +.previous +#endif /* CONFIG_PARAVIRT */ + /* * Initialize page tables. This creates a PDE and a set of page * tables, which are located immediately beyond _end. The variable @@ -132,6 +170,7 @@ 1: */ page_pde_offset = (__PAGE_OFFSET >> 20); +default_entry: movl $(pg0 - __PAGE_OFFSET), %edi movl $(swapper_pg_dir - __PAGE_OFFSET), %edx movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */ -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH RFC 5/7] i386: clean up bzImage generation
This patch cleans up image generation in several ways: - Firstly, it removes tools/build, and uses binutils to do all the final construction of the bzImage. This removes a chunk of code and makes the image generation more flexible, since we can compute various numbers rather than be forced to use fixed constants. - Rename compressed/vmlinux to compressed/blob, to make it a bit clearer that it's the compressed kernel image + decompressor (now all the files named "vmlinux*" are directly derived from the kernel vmlinux). - Rather than using objcopy to wrap the compressed kernel into an object file, simply use the assembler: payload.S does a .incbin of the blob.bin file, which allows us to easily place it into a section, and it makes the Makefile dependency a little clearer. - Similarly, use the same technique to create compressed/piggy.o, which cleans things up even more, since the .S file can also set the input and output_size symbols without further linker script hackery; it also removes a complete linker script. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: "Eric W. Biederman" <[EMAIL PROTECTED]> Cc: H. Peter Anvin <[EMAIL PROTECTED]> Cc: Vivek Goyal <[EMAIL PROTECTED]> Cc: Rusty Russell <[EMAIL PROTECTED]> --- arch/i386/boot/Makefile | 31 +- arch/i386/boot/compressed/Makefile| 13 -- arch/i386/boot/compressed/piggy.S | 10 + arch/i386/boot/compressed/vmlinux.scr | 10 - arch/i386/boot/header.S |6 - arch/i386/boot/payload.S |3 arch/i386/boot/setup.ld | 39 --- arch/i386/boot/tools/.gitignore |1 arch/i386/boot/tools/build.c | 168 - 9 files changed, 56 insertions(+), 225 deletions(-) === --- a/arch/i386/boot/Makefile +++ b/arch/i386/boot/Makefile @@ -25,12 +25,13 @@ SVGA_MODE := -DSVGA_MODE=NORMAL_VGA #RAMDISK := -DRAMDISK=512 -targets:= vmlinux.bin setup.bin setup.elf zImage bzImage +targets:= blob.bin setup.elf zImage bzImage subdir-:= compressed setup-y+= a20.o apm.o cmdline.o copy.o cpu.o cpucheck.o edd.o -setup-y+= header.o main.o mca.o memory.o pm.o pmjump.o -setup-y+= printf.o string.o tty.o video.o version.o voyager.o +setup-y+= header.o main.o mca.o memory.o payload.o pm.o +setup-y+= pmjump.o printf.o string.o tty.o video.o version.o +setup-y+= voyager.o # The link order of the video-*.o modules can matter. In particular, # video-vga.o *must* be listed first, followed by video-vesa.o. @@ -39,10 +40,6 @@ setup-y += video-vga.o setup-y+= video-vga.o setup-y+= video-vesa.o setup-y+= video-bios.o - -hostprogs-y:= tools/build - -HOSTCFLAGS_build.o := $(LINUXINCLUDE) # --- @@ -65,18 +62,12 @@ AFLAGS := $(CFLAGS) -D__ASSEMBLY__ $(obj)/bzImage: IMAGE_OFFSET := 0x10 $(obj)/bzImage: EXTRA_CFLAGS := -D__BIG_KERNEL__ $(obj)/bzImage: EXTRA_AFLAGS := $(SVGA_MODE) $(RAMDISK) -D__BIG_KERNEL__ -$(obj)/bzImage: BUILDFLAGS := -b -quiet_cmd_image = BUILD $@ -cmd_image = $(obj)/tools/build $(BUILDFLAGS) $(obj)/setup.bin \ - $(obj)/vmlinux.bin $(ROOT_DEV) > $@ - -$(obj)/zImage $(obj)/bzImage: $(obj)/setup.bin \ - $(obj)/vmlinux.bin $(obj)/tools/build FORCE - $(call if_changed,image) +$(obj)/zImage $(obj)/bzImage: $(obj)/setup.elf FORCE + $(call if_changed,objcopy) @echo 'Kernel: $@ is ready' ' (#'`cat .version`')' -$(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE +$(obj)/blob.bin: $(obj)/compressed/blob FORCE $(call if_changed,objcopy) SETUP_OBJS = $(addprefix $(obj)/,$(setup-y)) @@ -85,12 +76,10 @@ LDFLAGS_setup.elf := -T $(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE $(call if_changed,ld) -OBJCOPYFLAGS_setup.bin := -O binary +$(obj)/payload.o: EXTRA_AFLAGS := -Wa,-I$(obj) +$(obj)/payload.o: $(src)/payload.S $(obj)/blob.bin -$(obj)/setup.bin: $(obj)/setup.elf FORCE - $(call if_changed,objcopy) - -$(obj)/compressed/vmlinux: FORCE +$(obj)/compressed/blob: FORCE $(Q)$(MAKE) $(build)=$(obj)/compressed IMAGE_OFFSET=$(IMAGE_OFFSET) $@ # Set this if you want to pass append arguments to the zdisk/fdimage/isoimage kernel === --- a/arch/i386/boot/compressed/Makefile +++ b/arch/i386/boot/compressed/Makefile @@ -4,11 +4,10 @@ # create a compressed vmlinux image from the original vmlinux # -targets:= vmlinux vmlinux.bin vmlinux.bin.gz head.o misc.o piggy.o \ +targets:= blob vmlinux.bin vmlinux.bin.gz head.o misc.o piggy.o \
[PATCH RFC 1/7] update boot spec to 2.07
Proposed updates for version 2.07 of the boot protocol. This includes: load_flags.KEEP_SEGMENTS- flag to request/inhibit segment reloads hardware_subarch- what subarchitecture we're booting under hardware_subarch_data - per-architecture data kernel_payload - address of the raw kernel blob The intention of these changes is to make booting a paravirtualized kernel work via the normal Linux boot protocol. The intention is that the bzImage payload can be a properly formed ELF file, so that the bootloader can use its ELF notes and Phdrs to get more metadata about the kernel and its requirements. The ELF file could be the uncompressed kernel vmlinux itself; it would only take small buildsystem changes to implement this. kernel_payload was added so that a bootloader can just get to the raw bits of the kernel, so that it can do its own decompression/relocation if it wishes. This is not particularly well-defined yet; I just added it with the hope that it keeps HPA happy. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: "Eric W. Biederman" <[EMAIL PROTECTED]> Cc: H. Peter Anvin <[EMAIL PROTECTED]> Cc: Vivek Goyal <[EMAIL PROTECTED]> Cc: Rusty Russell <[EMAIL PROTECTED]> --- Documentation/i386/boot.txt| 43 +++- arch/i386/kernel/asm-offsets.c |7 ++ include/asm-i386/bootparam.h | 10 +++-- 3 files changed, 57 insertions(+), 3 deletions(-) === --- a/Documentation/i386/boot.txt +++ b/Documentation/i386/boot.txt @@ -168,6 +168,9 @@ 0234/1 2.05+ relocatable_kernel Whether 0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not 0235/3 N/A pad2Unused 0238/4 2.06+ cmdline_sizeMaximum size of the kernel command line +023C/4 2.07+ hardware_subarch Hardware subarchitecture +0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data +0248/4 2.07+ kernel_payload Pointer to raw kernel data (1) For backwards compatibility, if the setup_sects field contains 0, the real value is 4. @@ -204,7 +207,7 @@ boot loaders can ignore those fields. The byte order of all fields is littleendian (this is x86, after all.) -Field name:setup_secs +Field name:setup_sects Type: read Offset/size: 0x1f1/1 Protocol: ALL @@ -356,6 +359,13 @@ Protocol: 2.00+ - If 0, the protected-mode code is loaded at 0x1. - If 1, the protected-mode code is loaded at 0x10. + Bit 6 (write): KEEP_SEGMENTS + Protocol: 2.07+ + - if 0, reload the segment registers in the 32bit entry point. + - if 1, do not reload the segment registers in the 32bit entry point. + Assume that %cs %ds %ss %es are all set to flat segments with + a base of 0 (or the equivalent for their environment). + Bit 7 (write): CAN_USE_HEAP Set this bit to 1 to indicate that the value entered in the heap_end_ptr is valid. If this field is clear, some setup code @@ -479,6 +489,37 @@ Protocol: 2.06+ zero. This means that the command line can contain at most cmdline_size characters. With protocol version 2.05 and earlier, the maximum size was 255. + +Field name:hardware_subarch +Type: write +Offset/size: 0x23c/4 +Protocol: 2.07+ + + In a paravirtualized environment the hardware low level architectural + pieces such as interrupt handling, page table handling, and + accessing process control registers needs to be done differently. + + This field allows the bootloader to inform the kernel we are in one + one of those environments. + + 0x The default x86/PC environment + 0x0001 lguest + 0x0002 Xen + +Field name:hardware_subarch_data +Type: write +Offset/size: 0x240/8 +Protocol: 2.07+ + + A pointer to data that is specific to hardware subarch + +Field name:kernel_payload +Type: read +Offset/size: 0x248/4 +Protocol: 2.07+ + + The relocated pointer to the actual kernel payload, in whatever form + it exists in (gzip image, normally). THE KERNEL COMMAND LINE === --- a/arch/i386/kernel/asm-offsets.c +++ b/arch/i386/kernel/asm-offsets.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -143,4 +144,10 @@ void foo(void) OFFSET(LGUEST_PAGES_regs_errcode, lguest_pages, regs.errcode); OFFSET(LGUEST_PAGES_regs, lguest_pages, regs); #endif + + BLANK(); + OFFSET(BP_scratch, boot_params, scratch); + OFFSET(BP_loadflags, boot_params, hdr.loadflags); + OFFSET(BP_hardware_subarch, boot_params, hdr.hardware_subarch); + OFFSET(BP_version, boot_params, hdr.version); } === --- a/include/asm-i386/bootparam.h +++ b/include/asm-i386/bootparam.h @@ -24,8 +24,9 @@ struct
[PATCH RFC 2/7] add WEAK() for creating weak asm labels
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> --- include/linux/linkage.h |6 ++ 1 file changed, 6 insertions(+) === --- a/include/linux/linkage.h +++ b/include/linux/linkage.h @@ -34,6 +34,12 @@ name: #endif +#ifndef WEAK +#define WEAK(name)\ + .weak name;\ + name: +#endif + #define KPROBE_ENTRY(name) \ .pushsection .kprobes.text, "ax"; \ ENTRY(name) -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH RFC 0/7] proposed updates to boot protocol and paravirt booting
This series: 1. Updates the boot protocol to version 2.07 2. Clean up the existing build process, to get rid of tools/build and make the linker do more heavy lifting 3. Make the bzImage payload an ELF file. The bootloader can extract this as a naked ELF file by skipping over boot_params.setup_sects worth of 16-bit setup code. 4. Update the boot_params to 2.07, and update the kernel's head.S to jump to the appropriate subarch-specific kernel entrypoint. The very earliest code is common (copy boot_params, clear bss); the split happens just before the initial pagetable setup. + random little changes to make it all hang together This boots native for me, so everything basically works. But I haven't tested it end-to-end yet, because I haven't done the Xen bits yet. Perhaps Rusty can do the lguest version to verify that its all sound in principle (hint hint ;). So, how does it look? J -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] paravirt: fix error handling in paravirt_disable_iospace
Make sure everything is backed out if it fails. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> --- arch/i386/kernel/paravirt.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) === --- a/arch/i386/kernel/paravirt.c +++ b/arch/i386/kernel/paravirt.c @@ -254,8 +254,11 @@ int paravirt_disable_iospace(void) int ret; ret = request_resource(&ioport_resource, &reserve_ioports); - if (ret == 0) + if (ret == 0) { ret = request_resource(&iomem_resource, &reserve_iomem); + if (ret) + release_resource(&reserve_ioports); + } return ret; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] xen: fix xen-smp.patch: setup_runstate_info
Somehow an smp_processor_id() survived the transition to passing the cpu around. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: Jan Beulich <[EMAIL PROTECTED]> --- arch/i386/xen/time.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) === --- a/arch/i386/xen/time.c +++ b/arch/i386/xen/time.c @@ -110,7 +110,7 @@ static void setup_runstate_info(int cpu) area.addr.v = &per_cpu(runstate, cpu); if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, - smp_processor_id(), &area)) + cpu, &area)) BUG(); } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH UPDATE] xen: use iret directly where possible
[ Expand, correct and clarify comments; minor code change. ] Most of the time we can simply use the iret instruction to exit the kernel, rather than having to use the iret hypercall - the only exception is if we're returning into vm86 mode, or from delivering an NMI (which we don't support yet). When running native, iret has the behaviour of testing for a pending interrupt atomically with re-enabling interrupts. Unfortunately there's no way to do this with Xen, so there's a window in which we could get a recursive exception after enabling events but before actually returning to userspace. This causes a problem: if the nested interrupt causes one of the task's TIF_WORK_MASK flags to be set, they will not be checked again before returning to userspace. This means that pending work may be left pending indefinitely, until the process enters and leaves the kernel again. The net effect is that a pending signal or reschedule event could be delayed for an unbounded amount of time. To deal with this, the xen event upcall handler checks to see if the EIP is within the critical section of the iret code, after events are (potentially) enabled up to the iret itself. If its within this range, it calls the iret critical section fixup, which adjusts the stack to deal with any unrestored registers, and then shifts the stack frame up to replace the previous invocation. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> --- arch/i386/kernel/asm-offsets.c |1 arch/i386/kernel/entry.S | 16 +++ arch/i386/xen/enlighten.c |1 arch/i386/xen/xen-asm.S| 185 +++- arch/i386/xen/xen-ops.h|1 5 files changed, 199 insertions(+), 5 deletions(-) === --- a/arch/i386/kernel/asm-offsets.c +++ b/arch/i386/kernel/asm-offsets.c @@ -65,6 +65,7 @@ void foo(void) OFFSET(TI_addr_limit, thread_info, addr_limit); OFFSET(TI_restart_block, thread_info, restart_block); OFFSET(TI_sysenter_return, thread_info, sysenter_return); + OFFSET(TI_cpu, thread_info, cpu); BLANK(); OFFSET(GDS_size, Xgt_desc_struct, size); === --- a/arch/i386/kernel/entry.S +++ b/arch/i386/kernel/entry.S @@ -1030,7 +1030,21 @@ ENTRY(xen_hypervisor_callback) CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL TRACE_IRQS_OFF - mov %esp, %eax + + /* Check to see if we got the event in the critical + region in xen_iret_direct, after we've reenabled + events and checked for pending events. This simulates + iret instruction's behaviour where it delivers a + pending interrupt when enabling interrupts. */ + movl PT_EIP(%esp),%eax + cmpl $xen_iret_start_crit,%eax + jb 1f + cmpl $xen_iret_end_crit,%eax + jae 1f + + call xen_iret_crit_fixup + +1: mov %esp, %eax call xen_evtchn_do_upcall jmp ret_from_intr CFI_ENDPROC === --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -827,6 +827,7 @@ void __init xen_setup_vcpu_info_placemen paravirt_ops.irq_disable = xen_irq_disable_direct; paravirt_ops.irq_enable = xen_irq_enable_direct; paravirt_ops.read_cr2 = xen_read_cr2_direct; + paravirt_ops.iret = xen_iret_direct; } } === --- a/arch/i386/xen/xen-asm.S +++ b/arch/i386/xen/xen-asm.S @@ -12,14 +12,20 @@ */ #include + #include #include #include -#include #include +#include + +#include #define RELOC(x, v).globl x##_reloc; x##_reloc=v #define ENDPATCH(x).globl x##_end; x##_end=. + +/* Pseudo-flag used for virtual NMI, which we don't implement yet */ +#define XEN_EFLAGS_NMI 0x8000 /* Enable events. This clears the event mask and tests the pending @@ -81,13 +87,12 @@ ENDPATCH(xen_save_fl_direct) */ ENTRY(xen_restore_fl_direct) testb $X86_EFLAGS_IF>>8, %ah - setz %al - movb %al, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask + setz PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask /* Preempt here doesn't matter because that will deal with any pending interrupts. The pending check may end up being run on the wrong CPU, but that doesn't hurt. */ - /* check for pending but unmasked */ + /* check for unmasked and pending */ cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_pending jz 1f 2: call check_events @@ -97,6 +102,178 @@ ENDPATCH(xen_restore_fl_direct) ENDPROC(xen_restore_fl_direct) RELOC(xen_restore_fl_direct, 2b+1) +/* + This is run where a normal iret would be run, with the same stack setup: + 8: eflags + 4: cs +
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
Andi Kleen wrote: >> I once had some code in there to do that, implemented in very boneheaded >> way with a spinlock to protect the "last time returned" variable. I >> expect there's a better way to implement it. >> > > But any per CPU setup likely needs this to avoid non monotonicity Yeah. The point I didn't quite make was that this should be something that the clock core should handle rather than dealing with it in every clocksource. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
On 6/6/07 13:46, "Jan Beulich" <[EMAIL PROTECTED]> wrote: >> On the other hand some timing issues on throttling are probably >> the smallest of the users' problems when it really happens. > > Not if this results in your box hanging - I think throttling is exactly > intended > to keep the box alive as long as possible (and I've seen throttling in action, > with the box happily recovering from the situation - after having seen it a > few times I checked and found the fan covered with dust). Clamping to last returned timestamp value would be fine here. Time would go moderately haywire for a while (lurch forwards and then stop for a while), but wouldn't go backwards and should recover reasonably within the timescale of the thermal event itself. I don't see an issue with just implementing this clamping if it is needed. -- Keir ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
On Wednesday 06 June 2007 14:46:59 Jan Beulich wrote: > >>> Andi Kleen <[EMAIL PROTECTED]> 06.06.07 14:18 >>> > > > >> > >> Yes, this could be an issue. Is there any way to get an interrupt or MCE > >> when thermal throttling occurs? > > > >Yes you can get an thermal interrupt from the local APIC. See the Linux > >kernel source. Of course there would be still a race window. > > > >On the other hand some timing issues on throttling are probably > >the smallest of the users' problems when it really happens. > > Not if this results in your box hanging Yes it shouldn't hang. Just saying that some non monotonicity in the returned values under this abnormal condition is probably not the world's end. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
>>> Andi Kleen <[EMAIL PROTECTED]> 06.06.07 14:18 >>> > >> >> Yes, this could be an issue. Is there any way to get an interrupt or MCE >> when thermal throttling occurs? > >Yes you can get an thermal interrupt from the local APIC. See the Linux >kernel source. Of course there would be still a race window. > >On the other hand some timing issues on throttling are probably >the smallest of the users' problems when it really happens. Not if this results in your box hanging - I think throttling is exactly intended to keep the box alive as long as possible (and I've seen throttling in action, with the box happily recovering from the situation - after having seen it a few times I checked and found the fan covered with dust). >Standard Linux just ignores it. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
> > Yes, this could be an issue. Is there any way to get an interrupt or MCE > when thermal throttling occurs? Yes you can get an thermal interrupt from the local APIC. See the Linux kernel source. Of course there would be still a race window. On the other hand some timing issues on throttling are probably the smallest of the users' problems when it really happens. Standard Linux just ignores it. -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
On 6/6/07 12:00, "Jan Beulich" <[EMAIL PROTECTED]> wrote: >> If the error across CPUS is +/- just a few microseconds at worst then having >> the clocksource clamp to no less than the last timestamp returned seems a >> reasonable fix. Time won't 'stop' for longer than the cross-CPU error, and >> that should always be a tiny value. > > Are you sure this is also true when e.g. a CPU gets throttled due to thermal > conditions? It is my understanding that both the duty cycle adjustment and > the frequency reduction would yield a reduced rate TSC, which would be > accounted for only the next time the local clock gets calibrated. Otherwise, > immediate calibration (and vcpu update) would need to be forced out of the > thermal interrupt. Yes, this could be an issue. Is there any way to get an interrupt or MCE when thermal throttling occurs? -- Keir ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
>>> Keir Fraser <[EMAIL PROTECTED]> 06.06.07 11:56 >>> >On 6/6/07 10:30, "Jan Beulich" <[EMAIL PROTECTED]> wrote: > >>> If you have an ACPI PM timer in your system (and if you have SMM then your >>> system is almost certainly modern enough to have one) then surely the >>> problem is fixed for all practical purposes? The problem was overflow of a >>> fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM >>> timer will wrap only every ~4s. It would be quite unreasonable for SMM to >>> take the CPU away for multiple seconds, even as a one-time boot operation. >> >> No, I don't think the problem's gone with the PM timer - it is just much less >> likely. Since you depend on the TSC (which must generally be assumed be >> unsyncronized across CPUs) and on the error correction factor (which shows >> non-zero values every few seconds), getting the interpolated times on two >> CPUs out of sync is still possible, and given the way the time keeping code >> works even being off by just a single nanosecond may be fatal. > >If the error across CPUS is +/- just a few microseconds at worst then having >the clocksource clamp to no less than the last timestamp returned seems a >reasonable fix. Time won't 'stop' for longer than the cross-CPU error, and >that should always be a tiny value. Are you sure this is also true when e.g. a CPU gets throttled due to thermal conditions? It is my understanding that both the duty cycle adjustment and the frequency reduction would yield a reduced rate TSC, which would be accounted for only the next time the local clock gets calibrated. Otherwise, immediate calibration (and vcpu update) would need to be forced out of the thermal interrupt. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
On Wednesday 06 June 2007 12:05:22 Jeremy Fitzhardinge wrote: > Jan Beulich wrote: > > Xen itself knows to deal with this (by using an error correction factor to > > slow down the local [TSC-based] clock), but for the kernel such a situation > > may be fatal: If clocksource->cycle_last was most recently set on a CPU > > with shadow->tsc_to_nsec_mul sufficiently different from that where > > getnstimeofday() is being used, timekeeping.c's __get_nsec_offset() will > > calculate a huge nanosecond value (due to cyc2ns() doing unsigned > > operations), worth abut 4000s. This value may then be used to set a > > timeout that was intended to be a few milliseconds, effectively yielding > > a hung app (and perhaps system). > > > > Hm. I had a similar situation in the stolen time code, and I ended up > using signed values so I could clamp at zero. Though that might have > been another bug; either way, the clamp is still there. > > I wonder if cyc2ns might not be better using signed operations? Or > perhaps better, the time code should endevour to do things on a > completely per-cpu basis (haven't really given this any thought). This is being worked on. > > Unfortunately so far I haven't been able to think of a reasonable solution > > to this - a simplistic approach like making xen_clocksource_read() check > > the value it is about to return against the last value it returned doesn't > > seem to be a good idea (time might appear to have stopped over some > > period of time otherwise), nor does attempting to adjust the shadowed > > tsc_to_nsec_mul values (because the kernel can't know whether it should > > boost the lagging CPU or throttle the rushing one). > > I once had some code in there to do that, implemented in very boneheaded > way with a spinlock to protect the "last time returned" variable. I > expect there's a better way to implement it. But any per CPU setup likely needs this to avoid non monotonicity -Andi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
>> But I think that a clock source can be expected to be >> monotonic anyway, which Xen's interpolation mechanism doesn't >> guarantee across multiple CPUs. (I'm actually beginning to think that >> this might also be the reason for certain test suites occasionally reporting >> timeouts to fire early.) >> > >Does the kernel expect the tsc clocksource to be completely monotonic >across cpus? Any form of cpu-local clocksource is going to have this >problem; I wonder if clocksources can really only be useful if they're >always referring to a single system-wide time reference - seems like a >bit of a limitation. I suppose so - the clock source's rating gets set to zero if it can be predicted that the TSCs aren't all synchronized, which should pretty much exclude the use of this clock source for any other than fallback if there's really nothing else available. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
Jan Beulich wrote: > Xen itself knows to deal with this (by using an error correction factor to > slow down the local [TSC-based] clock), but for the kernel such a situation > may be fatal: If clocksource->cycle_last was most recently set on a CPU > with shadow->tsc_to_nsec_mul sufficiently different from that where > getnstimeofday() is being used, timekeeping.c's __get_nsec_offset() will > calculate a huge nanosecond value (due to cyc2ns() doing unsigned > operations), worth abut 4000s. This value may then be used to set a > timeout that was intended to be a few milliseconds, effectively yielding > a hung app (and perhaps system). > Hm. I had a similar situation in the stolen time code, and I ended up using signed values so I could clamp at zero. Though that might have been another bug; either way, the clamp is still there. I wonder if cyc2ns might not be better using signed operations? Or perhaps better, the time code should endevour to do things on a completely per-cpu basis (haven't really given this any thought). > I'm sure the time keeping code can't deal with negative values returned > from __get_nsec_offset() (timespec_add_ns() is an example, used in > __get_realtime_clock_ts()), otherwise a potential solution might have > been to set the clock source's multiplier and shift to one and zero > respectively. I don't quite follow you here, but wouldn't setting the multiplier/shift to 1/0 preclude being able to warp the clocksource with ntp? > But I think that a clock source can be expected to be > monotonic anyway, which Xen's interpolation mechanism doesn't > guarantee across multiple CPUs. (I'm actually beginning to think that > this might also be the reason for certain test suites occasionally reporting > timeouts to fire early.) > Does the kernel expect the tsc clocksource to be completely monotonic across cpus? Any form of cpu-local clocksource is going to have this problem; I wonder if clocksources can really only be useful if they're always referring to a single system-wide time reference - seems like a bit of a limitation. > Unfortunately so far I haven't been able to think of a reasonable solution > to this - a simplistic approach like making xen_clocksource_read() check > the value it is about to return against the last value it returned doesn't > seem to be a good idea (time might appear to have stopped over some > period of time otherwise), nor does attempting to adjust the shadowed > tsc_to_nsec_mul values (because the kernel can't know whether it should > boost the lagging CPU or throttle the rushing one). I once had some code in there to do that, implemented in very boneheaded way with a spinlock to protect the "last time returned" variable. I expect there's a better way to implement it. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
On 6/6/07 10:30, "Jan Beulich" <[EMAIL PROTECTED]> wrote: >> If you have an ACPI PM timer in your system (and if you have SMM then your >> system is almost certainly modern enough to have one) then surely the >> problem is fixed for all practical purposes? The problem was overflow of a >> fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM >> timer will wrap only every ~4s. It would be quite unreasonable for SMM to >> take the CPU away for multiple seconds, even as a one-time boot operation. > > No, I don't think the problem's gone with the PM timer - it is just much less > likely. Since you depend on the TSC (which must generally be assumed be > unsyncronized across CPUs) and on the error correction factor (which shows > non-zero values every few seconds), getting the interpolated times on two > CPUs out of sync is still possible, and given the way the time keeping code > works even being off by just a single nanosecond may be fatal. If the error across CPUS is +/- just a few microseconds at worst then having the clocksource clamp to no less than the last timestamp returned seems a reasonable fix. Time won't 'stop' for longer than the cross-CPU error, and that should always be a tiny value. -- Keir ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
>>> Keir Fraser <[EMAIL PROTECTED]> 06.06.07 10:54 >>> >On 6/6/07 09:39, "Jan Beulich" <[EMAIL PROTECTED]> wrote: > >> The issue is >> that on that system, transition into ACPI mode takes over 600ms (SMM >> execution, and hence no interrupts delivered during that time), and with >> Xen using the PIT (PM timer support was added by Keir as a result of this, >> but that doesn't cure the problem here, it just reduces the likelihood it'll >> be encountered) platform time and local time got pretty much out of sync. > >If you have an ACPI PM timer in your system (and if you have SMM then your >system is almost certainly modern enough to have one) then surely the >problem is fixed for all practical purposes? The problem was overflow of a >fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM >timer will wrap only every ~4s. It would be quite unreasonable for SMM to >take the CPU away for multiple seconds, even as a one-time boot operation. No, I don't think the problem's gone with the PM timer - it is just much less likely. Since you depend on the TSC (which must generally be assumed be unsyncronized across CPUs) and on the error correction factor (which shows non-zero values every few seconds), getting the interpolated times on two CPUs out of sync is still possible, and given the way the time keeping code works even being off by just a single nanosecond may be fatal. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
On 6/6/07 09:39, "Jan Beulich" <[EMAIL PROTECTED]> wrote: > The issue is > that on that system, transition into ACPI mode takes over 600ms (SMM > execution, and hence no interrupts delivered during that time), and with > Xen using the PIT (PM timer support was added by Keir as a result of this, > but that doesn't cure the problem here, it just reduces the likelihood it'll > be encountered) platform time and local time got pretty much out of sync. If you have an ACPI PM timer in your system (and if you have SMM then your system is almost certainly modern enough to have one) then surely the problem is fixed for all practical purposes? The problem was overflow of a fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM timer will wrap only every ~4s. It would be quite unreasonable for SMM to take the CPU away for multiple seconds, even as a one-time boot operation. -- Keir ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 14/33] xen: xen time implementation
>+cycle_t xen_clocksource_read(void) >+{ >+ struct shadow_time_info *shadow = &get_cpu_var(shadow_time); >+ cycle_t ret; >+ >+ get_time_values_from_xen(); >+ >+ ret = shadow->system_timestamp + get_nsec_offset(shadow); >+ >+ put_cpu_var(shadow_time); >+ >+ return ret; >+} I'm afraid this mechanism is pretty unreliable on SMP: getnstimeofday() and do_gettimeofday() both use the difference between the last snapshot taken and the current value read from the clock source. Since I had added this clocksource code to our kernel, I had reproducible hangs on one of the systems I regularly work with (you may have seen the respective thread on xen-devel), which recently I finally found time to look into. The issue is that on that system, transition into ACPI mode takes over 600ms (SMM execution, and hence no interrupts delivered during that time), and with Xen using the PIT (PM timer support was added by Keir as a result of this, but that doesn't cure the problem here, it just reduces the likelihood it'll be encountered) platform time and local time got pretty much out of sync. Xen itself knows to deal with this (by using an error correction factor to slow down the local [TSC-based] clock), but for the kernel such a situation may be fatal: If clocksource->cycle_last was most recently set on a CPU with shadow->tsc_to_nsec_mul sufficiently different from that where getnstimeofday() is being used, timekeeping.c's __get_nsec_offset() will calculate a huge nanosecond value (due to cyc2ns() doing unsigned operations), worth abut 4000s. This value may then be used to set a timeout that was intended to be a few milliseconds, effectively yielding a hung app (and perhaps system). I'm sure the time keeping code can't deal with negative values returned from __get_nsec_offset() (timespec_add_ns() is an example, used in __get_realtime_clock_ts()), otherwise a potential solution might have been to set the clock source's multiplier and shift to one and zero respectively. But I think that a clock source can be expected to be monotonic anyway, which Xen's interpolation mechanism doesn't guarantee across multiple CPUs. (I'm actually beginning to think that this might also be the reason for certain test suites occasionally reporting timeouts to fire early.) Unfortunately so far I haven't been able to think of a reasonable solution to this - a simplistic approach like making xen_clocksource_read() check the value it is about to return against the last value it returned doesn't seem to be a good idea (time might appear to have stopped over some period of time otherwise), nor does attempting to adjust the shadowed tsc_to_nsec_mul values (because the kernel can't know whether it should boost the lagging CPU or throttle the rushing one). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 21/33] xen: Xen SMP guest support
Jan Beulich wrote: >> --- a/arch/i386/xen/time.c >> +++ b/arch/i386/xen/time.c >> @@ -105,17 +105,15 @@ static void get_runstate_snapshot(struct >> preempt_enable(); >> } >> >> -static void setup_runstate_info(void) >> +static void setup_runstate_info(int cpu) >> { >> struct vcpu_register_runstate_memory_area area; >> >> -area.addr.v = &__get_cpu_var(runstate); >> +area.addr.v = &per_cpu(runstate, cpu); >> >> if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, >> smp_processor_id(), &area)) >> > > Shouldn't this be 'cpu' rather than 'smp_processor_id()'? > Yes. I'd noticed that, thought it got fixed later in the series, and looks like I ultimately got confused. Not sure when this crept in; it has been correct in the past. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [patch 21/33] xen: Xen SMP guest support
>--- a/arch/i386/xen/time.c >+++ b/arch/i386/xen/time.c >@@ -105,17 +105,15 @@ static void get_runstate_snapshot(struct > preempt_enable(); > } > >-static void setup_runstate_info(void) >+static void setup_runstate_info(int cpu) > { > struct vcpu_register_runstate_memory_area area; > >- area.addr.v = &__get_cpu_var(runstate); >+ area.addr.v = &per_cpu(runstate, cpu); > > if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, > smp_processor_id(), &area)) Shouldn't this be 'cpu' rather than 'smp_processor_id()'? > BUG(); >- >- get_runstate_snapshot(&__get_cpu_var(runstate_snapshot)); > } > > static void do_stolen_accounting(void) Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization