Re: polling SSL kerberos and srp support
Am Montag, 28. April 2014, 21:40:30 schrieb Ted Unangst: Also note that I'm not really interested in rumors or whispers. You don't need to tell me that it's possible somebody else uses Kerberos. I know it's possible, that's why I'm asking. I'd like to know who. One data point: Apache HTTPD 2.4 supports SRP.
Re: 9p
On Thursday 29 May 2014 19:17:25, M Farkas-Dyck wrote: On 29/05/2014, Ted Unangst t...@tedunangst.com wrote: The first question is why not use fuse? I think it's better to have one userland filesystem interface than two. We already have 2: fuse, nfs. • 9p can operate over arbitrary transports, such as virtio and tcp. Fuse can't. • To my knowledge one can't have root as fuse. • Fuse requires a copy to/from a memory buffer on read/write. I have usage cases in which fuse is the performance-limiting factor. I agree that 9p support would be nice, especially with virtio support. I don't think I will have time to help with it in the near future, though. I have noticed that there is also o9fs [1]. I haven't tried it, and I don't know what the plans of its author are. But in case you haven't, you may want to look at it and decide if you want to merge your projects, cherry-pick a few things, or just ignore it. Cheers, Stefan [1] https://bitbucket.org/iru/o9fs/overview
Paravirtualized optimizations for KVM
Hi, I have been trying to increase fork performance of openbsd/amd64 on KVM. It turns out that when I increase the number of CPUs of a VM from 1 to 3, a fork+exit micro benchmark is slowed down by a factor of 7. The main reason for this seems to be a very large number of cross-CPU TLB flushes (about 4 per fork+exit). Each IPI causes several VM exits which are expensive. To reduce this, I have been trying to use paravirtualized interfaces provided by KVM and optimize some other things. These changes are mostly activated by a new pseudo device paravirt (which has the advantage that one can use UKC to tweak things without recompiling). However, some changes will remain if not running on a hypervisor (or paravirt is disabled). For example, x86_ipi() will use a pointer to dispatch to the appropriate implementation. Is this the way to go forward? Or would you rather prefer a compile time option and maybe ship a bsd.mp.paravirt kernel in addition to bsd+bsd.mp? The attached patch speeds up the fork+exit micro benchmark by a factor of 3 on a 3 CPU system. And the time to build a kernel with -j4 on a 4 CPU system is also reduced by about 20%: current: real1m50.089s user4m46.240s sys 1m29.510s current+paravirt: real1m29.313s user4m54.720s sys 0m45.100s BTW, why does amd64 use the APTE mapping/unmapping dance in pmap despite the memory being available in the direct map area all the time? Cheers, Stefan diff --git a/sys/arch/amd64/amd64/cpu.c b/sys/arch/amd64/amd64/cpu.c index 88725f7..248ebb8 100644 --- a/sys/arch/amd64/amd64/cpu.c +++ b/sys/arch/amd64/amd64/cpu.c @@ -83,6 +83,7 @@ #include machine/pmap.h #include machine/vmparam.h #include machine/mpbiosvar.h +#include machine/paravirtvar.h #include machine/pcb.h #include machine/specialreg.h #include machine/segments.h @@ -569,6 +570,19 @@ cpu_init(struct cpu_info *ci) ci-ci_flags |= CPUF_RUNNING; tlbflushg(); #endif +#if NPARAVIRT 0 + if (kvm_pv_eoi_enabled) { + paddr_t pa; + ci-ci_kvm_pv_eoi = 0; + if (pmap_extract(pmap_kernel(), (vaddr_t)ci-ci_kvm_pv_eoi, pa) + ((uint64_t)pa 0x3) == 0) { + wrmsr(MSR_KVM_EOI_EN, (1 | (uint64_t)pa) ); + } else { + printf(could not get phys addr for MSR_KVM_EOI_EN, disabling pv_eoi\n); + kvm_pv_eoi_enabled = 0; + } + } +#endif } diff --git a/sys/arch/amd64/amd64/genassym.cf b/sys/arch/amd64/amd64/genassym.cf index e13a477..ab20329 100644 --- a/sys/arch/amd64/amd64/genassym.cf +++ b/sys/arch/amd64/amd64/genassym.cf @@ -114,6 +114,9 @@ member CPU_INFO_MUTEX_LEVELci_mutex_level endif member CPU_INFO_GDTci_gdt member CPU_INFO_TSSci_tss +if NPARAVIRT 0 +member CPU_INFO_KVM_PV_EOI ci_kvm_pv_eoi +endif struct intrsource member is_recurse diff --git a/sys/arch/amd64/amd64/lapic.c b/sys/arch/amd64/amd64/lapic.c index d09e3fc..857af4b 100644 --- a/sys/arch/amd64/amd64/lapic.c +++ b/sys/arch/amd64/amd64/lapic.c @@ -45,6 +45,7 @@ #include machine/pmap.h #include machine/vmparam.h #include machine/mpbiosvar.h +#include machine/paravirtvar.h #include machine/pcb.h #include machine/specialreg.h #include machine/segments.h @@ -235,20 +236,42 @@ lapic_boot_init(paddr_t lapic_base) lapic_map(lapic_base); #ifdef MULTIPROCESSOR - idt_allocmap[LAPIC_IPI_VECTOR] = 1; - idt_vec_set(LAPIC_IPI_VECTOR, Xintr_lapic_ipi); - idt_allocmap[LAPIC_IPI_INVLTLB] = 1; - idt_vec_set(LAPIC_IPI_INVLTLB, Xipi_invltlb); - idt_allocmap[LAPIC_IPI_INVLPG] = 1; - idt_vec_set(LAPIC_IPI_INVLPG, Xipi_invlpg); - idt_allocmap[LAPIC_IPI_INVLRANGE] = 1; - idt_vec_set(LAPIC_IPI_INVLRANGE, Xipi_invlrange); +#if NPARAVIRT 0 + if (kvm_pv_eoi_enabled) { + idt_allocmap[LAPIC_IPI_VECTOR] = 1; + idt_vec_set(LAPIC_IPI_VECTOR, Xintr_lapic_ipi_kvm_pv_eoi); + idt_allocmap[LAPIC_IPI_INVLTLB] = 1; + idt_vec_set(LAPIC_IPI_INVLTLB, Xipi_invltlb_kvm_pv_eoi); + idt_allocmap[LAPIC_IPI_INVLPG] = 1; + idt_vec_set(LAPIC_IPI_INVLPG, Xipi_invlpg_kvm_pv_eoi); + idt_allocmap[LAPIC_IPI_INVLRANGE] = 1; + idt_vec_set(LAPIC_IPI_INVLRANGE, Xipi_invlrange_kvm_pv_eoi); + } + else +#endif + { + idt_allocmap[LAPIC_IPI_VECTOR] = 1; + idt_vec_set(LAPIC_IPI_VECTOR, Xintr_lapic_ipi); + idt_allocmap[LAPIC_IPI_INVLTLB] = 1; + idt_vec_set(LAPIC_IPI_INVLTLB, Xipi_invltlb); + idt_allocmap[LAPIC_IPI_INVLPG] = 1; + idt_vec_set(LAPIC_IPI_INVLPG, Xipi_invlpg); + idt_allocmap[LAPIC_IPI_INVLRANGE] = 1; + idt_vec_set(LAPIC_IPI_INVLRANGE, Xipi_invlrange); + } #endif idt_allocmap[LAPIC_SPURIOUS_VECTOR] = 1;
Re: Paravirtualized optimizations for KVM
On Tuesday 08 July 2014 23:53:21, Mark Kettenis wrote: Are these paravirtualization APIs stable? Are they (properly) documented somewhere? Mostly. So far, I am using three things: 1) the paravirtualized EOI. Documented in Documentation/virtual/kvm/msr.txt in linux source. 2) the MSR to write to the lapic ICR register. This is a Hyper-V interface that is also implemented by KVM. It's documented in the Microsoft Hyper-V docs (don't have a pointer right now). 3) the fact that doing IPIs does not require waiting for the BUSY bit to clear in ICR. This is really an implementation detail in KVM. Unless there is some way to detect this, it's not something that one can depend on by default but it could be enabled by some UKC flag with default off. A different approach would be to add support for x2apic mode, which would take care of 2)+3). But this cannot be mixed with normal accesses to the memory mapped apic registers (xapic mode) and would therefore be a lot more intrusive to implement. If we're serious about supporting OpenBSD on (KVM) hypervisors, something like this makes sense. We tend to try and have a single kernel that runs on the widest range of hardware that is possible. For example the OpenBSD/sparc64 kernel runs on both sun4u and sun4v hardware, and the sun4v platforms has written paravirtualization all over it. There I successfully made use of code patching techniques. That might help on x86 as well. Yes, code patching may be useful. I haven't noticed it used in openbsd before, but I will take a look at sparc64. Can't say I'm happy with making the interrupt handling code even more complicated though... Do you think that putting all lapic operations as function into a apic_ops struct would be preferable? This would make the code much easier to read/maintain at the cost of some indirection and a few function calls in interrupt paths.
Re: [portable] OpenPGP signatures on release checksums (#12)
On Monday 14 July 2014 12:45:35, Bob Beck wrote: $ wc -l *.c 29 crypto_api.c 143 mod_ed25519.c 327 mod_ge25519.c 806 signify.c 1305 total Signify is 1305 *lines* of C code. and it's included in our development platform. It is not that difficult to install, and if you can't install it, you could always run OpenBSD in a vm to verify a signature, it comes with openbsd. Signify uses some openssh .c files: $ wc -l *.c *.data 29 crypto_api.c 335 fe25519.c 143 mod_ed25519.c 327 mod_ge25519.c 306 sc25519.c 806 signify.c 265 smult_curve25519_ref.c 858 ge25519_base.data 3069 total And it uses quite a few openbsd specific functions which makes compiling it on non-openbsd annoying. Because of the coupling to the openssh source, maybe it would make sense to include it in the openssh portable release?
Re: Paravirtualized optimizations for KVM
On Saturday 16 August 2014 23:51:46, Philip Guenther wrote: Yes, code patching may be useful. I haven't noticed it used in openbsd before, but I will take a look at sparc64. Code patching is used currently on i386 and amd64 for the SMAP support. Grep for _copyin_stac, for example. Thanks. I have already noticed that. Recently, I have however worked on improving pmap instead. Basically instead of making IPIs faster on KVM, do much fewer IPIs. I will write about that in a new thread, though. For paravirtualization, it seems it is not always obvious which features to use. If the host machine is very new and has certain VMX hardware features (e.g. Virtual interrupt delivery and APIC register emulation), some of paravirtualizaion techniques that have significant positive effect on older hardware may instead cause a slowdown. I am not sure that KVM always provides enough information to make the best choice automatically. I fear this will require quite a few knobs for tuning.
Improving amd64 pmap for MP
Hi, I found some patch by Art from 2008 that removes the APTE stuff. The patch was for i386 but I have adapted it to amd64. Instead of mapping the ptes of an inactive pmap in the APTE range, and then doing an unconditional remote TLB-flush on all CPUs, we just switch to the other pmap locally. This only causes a local TLB-flush. I have also reordered some stuff so that we will usually send the TLB-shootdown-IPIs first, then do some local stuff, and then wait for the remote TLB-shootdown to finish. The patch can definitely be optimized further (e.g. use the direct mapping in some or all cases). But maybe people want to take a look already. I would be interested in results on AMD hardware. On some real Intel hardware with 4 cores, I get (with an older version of the patch, but I don't think it makes much difference): - no significant difference for kernel builds - doing 'make index' in ports on mfs: w/o patch: 6m29.27s real 1m21.96s user 3m18.99s system with patch: 6m12.54s real 1m23.65s user 3m3.43s system - forktest n=4 (fork+exit micro benchmark) w/o patch: 0m8.75s real 0m0.36s user 0m7.51s system with patch: 0m7.43s real 0m0.18s user 0m7.79s system I expect the speed up to be larger with more CPUs. On KVM the speed up is much larger than on bare metal, in the same range as with the paravirt patches (of course doing both is even better). I don't have any exact numbers right now, though. We have been running the patch on one of our build-VMs with 6 CPUs for over a week without any problems. Cheers, Stefan = forktest.c == #include stdio.h #include sys/types.h #include unistd.h #include sys/wait.h #include stdlib.h int main(int ac, char **av) { int n = 4; int active = 0; while (n-- 0) { int status; while (active 100) { while (wait4(WAIT_ANY, status, WNOHANG, NULL) 0) active--; if (active 100) usleep(5000); } pid_t pid = fork(); if (pid 0) { perror(fork failed); exit(0); } else if (pid == 0) { // child exit(0); } else { active++; } } return 0; } === And here is the diff: diff --git a/sys/arch/amd64/amd64/pmap.c b/sys/arch/amd64/amd64/pmap.c index 40b64e5..293dfc9 100644 --- a/sys/arch/amd64/amd64/pmap.c +++ b/sys/arch/amd64/amd64/pmap.c @@ -203,7 +203,6 @@ long nkptp[] = NKPTP_INITIALIZER; long nkptpmax[] = NKPTPMAX_INITIALIZER; long nbpd[] = NBPD_INITIALIZER; pd_entry_t *normal_pdes[] = PDES_INITIALIZER; -pd_entry_t *alternate_pdes[] = APDES_INITIALIZER; /* int nkpde = NKPTP; */ @@ -289,7 +288,7 @@ void pmap_free_ptp(struct pmap *, struct vm_page *, vaddr_t, pt_entry_t *, pd_entry_t **, struct pg_to_free *); void pmap_freepage(struct pmap *, struct vm_page *, int, struct pg_to_free *); static boolean_t pmap_is_active(struct pmap *, int); -void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***); +void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***, paddr_t *); struct pv_entry *pmap_remove_pv(struct vm_page *, struct pmap *, vaddr_t); void pmap_do_remove(struct pmap *, vaddr_t, vaddr_t, int); boolean_t pmap_remove_pte(struct pmap *, struct vm_page *, pt_entry_t *, @@ -299,14 +298,23 @@ void pmap_remove_ptes(struct pmap *, struct vm_page *, vaddr_t, #define PMAP_REMOVE_ALL0 /* remove all mappings */ #define PMAP_REMOVE_SKIPWIRED 1 /* skip wired mappings */ -void pmap_unmap_ptes(struct pmap *); +void pmap_unmap_ptes(struct pmap *, paddr_t); boolean_t pmap_get_physpage(vaddr_t, int, paddr_t *); boolean_t pmap_pdes_valid(vaddr_t, pd_entry_t **, pd_entry_t *); void pmap_alloc_level(pd_entry_t **, vaddr_t, int, long *); -void pmap_apte_flush(struct pmap *pmap); void pmap_sync_flags_pte(struct vm_page *, u_long); +void pmap_tlb_shootpage(struct pmap *, vaddr_t, int); +void pmap_tlb_shootrange(struct pmap *, vaddr_t, vaddr_t, int); +void pmap_tlb_shoottlb(struct pmap *, int); +#ifdef MULTIPROCESSOR +void pmap_tlb_shootwait(void); +#else +#definepmap_tlb_shootwait() +#endif + + /* * p m a p i n l i n e h e l p e r f u n c t i o n s */ @@ -349,55 +357,44 @@ pmap_sync_flags_pte(struct vm_page *pg, u_long pte) } } -void -pmap_apte_flush(struct pmap *pmap) -{ - pmap_tlb_shoottlb(); - pmap_tlb_shootwait(); -} - /* * pmap_map_ptes: map a pmap's PTEs into KVM - * - * = we lock enough pmaps to keep things locked in - * = must be undone with pmap_unmap_ptes before returning */ void -pmap_map_ptes(struct pmap *pmap, pt_entry_t **ptepp,
minphys woes
Hi, while debugging some problems on a virtio host that supports only few dma segments per request, I noticed that the minphys function is not called in all code paths. It is only used for physio but not for other (cached) disk operations. An example trace is below. Am I missing something or is this as broken as I think it is? I thought the minphys function is a way for a driver to say that it does not support larger transfers and that transfers are split accordingly. What's the purpose of the minphys mechanism if it does not affect all code paths? Cheers, Stefan panic() at 0x1027bcf5 panic+0x1a5/0x1061006f0 vioblk_scsi_cmd() at 0x103ecccf vioblk_scsi_cmd+0x8ff/0x106100770 sdstart() at 0x103bc4d6 sdstart+0x176/0x1061007b0 scsi_ioh_runqueue() at 0x103b3c05 scsi_ioh_runqueue+0x55/0x1061007e0 scsi_xsh_runqueue() at 0x103b3d72 scsi_xsh_runqueue+0x132/0x106100800 sdstrategy() at 0x103bc6bf sdstrategy+0x10f/0x106100830 spec_strategy() at 0x102a8b3b spec_strategy+0x5b/0x106100850 VOP_STRATEGY() at 0x102a75be VOP_STRATEGY+0x2e/0x106100870 bio_doread() at 0x102991fc bio_doread+0xbc/0x1061008a0 bread() at 0x102992c3 bread+0x13/0x1061008c0 fillinusemap() at 0x102ad6d9 fillinusemap+0xb9/0x106100920 msdosfs_mountfs() at 0x102af78d msdosfs_mountfs+0x59d/0x106100990 msdosfs_mount() at 0x102afcf6 msdosfs_mount+0x3e6/0x106100b70 sys_mount() at 0x102a6a3c sys_mount+0x36c/0x106100d90
Re: minphys woes
On Friday 29 August 2014 10:34:20, David Gwynne wrote: are you expecting minphys to appear in that backtrace? No. In that trace minphys should have limited the transfer size and I had put an KASSERT() into vioblk_scsi_cmd() which verified that xs- datalen is not larger than that limit. A call to bread() causes the KASSERT to trigger.
Re: minphys woes
No, also happens on ffs. If the root FS is on an affected device, it looks like this: root on sd0a (9e7002f8647a2403.a) swap on sd0b dump on sd0b clock: unknown CMOS layout panic: kernel diagnostic assertion xs-datalen = MINPHYS failed: file ../../../../dev/pci/vioblk.c, line 405 Stopped at Debugger+0x9: leave RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! IF RUNNING SMP, USE 'mach ddbcpu #' AND 'trace' ON OTHER PROCESSORS, TOO. DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb{0} trace Debugger() at Debugger+0x9 panic() at panic+0xfe __assert() at __assert+0x25 vioblk_scsi_cmd() at vioblk_scsi_cmd+0x565 scsi_xs_exec() at scsi_xs_exec+0x35 sdstart() at sdstart+0x16e scsi_iopool_run() at scsi_iopool_run+0x5d scsi_xsh_runqueue() at scsi_xsh_runqueue+0x13d scsi_xsh_add() at scsi_xsh_add+0x98 sdstrategy() at sdstrategy+0x102 spec_strategy() at spec_strategy+0x53 ufs_strategy() at ufs_strategy+0x82 VOP_STRATEGY() at VOP_STRATEGY+0x3b bread_cluster() at bread_cluster+0x2a4 ffs_read() at ffs_read+0x250 VOP_READ() at VOP_READ+0x3f uvn_io() at uvn_io+0x2e2 uvn_get() at uvn_get+0x1f7 uvm_fault() at uvm_fault+0xdf5 trap() at trap+0x62f --- trap (number 6) --- end of kernel end trace frame: 0x7f7dd0c0, count: -20 acpi_pdirpa+0x43ae00: Reproducer patch is below. I am sure it can be reproduced with any other disk driver, but with ahci a reproducer would be a bit more complicated because of the atascsi layer. Maybe VOP_STRATEGY() should do a similar minphys+loop thing as physio() does? --- a/sys/dev/pci/vioblk.c +++ b/sys/dev/pci/vioblk.c @@ -63,6 +63,7 @@ #include scsi/scsi_disk.h #include scsi/scsiconf.h +#define MINPHYS(20*1024) #define VIOBLK_DONE-1 struct virtio_feature_name vioblk_feature_names[] = { @@ -157,6 +158,8 @@ vioblk_minphys(struct buf *bp, struct scsi_link *sl) struct vioblk_softc *sc = sl-adapter_softc; if (bp-b_bcount sc-sc_xfer_max) bp-b_bcount = sc-sc_xfer_max; + if (bp-b_bcount MINPHYS) + bp-b_bcount = MINPHYS; } void @@ -399,6 +402,8 @@ vioblk_scsi_cmd(struct scsi_xfer *xs) int timeout; int slot, ret, nsegs; + KASSERT(xs-datalen = MINPHYS); + s = splbio(); ret = virtio_enqueue_prep(vq, slot); if (ret) { On Fri, 29 Aug 2014, David Gwynne wrote: is it something msdos isnt doing that ffs does? On 29 Aug 2014, at 15:55, Stefan Fritsch s...@sfritsch.de wrote: On Friday 29 August 2014 10:34:20, David Gwynne wrote: are you expecting minphys to appear in that backtrace? No. In that trace minphys should have limited the transfer size and I had put an KASSERT() into vioblk_scsi_cmd() which verified that xs- datalen is not larger than that limit. A call to bread() causes the KASSERT to trigger.
Re: minphys woes
On Fri, 29 Aug 2014, Miod Vallat wrote: +++ b/sys/dev/pci/vioblk.c @@ -63,6 +63,7 @@ #include scsi/scsi_disk.h #include scsi/scsiconf.h +#define MINPHYS(20*1024) This makes no sense. If you want to override MINPHYS, you need to override for the whole kernel, not only here. MINPHYS is just a name I have chosen. It is not used anywhere else in the kernel. You are confusing this with MAXPHYS. @@ -157,6 +158,8 @@ vioblk_minphys(struct buf *bp, struct scsi_link *sl) struct vioblk_softc *sc = sl-adapter_softc; if (bp-b_bcount sc-sc_xfer_max) bp-b_bcount = sc-sc_xfer_max; + if (bp-b_bcount MINPHYS) + bp-b_bcount = MINPHYS; This is useless. sc-sc_xfer_max is computed so that it will always be = MINPHYS. sc-sc_xfer_max is computed according to the host's capabilities. What I want to simulate with this diff is a host adapter that can only cope with transfers 64k == MAXPHYS.
Re: minphys woes
On Fri, 29 Aug 2014, Miod Vallat wrote: sc-sc_xfer_max is computed according to the host's capabilities. What I want to simulate with this diff is a host adapter that can only cope with transfers 64k == MAXPHYS. Back to your original problem, you might want to print the sc_link struct as well the scsi_adapter struct it points to, when you detect a transfer larger than MAXPHYS. It has likely been overriden or reset to NULL by mistake at some point. OK, I will try that. Will take a bit until I have time, though. But I have also read the code and I could not find a place in the path from bread() to the scsi adapter cmd function where the minphys function is called.
Re: minphys woes
On Fri, 29 Aug 2014, Mike Belopuhov wrote: On 29 August 2014 11:26, Stefan Fritsch s...@sfritsch.de wrote: On Fri, 29 Aug 2014, Miod Vallat wrote: sc-sc_xfer_max is computed according to the host's capabilities. What I want to simulate with this diff is a host adapter that can only cope with transfers 64k == MAXPHYS. Back to your original problem, you might want to print the sc_link struct as well the scsi_adapter struct it points to, when you detect a transfer larger than MAXPHYS. It has likely been overriden or reset to NULL by mistake at some point. OK, I will try that. Will take a bit until I have time, though. That's not the problem. I have added an additional KASSERT(xs-sc_link-adapter-scsi_minphys == vioblk_minphys) before the other KASSERT and it did not trigger. But I have also read the code and I could not find a place in the path from bread() to the scsi adapter cmd function where the minphys function is called. correct me if i'm wrong, but what happens is that bread being a block read reads up to MAXBSIZE which is conveniently set to 64k and you can't create a filesystem with a larger block size. physio (raw device io) however doesn't go through bread and need to know how split the provided buffer in separate transactions hence minphys. Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. I think it makes more sense to have that logic in one place to be used by all drivers.
Re: minphys woes
On Mon, 1 Sep 2014, Mike Belopuhov wrote: On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote: Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. i believe that if you start digging you realise that (at least at some point) the least common denominator is (or was) 64k meaning that even the shittiest controller on vax can do 64k. which means that we don't have code for a filesystem or buffercache to probe the controller for a supported transfer size. That's possible. But what is really strange is, why does openbsd then have an infrastructure to set different max transfer sizes for physio on a per-adapter basis? This makes no sense. Either the drivers have to support 64k transfers, in which case most of the minphys infrastructure is useless, or they don't have to. In the latter case the minphys infrastructure would need to be used in all code paths. I think it makes more sense to have that logic in one place to be used by all drivers. perhaps, but unless the filesystem will issue breads of larger blocks the only benefit would be physio itself which doesn't really justify the change. You lost me there. Why should this depend on how many requests some filesystem makes with larger blocks? If there is the possibility that filesystems do such requests, someone needs to do the splitting of the requests. The question is who. The driver or the block layer. BTW, for my use case I don't actually want to limit the block size, but rather the number of DMA segments. But the most reasonable way to do that seemed to set minphys to max_segments * pagesize. If we change how these things work, we could take the number of DMA segments into account.
Re: minphys woes
On Mon, 1 Sep 2014, David Gwynne wrote: Yes, that seems to be what happens. But if every adapter needs to support transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the adapter to be able to override the default minphys function with its own. And adapters that only support smaller transfers would need to have logic in their driver to be able to split the transfer into smaller chunks. I think it makes more sense to have that logic in one place to be used by all drivers. this has blown my mind. how do the stupid ata cf card things that only support single sector IO work? It seems _wdc_ata_bio_start in ata_wdc.c does the splitting in this case.
Re: minphys woes
On Monday 01 September 2014 22:24:16, David Gwynne wrote: i haven't found a controller that does less than MAXPHYS. perhaps they meant to improve the situation but stopped short. if we wanted to raise MAXPHYS, we'd have to support older controllers that cant do greater than 64k with some mechanism. So the summary of this thread is that - drivers have to support 64K transfers - the minphys mechanism is useless at the moment Not quite what I hoped, but ok. Do you know a good place where this should be documented?
Re: minphys woes
On Tuesday 02 September 2014 13:15:19, Philip Guenther wrote: On Tue, Sep 2, 2014 at 12:51 PM, Stefan Fritsch s...@sfritsch.de wrote: On Monday 01 September 2014 22:24:16, David Gwynne wrote: i haven't found a controller that does less than MAXPHYS. perhaps they meant to improve the situation but stopped short. if we wanted to raise MAXPHYS, we'd have to support older controllers that cant do greater than 64k with some mechanism. So the summary of this thread is that - drivers have to support 64K transfers - the minphys mechanism is useless at the moment Useless? It works just fine for physio, which is what it was designed for, no? Not quite what I hoped, but ok. Do you know a good place where this should be documented? minphys is currently documented on physio(9) From physio(9): minphys A device specific routine called to determine the maximum transfer size that the device's strategy routine can handle. Since we have seen that the driver must be able to handle 64k blocks in any case, the fact that minphys is device specific is useless, isn't it?
Re: minphys woes
On Tuesday 02 September 2014 15:22:16, Philip Guenther wrote: From physio(9): minphys A device specific routine called to determine the maximum transfer size that the device's strategy routine can handle. Since we have seen that the driver must be able to handle 64k blocks in any case, the fact that minphys is device specific is useless, isn't it? physio() is used by character device access. Looks to me like sdminphys() will change the chunking behavior of this: dd if=/dev/zero of=/dev/rsd0a bs=100M count=1 depending on whether sd0 is a SCSI-I device, no? Yes, but that does not make it any less useless. File systems will call the very same strategy routine and expect it to deal with 64K blocks. The statement in the man page gives the misleading impression that the strategy routine could be used to avoid that the strategy routine has to handle 64K blocks, which is not true. So, maybe the minphys mechanism is useful for block devices that are not used for file systems. Are there any such devices? Maybe we could add a sentence to the man page like following? Beware that a block device's strategy routine that is used for file systems must always be able to accept transfers sizes up to MAXPHYS.
Re: minphys woes
On Wednesday 03 September 2014 15:23:13, Philip Guenther wrote: IMO, the problem that you're hitting with your vioblk device isn't a problem with MAXPHYS, physio(), or minphys, but rather with MAXBSIZE. Back in the 1990's, the fact that you want to be able to configure a device with minphys 20kB on your arch means that the MD MAXBSIZE define for your arch should be 20kB. If MAXBSIZE could be set/limited on a per-device basis (i.e., the code using it queried the device) then your problem would be resolved, no? Thank you very much for this write up. It was very enlightening and I agree that the problem I encountered is that MAXBSIZE is not a per- device value. Another thing this has made clear to me is that I don't want to deal with openbsd's lower level block layer. I will just close the lid and hope the smell goes away. Cheers, Stefan
Re: sparc64: fledgling QEMU support
On Tuesday 09 September 2014 21:27:37, Mark Cave-Ayland wrote: Could the PCI virtio stuff be adapted to non-x86 architectures? QEMU already has a virtio PCI device that can be plugged into qemu-system-sparc64 (see Artyom's blog at http://tyom.blogspot.co.uk/2013/03/debiansparc64-wheezy-under-qemu-h ow-to.html for an example of how to do this with Linux). This could be an amusing project; in theory it would be possible to work on an x86 laptop to test/debug big-endian virtio support with the help of QEMU's virtual hardware. You can do this by plugging in a standard virtual cdrom/hd along with an additional virtio hd/nic, booting from the standard devices and then testing the drivers accessing the extra devices as required. From the openbsd side, virtio should work with sparc. But since nobody has tested it on big endian so far, there will be bugs. And it needs to be enabled in the config, of course. If you look at generic PCI network adapters, I would recommend trying e1000 if possible. Last time I tried it (on x86), qemu's rtl8139 emulation did not work with openbsd's driver, and I think there were some problems with the ne2k emulation, too. Or maybe ne2k just had terrible performance. Cheers, Stefan
Speeding up openbsd on amd64 MP
Hi, I have improved my amd64 pmap patches a bit and finally got around to doing some benchmarks with them. Systems: - core-i7 is an Intel Core i7-860 with 4 cores / 8 threads - core2duo is an Intel Core 2 Duo T8100 with 2 cores - KVM is a VM with 4 vcpus running on qemu 2.1/Linux 3.16 on the above core-i7 SP/MP means single/multi processor kernel Benchmarks: kernel: compilation of the kernel on tmpfs makeindex: 'make index' in /usr/ports on tmpfs forktest: the fork/exit micro benchmark I have posted at http://permalink.gmane.org/gmane.os.openbsd.tech/38172 Speed up is wall clock time without patch divided by time with patch. system kernel benchmark speed up core-i7 MP kernel -j8 1.02 core-i7 MP makeindex 1.11 core-i7 MP forktest1.52 core-i7 SP kernel -j1 1.00 core-i7 SP forktest1.03 core2duo MP kernel -j2 1.17 core2duo MP makeindex 1.12 core2duo MP forktest1.11 KVM MP kernel -j4 1.20 KVM MP forktest3.26 As you see there is no test case where the new pmap is slower than the old one. On bare metal, there is some speed up, depending on system and workload. On virtualization, the speed difference very significant. I didn't run the makeindex test on KVM because it is so slow with the old pmap. I would like people to test the diffs on other machines. In particular on non-Intel CPUs. The only AMD system I could get hold of did not run reliably with openbsd even without the pmap diff. I am posting the patch in two parts as separate mails. The first part contains some simple cleanups and the second part does the actual optimizations. Compared to the patch I have posted on August 18th, the current patch uses the direct mapping in some cases to avoid more TLB flushes (some bits of this come from a different old patch by art@ from 2005). Also, many comments have been updated. Apart from testing, I would of course also appreciate reviews. Compared to art's patch from 2005 which tried to use the direct mapping for everything, the current patch is much less intrusive. Therefore I hope that it is easier to review/debug than his patch, which unfortunately never got integrated. Cheers, Stefan
Speeding up openbsd on amd64 MP - patch 1/2
Simple cleanups for amd64 pmap - use __func__ in panics/printfs (fixes some out of sync function names) - tell the compiler that code paths where we print diagnostics are unlikely - use pmap_valid_entry() in some places - KERNSPACE is not used anywhere diff --git a/sys/arch/amd64/amd64/pmap.c b/sys/arch/amd64/amd64/pmap.c index cd395e1..05cbd3e 100644 --- a/sys/arch/amd64/amd64/pmap.c +++ b/sys/arch/amd64/amd64/pmap.c @@ -436,7 +436,7 @@ pmap_kenter_pa(vaddr_t va, paddr_t pa, vm_prot_t prot) #ifdef LARGEPAGES /* XXX For now... */ if (opte PG_PS) - panic(pmap_kenter_pa: PG_PS); + panic(%s: PG_PS, __func__); #endif if (pmap_valid_entry(opte)) { if (pa PMAP_NOCACHE (opte PG_N) == 0) @@ -899,7 +899,7 @@ pmap_get_ptp(struct pmap *pmap, vaddr_t va, pd_entry_t **pdes) pptp = pmap_find_ptp(pmap, va, ppa, i); #ifdef DIAGNOSTIC if (pptp == NULL) - panic(pde page disappeared); + panic(%s: pde page disappeared, __func__); #endif pptp-wire_count++; } @@ -915,7 +915,7 @@ pmap_get_ptp(struct pmap *pmap, vaddr_t va, pd_entry_t **pdes) if (ptp == NULL) { printf(va %lx ppa %lx\n, (unsigned long)va, (unsigned long)ppa); - panic(pmap_get_ptp: unmanaged user PTP); + panic(%s: unmanaged user PTP, __func__); } #endif } @@ -1031,8 +1031,8 @@ pmap_destroy(struct pmap *pmap) */ #ifdef DIAGNOSTIC - if (pmap-pm_cpus != 0) - printf(pmap_destroy: pmap %p cpus=0x%llx\n, + if (__predict_false(pmap-pm_cpus != 0)) + printf(%s: pmap %p cpus=0x%llx\n, __func__, (void *)pmap, pmap-pm_cpus); #endif @@ -1132,7 +1132,7 @@ pmap_pdes_valid(vaddr_t va, pd_entry_t **pdes, pd_entry_t *lastpde) for (i = PTP_LEVELS; i 1; i--) { index = pl_i(va, i); pde = pdes[i - 2][index]; - if ((pde PG_V) == 0) + if (!pmap_valid_entry(pde)) return FALSE; } if (lastpde != NULL) @@ -1171,7 +1171,7 @@ pmap_extract(struct pmap *pmap, vaddr_t va, paddr_t *pap) pte = ptes[pl1_i(va)]; pmap_unmap_ptes(pmap); - if (__predict_true((pte PG_V) != 0)) { + if (__predict_true(pmap_valid_entry(pte))) { if (pap != NULL) *pap = (pte PG_FRAME) | (va 0xfff); return (TRUE); @@ -1321,16 +1321,16 @@ pmap_remove_ptes(struct pmap *pmap, struct vm_page *ptp, vaddr_t ptpva, if ((opte PG_PVLIST) == 0) { #ifdef DIAGNOSTIC if (pg != NULL) - panic(pmap_remove_ptes: managed page without - PG_PVLIST for 0x%lx, startva); + panic(%s: managed page without PG_PVLIST + for 0x%lx, __func__, startva); #endif continue; } #ifdef DIAGNOSTIC if (pg == NULL) - panic(pmap_remove_ptes: unmanaged page marked - PG_PVLIST, va = 0x%lx, pa = 0x%lx, + panic(%s: unmanaged page marked PG_PVLIST, + va = 0x%lx, pa = 0x%lx, __func__, startva, (u_long)(opte PG_FRAME)); #endif @@ -1388,17 +1388,16 @@ pmap_remove_pte(struct pmap *pmap, struct vm_page *ptp, pt_entry_t *pte, if ((opte PG_PVLIST) == 0) { #ifdef DIAGNOSTIC if (pg != NULL) - panic(pmap_remove_pte: managed page without - PG_PVLIST for 0x%lx, va); + panic(%s: managed page without PG_PVLIST for 0x%lx, + __func__, va); #endif return(TRUE); } #ifdef DIAGNOSTIC if (pg == NULL) - panic(pmap_remove_pte: unmanaged page marked - PG_PVLIST, va = 0x%lx, pa = 0x%lx, va, - (u_long)(opte PG_FRAME)); + panic(%s: unmanaged page marked PG_PVLIST, va = 0x%lx, + pa = 0x%lx, __func__, va, (u_long)(opte PG_FRAME)); #endif /* sync R/M bits */ @@ -1464,8 +1463,8 @@ pmap_do_remove(struct pmap *pmap, vaddr_t sva, vaddr_t eva, int flags) ptp = pmap_find_ptp(pmap, sva, ptppa, 1); #ifdef DIAGNOSTIC if (ptp == NULL) - panic(pmap_remove: unmanaged - PTP detected); + panic(%s: unmanaged PTP detected, +
Speeding up openbsd on amd64 MP - patch 2/2
Optimize pmap on amd64 based on a patch for i386 by Art from 2008 that removes the APTE stuff. Some additional bits were taken from a patch by Art for amd64 from 2005. Instead of mapping the ptes of an inactive pmap in the APTE range, and then doing an unconditional remote TLB-flush on all CPUs, we just switch to the other pmap locally. This only causes a local TLB-flush. In cases where we only need to access a single PTE, walk the page tables manually in the direct mapping. This saves some more TLB flushes. I have also reordered some stuff so that we will usually send the TLB-shootdown-IPIs first, then do some local stuff, and then wait for the remote TLB-shootdown to finish. diff --git a/sys/arch/amd64/amd64/pmap.c b/sys/arch/amd64/amd64/pmap.c index 05cbd3e..c137412 100644 --- a/sys/arch/amd64/amd64/pmap.c +++ b/sys/arch/amd64/amd64/pmap.c @@ -203,7 +203,6 @@ long nkptp[] = NKPTP_INITIALIZER; long nkptpmax[] = NKPTPMAX_INITIALIZER; long nbpd[] = NBPD_INITIALIZER; pd_entry_t *normal_pdes[] = PDES_INITIALIZER; -pd_entry_t *alternate_pdes[] = APDES_INITIALIZER; /* int nkpde = NKPTP; */ @@ -285,11 +284,12 @@ void pmap_enter_pv(struct vm_page *, struct pv_entry *, struct pmap *, vaddr_t, struct vm_page *); struct vm_page *pmap_get_ptp(struct pmap *, vaddr_t, pd_entry_t **); struct vm_page *pmap_find_ptp(struct pmap *, vaddr_t, paddr_t, int); +int pmap_find_pte_direct(struct pmap *pm, vaddr_t va, pt_entry_t **pd, int *offs); void pmap_free_ptp(struct pmap *, struct vm_page *, vaddr_t, pt_entry_t *, pd_entry_t **, struct pg_to_free *); void pmap_freepage(struct pmap *, struct vm_page *, int, struct pg_to_free *); static boolean_t pmap_is_active(struct pmap *, int); -void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***); +void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***, paddr_t *); struct pv_entry *pmap_remove_pv(struct vm_page *, struct pmap *, vaddr_t); void pmap_do_remove(struct pmap *, vaddr_t, vaddr_t, int); boolean_t pmap_remove_pte(struct pmap *, struct vm_page *, pt_entry_t *, @@ -299,14 +299,23 @@ void pmap_remove_ptes(struct pmap *, struct vm_page *, vaddr_t, #define PMAP_REMOVE_ALL0 /* remove all mappings */ #define PMAP_REMOVE_SKIPWIRED 1 /* skip wired mappings */ -void pmap_unmap_ptes(struct pmap *); +void pmap_unmap_ptes(struct pmap *, paddr_t); boolean_t pmap_get_physpage(vaddr_t, int, paddr_t *); boolean_t pmap_pdes_valid(vaddr_t, pd_entry_t **, pd_entry_t *); void pmap_alloc_level(pd_entry_t **, vaddr_t, int, long *); -void pmap_apte_flush(struct pmap *pmap); void pmap_sync_flags_pte(struct vm_page *, u_long); +void pmap_tlb_shootpage(struct pmap *, vaddr_t, int); +void pmap_tlb_shootrange(struct pmap *, vaddr_t, vaddr_t, int); +void pmap_tlb_shoottlb(struct pmap *, int); +#ifdef MULTIPROCESSOR +void pmap_tlb_shootwait(void); +#else +#definepmap_tlb_shootwait() +#endif + + /* * p m a p i n l i n e h e l p e r f u n c t i o n s */ @@ -349,57 +358,75 @@ pmap_sync_flags_pte(struct vm_page *pg, u_long pte) } } -void -pmap_apte_flush(struct pmap *pmap) -{ - pmap_tlb_shoottlb(); - pmap_tlb_shootwait(); -} - /* * pmap_map_ptes: map a pmap's PTEs into KVM - * - * = we lock enough pmaps to keep things locked in - * = must be undone with pmap_unmap_ptes before returning */ void -pmap_map_ptes(struct pmap *pmap, pt_entry_t **ptepp, pd_entry_t ***pdeppp) +pmap_map_ptes(struct pmap *pmap, pt_entry_t **ptepp, pd_entry_t ***pdeppp, paddr_t *save_cr3) { - pd_entry_t opde, npde; + paddr_t cr3 = rcr3(); - /* if curpmap then we are always mapped */ - if (pmap_is_curpmap(pmap)) { - *ptepp = PTE_BASE; - *pdeppp = normal_pdes; - return; - } + /* the kernel's pmap is always accessible */ + if (pmap == pmap_kernel() || pmap-pm_pdirpa == cr3) { + *save_cr3 = 0; + } else { + *save_cr3 = cr3; - /* need to load a new alternate pt space into curpmap? */ - opde = *APDP_PDE; - if (!pmap_valid_entry(opde) || (opde PG_FRAME) != pmap-pm_pdirpa) { - npde = (pd_entry_t) (pmap-pm_pdirpa | PG_RW | PG_V); - *APDP_PDE = npde; - if (pmap_valid_entry(opde)) - pmap_apte_flush(curpcb-pcb_pmap); + /* +* Not sure if we need this, but better be safe. +* We don't have the current pmap in order to unset its +* active bit, but this just means that we may receive +* an unneccessary cross-CPU TLB flush now and then. +*/ + x86_atomic_setbits_u64(pmap-pm_cpus, (1ULL cpu_number())); + + lcr3(pmap-pm_pdirpa); } - *ptepp = APTE_BASE; - *pdeppp = alternate_pdes; + + *ptepp = PTE_BASE; + *pdeppp = normal_pdes; + return; }
Re: Speeding up openbsd on amd64 MP
On Sun, 14 Sep 2014, Job Snijders wrote: On Sun, Sep 14, 2014 at 02:54:54PM +0200, Stefan Fritsch wrote: I would like people to test the diffs on other machines. In particular on non-Intel CPUs. The only AMD system I could get hold of did not run reliably with openbsd even without the pmap diff. I tested MP 'kernel -j 2' on a PC Engines apu1c4 (AMD G series T40E, 1 GHz dual Bobcat core with 64 bit support) and saw no speedup. Thanks. But no slowdown either? If you have time, it would be nice to see the result for the fork test, which is a lot more sensitive to the pmap handling. The source is at http://permalink.gmane.org/gmane.os.openbsd.tech/38172 Also, it would be welcome if people tested the patch for stability. I have been running it on a build VM for a few weeks, but that is hardly a complete stability test.
Re: Speeding up openbsd on amd64 MP
On Sunday 14 September 2014 12:13:44, Brent Cook wrote: Results for an old Athlon (hmm, I don't remember it running at 10.8 Ghz before) cpu0: AMD Athlon(tm) 64 X2 Dual Core Processor 6000+, 10823.06 MHz forktest 0m19.23s real 0m0.72s user 0m22.46s system forktest+0m16.33s real 0m0.50s user 0m18.22s system kernel -j2 3m0.68s real 5m10.95s user0m41.58s system kernel -j2+3m0.45s real 5m10.14s user0m41.53s system On Sunday 14 September 2014 19:16:41, Job Snijders wrote: The speedup kernel seems to perform slightly better: [root@speedup ~]# time ./forktest real0m33.134s user0m1.230s sys 0m53.710s [root@5.5stable ~]# time ./forktest real0m37.600s user0m1.200s sys 0m59.530s Thanks to both of you for testing. These results are more in line with my Core i7 and I expected the effect to be larger the large the number of CPUs is. Just my 6-year- old Core2 Duo does not fit in there and gives a much larger speedup. Maybe it is particularily bad at something the current pmap code does.
Re: Speeding up openbsd on amd64 MP - patch 2/2
This is now in -current. If anyone notices problems, give noise. Cheers, Stefan
Removing -Wno-format from kernel makefiles, 1/16
add support for %td for ptrdiff_t in kernel this also adds support in gcc 4.x kprintf --- gnu/gcc/gcc/c-format.c |7 --- sys/kern/subr_prf.c|6 ++ 2 files changed, 10 insertions(+), 3 deletions(-) diff --git gnu/gcc/gcc/c-format.c gnu/gcc/gcc/c-format.c index b9eecee..1b1734b 100644 --- gnu/gcc/gcc/c-format.c +++ gnu/gcc/gcc/c-format.c @@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] = { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L }, { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 }, { z, FMT_LEN_z, STD_C99, NULL, 0, 0 }, + { t, FMT_LEN_t, STD_C99, NULL, 0, 0 }, { NULL, 0, 0, NULL, 0, 0 } }; @@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] = static const format_char_info kprint_char_table[] = { /* C89 conversion specifiers. */ - { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, - { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, - { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, + { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, T99_PD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, + { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, + { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, { c, 0, STD_C89, { T89_I, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -w, , NULL }, { s, 1, STD_C89, { T89_C, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp, cR, NULL }, { p, 1, STD_C89, { T89_V, BADLEN, BADLEN, T89_UL, T9L_LL, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0, c, NULL }, diff --git sys/kern/subr_prf.c sys/kern/subr_prf.c index 768d164..c940141 100644 --- sys/kern/subr_prf.c +++ sys/kern/subr_prf.c @@ -842,6 +842,12 @@ reswitch: switch (ch) { size = 1; sign = '\0'; break; + case 't': + { + /* assume ptrdiff_t is long */ + CTASSERT(sizeof(fmt - fmt0) == sizeof(long)); + } + /* FALLTHROUGH */ case 'D': flags |= LONGINT; /*FALLTHROUGH*/
Removing -Wno-format from kernel makefiles, 06/16
Use %z* for size_t while there, fix a few %d into %u --- sys/arch/i386/pci/pcibios.c |2 +- sys/dev/ic/bwi.c|4 ++-- sys/dev/ic/pgt.c|6 +++--- sys/dev/ic/ti.c |4 ++-- sys/dev/pci/amdiic.c|4 ++-- sys/dev/pci/amdpm.c |2 +- sys/dev/pci/if_iwi.c|4 ++-- sys/dev/pci/if_iwn.c| 10 +- sys/dev/pci/if_lge.c|4 ++-- sys/dev/pci/if_nge.c|4 ++-- sys/dev/pci/if_tl.c |2 +- sys/dev/pci/if_wb.c |2 +- sys/dev/pci/if_wpi.c|4 ++-- sys/dev/pci/piixpm.c|2 +- sys/dev/pci/yds.c |2 +- sys/dev/usb/if_urndis.c | 12 ++-- sys/netinet/ip_output.c |2 +- sys/netinet6/ip6_forward.c |2 +- sys/netinet6/ip6_output.c |2 +- sys/nfs/nfs_socket.c|4 ++-- 20 files changed, 39 insertions(+), 39 deletions(-) diff --git sys/arch/i386/pci/pcibios.c sys/arch/i386/pci/pcibios.c index 1b34391..372ac51 100644 --- sys/arch/i386/pci/pcibios.c +++ sys/arch/i386/pci/pcibios.c @@ -260,7 +260,7 @@ pcibios_pir_init(struct pcibios_softc *sc) cksum += p[i]; printf(%s: PCI IRQ Routing Table rev %d.%d @ 0x%lx/%d - (%d entries)\n, sc-sc_dev.dv_xname, + (%zd entries)\n, sc-sc_dev.dv_xname, pirh-version 8, pirh-version 0xff, pa, pirh-tablesize, (pirh-tablesize - sizeof(*pirh)) / 16); diff --git sys/dev/ic/bwi.c sys/dev/ic/bwi.c index 6295172..934a6ac 100644 --- sys/dev/ic/bwi.c +++ sys/dev/ic/bwi.c @@ -1673,7 +1673,7 @@ bwi_fwimage_is_valid(struct bwi_softc *sc, uint8_t *fw, size_t fw_len, const struct bwi_fwhdr *hdr; if (fw_len sizeof(*hdr)) { - printf(%s: invalid firmware (%s): invalid size %u\n, + printf(%s: invalid firmware (%s): invalid size %zu\n, sc-sc_dev.dv_xname, fw_name, fw_len); return (1); } @@ -1686,7 +1686,7 @@ bwi_fwimage_is_valid(struct bwi_softc *sc, uint8_t *fw, size_t fw_len, */ if (betoh32(hdr-fw_size) != fw_len - sizeof(*hdr)) { printf(%s: invalid firmware (%s): size mismatch, - fw %u, real %u\n, + fw %u, real %zu\n, sc-sc_dev.dv_xname, fw_name, betoh32(hdr-fw_size), diff --git sys/dev/ic/pgt.c sys/dev/ic/pgt.c index 692b72c..6f6a5d6 100644 --- sys/dev/ic/pgt.c +++ sys/dev/ic/pgt.c @@ -3193,7 +3193,7 @@ pgt_dma_alloc_queue(struct pgt_softc *sc, enum pgt_queue pq) error = bus_dmamem_alloc(sc-sc_dmat, PGT_FRAG_SIZE, PAGE_SIZE, 0, pd-pd_dmas, 1, nsegs, BUS_DMA_WAITOK); if (error != 0) { - printf(%s: error alloc frag %u on queue %u\n, + printf(%s: error alloc frag %zu on queue %u\n, sc-sc_dev.dv_xname, i, pq); free(pd, M_DEVBUF); break; @@ -3202,7 +3202,7 @@ pgt_dma_alloc_queue(struct pgt_softc *sc, enum pgt_queue pq) error = bus_dmamem_map(sc-sc_dmat, pd-pd_dmas, nsegs, PGT_FRAG_SIZE, (caddr_t *)pd-pd_mem, BUS_DMA_WAITOK); if (error != 0) { - printf(%s: error map frag %u on queue %u\n, + printf(%s: error map frag %zu on queue %u\n, sc-sc_dev.dv_xname, i, pq); free(pd, M_DEVBUF); break; @@ -3212,7 +3212,7 @@ pgt_dma_alloc_queue(struct pgt_softc *sc, enum pgt_queue pq) error = bus_dmamap_load(sc-sc_dmat, pd-pd_dmam, pd-pd_mem, PGT_FRAG_SIZE, NULL, BUS_DMA_NOWAIT); if (error != 0) { - printf(%s: error load frag %u on queue %u\n, + printf(%s: error load frag %zu on queue %u\n, sc-sc_dev.dv_xname, i, pq); bus_dmamem_free(sc-sc_dmat, pd-pd_dmas, nsegs); diff --git sys/dev/ic/ti.c sys/dev/ic/ti.c index 85e3a2a..902591f 100644 --- sys/dev/ic/ti.c +++ sys/dev/ic/ti.c @@ -612,7 +612,7 @@ ti_alloc_jumbo_mem(struct ti_softc *sc) state = 1; if (bus_dmamem_map(sc-sc_dmatag, seg, rseg, TI_JMEM, kva, BUS_DMA_NOWAIT)) { - printf(%s: can't map dma buffers (%d bytes)\n, + printf(%s: can't map dma buffers (%zu bytes)\n, sc-sc_dv.dv_xname, TI_JMEM); error = ENOBUFS; goto out; @@ -1600,7 +1600,7 @@ ti_attach(struct ti_softc *sc) } if (bus_dmamem_map(sc-sc_dmatag, seg, rseg, sizeof(struct ti_ring_data), kva,
Re: Removing -Wno-format from kernel makefiles, 3/16
On Wed, 3 Jul 2013, Mark Kettenis wrote: diff --git sys/arch/i386/i386/db_interface.c sys/arch/i386/i386/db_interface.c index 85c1ff5..c75fd89 100644 --- sys/arch/i386/i386/db_interface.c +++ sys/arch/i386/i386/db_interface.c @@ -197,11 +197,11 @@ db_sysregs_cmd(db_expr_t addr, int have_addr, db_expr_t count, char *modif) uint16_t ldtr, tr; __asm__ __volatile__(sidt %0 : =m (idtr)); - db_printf(idtr: 0x%08x/%04x\n, + db_printf(idtr: 0x%08x/%04llx\n, (unsigned int)(idtr 16), idtr 0x); __asm__ __volatile__(sgdt %0 : =m (gdtr)); - db_printf(gdtr: 0x%08x/%04x\n, + db_printf(gdtr: 0x%08x/%04llx\n, (unsigned int)(gdtr 16), gdtr 0x); This is a tad bit inconsistent. I'd either use %llx for both values and get rid of the cast, or use %x and use a cast in both cases. Like this? --- sys/arch/i386/i386/db_interface.c +++ sys/arch/i386/i386/db_interface.c @@ -197,12 +197,10 @@ db_sysregs_cmd(db_expr_t addr, int have_addr, db_expr_t count, char *modif) uint16_t ldtr, tr; __asm__ __volatile__(sidt %0 : =m (idtr)); - db_printf(idtr: 0x%08x/%04x\n, - (unsigned int)(idtr 16), idtr 0x); + db_printf(idtr: 0x%08llx/%04llx\n, idtr 16, idtr 0x); __asm__ __volatile__(sgdt %0 : =m (gdtr)); - db_printf(gdtr: 0x%08x/%04x\n, - (unsigned int)(gdtr 16), gdtr 0x); + db_printf(gdtr: 0x%08llx/%04llx\n, gdtr 16, gdtr 0x); __asm__ __volatile__(sldt %0 : =g (ldtr)); db_printf(ldtr: 0x%04x\n, ldtr);
Re: Removing -Wno-format from kernel makefiles, 4/16
On Wed, 3 Jul 2013, Mark Kettenis wrote: diff --git sys/arch/i386/i386/esm.c sys/arch/i386/i386/esm.c index c90b2c4..3dff69e 100644 --- sys/arch/i386/i386/esm.c +++ sys/arch/i386/i386/esm.c @@ -880,7 +880,7 @@ esm_make_sensors(struct esm_softc *sc, struct esm_devmap *devmap, } for (j = 0; j nsensors; j++) { - snprintf(s[j].desc, sizeof(s[j].desc), %s %d, + snprintf(s[j].desc, sizeof(s[j].desc), %s %ld, sensor_map[i].name, sensor_map[i].arg + j); } break; Looking at this one, it makes more sense to make the arg member of struct esm_sensor_map an int. That will result in some space savings if we'd ever bring this driver to amd64. --- sys/arch/i386/i386/esm.c +++ sys/arch/i386/i386/esm.c @@ -87,7 +87,7 @@ enum sensor_type esm_typemap[] = { struct esm_sensor_map { enum esm_sensor_typetype; - longarg; + int arg; const char *name; };
Removing -Wno-format from kernel makefiles, 07/16
fix: %x instead of %p for int --- sys/dev/pci/musycc_obsd.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git sys/dev/pci/musycc_obsd.c sys/dev/pci/musycc_obsd.c index 25a58d8..0844136 100644 --- sys/dev/pci/musycc_obsd.c +++ sys/dev/pci/musycc_obsd.c @@ -215,7 +215,7 @@ musycc_ebus_attach(struct device *parent, struct musycc_softc *esc, #endif if (ebus_attach_device(rom, sc, 0, 0x400) != 0) { - printf(: failed to map rom @ %05p\n, 0); + printf(: failed to map rom @ %05x\n, 0); goto failed; } -- 1.7.6
Removing -Wno-format from kernel makefiles, 10/16
fix: pointer to long --- sys/arch/i386/pci/pci_addr_fixup.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git sys/arch/i386/pci/pci_addr_fixup.c sys/arch/i386/pci/pci_addr_fixup.c index 59880eb..40f7918 100644 --- sys/arch/i386/pci/pci_addr_fixup.c +++ sys/arch/i386/pci/pci_addr_fixup.c @@ -354,7 +354,7 @@ pciaddr_do_resource_reserve_disabled(struct pcibios_softc *sc, (val PCI_COMMAND_IO_ENABLE) == PCI_COMMAND_IO_ENABLE) return (0); - PCIBIOS_PRINTV((disabled %s space at addr 0x%x size 0x%x\n, + PCIBIOS_PRINTV((disabled %s space at addr 0x%lx size 0x%x\n, type == PCI_MAPREG_TYPE_MEM ? mem : io, *addr, size)); error = extent_alloc_region(ex, *addr, size, EX_NOWAIT | EX_MALLOCOK); -- 1.7.6
Removing -Wno-format from kernel makefiles, 08/16
%hu/%hd for uint16_t, %u/%d/%x for uint32_t despite the name, ntohl returns uint32_t also fix some %d into %u --- sys/arch/i386/i386/bios.c|4 ++-- sys/arch/i386/i386/ioapic.c |2 +- sys/arch/i386/pci/pcibios.c |4 ++-- sys/kern/vfs_cluster.c |2 +- sys/msdosfs/msdosfs_conv.c |2 +- sys/msdosfs/msdosfs_denode.c |2 +- sys/msdosfs/msdosfs_vnops.c |2 +- sys/net/if_spppsubr.c|2 +- 8 files changed, 10 insertions(+), 10 deletions(-) diff --git sys/arch/i386/i386/bios.c sys/arch/i386/i386/bios.c index 6e46e6a..75dab38 100644 --- sys/arch/i386/i386/bios.c +++ sys/arch/i386/i386/bios.c @@ -238,7 +238,7 @@ biosattach(struct device *parent, struct device *self, void *aux) bios32_entry.segment = GSEL(GCODE_SEL, SEL_KPL); bios32_entry.offset = (u_int32_t)ISA_HOLE_VADDR(h-entry); - printf(, BIOS32 rev. %d @ 0x%lx, h-rev, h-entry); + printf(, BIOS32 rev. %d @ 0x%x, h-rev, h-entry); break; } } @@ -285,7 +285,7 @@ biosattach(struct device *parent, struct device *self, void *aux) for (; pa end; pa+= NBPG, eva+= NBPG) pmap_kenter_pa(eva, pa, VM_PROT_READ); - printf(, SMBIOS rev. %d.%d @ 0x%lx (%d entries), + printf(, SMBIOS rev. %d.%d @ 0x%x (%hd entries), sh-majrev, sh-minrev, sh-addr, sh-count); /* * Unbelievably the SMBIOS version number diff --git sys/arch/i386/i386/ioapic.c sys/arch/i386/i386/ioapic.c index 0b0c6b8..4197553 100644 --- sys/arch/i386/i386/ioapic.c +++ sys/arch/i386/i386/ioapic.c @@ -300,7 +300,7 @@ ioapic_attach(struct device *parent, struct device *self, void *aux) ioapic_add(sc); - printf( pa 0x%lx, aaa-apic_address); + printf( pa 0x%x, aaa-apic_address); if (bus_mem_add_mapping(aaa-apic_address, PAGE_SIZE, 0, bh) != 0) { printf(, map failed\n); diff --git sys/arch/i386/pci/pcibios.c sys/arch/i386/pci/pcibios.c index 372ac51..bc8a848 100644 --- sys/arch/i386/pci/pcibios.c +++ sys/arch/i386/pci/pcibios.c @@ -151,7 +151,7 @@ pcibiosprobe(struct device *parent, void *match, void *aux) rv = bios32_service(PCIBIOS_SIGNATURE, pcibios_entry, pcibios_entry_info); - PCIBIOS_PRINTV((pcibiosprobe: 0x%lx:0x%lx at 0x%lx[0x%lx]\n, + PCIBIOS_PRINTV((pcibiosprobe: 0x%hx:0x%x at 0x%x[0x%x]\n, pcibios_entry.segment, pcibios_entry.offset, pcibios_entry_info.bei_base, pcibios_entry_info.bei_size)); @@ -172,7 +172,7 @@ pcibiosattach(struct device *parent, struct device *self, void *aux) rev_min, mech1, mech2, scmech1, scmech2, sc-max_bus); - printf(: rev %d.%d @ 0x%lx/0x%lx\n, + printf(: rev %d.%d @ 0x%x/0x%x\n, rev_maj, rev_min 4, pcibios_entry_info.bei_base, pcibios_entry_info.bei_size); diff --git sys/kern/vfs_cluster.c sys/kern/vfs_cluster.c index f9df40e..118a411 100644 --- sys/kern/vfs_cluster.c +++ sys/kern/vfs_cluster.c @@ -175,7 +175,7 @@ cluster_wbuild(struct vnode *vp, struct buf *last_bp, long size, #ifdef DIAGNOSTIC if (size != vp-v_mount-mnt_stat.f_iosize) - panic(cluster_wbuild: size %ld != filesize %ld, + panic(cluster_wbuild: size %ld != filesize %u, size, vp-v_mount-mnt_stat.f_iosize); #endif redo: diff --git sys/msdosfs/msdosfs_conv.c sys/msdosfs/msdosfs_conv.c index 727acd2..538aa1c 100644 --- sys/msdosfs/msdosfs_conv.c +++ sys/msdosfs/msdosfs_conv.c @@ -209,7 +209,7 @@ dos2unixtime(u_int dd, u_int dt, u_int dh, struct timespec *tsp) */ month = (dd DD_MONTH_MASK) DD_MONTH_SHIFT; if (month == 0) { - printf(dos2unixtime(): month value out of range (%ld)\n, + printf(dos2unixtime(): month value out of range (%u)\n, month); month = 1; } diff --git sys/msdosfs/msdosfs_denode.c sys/msdosfs/msdosfs_denode.c index d384a8b..29f2e57 100644 --- sys/msdosfs/msdosfs_denode.c +++ sys/msdosfs/msdosfs_denode.c @@ -387,7 +387,7 @@ detrunc(struct denode *dep, uint32_t length, int flags, struct ucred *cred, * directory's life. */ if ((DETOV(dep)-v_flag VROOT) !FAT32(pmp)) { - printf(detrunc(): can't truncate root directory, clust %ld, offset %ld\n, + printf(detrunc(): can't truncate root directory, clust %u, offset %u\n, dep-de_dirclust, dep-de_diroffset); return (EINVAL); } diff --git sys/msdosfs/msdosfs_vnops.c sys/msdosfs/msdosfs_vnops.c index 79f611c..25d7030 100644 --- sys/msdosfs/msdosfs_vnops.c +++ sys/msdosfs/msdosfs_vnops.c @@
Re: Removing -Wno-format from kernel makefiles, V2
On Wednesday 03 July 2013, Stefan Fritsch wrote: I haven't done any signedness fixes so far, only the size fixes. But I will look over the patches again and try to fix the signedness issues in the touched code lines, too. There are quite a few I guess. For people prefering to review by source files, an updated patch is here: http://www.sfritsch.de/~stf/openbsd-format-string-v3/combined.patch
Re: virtio MSI
Am Mittwoch, 18. September 2013, 07:19:30 schrieb Chris Cappuccio: Here's a simple and obvious change that would be necessary to support virtio under bhyve. But it is only acceptable if other virtio implementations either 1. don't claim to support MSI or 2. also work with MSI. Qemu does not advertise MSI support, so no problem there. Committed.
Re: virtio MSI
Out of interest, does openbsd run on bhyve with this patch, or are there other problems? On Sun, 22 Sep 2013, Stefan Fritsch wrote: Am Mittwoch, 18. September 2013, 07:19:30 schrieb Chris Cappuccio: Here's a simple and obvious change that would be necessary to support virtio under bhyve. But it is only acceptable if other virtio implementations either 1. don't claim to support MSI or 2. also work with MSI. Qemu does not advertise MSI support, so no problem there. Committed.
Snapshot install54.iso does not boot under qemu/kvm
Hi, the 2013-12-31 snapshot iso images don't work for me under qemu 1.7.0 / linux 3.12.6. They get this far and then reboot: loadrandom: cd0a:/etc/random.seed cannot open cd0a:/etc/random.seed loadrandom: error -1 booting cd0a:/5.4/i386/bsd.rd: 6038744+426300 [72+233616+222178]=0x699c6c entry point at 0x210073 I have tried both i386 and amd64. sha256: bdb45f421d4108ebf8da6896f23a1b23e81c968782e41da5529bce4d3146e5fa install54-amd64-snapshot-2013-12-31.iso bbc58d620aff0d5f6c97c6d193c1646754c5479296b97a2ebb468729e8aa0c30 install54-i386-snapshot-2013-12-31.iso I still have snapshots isos from 2013-09-22 lying around, and those continue to work. Therefore I believe it is not a qemu/kvm problem. The last things happening before the reboot seem to be these (amd64 iso): kvm_emulate_insn: 40120:1168:67 66 0f 01 15 88 0f 00 00 (real) kvm_emulate_insn: 40120:1171:0f 20 c0 (real) kvm_emulate_insn: 40120:1174:66 83 c8 01 (real) kvm_emulate_insn: 40120:1178:0f 22 c0 (real) kvm_emulate_insn: 40120:117b:66 ea a3 12 04 00 08 00 (prot16) kvm_emulate_insn: 0:412a3:b8 10 00 00 00 (prot32) kvm_emulate_insn: 0:412a8:8e d8 (prot32) kvm_emulate_insn: 0:412aa:8e d0 (prot32) kvm_emulate_insn: 0:412ac:8e c0 (prot32) kvm_exit: reason EPT_VIOLATION rip 0x1010149 info 182 0 kvm_page_fault: address 15e8050 error_code 182 kvm_exit: reason CPUID rip 0x10101ae info 0 0 kvm_cpuid: func 0 rax 4 rbx 756e6547 rcx 6c65746e rdx 49656e69 kvm_exit: reason CPUID rip 0x10101cf info 0 0 kvm_cpuid: func 1 rax 623 rbx 800 rcx 80802001 rdx 78bfbfd kvm_exit: reason CPUID rip 0x10101e7 info 0 0 kvm_cpuid: func a rax 0 rbx 0 rcx 0 rdx 0 kvm_exit: reason CPUID rip 0x10101ff info 0 0 kvm_cpuid: func 8001 rax 623 rbx 0 rcx 1 rdx 2191abfd kvm_exit: reason CPUID rip 0x1010212 info 0 0 kvm_cpuid: func 8007 rax 0 rbx 0 rcx 0 rdx 0 kvm_exit: reason CR_ACCESS rip 0x101042b info 4 0 kvm_cr: cr_write 4 = 0x6b0 kvm_exit: reason MSR_READ rip 0x1010433 info 0 0 kvm_msr: msr_read c080 = 0x0 kvm_exit: reason MSR_WRITE rip 0x101043c info 0 0 kvm_msr: msr_write c080 = 0x101 kvm_exit: reason CR_ACCESS rip 0x1010440 info 3 0 kvm_cr: cr_write 3 = 0x1b2b000 kvm_exit: reason CR_ACCESS rip 0x101044b info 0 0 kvm_cr: cr_write 0 = 0x8001003b kvm_exit: reason TRIPLE_FAULT rip 0x101045d info 0 0 kvm_userspace_exit: reason KVM_EXIT_SHUTDOWN (8) kvm_userspace_exit: reason restart (4) Not sure if this is related to the actual problem or just the reboot after the problem happened earlier. Cheers, Stefan
Virtio drivers for OpenBSD
Hi, I have been working on porting NetBSD's virtio drivers to OpenBSD. I am not finished yet, but in order to prevent duplicate work, I thought I'd publish the current state (attached as diff to OpenBSD 5.1). It adds a virtio block device driver (viod) and a virtio network interface (if_vioif). It is stable enough to run make -j 2 in /usr/src on a viod disk. Comments are welcome. BTW: Which device numbers should I choose for viod? Use the first unused number or just add virtio at the end? Cheers, Stefan [demime 1.01d removed an attachment of type TEXT/x-diff which had a name of openbsd-virtio-v1.diff]
Re: Virtio drivers for OpenBSD
On Wed, 11 Jul 2012, Otto Moerbeek wrote: On Wed, Jul 11, 2012 at 02:07:00PM +0200, Stefan Fritsch wrote: [demime 1.01d removed an attachment of type TEXT/x-diff which had a name of openbsd-virtio-v1.diff] Diff got stripped. I have put it at http://www.sfritsch.de/~stf/openbsd-virtio/openbsd-virtio-v1.diff Can you give us some examples of the use of this driver? Virtio disks and NICs are paravirtualized devices offered by kvm/qemu and virtualbox. Using them should give somewhat higher performance and/or lower CPU usage than using emulated hardware like IDE and e1000. But I haven't done any detailed performance tests, yet.
Re: Virtio drivers for OpenBSD
On Wed, 11 Jul 2012, Gleydson Soares wrote: could you resubmit a new diff against -current ? please, attach it inline here. ok, will do (but it will take a bit until I have my machine set up)
Re: Virtio drivers for OpenBSD
On Wed, 11 Jul 2012, Ted Unangst wrote: We are consolidating and reducing the number of disk types. I think this would be better if it attached as scsi disks, which is what all the cool virtual disks do these days. There is a virtio-scsi device, too, but this is only supported in very recent versions of qemu. To attach the simpler virtio-block device as scsi, the driver would have to emulate the scsi commands. Is there some generic infrastructure to do that? Or which driver would you recommend to take the emulation code from? sys/dev/ata/atascsi.c seems rather complex (it is twice the size of sys/dev/pci/viod.c).
Re: Virtio drivers for OpenBSD
On Wed, 11 Jul 2012, Theo de Raadt wrote: There are quite a few pretendy-SCSI drivers in the tree, but most of them deal with fairly complicated hardware so they're not good reference points. sparc64's vdsk(4) driver is probably the best reference point, since it's also for a virtual disk driver (see sys/arch/sparc64/dev/vdsk.c). Oh yes, vdsk(4) is also a good suggestion. ok, thanks. I will take a look at that one.
Re: Virtio drivers for OpenBSD
On Thu, 12 Jul 2012, Jiri B wrote: On Thu, Jul 12, 2012 at 11:02:54AM +0200, Stefan Fritsch wrote: On Wed, 11 Jul 2012, Matthew Dempsky wrote: There are quite a few pretendy-SCSI drivers in the tree, but most of them deal with fairly complicated hardware so they're not good reference points. sparc64's vdsk(4) driver is probably the best reference point, since it's also for a virtual disk driver (see sys/arch/sparc64/dev/vdsk.c). If you would like to dig into vmware pvscsi support, I'm more than happy to do testing :) http://lxr.linux.no/#linux+v3.4.4/drivers/scsi/vmw_pvscsi.c I will leave that to other people. I am not interested in vmware ATM.
Re: [Patch] Virtio drivers for OpenBSD V5
On Monday 13 August 2012 17:07:41 you wrote: * Note: the i386 does not currently require barriers, but we must * provide the flags to MI code. This is not correct for virtio. We need a memory barrier. sure, copy it from amd64. OK. A slight complication: sfence/mfence/lfence do not exist on all i386 CPUs. sfence was introduced with SSE, [lm]fence with SSE2. This is not really a problem with the virtio driver because the virtualization extensions were introduced much later. But it may be a problem with the rest of the i386 code. How do you normally handle this? Check during boot and set a pointer to the function to be used and call that function pointer from bus_space_barrier()? -- genua Gesellschaft fuer Netzwerk- und Unix-Administration mbH Domagkstrasse 7, 85551 Kirchheim bei Muenchen tel +49 89 991950-0, fax -999, www.genua.de Geschaeftsfuehrer: Dr. Magnus Harlander, Dr. Michaela Harlander, Bernhard Schneck. Amtsgericht Muenchen HRB 98238
Re: [Patch] Virtio drivers for OpenBSD V5
On Monday 13 August 2012 17:00:56 Mike Belopuhov wrote: you're calling MCLGETI and provide an interface pointer but you forget to set watermarks via m_clsetwms. OK. The MCLGETI man page should probably get a pointer to m_clsetwms(). -- genua Gesellschaft fuer Netzwerk- und Unix-Administration mbH Domagkstrasse 7, 85551 Kirchheim bei Muenchen tel +49 89 991950-0, fax -999, www.genua.de Geschaeftsfuehrer: Dr. Magnus Harlander, Dr. Michaela Harlander, Bernhard Schneck. Amtsgericht Muenchen HRB 98238
Re: Virtio network interface driver
I have been working on porting virtio devices too. Actually for now only networking driver complete The main difference between Stefen's version is virtio devices attaches directly to pci bus. I don't think this is a good idea. There are other transports for virtio than pci, e.g. virtio-mmio. And virtio-mmio behaves differently enough from virtio-pci that supporting both in all virtio device drivers would be a PITA. But I think your name of vio for the network devices is better than NetBSD's vioif. I think I will rename it, too.
Re: [Patch] Virtio drivers for OpenBSD V6
On Fri, 17 Aug 2012, Mike Belopuhov wrote: On Fri, Aug 17, 2012 at 1:17 PM, Stefan Fritsch s...@sfritsch.de wrote: On Thursday 16 August 2012, Mike Belopuhov wrote: could you please tell me if you're using tx interrupt or not? if yes, why do you need to have a watchdog code implement a txeof path and not the actual reset? In normal operation, tx interrupts are not used and the sent descriptors are cleaned up either by rx interrupts or by the watchdog. The sole exception is when the tx queue is full. exactly, so please don't do it. use a tx interrupt for txeof and use a watchdog to reset a controller. Always using the tx interrupt decreases performance significantly. On my test system (in MBytes/s): Sending UDP from 75 to 55 Sending TCP from 34 to 25 I dont' think we want to do this.
Re: [Patch] Virtio drivers for OpenBSD V6
On Mon, 20 Aug 2012, Mike Belopuhov wrote: Always using the tx interrupt decreases performance significantly. On my test system (in MBytes/s): Sending UDP from 75 to 55 Sending TCP from 34 to 25 Why? Because the interrupt load increases by approx. 7000 per second. I think the problem is that the backend sends packets faster than openbsd can enqueue them. This means that the tx queue will be empty often, and each time an tx interrupt will be triggered. I dont' think we want to do this. Perhaps you got your OACTIVE handling wrong and don't call a start routing from the appropriate places. Have you considered that? I don't see how this could help, but I will look at it.
Re: [Patch] Virtio drivers for OpenBSD V6
On Mon, 20 Aug 2012, Mike Belopuhov wrote: On Mon, Aug 20, 2012 at 1:42 PM, Stefan Fritsch s...@sfritsch.de wrote: Because the interrupt load increases by approx. 7000 per second. I think the problem is that the backend sends packets faster than openbsd can enqueue them. This means that the tx queue will be empty often, and each time an tx interrupt will be triggered. this doesn't sound right as it's the driver that decides how much to transmit. How so? It simply transmits all packets that are in ifp-if_snd. btw is there a interrupt moderation mechanism available? Unfortunately not one that is time based. It is only possible to defer the interrupt for a certain amount of packets (with the RingEventIdx feature) or until the last descriptor is removed from the ring (with the NotifyOnEmpty feature). But the latter does not help much and the former is not usable unless you have some timer that cleans up resources if no more packets are sent. For the rx interrupts it is even worse. I don't see any sensible way to reduce interrupts without switching to polling when there are lots of inbound packets. Are there already NICs using polling in OpenBSD?
Re: [Patch] Virtio drivers for OpenBSD V6
On Mon, 20 Aug 2012, Mike Belopuhov wrote: Because the interrupt load increases by approx. 7000 per second. I think the problem is that the backend sends packets faster than openbsd can enqueue them. This means that the tx queue will be empty often, and each time an tx interrupt will be triggered. this doesn't sound right as it's the driver that decides how much to transmit. How so? It simply transmits all packets that are in ifp-if_snd. only if OACTIVE is not set. Which does not help at all if I want to send more packets in one go, not fewer. btw is there a interrupt moderation mechanism available? Unfortunately not one that is time based. It is only possible to defer the interrupt for a certain amount of packets (with the RingEventIdx feature) or until the last descriptor is removed from the ring (with the NotifyOnEmpty feature). But the latter does not help much and the former is not usable unless you have some timer that cleans up resources if no more packets are sent. For the rx interrupts it is even worse. I don't see any sensible way to reduce interrupts without switching to polling when there are lots of inbound packets. that sucks, but that doesn't mean you have to abuse watchdog. What do yo suggest then? Use a separate timeout? Accept the lower performance? Resetting the device in the watchdog has another unexpected problem: In order to restore the MAC filters and promisc state, vioif_iff needs to do asynchronous requests on the control queue. But in the watchdog, I cannot sleep. FWIW, if_vic does not have a watchdog at all. Not having a watchdog looks rather attractive for the virtio driver, too. Cheers, Stefan
Small memcpy optimization
On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. --- a/sys/arch/i386/i386/locore.s +++ b/sys/arch/i386/i386/locore.s @@ -802,8 +802,9 @@ ENTRY(bcopy) */ ENTRY(memcpy) movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) + movl8(%esp),%eax + movl%ecx,8(%esp) + movl%eax,4(%esp) jmp _C_LABEL(bcopy)
Re: [Patch] Virtio drivers for OpenBSD V8
On Saturday 01 September 2012, LEVAI Daniel wrote: I've just started to test this with an OpenBSD guest on a Linux host. vio0 NIC and the virtio disk work like a charm. I have a slightly (~10%) better performance with the ide emulation than with virtio, but nevertheless it works. Thanks for testing. How did you test the performance? Well, weird thing is that some time after I've sent this mail, a few messages appeared in the logs while updating /usr/ports on the guest: virtio timeout vioblk0: sc_queued 5 vq_num 128 vq_avail_idx: 27965 vq_avail-idx: 27965 vq_avail-flags: 1 vq_used_idx: 27960 vq_used-idx: 27960 vq_used-flags: 0 Are these errors or just some debug information for the devs? These are just debugging leftovers. I will remove them in the next version.
Reduce false cache line sharing in page tables
Since the Pentium 4, cachelines are 64 byte on x86, not 32 byte. The pmap optimization should be adjusted accordingly: --- sys/arch/i386/i386/pmapae.c +++ sys/arch/i386/i386/pmapae.c @@ -516,10 +516,10 @@ typedef u_int64_t pd_entry_t; /* PDE */ typedef u_int64_t pt_entry_t; /* PTE */ /* - * Number of PTE's per cache line. 8 byte pte, 32-byte cache line + * Number of PTE's per cache line. 8 byte pte, 64-byte cache line * Used to avoid false sharing of cache lines. */ -#defineNPTECL 4 +#defineNPTECL 8 /* * other data structures --- sys/arch/i386/include/pmap.h +++ sys/arch/i386/include/pmap.h @@ -229,10 +229,10 @@ #definePG_XPG_AVAIL3 /* executable mapping */ /* - * Number of PTE's per cache line. 4 byte pte, 32-byte cache line + * Number of PTE's per cache line. 4 byte pte, 64-byte cache line * Used to avoid false sharing of cache lines. */ -#define NPTECL 8 +#define NPTECL 16 #ifdef _KERNEL /*
Possible bug in timecounter code
tc_windup() calls ntp_update_second() up to 200 times: /* * A large step happens on boot. This constant detects such steps. * It is relatively small so that ntp_update_second gets called enough * in the typical 'missed a couple of seconds' case, but doesn't loop * forever when the time step is large. */ #define LARGE_STEP 200 ntp_update_second() subtracts some amount of time from adjtimedelta and sets th-th_adjustment accordingly. However, if ntp_update_second is called more than once in a row, only the last value of th-th_adjustment is actually applied. Shouldn't the adjustments from all ntp_update_second calls be added instead? I don't know how often this is actually hit in practice, the normal case is that ntp_update_second is called only once. diff --git sys/kern/kern_tc.c sys/kern/kern_tc.c index 59b31dc..60ea119 100644 --- sys/kern/kern_tc.c +++ sys/kern/kern_tc.c @@ -425,8 +426,12 @@ tc_windup(void) i = bt.sec - tho-th_microtime.tv_sec; if (i LARGE_STEP) i = 2; - for (; i 0; i--) - ntp_update_second(th-th_adjustment, bt.sec); + th-th_adjustment = 0; + for (; i 0; i--) { + uint64_t adj; + ntp_update_second(adj, bt.sec); + th-th_adjustment += adj; + } /* Update the UTC timestamps used by the get*() functions. */ /* XXX shouldn't do this here. Should force non-`get' versions. */
Re: Possible bug in timecounter code
On Sunday 02 September 2012, s...@sfritsch.de wrote: But it may still be a problem the other way round. If adjtimedelta would go to zero during the lost ticks, there will be some over-compensation that is not detected. After some more hints from Mark (thanks), I think this is very unlikely to cause a problem in practice, possibly except if running on a virtualization host that is overloaded/swapping.
Re: Possible bug in timecounter code
On Monday 03 September 2012, Stefan Fritsch wrote: On Sunday 02 September 2012, s...@sfritsch.de wrote: But it may still be a problem the other way round. If adjtimedelta would go to zero during the lost ticks, there will be some over-compensation that is not detected. After some more hints from Mark (thanks), I think this is very unlikely to cause a problem in practice, possibly except if running on a virtualization host that is overloaded/swapping. There is another issue: If the previous call to tc_windup was more than 1s ago, there is an integer overflow, all complete seconds are lost and only the fractional part is added to th_offset. FreeBSD has a fix for this one: http://svnweb.freebsd.org/base?view=revisionrevision=215665
Re: [Patch] Virtio drivers for OpenBSD V9
On Tuesday 04 September 2012, LEVAI Daniel wrote: On v, szept 02, 2012 at 12:13:33 +0200, Stefan Fritsch wrote: On Saturday 01 September 2012, LEVAI Daniel wrote: I've just started to test this with an OpenBSD guest on a Linux host. vio0 NIC and the virtio disk work like a charm. I have a slightly (~10%) better performance with the ide emulation than with virtio, but nevertheless it works. Thanks for testing. How did you test the performance? I've just used dd(1) to transfer files on the same disk and partition. On my system, the behavior is not really consistent: Configuring the drive in kvm with cache=writeback, virtio is significantly faster at writing and ide is significantly faster at reading. With cache=none, it's the other way round and with smaller differences (5%). Ide is slightly faster at writing and virtio is slightly faster at reading. There is obviously room for optimization... These are just debugging leftovers. I will remove them in the next version. At http://www.sfritsch.de/~stf/openbsd-virtio/openbsd-virtio-v9.diff , there is a new version without the debugging output in vioblk. There are no other differences to v8. Cheers, Stefan
Re: Scheduler improvements, take 1001, Patch 2/5
On Monday 15 October 2012, Stuart Henderson wrote: Best to send the diff, with the accompanying text, in the same email, and make sure they still all apply, I tried testing some of these but didn't manage to get them to apply (either some conflicting change, or they were in the wrong order and stacked up on top of each other or something). As you mentioned using git in another mail, I would recommend to use git's format-patch to create the mails. Maybe with the --no-prefix option, because that seems to be the preferred patch format here. Also, interactive rebase (git rebase -i) is nice for adjusting patches and commit messages (in case you didn't know that one). Cheers, Stefan
vio(4) Large MTU support
Hi, here is a patch to support large MTUs in if_vio. It adds support for the VIRTIO_NET_F_MRG_RXBUF feature, i.e. allows to chain several rx buffers when receiving large packets. To make this work, it has to put the rx meta data headers at the beginning of the mbuf cluster instead of dedicated buffers. Should it take care that the payload gets some particular alignment? ATM, the ethernet header starts with offset 10 or 12 bytes into the cluster, depending on VIRTIO_NET_F_MRG_RXBUF. Comments / OKs ? Thanks. Stefan --- share/man/man4/vio.4 +++ share/man/man4/vio.4 @@ -50,6 +50,3 @@ It is based on the .Nm vioif driver by .An Minoura Makoto . -.Sh BUGS -.Nm -currently does not support jumbo frames. diff --git a/sys/dev/pci/if_vio.c b/sys/dev/pci/if_vio.c index 2f94db7..5678edc 100644 --- sys/dev/pci/if_vio.c +++ sys/dev/pci/if_vio.c @@ -123,9 +123,9 @@ struct virtio_net_hdr { uint16_tgso_size; uint16_tcsum_start; uint16_tcsum_offset; -#if 0 - uint16_tnum_buffers; /* if VIRTIO_NET_F_MRG_RXBUF enabled */ -#endif + + /* only present if VIRTIO_NET_F_MRG_RXBUF is negotiated */ + uint16_tnum_buffers; } __packed; #define VIRTIO_NET_HDR_F_NEEDS_CSUM1 /* flags */ @@ -199,7 +199,7 @@ struct vio_softc { size_t sc_dma_size; caddr_t sc_dma_kva; - struct virtio_net_hdr *sc_rx_hdrs; + int sc_hdr_size; struct virtio_net_hdr *sc_tx_hdrs; struct virtio_net_ctrl_cmd *sc_ctrl_cmd; struct virtio_net_ctrl_status *sc_ctrl_status; @@ -227,6 +227,8 @@ struct vio_softc { #define VIO_DMAMEM_ENQUEUE(sc, vq, slot, p, size, write) \ virtio_enqueue_p((vq), (slot), (sc)-sc_dma_map, \ VIO_DMAMEM_OFFSET((sc), (p)), (size), (write)) +#define VIO_HAVE_MRG_RXBUF(sc) \ + ((sc)-sc_hdr_size == sizeof(struct virtio_net_hdr)) #define VIRTIO_NET_TX_MAXNSEGS 16 /* for larger chains, defrag */ #define VIRTIO_NET_CTRL_MAC_MAXENTRIES 64 /* for more entries, use ALLMULTI */ @@ -343,7 +345,6 @@ vio_free_dmamem(struct vio_softc *sc) /* allocate memory */ /* * dma memory is used for: - * sc_rx_hdrs[slot]: metadata array for recieved frames (READ) * sc_tx_hdrs[slot]: metadata array for frames to be sent (WRITE) * sc_ctrl_cmd: command to be sent via ctrl vq (WRITE) * sc_ctrl_status:return value for a command via ctrl vq (READ) @@ -355,6 +356,9 @@ vio_free_dmamem(struct vio_softc *sc) * class command (WRITE) * sc_ctrl_* structures are allocated only one each; they are protected by * sc_ctrl_inuse, which must only be accessed at splnet + * + * metadata headers for received frames are stored at the start of the + * rx mbufs. */ /* * dynamically allocated memory is used for: @@ -367,7 +371,8 @@ int vio_alloc_mem(struct vio_softc *sc) { struct virtio_softc *vsc = sc-sc_virtio; - int allocsize, r, i; + struct ifnet *ifp = sc-sc_ac.ac_if; + int allocsize, r, i, txsize; unsigned int offset = 0; int rxqsize, txqsize; caddr_t kva; @@ -375,8 +380,13 @@ vio_alloc_mem(struct vio_softc *sc) rxqsize = vsc-sc_vqs[0].vq_num; txqsize = vsc-sc_vqs[1].vq_num; - allocsize = sizeof(struct virtio_net_hdr) * rxqsize; - allocsize += sizeof(struct virtio_net_hdr) * txqsize; + /* +* For simplicity, we always allocate the full virtio_net_hdr size +* even if VIRTIO_NET_F_MRG_RXBUF is not negotiated and +* only a part of the memory is ever used. +*/ + allocsize = sizeof(struct virtio_net_hdr) * txqsize; + if (vsc-sc_nvqs == 3) { allocsize += sizeof(struct virtio_net_ctrl_cmd) * 1; allocsize += sizeof(struct virtio_net_ctrl_status) * 1; @@ -393,8 +403,6 @@ vio_alloc_mem(struct vio_softc *sc) } kva = sc-sc_dma_kva; - sc-sc_rx_hdrs = (struct virtio_net_hdr*)kva; - offset += sizeof(struct virtio_net_hdr) * rxqsize; sc-sc_tx_hdrs = (struct virtio_net_hdr*)(kva + offset); offset += sizeof(struct virtio_net_hdr) * txqsize; if (vsc-sc_nvqs == 3) { @@ -429,9 +437,10 @@ vio_alloc_mem(struct vio_softc *sc) goto err_reqs; } + txsize = ifp-if_hardmtu + sc-sc_hdr_size + ETHER_HDR_LEN; for (i = 0; i txqsize; i++) { - r = bus_dmamap_create(vsc-sc_dmat, ETHER_MAX_LEN, - VIRTIO_NET_TX_MAXNSEGS, ETHER_MAX_LEN, 0, + r = bus_dmamap_create(vsc-sc_dmat, txsize, + VIRTIO_NET_TX_MAXNSEGS, txsize, 0, BUS_DMA_NOWAIT|BUS_DMA_ALLOCNOW, sc-sc_tx_dmamaps[i]); if (r != 0) @@ -502,7 +511,8 @@ vio_attach(struct device *parent, struct device *self, void *aux) vsc-sc_intrhand = virtio_vq_intr;
Re: Small memcpy optimization
On Tuesday 21 August 2012, Stefan Fritsch wrote: On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. OKs, anyone? --- a/sys/arch/i386/i386/locore.s +++ b/sys/arch/i386/i386/locore.s @@ -802,8 +802,9 @@ ENTRY(bcopy) */ ENTRY(memcpy) movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) + movl8(%esp),%eax + movl%ecx,8(%esp) + movl%eax,4(%esp) jmp _C_LABEL(bcopy)
Re: Small memcpy optimization
On Thu, 8 Nov 2012, Mark Kettenis wrote: On Tuesday 21 August 2012, Stefan Fritsch wrote: On x86, the xchg operation between reg and mem has an implicit lock prefix, i.e. it is a relatively expensive atomic operation. This is not needed here. OKs, anyone? What you say makes sense, although it might matter only on MP (capable) systems. True, but MP is the norm nowadays. If you really want to make things faster, I suppose you could change the code into something like pushl %esi pushl %edi movl12(%esp),%edi movl16(%esp),%esi That's true. Like this (suggestions for a better label name are welcome): --- locore.s +++ locore.s @@ -789,7 +789,7 @@ ENTRY(bcopy) pushl %edi movl12(%esp),%esi movl16(%esp),%edi - movl20(%esp),%ecx +bcopy2:movl20(%esp),%ecx movl%edi,%eax subl%esi,%eax cmpl%ecx,%eax # overlapping? @@ -827,13 +827,15 @@ ENTRY(bcopy) ret /* - * Emulate memcpy() by swapping the first two arguments and calling bcopy() + * Emulate memcpy() by loading the first two arguments in reverse order + * and jumping into bcopy() */ ENTRY(memcpy) - movl4(%esp),%ecx - xchg8(%esp),%ecx - movl%ecx,4(%esp) - jmp _C_LABEL(bcopy) + pushl %esi + pushl %edi + movl12(%esp),%edi + movl16(%esp),%esi + jmp bcopy2 /*/
Re: ##@!#@# gnu tools
On Thursday 15 November 2012, Reyk Floeter wrote: On Thu, Nov 15, 2012 at 5:11 PM, Marc Espie es...@nerim.net wrote: external people regularly ask but why you don't want to use GNU/m4 GNU/make GNU/whatever ? External people seem to ask weird questions. I just had to dig into autoconf/auto* because it seems to be a must have for a portable project. Yuck! That's all a matter of perspective. If you work on sane platforms like Linux and BSDs, you always think what do I need this autotools/libtool crap for, it would be much easier without them. But once you work on really weird platforms like AIX and do non-trivial tasks (like building shared libraries :-o ), autotools/libtool are sent from heaven and are really *so* much easier to use than what is available natively. I guess another example is Mac OS with its universal binaries that may contain both 32 and 64 bit, and both intel and powerpc code in the same file. Also, for non-trivial projects, the advice don't write portable make files, use a portable make instead is very much true. Posix compatible make lacks even the most basic features. Stefan
[PATCH] Support for virtio random device
Hi, qemu 1.3 has added a virtio entropy device. Here is a driver for it. Comments? OKs? As the entropy reserve of the host may not be unlimited, the OpenBSD guest should only ask for entropy when it actually needs it. Would it make sense to add an API that allows a driver to determine how full the entropy queue is? This could also be used by some hardware drivers to avoid polling for entropy if not necessary. E.g. amdpm does the polling in every timer tick, which is wasteful. Cheers, Stefan diff --git share/man/man4/Makefile share/man/man4/Makefile index ab6d258..3e36766 100644 --- share/man/man4/Makefile +++ share/man/man4/Makefile @@ -66,9 +66,9 @@ MAN= aac.4 ac97.4 acphy.4 \ uthum.4 uticom.4 utwitch.4 utrh.4 uts.4 uvideo.4 uvisor.4 uvscom.4 \ uyap.4 \ vether.4 vga.4 vgafb.4 vge.4 \ - viapm.4 viasio.4 vic.4 video.4 vio.4 vioblk.4 viomb.4 virtio.4 vlan.4 \ - vmt.4 vnd.4 vr.4 \ - vscsi.4 vte.4 \ + viapm.4 viasio.4 vic.4 video.4 \ + vio.4 vioblk.4 viomb.4 viornd.4 virtio.4 \ + vlan.4 vmt.4 vnd.4 vr.4 vscsi.4 vte.4 \ watchdog.4 wb.4 wbenv.4 wbng.4 wbsd.4 wbsio.4 wd.4 wdc.4 we.4 \ wi.4 wpi.4 wscons.4 wsdisplay.4 wskbd.4 wsmouse.4 wsmux.4 \ xe.4 xf86.4 xge.4 xl.4 xmphy.4 yds.4 ym.4 zero.4 zyd.4 diff --git share/man/man4/viornd.4 share/man/man4/viornd.4 new file mode 100644 index 000..09ff465 --- /dev/null +++ share/man/man4/viornd.4 @@ -0,0 +1,45 @@ +.\ $OpenBSD: $ +.\ +.\ Copyright (c) 2013 Stefan Fritsch s...@sfritsch.de +.\ +.\ Permission to use, copy, modify, and distribute this software for any +.\ purpose with or without fee is hereby granted, provided that the above +.\ copyright notice and this permission notice appear in all copies. +.\ +.\ THE SOFTWARE IS PROVIDED AS IS AND THE AUTHOR DISCLAIMS ALL WARRANTIES +.\ WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF +.\ MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR +.\ ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES +.\ WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN +.\ ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF +.\ OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. +.\ +.Dd $Mdocdate: January 25 2013 $ +.Dt VIORND 4 +.Os +.Sh NAME +.Nm viornd +.Nd VirtIO random number device +.Sh SYNOPSIS +.Cd viornd* at virtio? +.Sh DESCRIPTION +The +.Nm +driver provides a virtual random number generator using a +.Xr virtio 4 +entropy device provided by QEMU 1.3 and later, and possibly by other +hypervisors. +.Sh SEE ALSO +.Xr intro 4 , +.Xr virtio 4 +.Sh HISTORY +The +.Nm +driver first appeared in +.Ox 5.3 . +.Sh AUTHORS +.An -nosplit +The +.Nm +driver was written by +.An Stefan Fritsch Aq s...@sfritsch.de . diff --git share/man/man4/virtio.4 share/man/man4/virtio.4 index 16ad4ba..33ce729 100644 --- share/man/man4/virtio.4 +++ share/man/man4/virtio.4 @@ -40,6 +40,8 @@ VirtIO network device VirtIO disk .It Xr viomb 4 VirtIO memory ballooning driver +.It Xr viornd 4 +VirtIO random number device .El .Sh SEE ALSO .Xr intro 4 diff --git sys/arch/amd64/conf/GENERIC sys/arch/amd64/conf/GENERIC index ca0a4b2..8f41464 100644 --- sys/arch/amd64/conf/GENERIC +++ sys/arch/amd64/conf/GENERIC @@ -612,3 +612,4 @@ virtio* at pci? # Virtio PCI device vioblk*at virtio? # Virtio block device vio* at virtio? # Virtio network device viomb* at virtio? # Virtio memory ballooning device +viornd*at virtio? # Virtio entropy device diff --git sys/arch/i386/conf/GENERIC sys/arch/i386/conf/GENERIC index 190677e..9c0417d 100644 --- sys/arch/i386/conf/GENERIC +++ sys/arch/i386/conf/GENERIC @@ -802,3 +802,4 @@ virtio* at pci? # Virtio PCI device vioblk*at virtio? # Virtio block device vio* at virtio? # Virtio network device viomb* at virtio? # Virtio memory ballooning device +viornd*at virtio? # Virtio entropy device diff --git sys/dev/pci/files.pci sys/dev/pci/files.pci index aebacb8..3155a7f 100644 --- sys/dev/pci/files.pci +++ sys/dev/pci/files.pci @@ -844,3 +844,7 @@ filedev/pci/vioblk.cvioblk device viomb attach viomb at virtio file dev/pci/viomb.c viomb + +device viornd +attach viornd at virtio +file dev/pci/viornd.cviornd diff --git sys/dev/pci/viornd.c sys/dev/pci/viornd.c new file mode 100644 index 000..3669df3 --- /dev/null +++ sys/dev/pci/viornd.c @@ -0,0 +1,168 @@ +/* $OpenBSD: $ */ + +/* + * Copyright (c) 2013 Stefan Fritsch s...@sfritsch.de + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED
Re: [PATCH] Support for virtio random device
On Saturday 26 January 2013, Mark Kettenis wrote: As the entropy reserve of the host may not be unlimited, the OpenBSD guest should only ask for entropy when it actually needs it. If qemu really allows the guest to consume the complete entropy reserve of the host then that's a bug in qemu and/or the host OS. That's a red herring. Entropy is a sparse resource on most systems. If the guests don't make any effort to only use as much as they need, the host has no chance to dispense the existing entropy where it is actually needed. Would it make sense to add an API that allows a driver to determine how full the entropy queue is? This could also be used by some hardware drivers to avoid polling for entropy if not necessary. E.g. amdpm does the polling in every timer tick, which is wasteful. Wasteful in what sense? I doubt that reading the hardware registers once every 10ms noticibly affects performance of the system. But many small improvements may give a noticable effect on performance in the end.
Re: [PATCH] Support for virtio random device
On Sun, 27 Jan 2013, Damien Miller wrote: On Fri, 25 Jan 2013, Stefan Fritsch wrote: Hi, qemu 1.3 has added a virtio entropy device. Here is a driver for it. Comments? OKs? As the entropy reserve of the host may not be unlimited, the OpenBSD guest should only ask for entropy when it actually needs it. Would it make sense to add an API that allows a driver to determine how full the entropy queue is? This could also be used by some hardware drivers to avoid polling for entropy if not necessary. E.g. amdpm does the polling in every timer tick, which is wasteful. I don't think such an API is necessary. For this particular case, I agree with Kettenis that if qemu allows exhaustion of the hypervisor's PRNG using this call then the bug lies with qemu. Qemu can do rate limiting, so it can prevent the host's entropy from being drained. But then guests that actually have need of lots of good entropy won't be able to get it. That being said, if you want to avoid this bug then reading a largeish seed once at system boot and stirring it into the PRNG would be sufficient to ensure the PRNG pool is unpredictable to an adversary who tries to guess its state. That's more or less what the driver I posted does. But with the API to query the random queue's state it could do better than that.
[Patch] Fix subr_disk.c to allow compiling with -Wno-error=format
The log(pri, ) triggers a 'zero-length kprintf format string' warning which causes gcc to error out despite -Wno-error=format. OK to commit this patch? --- sys/kern/subr_disk.c +++ sys/kern/subr_disk.c @@ -761,13 +761,14 @@ diskerr(struct buf *bp, char *dname, char *what, int pri, int blkdone, daddr64_t sn; if (pri != LOG_PRINTF) { - static const char fmt[] = ; - log(pri, fmt); + log(pri, %s%d%c: %s %sing fsbn , dname, unit, partname, what, + bp-b_flags B_READ ? read : writ); pr = addlog; - } else + } else { + printf(%s%d%c: %s %sing fsbn , dname, unit, partname, what, + bp-b_flags B_READ ? read : writ); pr = printf; - (*pr)(%s%d%c: %s %sing fsbn , dname, unit, partname, what, - bp-b_flags B_READ ? read : writ); + } sn = bp-b_blkno; if (bp-b_bcount = DEV_BSIZE) (*pr)(%lld, sn);
Re: Removing -Wno-format from kernel makefiles
On Saturday 06 April 2013, Stefan Fritsch wrote: That wasn't clear from Miod's response. So the policy is to assume that char/short/int/long long are 8/16/32/64 bits and that intptr_t is long? Should that be added to style(9), then? I assume that the lack of a response means yes. The patches at http://www.sfritsch.de/~stf/openbsd-format-string/ fix all warnings on i386 and do not cause new warnings on amd64. Since the kernel's printf does not support %td for ptrdiff_t, I have used %ld instead. %zd would also work. Is there a preferred way?
Re: Removing -Wno-format from kernel makefiles
On Sun, 14 Apr 2013, Theo de Raadt wrote: That wasn't clear from Miod's response. So the policy is to assume that char/short/int/long long are 8/16/32/64 bits and that intptr_t is long? We only run on C8S16I32L32P32 and C8S16I32L64P64 architectures. Short names ILP32 and I32LP64. The addition of any other sizing would require us to fix thousands of subtle bugs, and crank the ABI on every architecture. Should that be added to style(9), then? I assume that the lack of a response means yes. Perhaps. Thing is, we don't want people to worsen the current situation for decades down the line in case someone tries to create a 'popular' architecture with I64. Or LL128 ... But if such architectures come around, all kinds of things will need to be changed. This should not prevent anyone from documenting the current best practices. Since the kernel's printf does not support %td for ptrdiff_t, I have used %ld instead. %zd would also work. Is there a preferred way? Or try to add %td support to kernel printf? That would also require gcc to be taught that %td is now allowed. I don't know how difficult that is, but I will take a look when I have some time.
Re: Removing -Wno-format from kernel makefiles
On Sun, 14 Apr 2013, Theo de Raadt wrote: Since the kernel's printf does not support %td for ptrdiff_t, I have used %ld instead. %zd would also work. Is there a preferred way? Or try to add %td support to kernel printf? The patch below seems to do the trick for gcc 4.2. I can't see that we would add an arch where sizeof(ptrdiff_t) != sizeof(size_t). Therefore I don't see any benefit in adding code that deals with this case and which would probably be just slightly to complicated for the compiler to optimize away on sane archs. Checking sizeof(ptrdiff_t) == sizeof(size_t) there would be a good use for a compile time assert macro (like _Static_assert in C11). Do we have that and I didn't find it? Or do we want to add one? Adding ptrdiff_t to stdint.h is not strictly necessary, but I see no reason not to do it. --- a/gnu/gcc/gcc/c-format.c +++ b/gnu/gcc/gcc/c-format.c @@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] = { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L }, { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 }, { z, FMT_LEN_z, STD_C99, NULL, 0, 0 }, + { t, FMT_LEN_t, STD_C99, NULL, 0, 0 }, { NULL, 0, 0, NULL, 0, 0 } }; @@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] = static const format_char_info kprint_char_table[] = { /* C89 conversion specifiers. */ - { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, - { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, - { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, + { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, T99_PD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, + { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, + { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, { c, 0, STD_C89, { T89_I, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -w, , NULL }, { s, 1, STD_C89, { T89_C, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp, cR, NULL }, { p, 1, STD_C89, { T89_V, BADLEN, BADLEN, T89_UL, T9L_LL, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0, c, NULL }, --- a/sys/kern/subr_prf.c +++ b/sys/kern/subr_prf.c @@ -834,6 +834,9 @@ reswitch: switch (ch) { case 'q': flags |= QUADINT; goto rflag; + case 't': + /* for us, ptrdiff_t (C) is equal to ssize_t (POSIX) */ + /* FALLTHROUGH */ case 'z': flags |= SIZEINT; goto rflag; --- a/sys/sys/stdint.h +++ b/sys/sys/stdint.h @@ -95,6 +95,9 @@ typedef __uintptr_t uintptr_t; typedef__intmax_t intmax_t; typedef__uintmax_t uintmax_t; +/* Integer type of the result of substracting two pointers */ +typedef__ptrdiff_t ptrdiff_t; + #if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS) /* * 7.18.2 Limits of specified-width integer types. For older gcc, it seems more difficult to add support for %t. At least, the code is quite different. Maybe this patch does the trick but it is untested. And it again assumes that sizeof(size_t) == sizeof(ptrdiff_t). IMHO simply keeping -Wno-format on architectures that use old gcc versions should be ok, too. --- a/gnu/egcs/gcc/c-common.c +++ b/gnu/egcs/gcc/c-common.c @@ -2497,7 +2497,7 @@ check_format_info (info, params) if (pedantic) warning (ANSI C does not support the `Z' length modifier); } - else if (*format_chars == 'z') + else if (*format_chars == 'z' || *format_chars == 't') { length_char = 'Z', format_chars++; /* FIXME: Is allowed in ISO C 9x. */
Re: Removing -Wno-format from kernel makefiles
Moved ptrdiff_t definition to types.h with proper guards. Assume ptrdiff_t is long in subr_prf.c OK? --- a/gnu/gcc/gcc/c-format.c +++ b/gnu/gcc/gcc/c-format.c @@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] = { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L }, { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 }, { z, FMT_LEN_z, STD_C99, NULL, 0, 0 }, + { t, FMT_LEN_t, STD_C99, NULL, 0, 0 }, { NULL, 0, 0, NULL, 0, 0 } }; @@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] = static const format_char_info kprint_char_table[] = { /* C89 conversion specifiers. */ - { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, - { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, - { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, + { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, T99_PD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, + { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, + { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, { c, 0, STD_C89, { T89_I, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -w, , NULL }, { s, 1, STD_C89, { T89_C, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp, cR, NULL }, { p, 1, STD_C89, { T89_V, BADLEN, BADLEN, T89_UL, T9L_LL, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0, c, NULL }, --- a/sys/kern/subr_prf.c +++ b/sys/kern/subr_prf.c @@ -842,6 +842,9 @@ reswitch: switch (ch) { size = 1; sign = '\0'; break; + case 't': + /* assume ptrdiff_t is long */ + /* FALLTHROUGH */ case 'D': flags |= LONGINT; /*FALLTHROUGH*/ --- a/sys/sys/types.h +++ b/sys/sys/types.h @@ -177,6 +177,11 @@ typedef__clockid_t clockid_t; typedef__size_tsize_t; #endif +#ifndef_PTRDIFF_T_DEFINED_ +#define_PTRDIFF_T_DEFINED_ +typedef__ptrdiff_t ptrdiff_t; +#endif + #ifndef_SSIZE_T_DEFINED_ #define_SSIZE_T_DEFINED_ typedef__ssize_t ssize_t;
Re: Removing -Wno-format from kernel makefiles
On Sunday 21 April 2013, Mark Kettenis wrote: +++ b/sys/sys/types.h @@ -177,6 +177,11 @@ typedef__clockid_t clockid_t; typedef __size_tsize_t; #endif +#ifndef_PTRDIFF_T_DEFINED_ +#define_PTRDIFF_T_DEFINED_ +typedef__ptrdiff_t ptrdiff_t; +#endif You can't put ptrdiff_t in sys/types.h either. Why not? POSIX only speaks about at least the following types in sys/types.h, so there is no reason why it may not define more types. And the ifndef guard will prevent conflicts with stddef.h (which is the canonical place for ptrdiff_t according to C99). I lost track a bit what you're trying to achieve here (been a tad bit busy). But I don't think we'd want to use ptrdiff_t in the kernel. The starting point was that the compiler produces ptrdiff_t when subtracting two pointers, and Theo preferred using the standard compliant %td format string for it over %ld (or %zd). And if we have a printf format string for ptrdiff_t, it makes sense to be able to cast to it. But it's not strictly necessary, of course.
Re: Removing -Wno-format from kernel makefiles
On Sunday 21 April 2013, Philip Guenther wrote: On Sun, Apr 21, 2013 at 12:15 PM, Stefan Fritsch s...@sfritsch.de wrote: On Sunday 21 April 2013, Mark Kettenis wrote: ... You can't put ptrdiff_t in sys/types.h either. Why not? POSIX only speaks about at least the following types in sys/types.h, so there is no reason why it may not define more types. Well, it can define others as long as it stays within the namespace restrictions. It turns out that ptrdiff_t *is* within the permitted namespace (because it ends with _t). However, we have generally tried to avoid defining more than is required to be present so that devs don't accidentally rely on a non-portable assumption like that. Maybe this should be handled the bool bits in sys/types.h, wrapping them in #ifdef _KERNEL ? #ifdef _KERNEL is fine with me. Alternatively, if subr_prf.c is the only kernel file which will actually need the ptrdiff_t type then perhaps it should just use __ptrdiff_t directly. Actually it's not needed, yet. We could postpone adding it until someone actually needs it. But my guess is that this would lead to that someone simply choosing some other type.
Re: Removing -Wno-format from kernel makefiles
On Sun, 21 Apr 2013, Stefan Fritsch wrote: On Sunday 21 April 2013, Philip Guenther wrote: Maybe this should be handled the bool bits in sys/types.h, wrapping them in #ifdef _KERNEL ? #ifdef _KERNEL is fine with me. OK for the patch below? --- a/gnu/gcc/gcc/c-format.c +++ b/gnu/gcc/gcc/c-format.c @@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] = { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L }, { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 }, { z, FMT_LEN_z, STD_C99, NULL, 0, 0 }, + { t, FMT_LEN_t, STD_C99, NULL, 0, 0 }, { NULL, 0, 0, NULL, 0, 0 } }; @@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] = static const format_char_info kprint_char_table[] = { /* C89 conversion specifiers. */ - { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, - { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, - { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, + { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, T99_PD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, + { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, + { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, { c, 0, STD_C89, { T89_I, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -w, , NULL }, { s, 1, STD_C89, { T89_C, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp, cR, NULL }, { p, 1, STD_C89, { T89_V, BADLEN, BADLEN, T89_UL, T9L_LL, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0, c, NULL }, --- a/sys/kern/subr_prf.c +++ b/sys/kern/subr_prf.c @@ -842,6 +842,9 @@ reswitch: switch (ch) { size = 1; sign = '\0'; break; + case 't': + /* assume ptrdiff_t is long */ + /* FALLTHROUGH */ case 'D': flags |= LONGINT; /*FALLTHROUGH*/ --- a/sys/sys/types.h +++ b/sys/sys/types.h @@ -177,6 +177,13 @@ typedef__clockid_t clockid_t; typedef__size_tsize_t; #endif +#ifdef _KERNEL +#ifndef_PTRDIFF_T_DEFINED_ +#define_PTRDIFF_T_DEFINED_ +typedef__ptrdiff_t ptrdiff_t; +#endif +#endif + #ifndef_SSIZE_T_DEFINED_ #define_SSIZE_T_DEFINED_ typedef__ssize_t ssize_t;
Re: Removing -Wno-format from kernel makefiles
On Sun, 28 Apr 2013, Stefan Fritsch wrote: On Sun, 21 Apr 2013, Stefan Fritsch wrote: On Sunday 21 April 2013, Philip Guenther wrote: Maybe this should be handled the bool bits in sys/types.h, wrapping them in #ifdef _KERNEL ? #ifdef _KERNEL is fine with me. OK for the patch below? FWIW, re-sending with unbroken MUA settings. --- a/gnu/gcc/gcc/c-format.c +++ b/gnu/gcc/gcc/c-format.c @@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] = { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L }, { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 }, { z, FMT_LEN_z, STD_C99, NULL, 0, 0 }, + { t, FMT_LEN_t, STD_C99, NULL, 0, 0 }, { NULL, 0, 0, NULL, 0, 0 } }; @@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] = static const format_char_info kprint_char_table[] = { /* C89 conversion specifiers. */ - { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, - { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, - { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, + { di, 0, STD_C89, { T89_I, BADLEN, T89_S, T89_L, T9L_LL, BADLEN, T99_SST, T99_PD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0 +'I, i, NULL }, + { oxX, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0#,i, NULL }, + { u, 0, STD_C89, { T89_UI, BADLEN, T89_US, T89_UL, T9L_ULL, BADLEN, T99_ST, T99_UPD, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0'I, i, NULL }, { c, 0, STD_C89, { T89_I, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -w, , NULL }, { s, 1, STD_C89, { T89_C, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp, cR, NULL }, { p, 1, STD_C89, { T89_V, BADLEN, BADLEN, T89_UL, T9L_LL, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN, BADLEN }, -wp0, c, NULL }, --- a/sys/kern/subr_prf.c +++ b/sys/kern/subr_prf.c @@ -842,6 +842,9 @@ reswitch: switch (ch) { size = 1; sign = '\0'; break; + case 't': + /* assume ptrdiff_t is long */ + /* FALLTHROUGH */ case 'D': flags |= LONGINT; /*FALLTHROUGH*/ --- a/sys/sys/types.h +++ b/sys/sys/types.h @@ -177,6 +177,13 @@ typedef__clockid_t clockid_t; typedef__size_tsize_t; #endif +#ifdef _KERNEL +#ifndef_PTRDIFF_T_DEFINED_ +#define_PTRDIFF_T_DEFINED_ +typedef__ptrdiff_t ptrdiff_t; +#endif +#endif + #ifndef_SSIZE_T_DEFINED_ #define_SSIZE_T_DEFINED_ typedef__ssize_t ssize_t;
Compile time assert macro
Hi, I think a CTASSERT macro like in FreeBSD would be nice. It could be used to verify assumptions about type and struct sizes. This should work on gcc 2.9.5, too, but I haven't tested it. --- sys/lib/libkern/libkern.h +++ sys/lib/libkern/libkern.h @@ -138,6 +138,9 @@ abs(int j) #endif #endif +#defineCTASSERT(x) extern char _ctassert[(x) ? 1 : -1 ] \ + __attribute__((__unused__)) + /* Prototypes for non-quad routines. */ void__assert(const char *, const char *, int, const char *) __attribute__ ((__noreturn__));
Re: Compile time assert macro
On Fri, 3 May 2013, Stefan Fritsch wrote: I think a CTASSERT macro like in FreeBSD would be nice. It could be used to verify assumptions about type and struct sizes. Ariane suggested making the syntax compatible to the C11 variant. OKs for this version? --- a/sys/lib/libkern/libkern.h +++ b/sys/lib/libkern/libkern.h @@ -138,6 +138,12 @@ abs(int j) #endif #endif +#if __STDC_VERSION__ 201112L +#define_Static_assert(predicate, msg) \ + extern char __static_assert[(predicate) ? 1 : -1 ] \ + __attribute__((__unused__)) +#endif + /* Prototypes for non-quad routines. */ void__assert(const char *, const char *, int, const char *) __attribute__ ((__noreturn__)); --- a/share/man/man9/kern.9 +++ b/share/man/man9/kern.9 @@ -108,6 +108,20 @@ enabled. tests are only included if the kernel has .Dv DEBUG enabled. +.Pp +.nr nS 1 +.Fn _Static_assert CONDITION MESSAGE +.nr nS 0 +.Pp +This is a standard feature in C11 and is implemented as a macro if compiling +with a pre-C11 compiler. +It causes a compile time error if the given condition evaluates to +false. +Its main purpose is to verify assertions about type and struct sizes that +would otherwise make the code fail at run time. +.Fn _Static_assert +can be used in global scope or at the start of blocks, where variable +declarations are allowed. .Sh BYTE STRINGS .nr nS 1 .Ft int
Binary code patching and paravirtualization
Hi, in summer, I posted some paravirt patches for amd64. In response to the comments I received then, I have created some infrastructure to binary patch kernel code during boot. In order to get some feedback, I am posting the whole paravirt code patching diff here. Also, KVM users may be interested in trying it (hi Antoine ;). For the codepatching part, the most interesting files are codepatch.c and codepatch.h. In copy.S and cpu.c, I convert the code patching already done for SMAP to the new infrastructure (if someone has a machine with SMAP support, testing would be nice). The basic approach is to create macros that surround a to-be-patched code part and store pointer, length, and a tag in a dedicated section. Then there are functions to replace all code parts with a specific tag. Does this part of the diff look sensible? The rest of the diff is paravirtualization for KVM that partially also uses the code patching infrastructure. The paravirtualization stuff seems to work fine and gives some nice speed-up, but it still needs plenty of cleanup before it can be committed. There are some paravirtualization bits that are not specific to KVM. For example, I introduce a running_on_hypervisor() function and skip some delays during shutdown if it returns true. I think this should work for most (all?) hypervisors, though it still needs some detection logic. Making it depend on the CPUIDECX_HYPERV bit should catch the most common hypervisors. Is this something we could commit already in order to see if it causes any problems? Another question: Most of the infrastructure and the pointers to the code that needs patching are only used at boot. Is there already a way to put that into a dedicated section that is freed and reused after boot? Cheers, Stefan === amd64/codepatch.c | 183 ++ amd64/copy.S | 40 -- amd64/cpu.c | 63 - amd64/genassym.cf |3 amd64/identcpu.c |1 amd64/lapic.c | 64 +++-- amd64/machdep.c |9 +- amd64/mainbus.c |8 ++ amd64/paravirt.c | 161 +++ amd64/vector.S| 139 + conf/GENERIC |2 conf/files.amd64 |6 + include/codepatch.h | 53 ++ include/cpu.h |4 + include/cpuvar.h | 12 +++ include/i82093reg.h | 26 ++- include/i82489var.h | 13 +++ include/paravirtvar.h | 72 +++ include/specialreg.h |1 19 files changed, 776 insertions(+), 84 deletions(-) === diff --git a/sys/arch/amd64/amd64/codepatch.c b/sys/arch/amd64/amd64/codepatch.c new file mode 100644 index 000..3f5eba0 --- /dev/null +++ b/sys/arch/amd64/amd64/codepatch.c @@ -0,0 +1,183 @@ +/* + * Copyright (c) 2014 Stefan Fritsch s...@sfritsch.de + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED AS IS AND THE AUTHOR DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + */ + +#include sys/param.h +#include machine/paravirtvar.h +#include machine/codepatch.h +#include uvm/uvm_extern.h + +#if 1 +#define DBGPRINT(fmt, args...) printf(%s: fmt \n, __func__, ## args) +#else +#define DBGPRINT(fmt, args...) do {} while (0) +#endif + +struct codepatch { + uint32_t offset; + uint16_t len; + uint16_t tag; +}; + +extern struct codepatch codepatch_begin; +extern struct codepatch codepatch_end; + +static const int nop_len_min = 2; +static const int nop_len_max = 9; +static const unsigned char nops[][9] = { + { 0x66, 0x90 }, + { 0x0F, 0x1F, 0x00 }, + { 0x0F, 0x1F, 0x40, 0x00 }, + { 0x0F, 0x1F, 0x44, 0x00, 0x00 }, + { 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00 }, + { 0x0F, 0x1F, 0x80, 0x00, 0x00, 0x00, 0x00 }, + { 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 }, + { 0x66, 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00}, +}; + +void +codepatch_fill_nop(void *caddr, uint16_t len) +{ + unsigned char *addr = caddr; + uint16_t nop_len; + + while (len 0) { + if (len nop_len_min) { + panic(%s: can't patch NOP with len 1 at %p, + __func__, caddr); + } + if (len = nop_len_max
Re: Binary code patching and paravirtualization
On Wednesday 10 December 2014 00:52:44, Antoine Jacoutot wrote: I assume we still need this: printf 'change paravirt0\ny\n\n0x8\nq\n' | config -ef /bsd under Illumos? Correct
Re: Binary code patching and paravirtualization
On Thursday 11 December 2014 22:20:06, Jonathan Gray wrote: On Wed, Dec 10, 2014 at 12:32:02AM +0100, Stefan Fritsch wrote: For the codepatching part, the most interesting files are codepatch.c and codepatch.h. In copy.S and cpu.c, I convert the code patching already done for SMAP to the new infrastructure (if someone has a machine with SMAP support, testing would be nice). There aren't so many machines with SMAP commercially released only a handful of 2 in 1 devices with Core M processors. You should be able to test with the qemu support added by Intel however. ie qemu-system-x86_64 -cpu qemu64,+smap or qemu-system-x86_64 -cpu Broadwell thanks for the info. I will try that. The basic approach is to create macros that surround a to-be-patched code part and store pointer, length, and a tag in a dedicated section. Then there are functions to replace all code parts with a specific tag. Does this part of the diff look sensible? What about i386? Though not included in your diff we can't assume the various multi byte nop sequences are valid there. The Intel instruction set reference says: The multi-byte NOP instruction performs no operation on supported processors and generates undefined opcode exception on processors that do not support the multi-byte NOP instruction. It seems in practice they are supported on everything after 686/pentium pro? Though the code doing the patching could do a cpuid check. As first step, I would only commit amd64 support. If that works, we can split it into MI and MD parts and add support for other architectures. But the functions to replace code with NOPs or a function call will always be MD. When those are implemented for i386, one can include appropriate cpuid checks. But as there is no place to put MD code that is used by both amd64 and i386, there will probably be two similar but independent implementations for those two archs. Or do you see a better way?
Re: Binary code patching and paravirtualization
On Thu, 11 Dec 2014, Mark Kettenis wrote: From: Alexey Suslikov alexey.susli...@gmail.com Date: Thu, 11 Dec 2014 20:51:14 + (UTC) Stefan Fritsch sf at sfritsch.de writes: --- a/sys/arch/amd64/include/specialreg.h +++ b/sys/arch/amd64/include/specialreg.h at at -158,6 +158,7 at at #define CPUIDECX_AVX0x1000 /* Advanced Vector Extensions */ #define CPUIDECX_F16C 0x2000 /* 16bit fp conversion */ #define CPUIDECX_RDRAND 0x4000 /* RDRAND instruction */ +#define CPUIDECX_HYPERV 0x8000 /* Hypervisor present */ Is this flag standardized? Last time I have tried to push this, there was an objection based on reserved for future use status of this flag. See http://marc.info/?l=openbsd-bugsm=136907278229145w=2 Well, that thread started out with a questionable workaround for a hypervisor bug. That may have have influenced the debate about the flag a bit. You can be almost certain that Intel and AMD will not use that reserved bit for anything else. The Linux KVM virtualization business is too important for them. And if Microsoft Hyper-V or VMWare ESX sets that bit as well, this becomes an absolute certainty. The intel manual says Not Used, Always returns 0 which is different from reserved, which is stated for other bits. FTR, jasper@ checked that vmware sets the bit while virtual box does not. So, many but not all hypervisors set it. I prefer the CPUIDECX_HV name used in the diff you posted in: OK? diff --git a/sys/arch/amd64/amd64/identcpu.c b/sys/arch/amd64/amd64/identcpu.c --- a/sys/arch/amd64/amd64/identcpu.c +++ b/sys/arch/amd64/amd64/identcpu.c @@ -129,6 +129,7 @@ const struct { { CPUIDECX_AVX, AVX }, { CPUIDECX_F16C,F16C }, { CPUIDECX_RDRAND, RDRAND }, + { CPUIDECX_HV, HV }, }, cpu_ecpuid_ecxfeatures[] = { { CPUIDECX_LAHF,LAHF }, { CPUIDECX_CMPLEG, CMPLEG }, diff --git a/sys/arch/amd64/include/specialreg.h b/sys/arch/amd64/include/specialreg.h --- a/sys/arch/amd64/include/specialreg.h +++ b/sys/arch/amd64/include/specialreg.h @@ -158,6 +158,7 @@ #defineCPUIDECX_AVX0x1000 /* Advanced Vector Extensions */ #defineCPUIDECX_F16C 0x2000 /* 16bit fp conversion */ #defineCPUIDECX_RDRAND 0x4000 /* RDRAND instruction */ +#defineCPUIDECX_HV 0x8000 /* Running on hypervisor */ /* * Structured Extended Feature Flags Parameters (CPUID function 0x7, leaf 0)
Re: Binary code patching and paravirtualization
On Mon, 15 Dec 2014, Ted Unangst wrote: On Wed, Dec 10, 2014 at 00:32, Stefan Fritsch wrote: --- a/sys/arch/amd64/conf/GENERIC +++ b/sys/arch/amd64/conf/GENERIC @@ -39,6 +39,8 @@ isa0 at amdpcib? isa0at tcpcib? pci*at mainbus0 +paravirt0 at mainbus0 + acpi0 at bios0 acpitimer* at acpi? acpihpet* at acpi? diff --git a/sys/arch/amd64/conf/files.amd64 b/sys/arch/amd64/conf/files.amd64 index 103b653..a73d5d8 100644 --- a/sys/arch/amd64/conf/files.amd64 +++ b/sys/arch/amd64/conf/files.amd64 @@ -80,6 +80,12 @@ device mainbus: isabus, pcibus, mainbus attach mainbus at root filearch/amd64/amd64/mainbus.c mainbus +device paravirt +attach paravirt at mainbus +file arch/amd64/amd64/paravirt.c paravirt needs-flag + +file arch/amd64/amd64/codepatch.c Can you explain why this needs to be a device? It doesn't appear to be doing any device like things. Only in order to get a flags field that can be tweaked with config(8). And to allow disable via config(8), though that could also be achieved with a flag. Tweaking the behavior with a flags value is necessary because hypervisors not always announce what they are capable of. One reason for this is that qemu is responsible for setting most of the cpuid flags but the kvm kernel module offers some interfaces in all cases. Another reason seems to be simple oversight, for example Illuminos KVM forgot to add the cpuid stuff from Linux KVM. And if there is some bug it may be a good idea to have a simple way to check if it is related to the paravirt stuff. I have just noticed that there is support in config(8) to set non-devices related int values. But this does not seem to be supported in the in-kernel UKC. And I can't see right now how config finds the valid int values in the kernel. Or is it just necessary to add the name of the variable prepended by an underscore to the config tool? Would this be preferred over introducing the paravirt device? But being able to set this from the in-kernel UKC would be nice, too.
Re: Binary code patching and paravirtualization
On Mon, 15 Dec 2014, Ted Unangst wrote: On Mon, Dec 15, 2014 at 23:55, Stefan Fritsch wrote: Only in order to get a flags field that can be tweaked with config(8). And to allow disable via config(8), though that could also be achieved with a flag. Tweaking the behavior with a flags value is necessary because hypervisors not always announce what they are capable of. One reason for this is that qemu is responsible for setting most of the cpuid flags but the kvm kernel module offers some interfaces in all cases. Another reason seems to be simple oversight, for example Illuminos KVM forgot to add the cpuid stuff from Linux KVM. And if there is some bug it may be a good idea to have a simple way to check if it is related to the paravirt stuff. I have just noticed that there is support in config(8) to set non-devices related int values. But this does not seem to be supported in the in-kernel UKC. And I can't see right now how config finds the valid int values in the kernel. Or is it just necessary to add the name of the variable prepended by an underscore to the config tool? Would this be preferred over introducing the paravirt device? But being able to set this from the in-kernel UKC would be nice, too. What I have forgotten is that just because a hypervisor supports a specific interface, it does not always mean that it is a good idea to use it. For example, recent Intel CPUs with a recent Linux/KVM have better hardware support for APIC virtualization and using the Hyper-V APIC access MSRs actually slow things down. I think it would be better to avoid fake device proliferation, but others may have other opinions. So the problem is that some hypervisors are broken and don't identify as such? Perhaps we ignore them to start? Then we can add a mechanism to force paravirt code patching. I think the introduction of paravirt code patching and the mechanism used to enable or disable are separate issues. It can start as a normal option PARAVIRT, and then we discuss what else needs to be done? Yes, we can first start with some things that are unproblematic WRT detection. AFAIK, this is the case for the PV EOI optimization. Unfortunately, the bit that gives the largest performance boost at the moment, the APIC access MSRs, is not so easy: - KVM has supported it for a long time - but it is only announced if qemu is started with a special command line option - Illumos KVM supports it too, but there is no way to announce it Also, old hypervisors tend to stay around for a long time and users can't always influence the configuration when they rent a vserver. On the other hand, the same positive effect as the APIC access MSRs should also be achievable by adding x2apic support. kettenis wanted to check how much effort that would be. So, I think it would be best to wait a bit with this piece.
Re: add m_defrag to pcn driver
Hi, I have committed the vio diff with some tweaks to remove some more now unused code. But I think this diff is actually incorrect: On Tue, 31 Mar 2015, Kimberley Manning wrote: Index: if_pcn.c === RCS file: /cvs/src/sys/dev/pci/if_pcn.c,v retrieving revision 1.36 diff -u -p -r1.36 if_pcn.c --- if_pcn.c 14 Mar 2015 03:38:48 - 1.36 +++ if_pcn.c 27 Mar 2015 12:17:24 - @@ -851,25 +851,23 @@ pcn_start(struct ifnet *ifp) * were short on resources. In this case, we'll copy * and try again. */ - if (bus_dmamap_load_mbuf(sc-sc_dmat, dmamap, m0, - BUS_DMA_WRITE|BUS_DMA_NOWAIT) != 0) { - MGETHDR(m, M_DONTWAIT, MT_DATA); - if (m == NULL) - break; - if (m0-m_pkthdr.len MHLEN) { - MCLGET(m, M_DONTWAIT); - if ((m-m_flags M_EXT) == 0) { - m_freem(m); - break; - } - } - m_copydata(m0, 0, m0-m_pkthdr.len, mtod(m, caddr_t)); - m-m_pkthdr.len = m-m_len = m0-m_pkthdr.len; - error = bus_dmamap_load_mbuf(sc-sc_dmat, dmamap, - m, BUS_DMA_WRITE|BUS_DMA_NOWAIT); - if (error) - break; - } +error = bus_dmamap_load_mbuf(sc-sc_dmat, dmamap, m0, +BUS_DMA_WRITE|BUS_DMA_NOWAIT); +switch (error) { +case 0: +break; +case EFBIG: +if ((error = m_defrag(m0, M_DONTWAIT)) == 0 +(error = bus_dmamap_load_mbuf(sc-sc_dmat, dmamap, + m0, BUS_DMA_WRITE|BUS_DMA_NOWAIT)) == 0) +break; + +/* FALLTHROUGH */ +default: + IFQ_DEQUEUE(ifp-if_snd, m0); + m_freem(m); m is always NULL. Therefore you leak m0 in the error path. You should remove the m variable which is no longer necessary, and possibly rename m0 to m. But the rename should probably be done in a second diff for easier review. + continue; +} Other drivers don't dequeue mbufs in out of mem situations. I think you should just set IFF_OACTIVE and break out of the for loop (not shown in the diff). The old code just does the break, but I think setting IFF_OACTIVE would be correct here. /* * Ensure we have enough descriptors free to describe
Re: add m_defrag to pcn driver
On Wednesday 08 April 2015 09:48:14, David Gwynne wrote: +continue; +} Other drivers don't dequeue mbufs in out of mem situations. I think you should just set IFF_OACTIVE and break out of the for loop (not shown in the diff). The old code just does the break, but I think setting IFF_OACTIVE would be correct here. OACTIVE means the tx ring is full. it (generally) gets cleared by a driver when the tx ring is emptied. if the tx ring is empty but we set it when the system runs out of resources (ie, memory shortages, dma mappings, iommu slot exhaustion, solar flares, etc), then there's no event to clear OACTIVE with. if we cant map the mbuf we should drop it and move on. Sounds reasonable but many drivers (e.g. em, ix, bge) don't do that but set OACTIVE in this case. Is there a reason for that? From a casual glance at the source it is not clear to me that they have some other way to recover. Some have a watchdog that will reset the device after some time.
Re: add m_defrag to pcn driver
On Wednesday 01 April 2015 12:46:47, Kimberley Manning wrote: This diff makes the pcn driver use m_defrag for fragmented mbuf chains, I like this kind of cleanups. As for vio(4) could you try the diff or are you looking for testers? Thanks. I've tried both the vio and pcn diffs and they worked for me, so dlg told me to write here. I'm not sure what happens to diffs after this point to be honest. I can take a look at the vio diff and commit it in a few days. BTW, is there a way to trigger heavily fragmented mbufs in order to test the code path?
Re: Possible fix for i217 problem
Thanks to everyone for the testing. The patch is now committed.
Possible fix for i217 problem
Hi, someone mentioned to me the i217-LM problems that were reported on misc end of May. It is possible that the patch below helps. For us, it fixed a problem on a laptop with i217-LM (pci id 8086:153a) where the receiving of packets would stop until the battery of the laptop was removed (or until linux or freebsd were booted, which also have this workaround). A normal reboot or power-cycle without removing the battery did not help. Interestingly, not even the Intel PXE BIOS has the workaround. The problem would happen if the LAN cable was plugged in after the card had already been initialized. If the LAN cable was always plugged in when the laptop was powered on, the problem would not appear. The workaround is part of the e1000_lv_jumbo_workaround_ich8lan() function in e1000_ich8lan.c in freebsd, but only the part that is used if jumbo packets are *not* configured. Linux has the same fix as b20a774495671f037e7160ea2ce87 and da1e2046e5f5ab268e55d30d6b74099ade0aeb6f with some more info in the commit messages. This probably has quite some potential to cause regressions with other boards, so i am not sure if it should go in before 5.8 release. Cheers, Stefan --- a/sys/dev/pci/if_em_hw.c +++ b/sys/dev/pci/if_em_hw.c @@ -91,6 +91,7 @@ static int32_tem_id_led_init(struct em_hw *); static int32_t em_init_lcd_from_nvm_config_region(struct em_hw *, uint32_t, uint32_t); static int32_t em_init_lcd_from_nvm(struct em_hw *); +static int32_t em_phy_no_cable_workaround(struct em_hw *); static voidem_init_rx_addrs(struct em_hw *); static voidem_initialize_hardware_bits(struct em_hw *); static boolean_t em_is_onboard_nvm_eeprom(struct em_hw *); @@ -7018,6 +7019,96 @@ em_read_mac_addr(struct em_hw *hw) } /** + * Explicitly disables jumbo frames and resets some PHY registers back to hw- + * defaults. This is necessary in case the ethernet cable was inserted AFTER + * the firmware initialized the PHY. Otherwise it is left in a state where + * it is possible to transmit but not receive packets. Observed on I217-LM and + * fixed in FreeBSD's sys/dev/e1000/e1000_ich8lan.c. + * + * hw - Struct containing variables accessed by shared code + */ +STATIC int32_t +em_phy_no_cable_workaround(struct em_hw *hw) { + int32_t ret_val, dft_ret_val; + uint32_t mac_reg; + uint16_t data, phy_reg; + + /* disable Rx path while enabling workaround */ + em_read_phy_reg(hw, I2_DFT_CTRL, phy_reg); + ret_val = em_write_phy_reg(hw, I2_DFT_CTRL, phy_reg | (1 14)); + if (ret_val) + return ret_val; + + /* Write MAC register values back to h/w defaults */ + mac_reg = E1000_READ_REG(hw, FFLT_DBG); + mac_reg = ~(0xF 14); + E1000_WRITE_REG(hw, FFLT_DBG, mac_reg); + + mac_reg = E1000_READ_REG(hw, RCTL); + mac_reg = ~E1000_RCTL_SECRC; + E1000_WRITE_REG(hw, RCTL, mac_reg); + + ret_val = em_read_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_CTRL, data); + if (ret_val) + goto out; + ret_val = em_write_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_CTRL, + data ~(1 0)); + if (ret_val) + goto out; + + ret_val = em_read_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_HD_CTRL, data); + if (ret_val) + goto out; + + data = ~(0xF 8); + data |= (0xB 8); + ret_val = em_write_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_HD_CTRL, data); + if (ret_val) + goto out; + + /* Write PHY register values back to h/w defaults */ + em_read_phy_reg(hw, I2_SMBUS_CTRL, data); + data = ~(0x7F 5); + ret_val = em_write_phy_reg(hw, I2_SMBUS_CTRL, data); + if (ret_val) + goto out; + + em_read_phy_reg(hw, I2_MODE_CTRL, data); + data |= (1 13); + ret_val = em_write_phy_reg(hw, I2_MODE_CTRL, data); + if (ret_val) + goto out; + + /* +* 776.20 and 776.23 are not documented in +* i217-ethernet-controller-datasheet.pdf... +*/ + em_read_phy_reg(hw, PHY_REG(776, 20), data); + data = ~(0x3FF 2); + data |= (0x8 2); + ret_val = em_write_phy_reg(hw, PHY_REG(776, 20), data); + if (ret_val) + goto out; + + ret_val = em_write_phy_reg(hw, PHY_REG(776, 23), 0x7E00); + if (ret_val) + goto out; + + em_read_phy_reg(hw, I2_PCIE_POWER_CTRL, data); + ret_val = em_write_phy_reg(hw, I2_PCIE_POWER_CTRL, data ~(1 10)); + if (ret_val) + goto out; + +out: + /* re-enable Rx path after enabling workaround */ + dft_ret_val = em_write_phy_reg(hw, I2_DFT_CTRL, phy_reg ~(1 14)); + if (ret_val) + return ret_val; + else + return dft_ret_val; +} +
Call for testing for tty diff
While writing a virtio-console driver, I have found a bug in ttwrite() that can cause hangs. Below is a fix and after talking to Theo, I would like to know if the patch causes regressions for anyone, for example hangs in pty input/output, serial ports, etc. Thanks in advance. Cheers, Stefan - Introduce new defines TTHIWATMINSPACE, TTMINHIWAT for some magic values that are used in tty.c. - Remove hiwat adjustments in ttwrite(). This fixes the missing spltty(). - The above alone causs deadlocks with ptys. Change ttysetwater() to keep at least TTHIWATMINSPACE space above the high water mark. This makes it consistent with ttycheckoutq() and seems to fix the pty deadlocks. --- sys/kern/tty.c +++ sys/kern/tty.c @@ -1688,7 +1688,7 @@ ttycheckoutq(struct tty *tp, int wait) hiwat = tp-t_hiwat; s = spltty(); oldsig = wait ? curproc-p_siglist : 0; - if (tp-t_outq.c_cc hiwat + 200) + if (tp-t_outq.c_cc hiwat + TTHIWATMINSPACE) while (tp-t_outq.c_cc hiwat) { ttstart(tp); if (wait == 0 || curproc-p_siglist != oldsig) { @@ -1823,7 +1823,7 @@ loop: tp-t_rocount = 0; if (ttyoutput(*cp, tp) = 0) { /* out of space */ - goto overfull; + goto ovhiwat; } cp++; cc--; @@ -1849,7 +1849,7 @@ loop: tp-t_outcc += ce; if (i 0) { /* out of space */ - goto overfull; + goto ovhiwat; } if (ISSET(tp-t_lflag, FLUSHO) || tp-t_outq.c_cc hiwat) @@ -1869,15 +1869,6 @@ done: explicit_bzero(obuf, obufcc); return (error); -overfull: - /* -* Since we are using ring buffers, if we can't insert any more into -* the output queue, we can assume the ring is full and that someone -* forgot to set the high water mark correctly. We set it and then -* proceed as normal. -*/ - hiwat = tp-t_outq.c_cc - 1; - ovhiwat: ttstart(tp); s = spltty(); @@ -2114,7 +2105,7 @@ ttsetwater(struct tty *tp) cps = tp-t_ospeed / 10; tp-t_lowat = x = CLAMP(cps / 2, TTMAXLOWAT, TTMINLOWAT); x += cps; - tp-t_hiwat = CLAMP(x, tp-t_outq.c_cn, 100); + tp-t_hiwat = CLAMP(x, tp-t_outq.c_cn - TTHIWATMINSPACE, TTMINHIWAT); #undef CLAMP } --- sys/sys/tty.h +++ sys/sys/tty.h @@ -171,6 +171,8 @@ struct itty { #ifdef _KERNEL #defineTTMAXLOWAT 256 #defineTTMINLOWAT 32 +#defineTTMINHIWAT 100 +#defineTTHIWATMINSPACE 200 /* Min space above hiwat */ #endif /* These flags are kept in t_state. */
Re: usb hang related to xhci
On Sat, 18 Jul 2015, David Hill wrote: Whenever I plug a device into my USB ports, my machine locks hard. I have the Intel Series 7 / C216 chip, so xhci attempts to route the port from ehci to xhci. The following diff is from FreeBSD and makes my USB devices work again. https://github.com/freebsd/freebsd/blob/e79c62ff68fc74d88cb6f479859f6fae9baa5101/sys/dev/usb/controller/xhci_pci.c#L153-L176 I didn't have the lockup but my Intel Series 9 laptop does not recognice a USB 3.0 storage device if I insert it after a hot boot. If I insert it after a cold boot, it works ok. This behavior does not change with your patch. Cheers, Stefan Index: sys/dev/pci/xhci_pci.c === RCS file: /cvs/src/sys/dev/pci/xhci_pci.c,v retrieving revision 1.6 diff -u -p -r1.6 xhci_pci.c --- sys/dev/pci/xhci_pci.c22 Jun 2015 08:43:27 - 1.6 +++ sys/dev/pci/xhci_pci.c19 Jul 2015 02:20:06 - @@ -92,33 +92,45 @@ xhci_pci_match(struct device *parent, vo static int xhci_pci_port_route(struct xhci_pci_softc *psc) { - pcireg_t val; + pcireg_t val, usb2_mask, usb3_mask; - /* - * Check USB3 Port Routing Mask register that indicates the ports - * can be changed from OS, and turn on by USB3 Port SS Enable register. - */ - val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3PRM); - DPRINTF((%s: USB3PRM / USB3.0 configurable ports: 0x%08x\n, - psc-sc.sc_bus.bdev.dv_xname, val)); +/* + * Check USB3 Port Routing Mask register that indicates the ports + * can be changed from OS, and turn on by USB3 Port SS Enable register. + */ +usb3_mask = pci_conf_read(psc-sc_pc, psc-sc_tag, + PCI_XHCI_INTEL_USB3PRM); +DPRINTF((%s: USB3PRM / USB3.0 configurable ports: 0x%08x\n, +psc-sc.sc_bus.bdev.dv_xname, usb3_mask)); + +/* + * Check USB2 Port Routing Mask register that indicates the USB2.0 + * ports to be controlled by xHCI HC, and switch them to xHCI HC. + */ +usb2_mask = pci_conf_read(psc-sc_pc, psc-sc_tag, + PCI_XHCI_INTEL_XUSB2PRM); +DPRINTF((%s: XUSB2PRM / USB2.0 ports can switch from EHCI to xHCI: +0x%08x\n, psc-sc.sc_bus.bdev.dv_xname, val)); + + val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN) | + pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR); - pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN, val); + + pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN, + val usb3_mask); +#ifdef XHCI_DEBUG val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN); DPRINTF((%s: USB3_PSSEN / Enabled USB3.0 ports under xHCI: 0x%08x\n, psc-sc.sc_bus.bdev.dv_xname, val)); +#endif - /* - * Check USB2 Port Routing Mask register that indicates the USB2.0 - * ports to be controlled by xHCI HC, and switch them to xHCI HC. - */ - val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PRM); - DPRINTF((%s: XUSB2PRM / USB2.0 ports can switch from EHCI to xHCI: - 0x%08x\n, psc-sc.sc_bus.bdev.dv_xname, val)); - - pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR, val); + pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR, + val usb2_mask); +#ifdef XHCI_DEBUG val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR); DPRINTF((%s: XUSB2PR / USB2.0 ports under xHCI: 0x%08x\n, psc-sc.sc_bus.bdev.dv_xname, val)); +#endif return (0); }
Looking for testers for amd64 suspend diff
Hi, the diff below is necessary to make suspend/resume work when x2apic is enabled (i.e. on qemu/kvm/...). While I don't expect problems, it would be nice if I could get some reports that this doesn't break suspend/resume on real machines, especially older ones (like core 2 duo and older) and ones with an amd cpu. Thanks in advance. Cheers, Stefan --- a/sys/arch/amd64/amd64/acpi_wakecode.S +++ b/sys/arch/amd64/amd64/acpi_wakecode.S @@ -52,6 +52,7 @@ #include #include #include +#include "lapic.h" #define _ACPI_TRMP_LABEL(a) a = . - _C_LABEL(acpi_real_mode_resume) + \ ACPI_TRAMPOLINE @@ -245,6 +246,13 @@ _C_LABEL(acpi_long_mode_resume): movw%ax, %gs /* Restore registers - start with the MSRs */ +#if NLAPIC > 0 + movl$MSR_APICBASE, %ecx + movlacpi_saved_apicbase, %eax + movlacpi_saved_apicbase+4, %edx + wrmsr +#endif + movl$MSR_STAR, %ecx movlacpi_saved_star, %eax movlacpi_saved_star+4, %edx @@ -623,6 +631,10 @@ _ACPI_TRMP_DATA_LABEL(acpi_saved_cstar) .quad 0 _ACPI_TRMP_DATA_LABEL(acpi_saved_sfmask) .quad 0 +#if NLAPIC > 0 +_ACPI_TRMP_DATA_LABEL(acpi_saved_apicbase) + .quad 0 +#endif .align 4 _ACPI_TRMP_DATA_LABEL(acpi_pdirpa) @@ -682,6 +694,13 @@ NENTRY(acpi_savecpu) pushq %rcx pushq %rdx +#if NLAPIC > 0 + movl$MSR_APICBASE, %ecx + rdmsr + movl%eax, acpi_saved_apicbase + movl%edx, acpi_saved_apicbase+4 +#endif + movl$MSR_STAR, %ecx rdmsr movl%eax, acpi_saved_star
Re: Looking for testers for amd64 suspend diff
On Monday 07 September 2015 13:23:03, Alexander Bluhm wrote: > Suspend OpenBSD in a fully emulated qemu running on OpenBSD works. > After resume, typing in the shell of the virtualized machine works. > But at disk access, all processes hang in biowait. > > It is the same behavior with and without your diff. If you use virtio disk this is expected, because the virtio drivers still lack the necessary suspend/resume logic.
Re: Looking for testers for amd64 suspend diff
On Sunday 06 September 2015 11:44:11, Stefan Fritsch wrote: > the diff below is necessary to make suspend/resume work when x2apic > is enabled (i.e. on qemu/kvm/...). While I don't expect problems, > it would be nice if I could get some reports that this doesn't > break suspend/resume on real machines, especially older ones (like > core 2 duo and older) and ones with an amd cpu. Committed now. Thanks again to everyone who tested it.
Re: Checking MAC address of incoming unicast packets
On Sun, 3 Jan 2016, Theo de Raadt wrote: > >On Friday 01 January 2016 16:25:59, Theo de Raadt wrote: > >> dlg writes: > >> > should we just do it unconditionally? is there a downside to that? > > > >It may decrease performance a tiny bit. Since such bits tend to add > >up, I would be hesitant to enable the check for drivers that don't > >need it. OTOH, freebsd and linux seem to always do the check. > > How does it decrease performance? I am talking about this diff in ether_input(): diff --git sys/net/if_ethersubr.c sys/net/if_ethersubr.c --- sys/net/if_ethersubr.c +++ sys/net/if_ethersubr.c @@ -353,8 +353,7 @@ ether_input(struct ifnet *ifp, struct mbuf *m, void *cookie) * If packet is unicast and we're in promiscuous mode, make sure it * is for us. Drop otherwise. */ - if ((m->m_flags & (M_BCAST|M_MCAST)) == 0 && - (ifp->if_flags & IFF_PROMISC)) { + if ((m->m_flags & (M_BCAST|M_MCAST)) == 0) { if (memcmp(ac->ac_enaddr, eh->ether_dhost, ETHER_ADDR_LEN)) { m_freem(m); return (1); For other drivers than vio this means an additional 6 byte memcmp where one operand is already in the cache, and the other operand may or may not be in the cache. 'tiny performance impact' seems about right. > > I tried to read the rest of what you wrote. It makes no sense. Since > these mbufs are coming off a ring, aren't they already linear; aren't > these mbufs already normalzed ie. "pulled up". The performance cost > seems to be one condition in a mbuf macro which decides to do no work. > > Maybe you should show the diff which does the check in the driver > itself, then then it will be easier to see the performance cost. Right > now I can't perceive the problem. It's not very expensive. But it's more code duplication than I like: diff --git sys/dev/pci/if_vio.c sys/dev/pci/if_vio.c --- sys/dev/pci/if_vio.c +++ sys/dev/pci/if_vio.c @@ -1005,6 +1005,9 @@ vio_rxeof(struct vio_softc *sc) int r = 0; int slot, len, bufs_left; struct virtio_net_hdr *hdr; + int promisc = (ifp->if_flags & IFF_PROMISC); + u_int8_t *enaddr = sc->sc_ac.ac_enaddr; + struct ether_header *eh; while (virtio_dequeue(vsc, vq, , ) == 0) { r = 1; @@ -1035,10 +1038,33 @@ vio_rxeof(struct vio_softc *sc) bufs_left--; } - if (bufs_left == 0) { - ml_enqueue(, m0); - m0 = NULL; + if (bufs_left > 0) + continue; + + /* +* Unfortunately, mac filtering is only best effort +* in virtio-net. Unwanted packets may still arrive. +* If we are not promiscous, we have to check if the +* packet is actually for us. +*/ + if (!promisc) { + m0 = m_pullup(m0, ETHER_HDR_LEN); + /* +* no need to check m0 == NULL, we know m0 has enough +* space +*/ + + eh = mtod(m0, struct ether_header *); + if (!ETHER_IS_MULTICAST(eh->ether_dhost) && + memcmp(enaddr, eh->ether_dhost, ETHER_ADDR_LEN) != 0) { + m_freem(m0); + m0 = NULL; + continue; + } } + + ml_enqueue(, m0); + m0 = NULL; } if (m0 != NULL) { DBGPRINT("expected %d buffers, got %d", (int)hdr->num_buffers,
Re: Checking MAC address of incoming unicast packets
On Friday 01 January 2016 16:25:59, Theo de Raadt wrote: > dlg writes: > > should we just do it unconditionally? is there a downside to that? It may decrease performance a tiny bit. Since such bits tend to add up, I would be hesitant to enable the check for drivers that don't need it. OTOH, freebsd and linux seem to always do the check. > > That is a very good question. What are the downsides against having > the driver do this filtering itself, like all real hardware does? The driver would need to do some mbuf operations to get the destination address. The same operations are then done again inside ether_input(). This is ugly. > Why risk sending packets of the wrong form further upwards into the > network stack, and then having to ensure the checks exist up there? > > This flag is requesting special service to encourage a layer > violation. > > What is the reasoning to encourage such a flag (which someone will > one day see, misuse, and try to expand for their bizzare use > case...) To me, it seems as much a layer violation as hardware checksum offloading features or the IFF_SIMPLEX flag. So no change there. On Saturday 02 January 2016 10:57:41, Martin Pieuchot wrote: > If it's acceptable performance-wise to do the check unconditionally > I believe that's the way to go. If not I'm a bit afraid of > introducing a flag/capability for a single driver. Do you know if > any other driver could use it? I haven't found specs of xen or vmware network drivers, therefore I can't say if those would have the same issue. One point for doing the check unconditionally is that it may guard against buggy hardware or buggy hypervisors (in cases the spec does not allow this). Since the observable error in our case was that sometimes TCP connections of other, unrelated VM hosts were aborted with TCP RSTs, such bugs are very difficult to find if they are not triggered often. I don't have a setup myself where I could do meaningful performance tests. If anyone has a setup where he can easily do such tests, it would be nice if he could test if removing the "(ifp->if_flags & IFF_PROMISC)" check in ether_input() makes any difference. Otherwise, I can try to do some performance tests at work, but that may take a while.
Checking MAC address of incoming unicast packets
Hi, by default, the ether_input() checks the destination MAC address of incoming unicast packets only if the interface is in promiscous mode. If not, it is assumed that the NIC filters unicast packets reliably. Unfortunately, for virtio-net this is not the case. There, unicast filtering is only best effort, and (depending on configuration) if the bridge on the VM host does unicast flodding, unicast packets that are not for the VM guest may still be delivered to the VM guest. This is a rather annoying problem because it can cause pf to send RST packets to foreign connections. (Kudos to mpf@ for debugging this). There are two possible approaches to fix this problem. Either make the vio(4) driver filter out unicast packets that are not for the local MAC, which would involve duplicating quite a bit of code from ether_input() in vio(4). Or, and I would prefer this, allow the driver to tell ether_input() that it needs to check the MAC always, and not only if the interface is in promiscous mode. This could be done with a new flag. There seem to be three possible places where this flag could be put: * ifnet.if_flags This is a short and there is no free bit. But the IFF_NOTRAILERS bit has become unused recently and could be recycled. * ifnet.if_xflags An int, lots of free bits. But comment says 'extra softnet flags' * if_data.ifi_capabilities An u_int32_t, lots of free bits. In the diff below, I went with the first choice because the new IFF_NOMACFILTER is somewhat similar to IFF_SIMPLEX and because the the check can then be nicely folded into the existing check for IFF_PROMISC. I would welcome any comments, suggestions for a better flag name, OKs, ... Cheers, Stefan diff --git sys/dev/pci/if_vio.c sys/dev/pci/if_vio.c index 4cd80d5..22fd7cf 100644 --- sys/dev/pci/if_vio.c +++ sys/dev/pci/if_vio.c @@ -582,21 +582,21 @@ vio_attach(struct device *parent, struct device *self, void *aux) virtio_start_vq_intr(vsc, >sc_vq[VQCTL]); vsc->sc_nvqs = 3; } } if (vio_alloc_mem(sc) < 0) goto err; strlcpy(ifp->if_xname, self->dv_xname, IFNAMSIZ); ifp->if_softc = sc; - ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; + ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST | IFF_NOMACFILTER; ifp->if_start = vio_start; ifp->if_ioctl = vio_ioctl; ifp->if_capabilities = IFCAP_VLAN_MTU; if (features & VIRTIO_NET_F_CSUM) ifp->if_capabilities |= IFCAP_CSUM_TCPv4|IFCAP_CSUM_UDPv4; IFQ_SET_MAXLEN(>if_snd, vsc->sc_vqs[1].vq_num - 1); IFQ_SET_READY(>if_snd); ifmedia_init(>sc_media, 0, vio_media_change, vio_media_status); ifmedia_add(>sc_media, IFM_ETHER | IFM_AUTO, 0, NULL); ifmedia_set(>sc_media, IFM_ETHER | IFM_AUTO); diff --git sys/net/if.h sys/net/if.h index 8d7e390..91c6d18 100644 --- sys/net/if.h +++ sys/net/if.h @@ -182,36 +182,37 @@ struct if_status_description { /* * Length of interface description, including terminating '\0'. */ #defineIFDESCRSIZE 64 #defineIFF_UP 0x1 /* interface is up */ #defineIFF_BROADCAST 0x2 /* broadcast address valid */ #defineIFF_DEBUG 0x4 /* turn on debugging */ #defineIFF_LOOPBACK0x8 /* is a loopback net */ #defineIFF_POINTOPOINT 0x10/* interface is point-to-point link */ -#defineIFF_NOTRAILERS 0x20/* avoid use of trailers */ +#defineIFF_NOMACFILTER 0x20/* Does not reliably filter unicast MACs */ #defineIFF_RUNNING 0x40/* resources allocated */ #defineIFF_NOARP 0x80/* no address resolution protocol */ #defineIFF_PROMISC 0x100 /* receive all packets */ #defineIFF_ALLMULTI0x200 /* receive all multicast packets */ #defineIFF_OACTIVE 0x400 /* transmission in progress */ #defineIFF_SIMPLEX 0x800 /* can't hear own transmissions */ #defineIFF_LINK0 0x1000 /* per link layer defined bit */ #defineIFF_LINK1 0x2000 /* per link layer defined bit */ #defineIFF_LINK2 0x4000 /* per link layer defined bit */ #defineIFF_MULTICAST 0x8000 /* supports multicast */ + /* flags set internally only: */ #defineIFF_CANTCHANGE \ (IFF_BROADCAST|IFF_POINTOPOINT|IFF_RUNNING|IFF_OACTIVE|\ - IFF_SIMPLEX|IFF_MULTICAST|IFF_ALLMULTI) + IFF_SIMPLEX|IFF_MULTICAST|IFF_ALLMULTI|IFF_NOMACFILTER) #define IFXF_MPSAFE0x1 /* if_start is mpsafe */ #defineIFXF_INET6_NOPRIVACY0x4 /* don't autoconf privacy */ #defineIFXF_MPLS 0x8 /*
Re: Xen virtual network (Netfront) driver
On Wed, 6 Jan 2016, Mike Belopuhov wrote: > There's still stuff to do, but it receives and transmits reliably > (at least on modern Xen) so I'd like to get it in. Man page will > follow. I only had a quick glance at the code, but I have one comment about your use of memory barriers. The membar_* macros are pure compiler barriers when the openbsd kernel is compiled for UP. But since the host machine and xen may use SMP even in this case, I suspect the that you need hardware memory barriers even if MULTIPROCESSOR is not defined. This does not seem relevant for x86 because you don't use membar_sync, but it may become relevant for arm, which is also supported by xen. I had the same problem in virtio and introduced the virtio_membar_* macros for this purpose. Maybe they should be renamed to a more generic name and you should use them, too? > + if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(>if_snd)) > + return; > + > + prod = txr->txr_prod; > + membar_consumer(); > + > + for (;;) { > + m = ifq_deq_begin(>if_snd); > + if (m == NULL) > + break; > + > + error = xnf_encap(sc, m, ); > + if (error == ENOENT) { > + /* transient */ > + ifq_deq_rollback(>if_snd, m); > + ifq_set_oactive(>if_snd); > + break; > + } else if (error) { > + /* the chain is too large */ > + ifq_deq_commit(>if_snd, m); > + m_freem(m); > + continue; > + } > + ifq_deq_commit(>if_snd, m); > + > +#if NBPFILTER > 0 > + if (ifp->if_bpf) > + bpf_mtap_ether(ifp->if_bpf, m, BPF_DIRECTION_OUT); > +#endif > + pkts++; > + } > + if (pkts > 0) { > + txr->txr_prod = prod; > + xen_intr_signal(sc->sc_xih); > + } > +} > + > +int > +xnf_encap(struct xnf_softc *sc, struct mbuf *m, uint32_t *prod) > +{ > + struct ifnet *ifp = >sc_ac.ac_if; > + struct xnf_tx_ring *txr = sc->sc_tx_ring; > + union xnf_tx_desc *txd; > + bus_dmamap_t dmap; > + int error, i, n = 0; > + > + if (((txr->txr_cons - *prod - 1) & (XNF_TX_DESC - 1)) < XNF_TX_FRAG) { > + error = ENOENT; > + goto errout; > + } > + > + i = *prod & (XNF_TX_DESC - 1); > + dmap = sc->sc_tx_dmap[i]; > + > + error = bus_dmamap_load_mbuf(sc->sc_dmat, dmap, m, BUS_DMA_WRITE | > + BUS_DMA_NOWAIT); > + if (error == EFBIG) { > + if (m_defrag(m, M_DONTWAIT) || > + bus_dmamap_load_mbuf(sc->sc_dmat, dmap, m, BUS_DMA_WRITE | > + BUS_DMA_NOWAIT)) > + goto errout; > + } else if (error) > + goto errout; > + > + for (n = 0; n < dmap->dm_nsegs; n++, (*prod)++) { > + i = *prod & (XNF_TX_DESC - 1); > + if (sc->sc_tx_buf[i]) > + panic("%s: save vs spell: %d\n", ifp->if_xname, i); > + txd = >txr_desc[i]; > + if (n == 0) { > + sc->sc_tx_buf[i] = m; > + if (0 && m->m_pkthdr.csum_flags & M_IPV4_CSUM_OUT) > + txd->txd_req.txq_flags = XNF_TXF_CSUM | > + XNF_TXF_VALID; > + txd->txd_req.txq_size = m->m_pkthdr.len; > + } else > + txd->txd_req.txq_size = dmap->dm_segs[n].ds_len; > + if (n != dmap->dm_nsegs - 1) > + txd->txd_req.txq_flags |= XNF_TXF_CHUNK; > + txd->txd_req.txq_ref = dmap->dm_segs[n].ds_addr; > + txd->txd_req.txq_offset = dmap->dm_segs[n].ds_offset; > + } > + > + ifp->if_opackets++; > + return (0); > + > + errout: > + ifp->if_oerrors++; > + return (error); > +} > + > +void > +xnf_intr(void *arg) > +{ > + struct xnf_softc *sc = arg; > + struct ifnet *ifp = >sc_ac.ac_if; > + > + if (ifp->if_flags & IFF_RUNNING) { > + xnf_rxeof(sc); > + xnf_txeof(sc); > + } > +} > + > +int > +xnf_txeof(struct xnf_softc *sc) > +{ > + struct ifnet *ifp = >sc_ac.ac_if; > + struct xnf_tx_ring *txr = sc->sc_tx_ring; > + union xnf_tx_desc *txd; > + struct mbuf *m; > + bus_dmamap_t dmap; > + volatile uint32_t r; > + uint32_t cons; > + int i, id, pkts = 0; > + > + do { > + for (cons = sc->sc_tx_cons; cons != txr->txr_cons; cons++) { > + membar_consumer(); > + i = cons & (XNF_TX_DESC - 1); > + txd = >txr_desc[i]; > + id = txd->txd_rsp.txp_id; > + memset(txd, 0, sizeof(*txd)); > + txd->txd_req.txq_id = id; > + membar_producer(); > + if (sc->sc_tx_buf[i]) { > +
Re: Checking MAC address of incoming unicast packets
On Mon, 4 Jan 2016, Claudio Jeker wrote: > On Sat, Jan 02, 2016 at 04:04:33PM +0100, Mark Kettenis wrote: > > > Date: Sat, 2 Jan 2016 10:57:41 +0100 > > > From: Martin Pieuchot> > > > > > If it's acceptable performance-wise to do the check unconditionally I > > > believe that's the way to go. If not I'm a bit afraid of introducing > > > a flag/capability for a single driver. Do you know if any other driver > > > could use it? > > > > I suppose any paravirtualized network driver would need this; i.e. my > > sun4v vnet(4) driver and the xen stuff mikeb@ is working on. > > That really depends on the device. Not all need to have the same weird semantics as virtio. But I wouldn't rule it out, either. > Shouldn't all these interfaces have the PROMISC flag set all the time > instead? They seem to be always promiscous since they don't have a HW mac > address filter. Doing that would solve the problem in a simple way > (special ioctl handling per driver). No, it's not the same as always promiscous. The virtio specification says "there is a filter, but unwanted packets may still arrive". The reliability of the HW mac address filter depends on the implementation. Some have a reliable filter (e.g. the one in qemu), some rely on the software bridge (e.g. the vhost-net implementation in the linux kernel). In order to reliably get all packets, the driver has to explicitly tell the device to switch PROMISC mode on.
Re: pppd and ioctl(TIOCSETD) (was: CVS: cvs.openbsd.org: src)
On Tue, 22 Dec 2015, David Coppa wrote: > I suspect this broke my 3G connection: Oops. Sorry. Try this diff (already committed): --- sys/kern/tty_conf.c +++ sys/kern/tty_conf.c @@ -42,6 +42,11 @@ #include #include +#include "ppp.h" +#include "nmea.h" +#include "msts.h" +#include "endrun.h" + #definettynodisc ((int (*)(dev_t, struct tty *, struct proc *))enodev) #definettyerrclose ((int (*)(struct tty *, int flags, struct proc *))enodev) #definettyerrio ((int (*)(struct tty *, struct uio *, int))enodev)
Re: Xen virtual network (Netfront) driver
On Thu, 7 Jan 2016, Mike Belopuhov wrote: > On 7 January 2016 at 13:17, Mark Kettenis <mark.kette...@xs4all.nl> wrote: > >> From: Mike Belopuhov <m...@belopuhov.com> > >> Date: Thu, 7 Jan 2016 12:02:23 +0100 > >> > >> On 6 January 2016 at 17:58, Stefan Fritsch <s...@sfritsch.de> wrote: > >> > On Wed, 6 Jan 2016, Mike Belopuhov wrote: > >> > > >> >> There's still stuff to do, but it receives and transmits reliably > >> >> (at least on modern Xen) so I'd like to get it in. Man page will > >> >> follow. > >> > > >> > I only had a quick glance at the code, but I have one comment about your > >> > use of memory barriers. The membar_* macros are pure compiler barriers > >> > when the openbsd kernel is compiled for UP. But since the host machine > >> > and > >> > xen may use SMP even in this case, I suspect the that you need hardware > >> > memory barriers even if MULTIPROCESSOR is not defined. This does not seem > >> > relevant for x86 because you don't use membar_sync, but it may become > >> > relevant for arm, which is also supported by xen. > >> > > >> > >> membar_{producer,consumer} are defined on arm to perform store and > >> load memory barriers. Our arm code currently does not distinguish > >> between an MP case and non-MP case regarding the definition of these > >> macros, so I'm not entirely certain what are you trying to say. I didn't check arm's implementation but new that it had non-empty membar_{producer,consumer}. So, if it does not distinguish between an MP case and non-MP case, then there is no problem there. But maybe you should document somewhere which assumptions about the architecture you make, so that they can be checked when adding a new architecture. I guess arm64 will come sooner or later and I don't know if it has exactly the same memory model as 32bit arm. > > > > Not sure ARM is a good example to look at. > > > > The only architectures that Xen dom0 is implemented for are i386, > adm64 and arm, so there's no real need to look at anything else. > > > In principle I think that the membar_xxx() interfaces could be simple > > compiler barriers on all our architectures, at least as long as the > > CPU will observe its own stores in the same order as they were > > emitted. But I think all sane CPU architectures make those > > guarantees. At least for "normal" memory. However, we treat that as > > an optimization. And we haven't done that for all our architectures. > > > > The problem with virtualization is of course that even a non-MP kernel > > is actually running in an MP environment. If data structures are > > shared with the hypervisor or another domain running on a different > > CPU, proper memory barriers must be used to guarantee the other side > > sees our stores in the right order. The typical case would be > > populating a descriptor with some sort of validity bit. There you > > want to make sure the other side doesn't see the valid bit set until > > all the other parts of the descriptor have been filled in and are > > visible. In that case a simple compiler barrier may not be enough. Yes. With intel it's the "Reads may be reordered with older writes to different locations but not with older writes to the same location" bit from the memory model that is causing problems. So you have to check if xen hits this case. virtio does (and removing the membarriers causes observable hangs). > > That's what I was referring to in my example below. > > > This is why the virtio_membar_xxx() primitives were introduced. > > > > Any idea why wasn't store and load barriers implemented separately? No idea. virtio_membar_xxx() was modeled after the existing membar_xxx(). But AIUI membar_consumer() plus membar_producer() is not equivalent to membar_sync() (which also prevents read vs. write reordering).
Re: Checking MAC address of incoming unicast packets
On Mon, 4 Jan 2016, Stefan Fritsch wrote: > On Sun, 3 Jan 2016, Theo de Raadt wrote: > > >> dlg writes: > > >> > should we just do it unconditionally? is there a downside to that? > > > > > >It may decrease performance a tiny bit. Since such bits tend to add > > >up, I would be hesitant to enable the check for drivers that don't > > >need it. OTOH, freebsd and linux seem to always do the check. > > > > How does it decrease performance? > > I am talking about this diff in ether_input(): > > diff --git sys/net/if_ethersubr.c sys/net/if_ethersubr.c > --- sys/net/if_ethersubr.c > +++ sys/net/if_ethersubr.c > @@ -353,8 +353,7 @@ ether_input(struct ifnet *ifp, struct mbuf *m, void > *cookie) >* If packet is unicast and we're in promiscuous mode, make sure it >* is for us. Drop otherwise. >*/ > - if ((m->m_flags & (M_BCAST|M_MCAST)) == 0 && > - (ifp->if_flags & IFF_PROMISC)) { > + if ((m->m_flags & (M_BCAST|M_MCAST)) == 0) { > if (memcmp(ac->ac_enaddr, eh->ether_dhost, ETHER_ADDR_LEN)) { > m_freem(m); > return (1); > > For other drivers than vio this means an additional 6 byte memcmp where > one operand is already in the cache, and the other operand may or may not > be in the cache. 'tiny performance impact' seems about right. Hrvoje Popovski was very helpful and did some performance tests with the patch above. There is only measurable difference at very high package rates, but I must admit that there the difference is a bit larger than I expected: > i'm sending 64byte packets from host connected to ix2 and counting then > on host connected to ix3 > without your patch sending receiving 400kpps 400kpps 600kpps 600kpps 800kpps 690kpps 1Mpps 610kpps 1.4Mpps 580kpps 12Mpps 580kpps > with your patch sending receiving 400kpps 400kpps - 0% 600kpps 600kpps - 0% 800kpps 680kpps - 1.4% 1Mpps 600kpps - 1.6% 1.4Mpps 560kpps - 3.4% 12Mpps 560kpps - 3.4% So, which variant should I commit? - do the check unconditionally - add some flag so that ether_input() does the check for vio(4) unconditionally - do the check in vio(4) mpi fixed a similar bug in trunk(4) in July, but that does not use ether_input() and would not profit from the flag approach. > > Maybe you should show the diff which does the check in the driver > > itself, then then it will be easier to see the performance cost. Right > > now I can't perceive the problem. > > It's not very expensive. But it's more code duplication than I like: > > diff --git sys/dev/pci/if_vio.c sys/dev/pci/if_vio.c > --- sys/dev/pci/if_vio.c > +++ sys/dev/pci/if_vio.c > @@ -1005,6 +1005,9 @@ vio_rxeof(struct vio_softc *sc) > int r = 0; > int slot, len, bufs_left; > struct virtio_net_hdr *hdr; > + int promisc = (ifp->if_flags & IFF_PROMISC); > + u_int8_t *enaddr = sc->sc_ac.ac_enaddr; > + struct ether_header *eh; > > while (virtio_dequeue(vsc, vq, , ) == 0) { > r = 1; > @@ -1035,10 +1038,33 @@ vio_rxeof(struct vio_softc *sc) > bufs_left--; > } > > - if (bufs_left == 0) { > - ml_enqueue(, m0); > - m0 = NULL; > + if (bufs_left > 0) > + continue; > + > + /* > + * Unfortunately, mac filtering is only best effort > + * in virtio-net. Unwanted packets may still arrive. > + * If we are not promiscous, we have to check if the > + * packet is actually for us. > + */ > + if (!promisc) { > + m0 = m_pullup(m0, ETHER_HDR_LEN); > + /* > + * no need to check m0 == NULL, we know m0 has enough > + * space > + */ > + > + eh = mtod(m0, struct ether_header *); > + if (!ETHER_IS_MULTICAST(eh->ether_dhost) && > + memcmp(enaddr, eh->ether_dhost, ETHER_ADDR_LEN) != > 0) { > + m_freem(m0); > + m0 = NULL; > + continue; > + } > } > + > + ml_enqueue(, m0); > + m0 = NULL; > } > if (m0 != NULL) { > DBGPRINT("expected %d buffers, got %d", (int)hdr->num_buffers, > >