from:"Stefan Fritsch"

Re: polling SSL kerberos and srp support

2014-04-29 Thread Stefan Fritsch

Am Montag, 28. April 2014, 21:40:30 schrieb Ted Unangst:
 Also note that I'm not really interested in rumors or whispers. You
 don't need to tell me that it's possible somebody else uses
 Kerberos. I know it's possible, that's why I'm asking. I'd like to
 know who.

One data point: Apache HTTPD 2.4 supports SRP.

Re: 9p

2014-05-30 Thread Stefan Fritsch

On Thursday 29 May 2014 19:17:25, M Farkas-Dyck wrote:
 On 29/05/2014, Ted Unangst t...@tedunangst.com wrote:
  The first question is why not use fuse? I think it's better to
  have one userland filesystem interface than two.
 
 We already have 2: fuse, nfs.
 
 • 9p can operate over arbitrary transports, such as virtio and tcp.
 Fuse can't.
 • To my knowledge one can't have root as fuse.
 • Fuse requires a copy to/from a memory buffer on read/write. I have
 usage cases in which fuse is the performance-limiting factor.

I agree that 9p support would be nice, especially with virtio support. 
I don't think I will have time to help with it in the near future, 
though.

I have noticed that there is also o9fs [1]. I haven't tried it, and I 
don't know what the plans of its author are. But in case you haven't, 
you may want to look at it and decide if you want to merge your 
projects, cherry-pick a few things, or just ignore it.

Cheers,
Stefan


[1] https://bitbucket.org/iru/o9fs/overview

Paravirtualized optimizations for KVM

2014-07-08 Thread Stefan Fritsch

Hi,

I have been trying to increase fork performance of openbsd/amd64 on KVM. 
It turns out that when I increase the number of CPUs of a VM from 1 to 3, 
a fork+exit micro benchmark is slowed down by a factor of 7.

The main reason for this seems to be a very large number of cross-CPU TLB 
flushes (about 4 per fork+exit). Each IPI causes several VM exits which 
are expensive. To reduce this, I have been trying to use paravirtualized 
interfaces provided by KVM and optimize some other things. These changes 
are mostly activated by a new pseudo device paravirt (which has the 
advantage that one can use UKC to tweak things without recompiling). 
However, some changes will remain if not running on a hypervisor (or 
paravirt is disabled). For example, x86_ipi() will use a pointer to 
dispatch to the appropriate implementation.

Is this the way to go forward? Or would you rather prefer a compile time 
option and maybe ship a bsd.mp.paravirt kernel in addition to bsd+bsd.mp?


The attached patch speeds up the fork+exit micro benchmark by a factor of 
3 on a 3 CPU system. And the time to build a kernel with -j4 on a 4 CPU 
system is also reduced by about 20%:

current:
real1m50.089s
user4m46.240s
sys 1m29.510s

current+paravirt:
real1m29.313s
user4m54.720s
sys 0m45.100s



BTW, why does amd64 use the APTE mapping/unmapping dance in pmap despite 
the memory being available in the direct map area all the time?


Cheers,
Stefan


diff --git a/sys/arch/amd64/amd64/cpu.c b/sys/arch/amd64/amd64/cpu.c
index 88725f7..248ebb8 100644
--- a/sys/arch/amd64/amd64/cpu.c
+++ b/sys/arch/amd64/amd64/cpu.c
@@ -83,6 +83,7 @@
 #include machine/pmap.h
 #include machine/vmparam.h
 #include machine/mpbiosvar.h
+#include machine/paravirtvar.h
 #include machine/pcb.h
 #include machine/specialreg.h
 #include machine/segments.h
@@ -569,6 +570,19 @@ cpu_init(struct cpu_info *ci)
ci-ci_flags |= CPUF_RUNNING;
tlbflushg();
 #endif
+#if NPARAVIRT  0
+   if (kvm_pv_eoi_enabled) {
+   paddr_t pa;
+   ci-ci_kvm_pv_eoi = 0;
+   if (pmap_extract(pmap_kernel(), (vaddr_t)ci-ci_kvm_pv_eoi, 
pa) 
+   ((uint64_t)pa  0x3) == 0) {
+   wrmsr(MSR_KVM_EOI_EN, (1 | (uint64_t)pa) );
+   } else {
+   printf(could not get phys addr for MSR_KVM_EOI_EN, 
disabling pv_eoi\n);
+   kvm_pv_eoi_enabled = 0;
+   }
+   }
+#endif
 }
 
 
diff --git a/sys/arch/amd64/amd64/genassym.cf b/sys/arch/amd64/amd64/genassym.cf
index e13a477..ab20329 100644
--- a/sys/arch/amd64/amd64/genassym.cf
+++ b/sys/arch/amd64/amd64/genassym.cf
@@ -114,6 +114,9 @@ member  CPU_INFO_MUTEX_LEVELci_mutex_level
 endif
 member CPU_INFO_GDTci_gdt
 member CPU_INFO_TSSci_tss
+if NPARAVIRT  0
+member CPU_INFO_KVM_PV_EOI ci_kvm_pv_eoi
+endif
 
 struct intrsource
 member is_recurse
diff --git a/sys/arch/amd64/amd64/lapic.c b/sys/arch/amd64/amd64/lapic.c
index d09e3fc..857af4b 100644
--- a/sys/arch/amd64/amd64/lapic.c
+++ b/sys/arch/amd64/amd64/lapic.c
@@ -45,6 +45,7 @@
 #include machine/pmap.h
 #include machine/vmparam.h
 #include machine/mpbiosvar.h
+#include machine/paravirtvar.h
 #include machine/pcb.h
 #include machine/specialreg.h
 #include machine/segments.h
@@ -235,20 +236,42 @@ lapic_boot_init(paddr_t lapic_base)
lapic_map(lapic_base);
 
 #ifdef MULTIPROCESSOR
-   idt_allocmap[LAPIC_IPI_VECTOR] = 1;
-   idt_vec_set(LAPIC_IPI_VECTOR, Xintr_lapic_ipi);
-   idt_allocmap[LAPIC_IPI_INVLTLB] = 1;
-   idt_vec_set(LAPIC_IPI_INVLTLB, Xipi_invltlb);
-   idt_allocmap[LAPIC_IPI_INVLPG] = 1;
-   idt_vec_set(LAPIC_IPI_INVLPG, Xipi_invlpg);
-   idt_allocmap[LAPIC_IPI_INVLRANGE] = 1;
-   idt_vec_set(LAPIC_IPI_INVLRANGE, Xipi_invlrange);
+#if NPARAVIRT  0
+   if (kvm_pv_eoi_enabled) {
+   idt_allocmap[LAPIC_IPI_VECTOR] = 1;
+   idt_vec_set(LAPIC_IPI_VECTOR, Xintr_lapic_ipi_kvm_pv_eoi);
+   idt_allocmap[LAPIC_IPI_INVLTLB] = 1;
+   idt_vec_set(LAPIC_IPI_INVLTLB, Xipi_invltlb_kvm_pv_eoi);
+   idt_allocmap[LAPIC_IPI_INVLPG] = 1;
+   idt_vec_set(LAPIC_IPI_INVLPG, Xipi_invlpg_kvm_pv_eoi);
+   idt_allocmap[LAPIC_IPI_INVLRANGE] = 1;
+   idt_vec_set(LAPIC_IPI_INVLRANGE, Xipi_invlrange_kvm_pv_eoi);
+   }
+   else
+#endif
+   {
+   idt_allocmap[LAPIC_IPI_VECTOR] = 1;
+   idt_vec_set(LAPIC_IPI_VECTOR, Xintr_lapic_ipi);
+   idt_allocmap[LAPIC_IPI_INVLTLB] = 1;
+   idt_vec_set(LAPIC_IPI_INVLTLB, Xipi_invltlb);
+   idt_allocmap[LAPIC_IPI_INVLPG] = 1;
+   idt_vec_set(LAPIC_IPI_INVLPG, Xipi_invlpg);
+   idt_allocmap[LAPIC_IPI_INVLRANGE] = 1;
+   idt_vec_set(LAPIC_IPI_INVLRANGE, Xipi_invlrange);
+   }
 #endif
idt_allocmap[LAPIC_SPURIOUS_VECTOR] = 1;

Re: Paravirtualized optimizations for KVM

2014-07-09 Thread Stefan Fritsch

On Tuesday 08 July 2014 23:53:21, Mark Kettenis wrote:
 Are these paravirtualization APIs stable?  Are they (properly)
 documented somewhere?

Mostly. So far, I am using three things:

1) the paravirtualized EOI. Documented in 
Documentation/virtual/kvm/msr.txt in linux source.
2) the MSR to write to the lapic ICR register. This is a Hyper-V 
interface that is also implemented by KVM. It's documented in the 
Microsoft Hyper-V docs (don't have a pointer right now).
3) the fact that doing IPIs does not require waiting for the BUSY bit 
to clear in ICR. This is really an implementation detail in KVM. 
Unless there is some way to detect this, it's not something that one 
can depend on by default but it could be enabled by some UKC flag with 
default off.

A different approach would be to add support for x2apic mode, which 
would take care of 2)+3). But this cannot be mixed with normal 
accesses to the memory mapped apic registers (xapic mode) and would 
therefore be a lot more intrusive to implement.

 If we're serious about supporting OpenBSD on (KVM) hypervisors,
 something like this makes sense.  We tend to try and have a single
 kernel that runs on the widest range of hardware that is possible.
 For example the OpenBSD/sparc64 kernel runs on both sun4u and sun4v
 hardware, and the sun4v platforms has written paravirtualization all
 over it.  There I successfully made use of code patching
 techniques. That might help on x86 as well.

Yes, code patching may be useful. I haven't noticed it used in openbsd 
before, but I will take a look at sparc64.

 Can't say I'm happy with making the interrupt handling code even
 more complicated though...

Do you think that putting all lapic operations as function into a 
apic_ops struct would be preferable? This would make the code much 
easier to read/maintain at the cost of some indirection and a few 
function calls in interrupt paths.

Re: [portable] OpenPGP signatures on release checksums (#12)

2014-07-20 Thread Stefan Fritsch

On Monday 14 July 2014 12:45:35, Bob Beck wrote:
 $ wc -l *.c
   29 crypto_api.c
  143 mod_ed25519.c
  327 mod_ge25519.c
  806 signify.c
 1305 total
 
 Signify is 1305 *lines* of C code. and it's included in our
 development platform. It is not that difficult to install, and
 if you can't install it, you could always run OpenBSD in a vm to
 verify a signature, it comes with openbsd.

Signify uses some openssh .c files:

$ wc -l *.c *.data
29 crypto_api.c
   335 fe25519.c
   143 mod_ed25519.c
   327 mod_ge25519.c
   306 sc25519.c
   806 signify.c
   265 smult_curve25519_ref.c
   858 ge25519_base.data
  3069 total

And it uses quite a few openbsd specific functions which makes 
compiling it on non-openbsd annoying. Because of the coupling to the 
openssh source, maybe it would make sense to include it in the openssh 
portable release?

Re: Paravirtualized optimizations for KVM

2014-08-17 Thread Stefan Fritsch

On Saturday 16 August 2014 23:51:46, Philip Guenther wrote:
  Yes, code patching may be useful. I haven't noticed it used in
  openbsd before, but I will take a look at sparc64.
 
 Code patching is used currently on i386 and amd64 for the SMAP
 support. Grep for _copyin_stac, for example.

Thanks. I have already noticed that.

Recently, I have however worked on improving pmap instead. Basically 
instead of making IPIs faster on KVM, do much fewer IPIs. I will write 
about that in a new thread, though.

For paravirtualization, it seems it is not always obvious which 
features to use. If the host machine is very new and has certain VMX 
hardware features (e.g. Virtual interrupt delivery and APIC 
register emulation), some of paravirtualizaion techniques that have 
significant positive effect on older hardware may instead cause a 
slowdown. I am not sure that KVM always provides enough information to 
make the best choice automatically. I fear this will require quite a 
few knobs for tuning.

Improving amd64 pmap for MP

2014-08-17 Thread Stefan Fritsch

Hi,

I found some patch by Art from 2008 that removes the APTE stuff.  The 
patch was for i386 but I have adapted it to amd64.

Instead of mapping the ptes of an inactive pmap in the APTE range, and 
then doing an unconditional remote TLB-flush on all CPUs, we just switch 
to the other pmap locally. This only causes a local TLB-flush.  I have 
also reordered some stuff so that we will usually send the 
TLB-shootdown-IPIs first, then do some local stuff, and then wait for the 
remote TLB-shootdown to finish.

The patch can definitely be optimized further (e.g. use the direct mapping 
in some or all cases). But maybe people want to take a look already. I 
would be interested in results on AMD hardware.

On some real Intel hardware with 4 cores, I get (with an older version of 
the patch, but I don't think it makes much difference):

- no significant difference for kernel builds
- doing 'make index' in ports on mfs:
  w/o patch:  6m29.27s real 1m21.96s user 3m18.99s system
  with patch: 6m12.54s real 1m23.65s user 3m3.43s system
- forktest n=4 (fork+exit micro benchmark)
  w/o patch:  0m8.75s real 0m0.36s user 0m7.51s system
  with patch: 0m7.43s real 0m0.18s user 0m7.79s system

I expect the speed up to be larger with more CPUs.

On KVM the speed up is much larger than on bare metal, in the same range 
as with the paravirt patches (of course doing both is even better). I 
don't have any exact numbers right now, though. We have been running the 
patch on one of our build-VMs with 6 CPUs for over a week without any 
problems.

Cheers,
Stefan

= forktest.c == 
#include stdio.h
#include sys/types.h
#include unistd.h
#include sys/wait.h
#include stdlib.h


int main(int ac, char **av)
{
int n = 4;
int active = 0;
while (n--  0) {
int status;
while (active  100) {
while (wait4(WAIT_ANY, status, WNOHANG, NULL)  0)
active--;
if (active  100)
usleep(5000);
}
pid_t pid = fork();
if (pid  0) {
perror(fork failed);
exit(0);
} else if (pid == 0) {
// child
exit(0);
} else {
active++;
}
}
return 0;
}
=== 

And here is the diff:


diff --git a/sys/arch/amd64/amd64/pmap.c b/sys/arch/amd64/amd64/pmap.c
index 40b64e5..293dfc9 100644
--- a/sys/arch/amd64/amd64/pmap.c
+++ b/sys/arch/amd64/amd64/pmap.c
@@ -203,7 +203,6 @@ long nkptp[] = NKPTP_INITIALIZER;
 long nkptpmax[] = NKPTPMAX_INITIALIZER;
 long nbpd[] = NBPD_INITIALIZER;
 pd_entry_t *normal_pdes[] = PDES_INITIALIZER;
-pd_entry_t *alternate_pdes[] = APDES_INITIALIZER;
 
 /* int nkpde = NKPTP; */
 
@@ -289,7 +288,7 @@ void pmap_free_ptp(struct pmap *, struct vm_page *,
 vaddr_t, pt_entry_t *, pd_entry_t **, struct pg_to_free *);
 void pmap_freepage(struct pmap *, struct vm_page *, int, struct pg_to_free *);
 static boolean_t pmap_is_active(struct pmap *, int);
-void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***);
+void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***, paddr_t *);
 struct pv_entry *pmap_remove_pv(struct vm_page *, struct pmap *, vaddr_t);
 void pmap_do_remove(struct pmap *, vaddr_t, vaddr_t, int);
 boolean_t pmap_remove_pte(struct pmap *, struct vm_page *, pt_entry_t *,
@@ -299,14 +298,23 @@ void pmap_remove_ptes(struct pmap *, struct vm_page *, 
vaddr_t,
 #define PMAP_REMOVE_ALL0   /* remove all mappings */
 #define PMAP_REMOVE_SKIPWIRED  1   /* skip wired mappings */
 
-void pmap_unmap_ptes(struct pmap *);
+void pmap_unmap_ptes(struct pmap *, paddr_t);
 boolean_t pmap_get_physpage(vaddr_t, int, paddr_t *);
 boolean_t pmap_pdes_valid(vaddr_t, pd_entry_t **, pd_entry_t *);
 void pmap_alloc_level(pd_entry_t **, vaddr_t, int, long *);
-void pmap_apte_flush(struct pmap *pmap);
 
 void pmap_sync_flags_pte(struct vm_page *, u_long);
 
+void   pmap_tlb_shootpage(struct pmap *, vaddr_t, int);
+void   pmap_tlb_shootrange(struct pmap *, vaddr_t, vaddr_t, int);
+void   pmap_tlb_shoottlb(struct pmap *, int);
+#ifdef MULTIPROCESSOR
+void   pmap_tlb_shootwait(void);
+#else
+#definepmap_tlb_shootwait()
+#endif
+
+
 /*
  * p m a p   i n l i n e   h e l p e r   f u n c t i o n s
  */
@@ -349,55 +357,44 @@ pmap_sync_flags_pte(struct vm_page *pg, u_long pte)
}
 }
 
-void
-pmap_apte_flush(struct pmap *pmap)
-{
-   pmap_tlb_shoottlb();
-   pmap_tlb_shootwait();
-}
-
 /*
  * pmap_map_ptes: map a pmap's PTEs into KVM
- *
- * = we lock enough pmaps to keep things locked in
- * = must be undone with pmap_unmap_ptes before returning
  */
 
 void
-pmap_map_ptes(struct pmap *pmap, pt_entry_t **ptepp,

minphys woes

2014-08-28 Thread Stefan Fritsch

Hi,

while debugging some problems on a virtio host that supports only few dma 
segments per request, I noticed that the minphys function is not called in 
all code paths. It is only used for physio but not for other (cached) disk 
operations. An example trace is below.

Am I missing something or is this as broken as I think it is? I thought 
the minphys function is a way for a driver to say that it does not support 
larger transfers and that transfers are split accordingly. What's the 
purpose of the minphys mechanism if it does not affect all code paths?

Cheers,
Stefan


panic() at 0x1027bcf5 panic+0x1a5/0x1061006f0
vioblk_scsi_cmd() at 0x103ecccf vioblk_scsi_cmd+0x8ff/0x106100770
sdstart() at 0x103bc4d6 sdstart+0x176/0x1061007b0
scsi_ioh_runqueue() at 0x103b3c05 scsi_ioh_runqueue+0x55/0x1061007e0
scsi_xsh_runqueue() at 0x103b3d72 scsi_xsh_runqueue+0x132/0x106100800
sdstrategy() at 0x103bc6bf sdstrategy+0x10f/0x106100830
spec_strategy() at 0x102a8b3b spec_strategy+0x5b/0x106100850
VOP_STRATEGY() at 0x102a75be VOP_STRATEGY+0x2e/0x106100870
bio_doread() at 0x102991fc bio_doread+0xbc/0x1061008a0
bread() at 0x102992c3 bread+0x13/0x1061008c0
fillinusemap() at 0x102ad6d9 fillinusemap+0xb9/0x106100920
msdosfs_mountfs() at 0x102af78d msdosfs_mountfs+0x59d/0x106100990
msdosfs_mount() at 0x102afcf6 msdosfs_mount+0x3e6/0x106100b70
sys_mount() at 0x102a6a3c sys_mount+0x36c/0x106100d90

Re: minphys woes

2014-08-28 Thread Stefan Fritsch

On Friday 29 August 2014 10:34:20, David Gwynne wrote:
 are you expecting minphys to appear in that backtrace?


No. In that trace minphys should have limited the transfer size and I 
had put an KASSERT() into vioblk_scsi_cmd() which verified that xs-
datalen is not larger than that limit. A call to bread() causes the 
KASSERT to trigger.

Re: minphys woes

2014-08-29 Thread Stefan Fritsch

No, also happens on ffs. If the root FS is on an affected device, it looks 
like this:


root on sd0a (9e7002f8647a2403.a) swap on sd0b dump on sd0b
clock: unknown CMOS layout
panic: kernel diagnostic assertion xs-datalen = MINPHYS failed: file 
../../../../dev/pci/vioblk.c, line 405
Stopped at  Debugger+0x9:   leave
RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS 
PANIC!
IF RUNNING SMP, USE 'mach ddbcpu #' AND 'trace' ON OTHER PROCESSORS, 
TOO.
DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
ddb{0} trace  
Debugger() at Debugger+0x9
panic() at panic+0xfe
__assert() at __assert+0x25
vioblk_scsi_cmd() at vioblk_scsi_cmd+0x565
scsi_xs_exec() at scsi_xs_exec+0x35
sdstart() at sdstart+0x16e
scsi_iopool_run() at scsi_iopool_run+0x5d
scsi_xsh_runqueue() at scsi_xsh_runqueue+0x13d
scsi_xsh_add() at scsi_xsh_add+0x98
sdstrategy() at sdstrategy+0x102
spec_strategy() at spec_strategy+0x53
ufs_strategy() at ufs_strategy+0x82
VOP_STRATEGY() at VOP_STRATEGY+0x3b
bread_cluster() at bread_cluster+0x2a4
ffs_read() at ffs_read+0x250
VOP_READ() at VOP_READ+0x3f
uvn_io() at uvn_io+0x2e2
uvn_get() at uvn_get+0x1f7
uvm_fault() at uvm_fault+0xdf5
trap() at trap+0x62f
--- trap (number 6) ---
end of kernel
end trace frame: 0x7f7dd0c0, count: -20
acpi_pdirpa+0x43ae00:




Reproducer patch is below. I am sure it can be reproduced with any 
other disk driver, but with ahci a reproducer would be a bit more 
complicated because of the atascsi layer.

Maybe VOP_STRATEGY() should do a similar minphys+loop thing as physio() 
does?



--- a/sys/dev/pci/vioblk.c
+++ b/sys/dev/pci/vioblk.c
@@ -63,6 +63,7 @@
 #include scsi/scsi_disk.h
 #include scsi/scsiconf.h
 
+#define MINPHYS(20*1024)
 #define VIOBLK_DONE-1
 
 struct virtio_feature_name vioblk_feature_names[] = {
@@ -157,6 +158,8 @@ vioblk_minphys(struct buf *bp, struct scsi_link *sl)
struct vioblk_softc *sc = sl-adapter_softc;
if (bp-b_bcount  sc-sc_xfer_max)
bp-b_bcount = sc-sc_xfer_max;
+   if (bp-b_bcount  MINPHYS)
+   bp-b_bcount = MINPHYS;
 }
 
 void
@@ -399,6 +402,8 @@ vioblk_scsi_cmd(struct scsi_xfer *xs)
int timeout;
int slot, ret, nsegs;
 
+   KASSERT(xs-datalen = MINPHYS);
+
s = splbio();
ret = virtio_enqueue_prep(vq, slot);
if (ret) {






On Fri, 29 Aug 2014, David Gwynne wrote:

 is it something msdos isnt doing that ffs does?
 
 On 29 Aug 2014, at 15:55, Stefan Fritsch s...@sfritsch.de wrote:
 
  On Friday 29 August 2014 10:34:20, David Gwynne wrote:
  are you expecting minphys to appear in that backtrace?
  
  
  No. In that trace minphys should have limited the transfer size and I 
  had put an KASSERT() into vioblk_scsi_cmd() which verified that xs-
  datalen is not larger than that limit. A call to bread() causes the 
  KASSERT to trigger.

Re: minphys woes

2014-08-29 Thread Stefan Fritsch

On Fri, 29 Aug 2014, Miod Vallat wrote:

  +++ b/sys/dev/pci/vioblk.c
  @@ -63,6 +63,7 @@
   #include scsi/scsi_disk.h
   #include scsi/scsiconf.h
   
  +#define MINPHYS(20*1024)
 
 This makes no sense. If you want to override MINPHYS, you need to
 override for the whole kernel, not only here.

MINPHYS is just a name I have chosen. It is not used anywhere else in the 
kernel. You are confusing this with MAXPHYS.

 
  @@ -157,6 +158,8 @@ vioblk_minphys(struct buf *bp, struct scsi_link *sl)
  struct vioblk_softc *sc = sl-adapter_softc;
  if (bp-b_bcount  sc-sc_xfer_max)
  bp-b_bcount = sc-sc_xfer_max;
  +   if (bp-b_bcount  MINPHYS)
  +   bp-b_bcount = MINPHYS;
 
 This is useless. sc-sc_xfer_max is computed so that it will always be
 = MINPHYS.

sc-sc_xfer_max is computed according to the host's capabilities. What I 
want to simulate with this diff is a host adapter that can only cope with 
transfers  64k == MAXPHYS.

Re: minphys woes

2014-08-29 Thread Stefan Fritsch

On Fri, 29 Aug 2014, Miod Vallat wrote:
  sc-sc_xfer_max is computed according to the host's capabilities. What I 
  want to simulate with this diff is a host adapter that can only cope with 
  transfers  64k == MAXPHYS.
 
 Back to your original problem, you might want to print the sc_link
 struct as well the scsi_adapter struct it points to, when you detect a
 transfer larger than MAXPHYS. It has likely been overriden or reset to
 NULL by mistake at some point.

OK, I will try that. Will take a bit until I have time, though.

But I have also read the code and I could not find a place in the path 
from bread() to the scsi adapter cmd function where the minphys function 
is called.

Re: minphys woes

2014-08-29 Thread Stefan Fritsch

On Fri, 29 Aug 2014, Mike Belopuhov wrote:

 On 29 August 2014 11:26, Stefan Fritsch s...@sfritsch.de wrote:
  On Fri, 29 Aug 2014, Miod Vallat wrote:
   sc-sc_xfer_max is computed according to the host's capabilities. What I
   want to simulate with this diff is a host adapter that can only cope with
   transfers  64k == MAXPHYS.
 
  Back to your original problem, you might want to print the sc_link
  struct as well the scsi_adapter struct it points to, when you detect a
  transfer larger than MAXPHYS. It has likely been overriden or reset to
  NULL by mistake at some point.
 
  OK, I will try that. Will take a bit until I have time, though.

That's not the problem. I have added an additional

 KASSERT(xs-sc_link-adapter-scsi_minphys == vioblk_minphys)

before the other KASSERT and it did not trigger.

  But I have also read the code and I could not find a place in the path
  from bread() to the scsi adapter cmd function where the minphys function
  is called.
 
 
 correct me if i'm wrong, but what happens is that bread being a block
 read reads up to MAXBSIZE which is conveniently set to 64k and you can't
 create a filesystem with a larger block size.
 
 physio (raw device io) however doesn't go through bread and need to know
 how split the provided buffer in separate transactions hence minphys.

Yes, that seems to be what happens. But if every adapter needs to support 
transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the 
adapter to be able to override the default minphys function with its own. 
And adapters that only support smaller transfers would need to have logic 
in their driver to be able to split the transfer into smaller chunks.

I think it makes more sense to have that logic in one place to be used by 
all drivers.

Re: minphys woes

2014-09-01 Thread Stefan Fritsch

On Mon, 1 Sep 2014, Mike Belopuhov wrote:

 On 29 August 2014 22:39, Stefan Fritsch s...@sfritsch.de wrote:
  Yes, that seems to be what happens. But if every adapter needs to support
  transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the
  adapter to be able to override the default minphys function with its own.
  And adapters that only support smaller transfers would need to have logic
  in their driver to be able to split the transfer into smaller chunks.
 
 
 i believe that if you start digging you realise that (at least at some point)
 the least common denominator is (or was) 64k meaning that even the shittiest
 controller on vax can do 64k.  which means that we don't have code for a
 filesystem or buffercache to probe the controller for a supported transfer
 size.

That's possible. But what is really strange is, why does openbsd then have 
an infrastructure to set different max transfer sizes for physio on a 
per-adapter basis? This makes no sense. Either the drivers have to support 
64k transfers, in which case most of the minphys infrastructure is 
useless, or they don't have to. In the latter case the minphys 
infrastructure would need to be used in all code paths.

  I think it makes more sense to have that logic in one place to be used by
  all drivers.
 
 perhaps, but unless the filesystem will issue breads of larger blocks the
 only benefit would be physio itself which doesn't really justify the change.

You lost me there. Why should this depend on how many requests some 
filesystem makes with larger blocks? If there is the possibility that 
filesystems do such requests, someone needs to do the splitting of the 
requests. The question is who. The driver or the block layer.


BTW, for my use case I don't actually want to limit the block size, but 
rather the number of DMA segments. But the most reasonable way to do that 
seemed to set minphys to max_segments * pagesize. If we change how these 
things work, we could take the number of DMA segments into account.

Re: minphys woes

2014-09-01 Thread Stefan Fritsch

On Mon, 1 Sep 2014, David Gwynne wrote:
  Yes, that seems to be what happens. But if every adapter needs to support 
  transfers of MAXBSIZE == MAXPHYS anyway, there would be no need for the 
  adapter to be able to override the default minphys function with its own. 
  And adapters that only support smaller transfers would need to have logic 
  in their driver to be able to split the transfer into smaller chunks.
  
  I think it makes more sense to have that logic in one place to be used by 
  all drivers.
 
 this has blown my mind. how do the stupid ata cf card things that only 
 support single sector IO work?

It seems _wdc_ata_bio_start in ata_wdc.c does the splitting in this case.

Re: minphys woes

2014-09-02 Thread Stefan Fritsch

On Monday 01 September 2014 22:24:16, David Gwynne wrote:
  i haven't found a controller that does less than MAXPHYS.
  perhaps they meant to improve the situation but stopped short.
 
 if we wanted to raise MAXPHYS, we'd have to support older
 controllers that cant do greater than 64k with some mechanism.

So the summary of this thread is that

- drivers have to support 64K transfers
- the minphys mechanism is useless at the moment

Not quite what I hoped, but ok. Do you know a good place where this 
should be documented?

Re: minphys woes

2014-09-02 Thread Stefan Fritsch

On Tuesday 02 September 2014 13:15:19, Philip Guenther wrote:
 On Tue, Sep 2, 2014 at 12:51 PM, Stefan Fritsch s...@sfritsch.de 
wrote:
  On Monday 01 September 2014 22:24:16, David Gwynne wrote:
i haven't found a controller that does less than MAXPHYS.
perhaps they meant to improve the situation but stopped short.
   
   if we wanted to raise MAXPHYS, we'd have to support older
   controllers that cant do greater than 64k with some mechanism.
  
  So the summary of this thread is that
  
  - drivers have to support 64K transfers
  - the minphys mechanism is useless at the moment
 
 Useless?  It works just fine for physio, which is what it was
 designed for, no?
  Not quite what I hoped, but ok. Do you know a good place where
  this
  should be documented?
 
 minphys is currently documented on physio(9)


From physio(9):

 minphys
 A device specific routine called to determine the maximum
 transfer size that the device's strategy routine can
 handle.

Since we have seen that the driver must be able to handle 64k blocks 
in any case, the fact that minphys is device specific is useless, 
isn't it?

Re: minphys woes

2014-09-03 Thread Stefan Fritsch

On Tuesday 02 September 2014 15:22:16, Philip Guenther wrote:
  From physio(9):
   minphys
   A device specific routine called to determine the
 maximum transfer size that the device's strategy routine can
 handle.
 
  Since we have seen that the driver must be able to handle 64k
  blocks in any case, the fact that minphys is device specific is
  useless, isn't it?
 
 physio() is used by character device access.  Looks to me like
 sdminphys() will change the chunking behavior of this:
 dd if=/dev/zero of=/dev/rsd0a bs=100M count=1
 
 depending on whether sd0 is a SCSI-I device, no?

Yes, but that does not make it any less useless. File systems will 
call the very same strategy routine and expect it to deal with 64K 
blocks. The statement in the man page gives the misleading impression 
that the strategy routine could be used to avoid that the strategy 
routine has to handle 64K blocks, which is not true.

So, maybe the minphys mechanism is useful for block devices that are 
not used for file systems. Are there any such devices?

Maybe we could add a sentence to the man page like following?

Beware that a block device's strategy routine that is used for file 
systems must always be able to accept transfers sizes up to MAXPHYS.

Re: minphys woes

2014-09-05 Thread Stefan Fritsch

On Wednesday 03 September 2014 15:23:13, Philip Guenther wrote:
 IMO, the problem that you're hitting with your vioblk device isn't a
 problem with MAXPHYS, physio(), or minphys, but rather with
 MAXBSIZE. Back in the 1990's, the fact that you want to be able to
 configure a device with minphys 20kB on your arch means that the MD
 MAXBSIZE define for your arch should be 20kB.
 
 If MAXBSIZE could be set/limited on a per-device basis (i.e., the
 code using it queried the device) then your problem would be
 resolved, no?

Thank you very much for this write up. It was very enlightening and I 
agree that the problem I encountered is that MAXBSIZE is not a per-
device value.

Another thing this has made clear to me is that I don't want to deal 
with openbsd's lower level block layer. I will just close the lid and 
hope the smell goes away.

Cheers,
Stefan

Re: sparc64: fledgling QEMU support

2014-09-09 Thread Stefan Fritsch

On Tuesday 09 September 2014 21:27:37, Mark Cave-Ayland wrote:
  Could the PCI virtio stuff be adapted to non-x86 architectures?
 
 QEMU already has a virtio PCI device that can be plugged into 
 qemu-system-sparc64 (see Artyom's blog at 
 http://tyom.blogspot.co.uk/2013/03/debiansparc64-wheezy-under-qemu-h
 ow-to.html  for an example of how to do this with Linux).
 
 This could be an amusing project; in theory it would be possible to
 work  on an x86 laptop to test/debug big-endian virtio support with
 the help of QEMU's virtual hardware.  You can do this by plugging
 in a standard virtual cdrom/hd along with an additional virtio
 hd/nic, booting from the standard devices and then testing the
 drivers accessing the extra devices as required.

From the openbsd side, virtio should work with sparc. But since nobody 
has tested it on big endian so far, there will be bugs. And it needs 
to be enabled in the config, of course.

If you look at generic PCI network adapters, I would recommend trying 
e1000 if possible. Last time I tried it (on x86), qemu's rtl8139 
emulation did not work with openbsd's driver, and I think there were 
some problems with the ne2k emulation, too. Or maybe ne2k just had 
terrible performance.

Cheers,
Stefan

Speeding up openbsd on amd64 MP

2014-09-14 Thread Stefan Fritsch

Hi,

I have improved my amd64 pmap patches a bit and finally got around to 
doing some benchmarks with them.

Systems:
- core-i7 is an Intel Core i7-860 with 4 cores / 8 threads
- core2duo is an Intel Core 2 Duo T8100 with 2 cores
- KVM is a VM with 4 vcpus running on qemu 2.1/Linux 3.16 on the
  above core-i7

SP/MP means single/multi processor kernel

Benchmarks:
kernel: compilation of the kernel on tmpfs
makeindex: 'make index' in /usr/ports on tmpfs
forktest: the fork/exit micro benchmark I have posted at
http://permalink.gmane.org/gmane.os.openbsd.tech/38172

Speed up is wall clock time without patch divided by time with patch.

system   kernel benchmark   speed up
core-i7  MP kernel -j8  1.02
core-i7  MP makeindex   1.11
core-i7  MP forktest1.52
core-i7  SP kernel -j1  1.00
core-i7  SP forktest1.03
core2duo MP kernel -j2  1.17
core2duo MP makeindex   1.12
core2duo MP forktest1.11
KVM  MP kernel -j4  1.20
KVM  MP forktest3.26

As you see there is no test case where the new pmap is slower than the old 
one. On bare metal, there is some speed up, depending on system and 
workload. On virtualization, the speed difference very significant. I 
didn't run the makeindex test on KVM because it is so slow with the old 
pmap.

I would like people to test the diffs on other machines. In particular on 
non-Intel CPUs. The only AMD system I could get hold of did not run 
reliably with openbsd even without the pmap diff.


I am posting the patch in two parts as separate mails. The first part 
contains some simple cleanups and the second part does the actual 
optimizations. Compared to the patch I have posted on August 18th, the 
current patch uses the direct mapping in some cases to avoid more TLB 
flushes (some bits of this come from a different old patch by art@ from 
2005). Also, many comments have been updated.

Apart from testing, I would of course also appreciate reviews. Compared to 
art's patch from 2005 which tried to use the direct mapping for 
everything, the current patch is much less intrusive. Therefore I hope 
that it is easier to review/debug than his patch, which unfortunately 
never got integrated.

Cheers,
Stefan

Speeding up openbsd on amd64 MP - patch 1/2

2014-09-14 Thread Stefan Fritsch

Simple cleanups for amd64 pmap

- use __func__ in panics/printfs (fixes some out of sync function names)

- tell the compiler that code paths where we print diagnostics are 
unlikely

- use pmap_valid_entry() in some places

- KERNSPACE is not used anywhere


diff --git a/sys/arch/amd64/amd64/pmap.c b/sys/arch/amd64/amd64/pmap.c
index cd395e1..05cbd3e 100644
--- a/sys/arch/amd64/amd64/pmap.c
+++ b/sys/arch/amd64/amd64/pmap.c
@@ -436,7 +436,7 @@ pmap_kenter_pa(vaddr_t va, paddr_t pa, vm_prot_t prot)
 #ifdef LARGEPAGES
/* XXX For now... */
if (opte  PG_PS)
-   panic(pmap_kenter_pa: PG_PS);
+   panic(%s: PG_PS, __func__);
 #endif
if (pmap_valid_entry(opte)) {
if (pa  PMAP_NOCACHE  (opte  PG_N) == 0)
@@ -899,7 +899,7 @@ pmap_get_ptp(struct pmap *pmap, vaddr_t va, pd_entry_t 
**pdes)
pptp = pmap_find_ptp(pmap, va, ppa, i);
 #ifdef DIAGNOSTIC
if (pptp == NULL)
-   panic(pde page disappeared);
+   panic(%s: pde page disappeared, __func__);
 #endif
pptp-wire_count++;
}
@@ -915,7 +915,7 @@ pmap_get_ptp(struct pmap *pmap, vaddr_t va, pd_entry_t 
**pdes)
if (ptp == NULL) {
printf(va %lx ppa %lx\n, (unsigned long)va,
(unsigned long)ppa);
-   panic(pmap_get_ptp: unmanaged user PTP);
+   panic(%s: unmanaged user PTP, __func__);
}
 #endif
}
@@ -1031,8 +1031,8 @@ pmap_destroy(struct pmap *pmap)
 */
 
 #ifdef DIAGNOSTIC
-   if (pmap-pm_cpus != 0)
-   printf(pmap_destroy: pmap %p cpus=0x%llx\n,
+   if (__predict_false(pmap-pm_cpus != 0))
+   printf(%s: pmap %p cpus=0x%llx\n, __func__,
(void *)pmap, pmap-pm_cpus);
 #endif
 
@@ -1132,7 +1132,7 @@ pmap_pdes_valid(vaddr_t va, pd_entry_t **pdes, pd_entry_t 
*lastpde)
for (i = PTP_LEVELS; i  1; i--) {
index = pl_i(va, i);
pde = pdes[i - 2][index];
-   if ((pde  PG_V) == 0)
+   if (!pmap_valid_entry(pde))
return FALSE;
}
if (lastpde != NULL)
@@ -1171,7 +1171,7 @@ pmap_extract(struct pmap *pmap, vaddr_t va, paddr_t *pap)
pte = ptes[pl1_i(va)];
pmap_unmap_ptes(pmap);
 
-   if (__predict_true((pte  PG_V) != 0)) {
+   if (__predict_true(pmap_valid_entry(pte))) {
if (pap != NULL)
*pap = (pte  PG_FRAME) | (va  0xfff);
return (TRUE);
@@ -1321,16 +1321,16 @@ pmap_remove_ptes(struct pmap *pmap, struct vm_page 
*ptp, vaddr_t ptpva,
if ((opte  PG_PVLIST) == 0) {
 #ifdef DIAGNOSTIC
if (pg != NULL)
-   panic(pmap_remove_ptes: managed page without 
- PG_PVLIST for 0x%lx, startva);
+   panic(%s: managed page without PG_PVLIST 
+ for 0x%lx, __func__, startva);
 #endif
continue;
}
 
 #ifdef DIAGNOSTIC
if (pg == NULL)
-   panic(pmap_remove_ptes: unmanaged page marked 
- PG_PVLIST, va = 0x%lx, pa = 0x%lx,
+   panic(%s: unmanaged page marked PG_PVLIST, 
+ va = 0x%lx, pa = 0x%lx, __func__,
  startva, (u_long)(opte  PG_FRAME));
 #endif
 
@@ -1388,17 +1388,16 @@ pmap_remove_pte(struct pmap *pmap, struct vm_page *ptp, 
pt_entry_t *pte,
if ((opte  PG_PVLIST) == 0) {
 #ifdef DIAGNOSTIC
if (pg != NULL)
-   panic(pmap_remove_pte: managed page without 
- PG_PVLIST for 0x%lx, va);
+   panic(%s: managed page without PG_PVLIST for 0x%lx,
+ __func__, va);
 #endif
return(TRUE);
}
 
 #ifdef DIAGNOSTIC
if (pg == NULL)
-   panic(pmap_remove_pte: unmanaged page marked 
-   PG_PVLIST, va = 0x%lx, pa = 0x%lx, va,
-   (u_long)(opte  PG_FRAME));
+   panic(%s: unmanaged page marked PG_PVLIST, va = 0x%lx, 
+ pa = 0x%lx, __func__, va, (u_long)(opte  PG_FRAME));
 #endif
 
/* sync R/M bits */
@@ -1464,8 +1463,8 @@ pmap_do_remove(struct pmap *pmap, vaddr_t sva, vaddr_t 
eva, int flags)
ptp = pmap_find_ptp(pmap, sva, ptppa, 1);
 #ifdef DIAGNOSTIC
if (ptp == NULL)
-   panic(pmap_remove: unmanaged 
- PTP detected);
+   panic(%s: unmanaged PTP detected,
+

Speeding up openbsd on amd64 MP - patch 2/2

2014-09-14 Thread Stefan Fritsch

Optimize pmap on amd64

based on a patch for i386 by Art from 2008 that removes the APTE stuff.  
Some additional bits were taken from a patch by Art for amd64 from 2005.

Instead of mapping the ptes of an inactive pmap in the APTE range, and 
then doing an unconditional remote TLB-flush on all CPUs, we just switch 
to the other pmap locally. This only causes a local TLB-flush.

In cases where we only need to access a single PTE, walk the page tables 
manually in the direct mapping. This saves some more TLB flushes.

I have also reordered some stuff so that we will usually send the 
TLB-shootdown-IPIs first, then do some local stuff, and then wait for the 
remote TLB-shootdown to finish.


diff --git a/sys/arch/amd64/amd64/pmap.c b/sys/arch/amd64/amd64/pmap.c
index 05cbd3e..c137412 100644
--- a/sys/arch/amd64/amd64/pmap.c
+++ b/sys/arch/amd64/amd64/pmap.c
@@ -203,7 +203,6 @@ long nkptp[] = NKPTP_INITIALIZER;
 long nkptpmax[] = NKPTPMAX_INITIALIZER;
 long nbpd[] = NBPD_INITIALIZER;
 pd_entry_t *normal_pdes[] = PDES_INITIALIZER;
-pd_entry_t *alternate_pdes[] = APDES_INITIALIZER;
 
 /* int nkpde = NKPTP; */
 
@@ -285,11 +284,12 @@ void  pmap_enter_pv(struct vm_page *, struct pv_entry *, 
struct pmap *,
 vaddr_t, struct vm_page *);
 struct vm_page *pmap_get_ptp(struct pmap *, vaddr_t, pd_entry_t **);
 struct vm_page *pmap_find_ptp(struct pmap *, vaddr_t, paddr_t, int);
+int pmap_find_pte_direct(struct pmap *pm, vaddr_t va, pt_entry_t **pd, int 
*offs);
 void pmap_free_ptp(struct pmap *, struct vm_page *,
 vaddr_t, pt_entry_t *, pd_entry_t **, struct pg_to_free *);
 void pmap_freepage(struct pmap *, struct vm_page *, int, struct pg_to_free *);
 static boolean_t pmap_is_active(struct pmap *, int);
-void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***);
+void pmap_map_ptes(struct pmap *, pt_entry_t **, pd_entry_t ***, paddr_t *);
 struct pv_entry *pmap_remove_pv(struct vm_page *, struct pmap *, vaddr_t);
 void pmap_do_remove(struct pmap *, vaddr_t, vaddr_t, int);
 boolean_t pmap_remove_pte(struct pmap *, struct vm_page *, pt_entry_t *,
@@ -299,14 +299,23 @@ void pmap_remove_ptes(struct pmap *, struct vm_page *, 
vaddr_t,
 #define PMAP_REMOVE_ALL0   /* remove all mappings */
 #define PMAP_REMOVE_SKIPWIRED  1   /* skip wired mappings */
 
-void pmap_unmap_ptes(struct pmap *);
+void pmap_unmap_ptes(struct pmap *, paddr_t);
 boolean_t pmap_get_physpage(vaddr_t, int, paddr_t *);
 boolean_t pmap_pdes_valid(vaddr_t, pd_entry_t **, pd_entry_t *);
 void pmap_alloc_level(pd_entry_t **, vaddr_t, int, long *);
-void pmap_apte_flush(struct pmap *pmap);
 
 void pmap_sync_flags_pte(struct vm_page *, u_long);
 
+void   pmap_tlb_shootpage(struct pmap *, vaddr_t, int);
+void   pmap_tlb_shootrange(struct pmap *, vaddr_t, vaddr_t, int);
+void   pmap_tlb_shoottlb(struct pmap *, int);
+#ifdef MULTIPROCESSOR
+void   pmap_tlb_shootwait(void);
+#else
+#definepmap_tlb_shootwait()
+#endif
+
+
 /*
  * p m a p   i n l i n e   h e l p e r   f u n c t i o n s
  */
@@ -349,57 +358,75 @@ pmap_sync_flags_pte(struct vm_page *pg, u_long pte)
}
 }
 
-void
-pmap_apte_flush(struct pmap *pmap)
-{
-   pmap_tlb_shoottlb();
-   pmap_tlb_shootwait();
-}
-
 /*
  * pmap_map_ptes: map a pmap's PTEs into KVM
- *
- * = we lock enough pmaps to keep things locked in
- * = must be undone with pmap_unmap_ptes before returning
  */
 
 void
-pmap_map_ptes(struct pmap *pmap, pt_entry_t **ptepp, pd_entry_t ***pdeppp)
+pmap_map_ptes(struct pmap *pmap, pt_entry_t **ptepp, pd_entry_t ***pdeppp, 
paddr_t *save_cr3)
 {
-   pd_entry_t opde, npde;
+   paddr_t cr3 = rcr3();
 
-   /* if curpmap then we are always mapped */
-   if (pmap_is_curpmap(pmap)) {
-   *ptepp = PTE_BASE;
-   *pdeppp = normal_pdes;
-   return;
-   }
+   /* the kernel's pmap is always accessible */
+   if (pmap == pmap_kernel() || pmap-pm_pdirpa == cr3) {
+   *save_cr3 = 0;
+   } else {
+   *save_cr3 = cr3;
 
-   /* need to load a new alternate pt space into curpmap? */
-   opde = *APDP_PDE;
-   if (!pmap_valid_entry(opde) || (opde  PG_FRAME) != pmap-pm_pdirpa) {
-   npde = (pd_entry_t) (pmap-pm_pdirpa | PG_RW | PG_V);
-   *APDP_PDE = npde;
-   if (pmap_valid_entry(opde))
-   pmap_apte_flush(curpcb-pcb_pmap);
+   /*
+* Not sure if we need this, but better be safe.
+* We don't have the current pmap in order to unset its
+* active bit, but this just means that we may receive
+* an unneccessary cross-CPU TLB flush now and then.
+*/
+   x86_atomic_setbits_u64(pmap-pm_cpus, (1ULL  cpu_number()));
+
+   lcr3(pmap-pm_pdirpa);
}
-   *ptepp = APTE_BASE;
-   *pdeppp = alternate_pdes;
+
+   *ptepp = PTE_BASE;
+   *pdeppp = normal_pdes;
+   return;
 }

Re: Speeding up openbsd on amd64 MP

2014-09-14 Thread Stefan Fritsch

On Sun, 14 Sep 2014, Job Snijders wrote:

 On Sun, Sep 14, 2014 at 02:54:54PM +0200, Stefan Fritsch wrote:
  I would like people to test the diffs on other machines. In particular on 
  non-Intel CPUs. The only AMD system I could get hold of did not run 
  reliably with openbsd even without the pmap diff.
 
 I tested MP 'kernel -j 2' on a PC Engines apu1c4 (AMD G series T40E, 1
 GHz dual Bobcat core with 64 bit support) and saw no speedup.

Thanks. But no slowdown either? If you have time, it would be nice to see 
the result for the fork test, which is a lot more sensitive to the pmap 
handling. The source is at 
http://permalink.gmane.org/gmane.os.openbsd.tech/38172

Also, it would be welcome if people tested the patch for stability. I have 
been running it on a build VM for a few weeks, but that is hardly a 
complete stability test.

Re: Speeding up openbsd on amd64 MP

2014-09-14 Thread Stefan Fritsch

On Sunday 14 September 2014 12:13:44, Brent Cook wrote:
 Results for an old Athlon (hmm, I don't remember it running at 10.8
 Ghz before)
 
 cpu0: AMD Athlon(tm) 64 X2 Dual Core Processor 6000+, 10823.06 MHz
 
 forktest 0m19.23s real 0m0.72s user 0m22.46s system
 forktest+0m16.33s real 0m0.50s user 0m18.22s system
 
 kernel -j2 3m0.68s real  5m10.95s user0m41.58s system
 kernel -j2+3m0.45s real  5m10.14s user0m41.53s system

On Sunday 14 September 2014 19:16:41, Job Snijders wrote:
 The speedup kernel seems to perform slightly better:
 
 [root@speedup ~]# time ./forktest
 
 real0m33.134s
 user0m1.230s
 sys 0m53.710s
 
 [root@5.5stable ~]# time ./forktest
 
 real0m37.600s
 user0m1.200s
 sys 0m59.530s

Thanks to both of you for testing.

These results are more in line with my Core i7 and I expected the 
effect to be larger the large the number of CPUs is. Just my 6-year-
old Core2 Duo does not fit in there and gives a much larger speedup. 
Maybe it is particularily bad at something the current pmap code does.

Re: Speeding up openbsd on amd64 MP - patch 2/2

2014-10-06 Thread Stefan Fritsch

This is now in -current. If anyone notices problems, give noise.

Cheers,
Stefan

Removing -Wno-format from kernel makefiles, 1/16

2013-07-03 Thread Stefan Fritsch

add support for %td for ptrdiff_t in kernel

this also adds support in gcc 4.x kprintf
---
 gnu/gcc/gcc/c-format.c |7 ---
 sys/kern/subr_prf.c|6 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git gnu/gcc/gcc/c-format.c gnu/gcc/gcc/c-format.c
index b9eecee..1b1734b 100644
--- gnu/gcc/gcc/c-format.c
+++ gnu/gcc/gcc/c-format.c
@@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] =
   { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L },
   { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 },
   { z, FMT_LEN_z, STD_C99, NULL, 0, 0 },
+  { t, FMT_LEN_t, STD_C99, NULL, 0, 0 },
   { NULL, 0, 0, NULL, 0, 0 }
 };
 
@@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] =
 static const format_char_info kprint_char_table[] = 
 { 
   /* C89 conversion specifiers.  */ 
-  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  
T99_SST, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
-  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
-  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL }, 
+  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  
T99_SST, T99_PD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
+  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL 
}, 
+  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL 
}, 
   { c,   0, STD_C89, { T89_I,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -w,   , NULL }, 
   { s,   1, STD_C89, { T89_C,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp,  cR, NULL 
}, 
   { p,   1, STD_C89, { T89_V,   BADLEN, BADLEN,  T89_UL,  T9L_LL,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0, c, NULL }, 
diff --git sys/kern/subr_prf.c sys/kern/subr_prf.c
index 768d164..c940141 100644
--- sys/kern/subr_prf.c
+++ sys/kern/subr_prf.c
@@ -842,6 +842,12 @@ reswitch:  switch (ch) {
size = 1;
sign = '\0';
break;
+   case 't':
+   {
+   /* assume ptrdiff_t is long */
+   CTASSERT(sizeof(fmt - fmt0) == sizeof(long));
+   }
+   /* FALLTHROUGH */
case 'D':
flags |= LONGINT;
/*FALLTHROUGH*/

Removing -Wno-format from kernel makefiles, 06/16

2013-07-04 Thread Stefan Fritsch

 Use %z* for size_t

while there, fix a few %d into %u
---
 sys/arch/i386/pci/pcibios.c |2 +-
 sys/dev/ic/bwi.c|4 ++--
 sys/dev/ic/pgt.c|6 +++---
 sys/dev/ic/ti.c |4 ++--
 sys/dev/pci/amdiic.c|4 ++--
 sys/dev/pci/amdpm.c |2 +-
 sys/dev/pci/if_iwi.c|4 ++--
 sys/dev/pci/if_iwn.c|   10 +-
 sys/dev/pci/if_lge.c|4 ++--
 sys/dev/pci/if_nge.c|4 ++--
 sys/dev/pci/if_tl.c |2 +-
 sys/dev/pci/if_wb.c |2 +-
 sys/dev/pci/if_wpi.c|4 ++--
 sys/dev/pci/piixpm.c|2 +-
 sys/dev/pci/yds.c   |2 +-
 sys/dev/usb/if_urndis.c |   12 ++--
 sys/netinet/ip_output.c |2 +-
 sys/netinet6/ip6_forward.c  |2 +-
 sys/netinet6/ip6_output.c   |2 +-
 sys/nfs/nfs_socket.c|4 ++--
 20 files changed, 39 insertions(+), 39 deletions(-)

diff --git sys/arch/i386/pci/pcibios.c sys/arch/i386/pci/pcibios.c
index 1b34391..372ac51 100644
--- sys/arch/i386/pci/pcibios.c
+++ sys/arch/i386/pci/pcibios.c
@@ -260,7 +260,7 @@ pcibios_pir_init(struct pcibios_softc *sc)
cksum += p[i];
 
printf(%s: PCI IRQ Routing Table rev %d.%d @ 0x%lx/%d 
-   (%d entries)\n, sc-sc_dev.dv_xname,
+   (%zd entries)\n, sc-sc_dev.dv_xname,
pirh-version  8, pirh-version  0xff, pa,
pirh-tablesize, (pirh-tablesize - sizeof(*pirh)) / 16);
 
diff --git sys/dev/ic/bwi.c sys/dev/ic/bwi.c
index 6295172..934a6ac 100644
--- sys/dev/ic/bwi.c
+++ sys/dev/ic/bwi.c
@@ -1673,7 +1673,7 @@ bwi_fwimage_is_valid(struct bwi_softc *sc, uint8_t *fw, 
size_t fw_len,
const struct bwi_fwhdr *hdr;
 
if (fw_len  sizeof(*hdr)) {
-   printf(%s: invalid firmware (%s): invalid size %u\n,
+   printf(%s: invalid firmware (%s): invalid size %zu\n,
sc-sc_dev.dv_xname, fw_name, fw_len);
return (1);
}
@@ -1686,7 +1686,7 @@ bwi_fwimage_is_valid(struct bwi_softc *sc, uint8_t *fw, 
size_t fw_len,
 */
if (betoh32(hdr-fw_size) != fw_len - sizeof(*hdr)) {
printf(%s: invalid firmware (%s): size mismatch, 
-   fw %u, real %u\n,
+   fw %u, real %zu\n,
sc-sc_dev.dv_xname,
fw_name,
betoh32(hdr-fw_size),
diff --git sys/dev/ic/pgt.c sys/dev/ic/pgt.c
index 692b72c..6f6a5d6 100644
--- sys/dev/ic/pgt.c
+++ sys/dev/ic/pgt.c
@@ -3193,7 +3193,7 @@ pgt_dma_alloc_queue(struct pgt_softc *sc, enum pgt_queue 
pq)
error = bus_dmamem_alloc(sc-sc_dmat, PGT_FRAG_SIZE, PAGE_SIZE,
0, pd-pd_dmas, 1, nsegs, BUS_DMA_WAITOK);
if (error != 0) {
-   printf(%s: error alloc frag %u on queue %u\n,
+   printf(%s: error alloc frag %zu on queue %u\n,
sc-sc_dev.dv_xname, i, pq);
free(pd, M_DEVBUF);
break;
@@ -3202,7 +3202,7 @@ pgt_dma_alloc_queue(struct pgt_softc *sc, enum pgt_queue 
pq)
error = bus_dmamem_map(sc-sc_dmat, pd-pd_dmas, nsegs,
PGT_FRAG_SIZE, (caddr_t *)pd-pd_mem, BUS_DMA_WAITOK);
if (error != 0) {
-   printf(%s: error map frag %u on queue %u\n,
+   printf(%s: error map frag %zu on queue %u\n,
sc-sc_dev.dv_xname, i, pq);
free(pd, M_DEVBUF);
break;
@@ -3212,7 +3212,7 @@ pgt_dma_alloc_queue(struct pgt_softc *sc, enum pgt_queue 
pq)
error = bus_dmamap_load(sc-sc_dmat, pd-pd_dmam,
pd-pd_mem, PGT_FRAG_SIZE, NULL, BUS_DMA_NOWAIT);
if (error != 0) {
-   printf(%s: error load frag %u on queue %u\n,
+   printf(%s: error load frag %zu on queue %u\n,
sc-sc_dev.dv_xname, i, pq);
bus_dmamem_free(sc-sc_dmat, pd-pd_dmas,
nsegs);
diff --git sys/dev/ic/ti.c sys/dev/ic/ti.c
index 85e3a2a..902591f 100644
--- sys/dev/ic/ti.c
+++ sys/dev/ic/ti.c
@@ -612,7 +612,7 @@ ti_alloc_jumbo_mem(struct ti_softc *sc)
state = 1;
if (bus_dmamem_map(sc-sc_dmatag, seg, rseg, TI_JMEM, kva,
BUS_DMA_NOWAIT)) {
-   printf(%s: can't map dma buffers (%d bytes)\n,
+   printf(%s: can't map dma buffers (%zu bytes)\n,
sc-sc_dv.dv_xname, TI_JMEM);
error = ENOBUFS;
goto out;
@@ -1600,7 +1600,7 @@ ti_attach(struct ti_softc *sc)
}
if (bus_dmamem_map(sc-sc_dmatag, seg, rseg,
sizeof(struct ti_ring_data), kva,

Re: Removing -Wno-format from kernel makefiles, 3/16

2013-07-04 Thread Stefan Fritsch

On Wed, 3 Jul 2013, Mark Kettenis wrote:
  diff --git sys/arch/i386/i386/db_interface.c 
  sys/arch/i386/i386/db_interface.c
  index 85c1ff5..c75fd89 100644
  --- sys/arch/i386/i386/db_interface.c
  +++ sys/arch/i386/i386/db_interface.c
  @@ -197,11 +197,11 @@ db_sysregs_cmd(db_expr_t addr, int have_addr, 
  db_expr_t count, char *modif)
  uint16_t ldtr, tr;
   
  __asm__ __volatile__(sidt %0 : =m (idtr));
  -   db_printf(idtr:   0x%08x/%04x\n,
  +   db_printf(idtr:   0x%08x/%04llx\n,
  (unsigned int)(idtr  16), idtr  0x);
   
  __asm__ __volatile__(sgdt %0 : =m (gdtr));
  -   db_printf(gdtr:   0x%08x/%04x\n,
  +   db_printf(gdtr:   0x%08x/%04llx\n,
  (unsigned int)(gdtr  16), gdtr  0x);
 
 This is a tad bit inconsistent.  I'd either use %llx for both values
 and get rid of the cast, or use %x and use a cast in both cases.

Like this?

--- sys/arch/i386/i386/db_interface.c
+++ sys/arch/i386/i386/db_interface.c
@@ -197,12 +197,10 @@ db_sysregs_cmd(db_expr_t addr, int have_addr, db_expr_t 
count, char *modif)
uint16_t ldtr, tr;
 
__asm__ __volatile__(sidt %0 : =m (idtr));
-   db_printf(idtr:   0x%08x/%04x\n,
-   (unsigned int)(idtr  16), idtr  0x);
+   db_printf(idtr:   0x%08llx/%04llx\n, idtr  16, idtr  0x);
 
__asm__ __volatile__(sgdt %0 : =m (gdtr));
-   db_printf(gdtr:   0x%08x/%04x\n,
-   (unsigned int)(gdtr  16), gdtr  0x);
+   db_printf(gdtr:   0x%08llx/%04llx\n, gdtr  16, gdtr  0x);
 
__asm__ __volatile__(sldt %0 : =g (ldtr));
db_printf(ldtr:   0x%04x\n, ldtr);

Re: Removing -Wno-format from kernel makefiles, 4/16

2013-07-04 Thread Stefan Fritsch

On Wed, 3 Jul 2013, Mark Kettenis wrote:
  diff --git sys/arch/i386/i386/esm.c sys/arch/i386/i386/esm.c
  index c90b2c4..3dff69e 100644
  --- sys/arch/i386/i386/esm.c
  +++ sys/arch/i386/i386/esm.c
  @@ -880,7 +880,7 @@ esm_make_sensors(struct esm_softc *sc, struct 
  esm_devmap *devmap,
  }
   
  for (j = 0; j  nsensors; j++) {
  -   snprintf(s[j].desc, sizeof(s[j].desc), %s %d,
  +   snprintf(s[j].desc, sizeof(s[j].desc), %s %ld,
  sensor_map[i].name, sensor_map[i].arg + j);
  }
  break;
 
 Looking at this one, it makes more sense to make the arg member of
 struct esm_sensor_map an int.  That will result in some space
 savings if we'd ever bring this driver to amd64.


--- sys/arch/i386/i386/esm.c
+++ sys/arch/i386/i386/esm.c
@@ -87,7 +87,7 @@ enum sensor_type esm_typemap[] = {
 
 struct esm_sensor_map {
enum esm_sensor_typetype;
-   longarg;
+   int arg;
const char  *name;
 };

Removing -Wno-format from kernel makefiles, 07/16

2013-07-04 Thread Stefan Fritsch

 fix: %x instead of %p for int

---
 sys/dev/pci/musycc_obsd.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git sys/dev/pci/musycc_obsd.c sys/dev/pci/musycc_obsd.c
index 25a58d8..0844136 100644
--- sys/dev/pci/musycc_obsd.c
+++ sys/dev/pci/musycc_obsd.c
@@ -215,7 +215,7 @@ musycc_ebus_attach(struct device *parent, struct 
musycc_softc *esc,
 #endif
 
if (ebus_attach_device(rom, sc, 0, 0x400) != 0) {
-   printf(: failed to map rom @ %05p\n, 0);
+   printf(: failed to map rom @ %05x\n, 0);
goto failed;
}
 
-- 
1.7.6

Removing -Wno-format from kernel makefiles, 10/16

2013-07-04 Thread Stefan Fritsch

 fix: pointer to long

---
 sys/arch/i386/pci/pci_addr_fixup.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git sys/arch/i386/pci/pci_addr_fixup.c sys/arch/i386/pci/pci_addr_fixup.c
index 59880eb..40f7918 100644
--- sys/arch/i386/pci/pci_addr_fixup.c
+++ sys/arch/i386/pci/pci_addr_fixup.c
@@ -354,7 +354,7 @@ pciaddr_do_resource_reserve_disabled(struct pcibios_softc 
*sc,
(val  PCI_COMMAND_IO_ENABLE) == PCI_COMMAND_IO_ENABLE)
return (0);
 
-   PCIBIOS_PRINTV((disabled %s space at addr 0x%x size 0x%x\n,
+   PCIBIOS_PRINTV((disabled %s space at addr 0x%lx size 0x%x\n,
type == PCI_MAPREG_TYPE_MEM ? mem : io, *addr, size));
 
error = extent_alloc_region(ex, *addr, size, EX_NOWAIT | EX_MALLOCOK);
-- 
1.7.6

Removing -Wno-format from kernel makefiles, 08/16

2013-07-04 Thread Stefan Fritsch

 %hu/%hd for uint16_t, %u/%d/%x for uint32_t

despite the name, ntohl returns uint32_t
also fix some %d into %u
---
 sys/arch/i386/i386/bios.c|4 ++--
 sys/arch/i386/i386/ioapic.c  |2 +-
 sys/arch/i386/pci/pcibios.c  |4 ++--
 sys/kern/vfs_cluster.c   |2 +-
 sys/msdosfs/msdosfs_conv.c   |2 +-
 sys/msdosfs/msdosfs_denode.c |2 +-
 sys/msdosfs/msdosfs_vnops.c  |2 +-
 sys/net/if_spppsubr.c|2 +-
 8 files changed, 10 insertions(+), 10 deletions(-)

diff --git sys/arch/i386/i386/bios.c sys/arch/i386/i386/bios.c
index 6e46e6a..75dab38 100644
--- sys/arch/i386/i386/bios.c
+++ sys/arch/i386/i386/bios.c
@@ -238,7 +238,7 @@ biosattach(struct device *parent, struct device *self, void 
*aux)
 
bios32_entry.segment = GSEL(GCODE_SEL, SEL_KPL);
bios32_entry.offset = 
(u_int32_t)ISA_HOLE_VADDR(h-entry);
-   printf(, BIOS32 rev. %d @ 0x%lx, h-rev, h-entry);
+   printf(, BIOS32 rev. %d @ 0x%x, h-rev, h-entry);
break;
}
}
@@ -285,7 +285,7 @@ biosattach(struct device *parent, struct device *self, void 
*aux)
for (; pa  end; pa+= NBPG, eva+= NBPG)
pmap_kenter_pa(eva, pa, VM_PROT_READ);
 
-   printf(, SMBIOS rev. %d.%d @ 0x%lx (%d entries),
+   printf(, SMBIOS rev. %d.%d @ 0x%x (%hd entries),
sh-majrev, sh-minrev, sh-addr, sh-count);
/*
 * Unbelievably the SMBIOS version number
diff --git sys/arch/i386/i386/ioapic.c sys/arch/i386/i386/ioapic.c
index 0b0c6b8..4197553 100644
--- sys/arch/i386/i386/ioapic.c
+++ sys/arch/i386/i386/ioapic.c
@@ -300,7 +300,7 @@ ioapic_attach(struct device *parent, struct device *self, 
void *aux)
 
ioapic_add(sc);
 
-   printf( pa 0x%lx, aaa-apic_address);
+   printf( pa 0x%x, aaa-apic_address);
 
if (bus_mem_add_mapping(aaa-apic_address, PAGE_SIZE, 0, bh) != 0) {
printf(, map failed\n);
diff --git sys/arch/i386/pci/pcibios.c sys/arch/i386/pci/pcibios.c
index 372ac51..bc8a848 100644
--- sys/arch/i386/pci/pcibios.c
+++ sys/arch/i386/pci/pcibios.c
@@ -151,7 +151,7 @@ pcibiosprobe(struct device *parent, void *match, void *aux)
rv = bios32_service(PCIBIOS_SIGNATURE, pcibios_entry,
pcibios_entry_info);
 
-   PCIBIOS_PRINTV((pcibiosprobe: 0x%lx:0x%lx at 0x%lx[0x%lx]\n,
+   PCIBIOS_PRINTV((pcibiosprobe: 0x%hx:0x%x at 0x%x[0x%x]\n,
pcibios_entry.segment, pcibios_entry.offset,
pcibios_entry_info.bei_base, pcibios_entry_info.bei_size));
 
@@ -172,7 +172,7 @@ pcibiosattach(struct device *parent, struct device *self, 
void *aux)
rev_min, mech1, mech2,
scmech1, scmech2, sc-max_bus);
 
-   printf(: rev %d.%d @ 0x%lx/0x%lx\n,
+   printf(: rev %d.%d @ 0x%x/0x%x\n,
rev_maj, rev_min  4, pcibios_entry_info.bei_base,
pcibios_entry_info.bei_size);
 
diff --git sys/kern/vfs_cluster.c sys/kern/vfs_cluster.c
index f9df40e..118a411 100644
--- sys/kern/vfs_cluster.c
+++ sys/kern/vfs_cluster.c
@@ -175,7 +175,7 @@ cluster_wbuild(struct vnode *vp, struct buf *last_bp, long 
size,
 
 #ifdef DIAGNOSTIC
if (size != vp-v_mount-mnt_stat.f_iosize)
-   panic(cluster_wbuild: size %ld != filesize %ld,
+   panic(cluster_wbuild: size %ld != filesize %u,
size, vp-v_mount-mnt_stat.f_iosize);
 #endif
 redo:
diff --git sys/msdosfs/msdosfs_conv.c sys/msdosfs/msdosfs_conv.c
index 727acd2..538aa1c 100644
--- sys/msdosfs/msdosfs_conv.c
+++ sys/msdosfs/msdosfs_conv.c
@@ -209,7 +209,7 @@ dos2unixtime(u_int dd, u_int dt, u_int dh, struct timespec 
*tsp)
 */
month = (dd  DD_MONTH_MASK)  DD_MONTH_SHIFT;
if (month == 0) {
-   printf(dos2unixtime(): month value out of range 
(%ld)\n,
+   printf(dos2unixtime(): month value out of range 
(%u)\n,
month);
month = 1;
}
diff --git sys/msdosfs/msdosfs_denode.c sys/msdosfs/msdosfs_denode.c
index d384a8b..29f2e57 100644
--- sys/msdosfs/msdosfs_denode.c
+++ sys/msdosfs/msdosfs_denode.c
@@ -387,7 +387,7 @@ detrunc(struct denode *dep, uint32_t length, int flags, 
struct ucred *cred,
 * directory's life.
 */
if ((DETOV(dep)-v_flag  VROOT)  !FAT32(pmp)) {
-   printf(detrunc(): can't truncate root directory, clust %ld, 
offset %ld\n,
+   printf(detrunc(): can't truncate root directory, clust %u, 
offset %u\n,
dep-de_dirclust, dep-de_diroffset);
return (EINVAL);
}
diff --git sys/msdosfs/msdosfs_vnops.c sys/msdosfs/msdosfs_vnops.c
index 79f611c..25d7030 100644
--- sys/msdosfs/msdosfs_vnops.c
+++ sys/msdosfs/msdosfs_vnops.c
@@

Re: Removing -Wno-format from kernel makefiles, V2

2013-07-04 Thread Stefan Fritsch

On Wednesday 03 July 2013, Stefan Fritsch wrote:
 I haven't done any signedness fixes so far, only the size fixes.
 But I will look over the patches again and try to fix the
 signedness issues in the touched code lines, too. There are quite
 a few I guess.

For people prefering to review by source files, an updated patch is 
here:

http://www.sfritsch.de/~stf/openbsd-format-string-v3/combined.patch

Re: virtio MSI

2013-09-22 Thread Stefan Fritsch

Am Mittwoch, 18. September 2013, 07:19:30 schrieb Chris Cappuccio:
 Here's a simple and obvious change that would be necessary to
 support virtio under bhyve. But it is only acceptable if other
 virtio implementations either 1. don't claim to support MSI or
 2. also work with MSI.

Qemu does not advertise MSI support, so no problem there. Committed.

Re: virtio MSI

2013-09-23 Thread Stefan Fritsch

Out of interest, does openbsd run on bhyve with this patch, or are there 
other problems?

On Sun, 22 Sep 2013, Stefan Fritsch wrote:
 Am Mittwoch, 18. September 2013, 07:19:30 schrieb Chris Cappuccio:
  Here's a simple and obvious change that would be necessary to
  support virtio under bhyve. But it is only acceptable if other
  virtio implementations either 1. don't claim to support MSI or
  2. also work with MSI.
 
 Qemu does not advertise MSI support, so no problem there. Committed.

Snapshot install54.iso does not boot under qemu/kvm

2014-01-02 Thread Stefan Fritsch

Hi,

the 2013-12-31 snapshot iso images don't work for me under qemu 1.7.0 
/ linux 3.12.6. They get this far and then reboot:

loadrandom: cd0a:/etc/random.seed
cannot open cd0a:/etc/random.seed
loadrandom: error -1
booting cd0a:/5.4/i386/bsd.rd: 6038744+426300 
[72+233616+222178]=0x699c6c
entry point at 0x210073


I have tried both i386 and amd64. sha256:
bdb45f421d4108ebf8da6896f23a1b23e81c968782e41da5529bce4d3146e5fa  
install54-amd64-snapshot-2013-12-31.iso
bbc58d620aff0d5f6c97c6d193c1646754c5479296b97a2ebb468729e8aa0c30  
install54-i386-snapshot-2013-12-31.iso

I still have snapshots isos from 2013-09-22 lying around, and those 
continue to work. Therefore I believe it is not a qemu/kvm problem.

The last things happening before the reboot seem to be these (amd64 
iso):

kvm_emulate_insn: 40120:1168:67 66 0f 01 15 88 0f 00 00 (real)  
 
kvm_emulate_insn: 40120:1171:0f 20 c0 (real)
 
kvm_emulate_insn: 40120:1174:66 83 c8 01 (real) 
 
kvm_emulate_insn: 40120:1178:0f 22 c0 (real)
 
kvm_emulate_insn: 40120:117b:66 ea a3 12 04 00 08 00 (prot16)   
 
kvm_emulate_insn: 0:412a3:b8 10 00 00 00 (prot32)   
 
kvm_emulate_insn: 0:412a8:8e d8 (prot32)
 
kvm_emulate_insn: 0:412aa:8e d0 (prot32)
 
kvm_emulate_insn: 0:412ac:8e c0 (prot32)
 
kvm_exit: reason EPT_VIOLATION rip 0x1010149 info 182 0 
 
kvm_page_fault: address 15e8050 error_code 182  
 
kvm_exit: reason CPUID rip 0x10101ae info 0 0
kvm_cpuid: func 0 rax 4 rbx 756e6547 rcx 6c65746e rdx 49656e69
kvm_exit: reason CPUID rip 0x10101cf info 0 0
kvm_cpuid: func 1 rax 623 rbx 800 rcx 80802001 rdx 78bfbfd
kvm_exit: reason CPUID rip 0x10101e7 info 0 0
kvm_cpuid: func a rax 0 rbx 0 rcx 0 rdx 0
kvm_exit: reason CPUID rip 0x10101ff info 0 0
kvm_cpuid: func 8001 rax 623 rbx 0 rcx 1 rdx 2191abfd
kvm_exit: reason CPUID rip 0x1010212 info 0 0
kvm_cpuid: func 8007 rax 0 rbx 0 rcx 0 rdx 0
kvm_exit: reason CR_ACCESS rip 0x101042b info 4 0
kvm_cr: cr_write 4 = 0x6b0
kvm_exit: reason MSR_READ rip 0x1010433 info 0 0
kvm_msr: msr_read c080 = 0x0
kvm_exit: reason MSR_WRITE rip 0x101043c info 0 0
kvm_msr: msr_write c080 = 0x101
kvm_exit: reason CR_ACCESS rip 0x1010440 info 3 0
kvm_cr: cr_write 3 = 0x1b2b000
kvm_exit: reason CR_ACCESS rip 0x101044b info 0 0
kvm_cr: cr_write 0 = 0x8001003b
kvm_exit: reason TRIPLE_FAULT rip 0x101045d info 0 0
kvm_userspace_exit: reason KVM_EXIT_SHUTDOWN (8)
kvm_userspace_exit: reason restart (4)

Not sure if this is related to the actual problem or just the reboot 
after the problem happened earlier.


Cheers,
Stefan

Virtio drivers for OpenBSD

2012-07-11 Thread Stefan Fritsch

Hi,

I have been working on porting NetBSD's virtio drivers to OpenBSD.  I am 
not finished yet, but in order to prevent duplicate work, I thought I'd 
publish the current state (attached as diff to OpenBSD 5.1). It adds a 
virtio block device driver (viod) and a virtio network interface 
(if_vioif). It is stable enough to run make -j 2 in /usr/src on a viod 
disk.

Comments are welcome.

BTW: Which device numbers should I choose for viod? Use the first unused 
number or just add virtio at the end?

Cheers,
Stefan

[demime 1.01d removed an attachment of type TEXT/x-diff which had a name of 
openbsd-virtio-v1.diff]

Re: Virtio drivers for OpenBSD

2012-07-11 Thread Stefan Fritsch


On Wed, 11 Jul 2012, Otto Moerbeek wrote:


On Wed, Jul 11, 2012 at 02:07:00PM +0200, Stefan Fritsch wrote:

[demime 1.01d removed an attachment of type TEXT/x-diff which had a name of 
openbsd-virtio-v1.diff]


Diff got stripped.


I have put it at 
http://www.sfritsch.de/~stf/openbsd-virtio/openbsd-virtio-v1.diff



Can you give us some examples of the use of this driver?


Virtio disks and NICs are paravirtualized devices offered by kvm/qemu and 
virtualbox. Using them should give somewhat higher performance and/or 
lower CPU usage than using emulated hardware like IDE and e1000. But I 
haven't done any detailed performance tests, yet.

Re: Virtio drivers for OpenBSD

2012-07-11 Thread Stefan Fritsch


On Wed, 11 Jul 2012, Gleydson Soares wrote:

could you resubmit a new diff against -current ?
please, attach it inline here.


ok, will do (but it will take a bit until I have my machine set up)

Re: Virtio drivers for OpenBSD

2012-07-11 Thread Stefan Fritsch


On Wed, 11 Jul 2012, Ted Unangst wrote:

We are consolidating and reducing the number of disk types.  I think
this would be better if it attached as scsi disks, which is what all
the cool virtual disks do these days.


There is a virtio-scsi device, too, but this is only supported in very 
recent versions of qemu. To attach the simpler virtio-block device as 
scsi, the driver would have to emulate the scsi commands. Is there some 
generic infrastructure to do that? Or which driver would you recommend to 
take the emulation code from? sys/dev/ata/atascsi.c seems rather complex 
(it is twice the size of sys/dev/pci/viod.c).

Re: Virtio drivers for OpenBSD

2012-07-11 Thread Stefan Fritsch


On Wed, 11 Jul 2012, Theo de Raadt wrote:

There are quite a few pretendy-SCSI drivers in the tree, but most of
them deal with fairly complicated hardware so they're not good
reference points.

sparc64's vdsk(4) driver is probably the best reference point, since
it's also for a virtual disk driver (see sys/arch/sparc64/dev/vdsk.c).


Oh yes, vdsk(4) is also a good suggestion.


ok, thanks. I will take a look at that one.

Re: Virtio drivers for OpenBSD

2012-07-12 Thread Stefan Fritsch


On Thu, 12 Jul 2012, Jiri B wrote:


On Thu, Jul 12, 2012 at 11:02:54AM +0200, Stefan Fritsch wrote:

On Wed, 11 Jul 2012, Matthew Dempsky wrote:

There are quite a few pretendy-SCSI drivers in the tree, but most of
them deal with fairly complicated hardware so they're not good
reference points.

sparc64's vdsk(4) driver is probably the best reference point, since
it's also for a virtual disk driver (see sys/arch/sparc64/dev/vdsk.c).


If you would like to dig into vmware pvscsi support, I'm more than happy
to do testing :)

http://lxr.linux.no/#linux+v3.4.4/drivers/scsi/vmw_pvscsi.c


I will leave that to other people. I am not interested in vmware ATM.

Re: [Patch] Virtio drivers for OpenBSD V5

2012-08-13 Thread Stefan Fritsch

On Monday 13 August 2012 17:07:41 you wrote:
   * Note: the i386 does not currently require barriers, but we must
   * provide the flags to MI code.
 
  This is not correct for virtio. We need a memory barrier.

 sure, copy it from amd64.

OK. A slight complication:

sfence/mfence/lfence do not exist on all i386 CPUs. sfence was introduced with
SSE, [lm]fence with SSE2. This is not really a problem with the virtio driver
because the virtualization extensions were introduced much later. But it may
be a problem with the rest of the i386 code.

How do you normally handle this? Check during boot and set a pointer to the
function to be used and call that function pointer from bus_space_barrier()?

--
genua
Gesellschaft fuer Netzwerk- und Unix-Administration mbH
Domagkstrasse 7, 85551 Kirchheim bei Muenchen
tel +49 89 991950-0, fax -999, www.genua.de
Geschaeftsfuehrer: Dr. Magnus Harlander, Dr. Michaela Harlander,
Bernhard Schneck. Amtsgericht Muenchen HRB 98238

Re: [Patch] Virtio drivers for OpenBSD V5

2012-08-14 Thread Stefan Fritsch

On Monday 13 August 2012 17:00:56 Mike Belopuhov wrote:
 you're calling MCLGETI and provide an interface pointer but
 you forget to set watermarks via m_clsetwms.

OK. The MCLGETI man page should probably get a pointer to m_clsetwms().

-- 
genua
Gesellschaft fuer Netzwerk- und Unix-Administration mbH
Domagkstrasse 7, 85551 Kirchheim bei Muenchen
tel +49 89 991950-0, fax -999, www.genua.de
Geschaeftsfuehrer: Dr. Magnus Harlander, Dr. Michaela Harlander,
Bernhard Schneck. Amtsgericht Muenchen HRB 98238

Re: Virtio network interface driver

2012-08-20 Thread Stefan Fritsch


I have been working on porting virtio devices too.
Actually for now only networking driver complete
The main difference between Stefen's version is virtio devices
attaches directly to pci bus.


I don't think this is a good idea. There are other transports for virtio 
than pci, e.g. virtio-mmio. And virtio-mmio behaves differently enough 
from virtio-pci that supporting both in all virtio device drivers would be 
a PITA.


But I think your name of vio for the network devices is better than 
NetBSD's vioif. I think I will rename it, too.

Re: [Patch] Virtio drivers for OpenBSD V6

2012-08-20 Thread Stefan Fritsch


On Fri, 17 Aug 2012, Mike Belopuhov wrote:

On Fri, Aug 17, 2012 at 1:17 PM, Stefan Fritsch s...@sfritsch.de wrote:

On Thursday 16 August 2012, Mike Belopuhov wrote:

could you please tell me if you're using tx interrupt or not?
if yes, why do you need to have a watchdog code implement
a txeof path and not the actual reset?


In normal operation, tx interrupts are not used and the sent
descriptors are cleaned up either by rx interrupts or by the watchdog.
The sole exception is when the tx queue is full.


exactly, so please don't do it.  use a tx interrupt for txeof and use a
watchdog to reset a controller.


Always using the tx interrupt decreases performance significantly. On my 
test system (in MBytes/s):


Sending UDP from 75 to 55
Sending TCP from 34 to 25

I dont' think we want to do this.

Re: [Patch] Virtio drivers for OpenBSD V6

2012-08-20 Thread Stefan Fritsch


On Mon, 20 Aug 2012, Mike Belopuhov wrote:

Always using the tx interrupt decreases performance significantly. On my
test system (in MBytes/s):

Sending UDP from 75 to 55
Sending TCP from 34 to 25



Why?


Because the interrupt load increases by approx. 7000 per second. I think 
the problem is that the backend sends packets faster than openbsd can 
enqueue them. This means that the tx queue will be empty often, and each 
time an tx interrupt will be triggered.



I dont' think we want to do this.


Perhaps you got your OACTIVE handling wrong and don't call a start
routing from the appropriate places.  Have you considered that?


I don't see how this could help, but I will look at it.

Re: [Patch] Virtio drivers for OpenBSD V6

2012-08-20 Thread Stefan Fritsch


On Mon, 20 Aug 2012, Mike Belopuhov wrote:

On Mon, Aug 20, 2012 at 1:42 PM, Stefan Fritsch s...@sfritsch.de wrote:

Because the interrupt load increases by approx. 7000 per second. I think the
problem is that the backend sends packets faster than openbsd can enqueue
them. This means that the tx queue will be empty often, and each time an tx
interrupt will be triggered.



this doesn't sound right as it's the driver that decides how much
to transmit.


How so? It simply transmits all packets that are in ifp-if_snd.


btw is there a interrupt moderation mechanism available?


Unfortunately not one that is time based. It is only possible to defer
the interrupt for a certain amount of packets (with the RingEventIdx 
feature) or until the last descriptor is removed from the ring (with the 
NotifyOnEmpty feature). But the latter does not help much and the former 
is not usable unless you have some timer that cleans up resources if no 
more packets are sent.


For the rx interrupts it is even worse. I don't see any sensible way to 
reduce interrupts without switching to polling when there are lots of 
inbound packets. Are there already NICs using polling in OpenBSD?

Re: [Patch] Virtio drivers for OpenBSD V6

2012-08-21 Thread Stefan Fritsch


On Mon, 20 Aug 2012, Mike Belopuhov wrote:

Because the interrupt load increases by approx. 7000 per second. I think
the
problem is that the backend sends packets faster than openbsd can enqueue
them. This means that the tx queue will be empty often, and each time an
tx
interrupt will be triggered.



this doesn't sound right as it's the driver that decides how much
to transmit.



How so? It simply transmits all packets that are in ifp-if_snd.



only if OACTIVE is not set.


Which does not help at all if I want to send more packets in one go, not 
fewer.



btw is there a interrupt moderation mechanism available?


Unfortunately not one that is time based. It is only possible to defer
the interrupt for a certain amount of packets (with the RingEventIdx
feature) or until the last descriptor is removed from the ring (with the
NotifyOnEmpty feature). But the latter does not help much and the former is
not usable unless you have some timer that cleans up resources if no more
packets are sent.

For the rx interrupts it is even worse. I don't see any sensible way to
reduce interrupts without switching to polling when there are lots of
inbound packets.


that sucks, but that doesn't mean you have to abuse watchdog.


What do yo suggest then? Use a separate timeout? Accept the lower 
performance?


Resetting the device in the watchdog has another unexpected problem: In 
order to restore the MAC filters and promisc state, vioif_iff needs to do 
asynchronous requests on the control queue. But in the watchdog, I cannot 
sleep. FWIW, if_vic does not have a watchdog at all. Not having a watchdog 
looks rather attractive for the virtio driver, too.


Cheers,
Stefan

Small memcpy optimization

2012-08-21 Thread Stefan Fritsch

On x86, the xchg operation between reg and mem has an implicit lock 
prefix, i.e. it is a relatively expensive atomic operation. This is not 
needed here.


--- a/sys/arch/i386/i386/locore.s
+++ b/sys/arch/i386/i386/locore.s
@@ -802,8 +802,9 @@ ENTRY(bcopy)
  */
 ENTRY(memcpy)
movl4(%esp),%ecx
-   xchg8(%esp),%ecx
-   movl%ecx,4(%esp)
+   movl8(%esp),%eax
+   movl%ecx,8(%esp)
+   movl%eax,4(%esp)
jmp _C_LABEL(bcopy)

Re: [Patch] Virtio drivers for OpenBSD V8

2012-09-02 Thread Stefan Fritsch

On Saturday 01 September 2012, LEVAI Daniel wrote:

  I've just started to test this with an OpenBSD guest on a Linux
  host. vio0 NIC and the virtio disk work like a charm.
  I have a slightly (~10%) better performance with the ide
  emulation than with virtio, but nevertheless it works.

Thanks for testing. How did you test the performance? 

 Well, weird thing is that some time after I've sent this mail, a
 few messages appeared in the logs while updating /usr/ports on the
 guest:
 
 virtio timeout vioblk0: sc_queued 5 vq_num 128
 vq_avail_idx: 27965 vq_avail-idx: 27965 vq_avail-flags: 1
 vq_used_idx:  27960 vq_used-idx:  27960 vq_used-flags:  0
 
 
 Are these errors or just some debug information for the devs?

These are just debugging leftovers. I will remove them in the next 
version.

Reduce false cache line sharing in page tables

2012-09-02 Thread Stefan Fritsch

Since the Pentium 4, cachelines are 64 byte on x86, not 32 byte. The pmap 
optimization should be adjusted accordingly:


--- sys/arch/i386/i386/pmapae.c
+++ sys/arch/i386/i386/pmapae.c
@@ -516,10 +516,10 @@ typedef u_int64_t pd_entry_t; /* PDE */
 typedef u_int64_t pt_entry_t;  /* PTE */

 /*
- * Number of PTE's per cache line. 8 byte pte, 32-byte cache line
+ * Number of PTE's per cache line. 8 byte pte, 64-byte cache line
  * Used to avoid false sharing of cache lines.
  */
-#defineNPTECL  4
+#defineNPTECL  8

 /*
  * other data structures
--- sys/arch/i386/include/pmap.h
+++ sys/arch/i386/include/pmap.h
@@ -229,10 +229,10 @@
 #definePG_XPG_AVAIL3   /* executable mapping */

 /*
- * Number of PTE's per cache line.  4 byte pte, 32-byte cache line
+ * Number of PTE's per cache line.  4 byte pte, 64-byte cache line
  * Used to avoid false sharing of cache lines.
  */
-#define NPTECL 8
+#define NPTECL 16

 #ifdef _KERNEL
 /*

Possible bug in timecounter code

2012-09-02 Thread Stefan Fritsch


tc_windup() calls ntp_update_second() up to 200 times:


/*
 * A large step happens on boot.  This constant detects such steps.
 * It is relatively small so that ntp_update_second gets called enough
 * in the typical 'missed a couple of seconds' case, but doesn't loop
 * forever when the time step is large.
 */
#define LARGE_STEP  200


ntp_update_second() subtracts some amount of time from adjtimedelta and 
sets th-th_adjustment accordingly. However, if ntp_update_second is 
called more than once in a row, only the last value of th-th_adjustment 
is actually applied. Shouldn't the adjustments from all ntp_update_second 
calls be added instead? I don't know how often this is actually hit in 
practice, the normal case is that ntp_update_second is called only once.


diff --git sys/kern/kern_tc.c sys/kern/kern_tc.c
index 59b31dc..60ea119 100644
--- sys/kern/kern_tc.c
+++ sys/kern/kern_tc.c
@@ -425,8 +426,12 @@ tc_windup(void)
i = bt.sec - tho-th_microtime.tv_sec;
if (i  LARGE_STEP)
i = 2;
-   for (; i  0; i--)
-   ntp_update_second(th-th_adjustment, bt.sec);
+   th-th_adjustment = 0;
+   for (; i  0; i--) {
+   uint64_t adj;
+   ntp_update_second(adj, bt.sec);
+   th-th_adjustment += adj;
+   }

/* Update the UTC timestamps used by the get*() functions. */
/* XXX shouldn't do this here.  Should force non-`get' versions. */

Re: Possible bug in timecounter code

2012-09-03 Thread Stefan Fritsch

On Sunday 02 September 2012, s...@sfritsch.de wrote:
 But it may still be a problem the other way round. If adjtimedelta
 would  go to zero during the lost ticks, there will be some
 over-compensation that is not detected.

After some more hints from Mark (thanks), I think this is very 
unlikely to cause a problem in practice, possibly except if running on 
a virtualization host that is overloaded/swapping.

Re: Possible bug in timecounter code

2012-09-04 Thread Stefan Fritsch

On Monday 03 September 2012, Stefan Fritsch wrote:
 On Sunday 02 September 2012, s...@sfritsch.de wrote:
  But it may still be a problem the other way round. If
  adjtimedelta would  go to zero during the lost ticks, there will
  be some over-compensation that is not detected.
 
 After some more hints from Mark (thanks), I think this is very
 unlikely to cause a problem in practice, possibly except if running
 on a virtualization host that is overloaded/swapping.

There is another issue: If the previous call to tc_windup was more 
than 1s ago, there is an integer overflow, all complete seconds are 
lost and only the fractional part is added to th_offset. FreeBSD has a 
fix for this one:

http://svnweb.freebsd.org/base?view=revisionrevision=215665

Re: [Patch] Virtio drivers for OpenBSD V9

2012-09-05 Thread Stefan Fritsch

On Tuesday 04 September 2012, LEVAI Daniel wrote:
 On v, szept 02, 2012 at 12:13:33 +0200, Stefan Fritsch wrote:
  On Saturday 01 September 2012, LEVAI Daniel wrote:
I've just started to test this with an OpenBSD guest on a
Linux host. vio0 NIC and the virtio disk work like a charm.
I have a slightly (~10%) better performance with the ide
emulation than with virtio, but nevertheless it works.
  
  Thanks for testing. How did you test the performance?
 
 I've just used dd(1) to transfer files on the same disk and
 partition.

On my system, the behavior is not really consistent: Configuring the 
drive in kvm with cache=writeback, virtio is significantly faster at 
writing and ide is significantly faster at reading. With cache=none, 
it's the other way round and with smaller differences (5%). Ide is 
slightly faster at writing and virtio is slightly faster at reading.
There is obviously room for optimization...

  These are just debugging leftovers. I will remove them in the
  next version.

At http://www.sfritsch.de/~stf/openbsd-virtio/openbsd-virtio-v9.diff , 
there is a new version without the debugging output in vioblk. There 
are no other differences to v8.

Cheers,
Stefan

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-15 Thread Stefan Fritsch

On Monday 15 October 2012, Stuart Henderson wrote:
 Best to send the diff, with the accompanying text, in the same
 email, and make sure they still all apply, I tried testing some of
 these but didn't manage to get them to apply (either some
 conflicting change, or they were in the wrong order and stacked up
 on top of each other or something).

As you mentioned using git in another mail, I would recommend to use 
git's format-patch to create the mails. Maybe with the --no-prefix 
option,  because that seems to be the preferred patch format here. 
Also, interactive rebase (git rebase -i) is nice for adjusting patches 
and commit messages (in case you didn't know that one).

Cheers,
Stefan

vio(4) Large MTU support

2012-11-01 Thread Stefan Fritsch


Hi,

here is a patch to support large MTUs in if_vio. It adds support
for the VIRTIO_NET_F_MRG_RXBUF feature, i.e. allows to chain
several rx buffers when receiving large packets.

To make this work, it has to put the rx meta data headers at the
beginning of the mbuf cluster instead of dedicated buffers. Should
it take care that the payload gets some particular alignment? ATM,
the ethernet header starts with offset 10 or 12 bytes into the
cluster, depending on VIRTIO_NET_F_MRG_RXBUF.

Comments / OKs ? Thanks.

Stefan

--- share/man/man4/vio.4
+++ share/man/man4/vio.4
@@ -50,6 +50,3 @@ It is based on the
 .Nm vioif
 driver by
 .An Minoura Makoto .
-.Sh BUGS
-.Nm
-currently does not support jumbo frames.
diff --git a/sys/dev/pci/if_vio.c b/sys/dev/pci/if_vio.c
index 2f94db7..5678edc 100644
--- sys/dev/pci/if_vio.c
+++ sys/dev/pci/if_vio.c
@@ -123,9 +123,9 @@ struct virtio_net_hdr {
uint16_tgso_size;
uint16_tcsum_start;
uint16_tcsum_offset;
-#if 0
-   uint16_tnum_buffers; /* if VIRTIO_NET_F_MRG_RXBUF enabled */
-#endif
+
+   /* only present if VIRTIO_NET_F_MRG_RXBUF is negotiated */
+   uint16_tnum_buffers;
 } __packed;

 #define VIRTIO_NET_HDR_F_NEEDS_CSUM1 /* flags */
@@ -199,7 +199,7 @@ struct vio_softc {
size_t  sc_dma_size;
caddr_t sc_dma_kva;

-   struct virtio_net_hdr   *sc_rx_hdrs;
+   int sc_hdr_size;
struct virtio_net_hdr   *sc_tx_hdrs;
struct virtio_net_ctrl_cmd *sc_ctrl_cmd;
struct virtio_net_ctrl_status *sc_ctrl_status;
@@ -227,6 +227,8 @@ struct vio_softc {
 #define VIO_DMAMEM_ENQUEUE(sc, vq, slot, p, size, write)   \
virtio_enqueue_p((vq), (slot), (sc)-sc_dma_map, \
VIO_DMAMEM_OFFSET((sc), (p)), (size), (write))
+#define VIO_HAVE_MRG_RXBUF(sc) \
+   ((sc)-sc_hdr_size == sizeof(struct virtio_net_hdr))

 #define VIRTIO_NET_TX_MAXNSEGS 16 /* for larger chains, defrag */
 #define VIRTIO_NET_CTRL_MAC_MAXENTRIES 64 /* for more entries, use ALLMULTI */
@@ -343,7 +345,6 @@ vio_free_dmamem(struct vio_softc *sc)
 /* allocate memory */
 /*
  * dma memory is used for:
- *   sc_rx_hdrs[slot]:  metadata array for recieved frames (READ)
  *   sc_tx_hdrs[slot]:  metadata array for frames to be sent (WRITE)
  *   sc_ctrl_cmd:   command to be sent via ctrl vq (WRITE)
  *   sc_ctrl_status:return value for a command via ctrl vq (READ)
@@ -355,6 +356,9 @@ vio_free_dmamem(struct vio_softc *sc)
  *  class command (WRITE)
  * sc_ctrl_* structures are allocated only one each; they are protected by
  * sc_ctrl_inuse, which must only be accessed at splnet
+ *
+ * metadata headers for received frames are stored at the start of the
+ * rx mbufs.
  */
 /*
  * dynamically allocated memory is used for:
@@ -367,7 +371,8 @@ int
 vio_alloc_mem(struct vio_softc *sc)
 {
struct virtio_softc *vsc = sc-sc_virtio;
-   int allocsize, r, i;
+   struct ifnet *ifp = sc-sc_ac.ac_if;
+   int allocsize, r, i, txsize;
unsigned int offset = 0;
int rxqsize, txqsize;
caddr_t kva;
@@ -375,8 +380,13 @@ vio_alloc_mem(struct vio_softc *sc)
rxqsize = vsc-sc_vqs[0].vq_num;
txqsize = vsc-sc_vqs[1].vq_num;

-   allocsize = sizeof(struct virtio_net_hdr) * rxqsize;
-   allocsize += sizeof(struct virtio_net_hdr) * txqsize;
+   /*
+* For simplicity, we always allocate the full virtio_net_hdr size
+* even if VIRTIO_NET_F_MRG_RXBUF is not negotiated and
+* only a part of the memory is ever used.
+*/
+   allocsize = sizeof(struct virtio_net_hdr) * txqsize;
+
if (vsc-sc_nvqs == 3) {
allocsize += sizeof(struct virtio_net_ctrl_cmd) * 1;
allocsize += sizeof(struct virtio_net_ctrl_status) * 1;
@@ -393,8 +403,6 @@ vio_alloc_mem(struct vio_softc *sc)
}

kva = sc-sc_dma_kva;
-   sc-sc_rx_hdrs = (struct virtio_net_hdr*)kva;
-   offset += sizeof(struct virtio_net_hdr) * rxqsize;
sc-sc_tx_hdrs = (struct virtio_net_hdr*)(kva + offset);
offset += sizeof(struct virtio_net_hdr) * txqsize;
if (vsc-sc_nvqs == 3) {
@@ -429,9 +437,10 @@ vio_alloc_mem(struct vio_softc *sc)
goto err_reqs;
}

+   txsize = ifp-if_hardmtu + sc-sc_hdr_size + ETHER_HDR_LEN;
for (i = 0; i  txqsize; i++) {
-   r = bus_dmamap_create(vsc-sc_dmat, ETHER_MAX_LEN,
-   VIRTIO_NET_TX_MAXNSEGS, ETHER_MAX_LEN, 0,
+   r = bus_dmamap_create(vsc-sc_dmat, txsize,
+   VIRTIO_NET_TX_MAXNSEGS, txsize, 0,
BUS_DMA_NOWAIT|BUS_DMA_ALLOCNOW,
sc-sc_tx_dmamaps[i]);
if (r != 0)
@@ -502,7 +511,8 @@ vio_attach(struct device *parent, struct device *self, void 
*aux)
vsc-sc_intrhand = virtio_vq_intr;

Re: Small memcpy optimization

2012-11-01 Thread Stefan Fritsch

On Tuesday 21 August 2012, Stefan Fritsch wrote:
 On x86, the xchg operation between reg and mem has an implicit lock
 prefix, i.e. it is a relatively expensive atomic operation. This is
 not needed here.

OKs, anyone?

 --- a/sys/arch/i386/i386/locore.s
 +++ b/sys/arch/i386/i386/locore.s
 @@ -802,8 +802,9 @@ ENTRY(bcopy)
*/
   ENTRY(memcpy)
  movl4(%esp),%ecx
 -   xchg8(%esp),%ecx
 -   movl%ecx,4(%esp)
 +   movl8(%esp),%eax
 +   movl%ecx,8(%esp)
 +   movl%eax,4(%esp)
  jmp _C_LABEL(bcopy)

Re: Small memcpy optimization

2012-11-10 Thread Stefan Fritsch


On Thu, 8 Nov 2012, Mark Kettenis wrote:

On Tuesday 21 August 2012, Stefan Fritsch wrote:

On x86, the xchg operation between reg and mem has an implicit lock
prefix, i.e. it is a relatively expensive atomic operation. This is
not needed here.


OKs, anyone?


What you say makes sense, although it might matter only on MP
(capable) systems.


True, but MP is the norm nowadays.


If you really want to make things faster, I
suppose you could change the code into something like

   pushl   %esi
   pushl   %edi
   movl12(%esp),%edi
   movl16(%esp),%esi


That's true. Like this (suggestions for a better label name are 
welcome):


--- locore.s
+++ locore.s
@@ -789,7 +789,7 @@ ENTRY(bcopy)
pushl   %edi
movl12(%esp),%esi
movl16(%esp),%edi
-   movl20(%esp),%ecx
+bcopy2:movl20(%esp),%ecx
movl%edi,%eax
subl%esi,%eax
cmpl%ecx,%eax   # overlapping?
@@ -827,13 +827,15 @@ ENTRY(bcopy)
ret

 /*
- * Emulate memcpy() by swapping the first two arguments and calling bcopy()
+ * Emulate memcpy() by loading the first two arguments in reverse order
+ * and jumping into bcopy()
  */
 ENTRY(memcpy)
-   movl4(%esp),%ecx
-   xchg8(%esp),%ecx
-   movl%ecx,4(%esp)
-   jmp _C_LABEL(bcopy)
+   pushl   %esi
+   pushl   %edi
+   movl12(%esp),%edi
+   movl16(%esp),%esi
+   jmp bcopy2

 /*/

Re: ##@!#@# gnu tools

2012-11-16 Thread Stefan Fritsch

On Thursday 15 November 2012, Reyk Floeter wrote:
 On Thu, Nov 15, 2012 at 5:11 PM, Marc Espie es...@nerim.net wrote:
  external people regularly ask but why you don't want  to use
  GNU/m4 GNU/make GNU/whatever ?
 
 External people seem to ask weird questions.
 
 I just had to dig into autoconf/auto* because it seems to be a
 must have for a portable project. Yuck! 

That's all a matter of perspective. If you work on sane platforms like 
Linux and BSDs, you always think what do I need this autotools/libtool 
crap for, it would be much easier without them. But once you work on 
really weird platforms like AIX and do non-trivial tasks (like 
building shared libraries :-o ), autotools/libtool are sent from 
heaven and are really *so* much easier to use than what is available 
natively. I guess another example is Mac OS with its universal 
binaries that may contain both 32 and 64 bit, and both intel and 
powerpc code in the same file.

Also, for non-trivial projects, the advice don't write portable make 
files, use a portable make instead is very much true. Posix 
compatible make lacks even the most basic features.

Stefan

[PATCH] Support for virtio random device

2013-01-25 Thread Stefan Fritsch


Hi,

qemu 1.3 has added a virtio entropy device. Here is a driver for it. 
Comments? OKs?


As the entropy reserve of the host may not be unlimited, the OpenBSD guest 
should only ask for entropy when it actually needs it. Would it make sense 
to add an API that allows a driver to determine how full the entropy queue 
is? This could also be used by some hardware drivers to avoid polling for 
entropy if not necessary. E.g. amdpm does the polling in every timer tick, 
which is wasteful.


Cheers,
Stefan


diff --git share/man/man4/Makefile share/man/man4/Makefile
index ab6d258..3e36766 100644
--- share/man/man4/Makefile
+++ share/man/man4/Makefile
@@ -66,9 +66,9 @@ MAN=  aac.4 ac97.4 acphy.4 \
uthum.4 uticom.4 utwitch.4 utrh.4 uts.4 uvideo.4 uvisor.4 uvscom.4 \
uyap.4 \
vether.4 vga.4 vgafb.4 vge.4 \
-   viapm.4 viasio.4 vic.4 video.4 vio.4 vioblk.4 viomb.4 virtio.4 vlan.4 \
-   vmt.4 vnd.4 vr.4 \
-   vscsi.4 vte.4 \
+   viapm.4 viasio.4 vic.4 video.4 \
+   vio.4 vioblk.4 viomb.4 viornd.4 virtio.4 \
+   vlan.4 vmt.4 vnd.4 vr.4 vscsi.4 vte.4 \
watchdog.4 wb.4 wbenv.4 wbng.4 wbsd.4 wbsio.4 wd.4 wdc.4 we.4 \
wi.4 wpi.4 wscons.4 wsdisplay.4 wskbd.4 wsmouse.4 wsmux.4 \
xe.4 xf86.4 xge.4 xl.4 xmphy.4 yds.4 ym.4 zero.4 zyd.4
diff --git share/man/man4/viornd.4 share/man/man4/viornd.4
new file mode 100644
index 000..09ff465
--- /dev/null
+++ share/man/man4/viornd.4
@@ -0,0 +1,45 @@
+.\ $OpenBSD: $
+.\
+.\ Copyright (c) 2013 Stefan Fritsch s...@sfritsch.de
+.\
+.\ Permission to use, copy, modify, and distribute this software for any
+.\ purpose with or without fee is hereby granted, provided that the above
+.\ copyright notice and this permission notice appear in all copies.
+.\
+.\ THE SOFTWARE IS PROVIDED AS IS AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+.\ WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+.\ MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+.\ ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+.\ WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+.\ ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+.\ OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+.\
+.Dd $Mdocdate: January 25 2013 $
+.Dt VIORND 4
+.Os
+.Sh NAME
+.Nm viornd
+.Nd VirtIO random number device
+.Sh SYNOPSIS
+.Cd viornd* at virtio?
+.Sh DESCRIPTION
+The
+.Nm
+driver provides a virtual random number generator using a
+.Xr virtio 4
+entropy device provided by QEMU 1.3 and later, and possibly by other
+hypervisors.
+.Sh SEE ALSO
+.Xr intro 4 ,
+.Xr virtio 4
+.Sh HISTORY
+The
+.Nm
+driver first appeared in
+.Ox 5.3 .
+.Sh AUTHORS
+.An -nosplit
+The
+.Nm
+driver was written by
+.An Stefan Fritsch Aq s...@sfritsch.de .
diff --git share/man/man4/virtio.4 share/man/man4/virtio.4
index 16ad4ba..33ce729 100644
--- share/man/man4/virtio.4
+++ share/man/man4/virtio.4
@@ -40,6 +40,8 @@ VirtIO network device
 VirtIO disk
 .It Xr viomb 4
 VirtIO memory ballooning driver
+.It Xr viornd 4
+VirtIO random number device
 .El
 .Sh SEE ALSO
 .Xr intro 4
diff --git sys/arch/amd64/conf/GENERIC sys/arch/amd64/conf/GENERIC
index ca0a4b2..8f41464 100644
--- sys/arch/amd64/conf/GENERIC
+++ sys/arch/amd64/conf/GENERIC
@@ -612,3 +612,4 @@ virtio* at pci? # Virtio PCI device
 vioblk*at virtio?  # Virtio block device
 vio*   at virtio?  # Virtio network device
 viomb* at virtio?  # Virtio memory ballooning device
+viornd*at virtio?  # Virtio entropy device
diff --git sys/arch/i386/conf/GENERIC sys/arch/i386/conf/GENERIC
index 190677e..9c0417d 100644
--- sys/arch/i386/conf/GENERIC
+++ sys/arch/i386/conf/GENERIC
@@ -802,3 +802,4 @@ virtio* at pci? # Virtio PCI device
 vioblk*at virtio?  # Virtio block device
 vio*   at virtio?  # Virtio network device
 viomb* at virtio?  # Virtio memory ballooning device
+viornd*at virtio?  # Virtio entropy device
diff --git sys/dev/pci/files.pci sys/dev/pci/files.pci
index aebacb8..3155a7f 100644
--- sys/dev/pci/files.pci
+++ sys/dev/pci/files.pci
@@ -844,3 +844,7 @@ filedev/pci/vioblk.cvioblk
 device viomb
 attach viomb at virtio
 file   dev/pci/viomb.c viomb
+
+device viornd
+attach viornd at virtio
+file   dev/pci/viornd.cviornd
diff --git sys/dev/pci/viornd.c sys/dev/pci/viornd.c
new file mode 100644
index 000..3669df3
--- /dev/null
+++ sys/dev/pci/viornd.c
@@ -0,0 +1,168 @@
+/* $OpenBSD: $ */
+
+/*
+ * Copyright (c) 2013 Stefan Fritsch s...@sfritsch.de
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED

Re: [PATCH] Support for virtio random device

2013-01-26 Thread Stefan Fritsch

On Saturday 26 January 2013, Mark Kettenis wrote:
  As the entropy reserve of the host may not be unlimited, the
  OpenBSD guest should only ask for entropy when it actually needs
  it.
 
 If qemu really allows the guest to consume the complete entropy
 reserve of the host then that's a bug in qemu and/or the host OS.

That's a red herring. Entropy is a sparse resource on most systems. If 
the guests don't make any effort to only use as much as they need, the 
host has no chance to dispense the existing entropy where it is 
actually needed.

 
  Would it make sense to add an API that allows a driver to
  determine how full the entropy queue is? This could also be used
  by some hardware drivers to avoid polling for entropy if not
  necessary. E.g. amdpm does the polling in every timer tick, which
  is wasteful.
 
 Wasteful in what sense?  I doubt that reading the hardware
 registers once every 10ms noticibly affects performance of the
 system.

But many small improvements may give a noticable effect on performance 
in the end.

Re: [PATCH] Support for virtio random device

2013-01-28 Thread Stefan Fritsch


On Sun, 27 Jan 2013, Damien Miller wrote:


On Fri, 25 Jan 2013, Stefan Fritsch wrote:


Hi,

qemu 1.3 has added a virtio entropy device. Here is a driver for it.
Comments?
OKs?

As the entropy reserve of the host may not be unlimited, the OpenBSD guest
should only ask for entropy when it actually needs it. Would it make sense to
add an API that allows a driver to determine how full the entropy queue is?
This could also be used by some hardware drivers to avoid polling for entropy
if not necessary. E.g. amdpm does the polling in every timer tick, which is
wasteful.


I don't think such an API is necessary.

For this particular case, I agree with Kettenis that if qemu allows exhaustion
of the hypervisor's PRNG using this call then the bug lies with qemu.


Qemu can do rate limiting, so it can prevent the host's entropy from being 
drained. But then guests that actually have need of lots of good entropy 
won't be able to get it.



That being said, if you want to avoid this bug then reading a largeish
seed once at system boot and stirring it into the PRNG would be sufficient
to ensure the PRNG pool is unpredictable to an adversary who tries to guess
its state.


That's more or less what the driver I posted does. But with the API to 
query the random queue's state it could do better than that.

[Patch] Fix subr_disk.c to allow compiling with -Wno-error=format

2013-04-06 Thread Stefan Fritsch

The log(pri, ) triggers a 'zero-length kprintf format string' warning 
which causes gcc to error out despite -Wno-error=format.


OK to commit this patch?


--- sys/kern/subr_disk.c
+++ sys/kern/subr_disk.c
@@ -761,13 +761,14 @@ diskerr(struct buf *bp, char *dname, char *what, int pri, 
int blkdone,
daddr64_t sn;

if (pri != LOG_PRINTF) {
-   static const char fmt[] = ;
-   log(pri, fmt);
+   log(pri, %s%d%c: %s %sing fsbn , dname, unit, partname, what,
+   bp-b_flags  B_READ ? read : writ);
pr = addlog;
-   } else
+   } else {
+   printf(%s%d%c: %s %sing fsbn , dname, unit, partname, what,
+   bp-b_flags  B_READ ? read : writ);
pr = printf;
-   (*pr)(%s%d%c: %s %sing fsbn , dname, unit, partname, what,
-   bp-b_flags  B_READ ? read : writ);
+   }
sn = bp-b_blkno;
if (bp-b_bcount = DEV_BSIZE)
(*pr)(%lld, sn);

Re: Removing -Wno-format from kernel makefiles

2013-04-14 Thread Stefan Fritsch

On Saturday 06 April 2013, Stefan Fritsch wrote:
 That wasn't clear from Miod's response. So the policy is to assume 
 that char/short/int/long long are 8/16/32/64 bits and that intptr_t
 is  long? Should that be added to style(9), then?

I assume that the lack of a response means yes.

The patches at http://www.sfritsch.de/~stf/openbsd-format-string/ fix 
all warnings on i386 and do not cause new warnings on amd64.

Since the kernel's printf does not support %td for ptrdiff_t, I have 
used %ld instead. %zd would also work. Is there a preferred way?

Re: Removing -Wno-format from kernel makefiles

2013-04-15 Thread Stefan Fritsch


On Sun, 14 Apr 2013, Theo de Raadt wrote:

That wasn't clear from Miod's response. So the policy is to assume
that char/short/int/long long are 8/16/32/64 bits and that intptr_t
is  long?


We only run on C8S16I32L32P32 and C8S16I32L64P64 architectures.  Short
names ILP32 and I32LP64.  The addition of any other sizing would
require us to fix thousands of subtle bugs, and crank the ABI on every
architecture.


Should that be added to style(9), then?
I assume that the lack of a response means yes.


Perhaps.  Thing is, we don't want people to worsen the current
situation for decades down the line in case someone tries to create
a 'popular' architecture with I64.


Or LL128 ...
But if such architectures come around, all kinds of things will need to be 
changed. This should not prevent anyone from documenting the current best 
practices.



Since the kernel's printf does not support %td for ptrdiff_t, I have
used %ld instead. %zd would also work. Is there a preferred way?


Or try to add %td support to kernel printf?


That would also require gcc to be taught that %td is now allowed. I don't 
know how difficult that is, but I will take a look when I have some time.

Re: Removing -Wno-format from kernel makefiles

2013-04-20 Thread Stefan Fritsch


On Sun, 14 Apr 2013, Theo de Raadt wrote:

Since the kernel's printf does not support %td for ptrdiff_t, I have
used %ld instead. %zd would also work. Is there a preferred way?


Or try to add %td support to kernel printf?


The patch below seems to do the trick for gcc 4.2.

I can't see that we would add an arch where sizeof(ptrdiff_t) != 
sizeof(size_t). Therefore I don't see any benefit in adding code that 
deals with this case and which would probably be just slightly to 
complicated for the compiler to optimize away on sane archs.


Checking sizeof(ptrdiff_t) == sizeof(size_t) there would be a good use for 
a compile time assert macro (like _Static_assert in C11). Do we have that 
and I didn't find it? Or do we want to add one?


Adding ptrdiff_t to stdint.h is not strictly necessary, but I see no 
reason not to do it.



--- a/gnu/gcc/gcc/c-format.c
+++ b/gnu/gcc/gcc/c-format.c
@@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] =
   { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L },
   { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 },
   { z, FMT_LEN_z, STD_C99, NULL, 0, 0 },
+  { t, FMT_LEN_t, STD_C99, NULL, 0, 0 },
   { NULL, 0, 0, NULL, 0, 0 }
 };

@@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] =
 static const format_char_info kprint_char_table[] =
 {
   /* C89 conversion specifiers.  */ 
-  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  T99_SST, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
-  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
-  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL }, 
+  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  T99_SST, T99_PD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
+  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
+  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL },

   { c,   0, STD_C89, { T89_I,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN }, -w,   , NULL },
   { s,   1, STD_C89, { T89_C,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp,  cR, NULL },
   { p,   1, STD_C89, { T89_V,   BADLEN, BADLEN,  T89_UL,  T9L_LL,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0, c, NULL }, 
--- a/sys/kern/subr_prf.c

+++ b/sys/kern/subr_prf.c
@@ -834,6 +834,9 @@ reswitch:   switch (ch) {
case 'q':
flags |= QUADINT;
goto rflag;
+   case 't':
+   /* for us, ptrdiff_t (C) is equal to ssize_t (POSIX) */
+   /* FALLTHROUGH */
case 'z':
flags |= SIZEINT;
goto rflag;
--- a/sys/sys/stdint.h
+++ b/sys/sys/stdint.h
@@ -95,6 +95,9 @@ typedef   __uintptr_t uintptr_t;
 typedef__intmax_t  intmax_t;
 typedef__uintmax_t uintmax_t;

+/* Integer type of the result of substracting two pointers */
+typedef__ptrdiff_t ptrdiff_t;
+
 #if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS)
 /*
  * 7.18.2 Limits of specified-width integer types.




For older gcc, it seems more difficult to add support for %t. At least, 
the code is quite different. Maybe this patch does the trick but it is 
untested. And it again assumes that sizeof(size_t) == sizeof(ptrdiff_t). 
IMHO simply keeping -Wno-format on architectures that use old gcc versions 
should be ok, too.


--- a/gnu/egcs/gcc/c-common.c
+++ b/gnu/egcs/gcc/c-common.c
@@ -2497,7 +2497,7 @@ check_format_info (info, params)
  if (pedantic)
warning (ANSI C does not support the `Z' length modifier);
}
- else if (*format_chars == 'z')
+ else if (*format_chars == 'z' || *format_chars == 't')
{
  length_char = 'Z', format_chars++;
  /* FIXME: Is allowed in ISO C 9x.  */

Re: Removing -Wno-format from kernel makefiles

2013-04-21 Thread Stefan Fritsch

Moved ptrdiff_t definition to types.h with proper guards. Assume 
ptrdiff_t is long in subr_prf.c


OK?

--- a/gnu/gcc/gcc/c-format.c
+++ b/gnu/gcc/gcc/c-format.c
@@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] =
   { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L },
   { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 },
   { z, FMT_LEN_z, STD_C99, NULL, 0, 0 },
+  { t, FMT_LEN_t, STD_C99, NULL, 0, 0 },
   { NULL, 0, 0, NULL, 0, 0 }
 };

@@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] =
 static const format_char_info kprint_char_table[] =
 {
   /* C89 conversion specifiers.  */ 
-  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  T99_SST, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
-  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
-  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL }, 
+  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  T99_SST, T99_PD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
+  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
+  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL },

   { c,   0, STD_C89, { T89_I,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN }, -w,   , NULL },
   { s,   1, STD_C89, { T89_C,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp,  cR, NULL },
   { p,   1, STD_C89, { T89_V,   BADLEN, BADLEN,  T89_UL,  T9L_LL,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0, c, NULL }, 
--- a/sys/kern/subr_prf.c

+++ b/sys/kern/subr_prf.c
@@ -842,6 +842,9 @@ reswitch:   switch (ch) {
size = 1;
sign = '\0';
break;
+   case 't':
+   /* assume ptrdiff_t is long */
+   /* FALLTHROUGH */
case 'D':
flags |= LONGINT;
/*FALLTHROUGH*/
--- a/sys/sys/types.h
+++ b/sys/sys/types.h
@@ -177,6 +177,11 @@ typedef__clockid_t clockid_t;
 typedef__size_tsize_t;
 #endif

+#ifndef_PTRDIFF_T_DEFINED_
+#define_PTRDIFF_T_DEFINED_
+typedef__ptrdiff_t ptrdiff_t;
+#endif
+
 #ifndef_SSIZE_T_DEFINED_
 #define_SSIZE_T_DEFINED_
 typedef__ssize_t   ssize_t;

Re: Removing -Wno-format from kernel makefiles

2013-04-21 Thread Stefan Fritsch

On Sunday 21 April 2013, Mark Kettenis wrote:

  +++ b/sys/sys/types.h
  @@ -177,6 +177,11 @@ typedef__clockid_t clockid_t;
  
typedef   __size_tsize_t;
#endif
  
  +#ifndef_PTRDIFF_T_DEFINED_
  +#define_PTRDIFF_T_DEFINED_
  +typedef__ptrdiff_t ptrdiff_t;
  +#endif
 
 You can't put ptrdiff_t in sys/types.h either.

Why not? POSIX only speaks about at least the following types in 
sys/types.h, so there is no reason why it may not define more types. 
And the ifndef guard will prevent conflicts with stddef.h (which is 
the canonical place for ptrdiff_t according to C99). 

 I lost track a bit what you're trying to achieve here (been a tad
 bit busy).  But I don't think we'd want to use ptrdiff_t in the
 kernel.

The starting point was that the compiler produces ptrdiff_t when 
subtracting two pointers, and Theo preferred using the standard 
compliant %td format string for it over %ld (or %zd).

And if we have a printf format string for ptrdiff_t, it makes sense to 
be able to cast to it. But it's not strictly necessary, of course.

Re: Removing -Wno-format from kernel makefiles

2013-04-21 Thread Stefan Fritsch

On Sunday 21 April 2013, Philip Guenther wrote:
 On Sun, Apr 21, 2013 at 12:15 PM, Stefan Fritsch s...@sfritsch.de 
wrote:
  On Sunday 21 April 2013, Mark Kettenis wrote:
 ...
 
  You can't put ptrdiff_t in sys/types.h either.
  
  Why not? POSIX only speaks about at least the following types
  in sys/types.h, so there is no reason why it may not define more
  types.
 
 Well, it can define others as long as it stays within the namespace
 restrictions.  It turns out that ptrdiff_t *is* within the
 permitted namespace (because it ends with _t).  However, we have
 generally tried to avoid defining more than is required to be
 present so that devs don't accidentally rely on a non-portable
 assumption like that.
 
 Maybe this should be handled the bool bits in sys/types.h,
 wrapping them in #ifdef _KERNEL ?

#ifdef _KERNEL is fine with me.

 Alternatively, if subr_prf.c is the only kernel file which will
 actually need the ptrdiff_t type then perhaps it should just use
 __ptrdiff_t directly.

Actually it's not needed, yet. We could postpone adding it until 
someone actually needs it. But my guess is that this would lead to 
that someone simply choosing some other type.

Re: Removing -Wno-format from kernel makefiles

2013-04-28 Thread Stefan Fritsch


On Sun, 21 Apr 2013, Stefan Fritsch wrote:

On Sunday 21 April 2013, Philip Guenther wrote:

Maybe this should be handled the bool bits in sys/types.h,
wrapping them in #ifdef _KERNEL ?


#ifdef _KERNEL is fine with me.


OK for the patch below?

--- a/gnu/gcc/gcc/c-format.c
+++ b/gnu/gcc/gcc/c-format.c
@@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] =
   { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L },
   { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 },
   { z, FMT_LEN_z, STD_C99, NULL, 0, 0 },
+  { t, FMT_LEN_t, STD_C99, NULL, 0, 0 },
   { NULL, 0, 0, NULL, 0, 0 }
 };

@@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] =
 static const format_char_info kprint_char_table[] =
 {
   /* C89 conversion specifiers.  */ 
-  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  T99_SST, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
-  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
-  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL }, 
+  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  T99_SST, T99_PD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
+  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
+  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL },

   { c,   0, STD_C89, { T89_I,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN }, -w,   , NULL },
   { s,   1, STD_C89, { T89_C,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp,  cR, NULL },
   { p,   1, STD_C89, { T89_V,   BADLEN, BADLEN,  T89_UL,  T9L_LL,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0, c, NULL }, 
--- a/sys/kern/subr_prf.c

+++ b/sys/kern/subr_prf.c
@@ -842,6 +842,9 @@ reswitch:   switch (ch) {
size = 1;
sign = '\0';
break;
+   case 't':
+   /* assume ptrdiff_t is long */
+   /* FALLTHROUGH */
case 'D':
flags |= LONGINT;
/*FALLTHROUGH*/
--- a/sys/sys/types.h
+++ b/sys/sys/types.h
@@ -177,6 +177,13 @@ typedef__clockid_t clockid_t;
 typedef__size_tsize_t;
 #endif

+#ifdef _KERNEL
+#ifndef_PTRDIFF_T_DEFINED_
+#define_PTRDIFF_T_DEFINED_
+typedef__ptrdiff_t ptrdiff_t;
+#endif
+#endif
+
 #ifndef_SSIZE_T_DEFINED_
 #define_SSIZE_T_DEFINED_
 typedef__ssize_t   ssize_t;

Re: Removing -Wno-format from kernel makefiles

2013-04-28 Thread Stefan Fritsch

On Sun, 28 Apr 2013, Stefan Fritsch wrote:

 On Sun, 21 Apr 2013, Stefan Fritsch wrote:
  On Sunday 21 April 2013, Philip Guenther wrote:
   Maybe this should be handled the bool bits in sys/types.h,
   wrapping them in #ifdef _KERNEL ?
  
  #ifdef _KERNEL is fine with me.
 
 OK for the patch below?

FWIW, re-sending with unbroken MUA settings.

--- a/gnu/gcc/gcc/c-format.c
+++ b/gnu/gcc/gcc/c-format.c
@@ -325,6 +325,7 @@ static const format_length_info kprintf_length_specs[] =
   { l, FMT_LEN_l, STD_C89, ll, FMT_LEN_ll, STD_C9L },
   { q, FMT_LEN_ll, STD_EXT, NULL, 0, 0 },
   { z, FMT_LEN_z, STD_C99, NULL, 0, 0 },
+  { t, FMT_LEN_t, STD_C99, NULL, 0, 0 },
   { NULL, 0, 0, NULL, 0, 0 }
 };
 
@@ -552,9 +553,9 @@ static const format_char_info asm_fprintf_char_table[] =
 static const format_char_info kprint_char_table[] = 
 { 
   /* C89 conversion specifiers.  */ 
-  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  
T99_SST, BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
-  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL }, 
-  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL }, 
+  { di,  0, STD_C89, { T89_I,   BADLEN, T89_S,   T89_L,   T9L_LL,  BADLEN,  
T99_SST, T99_PD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0 +'I, i, NULL }, 
+  { oxX, 0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0#,i, NULL 
}, 
+  { u,   0, STD_C89, { T89_UI,  BADLEN, T89_US,  T89_UL,  T9L_ULL, BADLEN,  
T99_ST,  T99_UPD,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0'I,   i, NULL 
}, 
   { c,   0, STD_C89, { T89_I,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -w,   , NULL }, 
   { s,   1, STD_C89, { T89_C,   BADLEN, BADLEN,  BADLEN,  BADLEN,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp,  cR, NULL 
}, 
   { p,   1, STD_C89, { T89_V,   BADLEN, BADLEN,  T89_UL,  T9L_LL,  BADLEN,  
BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN,  BADLEN }, -wp0, c, NULL }, 
--- a/sys/kern/subr_prf.c
+++ b/sys/kern/subr_prf.c
@@ -842,6 +842,9 @@ reswitch:   switch (ch) {
size = 1;
sign = '\0';
break;
+   case 't':
+   /* assume ptrdiff_t is long */
+   /* FALLTHROUGH */
case 'D':
flags |= LONGINT;
/*FALLTHROUGH*/
--- a/sys/sys/types.h
+++ b/sys/sys/types.h
@@ -177,6 +177,13 @@ typedef__clockid_t clockid_t;
 typedef__size_tsize_t;
 #endif
 
+#ifdef _KERNEL
+#ifndef_PTRDIFF_T_DEFINED_
+#define_PTRDIFF_T_DEFINED_
+typedef__ptrdiff_t ptrdiff_t;
+#endif
+#endif
+
 #ifndef_SSIZE_T_DEFINED_
 #define_SSIZE_T_DEFINED_
 typedef__ssize_t   ssize_t;

Compile time assert macro

2013-05-03 Thread Stefan Fritsch

Hi,

I think a CTASSERT macro like in FreeBSD would be nice. It could be used 
to verify assumptions about type and struct sizes.

This should work on gcc 2.9.5, too, but I haven't tested it.

--- sys/lib/libkern/libkern.h
+++ sys/lib/libkern/libkern.h
@@ -138,6 +138,9 @@ abs(int j)
 #endif
 #endif
 
+#defineCTASSERT(x) extern char  _ctassert[(x) ? 1 : -1 ]   \
+   __attribute__((__unused__))
+
 /* Prototypes for non-quad routines. */
 void__assert(const char *, const char *, int, const char *)
__attribute__ ((__noreturn__));

Re: Compile time assert macro

2013-05-17 Thread Stefan Fritsch

On Fri, 3 May 2013, Stefan Fritsch wrote:
 I think a CTASSERT macro like in FreeBSD would be nice. It could be used 
 to verify assumptions about type and struct sizes.

Ariane suggested making the syntax compatible to the C11 variant.

OKs for this version?

--- a/sys/lib/libkern/libkern.h
+++ b/sys/lib/libkern/libkern.h
@@ -138,6 +138,12 @@ abs(int j)
 #endif
 #endif
 
+#if __STDC_VERSION__  201112L
+#define_Static_assert(predicate, msg)  \
+   extern char __static_assert[(predicate) ? 1 : -1 ]  \
+   __attribute__((__unused__))
+#endif
+
 /* Prototypes for non-quad routines. */
 void__assert(const char *, const char *, int, const char *)
__attribute__ ((__noreturn__));
--- a/share/man/man9/kern.9
+++ b/share/man/man9/kern.9
@@ -108,6 +108,20 @@ enabled.
 tests are only included if the kernel has
 .Dv DEBUG
 enabled.
+.Pp
+.nr nS 1
+.Fn _Static_assert CONDITION MESSAGE
+.nr nS 0
+.Pp
+This is a standard feature in C11 and is implemented as a macro if compiling
+with a pre-C11 compiler.
+It causes a compile time error if the given condition evaluates to
+false.
+Its main purpose is to verify assertions about type and struct sizes that
+would otherwise make the code fail at run time.
+.Fn _Static_assert
+can be used in global scope or at the start of blocks, where variable
+declarations are allowed.
 .Sh BYTE STRINGS
 .nr nS 1
 .Ft int

Binary code patching and paravirtualization

2014-12-09 Thread Stefan Fritsch

Hi,

in summer, I posted some paravirt patches for amd64. In response to the 
comments I received then, I have created some infrastructure to binary 
patch kernel code during boot. In order to get some feedback, I am posting 
the whole paravirt  code patching diff here. Also, KVM users may be 
interested in trying it (hi Antoine ;).

For the codepatching part, the most interesting files are codepatch.c and 
codepatch.h.  In copy.S and cpu.c, I convert the code patching already 
done for SMAP to the new infrastructure (if someone has a machine with 
SMAP support, testing would be nice).

The basic approach is to create macros that surround a to-be-patched code 
part and store pointer, length, and a tag in a dedicated section. Then 
there are functions to replace all code parts with a specific tag. Does 
this part of the diff look sensible?

The rest of the diff is paravirtualization for KVM that partially also 
uses the code patching infrastructure. The paravirtualization stuff seems 
to work fine and gives some nice speed-up, but it still needs plenty of 
cleanup before it can be committed.

There are some paravirtualization bits that are not specific to KVM. For 
example, I introduce a running_on_hypervisor() function and skip some 
delays during shutdown if it returns true. I think this should work for 
most (all?) hypervisors, though it still needs some detection logic. 
Making it depend on the CPUIDECX_HYPERV bit should catch the most common 
hypervisors. Is this something we could commit already in order to see if 
it causes any problems?

Another question: Most of the infrastructure and the pointers to the code 
that needs patching are only used at boot. Is there already a way to put 
that into a dedicated section that is freed and reused after boot?

Cheers,
Stefan

===

 amd64/codepatch.c |  183 ++
 amd64/copy.S  |   40 --
 amd64/cpu.c   |   63 -
 amd64/genassym.cf |3 
 amd64/identcpu.c  |1 
 amd64/lapic.c |   64 +++--
 amd64/machdep.c   |9 +-
 amd64/mainbus.c   |8 ++
 amd64/paravirt.c  |  161 +++
 amd64/vector.S|  139 +
 conf/GENERIC  |2 
 conf/files.amd64  |6 +
 include/codepatch.h   |   53 ++
 include/cpu.h |4 +
 include/cpuvar.h  |   12 +++
 include/i82093reg.h   |   26 ++-
 include/i82489var.h   |   13 +++
 include/paravirtvar.h |   72 +++
 include/specialreg.h  |1 
 19 files changed, 776 insertions(+), 84 deletions(-)

===

diff --git a/sys/arch/amd64/amd64/codepatch.c b/sys/arch/amd64/amd64/codepatch.c
new file mode 100644
index 000..3f5eba0
--- /dev/null
+++ b/sys/arch/amd64/amd64/codepatch.c
@@ -0,0 +1,183 @@
+/*
+ * Copyright (c) 2014 Stefan Fritsch s...@sfritsch.de
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include sys/param.h
+#include machine/paravirtvar.h
+#include machine/codepatch.h
+#include uvm/uvm_extern.h
+
+#if 1
+#define DBGPRINT(fmt, args...) printf(%s:  fmt \n, __func__, ## args)
+#else
+#define DBGPRINT(fmt, args...) do {} while (0)
+#endif
+
+struct codepatch {
+   uint32_t offset;
+   uint16_t len;
+   uint16_t tag;
+};
+
+extern struct codepatch codepatch_begin;
+extern struct codepatch codepatch_end;
+
+static const int nop_len_min = 2;
+static const int nop_len_max = 9;
+static const unsigned char nops[][9] = {
+   { 0x66, 0x90 },
+   { 0x0F, 0x1F, 0x00 },
+   { 0x0F, 0x1F, 0x40, 0x00 },
+   { 0x0F, 0x1F, 0x44, 0x00, 0x00 },
+   { 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00 },
+   { 0x0F, 0x1F, 0x80, 0x00, 0x00, 0x00, 0x00 },
+   { 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 },
+   { 0x66, 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00},
+};
+
+void
+codepatch_fill_nop(void *caddr, uint16_t len)
+{
+   unsigned char *addr = caddr;
+   uint16_t nop_len;
+
+   while (len  0) {
+   if (len  nop_len_min) {
+   panic(%s: can't patch NOP with len 1 at %p,
+   __func__, caddr);
+   }
+   if (len = nop_len_max

Re: Binary code patching and paravirtualization

2014-12-09 Thread Stefan Fritsch

On Wednesday 10 December 2014 00:52:44, Antoine Jacoutot wrote:
 I assume we still need this:
 printf 'change paravirt0\ny\n\n0x8\nq\n'  | config -ef /bsd
 under Illumos?

Correct

Re: Binary code patching and paravirtualization

2014-12-11 Thread Stefan Fritsch

On Thursday 11 December 2014 22:20:06, Jonathan Gray wrote:
 On Wed, Dec 10, 2014 at 12:32:02AM +0100, Stefan Fritsch wrote:
  For the codepatching part, the most interesting files are
  codepatch.c and codepatch.h.  In copy.S and cpu.c, I convert the
  code patching already done for SMAP to the new infrastructure (if
  someone has a machine with SMAP support, testing would be nice).
 
 There aren't so many machines with SMAP commercially released only
 a handful of 2 in 1 devices with Core M processors.
 
 You should be able to test with the qemu support added by Intel
 however. ie
 qemu-system-x86_64 -cpu qemu64,+smap
 or
 qemu-system-x86_64 -cpu Broadwell

thanks for the info. I will try that.

 
  The basic approach is to create macros that surround a
  to-be-patched code part and store pointer, length, and a tag in a
  dedicated section. Then there are functions to replace all code
  parts with a specific tag. Does this part of the diff look
  sensible?
 
 What about i386?  Though not included in your diff we can't assume
 the various multi byte nop sequences are valid there.
 
 The Intel instruction set reference says:
 The multi-byte NOP instruction performs no operation on supported
 processors and generates undefined opcode exception on processors
 that do not support the multi-byte NOP instruction.
 
 It seems in practice they are supported on everything after
 686/pentium pro?
 
 Though the code doing the patching could do a cpuid check.

As first step, I would only commit amd64 support. If that works, we 
can split it into MI and MD parts and add support for other 
architectures. But the functions to replace code with NOPs or a 
function call will always be MD. When those are implemented for i386, 
one can include appropriate cpuid checks. But as there is no place to 
put MD code that is used by both amd64 and i386, there will probably 
be two similar but independent implementations for those two archs. Or 
do you see a better way?

Re: Binary code patching and paravirtualization

2014-12-15 Thread Stefan Fritsch

On Thu, 11 Dec 2014, Mark Kettenis wrote:

  From: Alexey Suslikov alexey.susli...@gmail.com
  Date: Thu, 11 Dec 2014 20:51:14 + (UTC)

  Stefan Fritsch sf at sfritsch.de writes:

   --- a/sys/arch/amd64/include/specialreg.h
   +++ b/sys/arch/amd64/include/specialreg.h
at  at  -158,6 +158,7  at  at 
#define  CPUIDECX_AVX0x1000  /* Advanced Vector Extensions 
  */
#define  CPUIDECX_F16C   0x2000  /* 16bit fp conversion  */
#define  CPUIDECX_RDRAND 0x4000  /* RDRAND instruction  */
   +#define  CPUIDECX_HYPERV 0x8000  /* Hypervisor present */

  Is this flag standardized? Last time I have tried to push this, there
  was an objection based on reserved for future use status of this flag.

  See http://marc.info/?l=openbsd-bugsm=136907278229145w=2

 Well, that thread started out with a questionable workaround for a
 hypervisor bug.  That may have have influenced the debate about the
 flag a bit.

 You can be almost certain that Intel and AMD will not use that
 reserved bit for anything else.  The Linux KVM virtualization business
 is too important for them.  And if Microsoft Hyper-V or VMWare ESX
 sets that bit as well, this becomes an absolute certainty.

The intel manual says Not Used, Always returns 0 which is different from 
reserved, which is stated for other bits.

FTR, jasper@ checked that vmware sets the bit while virtual box does not. 
So, many but not all hypervisors set it.

 I prefer the CPUIDECX_HV name used in the diff you posted in:

OK?

diff --git a/sys/arch/amd64/amd64/identcpu.c b/sys/arch/amd64/amd64/identcpu.c
--- a/sys/arch/amd64/amd64/identcpu.c
+++ b/sys/arch/amd64/amd64/identcpu.c
@@ -129,6 +129,7 @@ const struct {
{ CPUIDECX_AVX, AVX },
{ CPUIDECX_F16C,F16C },
{ CPUIDECX_RDRAND,  RDRAND },
+   { CPUIDECX_HV,  HV },
 }, cpu_ecpuid_ecxfeatures[] = {
{ CPUIDECX_LAHF,LAHF },
{ CPUIDECX_CMPLEG,  CMPLEG },
diff --git a/sys/arch/amd64/include/specialreg.h 
b/sys/arch/amd64/include/specialreg.h
--- a/sys/arch/amd64/include/specialreg.h
+++ b/sys/arch/amd64/include/specialreg.h
@@ -158,6 +158,7 @@
 #defineCPUIDECX_AVX0x1000  /* Advanced Vector Extensions */
 #defineCPUIDECX_F16C   0x2000  /* 16bit fp conversion  */
 #defineCPUIDECX_RDRAND 0x4000  /* RDRAND instruction  */
+#defineCPUIDECX_HV 0x8000  /* Running on hypervisor */

 /*
  * Structured Extended Feature Flags Parameters (CPUID function 0x7, leaf 0)

Re: Binary code patching and paravirtualization

2014-12-15 Thread Stefan Fritsch

On Mon, 15 Dec 2014, Ted Unangst wrote:

 On Wed, Dec 10, 2014 at 00:32, Stefan Fritsch wrote:
 
  --- a/sys/arch/amd64/conf/GENERIC
  +++ b/sys/arch/amd64/conf/GENERIC
  @@ -39,6 +39,8 @@ isa0  at amdpcib?
  isa0at tcpcib?
  pci*at mainbus0
  
  +paravirt0 at mainbus0
  +
  acpi0   at bios0
  acpitimer*  at acpi?
  acpihpet*   at acpi?
  diff --git a/sys/arch/amd64/conf/files.amd64
  b/sys/arch/amd64/conf/files.amd64
  index 103b653..a73d5d8 100644
  --- a/sys/arch/amd64/conf/files.amd64
  +++ b/sys/arch/amd64/conf/files.amd64
  @@ -80,6 +80,12 @@ device   mainbus: isabus, pcibus, mainbus
  attach  mainbus at root
  filearch/amd64/amd64/mainbus.c  mainbus
  
  +device paravirt
  +attach paravirt at mainbus
  +file   arch/amd64/amd64/paravirt.c paravirt needs-flag
  +
  +file   arch/amd64/amd64/codepatch.c
 
 Can you explain why this needs to be a device? It doesn't appear to be
 doing any device like things.
 

Only in order to get a flags field that can be tweaked with config(8). And 
to allow disable via config(8), though that could also be achieved with a 
flag.

Tweaking the behavior with a flags value is necessary because hypervisors 
not always announce what they are capable of. One reason for this is that 
qemu is responsible for setting most of the cpuid flags but the kvm kernel 
module offers some interfaces in all cases. Another reason seems to be 
simple oversight, for example Illuminos KVM forgot to add the cpuid stuff 
from Linux KVM. And if there is some bug it may be a good idea to have a 
simple way to check if it is related to the paravirt stuff.

I have just noticed that there is support in config(8) to set non-devices 
related int values. But this does not seem to be supported in the 
in-kernel UKC. And I can't see right now how config finds the valid int 
values in the kernel. Or is it just necessary to add the name of the 
variable prepended by an underscore to the config tool? Would this be 
preferred over introducing the paravirt device? But being able to set this 
from the in-kernel UKC would be nice, too.

Re: Binary code patching and paravirtualization

2014-12-16 Thread Stefan Fritsch

On Mon, 15 Dec 2014, Ted Unangst wrote:

 On Mon, Dec 15, 2014 at 23:55, Stefan Fritsch wrote:
  
  Only in order to get a flags field that can be tweaked with config(8). And
  to allow disable via config(8), though that could also be achieved with a
  flag.
  
  Tweaking the behavior with a flags value is necessary because hypervisors
  not always announce what they are capable of. One reason for this is that
  qemu is responsible for setting most of the cpuid flags but the kvm kernel
  module offers some interfaces in all cases. Another reason seems to be
  simple oversight, for example Illuminos KVM forgot to add the cpuid stuff
  from Linux KVM. And if there is some bug it may be a good idea to have a
  simple way to check if it is related to the paravirt stuff.
  
  I have just noticed that there is support in config(8) to set non-devices
  related int values. But this does not seem to be supported in the
  in-kernel UKC. And I can't see right now how config finds the valid int
  values in the kernel. Or is it just necessary to add the name of the
  variable prepended by an underscore to the config tool? Would this be
  preferred over introducing the paravirt device? But being able to set this
  from the in-kernel UKC would be nice, too.

What I have forgotten is that just because a hypervisor supports a 
specific interface, it does not always mean that it is a good idea to use 
it. For example, recent Intel CPUs with a recent Linux/KVM have better 
hardware support for APIC virtualization and using the Hyper-V APIC access 
MSRs actually slow things down.

 I think it would be better to avoid fake device proliferation, but
 others may have other opinions.
 
 So the problem is that some hypervisors are broken and don't
 identify as such? Perhaps we ignore them to start? Then we can add a
 mechanism to force paravirt code patching.

 I think the introduction of paravirt code patching and the mechanism
 used to enable or disable are separate issues. It can start as a
 normal option PARAVIRT, and then we discuss what else needs to be done?

Yes, we can first start with some things that are unproblematic 
WRT detection. AFAIK, this is the case for the PV EOI optimization.


Unfortunately, the bit that gives the largest performance boost at the 
moment, the APIC access MSRs, is not so easy:
- KVM has supported it for a long time
- but it is only announced if qemu is started with a special command line 
option
- Illumos KVM supports it too, but there is no way to announce it

Also, old hypervisors tend to stay around for a long time and users can't 
always influence the configuration when they rent a vserver.

On the other hand, the same positive effect as the APIC access MSRs should 
also be achievable by adding x2apic support. kettenis wanted to check how 
much effort that would be. So, I think it would be best to wait a bit with 
this piece.

Re: add m_defrag to pcn driver

2015-04-07 Thread Stefan Fritsch

Hi,

I have committed the vio diff with some tweaks to remove some more now 
unused code. But I think this diff is actually incorrect:

On Tue, 31 Mar 2015, Kimberley Manning wrote:
 Index: if_pcn.c
 ===
 RCS file: /cvs/src/sys/dev/pci/if_pcn.c,v
 retrieving revision 1.36
 diff -u -p -r1.36 if_pcn.c
 --- if_pcn.c  14 Mar 2015 03:38:48 -  1.36
 +++ if_pcn.c  27 Mar 2015 12:17:24 -
 @@ -851,25 +851,23 @@ pcn_start(struct ifnet *ifp)
* were short on resources.  In this case, we'll copy
* and try again.
*/
 - if (bus_dmamap_load_mbuf(sc-sc_dmat, dmamap, m0,
 - BUS_DMA_WRITE|BUS_DMA_NOWAIT) != 0) {
 - MGETHDR(m, M_DONTWAIT, MT_DATA);
 - if (m == NULL)
 - break;
 - if (m0-m_pkthdr.len  MHLEN) {
 - MCLGET(m, M_DONTWAIT);
 - if ((m-m_flags  M_EXT) == 0) {
 - m_freem(m);
 - break;
 - }
 - }
 - m_copydata(m0, 0, m0-m_pkthdr.len, mtod(m, caddr_t));
 - m-m_pkthdr.len = m-m_len = m0-m_pkthdr.len;
 - error = bus_dmamap_load_mbuf(sc-sc_dmat, dmamap,
 - m, BUS_DMA_WRITE|BUS_DMA_NOWAIT);
 - if (error)
 - break;
 - }
 +error = bus_dmamap_load_mbuf(sc-sc_dmat, dmamap, m0,
 +BUS_DMA_WRITE|BUS_DMA_NOWAIT);
 +switch (error) {
 +case 0:
 +break; 
 +case EFBIG:
 +if ((error = m_defrag(m0, M_DONTWAIT)) == 0 
 +(error = bus_dmamap_load_mbuf(sc-sc_dmat, 
 dmamap,
 + m0, BUS_DMA_WRITE|BUS_DMA_NOWAIT)) == 0)
 +break;
 +
 +/* FALLTHROUGH */
 +default:
 + IFQ_DEQUEUE(ifp-if_snd, m0);
 + m_freem(m);

m is always NULL. Therefore you leak m0 in the error path. You should 
remove the m variable which is no longer necessary, and possibly rename m0 
to m. But the rename should probably be done in a second diff for easier 
review.

 + continue;
 +}

Other drivers don't dequeue mbufs in out of mem situations. I think you 
should just set IFF_OACTIVE and break out of the for loop (not shown in 
the diff). The old code just does the break, but I think setting 
IFF_OACTIVE would be correct here.


  
   /*
* Ensure we have enough descriptors free to describe

Re: add m_defrag to pcn driver

2015-04-08 Thread Stefan Fritsch

On Wednesday 08 April 2015 09:48:14, David Gwynne wrote:
  +continue;
  +}
 
  
 
  Other drivers don't dequeue mbufs in out of mem situations. I
  think you  should just set IFF_OACTIVE and break out of the for
  loop (not shown in the diff). The old code just does the break,
  but I think setting IFF_OACTIVE would be correct here.
 
 OACTIVE means the tx ring is full. it (generally) gets cleared by a
 driver when the tx ring is emptied.
 
 if the tx ring is empty but we set it when the system runs out of
 resources (ie, memory shortages, dma mappings, iommu slot
 exhaustion, solar flares, etc), then there's no event to clear
 OACTIVE with. if we cant map the mbuf we should drop it and move
 on.

Sounds reasonable but many drivers (e.g. em, ix, bge) don't do that  
but set OACTIVE in this case. Is there a reason for that? From a 
casual glance at the source it is not clear to me that they have some 
other way to recover. Some have a watchdog that will reset the device 
after some time.

Re: add m_defrag to pcn driver

2015-04-01 Thread Stefan Fritsch

On Wednesday 01 April 2015 12:46:47, Kimberley Manning wrote:
   This diff makes the pcn driver use m_defrag for fragmented mbuf
   chains,
  
 
  I like this kind of cleanups.  As for vio(4) could you try the
  diff or are you looking for testers?
 
 Thanks. I've tried both the vio and pcn diffs and they worked for
 me, so dlg told me to write here. I'm not sure what happens to
 diffs after this point to be honest.

I can take a look at the vio diff and commit it in a few days.

BTW, is there a way to trigger heavily fragmented mbufs in order to 
test the code path?

Re: Possible fix for i217 problem

2015-08-05 Thread Stefan Fritsch

Thanks to everyone for the testing. The patch is now committed.

Possible fix for i217 problem

2015-08-04 Thread Stefan Fritsch

Hi,

someone mentioned to me the i217-LM problems that were reported on misc 
end of May. It is possible that the patch below helps.

For us, it fixed a problem on a laptop with i217-LM (pci id 8086:153a) 
where the receiving of packets would stop until the battery of the laptop 
was removed (or until linux or freebsd were booted, which also have this 
workaround). A normal reboot or power-cycle without removing the battery 
did not help. Interestingly, not even the Intel PXE BIOS has the 
workaround.

The problem would happen if the LAN cable was plugged in after the card 
had already been initialized. If the LAN cable was always plugged in when 
the laptop was powered on, the problem would not appear.

The workaround is part of the e1000_lv_jumbo_workaround_ich8lan() function 
in e1000_ich8lan.c in freebsd, but only the part that is used if jumbo 
packets are *not* configured. Linux has the same fix as 
b20a774495671f037e7160ea2ce87 and da1e2046e5f5ab268e55d30d6b74099ade0aeb6f 
with some more info in the commit messages.

This probably has quite some potential to cause regressions with other 
boards, so i am not sure if it should go in before 5.8 release.

Cheers,
Stefan


--- a/sys/dev/pci/if_em_hw.c
+++ b/sys/dev/pci/if_em_hw.c
@@ -91,6 +91,7 @@ static int32_tem_id_led_init(struct em_hw *);
 static int32_t em_init_lcd_from_nvm_config_region(struct em_hw *,  uint32_t,
uint32_t);
 static int32_t em_init_lcd_from_nvm(struct em_hw *);
+static int32_t em_phy_no_cable_workaround(struct em_hw *);
 static voidem_init_rx_addrs(struct em_hw *);
 static voidem_initialize_hardware_bits(struct em_hw *);
 static boolean_t em_is_onboard_nvm_eeprom(struct em_hw *);
@@ -7018,6 +7019,96 @@ em_read_mac_addr(struct em_hw *hw)
 }
 
 /**
+ * Explicitly disables jumbo frames and resets some PHY registers back to hw-
+ * defaults. This is necessary in case the ethernet cable was inserted AFTER
+ * the firmware initialized the PHY. Otherwise it is left in a state where
+ * it is possible to transmit but not receive packets. Observed on I217-LM and
+ * fixed in FreeBSD's sys/dev/e1000/e1000_ich8lan.c.
+ *
+ * hw - Struct containing variables accessed by shared code
+ */
+STATIC int32_t
+em_phy_no_cable_workaround(struct em_hw *hw) {
+   int32_t ret_val, dft_ret_val;
+   uint32_t mac_reg;
+   uint16_t data, phy_reg;
+
+   /* disable Rx path while enabling workaround */
+   em_read_phy_reg(hw, I2_DFT_CTRL, phy_reg);
+   ret_val = em_write_phy_reg(hw, I2_DFT_CTRL, phy_reg | (1  14));
+   if (ret_val)
+   return ret_val;
+
+   /* Write MAC register values back to h/w defaults */
+   mac_reg = E1000_READ_REG(hw, FFLT_DBG);
+   mac_reg = ~(0xF  14);
+   E1000_WRITE_REG(hw, FFLT_DBG, mac_reg);
+
+   mac_reg = E1000_READ_REG(hw, RCTL);
+   mac_reg = ~E1000_RCTL_SECRC;
+   E1000_WRITE_REG(hw, RCTL, mac_reg);
+
+   ret_val = em_read_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_CTRL, data);
+   if (ret_val)
+   goto out;
+   ret_val = em_write_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_CTRL,
+   data  ~(1  0));
+   if (ret_val)
+   goto out;
+
+   ret_val = em_read_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_HD_CTRL, data);
+   if (ret_val)
+   goto out;
+
+   data = ~(0xF  8);
+   data |= (0xB  8);
+   ret_val = em_write_kmrn_reg(hw, E1000_KUMCTRLSTA_OFFSET_HD_CTRL, data);
+   if (ret_val)
+   goto out;
+
+   /* Write PHY register values back to h/w defaults */
+   em_read_phy_reg(hw, I2_SMBUS_CTRL, data);
+   data = ~(0x7F  5);
+   ret_val = em_write_phy_reg(hw, I2_SMBUS_CTRL, data);
+   if (ret_val)
+   goto out;
+
+   em_read_phy_reg(hw, I2_MODE_CTRL, data);
+   data |= (1  13);
+   ret_val = em_write_phy_reg(hw, I2_MODE_CTRL, data);
+   if (ret_val)
+   goto out;
+
+   /*
+* 776.20 and 776.23 are not documented in
+* i217-ethernet-controller-datasheet.pdf...
+*/
+   em_read_phy_reg(hw, PHY_REG(776, 20), data);
+   data = ~(0x3FF  2);
+   data |= (0x8  2);
+   ret_val = em_write_phy_reg(hw, PHY_REG(776, 20), data);
+   if (ret_val)
+   goto out;
+
+   ret_val = em_write_phy_reg(hw, PHY_REG(776, 23), 0x7E00);
+   if (ret_val)
+   goto out;
+
+   em_read_phy_reg(hw, I2_PCIE_POWER_CTRL, data);
+   ret_val = em_write_phy_reg(hw, I2_PCIE_POWER_CTRL, data  ~(1  10));
+   if (ret_val)
+   goto out;
+
+out:
+   /* re-enable Rx path after enabling workaround */
+   dft_ret_val = em_write_phy_reg(hw, I2_DFT_CTRL, phy_reg  ~(1  14));
+   if (ret_val)
+   return ret_val;
+   else
+   return dft_ret_val;
+}
+

Call for testing for tty diff

2015-07-17 Thread Stefan Fritsch

While writing a virtio-console driver, I have found a bug in ttwrite() 
that can cause hangs. Below is a fix and after talking to Theo, I would 
like to know if the patch causes regressions for anyone, for example hangs 
in pty input/output, serial ports, etc. Thanks in advance.

Cheers,
Stefan


- Introduce new defines TTHIWATMINSPACE, TTMINHIWAT for some magic values
  that are used in tty.c.
- Remove hiwat adjustments in ttwrite(). This fixes the missing spltty().
- The above alone causs deadlocks with ptys. Change ttysetwater() to keep 
  at least TTHIWATMINSPACE space above the high water mark. This makes it
  consistent with ttycheckoutq() and seems to fix the pty deadlocks.


--- sys/kern/tty.c
+++ sys/kern/tty.c
@@ -1688,7 +1688,7 @@ ttycheckoutq(struct tty *tp, int wait)
hiwat = tp-t_hiwat;
s = spltty();
oldsig = wait ? curproc-p_siglist : 0;
-   if (tp-t_outq.c_cc  hiwat + 200)
+   if (tp-t_outq.c_cc  hiwat + TTHIWATMINSPACE)
while (tp-t_outq.c_cc  hiwat) {
ttstart(tp);
if (wait == 0 || curproc-p_siglist != oldsig) {
@@ -1823,7 +1823,7 @@ loop:
tp-t_rocount = 0;
if (ttyoutput(*cp, tp) = 0) {
/* out of space */
-   goto overfull;
+   goto ovhiwat;
}
cp++;
cc--;
@@ -1849,7 +1849,7 @@ loop:
tp-t_outcc += ce;
if (i  0) {
/* out of space */
-   goto overfull;
+   goto ovhiwat;
}
if (ISSET(tp-t_lflag, FLUSHO) ||
tp-t_outq.c_cc  hiwat)
@@ -1869,15 +1869,6 @@ done:
explicit_bzero(obuf, obufcc);
return (error);
 
-overfull:
-   /*
-* Since we are using ring buffers, if we can't insert any more into
-* the output queue, we can assume the ring is full and that someone
-* forgot to set the high water mark correctly.  We set it and then
-* proceed as normal.
-*/
-   hiwat = tp-t_outq.c_cc - 1;
-
 ovhiwat:
ttstart(tp);
s = spltty();
@@ -2114,7 +2105,7 @@ ttsetwater(struct tty *tp)
cps = tp-t_ospeed / 10;
tp-t_lowat = x = CLAMP(cps / 2, TTMAXLOWAT, TTMINLOWAT);
x += cps;
-   tp-t_hiwat = CLAMP(x, tp-t_outq.c_cn, 100);
+   tp-t_hiwat = CLAMP(x, tp-t_outq.c_cn - TTHIWATMINSPACE, TTMINHIWAT);
 #undef CLAMP
 }
 
--- sys/sys/tty.h
+++ sys/sys/tty.h
@@ -171,6 +171,8 @@ struct itty {
 #ifdef _KERNEL
 #defineTTMAXLOWAT  256
 #defineTTMINLOWAT  32
+#defineTTMINHIWAT  100
+#defineTTHIWATMINSPACE 200 /* Min space above hiwat */
 #endif
 
 /* These flags are kept in t_state. */

Re: usb hang related to xhci

2015-07-19 Thread Stefan Fritsch

On Sat, 18 Jul 2015, David Hill wrote:
 Whenever I plug a device into my USB ports, my machine locks hard.  I
 have the Intel Series 7 / C216 chip, so xhci attempts to route the port
 from ehci to xhci. 
 
 The following diff is from FreeBSD and makes my USB devices work again.
 https://github.com/freebsd/freebsd/blob/e79c62ff68fc74d88cb6f479859f6fae9baa5101/sys/dev/usb/controller/xhci_pci.c#L153-L176

I didn't have the lockup but my Intel Series 9 laptop does not recognice a 
USB 3.0 storage device if I insert it after a hot boot. If I insert it 
after a cold boot, it works ok.

This behavior does not change with your patch.

Cheers,
Stefan


 
 
 Index: sys/dev/pci/xhci_pci.c
 ===
 RCS file: /cvs/src/sys/dev/pci/xhci_pci.c,v
 retrieving revision 1.6
 diff -u -p -r1.6 xhci_pci.c
 --- sys/dev/pci/xhci_pci.c22 Jun 2015 08:43:27 -  1.6
 +++ sys/dev/pci/xhci_pci.c19 Jul 2015 02:20:06 -
 @@ -92,33 +92,45 @@ xhci_pci_match(struct device *parent, vo
  static int
  xhci_pci_port_route(struct xhci_pci_softc *psc)
  {
 - pcireg_t val;
 + pcireg_t val, usb2_mask, usb3_mask;
  
 - /*
 -  * Check USB3 Port Routing Mask register that indicates the ports
 -  * can be changed from OS, and turn on by USB3 Port SS Enable register.
 -  */
 - val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3PRM);
 - DPRINTF((%s: USB3PRM / USB3.0 configurable ports: 0x%08x\n,
 - psc-sc.sc_bus.bdev.dv_xname, val));
 +/*
 + * Check USB3 Port Routing Mask register that indicates the ports
 + * can be changed from OS, and turn on by USB3 Port SS Enable 
 register.
 + */
 +usb3_mask = pci_conf_read(psc-sc_pc, psc-sc_tag,
 + PCI_XHCI_INTEL_USB3PRM);
 +DPRINTF((%s: USB3PRM / USB3.0 configurable ports: 0x%08x\n,
 +psc-sc.sc_bus.bdev.dv_xname, usb3_mask));
 +
 +/*
 + * Check USB2 Port Routing Mask register that indicates the USB2.0
 + * ports to be controlled by xHCI HC, and switch them to xHCI HC.
 + */
 +usb2_mask = pci_conf_read(psc-sc_pc, psc-sc_tag,
 + PCI_XHCI_INTEL_XUSB2PRM);
 +DPRINTF((%s: XUSB2PRM / USB2.0 ports can switch from EHCI to xHCI:
 +0x%08x\n, psc-sc.sc_bus.bdev.dv_xname, val));
 +
 + val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN) 
 |
 + pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR);
  
 - pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN, val);
 +
 + pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN,
 + val  usb3_mask);
 +#ifdef XHCI_DEBUG
   val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_USB3_PSSEN);
   DPRINTF((%s: USB3_PSSEN / Enabled USB3.0 ports under xHCI: 0x%08x\n,
   psc-sc.sc_bus.bdev.dv_xname, val));
 +#endif
  
 - /*
 -  * Check USB2 Port Routing Mask register that indicates the USB2.0
 -  * ports to be controlled by xHCI HC, and switch them to xHCI HC.
 -  */
 - val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PRM);
 - DPRINTF((%s: XUSB2PRM / USB2.0 ports can switch from EHCI to xHCI:
 - 0x%08x\n, psc-sc.sc_bus.bdev.dv_xname, val));
 -
 - pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR, val);
 + pci_conf_write(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR,
 + val  usb2_mask);
 +#ifdef XHCI_DEBUG
   val = pci_conf_read(psc-sc_pc, psc-sc_tag, PCI_XHCI_INTEL_XUSB2PR);
   DPRINTF((%s: XUSB2PR / USB2.0 ports under xHCI: 0x%08x\n,
   psc-sc.sc_bus.bdev.dv_xname, val));
 +#endif
  
   return (0);
  }

Looking for testers for amd64 suspend diff

2015-09-06 Thread Stefan Fritsch

Hi,

the diff below is necessary to make suspend/resume work when x2apic is
enabled (i.e. on qemu/kvm/...). While I don't expect problems, it would be 
nice if I could get some reports that this doesn't break suspend/resume on 
real machines, especially older ones (like core 2 duo and older) and ones 
with an amd cpu.

Thanks in advance.

Cheers,
Stefan

--- a/sys/arch/amd64/amd64/acpi_wakecode.S
+++ b/sys/arch/amd64/amd64/acpi_wakecode.S
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include "lapic.h"
 
 #define _ACPI_TRMP_LABEL(a) a = . - _C_LABEL(acpi_real_mode_resume) + \
ACPI_TRAMPOLINE
@@ -245,6 +246,13 @@ _C_LABEL(acpi_long_mode_resume):
movw%ax, %gs
 
/* Restore registers - start with the MSRs */
+#if NLAPIC > 0
+   movl$MSR_APICBASE, %ecx
+   movlacpi_saved_apicbase, %eax
+   movlacpi_saved_apicbase+4, %edx
+   wrmsr
+#endif
+
movl$MSR_STAR, %ecx
movlacpi_saved_star, %eax
movlacpi_saved_star+4, %edx
@@ -623,6 +631,10 @@ _ACPI_TRMP_DATA_LABEL(acpi_saved_cstar)
.quad 0
 _ACPI_TRMP_DATA_LABEL(acpi_saved_sfmask)
.quad 0
+#if NLAPIC > 0
+_ACPI_TRMP_DATA_LABEL(acpi_saved_apicbase)
+   .quad 0
+#endif
 
.align 4
 _ACPI_TRMP_DATA_LABEL(acpi_pdirpa)
@@ -682,6 +694,13 @@ NENTRY(acpi_savecpu)

pushq   %rcx
pushq   %rdx
+#if NLAPIC > 0
+   movl$MSR_APICBASE, %ecx
+   rdmsr
+   movl%eax, acpi_saved_apicbase
+   movl%edx, acpi_saved_apicbase+4
+#endif
+
movl$MSR_STAR, %ecx
rdmsr
movl%eax, acpi_saved_star

Re: Looking for testers for amd64 suspend diff

2015-09-07 Thread Stefan Fritsch

On Monday 07 September 2015 13:23:03, Alexander Bluhm wrote:
> Suspend OpenBSD in a fully emulated qemu running on OpenBSD works.
> After resume, typing in the shell of the virtualized machine works.
> But at disk access, all processes hang in biowait.
> 
> It is the same behavior with and without your diff.

If you use virtio disk this is expected, because the virtio drivers 
still lack the necessary suspend/resume logic.

Re: Looking for testers for amd64 suspend diff

2015-09-11 Thread Stefan Fritsch

On Sunday 06 September 2015 11:44:11, Stefan Fritsch wrote:
> the diff below is necessary to make suspend/resume work when x2apic
> is enabled (i.e. on qemu/kvm/...). While I don't expect problems,
> it would be nice if I could get some reports that this doesn't
> break suspend/resume on real machines, especially older ones (like
> core 2 duo and older) and ones with an amd cpu.

Committed now. Thanks again to everyone who tested it.

Re: Checking MAC address of incoming unicast packets

2016-01-04 Thread Stefan Fritsch

On Sun, 3 Jan 2016, Theo de Raadt wrote:

> >On Friday 01 January 2016 16:25:59, Theo de Raadt wrote:
> >> dlg writes:
> >> > should we just do it unconditionally? is there a downside to that?
> >
> >It may decrease performance a tiny bit. Since such bits tend to add 
> >up, I would be hesitant to enable the check for drivers that don't 
> >need it. OTOH, freebsd and linux seem to always do the check.
> 
> How does it decrease performance?

I am talking about this diff in ether_input():

diff --git sys/net/if_ethersubr.c sys/net/if_ethersubr.c
--- sys/net/if_ethersubr.c
+++ sys/net/if_ethersubr.c
@@ -353,8 +353,7 @@ ether_input(struct ifnet *ifp, struct mbuf *m, void *cookie)
 * If packet is unicast and we're in promiscuous mode, make sure it
 * is for us.  Drop otherwise.
 */
-   if ((m->m_flags & (M_BCAST|M_MCAST)) == 0 &&
-   (ifp->if_flags & IFF_PROMISC)) {
+   if ((m->m_flags & (M_BCAST|M_MCAST)) == 0) {
if (memcmp(ac->ac_enaddr, eh->ether_dhost, ETHER_ADDR_LEN)) {
m_freem(m);
return (1);

For other drivers than vio this means an additional 6 byte memcmp where 
one operand is already in the cache, and the other operand may or may not 
be in the cache. 'tiny performance impact' seems about right.

> 
> I tried to read the rest of what you wrote.  It makes no sense.  Since
> these mbufs are coming off a ring, aren't they already linear; aren't
> these mbufs already normalzed ie. "pulled up".  The performance cost
> seems to be one condition in a mbuf macro which decides to do no work.
> 
> Maybe you should show the diff which does the check in the driver
> itself, then then it will be easier to see the performance cost.  Right
> now I can't perceive the problem.

It's not very expensive. But it's more code duplication than I like:

diff --git sys/dev/pci/if_vio.c sys/dev/pci/if_vio.c
--- sys/dev/pci/if_vio.c
+++ sys/dev/pci/if_vio.c
@@ -1005,6 +1005,9 @@ vio_rxeof(struct vio_softc *sc)
int r = 0;
int slot, len, bufs_left;
struct virtio_net_hdr *hdr;
+   int promisc = (ifp->if_flags & IFF_PROMISC);
+   u_int8_t *enaddr = sc->sc_ac.ac_enaddr;
+   struct ether_header *eh;
 
while (virtio_dequeue(vsc, vq, , ) == 0) {
r = 1;
@@ -1035,10 +1038,33 @@ vio_rxeof(struct vio_softc *sc)
bufs_left--;
}
 
-   if (bufs_left == 0) {
-   ml_enqueue(, m0);
-   m0 = NULL;
+   if (bufs_left > 0)
+   continue;
+
+   /*
+* Unfortunately, mac filtering is only best effort
+* in virtio-net. Unwanted packets may still arrive.
+* If we are not promiscous, we have to check if the
+* packet is actually for us.
+*/
+   if (!promisc) {
+   m0 = m_pullup(m0, ETHER_HDR_LEN);
+   /*
+* no need to check m0 == NULL, we know m0 has enough
+* space
+*/
+
+   eh = mtod(m0, struct ether_header *);
+   if (!ETHER_IS_MULTICAST(eh->ether_dhost) &&
+   memcmp(enaddr, eh->ether_dhost, ETHER_ADDR_LEN) != 
0) {
+   m_freem(m0);
+   m0 = NULL;
+   continue;
+   }
}
+
+   ml_enqueue(, m0);
+   m0 = NULL;
}
if (m0 != NULL) {
DBGPRINT("expected %d buffers, got %d", (int)hdr->num_buffers,

Re: Checking MAC address of incoming unicast packets

2016-01-02 Thread Stefan Fritsch

On Friday 01 January 2016 16:25:59, Theo de Raadt wrote:
> dlg writes:
> > should we just do it unconditionally? is there a downside to that?

It may decrease performance a tiny bit. Since such bits tend to add 
up, I would be hesitant to enable the check for drivers that don't 
need it. OTOH, freebsd and linux seem to always do the check.

> 
> That is a very good question.  What are the downsides against having
> the driver do this filtering itself, like all real hardware does?

The driver would need to do some mbuf operations to get the 
destination address. The same operations are then done again inside 
ether_input(). This is ugly.

> Why risk sending packets of the wrong form further upwards into the
> network stack, and then having to ensure the checks exist up there?
> 
> This flag is requesting special service to encourage a layer
> violation.
> 
> What is the reasoning to encourage such a flag (which someone will
> one day see, misuse, and try to expand for their bizzare use
> case...)

To me, it seems as much a layer violation as hardware checksum 
offloading features or the IFF_SIMPLEX flag. So no change there.

On Saturday 02 January 2016 10:57:41, Martin Pieuchot wrote:
> If it's acceptable performance-wise to do the check unconditionally
> I believe that's the way to go.  If not I'm a bit afraid of
> introducing a flag/capability for a single driver.  Do you know if
> any other driver could use it?

I haven't found specs of xen or vmware network drivers, therefore I 
can't say if those would have the same issue.

One point for doing the check unconditionally is that it may guard 
against buggy hardware or buggy hypervisors (in cases the spec does 
not allow this). Since the observable error in our case was that 
sometimes TCP connections of other, unrelated VM hosts were aborted 
with TCP RSTs, such bugs are very difficult to find if they are not 
triggered often.

I don't have a setup myself where I could do meaningful performance 
tests. If anyone has a setup where he can easily do such tests, it 
would be nice if he could test if removing the "(ifp->if_flags & 
IFF_PROMISC)" check in ether_input() makes any difference. Otherwise, 
I can try to do some performance tests at work, but that may take a 
while.

Checking MAC address of incoming unicast packets

2016-01-01 Thread Stefan Fritsch

Hi,

by default, the ether_input() checks the destination MAC address of 
incoming unicast packets only if the interface is in promiscous mode. If 
not, it is assumed that the NIC filters unicast packets reliably. 
Unfortunately, for virtio-net this is not the case. There, unicast 
filtering is only best effort, and (depending on configuration) if the 
bridge on the VM host does unicast flodding, unicast packets that are not 
for the VM guest may still be delivered to the VM guest. This is a rather 
annoying problem because it can cause pf to send RST packets to foreign 
connections. (Kudos to mpf@ for debugging this).

There are two possible approaches to fix this problem. Either make the 
vio(4) driver filter out unicast packets that are not for the local MAC, 
which would involve duplicating quite a bit of code from ether_input() in 
vio(4). Or, and I would prefer this, allow the driver to tell 
ether_input() that it needs to check the MAC always, and not only if the 
interface is in promiscous mode.

This could be done with a new flag. There seem to be three possible places 
where this flag could be put:

* ifnet.if_flags
  This is a short and there is no free bit. But the IFF_NOTRAILERS bit has
  become unused recently and could be recycled.

* ifnet.if_xflags
  An int, lots of free bits. But comment says 'extra softnet flags'

* if_data.ifi_capabilities
  An u_int32_t, lots of free bits. 


In the diff below, I went with the first choice because the new IFF_NOMACFILTER
is somewhat similar to IFF_SIMPLEX and because the the check can then be 
nicely folded into the existing check for IFF_PROMISC.

I would welcome any comments, suggestions for a better flag name, OKs, ...

Cheers,
Stefan



diff --git sys/dev/pci/if_vio.c sys/dev/pci/if_vio.c
index 4cd80d5..22fd7cf 100644
--- sys/dev/pci/if_vio.c
+++ sys/dev/pci/if_vio.c
@@ -582,21 +582,21 @@ vio_attach(struct device *parent, struct device *self, 
void *aux)
virtio_start_vq_intr(vsc, >sc_vq[VQCTL]);
vsc->sc_nvqs = 3;
}
}
 
if (vio_alloc_mem(sc) < 0)
goto err;
 
strlcpy(ifp->if_xname, self->dv_xname, IFNAMSIZ);
ifp->if_softc = sc;
-   ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
+   ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST | 
IFF_NOMACFILTER;
ifp->if_start = vio_start;
ifp->if_ioctl = vio_ioctl;
ifp->if_capabilities = IFCAP_VLAN_MTU;
if (features & VIRTIO_NET_F_CSUM)
ifp->if_capabilities |= IFCAP_CSUM_TCPv4|IFCAP_CSUM_UDPv4;
IFQ_SET_MAXLEN(>if_snd, vsc->sc_vqs[1].vq_num - 1);
IFQ_SET_READY(>if_snd);
ifmedia_init(>sc_media, 0, vio_media_change, vio_media_status);
ifmedia_add(>sc_media, IFM_ETHER | IFM_AUTO, 0, NULL);
ifmedia_set(>sc_media, IFM_ETHER | IFM_AUTO);
diff --git sys/net/if.h sys/net/if.h
index 8d7e390..91c6d18 100644
--- sys/net/if.h
+++ sys/net/if.h
@@ -182,36 +182,37 @@ struct if_status_description {
 /*
  * Length of interface description, including terminating '\0'.
  */
 #defineIFDESCRSIZE 64
 
 #defineIFF_UP  0x1 /* interface is up */
 #defineIFF_BROADCAST   0x2 /* broadcast address valid */
 #defineIFF_DEBUG   0x4 /* turn on debugging */
 #defineIFF_LOOPBACK0x8 /* is a loopback net */
 #defineIFF_POINTOPOINT 0x10/* interface is point-to-point 
link */
-#defineIFF_NOTRAILERS  0x20/* avoid use of trailers */
+#defineIFF_NOMACFILTER 0x20/* Does not reliably filter 
unicast MACs */
 #defineIFF_RUNNING 0x40/* resources allocated */
 #defineIFF_NOARP   0x80/* no address resolution 
protocol */
 #defineIFF_PROMISC 0x100   /* receive all packets */
 #defineIFF_ALLMULTI0x200   /* receive all multicast 
packets */
 #defineIFF_OACTIVE 0x400   /* transmission in progress */
 #defineIFF_SIMPLEX 0x800   /* can't hear own transmissions 
*/
 #defineIFF_LINK0   0x1000  /* per link layer defined bit */
 #defineIFF_LINK1   0x2000  /* per link layer defined bit */
 #defineIFF_LINK2   0x4000  /* per link layer defined bit */
 #defineIFF_MULTICAST   0x8000  /* supports multicast */
 
+
 /* flags set internally only: */
 #defineIFF_CANTCHANGE \
(IFF_BROADCAST|IFF_POINTOPOINT|IFF_RUNNING|IFF_OACTIVE|\
-   IFF_SIMPLEX|IFF_MULTICAST|IFF_ALLMULTI)
+   IFF_SIMPLEX|IFF_MULTICAST|IFF_ALLMULTI|IFF_NOMACFILTER)
 
 #define IFXF_MPSAFE0x1 /* if_start is mpsafe */
 #defineIFXF_INET6_NOPRIVACY0x4 /* don't autoconf 
privacy */
 #defineIFXF_MPLS   0x8 /*

Re: Xen virtual network (Netfront) driver

2016-01-06 Thread Stefan Fritsch

On Wed, 6 Jan 2016, Mike Belopuhov wrote:

> There's still stuff to do, but it receives and transmits reliably
> (at least on modern Xen) so I'd like to get it in.  Man page will
> follow.

I only had a quick glance at the code, but I have one comment about your 
use of memory barriers. The membar_* macros are pure compiler barriers 
when the openbsd kernel is compiled for UP. But since the host machine and 
xen may use SMP even in this case, I suspect the that you need hardware 
memory barriers even if MULTIPROCESSOR is not defined. This does not seem 
relevant for x86 because you don't use membar_sync, but it may become 
relevant for arm, which is also supported by xen.

I had the same problem in virtio and introduced the virtio_membar_* macros 
for this purpose. Maybe they should be renamed to a more generic name and 
you should use them, too?


> + if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(>if_snd))
> + return;
> +
> + prod = txr->txr_prod;
> + membar_consumer();
> +
> + for (;;) {
> + m = ifq_deq_begin(>if_snd);
> + if (m == NULL)
> + break;
> +
> + error = xnf_encap(sc, m, );
> + if (error == ENOENT) {
> + /* transient */
> + ifq_deq_rollback(>if_snd, m);
> + ifq_set_oactive(>if_snd);
> + break;
> + } else if (error) {
> + /* the chain is too large */
> + ifq_deq_commit(>if_snd, m);
> + m_freem(m);
> + continue;
> + }
> + ifq_deq_commit(>if_snd, m);
> +
> +#if NBPFILTER > 0
> + if (ifp->if_bpf)
> + bpf_mtap_ether(ifp->if_bpf, m, BPF_DIRECTION_OUT);
> +#endif
> + pkts++;
> + }
> + if (pkts > 0) {
> + txr->txr_prod = prod;
> + xen_intr_signal(sc->sc_xih);
> + }
> +}
> +
> +int
> +xnf_encap(struct xnf_softc *sc, struct mbuf *m, uint32_t *prod)
> +{
> + struct ifnet *ifp = >sc_ac.ac_if;
> + struct xnf_tx_ring *txr = sc->sc_tx_ring;
> + union xnf_tx_desc *txd;
> + bus_dmamap_t dmap;
> + int error, i, n = 0;
> +
> + if (((txr->txr_cons - *prod - 1) & (XNF_TX_DESC - 1)) < XNF_TX_FRAG) {
> + error = ENOENT;
> + goto errout;
> + }
> +
> + i = *prod & (XNF_TX_DESC - 1);
> + dmap = sc->sc_tx_dmap[i];
> +
> + error = bus_dmamap_load_mbuf(sc->sc_dmat, dmap, m, BUS_DMA_WRITE |
> + BUS_DMA_NOWAIT);
> + if (error == EFBIG) {
> + if (m_defrag(m, M_DONTWAIT) ||
> + bus_dmamap_load_mbuf(sc->sc_dmat, dmap, m, BUS_DMA_WRITE |
> +  BUS_DMA_NOWAIT))
> + goto errout;
> + } else if (error)
> + goto errout;
> +
> + for (n = 0; n < dmap->dm_nsegs; n++, (*prod)++) {
> + i = *prod & (XNF_TX_DESC - 1);
> + if (sc->sc_tx_buf[i])
> + panic("%s: save vs spell: %d\n", ifp->if_xname, i);
> + txd = >txr_desc[i];
> + if (n == 0) {
> + sc->sc_tx_buf[i] = m;
> + if (0 && m->m_pkthdr.csum_flags & M_IPV4_CSUM_OUT)
> + txd->txd_req.txq_flags = XNF_TXF_CSUM |
> + XNF_TXF_VALID;
> + txd->txd_req.txq_size = m->m_pkthdr.len;
> + } else
> + txd->txd_req.txq_size = dmap->dm_segs[n].ds_len;
> + if (n != dmap->dm_nsegs - 1)
> + txd->txd_req.txq_flags |= XNF_TXF_CHUNK;
> + txd->txd_req.txq_ref = dmap->dm_segs[n].ds_addr;
> + txd->txd_req.txq_offset = dmap->dm_segs[n].ds_offset;
> + }
> +
> + ifp->if_opackets++;
> + return (0);
> +
> + errout:
> + ifp->if_oerrors++;
> + return (error);
> +}
> +
> +void
> +xnf_intr(void *arg)
> +{
> + struct xnf_softc *sc = arg;
> + struct ifnet *ifp = >sc_ac.ac_if;
> +
> + if (ifp->if_flags & IFF_RUNNING) {
> + xnf_rxeof(sc);
> + xnf_txeof(sc);
> + }
> +}
> +
> +int
> +xnf_txeof(struct xnf_softc *sc)
> +{
> + struct ifnet *ifp = >sc_ac.ac_if;
> + struct xnf_tx_ring *txr = sc->sc_tx_ring;
> + union xnf_tx_desc *txd;
> + struct mbuf *m;
> + bus_dmamap_t dmap;
> + volatile uint32_t r;
> + uint32_t cons;
> + int i, id, pkts = 0;
> +
> + do {
> + for (cons = sc->sc_tx_cons; cons != txr->txr_cons; cons++) {
> + membar_consumer();
> + i = cons & (XNF_TX_DESC - 1);
> + txd = >txr_desc[i];
> + id = txd->txd_rsp.txp_id;
> + memset(txd, 0, sizeof(*txd));
> + txd->txd_req.txq_id = id;
> + membar_producer();
> + if (sc->sc_tx_buf[i]) {
> +

Re: Checking MAC address of incoming unicast packets

2016-01-04 Thread Stefan Fritsch

On Mon, 4 Jan 2016, Claudio Jeker wrote:

> On Sat, Jan 02, 2016 at 04:04:33PM +0100, Mark Kettenis wrote:
> > > Date: Sat, 2 Jan 2016 10:57:41 +0100
> > > From: Martin Pieuchot 
> > > 
> > > If it's acceptable performance-wise to do the check unconditionally I
> > > believe that's the way to go.  If not I'm a bit afraid of introducing
> > > a flag/capability for a single driver.  Do you know if any other driver
> > > could use it?
> > 
> > I suppose any paravirtualized network driver would need this; i.e. my
> > sun4v vnet(4) driver and the xen stuff mikeb@ is working on.
> > 

That really depends on the device. Not all need to have the same weird 
semantics as virtio. But I wouldn't rule it out, either.

> Shouldn't all these interfaces have the PROMISC flag set all the time
> instead? They seem to be always promiscous since they don't have a HW mac
> address filter. Doing that would solve the problem in a simple way
> (special ioctl handling per driver).

No, it's not the same as always promiscous. The virtio specification says 
"there is a filter, but unwanted packets may still arrive". The 
reliability of the HW mac address filter depends on the implementation. 
Some have a reliable filter (e.g. the one in qemu), some rely on the 
software bridge (e.g. the vhost-net implementation in the linux kernel). 
In order to reliably get all packets, the driver has to explicitly tell 
the device to switch PROMISC mode on.

Re: pppd and ioctl(TIOCSETD) (was: CVS: cvs.openbsd.org: src)

2015-12-22 Thread Stefan Fritsch

On Tue, 22 Dec 2015, David Coppa wrote:
> I suspect this broke my 3G connection:

Oops. Sorry. Try this diff (already committed):

--- sys/kern/tty_conf.c
+++ sys/kern/tty_conf.c
@@ -42,6 +42,11 @@
 #include 
 #include 
 
+#include "ppp.h"
+#include "nmea.h"
+#include "msts.h"
+#include "endrun.h"
+
 #definettynodisc ((int (*)(dev_t, struct tty *, struct proc *))enodev)
 #definettyerrclose ((int (*)(struct tty *, int flags, struct proc 
*))enodev)
 #definettyerrio ((int (*)(struct tty *, struct uio *, int))enodev)

Re: Xen virtual network (Netfront) driver

2016-01-12 Thread Stefan Fritsch

On Thu, 7 Jan 2016, Mike Belopuhov wrote:

> On 7 January 2016 at 13:17, Mark Kettenis <mark.kette...@xs4all.nl> wrote:
> >> From: Mike Belopuhov <m...@belopuhov.com>
> >> Date: Thu, 7 Jan 2016 12:02:23 +0100
> >>
> >> On 6 January 2016 at 17:58, Stefan Fritsch <s...@sfritsch.de> wrote:
> >> > On Wed, 6 Jan 2016, Mike Belopuhov wrote:
> >> >
> >> >> There's still stuff to do, but it receives and transmits reliably
> >> >> (at least on modern Xen) so I'd like to get it in.  Man page will
> >> >> follow.
> >> >
> >> > I only had a quick glance at the code, but I have one comment about your
> >> > use of memory barriers. The membar_* macros are pure compiler barriers
> >> > when the openbsd kernel is compiled for UP. But since the host machine 
> >> > and
> >> > xen may use SMP even in this case, I suspect the that you need hardware
> >> > memory barriers even if MULTIPROCESSOR is not defined. This does not seem
> >> > relevant for x86 because you don't use membar_sync, but it may become
> >> > relevant for arm, which is also supported by xen.
> >> >
> >>
> >> membar_{producer,consumer} are defined on arm to perform store and
> >> load memory barriers.  Our arm code currently does not distinguish
> >> between an MP case and non-MP case regarding the definition of these
> >> macros, so I'm not entirely certain what are you trying to say.

I didn't check arm's implementation but new that it had non-empty 
membar_{producer,consumer}. So, if it does not distinguish between an MP 
case and non-MP case, then there is no problem there. But maybe you should 
document somewhere which assumptions about the architecture you make, so 
that they can be checked when adding a new architecture. I guess arm64 
will come sooner or later and I don't know if it has exactly the same 
memory model as 32bit arm.

> >
> > Not sure ARM is a good example to look at.
> >
> 
> The only architectures that Xen dom0 is implemented for are i386,
> adm64 and arm, so there's no real need to look at anything else.
> 
> > In principle I think that the membar_xxx() interfaces could be simple
> > compiler barriers on all our architectures, at least as long as the
> > CPU will observe its own stores in the same order as they were
> > emitted.  But I think all sane CPU architectures make those
> > guarantees.  At least for "normal" memory.  However, we treat that as
> > an optimization.  And we haven't done that for all our architectures.
> >
> > The problem with virtualization is of course that even a non-MP kernel
> > is actually running in an MP environment.  If data structures are
> > shared with the hypervisor or another domain running on a different
> > CPU, proper memory barriers must be used to guarantee the other side
> > sees our stores in the right order.  The typical case would be
> > populating a descriptor with some sort of validity bit.  There you
> > want to make sure the other side doesn't see the valid bit set until
> > all the other parts of the descriptor have been filled in and are
> > visible.  In that case a simple compiler barrier may not be enough.

Yes. With intel it's the "Reads may be reordered with older writes to 
different locations but not with older writes to the same location" bit 
from the memory model that is causing problems. So you have to check if 
xen hits this case. virtio does (and removing the membarriers causes 
observable hangs).

> 
> That's what I was referring to in my example below.
> 
> > This is why the virtio_membar_xxx() primitives were introduced.
> >
> 
> Any idea why wasn't store and load barriers implemented separately?

No idea. virtio_membar_xxx() was modeled after the existing membar_xxx(). 
But AIUI membar_consumer() plus membar_producer() is not equivalent to 
membar_sync() (which also prevents read vs. write reordering).

Re: Checking MAC address of incoming unicast packets

2016-01-14 Thread Stefan Fritsch

On Mon, 4 Jan 2016, Stefan Fritsch wrote:
> On Sun, 3 Jan 2016, Theo de Raadt wrote:
> > >> dlg writes:
> > >> > should we just do it unconditionally? is there a downside to that?
> > >
> > >It may decrease performance a tiny bit. Since such bits tend to add 
> > >up, I would be hesitant to enable the check for drivers that don't 
> > >need it. OTOH, freebsd and linux seem to always do the check.
> > 
> > How does it decrease performance?
> 
> I am talking about this diff in ether_input():
> 
> diff --git sys/net/if_ethersubr.c sys/net/if_ethersubr.c
> --- sys/net/if_ethersubr.c
> +++ sys/net/if_ethersubr.c
> @@ -353,8 +353,7 @@ ether_input(struct ifnet *ifp, struct mbuf *m, void 
> *cookie)
>* If packet is unicast and we're in promiscuous mode, make sure it
>* is for us.  Drop otherwise.
>*/
> - if ((m->m_flags & (M_BCAST|M_MCAST)) == 0 &&
> - (ifp->if_flags & IFF_PROMISC)) {
> + if ((m->m_flags & (M_BCAST|M_MCAST)) == 0) {
>   if (memcmp(ac->ac_enaddr, eh->ether_dhost, ETHER_ADDR_LEN)) {
>   m_freem(m);
>   return (1);
> 
> For other drivers than vio this means an additional 6 byte memcmp where 
> one operand is already in the cache, and the other operand may or may not 
> be in the cache. 'tiny performance impact' seems about right.

Hrvoje Popovski was very helpful and did some performance tests with the 
patch above. There is only measurable difference at very high package 
rates, but I must admit that there the difference is a bit larger than I 
expected:

> i'm sending 64byte packets from host connected to ix2 and counting then
> on host connected to ix3

> without your patch

sending receiving
400kpps 400kpps
600kpps 600kpps
800kpps 690kpps
1Mpps   610kpps
1.4Mpps 580kpps
12Mpps  580kpps


> with your patch

sending receiving
400kpps 400kpps -   0%
600kpps 600kpps -   0%
800kpps 680kpps - 1.4%
1Mpps   600kpps - 1.6%
1.4Mpps 560kpps - 3.4%
12Mpps  560kpps - 3.4%



So, which variant should I commit?

- do the check unconditionally
- add some flag so that ether_input() does the check for vio(4) 
unconditionally
- do the check in vio(4)

mpi fixed a similar bug in trunk(4) in July, but that does not use 
ether_input() and would not profit from the flag approach.



> > Maybe you should show the diff which does the check in the driver
> > itself, then then it will be easier to see the performance cost.  Right
> > now I can't perceive the problem.
> 
> It's not very expensive. But it's more code duplication than I like:
> 
> diff --git sys/dev/pci/if_vio.c sys/dev/pci/if_vio.c
> --- sys/dev/pci/if_vio.c
> +++ sys/dev/pci/if_vio.c
> @@ -1005,6 +1005,9 @@ vio_rxeof(struct vio_softc *sc)
>   int r = 0;
>   int slot, len, bufs_left;
>   struct virtio_net_hdr *hdr;
> + int promisc = (ifp->if_flags & IFF_PROMISC);
> + u_int8_t *enaddr = sc->sc_ac.ac_enaddr;
> + struct ether_header *eh;
>  
>   while (virtio_dequeue(vsc, vq, , ) == 0) {
>   r = 1;
> @@ -1035,10 +1038,33 @@ vio_rxeof(struct vio_softc *sc)
>   bufs_left--;
>   }
>  
> - if (bufs_left == 0) {
> - ml_enqueue(, m0);
> - m0 = NULL;
> + if (bufs_left > 0)
> + continue;
> +
> + /*
> +  * Unfortunately, mac filtering is only best effort
> +  * in virtio-net. Unwanted packets may still arrive.
> +  * If we are not promiscous, we have to check if the
> +  * packet is actually for us.
> +  */
> + if (!promisc) {
> + m0 = m_pullup(m0, ETHER_HDR_LEN);
> + /*
> +  * no need to check m0 == NULL, we know m0 has enough
> +  * space
> +  */
> +
> + eh = mtod(m0, struct ether_header *);
> + if (!ETHER_IS_MULTICAST(eh->ether_dhost) &&
> + memcmp(enaddr, eh->ether_dhost, ETHER_ADDR_LEN) != 
> 0) {
> + m_freem(m0);
> + m0 = NULL;
> + continue;
> + }
>   }
> +
> + ml_enqueue(, m0);
> + m0 = NULL;
>   }
>   if (m0 != NULL) {
>   DBGPRINT("expected %d buffers, got %d", (int)hdr->num_buffers,
> 
>

1 2 >

1 - 100 of 187 matches

Mail list logo