Re: hvn(4): don't input mbufs if interface is not running

2021-06-11 Thread Mike Belopuhov
On 12/05/2021 15:15, Patrick Wildt wrote:
> Hi,
> 
> when hvn(4) attaches it sends commands and waits for replies to come
> back in, hence the interrupt function is being polled.  Unfortunately
> it seems that the 'receive pipe' has both command completion and data
> packets.  As it turns out, while hvn(4) is just setting up the pipes,
> it can already receive packets, which I have seen happening on Hyper-V.
> 
> This essentially means that if_input() is being called *before* the
> card is set up (or UP).  This seems wrong.  Apparently on drivers like
> em(4) we only read packets if IFF_RUNNING is set.  I think in the case
> of hvn(4), we should drop packets unless IFF_RUNNING is set.
> 
> Opinions?
> 

Hi Patrick,

You're right that hvn needs to have the receiving path setup to exchange
commands with the hypervisor. This diff LGTM and should be committed if
it wasn't.

Cheers,
Mike

> Patrick
> 
> diff --git a/sys/dev/pv/if_hvn.c b/sys/dev/pv/if_hvn.c
> index f12e2f935ca..4306f717baf 100644
> --- a/sys/dev/pv/if_hvn.c
> +++ b/sys/dev/pv/if_hvn.c
> @@ -1470,7 +1470,10 @@ hvn_rndis_input(struct hvn_softc *sc, uint64_t tid, 
> void *arg)
>   }
>   hvn_nvs_ack(sc, tid);
>  
> - if_input(ifp, );
> + if (ifp->if_flags & IFF_RUNNING)
> + if_input(ifp, );
> + else
> + ml_purge();
>  }
>  
>  static inline struct mbuf *
> 



Re: XCP-ng, OpenBSD and network interface changes

2021-02-01 Thread Mike Belopuhov
On Sun, Jan 31, 2021 at 2:59 PM Denis Fondras  wrote:

> I am using XCP-ng with the latest OpenBSD snapshot.
>
> Whenever I make an hardware change in networking on the VM (connect or
> disconnect an interface, change associated network), the VM panics :
>
> openbsd# panic: grant table reference 5912 is held by domain 0: frame
> 0x1f1a4 flags 0x19
> Stopped at   db_enter+0x10: popq %rbp
> TID   PID  UIDPRFLAGS   PFLAGS CPU COMMAND
> *349758 6557900x14000   0x200   0 xenwatch
> db_enter() at db_enter+0x10
> panic(81da7541) at panic+0x12a
> xen_bus_dmamap_unload(820ede50,800e9380) at
> xen_bus_dmamap_unload+0x138
> xnf_tx_ring_destroy(80162000) at xnf_tx_ring_destroy+0x104
> xnf_detach(80162000,0) at xnf_detach+0x55
> config_detach(80162000,0) at config_detach+0x140
> xen_hotplug(8012e200) at xen_hotplug+0x181
> taskq_thread(800dde00) at taskq_thread+0x66
> end trace frame: 0x0, count: 7
> https://www.openbsd.org/ddb.html describes the minimum info required in
> bug reports. Insufficient info makes it difficult to find and fix bugs.
> ddb>
>
> If I apply the following patch, it obviously does not panic and seems to
> work
> correctly :
>
>
Hi Denis,

This is not a real fix unfortunately, you're just ignoring the issue.
Somehow the grant table reference is not released when we perform the
detach.
You can try increasing amount of iterations to 1 (or more) for example
and see
if this is a timing issue.

Cheers,
Mike


> Index: xen.c
> ===
> RCS file: /cvs/src/sys/dev/pv/xen.c,v
> retrieving revision 1.97
> diff -u -p -r1.97 xen.c
> --- xen.c   29 Jun 2020 06:50:52 -  1.97
> +++ xen.c   31 Jan 2021 13:13:07 -
> @@ -1204,7 +1204,7 @@ xen_grant_table_remove(struct xen_softc
> loop = 0;
> while (atomic_cas_uint(ptr, flags, GTF_invalid) != flags) {
> if (loop++ > 10) {
> -   panic("grant table reference %u is held "
> +   printf("grant table reference %u is held "
> "by domain %d: frame %#x flags %#x",
> ref + ge->ge_start, ge->ge_table[ref].domid,
> ge->ge_table[ref].frame,
> ge->ge_table[ref].flags);
>
> Can someone give me a clue on what _atomic_cas_uint() is ?
>
> Thank you in advance.
>
> Denis
>
> OpenBSD 6.8-current (GENERIC) #9: Sun Jan 31 14:08:42 CET 2021
> r...@openbsd.lab.ledeuns.net:/sys/arch/amd64/compile/GENERIC
> real mem = 1052770304 (1004MB)
> avail mem = 1005694976 (959MB)
> random: good seed from bootblocks
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xeb01f (11 entries)
> bios0: vendor Xen version "4.13" date 01/21/2021
> bios0: Xen HVM domU
> acpi0 at bios0: ACPI 4.0
> acpi0: sleep states S5
> acpi0: tables DSDT FACP APIC HPET WAET
> acpi0: wakeup devices
> acpitimer0 at acpi0: 3579545 Hz, 32 bits
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> ioapic0 at mainbus0: apid 1 pa 0xfec0, version 11, 48 pins, remapped
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Xeon(R) CPU E5-2407 v2 @ 2.40GHz, 2394.83 MHz, 06-3e-04
> cpu0:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,ACPI,MMX,FXSR,SSE,SSE2,SS,SSE3,PCLMUL,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,XSAVEOPT,MELTDOWN
> cpu0: 256KB 64b/line 8-way L2 cache
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 100MHz
> acpihpet0 at acpi0: 6250 Hz
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpipci0 at acpi0 PCI0
> acpicmos0 at acpi0
> "ACPI0007" at acpi0 not configured
> acpicpu0 at acpi0: C1(@1 halt!)
> cpu0: using VERW MDS workaround (except on vmm entry)
> pvbus0 at mainbus0: Hyper-V 0.0, Xen 4.13
> xen0 at pvbus0: features 0x2705, 64 grant table frames, event channel 2
> xbf0 at xen0 backend 0 channel 6: disk
> scsibus1 at xbf0: 1 targets
> sd0 at scsibus1 targ 0 lun 0: 
> sd0: 10240MB, 512 bytes/sector, 20971520 sectors
> xbf1 at xen0 backend 0 channel 7: cdrom
> xbf1: timed out waiting for backend to connect
> xnf0 at xen0 backend 0 channel 7: address 76:88:23:28:25:f4
> xnf1 at xen0 backend 0 channel 8: address 62:36:ed:68:46:3c
> xnf2 at xen0 backend 0 channel 9: address be:04:e2:f3:7d:75
> pci0 at mainbus0 bus 0
> pchb0 at pci0 dev 0 function 0 "Intel 82441FX" rev 0x02
> pcib0 at pci0 dev 1 function 0 "Intel 82371SB ISA" rev 0x00
> pciide0 at pci0 dev 1 function 1 "Intel 82371SB IDE" rev 0x00: DMA,
> channel 0 wired to compatibility, channel 1 wired to compatibility
> pciide0: channel 0 disabled (no drives)
> atapiscsi0 at pciide0 channel 1 drive 1
> scsibus2 at atapiscsi0: 2 targets
> cd0 at scsibus2 targ 0 lun 

Re: xbf(4): tsleep(9) -> tsleep_nsec(9)

2020-01-21 Thread Mike Belopuhov


Scott Cheloha writes:

> Given the SCSI_NOSLEEP split here I think the simplest thing we can do
> is ask to sleep as much as we delay(9).
>
> The question is: if you *could* poll in 10us intervals here with
> tsleep_nsec(9), would you want to?  If so, then this works.  If
> not, what is a more appropriate interval?
>

Hi,

I believe it would be fine to use the same value as in the delay,
"1" was just the smallest available for the tsleep.

OK mikeb for the change.

Cheers,
Mike

> Index: pv/xbf.c
> ===
> RCS file: /cvs/src/sys/dev/pv/xbf.c,v
> retrieving revision 1.32
> diff -u -p -r1.32 xbf.c
> --- pv/xbf.c  17 Jul 2017 10:30:03 -  1.32
> +++ pv/xbf.c  15 Jan 2020 06:20:25 -
> @@ -738,7 +738,7 @@ xbf_poll_cmd(struct scsi_xfer *xs)
>   if (ISSET(xs->flags, SCSI_NOSLEEP))
>   delay(10);
>   else
> - tsleep(xs, PRIBIO, "xbfpoll", 1);
> + tsleep_nsec(xs, PRIBIO, "xbfpoll", USEC_TO_NSEC(10));
>   xbf_intr(xs->sc_link->adapter_softc);
>   } while(--timo > 0);
>  



Re: sparc64: find root device on hardware RAID

2019-12-27 Thread Mike Belopuhov


Klemens Nanni writes:

> On Thu, Dec 26, 2019 at 07:49:06PM +0100, Mark Kettenis wrote:
>> Well, there's your problem.  The mpii(4) doesn't fill in the WWNs for
>> the logical volume so there is nothing that can be matched to the WWN
>> from the bootpath.
> Obvious now that you mention it.
>
>> > See below a diff for debug printf() I use to look at thoes values.
>> > Complete console log from OBP prompt to multiuser follows to to show the
>> > boot process and debug output for all devices.
>> > 
>> > What I find odd is how 0aa32290d5dcd16c is the WWID of the RAID volume,
>> > and yet all devices attaching to scsibus* including those not being part
>> > of the RAID show the very same bp->val[0] of 3aa32290d5dcd16c.
>> 
>> bp->val[0] comes from the boot path; there is only one.
> Ha, sure that.  I confused myself with printing it for every device
> passing that code path where it is used as target, hence debug printfs
> showing the same value for multiple devices.
>
>> As you can see, the WWNs are filled in for the other disks (sd1, cd0)
>> that attach to the controller.  So you probably need some additional
>> code in mpii(4) to fill in the WWNs for logical volumes.  I recommend
>> talking to dlg@ and jmatthew@ directly about that.
> That makes sense, I didn't look toward mpii(4) yet.
>
> Thank you for pointing things out and asking such questions, this is
> very very helpful guidance.  I'm looking further into the controller
> driver now.


Looks like WWID for the RAID volume can be read from the RAID Volume
Page 1 (mpii_cfg_raid_vol_pg1).

Cheers,
Mike



Re: pfctl: Do not optimize empty rulesets

2019-12-12 Thread Mike Belopuhov


Klemens Nanni writes:

> On Wed, Nov 27, 2019 at 08:04:47PM +0100, Klemens Nanni wrote:
>> If an anchor/ruleset contains no rules, there is no point in creating
>> a temporary copy, optimizing and replacing it.
>> 
>> Regress passes on amd64.
>> 
>> Feedback? OK?
> Anyone?
>

FWIW, it looks good to me. Ok mikeb

> All optimizations work on actual rules;  if there are none, we don't
> need to look further, especially not in "profile" mode where existing
> rules are read from the kernel as feedback: an empty ruleset will stay
> empty after optimization is done.
>
> This also does not affect `set' or `table' lines in any way, e.g.
>
>   # echo 'table ' | pfctl -o basic -d -nf-
>
> still is an empty ruleset.
>
>
> I came across when debugging anchors, but with -DOPT_DEBUG as well this
> time where `-d' output for multiple anchors wouldn't really be helpful:
>
>   $ pfctl -dnf test.pf
>   pfctl_optimize_ruleset: optimizing ruleset
>   pfctl_optimize_ruleset: optimizing ruleset
>   pfctl_optimize_ruleset: optimizing ruleset
>
> So below is an updated diff that also prints the anchor path, letting
> developers know which anchor is being optimized in wha order:
>
>   pfctl_optimize_ruleset: optimizing ruleset ""
>   pfctl_optimize_ruleset: optimizing ruleset "a1"
>   pfctl_optimize_ruleset: optimizing ruleset "_1/a2"
>
> Yes, the main anchor prints as "" but all that is behind compile time
> -DOPT_DEBUG so regular users won't deal with it anyway, so keep the code
> simple instead of adding logging around `rs->anchor->path'.
>
> OK?
>
>
> Index: pfctl_optimize.c
> ===
> RCS file: /cvs/src/sbin/pfctl/pfctl_optimize.c,v
> retrieving revision 1.42
> diff -u -p -r1.42 pfctl_optimize.c
> --- pfctl_optimize.c  28 Jun 2019 13:32:45 -  1.42
> +++ pfctl_optimize.c  12 Dec 2019 20:06:15 -
> @@ -270,7 +270,10 @@ pfctl_optimize_ruleset(struct pfctl *pf,
>   struct pf_rule *r;
>   struct pf_rulequeue *old_rules;
>  
> - DEBUG("optimizing ruleset");
> + if (TAILQ_EMPTY(rs->rules.active.ptr))
> + return (0);
> +
> + DEBUG("optimizing ruleset \"%s\"", rs->anchor->path);
>   memset(_buffer, 0, sizeof(table_buffer));
>   skip_init();
>   TAILQ_INIT(_queue);



Re: iked(8): fix error handling in msg_send

2019-11-15 Thread Mike Belopuhov


Tobias Heider writes:

> On Thu, Nov 14, 2019 at 09:57:27AM -0700, Theo de Raadt wrote:
>> > 
>> > The problem here is that log_warn can change errno,
>> 
>> No, it specifically avoids touching errno.
>> 
>> log_warn(const char *emsg, ...)
>> {
>> char*nfmt;
>> va_list  ap;
>> int  saved_errno = errno;
>> ...
>> errno = saved_errno;
>> }
>> 
>
> Good to know, thanks! In that case I really prefer Mike's diff.
> Here is an update with msg->msg_sa used consistently. We can also do it
> the other way around, but I would prefer to use either sa or msg_sa.

I'm sorry, it was a bit irresponsible of me to put msg_sa everywhere
in my diff.  In fact, almost all of the code in this file uses 'sa'
so it would be consistent with other functions if 'sa' would be used
instead of 'msg_sa'.  If you don't mind changing it to 'sa', I would
appreciate it.  OK mikeb either way.

>
> Index: ikev2_msg.c
> ===
> RCS file: /cvs/src/sbin/iked/ikev2_msg.c,v
> retrieving revision 1.58
> diff -u -p -r1.58 ikev2_msg.c
> --- ikev2_msg.c   13 Nov 2019 12:24:40 -  1.58
> +++ ikev2_msg.c   14 Nov 2019 17:15:42 -
> @@ -303,7 +303,6 @@ ikev2_msg_valid_ike_sa(struct iked *env,
>  int
>  ikev2_msg_send(struct iked *env, struct iked_message *msg)
>  {
> - struct iked_sa  *sa = msg->msg_sa;
>   struct ibuf *buf = msg->msg_data;
>   uint32_t natt = 0x;
>   int  isnatt = 0;
> @@ -338,7 +337,8 @@ ikev2_msg_send(struct iked *env, struct 
>   if (sendtofrom(msg->msg_fd, ibuf_data(buf), ibuf_size(buf), 0,
>   (struct sockaddr *)>msg_peer, msg->msg_peerlen,
>   (struct sockaddr *)>msg_local, msg->msg_locallen) == -1) {
> - if (errno == EADDRNOTAVAIL) {
> + log_warn("%s: sendtofrom", __func__);
> + if (msg->msg_sa != NULL && errno == EADDRNOTAVAIL) {
>   sa_state(env, msg->msg_sa, IKEV2_STATE_CLOSING);
>   timer_del(env, >msg_sa->sa_timer);
>   timer_set(env, >msg_sa->sa_timer,
> @@ -346,11 +346,11 @@ ikev2_msg_send(struct iked *env, struct 
>   timer_add(env, >msg_sa->sa_timer,
>   IKED_IKE_SA_DELETE_TIMEOUT);
>   }
> - log_warn("%s: sendtofrom", __func__);
> - return (-1);
> + if (msg->msg_sa != NULL)
> + return (-1);
>   }
>  
> - if (!sa)
> + if (msg->msg_sa == NULL)
>   return (0);
>  
>   if ((m = ikev2_msg_copy(env, msg)) == NULL) {
> @@ -360,11 +360,11 @@ ikev2_msg_send(struct iked *env, struct 
>   m->msg_exchange = exchange;
>  
>   if (flags & IKEV2_FLAG_RESPONSE) {
> - TAILQ_INSERT_TAIL(>sa_responses, m, msg_entry);
> + TAILQ_INSERT_TAIL(>msg_sa->sa_responses, m, msg_entry);
>   timer_set(env, >msg_timer, ikev2_msg_response_timeout, m);
>   timer_add(env, >msg_timer, IKED_RESPONSE_TIMEOUT);
>   } else {
> - TAILQ_INSERT_TAIL(>sa_requests, m, msg_entry);
> + TAILQ_INSERT_TAIL(>msg_sa->sa_requests, m, msg_entry);
>   timer_set(env, >msg_timer, ikev2_msg_retransmit_timeout, m);
>   timer_add(env, >msg_timer, IKED_RETRANSMIT_TIMEOUT);
>   }



Re: iked(8): fix error handling in msg_send

2019-11-14 Thread Mike Belopuhov


Tobias Heider writes:

> Hi,
>
> in the error case ikev2_msg_send the accesses the sa before checking for
> NULL. The diff adds explicit checks in those cases.
> If sendtofrom fails for any other reason than EADDRNOTAVAIL and sa is not NULL
> we should continue instead of returning (-1) so that the error is handled with
> retransmission.
>
> ok?
>

Hi Tobias,

you can write a simpler diff w/o repeating log_warn:

diff --git a/sbin/iked/ikev2_msg.c b/sbin/iked/ikev2_msg.c
index 2baea5f5508..396fea88c16 100644
--- a/sbin/iked/ikev2_msg.c
+++ b/sbin/iked/ikev2_msg.c
@@ -338,7 +338,8 @@ ikev2_msg_send(struct iked *env, struct iked_message *msg)
if (sendtofrom(msg->msg_fd, ibuf_data(buf), ibuf_size(buf), 0,
(struct sockaddr *)>msg_peer, msg->msg_peerlen,
(struct sockaddr *)>msg_local, msg->msg_locallen) == -1) {
-   if (errno == EADDRNOTAVAIL) {
+   log_warn("%s: sendtofrom", __func__);
+   if (msg->msg_sa != NULL && errno == EADDRNOTAVAIL) {
sa_state(env, msg->msg_sa, IKEV2_STATE_CLOSING);
timer_del(env, >msg_sa->sa_timer);
timer_set(env, >msg_sa->sa_timer,
@@ -346,8 +347,8 @@ ikev2_msg_send(struct iked *env, struct iked_message *msg)
timer_add(env, >msg_sa->sa_timer,
IKED_IKE_SA_DELETE_TIMEOUT);
}
-   log_warn("%s: sendtofrom", __func__);
-   return (-1);
+   if (msg->msg_sa != NULL)
+   return (-1);
}
 
if (!sa)


Regards,
Mike


> Index: ikev2_msg.c
> ===
> RCS file: /mount/openbsd/cvs/src/sbin/iked/ikev2_msg.c,v
> retrieving revision 1.58
> diff -u -p -r1.58 ikev2_msg.c
> --- ikev2_msg.c   13 Nov 2019 12:24:40 -  1.58
> +++ ikev2_msg.c   14 Nov 2019 15:37:11 -
> @@ -339,15 +339,20 @@ ikev2_msg_send(struct iked *env, struct 
>   (struct sockaddr *)>msg_peer, msg->msg_peerlen,
>   (struct sockaddr *)>msg_local, msg->msg_locallen) == -1) {
>   if (errno == EADDRNOTAVAIL) {
> - sa_state(env, msg->msg_sa, IKEV2_STATE_CLOSING);
> - timer_del(env, >msg_sa->sa_timer);
> - timer_set(env, >msg_sa->sa_timer,
> - ikev2_ike_sa_timeout, msg->msg_sa);
> - timer_add(env, >msg_sa->sa_timer,
> - IKED_IKE_SA_DELETE_TIMEOUT);
> + if (sa != NULL) {
> + sa_state(env, sa, IKEV2_STATE_CLOSING);
> + timer_del(env, >sa_timer);
> + timer_set(env, >sa_timer,
> + ikev2_ike_sa_timeout, sa);
> + timer_add(env, >sa_timer,
> + IKED_IKE_SA_DELETE_TIMEOUT);
> + }
> + log_warn("%s: sendtofrom", __func__);
> + return (-1);
>   }
>   log_warn("%s: sendtofrom", __func__);
> - return (-1);
> + if (!sa)
> + return (-1);
>   }
>  
>   if (!sa)



Re: iked(8): add configuration option for esn

2019-11-12 Thread Mike Belopuhov
On Tue, 12 Nov 2019 at 16:08, Tobias Heider  wrote:

> On Tue, Nov 12, 2019 at 09:57:31AM +0100, Mike Belopuhov wrote:
> > Hi Tobias,
> >
> > I see, however, I don't think iked would negotiate an SA
> > without ESN support if the other side supports ESN, so I'm
> > not sure how "enforcing" changes that.
>
> It doesn't, but if I have an iked on both sides one will have to
> make the decision. I have another case where I actually can not
> use ESN, with two ikeds this can not be configured currently.
>
> > In any case, I'm not opposed to adding a toggle if you guys
> > need it, but could you please adjust the grammar so that "esn"
> > and "no esn" are used instead of "on" and "off" since that's
> > what we're normally doing.  "on" and "off" are clutches for
> > simple file formats, parse.y allows you to make it a bit nicer.
>
> Makes sense. Here is the updated diff including a fix for bluhms
> comment.
>
>
While I meant "no esn" with a space, I see that you and Patrick have
been adding things like "nofragmentation" and "nomobike".  These
should be written with a space as well.

Nevertheless ok mikeb, hopefully you'll come around fixing grammar
later on.


Re: iked(8): add configuration option for esn

2019-11-12 Thread Mike Belopuhov
Hi Tobias,

I see, however, I don't think iked would negotiate an SA
without ESN support if the other side supports ESN, so I'm
not sure how "enforcing" changes that.

In any case, I'm not opposed to adding a toggle if you guys
need it, but could you please adjust the grammar so that "esn"
and "no esn" are used instead of "on" and "off" since that's
what we're normally doing.  "on" and "off" are clutches for
simple file formats, parse.y allows you to make it a bit nicer.

Regards,
Mike

On Mon, 11 Nov 2019 at 16:38, Tobias Heider  wrote:

> Sure, I have a crypto device that only supports SAs with ESN.
> For it to be used I have to force iked to only negotiate SAs with ESP
> support.
> Another one is high-speed network cards:
> Accepting a policy with ESN disabled can throttle my throughput because it
> exhausts the sequence number space forcing me to rekey more often than I
> would
> like.
>
> On Mon, Nov 11, 2019 at 04:15:32PM +0100, Mike Belopuhov wrote:
> > On Mon, 11 Nov 2019 at 16:08, Tobias Heider 
> wrote:
> >
> > > Hi Mike,
> > >
> > > the default behaviour is the same as before. I ran into cases where it
> is
> > > necessary for me to enforce ESN to be enabled/disabled, which is not
> > > possible
> > > currently.
> > >
> >
> > Can you please describe those cases where you had to enforce it?
>


Re: iked(8): add configuration option for esn

2019-11-11 Thread Mike Belopuhov
On Mon, 11 Nov 2019 at 16:08, Tobias Heider  wrote:

> Hi Mike,
>
> the default behaviour is the same as before. I ran into cases where it is
> necessary for me to enforce ESN to be enabled/disabled, which is not
> possible
> currently.
>

Can you please describe those cases where you had to enforce it?


Re: iked(8): add configuration option for esn

2019-11-11 Thread Mike Belopuhov
On Mon, 11 Nov 2019 at 15:47, Tobias Heider  wrote:

> Currently iked does not provide an option to configure extended sequence
> numbers
> (ESN) for child SAs, but always proposes/accepts both options.
> This diff adds a new optional "esn on/off" config option to explicitly
> enable
> or disable esn.
>
> ok?
>
>
Hi Tobias,

What's wrong with the current behavior?  Does you patch retain it?

Regards,
Mike


Re: Attach Hyper-V guest services to VMBus 4.0

2019-10-05 Thread Mike Belopuhov


Remi Locherer writes:

> On Tue, Oct 01, 2019 at 12:25:35AM +0200, Mike Belopuhov wrote:
>> 
>> 
>> Hi,
>> 
>> I've got a verbal report that Hyper-V guest services aren't attached
>> on modern Windows 10 systems so I believe we should get this one-liner
>> in before 6.6.
>> 
>> FreeBSD revision 349856 adds another define for VMBus 5.0 but AFAICT
>> it doesn't attempt to use it in version negotiations.
>> 
>> Unfortunately, I can't test this myself at the moment.
>> 
>> I've got another two fixes for Hyper-V but can't test them either, so
>> if somebody is willing to test, please take a look at http://ix.io/1X2V
>
> With the diff from this link I'm getting the following dmesg. The VM
> seems to work fine.
>

Hi Remi,

Thanks for testing.

Does it work with a plain OpenBSD-current w/o any diffs?

I feel confident we should get the attached one-liner in,
any OKs?


> Cheers,
> Remi
>
>
> OpenBSD 6.6 (GENERIC.MP) #17: Sat Oct  5 11:52:48 CEST 2019
> r...@typhoon.relo.ch:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 1056899072 (1007MB)
> avail mem = 1012211712 (965MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.3 @ 0xf93d0 (338 entries)
> bios0: vendor American Megatrends Inc. version "090008" date 12/07/2018
> bios0: Microsoft Corporation Virtual Machine
> acpi0 at bios0: ACPI 2.0
> acpi0: sleep states S0 S5
> acpi0: tables DSDT FACP WAET SLIC OEM0 SRAT APIC OEMB
> acpi0: wakeup devices
> acpitimer0 at acpi0: 3579545 Hz, 32 bits
> acpihve0 at acpi0
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> ioapic0 at mainbus0: apid 0 pa 0xfec0, version 11, 24 pins, remapped
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 1399.64 MHz, 06-8e-0a
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,SSE3,PCLMUL,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,IBRS,IBPB,STIBP,L1DF,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
> cpu0: 256KB 64b/line 8-way L2 cache
> tsc_timecounter_init: TSC skew=0 observed drift=0
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 159MHz
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpicpu0 at acpi0: C1(@1 halt!)
> acpipci0 at acpi0 PCI0: _OSC failed
> acpicmos0 at acpi0
> "VMBus" at acpi0 not configured
> "Hyper_V_Gen_Counter_V1" at acpi0 not configured
> cpu0: using Skylake AVX MDS workaround
> pvbus0 at mainbus0: Hyper-V 10.0
> hyperv0 at pvbus0: protocol 5.0, features 0x2e7f
> hyperv0: heartbeat, kvp, shutdown, timesync
> hvs0 at hyperv0 channel 2: ide, protocol 6.2
> scsibus1 at hvs0: 2 targets
> sd0 at scsibus1 targ 0 lun 0:  
> naa.60022480c6c46e45fe9338343c3f1c08
> sd0: 20480MB, 512 bytes/sector, 41943040 sectors, thin
> hvs1 at hyperv0 channel 15: scsi, protocol 6.2
> scsibus2 at hvs1: 2 targets
> hvn0 at hyperv0 channel 14: NVS 5.0 NDIS 6.30, address 00:15:5d:b6:9f:19
> pci0 at mainbus0 bus 0
> pchb0 at pci0 dev 0 function 0 "Intel 82443BX" rev 0x03
> pcib0 at pci0 dev 7 function 0 "Intel 82371AB PIIX4 ISA" rev 0x01
> pciide0 at pci0 dev 7 function 1 "Intel 82371AB IDE" rev 0x01: DMA, channel 0 
> wired to compatibility, channel 1 wired to compatibility
> pciide0: channel 0 disabled (no drives)
> atapiscsi0 at pciide0 channel 1 drive 0
> scsibus3 at atapiscsi0: 2 targets
> cd0 at scsibus3 targ 0 lun 0:  removable
> cd0(pciide0:1:0): using PIO mode 4, DMA mode 2
> piixpm0 at pci0 dev 7 function 3 "Intel 82371AB Power" rev 0x02: SMBus 
> disabled
> vga1 at pci0 dev 8 function 0 "Microsoft VGA" rev 0x00
> wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
> wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
> isa0 at pcib0
> isadma0 at isa0
> fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5 irq 1 irq 12
> pckbd0 at pckbc0 (kbd slot)
> wskbd0 at pckbd0: console keyboard, using wsdisplay0
> pms0 at pckbc0 (aux slot)
> wsmouse0 at pms0 mux 0
> pcppi0 at isa0 port 0x61
> spkr0 at pcppi0
> vscsi0 at root
> scsibus4 at vscsi0: 256 targets
> softraid0 at root
> scsibus5 at softraid0: 256 targets
> root on sd0a (d3de7339e9421b70.a) swap on sd0b dump on sd0b
> fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
> fd1 at fdc0 drive 1: den

Re: Attach Hyper-V guest services to VMBus 4.0

2019-10-04 Thread Mike Belopuhov


Andre Stoebe writes:

> On 03.10.2019 02:13, Mike Belopuhov wrote:
>> And what about OpenBSD-current or an attached patch as opposed
>> to the linked one?
>> 
>> Please don't go half the way if you're willing to help us out,
>> we'd like to make OpenBSD 6.6-release work in these setups
>> especially since we believe that all it takes is an one-line
>> diff.
>
> Hi Mike,
>
> you're right, I should know better by now to include all information.
>
> Here are the three dmesgs (-current, attached patch, linked patch) with
> all integration services enabled.
>

Thanks a lot for dmesgs.  Do I understand correctly that paravirtualized
disk and network work fine for you regardless of whether OpenBSD-current,
attached one-liner diff or the linked larger diff are used?

With best regards,
Mike

> Regards,
> Andre
>
> dmesg (-current):
>
> OpenBSD 6.6 (GENERIC.MP) #344: Wed Oct  2 11:48:47 MDT 2019
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 8573091840 (8175MB)
> avail mem = 8300560384 (7916MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.3 @ 0xf93d0 (338 entries)
> bios0: vendor American Megatrends Inc. version "090008" date 12/07/2018
> bios0: Microsoft Corporation Virtual Machine
> acpi0 at bios0: ACPI 2.0
> acpi0: sleep states S0 S5
> acpi0: tables DSDT FACP WAET SLIC OEM0 SRAT APIC OEMB
> acpi0: wakeup devices
> acpitimer0 at acpi0: 3579545 Hz, 32 bits
> acpihve0 at acpi0
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> ioapic0 at mainbus0: apid 0 pa 0xfec0, version 11, 24 pins, remapped
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz, 2811.42 MHz, 06-5e-03
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
> cpu0: 256KB 64b/line 8-way L2 cache
> tsc_timecounter_init: TSC skew=0 observed drift=0
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 163MHz
> cpu1 at mainbus0: apid 1 (application processor)
> TSC skew=-9
> cpu1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz, 2855.87 MHz, 06-5e-03
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
> cpu1: 256KB 64b/line 8-way L2 cache
> tsc_timecounter_init: TSC skew=-9 observed drift=0
> cpu1: smt 0, core 1, package 0
> cpu2 at mainbus0: apid 2 (application processor)
> TSC skew=8
> cpu2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz, 2855.90 MHz, 06-5e-03
> cpu2: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
> cpu2: 256KB 64b/line 8-way L2 cache
> tsc_timecounter_init: TSC skew=8 observed drift=0
> cpu2: smt 0, core 2, package 0
> cpu3 at mainbus0: apid 3 (application processor)
> TSC skew=-20
> cpu3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz, 2855.89 MHz, 06-5e-03
> cpu3: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
> cpu3: 256KB 64b/line 8-way L2 cache
> tsc_timecounter_init: TSC skew=-20 observed drift=0
> cpu3: smt 0, core 3, package 0
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpicpu0 at acpi0: C1(@1 halt!)
> acpicpu1 at acpi0: C1(@1 halt!)
> acpicpu2 at acpi0: C1(@1 halt!)
> acpicpu3 at acpi0: C1(@1 halt!)
> acpipci0 at acpi0 PCI0: _OSC failed
> acpicmos0 at acpi0
> "VMBus" at acpi0 not configured
> "Hyper_V_Gen_Counter_V1" at acpi0 not configured
> cpu0: using VERW MDS workaround (excep

Re: Attach Hyper-V guest services to VMBus 4.0

2019-10-02 Thread Mike Belopuhov


Andre Stoebe writes:

> On 01.10.2019 00:25, Mike Belopuhov wrote:
>> 
>> 
>> Hi,
>> 
>> I've got a verbal report that Hyper-V guest services aren't attached
>> on modern Windows 10 systems so I believe we should get this one-liner
>> in before 6.6.
>> 
>> FreeBSD revision 349856 adds another define for VMBus 5.0 but AFAICT
>> it doesn't attempt to use it in version negotiations.
>> 
>> Unfortunately, I can't test this myself at the moment.
>> 
>> I've got another two fixes for Hyper-V but can't test them either, so
>> if somebody is willing to test, please take a look at http://ix.io/1X2V
>
> Hi,
>
> I tested both diffs on Win 10 1903 and didn't see any immediate problems.
>
> But I don't use and have disabled most of the "integration services"
> besides shutdown (which worked already), so I can't really comment on
> the guest services. Maybe someone knows a way using PowerShell to show
> the status, so I could compare that to an older OpenBSD VM?
>

Why have you disabled integration services?
The whole point of these diffs is to make them work on recent Windows
versions.

> With your linked patch, dmesg shows:
> hyperv0 at pvbus0: protocol 5.0, features 0x2e7f
>

And what about OpenBSD-current or an attached patch as opposed
to the linked one?

Please don't go half the way if you're willing to help us out,
we'd like to make OpenBSD 6.6-release work in these setups
especially since we believe that all it takes is an one-line
diff.

> Andre


Cheers,
Mike



Attach Hyper-V guest services to VMBus 4.0

2019-09-30 Thread Mike Belopuhov



Hi,

I've got a verbal report that Hyper-V guest services aren't attached
on modern Windows 10 systems so I believe we should get this one-liner
in before 6.6.

FreeBSD revision 349856 adds another define for VMBus 5.0 but AFAICT
it doesn't attempt to use it in version negotiations.

Unfortunately, I can't test this myself at the moment.

I've got another two fixes for Hyper-V but can't test them either, so
if somebody is willing to test, please take a look at http://ix.io/1X2V


Cheers,
Mike


diff --git sys/dev/pv/hyperv.c sys/dev/pv/hyperv.c
index a75276335d6..3ab2ae22831 100644
--- sys/dev/pv/hyperv.c
+++ sys/dev/pv/hyperv.c
@@ -803,10 +803,11 @@ hv_channel_delivered(struct hv_softc *sc, struct 
vmbus_chanmsg_hdr *hdr)
 
 int
 hv_vmbus_connect(struct hv_softc *sc)
 {
const uint32_t versions[] = {
+   VMBUS_VERSION_WIN10,
VMBUS_VERSION_WIN8_1, VMBUS_VERSION_WIN8,
VMBUS_VERSION_WIN7, VMBUS_VERSION_WS2008
};
struct vmbus_chanmsg_connect cmd;
struct vmbus_chanmsg_connect_resp rsp;



Re: relayd: fix filter rules with forward to statement

2019-05-13 Thread Mike Belopuhov


Reyk Floeter writes:

> Hi,
>
> the attached diff fixes filter rules with "forward to" statement in
> persistent (keep-alive) connections.  See the XXX comment below.
>
> ```relayd.conf
> log connection
> table  {
> 127.0.0.1
> }
> table  {
> 127.0.0.1
> }
> table  {
> 127.0.0.1
> }
> http protocol pathfwd {
> return error
>
>   # XXX The following workaround is not needed anymore:
>   #match header set "Connection" value "close"
>
> pass path "/a/*" forward to 
> pass path "/b/*" forward to 
> #match request path log "*"
> }
> relay pathfwd {
> listen on 0.0.0.0 port 80
> protocol pathfwd
> forward to  port 8082
> forward to  port 8080
> forward to  port 8081
> }
> ```
>
> OK?
>

Works great for us. FWIW, OK mikeb


> reyk
>
> Index: usr.sbin/relayd/relay.c
> ===
> RCS file: /cvs/src/usr.sbin/relayd/relay.c,v
> retrieving revision 1.242
> diff -u -p -u -p -r1.242 relay.c
> --- usr.sbin/relayd/relay.c   4 Mar 2019 21:25:03 -   1.242
> +++ usr.sbin/relayd/relay.c   8 May 2019 14:26:40 -
> @@ -76,11 +76,14 @@ intrelay_tls_ctx_create(struct relay 
>  void  relay_tls_transaction(struct rsession *,
>   struct ctl_relay_event *);
>  void  relay_tls_handshake(int, short, void *);
> -void  relay_connect_retry(int, short, void *);
>  void  relay_tls_connected(struct ctl_relay_event *);
>  void  relay_tls_readcb(int, short, void *);
>  void  relay_tls_writecb(int, short, void *);
>  
> +void  relay_connect_retry(int, short, void *);
> +void  relay_connect_state(struct rsession *,
> + struct ctl_relay_event *, enum relay_state);
> +
>  extern void   bufferevent_read_pressure_cb(struct evbuffer *, size_t,
>   size_t, void *);
>  
> @@ -654,6 +657,7 @@ relay_socket_listen(struct sockaddr_stor
>  void
>  relay_connected(int fd, short sig, void *arg)
>  {
> + char obuf[128];
>   struct rsession *con = arg;
>   struct relay*rlay = con->se_relay;
>   struct protocol *proto = rlay->rl_proto;
> @@ -696,6 +700,22 @@ relay_connected(int fd, short sig, void 
>  
>   DPRINTF("%s: session %d: successful", __func__, con->se_id);
>  
> + /* Log destination if it was changed in a keep-alive connection */
> + if ((con->se_table != con->se_table0) &&
> + (env->sc_conf.opts & (RELAYD_OPT_LOGCON|RELAYD_OPT_LOGCONERR))) {
> + con->se_table0 = con->se_table;
> + memset(, 0, sizeof(obuf));
> + (void)print_host(>se_out.ss, obuf, sizeof(obuf));
> + if (asprintf(, " -> %s:%d",
> + obuf, ntohs(con->se_out.port)) == -1) {
> + relay_abort_http(con, 500,
> + "connection changed and asprintf failed", 0);
> + return;
> + }
> + relay_log(con, msg);
> + free(msg);
> + }
> +
>   switch (rlay->rl_proto->type) {
>   case RELAY_PROTO_HTTP:
>   if (relay_httpdesc_init(out) == -1) {
> @@ -1465,6 +1485,17 @@ relay_bindany(int fd, short event, void 
>  }
>  
>  void
> +relay_connect_state(struct rsession *con, struct ctl_relay_event *cre,
> +enum relay_state new)
> +{
> + DPRINTF("%s: session %d: %s state %s -> %s",
> + __func__, con->se_id,
> + cre->dir == RELAY_DIR_REQUEST ? "accept" : "connect",
> + relay_state(cre->state), relay_state(new));
> + cre->state = new;
> +}
> +
> +void
>  relay_connect_retry(int fd, short sig, void *arg)
>  {
>   struct timeval   evtpause = { 1, 0 };
> @@ -1533,9 +1564,9 @@ relay_connect_retry(int fd, short sig, v
>   }
>  
>   if (rlay->rl_conf.flags & F_TLSINSPECT)
> - con->se_out.state = STATE_PRECONNECT;
> + relay_connect_state(con, >se_out, STATE_PRECONNECT);
>   else
> - con->se_out.state = STATE_CONNECTED;
> + relay_connect_state(con, >se_out, STATE_CONNECTED);
>   relay_inflight--;
>   DPRINTF("%s: inflight decremented, now %d",__func__, relay_inflight);
>  
> @@ -1560,7 +1591,7 @@ relay_preconnect(struct rsession *con)
>   con->se_id, privsep_process);
>   rv = relay_connect(con);
>   if (con->se_out.state == STATE_CONNECTED)
> - con->se_out.state = STATE_PRECONNECT;
> + relay_connect_state(con, >se_out, STATE_PRECONNECT);
>   return (rv);
>  }
>  
> @@ -1585,7 +1616,7 @@ relay_connect(struct rsession *con)
>   return (-1);
>   }
>   relay_connected(con->se_out.s, EV_WRITE, con);
> - con->se_out.state = STATE_CONNECTED;
> + relay_connect_state(con, >se_out, STATE_CONNECTED);
>   return (0);
>   }
>  
> @@ -1642,7 

Re: extend BPF filter drop to allow not capturing packets

2019-03-13 Thread Mike Belopuhov


David Gwynne writes:

> On Tue, Mar 05, 2019 at 12:03:05PM +1000, David Gwynne wrote:
>> this extends the fildrop mechanism so you can drop the packets with bpf
>> using the existing fildrop method, but with an extra tweak so you can 
>> avoid the cost of copying packets to userland.
>> 
>> i wanted to quickly drop some packets in the rx interrupt path to try
>> and prioritise some traffic getting processed by the system. the initial
>> version was going to use weird custom DLTs and extra bpf interface
>> pointers and stuff, but most of the glue is already in place with
>> the fildrop functionality.
>> 
>> this also adds a bit to tcpdump so you can set a fildrop action. it
>> means tcpdump can be used as a quick and dirty firewall.
>
> there's a bit more discussion about this that i should have included in
> my original email.
>
> firstly, the functionality it offers. this effectively offers a firewall
> with the ability to filter arbitrary packets. this has significant
> overlap with the functionality that pf offers, but there are a couple of
> important differences. pf only handles IP traffic, but we don't
> really have a good story when it comes to filtering non-ip. we could
> implement something like pf for the next protocol that people need to
> manage, but what is that next protocol? pf like implies a highly
> optimised but constrained set of filters that deeply understands the
> protocol it is handling. is that next protol ieee1905p? cdp? ipx?
> macsec? where should that protocol be filtered in the stack?
>
> im arguing that bpf with fildrop has the benefit of already existing,
> it's in place, and it already has the ability to be configured with
> arbitrary policy. considering we've got this far without handling
> non-ip, spending more time on it seems unjustified.
>
> secondly, the performance aspects of this diff.
>
> bpf allows for arbitrarily complicated filters, so it is entirely
> possible to slow your box down a lot by writing really complicated
> filters. this is in comparison to pf where each rule has a limit
> on how much work it will do, which is also mitigated by the ruleset
> optimiser and skip steps. i don't have a good answer to that except to
> say you can already add such filters to bpf, they just don't do anything
> except copy packets at the moment.
>
> another interesting performance consideration is that bpf runs a lot
> earlier than pf, so filtering packets with bpf can avoid a lot of work
> in the stack. if you want to pass IP statefully, pf is a much better
> hammer, but to drop packets up front bpf is interesting.
>
> for example, thanks to hrvoje popovski i now have a setup where im
> pushing ~7 million packets per second through a box to do performance
> measurements. those packets are udp from random ips to port 7 on
> another set of random ips. if i have the following rule in pf.conf:
>
>  block in quick proto udp to port 7
>
> i can rx and drop about 550kpps. if im sshed in using another
> interface, the system is super sluggish over that shell.
>
> if i use this diff and run the following;
>
> # tcpdump -B drop -i ix1 udp and port 7
>
> i'm dropping about 1.2 million pps, and the box is responsive when sshed
> in using another interface.
>
> so, to summarise, bpf can already be used to drop packets, this is just
> a tweak to make it faster, and a tweak so tcpdump can be used to set up
> that filtering.
>

I think this is a great development. Diff looks good as well.

>> Index: sys/net/bpf.c
>> ===
>> RCS file: /cvs/src/sys/net/bpf.c,v
>> retrieving revision 1.170
>> diff -u -p -r1.170 bpf.c
>> --- sys/net/bpf.c13 Jul 2018 08:51:15 -  1.170
>> +++ sys/net/bpf.c4 Mar 2019 22:30:32 -
>> @@ -926,9 +926,20 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t 
>>  *(u_int *)addr = d->bd_fildrop;
>>  break;
>>  
>> -case BIOCSFILDROP:  /* set "filter-drop" flag */
>> -d->bd_fildrop = *(u_int *)addr ? 1 : 0;
>> +case BIOCSFILDROP: {/* set "filter-drop" flag */
>> +unsigned int fildrop = *(u_int *)addr;
>> +switch (fildrop) {
>> +case BPF_FILDROP_PASS:
>> +case BPF_FILDROP_CAPTURE:
>> +case BPF_FILDROP_DROP:
>> +d->bd_fildrop = fildrop;
>> +break;
>> +default:
>> +error = EINVAL;
>> +break;
>> +}
>>  break;
>> +}
>>  
>>  case BIOCGDIRFILT:  /* get direction filter */
>>  *(u_int *)addr = d->bd_dirfilt;
>> @@ -1261,23 +1272,26 @@ _bpf_mtap(caddr_t arg, const struct mbuf
>>  pktlen += m0->m_len;
>>  
>>  SRPL_FOREACH(d, , >bif_dlist, bd_next) {
>> +struct srp_ref bsr;
>> +struct bpf_program *bf;
>> +struct bpf_insn *fcode = NULL;
>> +
>>  atomic_inc_long(>bd_rcount);
>>  
>> -if 

Re: dont let hfsc force the packet priority

2018-10-22 Thread Mike Belopuhov


David Gwynne writes:

> As discoverd by Adrian Close on tech@ (in "VLAN priority field and
> PF queues"), setting up traffic shaping in pf on vlan interfaces has a
> side effect where all the packets are sent with the vlan priority field
> set to the highest value. This is because hfsc forces the mbuf priority
> to the highest value, and that ends up on the wire.
>
> I'd argue this is not what you want. HFSC queuing and packet priority
> on the wire are orthoganal, and should be configurable independently.
>
> The diff below allows a packet through HFSC to maintain it's priority,
> despite how fast the queueing policy sends it.
>
> This has two consequences. Firstly, it allows mbuf priorities to be
> maintained through the system or set in pf, independently of traffic
> shaping policy implemented with hfsc.
>
> Secondly, it will allow priority queueing on a vlan interfaces
> parent to kick in. With HFSC setting the priority to 7, it made
> packets on the physical interface get queued at the highest priority,
> but now they get queued at their natural(?) prio.
>
> It could be argued that allowing priq on the parent for HFSC controlled
> traffic is good and bad. I think it is more good, as it let's the parent
> interface act like the rest of the network that should respect the
> vlan prio value.
>
> Adrian has tested this himself and gets the result he expects now.
>
> OK?
>

I agree that this is odd and you have my OK for the patch.

However, since HFSC uses its own FIFO queues internally
(that replaces the default priority queueing), this change
won't make an HFSC enabled system do any priority queueing.

> Index: hfsc.c
> ===
> RCS file: /cvs/src/sys/net/hfsc.c,v
> retrieving revision 1.47
> diff -u -p -r1.47 hfsc.c
> --- hfsc.c13 Apr 2018 14:09:42 -  1.47
> +++ hfsc.c22 Oct 2018 07:20:39 -
> @@ -540,7 +540,6 @@ hfsc_pf_enqueue(void *arg, struct mbuf *
>   return (m);
>  
>   ml_enqueue(>q, m);
> - m->m_pkthdr.pf.prio = IFQ_MAXPRIO;
>   return (NULL);
>  }
>  



Re: ENA support

2018-08-27 Thread Mike Belopuhov


Mike Belopuhov writes:

> On Sat, 25 Aug 2018 at 20:03, Zbyszek Żółkiewski 
> wrote:
>>
>> Hi,
>>
>> just a question: anyone tried/consider porting ENA (Elastic Network
> Adapter) support to OpenBSD (
> https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena), how hard
> would be to get it to obsd?
>>
>> _
>> Zbyszek Żółkiewski
>
> Hi,
>
> It would be nice to have a thoughtful port or a partial rewrite of
> this vendor code dump.  IIRC, dlg@ was toying with the idea.
>
> Cheers,
> Mike

BTW, a merged and trimmed down version available directly from
the FreeBSD tree: https://svnweb.freebsd.org/base/head/sys/dev/ena/



Re: ENA support

2018-08-27 Thread Mike Belopuhov
On Sat, 25 Aug 2018 at 20:03, Zbyszek Żółkiewski 
wrote:
>
> Hi,
>
> just a question: anyone tried/consider porting ENA (Elastic Network
Adapter) support to OpenBSD (
https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena), how hard
would be to get it to obsd?
>
> _
> Zbyszek Żółkiewski

Hi,

It would be nice to have a thoughtful port or a partial rewrite of
this vendor code dump.  IIRC, dlg@ was toying with the idea.

Cheers,
Mike


Re: pcidevs Sandisk or WD

2018-07-19 Thread Mike Belopuhov
On 19 July 2018 at 19:36, Stuart Henderson  wrote:
>
> On 2018/07/19 10:00, Bryan Vyhmeister wrote:
> > I am wanting to add the WD Black High-performance NVMe SSD PCI
> > IDs to pcidevs and I am not sure how to proceed. WD bought Sandisk a
> > while back but the vendor ID is 15b7 which is Sandisk Corp. The product
> > itself is WD Black High-performance NVMe SSD and is labeled as Western
> > Digital. The vendor ID of 15b7 does not exist in our pcidevs at this
> > point. Should I create a diff as 15b7 being Sandisk or Western Digital?
> > There are only two products listed under 15b7 that I have found (2001
> > and 5001) and this SSD I have is a new (third) one (5002). My intention
> > is to add the information for 5001 and 5002 which are the original WD
> > Black NVMe SSD and the new WD Black High-performance NVMe SSD. I have
> > both 500GB and 1TB versions which share the same product ID as expected.
> > Output from dmesg and pcidump is below.
>
> AFAIK typically we list under the original vendor name though I'm sure
> there are exceptions to this. My vote would be for listing as Sandisk.
>

But at this point the origin of new devices is WD.  I think since it's
a new addition it should go by WD to reduce potential confusion in the
future.


Re: 6.3 and Prolific PL-2303 USB serial adapter

2018-06-18 Thread Mike Belopuhov
On Mon, Jun 18, 2018 at 19:29 +0200, Paul de Weerd wrote:
> Updated diff included below.  I've set the product name to LD220.
> This diff lacks the updates to usbdevs{,_data}.h, whoever commits
> should update those too.
> 
> Any takers to commit this?  Tested by an owner of the actual device,
> so that's good.  Jan-Piet, maybe you can share a dmesg of the machine
> with the device attached as uplcom(4)?
> 
> Thanks!
>

I've checked the diff in.  Thank you both!

> Paul
> 
> Index: uplcom.c
> ===
> RCS file: /home/OpenBSD/cvs/src/sys/dev/usb/uplcom.c,v
> retrieving revision 1.71
> diff -u -p -r1.71 uplcom.c
> --- uplcom.c  27 Apr 2018 09:40:59 -  1.71
> +++ uplcom.c  18 Jun 2018 17:25:51 -
> @@ -140,6 +140,7 @@ static const struct usb_devno uplcom_dev
>   { USB_VENDOR_ELECOM, USB_PRODUCT_ELECOM_UCSGT },
>   { USB_VENDOR_ELECOM, USB_PRODUCT_ELECOM_UCSGT0 },
>   { USB_VENDOR_HAL, USB_PRODUCT_HAL_IMR001 },
> + { USB_VENDOR_HP, USB_PRODUCT_HP_LD220 },
>   { USB_VENDOR_IODATA, USB_PRODUCT_IODATA_USBRSAQ },
>   { USB_VENDOR_IODATA, USB_PRODUCT_IODATA_USBRSAQ5 },
>   { USB_VENDOR_LEADTEK, USB_PRODUCT_LEADTEK_9531 },
> Index: usbdevs
> ===
> RCS file: /home/OpenBSD/cvs/src/sys/dev/usb/usbdevs,v
> retrieving revision 1.684
> diff -u -p -r1.684 usbdevs
> --- usbdevs   11 Apr 2018 04:15:26 -  1.684
> +++ usbdevs   18 Jun 2018 17:25:36 -
> @@ -2190,6 +2190,7 @@ product HP R1500G2  0x1fe0  R1500 G2 UPS
>  product HP T750G20x1fe1  T750 G2 UPS
>  product HP 640C  0x2004  DeskJet 640c
>  product HP 1020  0x2b17  LaserJet 1020
> +product HP LD220 0x3524  LD220
>  product HP P1100 0x3102  Photosmart P1100
>  product HP 1018  0x4117  LaserJet 1018
>  product HP HN210E0x811c  HN210E Ethernet
> 
> -- 
> >[<++>-]<+++.>+++[<-->-]<.>+++[<+
> +++>-]<.>++[<>-]<+.--.[-]
>  http://www.weirdnet.nl/ 
> 



Hyper-V network: let hvn_iff handle promisc mode activation

2018-05-01 Thread Mike Belopuhov
Hi,

A user has reported an issue with not having a proper promiscuous mode
on reddit a while back but as of now didn't manage to test the diff.

If somebody is running OpenBSD on Hyper-V, please test the diff below
and see if there are any regressions (like increased CPU load) without
running tcpdump and then try running 'tcpdump -nvvi hvn0' and see if
that works: one indicator is the PROMISC flag in the ifconfig output
that should get set when tcpdump is started and cleared once it's
finished.

I'm not certain if Hyper-V will allow you to sniff a whole lot of
traffic on the virtual switch, however, one of the goals here is to
instruct RNDIS to enter "all multicast capture" mode when one or more
multicast addresses are configured.  Therefore one of the goals is to
attempt to support CARP operation on the virtual switch.


commit 6e6b001dae79505e3e0fbca663c31f9eb6da285b
Author: Mike Belopuhov <m...@belopuhov.com>
Date:   Sun Mar 11 13:38:21 2018 +0100

Add support for promisc mode

diff --git sys/dev/pv/if_hvn.c sys/dev/pv/if_hvn.c
index 3ca35165565..6d5e919b01c 100644
--- sys/dev/pv/if_hvn.c
+++ sys/dev/pv/if_hvn.c
@@ -134,11 +134,10 @@ struct hvn_softc {
bus_dma_tag_tsc_dmat;
 
struct arpcomsc_ac;
struct ifmedia   sc_media;
int  sc_link_state;
-   int  sc_promisc;
 
/* NVS protocol */
int  sc_proto;
uint32_t sc_nvstid;
uint8_t  sc_nvsrsp[HVN_NVS_MSGSIZE];
@@ -208,11 +207,10 @@ void  hvn_rxeof(struct hvn_softc *, caddr_t, 
uint32_t, struct mbuf_list *);
 void   hvn_rndis_complete(struct hvn_softc *, caddr_t, uint32_t);
 inthvn_rndis_output(struct hvn_softc *, struct hvn_tx_desc *);
 void   hvn_rndis_status(struct hvn_softc *, caddr_t, uint32_t);
 inthvn_rndis_query(struct hvn_softc *, uint32_t, void *, size_t *);
 inthvn_rndis_set(struct hvn_softc *, uint32_t, void *, size_t);
-inthvn_rndis_open(struct hvn_softc *);
 inthvn_rndis_close(struct hvn_softc *);
 void   hvn_rndis_detach(struct hvn_softc *);
 
 struct cfdriver hvn_cd = {
NULL, "hvn", DV_IFNET
@@ -401,26 +399,44 @@ hvn_link_status(struct hvn_softc *sc)
 }
 
 int
 hvn_iff(struct hvn_softc *sc)
 {
-   /* XXX */
-   sc->sc_promisc = 0;
+   struct ifnet *ifp = >sc_ac.ac_if;
+   uint32_t filter = 0;
+   int rv;
 
-   return (0);
+   ifp->if_flags &= ~IFF_ALLMULTI;
+
+   if ((ifp->if_flags & IFF_PROMISC) || sc->sc_ac.ac_multirangecnt > 0) {
+   ifp->if_flags |= IFF_ALLMULTI;
+   filter = NDIS_PACKET_TYPE_PROMISCUOUS;
+   } else {
+   filter = NDIS_PACKET_TYPE_BROADCAST |
+   NDIS_PACKET_TYPE_DIRECTED;
+   if (sc->sc_ac.ac_multicnt > 0) {
+   ifp->if_flags |= IFF_ALLMULTI;
+   filter |= NDIS_PACKET_TYPE_ALL_MULTICAST;
+   }
+   }
+
+   rv = hvn_rndis_set(sc, OID_GEN_CURRENT_PACKET_FILTER,
+   , sizeof(filter));
+   if (rv)
+   DPRINTF("%s: failed to set RNDIS filter to %#x\n",
+   sc->sc_dev.dv_xname, filter);
+   return (rv);
 }
 
 void
 hvn_init(struct hvn_softc *sc)
 {
struct ifnet *ifp = >sc_ac.ac_if;
 
hvn_stop(sc);
 
-   hvn_iff(sc);
-
-   if (hvn_rndis_open(sc) == 0) {
+   if (hvn_iff(sc) == 0) {
ifp->if_flags |= IFF_RUNNING;
ifq_clr_oactive(>if_snd);
}
 }
 
@@ -1723,31 +1739,10 @@ hvn_rndis_set(struct hvn_softc *sc, uint32_t oid, void 
*data, size_t length)
hvn_free_cmd(sc, rc);
 
return (rv);
 }
 
-int
-hvn_rndis_open(struct hvn_softc *sc)
-{
-   uint32_t filter;
-   int rv;
-
-   if (sc->sc_promisc)
-   filter = NDIS_PACKET_TYPE_PROMISCUOUS;
-   else
-   filter = NDIS_PACKET_TYPE_BROADCAST |
-   NDIS_PACKET_TYPE_ALL_MULTICAST |
-   NDIS_PACKET_TYPE_DIRECTED;
-
-   rv = hvn_rndis_set(sc, OID_GEN_CURRENT_PACKET_FILTER,
-   , sizeof(filter));
-   if (rv)
-   DPRINTF("%s: failed to set RNDIS filter to %#x\n",
-   sc->sc_dev.dv_xname, filter);
-   return (rv);
-}
-
 int
 hvn_rndis_close(struct hvn_softc *sc)
 {
uint32_t filter = 0;
int rv;



Re: kqueue EV_DISPATCH and EV_EOF interaction

2018-04-08 Thread Mike Belopuhov
On Tue, Apr 03, 2018 at 17:00 +0200, Lukas Larsson wrote:
> On Fri, Mar 30, 2018 at 1:51 AM, Mike Belopuhov <m...@belopuhov.com> wrote:
> 
> > On Fri, Mar 30, 2018 at 01:21 +0200, Mike Belopuhov wrote:
> > >
> > > Hi,
> > >
> > > This appears to be an issue with reactivating disabled event sources
> > > in kqueue_register.  Something along the lines of FreeBSD commits:
> > >
> > > https://svnweb.freebsd.org/base?view=revision=274560 and
> > > https://reviews.freebsd.org/rS295786 where parent differential review
> > > https://reviews.freebsd.org/D5307 has some additional comments.
> > >
> > > In any case, by either porting their code (#else branch) or slightly
> > > adjusting our own (I think that should be enough), I can no longer
> > > reproduce the issue you've reported.  Please test and report back if
> > > that solves your original issue.  Either variants will require
> > > rigorous testing and a thorough review.
> > >
> > > Cheers,
> > > Mike
> > >
> >
> > After a bit of tinkering, I think I can minimize the change even
> > further.  Basically we just need to call the filter once and if
> > there's some data available, it'll return true and we'll mark the
> > knote as active.
> >
> > diff --git sys/kern/kern_event.c sys/kern/kern_event.c
> > index fb9cad360b1..4e0949645cb 100644
> > --- sys/kern/kern_event.c
> > +++ sys/kern/kern_event.c
> > @@ -671,10 +671,12 @@ kqueue_register(struct kqueue *kq, struct kevent
> > *kev, struct proc *p)
> > }
> >
> > if ((kev->flags & EV_ENABLE) && (kn->kn_status & KN_DISABLED)) {
> > s = splhigh();
> > kn->kn_status &= ~KN_DISABLED;
> > +   if (kn->kn_fop->f_event(kn, 0))
> > +   kn->kn_status |= KN_ACTIVE;
> > if ((kn->kn_status & KN_ACTIVE) &&
> > ((kn->kn_status & KN_QUEUED) == 0))
> > knote_enqueue(kn);
> > splx(s);
> > }
> >
> 
> Hello,
> 
> Thank you for your help and the patch. I've applied the smaller patch to
> one of our test machines
> and the small testcase I sent here on the list has been fixed. I also ran
> our larger test suites where
> I first found the issue and those work as well.
> 
> Lukas

Thanks a lot for a great bug report and testing, I've checked in the diff.

Cheers,
Mike



Re: kqueue EV_DISPATCH and EV_EOF interaction

2018-03-29 Thread Mike Belopuhov
On Fri, Mar 30, 2018 at 01:21 +0200, Mike Belopuhov wrote:
> 
> Hi,
> 
> This appears to be an issue with reactivating disabled event sources
> in kqueue_register.  Something along the lines of FreeBSD commits:
> 
> https://svnweb.freebsd.org/base?view=revision=274560 and
> https://reviews.freebsd.org/rS295786 where parent differential review
> https://reviews.freebsd.org/D5307 has some additional comments.
> 
> In any case, by either porting their code (#else branch) or slightly
> adjusting our own (I think that should be enough), I can no longer
> reproduce the issue you've reported.  Please test and report back if
> that solves your original issue.  Either variants will require
> rigorous testing and a thorough review.
> 
> Cheers,
> Mike
> 

After a bit of tinkering, I think I can minimize the change even
further.  Basically we just need to call the filter once and if
there's some data available, it'll return true and we'll mark the
knote as active.

diff --git sys/kern/kern_event.c sys/kern/kern_event.c
index fb9cad360b1..4e0949645cb 100644
--- sys/kern/kern_event.c
+++ sys/kern/kern_event.c
@@ -671,10 +671,12 @@ kqueue_register(struct kqueue *kq, struct kevent *kev, 
struct proc *p)
}
 
if ((kev->flags & EV_ENABLE) && (kn->kn_status & KN_DISABLED)) {
s = splhigh();
kn->kn_status &= ~KN_DISABLED;
+   if (kn->kn_fop->f_event(kn, 0))
+   kn->kn_status |= KN_ACTIVE;
if ((kn->kn_status & KN_ACTIVE) &&
((kn->kn_status & KN_QUEUED) == 0))
knote_enqueue(kn);
splx(s);
}



Re: kqueue EV_DISPATCH and EV_EOF interaction

2018-03-29 Thread Mike Belopuhov
On Thu, Mar 29, 2018 at 15:09 +0200, Lukas Larsson wrote:
> Hello,
> 
> I've been re-writing the polling mechanisms in the Erlang VM and stumbled
> across
> something that might be a bug in the OpenBSD implementation of kqueue.
> 
> When using EV_DISPATCH, the event is never triggered again after the EV_EOF
> flag has been delivered, even though there is more data to be read from the
> socket.
> 
> I've attached a smallish program that shows the problem.
> 
> The shortened ktrace output looks like this on OpenBSD 6.2:
> 
>  29672 a.out0.012883 CALL  kevent(4,0x7f7e8220,1,0,0,0)
>  29672 a.out0.012888 STRU  struct kevent { ident=5, filter=EVFILT_READ,
> flags=0x81, fflags=0<>, data=0, udata=0x0 }
>  29672 a.out0.012895 RET   kevent 0
>  29672 a.out0.012904 CALL  kevent(4,0,0,0x7f7e7cf0,32,0)
>  29672 a.out0.013408 STRU  struct kevent { ident=5, filter=EVFILT_READ,
> flags=0x81, fflags=0<>, data=6, udata=0x0 }
>  29672 a.out0.013493 RET   kevent 1
>  29672 a.out0.013548 CALL  read(5,0x7f7e8286,0x2)
>  29672 a.out0.013562 RET   read 2
>  29672 a.out0.013590 CALL  kevent(4,0x7f7e8220,1,0,0,0)
>  29672 a.out0.013594 STRU  struct kevent { ident=5, filter=EVFILT_READ,
> flags=0x84, fflags=0<>, data=0, udata=0x0 }
>  29672 a.out0.013608 RET   kevent 0
>  29672 a.out1.08 CALL  kevent(4,0,0,0x7f7e7cf0,32,0)
>  29672 a.out1.022537 STRU  struct kevent { ident=5, filter=EVFILT_READ,
> flags=0x8081, fflags=0<>, data=4, udata=0x0 }
>  29672 a.out1.022572 RET   kevent 1
>  29672 a.out1.022663 CALL  read(5,0x7f7e8286,0x2)
>  29672 a.out1.022707 RET   read 2
>  29672 a.out1.022816 CALL  kevent(4,0x7f7e8220,1,0,0,0)
>  29672 a.out1.022822 STRU  struct kevent { ident=5, filter=EVFILT_READ,
> flags=0x84, fflags=0<>, data=0, udata=0x0 }
>  29672 a.out1.022835 RET   kevent 0
>  29672 a.out2.032238 CALL  kevent(4,0,0,0x7f7e7cf0,32,0)
>  29672 a.out5.277194 PSIG  SIGINT SIG_DFL
> 
> In this example I would have expected the last kevent call to return with
> EV_EOF and
> data set to 2, but it does not trigger again. If I don't use EV_DISPATCH,
> the event is
> triggered again and the program terminates.
> 
> Does anyone know if this is the expected behavior or a bug?
> 
> I've worked around this issue by using EV_ONESHOT instead of EV_DISPATCH on
> OpenBSD for now, but would like to use EV_DISPATCH in the future as I've
> found
> that it aligns better with the abstractions that I use, and could possibly
> be a little bit
> more performant.
> 
> Lukas
> 
> PS. If relevant, it seems like FreeBSD does behave the way that I expected,
> i.e.
> it triggers again for EV_DISPATCH after EV_EOF has been shown. DS.

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define USE_DISPATCH 1
> 
> int main() {
> struct addrinfo *addr;
> struct addrinfo hints;
> int kq, listen_s, fd = -1;
> struct kevent evSet;
> struct kevent evList[32];
> 
> /* open a TCP socket */
> memset(, 0, sizeof hints);
> hints.ai_family = PF_UNSPEC; /* any supported protocol */
> hints.ai_flags = AI_PASSIVE; /* result for bind() */
> hints.ai_socktype = SOCK_STREAM; /* TCP */
> int error = getaddrinfo ("127.0.0.1", "8080", , );
> if (error)
> errx(1, "getaddrinfo failed: %s", gai_strerror(error));
> listen_s = socket(addr->ai_family, addr->ai_socktype, addr->ai_protocol);
> if (setsockopt(listen_s, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, 
> sizeof(int)) < 0)
> errx(1, "setsockopt(SO_REUSEADDR) failed");
> bind(listen_s, addr->ai_addr, addr->ai_addrlen);
> listen(listen_s, 5);
> 
> kq = kqueue();
> 
> system("echo -n abcdef | nc -v -w 1 127.0.0.1 8080 &");
> 
> EV_SET(, listen_s, EVFILT_READ, EV_ADD, 0, 0, NULL);
> if (kevent(kq, , 1, NULL, 0, NULL) == -1)
> err(1, "kevent");
> 
> while(1) {
> int i;
> int nev = kevent(kq, NULL, 0, evList, 32, NULL);
> for (i = 0; i < nev; i++) {
> if (evList[i].ident == listen_s) {
> struct sockaddr_storage addr;
> socklen_t socklen = sizeof(addr);
> if (fd != -1)
> close(fd);
> fd = accept(evList[i].ident, (struct sockaddr *), 
> );
> printf("accepted %d\n", fd);
> #if USE_DISPATCH
> EV_SET(, fd, EVFILT_READ, EV_ADD|EV_DISPATCH, 0, 0, 
> NULL);
> #else
> EV_SET(, fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
> #endif
> if (kevent(kq, , 1, NULL, 0, NULL) == -1)
> err(1, "kevent");
> } else {
> if (evList[i].flags & EV_EOF && evList[i].data == 0) {
> printf("closing 

Re: close filedescriptors of children

2018-03-07 Thread Mike Belopuhov
On 7 March 2018 at 17:27, Gerhard Roth <gerhard_r...@genua.de> wrote:
>
> On Wed, 7 Mar 2018 17:20:06 +0100 Mike Belopuhov <m...@belopuhov.com>
wrote:
> > On 7 March 2018 at 17:01, Gerhard Roth <gerhard_r...@genua.de> wrote:
> > >
> > > Hi Benno,
> > >
> > > thanks for your reply.
> > >
> > > On Wed, 7 Mar 2018 15:22:28 +0100 Sebastian Benoit <be...@openbsd.org>
> > wrote:
> > > > Hi,
> > > >
> > > > switchd and vmd use the same proc.c,and should stay in sync.
> > >
> > > Ack. I missed them.
> > >
> >
> > iked also uses proc.c. I think you've got all the others,
> > but perhaps you should run a find?
> >
> > Cheers,
> > Mike
>
> Hi Mike,
>
> but iked still uses an older version of proc.c that just forks off
> the children but does not execve() the own binary.
>
> Also, iked is the only one that daemon(3)-izes before calling
> proc_init(). So here stdout, stdin, and stderr is already remapped
> to /dev/null before forking the kids.
>
> Gerhard

I see.  Reyk always wanted to keep them all in sync, but I guess
it's too late to care about that if they've already diverged.


Re: close filedescriptors of children

2018-03-07 Thread Mike Belopuhov
On 7 March 2018 at 17:01, Gerhard Roth  wrote:
>
> Hi Benno,
>
> thanks for your reply.
>
> On Wed, 7 Mar 2018 15:22:28 +0100 Sebastian Benoit 
wrote:
> > Hi,
> >
> > switchd and vmd use the same proc.c,and should stay in sync.
>
> Ack. I missed them.
>

iked also uses proc.c. I think you've got all the others,
but perhaps you should run a find?

Cheers,
Mike


Re: tcp reaper timeout

2018-01-22 Thread Mike Belopuhov
Hi,

thanks for the detailed explanation!

On Mon, Jan 22, 2018 at 22:37 +0100, Alexander Bluhm wrote:
> On Sat, Jan 20, 2018 at 05:53:05PM +0100, Mike Belopuhov wrote:
> > On Sat, Jan 20, 2018 at 15:17 +0100, Alexander Bluhm wrote:
> > While I'm not against making all TCP timeouts look similar, I'd like
> > to understand if there's any other reason to do it other than
> > "consistency".
> 
> I think it prevents a use after free.  The timeouts may fire and
> wait for the netlock.  Then the softnet task or a user process calls
> tcp_close().  As our net lock does not include splsoftnet anymore,
> soft timeouts are not blocked.  So the reaper timeout may immediately
> free the tcpcb.  Then the timer functions could operate on invalid
> memory.  I have not seen this in practice, it is just a theory.
> 
> In 4.4 BSD there was no reaper timeout.  We have added it here:
> 
> revision 1.85
> date: 2004/11/25 15:32:08;  author: markus;  state: Exp;  lines: +18 -3;
> fix for race between invocation for timer and network input
> 1) add a reaper for TCP and SYN cache states (cf. netbsd pr 20390)
> 2) additional check for TCP_TIMER_ISARMED(TCPT_REXMT) in tcp_timer_persist()
> with mickey@; ok deraadt@
> 
> > > The tcp reaper timeout is still imlemented as soft timeout.  So it
> > > can run while net lock is held by others and it is not synchronized
> > > with the other tcp timeouts.
> > 
> > Am I right to say that this is not an issue?  Neither pool_put nor
> > tcpstat_inc requre a NET_LOCK, correct?
> 
> pool_put() does not need the lock to protect itself.  But it must
> not free memory that another thread which is holding the net lock
> is still using.
> 
> > > Convert it to an ordinary tcp timeout
> > > so it is scheduled on the same timeout thread.  It grabs the net
> > > lock to make sure that softnet has finished its work.
> > 
> > This just makes other threads who want to grab NET_LOCK wait for
> > a pool_put to finish, but where's the benefit?
> 
> It restores the bahavior introduced in revision 1.85.  The memory
> is only freed after the thread that called tcp_close() has released
> the net lock.  As there was no timeout it 4.4 BSD, and the callers
> of tcp_close() set their tcpcb pointer to NULL, and all syncronisation
> with the timeouts is done by using a single timeout thread, the net
> lock in tcp_timer_reaper() is not necessary.
>

I don't mind restoring the order of operations.

After the change below there will be one thread serializing all timeouts.
Perhaps we can revisit this after we move to multiple timeout threads?
By then we might require different synchronization methods.

> If we fear that freeing the tcpcb is delayed too much by NET_LOCK()
> and we consume too much memory, then I am fine when we do not grab
> the net lock in tcp_timer_reaper().
> 
> Generally I prefer to place to many locks than to miss some.  When
> we see lock congestion, we can remove the unnecessary ones.  This
> stategy is better than adding necessary locks to a crashing kernel.
>

Yeah, but as you've said the kernel isn't crashing and this is purely
theoretical.  While I see why the current state of affairs isn't
desirable, I don't see how it's still flawed after the change below.

> > What's up with the diagnostics?  Do you suspect a race of some sorts?
> 
> I consider myself not smart enough to program multi threaded.  There
> is always some race I did not think about.  So I add diagnostics
> to have a better feeling.
>

There's nothing wrong with diagnostics of course, but at the same time,
if there's no bug to hunt why bother?  I'm certain you've tried the diff
yourself and it didn't blow up, otherwise we wouldn't be having this
conversation.  Do you want to have it temporarily enabled?  I won't mind
that, but it looks good to me as it is.

> As we agree that the net lock is not needed in the TCP reaper, I
> propose this diff without diagnostics and lock, but with a comment.
> 
> ok?
> 
> bluhm
> 
> Index: netinet/tcp_subr.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/tcp_subr.c,v
> retrieving revision 1.167
> diff -u -p -r1.167 tcp_subr.c
> --- netinet/tcp_subr.c7 Dec 2017 16:52:21 -   1.167
> +++ netinet/tcp_subr.c22 Jan 2018 20:29:06 -
> @@ -436,7 +436,6 @@ tcp_newtcpcb(struct inpcb *inp)
>   TCP_INIT_DELACK(tp);
>   for (i = 0; i < TCPT_NTIMERS; i++)
>   TCP_TIMER_INIT(tp, i);
> - timeout_set(>t_reap_to, tcp_reaper, tp);
>  
>   tp->sack_enable = tcp_do_sack;
>   tp->t_flags = tcp_do_rfc1323 ? (TF_REQ_SCALE|TF_REQ_TSTMP) : 

Re: tcp reaper timeout

2018-01-20 Thread Mike Belopuhov
On Sat, Jan 20, 2018 at 15:17 +0100, Alexander Bluhm wrote:
> Hi,
>

Hi,

While I'm not against making all TCP timeouts look similar, I'd like
to understand if there's any other reason to do it other than
"consistency".

> The tcp reaper timeout is still imlemented as soft timeout.  So it
> can run while net lock is held by others and it is not synchronized
> with the other tcp timeouts.

Am I right to say that this is not an issue?  Neither pool_put nor
tcpstat_inc requre a NET_LOCK, correct?

> Convert it to an ordinary tcp timeout
> so it is scheduled on the same timeout thread.  It grabs the net
> lock to make sure that softnet has finished its work.
>

This just makes other threads who want to grab NET_LOCK wait for
a pool_put to finish, but where's the benefit?

> ok?
> 
> bluhm
> 

What's up with the diagnostics?  Do you suspect a race of some sorts?

> @@ -462,5 +464,26 @@ tcp_timer_2msl(void *arg)
>   tp = tcp_close(tp);
>  
>   out:
> + NET_UNLOCK();
> +}
> +
> +void
> +tcp_timer_reaper(void *arg)
> +{
> + struct tcpcb *tp = arg;
> + int i;
> +
> + NET_LOCK();
> +#ifdef DIAGNOSTIC
> + if ((tp->t_flags & TF_DEAD) == 0)
> + panic("%s: tcpcb %p is not dead", __func__, tp);
> + for (i = 0; i < TCPT_NTIMERS; i++) {
> + if (TCP_TIMER_ISARMED(tp, i))
> + panic("%s: tcpcb %p timer %d is armed",
> + __func__, tp, i);
> + }
> +#endif
> + pool_put(_pool, tp);
> + tcpstat_inc(tcps_closed);
>   NET_UNLOCK();
>  }



Re: inteldrm(4) tests needed

2018-01-20 Thread Mike Belopuhov
On Mon, Jan 15, 2018 at 01:02 +0100, Mark Kettenis wrote:
> The diff below adopts more of the Linux code to manage i2c
> transactions on hardware supported by inteldrm(4).  The i2c stuff is
> reponsible for detecting panels and monitors, so it is somewhat
> important that this works right.  And the Linux code developed some
> quirks over the years that my rewrite of the code to use OpenBSD APIs
> didn't have.
> 
> So I'm looking for testers.  I'm especially interested in tests of
> external displays on all sorts of connector types (VGA, DVI, HDMI,
> DP).  It would be really great to get some tests on older stuff with
> (S)DVO.  Please let me know if there are regressions or if this fixes
> things that are currently broken.  But all reports are welcome.
> Please include a dmesg and some information about the display and
> connector type.
> 

Hi,

I've been running with this since it hit the tree and it has fixed
a major issue for me: an external HDMI output was marked 'disconnected'
only a few minutes after being connected.  The output worked fine,
but I presume X was notified that something changed and my window
manager would restrict the area for popups and drop down menus to the
main screen, as in they would only appear on the panel of the laptop,
regardless of where the actual window was.  So huge thanks for that!

In the meantime (somewhere around Mesa update and a backout), another
issue got fixed.  If a test 60fps video would've been displayed on the
panel, a visible tearing artefact would be visible in the upper portion
of the screen.  And now it's gone.  Thanks Jonathan and Mark for your
work on this, it's greatly appreciated!

My hardware is an X1 Carbon 2017 (Kaby Lake) with a WQHD (2560x1440)
panel and a 1920x1080 external monitor.

Cheers,
Mike



Re: Add sizes for free() in the VIA PadLock driver

2017-11-12 Thread Mike Belopuhov
On Sun, Nov 12, 2017 at 21:53 +0100, Frederic Cambus wrote:
> Hi tech@,
> 
> Add sizes for free() in the VIA PadLock driver.
> 
> Comments? OK?
> 

OK mikeb.



Re: "max" field in "netstat -m" is ambiguous

2017-10-28 Thread Mike Belopuhov
On Sat, Oct 28, 2017 at 11:06 +0200, Mike Belopuhov wrote:
> On Thu, Oct 26, 2017 at 08:58 +0200, Claudio Jeker wrote:
> > On Wed, Oct 25, 2017 at 11:46:05PM +0200, Mike Belopuhov wrote:
> > > On Wed, Oct 25, 2017 at 21:56 +0200, Claudio Jeker wrote:
> > > > Would be great if netstat could show the current and peak memory usage.
> > > >
> > > 
> > > Current is 5876.  Maximum is 524288.  Do you want to display them in
> > > the x/y/z format?
> > > 
> > >   5876//524288 Kbytes allocated to network, 20% in use 
> > > (current/peak/max)
> > > 
> > > Something like this? Any other ideas?
> > 
> > I think that would be an improvement. I normally look for peak values. The
> > current is normally not interesting when tuning systems. 
> > Maybe we can even drop the use percentage since it more confusing than
> > anything.
> > 
> 
> How about this then?
> 
>   saru:usr.bin/netstat% ./obj/netstat -m
>   532 mbufs in use:
>   379 mbufs allocated to data
>   12 mbufs allocated to packet headers
>   141 mbufs allocated to socket names and addresses
>   18/208 mbuf 2048 byte clusters in use (current/peak)
>   0/45 mbuf 2112 byte clusters in use (current/peak)
>   256/320 mbuf 4096 byte clusters in use (current/peak)
>   0/48 mbuf 8192 byte clusters in use (current/peak)
>   0/42 mbuf 9216 byte clusters in use (current/peak)
>   0/50 mbuf 12288 byte clusters in use (current/peak)
>   0/48 mbuf 16384 byte clusters in use (current/peak)
>   0/48 mbuf 65536 byte clusters in use (current/peak)
>   5952/7236/524288 Kbytes allocated to network (current/peak/max)
>   0 requests for memory denied
>   0 requests for memory delayed
>   0 calls to protocol drain routines
> 
> OK?
> 

Ian Darwin suggested using fmt_scaled for an output like this:
6.1M/7.1M/512Mbytes allocated to network (current/peak/max)

netstat is already linked against libutil.

Any objections?

diff --git usr.bin/netstat/mbuf.c usr.bin/netstat/mbuf.c
index f7970a57c32..27412f9e217 100644
--- usr.bin/netstat/mbuf.c
+++ usr.bin/netstat/mbuf.c
@@ -42,10 +42,11 @@
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include "netstat.h"
 
 #defineYES 1
 typedef int bool;
 
@@ -85,13 +86,13 @@ bool seen[256]; /* "have we seen this 
type yet?" */
  * Print mbuf statistics.
  */
 void
 mbpr(void)
 {
-   unsigned long totmem, totused, totmbufs;
-   int totpct;
-   int i, mib[4], npools;
+   unsigned long totmem, totpeak, totmbufs;
+   int i, maxclusters, mib[4], npools;
+   char fmt[FMT_SCALED_STRSIZE];
struct kinfo_pool pool;
struct mbtypes *mp;
size_t size;
 
if (nmbtypes != 256) {
@@ -99,10 +100,20 @@ mbpr(void)
"%s: unexpected change to mbstat; check source\n",
__progname);
return;
}
 
+   mib[0] = CTL_KERN;
+   mib[1] = KERN_MAXCLUSTERS;
+   size = sizeof(maxclusters);
+
+   if (sysctl(mib, 2, , , NULL, 0) < 0) {
+   printf("Can't retrieve value of maxclusters from the "
+   "kernel: %s\n",  strerror(errno));
+   return;
+   }
+
mib[0] = CTL_KERN;
mib[1] = KERN_MBSTAT;
size = sizeof(mbstat);
 
if (sysctl(mib, 2, , , NULL, 0) < 0) {
@@ -174,26 +185,34 @@ mbpr(void)
printf("\t%u mbuf%s allocated to \n",
mbstat.m_mtypes[i],
plural(mbstat.m_mtypes[i]), i);
}
totmem = (mbpool.pr_npages * mbpool.pr_pgsize);
-   totused = mbpool.pr_nout * mbpool.pr_size;
+   totpeak = mbpool.pr_hiwat * mbpool.pr_pgsize;
for (i = 0; i < mclp; i++) {
-   printf("%u/%lu/%lu mbuf %d byte clusters in use"
-   " (current/peak/max)\n",
+   printf("%u/%lu mbuf %d byte clusters in use"
+   " (current/peak)\n",
mclpools[i].pr_nout,
(unsigned long)
(mclpools[i].pr_hiwat * mclpools[i].pr_itemsperpage),
-   (unsigned long)
-   (mclpools[i].pr_maxpages * mclpools[i].pr_itemsperpage),
mclpools[i].pr_size);
totmem += (mclpools[i].pr_npages * mclpools[i].pr_pgsize);
-   totused += mclpools[i].pr_nout * mclpools[i].pr_size;
+   totpeak += mclpools[i].pr_hiwat * mclpools[i].pr_pgsize;
}
 
-   totpct = (totmem == 0) ? 0 : (totused/(totmem / 100));
-   printf("%lu Kbytes allocated to network (%d%% in use)\n",
-   totmem / 1024, totpct)

Re: "max" field in "netstat -m" is ambiguous

2017-10-28 Thread Mike Belopuhov
On Thu, Oct 26, 2017 at 08:58 +0200, Claudio Jeker wrote:
> On Wed, Oct 25, 2017 at 11:46:05PM +0200, Mike Belopuhov wrote:
> > On Wed, Oct 25, 2017 at 21:56 +0200, Claudio Jeker wrote:
> > > Would be great if netstat could show the current and peak memory usage.
> > >
> > 
> > Current is 5876.  Maximum is 524288.  Do you want to display them in
> > the x/y/z format?
> > 
> >   5876//524288 Kbytes allocated to network, 20% in use 
> > (current/peak/max)
> > 
> > Something like this? Any other ideas?
> 
> I think that would be an improvement. I normally look for peak values. The
> current is normally not interesting when tuning systems. 
> Maybe we can even drop the use percentage since it more confusing than
> anything.
> 

How about this then?

  saru:usr.bin/netstat% ./obj/netstat -m
  532 mbufs in use:
379 mbufs allocated to data
12 mbufs allocated to packet headers
141 mbufs allocated to socket names and addresses
  18/208 mbuf 2048 byte clusters in use (current/peak)
  0/45 mbuf 2112 byte clusters in use (current/peak)
  256/320 mbuf 4096 byte clusters in use (current/peak)
  0/48 mbuf 8192 byte clusters in use (current/peak)
  0/42 mbuf 9216 byte clusters in use (current/peak)
  0/50 mbuf 12288 byte clusters in use (current/peak)
  0/48 mbuf 16384 byte clusters in use (current/peak)
  0/48 mbuf 65536 byte clusters in use (current/peak)
  5952/7236/524288 Kbytes allocated to network (current/peak/max)
  0 requests for memory denied
  0 requests for memory delayed
  0 calls to protocol drain routines

OK?

diff --git usr.bin/netstat/mbuf.c usr.bin/netstat/mbuf.c
index f7970a57c32..79c2c16a6c3 100644
--- usr.bin/netstat/mbuf.c
+++ usr.bin/netstat/mbuf.c
@@ -85,13 +85,12 @@ bool seen[256]; /* "have we seen this 
type yet?" */
  * Print mbuf statistics.
  */
 void
 mbpr(void)
 {
-   unsigned long totmem, totused, totmbufs;
-   int totpct;
-   int i, mib[4], npools;
+   unsigned long totmem, totpeak, totmbufs;
+   int i, maxclusters, mib[4], npools;
struct kinfo_pool pool;
struct mbtypes *mp;
size_t size;
 
if (nmbtypes != 256) {
@@ -99,10 +98,20 @@ mbpr(void)
"%s: unexpected change to mbstat; check source\n",
__progname);
return;
}
 
+   mib[0] = CTL_KERN;
+   mib[1] = KERN_MAXCLUSTERS;
+   size = sizeof(maxclusters);
+
+   if (sysctl(mib, 2, , , NULL, 0) < 0) {
+   printf("Can't retrieve value of maxclusters from the "
+   "kernel: %s\n",  strerror(errno));
+   return;
+   }
+
mib[0] = CTL_KERN;
mib[1] = KERN_MBSTAT;
size = sizeof(mbstat);
 
if (sysctl(mib, 2, , , NULL, 0) < 0) {
@@ -174,26 +183,24 @@ mbpr(void)
printf("\t%u mbuf%s allocated to \n",
mbstat.m_mtypes[i],
plural(mbstat.m_mtypes[i]), i);
}
totmem = (mbpool.pr_npages * mbpool.pr_pgsize);
-   totused = mbpool.pr_nout * mbpool.pr_size;
+   totpeak = mbpool.pr_hiwat * mbpool.pr_pgsize;
for (i = 0; i < mclp; i++) {
-   printf("%u/%lu/%lu mbuf %d byte clusters in use"
-   " (current/peak/max)\n",
+   printf("%u/%lu mbuf %d byte clusters in use"
+   " (current/peak)\n",
mclpools[i].pr_nout,
(unsigned long)
(mclpools[i].pr_hiwat * mclpools[i].pr_itemsperpage),
-   (unsigned long)
-   (mclpools[i].pr_maxpages * mclpools[i].pr_itemsperpage),
mclpools[i].pr_size);
totmem += (mclpools[i].pr_npages * mclpools[i].pr_pgsize);
-   totused += mclpools[i].pr_nout * mclpools[i].pr_size;
+   totpeak += mclpools[i].pr_hiwat * mclpools[i].pr_pgsize;
}
 
-   totpct = (totmem == 0) ? 0 : (totused/(totmem / 100));
-   printf("%lu Kbytes allocated to network (%d%% in use)\n",
-   totmem / 1024, totpct);
+   printf("%lu/%lu/%lu Kbytes allocated to network "
+   "(current/peak/max)\n", totmem / 1024, totpeak / 1024,
+   (unsigned long)(maxclusters * MCLBYTES) / 1024);
printf("%lu requests for memory denied\n", mbstat.m_drops);
printf("%lu requests for memory delayed\n", mbstat.m_wait);
printf("%lu calls to protocol drain routines\n", mbstat.m_drain);
 }



Re: "max" field in "netstat -m" is ambiguous

2017-10-25 Thread Mike Belopuhov
On Wed, Oct 25, 2017 at 21:56 +0200, Claudio Jeker wrote:
> On Wed, Oct 25, 2017 at 01:39:35PM -0600, Todd C. Miller wrote:
> > On Wed, 25 Oct 2017 19:46:56 +0200, Mike Belopuhov wrote:
> > 
> > > I think we can extend this by adding an additional number for the
> > > upper boundary (kern.maxclusters), like so:
> > > 
> > >   saru:usr.bin/netstat% ./obj/netstat -m
> > >   539 mbufs in use:
> > >   385 mbufs allocated to data
> > >   13 mbufs allocated to packet headers
> > >   141 mbufs allocated to socket names and addresses
> > >   19/144 mbuf 2048 byte clusters in use (current/peak)
> > >   0/45 mbuf 2112 byte clusters in use (current/peak)
> > >   256/312 mbuf 4096 byte clusters in use (current/peak)
> > >   0/48 mbuf 8192 byte clusters in use (current/peak)
> > >   0/28 mbuf 9216 byte clusters in use (current/peak)
> > >   0/40 mbuf 12288 byte clusters in use (current/peak)
> > >   0/40 mbuf 16384 byte clusters in use (current/peak)
> > >   0/40 mbuf 65536 byte clusters in use (current/peak)
> > >   5876 out of 524288 Kbytes allocated to network (20% in use)
> > >   0 requests for memory denied
> > >   0 requests for memory delayed
> > >   0 calls to protocol drain routines
> > 
> > That's definitely an improvement.  OK millert@
> > 
> 
> The math for the percentage in use is doing something different at least
> 20% of 524288 is not 5876. AFAIK the percentage is calculated against the
> pool size and not the maximum size.

Correct and I didn't say otherwise. I wrote:

  This shows how much backing memory has been allocated by all cluster
  pools from the UVM and percentage of how much of it has been taken
  out by pool_get operations.

It's 20% of 5876K that is in use.

> Would be great if netstat could show the current and peak memory usage.
>

Current is 5876.  Maximum is 524288.  Do you want to display them in
the x/y/z format?

  5876//524288 Kbytes allocated to network, 20% in use (current/peak/max)

Something like this? Any other ideas?



"max" field in "netstat -m" is ambiguous

2017-10-25 Thread Mike Belopuhov
Hi,

After some changes in the way mbuf cluster pool limits are set up,
we have a situation where the "max" number doesn't reflect what it
used to and is ambiguous most of the time.  Right now I have:

  36/144/64 mbuf 2048 byte clusters in use (current/peak/max)
  0/45/120 mbuf 2112 byte clusters in use (current/peak/max)
  256/312/64 mbuf 4096 byte clusters in use (current/peak/max)
  0/40/64 mbuf 8192 byte clusters in use (current/peak/max)
  0/14/112 mbuf 9216 byte clusters in use (current/peak/max)
  0/30/80 mbuf 12288 byte clusters in use (current/peak/max)
  0/40/64 mbuf 16384 byte clusters in use (current/peak/max)
  0/40/64 mbuf 65536 byte clusters in use (current/peak/max)

Several users expressed their concern regarding this and I agree
that this was one of the important metrics that we used to look at.

Now that kern.maxclusters defines how much memory (in 2k chunks) in
total can be spent on (all) clusters, there's no well defined maximum
value for each individual pool as they share this global limit.  But
we shouldn't provide values that are misinterpreted by users.

Here's my take on how to improve the situation.

One line in the "netstat -m" output talks about memory usage:

  5748 Kbytes allocated to network (21% in use)

This shows how much backing memory has been allocated by all cluster
pools from the UVM and percentage of how much of it has been taken
out by pool_get operations.

I think we can extend this by adding an additional number for the
upper boundary (kern.maxclusters), like so:

  saru:usr.bin/netstat% ./obj/netstat -m
  539 mbufs in use:
385 mbufs allocated to data
13 mbufs allocated to packet headers
141 mbufs allocated to socket names and addresses
  19/144 mbuf 2048 byte clusters in use (current/peak)
  0/45 mbuf 2112 byte clusters in use (current/peak)
  256/312 mbuf 4096 byte clusters in use (current/peak)
  0/48 mbuf 8192 byte clusters in use (current/peak)
  0/28 mbuf 9216 byte clusters in use (current/peak)
  0/40 mbuf 12288 byte clusters in use (current/peak)
  0/40 mbuf 16384 byte clusters in use (current/peak)
  0/40 mbuf 65536 byte clusters in use (current/peak)
  5876 out of 524288 Kbytes allocated to network (20% in use)
  0 requests for memory denied
  0 requests for memory delayed
  0 calls to protocol drain routines

I gather this isn't very friendly towards existing scripts parsing
this output, but YMMV.

diff --git usr.bin/netstat/mbuf.c usr.bin/netstat/mbuf.c
index f7970a57c32..701385b2e6b 100644
--- usr.bin/netstat/mbuf.c
+++ usr.bin/netstat/mbuf.c
@@ -86,11 +86,11 @@ bool seen[256]; /* "have we seen this 
type yet?" */
  */
 void
 mbpr(void)
 {
unsigned long totmem, totused, totmbufs;
-   int totpct;
+   int maxclusters, totpct;
int i, mib[4], npools;
struct kinfo_pool pool;
struct mbtypes *mp;
size_t size;
 
@@ -99,10 +99,20 @@ mbpr(void)
"%s: unexpected change to mbstat; check source\n",
__progname);
return;
}
 
+   mib[0] = CTL_KERN;
+   mib[1] = KERN_MAXCLUSTERS;
+   size = sizeof(maxclusters);
+
+   if (sysctl(mib, 2, , , NULL, 0) < 0) {
+   printf("Can't retrieve value of maxclusters from the "
+   "kernel: %s\n",  strerror(errno));
+   return;
+   }
+
mib[0] = CTL_KERN;
mib[1] = KERN_MBSTAT;
size = sizeof(mbstat);
 
if (sysctl(mib, 2, , , NULL, 0) < 0) {
@@ -176,24 +186,23 @@ mbpr(void)
plural(mbstat.m_mtypes[i]), i);
}
totmem = (mbpool.pr_npages * mbpool.pr_pgsize);
totused = mbpool.pr_nout * mbpool.pr_size;
for (i = 0; i < mclp; i++) {
-   printf("%u/%lu/%lu mbuf %d byte clusters in use"
-   " (current/peak/max)\n",
+   printf("%u/%lu mbuf %d byte clusters in use"
+   " (current/peak)\n",
mclpools[i].pr_nout,
(unsigned long)
(mclpools[i].pr_hiwat * mclpools[i].pr_itemsperpage),
-   (unsigned long)
-   (mclpools[i].pr_maxpages * mclpools[i].pr_itemsperpage),
mclpools[i].pr_size);
totmem += (mclpools[i].pr_npages * mclpools[i].pr_pgsize);
totused += mclpools[i].pr_nout * mclpools[i].pr_size;
}
 
totpct = (totmem == 0) ? 0 : (totused/(totmem / 100));
-   printf("%lu Kbytes allocated to network (%d%% in use)\n",
-   totmem / 1024, totpct);
+   printf("%lu out of %lu Kbytes allocated to network (%d%% in use)\n",
+   totmem / 1024, (unsigned long)(maxclusters * MCLBYTES) / 1024,
+   totpct);
printf("%lu requests for memory denied\n", mbstat.m_drops);
printf("%lu requests for memory delayed\n", mbstat.m_wait);
printf("%lu calls to protocol drain routines\n", 

Re: Remove TCP_FACK

2017-10-25 Thread Mike Belopuhov
On Tue, Oct 24, 2017 at 23:22 +0200, Job Snijders wrote:
> Dear all,
> 
> This patch builds upon the work shared in the following email. Mike's
> patch is a prerequisite to apply this patch.
> 
>   Date: Tue, 24 Oct 2017 15:21:08 +0200
>   From: Mike Belopuhov <m...@belopuhov.com>
>   Subject: Re: Refactor TCP partial ACK handling
> 
> TCP_FACK was disabled by provos@ in June 1999. This patch removes
> the TCP_FACK option and associated #if{,n}def code.
> 
> TCP_FACK is an algorithm that decides that when something is lost, all
> not SACKed packets until the most forward SACK are lost. It may be a
> correct estimate, if network does not reorder packets. 
> 
> The algorithm described in RFC 6675 may be a better replacement. This
> culling patch can provide guidance how and where to implement 6675.
> 
> Kind regards,
> 
> Job
> 

This makes my life that much easier so naturally I'm in favour of this
change.  OK mikeb

> @@ -2705,11 +2608,9 @@ tcp_sack_partialack(struct tcpcb *tp, struct tcphdr 
> *th)
>   /* Turn off retx. timer (will start again next segment) */
>   TCP_TIMER_DISARM(tp, TCPT_REXMT);
>   tp->t_rtttime = 0;
> -#ifndef TCP_FACK
>   /*
>* Partial window deflation.  This statement relies on the
> -  * fact that tp->snd_una has not been updated yet.  In FACK
> -  * hold snd_cwnd constant during fast recovery.
> +  * fact that tp->snd_una has not been updated yet.  
>*/

trailing white space in the '+' line above.



Re: Refactor TCP partial ACK handling

2017-10-24 Thread Mike Belopuhov
On Tue, Oct 24, 2017 at 13:37 +0200, Martin Pieuchot wrote:
> On 24/10/17(Tue) 12:27, Mike Belopuhov wrote:
> > On Tue, Oct 24, 2017 at 12:05 +0200, Martin Pieuchot wrote:
> > > On 21/10/17(Sat) 15:17, Mike Belopuhov wrote:
> > > > On Fri, Oct 20, 2017 at 22:59 +0200, Klemens Nanni wrote:
> > > > > The comments for both void tcp_{sack,newreno}_partialack() still 
> > > > > mention
> > > > > tp->snd_last and return value bits.
> > > > > 
> > > > 
> > > > Good eyes!  It made me spot a mistake I made by folding two lines
> > > > into an incorrect ifdef in tcp_sack_partialack.  I expected it to
> > > > say "ifdef TCP_FACK" while it says "ifNdef".  The adjusted comment
> > > > didn't make sense and I found the bug.
> > > 
> > > Could you send the full/fixed diff?
> > > 
> > 
> > Sure.
> 
> Diff is correct.  I have two suggestions, but it's ok mpi@ either way.
> 
> > > And what about TCP_FACK?  It is disabled by default, is there a
> > > point in keeping it?
> > 
> > Job has pointed out that RFC 6675 might be a better alternative
> > so it might be a good idea to ditch it while we're at it.  I'm
> > not certain which parts need to be preserved (if any) however.
> 
> I'd say remove it.  One can always look in the Attic if necessary.
> 
> > diff --git sys/netinet/tcp_input.c sys/netinet/tcp_input.c
> > index 790e163975e..3809a2371f2 100644
> > --- sys/netinet/tcp_input.c
> > +++ sys/netinet/tcp_input.c
> > @@ -1664,52 +1664,38 @@ trimthenstep6:
> > }
> > /*
> >  * If the congestion window was inflated to account
> >  * for the other side's cached packets, retract it.
> >  */
> > -   if (tp->sack_enable) {
> > -   if (tp->t_dupacks >= tcprexmtthresh) {
> > -   /* Check for a partial ACK */
> > -   if (tcp_sack_partialack(tp, th)) {
> > -#ifdef TCP_FACK
> > -   /* Force call to tcp_output */
> > -   if (tp->snd_awnd < tp->snd_cwnd)
> > -   tp->t_flags |= TF_NEEDOUTPUT;
> > -#else
> > -   tp->snd_cwnd += tp->t_maxseg;
> > -   tp->t_flags |= TF_NEEDOUTPUT;
> > -#endif /* TCP_FACK */
> > -   } else {
> > -   /* Out of fast recovery */
> > -   tp->snd_cwnd = tp->snd_ssthresh;
> > -   if (tcp_seq_subtract(tp->snd_max,
> > -   th->th_ack) < tp->snd_ssthresh)
> > -   tp->snd_cwnd =
> > -  tcp_seq_subtract(tp->snd_max,
> > -  th->th_ack);
> > -   tp->t_dupacks = 0;
> > -#ifdef TCP_FACK
> > -   if (SEQ_GT(th->th_ack, tp->snd_fack))
> > -   tp->snd_fack = th->th_ack;
> > -#endif
> > -   }
> > -   }
> > -   } else {
> > -   if (tp->t_dupacks >= tcprexmtthresh &&
> > -   !tcp_newreno(tp, th)) {
> > +   if (tp->t_dupacks >= tcprexmtthresh) {
> 
> I'd keep the comment:
> 
>   /* Check for a partial ACK */

Sure.

> > diff --git sys/netinet/tcp_var.h sys/netinet/tcp_var.h
> > index 6b797fd48e7..97b04884879 100644
> > --- sys/netinet/tcp_var.h
> > +++ sys/netinet/tcp_var.h
> > @@ -764,15 +764,15 @@ void   tcp_update_sack_list(struct tcpcb *tp, 
> > tcp_seq, tcp_seq);
> >  voidtcp_del_sackholes(struct tcpcb *, struct tcphdr *);
> >  voidtcp_clean_sackreport(struct tcpcb *tp);
> >  voidtcp_sack_adjust(struct tcpcb *tp);
> >  struct sackhole *
> >  tcp_sack_output(struct tcpcb *tp);
> > -int tcp_sack_partialack(struct tcpcb *, struct tcphdr *);
> > +voidtcp_sack_partialack(struct tcpcb *, struct tcphdr *);
> > +voidtcp_newreno_partialack(struct tcpcb *, struct tcphdr *);
> >  #ifdef DEBUG
> >  voidtcp_print_holes(struct tcpcb *tp);
> >  #endif
> > -int tcp_newreno(

Re: Refactor TCP partial ACK handling

2017-10-24 Thread Mike Belopuhov
On Tue, Oct 24, 2017 at 12:05 +0200, Martin Pieuchot wrote:
> On 21/10/17(Sat) 15:17, Mike Belopuhov wrote:
> > On Fri, Oct 20, 2017 at 22:59 +0200, Klemens Nanni wrote:
> > > The comments for both void tcp_{sack,newreno}_partialack() still mention
> > > tp->snd_last and return value bits.
> > > 
> > 
> > Good eyes!  It made me spot a mistake I made by folding two lines
> > into an incorrect ifdef in tcp_sack_partialack.  I expected it to
> > say "ifdef TCP_FACK" while it says "ifNdef".  The adjusted comment
> > didn't make sense and I found the bug.
> 
> Could you send the full/fixed diff?
> 

Sure.

> And what about TCP_FACK?  It is disabled by default, is there a
> point in keeping it?

Job has pointed out that RFC 6675 might be a better alternative
so it might be a good idea to ditch it while we're at it.  I'm
not certain which parts need to be preserved (if any) however.

diff --git sys/netinet/tcp_input.c sys/netinet/tcp_input.c
index 790e163975e..3809a2371f2 100644
--- sys/netinet/tcp_input.c
+++ sys/netinet/tcp_input.c
@@ -1664,52 +1664,38 @@ trimthenstep6:
}
/*
 * If the congestion window was inflated to account
 * for the other side's cached packets, retract it.
 */
-   if (tp->sack_enable) {
-   if (tp->t_dupacks >= tcprexmtthresh) {
-   /* Check for a partial ACK */
-   if (tcp_sack_partialack(tp, th)) {
-#ifdef TCP_FACK
-   /* Force call to tcp_output */
-   if (tp->snd_awnd < tp->snd_cwnd)
-   tp->t_flags |= TF_NEEDOUTPUT;
-#else
-   tp->snd_cwnd += tp->t_maxseg;
-   tp->t_flags |= TF_NEEDOUTPUT;
-#endif /* TCP_FACK */
-   } else {
-   /* Out of fast recovery */
-   tp->snd_cwnd = tp->snd_ssthresh;
-   if (tcp_seq_subtract(tp->snd_max,
-   th->th_ack) < tp->snd_ssthresh)
-   tp->snd_cwnd =
-  tcp_seq_subtract(tp->snd_max,
-  th->th_ack);
-   tp->t_dupacks = 0;
-#ifdef TCP_FACK
-   if (SEQ_GT(th->th_ack, tp->snd_fack))
-   tp->snd_fack = th->th_ack;
-#endif
-   }
-   }
-   } else {
-   if (tp->t_dupacks >= tcprexmtthresh &&
-   !tcp_newreno(tp, th)) {
+   if (tp->t_dupacks >= tcprexmtthresh) {
+   if (SEQ_LT(th->th_ack, tp->snd_last)) {
+   if (tp->sack_enable)
+   tcp_sack_partialack(tp, th);
+   else
+   tcp_newreno_partialack(tp, th);
+   } else {
/* Out of fast recovery */
tp->snd_cwnd = tp->snd_ssthresh;
if (tcp_seq_subtract(tp->snd_max, th->th_ack) <
tp->snd_ssthresh)
tp->snd_cwnd =
tcp_seq_subtract(tp->snd_max,
th->th_ack);
tp->t_dupacks = 0;
+#ifdef TCP_FACK
+   if (tp->sack_enable &&
+   SEQ_GT(th->th_ack, tp->snd_fack))
+   tp->snd_fack = th->th_ack;
+#endif
}
-   }
-   if (tp->t_dupacks < tcprexmtthresh)
+   } else {
+   /*
+* Reset the duplicate ACK counter if we
+* were not in fast recovery.
+*/
tp->t_dupacks = 0;
+   }
if (SEQ_GT(th->th_ack, tp->snd_max)) {
tcpstat_inc(tcps_rcvacktoomuch);
goto dropafterack_ratelim;
}
acked = th->th_ack - tp->snd_una;
@@ -2703,36 +2689,38 @@ tcp_clean_sackreport(struct tcpcb *tp)
tp->sackblks[i].start = tp->sackblks[i].end=0;
 
 }
 
 /*
-

Re: Enable TCP selective acknowledgements (SACK) on all kernels

2017-10-22 Thread Mike Belopuhov
On Sun, Oct 22, 2017 at 11:23 +0200, Job Snijders wrote:
> On Thu, Oct 19, 2017 at 06:55:05PM +0200, Mike Belopuhov wrote:
> > SACK has been enabled in GENERIC kernels for over a decade and it's
> > time to make it an official part of the TCP stack. 
> 
> I tested your diff by doing an amd64 release build and testing both the
> newly created /bsd and /bsd.rd, I observed no problems and SACK was
> available in both boot scenarios.
> 
> One thing that stood out to me is that the miniroot's "SMALL" ftp(1)
> didn't sent SACK as permitted tcp option. However, after chrooting into
> my normal environment and using the real /usr/bin/ftp, I observed that
> SACK was available and used.
> 
> If this is as expected, OK job@
>

It's setting the option in my build here:

15:55:20.336682 fe:e1:bb:d1:a2:f0 fe:e1:ba:d0:55:1e 0800 78: \
  10.50.50.34.17078 > 10.50.50.1.80: S [tcp sum ok] 1313610867:1313610867(0) \
  win 16384  \
  (DF) (ttl 64, id 25292, len 64)

> > This grows bsd.rd on amd64 by 8k but Theo said it's within reasonable.
> > OK?
> 
> $ ls -latr /bsd.rd /tmp/bsd.rd
> -rw---  1 root  wheel  9787542 Oct 22 08:36 /bsd.rd
> -rw---  1 job   wheel  9782763 Oct 19 12:48 /old/bsd.rd
> 
> > diff --git sys/conf/GENERIC sys/conf/GENERIC
> > -option TCP_SACK# Selective Acknowledgements for TCP
> 
> I think the below patch may be an appropriate companion for removal of
> the option.
>

Yes, jmc@ has already notified me that I didn't include the
manpage diff, but please be my guest and check your diff in.

> Kind regards,
> 
> Job
> 
> diff --git share/man/man4/options.4 share/man/man4/options.4
> index 3e15d4c8c4f..3945611607e 100644
> --- share/man/man4/options.4
> +++ share/man/man4/options.4
> @@ -454,20 +454,6 @@ Turns on forward acknowledgements allowing a more 
> precise estimate of
>  outstanding data during the fast recovery phase by using
>  .Em SACK
>  information.
> -This option can only be used together with
> -.Em TCP_SACK .
> -.It Cd option TCP_SACK
> -Turns on selective acknowledgements.
> -Additional information about
> -segments already received can be transmitted back to the sender,
> -thus indicating segments that have been lost and allowing for
> -a swifter recovery.
> -Both communication endpoints need to support
> -.Em SACK .
> -The fallback behaviour is NewReno fast recovery phase, which allows
> -one lost segment to be recovered per round trip time.
> -When more than one segment has been dropped per window, the transmission can
> -continue without waiting for a retransmission timeout.
>  .It Cd option TCP_SIGNATURE
>  Turns on support for the TCP MD5 Signature option (RFC 2385).
>  This is used by
> 



Re: Refactor TCP partial ACK handling

2017-10-21 Thread Mike Belopuhov
On Fri, Oct 20, 2017 at 22:59 +0200, Klemens Nanni wrote:
> The comments for both void tcp_{sack,newreno}_partialack() still mention
> tp->snd_last and return value bits.
> 

Good eyes!  It made me spot a mistake I made by folding two lines
into an incorrect ifdef in tcp_sack_partialack.  I expected it to
say "ifdef TCP_FACK" while it says "ifNdef".  The adjusted comment
didn't make sense and I found the bug.

diff --git sys/netinet/tcp_input.c sys/netinet/tcp_input.c
index 45aafee0d05..d5de9cb2407 100644
--- sys/netinet/tcp_input.c
+++ sys/netinet/tcp_input.c
@@ -2690,13 +2690,13 @@ tcp_clean_sackreport(struct tcpcb *tp)
tp->sackblks[i].start = tp->sackblks[i].end=0;
 
 }
 
 /*
- * Checks for partial ack.  If partial ack arrives, turn off retransmission
- * timer, deflate the window, do not clear tp->t_dupacks, and return 1.
- * If the ack advances at least to tp->snd_last, return 0.
+ * Partial ack handling within a sack recovery episode.  When a partial ack
+ * arrives, turn off retransmission timer, deflate the window, do not clear
+ * tp->t_dupacks.
  */
 void
 tcp_sack_partialack(struct tcpcb *tp, struct tcphdr *th)
 {
/* Turn off retx. timer (will start again next segment) */
@@ -2711,16 +2711,16 @@ tcp_sack_partialack(struct tcpcb *tp, struct tcphdr *th)
if (tp->snd_cwnd > (th->th_ack - tp->snd_una)) {
tp->snd_cwnd -= th->th_ack - tp->snd_una;
tp->snd_cwnd += tp->t_maxseg;
} else
tp->snd_cwnd = tp->t_maxseg;
+   tp->snd_cwnd += tp->t_maxseg;
+   tp->t_flags |= TF_NEEDOUTPUT;
+#else
/* Force call to tcp_output */
if (tp->snd_awnd < tp->snd_cwnd)
tp->t_flags |= TF_NEEDOUTPUT;
-#else
-   tp->snd_cwnd += tp->t_maxseg;
-   tp->t_flags |= TF_NEEDOUTPUT;
 #endif
 }
 
 /*
  * Pull out of band byte out of a segment so
@@ -3078,14 +3078,14 @@ tcp_mss_update(struct tcpcb *tp)
}
 
 }
 
 /*
- * Checks for partial ack.  If partial ack arrives, force the retransmission
- * of the next unacknowledged segment, do not clear tp->t_dupacks, and return
- * 1.  By setting snd_nxt to ti_ack, this forces retransmission timer to
- * be started again.  If the ack advances at least to tp->snd_last, return 0.
+ * When a partial ack arrives, force the retransmission of the
+ * next unacknowledged segment.  Do not clear tp->t_dupacks.
+ * By setting snd_nxt to ti_ack, this forces retransmission timer
+ * to be started again.
  */
 void
 tcp_newreno_partialack(struct tcpcb *tp, struct tcphdr *th)
 {
/*



Refactor TCP partial ACK handling

2017-10-20 Thread Mike Belopuhov
This is a small and not intrusive refactoring of partial ACK handling
but it certainly doesn't look like one.  It's intended to be applied
after the TCP SACK diff that I've sent earlier and basically moves the
conditional (SEQ_LT(th->th_ack, tp->snd_last)) out of tcp_sack_partialack
and tcp_newreno into the tcp_input itself making these two functions just
do the work and let tcp_input make decisions.  Here's how it looks after
refactoring:

if (tp->t_dupacks >= tcprexmtthresh) {
if (SEQ_LT(th->th_ack, tp->snd_last)) {
if (tp->sack_enable)
tcp_sack_partialack(tp, th);
else
tcp_newreno_partialack(tp, th);
} else {
/* Out of fast recovery */
tp->snd_cwnd = tp->snd_ssthresh;
if (tcp_seq_subtract(tp->snd_max, th->th_ack) <
tp->snd_ssthresh)
tp->snd_cwnd =
tcp_seq_subtract(tp->snd_max,
th->th_ack);
tp->t_dupacks = 0;
  #ifdef TCP_FACK
if (tp->sack_enable &&
SEQ_GT(th->th_ack, tp->snd_fack))
tp->snd_fack = th->th_ack;
  #endif
}
} else {
/*
 * Reset the duplicate ACK counter if we
 * were not in fast recovery.
 */
tp->t_dupacks = 0;
}

This allows to consolidate the "out of fast recovery" branch currently
repeated twice as well as show how tcp_sack_partialack and tcp_newreno
interact without extra clutter.  The true branch of the old condition
"if (tcp_sack_partialack(tp, th))" gets integrated into the function
tcp_sack_partialack itself and tcp_newreno is renamed for consistency.

The diff also hooks up the "if (tp->t_dupacks < tcprexmtthresh)" branch
to this "main if" since t_dupacks is ether greater then tcprexmtthresh
or not.

In the end there's no (intentional) logic change at all, but gained
clarity is quite substantial as noticed by FreeBSD folks as well.
Consolidating the 'out of fast recovery' branch is also beneficial for
later work.

OK?

diff --git sys/netinet/tcp_input.c sys/netinet/tcp_input.c
index 9951923bbdb..84cdb35f048 100644
--- sys/netinet/tcp_input.c
+++ sys/netinet/tcp_input.c
@@ -1664,52 +1664,38 @@ trimthenstep6:
}
/*
 * If the congestion window was inflated to account
 * for the other side's cached packets, retract it.
 */
-   if (tp->sack_enable) {
-   if (tp->t_dupacks >= tcprexmtthresh) {
-   /* Check for a partial ACK */
-   if (tcp_sack_partialack(tp, th)) {
-#ifdef TCP_FACK
-   /* Force call to tcp_output */
-   if (tp->snd_awnd < tp->snd_cwnd)
-   tp->t_flags |= TF_NEEDOUTPUT;
-#else
-   tp->snd_cwnd += tp->t_maxseg;
-   tp->t_flags |= TF_NEEDOUTPUT;
-#endif /* TCP_FACK */
-   } else {
-   /* Out of fast recovery */
-   tp->snd_cwnd = tp->snd_ssthresh;
-   if (tcp_seq_subtract(tp->snd_max,
-   th->th_ack) < tp->snd_ssthresh)
-   tp->snd_cwnd =
-  tcp_seq_subtract(tp->snd_max,
-  th->th_ack);
-   tp->t_dupacks = 0;
-#ifdef TCP_FACK
-   if (SEQ_GT(th->th_ack, tp->snd_fack))
-   tp->snd_fack = th->th_ack;
-#endif
-   }
-   }
-   } else {
-   if (tp->t_dupacks >= tcprexmtthresh &&
-   !tcp_newreno(tp, th)) {
+   if (tp->t_dupacks >= tcprexmtthresh) {
+   if (SEQ_LT(th->th_ack, tp->snd_last)) {
+   if (tp->sack_enable)
+   tcp_sack_partialack(tp, th);
+   else
+   tcp_newreno_partialack(tp, th);
+   } else {
/* Out of fast recovery */
tp->snd_cwnd = tp->snd_ssthresh;
if (tcp_seq_subtract(tp->snd_max, th->th_ack) <
tp->snd_ssthresh)

Enable TCP selective acknowledgements (SACK) on all kernels

2017-10-19 Thread Mike Belopuhov
SACK has been enabled in GENERIC kernels for over a decade and it's
time to make it an official part of the TCP stack.  This grows bsd.rd
on amd64 by 8k but Theo said it's within reasonable.  OK?

diff --git sys/conf/GENERIC sys/conf/GENERIC
index 87dd069f514..cd68ae9e651 100644
--- sys/conf/GENERIC
+++ sys/conf/GENERIC
@@ -43,11 +43,10 @@ option  MSDOSFS # MS-DOS file system
 option FIFO# FIFOs; RECOMMENDED
 #optionTMPFS   # efficient memory file system
 option FUSE# FUSE
 
 option SOCKET_SPLICE   # Socket Splicing for TCP and UDP
-option TCP_SACK# Selective Acknowledgements for TCP
 option TCP_ECN # Explicit Congestion Notification for TCP
 option TCP_SIGNATURE   # TCP MD5 Signatures, for BGP routing sessions
 #optionTCP_FACK# Forward Acknowledgements for TCP
 
 option INET6   # IPv6
diff --git sys/netinet/tcp_input.c sys/netinet/tcp_input.c
index 52c206f0bf5..9951923bbdb 100644
--- sys/netinet/tcp_input.c
+++ sys/netinet/tcp_input.c
@@ -852,14 +852,12 @@ findpcb:
 */
tp->t_rcvtime = tcp_now;
if (TCPS_HAVEESTABLISHED(tp->t_state))
TCP_TIMER_ARM(tp, TCPT_KEEP, tcp_keepidle);
 
-#ifdef TCP_SACK
if (tp->sack_enable)
tcp_del_sackholes(tp, th); /* Delete stale SACK holes */
-#endif /* TCP_SACK */
 
/*
 * Process options.
 */
 #ifdef TCP_SIGNATURE
@@ -962,25 +960,23 @@ findpcb:
 */
if (tp->t_pmtud_mss_acked < acked)
tp->t_pmtud_mss_acked = acked;
 
tp->snd_una = th->th_ack;
-#if defined(TCP_SACK) || defined(TCP_ECN)
/*
 * We want snd_last to track snd_una so
 * as to avoid sequence wraparound problems
 * for very large transfers.
 */
 #ifdef TCP_ECN
if (SEQ_GT(tp->snd_una, tp->snd_last))
 #endif
tp->snd_last = tp->snd_una;
-#endif /* TCP_SACK */
-#if defined(TCP_SACK) && defined(TCP_FACK)
+#ifdef TCP_FACK
tp->snd_fack = tp->snd_una;
tp->retran_data = 0;
-#endif /* TCP_FACK */
+#endif
m_freem(m);
 
/*
 * If all outstanding data are acked, stop
 * retransmit timer, otherwise restart timer
@@ -1012,15 +1008,13 @@ findpcb:
/*
 * This is a pure, in-sequence data packet
 * with nothing on the reassembly queue and
 * we have enough buffer space to take it.
 */
-#ifdef TCP_SACK
/* Clean receiver SACK report if present */
if (tp->sack_enable && tp->rcv_numsacks)
tcp_clean_sackreport(tp);
-#endif /* TCP_SACK */
tcpstat_inc(tcps_preddat);
tp->rcv_nxt += tlen;
tcpstat_pkt(tcps_rcvpack, tcps_rcvbyte, tlen);
ND6_HINT(tp);
 
@@ -1137,19 +1131,17 @@ findpcb:
/* Reset initial window to 1 segment for retransmit */
if (tp->t_rxtshift > 0)
tp->snd_cwnd = tp->t_maxseg;
tcp_rcvseqinit(tp);
tp->t_flags |= TF_ACKNOW;
-#ifdef TCP_SACK
 /*
  * If we've sent a SACK_PERMITTED option, and the peer
  * also replied with one, then TF_SACK_PERMIT should have
  * been set in tcp_dooptions().  If it was not, disable SACKs.
  */
if (tp->sack_enable)
tp->sack_enable = tp->t_flags & TF_SACK_PERMIT;
-#endif
 #ifdef TCP_ECN
/*
 * if ECE is set but CWR is not set for SYN-ACK, or
 * both ECE and CWR are set for simultaneous open,
 * peer is ECN capable.
@@ -1569,11 +1561,11 @@ trimthenstep6:
 * to keep a constant cwnd packets in the
 * network.
 */
if (TCP_TIMER_ISARMED(tp, TCPT_REXMT) == 0)
tp->t_dupacks = 0;
-#if defined(TCP_SACK) && defined(TCP_FACK)
+#ifdef TCP_FACK
/*
 * In FACK, can enter fast rec. if the receiver
 * reports a reass. queue longer than 3 segs.
 */
else 

Re: rwlock tweak

2017-10-16 Thread Mike Belopuhov
On Mon, Oct 16, 2017 at 09:36 +, Mark Kettenis wrote:
> > Date: Mon, 16 Oct 2017 10:52:09 +0200
> > From: Martin Pieuchot 
> > 
> > As pointed by Mateusz Guzik [0], on x86 the cmpxchg{q,l} used by
> > rw_enter(9) and rw_exit(9) already include an implicit memory
> > barrier, so we can avoids using an explicit expensive one by
> > using the following variants.
> > 
> > [0] https://marc.info/?l=openbsd-tech=150765959923113=2
> > 
> > ok?
> 
> Is this really safe?  The atomic instructions are executed
> conditionally...
>

I've been running without membars on amd64 for about 3 weeks now
and haven't noticed any adverse effects.  On my laptop and APU2
serving internet for a hundred of people.  It appears to me that
access pattern in rwlock.c allows for that: no reads to be
reordered that matter.  But please take this with a grain of salt.

A few people noted that a single membar per packet hugely affects
forwarding performance, so a macro-benchmark can be set up to
quantify the cost.

> > Index: kern/kern_rwlock.c
> > ===
> > RCS file: /cvs/src/sys/kern/kern_rwlock.c,v
> > retrieving revision 1.31
> > diff -u -p -r1.31 kern_rwlock.c
> > --- kern/kern_rwlock.c  12 Oct 2017 09:19:45 -  1.31
> > +++ kern/kern_rwlock.c  16 Oct 2017 08:24:27 -
> > @@ -96,7 +96,7 @@ _rw_enter_read(struct rwlock *rwl LOCK_F
> > rw_cas(>rwl_owner, owner, owner + RWLOCK_READ_INCR)))
> > _rw_enter(rwl, RW_READ LOCK_FL_ARGS);
> > else {
> > -   membar_enter();
> > +   membar_enter_after_atomic();
> > WITNESS_CHECKORDER(>rwl_lock_obj, LOP_NEWORDER, file, line,
> > NULL);
> > WITNESS_LOCK(>rwl_lock_obj, 0, file, line);
> > @@ -112,7 +112,7 @@ _rw_enter_write(struct rwlock *rwl LOCK_
> > RW_PROC(p) | RWLOCK_WRLOCK)))
> > _rw_enter(rwl, RW_WRITE LOCK_FL_ARGS);
> > else {
> > -   membar_enter();
> > +   membar_enter_after_atomic();
> > WITNESS_CHECKORDER(>rwl_lock_obj,
> > LOP_EXCLUSIVE | LOP_NEWORDER, file, line, NULL);
> > WITNESS_LOCK(>rwl_lock_obj, LOP_EXCLUSIVE, file, line);
> > @@ -126,7 +126,7 @@ _rw_exit_read(struct rwlock *rwl LOCK_FL
> >  
> > rw_assert_rdlock(rwl);
> >  
> > -   membar_exit();
> > +   membar_exit_before_atomic();
> > if (__predict_false((owner & RWLOCK_WAIT) ||
> > rw_cas(>rwl_owner, owner, owner - RWLOCK_READ_INCR)))
> > _rw_exit(rwl LOCK_FL_ARGS);
> > @@ -141,7 +141,7 @@ _rw_exit_write(struct rwlock *rwl LOCK_F
> >  
> > rw_assert_wrlock(rwl);
> >  
> > -   membar_exit();
> > +   membar_exit_before_atomic();
> > if (__predict_false((owner & RWLOCK_WAIT) ||
> > rw_cas(>rwl_owner, owner, 0)))
> > _rw_exit(rwl LOCK_FL_ARGS);
> > @@ -261,7 +261,7 @@ retry:
> >  
> > if (__predict_false(rw_cas(>rwl_owner, o, o + inc)))
> > goto retry;
> > -   membar_enter();
> > +   membar_enter_after_atomic();
> >  
> > /*
> >  * If old lock had RWLOCK_WAIT and RWLOCK_WRLOCK set, it means we
> > @@ -295,7 +295,7 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS
> > WITNESS_UNLOCK(>rwl_lock_obj, wrlock ? LOP_EXCLUSIVE : 0,
> > file, line);
> >  
> > -   membar_exit();
> > +   membar_exit_before_atomic();
> > do {
> > owner = rwl->rwl_owner;
> > if (wrlock)
> > 
> > 
> 



Re: softdep dangling vnode

2017-10-09 Thread Mike Belopuhov
On Mon, Oct 09, 2017 at 22:21 +, Alexander Bluhm wrote:
> Hi,
> 
> we sometimes see a panic "unmount: dangling vnode" when rebooting a 6.1
> system with softdep.
> 
> I have hacked some diagnostic panics until I got these traces from the
> reboot and update process.
> 
> Reboot:
> sleep_finish() at sleep_finish+0xb1
> tsleep() at tsleep+0x154
> biowait() at biowait+0x46
> bwrite() at bwrite+0x10d
> ffs_update() at ffs_update+0x2bd
> VOP_FSYNC() at VOP_FSYNC+0x3c
> ffs_flushfiles() at ffs_flushfiles+0xb9
> softdep_flushfiles() at softdep_flushfiles+0x4e
> ffs_unmount() at ffs_unmount+0x49
> dounmount_leaf() at dounmount_leaf+0x8b
> dounmount() at dounmount+0xb2
> vfs_unmountall() at vfs_unmountall+0x72
> vfs_shutdown() at vfs_shutdown+0x79
> boot() at boot+0x144
> reboot() at reboot+0x30
> sys_reboot() at sys_reboot+0x5e
> syscall() at syscall+0x21f
> 
> Update:
> *115878  74431  0 0x14000  0x2000  update
> Debugger() at Debugger+0x9
> panic() at panic+0xfe
> insmntque() at insmntque+0x86
> getnewvnode() at getnewvnode+0x192
> ffs_vget() at ffs_vget+0x8b
> handle_workitem_remove() at handle_workitem_remove+0x4c
> process_worklist_item() at process_worklist_item+0xf5
> softdep_process_worklist() at softdep_process_worklist+0x169
> sched_sync() at sched_sync+0xfb
> 
> At reboot all vnodes are flushed, but when it sleeps, the update
> process has a chance to create new dirty vnodes.  Resolving soft
> dependencies adds vnodes to the dirty list.
> 
> In softdep_flushfiles() vnodes and softdep are flushed in a loop.
> But if they sleep, it is not guaranteed that all vnodes have been
> flushed when the softdep worklist flush reports that nothing has
> been done.
> 
> My solution is to do a final vnode flush after the softdep worklist
> has been flushed.  Then the dirty list is empty and the final check in
> dounmount_leaf() does not panic.
> 
> ok?
> 
> bluhm
>

Makes sense to me.  FreeBSD does something similar:
https://svnweb.freebsd.org/base/head/sys/ufs/ffs/ffs_softdep.c?revision=324039=markup#l1920

> Index: ufs/ffs/ffs_softdep.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/ufs/ffs/ffs_softdep.c,v
> retrieving revision 1.135
> diff -u -p -r1.135 ffs_softdep.c
> --- ufs/ffs/ffs_softdep.c 7 Nov 2016 00:26:33 -   1.135
> +++ ufs/ffs/ffs_softdep.c 9 Oct 2017 22:19:39 -
> @@ -904,6 +904,14 @@ softdep_flushfiles(struct mount *oldmnt,
>   break;
>   }
>   /*
> +  * If the reboot process sleeps during the loop, the update
> +  * process may call softdep_process_worklist() and create
> +  * new dirty vnodes at the mount point.  Call ffs_flushfiles()
> +  * again after the loop has flushed all soft dependencies.
> +  */
> + if (error == 0)
> + error = ffs_flushfiles(oldmnt, flags, p);
> + /*
>* If we are unmounting then it is an error to fail. If we
>* are simply trying to downgrade to read-only, then filesystem
>* activity can keep us busy forever, so we just fail with EBUSY.
> 



Re: TSC timecounters

2017-10-07 Thread Mike Belopuhov
On Sat, Oct 07, 2017 at 17:23 +, Theo de Raadt wrote:
> > > > Now that we have an accurate tsc frequency, I would like to expose this
> > > > information to userland via a sysctl.
> > > >
> > > > The diff below exposes the tsc frequency and if it is invariant.
> > > >
> > > > Cheers,
> > > > Adam
> > > >
> > > >
> > > Please ignore that diff, looks like i had some dregs from a older diff i
> > > had been testing
> > 
> > Yeah, I've renamed a few things and deliberately skipped the sysctl part
> > so that we can test for regressions and don't mess with exposed interfaces
> > if we need to back out.  So I'd wait a few weeks for people to get it
> > running on a variety of systems and report the fallout if any.
> 
> The frequency can be found in dmesg.
> 
> Is there a particular reason why a piece of software has to query it
> with sysctl?  

Adam will correct me if I'm wrong, but his idea was to provide clock
emulation to the operating system running in userland (solo5/unikernel).
Perhaps vmd can make use of this interface too.



Re: TSC timecounters

2017-10-07 Thread Mike Belopuhov
On Sat, Oct 07, 2017 at 10:27 +, Adam Steen wrote:
> On Sat, Oct 7, 2017 at 5:52 PM, Adam Steen <a...@adamsteen.com.au> wrote:
> 
> > On Fri, Oct 06, 2017 at 03:58:18PM +0200, Mike Belopuhov wrote:
> > > Hi,
> > >
> > > An experimental change to use TSC as a timecounter source on a variety
> > > of modern Intel and AMD CPUs has been just committed and enabled on
> > > OpenBSD/amd64 thanks to the work done by Adam Steen.
> > >
> > > The rationale is, quoting the commit message:
> > >
> > >   If frequency of an invariant (non-stop) time stamp counter is measured
> > >   using an independent working timecounter that has a known frequency, we
> > >   can assume that the measured TSC frequency is as good as the resolution
> > >   of the timecounter that we use to perform the measurement. This lets us
> > >   switch from this high quality but expensive source to the cheaper TSC
> > >   without sacrificing precision on a wide range of modern CPUs.
> > >
> > > You can query and change the current timecounter source in the runtime
> > > via sysctl:
> > >
> > >   % sysctl kern.timecounter.{choice,hardware}
> > >   kern.timecounter.choice=i8254(0) tsc(2000) acpihpet0(1000)
> > acpitimer0(1000) dummy(-100)
> > >   kern.timecounter.hardware=tsc
> > >
> > > Please make sure your NTP drift (/var/db/ntpd.drift) stays within
> > -20..+20
> > > or at least is not worse than it is right now.
> > >
> > > And finally, please make sure to run a "make config" when building the
> > > kernel to update offset tables because of the cpu_info structure changes.
> > >
> > > Regards,
> > > Mike
> > >
> >
> > Hi,
> >
> > Now that we have an accurate tsc frequency, I would like to expose this
> > information to userland via a sysctl.
> >
> > The diff below exposes the tsc frequency and if it is invariant.
> >
> > Cheers,
> > Adam
> >
> >
> Please ignore that diff, looks like i had some dregs from a older diff i
> had been testing

Yeah, I've renamed a few things and deliberately skipped the sysctl part
so that we can test for regressions and don't mess with exposed interfaces
if we need to back out.  So I'd wait a few weeks for people to get it
running on a variety of systems and report the fallout if any.



TSC timecounters

2017-10-06 Thread Mike Belopuhov
Hi,

An experimental change to use TSC as a timecounter source on a variety
of modern Intel and AMD CPUs has been just committed and enabled on
OpenBSD/amd64 thanks to the work done by Adam Steen.

The rationale is, quoting the commit message:

  If frequency of an invariant (non-stop) time stamp counter is measured
  using an independent working timecounter that has a known frequency, we
  can assume that the measured TSC frequency is as good as the resolution
  of the timecounter that we use to perform the measurement. This lets us
  switch from this high quality but expensive source to the cheaper TSC
  without sacrificing precision on a wide range of modern CPUs.

You can query and change the current timecounter source in the runtime
via sysctl:

  % sysctl kern.timecounter.{choice,hardware}
  kern.timecounter.choice=i8254(0) tsc(2000) acpihpet0(1000) acpitimer0(1000) 
dummy(-100)
  kern.timecounter.hardware=tsc

Please make sure your NTP drift (/var/db/ntpd.drift) stays within -20..+20
or at least is not worse than it is right now.

And finally, please make sure to run a "make config" when building the
kernel to update offset tables because of the cpu_info structure changes.

Regards,
Mike



Re: pfctl can do a better job when handling ioctl(2) errors

2017-09-26 Thread Mike Belopuhov
On Tue, Sep 26, 2017 at 11:37 +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> whenever administrator asks pfctl to modify/remove anchor, which does not
> exist, the pfctl(8) prints warning 'pfctl: DIOCGETRULES: Invalid argument'.
> Few users on Solaris wants pfctl(8) to be more helpful.
> 
> The 'Invalid Argument' (EINVAL) is returned when particular anchor is not
> found. Patch below changes pfctl output to this:
> 
> # pfctl -sA
> # pfctl -a Foo -sr 
> Anchor 'Foo' not found.
> # 
> 
> OK?
>

Can you please reverse the check so that adding conditions is possible
and the default is in the 'else' branch, i.e.

if (errno == EINVAL)
errx(1, "Anchor '%s' not found.\n", anchorname);
else
err(1, "pfctl_clear_rules");

uipc_mbuf.c are clearly unrelated, but what about the chunk below?

> @@ -850,7 +860,7 @@ pfctl_show_rules(int dev, char *path, in
>* to the kernel.
>*/
>   if ((p = strrchr(anchorname, '/')) != NULL &&
> - p[1] == '*' && p[2] == '\0') {
> + ((p[1] == '*' && p[2] == '\0') || (p[1] == '\0'))) {
>   p[0] = '\0';
>   }
>  

I think this requires an explanation.

> thanks and
> regards
> sasha
> 
> 8<---8<---8<--8<
> diff -r 215db23c6b05 src/sbin/pfctl/pfctl.c
> --- src/sbin/pfctl/pfctl.cMon Sep 25 13:38:48 2017 +0200
> +++ src/sbin/pfctl/pfctl.cTue Sep 26 11:36:50 2017 +0200
> @@ -318,13 +318,23 @@ void
>  pfctl_clear_rules(int dev, int opts, char *anchorname)
>  {
>   struct pfr_buffer t;
> + char*p;
> +
> + p = strrchr(anchorname, '/');
> + if (p != NULL && p[1] == '\0')
> + errx(1, "%s: bad anchor name %s", __func__, anchorname);
>  
>   memset(, 0, sizeof(t));
>   t.pfrb_type = PFRB_TRANS;
> +
>   if (pfctl_add_trans(, PF_TRANS_RULESET, anchorname) ||
>   pfctl_trans(dev, , DIOCXBEGIN, 0) ||
> - pfctl_trans(dev, , DIOCXCOMMIT, 0))
> - err(1, "pfctl_clear_rules");
> + pfctl_trans(dev, , DIOCXCOMMIT, 0)) {
> + if (errno != EINVAL)
> + err(1, "pfctl_clear_rules");
> + else
> + errx(1, "Anchor '%s' not found.\n", anchorname);
> + }
>   if ((opts & PF_OPT_QUIET) == 0)
>   fprintf(stderr, "rules cleared\n");
>  }
> @@ -850,7 +860,7 @@ pfctl_show_rules(int dev, char *path, in
>* to the kernel.
>*/
>   if ((p = strrchr(anchorname, '/')) != NULL &&
> - p[1] == '*' && p[2] == '\0') {
> + ((p[1] == '*' && p[2] == '\0') || (p[1] == '\0'))) {
>   p[0] = '\0';
>   }
>  
> @@ -871,7 +881,11 @@ pfctl_show_rules(int dev, char *path, in
>   if (opts & PF_OPT_SHOWALL) {
>   pr.rule.action = PF_PASS;
>   if (ioctl(dev, DIOCGETRULES, )) {
> - warn("DIOCGETRULES");
> + if (errno != EINVAL)
> + warn("DIOCGETRULES");
> + else
> + fprintf(stderr, "Anchor '%s' not found.\n",
> + anchorname);
>   ret = -1;
>   goto error;
>   }
> @@ -886,7 +900,10 @@ pfctl_show_rules(int dev, char *path, in
>  
>   pr.rule.action = PF_PASS;
>   if (ioctl(dev, DIOCGETRULES, )) {
> - warn("DIOCGETRULES");
> + if (errno != EINVAL)
> + warn("DIOCGETRULES");
> + else
> + fprintf(stderr, "Anchor '%s' not found.\n", anchorname);
>   ret = -1;
>   goto error;
>   }
> diff -r 215db23c6b05 src/sys/kern/uipc_mbuf.c
> --- src/sys/kern/uipc_mbuf.c  Mon Sep 25 13:38:48 2017 +0200
> +++ src/sys/kern/uipc_mbuf.c  Tue Sep 26 11:36:50 2017 +0200
> @@ -1,4 +1,4 @@
> -/*   $OpenBSD: uipc_mbuf.c,v 1.249 2017/09/15 18:13:05 bluhm Exp $   */
> +/*   $OpenBSD: uipc_mbuf.c,v 1.248 2017/05/27 16:41:10 bluhm Exp $   */
>  /*   $NetBSD: uipc_mbuf.c,v 1.15.4.1 1996/06/13 17:11:44 cgd Exp $   */
>  
>  /*
> @@ -804,13 +804,12 @@ m_adj(struct mbuf *mp, int req_len)
>   struct mbuf *m;
>   int count;
>  
> - if (mp == NULL)
> + if ((m = mp) == NULL)
>   return;
>   if (len >= 0) {
>   /*
>* Trim from head.
>*/
> - m = mp;
>   while (m != NULL && len > 0) {
>   if (m->m_len <= len) {
>   len -= m->m_len;
> @@ -834,7 +833,6 @@ m_adj(struct mbuf *mp, int req_len)
>*/
>   len = -len;
>   count = 0;
> - m = mp;
>   for (;;) {
>   count += m->m_len;
>   if (m->m_next == NULL)
> @@ -855,16 +853,15 @@ m_adj(struct mbuf *mp, int req_len)
>* Find the mbuf 

Re: pfctl always prints warning when flushes ruleset

2017-09-26 Thread Mike Belopuhov
On Tue, Sep 26, 2017 at 11:15 +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> few users on Solaris don't like to read warning 'Anchor or Ruleset' does not
> exist:
> 
> # echo 'pass' |pfctl -a foo -f -
> # pfctl -a foo -Fa
> rules cleared
> pfctl: Anchor or Ruleset does not exist.
> # 
> 
> the commands above did work well, the 'pfctl: Anchor ...' warning message
> is kind of invalid. The code path, which ends up with warning, starts in
> pfct main() function:
> 
> 2518 case 'a':
> 2519 pfctl_clear_rules(dev, opts, anchorname);
> 2520 pfctl_clear_tables(anchorname, opts);
> 2521 if (ifaceopt && *ifaceopt) {
> 2522 warnx("don't specify an interface with 
> -Fall");
> 2523 usage();
> 2524 /* NOTREACHED */
> 2525 }
> 
> we call pfctl_clear_rules(), which flushes all rules from anchor. The anchor
> becomes empty. If there are no tables attached to anchor, then
> pf_remove_if_empty_ruleset() called on behalf of DIOCXCOMMIT also removes the
> anchor from table. No wonder the pfctl_clear_tables() invoked later at line 
> 2520
> can not find anchor, which it is searching table to be flushed.
> 
> The patch below just swaps line 2519 and 2520 to put the operations to correct
> order. With patch in place the warning is gone.
> 
> # echo 'pass' |pfctl -a foo -f -
> # pfctl -a foo -Fa
> 0 tables deleted.
> rules cleared
> # 
> 
> Also pfctl still prints warning in expected case:
> # pfctl -sA
> # pfctl -a foo -FT
> pfctl: Anchor or Ruleset does not exist.
> #
> 
> OK?
>

Please make sure a pfctl regress doesn't run into issues with this
diff.  Otherwise OK mikeb.

> thanks and
> regards
> sasha
> 
> 8<---8<---8<--8<
> diff -r 215db23c6b05 src/sbin/pfctl/pfctl.c
> --- src/sbin/pfctl/pfctl.c  Mon Sep 25 13:38:48 2017 +0200
> +++ src/sbin/pfctl/pfctl.c  Tue Sep 26 11:15:26 2017 +0200
> @@ -2516,8 +2516,8 @@ main(int argc, char *argv[])
> pfctl_clear_stats(dev, ifaceopt, opts);
> break;
> case 'a':
> +   pfctl_clear_tables(anchorname, opts);
> pfctl_clear_rules(dev, opts, anchorname);
> -   pfctl_clear_tables(anchorname, opts);
> if (ifaceopt && *ifaceopt) {
> warnx("don't specify an interface with 
> -Fall");
> usage();
> 8<---8<---8<--8<
> 
> 
> 
> 
> 



Re: Improve the accuracy of the TSC frequency calibration - Updated Patch

2017-09-19 Thread Mike Belopuhov
On Tue, Sep 19, 2017 at 22:35 +0200, Mike Belopuhov wrote:
> Keeping all of what Reyk said in mind, here's an updated diff that
> incorporates additional changes:
> 
> 1) doesn't break i386 kernel compilation;
> 
> 2) allows for multiple recalibrations using better timecounter
> sources since acpihpet (and maybe acpitimer) can be attached both
> before and after cpu0;
> 
> 3) factors out the TSC timecounter code into a separate file so
> that it's clear what's related to the timecounter code and what's
> not as I didn't quite like pollution of ACPI bits with unrelated
> stuff;
> 
> 4) cuts down on global variables and provides additional cleanup.
> 
> Same diff as in here: https://github.com/mbelop/src/tree/tsc
> 
> Does this look good to everybody?
> acpitimer and acpihpet parts look a tiny bit gross I guess but
> on overall the tsc.c can be copied to i386 and these ifdefs
> extended to include it.
> 

Forgot to mention.  Since this removes the useless ci_tsc_freq
struct cpu_info member, a "make config" is required to update
cpu_info member offsets table.  Not doing "make clean" at the
same time is on your conscience.



Re: Improve the accuracy of the TSC frequency calibration - Updated Patch

2017-09-19 Thread Mike Belopuhov
();
return (error);
 #endif
+   case CPU_TSCFREQ:
+   return (sysctl_rdquad(oldp, oldlenp, newp,
+   amd64_tsc_frequency));
+   case CPU_INVARIANTTSC:
+   return (sysctl_rdint(oldp, oldlenp, newp,
+   amd64_has_invariant_tsc));
default:
    return (EOPNOTSUPP);
}
/* NOTREACHED */
 }
diff --git sys/arch/amd64/amd64/tsc.c sys/arch/amd64/amd64/tsc.c
new file mode 100644
index 000..32a02eb00af
--- /dev/null
+++ sys/arch/amd64/amd64/tsc.c
@@ -0,0 +1,223 @@
+/* $OpenBSD$   */
+/*
+ * Copyright (c) 2016,2017 Reyk Floeter <r...@openbsd.org>
+ * Copyright (c) 2017 Adam Steen <a...@adamsteen.com.au>
+ * Copyright (c) 2017 Mike Belopuhov <m...@openbsd.org>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define RECALIBRATE_MAX_RETRIES5
+#define RECALIBRATE_SMI_THRESHOLD  5
+#define RECALIBRATE_DELAY_THRESHOLD20
+
+inttsc_recalibrate;
+
+uint64_t   amd64_tsc_frequency;
+intamd64_has_invariant_tsc;
+
+uint   tsc_get_timecount(struct timecounter *tc);
+
+struct timecounter tsc_timecounter = {
+   tsc_get_timecount, NULL, ~0u, 0, "tsc", -1000, NULL
+};
+
+uint64_t
+tsc_freq_cpuid(struct cpu_info *ci)
+{
+   uint64_t count;
+   uint32_t eax, ebx, khz, dummy;
+
+   if (!strcmp(cpu_vendor, "GenuineIntel") &&
+   cpuid_level >= 0x15) {
+   eax = ebx = khz = dummy = 0;
+   CPUID(0x15, eax, ebx, khz, dummy);
+   khz /= 1000;
+   if (khz == 0) {
+   switch (ci->ci_model) {
+   case 0x4e: /* Skylake mobile */
+   case 0x5e: /* Skylake desktop */
+   case 0x8e: /* Kabylake mobile */
+   case 0x9e: /* Kabylake desktop */
+   khz = 24000; /* 24.0 Mhz */
+   break;
+   case 0x55: /* Skylake X */
+   khz = 25000; /* 25.0 Mhz */
+   break;
+   case 0x5c: /* Atom Goldmont */
+   khz = 19200; /* 19.2 Mhz */
+   break;
+   }
+   }
+   if (ebx == 0 || eax == 0)
+   count = 0;
+   else if ((count = (uint64_t)khz * (uint64_t)ebx / eax) != 0)
+   return (count * 1000);
+   }
+
+   return (0);
+}
+
+static inline int
+get_tsc_and_timecount(struct timecounter *tc, uint64_t *tsc, uint64_t *count)
+{
+   uint64_t n, tsc1, tsc2;
+   int i;
+
+   for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) {
+   tsc1 = rdtsc();
+   n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask);
+   tsc2 = rdtsc();
+
+   if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) {
+   *count = n;
+   *tsc = tsc2;
+   return (0);
+   }
+   }
+   return (1);
+}
+
+static inline uint64_t
+calculate_tsc_freq(uint64_t tsc1, uint64_t tsc2, int usec)
+{
+   uint64_t delta;
+
+   delta = (tsc2 - tsc1);
+   return (delta * 100 / usec);
+}
+
+static inline uint64_t
+calculate_tc_delay(struct timecounter *tc, uint64_t count1, uint64_t count2)
+{
+   uint64_t delta;
+
+   if (count2 < count1)
+   count2 += tc->tc_counter_mask;
+
+   delta = (count2 - count1);
+   return (delta * 100 / tc->tc_frequency);
+}
+
+uint64_t
+measure_tsc_freq(struct timecounter *tc)
+{
+   uint64_t count1, count2, frequency, min_freq, tsc1, tsc2;
+   u_long ef;
+   int delay_usec, i, err1, err2, usec;
+
+   /* warmup the timers */
+   for (i = 0; i < 3; i++) {
+   (void)tc->tc_get_timecount(tc);
+   (void)rdtsc();
+   }
+
+   min_freq = ULLONG_MAX;
+
+   delay_usec = 10;
+   for (i = 0; i < 3; i++) {
+   ef = read_rflags();
+   disa

Re: Improve the accuracy of the TSC frequency calibration - Updated Patch

2017-08-25 Thread Mike Belopuhov
On Fri, Aug 25, 2017 at 00:40 -0700, Mike Larkin wrote:
> On Thu, Aug 24, 2017 at 12:39:33PM +0800, Adam Steen wrote:
> > On Thu, Aug 24, 2017 at 2:35 AM, Mike Larkin  wrote:
> > > On Wed, Aug 23, 2017 at 09:29:15PM +0800, Adam Steen wrote:
> > >>
> > >> Thank you Mike on the feedback on the last patch, please see the diff
> > >> below, update with your input and style(9)
> > >>
> > >> I have continued to use tsc as my timecounter and /var/db/ntpd.driff
> > >> has stayed under 10.
> > >>
> > >> cat /var/db/ntpd.drift
> > >> 6.665
> > >>
> > >> ntpctl -s all
> > >> 4/4 peers valid, constraint offset -1s, clock synced, stratum 3
> > >>
> > >> peer
> > >>wt tl st  next  poll  offset   delay  jitter
> > >> 144.48.166.166 from pool pool.ntp.org
> > >> 1 10  24s   32s-3.159ms87.723ms10.389ms
> > >> 13.55.50.68 from pool pool.ntp.org
> > >> 1 10  3   11s   32s-3.433ms86.053ms18.095ms
> > >> 14.202.204.182 from pool pool.ntp.org
> > >> 1 10  1   14s   32s 1.486ms86.545ms16.483ms
> > >> 27.124.125.250 from pool pool.ntp.org
> > >>  *  1 10  2   12s   30s   -10.275ms54.156ms70.389ms
> > >>
> > >> Cheers
> > >> Adam
> > >
> > > IIRC you have an x220, right?
> > >
> > > If so, could you try letting the clock run for a bit (while using tsc
> > > timecounter selection) after apm -L to drop the speed? (make sure
> > > apm shows that it dropped).
> > >
> > > Even though my x230 supposedly has a constant/invar TSC (according to
> > > cpuid), the TSC drops from 2.5GHz to 1.2GHz when apm -L runs, which
> > > causes time to run too slowly when tsc is selected there.
> > >
> > > -ml
> > >
> >
> > Yes, x220
> > (bios: LENOVO version "8DET69WW (1.39 )" date 07/18/2013)
> > (cpu: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz, 2491.91 MHz)
> >
> > I took some measurements to before starting the test.
> >
> > note: the laptop has been up for a few days with apm -A set via 
> > rc.conf.local
> > and sysctl kern.timecounter.hardware as tsc via sysctl.conf and mostly
> > idle.
> >
> > cat /var/db/ntpd.drift
> > 6.459
> >
> > apm -v
> > Battery state: high, 100% remaining, unknown life estimate
> > A/C adapter state: connected
> > Performance adjustment mode: auto (800 MHz)
> >
> > 6 hours ago i ran apm -L, verified it was running slowly (800 MHz),
> > and got the following results
> >
> > The clock appears correct (comparing to other computers)
> >
> > apm -v
> > Battery state: high, 100% remaining, unknown life estimate
> > A/C adapter state: connected
> > Performance adjustment mode: manual (800 MHz)
> >
> > cat /var/db/ntpd.drift
> > 6.385
> >
> > ntpctl -s all
> > 4/4 peers valid, constraint offset 0s, clock synced, stratum 4
> >
> > peer
> >wt tl st  next  poll  offset   delay  jitter
> > 203.23.237.200 from pool pool.ntp.org
> > 1 10  2  153s 1505s   -25.546ms73.450ms 2.644ms
> > 203.114.73.24 from pool pool.ntp.org
> > 1 10  2  253s 1560s-1.042ms75.133ms 0.752ms
> > 192.189.54.33 from pool pool.ntp.org
> >  *  1 10  2  204s 1558s31.644ms70.910ms 3.388ms
> > 54.252.165.245 from pool pool.ntp.org
> > 1 10  2  238s 1518s 0.146ms73.005ms 2.025ms
> >
> > I will leave the laptop in lower power mode over the weekend and see
> > what happens.
> >
>
> No need, I think you've convinced me that it works :)

But does it actually work on x230 as well?  I'm surprised to learn
that you've observed TSC frequency change on Ivy Bridge.  I was
under impression that everything since at least Sandy Bridge (x220)
has constant and invariant TSC as advertised.

Adam, I've readjusted and simplified your diff a bit.  The biggest
change is that we can select the reference tc based on it's quality
so there's no need to have acpitimer and acpihpet specific functions
and variables.

There's one big thing missing here: increasing the timecounter
quality so that OS can pick it.  Something like this:

https://github.com/mbelop/src/commit/99d6ef3ae95bbd8ea93c27e0425eb65e5a3359a1

I'd say we should try getting this in after 6.3 unlock unless there
are objections.  Further cleanup and testing is welcome of course.


diff --git sys/arch/amd64/amd64/acpi_machdep.c 
sys/arch/amd64/amd64/acpi_machdep.c
index 17d8fb205ef..f632b838ec2 100644
--- sys/arch/amd64/amd64/acpi_machdep.c
+++ sys/arch/amd64/amd64/acpi_machdep.c
@@ -67,10 +67,19 @@ extern int acpi_savecpu(void) __returns_twice;
 #define ACPI_BIOS_RSDP_WINDOW_BASE0xe
 #define ACPI_BIOS_RSDP_WINDOW_SIZE0x2
 
 u_int8_t   *acpi_scan(struct acpi_mem_map *, paddr_t, size_t);
 
+#define RECALIBRATE_MAX_RETRIES5
+#define RECALIBRATE_SMI_THRESHOLD  5
+#define RECALIBRATE_DELAY_THRESHOLD20
+
+struct timecounter *recalibrate_tc;
+
+uint64_t acpi_calibrate_tsc_freq(void);
+int acpi_get_tsc_and_timecount(struct timecounter *, uint64_t *, uint64_t *);
+
 int
 

CID 1453170, 1452971: Uninitialized scalar variable (txp_start)

2017-08-17 Thread Mike Belopuhov
Hi,

I've made a mistake when refactoring txp_start recently.
firstprod and firstcnt served one purpose only: they cached the
value of prod and cnt at the start of the loop and then if they'd
get incremented but we'd have to bail and goto the oactive label
we'd restore the r_prod and r_cnt to values from before the
increment.  Now that we don't bail where we used to after the
refactoring, we don't need to cache them anymore and can safely
use prod and cnt.

OK?

diff --git sys/dev/pci/if_txp.c sys/dev/pci/if_txp.c
index 9ea9b359832..3529e44cef1 100644
--- sys/dev/pci/if_txp.c
+++ sys/dev/pci/if_txp.c
@@ -1263,11 +1263,11 @@ txp_start(struct ifnet *ifp)
struct txp_tx_desc *txd;
int txdidx;
struct txp_frag_desc *fxd;
struct mbuf *m;
struct txp_swdesc *sd;
-   u_int32_t firstprod, firstcnt, prod, cnt, i;
+   u_int32_t prod, cnt, i;
 
if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(>if_snd))
return;
 
prod = r->r_prod;
@@ -1279,13 +1279,10 @@ txp_start(struct ifnet *ifp)
 
m = ifq_dequeue(>if_snd);
if (m == NULL)
break;
 
-   firstprod = prod;
-   firstcnt = cnt;
-
sd = sc->sc_txd + prod;
sd->sd_mbuf = m;
 
switch (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
BUS_DMA_NOWAIT)) {
@@ -1399,12 +1396,12 @@ txp_start(struct ifnet *ifp)
r->r_cnt = cnt;
return;
 
 oactive:
ifq_set_oactive(>if_snd);
-   r->r_prod = firstprod;
-   r->r_cnt = firstcnt;
+   r->r_prod = prod;
+   r->r_cnt = cnt;
 }
 
 /*
  * Handle simple commands sent to the typhoon
  */



Re: hfsc_deferred race

2017-08-16 Thread Mike Belopuhov
On Tue, Aug 15, 2017 at 17:14 +0200, Mike Belopuhov wrote:
> Hi,
> 
> I've just triggered an assert in hfsc_deferred (a callout) on an
> MP kernel running on an SP virtual machine:
> 
>   panic: kernel diagnostic assertion "HFSC_ENABLED(ifq)" failed: file 
> "/home/mike/src/openbsd/sys/net/hfsc.c", line 950
>   Stopped at  db_enter+0x9:   leave
>   TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>   *247463  28420  0 0x3  00  pfctl
>   db_enter() at db_enter+0x9
>   
> panic(817f78f0,4,81a3ffc0,8110c140,800c2060,fff
>   f81598b1c) at panic+0x102
>   __assert(81769d93,817d7350,3b6,817d72bd) at 
> __assert+0x
>   35
>   hfsc_deferred(800c2060) at hfsc_deferred+0x9e
>   timeout_run(8004adc8) at timeout_run+0x4c
>   softclock(0) at softclock+0x146
>   softintr_dispatch(0) at softintr_dispatch+0x9f
>   Xsoftclock() at Xsoftclock+0x1f
>   --- interrupt ---
>   end of kernel
>   end trace frame: 0x728d481974c08548, count: 7
>   0x2cfe9c031c9:
>   https://www.openbsd.org/ddb.html describes the minimum info required in bug
>   reports.  Insufficient info makes it difficult to find and fix bugs.
>   ddb{0}> ps
>  PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>   *28420  247463   5000  0  7 0x3pfctl
> 
> 
> pfctl runs in the loop reloading the ruleset.  So at some point we
> disable HFSC on the interface but lose a race with hfsc_deferred
> before re-enabling it.
> 
> IFQ has a mechanism to lock the underlying object and I believe this
> is the right tool for this job.  Any other ideas?
> 
> I don't think it's a good idea to hold the mutex (ifq_q_enter and
> ifq_q_leave effectively lock and unlock it) during the ifq_start,
> so we have to make a concession and run the ifq_start before knowing
> whether or not HFSC is attached.  IMO, it's a small price to pay to
> avoide clutter.  Kernel lock assertion is pointless at this point.
> 
> OK?
>

I've been running with this while debugging the issue with the active
class list ("panic: kernel diagnostic assertion" from Aug 12 on bugs@)
and I'm quite confident that this works and I don't observe the race
anymore.

In addition, I've figured we can keep the HFSC_ENABLED check as there
is no issue with bailing early here:

diff --git sys/net/hfsc.c sys/net/hfsc.c
index 12504267dc5..c51f1406a0b 100644
--- sys/net/hfsc.c
+++ sys/net/hfsc.c
@@ -950,10 +950,13 @@ hfsc_deferred(void *arg)
 {
struct ifnet *ifp = arg;
struct ifqueue *ifq = >if_snd;
struct hfsc_if *hif;
 
+   if (!HFSC_ENABLED(ifq))
+   return;
+
if (!ifq_empty(ifq))
ifq_start(ifq);
 
hif = ifq_q_enter(>if_snd, ifq_hfsc_ops);
if (hif == NULL)


> diff --git sys/net/hfsc.c sys/net/hfsc.c
> index 410bea733c6..3c5b6f6ef78 100644
> --- sys/net/hfsc.c
> +++ sys/net/hfsc.c
> @@ -944,20 +944,19 @@ hfsc_deferred(void *arg)
>  {
>   struct ifnet *ifp = arg;
>   struct ifqueue *ifq = >if_snd;
>   struct hfsc_if *hif;
>  
> - KERNEL_ASSERT_LOCKED();
> - KASSERT(HFSC_ENABLED(ifq));
> -
>   if (!ifq_empty(ifq))
>   ifq_start(ifq);
>  
> - hif = ifq->ifq_q;
> -
> + hif = ifq_q_enter(>if_snd, ifq_hfsc_ops);
> + if (hif == NULL)
> + return;
>   /* XXX HRTIMER nearest virtual/fit time is likely less than 1/HZ. */
>   timeout_add(>hif_defer, 1);
> + ifq_q_leave(>if_snd, hif);
>  }
>  
>  void
>  hfsc_cl_purge(struct hfsc_if *hif, struct hfsc_class *cl, struct mbuf_list 
> *ml)
>  {



CID 1453358: Out-of-bounds read (bufq_init)

2017-08-16 Thread Mike Belopuhov
There's only two disk elevator disciplines 0 - fifo and 1 - nscan.
BUFQ_HOWMANY is 2, but the 'type' should be checked against
(BUFQ_HOWMANY - 1) as it's used as an index.

OK?

diff --git sys/kern/kern_bufq.c sys/kern/kern_bufq.c
index 7ed83470e58..ad9558e0d53 100644
--- sys/kern/kern_bufq.c
+++ sys/kern/kern_bufq.c
@@ -76,11 +76,11 @@ const struct bufq_impl bufq_impls[BUFQ_HOWMANY] = {
 int
 bufq_init(struct bufq *bq, int type)
 {
u_int hi = BUFQ_HI, low = BUFQ_LOW;
 
-   if (type > BUFQ_HOWMANY)
+   if (type >= BUFQ_HOWMANY)
panic("bufq_init: type %i unknown", type);
 
/*
 * Ensure that writes can't consume the entire amount of kva
 * available the buffer cache if we only have a limited amount



CID 1452946, 1452957: Uninitialized scalar variable (bridge_ipsec)

2017-08-16 Thread Mike Belopuhov
Hi,

In may this year, the condition that would make this break do the
right thing got removed and now if a short packet is sent to an
ipsec-enabled bridge, various things like 'spi' and 'off' are left
uninitialized, but thankfully the gettdb call that follows will
most likely fail when presented with a random spi value.  But it's
a nasty bug nevertheless.

OK?

diff --git sys/net/if_bridge.c sys/net/if_bridge.c
index 0e048205475..33d4753fd6b 100644
--- sys/net/if_bridge.c
+++ sys/net/if_bridge.c
@@ -1404,11 +1404,11 @@ bridge_ipsec(struct bridge_softc *sc, struct ifnet *ifp,
 
if (dir == BRIDGE_IN) {
switch (af) {
case AF_INET:
if (m->m_pkthdr.len - hlen < 2 * sizeof(u_int32_t))
-   break;
+   goto skiplookup;
 
ip = mtod(m, struct ip *);
proto = ip->ip_p;
off = offsetof(struct ip, ip_p);
 
@@ -1425,11 +1425,11 @@ bridge_ipsec(struct bridge_softc *sc, struct ifnet *ifp,
 
break;
 #ifdef INET6
case AF_INET6:
if (m->m_pkthdr.len - hlen < 2 * sizeof(u_int32_t))
-   break;
+   goto skiplookup;
 
ip6 = mtod(m, struct ip6_hdr *);
 
/* XXX We should chase down the header chain */
proto = ip6->ip6_nxt;



Additional media options for ix(4) [again]

2017-08-16 Thread Mike Belopuhov
Hi,

I haven't gotten any feedback on the following diff
but I think there's still hope.  Please test.

Original mail:

I won't mind some broad testing of the following diff
which adds some additional media options to ix(4) from
FreeBSD and includes a fix for changing media from
Masanobu SAITOH.

The fix makes sure that when the media operation speed
is selected manually, the device doesn't additionally
advertise other (slower) modes.


diff --git sys/dev/pci/if_ix.c sys/dev/pci/if_ix.c
index 339ba2bc4f1..8fca8742f7f 100644
--- sys/dev/pci/if_ix.c
+++ sys/dev/pci/if_ix.c
@@ -1028,62 +1028,115 @@ ixgbe_intr(void *arg)
  *  This routine is called whenever the user queries the status of
  *  the interface using ifconfig.
  *
  **/
 void
-ixgbe_media_status(struct ifnet * ifp, struct ifmediareq *ifmr)
+ixgbe_media_status(struct ifnet *ifp, struct ifmediareq *ifmr)
 {
struct ix_softc *sc = ifp->if_softc;
+   int layer;
+
+   layer = sc->hw.mac.ops.get_supported_physical_layer(>hw);
 
ifmr->ifm_active = IFM_ETHER;
ifmr->ifm_status = IFM_AVALID;
 
INIT_DEBUGOUT("ixgbe_media_status: begin");
ixgbe_update_link_status(sc);
 
-   if (LINK_STATE_IS_UP(ifp->if_link_state)) {
-   ifmr->ifm_status |= IFM_ACTIVE;
+   if (!LINK_STATE_IS_UP(ifp->if_link_state))
+   return;
+
+   ifmr->ifm_status |= IFM_ACTIVE;
 
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_T ||
+   layer & IXGBE_PHYSICAL_LAYER_1000BASE_T ||
+   layer & IXGBE_PHYSICAL_LAYER_100BASE_TX)
switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_T | IFM_FDX;
+   break;
+   case IXGBE_LINK_SPEED_1GB_FULL:
+   ifmr->ifm_active |= IFM_1000_T | IFM_FDX;
+   break;
case IXGBE_LINK_SPEED_100_FULL:
ifmr->ifm_active |= IFM_100_TX | IFM_FDX;
break;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_SFP_PLUS_CU ||
+   layer & IXGBE_PHYSICAL_LAYER_SFP_ACTIVE_DA)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_SFP_CU | IFM_FDX;
+   break;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_LR)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_LR | IFM_FDX;
+   break;
case IXGBE_LINK_SPEED_1GB_FULL:
-   switch (sc->optics) {
-   case IFM_10G_SR: /* multi-speed fiber */
-   ifmr->ifm_active |= IFM_1000_SX | IFM_FDX;
-   break;
-   case IFM_10G_LR: /* multi-speed fiber */
-   ifmr->ifm_active |= IFM_1000_LX | IFM_FDX;
-   break;
-   default:
-   ifmr->ifm_active |= sc->optics | IFM_FDX;
-   break;
-   }
+   ifmr->ifm_active |= IFM_1000_LX | IFM_FDX;
break;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_LRM)
+   switch (sc->link_speed) {
case IXGBE_LINK_SPEED_10GB_FULL:
-   ifmr->ifm_active |= sc->optics | IFM_FDX;
+   ifmr->ifm_active |= IFM_10G_LRM | IFM_FDX;
+   break;
+   case IXGBE_LINK_SPEED_1GB_FULL:
+   ifmr->ifm_active |= IFM_1000_LX | IFM_FDX;
break;
}
-
-   switch (sc->hw.fc.current_mode) {
-   case ixgbe_fc_tx_pause:
-   ifmr->ifm_active |= IFM_FLOW | IFM_ETH_TXPAUSE;
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_SR ||
+   layer & IXGBE_PHYSICAL_LAYER_1000BASE_SX)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_SR | IFM_FDX;
+   break;
+   case IXGBE_LINK_SPEED_1GB_FULL:
+   ifmr->ifm_active |= IFM_1000_SX | IFM_FDX;
break;
-   case ixgbe_fc_rx_pause:
-   ifmr->ifm_active |= IFM_FLOW | IFM_ETH_RXPAUSE;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_CX4)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_CX4 | IFM_FDX;
break;
-   case ixgbe_fc_full:
-   ifmr->ifm_active |= IFM_FLOW | 

hfsc_deferred race

2017-08-15 Thread Mike Belopuhov
Hi,

I've just triggered an assert in hfsc_deferred (a callout) on an
MP kernel running on an SP virtual machine:

  panic: kernel diagnostic assertion "HFSC_ENABLED(ifq)" failed: file 
"/home/mike/src/openbsd/sys/net/hfsc.c", line 950
  Stopped at  db_enter+0x9:   leave
  TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
  *247463  28420  0 0x3  00  pfctl
  db_enter() at db_enter+0x9
  
panic(817f78f0,4,81a3ffc0,8110c140,800c2060,fff
  f81598b1c) at panic+0x102
  __assert(81769d93,817d7350,3b6,817d72bd) at 
__assert+0x
  35
  hfsc_deferred(800c2060) at hfsc_deferred+0x9e
  timeout_run(8004adc8) at timeout_run+0x4c
  softclock(0) at softclock+0x146
  softintr_dispatch(0) at softintr_dispatch+0x9f
  Xsoftclock() at Xsoftclock+0x1f
  --- interrupt ---
  end of kernel
  end trace frame: 0x728d481974c08548, count: 7
  0x2cfe9c031c9:
  https://www.openbsd.org/ddb.html describes the minimum info required in bug
  reports.  Insufficient info makes it difficult to find and fix bugs.
  ddb{0}> ps
 PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
  *28420  247463   5000  0  7 0x3pfctl


pfctl runs in the loop reloading the ruleset.  So at some point we
disable HFSC on the interface but lose a race with hfsc_deferred
before re-enabling it.

IFQ has a mechanism to lock the underlying object and I believe this
is the right tool for this job.  Any other ideas?

I don't think it's a good idea to hold the mutex (ifq_q_enter and
ifq_q_leave effectively lock and unlock it) during the ifq_start,
so we have to make a concession and run the ifq_start before knowing
whether or not HFSC is attached.  IMO, it's a small price to pay to
avoide clutter.  Kernel lock assertion is pointless at this point.

OK?

diff --git sys/net/hfsc.c sys/net/hfsc.c
index 410bea733c6..3c5b6f6ef78 100644
--- sys/net/hfsc.c
+++ sys/net/hfsc.c
@@ -944,20 +944,19 @@ hfsc_deferred(void *arg)
 {
struct ifnet *ifp = arg;
struct ifqueue *ifq = >if_snd;
struct hfsc_if *hif;
 
-   KERNEL_ASSERT_LOCKED();
-   KASSERT(HFSC_ENABLED(ifq));
-
if (!ifq_empty(ifq))
ifq_start(ifq);
 
-   hif = ifq->ifq_q;
-
+   hif = ifq_q_enter(>if_snd, ifq_hfsc_ops);
+   if (hif == NULL)
+   return;
/* XXX HRTIMER nearest virtual/fit time is likely less than 1/HZ. */
timeout_add(>hif_defer, 1);
+   ifq_q_leave(>if_snd, hif);
 }
 
 void
 hfsc_cl_purge(struct hfsc_if *hif, struct hfsc_class *cl, struct mbuf_list *ml)
 {



CID 1452909: Use of untrusted scalar value (pf_table.c)

2017-08-15 Thread Mike Belopuhov
Hi,

Coverity has discovered that we're blindly trusting the value
of pfra_type that we read from the userland supplied pfr_addr
and use it to index an array of pools in pfr_create_kentry.

I suggest to do two things: add a check in pfr_validate_addr
that is called after every copyin and also perform the check
in pfr_create_kentry before we attempt to use the value.

OK?

P.S.
What does 'k' table and entry prefix stand for in pf_table.c?
Kernel?

diff --git sys/net/pf_table.c sys/net/pf_table.c
index 7666ec7013c..985c673b5cb 100644
--- sys/net/pf_table.c
+++ sys/net/pf_table.c
@@ -741,10 +741,12 @@ pfr_validate_addr(struct pfr_addr *ad)
return (-1);
if (ad->pfra_not && ad->pfra_not != 1)
return (-1);
if (ad->pfra_fback)
return (-1);
+   if (ad->pfra_type >= PFRKE_MAX)
+   return (-1);
return (0);
 }
 
 void
 pfr_enqueue_addrs(struct pfr_ktable *kt, struct pfr_kentryworkq *workq,
@@ -820,10 +822,13 @@ pfr_lookup_addr(struct pfr_ktable *kt, struct pfr_addr 
*ad, int exact)
 struct pfr_kentry *
 pfr_create_kentry(struct pfr_addr *ad)
 {
struct pfr_kentry_all   *ke;
 
+   if (ad->pfra_type >= PFRKE_MAX)
+   panic("unknown pfra_type %d", ad->pfra_type);
+
ke = pool_get(_kentry_pl[ad->pfra_type], PR_NOWAIT | PR_ZERO);
if (ke == NULL)
return (NULL);
 
ke->pfrke_type = ad->pfra_type;
@@ -842,13 +847,10 @@ pfr_create_kentry(struct pfr_addr *ad)
if (ad->pfra_ifname[0])
ke->pfrke_rkif = pfi_kif_get(ad->pfra_ifname);
if (ke->pfrke_rkif)
pfi_kif_ref(ke->pfrke_rkif, PFI_KIF_REF_ROUTE);
break;
-   default:
-   panic("unknown pfrke_type %d", ke->pfrke_type);
-   break;
}
 
switch (ad->pfra_af) {
case AF_INET:
FILLIN_SIN(ke->pfrke_sa.sin, ad->pfra_ip4addr);



Re: Improve the accuracy of the TSC frequency calibration (Was: Calculate the frequency of the tsc timecounter)

2017-08-08 Thread Mike Belopuhov
On Tue, Aug 08, 2017 at 08:18 +0800, Adam Steen wrote:
> On Mon, Jul 31, 2017 at 3:58 PM, Mike Belopuhov <m...@belopuhov.com> wrote:
> > On Mon, Jul 31, 2017 at 09:48 +0800, Adam Steen wrote:
> >> Ted Unangst  wrote:
> >> > we don't currently export this info, but we could add some sysctls. 
> >> > there's
> >> > some cpufeatures stuff there, but generally stuff isn't exported until
> >> > somebody finds a use for it... it shouldn't be too hard to add something 
> >> > to
> >> > amd64/machdep.c sysctl if you're interested.
> >>
> >> I am interested, as i need the info, i will look into it and hopefully
> >> come back with a patch.
> >
> > This is a bad idea because TSC as the time source is only usable
> > by OpenBSD on Skylake and Kaby Lake CPUs since they encode the TSC
> > frequency in the CPUID. All older CPUs have their TSCs measured
> > against the PIT. Currently the measurement done by the kernel isn't
> > very precise and if TSC is selected as a timecounter, the machine
> > would be gaining time on a pace that cannot be corrected by our NTP
> > daemon. (IIRC, about an hour a day on my Haswell running with NTP).
> >
> > To be able to use TSC as a timecounter source on OpenBSD or Solo5
> > you'd have to improve the in-kernel measurement of the TSC frequency
> > first. I've tried to perform 10 measurements and take an average and
> > it does improve accuracy, however I believe we need to poach another
> > bit from Linux and re-calibrate TSC via HPET:
> >
> >  
> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L409
> >
> > I think this is the most sane thing we can do. Here's a complete
> > procedure that Linux kernel undertakes:
> >
> >  
> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L751
> >
> > Regards,
> > Mike
> 
> Hi Mike/All
> 
> I would like to improve the accuracy of TSC frequency calibration as
> Mike B. describes above.
> 
> I initially thought the calibration would take place at line 470 of
> amd64/identcpu.c
> (https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/identcpu.c?annotate=1.87)
>

Indeed, it cannot happen there simply because you don't know at
that point whether or not HPET actually exists.

> But I looked into using the acpihpet directly but it is never exposed
> outside of acpihpet.c.
>

And it shouldn't be.

> Could someone point me to were if would be appropriate to complete
> this calibration and how to use the acpihpet?

The way I envision this is a multi-step approach:

1) TSC frequency is approximated with the PIT (possibly performing
multiple measurements and averaging them out; also keep in mind that
doing it 8 times means you can shift the sum right by 3 instead of
using actual integer division).  This is what should happen around
the line 470 of identcpu.c

2) A function can be provided by identcpu.c to further adjust the
TSC frequency once acpitimer(4) (this is a PM timer) and acpihpet(4)
(or any other timer for that matter) are attached.

3) Once acpitimer(4) or acpihpet(4) or any other timecounter source
are attached and are verified to be operating correctly, they can
perform TSC re-calibration and update the TSC frequency with their
measurements.  The idea here is that the function (or functions) that
facilitate this must abstract enough logic so that you don't have to
duplicate it in the acpitimer or acpihpet themselves.

> (Will it need to be
> exposed like i8254_delay/delay_func/delay in machdep.c and cpu.h)
>

No it won't.

> Lastly should the calibration be done using both delay(i8254 pit) and
> hpet timers similar to Linux described above or just using the hpet?
>

Well, that's what I was arguing for.  As I said in my initial mail
on misc (not quoted here), the TSC must be calibrated using separate
known clocks sources.



Re: Document hostctl commands for XenServer

2017-07-21 Thread Mike Belopuhov
On Fri, Jul 21, 2017 at 17:49 +0200, Erik van Westen wrote:
> Hi Ingo,
> 
> Op 21-7-2017 om 16:20 schreef Ingo Schwarze:
> > Hi Mike,
> >
> > Mike Belopuhov wrote on Fri, Jul 21, 2017 at 03:14:08PM +0200:
> >
> >> Together with Maxim Khitrov we have figured out what needs to
> >> be set for XenServer
> > If XenServer were free software, i would say that the OpenBSD
> > operating system should detect whether it is running under
> > XenServer and then do all this automatically, by default, but ...
> >
> >> (that's a Citrix product based on Xen) to "recognize"
> >> the OpenBSD VM and let it do things like reboot and so on.
> >  ... that sounds like XenServer is a commercial product, so maybe
> > we don't want to bloat OpenBSD with specific code targeting it,
> > certainly not with large amounts of code.  If it can automatically
> > be done in just a handful of lines of code, i don't feel strongly
> > either way, even if XenServer is commercial.  I mean, other
> > virtual machine hosts are completely commercial in the first
> > place (unlike Xen) and we have large amounts of code in the kernel
> > specifically targeting them (unless i misunderstand).
> 
> A quick check makes sure that XenServer is open source and not commercial.
> http://xenserver.org.
>

Right, thanks for correction. I apologize for being too vague
and misleading Ingo, it wasn't intentional.

> [quote]
> 
> 
>   ABOUT XENSERVER
> 
> XenServer is the leading open source virtualization platform, powered by
> the Xen Project hypervisor
> <http://xenproject.org/developers/teams/hypervisor.html> and the XAPI
> toolstack <http://xenproject.org/developers/teams/xapi.html>. It is used
> in the world's largest clouds and enterprises.
>  
> Commercial support for XenServer
> <https://xenserver.org/get-support.html> is available from Citrix.
> [/quote]
> 
> [snip]
> 
> Maybe that helps.
> 
> Regards,
> 
> Erik



Re: Document hostctl commands for XenServer

2017-07-21 Thread Mike Belopuhov
On Fri, Jul 21, 2017 at 19:58 +0200, Ingo Schwarze wrote:
> Hi Mike,
> 
> Mike Belopuhov wrote on Fri, Jul 21, 2017 at 07:08:06PM +0200:
> 
> > Thanks for the detailed respose, I share your outlook and in
> > this case it is better to keep this stuff in the userland since
> > we actually can do it just fine.
> 
> Fair enough!
> 
> > I will however add some text to the
> > hostctl(8) man page to highlight the fact that multiple pvbus
> > devices may be present on a single VM.
> 
> Makes sense to me.
> 
> > OK?
> 
> Reads well to me.
> 
> Optionally, consider the two following nits:
> 
>  1. It might make sense to mention in half a sentence why it is
> useful to set these properties, probably up front like in
> 
> When running under XenServer, to let the host know that the guest
> has finished initializing and to allow graceful restarts (or
> whatever), set the following XenStore properties with
> .Xr hostctl 8
> in
> .Xr rc.local 8 :
>

This sounds good, thanks.

> or (less convincing to me) after your existing text like so:
> 
> Without these settings, graceful restarts (or whatever)
> may not work.
> 
> In any case, you want "in rc.local", not "in the rc.local".
> 
>  2. Magic number alarm:
> 
> # XenServer Tools version
> hostctl attr/PVAddons/MajorVersion 6
> hostctl attr/PVAddons/MinorVersion 2
> hostctl attr/PVAddons/MicroVersion 0
> hostctl attr/PVAddons/BuildVersion 76888
> hostctl attr/PVAddons/Installed 1
> 
> With the changed content, this reads a bit like a HOWTO:
> Type these commands, but i won't tell you what they do.
>
> If these numbers are completely fake and irrelevant,
> then saying so in one short sentence - or even in the
> comment line above - may be sufficient.
>
> But this quote from Maxim fuels doubts:
> 
> :: I don't know whether XenServer actually cares about what
> :: version is reported, but if it does, this would be tied
> :: to features supported by xen, xbf, and xnf drivers.
> :: You typically update the tools with each new XenServer
> :: release, which gives you most recent disk and
> :: network drivers, at least on Windows.
> 
> If that is true, than you have just built a time bomb:
> YOU need to remember to regularly update the manual.
> USERS need to remember to regularly update their rc.local.

Yes and yes, however, these interfaces don't change often so
there's no need to constantly update them.  We can leave this
as a hint and an exercise for the reader.  I've changed a
comment there saying that this is a version of "XenServer Tools".
XenServer users must be able to comprehend what this means.

> If the latter is true, this may need to be explained
> in this manual page.  Of course, that can be figured
> out and improved later, if needed.  Including a change
> to explain where to get the right magic numbers from
> rather than simply advertising some numbers that may be
> good today, but not tomorrow.
> 
> Yours,
>   Ingo



Create pvbus1 node in addition to pvbus0

2017-07-21 Thread Mike Belopuhov
As suggested by deraadt@ it's better to have /dev/pvbus1 around
then document the need to create it under certain circumstances
which are of course availability of multiple virtualization
interfaces.

The diff has survived a make build. I'm not sure the adjustment
is done perfectly, but I've tried to preserve the length of the
line and while there's now empty space, further additions will
certainly rectify that.  OK?

Index: ./etc/MAKEDEV.common
===
RCS file: /home/cvs/src/etc/MAKEDEV.common,v
retrieving revision 1.94
diff -u -p -r1.94 MAKEDEV.common
--- ./etc/MAKEDEV.common11 Sep 2016 19:59:51 -  1.94
+++ ./etc/MAKEDEV.common21 Jul 2017 13:20:57 -
@@ -167,7 +167,7 @@ target(all, hotplug)dnl
 target(all, pppx)dnl
 target(all, fuse)dnl
 target(all, vmm)dnl
-target(all, pvbus, 0)dnl
+target(all, pvbus, 0, 1)dnl
 target(all, bpf)dnl
 dnl
 _mkdev(all, {-all-}, {-dnl
Index: ./etc/etc.amd64/MAKEDEV
===
RCS file: /home/cvs/src/etc/etc.amd64/MAKEDEV,v
retrieving revision 1.115
diff -u -p -r1.115 MAKEDEV
--- ./etc/etc.amd64/MAKEDEV 11 Sep 2016 19:59:57 -  1.115
+++ ./etc/etc.amd64/MAKEDEV 21 Jul 2017 16:06:36 -
@@ -571,7 +571,8 @@ all)
R sd5 sd6 sd7 sd8 sd9 cd0 cd1 rd0 tap0 tap1 tap2 tap3 tun0
R tun1 tun2 tun3 bio pty0 fd1 fd1B fd1C fd1D fd1E fd1F fd1G
R fd1H fd0 fd0B fd0C fd0D fd0E fd0F fd0G fd0H diskmap vscsi0
-   R ch0 audio0 audio1 audio2 bpf pvbus0 vmm fuse pppx hotplug
+   R ch0 audio0 audio1 audio2 bpf pvbus0 pvbus1 vmm fuse pppx
+   R hotplug
R ptm gpr0 local wscons pci0 pci1 pci2 pci3 uall rmidi0 rmidi1
R rmidi2 rmidi3 rmidi4 rmidi5 rmidi6 rmidi7 tuner0 radio0
R speaker video0 video1 uk0 random lpa0 lpa1 lpa2 lpt0 lpt1
Index: ./etc/etc.i386/MAKEDEV
===
RCS file: /home/cvs/src/etc/etc.i386/MAKEDEV,v
retrieving revision 1.255
diff -u -p -r1.255 MAKEDEV
--- ./etc/etc.i386/MAKEDEV  11 Sep 2016 19:59:57 -  1.255
+++ ./etc/etc.i386/MAKEDEV  21 Jul 2017 15:50:00 -
@@ -575,8 +575,9 @@ all)
R sd1 sd2 sd3 sd4 sd5 sd6 sd7 sd8 sd9 cd0 cd1 rd0 tap0 tap1
R tap2 tap3 tun0 tun1 tun2 tun3 bio pty0 fd1 fd1B fd1C fd1D
R fd1E fd1F fd1G fd1H fd0 fd0B fd0C fd0D fd0E fd0F fd0G fd0H
-   R diskmap vscsi0 ch0 audio0 audio1 audio2 bpf pvbus0 vmm fuse
-   R pppx hotplug ptm gpr0 local wscons pci0 pci1 pci2 pci3 uall
+   R diskmap vscsi0 ch0 audio0 audio1 audio2 bpf pvbus0 pvbus1
+   R vmm fuse pppx hotplug ptm gpr0 local wscons
+   R pci0 pci1 pci2 pci3 uall
R rmidi0 rmidi1 rmidi2 rmidi3 rmidi4 rmidi5 rmidi6 rmidi7
R tuner0 radio0 speaker video0 video1 uk0 random joy0 joy1
R lpa0 lpa1 lpa2 lpt0 lpt1 lpt2 tty00 tty01 tty02 tty03 tty04



Re: Document hostctl commands for XenServer

2017-07-21 Thread Mike Belopuhov
On Fri, Jul 21, 2017 at 10:28 -0400, Maxim Khitrov wrote:
> On Fri, Jul 21, 2017 at 9:14 AM, Mike Belopuhov <m...@belopuhov.com> wrote:
> > Hi,
> >
> > Together with Maxim Khitrov we have figured out what needs to
> > be set for XenServer (that's a Citrix product based on Xen) to
> > "recognize" the OpenBSD VM and let it do things like reboot and
> > so on.
> >
> > I'd like to get this documented in the xen(4) man page instead
> > of referring users to mailing list archives.
> >
> > There are two things that we can mention:
> >
> > 1) viridian capability, that XenServer comes with enabled by
> >default, interferes with hostctl: you need to either disable
> >it for your VM (if you have access) or MAKEDEV /dev/pvbus1
> >and use that with hostctl(8).
> >
> > 2) to let XenServer management software know that OpenBSD is
> >there in full glory we need to set a few XenStore properties
> >with hostctl(8).  User needs to do this on every boot, so
> >putting them somewhere around /etc/rc.local is necessary.
> >
> > I've come up with the diff below.  Please let me know if this
> > makes sense and if we can improve it.
> >
> > Maxim, can you please double check the script itself.  Are all
> > these values necessary?  I've changed a few things including
> > the BuildVersion value.
> >
> > Thanks.
> 
> Hi Mike,
> 
> attr/PVAddons/* contains version information about XenServer Tools
> installed in the VM, not the OS. Dinar copied the original values from
> tools v6.2. I don't know whether XenServer actually cares about what
> version is reported, but if it does, this would be tied to features
> supported by xen, xbf, and xnf drivers. You typically update the tools
> with each new XenServer release, which gives you most recent disk and
> network drivers, at least on Windows.
> 
> Only attr/PVAddons/{MajorVersion,MinorVersion} and data/updated are
> required to get graceful reboot and shutdown support in XenCenter, but
> I would leave the rest in there to avoid confusing any tools that
> might expect to find the other keys as well.
> data/{os_name,os_uname,os_distro} provide OS information, which is
> shown in VM properties (not required, but useful to have). Setting
> data/updated to 1 triggers a refresh of this information in
> XenStore/XenCenter.
> 
> -Max

Ok, thanks for clarifying this, I've updated the patch.



Re: Document hostctl commands for XenServer

2017-07-21 Thread Mike Belopuhov
On Fri, Jul 21, 2017 at 16:20 +0200, Ingo Schwarze wrote:
> Hi Mike,
> 
> Mike Belopuhov wrote on Fri, Jul 21, 2017 at 03:14:08PM +0200:
> 
> > Together with Maxim Khitrov we have figured out what needs to
> > be set for XenServer
> 
> If XenServer were free software, i would say that the OpenBSD
> operating system should detect whether it is running under
> XenServer and then do all this automatically, by default, but ...
> 
> > (that's a Citrix product based on Xen) to "recognize"
> > the OpenBSD VM and let it do things like reboot and so on.
> 
>  ... that sounds like XenServer is a commercial product, so maybe
> we don't want to bloat OpenBSD with specific code targeting it,
> certainly not with large amounts of code.  If it can automatically
> be done in just a handful of lines of code, i don't feel strongly
> either way, even if XenServer is commercial.  I mean, other
> virtual machine hosts are completely commercial in the first
> place (unlike Xen) and we have large amounts of code in the kernel
> specifically targeting them (unless i misunderstand).
> 
> > I'd like to get this documented in the xen(4) man page instead
> > of referring users to mailing list archives.
> 
> Sure, if you decide to not do this automatically by default,
> then documenting it in a manual page is a good idea, and xen(4)
> seems like the logical place to me - FWTW, i know nothing about Xen,
> pvbus(4), or hostctl(8).
>

Thanks for the detailed respose, I share your outlook and in
this case it is better to keep this stuff in the userland since
we actually can do it just fine. As Maxim has pointed out we have
to fake a version of XenServer Tools in addition to telling our
own and I certainly don't want to do that in the kernel. Reyk
has mentioned that this configuration be performed automatically
in the provisioning tool like cloud-agent.

> 
> > Index: xen.4
> > ===
> > RCS file: /home/cvs/src/share/man/man4/xen.4,v
> > retrieving revision 1.1
> > diff -u -p -r1.1 xen.4
> > --- xen.4   9 Dec 2015 00:26:39 -   1.1
> > +++ xen.4   21 Jul 2017 13:00:52 -
> > @@ -28,6 +28,51 @@ driver performs HVM domU guest initializ
> >  virtual Xen interrupts, access to the XenStore configuration storage as
> >  well as a device probing facility for paravirtualized devices such as
> >  disk and network interfaces.
> > +.Sh CAVEATS
> 
> I don't object to putting this into CAVEATS because XenServer does
> seem to be going out of its way in order to set up plenty of traps
> for the user.  On the other hand, putting it at the end of DESCRIPTION
> would seem even more logical to me, because it is just a description
> of how to use XenServer.  If DESCRIPTION seems too prominent to you
> for a blurb about one specific commercial product, EXAMPLES would
> also seem more logical to me than CAVEATS - anyway, your call.
> 
> If you decide to use CAVEATS, it has to go at the very end,
> after AUTHORS.  EXAMPLES would have to go between DESCRIPTION
> and SEE ALSO.
>

DESCRIPTION is fine.

> > +When running under XenServer, it's useful to let it know that the guest
> > +has finished initializing by setting a few XenStore properties with
> > +.Xr hostctl 8
> > +in the
> > +.Pa /etc/rc.local
> 
> Make that line
> 
>   .Xr rc.local 8
>

Done.

> > +Please note, that XenStore is capable of advertising a Hyper-V 
> > compatibility
> 
> No comma needed here.
> 
> > +layer called
> > +.Dq Viridian
> > +that may require an additional
> > +.Xr pvbus 4
> > +device node to be crated in
> > +.Pa /dev
> > +with
> > +.Xr MAKEDEV 8
> > +and all aforementioned invocations of
> > +.Xr hostctl 8
> > +to be amended with an
> > +.Fl f Ar /dev/pvbus1
> 
> Make that line
> 
>   .Fl f Pa /dev/pvbus1
> 
> > +command line option.
> > +Viridian can also be disabled in the virtual machine configuration.
> 
> Maybe
> 
>   Alternatively, Viridian can be disabled in the virtual machine
>   configuration.
> 
> because doing both ("also") does not seem to make much sense to me,
> but of course i'm not sure.
> 
> Yours,
>   Ingo

Thanks for suggestions, they're all good. In the meantime Theo
has suggested to create /dev/pvbus1 by default and I'm going to
send a diff for that in a moment. This means that I'm removing
this paragraph completely. I will however add some text to the
hostctl(8) man page to highlight the fact that multiple pvbus
devices may be present on a single VM.

OK?

Index: xen.4
===
RCS file: /hom

Document hostctl commands for XenServer

2017-07-21 Thread Mike Belopuhov
Hi,

Together with Maxim Khitrov we have figured out what needs to
be set for XenServer (that's a Citrix product based on Xen) to
"recognize" the OpenBSD VM and let it do things like reboot and
so on.

I'd like to get this documented in the xen(4) man page instead
of referring users to mailing list archives.

There are two things that we can mention:

1) viridian capability, that XenServer comes with enabled by
   default, interferes with hostctl: you need to either disable
   it for your VM (if you have access) or MAKEDEV /dev/pvbus1
   and use that with hostctl(8).

2) to let XenServer management software know that OpenBSD is
   there in full glory we need to set a few XenStore properties
   with hostctl(8).  User needs to do this on every boot, so
   putting them somewhere around /etc/rc.local is necessary.

I've come up with the diff below.  Please let me know if this
makes sense and if we can improve it.

Maxim, can you please double check the script itself.  Are all
these values necessary?  I've changed a few things including
the BuildVersion value.

Thanks.

- Forwarded message from Maxim Khitrov <m...@mxcrypt.com> -

Date: Mon, 17 Jul 2017 17:07:02 -0400
From: Maxim Khitrov <m...@mxcrypt.com>
To: Mike Belopuhov <m...@belopuhov.com>
Cc: Dinar Talypov <t.dina...@gmail.com>, tech@openbsd.org
Subject: Re: [patch] fake pv drivers installation on xen

On Mon, Jul 17, 2017 at 3:40 PM, Mike Belopuhov <m...@belopuhov.com> wrote:
> On Mon, Jul 17, 2017 at 14:32 -0400, Maxim Khitrov wrote:
>> On Wed, Jan 18, 2017 at 2:16 PM, Dinar Talypov <t.dina...@gmail.com> wrote:
>> > I use Xenserver 7.0 with xencenter management console.
>> > without it doesn't allow shutdown or reboot.
>> > Anyway I'll try with hostctl.
>> >
>> > Thanks.
>>
>> Were you able to get this working with hostctl? I'm running OpenBSD
>> 6.1 amd64 on XenServer 7.0. When I run any hostctl command, such as
>> `hostctl device/vif/0/mac`, I get the following error:
>>
>> hostctl: ioctl: Device not configured
>>
>> During boot, I see these messages:
>>
>> pvbus0 at mainbus0: Hyper-V 0.0, Xen 4.6
>> xen0 at pvbus0: features 0x2705, 32 grant table frames, event channel 3
>> xbf0 at xen0 backend 0 channel 8: disk
>>
>
> You need to disable viridian compatibility in your Xenserver.
>
>> Running `hostctl -t` returns "/dev/pvbus0: Hyper-V"
>>
>
> That's because Xenserver announces Hyper-V compatibility layer
> (called viridian) before Xen for whatever reason.  You need to
> do "cd /dev; ./MAKEDEV pvbus1" and then use "hostctl -f /dev/pvbus1"
> with your commands (I assume -- never tried a Xenserver myself).
>
>> Any tips on getting hostctl to work?
>
> See above.  The easiest is probably just to disable viridian :)

Disabling viridian worked, thanks! For anyone else interested in doing
this, run the following command on your XenServer host:

xe vm-param-set uuid= platform:viridian=false

After that, you can add these commands to /etc/rc.local:

ostype=$(sysctl -n kern.ostype)
osrelease=$(sysctl -n kern.osrelease)

# PV driver version
hostctl attr/PVAddons/MajorVersion 6
hostctl attr/PVAddons/MinorVersion 2
hostctl attr/PVAddons/MicroVersion 0
hostctl attr/PVAddons/BuildVersion 76888
hostctl attr/PVAddons/Installed 1

# OS version
hostctl data/os_name "$ostype $osrelease"
hostctl data/os_uname $osrelease
hostctl data/os_distro $ostype

# Update XenStore
hostctl data/updated 1

-Max

- End forwarded message -


Index: xen.4
===
RCS file: /home/cvs/src/share/man/man4/xen.4,v
retrieving revision 1.1
diff -u -p -r1.1 xen.4
--- xen.4   9 Dec 2015 00:26:39 -   1.1
+++ xen.4   21 Jul 2017 13:00:52 -
@@ -28,6 +28,51 @@ driver performs HVM domU guest initializ
 virtual Xen interrupts, access to the XenStore configuration storage as
 well as a device probing facility for paravirtualized devices such as
 disk and network interfaces.
+.Sh CAVEATS
+When running under XenServer, it's useful to let it know that the guest
+has finished initializing by setting a few XenStore properties with
+.Xr hostctl 8
+in the
+.Pa /etc/rc.local
+script.
+.Bd -literal -offset indent
+ostype=$(sysctl -n kern.ostype)
+osrelease=$(sysctl -n kern.osrelease)
+osbuild=$(sysctl -n kern.osversion)
+osvermaj=${osrelease%\.*}
+osvermin=${osrelease#*\.}
+
+# PV driver version
+hostctl attr/PVAddons/MajorVersion $osvermaj
+hostctl attr/PVAddons/MinorVersion $osvermin
+hostctl attr/PVAddons/MicroVersion 0
+hostctl attr/PVAddons/BuildVersion $osbuild
+hostctl attr/PVAddons/Installed 1
+
+# OS version
+hostctl data/os_name "$ostype $osrelease"
+hostctl data/os_uname $osrelease
+hostctl data/os_distro $ostype
+
+# Updat

Re: urndis issues

2017-07-18 Thread Mike Belopuhov
On Thu, Jul 13, 2017 at 14:04 +0200, Mike Belopuhov wrote:
> On Wed, Jul 12, 2017 at 21:04 +, Jonathan Armani wrote:
> > Hi,
> > 
> > Thanks I was cooking the same diff.
> > 
> > Ok armani@
> > 
> 
> Hi, thanks! I want to get rid of printfs though and
> return errors (or unhandled status codes) so that we
> don't paper over them. In theory, all error codes that
> I've seen in the ndis.h from the DDK have the MSB set,
> i.e. (rval & 0x8000) != 0 for those, but doing it
> would be a bit too hackish IMO. However, if we go down
> this road the rval check can be rewritten as:
> 
>   /* Not an error */
>   if ((rval & 0x8000) == 0)
>   rval = RNDIS_STATUS_SUCCESS;
>   else
>   printf("%s: status 0x%x\n", DEVNAME(sc), rval);
> 
> Is this something we want to do?
> 
> I was also going to add sc_link toggled by those status
> codes and check it in the urndis_start like other USB
> Ethernet drivers do, but decided against it cause this
> would break other devices that don't send this message.
>

Hi,

Since I haven't heard anything from any of you, I'm assuming
that the diff below is the way to go (since that's what Artturi
has tested essentially).  I'll wait another day and check this
in tomorrow (July 19).

Cheers,
Mike

> 
> diff --git sys/dev/usb/if_urndis.c sys/dev/usb/if_urndis.c
> index 4af6b55cf05..956207f73a8 100644
> --- sys/dev/usb/if_urndis.c
> +++ sys/dev/usb/if_urndis.c
> @@ -88,10 +88,12 @@ u_int32_t urndis_ctrl_handle_init(struct urndis_softc *,
>  const struct rndis_comp_hdr *);
>  u_int32_t urndis_ctrl_handle_query(struct urndis_softc *,
>  const struct rndis_comp_hdr *, void **, size_t *);
>  u_int32_t urndis_ctrl_handle_reset(struct urndis_softc *,
>  const struct rndis_comp_hdr *);
> +u_int32_t urndis_ctrl_handle_status(struct urndis_softc *,
> +const struct rndis_comp_hdr *);
>  
>  u_int32_t urndis_ctrl_init(struct urndis_softc *);
>  u_int32_t urndis_ctrl_halt(struct urndis_softc *);
>  u_int32_t urndis_ctrl_query(struct urndis_softc *, u_int32_t, void *, size_t,
>  void **, size_t *);
> @@ -233,10 +235,14 @@ urndis_ctrl_handle(struct urndis_softc *sc, struct 
> rndis_comp_hdr *hdr,
>   case REMOTE_NDIS_KEEPALIVE_CMPLT:
>   case REMOTE_NDIS_SET_CMPLT:
>   rval = letoh32(hdr->rm_status);
>   break;
>  
> + case REMOTE_NDIS_INDICATE_STATUS_MSG:
> + rval = urndis_ctrl_handle_status(sc, hdr);
> + break;
> +
>   default:
>   printf("%s: ctrl message error: unknown event 0x%x\n",
>   DEVNAME(sc), letoh32(hdr->rm_type));
>   rval = RNDIS_STATUS_FAILURE;
>   }
> @@ -400,10 +406,42 @@ urndis_ctrl_handle_reset(struct urndis_softc *sc,
>  
>   return rval;
>  }
>  
>  u_int32_t
> +urndis_ctrl_handle_status(struct urndis_softc *sc,
> +const struct rndis_comp_hdr *hdr)
> +{
> + const struct rndis_status_msg   *msg;
> + u_int32_trval;
> +
> + msg = (struct rndis_status_msg *)hdr;
> +
> + rval = letoh32(msg->rm_status);
> +
> + DPRINTF(("%s: urndis_ctrl_handle_status: len %u status 0x%x "
> + "stbuflen %u\n",
> + DEVNAME(sc),
> + letoh32(msg->rm_len),
> + rval,
> + letoh32(msg->rm_stbuflen)));
> +
> + switch (rval) {
> + case RNDIS_STATUS_MEDIA_CONNECT:
> + case RNDIS_STATUS_MEDIA_DISCONNECT:
> + case RNDIS_STATUS_OFFLOAD_CURRENT_CONFIG:
> + rval = RNDIS_STATUS_SUCCESS;
> + break;
> +
> + default:
> + printf("%s: status 0x%x\n", DEVNAME(sc), rval);
> + }
> +
> + return rval;
> +}
> +
> +u_int32_t
>  urndis_ctrl_init(struct urndis_softc *sc)
>  {
>   struct rndis_init_req   *msg;
>   u_int32_trval;
>   struct rndis_comp_hdr   *hdr;



Re: [patch] fake pv drivers installation on xen

2017-07-17 Thread Mike Belopuhov
On Mon, Jul 17, 2017 at 14:32 -0400, Maxim Khitrov wrote:
> On Wed, Jan 18, 2017 at 2:16 PM, Dinar Talypov  wrote:
> > I use Xenserver 7.0 with xencenter management console.
> > without it doesn't allow shutdown or reboot.
> > Anyway I'll try with hostctl.
> >
> > Thanks.
> 
> Were you able to get this working with hostctl? I'm running OpenBSD
> 6.1 amd64 on XenServer 7.0. When I run any hostctl command, such as
> `hostctl device/vif/0/mac`, I get the following error:
> 
> hostctl: ioctl: Device not configured
> 
> During boot, I see these messages:
> 
> pvbus0 at mainbus0: Hyper-V 0.0, Xen 4.6
> xen0 at pvbus0: features 0x2705, 32 grant table frames, event channel 3
> xbf0 at xen0 backend 0 channel 8: disk
>

You need to disable viridian compatibility in your Xenserver.

> Running `hostctl -t` returns "/dev/pvbus0: Hyper-V"
>

That's because Xenserver announces Hyper-V compatibility layer
(called viridian) before Xen for whatever reason.  You need to
do "cd /dev; ./MAKEDEV pvbus1" and then use "hostctl -f /dev/pvbus1"
with your commands (I assume -- never tried a Xenserver myself).

> Any tips on getting hostctl to work?

See above.  The easiest is probably just to disable viridian :)

> Also, do the values persist
> across reboots or do they need to be set via rc.d?
> 

No, they're not.



Re: time(1): use monotonic clock for computing elapsed time

2017-07-13 Thread Mike Belopuhov
On Thu, Jul 13, 2017 at 13:44 +1000, David Gwynne wrote:
> 
> > On 13 Jul 2017, at 11:16 am, Scott Cheloha  wrote:
> > 
> > Hi,
> > 
> > The "real" elapsed time for time(1) and the ksh/csh time builtins is
> > currently computed with gettimeofday(2), so it's subject to changes
> > by adjtime(2) and, if you're really unlucky, clock_settime(2) or
> > settimeofday(2).  In pathological cases you can get negative values
> > in the output.
> > 
> > This seems wrong to me.  I personally use these tools like a stopwatch,
> > and I was surprised to see that the elapsed difference wasn't (more)
> > immune to changes to the system clock.
> > 
> > The attached patches change the "real" listing for time(1), ksh's time
> > builtin, and csh's time builtin to use a monotonic clock, which I think
> > more closely matches what the typical user and programmer expects.  This
> > interpretation is, near as I can tell, also compatible with the POSIX.1
> > 2008 description of the time(1) utility.  In particular, the use of
> > "elapsed," implying a scalar value, makes me think that this is the
> > intended behavior. [1]
> > 
> > NetBSD did this in 2011 without much fanfare, though for some reason they
> > did it for time(1) and csh's builtin but not for ksh's builtin. [2]
> > 
> > I've tested pathological cases in each of the three and these patches
> > correct the result in said cases without (perceptibly) changing the
> > result in the typical case.
> > 
> > Thoughts?  Feedback?
> 
> this makes sense to me, id like to see it go in.
> 

Same here. I'm surprised to learn time(1) is not using CLOCK_MONOTONIC.



Re: urndis issues

2017-07-13 Thread Mike Belopuhov
On Wed, Jul 12, 2017 at 21:04 +, Jonathan Armani wrote:
> Hi,
> 
> Thanks I was cooking the same diff.
> 
> Ok armani@
> 

Hi, thanks! I want to get rid of printfs though and
return errors (or unhandled status codes) so that we
don't paper over them. In theory, all error codes that
I've seen in the ndis.h from the DDK have the MSB set,
i.e. (rval & 0x8000) != 0 for those, but doing it
would be a bit too hackish IMO. However, if we go down
this road the rval check can be rewritten as:

/* Not an error */
if ((rval & 0x8000) == 0)
rval = RNDIS_STATUS_SUCCESS;
else
printf("%s: status 0x%x\n", DEVNAME(sc), rval);

Is this something we want to do?

I was also going to add sc_link toggled by those status
codes and check it in the urndis_start like other USB
Ethernet drivers do, but decided against it cause this
would break other devices that don't send this message.


diff --git sys/dev/usb/if_urndis.c sys/dev/usb/if_urndis.c
index 4af6b55cf05..956207f73a8 100644
--- sys/dev/usb/if_urndis.c
+++ sys/dev/usb/if_urndis.c
@@ -88,10 +88,12 @@ u_int32_t urndis_ctrl_handle_init(struct urndis_softc *,
 const struct rndis_comp_hdr *);
 u_int32_t urndis_ctrl_handle_query(struct urndis_softc *,
 const struct rndis_comp_hdr *, void **, size_t *);
 u_int32_t urndis_ctrl_handle_reset(struct urndis_softc *,
 const struct rndis_comp_hdr *);
+u_int32_t urndis_ctrl_handle_status(struct urndis_softc *,
+const struct rndis_comp_hdr *);
 
 u_int32_t urndis_ctrl_init(struct urndis_softc *);
 u_int32_t urndis_ctrl_halt(struct urndis_softc *);
 u_int32_t urndis_ctrl_query(struct urndis_softc *, u_int32_t, void *, size_t,
 void **, size_t *);
@@ -233,10 +235,14 @@ urndis_ctrl_handle(struct urndis_softc *sc, struct 
rndis_comp_hdr *hdr,
case REMOTE_NDIS_KEEPALIVE_CMPLT:
case REMOTE_NDIS_SET_CMPLT:
rval = letoh32(hdr->rm_status);
break;
 
+   case REMOTE_NDIS_INDICATE_STATUS_MSG:
+   rval = urndis_ctrl_handle_status(sc, hdr);
+   break;
+
default:
printf("%s: ctrl message error: unknown event 0x%x\n",
DEVNAME(sc), letoh32(hdr->rm_type));
rval = RNDIS_STATUS_FAILURE;
}
@@ -400,10 +406,42 @@ urndis_ctrl_handle_reset(struct urndis_softc *sc,
 
return rval;
 }
 
 u_int32_t
+urndis_ctrl_handle_status(struct urndis_softc *sc,
+const struct rndis_comp_hdr *hdr)
+{
+   const struct rndis_status_msg   *msg;
+   u_int32_trval;
+
+   msg = (struct rndis_status_msg *)hdr;
+
+   rval = letoh32(msg->rm_status);
+
+   DPRINTF(("%s: urndis_ctrl_handle_status: len %u status 0x%x "
+   "stbuflen %u\n",
+   DEVNAME(sc),
+   letoh32(msg->rm_len),
+   rval,
+   letoh32(msg->rm_stbuflen)));
+
+   switch (rval) {
+   case RNDIS_STATUS_MEDIA_CONNECT:
+   case RNDIS_STATUS_MEDIA_DISCONNECT:
+   case RNDIS_STATUS_OFFLOAD_CURRENT_CONFIG:
+   rval = RNDIS_STATUS_SUCCESS;
+   break;
+
+   default:
+   printf("%s: status 0x%x\n", DEVNAME(sc), rval);
+   }
+
+   return rval;
+}
+
+u_int32_t
 urndis_ctrl_init(struct urndis_softc *sc)
 {
struct rndis_init_req   *msg;
u_int32_trval;
struct rndis_comp_hdr   *hdr;



Re: urndis issues

2017-07-11 Thread Mike Belopuhov
On Sun, Jul 09, 2017 at 09:57 +0300, Artturi Alm wrote:
> Hi,
> 
> anyone else having issues w/urndis(android)?
> victim of circumstances, i have to rely on it at times during the summer.
> When i plug phone into usb, and enable usb tethering or w/e it is called,
> i never get ip on first try, i have nothing but "dhcp"
> in /etc/hostname.urndis0, so i just ^C on the first i"ksh /etc/netstart"
> and get ip pretty much as expected in seconds on the successive run
> right after ^C, the first dhclient would end up sleeping if not ^C'ed..
> 
> this is what i see in dmesg:
> urndis0 at uhub0 port 1 configuration 1 interface 0 "SAMSUNG SAMSUNG_Android" 
> rev 2.00/ff.ff addr 2
> urndis0: using RNDIS, address 02:56:66:63:30:3c
> urndis0: ctrl message error: unknown event 0x7
> 
> no dmesg, as i've ran into this issue on every installation of OpenBSD
> i have tried w/.
> unrelated issue is this spam i get which i haven't noticed to affect
> anything:
> urndis0: urndis_decap invalid buffer len 1 < minimum header 44
> 
> for which i ended up w/diff below.
> -Artturi
> 

What happens if you apply the diff below w/o your modifications?

diff --git sys/dev/usb/if_urndis.c sys/dev/usb/if_urndis.c
index 4af6b55cf05..bdca361713d 100644
--- sys/dev/usb/if_urndis.c
+++ sys/dev/usb/if_urndis.c
@@ -88,10 +88,12 @@ u_int32_t urndis_ctrl_handle_init(struct urndis_softc *,
 const struct rndis_comp_hdr *);
 u_int32_t urndis_ctrl_handle_query(struct urndis_softc *,
 const struct rndis_comp_hdr *, void **, size_t *);
 u_int32_t urndis_ctrl_handle_reset(struct urndis_softc *,
 const struct rndis_comp_hdr *);
+u_int32_t urndis_ctrl_handle_status(struct urndis_softc *,
+const struct rndis_comp_hdr *);
 
 u_int32_t urndis_ctrl_init(struct urndis_softc *);
 u_int32_t urndis_ctrl_halt(struct urndis_softc *);
 u_int32_t urndis_ctrl_query(struct urndis_softc *, u_int32_t, void *, size_t,
 void **, size_t *);
@@ -233,10 +235,14 @@ urndis_ctrl_handle(struct urndis_softc *sc, struct 
rndis_comp_hdr *hdr,
case REMOTE_NDIS_KEEPALIVE_CMPLT:
case REMOTE_NDIS_SET_CMPLT:
rval = letoh32(hdr->rm_status);
break;
 
+   case REMOTE_NDIS_INDICATE_STATUS_MSG:
+   rval = urndis_ctrl_handle_status(sc, hdr);
+   break;
+
default:
printf("%s: ctrl message error: unknown event 0x%x\n",
DEVNAME(sc), letoh32(hdr->rm_type));
rval = RNDIS_STATUS_FAILURE;
}
@@ -400,10 +406,48 @@ urndis_ctrl_handle_reset(struct urndis_softc *sc,
 
return rval;
 }
 
 u_int32_t
+urndis_ctrl_handle_status(struct urndis_softc *sc,
+const struct rndis_comp_hdr *hdr)
+{
+   const struct rndis_status_msg   *msg;
+   u_int32_trval;
+
+   msg = (struct rndis_status_msg *)hdr;
+
+   rval = letoh32(msg->rm_status);
+
+   DPRINTF(("%s: urndis_ctrl_handle_status: len %u status 0x%x "
+   "stbuflen %u\n",
+   DEVNAME(sc),
+   letoh32(msg->rm_len),
+   rval,
+   letoh32(msg->rm_stbuflen)));
+
+   switch (rval) {
+   case RNDIS_STATUS_MEDIA_CONNECT:
+   printf("%s: link up\n", DEVNAME(sc));
+   break;
+
+   case RNDIS_STATUS_MEDIA_DISCONNECT:
+   printf("%s: link down\n", DEVNAME(sc));
+   break;
+
+   /* Ignore these */
+   case RNDIS_STATUS_OFFLOAD_CURRENT_CONFIG:
+   break;
+
+   default:
+   printf("%s: unknown status 0x%x\n", DEVNAME(sc), rval);
+   }
+
+   return RNDIS_STATUS_SUCCESS;
+}
+
+u_int32_t
 urndis_ctrl_init(struct urndis_softc *sc)
 {
struct rndis_init_req   *msg;
u_int32_trval;
struct rndis_comp_hdr   *hdr;



Additional media options for ix(4)

2017-06-27 Thread Mike Belopuhov
Hi,

I won't mind some broad testing of the following diff
which adds some additional media options to ix(4) from
FreeBSD and includes a fix for changing media from
Masanobu SAITOH.

The fix makes sure that when the media operation speed
is selected manually, the device doesn't additionally
advertise other (slower) modes.


diff --git sys/dev/pci/if_ix.c sys/dev/pci/if_ix.c
index 339ba2bc4f1..8fca8742f7f 100644
--- sys/dev/pci/if_ix.c
+++ sys/dev/pci/if_ix.c
@@ -1028,62 +1028,115 @@ ixgbe_intr(void *arg)
  *  This routine is called whenever the user queries the status of
  *  the interface using ifconfig.
  *
  **/
 void
-ixgbe_media_status(struct ifnet * ifp, struct ifmediareq *ifmr)
+ixgbe_media_status(struct ifnet *ifp, struct ifmediareq *ifmr)
 {
struct ix_softc *sc = ifp->if_softc;
+   int layer;
+
+   layer = sc->hw.mac.ops.get_supported_physical_layer(>hw);
 
ifmr->ifm_active = IFM_ETHER;
ifmr->ifm_status = IFM_AVALID;
 
INIT_DEBUGOUT("ixgbe_media_status: begin");
ixgbe_update_link_status(sc);
 
-   if (LINK_STATE_IS_UP(ifp->if_link_state)) {
-   ifmr->ifm_status |= IFM_ACTIVE;
+   if (!LINK_STATE_IS_UP(ifp->if_link_state))
+   return;
+
+   ifmr->ifm_status |= IFM_ACTIVE;
 
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_T ||
+   layer & IXGBE_PHYSICAL_LAYER_1000BASE_T ||
+   layer & IXGBE_PHYSICAL_LAYER_100BASE_TX)
switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_T | IFM_FDX;
+   break;
+   case IXGBE_LINK_SPEED_1GB_FULL:
+   ifmr->ifm_active |= IFM_1000_T | IFM_FDX;
+   break;
case IXGBE_LINK_SPEED_100_FULL:
ifmr->ifm_active |= IFM_100_TX | IFM_FDX;
break;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_SFP_PLUS_CU ||
+   layer & IXGBE_PHYSICAL_LAYER_SFP_ACTIVE_DA)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_SFP_CU | IFM_FDX;
+   break;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_LR)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_LR | IFM_FDX;
+   break;
case IXGBE_LINK_SPEED_1GB_FULL:
-   switch (sc->optics) {
-   case IFM_10G_SR: /* multi-speed fiber */
-   ifmr->ifm_active |= IFM_1000_SX | IFM_FDX;
-   break;
-   case IFM_10G_LR: /* multi-speed fiber */
-   ifmr->ifm_active |= IFM_1000_LX | IFM_FDX;
-   break;
-   default:
-   ifmr->ifm_active |= sc->optics | IFM_FDX;
-   break;
-   }
+   ifmr->ifm_active |= IFM_1000_LX | IFM_FDX;
break;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_LRM)
+   switch (sc->link_speed) {
case IXGBE_LINK_SPEED_10GB_FULL:
-   ifmr->ifm_active |= sc->optics | IFM_FDX;
+   ifmr->ifm_active |= IFM_10G_LRM | IFM_FDX;
+   break;
+   case IXGBE_LINK_SPEED_1GB_FULL:
+   ifmr->ifm_active |= IFM_1000_LX | IFM_FDX;
break;
}
-
-   switch (sc->hw.fc.current_mode) {
-   case ixgbe_fc_tx_pause:
-   ifmr->ifm_active |= IFM_FLOW | IFM_ETH_TXPAUSE;
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_SR ||
+   layer & IXGBE_PHYSICAL_LAYER_1000BASE_SX)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_SR | IFM_FDX;
+   break;
+   case IXGBE_LINK_SPEED_1GB_FULL:
+   ifmr->ifm_active |= IFM_1000_SX | IFM_FDX;
break;
-   case ixgbe_fc_rx_pause:
-   ifmr->ifm_active |= IFM_FLOW | IFM_ETH_RXPAUSE;
+   }
+   if (layer & IXGBE_PHYSICAL_LAYER_10GBASE_CX4)
+   switch (sc->link_speed) {
+   case IXGBE_LINK_SPEED_10GB_FULL:
+   ifmr->ifm_active |= IFM_10G_CX4 | IFM_FDX;
break;
-   case ixgbe_fc_full:
-   ifmr->ifm_active |= IFM_FLOW | IFM_ETH_RXPAUSE |
-   IFM_ETH_TXPAUSE;
+   }
+   if (layer & 

Re: faster timecounters for kvm/xen

2017-06-18 Thread Mike Belopuhov
On Sun, Jun 18, 2017 at 21:20 +1000, Jonathan Matthew wrote:
> On Fri, Jun 16, 2017 at 10:25:29AM +0200, Mike Belopuhov wrote:
> > Now regarding the diff.  pvbus_init_vcpu.  Ah yes, please.
> > It was a chicken and the egg problem for me: I didn't have
> > Xen, but wanted a callback from cpu_hatch to setup shared
> > info pages and events (interrupt delivery) for all CPUs.
> > So please factor it out and let's get that committed.
> 
> Updated version of this is below.  The init_cpu function pointer is now in
> the pvbus_hv so it's easier to decide what it does at runtime.
>
[...]
> oks on this bit?
>

OK mikeb

> Index: arch/amd64/amd64/cpu.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
> retrieving revision 1.105
> diff -u -p -r1.105 cpu.c
> --- arch/amd64/amd64/cpu.c30 May 2017 15:11:32 -  1.105
> +++ arch/amd64/amd64/cpu.c18 Jun 2017 09:16:12 -
> @@ -67,6 +67,7 @@
>  #include "lapic.h"
>  #include "ioapic.h"
>  #include "vmm.h"
> +#include "pvbus.h"
>  
>  #include 
>  #include 
> @@ -103,6 +104,10 @@
>  #include 
>  #endif
>  
> +#if NPVBUS > 0
> +#include 
> +#endif
> +
>  #include 
>  #include 
>  #include 
> @@ -728,6 +733,9 @@ cpu_hatch(void *v)
>   lldt(0);
>  
>   cpu_init(ci);
> +#if NPVBUS > 0
> + pvbus_init_cpu();
> +#endif
>  
>   /* Re-initialise memory range handling on AP */
>   if (mem_range_softc.mr_op != NULL)
> Index: arch/i386/i386/cpu.c
> ===
> RCS file: /cvs/src/sys/arch/i386/i386/cpu.c,v
> retrieving revision 1.84
> diff -u -p -r1.84 cpu.c
> --- arch/i386/i386/cpu.c  30 May 2017 15:11:32 -  1.84
> +++ arch/i386/i386/cpu.c  18 Jun 2017 09:16:13 -
> @@ -67,6 +67,7 @@
>  #include "lapic.h"
>  #include "ioapic.h"
>  #include "vmm.h"
> +#include "pvbus.h"
>  
>  #include 
>  #include 
> @@ -104,6 +105,10 @@
>  #include 
>  #endif
>  
> +#if NPVBUS > 0
> +#include 
> +#endif
> +
>  #include 
>  #include 
>  #include 
> @@ -626,6 +631,9 @@ cpu_hatch(void *v)
>  
>   ci->ci_curpmap = pmap_kernel();
>   cpu_init(ci);
> +#if NPVBUS > 0
> + pvbus_init_cpu();
> +#endif
>  
>   /* Re-initialise memory range handling on AP */
>   if (mem_range_softc.mr_op != NULL)
> Index: dev/pv/pvbus.c
> ===
> RCS file: /cvs/src/sys/dev/pv/pvbus.c,v
> retrieving revision 1.16
> diff -u -p -r1.16 pvbus.c
> --- dev/pv/pvbus.c10 Jan 2017 17:16:39 -  1.16
> +++ dev/pv/pvbus.c18 Jun 2017 09:16:17 -
> @@ -210,6 +210,19 @@ pvbus_identify(void)
>   has_hv_cpuid = 1;
>  }
>  
> +void
> +pvbus_init_cpu(void)
> +{
> + int i;
> +
> + for (i = 0; i < PVBUS_MAX; i++) {
> + if (pvbus_hv[i].hv_base == 0)
> + continue;
> + if (pvbus_hv[i].hv_init_cpu != NULL)
> + (pvbus_hv[i].hv_init_cpu)(_hv[i]);
> + }
> +}
> +
>  int
>  pvbus_activate(struct device *self, int act)
>  {
> Index: dev/pv/pvvar.h
> ===
> RCS file: /cvs/src/sys/dev/pv/pvvar.h,v
> retrieving revision 1.9
> diff -u -p -r1.9 pvvar.h
> --- dev/pv/pvvar.h10 Jan 2017 17:16:39 -  1.9
> +++ dev/pv/pvvar.h18 Jun 2017 09:16:17 -
> @@ -56,6 +56,7 @@ struct pvbus_hv {
>  
>   void*hv_arg;
>   int (*hv_kvop)(void *, int, char *, char *, size_t);
> + void(*hv_init_cpu)(struct pvbus_hv *);
>  };
>  
>  struct pvbus_softc {
> @@ -77,6 +78,7 @@ struct pv_attach_args {
>  
>  void  pvbus_identify(void);
>  int   pvbus_probe(void);
> +void  pvbus_init_cpu(void);
>  void  pvbus_reboot(struct device *);
>  void  pvbus_shutdown(struct device *);
>  



Re: faster timecounters for kvm/xen

2017-06-17 Thread Mike Belopuhov
On 17 June 2017 at 14:17, Jonathan Matthew <jonat...@d14n.org> wrote:
>
> On Fri, Jun 16, 2017 at 01:19:02PM +0200, Mike Belopuhov wrote:
> > On Fri, Jun 16, 2017 at 10:25 +0200, Mike Belopuhov wrote:
> > > I don't know if it's a good idea to depend on Xen's
> > > definition of vcpu_time_info.  I think I have factored
> > > it out into the pvclock_time_info and put it into the
> > > pvclockvar.h or something like that.  And then made Xen
> > > use those definitions instead of its own.  Dunno what's
> > > the best course of action here.
> > >
> >
> > This is what I would like to use.  I've stripped the API
> > part, but we can add it as well.  I don't believe this
> > file requires a specific license since there's a handful
> > of pvclock header files out there implementing a common
> > interface so a person committing such a file can add his
> > own copyright.  Opinions?
>
> Looks good to me.  Can we put the #defines for flag bits in there too?
>
> #define PVCLOCK_FLAG_TSC_STABLE_BIT (1 << 0)
> #define PVCLOCK_FLAG_GUEST_STOPPED  (1 << 1)
>
> As far as I can tell, xen doesn't use these, but Linux handles them in its
> common pvclock code anyway.
>

Sure, go for it.


Re: faster timecounters for kvm/xen

2017-06-16 Thread Mike Belopuhov
On Fri, Jun 16, 2017 at 10:25 +0200, Mike Belopuhov wrote:
> Last time I've tried uebayashi's pvclock on Xen, it didn't
> work for me.  I didn't have time to investigate why but
> probably because we need per-cpu readings.  Which you do
> for KVM.  I'll test this on Xen as soon as I get to the
> office.
>
[...]
> 
> But this brings another point: where and how to perform
> the pvclock initialization and attachment.  In your diff
> pvclock_xen_init comes a bit too early: none of the Xen
> things are initialized at that point, shared info page
> isn't allocated.
> 

As I've told jmatthew@ privately this doesn't work on Xen,
but I've changed a few things and made it work:
   https://github.com/mbelop/src/commits/pvclock

However, I'm observing a huge drift: 6 minutes in about an
hour so it's not usable as it is. I think this is the same
thing as I saw before.  I'll ponder the per-CPU vcpu_info
path and report back when I've got something.  But in the
meantime I hope that jmatthew@ will check in rdtsc and
pvbus_init_cpu bits (we're lobbying for a s/_vcpu/_cpu/
change. :-)



Re: faster timecounters for kvm/xen

2017-06-16 Thread Mike Belopuhov
On Fri, Jun 16, 2017 at 23:09 +1000, Jonathan Gray wrote:
> On Fri, Jun 16, 2017 at 02:23:36PM +0200, Mike Belopuhov wrote:
> > On Fri, Jun 16, 2017 at 16:31 +1000, Jonathan Matthew wrote:
> > > Index: arch/i386/include/cpufunc.h
> > > ===
> > > RCS file: /cvs/src/sys/arch/i386/include/cpufunc.h,v
> > > retrieving revision 1.25
> > > diff -u -p -u -p -r1.25 cpufunc.h
> > > --- arch/i386/include/cpufunc.h   27 May 2017 12:21:50 -  1.25
> > > +++ arch/i386/include/cpufunc.h   16 Jun 2017 06:07:16 -
> > > @@ -217,6 +217,15 @@ mfence(void)
> > >   __asm volatile("mfence" : : : "memory");
> > >  }
> > >  
> > > +static __inline u_int64_t
> > > +rdtsc(void)
> > > +{
> > > + uint32_t hi, lo;
> > > +
> > > + __asm volatile("rdtsc" : "=d" (hi), "=a" (lo));
> > > + return (((uint64_t)hi << 32) | (uint64_t) lo);
> > > +}
> > > +
> > >  static __inline void
> > >  wrmsr(u_int msr, u_int64_t newval)
> > >  {
> > 
> > I think it's OK to get this chunk in.  amd64 has got this already.
> > 
> 
> Perhaps make it __asm volatile ("rdtsc" : "=A" (v)); like the pctr.h version?
>
> That's also what the gcc example uses for rdtsc in
> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
>

Sure.

> I guess the chance of something pretending to be a 486 without TSC attaching
> pvbus is slim.

This file is full of instructions not supported on older CPUs
so no harm done as far as I can tell.



Re: faster timecounters for kvm/xen

2017-06-16 Thread Mike Belopuhov
On Fri, Jun 16, 2017 at 16:31 +1000, Jonathan Matthew wrote:
> Index: arch/i386/include/cpufunc.h
> ===
> RCS file: /cvs/src/sys/arch/i386/include/cpufunc.h,v
> retrieving revision 1.25
> diff -u -p -u -p -r1.25 cpufunc.h
> --- arch/i386/include/cpufunc.h   27 May 2017 12:21:50 -  1.25
> +++ arch/i386/include/cpufunc.h   16 Jun 2017 06:07:16 -
> @@ -217,6 +217,15 @@ mfence(void)
>   __asm volatile("mfence" : : : "memory");
>  }
>  
> +static __inline u_int64_t
> +rdtsc(void)
> +{
> + uint32_t hi, lo;
> +
> + __asm volatile("rdtsc" : "=d" (hi), "=a" (lo));
> + return (((uint64_t)hi << 32) | (uint64_t) lo);
> +}
> +
>  static __inline void
>  wrmsr(u_int msr, u_int64_t newval)
>  {

I think it's OK to get this chunk in.  amd64 has got this already.



Re: faster timecounters for kvm/xen

2017-06-16 Thread Mike Belopuhov
On Fri, Jun 16, 2017 at 10:25 +0200, Mike Belopuhov wrote:
> I don't know if it's a good idea to depend on Xen's
> definition of vcpu_time_info.  I think I have factored
> it out into the pvclock_time_info and put it into the
> pvclockvar.h or something like that.  And then made Xen
> use those definitions instead of its own.  Dunno what's
> the best course of action here.
> 

This is what I would like to use.  I've stripped the API
part, but we can add it as well.  I don't believe this
file requires a specific license since there's a handful
of pvclock header files out there implementing a common
interface so a person committing such a file can add his
own copyright.  Opinions?


#ifndef _PV_PVCLOCK_H_
#define _PV_PVCLOCK_H_

struct pvclock_vcpu_time_info {
volatile uint32_t   version;
volatile uint32_t   pad1;
volatile uint64_t   tsc_timestamp;
volatile uint64_t   system_time;
volatile uint32_t   tsc_to_system_mul;
volatile int8_t tsc_shift;
volatile uint8_tflags;
volatile uint8_tpad2[2];
} __packed;

struct pvclock_wall_clock {
volatile uint32_t   version;
volatile uint32_t   sec;
volatile uint32_t   nsec;
} __packed;

#endif  /* _PV_PVCLOCK_H_ */



Re: faster timecounters for kvm/xen

2017-06-16 Thread Mike Belopuhov
On Fri, Jun 16, 2017 at 16:31 +1000, Jonathan Matthew wrote:
> Recently I updated the kernel lock profiling stuff I've been working on, since
> it  had been rotting a bit since witness was introduced.  Running my diff on a
> KVM VM, I found there was a pretty huge performance impact (10 minutes to
> build a kernel instead of 4), which turned out to be because reading the
> emulated HPET in KVM is slow, and lock profiling involves a lot of extra
> clock reads.  The diff below adds a new TSC-based timecounter implementation
> for KVM and Xen to remedy this.
> 
> KVM and Xen provide frequently-updated views of system time from the host to
> each vcpu in a way that lets the VM get accurate high resolution time without
> much work.  Linux calls this mechanism 'pvclock' so I'm doing the same.
> 
> The pvclock structure gives you a system time (in nanoseconds), the TSC
> reading from when the time was updated, and scaling factors for converting TSC
> values to nanoseconds.  Usually you subtract the TSC reading in the pvclock
> structure from a current reading, convert that to nanoseconds, and add it to
> the system time.  I decided to go the other way in order to keep all the
> available resolution.
> 
> Using pvclock as the timecounter reduces the overhead of lock profiling to
> almost nothing.  Even without the extra clock reads for lock profiling,
> it cuts a few seconds off kernel compile time on a 2 vcpu vm.  I've run it
> for ~12 hours without ntpd and the clock keeps time accurately.
> 
> One wrinkle here is that the KVM pvclock mechanism requires setup on each 
> vcpu,
> so I added a new pvbus function that gets called from cpu_hatch, allowing any
> hypervisor-specific setup to happen there.
> 
> I still need to try this on xen, but comments at this stage are welcome.
>

Cool!  You've beaten both of us to it :)

Last time I've tried uebayashi's pvclock on Xen, it didn't
work for me.  I didn't have time to investigate why but
probably because we need per-cpu readings.  Which you do
for KVM.  I'll test this on Xen as soon as I get to the
office.

Now regarding the diff.  pvbus_init_vcpu.  Ah yes, please.
It was a chicken and the egg problem for me: I didn't have
Xen, but wanted a callback from cpu_hatch to setup shared
info pages and events (interrupt delivery) for all CPUs.
So please factor it out and let's get that committed.

I don't know if it's a good idea to depend on Xen's
definition of vcpu_time_info.  I think I have factored
it out into the pvclock_time_info and put it into the
pvclockvar.h or something like that.  And then made Xen
use those definitions instead of its own.  Dunno what's
the best course of action here.

But this brings another point: where and how to perform
the pvclock initialization and attachment.  In your diff
pvclock_xen_init comes a bit too early: none of the Xen
things are initialized at that point, shared info page
isn't allocated.

I told Stefan in Munich that perhaps having a kvm.c shim
that would prepare and attach pvclock (and maybe provide
some flags and other bells and whistles).

I think we need to call pvclock attachment from Xen code
where it's appropriate, not from pvbus code.  Or do a
config_attach on it.  Why didn't you want to put it in
its own device driver?

It's nice that this version avoids using assembly. Any idea
what was the reason for Linux/FreeBSD code to use it?  Were
they afraid to lose precision maybe?

In any case, good job, lets try to get this in.



Re: Better handling of short reads

2017-06-14 Thread Mike Belopuhov
On Wed, Jun 14, 2017 at 11:43 -0400, Ted Unangst wrote:
> Mike Belopuhov wrote:
> > still looking forward to replies to the original set of changes.
> 
> i'm a little in between. on the one hand, yes, ok, it's good that we don't
> leave corrupted buffers around with bad data. on the other hand, don't we want
> to learn about these problems and fix them? i don't think the change is wrong,
> but it seems like it covers up another issue.

Device drivers do not consider such situations as issues.
Yes, they're edge cases that we can't normally trigger,
but they're not bugs in drivers since over the years
developers have deliberately put such code there.

So I'm not entirely sure what do you think this is papering
over.  There's a clear violation of contract between buffer
cache and the filesystem: FFS asked for 16k, got 16k plus
resid of 20k which is weird to say the least.



Re: Better handling of short reads

2017-06-14 Thread Mike Belopuhov
On Wed, Jun 14, 2017 at 09:12 -0600, Bob Beck wrote:
> 
> > As you all might have gathered by now Amit has jumped the gun
> > but was wrong to do so.  His setup is not affected by this change.
> > That was expected so please don't get distracted by this as I'm
> > still looking forward to replies to the original set of changes.
> > beck@?
> > 
> > > diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
> > > index 95bc80bc0e6..9316e6e0eb2 100644
> > > --- sys/kern/vfs_bio.c
> > > +++ sys/kern/vfs_bio.c
> > > @@ -534,10 +534,27 @@ bread_cluster_callback(struct buf *bp)
> > >*/
> > >   buf_fix_mapping(bp, newsize);
> > >   bp->b_bcount = newsize;
> > >   }
> > >  
> > > + /* Invalidate read-ahead buffers if read short */
> > > + if (bp->b_resid > 0) {
> > > + printf("read %ld resid %ld\n", bp->b_bcount, bp->b_resid);
> 
> Should the printf actually be here?  I'm not thinking this thing 
> spewing dmesg like a banshee if we get short reads is really going
> to help anything
>

You're looking at the wrong diff.  Please look at the first mail in
the thread.  This one was for Amit.

> > > + for (i = 0; xbpp[i] != NULL; i++)
> > > + continue;
> > > + for (i = i - 1; i != 0; i--) {
> > > + if (xbpp[i]->b_bufsize <= bp->b_resid) {
> > > + bp->b_resid -= xbpp[i]->b_bufsize;
> > > + SET(xbpp[i]->b_flags, B_INVAL);
> > > + } else if (bp->b_resid > 0) {
> > > + bp->b_resid = 0;
> > > + SET(xbpp[i]->b_flags, B_INVAL);
> > > + } else
> > > + break;
> > > + }
> > > + }
> > > +
> > >   for (i = 1; xbpp[i] != 0; i++) {
> > >   if (ISSET(bp->b_flags, B_ERROR))
> > >   SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
> > >   biodone(xbpp[i]);
> > >   }
> > 



Re: tweak txp to avoid ifq_deq_begin/commit/rollback

2017-06-14 Thread Mike Belopuhov
On Mon, Jun 05, 2017 at 16:13 +0200, Mike Belopuhov wrote:
> On Wed, May 31, 2017 at 20:40 +0200, Mike Belopuhov wrote:
> > According to the FreeBSD driver, txp(4) is not setting up its TX
> > ring correctly.  FreeBSD driver uses up to 16 fragments, while we
> > use up to 252 which is suspicious.
> > 
> > This gets us in line with FreeBSD, introduces goodness of m_defrag
> > and removes pesky if_deq_* thingies.
> > 
> > Does anyone still have the hardware (3com 3CR900 Typhoon) to test?
> > OK's are welcome.
> >
> 
> Any OKs? Tests? Should I go ahead with this?
>

I've heard no objections to this so the best way to test it
would be to get this in.

> > diff --git sys/dev/pci/if_txp.c sys/dev/pci/if_txp.c
> > index deede70e9de..1aed06765c0 100644
> > --- sys/dev/pci/if_txp.c
> > +++ sys/dev/pci/if_txp.c
> > @@ -883,12 +883,12 @@ txp_alloc_rings(struct txp_softc *sc)
> > sc->sc_txhir.r_desc = (struct txp_tx_desc 
> > *)sc->sc_txhiring_dma.dma_vaddr;
> > sc->sc_txhir.r_cons = sc->sc_txhir.r_prod = sc->sc_txhir.r_cnt = 0;
> > sc->sc_txhir.r_off = >sc_hostvar->hv_tx_hi_desc_read_idx;
> > for (i = 0; i < TX_ENTRIES; i++) {
> > if (bus_dmamap_create(sc->sc_dmat, TXP_MAX_PKTLEN,
> > -   TX_ENTRIES - 4, TXP_MAX_SEGLEN, 0,
> > -   BUS_DMA_NOWAIT, >sc_txd[i].sd_map) != 0) {
> > +   TXP_MAXTXSEGS, MCLBYTES, 0, BUS_DMA_NOWAIT,
> > +   >sc_txd[i].sd_map) != 0) {
> > for (j = 0; j < i; j++) {
> > bus_dmamap_destroy(sc->sc_dmat,
> > sc->sc_txd[j].sd_map);
> > sc->sc_txd[j].sd_map = NULL;
> > }
> > @@ -1261,57 +1261,48 @@ txp_start(struct ifnet *ifp)
> > struct txp_softc *sc = ifp->if_softc;
> > struct txp_tx_ring *r = >sc_txhir;
> > struct txp_tx_desc *txd;
> > int txdidx;
> > struct txp_frag_desc *fxd;
> > -   struct mbuf *m, *mnew;
> > +   struct mbuf *m;
> > struct txp_swdesc *sd;
> > u_int32_t firstprod, firstcnt, prod, cnt, i;
> >  
> > if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(>if_snd))
> > return;
> >  
> > prod = r->r_prod;
> > cnt = r->r_cnt;
> >  
> > while (1) {
> > -   m = ifq_deq_begin(>if_snd);
> > +   if (cnt >= TX_ENTRIES - TXP_MAXTXSEGS - 4)
> > +   goto oactive;
> > +
> > +   m = ifq_dequeue(>if_snd);
> > if (m == NULL)
> > break;
> > -   mnew = NULL;
> >  
> > firstprod = prod;
> > firstcnt = cnt;
> >  
> > sd = sc->sc_txd + prod;
> > sd->sd_mbuf = m;
> >  
> > -   if (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
> > +   switch (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
> > BUS_DMA_NOWAIT)) {
> > -   MGETHDR(mnew, M_DONTWAIT, MT_DATA);
> > -   if (mnew == NULL)
> > -   goto oactive1;
> > -   if (m->m_pkthdr.len > MHLEN) {
> > -   MCLGET(mnew, M_DONTWAIT);
> > -   if ((mnew->m_flags & M_EXT) == 0) {
> > -   m_freem(mnew);
> > -   goto oactive1;
> > -   }
> > -   }
> > -   m_copydata(m, 0, m->m_pkthdr.len, mtod(mnew, caddr_t));
> > -   mnew->m_pkthdr.len = mnew->m_len = m->m_pkthdr.len;
> > -   ifq_deq_commit(>if_snd, m);
> > +   case 0:
> > +   break;
> > +   case EFBIG:
> > +   if (m_defrag(m, M_DONTWAIT) == 0 &&
> > +   bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
> > +   BUS_DMA_NOWAIT) == 0)
> > +   break;
> > +   default:
> > m_freem(m);
> > -   m = mnew;
> > -   if (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
> > -   BUS_DMA_NOWAIT))
> > -   goto oactive1;
> > +   continue;
> > }
> >  
> > -   if ((TX

Re: Better handling of short reads

2017-06-14 Thread Mike Belopuhov
On Thu, Jun 08, 2017 at 11:55 +0200, Mike Belopuhov wrote:
> On Wed, Jun 07, 2017 at 23:04 -0500, Amit Kulkarni wrote:
> > On Wed, 7 Jun 2017 21:27:27 -0500
> > Amit Kulkarni <amit.o...@gmail.com> wrote:
> > 
> > > On Thu, 8 Jun 2017 01:57:25 +0200
> > > Mike Belopuhov <m...@belopuhov.com> wrote:
> > > 
> > > > On Wed, Jun 07, 2017 at 18:35 -0500, Amit Kulkarni wrote:
> > > > > Wow, please get this in!!!
> > > > > 
> > > > > This fixes cvs update on hard disks, to go much much faster. When I am
> > > > > updating the entire set of cvs trees: www, src, xenocara, ports, I can
> > > > > still use firefox and have it perfectly usable. There's a night and
> > > > > day improvement, before and after. Thanks for debugging and fixing
> > > > > this.
> > > > >
> > > > 
> > > > What kind of broken hardware do you have that this diff helps you?
> > > > Can you show us your dmesg?
> > > > 
> > 
> > Please ignore previous dmesg, it was incomplete.
> > 
> 
> Are you 100% sure this diff changes anything for you?
> Can you please try the one below.  It adds a printf.
>

As you all might have gathered by now Amit has jumped the gun
but was wrong to do so.  His setup is not affected by this change.
That was expected so please don't get distracted by this as I'm
still looking forward to replies to the original set of changes.
beck@?

> diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
> index 95bc80bc0e6..9316e6e0eb2 100644
> --- sys/kern/vfs_bio.c
> +++ sys/kern/vfs_bio.c
> @@ -534,10 +534,27 @@ bread_cluster_callback(struct buf *bp)
>*/
>   buf_fix_mapping(bp, newsize);
>   bp->b_bcount = newsize;
>   }
>  
> + /* Invalidate read-ahead buffers if read short */
> + if (bp->b_resid > 0) {
> + printf("read %ld resid %ld\n", bp->b_bcount, bp->b_resid);
> + for (i = 0; xbpp[i] != NULL; i++)
> + continue;
> + for (i = i - 1; i != 0; i--) {
> + if (xbpp[i]->b_bufsize <= bp->b_resid) {
> + bp->b_resid -= xbpp[i]->b_bufsize;
> + SET(xbpp[i]->b_flags, B_INVAL);
> + } else if (bp->b_resid > 0) {
> + bp->b_resid = 0;
> + SET(xbpp[i]->b_flags, B_INVAL);
> + } else
> + break;
> + }
> + }
> +
>   for (i = 1; xbpp[i] != 0; i++) {
>   if (ISSET(bp->b_flags, B_ERROR))
>   SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
>   biodone(xbpp[i]);
>   }



Re: pool cpu caches and a systat view of them

2017-06-14 Thread Mike Belopuhov
On Wed, Jun 14, 2017 at 13:50 +1000, David Gwynne wrote:
> i have a few things left to do in the pools per cpu caches, one of
> which is make their activity visibile. to that end, here's a diff
> provides a way for userland to request stats from the per cpu caches,
> and uses that in systat so you can watch them.
> 
> there are two added pool sysctls. one copies an array of stats from
> each cpus cache. the interesting bits in those stats are how many
> items each cpu handled, and how many list operations the cpu did
> against the global pool cache.
> 
> the second sysctl reports stats about the global pool cache. currently
> this is the target for the list length the cpus build is, how many
> lists its holding, and how many times the gc has moved a list of
> items back into the pool for recovery.
> 
> these are used by sysctl for a new view which ive called pcaches,
> short for pool caches.
> 

I think this is a nice addition.  It would be nice to have some
(at least terse) description of fields in the man page itself.
In any case, OK mikeb



Re: pfctl: make functions return void, merge two ifs

2017-06-12 Thread Mike Belopuhov
On Sun, Jun 11, 2017 at 15:03 +0100, Raymond wrote:
> Transform the following functions (which never return anything other than 0, 
> and whose return value is never used) to void:
> 
> * pfctl_clear_stats, pfctl_clear_interface_flags, pfctl_clear_rules, 
> pfctl_clear_src_nodes, pfctl_clear_states
> * pfctl_kill_src_nodes, pfctl_net_kill_states, pfctl_label_kill_states, 
> pfctl_id_kill_states, pfctl_key_kill_states
> 
> inside main: merge two identical if conditions next to each other into one.
> 
> credit to
> - awolk@ for the code reading
> - mikeb for pointing out we can void all _clear_ functions
> - ghostyy for pointing out all _kill_ functions can be voided
>

Looks good to me.  I was going to point out that pfctl_clear_tables
should also be converted, but leave that for a rainy day since some
extra return value checking of pfctl_table call is probably in order.

> ? parse.c
> ? pfctl
> Index: pfctl.c
> ===
> RCS file: /cvs/src/sbin/pfctl/pfctl.c,v
> retrieving revision 1.344
> diff -u -p -r1.344 pfctl.c
> --- pfctl.c   30 May 2017 12:13:04 -  1.344
> +++ pfctl.c   11 Jun 2017 13:39:14 -
> @@ -61,17 +61,17 @@ void   usage(void);
>  int   pfctl_enable(int, int);
>  int   pfctl_disable(int, int);
>  void  pfctl_clear_queues(struct pf_qihead *);
> -int   pfctl_clear_stats(int, const char *, int);
> -int   pfctl_clear_interface_flags(int, int);
> -int   pfctl_clear_rules(int, int, char *);
> -int   pfctl_clear_src_nodes(int, int);
> -int   pfctl_clear_states(int, const char *, int);
> +void  pfctl_clear_stats(int, const char *, int);
> +void  pfctl_clear_interface_flags(int, int);
> +void  pfctl_clear_rules(int, int, char *);
> +void  pfctl_clear_src_nodes(int, int);
> +void  pfctl_clear_states(int, const char *, int);
>  void  pfctl_addrprefix(char *, struct pf_addr *);
> -int   pfctl_kill_src_nodes(int, const char *, int);
> -int   pfctl_net_kill_states(int, const char *, int, int);
> -int   pfctl_label_kill_states(int, const char *, int, int);
> -int   pfctl_id_kill_states(int, int);
> -int   pfctl_key_kill_states(int, const char *, int, int);
> +void  pfctl_kill_src_nodes(int, const char *, int);
> +void  pfctl_net_kill_states(int, const char *, int, int);
> +void  pfctl_label_kill_states(int, const char *, int, int);
> +void  pfctl_id_kill_states(int, int);
> +void  pfctl_key_kill_states(int, const char *, int, int);
>  int   pfctl_parse_host(char *, struct pf_rule_addr *);
>  void  pfctl_init_options(struct pfctl *);
>  int   pfctl_load_options(struct pfctl *);
> @@ -278,7 +278,7 @@ pfctl_disable(int dev, int opts)
>   return (0);
>  }
>  
> -int
> +void
>  pfctl_clear_stats(int dev, const char *iface, int opts)
>  {
>   struct pfioc_iface pi;
> @@ -296,10 +296,9 @@ pfctl_clear_stats(int dev, const char *i
>   fprintf(stderr, " for interface %s", iface);
>   fprintf(stderr, "\n");
>   }
> - return (0);
>  }
>  
> -int
> +void
>  pfctl_clear_interface_flags(int dev, int opts)
>  {
>   struct pfioc_iface  pi;
> @@ -313,10 +312,9 @@ pfctl_clear_interface_flags(int dev, int
>   if ((opts & PF_OPT_QUIET) == 0)
>   fprintf(stderr, "pf: interface flags reset\n");
>   }
> - return (0);
>  }
>  
> -int
> +void
>  pfctl_clear_rules(int dev, int opts, char *anchorname)
>  {
>   struct pfr_buffer t;
> @@ -329,20 +327,18 @@ pfctl_clear_rules(int dev, int opts, cha
>   err(1, "pfctl_clear_rules");
>   if ((opts & PF_OPT_QUIET) == 0)
>   fprintf(stderr, "rules cleared\n");
> - return (0);
>  }
>  
> -int
> +void
>  pfctl_clear_src_nodes(int dev, int opts)
>  {
>   if (ioctl(dev, DIOCCLRSRCNODES))
>   err(1, "DIOCCLRSRCNODES");
>   if ((opts & PF_OPT_QUIET) == 0)
>   fprintf(stderr, "source tracking entries cleared\n");
> - return (0);
>  }
>  
> -int
> +void
>  pfctl_clear_states(int dev, const char *iface, int opts)
>  {
>   struct pfioc_state_kill psk;
> @@ -356,7 +352,6 @@ pfctl_clear_states(int dev, const char *
>   err(1, "DIOCCLRSTATES");
>   if ((opts & PF_OPT_QUIET) == 0)
>   fprintf(stderr, "%d states cleared\n", psk.psk_killed);
> - return (0);
>  }
>  
>  void
> @@ -409,7 +404,7 @@ pfctl_addrprefix(char *addr, struct pf_a
>   freeaddrinfo(res);
>  }
>  
> -int
> +void
>  pfctl_kill_src_nodes(int dev, const char *iface, int opts)
>  {
>   struct pfioc_src_node_kill psnk;
> @@ -509,10 +504,9 @@ pfctl_kill_src_nodes(int dev, const char
>   if ((opts & PF_OPT_QUIET) == 0)
>   fprintf(stderr, "killed %d src nodes from %d sources and %d "
>   "destinations\n", killed, sources, dests);
> - return (0);
>  }
>  
> -int
> +void
>  pfctl_net_kill_states(int dev, const char *iface, int opts, int rdomain)
>  {
>   struct pfioc_state_kill psk;
> @@ -617,10 +611,9 @@ 

Re: Better handling of short reads

2017-06-08 Thread Mike Belopuhov
On Wed, Jun 07, 2017 at 23:04 -0500, Amit Kulkarni wrote:
> On Wed, 7 Jun 2017 21:27:27 -0500
> Amit Kulkarni <amit.o...@gmail.com> wrote:
> 
> > On Thu, 8 Jun 2017 01:57:25 +0200
> > Mike Belopuhov <m...@belopuhov.com> wrote:
> > 
> > > On Wed, Jun 07, 2017 at 18:35 -0500, Amit Kulkarni wrote:
> > > > Wow, please get this in!!!
> > > > 
> > > > This fixes cvs update on hard disks, to go much much faster. When I am
> > > > updating the entire set of cvs trees: www, src, xenocara, ports, I can
> > > > still use firefox and have it perfectly usable. There's a night and
> > > > day improvement, before and after. Thanks for debugging and fixing
> > > > this.
> > > >
> > > 
> > > What kind of broken hardware do you have that this diff helps you?
> > > Can you show us your dmesg?
> > > 
> 
> Please ignore previous dmesg, it was incomplete.
> 

Are you 100% sure this diff changes anything for you?
Can you please try the one below.  It adds a printf.

diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
index 95bc80bc0e6..9316e6e0eb2 100644
--- sys/kern/vfs_bio.c
+++ sys/kern/vfs_bio.c
@@ -534,10 +534,27 @@ bread_cluster_callback(struct buf *bp)
 */
buf_fix_mapping(bp, newsize);
bp->b_bcount = newsize;
}
 
+   /* Invalidate read-ahead buffers if read short */
+   if (bp->b_resid > 0) {
+   printf("read %ld resid %ld\n", bp->b_bcount, bp->b_resid);
+   for (i = 0; xbpp[i] != NULL; i++)
+   continue;
+   for (i = i - 1; i != 0; i--) {
+   if (xbpp[i]->b_bufsize <= bp->b_resid) {
+   bp->b_resid -= xbpp[i]->b_bufsize;
+   SET(xbpp[i]->b_flags, B_INVAL);
+   } else if (bp->b_resid > 0) {
+   bp->b_resid = 0;
+   SET(xbpp[i]->b_flags, B_INVAL);
+   } else
+   break;
+   }
+   }
+
for (i = 1; xbpp[i] != 0; i++) {
if (ISSET(bp->b_flags, B_ERROR))
SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
biodone(xbpp[i]);
}



Re: Better handling of short reads

2017-06-07 Thread Mike Belopuhov
On Wed, Jun 07, 2017 at 18:35 -0500, Amit Kulkarni wrote:
> Wow, please get this in!!!
> 
> This fixes cvs update on hard disks, to go much much faster. When I am
> updating the entire set of cvs trees: www, src, xenocara, ports, I can
> still use firefox and have it perfectly usable. There's a night and
> day improvement, before and after. Thanks for debugging and fixing
> this.
>

What kind of broken hardware do you have that this diff helps you?
Can you show us your dmesg?

> amit
> 
> On Wed, Jun 7, 2017 at 12:29 PM, Mike Belopuhov <m...@belopuhov.com> wrote:
> > Hi,
> >
> > I've discovered that short reads (nonzero b_resid) aren't
> > handled very well in our kernel and I've proposed a diff
> > like this to handle short reads of buffercache read-ahead
> > buffers:
> >
[...]



Better handling of short reads

2017-06-07 Thread Mike Belopuhov
Hi,

I've discovered that short reads (nonzero b_resid) aren't
handled very well in our kernel and I've proposed a diff
like this to handle short reads of buffercache read-ahead
buffers:

diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
index 95bc80bc0e6..1cc1943d752 100644
--- sys/kern/vfs_bio.c
+++ sys/kern/vfs_bio.c
@@ -534,11 +534,27 @@ bread_cluster_callback(struct buf *bp)
 */
buf_fix_mapping(bp, newsize);
bp->b_bcount = newsize;
}
 
-   for (i = 1; xbpp[i] != 0; i++) {
+   /* Invalidate read-ahead buffers if read short */
+   if (bp->b_resid > 0) {
+   for (i = 0; xbpp[i] != NULL; i++)
+   continue;
+   for (i = i - 1; i != 0; i--) {
+   if (xbpp[i]->b_bufsize <= bp->b_resid) {
+   bp->b_resid -= xbpp[i]->b_bufsize;
+   SET(xbpp[i]->b_flags, B_INVAL);
+   } else if (bp->b_resid > 0) {
+   bp->b_resid = 0;
+   SET(xbpp[i]->b_flags, B_INVAL);
+   } else
+   break;
+   }
+   }
+
+   for (i = 1; xbpp[i] != NULL; i++) {
if (ISSET(bp->b_flags, B_ERROR))
SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
biodone(xbpp[i]);
}
 

Now I said before that the only issue that this diff didn't
fix was with the xbpp[0] aka the buf we return to FFS: if we
have a 64k sized cluster on our filesystem then we've never
created read-ahead bufs and thus this code never runs and we
never account for the b_resid.  However, this is thankfully
not correct as FFS handles short reads itself (except one
small detail...). Here's a chunk from ffs_read:

if (lblktosize(fs, nextlbn) >= DIP(ip, size))
error = bread(vp, lbn, size, );
else
error = bread_cluster(vp, lbn, size, );

if (error)
break;

/*
 * We should only get non-zero b_resid when an I/O error
 * has occurred, which should cause us to break above.
 * However, if the short read did not cause an error,
 * then we want to ensure that we do not uiomove bad
 * or uninitialized data.
 */
size -= bp->b_resid;
if (size < xfersize) {
if (size == 0)
break;
xfersize = size;
}
error = uiomove(bp->b_data + blkoffset, xfersize, uio);

As you can see it copies (size - bp->b_resid) into the uio.
That would be OK if b_resid was as large as the 'size'. But
due to how bread_cluster extends the b_count to cover for
all additional read-ahead buffers, the transfer in the end
can have a b_resid anywhere in the interval of [0, MAXPHYS]
which can be larger than 'size' that FFS has asked for.

This leads to 'size' underflow because it's an integer and
then uiomove gets a negative value for xfersize which gets
converted to a very large unsigned long (size_t) parameter
for uiomove. And this is bad.  Therefore, additionally I'd
like to assert this in the FFS code itself.  If this is the
way to go, I'll look into other filesystems and propose a
similar check.

diff --git sys/ufs/ffs/ffs_vnops.c sys/ufs/ffs/ffs_vnops.c
index 160e187820f..56c222612a2 100644
--- sys/ufs/ffs/ffs_vnops.c
+++ sys/ufs/ffs/ffs_vnops.c
@@ -244,10 +244,11 @@ ffs_read(void *v)
 * has occurred, which should cause us to break above.
 * However, if the short read did not cause an error,
 * then we want to ensure that we do not uiomove bad
 * or uninitialized data.
 */
+   KASSERT(bp->b_resid <= size);
size -= bp->b_resid;
if (size < xfersize) {
if (size == 0)
break;
xfersize = size;


So to make it clear: I'd like to commit both changes and
if that's something we agree upon, I'll look into other
filesystems and make sure that they implement similar
assertions.

Opinions?



Fail adding the queue for an interface that doesn't exist

2017-06-07 Thread Mike Belopuhov
This might not be the fix we want in the long run, but it surely
prevents frustration when making a typo in the interface name.

As reported by Sebastien Marie and claudio@.

OK?

diff --git sys/net/pf_ioctl.c sys/net/pf_ioctl.c
index 43cccdb2efa..c563a439c45 100644
--- sys/net/pf_ioctl.c
+++ sys/net/pf_ioctl.c
@@ -582,11 +582,11 @@ pf_ifp2q(struct pf_queue_if *list, struct ifnet *ifp)
 int
 pf_create_queues(void)
 {
struct pf_queuespec *q;
struct ifnet*ifp;
-   struct pf_queue_if  *list = NULL, *qif;
+   struct pf_queue_if  *list = NULL, *qif;
int  error;
 
/*
 * Find root queues and allocate traffic conditioner
 * private data for these interfaces
@@ -1128,20 +1128,19 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
bcopy(>queue, qs, sizeof(*qs));
qs->qid = pf_qname2qid(qs->qname, 1);
if (qs->parent[0] && (qs->parent_qid =
pf_qname2qid(qs->parent, 0)) == 0) {
pool_put(_queue_pl, qs);
-   error = ESRCH;
+   error = EINVAL;
break;
}
qs->kif = pfi_kif_get(qs->ifname);
-   if (qs->kif == NULL) {
+   if (qs->kif == NULL || qs->kif->pfik_ifp == NULL) {
pool_put(_queue_pl, qs);
-   error = ESRCH;
+   error = EINVAL;
break;
}
-   /* XXX resolve bw percentage specs */
pfi_kif_ref(qs->kif, PFI_KIF_REF_RULE);
 
TAILQ_INSERT_TAIL(pf_queues_inactive, qs, entries);
 
break;



vic: stop using ifq_deq_rollback

2017-06-07 Thread Mike Belopuhov
Hi,

This is a straightforward diff moving invariant chunks before
dequeue operation.

OK?

diff --git sys/dev/pci/if_vic.c sys/dev/pci/if_vic.c
index e34a1aa4f27..bc1e600d8bc 100644
--- sys/dev/pci/if_vic.c
+++ sys/dev/pci/if_vic.c
@@ -1049,43 +1049,35 @@ vic_start(struct ifnet *ifp)
if (VIC_TXURN(sc)) {
ifq_set_oactive(>if_snd);
break;
}
 
-   m = ifq_deq_begin(>if_snd);
-   if (m == NULL)
-   break;
-
idx = sc->sc_data->vd_tx_nextidx;
if (idx >= sc->sc_data->vd_tx_length) {
-   ifq_deq_rollback(>if_snd, m);
printf("%s: tx idx is corrupt\n", DEVNAME(sc));
ifp->if_oerrors++;
break;
}
 
txd = >sc_txq[idx];
txb = >sc_txbuf[idx];
 
if (txb->txb_m != NULL) {
-   ifq_deq_rollback(>if_snd, m);
printf("%s: tx ring is corrupt\n", DEVNAME(sc));
sc->sc_data->vd_tx_stopped = 1;
ifp->if_oerrors++;
break;
}
 
-   /*
-* we're committed to sending it now. if we cant map it into
-* dma memory then we drop it.
-*/
-   ifq_deq_commit(>if_snd, m);
+   m = ifq_dequeue(>if_snd);
+   if (m == NULL)
+   break;
+
if (vic_load_txb(sc, txb, m) != 0) {
m_freem(m);
ifp->if_oerrors++;
-   /* continue? */
-   break;
+   continue;
}
 
 #if NBPFILTER > 0
if (ifp->if_bpf)
bpf_mtap(ifp->if_bpf, txb->txb_m, BPF_DIRECTION_OUT);



Re: tweak txp to avoid ifq_deq_begin/commit/rollback

2017-06-05 Thread Mike Belopuhov
On Wed, May 31, 2017 at 20:40 +0200, Mike Belopuhov wrote:
> According to the FreeBSD driver, txp(4) is not setting up its TX
> ring correctly.  FreeBSD driver uses up to 16 fragments, while we
> use up to 252 which is suspicious.
> 
> This gets us in line with FreeBSD, introduces goodness of m_defrag
> and removes pesky if_deq_* thingies.
> 
> Does anyone still have the hardware (3com 3CR900 Typhoon) to test?
> OK's are welcome.
>

Any OKs? Tests? Should I go ahead with this?

> diff --git sys/dev/pci/if_txp.c sys/dev/pci/if_txp.c
> index deede70e9de..1aed06765c0 100644
> --- sys/dev/pci/if_txp.c
> +++ sys/dev/pci/if_txp.c
> @@ -883,12 +883,12 @@ txp_alloc_rings(struct txp_softc *sc)
>   sc->sc_txhir.r_desc = (struct txp_tx_desc 
> *)sc->sc_txhiring_dma.dma_vaddr;
>   sc->sc_txhir.r_cons = sc->sc_txhir.r_prod = sc->sc_txhir.r_cnt = 0;
>   sc->sc_txhir.r_off = >sc_hostvar->hv_tx_hi_desc_read_idx;
>   for (i = 0; i < TX_ENTRIES; i++) {
>   if (bus_dmamap_create(sc->sc_dmat, TXP_MAX_PKTLEN,
> - TX_ENTRIES - 4, TXP_MAX_SEGLEN, 0,
> - BUS_DMA_NOWAIT, >sc_txd[i].sd_map) != 0) {
> + TXP_MAXTXSEGS, MCLBYTES, 0, BUS_DMA_NOWAIT,
> + >sc_txd[i].sd_map) != 0) {
>   for (j = 0; j < i; j++) {
>   bus_dmamap_destroy(sc->sc_dmat,
>   sc->sc_txd[j].sd_map);
>   sc->sc_txd[j].sd_map = NULL;
>   }
> @@ -1261,57 +1261,48 @@ txp_start(struct ifnet *ifp)
>   struct txp_softc *sc = ifp->if_softc;
>   struct txp_tx_ring *r = >sc_txhir;
>   struct txp_tx_desc *txd;
>   int txdidx;
>   struct txp_frag_desc *fxd;
> - struct mbuf *m, *mnew;
> + struct mbuf *m;
>   struct txp_swdesc *sd;
>   u_int32_t firstprod, firstcnt, prod, cnt, i;
>  
>   if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(>if_snd))
>   return;
>  
>   prod = r->r_prod;
>   cnt = r->r_cnt;
>  
>   while (1) {
> - m = ifq_deq_begin(>if_snd);
> + if (cnt >= TX_ENTRIES - TXP_MAXTXSEGS - 4)
> + goto oactive;
> +
> + m = ifq_dequeue(>if_snd);
>   if (m == NULL)
>   break;
> - mnew = NULL;
>  
>   firstprod = prod;
>   firstcnt = cnt;
>  
>   sd = sc->sc_txd + prod;
>   sd->sd_mbuf = m;
>  
> - if (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
> + switch (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
>   BUS_DMA_NOWAIT)) {
> - MGETHDR(mnew, M_DONTWAIT, MT_DATA);
> - if (mnew == NULL)
> - goto oactive1;
> - if (m->m_pkthdr.len > MHLEN) {
> - MCLGET(mnew, M_DONTWAIT);
> - if ((mnew->m_flags & M_EXT) == 0) {
> - m_freem(mnew);
> - goto oactive1;
> - }
> - }
> - m_copydata(m, 0, m->m_pkthdr.len, mtod(mnew, caddr_t));
> - mnew->m_pkthdr.len = mnew->m_len = m->m_pkthdr.len;
> - ifq_deq_commit(>if_snd, m);
> + case 0:
> + break;
> + case EFBIG:
> + if (m_defrag(m, M_DONTWAIT) == 0 &&
> + bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
> + BUS_DMA_NOWAIT) == 0)
> + break;
> + default:
>   m_freem(m);
> - m = mnew;
> - if (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
> - BUS_DMA_NOWAIT))
> - goto oactive1;
> + continue;
>   }
>  
> - if ((TX_ENTRIES - cnt) < 4)
> - goto oactive;
> -
>   txd = r->r_desc + prod;
>   txdidx = prod;
>   txd->tx_flags = TX_FLAGS_TYPE_DATA;
>   txd->tx_numdesc = 0;
>   txd->tx_addrlo = 0;
> @@ -1321,13 +1312,10 @@ txp_start(struct ifnet *ifp)
>   txd->tx_numdesc = sd->sd_map->dm_nsegs;
>  
>   if (++prod == TX_ENTRIES)
>   prod

Re: iked: add a tag macro for EAP identity

2017-06-01 Thread Mike Belopuhov
On 1 June 2017 at 10:57, Stuart Henderson  wrote:
>
> I have an iked VPN box that needs to restrict access to certain
> resources by user. For connections using a client cert this can be
> done by using PF tags based on the ID from the cert, but this
> falls short for EAP.
>
> This diff adds an $eapid macro that can be used instead. If eapid
> isn't set (non-EAP connection) it just skips expanding the macro.
>
> OK?
>
> (I'd really like per-user IP address setting, but this gets the
> job done in a minimal way.. :)

LGTM, OK mikeb


Re: tweak txp to avoid ifq_deq_begin/commit/rollback

2017-05-31 Thread Mike Belopuhov
On Wed, May 31, 2017 at 20:40 +0200, Mike Belopuhov wrote:
> According to the FreeBSD driver, txp(4) is not setting up its TX
> ring correctly.  FreeBSD driver uses up to 16 fragments, while we
> use up to 252 which is suspicious.
> 
> This gets us in line with FreeBSD, introduces goodness of m_defrag
> and removes pesky if_deq_* thingies.
> 
> Does anyone still have the hardware (3com 3CR900 Typhoon) to test?
> OK's are welcome.
>

Forgot to mention, this "goto oactive" should never happen
because of the check at the start of the loop, but I'm not
too brave to just ditch it right now.

> @@ -1351,10 +1339,12 @@ txp_start(struct ifnet *ifp)
>   for (i = 0; i < sd->sd_map->dm_nsegs; i++) {
>   if (++cnt >= (TX_ENTRIES - 4)) {
>   bus_dmamap_sync(sc->sc_dmat, sd->sd_map,
>   0, sd->sd_map->dm_mapsize,
>   BUS_DMASYNC_POSTWRITE);
> + bus_dmamap_unload(sc->sc_dmat, sd->sd_map);
> + m_freem(m);
>   goto oactive;
>   }
>  
>   fxd->frag_flags = FRAG_FLAGS_TYPE_FRAG |
>   FRAG_FLAGS_VALID;
[...]
> @@ -1424,13 +1407,10 @@ txp_start(struct ifnet *ifp)
>   r->r_prod = prod;
>   r->r_cnt = cnt;
>   return;
>  
>  oactive:
> - bus_dmamap_unload(sc->sc_dmat, sd->sd_map);
> -oactive1:
> - ifq_deq_rollback(>if_snd, m);
>   ifq_set_oactive(>if_snd);
>   r->r_prod = firstprod;
>   r->r_cnt = firstcnt;
>  }
>  



tweak txp to avoid ifq_deq_begin/commit/rollback

2017-05-31 Thread Mike Belopuhov
According to the FreeBSD driver, txp(4) is not setting up its TX
ring correctly.  FreeBSD driver uses up to 16 fragments, while we
use up to 252 which is suspicious.

This gets us in line with FreeBSD, introduces goodness of m_defrag
and removes pesky if_deq_* thingies.

Does anyone still have the hardware (3com 3CR900 Typhoon) to test?
OK's are welcome.

diff --git sys/dev/pci/if_txp.c sys/dev/pci/if_txp.c
index deede70e9de..1aed06765c0 100644
--- sys/dev/pci/if_txp.c
+++ sys/dev/pci/if_txp.c
@@ -883,12 +883,12 @@ txp_alloc_rings(struct txp_softc *sc)
sc->sc_txhir.r_desc = (struct txp_tx_desc 
*)sc->sc_txhiring_dma.dma_vaddr;
sc->sc_txhir.r_cons = sc->sc_txhir.r_prod = sc->sc_txhir.r_cnt = 0;
sc->sc_txhir.r_off = >sc_hostvar->hv_tx_hi_desc_read_idx;
for (i = 0; i < TX_ENTRIES; i++) {
if (bus_dmamap_create(sc->sc_dmat, TXP_MAX_PKTLEN,
-   TX_ENTRIES - 4, TXP_MAX_SEGLEN, 0,
-   BUS_DMA_NOWAIT, >sc_txd[i].sd_map) != 0) {
+   TXP_MAXTXSEGS, MCLBYTES, 0, BUS_DMA_NOWAIT,
+   >sc_txd[i].sd_map) != 0) {
for (j = 0; j < i; j++) {
bus_dmamap_destroy(sc->sc_dmat,
sc->sc_txd[j].sd_map);
sc->sc_txd[j].sd_map = NULL;
}
@@ -1261,57 +1261,48 @@ txp_start(struct ifnet *ifp)
struct txp_softc *sc = ifp->if_softc;
struct txp_tx_ring *r = >sc_txhir;
struct txp_tx_desc *txd;
int txdidx;
struct txp_frag_desc *fxd;
-   struct mbuf *m, *mnew;
+   struct mbuf *m;
struct txp_swdesc *sd;
u_int32_t firstprod, firstcnt, prod, cnt, i;
 
if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(>if_snd))
return;
 
prod = r->r_prod;
cnt = r->r_cnt;
 
while (1) {
-   m = ifq_deq_begin(>if_snd);
+   if (cnt >= TX_ENTRIES - TXP_MAXTXSEGS - 4)
+   goto oactive;
+
+   m = ifq_dequeue(>if_snd);
if (m == NULL)
break;
-   mnew = NULL;
 
firstprod = prod;
firstcnt = cnt;
 
sd = sc->sc_txd + prod;
sd->sd_mbuf = m;
 
-   if (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
+   switch (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
BUS_DMA_NOWAIT)) {
-   MGETHDR(mnew, M_DONTWAIT, MT_DATA);
-   if (mnew == NULL)
-   goto oactive1;
-   if (m->m_pkthdr.len > MHLEN) {
-   MCLGET(mnew, M_DONTWAIT);
-   if ((mnew->m_flags & M_EXT) == 0) {
-   m_freem(mnew);
-   goto oactive1;
-   }
-   }
-   m_copydata(m, 0, m->m_pkthdr.len, mtod(mnew, caddr_t));
-   mnew->m_pkthdr.len = mnew->m_len = m->m_pkthdr.len;
-   ifq_deq_commit(>if_snd, m);
+   case 0:
+   break;
+   case EFBIG:
+   if (m_defrag(m, M_DONTWAIT) == 0 &&
+   bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
+   BUS_DMA_NOWAIT) == 0)
+   break;
+   default:
m_freem(m);
-   m = mnew;
-   if (bus_dmamap_load_mbuf(sc->sc_dmat, sd->sd_map, m,
-   BUS_DMA_NOWAIT))
-   goto oactive1;
+   continue;
}
 
-   if ((TX_ENTRIES - cnt) < 4)
-   goto oactive;
-
txd = r->r_desc + prod;
txdidx = prod;
txd->tx_flags = TX_FLAGS_TYPE_DATA;
txd->tx_numdesc = 0;
txd->tx_addrlo = 0;
@@ -1321,13 +1312,10 @@ txp_start(struct ifnet *ifp)
txd->tx_numdesc = sd->sd_map->dm_nsegs;
 
if (++prod == TX_ENTRIES)
prod = 0;
 
-   if (++cnt >= (TX_ENTRIES - 4))
-   goto oactive;
-
 #if NVLAN > 0
if (m->m_flags & M_VLANTAG) {
txd->tx_pflags = TX_PFLAGS_VLAN |
(htons(m->m_pkthdr.ether_vtag) << 
TX_PFLAGS_VLANTAG_S);
}
@@ -1351,10 +1339,12 @@ txp_start(struct ifnet *ifp)
for (i = 0; i < sd->sd_map->dm_nsegs; i++) {
if (++cnt >= (TX_ENTRIES - 4)) {
bus_dmamap_sync(sc->sc_dmat, sd->sd_map,
0, sd->sd_map->dm_mapsize,

Re: tweak msk to avoid ifq_deq_begin/commit/rollback

2017-05-31 Thread Mike Belopuhov
On Wed, May 31, 2017 at 10:28 +1000, David Gwynne wrote:
> ie, do the space check before trying to dequeue and mbuf.
> 
> this also moves it to using m_defrag.
>

Thanks, this looks good.

Forgot to mention that you can remove the
/* now we are committed to transmit the packet */
comment from both sk and msk as it doesn't reveal any sacred
truths anymore.

Same as with sk, I've got no real opinion regarding adding
BUS_DMA_STREAMING, but otherwise I'm OK.

> i dont have an msk plugged in and i dont know how to use the overdrive
> 1000 i have here. if someone could test and ok this, it would be
> great.
>
> Index: if_msk.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_msk.c,v
> retrieving revision 1.127
> diff -u -p -r1.127 if_msk.c
> --- if_msk.c  10 Apr 2017 02:15:54 -  1.127
> +++ if_msk.c  31 May 2017 00:27:04 -
> @@ -1489,31 +1489,20 @@ msk_encap(struct sk_if_softc *sc_if, str
>  
>   cur = frag = *txidx;
>  
> -#ifdef MSK_DEBUG
> - if (mskdebug >= 2)
> - msk_dump_mbuf(m_head);
> -#endif
> -
> - /*
> -  * Start packing the mbufs in this chain into
> -  * the fragment pointers. Stop when we run out
> -  * of fragments or hit the end of the mbuf chain.
> -  */
> - if (bus_dmamap_load_mbuf(sc->sc_dmatag, txmap, m_head,
> - BUS_DMA_NOWAIT)) {
> - DPRINTFN(2, ("msk_encap: dmamap failed\n"));
> - return (ENOBUFS);
> - }
> -
> - entries = txmap->dm_nsegs * 2;
> - if (entries > (MSK_TX_RING_CNT - sc_if->sk_cdata.sk_tx_cnt - 2)) {
> - DPRINTFN(2, ("msk_encap: too few descriptors free\n"));
> - bus_dmamap_unload(sc->sc_dmatag, txmap);
> - return (ENOBUFS);
> + switch (bus_dmamap_load_mbuf(sc->sc_dmatag, txmap, m_head,
> + BUS_DMA_STREAMING | BUS_DMA_NOWAIT)) {
> + case 0:
> + break;
> + case EFBIG: /* mbuf chain is too fragmented */
> + if (m_defrag(m_head, M_DONTWAIT) == 0 &&
> + bus_dmamap_load_mbuf(sc->sc_dmatag, txmap, m_head,
> + BUS_DMA_STREAMING | BUS_DMA_NOWAIT) == 0)
> + break;
> + /* FALLTHROUGH */
> + default:
> + return (1);
>   }
>  
> - DPRINTFN(2, ("msk_encap: dm_nsegs=%d\n", txmap->dm_nsegs));
> -
>   /* Sync the DMA map. */
>   bus_dmamap_sync(sc->sc_dmatag, txmap, 0, txmap->dm_mapsize,
>   BUS_DMASYNC_PREWRITE);
> @@ -1585,12 +1574,16 @@ msk_start(struct ifnet *ifp)
>   struct sk_if_softc  *sc_if = ifp->if_softc;
>   struct mbuf *m_head = NULL;
>   u_int32_t   idx = sc_if->sk_cdata.sk_tx_prod;
> - int pkts = 0;
> + int post = 0;
>  
> - DPRINTFN(2, ("msk_start\n"));
> + for (;;) {
> + if (sc_if->sk_cdata.sk_tx_cnt + (SK_NTXSEG * 2) + 1 >
> + MSK_TX_RING_CNT) {
> + ifq_set_oactive(>if_snd);
> + break;
> + }
>  
> - while (sc_if->sk_cdata.sk_tx_chain[idx].sk_mbuf == NULL) {
> - m_head = ifq_deq_begin(>if_snd);
> + m_head = ifq_dequeue(>if_snd);
>   if (m_head == NULL)
>   break;
>  
> @@ -1600,14 +1593,11 @@ msk_start(struct ifnet *ifp)
>* for the NIC to drain the ring.
>*/
>   if (msk_encap(sc_if, m_head, )) {
> - ifq_deq_rollback(>if_snd, m_head);
> - ifq_set_oactive(>if_snd);
> - break;
> + m_freem(m_head);
> + continue;
>   }
>  
>   /* now we are committed to transmit the packet */
> - ifq_deq_commit(>if_snd, m_head);
> - pkts++;
>  
>   /*
>* If there's a BPF listener, bounce a copy of this frame
> @@ -1617,18 +1607,17 @@ msk_start(struct ifnet *ifp)
>   if (ifp->if_bpf)
>   bpf_mtap(ifp->if_bpf, m_head, BPF_DIRECTION_OUT);
>  #endif
> + post = 1;
>   }
> - if (pkts == 0)
> + if (post == 0)
>   return;
>  
>   /* Transmit */
> - if (idx != sc_if->sk_cdata.sk_tx_prod) {
> - sc_if->sk_cdata.sk_tx_prod = idx;
> - SK_IF_WRITE_2(sc_if, 1, SK_TXQA1_Y2_PREF_PUTIDX, idx);
> + sc_if->sk_cdata.sk_tx_prod = idx;
> + SK_IF_WRITE_2(sc_if, 1, SK_TXQA1_Y2_PREF_PUTIDX, idx);
>  
> - /* Set a timeout in case the chip goes out to lunch. */
> - ifp->if_timer = MSK_TX_TIMEOUT;
> - }
> + /* Set a timeout in case the chip goes out to lunch. */
> + ifp->if_timer = MSK_TX_TIMEOUT;
>  }
>  
>  void
> 



Re: kqueue EV_RECEIPT and EV_DISPATCH

2017-05-31 Thread Mike Belopuhov
On Wed, May 31, 2017 at 08:37 +0200, Jan Schreiber wrote:
> Hi,
> 
> I recently stumbled upon software that relies on EV_RECEIPT and
> EV_DISPATCH to be available as flags. It also showed up as dependency
> for a Rust crate.
> FreeBSD has it since 8.1 and OSX since 10.5.
> Patch is below.
> 
> mike@ looked throug, thanks a lot!
>

That was me (mikeb@).

> Jan
> 
> Index: sys/kern/kern_event.c
> ===
> RCS file: /cvs/src/sys/kern/kern_event.c,v
> retrieving revision 1.78
> diff -u -p -u -r1.78 kern_event.c
> --- sys/kern/kern_event.c 11 Feb 2017 19:51:06 -  1.78
> +++ sys/kern/kern_event.c 30 May 2017 22:38:49 -
> @@ -512,7 +512,7 @@ sys_kevent(struct proc *p, void *v, regi
>   kevp = >kq_kev[i];
>   kevp->flags &= ~EV_SYSFLAGS;
>   error = kqueue_register(kq, kevp, p);
> - if (error) {
> + if (error || (kevp->flags & EV_RECEIPT)) {
>   if (SCARG(uap, nevents) != 0) {
>   kevp->flags = EV_ERROR;
>   kevp->data = error;
> @@ -788,9 +788,13 @@ start:
>   kn->kn_fop->f_detach(kn);
>   knote_drop(kn, p, p->p_fd);
>   s = splhigh();
> - } else if (kn->kn_flags & EV_CLEAR) {
> - kn->kn_data = 0;
> - kn->kn_fflags = 0;
> + } else if (kn->kn_flags & (EV_CLEAR | EV_DISPATCH)) {
> + if (kn->kn_flags & EV_CLEAR) {
> + kn->kn_data = 0;
> + kn->kn_fflags = 0;
> + }
> + if (kn->kn_flags & EV_DISPATCH)
> + kn->kn_status |= KN_DISABLED;
>   kn->kn_status &= ~(KN_QUEUED | KN_ACTIVE);
>   kq->kq_count--;
>   } else {
> Index: sys/sys/event.h
> ===
> RCS file: /cvs/src/sys/sys/event.h,v
> retrieving revision 1.23
> diff -u -p -u -r1.23 event.h
> --- sys/sys/event.h   24 Sep 2016 18:39:17 -  1.23
> +++ sys/sys/event.h   30 May 2017 22:31:04 -
> @@ -68,6 +68,8 @@ struct kevent {
>  /* flags */
>  #define EV_ONESHOT   0x0010  /* only report one occurrence */
>  #define EV_CLEAR 0x0020  /* clear event state after reporting */
> +#define EV_RECEIPT   0x0040  /* force EV_ERROR on success, data=0 */
> +#define EV_DISPATCH  0x0080  /* disable event after reporting */
>  
>  #define EV_SYSFLAGS  0xF000  /* reserved by system */
>  #define EV_FLAG1 0x2000  /* filter-specific flag */
> Index: lib/libc/sys/kqueue.2
> ===
> RCS file: /cvs/src/lib/libc/sys/kqueue.2,v
> retrieving revision 1.33
> diff -u -p -u -r1.33 kqueue.2
> --- lib/libc/sys/kqueue.2 13 Aug 2016 17:05:02 -  1.33
> +++ lib/libc/sys/kqueue.2 30 May 2017 22:30:29 -
> @@ -184,10 +184,25 @@ Disable the event so
>  .Fn kevent
>  will not return it.
>  The filter itself is not disabled.
> +.It Dv EV_DISPATCH
> +Disable the event source immediately after delivery of an event.
> +See
> +.Dv EV_DISABLE
> +above.
>  .It Dv EV_DELETE
>  Removes the event from the kqueue.
>  Events which are attached to file descriptors are automatically deleted
>  on the last close of the descriptor.
> +.It Dv EV_RECEIPT
> +Causes
> +.Fn kevent
> +to return with
> +.Dv EV_ERROR
> +set without draining any pending events after updating events in the kqueue.
> +When a filter is successfully added the
> +.Va data
> +field will be zero.
> +This flag is useful for making bulk changes to a kqueue.
>  .It Dv EV_ONESHOT
>  Causes the event to return only the first occurrence of the filter
>  being triggered.
> 

We've tweaked the description for EV_RECEIPT a bit because FreeBSD
version didn't make a whole lot sense.



HFSC and FQ-CoDel integration

2017-05-28 Thread Mike Belopuhov
This is a first stab at HFSC and FQ-CoDel integration via extending
PF queueing operations (pfq_ops) interface.  With this FQ-CoDel can
be attached directly to an interface as well as serve as a replacement
for the HFSC queue to improve its characteristics.  In essence, in
many setups (router behind a modem) FQ-CoDel can benefit immensely
from HFSC.  pf.conf grammar is simply a "queue" statement with an
additional "flows" parameter, e.g.:

  queue rootq on em0 bandwidth 10M flows 1000 default

The default queue limit (50) is inherited from HFSC and can be
adjusted for FQ-CoDel with the same 'qlimit' keyword:

  queue rootq on em0 bandwidth 10M flows 1000 qlimit 300 default

There's however a limitation: the 'min' keyword specifying reserved
bandwidth requires knowing which packet will be dequeued next ahead
of time and thus is not supported with the "flows" specification.
(At least for now).

The polishing is still in progress, but I'd like to continue in tree
if that's possible.  I won't mind if someone does an independent test
of course.

---
 sbin/pfctl/parse.y|  10 ---
 sbin/pfctl/pfctl_parser.c |   5 +-
 sbin/pfctl/pfctl_queue.c  |   3 +-
 sys/conf/files|   4 +-
 sys/net/fq_codel.c| 146 -
 sys/net/hfsc.c| 180 +-
 sys/net/pf_ioctl.c|  13 +++-
 sys/net/pfvar.h   |  18 +++--
 usr.bin/systat/pftop.c|  19 +++--
 9 files changed, 301 insertions(+), 97 deletions(-)

diff --git sbin/pfctl/parse.y sbin/pfctl/parse.y
index 63aaafeeea5..47deb3db3d8 100644
--- sbin/pfctl/parse.y
+++ sbin/pfctl/parse.y
@@ -1326,15 +1326,10 @@ queue_opts_l: queue_opts_l queue_opt
 queue_opt  : BANDWIDTH scspec optscs   {
if (queue_opts.marker & QOM_BWSPEC) {
yyerror("bandwidth cannot be respecified");
YYERROR;
}
-   if (queue_opts.marker & QOM_FLOWS) {
-   yyerror("bandwidth cannot be specified for "
-   "a flow queue");
-   YYERROR;
-   }
queue_opts.marker |= QOM_BWSPEC;
queue_opts.linkshare = $2;
queue_opts.realtime= $3.realtime;
queue_opts.upperlimit = $3.upperlimit;
}
@@ -1369,15 +1364,10 @@ queue_opt   : BANDWIDTH scspec optscs   
{
| FLOWS NUMBER  {
if (queue_opts.marker & QOM_FLOWS) {
yyerror("number of flows cannot be 
respecified");
YYERROR;
}
-   if (queue_opts.marker & QOM_BWSPEC) {
-   yyerror("bandwidth cannot be specified for "
-   "a flow queue");
-   YYERROR;
-   }
if ($2 < 1 || $2 > 32767) {
yyerror("number of flows out of range: "
"max 32767");
YYERROR;
}
diff --git sbin/pfctl/pfctl_parser.c sbin/pfctl/pfctl_parser.c
index a69acb2e5b2..d9f63da99b0 100644
--- sbin/pfctl/pfctl_parser.c
+++ sbin/pfctl/pfctl_parser.c
@@ -1199,21 +1199,22 @@ print_queuespec(struct pf_queuespec *q)
printf("queue %s", q->qname);
if (q->parent[0])
printf(" parent %s", q->parent);
else if (q->ifname[0])
printf(" on %s", q->ifname);
-   if (q->flags & PFQS_FLOWQUEUE) {
+   if (q->flowqueue.flows > 0) {
printf(" flows %u", q->flowqueue.flows);
if (q->flowqueue.quantum > 0)
printf(" quantum %u", q->flowqueue.quantum);
if (q->flowqueue.interval > 0)
printf(" interval %ums",
q->flowqueue.interval / 100);
if (q->flowqueue.target > 0)
printf(" target %ums",
q->flowqueue.target / 100);
-   } else {
+   }
+   if (q->linkshare.m1.absolute || q->linkshare.m2.absolute) {
print_scspec(" bandwidth ", >linkshare);
print_scspec(", min ", >realtime);
print_scspec(", max ", >upperlimit);
}
if (q->flags & PFQS_DEFAULT)
diff --git sbin/pfctl/pfctl_queue.c sbin/pfctl/pfctl_queue.c
index feeeba33f8d..0d1abce36c6 100644
--- sbin/pfctl/pfctl_queue.c
+++ sbin/pfctl/pfctl_queue.c
@@ -210,11 +210,12 @@ pfctl_print_queue_nodestat(int dev, const struct 
pfctl_queue_node *node)
"dropped pkts: %6llu bytes: %6llu ]\n",
(unsigned long 

Re: memory barriers and atomic instructions

2017-05-23 Thread Mike Belopuhov
On Tue, May 23, 2017 at 17:41 +0200, Mark Kettenis wrote:
> So here is a diff that implements what I proposed recently.  This
> recognizes that atomic instructions on amd64 already include an
> implicit memory barrier and allows us to write optimized code that
> avoids a redundant memory barrier.
> 
> Note that I don't have a use-case for membar_exit_before_atomic() yet;
> I merely added it for symmetry reasons.  I can leave it out if that's
> what people prefer.
> 
> This should allow us to use a generic mutex implementation written in
> C without a significant penalty.
> 
> ok?
>

LGTM, but shouldn't the same thing be done to i386?

> 
> 
> Index: sys/atomic.h
> ===
> RCS file: /cvs/src/sys/sys/atomic.h,v
> retrieving revision 1.4
> diff -u -p -r1.4 atomic.h
> --- sys/atomic.h  24 Jan 2017 22:22:20 -  1.4
> +++ sys/atomic.h  23 May 2017 15:01:34 -
> @@ -219,4 +219,12 @@ atomic_sub_long_nv(volatile unsigned lon
>  #define membar_sync() __sync_synchronize()
>  #endif
>  
> +#ifndef membar_enter_after_atomic
> +#define membar_enter_after_atomic() membar_enter()
> +#endif
> +
> +#ifndef membar_exit_before_atomic
> +#define membar_exit_before_atomic() membar_exit()
> +#endif
> +
>  #endif /* _SYS_ATOMIC_H_ */
> Index: arch/amd64/include/atomic.h
> ===
> RCS file: /cvs/src/sys/arch/amd64/include/atomic.h,v
> retrieving revision 1.19
> diff -u -p -r1.19 atomic.h
> --- arch/amd64/include/atomic.h   12 May 2017 08:47:03 -  1.19
> +++ arch/amd64/include/atomic.h   23 May 2017 15:02:23 -
> @@ -276,6 +276,9 @@ _atomic_sub_long_nv(volatile unsigned lo
>  #define membar_sync()__membar("")
>  #endif
>  
> +#define membar_enter_after_atomic()  __membar("")
> +#define membar_exit_before_atomic()  __membar("")
> +
>  #ifdef _KERNEL
>  
>  /* virtio needs MP membars even on SP kernels */
> 



Re: UDP sendspace for dlna providing

2017-05-23 Thread Mike Belopuhov
On Tue, May 23, 2017 at 11:51 +0100, Stuart Henderson wrote:
> (replying to an old mail),
> 
> On 2017/03/16 18:07, Claudio Jeker wrote:
> > On Thu, Mar 16, 2017 at 03:46:38PM +0100, Eric JACQUOT wrote:
> > > Hi all,
> > > 
> > > I had some problems with dlna server (minidlna) and a lot of cuts and 
> > > crashes of the client when playing videos.
> > > It seems that the default net.inet.udp.sendspace (9216 by default) 
> > > variable is not suitable. I have increased it as a result of the 
> > > capacities of my network and I no longer have issue of broadcasting 
> > > videos.
> > > It may be time to increase the default value or document all ports based 
> > > on dlna accordingly.
> > > Not having much time for these actions, thank you for giving me the best 
> > > way so that I provide the necessary diffs.
> > > 
> > > Maybe I'm wrong so tell me...
> > > 
> > 
> > Please change minidlna to use a setsockopt() to increase the send buffer.
> > Doing this globally is not the right fix.
> 
> I don't see a sockopt to control udp_sendspace per-socket, am I missing
> something? Nameservers easily run into the default limit too (I've been
> running with increased net.inet.udp.sendspace on nameservers for ages)..
> 

Isn't SO_SNDBUF what you're looking for?  tcpbench(1) does it via -S
for instance.



  1   2   3   4   5   6   7   8   9   10   >