Re: Status of Virtual Function driver for Intel 82599 series port?

2023-06-19 Thread Yuichiro NAITO
Hi.

I fixed a bug on Intel 82599. Please refer to following patch. And I found
that Linux 82599 PF driver doesn't allow setting VF MTU size bigger than
PF MTU size. If you set bigger MTU size for ixv interface than PF MTU size,
you will see the following message in dmesg and the ixv interface stops working.

"There is a problem with the PF setup.  It is likely the receive unit for this
 VF will not function correctly."

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index c8e4ec8284e..2ad357f9c1b 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -524,6 +524,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index 101ed502e76..72c8b485938 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia, intrmap, stoeplitz
 attach ix at pci
-file   dev/pci/if_ix.c ix
-file   dev/pci/ixgbe.c ix
-file   dev/pci/ixgbe_82598.c   ix
-file   dev/pci/ixgbe_82599.c   ix
-file   dev/pci/ixgbe_x540.cix
-file   dev/pci/ixgbe_x550.cix
-file   dev/pci/ixgbe_phy.c ix
+file   dev/pci/if_ix.c ix | ixv
+file   dev/pci/ixgbe.c ix | ixv
+file   dev/pci/ixgbe_82598.c   ix | ixv
+file   dev/pci/ixgbe_82599.c   ix | ixv
+file   dev/pci/ixgbe_x540.cix | ixv
+file   dev/pci/ixgbe_x550.cix | ixv
+file   dev/pci/ixgbe_phy.c ix | ixv
+
+# Virtual Function of i82599.
+device ixv: ether, ifnet, ifmedia, intrmap, stoeplitz
+attach ixv at pci
+file   dev/pci/if_ixv.cixv
+file   dev/pci/ixgbe_vf.c  ixv
 
 # Intel Ethernet 700 Series
 device ixl: ether, ifnet, ifmedia, intrmap, stoeplitz
diff --git a/sys/dev/pci/if_ix.c b/sys/dev/pci/if_ix.c
index 98815b51d62..56d8e305dec 100644
--- a/sys/dev/pci/if_ix.c
+++ b/sys/dev/pci/if_ix.c
@@ -507,7 +507,7 @@ ixgbe_start(struct ifqueue *ifq)
 * hardware that this frame is available to transmit.
 */
if (post)
-   IXGBE_WRITE_REG(>hw, IXGBE_TDT(txr->me),
+   IXGBE_WRITE_REG(>hw, txr->tail,
txr->next_avail_desc);
 }
 
@@ -705,7 +705,7 @@ ixgbe_watchdog(struct ifnet * ifp)
for (i = 0; i < sc->num_queues; i++, txr++) {
printf("%s: Queue(%d) tdh = %d, hw tdt = %d\n", ifp->if_xname, 
i,
IXGBE_READ_REG(hw, IXGBE_TDH(i)),
-   IXGBE_READ_REG(hw, IXGBE_TDT(i)));
+   IXGBE_READ_REG(hw, sc->tx_rings[i].tail));
printf("%s: TX(%d) Next TX to Clean = %d\n", ifp->if_xname,
i, txr->next_to_clean);
}
@@ -825,7 +825,7 @@ ixgbe_init(void *arg)
msec_delay(1);
}
IXGBE_WRITE_FLUSH(>hw);
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(i), rxr->last_desc_filled);
+   IXGBE_WRITE_REG(>hw, rxr[i].tail, rxr->last_desc_filled);
}
 
/* Set up VLAN support and filter */
@@ -2360,9 +2360,12 @@ ixgbe_initialize_transmit_units(struct ix_softc *sc)
IXGBE_WRITE_REG(hw, IXGBE_TDLEN(i),
sc->num_tx_desc * sizeof(struct ixgbe_legacy_tx_desc));
 
+   /* Set Tx Tail register */
+   txr->tail = IXGBE_TDT(i);
+
/* Setup the HW Tx Head and Tail descriptor pointers */
IXGBE_WRITE_REG(hw, IXGBE_TDH(i), 0);
-   IXGBE_WRITE_REG(hw, IXGBE_TDT(i), 0);
+   IXGBE_WRITE_REG(hw, txr->tail, 0);
 
/* Setup Transmit Descriptor Cmd Settings */
txr->txd_cmd = IXGBE_TXD_CMD_IFCS;
@@ -2845,7 +2848,7 @@ ixgbe_rxrefill(void *xrxr)
 
if (ixgbe_rxfill(rxr)) {
/* Advance the Rx Queue "Tail Pointer" */
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(rxr->me),
+   IXGBE_WRITE_REG(>hw, rxr->tail,
rxr->last_desc_filled);
} else if (if_rxr_inuse(>rx_ring) == 0)
timeout_add(>rx_refill, 1);
@@ -2941,6 +2944,9 @@ ixgbe_initialize_receive_units(struct ix_softc *sc)
srrctl = bufsz | IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(i), srrctl);
 
+   /* Capture Rx Tail index */
+   rxr->tail = 

Re: Status of Virtual Function driver for Intel 82599 series port?

2023-05-26 Thread Yuichiro NAITO
My previous patch is partly rejected for OpenBSD current. Because ixv(4) code
depends on ix(4) that has changed to supported TSO/LRO. I rebased my patch for
OpenBSD current. See the patch at the end of this e-mail.

Thank you, Paul B. Henson! He tested my patch on Linux Qemu and now we have
the knowledge to operate the ixv(4). I will write down the following.

Known Issues and Requirements:

  1. Do not use 'ifconfig lladdr' on ESXi.

 Changing link local address is not permitted on ESXi. If it is changed,
 the interface cannot be usable any more. Need to reboot the VM.

 Link local address can be changed by ESXi user interface.

 On Linux qemu, 'ifconfig lladdr' works.

  2. "pc-q35" emulation is required on Linux Qemu.

 The default chipset emulation "pc-i440fx" doesn't support MSI-X interrupt.
 Ixv(4) requires MSI-X and never work with MSI. So, ixv(4) fails to attach
 with "pc-i440fx" emulation.

  3. Linux ixgbe driver shows "Unhandled Msg 0010" in dmesg.

 Older version of Linux doesn't support GET_LINK_STATE message. Ixv(4)
 cannot see physical link state is changed. I see GET_LINK_STATE message
 has supported by following commit in Linus repository. But I'm not sure
 which distro has this patch.

 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=366fd1000995d4cf64e1a61a0d78a051550b9841

  4. Make sure PF (Primary Function) interface is up.

 While PF interface is down, PF driver stop working and doesn't respond to
 VF (Virtual Function). Ixv interface cannot be brought up.

  5. Linux kernel shows all reset messages from VF in dmesg.

 This is not an error case. Ixv(4) sends reset messages when attaching
 device and the interface is brought up.

  6. Performance

 Recent OpenBSD ix(4) supports TSO. It also improves ixv(4) packet
 transmission performance significantly. TSO is enabled by default. But
 LRO is not supported by VF. There is no way to use LRO in ixv(4).

 Increasing MTU size gives you a chance to improve performance but make
 sure all intermediate routes can handle the MTU size.

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index c8e4ec8284e..2ad357f9c1b 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -524,6 +524,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index 101ed502e76..72c8b485938 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia, intrmap, stoeplitz
 attach ix at pci
-file   dev/pci/if_ix.c ix
-file   dev/pci/ixgbe.c ix
-file   dev/pci/ixgbe_82598.c   ix
-file   dev/pci/ixgbe_82599.c   ix
-file   dev/pci/ixgbe_x540.cix
-file   dev/pci/ixgbe_x550.cix
-file   dev/pci/ixgbe_phy.c ix
+file   dev/pci/if_ix.c ix | ixv
+file   dev/pci/ixgbe.c ix | ixv
+file   dev/pci/ixgbe_82598.c   ix | ixv
+file   dev/pci/ixgbe_82599.c   ix | ixv
+file   dev/pci/ixgbe_x540.cix | ixv
+file   dev/pci/ixgbe_x550.cix | ixv
+file   dev/pci/ixgbe_phy.c ix | ixv
+
+# Virtual Function of i82599.
+device ixv: ether, ifnet, ifmedia, intrmap, stoeplitz
+attach ixv at pci
+file   dev/pci/if_ixv.cixv
+file   dev/pci/ixgbe_vf.c  ixv
 
 # Intel Ethernet 700 Series
 device ixl: ether, ifnet, ifmedia, intrmap, stoeplitz
diff --git a/sys/dev/pci/if_ix.c b/sys/dev/pci/if_ix.c
index 98815b51d62..56d8e305dec 100644
--- a/sys/dev/pci/if_ix.c
+++ b/sys/dev/pci/if_ix.c
@@ -507,7 +507,7 @@ ixgbe_start(struct ifqueue *ifq)
 * hardware that this frame is available to transmit.
 */
if (post)
-   IXGBE_WRITE_REG(>hw, IXGBE_TDT(txr->me),
+   IXGBE_WRITE_REG(>hw, txr->tail,
txr->next_avail_desc);
 }
 
@@ -705,7 +705,7 @@ ixgbe_watchdog(struct ifnet * ifp)
for (i = 0; i < sc->num_queues; i++, txr++) {
printf("%s: Queue(%d) tdh = %d, hw tdt = %d\n", ifp->if_xname, 
i,
IXGBE_READ_REG(hw, IXGBE_TDH(i)),
-   IXGBE_READ_REG(hw, IXGBE_TDT(i)));
+   IXGBE_READ_REG(hw, sc->tx_rings[i].tail));
printf("%s: TX(%d) Next TX to Clean = 

Re: Status of Virtual Function driver for Intel 82599 series port?

2023-05-15 Thread Yuichiro NAITO
Hi.

Paul notified me that my previous patch was broken. I changed my mailer
to Mew on Emacs that sends a plain text mail without any modification.

I re-send my patch by Mew. Please try following patch. Thank you!

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index c8e4ec8284e..2ad357f9c1b 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -524,6 +524,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index 101ed502e76..72c8b485938 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia, intrmap, stoeplitz
 attach ix at pci
-file   dev/pci/if_ix.c ix
-file   dev/pci/ixgbe.c ix
-file   dev/pci/ixgbe_82598.c   ix
-file   dev/pci/ixgbe_82599.c   ix
-file   dev/pci/ixgbe_x540.cix
-file   dev/pci/ixgbe_x550.cix
-file   dev/pci/ixgbe_phy.c ix
+file   dev/pci/if_ix.c ix | ixv
+file   dev/pci/ixgbe.c ix | ixv
+file   dev/pci/ixgbe_82598.c   ix | ixv
+file   dev/pci/ixgbe_82599.c   ix | ixv
+file   dev/pci/ixgbe_x540.cix | ixv
+file   dev/pci/ixgbe_x550.cix | ixv
+file   dev/pci/ixgbe_phy.c ix | ixv
+
+# Virtual Function of i82599.
+device ixv: ether, ifnet, ifmedia, intrmap, stoeplitz
+attach ixv at pci
+file   dev/pci/if_ixv.cixv
+file   dev/pci/ixgbe_vf.c  ixv
 
 # Intel Ethernet 700 Series
 device ixl: ether, ifnet, ifmedia, intrmap, stoeplitz
diff --git a/sys/dev/pci/if_ix.c b/sys/dev/pci/if_ix.c
index 870c3349fb3..24397a01a9d 100644
--- a/sys/dev/pci/if_ix.c
+++ b/sys/dev/pci/if_ix.c
@@ -507,7 +507,7 @@ ixgbe_start(struct ifqueue *ifq)
 * hardware that this frame is available to transmit.
 */
if (post)
-   IXGBE_WRITE_REG(>hw, IXGBE_TDT(txr->me),
+   IXGBE_WRITE_REG(>hw, txr->tail,
txr->next_avail_desc);
 }
 
@@ -705,7 +705,7 @@ ixgbe_watchdog(struct ifnet * ifp)
for (i = 0; i < sc->num_queues; i++, txr++) {
printf("%s: Queue(%d) tdh = %d, hw tdt = %d\n", ifp->if_xname, 
i,
IXGBE_READ_REG(hw, IXGBE_TDH(i)),
-   IXGBE_READ_REG(hw, IXGBE_TDT(i)));
+   IXGBE_READ_REG(hw, sc->tx_rings[i].tail));
printf("%s: TX(%d) Next TX to Clean = %d\n", ifp->if_xname,
i, txr->next_to_clean);
}
@@ -825,7 +825,7 @@ ixgbe_init(void *arg)
msec_delay(1);
}
IXGBE_WRITE_FLUSH(>hw);
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(i), rxr->last_desc_filled);
+   IXGBE_WRITE_REG(>hw, rxr[i].tail, rxr->last_desc_filled);
}
 
/* Set up VLAN support and filter */
@@ -2358,9 +2358,12 @@ ixgbe_initialize_transmit_units(struct ix_softc *sc)
IXGBE_WRITE_REG(hw, IXGBE_TDLEN(i),
sc->num_tx_desc * sizeof(struct ixgbe_legacy_tx_desc));
 
+   /* Set Tx Tail register */
+   txr->tail = IXGBE_TDT(i);
+
/* Setup the HW Tx Head and Tail descriptor pointers */
IXGBE_WRITE_REG(hw, IXGBE_TDH(i), 0);
-   IXGBE_WRITE_REG(hw, IXGBE_TDT(i), 0);
+   IXGBE_WRITE_REG(hw, txr->tail, 0);
 
/* Setup Transmit Descriptor Cmd Settings */
txr->txd_cmd = IXGBE_TXD_CMD_IFCS;
@@ -2808,7 +2811,7 @@ ixgbe_rxrefill(void *xrxr)
 
if (ixgbe_rxfill(rxr)) {
/* Advance the Rx Queue "Tail Pointer" */
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(rxr->me),
+   IXGBE_WRITE_REG(>hw, rxr->tail,
rxr->last_desc_filled);
} else if (if_rxr_inuse(>rx_ring) == 0)
timeout_add(>rx_refill, 1);
@@ -2902,6 +2905,9 @@ ixgbe_initialize_receive_units(struct ix_softc *sc)
srrctl = bufsz | IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(i), srrctl);
 
+   /* Capture Rx Tail index */
+   rxr->tail = IXGBE_RDT(i);
+
if (ISSET(ifp->if_xflags, IFXF_TSO)) {
rdrxctl = IXGBE_READ_REG(>hw, IXGBE_RSCCTL(i));
 
@@ -2914,7 +2920,7 @@ ixgbe_initialize_receive_units(struct 

Re: seperate LRO/TSO flags

2023-05-11 Thread Yuichiro NAITO

On 5/11/23 03:58, Todd C. Miller wrote:

On Wed, 10 May 2023 20:55:24 +0200, Jan Klemkow wrote:


For tcprecvoffload and ix(4) it's not possible to enable/disable it per
address family.  Its just one flag for the hardware.

For tcpsendoffload its possible, but I won't do that till its necessary.

Why would you want to differentiate the address families here?


I was mostly just curious as I see FreeBSD seems to support this.
That made we wonder if there is hardware that only supports offloading
for IPv4.


I know that em driver only supports TSO4 on FreeBSD.
It's not enabled by default but users can enable it.

See the following code for detail.

https://cgit.freebsd.org/src/tree/sys/dev/e1000/if_em.c#n900

--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: Status of Virtual Function driver for Intel 82599 series port?

2023-05-10 Thread Yuichiro NAITO

On 4/30/23 04:04, Paul B. Henson wrote:
> https://marc.info/?l=openbsd-tech=167160269008839=2
>
> The port doesn't appear to be merged yet, and there's no activity on the
> thread since December, so I was just wondering what the current status
> of this is? If there's any further testing needed I have a system I can
> spin up test vm's on.

Hi, I believe my patch works on ESXi and Linux qemu. I'm looking for a
reviewer. It's welcome to testing and reporting the results if it works
or fails to work.

> Also, I tried downloading the patch from the mailing list archive and it
> seems corrupted and won't apply. Could someone possibly forward me
> directly the latest version of the patch?

I rebased my patch for OpenBSD current. See the below patch of this mail.

Install procedure:

  ```
  Make sure that OpenBSD current source is written in /usr/src.
  ```

  # cd /usr/src
  # patch -p1 -i /somewhre/ixv.patch
  # cd sys/dev/pci
  # make
  # cd /usr/src/sys/arch/amd64/compile/GENERIC.MP
  # make obj
  # make config
  # make -j 4
  # make install
  # reboot

Known limitations:

  1. Do not use 'ifconfig lladdr' on ESXi.

 Changing link local address is not permitted on ESXi.
 If it is changed, the interface cannot be usable any more.
 Need to reboot the VM.

 Link local address can be changed by ESXi user interface.

 On Linux qemu, 'ifconfig lladdr' works.

  2. MTU 9000 is required for 10Gbps performance.

 The default MTU size 1500 is too small for 10Gbps link for now.

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index c8e4ec8284e..2ad357f9c1b 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -524,6 +524,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index 101ed502e76..72c8b485938 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia, intrmap, stoeplitz
 attach ix at pci
-file   dev/pci/if_ix.c ix
-file   dev/pci/ixgbe.c ix
-file   dev/pci/ixgbe_82598.c   ix
-file   dev/pci/ixgbe_82599.c   ix
-file   dev/pci/ixgbe_x540.cix
-file   dev/pci/ixgbe_x550.cix
-file   dev/pci/ixgbe_phy.c ix
+file   dev/pci/if_ix.c ix | ixv
+file   dev/pci/ixgbe.c ix | ixv
+file   dev/pci/ixgbe_82598.c   ix | ixv
+file   dev/pci/ixgbe_82599.c   ix | ixv
+file   dev/pci/ixgbe_x540.cix | ixv
+file   dev/pci/ixgbe_x550.cix | ixv
+file   dev/pci/ixgbe_phy.c ix | ixv
+
+# Virtual Function of i82599.
+device ixv: ether, ifnet, ifmedia, intrmap, stoeplitz
+attach ixv at pci
+file   dev/pci/if_ixv.cixv
+file   dev/pci/ixgbe_vf.c  ixv

 # Intel Ethernet 700 Series
 device ixl: ether, ifnet, ifmedia, intrmap, stoeplitz
diff --git a/sys/dev/pci/if_ix.c b/sys/dev/pci/if_ix.c
index 870c3349fb3..24397a01a9d 100644
--- a/sys/dev/pci/if_ix.c
+++ b/sys/dev/pci/if_ix.c
@@ -507,7 +507,7 @@ ixgbe_start(struct ifqueue *ifq)
 * hardware that this frame is available to transmit.
 */
if (post)
-   IXGBE_WRITE_REG(>hw, IXGBE_TDT(txr->me),
+   IXGBE_WRITE_REG(>hw, txr->tail,
txr->next_avail_desc);
 }

@@ -705,7 +705,7 @@ ixgbe_watchdog(struct ifnet * ifp)
for (i = 0; i < sc->num_queues; i++, txr++) {
printf("%s: Queue(%d) tdh = %d, hw tdt = %d\n", ifp->if_xname, 
i,
IXGBE_READ_REG(hw, IXGBE_TDH(i)),
-   IXGBE_READ_REG(hw, IXGBE_TDT(i)));
+   IXGBE_READ_REG(hw, sc->tx_rings[i].tail));
printf("%s: TX(%d) Next TX to Clean = %d\n", ifp->if_xname,
i, txr->next_to_clean);
}
@@ -825,7 +825,7 @@ ixgbe_init(void *arg)
msec_delay(1);
}
IXGBE_WRITE_FLUSH(>hw);
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(i), rxr->last_desc_filled);
+   IXGBE_WRITE_REG(>hw, rxr[i].tail, rxr->last_desc_filled);
}

/* Set up VLAN support and filter */
@@ -2358,9 +2358,12 @@ ixgbe_initialize_transmit_units(struct ix_softc *sc)
IXGBE_WRITE_REG(hw, IXGBE_TDLEN(i),
sc->num_tx_desc * sizeof(struct ixgbe_legacy_tx_desc));

+  

atactl(8): 'readattr' subcommand quits silently.

2023-04-26 Thread Yuichiro NAITO

Hi, Our IIJ team is planning to run OpenBSD on HPE ProLiant DL20 Gen10 server.
Yasuoka-san and I are testing on this host and found that `atactl sd0 readattr`
doesn't show any messages, just quits silently.

'sd0' disk is shown in dmesg as follows.

```
sd0 at scsibus1 targ 0 lun 0:  naa.5002538e01735e7f
sd0: 457862MB, 512 bytes/sector, 937703088 sectors, thin
```

To see what's happening to atactl(8), we added printf code like this.

```
diff --git a/sbin/atactl/atactl.c b/sbin/atactl/atactl.c
index 85dfced8c9a..2babdafffc7 100644
--- a/sbin/atactl/atactl.c
+++ b/sbin/atactl/atactl.c
@@ -1660,10 +1660,11 @@ device_attr(int argc, char *argv[])
if (attr_val.revision != attr_thr.revision) {
/*
 * Non standard vendor implementation.
 * Return, since we don't know how to use this.
 */
+   printf("atactl: revision mismatch SMART_READ=0x%02x 
SMART_THREASHOLD=0x%02x\n", attr_val.revision, attr_thr.revision);
return;
}

attr = attr_val.attribute;
thr = attr_thr.threshold;

```

And got following message.

```
$ doas ./atactl sd0 readattr
atactl: revision mismatch SMART_READ=0x01 SMART_THREASHOLD=0x00

```

These 2 revisions of 'attr_val' and 'attr_thr' are different on this disk.
The comment says that it's wrong vendor implementation but I can see
'smartctl' shows the attributes as follows. NetBSD's atactl doesn't
see these 2 revisions are same but checks each checksum is valid.

```
SMART Attributes Data Structure revision number: 1
  Vendor Specific SMART Attributes with Thresholds:
  ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f   200   200   002Pre-fail  Always  
 -   0
5 Reallocated_Sector_Ct   0x0033   100   100   005Pre-fail  Always  
 -   0
9 Power_On_Hours  0x0032   099   099   000Old_age   Always  
 -   3138
  173 Unknown_Attribute   0x0033   099   099   005Pre-fail  Always  
 -   8
  175 Program_Fail_Count_Chip 0x0033   100   100   001Pre-fail  Always  
 -   0
  180 Unused_Rsvd_Blk_Cnt_Tot 0x003b   200   200   097Pre-fail  Always  
 -   0
  194 Temperature_Celsius 0x0022   080   075   000Old_age   Always  
 -   20
  196 Reallocated_Event_Count 0x0033   100   100   005Pre-fail  Always  
 -   0
  202 Unknown_SSD_Attribute   0x0033   100   100   010Pre-fail  Always  
 -   0
```

So, I propose to change the method to verify checksums instead of revision 
checks.

```
diff --git a/sbin/atactl/atactl.c b/sbin/atactl/atactl.c
index 85dfced8c9a..1f77460ce3d 100644
--- a/sbin/atactl/atactl.c
+++ b/sbin/atactl/atactl.c
@@ -1657,13 +1657,11 @@ device_attr(int argc, char *argv[])
req.datalen = sizeof(attr_thr);
ata_command();

-   if (attr_val.revision != attr_thr.revision) {
-   /*
-* Non standard vendor implementation.
-* Return, since we don't know how to use this.
-*/
-   return;
-   }
+   if (smart_cksum((u_int8_t *)_val, sizeof(attr_val)) != 0)
+   errx(1, "Checksum mismatch (attr_val)");
+
+   if (smart_cksum((u_int8_t *)_thr, sizeof(attr_thr)) != 0)
+   errx(1, "Checksum mismatch (attr_thr)");

attr = attr_val.attribute;
thr = attr_thr.threshold;

```

With this patch, we got following result.

```
Attributes table revision: 1
  ID  Attribute name  Threshold   Value   Raw
1 Raw Read Error Rate2200 0x
5 Reallocated Sector Count   5100 0x
9 Power-On Hours Count   0 99 0x0c42
  173 Unknown5 99 0x0008
  175 Unknown1100 0x
  180 Unknown   97200 0x
  194 Temperature0 80 0x0014
  196 Reallocation Event Count   5100 0x
  202 Data Address Mark Errors  10100 0x
```

Is the proposed patch OK?

--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: ixv(4): porting Virtual Function driver for Intel 82599 series.

2022-12-20 Thread Yuichiro NAITO

I received an issue that ixv(4) doesn't detect linkdowns in personal email.
When linkdown happens, the PF (Primary Function) driver interrupts all VFs
(Virtual Function) via `mailbox`. But ixv(4) doesn't receive the interrupt.

According to NetBSD, this problem has been fixed by following commits.

http://www.nerv.org/netbsd/changeset.cgi?id=20171003T031229Z.85b67885ef5a26fe6778d9f35f3142c5e14d1de8#src/sys/dev/pci/ixgbe/ixv.c

http://www.nerv.org/netbsd/changeset.cgi?id=20171004T110321Z.7b02467efe35f10f476f3239901730fce3bdb494#src/sys/dev/pci/ixgbe/ixv.c

I ported these commits and updated my patch.

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index 3563126b2a1..e421c38e95f 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -522,6 +522,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index a803dc9b659..7d6402e524e 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia, intrmap, stoeplitz
 attach ix at pci
-file   dev/pci/if_ix.c ix
-file   dev/pci/ixgbe.c ix
-file   dev/pci/ixgbe_82598.c   ix
-file   dev/pci/ixgbe_82599.c   ix
-file   dev/pci/ixgbe_x540.cix
-file   dev/pci/ixgbe_x550.cix
-file   dev/pci/ixgbe_phy.c ix
+file   dev/pci/if_ix.c ix | ixv
+file   dev/pci/ixgbe.c ix | ixv
+file   dev/pci/ixgbe_82598.c   ix | ixv
+file   dev/pci/ixgbe_82599.c   ix | ixv
+file   dev/pci/ixgbe_x540.cix | ixv
+file   dev/pci/ixgbe_x550.cix | ixv
+file   dev/pci/ixgbe_phy.c ix | ixv
+
+# Virtual Function of i82599.
+device ixv: ether, ifnet, ifmedia, intrmap, stoeplitz
+attach ixv at pci
+file   dev/pci/if_ixv.cixv
+file   dev/pci/ixgbe_vf.c  ixv

 # Intel Ethernet 700 Series
 device ixl: ether, ifnet, ifmedia, intrmap, stoeplitz
diff --git a/sys/dev/pci/if_ix.c b/sys/dev/pci/if_ix.c
index b59ec28d9f1..4fb01d85778 100644
--- a/sys/dev/pci/if_ix.c
+++ b/sys/dev/pci/if_ix.c
@@ -507,7 +507,7 @@ ixgbe_start(struct ifqueue *ifq)
 * hardware that this frame is available to transmit.
 */
if (post)
-   IXGBE_WRITE_REG(>hw, IXGBE_TDT(txr->me),
+   IXGBE_WRITE_REG(>hw, txr->tail,
txr->next_avail_desc);
 }

@@ -706,7 +706,7 @@ ixgbe_watchdog(struct ifnet * ifp)
for (i = 0; i < sc->num_queues; i++, txr++) {
printf("%s: Queue(%d) tdh = %d, hw tdt = %d\n", ifp->if_xname, 
i,
IXGBE_READ_REG(hw, IXGBE_TDH(i)),
-   IXGBE_READ_REG(hw, IXGBE_TDT(i)));
+   IXGBE_READ_REG(hw, sc->tx_rings[i].tail));
printf("%s: TX(%d) Next TX to Clean = %d\n", ifp->if_xname,
i, txr->next_to_clean);
}
@@ -826,7 +826,7 @@ ixgbe_init(void *arg)
msec_delay(1);
}
IXGBE_WRITE_FLUSH(>hw);
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(i), rxr->last_desc_filled);
+   IXGBE_WRITE_REG(>hw, rxr[i].tail, rxr->last_desc_filled);
}

/* Set up VLAN support and filter */
@@ -2359,9 +2359,12 @@ ixgbe_initialize_transmit_units(struct ix_softc *sc)
IXGBE_WRITE_REG(hw, IXGBE_TDLEN(i),
sc->num_tx_desc * sizeof(struct ixgbe_legacy_tx_desc));

+   /* Set Tx Tail register */
+   txr->tail = IXGBE_TDT(i);
+
/* Setup the HW Tx Head and Tail descriptor pointers */
IXGBE_WRITE_REG(hw, IXGBE_TDH(i), 0);
-   IXGBE_WRITE_REG(hw, IXGBE_TDT(i), 0);
+   IXGBE_WRITE_REG(hw, txr->tail, 0);

/* Setup Transmit Descriptor Cmd Settings */
txr->txd_cmd = IXGBE_TXD_CMD_IFCS;
@@ -2834,7 +2837,7 @@ ixgbe_rxrefill(void *xrxr)

if (ixgbe_rxfill(rxr)) {
/* Advance the Rx Queue "Tail Pointer" */
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(rxr->me),
+   IXGBE_WRITE_REG(>hw, rxr->tail,
rxr->last_desc_filled);
} else if (if_rxr_inuse(>rx_ring) == 0)
timeout_add(>rx_refill, 1);
@@ -2928,6 +2931,9 @@ ixgbe_initialize_receive_units(struct ix_softc *sc)
srrctl = 

Re: ixv(4): porting Virtual Function driver for Intel 82599 series.

2022-12-04 Thread Yuichiro NAITO

I updated my patch to remove AIM code. I can't see any performance improvement
with AIM enabled. I ran packet forwarding test with Cisco TRex traffic
Generator (*1). AIM doesn't improves performance in this test.

*1: https://trex-tgn.cisco.com/

With my latest patch, I've got following maximum packet forwarding rate in
stateless benchmark (stl/bench.py) without packet loss. It starts to drop
packets over these rate.

Stateless benchmark (stl/bench.py)

| N cpu cores | mbps |
|-|--|
|  2  | 3390 |
|  4  | 4250 |
|  6  | 4340 |

Without PF (Packet Filter) improves performance as follows.

| N cpu cores | mbps |
|-|--|
|  2  | 4740 |
|  4  | 7190 |
|  6  | 8410 |

All tests run in MTU 1500. Host hardware spec is shown as below.

Vendor: FUJITSU PRIMERGY RX2530 M2
CPU: 24 CPUs x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Memory: 192 GB
NIC: Intel X540-AT2
ESXi version: 7.0U3

OK?

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index 3563126b2a1..e421c38e95f 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -522,6 +522,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index a803dc9b659..7d6402e524e 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia, intrmap, stoeplitz
 attach ix at pci
-file   dev/pci/if_ix.c ix
-file   dev/pci/ixgbe.c ix
-file   dev/pci/ixgbe_82598.c   ix
-file   dev/pci/ixgbe_82599.c   ix
-file   dev/pci/ixgbe_x540.cix
-file   dev/pci/ixgbe_x550.cix
-file   dev/pci/ixgbe_phy.c ix
+file   dev/pci/if_ix.c ix | ixv
+file   dev/pci/ixgbe.c ix | ixv
+file   dev/pci/ixgbe_82598.c   ix | ixv
+file   dev/pci/ixgbe_82599.c   ix | ixv
+file   dev/pci/ixgbe_x540.cix | ixv
+file   dev/pci/ixgbe_x550.cix | ixv
+file   dev/pci/ixgbe_phy.c ix | ixv
+
+# Virtual Function of i82599.
+device ixv: ether, ifnet, ifmedia, intrmap, stoeplitz
+attach ixv at pci
+file   dev/pci/if_ixv.cixv
+file   dev/pci/ixgbe_vf.c  ixv

 # Intel Ethernet 700 Series
 device ixl: ether, ifnet, ifmedia, intrmap, stoeplitz
diff --git a/sys/dev/pci/if_ix.c b/sys/dev/pci/if_ix.c
index b59ec28d9f1..4fb01d85778 100644
--- a/sys/dev/pci/if_ix.c
+++ b/sys/dev/pci/if_ix.c
@@ -507,7 +507,7 @@ ixgbe_start(struct ifqueue *ifq)
 * hardware that this frame is available to transmit.
 */
if (post)
-   IXGBE_WRITE_REG(>hw, IXGBE_TDT(txr->me),
+   IXGBE_WRITE_REG(>hw, txr->tail,
txr->next_avail_desc);
 }

@@ -706,7 +706,7 @@ ixgbe_watchdog(struct ifnet * ifp)
for (i = 0; i < sc->num_queues; i++, txr++) {
printf("%s: Queue(%d) tdh = %d, hw tdt = %d\n", ifp->if_xname, 
i,
IXGBE_READ_REG(hw, IXGBE_TDH(i)),
-   IXGBE_READ_REG(hw, IXGBE_TDT(i)));
+   IXGBE_READ_REG(hw, sc->tx_rings[i].tail));
printf("%s: TX(%d) Next TX to Clean = %d\n", ifp->if_xname,
i, txr->next_to_clean);
}
@@ -826,7 +826,7 @@ ixgbe_init(void *arg)
msec_delay(1);
}
IXGBE_WRITE_FLUSH(>hw);
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(i), rxr->last_desc_filled);
+   IXGBE_WRITE_REG(>hw, rxr[i].tail, rxr->last_desc_filled);
}

/* Set up VLAN support and filter */
@@ -2359,9 +2359,12 @@ ixgbe_initialize_transmit_units(struct ix_softc *sc)
IXGBE_WRITE_REG(hw, IXGBE_TDLEN(i),
sc->num_tx_desc * sizeof(struct ixgbe_legacy_tx_desc));

+   /* Set Tx Tail register */
+   txr->tail = IXGBE_TDT(i);
+
/* Setup the HW Tx Head and Tail descriptor pointers */
IXGBE_WRITE_REG(hw, IXGBE_TDH(i), 0);
-   IXGBE_WRITE_REG(hw, IXGBE_TDT(i), 0);
+   IXGBE_WRITE_REG(hw, txr->tail, 0);

/* Setup Transmit Descriptor Cmd Settings */
txr->txd_cmd = IXGBE_TXD_CMD_IFCS;
@@ -2834,7 +2837,7 @@ ixgbe_rxrefill(void *xrxr)

if (ixgbe_rxfill(rxr)) {
/* Advance the Rx 

Re: ixv(4): porting Virtual Function driver for Intel 82599 series.

2022-11-20 Thread Yuichiro NAITO

Thank you so much for reviewing my patch.

On 11/21/22 08:02, Christian Weisgerber wrote:

Yuichiro NAITO:


+static void
+ixv_set_multi(struct ix_softc *sc)
+{

[...]

+   if ((ifp->if_flags & IFF_PROMISC) == 0 && ac->ac_multirangecnt <= 0 &&
+ ac->ac_multicnt <= MAX_NUM_MULTICAST_ADDRESSES) {
+   ETHER_FIRST_MULTI(step, >arpcom, enm);
+   while (enm != NULL) {
+   bcopy(enm->enm_addrlo,
+ [mcnt * IXGBE_ETH_LENGTH_OF_ADDRESS],
+ IXGBE_ETH_LENGTH_OF_ADDRESS);
+   mcnt++;
+
+   ETHER_NEXT_MULTI(step, enm);
+   }
+
+   update_ptr = mta;
+   sc->hw.mac.ops.update_mc_addr_list(>hw, update_ptr, mcnt,
+  ixv_mc_array_itr, TRUE);
+   }
+
+} /* ixv_set_multi */


This doesn't look right.
There is no handling of ac->ac_multirangecnt > 0 or
mcnt >= MAX_NUM_MULTICAST_ADDRESSES.

Compare ixgb_set_multi() in if_ixgb.c.


Yes, and also 'ixv_set_multi' works similar to 'ixgbe_iff()' in 'if_ix.c'.
Once I found VF cannot write FCTL register, I removed IXGBE_WRITE_REG macro
and 'fctrl' value. That was mistake.

Writing FCTL register should be replaced by calling 'update_xcast_mode'
function to tell PF which multicast addresses are used.

I updated 'ixv_set_multi' and catch the 'ac->ac_multirangecnt > 0 or
mcnt >= MAX_NUM_MULTICAST_ADDRESSES' case.


+int32_t ixgbe_update_mc_addr_list_vf(struct ixgbe_hw *hw, uint8_t 
*mc_addr_list,
+uint32_t mc_addr_count, ixgbe_mc_addr_itr 
next,
+bool clear)
+{

[...]

+   /* Each entry in the list uses 1 16 bit word.  We have 30
+* 16 bit words available in our HW msg buffer (minus 1 for the
+* msg type).  That's 30 hash values if we pack 'em right.  If
+* there are more than 30 MC addresses to add then punt the
+* extras for now and then add code to handle more than 30 later.
+* It would be unusual for a server to request that many multi-cast
+* addresses except for in large enterprise network environments.
+*/
+
+   DEBUGOUT1("MC Addr Count = %d\n", mc_addr_count);
+
+   cnt = (mc_addr_count > 30) ? 30 : mc_addr_count;


Should MAX_NUM_MULTICAST_ADDRESSES simply be set to 30?


MAX_NUM_MULTICAST_ADDRESSES is used for ix(4) driver.
So, I introduce IXGBE_MAX_MULTICAST_ADDRESSES_VF as 30.

Increasing this number will never work as intended.
Because multicast addresses are sent to PF (Primary Function) driver via 
'mailbox'.
PF drivers doesn't allow over 30 multicast addresses for now.

For instance, if Linux qemu hosts ixgbe SR-IOV, following Linux driver code
handles 'update_mc_addr_list' request.

drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c:
 362static int ixgbe_set_vf_multicasts(struct ixgbe_adapter *adapter,
 363   u32 *msgbuf, u32 vf)
 364{
 365int entries = (msgbuf[0] & IXGBE_VT_MSGINFO_MASK)
 366   >> IXGBE_VT_MSGINFO_SHIFT;
 367u16 *hash_list = (u16 *)[1];
 368struct vf_data_storage *vfinfo = >vfinfo[vf];
 369struct ixgbe_hw *hw = >hw;
 370int i;
 371u32 vector_bit;
 372u32 vector_reg;
 373u32 mta_reg;
 374u32 vmolr = IXGBE_READ_REG(hw, IXGBE_VMOLR(vf));
 375
 376/* only so many hash values supported */
 377entries = min(entries, IXGBE_MAX_VF_MC_ENTRIES);

Number of multicast addresses is limited by IXGBE_MAX_VF_MC_ENTRIES(= 30).

I'm not sure about ESXi host, because I don't have an access to ESXi source 
code.
But I'm guessing it has the same limitation like this.

And I've received an another issue about FreeBSD VCS ID.
$FreeBSD$ has completely no meaning now.
I also removed it.

Here is my updated ixv patch.

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index 3563126b2a1..e421c38e95f 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -522,6 +522,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index a803dc9b659..7d6402e524e 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 8

ixv(4): porting Virtual Function driver for Intel 82599 series.

2022-11-17 Thread Yuichiro NAITO

Hi.

I'm porting Intel 82599 series VF driver ixv(4) from Intel (see below link).

https://www.intel.com/content/www/us/en/download/645984/intel-network-adapter-virtual-function-driver-for-pcie-10-gigabit-network-connections-under-freebsd.html

It's written for FreeBSD but doesn't have iflib code.
I think it is easier than porting from FreeBSD src tree.

In my patch, many codes are shared between OpenBSD ix(4) driver.
Especially ring buffer codes are almost re-used and just linked.
Only TDT/RDT register offset is different between ix(4) and ixv(4),
I changed to store the offset into the ring buffers and refer them.

Intel driver has AIM (Adaptive Interrupt Moderation) code.
I also ported it but it's disabled for now.
When I find some evidence that improves performance in the further testing,
I will enable it.

Most of ixv(4) driver codes (if_ixv.c) and hardware control code (ixgbe_vf.c)
are taken from Intel driver. I adjusted kstat codes to see VF registers.

Internal communication (called 'mailbox') between VF and PF (Primary Function)
codes are updated to the Intel driver.

I introduced a new interface that vlan(4) driver passes the vnet id to network
drivers. VF must tell PF which vnet ids are used or VF never receive VLAN 
packets.

I tested my patch X540 VF and X550 VF on ESXi and Linux qemu.
It works fine for me.
I also confirmed ix(4) works on X550 as a regression test.

Is it OK?

diff --git a/sys/arch/amd64/conf/GENERIC b/sys/arch/amd64/conf/GENERIC
index 3563126b2a1..e421c38e95f 100644
--- a/sys/arch/amd64/conf/GENERIC
+++ b/sys/arch/amd64/conf/GENERIC
@@ -522,6 +522,7 @@ msk*at mskc?#  each port of 
above
 em*at pci? # Intel Pro/1000 ethernet
 ixgb*  at pci? # Intel Pro/10Gb ethernet
 ix*at pci? # Intel 82598EB 10Gb ethernet
+ixv*   at pci? # Virtual Function of Intel 82598EB
 myx*   at pci? # Myricom Myri-10G 10Gb ethernet
 oce*   at pci? # Emulex OneConnect 10Gb ethernet
 txp*   at pci? # 3com 3CR990
diff --git a/sys/dev/pci/files.pci b/sys/dev/pci/files.pci
index a803dc9b659..7d6402e524e 100644
--- a/sys/dev/pci/files.pci
+++ b/sys/dev/pci/files.pci
@@ -350,13 +350,19 @@ file  dev/pci/ixgb_hw.c   ixgb
 # Intel 82598 10GbE
 device ix: ether, ifnet, ifmedia, intrmap, stoeplitz
 attach ix at pci
-file   dev/pci/if_ix.c ix
-file   dev/pci/ixgbe.c ix
-file   dev/pci/ixgbe_82598.c   ix
-file   dev/pci/ixgbe_82599.c   ix
-file   dev/pci/ixgbe_x540.cix
-file   dev/pci/ixgbe_x550.cix
-file   dev/pci/ixgbe_phy.c ix
+file   dev/pci/if_ix.c ix | ixv
+file   dev/pci/ixgbe.c ix | ixv
+file   dev/pci/ixgbe_82598.c   ix | ixv
+file   dev/pci/ixgbe_82599.c   ix | ixv
+file   dev/pci/ixgbe_x540.cix | ixv
+file   dev/pci/ixgbe_x550.cix | ixv
+file   dev/pci/ixgbe_phy.c ix | ixv
+
+# Virtual Function of i82599.
+device ixv: ether, ifnet, ifmedia, intrmap, stoeplitz
+attach ixv at pci
+file   dev/pci/if_ixv.cixv
+file   dev/pci/ixgbe_vf.c  ixv

 # Intel Ethernet 700 Series
 device ixl: ether, ifnet, ifmedia, intrmap, stoeplitz
diff --git a/sys/dev/pci/if_ix.c b/sys/dev/pci/if_ix.c
index b59ec28d9f1..9d88d00222d 100644
--- a/sys/dev/pci/if_ix.c
+++ b/sys/dev/pci/if_ix.c
@@ -507,7 +507,7 @@ ixgbe_start(struct ifqueue *ifq)
 * hardware that this frame is available to transmit.
 */
if (post)
-   IXGBE_WRITE_REG(>hw, IXGBE_TDT(txr->me),
+   IXGBE_WRITE_REG(>hw, txr->tail,
txr->next_avail_desc);
 }

@@ -706,7 +706,7 @@ ixgbe_watchdog(struct ifnet * ifp)
for (i = 0; i < sc->num_queues; i++, txr++) {
printf("%s: Queue(%d) tdh = %d, hw tdt = %d\n", ifp->if_xname, 
i,
IXGBE_READ_REG(hw, IXGBE_TDH(i)),
-   IXGBE_READ_REG(hw, IXGBE_TDT(i)));
+   IXGBE_READ_REG(hw, sc->tx_rings[i].tail));
printf("%s: TX(%d) Next TX to Clean = %d\n", ifp->if_xname,
i, txr->next_to_clean);
}
@@ -826,7 +826,7 @@ ixgbe_init(void *arg)
msec_delay(1);
}
IXGBE_WRITE_FLUSH(>hw);
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(i), rxr->last_desc_filled);
+   IXGBE_WRITE_REG(>hw, rxr[i].tail, rxr->last_desc_filled);
}

/* Set up VLAN support and filter */
@@ -2359,9 +2359,12 @@ ixgbe_initialize_transmit_units(struct ix_softc *sc)
IXGBE_WRITE_REG(hw, IXGBE_TDLEN(i),
sc->num_tx_desc * sizeof(struct ixgbe_legacy_tx_desc));

+   /* Set Tx Tail register */
+   txr->tail = IXGBE_TDT(i);
+
/* 

Re: [v3] amd64: simplify TSC sync testing

2022-07-06 Thread Yuichiro NAITO



On 7/7/22 11:29, Scott Cheloha wrote:

On Thu, Jul 07, 2022 at 11:13:14AM +0900, Yuichiro NAITO wrote:

I had another try for the v3 patch that runs on FreeBSD bhyve.
In this case, I need the following patch for TSC to be chosen as a timecounter.

diff --git a/sys/arch/amd64/amd64/tsc.c b/sys/arch/amd64/amd64/tsc.c
index c4d3acda8c7..ab34df4463c 100644
--- a/sys/arch/amd64/amd64/tsc.c
+++ b/sys/arch/amd64/amd64/tsc.c
@@ -168,7 +168,7 @@ measure_tsc_freq(struct timecounter *tc)

min_freq = ULLONG_MAX;

-   delay_usec = 10;
+   delay_usec = 1;
for (i = 0; i < 3; i++) {
s = intr_disable();

This is a separate issue from TSC synchronization.
I'll describe it in an another mail.


What are we using for delay_func when this function runs
on cpu0?


`i8254_delay` is used in the `measure_tsc_freq` execution time.




With this patch, Scott's v3 patch works fine on FreeBSD bhyve.
Here is dmesg shown in my OpenBSD.


This is an Intel machine.  Does byhve not pass the TSC frequency
through the MSR?  Or is there something special we need to do if
we're a guest to determine the frequency?


I'm not sure TSC frequency is offered via MSR.
`tsc_freq_cpuid` gets TSC frequency from CPUID 0x15.
It is disabled by bhyve.

So, OpenBSD kernel measures TSC frequency in `measure_tsc_freq` function.
But i8254_delay error is relatively large in this environment.

--
Yuichiro NAITO (naito.yuich...@gmail.com)



fails to measure TSC frequency on Freebsd bhyve

2022-07-06 Thread Yuichiro NAITO

This always happens when OpenBSD runs on FreeBSD bhyve.

Usually OpenBSD gets TSC frequency from CPUID 0x15,
but FreeBSD bhyve disables it by the following commit.

https://cgit.freebsd.org/src/commit/?id=ec048c755076befde5bf533911c70a01d59abdda

Then OpenBSD tries to measure TSC frequency by delay(9) in `measure_tsc_freq`
function. This function waits in 100ms but i8254_delay error is relatively
large in this environment. So, the difference in `tc_get_count` become bigger
than 100ms.

I could see the i8254_delay error by the following patch.

diff --git a/sys/arch/amd64/amd64/tsc.c b/sys/arch/amd64/amd64/tsc.c
index c4d3acda8c7..ab34df4463c 100644
--- a/sys/arch/amd64/amd64/tsc.c
+++ b/sys/arch/amd64/amd64/tsc.c
@@ -182,7 +182,7 @@ measure_tsc_freq(struct timecounter *tc)
continue;

usec = calculate_tc_delay(tc, count1, count2);
-
+   printf("usec - delay_usec = %d\n", usec - delay_usec);
if ((usec < (delay_usec - RECALIBRATE_DELAY_THRESHOLD)) ||
(usec > (delay_usec + RECALIBRATE_DELAY_THRESHOLD)))
continue;

This shows as follows.

```
usec - delay_usec = 98
usec - delay_usec = 98
usec - delay_usec = 100
```

These are bigger than RECALIBRATE_DELAY_THRESHOLD (= 50).

When I change the patch like this.

diff --git a/sys/arch/amd64/amd64/tsc.c b/sys/arch/amd64/amd64/tsc.c
index c4d3acda8c7..ab34df4463c 100644
--- a/sys/arch/amd64/amd64/tsc.c
+++ b/sys/arch/amd64/amd64/tsc.c
@@ -168,7 +168,7 @@ measure_tsc_freq(struct timecounter *tc)

min_freq = ULLONG_MAX;

-   delay_usec = 10;
+   delay_usec = 1;
for (i = 0; i < 3; i++) {
s = intr_disable();

@@ -182,7 +182,7 @@ measure_tsc_freq(struct timecounter *tc)
continue;

usec = calculate_tc_delay(tc, count1, count2);
-
+   printf("usec - delay_usec = %d\n", usec - delay_usec);
if ((usec < (delay_usec - RECALIBRATE_DELAY_THRESHOLD)) ||
(usec > (delay_usec + RECALIBRATE_DELAY_THRESHOLD)))
continue;

I've got following results.

```
usec - delay_usec = 28
usec - delay_usec = 22
usec - delay_usec = 25
```

So, I want to change `delay_usec` value from 100000 to 1.
Is it OK?

--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: [v3] amd64: simplify TSC sync testing

2022-07-06 Thread Yuichiro NAITO
CPUF_INVAR_TSC0x0100  /* CPU has invariant TSC */
  #define CPUF_USERXSTATE   0x0200  /* CPU has curproc's xsave 
state */
  
-#define CPUF_SYNCTSC	0x0800		/* Synchronize TSC */

  #define CPUF_PRESENT  0x1000  /* CPU is present */
  #define CPUF_RUNNING  0x2000  /* CPU is running */
  #define CPUF_PAUSE0x4000  /* CPU is paused in DDB */
Index: include/cpuvar.h
===
RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v
retrieving revision 1.11
diff -u -p -r1.11 cpuvar.h
--- include/cpuvar.h16 May 2021 04:33:05 -  1.11
+++ include/cpuvar.h5 Jul 2022 01:52:11 -
@@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *);
  void cpu_init(struct cpu_info *);
  void cpu_init_first(void);
  
-void tsc_sync_drift(int64_t);

-void tsc_sync_bp(struct cpu_info *);
-void tsc_sync_ap(struct cpu_info *);
+void tsc_test_sync_bp(struct cpu_info *);
+void tsc_test_sync_ap(struct cpu_info *);
  
  #endif


--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: [v3] amd64: simplify TSC sync testing

2022-07-05 Thread Yuichiro NAITO
   /* CPU has curproc's xsave 
state */
  
-#define CPUF_SYNCTSC	0x0800		/* Synchronize TSC */

  #define CPUF_PRESENT  0x1000  /* CPU is present */
  #define CPUF_RUNNING  0x2000  /* CPU is running */
  #define CPUF_PAUSE0x4000  /* CPU is paused in DDB */
Index: include/cpuvar.h
===
RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v
retrieving revision 1.11
diff -u -p -r1.11 cpuvar.h
--- include/cpuvar.h16 May 2021 04:33:05 -  1.11
+++ include/cpuvar.h5 Jul 2022 01:52:11 -
@@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *);
  void cpu_init(struct cpu_info *);
  void cpu_init_first(void);
  
-void tsc_sync_drift(int64_t);

-void tsc_sync_bp(struct cpu_info *);
-void tsc_sync_ap(struct cpu_info *);
+void tsc_test_sync_bp(struct cpu_info *);
+void tsc_test_sync_ap(struct cpu_info *);
  
  #endif


--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: wg(4): 'Address already in use' when wgrtable is changed

2022-06-12 Thread Yuichiro NAITO

Hi, Yasuoka-san.

I can confirm my problem is fixed by your patch.
Thank you!

On 6/11/22 20:14, YASUOKA Masahiko wrote:

Hi,

On Sun, 27 Mar 2022 18:25:18 +0900 (JST)
YASUOKA Masahiko  wrote:

On Wed, 9 Mar 2022 15:28:44 +0900
Yuichiro NAITO  wrote:

I see 'Address already in use' message,
when I change wgrtable for a running wg interface.
It doesn't make sense to me.

It can be reproduced by the following command sequence.

```
# route -T1 add default `cat /etc/mygate`
# ifconfig wg0 create wgport 7111 wgkey `cat /etc/mykey.wg0`
# ifconfig wg0 up
# ifconfig wg0 wgrtable 1
ifconfig: SIOCSWG: Address already in use
```

When I down wg0 interface before changing wgrtable,
It succeeds and no messages are shown.

I investigated the reason why 'Address already in use' is shown.

If wgrtable is specified by ifconfig argument,
`wg_ioctl_set` function in `sys/net/if_wg.c` is called.

And if the wg interface is running, `wg_bind` function is called.
`wg_bind` creates new sockets (IPv4 and 6) and replace them from old
ones.

If only wgrtable is changed, `wg_bind` binds as same port as existing
sockets.
So 'Address already in use' is shown.

Here is a simple patch to close existing sockets before `wg_bind`.
It works for me but I'm not 100% sure this is right fix.

Any other ideas?

```
diff --git a/sys/net/if_wg.c b/sys/net/if_wg.c
index 4dae3e3c976..0159664fb34 100644
--- a/sys/net/if_wg.c
+++ b/sys/net/if_wg.c
@@ -2253,11 +2253,14 @@ wg_ioctl_set(struct wg_softc *sc, struct
wg_data_io *data)
if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {
TAILQ_FOREACH(peer, >sc_peer_seq, p_seq_entry)
wg_peer_clear_src(peer);

-   if (sc->sc_if.if_flags & IFF_RUNNING)
+   if (sc->sc_if.if_flags & IFF_RUNNING) {
+   if (port == sc->sc_udp_port)
+   wg_unbind(sc);
if ((ret = wg_bind(sc, , )) != 0)
goto error;
+   }

sc->sc_udp_port = port;
sc->sc_udp_rtable = rtable;
}
```


If rdomain 1 exists, the error will not shown.

  # ifconfig vether0 rdomain 1 up
  # ifconfig wg0 create wgport 7111 wgkey `openssl rand -base64 32` up
  # ifconfig wg0 wgrtable 1
  #

In the case which you reported to, it is supposed that rtable 1 exists
but rdomain 1 doesn't exist.

Even when "wgtable 1" is configured, becase there is no dedicated
rdomain, rdomain 0 will be used to bind the UDP port.


This is not correct.

Configuring "wgtable 1" when rdomain 1 doesn't exist should mean that
default(rdomain 0) is used for inbound and rtable 1 used for outbound.

Binding the same address/port causes EADDRINUSE even if the rtable is
different.  This is the original problem.  Then we need to unbind the
old socket.

So your original diff has been basically OK.  New diff is almost the
same of that diff, but tweak a condition which checks rdomain is the
same before unbinding the socket.

I belive this diff is ok.

Index: sys/net/if_wg.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/if_wg.c,v
retrieving revision 1.25
diff -u -p -r1.25 if_wg.c
--- sys/net/if_wg.c 6 Jun 2022 14:45:41 -   1.25
+++ sys/net/if_wg.c 11 Jun 2022 10:55:18 -
@@ -751,6 +751,16 @@ wg_bind(struct wg_softc *sc, in_port_t *
int  retries = 0;
  retry:
  #endif
+
+   if (port == sc->sc_udp_port &&
+   rtable_l2(rtable) == rtable_l2(sc->sc_udp_rtable)) {
+   /* changing rtable in the same domain */
+   wg_socket_close(>sc_so4);
+#ifdef INET6
+   wg_socket_close(>sc_so6);
+#endif
+   }
+
if ((ret = wg_socket_open(, AF_INET, , , sc)) != 0)
return ret;
  



--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: amd64: do CPU identification before TSC sync test

2022-05-11 Thread Yuichiro NAITO

On 5/11/22 14:34, Yuichiro NAITO wrote:

After applying your patch, cpu1 is not identified on my current kernel.
Dmesg shows as follows. I'll see it further more.


I found that LAPIC is necessary for `delay` function that is used in 
`idendifycpu` and
waiting loop for CPUF_IDENTIFY flag is set.

I partly reverted calling `lapic_enable` and `lapic_startclock` and it works 
for me.
Please see my following patch.

diff --git a/sys/arch/amd64/amd64/cpu.c b/sys/arch/amd64/amd64/cpu.c
index 4778347a12e..83c0297e7f1 100644
--- a/sys/arch/amd64/amd64/cpu.c
+++ b/sys/arch/amd64/amd64/cpu.c
@@ -844,50 +844,54 @@ cpu_start_secondary(struct cpu_info *ci)
 * wait for it to become ready
 */
for (i = 10; (!(ci->ci_flags & CPUF_PRESENT)) && i>0;i--) {
delay(10);
}
if (! (ci->ci_flags & CPUF_PRESENT)) {
printf("%s: failed to become ready\n", ci->ci_dev->dv_xname);
 #if defined(MPDEBUG) && defined(DDB)
printf("dropping into debugger; continue from here to resume 
boot\n");
db_enter();
 #endif
-   } else {
-   /*
-* Synchronize time stamp counters. Invalidate cache and
-* synchronize twice (in tsc_sync_bp) to minimize possible
-* cache effects. Disable interrupts to try and rule out any
-* external interference.
-*/
-   s = intr_disable();
-   wbinvd();
-   tsc_sync_bp(ci);
-   intr_restore(s);
-#ifdef TSC_DEBUG
-   printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
-#endif
+   goto cleanup;
}

if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {
atomic_setbits_int(>ci_flags, CPUF_IDENTIFY);

/* wait for it to identify */
for (i = 200; (ci->ci_flags & CPUF_IDENTIFY) && i > 0; i--)
delay(10);

-   if (ci->ci_flags & CPUF_IDENTIFY)
+   if (ci->ci_flags & CPUF_IDENTIFY) {
printf("%s: failed to identify\n",
ci->ci_dev->dv_xname);
+   goto cleanup;
+   }
}

+   /*
+* Synchronize time stamp counters. Invalidate cache and
+* synchronize twice (in tsc_sync_bp) to minimize possible
+* cache effects. Disable interrupts to try and rule out any
+* external interference.
+*/
+   s = intr_disable();
+   wbinvd();
+   tsc_sync_bp(ci);
+   intr_restore(s);
+#ifdef TSC_DEBUG
+   printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
+#endif
+
+cleanup:
CPU_START_CLEANUP(ci);

pmap_kremove(MP_TRAMPOLINE, PAGE_SIZE);
pmap_kremove(MP_TRAMP_DATA, PAGE_SIZE);
 }

 void
 cpu_boot_secondary(struct cpu_info *ci)
 {
int i;
int64_t drift;
@@ -932,53 +936,53 @@ void
 cpu_hatch(void *v)
 {
struct cpu_info *ci = (struct cpu_info *)v;
int s;

cpu_init_msrs(ci);

 #ifdef DEBUG
if (ci->ci_flags & CPUF_PRESENT)
panic("%s: already running!?", ci->ci_dev->dv_xname);
 #endif
-
-   /*
-* Synchronize the TSC for the first time. Note that interrupts are
-* off at this point.
-*/
-   wbinvd();
ci->ci_flags |= CPUF_PRESENT;
-   ci->ci_tsc_skew = 0; /* reset on resume */
-   tsc_sync_ap(ci);

lapic_enable();
lapic_startclock();
cpu_ucode_apply(ci);
cpu_tsx_disable(ci);

if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {
/*
 * We need to wait until we can identify, otherwise dmesg
 * output will be messy.
 */
while ((ci->ci_flags & CPUF_IDENTIFY) == 0)
delay(10);

identifycpu(ci);

/* Signal we're done */
atomic_clearbits_int(>ci_flags, CPUF_IDENTIFY);
/* Prevent identifycpu() from running again */
atomic_setbits_int(>ci_flags, CPUF_IDENTIFIED);
}

+   /*
+* Synchronize the TSC for the first time. Note that interrupts are
+* off at this point.
+*/
+   wbinvd();
+   ci->ci_tsc_skew = 0; /* reset on resume */
+   tsc_sync_ap(ci);
+
while ((ci->ci_flags & CPUF_GO) == 0)
delay(10);
 #ifdef HIBERNATE
if ((ci->ci_flags & CPUF_PARK) != 0) {
    atomic_clearbits_int(>ci_flags, CPUF_PARK);
hibernate_drop_to_real_mode();
}
 #endif /* HIBERNATE */

 #ifdef DEBUG
if (ci->ci_flags & CPUF_RUNNING)


--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: amd64: do CPU identification before TSC sync test

2022-05-10 Thread Yuichiro NAITO
 Invalidate cache and
+* synchronize twice (in tsc_sync_bp) to minimize possible
+* cache effects. Disable interrupts to try and rule out any
+* external interference.
+*/
+   s = intr_disable();
+   wbinvd();
+   tsc_sync_bp(ci);
+   intr_restore(s);
+#ifdef TSC_DEBUG
+   printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
+#endif
+
+cleanup:
CPU_START_CLEANUP(ci);
  
  	pmap_kremove(MP_TRAMPOLINE, PAGE_SIZE);

@@ -940,18 +944,8 @@ cpu_hatch(void *v)
if (ci->ci_flags & CPUF_PRESENT)
panic("%s: already running!?", ci->ci_dev->dv_xname);
  #endif
-
-   /*
-* Synchronize the TSC for the first time. Note that interrupts are
-* off at this point.
-*/
-   wbinvd();
ci->ci_flags |= CPUF_PRESENT;
-   ci->ci_tsc_skew = 0; /* reset on resume */
-   tsc_sync_ap(ci);
  
-	lapic_enable();

-   lapic_startclock();
cpu_ucode_apply(ci);
cpu_tsx_disable(ci);
  
@@ -970,6 +964,17 @@ cpu_hatch(void *v)

/* Prevent identifycpu() from running again */
atomic_setbits_int(>ci_flags, CPUF_IDENTIFIED);
}
+
+   /*
+* Synchronize the TSC for the first time. Note that interrupts are
+* off at this point.
+*/
+   wbinvd();
+   ci->ci_tsc_skew = 0; /* reset on resume */
+   tsc_sync_ap(ci);
+
+   lapic_enable();
+   lapic_startclock();
  
  	while ((ci->ci_flags & CPUF_GO) == 0)

delay(10);



--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: [v2] amd64: simplify TSC sync testing

2022-05-10 Thread Yuichiro NAITO
uint64_t interval, start;
+   uint64_t ap_val, bp_val, end, lag;
+   u_int i = 0;
  
-	interval = (uint64_t)usecs * tsc_frequency / 100;

-   start = rdtsc_lfence();
-   while (rdtsc_lfence() - start < interval)
-   CPU_BUSY_CYCLE();
+   bp_val = rdtsc_lfence();
+   end = bp_val + tsc_test_cycles;
+   while (bp_val < end) {
+   ap_val = tsc_ap_status.val;
+   bp_val = rdtsc_lfence();
+   tsc_bp_status.val = bp_val;
+
+   if (++i % 8 == 0)
+   CPU_BUSY_CYCLE();
+
+   if (__predict_false(bp_val < ap_val)) {
+   tsc_bp_status.lag_count++;
+   lag = ap_val - bp_val;
+   if (tsc_bp_status.lag_max < lag)
+   tsc_bp_status.lag_max = lag;
+   }
+   }
  }
+
+#endif /* MULTIPROCESSOR */
Index: amd64/cpu.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.155
diff -u -p -r1.155 cpu.c
--- amd64/cpu.c 21 Feb 2022 11:03:39 -  1.155
+++ amd64/cpu.c 23 Feb 2022 02:23:26 -
@@ -762,9 +762,9 @@ cpu_init(struct cpu_info *ci)
lcr4(cr4 & ~CR4_PGE);
lcr4(cr4);
  
-	/* Synchronize TSC */

+   /* Check if our TSC is synchronized with the BP's TSC. */
if (cold && !CPU_IS_PRIMARY(ci))
- tsc_sync_ap(ci);
+ tsc_test_sync_ap(ci);
  #endif
  }
  
@@ -844,18 +844,14 @@ cpu_start_secondary(struct cpu_info *ci)

  #endif
} else {
/*
-* Synchronize time stamp counters. Invalidate cache and
-* synchronize twice (in tsc_sync_bp) to minimize possible
-* cache effects. Disable interrupts to try and rule out any
-* external interference.
+* Check if our TSC is synchronized with the AP's TSC.
+* Invalidate cache to minimize possible cache effects
+* and disable interrupts to minimize test interference.
 */
s = intr_disable();
wbinvd();
-   tsc_sync_bp(ci);
+   tsc_test_sync_bp(ci);
intr_restore(s);
-#ifdef TSC_DEBUG
-   printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
-#endif
}
  
  	if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {

@@ -880,7 +876,6 @@ void
  cpu_boot_secondary(struct cpu_info *ci)
  {
int i;
-   int64_t drift;
u_long s;
  
  	atomic_setbits_int(>ci_flags, CPUF_GO);

@@ -895,18 +890,14 @@ cpu_boot_secondary(struct cpu_info *ci)
db_enter();
  #endif
} else if (cold) {
-   /* Synchronize TSC again, check for drift. */
-   drift = ci->ci_tsc_skew;
+   /*
+* Test TSC synchronization again.  Something may
+* have screwed it up while we were down.
+*/
s = intr_disable();
wbinvd();
-   tsc_sync_bp(ci);
+   tsc_test_sync_bp(ci);
intr_restore(s);
-   drift -= ci->ci_tsc_skew;
-#ifdef TSC_DEBUG
-   printf("TSC skew=%lld drift=%lld\n",
-   (long long)ci->ci_tsc_skew, (long long)drift);
-#endif
-   tsc_sync_drift(drift);
}
  }
  
@@ -932,16 +923,17 @@ cpu_hatch(void *v)

  #endif
  
  	/*

-* Synchronize the TSC for the first time. Note that interrupts are
-* off at this point.
+* Test if our TSC is synchronized with the BP's TSC.
+* Note that interrupts are off at this point.  Invalidate
+* cache to minimize cache effects on the test.
 */
wbinvd();
ci->ci_flags |= CPUF_PRESENT;
-   ci->ci_tsc_skew = 0; /* reset on resume */
-   tsc_sync_ap(ci);
+   tsc_test_sync_ap(ci);
  
  	lapic_enable();

lapic_startclock();
+
cpu_ucode_apply(ci);
cpu_tsx_disable(ci);
  
Index: include/cpu.h

===
RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
retrieving revision 1.141
diff -u -p -r1.141 cpu.h
--- include/cpu.h   31 Aug 2021 17:40:59 -  1.141
+++ include/cpu.h   23 Feb 2022 02:23:26 -
@@ -209,8 +209,6 @@ struct cpu_info {
paddr_t ci_vmxon_region_pa;
struct vmxon_region *ci_vmxon_region;
  
-	int64_t		ci_tsc_skew;		/* counter skew vs cpu0 */

-
charci_panicbuf[512];
  
  	paddr_t		ci_vmcs_pa;

@@ -230,7 +228,6 @@ struct cpu_info {
  #define CPUF_INVAR_TSC0x0100  /* CPU has invariant TSC */
  #define CPUF_USERXSTATE   0x0200  /* CPU has curproc's xsave 
state */
  
-#define CPUF_SYNCTSC	0x0800		/* Synchronize TSC */

  #define CPUF_PRESENT  0x1000  /* CPU is present */
  #define CPUF_RUNNING  0x2000  /* CPU is running */
  #define CPUF_PAUSE0x4000  /* CPU is paused in DDB */
Index: include/cpuvar.h
===
RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v
retrieving revision 1.11
diff -u -p -r1.11 cpuvar.h
--- include/cpuvar.h16 May 2021 04:33:05 -  1.11
+++ include/cpuvar.h23 Feb 2022 02:23:26 -
@@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *);
  void cpu_init(struct cpu_info *);
  void cpu_init_first(void);
  
-void tsc_sync_drift(int64_t);

-void tsc_sync_bp(struct cpu_info *);
-void tsc_sync_ap(struct cpu_info *);
+void tsc_test_sync_bp(struct cpu_info *);
+void tsc_test_sync_ap(struct cpu_info *);
  
  #endif




--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: wg(4): 'Address already in use' when wgrtable is changed

2022-03-29 Thread Yuichiro NAITO

Thanks for the answer.

There is one thing I'm worrying about.
Ifconfig doesn't show wgrtable value with your patch.
In my use case as follows, it seems that setting `wgrtable 1` is ignored.

```
# route -T1 add default `cat /etc/mygate`
# ifconfig wg0 create wgport 7111 wgkey `cat /etc/mykey.wg0`
# ifconfig wg0 up
# ifconfig wg0 wgrtable 1
# ifconfig wg0
wg0: flags=80c3 mtu 1420
index 6 priority 0 llprio 3
wgport 7111
wgpubkey e/CYTG1RGqT4jmrY0Fom8cAdtOWP7F/gBVwamyINRlg=
groups: wg
```

On 3/28/22 15:59, YASUOKA Masahiko wrote:

On Mon, 28 Mar 2022 15:20:02 +0900
Yuichiro NAITO  wrote:

Thanks for the explanation.
I understand how your patch works.

I want to ask the goal of your patch.
It seems just removing 'Address already in use' message.
Is my guessing right?


Yes.  There is nothing to do, since the command is to bind the same
port, protocol, and domain of prevous.

The code seems to do such the skip already, but it lacks consideration
for rtable_l2(rtable) != rtable case.


On 3/28/22 14:01, YASUOKA Masahiko wrote:

Hi,
On Mon, 28 Mar 2022 12:12:39 +0900
Yuichiro NAITO  wrote:

On 3/27/22 18:25, YASUOKA Masahiko wrote:

Hi,
On Wed, 9 Mar 2022 15:28:44 +0900
Yuichiro NAITO  wrote:

I see 'Address already in use' message,
when I change wgrtable for a running wg interface.
It doesn't make sense to me.

It can be reproduced by the following command sequence.

```
# route -T1 add default `cat /etc/mygate`
# ifconfig wg0 create wgport 7111 wgkey `cat /etc/mykey.wg0`
# ifconfig wg0 up
# ifconfig wg0 wgrtable 1
ifconfig: SIOCSWG: Address already in use
```

When I down wg0 interface before changing wgrtable,
It succeeds and no messages are shown.

I investigated the reason why 'Address already in use' is shown.

If wgrtable is specified by ifconfig argument,
`wg_ioctl_set` function in `sys/net/if_wg.c` is called.

And if the wg interface is running, `wg_bind` function is called.
`wg_bind` creates new sockets (IPv4 and 6) and replace them from old
ones.

If only wgrtable is changed, `wg_bind` binds as same port as existing
sockets.
So 'Address already in use' is shown.

Here is a simple patch to close existing sockets before `wg_bind`.
It works for me but I'm not 100% sure this is right fix.

Any other ideas?

```
diff --git a/sys/net/if_wg.c b/sys/net/if_wg.c
index 4dae3e3c976..0159664fb34 100644
--- a/sys/net/if_wg.c
+++ b/sys/net/if_wg.c
@@ -2253,11 +2253,14 @@ wg_ioctl_set(struct wg_softc *sc, struct
wg_data_io *data)
if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {
TAILQ_FOREACH(peer, >sc_peer_seq, p_seq_entry)
wg_peer_clear_src(peer);

-   if (sc->sc_if.if_flags & IFF_RUNNING)
+   if (sc->sc_if.if_flags & IFF_RUNNING) {
+   if (port == sc->sc_udp_port)
+   wg_unbind(sc);
if ((ret = wg_bind(sc, , )) != 0)
goto error;
+   }

sc->sc_udp_port = port;
sc->sc_udp_rtable = rtable;
}
```

If rdomain 1 exists, the error will not shown.
# ifconfig vether0 rdomain 1 up
# ifconfig wg0 create wgport 7111 wgkey `openssl rand -base64 32` up
# ifconfig wg0 wgrtable 1
#


Yes, if rdomain 1 is created before `ifconfig wg0 wgrtable 1`,
setting wgrtable succeeds and there is no problem.


In the case which you reported to, it is supposed that rtable 1 exists
but rdomain 1 doesn't exist.
Even when "wgtable 1" is configured, becase there is no dedicated
rdomain, rdomain 0 will be used to bind the UDP port.


Exactly, it's the case that I reported and want to fix.


So what wg(4) should do for this case is "nothing".


I'm a little bit confused.
As you said, I can confirm your patch doesn't set wgrtable in my use
case.
It is not the result that I wanted.

 # ifconfig wg0 create wgport 7111 wgkey `openssl rand -base64 32` up
 -> bind 7111/udp on rdomain 0  (1)
is expected. (1)
 # ifconfig wg0 wgrtable 1
 -> bind 7111/udp on rdomain 0  (2)
is expected, since there is no "domain 1".
If trying to do (1) and (2), then it causes EADDRINUSE since it is to
bind the same port, proto, and domain.  The latest diff is skip (2)
properly.
Previous


-   if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {

"rtable != sc->sc_udp_rtable" was wrong since rdomain for rtable may
not exist.  This is the cause of EADDRINUSE.


So the diff is updated.
ok?
Index: sys/net/if_wg.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/if_wg.c,v
retrieving revision 1.22
diff -u -p -r1.22 if_wg.c
--- sys/net/if_wg.c 22 Feb 2022 01:15:02 -  1.22
+++ sys/net/if_wg.c 27 Mar 2022 09:17:08 -
@@ -2250,7 +2250,8 @@ wg_ioctl_set(struct wg_softc *sc, struct
else

Re: wg(4): 'Address already in use' when wgrtable is changed

2022-03-28 Thread Yuichiro NAITO

Thanks for the explanation.
I understand how your patch works.

I want to ask the goal of your patch.
It seems just removing 'Address already in use' message.
Is my guessing right?

On 3/28/22 14:01, YASUOKA Masahiko wrote:

Hi,

On Mon, 28 Mar 2022 12:12:39 +0900
Yuichiro NAITO  wrote:

On 3/27/22 18:25, YASUOKA Masahiko wrote:

Hi,
On Wed, 9 Mar 2022 15:28:44 +0900
Yuichiro NAITO  wrote:

I see 'Address already in use' message,
when I change wgrtable for a running wg interface.
It doesn't make sense to me.

It can be reproduced by the following command sequence.

```
# route -T1 add default `cat /etc/mygate`
# ifconfig wg0 create wgport 7111 wgkey `cat /etc/mykey.wg0`
# ifconfig wg0 up
# ifconfig wg0 wgrtable 1
ifconfig: SIOCSWG: Address already in use
```

When I down wg0 interface before changing wgrtable,
It succeeds and no messages are shown.

I investigated the reason why 'Address already in use' is shown.

If wgrtable is specified by ifconfig argument,
`wg_ioctl_set` function in `sys/net/if_wg.c` is called.

And if the wg interface is running, `wg_bind` function is called.
`wg_bind` creates new sockets (IPv4 and 6) and replace them from old
ones.

If only wgrtable is changed, `wg_bind` binds as same port as existing
sockets.
So 'Address already in use' is shown.

Here is a simple patch to close existing sockets before `wg_bind`.
It works for me but I'm not 100% sure this is right fix.

Any other ideas?

```
diff --git a/sys/net/if_wg.c b/sys/net/if_wg.c
index 4dae3e3c976..0159664fb34 100644
--- a/sys/net/if_wg.c
+++ b/sys/net/if_wg.c
@@ -2253,11 +2253,14 @@ wg_ioctl_set(struct wg_softc *sc, struct
wg_data_io *data)
if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {
TAILQ_FOREACH(peer, >sc_peer_seq, p_seq_entry)
wg_peer_clear_src(peer);

-   if (sc->sc_if.if_flags & IFF_RUNNING)
+   if (sc->sc_if.if_flags & IFF_RUNNING) {
+   if (port == sc->sc_udp_port)
+   wg_unbind(sc);
if ((ret = wg_bind(sc, , )) != 0)
goto error;
+   }

sc->sc_udp_port = port;
sc->sc_udp_rtable = rtable;
}
```

If rdomain 1 exists, the error will not shown.
   # ifconfig vether0 rdomain 1 up
   # ifconfig wg0 create wgport 7111 wgkey `openssl rand -base64 32` up
   # ifconfig wg0 wgrtable 1
   #


Yes, if rdomain 1 is created before `ifconfig wg0 wgrtable 1`,
setting wgrtable succeeds and there is no problem.


In the case which you reported to, it is supposed that rtable 1 exists
but rdomain 1 doesn't exist.
Even when "wgtable 1" is configured, becase there is no dedicated
rdomain, rdomain 0 will be used to bind the UDP port.


Exactly, it's the case that I reported and want to fix.


So what wg(4) should do for this case is "nothing".


I'm a little bit confused.
As you said, I can confirm your patch doesn't set wgrtable in my use
case.
It is not the result that I wanted.


# ifconfig wg0 create wgport 7111 wgkey `openssl rand -base64 32` up
-> bind 7111/udp on rdomain 0  (1)

is expected. (1)

# ifconfig wg0 wgrtable 1
-> bind 7111/udp on rdomain 0  (2)

is expected, since there is no "domain 1".

If trying to do (1) and (2), then it causes EADDRINUSE since it is to
bind the same port, proto, and domain.  The latest diff is skip (2)
properly.

Previous


   -if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {


"rtable != sc->sc_udp_rtable" was wrong since rdomain for rtable may
not exist.  This is the cause of EADDRINUSE.



So the diff is updated.
ok?
Index: sys/net/if_wg.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/if_wg.c,v
retrieving revision 1.22
diff -u -p -r1.22 if_wg.c
--- sys/net/if_wg.c 22 Feb 2022 01:15:02 -  1.22
+++ sys/net/if_wg.c 27 Mar 2022 09:17:08 -
@@ -2250,7 +2250,8 @@ wg_ioctl_set(struct wg_softc *sc, struct
else
rtable = sc->sc_udp_rtable;
   -if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {
+   if (port != sc->sc_udp_port ||
+   rtable_l2(rtable) != sc->sc_udp_rtable) {
TAILQ_FOREACH(peer, >sc_peer_seq, p_seq_entry)
    wg_peer_clear_src(peer);
   


--
Yuichiro NAITO (naito.yuich...@gmail.com)



--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: wg(4): 'Address already in use' when wgrtable is changed

2022-03-27 Thread Yuichiro NAITO

Hello, thanks for the comment.

On 3/27/22 18:25, YASUOKA Masahiko wrote:

Hi,

On Wed, 9 Mar 2022 15:28:44 +0900
Yuichiro NAITO  wrote:

I see 'Address already in use' message,
when I change wgrtable for a running wg interface.
It doesn't make sense to me.

It can be reproduced by the following command sequence.

```
# route -T1 add default `cat /etc/mygate`
# ifconfig wg0 create wgport 7111 wgkey `cat /etc/mykey.wg0`
# ifconfig wg0 up
# ifconfig wg0 wgrtable 1
ifconfig: SIOCSWG: Address already in use
```

When I down wg0 interface before changing wgrtable,
It succeeds and no messages are shown.

I investigated the reason why 'Address already in use' is shown.

If wgrtable is specified by ifconfig argument,
`wg_ioctl_set` function in `sys/net/if_wg.c` is called.

And if the wg interface is running, `wg_bind` function is called.
`wg_bind` creates new sockets (IPv4 and 6) and replace them from old
ones.

If only wgrtable is changed, `wg_bind` binds as same port as existing
sockets.
So 'Address already in use' is shown.

Here is a simple patch to close existing sockets before `wg_bind`.
It works for me but I'm not 100% sure this is right fix.

Any other ideas?

```
diff --git a/sys/net/if_wg.c b/sys/net/if_wg.c
index 4dae3e3c976..0159664fb34 100644
--- a/sys/net/if_wg.c
+++ b/sys/net/if_wg.c
@@ -2253,11 +2253,14 @@ wg_ioctl_set(struct wg_softc *sc, struct
wg_data_io *data)
if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {
TAILQ_FOREACH(peer, >sc_peer_seq, p_seq_entry)
wg_peer_clear_src(peer);

-   if (sc->sc_if.if_flags & IFF_RUNNING)
+   if (sc->sc_if.if_flags & IFF_RUNNING) {
+   if (port == sc->sc_udp_port)
+   wg_unbind(sc);
if ((ret = wg_bind(sc, , )) != 0)
goto error;
+   }

sc->sc_udp_port = port;
sc->sc_udp_rtable = rtable;
}
```


If rdomain 1 exists, the error will not shown.

  # ifconfig vether0 rdomain 1 up
  # ifconfig wg0 create wgport 7111 wgkey `openssl rand -base64 32` up
  # ifconfig wg0 wgrtable 1
  #


Yes, if rdomain 1 is created before `ifconfig wg0 wgrtable 1`,
setting wgrtable succeeds and there is no problem.


In the case which you reported to, it is supposed that rtable 1 exists
but rdomain 1 doesn't exist.

Even when "wgtable 1" is configured, becase there is no dedicated
rdomain, rdomain 0 will be used to bind the UDP port.


Exactly, it's the case that I reported and want to fix.


So what wg(4) should do for this case is "nothing".


I'm a little bit confused.
As you said, I can confirm your patch doesn't set wgrtable in my use case.
It is not the result that I wanted.


So the diff is updated.

ok?

Index: sys/net/if_wg.c
===
RCS file: /disk/cvs/openbsd/src/sys/net/if_wg.c,v
retrieving revision 1.22
diff -u -p -r1.22 if_wg.c
--- sys/net/if_wg.c 22 Feb 2022 01:15:02 -  1.22
+++ sys/net/if_wg.c 27 Mar 2022 09:17:08 -
@@ -2250,7 +2250,8 @@ wg_ioctl_set(struct wg_softc *sc, struct
else
rtable = sc->sc_udp_rtable;
  
-	if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {

+   if (port != sc->sc_udp_port ||
+   rtable_l2(rtable) != sc->sc_udp_rtable) {
TAILQ_FOREACH(peer, >sc_peer_seq, p_seq_entry)
    wg_peer_clear_src(peer);
  


--
Yuichiro NAITO (naito.yuich...@gmail.com)



wg(4): 'Address already in use' when wgrtable is changed

2022-03-08 Thread Yuichiro NAITO

I see 'Address already in use' message,
when I change wgrtable for a running wg interface.
It doesn't make sense to me.

It can be reproduced by the following command sequence.

```
# route -T1 add default `cat /etc/mygate`
# ifconfig wg0 create wgport 7111 wgkey `cat /etc/mykey.wg0`
# ifconfig wg0 up
# ifconfig wg0 wgrtable 1
ifconfig: SIOCSWG: Address already in use
```

When I down wg0 interface before changing wgrtable,
It succeeds and no messages are shown.

I investigated the reason why 'Address already in use' is shown.

If wgrtable is specified by ifconfig argument,
`wg_ioctl_set` function in `sys/net/if_wg.c` is called.

And if the wg interface is running, `wg_bind` function is called.
`wg_bind` creates new sockets (IPv4 and 6) and replace them from old ones.

If only wgrtable is changed, `wg_bind` binds as same port as existing sockets.
So 'Address already in use' is shown.

Here is a simple patch to close existing sockets before `wg_bind`.
It works for me but I'm not 100% sure this is right fix.

Any other ideas?

```
diff --git a/sys/net/if_wg.c b/sys/net/if_wg.c
index 4dae3e3c976..0159664fb34 100644
--- a/sys/net/if_wg.c
+++ b/sys/net/if_wg.c
@@ -2253,11 +2253,14 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io 
*data)
if (port != sc->sc_udp_port || rtable != sc->sc_udp_rtable) {
TAILQ_FOREACH(peer, >sc_peer_seq, p_seq_entry)
wg_peer_clear_src(peer);

-   if (sc->sc_if.if_flags & IFF_RUNNING)
+   if (sc->sc_if.if_flags & IFF_RUNNING) {
+   if (port == sc->sc_udp_port)
+   wg_unbind(sc);
if ((ret = wg_bind(sc, , )) != 0)
goto error;
+   }

sc->sc_udp_port = port;
sc->sc_udp_rtable = rtable;
}
```

--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: A program compiled with '-pg' option always gets SEGV on its execution.

2022-02-28 Thread Yuichiro NAITO

Thanks for the comment.

On 2/21/22 21:07, Stefan Sperling wrote:

On Mon, Feb 21, 2022 at 10:20:17AM +0100, Marc Espie wrote:

On Mon, Feb 21, 2022 at 05:36:16PM +0900, Yuichiro NAITO wrote:

Of course, all programs compiled without '-pg' work fine for me.
I found this issue when I profile my application with gprof(1).
For example, following example C source code fails to execute on OpenBSD 7.0.


Profile is partly broken and has been for a while.

Compiling with -static and removing any pledge() call allow profiling to work.


Yes, and the proposed patch effectively enables -static if -pg is used.
I don't know if this the best fix. But this issue keeps popping up and
it would be nice to have something that works out of the box.
I would not mind this patch going in.


Yes.
When I use ld.bfd as /usr/bin/ld, this problem doesn't happen
although without '-static' option. So, my patch is helpful to keep same
usability between ld.bfd and ld.lld.


Pledge and unveil interfering with profiling is a separate issue which
is more obvious when it occurs and can easily be worked around by the
developer.


I don't have pledge issues in my use case.
My patch isn't intended to change anything about pledge.

--
Yuichiro NAITO (naito.yuich...@gmail.com)



A program compiled with '-pg' option always gets SEGV on its execution.

2022-02-21 Thread Yuichiro NAITO
t(argv[0] ? argv[0] : "",
522(Elf_Dyn *)(phdp->p_vaddr + exe_loff),
523(Elf_Phdr *)dl_data[AUX_phdr],
524dl_data[AUX_phnum], OBJTYPE_EXE, minva + 
exe_loff,
525exe_loff);
526_dl_add_object(exe_obj);
527break;

(snip)

562exe_obj->load_list = load_list;
563exe_obj->obj_flags |= DF_1_GLOBAL;
564exe_obj->load_size = maxva - minva;
565exe_obj->relro_addr = relro_addr;
566exe_obj->relro_size = relro_size;
```

I wonder why statically linked binary has an INTERP section?
I tried checking which options are passed to 'ld'.
I replaced my '/usr/bin/ld' as following script.

```
#!/bin/sh

echo $*
exec /usr/bin/ld.lld $*
```

I ran `cc -pg` again.

```
$ cc -pg helloworld.c
__start --eh-frame-hdr -Bdynamic -dynamic-linker /usr/libexec/ld.so -nopie -o 
a.out /usr/lib/gcrt0.o /usr/lib/crtbegin.o -L/usr/lib /tmp/helloworld-59ced4.o 
-lcompiler_rt -lc_p -lcompiler_rt /usr/lib/crtend.o
```

`-Bdynamic -dynamic-linker /usr/libexec/ld.so` is passed to the 'ld'.
I think 'ld' is not wrong, passing the dynamic linker option seems to be a bug.

I patched clang source code to pass `-Bstatic` option as follows.
I can confirm my programs compiled with -pg option run successfully.

```
diff --git a/gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.cpp 
b/gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.cpp
index 1577f70aad6..c3a2f9e005e 100644
--- a/gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.cpp
+++ b/gnu/llvm/clang/lib/Driver/ToolChains/OpenBSD.cpp
@@ -108,25 +108,25 @@ void openbsd::Linker::ConstructJob(Compilation , const 
JobAction ,

   if (ToolChain.getArch() == llvm::Triple::mips64)
 CmdArgs.push_back("-EB");
   else if (ToolChain.getArch() == llvm::Triple::mips64el)
 CmdArgs.push_back("-EL");

   if (!Args.hasArg(options::OPT_nostdlib, options::OPT_shared)) {
 CmdArgs.push_back("-e");
 CmdArgs.push_back("__start");
   }

   CmdArgs.push_back("--eh-frame-hdr");
-  if (Args.hasArg(options::OPT_static)) {
+  if (Args.hasArg(options::OPT_static) || Args.hasArg(options::OPT_pg)) {
 CmdArgs.push_back("-Bstatic");
   } else {
 if (Args.hasArg(options::OPT_rdynamic))
   CmdArgs.push_back("-export-dynamic");
 CmdArgs.push_back("-Bdynamic");
 if (Args.hasArg(options::OPT_shared)) {
   CmdArgs.push_back("-shared");
 } else {
   CmdArgs.push_back("-dynamic-linker");
   CmdArgs.push_back("/usr/libexec/ld.so");
 }
   }
```

Is it OK?

--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: gprof: Profiling a multi-threaded application

2021-12-09 Thread Yuichiro NAITO

Any comments about this topic?

On 7/12/21 18:09, Yuichiro NAITO wrote:

Hi, Martin

n 2021/07/10 16:55, Martin Pieuchot wrote:

Hello Yuichiro, thanks for your work !


Thanks for the response.


On 2021/06/16 16:34, Yuichiro NAITO wrote:

When I compile a multi-threaded application with '-pg' option, I always get no
results in gmon.out. With bad luck, my application gets SIGSEGV while running.
Because the buffer that holds number of caller/callee count is the only one
in the process and will be broken by multi-threaded access.

I get the idea to solve this problem from NetBSD. NetBSD has individual buffers
for each thread and merges them at the end of profiling.


Note that the kernel use a similar approach but doesn't merge the buffer,
instead it generates a file for each CPU.


Yes, so the number of output files are limited by the number of CPUs in case of
the kernel profiling. I think number of application threads can be increased 
more
casually. I don't want to see dozens of 'gmon.out' files.


NetBSD stores the reference to the individual buffer by pthread_setspecific(3).
I think it causes infinite recursive call if whole libc library (except
mcount.c) is compiled with -pg.

The compiler generates '_mcount' function call at the beginning of every
functions. If '_mcount' calls pthread_getspecific(3) to get the individual
buffer, pthread_getspecific() calls '_mcount' again and causes infinite
recursion.

NetBSD prevents from infinite recursive call by checking a global variable. But
this approach also prevents simultaneously call of '_mcount' on a multi-threaded
application. It makes a little bit wrong results of profiling.

So I added a pointer to the buffer in `struct pthread` that can be accessible
via macro call as same as pthread_self(3). This approach can prevent of
infinite recursive call of '_mcount'.


Not calling a libc function for this makes sense.  However I'm not
convinced that accessing `tib_thread' before _rthread_init() has been
called is safe.


Before '_rthread_init’ is called, '__isthreaded' global variable is kept to be 
0.
My patch doesn't access tib_thread in this case.
After calling `_rthread_init`, `pthread_create()` changes `__isthreaded` to 1.
Tib_thread will be accessed by all threads that are newly created and the 
initial one.

I believe tib of the initial thread has been initialized in `_libc_preinit' 
function
in 'lib/libc/dlfcn/init.c'.


I'm not sure where is the cleanest way to place the per-thread buffer,
I'd suggest you ask guenther@ about this.


I added guenther@ in CC of this mail.
I hope I can get an advise about per-thread buffer.


Maybe the initialization can be done outside of _mcount()?


Yes, I think tib is initialized in `pthread_create()` and `_libc_preinit()`.


I obtained merging function from NetBSD that is called in '_mcleanup' function.
Merging function needs to walk through all the individual buffers,
I added SLIST_ENTRY member in 'struct gmonparam' to make a list of the buffers.
And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be used for
the kernel.

But I still use pthread_getspecific(3) for that can call destructor when
a thread is terminated. The individual buffer is moved to free list to reuse
for a new thread.


Here is a patch for this approach.


I have various comments:

We choose not to use C++ style comment (// comment) in the tree, could you
fix yours?

There's two copies of mcount.c, the other one lies in sys/lib/libkern could
you keep them in sync?

gmon.c is only compiled in userland and don't need #ifndef _KERNEL

In libc there's also the use of _MUTEX_LOCK/UNLOCK() macro instead of
calling pthread_mutex* directly.  This might help reduce the differences
between ST and MT paths.


Thanks for letting me know them.
I updated my patch as following according to the advice.

diff --git a/lib/libc/gmon/gmon.c b/lib/libc/gmon/gmon.c
index 1ce0a1c289e..5169fa70449 100644
--- a/lib/libc/gmon/gmon.c
+++ b/lib/libc/gmon/gmon.c
@@ -42,6 +42,15 @@
  
  struct gmonparam _gmonparam = { GMON_PROF_OFF };
  
+#include 

+#include 
+
+SLIST_HEAD(, gmonparam) _gmonfree = SLIST_HEAD_INITIALIZER(_gmonfree);
+SLIST_HEAD(, gmonparam) _gmoninuse = SLIST_HEAD_INITIALIZER(_gmoninuse);
+void* _gmonlock = NULL;
+pthread_key_t _gmonkey;
+struct gmonparam _gmondummy;
+
  static ints_scale;
  /* see profil(2) where this is describe (incorrectly) */
  #define   SCALE_1_TO_10x1L
@@ -52,6 +61,11 @@ PROTO_NORMAL(moncontrol);
  PROTO_DEPRECATED(monstartup);
  static int hertz(void);
  
+static void _gmon_destructor(void *);

+struct gmonparam *_gmon_alloc(void);
+static void _gmon_merge(void);
+static void _gmon_merge_two(struct gmonparam *, struct gmonparam *);
+
  void
  monstartup(u_long lowpc, u_long highpc)
  {
@@ -114,6 +128,9 @@ monstartup(u_long lowpc, u_long highpc)
} else
s_scale = SCALE_1_TO_1;
  
+	_gmondummy.state = GMON_PROF_BUSY;

+   pthread_key_create(&_gmonkey, _gmon_destru

Re: bsd.upgrade fails `umount /mnt` with a single partition disk.

2021-11-10 Thread Yuichiro NAITO
0x100080  kqreadresolvd
 18305  363865  1  0  30x10008b  sigsusp   sh
 42161  407682  0  0  3 0x14200  bored smr
 29177  464899  0  0  2 0x14200zerothread
 39289  156385  0  0  3 0x14200  aiodoned  aiodoned
 21282  513038  0  0  3 0x14200  syncerupdate
 13154   20732  0  0  3 0x14200  cleaner   cleaner
 59578   63904  0  0  3 0x14200  reaperreaper
 76905  342808  0  0  3 0x14200  pgdaemon  pagedaemon
 52686   22631  0  0  3 0x14200  usbtskusbtask
 91247  141316  0  0  3 0x14200  usbatsk   usbatsk
  1259  298570  0  0  3  0x40014200  acpi0 acpi0
 75666  385347  0  0  3 0x14200  bored softnet
 13987  278995  0  0  3 0x14200  bored systqmp
 95941  144455  0  0  3 0x14200  bored systq
 54048  119232  0  0  3  0x40014200  bored softclock
 60901  220570  0  0  3  0x40014200idle0
 1  467165  0  0  30x82  wait  init
 0   0 -1  0  3 0x10200  scheduler swapper
```

It seems that dd is just reading now.
I looked for where dd is invoked in install.sub script.

```
 2641 # Feed the random pool some entropy before we read from it.
 2642 feed_random() {
 2643 (dmesg; cat $CGI_INFO /*.conf; sysctl; route -n show; df;
 2644 ifconfig -A; hostname) >/dev/random 2>&1
 2645 if [[ -e /mnt/var/db/host.random ]]; then
 2646 dd if=/mnt/var/db/host.random of=/dev/random 
bs=65536 count=1 \

 2647 status=none
 2648 fi
 2649 }
```

Dd is invoked in feed_random function.
Feed_random is called in start_cgiinfo function in background.

```
 2505 # Fetch the list of mirror servers and installer choices from 
previous runs if
 2506 # available from ftplist.cgi. Start the ftp process in the 
background, but kill

 2507 # it if it takes longer than 12 seconds.
 2508 start_cgiinfo() {
 2509 # If no networks are configured, we do not need the 
httplist file.

 2510 ((NIFS < 1)) && return
 2511
 2512 # Ensure proper name resolution in case there's no dns yet.
 2513 add_hostent 129.128.5.191 ftp.openbsd.org
 2514
 2515 # Make sure the ftp subshell gets its own process group.
 2516 set -m
 2517 (
 2518 unpriv2 ftp -w 15 -Vao - \
 2519 
"$HTTP_PROTO://ftp.openbsd.org/cgi-bin/ftplist.cgi?dbversion=1" \

 2520 2>/dev/null >$CGI_INFO
 2521
 2522 # Remember finish time for adjusting the received 
timestamp.

 2523 echo -n $SECONDS >$HTTP_SEC
 2524 feed_random
 2525 ) & CGIPID=$!
 2526 set +m
 2527
 2528 # If the ftp process takes more than 12 seconds, kill it.
 2529 # XXX We are relying on the pid space not randomly biting us.
 2530 # XXX ftp could terminate early, and the pid could be reused.
 2531 (sleep 12; kill -INT -$CGIPID >/dev/null 2>&1) &
 2532 }
```

For testing, I changed to execute feed_random in foreground as follows.

```
--- a/distrib/miniroot/install.sub
+++ b/distrib/miniroot/install.sub
@@ -2514,7 +2514,6 @@ start_cgiinfo() {

# Make sure the ftp subshell gets its own process group.
set -m
-   (
unpriv2 ftp -w 15 -Vao - \

"$HTTP_PROTO://ftp.openbsd.org/cgi-bin/ftplist.cgi?dbversion=1" \
2>/dev/null >$CGI_INFO
@@ -2522,13 +2521,11 @@ start_cgiinfo() {
# Remember finish time for adjusting the received 
timestamp.

echo -n $SECONDS >$HTTP_SEC
feed_random
-   ) & CGIPID=$!
set +m

# If the ftp process takes more than 12 seconds, kill it.
# XXX We are relying on the pid space not randomly biting us.
# XXX ftp could terminate early, and the pid could be reused.
-   (sleep 12; kill -INT -$CGIPID >/dev/null 2>&1) &
 }
```

Bsd.sysupgrade always succeeds in my bhyve environment.
The reason of umount failure is concurrent execution with dd.

I don't have better idea than executing dd in foreground.
How should we fix this?


On 11/8/21 12:55, Yuichiro NAITO wrote:

Thanks for the comment.
I will see it further more.

On 11/8/21 12:45, Theo de Raadt wrote:

Doesn't this suggest a disk driver failed to perform/complete a write?
Not right at that moment but perhaps in the recent past?

Yuichiro NAITO  wrote:

bsd.upgrade fails and shows following messages with a single 
partition disk.


```
Force checking of clean non-root filesystems? [no] no
umount: /mnt: Device busy
Can't umount sd0a!
```

sd0 has following partitions.

```
#    size   offset  fstype [

Re: bsd.upgrade fails `umount /mnt` with a single partition disk.

2021-11-07 Thread Yuichiro NAITO

Thanks for the comment.
I will see it further more.

On 11/8/21 12:45, Theo de Raadt wrote:

Doesn't this suggest a disk driver failed to perform/complete a write?
Not right at that moment but perhaps in the recent past?

Yuichiro NAITO  wrote:


bsd.upgrade fails and shows following messages with a single partition disk.

```
Force checking of clean non-root filesystems? [no] no
umount: /mnt: Device busy
Can't umount sd0a!
```

sd0 has following partitions.

```
#size   offset  fstype [fsize bsize   cpg]
   a: 37752672   64  4.2bsd   2048 16384 12959 # /
   b:  4176900 37752750swap# none
   c: 419430400  unused
```

I don't want to say any thing about disk partitioning policy.
I just want to improve bsd.upgrade more stable.

This error happens in a specific environment.
In my cases, OpenBSD on FreeBSD bhyve in Intel Core i7-6700 box caused
this error. No other environments (that I have) reproduce it.

In my investigation, the following patch caused panic in the bhyve
environment.

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 43a7cbd4ae9..4f4176a77ac 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -966,6 +966,7 @@ vflush_vnode(struct vnode *vp, void *arg)
vprint("vflush: busy vnode", vp);
  #endif
va->busy++;
+   panic("vflush_vnode: mark busy");
return (0);
  }

I think some vnodes are still remaining while `umount /mnt`.
So, the kernel returns EBUSY.
It's right to kernel behavior and user land should handle this case.

Doing `umount /mnt` just after `fsck -p` in disrib/miniroot/install.sub.

```
3264# Create a skeletal /etc/fstab which is usable for the upgrade process.
3265munge_fstab
3266
3267# fsck -p non-root filesystems in /etc/fstab.
3268check_fs
3269
3270# Mount filesystems in /etc/fstab.
3272umount /mnt || { echo "Can't umount $ROOTDEV!"; exit; }
3273mount_fs
```

Is it necessary that /mnt is mounted while `check_fs` function?
The target disk information is already read from /mnt/etc/fstab in
`munge_fstab` function. `check_fs` uses the merged fstab in /tmp.

Just `umount /mnt` before `check_fs` can get avoid this error.
Is the following patch OK?

```
diff --git a/distrib/miniroot/install.sub b/distrib/miniroot/install.sub
old mode 100644
new mode 100755
index 16d3b6e3420..cf7e15c9754
--- a/distrib/miniroot/install.sub
+++ b/distrib/miniroot/install.sub
@@ -3243,11 +3243,12 @@ do_upgrade() {
# Create a skeletal /etc/fstab which is usable for the upgrade process.
munge_fstab

+   umount /mnt || { echo "Can't umount $ROOTDEV!"; exit; }
+
# fsck -p non-root filesystems in /etc/fstab.
check_fs

# Mount filesystems in /etc/fstab.
-   umount /mnt || { echo "Can't umount $ROOTDEV!"; exit; }
mount_fs

rm -f /mnt/bsd.upgrade /mnt/auto_upgrade.conf
```

--
Yuichiro NAITO (naito.yuich...@gmail.com)



--
Yuichiro NAITO (naito.yuich...@gmail.com)



bsd.upgrade fails `umount /mnt` with a single partition disk.

2021-11-07 Thread Yuichiro NAITO

bsd.upgrade fails and shows following messages with a single partition disk.

```
Force checking of clean non-root filesystems? [no] no
umount: /mnt: Device busy
Can't umount sd0a!
```

sd0 has following partitions.

```
#size   offset  fstype [fsize bsize   cpg]
  a: 37752672   64  4.2bsd   2048 16384 12959 # /
  b:  4176900 37752750swap# none
  c: 419430400  unused
```

I don't want to say any thing about disk partitioning policy.
I just want to improve bsd.upgrade more stable.

This error happens in a specific environment.
In my cases, OpenBSD on FreeBSD bhyve in Intel Core i7-6700 box caused 
this error. No other environments (that I have) reproduce it.


In my investigation, the following patch caused panic in the bhyve 
environment.


diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 43a7cbd4ae9..4f4176a77ac 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -966,6 +966,7 @@ vflush_vnode(struct vnode *vp, void *arg)
vprint("vflush: busy vnode", vp);
 #endif
va->busy++;
+   panic("vflush_vnode: mark busy");
return (0);
 }

I think some vnodes are still remaining while `umount /mnt`.
So, the kernel returns EBUSY.
It's right to kernel behavior and user land should handle this case.

Doing `umount /mnt` just after `fsck -p` in disrib/miniroot/install.sub.

```
3264# Create a skeletal /etc/fstab which is usable for the upgrade process.
3265munge_fstab
3266
3267# fsck -p non-root filesystems in /etc/fstab.
3268check_fs
3269
3270# Mount filesystems in /etc/fstab.
3272umount /mnt || { echo "Can't umount $ROOTDEV!"; exit; }
3273mount_fs
```

Is it necessary that /mnt is mounted while `check_fs` function?
The target disk information is already read from /mnt/etc/fstab in 
`munge_fstab` function. `check_fs` uses the merged fstab in /tmp.


Just `umount /mnt` before `check_fs` can get avoid this error.
Is the following patch OK?

```
diff --git a/distrib/miniroot/install.sub b/distrib/miniroot/install.sub
old mode 100644
new mode 100755
index 16d3b6e3420..cf7e15c9754
--- a/distrib/miniroot/install.sub
+++ b/distrib/miniroot/install.sub
@@ -3243,11 +3243,12 @@ do_upgrade() {
# Create a skeletal /etc/fstab which is usable for the upgrade process.
munge_fstab

+   umount /mnt || { echo "Can't umount $ROOTDEV!"; exit; }
+
# fsck -p non-root filesystems in /etc/fstab.
check_fs

# Mount filesystems in /etc/fstab.
-   umount /mnt || { echo "Can't umount $ROOTDEV!"; exit; }
mount_fs

rm -f /mnt/bsd.upgrade /mnt/auto_upgrade.conf
```

--
Yuichiro NAITO (naito.yuich...@gmail.com)



Re: Exit status of pkg_add

2021-10-18 Thread Yuichiro NAITO

Following patch changes pkg_add to return a error code,
if a package name is wrong.

diff --git a/usr.sbin/pkg_add/OpenBSD/AddDelete.pm 
b/usr.sbin/pkg_add/OpenBSD/AddDelete.pm

index 7a968cbf05d..39bee874ff1 100644
--- a/usr.sbin/pkg_add/OpenBSD/AddDelete.pm
+++ b/usr.sbin/pkg_add/OpenBSD/AddDelete.pm
@@ -403,12 +403,13 @@ sub check_root
 sub choose_location
 {
my ($state, $name, $list, $is_quirks) = @_;
if (@$list == 0) {
if (!$is_quirks) {
$state->errsay("Can't find #1", $name);
+   $state->{bad}++;
$state->run_quirks(
sub {
my $quirks = shift;
$quirks->filter_obsolete([$name], $state);
});
}

Is it OK?

On 10/18/21 16:53, Yuichiro NAITO wrote:

Hi, I have a question about exit status of pkg_add command.

When I wrote a package install script which included typo in a package name
(of course it's my fault), the script didn't stop in spite of `set -e`.

Because pkg_add command returns 0 even if a package name is wrong.
Is this exit status intended or design policy of pkg_add command?

If not, I want a error status getting returned.
It will save my time to look for a typo or similar bug.

I can't see 'EXIT STATUS' section in the pkg_add manual of OpenBSD 7.0.
So, I e-mailed this question.



--
Yuichiro NAITO (naito.yuich...@gmail.com)



gdb: follow frame pointer link over trap frame

2021-09-10 Thread Yuichiro NAITO
db89 in syscall (frame=0xbff5d1c299b74ff0) at 
sys/syscall_mi.h:102
#23 0x81dff134 in Xsyscall ()
```

Without my patch, gdb doesn't show the back trace after 
`alltraps_kern_meltdown`.

```
#0  dumpsys () at /usr/src/sys/arch/amd64/amd64/machdep.c:1055
#1  0x8118d5e7 in boot (howto=18688) at 
/usr/src/sys/archa/amd64/amd64/machdep.c:847
#2  0x8174dd32 in reboot (howto=Variable "howto" is not available.
) at /usr/src/sys/kern/kern_xxx.c:74
#3  0x81b59eb0 in db_reboot (howto=Variable "howto" is not available.
) at /usr/src/sys/ddb/db_command.c:769
#4  0x81b59922 in db_boot_dump_cmd () at 
/usr/src/sys/ddb/db_command.c:787
#5  0x81b5900e in db_command (last_cmdp=0x4900, cmd_table=Variable 
"cmd_table" is not available.
) at /usr/src/sys/ddb/db_command.c:285
#6  0x81b59de4 in db_command_loop () at 
/usr/src/sys/ddb/db_command.c:686
#7  0x81a60185 in db_trap (type=Unhandled dwarf expression opcode 0xa3
) at /usr/src/sys/ddb/db_trap.c:92
#8  0x81b600d8 in db_ktrap (type=1, code=0, regs=0x8000214ef570)
at /usr/src/sys/arch/amd64/amd64/db_interface.c:149
#9  0x8119d44c in kerntrap (frame=0x0) at 
/usr/src/sys/arch/amd64/amd64/trap.c:309
#10 0x81dfe1ce in alltraps_kern_meltdown ()
#11 0x8000214ef660 in ?? ()
#12 0xffff822774a8 in x86_soft_intrs ()
#13 0x9095 in ?? ()
#14 0x in ?? ()
```

-- 
Yuichiro NAITO (naito.yuic...@gmail.com)



Re: gprof: Profiling a multi-threaded application

2021-07-12 Thread Yuichiro NAITO
Hi, Martin

n 2021/07/10 16:55, Martin Pieuchot wrote:
> Hello Yuichiro, thanks for your work !

Thanks for the response.

>> On 2021/06/16 16:34, Yuichiro NAITO wrote:
>>> When I compile a multi-threaded application with '-pg' option, I always get 
>>> no
>>> results in gmon.out. With bad luck, my application gets SIGSEGV while 
>>> running.
>>> Because the buffer that holds number of caller/callee count is the only one
>>> in the process and will be broken by multi-threaded access.
>>>
>>> I get the idea to solve this problem from NetBSD. NetBSD has individual 
>>> buffers
>>> for each thread and merges them at the end of profiling.
> 
> Note that the kernel use a similar approach but doesn't merge the buffer,
> instead it generates a file for each CPU.

Yes, so the number of output files are limited by the number of CPUs in case of 
the kernel profiling. I think number of application threads can be increased 
more
casually. I don't want to see dozens of 'gmon.out' files.

>>> NetBSD stores the reference to the individual buffer by 
>>> pthread_setspecific(3).
>>> I think it causes infinite recursive call if whole libc library (except
>>> mcount.c) is compiled with -pg.
>>>
>>> The compiler generates '_mcount' function call at the beginning of every
>>> functions. If '_mcount' calls pthread_getspecific(3) to get the individual
>>> buffer, pthread_getspecific() calls '_mcount' again and causes infinite
>>> recursion.
>>>
>>> NetBSD prevents from infinite recursive call by checking a global variable. 
>>> But
>>> this approach also prevents simultaneously call of '_mcount' on a 
>>> multi-threaded
>>> application. It makes a little bit wrong results of profiling.
>>>
>>> So I added a pointer to the buffer in `struct pthread` that can be 
>>> accessible
>>> via macro call as same as pthread_self(3). This approach can prevent of
>>> infinite recursive call of '_mcount'.
> 
> Not calling a libc function for this makes sense.  However I'm not
> convinced that accessing `tib_thread' before _rthread_init() has been
> called is safe.

Before '_rthread_init’ is called, '__isthreaded' global variable is kept to be 
0.
My patch doesn't access tib_thread in this case.
After calling `_rthread_init`, `pthread_create()` changes `__isthreaded` to 1.
Tib_thread will be accessed by all threads that are newly created and the 
initial one.

I believe tib of the initial thread has been initialized in `_libc_preinit' 
function
in 'lib/libc/dlfcn/init.c'.

> I'm not sure where is the cleanest way to place the per-thread buffer,
> I'd suggest you ask guenther@ about this.

I added guenther@ in CC of this mail.
I hope I can get an advise about per-thread buffer.

> Maybe the initialization can be done outside of _mcount()?

Yes, I think tib is initialized in `pthread_create()` and `_libc_preinit()`.

>>> I obtained merging function from NetBSD that is called in '_mcleanup' 
>>> function.
>>> Merging function needs to walk through all the individual buffers,
>>> I added SLIST_ENTRY member in 'struct gmonparam' to make a list of the 
>>> buffers.
>>> And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be used 
>>> for
>>> the kernel.
>>>
>>> But I still use pthread_getspecific(3) for that can call destructor when
>>> a thread is terminated. The individual buffer is moved to free list to reuse
>>> for a new thread.
>>
>> Here is a patch for this approach.
> 
> I have various comments:
> 
> We choose not to use C++ style comment (// comment) in the tree, could you
> fix yours?
> 
> There's two copies of mcount.c, the other one lies in sys/lib/libkern could
> you keep them in sync?
> 
> gmon.c is only compiled in userland and don't need #ifndef _KERNEL
> 
> In libc there's also the use of _MUTEX_LOCK/UNLOCK() macro instead of
> calling pthread_mutex* directly.  This might help reduce the differences
> between ST and MT paths.

Thanks for letting me know them.
I updated my patch as following according to the advice.

diff --git a/lib/libc/gmon/gmon.c b/lib/libc/gmon/gmon.c
index 1ce0a1c289e..5169fa70449 100644
--- a/lib/libc/gmon/gmon.c
+++ b/lib/libc/gmon/gmon.c
@@ -42,6 +42,15 @@
 
 struct gmonparam _gmonparam = { GMON_PROF_OFF };
 
+#include 
+#include 
+
+SLIST_HEAD(, gmonparam) _gmonfree = SLIST_HEAD_INITIALIZER(_gmonfree);
+SLIST_HEAD(, gmonparam) _gmoninuse = SLIST_HEAD_INITIALIZER(_gmoninuse);
+void* _gmonlock = NULL;
+pthread_key_t _gmonkey;
+struct gmonparam _gmondummy;
+
 static int s_scale;
 /* see profil(2) where this is describe (in

Re: gprof: Profiling a multi-threaded application

2021-06-21 Thread Yuichiro NAITO
if (p == NULL) {
+   // prevent recursive call of _gmon_alloc().
+   t->gmonparam = &_gmondummy;
+   if ((t->gmonparam = _gmon_alloc()) == NULL)
+   return;
+   p = t->gmonparam;
+   }
+   } else
+   p = &_gmonparam;
 #endif
/*
 * check that we are profiling
diff --git a/lib/libc/include/thread_private.h 
b/lib/libc/include/thread_private.h
index 237c3fbd034..d6c89c8334b 100644
--- a/lib/libc/include/thread_private.h
+++ b/lib/libc/include/thread_private.h
@@ -5,6 +5,8 @@
 #ifndef _THREAD_PRIVATE_H_
 #define _THREAD_PRIVATE_H_
 
+#include 
+#include 
 #include  /* for FILE and __isthreaded */
 
 #define _MALLOC_MUTEXES 32
@@ -389,6 +391,7 @@ struct pthread {
 
/* cancel received in a delayed cancel block? */
int delayed_cancel;
+   struct gmonparam *gmonparam;
 };
 /* flags in pthread->flags */
 #defineTHREAD_DONE 0x001
diff --git a/sys/sys/gmon.h b/sys/sys/gmon.h
index 1a63cb52b1b..cce98752346 100644
--- a/sys/sys/gmon.h
+++ b/sys/sys/gmon.h
@@ -35,6 +35,7 @@
 #ifndef _SYS_GMON_H_
 #define _SYS_GMON_H_
 
+#include 
 #include 
 
 /*
@@ -136,6 +137,7 @@ struct gmonparam {
u_long  highpc;
u_long  textsize;
u_long  hashfraction;
+   SLIST_ENTRY(gmonparam)  next;
 };
 
 /*


On 2021/06/16 16:34, Yuichiro NAITO wrote:
> When I compile a multi-threaded application with '-pg' option, I always get no
> results in gmon.out. With bad luck, my application gets SIGSEGV while running.
> Because the buffer that holds number of caller/callee count is the only one
> in the process and will be broken by multi-threaded access.
> 
> I get the idea to solve this problem from NetBSD. NetBSD has individual 
> buffers
> for each thread and merges them at the end of profiling.
> 
> NetBSD stores the reference to the individual buffer by 
> pthread_setspecific(3).
> I think it causes infinite recursive call if whole libc library (except
> mcount.c) is compiled with -pg.
> 
> The compiler generates '_mcount' function call at the beginning of every
> functions. If '_mcount' calls pthread_getspecific(3) to get the individual
> buffer, pthread_getspecific() calls '_mcount' again and causes infinite
> recursion.
> 
> NetBSD prevents from infinite recursive call by checking a global variable. 
> But
> this approach also prevents simultaneously call of '_mcount' on a 
> multi-threaded
> application. It makes a little bit wrong results of profiling.
> 
> So I added a pointer to the buffer in `struct pthread` that can be accessible
> via macro call as same as pthread_self(3). This approach can prevent of
> infinite recursive call of '_mcount'.
> 
> I obtained merging function from NetBSD that is called in '_mcleanup' 
> function.
> Merging function needs to walk through all the individual buffers,
> I added SLIST_ENTRY member in 'struct gmonparam' to make a list of the 
> buffers.
> And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be used for
> the kernel.
> 
> But I still use pthread_getspecific(3) for that can call destructor when
> a thread is terminated. The individual buffer is moved to free list to reuse
> for a new thread.


-- 
Yuichiro NAITO (naito.yuic...@gmail.com)



gprof: Profiling a multi-threaded application

2021-06-16 Thread Yuichiro NAITO
IST_FOREACH(q, &_gmonfree, next)
+   _gmon_merge_two(&_gmonparam, q);
+
+   SLIST_FOREACH(q, &_gmoninuse, next) {
+   q->state = GMON_PROF_OFF;
+   _gmon_merge_two(&_gmonparam, q);
+   }
+
+   pthread_mutex_unlock(&_gmonlock);
+}
+#endif
+
+
 void
 _mcleanup(void)
 {
@@ -233,6 +443,9 @@ _mcleanup(void)
snprintf(dbuf, sizeof dbuf, "[mcleanup1] kcount 0x%x ssiz %d\n",
p->kcount, p->kcountsize);
write(log, dbuf, strlen(dbuf));
+#endif
+#ifndef _KERNEL
+   _gmon_merge();
 #endif
hdr = (struct gmonhdr *)
bzero(hdr, sizeof(*hdr));
diff --git a/lib/libc/gmon/mcount.c b/lib/libc/gmon/mcount.c
index f0ce70dd6ae..2394e539337 100644
--- a/lib/libc/gmon/mcount.c
+++ b/lib/libc/gmon/mcount.c
@@ -31,6 +31,15 @@
 #include 
 #include 
 
+#ifndef _KERNEL
+#include  // for the use of '__isthreaded'.
+#include 
+#include 
+#include 
+extern struct gmonparam _gmondummy;
+struct gmonparam *_gmon_alloc(void);
+#endif
+
 /*
  * mcount is called on entry to each function compiled with the profiling
  * switch set.  _mcount(), which is declared in a machine-dependent way
@@ -65,7 +74,22 @@ _MCOUNT_DECL(u_long frompc, u_long selfpc)
if ((p = curcpu()->ci_gmon) == NULL)
return;
 #else
-   p = &_gmonparam;
+   if (__isthreaded) {
+   if (_gmonparam.state != GMON_PROF_ON)
+   return;
+   pthread_t t = TIB_GET()->tib_thread;
+   p = t->gmonparam;
+   if (p == &_gmondummy)
+   return;
+   if (p == NULL) {
+   // prevent recursive call of _gmon_alloc().
+   t->gmonparam = &_gmondummy;
+   if ((t->gmonparam = _gmon_alloc()) == NULL)
+   return;
+   p = t->gmonparam;
+   }
+   } else
+   p = &_gmonparam;
 #endif
/*
 * check that we are profiling
diff --git a/lib/libc/include/thread_private.h 
b/lib/libc/include/thread_private.h
index 237c3fbd034..d6c89c8334b 100644
--- a/lib/libc/include/thread_private.h
+++ b/lib/libc/include/thread_private.h
@@ -5,6 +5,8 @@
 #ifndef _THREAD_PRIVATE_H_
 #define _THREAD_PRIVATE_H_
 
+#include 
+#include 
 #include  /* for FILE and __isthreaded */
 
 #define _MALLOC_MUTEXES 32
@@ -389,6 +391,7 @@ struct pthread {
 
/* cancel received in a delayed cancel block? */
int delayed_cancel;
+   struct gmonparam *gmonparam;
 };
 /* flags in pthread->flags */
 #defineTHREAD_DONE 0x001
diff --git a/sys/sys/gmon.h b/sys/sys/gmon.h
index 1a63cb52b1b..1e9ab4054bd 100644
--- a/sys/sys/gmon.h
+++ b/sys/sys/gmon.h
@@ -35,6 +35,7 @@
 #ifndef _SYS_GMON_H_
 #define _SYS_GMON_H_
 
+#include 
 #include 
 
 /*
@@ -136,6 +137,9 @@ struct gmonparam {
u_long  highpc;
u_long  textsize;
u_long  hashfraction;
+#ifndef _KERNEL
+   SLIST_ENTRY(gmonparam)  next;
+#endif
 };
 
 /*

-- 
Yuichiro NAITO (naito.yuic...@gmail.com)



Re: Wireguard: can't remove multiple peers at once.

2021-01-13 Thread Yuichiro NAITO
Does anybody please review my code?

Yasuoka-san is my coleague of my work.
So, he is interested in this topic. That’s why I CCed this mail.
I don’t mean he is an reviewer.

> 2021/01/12 11:27、Yuichiro NAITO のメール:
> 
> 
> Hi.
> 
> I have set up multiple peers in a wg0 interface,
> and tried to remove more than one peers at once.
> Ifconfig(1) only removes the first peer.
> 
> Command line was like following.
> 
> ```
> # ifconfig wg0 -wgpeer  -wgpeer  -wgpeer 
> ```
> 
> Only  was removed.
> 
> I think next peer pointer isn't calculated in case of removing peer
> in sys/net/if_wg.c: wg_ioctl_set() function.
> 
> I have tried following patch that can fix this problem.
> Is it OK?
> 
> diff --git a/sys/net/if_wg.c b/sys/net/if_wg.c
> index c534f966363..c21f883269f 100644
> --- a/sys/net/if_wg.c
> +++ b/sys/net/if_wg.c
> @@ -2270,7 +2270,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io 
> *data)
> 
>   /* Peer must have public key */
>   if (!(peer_o.p_flags & WG_PEER_HAS_PUBLIC))
> - continue;
> + goto next_peer;
> 
>   /* 0 = latest protocol, 1 = this protocol */
>   if (peer_o.p_protocol_version != 0) {
> @@ -2283,7 +2283,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io 
> *data)
>   /* Get local public and check that peer key doesn't match */
>   if (noise_local_keys(>sc_local, public, NULL) == 0 &&
>   bcmp(public, peer_o.p_public, WG_KEY_SIZE) == 0)
> - continue;
> + goto next_peer;
> 
>   /* Lookup peer, or create if it doesn't exist */
>   if ((peer = wg_peer_lookup(sc, peer_o.p_public)) == NULL) {
> @@ -2291,7 +2291,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io 
> *data)
>* Also, don't create a new one if we only want to
>* update. */
>   if (peer_o.p_flags & (WG_PEER_REMOVE|WG_PEER_UPDATE))
> - continue;
> + goto next_peer;
> 
>   if ((peer = wg_peer_create(sc,
>   peer_o.p_public)) == NULL) {
> @@ -2303,7 +2303,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io 
> *data)
>   /* Remove peer and continue if specified */
>   if (peer_o.p_flags & WG_PEER_REMOVE) {
>   wg_peer_destroy(peer);
> - continue;
> + goto next_peer;
>   }
> 
>   if (peer_o.p_flags & WG_PEER_HAS_ENDPOINT)
> @@ -2333,6 +2333,11 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io 
> *data)
>   }
> 
>   peer_p = (struct wg_peer_io *)aip_p;
> + continue;
> + next_peer:
> + aip_p = _p->p_aips[0];
> + aip_p += peer_o.p_aips_count;
> + peer_p = (struct wg_peer_io *)aip_p;
>   }
> 
> error:
> 
> —
> Yuichiro NAITO
> naito.yuich...@gmail.com
> 

—
Yuichiro NAITO
naito.yuich...@gmail.com






Wireguard: can't remove multiple peers at once.

2021-01-11 Thread Yuichiro NAITO


Hi.

I have set up multiple peers in a wg0 interface,
and tried to remove more than one peers at once.
Ifconfig(1) only removes the first peer.

Command line was like following.

```
# ifconfig wg0 -wgpeer  -wgpeer  -wgpeer 
```

Only  was removed.

I think next peer pointer isn't calculated in case of removing peer
in sys/net/if_wg.c: wg_ioctl_set() function.

I have tried following patch that can fix this problem.
Is it OK?

diff --git a/sys/net/if_wg.c b/sys/net/if_wg.c
index c534f966363..c21f883269f 100644
--- a/sys/net/if_wg.c
+++ b/sys/net/if_wg.c
@@ -2270,7 +2270,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io *data)
 
/* Peer must have public key */
if (!(peer_o.p_flags & WG_PEER_HAS_PUBLIC))
-   continue;
+   goto next_peer;
 
/* 0 = latest protocol, 1 = this protocol */
if (peer_o.p_protocol_version != 0) {
@@ -2283,7 +2283,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io *data)
/* Get local public and check that peer key doesn't match */
if (noise_local_keys(>sc_local, public, NULL) == 0 &&
bcmp(public, peer_o.p_public, WG_KEY_SIZE) == 0)
-   continue;
+   goto next_peer;
 
/* Lookup peer, or create if it doesn't exist */
if ((peer = wg_peer_lookup(sc, peer_o.p_public)) == NULL) {
@@ -2291,7 +2291,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io *data)
 * Also, don't create a new one if we only want to
 * update. */
if (peer_o.p_flags & (WG_PEER_REMOVE|WG_PEER_UPDATE))
-   continue;
+   goto next_peer;
 
if ((peer = wg_peer_create(sc,
peer_o.p_public)) == NULL) {
@@ -2303,7 +2303,7 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io *data)
/* Remove peer and continue if specified */
if (peer_o.p_flags & WG_PEER_REMOVE) {
wg_peer_destroy(peer);
-   continue;
+   goto next_peer;
}
 
if (peer_o.p_flags & WG_PEER_HAS_ENDPOINT)
@@ -2333,6 +2333,11 @@ wg_ioctl_set(struct wg_softc *sc, struct wg_data_io 
*data)
}
 
peer_p = (struct wg_peer_io *)aip_p;
+   continue;
+   next_peer:
+   aip_p = _p->p_aips[0];
+   aip_p += peer_o.p_aips_count;
+   peer_p = (struct wg_peer_io *)aip_p;
    }
 
 error:

—
Yuichiro NAITO
naito.yuich...@gmail.com



Re: IPv6 packets are not forwarded via IPsec tunnel

2020-12-07 Thread Yuichiro NAITO



> 2020/12/07 16:47、Claudio Jeker のメール:
> 
> On Mon, Dec 07, 2020 at 01:00:05PM +0900, Yuichiro NAITO wrote:
>> Hi.
>> 
>> I have set up OpenBSD as a IPsec gateway and tried to forward IPv6 packets.
>> But IPv6 packets are not forwarded via IPsec tunnel.
>> IPv4 forwarding via IPsec works for me.
>> 
>> Of course, I set following MIBs.
>> 
>>  net.inet.ip.forwarding=1
>>  net.inet6.ip6.forwarding=1
>> 
>> I have confirmed ipsec policies by 'ipsecctl -sa'.
>> Both SPD and SPA entries are good to me.
>> 
>> In my investigation,
>> IPv6 packets are forwarded by ip6_forward() function in ip6_forward.c.
>> One of the arguments of ip6_forward() is "struct rtentry *rt" that will be 
>> NULL,
>> if the packet is forwarded via IPsec tunnel.
>> Because next hop information is written in SPD and not in the routing table.
>> 
>> If rt is NULL, "rtisvalid(rt)" check is not satisfied,
>> OpenBSD sends back network unreachable ICMP6 packet before 
>> ip6_output_ipsec_send().
>> 
>> I have tried the following patch that makes dummy 'rt' to pass 
>> "rtisvalid(rt)” check
>> and do the Scope check. It forwards IPv6 packets as I expected.
> 
> IPsec requires valid routing to work both for IPv6 and IPv4. At least you
> need to have a default route that forwards the packet. The IPsec flow will
> trigger on the output and "steal" the packets away.

There are 2 destination addresses for IPsec.
The inner packet destination address is written in the encapsulated packet and
the final address to be delivered.
The outer packet destination address is peer address of IPsec tunnel.

Do you mean that I have to set the routing table entry for the inner packet
destination address?

When I set a default route of IPv6, IPv6 packets are forwarded via IPsec.
So it sounds strange to me but necessary for me.

> Is there a reason why your IPv6 setup has no matching route that would
> allow ip6_forward to work?

My case of IPv6 network is for testing how iked(8) works.
So all routing tables are set by manual.

And when I set up OpenBSD that is not allowed to connect to the Internet,
I won't set a default gateway for a security reason.

>> diff --git a/sys/netinet6/ip6_forward.c b/sys/netinet6/ip6_forward.c
>> index e47047e31f1..4dac045f243 100644
>> --- a/sys/netinet6/ip6_forward.c
>> +++ b/sys/netinet6/ip6_forward.c
>> @@ -91,6 +91,7 @@ ip6_forward(struct mbuf *m, struct rtentry *rt, int srcrt)
>>  struct mbuf *mcopy = NULL;
>> #ifdef IPSEC
>>  struct tdb *tdb = NULL;
>> +struct rtentry rtent;
>> #endif /* IPSEC */
>>  char src6[INET6_ADDRSTRLEN], dst6[INET6_ADDRSTRLEN];
>> 
>> @@ -157,6 +158,17 @@ reroute:
>>  m_freem(m);
>>  goto freecopy;
>>  }
>> +if (tdb != NULL && rt == NULL) {
>> +/*
>> + * If we try to forward via IPsec,
>> + * create dummy rtent for scope check.
>> + */
>> +rt = 
>> +memset(rt, 0, sizeof(rtent));
>> +SET(rt->rt_flags, RTF_UP);
>> +/* Do not match any existing interface */
>> +memset(>rt_ifidx, 0xff, sizeof(rt->rt_ifidx));
>> +}
>>  }
>> #endif /* IPSEC */
>> 
>> @@ -302,6 +314,9 @@ reroute:
>>  /* tag as generated to skip over pf_test on rerun */
>>  m->m_pkthdr.pf.flags |= PF_TAG_GENERATED;
>>  srcrt = 1;
>> +#ifdef IPSEC
>> +if (rt != )
>> +#endif /* IPSEC */
>>  rtfree(rt);
>>  rt = NULL;
>>  if_put(ifp);
>> @@ -369,6 +384,9 @@ senderr:
>> freecopy:
>>  m_freem(mcopy);
>> out:
>> +#ifdef IPSEC
>> +if (rt != )
>> +#endif /* IPSEC */
>>  rtfree(rt);
>>  if_put(ifp);
>> }
>> 
>> —
>> Yuichiro NAITO
>> naito.yuich...@gmail.com
>> 
> 
> -- 
> :wq Claudio

—
Yuichiro NAITO
naito.yuich...@gmail.com






IPv6 packets are not forwarded via IPsec tunnel

2020-12-06 Thread Yuichiro NAITO
Hi.

I have set up OpenBSD as a IPsec gateway and tried to forward IPv6 packets.
But IPv6 packets are not forwarded via IPsec tunnel.
IPv4 forwarding via IPsec works for me.

Of course, I set following MIBs.

  net.inet.ip.forwarding=1
  net.inet6.ip6.forwarding=1

I have confirmed ipsec policies by 'ipsecctl -sa'.
Both SPD and SPA entries are good to me.

In my investigation,
IPv6 packets are forwarded by ip6_forward() function in ip6_forward.c.
One of the arguments of ip6_forward() is "struct rtentry *rt" that will be NULL,
if the packet is forwarded via IPsec tunnel.
Because next hop information is written in SPD and not in the routing table.

If rt is NULL, "rtisvalid(rt)" check is not satisfied,
OpenBSD sends back network unreachable ICMP6 packet before 
ip6_output_ipsec_send().

I have tried the following patch that makes dummy 'rt' to pass "rtisvalid(rt)” 
check
and do the Scope check. It forwards IPv6 packets as I expected.

diff --git a/sys/netinet6/ip6_forward.c b/sys/netinet6/ip6_forward.c
index e47047e31f1..4dac045f243 100644
--- a/sys/netinet6/ip6_forward.c
+++ b/sys/netinet6/ip6_forward.c
@@ -91,6 +91,7 @@ ip6_forward(struct mbuf *m, struct rtentry *rt, int srcrt)
struct mbuf *mcopy = NULL;
 #ifdef IPSEC
struct tdb *tdb = NULL;
+   struct rtentry rtent;
 #endif /* IPSEC */
char src6[INET6_ADDRSTRLEN], dst6[INET6_ADDRSTRLEN];
 
@@ -157,6 +158,17 @@ reroute:
m_freem(m);
goto freecopy;
}
+   if (tdb != NULL && rt == NULL) {
+   /*
+* If we try to forward via IPsec,
+* create dummy rtent for scope check.
+*/
+   rt = 
+   memset(rt, 0, sizeof(rtent));
+   SET(rt->rt_flags, RTF_UP);
+   /* Do not match any existing interface */
+   memset(>rt_ifidx, 0xff, sizeof(rt->rt_ifidx));
+   }
}
 #endif /* IPSEC */
 
@@ -302,6 +314,9 @@ reroute:
/* tag as generated to skip over pf_test on rerun */
m->m_pkthdr.pf.flags |= PF_TAG_GENERATED;
srcrt = 1;
+#ifdef IPSEC
+   if (rt != )
+#endif /* IPSEC */
rtfree(rt);
rt = NULL;
if_put(ifp);
@@ -369,6 +384,9 @@ senderr:
 freecopy:
m_freem(mcopy);
 out:
+#ifdef IPSEC
+   if (rt != )
+#endif /* IPSEC */
    rtfree(rt);
if_put(ifp);
 }

—
Yuichiro NAITO
naito.yuich...@gmail.com






Re: make btrace interval event to reciprocal of ticks

2020-06-25 Thread Yuichiro NAITO
From: Martin Pieuchot 
Subject: Re: make btrace interval event to reciprocal of ticks
Date: Thu, 25 Jun 2020 11:36:48 +0200

> On 23/06/20(Tue) 12:04, Yuichiro NAITO wrote:
>> Current btrace has `interval:hz:1` probe that makes periodically events.
>> `interval:hz:1` looks to make events once per second (= 1Hz),
>> but current implementation makes once per tick (= 100Hz on amd64).
>> I think the interval should be counted by reciprocal of ticks so that fit to 
>> Hz.
> 
> Indeed the current implementations assumes it's a number of ticks.  How
> does other tracing tool behave?  Is the behavior you suggest similar to
> bpftrace(8) or dtrace(1)?

I don't know about bpftrace(8).
Dtrace has interval timer in Hz as Lauri says.

> Would it be complicated to add support for other units, like 'ms' or
> 'sec'?  In that case 'profile:s:10' would mean every 10sec while
> 'profile:hz:10' would mean every ~6sec, right?

I feel it's ok to just rename 'interval:hz' to 'interval:ticks' with
no implementation change. (but I hope that 100 is allowed.)
We can know the tick interval from 'hz' in `sysctl -n kern.clockrate`.
So it's easy to understand that 'interval:ticks:N' fires on every N ticks.

And tick resolution is not so high.
In my patch, 51 Hz ~ 100 Hz is calculated to 10 ms (=1 tick) interval.
It might confuse some users.

In my usecase, I want 10ms ~ 1sec intervals.
If N is allowed over 99, 'interval:ticks:N' can be useful to me.

-- 
Yuichiro NAITO
  naito.yuich...@gmail.com



make btrace interval event to reciprocal of ticks

2020-06-22 Thread Yuichiro NAITO
Current btrace has `interval:hz:1` probe that makes periodically events.
`interval:hz:1` looks to make events once per second (= 1Hz),
but current implementation makes once per tick (= 100Hz on amd64).
I think the interval should be counted by reciprocal of ticks so that fit to Hz.

Following patch changes to calculate 'dp_maxtick' fit to Hz.
'dtrq_rate' has number of hz. I think it should be allowed same value of 'hz'
so that we can use `interval:hz:100` that equals to once per tick on amd64.

Index: dt_prov_profile.c
===
RCS file: /cvs/src/sys/dev/dt/dt_prov_profile.c,v
retrieving revision 1.2
diff -u -p -r1.2 dt_prov_profile.c
--- dt_prov_profile.c   25 Mar 2020 14:59:23 -  1.2
+++ dt_prov_profile.c   23 Jun 2020 02:02:27 -
@@ -75,7 +75,7 @@ dt_prov_profile_alloc(struct dt_probe *d
KASSERT(TAILQ_EMPTY(plist));
KASSERT(dtp == dtpp_profile || dtp == dtpp_interval);
 
-   if (dtrq->dtrq_rate <= 0 || dtrq->dtrq_rate >= hz)
+   if (dtrq->dtrq_rate <= 0 || dtrq->dtrq_rate > hz)
return EOPNOTSUPP;
 
CPU_INFO_FOREACH(cii, ci) {
@@ -88,7 +88,7 @@ dt_prov_profile_alloc(struct dt_probe *d
return ENOMEM;
}
 
-   dp->dp_maxtick = dtrq->dtrq_rate;
+   dp->dp_maxtick = hz / dtrq->dtrq_rate;
dp->dp_cpuid = ci->ci_cpuid;
 
    dp->dp_filter = dtrq->dtrq_filter;

-- 
Yuichiro NAITO
  naito.yuich...@gmail.com



btrace(8) seems to read same events

2020-06-17 Thread Yuichiro NAITO
Hi, I'm newbie to OpenBSD.
I'm trying to learn OpenBSD kernel by btrace.
It is really helpful tool to me.
Thanks for the development and 6.7 release.

I found that brace sometimes read more than one event at the same time.
After that, same events are always read and evaluated by my btrace script.

I think btrace command reuses the read buffer for '/dev/dt' in each read.
It is cleared by nobody and overwritten by the kernel.
It seems that btrace expects zero is written at the end of the buffer,
But kernel doesn't write zeroes.

So if more than one events are read and then one event is read in next time,
the second event is still written and evaluated by btrace script.

I think number of read events should be determined by return value of read(2).
Following patch works for me.

Index: btrace.c
===
RCS file: /cvs/src/usr.sbin/btrace/btrace.c,v
retrieving revision 1.18
diff -u -p -r1.18 btrace.c
--- btrace.c24 Apr 2020 14:56:43 -  1.18
+++ btrace.c17 Jun 2020 08:25:49 -
@@ -339,12 +339,8 @@ rules_do(int fd)
err(1, "incorrect read");
 
 
-   for (i = 0; i < nitems(devtbuf); i++) {
+   for (i = 0; i < rlen / sizeof(struct dt_evt); i++) {
struct dt_evt *dtev = [i];
-
-   if (dtev->dtev_tid == 0)
-   break;
-
rules_apply(dtev);
    }
    }

-- 
Yuichiro NAITO (naito.yuich...@gmail.com)