RE: [PATCH] i40e: fix mac filter delete when setting mac address

2018-12-04 Thread Keller, Jacob E
> -Original Message-
> From: Stefan Assmann [mailto:sassm...@kpanic.de]
> Sent: Tuesday, December 04, 2018 6:19 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; da...@davemloft.net; Kirsher, Jeffrey T
> ; Keller, Jacob E ;
> sassm...@kpanic.de
> Subject: [PATCH] i40e: fix mac filter delete when setting mac address
> 
> A previous commit moved the ether_addr_copy() in i40e_set_mac() before
> the mac filter del/add to avoid a race. However it wasn't taken into
> account that this alters the mac address being handed to
> i40e_del_mac_filter().

Woops!

> 
> Also changed i40e_add_mac_filter() to operate on netdev->dev_addr,
> hopefully that makes the code easier to read.
> 

Makes sense.

> Fixes: 458867b2ca0c ("i40e: don't remove netdev->dev_addr when syncing uc 
> list")
> 

Good catch. This looks correct to me.

Acked-by: Jacob Keller 

> Signed-off-by: Stefan Assmann 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 6d5b13f69dec..901ef799d2ad 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -1546,17 +1546,17 @@ static int i40e_set_mac(struct net_device *netdev, 
> void
> *p)
>   netdev_info(netdev, "set new mac address %pM\n", addr->sa_data);
> 
>   /* Copy the address first, so that we avoid a possible race with
> -  * .set_rx_mode(). If we copy after changing the address in the filter
> -  * list, we might open ourselves to a narrow race window where
> -  * .set_rx_mode could delete our dev_addr filter and prevent traffic
> -  * from passing.
> +  * .set_rx_mode().
> +  * - Remove old address from MAC filter
> +  * - Copy new address
> +  * - Add new address to MAC filter
>*/
> - ether_addr_copy(netdev->dev_addr, addr->sa_data);
> -
>   spin_lock_bh(>mac_filter_hash_lock);
>   i40e_del_mac_filter(vsi, netdev->dev_addr);
> - i40e_add_mac_filter(vsi, addr->sa_data);
> + ether_addr_copy(netdev->dev_addr, addr->sa_data);
> + i40e_add_mac_filter(vsi, netdev->dev_addr);
>   spin_unlock_bh(>mac_filter_hash_lock);
> +
>   if (vsi->type == I40E_VSI_MAIN) {
>   i40e_status ret;
> 
> --
> 2.19.2



RE: [PATCH net-next 7/8] ixgbe: extend PTP gettime function to read system clock

2018-11-09 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Friday, November 09, 2018 2:15 AM
> To: netdev@vger.kernel.org
> Cc: Richard Cochran ; Keller, Jacob E
> ; Miroslav Lichvar ; Kirsher,
> Jeffrey T 
> Subject: [PATCH net-next 7/8] ixgbe: extend PTP gettime function to read 
> system
> clock
> 
> -static int ixgbe_ptp_gettime(struct ptp_clock_info *ptp, struct timespec64 
> *ts)
> +static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp,
> +   struct timespec64 *ts,
> +   struct ptp_system_timestamp *sts)
>  {
>   struct ixgbe_adapter *adapter =
>   container_of(ptp, struct ixgbe_adapter, ptp_caps);
> + struct ixgbe_hw *hw = >hw;
>   unsigned long flags;
> - u64 ns;
> + u64 ns, stamp;
> 
>   spin_lock_irqsave(>tmreg_lock, flags);
> - ns = timecounter_read(>hw_tc);
> +
> + switch (adapter->hw.mac.type) {
> + case ixgbe_mac_X550:
> + case ixgbe_mac_X550EM_x:
> + case ixgbe_mac_x550em_a:
> + /* Upper 32 bits represent billions of cycles, lower 32 bits
> +  * represent cycles. However, we use timespec64_to_ns for the
> +  * correct math even though the units haven't been corrected
> +  * yet.
> +  */
> + ptp_read_system_prets(sts);
> + IXGBE_READ_REG(hw, IXGBE_SYSTIMR);
> + ptp_read_system_postts(sts);
> + ts->tv_nsec = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> + ts->tv_sec = IXGBE_READ_REG(hw, IXGBE_SYSTIMH);
> + stamp = timespec64_to_ns(ts);
> + break;
> + default:
> + ptp_read_system_prets(sts);
> + stamp = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> + ptp_read_system_postts(sts);
> + stamp |= (u64)IXGBE_READ_REG(hw, IXGBE_SYSTIMH) << 32;
> + break;
> + }
> +
> + ns = timecounter_cyc2time(>hw_tc, stamp);
> +

At first, I was confused by this entire block of code, but then realized that 
we can't update the timecounter_read method, so we instead have to break this 
out so that our calls to ptp_read_system_prets() and ptp_read_system_postts() 
can be added between the register reads.

Ok, that makes sense.

>   spin_unlock_irqrestore(>tmreg_lock, flags);
> 
>   *ts = ns_to_timespec64(ns);
> @@ -567,10 +597,14 @@ void ixgbe_ptp_overflow_check(struct ixgbe_adapter
> *adapter)
>  {
>   bool timeout = time_is_before_jiffies(adapter->last_overflow_check +
>IXGBE_OVERFLOW_PERIOD);
> - struct timespec64 ts;
> + unsigned long flags;
> 
>   if (timeout) {
> - ixgbe_ptp_gettime(>ptp_caps, );
> + /* Update the timecounter */
> + spin_lock_irqsave(>tmreg_lock, flags);
> + timecounter_read(>hw_tc);
> + spin_unlock_irqrestore(>tmreg_lock, flags);
> +

This also explains this change where we now have to update the timecounter 
during the overflow check.

Ok, this makes sense to me.

Thanks,
Jake


RE: [PATCH net-next 0/8] More accurate PHC<->system clock synchronization

2018-11-09 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Friday, November 09, 2018 2:15 AM
> To: netdev@vger.kernel.org
> Cc: Richard Cochran ; Keller, Jacob E
> ; Miroslav Lichvar ; Marcelo
> Tosatti ; Kirsher, Jeffrey T 
> ;
> Michael Chan 
> Subject: [PATCH net-next 0/8] More accurate PHC<->system clock synchronization
> 
> RFC->v1:
> - added new patches
> - separated PHC timestamp from ptp_system_timestamp
> - fixed memory leak in PTP_SYS_OFFSET_EXTENDED
> - changed PTP_SYS_OFFSET_EXTENDED to work with array of arrays
> - fixed PTP_SYS_OFFSET_EXTENDED to break correctly from loop
> - fixed timecounter updates in drivers
> - split gettimex in igb driver
> - fixed ptp_read_* functions to be available without
>   CONFIG_PTP_1588_CLOCK
> 
> This series enables a more accurate synchronization between PTP hardware
> clocks and the system clock.

Thanks for doing this, Miroslav!

> 
> The first two patches are minor cleanup/bug fixes.
> 
> The third patch adds an extended version of the PTP_SYS_OFFSET ioctl,
> which returns three timestamps for each measurement. The idea is to
> shorten the interval between the system timestamps to contain just the
> reading of the lowest register of the PHC in order to reduce the error
> in the measured offset and get a smaller upper bound on the maximum
> error.
> 
> The fourth patch deprecates the original gettime function.
> 
> The remaining patches update the gettime function in order to support
> the new ioctl in the e1000e, igb, ixgbe, and tg3 drivers.
> 
> Tests with few different NICs in different machines show that:
> - with an I219 (e1000e) the measured delay was reduced from 2500 to 1300
>   ns and the error in the measured offset, when compared to the cross
>   timestamping supported by the driver, was reduced by a factor of 5
> - with an I210 (igb) the delay was reduced from 5100 to 1700 ns
> - with an I350 (igb) the delay was reduced from 2300 to 750 ns
> - with an X550 (ixgbe) the delay was reduced from 1950 to 650 ns
> - with a BCM5720 (tg3) the delay was reduced from 2400 to 1200 ns
> 

Impressive results!

For the main portions and the Intel driver changes this is

Reviewed-by: Jacob Keller 

Regards,
Jake

> 
> Miroslav Lichvar (8):
>   ptp: reorder declarations in ptp_ioctl()
>   ptp: check gettime64 return code in PTP_SYS_OFFSET ioctl
>   ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
>   ptp: deprecate gettime64() in favor of gettimex64()
>   e1000e: extend PTP gettime function to read system clock
>   igb: extend PTP gettime function to read system clock
>   ixgbe: extend PTP gettime function to read system clock
>   tg3: extend PTP gettime function to read system clock
> 
>  drivers/net/ethernet/broadcom/tg3.c  | 19 --
>  drivers/net/ethernet/intel/e1000e/e1000.h|  3 +
>  drivers/net/ethernet/intel/e1000e/netdev.c   | 42 ++---
>  drivers/net/ethernet/intel/e1000e/ptp.c  | 16 +++--
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 65 +---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 54 +---
>  drivers/ptp/ptp_chardev.c| 55 ++---
>  drivers/ptp/ptp_clock.c  |  5 +-
>  include/linux/ptp_clock_kernel.h | 33 ++
>  include/uapi/linux/ptp_clock.h   | 12 
>  10 files changed, 253 insertions(+), 51 deletions(-)
> 
> --
> 2.17.2



RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-31 Thread Keller, Jacob E



> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Wednesday, October 31, 2018 2:17 PM
> To: Miroslav Lichvar 
> Cc: Keller, Jacob E ; netdev@vger.kernel.org; 
> intel-wired-
> l...@lists.osuosl.org
> Subject: Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> On Wed, Oct 31, 2018 at 03:49:35PM +0100, Miroslav Lichvar wrote:
> >
> > How about separating the PHC timestamp from the ptp_system_timestamp
> > structure and use NULL to indicate we don't want to read the system
> > clock? A gettimex64(ptp, ts, NULL) call would be equal to
> > gettime64(ptp, ts).
> 
> Doesn't sound too bad to me.
> 
> Thanks,
> Richard

Yep, this seems fine to me as well.

Regards,
Jake


RE: [RFC PATCH 3/4] igb: add support for extended PHC gettime

2018-10-31 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Wednesday, October 31, 2018 2:40 AM
> To: Richard Cochran 
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Keller, Jacob E
> 
> Subject: Re: [RFC PATCH 3/4] igb: add support for extended PHC gettime
> 
> On Tue, Oct 30, 2018 at 07:29:16PM -0700, Richard Cochran wrote:
> > On Fri, Oct 26, 2018 at 06:27:41PM +0200, Miroslav Lichvar wrote:
> > > +static int igb_ptp_gettimex(struct ptp_clock_info *ptp,
> > > + struct ptp_system_timestamp *sts)
> > > +{
> > > + struct igb_adapter *igb = container_of(ptp, struct igb_adapter,
> > > +ptp_caps);
> > > + struct e1000_hw *hw = >hw;
> > > + unsigned long flags;
> > > + u32 lo, hi;
> > > + u64 ns;
> > > +
> > > + spin_lock_irqsave(>tmreg_lock, flags);
> > > +
> > > + /* 82576 doesn't have SYSTIMR */
> > > + if (igb->hw.mac.type == e1000_82576) {
> >
> > Instead of if/then/else, can't you follow the pattern of providing
> > different function flavors ...
> 
> I can. I was just trying to minimize the amount of triplicated code.
> In the next version I'll add a patch to deprecate the old gettime
> functions, as Jacob suggested, and replace them with the extended
> versions, so the amount of code will not change that much.
> 

Excellent.

-Jake

> Thanks,
> 
> --
> Miroslav Lichvar


RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-31 Thread Keller, Jacob E



> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Wednesday, October 31, 2018 7:40 AM
> To: Miroslav Lichvar 
> Cc: Keller, Jacob E ; netdev@vger.kernel.org; 
> intel-wired-
> l...@lists.osuosl.org
> Subject: Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> On Mon, Oct 29, 2018 at 02:31:09PM +0100, Miroslav Lichvar wrote:
> > I think there could be a flag in ptp_system_timestamp, or a parameter
> > of gettimex64(), which would enable/disable reading of the system
> > clock.
> 
> I'm not a fan of functions that change their behavior based on flags
> in their input parameters.
> 
> Thanks,
> Richard

Neither am I. I do however want to find a solution that avoids having drivers 
needlessly duplicate almost the same functionality.

Thanks,
Jake


RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-29 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Miroslav Lichvar
> Sent: Monday, October 29, 2018 6:31 AM
> To: Keller, Jacob E 
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Richard Cochran
> 
> Subject: Re: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> On Fri, Oct 26, 2018 at 04:54:57PM +, Keller, Jacob E wrote:
> > > -Original Message-
> > > From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> > > Sent: Friday, October 26, 2018 9:28 AM
> > > To: netdev@vger.kernel.org
> > > Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ;
> > > Keller, Jacob E ; Miroslav Lichvar
> 
> > > Subject: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> > >
> > > Cc: Richard Cochran 
> > > Cc: Jacob Keller 
> > > Signed-off-by: Miroslav Lichvar 
> 
> > What about replacing gettime64 with:
> >
> > static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp, struct timespec64 
> > *ts)
> > {
> > struct ptp_system_timestamp sts
> >
> > ixgbe_ptp_gettimex(ptp, );
> > *ts = sts.phc_ts
> > }
> 
> That will work, but it will be slower. With HPET as a clocksource
> there would be few microseconds of an extra (symmetric) delay and the
> applications would have to assume a larger maximum error.
> 

Right. My intention for this would be that we'd switch from gettime64 to 
gettime64x going forward, and provide this as a way to avoid having to 
duplicate logic in drivers while we're transitioning? Thus, new applications 
should be using the new call if it's available in the driver.

Hmm.
 
> I think there could be a flag in ptp_system_timestamp, or a parameter
> of gettimex64(), which would enable/disable reading of the system
> clock.
> 
> > Actually, could that even just be provided by the PTP core if gettime64 
> > isn't
> implemented? This way new drivers only have to implement the new interface, 
> and
> userspace will just get the old behavior if they use the old call?
> 
> Good idea.

Ideally we can find a way that minimizes the overhead for the old call.

Thanks,
Jake

> 
> Thanks,
> 
> --
> Miroslav Lichvar


RE: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval

2018-10-26 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Friday, October 26, 2018 10:13 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Miroslav Lichvar ;
> Keller, Jacob E ; Richard Cochran
> ; Thomas Gleixner 
> Subject: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
> 
> The timecounter needs to be updated at least once per ~550 seconds in
> order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
> timestamp.
> 
> Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
> scheduling of delayed work seems to be less accurate and a requested
> delay of 540 seconds may actually be longer than 550 seconds. Also, the
> PHC may be adjusted to run up to 6% faster than real time and the system
> clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
> timecounter is updated in time.
> 
> This fixes an issue with HW timestamps on 82580/I350/I354 being off by
> ~1100 seconds for few seconds every ~9 minutes.
> 

Acked-by: Jacob Keller 

> Cc: Jacob Keller 
> Cc: Richard Cochran 
> Cc: Thomas Gleixner 
> Signed-off-by: Miroslav Lichvar 
> ---
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c
> b/drivers/net/ethernet/intel/igb/igb_ptp.c
> index 9f4d700e09df..2b95dc9c7a6a 100644
> --- a/drivers/net/ethernet/intel/igb/igb_ptp.c
> +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
> @@ -51,9 +51,17 @@
>   *
>   * The 40 bit 82580 SYSTIM overflows every
>   *   2^40 * 10^-9 /  60  = 18.3 minutes.
> + *
> + * SYSTIM is converted to real time using a timecounter. As
> + * timecounter_cyc2time() allows old timestamps, the timecounter needs
> + * to be updated at least once per half of the SYSTIM interval.
> + * Scheduling of delayed work is not very accurate, and also the NIC
> + * clock can be adjusted to run up to 6% faster and the system clock
> + * up to 10% slower, so we aim for 6 minutes to be sure the actual
> + * interval in the NIC time is shorter than 9.16 minutes.
>   */
> 
> -#define IGB_SYSTIM_OVERFLOW_PERIOD   (HZ * 60 * 9)
> +#define IGB_SYSTIM_OVERFLOW_PERIOD   (HZ * 60 * 6)
>  #define IGB_PTP_TX_TIMEOUT   (HZ * 15)
>  #define INCPERIOD_82576  BIT(E1000_TIMINCA_16NS_SHIFT)
>  #define INCVALUE_82576_MASK
>   GENMASK(E1000_TIMINCA_16NS_SHIFT - 1, 0)
> --
> 2.17.2



RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime

2018-10-26 Thread Keller, Jacob E



> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran 
> ;
> Keller, Jacob E ; Miroslav Lichvar 
> 
> Subject: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> Cc: Richard Cochran 
> Cc: Jacob Keller 
> Signed-off-by: Miroslav Lichvar 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 57 
>  1 file changed, 57 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> index b3e0d8bb5cbd..d31e8d3effc7 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> @@ -466,6 +466,60 @@ static int ixgbe_ptp_gettime(struct ptp_clock_info *ptp,
> struct timespec64 *ts)
>   return 0;
>  }
> 
> +/**
> + * ixgbe_ptp_gettimex
> + * @ptp: the ptp clock structure
> + * @sts: structure to hold the system time before reading the PHC,
> + * the PHC timestamp, and system time after reading the PHC
> + *
> + * read the timecounter and return the correct value on ns,
> + * after converting it into a struct timespec.
> + */
> +static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp,
> +   struct ptp_system_timestamp *sts)
> +{
> + struct ixgbe_adapter *adapter =
> + container_of(ptp, struct ixgbe_adapter, ptp_caps);
> + struct ixgbe_hw *hw = >hw;
> + unsigned long flags;
> + struct timespec64 ts;
> + u64 ns, stamp;
> +
> + spin_lock_irqsave(>tmreg_lock, flags);
> +
> + switch (adapter->hw.mac.type) {
> + case ixgbe_mac_X550:
> + case ixgbe_mac_X550EM_x:
> + case ixgbe_mac_x550em_a:
> + /* Upper 32 bits represent billions of cycles, lower 32 bits
> +  * represent cycles. However, we use timespec64_to_ns for the
> +  * correct math even though the units haven't been corrected
> +  * yet.
> +  */
> + ptp_read_system_prets(sts);
> + IXGBE_READ_REG(hw, IXGBE_SYSTIMR);
> + ptp_read_system_postts(sts);
> + ts.tv_nsec = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> + ts.tv_sec = IXGBE_READ_REG(hw, IXGBE_SYSTIMH);
> + stamp = timespec64_to_ns();
> + break;
> + default:
> + ptp_read_system_prets(sts);
> + stamp = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> + ptp_read_system_postts(sts);
> + stamp |= (u64)IXGBE_READ_REG(hw, IXGBE_SYSTIMH) << 32;
> + break;
> + }
> +
> + ns = timecounter_cyc2time(>hw_tc, stamp);
> +
> + spin_unlock_irqrestore(>tmreg_lock, flags);
> +
> + sts->phc_ts = ns_to_timespec64(ns);
> +
> + return 0;
> +}
> +


What about replacing gettime64 with:

static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts)
{
struct ptp_system_timestamp sts

ixgbe_ptp_gettimex(ptp, );
*ts = sts.phc_ts
}

Actually, could that even just be provided by the PTP core if gettime64 isn't 
implemented? This way new drivers only have to implement the new interface, and 
userspace will just get the old behavior if they use the old call?

Thanks,
Jake

>  /**
>   * ixgbe_ptp_settime
>   * @ptp: the ptp clock structure
> @@ -1217,6 +1271,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>   adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
>   adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>   adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> + adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>   adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>   adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>   adapter->ptp_setup_sdp = ixgbe_ptp_setup_sdp_x540;
> @@ -1234,6 +1289,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>   adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
>   adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>   adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> + adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>   adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>   adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>   break;
> @@ -1250,6 +1306,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>   adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq

RE: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization

2018-10-26 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran 
> ;
> Keller, Jacob E ; Miroslav Lichvar 
> 
> Subject: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
> 

I read the whole series, and it looks good to me.

Acked-by: Jacob Keller 

> There is some duplication of code in the igb and ixgbe drivers, which I
> don't like very much, but I thought it's better than extending and
> wrapping the existing functions like in the e1000e driver. Also, mixing
> SYSTIM and "system time" in the code will probably be confusing.
> 
> I wasn't able to find a better name for the ioctl, the structures, and
> the driver function. If anyone has suggestions, please let me know.
> 

Hmm.. Yea, I don't really have better names either.

> Miroslav Lichvar (4):
>   ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
>   e1000e: add support for extended PHC gettime
>   igb: add support for extended PHC gettime
>   ixgbe: add support for extended PHC gettime
> 
>  drivers/net/ethernet/intel/e1000e/e1000.h|  3 ++
>  drivers/net/ethernet/intel/e1000e/netdev.c   | 48 +
>  drivers/net/ethernet/intel/e1000e/ptp.c  | 21 
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 43 +++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 57 
>  drivers/ptp/ptp_chardev.c| 39 ++
>  include/linux/ptp_clock_kernel.h | 26 +
>  include/uapi/linux/ptp_clock.h   | 12 +
>  8 files changed, 239 insertions(+), 10 deletions(-)
> 
> --
> 2.17.2



RE: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization

2018-10-26 Thread Keller, Jacob E
Hi Miroslav,

> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran 
> ;
> Keller, Jacob E ; Miroslav Lichvar 
> 
> Subject: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
> 
> This series adds support for a more accurate synchronization between a
> PTP hardware clock and the system clock.
> 
> The first patch adds an extended version of the PTP_SYS_OFFSET ioctl,
> which returns three timestamps for each measurement. The idea is to
> shorten the interval between the system timestamps to contain just the
> reading of the lowest register of the PHC in order to reduce the error
> in the measured offset and give a better bound on the maximum error.
> 
> The other patches add support for the new ioctl to the e1000e, igb,
> and ixgbe driver. Tests with few different NICs in different machines
> (and PCIe slots) show that:
> - with an I219 (e1000e) the measured delay improved from 2500 to 1300 ns
>   and the error in the measured offset, when compared to cross
>   timestamping, was reduced by a factor of 5
> - with an I210 (igb) the delay improved from 5100 to 1700 ns
> - with an I350 (igb) the delay improved from 2300 to 750 ns
> - with an X550 (ixgbe) the delay improved from 1950 to 650 ns
> 

That is some very significant improvements! Excellent find.

> There is some duplication of code in the igb and ixgbe drivers, which I
> don't like very much, but I thought it's better than extending and
> wrapping the existing functions like in the e1000e driver. Also, mixing
> SYSTIM and "system time" in the code will probably be confusing.
> 

Yea...

> I wasn't able to find a better name for the ioctl, the structures, and
> the driver function. If anyone has suggestions, please let me know.
> 

I don't have any good suggestions yet. I'll reply after reviewing if I think of 
any.

Thanks,
Jake


RE: [PATCH] igb: shorten maximum PHC timecounter update interval

2018-10-26 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Friday, October 26, 2018 5:04 AM
> To: Richard Cochran 
> Cc: intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; Keller, Jacob E
> ; Thomas Gleixner 
> Subject: Re: [PATCH] igb: shorten maximum PHC timecounter update interval
> 
> On Fri, Oct 12, 2018 at 07:05:30AM -0700, Richard Cochran wrote:
> > On Fri, Oct 12, 2018 at 01:13:39PM +0200, Miroslav Lichvar wrote:
> > > Since commit 500462a9d ("timers: Switch to a non-cascading wheel"),
> > > scheduling of delayed work seems to be less accurate and a requested
> > > delay of 540 seconds may actually be longer than 550 seconds. Shorten
> > > the delay to 480 seconds to be sure the timecounter is updated in time.
> >
> > Good catch.  This timer wheel change will affect other, similar
> > drivers.  Guess I'll go through and adjust their timeouts, too.
> 
> I just realized that we need to fit there also any frequency
> adjustments of the PHC and system clock. The PHC can be set to run up
> to 6% faster and the system clock can be slowed down by up to 10%.
> 
> Those 480 seconds in the igb driver is not short enough for that.
> Should I fix and resend this patch, or send a new one?
> 
> Other drivers may have a similar problem.
> 

Hmm, good point. I'd send a v2 of this patch, unless it's already been applied 
to net or net-next.

Thanks,
Jake

> --
> Miroslav Lichvar


RE: Improving accuracy of PHC readings

2018-10-19 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Miroslav Lichvar
> Sent: Friday, October 19, 2018 2:52 AM
> To: netdev@vger.kernel.org
> Cc: Richard Cochran ; Keller, Jacob E
> 
> Subject: Improving accuracy of PHC readings
> 
> I think there might be a way how we could significantly improve
> accuracy of synchronization between the system clock and a PTP
> hardware clock, at least with some network drivers.
> 
> Currently, the PTP_SYS_OFFSET ioctl reads the system clock, reads the
> PHC using the gettime64 function of the driver, and reads the system
> clock again. The ioctl can repeat this to provide multiple readings to
> the user space.
> 
> phc2sys (or another program synchronizing the system clock to the PHC)
> assumes the PHC timestamps were captured in the middle between the two
> closest system clock timestamps.
> 
> The trouble is that gettime64 typically reads multiple (2-3) registers
> and the timestamp is latched on the first one, so the assumption about
> middle point is wrong. There is an asymmetry, even if the delays on
> the PCIe bus are perfectly symmetric.
> 

Right! I feel like this is obvious now that you said it, so I'm surprised no 
one thought of it before...

> A solution to this would be a new driver function that wraps the
> latching register read with readings of the system clock and return
> three timestamps instead of one. For example:
> 
> ktime_get_real_ts64(_ts1);
>   IXGBE_READ_REG(hw, IXGBE_SYSTIMR);
>   ktime_get_real_ts64(_ts2);
>   phc_ts.tv_nsec = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
>   phc_ts.tv_sec = IXGBE_READ_REG(hw, IXGBE_SYSTIMH);
> 
> The extra timestamp doesn't fit the API of the PTP_SYS_OFFSET ioctl,
> so it would need to shift the timestamp it returns by the missing
> intervals (assuming the frequency offset between the PHC and system
> clock is small), or a new ioctl could be introduced that would return
> all timestamps in an array looking like this:
> 
>   [sys, phc, sys, sys, phc, sys, ...]
> 

I think the new ioctl is probably the better solution.

> This should significantly improve the accuracy of the synchronization,
> reduce the uncertainty in the readings to less than a half or third,
> and also reduce the jitter as there are fewer register reads sensitive
> to the PCIe delay.
> 
> What do you think?
> 

Nice! I think this is good. I'd love to see some data to back it up, but it 
makes sense to me.

Thanks,
Jake

> --
> Miroslav Lichvar


RE: Issues in error queue polling

2018-10-18 Thread Keller, Jacob E
Hi,

> -Original Message-
> From: Ricardo Biehl Pasquali [mailto:pasqual...@gmail.com]
> Sent: Thursday, October 18, 2018 11:27 AM
> To: netdev@vger.kernel.org
> Cc: Keller, Jacob E ; Gomes, Vinicius
> ; da...@davemloft.net
> Subject: Issues in error queue polling
> 
> The commit 7d4c04fc170087119727 ("net: add option to enable
> error queue packets waking select") (2013-03-28) introduced
> SO_SELECT_ERR_QUEUE, which masks POLLPRI with POLLERR event
> return in some socket poll callbacks.
> 
> POLLERR event issued with sock_queue_err_skb() did not wake
> up a poll when POLLERR is the only requested event because
> sk_data_ready() (sock_def_readable()) was used and it
> doesn't mask POLLERR in poll wake up:
> 

Right.

> wake_up_interruptible_sync_poll(>wait,
> EPOLLIN | EPOLLPRI |
> EPOLLRDNORM | EPOLLRDBAND);
> 
> If POLLIN or POLLPRI are requested, for example, poll does
> wake up.
> 
> POLLERR wakeup by requesting POLLPRI is possible without
> set SO_SELECT_ERR_QUEUE. All the option does is masking
> POLLPRI as a returned event before poll returns. poll
> would return anyway because of POLLERR.
> 

Yes. The problem being that the application thread not being ready to handle 
POLLPRI, so they want to avoid the application waking up to an event.

> Also, the sentence "[...] enable software to wait on error
> queue packets without waking up for regular data on the
> socket." from the above commit is not true.
> 

Not entirely true but...

> A POLLIN event issued via sock_def_readable() wakes up
> threads waiting for POLLPRI, and vice versa. However,
> poll() does not return, sleeping again, as the requested
> events do not match events.
> 

The thread wakes up, but the application handling the events doesn't because 
the thread goes right back to sleep.

> The commit 6e5d58fdc9bedd0255a8 ("skbuff: Fix not waking
> applications when errors are enqueued") (2018-03-14) make
> POLLERR alone wake up poll. It replaces sk_data_ready()
> (sock_def_readable()) with sk_error_report()
> (sock_def_error_report()). This makes "POLLERR wake up by
> requesting POLLPRI" obsolete.
> 

Yep, this is a better solution, and I wish it had been thought of before we 
introduced SO_SELECT_ERR_QUEUE.

> Rationale:
> 
> POLLIN-only and POLLERR-only wake up are useful when there
> is a receiving thread, a sending thread, and a thread that
> get transmit timestamps. The thread polling on POLLERR will
> not wake up when regular data arrives (POLLIN). The thread
> polling on POLLIN will not wake up when tx timestamps are
> ready (POLLERR).

Right. This is the goal for applications like ptp4l.

> 
> One solution is adding an option that disable POLLERR as
> requested event. This is in the Virtual File System
> subsystem, not in the network, though.
> 
> This solves the problem of waking up other threads that
> not interested in error queue. Thus allowing a separate
> thread take care of error queue (useful for receiving
> transmit timestamps).

Yes, this makes sense to me.

Thanks,
Jake


RE: [net-next 1/8] ice: use [sr]q.count when checking if queue is initialized

2018-10-01 Thread Keller, Jacob E


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Sergei Shtylyov
> Sent: Friday, September 28, 2018 2:02 AM
> To: Kirsher, Jeffrey T ; da...@davemloft.net
> Cc: Keller, Jacob E ; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com;
> Venkataramanan, Anirudh 
> Subject: Re: [net-next 1/8] ice: use [sr]q.count when checking if queue is
> initialized
> 
> Hello!
> 
> On 9/27/2018 7:21 PM, Jeff Kirsher wrote:
> 
> > From: Jacob Keller 
> >
> > When shutting down the controlqs, we check if they are initialized
> > before we shut them down and destroy the lock. This is important, as it
> > prevents attempts to access the lock of an already shutdown queue.
> >
> > Unfortunately, we checked rq.head and sq.head as the value to determine
> > if the queue was initialized. This doesn't work, because head is not
> > reset when the queue is shutdown. In some flows, the adminq will have
> > already been shut down prior to calling ice_shutdown_all_ctrlqs. This
> > can result in a crash due to attempting to access the already destroyed
> > mutex.
> >
> > Fix this by using rq.count and sq.count instead. Indeed, ice_shutdown_sq
> > and ice_shutdown_rq already indicate that this is the value we should be
> > using to determine of the queue was initialized.
> 
> s/of/if/?

Yes. However, unless there's a functional change required as well, I don't 
think it's worth fixing this.

Thanks,
Jake

> 
> > Signed-off-by: Jacob Keller 
> > Signed-off-by: Anirudh Venkataramanan
> 
> > Tested-by: Andrew Bowers 
> > Signed-off-by: Jeff Kirsher 
> [...]
> 
> MBR, Sergei


RE: [net-next 02/15] i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h

2018-09-04 Thread Keller, Jacob E



> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of David Miller
> Sent: Wednesday, August 29, 2018 8:05 PM
> To: Kirsher, Jeffrey T 
> Cc: Keller, Jacob E ; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 02/15] i40e: move ethtool stats boiler plate code to
> i40e_ethtool_stats.h
> 
> From: Jeff Kirsher 
> Date: Wed, 29 Aug 2018 15:48:21 -0700
> 
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
> b/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
> > new file mode 100644
> > index ..0290ade7494b
> > --- /dev/null
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
> > @@ -0,0 +1,221 @@
> ...
> > +/**
> > + * __i40e_add_stat_strings - copy stat strings into ethtool buffer
> > + * @p: ethtool supplied buffer
> > + * @stats: stat definitions array
> > + * @size: size of the stats array
> > + *
> > + * Format and copy the strings described by stats into the buffer pointed 
> > at
> > + * by p.
> > + **/
> > +static void __i40e_add_stat_strings(u8 **p, const struct i40e_stats 
> > stats[],
> > +   const unsigned int size, ...)
> 
> Need to be marked inline.

Marking this inline seems to be causing build problems on some systems because 
it uses variadic arguments.

Thanks,
Jake


RE: [net 07/13] ice: Use order_base_2 to calculate higher power of 2

2018-08-24 Thread Keller, Jacob E


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Sergei Shtylyov
> Sent: Friday, August 24, 2018 2:03 AM
> To: Kirsher, Jeffrey T ; da...@davemloft.net
> Cc: Keller, Jacob E ; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com;
> Venkataramanan, Anirudh 
> Subject: Re: [net 07/13] ice: Use order_base_2 to calculate higher power of 2
> 
> Hello!
> 
> On 8/23/2018 10:14 PM, Jeff Kirsher wrote:
> 
> > From: Jacob Keller 
> >
> > Currently, we use a combination of ilog2 and is_power_of_2() to
> > calculate the next power of 2 for the qcount. This appears to be causing
> > a warning on some combinations of GCC and the Linux kernel:
> >
> > MODPOST 1 modules
> > WARNING: "ilog2_NaN" [ice.ko] undefined!
> >
> > This appears to because because GCC realizes that qcount could be zero
> 
> One "because" is enough. :-)

Woops! Thanks for catching it.

Regards,
Jake




RE: Issue with driver i40e stat strings count mismatch

2018-07-31 Thread Keller, Jacob E


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Stefan Assmann
> Sent: Tuesday, July 31, 2018 12:06 AM
> To: Jesper Dangaard Brouer ; Kirsher, Jeffrey T
> 
> Cc: Topel, Bjorn ; Duyck, Alexander H
> ; intel-wired-lan  l...@lists.osuosl.org>; netdev@vger.kernel.org
> Subject: Re: Issue with driver i40e stat strings count mismatch
> 
> On 10.07.2018 13:17, Jesper Dangaard Brouer wrote:
> > Hi Intel-fokes,
> >
> > Your i40e driver have issues with it's ethtool stats.  A warning
> > triggers at drivers/net/ethernet/intel/i40e/i40e_ethtool.c line 1907
> > (see splash below) in func i40e_get_stat_strings().
> 
> Hi Jesper,
> 
> I ran into the same issue. Here's my proposed fix.
> 
> From 46c74c25496bab06712641c7b2b6b34e365397a2 Mon Sep 17 00:00:00 2001
> From: Stefan Assmann 
> Date: Mon, 30 Jul 2018 21:38:43 +0200
> Subject: [PATCH] i40e: fix i40e_get_stat_strings strings count warning
> 
> The current code calculates p - data, which results in a negative value.
> Therefore the WARN_ONCE condition will always be true.
> Fix this by calculating data - p instead.
> 
> Fixes: 9b10df596bd4 ("i40e: use WARN_ONCE to replace the commented
> BUG_ON size check")
> 
> Signed-off-by: Stefan Assmann 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> index 6947a2a571cb..5d670f4ce5ac 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> @@ -1903,7 +1903,7 @@ static void i40e_get_stat_strings(struct net_device
> *netdev, u8 *data)
>   data += ETH_GSTRING_LEN;
>   }
> 
> - WARN_ONCE(p - data != i40e_get_stats_count(netdev) *
> ETH_GSTRING_LEN,
> + WARN_ONCE(data - p != i40e_get_stats_count(netdev) *
> ETH_GSTRING_LEN,
> "stat strings count mismatch!");
>  }
> 


Thanks Stefan. Sorry about this one. I had a fix for this a while back 
internally but I think it somehow got lost in the shuffle. It's now in the 
process of being posted.

It's the same fix though, so I don't feel strongly about which gets applied. 
Thus...

Acked-by: Jacob Keller 

Thanks,
Jake

> --
> 2.17.1
> 
> >
> > [ 5077.779518] [ cut here ]
> > [ 5077.784493] stat strings count mismatch!
> > [ 5077.784529] WARNING: CPU: 0 PID: 2293 at
> drivers/net/ethernet/intel/i40e/i40e_ethtool.c:1907
> i40e_get_strings+0x477/0x4b0 [i40e]
> > [ 5077.800941] Modules linked in: act_gact cls_u32 sch_ingress xt_tcpudp
> iptable_raw ip_tables x_tables tun nfnetlink bridge stp llc bpfilter sunrpc
> coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf
> pcspkr i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad
> sch_fq_codel ixgbe mlx5_core mlxfw i40e devlink hid_generic igb mdio
> i2c_algo_bit ptp sd_mod i2c_core pps_core [last unloaded: x_tables]
> > [ 5077.839833] CPU: 0 PID: 2293 Comm: ethtool Not tainted 
> > 4.18.0-rc3-net-next-
> EdwardCree01+ #484
> > [ 5077.848962] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a
> 08/01/2016
> > [ 5077.857049] RIP: 0010:i40e_get_strings+0x477/0x4b0 [i40e]
> > [ 5077.862776] Code: 98 49 39 c4 0f 84 2e fc ff ff 80 3d b3 da 03 00 00 0f 
> > 85 21 fc ff
> ff 48 c7 c7 24 42 19 a0 c6 05 9f da 03 00 01 e8 e9 9c ee e0 <0f> 0b e9 07 fc 
> ff ff 48 83
> c4 10 48 c7 c1 80 01 19 a0 be 20 00 00
> > [ 5077.882506] RSP: 0018:c90003af3c18 EFLAGS: 00010296
> > [ 5077.888063] RAX: 001c RBX: c90003ce1440 RCX:
> 0006
> > [ 5077.895528] RDX: 0007 RSI: 0096 RDI:
> 88087ca15530
> > [ 5077.902991] RBP: c90003ce1440 R08: 001c R09:
> 0411
> > [ 5077.910453] R10: 000fffe0 R11: 82a4e66d R12: 
> > cbc0
> > [ 5077.917913] R13: 88087c50f000 R14: 0008 R15:
> a0199620
> > [ 5077.925376] FS:  7f7a84bcb740() GS:88087ca0()
> knlGS:
> > [ 5077.934061] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 5077.940133] CR2: 55c7d709b000 CR3: 00081acd0004 CR4:
> 003606f0
> > [ 5077.947609] DR0:  DR1:  DR2:
> 
> > [ 5077.955070] DR3:  DR6: fffe0ff0 DR7:
> 0400
> > [ 5077.962531] Call Trace:
> > [ 5077.965309]  dev_ethtool+0xf4e/0x2430
> > [ 5077.969305]  ? get_page_from_freelist+0x2bb/0x1240
> > [ 5077.974428]  ? dev_ioctl+0x1e9/0x3c0
> > [ 5077.978332]  dev_ioctl+0x1e9/0x3c0
> > [ 5077.982062]  sock_do_ioctl+0xa8/0x140
> > [ 5077.986057]  ? sock_ioctl+0x1c0/0x300
> > [ 5077.990051]  sock_ioctl+0x1c0/0x300
> > [ 5077.993864]  ? __handle_mm_fault+0xa82/0xfd0
> > [ 5077.998462]  ? do_vfs_ioctl+0x8d/0x5e0
> > [ 5078.002550]  do_vfs_ioctl+0x8d/0x5e0
> > [ 5078.006456]  ? 

RE: [Intel-wired-lan] Issue with driver i40e stat strings count mismatch

2018-07-31 Thread Keller, Jacob E


> -Original Message-
> From: Wyborny, Carolyn
> Sent: Tuesday, July 31, 2018 8:53 AM
> To: Jesper Dangaard Brouer ; Stefan Assmann
> ; Keller, Jacob E 
> Cc: netdev@vger.kernel.org; intel-wired-lan 
> ;
> Topel, Bjorn ; Keller, Jacob E
> 
> Subject: RE: [Intel-wired-lan] Issue with driver i40e stat strings count 
> mismatch
> 
> > -Original Message-
> > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> > Behalf Of Jesper Dangaard Brouer
> > Sent: Tuesday, July 31, 2018 7:29 AM
> > To: Stefan Assmann ; Keller, Jacob E
> > 
> > Cc: netdev@vger.kernel.org; intel-wired-lan  > l...@lists.osuosl.org>; bro...@redhat.com; Topel, Bjorn
> > 
> > Subject: Re: [Intel-wired-lan] Issue with driver i40e stat strings count
> > mismatch
> >
> >
> > On Tue, 31 Jul 2018 09:05:40 +0200 Stefan Assmann
> >  wrote:
> >
> > > From: Stefan Assmann 
> > > To: Jesper Dangaard Brouer ,  Jeff Kirsher
> > 
> > > Cc: Björn Töpel ,
> > "alexander.h.du...@intel.com" ,  intel-
> > wired-lan ,  "netdev@vger.kernel.org"
> > 
> > > Subject: Re: Issue with driver i40e stat strings count mismatch
> > > Date: Tue, 31 Jul 2018 09:05:40 +0200
> > > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
> > >  Thunderbird/52.9.1
> > > Message-ID: <64be8e6a-c285-2864-fd91-356eba645...@kpanic.de>
> > >
> > > On 10.07.2018 13:17, Jesper Dangaard Brouer wrote:
> > > > Hi Intel-fokes,
> > > >
> > > > Your i40e driver have issues with it's ethtool stats.  A warning
> > > > triggers at drivers/net/ethernet/intel/i40e/i40e_ethtool.c line 1907
> > > > (see splash below) in func i40e_get_stat_strings().
> > >
> > > Hi Jesper,
> > >
> > > I ran into the same issue. Here's my proposed fix.
> >
> > Thanks for following up Stefan :-)
> >
> > I'm hoping some Intel people will look at evaluating this fix? ...
> >
> Thanks, we have a patch in process for this fix and other related ones from 
> Jake
> Keller.  We'll expedite it.
> 
> Carolyn
> 
> Carolyn Wyborny
> Linux Development
> Networking Division
> Intel Corporation
> 
> 

Yea I found tihs issue a while ago and we had a patch internally. I suspect it 
just got lost in a shuffle earlier

I think we have an equivalent patch, but I don't care which gets applied.

Thanks,
Jake


RE: [net] sch_fq_codel: zero q->flows_cnt when fq_codel_init fails

2018-07-10 Thread Keller, Jacob E
> -Original Message-
> From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
> Sent: Tuesday, July 10, 2018 1:32 PM
> To: Keller, Jacob E 
> Cc: Linux Kernel Network Developers ; Eric Dumazet
> 
> Subject: Re: [net] sch_fq_codel: zero q->flows_cnt when fq_codel_init fails
> 
> On Mon, Jul 9, 2018 at 8:37 AM Jacob Keller  wrote:
> > +alloc_failure:
> > +   kfree(q->flows);
> 
> You need to call kvfree() instead.
> 
> Other than this,
> 
> Acked-by: Cong Wang 
> 
> For net-next, I will send a patch to skip ->reset() for ->init() failure
> case, there is no reason to reset queues since qdisc is not even
> activated.

Thanks, I sent a v2.

Regards,
Jake


RE: [net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-07-02 Thread Keller, Jacob E
> -Original Message-
> From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
> Sent: Monday, June 11, 2018 1:23 PM
> To: Keller, Jacob E 
> Cc: Kirsher, Jeffrey T ; da...@davemloft.net;
> netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com; Eric Dumazet 
> Subject: Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
> 
> On Mon, Jun 11, 2018 at 12:57 PM, Keller, Jacob E
>  wrote:
> >
> > I'm open to alternative suggestinos for fixing this, I think Eric suggested 
> > that
> maybe we should just remove the ->reset() call from qdisc_destroy..?
> 
> You can't remove ->reset() for non-failure call path.
> 
> For failure path, yeah, but it is much simpler to just make
> q->flows_cnt be 0 for this specific case.

Alright. I'll rework the patch to set flows_cnt to 0 when init fails.

Thanks,
Jake


RE: [net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-06-11 Thread Keller, Jacob E
> -Original Message-
> From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
> Sent: Monday, June 11, 2018 1:03 PM
> To: Kirsher, Jeffrey T 
> Cc: David Miller ; Keller, Jacob E
> ; Linux Kernel Network Developers
> ; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com; Eric Dumazet 
> Subject: Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
> 

> Making q->flows_cnt 0 is simpler and easier to understand.

Feel free to propose such a patch :)

Thanks,
Jake


RE: [net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-06-11 Thread Keller, Jacob E
> -Original Message-
> From: Kirsher, Jeffrey T
> Sent: Monday, June 11, 2018 10:00 AM
> To: da...@davemloft.net
> Cc: Keller, Jacob E ; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com; Eric
> Dumazet ; Kirsher, Jeffrey T
> 
> Subject: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
> 
> From: Jacob Keller 
> 
> The function qdisc_create_dftl attempts to create a default qdisc. If
> this fails, it calls qdisc_destroy when cleaning up. The qdisc_destroy
> function calls the ->reset op on the qdisc.
> 
> In the case of sch_fq_codel.c, this function will panic when the qdisc
> wasn't properly initialized:
> 
>kernel: BUG: unable to handle kernel NULL pointer dereference at
> 0008
>kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
>kernel: PGD 0 P4D 0
>kernel: Oops:  [#1] SMP PTI
>kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc
> devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert
> iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt
> target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm
> ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt
> iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev
> i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe
> drm_kms_helper isci ttm firewire_ohci
>kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core
> scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf
> ipmi_msghandler [last unloaded: i40e]
>kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G   OE
> 4.16.13custom-fq-
> codel-test+ #3
>kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS
> SE5C600.86B.02.05.0004.051120151007 05/11/2015
>kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
>kernel: RSP: 0018:bfbf4c1fb620 EFLAGS: 00010246
>kernel: RAX: 0400 RBX:  RCX: 05b9
>kernel: RDX:  RSI: 9d03264a60c0 RDI: 9cfd17b31c00
>kernel: RBP: 0001 R08: 000260c0 R09: b679c3e9
>kernel: R10: f1dab06a0e80 R11: 9cfd163af800 R12: 9cfd17b31c00
>kernel: R13: 0001 R14: 9cfd153de600 R15: 0001
>kernel: FS:  7fdec2f92800() GS:9d032648()
> knlGS:
>kernel: CS:  0010 DS:  ES:  CR0: 80050033
>kernel: CR2: 0008 CR3: 000c1956a006 CR4: 000606e0
>kernel: Call Trace:
>kernel:  qdisc_destroy+0x56/0x140
>kernel:  qdisc_create_dflt+0x8b/0xb0
>kernel:  mq_init+0xc1/0xf0
>kernel:  qdisc_create_dflt+0x5a/0xb0
>kernel:  dev_activate+0x205/0x230
>kernel:  __dev_open+0xf5/0x160
>kernel:  __dev_change_flags+0x1a3/0x210
>kernel:  dev_change_flags+0x21/0x60
>kernel:  do_setlink+0x660/0xdf0
>kernel:  ? down_trylock+0x25/0x30
>kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
>kernel:  ? rtnl_newlink+0x816/0x990
>kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
>kernel:  ? _cond_resched+0x15/0x30
>kernel:  ? kmem_cache_alloc+0x20/0x1b0
>kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
>kernel:  ? rtnl_calcit.isra.30+0x100/0x100
>kernel:  ? netlink_rcv_skb+0x4c/0x120
>kernel:  ? netlink_unicast+0x19e/0x260
>kernel:  ? netlink_sendmsg+0x1ff/0x3c0
>kernel:  ? sock_sendmsg+0x36/0x40
>kernel:  ? ___sys_sendmsg+0x295/0x2f0
>kernel:  ? ebitmap_cmp+0x6d/0x90
>kernel:  ? dev_get_by_name_rcu+0x73/0x90
>kernel:  ? skb_dequeue+0x52/0x60
>kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
>kernel:  ? bit_waitqueue+0x30/0x30
>kernel:  ? fsnotify_grab_connector+0x3c/0x60
>kernel:  ? __sys_sendmsg+0x51/0x90
>kernel:  ? do_syscall_64+0x74/0x180
>kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 
> 00 00 00
> 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 
> 8b 3b e8
> 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
>kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: bfbf4c1fb620
>kernel: CR2: 0008
>kernel: ---[ end trace e81a62bede66274e ]---
> 
> This occurs because if fq_codel_init fails, it has left the private data
> 

RE: [PATCH ethtool 2/6] ethtool: fix RING_VF assignment

2018-06-08 Thread Keller, Jacob E
> -Original Message-
> From: Ivan Vecera [mailto:c...@cera.cz]
> Sent: Friday, June 08, 2018 2:20 AM
> To: linvi...@tuxdriver.com
> Cc: netdev@vger.kernel.org; Keller, Jacob E 
> Subject: [PATCH ethtool 2/6] ethtool: fix RING_VF assignment
> 
> Fixes: 36ee712 ("ethtool: support queue and VF fields for rxclass filters")
> Cc: Jacob Keller 
> Signed-off-by: Ivan Vecera 
> ---
>  rxclass.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/rxclass.c b/rxclass.c
> index ce4b382..42d122d 100644
> --- a/rxclass.c
> +++ b/rxclass.c
> @@ -1066,7 +1066,7 @@ static int rxclass_get_val(char *str, unsigned char *p,
> u32 *flags,
>   val++;
> 
>   *(u64 *)[opt->offset] &=
> ~ETHTOOL_RX_FLOW_SPEC_RING_VF;
> - *(u64 *)[opt->offset] = (u64)val <<
> ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
> + *(u64 *)[opt->offset] |= (u64)val <<
> ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
>   break;
>   }

Hah. Good catch.

Thanks,
Jake

>   case OPT_RING_QUEUE: {
> --
> 2.16.4



RE: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-05-25 Thread Keller, Jacob E
> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Friday, May 25, 2018 10:01 AM
> To: Jakub Kicinski <jakub.kicin...@netronome.com>
> Cc: Bjorn Helgaas <bhelg...@google.com>; linux-...@vger.kernel.org;
> netdev@vger.kernel.org; Sathya Perla <sathya.pe...@broadcom.com>; Felix
> Manlunas <felix.manlu...@caviumnetworks.com>;
> alexander.du...@gmail.com; john.fastab...@gmail.com; Keller, Jacob E
> <jacob.e.kel...@intel.com>; Donald Dutile <ddut...@redhat.com>; oss-
> driv...@netronome.com; Christoph Hellwig <h...@infradead.org>; Derek
> Chickles <derek.chick...@caviumnetworks.com>; Satanand Burla
> <satananda.bu...@caviumnetworks.com>; Raghu Vatsavayi
> <raghu.vatsav...@caviumnetworks.com>; Ajit Khaparde
> <ajit.khapa...@broadcom.com>; Sriharsha Basavapatna
> <sriharsha.basavapa...@broadcom.com>; Somnath Kotur
> <somnath.ko...@broadcom.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; intel-wired-...@lists.osuosl.org
> Subject: Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
> 
> [+cc liquidio, benet, fm10k maintainers:
> 
>   The patch below will affect you if your driver calls
> pci_sriov_set_totalvfs(dev, 0);
> 
>   Previously that caused a subsequent pci_sriov_get_totalvfs() to return
>   the totalVFs value from the SR-IOV capability.  After this patch, it will
>   return 0, which has implications for VF enablement via the sysfs
>   "sriov_numvfs" file.]
> 

Thanks. I don't foresee any issues with fm10k regarding this..

Thanks,
Jake


RE: [PATCH] e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes

2018-05-10 Thread Keller, Jacob E
> -Original Message-
> From: Benjamin Poirier [mailto:bpoir...@suse.com]
> Sent: Thursday, May 10, 2018 12:29 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; Achim Mildenberger
> <ad...@fph.physik.uni-karlsruhe.de>; olouvig...@gmail.com;
> jaya...@goubiq.com; ehabk...@redhat.com; postmodern.m...@gmail.com;
> bart.vanass...@wdc.com; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes
> 
> There have been multiple reports of crashes that look like
> kernel: RIP: 0010:[] timecounter_read+0xf/0x50
> [...]
> kernel: Call Trace:
> kernel:  [] e1000e_phc_gettime+0x2f/0x60 [e1000e]
> kernel:  [] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
> kernel:  [] process_one_work+0x155/0x440
> kernel:  [] worker_thread+0x116/0x4b0
> kernel:  [] kthread+0xd2/0xf0
> kernel:  [] ret_from_fork+0x3f/0x70
> 
> These can be traced back to the fact that e1000e_systim_reset() skips the
> timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL, which
> leads to a null deref in timecounter_read().
> 
> Commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) reworked
> e1000e_get_base_timinca() in such a way that it can return -EINVAL for
> e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.
> 
> Some experimentation has shown that on I219 (e1000_pch_spt, "MAC: 12")
> adapters, the E1000_TSYNCRXCTL_SYSCFI flag is unstable; TSYNCRXCTL reads
> sometimes don't have the SYSCFI bit set. Retrying the read shortly after
> finds the bit to be set. This was observed at boot (probe) but also link up
> and link down.
> 
> Moreover, the phc (PTP Hardware Clock) seems to operate normally even after
> reads where SYSCFI=0. Therefore, remove this register read and
> unconditionally set the clock parameters.
> 
> Reported-by: Achim Mildenberger <ad...@fph.physik.uni-karlsruhe.de>
> Message-Id: <20180425065243.g5mqewg5irkwgwgv@f2>
> Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1075876
> Fixes: 83129b37ef35 ("e1000e: fix systim issues")
> Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index ec4a9759a6f2..3afb1f3b6f91 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -3546,15 +3546,12 @@ s32 e1000e_get_base_timinca(struct e1000_adapter
> *adapter, u32 *timinca)
>   }
>   break;
>   case e1000_pch_spt:
> - if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
> - /* Stable 24MHz frequency */
> - incperiod = INCPERIOD_24MHZ;
> - incvalue = INCVALUE_24MHZ;
> - shift = INCVALUE_SHIFT_24MHZ;
> - adapter->cc.shift = shift;
> - break;
> - }
> - return -EINVAL;
> + /* Stable 24MHz frequency */
> + incperiod = INCPERIOD_24MHZ;
> + incvalue = INCVALUE_24MHZ;
> + shift = INCVALUE_SHIFT_24MHZ;
> + adapter->cc.shift = shift;
> + break;
>   case e1000_pch_cnp:
>   if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
>   /* Stable 24MHz frequency */
> --
> 2.16.3

Given testing showing that the clock operates fine regardless of the register 
read, I think this is probably fine. Normally I believe the register was used 
to check which frequency was in use, but it doesn't seem to serve that purpose 
here.

Thanks,
Jake


RE: [net-next 2/6] fm10k: reduce duplicate fm10k_stat macro code

2018-05-08 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Joe Perches
> Sent: Tuesday, May 08, 2018 10:00 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; da...@davemloft.net
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 2/6] fm10k: reduce duplicate fm10k_stat macro code
> 
> On Mon, 2018-05-07 at 07:45 -0700, Jeff Kirsher wrote:
> > Share some of the code for setting up fm10k_stat macros by implementing
> > an FM10K_STAT_FIELDS macro which we can use when setting up the type
> > specific macros.
> []
> > diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
> b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
> []
> > @@ -11,12 +11,16 @@ struct fm10k_stats {
> > int stat_offset;
> >  };o
> >
> > -#define FM10K_NETDEV_STAT(_net_stat) { \
> > -   .stat_string = #_net_stat, \
> > -   .sizeof_stat = FIELD_SIZEOF(struct net_device_stats, _net_stat), \
> > -   .stat_offset = offsetof(struct net_device_stats, _net_stat) \
> > +#define FM10K_STAT_FIELDS(_type, _name, _stat) { \
> > +   .stat_string = _name, \
> > +   .sizeof_stat = FIELD_SIZEOF(_type, _stat), \
> > +   .stat_offset = offsetof(_type, _stat) \
> >  }
> >
> > +/* netdevice statistics */
> > +#define FM10K_NETDEV_STAT(_net_stat) \
> > +   FM10K_STAT_FIELDS(struct net_device_stats, #_net_stat, _net_stat)
> 
> trivia:
> 
> It's somewhat unusual to use # in a macro argument.
> Perhaps this would be slightly easier to understand using __stringify
> 
> #define FM10K_NETDEV_STAT(_net_stat) \
>   FM10K_STAT_FIELDS(struct net_device_stats, __stringify(_net_stat),
> _net_stat)

Makes sense. Will change.

Thanks,
Jake


RE: [PATCH v6 0/5] PCI: Improve PCIe link status reporting

2018-05-03 Thread Keller, Jacob E
> -Original Message-
> This does change the dmesg reporting of link speeds, and in the ixgbe case,
> it changes the reporting from KERN_WARN level to KERN_INFO.  If that's an
> issue, let's talk about it.  I'm hoping the reduce code size, improved
> functionality, and consistency across drivers is enough to make this
> worthwhile.
> 

I personally have no issue with this change, but I don't work on the ixgbe 
driver much anymore.

Thanks,
Jake



RE: e1000e I219 timestamping oops related to TSYNCRXCTL read

2018-04-25 Thread Keller, Jacob E
Hi Benjamin,

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Benjamin Poirier
> Sent: Tuesday, April 24, 2018 11:53 PM
> To: Allan, Bruce W <bruce.w.al...@intel.com>; Yanir Lubetkin
> <yanirx.lubet...@intel.com>; Keller, Jacob E <jacob.e.kel...@intel.com>; 
> Neftin,
> Sasha <sasha.nef...@intel.com>
> Cc: Alexander Duyck <alexander.du...@gmail.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; Achim Mildenberger <ad...@fph.physik.uni-
> karlsruhe.de>; olouvig...@gmail.com; jaya...@goubiq.com;
> ehabk...@redhat.com; postmodern.m...@gmail.com;
> bart.vanass...@wdc.com; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org
> Subject: e1000e I219 timestamping oops related to TSYNCRXCTL read
> 
> In the following openSUSE bug report
> https://bugzilla.suse.com/show_bug.cgi?id=1075876
> Achim reported an oops related to e1000e timestamping:
> kernel: RIP: 0010:[] timecounter_read+0xf/0x50
> [...]
> kernel: Call Trace:
> kernel:  [] e1000e_phc_gettime+0x2f/0x60 [e1000e]
> kernel:  [] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
> kernel:  [] process_one_work+0x155/0x440
> kernel:  [] worker_thread+0x116/0x4b0
> kernel:  [] kthread+0xd2/0xf0
> kernel:  [] ret_from_fork+0x3f/0x70
> 
> It always occurs 4 hours after boot but not on every boot. It is most
> likely the same problem reported here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668356
> http://lkml.iu.edu/hypermail/linux/kernel/1506.2/index.html#02530
> https://bugzilla.redhat.com/show_bug.cgi?id=1463882
> https://bugzilla.redhat.com/show_bug.cgi?id=1431863
> 

It probably occurs due to the systim overflow check, yes.

> This occurs with MAC: 12, e1000_pch_spt/I219. The reporter has
> reproduced it on a v4.16 derivative.
> 
> We've traced it to the fact that e1000e_systim_reset() skips the
> timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL,
> which leads to a null deref in timecounter_read() (see comment 8 of the
> suse bugzilla for more details.)
> 
> In commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) Yanir
> reworked e1000e_get_base_timinca() in such a way that it can return
> -EINVAL for e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.
> This is also the commit that was identified by bisection in the second
> link above (lkml).
> 
> What we've observed (in comment 14) is that TSYNCRXCTL reads sometimes
> don't have the SYSCFI bit set. Retrying the read shortly after finds the
> bit to be set. This was observed at boot (probe) but also link up and
> link down.
> 

I don't know offhand what the SYSCFI bit is for yet still digging into it.

> I have a few questions:
> 
> What's the purpose of the SYSCFI bit in TSYNCRXCTL ("Reserved" in the
> datasheet)?
> 
> Why does it look like subsequent reads of TSYNCRXCTL sometimes have the
> SYSCFI bit set/not set on I219?
> 
> Is it right to check the SYSCFI bit in e1000e_get_base_timinca() for
> _spt and return -EINVAL if it's not set? Could we just remove that
> check?
> 

I think the right approach might be proper cleanup when we fail to reset. I 
think the problem is that when e1000e_systim_reset is called and fails, we 
don't properly cleanup the work items. I think we need to actually stop and 
kill the work task so that it won't run.

> The patch in comment 13 of the suse bugzilla works around the problem by
> retrying TSYNCRXCTL reads, maybe we could instead remove that read
> altogether or move the timecounter_init() call to at least avoid the
> oops. The best approach to take seems to depend on the behavior expected
> of TSYNCRXCTL reads on I219 so I'm hoping that you could provide more
> info on that.
> 

Yea, we need to do something here, I'm still investigating why we need the 
SYSCFI check, but at a minimum, we should disable the overflow check task if we 
fail here, I think.

It looks like the SYSCFI is the System Clock Frequency Indicator bit, and it 
should be used to tell which of two clock frequencies to choose. I do not 
understand why that would be changing at different reads. We do need to make 
sure the clock is already enabled, but we do that prior to the switch case... 
Something is really weird here...

Thanks,
Jake

> Thanks,
> -Benjamin


RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-13 Thread Keller, Jacob E


> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Friday, April 13, 2018 7:07 AM
> To: Jakub Kicinski <kubak...@wp.pl>
> Cc: Tal Gilboa <ta...@mellanox.com>; Tariq Toukan <tar...@mellanox.com>;
> Keller, Jacob E <jacob.e.kel...@intel.com>; Ariel Elior 
> <ariel.el...@cavium.com>;
> Ganesh Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> speed
> and whether it's limited
> 
> On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> > > + if (bw_avail >= bw_cap)
> > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > + else
> > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> link at %s (capable of %d Mb/s with %s x%d link)\n",
> > > +  bw_avail, PCIE_SPEED2STR(speed), width,
> > > +  limiting_dev ? pci_name(limiting_dev) : "",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> >
> > I was just looking at using this new function to print PCIe BW for a
> > NIC, but I'm slightly worried that there is nothing in the message that
> > says PCIe...  For a NIC some people may interpret the bandwidth as NIC
> > bandwidth:
> >
> > [   39.839989] nfp :04:00.0: Netronome Flow Processor NFP4000/NFP6000
> PCIe Card Probe
> > [   39.848943] nfp :04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 
> > link)
> > [   39.857146] nfp :04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 
> > 0.1:
> PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4
> >
> > It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
> > didn't find it.  Would it make sense to add the "PCIe: " prefix to the
> > message like bnx2x used to do?  Like:
> >
> > nfp :04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)
> 
> I agree, that does look potentially confusing.  How about this:
> 
>   nfp :04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
> 
> I did have to look twice at this before I remembered that we're
> printing Gb/s (not GB/s).  Most of the references I found on the web
> use GB/s when talking about total PCIe bandwidth.
> 
> But either way I think it's definitely worth mentioning PCIe
> explicitly.

I also agree printing PCIe explicitly is good.

Thanks,
Jake


RE: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-03 Thread Keller, Jacob E
> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Tuesday, April 03, 2018 7:06 AM
> To: Jacob Keller <jacob.kel...@gmail.com>
> Cc: Tal Gilboa <ta...@mellanox.com>; Tariq Toukan <tar...@mellanox.com>;
> Keller, Jacob E <jacob.e.kel...@intel.com>; Ariel Elior 
> <ariel.el...@cavium.com>;
> Ganesh Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute
> max supported link bandwidth
> 
> On Mon, Apr 02, 2018 at 05:30:54PM -0700, Jacob Keller wrote:
> > On Mon, Apr 2, 2018 at 7:05 AM, Bjorn Helgaas <helg...@kernel.org> wrote:
> > > +/* PCIe speed to Mb/s reduced by encoding overhead */
> > > +#define PCIE_SPEED2MBS_ENC(speed) \
> > > +   ((speed) == PCIE_SPEED_16_0GT ? (16000*(128/130)) : \
> > > +(speed) == PCIE_SPEED_8_0GT  ?  (8000*(128/130)) : \
> > > +(speed) == PCIE_SPEED_5_0GT  ?  (5000*(8/10)) : \
> > > +(speed) == PCIE_SPEED_2_5GT  ?  (2500*(8/10)) : \
> > > +0)
> > > +
> >
> > Should this be "(speed * x ) / y" instead? wouldn't they calculate
> > 128/130 and truncate that to zero before multiplying by the speed? Or
> > are compilers smart enough to do this the other way to avoid the
> > losses?
> 
> Yep, thanks for saving me yet more embarrassment.

That's what patch review is for :D

Thanks,
Jake


RE: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()

2018-04-02 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Bjorn Helgaas
> Sent: Monday, April 02, 2018 1:32 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Tal Gilboa <ta...@mellanox.com>; Tariq Toukan <tar...@mellanox.com>; Ariel
> Elior <ariel.el...@cavium.com>; Ganesh Goudar <ganes...@chelsio.com>;
> Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com;
> intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> pcie_print_link_status()
> 
> On Mon, Apr 02, 2018 at 03:56:06PM +, Keller, Jacob E wrote:
> > > -Original Message-
> > > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > > Sent: Friday, March 30, 2018 2:06 PM
> > > To: Tal Gilboa <ta...@mellanox.com>
> > > Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> > > <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> > > Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> > > <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> > > l...@lists.osuosl.org; netdev@vger.kernel.org; 
> > > linux-ker...@vger.kernel.org;
> > > linux-...@vger.kernel.org
> > > Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> > > pcie_print_link_status()
> > >
> > > From: Bjorn Helgaas <bhelg...@google.com>
> > >
> > > Use pcie_print_link_status() to report PCIe link speed and possible
> > > limitations instead of implementing this in the driver itself.
> > >
> > > Note that pcie_get_minimum_link() can return misleading information
> because
> > > it finds the slowest link and the narrowest link without considering the
> > > total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
> > > 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
> > > corresponds to 250 MB/s of bandwidth, not the actual available bandwidth 
> > > of
> > > about 2000 MB/s for a 16 GT/s x1 link.
> >
> > This comment is about what's being fixed, so it would have been easier to
> > parse if it were written to more clearly indicate that we're removing
> > (and not adding) this behavior.
> 
> Good point.  Is this any better?
> 
>   fm10k: Report PCIe link properties with pcie_print_link_status()
> 
>   Previously the driver used pcie_get_minimum_link() to warn when the NIC
>   is in a slot that can't supply as much bandwidth as the NIC could use.
> 
>   pcie_get_minimum_link() can be misleading because it finds the slowest link
>   and the narrowest link (which may be different links) without considering
>   the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
>   2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
>   bandwidth, not the true available bandwidth of about 1969 MB/s for a
>   16 GT/s x1 link.
> 
>   Use pcie_print_link_status() to report PCIe link speed and possible
>   limitations instead of implementing this in the driver itself.  This finds
>   the slowest link in the path to the device by computing the total bandwidth
>   of each link and compares that with the capabilities of the device.
> 
>   Note that the driver previously used dev_warn() to suggest using a
>   different slot, but pcie_print_link_status() uses dev_info() because if the
>   platform has no faster slot available, the user can't do anything about the
>   warning and may not want to be bothered with it.

Perfect! Thanks!

-Jake


RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Keller, Jacob E


> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Monday, April 02, 2018 12:58 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Tal Gilboa <ta...@mellanox.com>; Tariq Toukan <tar...@mellanox.com>; Ariel
> Elior <ariel.el...@cavium.com>; Ganesh Goudar <ganes...@chelsio.com>;
> Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com;
> intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> speed
> and whether it's limited
> 
> On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote:
> > > -Original Message-
> > > From: Bjorn Helgaas [mailto:helg...@kernel.org]
> > > Sent: Friday, March 30, 2018 2:05 PM
> > > To: Tal Gilboa <ta...@mellanox.com>
> > > Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> > > <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> > > Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> > > <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> > > l...@lists.osuosl.org; netdev@vger.kernel.org; 
> > > linux-ker...@vger.kernel.org;
> > > linux-...@vger.kernel.org
> > > Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
> > > speed
> and
> > > whether it's limited
> > >
> > > From: Tal Gilboa <ta...@mellanox.com>
> > >
> > > Add pcie_print_link_status().  This logs the current settings of the link
> > > (speed, width, and total available bandwidth).
> > >
> > > If the device is capable of more bandwidth but is limited by a slower
> > > upstream link, we include information about the link that limits the
> > > device's performance.
> > >
> > > The user may be able to move the device to a different slot for better
> > > performance.
> > >
> > > This provides a unified method for all PCI devices to report status and
> > > issues, instead of each device reporting in a different way, using
> > > different code.
> > >
> > > Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> > > [bhelgaas: changelog, reword log messages, print device capabilities when
> > > not limited]
> > > Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> > > ---
> > >  drivers/pci/pci.c   |   29 +
> > >  include/linux/pci.h |1 +
> > >  2 files changed, 30 insertions(+)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index e00d56b12747..cec7aed09f6b 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> > > enum pci_bus_speed *speed,
> > >   return *width * PCIE_SPEED2MBS_ENC(*speed);
> > >  }
> > >
> > > +/**
> > > + * pcie_print_link_status - Report the PCI device's link speed and width
> > > + * @dev: PCI device to query
> > > + *
> > > + * Report the available bandwidth at the device.  If this is less than 
> > > the
> > > + * device is capable of, report the device's maximum possible bandwidth 
> > > and
> > > + * the upstream link that limits its performance to less than that.
> > > + */
> > > +void pcie_print_link_status(struct pci_dev *dev)
> > > +{
> > > + enum pcie_link_width width, width_cap;
> > > + enum pci_bus_speed speed, speed_cap;
> > > + struct pci_dev *limiting_dev = NULL;
> > > + u32 bw_avail, bw_cap;
> > > +
> > > + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
> > > + bw_avail = pcie_bandwidth_available(dev, _dev, ,
> > > );
> > > +
> > > + if (bw_avail >= bw_cap)
> > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > + else
> > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> > > link at %s (capable of %d Mb/s with %s x%d link)\n",
> > > +  bw_avail, PCIE_SPEED2STR(speed), width,
> > > +  limiting_dev ? pci_name(limiting_dev) : "",
> > > +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> > > +}
> >
> > P

RE: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Keller, Jacob E
> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Friday, March 30, 2018 2:05 PM
> To: Tal Gilboa <ta...@mellanox.com>
> Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed 
> and
> whether it's limited
> 
> From: Tal Gilboa <ta...@mellanox.com>
> 
> Add pcie_print_link_status().  This logs the current settings of the link
> (speed, width, and total available bandwidth).
> 
> If the device is capable of more bandwidth but is limited by a slower
> upstream link, we include information about the link that limits the
> device's performance.
> 
> The user may be able to move the device to a different slot for better
> performance.
> 
> This provides a unified method for all PCI devices to report status and
> issues, instead of each device reporting in a different way, using
> different code.
> 
> Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> [bhelgaas: changelog, reword log messages, print device capabilities when
> not limited]
> Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> ---
>  drivers/pci/pci.c   |   29 +
>  include/linux/pci.h |1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e00d56b12747..cec7aed09f6b 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
> enum pci_bus_speed *speed,
>   return *width * PCIE_SPEED2MBS_ENC(*speed);
>  }
> 
> +/**
> + * pcie_print_link_status - Report the PCI device's link speed and width
> + * @dev: PCI device to query
> + *
> + * Report the available bandwidth at the device.  If this is less than the
> + * device is capable of, report the device's maximum possible bandwidth and
> + * the upstream link that limits its performance to less than that.
> + */
> +void pcie_print_link_status(struct pci_dev *dev)
> +{
> + enum pcie_link_width width, width_cap;
> + enum pci_bus_speed speed, speed_cap;
> + struct pci_dev *limiting_dev = NULL;
> + u32 bw_avail, bw_cap;
> +
> + bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
> + bw_avail = pcie_bandwidth_available(dev, _dev, ,
> );
> +
> + if (bw_avail >= bw_cap)
> + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> + else
> + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
> link at %s (capable of %d Mb/s with %s x%d link)\n",
> +  bw_avail, PCIE_SPEED2STR(speed), width,
> +  limiting_dev ? pci_name(limiting_dev) : "",
> +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> +}

Personally, I would make thic last one a pci_warn() to indicate it at a higher 
log level, but I'm  ok with the wording, and if consensus is that this should 
be at info, I'm ok with that.

Thanks,
Jake

> +EXPORT_SYMBOL(pcie_print_link_status);
> +
>  /**
>   * pci_select_bars - Make BAR mask from the type of resource
>   * @dev: the PCI device for which BAR mask is made
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index f2bf2b7a66c7..38f7957121ef 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1086,6 +1086,7 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum
> pci_bus_speed *speed,
>  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
> **limiting_dev,
>enum pci_bus_speed *speed,
>enum pcie_link_width *width);
> +void pcie_print_link_status(struct pci_dev *dev);
>  void pcie_flr(struct pci_dev *dev);
>  int __pci_reset_function_locked(struct pci_dev *dev);
>  int pci_reset_function(struct pci_dev *dev);



RE: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Keller, Jacob E
> -Original Message-
> From: Tal Gilboa [mailto:ta...@mellanox.com]
> Sent: Monday, April 02, 2018 7:34 AM
> To: Bjorn Helgaas <helg...@kernel.org>
> Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute
> max supported link bandwidth
> 
> On 4/2/2018 5:05 PM, Bjorn Helgaas wrote:
> > On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote:
> >> On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:
> >>> On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:
> >>>> On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:
> >>>>> From: Tal Gilboa <ta...@mellanox.com>
> >>>>>
> >>>>> Add pcie_bandwidth_capable() to compute the max link bandwidth
> supported by
> >>>>> a device, based on the max link speed and width, adjusted by the
> encoding
> >>>>> overhead.
> >>>>>
> >>>>> The maximum bandwidth of the link is computed as:
> >>>>>
> >>>>>  max_link_speed * max_link_width * (1 - encoding_overhead)
> >>>>>
> >>>>> The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using
> 8b/10b
> >>>>> encoding, and about 1.5% for 8 GT/s or higher speed links using 
> >>>>> 128b/130b
> >>>>> encoding.
> >>>>>
> >>>>> Signed-off-by: Tal Gilboa <ta...@mellanox.com>
> >>>>> [bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
> >>>>> signatures, don't export outside drivers/pci]
> >>>>> Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> >>>>> Reviewed-by: Tariq Toukan <tar...@mellanox.com>
> >>>>> ---
> >>>>> drivers/pci/pci.c |   21 +
> >>>>> drivers/pci/pci.h |9 +
> >>>>> 2 files changed, 30 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >>>>> index 43075be79388..9ce89e254197 100644
> >>>>> --- a/drivers/pci/pci.c
> >>>>> +++ b/drivers/pci/pci.c
> >>>>> @@ -5208,6 +5208,27 @@ enum pcie_link_width
> pcie_get_width_cap(struct pci_dev *dev)
> >>>>> return PCIE_LNK_WIDTH_UNKNOWN;
> >>>>> }
> >>>>> +/**
> >>>>> + * pcie_bandwidth_capable - calculates a PCI device's link bandwidth
> capability
> >>>>> + * @dev: PCI device
> >>>>> + * @speed: storage for link speed
> >>>>> + * @width: storage for link width
> >>>>> + *
> >>>>> + * Calculate a PCI device's link bandwidth by querying for its link 
> >>>>> speed
> >>>>> + * and width, multiplying them, and applying encoding overhead.
> >>>>> + */
> >>>>> +u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed
> *speed,
> >>>>> +  enum pcie_link_width *width)
> >>>>> +{
> >>>>> +   *speed = pcie_get_speed_cap(dev);
> >>>>> +   *width = pcie_get_width_cap(dev);
> >>>>> +
> >>>>> +   if (*speed == PCI_SPEED_UNKNOWN || *width ==
> PCIE_LNK_WIDTH_UNKNOWN)
> >>>>> +   return 0;
> >>>>> +
> >>>>> +   return *width * PCIE_SPEED2MBS_ENC(*speed);
> >>>>> +}
> >>>>> +
> >>>>> /**
> >>>>>  * pci_select_bars - Make BAR mask from the type of resource
> >>>>>  * @dev: the PCI device for which BAR mask is made
> >>>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> >>>>> index 66738f1050c0..2a50172b9803 100644
> >>>>> --- a/drivers/pci/pci.h
> >>>>> +++ b/drivers/pci/pci.h
> >>>>> @@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev
> *dev);
> >>>>>  (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
> >>>>>

RE: [PATCH v5 12/14] fm10k: Report PCIe link properties with pcie_print_link_status()

2018-04-02 Thread Keller, Jacob E
> -Original Message-
> From: Bjorn Helgaas [mailto:helg...@kernel.org]
> Sent: Friday, March 30, 2018 2:06 PM
> To: Tal Gilboa <ta...@mellanox.com>
> Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
> Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> linux-...@vger.kernel.org
> Subject: [PATCH v5 12/14] fm10k: Report PCIe link properties with
> pcie_print_link_status()
> 
> From: Bjorn Helgaas <bhelg...@google.com>
> 
> Use pcie_print_link_status() to report PCIe link speed and possible
> limitations instead of implementing this in the driver itself.
> 
> Note that pcie_get_minimum_link() can return misleading information because
> it finds the slowest link and the narrowest link without considering the
> total bandwidth of the link.  If the path contains a 16 GT/s x1 link and a
> 2.5 GT/s x16 link, pcie_get_minimum_link() returns 2.5 GT/s x1, which
> corresponds to 250 MB/s of bandwidth, not the actual available bandwidth of
> about 2000 MB/s for a 16 GT/s x1 link.

This comment is about what's being fixed, so it would have been easier to parse 
if it were written to more clearly indicate that we're removing (and not 
adding) this behavior.

Aside from the commit message (which I don't feel strongly enough needs a 
re-send of the patch) this looks good to me.

Acked-by: Jacob Keller <jacob.e.kel...@intel.com>

Thanks Bjorn and Tal for fixing this!

> 
> Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
> ---
>  drivers/net/ethernet/intel/fm10k/fm10k_pci.c |   87 
> --
>  1 file changed, 1 insertion(+), 86 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> index a434fecfdfeb..aa05fb534942 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> @@ -2120,91 +2120,6 @@ static int fm10k_sw_init(struct fm10k_intfc *interface,
>   return 0;
>  }
> 
> -static void fm10k_slot_warn(struct fm10k_intfc *interface)
> -{
> - enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
> - enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
> - struct fm10k_hw *hw = >hw;
> - int max_gts = 0, expected_gts = 0;
> -
> - if (pcie_get_minimum_link(interface->pdev, , ) ||
> - speed == PCI_SPEED_UNKNOWN || width ==
> PCIE_LNK_WIDTH_UNKNOWN) {
> - dev_warn(>pdev->dev,
> -  "Unable to determine PCI Express bandwidth.\n");
> - return;
> - }
> -
> - switch (speed) {
> - case PCIE_SPEED_2_5GT:
> - /* 8b/10b encoding reduces max throughput by 20% */
> - max_gts = 2 * width;
> - break;
> - case PCIE_SPEED_5_0GT:
> - /* 8b/10b encoding reduces max throughput by 20% */
> - max_gts = 4 * width;
> - break;
> - case PCIE_SPEED_8_0GT:
> - /* 128b/130b encoding has less than 2% impact on throughput */
> - max_gts = 8 * width;
> - break;
> - default:
> - dev_warn(>pdev->dev,
> -  "Unable to determine PCI Express bandwidth.\n");
> - return;
> - }
> -
> - dev_info(>pdev->dev,
> -  "PCI Express bandwidth of %dGT/s available\n",
> -  max_gts);
> - dev_info(>pdev->dev,
> -  "(Speed:%s, Width: x%d, Encoding Loss:%s, Payload:%s)\n",
> -  (speed == PCIE_SPEED_8_0GT ? "8.0GT/s" :
> -   speed == PCIE_SPEED_5_0GT ? "5.0GT/s" :
> -   speed == PCIE_SPEED_2_5GT ? "2.5GT/s" :
> -   "Unknown"),
> -  hw->bus.width,
> -  (speed == PCIE_SPEED_2_5GT ? "20%" :
> -   speed == PCIE_SPEED_5_0GT ? "20%" :
> -   speed == PCIE_SPEED_8_0GT ? "<2%" :
> -   "Unknown"),
> -  (hw->bus.payload == fm10k_bus_payload_128 ? "128B" :
> -   hw->bus.payload == fm10k_bus_payload_256 ? "256B" :
> -   hw->bus.payload == fm10k_bus_payload_512 ? "512B" :
> -   "Unknown"));
> -
> - switch (hw->bus_caps.speed) {
> - case fm10k_bus_speed_2500:
> - /* 8b/10b encoding reduces max throughp

RE: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.

2018-03-21 Thread Keller, Jacob E
> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Wednesday, March 21, 2018 2:26 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: netdev@vger.kernel.org; devicet...@vger.kernel.org; Andrew Lunn
> <and...@lunn.ch>; David Miller <da...@davemloft.net>; Florian Fainelli
> <f.faine...@gmail.com>; Mark Rutland <mark.rutl...@arm.com>; Miroslav
> Lichvar <mlich...@redhat.com>; Rob Herring <robh...@kernel.org>; Willem de
> Bruijn <will...@google.com>
> Subject: Re: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step
> PTP time stamping.
> 
> On Wed, Mar 21, 2018 at 08:05:36PM +, Keller, Jacob E wrote:
> > I am guessing that we expect all devices which support onestep P2P messages,
> will always support onestep SYNC as well?
> 
> Yes.  Anything else doesn't make sense, don't you think?
> 
> Also, reading 1588, it isn't clear whether supporting only 1-step Sync
> without 1-step P2P is even intended.  There is only a "one-step
> clock", and it is described as doing both.
> 
> Thanks,
> Richard

This was my understanding as well, but given the limited hardware which can do 
sync but not pdelay messages, I just wanted to make sure we were on the same 
page.

Thanks,
Jake


RE: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.

2018-03-21 Thread Keller, Jacob E


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Richard Cochran
> Sent: Wednesday, March 21, 2018 11:58 AM
> To: netdev@vger.kernel.org
> Cc: devicet...@vger.kernel.org; Andrew Lunn ; David Miller
> ; Florian Fainelli ; Mark Rutland
> ; Miroslav Lichvar ; Rob
> Herring ; Willem de Bruijn 
> Subject: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP
> +
> + /*
> +  * Same as HWTSTAMP_TX_ONESTEP_SYNC, but also enables time
> +  * stamp insertion directly into PDelay_Resp packets. In this
> +  * case, neither transmitted Sync nor PDelay_Resp packets will
> +  * receive a time stamp via the socket error queue.
> +  */
> + HWTSTAMP_TX_ONESTEP_P2P,
>  };
> 

I am guessing that we expect all devices which support onestep P2P messages, 
will always support onestep SYNC as well?

Thanks,
Jake



RE: [net-next 2/7] fm10k: cleanup unnecessary parenthesis in fm10k_iov.c

2018-01-31 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Joe Perches
> Sent: Wednesday, January 31, 2018 3:35 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; da...@davemloft.net
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 2/7] fm10k: cleanup unnecessary parenthesis in
> fm10k_iov.c
> 
> On Wed, 2018-01-24 at 14:45 -0800, Jeff Kirsher wrote:
> > This fixes a few warnings found by checkpatch.pl --strict
> []
> > diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
> []
> > @@ -353,7 +353,7 @@ int fm10k_iov_resume(struct pci_dev *pdev)
> > struct fm10k_vf_info *vf_info = _data->vf_info[i];
> >
> > /* allocate all but the last GLORT to the VFs */
> > -   if (i == ((~hw->mac.dglort_map) >>
> FM10K_DGLORTMAP_MASK_SHIFT))
> > +   if (i == (~hw->mac.dglort_map >>
> FM10K_DGLORTMAP_MASK_SHIFT))
> 
> Strictly, only the if needs parentheses here, but I
> think it reads better before this change.
> 

I don't really find the extra paranthesis around ~hw->mac.dglort_map to be that 
useful here.

Thanks,
Jake

> > @@ -511,7 +511,7 @@ int fm10k_iov_configure(struct pci_dev *pdev, int
> num_vfs)
> > return err;
> >
> > /* allocate VFs if not already allocated */
> > -   if (num_vfs && (num_vfs != current_vfs)) {
> > +   if (num_vfs && num_vfs != current_vfs) {
> 



RE: macvlan devices and vlan interaction

2018-01-30 Thread Keller, Jacob E
> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Tuesday, January 30, 2018 2:39 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Shannon Nelson <shannon.nel...@oracle.com>; netdev@vger.kernel.org;
> Duyck, Alexander H <alexander.h.du...@intel.com>
> Subject: Re: macvlan devices and vlan interaction
> 
> On Tue, Jan 30, 2018 at 2:20 PM, Keller, Jacob E
> <jacob.e.kel...@intel.com> wrote:
> >> -Original Message-
> >> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> >> Sent: Tuesday, January 30, 2018 12:49 PM
> >> To: Shannon Nelson <shannon.nel...@oracle.com>
> >> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> Duyck,
> >> Alexander H <alexander.h.du...@intel.com>
> >> Subject: Re: macvlan devices and vlan interaction
> >>
> >> > Hi Jake,
> >> >
> >> > The current behavior seems logical to me, but I suppose Alex might argue
> >> > differently.  The macvlan was put onto the default lowerdev assuming the
> >> > lowerdev will hand it all the default traffic, and then the macvlan 
> >> > splits
> >> > out its own vlan traffic.  As soon as the lowerdev assumption changes, 
> >> > it is
> >> > going to change what gets pushed up to the macvlan dev. If the lowerdev 
> >> > is
> >> > separating the vlan traffic out of the "default" flow headed to the 
> >> > macvlan,
> >> > then the initial assumption has changed and the vlan traffic has been
> >> > vectored off before it can be delivered up the stack to the macvlan.
> >>
> >> It depends on what your goal is. In my mind making macvlan VLAN
> >> challenged is the easier solution since you just have to add some
> >> pass-thru ops to the VLAN drivers and you can guarantee that you are
> >> passing MAC-VLAN pair for each address on the interface for the call.
> >> The alternative gets to be a bit more complex since it requires
> >> multiple rules, one for non-tagged and one per VLAN for tagged
> >> traffic.
> >>
> >> > There's an argument that the lowerdev shouldn't know anything about the
> >> > upperdev's routing, just deliver to the upperdev and let the upperdev 
> >> > worry
> >> > about it.  But perhaps this becomes is a question of precedence: does the
> >> > lowerdev split traffic first by mac address or by vlan tag.
> >>
> >> That is where things get messy. We found it splits by VLAN tag if the
> >> VLAN is present on the lowerdev, or it splits by MAC if it is not.
> >> That is why as Jake pointed out adding the VLAN to the lower dev
> >> causes issues.
> >>
> >
> > Yes, right now the problem is that it splits differently depending on 
> > whether or
> not a VLAN is present on the lower dev.
> >
> >> > I don't like your option 1: as you point out, it breaks current
> >> > functionality, likely depended upon in some containers that are using
> >> > macvlans to manage their traffic.  We don't know what's going on inside 
> >> > that
> >> > container and I don't think we want to break its ability to split its own
> >> > vlans.
> >>
> >> Maybe we should look at an option 1.5. Mark the lowerdev as VLAN
> >> challenged if any macvlan is operating with any VLANs enabled on it
> >> since we can only really allow VLAN filtering to occur at one level
> >> reliably. Either that or maybe we look at making VLANs and rx_handler
> >> setups mutually exclusive.
> >>
> >
> > Actually.. what if we changed the order of splitting, so that we always 
> > check
> macvlan MAC address first, before checking VLANs?
> >
> > This should work in both cases of macvlan -> VLAN -> lowerdev, or VLAN ->
> macvlan -> lowerdev.
> 
> The thing you have to then watch out for is how something like this
> would impact bonding or bridging since both of those use the Rx
> handler as well from my understanding. I suppose it would make sense
> though to do Rx handler first and then VLAN since the Rx handler
> should be placed on the VLAN itself if you are
> bridging/bonding/macvlan over a VLAN versus the reverse.
> 
> > In the first case, the macvlan isn't directly attached to the lowerdev, so 
> > we'd do
> VLAN filtering first, and then the VLAN would check MAC address.
> 
> Right, that bit works without any issues.
> 
> > In t

RE: macvlan devices and vlan interaction

2018-01-30 Thread Keller, Jacob E
> -Original Message-
> From: Yuan, Linyu (NSB - CN/Shanghai) [mailto:linyu.y...@nokia-sbell.com]
> Sent: Monday, January 29, 2018 5:53 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> Cc: Duyck, Alexander H <alexander.h.du...@intel.com>
> Subject: RE: macvlan devices and vlan interaction
> 
> https://www.spinics.net/lists/netdev/msg476083.html
> 
> I also have a macvlan device question, but get no answer.
> 
> But my original thought is in __netif_receive_skb_core() we should check 
> packet
> destination mac address,
> if it match macvlan device, change packet as receive from macvlan device, not
> lower device, then packet go to upper layer.
> 
> But I don't know how to process broadcast mac address. Do macvlan device can
> receive broadcast packet ?
> 

I don't know how macvlans behave in regards to broadcast addresses.

I do think that we should make sure macvlan filtering occurs earlier than VLAN 
filtering to ensure that we get the correct behavior (see the other emails on 
this thread).

I can't comment on how that impacts AF_PACKET, because I think AF_PACKET 
sockets bypass a lot of the stack don't they?

Thanks,
Jake



RE: macvlan devices and vlan interaction

2018-01-30 Thread Keller, Jacob E
> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Tuesday, January 30, 2018 12:49 PM
> To: Shannon Nelson <shannon.nel...@oracle.com>
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org; Duyck,
> Alexander H <alexander.h.du...@intel.com>
> Subject: Re: macvlan devices and vlan interaction
> 
> > Hi Jake,
> >
> > The current behavior seems logical to me, but I suppose Alex might argue
> > differently.  The macvlan was put onto the default lowerdev assuming the
> > lowerdev will hand it all the default traffic, and then the macvlan splits
> > out its own vlan traffic.  As soon as the lowerdev assumption changes, it is
> > going to change what gets pushed up to the macvlan dev. If the lowerdev is
> > separating the vlan traffic out of the "default" flow headed to the macvlan,
> > then the initial assumption has changed and the vlan traffic has been
> > vectored off before it can be delivered up the stack to the macvlan.
> 
> It depends on what your goal is. In my mind making macvlan VLAN
> challenged is the easier solution since you just have to add some
> pass-thru ops to the VLAN drivers and you can guarantee that you are
> passing MAC-VLAN pair for each address on the interface for the call.
> The alternative gets to be a bit more complex since it requires
> multiple rules, one for non-tagged and one per VLAN for tagged
> traffic.
> 
> > There's an argument that the lowerdev shouldn't know anything about the
> > upperdev's routing, just deliver to the upperdev and let the upperdev worry
> > about it.  But perhaps this becomes is a question of precedence: does the
> > lowerdev split traffic first by mac address or by vlan tag.
> 
> That is where things get messy. We found it splits by VLAN tag if the
> VLAN is present on the lowerdev, or it splits by MAC if it is not.
> That is why as Jake pointed out adding the VLAN to the lower dev
> causes issues.
>

Yes, right now the problem is that it splits differently depending on whether 
or not a VLAN is present on the lower dev.

> > I don't like your option 1: as you point out, it breaks current
> > functionality, likely depended upon in some containers that are using
> > macvlans to manage their traffic.  We don't know what's going on inside that
> > container and I don't think we want to break its ability to split its own
> > vlans.
> 
> Maybe we should look at an option 1.5. Mark the lowerdev as VLAN
> challenged if any macvlan is operating with any VLANs enabled on it
> since we can only really allow VLAN filtering to occur at one level
> reliably. Either that or maybe we look at making VLANs and rx_handler
> setups mutually exclusive.
> 

Actually.. what if we changed the order of splitting, so that we always check 
macvlan MAC address first, before checking VLANs?

This should work in both cases of macvlan -> VLAN -> lowerdev, or VLAN -> 
macvlan -> lowerdev.

In the first case, the macvlan isn't directly attached to the lowerdev, so we'd 
do VLAN filtering first, and then the VLAN would check MAC address.

In the second case, even if lowerdev also had the VLAN, we'd do macvlan 
filtering first, and things would work.

Both the lowerdev VLAN and upperdev macvlan should receive traffic correctly in 
this case.

I think this resolves the problem of which device goes to which VLAN.

I don't know if it resolves the issues with leaked VLANs, where a VLAN added to 
the macvlan device causes traffic for that VLAN to be received by all the MAC 
addresses of the lowerdev...

I suppose this might not be considered a problem? The traffic could be received 
either way if you're in promiscuous mode. It's not like we have a sense of 
"trusted" configuration either.

I think some separate work for the case of macvlan on top of VLAN on top of 
lower dev can be done as well, to enable offloading in this case. I'll have 
some more thoughts on that soon.

Thanks,
Jake

> > Like I said, I think the current behavior is mostly correct, but a version
> > of option 2 might be good to help support offload of the mac+vlan pair into
> > a macvlan channel.
> 
> The only issue is I am not completely sure how option 2 solves the
> original issue. Yes it makes the filtering more explicit, but the
> network stack is still filtering VLANs before we get to the rx_handler
> calls, or is this a fix that works for the offloaded approach only and
> doesn't address the issues in the non-offloaded case? It's also
> possible I might have missed something.
> 
> - Alex


RE: macvlan devices and vlan interaction

2018-01-30 Thread Keller, Jacob E
> -Original Message-
> From: Shannon Nelson [mailto:shannon.nel...@oracle.com]
> Sent: Tuesday, January 30, 2018 12:30 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> Cc: Duyck, Alexander H <alexander.h.du...@intel.com>
> Subject: Re: macvlan devices and vlan interaction
> 
> Hi Jake,
> 
> The current behavior seems logical to me, but I suppose Alex might argue
> differently.  The macvlan was put onto the default lowerdev assuming the
> lowerdev will hand it all the default traffic, and then the macvlan
> splits out its own vlan traffic.  As soon as the lowerdev assumption
> changes, it is going to change what gets pushed up to the macvlan dev.
> If the lowerdev is separating the vlan traffic out of the "default" flow
> headed to the macvlan, then the initial assumption has changed and the
> vlan traffic has been vectored off before it can be delivered up the
> stack to the macvlan.
> 
> There's an argument that the lowerdev shouldn't know anything about the
> upperdev's routing, just deliver to the upperdev and let the upperdev
> worry about it.  But perhaps this becomes is a question of precedence:
> does the lowerdev split traffic first by mac address or by vlan tag.
> 

There's a few issues at play here. (1) the device driver has no idea which 
VLANs apply to which devs. So when adding a VLAN to upperdev, it just sends a 
notification to the lowerdev, saying please add VLAN N. The lowerdev doesn't 
have a clue which this applies to.

The second issue (2) is that partially, when deciding where traffic goes, the 
stack prioritises VLANs over macvlan upperdevs, so we end up routing traffic 
that should have gone to a macvlan into a VLAN attached to the lowerdev instead.

> I don't like your option 1: as you point out, it breaks current
> functionality, likely depended upon in some containers that are using
> macvlans to manage their traffic.  We don't know what's going on inside
> that container and I don't think we want to break its ability to split
> its own vlans.
> 

I don't really want to break the ability either, but look at this scenario:

upperdev macvlan created on some lowerdev, and put into a container.
upperdev creates VLAN 10 and starts receiving VLAN 10 traffic.

now, lowerdev creates VLAN 10 on the same lowerdev, possibly unaware of what 
the container did.
 
suddenly the upperdev macvlan no longer receives any VLAN 10 traffic.

Worse, the behavior is *different* depending on whether the macvlan is 
offloaded or not.

In an offloaded macvlan, at least from what i can tell, VLANs have not worked 
on any open source driver in the upstream kernel today, so the original case of 
upperdev creates VLAN 10 will just not receive traffic. This is a separate 
issue which I have a patch to resolve, but it still has problems with the 
leaked VLAN issue (where VLANs are added to the lowerdev directly).

You can argue that this is administrator error, but I'd rather fix it so that 
it's not possible one way or another. Unfortunately, I don't have any good way 
to figure out how to prevent this. The driver doesn't have any indication which 
VLANs apply to which devices.

> Like I said, I think the current behavior is mostly correct, but a
> version of option 2 might be good to help support offload of the
> mac+vlan pair into a macvlan channel.
> 
> sln
> 

I don't really like either option, so suggestions are welcome.

Thanks,
Jake


macvlan devices and vlan interaction

2018-01-29 Thread Keller, Jacob E
Hi,

I'm currently investigating how macvlan devices behave in regards to vlan 
support, and found some interesting behavior that I am not sure how best to 
correct, or what the right path forward is.

If I create a macvlan device:

ip link add link ens0 name macvlan0 type macvlan:

and then add a VLAN to it:

ip link add link macvlan0 name vlan10 type vlan id 10

This works to pass VLAN 10 traffic over the macvlan device. This seems like 
expected behavior.

However, if I then also add vlan 10 to the lowerdev:

ip link add link ens0 name lowervlan10  type vlan id 10

Then traffic stops flowing to the VLAN on the macvlan device.

This happens, as far as I can tell, because of how the VLAN traffic is filtered 
first, and then forwarded to the VLAN device, which doesn't know about how the 
macvlan device exists.

It seems, essentially, that vlan stacked on top of a macvlan shouldn't work. 
Because the vlan code basically expects each vlan to apply to every MAC 
address, and the macvlan device works by putting its MAC address into the 
unicast address list, there's no way for a device driver to know when or how to 
apply the vlan.

This gets a bit more confusing when we add in the l2 fwd hardware offload.

Currently, at least for the Intel network parts, this isn't supported, because 
of a bug in which the device drivers don't apply the VLANs to the macvlan 
accelerated addresses. If we fix this, at least for fm10k, the behavior is 
slightly better, because of how the hardware filtering at the MAC address 
happens first, and we direct the traffic to the proper device regardless of 
VLAN.

In addition to this peculiarity of VLANs on both the macvlan and lowerdev, is 
that when a macvlan device adds a VLAN, the lowerdev gets an indication to add 
the vlan via its .ndo_vlan_rx_add_vid(), which doesn't distinguish between 
which addresses the VLAN might apply to. It thus simply, depending on hardware 
design, enables the VLAN for all its unicast and multicast addresses. Some 
hardware could theoretically support MAC+VLAN pairs, where it could distinguish 
that a VLAN should only be added for some subset of addresses. Other hardware 
might not be so lucky..

Unfortunately, this has the weird consequence that if we have the following 
stack of devices:

vlan10@macvlan0
macvlan0@ens0
ens0

Then ens0 will receive VLAN10 traffic on every address. So VLAN 10 traffic 
destined to the MAC of the lowerdev will be received, instead of dropped.

If we add VLAN 10 to the lowerdev so we have both the above stack and also

lowervlan10@ens0
ens0 (mac gg:hh:ii:jj:kk)

then all vlan 10 traffic will be received on the lowerdev VLAN 10, without any 
being forwarded to the VLAN10 attached to the macvlan.

However, if we add two macvlans, and each add the vlan10, so we have the 
following:

avlan10@macvlan0
macvlan0@ens0
ens0

bvlan10@macvlan1
macvlan1@ens0
ens0

In this case, it does appear that traffic is sorted out correctly. It seems 
that only if the lowerdev gets the VLAN does it end up breaking. If I remove 
bvlan10 from macvlan1, the traffic associated with vlan10 is still received by 
macvlan1, even though in principle it should no longer be.

What is the correct behavior here? Should this just be "administrators should 
know better"? I don't think that's a great argument, and either way we're still 
essentially leaking VLANs across the macvlan interfaces, which I don't think is 
ideal.

I see two possible solutions:

1) modify macvlan driver so that it is marked as VLAN_CHALLENGED, and thus 
indicate it cannot handle VLAN traffic on top of it.
  a. In order to get the VLANs associated, administrator could instead add the 
VLAN first, and then add the macvlan on top. This I think is a better 
configuration.
  b. that doesn't work in the offload case, unless/until we fix the VLAN 
interface to forward the l2_dfwd_add_station() along with a vid.
  c. this could appear as loss of functionality, since in some cases these VLAN 
on top of macvlan work today (with the interesting caveats listed above).

2) modify how VLANs interact with MAC addresses, so that the lowerdev can 
explicitly be aware of which VLANs are tied to which address groups, in order 
to allow for the explicit configuration of which MAC+VLAN pairs are actually 
allowed.
  a. this is a much more invasive change to driver interface, and more 
difficult to get right
  b. possibly other configurations of stacked devices might have a similar 
problem, so we could solve more here? Or create more problems.. I'm not really 
certain.


I think the correct solution is (1) but I wasn't sure what others thought, and 
whether anyone else has encountered the problems I mention and outline above. I 
cc'd Alex who I discussed with offline when I first heard of and began 
investigating this, in case he has anything further to add.

Regards,
Jake


RE: [PATCH] [RESEND net] fm10k: mark PM functions as __maybe_unused

2018-01-16 Thread Keller, Jacob E
> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Tuesday, January 16, 2018 1:14 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: Arnd Bergmann <a...@arndb.de>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; David S. Miller <da...@davemloft.net>; Kwan,
> Ngai-mint <ngai-mint.k...@intel.com>; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] [RESEND net] fm10k: mark PM functions as __maybe_unused
> 
> A cleanup of the PM code left an incorrect #ifdef in place, leading
> to a harmless build warning:
> 
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2502:12: error: 'fm10k_suspend'
> defined but not used [-Werror=unused-function]
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2475:12: error: 'fm10k_resume'
> defined but not used [-Werror=unused-function]
> 
> It's easier to use __maybe_unused attributes here, since you
> can't pick the wrong one.
> 
> Fixes: 8249c47c6ba4 ("fm10k: use generic PM hooks instead of legacy PCIe power
> hooks")
> Acked-by: Jacob Keller <jacob.e.kel...@intel.com>
> Tested-by: Krishneil Singh <krishneil.k.si...@intel.com>
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
> Apparently nobody picked this up the first time around (Oct 2017),
> here is the same patch again.

Odd. I remember seeing this and thought I ack'd it..? Guess it got missed.

Acked-by: Jacob Keller <jacob.e.kel...@intel.com>

> ---
>  drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> index 7f605221a686..a434fecfdfeb 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> @@ -2463,7 +2463,6 @@ static int fm10k_handle_resume(struct fm10k_intfc
> *interface)
>   return err;
>  }
> 
> -#ifdef CONFIG_PM
>  /**
>   * fm10k_resume - Generic PM resume hook
>   * @dev: generic device structure
> @@ -2472,7 +2471,7 @@ static int fm10k_handle_resume(struct fm10k_intfc
> *interface)
>   * suspend or hibernation. This function does not need to handle lower PCIe
>   * device state as the stack takes care of that for us.
>   **/
> -static int fm10k_resume(struct device *dev)
> +static int __maybe_unused fm10k_resume(struct device *dev)
>  {
>   struct fm10k_intfc *interface = pci_get_drvdata(to_pci_dev(dev));
>   struct net_device *netdev = interface->netdev;
> @@ -2499,7 +2498,7 @@ static int fm10k_resume(struct device *dev)
>   * system suspend or hibernation. This function does not need to handle lower
>   * PCIe device state as the stack takes care of that for us.
>   **/
> -static int fm10k_suspend(struct device *dev)
> +static int __maybe_unused fm10k_suspend(struct device *dev)
>  {
>   struct fm10k_intfc *interface = pci_get_drvdata(to_pci_dev(dev));
>   struct net_device *netdev = interface->netdev;
> @@ -2511,8 +2510,6 @@ static int fm10k_suspend(struct device *dev)
>   return 0;
>  }
> 
> -#endif /* CONFIG_PM */
> -
>  /**
>   * fm10k_io_error_detected - called when PCI error is detected
>   * @pdev: Pointer to PCI device
> @@ -2643,11 +2640,9 @@ static struct pci_driver fm10k_driver = {
>   .id_table   = fm10k_pci_tbl,
>   .probe  = fm10k_probe,
>   .remove = fm10k_remove,
> -#ifdef CONFIG_PM
>   .driver = {
>   .pm = _pm_ops,
>   },
> -#endif /* CONFIG_PM */
>   .sriov_configure= fm10k_iov_configure,
>   .err_handler= _err_handler
>  };
> --
> 2.9.0



RE: [PATCH net-next] i40evf: use GFP_ATOMIC under spin lock

2018-01-12 Thread Keller, Jacob E
> -Original Message-
> From: Wei Yongjun [mailto:weiyongj...@huawei.com]
> Sent: Thursday, January 11, 2018 6:27 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; Keller, Jacob E
> <jacob.e.kel...@intel.com>
> Cc: Wei Yongjun <weiyongj...@huawei.com>; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; kernel-janit...@vger.kernel.org
> Subject: [PATCH net-next] i40evf: use GFP_ATOMIC under spin lock
> 
> A spin lock is taken here so we should use GFP_ATOMIC.
> 

You are correct, good catch!

> Fixes: 504398f0a78e ("i40evf: use spinlock to protect (mac|vlan)_filter_list")
> Signed-off-by: Wei Yongjun <weiyongj...@huawei.com>
> ---

Acked-by: Jacob Keller <jacob.e.kel...@intel.com>

>  drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> index feb95b6..ca5b538 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
> @@ -459,7 +459,7 @@ void i40evf_add_ether_addrs(struct i40evf_adapter
> *adapter)
>   more = true;
>   }
> 
> - veal = kzalloc(len, GFP_KERNEL);
> + veal = kzalloc(len, GFP_ATOMIC);
>   if (!veal) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;
> @@ -532,7 +532,7 @@ void i40evf_del_ether_addrs(struct i40evf_adapter
> *adapter)
> (count * sizeof(struct virtchnl_ether_addr));
>   more = true;
>   }
> - veal = kzalloc(len, GFP_KERNEL);
> + veal = kzalloc(len, GFP_ATOMIC);
>   if (!veal) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;
> @@ -606,7 +606,7 @@ void i40evf_add_vlans(struct i40evf_adapter *adapter)
> (count * sizeof(u16));
>   more = true;
>   }
> - vvfl = kzalloc(len, GFP_KERNEL);
> + vvfl = kzalloc(len, GFP_ATOMIC);
>   if (!vvfl) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;
> @@ -678,7 +678,7 @@ void i40evf_del_vlans(struct i40evf_adapter *adapter)
> (count * sizeof(u16));
>   more = true;
>   }
> - vvfl = kzalloc(len, GFP_KERNEL);
> + vvfl = kzalloc(len, GFP_ATOMIC);
>   if (!vvfl) {
>   spin_unlock_bh(>mac_vlan_list_lock);
>   return;



RE: v4.15-rc2 on thinkpad x60: ethernet stopped working

2017-12-15 Thread Keller, Jacob E
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf Of
> Keller, Jacob E
> Sent: Friday, December 15, 2017 9:29 AM
> To: Gabriel C <nix.or@gmail.com>; Pavel Machek <pa...@ucw.cz>; kernel list
> <linux-ker...@vger.kernel.org>
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> Subject: Re: [Intel-wired-lan] v4.15-rc2 on thinkpad x60: ethernet stopped
> working
> 
> > -Original Message-
> > From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org]
> > On Behalf Of Gabriel C
> > Sent: Sunday, December 10, 2017 4:44 AM
> > To: Pavel Machek <pa...@ucw.cz>; kernel list <linux-ker...@vger.kernel.org>
> > Cc: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; intel-wired-
> > l...@lists.osuosl.org; netdev@vger.kernel.org
> > Subject: Re: v4.15-rc2 on thinkpad x60: ethernet stopped working
> >
> > On 10.12.2017 09:39, Pavel Machek wrote:
> > > Hi!
> >
> > Hi,
> >
> > > In v4.15-rc2+, network manager can not see my ethernet card, and
> > > manual attempts to ifconfig it up did not really help, either.
> > >
> > > Card is:
> > >
> > > 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet
> > > Controller
> > >
> > > Dmesg says:
> > >
> > >dmesg | grep eth
> > > [0.648931] e1000e :02:00.0 eth0: (PCI Express:2.5GT/s:Width
> > > x1) 00:16:d3:25:19:04
> > > [0.648934] e1000e :02:00.0 eth0: Intel(R) PRO/1000 Network
> > > Connection
> > > [0.649012] e1000e :02:00.0 eth0: MAC: 2, PHY: 2, PBA No:
> > > 005302-003
> > > [0.706510] usbcore: registered new interface driver cdc_ether
> > > [6.557022] e1000e :02:00.0 eth1: renamed from eth0
> > > [6.577554] systemd-udevd[2363]: renamed network interface eth0 to
> > > eth1
> > >
> > > Any ideas ?
> >
> > Yes , 19110cfbb34d4af0cdfe14cd243f3b09dc95b013 broke it.
> >
> > See:
> > https://bugzilla.kernel.org/show_bug.cgi?id=198047
> >
> > Fix there :
> > https://marc.info/?l=linux-kernel=151272209903675=2
> >
> > Regards,
> >
> > Gabriel C
> 
> Hi,
> 
> Digging into this, the problem is complicated. The original bug assumed 
> behavior
> of the .check_for_link call, which is universally not implemented.
> 
> I think the correct fix is to revert 19110cfbb34d ("e1000e: Separate 
> signaling for
> link check/link up", 2017-10-10) and find a more proper solution.
> 
> I don't think any other code which uses check_for_link expects the interface 
> to
> return in the way this patch attempted.
> 
> Thanks,
> Jake
> 

Alternatively, we can go a step farther and make sure every implementation of 
.check_for_link follows the modified interface.

Thanks,
Jake



RE: v4.15-rc2 on thinkpad x60: ethernet stopped working

2017-12-15 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Gabriel C
> Sent: Sunday, December 10, 2017 4:44 AM
> To: Pavel Machek ; kernel list 
> Cc: Kirsher, Jeffrey T ; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org
> Subject: Re: v4.15-rc2 on thinkpad x60: ethernet stopped working
> 
> On 10.12.2017 09:39, Pavel Machek wrote:
> > Hi!
> 
> Hi,
> 
> > In v4.15-rc2+, network manager can not see my ethernet card, and
> > manual attempts to ifconfig it up did not really help, either.
> >
> > Card is:
> >
> > 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet
> > Controller
> >
> > Dmesg says:
> >
> >dmesg | grep eth
> > [0.648931] e1000e :02:00.0 eth0: (PCI Express:2.5GT/s:Width
> > x1) 00:16:d3:25:19:04
> > [0.648934] e1000e :02:00.0 eth0: Intel(R) PRO/1000 Network
> > Connection
> > [0.649012] e1000e :02:00.0 eth0: MAC: 2, PHY: 2, PBA No:
> > 005302-003
> > [0.706510] usbcore: registered new interface driver cdc_ether
> > [6.557022] e1000e :02:00.0 eth1: renamed from eth0
> > [6.577554] systemd-udevd[2363]: renamed network interface eth0 to
> > eth1
> >
> > Any ideas ?
> 
> Yes , 19110cfbb34d4af0cdfe14cd243f3b09dc95b013 broke it.
> 
> See:
> https://bugzilla.kernel.org/show_bug.cgi?id=198047
> 
> Fix there :
> https://marc.info/?l=linux-kernel=151272209903675=2
> 
> Regards,
> 
> Gabriel C

Hi,

Digging into this, the problem is complicated. The original bug assumed 
behavior of the .check_for_link call, which is universally not implemented.

I think the correct fix is to revert 19110cfbb34d ("e1000e: Separate signaling 
for link check/link up", 2017-10-10) and find a more proper solution.

I don't think any other code which uses check_for_link expects the interface to 
return in the way this patch attempted.

Thanks,
Jake


RE: [PATCH] sched/deadline: fix one-bit signed bitfields to be unsigned

2017-11-29 Thread Keller, Jacob E
> -Original Message-
> From: Jakub Kicinski [mailto:kubak...@wp.pl]
> Sent: Tuesday, November 28, 2017 8:08 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: mi...@redhat.com; pet...@infradead.org; Keller, Jacob E
> <jacob.e.kel...@intel.com>; linux-ker...@vger.kernel.org;
> netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com; luca abeni <luca.ab...@santannapisa.it>
> Subject: Re: [PATCH] sched/deadline: fix one-bit signed bitfields to be 
> unsigned
> 
> On Tue, 28 Nov 2017 12:36:19 -0800, Jeff Kirsher wrote:
> > From: Jacob Keller <jacob.e.kel...@intel.com>
> >
> > Commit 799ba82de01e ("sched/deadline: Use C bitfields for the state
> > flags", 2017-10-10) introduced the use of C bitfields for these
> > variables. However, sparse complains about them:
> >
> > ./include/linux/sched.h:476:62: error: dubious one-bit signed bitfield
> > ./include/linux/sched.h:477:62: error: dubious one-bit signed bitfield
> > ./include/linux/sched.h:478:62: error: dubious one-bit signed bitfield
> > ./include/linux/sched.h:479:62: error: dubious one-bit signed bitfield
> >
> > This is because a one-bit signed bitfield can only hold the values 0 and
> > -1, which can cause problems if the program expects to be able to
> > represent the value positive 1.
> >
> > In practice, this may not cause a bug since -1 would be considered
> > "true" in logical tests, however we should avoid the practice anyways.
> >
> > Fixes: 799ba82de01e ("sched/deadline: Use C bitfields for the state flags", 
> > 2017-
> 10-10)
> > Signed-off-by: Jacob Keller <jacob.e.kel...@intel.com>
> > Cc: luca abeni <luca.ab...@santannapisa.it>
> > Tested-by: Andrew Bowers <andrewx.bow...@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> 
> This is already in Linus's tree (I've been waiting for it to land as
> well :))
> 

Excellent.

Regards,
Jake


RE: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN configuration for macvlan offload

2017-11-08 Thread Keller, Jacob E
> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Wednesday, November 08, 2017 4:32 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Brandeburg, Jesse <jesse.brandeb...@intel.com>; netdev@vger.kernel.org;
> intel-wired-...@lists.osuosl.org
> Subject: Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN
> configuration for macvlan offload
> 
> On Wed, Nov 8, 2017 at 4:21 PM, Keller, Jacob E
> <jacob.e.kel...@intel.com> wrote:
> >
> >
> >> -Original Message-
> >> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> >> Sent: Wednesday, November 08, 2017 3:04 PM
> >> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> >> Cc: Brandeburg, Jesse <jesse.brandeb...@intel.com>;
> netdev@vger.kernel.org;
> >> intel-wired-...@lists.osuosl.org
> >> Subject: Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix
> VLAN
> >> configuration for macvlan offload
> >>
> >> On Wed, Nov 8, 2017 at 2:05 PM, Keller, Jacob E
> >> <jacob.e.kel...@intel.com> wrote:
> >> >> -Original Message-
> >> >> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf
> >> Of
> >> >> Jesse Brandeburg
> >> >> Sent: Friday, November 03, 2017 10:06 AM
> >> >> To: Alexander Duyck <alexander.du...@gmail.com>
> >> >> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> >> >> Subject: Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: 
> >> >> Fix
> >> VLAN
> >> >> configuration for macvlan offload
> >> >>
> >> >> On Thu, 2 Nov 2017 16:33:45 -0700
> >> >> Alexander Duyck <alexander.du...@gmail.com> wrote:
> >> >>
> >> >> > From: Alexander Duyck <alexander.h.du...@intel.com>
> >> >> >
> >> >> > The fm10k driver didn't work correctly when macvlan offload was
> enabled.
> >> >> > Specifically what would occur is that we would see no unicast packets
> being
> >> >> > received. This was traced down to us not correctly configuring the 
> >> >> > default
> >> >> > VLAN ID for the port and defaulting to 0.
> >> >> >
> >> >> > To correct this we either use the default ID provided by the switch or
> >> >> > simply use 1. With that we are able to pass and receive traffic 
> >> >> > without any
> >> >> > issues.
> >> >>
> >> >> Reviewed-by: Jesse Brandeburg <jesse.brandeb...@intel.com>
> >> >>
> >> >
> >> > Hi,
> >> >
> >> > I think this isn't quite right, since we recently made changes to the 
> >> > fm10k
> code
> >> to stop assuming VLAN 1 in these cases. I believe it should just pass the
> >> default_vid
> >> >
> >> > Thanks,
> >> > Jake
> >> >
> >>
> >> I kind of figured that might be the case. We can probably drop this
> >> patch for now and if you want you can work this from our end as the
> >> out-of-tree code doesn't make the upstream and I would imagine you
> >> guys are planning to upstream that patch at some point.
> >>
> >> - Alex
> >
> > I think the patch should just be changed from using
> >
> > hw->mac.default_vid ? : 1
> >
> > to just using hw->mac.default_vid directly.
> >
> > Otherwise, I think the patch is necessary.
> >
> > Thanks,
> > Jake
> 
> Doesn't passing a 0 in the case of the value not being populated cause
> issues? If I understand the problem I thought passing a 0 in the vlan
> ID field caused some sort of error.
> 
> - Alex

No, it just doesn't really do anything useful. But it doesn't harm anything 
either.

Using 0 on the switch side causes problems for the switch manager, but in this 
case, if we don't have a default_vid, then we aren't talking to the switch 
anyways.

Thanks,
Jake


RE: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN configuration for macvlan offload

2017-11-08 Thread Keller, Jacob E


> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Wednesday, November 08, 2017 3:04 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Brandeburg, Jesse <jesse.brandeb...@intel.com>; netdev@vger.kernel.org;
> intel-wired-...@lists.osuosl.org
> Subject: Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN
> configuration for macvlan offload
> 
> On Wed, Nov 8, 2017 at 2:05 PM, Keller, Jacob E
> <jacob.e.kel...@intel.com> wrote:
> >> -Original Message-
> >> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf
> Of
> >> Jesse Brandeburg
> >> Sent: Friday, November 03, 2017 10:06 AM
> >> To: Alexander Duyck <alexander.du...@gmail.com>
> >> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> >> Subject: Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix
> VLAN
> >> configuration for macvlan offload
> >>
> >> On Thu, 2 Nov 2017 16:33:45 -0700
> >> Alexander Duyck <alexander.du...@gmail.com> wrote:
> >>
> >> > From: Alexander Duyck <alexander.h.du...@intel.com>
> >> >
> >> > The fm10k driver didn't work correctly when macvlan offload was enabled.
> >> > Specifically what would occur is that we would see no unicast packets 
> >> > being
> >> > received. This was traced down to us not correctly configuring the 
> >> > default
> >> > VLAN ID for the port and defaulting to 0.
> >> >
> >> > To correct this we either use the default ID provided by the switch or
> >> > simply use 1. With that we are able to pass and receive traffic without 
> >> > any
> >> > issues.
> >>
> >> Reviewed-by: Jesse Brandeburg <jesse.brandeb...@intel.com>
> >>
> >
> > Hi,
> >
> > I think this isn't quite right, since we recently made changes to the fm10k 
> > code
> to stop assuming VLAN 1 in these cases. I believe it should just pass the
> default_vid
> >
> > Thanks,
> > Jake
> >
> 
> I kind of figured that might be the case. We can probably drop this
> patch for now and if you want you can work this from our end as the
> out-of-tree code doesn't make the upstream and I would imagine you
> guys are planning to upstream that patch at some point.
> 
> - Alex

I think the patch should just be changed from using

hw->mac.default_vid ? : 1

to just using hw->mac.default_vid directly.

Otherwise, I think the patch is necessary.

Thanks,
Jake


RE: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN configuration for macvlan offload

2017-11-08 Thread Keller, Jacob E
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf Of
> Jesse Brandeburg
> Sent: Friday, November 03, 2017 10:06 AM
> To: Alexander Duyck 
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> Subject: Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 2/5] fm10k: Fix VLAN
> configuration for macvlan offload
> 
> On Thu, 2 Nov 2017 16:33:45 -0700
> Alexander Duyck  wrote:
> 
> > From: Alexander Duyck 
> >
> > The fm10k driver didn't work correctly when macvlan offload was enabled.
> > Specifically what would occur is that we would see no unicast packets being
> > received. This was traced down to us not correctly configuring the default
> > VLAN ID for the port and defaulting to 0.
> >
> > To correct this we either use the default ID provided by the switch or
> > simply use 1. With that we are able to pass and receive traffic without any
> > issues.
> 
> Reviewed-by: Jesse Brandeburg 
> 

Hi,

I think this isn't quite right, since we recently made changes to the fm10k 
code to stop assuming VLAN 1 in these cases. I believe it should just pass the 
default_vid

Thanks,
Jake

___
> Intel-wired-lan mailing list
> intel-wired-...@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-11-02 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Toshiaki Makita
> Sent: Thursday, November 02, 2017 2:23 AM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> Cc: vyase...@redhat.com; Malek, Patryk <patryk.ma...@intel.com>
> Subject: Re: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> On 2017/11/02 7:25, Keller, Jacob E wrote:
> ...
> >> If we skip adding them, we cannot receive frames which should be
> >> received on the bridge device during non-promiscuous mode.
> >>
> >> --
> >> Toshiaki Makita
> >
> > This makes sense, but then what removes the addresses upon bridge deletion
> or exiting static mode?
> >
> > We want to make sure we remove the correct addresses but don't request a
> delete of the permanent MAC address? Or, do we just completely assume that a
> device will never actually delete it's own permanent address, and thus say 
> this is
> a driver's fault for allowing a delete request of its permanent address to do
> anything..?
> 
> We may be able to skip adding or deleting local address which is
> identical to dev_addr in bridge code.
> Having said that I feel like drivers should ensure not to remove their
> permanent address even when the same address is removed from the uc
> list, since currently it is not prohibited to do that kind of admin
> operation through bridge command (bridge fdb add|del self).
> Note that "bridge fdb ... self" is a command which modifies device's uc
> filter, not modify bridge's fdb entries.
> 
> --
> Toshiaki Makita   

Ok. I'll go ahead and cook a patch for preventing such a removal from deleting 
the permanent address from i40e. That sounds like the most reasonable approach 
given that from digging into other drivers, they don't store the permanent 
address in the regular UC table anyways.

Thanks,
Jake



RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-11-01 Thread Keller, Jacob E
> -Original Message-
> From: Toshiaki Makita [mailto:makita.toshi...@lab.ntt.co.jp]
> Sent: Tuesday, October 31, 2017 5:58 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; vyase...@redhat.com;
> netdev@vger.kernel.org
> Cc: Malek, Patryk <patryk.ma...@intel.com>
> Subject: Re: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> On 2017/11/01 9:10, Keller, Jacob E wrote:
> >> -Original Message-
> >> From: Keller, Jacob E
> >> Sent: Thursday, October 26, 2017 1:33 PM
> >> To: Keller, Jacob E <jacob.e.kel...@intel.com>; vyase...@redhat.com;
> >> netdev@vger.kernel.org
> >> Cc: Malek, Patryk <patryk.ma...@intel.com>
> >> Subject: RE: removing bridge in vlan_filtering mode requests delete of
> attached
> >> ports main MAC address
> >>
> >>> -Original Message-
> >>> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> >> ow...@vger.kernel.org]
> >>> On Behalf Of Keller, Jacob E
> >>> Sent: Thursday, October 26, 2017 1:27 PM
> >>> To: vyase...@redhat.com; netdev@vger.kernel.org
> >>> Cc: Malek, Patryk <patryk.ma...@intel.com>
> >>> Subject: RE: removing bridge in vlan_filtering mode requests delete of
> attached
> >>> ports main MAC address
> >>>
> >>>> -Original Message-
> >>>> From: Vlad Yasevich [mailto:vyase...@redhat.com]
> >>>> Sent: Thursday, October 26, 2017 3:22 AM
> >>>> To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> >>>> Cc: Malek, Patryk <patryk.ma...@intel.com>
> >>>> Subject: Re: removing bridge in vlan_filtering mode requests delete of
> >> attached
> >>>> ports main MAC address
> >>>>
> >>>> Hi Jake
> >>>>
> >>>> I think adding a !fdb->local should work.  local fdb contain the address 
> >>>> of
> >>> assigned
> >>>> to
> >>>> the ports of the bridge and those shouldn't be directly removed.
> >>>>
> >>>> If that works,  that looks like the right solution.
> >>>>
> >>>> -vlad
> >>>>
> >>>
> >>> So this does prevent us from removing the port's address. However, if I 
> >>> add
> >> two
> >>> devices to the bridge, then after removing the bridge, each device now
> keeps
> >>> both permanent addresses in their list, which isn't what we want is it?
> >>>
> >>> Do we even want to assign the local fdb addresses to every port?
> >>>
> >>> Obviously, I don't fully understand this code, so I think I'm missing 
> >>> something
> >>> here.
> >>>
> >>> Regards,
> >>> Jake
> >>>
> >>
> >> Ok, I tried this again, and it didn't end up crossing the local device 
> >> addresses to
> >> each port. I'm not sure how that happened the first time yet, so maybe it 
> >> is
> >> correct to skip removing local addresses... but if we skip removing them,
> wouldn't
> >> we want to skip adding them too?
> >>
> >> Thanks,
> >> Jake
> >
> > There's definitely some weirdness going on, because I've been able to get 
> > the
> local port addresses added to the wrong device under some circumstances. It
> seems to be some sort of race condition, since I can't reliably re-create the
> scenario.
> >
> > Either way, some more insight on what the correct fix here would be nice.
> >
> > I'm thinking we want to skip adding or removing local addresses when 
> > switching
> into the static mode configuration.
> 
> If we skip adding them, we cannot receive frames which should be
> received on the bridge device during non-promiscuous mode.
> 
> --
> Toshiaki Makita

This makes sense, but then what removes the addresses upon bridge deletion or 
exiting static mode?

We want to make sure we remove the correct addresses but don't request a delete 
of the permanent MAC address? Or, do we just completely assume that a device 
will never actually delete it's own permanent address, and thus say this is a 
driver's fault for allowing a delete request of its permanent address to do 
anything..?

Thanks,
Jake
 


RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-31 Thread Keller, Jacob E
> -Original Message-
> From: Keller, Jacob E
> Sent: Thursday, October 26, 2017 1:33 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; vyase...@redhat.com;
> netdev@vger.kernel.org
> Cc: Malek, Patryk <patryk.ma...@intel.com>
> Subject: RE: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> > -Original Message-
> > From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org]
> > On Behalf Of Keller, Jacob E
> > Sent: Thursday, October 26, 2017 1:27 PM
> > To: vyase...@redhat.com; netdev@vger.kernel.org
> > Cc: Malek, Patryk <patryk.ma...@intel.com>
> > Subject: RE: removing bridge in vlan_filtering mode requests delete of 
> > attached
> > ports main MAC address
> >
> > > -Original Message-
> > > From: Vlad Yasevich [mailto:vyase...@redhat.com]
> > > Sent: Thursday, October 26, 2017 3:22 AM
> > > To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> > > Cc: Malek, Patryk <patryk.ma...@intel.com>
> > > Subject: Re: removing bridge in vlan_filtering mode requests delete of
> attached
> > > ports main MAC address
> > >
> > > Hi Jake
> > >
> > > I think adding a !fdb->local should work.  local fdb contain the address 
> > > of
> > assigned
> > > to
> > > the ports of the bridge and those shouldn't be directly removed.
> > >
> > > If that works,  that looks like the right solution.
> > >
> > > -vlad
> > >
> >
> > So this does prevent us from removing the port's address. However, if I add
> two
> > devices to the bridge, then after removing the bridge, each device now keeps
> > both permanent addresses in their list, which isn't what we want is it?
> >
> > Do we even want to assign the local fdb addresses to every port?
> >
> > Obviously, I don't fully understand this code, so I think I'm missing 
> > something
> > here.
> >
> > Regards,
> > Jake
> >
> 
> Ok, I tried this again, and it didn't end up crossing the local device 
> addresses to
> each port. I'm not sure how that happened the first time yet, so maybe it is
> correct to skip removing local addresses... but if we skip removing them, 
> wouldn't
> we want to skip adding them too?
> 
> Thanks,
> Jake

There's definitely some weirdness going on, because I've been able to get the 
local port addresses added to the wrong device under some circumstances. It 
seems to be some sort of race condition, since I can't reliably re-create the 
scenario.

Either way, some more insight on what the correct fix here would be nice.

I'm thinking we want to skip adding or removing local addresses when switching 
into the static mode configuration.

Thanks,
Jake


RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-26 Thread Keller, Jacob E
> -Original Message-
> From: Keller, Jacob E
> Sent: Thursday, October 26, 2017 1:33 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; vyase...@redhat.com;
> netdev@vger.kernel.org
> Cc: Malek, Patryk <patryk.ma...@intel.com>
> Subject: RE: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> > -Original Message-
> > From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org]
> > On Behalf Of Keller, Jacob E
> > Sent: Thursday, October 26, 2017 1:27 PM
> > To: vyase...@redhat.com; netdev@vger.kernel.org
> > Cc: Malek, Patryk <patryk.ma...@intel.com>
> > Subject: RE: removing bridge in vlan_filtering mode requests delete of 
> > attached
> > ports main MAC address
> >
> > > -Original Message-
> > > From: Vlad Yasevich [mailto:vyase...@redhat.com]
> > > Sent: Thursday, October 26, 2017 3:22 AM
> > > To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> > > Cc: Malek, Patryk <patryk.ma...@intel.com>
> > > Subject: Re: removing bridge in vlan_filtering mode requests delete of
> attached
> > > ports main MAC address
> > >
> > > Hi Jake
> > >
> > > I think adding a !fdb->local should work.  local fdb contain the address 
> > > of
> > assigned
> > > to
> > > the ports of the bridge and those shouldn't be directly removed.
> > >
> > > If that works,  that looks like the right solution.
> > >
> > > -vlad
> > >
> >
> > So this does prevent us from removing the port's address. However, if I add
> two
> > devices to the bridge, then after removing the bridge, each device now keeps
> > both permanent addresses in their list, which isn't what we want is it?
> >
> > Do we even want to assign the local fdb addresses to every port?
> >
> > Obviously, I don't fully understand this code, so I think I'm missing 
> > something
> > here.
> >
> > Regards,
> > Jake
> >
> 
> Ok, I tried this again, and it didn't end up crossing the local device 
> addresses to
> each port. I'm not sure how that happened the first time yet, so maybe it is
> correct to skip removing local addresses... but if we skip removing them, 
> wouldn't
> we want to skip adding them too?
> 
> Thanks,
> Jake

I'm still digging into this. It turns out adding two devices, enabling vlan 
filtering, and deleting the bridge sometimes (but not always, not sure what 
condition triggers it) causes the hw address of one of the devices to be 
assigned to the other device.

I'm still unsure whether sync_static should be assigning local addresses to 
each device, but it appears like it should. In this case, I'm really unsure how 
to handle this case properly.

If we add local addresses, we need to delete the ones that aren't specific to 
that device so that after removing the bridge we end up in the original 
configuration.. but I'm not really sure how best to do this.

Using !fdb->is_local in unsync_static works to resolve my issue, but I believe 
it papers over other issues, since it means that we'll never delete static 
addresses when deleting the ports or exiting promiscuous mode.

I think checking fdb->dst might work, but that would break if we manually add a 
new address and tag is as permanent, see line 806 of br_fdb.c... In this case, 
we'd never delete this address even though it was not originally on the device.

I checked other drivers, and it turns out that at least one (ixgbe) doesn't 
have this problem because the hw address is special and isn't actually stored 
in a hardware MAC filter list. In i40e we keep the hardware address in the same 
list as all the other MAC filters.

We could "fix" this in i40e by treating the hw permanent address separately and 
essentially ignoring it from the dev_uc_del() calls.. but I still feel like 
this papers over the issues in the bridge code.

Any thoughts or suggestions? I haven't checked other drivers to see how they 
handle addresses in the unicast table (whether they treat the hw address as 
special or not, like ixgbe ultimately does).

Thanks,
Jake


RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-26 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Keller, Jacob E
> Sent: Thursday, October 26, 2017 1:27 PM
> To: vyase...@redhat.com; netdev@vger.kernel.org
> Cc: Malek, Patryk <patryk.ma...@intel.com>
> Subject: RE: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> > -Original Message-
> > From: Vlad Yasevich [mailto:vyase...@redhat.com]
> > Sent: Thursday, October 26, 2017 3:22 AM
> > To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> > Cc: Malek, Patryk <patryk.ma...@intel.com>
> > Subject: Re: removing bridge in vlan_filtering mode requests delete of 
> > attached
> > ports main MAC address
> >
> > Hi Jake
> >
> > I think adding a !fdb->local should work.  local fdb contain the address of
> assigned
> > to
> > the ports of the bridge and those shouldn't be directly removed.
> >
> > If that works,  that looks like the right solution.
> >
> > -vlad
> >
> 
> So this does prevent us from removing the port's address. However, if I add 
> two
> devices to the bridge, then after removing the bridge, each device now keeps
> both permanent addresses in their list, which isn't what we want is it?
> 
> Do we even want to assign the local fdb addresses to every port?
> 
> Obviously, I don't fully understand this code, so I think I'm missing 
> something
> here.
> 
> Regards,
> Jake
> 

Ok, I tried this again, and it didn't end up crossing the local device 
addresses to each port. I'm not sure how that happened the first time yet, so 
maybe it is correct to skip removing local addresses... but if we skip removing 
them, wouldn't we want to skip adding them too?

Thanks,
Jake


RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-26 Thread Keller, Jacob E
> -Original Message-
> From: Vlad Yasevich [mailto:vyase...@redhat.com]
> Sent: Thursday, October 26, 2017 3:22 AM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> Cc: Malek, Patryk <patryk.ma...@intel.com>
> Subject: Re: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> On 10/20/2017 08:06 PM, Keller, Jacob E wrote:
> >> -----Original Message-
> >> From: Keller, Jacob E
> >> Sent: Friday, October 20, 2017 10:23 AM
> >> To: netdev@vger.kernel.org
> >> Cc: Malek, Patryk <patryk.ma...@intel.com>; 'Vlad Yasevich'
> >> <vyase...@redhat.com>
> >> Subject: removing bridge in vlan_filtering mode requests delete of attached
> >> ports main MAC address
> >>
> >> Hi,
> >>
> >> We've run into an issue with bridges set in vlan_filtering mode. 
> >> Basically, if we
> >> attach a device to a bridge which has enabled vlan_filtering, and then 
> >> remove
> the
> >> bridge, we end up requesting the driver of the attached device to remove 
> >> its
> >> own MAC HW address.
> >>
> >> In i40e, at least, this causes the driver to actually delete such an 
> >> address and
> then
> >> it will no longer receive any traffic.
> >>
> >> To reproduce this:
> >>
> >> a) brctl addbr br0
> >> b) brctl addif br0 enp
> >> # enable vlan filtering
> >> c) echo 1 >/sys/class/net/br0/bridge/vlan_filtering
> >> d) brctl delbr br0
> >>
> >> Specifically this appears to happen because of how we automatically enter
> static
> >> configuration for routes when vlan_filtering is enabled, and we call
> >> br_fdb_unsync_static which will clear all the routes from the fdb table 
> >> for the
> >> device. See commit 2796d0c648c9 ("bridge: Automatically manage port
> >> promiscuous mode.", 2014-05-16) for more details.
> >>
> >> This happens to include the devices own default address, which results in 
> >> the
> >> bug.
> >>
> >> I'm not sure if this is a driver bug, or if it's a bug in the bridging 
> >> code.
> >>
> >> Who would know more about this and what to do about this?
> >>
> >> One obvious solution is to hard code the i40e device driver so that it 
> >> does not
> >> actually delete the HW address from the unicast filter list. This could 
> >> work, but
> >> seems to me like its papering over the problem. Is this just a known thing 
> >> that
> >> drivers should be aware of? I don't really know...
> >>
> >> An alternative solution would be to possibly ignore any fdb addresses which
> >> specifically target that port?
> >>
> >> Any ideas?
> >
> > For the record, adding a check to prevent unsync_static from removing
> addresses which are targetting the specific port does work to resolve this 
> specific
> issue, but I'm sure it's not the correct solution as I expect that would 
> cause other
> problems.
> >
> 
> Hi Jake
> 
> I think adding a !fdb->local should work.  local fdb contain the address of 
> assigned
> to
> the ports of the bridge and those shouldn't be directly removed.
> 
> If that works,  that looks like the right solution.
> 
> -vlad
> 

So this does prevent us from removing the port's address. However, if I add two 
devices to the bridge, then after removing the bridge, each device now keeps 
both permanent addresses in their list, which isn't what we want is it?

Do we even want to assign the local fdb addresses to every port?

Obviously, I don't fully understand this code, so I think I'm missing something 
here.

Regards,
Jake

> > Thanks,
> > Jake
> >
> >>
> >> Regards,
> >> Jake



RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-26 Thread Keller, Jacob E
> -Original Message-
> From: Vlad Yasevich [mailto:vyase...@redhat.com]
> Sent: Thursday, October 26, 2017 3:22 AM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> Cc: Malek, Patryk <patryk.ma...@intel.com>
> Subject: Re: removing bridge in vlan_filtering mode requests delete of 
> attached
> ports main MAC address
> 
> On 10/20/2017 08:06 PM, Keller, Jacob E wrote:
> >> -----Original Message-
> >> From: Keller, Jacob E
> >> Sent: Friday, October 20, 2017 10:23 AM
> >> To: netdev@vger.kernel.org
> >> Cc: Malek, Patryk <patryk.ma...@intel.com>; 'Vlad Yasevich'
> >> <vyase...@redhat.com>
> >> Subject: removing bridge in vlan_filtering mode requests delete of attached
> >> ports main MAC address
> >>
> >> Hi,
> >>
> >> We've run into an issue with bridges set in vlan_filtering mode. 
> >> Basically, if we
> >> attach a device to a bridge which has enabled vlan_filtering, and then 
> >> remove
> the
> >> bridge, we end up requesting the driver of the attached device to remove 
> >> its
> >> own MAC HW address.
> >>
> >> In i40e, at least, this causes the driver to actually delete such an 
> >> address and
> then
> >> it will no longer receive any traffic.
> >>
> >> To reproduce this:
> >>
> >> a) brctl addbr br0
> >> b) brctl addif br0 enp
> >> # enable vlan filtering
> >> c) echo 1 >/sys/class/net/br0/bridge/vlan_filtering
> >> d) brctl delbr br0
> >>
> >> Specifically this appears to happen because of how we automatically enter
> static
> >> configuration for routes when vlan_filtering is enabled, and we call
> >> br_fdb_unsync_static which will clear all the routes from the fdb table 
> >> for the
> >> device. See commit 2796d0c648c9 ("bridge: Automatically manage port
> >> promiscuous mode.", 2014-05-16) for more details.
> >>
> >> This happens to include the devices own default address, which results in 
> >> the
> >> bug.
> >>
> >> I'm not sure if this is a driver bug, or if it's a bug in the bridging 
> >> code.
> >>
> >> Who would know more about this and what to do about this?
> >>
> >> One obvious solution is to hard code the i40e device driver so that it 
> >> does not
> >> actually delete the HW address from the unicast filter list. This could 
> >> work, but
> >> seems to me like its papering over the problem. Is this just a known thing 
> >> that
> >> drivers should be aware of? I don't really know...
> >>
> >> An alternative solution would be to possibly ignore any fdb addresses which
> >> specifically target that port?
> >>
> >> Any ideas?
> >
> > For the record, adding a check to prevent unsync_static from removing
> addresses which are targetting the specific port does work to resolve this 
> specific
> issue, but I'm sure it's not the correct solution as I expect that would 
> cause other
> problems.
> >
> 
> Hi Jake
> 
> I think adding a !fdb->local should work.  local fdb contain the address of 
> assigned
> to
> the ports of the bridge and those shouldn't be directly removed.
> 
> If that works,  that looks like the right solution.
> 
> -vlad
> 

I'll give this a shot, and if so, cook up a patch.

Thanks,
Jake

> > Thanks,
> > Jake
> >
> >>
> >> Regards,
> >> Jake



RE: removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-20 Thread Keller, Jacob E
> -Original Message-
> From: Keller, Jacob E
> Sent: Friday, October 20, 2017 10:23 AM
> To: netdev@vger.kernel.org
> Cc: Malek, Patryk <patryk.ma...@intel.com>; 'Vlad Yasevich'
> <vyase...@redhat.com>
> Subject: removing bridge in vlan_filtering mode requests delete of attached
> ports main MAC address
> 
> Hi,
> 
> We've run into an issue with bridges set in vlan_filtering mode. Basically, 
> if we
> attach a device to a bridge which has enabled vlan_filtering, and then remove 
> the
> bridge, we end up requesting the driver of the attached device to remove its
> own MAC HW address.
> 
> In i40e, at least, this causes the driver to actually delete such an address 
> and then
> it will no longer receive any traffic.
> 
> To reproduce this:
> 
> a) brctl addbr br0
> b) brctl addif br0 enp
> # enable vlan filtering
> c) echo 1 >/sys/class/net/br0/bridge/vlan_filtering
> d) brctl delbr br0
> 
> Specifically this appears to happen because of how we automatically enter 
> static
> configuration for routes when vlan_filtering is enabled, and we call
> br_fdb_unsync_static which will clear all the routes from the fdb table for 
> the
> device. See commit 2796d0c648c9 ("bridge: Automatically manage port
> promiscuous mode.", 2014-05-16) for more details.
> 
> This happens to include the devices own default address, which results in the
> bug.
> 
> I'm not sure if this is a driver bug, or if it's a bug in the bridging code.
> 
> Who would know more about this and what to do about this?
> 
> One obvious solution is to hard code the i40e device driver so that it does 
> not
> actually delete the HW address from the unicast filter list. This could work, 
> but
> seems to me like its papering over the problem. Is this just a known thing 
> that
> drivers should be aware of? I don't really know...
> 
> An alternative solution would be to possibly ignore any fdb addresses which
> specifically target that port?
> 
> Any ideas?

For the record, adding a check to prevent unsync_static from removing addresses 
which are targetting the specific port does work to resolve this specific 
issue, but I'm sure it's not the correct solution as I expect that would cause 
other problems.

Thanks,
Jake

> 
> Regards,
> Jake


removing bridge in vlan_filtering mode requests delete of attached ports main MAC address

2017-10-20 Thread Keller, Jacob E
Hi,

We've run into an issue with bridges set in vlan_filtering mode. Basically, if 
we attach a device to a bridge which has enabled vlan_filtering, and then 
remove the bridge, we end up requesting the driver of the attached device to 
remove its own MAC HW address.

In i40e, at least, this causes the driver to actually delete such an address 
and then it will no longer receive any traffic.

To reproduce this:

a) brctl addbr br0
b) brctl addif br0 enp
# enable vlan filtering
c) echo 1 >/sys/class/net/br0/bridge/vlan_filtering
d) brctl delbr br0

Specifically this appears to happen because of how we automatically enter 
static configuration for routes when vlan_filtering is enabled, and we call 
br_fdb_unsync_static which will clear all the routes from the fdb table for the 
device. See commit 2796d0c648c9 ("bridge: Automatically manage port promiscuous 
mode.", 2014-05-16) for more details.

This happens to include the devices own default address, which results in the 
bug.

I'm not sure if this is a driver bug, or if it's a bug in the bridging code.

Who would know more about this and what to do about this?

One obvious solution is to hard code the i40e device driver so that it does not 
actually delete the HW address from the unicast filter list. This could work, 
but seems to me like its papering over the problem. Is this just a known thing 
that drivers should be aware of? I don't really know...

An alternative solution would be to possibly ignore any fdb addresses which 
specifically target that port?

Any ideas?

Regards,
Jake


RE: [PATCH] i40e/i40evf: actually use u32 for feature flags

2017-10-11 Thread Keller, Jacob E
> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Wednesday, October 11, 2017 7:03 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: Arnd Bergmann <a...@arndb.de>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; Williams, Mitch A
> <mitch.a.willi...@intel.com>; Sadowski, Filip <filip.sadow...@intel.com>; 
> intel-
> wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: [PATCH] i40e/i40evf: actually use u32 for feature flags
> 
> A previous cleanup intended to change the flags variable to 32
> bit instead of 64, but accidentally left out the important
> part of that change, leading to a build error:
> 
> drivers/net/ethernet/intel/i40e/i40e_ethtool.o: In function 
> `i40e_set_priv_flags':
> i40e_ethtool.c:(.text+0x1a94): undefined reference to `wrong_size_cmpxchg'
> 
> This adds the missing modification.
> 

Hah good eyes. I'm guessing this got messed up in a patch re-ordering.

Acked-by: Jacob Keller <jacob.e.kel...@intel.com>

> Fixes: b74f571f59a8 ("i40e/i40evf: organize and re-number feature flags")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
>  drivers/net/ethernet/intel/i40e/i40e.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h
> b/drivers/net/ethernet/intel/i40e/i40e.h
> index 18c453a3e728..7baf6d8a84dd 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
> @@ -424,7 +424,7 @@ struct i40e_pf {
>  #define I40E_HW_PORT_ID_VALIDBIT(17)
>  #define I40E_HW_RESTART_AUTONEG  BIT(18)
> 
> - u64 flags;
> + u32 flags;
>  #define I40E_FLAG_RX_CSUM_ENABLEDBIT(0)
>  #define I40E_FLAG_MSI_ENABLEDBIT(1)
>  #define I40E_FLAG_MSIX_ENABLED   BIT(2)
> --
> 2.9.0



RE: [PATCH] fm10k: mark PM functions as __maybe_unused

2017-10-11 Thread Keller, Jacob E


> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Wednesday, October 11, 2017 6:58 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: Arnd Bergmann <a...@arndb.de>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; Kwan, Ngai-mint <ngai-mint.k...@intel.com>;
> David S. Miller <da...@davemloft.net>; Florian Westphal <f...@strlen.de>;
> intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: [PATCH] fm10k: mark PM functions as __maybe_unused
> 
> A cleanup of the PM code left an incorrect #ifdef in place, leading
> to a harmless build warning:
> 
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2502:12: error: 'fm10k_suspend'
> defined but not used [-Werror=unused-function]
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2475:12: error: 'fm10k_resume'
> defined but not used [-Werror=unused-function]
> 
> It's easier to use __maybe_unused attributes here, since you
> can't pick the wrong one.
> 

Acked-by: Jacob Keller <jacob.e.kel...@intel.com>

> Fixes: 8249c47c6ba4 ("fm10k: use generic PM hooks instead of legacy PCIe power
> hooks")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
>  drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> index 1e9ae3197b17..52f8eb3c470e 100644
> --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> @@ -2463,7 +2463,6 @@ static int fm10k_handle_resume(struct fm10k_intfc
> *interface)
>   return err;
>  }
> 
> -#ifdef CONFIG_PM
>  /**
>   * fm10k_resume - Generic PM resume hook
>   * @dev: generic device structure
> @@ -2472,7 +2471,7 @@ static int fm10k_handle_resume(struct fm10k_intfc
> *interface)
>   * suspend or hibernation. This function does not need to handle lower PCIe
>   * device state as the stack takes care of that for us.
>   **/
> -static int fm10k_resume(struct device *dev)
> +static int __maybe_unused fm10k_resume(struct device *dev)
>  {
>   struct fm10k_intfc *interface = pci_get_drvdata(to_pci_dev(dev));
>   struct net_device *netdev = interface->netdev;
> @@ -2499,7 +2498,7 @@ static int fm10k_resume(struct device *dev)
>   * system suspend or hibernation. This function does not need to handle lower
>   * PCIe device state as the stack takes care of that for us.
>   **/
> -static int fm10k_suspend(struct device *dev)
> +static int __maybe_unused fm10k_suspend(struct device *dev)
>  {
>   struct fm10k_intfc *interface = pci_get_drvdata(to_pci_dev(dev));
>   struct net_device *netdev = interface->netdev;
> @@ -2511,8 +2510,6 @@ static int fm10k_suspend(struct device *dev)
>   return 0;
>  }
> 
> -#endif /* CONFIG_PM */
> -
>  /**
>   * fm10k_io_error_detected - called when PCI error is detected
>   * @pdev: Pointer to PCI device
> @@ -2643,11 +2640,9 @@ static struct pci_driver fm10k_driver = {
>   .id_table   = fm10k_pci_tbl,
>   .probe  = fm10k_probe,
>   .remove = fm10k_remove,
> -#ifdef CONFIG_PM
>   .driver = {
>   .pm = _pm_ops,
>   },
> -#endif /* CONFIG_PM */
>   .sriov_configure= fm10k_iov_configure,
>   .err_handler= _err_handler
>  };
> --
> 2.9.0



RE: [PATCH] i40e: mark PM functions as __maybe_unused

2017-10-10 Thread Keller, Jacob E


> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Tuesday, October 10, 2017 1:18 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: Arnd Bergmann <a...@arndb.de>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; Williams, Mitch A <mitch.a.willi...@intel.com>;
> Duyck, Alexander H <alexander.h.du...@intel.com>; David S. Miller
> <da...@davemloft.net>; Brady, Alan <alan.br...@intel.com>; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] i40e: mark PM functions as __maybe_unused
> 
> A cleanup of the PM code left an incorrect #ifdef in place, leading
> to a harmless build warning:
> 
> drivers/net/ethernet/intel/i40e/i40e_main.c:12223:12: error: 'i40e_resume'
> defined but not used [-Werror=unused-function]
> drivers/net/ethernet/intel/i40e/i40e_main.c:12185:12: error: 'i40e_suspend'
> defined but not used [-Werror=unused-function]
> 
> It's easier to use __maybe_unused attributes here, since you
> can't pick the wrong one.
> 

Sure.

Acked-by: Jacob Keller <jacob.e.kel...@intel.com>

> Fixes: 0e5d3da40055 ("i40e: use newer generic PM support instead of legacy PM
> callbacks")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 60b11fdeca2d..eb091268bc3c 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -8370,7 +8370,6 @@ static int i40e_init_interrupt_scheme(struct i40e_pf
> *pf)
>   return 0;
>  }
> 
> -#ifdef CONFIG_PM
>  /**
>   * i40e_restore_interrupt_scheme - Restore the interrupt scheme
>   * @pf: private board data structure
> @@ -8419,7 +8418,6 @@ static int i40e_restore_interrupt_scheme(struct i40e_pf
> *pf)
> 
>   return err;
>  }
> -#endif /* CONFIG_PM */
> 
>  /**
>   * i40e_setup_misc_vector - Setup the misc vector to handle non queue events
> @@ -12177,12 +12175,11 @@ static void i40e_shutdown(struct pci_dev *pdev)
>   }
>  }
> 
> -#ifdef CONFIG_PM
>  /**
>   * i40e_suspend - PM callback for moving to D3
>   * @dev: generic device information structure
>   **/
> -static int i40e_suspend(struct device *dev)
> +static int __maybe_unused i40e_suspend(struct device *dev)
>  {
>   struct pci_dev *pdev = to_pci_dev(dev);
>   struct i40e_pf *pf = pci_get_drvdata(pdev);
> @@ -12220,7 +12217,7 @@ static int i40e_suspend(struct device *dev)
>   * i40e_resume - PM callback for waking up from D3
>   * @dev: generic device information structure
>   **/
> -static int i40e_resume(struct device *dev)
> +static int __maybe_unused i40e_resume(struct device *dev)
>  {
>   struct pci_dev *pdev = to_pci_dev(dev);
>   struct i40e_pf *pf = pci_get_drvdata(pdev);
> @@ -12252,8 +12249,6 @@ static int i40e_resume(struct device *dev)
>   return 0;
>  }
> 
> -#endif /* CONFIG_PM */
> -
>  static const struct pci_error_handlers i40e_err_handler = {
>   .error_detected = i40e_pci_error_detected,
>   .slot_reset = i40e_pci_error_slot_reset,
> @@ -12269,11 +12264,9 @@ static struct pci_driver i40e_driver = {
>   .id_table = i40e_pci_tbl,
>   .probe= i40e_probe,
>   .remove   = i40e_remove,
> -#ifdef CONFIG_PM
>   .driver   = {
>   .pm = _pm_ops,
>   },
> -#endif /* CONFIG_PM */
>   .shutdown = i40e_shutdown,
>   .err_handler = _err_handler,
>   .sriov_configure = i40e_pci_sriov_configure,
> --
> 2.9.0



RE: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to set RS bit

2017-10-09 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Alexander Duyck
> Sent: Monday, October 09, 2017 9:28 AM
> To: David Laight <david.lai...@aculab.com>
> Cc: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; da...@davemloft.net;
> Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to set
> RS bit
> 
> On Mon, Oct 9, 2017 at 2:07 AM, David Laight <david.lai...@aculab.com> wrote:
> > From: Jeff Kirsher
> >> Sent: 06 October 2017 18:57
> >> From: Jacob Keller <jacob.e.kel...@intel.com>
> >>
> >> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and
> >> disable force WB workaround") we've tried to "optimize" setting the
> >> RS bit based around skb->xmit_more. This same logic was refactored
> >> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"),
> >> but ultimately was not functionally changed.
> >
> > I've tried to understand the above, but without definitions of WB
> > and RS and the '4-ness' of something it is quite difficult.
> >
> > I THINK this is all about telling the hardware there is a packet
> > in the ring to transmit?
> >
> > I don't understand the 4-ness. Linux requires that the hardware
> > be notified of a single packet transmit, and that the 'transmit
> > complete' also be notified in 'a timely manner' for a single packet.
> 
> So to clarify some of this. The RS is short for Report Status. It
> tells the hardware that when it has completed the descriptor it should
> trigger a write back, aka WB.
> 
> The 4-ness is because each descriptor is 16 bytes, so if we write back
> once every 4 descriptors that is one 64B cache line which is much
> better for most platforms performance wise.
> 
> > skb->xmit_more ought to be usable to optimise both of these
> > (assuming it isn't a lie).
> 
> Actually it leads to issues if we hold off for too long. If we are
> busy dequeueing a long list, and the Tx queue has gone empty then we
> are stalling Tx without need to do so. We need to be regularly
> notifying the device that it has buffers available to transmit which
> is what xmit_more is good for. However in addition the hardware needs
> to notify us regularly that there are buffers ready to clean which it
> isn't necessarily so useful for, especially if  are filling the entire
> ring and only providing one notification to the driver that there are
> buffers ready since we can defer the Tx interrupt for too long.
> 
> What Jake is trying to fix here is to prevent us from generating what
> looks like a square wave in terms of throughput. Essentially the
> current behavior that is being seen is that all the buffers in the
> ring either belong to software, or belong to hardware. It is best for
> us to have this split half and half so that the hardware has buffers
> to transmit and software has buffers to clean and enqueue new buffers
> onto.
> 

Yes this sums it up pretty well. The test case which found this bug is outlined 
in the description. Essentially we could see up to dozens or even almost a 
hundred packets in a row bunched up if we had a bunch of virtual devices 
layered on top of the network driver. This led to us not indicating to get 
status from the hardware about finished packets, which results in an increased 
delay to finish Tx.

> > The driver would need to ensure a ring full of messages isn't
> > generated without either wakeup - but that might be 128 frames.
> 
> That is what is happening here. Basically we are guaranteeing
> writebacks are occurring on a regular basis instead of in large
> bursts.
> 
> > FWIW on the systems I have PCIe writes are relatively cheap
> > (reads are slow). So different counts would be appropriate
> > for delaying doorbell writes and requesting completion interrupts.
> >
> > David
> 
> The doorbell writes are still being delayed the same amount with this
> patch. The only thing that was split out was the device write backs
> are now occurring on a more regular basis instead of being held off
> until a much larger set is handled.
> 

Exactly. The tail bumps are still functioning the same, but it's the hardware 
report status requests which are not delayed. Note that there's some further 
quirks in how our hardware performs write backs which exacerbate the problem 
because of how the cachelines are setup so that the delay is even worse than we 
would expect.

Thanks,
Jake

> - Alex


RE: [PATCH 0/2] i40e: fix firmware update

2017-09-01 Thread Keller, Jacob E


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Stefan Assmann
> Sent: Friday, September 01, 2017 7:03 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; da...@davemloft.net; Kirsher, Jeffrey T
> ; sassm...@kpanic.de
> Subject: [PATCH 0/2] i40e: fix firmware update
> 
> The first patch fixes the firmware update which is currently broken and
> results in a bad flash (corrupt firmware). Recovery is possible with a
> fixed driver.
> The second patch reverts a commit that causes the firmware checksum
> verification to fail right after a successful flash. This is related to
> a recent workqueue change. Haven't gotten to the bottom of this yet, but
> for the sake of a smooth firmware update experience let's revert the
> commit for now.

Hi Stefan,

Thanks for these patches, I apologize for the time it took for us to respond to 
this. 

The first patch is functionally correct, and I'm surprised we missed sending an 
equivalent ourselves. It looks like some related changes occurred around this 
code, and we failed to submit the patch.

I think Jeff would prefer if we send the version based directly on the 
out-of-tree code, which I will be reviving and submitting shortly.

The second issue I believe is not fixed correctly by the patch, I'm unsure why 
exactly changing the WQ would cause this but I believe that a similar patch 
which creates a non-locked version of i40e_nvm_read_buffer() will resolve this, 
and I will be sending that patch as well, which I believe is the real fix 
versus halting the work queue.

Thanks,
Jake

> 
> Stefan Assmann (2):
>   i40e: use non-locking i40e_read_nvm_word() function during nvmupdate
>   Revert "i40e: remove WQ_UNBOUND and the task limit of our workqueue"
> 
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 12 +---
>  drivers/net/ethernet/intel/i40e/i40e_nvm.c  | 24 ++--
>  2 files changed, 27 insertions(+), 9 deletions(-)
> 
> --
> 2.13.5




RE: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more

2017-08-28 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Jakub Kicinski
> Sent: Friday, August 25, 2017 12:34 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: netdev@vger.kernel.org
> Subject: Re: [RFC PATCH] net: limit maximum number of packets to mark with
> xmit_more
> 
> On Fri, 25 Aug 2017 08:24:49 -0700, Jacob Keller wrote:
> > Under some circumstances, such as with many stacked devices, it is
> > possible that dev_hard_start_xmit will bundle many packets together, and
> > mark them all with xmit_more.
> 
> Excuse my ignorance but what are those stacked devices?  Could they
> perhaps be fixed somehow?  My intuition was that long xmit_more
> sequences can only happen if NIC and/or BQL are back pressuring, and
> therefore we shouldn't be seeing a long xmit_more "train" arriving at
> an empty device ring...

a veth device connecting a VM to the host, then connected to a bridge, which is 
connected to a vlan interface connected to a bond, which is hooked in 
active-backup to a physical device.

Sorry if I don't really know the correct way to refer to these, I just think of 
them as devices stacked on top of each other.

During root cause investigation I found that we (the i40e driver) sometimes 
received up to 100 or more SKBs in a row with xmit_more set. We were 
incorrectly also using xmit_more as a hint for not marking packets to get 
writebacks, which caused significant throughput issues. Additionally there was 
concern that that many packets in a row without a tail bump would cause latency 
issues, so I thought maybe it was best to simply guarantee that the stack 
didn't send us too many packets marked with xmit more at once.

It seems based on discussion that it should be up to the driver to determine 
exactly how to handle the xmit_more hint and to determine when it actually 
isn't helpful or not, so I do not think this patch makes sense now.

Thanks,
Jake 


RE: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more

2017-08-28 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Alexander Duyck
> Sent: Friday, August 25, 2017 3:34 PM
> To: Stephen Hemminger <step...@networkplumber.org>
> Cc: Waskiewicz Jr, Peter <peter.waskiewicz...@intel.com>; Keller, Jacob E
> <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> Subject: Re: [RFC PATCH] net: limit maximum number of packets to mark with
> xmit_more
> 
> On Fri, Aug 25, 2017 at 8:58 AM, Stephen Hemminger
> <step...@networkplumber.org> wrote:
> > On Fri, 25 Aug 2017 15:36:22 +
> > "Waskiewicz Jr, Peter" <peter.waskiewicz...@intel.com> wrote:
> >
> >> On 8/25/17 11:25 AM, Jacob Keller wrote:
> >> > Under some circumstances, such as with many stacked devices, it is
> >> > possible that dev_hard_start_xmit will bundle many packets together, and
> >> > mark them all with xmit_more.
> >> >
> >> > Most drivers respond to xmit_more by skipping tail bumps on packet
> >> > rings, or similar behavior as long as xmit_more is set. This is
> >> > a performance win since it means drivers can avoid notifying hardware of
> >> > new packets repeat daily, and thus avoid wasting unnecessary PCIe or 
> >> > other
> >> > bandwidth.
> >> >
> >> > This use of xmit_more comes with a trade off because bundling too many
> >> > packets can increase latency of the Tx packets. To avoid this, we should
> >> > limit the maximum number of packets with xmit_more.
> >> >
> >> > Driver authors could modify their drivers to check for some determined
> >> > limit, but this requires all drivers to be modified in order to gain
> >> > advantage.
> >> >
> >> > Instead, add a sysctl "xmit_more_max" which can be used to configure the
> >> > maximum number of xmit_more skbs to send in a sequence. This ensures
> >> > that all drivers benefit, and allows system administrators the option to
> >> > tune the value to their environment.
> >> >
> >> > Signed-off-by: Jacob Keller <jacob.e.kel...@intel.com>
> >> > ---
> >> >
> >> > Stray thoughts and further questions
> >> >
> >> > Is this the right approach? Did I miss any other places where we should
> >> > limit? Does the limit make sense? Should it instead be a per-device
> >> > tuning nob instead of a global? Is 32 a good default?
> >>
> >> I actually like the idea of a per-device knob.  A xmit_more_max that's
> >> global in a system with 1GbE devices along with a 25/50GbE or more just
> >> doesn't make much sense to me.  Or having heterogeneous vendor devices
> >> in the same system that have different HW behaviors could mask issues
> >> with latency.
> >>
> >> This seems like another incarnation of possible buffer-bloat if the max
> >> is too high...
> >>
> >> >
> >> >   Documentation/sysctl/net.txt |  6 ++
> >> >   include/linux/netdevice.h|  2 ++
> >> >   net/core/dev.c   | 10 +-
> >> >   net/core/sysctl_net_core.c   |  7 +++
> >> >   4 files changed, 24 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> >> > index b67044a2575f..3d995e8f4448 100644
> >> > --- a/Documentation/sysctl/net.txt
> >> > +++ b/Documentation/sysctl/net.txt
> >> > @@ -230,6 +230,12 @@ netdev_max_backlog
> >> >   Maximum number  of  packets,  queued  on  the  INPUT  side, when the
> interface
> >> >   receives packets faster than kernel can process them.
> >> >
> >> > +xmit_more_max
> >> > +-
> >> > +
> >> > +Maximum number of packets in a row to mark with skb->xmit_more. A value
> of zero
> >> > +indicates no limit.
> >>
> >> What defines "packet?"  MTU-sized packets, or payloads coming down from
> >> the stack (e.g. TSO's)?
> >
> > xmit_more is only a hint to the device. The device driver should ignore it 
> > unless
> > there are hardware advantages. The device driver is the place with HW 
> > specific
> > knowledge (like 4 Tx descriptors is equivalent to one PCI transaction on 
> > this
> device).
> >
> > Anything that pushes that optimization out to the user is only useful for
> benchma

RE: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more

2017-08-25 Thread Keller, Jacob E
> -Original Message-
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Friday, August 25, 2017 8:58 AM
> To: Waskiewicz Jr, Peter <peter.waskiewicz...@intel.com>
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org
> Subject: Re: [RFC PATCH] net: limit maximum number of packets to mark with
> xmit_more
> 
> xmit_more is only a hint to the device. The device driver should ignore it 
> unless
> there are hardware advantages. The device driver is the place with HW specific
> knowledge (like 4 Tx descriptors is equivalent to one PCI transaction on this
> device).
> 
> Anything that pushes that optimization out to the user is only useful for
> benchmarks
> and embedded devices.

Right so most drivers I've seen simply take it as a "avoid bumping tail of a 
ring" whenever they see xmit_more. But unfortunately in some circumstances, 
this results in potentially several hundred packets being set with xmit_more in 
a row, and then the driver doesn't bump the tail for a long time, resulting in 
high latency spikes..

I was trying to find a way to fix this potentially in multiple drivers, rather 
than just a single driver, since I figured the same sort of code might need to 
be needed.

So you're suggesting we should just perform some check in the device driver, 
even if it might be duplication?

We could also instead make it a setting in the netdev struct or something which 
would be set by the driver and then tell stack code to limit how many it sends 
at once (so that we don't need to duplicate that checking code in every driver?)

Thanks,
Jake


RE: [PATCH v2] i40e/i40evf: fix out-of-bounds read of cpumask

2017-08-22 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Stefano Brivio
> Sent: Tuesday, August 22, 2017 2:24 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Intel Wired LAN <intel-wired-...@lists.osuosl.org>; 
> netdev@vger.kernel.org;
> sta...@vger.kernel.org; Juergen Gross <jgr...@suse.com>
> Subject: Re: [PATCH v2] i40e/i40evf: fix out-of-bounds read of cpumask
> 
> [Fixed Cc: address for stable, Cc'ed Juergen]
> 
> On Tue, 22 Aug 2017 14:04:42 -0700
> Jacob Keller <jacob.e.kel...@intel.com> wrote:
> 
> > When responding to an affinity hint we directly copied a cpumask value,
> > intsead of using cpumask_copy. According to cpumask.h this is not
> > correct because cpumask_t is only guaranteed to have enough space for
> > the number of CPUs in the system, and may not be as big as we expect.
> > Thus a direct copy results in an out-of-bound read and potentially
> > a crash if the pages are aligned just right. This will be easily
> > detected on a kernel with KASAN enabled:
> 
> I still think commit message of my patch
> (ae9c9586f61e914dc1c6fe2e6ac1fb2bf07283bc.1502792828.git.sbri...@redhat.co
> m)
> was perhaps a bit clearer, but okay, this is also clear, fair enough.
> 
> > KASAN reports:
> > [   25.242312] BUG: KASAN: slab-out-of-bounds in
> i40e_irq_affinity_notify+0x30/0x50 [i40e] at addr 880462eea960
> [...]
> > [   25.242597]
> ==
> 
> This is also taken from my message, not terribly happy about it
> (and still happier with it than without). Fair enough, whatever it
> takes to get this applied as soon as possible...
> 
> > Fixes: 96db776a3682 ("i40e/i40evf: fix interrupt affinity bug", 2016-09-14)
> > Signed-off-by: Jacob Keller <jacob.e.kel...@intel.com>
> > Cc: sta...@vger.kernel.org # 4.10+
> 
> FWIW,
> 
> Acked-by: Stefano Brivio <sbri...@redhat.com>
> 

I don't really care which message gets applied either, as long as we get it 
fixed. Either patch is fine with me.

Thanks,
Jake

> 
> --
> Stefano


RE: [PATCH v2] i40e/i40evf: fix out-of-bounds read of cpumask

2017-08-22 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Jacob Keller
> Sent: Tuesday, August 22, 2017 2:05 PM
> To: Intel Wired LAN <intel-wired-...@lists.osuosl.org>
> Cc: netdev@vger.kernel.org; Keller, Jacob E <jacob.e.kel...@intel.com>;
> sta...@vger.kernel.org#4.10+
> Subject: [PATCH v2] i40e/i40evf: fix out-of-bounds read of cpumask
> 
> When responding to an affinity hint we directly copied a cpumask value,
> intsead of using cpumask_copy. According to cpumask.h this is not
> correct because cpumask_t is only guaranteed to have enough space for
> the number of CPUs in the system, and may not be as big as we expect.
> Thus a direct copy results in an out-of-bound read and potentially
> a crash if the pages are aligned just right. This will be easily
> detected on a kernel with KASAN enabled:
> 
> KASAN reports:
> [   25.242312] BUG: KASAN: slab-out-of-bounds in
> i40e_irq_affinity_notify+0x30/0x50 [i40e] at addr 880462eea960
> [   25.242315] Read of size 1024 by task kworker/2:1/170
> [   25.242322] CPU: 2 PID: 170 Comm: kworker/2:1 Not tainted 4.11.0-
> 22.el7a.x86_64 #1
> [   25.242325] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 05/06/2015
> [   25.242336] Workqueue: events irq_affinity_notify
> [   25.242340] Call Trace:
> [   25.242350]  dump_stack+0x63/0x8d
> [   25.242358]  kasan_object_err+0x21/0x70
> [   25.242364]  kasan_report+0x288/0x540
> [   25.242397]  ? i40e_irq_affinity_notify+0x30/0x50 [i40e]
> [   25.242403]  check_memory_region+0x13c/0x1a0
> [   25.242408]  __asan_loadN+0xf/0x20
> [   25.242440]  i40e_irq_affinity_notify+0x30/0x50 [i40e]
> [   25.242446]  irq_affinity_notify+0x1b4/0x230
> [   25.242452]  ? irq_set_affinity_notifier+0x130/0x130
> [   25.242457]  ? kasan_slab_free+0x89/0xc0
> [   25.242466]  process_one_work+0x32f/0x6f0
> [   25.242472]  worker_thread+0x89/0x770
> [   25.242481]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> [   25.242488]  kthread+0x18c/0x1e0
> [   25.242493]  ? process_one_work+0x6f0/0x6f0
> [   25.242499]  ? kthread_create_on_node+0xc0/0xc0
> [   25.242506]  ret_from_fork+0x2c/0x40
> [   25.242511] Object at 880462eea960, in cache kmalloc-8 size: 8
> [   25.242513] Allocated:
> [   25.242514] PID = 170
> [   25.242522]  save_stack_trace+0x1b/0x20
> [   25.242529]  save_stack+0x46/0xd0
> [   25.242533]  kasan_kmalloc+0xad/0xe0
> [   25.242537]  __kmalloc_node+0x12c/0x2b0
> [   25.242542]  alloc_cpumask_var_node+0x3c/0x60
> [   25.242546]  alloc_cpumask_var+0xe/0x10
> [   25.242550]  irq_affinity_notify+0x94/0x230
> [   25.242555]  process_one_work+0x32f/0x6f0
> [   25.242559]  worker_thread+0x89/0x770
> [   25.242564]  kthread+0x18c/0x1e0
> [   25.242568]  ret_from_fork+0x2c/0x40
> [   25.242569] Freed:
> [   25.242570] PID = 0
> [   25.242572] (stack is not available)
> [   25.242573] Memory state around the buggy address:
> [   25.242578]  880462eea800: fc fc 00 fc fc 00 fc fc 00 fc fc 00 fc fc 
> fb fc
> [   25.242582]  880462eea880: fc fb fc fc fb fc fc 00 fc fc 00 fc fc 00 
> fc fc
> [   25.242586] >880462eea900: 00 fc fc 00 fc fc 00 fc fc fb fc fc 00 fc 
> fc fc
> [   25.242588]   ^
> [   25.242592]  880462eea980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
> fc fc
> [   25.242596]  880462eeaa00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
> fc fc
> [   25.242597]
> ==
> 
> Fixes: 96db776a3682 ("i40e/i40evf: fix interrupt affinity bug", 2016-09-14)
> Signed-off-by: Jacob Keller <jacob.e.kel...@intel.com>
> Cc: sta...@vger.kernel.org # 4.10+
> ---
> This updates the commit message for the original fix, and indicates that
> it fixes a potential crash, as well as tagged the commit for stable and
> added a Fixes to indicate which commit this fixes.
> 

I should have noted, I changed the title to be more accurate as well, this is a 
v2 of https://patchwork.ozlabs.org/patch/787388/



RE: [net-next 08/15] i40e/i40evf: organize and re-number feature flags

2017-08-14 Thread Keller, Jacob E


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Saturday, August 12, 2017 1:04 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 08/15] i40e/i40evf: organize and re-number feature 
> flags
> 
> From: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> Date: Sat, 12 Aug 2017 04:08:41 -0700
> 
> > Also ensure that the flags variable is actually a u64 to guarantee
> > 64bits of space on all architectures.
> 
> Why?  You don't need 64-bits, you only need 27.
> 
> This will be unnecessarily expensive on 32-bit platforms.
> 
> Please don't do this.

I suppose a better method would be to switch to using a declare_bitmap instead, 
so that it automatically sizes based on the number of flags we have. The reason 
we chose 64bits is because we will add flags in the future, as we originally 
had more than 32 flags prior to this patch until we moved some into a separate 
field.

But now that I think about it, using DECLARE_BITMAP makes more sense, though 
it's a bit more invasive of the code.

Thanks,
Jake


RE: [RFC PATCH] net: don't set __LINK_STATE_START until after dev->open() call

2017-08-09 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of David Miller
> Sent: Tuesday, August 08, 2017 6:17 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: netdev@vger.kernel.org
> Subject: Re: [RFC PATCH] net: don't set __LINK_STATE_START until after 
> dev->open()
> call
> 
> From: Jacob Keller <jacob.e.kel...@intel.com>
> Date: Mon,  7 Aug 2017 15:24:21 -0700
> > I am wondering whether the proposed change is acceptable here, or
> > whether some ndo_open handlers rely on __LINK_STATE_START being true
> > prior to their being called?
> 
> I think this has the potential to break a bunch of drivers, but I
> cannot prove this.
> 
> A lot of drivers have several pieces of state setup when they bring
> the device up.  And these routines are also invoked from other code
> paths like suspend/resume, PCI-E error recovery, etc. and they
> probably do netif_running() calls here and there.
> 
> This behavior has been this way for a very long time, so the risk is
> quite high I think.

That's what I am worried about. However, I think there are problems with 
leaving it. A lot of drivers rely on netif_running() to determine whether or 
not the device is open, but they may be using it relying on all the changes 
caused by the .ndo_open() handler are finished. The current system there is a 
race, since you set the __LINK_STATE_START before .ndo_open is called.

This is what I found when attempting to debug a race in i40evf_open and 
i40evf_reset. The reset ended up running at the same time as open and the two 
flows collided because reset relied on netif_running() under the assumption 
that it would not return true until the device was actually open. However, it 
was running at the same time as open was, so it was trying to shutdown things 
at the same time as the open call was trying to bring them up.

Any location which uses rtnl_lock() will be safe, since the dev_open call is 
under rtnl_lock() so all callers of netif_running() under lock are serialized 
to see only the flag before or after dev_open completes. This is the majority 
of the callers I think.

Unfortunately, I can't really hold the rtnl_lock() during reset, since this 
caused a deadlock when I tried it due to lock ordering problems. I'm not sure 
if I could fix that or not.

I think there are other places which are incorrect, but haven't yet run into an 
issue. The only place I can see where the functionality might be changed for 
the worse is if a .dev_open handler actually relies on checking netif_running() 
during its call, which seems incredibly silly to me

Thanks,
Jake

P.S. I apologize for the typo in this patch, if there is more discussion I can 
send a v2 which fixes it.


RE: RXFH manual configuration

2017-06-23 Thread Keller, Jacob E


> -Original Message-
> From: Tariq Toukan [mailto:tar...@mellanox.com]
> Sent: Sunday, June 18, 2017 1:32 AM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; Saeed
> Mahameed <sae...@mellanox.com>; Eran Ben Elisha
> <era...@mellanox.com>
> Subject: RXFH manual configuration
> 
> Hi Jacob,
> 
> I am looking at your patch:
> d4ab4286276f ethtool: correctly ensure {GS}CHANNELS doesn't conflict ...
> 
> I wonder - if I want to configure the number of channels (ethtool -L),
> without being aware to the history of indirection table (manually set or
> not), how can I know what is the expected behavior?
> I should be able, from userspace, to query the value of
> netif_is_rxfh_configured() in order to decide whether the outcome of an
> ethtool -L command is correct or not.
> 
> Do you know if this indication exists?
> 
> Thanks,
> Tariq

I mis-spoke earlier, this is set in priv_flags and is thus not shared with 
userspace. I'm not sure if there is any current way to tell the status.

Thanks,
Jake


RE: RXFH manual configuration

2017-06-23 Thread Keller, Jacob E
> -Original Message-
> From: Tariq Toukan [mailto:tar...@mellanox.com]
> Sent: Sunday, June 18, 2017 1:32 AM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; Saeed
> Mahameed <sae...@mellanox.com>; Eran Ben Elisha
> <era...@mellanox.com>
> Subject: RXFH manual configuration
> 
> Hi Jacob,
> 
> I am looking at your patch:
> d4ab4286276f ethtool: correctly ensure {GS}CHANNELS doesn't conflict ...
> 
> I wonder - if I want to configure the number of channels (ethtool -L),
> without being aware to the history of indirection table (manually set or
> not), how can I know what is the expected behavior?
> I should be able, from userspace, to query the value of
> netif_is_rxfh_configured() in order to decide whether the outcome of an
> ethtool -L command is correct or not.
> 
> Do you know if this indication exists?
> 
> Thanks,
> Tariq

Hi,

I believe you can check the flags value, from /sys/class/net//flags, 
though you'll have to manually parse that. I don't think iproute2 suite shows 
this flag, but i suspect it could be modified to do so. Specifically you check 
for IFF_RXFH_CONFIGURED

Thanks,
Jake


RE: [Intel-wired-lan] [i40e] regression on TCP stream and TCP maerts, kernel-4.12.0-0.rc2

2017-06-09 Thread Keller, Jacob E


> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Friday, June 09, 2017 12:59 PM
> To: Adrian Tomasov <atoma...@redhat.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirs...@intel.com>; Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: Duyck, Alexander H <alexander.h.du...@intel.com>; osab...@redhat.com;
> netdev@vger.kernel.org; aokul...@redhat.com; intel-wired-...@lists.osuosl.org;
> jhla...@redhat.com
> Subject: Re: [Intel-wired-lan] [i40e] regression on TCP stream and TCP maerts,
> kernel-4.12.0-0.rc2
> 
> On Fri, Jun 9, 2017 at 3:34 AM, Adrian Tomasov <atoma...@redhat.com> wrote:
> > On Thu, 2017-06-01 at 19:18 +, Duyck, Alexander H wrote:
> >> On Thu, 2017-06-01 at 12:14 +0200, Adrian Tomasov wrote:
> >> >
> >> > On Wed, 2017-05-31 at 14:42 -0700, Alexander Duyck wrote:
> >> > >
> >> > >
> >> > > On Wed, May 31, 2017 at 6:48 AM, Adrian Tomasov <atomasov@redhat.
> >> > > com>
> >> > > wrote:
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Tue, 2017-05-30 at 18:27 -0700, Alexander Duyck wrote:
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Tue, May 30, 2017 at 8:41 AM, Alexander Duyck
> >> > > > > <alexander.du...@gmail.com> wrote:
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, May 30, 2017 at 6:43 AM, Adam Okuliar <aokuliar@red
> >> > > > > > hat.
> >> > > > > > com>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Hello,
> >> > > > > > >
> >> > > > > > > we found regression on intel card(XL710) with i40e
> >> > > > > > > driver.
> >> > > > > > > Regression is
> >> > > > > > > about ~45%
> >> > > > > > > on TCP_STREAM and TCP_MAERTS test for IPv4 and IPv6.
> >> > > > > > > Regression
> >> > > > > > > was first
> >> > > > > > > visible in kernel-4.12.0-0.rc1.
> >> > > > > > >
> >> > > > > > > More details about results you can see in uploaded images
> >> > > > > > > in
> >> > > > > > > bugzilla. [0]
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > [0] https://bugzilla.kernel.org/show_bug.cgi?id=195923
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Best regards, / S pozdravom,
> >> > > > > > >
> >> > > > > > > Adrián Tomašov
> >> > > > > > > Kernel Performance QE
> >> > > > > > > atoma...@redhat.com
> >> > > > > >
> >> > > > > > I have added the i40e driver maintainer and the intel-
> >> > > > > > wired-lan
> >> > > > > > mailing list so that we can make are developers aware of
> >> > > > > > the
> >> > > > > > issue.
> >> > > > > >
> >> > > > > > Thanks.
> >> > > > > >
> >> > > > > > - Alex
> >> > > > >
> >> > > > > Adam,
> >> > > > >
> >> > > > > We are having some issues trying to reproduce what you
> >> > > > > reported.
> >> > > > >
> >> > > > > Can you provide some additional data. Specifically we would
> >> > > > > be
> >> > > > > looking
> >> > > > > for an "ethtool -i", and an "ethtool -S" for the port before
> >> > > > > and
> >> > > > > after
> >> > > > > the test. If you can attach it to the bugzilla that would be
> >> > > > > appreciated.
> >> > > > >
> >> > 

RE: [PATCH net-next] i40evf: hide unused variable

2017-04-19 Thread Keller, Jacob E
> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Wednesday, April 19, 2017 10:30 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> Cc: Arnd Bergmann <a...@arndb.de>; Pujari, Bimmy
> <bimmy.puj...@intel.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; Williams, Mitch A
> <mitch.a.willi...@intel.com>; Keller, Jacob E <jacob.e.kel...@intel.com>; 
> Brady,
> Alan <alan.br...@intel.com>; Joe Perches <j...@perches.com>; Singhai, Anjali
> <anjali.sing...@intel.com>; Brandeburg, Jesse <jesse.brandeb...@intel.com>;
> Banala, Preethi <preethi.ban...@intel.com>; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH net-next] i40evf: hide unused variable
> 
> On architectures with larger pages, we get a warning about an unused variable:
> 
> drivers/net/ethernet/intel/i40evf/i40evf_main.c: In function
> 'i40evf_configure_rx':
> drivers/net/ethernet/intel/i40evf/i40evf_main.c:690:21: error: unused variable
> 'netdev' [-Werror=unused-variable]
> 
> This moves the declaration into the #ifdef to avoid the warning.
> 

Makes sense.

Acked-by: Jacob Keller <jacob.e.kel...@intel.com>

> Fixes: dab86afdbbd1 ("i40e/i40evf: Change the way we limit the maximum frame
> size for Rx")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
>  drivers/net/ethernet/intel/i40evf/i40evf_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> index 12a930e879af..1bb13c864edd 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
> @@ -687,13 +687,14 @@ static void i40evf_configure_tx(struct i40evf_adapter
> *adapter)
>  static void i40evf_configure_rx(struct i40evf_adapter *adapter)
>  {
>   unsigned int rx_buf_len = I40E_RXBUFFER_2048;
> - struct net_device *netdev = adapter->netdev;
>   struct i40e_hw *hw = >hw;
>   int i;
> 
>   /* Legacy Rx will always default to a 2048 buffer size. */
>  #if (PAGE_SIZE < 8192)
>   if (!(adapter->flags & I40EVF_FLAG_LEGACY_RX)) {
> + struct net_device *netdev = adapter->netdev;
> +
>   /* For jumbo frames on systems with 4K pages we have to use
>* an order 1 page, so we might as well increase the size
>* of our Rx buffer to make better use of the available space
> --
> 2.9.0



RE: [RFC PATCH 6/7] net: allow simultaneous SW and HW transmit timestamping

2017-04-13 Thread Keller, Jacob E


> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Thursday, April 13, 2017 8:00 AM
>
> Oh, I see. I was struggling to find a good name for this option.
> 
> > The name for this option is therefore not very descriptive. Perhaps
> > SOF_TIMESTAMPING_OPT_BOTH_SW_HW.
> 
> Simultaneous SW/HW timestamping was already possible for incoming
> packets. Maybe _OPT_TX_SWHW would be better?
>

This sounds more accurate to me.

Thanks,
Jake
 
> --
> Miroslav Lichvar


RE: [RFC PATCH 0/7] Extend socket timestamping API

2017-04-13 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Miroslav Lichvar
> Sent: Thursday, April 13, 2017 2:54 AM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: netdev@vger.kernel.org; Richard Cochran <richardcoch...@gmail.com>;
> Willem de Bruijn <will...@google.com>; Soheil Hassas Yeganeh
> <soh...@google.com>; Denny Page <dennyp...@me.com>; Jiri Benc
> <jb...@redhat.com>
> Subject: Re: [RFC PATCH 0/7] Extend socket timestamping API
> 
> On Thu, Apr 13, 2017 at 09:08:08AM +, Keller, Jacob E wrote:
> > > -Original Message-
> > > This patchset adds new options to the timestamping API that will be
> > > useful for NTP implementations and possibly other applications.
> 
> > I think this looks pretty good, straight forward and useful. I agree with 
> > Richard
> about the one commit message, but the rest looks good modulo missing updates
> for all the other drivers.
> 
> Thanks for the review.
> 
> > I didn't see any code to update the ethtool get_ts_info data for _NTP_ALL
> either.
> 
> My understanding was that ethtool should list only filters that are
> actually supported by the HW and are not handled just by switching to
> a more general filter. I think that's what drivers I've looked were
> doing. The phyter driver would be the only one that could list the
> filter as supported. Is that correct?
> 
> --
> Miroslav Lichvar

I'm not sure whether that's how all drivers are implemented, or whether that 
makes sense. Suppose an application wants to know if a particular filter is 
supported, they might ask using get_ts_info, but now the application must 
encode the rules for which filters are more general. I suppose this logic isn't 
all that complicated and is straight forward deductions based on the name and 
generally we prefer if driver or hardware supports timestamping all frames 
anyways.

In this regard I guess you'd be right.

Thanks,
Jake
 


RE: [RFC PATCH 0/7] Extend socket timestamping API

2017-04-13 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Wednesday, April 12, 2017 7:18 AM
> To: netdev@vger.kernel.org
> Cc: Richard Cochran <richardcoch...@gmail.com>; Willem de Bruijn
> <will...@google.com>; Soheil Hassas Yeganeh <soh...@google.com>; Keller,
> Jacob E <jacob.e.kel...@intel.com>; Denny Page <dennyp...@me.com>; Jiri
> Benc <jb...@redhat.com>
> Subject: [RFC PATCH 0/7] Extend socket timestamping API
> 
> This patchset adds new options to the timestamping API that will be
> useful for NTP implementations and possibly other applications.
> 
> The first patch specifies a timestamp filter for NTP packets, which is
> handled in the second patch in drivers that can timestamp all packets.
> There is no attempt to add the support to the phyter driver.
> 
> The third patch adds a new option to get information about
> HW-timestamped packets. The fourth patch adds support for this option to
> the drivers (currently only igb and e1000e).
> 
> The fifth patch fixes the code to not make a false software TX timestamp
> when HW timestamping is enabled. The sixth patch depends on this fix.
> 
> The sixth patch adds a new option to allow outgoing packets to be looped
> multiple times to the error queue in order to allow simultaneous SW and
> HW timestamping. The seventh patch updates drivers that assumed SW
> timestamping cannot be used together with HW timestamping.
> 

I think this looks pretty good, straight forward and useful. I agree with 
Richard about the one commit message, but the rest looks good modulo missing 
updates for all the other drivers.

I didn't see any code to update the ethtool get_ts_info data for _NTP_ALL 
either.

Thanks,
Jake


RE: [RFC PATCH 4/7] net: ethernet: update drivers to provide timestamping packet info

2017-04-13 Thread Keller, Jacob E
> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Wednesday, April 12, 2017 7:18 AM
> To: netdev@vger.kernel.org
> Cc: Richard Cochran <richardcoch...@gmail.com>; Willem de Bruijn
> <will...@google.com>; Soheil Hassas Yeganeh <soh...@google.com>; Keller,
> Jacob E <jacob.e.kel...@intel.com>; Denny Page <dennyp...@me.com>; Jiri
> Benc <jb...@redhat.com>
> Subject: [RFC PATCH 4/7] net: ethernet: update drivers to provide timestamping
> packet info
> 
> Update drivers that support hardware timestamping to provide the
> interface index and packet length for the SOF_TIMESTAMPING_OPT_PKTINFO
> option.
> 
> TODO: update other drivers (not just e1000e and igb)
> 
> CC: Richard Cochran <richardcoch...@gmail.com>
> CC: Willem de Bruijn <will...@google.com>
> CC: Jacob Keller <jacob.e.kel...@intel.com>
> Signed-off-by: Miroslav Lichvar <mlich...@redhat.com>
> ---

Driver changes seem alright to me. I might have gone a different route without 
changing the return value of functions but it doesn't make a huge difference, 
and i guess this route is easier to follow the code.

Thanks,
Jake



RE: [RFC PATCH 2/7] net: ethernet: update drivers to handle HWTSTAMP_FILTER_NTP_ALL

2017-04-13 Thread Keller, Jacob E


> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Wednesday, April 12, 2017 7:18 AM
> To: netdev@vger.kernel.org
> Cc: Richard Cochran <richardcoch...@gmail.com>; Willem de Bruijn
> <will...@google.com>; Soheil Hassas Yeganeh <soh...@google.com>; Keller,
> Jacob E <jacob.e.kel...@intel.com>; Denny Page <dennyp...@me.com>; Jiri
> Benc <jb...@redhat.com>
> Subject: [RFC PATCH 2/7] net: ethernet: update drivers to handle
> HWTSTAMP_FILTER_NTP_ALL
> 
> Update drivers which can timestamp all packets to handle also
> HWTSTAMP_FILTER_NTP_ALL.
> 

The code looks ok to me, though I agree with Richard that the description 
should be more clear that not all locations are about supporting the new 
feature (since many locations don't actually support ALL but merely error out)

Thanks,
Jake



RE: Extending socket timestamping API for NTP

2017-03-24 Thread Keller, Jacob E
> -Original Message-
> From: Denny Page [mailto:dennyp...@me.com]
> Sent: Friday, March 24, 2017 10:18 AM
> To: Miroslav Lichvar <mlich...@redhat.com>
> Cc: Richard Cochran <richardcoch...@gmail.com>; netdev@vger.kernel.org; Jiri
> Benc <jb...@redhat.com>; Keller, Jacob E <jacob.e.kel...@intel.com>; Willem
> de Bruijn <will...@google.com>
> Subject: Re: Extending socket timestamping API for NTP
> 
> 
> > On Mar 24, 2017, at 02:45, Miroslav Lichvar <mlich...@redhat.com> wrote:
> >
> > On Thu, Mar 23, 2017 at 10:08:00AM -0700, Denny Page wrote:
> >>> On Mar 23, 2017, at 09:21, Miroslav Lichvar <mlich...@redhat.com> wrote:
> >>>
> >>> After becoming a bit more familiar with the code I don't think this is
> >>> a good idea anymore :). I suspect there would be a noticeable
> >>> performance impact if each timestamped packet could trigger reading of
> >>> the current link speed. If the value had to be cached it would make
> >>> more sense to do it in the application.
> >>
> >> I am very surprised at this. The application caching approach requires the
> application retrieve the value via a system call. The system call overhead is 
> huge
> in comparison to everything else. More importantly, the application cached 
> value
> may be wrong. If the application takes a sample every 5 seconds, there are 5
> seconds of timestamps that can be wildly wrong.
> >
> > I'm just trying to be practical and minimize the performance impact
> > and the amount of code that needs to be written, reviewed and
> > maintained.
> >
> > How common is to have link speed changing in normal operation on LAN?
> 
> In my case, it’s currently every few minutes because I’m doing hw timestamp
> testing. :)
> 
> But this does speak to my point. If it’s cached by the application, the 
> application
> has to check it regularly to minimize the possibility of bad timestamps. If 
> the link
> speed doesn’t change, every call by the application is wasted overhead. If 
> it’s
> cached by the driver, there is no waste, and the stamps are always correct.
> 
> I should have remembered this yesterday... I went and looked at my favorite
> driver, Intel's igb. Not only is the igb driver already caching link speed, 
> it is also
> performing timestamp correction based on that link speed. It appears that all
> Intel drivers are caching link speed. I looked at a few other popular
> manufacturers, and it appears that caching link speed is common. The only one 
> I
> quickly found that didn’t cache was realtek.
> 
> I believe that timestamp correction, whether it be speed based latency, 
> header -
> > trailer, or whatever else might be needed later down the line, are properly
> done in the driver. It’s a lot for the application to try and figure out if 
> it should or
> should not be doing corrections and what correction to apply. The driver 
> knows.

I also believe the right place for these corrections is in the driver.

Thanks,
Jake


RE: [net-next 05/13] i40e: rework exit flow of i40e_add_fdir_ethtool

2017-03-21 Thread Keller, Jacob E
> -Original Message-
> From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com]
> Sent: Tuesday, March 21, 2017 3:11 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; da...@davemloft.net
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 05/13] i40e: rework exit flow of i40e_add_fdir_ethtool
> 
> Hello!
> 
> On 3/21/2017 2:47 AM, Jeff Kirsher wrote:
> 
> > From: Jacob Keller <jacob.e.kel...@intel.com>
> >
> > Refactor the exit flow of the i40e_add_fdir_ethtool function. Move the
> > input_label to the end of the function, removing the dependency on
> 
> I don't see 'input_label' anywhere. Perhaps 'free_input' label was meant?
> 

You're correct. 

Thanks,
Jake



RE: [Intel-wired-lan] [PATCH] i40e: fix memcpy with swapped arguments

2017-03-20 Thread Keller, Jacob E
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Colin King
> Sent: Monday, March 20, 2017 7:46 AM
> To: Kirsher, Jeffrey T ; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH] i40e: fix memcpy with swapped arguments
> 
> From: Colin Ian King 

Hi there,

> 
> The current code copies an uninitialized params into
> cdev->lan_info.params and then passes the uninitialized params
> to the call cdev->client->ops->l2_param_change.  I believe the
> order of the source and destination in the memcpy is the wrong
> way around and should be swapped.
> 

So you are correct that params is uninitialized. However, the fix here is not 
correct. Somehow we dropped the code for initializing the parameters.

See commit d7ce6422d6e6 ("i40e: don't check params until after checking for 
client instance", 2017-02-09) It looks like the commit itself was malformed 
when applied upstream, and a later commit which should have preserved the 
changes 3140aa9a78c9 ("i40e: KISS the client interface", 2017-03-14) 
accidentally dropped them.

I'll provide a patch to get this back into the correct state.

Thanks for catching this.

Regards,
Jake


RE: [Intel-wired-lan] [PATCH] e1000e: fix timing for 82579 Gigabit Ethernet controller

2017-02-28 Thread Keller, Jacob E
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Neftin, Sasha
> Sent: Monday, February 27, 2017 12:40 AM
> To: Bernd Faust ; Kirsher, Jeffrey T
> ; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [Intel-wired-lan] [PATCH] e1000e: fix timing for 82579 Gigabit
> Ethernet controller
> 
> On 2/26/2017 11:08, Neftin, Sasha wrote:
> > On 2/19/2017 14:55, Neftin, Sasha wrote:
> >> On 2/16/2017 20:42, Bernd Faust wrote:
> >>> After an upgrade to Linux kernel v4.x the hardware timestamps of the
> >>> 82579 Gigabit Ethernet Controller are different than expected.
> >>> The values that are being read are almost four times as big as before
> >>> the kernel upgrade.
> >>>
> >>> The difference is that after the upgrade the driver sets the clock
> >>> frequency to 25MHz, where before the upgrade it was set to 96MHz. Intel
> >>> confirmed that the correct frequency for this network adapter is 96MHz.
> >>>
> >>> Signed-off-by: Bernd Faust 
> >>> ---
> >>>   drivers/net/ethernet/intel/e1000e/netdev.c | 6 ++
> >>>   1 file changed, 6 insertions(+)
> >>>
> >>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> >>> b/drivers/net/ethernet/intel/e1000e/netdev.c
> >>> index 7017281..8b7113d 100644
> >>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> >>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> >>> @@ -3511,6 +3511,12 @@ s32 e1000e_get_base_timinca(struct
> >>> e1000_adapter *adapter, u32 *timinca)
> >>>
> >>>   switch (hw->mac.type) {
> >>>   case e1000_pch2lan:
> >>> +/* Stable 96MHz frequency */
> >>> +incperiod = INCPERIOD_96MHz;
> >>> +incvalue = INCVALUE_96MHz;
> >>> +shift = INCVALUE_SHIFT_96MHz;
> >>> +adapter->cc.shift = shift + INCPERIOD_SHIFT_96MHz;
> >>> +break;
> >>>   case e1000_pch_lpt:
> >>>   if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
> >>>   /* Stable 96MHz frequency */
> >>> --
> >>> 2.7.4
> >>> ___
> >>> Intel-wired-lan mailing list
> >>> intel-wired-...@lists.osuosl.org
> >>> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> >>
> >> Hello,
> >>
> >> e1000_pch2lan mac type corresponds to 82579LM and 82579V network
> >> adapters. System clock frequency indication (SYSCFI) for these
> >> devices supports both 25MHz and 96MHz frequency. By default
> >> TSYNCRXCTL.SYSCFI is set to 1 and that means 96MHz frequency is picked.
> >>
> >> It is better to keep the current implementation as it covers all
> >> options.
> >>
> >> Thanks,
> >>
> >> Sasha
> >>
> > Hello,
> >
> > During last couple of weeks I saw few  complaints from community on
> > same timing problem with 82579. I will double check clock definition
> > with HW architecture.
> >
> > Sasha
> >
> I've double checked - 82579 support 96MHz frequency only. So, let's
> accept this suggestion to upstream.
> 
> Ack.
> 

This resolves also a complaint from someone on the LinuxPTP development mailing 
list.

ACK.

Thanks,
Jake



RE: [net-next 12/14] i40e: allow i40e_update_filter_state to skip broadcast filters

2017-02-14 Thread Keller, Jacob E
> -Original Message-
> From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com]
> Sent: Sunday, February 12, 2017 2:18 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; da...@davemloft.net
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 12/14] i40e: allow i40e_update_filter_state to skip
> broadcast filters
> 
> Hello!
> 
> On 2/12/2017 8:30 AM, Jeff Kirsher wrote:
> 
> > From: Jacob Keller <jacob.e.kel...@intel.com>
> >
> > Fix a bug where we modified the mac_filter_hash while outside a lock,
> > when handling addition of broadcast filters.
> >
> > Normally, we add filters to firmware by batching the additions into
> > lists and issuing 1 update for every few filters. Broadcast filters are
> > handled differently, by instead setting the broadcast promiscuous mode
> > flags. In order to make sure the 1<->1 mapping of filters in our
> > addition array lined up with filters in the hlist tmp_add_list, we had
> > to remove the filter and move it back to the main hash. However, we
> > didn't do this under lock, which could cause consistency problems for
> > the list.
> >
> > Fix this by updating i40e_update_filter_state logic so that it knows to
> > avoid broadcast filters. This ensures that we don't have to remove the
> > filter separately, and can put it back using the normal flow.
> >
> > Change-ID: Id288fade80b3e3a9a54b68cc249188cb95147518
> > Signed-off-by: Jacob Keller <jacob.e.kel...@intel.com>
> > Tested-by: Andrew Bowers <andrewx.bow...@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirs...@intel.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_main.c | 37
> ++---
> >  1 file changed, 29 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index fa4a04d..06c80d4 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -1843,6 +1843,31 @@ static void i40e_undo_filter_entries(struct i40e_vsi
> *vsi,
> >  }
> >
> >  /**
> > + * i40e_next_entry - Get the next non-broadcast filter from a list
> > + * @f: pointer to filter in list
> > + *
> > + * Returns the next non-broadcast filter in the list. Required so that we
> > + * ignore broadcast filters within the list, since these are not handled 
> > via
> > + * the normal firmware update path.
> > + */
> > +static struct i40e_mac_filter *i40e_next_filter(struct i40e_mac_filter *f)
> > +{
> > +   while (f) {
> > +   f = hlist_entry(f->hlist.next,
> > +   typeof(struct i40e_mac_filter),
> > +   hlist);
> > +
> > +   /* keep going if we found a broadcast filter */
> > +   if (f && is_broadcast_ether_addr(f->macaddr))
> > +   continue;
> > +
> > +   break;
> 
> Isn't it simpler to *break* on an inverted condition above?
> This way, *continue* isn;'t needed...
> 

When I wrote the code originally it seemed better, but now that I think about 
it, the inverted conditional isn't that much more complicated, so I'll change 
it.

Thanks,
Jake

> [...]
> 
> MBR, Sergei



RE: Extending socket timestamping API for NTP

2017-02-07 Thread Keller, Jacob E
Hi Miroslav,

> -Original Message-
> From: Miroslav Lichvar [mailto:mlich...@redhat.com]
> Sent: Tuesday, February 07, 2017 6:02 AM
> To: netdev@vger.kernel.org
> Cc: Richard Cochran <richardcoch...@gmail.com>; Jiri Benc
> <jb...@redhat.com>; Keller, Jacob E <jacob.e.kel...@intel.com>; Denny Page
> <dennyp...@me.com>; Willem de Bruijn <will...@google.com>
> Subject: Extending socket timestamping API for NTP
> 
> I'd like to propose some changes and new options for the timestamping
> interface that I think would be useful for NTP implementations and
> maybe also other applications. Before I or someone else tries to
> implement them, do you think they would actually make sense and fit
> well in the current code?
> 
> 1) new rx_filter for NTP
> 
>Some NICs can't timestamp all received packets and are currently
>unusable for NTP with HW timestamping. The new filter would allow
>NTP support in new NICs and adding support to existing NICs with
>firmware/driver updates. The filter would apply to IPv4 and IPv6
>UDP packets received from or sent to the port number 123.

The main problem here is that most hardware that *can't* timestamp all packets 
is pretty limited to timestamping only PTP frames. It's possible with firmware 
upgrades this could be worked around, but I do not know if it would actually 
happen. Still, it can't really hurt too much to add a new filter, and those 
drivers which can support it already should be easy to implement.

> 
>Should be the current drivers of HW that can timestamp all packets
>updated to fall back to HWTSTAMP_FILTER_ALL?

Generally, the drivers I am aware of that support timestamping all packets do 
so for any filter request, rather than actually limiting the timestamping.

> 
> 2) new SO_TIMESTAMPING option to receive from the error queue only
>user data as was passed to sendmsg() instead of Ethernet frames
> 
>Parsing Ethernet and IP headers (especially IPv6 options) is not
>fun and SOF_TIMESTAMPING_OPT_ID is not always practical, e.g. in
>applications which process messages from the error queue
>asynchronously and don't bind/connect their sockets.

This would be useful for application writing.

> 
> 3) target address in msg_name of messages from the error queue
> 
>With 2) and unconnected sockets, there needs to be a way to get the
>address to which the packet was sent. Is it ok to always fill
>msg_name, or does it need to be a new option?


I'm not sure.

> 
> 4) allow sockets to use both SW and HW TX timestamping at the same time
> 
>When using a socket which is not bound to a specific interface, it
>would be nice to get transmit SW timestamps when HW timestamps are
>missing. I suspect it's difficult to predict if a HW timestamp will
>be available. Maybe it would be acceptable to get from the error
>queue two messages per transmission if the interface supports both
>SW and HW timestamping?


This seems useful, but not sure how best to implement it.

> Thoughts?
> 
> --
> Miroslav Lichvar


RE: [net-next 20/20] i40e: don't allow i40e_vsi_(add|kill)_vlan to operate when VID<1

2016-12-07 Thread Keller, Jacob E


> -Original Message-
> From: Kirsher, Jeffrey T
> Sent: Wednesday, December 07, 2016 1:53 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>; Sergei Shtylyov
> <sergei.shtyl...@cogentembedded.com>; da...@davemloft.net
> Cc: netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com; guru.anbalag...@oracle.com
> Subject: Re: [net-next 20/20] i40e: don't allow i40e_vsi_(add|kill)_vlan to 
> operate
> when VID<1
> 
> On Wed, 2016-12-07 at 13:50 -0800, Keller, Jacob E wrote:
> > > -Original Message-
> > > From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com]
> > > Sent: Wednesday, December 07, 2016 2:11 AM
> > > To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; davem@davemloft.n
> > > et
> > > Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> > > nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com;
> > > guru.anbalag...@oracle.com
> > > Subject: Re: [net-next 20/20] i40e: don't allow
> > > i40e_vsi_(add|kill)_vlan to operate
> > > when VID<1
> > >
> > > Hello!
> > > > +   if (!(vid > 0) || vsi->info.pvid)
> > >
> > >  Why not just '!vid'?
> >
> > Left over artifact of this previously being a signed value. We can fix
> > this.
> >
> > Thanks,
> > Jake
> >
> > > > -void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid)
> > > > +void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, u16 vid)
> > > >   {
> > > > +   if (!(vid > 0) || vsi->info.pvid)
> > >
> > >  Likewise.
> >
> > Same here. Can get this fixed.
> 
> While you are fixing this up and sending me a new version of this patch, I
> will just drop this from the series and re-send.

Yes, since it's the last patch that's fine.

Thanks,
Jake


RE: [net-next 20/20] i40e: don't allow i40e_vsi_(add|kill)_vlan to operate when VID<1

2016-12-07 Thread Keller, Jacob E
> -Original Message-
> From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com]
> Sent: Wednesday, December 07, 2016 2:11 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; da...@davemloft.net
> Cc: Keller, Jacob E <jacob.e.kel...@intel.com>; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com;
> guru.anbalag...@oracle.com
> Subject: Re: [net-next 20/20] i40e: don't allow i40e_vsi_(add|kill)_vlan to 
> operate
> when VID<1
> 
> Hello!
> > +   if (!(vid > 0) || vsi->info.pvid)
> 
> Why not just '!vid'?

Left over artifact of this previously being a signed value. We can fix this.

Thanks,
Jake

> > -void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid)
> > +void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, u16 vid)
> >  {
> > +   if (!(vid > 0) || vsi->info.pvid)
> 
> Likewise.

Same here. Can get this fixed.

Thanks,
Jake

> 
> > +   return;
> > +
> > spin_lock_bh(>mac_filter_hash_lock);
> > i40e_rm_vlan_all_mac(vsi, vid);
> > spin_unlock_bh(>mac_filter_hash_lock);
> 
> MBR, Sergei



Re: [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error

2016-12-02 Thread Keller, Jacob E
On Fri, 2016-12-02 at 10:55 -0200, Guilherme G. Piccoli wrote:
> On 11/10/2016 04:46 PM, Guilherme G. Piccoli wrote:
> > Whenever the igb driver detects the result of a read operation
> > returns
> > a value composed only by F's (like 0x), it will detach the
> > net_device, clear the hw_addr pointer and warn to the user that
> > adapter's
> > link is lost - those steps happen on igb_rd32().
> > 
> > In case a PCI error happens on Power architecture, there's a
> > recovery
> > mechanism called EEH, that will reset the PCI slot and call
> > driver's
> > handlers to reset the adapter and network functionality as well.
> > 
> > We observed that once hw_addr is NULL after the error is detected
> > on
> > igb_rd32(), it's never assigned back, so in the process of
> > resetting
> > the network functionality we got a NULL pointer dereference in both
> > igb_configure_tx_ring() and igb_configure_rx_ring(). In order to
> > avoid
> > such bug, this patch re-assigns the hw_addr value in the slot_reset
> > handler.
> > 
> > Reported-by: Anthony H. Thai 
> > Reported-by: Harsha Thyagaraja 
> > Signed-off-by: Guilherme G. Piccoli 
> > ---
> >  drivers/net/ethernet/intel/igb/igb_main.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index edc9a6a..136ee9e 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -7878,6 +7878,11 @@ static pci_ers_result_t
> > igb_io_slot_reset(struct pci_dev *pdev)
> >     pci_enable_wake(pdev, PCI_D3hot, 0);
> >     pci_enable_wake(pdev, PCI_D3cold, 0);
> > 
> > +   /* In case of PCI error, adapter lose its HW
> > address
> > +    * so we should re-assign it here.
> > +    */
> > +   hw->hw_addr = adapter->io_addr;
> > +
> >     igb_reset(adapter);
> >     wr32(E1000_WUS, ~0);
> >     result = PCI_ERS_RESULT_RECOVERED;
> > 
> 
> Ping?
> 
> Sorry to annoy, any news on this?
> Thanks in advance!
> 
> Cheers,
> 
> 
> Guilherme
> 

This seems reasonable. It's similar to what fm10k driver does under
this circumstance.

Thanks,
Jake

Re: [PATCH RFC v2] ethtool: implement helper to get flow_type value

2016-11-29 Thread Keller, Jacob E
On Tue, 2016-11-29 at 15:21 +, Jakub Kicinski wrote:
> On Mon, 28 Nov 2016 15:03:43 -0800, Jacob Keller wrote:
> > +static inline __u32 ethtool_get_flow_spec_type(__u32 flow_type)
> > +{
> > +   return flow_type & (FLOW_EXT | FLOW_MAC_EXT);
> 
> I don't have anything of substance to say but I think you are missing
> a
> negation (~) here compared to the code you are replacing ;)

HAH! Yes you are right. I made a mistake when copying this out of my
driver header file.

Will fix. Sorry for the thrash, and thanks for catching my mistake.

Regards,
Jake

RE: [PATCH RFC v1] ethtool: implement helper to get flow_type value

2016-11-28 Thread Keller, Jacob E
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of David Miller
> Sent: Friday, November 25, 2016 1:07 PM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org
> Subject: Re: [PATCH RFC v1] ethtool: implement helper to get flow_type value
> 
> From: Jacob Keller <jacob.e.kel...@intel.com>
> Date: Tue, 22 Nov 2016 15:44:53 -0800
> 
> > @@ -880,6 +880,14 @@ struct ethtool_rx_flow_spec {
> > __u32   location;
> >  };
> >
> > +/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
> > +#defineFLOW_EXT0x8000
> > +#defineFLOW_MAC_EXT0x4000
> > +static inline __u32 ethtool_get_flow_spec_type(__u32 flow_type)
> > +{
> > +   return flow_type & (FLOW_EXT | FLOW_MAC_EXT);
> > +}
> > +
> >  /* How rings are layed out when accessing virtual functions or
> >   * offloaded queues is device specific. To allow users to do flow
> >   * steering and specify these queues the ring cookie is partitioned
> > @@ -1579,9 +1587,6 @@ static inline int ethtool_validate_duplex(__u8 duplex)
> >  #defineIPV4_FLOW   0x10/* hash only */
> >  #defineIPV6_FLOW   0x11/* hash only */
> >  #defineETHER_FLOW  0x12/* spec only (ether_spec) */
> > -/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
> > -#defineFLOW_EXT0x8000
> > -#defineFLOW_MAC_EXT0x4000
> >
> >  /* L3-L4 network traffic flow hash options */
> >  #defineRXH_L2DA(1 << 1)
> 
> Please put the helper after the FLOW_* definitions rather than moving
> them earlier in the file.

Will do. I originally moved them to place these with other similar helpers but 
I can re-spin this.

Thanks,
Jake


RE: [PATCH net-next 2/3] ptp: igb: Use the high resolution frequency method.

2016-11-09 Thread Keller, Jacob E
> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Wednesday, November 09, 2016 5:12 AM
> To: Keller, Jacob E <jacob.e.kel...@intel.com>
> Cc: netdev@vger.kernel.org; t...@linutronix.de; manfred.rudig...@omicron.at;
> ulrik.debie...@e2big.org; stefan.soren...@spectralink.com;
> da...@davemloft.net; Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>;
> john.stu...@linaro.org; intel-wired-...@lists.osuosl.org
> Subject: Re: [PATCH net-next 2/3] ptp: igb: Use the high resolution frequency
> method.
> 
> On Tue, Nov 08, 2016 at 10:02:22PM +, Keller, Jacob E wrote:
> > On Tue, 2016-11-08 at 22:49 +0100, Richard Cochran wrote:
> > > - rate = ppb;
> > > - rate <<= 26;
> > > - rate = div_u64(rate, 1953125);
> > > + rate = scaled_ppm;
> > > + rate <<= 13;
> > > + rate = div_u64(rate, 15625);
> >
> > I'm curious how you generate the new math here, since this can be
> > tricky, and I could use more examples in order to port to some of the
> > other drivers implementations. I'm not quit sure how to handle the
> > value when the lower 16 bits are fractional.
> 
> TL;DR version:
> 
> In ptp_clock.c we convert scaled_ppm to ppb like this.
> 
>   ppb = scaled_ppm * 10^3 * 2^-16
> 
> If you already have a working driver that does
> 
>   regval = ppb * SOMEMATH;
> 
> then just substitute
> 
>   regval = (scaled_ppm * 10^3 * 2^-16) * SOMEMATH;
>  = (scaled_ppm *  5^3 * 2^-13) * SOMEMATH;
> 
> and simplify by combining the 5^3 and 2^-13 constants into SOMEMATH.
> 

Thanks, this makes more sense :)

> Longer explanation:
> 
> You have to consider how the frequency adjustment HW works, case by
> case.  Both the i210 and the phyter have an adjustment register that
> holds units of 2^-32 nanoseconds per 8 nanosecond clock period, and so
> the rate from adjustment value 1 is (2^-32 / 8).
> 
> Then with the old interface, the conversion from "adjustment unit" to
> ppb was (2^-32 / 8 * 10^9) or (2^-26 * 5^9).  The conversion the other
> way needs the inverse, and so the code did (ppb << 26) / 5^9.
> 
> With the new interface, the conversion from "adjustment unit" to
> scaled_ppm is (2^-32 / 8 * 10^6 * 2^16) or (2^-13 * 5^6).  The code
> converts the other direction using the inverse, (s_ppm << 13) / 5^6.
> 

Right. This helps.

Thanks,
Jake

> HTH,
> Richard


Re: [PATCH net-next 2/3] ptp: igb: Use the high resolution frequency method.

2016-11-08 Thread Keller, Jacob E
On Tue, 2016-11-08 at 22:49 +0100, Richard Cochran wrote:
> The 82580 and related devices offer a frequency resolution of about
> 0.029 ppb.  This patch lets users of the device benefit from the
> increased frequency resolution when tuning the clock.
> 
> Signed-off-by: Richard Cochran 
> ---

Additionally, what about min/max frequency check? Wouldn't this need to
be updated for the new adjfine operation?

Thanks,
Jake

Re: [PATCH net-next 2/3] ptp: igb: Use the high resolution frequency method.

2016-11-08 Thread Keller, Jacob E
On Tue, 2016-11-08 at 22:49 +0100, Richard Cochran wrote:
> The 82580 and related devices offer a frequency resolution of about
> 0.029 ppb.  This patch lets users of the device benefit from the
> increased frequency resolution when tuning the clock.
> 
> Signed-off-by: Richard Cochran 
> ---
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c
> b/drivers/net/ethernet/intel/igb/igb_ptp.c
> index a7895c4..c30eea8 100644
> --- a/drivers/net/ethernet/intel/igb/igb_ptp.c
> +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
> @@ -226,7 +226,7 @@ static int igb_ptp_adjfreq_82576(struct
> ptp_clock_info *ptp, s32 ppb)
>   return 0;
>  }
>  
> -static int igb_ptp_adjfreq_82580(struct ptp_clock_info *ptp, s32
> ppb)
> +static int igb_ptp_adjfine_82580(struct ptp_clock_info *ptp, long
> scaled_ppm)
>  {
>   struct igb_adapter *igb = container_of(ptp, struct
> igb_adapter,
>      ptp_caps);
> @@ -235,13 +235,13 @@ static int igb_ptp_adjfreq_82580(struct
> ptp_clock_info *ptp, s32 ppb)
>   u64 rate;
>   u32 inca;
>  
> - if (ppb < 0) {
> + if (scaled_ppm < 0) {
>   neg_adj = 1;
> - ppb = -ppb;
> + scaled_ppm = -scaled_ppm;
>   }
> - rate = ppb;
> - rate <<= 26;
> - rate = div_u64(rate, 1953125);
> + rate = scaled_ppm;
> + rate <<= 13;
> + rate = div_u64(rate, 15625);
>  

I'm curious how you generate the new math here, since this can be
tricky, and I could use more examples in order to port to some of the
other drivers implementations. I'm not quit sure how to handle the
value when the lower 16 bits are fractional.

Thanks,
Jake

>   inca = rate & INCVALUE_MASK;
>   if (neg_adj)
> @@ -1103,7 +1103,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
>   adapter->ptp_caps.max_adj = 6249;
>   adapter->ptp_caps.n_ext_ts = 0;
>   adapter->ptp_caps.pps = 0;
> - adapter->ptp_caps.adjfreq = igb_ptp_adjfreq_82580;
> + adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580;
>   adapter->ptp_caps.adjtime = igb_ptp_adjtime_82576;
>   adapter->ptp_caps.gettime64 = igb_ptp_gettime_82576;
>   adapter->ptp_caps.settime64 = igb_ptp_settime_82576;
> @@ -1131,7 +1131,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
>   adapter->ptp_caps.n_pins = IGB_N_SDP;
>   adapter->ptp_caps.pps = 1;
>   adapter->ptp_caps.pin_config = adapter->sdp_config;
> - adapter->ptp_caps.adjfreq = igb_ptp_adjfreq_82580;
> + adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580;
>   adapter->ptp_caps.adjtime = igb_ptp_adjtime_i210;
>   adapter->ptp_caps.gettime64 = igb_ptp_gettime_i210;
>   adapter->ptp_caps.settime64 = igb_ptp_settime_i210;


Re: [PATCH net-next 0/3] PHC frequency fine tuning

2016-11-08 Thread Keller, Jacob E
On Tue, 2016-11-08 at 22:49 +0100, Richard Cochran wrote:
> This series expands the PTP Hardware Clock subsystem by adding a
> method that passes the frequency tuning word to the the drivers
> without dropping the low order bits.  Keeping those bits is useful
> for
> drivers whose frequency resolution is higher than 1 ppb.
> 

Makes sense.

> The appended script (below) runs a simple demonstration of the
> improvement.  This test needs two Intel i210 PCIe cards installed in
> the same PC, with their SDP0 pins connected by copper
> wire.  Measuring
> the estimated offset (from the ptp4l servo) and the true offset (from
> the PPS) over one hour yields the following statistics.
> 
> > 
> >    |   Est. Before |Est. After |   True Before |True
> > After |
> > +---+---+---+
> > ---|
> > min| -5.20e+01 | -1.60e+01 | -3.10e+01 |
> > -1.00e+00 |
> > max| +5.70e+01 | +2.50e+01 | +8.50e+01 |
> > +4.00e+01 |
> > pk-pk: | +1.09e+02 | +4.10e+01 | +1.16e+02 |
> > +4.10e+01 |
> > mean   | +6.47e-02 | +1.28e-02 | +2.422083e+01 |
> > +1.826083e+01 |
> > stddev | +1.158006e+01 | +4.581982e+00 | +1.207708e+01 |
> > +4.981435e+00 |
> 
> Here the numbers in units of nanoseconds, and the ~20 nanosecond PPS
> offset is due to input/output delays on the i210's external interface
> logic.
> 
> With the series applied, both the peak to peak error and the standard
> deviation improve by a factor of more than two.  These two graphs
> show
> the improvement nicely.
> 
>   http://linuxptp.sourceforge.net/fine-tuning/fine-est.png
> 
>   http://linuxptp.sourceforge.net/fine-tuning/fine-tru.png
> 

Wow, nice! I'll take a look at the actual patches in a few minutes, but
this is a really nice improvement!

Thanks,
Jake

> 
> Thanks,
> Richard
> 
> Richard Cochran (3):
>   ptp: Introduce a high resolution frequency adjustment method.
>   ptp: igb: Use the high resolution frequency method.
>   ptp: dp83640: Use the high resolution frequency method.
> 
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 16 
>  drivers/net/phy/dp83640.c| 14 +++---
>  drivers/ptp/ptp_clock.c  |  5 -
>  include/linux/ptp_clock_kernel.h |  8 
>  4 files changed, 27 insertions(+), 16 deletions(-)
> 


Re: [Intel-wired-lan] [PATCH net] i40e: avoid NULL pointer dereference and recursive errors on early PCI error

2016-09-27 Thread Keller, Jacob E
On Tue, 2016-09-27 at 18:14 -0300, Guilherme G. Piccoli wrote:
> Although rare, it's possible to hit PCI error early on device
> probe, meaning possibly some structs are not entirely initialized,
> and some might even be completely uninitialized, leading to NULL
> pointer dereference.
> 
> The i40e driver currently presents a "bad" behavior if device hits
> such early PCI error: firstly, the struct i40e_pf might not be
> attached to pci_dev yet, leading to a NULL pointer dereference on
> access to pf->state.
> 

Oops! Nice find!

> Even checking if the struct is NULL and avoiding the access in that
> case isn't enough, since the driver cannot recover from PCI error
> that early; in our experiments we saw multiple failures on kernel
> log, like:
> 
>   [549.664] i40e 0007:01:00.1: Initial pf_reset failed: -15
>   [549.664] i40e: probe of 0007:01:00.1 failed with error -15
>   [...]
>   [871.644] i40e 0007:01:00.1: The driver for the device stopped
> because the
>   device firmware failed to init. Try updating your NVM image.
>   [871.644] i40e: probe of 0007:01:00.1 failed with error -32
>   [...]
>   [872.516] i40e 0007:01:00.0: ARQ: Unknown event 0x ignored
> 
> Between the first probe failure (error -15) and the second (error
> -32)
> another PCI error happened due to the first bad probe. Also, driver
> started to flood console with those ARQ event messages.
> 
> This patch will prevent these issues by allowing error recovery
> mechanism to remove the failed device from the system instead of
> trying to recover from early PCI errors during device probe.
> 

This seems reasonable.

> Signed-off-by: Guilherme G. Piccoli 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index d0b3a1b..dad15b6 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -11360,6 +11360,12 @@ static pci_ers_result_t
> i40e_pci_error_detected(struct pci_dev *pdev,
>  
>   dev_info(>dev, "%s: error %d\n", __func__, error);
>  
> + if (!pf) {
> + dev_info(>dev,
> +  "Cannot recover - error happened during
> device probe\n");
> + return PCI_ERS_RESULT_DISCONNECT;
> + }
> +

Looks good to me.

Acked-by: Jacob Keller 

Thanks for the bug fix and detailed explanation!

Regards,
Jake

>   /* shutdown all operations */
>   if (!test_bit(__I40E_SUSPENDED, >state)) {
>   rtnl_lock();


~~

2016-07-21 Thread Keller, Jacob E
On Thu, 2016-07-21 at 13:51 +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 7/21/2016 1:23 AM, Jeff Kirsher wrote:
> 
> > From: Jacob Keller 
> > 
> > Sometimes, a VF driver will lose PCIe address access, such as due
> > to
> > a PF FLR event. In fm10k_detach_subtask, poll and check whether the
> > PCIe register space is active again and restore the device when it
> > has.
> > 
> > Signed-off-by: Jacob Keller 
> > Tested-by: Krishneil Singh 
> > Signed-off-by: Jeff Kirsher 
> > ---
> >  drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 13 +
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> > b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> > index 5e40460..d4ccb2a 100644
> > --- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> > +++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
> > @@ -123,11 +123,24 @@ static void fm10k_service_timer(unsigned long
> > data)
> >  static void fm10k_detach_subtask(struct fm10k_intfc *interface)
> >  {
> >     struct net_device *netdev = interface->netdev;
> > +   u32 __iomem *hw_addr;
> > +   u32 value;
> > 
> >     /* do nothing if device is still present or hw_addr is set
> > */
> >     if (netif_device_present(netdev) || interface->hw.hw_addr)
> >     return;
> > 
> > +   /* check the real address space to see if we've recovered
> > */
> > +   hw_addr = READ_ONCE(interface->uc_addr);
> > +   value = readl(hw_addr);
> > +   if ((~value)) {
> 
> Why these double parens?
> 

You're right it doesn't need them. I think at one point the check was
"!(~value)" in some other portion of code, and likely got copied by
mistake.

Thanks,
Jake

> [...]
> 
> MBR, Sergei
> 


Re: [PATCH net] ixgbe: napi_poll must return the work done

2016-06-16 Thread Keller, Jacob E
On Wed, 2016-06-15 at 09:34 -0700, Venkatesh Srinivas wrote:
> Reviewed-by: Venkatesh Srinivas 
> 
> The same bit of code appears in fm10k and i40e/i40evf. ixgb appears
> to
> correctly return work_done.
> 
> ixgbe_poll also appears to return an (minor) incorrect work_done in
> another case, BTW. It divides its
> budget between Rx rings associated with a vector. If any ring exceeds
> its share of the budget, ixgbe_poll
> claims to have consumed the full budget, even if a full budget of
> frames was not received in a single
> pass.
> 

So the correct return here would also be min between work_done and
budget, right? ie: we should still return the total work done instead
of the budget...?

Thanks,
Jake

> -- vs;


  1   2   >