Re: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
On Tue, Aug 02, 2016 at 07:33:01AM +, Brown, Aaron F wrote: > > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On > > Behalf Of Jarod Wilson > > Sent: Monday, August 1, 2016 6:32 PM > > To: Avargil, Raanan <raanan.avar...@intel.com> > > Cc: Hall, Christopher S <christopher.s.h...@intel.com>; > > netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; linux- > > ker...@vger.kernel.org > > Subject: Re: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out > > systim sanitization > > > > On Wed, Jul 27, 2016 at 11:01:55AM -0400, Jarod Wilson wrote: > > > On Wed, Jul 27, 2016 at 02:09:13PM +, Avargil, Raanan wrote: > > > >> This is prepatory work for an expanding list of adapter families that > > > >> have > > occasional ~10 hour clock jumps when being used for PTP. Factor out the > > sanitization function and convert to using a feature (bug) flag, per > > suggestion > > from Jesse Brandeburg. > > > >> > > > >> Littering functional code with device-specific checks is much messier > > than simply checking a flag, and having device-specific init set flags as > > needed. > > > >> There are probably a number of other cases in the e1000e code that > > could/should be converted similarly. > > > > > > > > Looks ok to me. > > > > Adding Chris who asked what happens if we reach the max retry counter > > (E1000_MAX_82574_SYSTIM_REREAD)? > > > > This counter is set to 50. > > > > Can you, for testing purposes, decreased this value (or even set it to > > > > 0) > > and see what happens? > > > > > > Unfortunately, I don't have direct access to the affected hardware myself, > > > so I'd have to prep a test build, hand it off to someone and play relay. I > > > could do that, but it'd have some lag and possible multiple round-trips... > > > Anyone inside Intel have hardware handy to test on? :p > > > > Was tied up with other work the middle of last week, then on vacation for > > a bit. There was some testing feedback provided from someone at neither > > Red Hat or Intel, but I'm not sure where it leaves us right now. What > > needs to happen next? > > Probably nothing else needs to be done on your end. I was out for the last > week and a half and am now running the patches through a series of regression > test covering a fair number of the different e1000e parts. I will also try > to duplicate Tim Woodford' success on a NUC with an i218 in my lab. Assuming > nothing jumps out at me I'll probably give it a tested-by later this week so > that Jeff can push it on up. Looking for a status update on this one, not seeing it pushed to DaveM just yet. -- Jarod Wilson ja...@redhat.com
RE: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On > Behalf Of Woodford, Timothy W. > Sent: Friday, July 29, 2016 7:41 AM > To: Woodford, Timothy W. <timothy.woodf...@jhuapl.edu>; Avargil, > Raanan <raanan.avar...@intel.com>; Jarod Wilson <ja...@redhat.com>; > linux-ker...@vger.kernel.org; Hall, Christopher S > <christopher.s.h...@intel.com> > Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org > Subject: Re: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out > systim sanitization > > >>> This is prepatory work for an expanding list of adapter families that have > occasional ~10 hour clock jumps when being used for PTP. Factor out the > sanitization function and convert to using a feature (bug) flag, per > suggestion > from Jesse Brandeburg. > >>> > >>> Littering functional code with device-specific checks is much messier than > simply checking a flag, and having device-specific init set flags as needed. > >>> There are probably a number of other cases in the e1000e code that > could/should be converted similarly. > >> > >> Looks ok to me. > >> Adding Chris who asked what happens if we reach the max retry counter > (E1000_MAX_82574_SYSTIM_REREAD)? > >> This counter is set to 50. > >> Can you, for testing purposes, decreased this value (or even set it to 0) > and see what happens? > > I'll set the max retry counter to 1 and run an overnight test to see what > happens. > > After running with this configuration for about 36 hours, I haven't seen any > timing jumps. Either this configuration eliminates the error, or it makes it > significantly less likely to occur. > > Tim Woodford Feel free to throw a Tested-by: on it if you like. Not a big deal either way, I managed to get enough cycles in on it I'm pretty happy with it as well. > ___ > Intel-wired-lan mailing list > intel-wired-...@lists.osuosl.org > http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
RE: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On > Behalf Of Jarod Wilson > Sent: Tuesday, July 26, 2016 11:26 AM > To: linux-ker...@vger.kernel.org > Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org > Subject: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim > sanitization > > This is prepatory work for an expanding list of adapter families that have > occasional ~10 hour clock jumps when being used for PTP. Factor out the > sanitization function and convert to using a feature (bug) flag, per > suggestion from Jesse Brandeburg. > > Littering functional code with device-specific checks is much messier than > simply checking a flag, and having device-specific init set flags as needed. > There are probably a number of other cases in the e1000e code that > could/should be converted similarly. > > Suggested-by: Jesse Brandeburg> CC: Jesse Brandeburg > CC: Jeff Kirsher > CC: intel-wired-...@lists.osuosl.org > CC: netdev@vger.kernel.org > Signed-off-by: Jarod Wilson > --- > drivers/net/ethernet/intel/e1000e/82571.c | 6 ++- > drivers/net/ethernet/intel/e1000e/e1000.h | 1 + > drivers/net/ethernet/intel/e1000e/netdev.c | 66 ++--- > - > 3 files changed, 44 insertions(+), 29 deletions(-) Tested-by: Aaron Brown
RE: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On > Behalf Of Jarod Wilson > Sent: Monday, August 1, 2016 6:32 PM > To: Avargil, Raanan <raanan.avar...@intel.com> > Cc: Hall, Christopher S <christopher.s.h...@intel.com>; > netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; linux- > ker...@vger.kernel.org > Subject: Re: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out > systim sanitization > > On Wed, Jul 27, 2016 at 11:01:55AM -0400, Jarod Wilson wrote: > > On Wed, Jul 27, 2016 at 02:09:13PM +, Avargil, Raanan wrote: > > >> This is prepatory work for an expanding list of adapter families that > > >> have > occasional ~10 hour clock jumps when being used for PTP. Factor out the > sanitization function and convert to using a feature (bug) flag, per > suggestion > from Jesse Brandeburg. > > >> > > >> Littering functional code with device-specific checks is much messier > than simply checking a flag, and having device-specific init set flags as > needed. > > >> There are probably a number of other cases in the e1000e code that > could/should be converted similarly. > > > > > > Looks ok to me. > > > Adding Chris who asked what happens if we reach the max retry counter > (E1000_MAX_82574_SYSTIM_REREAD)? > > > This counter is set to 50. > > > Can you, for testing purposes, decreased this value (or even set it to 0) > and see what happens? > > > > Unfortunately, I don't have direct access to the affected hardware myself, > > so I'd have to prep a test build, hand it off to someone and play relay. I > > could do that, but it'd have some lag and possible multiple round-trips... > > Anyone inside Intel have hardware handy to test on? :p > > Was tied up with other work the middle of last week, then on vacation for > a bit. There was some testing feedback provided from someone at neither > Red Hat or Intel, but I'm not sure where it leaves us right now. What > needs to happen next? Probably nothing else needs to be done on your end. I was out for the last week and a half and am now running the patches through a series of regression test covering a fair number of the different e1000e parts. I will also try to duplicate Tim Woodford' success on a NUC with an i218 in my lab. Assuming nothing jumps out at me I'll probably give it a tested-by later this week so that Jeff can push it on up. > > -- > Jarod Wilson > ja...@redhat.com > > ___ > Intel-wired-lan mailing list > intel-wired-...@lists.osuosl.org > http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Re: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
On Wed, Jul 27, 2016 at 11:01:55AM -0400, Jarod Wilson wrote: > On Wed, Jul 27, 2016 at 02:09:13PM +, Avargil, Raanan wrote: > >> This is prepatory work for an expanding list of adapter families that have > >> occasional ~10 hour clock jumps when being used for PTP. Factor out the > >> sanitization function and convert to using a feature (bug) flag, per > >> suggestion from Jesse Brandeburg. > >> > >> Littering functional code with device-specific checks is much messier than > >> simply checking a flag, and having device-specific init set flags as > >> needed. > >> There are probably a number of other cases in the e1000e code that > >> could/should be converted similarly. > > > > Looks ok to me. > > Adding Chris who asked what happens if we reach the max retry counter > > (E1000_MAX_82574_SYSTIM_REREAD)? > > This counter is set to 50. > > Can you, for testing purposes, decreased this value (or even set it to 0) > > and see what happens? > > Unfortunately, I don't have direct access to the affected hardware myself, > so I'd have to prep a test build, hand it off to someone and play relay. I > could do that, but it'd have some lag and possible multiple round-trips... > Anyone inside Intel have hardware handy to test on? :p Was tied up with other work the middle of last week, then on vacation for a bit. There was some testing feedback provided from someone at neither Red Hat or Intel, but I'm not sure where it leaves us right now. What needs to happen next? -- Jarod Wilson ja...@redhat.com
RE: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
>>> This is prepatory work for an expanding list of adapter families that have >>> occasional ~10 hour clock jumps when being used for PTP. Factor out the >>> sanitization function and convert to using a feature (bug) flag, per >>> suggestion from Jesse Brandeburg. >>> >>> Littering functional code with device-specific checks is much messier than >>> simply checking a flag, and having device-specific init set flags as needed. >>> There are probably a number of other cases in the e1000e code that >>> could/should be converted similarly. >> >> Looks ok to me. >> Adding Chris who asked what happens if we reach the max retry counter >> (E1000_MAX_82574_SYSTIM_REREAD)? >> This counter is set to 50. >> Can you, for testing purposes, decreased this value (or even set it to 0) >> and see what happens? > I'll set the max retry counter to 1 and run an overnight test to see what > happens. After running with this configuration for about 36 hours, I haven't seen any timing jumps. Either this configuration eliminates the error, or it makes it significantly less likely to occur. Tim Woodford
RE: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
>> This is prepatory work for an expanding list of adapter families that have >> occasional ~10 hour clock jumps when being used for PTP. Factor out the >> sanitization function and convert to using a feature (bug) flag, per >> suggestion from Jesse Brandeburg. >> >> Littering functional code with device-specific checks is much messier than >> simply checking a flag, and having device-specific init set flags as needed. >> There are probably a number of other cases in the e1000e code that >> could/should be converted similarly. > > Looks ok to me. > Adding Chris who asked what happens if we reach the max retry counter > (E1000_MAX_82574_SYSTIM_REREAD)? > This counter is set to 50. > Can you, for testing purposes, decreased this value (or even set it to 0) and > see what happens? I have one of the affected NICs. I inserted some test code into the module to check the value of i after the for loop inside the sanitize function exits to check how many iterations of the loop were being run. I ran a PTP test setup for a few hours with E1000_MAX_82574_SYSTIM_REREAD=50, and the maximum number was 5 iterations. It looks like setting E1000_MAX_82574_SYSTIM_REREAD to 0 will just disable the return value checking and will have the same results as disabling the sanitization function. I'll set the max retry counter to 1 and run an overnight test to see what happens. Tim Woodford
Re: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
On Wed, Jul 27, 2016 at 02:09:13PM +, Avargil, Raanan wrote: >> This is prepatory work for an expanding list of adapter families that have >> occasional ~10 hour clock jumps when being used for PTP. Factor out the >> sanitization function and convert to using a feature (bug) flag, per >> suggestion from Jesse Brandeburg. >> >> Littering functional code with device-specific checks is much messier than >> simply checking a flag, and having device-specific init set flags as needed. >> There are probably a number of other cases in the e1000e code that >> could/should be converted similarly. > > Looks ok to me. > Adding Chris who asked what happens if we reach the max retry counter > (E1000_MAX_82574_SYSTIM_REREAD)? > This counter is set to 50. > Can you, for testing purposes, decreased this value (or even set it to 0) and > see what happens? Unfortunately, I don't have direct access to the affected hardware myself, so I'd have to prep a test build, hand it off to someone and play relay. I could do that, but it'd have some lag and possible multiple round-trips... Anyone inside Intel have hardware handy to test on? :p -- Jarod Wilson ja...@redhat.com
RE: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization
This is prepatory work for an expanding list of adapter families that have occasional ~10 hour clock jumps when being used for PTP. Factor out the sanitization function and convert to using a feature (bug) flag, per suggestion from Jesse Brandeburg. Littering functional code with device-specific checks is much messier than simply checking a flag, and having device-specific init set flags as needed. There are probably a number of other cases in the e1000e code that could/should be converted similarly. Suggested-by: Jesse BrandeburgCC: Jesse Brandeburg CC: Jeff Kirsher CC: intel-wired-...@lists.osuosl.org CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson --- drivers/net/ethernet/intel/e1000e/82571.c | 6 ++- drivers/net/ethernet/intel/e1000e/e1000.h | 1 + drivers/net/ethernet/intel/e1000e/netdev.c | 66 ++ 3 files changed, 44 insertions(+), 29 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/82571.c b/drivers/net/ethernet/intel/e1000e/82571.c index 7fd4d54..6b03c85 100644 --- a/drivers/net/ethernet/intel/e1000e/82571.c +++ b/drivers/net/ethernet/intel/e1000e/82571.c @@ -2032,7 +2032,8 @@ const struct e1000_info e1000_82574_info = { | FLAG2_DISABLE_ASPM_L0S | FLAG2_DISABLE_ASPM_L1 | FLAG2_NO_DISABLE_RX - | FLAG2_DMA_BURST, + | FLAG2_DMA_BURST + | FLAG2_CHECK_SYSTIM_OVERFLOW, .pba= 32, .max_hw_frame_size = DEFAULT_JUMBO, .get_variants = e1000_get_variants_82571, @@ -2053,7 +2054,8 @@ const struct e1000_info e1000_82583_info = { | FLAG_HAS_CTRLEXT_ON_LOAD, .flags2 = FLAG2_DISABLE_ASPM_L0S | FLAG2_DISABLE_ASPM_L1 - | FLAG2_NO_DISABLE_RX, + | FLAG2_NO_DISABLE_RX + | FLAG2_CHECK_SYSTIM_OVERFLOW, .pba= 32, .max_hw_frame_size = DEFAULT_JUMBO, .get_variants = e1000_get_variants_82571, diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h index ef96cd1..879cca4 100644 --- a/drivers/net/ethernet/intel/e1000e/e1000.h +++ b/drivers/net/ethernet/intel/e1000e/e1000.h @@ -452,6 +452,7 @@ s32 e1000e_get_base_timinca(struct e1000_adapter *adapter, u32 *timinca); #define FLAG2_PCIM2PCI_ARBITER_WA BIT(11) #define FLAG2_DFLT_CRC_STRIPPING BIT(12) #define FLAG2_CHECK_RX_HWTSTAMP BIT(13) +#define FLAG2_CHECK_SYSTIM_OVERFLOW BIT(14) #define E1000_RX_DESC_PS(R, i) \ (&(((union e1000_rx_desc_packet_split *)((R).desc))[i])) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 41f32c0..1b2df11 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -4303,6 +4303,42 @@ void e1000e_reinit_locked(struct e1000_adapter *adapter) } /** + * e1000e_sanitize_systim - sanitize raw cycle counter reads + * @hw: pointer to the HW structure + * @systim: cycle_t value read, sanitized and returned + * + * Errata for 82574/82583 possible bad bits read from SYSTIMH/L: + * check to see that the time is incrementing at a reasonable + * rate and is a multiple of incvalue. + **/ +static cycle_t e1000e_sanitize_systim(struct e1000_hw *hw, cycle_t +systim) { + u64 time_delta, rem, temp; + cycle_t systim_next; + u32 incvalue; + int i; + + incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK; + for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) { + /* latch SYSTIMH on read of SYSTIML */ + systim_next = (cycle_t)er32(SYSTIML); + systim_next |= (cycle_t)er32(SYSTIMH) << 32; + + time_delta = systim_next - systim; + temp = time_delta; + /* VMWare users have seen incvalue of zero, don't div / 0 */ + rem = incvalue ? do_div(temp, incvalue) : (time_delta != 0); + + systim = systim_next; + + if ((time_delta < E1000_82574_SYSTIM_EPSILON) && (rem == 0)) + break; + } + + return systim; +} + +/** * e1000e_cyclecounter_read - read raw cycle counter (used by time counter) * @cc: cyclecounter structure **/ @@ -4312,7 +4348,7 @@ static cycle_t e1000e_cyclecounter_read(const struct cyclecounter *cc) cc); struct e1000_hw *hw = >hw; u32 systimel, systimeh; - cycle_t systim, systim_next; + cycle_t systim; /* SYSTIMH