Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2018-03-16 Thread Keller, Jacob E
> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Friday, March 16, 2018 7:59 AM
> To: Frantisek Rysanek <frantisek.rysa...@post.cz>
> Cc: linuxptp-devel@lists.sourceforge.net
> Subject: Re: [Linuxptp-devel] Despite the patch, "timed out while polling for 
> tx
> timestamp" keeps happening
> 
> On Fri, Mar 16, 2018 at 09:37:59AM +0100, Frantisek Rysanek wrote:
> > let me just briefly follow up on the past thread on "TX timestamp
> > timeouts". To sum up: on the my part the problem seems gone.
> 
> ...
> 
> > On tuesday I had another chance to try PTP against the TC switch at a
> > substation with IEC61850 running. The "sampled values" (not GOOSE)
> > traffic is a rapid fire of 3.5 Mbps in 124B packets, i.e. about
> > 3.5kpps, on a 100Mbps network. The PTP is mixed in that.
> 
> (Ah, 61850, SV, GOOSE - it brings back bad memories ;)
> 
> > And, ptp4l on the Intel NIC's worked fine.
> 
> Glad to hear that.
> 
> Thanks,
> Richard
> 

Yep, same here! Always glad to see the issues resolved.

Regards,
Jake


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2018-03-16 Thread Richard Cochran
On Fri, Mar 16, 2018 at 09:37:59AM +0100, Frantisek Rysanek wrote:
> let me just briefly follow up on the past thread on "TX timestamp 
> timeouts". To sum up: on the my part the problem seems gone.

...

> On tuesday I had another chance to try PTP against the TC switch at a 
> substation with IEC61850 running. The "sampled values" (not GOOSE) 
> traffic is a rapid fire of 3.5 Mbps in 124B packets, i.e. about 
> 3.5kpps, on a 100Mbps network. The PTP is mixed in that.

(Ah, 61850, SV, GOOSE - it brings back bad memories ;)

> And, ptp4l on the Intel NIC's worked fine.

Glad to hear that.

Thanks,
Richard

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2018-03-16 Thread Frantisek Rysanek
Dear gents,

let me just briefly follow up on the past thread on "TX timestamp 
timeouts". To sum up: on the my part the problem seems gone.

In the three months that have passed, I've built another computer
with some i210 and i82574 NIC's inside (related to my thread on
precise timestamping when capturing). I took some unsold ATX mobo 
with quad Intel NIC's and added some i210's in PCI-e slots.
I also ended up with some fiber NIC's that cannot be synced by 
external PPS, and thus I needed to sync my system timebase too, and I 
did that by PTP on one of the Intel NIC's.
This new computer already has Debian 9 Stretch, and the kernel is 
4.14.10.

On tuesday I had another chance to try PTP against the TC switch at a 
substation with IEC61850 running. The "sampled values" (not GOOSE) 
traffic is a rapid fire of 3.5 Mbps in 124B packets, i.e. about 
3.5kpps, on a 100Mbps network. The PTP is mixed in that.
And, ptp4l on the Intel NIC's worked fine.
This time I prepared the interface for PTP (and for the capturing 
too) using the following script:

ethtool --set-eee $netdev eee off
ethtool --pause $netdev rx off tx off autoneg off
# unfortunately, forced mdix off is exclusive with autoneg off :-(
#ethtool -s $netdev mdix off
ethtool -s $netdev speed 100 duplex full autoneg off
ip link set dev $netdev up

There was not a hint of any problem, neither in ptp4l (TX timestamp 
timeout) nor in t-shark doing the HW-timestamped capturing.

This is probably my last message with this subject, 
thanks for all your attention and help.

Frank

On 9 Dec 2017 at 16:26, linuxptp-devel@lists.sourcefo wrote:
> On 7 Dec 2017 at 21:55, Keller, Jacob E wrote:
> > > About ethtool stats - I now understand that you mean the output of
> > > ethtool -S, namely the lines
> > >  tx_hwtstamp_timeouts: 0
> > >  tx_hwtstamp_skipped: 0
> > >  rx_hwtstamp_cleared: 0
> > > This is what they look like now, that the error does not occur.
> > > In a few days I will probably have a chance to try it in the field
> > > again, on a PTP TC switch wih GOOSE flooding the network... that's
> > > where the misbehavior was most stubborn. Well now I know what to look
> > > at :-) I'll report more numbers when I have some.
> > > 
> > 
> > Ok good. If you see Tx timeouts again, try to measure the stats here
> > and see if any of these increment. If they do, that's a sure
> > indication that the driver was not able to obtain and send the
> > timestamp to the stack. If they *do not* increment, then that means
> > that the driver was likely too late when responding with the Tx
> > timestamp, which is a separate problem. 
> >
> Thanks for the useful information!
> 
> > Oh.. It's possible that the
> > device might be going to sleep too quickly.. can you check to see if
> > it supports EEE? "ethtool --show-eee "? This causes the device to
> > go into low power link mode which substantially increases the latency
> > for actual Tx packets (when there's little to no traffic). That might
> > be the reason under some circumstances why you see dropped timestamps,
> > if EEE is enabled? 
> >
> 
> So uh... ASPM on PCI-e may have been eliminated a couple years ago, 
> but we still have the Ethernet-level energy saving to sort out?
> I recall that even low-end unmanaged SoHo switches now support some 
> "green" functions on Gb Eth... makes me wonder if EEE is the techical 
> substance of the "green" word printed on D-Link's packaging carton.
> Wikipedia mentions 802.3az... ahh yes, and so does "man ethtool",
> referring to "eee".
> 
> # ethtool --show-eee eth0
> EEE Settings for eth0:
> EEE status: enabled - inactive
> Tx LPI: 0 (us)
> Supported EEE link modes:  100baseT/Full
>1000baseT/Full
> Advertised EEE link modes:  100baseT/Full
> 1000baseT/Full
> Link partner advertised EEE link modes:  Not reported
> 
> The Ethernet port is now connected straight to a 2nd-generation 
> Meinberg grandmaster card, called the IMS-TSU. I understand the 
> ethtool listing in the following way:
> the IMS-TSU does not support EEE.
> All the better.
> 
> A month ago, also in my lab, the eth0 was attached directly to a 
> 3rd-generation Meinberg grandmaster card, called the HPS100.
> Not sure, maybe it supports EEE? That's when the TX timeouts were 
> occurring a couple times a day, when combined with my i219LM eth0.
> 
> On site, eth0 was attached to a switch by Ruggedcom, working as a TC. 
> Makes me wonder if the switch supports EEE. Time for me to check the 
> manual. Makes me wonder if EEE combined with the GOOSE multicasts 
> could result in the high frequency of TX timestamp timeouts 
> observed...
> 
> Thanks for all the tips! :-)
> 
> Frank



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-12-11 Thread Frantisek Rysanek
Dear everyone (Mr. Keller in particular),

in the process of fiddling with an addon i210-T1 received today 
(original Intel board SKU) I have discovered minor havoc in my 
previously posted data.

Over the past few days/weeks I was delighted at how marvellous the 
onboard i219LM was, while in fact, it turns out that I've been using 
the second onboard NIC, the i210, all along for PTP. 
All the praise goes to the i210. The i219LM is inferior in my PC.
I'm re-attaching some samples from a running ptp4l.
The files also contain a corresponding dump of ethtool -T .

I was originally orienting myself by the contents of "dmesg". By the 
eth0 vs. eth1 originally reported upon device detection. 
Only today I started to smell a rat (because the addon i210 behaved 
so very good) and after some fumbling in dmesg, I have noticed that 
systemd would rename (swap) eth1 for eth0, all along, probably since 
my kernel upgrade (which I did very early on).
Today with the addon board, systemd does a triple rename:
eth0 -> eth1
eth1 -> eth2
eth2 -> eth0
:-)

Trying to trace the renames in dmesg is prone to confusion.
Ultimately my favourite way of mapping the netdevice names
to PCI devices is a combination of the following two commands:
ethtool -i 
and
lspci
The output of ethtool -i contains a row labeled "bus-info", which 
quotes the familiar bus:device.function triplet, matching those 
listed by lspci.

Which means that I have a tool that's capable of HW timestamping
with an error within maybe 20-30 ns. Now for the PPS input and 
distribution across some 4 boards... I've already ordered some 
74LVC1G125 to work as level shifters.

Frank Rysanek

The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any other MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

    File information ---
 File:  i210_onboard_arbor.txt
 Date:  11 Dec 2017, 15:06
 Size:  2576 bytes.
 Type:  Unix-text


i210_onboard_arbor.txt
Description: Binary data
The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any other MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

    File information ---
 File:  i219LM_onboard_arbor_PCH_LOM.txt
 Date:  11 Dec 2017, 15:05
 Size:  2729 bytes.
 Type:  Unix-text


i219LM_onboard_arbor_PCH_LOM.txt
Description: Binary data
The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any other MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

    File information ---
 File:  i210_addon_intel_original_i210-T1.txt
 Date:  11 Dec 2017, 15:07
 Size:  2504 bytes.
 Type:  Unix-text


i210_addon_intel_original_i210-T1.txt
Description: Binary data
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-12-09 Thread Frantisek Rysanek
On 7 Dec 2017 at 21:55, Keller, Jacob E wrote:
> > About ethtool stats - I now understand that you mean the output of
> > ethtool -S, namely the lines
> >  tx_hwtstamp_timeouts: 0
> >  tx_hwtstamp_skipped: 0
> >  rx_hwtstamp_cleared: 0
> > This is what they look like now, that the error does not occur.
> > In a few days I will probably have a chance to try it in the field
> > again, on a PTP TC switch wih GOOSE flooding the network... that's
> > where the misbehavior was most stubborn. Well now I know what to look
> > at :-) I'll report more numbers when I have some.
> > 
> 
> Ok good. If you see Tx timeouts again, try to measure the stats here
> and see if any of these increment. If they do, that's a sure
> indication that the driver was not able to obtain and send the
> timestamp to the stack. If they *do not* increment, then that means
> that the driver was likely too late when responding with the Tx
> timestamp, which is a separate problem. 
>
Thanks for the useful information!

> Oh.. It's possible that the
> device might be going to sleep too quickly.. can you check to see if
> it supports EEE? "ethtool --show-eee "? This causes the device to
> go into low power link mode which substantially increases the latency
> for actual Tx packets (when there's little to no traffic). That might
> be the reason under some circumstances why you see dropped timestamps,
> if EEE is enabled? 
>

So uh... ASPM on PCI-e may have been eliminated a couple years ago, 
but we still have the Ethernet-level energy saving to sort out?
I recall that even low-end unmanaged SoHo switches now support some 
"green" functions on Gb Eth... makes me wonder if EEE is the techical 
substance of the "green" word printed on D-Link's packaging carton.
Wikipedia mentions 802.3az... ahh yes, and so does "man ethtool",
referring to "eee".

# ethtool --show-eee eth0
EEE Settings for eth0:
EEE status: enabled - inactive
Tx LPI: 0 (us)
Supported EEE link modes:  100baseT/Full
   1000baseT/Full
Advertised EEE link modes:  100baseT/Full
1000baseT/Full
Link partner advertised EEE link modes:  Not reported

The Ethernet port is now connected straight to a 2nd-generation 
Meinberg grandmaster card, called the IMS-TSU. I understand the 
ethtool listing in the following way:
the IMS-TSU does not support EEE.
All the better.

A month ago, also in my lab, the eth0 was attached directly to a 
3rd-generation Meinberg grandmaster card, called the HPS100.
Not sure, maybe it supports EEE? That's when the TX timeouts were 
occurring a couple times a day, when combined with my i219LM eth0.

On site, eth0 was attached to a switch by Ruggedcom, working as a TC. 
Makes me wonder if the switch supports EEE. Time for me to check the 
manual. Makes me wonder if EEE combined with the GOOSE multicasts 
could result in the high frequency of TX timestamp timeouts 
observed...

Thanks for all the tips! :-)

Frank

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-12-07 Thread Keller, Jacob E
> -Original Message-
> From: Frantisek Rysanek [mailto:frantisek.rysa...@post.cz]
> Sent: Thursday, December 07, 2017 12:41 PM
> To: linuxptp-devel@lists.sourceforge.net
> Subject: Re: [Linuxptp-devel] Despite the patch, "timed out while polling for 
> tx
> timestamp" keeps happening
> 
> On 7 Dec 2017 at 18:47, Keller, Jacob E wrote:
> 
> > > => the Intel NIC hardware is possibly sensitive to "irrelevant"
> > > contents in the traffic. I can come up with the following candidate
> > > culprits/theories:
> > > - absence of the VLAN tag
> > > - correction values of 10-20 ms
> > > - other mcast traffic interfering
> > > - higher/different actual jitter in the messages?
> > >
> > > > Which device (and driver) are you using? (I can't see it in the 
> > > > history).
> > > >
> > > On the ptp4l client?
> > > The PC is a pre-production engineering sample panel PC by Arbor/TW,
> > > with Intel Skylake mobile, the NIC that I'm using is an i219LM
> > > integrated on the mothereboard (not sure if this has a MAC on chip
> > > within the PCH/south, or if it's a stand-alone NIC). Of the two Intel
> > > NIC chips, this one is more precise. The kernel is a fresh vanilla
> > > 4.13.12 and the e1000e driver came with it.
> > > I'm attaching a dump of dmesg and lspci. Ask for more if you want.
> > >
> > > Frank Rysanek
> >
> > Do you know the packet rate for Tx packets? (How often is it
> > requesting timestamps)? There was a recent-ish problem I believe we
> > fixed but it appears to be in 4.13: 5012863b7347 ("e1000e: fix race
> > condition around skb_tstamp_tx()", 2017-06-06), but that definitely
> > should be in the 4.13 kernel..
> >
> > There should also be statistics you can check in ethtool stats on the
> > device. Could you try checking if tx_hwtstamp_timeouts is
> > incrementing? Also whether tx_hwtstamp_skipped?
> >
> > Thanks,
> > Jake
> 
> Dear Mr. Keller, thanks for your immediate responses and for the job
> that you're doing on the drivers. You have my deepest respect.
> 
> Yes that patch is in my e1000e driver:
> https://patchwork.ozlabs.org/patch/758160/
> That's the patch mentioned in the subject of this e-mail thread :-)
> 
> ptp4l sends one PDelay Request per second, and answers one
> PDelay Request received from the upstream switch (per second).
> That's three PTP messages transmitted per second.
> There is no other TX traffic on that same port.
> 

Ok, so that means it probably isn't caused by too many requests for the device 
to handle.

> About ethtool stats - I now understand that you mean the output of
> ethtool -S, namely the lines
>  tx_hwtstamp_timeouts: 0
>  tx_hwtstamp_skipped: 0
>  rx_hwtstamp_cleared: 0
> This is what they look like now, that the error does not occur.
> In a few days I will probably have a chance to try it in the field
> again, on a PTP TC switch wih GOOSE flooding the network... that's
> where the misbehavior was most stubborn. Well now I know what to look
> at :-) I'll report more numbers when I have some.
> 

Ok good. If you see Tx timeouts again, try to measure the stats here and see if 
any of these increment. If they do, that's a sure indication that the driver 
was not able to obtain and send the timestamp to the stack. If they *do not* 
increment, then that means that the driver was likely too late when responding 
with the Tx timestamp, which is a separate problem. Oh.. It's possible that the 
device might be going to sleep too quickly.. can you check to see if it 
supports EEE? "ethtool --show-eee "? This causes the device to go into low 
power link mode which substantially increases the latency for actual Tx packets 
(when there's little to no traffic). That might be the reason under some 
circumstances why you see dropped timestamps, if EEE is enabled?

> BTW do you know what volume of RX buffers does the i219LM have on
> chip? Or its companion MAC integrated in the PCH, if the i219 is just
> a PHY.
> 
> Frank Rysanek
> 

The i219 is a MAC, however I don't know the volume of buffers on the chip 
unfortunately. Most of my work is on the i40e, fm10k, and ixgbe, though I've 
helped some of the work on PTP for other parts. (And, as far as I know, I'm the 
only one here who monitors the ptp4l list directly).

Thanks,
Jake


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-12-07 Thread Frantisek Rysanek
On 7 Dec 2017 at 18:47, Keller, Jacob E wrote:

> > => the Intel NIC hardware is possibly sensitive to "irrelevant"
> > contents in the traffic. I can come up with the following candidate
> > culprits/theories:
> > - absence of the VLAN tag
> > - correction values of 10-20 ms
> > - other mcast traffic interfering
> > - higher/different actual jitter in the messages?
> > 
> > > Which device (and driver) are you using? (I can't see it in the history).
> > >
> > On the ptp4l client?
> > The PC is a pre-production engineering sample panel PC by Arbor/TW,
> > with Intel Skylake mobile, the NIC that I'm using is an i219LM
> > integrated on the mothereboard (not sure if this has a MAC on chip
> > within the PCH/south, or if it's a stand-alone NIC). Of the two Intel
> > NIC chips, this one is more precise. The kernel is a fresh vanilla
> > 4.13.12 and the e1000e driver came with it.
> > I'm attaching a dump of dmesg and lspci. Ask for more if you want.
> > 
> > Frank Rysanek
> 
> Do you know the packet rate for Tx packets? (How often is it
> requesting timestamps)? There was a recent-ish problem I believe we
> fixed but it appears to be in 4.13: 5012863b7347 ("e1000e: fix race
> condition around skb_tstamp_tx()", 2017-06-06), but that definitely
> should be in the 4.13 kernel.. 
> 
> There should also be statistics you can check in ethtool stats on the
> device. Could you try checking if tx_hwtstamp_timeouts is
> incrementing? Also whether tx_hwtstamp_skipped? 
> 
> Thanks,
> Jake

Dear Mr. Keller, thanks for your immediate responses and for the job 
that you're doing on the drivers. You have my deepest respect.

Yes that patch is in my e1000e driver:
https://patchwork.ozlabs.org/patch/758160/
That's the patch mentioned in the subject of this e-mail thread :-)

ptp4l sends one PDelay Request per second, and answers one
PDelay Request received from the upstream switch (per second).
That's three PTP messages transmitted per second.
There is no other TX traffic on that same port.

About ethtool stats - I now understand that you mean the output of 
ethtool -S, namely the lines
 tx_hwtstamp_timeouts: 0
 tx_hwtstamp_skipped: 0
 rx_hwtstamp_cleared: 0
This is what they look like now, that the error does not occur.
In a few days I will probably have a chance to try it in the field 
again, on a PTP TC switch wih GOOSE flooding the network... that's 
where the misbehavior was most stubborn. Well now I know what to look 
at :-) I'll report more numbers when I have some.

BTW do you know what volume of RX buffers does the i219LM have on 
chip? Or its companion MAC integrated in the PCH, if the i219 is just 
a PHY.

Frank Rysanek

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-12-07 Thread Keller, Jacob E
> -Original Message-
> From: Frantisek Rysanek [mailto:frantisek.rysa...@post.cz]
> Sent: Thursday, December 07, 2017 8:28 AM
> To: linuxptp-devel@lists.sourceforge.net
> Subject: Re: [Linuxptp-devel] Despite the patch, "timed out while polling for 
> tx
> timestamp" keeps happening
>
> => the Intel NIC hardware is possibly sensitive to "irrelevant"
> contents in the traffic. I can come up with the following candidate
> culprits/theories:
> - absence of the VLAN tag
> - correction values of 10-20 ms
> - other mcast traffic interfering
> - higher/different actual jitter in the messages?
> 
> > Which device (and driver) are you using? (I can't see it in the history).
> >
> On the ptp4l client?
> The PC is a pre-production engineering sample panel PC by Arbor/TW,
> with Intel Skylake mobile, the NIC that I'm using is an i219LM
> integrated on the mothereboard (not sure if this has a MAC on chip
> within the PCH/south, or if it's a stand-alone NIC). Of the two Intel
> NIC chips, this one is more precise. The kernel is a fresh vanilla
> 4.13.12 and the e1000e driver came with it.
> I'm attaching a dump of dmesg and lspci. Ask for more if you want.
> 
> Frank Rysanek

Do you know the packet rate for Tx packets? (How often is it requesting 
timestamps)? There was a recent-ish problem I believe we fixed but it appears 
to be in 4.13: 5012863b7347 ("e1000e: fix race condition around 
skb_tstamp_tx()", 2017-06-06), but that definitely should be in the 4.13 
kernel..

There should also be statistics you can check in ethtool stats on the device. 
Could you try checking if tx_hwtstamp_timeouts is incrementing? Also whether 
tx_hwtstamp_skipped?

Thanks,
Jake

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-12-07 Thread Frantisek Rysanek
Note: I'm forwarding this message with PNG attachments removed,
as I got politely and deservedly reminded that big attachments are a 
no-no in a mailing list. Here goes the message:

> > The "correction" field inserted by the RuggedCom switch contains
> > values between 10 and 20 million raw units, that's some 150 to 300ns.
> > Sounds about appropriate. Makes me wonder if the contents of the PTP
> > traffic can make the Intel hardware puke :-/ The actual jitter, or
> > the non-zero correction field... it's strange.
> > 
Actually... this is probably wrong. The value in the correction.ns 
field is about 10 to 20 million, i.e. 10 to 20 milliseconds. I can 
see the raw value in the frame (in hex) and that's what Wireshark and 
ptpTrackHound interpret, in unison. 
And, one vendor techsupport insists that a correction value of 20 ms
is perfectly allright in a TC switch, due to SW processing of the PTP 
packets. Yikes, what?
Or, is there any chance that my sniffing rig is broken? 

I've captured the PTP traffic by libpcap, 
A) either with ptp4l running in software mode as a client 
to a TC switch (with a Meinberg GM as the next upstream hop) 
B) or as a pure sniffer, listening to traffic between a 3rd-party
   client and the TC. The Intel NIC does have PTP support, but I 
   understand that it is turned off, at the time of the capture. 

Any chance that the Intel NIC hardware would mangle the correction 
field? (I hope not - after some debate in another thread, the 10-20ms 
really seem allright, even if spooky.)

I'll probably have to borrow a proper "meter" device anyway :-/

I have some other potentially interesting observations, relevant to 
ptp4l and Intel HW.

There are two GM's in play: 

GM A (older), which correlated with a problem reported on site by a 
particular 3rd-party PTP slave. Presumed buggy.

GM B (younger), whose deployment correlated with the 3rd-party slave 
becoming happy. Presumed healthy.

The 3rd-party slave is a black box, expensive, presumably 
high-quality implementation.
Let me focus on the behavior observed in ptp4l with HW accel.


I actually tried ptp4l with HW support under several slightly 
different scenaria. L2 Multicast and 2-step P2P mechanism were 
common, but details were different. 

1) with "grandmaster B", directly attached at 1 Gbps, configured for 
C37.238-2017 (including ALTERNATE_TIME_OFFSET_INDICATOR_TLV),
both ends without a VLAN tag, in my lab. That worked for the most 
part, ptp4l would throw maybe 8 TX timeouts during one night (10 
hours).

2) with "grandmaster B", on site, configured for C37.238-2017 
(including ALTERNATE_TIME_OFFSET_INDICATOR_TLV),
both ends without a VLAN tag, through a PTP-capable switch
(the one adding 10-20 ms of "correction").
Here the ptp4l with HW accel would never stop choking with TX 
timeouts. Sometimes it went for 3 to 10 PDelay transactions without a
timeout, sometimes it would run timeout after timeout.
There was 3rd-party multicast traffic on the network (IEC61850 
GOOSE).

3) with "grandmaster A", on site, direct attached, configured for 
C37.238-2011 (no local timezone TLV), but *with* a VLAN tag 
containing ID=0 configured on the GM, and *without* VLAN tag on the 
ptp4l client, the ptp4l would not sychronize to the GM. In the packet
trace I can see all the messages from the GM, and ptp4l does respond 
to the master's PDelay Requests, but the GM does *not* respond to 
ptp4l's PDelay Requests.
=> I consider this a misconfiguration on my part (PEBKAC),
even though... theoretically... VLAN ID=0 means "this packet has 
802.1p priority assigned, but does not belong to a VLAN".
The GM *could* be a little more tolerant / liberal in what it accepts
:-) Then again, I do not know the wording of the 2011 "power 
profile".

4) with "grandmaster A", direct attached, back home in the lab, 
configured for C37.238-2011 (no local timezone TLV), but *with* a 
VLAN tag containing ID=0 configured on the GM, and *with* a VLAN tag 
ID=0 on the ptp4l client (created a VLAN subinterface eth0.0), 
ptp4l now RUNS LIKE A CHEETAH FOR DAYS !
No TX timeouts in the log.

=> the Intel NIC hardware is possibly sensitive to "irrelevant" 
contents in the traffic. I can come up with the following candidate 
culprits/theories: 
- absence of the VLAN tag
- correction values of 10-20 ms
- other mcast traffic interfering
- higher/different actual jitter in the messages?

> Which device (and driver) are you using? (I can't see it in the history).
> 
On the ptp4l client?
The PC is a pre-production engineering sample panel PC by Arbor/TW, 
with Intel Skylake mobile, the NIC that I'm using is an i219LM 
integrated on the mothereboard (not sure if this has a MAC on chip 
within the PCH/south, or if it's a stand-alone NIC). Of the two Intel
NIC chips, this one is more precise. The kernel is a fresh vanilla 
4.13.12 and the e1000e driver came with it.
I'm attaching a dump of dmesg and lspci. Ask for more if you want.

Frank Rysanek



WPM$LMWC.PM$
Description: Mail 

Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-12-05 Thread Keller, Jacob E
> -Original Message-
> From: Frantisek Rysanek [mailto:frantisek.rysa...@post.cz]
> Sent: Tuesday, December 05, 2017 8:35 AM
> To: linuxptp-devel@lists.sourceforge.net
> Subject: Re: [Linuxptp-devel] Despite the patch, "timed out while polling for 
> tx
> timestamp" keeps happening
> 
> On 13 Nov 2017 at 8:47, Richard Cochran wrote:
> > On Mon, Nov 13, 2017 at 01:54:32PM +0100, Frantisek Rysanek wrote:
> > > [...]
> > > I first tried your software on the stock kernel in Debian 8
> > > (3.16.something if memory serves) - and the "timed out while ..."
> > > errors prompted me to upgrade to the latest vanilla, which happens to
> > > be 4.13.12 at the time of this writing.
> > > The errors are now less frequent, but they do occur.
> >
> > Okay, so maybe the remaining errors are due to latency within your
> > system.  The default tx_timestamp_timeout is 1 millisecond.  Try 10
> > and see if that fixes the problem.
> >
> > > Notice how the errors become more frequent after 9 a.m.
> > > - I came to the machine and started a PCAP sniffer
> > > on that same port.
> >
> > That fact supports the idea that the cause is latency.
> >
> Apologies for not responding for almost a month...
> Other work interfering :-)
> 
> I'm back in the lab with a slightly different GrandMaster to play
> with and some more time on my hands. Still the same Intel PC.
> 
> Even if I increase the tx_timestamp_timeout to 20 ms,
> the TX timeouts still happen.
> 
> Interestingly, in the lab, against a directly attached GM,
> the timeouts are relatively rare.
> 
> Two weeks ago I was on a field trip with my GrandMasters,
> and on site my i219LM talked to the Meinberg GM through
> a RuggedCom switch. It all looked fairly good, the protocol
> clockwork seemd to work, the switch was evidently doing its job
> (the correction field was non-zero) etc.
> But, once I started ptp4l in the HW-accelerated mode,
> the TX timeouts happened so often that the client was pretty much
> useless even as a "test traffic generator". It kept falling over all
> the time! I ended up resorting to the software mode, at the price of
> some 4 decimal orders in precision :-( Well the purpose really was
> to generate some test traffic, so thanks god for the software-only
> mode :-)
> 
> It seems that I haven't shared some details of my config yet:
> the GrandMasters that I'm playing with are destined for IEC61850
> deployment (power substation) and so they're configured for
> L2 multicast mode, and the switch needs to be a TC with peer delay
> measurement in P2P mode. So the PDelay messages are
> exchanged between immediate neighbors on physical Ethernet links
> (they don't pass through the switch). The GrandMaster's Announce,
> Sync and Follow-up do pass through the switch. I believe only the
> Follow-ups have a correction field, and that is non-zero.
> 
> The "correction" field inserted by the RuggedCom switch contains
> values between 10 and 20 million raw units, that's some 150 to 300ns.
> Sounds about appropriate. Makes me wonder if the contents of the PTP
> traffic can make the Intel hardware puke :-/ The actual jitter, or
> the non-zero correction field... it's strange.
> 
> Frank Rysanek
> 

Which device (and driver) are you using? (I can't see it in the history).

Thanks,
Jake

> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Linuxptp-devel mailing list
> Linuxptp-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linuxptp-devel

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel


Re: [Linuxptp-devel] Despite the patch, "timed out while polling for tx timestamp" keeps happening

2017-11-13 Thread Richard Cochran
On Mon, Nov 13, 2017 at 01:54:32PM +0100, Frantisek Rysanek wrote:
> I have the privilege to play around with your software, on a 
> Skylake-based platform containing an Intel i219LM 
> (alongside i210 which seems to give comparably mediocre results).
> 
> I also have the privilege to test your software against a 
> latest-generation Meinberg GrandMaster (HPS100).
> 
> I first tried your software on the stock kernel in Debian 8 
> (3.16.something if memory serves) - and the "timed out while ..." 
> errors prompted me to upgrade to the latest vanilla, which happens to 
> be 4.13.12 at the time of this writing.
> The errors are now less frequent, but they do occur.

Okay, so maybe the remaining errors are due to latency within your
system.  The default tx_timestamp_timeout is 1 millisecond.  Try 10
and see if that fixes the problem.

> Notice how the errors become more frequent after 9 a.m.
> - I came to the machine and started a PCAP sniffer 
> on that same port. 

That fact supports the idea that the cause is latency.

> The one thing that has stunned me in a nice way:
> I first ran ptp4l against the GM through two managed switches,
> not very loaded, but unaware of PTP.
> That resulted in the "offset" jumping within +/- 700 "units".
> After I got the idea to plug the client straight into the GM,
> the "offset" now wanders within +/- 10 "units". 

Yes, that is the expected performance.  Store and forward switches
do introduce such errors.

> That's an improvement of two decimal orders :-D
> Wonderful :-)
> What are those units BTW? PPB? Nanoseconds?

Nanoseconds.

Thanks,
Richard

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel