Re: [casper] packets lost of a packetized correlator

2018-03-13 Thread Danny Price
Hey Homin,

I think that looks fine -- it's only an issue if they get changed by a
rouge process afterwards.

- Danny

On 14 March 2018 at 2:43:43 pm, Homin Jiang (ho...@asiaa.sinica.edu.tw)
wrote:

Dear Danny and John:

Thanks of your suggestion. I checked the ARP table as below, the unused
ones are all "FF FF ...". Did you suggest assign all the ARP table with
different address ?

best
homin





ARP Table:
IP:  10.  0.  0.  0: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.  1: MAC: FF FF FF FF FF FF
...

IP:  10.  0.  0. 19: MAC: FF FF FF FF FF FF
IP:  10.  0.  0. 20: MAC: 00 60 DD 44 9D 38
IP:  10.  0.  0. 21: MAC: FF FF FF FF FF FF
...
IP:  10.  0.  0.126: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.127: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.128: MAC: 02 02 0A 00 00 80
IP:  10.  0.  0.129: MAC: 02 02 0A 00 00 81
IP:  10.  0.  0.130: MAC: 02 02 0A 00 00 82
IP:  10.  0.  0.131: MAC: 02 02 0A 00 00 83
IP:  10.  0.  0.132: MAC: 02 02 0A 00 00 84
IP:  10.  0.  0.133: MAC: 02 02 0A 00 00 85
IP:  10.  0.  0.134: MAC: 02 02 0A 00 00 86
IP:  10.  0.  0.135: MAC: 02 02 0A 00 00 87
IP:  10.  0.  0.136: MAC: 02 02 0A 00 00 88
IP:  10.  0.  0.137: MAC: 02 02 0A 00 00 89
IP:  10.  0.  0.138: MAC: 02 02 0A 00 00 8A
IP:  10.  0.  0.139: MAC: 02 02 0A 00 00 8B
IP:  10.  0.  0.140: MAC: 02 02 0A 00 00 8C
IP:  10.  0.  0.141: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.142: MAC: 02 02 0A 00 00 8E
IP:  10.  0.  0.143: MAC: 02 02 0A 00 00 8F
IP:  10.  0.  0.144: MAC: 02 02 0A 00 00 90
IP:  10.  0.  0.145: MAC: 02 02 0A 00 00 91
IP:  10.  0.  0.146: MAC: 02 02 0A 00 00 92
IP:  10.  0.  0.147: MAC: 02 02 0A 00 00 93
IP:  10.  0.  0.148: MAC: 02 02 0A 00 00 94
IP:  10.  0.  0.149: MAC: 02 02 0A 00 00 95
IP:  10.  0.  0.150: MAC: 02 02 0A 00 00 96
IP:  10.  0.  0.151: MAC: 02 02 0A 00 00 97
IP:  10.  0.  0.152: MAC: 02 02 0A 00 00 98
IP:  10.  0.  0.153: MAC: 02 02 0A 00 00 99
IP:  10.  0.  0.154: MAC: 02 02 0A 00 00 9A
IP:  10.  0.  0.155: MAC: 02 02 0A 00 00 9B
IP:  10.  0.  0.156: MAC: 02 02 0A 00 00 9C
IP:  10.  0.  0.157: MAC: 02 02 0A 00 00 9D
IP:  10.  0.  0.158: MAC: 02 02 0A 00 00 9E
IP:  10.  0.  0.159: MAC: 02 02 0A 00 00 9F
IP:  10.  0.  0.160: MAC: FF FF FF FF FF FF
...
IP:  10.  0.  0.255: MAC: FF FF FF FF FF FF



On Wed, Mar 14, 2018 at 7:21 AM, John Ford  wrote:

> Hi Homin.  I think Danny's suggestion is a good one.  We have had similar
> problems with the system working for a while, then packets getting lost.
> Making sure that the entries in the ARP table are correct (and the yellow
> block MAC addresses are correct) may solve it.  Looking at the switch
> traffic with the monitoring built into it might tell you if this is a
> problem.
>
> John
>
> On Mon, Mar 12, 2018 at 10:54 PM, David MacMahon 
> wrote:
>
>> I think the tx overflow will be OK since the FPGA won't try to send more
>> than 10 Gbps.  I think the "rx overrun" flag would be more interesting.
>> But probably best to check both of course! :)
>>
>> Is the X engine clock an exact copy of the F engine clock (i.e. a common
>> clock that goes through a massive splitter) or just a clock of the same
>> frequency locked to the same reference (but not the exact same clock)?
>> Things get more complicated once you run F and X at different rates, so I
>> wouldn't recommend that path if you can avoid it.
>>
>> HTH,
>> Dave
>>
>>
>> On Mar 12, 2018, at 22:01, Homin Jiang  wrote:
>>
>> Hi Dave:
>>
>> Thanks of prompt response and suggestion.
>> The X engine is running the same clock as the F engine, 2.24GHz/8 =
>> 280MHz. Perhaps I should increase the clock in X engine ?
>> Yes, there is Tx overflow flag in the model, it will be the first thing
>> for me to check.
>>
>> best
>> homin
>>
>>
>>
>> On Tue, Mar 13, 2018 at 12:42 PM, David MacMahon 
>> wrote:
>>
>>> Hi, Homin,
>>>
>>> The first thing to do is figure out where packet loss is actually
>>> happening.  The fact that you have to reset the 10G yellow blocks to get
>>> things going again suggests that the X engines are not keeping up with the
>>> data rate (since the F engines will happily churn out 8.96 Gbps data
>>> regardless of the receivers' states and the X engines will happily churn
>>> out data regardless of the PC's state, it seems that the only way for the
>>> 10 GbE blocks to get confused is if the X engines are not keep up with the
>>> incoming data rate).  I assume the F engine ROACH2s are being clocked via
>>> their ADCs.  How are the X engine ROACH2s being clocked?
>>>
>>> Assuming the F-to-X packets are going through a switch, you could query
>>> the switch to see what it thinks the incoming and outgoing data rates are
>>> on the various ports involved.
>>>
>>> Does your design have any way of capturing the overflow flags of the 10
>>> GbE cores?
>>>
>>> Dave
>>>
>>> On Mar 12, 2018, at 19:39, Homin Jiang 
>>> wrote:
>>>
>>> Dear Casperite:
>>>
>>> We have been 

Re: [casper] packets lost of a packetized correlator

2018-03-13 Thread Homin Jiang
Dear Danny and John:

Thanks of your suggestion. I checked the ARP table as below, the unused
ones are all "FF FF ...". Did you suggest assign all the ARP table with
different address ?

best
homin





ARP Table:
IP:  10.  0.  0.  0: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.  1: MAC: FF FF FF FF FF FF
...

IP:  10.  0.  0. 19: MAC: FF FF FF FF FF FF
IP:  10.  0.  0. 20: MAC: 00 60 DD 44 9D 38
IP:  10.  0.  0. 21: MAC: FF FF FF FF FF FF
...
IP:  10.  0.  0.126: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.127: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.128: MAC: 02 02 0A 00 00 80
IP:  10.  0.  0.129: MAC: 02 02 0A 00 00 81
IP:  10.  0.  0.130: MAC: 02 02 0A 00 00 82
IP:  10.  0.  0.131: MAC: 02 02 0A 00 00 83
IP:  10.  0.  0.132: MAC: 02 02 0A 00 00 84
IP:  10.  0.  0.133: MAC: 02 02 0A 00 00 85
IP:  10.  0.  0.134: MAC: 02 02 0A 00 00 86
IP:  10.  0.  0.135: MAC: 02 02 0A 00 00 87
IP:  10.  0.  0.136: MAC: 02 02 0A 00 00 88
IP:  10.  0.  0.137: MAC: 02 02 0A 00 00 89
IP:  10.  0.  0.138: MAC: 02 02 0A 00 00 8A
IP:  10.  0.  0.139: MAC: 02 02 0A 00 00 8B
IP:  10.  0.  0.140: MAC: 02 02 0A 00 00 8C
IP:  10.  0.  0.141: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.142: MAC: 02 02 0A 00 00 8E
IP:  10.  0.  0.143: MAC: 02 02 0A 00 00 8F
IP:  10.  0.  0.144: MAC: 02 02 0A 00 00 90
IP:  10.  0.  0.145: MAC: 02 02 0A 00 00 91
IP:  10.  0.  0.146: MAC: 02 02 0A 00 00 92
IP:  10.  0.  0.147: MAC: 02 02 0A 00 00 93
IP:  10.  0.  0.148: MAC: 02 02 0A 00 00 94
IP:  10.  0.  0.149: MAC: 02 02 0A 00 00 95
IP:  10.  0.  0.150: MAC: 02 02 0A 00 00 96
IP:  10.  0.  0.151: MAC: 02 02 0A 00 00 97
IP:  10.  0.  0.152: MAC: 02 02 0A 00 00 98
IP:  10.  0.  0.153: MAC: 02 02 0A 00 00 99
IP:  10.  0.  0.154: MAC: 02 02 0A 00 00 9A
IP:  10.  0.  0.155: MAC: 02 02 0A 00 00 9B
IP:  10.  0.  0.156: MAC: 02 02 0A 00 00 9C
IP:  10.  0.  0.157: MAC: 02 02 0A 00 00 9D
IP:  10.  0.  0.158: MAC: 02 02 0A 00 00 9E
IP:  10.  0.  0.159: MAC: 02 02 0A 00 00 9F
IP:  10.  0.  0.160: MAC: FF FF FF FF FF FF
...
IP:  10.  0.  0.255: MAC: FF FF FF FF FF FF



On Wed, Mar 14, 2018 at 7:21 AM, John Ford  wrote:

> Hi Homin.  I think Danny's suggestion is a good one.  We have had similar
> problems with the system working for a while, then packets getting lost.
> Making sure that the entries in the ARP table are correct (and the yellow
> block MAC addresses are correct) may solve it.  Looking at the switch
> traffic with the monitoring built into it might tell you if this is a
> problem.
>
> John
>
> On Mon, Mar 12, 2018 at 10:54 PM, David MacMahon 
> wrote:
>
>> I think the tx overflow will be OK since the FPGA won't try to send more
>> than 10 Gbps.  I think the "rx overrun" flag would be more interesting.
>> But probably best to check both of course! :)
>>
>> Is the X engine clock an exact copy of the F engine clock (i.e. a common
>> clock that goes through a massive splitter) or just a clock of the same
>> frequency locked to the same reference (but not the exact same clock)?
>> Things get more complicated once you run F and X at different rates, so I
>> wouldn't recommend that path if you can avoid it.
>>
>> HTH,
>> Dave
>>
>>
>> On Mar 12, 2018, at 22:01, Homin Jiang  wrote:
>>
>> Hi Dave:
>>
>> Thanks of prompt response and suggestion.
>> The X engine is running the same clock as the F engine, 2.24GHz/8 =
>> 280MHz. Perhaps I should increase the clock in X engine ?
>> Yes, there is Tx overflow flag in the model, it will be the first thing
>> for me to check.
>>
>> best
>> homin
>>
>>
>>
>> On Tue, Mar 13, 2018 at 12:42 PM, David MacMahon 
>> wrote:
>>
>>> Hi, Homin,
>>>
>>> The first thing to do is figure out where packet loss is actually
>>> happening.  The fact that you have to reset the 10G yellow blocks to get
>>> things going again suggests that the X engines are not keeping up with the
>>> data rate (since the F engines will happily churn out 8.96 Gbps data
>>> regardless of the receivers' states and the X engines will happily churn
>>> out data regardless of the PC's state, it seems that the only way for the
>>> 10 GbE blocks to get confused is if the X engines are not keep up with the
>>> incoming data rate).  I assume the F engine ROACH2s are being clocked via
>>> their ADCs.  How are the X engine ROACH2s being clocked?
>>>
>>> Assuming the F-to-X packets are going through a switch, you could query
>>> the switch to see what it thinks the incoming and outgoing data rates are
>>> on the various ports involved.
>>>
>>> Does your design have any way of capturing the overflow flags of the 10
>>> GbE cores?
>>>
>>> Dave
>>>
>>> On Mar 12, 2018, at 19:39, Homin Jiang 
>>> wrote:
>>>
>>> Dear Casperite:
>>>
>>> We have been deployed a 7(actually 8) antenna packetized correlator on
>>> Mauna Loa Hawaii. Running at 2.24GHz clock, that means 8.96 G bits per
>>> second for each 10G ethernet. The packet size is 2K. There are 8 

Re: [casper] packets lost of a packetized correlator

2018-03-13 Thread Homin Jiang
Dear Danny and John:
Thanks of your suggestion. I checked the ARP table as below, they are all
"FF FF ... FF" for unused X engine ports except the one with IP
=10.0.0.141. The F engine is all "FF FF". Did you suggest that  assign
different numbers to all the IP including the unused ones ?

best
homin

ARP Table:
IP:  10.  0.  0.  0: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.  1: MAC: FF FF FF FF FF FF
...

IP:  10.  0.  0. 19: MAC: FF FF FF FF FF FF
IP:  10.  0.  0. 20: MAC: 00 60 DD 44 9D 38
IP:  10.  0.  0. 21: MAC: FF FF FF FF FF FF
...
IP:  10.  0.  0.126: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.127: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.128: MAC: 02 02 0A 00 00 80
IP:  10.  0.  0.129: MAC: 02 02 0A 00 00 81
IP:  10.  0.  0.130: MAC: 02 02 0A 00 00 82
IP:  10.  0.  0.131: MAC: 02 02 0A 00 00 83
IP:  10.  0.  0.132: MAC: 02 02 0A 00 00 84
IP:  10.  0.  0.133: MAC: 02 02 0A 00 00 85
IP:  10.  0.  0.134: MAC: 02 02 0A 00 00 86
IP:  10.  0.  0.135: MAC: 02 02 0A 00 00 87
IP:  10.  0.  0.136: MAC: 02 02 0A 00 00 88
IP:  10.  0.  0.137: MAC: 02 02 0A 00 00 89
IP:  10.  0.  0.138: MAC: 02 02 0A 00 00 8A
IP:  10.  0.  0.139: MAC: 02 02 0A 00 00 8B
IP:  10.  0.  0.140: MAC: 02 02 0A 00 00 8C
IP:  10.  0.  0.141: MAC: FF FF FF FF FF FF
IP:  10.  0.  0.142: MAC: 02 02 0A 00 00 8E
IP:  10.  0.  0.143: MAC: 02 02 0A 00 00 8F
IP:  10.  0.  0.144: MAC: 02 02 0A 00 00 90
IP:  10.  0.  0.145: MAC: 02 02 0A 00 00 91
IP:  10.  0.  0.146: MAC: 02 02 0A 00 00 92
IP:  10.  0.  0.147: MAC: 02 02 0A 00 00 93
IP:  10.  0.  0.148: MAC: 02 02 0A 00 00 94
IP:  10.  0.  0.149: MAC: 02 02 0A 00 00 95
IP:  10.  0.  0.150: MAC: 02 02 0A 00 00 96
IP:  10.  0.  0.151: MAC: 02 02 0A 00 00 97
IP:  10.  0.  0.152: MAC: 02 02 0A 00 00 98
IP:  10.  0.  0.153: MAC: 02 02 0A 00 00 99
IP:  10.  0.  0.154: MAC: 02 02 0A 00 00 9A
IP:  10.  0.  0.155: MAC: 02 02 0A 00 00 9B
IP:  10.  0.  0.156: MAC: 02 02 0A 00 00 9C
IP:  10.  0.  0.157: MAC: 02 02 0A 00 00 9D
IP:  10.  0.  0.158: MAC: 02 02 0A 00 00 9E
IP:  10.  0.  0.159: MAC: 02 02 0A 00 00 9F
IP:  10.  0.  0.160: MAC: FF FF FF FF FF FF
...
IP:  10.  0.  0.255: MAC: FF FF FF FF FF FF



On Wed, Mar 14, 2018 at 7:21 AM, John Ford  wrote:

> Hi Homin.  I think Danny's suggestion is a good one.  We have had similar
> problems with the system working for a while, then packets getting lost.
> Making sure that the entries in the ARP table are correct (and the yellow
> block MAC addresses are correct) may solve it.  Looking at the switch
> traffic with the monitoring built into it might tell you if this is a
> problem.
>
> John
>
> On Mon, Mar 12, 2018 at 10:54 PM, David MacMahon 
> wrote:
>
>> I think the tx overflow will be OK since the FPGA won't try to send more
>> than 10 Gbps.  I think the "rx overrun" flag would be more interesting.
>> But probably best to check both of course! :)
>>
>> Is the X engine clock an exact copy of the F engine clock (i.e. a common
>> clock that goes through a massive splitter) or just a clock of the same
>> frequency locked to the same reference (but not the exact same clock)?
>> Things get more complicated once you run F and X at different rates, so I
>> wouldn't recommend that path if you can avoid it.
>>
>> HTH,
>> Dave
>>
>>
>> On Mar 12, 2018, at 22:01, Homin Jiang  wrote:
>>
>> Hi Dave:
>>
>> Thanks of prompt response and suggestion.
>> The X engine is running the same clock as the F engine, 2.24GHz/8 =
>> 280MHz. Perhaps I should increase the clock in X engine ?
>> Yes, there is Tx overflow flag in the model, it will be the first thing
>> for me to check.
>>
>> best
>> homin
>>
>>
>>
>> On Tue, Mar 13, 2018 at 12:42 PM, David MacMahon 
>> wrote:
>>
>>> Hi, Homin,
>>>
>>> The first thing to do is figure out where packet loss is actually
>>> happening.  The fact that you have to reset the 10G yellow blocks to get
>>> things going again suggests that the X engines are not keeping up with the
>>> data rate (since the F engines will happily churn out 8.96 Gbps data
>>> regardless of the receivers' states and the X engines will happily churn
>>> out data regardless of the PC's state, it seems that the only way for the
>>> 10 GbE blocks to get confused is if the X engines are not keep up with the
>>> incoming data rate).  I assume the F engine ROACH2s are being clocked via
>>> their ADCs.  How are the X engine ROACH2s being clocked?
>>>
>>> Assuming the F-to-X packets are going through a switch, you could query
>>> the switch to see what it thinks the incoming and outgoing data rates are
>>> on the various ports involved.
>>>
>>> Does your design have any way of capturing the overflow flags of the 10
>>> GbE cores?
>>>
>>> Dave
>>>
>>> On Mar 12, 2018, at 19:39, Homin Jiang 
>>> wrote:
>>>
>>> Dear Casperite:
>>>
>>> We have been deployed a 7(actually 8) antenna packetized correlator on
>>> Mauna Loa Hawaii. Running at 2.24GHz 

Re: [casper] packets lost of a packetized correlator

2018-03-13 Thread John Ford
Hi Homin.  I think Danny's suggestion is a good one.  We have had similar
problems with the system working for a while, then packets getting lost.
Making sure that the entries in the ARP table are correct (and the yellow
block MAC addresses are correct) may solve it.  Looking at the switch
traffic with the monitoring built into it might tell you if this is a
problem.

John

On Mon, Mar 12, 2018 at 10:54 PM, David MacMahon 
wrote:

> I think the tx overflow will be OK since the FPGA won't try to send more
> than 10 Gbps.  I think the "rx overrun" flag would be more interesting.
> But probably best to check both of course! :)
>
> Is the X engine clock an exact copy of the F engine clock (i.e. a common
> clock that goes through a massive splitter) or just a clock of the same
> frequency locked to the same reference (but not the exact same clock)?
> Things get more complicated once you run F and X at different rates, so I
> wouldn't recommend that path if you can avoid it.
>
> HTH,
> Dave
>
>
> On Mar 12, 2018, at 22:01, Homin Jiang  wrote:
>
> Hi Dave:
>
> Thanks of prompt response and suggestion.
> The X engine is running the same clock as the F engine, 2.24GHz/8 =
> 280MHz. Perhaps I should increase the clock in X engine ?
> Yes, there is Tx overflow flag in the model, it will be the first thing
> for me to check.
>
> best
> homin
>
>
>
> On Tue, Mar 13, 2018 at 12:42 PM, David MacMahon 
> wrote:
>
>> Hi, Homin,
>>
>> The first thing to do is figure out where packet loss is actually
>> happening.  The fact that you have to reset the 10G yellow blocks to get
>> things going again suggests that the X engines are not keeping up with the
>> data rate (since the F engines will happily churn out 8.96 Gbps data
>> regardless of the receivers' states and the X engines will happily churn
>> out data regardless of the PC's state, it seems that the only way for the
>> 10 GbE blocks to get confused is if the X engines are not keep up with the
>> incoming data rate).  I assume the F engine ROACH2s are being clocked via
>> their ADCs.  How are the X engine ROACH2s being clocked?
>>
>> Assuming the F-to-X packets are going through a switch, you could query
>> the switch to see what it thinks the incoming and outgoing data rates are
>> on the various ports involved.
>>
>> Does your design have any way of capturing the overflow flags of the 10
>> GbE cores?
>>
>> Dave
>>
>> On Mar 12, 2018, at 19:39, Homin Jiang  wrote:
>>
>> Dear Casperite:
>>
>> We have been deployed a 7(actually 8) antenna packetized correlator on
>> Mauna Loa Hawaii. Running at 2.24GHz clock, that means 8.96 G bits per
>> second for each 10G ethernet. The packet size is 2K. There are 8 sets of
>> ROACH2 as F engines, the other 8 sets of ROACH2 as X engines. Data packets
>> from F to X looks fine, the problem of lost packets is the integration data
>> from X engine to the computer. The 10G yellow blocks in X engines handle
>> the incoming data packets from F engine at the data rate of 8.96 Gbps, and
>> output the integration data to PC, the outgoing data rate depends on the
>> integration time, usually it is longer than 0.5 second. The syndrome is
>> that packets lost happened by specific X engines after 10,20 minutes or
>> couple of hours. Once it happened, we reset all the 10G yellow blocks in F
>> and X, then the system revived.
>>
>> I have no idea about the 10G ethernet yellow block. Any comments of
>> suggestions are highly welcome.
>>
>> best
>> homin jiang
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "casper@lists.berkeley.edu" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to casper+unsubscr...@lists.berkeley.edu.
>> To post to this group, send email to casper@lists.berkeley.edu.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "casper@lists.berkeley.edu" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to casper+unsubscr...@lists.berkeley.edu.
>> To post to this group, send email to casper@lists.berkeley.edu.
>>
>
>
> --
> You received this message because you are subscribed to the Google Groups "
> casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to casper+unsubscr...@lists.berkeley.edu.
> To post to this group, send email to casper@lists.berkeley.edu.
>
>
> --
> You received this message because you are subscribed to the Google Groups "
> casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to casper+unsubscr...@lists.berkeley.edu.
> To post to this group, send email to casper@lists.berkeley.edu.
>

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this 

Re: [casper] packets lost of a packetized correlator

2018-03-12 Thread David MacMahon
I think the tx overflow will be OK since the FPGA won't try to send more than 
10 Gbps.  I think the "rx overrun" flag would be more interesting.  But 
probably best to check both of course! :)

Is the X engine clock an exact copy of the F engine clock (i.e. a common clock 
that goes through a massive splitter) or just a clock of the same frequency 
locked to the same reference (but not the exact same clock)?  Things get more 
complicated once you run F and X at different rates, so I wouldn't recommend 
that path if you can avoid it.

HTH,
Dave

> On Mar 12, 2018, at 22:01, Homin Jiang  wrote:
> 
> Hi Dave:
> 
> Thanks of prompt response and suggestion.
> The X engine is running the same clock as the F engine, 2.24GHz/8 = 280MHz. 
> Perhaps I should increase the clock in X engine ?
> Yes, there is Tx overflow flag in the model, it will be the first thing for 
> me to check.
> 
> best
> homin
> 
> 
> 
> On Tue, Mar 13, 2018 at 12:42 PM, David MacMahon  > wrote:
> Hi, Homin,
> 
> The first thing to do is figure out where packet loss is actually happening.  
> The fact that you have to reset the 10G yellow blocks to get things going 
> again suggests that the X engines are not keeping up with the data rate 
> (since the F engines will happily churn out 8.96 Gbps data regardless of the 
> receivers' states and the X engines will happily churn out data regardless of 
> the PC's state, it seems that the only way for the 10 GbE blocks to get 
> confused is if the X engines are not keep up with the incoming data rate).  I 
> assume the F engine ROACH2s are being clocked via their ADCs.  How are the X 
> engine ROACH2s being clocked?
> 
> Assuming the F-to-X packets are going through a switch, you could query the 
> switch to see what it thinks the incoming and outgoing data rates are on the 
> various ports involved.
> 
> Does your design have any way of capturing the overflow flags of the 10 GbE 
> cores?
> 
> Dave
> 
>> On Mar 12, 2018, at 19:39, Homin Jiang > > wrote:
>> 
>> Dear Casperite:
>> 
>> We have been deployed a 7(actually 8) antenna packetized correlator on Mauna 
>> Loa Hawaii. Running at 2.24GHz clock, that means 8.96 G bits per second for 
>> each 10G ethernet. The packet size is 2K. There are 8 sets of ROACH2 as F 
>> engines, the other 8 sets of ROACH2 as X engines. Data packets from F to X 
>> looks fine, the problem of lost packets is the integration data from X 
>> engine to the computer. The 10G yellow blocks in X engines handle the 
>> incoming data packets from F engine at the data rate of 8.96 Gbps, and 
>> output the integration data to PC, the outgoing data rate depends on the 
>> integration time, usually it is longer than 0.5 second. The syndrome is that 
>> packets lost happened by specific X engines after 10,20 minutes or couple of 
>> hours. Once it happened, we reset all the 10G yellow blocks in F and X, then 
>> the system revived.
>> 
>> I have no idea about the 10G ethernet yellow block. Any comments of 
>> suggestions are highly welcome.
>> 
>> best
>> homin jiang
>>   
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "casper@lists.berkeley.edu " group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to casper+unsubscr...@lists.berkeley.edu 
>> .
>> To post to this group, send email to casper@lists.berkeley.edu 
>> .
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "casper@lists.berkeley.edu " group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to casper+unsubscr...@lists.berkeley.edu 
> .
> To post to this group, send email to casper@lists.berkeley.edu 
> .
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to casper+unsubscr...@lists.berkeley.edu 
> .
> To post to this group, send email to casper@lists.berkeley.edu 
> .

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To post to this group, send email to casper@lists.berkeley.edu.


Re: [casper] packets lost of a packetized correlator

2018-03-12 Thread Danny Price
Hi Homin,

Could this be due to a rouge ARP process? We have seen failure modes where
the ARP configuration fills itself with FF:FF:FF:FF and starts broadcasting
UDP traffic. We hard-code the ARP table to stop this.

I also recall reading something similar to do with anti-flood contorl on
some switches, might be worth double-checking if there's an unusual
'feature' that turns on after a metric is reached.

Good luck debugging!

Regards,
Danny


On 13 March 2018 at 4:01:13 pm, Homin Jiang (ho...@asiaa.sinica.edu.tw)
wrote:

Hi Dave:

Thanks of prompt response and suggestion.
The X engine is running the same clock as the F engine, 2.24GHz/8 = 280MHz.
Perhaps I should increase the clock in X engine ?
Yes, there is Tx overflow flag in the model, it will be the first thing for
me to check.

best
homin



On Tue, Mar 13, 2018 at 12:42 PM, David MacMahon 
wrote:

> Hi, Homin,
>
> The first thing to do is figure out where packet loss is actually
> happening.  The fact that you have to reset the 10G yellow blocks to get
> things going again suggests that the X engines are not keeping up with the
> data rate (since the F engines will happily churn out 8.96 Gbps data
> regardless of the receivers' states and the X engines will happily churn
> out data regardless of the PC's state, it seems that the only way for the
> 10 GbE blocks to get confused is if the X engines are not keep up with the
> incoming data rate).  I assume the F engine ROACH2s are being clocked via
> their ADCs.  How are the X engine ROACH2s being clocked?
>
> Assuming the F-to-X packets are going through a switch, you could query
> the switch to see what it thinks the incoming and outgoing data rates are
> on the various ports involved.
>
> Does your design have any way of capturing the overflow flags of the 10
> GbE cores?
>
> Dave
>
> On Mar 12, 2018, at 19:39, Homin Jiang  wrote:
>
> Dear Casperite:
>
> We have been deployed a 7(actually 8) antenna packetized correlator on
> Mauna Loa Hawaii. Running at 2.24GHz clock, that means 8.96 G bits per
> second for each 10G ethernet. The packet size is 2K. There are 8 sets of
> ROACH2 as F engines, the other 8 sets of ROACH2 as X engines. Data packets
> from F to X looks fine, the problem of lost packets is the integration data
> from X engine to the computer. The 10G yellow blocks in X engines handle
> the incoming data packets from F engine at the data rate of 8.96 Gbps, and
> output the integration data to PC, the outgoing data rate depends on the
> integration time, usually it is longer than 0.5 second. The syndrome is
> that packets lost happened by specific X engines after 10,20 minutes or
> couple of hours. Once it happened, we reset all the 10G yellow blocks in F
> and X, then the system revived.
>
> I have no idea about the 10G ethernet yellow block. Any comments of
> suggestions are highly welcome.
>
> best
> homin jiang
>
>
> --
> You received this message because you are subscribed to the Google Groups "
> casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to casper+unsubscr...@lists.berkeley.edu.
> To post to this group, send email to casper@lists.berkeley.edu.
>
>
> --
> You received this message because you are subscribed to the Google Groups "
> casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to casper+unsubscr...@lists.berkeley.edu.
> To post to this group, send email to casper@lists.berkeley.edu.
>

--
You received this message because you are subscribed to the Google Groups "
casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to casper+unsubscr...@lists.berkeley.edu.
To post to this group, send email to casper@lists.berkeley.edu.

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To post to this group, send email to casper@lists.berkeley.edu.


Re: [casper] packets lost of a packetized correlator

2018-03-12 Thread Homin Jiang
Hi Dave:

Thanks of prompt response and suggestion.
The X engine is running the same clock as the F engine, 2.24GHz/8 = 280MHz.
Perhaps I should increase the clock in X engine ?
Yes, there is Tx overflow flag in the model, it will be the first thing for
me to check.

best
homin



On Tue, Mar 13, 2018 at 12:42 PM, David MacMahon 
wrote:

> Hi, Homin,
>
> The first thing to do is figure out where packet loss is actually
> happening.  The fact that you have to reset the 10G yellow blocks to get
> things going again suggests that the X engines are not keeping up with the
> data rate (since the F engines will happily churn out 8.96 Gbps data
> regardless of the receivers' states and the X engines will happily churn
> out data regardless of the PC's state, it seems that the only way for the
> 10 GbE blocks to get confused is if the X engines are not keep up with the
> incoming data rate).  I assume the F engine ROACH2s are being clocked via
> their ADCs.  How are the X engine ROACH2s being clocked?
>
> Assuming the F-to-X packets are going through a switch, you could query
> the switch to see what it thinks the incoming and outgoing data rates are
> on the various ports involved.
>
> Does your design have any way of capturing the overflow flags of the 10
> GbE cores?
>
> Dave
>
> On Mar 12, 2018, at 19:39, Homin Jiang  wrote:
>
> Dear Casperite:
>
> We have been deployed a 7(actually 8) antenna packetized correlator on
> Mauna Loa Hawaii. Running at 2.24GHz clock, that means 8.96 G bits per
> second for each 10G ethernet. The packet size is 2K. There are 8 sets of
> ROACH2 as F engines, the other 8 sets of ROACH2 as X engines. Data packets
> from F to X looks fine, the problem of lost packets is the integration data
> from X engine to the computer. The 10G yellow blocks in X engines handle
> the incoming data packets from F engine at the data rate of 8.96 Gbps, and
> output the integration data to PC, the outgoing data rate depends on the
> integration time, usually it is longer than 0.5 second. The syndrome is
> that packets lost happened by specific X engines after 10,20 minutes or
> couple of hours. Once it happened, we reset all the 10G yellow blocks in F
> and X, then the system revived.
>
> I have no idea about the 10G ethernet yellow block. Any comments of
> suggestions are highly welcome.
>
> best
> homin jiang
>
>
> --
> You received this message because you are subscribed to the Google Groups "
> casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to casper+unsubscr...@lists.berkeley.edu.
> To post to this group, send email to casper@lists.berkeley.edu.
>
>
> --
> You received this message because you are subscribed to the Google Groups "
> casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to casper+unsubscr...@lists.berkeley.edu.
> To post to this group, send email to casper@lists.berkeley.edu.
>

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To post to this group, send email to casper@lists.berkeley.edu.


Re: [casper] packets lost of a packetized correlator

2018-03-12 Thread David MacMahon
Hi, Homin,

The first thing to do is figure out where packet loss is actually happening.  
The fact that you have to reset the 10G yellow blocks to get things going again 
suggests that the X engines are not keeping up with the data rate (since the F 
engines will happily churn out 8.96 Gbps data regardless of the receivers' 
states and the X engines will happily churn out data regardless of the PC's 
state, it seems that the only way for the 10 GbE blocks to get confused is if 
the X engines are not keep up with the incoming data rate).  I assume the F 
engine ROACH2s are being clocked via their ADCs.  How are the X engine ROACH2s 
being clocked?

Assuming the F-to-X packets are going through a switch, you could query the 
switch to see what it thinks the incoming and outgoing data rates are on the 
various ports involved.

Does your design have any way of capturing the overflow flags of the 10 GbE 
cores?

Dave

> On Mar 12, 2018, at 19:39, Homin Jiang  wrote:
> 
> Dear Casperite:
> 
> We have been deployed a 7(actually 8) antenna packetized correlator on Mauna 
> Loa Hawaii. Running at 2.24GHz clock, that means 8.96 G bits per second for 
> each 10G ethernet. The packet size is 2K. There are 8 sets of ROACH2 as F 
> engines, the other 8 sets of ROACH2 as X engines. Data packets from F to X 
> looks fine, the problem of lost packets is the integration data from X engine 
> to the computer. The 10G yellow blocks in X engines handle the incoming data 
> packets from F engine at the data rate of 8.96 Gbps, and output the 
> integration data to PC, the outgoing data rate depends on the integration 
> time, usually it is longer than 0.5 second. The syndrome is that packets lost 
> happened by specific X engines after 10,20 minutes or couple of hours. Once 
> it happened, we reset all the 10G yellow blocks in F and X, then the system 
> revived.
> 
> I have no idea about the 10G ethernet yellow block. Any comments of 
> suggestions are highly welcome.
> 
> best
> homin jiang
>   
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "casper@lists.berkeley.edu" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to casper+unsubscr...@lists.berkeley.edu 
> .
> To post to this group, send email to casper@lists.berkeley.edu 
> .

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To post to this group, send email to casper@lists.berkeley.edu.