Re: rt_e1000e: Detected Hardware Unit Hang

2018-11-27 Thread Jan Kiszka via Xenomai

On 27.11.18 10:47, Means Lee wrote:
I can't reduce the size of my patch for the e1000e realtime driver existed in 
Xenomai, because the non-realtime driver evolutions a lot. So I offered the diff 
file of three files I changed when porting the driver. On my view, the hardware 
oprations shall be unchanged so I focused on the change of netdev.c. I modfied 
param.c and e1000.h changed the private data structure and parameters a little.


For the question you asked:
- the upstream Linux driver I ported over works fine with my hardware, even when 
I try to put a strong pressure on it(UDP broadcast storm).
- when I meet the hardware unit hang, the Tx completion interrupt didn't 
dissapper but it do reduced a lot.

- I didn't enable CONFIG_IPIPE_DEBUG_CONTEXT, but I do uses lock in several 
places.


Then this should be the next thing to do. This not only detects direct locking 
issues but also those triggered by calling into Linux functions that take normal 
locks.



- then the TX path
    - the interrupt registertion code was shown below:
             err = rtdm_irq_request(>irq_handle,
                   adapter->pdev->irq, e1000_intr_msi, 0,
                   netdev->name, adapter);
    - I write a new index of ring buffer to TDT register to notify the hardware 
there is an packet should be sent.
             writel(tx_ring->next_to_use, tx_ring->tail);//after writel, the 
interrupt routine shall be launched.
    - If the 'event flow' means the event during the transmit process, the event 


I mean specifically if both the vanilla driver as well as the ported version 
take the code path and receive the same interrupts when sending packets. Of 
course, you can put identical package load on both because higher RTnet layers 
do no exist for vanilla Linux. You may capture the outgoing traffic under RTnet 
(RTcap) and replay that under Linux.



flow is shown below:
     e1000e_xmit_frame                        send an packet atomicly
          e1000_tx_map                           use DMA to map the packet(maped 
before,so just get the DMA address)
          e1000_tx_queue                        make sure the tx ring buffer 
index right

          write the TDT register to tell hardware to send an packet
after the hardware sent an packet, it supposed to trigger an TX completion 
interrupt and the driver shall response:

     e1000_intr_msi                                triggered when Tx/Rx 
completion
          e1000_clean_tx_irq                    recycle the transmit resource

By the way, I found that every time master station sent an Ready frame belongs 
to RTcfg, this Tx hung shows up. And the comunication
  before that works fine: the TDMA sync frame send properly and every stage 
before the Ready frame goes well.
If I let it stay in the RTCFG_MAIN_CLIENT_2 stage(so far the master and slave 
known each other), master and slave could comunicate
properly. So the Ready frame triggers this problem, but why? An frame of 
specific format triggers the hardware hung, why it happens?


RTcfg is unlikely to be the reason, but maybe the transmission pattern triggers 
the issue in the driver.


Jan

--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux



Re: rt_e1000e: Detected Hardware Unit Hang

2018-11-27 Thread Means Lee via Xenomai
I can't reduce the size of my patch for the e1000e realtime driver existed
in Xenomai, because the non-realtime driver evolutions a lot. So I offered
the diff file of three files I changed when porting the driver. On my view,
the hardware oprations shall be unchanged so I focused on the change of
netdev.c. I modfied param.c and e1000.h changed the private data structure
and parameters a little.

For the question you asked:
- the upstream Linux driver I ported over works fine with my hardware, even
when I try to put a strong pressure on it(UDP broadcast storm).
- when I meet the hardware unit hang, the Tx completion interrupt didn't
dissapper but it do reduced a lot.
- I didn't enable CONFIG_IPIPE_DEBUG_CONTEXT, but I do uses lock in several
places.
- then the TX path
   - the interrupt registertion code was shown below:
err = rtdm_irq_request(>irq_handle,
  adapter->pdev->irq, e1000_intr_msi, 0,
  netdev->name, adapter);
   - I write a new index of ring buffer to TDT register to notify the
hardware there is an packet should be sent.
writel(tx_ring->next_to_use, tx_ring->tail);//after writel, the
interrupt routine shall be launched.
   - If the 'event flow' means the event during the transmit process, the
event flow is shown below:
e1000e_xmit_framesend an packet atomicly
 e1000_tx_map   use DMA to map the
packet(maped before,so just get the DMA address)
 e1000_tx_queuemake sure the tx ring buffer
index right
 write the TDT register to tell hardware to send an packet
after the hardware sent an packet, it supposed to trigger an TX completion
interrupt and the driver shall response:
e1000_intr_msitriggered when Tx/Rx
completion
 e1000_clean_tx_irqrecycle the transmit resource

By the way, I found that every time master station sent an Ready frame
belongs to RTcfg, this Tx hung shows up. And the comunication
 before that works fine: the TDMA sync frame send properly and every stage
before the Ready frame goes well.
If I let it stay in the RTCFG_MAIN_CLIENT_2 stage(so far the master and
slave known each other), master and slave could comunicate
properly. So the Ready frame triggers this problem, but why? An frame of
specific format triggers the hardware hung, why it happens?

Jan Kiszka  于2018年11月23日周五 下午7:59写道:

> On 21.11.18 02:36, Means Lee via Xenomai wrote:
> > Sure thing. As I ported e1000e-rt driver from mainline kernel e1000e
> > driver, which
> > the commit id is 089d7720383d7bc9ca6b8824a05dfa66f80d1f41, the patch
> file is
> >   kind of huge so I attach them here in this mail.
> > diff-with-nrt.diff is the diff file of mainline driver e1000e and my
> ported
> > driver.
> > diff-with-old-rt-e1000e.diff is the diff file of xenomai v3.0.7 driver
> and
> > my ported driver.
> >
>
> Unified diff ("diff -u"), please. I got that offlist, which is more
> readable,
> but it remains huge.
>
> So, let's analyze systematically:
>   - the upstream Linux driver you ported over works fine with your
> hardware, correct?
>   - hardware unit hand may mean that no TX completion interrupt arrived -
> can you confirm this based on /proc/xenomai/irq?
>   - did you enable CONFIG_IPIPE_DEBUG_CONTEXT? It can reveal invalid lock
> usage (common mistake when porting linux drivers over)
>   - then look into the TX path
>  - is the interrupt registered properly?
>  - is packet submission happening?
>  - is any interrupt arriving?
>  - compare event flow to vanilla Linux driver (add instrumentation to
>both)
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux
>
-- next part --
A non-text attachment was scrubbed...
Name: e1000.diff
Type: text/x-patch
Size: 2569 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: param.diff
Type: text/x-patch
Size: 800 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: netdev.diff
Type: text/x-patch
Size: 118926 bytes
Desc: not available
URL: 



Re: rt_e1000e: Detected Hardware Unit Hang

2018-11-23 Thread Jan Kiszka via Xenomai

On 21.11.18 02:36, Means Lee via Xenomai wrote:

Sure thing. As I ported e1000e-rt driver from mainline kernel e1000e
driver, which
the commit id is 089d7720383d7bc9ca6b8824a05dfa66f80d1f41, the patch file is
  kind of huge so I attach them here in this mail.
diff-with-nrt.diff is the diff file of mainline driver e1000e and my ported
driver.
diff-with-old-rt-e1000e.diff is the diff file of xenomai v3.0.7 driver and
my ported driver.



Unified diff ("diff -u"), please. I got that offlist, which is more readable, 
but it remains huge.


So, let's analyze systematically:
 - the upstream Linux driver you ported over works fine with your
   hardware, correct?
 - hardware unit hand may mean that no TX completion interrupt arrived -
   can you confirm this based on /proc/xenomai/irq?
 - did you enable CONFIG_IPIPE_DEBUG_CONTEXT? It can reveal invalid lock
   usage (common mistake when porting linux drivers over)
 - then look into the TX path
- is the interrupt registered properly?
- is packet submission happening?
- is any interrupt arriving?
- compare event flow to vanilla Linux driver (add instrumentation to
  both)

Jan

--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux



Re: rt_e1000e: Detected Hardware Unit Hang

2018-11-20 Thread Means Lee via Xenomai
Sure thing. As I ported e1000e-rt driver from mainline kernel e1000e
driver, which
the commit id is 089d7720383d7bc9ca6b8824a05dfa66f80d1f41, the patch file is
 kind of huge so I attach them here in this mail.
diff-with-nrt.diff is the diff file of mainline driver e1000e and my ported
driver.
diff-with-old-rt-e1000e.diff is the diff file of xenomai v3.0.7 driver and
my ported driver.

Jan Kiszka  于2018年11月20日周二 下午5:42写道:

> On 19.11.18 22:57, Means Lee via Xenomai wrote:
> > I am porting newer e1000e driver to Xenomai 3.0.7. When I using my ported
> > driver to setup an
> > RTnet connection, something goes wrong. What I got is an Hardware Unit
> hang:
>
> Can you share the diff to/patch of the existing one?
>
> Jan
>
> >
> > [ 1598.783133] rt_e1000e: Detected Hardware Unit Hang:
> >   TDH  
> >   TDT  <1e>
> >   next_to_use  <1e>
> >   next_to_clean
> > buffer_info[next_to_clean]:
> >   time_stamp   <10004ea14>
> >   next_to_watch
> >   jiffies  <10004f458>
> >   next_to_watch.status <0>
> > MAC Status <40080083>
> > PHY Status <796d>
> > PHY 1000BASE-T Status  <3800>
> > PHY Extended Status<3000>
> > PCI Status <10>
> >
> > I'm sure that transmit and receive function works fine for I can send
> TDMA
> > sync frame normally.
> > But this message shown during the end of RTcfg ready stage. Is anybody
> has
> > any clue about how to fix this bug?
> >
>
> --
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux
>
-- next part --
A non-text attachment was scrubbed...
Name: diff-with-old-rt-e1000e.diff
Type: text/x-patch
Size: 410032 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20181121/0c3b022a/attachment.bin>
-- next part --
A non-text attachment was scrubbed...
Name: diff-with-nrt.diff
Type: text/x-patch
Size: 98297 bytes
Desc: not available
URL: 
<http://xenomai.org/pipermail/xenomai/attachments/20181121/0c3b022a/attachment-0001.bin>


Re: rt_e1000e: Detected Hardware Unit Hang

2018-11-20 Thread Jan Kiszka via Xenomai

On 19.11.18 22:57, Means Lee via Xenomai wrote:

I am porting newer e1000e driver to Xenomai 3.0.7. When I using my ported
driver to setup an
RTnet connection, something goes wrong. What I got is an Hardware Unit hang:


Can you share the diff to/patch of the existing one?

Jan



[ 1598.783133] rt_e1000e: Detected Hardware Unit Hang:
  TDH  
  TDT  <1e>
  next_to_use  <1e>
  next_to_clean
buffer_info[next_to_clean]:
  time_stamp   <10004ea14>
  next_to_watch
  jiffies  <10004f458>
  next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status<3000>
PCI Status <10>

I'm sure that transmit and receive function works fine for I can send TDMA
sync frame normally.
But this message shown during the end of RTcfg ready stage. Is anybody has
any clue about how to fix this bug?



--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux



rt_e1000e: Detected Hardware Unit Hang

2018-11-19 Thread Means Lee via Xenomai
I am porting newer e1000e driver to Xenomai 3.0.7. When I using my ported
driver to setup an
RTnet connection, something goes wrong. What I got is an Hardware Unit hang:

[ 1598.783133] rt_e1000e: Detected Hardware Unit Hang:
 TDH  
 TDT  <1e>
 next_to_use  <1e>
 next_to_clean
   buffer_info[next_to_clean]:
 time_stamp   <10004ea14>
 next_to_watch
 jiffies  <10004f458>
 next_to_watch.status <0>
   MAC Status <40080083>
   PHY Status <796d>
   PHY 1000BASE-T Status  <3800>
   PHY Extended Status<3000>
   PCI Status <10>

I'm sure that transmit and receive function works fine for I can send TDMA
sync frame normally.
But this message shown during the end of RTcfg ready stage. Is anybody has
any clue about how to fix this bug?