Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-22 Thread Sameeh Jubran
On Mon, May 22, 2017 at 5:46 AM, Jason Wang  wrote:

>
>
> On 2017年05月19日 22:04, Sameeh Jubran wrote:
>
>> On Fri, May 19, 2017 at 9:25 AM, Jason Wang  wrote:
>>
>>
>>> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>>>
>>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
 interrupt
 which would be triggered post migration in a heavy traffic environment.
 Even though the
 "receive overrun" bit (#6) is masked out by the IMS register (refer to
 the log below)
 the driver still receives an interrupt as the "receive overrun" bit (#6)
 causes the
 "Other" - bit #24 of the ICR register - bit to be set as documented
 below. The driver
 handles the interrupt and clears the "Other" bit (#24) but doesn't clear
 the
 "receive overrun" bit (#6) which leads to an infinite loop. Apparently
 the Windows
 driver expects that the "receive overrun" bit and other ones -
 documented
 below - to be
 cleared when the "Other" bit (#24) is cleared.

 So to sum that up:
 1. Bit #6 of the ICR register is set by heavy traffic
 2. As a results of setting bit #6, bit #24 is set
 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
 interrupt for bit #6 as it is masked out by IMS)
 4. The driver handles and clears the interrupt of bit #24
 5. Bit #6 is still set.
 6. 2 happens all over again

 The Interrupt Cause Read - ICR register:

 The ICR has the "Other" bit - bit #24 - that is set when one or more of
 the following
 ICR register's bits are set:

 LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit
 #17,
 MNG - bit #18

 Log sample of the storm:

 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
 0x100 (ICR: 0x815000c2, IMS: 0x1a4)
 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
 (ICR: 0x815000c2, IMS: 0xa4)
 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
 (ICR: 0x815000c2, IMS: 0xa4)
 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
 (ICR: 0x815000c2, IMS: 0xa4)
 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
 (ICR: 0x815000c2, IMS: 0xa4)
 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
 (ICR: 0x815000c2, IMS: 0xa4)
 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
 (ICR: 0x815000c2, IMS: 0xa4)
 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
 0x100 (ICR: 0x815000c2, IMS: 0x1a4)

 This commit solves:
 https://bugzilla.redhat.com/show_bug.cgi?id=1447935
 https://bugzilla.redhat.com/show_bug.cgi?id=1449490

 Signed-off-by: Sameeh Jubran 
 ---
hw/net/e1000e_core.c | 7 +--
1 file changed, 5 insertions(+), 2 deletions(-)

 diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
 index 28c5be1..8174b53 100644
 --- a/hw/net/e1000e_core.c
 +++ b/hw/net/e1000e_core.c
 @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
 uint32_t val)
static void
e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
{
 +uint32_t icr = 0;
if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
(core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
trace_e1000e_irq_icr_process_iame();
e1000e_clear_ims_bits(core, core->mac[IAM]);
}
-trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR]
 &
 ~val);
 -core->mac[ICR] &= ~val;
 +icr = core->mac[ICR] & ~val;
 +icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
 icr;
 +trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
 +core->mac[ICR] = icr;
e1000e_update_interrupt_state(core);
}


 Thanks for the patch.
>>>
>>> So this is an undocumented behavior, we must be careful on this. Several
>>> question below:
>>>
>>> - have you verified this on real hardware?
>>>
>>> No I haven't
>>
>
> Any chance to verify the behavior on real hardware?

This can be very challenging to reproduce on a real hardware as the Windows
source code isn't available and the bug doesn't
reproduce with the Linux driver; Moreover, the Linux driver configures the
device differently than the Windows driver.

>
>
>
>> - is MSIX enabled in this case?
>>>
>>> Yes it is, I have tested the patch with msi disabled too.
>>
>
> Does the problem exist in no msi case. If not, is this better to hack EIAC
> behavior?

Even though it doesn't reproduce in no msi case, this certainly can occur
as the ICR register is present in both msi and
no msi cases while the EIAC register is for MSI only.

>
>> - according to the steps you've summed up above, it's not specific to
>>> migration?
>>>

Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-21 Thread Jason Wang



On 2017年05月19日 22:04, Sameeh Jubran wrote:

On Fri, May 19, 2017 at 9:25 AM, Jason Wang  wrote:



On 2017年05月17日 19:46, Sameeh Jubran wrote:


The bug was caused by the "receive overrun" (bit #6 of the ICR register)
interrupt
which would be triggered post migration in a heavy traffic environment.
Even though the
"receive overrun" bit (#6) is masked out by the IMS register (refer to
the log below)
the driver still receives an interrupt as the "receive overrun" bit (#6)
causes the
"Other" - bit #24 of the ICR register - bit to be set as documented
below. The driver
handles the interrupt and clears the "Other" bit (#24) but doesn't clear
the
"receive overrun" bit (#6) which leads to an infinite loop. Apparently
the Windows
driver expects that the "receive overrun" bit and other ones - documented
below - to be
cleared when the "Other" bit (#24) is cleared.

So to sum that up:
1. Bit #6 of the ICR register is set by heavy traffic
2. As a results of setting bit #6, bit #24 is set
3. The driver receives an interrupt for bit 24 (it doesn't receieve an
interrupt for bit #6 as it is masked out by IMS)
4. The driver handles and clears the interrupt of bit #24
5. Bit #6 is still set.
6. 2 happens all over again

The Interrupt Cause Read - ICR register:

The ICR has the "Other" bit - bit #24 - that is set when one or more of
the following
ICR register's bits are set:

LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17,
MNG - bit #18

Log sample of the storm:

27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
0x100 (ICR: 0x815000c2, IMS: 0x1a4)
27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
(ICR: 0x815000c2, IMS: 0xa4)
27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
(ICR: 0x815000c2, IMS: 0xa4)
27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
(ICR: 0x815000c2, IMS: 0xa4)
27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
(ICR: 0x815000c2, IMS: 0xa4)
27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
(ICR: 0x815000c2, IMS: 0xa4)
27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
(ICR: 0x815000c2, IMS: 0xa4)
27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
0x100 (ICR: 0x815000c2, IMS: 0x1a4)

This commit solves:
https://bugzilla.redhat.com/show_bug.cgi?id=1447935
https://bugzilla.redhat.com/show_bug.cgi?id=1449490

Signed-off-by: Sameeh Jubran 
---
   hw/net/e1000e_core.c | 7 +--
   1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index 28c5be1..8174b53 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
uint32_t val)
   static void
   e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
   {
+uint32_t icr = 0;
   if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
   (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
   trace_e1000e_irq_icr_process_iame();
   e1000e_clear_ims_bits(core, core->mac[IAM]);
   }
   -trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] &
~val);
-core->mac[ICR] &= ~val;
+icr = core->mac[ICR] & ~val;
+icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
icr;
+trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
+core->mac[ICR] = icr;
   e1000e_update_interrupt_state(core);
   }



Thanks for the patch.

So this is an undocumented behavior, we must be careful on this. Several
question below:

- have you verified this on real hardware?


No I haven't


Any chance to verify the behavior on real hardware?




- is MSIX enabled in this case?


Yes it is, I have tested the patch with msi disabled too.


Does the problem exist in no msi case. If not, is this better to hack 
EIAC behavior?





- according to the steps you've summed up above, it's not specific to
migration?


True


Then we probably need tweak the title.

Thanks.




Thanks









Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-19 Thread Sameeh Jubran
On Fri, May 19, 2017 at 9:25 AM, Jason Wang  wrote:

>
>
> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>
>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>> interrupt
>> which would be triggered post migration in a heavy traffic environment.
>> Even though the
>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>> the log below)
>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>> causes the
>> "Other" - bit #24 of the ICR register - bit to be set as documented
>> below. The driver
>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>> the
>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>> the Windows
>> driver expects that the "receive overrun" bit and other ones - documented
>> below - to be
>> cleared when the "Other" bit (#24) is cleared.
>>
>> So to sum that up:
>> 1. Bit #6 of the ICR register is set by heavy traffic
>> 2. As a results of setting bit #6, bit #24 is set
>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>> interrupt for bit #6 as it is masked out by IMS)
>> 4. The driver handles and clears the interrupt of bit #24
>> 5. Bit #6 is still set.
>> 6. 2 happens all over again
>>
>> The Interrupt Cause Read - ICR register:
>>
>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>> the following
>> ICR register's bits are set:
>>
>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17,
>> MNG - bit #18
>>
>> Log sample of the storm:
>>
>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING:
>> 0x100 (ICR: 0x815000c2, IMS: 0x1a4)
>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>> (ICR: 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING:
>> 0x100 (ICR: 0x815000c2, IMS: 0x1a4)
>>
>> This commit solves:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>
>> Signed-off-by: Sameeh Jubran 
>> ---
>>   hw/net/e1000e_core.c | 7 +--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>> index 28c5be1..8174b53 100644
>> --- a/hw/net/e1000e_core.c
>> +++ b/hw/net/e1000e_core.c
>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>> uint32_t val)
>>   static void
>>   e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>   {
>> +uint32_t icr = 0;
>>   if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>   (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>   trace_e1000e_irq_icr_process_iame();
>>   e1000e_clear_ims_bits(core, core->mac[IAM]);
>>   }
>>   -trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] &
>> ~val);
>> -core->mac[ICR] &= ~val;
>> +icr = core->mac[ICR] & ~val;
>> +icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>> icr;
>> +trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>> +core->mac[ICR] = icr;
>>   e1000e_update_interrupt_state(core);
>>   }
>>
>>
>
> Thanks for the patch.
>
> So this is an undocumented behavior, we must be careful on this. Several
> question below:
>
> - have you verified this on real hardware?
>
No I haven't

> - is MSIX enabled in this case?
>
Yes it is, I have tested the patch with msi disabled too.

> - according to the steps you've summed up above, it's not specific to
> migration?
>
True

>
> Thanks
>



-- 
Respectfully,
*Sameeh Jubran*
*Linkedin *
*Software Engineer @ Daynix .*


Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-18 Thread Jason Wang



On 2017年05月17日 19:46, Sameeh Jubran wrote:

The bug was caused by the "receive overrun" (bit #6 of the ICR register) 
interrupt
which would be triggered post migration in a heavy traffic environment. Even 
though the
"receive overrun" bit (#6) is masked out by the IMS register (refer to the log 
below)
the driver still receives an interrupt as the "receive overrun" bit (#6) causes 
the
"Other" - bit #24 of the ICR register - bit to be set as documented below. The 
driver
handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
"receive overrun" bit (#6) which leads to an infinite loop. Apparently the 
Windows
driver expects that the "receive overrun" bit and other ones - documented below 
- to be
cleared when the "Other" bit (#24) is cleared.

So to sum that up:
1. Bit #6 of the ICR register is set by heavy traffic
2. As a results of setting bit #6, bit #24 is set
3. The driver receives an interrupt for bit 24 (it doesn't receieve an 
interrupt for bit #6 as it is masked out by IMS)
4. The driver handles and clears the interrupt of bit #24
5. Bit #6 is still set.
6. 2 happens all over again

The Interrupt Cause Read - ICR register:

The ICR has the "Other" bit - bit #24 - that is set when one or more of the 
following
ICR register's bits are set:

LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - 
bit #18

Log sample of the storm:

27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
(ICR: 0x815000c2, IMS: 0x1a4)
27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
(ICR: 0x815000c2, IMS: 0x1a4)

This commit solves:
https://bugzilla.redhat.com/show_bug.cgi?id=1447935
https://bugzilla.redhat.com/show_bug.cgi?id=1449490

Signed-off-by: Sameeh Jubran 
---
  hw/net/e1000e_core.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index 28c5be1..8174b53 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t 
val)
  static void
  e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
  {
+uint32_t icr = 0;
  if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
  (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
  trace_e1000e_irq_icr_process_iame();
  e1000e_clear_ims_bits(core, core->mac[IAM]);
  }
  
-trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);

-core->mac[ICR] &= ~val;
+icr = core->mac[ICR] & ~val;
+icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
+trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
+core->mac[ICR] = icr;
  e1000e_update_interrupt_state(core);
  }
  


Thanks for the patch.

So this is an undocumented behavior, we must be careful on this. Several 
question below:


- have you verified this on real hardware?
- is MSIX enabled in this case?
- according to the steps you've summed up above, it's not specific to 
migration?


Thanks



Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-18 Thread Paolo Bonzini


On 18/05/2017 12:04, Dr. David Alan Gilbert wrote:
> * Sameeh Jubran (sam...@daynix.com) wrote:
>> The bug was caused by the "receive overrun" (bit #6 of the ICR register) 
>> interrupt
>> which would be triggered post migration in a heavy traffic environment. Even 
>> though the
>> "receive overrun" bit (#6) is masked out by the IMS register (refer to the 
>> log below)
>> the driver still receives an interrupt as the "receive overrun" bit (#6) 
>> causes the
>> "Other" - bit #24 of the ICR register - bit to be set as documented below. 
>> The driver
>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the 
>> Windows
>> driver expects that the "receive overrun" bit and other ones - documented 
>> below - to be
>> cleared when the "Other" bit (#24) is cleared.
>>
>> So to sum that up:
>> 1. Bit #6 of the ICR register is set by heavy traffic
>> 2. As a results of setting bit #6, bit #24 is set
>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an 
>> interrupt for bit #6 as it is masked out by IMS)
>> 4. The driver handles and clears the interrupt of bit #24
>> 5. Bit #6 is still set.
>> 6. 2 happens all over again
>>
>> The Interrupt Cause Read - ICR register:
>>
>> The ICR has the "Other" bit - bit #24 - that is set when one or more of the 
>> following
>> ICR register's bits are set:
>>
>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, 
>> MNG - bit #18
>>
>> Log sample of the storm:
>>
>> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
>> (ICR: 0x815000c2, IMS: 0x1a4)
>> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
>> 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
>> 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
>> 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
>> 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
>> 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
>> 0x815000c2, IMS: 0xa4)
>> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
>> (ICR: 0x815000c2, IMS: 0x1a4)
>>
>> This commit solves:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>
>> Signed-off-by: Sameeh Jubran 
> 
> Thanks, I tested this with our downstream and it does
> fix the reproducer for 1447935 for me, so:
> 
> 
> Tested-by: Dr. David Alan Gilbert 
> 
> Dave
> 
>> ---
>>  hw/net/e1000e_core.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>> index 28c5be1..8174b53 100644
>> --- a/hw/net/e1000e_core.c
>> +++ b/hw/net/e1000e_core.c
>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t 
>> val)
>>  static void
>>  e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>  {
>> +uint32_t icr = 0;
>>  if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>  (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>  trace_e1000e_irq_icr_process_iame();
>>  e1000e_clear_ims_bits(core, core->mac[IAM]);
>>  }
>>  
>> -trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
>> -core->mac[ICR] &= ~val;
>> +icr = core->mac[ICR] & ~val;
>> +icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
>> +trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>> +core->mac[ICR] = icr;
>>  e1000e_update_interrupt_state(core);
>>  }
>>  
>> -- 
>> 2.8.1.185.gdc0db2c
>>
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 

Cc: qemu-sta...@nongnu.org



Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-18 Thread Dr. David Alan Gilbert
* Sameeh Jubran (sam...@daynix.com) wrote:
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) 
> interrupt
> which would be triggered post migration in a heavy traffic environment. Even 
> though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the 
> log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) 
> causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. 
> The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the 
> Windows
> driver expects that the "receive overrun" bit and other ones - documented 
> below - to be
> cleared when the "Other" bit (#24) is cleared.
> 
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an 
> interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
> 
> The Interrupt Cause Read - ICR register:
> 
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the 
> following
> ICR register's bits are set:
> 
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG 
> - bit #18
> 
> Log sample of the storm:
> 
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
> (ICR: 0x815000c2, IMS: 0x1a4)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
> (ICR: 0x815000c2, IMS: 0x1a4)
> 
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
> 
> Signed-off-by: Sameeh Jubran 

Thanks, I tested this with our downstream and it does
fix the reproducer for 1447935 for me, so:


Tested-by: Dr. David Alan Gilbert 

Dave

> ---
>  hw/net/e1000e_core.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t 
> val)
>  static void
>  e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>  {
> +uint32_t icr = 0;
>  if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>  (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>  trace_e1000e_irq_icr_process_iame();
>  e1000e_clear_ims_bits(core, core->mac[IAM]);
>  }
>  
> -trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -core->mac[ICR] &= ~val;
> +icr = core->mac[ICR] & ~val;
> +icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +core->mac[ICR] = icr;
>  e1000e_update_interrupt_state(core);
>  }
>  
> -- 
> 2.8.1.185.gdc0db2c
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-17 Thread Dmitry Fleytman

> On 17 May 2017, at 14:46 PM, Sameeh Jubran  wrote:
> 
> The bug was caused by the "receive overrun" (bit #6 of the ICR register) 
> interrupt
> which would be triggered post migration in a heavy traffic environment. Even 
> though the
> "receive overrun" bit (#6) is masked out by the IMS register (refer to the 
> log below)
> the driver still receives an interrupt as the "receive overrun" bit (#6) 
> causes the
> "Other" - bit #24 of the ICR register - bit to be set as documented below. 
> The driver
> handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
> "receive overrun" bit (#6) which leads to an infinite loop. Apparently the 
> Windows
> driver expects that the "receive overrun" bit and other ones - documented 
> below - to be
> cleared when the "Other" bit (#24) is cleared.
> 
> So to sum that up:
> 1. Bit #6 of the ICR register is set by heavy traffic
> 2. As a results of setting bit #6, bit #24 is set
> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an 
> interrupt for bit #6 as it is masked out by IMS)
> 4. The driver handles and clears the interrupt of bit #24
> 5. Bit #6 is still set.
> 6. 2 happens all over again
> 
> The Interrupt Cause Read - ICR register:
> 
> The ICR has the "Other" bit - bit #24 - that is set when one or more of the 
> following
> ICR register's bits are set:
> 
> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG 
> - bit #18
> 
> Log sample of the storm:
> 
> 27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
> (ICR: 0x815000c2, IMS: 0x1a4)
> 27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
> 0x815000c2, IMS: 0xa4)
> 27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
> (ICR: 0x815000c2, IMS: 0x1a4)
> 
> This commit solves:
> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
> https://bugzilla.redhat.com/show_bug.cgi?id=1449490 
> 


Reviewed-by: Dmitry Fleytman 


> 
> Signed-off-by: Sameeh Jubran 
> ---
> hw/net/e1000e_core.c | 7 +--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 28c5be1..8174b53 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t 
> val)
> static void
> e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
> {
> +uint32_t icr = 0;
> if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
> (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
> trace_e1000e_irq_icr_process_iame();
> e1000e_clear_ims_bits(core, core->mac[IAM]);
> }
> 
> -trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
> -core->mac[ICR] &= ~val;
> +icr = core->mac[ICR] & ~val;
> +icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
> +trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
> +core->mac[ICR] = icr;
> e1000e_update_interrupt_state(core);
> }
> 
> -- 
> 2.8.1.185.gdc0db2c
> 



[Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration

2017-05-17 Thread Sameeh Jubran
The bug was caused by the "receive overrun" (bit #6 of the ICR register) 
interrupt
which would be triggered post migration in a heavy traffic environment. Even 
though the
"receive overrun" bit (#6) is masked out by the IMS register (refer to the log 
below)
the driver still receives an interrupt as the "receive overrun" bit (#6) causes 
the
"Other" - bit #24 of the ICR register - bit to be set as documented below. The 
driver
handles the interrupt and clears the "Other" bit (#24) but doesn't clear the
"receive overrun" bit (#6) which leads to an infinite loop. Apparently the 
Windows
driver expects that the "receive overrun" bit and other ones - documented below 
- to be
cleared when the "Other" bit (#24) is cleared.

So to sum that up:
1. Bit #6 of the ICR register is set by heavy traffic
2. As a results of setting bit #6, bit #24 is set
3. The driver receives an interrupt for bit 24 (it doesn't receieve an 
interrupt for bit #6 as it is masked out by IMS)
4. The driver handles and clears the interrupt of bit #24
5. Bit #6 is still set.
6. 2 happens all over again

The Interrupt Cause Read - ICR register:

The ICR has the "Other" bit - bit #24 - that is set when one or more of the 
following
ICR register's bits are set:

LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit #17, MNG - 
bit #18

Log sample of the storm:

27563@1494850819.411877:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
(ICR: 0x815000c2, IMS: 0x1a4)
27563@1494850819.411900:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.411915:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412380:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412395:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412436:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412441:e1000e_irq_pending_interrupts ICR PENDING: 0x0 (ICR: 
0x815000c2, IMS: 0xa4)
27563@1494850819.412998:e1000e_irq_pending_interrupts ICR PENDING: 0x100 
(ICR: 0x815000c2, IMS: 0x1a4)

This commit solves:
https://bugzilla.redhat.com/show_bug.cgi?id=1447935
https://bugzilla.redhat.com/show_bug.cgi?id=1449490

Signed-off-by: Sameeh Jubran 
---
 hw/net/e1000e_core.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index 28c5be1..8174b53 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index, uint32_t 
val)
 static void
 e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
 {
+uint32_t icr = 0;
 if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
 (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
 trace_e1000e_irq_icr_process_iame();
 e1000e_clear_ims_bits(core, core->mac[IAM]);
 }
 
-trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR] & ~val);
-core->mac[ICR] &= ~val;
+icr = core->mac[ICR] & ~val;
+icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) : icr;
+trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
+core->mac[ICR] = icr;
 e1000e_update_interrupt_state(core);
 }
 
-- 
2.8.1.185.gdc0db2c