Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Alexandru Chirvasitu
On Fri, Dec 29, 2017 at 02:09:40PM +0100, Thomas Gleixner wrote:
> On Fri, 29 Dec 2017, Alexandru Chirvasitu wrote:
> > All right, I tried to do some more digging around, in the hope of
> > getting as close to the source of the problem as I can.
> > 
> > I went back to the very first commit that went astray for me, 2db1f95
> > (which is the only one actually panicking), and tried to move from its
> > parent 90ad9e2 (that boots fine) to it gradually, altering the code in
> > small chunks.
> > 
> > I tried to ignore the stuff that clearly shouldn't make a difference,
> > such as definitions. So in the end I get defined-but-unused-function
> > errors in my compilations, but I'm ignoring those for now. Some
> > results:
> > 
> > (1) When I move from the good commit 90ad9e2 according to the attached
> > bad-diff (which moves partly towards 2db1f95), I get a panic.
> > 
> > (2) On the other hand, when I further change this last panicking
> > commit by simply doing
> > 
> > 
> > 
> > removed activate / deactivate from x86_vector_domain_ops
> > 
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 7317ba5a..063594d 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
> > irq_domain *d,
> >  static const struct irq_domain_ops x86_vector_domain_ops = {
> > .alloc  = x86_vector_alloc_irqs,
> > .free   = x86_vector_free_irqs,
> > -   .activate   = x86_vector_activate,
> > -   .deactivate = x86_vector_deactivate,
> >  #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> > .debug_show = x86_vector_debug_show,
> >  #endif
> > 
> > 
> > all is well. 
> 
> Nice detective work. Unfortunately that's not a real solution ...
>

Oh, of course. It was never intended as a solution, only as
information perhaps enabling someone who knows what they're doing
(unlike myself :) ) to find one.


> Can you try the patch below on top of Linus tree, please?
> 
> Thanks,
>

Applied it to 464e1d5 4.15-rc5 just now. it appears to be
trouble-free: booted, logged me in fine, the works.




>   tglx
> 
> 8<-
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -339,6 +339,40 @@ int msi_domain_populate_irqs(struct irq_
>   return ret;
>  }
>  
> +/*
> + * Carefully check whether the device can use reservation mode. If
> + * reservation mode is enabled then the early activation will assign a
> + * dummy vector to the device. If the PCI/MSI device does not support
> + * masking of the entry then this can result in spurious interrupts when
> + * the device driver is not absolutely careful. But even then a malfunction
> + * of the hardware could result in a spurious interrupt on the dummy vector
> + * and render the device unusable. If the entry can be masked then the core
> + * logic will prevent the spurious interrupt and reservation mode can be
> + * used. For now reservation mode is restricted to PCI/MSI.
> + */
> +static bool msi_check_reservation_mode(struct irq_domain *domain,
> +struct msi_domain_info *info,
> +struct device *dev)
> +{
> + struct msi_desc *desc;
> +
> + if (domain->bus_token != DOMAIN_BUS_PCI_MSI)
> + return false;
> +
> + if (!(info->flags & MSI_FLAG_MUST_REACTIVATE))
> + return false;
> +
> + if (IS_ENABLED(CONFIG_PCI_MSI) && pci_msi_ignore_mask)
> + return false;
> +
> + /*
> +  * Checking the first MSI descriptor is sufficient. MSIX supports
> +  * masking and MSI does so when the maskbit is set.
> +  */
> + desc = first_msi_entry(dev);
> + return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit;
> +}
> +
>  /**
>   * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
>   * @domain:  The domain to allocate from
> @@ -353,9 +387,11 @@ int msi_domain_alloc_irqs(struct irq_dom
>  {
>   struct msi_domain_info *info = domain->host_data;
>   struct msi_domain_ops *ops = info->ops;
> - msi_alloc_info_t arg;
> + struct irq_data *irq_data;
>   struct msi_desc *desc;
> + msi_alloc_info_t arg;
>   int i, ret, virq;
> + bool can_reserve;
>  
>   ret = msi_domain_prepare_irqs(domain, dev, nvec, );
>   if (ret)
> @@ -385,6 +421,8 @@ int msi_domain_alloc_irqs(struct irq_dom
>   if (ops->msi_finish)
>   ops->msi_finish(, 0);
>  
> + can_reserve = msi_check_reservation_mode(domain, info, dev);
> +
>   for_each_msi_entry(desc, dev) {
>   virq = desc->irq;
>   if (desc->nvec_used == 1)
> @@ -397,17 +435,28 @@ int msi_domain_alloc_irqs(struct irq_dom
>* the MSI entries before the PCI layer enables MSI in the
> 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Alexandru Chirvasitu
On Fri, Dec 29, 2017 at 02:09:40PM +0100, Thomas Gleixner wrote:
> On Fri, 29 Dec 2017, Alexandru Chirvasitu wrote:
> > All right, I tried to do some more digging around, in the hope of
> > getting as close to the source of the problem as I can.
> > 
> > I went back to the very first commit that went astray for me, 2db1f95
> > (which is the only one actually panicking), and tried to move from its
> > parent 90ad9e2 (that boots fine) to it gradually, altering the code in
> > small chunks.
> > 
> > I tried to ignore the stuff that clearly shouldn't make a difference,
> > such as definitions. So in the end I get defined-but-unused-function
> > errors in my compilations, but I'm ignoring those for now. Some
> > results:
> > 
> > (1) When I move from the good commit 90ad9e2 according to the attached
> > bad-diff (which moves partly towards 2db1f95), I get a panic.
> > 
> > (2) On the other hand, when I further change this last panicking
> > commit by simply doing
> > 
> > 
> > 
> > removed activate / deactivate from x86_vector_domain_ops
> > 
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 7317ba5a..063594d 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
> > irq_domain *d,
> >  static const struct irq_domain_ops x86_vector_domain_ops = {
> > .alloc  = x86_vector_alloc_irqs,
> > .free   = x86_vector_free_irqs,
> > -   .activate   = x86_vector_activate,
> > -   .deactivate = x86_vector_deactivate,
> >  #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> > .debug_show = x86_vector_debug_show,
> >  #endif
> > 
> > 
> > all is well. 
> 
> Nice detective work. Unfortunately that's not a real solution ...
>

Oh, of course. It was never intended as a solution, only as
information perhaps enabling someone who knows what they're doing
(unlike myself :) ) to find one.


> Can you try the patch below on top of Linus tree, please?
> 
> Thanks,
>

Applied it to 464e1d5 4.15-rc5 just now. it appears to be
trouble-free: booted, logged me in fine, the works.




>   tglx
> 
> 8<-
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -339,6 +339,40 @@ int msi_domain_populate_irqs(struct irq_
>   return ret;
>  }
>  
> +/*
> + * Carefully check whether the device can use reservation mode. If
> + * reservation mode is enabled then the early activation will assign a
> + * dummy vector to the device. If the PCI/MSI device does not support
> + * masking of the entry then this can result in spurious interrupts when
> + * the device driver is not absolutely careful. But even then a malfunction
> + * of the hardware could result in a spurious interrupt on the dummy vector
> + * and render the device unusable. If the entry can be masked then the core
> + * logic will prevent the spurious interrupt and reservation mode can be
> + * used. For now reservation mode is restricted to PCI/MSI.
> + */
> +static bool msi_check_reservation_mode(struct irq_domain *domain,
> +struct msi_domain_info *info,
> +struct device *dev)
> +{
> + struct msi_desc *desc;
> +
> + if (domain->bus_token != DOMAIN_BUS_PCI_MSI)
> + return false;
> +
> + if (!(info->flags & MSI_FLAG_MUST_REACTIVATE))
> + return false;
> +
> + if (IS_ENABLED(CONFIG_PCI_MSI) && pci_msi_ignore_mask)
> + return false;
> +
> + /*
> +  * Checking the first MSI descriptor is sufficient. MSIX supports
> +  * masking and MSI does so when the maskbit is set.
> +  */
> + desc = first_msi_entry(dev);
> + return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit;
> +}
> +
>  /**
>   * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
>   * @domain:  The domain to allocate from
> @@ -353,9 +387,11 @@ int msi_domain_alloc_irqs(struct irq_dom
>  {
>   struct msi_domain_info *info = domain->host_data;
>   struct msi_domain_ops *ops = info->ops;
> - msi_alloc_info_t arg;
> + struct irq_data *irq_data;
>   struct msi_desc *desc;
> + msi_alloc_info_t arg;
>   int i, ret, virq;
> + bool can_reserve;
>  
>   ret = msi_domain_prepare_irqs(domain, dev, nvec, );
>   if (ret)
> @@ -385,6 +421,8 @@ int msi_domain_alloc_irqs(struct irq_dom
>   if (ops->msi_finish)
>   ops->msi_finish(, 0);
>  
> + can_reserve = msi_check_reservation_mode(domain, info, dev);
> +
>   for_each_msi_entry(desc, dev) {
>   virq = desc->irq;
>   if (desc->nvec_used == 1)
> @@ -397,17 +435,28 @@ int msi_domain_alloc_irqs(struct irq_dom
>* the MSI entries before the PCI layer enables MSI in the
> 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Thomas Gleixner
On Fri, 29 Dec 2017, Alexandru Chirvasitu wrote:
> All right, I tried to do some more digging around, in the hope of
> getting as close to the source of the problem as I can.
> 
> I went back to the very first commit that went astray for me, 2db1f95
> (which is the only one actually panicking), and tried to move from its
> parent 90ad9e2 (that boots fine) to it gradually, altering the code in
> small chunks.
> 
> I tried to ignore the stuff that clearly shouldn't make a difference,
> such as definitions. So in the end I get defined-but-unused-function
> errors in my compilations, but I'm ignoring those for now. Some
> results:
> 
> (1) When I move from the good commit 90ad9e2 according to the attached
> bad-diff (which moves partly towards 2db1f95), I get a panic.
> 
> (2) On the other hand, when I further change this last panicking
> commit by simply doing
> 
> 
> 
> removed activate / deactivate from x86_vector_domain_ops
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7317ba5a..063594d 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
> irq_domain *d,
>  static const struct irq_domain_ops x86_vector_domain_ops = {
> .alloc  = x86_vector_alloc_irqs,
> .free   = x86_vector_free_irqs,
> -   .activate   = x86_vector_activate,
> -   .deactivate = x86_vector_deactivate,
>  #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> .debug_show = x86_vector_debug_show,
>  #endif
> 
> 
> all is well. 

Nice detective work. Unfortunately that's not a real solution ...

Can you try the patch below on top of Linus tree, please?

Thanks,

tglx

8<-
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -339,6 +339,40 @@ int msi_domain_populate_irqs(struct irq_
return ret;
 }
 
+/*
+ * Carefully check whether the device can use reservation mode. If
+ * reservation mode is enabled then the early activation will assign a
+ * dummy vector to the device. If the PCI/MSI device does not support
+ * masking of the entry then this can result in spurious interrupts when
+ * the device driver is not absolutely careful. But even then a malfunction
+ * of the hardware could result in a spurious interrupt on the dummy vector
+ * and render the device unusable. If the entry can be masked then the core
+ * logic will prevent the spurious interrupt and reservation mode can be
+ * used. For now reservation mode is restricted to PCI/MSI.
+ */
+static bool msi_check_reservation_mode(struct irq_domain *domain,
+  struct msi_domain_info *info,
+  struct device *dev)
+{
+   struct msi_desc *desc;
+
+   if (domain->bus_token != DOMAIN_BUS_PCI_MSI)
+   return false;
+
+   if (!(info->flags & MSI_FLAG_MUST_REACTIVATE))
+   return false;
+
+   if (IS_ENABLED(CONFIG_PCI_MSI) && pci_msi_ignore_mask)
+   return false;
+
+   /*
+* Checking the first MSI descriptor is sufficient. MSIX supports
+* masking and MSI does so when the maskbit is set.
+*/
+   desc = first_msi_entry(dev);
+   return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit;
+}
+
 /**
  * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
  * @domain:The domain to allocate from
@@ -353,9 +387,11 @@ int msi_domain_alloc_irqs(struct irq_dom
 {
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
-   msi_alloc_info_t arg;
+   struct irq_data *irq_data;
struct msi_desc *desc;
+   msi_alloc_info_t arg;
int i, ret, virq;
+   bool can_reserve;
 
ret = msi_domain_prepare_irqs(domain, dev, nvec, );
if (ret)
@@ -385,6 +421,8 @@ int msi_domain_alloc_irqs(struct irq_dom
if (ops->msi_finish)
ops->msi_finish(, 0);
 
+   can_reserve = msi_check_reservation_mode(domain, info, dev);
+
for_each_msi_entry(desc, dev) {
virq = desc->irq;
if (desc->nvec_used == 1)
@@ -397,17 +435,28 @@ int msi_domain_alloc_irqs(struct irq_dom
 * the MSI entries before the PCI layer enables MSI in the
 * card. Otherwise the card latches a random msi message.
 */
-   if (info->flags & MSI_FLAG_ACTIVATE_EARLY) {
-   struct irq_data *irq_data;
+   if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY))
+   continue;
 
+   irq_data = irq_domain_get_irq_data(domain, desc->irq);
+   if (!can_reserve)
+   irqd_clr_can_reserve(irq_data);
+   ret = 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Thomas Gleixner
On Fri, 29 Dec 2017, Alexandru Chirvasitu wrote:
> All right, I tried to do some more digging around, in the hope of
> getting as close to the source of the problem as I can.
> 
> I went back to the very first commit that went astray for me, 2db1f95
> (which is the only one actually panicking), and tried to move from its
> parent 90ad9e2 (that boots fine) to it gradually, altering the code in
> small chunks.
> 
> I tried to ignore the stuff that clearly shouldn't make a difference,
> such as definitions. So in the end I get defined-but-unused-function
> errors in my compilations, but I'm ignoring those for now. Some
> results:
> 
> (1) When I move from the good commit 90ad9e2 according to the attached
> bad-diff (which moves partly towards 2db1f95), I get a panic.
> 
> (2) On the other hand, when I further change this last panicking
> commit by simply doing
> 
> 
> 
> removed activate / deactivate from x86_vector_domain_ops
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7317ba5a..063594d 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
> irq_domain *d,
>  static const struct irq_domain_ops x86_vector_domain_ops = {
> .alloc  = x86_vector_alloc_irqs,
> .free   = x86_vector_free_irqs,
> -   .activate   = x86_vector_activate,
> -   .deactivate = x86_vector_deactivate,
>  #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> .debug_show = x86_vector_debug_show,
>  #endif
> 
> 
> all is well. 

Nice detective work. Unfortunately that's not a real solution ...

Can you try the patch below on top of Linus tree, please?

Thanks,

tglx

8<-
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -339,6 +339,40 @@ int msi_domain_populate_irqs(struct irq_
return ret;
 }
 
+/*
+ * Carefully check whether the device can use reservation mode. If
+ * reservation mode is enabled then the early activation will assign a
+ * dummy vector to the device. If the PCI/MSI device does not support
+ * masking of the entry then this can result in spurious interrupts when
+ * the device driver is not absolutely careful. But even then a malfunction
+ * of the hardware could result in a spurious interrupt on the dummy vector
+ * and render the device unusable. If the entry can be masked then the core
+ * logic will prevent the spurious interrupt and reservation mode can be
+ * used. For now reservation mode is restricted to PCI/MSI.
+ */
+static bool msi_check_reservation_mode(struct irq_domain *domain,
+  struct msi_domain_info *info,
+  struct device *dev)
+{
+   struct msi_desc *desc;
+
+   if (domain->bus_token != DOMAIN_BUS_PCI_MSI)
+   return false;
+
+   if (!(info->flags & MSI_FLAG_MUST_REACTIVATE))
+   return false;
+
+   if (IS_ENABLED(CONFIG_PCI_MSI) && pci_msi_ignore_mask)
+   return false;
+
+   /*
+* Checking the first MSI descriptor is sufficient. MSIX supports
+* masking and MSI does so when the maskbit is set.
+*/
+   desc = first_msi_entry(dev);
+   return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit;
+}
+
 /**
  * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
  * @domain:The domain to allocate from
@@ -353,9 +387,11 @@ int msi_domain_alloc_irqs(struct irq_dom
 {
struct msi_domain_info *info = domain->host_data;
struct msi_domain_ops *ops = info->ops;
-   msi_alloc_info_t arg;
+   struct irq_data *irq_data;
struct msi_desc *desc;
+   msi_alloc_info_t arg;
int i, ret, virq;
+   bool can_reserve;
 
ret = msi_domain_prepare_irqs(domain, dev, nvec, );
if (ret)
@@ -385,6 +421,8 @@ int msi_domain_alloc_irqs(struct irq_dom
if (ops->msi_finish)
ops->msi_finish(, 0);
 
+   can_reserve = msi_check_reservation_mode(domain, info, dev);
+
for_each_msi_entry(desc, dev) {
virq = desc->irq;
if (desc->nvec_used == 1)
@@ -397,17 +435,28 @@ int msi_domain_alloc_irqs(struct irq_dom
 * the MSI entries before the PCI layer enables MSI in the
 * card. Otherwise the card latches a random msi message.
 */
-   if (info->flags & MSI_FLAG_ACTIVATE_EARLY) {
-   struct irq_data *irq_data;
+   if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY))
+   continue;
 
+   irq_data = irq_domain_get_irq_data(domain, desc->irq);
+   if (!can_reserve)
+   irqd_clr_can_reserve(irq_data);
+   ret = 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Alexandru Chirvasitu
On Fri, Dec 29, 2017 at 06:49:15AM -0500, Alexandru Chirvasitu wrote:
> All right, I tried to do some more digging around, in the hope of
> getting as close to the source of the problem as I can.
> 
> I went back to the very first commit that went astray for me, 2db1f95
> (which is the only one actually panicking), and tried to move from its
> parent 90ad9e2 (that boots fine) to it gradually, altering the code in
> small chunks.
> 
> I tried to ignore the stuff that clearly shouldn't make a difference,
> such as definitions. So in the end I get defined-but-unused-function
> errors in my compilations, but I'm ignoring those for now. Some
> results:
> 
> (1) When I move from the good commit 90ad9e2 according to the attached
> bad-diff (which moves partly towards 2db1f95), I get a panic.
> 
> (2) On the other hand, when I further change this last panicking
> commit by simply doing
> 
> 
> 
> removed activate / deactivate from x86_vector_domain_ops
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7317ba5a..063594d 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
> irq_domain *d,
>  static const struct irq_domain_ops x86_vector_domain_ops = {
> .alloc  = x86_vector_alloc_irqs,
> .free   = x86_vector_free_irqs,
> -   .activate   = x86_vector_activate,
> -   .deactivate = x86_vector_deactivate,
>  #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> .debug_show = x86_vector_debug_show,
>  #endif
> 
> 
> all is well. 
> 

And sure enough, simply diffing


removed activate / deactivate from x86_vector_domain_ops

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3f53572..e6cb55d 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -511,8 +511,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
irq_domain *d,
 static const struct irq_domain_ops x86_vector_domain_ops = {
.alloc  = x86_vector_alloc_irqs,
.free   = x86_vector_free_irqs,
-   .activate   = x86_vector_activate,
-   .deactivate = x86_vector_deactivate,
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
.debug_show = x86_vector_debug_show,
 #endif


directly against 2db1f95 fixes the issues (no freezes, lockups, or
panics).




> 
> 
> 
> On Fri, Dec 29, 2017 at 09:07:45AM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> > > > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > > 
> > > > > Attached, but heads up on this: when redirecting the output of lspci
> > > > > -vvv to a text file as root I get
> > > > > 
> > > > > pcilib: sysfs_read_vpd: read failed: Input/output error
> > > > > 
> > > > > I can find bugs filed for various distros to this same effect, but
> > > > > haven't tracked down any explanations.
> > > > 
> > > > Weird, but the info looks complete.
> > > > 
> > > > Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> > > > whether that works?
> > > 
> > > It does (emailing from that successful boot as we speak). I'm on a
> > > clean 4.15-rc5 (as in no patches, etc.). 
> > > 
> > > This was also suggested way at the top of this thread by Dexuan Cui
> > > for 4.15-rc3 (where this exchange started), and it worked back then
> > > too.
> > 
> > I missed that part of the conversation. Let me stare into the MSI code
> > again.
> > 
> > Thanks,
> > 
> > tglx

> diff --git a/arch/x86/include/asm/irq_vectors.h 
> b/arch/x86/include/asm/irq_vectors.h
> index aaf8d28..1e9bd28 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -101,12 +101,8 @@
>  #define POSTED_INTR_NESTED_VECTOR0xf0
>  #endif
>  
> -/*
> - * Local APIC timer IRQ vector is on a different priority level,
> - * to work around the 'lost local interrupt if more than 2 IRQ
> - * sources per level' errata.
> - */
> -#define LOCAL_TIMER_VECTOR   0xef
> +#define MANAGED_IRQ_SHUTDOWN_VECTOR  0xef
> +#define LOCAL_TIMER_VECTOR   0xee
>  
>  #define NR_VECTORS256
>  
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f08d44f..7317ba5a 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -32,7 +32,8 @@ struct apic_chip_data {
>   unsigned intprev_cpu;
>   unsigned intirq;
>   struct hlist_node   clist;
> - u8  move_in_progress : 1;
> + unsigned intmove_in_progress: 1,
> 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Alexandru Chirvasitu
On Fri, Dec 29, 2017 at 06:49:15AM -0500, Alexandru Chirvasitu wrote:
> All right, I tried to do some more digging around, in the hope of
> getting as close to the source of the problem as I can.
> 
> I went back to the very first commit that went astray for me, 2db1f95
> (which is the only one actually panicking), and tried to move from its
> parent 90ad9e2 (that boots fine) to it gradually, altering the code in
> small chunks.
> 
> I tried to ignore the stuff that clearly shouldn't make a difference,
> such as definitions. So in the end I get defined-but-unused-function
> errors in my compilations, but I'm ignoring those for now. Some
> results:
> 
> (1) When I move from the good commit 90ad9e2 according to the attached
> bad-diff (which moves partly towards 2db1f95), I get a panic.
> 
> (2) On the other hand, when I further change this last panicking
> commit by simply doing
> 
> 
> 
> removed activate / deactivate from x86_vector_domain_ops
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7317ba5a..063594d 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
> irq_domain *d,
>  static const struct irq_domain_ops x86_vector_domain_ops = {
> .alloc  = x86_vector_alloc_irqs,
> .free   = x86_vector_free_irqs,
> -   .activate   = x86_vector_activate,
> -   .deactivate = x86_vector_deactivate,
>  #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
> .debug_show = x86_vector_debug_show,
>  #endif
> 
> 
> all is well. 
> 

And sure enough, simply diffing


removed activate / deactivate from x86_vector_domain_ops

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3f53572..e6cb55d 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -511,8 +511,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
irq_domain *d,
 static const struct irq_domain_ops x86_vector_domain_ops = {
.alloc  = x86_vector_alloc_irqs,
.free   = x86_vector_free_irqs,
-   .activate   = x86_vector_activate,
-   .deactivate = x86_vector_deactivate,
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
.debug_show = x86_vector_debug_show,
 #endif


directly against 2db1f95 fixes the issues (no freezes, lockups, or
panics).




> 
> 
> 
> On Fri, Dec 29, 2017 at 09:07:45AM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> > > > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > > 
> > > > > Attached, but heads up on this: when redirecting the output of lspci
> > > > > -vvv to a text file as root I get
> > > > > 
> > > > > pcilib: sysfs_read_vpd: read failed: Input/output error
> > > > > 
> > > > > I can find bugs filed for various distros to this same effect, but
> > > > > haven't tracked down any explanations.
> > > > 
> > > > Weird, but the info looks complete.
> > > > 
> > > > Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> > > > whether that works?
> > > 
> > > It does (emailing from that successful boot as we speak). I'm on a
> > > clean 4.15-rc5 (as in no patches, etc.). 
> > > 
> > > This was also suggested way at the top of this thread by Dexuan Cui
> > > for 4.15-rc3 (where this exchange started), and it worked back then
> > > too.
> > 
> > I missed that part of the conversation. Let me stare into the MSI code
> > again.
> > 
> > Thanks,
> > 
> > tglx

> diff --git a/arch/x86/include/asm/irq_vectors.h 
> b/arch/x86/include/asm/irq_vectors.h
> index aaf8d28..1e9bd28 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -101,12 +101,8 @@
>  #define POSTED_INTR_NESTED_VECTOR0xf0
>  #endif
>  
> -/*
> - * Local APIC timer IRQ vector is on a different priority level,
> - * to work around the 'lost local interrupt if more than 2 IRQ
> - * sources per level' errata.
> - */
> -#define LOCAL_TIMER_VECTOR   0xef
> +#define MANAGED_IRQ_SHUTDOWN_VECTOR  0xef
> +#define LOCAL_TIMER_VECTOR   0xee
>  
>  #define NR_VECTORS256
>  
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f08d44f..7317ba5a 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -32,7 +32,8 @@ struct apic_chip_data {
>   unsigned intprev_cpu;
>   unsigned intirq;
>   struct hlist_node   clist;
> - u8  move_in_progress : 1;
> + unsigned intmove_in_progress: 1,
> 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Alexandru Chirvasitu
All right, I tried to do some more digging around, in the hope of
getting as close to the source of the problem as I can.

I went back to the very first commit that went astray for me, 2db1f95
(which is the only one actually panicking), and tried to move from its
parent 90ad9e2 (that boots fine) to it gradually, altering the code in
small chunks.

I tried to ignore the stuff that clearly shouldn't make a difference,
such as definitions. So in the end I get defined-but-unused-function
errors in my compilations, but I'm ignoring those for now. Some
results:

(1) When I move from the good commit 90ad9e2 according to the attached
bad-diff (which moves partly towards 2db1f95), I get a panic.

(2) On the other hand, when I further change this last panicking
commit by simply doing



removed activate / deactivate from x86_vector_domain_ops

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 7317ba5a..063594d 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
irq_domain *d,
 static const struct irq_domain_ops x86_vector_domain_ops = {
.alloc  = x86_vector_alloc_irqs,
.free   = x86_vector_free_irqs,
-   .activate   = x86_vector_activate,
-   .deactivate = x86_vector_deactivate,
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
.debug_show = x86_vector_debug_show,
 #endif


all is well. 




On Fri, Dec 29, 2017 at 09:07:45AM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> > > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > 
> > > > Attached, but heads up on this: when redirecting the output of lspci
> > > > -vvv to a text file as root I get
> > > > 
> > > > pcilib: sysfs_read_vpd: read failed: Input/output error
> > > > 
> > > > I can find bugs filed for various distros to this same effect, but
> > > > haven't tracked down any explanations.
> > > 
> > > Weird, but the info looks complete.
> > > 
> > > Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> > > whether that works?
> > 
> > It does (emailing from that successful boot as we speak). I'm on a
> > clean 4.15-rc5 (as in no patches, etc.). 
> > 
> > This was also suggested way at the top of this thread by Dexuan Cui
> > for 4.15-rc3 (where this exchange started), and it worked back then
> > too.
> 
> I missed that part of the conversation. Let me stare into the MSI code
> again.
> 
> Thanks,
> 
>   tglx
diff --git a/arch/x86/include/asm/irq_vectors.h 
b/arch/x86/include/asm/irq_vectors.h
index aaf8d28..1e9bd28 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -101,12 +101,8 @@
 #define POSTED_INTR_NESTED_VECTOR  0xf0
 #endif
 
-/*
- * Local APIC timer IRQ vector is on a different priority level,
- * to work around the 'lost local interrupt if more than 2 IRQ
- * sources per level' errata.
- */
-#define LOCAL_TIMER_VECTOR 0xef
+#define MANAGED_IRQ_SHUTDOWN_VECTOR0xef
+#define LOCAL_TIMER_VECTOR 0xee
 
 #define NR_VECTORS  256
 
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f08d44f..7317ba5a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -32,7 +32,8 @@ struct apic_chip_data {
unsigned intprev_cpu;
unsigned intirq;
struct hlist_node   clist;
-   u8  move_in_progress : 1;
+   unsigned intmove_in_progress: 1,
+   is_managed  : 1;
 };
 
 struct irq_domain *x86_vector_domain;
@@ -152,6 +153,28 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
per_cpu(vector_irq, newcpu)[newvec] = desc;
 }
 
+static void vector_assign_managed_shutdown(struct irq_data *irqd)
+{
+   unsigned int cpu = cpumask_first(cpu_online_mask);
+
+   apic_update_irq_cfg(irqd, MANAGED_IRQ_SHUTDOWN_VECTOR, cpu);
+}
+
+static int reserve_managed_vector(struct irq_data *irqd)
+{
+   const struct cpumask *affmsk = irq_data_get_affinity_mask(irqd);
+   struct apic_chip_data *apicd = apic_chip_data(irqd);
+   unsigned long flags;
+   int ret;
+
+   raw_spin_lock_irqsave(_lock, flags);
+   apicd->is_managed = true;
+   ret = irq_matrix_reserve_managed(vector_matrix, affmsk);
+   raw_spin_unlock_irqrestore(_lock, flags);
+   trace_vector_reserve_managed(irqd->irq, ret);
+   return ret;
+}
+
 static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest)
 {
struct apic_chip_data *apicd = apic_chip_data(irqd);
@@ -211,9 +234,58 @@ static int 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Alexandru Chirvasitu
All right, I tried to do some more digging around, in the hope of
getting as close to the source of the problem as I can.

I went back to the very first commit that went astray for me, 2db1f95
(which is the only one actually panicking), and tried to move from its
parent 90ad9e2 (that boots fine) to it gradually, altering the code in
small chunks.

I tried to ignore the stuff that clearly shouldn't make a difference,
such as definitions. So in the end I get defined-but-unused-function
errors in my compilations, but I'm ignoring those for now. Some
results:

(1) When I move from the good commit 90ad9e2 according to the attached
bad-diff (which moves partly towards 2db1f95), I get a panic.

(2) On the other hand, when I further change this last panicking
commit by simply doing



removed activate / deactivate from x86_vector_domain_ops

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 7317ba5a..063594d 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -514,8 +514,6 @@ void x86_vector_debug_show(struct seq_file *m, struct 
irq_domain *d,
 static const struct irq_domain_ops x86_vector_domain_ops = {
.alloc  = x86_vector_alloc_irqs,
.free   = x86_vector_free_irqs,
-   .activate   = x86_vector_activate,
-   .deactivate = x86_vector_deactivate,
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
.debug_show = x86_vector_debug_show,
 #endif


all is well. 




On Fri, Dec 29, 2017 at 09:07:45AM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> > > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > 
> > > > Attached, but heads up on this: when redirecting the output of lspci
> > > > -vvv to a text file as root I get
> > > > 
> > > > pcilib: sysfs_read_vpd: read failed: Input/output error
> > > > 
> > > > I can find bugs filed for various distros to this same effect, but
> > > > haven't tracked down any explanations.
> > > 
> > > Weird, but the info looks complete.
> > > 
> > > Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> > > whether that works?
> > 
> > It does (emailing from that successful boot as we speak). I'm on a
> > clean 4.15-rc5 (as in no patches, etc.). 
> > 
> > This was also suggested way at the top of this thread by Dexuan Cui
> > for 4.15-rc3 (where this exchange started), and it worked back then
> > too.
> 
> I missed that part of the conversation. Let me stare into the MSI code
> again.
> 
> Thanks,
> 
>   tglx
diff --git a/arch/x86/include/asm/irq_vectors.h 
b/arch/x86/include/asm/irq_vectors.h
index aaf8d28..1e9bd28 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -101,12 +101,8 @@
 #define POSTED_INTR_NESTED_VECTOR  0xf0
 #endif
 
-/*
- * Local APIC timer IRQ vector is on a different priority level,
- * to work around the 'lost local interrupt if more than 2 IRQ
- * sources per level' errata.
- */
-#define LOCAL_TIMER_VECTOR 0xef
+#define MANAGED_IRQ_SHUTDOWN_VECTOR0xef
+#define LOCAL_TIMER_VECTOR 0xee
 
 #define NR_VECTORS  256
 
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f08d44f..7317ba5a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -32,7 +32,8 @@ struct apic_chip_data {
unsigned intprev_cpu;
unsigned intirq;
struct hlist_node   clist;
-   u8  move_in_progress : 1;
+   unsigned intmove_in_progress: 1,
+   is_managed  : 1;
 };
 
 struct irq_domain *x86_vector_domain;
@@ -152,6 +153,28 @@ static void apic_update_vector(struct irq_data *irqd, 
unsigned int newvec,
per_cpu(vector_irq, newcpu)[newvec] = desc;
 }
 
+static void vector_assign_managed_shutdown(struct irq_data *irqd)
+{
+   unsigned int cpu = cpumask_first(cpu_online_mask);
+
+   apic_update_irq_cfg(irqd, MANAGED_IRQ_SHUTDOWN_VECTOR, cpu);
+}
+
+static int reserve_managed_vector(struct irq_data *irqd)
+{
+   const struct cpumask *affmsk = irq_data_get_affinity_mask(irqd);
+   struct apic_chip_data *apicd = apic_chip_data(irqd);
+   unsigned long flags;
+   int ret;
+
+   raw_spin_lock_irqsave(_lock, flags);
+   apicd->is_managed = true;
+   ret = irq_matrix_reserve_managed(vector_matrix, affmsk);
+   raw_spin_unlock_irqrestore(_lock, flags);
+   trace_vector_reserve_managed(irqd->irq, ret);
+   return ret;
+}
+
 static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest)
 {
struct apic_chip_data *apicd = apic_chip_data(irqd);
@@ -211,9 +234,58 @@ static int 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > 
> > > Attached, but heads up on this: when redirecting the output of lspci
> > > -vvv to a text file as root I get
> > > 
> > > pcilib: sysfs_read_vpd: read failed: Input/output error
> > > 
> > > I can find bugs filed for various distros to this same effect, but
> > > haven't tracked down any explanations.
> > 
> > Weird, but the info looks complete.
> > 
> > Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> > whether that works?
> 
> It does (emailing from that successful boot as we speak). I'm on a
> clean 4.15-rc5 (as in no patches, etc.). 
> 
> This was also suggested way at the top of this thread by Dexuan Cui
> for 4.15-rc3 (where this exchange started), and it worked back then
> too.

I missed that part of the conversation. Let me stare into the MSI code
again.

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-29 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > 
> > > Attached, but heads up on this: when redirecting the output of lspci
> > > -vvv to a text file as root I get
> > > 
> > > pcilib: sysfs_read_vpd: read failed: Input/output error
> > > 
> > > I can find bugs filed for various distros to this same effect, but
> > > haven't tracked down any explanations.
> > 
> > Weird, but the info looks complete.
> > 
> > Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> > whether that works?
> 
> It does (emailing from that successful boot as we speak). I'm on a
> clean 4.15-rc5 (as in no patches, etc.). 
> 
> This was also suggested way at the top of this thread by Dexuan Cui
> for 4.15-rc3 (where this exchange started), and it worked back then
> too.

I missed that part of the conversation. Let me stare into the MSI code
again.

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 06:15:19PM -0600, Bjorn Helgaas wrote:
> On Thu, Dec 28, 2017 at 06:30:58PM -0500, Alexandru Chirvasitu wrote:
> > Attached, but heads up on this: when redirecting the output of lspci
> > -vvv to a text file as root I get
> > 
> > pcilib: sysfs_read_vpd: read failed: Input/output error
> > 
> > I can find bugs filed for various distros to this same effect, but
> > haven't tracked down any explanations.
> 
> This is a tangent, but I think you should *always* see "Input/output
> error" on this system when running "lspci -vvv" as root, regardless of
> whether you redirect the output (the error probably goes to stderr,
> not stdout, so it's probably easy to miss when not redirecting the
> output).
> 
> I think this is the -EIO return from pci_vpd_read(), which probably
> means pci_vpd_size() returned 0 for one of your devices, which means
> the VPD data provided by the device wasn't formatted correctly.  If
> this happens, you should see a warning in dmesg about it ("invalid VPD
> tag" or similar) -- could you verify that?
>

This in dmesg:

 pci :06:00.0: [Firmware Bug]: disabling VPD access (can't
 determine size of non-standard VPD format)

So yes, looks like you pinned it down good. No other VPD instances in
dmesg.

And yes, the error does seem to always be present. I see it with

lspci -vvv 2>&1 | grep pcilib

so it was there in stderr all along. 

> It's possible we should return something other than -EIO, or maybe
> pcilib should do something other than emitting the warning.  In
> pcilib, sysfs_read_vpd() emits the warning [1], and it would seem sort
> of ugly to special-case EIO, so maybe we should change this in the
> kernel.
> 
> It looks like your Qualcomm Atheros Attansic NIC at 06:00.0 is the
> only device with VPD, so that's probably the one:
> 
>   06:00.0 Ethernet controller: Qualcomm Atheros Attansic L2 Fast Ethernet
> Capabilities: [6c] Vital Product Data
>   Not readable
> 
> I think lspci would still print "Not readable" if we just made the
> kernel return 0 instead of -EIO [2].
> 
> Bjorn
> 
> [1] 
> https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/lib/sysfs.c#n410
> [2] 
> https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/ls-vpd.c#n87


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 06:15:19PM -0600, Bjorn Helgaas wrote:
> On Thu, Dec 28, 2017 at 06:30:58PM -0500, Alexandru Chirvasitu wrote:
> > Attached, but heads up on this: when redirecting the output of lspci
> > -vvv to a text file as root I get
> > 
> > pcilib: sysfs_read_vpd: read failed: Input/output error
> > 
> > I can find bugs filed for various distros to this same effect, but
> > haven't tracked down any explanations.
> 
> This is a tangent, but I think you should *always* see "Input/output
> error" on this system when running "lspci -vvv" as root, regardless of
> whether you redirect the output (the error probably goes to stderr,
> not stdout, so it's probably easy to miss when not redirecting the
> output).
> 
> I think this is the -EIO return from pci_vpd_read(), which probably
> means pci_vpd_size() returned 0 for one of your devices, which means
> the VPD data provided by the device wasn't formatted correctly.  If
> this happens, you should see a warning in dmesg about it ("invalid VPD
> tag" or similar) -- could you verify that?
>

This in dmesg:

 pci :06:00.0: [Firmware Bug]: disabling VPD access (can't
 determine size of non-standard VPD format)

So yes, looks like you pinned it down good. No other VPD instances in
dmesg.

And yes, the error does seem to always be present. I see it with

lspci -vvv 2>&1 | grep pcilib

so it was there in stderr all along. 

> It's possible we should return something other than -EIO, or maybe
> pcilib should do something other than emitting the warning.  In
> pcilib, sysfs_read_vpd() emits the warning [1], and it would seem sort
> of ugly to special-case EIO, so maybe we should change this in the
> kernel.
> 
> It looks like your Qualcomm Atheros Attansic NIC at 06:00.0 is the
> only device with VPD, so that's probably the one:
> 
>   06:00.0 Ethernet controller: Qualcomm Atheros Attansic L2 Fast Ethernet
> Capabilities: [6c] Vital Product Data
>   Not readable
> 
> I think lspci would still print "Not readable" if we just made the
> kernel return 0 instead of -EIO [2].
> 
> Bjorn
> 
> [1] 
> https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/lib/sysfs.c#n410
> [2] 
> https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/ls-vpd.c#n87


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Bjorn Helgaas
On Thu, Dec 28, 2017 at 06:30:58PM -0500, Alexandru Chirvasitu wrote:
> Attached, but heads up on this: when redirecting the output of lspci
> -vvv to a text file as root I get
> 
> pcilib: sysfs_read_vpd: read failed: Input/output error
> 
> I can find bugs filed for various distros to this same effect, but
> haven't tracked down any explanations.

This is a tangent, but I think you should *always* see "Input/output
error" on this system when running "lspci -vvv" as root, regardless of
whether you redirect the output (the error probably goes to stderr,
not stdout, so it's probably easy to miss when not redirecting the
output).

I think this is the -EIO return from pci_vpd_read(), which probably
means pci_vpd_size() returned 0 for one of your devices, which means
the VPD data provided by the device wasn't formatted correctly.  If
this happens, you should see a warning in dmesg about it ("invalid VPD
tag" or similar) -- could you verify that?

It's possible we should return something other than -EIO, or maybe
pcilib should do something other than emitting the warning.  In
pcilib, sysfs_read_vpd() emits the warning [1], and it would seem sort
of ugly to special-case EIO, so maybe we should change this in the
kernel.

It looks like your Qualcomm Atheros Attansic NIC at 06:00.0 is the
only device with VPD, so that's probably the one:

  06:00.0 Ethernet controller: Qualcomm Atheros Attansic L2 Fast Ethernet
Capabilities: [6c] Vital Product Data
  Not readable

I think lspci would still print "Not readable" if we just made the
kernel return 0 instead of -EIO [2].

Bjorn

[1] 
https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/lib/sysfs.c#n410
[2] https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/ls-vpd.c#n87


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Bjorn Helgaas
On Thu, Dec 28, 2017 at 06:30:58PM -0500, Alexandru Chirvasitu wrote:
> Attached, but heads up on this: when redirecting the output of lspci
> -vvv to a text file as root I get
> 
> pcilib: sysfs_read_vpd: read failed: Input/output error
> 
> I can find bugs filed for various distros to this same effect, but
> haven't tracked down any explanations.

This is a tangent, but I think you should *always* see "Input/output
error" on this system when running "lspci -vvv" as root, regardless of
whether you redirect the output (the error probably goes to stderr,
not stdout, so it's probably easy to miss when not redirecting the
output).

I think this is the -EIO return from pci_vpd_read(), which probably
means pci_vpd_size() returned 0 for one of your devices, which means
the VPD data provided by the device wasn't formatted correctly.  If
this happens, you should see a warning in dmesg about it ("invalid VPD
tag" or similar) -- could you verify that?

It's possible we should return something other than -EIO, or maybe
pcilib should do something other than emitting the warning.  In
pcilib, sysfs_read_vpd() emits the warning [1], and it would seem sort
of ugly to special-case EIO, so maybe we should change this in the
kernel.

It looks like your Qualcomm Atheros Attansic NIC at 06:00.0 is the
only device with VPD, so that's probably the one:

  06:00.0 Ethernet controller: Qualcomm Atheros Attansic L2 Fast Ethernet
Capabilities: [6c] Vital Product Data
  Not readable

I think lspci would still print "Not readable" if we just made the
kernel return 0 instead of -EIO [2].

Bjorn

[1] 
https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/lib/sysfs.c#n410
[2] https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/tree/ls-vpd.c#n87


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > Attached, but heads up on this: when redirecting the output of lspci
> > -vvv to a text file as root I get
> > 
> > pcilib: sysfs_read_vpd: read failed: Input/output error
> > 
> > I can find bugs filed for various distros to this same effect, but
> > haven't tracked down any explanations.
> 
> Weird, but the info looks complete.
> 
> Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> whether that works?

It does (emailing from that successful boot as we speak). I'm on a
clean 4.15-rc5 (as in no patches, etc.). 

This was also suggested way at the top of this thread by Dexuan Cui
for 4.15-rc3 (where this exchange started), and it worked back then
too.

> 
> Thanks,
> 
>   tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Fri, Dec 29, 2017 at 12:36:37AM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > Attached, but heads up on this: when redirecting the output of lspci
> > -vvv to a text file as root I get
> > 
> > pcilib: sysfs_read_vpd: read failed: Input/output error
> > 
> > I can find bugs filed for various distros to this same effect, but
> > haven't tracked down any explanations.
> 
> Weird, but the info looks complete.
> 
> Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
> whether that works?

It does (emailing from that successful boot as we speak). I'm on a
clean 4.15-rc5 (as in no patches, etc.). 

This was also suggested way at the top of this thread by Dexuan Cui
for 4.15-rc3 (where this exchange started), and it worked back then
too.

> 
> Thanks,
> 
>   tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> Attached, but heads up on this: when redirecting the output of lspci
> -vvv to a text file as root I get
> 
> pcilib: sysfs_read_vpd: read failed: Input/output error
> 
> I can find bugs filed for various distros to this same effect, but
> haven't tracked down any explanations.

Weird, but the info looks complete.

Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
whether that works?

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> Attached, but heads up on this: when redirecting the output of lspci
> -vvv to a text file as root I get
> 
> pcilib: sysfs_read_vpd: read failed: Input/output error
> 
> I can find bugs filed for various distros to this same effect, but
> haven't tracked down any explanations.

Weird, but the info looks complete.

Can you please add 'pci=nomsi' to the 4.15 kernel command line and see
whether that works?

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
Attached, but heads up on this: when redirecting the output of lspci
-vvv to a text file as root I get

pcilib: sysfs_read_vpd: read failed: Input/output error

I can find bugs filed for various distros to this same effect, but
haven't tracked down any explanations.

On Fri, Dec 29, 2017 at 12:19:19AM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Thomas Gleixner wrote:
> 
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > 
> > > Attached.
> > > 
> > > I don't have a 4.14 family kernel available at the moment on that
> > > machine. What I'm attaching comes from the 4.13 one I was playing with
> > > yesterday, what with kexec and all.
> > 
> > Good enough. Thanks !
> 
> Spoke too fast. Could you please run that command as root?
> 
> And while at it please provide the output of
> 
> cat /proc/interrupts
> 
> as well.
> 
> Thanks,
> 
>   tglx
> 
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC410 Host Bridge 
(rev 01)
Subsystem: ASUSTeK Computer Inc. RC410 Host Bridge
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [b0] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] 
RC4xx/RS4xx PCI Bridge [int gfx]
Kernel modules: shpchp

00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 1 (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v1) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- 
TransPend-
LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit 
Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- 
Surprise-
Slot #0, PowerLimit 25.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- 
LinkChg-
Control: AttnInd Off, PwrInd Off, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
Interlock-
Changed: MRL- PresDet+ LinkState-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- 
CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID , PMEStatus- PMEPending-
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
Address:   Data: 
Capabilities: [b0] Subsystem: ASUSTeK Computer Inc. RC4xx/RS4xx PCI 
Express Port 1
Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
Kernel driver in use: pcieport
Kernel modules: shpchp

00:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 2 (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v1) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
Attached, but heads up on this: when redirecting the output of lspci
-vvv to a text file as root I get

pcilib: sysfs_read_vpd: read failed: Input/output error

I can find bugs filed for various distros to this same effect, but
haven't tracked down any explanations.

On Fri, Dec 29, 2017 at 12:19:19AM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Thomas Gleixner wrote:
> 
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > 
> > > Attached.
> > > 
> > > I don't have a 4.14 family kernel available at the moment on that
> > > machine. What I'm attaching comes from the 4.13 one I was playing with
> > > yesterday, what with kexec and all.
> > 
> > Good enough. Thanks !
> 
> Spoke too fast. Could you please run that command as root?
> 
> And while at it please provide the output of
> 
> cat /proc/interrupts
> 
> as well.
> 
> Thanks,
> 
>   tglx
> 
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC410 Host Bridge 
(rev 01)
Subsystem: ASUSTeK Computer Inc. RC410 Host Bridge
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [b0] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] 
RC4xx/RS4xx PCI Bridge [int gfx]
Kernel modules: shpchp

00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 1 (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v1) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- 
TransPend-
LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit 
Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- 
Surprise-
Slot #0, PowerLimit 25.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- 
LinkChg-
Control: AttnInd Off, PwrInd Off, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
Interlock-
Changed: MRL- PresDet+ LinkState-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- 
CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID , PMEStatus- PMEPending-
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
Address:   Data: 
Capabilities: [b0] Subsystem: ASUSTeK Computer Inc. RC4xx/RS4xx PCI 
Express Port 1
Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
Kernel driver in use: pcieport
Kernel modules: shpchp

00:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 2 (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v1) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Thomas Gleixner wrote:

> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > Attached.
> > 
> > I don't have a 4.14 family kernel available at the moment on that
> > machine. What I'm attaching comes from the 4.13 one I was playing with
> > yesterday, what with kexec and all.
> 
> Good enough. Thanks !

Spoke too fast. Could you please run that command as root?

And while at it please provide the output of

cat /proc/interrupts

as well.

Thanks,

tglx



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Thomas Gleixner wrote:

> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > Attached.
> > 
> > I don't have a 4.14 family kernel available at the moment on that
> > machine. What I'm attaching comes from the 4.13 one I was playing with
> > yesterday, what with kexec and all.
> 
> Good enough. Thanks !

Spoke too fast. Could you please run that command as root?

And while at it please provide the output of

cat /proc/interrupts

as well.

Thanks,

tglx



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> Attached.
> 
> I don't have a 4.14 family kernel available at the moment on that
> machine. What I'm attaching comes from the 4.13 one I was playing with
> yesterday, what with kexec and all.

Good enough. Thanks !


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> Attached.
> 
> I don't have a 4.14 family kernel available at the moment on that
> machine. What I'm attaching comes from the 4.13 one I was playing with
> yesterday, what with kexec and all.

Good enough. Thanks !


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
Attached.

I don't have a 4.14 family kernel available at the moment on that
machine. What I'm attaching comes from the 4.13 one I was playing with
yesterday, what with kexec and all.

On Thu, Dec 28, 2017 at 10:54:25PM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > 
> > > No; it seems to be tied to this specific issue, and I was seeing even
> > > before getting logs just now, whenever I'd start one of the bad
> > > kernels in recovery mode.
> > > 
> > > But no, I've never seen that in any other logs, or on any other
> > > screens outside of those popping up in relation to this problem.
> > 
> > Ok. I'll dig into it and we have a 100% reproducer reported by someone else
> > now, which might give us simpler insight into that issue. I'll let you know
> > once I have something to test. Might be a couple of days though.
> 
> Can you please provide the output of
> 
> lspci -vvv
> 
> from a working kernel, preferrably 4.14.y
> 
> Thanks,
> 
>   tglx
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC410 Host Bridge 
(rev 01)
Subsystem: ASUSTeK Computer Inc. RC410 Host Bridge
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel modules: shpchp

00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 1 (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 2 (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 3 (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:07.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 4 (prog-if 00 [Normal decode])
Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:12.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB600 
Non-Raid-5 SATA (prog-if 01 [AHCI 1.0])
Subsystem: ASUSTeK Computer Inc. SB600 Non-Raid-5 SATA
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- 
Kernel driver in use: ahci
Kernel modules: ahci

00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB600 USB 
(OHCI0) (prog-if 10 [OHCI])
Subsystem: ASUSTeK Computer Inc. SB600 USB (OHCI0)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- 
Kernel driver in use: ehci-pci
Kernel modules: ehci_pci

00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller 
(rev 13)
Subsystem: ASUSTeK Computer Inc. SBx00 SMBus Controller
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- SERR- 
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB600 PCI to LPC 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
Attached.

I don't have a 4.14 family kernel available at the moment on that
machine. What I'm attaching comes from the 4.13 one I was playing with
yesterday, what with kexec and all.

On Thu, Dec 28, 2017 at 10:54:25PM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > 
> > > No; it seems to be tied to this specific issue, and I was seeing even
> > > before getting logs just now, whenever I'd start one of the bad
> > > kernels in recovery mode.
> > > 
> > > But no, I've never seen that in any other logs, or on any other
> > > screens outside of those popping up in relation to this problem.
> > 
> > Ok. I'll dig into it and we have a 100% reproducer reported by someone else
> > now, which might give us simpler insight into that issue. I'll let you know
> > once I have something to test. Might be a couple of days though.
> 
> Can you please provide the output of
> 
> lspci -vvv
> 
> from a working kernel, preferrably 4.14.y
> 
> Thanks,
> 
>   tglx
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC410 Host Bridge 
(rev 01)
Subsystem: ASUSTeK Computer Inc. RC410 Host Bridge
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel modules: shpchp

00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 1 (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 2 (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 3 (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:07.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI 
Express Port 4 (prog-if 00 [Normal decode])
Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: 
Kernel driver in use: pcieport
Kernel modules: shpchp

00:12.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB600 
Non-Raid-5 SATA (prog-if 01 [AHCI 1.0])
Subsystem: ASUSTeK Computer Inc. SB600 Non-Raid-5 SATA
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- 
Kernel driver in use: ahci
Kernel modules: ahci

00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB600 USB 
(OHCI0) (prog-if 10 [OHCI])
Subsystem: ASUSTeK Computer Inc. SB600 USB (OHCI0)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- 
Kernel driver in use: ehci-pci
Kernel modules: ehci_pci

00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller 
(rev 13)
Subsystem: ASUSTeK Computer Inc. SBx00 SMBus Controller
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- SERR- 
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB600 PCI to LPC 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > No; it seems to be tied to this specific issue, and I was seeing even
> > before getting logs just now, whenever I'd start one of the bad
> > kernels in recovery mode.
> > 
> > But no, I've never seen that in any other logs, or on any other
> > screens outside of those popping up in relation to this problem.
> 
> Ok. I'll dig into it and we have a 100% reproducer reported by someone else
> now, which might give us simpler insight into that issue. I'll let you know
> once I have something to test. Might be a couple of days though.

Can you please provide the output of

lspci -vvv

from a working kernel, preferrably 4.14.y

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> 
> > No; it seems to be tied to this specific issue, and I was seeing even
> > before getting logs just now, whenever I'd start one of the bad
> > kernels in recovery mode.
> > 
> > But no, I've never seen that in any other logs, or on any other
> > screens outside of those popping up in relation to this problem.
> 
> Ok. I'll dig into it and we have a 100% reproducer reported by someone else
> now, which might give us simpler insight into that issue. I'll let you know
> once I have something to test. Might be a couple of days though.

Can you please provide the output of

lspci -vvv

from a working kernel, preferrably 4.14.y

Thanks,

tglx


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Dexuan Cui wrote:

> > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > Sent: Thursday, December 28, 2017 03:03
> > > > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:
> > 
> > > > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > > > through vector " and a patchset, which included the patch you identifed
> > > > ("genirq: Add config option for reservation mode"), was made to fix the
> > > > issue. But since you're using a physical machine rathter than a VM, I
> > > > suspect it should be a different issue.
> > 
> > Aaargh! Why was this never reported and where is that magic patchset?
> > tglx
> 
> Hi Thomas,
> The Hyper-V specific issue was reported and you made a patchset to fix it:
> https://patchwork.kernel.org/patch/10006171/
> https://lkml.org/lkml/2017/10/17/120

Ah, ok. I did not make the connection. I'll have a scan through the tree to
figure out whether there is some other weird place which is missing that.

Thanks,

tglx


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Dexuan Cui wrote:

> > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > Sent: Thursday, December 28, 2017 03:03
> > > > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:
> > 
> > > > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > > > through vector " and a patchset, which included the patch you identifed
> > > > ("genirq: Add config option for reservation mode"), was made to fix the
> > > > issue. But since you're using a physical machine rathter than a VM, I
> > > > suspect it should be a different issue.
> > 
> > Aaargh! Why was this never reported and where is that magic patchset?
> > tglx
> 
> Hi Thomas,
> The Hyper-V specific issue was reported and you made a patchset to fix it:
> https://patchwork.kernel.org/patch/10006171/
> https://lkml.org/lkml/2017/10/17/120

Ah, ok. I did not make the connection. I'll have a scan through the tree to
figure out whether there is some other weird place which is missing that.

Thanks,

tglx


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Dexuan Cui
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Thursday, December 28, 2017 03:03
> > > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:
> 
> > > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > > through vector " and a patchset, which included the patch you identifed
> > > ("genirq: Add config option for reservation mode"), was made to fix the
> > > issue. But since you're using a physical machine rathter than a VM, I
> > > suspect it should be a different issue.
> 
> Aaargh! Why was this never reported and where is that magic patchset?
>   tglx

Hi Thomas,
The Hyper-V specific issue was reported and you made a patchset to fix it:
https://patchwork.kernel.org/patch/10006171/
https://lkml.org/lkml/2017/10/17/120


Thanks,
-- Dexuan


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Dexuan Cui
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Thursday, December 28, 2017 03:03
> > > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:
> 
> > > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > > through vector " and a patchset, which included the patch you identifed
> > > ("genirq: Add config option for reservation mode"), was made to fix the
> > > issue. But since you're using a physical machine rathter than a VM, I
> > > suspect it should be a different issue.
> 
> Aaargh! Why was this never reported and where is that magic patchset?
>   tglx

Hi Thomas,
The Hyper-V specific issue was reported and you made a patchset to fix it:
https://patchwork.kernel.org/patch/10006171/
https://lkml.org/lkml/2017/10/17/120


Thanks,
-- Dexuan


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> No; it seems to be tied to this specific issue, and I was seeing even
> before getting logs just now, whenever I'd start one of the bad
> kernels in recovery mode.
> 
> But no, I've never seen that in any other logs, or on any other
> screens outside of those popping up in relation to this problem.

Ok. I'll dig into it and we have a 100% reproducer reported by someone else
now, which might give us simpler insight into that issue. I'll let you know
once I have something to test. Might be a couple of days though.

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:

> No; it seems to be tied to this specific issue, and I was seeing even
> before getting logs just now, whenever I'd start one of the bad
> kernels in recovery mode.
> 
> But no, I've never seen that in any other logs, or on any other
> screens outside of those popping up in relation to this problem.

Ok. I'll dig into it and we have a 100% reproducer reported by someone else
now, which might give us simpler insight into that issue. I'll let you know
once I have something to test. Might be a couple of days though.

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
No; it seems to be tied to this specific issue, and I was seeing even
before getting logs just now, whenever I'd start one of the bad
kernels in recovery mode.

But no, I've never seen that in any other logs, or on any other
screens outside of those popping up in relation to this problem.

On Thu, Dec 28, 2017 at 06:29:05PM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > On Thu, Dec 28, 2017 at 05:10:28PM +0100, Thomas Gleixner wrote:
> > > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > > Actually, it decided to cooperate for just long enough for me to get
> > > > the dmesg out. Attached.
> > > > 
> > > > This is from the kernel you asked about: Dou's patch + yours, i.e. the
> > > > latest one in that git log I just sent, booted up with 'apic=debug'.
> > > 
> > > Ok. As I suspected that warning does not trigger. I would have been
> > > massively surprised if that happened. So Dou's patch is just a red herring
> > > and just might change the timing enough to make the problem 'hide'.
> > > 
> > > Can you try something completely different please?
> > > 
> > > Just use plain Linus tree without any additional patches on top and 
> > > disable
> > > CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.
> > > 
> > > If that works, then reenable it and add 'nohz=off' to the kernel command
> > > line.
> > >
> > 
> > No go here I'm afraid:
> > 
> > Linus' clean 4.15-rc5 compiled with CONFIG_HZ_PERIODIC exhibits the
> > familiar behaviour: lockups, sometimes instant upon trying to log in,
> > sometimes logging me in and freaking out seconds later.
> 
> Ok. So it's not the issue I had in mind. 
> 
> Back to some of the interesting bits in the logs:
> 
> [   36.017942] spurious APIC interrupt through vector ff on CPU#0, should 
> never happen.
> 
> Does that message ever show up in 4.14 or 4.9?
> 
> Thanks,
> 
>   tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
No; it seems to be tied to this specific issue, and I was seeing even
before getting logs just now, whenever I'd start one of the bad
kernels in recovery mode.

But no, I've never seen that in any other logs, or on any other
screens outside of those popping up in relation to this problem.

On Thu, Dec 28, 2017 at 06:29:05PM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > On Thu, Dec 28, 2017 at 05:10:28PM +0100, Thomas Gleixner wrote:
> > > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > > Actually, it decided to cooperate for just long enough for me to get
> > > > the dmesg out. Attached.
> > > > 
> > > > This is from the kernel you asked about: Dou's patch + yours, i.e. the
> > > > latest one in that git log I just sent, booted up with 'apic=debug'.
> > > 
> > > Ok. As I suspected that warning does not trigger. I would have been
> > > massively surprised if that happened. So Dou's patch is just a red herring
> > > and just might change the timing enough to make the problem 'hide'.
> > > 
> > > Can you try something completely different please?
> > > 
> > > Just use plain Linus tree without any additional patches on top and 
> > > disable
> > > CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.
> > > 
> > > If that works, then reenable it and add 'nohz=off' to the kernel command
> > > line.
> > >
> > 
> > No go here I'm afraid:
> > 
> > Linus' clean 4.15-rc5 compiled with CONFIG_HZ_PERIODIC exhibits the
> > familiar behaviour: lockups, sometimes instant upon trying to log in,
> > sometimes logging me in and freaking out seconds later.
> 
> Ok. So it's not the issue I had in mind. 
> 
> Back to some of the interesting bits in the logs:
> 
> [   36.017942] spurious APIC interrupt through vector ff on CPU#0, should 
> never happen.
> 
> Does that message ever show up in 4.14 or 4.9?
> 
> Thanks,
> 
>   tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 05:10:28PM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > Actually, it decided to cooperate for just long enough for me to get
> > > the dmesg out. Attached.
> > > 
> > > This is from the kernel you asked about: Dou's patch + yours, i.e. the
> > > latest one in that git log I just sent, booted up with 'apic=debug'.
> > 
> > Ok. As I suspected that warning does not trigger. I would have been
> > massively surprised if that happened. So Dou's patch is just a red herring
> > and just might change the timing enough to make the problem 'hide'.
> > 
> > Can you try something completely different please?
> > 
> > Just use plain Linus tree without any additional patches on top and disable
> > CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.
> > 
> > If that works, then reenable it and add 'nohz=off' to the kernel command
> > line.
> >
> 
> No go here I'm afraid:
> 
> Linus' clean 4.15-rc5 compiled with CONFIG_HZ_PERIODIC exhibits the
> familiar behaviour: lockups, sometimes instant upon trying to log in,
> sometimes logging me in and freaking out seconds later.

Ok. So it's not the issue I had in mind. 

Back to some of the interesting bits in the logs:

[   36.017942] spurious APIC interrupt through vector ff on CPU#0, should never 
happen.

Does that message ever show up in 4.14 or 4.9?

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 05:10:28PM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > Actually, it decided to cooperate for just long enough for me to get
> > > the dmesg out. Attached.
> > > 
> > > This is from the kernel you asked about: Dou's patch + yours, i.e. the
> > > latest one in that git log I just sent, booted up with 'apic=debug'.
> > 
> > Ok. As I suspected that warning does not trigger. I would have been
> > massively surprised if that happened. So Dou's patch is just a red herring
> > and just might change the timing enough to make the problem 'hide'.
> > 
> > Can you try something completely different please?
> > 
> > Just use plain Linus tree without any additional patches on top and disable
> > CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.
> > 
> > If that works, then reenable it and add 'nohz=off' to the kernel command
> > line.
> >
> 
> No go here I'm afraid:
> 
> Linus' clean 4.15-rc5 compiled with CONFIG_HZ_PERIODIC exhibits the
> familiar behaviour: lockups, sometimes instant upon trying to log in,
> sometimes logging me in and freaking out seconds later.

Ok. So it's not the issue I had in mind. 

Back to some of the interesting bits in the logs:

[   36.017942] spurious APIC interrupt through vector ff on CPU#0, should never 
happen.

Does that message ever show up in 4.14 or 4.9?

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> Actually, it decided to cooperate for just long enough for me to get
> the dmesg out. Attached.
> 
> This is from the kernel you asked about: Dou's patch + yours, i.e. the
> latest one in that git log I just sent, booted up with 'apic=debug'.

Ok. As I suspected that warning does not trigger. I would have been
massively surprised if that happened. So Dou's patch is just a red herring
and just might change the timing enough to make the problem 'hide'.

Can you try something completely different please?

Just use plain Linus tree without any additional patches on top and disable
CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.

If that works, then reenable it and add 'nohz=off' to the kernel command
line.

Thanks,

tglx



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> Actually, it decided to cooperate for just long enough for me to get
> the dmesg out. Attached.
> 
> This is from the kernel you asked about: Dou's patch + yours, i.e. the
> latest one in that git log I just sent, booted up with 'apic=debug'.

Ok. As I suspected that warning does not trigger. I would have been
massively surprised if that happened. So Dou's patch is just a red herring
and just might change the timing enough to make the problem 'hide'.

Can you try something completely different please?

Just use plain Linus tree without any additional patches on top and disable
CONFIG_NO_HZ_IDLE, i.e. select CONFIG_HZ_PERIODIC.

If that works, then reenable it and add 'nohz=off' to the kernel command
line.

Thanks,

tglx



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 10:48:35AM -0500, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 03:48:15PM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > > > Ok, lets take a step back. The bisect/kexec attempts led us away from 
> > > > the
> > > > initial problem which is the machine locking up after login, right?
> > > >
> > > 
> > > Yes; sorry about that..
> > 
> > Nothing to be sorry about.
> > 
> > > x86/vector: Replace the raw_spin_lock() with
> > > 
> > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > > index 7504491..e5bab02 100644
> > > --- a/arch/x86/kernel/apic/vector.c
> > > +++ b/arch/x86/kernel/apic/vector.c
> > > @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
> > >  const struct cpumask *dest, bool force)
> > >  {
> > > struct apic_chip_data *apicd = apic_chip_data(irqd);
> > > +   unsigned long flags;
> > > int err;
> > >  
> > > /*
> > > @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
> > > (apicd->is_managed || apicd->can_reserve))
> > > return IRQ_SET_MASK_OK;
> > >  
> > > -   raw_spin_lock(_lock);
> > > +   raw_spin_lock_irqsave(_lock, flags);
> > > cpumask_and(vector_searchmask, dest, cpu_online_mask);
> > > if (irqd_affinity_is_managed(irqd))
> > > err = assign_managed_vector(irqd, vector_searchmask);
> > > else
> > > err = assign_vector_locked(irqd, vector_searchmask);
> > > -   raw_spin_unlock(_lock);
> > > +   raw_spin_unlock_irqrestore(_lock, flags);
> > > return err ? err : IRQ_SET_MASK_OK;
> > >  }
> > > 
> > > With this, I still get the lockup messages after login, but not the
> > > freezes!
> > 
> > That's really interesting. There should be no code path which calls into
> > that with interrupts enabled. I assume you never ran that kernel with
> > CONFIG_PROVE_LOCKING=y.
> >
> 
> Correct. That option is not set in .config.
> 
> > Find below a debug patch which should show us the call chain for that
> > case. Please apply that on top of Dou's patch so the machine stays
> > accessible. Plain output from dmesg is sufficient.
> > 
> > > The lockups register in the log, which I am attaching (see below for
> > > attachment naming conventions).
> > 
> > Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
> > looks very familiar. I'd like to see the above result first and then I'll
> > send you another pile of patches which might cure that RCU issue.
> > 
> > Thanks,
> > 
> > tglx
> > 
> > 8<---
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
> > unsigned long flags;
> > int err;
> >  
> > +   WARN_ON_ONCE(!irqs_disabled());
> > +
> > /*
> >  * Core code can call here for inactive interrupts. For inactive
> >  * interrupts which use managed or reservation mode there is no
> > 
> > 
> > 
> 
> Bit of a step back here: the kernel treated with Dou's patch no longer
> logs me in reliably as before, with or without this newest patch on
> top..
> 
> So now I sometimes get immediate lockups and freezes upon trying to
> log in, and other times I get logged in but get a freeze seconds
> later.
> 
> In no case can I roam around long nough to get a dmesg, and I no
> longer get the non-freezing lockups from before. I can't imagine what
> I could possibly have changed..
> 
> Here's the output of `git log --pretty=oneline -5` on the branch I'm
> working in.
> 
> 
> 
> f2c02af5cc1d620c039b21fab0ca5948a06daf90 2nd tglx patch
> 7715575170bacf3566d400b9f2210a10ce152880 x86/vector: Replace the 
> raw_spin_lock() with raw_spin_lock_irqsave()
> 8d9d56caf33d78bfe6b6087767b1b84acee58458 x86-32: fix kexec with stack canary 
> (CONFIG_CC_STACKPROTECTOR)
> a197e9dea4ccb72e1a6457fac15329bd5319e719 irq/matrix: Remove the overused 
> BUGON() in irq_matrix_assign_system()
> 464e1d5f23cca236b930ef068c328a64cab78fb1 Linux 4.15-rc5
> 
> 
> 
> 7715575170bacf3566d400b9f2210a10ce152880, which is the kernel with
> Dou's patch, logged me in and allowed me to produce the dmesg from
> before. I did this a couple of times back then. I no longer can, for
> some reason, as it's reverted back to the no-go lockups from before.
> 
> And the next one, f2c02af5cc1d620c039b21fab0ca5948a06daf90, where I
> applied the patch you just sent, behaves identically.
> 
>

Actually, it decided to cooperate for just long enough for me to get
the dmesg out. Attached.

This is from the kernel you asked about: Dou's patch + yours, i.e. the
latest one in that git log I just sent, booted up with 'apic=debug'.
[0.00] Linux version 4.15.0-rc5-dou-thms-p2+ (root@axiomatic) (gcc 
version 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 10:48:35AM -0500, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 03:48:15PM +0100, Thomas Gleixner wrote:
> > On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > > On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > > > Ok, lets take a step back. The bisect/kexec attempts led us away from 
> > > > the
> > > > initial problem which is the machine locking up after login, right?
> > > >
> > > 
> > > Yes; sorry about that..
> > 
> > Nothing to be sorry about.
> > 
> > > x86/vector: Replace the raw_spin_lock() with
> > > 
> > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > > index 7504491..e5bab02 100644
> > > --- a/arch/x86/kernel/apic/vector.c
> > > +++ b/arch/x86/kernel/apic/vector.c
> > > @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
> > >  const struct cpumask *dest, bool force)
> > >  {
> > > struct apic_chip_data *apicd = apic_chip_data(irqd);
> > > +   unsigned long flags;
> > > int err;
> > >  
> > > /*
> > > @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
> > > (apicd->is_managed || apicd->can_reserve))
> > > return IRQ_SET_MASK_OK;
> > >  
> > > -   raw_spin_lock(_lock);
> > > +   raw_spin_lock_irqsave(_lock, flags);
> > > cpumask_and(vector_searchmask, dest, cpu_online_mask);
> > > if (irqd_affinity_is_managed(irqd))
> > > err = assign_managed_vector(irqd, vector_searchmask);
> > > else
> > > err = assign_vector_locked(irqd, vector_searchmask);
> > > -   raw_spin_unlock(_lock);
> > > +   raw_spin_unlock_irqrestore(_lock, flags);
> > > return err ? err : IRQ_SET_MASK_OK;
> > >  }
> > > 
> > > With this, I still get the lockup messages after login, but not the
> > > freezes!
> > 
> > That's really interesting. There should be no code path which calls into
> > that with interrupts enabled. I assume you never ran that kernel with
> > CONFIG_PROVE_LOCKING=y.
> >
> 
> Correct. That option is not set in .config.
> 
> > Find below a debug patch which should show us the call chain for that
> > case. Please apply that on top of Dou's patch so the machine stays
> > accessible. Plain output from dmesg is sufficient.
> > 
> > > The lockups register in the log, which I am attaching (see below for
> > > attachment naming conventions).
> > 
> > Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
> > looks very familiar. I'd like to see the above result first and then I'll
> > send you another pile of patches which might cure that RCU issue.
> > 
> > Thanks,
> > 
> > tglx
> > 
> > 8<---
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
> > unsigned long flags;
> > int err;
> >  
> > +   WARN_ON_ONCE(!irqs_disabled());
> > +
> > /*
> >  * Core code can call here for inactive interrupts. For inactive
> >  * interrupts which use managed or reservation mode there is no
> > 
> > 
> > 
> 
> Bit of a step back here: the kernel treated with Dou's patch no longer
> logs me in reliably as before, with or without this newest patch on
> top..
> 
> So now I sometimes get immediate lockups and freezes upon trying to
> log in, and other times I get logged in but get a freeze seconds
> later.
> 
> In no case can I roam around long nough to get a dmesg, and I no
> longer get the non-freezing lockups from before. I can't imagine what
> I could possibly have changed..
> 
> Here's the output of `git log --pretty=oneline -5` on the branch I'm
> working in.
> 
> 
> 
> f2c02af5cc1d620c039b21fab0ca5948a06daf90 2nd tglx patch
> 7715575170bacf3566d400b9f2210a10ce152880 x86/vector: Replace the 
> raw_spin_lock() with raw_spin_lock_irqsave()
> 8d9d56caf33d78bfe6b6087767b1b84acee58458 x86-32: fix kexec with stack canary 
> (CONFIG_CC_STACKPROTECTOR)
> a197e9dea4ccb72e1a6457fac15329bd5319e719 irq/matrix: Remove the overused 
> BUGON() in irq_matrix_assign_system()
> 464e1d5f23cca236b930ef068c328a64cab78fb1 Linux 4.15-rc5
> 
> 
> 
> 7715575170bacf3566d400b9f2210a10ce152880, which is the kernel with
> Dou's patch, logged me in and allowed me to produce the dmesg from
> before. I did this a couple of times back then. I no longer can, for
> some reason, as it's reverted back to the no-go lockups from before.
> 
> And the next one, f2c02af5cc1d620c039b21fab0ca5948a06daf90, where I
> applied the patch you just sent, behaves identically.
> 
>

Actually, it decided to cooperate for just long enough for me to get
the dmesg out. Attached.

This is from the kernel you asked about: Dou's patch + yours, i.e. the
latest one in that git log I just sent, booted up with 'apic=debug'.
[0.00] Linux version 4.15.0-rc5-dou-thms-p2+ (root@axiomatic) (gcc 
version 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 03:48:15PM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > > Ok, lets take a step back. The bisect/kexec attempts led us away from the
> > > initial problem which is the machine locking up after login, right?
> > >
> > 
> > Yes; sorry about that..
> 
> Nothing to be sorry about.
> 
> > x86/vector: Replace the raw_spin_lock() with
> > 
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 7504491..e5bab02 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
> >  const struct cpumask *dest, bool force)
> >  {
> > struct apic_chip_data *apicd = apic_chip_data(irqd);
> > +   unsigned long flags;
> > int err;
> >  
> > /*
> > @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
> > (apicd->is_managed || apicd->can_reserve))
> > return IRQ_SET_MASK_OK;
> >  
> > -   raw_spin_lock(_lock);
> > +   raw_spin_lock_irqsave(_lock, flags);
> > cpumask_and(vector_searchmask, dest, cpu_online_mask);
> > if (irqd_affinity_is_managed(irqd))
> > err = assign_managed_vector(irqd, vector_searchmask);
> > else
> > err = assign_vector_locked(irqd, vector_searchmask);
> > -   raw_spin_unlock(_lock);
> > +   raw_spin_unlock_irqrestore(_lock, flags);
> > return err ? err : IRQ_SET_MASK_OK;
> >  }
> > 
> > With this, I still get the lockup messages after login, but not the
> > freezes!
> 
> That's really interesting. There should be no code path which calls into
> that with interrupts enabled. I assume you never ran that kernel with
> CONFIG_PROVE_LOCKING=y.
>

Correct. That option is not set in .config.

> Find below a debug patch which should show us the call chain for that
> case. Please apply that on top of Dou's patch so the machine stays
> accessible. Plain output from dmesg is sufficient.
> 
> > The lockups register in the log, which I am attaching (see below for
> > attachment naming conventions).
> 
> Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
> looks very familiar. I'd like to see the above result first and then I'll
> send you another pile of patches which might cure that RCU issue.
> 
> Thanks,
> 
>   tglx
> 
> 8<---
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
>   unsigned long flags;
>   int err;
>  
> + WARN_ON_ONCE(!irqs_disabled());
> +
>   /*
>* Core code can call here for inactive interrupts. For inactive
>* interrupts which use managed or reservation mode there is no
> 
> 
> 

Bit of a step back here: the kernel treated with Dou's patch no longer
logs me in reliably as before, with or without this newest patch on
top..

So now I sometimes get immediate lockups and freezes upon trying to
log in, and other times I get logged in but get a freeze seconds
later.

In no case can I roam around long nough to get a dmesg, and I no
longer get the non-freezing lockups from before. I can't imagine what
I could possibly have changed..

Here's the output of `git log --pretty=oneline -5` on the branch I'm
working in.



f2c02af5cc1d620c039b21fab0ca5948a06daf90 2nd tglx patch
7715575170bacf3566d400b9f2210a10ce152880 x86/vector: Replace the 
raw_spin_lock() with raw_spin_lock_irqsave()
8d9d56caf33d78bfe6b6087767b1b84acee58458 x86-32: fix kexec with stack canary 
(CONFIG_CC_STACKPROTECTOR)
a197e9dea4ccb72e1a6457fac15329bd5319e719 irq/matrix: Remove the overused 
BUGON() in irq_matrix_assign_system()
464e1d5f23cca236b930ef068c328a64cab78fb1 Linux 4.15-rc5



7715575170bacf3566d400b9f2210a10ce152880, which is the kernel with
Dou's patch, logged me in and allowed me to produce the dmesg from
before. I did this a couple of times back then. I no longer can, for
some reason, as it's reverted back to the no-go lockups from before.

And the next one, f2c02af5cc1d620c039b21fab0ca5948a06daf90, where I
applied the patch you just sent, behaves identically.





Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 03:48:15PM +0100, Thomas Gleixner wrote:
> On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> > On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > > Ok, lets take a step back. The bisect/kexec attempts led us away from the
> > > initial problem which is the machine locking up after login, right?
> > >
> > 
> > Yes; sorry about that..
> 
> Nothing to be sorry about.
> 
> > x86/vector: Replace the raw_spin_lock() with
> > 
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 7504491..e5bab02 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
> >  const struct cpumask *dest, bool force)
> >  {
> > struct apic_chip_data *apicd = apic_chip_data(irqd);
> > +   unsigned long flags;
> > int err;
> >  
> > /*
> > @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
> > (apicd->is_managed || apicd->can_reserve))
> > return IRQ_SET_MASK_OK;
> >  
> > -   raw_spin_lock(_lock);
> > +   raw_spin_lock_irqsave(_lock, flags);
> > cpumask_and(vector_searchmask, dest, cpu_online_mask);
> > if (irqd_affinity_is_managed(irqd))
> > err = assign_managed_vector(irqd, vector_searchmask);
> > else
> > err = assign_vector_locked(irqd, vector_searchmask);
> > -   raw_spin_unlock(_lock);
> > +   raw_spin_unlock_irqrestore(_lock, flags);
> > return err ? err : IRQ_SET_MASK_OK;
> >  }
> > 
> > With this, I still get the lockup messages after login, but not the
> > freezes!
> 
> That's really interesting. There should be no code path which calls into
> that with interrupts enabled. I assume you never ran that kernel with
> CONFIG_PROVE_LOCKING=y.
>

Correct. That option is not set in .config.

> Find below a debug patch which should show us the call chain for that
> case. Please apply that on top of Dou's patch so the machine stays
> accessible. Plain output from dmesg is sufficient.
> 
> > The lockups register in the log, which I am attaching (see below for
> > attachment naming conventions).
> 
> Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
> looks very familiar. I'd like to see the above result first and then I'll
> send you another pile of patches which might cure that RCU issue.
> 
> Thanks,
> 
>   tglx
> 
> 8<---
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
>   unsigned long flags;
>   int err;
>  
> + WARN_ON_ONCE(!irqs_disabled());
> +
>   /*
>* Core code can call here for inactive interrupts. For inactive
>* interrupts which use managed or reservation mode there is no
> 
> 
> 

Bit of a step back here: the kernel treated with Dou's patch no longer
logs me in reliably as before, with or without this newest patch on
top..

So now I sometimes get immediate lockups and freezes upon trying to
log in, and other times I get logged in but get a freeze seconds
later.

In no case can I roam around long nough to get a dmesg, and I no
longer get the non-freezing lockups from before. I can't imagine what
I could possibly have changed..

Here's the output of `git log --pretty=oneline -5` on the branch I'm
working in.



f2c02af5cc1d620c039b21fab0ca5948a06daf90 2nd tglx patch
7715575170bacf3566d400b9f2210a10ce152880 x86/vector: Replace the 
raw_spin_lock() with raw_spin_lock_irqsave()
8d9d56caf33d78bfe6b6087767b1b84acee58458 x86-32: fix kexec with stack canary 
(CONFIG_CC_STACKPROTECTOR)
a197e9dea4ccb72e1a6457fac15329bd5319e719 irq/matrix: Remove the overused 
BUGON() in irq_matrix_assign_system()
464e1d5f23cca236b930ef068c328a64cab78fb1 Linux 4.15-rc5



7715575170bacf3566d400b9f2210a10ce152880, which is the kernel with
Dou's patch, logged me in and allowed me to produce the dmesg from
before. I did this a couple of times back then. I no longer can, for
some reason, as it's reverted back to the no-go lockups from before.

And the next one, f2c02af5cc1d620c039b21fab0ca5948a06daf90, where I
applied the patch you just sent, behaves identically.





Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > Ok, lets take a step back. The bisect/kexec attempts led us away from the
> > initial problem which is the machine locking up after login, right?
> >
> 
> Yes; sorry about that..

Nothing to be sorry about.

> x86/vector: Replace the raw_spin_lock() with
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7504491..e5bab02 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
>  const struct cpumask *dest, bool force)
>  {
> struct apic_chip_data *apicd = apic_chip_data(irqd);
> +   unsigned long flags;
> int err;
>  
> /*
> @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
> (apicd->is_managed || apicd->can_reserve))
> return IRQ_SET_MASK_OK;
>  
> -   raw_spin_lock(_lock);
> +   raw_spin_lock_irqsave(_lock, flags);
> cpumask_and(vector_searchmask, dest, cpu_online_mask);
> if (irqd_affinity_is_managed(irqd))
> err = assign_managed_vector(irqd, vector_searchmask);
> else
> err = assign_vector_locked(irqd, vector_searchmask);
> -   raw_spin_unlock(_lock);
> +   raw_spin_unlock_irqrestore(_lock, flags);
> return err ? err : IRQ_SET_MASK_OK;
>  }
> 
> With this, I still get the lockup messages after login, but not the
> freezes!

That's really interesting. There should be no code path which calls into
that with interrupts enabled. I assume you never ran that kernel with
CONFIG_PROVE_LOCKING=y.

Find below a debug patch which should show us the call chain for that
case. Please apply that on top of Dou's patch so the machine stays
accessible. Plain output from dmesg is sufficient.

> The lockups register in the log, which I am attaching (see below for
> attachment naming conventions).

Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
looks very familiar. I'd like to see the above result first and then I'll
send you another pile of patches which might cure that RCU issue.

Thanks,

tglx

8<---
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
unsigned long flags;
int err;
 
+   WARN_ON_ONCE(!irqs_disabled());
+
/*
 * Core code can call here for inactive interrupts. For inactive
 * interrupts which use managed or reservation mode there is no






Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Thu, 28 Dec 2017, Alexandru Chirvasitu wrote:
> On Thu, Dec 28, 2017 at 12:00:47PM +0100, Thomas Gleixner wrote:
> > Ok, lets take a step back. The bisect/kexec attempts led us away from the
> > initial problem which is the machine locking up after login, right?
> >
> 
> Yes; sorry about that..

Nothing to be sorry about.

> x86/vector: Replace the raw_spin_lock() with
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 7504491..e5bab02 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
>  const struct cpumask *dest, bool force)
>  {
> struct apic_chip_data *apicd = apic_chip_data(irqd);
> +   unsigned long flags;
> int err;
>  
> /*
> @@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
> (apicd->is_managed || apicd->can_reserve))
> return IRQ_SET_MASK_OK;
>  
> -   raw_spin_lock(_lock);
> +   raw_spin_lock_irqsave(_lock, flags);
> cpumask_and(vector_searchmask, dest, cpu_online_mask);
> if (irqd_affinity_is_managed(irqd))
> err = assign_managed_vector(irqd, vector_searchmask);
> else
> err = assign_vector_locked(irqd, vector_searchmask);
> -   raw_spin_unlock(_lock);
> +   raw_spin_unlock_irqrestore(_lock, flags);
> return err ? err : IRQ_SET_MASK_OK;
>  }
> 
> With this, I still get the lockup messages after login, but not the
> freezes!

That's really interesting. There should be no code path which calls into
that with interrupts enabled. I assume you never ran that kernel with
CONFIG_PROVE_LOCKING=y.

Find below a debug patch which should show us the call chain for that
case. Please apply that on top of Dou's patch so the machine stays
accessible. Plain output from dmesg is sufficient.

> The lockups register in the log, which I am attaching (see below for
> attachment naming conventions).

Hmm. That's RCU lockups and that backtrace on the CPU which gets the stall
looks very familiar. I'd like to see the above result first and then I'll
send you another pile of patches which might cure that RCU issue.

Thanks,

tglx

8<---
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -729,6 +729,8 @@ static int apic_set_affinity(struct irq_
unsigned long flags;
int err;
 
+   WARN_ON_ONCE(!irqs_disabled());
+
/*
 * Core code can call here for inactive interrupts. For inactive
 * interrupts which use managed or reservation mode there is no






Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Wed, 20 Dec 2017, Alexandru Chirvasitu wrote:
> Merging the contents of another exchange spawned from the original

> > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:

> > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > through vector " and a patchset, which included the patch you identifed
> > ("genirq: Add config option for reservation mode"), was made to fix the
> > issue. But since you're using a physical machine rathter than a VM, I
> > suspect it should be a different issue. 

Aaargh! Why was this never reported and where is that magic patchset?

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Wed, 20 Dec 2017, Alexandru Chirvasitu wrote:
> Merging the contents of another exchange spawned from the original

> > On Wed, Dec 20, 2017 at 02:12:05AM +, Dexuan Cui wrote:

> > For Linux VM running on Hyper-V, we did get "spurious APIC interrupt
> > through vector " and a patchset, which included the patch you identifed
> > ("genirq: Add config option for reservation mode"), was made to fix the
> > issue. But since you're using a physical machine rathter than a VM, I
> > suspect it should be a different issue. 

Aaargh! Why was this never reported and where is that magic patchset?

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Wed, 20 Dec 2017, Alexandru Chirvasitu wrote:
> On Wed, Dec 20, 2017 at 11:58:57AM +0800, Dou Liyang wrote:
> > At 12/20/2017 08:31 AM, Thomas Gleixner wrote:
> > > > I had never heard of 'bisect' before this casual mention (you might tell
> > > > I am a bit out of my depth). I've since applied it to Linus' tree 
> > > > between
> > > 
> > > > bebc608 Linux 4.14 (good)
> > > > 
> > > > and
> > > > 
> > > > 4fbd8d1 Linux 4.15-rc1 (bad)
> > > 
> > > Is Linus current head 4.15-rc4 bad as well?
> > > 
> > [...]
> 
> Yes. Exactly the same symptoms on
> 
> 1291a0d5 Linux 4.15-rc4
> 
> compiled just now from Linus' tree. 

Ok, lets take a step back. The bisect/kexec attempts led us away from the
initial problem which is the machine locking up after login, right?

Could you try the patch below on top of Linus tree (rc5+)?

Thanks,

tglx

8<---
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -151,7 +151,7 @@ static struct apic apic_flat __ro_after_
.apic_id_valid  = default_apic_id_valid,
.apic_id_registered = flat_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
.irq_dest_mode  = 1, /* logical */
 
.disable_esr= 0,
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -105,7 +105,7 @@ static struct apic apic_default __ro_aft
.apic_id_valid  = default_apic_id_valid,
.apic_id_registered = default_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
/* logical delivery broadcast to all CPUs: */
.irq_dest_mode  = 1,
 
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -184,7 +184,7 @@ static struct apic apic_x2apic_cluster _
.apic_id_valid  = x2apic_apic_id_valid,
.apic_id_registered = x2apic_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
.irq_dest_mode  = 1, /* logical */
 
.disable_esr= 0,



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Thomas Gleixner
On Wed, 20 Dec 2017, Alexandru Chirvasitu wrote:
> On Wed, Dec 20, 2017 at 11:58:57AM +0800, Dou Liyang wrote:
> > At 12/20/2017 08:31 AM, Thomas Gleixner wrote:
> > > > I had never heard of 'bisect' before this casual mention (you might tell
> > > > I am a bit out of my depth). I've since applied it to Linus' tree 
> > > > between
> > > 
> > > > bebc608 Linux 4.14 (good)
> > > > 
> > > > and
> > > > 
> > > > 4fbd8d1 Linux 4.15-rc1 (bad)
> > > 
> > > Is Linus current head 4.15-rc4 bad as well?
> > > 
> > [...]
> 
> Yes. Exactly the same symptoms on
> 
> 1291a0d5 Linux 4.15-rc4
> 
> compiled just now from Linus' tree. 

Ok, lets take a step back. The bisect/kexec attempts led us away from the
initial problem which is the machine locking up after login, right?

Could you try the patch below on top of Linus tree (rc5+)?

Thanks,

tglx

8<---
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -151,7 +151,7 @@ static struct apic apic_flat __ro_after_
.apic_id_valid  = default_apic_id_valid,
.apic_id_registered = flat_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
.irq_dest_mode  = 1, /* logical */
 
.disable_esr= 0,
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -105,7 +105,7 @@ static struct apic apic_default __ro_aft
.apic_id_valid  = default_apic_id_valid,
.apic_id_registered = default_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
/* logical delivery broadcast to all CPUs: */
.irq_dest_mode  = 1,
 
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -184,7 +184,7 @@ static struct apic apic_x2apic_cluster _
.apic_id_valid  = x2apic_apic_id_valid,
.apic_id_registered = x2apic_apic_id_registered,
 
-   .irq_delivery_mode  = dest_LowestPrio,
+   .irq_delivery_mode  = dest_Fixed,
.irq_dest_mode  = 1, /* logical */
 
.disable_esr= 0,



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Dou Liyang

Hi Alexandru,

At 12/28/2017 10:51 AM, Alexandru Chirvasitu wrote:

Ah, of course. Attached is the output of `journalctl --boot=-1` after
booting, getting locked up, and then rebooting a good kernel.


For the Hard lockups on both CPUs after login:

Please try the patch in the attachment by

git am ./0001-x86-vector-Replace-the-raw_spin_lock-with-raw_spin_l.patch

or

patch -p1 < 
./0001-x86-vector-Replace-the-raw_spin_lock-with-raw_spin_l.patch



Slightly different version of 4.15-rc5; this one has both patches
applied, yours and Linus' for kexec, but the latter shouldn't make a
difference.

---

You'll see another trace in there that's been bugging me, about W=X
checking. I'm not qualified to judge how related they are, but during
these past few days I've compiled and tested many kernels, and many of
them have exhibited the W+X thing but*not*  the lockups.



Yes, I found it, but I am not familiar with it and have no idea.

Thanks,
dou.

-8<


>From 57d8543ea4dcf2a53b1c37757da12866a52aaf57 Mon Sep 17 00:00:00 2001
From: Dou Liyang 
Date: Thu, 28 Dec 2017 16:20:48 +0800
Subject: [PATCH] x86/vector: Replace the raw_spin_lock() with
 raw_spin_lock_irqsave()

Signed-off-by: Dou Liyang 
---
 arch/x86/kernel/apic/vector.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 750449152b04..a43ca26d5dfd 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
 			 const struct cpumask *dest, bool force)
 {
 	struct apic_chip_data *apicd = apic_chip_data(irqd);
+	unsigned long flags;
 	int err;
 
 	/*
@@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
 	(apicd->is_managed || apicd->can_reserve))
 		return IRQ_SET_MASK_OK;
 
-	raw_spin_lock(_lock);
+	raw_spin_lock_irqsave(_lock, flags);
 	cpumask_and(vector_searchmask, dest, cpu_online_mask);
 	if (irqd_affinity_is_managed(irqd))
 		err = assign_managed_vector(irqd, vector_searchmask);
 	else
 		err = assign_vector_locked(irqd, vector_searchmask);
-	raw_spin_unlock(_lock);
+	raw_spin_unlock_irqrestore(_lock, flags);
 	return err ? err : IRQ_SET_MASK_OK;
 }
 
-- 
2.14.3



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-28 Thread Dou Liyang

Hi Alexandru,

At 12/28/2017 10:51 AM, Alexandru Chirvasitu wrote:

Ah, of course. Attached is the output of `journalctl --boot=-1` after
booting, getting locked up, and then rebooting a good kernel.


For the Hard lockups on both CPUs after login:

Please try the patch in the attachment by

git am ./0001-x86-vector-Replace-the-raw_spin_lock-with-raw_spin_l.patch

or

patch -p1 < 
./0001-x86-vector-Replace-the-raw_spin_lock-with-raw_spin_l.patch



Slightly different version of 4.15-rc5; this one has both patches
applied, yours and Linus' for kexec, but the latter shouldn't make a
difference.

---

You'll see another trace in there that's been bugging me, about W=X
checking. I'm not qualified to judge how related they are, but during
these past few days I've compiled and tested many kernels, and many of
them have exhibited the W+X thing but*not*  the lockups.



Yes, I found it, but I am not familiar with it and have no idea.

Thanks,
dou.

-8<


>From 57d8543ea4dcf2a53b1c37757da12866a52aaf57 Mon Sep 17 00:00:00 2001
From: Dou Liyang 
Date: Thu, 28 Dec 2017 16:20:48 +0800
Subject: [PATCH] x86/vector: Replace the raw_spin_lock() with
 raw_spin_lock_irqsave()

Signed-off-by: Dou Liyang 
---
 arch/x86/kernel/apic/vector.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 750449152b04..a43ca26d5dfd 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -726,6 +726,7 @@ static int apic_set_affinity(struct irq_data *irqd,
 			 const struct cpumask *dest, bool force)
 {
 	struct apic_chip_data *apicd = apic_chip_data(irqd);
+	unsigned long flags;
 	int err;
 
 	/*
@@ -740,13 +741,13 @@ static int apic_set_affinity(struct irq_data *irqd,
 	(apicd->is_managed || apicd->can_reserve))
 		return IRQ_SET_MASK_OK;
 
-	raw_spin_lock(_lock);
+	raw_spin_lock_irqsave(_lock, flags);
 	cpumask_and(vector_searchmask, dest, cpu_online_mask);
 	if (irqd_affinity_is_managed(irqd))
 		err = assign_managed_vector(irqd, vector_searchmask);
 	else
 		err = assign_vector_locked(irqd, vector_searchmask);
-	raw_spin_unlock(_lock);
+	raw_spin_unlock_irqrestore(_lock, flags);
 	return err ? err : IRQ_SET_MASK_OK;
 }
 
-- 
2.14.3



Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-27 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 10:06:25AM +0800, Dou Liyang wrote:
> Hi Alexandru,
> 
> Thanks for testing !
> At 12/28/2017 12:18 AM, Alexandru Chirvasitu wrote:
> > As per instructions, I did the following:
> > 
> > (1)
> > 
> > Checked out
> > 
> > 464e1d5 Linux 4.15-rc5
> > 
> > (after getting my copy up to date, fetching, pulling ,etc.) and
> > compiled it as-is. Config attached (the one labeled 'np' for 'no
> > patch').
> > 
> > Result:
> > 
> > Boot with no extraparameters locks up after login, as before;
> > 
> > apic=debug does not panic, but locks up after login, as before;
> > 
> I also hope to see the log with "apic=debug" by "journalctl" command,
> though the logs don't have the lockup trace.

Ah, of course. Attached is the output of `journalctl --boot=-1` after
booting, getting locked up, and then rebooting a good kernel.

Slightly different version of 4.15-rc5; this one has both patches
applied, yours and Linus' for kexec, but the latter shouldn't make a
difference.

---

You'll see another trace in there that's been bugging me, about W=X
checking. I'm not qualified to judge how related they are, but during
these past few days I've compiled and tested many kernels, and many of
them have exhibited the W+X thing but *not* the lockups.

I hope to trace that one back to the original commit with another
bisect one of these days, but they do seem to be different issues.

> 
> Thanks,
>   dou.
> > 
> 
> 
> 
-- Logs begin at Sat 2017-12-23 08:45:59 EST, end at Wed 2017-12-27 21:42:46 
EST. --
Dec 27 21:39:03 D-69-91-141-110 kernel: Linux version 4.15.0-rc5-kex-fix+ 
(root@axiomatic) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) 
#1 SMP Wed Dec 27 17:37:47 EST 2017
Dec 27 21:39:03 D-69-91-141-110 kernel: x86/fpu: x87 FPU will use FXSAVE
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: BIOS-provided physical RAM map:
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x-0x0009fbff] usable
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0009fc00-0x0009] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x000e-0x000f] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0010-0xb7f9] usable
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fa-0xb7fadfff] ACPI data
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fae000-0xb7fe] ACPI NVS
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7ff-0xb7ff] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xfee0-0xfee00fff] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xffb8-0x] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: NX (Execute Disable) protection: active
Dec 27 21:39:03 D-69-91-141-110 kernel: random: fast init done
Dec 27 21:39:03 D-69-91-141-110 kernel: SMBIOS 2.4 present.
Dec 27 21:39:03 D-69-91-141-110 kernel: DMI: ASUSTeK Computer Inc. F5RL 
   /F5RL  , BIOS 210 06/12/2008
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: update [mem 
0x-0x0fff] usable ==> reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: remove [mem 
0x000a-0x000f] usable
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: last_pfn = 0xb7fa0 max_arch_pfn = 
0x100
Dec 27 21:39:03 D-69-91-141-110 kernel: MTRR default type: uncachable
Dec 27 21:39:03 D-69-91-141-110 kernel: MTRR fixed ranges enabled:
Dec 27 21:39:03 D-69-91-141-110 kernel:   0-9 write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   A-B uncachable
Dec 27 21:39:03 D-69-91-141-110 kernel:   C-C write-protect
Dec 27 21:39:03 D-69-91-141-110 kernel:   D-D uncachable
Dec 27 21:39:03 D-69-91-141-110 kernel:   E-E write-through
Dec 27 21:39:03 D-69-91-141-110 kernel:   F-F write-protect
Dec 27 21:39:03 D-69-91-141-110 kernel: MTRR variable ranges enabled:
Dec 27 21:39:03 D-69-91-141-110 kernel:   0 base 0 mask F8000 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   1 base 08000 mask FE000 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   2 base 0A000 mask FF000 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   3 base 0B000 mask FF800 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   4 base 0B800 mask FFC00 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   5 base 0BC00 mask FFF00 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   6 base 0C000 mask FF000 
write-combining
Dec 27 21:39:03 D-69-91-141-110 kernel:   7 disabled
Dec 27 21:39:03 D-69-91-141-110 kernel: x86/PAT: Configuration [0-7]: WB  WC  
UC- UC  WB  WP  UC- WT  
Dec 27 21:39:03 D-69-91-141-110 kernel: Scan for SMP in [mem 
0x-0x03ff]
Dec 27 21:39:03 D-69-91-141-110 kernel: Scan for SMP in [mem 
0x0009fc00-0x0009]
Dec 27 21:39:03 D-69-91-141-110 kernel: 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-27 Thread Alexandru Chirvasitu
On Thu, Dec 28, 2017 at 10:06:25AM +0800, Dou Liyang wrote:
> Hi Alexandru,
> 
> Thanks for testing !
> At 12/28/2017 12:18 AM, Alexandru Chirvasitu wrote:
> > As per instructions, I did the following:
> > 
> > (1)
> > 
> > Checked out
> > 
> > 464e1d5 Linux 4.15-rc5
> > 
> > (after getting my copy up to date, fetching, pulling ,etc.) and
> > compiled it as-is. Config attached (the one labeled 'np' for 'no
> > patch').
> > 
> > Result:
> > 
> > Boot with no extraparameters locks up after login, as before;
> > 
> > apic=debug does not panic, but locks up after login, as before;
> > 
> I also hope to see the log with "apic=debug" by "journalctl" command,
> though the logs don't have the lockup trace.

Ah, of course. Attached is the output of `journalctl --boot=-1` after
booting, getting locked up, and then rebooting a good kernel.

Slightly different version of 4.15-rc5; this one has both patches
applied, yours and Linus' for kexec, but the latter shouldn't make a
difference.

---

You'll see another trace in there that's been bugging me, about W=X
checking. I'm not qualified to judge how related they are, but during
these past few days I've compiled and tested many kernels, and many of
them have exhibited the W+X thing but *not* the lockups.

I hope to trace that one back to the original commit with another
bisect one of these days, but they do seem to be different issues.

> 
> Thanks,
>   dou.
> > 
> 
> 
> 
-- Logs begin at Sat 2017-12-23 08:45:59 EST, end at Wed 2017-12-27 21:42:46 
EST. --
Dec 27 21:39:03 D-69-91-141-110 kernel: Linux version 4.15.0-rc5-kex-fix+ 
(root@axiomatic) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) 
#1 SMP Wed Dec 27 17:37:47 EST 2017
Dec 27 21:39:03 D-69-91-141-110 kernel: x86/fpu: x87 FPU will use FXSAVE
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: BIOS-provided physical RAM map:
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x-0x0009fbff] usable
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0009fc00-0x0009] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x000e-0x000f] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0010-0xb7f9] usable
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fa-0xb7fadfff] ACPI data
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fae000-0xb7fe] ACPI NVS
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7ff-0xb7ff] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xfee0-0xfee00fff] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xffb8-0x] reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: NX (Execute Disable) protection: active
Dec 27 21:39:03 D-69-91-141-110 kernel: random: fast init done
Dec 27 21:39:03 D-69-91-141-110 kernel: SMBIOS 2.4 present.
Dec 27 21:39:03 D-69-91-141-110 kernel: DMI: ASUSTeK Computer Inc. F5RL 
   /F5RL  , BIOS 210 06/12/2008
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: update [mem 
0x-0x0fff] usable ==> reserved
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: remove [mem 
0x000a-0x000f] usable
Dec 27 21:39:03 D-69-91-141-110 kernel: e820: last_pfn = 0xb7fa0 max_arch_pfn = 
0x100
Dec 27 21:39:03 D-69-91-141-110 kernel: MTRR default type: uncachable
Dec 27 21:39:03 D-69-91-141-110 kernel: MTRR fixed ranges enabled:
Dec 27 21:39:03 D-69-91-141-110 kernel:   0-9 write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   A-B uncachable
Dec 27 21:39:03 D-69-91-141-110 kernel:   C-C write-protect
Dec 27 21:39:03 D-69-91-141-110 kernel:   D-D uncachable
Dec 27 21:39:03 D-69-91-141-110 kernel:   E-E write-through
Dec 27 21:39:03 D-69-91-141-110 kernel:   F-F write-protect
Dec 27 21:39:03 D-69-91-141-110 kernel: MTRR variable ranges enabled:
Dec 27 21:39:03 D-69-91-141-110 kernel:   0 base 0 mask F8000 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   1 base 08000 mask FE000 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   2 base 0A000 mask FF000 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   3 base 0B000 mask FF800 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   4 base 0B800 mask FFC00 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   5 base 0BC00 mask FFF00 
write-back
Dec 27 21:39:03 D-69-91-141-110 kernel:   6 base 0C000 mask FF000 
write-combining
Dec 27 21:39:03 D-69-91-141-110 kernel:   7 disabled
Dec 27 21:39:03 D-69-91-141-110 kernel: x86/PAT: Configuration [0-7]: WB  WC  
UC- UC  WB  WP  UC- WT  
Dec 27 21:39:03 D-69-91-141-110 kernel: Scan for SMP in [mem 
0x-0x03ff]
Dec 27 21:39:03 D-69-91-141-110 kernel: Scan for SMP in [mem 
0x0009fc00-0x0009]
Dec 27 21:39:03 D-69-91-141-110 kernel: 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-27 Thread Dou Liyang

Hi Alexandru,

Thanks for testing !
At 12/28/2017 12:18 AM, Alexandru Chirvasitu wrote:

As per instructions, I did the following:

(1)

Checked out

464e1d5 Linux 4.15-rc5

(after getting my copy up to date, fetching, pulling ,etc.) and
compiled it as-is. Config attached (the one labeled 'np' for 'no
patch').

Result:

Boot with no extraparameters locks up after login, as before;

apic=debug does not panic, but locks up after login, as before;


I also hope to see the log with "apic=debug" by "journalctl" command,
though the logs don't have the lockup trace.

Thanks,
dou.








Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-27 Thread Dou Liyang

Hi Alexandru,

Thanks for testing !
At 12/28/2017 12:18 AM, Alexandru Chirvasitu wrote:

As per instructions, I did the following:

(1)

Checked out

464e1d5 Linux 4.15-rc5

(after getting my copy up to date, fetching, pulling ,etc.) and
compiled it as-is. Config attached (the one labeled 'np' for 'no
patch').

Result:

Boot with no extraparameters locks up after login, as before;

apic=debug does not panic, but locks up after login, as before;


I also hope to see the log with "apic=debug" by "journalctl" command,
though the logs don't have the lockup trace.

Thanks,
dou.








Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-27 Thread Dou Liyang

Hi Alexandru,

At 12/24/2017 04:01 AM, Alexandru Chirvasitu wrote:

On Sat, Dec 23, 2017 at 02:32:52PM +0100, Thomas Gleixner wrote:

On Sat, 23 Dec 2017, Dexuan Cui wrote:


From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
Sent: Friday, December 22, 2017 14:29

The output of that precise command run just now on a freshly-compiled
copy of that commit is attached.

On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:

From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
Sent: Friday, December 22, 2017 06:21

In the absence of logs, the best I can do at the moment is attach a
picture of the screen I am presented with on the  boot
attempt.
Alex


The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
IMO we should find which line of code causes the panic. I suppose
"objdump -D kernel/irq/matrix.o" can help to do that.

Thanks,
-- Dexuan


The BUG_ON panic happens at line 147:
BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));



There are 2 bugs in your laptop:

  1. Hard lockups on both CPUs after login
  2. panic with "apic=debug"

For the 2th bug, please try the following patch(need Thomas confirmation
:) ) in Linux 4.15-rc5. I think it can fix the panic.

If the 2th bug fixed, let's back to the 1th bug:

Is Linus current head 4.15-rc5 bad as well?

If yes, Please using "apic=debug" and give the dmesg log.

Thanks,
dou.

8<---

irq/matrix: Remove the overused BUGON() in irq_matrix_assign_system()

Currently, x86 marks the preallocated legacy interrupts when initializing
IRQ(native_init_IRQ), but will clear them if they are not activated in
vector_configure_legacy().

So, in irq_matrix_assign_system(), replacing an legacy vector which may
not allocated in a cpumap->alloc_map[] with a system vector will trigger
the BUGON();

Remove the BUGON().

Signed-off-by: Dou Liyang 
---
 kernel/irq/matrix.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 0ba0dd8863a7..876cbeab9ca2 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -143,11 +143,12 @@ void irq_matrix_assign_system(struct irq_matrix 
*m, unsigned int bit,

BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));

set_bit(bit, m->system_map);
-   if (replace) {
-   BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
+
+   if (replace && test_and_clear_bit(bit, cm->alloc_map)){
cm->allocated--;
m->total_allocated--;
}
+
if (bit >= m->alloc_start && bit < m->alloc_end)
m->systembits_inalloc++;

--




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-27 Thread Dou Liyang

Hi Alexandru,

At 12/24/2017 04:01 AM, Alexandru Chirvasitu wrote:

On Sat, Dec 23, 2017 at 02:32:52PM +0100, Thomas Gleixner wrote:

On Sat, 23 Dec 2017, Dexuan Cui wrote:


From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
Sent: Friday, December 22, 2017 14:29

The output of that precise command run just now on a freshly-compiled
copy of that commit is attached.

On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:

From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
Sent: Friday, December 22, 2017 06:21

In the absence of logs, the best I can do at the moment is attach a
picture of the screen I am presented with on the  boot
attempt.
Alex


The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
IMO we should find which line of code causes the panic. I suppose
"objdump -D kernel/irq/matrix.o" can help to do that.

Thanks,
-- Dexuan


The BUG_ON panic happens at line 147:
BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));



There are 2 bugs in your laptop:

  1. Hard lockups on both CPUs after login
  2. panic with "apic=debug"

For the 2th bug, please try the following patch(need Thomas confirmation
:) ) in Linux 4.15-rc5. I think it can fix the panic.

If the 2th bug fixed, let's back to the 1th bug:

Is Linus current head 4.15-rc5 bad as well?

If yes, Please using "apic=debug" and give the dmesg log.

Thanks,
dou.

8<---

irq/matrix: Remove the overused BUGON() in irq_matrix_assign_system()

Currently, x86 marks the preallocated legacy interrupts when initializing
IRQ(native_init_IRQ), but will clear them if they are not activated in
vector_configure_legacy().

So, in irq_matrix_assign_system(), replacing an legacy vector which may
not allocated in a cpumap->alloc_map[] with a system vector will trigger
the BUGON();

Remove the BUGON().

Signed-off-by: Dou Liyang 
---
 kernel/irq/matrix.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 0ba0dd8863a7..876cbeab9ca2 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -143,11 +143,12 @@ void irq_matrix_assign_system(struct irq_matrix 
*m, unsigned int bit,

BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));

set_bit(bit, m->system_map);
-   if (replace) {
-   BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
+
+   if (replace && test_and_clear_bit(bit, cm->alloc_map)){
cm->allocated--;
m->total_allocated--;
}
+
if (bit >= m->alloc_start && bit < m->alloc_end)
m->systembits_inalloc++;

--




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-23 Thread Dou Liyang

Hi Thomas,

At 12/23/2017 09:32 PM, Thomas Gleixner wrote:
[...]


The BUG_ON panic happens at line 147:
BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));

I'm sure Thomas and Dou know it better than me.


I'll have a look after the holidays.



Merry Christmas!  :-)

I am trying to look into it.

Thanks,
dou




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-23 Thread Dou Liyang

Hi Thomas,

At 12/23/2017 09:32 PM, Thomas Gleixner wrote:
[...]


The BUG_ON panic happens at line 147:
BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));

I'm sure Thomas and Dou know it better than me.


I'll have a look after the holidays.



Merry Christmas!  :-)

I am trying to look into it.

Thanks,
dou




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-23 Thread Alexandru Chirvasitu
On Sat, Dec 23, 2017 at 02:32:52PM +0100, Thomas Gleixner wrote:
> On Sat, 23 Dec 2017, Dexuan Cui wrote:
> 
> > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > Sent: Friday, December 22, 2017 14:29
> > > 
> > > The output of that precise command run just now on a freshly-compiled
> > > copy of that commit is attached.
> > > 
> > > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > > Sent: Friday, December 22, 2017 06:21
> > > > >
> > > > > In the absence of logs, the best I can do at the moment is attach a
> > > > > picture of the screen I am presented with on the apic=debug boot
> > > > > attempt.
> > > > > Alex
> > > >
> > > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > > IMO we should find which line of code causes the panic. I suppose
> > > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > > >
> > > > Thanks,
> > > > -- Dexuan
> > 
> > The BUG_ON panic happens at line 147:
> >BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> > 
> > I'm sure Thomas and Dou know it better than me. 
> 
> I'll have a look after the holidays.
>

Thanks for that!

A quick follow-up on my inability to make kexec / kdump work in order
to perhaps produce better logs: I've done another bisect for that with
this result:

# first bad commit: [e802a51ede91350438c051da2f238f5e8c918ead] x86/idt: 
Consolidate IDT invalidation

I am quite certain this is the one for that issue. Its only parent is

# good: [8f55868f9e42fea56021b17421914b9e4fda4960] x86/idt: Remove unused 
set_trap_gate()

(i.e. one of the "good" commits I hit upon during the bisect).

On the core 2 duo machine I've been referring to e802a51 and later
commits simply return me to a regular BIOS boot when issuing either
kexec -e on a loaded crash kernel or crashing with echo c >
/proc/sysrq-trigger.


Alex


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-23 Thread Alexandru Chirvasitu
On Sat, Dec 23, 2017 at 02:32:52PM +0100, Thomas Gleixner wrote:
> On Sat, 23 Dec 2017, Dexuan Cui wrote:
> 
> > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > Sent: Friday, December 22, 2017 14:29
> > > 
> > > The output of that precise command run just now on a freshly-compiled
> > > copy of that commit is attached.
> > > 
> > > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > > Sent: Friday, December 22, 2017 06:21
> > > > >
> > > > > In the absence of logs, the best I can do at the moment is attach a
> > > > > picture of the screen I am presented with on the apic=debug boot
> > > > > attempt.
> > > > > Alex
> > > >
> > > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > > IMO we should find which line of code causes the panic. I suppose
> > > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > > >
> > > > Thanks,
> > > > -- Dexuan
> > 
> > The BUG_ON panic happens at line 147:
> >BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> > 
> > I'm sure Thomas and Dou know it better than me. 
> 
> I'll have a look after the holidays.
>

Thanks for that!

A quick follow-up on my inability to make kexec / kdump work in order
to perhaps produce better logs: I've done another bisect for that with
this result:

# first bad commit: [e802a51ede91350438c051da2f238f5e8c918ead] x86/idt: 
Consolidate IDT invalidation

I am quite certain this is the one for that issue. Its only parent is

# good: [8f55868f9e42fea56021b17421914b9e4fda4960] x86/idt: Remove unused 
set_trap_gate()

(i.e. one of the "good" commits I hit upon during the bisect).

On the core 2 duo machine I've been referring to e802a51 and later
commits simply return me to a regular BIOS boot when issuing either
kexec -e on a loaded crash kernel or crashing with echo c >
/proc/sysrq-trigger.


Alex


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-23 Thread Thomas Gleixner
On Sat, 23 Dec 2017, Dexuan Cui wrote:

> > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > Sent: Friday, December 22, 2017 14:29
> > 
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> > 
> > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
> 
> The BUG_ON panic happens at line 147:
>BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 
> I'm sure Thomas and Dou know it better than me. 

I'll have a look after the holidays.

Thanks,

tglx


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-23 Thread Thomas Gleixner
On Sat, 23 Dec 2017, Dexuan Cui wrote:

> > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > Sent: Friday, December 22, 2017 14:29
> > 
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> > 
> > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
> 
> The BUG_ON panic happens at line 147:
>BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 
> I'm sure Thomas and Dou know it better than me. 

I'll have a look after the holidays.

Thanks,

tglx


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Alexandru Chirvasitu
I was just now trying to track down my other issue, whereby somewhere
along the tree kexec stops working properly. In the process of doing
that I realized I had initially made one change to the original 4.9
config beyond oldconfig: I'd turned off WX debugging.

I've now compiled a bunch of versions with WX debugging back on, and
new behavior arises. I am attaching the joournalctl log of a booted
4.13 kernel (from Linus' tree, commit 569dbb8).

It boots and logs me in, but returns a call trace I wasn't seeing
without the WX debugging. I am sending over in case it provides any
information.

The trace bears the 23:24:09 timestamp.

On Sat, Dec 23, 2017 at 01:35:12AM +, Dexuan Cui wrote:
> > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > Sent: Friday, December 22, 2017 14:29
> > 
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> > 
> > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
> 
> The BUG_ON panic happens at line 147:
>BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 
> I'm sure Thomas and Dou know it better than me. 
> 
> 137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
> 138   bool replace)
> 139 {
> 140 struct cpumap *cm = this_cpu_ptr(m->maps);
> 141
> 142 BUG_ON(bit > m->matrix_bits);
> 143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
> 144
> 145 set_bit(bit, m->system_map);
> 146 if (replace) {
> 147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 148 cm->allocated--;
> 149 m->total_allocated--;
> 150 }
> 151 if (bit >= m->alloc_start && bit < m->alloc_end)
> 152 m->systembits_inalloc++;
> 153
> 154 trace_irq_matrix_assign_system(bit, m);
> 155 }
> 
> -- Dexuan
> 
-- Logs begin at Thu 2017-12-14 19:59:20 EST, end at Fri 2017-12-22 23:45:29 
EST. --
Dec 22 23:24:09 D-69-91-141-110 kernel: random: get_random_bytes called from 
start_kernel+0x32/0x3a0 with crng_init=0
Dec 22 23:24:09 D-69-91-141-110 kernel: Linux version 4.13.0 (root@axiomatic) 
(gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) #1 SMP Fri Dec 22 
22:18:58 EST 2017
Dec 22 23:24:09 D-69-91-141-110 kernel: x86/fpu: x87 FPU will use FXSAVE
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: BIOS-provided physical RAM map:
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x-0x0009fbff] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0009fc00-0x0009] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x000e-0x000f] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0010-0xb7f9] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fa-0xb7fadfff] ACPI data
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fae000-0xb7fe] ACPI NVS
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7ff-0xb7ff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xfee0-0xfee00fff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xffb8-0x] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: NX (Execute Disable) protection: active
Dec 22 23:24:09 D-69-91-141-110 kernel: random: fast init done
Dec 22 23:24:09 D-69-91-141-110 kernel: SMBIOS 2.4 present.
Dec 22 23:24:09 D-69-91-141-110 kernel: DMI: ASUSTeK Computer Inc. F5RL 
   /F5RL  , BIOS 210 06/12/2008
Dec 22 23:24:09 D-69-91-141-110 kernel: tsc: Fast TSC calibration using PIT
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: update [mem 
0x-0x0fff] usable ==> reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: remove [mem 
0x000a-0x000f] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: last_pfn = 0xb7fa0 max_arch_pfn = 
0x100
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR default type: uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR fixed ranges enabled:
Dec 22 23:24:09 D-69-91-141-110 kernel:   0-9 write-back
Dec 22 23:24:09 D-69-91-141-110 kernel:   A-B uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel:   C-C write-protect
Dec 22 23:24:09 D-69-91-141-110 

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Alexandru Chirvasitu
I was just now trying to track down my other issue, whereby somewhere
along the tree kexec stops working properly. In the process of doing
that I realized I had initially made one change to the original 4.9
config beyond oldconfig: I'd turned off WX debugging.

I've now compiled a bunch of versions with WX debugging back on, and
new behavior arises. I am attaching the joournalctl log of a booted
4.13 kernel (from Linus' tree, commit 569dbb8).

It boots and logs me in, but returns a call trace I wasn't seeing
without the WX debugging. I am sending over in case it provides any
information.

The trace bears the 23:24:09 timestamp.

On Sat, Dec 23, 2017 at 01:35:12AM +, Dexuan Cui wrote:
> > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > Sent: Friday, December 22, 2017 14:29
> > 
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> > 
> > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
> 
> The BUG_ON panic happens at line 147:
>BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 
> I'm sure Thomas and Dou know it better than me. 
> 
> 137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
> 138   bool replace)
> 139 {
> 140 struct cpumap *cm = this_cpu_ptr(m->maps);
> 141
> 142 BUG_ON(bit > m->matrix_bits);
> 143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
> 144
> 145 set_bit(bit, m->system_map);
> 146 if (replace) {
> 147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 148 cm->allocated--;
> 149 m->total_allocated--;
> 150 }
> 151 if (bit >= m->alloc_start && bit < m->alloc_end)
> 152 m->systembits_inalloc++;
> 153
> 154 trace_irq_matrix_assign_system(bit, m);
> 155 }
> 
> -- Dexuan
> 
-- Logs begin at Thu 2017-12-14 19:59:20 EST, end at Fri 2017-12-22 23:45:29 
EST. --
Dec 22 23:24:09 D-69-91-141-110 kernel: random: get_random_bytes called from 
start_kernel+0x32/0x3a0 with crng_init=0
Dec 22 23:24:09 D-69-91-141-110 kernel: Linux version 4.13.0 (root@axiomatic) 
(gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) #1 SMP Fri Dec 22 
22:18:58 EST 2017
Dec 22 23:24:09 D-69-91-141-110 kernel: x86/fpu: x87 FPU will use FXSAVE
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: BIOS-provided physical RAM map:
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x-0x0009fbff] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0009fc00-0x0009] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x000e-0x000f] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0010-0xb7f9] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fa-0xb7fadfff] ACPI data
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fae000-0xb7fe] ACPI NVS
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7ff-0xb7ff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xfee0-0xfee00fff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xffb8-0x] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: NX (Execute Disable) protection: active
Dec 22 23:24:09 D-69-91-141-110 kernel: random: fast init done
Dec 22 23:24:09 D-69-91-141-110 kernel: SMBIOS 2.4 present.
Dec 22 23:24:09 D-69-91-141-110 kernel: DMI: ASUSTeK Computer Inc. F5RL 
   /F5RL  , BIOS 210 06/12/2008
Dec 22 23:24:09 D-69-91-141-110 kernel: tsc: Fast TSC calibration using PIT
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: update [mem 
0x-0x0fff] usable ==> reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: remove [mem 
0x000a-0x000f] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: last_pfn = 0xb7fa0 max_arch_pfn = 
0x100
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR default type: uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR fixed ranges enabled:
Dec 22 23:24:09 D-69-91-141-110 kernel:   0-9 write-back
Dec 22 23:24:09 D-69-91-141-110 kernel:   A-B uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel:   C-C write-protect
Dec 22 23:24:09 D-69-91-141-110 

RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Dexuan Cui
> From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> Sent: Friday, December 22, 2017 14:29
> 
> The output of that precise command run just now on a freshly-compiled
> copy of that commit is attached.
> 
> On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > Sent: Friday, December 22, 2017 06:21
> > >
> > > In the absence of logs, the best I can do at the moment is attach a
> > > picture of the screen I am presented with on the apic=debug boot
> > > attempt.
> > > Alex
> >
> > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > IMO we should find which line of code causes the panic. I suppose
> > "objdump -D kernel/irq/matrix.o" can help to do that.
> >
> > Thanks,
> > -- Dexuan

The BUG_ON panic happens at line 147:
   BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));

I'm sure Thomas and Dou know it better than me. 

137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
138   bool replace)
139 {
140 struct cpumap *cm = this_cpu_ptr(m->maps);
141
142 BUG_ON(bit > m->matrix_bits);
143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
144
145 set_bit(bit, m->system_map);
146 if (replace) {
147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
148 cm->allocated--;
149 m->total_allocated--;
150 }
151 if (bit >= m->alloc_start && bit < m->alloc_end)
152 m->systembits_inalloc++;
153
154 trace_irq_matrix_assign_system(bit, m);
155 }

-- Dexuan



RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Dexuan Cui
> From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> Sent: Friday, December 22, 2017 14:29
> 
> The output of that precise command run just now on a freshly-compiled
> copy of that commit is attached.
> 
> On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > Sent: Friday, December 22, 2017 06:21
> > >
> > > In the absence of logs, the best I can do at the moment is attach a
> > > picture of the screen I am presented with on the apic=debug boot
> > > attempt.
> > > Alex
> >
> > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > IMO we should find which line of code causes the panic. I suppose
> > "objdump -D kernel/irq/matrix.o" can help to do that.
> >
> > Thanks,
> > -- Dexuan

The BUG_ON panic happens at line 147:
   BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));

I'm sure Thomas and Dou know it better than me. 

137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
138   bool replace)
139 {
140 struct cpumap *cm = this_cpu_ptr(m->maps);
141
142 BUG_ON(bit > m->matrix_bits);
143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
144
145 set_bit(bit, m->system_map);
146 if (replace) {
147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
148 cm->allocated--;
149 m->total_allocated--;
150 }
151 if (bit >= m->alloc_start && bit < m->alloc_end)
152 m->systembits_inalloc++;
153
154 trace_irq_matrix_assign_system(bit, m);
155 }

-- Dexuan



RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Dexuan Cui
> From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> Sent: Friday, December 22, 2017 06:21
> 
> In the absence of logs, the best I can do at the moment is attach a
> picture of the screen I am presented with on the apic=debug boot
> attempt.
> Alex

The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
IMO we should find which line of code causes the panic. I suppose
"objdump -D kernel/irq/matrix.o" can help to do that.

Thanks,
-- Dexuan


RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Dexuan Cui
> From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> Sent: Friday, December 22, 2017 06:21
> 
> In the absence of logs, the best I can do at the moment is attach a
> picture of the screen I am presented with on the apic=debug boot
> attempt.
> Alex

The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
IMO we should find which line of code causes the panic. I suppose
"objdump -D kernel/irq/matrix.o" can help to do that.

Thanks,
-- Dexuan


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Dou Liyang

Hi Alexandru,

At 12/21/2017 10:23 AM, Alexandru Chirvasitu wrote:

This might be more helpful. I ran another bisect with the following
final log:

---

git bisect start
# bad: [d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c] x86/vector: Respect affinity 
mask in irq descriptor
git bisect bad d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c
# good: [e4ae4c8ea7c65f61fde29c689d148c8c9e05305a] Merge branch 'irq/core' into 
x86/apic
git bisect good e4ae4c8ea7c65f61fde29c689d148c8c9e05305a
# good: [4ef76eb6de734dc03a7f3b8f80884362364e6049] x86/apic: Get rid of the 
legacy irq data storage
git bisect good 4ef76eb6de734dc03a7f3b8f80884362364e6049
# good: [ba801640b10d87b1c4e26cbcbe414a001255404f] x86/vector: Compile SMP only 
code conditionally
git bisect good ba801640b10d87b1c4e26cbcbe414a001255404f
# good: [90ad9e2d91067983f3328e21b306323877e5f48a] x86/io_apic: Reevaluate 
vector configuration on activate()
git bisect good 90ad9e2d91067983f3328e21b306323877e5f48a
# bad: [4900be83602b6be07366d3e69f756c1959f4169a] x86/vector/msi: Switch to 
global reservation mode
git bisect bad 4900be83602b6be07366d3e69f756c1959f4169a
# bad: [2db1f959d9dc16035f2eb44ed5fdb2789b754d6a] x86/vector: Handle managed 
interrupts proper
git bisect bad 2db1f959d9dc16035f2eb44ed5fdb2789b754d6a
# first bad commit: [2db1f959d9dc16035f2eb44ed5fdb2789b754d6a] x86/vector: 
Handle managed interrupts proper


It's helpful to me. I tried it in QEmu with

  (Intel(R) Core(TM)2 Duo CPU  T7700  @ 2.40GHz)
but, can't reproduced the bug.



---

That first bad commit 2db1f95 identified at the end is interesting:
it's the only one I've tried through all of this that actually gives
me a kernel panic when unadorned with kernel options (so unlike all of
the others it fails to even drop me at a tty login prompt).

I tried a number of things to fiddle with it: it boots fine with
either nolapic or noapic. The former results in seeing a single cpu
with lscpu, but the latter (noapic) seems to give me as much
functionality as I'd need. I'm not seeing the issue noted before,


Because the "noapic" just disable the I/O APIC, but, the "nolapic" will
disable both the Local APIC and I/O APIC


whereby noapic for 4.15.0-rc3 was somehow disabling my ethernet card.

I hope this second bisect went down better than the last one..


Could you add "apic=debug" in the kernel command line, then, give me the 
dmesg log?


Thanks,
dou.




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Dou Liyang

Hi Alexandru,

At 12/21/2017 10:23 AM, Alexandru Chirvasitu wrote:

This might be more helpful. I ran another bisect with the following
final log:

---

git bisect start
# bad: [d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c] x86/vector: Respect affinity 
mask in irq descriptor
git bisect bad d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c
# good: [e4ae4c8ea7c65f61fde29c689d148c8c9e05305a] Merge branch 'irq/core' into 
x86/apic
git bisect good e4ae4c8ea7c65f61fde29c689d148c8c9e05305a
# good: [4ef76eb6de734dc03a7f3b8f80884362364e6049] x86/apic: Get rid of the 
legacy irq data storage
git bisect good 4ef76eb6de734dc03a7f3b8f80884362364e6049
# good: [ba801640b10d87b1c4e26cbcbe414a001255404f] x86/vector: Compile SMP only 
code conditionally
git bisect good ba801640b10d87b1c4e26cbcbe414a001255404f
# good: [90ad9e2d91067983f3328e21b306323877e5f48a] x86/io_apic: Reevaluate 
vector configuration on activate()
git bisect good 90ad9e2d91067983f3328e21b306323877e5f48a
# bad: [4900be83602b6be07366d3e69f756c1959f4169a] x86/vector/msi: Switch to 
global reservation mode
git bisect bad 4900be83602b6be07366d3e69f756c1959f4169a
# bad: [2db1f959d9dc16035f2eb44ed5fdb2789b754d6a] x86/vector: Handle managed 
interrupts proper
git bisect bad 2db1f959d9dc16035f2eb44ed5fdb2789b754d6a
# first bad commit: [2db1f959d9dc16035f2eb44ed5fdb2789b754d6a] x86/vector: 
Handle managed interrupts proper


It's helpful to me. I tried it in QEmu with

  (Intel(R) Core(TM)2 Duo CPU  T7700  @ 2.40GHz)
but, can't reproduced the bug.



---

That first bad commit 2db1f95 identified at the end is interesting:
it's the only one I've tried through all of this that actually gives
me a kernel panic when unadorned with kernel options (so unlike all of
the others it fails to even drop me at a tty login prompt).

I tried a number of things to fiddle with it: it boots fine with
either nolapic or noapic. The former results in seeing a single cpu
with lscpu, but the latter (noapic) seems to give me as much
functionality as I'd need. I'm not seeing the issue noted before,


Because the "noapic" just disable the I/O APIC, but, the "nolapic" will
disable both the Local APIC and I/O APIC


whereby noapic for 4.15.0-rc3 was somehow disabling my ethernet card.

I hope this second bisect went down better than the last one..


Could you add "apic=debug" in the kernel command line, then, give me the 
dmesg log?


Thanks,
dou.




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-19 Thread Dou Liyang

Hi Thomas,

At 12/20/2017 08:31 AM, Thomas Gleixner wrote:

On Tue, 19 Dec 2017, Alexandru Chirvasitu wrote:


I had never heard of 'bisect' before this casual mention (you might tell
I am a bit out of my depth). I've since applied it to Linus' tree between



bebc608 Linux 4.14 (good)

and

4fbd8d1 Linux 4.15-rc1 (bad)


Is Linus current head 4.15-rc4 bad as well?


[...]


Thanks for doing that bisect, but unfortunately this commit cannot be the
problematic one, It merily adds a config symbol, but it does not change any
code at all. It has no effect whatsoever. So something might have gone
wrong in your bisecting.



Agree.


I CC'ed Dou Liyang. He has changed the early APIC setup code and there has
been an issue reported already. Though I lost track of that. Dou, any

     Is it this one?
               https://marc.info/?l=linux-kernel=151188084018443

pointers?



Not sure, but seems the APIC failed to start in that 32-bit system.

I will look into it.

Alex,

Could you give me your .config file and the dmesg-log of 4.15.0-rc3.

Thanks,
dou




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-19 Thread Dou Liyang

Hi Thomas,

At 12/20/2017 08:31 AM, Thomas Gleixner wrote:

On Tue, 19 Dec 2017, Alexandru Chirvasitu wrote:


I had never heard of 'bisect' before this casual mention (you might tell
I am a bit out of my depth). I've since applied it to Linus' tree between



bebc608 Linux 4.14 (good)

and

4fbd8d1 Linux 4.15-rc1 (bad)


Is Linus current head 4.15-rc4 bad as well?


[...]


Thanks for doing that bisect, but unfortunately this commit cannot be the
problematic one, It merily adds a config symbol, but it does not change any
code at all. It has no effect whatsoever. So something might have gone
wrong in your bisecting.



Agree.


I CC'ed Dou Liyang. He has changed the early APIC setup code and there has
been an issue reported already. Though I lost track of that. Dou, any

     Is it this one?
               https://marc.info/?l=linux-kernel=151188084018443

pointers?



Not sure, but seems the APIC failed to start in that 32-bit system.

I will look into it.

Alex,

Could you give me your .config file and the dmesg-log of 4.15.0-rc3.

Thanks,
dou




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-19 Thread Thomas Gleixner
On Tue, 19 Dec 2017, Alexandru Chirvasitu wrote:

> I had never heard of 'bisect' before this casual mention (you might tell
> I am a bit out of my depth). I've since applied it to Linus' tree between

> bebc608 Linux 4.14 (good)
> 
> and
> 
> 4fbd8d1 Linux 4.15-rc1 (bad)

Is Linus current head 4.15-rc4 bad as well?

> It took about 13 attempts (I had access to a faster machine to compile
> on, and ccache helped once the cache built up some momentum). The result
> is (as presented by 'git bisect' at the end of the process, between the
> --- dividers added by me for clarity):

> --- start of output ---
> 
> 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714 is the first bad commit
> commit 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714
> Author: Thomas Gleixner 
> Date:   Tue Oct 17 09:54:57 2017 +0200
> 
> genirq: Add config option for reservation mode
> 
> The interrupt reservation mode requires reactivation of PCI/MSI
> interrupts. Create a config option, so the PCI code can set the
> corresponding flag when required.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: Josh Poulson 
> Cc: Mihai Costache 
> Cc: Stephen Hemminger 
> Cc: Marc Zyngier 
> Cc: linux-...@vger.kernel.org
> Cc: Haiyang Zhang 
> Cc: Dexuan Cui 
> Cc: Simon Xiao 
> Cc: Saeed Mahameed 
> Cc: Jork Loeser 
> Cc: Bjorn Helgaas 
> Cc: de...@linuxdriverproject.org
> Cc: KY Srinivasan 
> Link: https://lkml.kernel.org/r/20171017075600.369375...@linutronix.de
> 
> :04 04 5e73031cc0c8411a20722cce7876ab7b82ed3858 
> dcf98e7a6b7d5f7c5353b7ccab02125e6d332ec8 M  kernel
> 
> --- end of output ---
> 
> Consequently, I am cc-ing in the listed addresses.

Thanks for doing that bisect, but unfortunately this commit cannot be the
problematic one, It merily adds a config symbol, but it does not change any
code at all. It has no effect whatsoever. So something might have gone
wrong in your bisecting.

I CC'ed Dou Liyang. He has changed the early APIC setup code and there has
been an issue reported already. Though I lost track of that. Dou, any
pointers?

Thanks,

tglx




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-19 Thread Thomas Gleixner
On Tue, 19 Dec 2017, Alexandru Chirvasitu wrote:

> I had never heard of 'bisect' before this casual mention (you might tell
> I am a bit out of my depth). I've since applied it to Linus' tree between

> bebc608 Linux 4.14 (good)
> 
> and
> 
> 4fbd8d1 Linux 4.15-rc1 (bad)

Is Linus current head 4.15-rc4 bad as well?

> It took about 13 attempts (I had access to a faster machine to compile
> on, and ccache helped once the cache built up some momentum). The result
> is (as presented by 'git bisect' at the end of the process, between the
> --- dividers added by me for clarity):

> --- start of output ---
> 
> 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714 is the first bad commit
> commit 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714
> Author: Thomas Gleixner 
> Date:   Tue Oct 17 09:54:57 2017 +0200
> 
> genirq: Add config option for reservation mode
> 
> The interrupt reservation mode requires reactivation of PCI/MSI
> interrupts. Create a config option, so the PCI code can set the
> corresponding flag when required.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: Josh Poulson 
> Cc: Mihai Costache 
> Cc: Stephen Hemminger 
> Cc: Marc Zyngier 
> Cc: linux-...@vger.kernel.org
> Cc: Haiyang Zhang 
> Cc: Dexuan Cui 
> Cc: Simon Xiao 
> Cc: Saeed Mahameed 
> Cc: Jork Loeser 
> Cc: Bjorn Helgaas 
> Cc: de...@linuxdriverproject.org
> Cc: KY Srinivasan 
> Link: https://lkml.kernel.org/r/20171017075600.369375...@linutronix.de
> 
> :04 04 5e73031cc0c8411a20722cce7876ab7b82ed3858 
> dcf98e7a6b7d5f7c5353b7ccab02125e6d332ec8 M  kernel
> 
> --- end of output ---
> 
> Consequently, I am cc-ing in the listed addresses.

Thanks for doing that bisect, but unfortunately this commit cannot be the
problematic one, It merily adds a config symbol, but it does not change any
code at all. It has no effect whatsoever. So something might have gone
wrong in your bisecting.

I CC'ed Dou Liyang. He has changed the early APIC setup code and there has
been an issue reported already. Though I lost track of that. Dou, any
pointers?

Thanks,

tglx




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-19 Thread Alexandru Chirvasitu
Thank you!

On Mon, Dec 18, 2017 at 11:11:31AM +0100, Pavel Machek wrote:
> Hi!
> On Mon 2017-12-18 03:20:11, Alexandru Chirvasitu wrote:
> > Short description of the problem: latest rc kernel results in seemingly 
> > APIC-caused hard lockups, whereas latest stable kernel works fine.
> > 
> > I have an old ASUS F5RL laptop with an Intel Core 2 Duo CPU T5450 @1.66GHz. 
> > It is currently running Debian 9.3 stable 32 bit (by default on a 
> > 4.9-series kernel), but I have been compiling and installing the latest 
> > kernels.
> > 
> 
> Thanks for doing that.
> 
> > The latest rc kernel at the time of this writing (4.15.0-rc3) boots but 
> > then results in hard lockups on both CPUs after login. Starting in recovery 
> > mode returns the error
> > 
> > "spurious APIC interrupt through vector ff on CPU#0, should never happen"
> > 
> > before lockip up the CPUs again. A hard reboot is necessary. 
> > 
> > Starting with kernel option noapic logs me in uneventfully, but for some 
> > reason has the effect of rendering my ethrenet card inoperable. It is a 
> > Qualcomm Atheros Attansic L2 Fast Ethernet (rev a0), handled by kernel 
> > module atl2. In noapic mode the card is still seen by the system, can be 
> > brought up / down, etc., but dhclient never manages to acquire a lease.
> > 
> > Starting with kernel option nolapic instead brings up the network and logs 
> > me in, but only sees one CPU instead of two, as usual.
> > 
> > The latest kernel that exhibits none of these issues is the latest stable 
> > one as of this writing: 4.14.7.
> > 
> > ---
> > 
> > As this seems to be APIC-related, I am sending the message to the 
> > maintainers mentioned in arch/x86/kernel/apic/apic.c. I am unsure whether 
> > this is the correct procedure however.
> > 
> 
> Good enough procedure. You want to always copy linux-kernel mailing
> list, and you should probably look for X86 maintainers in MAINTAINERS
> file, and  cc them, too.
> 
> If you run out of other options, you can always do "git bisect"...
>


I had never heard of 'bisect' before this casual mention (you might tell I am a 
bit out of my depth). I've since applied it to Linus' tree between

bebc608 Linux 4.14 (good)

and

4fbd8d1 Linux 4.15-rc1 (bad)

It took about 13 attempts (I had access to a faster machine to compile on, and 
ccache helped once the cache built up some momentum). The result is (as 
presented by 'git bisect' at the end of the process, between the --- dividers 
added by me for clarity):

--- start of output ---

2b5175c4fa974b6aa05bbd2ee8d443a8036a1714 is the first bad commit
commit 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714
Author: Thomas Gleixner 
Date:   Tue Oct 17 09:54:57 2017 +0200

genirq: Add config option for reservation mode

The interrupt reservation mode requires reactivation of PCI/MSI
interrupts. Create a config option, so the PCI code can set the
corresponding flag when required.

Signed-off-by: Thomas Gleixner 
Cc: Josh Poulson 
Cc: Mihai Costache 
Cc: Stephen Hemminger 
Cc: Marc Zyngier 
Cc: linux-...@vger.kernel.org
Cc: Haiyang Zhang 
Cc: Dexuan Cui 
Cc: Simon Xiao 
Cc: Saeed Mahameed 
Cc: Jork Loeser 
Cc: Bjorn Helgaas 
Cc: de...@linuxdriverproject.org
Cc: KY Srinivasan 
Link: https://lkml.kernel.org/r/20171017075600.369375...@linutronix.de

:04 04 5e73031cc0c8411a20722cce7876ab7b82ed3858 
dcf98e7a6b7d5f7c5353b7ccab02125e6d332ec8 M  kernel

--- end of output ---

Consequently, I am cc-ing in the listed addresses.


Thank you,

Alex Chirvasitu

> Best regards, Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-19 Thread Alexandru Chirvasitu
Thank you!

On Mon, Dec 18, 2017 at 11:11:31AM +0100, Pavel Machek wrote:
> Hi!
> On Mon 2017-12-18 03:20:11, Alexandru Chirvasitu wrote:
> > Short description of the problem: latest rc kernel results in seemingly 
> > APIC-caused hard lockups, whereas latest stable kernel works fine.
> > 
> > I have an old ASUS F5RL laptop with an Intel Core 2 Duo CPU T5450 @1.66GHz. 
> > It is currently running Debian 9.3 stable 32 bit (by default on a 
> > 4.9-series kernel), but I have been compiling and installing the latest 
> > kernels.
> > 
> 
> Thanks for doing that.
> 
> > The latest rc kernel at the time of this writing (4.15.0-rc3) boots but 
> > then results in hard lockups on both CPUs after login. Starting in recovery 
> > mode returns the error
> > 
> > "spurious APIC interrupt through vector ff on CPU#0, should never happen"
> > 
> > before lockip up the CPUs again. A hard reboot is necessary. 
> > 
> > Starting with kernel option noapic logs me in uneventfully, but for some 
> > reason has the effect of rendering my ethrenet card inoperable. It is a 
> > Qualcomm Atheros Attansic L2 Fast Ethernet (rev a0), handled by kernel 
> > module atl2. In noapic mode the card is still seen by the system, can be 
> > brought up / down, etc., but dhclient never manages to acquire a lease.
> > 
> > Starting with kernel option nolapic instead brings up the network and logs 
> > me in, but only sees one CPU instead of two, as usual.
> > 
> > The latest kernel that exhibits none of these issues is the latest stable 
> > one as of this writing: 4.14.7.
> > 
> > ---
> > 
> > As this seems to be APIC-related, I am sending the message to the 
> > maintainers mentioned in arch/x86/kernel/apic/apic.c. I am unsure whether 
> > this is the correct procedure however.
> > 
> 
> Good enough procedure. You want to always copy linux-kernel mailing
> list, and you should probably look for X86 maintainers in MAINTAINERS
> file, and  cc them, too.
> 
> If you run out of other options, you can always do "git bisect"...
>


I had never heard of 'bisect' before this casual mention (you might tell I am a 
bit out of my depth). I've since applied it to Linus' tree between

bebc608 Linux 4.14 (good)

and

4fbd8d1 Linux 4.15-rc1 (bad)

It took about 13 attempts (I had access to a faster machine to compile on, and 
ccache helped once the cache built up some momentum). The result is (as 
presented by 'git bisect' at the end of the process, between the --- dividers 
added by me for clarity):

--- start of output ---

2b5175c4fa974b6aa05bbd2ee8d443a8036a1714 is the first bad commit
commit 2b5175c4fa974b6aa05bbd2ee8d443a8036a1714
Author: Thomas Gleixner 
Date:   Tue Oct 17 09:54:57 2017 +0200

genirq: Add config option for reservation mode

The interrupt reservation mode requires reactivation of PCI/MSI
interrupts. Create a config option, so the PCI code can set the
corresponding flag when required.

Signed-off-by: Thomas Gleixner 
Cc: Josh Poulson 
Cc: Mihai Costache 
Cc: Stephen Hemminger 
Cc: Marc Zyngier 
Cc: linux-...@vger.kernel.org
Cc: Haiyang Zhang 
Cc: Dexuan Cui 
Cc: Simon Xiao 
Cc: Saeed Mahameed 
Cc: Jork Loeser 
Cc: Bjorn Helgaas 
Cc: de...@linuxdriverproject.org
Cc: KY Srinivasan 
Link: https://lkml.kernel.org/r/20171017075600.369375...@linutronix.de

:04 04 5e73031cc0c8411a20722cce7876ab7b82ed3858 
dcf98e7a6b7d5f7c5353b7ccab02125e6d332ec8 M  kernel

--- end of output ---

Consequently, I am cc-ing in the listed addresses.


Thank you,

Alex Chirvasitu

> Best regards, Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html




Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-18 Thread Pavel Machek
Hi!
On Mon 2017-12-18 03:20:11, Alexandru Chirvasitu wrote:
> Short description of the problem: latest rc kernel results in seemingly 
> APIC-caused hard lockups, whereas latest stable kernel works fine.
> 
> I have an old ASUS F5RL laptop with an Intel Core 2 Duo CPU T5450 @1.66GHz. 
> It is currently running Debian 9.3 stable 32 bit (by default on a 4.9-series 
> kernel), but I have been compiling and installing the latest kernels.
> 

Thanks for doing that.

> The latest rc kernel at the time of this writing (4.15.0-rc3) boots but then 
> results in hard lockups on both CPUs after login. Starting in recovery mode 
> returns the error
> 
> "spurious APIC interrupt through vector ff on CPU#0, should never happen"
> 
> before lockip up the CPUs again. A hard reboot is necessary. 
> 
> Starting with kernel option noapic logs me in uneventfully, but for some 
> reason has the effect of rendering my ethrenet card inoperable. It is a 
> Qualcomm Atheros Attansic L2 Fast Ethernet (rev a0), handled by kernel module 
> atl2. In noapic mode the card is still seen by the system, can be brought up 
> / down, etc., but dhclient never manages to acquire a lease.
> 
> Starting with kernel option nolapic instead brings up the network and logs me 
> in, but only sees one CPU instead of two, as usual.
> 
> The latest kernel that exhibits none of these issues is the latest stable one 
> as of this writing: 4.14.7.
> 
> ---
> 
> As this seems to be APIC-related, I am sending the message to the maintainers 
> mentioned in arch/x86/kernel/apic/apic.c. I am unsure whether this is the 
> correct procedure however.
> 

Good enough procedure. You want to always copy linux-kernel mailing
list, and you should probably look for X86 maintainers in MAINTAINERS
file, and  cc them, too.

If you run out of other options, you can always do "git bisect"...

Best regards,   Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-18 Thread Pavel Machek
Hi!
On Mon 2017-12-18 03:20:11, Alexandru Chirvasitu wrote:
> Short description of the problem: latest rc kernel results in seemingly 
> APIC-caused hard lockups, whereas latest stable kernel works fine.
> 
> I have an old ASUS F5RL laptop with an Intel Core 2 Duo CPU T5450 @1.66GHz. 
> It is currently running Debian 9.3 stable 32 bit (by default on a 4.9-series 
> kernel), but I have been compiling and installing the latest kernels.
> 

Thanks for doing that.

> The latest rc kernel at the time of this writing (4.15.0-rc3) boots but then 
> results in hard lockups on both CPUs after login. Starting in recovery mode 
> returns the error
> 
> "spurious APIC interrupt through vector ff on CPU#0, should never happen"
> 
> before lockip up the CPUs again. A hard reboot is necessary. 
> 
> Starting with kernel option noapic logs me in uneventfully, but for some 
> reason has the effect of rendering my ethrenet card inoperable. It is a 
> Qualcomm Atheros Attansic L2 Fast Ethernet (rev a0), handled by kernel module 
> atl2. In noapic mode the card is still seen by the system, can be brought up 
> / down, etc., but dhclient never manages to acquire a lease.
> 
> Starting with kernel option nolapic instead brings up the network and logs me 
> in, but only sees one CPU instead of two, as usual.
> 
> The latest kernel that exhibits none of these issues is the latest stable one 
> as of this writing: 4.14.7.
> 
> ---
> 
> As this seems to be APIC-related, I am sending the message to the maintainers 
> mentioned in arch/x86/kernel/apic/apic.c. I am unsure whether this is the 
> correct procedure however.
> 

Good enough procedure. You want to always copy linux-kernel mailing
list, and you should probably look for X86 maintainers in MAINTAINERS
file, and  cc them, too.

If you run out of other options, you can always do "git bisect"...

Best regards,   Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature