Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-20 Thread Matt Fleming
On Mon, 19 Sep, at 01:09:22PM, Waiman Long wrote:
> On 09/19/2016 10:51 AM, Matt Fleming wrote:
> >On Mon, 19 Sep, at 10:48:12AM, Waiman Long wrote:
> >>With this patch applied, I am able to successfully boot both the 16-socket
> >>12-TB and 8-socket 6TB configurations without problem.
> >>
> >>Tested-by: Waiman Long
> >
> >Could you please show your dmesg after booting with efi=debug? The
> >part I'm interested in is the dump of the EFI memory map.
> 
> Attached is the output of dmesg with efi=debug.

Bingo. There's some more type conversion bugs in the pat code, that's
the real bug.

Here's what I've got queued up, along with the first patch I sent.

8<
>From e535ec0899d1fe52ec3a84c9bc03457ac67ad6f7 Mon Sep 17 00:00:00 2001
From: Matt Fleming 
Date: Tue, 20 Sep 2016 14:26:21 +0100
Subject: [PATCH 1/2] x86/mm/pat: Prevent hang during boot when mapping pages

There's a mixture of signed 32-bit and unsigned 32-bit and 64-bit data
types used for keeping track of how many pages have been mapped.

This leads to hangs during boot when mapping large numbers of pages
(multiple terabytes, as reported by Waiman) because those values are
interpreted as being negative.

commit 742563777e8d ("x86/mm/pat: Avoid truncation when converting
cpa->numpages to address") fixed one of those bugs, but there is
another lurking in __change_page_attr_set_clr().

Additionally, the return value type for the populate_*() functions can
return negative values when a large number of pages have been mapped,
triggering the error paths even though no error occurred.

Consistently use 64-bit types on 64-bit platforms when counting pages.
Even in the signed case this gives us room for regions 8PiB
(pebibytes) in size whilst still allowing the usual negative value
error checking idiom.

Reported-by: Waiman Long 
Cc: Ard Biesheuvel 
Cc: Borislav Petkov 
Cc: Linus Torvalds 
CC: Theodore Ts'o 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Scott J Norton 
Cc: Douglas Hatch 
Signed-off-by: Matt Fleming 
---
 arch/x86/mm/pageattr.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 849dc09fa4f0..e3353c97d086 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -917,11 +917,11 @@ static void populate_pte(struct cpa_data *cpa,
}
 }
 
-static int populate_pmd(struct cpa_data *cpa,
-   unsigned long start, unsigned long end,
-   unsigned num_pages, pud_t *pud, pgprot_t pgprot)
+static long populate_pmd(struct cpa_data *cpa,
+unsigned long start, unsigned long end,
+unsigned num_pages, pud_t *pud, pgprot_t pgprot)
 {
-   unsigned int cur_pages = 0;
+   long cur_pages = 0;
pmd_t *pmd;
pgprot_t pmd_pgprot;
 
@@ -991,12 +991,12 @@ static int populate_pmd(struct cpa_data *cpa,
return num_pages;
 }
 
-static int populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
-   pgprot_t pgprot)
+static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
+pgprot_t pgprot)
 {
pud_t *pud;
unsigned long end;
-   int cur_pages = 0;
+   long cur_pages = 0;
pgprot_t pud_pgprot;
 
end = start + (cpa->numpages << PAGE_SHIFT);
@@ -1052,7 +1052,7 @@ static int populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
 
/* Map trailing leftover */
if (start < end) {
-   int tmp;
+   long tmp;
 
pud = pud_offset(pgd, start);
if (pud_none(*pud))
@@ -1078,7 +1078,7 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
pgprot_t pgprot = __pgprot(_KERNPG_TABLE);
pud_t *pud = NULL;  /* shut up gcc */
pgd_t *pgd_entry;
-   int ret;
+   long ret;
 
pgd_entry = cpa->pgd + pgd_index(addr);
 
@@ -1327,7 +1327,8 @@ static int cpa_process_alias(struct cpa_data *cpa)
 
 static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias)
 {
-   int ret, numpages = cpa->numpages;
+   unsigned long numpages = cpa->numpages;
+   int ret;
 
while (numpages) {
/*
-- 
2.9.3



Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-19 Thread Matt Fleming
On Mon, 19 Sep, at 10:48:12AM, Waiman Long wrote:
> 
> With this patch applied, I am able to successfully boot both the 16-socket
> 12-TB and 8-socket 6TB configurations without problem.
> 
> Tested-by: Waiman Long 
 
Could you please show your dmesg after booting with efi=debug? The
part I'm interested in is the dump of the EFI memory map.


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-19 Thread Waiman Long

On 09/19/2016 08:43 AM, Matt Fleming wrote:

On Sun, 18 Sep, at 11:09:08PM, Waiman Long wrote:

On 09/14/2016 03:19 PM, Linus Torvalds wrote:

On Wed, Sep 14, 2016 at 12:14 PM, Waiman Long   wrote:

In the stack backtrace above, the kernel hadn't even reached SMP boot after
about 50s. That was extremely slow. I tried the 4.7.3 kernel and it booted
up fine. So I suspect that there may be too many interrupts going on and it
consumes most of the CPU cycles. The prime suspect is the random driver, I
think.

Any chance of bisecting it at least partially? The random driver
doesn't do interrupts itself, it just gets called by other drivers
doing intterrupts. So if there are too many of them, that would be
something else..

Linus

I have finally finished bisecting the problem. I was wrong in saying that
the 4.7.3 kernel had no problem. It did have. There were some slight
differences between the 4.8 and 4.7 kernel config files that I used. After
some further testing, it was found that the bootup problem only happened
when the following kernel config option was defined:

CONFIG_EFI_MIXED=y


Could you try this patch? It won't be the final version, because it
doesn't address the root cause of the crash, which looks like page
table corruption of some kind, but it should at least confirm that
this is the buggy code,

---

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 677e29e29473..8dd3784eb075 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -245,7 +245,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)
 * text and allocate a new stack because we can't rely on the
 * stack pointer being<  4GB.
 */
-   if (!IS_ENABLED(CONFIG_EFI_MIXED))
+   if (!IS_ENABLED(CONFIG_EFI_MIXED) || efi_is_native())
return 0;

/*


With this patch applied, I am able to successfully boot both the 
16-socket 12-TB and 8-socket 6TB configurations without problem.


Tested-by: Waiman Long 



Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-19 Thread Matt Fleming
On Sun, 18 Sep, at 11:09:08PM, Waiman Long wrote:
> On 09/14/2016 03:19 PM, Linus Torvalds wrote:
> >On Wed, Sep 14, 2016 at 12:14 PM, Waiman Long  wrote:
> >>In the stack backtrace above, the kernel hadn't even reached SMP boot after
> >>about 50s. That was extremely slow. I tried the 4.7.3 kernel and it booted
> >>up fine. So I suspect that there may be too many interrupts going on and it
> >>consumes most of the CPU cycles. The prime suspect is the random driver, I
> >>think.
> >Any chance of bisecting it at least partially? The random driver
> >doesn't do interrupts itself, it just gets called by other drivers
> >doing intterrupts. So if there are too many of them, that would be
> >something else..
> >
> >Linus
> 
> I have finally finished bisecting the problem. I was wrong in saying that
> the 4.7.3 kernel had no problem. It did have. There were some slight
> differences between the 4.8 and 4.7 kernel config files that I used. After
> some further testing, it was found that the bootup problem only happened
> when the following kernel config option was defined:
> 
> CONFIG_EFI_MIXED=y
 
Could you try this patch? It won't be the final version, because it
doesn't address the root cause of the crash, which looks like page
table corruption of some kind, but it should at least confirm that
this is the buggy code,

---

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 677e29e29473..8dd3784eb075 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -245,7 +245,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, 
unsigned num_pages)
 * text and allocate a new stack because we can't rely on the
 * stack pointer being < 4GB.
 */
-   if (!IS_ENABLED(CONFIG_EFI_MIXED))
+   if (!IS_ENABLED(CONFIG_EFI_MIXED) || efi_is_native())
return 0;
 
/*


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-19 Thread Matt Fleming
On Sun, 18 Sep, at 11:09:08PM, Waiman Long wrote:
> 
> I have finally finished bisecting the problem. I was wrong in saying that
> the 4.7.3 kernel had no problem. It did have. There were some slight
> differences between the 4.8 and 4.7 kernel config files that I used. After
> some further testing, it was found that the bootup problem only happened
> when the following kernel config option was defined:
> 
> CONFIG_EFI_MIXED=y
> 
> Bisecting reviewed that the following 4.6 patch was the first patch that had
> this problem:
> 
> c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff
> [PATCH] x86/efi: Hoist page table switching code into efi_call_virt()
> 
> I did testing on my test system with three different partition sizes:
> 1) 16-socket Broadwell-EX with 12TB memory
> 2) 8-socket Broadwell-EX with 6TB memory
> 3) 4-socket Broadwell-EX with 3TB memory
> 
> Only the 16-socket and 8-socket configurations had this problem. I am not
> sure if over 4TB of main memory is a factor or not.

Yes, I think it's a safe bet that the amount of main memory is a major
factor here. Thanks for the report.

The only real difference when CONFIG_EFI_MIXED is enabled for 64-bit
kernel and 64-bit firmware is that main memory will be mapped into the
EFI page tables at its physical addresses.

I'll go stare at the code and get back to you.


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-18 Thread Waiman Long

On 09/14/2016 03:19 PM, Linus Torvalds wrote:

On Wed, Sep 14, 2016 at 12:14 PM, Waiman Long  wrote:

In the stack backtrace above, the kernel hadn't even reached SMP boot after
about 50s. That was extremely slow. I tried the 4.7.3 kernel and it booted
up fine. So I suspect that there may be too many interrupts going on and it
consumes most of the CPU cycles. The prime suspect is the random driver, I
think.

Any chance of bisecting it at least partially? The random driver
doesn't do interrupts itself, it just gets called by other drivers
doing intterrupts. So if there are too many of them, that would be
something else..

Linus


I have finally finished bisecting the problem. I was wrong in saying 
that the 4.7.3 kernel had no problem. It did have. There were some 
slight differences between the 4.8 and 4.7 kernel config files that I 
used. After some further testing, it was found that the bootup problem 
only happened when the following kernel config option was defined:


CONFIG_EFI_MIXED=y

Bisecting reviewed that the following 4.6 patch was the first patch that 
had this problem:


c9f2a9a65e4855b74d92cdad688f6ee4a1a323ff
[PATCH] x86/efi: Hoist page table switching code into efi_call_virt()

I did testing on my test system with three different partition sizes:
1) 16-socket Broadwell-EX with 12TB memory
2) 8-socket Broadwell-EX with 6TB memory
3) 4-socket Broadwell-EX with 3TB memory

Only the 16-socket and 8-socket configurations had this problem. I am 
not sure if over 4TB of main memory is a factor or not.


I have attached several slightly different panic messages that had 
happened in my testing. I know little about the EFI code and so I am not 
sure if it is a kernel problem, firmware problem or a combination of 
both. Hopefully someone with knowledge on this code will shed light on 
this problem.


Cheers,
Longman
commit 1bb6936473c07b5a7c8daced1000893b7145bb14
Author: Ard Biesheuvel 
Date:   Mon Feb 1 22:07:00 2016 +

efi: Runtime-wrapper: Get rid of the rtc_lock spinlock

The rtc_lock spinlock aims to serialize access to the CMOS RTC
between the UEFI firmware and the kernel drivers that use it
directly. However, x86 is the only arch that performs such
direct accesses, and that never uses the time related UEFI
runtime services. Since no other UEFI enlightened architectures
have a legcay CMOS RTC anyway, we can remove the rtc_lock
spinlock entirely.

Signed-off-by: Ard Biesheuvel 
Signed-off-by: Matt Fleming 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: http://lkml.kernel.org/r/1454364428-494-7-git-send-email-matt@codeblue
Signed-off-by: Ingo Molnar 

-
[0.00] ACPI: X2APIC_NMI (uid[0x16b] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x16c] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x16d] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x16e] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x16f] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x170] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x171] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x172] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x173] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x174] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x175] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x176] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x177] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x178] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x179] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x17a] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x17b] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x17c] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x17d] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x17e] high level lint[0x1])
[0.00] ACPI: X2APIC_NMI (uid[0x17f] high level lint[0x1])
[0.00] IOAPIC[0]: apic_id 8, version 32, address 0xfec0, GSI 0-23
[0.00] IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-47
[0.00] IOAPIC[2]: apic_id 10, version 32, address 0xfec04000, GSI 48-71
[0.00] IOAPIC[3]: apic_id 11, version 32, address 0xfec08000, GSI 72-95
[0.00] IOAPIC[4]: apic_id 12, version 32, address 0xfec09000, GSI 96-119
[0.00] IOAPIC[5]: apic_id 13, version 32, address 0xfec0c000, GSI 
120-143
[0.00] IOAPIC[6]: apic_id 14, version 32, address 0xfec1, GSI 
144-167
[0.00] IOAPIC[7]: apic_id 15, version 32, address 0xfec11000, GSI 
168-191
[0.00] IOAPIC[8]: apic_id 16, versi

Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Tejun Heo
Hello,

On Wed, Sep 14, 2016 at 03:55:51PM -0400, Tejun Heo wrote:
> We've used keventd_up() for this purpose and it hasn't been big enough
> an issue as workqueue usages during earlyboot are very rare (only five
> users right now).  But, yeah, it's getting used a more and more and
> there's no harm in moving it to earlier.  I'll see how early I can
> push it.

So, we can't really start executing work items much before the init
task is started because other kthreads can't be created or scheduled
before then; however, we can build up all of the workqueue data
structures very early once memory allocation and idr are operational
sans actual worker creation and onlining.  This allows creating
workqueues and queueing work items from very early during the boot and
the queued work items will start executing as soon as init task starts
executing.  Obviously, if someone tries to flush anything, it's not
gonna work but that really shouldn't happen during early boot.

The patch I'm playing with now looks like the following and it seems
to work fine.  I'll verify it further and add some niceties like
WARN_ONs on premature flushes and removal of unnecessary keventd_up()
tests.

Thanks.


Index: work/include/linux/workqueue.h
===
--- work.orig/include/linux/workqueue.h
+++ work/include/linux/workqueue.h
@@ -631,4 +631,7 @@ int workqueue_online_cpu(unsigned int cp
 int workqueue_offline_cpu(unsigned int cpu);
 #endif
 
+int __init workqueue_init_early(void);
+int __init workqueue_init(void);
+
 #endif
Index: work/init/main.c
===
--- work.orig/init/main.c
+++ work/init/main.c
@@ -476,6 +476,11 @@ static void __init mm_init(void)
ioremap_huge_init();
 }
 
+static void my_test_work_fn(struct work_struct *work)
+{
+   printk("XXX my_test_work\n");
+}
+
 asmlinkage __visible void __init start_kernel(void)
 {
char *command_line;
@@ -551,6 +556,13 @@ asmlinkage __visible void __init start_k
 "Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
idr_init_cache();
+   workqueue_init_early();
+
+   {
+   static DECLARE_WORK(my_test_work, my_test_work_fn);
+   schedule_work(&my_test_work);
+   }
+
rcu_init();
 
/* trace_printk() and trace points may be used after this */
@@ -1005,6 +1017,7 @@ static noinline void __init kernel_init_
 
smp_prepare_cpus(setup_max_cpus);
 
+   workqueue_init();
do_pre_smp_initcalls();
lockup_detector_init();
 
Index: work/kernel/workqueue.c
===
--- work.orig/kernel/workqueue.c
+++ work/kernel/workqueue.c
@@ -290,6 +290,8 @@ module_param_named(disable_numa, wq_disa
 static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT);
 module_param_named(power_efficient, wq_power_efficient, bool, 0444);
 
+static bool wq_online;
+
 static bool wq_numa_enabled;   /* unbound NUMA affinity enabled */
 
 /* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion 
*/
@@ -3352,7 +3354,7 @@ static struct worker_pool *get_unbound_p
goto fail;
 
/* create and start the initial worker */
-   if (!create_worker(pool))
+   if (wq_online && !create_worker(pool))
goto fail;
 
/* install */
@@ -3417,6 +3419,7 @@ static void pwq_adjust_max_active(struct
 {
struct workqueue_struct *wq = pwq->wq;
bool freezable = wq->flags & WQ_FREEZABLE;
+   unsigned long flags;
 
/* for @wq->saved_max_active */
lockdep_assert_held(&wq->mutex);
@@ -3425,7 +3428,7 @@ static void pwq_adjust_max_active(struct
if (!freezable && pwq->max_active == wq->saved_max_active)
return;
 
-   spin_lock_irq(&pwq->pool->lock);
+   spin_lock_irqsave(&pwq->pool->lock, flags);
 
/*
 * During [un]freezing, the caller is responsible for ensuring that
@@ -3448,7 +3451,7 @@ static void pwq_adjust_max_active(struct
pwq->max_active = 0;
}
 
-   spin_unlock_irq(&pwq->pool->lock);
+   spin_unlock_irqrestore(&pwq->pool->lock, flags);
 }
 
 /* initialize newly alloced @pwq which is associated with @wq and @pool */
@@ -5455,7 +5458,7 @@ static void __init wq_numa_init(void)
wq_numa_enabled = true;
 }
 
-static int __init init_workqueues(void)
+int __init workqueue_init_early(void)
 {
int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
int i, cpu;
@@ -5488,16 +5491,6 @@ static int __init init_workqueues(void)
}
}
 
-   /* create the initial worker */
-   for_each_online_cpu(cpu) {
-   struct worker_pool *pool;
-
-   for_each_cpu_worker_pool(pool, cpu) {
-   pool->flags &= ~POOL_DISASSOCIATED;
- 

Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Waiman Long

On 09/14/2016 05:06 PM, Linus Torvalds wrote:

On Wed, Sep 14, 2016 at 12:34 PM, Waiman Long  wrote:

I can try, but the 16-socket system that I have at the moment takes a long
time (more than an hour) for one shutdown-reboot cycle. It may not be really
more interrupts in 4.8, it may be that the random driver just somehow run
very slow on my test machine as it seems to have a major rewrite in the 4.8
cycle.

Looking at the random driver updates since 4.7, the only thing I see
is that .crng_fast_load() for the chacha20 randomness. And that should
trigger only until it's been initialized, so the cost looks like it
should be limited.

Is there some fundamental reason you think it's the random driver?
Other than the oops? Because I'd be more inclined to suspect just some
apic issue or something, where an actual interrupt line ends up
screaming or whatever. Is this UV? There's also the CPU hotplug state
machine changes etc.


Yes, it is because of the oops that I suspect the random driver may be 
the cause.




But a few rounds of bisecting should hopefully cut down on the
suspects a lot. A *full* bisect might be 16-17 rounds, but if you can
do just four or five rounds of bisection, that should still cut it
down from 14k commits to "only" several hundred..

Linus


Yes, I will do a few rounds to see if we can isolate the problem. In the 
mean time, I will also reconfigure the system with less sockets to see 
if it is reproduced in a smaller configuration.


Cheers,
Longman


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Linus Torvalds
On Wed, Sep 14, 2016 at 12:34 PM, Waiman Long  wrote:
>
> I can try, but the 16-socket system that I have at the moment takes a long
> time (more than an hour) for one shutdown-reboot cycle. It may not be really
> more interrupts in 4.8, it may be that the random driver just somehow run
> very slow on my test machine as it seems to have a major rewrite in the 4.8
> cycle.

Looking at the random driver updates since 4.7, the only thing I see
is that .crng_fast_load() for the chacha20 randomness. And that should
trigger only until it's been initialized, so the cost looks like it
should be limited.

Is there some fundamental reason you think it's the random driver?
Other than the oops? Because I'd be more inclined to suspect just some
apic issue or something, where an actual interrupt line ends up
screaming or whatever. Is this UV? There's also the CPU hotplug state
machine changes etc.

But a few rounds of bisecting should hopefully cut down on the
suspects a lot. A *full* bisect might be 16-17 rounds, but if you can
do just four or five rounds of bisection, that should still cut it
down from 14k commits to "only" several hundred..

   Linus


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Tejun Heo
Hello, Linus.

On Wed, Sep 14, 2016 at 12:14:30PM -0700, Linus Torvalds wrote:
> I'm wondering if we couldn't just initialize "system_wq" earlier.
> Right now init_workqueues() is an "early_initcall()", so it's at the
> same priority as a number of other random early initcalls. My gut
> feeling is that it should be initialized even earlier, with the
> scheduler.
> 
> Because dammit, we use "schedule_work()" as if it was a pretty core
> scheduler thing, and having to have some odd knowledge of system_wq
> initialization details in the rest of the kernel sounds really really
> wrong.
> 
> I don't think the random code is at all special in maybe wanting to
> schedule some work despite being an "early" initcall.
> 
> Adding Tejun to the cc, and quoting the whole email.
> 
> Tejun, comments?

We've used keventd_up() for this purpose and it hasn't been big enough
an issue as workqueue usages during earlyboot are very rare (only five
users right now).  But, yeah, it's getting used a more and more and
there's no harm in moving it to earlier.  I'll see how early I can
push it.

Thanks.

-- 
tejun


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Waiman Long

On 09/14/2016 03:19 PM, Linus Torvalds wrote:

On Wed, Sep 14, 2016 at 12:14 PM, Waiman Long  wrote:

In the stack backtrace above, the kernel hadn't even reached SMP boot after
about 50s. That was extremely slow. I tried the 4.7.3 kernel and it booted
up fine. So I suspect that there may be too many interrupts going on and it
consumes most of the CPU cycles. The prime suspect is the random driver, I
think.

Any chance of bisecting it at least partially? The random driver
doesn't do interrupts itself, it just gets called by other drivers
doing intterrupts. So if there are too many of them, that would be
something else..

Linus


I can try, but the 16-socket system that I have at the moment takes a 
long time (more than an hour) for one shutdown-reboot cycle. It may not 
be really more interrupts in 4.8, it may be that the random driver just 
somehow run very slow on my test machine as it seems to have a major 
rewrite in the 4.8 cycle. So I would like to solicit what sort of test 
that I can run to pinpoint where the problem is. I currently has the 
machine till the end of the week.


Cheers,
Longman


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Waiman Long

On 09/14/2016 03:14 PM, Linus Torvalds wrote:

Ugh, I detest this patch.

My gut feeling is that a driver (even a fairly core one like the
random code) should not have to know these kinds of details like
"schedule_work() needs system_wq to have been initialized".

I'm wondering if we couldn't just initialize "system_wq" earlier.
Right now init_workqueues() is an "early_initcall()", so it's at the
same priority as a number of other random early initcalls. My gut
feeling is that it should be initialized even earlier, with the
scheduler.

Because dammit, we use "schedule_work()" as if it was a pretty core
scheduler thing, and having to have some odd knowledge of system_wq
initialization details in the rest of the kernel sounds really really
wrong.

I don't think the random code is at all special in maybe wanting to
schedule some work despite being an "early" initcall.

Adding Tejun to the cc, and quoting the whole email.

Tejun, comments?

  Linus




My patch does not really fix the boot problem as detailed in my 
follow-up email. It serves mostly to jump start the discussion on the 
problem that I saw. The schedule_work() call was issued as part of 
interrupt handling that seems to be started pretty early in the boot 
process before the early_initcall. I guess it is possible to move the 
initialization earlier, but I am not sure where will be a good place.


Cheers,
Longman


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Linus Torvalds
On Wed, Sep 14, 2016 at 12:14 PM, Waiman Long  wrote:
>
> In the stack backtrace above, the kernel hadn't even reached SMP boot after
> about 50s. That was extremely slow. I tried the 4.7.3 kernel and it booted
> up fine. So I suspect that there may be too many interrupts going on and it
> consumes most of the CPU cycles. The prime suspect is the random driver, I
> think.

Any chance of bisecting it at least partially? The random driver
doesn't do interrupts itself, it just gets called by other drivers
doing intterrupts. So if there are too many of them, that would be
something else..

   Linus


Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Waiman Long

On 09/14/2016 03:03 PM, Waiman Long wrote:

While booting a 4.8-rc6 kernel on a 16-socket 768-thread Broadwell-EX
system, the kernel panic'ed with the following log:

[   51.837010] BUG: unable to handle kernel NULL pointer dereference at 
0102
[   51.845635] IP: [] __queue_work+0x32/0x420
[   52.004366] Call Trace:
[   52.007053]
[   52.009171]  [] queue_work_on+0x27/0x40
[   52.015306]  [] credit_entropy_bits+0x1d7/0x2a0
[   52.022002]  [] ?  add_interrupt_randomness+0x1b9/0x210
[   52.029366]  [] add_interrupt_randomness+0x1b9/0x210
[   52.036544]  [] handle_irq_event_percpu+0x40/0x80
[   52.043430]  [] handle_irq_event+0x3b/0x60
[   52.049655]  [] handle_level_irq+0x88/0x100
[   52.055968]  [] handle_irq+0xab/0x130
[   52.061708]  [] ? _local_bh_enable+0x21/0x50
[   52.068125]  [] ? __exit_idle+0x5/0x30
[   52.073965]  [] do_IRQ+0x4d/0xd0
[   52.079229]  [] common_interrupt+0x8c/0x8c
[   52.085444]
[   52.087568]  [] ? try_to_free_pmd_page+0x9/0x40
[   52.094462]  [] ? try_to_free_pmd_page+0x5/0x40
[   52.101157]  [] ?  __unmap_pmd_range.part.5+0x4a/0x70
[   52.108330]  [] unmap_pmd_range+0x130/0x250
[   52.114644]  [] __cpa_process_fault+0x47b/0x5a0
[   52.121339]  [] __change_page_attr+0x78b/0x9e0
[   52.127946]  [] ?  
__raw_callee_save___native_queued_spin_unlock+0x15/0x30
[   52.137124]  [] __change_page_attr_set_clr+0x78/0x300
[   52.144404]  [] ? __slab_alloc+0x4d/0x5c
[   52.150436]  [] kernel_map_pages_in_pgd+0x8f/0xd0
[   52.157333]  [] efi_setup_page_tables+0xcc/0x1d9
[   52.164124]  [] efi_enter_virtual_mode+0x35e/0x4af
[   52.171117]  [] start_kernel+0x41f/0x4c8
[   52.177142]  [] ? set_init_arg+0x55/0x55
[   52.183168]  [] ?  early_idt_handler_array+0x120/0x120
[   52.190440]  [] x86_64_start_reservations+0x2a/0x2c
[   52.197516]  [] x86_64_start_kernel+0x14c/0x16f
[   52.204214] Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 53 48 
83 ec 10 89 7d d4 ff 14 25 a0 a7 c2 81 f6 c4 02 0f 85 0d 03 00 00<41>  f6 86 02 
01 00 00 01 0f 85 ae 02 00 00 49 c7 c7 18 41 01 00
[   52.225516] RIP  [] __queue_work+0x32/0x420
[   52.231838]  RSP
[   52.235667] CR2: 0102
[   52.239667] ---[ end trace 2ee7ea9d2908eb72 ]---
[   52.244743] Kernel panic - not syncing: Fatal exception in interrupt

Looking at the panic'ed instruction indicated that system_wq was
used by credit_entropy_bits() before it it was initialized in an
early_initcall.

This patch prevents the schedule_work() call from being made before
system_wq is initialized.

Signed-off-by: Waiman Long
---
  drivers/char/random.c |8 ++--
  1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3efb3bf..3afc519 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -730,8 +730,12 @@ retry:
r->entropy_total>= 2*random_read_wakeup_bits) {
struct entropy_store *other =&blocking_pool;

-   if (other->entropy_count<=
-   3 * other->poolinfo->poolfracbits / 4) {
+   /*
+* We cannot call schedule_work() before system_wq
+* is initialized.
+*/
+   if (system_wq&&  (other->entropy_count<=
+   3 * other->poolinfo->poolfracbits / 4)) {
schedule_work(&other->push_work);
r->entropy_total = 0;
}


This patch fixed the kernel panic, but the test system still seemed to 
hang after the following log messages:


[0.276735] random: fast init done
[6.230775] random: crng init done

In the stack backtrace above, the kernel hadn't even reached SMP boot 
after about 50s. That was extremely slow. I tried the 4.7.3 kernel and 
it booted up fine. So I suspect that there may be too many interrupts 
going on and it consumes most of the CPU cycles. The prime suspect is 
the random driver, I think.


I would like to hear your thought on that.

Cheers,
Longman




Re: [PATCH] random: Fix kernel panic due to system_wq use before init

2016-09-14 Thread Linus Torvalds
Ugh, I detest this patch.

My gut feeling is that a driver (even a fairly core one like the
random code) should not have to know these kinds of details like
"schedule_work() needs system_wq to have been initialized".

I'm wondering if we couldn't just initialize "system_wq" earlier.
Right now init_workqueues() is an "early_initcall()", so it's at the
same priority as a number of other random early initcalls. My gut
feeling is that it should be initialized even earlier, with the
scheduler.

Because dammit, we use "schedule_work()" as if it was a pretty core
scheduler thing, and having to have some odd knowledge of system_wq
initialization details in the rest of the kernel sounds really really
wrong.

I don't think the random code is at all special in maybe wanting to
schedule some work despite being an "early" initcall.

Adding Tejun to the cc, and quoting the whole email.

Tejun, comments?

 Linus

On Wed, Sep 14, 2016 at 12:03 PM, Waiman Long  wrote:
> While booting a 4.8-rc6 kernel on a 16-socket 768-thread Broadwell-EX
> system, the kernel panic'ed with the following log:
>
> [   51.837010] BUG: unable to handle kernel NULL pointer dereference at 
> 0102
> [   51.845635] IP: [] __queue_work+0x32/0x420
> [   52.004366] Call Trace:
> [   52.007053]  
> [   52.009171]  [] queue_work_on+0x27/0x40
> [   52.015306]  [] credit_entropy_bits+0x1d7/0x2a0
> [   52.022002]  [] ?  add_interrupt_randomness+0x1b9/0x210
> [   52.029366]  [] add_interrupt_randomness+0x1b9/0x210
> [   52.036544]  [] handle_irq_event_percpu+0x40/0x80
> [   52.043430]  [] handle_irq_event+0x3b/0x60
> [   52.049655]  [] handle_level_irq+0x88/0x100
> [   52.055968]  [] handle_irq+0xab/0x130
> [   52.061708]  [] ? _local_bh_enable+0x21/0x50
> [   52.068125]  [] ? __exit_idle+0x5/0x30
> [   52.073965]  [] do_IRQ+0x4d/0xd0
> [   52.079229]  [] common_interrupt+0x8c/0x8c
> [   52.085444]  
> [   52.087568]  [] ? try_to_free_pmd_page+0x9/0x40
> [   52.094462]  [] ? try_to_free_pmd_page+0x5/0x40
> [   52.101157]  [] ?  __unmap_pmd_range.part.5+0x4a/0x70
> [   52.108330]  [] unmap_pmd_range+0x130/0x250
> [   52.114644]  [] __cpa_process_fault+0x47b/0x5a0
> [   52.121339]  [] __change_page_attr+0x78b/0x9e0
> [   52.127946]  [] ?  
> __raw_callee_save___native_queued_spin_unlock+0x15/0x30
> [   52.137124]  [] __change_page_attr_set_clr+0x78/0x300
> [   52.144404]  [] ? __slab_alloc+0x4d/0x5c
> [   52.150436]  [] kernel_map_pages_in_pgd+0x8f/0xd0
> [   52.157333]  [] efi_setup_page_tables+0xcc/0x1d9
> [   52.164124]  [] efi_enter_virtual_mode+0x35e/0x4af
> [   52.171117]  [] start_kernel+0x41f/0x4c8
> [   52.177142]  [] ? set_init_arg+0x55/0x55
> [   52.183168]  [] ?  early_idt_handler_array+0x120/0x120
> [   52.190440]  [] x86_64_start_reservations+0x2a/0x2c
> [   52.197516]  [] x86_64_start_kernel+0x14c/0x16f
> [   52.204214] Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 
> 53 48 83 ec 10 89 7d d4 ff 14 25 a0 a7 c2 81 f6 c4 02 0f 85 0d 03 00 00 <41> 
> f6 86 02 01 00 00 01 0f 85 ae 02 00 00 49 c7 c7 18 41 01 00
> [   52.225516] RIP  [] __queue_work+0x32/0x420
> [   52.231838]  RSP 
> [   52.235667] CR2: 0102
> [   52.239667] ---[ end trace 2ee7ea9d2908eb72 ]---
> [   52.244743] Kernel panic - not syncing: Fatal exception in interrupt
>
> Looking at the panic'ed instruction indicated that system_wq was
> used by credit_entropy_bits() before it it was initialized in an
> early_initcall.
>
> This patch prevents the schedule_work() call from being made before
> system_wq is initialized.
>
> Signed-off-by: Waiman Long 
> ---
>  drivers/char/random.c |8 ++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 3efb3bf..3afc519 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -730,8 +730,12 @@ retry:
> r->entropy_total >= 2*random_read_wakeup_bits) {
> struct entropy_store *other = &blocking_pool;
>
> -   if (other->entropy_count <=
> -   3 * other->poolinfo->poolfracbits / 4) {
> +   /*
> +* We cannot call schedule_work() before system_wq
> +* is initialized.
> +*/
> +   if (system_wq && (other->entropy_count <=
> +   3 * other->poolinfo->poolfracbits / 4)) {
> schedule_work(&other->push_work);
> r->entropy_total = 0;
> }
> --
> 1.7.1
>