Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll

2017-08-29 Thread Luis R. Rodriguez
On Tue, Aug 29, 2017 at 11:46:39AM +, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.

This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_grow   [ X86 only ]
> +- poll_shrink [ X86 only ]
> +- poll_threshold_ns   [ X86 only ]

How about paravirt_ prefix?

>  - powersave-nap   [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>  
>  ==
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

These don't say anything about this being related to paravirt.

> +
> +==
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.

turn it on? Or modify, or use it?

> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
>   .update = paravirt_nop,
>  };
>  
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>   .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>   .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>  
>  extern bool crash_kexec_post_notifiers;
>  
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif

What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.

> +
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. 
> It
>   * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table 
> *table, int write,
>   .extra2 = ,
>   },
>  #endif
> +#ifdef CONFIG_PARAVIRT
> + {
> + .procname   = "halt_poll_threshold",
> + .data   = _threshold_ns,
> + .maxlen = sizeof(unsigned long),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,
> + },

proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.

> + {
> + .procname   = "halt_poll_grow",
> + .data   = _grow,
> + .maxlen = sizeof(unsigned int),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,
> + },
> + {
> + .procname   = "halt_poll_shrink",
> + .data   = _shrink,
> + .maxlen = sizeof(unsigned int),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,

We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.

To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input 

Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll

2017-08-29 Thread Luis R. Rodriguez
On Tue, Aug 29, 2017 at 11:46:39AM +, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.

This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_grow   [ X86 only ]
> +- poll_shrink [ X86 only ]
> +- poll_threshold_ns   [ X86 only ]

How about paravirt_ prefix?

>  - powersave-nap   [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>  
>  ==
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

These don't say anything about this being related to paravirt.

> +
> +==
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.

turn it on? Or modify, or use it?

> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
>   .update = paravirt_nop,
>  };
>  
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>   .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>   .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>  
>  extern bool crash_kexec_post_notifiers;
>  
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif

What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.

> +
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. 
> It
>   * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table 
> *table, int write,
>   .extra2 = ,
>   },
>  #endif
> +#ifdef CONFIG_PARAVIRT
> + {
> + .procname   = "halt_poll_threshold",
> + .data   = _threshold_ns,
> + .maxlen = sizeof(unsigned long),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,
> + },

proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.

> + {
> + .procname   = "halt_poll_grow",
> + .data   = _grow,
> + .maxlen = sizeof(unsigned int),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,
> + },
> + {
> + .procname   = "halt_poll_shrink",
> + .data   = _shrink,
> + .maxlen = sizeof(unsigned int),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec,

We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.

To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input 

[RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll

2017-08-29 Thread Yang Zhang
To reduce the cost of poll, we introduce three sysctl to control the
poll time.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Jonathan Corbet 
Cc: Jeremy Fitzhardinge 
Cc: Chris Wright 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: "Luis R. Rodriguez" 
Cc: Kees Cook 
Cc: Mauro Carvalho Chehab 
Cc: Krzysztof Kozlowski 
Cc: Josh Poimboeuf 
Cc: Andrew Morton 
Cc: Petr Mladek 
Cc: Peter Zijlstra 
Cc: Jessica Yu 
Cc: Larry Finger 
Cc: zijun_hu 
Cc: Baoquan He 
Cc: Johannes Berg 
Cc: Ian Abbott 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-fsde...@vger.kernel.org
---
 Documentation/sysctl/kernel.txt | 25 +
 arch/x86/kernel/paravirt.c  |  4 
 include/linux/kernel.h  |  6 ++
 kernel/sysctl.c | 23 +++
 4 files changed, 58 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow   [ X86 only ]
+- poll_shrink [ X86 only ]
+- poll_threshold_ns   [ X86 only ]
 - powersave-nap   [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
 
 ==
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
.update = paravirt_nop,
 };
 
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
 __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
 
 extern bool crash_kexec_post_notifiers;
 
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.extra2 = ,
},
 #endif
+#ifdef CONFIG_PARAVIRT
+   {
+   .procname   = "halt_poll_threshold",
+   .data   = _threshold_ns,
+   .maxlen = sizeof(unsigned long),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "halt_poll_grow",
+   .data   = _grow,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "halt_poll_shrink",
+

[RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll

2017-08-29 Thread Yang Zhang
To reduce the cost of poll, we introduce three sysctl to control the
poll time.

Signed-off-by: Yang Zhang 
Signed-off-by: Quan Xu 
Cc: Jonathan Corbet 
Cc: Jeremy Fitzhardinge 
Cc: Chris Wright 
Cc: Alok Kataria 
Cc: Rusty Russell 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: "Luis R. Rodriguez" 
Cc: Kees Cook 
Cc: Mauro Carvalho Chehab 
Cc: Krzysztof Kozlowski 
Cc: Josh Poimboeuf 
Cc: Andrew Morton 
Cc: Petr Mladek 
Cc: Peter Zijlstra 
Cc: Jessica Yu 
Cc: Larry Finger 
Cc: zijun_hu 
Cc: Baoquan He 
Cc: Johannes Berg 
Cc: Ian Abbott 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-fsde...@vger.kernel.org
---
 Documentation/sysctl/kernel.txt | 25 +
 arch/x86/kernel/paravirt.c  |  4 
 include/linux/kernel.h  |  6 ++
 kernel/sysctl.c | 23 +++
 4 files changed, 58 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow   [ X86 only ]
+- poll_shrink [ X86 only ]
+- poll_threshold_ns   [ X86 only ]
 - powersave-nap   [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
 
 ==
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
.update = paravirt_nop,
 };
 
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
 __visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
 
 extern bool crash_kexec_post_notifiers;
 
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.extra2 = ,
},
 #endif
+#ifdef CONFIG_PARAVIRT
+   {
+   .procname   = "halt_poll_threshold",
+   .data   = _threshold_ns,
+   .maxlen = sizeof(unsigned long),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "halt_poll_grow",
+   .data   = _grow,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "halt_poll_shrink",
+   .data   = _shrink,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
{ }
 };
 
-- 
1.8.3.1