Re: [PATCH 1/2] clk: ti: add a usecount for autoidle

2018-11-09 Thread Andreas Kemnade
On Thu, 8 Nov 2018 12:36:35 +0200
Tero Kristo  wrote:

> On 04/10/2018 23:38, Andreas Kemnade wrote:
> > We have the scenario that first autoidle is disabled for all clocks,
> > then it is disabled for selected ones and then enabled for all. So
> > we should have some counting here, also according to the
> > comment in  _setup_iclk_autoidle()
> > 
> > Signed-off-by: Andreas Kemnade 
> > ---
> >   drivers/clk/ti/autoidle.c | 32 
> >   include/linux/clk/ti.h|  1 +
> >   2 files changed, 25 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/clk/ti/autoidle.c b/drivers/clk/ti/autoidle.c
> > index 7bb9afbe4058..bb6cff168e73 100644
> > --- a/drivers/clk/ti/autoidle.c
> > +++ b/drivers/clk/ti/autoidle.c
> > @@ -37,6 +37,14 @@ struct clk_ti_autoidle {
> >   static LIST_HEAD(autoidle_clks);
> >   static LIST_HEAD(clk_hw_omap_clocks);
> >   
> > +/*
> > + * we have some non-atomic read/write
> > + * operations behind it, so lets
> > + * take one mutex for handling autoidle
> > + * of all clocks
> > + */
> > +static DEFINE_MUTEX(autoidle_mutex);  
> 
> Why mutex? This prevents calling the autoidle APIs from atomic context. 
> Did you check the mutex debug kernel configs with this?
> 
Oops, I thought they were on, but they were not. OK,
I am preparing a v2 of this thing.


> This may cause problems with the runtime PM entries to the code at least.
> 
> 
> > +
> >   /**
> >* omap2_clk_deny_idle - disable autoidle on an OMAP clock
> >* @clk: struct clk * to disable autoidle for
> > @@ -48,8 +56,13 @@ int omap2_clk_deny_idle(struct clk *clk)
> > struct clk_hw_omap *c;
> >   
> > c = to_clk_hw_omap(__clk_get_hw(clk));
> > -   if (c->ops && c->ops->deny_idle)
> > -   c->ops->deny_idle(c);
> > +   if (c->ops && c->ops->deny_idle) {
> > +   mutex_lock(_mutex);
> > +   c->autoidle_count--;
> > +   if (c->autoidle_count == -1)  
> 
> I think you should swap the arithmetics here, all the other usecounters 
> use positive values, here you enter deep to the negative side when 
> autoidle is denied by multiple users, which might be confusing.
> 
agreed.

Regards,
Andreas


pgpajxU1zsCUf.pgp
Description: OpenPGP digital signature


Re: [PATCH 1/2] clk: ti: add a usecount for autoidle

2018-11-09 Thread Andreas Kemnade
On Thu, 8 Nov 2018 12:36:35 +0200
Tero Kristo  wrote:

> On 04/10/2018 23:38, Andreas Kemnade wrote:
> > We have the scenario that first autoidle is disabled for all clocks,
> > then it is disabled for selected ones and then enabled for all. So
> > we should have some counting here, also according to the
> > comment in  _setup_iclk_autoidle()
> > 
> > Signed-off-by: Andreas Kemnade 
> > ---
> >   drivers/clk/ti/autoidle.c | 32 
> >   include/linux/clk/ti.h|  1 +
> >   2 files changed, 25 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/clk/ti/autoidle.c b/drivers/clk/ti/autoidle.c
> > index 7bb9afbe4058..bb6cff168e73 100644
> > --- a/drivers/clk/ti/autoidle.c
> > +++ b/drivers/clk/ti/autoidle.c
> > @@ -37,6 +37,14 @@ struct clk_ti_autoidle {
> >   static LIST_HEAD(autoidle_clks);
> >   static LIST_HEAD(clk_hw_omap_clocks);
> >   
> > +/*
> > + * we have some non-atomic read/write
> > + * operations behind it, so lets
> > + * take one mutex for handling autoidle
> > + * of all clocks
> > + */
> > +static DEFINE_MUTEX(autoidle_mutex);  
> 
> Why mutex? This prevents calling the autoidle APIs from atomic context. 
> Did you check the mutex debug kernel configs with this?
> 
Oops, I thought they were on, but they were not. OK,
I am preparing a v2 of this thing.


> This may cause problems with the runtime PM entries to the code at least.
> 
> 
> > +
> >   /**
> >* omap2_clk_deny_idle - disable autoidle on an OMAP clock
> >* @clk: struct clk * to disable autoidle for
> > @@ -48,8 +56,13 @@ int omap2_clk_deny_idle(struct clk *clk)
> > struct clk_hw_omap *c;
> >   
> > c = to_clk_hw_omap(__clk_get_hw(clk));
> > -   if (c->ops && c->ops->deny_idle)
> > -   c->ops->deny_idle(c);
> > +   if (c->ops && c->ops->deny_idle) {
> > +   mutex_lock(_mutex);
> > +   c->autoidle_count--;
> > +   if (c->autoidle_count == -1)  
> 
> I think you should swap the arithmetics here, all the other usecounters 
> use positive values, here you enter deep to the negative side when 
> autoidle is denied by multiple users, which might be confusing.
> 
agreed.

Regards,
Andreas


pgpajxU1zsCUf.pgp
Description: OpenPGP digital signature


[PATCH] sched/fair: Make some variables static

2018-11-09 Thread Muchun Song
The variables are local to the source and do not
need to be in global scope, so make them static.

Signed-off-by: Muchun Song 
---
 kernel/sched/fair.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f622fc858d7a..9d0cd2d0515f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -38,7 +38,7 @@
  * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
  */
 unsigned int sysctl_sched_latency  = 600ULL;
-unsigned int normalized_sysctl_sched_latency   = 600ULL;
+static unsigned int normalized_sysctl_sched_latency= 600ULL;
 
 /*
  * The initial- and re-scaling of tunables is configurable
@@ -58,8 +58,8 @@ enum sched_tunable_scaling sysctl_sched_tunable_scaling = 
SCHED_TUNABLESCALING_L
  *
  * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_min_granularity  = 75ULL;
-unsigned int normalized_sysctl_sched_min_granularity   = 75ULL;
+unsigned int sysctl_sched_min_granularity  = 75ULL;
+static unsigned int normalized_sysctl_sched_min_granularity= 75ULL;
 
 /*
  * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
@@ -81,8 +81,8 @@ unsigned int sysctl_sched_child_runs_first __read_mostly;
  *
  * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_wakeup_granularity   = 100UL;
-unsigned int normalized_sysctl_sched_wakeup_granularity= 100UL;
+unsigned int sysctl_sched_wakeup_granularity   = 100UL;
+static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL;
 
 const_debug unsigned int sysctl_sched_migration_cost   = 50UL;
 
-- 
2.17.1



[PATCH] sched/fair: Make some variables static

2018-11-09 Thread Muchun Song
The variables are local to the source and do not
need to be in global scope, so make them static.

Signed-off-by: Muchun Song 
---
 kernel/sched/fair.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f622fc858d7a..9d0cd2d0515f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -38,7 +38,7 @@
  * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
  */
 unsigned int sysctl_sched_latency  = 600ULL;
-unsigned int normalized_sysctl_sched_latency   = 600ULL;
+static unsigned int normalized_sysctl_sched_latency= 600ULL;
 
 /*
  * The initial- and re-scaling of tunables is configurable
@@ -58,8 +58,8 @@ enum sched_tunable_scaling sysctl_sched_tunable_scaling = 
SCHED_TUNABLESCALING_L
  *
  * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_min_granularity  = 75ULL;
-unsigned int normalized_sysctl_sched_min_granularity   = 75ULL;
+unsigned int sysctl_sched_min_granularity  = 75ULL;
+static unsigned int normalized_sysctl_sched_min_granularity= 75ULL;
 
 /*
  * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
@@ -81,8 +81,8 @@ unsigned int sysctl_sched_child_runs_first __read_mostly;
  *
  * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_wakeup_granularity   = 100UL;
-unsigned int normalized_sysctl_sched_wakeup_granularity= 100UL;
+unsigned int sysctl_sched_wakeup_granularity   = 100UL;
+static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL;
 
 const_debug unsigned int sysctl_sched_migration_cost   = 50UL;
 
-- 
2.17.1



Re: [PATCH] pinctrl: msm: Add sleep pinctrl state transitions

2018-11-09 Thread Bjorn Andersson
On Fri 09 Nov 14:28 PST 2018, Evan Green wrote:

> Add PM suspend callbacks to the msm core driver that select the
> sleep and default pinctrl states. Then wire those callbacks up
> in the sdm845 driver, for those boards that may have GPIO hogs
> that need to change state during suspend.
> 
> Signed-off-by: Evan Green 
> ---
> 
>  drivers/pinctrl/qcom/pinctrl-msm.c| 16 
>  drivers/pinctrl/qcom/pinctrl-msm.h|  2 ++
>  drivers/pinctrl/qcom/pinctrl-sdm845.c |  4 
>  3 files changed, 22 insertions(+)
> 
> diff --git a/drivers/pinctrl/qcom/pinctrl-msm.c 
> b/drivers/pinctrl/qcom/pinctrl-msm.c
> index 7c7d083e2c0dc..d505d0a50f0a5 100644
> --- a/drivers/pinctrl/qcom/pinctrl-msm.c
> +++ b/drivers/pinctrl/qcom/pinctrl-msm.c
> @@ -1072,6 +1072,22 @@ static void msm_pinctrl_setup_pm_reset(struct 
> msm_pinctrl *pctrl)
>   }
>  }
>  
> +int msm_pinctrl_suspend(struct device *dev)
> +{
> + struct msm_pinctrl *pctrl = dev_get_drvdata(dev);
> +
> + return pinctrl_force_sleep(pctrl->pctrl);
> +}
> +EXPORT_SYMBOL(msm_pinctrl_suspend);
> +
> +int msm_pinctrl_resume(struct device *dev)
> +{
> + struct msm_pinctrl *pctrl = dev_get_drvdata(dev);
> +
> + return pinctrl_force_default(pctrl->pctrl);
> +}
> +EXPORT_SYMBOL(msm_pinctrl_resume);
> +
>  int msm_pinctrl_probe(struct platform_device *pdev,
> const struct msm_pinctrl_soc_data *soc_data)
>  {
> diff --git a/drivers/pinctrl/qcom/pinctrl-msm.h 
> b/drivers/pinctrl/qcom/pinctrl-msm.h
> index 29172fdf5882c..e163ca600ecd3 100644
> --- a/drivers/pinctrl/qcom/pinctrl-msm.h
> +++ b/drivers/pinctrl/qcom/pinctrl-msm.h
> @@ -123,6 +123,8 @@ struct msm_pinctrl_soc_data {
>   unsigned int ntiles;
>  };
>  
> +int msm_pinctrl_suspend(struct device *dev);
> +int msm_pinctrl_resume(struct device *dev);
>  int msm_pinctrl_probe(struct platform_device *pdev,
> const struct msm_pinctrl_soc_data *soc_data);
>  int msm_pinctrl_remove(struct platform_device *pdev);
> diff --git a/drivers/pinctrl/qcom/pinctrl-sdm845.c 
> b/drivers/pinctrl/qcom/pinctrl-sdm845.c
> index 2ab7a88857579..a3ac9cbeabad1 100644
> --- a/drivers/pinctrl/qcom/pinctrl-sdm845.c
> +++ b/drivers/pinctrl/qcom/pinctrl-sdm845.c
> @@ -1287,6 +1287,9 @@ static const struct msm_pinctrl_soc_data sdm845_pinctrl 
> = {
>   .ngpios = 150,
>  };
>  
> +static SIMPLE_DEV_PM_OPS(sdm845_pinctrl_dev_pm_ops, msm_pinctrl_suspend,
> +  msm_pinctrl_resume);

I don't expect this to differ between the various platforms, so how
about moving it to pinctrl-msm?

Regards,
Bjorn

> +
>  static int sdm845_pinctrl_probe(struct platform_device *pdev)
>  {
>   return msm_pinctrl_probe(pdev, _pinctrl);
> @@ -1300,6 +1303,7 @@ static const struct of_device_id 
> sdm845_pinctrl_of_match[] = {
>  static struct platform_driver sdm845_pinctrl_driver = {
>   .driver = {
>   .name = "sdm845-pinctrl",
> + .pm = _pinctrl_dev_pm_ops,
>   .of_match_table = sdm845_pinctrl_of_match,
>   },
>   .probe = sdm845_pinctrl_probe,
> -- 
> 2.16.4
> 


Re: [PATCH] pinctrl: msm: Add sleep pinctrl state transitions

2018-11-09 Thread Bjorn Andersson
On Fri 09 Nov 14:28 PST 2018, Evan Green wrote:

> Add PM suspend callbacks to the msm core driver that select the
> sleep and default pinctrl states. Then wire those callbacks up
> in the sdm845 driver, for those boards that may have GPIO hogs
> that need to change state during suspend.
> 
> Signed-off-by: Evan Green 
> ---
> 
>  drivers/pinctrl/qcom/pinctrl-msm.c| 16 
>  drivers/pinctrl/qcom/pinctrl-msm.h|  2 ++
>  drivers/pinctrl/qcom/pinctrl-sdm845.c |  4 
>  3 files changed, 22 insertions(+)
> 
> diff --git a/drivers/pinctrl/qcom/pinctrl-msm.c 
> b/drivers/pinctrl/qcom/pinctrl-msm.c
> index 7c7d083e2c0dc..d505d0a50f0a5 100644
> --- a/drivers/pinctrl/qcom/pinctrl-msm.c
> +++ b/drivers/pinctrl/qcom/pinctrl-msm.c
> @@ -1072,6 +1072,22 @@ static void msm_pinctrl_setup_pm_reset(struct 
> msm_pinctrl *pctrl)
>   }
>  }
>  
> +int msm_pinctrl_suspend(struct device *dev)
> +{
> + struct msm_pinctrl *pctrl = dev_get_drvdata(dev);
> +
> + return pinctrl_force_sleep(pctrl->pctrl);
> +}
> +EXPORT_SYMBOL(msm_pinctrl_suspend);
> +
> +int msm_pinctrl_resume(struct device *dev)
> +{
> + struct msm_pinctrl *pctrl = dev_get_drvdata(dev);
> +
> + return pinctrl_force_default(pctrl->pctrl);
> +}
> +EXPORT_SYMBOL(msm_pinctrl_resume);
> +
>  int msm_pinctrl_probe(struct platform_device *pdev,
> const struct msm_pinctrl_soc_data *soc_data)
>  {
> diff --git a/drivers/pinctrl/qcom/pinctrl-msm.h 
> b/drivers/pinctrl/qcom/pinctrl-msm.h
> index 29172fdf5882c..e163ca600ecd3 100644
> --- a/drivers/pinctrl/qcom/pinctrl-msm.h
> +++ b/drivers/pinctrl/qcom/pinctrl-msm.h
> @@ -123,6 +123,8 @@ struct msm_pinctrl_soc_data {
>   unsigned int ntiles;
>  };
>  
> +int msm_pinctrl_suspend(struct device *dev);
> +int msm_pinctrl_resume(struct device *dev);
>  int msm_pinctrl_probe(struct platform_device *pdev,
> const struct msm_pinctrl_soc_data *soc_data);
>  int msm_pinctrl_remove(struct platform_device *pdev);
> diff --git a/drivers/pinctrl/qcom/pinctrl-sdm845.c 
> b/drivers/pinctrl/qcom/pinctrl-sdm845.c
> index 2ab7a88857579..a3ac9cbeabad1 100644
> --- a/drivers/pinctrl/qcom/pinctrl-sdm845.c
> +++ b/drivers/pinctrl/qcom/pinctrl-sdm845.c
> @@ -1287,6 +1287,9 @@ static const struct msm_pinctrl_soc_data sdm845_pinctrl 
> = {
>   .ngpios = 150,
>  };
>  
> +static SIMPLE_DEV_PM_OPS(sdm845_pinctrl_dev_pm_ops, msm_pinctrl_suspend,
> +  msm_pinctrl_resume);

I don't expect this to differ between the various platforms, so how
about moving it to pinctrl-msm?

Regards,
Bjorn

> +
>  static int sdm845_pinctrl_probe(struct platform_device *pdev)
>  {
>   return msm_pinctrl_probe(pdev, _pinctrl);
> @@ -1300,6 +1303,7 @@ static const struct of_device_id 
> sdm845_pinctrl_of_match[] = {
>  static struct platform_driver sdm845_pinctrl_driver = {
>   .driver = {
>   .name = "sdm845-pinctrl",
> + .pm = _pinctrl_dev_pm_ops,
>   .of_match_table = sdm845_pinctrl_of_match,
>   },
>   .probe = sdm845_pinctrl_probe,
> -- 
> 2.16.4
> 


[PATCH] riscv: add S and U modes to ISA string

2018-11-09 Thread David Abdurachmanov
Booting kernel (4.20-rc1) with riscv-pk @
6ebd0f2a46255d0c76dad3c05b16c1d154795d26 (master/HEAD) on Fedora 29 one gets:

[..]
[   55.075000] unsupported ISA "rv64imafdcsu" in device tree
[   55.075000] unsupported ISA "rv64imafdcsu" in device tree
[   55.076000] unsupported ISA "rv64imafdcsu" in device tree
[   55.076000] unsupported ISA "rv64imafdcsu" in device tree
[   55.077000] systemd[1]: Detected architecture riscv64.
[..]

The patch adds the missing S and U modes.

Signed-off-by: David Abdurachmanov 
---
 arch/riscv/kernel/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 3a5a2ee31547..4029c7e6872b 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -64,7 +64,7 @@ int riscv_of_processor_hartid(struct device_node *node)
 
 static void print_isa(struct seq_file *f, const char *orig_isa)
 {
-   static const char *ext = "mafdc";
+   static const char *ext = "mafdcsu";
const char *isa = orig_isa;
const char *e;
 
-- 
2.19.1



[PATCH] riscv: add S and U modes to ISA string

2018-11-09 Thread David Abdurachmanov
Booting kernel (4.20-rc1) with riscv-pk @
6ebd0f2a46255d0c76dad3c05b16c1d154795d26 (master/HEAD) on Fedora 29 one gets:

[..]
[   55.075000] unsupported ISA "rv64imafdcsu" in device tree
[   55.075000] unsupported ISA "rv64imafdcsu" in device tree
[   55.076000] unsupported ISA "rv64imafdcsu" in device tree
[   55.076000] unsupported ISA "rv64imafdcsu" in device tree
[   55.077000] systemd[1]: Detected architecture riscv64.
[..]

The patch adds the missing S and U modes.

Signed-off-by: David Abdurachmanov 
---
 arch/riscv/kernel/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 3a5a2ee31547..4029c7e6872b 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -64,7 +64,7 @@ int riscv_of_processor_hartid(struct device_node *node)
 
 static void print_isa(struct seq_file *f, const char *orig_isa)
 {
-   static const char *ext = "mafdc";
+   static const char *ext = "mafdcsu";
const char *isa = orig_isa;
const char *e;
 
-- 
2.19.1



Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread Todd Kjos
On Fri, Nov 9, 2018 at 9:43 PM chouryzhou(周威)  wrote:
>
> > >
> > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it 
> > > will be a static
> > > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by 
> > > me) with
> > > no namespace-ization. You will get the same one in all processes, 
> > > everything is
> > > the same as  without this patch.
> >
> > except, as far as I can tell, binder_init_ns() would never have been
> > called on it so the mutex and list heads are not initialized so its
> > completely broken. Am I missing something? How do those fields get
> > initialized in this case?
>
> > @@ -5832,8 +5888,12 @@ static int __init binder_init(void)
> > goto err_init_binder_device_failed;
> > }
> >
> > -   return ret;
> > +   ret = binder_init_ns(_ipc_ns);
> > +   if (ret)
> > +   goto err_init_namespace_failed;
> >
> > +   return ret;
>
> They are initialized here.

Ok, This init_ipc_ns is a global declared in msgutil.c if SYSVIPC ||
POSIX_MQUEUE. This seems kinda hacky, but now I finally see why the
dependancy... msgutil.c is the only file we can count on if !IPC_NS &&
(SYSVIPC || POSIX_MQUEUE). There must be a cleaner way to do this, I
really don't like this dependency... wouldn't it be cleaner to do:

#ifndef CONFIG_IPC_NS
static struct ipc_namespace binder_ipc_ns;
#define ipcns  (_ipc_ns)
#else
#define ipcns  (current->nsproxy->ipc_ns)
#endif

(and make the initialization of binder_ipc_ns conditional on IPC_NS)

This gets us the same thing without the incestuous dependency on the
msgutil.c version of init_ipc_ns...and then binder doesn't rely on
SYSVIPC or POSIX_MQUEUE directly.

>
> - choury -
>


Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread Todd Kjos
On Fri, Nov 9, 2018 at 9:43 PM chouryzhou(周威)  wrote:
>
> > >
> > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it 
> > > will be a static
> > > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by 
> > > me) with
> > > no namespace-ization. You will get the same one in all processes, 
> > > everything is
> > > the same as  without this patch.
> >
> > except, as far as I can tell, binder_init_ns() would never have been
> > called on it so the mutex and list heads are not initialized so its
> > completely broken. Am I missing something? How do those fields get
> > initialized in this case?
>
> > @@ -5832,8 +5888,12 @@ static int __init binder_init(void)
> > goto err_init_binder_device_failed;
> > }
> >
> > -   return ret;
> > +   ret = binder_init_ns(_ipc_ns);
> > +   if (ret)
> > +   goto err_init_namespace_failed;
> >
> > +   return ret;
>
> They are initialized here.

Ok, This init_ipc_ns is a global declared in msgutil.c if SYSVIPC ||
POSIX_MQUEUE. This seems kinda hacky, but now I finally see why the
dependancy... msgutil.c is the only file we can count on if !IPC_NS &&
(SYSVIPC || POSIX_MQUEUE). There must be a cleaner way to do this, I
really don't like this dependency... wouldn't it be cleaner to do:

#ifndef CONFIG_IPC_NS
static struct ipc_namespace binder_ipc_ns;
#define ipcns  (_ipc_ns)
#else
#define ipcns  (current->nsproxy->ipc_ns)
#endif

(and make the initialization of binder_ipc_ns conditional on IPC_NS)

This gets us the same thing without the incestuous dependency on the
msgutil.c version of init_ipc_ns...and then binder doesn't rely on
SYSVIPC or POSIX_MQUEUE directly.

>
> - choury -
>


[PATCH] regulator: bd718x7: Use regulator_map_voltage_ascend for buck5 and buck7

2018-11-09 Thread Axel Lin
The voltages in bd718xx_3rd_nodvs_buck_volts are in ascendant order, so use
regulator_map_voltage_ascend.

Signed-off-by: Axel Lin 
---
 drivers/regulator/bd718x7-regulator.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/regulator/bd718x7-regulator.c 
b/drivers/regulator/bd718x7-regulator.c
index f142f9ce6853..0ae3e03f5433 100644
--- a/drivers/regulator/bd718x7-regulator.c
+++ b/drivers/regulator/bd718x7-regulator.c
@@ -131,6 +131,7 @@ static struct regulator_ops 
bd718xx_buck_regulator_nolinear_ops = {
.disable = regulator_disable_regmap,
.is_enabled = regulator_is_enabled_regmap,
.list_voltage = regulator_list_voltage_table,
+   .map_voltage = regulator_map_voltage_ascend,
.set_voltage_sel = bd718xx_set_voltage_sel_restricted,
.get_voltage_sel = regulator_get_voltage_sel_regmap,
.set_voltage_time_sel = regulator_set_voltage_time_sel,
-- 
2.17.1



[PATCH] regulator: bd718x7: Use regulator_map_voltage_ascend for buck5 and buck7

2018-11-09 Thread Axel Lin
The voltages in bd718xx_3rd_nodvs_buck_volts are in ascendant order, so use
regulator_map_voltage_ascend.

Signed-off-by: Axel Lin 
---
 drivers/regulator/bd718x7-regulator.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/regulator/bd718x7-regulator.c 
b/drivers/regulator/bd718x7-regulator.c
index f142f9ce6853..0ae3e03f5433 100644
--- a/drivers/regulator/bd718x7-regulator.c
+++ b/drivers/regulator/bd718x7-regulator.c
@@ -131,6 +131,7 @@ static struct regulator_ops 
bd718xx_buck_regulator_nolinear_ops = {
.disable = regulator_disable_regmap,
.is_enabled = regulator_is_enabled_regmap,
.list_voltage = regulator_list_voltage_table,
+   .map_voltage = regulator_map_voltage_ascend,
.set_voltage_sel = bd718xx_set_voltage_sel_restricted,
.get_voltage_sel = regulator_get_voltage_sel_regmap,
.set_voltage_time_sel = regulator_set_voltage_time_sel,
-- 
2.17.1



Re: [PATCH] drivers/android/binder.c: Remove duplicate header

2018-11-09 Thread Souptick Joarder
On Fri, Nov 9, 2018 at 8:17 PM Greg KH  wrote:
>
> On Fri, Nov 09, 2018 at 10:40:14PM +0800, kbuild test robot wrote:
> > Hi Brajeswar,
> >
> > Thank you for the patch! Yet something to improve:
> >
> > [auto build test ERROR on staging/staging-testing]
> > [also build test ERROR on v4.20-rc1 next-20181109]
> > [if your patch is applied to the wrong git tree, please drop us a note to 
> > help improve the system]
> >
> > url:
> > https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717
> > config: i386-allmodconfig (attached as .config)
> > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> > reproduce:
> > # save the attached .config to linux build tree
> > make ARCH=i386
> >
> > All errors (new ones prefixed by >>):
>
> 
>
> Yeah, I was right :(
>
> Always test-build your patches.

Sorry about it. It was a mistake from our side.


Re: [PATCH] drivers/android/binder.c: Remove duplicate header

2018-11-09 Thread Souptick Joarder
On Fri, Nov 9, 2018 at 8:17 PM Greg KH  wrote:
>
> On Fri, Nov 09, 2018 at 10:40:14PM +0800, kbuild test robot wrote:
> > Hi Brajeswar,
> >
> > Thank you for the patch! Yet something to improve:
> >
> > [auto build test ERROR on staging/staging-testing]
> > [also build test ERROR on v4.20-rc1 next-20181109]
> > [if your patch is applied to the wrong git tree, please drop us a note to 
> > help improve the system]
> >
> > url:
> > https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717
> > config: i386-allmodconfig (attached as .config)
> > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> > reproduce:
> > # save the attached .config to linux build tree
> > make ARCH=i386
> >
> > All errors (new ones prefixed by >>):
>
> 
>
> Yeah, I was right :(
>
> Always test-build your patches.

Sorry about it. It was a mistake from our side.


Re: [PATCH] infiniband/hw/hns/hns_roce_hw_v2.c: Use dma_zalloc_coherent

2018-11-09 Thread Souptick Joarder
On Fri, Nov 9, 2018 at 8:08 PM Sabyasachi Gupta
 wrote:
>
> Replaced dma_alloc_coherent + memset with dma_zalloc_coherent

There are few other places in this file where same is applicable.
Can we get those changes in same patch ?
>
> Signed-off-by: Sabyasachi Gupta 
> ---
>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c 
> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> index a4c62ae..c9966ec 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> @@ -4972,13 +4972,12 @@ static int hns_roce_mhop_alloc_eq(struct hns_roce_dev 
> *hr_dev,
> eqe_alloc = i * (buf_chk_sz / eq->eqe_size);
> size = (eq->entries - eqe_alloc) * 
> eq->eqe_size;
> }
> -   eq->buf[i] = dma_alloc_coherent(dev, size,
> +   eq->buf[i] = dma_zalloc_coherent(dev, size,
> &(eq->buf_dma[i]),
> GFP_KERNEL);
> if (!eq->buf[i])
> goto err_dma_alloc_buf;
>
> -   memset(eq->buf[i], 0, size);
> *(eq->bt_l0 + i) = eq->buf_dma[i];
>
> eq_buf_cnt++;
> --
> 2.7.4
>


Re: [PATCH] infiniband/hw/hns/hns_roce_hw_v2.c: Use dma_zalloc_coherent

2018-11-09 Thread Souptick Joarder
On Fri, Nov 9, 2018 at 8:08 PM Sabyasachi Gupta
 wrote:
>
> Replaced dma_alloc_coherent + memset with dma_zalloc_coherent

There are few other places in this file where same is applicable.
Can we get those changes in same patch ?
>
> Signed-off-by: Sabyasachi Gupta 
> ---
>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c 
> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> index a4c62ae..c9966ec 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> @@ -4972,13 +4972,12 @@ static int hns_roce_mhop_alloc_eq(struct hns_roce_dev 
> *hr_dev,
> eqe_alloc = i * (buf_chk_sz / eq->eqe_size);
> size = (eq->entries - eqe_alloc) * 
> eq->eqe_size;
> }
> -   eq->buf[i] = dma_alloc_coherent(dev, size,
> +   eq->buf[i] = dma_zalloc_coherent(dev, size,
> &(eq->buf_dma[i]),
> GFP_KERNEL);
> if (!eq->buf[i])
> goto err_dma_alloc_buf;
>
> -   memset(eq->buf[i], 0, size);
> *(eq->bt_l0 + i) = eq->buf_dma[i];
>
> eq_buf_cnt++;
> --
> 2.7.4
>


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Andy Lutomirski



> On Nov 9, 2018, at 7:20 PM, Joel Fernandes  wrote:
> 
>> On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote:
>>> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn  wrote:
>>> On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
>>>  wrote:
 Android uses ashmem for sharing memory regions. We are looking forward
 to migrating all usecases of ashmem to memfd so that we can possibly
 remove the ashmem driver in the future from staging while also
 benefiting from using memfd and contributing to it. Note staging drivers
 are also not ABI and generally can be removed at anytime.
 
 One of the main usecases Android has is the ability to create a region
 and mmap it as writeable, then add protection against making any
 "future" writes while keeping the existing already mmap'ed
 writeable-region active.  This allows us to implement a usecase where
 receivers of the shared memory buffer can get a read-only view, while
 the sender continues to write to the buffer.
 See CursorWindow documentation in Android for more details:
 https://developer.android.com/reference/android/database/CursorWindow
 
 This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
 To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
 which prevents any future mmap and write syscalls from succeeding while
 keeping the existing mmap active.
>>> 
>>> Please CC linux-api@ on patches like this. If you had done that, I
>>> might have criticized your v1 patch instead of your v3 patch...
>>> 
 The following program shows the seal
 working in action:
>>> [...]
 Cc: jr...@google.com
 Cc: john.stu...@linaro.org
 Cc: tk...@google.com
 Cc: gre...@linuxfoundation.org
 Cc: h...@infradead.org
 Reviewed-by: John Stultz 
 Signed-off-by: Joel Fernandes (Google) 
 ---
>>> [...]
 diff --git a/mm/memfd.c b/mm/memfd.c
 index 2bb5e257080e..5ba9804e9515 100644
 --- a/mm/memfd.c
 +++ b/mm/memfd.c
>>> [...]
 @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, 
 unsigned int seals)
}
}
 
 +   if ((seals & F_SEAL_FUTURE_WRITE) &&
 +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
 +   /*
 +* The FUTURE_WRITE seal also prevents growing and 
 shrinking
 +* so we need them to be already set, or requested now.
 +*/
 +   int test_seals = (seals | *file_seals) &
 +(F_SEAL_GROW | F_SEAL_SHRINK);
 +
 +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
 +   error = -EINVAL;
 +   goto unlock;
 +   }
 +
 +   spin_lock(>f_lock);
 +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
 +   spin_unlock(>f_lock);
 +   }
>>> 
>>> So you're fiddling around with the file, but not the inode? How are
>>> you preventing code like the following from re-opening the file as
>>> writable?
>>> 
>>> $ cat memfd.c
>>> #define _GNU_SOURCE
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> 
>>> int main(void) {
>>>  int fd = syscall(__NR_memfd_create, "testfd", 0);
>>>  if (fd == -1) err(1, "memfd");
>>>  char path[100];
>>>  sprintf(path, "/proc/self/fd/%d", fd);
>>>  int fd2 = open(path, O_RDWR);
>>>  if (fd2 == -1) err(1, "reopen");
>>>  printf("reopen successful: %d\n", fd2);
>>> }
>>> $ gcc -o memfd memfd.c
>>> $ ./memfd
>>> reopen successful: 4
>>> $
>>> 
>>> That aside: I wonder whether a better API would be something that
>>> allows you to create a new readonly file descriptor, instead of
>>> fiddling with the writability of an existing fd.
>> 
>> My favorite approach would be to forbid open() on memfds, hope that
>> nobody notices the tiny API break, and then add an ioctl for "reopen
>> this memfd with reduced permissions" - but that's just my personal
>> opinion.
> 
> I did something along these lines and it fixes the issue, but I forbid open
> of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not
> an ABI break because this is a brand new seal. That seems the least intrusive
> solution and it works. Do you mind testing it and I'll add your and Tested-by
> to the new fix? The patch is based on top of this series.
> 
> ---8<---
> From: "Joel Fernandes (Google)" 
> Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd
> 
> Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd
> through /proc/self/fd/N symlink as writeable succeeds. The simplest fix
> without causing ABI breakages and ugly VFS hacks is to simply deny all
> opens on F_SEAL_FUTURE_WRITE sealed fds.
> 
> Reported-by: Jann Horn 
> Signed-off-by: Joel Fernandes (Google) 
> ---
> mm/shmem.c | 18 ++
> 1 

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Andy Lutomirski



> On Nov 9, 2018, at 7:20 PM, Joel Fernandes  wrote:
> 
>> On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote:
>>> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn  wrote:
>>> On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
>>>  wrote:
 Android uses ashmem for sharing memory regions. We are looking forward
 to migrating all usecases of ashmem to memfd so that we can possibly
 remove the ashmem driver in the future from staging while also
 benefiting from using memfd and contributing to it. Note staging drivers
 are also not ABI and generally can be removed at anytime.
 
 One of the main usecases Android has is the ability to create a region
 and mmap it as writeable, then add protection against making any
 "future" writes while keeping the existing already mmap'ed
 writeable-region active.  This allows us to implement a usecase where
 receivers of the shared memory buffer can get a read-only view, while
 the sender continues to write to the buffer.
 See CursorWindow documentation in Android for more details:
 https://developer.android.com/reference/android/database/CursorWindow
 
 This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
 To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
 which prevents any future mmap and write syscalls from succeeding while
 keeping the existing mmap active.
>>> 
>>> Please CC linux-api@ on patches like this. If you had done that, I
>>> might have criticized your v1 patch instead of your v3 patch...
>>> 
 The following program shows the seal
 working in action:
>>> [...]
 Cc: jr...@google.com
 Cc: john.stu...@linaro.org
 Cc: tk...@google.com
 Cc: gre...@linuxfoundation.org
 Cc: h...@infradead.org
 Reviewed-by: John Stultz 
 Signed-off-by: Joel Fernandes (Google) 
 ---
>>> [...]
 diff --git a/mm/memfd.c b/mm/memfd.c
 index 2bb5e257080e..5ba9804e9515 100644
 --- a/mm/memfd.c
 +++ b/mm/memfd.c
>>> [...]
 @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, 
 unsigned int seals)
}
}
 
 +   if ((seals & F_SEAL_FUTURE_WRITE) &&
 +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
 +   /*
 +* The FUTURE_WRITE seal also prevents growing and 
 shrinking
 +* so we need them to be already set, or requested now.
 +*/
 +   int test_seals = (seals | *file_seals) &
 +(F_SEAL_GROW | F_SEAL_SHRINK);
 +
 +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
 +   error = -EINVAL;
 +   goto unlock;
 +   }
 +
 +   spin_lock(>f_lock);
 +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
 +   spin_unlock(>f_lock);
 +   }
>>> 
>>> So you're fiddling around with the file, but not the inode? How are
>>> you preventing code like the following from re-opening the file as
>>> writable?
>>> 
>>> $ cat memfd.c
>>> #define _GNU_SOURCE
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>> 
>>> int main(void) {
>>>  int fd = syscall(__NR_memfd_create, "testfd", 0);
>>>  if (fd == -1) err(1, "memfd");
>>>  char path[100];
>>>  sprintf(path, "/proc/self/fd/%d", fd);
>>>  int fd2 = open(path, O_RDWR);
>>>  if (fd2 == -1) err(1, "reopen");
>>>  printf("reopen successful: %d\n", fd2);
>>> }
>>> $ gcc -o memfd memfd.c
>>> $ ./memfd
>>> reopen successful: 4
>>> $
>>> 
>>> That aside: I wonder whether a better API would be something that
>>> allows you to create a new readonly file descriptor, instead of
>>> fiddling with the writability of an existing fd.
>> 
>> My favorite approach would be to forbid open() on memfds, hope that
>> nobody notices the tiny API break, and then add an ioctl for "reopen
>> this memfd with reduced permissions" - but that's just my personal
>> opinion.
> 
> I did something along these lines and it fixes the issue, but I forbid open
> of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not
> an ABI break because this is a brand new seal. That seems the least intrusive
> solution and it works. Do you mind testing it and I'll add your and Tested-by
> to the new fix? The patch is based on top of this series.
> 
> ---8<---
> From: "Joel Fernandes (Google)" 
> Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd
> 
> Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd
> through /proc/self/fd/N symlink as writeable succeeds. The simplest fix
> without causing ABI breakages and ugly VFS hacks is to simply deny all
> opens on F_SEAL_FUTURE_WRITE sealed fds.
> 
> Reported-by: Jann Horn 
> Signed-off-by: Joel Fernandes (Google) 
> ---
> mm/shmem.c | 18 ++
> 1 

Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread 周威

> > > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE.
> > > It seems like this mechanism would work even if both are disabled --
> > > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and
> > > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to
> > > "#ifndef CONFIG_IPC_NS"
> >
> > Let me explain it in detail. If SYSIPC and IPC_NS are both defined,
> > current->nsproxy->ipc_ns will save the ipc namespace variables. We just use
> > it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set,
> > current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c,
> > which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set
> > (IPC_NS can't be set in this situation), there is no 
> > current->nsproxy->ipc_ns.
> > We make a fack init_ipc_ns here and use it.
> 
> Yes, I can read the code. I'm wondering specifically about SYSVIPC and
> POSIX_MQUEUE. Even with your code changes, binder has no dependency on
> these configs. Why are you creating one? The actual dependency with
> your changes is on "current->nsproxy->ipc_ns" being initialized for
> binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it?
> 
> If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this 
> work? 

If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it will be 
a static 
reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) 
with 
no namespace-ization. You will get the same one in all processes, everything is 
the same as  without this patch.

- choury -



Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread 周威

> > > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE.
> > > It seems like this mechanism would work even if both are disabled --
> > > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and
> > > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to
> > > "#ifndef CONFIG_IPC_NS"
> >
> > Let me explain it in detail. If SYSIPC and IPC_NS are both defined,
> > current->nsproxy->ipc_ns will save the ipc namespace variables. We just use
> > it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set,
> > current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c,
> > which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set
> > (IPC_NS can't be set in this situation), there is no 
> > current->nsproxy->ipc_ns.
> > We make a fack init_ipc_ns here and use it.
> 
> Yes, I can read the code. I'm wondering specifically about SYSVIPC and
> POSIX_MQUEUE. Even with your code changes, binder has no dependency on
> these configs. Why are you creating one? The actual dependency with
> your changes is on "current->nsproxy->ipc_ns" being initialized for
> binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it?
> 
> If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this 
> work? 

If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it will be 
a static 
reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) 
with 
no namespace-ization. You will get the same one in all processes, everything is 
the same as  without this patch.

- choury -



Re: [PATCH 2/8] of: Drop full path from Sparc PDT full_name

2018-11-09 Thread David Miller
From: Rob Herring 
Date: Fri, 9 Nov 2018 14:30:01 -0600

> That is the intent. With this change we stop storing the full path for
> every node. Everywhere that needs the full path, generates it with the
> %pOF printf specifier. Other than users in arch/sparc converted in
> this series all the users of full_name have been converted already or
> use kbasename(full_name) so they work either way. BTW, I've found a
> couple more, so I'll be sending you a v2.
> 
> And of_pdt_build_full_name for !SPARC does need to be converted too,
> but I'll do that separately.

I would think that, for correctness, you'd need to adjust both functions
at the same time, no?


Re: [PATCH 2/8] of: Drop full path from Sparc PDT full_name

2018-11-09 Thread David Miller
From: Rob Herring 
Date: Fri, 9 Nov 2018 14:30:01 -0600

> That is the intent. With this change we stop storing the full path for
> every node. Everywhere that needs the full path, generates it with the
> %pOF printf specifier. Other than users in arch/sparc converted in
> this series all the users of full_name have been converted already or
> use kbasename(full_name) so they work either way. BTW, I've found a
> couple more, so I'll be sending you a v2.
> 
> And of_pdt_build_full_name for !SPARC does need to be converted too,
> but I'll do that separately.

I would think that, for correctness, you'd need to adjust both functions
at the same time, no?


[PATCH -next] sysv: return 'err' instead of 0 in __sysv_write_inode

2018-11-09 Thread YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:

fs/sysv/inode.c: In function '__sysv_write_inode':
fs/sysv/inode.c:239:6: warning:
 variable 'err' set but not used [-Wunused-but-set-variable]

__sysv_write_inode should return 'err' instead of 0

Fixes: 05459ca81ac3 ("repair sysv_write_inode(), switch sysv to simple_fsync()")
Signed-off-by: YueHaibing 
---
 fs/sysv/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index 499a20a..273736f 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -275,7 +275,7 @@ static int __sysv_write_inode(struct inode *inode, int wait)
 }
 }
brelse(bh);
-   return 0;
+   return err;
 }
 
 int sysv_write_inode(struct inode *inode, struct writeback_control *wbc)





[PATCH -next] sysv: return 'err' instead of 0 in __sysv_write_inode

2018-11-09 Thread YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:

fs/sysv/inode.c: In function '__sysv_write_inode':
fs/sysv/inode.c:239:6: warning:
 variable 'err' set but not used [-Wunused-but-set-variable]

__sysv_write_inode should return 'err' instead of 0

Fixes: 05459ca81ac3 ("repair sysv_write_inode(), switch sysv to simple_fsync()")
Signed-off-by: YueHaibing 
---
 fs/sysv/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index 499a20a..273736f 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -275,7 +275,7 @@ static int __sysv_write_inode(struct inode *inode, int wait)
 }
 }
brelse(bh);
-   return 0;
+   return err;
 }
 
 int sysv_write_inode(struct inode *inode, struct writeback_control *wbc)





Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages

2018-11-09 Thread Pavel Tatashin
On 18-11-05 13:20:01, Alexander Duyck wrote:
> +static unsigned long __next_pfn_valid_range(unsigned long *i,
> + unsigned long end_pfn)
>  {
> - if (!pfn_valid_within(pfn))
> - return false;
> - if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
> - return false;
> - return true;
> + unsigned long pfn = *i;
> + unsigned long count;
> +
> + while (pfn < end_pfn) {
> + unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
> + unsigned long pageblock_pfn = min(t, end_pfn);
> +
> +#ifndef CONFIG_HOLES_IN_ZONE
> + count = pageblock_pfn - pfn;
> + pfn = pageblock_pfn;
> + if (!pfn_valid(pfn))
> + continue;
> +#else
> + for (count = 0; pfn < pageblock_pfn; pfn++) {
> + if (pfn_valid_within(pfn)) {
> + count++;
> + continue;
> + }
> +
> + if (count)
> + break;
> + }
> +
> + if (!count)
> + continue;
> +#endif
> + *i = pfn;
> + return count;
> + }
> +
> + return 0;
>  }
>  
> +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) 
> \
> + for (i = (start_pfn),\
> +  count = __next_pfn_valid_range(, (end_pfn));  \
> +  count && ({ pfn = i - count; 1; }); \
> +  count = __next_pfn_valid_range(, (end_pfn)))

Can this be improved somehow? It took me a while to understand this
piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - 
count; 1;}) is
simply hard to parse. Why can't we make __next_pfn_valid_range() to
return both end and a start of a block?

The rest is good:

Reviewed-by: Pavel Tatashin 

Thank you,
Pasha


Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages

2018-11-09 Thread Pavel Tatashin
On 18-11-05 13:20:01, Alexander Duyck wrote:
> +static unsigned long __next_pfn_valid_range(unsigned long *i,
> + unsigned long end_pfn)
>  {
> - if (!pfn_valid_within(pfn))
> - return false;
> - if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
> - return false;
> - return true;
> + unsigned long pfn = *i;
> + unsigned long count;
> +
> + while (pfn < end_pfn) {
> + unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
> + unsigned long pageblock_pfn = min(t, end_pfn);
> +
> +#ifndef CONFIG_HOLES_IN_ZONE
> + count = pageblock_pfn - pfn;
> + pfn = pageblock_pfn;
> + if (!pfn_valid(pfn))
> + continue;
> +#else
> + for (count = 0; pfn < pageblock_pfn; pfn++) {
> + if (pfn_valid_within(pfn)) {
> + count++;
> + continue;
> + }
> +
> + if (count)
> + break;
> + }
> +
> + if (!count)
> + continue;
> +#endif
> + *i = pfn;
> + return count;
> + }
> +
> + return 0;
>  }
>  
> +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) 
> \
> + for (i = (start_pfn),\
> +  count = __next_pfn_valid_range(, (end_pfn));  \
> +  count && ({ pfn = i - count; 1; }); \
> +  count = __next_pfn_valid_range(, (end_pfn)))

Can this be improved somehow? It took me a while to understand this
piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - 
count; 1;}) is
simply hard to parse. Why can't we make __next_pfn_valid_range() to
return both end and a start of a block?

The rest is good:

Reviewed-by: Pavel Tatashin 

Thank you,
Pasha


Re: [RFC PATCH 1/3] static_call: Add static call infrastructure

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 14:34:59 -0600
Josh Poimboeuf  wrote:

I'm slowly massaging this to work with tracepoints.

But I hit a snag on this patch.

> On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote:
> > On Fri, 9 Nov 2018 13:35:05 -0600
> > Josh Poimboeuf  wrote:
> > 
> >   
> > > > > +#define DECLARE_STATIC_CALL(key, func)   
> > > > > \
> > > > > + extern struct static_call_key key;  
> > > > > \
> > > > > + extern typeof(func) STATIC_CALL_TRAMP(key); 
> > > > > \
> > > > > + /* Preserve the ELF symbol so objtool can access it: */ 
> > > > > \
> > > > > + __ADDRESSABLE(key)
> > > > 
> > > > Does the __ADDRESSABLE(key) need to be in the DECLARE part?
> > > > If so, there needs to be more explanation than just the comment above
> > > > it.
> > > 
> > > For each call site, objtool creates a struct in .static_call_sites:
> > > 
> > >   struct static_call_site {
> > >   s32 addr;
> > >   s32 key;
> > >   };
> > > 
> > > In order to do that, it needs to create a relocation which references
> > > the key symbol.  If the key is defined in another .o file, then the
> > > current .o will not have an ELF symbol associated with the key.  The
> > > __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o
> > > file, even though it's not referenced anywhere.  That makes objtool's
> > > job easier, so it doesn't have to edit the symbol table.
> > > 
> > > I could add a comment saying as much, though it's hard to explain it in
> > > fewer words than I just did :-)  
> > 
> > Does this have to do with adding the references by relative address?
> > 
> > In record_mcount, I just picked an existing symbol and referenced that..
> > But perhaps this is a cleaner way.  
> 
> I think recordmcount is different.  It creates references (in
> __mcount_loc) to functions which are already in the object file, so they
> already have symbols associated with them.
> 
> But in this case, when objtool is creating references, the symbol it
> needs to reference is outside the .o file, so there's no symbol to
> associate it with.
> 

The __ADDRESSABLE() appears to fail if you have a header with a
DECLARE_STATIC_CALL() that is included where the DEFINE_STATIC_CALL()
is, because I'm getting this:

In file included from :
/work/git/linux-trace.git/include/linux/compiler.h:285:11: error: redefinition 
of ‘__addressable___tp_func_sys_enter40’
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~
/work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in 
definition of macro ‘___PASTE’
 #define ___PASTE(a,b) a##b
   ^
/work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of 
macro ‘__PASTE’
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~~
/work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion 
of macro ‘__ADDRESSABLE’
  __ADDRESSABLE(key)
  ^
/work/git/linux-trace.git/include/linux/static_call.h:115:2: note: in expansion 
of macro ‘DECLARE_STATIC_CALL’
  DECLARE_STATIC_CALL(key, _func);\
  ^~~
/work/git/linux-trace.git/include/linux/tracepoint.h:310:2: note: in expansion 
of macro ‘DEFINE_STATIC_CALL’
  DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
  ^~
/work/git/linux-trace.git/include/trace/define_trace.h:42:2: note: in expansion 
of macro ‘DEFINE_TRACE_FN’
  DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
  ^~~
/work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in 
expansion of macro ‘TRACE_EVENT_FN’
 TRACE_EVENT_FN(sys_enter,
 ^~
/work/git/linux-trace.git/include/linux/compiler.h:285:11: note: previous 
definition of ‘__addressable___tp_func_sys_enter40’ was here
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~
/work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in 
definition of macro ‘___PASTE’
 #define ___PASTE(a,b) a##b
   ^
/work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of 
macro ‘__PASTE’
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~~
/work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion 
of macro ‘__ADDRESSABLE’
  __ADDRESSABLE(key)
  ^
/work/git/linux-trace.git/include/linux/tracepoint.h:234:2: note: in expansion 
of macro ‘DECLARE_STATIC_CALL’
  DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
  ^~~
/work/git/linux-trace.git/include/linux/tracepoint.h:421:2: note: in expansion 
of macro ‘__DECLARE_TRACE’
  __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
  ^~~
/work/git/linux-trace.git/include/linux/tracepoint.h:560:2: note: in expansion 
of macro ‘DECLARE_TRACE’
  DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
  ^

Re: [RFC PATCH 1/3] static_call: Add static call infrastructure

2018-11-09 Thread Steven Rostedt
On Fri, 9 Nov 2018 14:34:59 -0600
Josh Poimboeuf  wrote:

I'm slowly massaging this to work with tracepoints.

But I hit a snag on this patch.

> On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote:
> > On Fri, 9 Nov 2018 13:35:05 -0600
> > Josh Poimboeuf  wrote:
> > 
> >   
> > > > > +#define DECLARE_STATIC_CALL(key, func)   
> > > > > \
> > > > > + extern struct static_call_key key;  
> > > > > \
> > > > > + extern typeof(func) STATIC_CALL_TRAMP(key); 
> > > > > \
> > > > > + /* Preserve the ELF symbol so objtool can access it: */ 
> > > > > \
> > > > > + __ADDRESSABLE(key)
> > > > 
> > > > Does the __ADDRESSABLE(key) need to be in the DECLARE part?
> > > > If so, there needs to be more explanation than just the comment above
> > > > it.
> > > 
> > > For each call site, objtool creates a struct in .static_call_sites:
> > > 
> > >   struct static_call_site {
> > >   s32 addr;
> > >   s32 key;
> > >   };
> > > 
> > > In order to do that, it needs to create a relocation which references
> > > the key symbol.  If the key is defined in another .o file, then the
> > > current .o will not have an ELF symbol associated with the key.  The
> > > __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o
> > > file, even though it's not referenced anywhere.  That makes objtool's
> > > job easier, so it doesn't have to edit the symbol table.
> > > 
> > > I could add a comment saying as much, though it's hard to explain it in
> > > fewer words than I just did :-)  
> > 
> > Does this have to do with adding the references by relative address?
> > 
> > In record_mcount, I just picked an existing symbol and referenced that..
> > But perhaps this is a cleaner way.  
> 
> I think recordmcount is different.  It creates references (in
> __mcount_loc) to functions which are already in the object file, so they
> already have symbols associated with them.
> 
> But in this case, when objtool is creating references, the symbol it
> needs to reference is outside the .o file, so there's no symbol to
> associate it with.
> 

The __ADDRESSABLE() appears to fail if you have a header with a
DECLARE_STATIC_CALL() that is included where the DEFINE_STATIC_CALL()
is, because I'm getting this:

In file included from :
/work/git/linux-trace.git/include/linux/compiler.h:285:11: error: redefinition 
of ‘__addressable___tp_func_sys_enter40’
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~
/work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in 
definition of macro ‘___PASTE’
 #define ___PASTE(a,b) a##b
   ^
/work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of 
macro ‘__PASTE’
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~~
/work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion 
of macro ‘__ADDRESSABLE’
  __ADDRESSABLE(key)
  ^
/work/git/linux-trace.git/include/linux/static_call.h:115:2: note: in expansion 
of macro ‘DECLARE_STATIC_CALL’
  DECLARE_STATIC_CALL(key, _func);\
  ^~~
/work/git/linux-trace.git/include/linux/tracepoint.h:310:2: note: in expansion 
of macro ‘DEFINE_STATIC_CALL’
  DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
  ^~
/work/git/linux-trace.git/include/trace/define_trace.h:42:2: note: in expansion 
of macro ‘DEFINE_TRACE_FN’
  DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
  ^~~
/work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in 
expansion of macro ‘TRACE_EVENT_FN’
 TRACE_EVENT_FN(sys_enter,
 ^~
/work/git/linux-trace.git/include/linux/compiler.h:285:11: note: previous 
definition of ‘__addressable___tp_func_sys_enter40’ was here
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~
/work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in 
definition of macro ‘___PASTE’
 #define ___PASTE(a,b) a##b
   ^
/work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of 
macro ‘__PASTE’
   __PASTE(__addressable_##sym, __LINE__) = (void *)
   ^~~
/work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion 
of macro ‘__ADDRESSABLE’
  __ADDRESSABLE(key)
  ^
/work/git/linux-trace.git/include/linux/tracepoint.h:234:2: note: in expansion 
of macro ‘DECLARE_STATIC_CALL’
  DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
  ^~~
/work/git/linux-trace.git/include/linux/tracepoint.h:421:2: note: in expansion 
of macro ‘__DECLARE_TRACE’
  __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
  ^~~
/work/git/linux-trace.git/include/linux/tracepoint.h:560:2: note: in expansion 
of macro ‘DECLARE_TRACE’
  DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
  ^

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 12:36:34PM -0800, Andrew Morton wrote:
> On Wed,  7 Nov 2018 20:15:36 -0800 "Joel Fernandes (Google)" 
>  wrote:
> 
> > Android uses ashmem for sharing memory regions. We are looking forward
> > to migrating all usecases of ashmem to memfd so that we can possibly
> > remove the ashmem driver in the future from staging while also
> > benefiting from using memfd and contributing to it. Note staging drivers
> > are also not ABI and generally can be removed at anytime.
> > 
> > One of the main usecases Android has is the ability to create a region
> > and mmap it as writeable, then add protection against making any
> > "future" writes while keeping the existing already mmap'ed
> > writeable-region active.  This allows us to implement a usecase where
> > receivers of the shared memory buffer can get a read-only view, while
> > the sender continues to write to the buffer.
> > See CursorWindow documentation in Android for more details:
> > https://developer.android.com/reference/android/database/CursorWindow
> 
> It appears that the memfd_create and fcntl manpages will require
> updating.  Please attend to this at the appropriate time?

Yes, I am planning to send those out shortly. I finished working on them.

Also just to let you know, I posted a fix for the security issue Jann Horn
reported and requested him to test it:
https://lore.kernel.org/lkml/20181109234636.ga136...@google.com/T/#m8d9d185e6480d095f0ab8f84bcb103892181f77d

This fix along with the 2 other patches I posted in v3 are all that's needed. 
thanks!

- Joel



Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 12:36:34PM -0800, Andrew Morton wrote:
> On Wed,  7 Nov 2018 20:15:36 -0800 "Joel Fernandes (Google)" 
>  wrote:
> 
> > Android uses ashmem for sharing memory regions. We are looking forward
> > to migrating all usecases of ashmem to memfd so that we can possibly
> > remove the ashmem driver in the future from staging while also
> > benefiting from using memfd and contributing to it. Note staging drivers
> > are also not ABI and generally can be removed at anytime.
> > 
> > One of the main usecases Android has is the ability to create a region
> > and mmap it as writeable, then add protection against making any
> > "future" writes while keeping the existing already mmap'ed
> > writeable-region active.  This allows us to implement a usecase where
> > receivers of the shared memory buffer can get a read-only view, while
> > the sender continues to write to the buffer.
> > See CursorWindow documentation in Android for more details:
> > https://developer.android.com/reference/android/database/CursorWindow
> 
> It appears that the memfd_create and fcntl manpages will require
> updating.  Please attend to this at the appropriate time?

Yes, I am planning to send those out shortly. I finished working on them.

Also just to let you know, I posted a fix for the security issue Jann Horn
reported and requested him to test it:
https://lore.kernel.org/lkml/20181109234636.ga136...@google.com/T/#m8d9d185e6480d095f0ab8f84bcb103892181f77d

This fix along with the 2 other patches I posted in v3 are all that's needed. 
thanks!

- Joel



Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread 周威
> >
> > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it will 
> > be a static
> > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by 
> > me) with
> > no namespace-ization. You will get the same one in all processes, 
> > everything is
> > the same as  without this patch.
> 
> except, as far as I can tell, binder_init_ns() would never have been
> called on it so the mutex and list heads are not initialized so its
> completely broken. Am I missing something? How do those fields get
> initialized in this case?

> @@ -5832,8 +5888,12 @@ static int __init binder_init(void)
>                         goto err_init_binder_device_failed;
>         }
>
> -       return ret;
> +       ret = binder_init_ns(_ipc_ns);
> +       if (ret)
> +               goto err_init_namespace_failed;
>
> +       return ret;

They are initialized here.

- choury -



Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread 周威
> >
> > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it will 
> > be a static
> > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by 
> > me) with
> > no namespace-ization. You will get the same one in all processes, 
> > everything is
> > the same as  without this patch.
> 
> except, as far as I can tell, binder_init_ns() would never have been
> called on it so the mutex and list heads are not initialized so its
> completely broken. Am I missing something? How do those fields get
> initialized in this case?

> @@ -5832,8 +5888,12 @@ static int __init binder_init(void)
>                         goto err_init_binder_device_failed;
>         }
>
> -       return ret;
> +       ret = binder_init_ns(_ipc_ns);
> +       if (ret)
> +               goto err_init_namespace_failed;
>
> +       return ret;

They are initialized here.

- choury -



Re: Re: [PATCH V3] binder: ipc namespace support for android binder

2018-11-09 Thread Todd Kjos
On Fri, Nov 9, 2018 at 7:09 PM chouryzhou(周威)  wrote:
>
> >
> > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE.
> > It seems like this mechanism would work even if both are disabled --
> > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and
> > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to
> > "#ifndef CONFIG_IPC_NS"
>
> Let me explain it in detail. If SYSIPC and IPC_NS are both defined,
> current->nsproxy->ipc_ns will save the ipc namespace variables. We just use
> it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set,
> current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c,
> which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set
> (IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns.
> We make a fack init_ipc_ns here and use it.

Yes, I can read the code. I'm wondering specifically about SYSVIPC and
POSIX_MQUEUE. Even with your code changes, binder has no dependency on
these configs. Why are you creating one? The actual dependency with
your changes is on "current->nsproxy->ipc_ns" being initialized for
binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it?

If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this work?

>
> > why eliminate name? The string name is very useful for differentiating
> > normal "framework" binder transactions vs "hal" or "vendor"
> > transactions. If we just have a device number it will be hard to tell
> > in the logs even which namespace it belongs to. We need to keep both
> > the "name" (for which there might be multiple in each ns) and some
> > indication of which namespace this is. Maybe we assign some sort of
> > namespace ID during binder_init_ns().
>
>  I will remain the name of device. The  inum of ipc_ns can be treated as
> namespace ID in ipc_ns.
>
> > As mentioned above, we need to retain name and probably also want a ns
> > id of some sort. So context now has 3 components if IPC_NS, so maybe a
> > helper function to print context like:
> >
> > static void binder_seq_print_context(struct seq_file *m, struct
> > binder_context *context)
> > {
> > #ifdef CONFIG_IPC_NS
> >   seq_printf(m, "%d-%d-%s", context->ns_id, context->device,
> > context->name);
> > #else
> >   seq_printf(m, "%d", context->name);
> > #endif
> > }
> >
> > (same comment below everywhere context is printed)
> >
> > Should these debugfs nodes should be ns aware and only print debugging
> > info for the context of the thread accessing the node? If so, we would
> > also want to be namespace-aware when printing pids.
>
> Nowadays, debugfs is not namespace-ized, and pid namespace is not associated
> with ipc namespace.  Will it be more complicated to debug this if we just 
> print
> the info for current thread? Because we will have to enter the ipc namespace
> firstly. But add ipc inum should be no problem.

With the name and ns id, debugging would be fine from the host-side. I
don't understand the container use cases enough to know if you need to
be able to debug binder transaction failures from within the container
-- in which case it seems like you'd want the container-version of the
PIDs -- but obviously this depends on how the containers are set up
and what the use-cases really are. I'm ok with leaving that for a
later patch.

>
> - choury -
>
>


Re: Re: [PATCH V3] binder: ipc namespace support for android binder

2018-11-09 Thread Todd Kjos
On Fri, Nov 9, 2018 at 7:09 PM chouryzhou(周威)  wrote:
>
> >
> > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE.
> > It seems like this mechanism would work even if both are disabled --
> > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and
> > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to
> > "#ifndef CONFIG_IPC_NS"
>
> Let me explain it in detail. If SYSIPC and IPC_NS are both defined,
> current->nsproxy->ipc_ns will save the ipc namespace variables. We just use
> it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set,
> current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c,
> which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set
> (IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns.
> We make a fack init_ipc_ns here and use it.

Yes, I can read the code. I'm wondering specifically about SYSVIPC and
POSIX_MQUEUE. Even with your code changes, binder has no dependency on
these configs. Why are you creating one? The actual dependency with
your changes is on "current->nsproxy->ipc_ns" being initialized for
binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it?

If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this work?

>
> > why eliminate name? The string name is very useful for differentiating
> > normal "framework" binder transactions vs "hal" or "vendor"
> > transactions. If we just have a device number it will be hard to tell
> > in the logs even which namespace it belongs to. We need to keep both
> > the "name" (for which there might be multiple in each ns) and some
> > indication of which namespace this is. Maybe we assign some sort of
> > namespace ID during binder_init_ns().
>
>  I will remain the name of device. The  inum of ipc_ns can be treated as
> namespace ID in ipc_ns.
>
> > As mentioned above, we need to retain name and probably also want a ns
> > id of some sort. So context now has 3 components if IPC_NS, so maybe a
> > helper function to print context like:
> >
> > static void binder_seq_print_context(struct seq_file *m, struct
> > binder_context *context)
> > {
> > #ifdef CONFIG_IPC_NS
> >   seq_printf(m, "%d-%d-%s", context->ns_id, context->device,
> > context->name);
> > #else
> >   seq_printf(m, "%d", context->name);
> > #endif
> > }
> >
> > (same comment below everywhere context is printed)
> >
> > Should these debugfs nodes should be ns aware and only print debugging
> > info for the context of the thread accessing the node? If so, we would
> > also want to be namespace-aware when printing pids.
>
> Nowadays, debugfs is not namespace-ized, and pid namespace is not associated
> with ipc namespace.  Will it be more complicated to debug this if we just 
> print
> the info for current thread? Because we will have to enter the ipc namespace
> firstly. But add ipc inum should be no problem.

With the name and ns id, debugging would be fine from the host-side. I
don't understand the container use cases enough to know if you need to
be able to debug binder transaction failures from within the container
-- in which case it seems like you'd want the container-version of the
PIDs -- but obviously this depends on how the containers are set up
and what the use-cases really are. I'm ok with leaving that for a
later patch.

>
> - choury -
>
>


Re: [RFC 1/4] pwm: sifive: Add DT documentation for SiFive PWM Controller.

2018-11-09 Thread Paul Walmsley


On 10/16/18 3:04 PM, Thierry Reding wrote:
> On Tue, Oct 16, 2018 at 10:31:42AM -0700, Paul Walmsley wrote:
>> On 10/16/18 4:01 AM, Thierry Reding wrote:
>>> On Mon, Oct 15, 2018 at 03:57:35PM -0700, Atish Patra wrote:
 On 10/10/18 6:49 AM, Thierry Reding wrote:
> On Tue, Oct 09, 2018 at 11:51:22AM -0700, Atish Patra wrote:
>> +Required properties:
>> +- compatible: should be one of
>> +"sifive,fu540-c000-pwm0","sifive,pwm0".
> What's the '0' in here? A version number?
>
 I think yes. Since fu540 is the first Linux capable RISC-V core, SiFive 
 Guys
 decided mark it as version 0.

 @Wesly: Please correct me if I am wrong.
>>> It seems fairly superfluous to me to have a version number in additon to
>>> the fu540-c000, which already seems to be the core plus some sort of
>>> part number. Do you really expect there to be any changes in the SoC
>>> that would require a different compatible string at this point? If the
>>> SoC has taped out, how will you ever get a different version of the PWM
>>> IP in it?
>>>
>>> I would expect any improvements or changes to the PWM IP to show up in a
>>> different SoC generation, at which point it would be something like
>>> "sifive,fu640-c000" maybe, or perhaps "sifive,fu540-d000", or whatever
>>> the numbering is.
>>
>> The "0" suffix refers to a revision number for the underlying PWM IP block.
>>
>> It's certainly important to keep that version number on the "sifive,pwm0"
>> compatible string that doesn't have the chip name associated with it.
> Isn't the hardware identified by "sifive,pwm0" and "sifive,fu540-c000"
> effectively identical? 


The intention was that the "sifive,pwm0" compatible string specifies a
register interface and programming model that the IP block exposes to
the software, rather than a particular underlying hardware
implementation.  That is in contrast to a string like
"sifive,fu540-c000-pwm" which might activate particular workarounds or
quirks that are specific to the integration of the IP block on a given SoC.


The idea is that, for this and similar open-source hardware IP blocks,
the driver code can just match on a generic "sifive,pwm0" compatible
string.  The SoC DT data would include both the SoC-specific
"sifive,fu540-c000-pwm0" and the common interface "sifive,pwm0".  But
the driver would only need the SoC-specific compatible string if the SoC
wound up needing some SoC-specific quirks.


In the past, some folks have had a problem with that idea, since for
closed-source IP blocks, it's been difficult to determine what changes
went into a specific version of the IP block.  Thus folks generating
data for later SoCs usually specify a compatible string for another,
older, SoC that seems to have the desired behavior.  But since this
particular IP block has open-source RTL, and contains a "sifive,pwmX"
version string in the RTL itself:


https://github.com/sifive/sifive-blocks/blob/master/src/main/scala/devices/pwm/PWM.scala#L74


... it's straightforward to see what interface the hardware exposes to
the software for a given compatible string.


> Is there a need to have two compatible strings
> that refer to the exact same hardware?


There's no intention that "sifive,pwm0" and "sifive,fu540-c000-pwm0"
refer to the same hardware; just the same software interface and
programming model.  Even now, it's usually pretty unlikely that two
different SoCs that refer to (say) "nvidia,tegra20-pwm" contain the same
hardware, since differences in synthesis, place and route, ECOs, and
integration change the actual realization of the hardware.  Some folks
interpreted that compatible string reuse as implying the same "hardware"
is in use on both SoCs, but we're really just identifying a software
interface.


>> As to whether there could ever be a FU540-C000 part with different IP block
>> versions on it: FU540-C000 is ultimately a marketing name.  While
>> theoretically we shouldn't have another "FU540-C000" chip with different
>> peripheral IP block versions on it, I don't think any engineer can guarantee
>> that it won't happen.
> I would argue that if at some point there was indeed a chip with the
> same name but a different IP block version in it, we can figure out what
> to call it. Sure there are no guarantees, but it's still fairly unlikely
> in my opinion, so I personally wouldn't worry about this up front.
>
> Anyway, I don't feel strongly either way, I'm just pointing out that
> this is somewhat unusual. If you want to keep it, feel free to.


Thanks for the review, Thierry -


- Paul



Re: [RFC 1/4] pwm: sifive: Add DT documentation for SiFive PWM Controller.

2018-11-09 Thread Paul Walmsley


On 10/16/18 3:04 PM, Thierry Reding wrote:
> On Tue, Oct 16, 2018 at 10:31:42AM -0700, Paul Walmsley wrote:
>> On 10/16/18 4:01 AM, Thierry Reding wrote:
>>> On Mon, Oct 15, 2018 at 03:57:35PM -0700, Atish Patra wrote:
 On 10/10/18 6:49 AM, Thierry Reding wrote:
> On Tue, Oct 09, 2018 at 11:51:22AM -0700, Atish Patra wrote:
>> +Required properties:
>> +- compatible: should be one of
>> +"sifive,fu540-c000-pwm0","sifive,pwm0".
> What's the '0' in here? A version number?
>
 I think yes. Since fu540 is the first Linux capable RISC-V core, SiFive 
 Guys
 decided mark it as version 0.

 @Wesly: Please correct me if I am wrong.
>>> It seems fairly superfluous to me to have a version number in additon to
>>> the fu540-c000, which already seems to be the core plus some sort of
>>> part number. Do you really expect there to be any changes in the SoC
>>> that would require a different compatible string at this point? If the
>>> SoC has taped out, how will you ever get a different version of the PWM
>>> IP in it?
>>>
>>> I would expect any improvements or changes to the PWM IP to show up in a
>>> different SoC generation, at which point it would be something like
>>> "sifive,fu640-c000" maybe, or perhaps "sifive,fu540-d000", or whatever
>>> the numbering is.
>>
>> The "0" suffix refers to a revision number for the underlying PWM IP block.
>>
>> It's certainly important to keep that version number on the "sifive,pwm0"
>> compatible string that doesn't have the chip name associated with it.
> Isn't the hardware identified by "sifive,pwm0" and "sifive,fu540-c000"
> effectively identical? 


The intention was that the "sifive,pwm0" compatible string specifies a
register interface and programming model that the IP block exposes to
the software, rather than a particular underlying hardware
implementation.  That is in contrast to a string like
"sifive,fu540-c000-pwm" which might activate particular workarounds or
quirks that are specific to the integration of the IP block on a given SoC.


The idea is that, for this and similar open-source hardware IP blocks,
the driver code can just match on a generic "sifive,pwm0" compatible
string.  The SoC DT data would include both the SoC-specific
"sifive,fu540-c000-pwm0" and the common interface "sifive,pwm0".  But
the driver would only need the SoC-specific compatible string if the SoC
wound up needing some SoC-specific quirks.


In the past, some folks have had a problem with that idea, since for
closed-source IP blocks, it's been difficult to determine what changes
went into a specific version of the IP block.  Thus folks generating
data for later SoCs usually specify a compatible string for another,
older, SoC that seems to have the desired behavior.  But since this
particular IP block has open-source RTL, and contains a "sifive,pwmX"
version string in the RTL itself:


https://github.com/sifive/sifive-blocks/blob/master/src/main/scala/devices/pwm/PWM.scala#L74


... it's straightforward to see what interface the hardware exposes to
the software for a given compatible string.


> Is there a need to have two compatible strings
> that refer to the exact same hardware?


There's no intention that "sifive,pwm0" and "sifive,fu540-c000-pwm0"
refer to the same hardware; just the same software interface and
programming model.  Even now, it's usually pretty unlikely that two
different SoCs that refer to (say) "nvidia,tegra20-pwm" contain the same
hardware, since differences in synthesis, place and route, ECOs, and
integration change the actual realization of the hardware.  Some folks
interpreted that compatible string reuse as implying the same "hardware"
is in use on both SoCs, but we're really just identifying a software
interface.


>> As to whether there could ever be a FU540-C000 part with different IP block
>> versions on it: FU540-C000 is ultimately a marketing name.  While
>> theoretically we shouldn't have another "FU540-C000" chip with different
>> peripheral IP block versions on it, I don't think any engineer can guarantee
>> that it won't happen.
> I would argue that if at some point there was indeed a chip with the
> same name but a different IP block version in it, we can figure out what
> to call it. Sure there are no guarantees, but it's still fairly unlikely
> in my opinion, so I personally wouldn't worry about this up front.
>
> Anyway, I don't feel strongly either way, I'm just pointing out that
> this is somewhat unusual. If you want to keep it, feel free to.


Thanks for the review, Thierry -


- Paul



Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread Todd Kjos
On Fri, Nov 9, 2018 at 8:43 PM chouryzhou(周威)  wrote:

>
> If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it will 
> be a static
> reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) 
> with
> no namespace-ization. You will get the same one in all processes, everything 
> is
> the same as  without this patch.

except, as far as I can tell, binder_init_ns() would never have been
called on it so the mutex and list heads are not initialized so its
completely broken. Am I missing something? How do those fields get
initialized in this case?

>
> - choury -
>


Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)

2018-11-09 Thread Todd Kjos
On Fri, Nov 9, 2018 at 8:43 PM chouryzhou(周威)  wrote:

>
> If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists,  it will 
> be a static
> reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) 
> with
> no namespace-ization. You will get the same one in all processes, everything 
> is
> the same as  without this patch.

except, as far as I can tell, binder_init_ns() would never have been
called on it so the mutex and list heads are not initialized so its
completely broken. Am I missing something? How do those fields get
initialized in this case?

>
> - choury -
>


RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> Sent: Monday, November 05, 2018 10:56 AM
> Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> ...  The kernel doesn't
> know the memory bandwidth of a given system to get the most efficient
> number of threads, so there's some guesswork involved.  

The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
that kind of information, and could facilitate automatic tuning.

There was discussion last year about kernel support for it:
https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/


> In testing, a reasonable value turned out to be about a quarter of the
> CPUs on the node.
...
> + /*
> +  * We'd like to know the memory bandwidth of the chip to
> calculate the
> +  * most efficient number of threads to start, but we can't.
> +  * In testing, a good value for a variety of systems was a
> quarter of the CPUs on the node.
> +  */
> + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);


You might want to base that calculation on and limit the threads to
physical cores, not hyperthreaded cores.

---
Robert Elliott, HPE Persistent Memory




RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

2018-11-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Daniel Jordan
> Sent: Monday, November 05, 2018 10:56 AM
> Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page
> initialization within each node
> 
> ...  The kernel doesn't
> know the memory bandwidth of a given system to get the most efficient
> number of threads, so there's some guesswork involved.  

The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report
that kind of information, and could facilitate automatic tuning.

There was discussion last year about kernel support for it:
https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/


> In testing, a reasonable value turned out to be about a quarter of the
> CPUs on the node.
...
> + /*
> +  * We'd like to know the memory bandwidth of the chip to
> calculate the
> +  * most efficient number of threads to start, but we can't.
> +  * In testing, a good value for a variety of systems was a
> quarter of the CPUs on the node.
> +  */
> + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4);


You might want to base that calculation on and limit the threads to
physical cores, not hyperthreaded cores.

---
Robert Elliott, HPE Persistent Memory




[PATCH v5 2/3] dt-bindings: Add vendor prefix for PHICOMM Co., Ltd.

2018-11-09 Thread He Yangxuan
PHICOMM Co., Ltd. is a hardware provider headquartered in Shanghai, it's
product includes router and smart devices.

Signed-off-by: He Yangxuan 
---
 Documentation/devicetree/bindings/vendor-prefixes.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt 
b/Documentation/devicetree/bindings/vendor-prefixes.txt
index 4b1a2a8fc..8c413d8cc 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -296,6 +296,7 @@ panasonic   Panasonic Corporation
 parade Parade Technologies Inc.
 pericomPericom Technology Inc.
 pervasive  Pervasive Displays, Inc.
+phicomm PHICOMM Co., Ltd.
 phytec PHYTEC Messtechnik GmbH
 picochip   Picochip Ltd
 pine64 Pine64
-- 
2.11.0



[PATCH v5 2/3] dt-bindings: Add vendor prefix for PHICOMM Co., Ltd.

2018-11-09 Thread He Yangxuan
PHICOMM Co., Ltd. is a hardware provider headquartered in Shanghai, it's
product includes router and smart devices.

Signed-off-by: He Yangxuan 
---
 Documentation/devicetree/bindings/vendor-prefixes.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt 
b/Documentation/devicetree/bindings/vendor-prefixes.txt
index 4b1a2a8fc..8c413d8cc 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -296,6 +296,7 @@ panasonic   Panasonic Corporation
 parade Parade Technologies Inc.
 pericomPericom Technology Inc.
 pervasive  Pervasive Displays, Inc.
+phicomm PHICOMM Co., Ltd.
 phytec PHYTEC Messtechnik GmbH
 picochip   Picochip Ltd
 pine64 Pine64
-- 
2.11.0



[PATCH v5 3/3] dt-bindings: arm: amlogic: Add Phicomm N1

2018-11-09 Thread He Yangxuan
Add bindings documentation for the Phicomm N1.

Signed-off-by: He Yangxuan 
---
 Documentation/devicetree/bindings/arm/amlogic.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/amlogic.txt 
b/Documentation/devicetree/bindings/arm/amlogic.txt
index 4498292b8..93177f38e 100644
--- a/Documentation/devicetree/bindings/arm/amlogic.txt
+++ b/Documentation/devicetree/bindings/arm/amlogic.txt
@@ -91,6 +91,7 @@ Board compatible values (alphabetically, grouped by SoC):
 
   - "amlogic,p230" (Meson gxl s905d)
   - "amlogic,p231" (Meson gxl s905d)
+  - "phicomm,n1" (Meson gxl s905d)
 
   - "amlogic,p241" (Meson gxl s805x)
 
-- 
2.11.0



[PATCH v5 3/3] dt-bindings: arm: amlogic: Add Phicomm N1

2018-11-09 Thread He Yangxuan
Add bindings documentation for the Phicomm N1.

Signed-off-by: He Yangxuan 
---
 Documentation/devicetree/bindings/arm/amlogic.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/amlogic.txt 
b/Documentation/devicetree/bindings/arm/amlogic.txt
index 4498292b8..93177f38e 100644
--- a/Documentation/devicetree/bindings/arm/amlogic.txt
+++ b/Documentation/devicetree/bindings/arm/amlogic.txt
@@ -91,6 +91,7 @@ Board compatible values (alphabetically, grouped by SoC):
 
   - "amlogic,p230" (Meson gxl s905d)
   - "amlogic,p231" (Meson gxl s905d)
+  - "phicomm,n1" (Meson gxl s905d)
 
   - "amlogic,p241" (Meson gxl s805x)
 
-- 
2.11.0



[PATCH v5 0/3] arm64: dts: meson-gxl: add support for phicomm n1

2018-11-09 Thread He Yangxuan
This patch adds support for the Phicomm N1. This device based on P230 reference 
design.
The phy is RTL8211F, need to disable Energy Efficient Ethernet (EEE) to make it 
stable.
And this box doesn't have cvbs, so disable related section in device tree.

Changes since v4:
 - include device tree

Changes since v3:
 - remove external phy from device tree, will disable EEE in 
meson-gxl-s905d-p230.dts

Changes since v2:
 - remove changes section in commit message
 - add commit message for patch of vendor prefix and bindings documentation

Changes since v1:
 - rewrite external phy section
 - add phicomm vendor prefix
 - add phicomm n1 in Documentation/devicetree/bindings/arm/amlogic.txt

He Yangxuan (3):
  arm64: dts: meson-gxl: add support for phicomm n1
  dt-bindings: Add vendor prefix for PHICOMM Co., Ltd.
  dt-bindings: arm: amlogic: Add Phicomm N1

 Documentation/devicetree/bindings/arm/amlogic.txt   |  1 +
 .../devicetree/bindings/vendor-prefixes.txt |  1 +
 arch/arm64/boot/dts/amlogic/Makefile|  1 +
 .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 +
 4 files changed, 24 insertions(+)
 create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts

-- 
2.11.0



[PATCH v5 1/3] arm64: dts: meson-gxl: add support for phicomm n1

2018-11-09 Thread He Yangxuan
This patch adds support for the Phicomm N1. This device based on P230 reference 
design.
And this box doesn't have cvbs, so disable related section in device tree.

Signed-off-by: He Yangxuan 
---
 arch/arm64/boot/dts/amlogic/Makefile|  1 +
 .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 +
 2 files changed, 22 insertions(+)
 create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts

diff --git a/arch/arm64/boot/dts/amlogic/Makefile 
b/arch/arm64/boot/dts/amlogic/Makefile
index c31f29d66..49f3ac5d8 100644
--- a/arch/arm64/boot/dts/amlogic/Makefile
+++ b/arch/arm64/boot/dts/amlogic/Makefile
@@ -18,6 +18,7 @@ dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-nexbox-a95x.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-p212.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p230.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p231.dtb
+dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-phicomm-n1.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s805x-p241.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-p281.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-tx3-mini.dtb
diff --git a/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts 
b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts
new file mode 100644
index 0..9a8a8a7e4
--- /dev/null
+++ b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+/*
+ * Copyright (c) 2018 He Yangxuan
+ */
+
+/dts-v1/;
+
+#include "meson-gxl-s905d-p230.dts"
+
+/ {
+   compatible = "phicomm,n1", "amlogic,s905d", "amlogic,meson-gxl";
+   model = "Phicomm N1";
+
+   cvbs-connector {
+   status = "disabled";
+   };
+};
+
+_vdac_port {
+   status = "disabled";
+};
-- 
2.11.0



[PATCH v5 1/3] arm64: dts: meson-gxl: add support for phicomm n1

2018-11-09 Thread He Yangxuan
This patch adds support for the Phicomm N1. This device based on P230 reference 
design.
And this box doesn't have cvbs, so disable related section in device tree.

Signed-off-by: He Yangxuan 
---
 arch/arm64/boot/dts/amlogic/Makefile|  1 +
 .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 +
 2 files changed, 22 insertions(+)
 create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts

diff --git a/arch/arm64/boot/dts/amlogic/Makefile 
b/arch/arm64/boot/dts/amlogic/Makefile
index c31f29d66..49f3ac5d8 100644
--- a/arch/arm64/boot/dts/amlogic/Makefile
+++ b/arch/arm64/boot/dts/amlogic/Makefile
@@ -18,6 +18,7 @@ dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-nexbox-a95x.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-p212.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p230.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p231.dtb
+dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-phicomm-n1.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s805x-p241.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-p281.dtb
 dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-tx3-mini.dtb
diff --git a/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts 
b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts
new file mode 100644
index 0..9a8a8a7e4
--- /dev/null
+++ b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+/*
+ * Copyright (c) 2018 He Yangxuan
+ */
+
+/dts-v1/;
+
+#include "meson-gxl-s905d-p230.dts"
+
+/ {
+   compatible = "phicomm,n1", "amlogic,s905d", "amlogic,meson-gxl";
+   model = "Phicomm N1";
+
+   cvbs-connector {
+   status = "disabled";
+   };
+};
+
+_vdac_port {
+   status = "disabled";
+};
-- 
2.11.0



[PATCH v5 0/3] arm64: dts: meson-gxl: add support for phicomm n1

2018-11-09 Thread He Yangxuan
This patch adds support for the Phicomm N1. This device based on P230 reference 
design.
The phy is RTL8211F, need to disable Energy Efficient Ethernet (EEE) to make it 
stable.
And this box doesn't have cvbs, so disable related section in device tree.

Changes since v4:
 - include device tree

Changes since v3:
 - remove external phy from device tree, will disable EEE in 
meson-gxl-s905d-p230.dts

Changes since v2:
 - remove changes section in commit message
 - add commit message for patch of vendor prefix and bindings documentation

Changes since v1:
 - rewrite external phy section
 - add phicomm vendor prefix
 - add phicomm n1 in Documentation/devicetree/bindings/arm/amlogic.txt

He Yangxuan (3):
  arm64: dts: meson-gxl: add support for phicomm n1
  dt-bindings: Add vendor prefix for PHICOMM Co., Ltd.
  dt-bindings: arm: amlogic: Add Phicomm N1

 Documentation/devicetree/bindings/arm/amlogic.txt   |  1 +
 .../devicetree/bindings/vendor-prefixes.txt |  1 +
 arch/arm64/boot/dts/amlogic/Makefile|  1 +
 .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 +
 4 files changed, 24 insertions(+)
 create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts

-- 
2.11.0



Re: KASAN: use-after-free Read in task_is_descendant

2018-11-09 Thread syzbot

syzbot has found a reproducer for the following crash on:

HEAD commit:3541833fd1f2 Merge tag 's390-4.20-2' of git://git.kernel.o..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16d3cad540
kernel config:  https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1494bb0b40

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39bf55329e206...@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
==
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182  
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670  
security/yama/yama_lsm.c:295

Read of size 8 at addr 8801c46f24e0 by task syz-executor2/9973

CPU: 0 PID: 9973 Comm: syz-executor2 Not tainted 4.20.0-rc1+ #107
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x244/0x39d lib/dump_stack.c:113
 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 __read_once_size include/linux/compiler.h:182 [inline]
 task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295
 task_is_descendant security/yama/yama_lsm.c:282 [inline]
 yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
 security_ptrace_access_check+0x54/0xb0 security/security.c:271
 __ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336
 ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389
 __do_sys_ptrace kernel/ptrace.c:1139 [inline]
 __se_sys_ptrace kernel/ptrace.c:1119 [inline]
 __x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1119
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00

RSP: 002b:7f2ed4dbfc78 EFLAGS: 0246 ORIG_RAX: 0065
RAX: ffda RBX: 0004 RCX: 00457569
RDX:  RSI: 021d RDI: 4206
RBP: 0072bf00 R08:  R09: 
R10:  R11: 0246 R12: 7f2ed4dc06d4
R13: 004c33bd R14: 004d50e0 R15: 

Allocated by task 5873:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
 alloc_task_struct_node kernel/fork.c:158 [inline]
 dup_task_struct kernel/fork.c:843 [inline]
 copy_process+0x2026/0x87a0 kernel/fork.c:1751
 _do_fork+0x1cb/0x11d0 kernel/fork.c:2216
 __do_sys_clone kernel/fork.c:2323 [inline]
 __se_sys_clone kernel/fork.c:2317 [inline]
 __x64_sys_clone+0xbf/0x150 kernel/fork.c:2317
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x83/0x290 mm/slab.c:3760
 free_task_struct kernel/fork.c:163 [inline]
 free_task+0x16e/0x1f0 kernel/fork.c:457
 __put_task_struct+0x2e6/0x620 kernel/fork.c:730
 put_task_struct include/linux/sched/task.h:96 [inline]
 delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
 __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
 rcu_do_batch kernel/rcu/tree.c:2437 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
 rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
 __do_softirq+0x308/0xb7e kernel/softirq.c:292

The buggy address belongs to the object at 8801c46f2000
 which belongs to the cache task_struct(65:syz2) of size 6080
The buggy address is located 1248 bytes inside of
 6080-byte region [8801c46f2000, 8801c46f37c0)
The buggy address belongs to the page:
page:ea000711bc80 count:1 mapcount:0 mapping:8801c204fc80 index:0x0  
compound_mapcount: 0

flags: 0x2fffc010200(slab|head)
raw: 02fffc010200 ea00073cf808 ea000697c908 8801c204fc80
raw:  8801c46f2000 00010001 8801d8404a80
page dumped because: 

Re: KASAN: use-after-free Read in task_is_descendant

2018-11-09 Thread syzbot

syzbot has found a reproducer for the following crash on:

HEAD commit:3541833fd1f2 Merge tag 's390-4.20-2' of git://git.kernel.o..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16d3cad540
kernel config:  https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a
dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1494bb0b40

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+a9ac39bf55329e206...@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
==
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182  
[inline]
BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670  
security/yama/yama_lsm.c:295

Read of size 8 at addr 8801c46f24e0 by task syz-executor2/9973

CPU: 0 PID: 9973 Comm: syz-executor2 Not tainted 4.20.0-rc1+ #107
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x244/0x39d lib/dump_stack.c:113
 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 __read_once_size include/linux/compiler.h:182 [inline]
 task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295
 task_is_descendant security/yama/yama_lsm.c:282 [inline]
 yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371
 security_ptrace_access_check+0x54/0xb0 security/security.c:271
 __ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336
 ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389
 __do_sys_ptrace kernel/ptrace.c:1139 [inline]
 __se_sys_ptrace kernel/ptrace.c:1119 [inline]
 __x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1119
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00

RSP: 002b:7f2ed4dbfc78 EFLAGS: 0246 ORIG_RAX: 0065
RAX: ffda RBX: 0004 RCX: 00457569
RDX:  RSI: 021d RDI: 4206
RBP: 0072bf00 R08:  R09: 
R10:  R11: 0246 R12: 7f2ed4dc06d4
R13: 004c33bd R14: 004d50e0 R15: 

Allocated by task 5873:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
 alloc_task_struct_node kernel/fork.c:158 [inline]
 dup_task_struct kernel/fork.c:843 [inline]
 copy_process+0x2026/0x87a0 kernel/fork.c:1751
 _do_fork+0x1cb/0x11d0 kernel/fork.c:2216
 __do_sys_clone kernel/fork.c:2323 [inline]
 __se_sys_clone kernel/fork.c:2317 [inline]
 __x64_sys_clone+0xbf/0x150 kernel/fork.c:2317
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x83/0x290 mm/slab.c:3760
 free_task_struct kernel/fork.c:163 [inline]
 free_task+0x16e/0x1f0 kernel/fork.c:457
 __put_task_struct+0x2e6/0x620 kernel/fork.c:730
 put_task_struct include/linux/sched/task.h:96 [inline]
 delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181
 __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
 rcu_do_batch kernel/rcu/tree.c:2437 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
 rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
 __do_softirq+0x308/0xb7e kernel/softirq.c:292

The buggy address belongs to the object at 8801c46f2000
 which belongs to the cache task_struct(65:syz2) of size 6080
The buggy address is located 1248 bytes inside of
 6080-byte region [8801c46f2000, 8801c46f37c0)
The buggy address belongs to the page:
page:ea000711bc80 count:1 mapcount:0 mapping:8801c204fc80 index:0x0  
compound_mapcount: 0

flags: 0x2fffc010200(slab|head)
raw: 02fffc010200 ea00073cf808 ea000697c908 8801c204fc80
raw:  8801c46f2000 00010001 8801d8404a80
page dumped because: 

[PATCH] tracehook: fix documentation for tracehook_report_syscall_entry()

2018-11-09 Thread Elvira Khabirova
tracehook_report_syscall_entry() is called not only
if %TIF_SYSCALL_TRACE is set, but also if %TIF_SYSCALL_EMU is set,
as appears from x86's entry code.

Signed-off-by: Elvira Khabirova 
---
 include/linux/tracehook.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 40b0b4c1bf7b..df20f8bdbfa3 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -83,8 +83,8 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
  * tracehook_report_syscall_entry - task is about to attempt a system call
  * @regs:  user register state of current task
  *
- * This will be called if %TIF_SYSCALL_TRACE has been set, when the
- * current task has just entered the kernel for a system call.
+ * This will be called if %TIF_SYSCALL_TRACE or %TIF_SYSCALL_EMU have been set,
+ * when the current task has just entered the kernel for a system call.
  * Full user register state is available here.  Changing the values
  * in @regs can affect the system call number and arguments to be tried.
  * It is safe to block here, preventing the system call from beginning.
-- 
2.19.1



[PATCH] tracehook: fix documentation for tracehook_report_syscall_entry()

2018-11-09 Thread Elvira Khabirova
tracehook_report_syscall_entry() is called not only
if %TIF_SYSCALL_TRACE is set, but also if %TIF_SYSCALL_EMU is set,
as appears from x86's entry code.

Signed-off-by: Elvira Khabirova 
---
 include/linux/tracehook.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 40b0b4c1bf7b..df20f8bdbfa3 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -83,8 +83,8 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
  * tracehook_report_syscall_entry - task is about to attempt a system call
  * @regs:  user register state of current task
  *
- * This will be called if %TIF_SYSCALL_TRACE has been set, when the
- * current task has just entered the kernel for a system call.
+ * This will be called if %TIF_SYSCALL_TRACE or %TIF_SYSCALL_EMU have been set,
+ * when the current task has just entered the kernel for a system call.
  * Full user register state is available here.  Changing the values
  * in @regs can affect the system call number and arguments to be tried.
  * It is safe to block here, preventing the system call from beginning.
-- 
2.19.1



Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote:
> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn  wrote:
> > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
> >  wrote:
> > > Android uses ashmem for sharing memory regions. We are looking forward
> > > to migrating all usecases of ashmem to memfd so that we can possibly
> > > remove the ashmem driver in the future from staging while also
> > > benefiting from using memfd and contributing to it. Note staging drivers
> > > are also not ABI and generally can be removed at anytime.
> > >
> > > One of the main usecases Android has is the ability to create a region
> > > and mmap it as writeable, then add protection against making any
> > > "future" writes while keeping the existing already mmap'ed
> > > writeable-region active.  This allows us to implement a usecase where
> > > receivers of the shared memory buffer can get a read-only view, while
> > > the sender continues to write to the buffer.
> > > See CursorWindow documentation in Android for more details:
> > > https://developer.android.com/reference/android/database/CursorWindow
> > >
> > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> > > To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
> > > which prevents any future mmap and write syscalls from succeeding while
> > > keeping the existing mmap active.
> >
> > Please CC linux-api@ on patches like this. If you had done that, I
> > might have criticized your v1 patch instead of your v3 patch...
> >
> > > The following program shows the seal
> > > working in action:
> > [...]
> > > Cc: jr...@google.com
> > > Cc: john.stu...@linaro.org
> > > Cc: tk...@google.com
> > > Cc: gre...@linuxfoundation.org
> > > Cc: h...@infradead.org
> > > Reviewed-by: John Stultz 
> > > Signed-off-by: Joel Fernandes (Google) 
> > > ---
> > [...]
> > > diff --git a/mm/memfd.c b/mm/memfd.c
> > > index 2bb5e257080e..5ba9804e9515 100644
> > > --- a/mm/memfd.c
> > > +++ b/mm/memfd.c
> > [...]
> > > @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, 
> > > unsigned int seals)
> > > }
> > > }
> > >
> > > +   if ((seals & F_SEAL_FUTURE_WRITE) &&
> > > +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
> > > +   /*
> > > +* The FUTURE_WRITE seal also prevents growing and 
> > > shrinking
> > > +* so we need them to be already set, or requested now.
> > > +*/
> > > +   int test_seals = (seals | *file_seals) &
> > > +(F_SEAL_GROW | F_SEAL_SHRINK);
> > > +
> > > +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
> > > +   error = -EINVAL;
> > > +   goto unlock;
> > > +   }
> > > +
> > > +   spin_lock(>f_lock);
> > > +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > > +   spin_unlock(>f_lock);
> > > +   }
> >
> > So you're fiddling around with the file, but not the inode? How are
> > you preventing code like the following from re-opening the file as
> > writable?
> >
> > $ cat memfd.c
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main(void) {
> >   int fd = syscall(__NR_memfd_create, "testfd", 0);
> >   if (fd == -1) err(1, "memfd");
> >   char path[100];
> >   sprintf(path, "/proc/self/fd/%d", fd);
> >   int fd2 = open(path, O_RDWR);
> >   if (fd2 == -1) err(1, "reopen");
> >   printf("reopen successful: %d\n", fd2);
> > }
> > $ gcc -o memfd memfd.c
> > $ ./memfd
> > reopen successful: 4
> > $
> >
> > That aside: I wonder whether a better API would be something that
> > allows you to create a new readonly file descriptor, instead of
> > fiddling with the writability of an existing fd.
> 
> My favorite approach would be to forbid open() on memfds, hope that
> nobody notices the tiny API break, and then add an ioctl for "reopen
> this memfd with reduced permissions" - but that's just my personal
> opinion.

I did something along these lines and it fixes the issue, but I forbid open
of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not
an ABI break because this is a brand new seal. That seems the least intrusive
solution and it works. Do you mind testing it and I'll add your and Tested-by
to the new fix? The patch is based on top of this series.

---8<---
From: "Joel Fernandes (Google)" 
Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd

Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd
through /proc/self/fd/N symlink as writeable succeeds. The simplest fix
without causing ABI breakages and ugly VFS hacks is to simply deny all
opens on F_SEAL_FUTURE_WRITE sealed fds.

Reported-by: Jann Horn 
Signed-off-by: Joel Fernandes (Google) 
---
 mm/shmem.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/mm/shmem.c 

Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote:
> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn  wrote:
> > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
> >  wrote:
> > > Android uses ashmem for sharing memory regions. We are looking forward
> > > to migrating all usecases of ashmem to memfd so that we can possibly
> > > remove the ashmem driver in the future from staging while also
> > > benefiting from using memfd and contributing to it. Note staging drivers
> > > are also not ABI and generally can be removed at anytime.
> > >
> > > One of the main usecases Android has is the ability to create a region
> > > and mmap it as writeable, then add protection against making any
> > > "future" writes while keeping the existing already mmap'ed
> > > writeable-region active.  This allows us to implement a usecase where
> > > receivers of the shared memory buffer can get a read-only view, while
> > > the sender continues to write to the buffer.
> > > See CursorWindow documentation in Android for more details:
> > > https://developer.android.com/reference/android/database/CursorWindow
> > >
> > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> > > To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
> > > which prevents any future mmap and write syscalls from succeeding while
> > > keeping the existing mmap active.
> >
> > Please CC linux-api@ on patches like this. If you had done that, I
> > might have criticized your v1 patch instead of your v3 patch...
> >
> > > The following program shows the seal
> > > working in action:
> > [...]
> > > Cc: jr...@google.com
> > > Cc: john.stu...@linaro.org
> > > Cc: tk...@google.com
> > > Cc: gre...@linuxfoundation.org
> > > Cc: h...@infradead.org
> > > Reviewed-by: John Stultz 
> > > Signed-off-by: Joel Fernandes (Google) 
> > > ---
> > [...]
> > > diff --git a/mm/memfd.c b/mm/memfd.c
> > > index 2bb5e257080e..5ba9804e9515 100644
> > > --- a/mm/memfd.c
> > > +++ b/mm/memfd.c
> > [...]
> > > @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, 
> > > unsigned int seals)
> > > }
> > > }
> > >
> > > +   if ((seals & F_SEAL_FUTURE_WRITE) &&
> > > +   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
> > > +   /*
> > > +* The FUTURE_WRITE seal also prevents growing and 
> > > shrinking
> > > +* so we need them to be already set, or requested now.
> > > +*/
> > > +   int test_seals = (seals | *file_seals) &
> > > +(F_SEAL_GROW | F_SEAL_SHRINK);
> > > +
> > > +   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
> > > +   error = -EINVAL;
> > > +   goto unlock;
> > > +   }
> > > +
> > > +   spin_lock(>f_lock);
> > > +   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > > +   spin_unlock(>f_lock);
> > > +   }
> >
> > So you're fiddling around with the file, but not the inode? How are
> > you preventing code like the following from re-opening the file as
> > writable?
> >
> > $ cat memfd.c
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main(void) {
> >   int fd = syscall(__NR_memfd_create, "testfd", 0);
> >   if (fd == -1) err(1, "memfd");
> >   char path[100];
> >   sprintf(path, "/proc/self/fd/%d", fd);
> >   int fd2 = open(path, O_RDWR);
> >   if (fd2 == -1) err(1, "reopen");
> >   printf("reopen successful: %d\n", fd2);
> > }
> > $ gcc -o memfd memfd.c
> > $ ./memfd
> > reopen successful: 4
> > $
> >
> > That aside: I wonder whether a better API would be something that
> > allows you to create a new readonly file descriptor, instead of
> > fiddling with the writability of an existing fd.
> 
> My favorite approach would be to forbid open() on memfds, hope that
> nobody notices the tiny API break, and then add an ioctl for "reopen
> this memfd with reduced permissions" - but that's just my personal
> opinion.

I did something along these lines and it fixes the issue, but I forbid open
of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not
an ABI break because this is a brand new seal. That seems the least intrusive
solution and it works. Do you mind testing it and I'll add your and Tested-by
to the new fix? The patch is based on top of this series.

---8<---
From: "Joel Fernandes (Google)" 
Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd

Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd
through /proc/self/fd/N symlink as writeable succeeds. The simplest fix
without causing ABI breakages and ugly VFS hacks is to simply deny all
opens on F_SEAL_FUTURE_WRITE sealed fds.

Reported-by: Jann Horn 
Signed-off-by: Joel Fernandes (Google) 
---
 mm/shmem.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/mm/shmem.c 

Re: Re: [PATCH V3] binder: ipc namespace support for android binder

2018-11-09 Thread 周威
> 
> I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE.
> It seems like this mechanism would work even if both are disabled --
> as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and
> allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to
> "#ifndef CONFIG_IPC_NS"

Let me explain it in detail. If SYSIPC and IPC_NS are both defined,  
current->nsproxy->ipc_ns will save the ipc namespace variables. We just use 
it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, 
current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, 
which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set 
(IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns.
We make a fack init_ipc_ns here and use it.

> why eliminate name? The string name is very useful for differentiating
> normal "framework" binder transactions vs "hal" or "vendor"
> transactions. If we just have a device number it will be hard to tell
> in the logs even which namespace it belongs to. We need to keep both
> the "name" (for which there might be multiple in each ns) and some
> indication of which namespace this is. Maybe we assign some sort of
> namespace ID during binder_init_ns().

 I will remain the name of device. The  inum of ipc_ns can be treated as 
namespace ID in ipc_ns.

> As mentioned above, we need to retain name and probably also want a ns
> id of some sort. So context now has 3 components if IPC_NS, so maybe a
> helper function to print context like:
> 
> static void binder_seq_print_context(struct seq_file *m, struct
> binder_context *context)
> {
> #ifdef CONFIG_IPC_NS
>           seq_printf(m, "%d-%d-%s", context->ns_id, context->device,
> context->name);
> #else
>           seq_printf(m, "%d", context->name);
> #endif
> }
> 
> (same comment below everywhere context is printed)
> 
> Should these debugfs nodes should be ns aware and only print debugging
> info for the context of the thread accessing the node? If so, we would
> also want to be namespace-aware when printing pids.

Nowadays, debugfs is not namespace-ized, and pid namespace is not associated 
with ipc namespace.  Will it be more complicated to debug this if we just print 
the info for current thread? Because we will have to enter the ipc namespace 
firstly. But add ipc inum should be no problem.

- choury -




Re: Re: [PATCH V3] binder: ipc namespace support for android binder

2018-11-09 Thread 周威
> 
> I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE.
> It seems like this mechanism would work even if both are disabled --
> as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and
> allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to
> "#ifndef CONFIG_IPC_NS"

Let me explain it in detail. If SYSIPC and IPC_NS are both defined,  
current->nsproxy->ipc_ns will save the ipc namespace variables. We just use 
it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, 
current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, 
which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set 
(IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns.
We make a fack init_ipc_ns here and use it.

> why eliminate name? The string name is very useful for differentiating
> normal "framework" binder transactions vs "hal" or "vendor"
> transactions. If we just have a device number it will be hard to tell
> in the logs even which namespace it belongs to. We need to keep both
> the "name" (for which there might be multiple in each ns) and some
> indication of which namespace this is. Maybe we assign some sort of
> namespace ID during binder_init_ns().

 I will remain the name of device. The  inum of ipc_ns can be treated as 
namespace ID in ipc_ns.

> As mentioned above, we need to retain name and probably also want a ns
> id of some sort. So context now has 3 components if IPC_NS, so maybe a
> helper function to print context like:
> 
> static void binder_seq_print_context(struct seq_file *m, struct
> binder_context *context)
> {
> #ifdef CONFIG_IPC_NS
>           seq_printf(m, "%d-%d-%s", context->ns_id, context->device,
> context->name);
> #else
>           seq_printf(m, "%d", context->name);
> #endif
> }
> 
> (same comment below everywhere context is printed)
> 
> Should these debugfs nodes should be ns aware and only print debugging
> info for the context of the thread accessing the node? If so, we would
> also want to be namespace-aware when printing pids.

Nowadays, debugfs is not namespace-ized, and pid namespace is not associated 
with ipc namespace.  Will it be more complicated to debug this if we just print 
the info for current thread? Because we will have to enter the ipc namespace 
firstly. But add ipc inum should be no problem.

- choury -




Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-09 Thread Tetsuo Handa
On 2018/11/10 0:43, Petr Mladek wrote:
>> + * Line buffered printk() tries to assign a buffer when printk() from a new
>> + * context identifier comes in. And it automatically releases that buffer 
>> when
>> + * one of three conditions listed below became true.
>> + *
>> + *   (1) printk() from that context identifier emitted '\n' as the last
>> + *   character of output.
>> + *   (2) printk() from that context identifier tried to print a too long 
>> line
>> + *   which cannot be stored into a buffer.
>> + *   (3) printk() from a new context identifier noticed that some context
>> + *   identifier is reserving a buffer for more than 10 seconds without
>> + *   emitting '\n'.
>> + *
>> + * Since (3) is based on a heuristic that somebody forgot to emit '\n' as 
>> the
>> + * last character of output(), pr_cont()/KERN_CONT users are expected to 
>> emit
>> + * '\n' within 10 seconds even if they reserved a buffer.
> 
> This is my main concern about this approach. It is so easy to omit
> the final '\n'.

If it is so easy to forget the final '\n', there will be a lot of implicit
pr_cont() users (because pr_cont() assumes that previous printk() omitted the
final '\n'), and "I am going to investigate much more pr_cont() users." will
be insufficient for getting meaningful conclusion.

Checking "lack of the the final '\n'" means that we need to check
"all printk() users who are not emitting the final '\n'" and evaluate
"whether there is a possibility that subsequent printk() will not be
 called from that context due to e.g. conditional branches". That is an
impossible task for anybody, for there might be out-of-tree code doing it.

> 
> They are currently delayed until another printk(). Even this is bad.
> Unfortunately we could not setup timer from printk() because it
> would add more locks into the game.

We could use interval timer for flushing incomplete line.
But updating printk() users to always end with '\n' will be preferable.

> 
> This patch will make it worse. They might get delayed by 10s or even
> more. Many other lines might appear in between. Also the code is
> more tricky[*].

If there are a lot of implicit pr_cont() users (who are omitting
the final '\n' in previous printk()), we can try to find them (and
fix them) by "reporting the location of printk() which omitted the
final '\n'" instead of "flushing the partial line from different
processes" when "try_buffered_printk() (or an interval timer) found
that some buffer is holding a partial line for more than 10 seconds".

> 
> 
> Sign, I am really unhappy how the buffered_printk() conversion
> looks in lockdep.c. But I still think that the approach is more
> reliable. I am going to investigate much more pr_cont() users.

If there is a possibility that subsequent printk() will not be called from
that context due to e.g. conditional branches, we will need to flush after
previous printk() in order to eliminate possibility of failing to flush.

That is asking printk() users to fix previous printk() so that partial
line is always flushed (with '\n' or without '\n').

> I wonder how many would be that complicated. I do not want
> to give up just because one use-case that was complicated
> even before.
> 
> 
> [*] The buffer can get written and flushed by different processes.
> It is not trivial to do it correctly a lockless way.
> 
> The proposed locking looks right on the first glance. But
> the code is a bit scary ;-)

try_buffered_printk() will be a good hook for finding implicit
pr_cont() users who are omitting the final '\n', for we can find them
without making changes to printk() users.

try_buffered_printk() hook can offload atomic-but-complicated task
(converting e.g. %pSOMETHING to %s) to outside of logbuf_lock, and
reduce the period of holding logbuf_lock.

Since try_buffered_printk() serves as a hook, if we ignore the offloading
atomic-but-complicated task, we can eliminate try_buffered_printk() after
we updated printk() users to always end with '\n'.


Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-09 Thread Tetsuo Handa
On 2018/11/10 0:43, Petr Mladek wrote:
>> + * Line buffered printk() tries to assign a buffer when printk() from a new
>> + * context identifier comes in. And it automatically releases that buffer 
>> when
>> + * one of three conditions listed below became true.
>> + *
>> + *   (1) printk() from that context identifier emitted '\n' as the last
>> + *   character of output.
>> + *   (2) printk() from that context identifier tried to print a too long 
>> line
>> + *   which cannot be stored into a buffer.
>> + *   (3) printk() from a new context identifier noticed that some context
>> + *   identifier is reserving a buffer for more than 10 seconds without
>> + *   emitting '\n'.
>> + *
>> + * Since (3) is based on a heuristic that somebody forgot to emit '\n' as 
>> the
>> + * last character of output(), pr_cont()/KERN_CONT users are expected to 
>> emit
>> + * '\n' within 10 seconds even if they reserved a buffer.
> 
> This is my main concern about this approach. It is so easy to omit
> the final '\n'.

If it is so easy to forget the final '\n', there will be a lot of implicit
pr_cont() users (because pr_cont() assumes that previous printk() omitted the
final '\n'), and "I am going to investigate much more pr_cont() users." will
be insufficient for getting meaningful conclusion.

Checking "lack of the the final '\n'" means that we need to check
"all printk() users who are not emitting the final '\n'" and evaluate
"whether there is a possibility that subsequent printk() will not be
 called from that context due to e.g. conditional branches". That is an
impossible task for anybody, for there might be out-of-tree code doing it.

> 
> They are currently delayed until another printk(). Even this is bad.
> Unfortunately we could not setup timer from printk() because it
> would add more locks into the game.

We could use interval timer for flushing incomplete line.
But updating printk() users to always end with '\n' will be preferable.

> 
> This patch will make it worse. They might get delayed by 10s or even
> more. Many other lines might appear in between. Also the code is
> more tricky[*].

If there are a lot of implicit pr_cont() users (who are omitting
the final '\n' in previous printk()), we can try to find them (and
fix them) by "reporting the location of printk() which omitted the
final '\n'" instead of "flushing the partial line from different
processes" when "try_buffered_printk() (or an interval timer) found
that some buffer is holding a partial line for more than 10 seconds".

> 
> 
> Sign, I am really unhappy how the buffered_printk() conversion
> looks in lockdep.c. But I still think that the approach is more
> reliable. I am going to investigate much more pr_cont() users.

If there is a possibility that subsequent printk() will not be called from
that context due to e.g. conditional branches, we will need to flush after
previous printk() in order to eliminate possibility of failing to flush.

That is asking printk() users to fix previous printk() so that partial
line is always flushed (with '\n' or without '\n').

> I wonder how many would be that complicated. I do not want
> to give up just because one use-case that was complicated
> even before.
> 
> 
> [*] The buffer can get written and flushed by different processes.
> It is not trivial to do it correctly a lockless way.
> 
> The proposed locking looks right on the first glance. But
> the code is a bit scary ;-)

try_buffered_printk() will be a good hook for finding implicit
pr_cont() users who are omitting the final '\n', for we can find them
without making changes to printk() users.

try_buffered_printk() hook can offload atomic-but-complicated task
(converting e.g. %pSOMETHING to %s) to outside of logbuf_lock, and
reduce the period of holding logbuf_lock.

Since try_buffered_printk() serves as a hook, if we ignore the offloading
atomic-but-complicated task, we can eliminate try_buffered_printk() after
we updated printk() users to always end with '\n'.


Re: [PATCH] drivers/android/binder.c: Remove duplicate header

2018-11-09 Thread kbuild test robot
Hi Brajeswar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on staging/staging-testing]
[also build test ERROR on v4.20-rc1 next-20181109]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717
config: i386-randconfig-i3-201844 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/android/binder.o: In function `__read_once_size':
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_return'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_return'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_return'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_wait_for_work'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_fd_recv'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_wait_for_work'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_fd_recv'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_fd_recv'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_wait_for_work'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_received'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_received'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_received'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_alloc_buf'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_failed_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_alloc_buf'

vim +182 include/linux/compiler.h

d976441f Andrey Ryabinin   2015-10-19  178  
d976441f Andrey Ryabinin   2015-10-19  179  static __always_inline
d976441f Andrey Ryabinin   2015-10-19  180  void __read_once_size(const 
volatile void *p, void *res, int size)
230fa253 Christian Borntraeger 2014-11-25  181  {
d976441f Andrey Ryabinin   2015-10-19 @182  __READ_ONCE_SIZE;
230fa253 Christian Borntraeger 2014-11-25  183  }
d976441f Andrey Ryabinin   2015-10-19  184  

:: The code at line 182 was first introduced by commit
:: d976441f44bc5d48635d081d277aa76556ffbf8b compiler, atomics, kasan: 
Provide READ_ONCE_NOCHECK()

:: TO: Andrey Ryabinin 
:: CC: Ingo Molnar 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] drivers/android/binder.c: Remove duplicate header

2018-11-09 Thread kbuild test robot
Hi Brajeswar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on staging/staging-testing]
[also build test ERROR on v4.20-rc1 next-20181109]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717
config: i386-randconfig-i3-201844 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/android/binder.o: In function `__read_once_size':
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_return'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_return'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_return'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_wait_for_work'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_fd_recv'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_wait_for_work'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_fd_recv'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_fd_recv'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_wait_for_work'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_received'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_received'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_received'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_alloc_buf'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_failed_buffer_release'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction'
>> include/linux/compiler.h:182: undefined reference to 
>> `__tracepoint_binder_transaction_alloc_buf'

vim +182 include/linux/compiler.h

d976441f Andrey Ryabinin   2015-10-19  178  
d976441f Andrey Ryabinin   2015-10-19  179  static __always_inline
d976441f Andrey Ryabinin   2015-10-19  180  void __read_once_size(const 
volatile void *p, void *res, int size)
230fa253 Christian Borntraeger 2014-11-25  181  {
d976441f Andrey Ryabinin   2015-10-19 @182  __READ_ONCE_SIZE;
230fa253 Christian Borntraeger 2014-11-25  183  }
d976441f Andrey Ryabinin   2015-10-19  184  

:: The code at line 182 was first introduced by commit
:: d976441f44bc5d48635d081d277aa76556ffbf8b compiler, atomics, kasan: 
Provide READ_ONCE_NOCHECK()

:: TO: Andrey Ryabinin 
:: CC: Ingo Molnar 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 08:02:14PM +, Michael Tirado wrote:
[...]
> > > That aside: I wonder whether a better API would be something that
> > > allows you to create a new readonly file descriptor, instead of
> > > fiddling with the writability of an existing fd.
> >
> > Every now and then I try to write a patch to prevent using proc to reopen
> > a file with greater permission than the original open.
> >
> > I like your idea to have a clean way to reopen a a memfd with reduced
> > permissions. But I would make it a syscall instead and maybe make it only
> > work for memfd at first.  And the proc issue would need to be fixed, too.
> 
> IMO the best solution would handle the issue at memfd creation time by
> removing the race condition.

I agree, this is another idea I'm exploring. We could add a new .open
callback to shmem_file_operations and check for seals there.

thanks,

 - Joel


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 08:02:14PM +, Michael Tirado wrote:
[...]
> > > That aside: I wonder whether a better API would be something that
> > > allows you to create a new readonly file descriptor, instead of
> > > fiddling with the writability of an existing fd.
> >
> > Every now and then I try to write a patch to prevent using proc to reopen
> > a file with greater permission than the original open.
> >
> > I like your idea to have a clean way to reopen a a memfd with reduced
> > permissions. But I would make it a syscall instead and maybe make it only
> > work for memfd at first.  And the proc issue would need to be fixed, too.
> 
> IMO the best solution would handle the issue at memfd creation time by
> removing the race condition.

I agree, this is another idea I'm exploring. We could add a new .open
callback to shmem_file_operations and check for seals there.

thanks,

 - Joel


Re: ODEBUG: Out of memory. ODEBUG disabled

2018-11-09 Thread Qian Cai
> Sent: Friday, November 09, 2018 at 5:08 PM
> From: "Waiman Long" 
> To: "Qian Cai" , "Yang Shi" 
> Cc: "open list" , "Thomas Gleixner" 
> , "Arnd Bergmann" , "Joel Fernandes 
> (Google)" , "Zhong Jiang" 
> Subject: Re: ODEBUG: Out of memory. ODEBUG disabled
> 
> On 11/09/2018 04:51 PM, Qian Cai wrote:
>> 
>>> On Nov 9, 2018, at 4:42 PM, Yang Shi  wrote:
>>> 
>>> 
>>> 
>>> On 11/9/18 1:36 PM, Qian Cai wrote:
 It is a bit annoying on this aarch64 server with 64 CPUs that is
 booting the latest mainline (3541833fd1f2) causes object debugging
 always running out of memory.
>>> May you please paste the detail failure log?
>> I assume you mean dmesg.
>> 
>> Here is the dmesg for 64 CPUs,
>> https://paste.ubuntu.com/p/BnhvXXhn7k/
 I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
 to make it work. Is it expected that object debugging is not going
 to work with large machines?
>>> I don't think so. I'm supposed it works well with large CPU number on x86.
>> Here is the one with nr_cpus workaround,
>> https://paste.ubuntu.com/p/qMpd2CCPSV/
> 
> The debugobjects code have a set of 1024 statically allocated debug
> objects that can be used in early boot before the slab memory allocator
> is initialized. Apparently, the system may have used up all the
> statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
> helps.
Great, you are right. Doubling the size makes it work. Does it make sense
to have a kconfig option instead?
> 
> There are also quite a number of warnings in your console log. So there
> is certainly something wrong with your kernel or config options.
Yes, I am working on all those warnings. This one is found by ODEBUG,
https://lkml.org/lkml/2018/11/10/136


Re: ODEBUG: Out of memory. ODEBUG disabled

2018-11-09 Thread Qian Cai
> Sent: Friday, November 09, 2018 at 5:08 PM
> From: "Waiman Long" 
> To: "Qian Cai" , "Yang Shi" 
> Cc: "open list" , "Thomas Gleixner" 
> , "Arnd Bergmann" , "Joel Fernandes 
> (Google)" , "Zhong Jiang" 
> Subject: Re: ODEBUG: Out of memory. ODEBUG disabled
> 
> On 11/09/2018 04:51 PM, Qian Cai wrote:
>> 
>>> On Nov 9, 2018, at 4:42 PM, Yang Shi  wrote:
>>> 
>>> 
>>> 
>>> On 11/9/18 1:36 PM, Qian Cai wrote:
 It is a bit annoying on this aarch64 server with 64 CPUs that is
 booting the latest mainline (3541833fd1f2) causes object debugging
 always running out of memory.
>>> May you please paste the detail failure log?
>> I assume you mean dmesg.
>> 
>> Here is the dmesg for 64 CPUs,
>> https://paste.ubuntu.com/p/BnhvXXhn7k/
 I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
 to make it work. Is it expected that object debugging is not going
 to work with large machines?
>>> I don't think so. I'm supposed it works well with large CPU number on x86.
>> Here is the one with nr_cpus workaround,
>> https://paste.ubuntu.com/p/qMpd2CCPSV/
> 
> The debugobjects code have a set of 1024 statically allocated debug
> objects that can be used in early boot before the slab memory allocator
> is initialized. Apparently, the system may have used up all the
> statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
> helps.
Great, you are right. Doubling the size makes it work. Does it make sense
to have a kconfig option instead?
> 
> There are also quite a number of warnings in your console log. So there
> is certainly something wrong with your kernel or config options.
Yes, I am working on all those warnings. This one is found by ODEBUG,
https://lkml.org/lkml/2018/11/10/136


[PATCH v9 0/2] Add support for LPASS clock controller for SDM845

2018-11-09 Thread Taniya Das
 [v9]
  * Update GCC documentation binding with the protected-clocks list.
  * Update the GCC code to add the GCC lpass clocks.
  * This depends on the acceptance of
  https://lore.kernel.org/lkml/20181105194011.43770-1-swb...@chromium.org/

 [v8]
  * Add CLK_IS_CRITICAL for GCC lpass clocks for lpass clocks access to go
  through always.

 [v7]
  * Cleanup header file inclusions.
  * Move the comments along with the flags.
  * Update the commit with details for CLK_IGNORE_UNUSED.

 [v6]
  * Update the logic to register the lpass clocks when the device tree property
   is not present.
  * Add the CLK_IGNORE_UNUSED flag for the lpass clocks to not gate the clocks
   at late_init.

 [v5]
  * Address the comments in device tree binding to update the reg-names,
update the unit address in lpass clock node example and also
add reg property for the gcc clock node.
  * Update the lpass driver to take care of the reg-names.

 [v4]
  * Update the description in GCC Documentation binding for
  'qcom,lpass-protected'.
  * Remove 'qcom,lpass-protected' from LPASS Documentation binding.
  * Update KConfig to use Low Power Audio Subsystem.
  * Add module_exit() and also update return value for
devm_ioremap_resource failure.

 [v3]
  * Add a device tree property to identify lpass protected GCC clocks.
  * Update the GCC driver code to register the lpass clocks when the flag is
   defined.
  * Add comment for clocks using the BRANCH_HALT_SKIP flag.
  * Use platform APIs instead of of_address_to_resource.
  * Replace devm_ioremap with devm_ioremap_resource.
  * Use fixed index for 'lpass_cc' & 'lpass_qdsp6ss' in probe.

 [v2]
  * Make gcc_lpass_sway_clk static.
  * Remove using child nodes and use reg-names to differentiate various
domains of LPASS CC.

Add support for the lpass clock controller found on SDM845 based devices.
This would allow lpass peripheral loader drivers to control the clocks to
bring the subsystem out of reset.

Taniya Das (2):
  dt-bindings: clock: Introduce QCOM LPASS clock bindings
  clk: qcom: Add lpass clock controller driver for SDM845

 .../devicetree/bindings/clock/qcom,gcc.txt |  16 ++
 .../devicetree/bindings/clock/qcom,lpasscc.txt |  26 +++
 drivers/clk/qcom/Kconfig   |   9 +
 drivers/clk/qcom/Makefile  |   1 +
 drivers/clk/qcom/gcc-sdm845.c  |  30 
 drivers/clk/qcom/lpasscc-sdm845.c  | 192 +
 include/dt-bindings/clock/qcom,gcc-sdm845.h|   2 +
 include/dt-bindings/clock/qcom,lpass-sdm845.h  |  16 ++
 8 files changed, 292 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
 create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c
 create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h

--
Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member
of the Code Aurora Forum, hosted by the  Linux Foundation.



[PATCH v9 1/2] dt-bindings: clock: Introduce QCOM LPASS clock bindings

2018-11-09 Thread Taniya Das
Add device tree bindings for Low Power Audio subsystem clock controller for
Qualcomm Technology Inc's SDM845 SoCs.

Signed-off-by: Taniya Das 
---
 .../devicetree/bindings/clock/qcom,gcc.txt | 16 +
 .../devicetree/bindings/clock/qcom,lpasscc.txt | 26 ++
 include/dt-bindings/clock/qcom,gcc-sdm845.h|  2 ++
 include/dt-bindings/clock/qcom,lpass-sdm845.h  | 16 +
 4 files changed, 60 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
 create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h

diff --git a/Documentation/devicetree/bindings/clock/qcom,gcc.txt 
b/Documentation/devicetree/bindings/clock/qcom,gcc.txt
index 52d9345..8661c3c 100644
--- a/Documentation/devicetree/bindings/clock/qcom,gcc.txt
+++ b/Documentation/devicetree/bindings/clock/qcom,gcc.txt
@@ -35,6 +35,8 @@ be part of GCC and hence the TSENS properties can also be
 part of the GCC/clock-controller node.
 For more details on the TSENS properties please refer
 Documentation/devicetree/bindings/thermal/qcom-tsens.txt
+- protected-clocks : Protected clock specifier list as per common clock
+ binding.

 Example:
clock-controller@90 {
@@ -55,3 +57,17 @@ Example of GCC with TSENS properties:
#reset-cells = <1>;
#thermal-sensor-cells = <1>;
};
+
+Example of GCC with protected-clocks properties:
+   clock-controller@10 {
+   compatible = "qcom,gcc-sdm845";
+   reg = <0x10 0x1f>;
+   #clock-cells = <1>;
+   #reset-cells = <1>;
+   #power-domain-cells = <1>;
+   protected-clocks = ,
+  ,
+  ,
+  ,
+  ;
+   };
diff --git a/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt 
b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
new file mode 100644
index 000..b9e9787
--- /dev/null
+++ b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
@@ -0,0 +1,26 @@
+Qualcomm LPASS Clock Controller Binding
+---
+
+Required properties :
+- compatible   : shall contain "qcom,sdm845-lpasscc"
+- #clock-cells : from common clock binding, shall contain 1.
+- reg  : shall contain base register address and size,
+ in the order
+   Index-0 maps to LPASS_CC register region
+   Index-1 maps to LPASS_QDSP6SS register region
+
+Optional properties :
+- reg-names: register names of LPASS domain
+"cc", "qdsp6ss".
+
+Example:
+
+The below node has to be defined in the cases where the LPASS peripheral loader
+would bring the subsystem out of reset.
+
+   lpasscc: clock-controller@17014000 {
+   compatible = "qcom,sdm845-lpasscc";
+   reg = <0x17014000 0x1f004>, <0x1730 0x200>;
+   reg-names = "cc", "qdsp6ss";
+   #clock-cells = <1>;
+   };
diff --git a/include/dt-bindings/clock/qcom,gcc-sdm845.h 
b/include/dt-bindings/clock/qcom,gcc-sdm845.h
index b8eae5a..968fa65 100644
--- a/include/dt-bindings/clock/qcom,gcc-sdm845.h
+++ b/include/dt-bindings/clock/qcom,gcc-sdm845.h
@@ -197,6 +197,8 @@
 #define GCC_QSPI_CORE_CLK_SRC  187
 #define GCC_QSPI_CORE_CLK  188
 #define GCC_QSPI_CNOC_PERIPH_AHB_CLK   189
+#define GCC_LPASS_Q6_AXI_CLK   190
+#define GCC_LPASS_SWAY_CLK 191

 /* GCC Resets */
 #define GCC_MMSS_BCR   0
diff --git a/include/dt-bindings/clock/qcom,lpass-sdm845.h 
b/include/dt-bindings/clock/qcom,lpass-sdm845.h
new file mode 100644
index 000..015968e
--- /dev/null
+++ b/include/dt-bindings/clock/qcom,lpass-sdm845.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018, The Linux Foundation. All rights reserved.
+ */
+
+#ifndef _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H
+#define _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H
+
+#define LPASS_AUDIO_WRAPPER_AON_CLK0
+#define LPASS_Q6SS_AHBM_AON_CLK1
+#define LPASS_Q6SS_AHBS_AON_CLK2
+#define LPASS_QDSP6SS_XO_CLK   3
+#define LPASS_QDSP6SS_SLEEP_CLK4
+#define LPASS_QDSP6SS_CORE_CLK 5
+
+#endif
--
Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member
of the Code Aurora Forum, hosted by the  Linux Foundation.



[PATCH v9 2/2] clk: qcom: Add lpass clock controller driver for SDM845

2018-11-09 Thread Taniya Das
Add support for the lpass clock controller found on SDM845 based devices.
This would allow lpass peripheral loader drivers to control the clocks to
bring the subsystem out of reset.
LPASS clocks present on the global clock controller would be registered
with the clock framework based on the protected-clock flag. Also do not
gate these clocks if they are left unused, as the lpass clocks require
the global clock controller lpass clocks to be enabled before they are
accessed. Mark the GCC lpass clocks as CRITICAL, for the LPASS clock
access.

Signed-off-by: Taniya Das 
---
 drivers/clk/qcom/Kconfig  |   9 ++
 drivers/clk/qcom/Makefile |   1 +
 drivers/clk/qcom/gcc-sdm845.c |  30 ++
 drivers/clk/qcom/lpasscc-sdm845.c | 192 ++
 4 files changed, 232 insertions(+)
 create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c

diff --git a/drivers/clk/qcom/Kconfig b/drivers/clk/qcom/Kconfig
index a611531..23adc4c 100644
--- a/drivers/clk/qcom/Kconfig
+++ b/drivers/clk/qcom/Kconfig
@@ -293,6 +293,15 @@ config SDM_DISPCC_845
  Say Y if you want to support display devices and functionality such as
  splash screen.

+config SDM_LPASSCC_845
+   tristate "SDM845 Low Power Audio Subsystem (LPAAS) Clock Controller"
+   depends on COMMON_CLK_QCOM
+   select SDM_GCC_845
+   help
+ Support for the LPASS clock controller on SDM845 devices.
+ Say Y if you want to use the LPASS branch clocks of the LPASS clock
+ controller to reset the LPASS subsystem.
+
 config SPMI_PMIC_CLKDIV
tristate "SPMI PMIC clkdiv Support"
depends on (COMMON_CLK_QCOM && SPMI) || COMPILE_TEST
diff --git a/drivers/clk/qcom/Makefile b/drivers/clk/qcom/Makefile
index 981882e..3d530b1 100644
--- a/drivers/clk/qcom/Makefile
+++ b/drivers/clk/qcom/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_SDM_CAMCC_845) += camcc-sdm845.o
 obj-$(CONFIG_SDM_DISPCC_845) += dispcc-sdm845.o
 obj-$(CONFIG_SDM_GCC_660) += gcc-sdm660.o
 obj-$(CONFIG_SDM_GCC_845) += gcc-sdm845.o
+obj-$(CONFIG_SDM_LPASSCC_845) += lpasscc-sdm845.o
 obj-$(CONFIG_SDM_VIDEOCC_845) += videocc-sdm845.o
 obj-$(CONFIG_SPMI_PMIC_CLKDIV) += clk-spmi-pmic-div.o
 obj-$(CONFIG_KPSS_XCC) += kpss-xcc.o
diff --git a/drivers/clk/qcom/gcc-sdm845.c b/drivers/clk/qcom/gcc-sdm845.c
index f133b7f..ba8ff99 100644
--- a/drivers/clk/qcom/gcc-sdm845.c
+++ b/drivers/clk/qcom/gcc-sdm845.c
@@ -3153,6 +3153,34 @@ enum {
},
 };

+static struct clk_branch gcc_lpass_q6_axi_clk = {
+   .halt_reg = 0x47000,
+   .halt_check = BRANCH_HALT,
+   .clkr = {
+   .enable_reg = 0x47000,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = "gcc_lpass_q6_axi_clk",
+   .flags = CLK_IS_CRITICAL,
+   .ops = _branch2_ops,
+   },
+   },
+};
+
+static struct clk_branch gcc_lpass_sway_clk = {
+   .halt_reg = 0x47008,
+   .halt_check = BRANCH_HALT,
+   .clkr = {
+   .enable_reg = 0x47008,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = "gcc_lpass_sway_clk",
+   .flags = CLK_IS_CRITICAL,
+   .ops = _branch2_ops,
+   },
+   },
+};
+
 static struct gdsc pcie_0_gdsc = {
.gdscr = 0x6b004,
.pd = {
@@ -3453,6 +3481,8 @@ enum {
[GCC_QSPI_CORE_CLK_SRC] = _qspi_core_clk_src.clkr,
[GCC_QSPI_CORE_CLK] = _qspi_core_clk.clkr,
[GCC_QSPI_CNOC_PERIPH_AHB_CLK] = _qspi_cnoc_periph_ahb_clk.clkr,
+   [GCC_LPASS_Q6_AXI_CLK] = _lpass_q6_axi_clk.clkr,
+   [GCC_LPASS_SWAY_CLK] = _lpass_sway_clk.clkr,
 };

 static const struct qcom_reset_map gcc_sdm845_resets[] = {
diff --git a/drivers/clk/qcom/lpasscc-sdm845.c 
b/drivers/clk/qcom/lpasscc-sdm845.c
new file mode 100644
index 000..2ef7f2a
--- /dev/null
+++ b/drivers/clk/qcom/lpasscc-sdm845.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018, The Linux Foundation. All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "clk-regmap.h"
+#include "clk-branch.h"
+#include "common.h"
+
+static struct clk_branch lpass_audio_wrapper_aon_clk = {
+   .halt_reg = 0x098,
+   .halt_check = BRANCH_VOTED,
+   .clkr = {
+   .enable_reg = 0x098,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = "lpass_audio_wrapper_aon_clk",
+   .ops = _branch2_ops,
+   },
+   },
+};
+
+static struct clk_branch lpass_q6ss_ahbm_aon_clk = {
+   .halt_reg = 0x12000,
+   .halt_check = BRANCH_VOTED,
+   .clkr = {
+   .enable_reg = 0x12000,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = 

[PATCH v9 2/2] clk: qcom: Add lpass clock controller driver for SDM845

2018-11-09 Thread Taniya Das
Add support for the lpass clock controller found on SDM845 based devices.
This would allow lpass peripheral loader drivers to control the clocks to
bring the subsystem out of reset.
LPASS clocks present on the global clock controller would be registered
with the clock framework based on the protected-clock flag. Also do not
gate these clocks if they are left unused, as the lpass clocks require
the global clock controller lpass clocks to be enabled before they are
accessed. Mark the GCC lpass clocks as CRITICAL, for the LPASS clock
access.

Signed-off-by: Taniya Das 
---
 drivers/clk/qcom/Kconfig  |   9 ++
 drivers/clk/qcom/Makefile |   1 +
 drivers/clk/qcom/gcc-sdm845.c |  30 ++
 drivers/clk/qcom/lpasscc-sdm845.c | 192 ++
 4 files changed, 232 insertions(+)
 create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c

diff --git a/drivers/clk/qcom/Kconfig b/drivers/clk/qcom/Kconfig
index a611531..23adc4c 100644
--- a/drivers/clk/qcom/Kconfig
+++ b/drivers/clk/qcom/Kconfig
@@ -293,6 +293,15 @@ config SDM_DISPCC_845
  Say Y if you want to support display devices and functionality such as
  splash screen.

+config SDM_LPASSCC_845
+   tristate "SDM845 Low Power Audio Subsystem (LPAAS) Clock Controller"
+   depends on COMMON_CLK_QCOM
+   select SDM_GCC_845
+   help
+ Support for the LPASS clock controller on SDM845 devices.
+ Say Y if you want to use the LPASS branch clocks of the LPASS clock
+ controller to reset the LPASS subsystem.
+
 config SPMI_PMIC_CLKDIV
tristate "SPMI PMIC clkdiv Support"
depends on (COMMON_CLK_QCOM && SPMI) || COMPILE_TEST
diff --git a/drivers/clk/qcom/Makefile b/drivers/clk/qcom/Makefile
index 981882e..3d530b1 100644
--- a/drivers/clk/qcom/Makefile
+++ b/drivers/clk/qcom/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_SDM_CAMCC_845) += camcc-sdm845.o
 obj-$(CONFIG_SDM_DISPCC_845) += dispcc-sdm845.o
 obj-$(CONFIG_SDM_GCC_660) += gcc-sdm660.o
 obj-$(CONFIG_SDM_GCC_845) += gcc-sdm845.o
+obj-$(CONFIG_SDM_LPASSCC_845) += lpasscc-sdm845.o
 obj-$(CONFIG_SDM_VIDEOCC_845) += videocc-sdm845.o
 obj-$(CONFIG_SPMI_PMIC_CLKDIV) += clk-spmi-pmic-div.o
 obj-$(CONFIG_KPSS_XCC) += kpss-xcc.o
diff --git a/drivers/clk/qcom/gcc-sdm845.c b/drivers/clk/qcom/gcc-sdm845.c
index f133b7f..ba8ff99 100644
--- a/drivers/clk/qcom/gcc-sdm845.c
+++ b/drivers/clk/qcom/gcc-sdm845.c
@@ -3153,6 +3153,34 @@ enum {
},
 };

+static struct clk_branch gcc_lpass_q6_axi_clk = {
+   .halt_reg = 0x47000,
+   .halt_check = BRANCH_HALT,
+   .clkr = {
+   .enable_reg = 0x47000,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = "gcc_lpass_q6_axi_clk",
+   .flags = CLK_IS_CRITICAL,
+   .ops = _branch2_ops,
+   },
+   },
+};
+
+static struct clk_branch gcc_lpass_sway_clk = {
+   .halt_reg = 0x47008,
+   .halt_check = BRANCH_HALT,
+   .clkr = {
+   .enable_reg = 0x47008,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = "gcc_lpass_sway_clk",
+   .flags = CLK_IS_CRITICAL,
+   .ops = _branch2_ops,
+   },
+   },
+};
+
 static struct gdsc pcie_0_gdsc = {
.gdscr = 0x6b004,
.pd = {
@@ -3453,6 +3481,8 @@ enum {
[GCC_QSPI_CORE_CLK_SRC] = _qspi_core_clk_src.clkr,
[GCC_QSPI_CORE_CLK] = _qspi_core_clk.clkr,
[GCC_QSPI_CNOC_PERIPH_AHB_CLK] = _qspi_cnoc_periph_ahb_clk.clkr,
+   [GCC_LPASS_Q6_AXI_CLK] = _lpass_q6_axi_clk.clkr,
+   [GCC_LPASS_SWAY_CLK] = _lpass_sway_clk.clkr,
 };

 static const struct qcom_reset_map gcc_sdm845_resets[] = {
diff --git a/drivers/clk/qcom/lpasscc-sdm845.c 
b/drivers/clk/qcom/lpasscc-sdm845.c
new file mode 100644
index 000..2ef7f2a
--- /dev/null
+++ b/drivers/clk/qcom/lpasscc-sdm845.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018, The Linux Foundation. All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "clk-regmap.h"
+#include "clk-branch.h"
+#include "common.h"
+
+static struct clk_branch lpass_audio_wrapper_aon_clk = {
+   .halt_reg = 0x098,
+   .halt_check = BRANCH_VOTED,
+   .clkr = {
+   .enable_reg = 0x098,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = "lpass_audio_wrapper_aon_clk",
+   .ops = _branch2_ops,
+   },
+   },
+};
+
+static struct clk_branch lpass_q6ss_ahbm_aon_clk = {
+   .halt_reg = 0x12000,
+   .halt_check = BRANCH_VOTED,
+   .clkr = {
+   .enable_reg = 0x12000,
+   .enable_mask = BIT(0),
+   .hw.init = &(struct clk_init_data){
+   .name = 

[PATCH v9 0/2] Add support for LPASS clock controller for SDM845

2018-11-09 Thread Taniya Das
 [v9]
  * Update GCC documentation binding with the protected-clocks list.
  * Update the GCC code to add the GCC lpass clocks.
  * This depends on the acceptance of
  https://lore.kernel.org/lkml/20181105194011.43770-1-swb...@chromium.org/

 [v8]
  * Add CLK_IS_CRITICAL for GCC lpass clocks for lpass clocks access to go
  through always.

 [v7]
  * Cleanup header file inclusions.
  * Move the comments along with the flags.
  * Update the commit with details for CLK_IGNORE_UNUSED.

 [v6]
  * Update the logic to register the lpass clocks when the device tree property
   is not present.
  * Add the CLK_IGNORE_UNUSED flag for the lpass clocks to not gate the clocks
   at late_init.

 [v5]
  * Address the comments in device tree binding to update the reg-names,
update the unit address in lpass clock node example and also
add reg property for the gcc clock node.
  * Update the lpass driver to take care of the reg-names.

 [v4]
  * Update the description in GCC Documentation binding for
  'qcom,lpass-protected'.
  * Remove 'qcom,lpass-protected' from LPASS Documentation binding.
  * Update KConfig to use Low Power Audio Subsystem.
  * Add module_exit() and also update return value for
devm_ioremap_resource failure.

 [v3]
  * Add a device tree property to identify lpass protected GCC clocks.
  * Update the GCC driver code to register the lpass clocks when the flag is
   defined.
  * Add comment for clocks using the BRANCH_HALT_SKIP flag.
  * Use platform APIs instead of of_address_to_resource.
  * Replace devm_ioremap with devm_ioremap_resource.
  * Use fixed index for 'lpass_cc' & 'lpass_qdsp6ss' in probe.

 [v2]
  * Make gcc_lpass_sway_clk static.
  * Remove using child nodes and use reg-names to differentiate various
domains of LPASS CC.

Add support for the lpass clock controller found on SDM845 based devices.
This would allow lpass peripheral loader drivers to control the clocks to
bring the subsystem out of reset.

Taniya Das (2):
  dt-bindings: clock: Introduce QCOM LPASS clock bindings
  clk: qcom: Add lpass clock controller driver for SDM845

 .../devicetree/bindings/clock/qcom,gcc.txt |  16 ++
 .../devicetree/bindings/clock/qcom,lpasscc.txt |  26 +++
 drivers/clk/qcom/Kconfig   |   9 +
 drivers/clk/qcom/Makefile  |   1 +
 drivers/clk/qcom/gcc-sdm845.c  |  30 
 drivers/clk/qcom/lpasscc-sdm845.c  | 192 +
 include/dt-bindings/clock/qcom,gcc-sdm845.h|   2 +
 include/dt-bindings/clock/qcom,lpass-sdm845.h  |  16 ++
 8 files changed, 292 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
 create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c
 create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h

--
Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member
of the Code Aurora Forum, hosted by the  Linux Foundation.



[PATCH v9 1/2] dt-bindings: clock: Introduce QCOM LPASS clock bindings

2018-11-09 Thread Taniya Das
Add device tree bindings for Low Power Audio subsystem clock controller for
Qualcomm Technology Inc's SDM845 SoCs.

Signed-off-by: Taniya Das 
---
 .../devicetree/bindings/clock/qcom,gcc.txt | 16 +
 .../devicetree/bindings/clock/qcom,lpasscc.txt | 26 ++
 include/dt-bindings/clock/qcom,gcc-sdm845.h|  2 ++
 include/dt-bindings/clock/qcom,lpass-sdm845.h  | 16 +
 4 files changed, 60 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
 create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h

diff --git a/Documentation/devicetree/bindings/clock/qcom,gcc.txt 
b/Documentation/devicetree/bindings/clock/qcom,gcc.txt
index 52d9345..8661c3c 100644
--- a/Documentation/devicetree/bindings/clock/qcom,gcc.txt
+++ b/Documentation/devicetree/bindings/clock/qcom,gcc.txt
@@ -35,6 +35,8 @@ be part of GCC and hence the TSENS properties can also be
 part of the GCC/clock-controller node.
 For more details on the TSENS properties please refer
 Documentation/devicetree/bindings/thermal/qcom-tsens.txt
+- protected-clocks : Protected clock specifier list as per common clock
+ binding.

 Example:
clock-controller@90 {
@@ -55,3 +57,17 @@ Example of GCC with TSENS properties:
#reset-cells = <1>;
#thermal-sensor-cells = <1>;
};
+
+Example of GCC with protected-clocks properties:
+   clock-controller@10 {
+   compatible = "qcom,gcc-sdm845";
+   reg = <0x10 0x1f>;
+   #clock-cells = <1>;
+   #reset-cells = <1>;
+   #power-domain-cells = <1>;
+   protected-clocks = ,
+  ,
+  ,
+  ,
+  ;
+   };
diff --git a/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt 
b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
new file mode 100644
index 000..b9e9787
--- /dev/null
+++ b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt
@@ -0,0 +1,26 @@
+Qualcomm LPASS Clock Controller Binding
+---
+
+Required properties :
+- compatible   : shall contain "qcom,sdm845-lpasscc"
+- #clock-cells : from common clock binding, shall contain 1.
+- reg  : shall contain base register address and size,
+ in the order
+   Index-0 maps to LPASS_CC register region
+   Index-1 maps to LPASS_QDSP6SS register region
+
+Optional properties :
+- reg-names: register names of LPASS domain
+"cc", "qdsp6ss".
+
+Example:
+
+The below node has to be defined in the cases where the LPASS peripheral loader
+would bring the subsystem out of reset.
+
+   lpasscc: clock-controller@17014000 {
+   compatible = "qcom,sdm845-lpasscc";
+   reg = <0x17014000 0x1f004>, <0x1730 0x200>;
+   reg-names = "cc", "qdsp6ss";
+   #clock-cells = <1>;
+   };
diff --git a/include/dt-bindings/clock/qcom,gcc-sdm845.h 
b/include/dt-bindings/clock/qcom,gcc-sdm845.h
index b8eae5a..968fa65 100644
--- a/include/dt-bindings/clock/qcom,gcc-sdm845.h
+++ b/include/dt-bindings/clock/qcom,gcc-sdm845.h
@@ -197,6 +197,8 @@
 #define GCC_QSPI_CORE_CLK_SRC  187
 #define GCC_QSPI_CORE_CLK  188
 #define GCC_QSPI_CNOC_PERIPH_AHB_CLK   189
+#define GCC_LPASS_Q6_AXI_CLK   190
+#define GCC_LPASS_SWAY_CLK 191

 /* GCC Resets */
 #define GCC_MMSS_BCR   0
diff --git a/include/dt-bindings/clock/qcom,lpass-sdm845.h 
b/include/dt-bindings/clock/qcom,lpass-sdm845.h
new file mode 100644
index 000..015968e
--- /dev/null
+++ b/include/dt-bindings/clock/qcom,lpass-sdm845.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018, The Linux Foundation. All rights reserved.
+ */
+
+#ifndef _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H
+#define _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H
+
+#define LPASS_AUDIO_WRAPPER_AON_CLK0
+#define LPASS_Q6SS_AHBM_AON_CLK1
+#define LPASS_Q6SS_AHBS_AON_CLK2
+#define LPASS_QDSP6SS_XO_CLK   3
+#define LPASS_QDSP6SS_SLEEP_CLK4
+#define LPASS_QDSP6SS_CORE_CLK 5
+
+#endif
--
Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member
of the Code Aurora Forum, hosted by the  Linux Foundation.



Re: [PATCH v3 1/4] x86/mm: declare check_la57_support() as inline

2018-11-09 Thread Changbin Du
On Sat, Nov 10, 2018 at 01:27:29AM +0900, Masahiro Yamada wrote:
> On Fri, Nov 2, 2018 at 8:28 PM Borislav Petkov  wrote:
> >
> > On Mon, Oct 29, 2018 at 10:04:46PM +0900, Masahiro Yamada wrote:
> > > On Mon, Oct 29, 2018 at 3:09 AM Steven Rostedt  
> > > wrote:
> > > >
> > > > On Sun, 28 Oct 2018 13:09:42 +
> > > > Changbin Du  wrote:
> > > >
> > > > > The level4_kernel_pgt is only defined when X86_5LEVEL is enabled.
> > > > > So declare check_la57_support() as inline to make sure the code
> > > > > referring to level4_kernel_pgt is optimized out. This is a preparation
> > > > > for CONFIG_CC_OPTIMIZE_FOR_DEBUGGING.
> > > > >
> > > > > Signed-off-by: Changbin Du 
> > > >
> > > > Reviewed-by: Steven Rostedt (VMware) 
> > > >
> > > > -- Steve
> > > >
> > >
> > > Applied to linux-kbuild.
> >
> > Next time, before you route a patch through your tree, please get an x86
> > maintainer's ACK first.
> 
> 
> I changed my mind, and did not include this in my previous pull request
> because this series is causing many warnings.
>
Most of the new warnings are false warnings. Since this series didn't change any
code where the warnings happen. I'll make patches to make gcc happy.

> Now, it is aiming for v5.0-rc1
> 
> So, it is not too late.
> 
> If x86 maintainers issue Acked-by,
> I am happy to add it.
> 
> 
> 
> 
> > Thx.
> >
> > --
> > Regards/Gruss,
> > Boris.
> >
> > Good mailing practices for 400: avoid top-posting and trim the reply.
> 
> 
> 
> -- 
> Best Regards
> Masahiro Yamada

-- 
Thanks,
Changbin Du


Re: [PATCH v3 1/4] x86/mm: declare check_la57_support() as inline

2018-11-09 Thread Changbin Du
On Sat, Nov 10, 2018 at 01:27:29AM +0900, Masahiro Yamada wrote:
> On Fri, Nov 2, 2018 at 8:28 PM Borislav Petkov  wrote:
> >
> > On Mon, Oct 29, 2018 at 10:04:46PM +0900, Masahiro Yamada wrote:
> > > On Mon, Oct 29, 2018 at 3:09 AM Steven Rostedt  
> > > wrote:
> > > >
> > > > On Sun, 28 Oct 2018 13:09:42 +
> > > > Changbin Du  wrote:
> > > >
> > > > > The level4_kernel_pgt is only defined when X86_5LEVEL is enabled.
> > > > > So declare check_la57_support() as inline to make sure the code
> > > > > referring to level4_kernel_pgt is optimized out. This is a preparation
> > > > > for CONFIG_CC_OPTIMIZE_FOR_DEBUGGING.
> > > > >
> > > > > Signed-off-by: Changbin Du 
> > > >
> > > > Reviewed-by: Steven Rostedt (VMware) 
> > > >
> > > > -- Steve
> > > >
> > >
> > > Applied to linux-kbuild.
> >
> > Next time, before you route a patch through your tree, please get an x86
> > maintainer's ACK first.
> 
> 
> I changed my mind, and did not include this in my previous pull request
> because this series is causing many warnings.
>
Most of the new warnings are false warnings. Since this series didn't change any
code where the warnings happen. I'll make patches to make gcc happy.

> Now, it is aiming for v5.0-rc1
> 
> So, it is not too late.
> 
> If x86 maintainers issue Acked-by,
> I am happy to add it.
> 
> 
> 
> 
> > Thx.
> >
> > --
> > Regards/Gruss,
> > Boris.
> >
> > Good mailing practices for 400: avoid top-posting and trim the reply.
> 
> 
> 
> -- 
> Best Regards
> Masahiro Yamada

-- 
Thanks,
Changbin Du


[PATCH v9 0/4] KASLR feature to randomize each loadable module

2018-11-09 Thread Rick Edgecombe
This is V9 of the "KASLR feature to randomize each loadable module" patchset.
The purpose is to increase the randomization for the module space from 10 to 17
bits, and also to make the modules randomized in relation to each other instead
of just the address where the allocations begin, so that if one module leaks the
location of the others can't be inferred.

Why its useful
==
Randomizing the location of executable code is a defense against control flow
attacks, where the kernel is tricked into jumping, or speculatively executing
code other than what is intended. By randomizing the location of the code, the
attacker doesn't know where to redirect the control flow.

Today the RANDOMIZE_BASE feature randomizes the base address where the module
allocations begin with 10 bits of entropy for this purpose. From here, a highly
deterministic algorithm allocates space for the modules as they are loaded and
unloaded. If an attacker can predict the order and identities for modules that
will be loaded (either by the system, or controlled by the user with
request_module or BPF), then a single text address leak can give the attacker
access to the locations of other modules. So in this case this new algorithm can
take the entropy of the other modules from ~0 to 17, making it much more robust.

Another problem today is that the low 10 bits of entropy makes brute force
attacks feasible, especially in the case of speculative execution where a wrong
guess won't necessarily cause a crash. In this case, increasing the
randomization will force attacks to take longer, and so increase the time an
attacker may be detected on a system.

In the past KASLR has been considered mostly a remote defense, due to available
methods of de-randomizing the kernel text locally, but previous easier local
de-randomizing methods have been blocked by KPTI.

There are multiple efforts to apply more randomization to the core kernel text
as well, and so this module space piece can be a first step to increasing
randomization for all kernel space executable code.

Userspace ASLR can get 28 bits of entropy or more, so at least increasing this
to 17 for now improves what is currently a pretty low amount of randomization
for the higher privileged kernel space.

How it works

The algorithm is pretty simple. It just breaks the module space in two, a random
area (2/3 of module space) and a backup area (1/3 of module space). It first
tries to allocate up to 1 randomly located starting pages inside the random
section. If this fails, it will allocate in the backup area. The backup area
base will be offset in the same way as current algorithm does for the base area,
which has 10 bits of entropy.

The vmalloc allocator can be used to try an allocation at a specific address,
however it is usually used to try an allocation over a large address range, and
so some behaviors which are non-issues in normal usage can be be sub-optimal
when trying the an allocation at 1 small ranges. So this patch also includes
a new vmalloc function __vmalloc_node_try_addr and some other vmalloc tweaks
that allow for more efficient trying of addresses.

This algorithm targets maintaining high entropy for many 1000's of module
allocations. This is because there are other users of the module space besides
kernel modules, like eBPF JIT, classic BPF socket filter JIT and kprobes.

Performance
===
Simulations were run using module sizes derived from the x86_64 modules to
measure the allocation performance at various levels of fragmentation and
whether the backup area was used.

Capacity

There is a slight reduction in the capacity of modules as simulated by the
x86_64 module sizes of <1000. Note this is a worst case, since in practice
module allocations in the 1000's will consist of smaller BPF JIT allocations or
kprobes which would fit better in the random area.

Allocation time
---
Below are three sets of measurements in ns of the allocation time as measured by
the included kselftests. The first two columns are this new algorithm with and
with out the vmalloc optimizations for trying random addresses quickly. They are
included for consideration of whether the changes are worth it. The last column
is the performance of the original algorithm.

Modules Vmalloc optimizationNo Vmalloc Optimization Existing Module KASLR
1000143319933821
2000229536817830
30004424745013012
4000774613824   18106
500012721   21852   22572
600019724   33926   26443
700027638   47427   30473
800037745   64443   34200

These allocations are not taking very long, but it may show up on systems with
very high usage of the module space (BPF JITs). 

[PATCH v9 1/4] vmalloc: Add __vmalloc_node_try_addr function

2018-11-09 Thread Rick Edgecombe
Create __vmalloc_node_try_addr function that tries to allocate at a specific
address without triggering any lazy purging and retry. For the randomized
allocator that uses this function, failing to allocate at a specific address is
a lot more common. This function will not try to do any lazy purge and retry,
to try to fail faster when an allocation won't fit at a specific address. This
function is used for a case where lazy free areas are unlikely and so the purge
and retry is just extra work done every time. For the randomized module
loader, the performance for an average allocation in ns for different numbers
of modules was:

Modules Vmalloc optimizationNo Vmalloc Optimization
100014331993
200022953681
300044247450
4000774613824
500012721   21852
600019724   33926
700027638   47427
800037745   64443

In order to support this behavior a try_addr argument was plugged into several
of the static helpers.

This also changes logic in __get_vm_area_node to be faster in cases where
allocations fail due to no space, which is a lot more common when trying
specific addresses.

Signed-off-by: Rick Edgecombe 
---
 include/linux/vmalloc.h |   3 +
 mm/vmalloc.c| 128 +---
 2 files changed, 95 insertions(+), 36 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..6eaa89612372 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, 
unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller);
+extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+   gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
+   int node, const void *caller);
 #ifndef CONFIG_MMU
 extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 97d4b25d0373..b8b34d319c85 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -326,6 +326,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 #define VM_LAZY_FREE   0x02
 #define VM_VM_AREA 0x04
 
+#define VMAP_MAY_PURGE 0x2
+#define VMAP_NO_PURGE  0x1
+
 static DEFINE_SPINLOCK(vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
@@ -402,12 +405,12 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
 static struct vmap_area *alloc_vmap_area(unsigned long size,
unsigned long align,
unsigned long vstart, unsigned long vend,
-   int node, gfp_t gfp_mask)
+   int node, gfp_t gfp_mask, int try_purge)
 {
struct vmap_area *va;
struct rb_node *n;
unsigned long addr;
-   int purged = 0;
+   int purged = try_purge & VMAP_NO_PURGE;
struct vmap_area *first;
 
BUG_ON(!size);
@@ -860,7 +863,7 @@ static void *new_vmap_block(unsigned int order, gfp_t 
gfp_mask)
 
va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
VMALLOC_START, VMALLOC_END,
-   node, gfp_mask);
+   node, gfp_mask, VMAP_MAY_PURGE);
if (IS_ERR(va)) {
kfree(vb);
return ERR_CAST(va);
@@ -1170,8 +1173,9 @@ void *vm_map_ram(struct page **pages, unsigned int count, 
int node, pgprot_t pro
addr = (unsigned long)mem;
} else {
struct vmap_area *va;
-   va = alloc_vmap_area(size, PAGE_SIZE,
-   VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+   va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START,
+   VMALLOC_END, node, GFP_KERNEL,
+   VMAP_MAY_PURGE);
if (IS_ERR(va))
return NULL;
 
@@ -1372,7 +1376,8 @@ static void clear_vm_uninitialized_flag(struct vm_struct 
*vm)
 
 static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long align, unsigned long flags, unsigned long start,
-   unsigned long end, int node, gfp_t gfp_mask, const void *caller)
+   unsigned long end, int node, gfp_t gfp_mask, int try_purge,
+   const void *caller)
 {
struct vmap_area *va;
struct vm_struct *area;
@@ -1386,16 +1391,17 @@ static struct vm_struct *__get_vm_area_node(unsigned 
long size,
align = 1ul << clamp_t(int, get_count_order_long(size),
   

[PATCH v9 0/4] KASLR feature to randomize each loadable module

2018-11-09 Thread Rick Edgecombe
This is V9 of the "KASLR feature to randomize each loadable module" patchset.
The purpose is to increase the randomization for the module space from 10 to 17
bits, and also to make the modules randomized in relation to each other instead
of just the address where the allocations begin, so that if one module leaks the
location of the others can't be inferred.

Why its useful
==
Randomizing the location of executable code is a defense against control flow
attacks, where the kernel is tricked into jumping, or speculatively executing
code other than what is intended. By randomizing the location of the code, the
attacker doesn't know where to redirect the control flow.

Today the RANDOMIZE_BASE feature randomizes the base address where the module
allocations begin with 10 bits of entropy for this purpose. From here, a highly
deterministic algorithm allocates space for the modules as they are loaded and
unloaded. If an attacker can predict the order and identities for modules that
will be loaded (either by the system, or controlled by the user with
request_module or BPF), then a single text address leak can give the attacker
access to the locations of other modules. So in this case this new algorithm can
take the entropy of the other modules from ~0 to 17, making it much more robust.

Another problem today is that the low 10 bits of entropy makes brute force
attacks feasible, especially in the case of speculative execution where a wrong
guess won't necessarily cause a crash. In this case, increasing the
randomization will force attacks to take longer, and so increase the time an
attacker may be detected on a system.

In the past KASLR has been considered mostly a remote defense, due to available
methods of de-randomizing the kernel text locally, but previous easier local
de-randomizing methods have been blocked by KPTI.

There are multiple efforts to apply more randomization to the core kernel text
as well, and so this module space piece can be a first step to increasing
randomization for all kernel space executable code.

Userspace ASLR can get 28 bits of entropy or more, so at least increasing this
to 17 for now improves what is currently a pretty low amount of randomization
for the higher privileged kernel space.

How it works

The algorithm is pretty simple. It just breaks the module space in two, a random
area (2/3 of module space) and a backup area (1/3 of module space). It first
tries to allocate up to 1 randomly located starting pages inside the random
section. If this fails, it will allocate in the backup area. The backup area
base will be offset in the same way as current algorithm does for the base area,
which has 10 bits of entropy.

The vmalloc allocator can be used to try an allocation at a specific address,
however it is usually used to try an allocation over a large address range, and
so some behaviors which are non-issues in normal usage can be be sub-optimal
when trying the an allocation at 1 small ranges. So this patch also includes
a new vmalloc function __vmalloc_node_try_addr and some other vmalloc tweaks
that allow for more efficient trying of addresses.

This algorithm targets maintaining high entropy for many 1000's of module
allocations. This is because there are other users of the module space besides
kernel modules, like eBPF JIT, classic BPF socket filter JIT and kprobes.

Performance
===
Simulations were run using module sizes derived from the x86_64 modules to
measure the allocation performance at various levels of fragmentation and
whether the backup area was used.

Capacity

There is a slight reduction in the capacity of modules as simulated by the
x86_64 module sizes of <1000. Note this is a worst case, since in practice
module allocations in the 1000's will consist of smaller BPF JIT allocations or
kprobes which would fit better in the random area.

Allocation time
---
Below are three sets of measurements in ns of the allocation time as measured by
the included kselftests. The first two columns are this new algorithm with and
with out the vmalloc optimizations for trying random addresses quickly. They are
included for consideration of whether the changes are worth it. The last column
is the performance of the original algorithm.

Modules Vmalloc optimizationNo Vmalloc Optimization Existing Module KASLR
1000143319933821
2000229536817830
30004424745013012
4000774613824   18106
500012721   21852   22572
600019724   33926   26443
700027638   47427   30473
800037745   64443   34200

These allocations are not taking very long, but it may show up on systems with
very high usage of the module space (BPF JITs). 

[PATCH v9 1/4] vmalloc: Add __vmalloc_node_try_addr function

2018-11-09 Thread Rick Edgecombe
Create __vmalloc_node_try_addr function that tries to allocate at a specific
address without triggering any lazy purging and retry. For the randomized
allocator that uses this function, failing to allocate at a specific address is
a lot more common. This function will not try to do any lazy purge and retry,
to try to fail faster when an allocation won't fit at a specific address. This
function is used for a case where lazy free areas are unlikely and so the purge
and retry is just extra work done every time. For the randomized module
loader, the performance for an average allocation in ns for different numbers
of modules was:

Modules Vmalloc optimizationNo Vmalloc Optimization
100014331993
200022953681
300044247450
4000774613824
500012721   21852
600019724   33926
700027638   47427
800037745   64443

In order to support this behavior a try_addr argument was plugged into several
of the static helpers.

This also changes logic in __get_vm_area_node to be faster in cases where
allocations fail due to no space, which is a lot more common when trying
specific addresses.

Signed-off-by: Rick Edgecombe 
---
 include/linux/vmalloc.h |   3 +
 mm/vmalloc.c| 128 +---
 2 files changed, 95 insertions(+), 36 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..6eaa89612372 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, 
unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller);
+extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+   gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
+   int node, const void *caller);
 #ifndef CONFIG_MMU
 extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 97d4b25d0373..b8b34d319c85 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -326,6 +326,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 #define VM_LAZY_FREE   0x02
 #define VM_VM_AREA 0x04
 
+#define VMAP_MAY_PURGE 0x2
+#define VMAP_NO_PURGE  0x1
+
 static DEFINE_SPINLOCK(vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
@@ -402,12 +405,12 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
 static struct vmap_area *alloc_vmap_area(unsigned long size,
unsigned long align,
unsigned long vstart, unsigned long vend,
-   int node, gfp_t gfp_mask)
+   int node, gfp_t gfp_mask, int try_purge)
 {
struct vmap_area *va;
struct rb_node *n;
unsigned long addr;
-   int purged = 0;
+   int purged = try_purge & VMAP_NO_PURGE;
struct vmap_area *first;
 
BUG_ON(!size);
@@ -860,7 +863,7 @@ static void *new_vmap_block(unsigned int order, gfp_t 
gfp_mask)
 
va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
VMALLOC_START, VMALLOC_END,
-   node, gfp_mask);
+   node, gfp_mask, VMAP_MAY_PURGE);
if (IS_ERR(va)) {
kfree(vb);
return ERR_CAST(va);
@@ -1170,8 +1173,9 @@ void *vm_map_ram(struct page **pages, unsigned int count, 
int node, pgprot_t pro
addr = (unsigned long)mem;
} else {
struct vmap_area *va;
-   va = alloc_vmap_area(size, PAGE_SIZE,
-   VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+   va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START,
+   VMALLOC_END, node, GFP_KERNEL,
+   VMAP_MAY_PURGE);
if (IS_ERR(va))
return NULL;
 
@@ -1372,7 +1376,8 @@ static void clear_vm_uninitialized_flag(struct vm_struct 
*vm)
 
 static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long align, unsigned long flags, unsigned long start,
-   unsigned long end, int node, gfp_t gfp_mask, const void *caller)
+   unsigned long end, int node, gfp_t gfp_mask, int try_purge,
+   const void *caller)
 {
struct vmap_area *va;
struct vm_struct *area;
@@ -1386,16 +1391,17 @@ static struct vm_struct *__get_vm_area_node(unsigned 
long size,
align = 1ul << clamp_t(int, get_count_order_long(size),
   

[PATCH v9 3/4] vmalloc: Add debugfs modfraginfo

2018-11-09 Thread Rick Edgecombe
Add debugfs file "modfraginfo" for providing info on module space fragmentation.
This can be used for determining if loadable module randomization is causing any
problems for extreme module loading situations, like huge numbers of modules or
extremely large modules.

Sample output when KASLR is enabled and X86_64 is configured:
Largest free space: 897912 kB
  Total free space: 1025424 kB
Allocations in backup area: 0

Sample output when just X86_64:
Largest free space: 897912 kB
  Total free space: 1025424 kB

Signed-off-by: Rick Edgecombe 
Reviewed-by: Kees Cook 
---
 mm/vmalloc.c | 100 +--
 1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b8b34d319c85..63894cb50873 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -36,6 +37,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_X86
+#include 
+#include 
+#include 
+#endif
+
 #include "internal.h"
 
 struct vfree_deferred {
@@ -2415,7 +2422,6 @@ void free_vm_area(struct vm_struct *area)
 }
 EXPORT_SYMBOL_GPL(free_vm_area);
 
-#ifdef CONFIG_SMP
 static struct vmap_area *node_to_va(struct rb_node *n)
 {
return rb_entry_safe(n, struct vmap_area, rb_node);
@@ -2463,6 +2469,7 @@ static bool pvm_find_next_prev(unsigned long end,
return true;
 }
 
+#ifdef CONFIG_SMP
 /**
  * pvm_determine_end - find the highest aligned address between two vmap_areas
  * @pnext: in/out arg for the next vmap_area
@@ -2804,7 +2811,96 @@ static int __init proc_vmalloc_init(void)
proc_create_seq("vmallocinfo", 0400, NULL, _op);
return 0;
 }
-module_init(proc_vmalloc_init);
+#elif defined(CONFIG_DEBUG_FS)
+static int __init proc_vmalloc_init(void)
+{
+   return 0;
+}
+#endif
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_RANDOMIZE_FINE_MODULE)
+static inline unsigned long is_in_backup(unsigned long addr)
+{
+   return addr >= MODULES_VADDR + get_modules_rand_len();
+}
+
+static int modulefraginfo_debug_show(struct seq_file *m, void *v)
+{
+   unsigned long last_end = MODULES_VADDR;
+   unsigned long total_free = 0;
+   unsigned long largest_free = 0;
+   unsigned long backup_cnt = 0;
+   unsigned long gap;
+   struct vmap_area *prev, *cur = NULL;
+
+   spin_lock(_area_lock);
+
+   if (!pvm_find_next_prev(MODULES_VADDR, , ) || !cur)
+   goto done;
+
+   for (; cur->va_end <= MODULES_END; cur = list_next_entry(cur, list)) {
+   /* Don't count areas that are marked to be lazily freed */
+   if (!(cur->flags & VM_LAZY_FREE)) {
+   if (kaslr_mod_randomize_each_module())
+   backup_cnt += is_in_backup(cur->va_start);
+   gap = cur->va_start - last_end;
+   if (gap > largest_free)
+   largest_free = gap;
+   total_free += gap;
+   last_end = cur->va_end;
+   }
+
+   if (list_is_last(>list, _area_list))
+   break;
+   }
+
+done:
+   gap = (MODULES_END - last_end);
+   if (gap > largest_free)
+   largest_free = gap;
+   total_free += gap;
 
+   spin_unlock(_area_lock);
+
+   seq_printf(m, "\tLargest free space:\t%lu kB\n", largest_free / 1024);
+   seq_printf(m, "\t  Total free space:\t%lu kB\n", total_free / 1024);
+
+   if (kaslr_mod_randomize_each_module())
+   seq_printf(m, "Allocations in backup area:\t%lu\n", backup_cnt);
+
+   return 0;
+}
+
+static int proc_module_frag_debug_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, modulefraginfo_debug_show, NULL);
+}
+
+static const struct file_operations debug_module_frag_operations = {
+   .open   = proc_module_frag_debug_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static void __init debug_modfrag_init(void)
+{
+   debugfs_create_file("modfraginfo", 0400, NULL, NULL,
+   _module_frag_operations);
+}
+#elif defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static void __init debug_modfrag_init(void)
+{
+}
 #endif
 
+#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static int __init info_vmalloc_init(void)
+{
+   proc_vmalloc_init();
+   debug_modfrag_init();
+   return 0;
+}
+
+module_init(info_vmalloc_init);
+#endif
-- 
2.17.1



Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 03:14:02PM -0800, Andy Lutomirski wrote:
>  That aside: I wonder whether a better API would be something that
>  allows you to create a new readonly file descriptor, instead of
>  fiddling with the writability of an existing fd.
> >>> 
> >>> That doesn't work, unfortunately. The ashmem API we're replacing with
> >>> memfd requires file descriptor continuity. I also looked into opening
> >>> a new FD and dup2(2)ing atop the old one, but this approach doesn't
> >>> work in the case that the old FD has already leaked to some other
> >>> context (e.g., another dup, SCM_RIGHTS). See
> >>> https://developer.android.com/ndk/reference/group/memory. We can't
> >>> break ASharedMemory_setProt.
> >> 
> >> 
> >> Hmm.  If we fix the general reopen bug, a way to drop write access from
> >> an existing struct file would do what Android needs, right?  I don’t
> >> know if there are general VFS issues with that.
> > 

I don't think there is a way to fix this in /proc/pid/fd. At the proc
level, the /proc/pid/fd/N files are just soft symlinks that follow through to
the actual file. The open is actually done on that inode/file. I think
changing it the way being discussed here means changing the way symlinks work
in Linux.

I think the right way to fix this is at the memfd inode level. I am working
on a follow up patch on top of this patch, and will send that out in a few
days (along with the man page updates).

thanks!

 - Joel


[PATCH v9 2/4] x86/modules: Increase randomization for modules

2018-11-09 Thread Rick Edgecombe
This changes the behavior of the KASLR logic for allocating memory for the text
sections of loadable modules. It randomizes the location of each module text
section with about 17 bits of entropy in typical use. This is enabled on X86_64
only. For 32 bit, the behavior is unchanged.

It refactors existing code around module randomization somewhat. There are now
three different behaviors for x86 module_alloc depending on config.
RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y
ARCH=i386. The refactor of the existing code is to try to clearly show what
those behaviors are without having three separate versions or threading the
behaviors in a bunch of little spots. The reason it is not enabled on 32 bit
yet is because the module space is much smaller and simulations haven't been
run to see how it performs.

The new algorithm breaks the module space in two, a random area and a backup
area. It first tries to allocate at a number of randomly located starting pages
inside the random section. If this fails, then it will allocate in the backup
area. The backup area base will be offset in the same way as the current
algorithm does for the base area, 1024 possible locations.

Due to boot_params being defined with different types in different places,
placing the config helpers modules.h or kaslr.h caused conflicts elsewhere, and
so they are placed in a new file, kaslr_modules.h, instead.

Signed-off-by: Rick Edgecombe 
---
 arch/x86/Kconfig|   3 +
 arch/x86/include/asm/kaslr_modules.h|  38 
 arch/x86/include/asm/pgtable_64_types.h |   7 ++
 arch/x86/kernel/module.c| 111 +++-
 4 files changed, 136 insertions(+), 23 deletions(-)
 create mode 100644 arch/x86/include/asm/kaslr_modules.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ba7e3464ee92..db93cde0528a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2144,6 +2144,9 @@ config RANDOMIZE_BASE
 
  If unsure, say Y.
 
+config RANDOMIZE_FINE_MODULE
+   def_bool y if RANDOMIZE_BASE && X86_64 && !CONFIG_UML
+
 # Relocation on x86 needs some additional build support
 config X86_NEED_RELOCS
def_bool y
diff --git a/arch/x86/include/asm/kaslr_modules.h 
b/arch/x86/include/asm/kaslr_modules.h
new file mode 100644
index ..1da6eced4b47
--- /dev/null
+++ b/arch/x86/include/asm/kaslr_modules.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_KASLR_MODULES_H_
+#define _ASM_KASLR_MODULES_H_
+
+#ifdef CONFIG_RANDOMIZE_BASE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_base(void)
+{
+   return kaslr_enabled();
+}
+#else
+static inline int kaslr_mod_randomize_base(void)
+{
+   return 0;
+}
+#endif /* CONFIG_RANDOMIZE_BASE */
+
+#ifdef CONFIG_RANDOMIZE_FINE_MODULE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_each_module(void)
+{
+   return kaslr_enabled();
+}
+
+static inline unsigned long get_modules_rand_len(void)
+{
+   return MODULES_RAND_LEN;
+}
+#else
+static inline int kaslr_mod_randomize_each_module(void)
+{
+   return 0;
+}
+
+unsigned long get_modules_rand_len(void);
+#endif /* CONFIG_RANDOMIZE_FINE_MODULE */
+
+#endif
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..5e26369ab86c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -143,6 +143,13 @@ extern unsigned int ptrs_per_p4d;
 #define MODULES_END_AC(0xff00, UL)
 #define MODULES_LEN(MODULES_END - MODULES_VADDR)
 
+/*
+ * Dedicate the first part of the module space to a randomized area when KASLR
+ * is in use.  Leave the remaining part for a fallback if we are unable to
+ * allocate in the random area.
+ */
+#define MODULES_RAND_LEN   PAGE_ALIGN((MODULES_LEN/3)*2)
+
 #define ESPFIX_PGD_ENTRY   _AC(-2, UL)
 #define ESPFIX_BASE_ADDR   (ESPFIX_PGD_ENTRY << P4D_SHIFT)
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b052e883dd8c..35cb912ed1f8 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if 0
 #define DEBUGP(fmt, ...)   \
@@ -48,34 +49,96 @@ do {
\
 } while (0)
 #endif
 
-#ifdef CONFIG_RANDOMIZE_BASE
 static unsigned long module_load_offset;
+static const unsigned long NO_TRY_RAND = 1;
 
 /* Mutex protects the module_load_offset. */
 static DEFINE_MUTEX(module_kaslr_mutex);
 
 static unsigned long int get_module_load_offset(void)
 {
-   if (kaslr_enabled()) {
-   mutex_lock(_kaslr_mutex);
-   /*
-* Calculate the module_load_offset the first time this
-* code is called. Once calculated it stays the same until
-* reboot.
-*/

[PATCH v9 3/4] vmalloc: Add debugfs modfraginfo

2018-11-09 Thread Rick Edgecombe
Add debugfs file "modfraginfo" for providing info on module space fragmentation.
This can be used for determining if loadable module randomization is causing any
problems for extreme module loading situations, like huge numbers of modules or
extremely large modules.

Sample output when KASLR is enabled and X86_64 is configured:
Largest free space: 897912 kB
  Total free space: 1025424 kB
Allocations in backup area: 0

Sample output when just X86_64:
Largest free space: 897912 kB
  Total free space: 1025424 kB

Signed-off-by: Rick Edgecombe 
Reviewed-by: Kees Cook 
---
 mm/vmalloc.c | 100 +--
 1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b8b34d319c85..63894cb50873 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -36,6 +37,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_X86
+#include 
+#include 
+#include 
+#endif
+
 #include "internal.h"
 
 struct vfree_deferred {
@@ -2415,7 +2422,6 @@ void free_vm_area(struct vm_struct *area)
 }
 EXPORT_SYMBOL_GPL(free_vm_area);
 
-#ifdef CONFIG_SMP
 static struct vmap_area *node_to_va(struct rb_node *n)
 {
return rb_entry_safe(n, struct vmap_area, rb_node);
@@ -2463,6 +2469,7 @@ static bool pvm_find_next_prev(unsigned long end,
return true;
 }
 
+#ifdef CONFIG_SMP
 /**
  * pvm_determine_end - find the highest aligned address between two vmap_areas
  * @pnext: in/out arg for the next vmap_area
@@ -2804,7 +2811,96 @@ static int __init proc_vmalloc_init(void)
proc_create_seq("vmallocinfo", 0400, NULL, _op);
return 0;
 }
-module_init(proc_vmalloc_init);
+#elif defined(CONFIG_DEBUG_FS)
+static int __init proc_vmalloc_init(void)
+{
+   return 0;
+}
+#endif
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_RANDOMIZE_FINE_MODULE)
+static inline unsigned long is_in_backup(unsigned long addr)
+{
+   return addr >= MODULES_VADDR + get_modules_rand_len();
+}
+
+static int modulefraginfo_debug_show(struct seq_file *m, void *v)
+{
+   unsigned long last_end = MODULES_VADDR;
+   unsigned long total_free = 0;
+   unsigned long largest_free = 0;
+   unsigned long backup_cnt = 0;
+   unsigned long gap;
+   struct vmap_area *prev, *cur = NULL;
+
+   spin_lock(_area_lock);
+
+   if (!pvm_find_next_prev(MODULES_VADDR, , ) || !cur)
+   goto done;
+
+   for (; cur->va_end <= MODULES_END; cur = list_next_entry(cur, list)) {
+   /* Don't count areas that are marked to be lazily freed */
+   if (!(cur->flags & VM_LAZY_FREE)) {
+   if (kaslr_mod_randomize_each_module())
+   backup_cnt += is_in_backup(cur->va_start);
+   gap = cur->va_start - last_end;
+   if (gap > largest_free)
+   largest_free = gap;
+   total_free += gap;
+   last_end = cur->va_end;
+   }
+
+   if (list_is_last(>list, _area_list))
+   break;
+   }
+
+done:
+   gap = (MODULES_END - last_end);
+   if (gap > largest_free)
+   largest_free = gap;
+   total_free += gap;
 
+   spin_unlock(_area_lock);
+
+   seq_printf(m, "\tLargest free space:\t%lu kB\n", largest_free / 1024);
+   seq_printf(m, "\t  Total free space:\t%lu kB\n", total_free / 1024);
+
+   if (kaslr_mod_randomize_each_module())
+   seq_printf(m, "Allocations in backup area:\t%lu\n", backup_cnt);
+
+   return 0;
+}
+
+static int proc_module_frag_debug_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, modulefraginfo_debug_show, NULL);
+}
+
+static const struct file_operations debug_module_frag_operations = {
+   .open   = proc_module_frag_debug_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static void __init debug_modfrag_init(void)
+{
+   debugfs_create_file("modfraginfo", 0400, NULL, NULL,
+   _module_frag_operations);
+}
+#elif defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static void __init debug_modfrag_init(void)
+{
+}
 #endif
 
+#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static int __init info_vmalloc_init(void)
+{
+   proc_vmalloc_init();
+   debug_modfrag_init();
+   return 0;
+}
+
+module_init(info_vmalloc_init);
+#endif
-- 
2.17.1



Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Joel Fernandes
On Fri, Nov 09, 2018 at 03:14:02PM -0800, Andy Lutomirski wrote:
>  That aside: I wonder whether a better API would be something that
>  allows you to create a new readonly file descriptor, instead of
>  fiddling with the writability of an existing fd.
> >>> 
> >>> That doesn't work, unfortunately. The ashmem API we're replacing with
> >>> memfd requires file descriptor continuity. I also looked into opening
> >>> a new FD and dup2(2)ing atop the old one, but this approach doesn't
> >>> work in the case that the old FD has already leaked to some other
> >>> context (e.g., another dup, SCM_RIGHTS). See
> >>> https://developer.android.com/ndk/reference/group/memory. We can't
> >>> break ASharedMemory_setProt.
> >> 
> >> 
> >> Hmm.  If we fix the general reopen bug, a way to drop write access from
> >> an existing struct file would do what Android needs, right?  I don’t
> >> know if there are general VFS issues with that.
> > 

I don't think there is a way to fix this in /proc/pid/fd. At the proc
level, the /proc/pid/fd/N files are just soft symlinks that follow through to
the actual file. The open is actually done on that inode/file. I think
changing it the way being discussed here means changing the way symlinks work
in Linux.

I think the right way to fix this is at the memfd inode level. I am working
on a follow up patch on top of this patch, and will send that out in a few
days (along with the man page updates).

thanks!

 - Joel


[PATCH v9 2/4] x86/modules: Increase randomization for modules

2018-11-09 Thread Rick Edgecombe
This changes the behavior of the KASLR logic for allocating memory for the text
sections of loadable modules. It randomizes the location of each module text
section with about 17 bits of entropy in typical use. This is enabled on X86_64
only. For 32 bit, the behavior is unchanged.

It refactors existing code around module randomization somewhat. There are now
three different behaviors for x86 module_alloc depending on config.
RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y
ARCH=i386. The refactor of the existing code is to try to clearly show what
those behaviors are without having three separate versions or threading the
behaviors in a bunch of little spots. The reason it is not enabled on 32 bit
yet is because the module space is much smaller and simulations haven't been
run to see how it performs.

The new algorithm breaks the module space in two, a random area and a backup
area. It first tries to allocate at a number of randomly located starting pages
inside the random section. If this fails, then it will allocate in the backup
area. The backup area base will be offset in the same way as the current
algorithm does for the base area, 1024 possible locations.

Due to boot_params being defined with different types in different places,
placing the config helpers modules.h or kaslr.h caused conflicts elsewhere, and
so they are placed in a new file, kaslr_modules.h, instead.

Signed-off-by: Rick Edgecombe 
---
 arch/x86/Kconfig|   3 +
 arch/x86/include/asm/kaslr_modules.h|  38 
 arch/x86/include/asm/pgtable_64_types.h |   7 ++
 arch/x86/kernel/module.c| 111 +++-
 4 files changed, 136 insertions(+), 23 deletions(-)
 create mode 100644 arch/x86/include/asm/kaslr_modules.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ba7e3464ee92..db93cde0528a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2144,6 +2144,9 @@ config RANDOMIZE_BASE
 
  If unsure, say Y.
 
+config RANDOMIZE_FINE_MODULE
+   def_bool y if RANDOMIZE_BASE && X86_64 && !CONFIG_UML
+
 # Relocation on x86 needs some additional build support
 config X86_NEED_RELOCS
def_bool y
diff --git a/arch/x86/include/asm/kaslr_modules.h 
b/arch/x86/include/asm/kaslr_modules.h
new file mode 100644
index ..1da6eced4b47
--- /dev/null
+++ b/arch/x86/include/asm/kaslr_modules.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_KASLR_MODULES_H_
+#define _ASM_KASLR_MODULES_H_
+
+#ifdef CONFIG_RANDOMIZE_BASE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_base(void)
+{
+   return kaslr_enabled();
+}
+#else
+static inline int kaslr_mod_randomize_base(void)
+{
+   return 0;
+}
+#endif /* CONFIG_RANDOMIZE_BASE */
+
+#ifdef CONFIG_RANDOMIZE_FINE_MODULE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_each_module(void)
+{
+   return kaslr_enabled();
+}
+
+static inline unsigned long get_modules_rand_len(void)
+{
+   return MODULES_RAND_LEN;
+}
+#else
+static inline int kaslr_mod_randomize_each_module(void)
+{
+   return 0;
+}
+
+unsigned long get_modules_rand_len(void);
+#endif /* CONFIG_RANDOMIZE_FINE_MODULE */
+
+#endif
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..5e26369ab86c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -143,6 +143,13 @@ extern unsigned int ptrs_per_p4d;
 #define MODULES_END_AC(0xff00, UL)
 #define MODULES_LEN(MODULES_END - MODULES_VADDR)
 
+/*
+ * Dedicate the first part of the module space to a randomized area when KASLR
+ * is in use.  Leave the remaining part for a fallback if we are unable to
+ * allocate in the random area.
+ */
+#define MODULES_RAND_LEN   PAGE_ALIGN((MODULES_LEN/3)*2)
+
 #define ESPFIX_PGD_ENTRY   _AC(-2, UL)
 #define ESPFIX_BASE_ADDR   (ESPFIX_PGD_ENTRY << P4D_SHIFT)
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b052e883dd8c..35cb912ed1f8 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if 0
 #define DEBUGP(fmt, ...)   \
@@ -48,34 +49,96 @@ do {
\
 } while (0)
 #endif
 
-#ifdef CONFIG_RANDOMIZE_BASE
 static unsigned long module_load_offset;
+static const unsigned long NO_TRY_RAND = 1;
 
 /* Mutex protects the module_load_offset. */
 static DEFINE_MUTEX(module_kaslr_mutex);
 
 static unsigned long int get_module_load_offset(void)
 {
-   if (kaslr_enabled()) {
-   mutex_lock(_kaslr_mutex);
-   /*
-* Calculate the module_load_offset the first time this
-* code is called. Once calculated it stays the same until
-* reboot.
-*/

[PATCH v9 4/4] Kselftest for module text allocation benchmarking

2018-11-09 Thread Rick Edgecombe
This adds a test module in lib/, and a script in kselftest that does
benchmarking on the allocation of memory in the module space. Performance here
would have some small impact on kernel module insertions, BPF JIT insertions
and kprobes. In the case of KASLR features for the module space, this module
can be used to measure the allocation performance of different configurations.
This module needs to be compiled into the kernel because module_alloc is not
exported.

With some modification to the code, as explained in the comments, it can be
enabled to measure TLB flushes as well.

There are two tests in the module. One allocates until failure in order to
test module capacity and the other times allocating space in the module area.
They both use module sizes that roughly approximate the distribution of in-tree
X86_64 modules.

You can control the number of modules used in the tests like this:
echo m1000>/dev/mod_alloc_test

Run the test for module capacity like:
echo t1>/dev/mod_alloc_test

The other test will measure the allocation time, and for CONFG_X86_64 and
CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is used.

Run the test for allocation time and backup area usage like:
echo t2>/dev/mod_alloc_test
The output will be something like this:
num all(ns) last(ns)
100010831099
Last module in backup count = 0
Total modules in backup = 0
>1 module in backup count   = 0

To run a suite of allocation time tests for a collection of module numbers you 
can run:
tools/testing/selftests/bpf/test_mod_alloc.sh

Signed-off-by: Rick Edgecombe 
---
 lib/Kconfig.debug |   9 +
 lib/Makefile  |   1 +
 lib/test_mod_alloc.c  | 375 ++
 tools/testing/selftests/bpf/test_mod_alloc.sh |  29 ++
 4 files changed, 414 insertions(+)
 create mode 100644 lib/test_mod_alloc.c
 create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1af29b8224fd..b590b2bb312f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1886,6 +1886,15 @@ config TEST_BPF
 
  If unsure, say N.
 
+config TEST_MOD_ALLOC
+   bool "Tests for module allocator/vmalloc"
+   help
+ This builds the "test_mod_alloc" module that performs performance
+ tests on the module text section allocator. The module uses X86_64
+ module text sizes for simulations.
+
+ If unsure, say N.
+
 config FIND_BIT_BENCHMARK
tristate "Test find_bit functions"
help
diff --git a/lib/Makefile b/lib/Makefile
index db06d1237898..c447e07931b0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -60,6 +60,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
+obj-$(CONFIG_TEST_MOD_ALLOC) += test_mod_alloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
 obj-$(CONFIG_TEST_SORT) += test_sort.o
diff --git a/lib/test_mod_alloc.c b/lib/test_mod_alloc.c
new file mode 100644
index ..3a6fb7999df4
--- /dev/null
+++ b/lib/test_mod_alloc.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * This module can be used to test allocation allocation time and randomness.
+ *
+ * To interact with this module, mount debugfs, for example:
+ *  mount -t debugfs none /sys/kernel/debug/
+ *
+ * Then write to the file:
+ *  /sys/kernel/debug/mod_alloc_test
+ *
+ * There are two tests:
+ *  Test 1: Allocate until failure
+ *  Test 2: Run 1000 iterations of a test the simulates loading modules with
+ *  x86_64 module sizes.
+ *
+ * Configure the number (ex:1000) of modules to use per test in the tests:
+ *  echo m1000 > /sys/kernel/debug/mod_alloc_test
+ *
+ * To run test (ex: Test 2):
+ *  echo t2 > /sys/kernel/debug/mod_alloc_test
+ *
+ * For test 1 it will print the results of each test. For test 2 it will print
+ * out statistics for example:
+ * New module count: 1000
+ * Starting 1 iterations of 1000 modules
+ * num all(ns) last(ns)
+ * 100019842112
+ * Last module in backup count = 0
+ * Total modules in backup = 188
+ * >1 module in backup count   = 7
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct mod { int filesize; int coresize; int initsize; };
+
+/*  Begin optional logging  */
+/*
+ * Note: In order to get an accurate count for the tlb flushes triggered in
+ * vmalloc, create a counter in vmalloc.c with this method signature and export
+ * it. Then replace the below with:
+ *
+ * extern unsigned long get_tlb_flushes_vmalloc(void);
+ */
+static unsigned long get_tlb_flushes_vmalloc(void)
+{
+   return 0;
+}
+
+/*  End optional logging  */
+
+
+#define 

[PATCH v9 4/4] Kselftest for module text allocation benchmarking

2018-11-09 Thread Rick Edgecombe
This adds a test module in lib/, and a script in kselftest that does
benchmarking on the allocation of memory in the module space. Performance here
would have some small impact on kernel module insertions, BPF JIT insertions
and kprobes. In the case of KASLR features for the module space, this module
can be used to measure the allocation performance of different configurations.
This module needs to be compiled into the kernel because module_alloc is not
exported.

With some modification to the code, as explained in the comments, it can be
enabled to measure TLB flushes as well.

There are two tests in the module. One allocates until failure in order to
test module capacity and the other times allocating space in the module area.
They both use module sizes that roughly approximate the distribution of in-tree
X86_64 modules.

You can control the number of modules used in the tests like this:
echo m1000>/dev/mod_alloc_test

Run the test for module capacity like:
echo t1>/dev/mod_alloc_test

The other test will measure the allocation time, and for CONFG_X86_64 and
CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is used.

Run the test for allocation time and backup area usage like:
echo t2>/dev/mod_alloc_test
The output will be something like this:
num all(ns) last(ns)
100010831099
Last module in backup count = 0
Total modules in backup = 0
>1 module in backup count   = 0

To run a suite of allocation time tests for a collection of module numbers you 
can run:
tools/testing/selftests/bpf/test_mod_alloc.sh

Signed-off-by: Rick Edgecombe 
---
 lib/Kconfig.debug |   9 +
 lib/Makefile  |   1 +
 lib/test_mod_alloc.c  | 375 ++
 tools/testing/selftests/bpf/test_mod_alloc.sh |  29 ++
 4 files changed, 414 insertions(+)
 create mode 100644 lib/test_mod_alloc.c
 create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1af29b8224fd..b590b2bb312f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1886,6 +1886,15 @@ config TEST_BPF
 
  If unsure, say N.
 
+config TEST_MOD_ALLOC
+   bool "Tests for module allocator/vmalloc"
+   help
+ This builds the "test_mod_alloc" module that performs performance
+ tests on the module text section allocator. The module uses X86_64
+ module text sizes for simulations.
+
+ If unsure, say N.
+
 config FIND_BIT_BENCHMARK
tristate "Test find_bit functions"
help
diff --git a/lib/Makefile b/lib/Makefile
index db06d1237898..c447e07931b0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -60,6 +60,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
+obj-$(CONFIG_TEST_MOD_ALLOC) += test_mod_alloc.o
 obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
 obj-$(CONFIG_TEST_SORT) += test_sort.o
diff --git a/lib/test_mod_alloc.c b/lib/test_mod_alloc.c
new file mode 100644
index ..3a6fb7999df4
--- /dev/null
+++ b/lib/test_mod_alloc.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * This module can be used to test allocation allocation time and randomness.
+ *
+ * To interact with this module, mount debugfs, for example:
+ *  mount -t debugfs none /sys/kernel/debug/
+ *
+ * Then write to the file:
+ *  /sys/kernel/debug/mod_alloc_test
+ *
+ * There are two tests:
+ *  Test 1: Allocate until failure
+ *  Test 2: Run 1000 iterations of a test the simulates loading modules with
+ *  x86_64 module sizes.
+ *
+ * Configure the number (ex:1000) of modules to use per test in the tests:
+ *  echo m1000 > /sys/kernel/debug/mod_alloc_test
+ *
+ * To run test (ex: Test 2):
+ *  echo t2 > /sys/kernel/debug/mod_alloc_test
+ *
+ * For test 1 it will print the results of each test. For test 2 it will print
+ * out statistics for example:
+ * New module count: 1000
+ * Starting 1 iterations of 1000 modules
+ * num all(ns) last(ns)
+ * 100019842112
+ * Last module in backup count = 0
+ * Total modules in backup = 188
+ * >1 module in backup count   = 7
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct mod { int filesize; int coresize; int initsize; };
+
+/*  Begin optional logging  */
+/*
+ * Note: In order to get an accurate count for the tlb flushes triggered in
+ * vmalloc, create a counter in vmalloc.c with this method signature and export
+ * it. Then replace the below with:
+ *
+ * extern unsigned long get_tlb_flushes_vmalloc(void);
+ */
+static unsigned long get_tlb_flushes_vmalloc(void)
+{
+   return 0;
+}
+
+/*  End optional logging  */
+
+
+#define 

Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory

2018-11-09 Thread anthony . yznaga



On 11/09/2018 11:51 AM, Andrea Arcangeli wrote:
> Hello,
>
> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
>>> The basic idea as outlined by Mel Gorman in [2] is:
>>>
>>> 1) On first fault in a sufficiently sized range, allocate a huge page
>>>sized and aligned block of base pages.  Map the base page
>>>corresponding to the fault address and hold the rest of the pages in
>>>reserve.
>>> 2) On subsequent faults in the range, map the pages from the reservation.
>>> 3) When enough pages have been mapped, promote the mapped pages and
>>>remaining pages in the reservation to a huge page.
>>> 4) When there is memory pressure, release the unused pages from their
>>>reservations.
>> I haven't yet read the patch in details, but I'm skeptical about the
>> approach in general for few reasons:
>>
>> - PTE page table retracting to replace it with huge PMD entry requires
>>   down_write(mmap_sem). It makes the approach not practical for many
>>   multi-threaded workloads.
>>
>>   I don't see a way to avoid exclusive lock here. I will be glad to
>>   be proved otherwise.
>>
>> - The promotion will also require TLB flush which might be prohibitively
>>   slow on big machines.
>>
>> - Short living processes will fail to benefit from THP with the policy,
>>   even with plenty of free memory in the system: no time to promote to THP
>>   or, with synchronous promotion, cost will overweight the benefit.
>>
>> The goal to reduce memory overhead of THP is admirable, but we need to be
>> careful not to kill THP benefit itself. The approach will reduce number of
>> THP mapped in the system and/or shift their allocation to later stage of
>> process lifetime.
>>
>> The only way I see it can be useful is if it will be possible to apply the
>> policy on per-VMA basis. It will be very useful for malloc()
>> implementations, for instance. But as a global policy it's no-go to me.
> I'm also skeptical about this: the current design is quite
> intentional. It's not a bug but a feature that we're not doing the
> promotion.
>
> Part of the tradeoff with THP is to use more RAM to save CPU, when you
> use less RAM you're inherently already wasting some CPU just for the
> reservation management and you don't get the immediate TLB benefit
> anymore either.
>
> And if you're in the camp that is concerned about the use of more RAM
> or/and about the higher latency of COW faults, I'm afraid the
> intermediate solution will be still slower than the already available
> MADV_NOHUGEPAGE or enabled=madvise.
>
> Apps like redis that will use more RAM during snapshot and that are
> slowed down with THP needs to simply use MADV_NOHUGEPAGE which already
> exists as an madvise from the very first kernel that supported
> THP-anon. Same thing for other apps that use more RAM with THP and
> that are on the losing end of the tradeoff.
>
> Now about the implementation: the whole point of the reservation
> complexity is to skip the khugepaged copy, so it can collapse in
> place. Is skipping the copy worth it? Isn't the big cost the IPI
> anyway to avoid leaving two simultaneous TLB mappings of different
> granularity?
Good questions.  I'll take them into account when measuring performance.
I do wonder about other architectures (e.g. ARM) where the PMD
size may be significantly larger than 2MB.

>
> khugepaged is already tunable to specify a ratio of memory in use to
> avoid wasting memory
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none.
>
> If you set max_ptes_none to half the default value, it'll only promote
> pages that are half mapped, reducing the memory waste to 50% of what
> it is by default.
>
> So if you are ok to copy the memory that you promote to THP, you'd
> just need a global THP mode to avoid allocating THP even when they're
> available during the page fault (while still allowing khugepaged to
> collapse hugepages in the background), and then reduce max_ptes_none
> to get the desired promotion ratio.
>
> Doing the copy will avoid the reservation there will be also more THP
> available to use for those khugepaged users without losing them in
> reservations. You won't have to worry about what to do when there's
> memory pressure because you won't have to undo the reservation because
> there was no reservation in the first place. That problem also goes
> away with the copy.
>
> So it sounds like you could achieve a similar runtime behavior with
> much less complexity by reducing max_ptes_none and by doing the copy
> and dropping all reservation code.

These are compelling arguments.  I will be sure to evaluate any
performance data against this alternate implementation/tuning.

Thank you for the comments.

Anthony

>
>> Prove me wrong with performance data. :)
> Same here.
>
> Thanks,
> Andrea



Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory

2018-11-09 Thread anthony . yznaga



On 11/09/2018 11:51 AM, Andrea Arcangeli wrote:
> Hello,
>
> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
>>> The basic idea as outlined by Mel Gorman in [2] is:
>>>
>>> 1) On first fault in a sufficiently sized range, allocate a huge page
>>>sized and aligned block of base pages.  Map the base page
>>>corresponding to the fault address and hold the rest of the pages in
>>>reserve.
>>> 2) On subsequent faults in the range, map the pages from the reservation.
>>> 3) When enough pages have been mapped, promote the mapped pages and
>>>remaining pages in the reservation to a huge page.
>>> 4) When there is memory pressure, release the unused pages from their
>>>reservations.
>> I haven't yet read the patch in details, but I'm skeptical about the
>> approach in general for few reasons:
>>
>> - PTE page table retracting to replace it with huge PMD entry requires
>>   down_write(mmap_sem). It makes the approach not practical for many
>>   multi-threaded workloads.
>>
>>   I don't see a way to avoid exclusive lock here. I will be glad to
>>   be proved otherwise.
>>
>> - The promotion will also require TLB flush which might be prohibitively
>>   slow on big machines.
>>
>> - Short living processes will fail to benefit from THP with the policy,
>>   even with plenty of free memory in the system: no time to promote to THP
>>   or, with synchronous promotion, cost will overweight the benefit.
>>
>> The goal to reduce memory overhead of THP is admirable, but we need to be
>> careful not to kill THP benefit itself. The approach will reduce number of
>> THP mapped in the system and/or shift their allocation to later stage of
>> process lifetime.
>>
>> The only way I see it can be useful is if it will be possible to apply the
>> policy on per-VMA basis. It will be very useful for malloc()
>> implementations, for instance. But as a global policy it's no-go to me.
> I'm also skeptical about this: the current design is quite
> intentional. It's not a bug but a feature that we're not doing the
> promotion.
>
> Part of the tradeoff with THP is to use more RAM to save CPU, when you
> use less RAM you're inherently already wasting some CPU just for the
> reservation management and you don't get the immediate TLB benefit
> anymore either.
>
> And if you're in the camp that is concerned about the use of more RAM
> or/and about the higher latency of COW faults, I'm afraid the
> intermediate solution will be still slower than the already available
> MADV_NOHUGEPAGE or enabled=madvise.
>
> Apps like redis that will use more RAM during snapshot and that are
> slowed down with THP needs to simply use MADV_NOHUGEPAGE which already
> exists as an madvise from the very first kernel that supported
> THP-anon. Same thing for other apps that use more RAM with THP and
> that are on the losing end of the tradeoff.
>
> Now about the implementation: the whole point of the reservation
> complexity is to skip the khugepaged copy, so it can collapse in
> place. Is skipping the copy worth it? Isn't the big cost the IPI
> anyway to avoid leaving two simultaneous TLB mappings of different
> granularity?
Good questions.  I'll take them into account when measuring performance.
I do wonder about other architectures (e.g. ARM) where the PMD
size may be significantly larger than 2MB.

>
> khugepaged is already tunable to specify a ratio of memory in use to
> avoid wasting memory
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none.
>
> If you set max_ptes_none to half the default value, it'll only promote
> pages that are half mapped, reducing the memory waste to 50% of what
> it is by default.
>
> So if you are ok to copy the memory that you promote to THP, you'd
> just need a global THP mode to avoid allocating THP even when they're
> available during the page fault (while still allowing khugepaged to
> collapse hugepages in the background), and then reduce max_ptes_none
> to get the desired promotion ratio.
>
> Doing the copy will avoid the reservation there will be also more THP
> available to use for those khugepaged users without losing them in
> reservations. You won't have to worry about what to do when there's
> memory pressure because you won't have to undo the reservation because
> there was no reservation in the first place. That problem also goes
> away with the copy.
>
> So it sounds like you could achieve a similar runtime behavior with
> much less complexity by reducing max_ptes_none and by doing the copy
> and dropping all reservation code.

These are compelling arguments.  I will be sure to evaluate any
performance data against this alternate implementation/tuning.

Thank you for the comments.

Anthony

>
>> Prove me wrong with performance data. :)
> Same here.
>
> Thanks,
> Andrea



Re: [PATCH v5 1/5] dt-bindings: phy-qcom-qmp: Fix register underspecification

2018-11-09 Thread Stephen Boyd
Quoting Doug Anderson (2018-11-05 08:52:39)
> On Sat, Nov 3, 2018 at 7:40 PM Stephen Boyd  wrote:
> > > +   clocks = < GCC_USB3_SEC_PHY_PIPE_CLK>;
> > > +   clock-names = "pipe0";
> > > +   clock-output-names = "usb3_uni_phy_pipe_clk_src";
> >
> > If this has clock-output-names then I would expect to see a #clock-cells
> > property, but that isn't here in this node. Are we relying on the same
> > property in the parent node?
> 
> If I had to guess, I believe it's yet more confusing than that.  I
> believe you actually point to the parent phandle if you want to use
> the clock.  I notice that the parent has #clock-cells as 1 so
> presumably this is how you point to one child or the other?  ...but I
> don't think it's documented how this works.

There are 'clock-ranges', that almost nobody uses. It might be usable
for this purpose.

> The lane nodes don't have
> any sort of ID as far as I can tell.  ...and in any case having
> #clock-cells of 1 makes no sense for USB 3 PHYs which are supposed to
> only have one child?
> 
> Let's look at the code, maybe?  Hrm, phy_pipe_clk_register() takes no
> ID or anything.  Huh?  OK, so as far as I can tell
> of_clk_add_provider() is never called on this clock...
> 
> So I think the answer is that #clock-cells should be <0> and should
> move to the child node to match with clock-output-names.  Then I guess
> (if anyone references this clock from the device tree rather than
> relying on the global clock-output-names) we should add the
> of_clk_add_provider() into the code?
> 
> Maybe we can add that as a patch to the end of this series?  There are
> so many crazy / random things wrong with these bindings that it makes
> sense to make smaller / incremental changes?
> 

Sure that sounds fine. It would be another case where a driver would
want to call the proposed devm_of_clk_add_parent_provider() API.



Re: [PATCH v5 1/5] dt-bindings: phy-qcom-qmp: Fix register underspecification

2018-11-09 Thread Stephen Boyd
Quoting Doug Anderson (2018-11-05 08:52:39)
> On Sat, Nov 3, 2018 at 7:40 PM Stephen Boyd  wrote:
> > > +   clocks = < GCC_USB3_SEC_PHY_PIPE_CLK>;
> > > +   clock-names = "pipe0";
> > > +   clock-output-names = "usb3_uni_phy_pipe_clk_src";
> >
> > If this has clock-output-names then I would expect to see a #clock-cells
> > property, but that isn't here in this node. Are we relying on the same
> > property in the parent node?
> 
> If I had to guess, I believe it's yet more confusing than that.  I
> believe you actually point to the parent phandle if you want to use
> the clock.  I notice that the parent has #clock-cells as 1 so
> presumably this is how you point to one child or the other?  ...but I
> don't think it's documented how this works.

There are 'clock-ranges', that almost nobody uses. It might be usable
for this purpose.

> The lane nodes don't have
> any sort of ID as far as I can tell.  ...and in any case having
> #clock-cells of 1 makes no sense for USB 3 PHYs which are supposed to
> only have one child?
> 
> Let's look at the code, maybe?  Hrm, phy_pipe_clk_register() takes no
> ID or anything.  Huh?  OK, so as far as I can tell
> of_clk_add_provider() is never called on this clock...
> 
> So I think the answer is that #clock-cells should be <0> and should
> move to the child node to match with clock-output-names.  Then I guess
> (if anyone references this clock from the device tree rather than
> relying on the global clock-output-names) we should add the
> of_clk_add_provider() into the code?
> 
> Maybe we can add that as a patch to the end of this series?  There are
> so many crazy / random things wrong with these bindings that it makes
> sense to make smaller / incremental changes?
> 

Sure that sounds fine. It would be another case where a driver would
want to call the proposed devm_of_clk_add_parent_provider() API.



Re: [PATCH 3.18 000/144] 3.18.125-stable review

2018-11-09 Thread Greg Kroah-Hartman
On Fri, Nov 09, 2018 at 12:39:56PM -0700, Shuah Khan wrote:
> On 11/08/2018 02:49 PM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 3.18.125 release.
> > There are 144 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sat Nov 10 21:50:17 UTC 2018.
> > Anything received after that time might be too late.
> > 
> > The whole patch series can be found in one patch at:
> > 
> > https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.125-rc1.gz
> > or in the git tree and branch at:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-3.18.y
> > and the diffstat can be found below.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
> Compiled and booted on my new test system. I don't have a reference for
> dmesg regressions and I haven't noticed any problems.

Thanks for testing all of these and letting me know.

greg k-h


Re: [PATCH 3.18 000/144] 3.18.125-stable review

2018-11-09 Thread Greg Kroah-Hartman
On Fri, Nov 09, 2018 at 12:39:56PM -0700, Shuah Khan wrote:
> On 11/08/2018 02:49 PM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 3.18.125 release.
> > There are 144 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sat Nov 10 21:50:17 UTC 2018.
> > Anything received after that time might be too late.
> > 
> > The whole patch series can be found in one patch at:
> > 
> > https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.125-rc1.gz
> > or in the git tree and branch at:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-3.18.y
> > and the diffstat can be found below.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
> Compiled and booted on my new test system. I don't have a reference for
> dmesg regressions and I haven't noticed any problems.

Thanks for testing all of these and letting me know.

greg k-h


Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory

2018-11-09 Thread anthony . yznaga



On 11/09/2018 07:34 AM, Zi Yan wrote:
> On 9 Nov 2018, at 8:11, Mel Gorman wrote:
>
>> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
 The basic idea as outlined by Mel Gorman in [2] is:

 1) On first fault in a sufficiently sized range, allocate a huge page
sized and aligned block of base pages.  Map the base page
corresponding to the fault address and hold the rest of the pages in
reserve.
 2) On subsequent faults in the range, map the pages from the reservation.
 3) When enough pages have been mapped, promote the mapped pages and
remaining pages in the reservation to a huge page.
 4) When there is memory pressure, release the unused pages from their
reservations.
>>> I haven't yet read the patch in details, but I'm skeptical about the
>>> approach in general for few reasons:
>>>
>>> - PTE page table retracting to replace it with huge PMD entry requires
>>>   down_write(mmap_sem). It makes the approach not practical for many
>>>   multi-threaded workloads.
>>>
>>>   I don't see a way to avoid exclusive lock here. I will be glad to
>>>   be proved otherwise.
>>>
>> That problem is somewhat fundamental to the mmap_sem itself and
>> conceivably it could be alleviated by range-locking (if that gets
>> completed). The other thing to bear in mind is the timing. If the
>> promotion is in-place due to reservations, there isn't the allocation
>> overhead and the hold times *should* be short.
>>
> Is it possible to convert all these PTEs to migration entries during
> the promotion and replace them with a huge PMD entry afterwards?
> AFAIK, migrating pages does not require holding a mmap_sem.
> Basically, it will act like migrating 512 base pages to a THP without
> actually doing the page copy.
That's an interesting idea.  I'll look into it.

Thanks,
Anthony

>
> --
> Best Regards
> Yan Zi



Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory

2018-11-09 Thread anthony . yznaga



On 11/09/2018 07:34 AM, Zi Yan wrote:
> On 9 Nov 2018, at 8:11, Mel Gorman wrote:
>
>> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote:
 The basic idea as outlined by Mel Gorman in [2] is:

 1) On first fault in a sufficiently sized range, allocate a huge page
sized and aligned block of base pages.  Map the base page
corresponding to the fault address and hold the rest of the pages in
reserve.
 2) On subsequent faults in the range, map the pages from the reservation.
 3) When enough pages have been mapped, promote the mapped pages and
remaining pages in the reservation to a huge page.
 4) When there is memory pressure, release the unused pages from their
reservations.
>>> I haven't yet read the patch in details, but I'm skeptical about the
>>> approach in general for few reasons:
>>>
>>> - PTE page table retracting to replace it with huge PMD entry requires
>>>   down_write(mmap_sem). It makes the approach not practical for many
>>>   multi-threaded workloads.
>>>
>>>   I don't see a way to avoid exclusive lock here. I will be glad to
>>>   be proved otherwise.
>>>
>> That problem is somewhat fundamental to the mmap_sem itself and
>> conceivably it could be alleviated by range-locking (if that gets
>> completed). The other thing to bear in mind is the timing. If the
>> promotion is in-place due to reservations, there isn't the allocation
>> overhead and the hold times *should* be short.
>>
> Is it possible to convert all these PTEs to migration entries during
> the promotion and replace them with a huge PMD entry afterwards?
> AFAIK, migrating pages does not require holding a mmap_sem.
> Basically, it will act like migrating 512 base pages to a THP without
> actually doing the page copy.
That's an interesting idea.  I'll look into it.

Thanks,
Anthony

>
> --
> Best Regards
> Yan Zi



Re: [PATCH 3/8] sparc: prom: use property "name" directly to construct node names

2018-11-09 Thread David Miller
From: Rob Herring 
Date: Fri, 9 Nov 2018 07:02:46 -0600

> On Thu, Nov 8, 2018 at 9:13 PM David Miller  wrote:
>>
>> From: Rob Herring 
>> Date: Wed,  7 Nov 2018 16:31:46 -0600
>>
>> > In preparation to remove direct accesses to the device_node.name
>> > pointer, retrieve the node name from the "name" property instead.
>> >
>> > Cc: "David S. Miller" 
>> > Cc: sparcli...@vger.kernel.org
>> > Signed-off-by: Rob Herring 
>>
>> On some 32-bit sparcs, the OF runs in non-cached memory.
>>
>> Sucking in the OF device tree take a signficant amount of the
>> boot time on such systems.
>>
>> And these changes are going to make it even slower.
>>
>> Please just put a wrapper around dp->name or whatever and use that,
>> instead of making more get property OF calls which can be very
>> expensive.
> 
> It's not making calls into OF. It is retrieving the property from the
> kernel's copy of the DT (allocated by memblock). So the only
> difference is we have to walk the struct property list to find "name".
> That's perhaps offset by the fact that we'll be doing one less
> property retrieval from OF when this is all done. Right now, we
> retrieve "name" from OF twice per node.

Ok, now I understand, thanks.


Re: [PATCH 3/8] sparc: prom: use property "name" directly to construct node names

2018-11-09 Thread David Miller
From: Rob Herring 
Date: Fri, 9 Nov 2018 07:02:46 -0600

> On Thu, Nov 8, 2018 at 9:13 PM David Miller  wrote:
>>
>> From: Rob Herring 
>> Date: Wed,  7 Nov 2018 16:31:46 -0600
>>
>> > In preparation to remove direct accesses to the device_node.name
>> > pointer, retrieve the node name from the "name" property instead.
>> >
>> > Cc: "David S. Miller" 
>> > Cc: sparcli...@vger.kernel.org
>> > Signed-off-by: Rob Herring 
>>
>> On some 32-bit sparcs, the OF runs in non-cached memory.
>>
>> Sucking in the OF device tree take a signficant amount of the
>> boot time on such systems.
>>
>> And these changes are going to make it even slower.
>>
>> Please just put a wrapper around dp->name or whatever and use that,
>> instead of making more get property OF calls which can be very
>> expensive.
> 
> It's not making calls into OF. It is retrieving the property from the
> kernel's copy of the DT (allocated by memblock). So the only
> difference is we have to walk the struct property list to find "name".
> That's perhaps offset by the fact that we'll be doing one less
> property retrieval from OF when this is all done. Right now, we
> retrieve "name" from OF twice per node.

Ok, now I understand, thanks.


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Michael Tirado
On Fri, Nov 9, 2018 at 9:41 PM Andy Lutomirski  wrote:
>
>
>
> > On Nov 9, 2018, at 1:06 PM, Jann Horn  wrote:
> >
> > +linux-api for API addition
> > +hughd as FYI since this is somewhat related to mm/shmem
> >
> > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
> >  wrote:
> >> Android uses ashmem for sharing memory regions. We are looking forward
> >> to migrating all usecases of ashmem to memfd so that we can possibly
> >> remove the ashmem driver in the future from staging while also
> >> benefiting from using memfd and contributing to it. Note staging drivers
> >> are also not ABI and generally can be removed at anytime.
> >>
> >> One of the main usecases Android has is the ability to create a region
> >> and mmap it as writeable, then add protection against making any
> >> "future" writes while keeping the existing already mmap'ed
> >> writeable-region active.  This allows us to implement a usecase where
> >> receivers of the shared memory buffer can get a read-only view, while
> >> the sender continues to write to the buffer.

Oh I remember trying this years ago with a new seal, F_SEAL_WRITE_PEER,
or something like that.

> >
> > So you're fiddling around with the file, but not the inode? How are
> > you preventing code like the following from re-opening the file as
> > writable?
> >
> > $ cat memfd.c
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main(void) {
> >  int fd = syscall(__NR_memfd_create, "testfd", 0);
> >  if (fd == -1) err(1, "memfd");
> >  char path[100];
> >  sprintf(path, "/proc/self/fd/%d", fd);
> >  int fd2 = open(path, O_RDWR);
> >  if (fd2 == -1) err(1, "reopen");
> >  printf("reopen successful: %d\n", fd2);
> > }
> > $ gcc -o memfd memfd.c
> > $ ./memfd
> > reopen successful: 4
> > $
> >

The race condition between memfd_create and applying seals in fcntl?
I think it would be possible to block new write mappings from peer processes if
there is a new memfd_create api that accepts seals. Allowing caller to
set a seal
like the one I proposed years ago, though in a race-free manner. Then
also consider
how to properly handle blocking inherited +W mapping through
clone/fork. Maybe I'm
forgetting some other pitfalls?



> > That aside: I wonder whether a better API would be something that
> > allows you to create a new readonly file descriptor, instead of
> > fiddling with the writability of an existing fd.
>
> Every now and then I try to write a patch to prevent using proc to reopen a 
> file with greater permission than the original open.
>
> I like your idea to have a clean way to reopen a a memfd with reduced 
> permissions. But I would make it a syscall instead and maybe make it only 
> work for memfd at first.  And the proc issue would need to be fixed, too.

IMO the best solution would handle the issue at memfd creation time by
removing the race condition.


Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-09 Thread Michael Tirado
On Fri, Nov 9, 2018 at 9:41 PM Andy Lutomirski  wrote:
>
>
>
> > On Nov 9, 2018, at 1:06 PM, Jann Horn  wrote:
> >
> > +linux-api for API addition
> > +hughd as FYI since this is somewhat related to mm/shmem
> >
> > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google)
> >  wrote:
> >> Android uses ashmem for sharing memory regions. We are looking forward
> >> to migrating all usecases of ashmem to memfd so that we can possibly
> >> remove the ashmem driver in the future from staging while also
> >> benefiting from using memfd and contributing to it. Note staging drivers
> >> are also not ABI and generally can be removed at anytime.
> >>
> >> One of the main usecases Android has is the ability to create a region
> >> and mmap it as writeable, then add protection against making any
> >> "future" writes while keeping the existing already mmap'ed
> >> writeable-region active.  This allows us to implement a usecase where
> >> receivers of the shared memory buffer can get a read-only view, while
> >> the sender continues to write to the buffer.

Oh I remember trying this years ago with a new seal, F_SEAL_WRITE_PEER,
or something like that.

> >
> > So you're fiddling around with the file, but not the inode? How are
> > you preventing code like the following from re-opening the file as
> > writable?
> >
> > $ cat memfd.c
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main(void) {
> >  int fd = syscall(__NR_memfd_create, "testfd", 0);
> >  if (fd == -1) err(1, "memfd");
> >  char path[100];
> >  sprintf(path, "/proc/self/fd/%d", fd);
> >  int fd2 = open(path, O_RDWR);
> >  if (fd2 == -1) err(1, "reopen");
> >  printf("reopen successful: %d\n", fd2);
> > }
> > $ gcc -o memfd memfd.c
> > $ ./memfd
> > reopen successful: 4
> > $
> >

The race condition between memfd_create and applying seals in fcntl?
I think it would be possible to block new write mappings from peer processes if
there is a new memfd_create api that accepts seals. Allowing caller to
set a seal
like the one I proposed years ago, though in a race-free manner. Then
also consider
how to properly handle blocking inherited +W mapping through
clone/fork. Maybe I'm
forgetting some other pitfalls?



> > That aside: I wonder whether a better API would be something that
> > allows you to create a new readonly file descriptor, instead of
> > fiddling with the writability of an existing fd.
>
> Every now and then I try to write a patch to prevent using proc to reopen a 
> file with greater permission than the original open.
>
> I like your idea to have a clean way to reopen a a memfd with reduced 
> permissions. But I would make it a syscall instead and maybe make it only 
> work for memfd at first.  And the proc issue would need to be fixed, too.

IMO the best solution would handle the issue at memfd creation time by
removing the race condition.


Re: [PATCH 5/6] staging:iio:ad2s90: Add SPDX license identifier

2018-11-09 Thread Matheus Tavares Bernardino
On Fri, Nov 9, 2018 at 10:20 PM Greg Kroah-Hartman
 wrote:
>
> On Fri, Nov 09, 2018 at 09:19:52PM -0200, Matheus Tavares Bernardino wrote:
> > On Fri, Nov 9, 2018 at 8:13 PM Fabio Estevam  wrote:
> > >
> > > Hi Matheus,
> > >
> >
> > Hi, Fabio
> >
> > > On Fri, Nov 9, 2018 at 8:01 PM Matheus Tavares
> > >  wrote:
> > > >
> > > > This patch adds the SPDX GPL-2.0-only license identifier to ad2s90.c,
> > > > which solves the checkpatch.pl warning:
> > > > "WARNING: Missing or malformed SPDX-License-Identifier tag in line 1".
> > > >
> > > > Signed-off-by: Matheus Tavares 
> > > > ---
> > > >  drivers/staging/iio/resolver/ad2s90.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/drivers/staging/iio/resolver/ad2s90.c 
> > > > b/drivers/staging/iio/resolver/ad2s90.c
> > > > index 949ff55ac6b0..f439da721df8 100644
> > > > --- a/drivers/staging/iio/resolver/ad2s90.c
> > > > +++ b/drivers/staging/iio/resolver/ad2s90.c
> > > > @@ -1,3 +1,4 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > >
> > > This should be:
> > > // SPDX-License-Identifier: GPL-2.0
> >
> > Hm, but it seems that the identifier "GPL-2.0" is deprecated, look:
> > https://spdx.org/licenses/GPL-2.0.html. It has been updated to
> > "GPL-2.0-only" in license list v3
> > (https://spdx.org/licenses/GPL-2.0-only.html). Is there some other
> > reason to use the deprecated "GPL-2.0" that I'm not aware of?
>
> Yes, please read the in-kernel documentation for all of this at:
> Documentation/process/license-rules.rst
>
> Long story short, we started the adding of these tags to the kernel
> before the crazyness of the "-only" markings for GPL in spdx.  Let's
> keep it this way for now, if we ever get the whole kernel finished, then
> we can revisit the markings and maybe do a wholesale conversion, if it's
> really needed.
>

Got it, thanks for the explanation! I'll correct this in v2.

Thanks,
Matheus

> thanks,
>
> greg k-h


Re: [PATCH 5/6] staging:iio:ad2s90: Add SPDX license identifier

2018-11-09 Thread Matheus Tavares Bernardino
On Fri, Nov 9, 2018 at 10:20 PM Greg Kroah-Hartman
 wrote:
>
> On Fri, Nov 09, 2018 at 09:19:52PM -0200, Matheus Tavares Bernardino wrote:
> > On Fri, Nov 9, 2018 at 8:13 PM Fabio Estevam  wrote:
> > >
> > > Hi Matheus,
> > >
> >
> > Hi, Fabio
> >
> > > On Fri, Nov 9, 2018 at 8:01 PM Matheus Tavares
> > >  wrote:
> > > >
> > > > This patch adds the SPDX GPL-2.0-only license identifier to ad2s90.c,
> > > > which solves the checkpatch.pl warning:
> > > > "WARNING: Missing or malformed SPDX-License-Identifier tag in line 1".
> > > >
> > > > Signed-off-by: Matheus Tavares 
> > > > ---
> > > >  drivers/staging/iio/resolver/ad2s90.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/drivers/staging/iio/resolver/ad2s90.c 
> > > > b/drivers/staging/iio/resolver/ad2s90.c
> > > > index 949ff55ac6b0..f439da721df8 100644
> > > > --- a/drivers/staging/iio/resolver/ad2s90.c
> > > > +++ b/drivers/staging/iio/resolver/ad2s90.c
> > > > @@ -1,3 +1,4 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > >
> > > This should be:
> > > // SPDX-License-Identifier: GPL-2.0
> >
> > Hm, but it seems that the identifier "GPL-2.0" is deprecated, look:
> > https://spdx.org/licenses/GPL-2.0.html. It has been updated to
> > "GPL-2.0-only" in license list v3
> > (https://spdx.org/licenses/GPL-2.0-only.html). Is there some other
> > reason to use the deprecated "GPL-2.0" that I'm not aware of?
>
> Yes, please read the in-kernel documentation for all of this at:
> Documentation/process/license-rules.rst
>
> Long story short, we started the adding of these tags to the kernel
> before the crazyness of the "-only" markings for GPL in spdx.  Let's
> keep it this way for now, if we ever get the whole kernel finished, then
> we can revisit the markings and maybe do a wholesale conversion, if it's
> really needed.
>

Got it, thanks for the explanation! I'll correct this in v2.

Thanks,
Matheus

> thanks,
>
> greg k-h


Re: [PATCH] x86/mm/pat: Fix missing preemption disable for __native_flush_tlb()

2018-11-09 Thread Andy Lutomirski



> On Nov 9, 2018, at 4:05 PM, Dan Williams  wrote:
> 
> Commit f77084d96355 "x86/mm/pat: Disable preemption around
> __flush_tlb_all()" addressed a case where __flush_tlb_all() is called
> without preemption being disabled. It also left a warning to catch other
> cases where preemption is not disabled. That warning triggers for the
> memory hotplug path which is also used for persistent memory enabling:

I don’t think I agree with the patch. If you call __flush_tlb_all() in a 
context where you might be *migrated*, then there’s a bug. We could change the 
code to allow this particular use by checking that we haven’t done SMP init 
yet, perhaps.

> 
> WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
> RIP: 0010:__flush_tlb_all+0x1b/0x3a
> [..]
> Call Trace:
>  phys_pud_init+0x29c/0x2bb
>  kernel_physical_mapping_init+0xfc/0x219
>  init_memory_mapping+0x1a5/0x3b0
>  arch_add_memory+0x2c/0x50
>  devm_memremap_pages+0x3aa/0x610
>  pmem_attach_disk+0x585/0x700 [nd_pmem]
> 
> Rather than audit all __flush_tlb_all() callers to add preemption, just
> do it internally to __flush_tlb_all().
> 
> Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around 
> __flush_tlb_all()")
> Cc: Sebastian Andrzej Siewior 
> Cc: Thomas Gleixner 
> Cc: Andy Lutomirski 
> Cc: Dave Hansen 
> Cc: Peter Zijlstra 
> Cc: Borislav Petkov 
> Cc: 
> Signed-off-by: Dan Williams 
> ---
> arch/x86/include/asm/tlbflush.h |8 
> arch/x86/mm/pageattr.c  |6 +-
> 2 files changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index d760611cfc35..049e0aca0fb5 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -454,11 +454,10 @@ static inline void __native_flush_tlb_one_user(unsigned 
> long addr)
> static inline void __flush_tlb_all(void)
> {
>/*
> - * This is to catch users with enabled preemption and the PGE feature
> - * and don't trigger the warning in __native_flush_tlb().
> + *  Preemption needs to be disabled around __flush_tlb* calls
> + *  due to CR3 reload in __native_flush_tlb().
> */
> -VM_WARN_ON_ONCE(preemptible());
> -
> +preempt_disable();
>if (boot_cpu_has(X86_FEATURE_PGE)) {
>__flush_tlb_global();
>} else {
> @@ -467,6 +466,7 @@ static inline void __flush_tlb_all(void)
> */
>__flush_tlb();
>}
> +preempt_enable();
> }
> 
> /*
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index db7a10082238..f799076e3d57 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -2309,13 +2309,9 @@ void __kernel_map_pages(struct page *page, int 
> numpages, int enable)
> 
>/*
> * We should perform an IPI and flush all tlbs,
> - * but that can deadlock->flush only current cpu.
> - * Preemption needs to be disabled around __flush_tlb_all() due to
> - * CR3 reload in __native_flush_tlb().
> + * but that can deadlock->flush only current cpu:
> */
> -preempt_disable();
>__flush_tlb_all();
> -preempt_enable();
> 
>arch_flush_lazy_mmu_mode();
> }
> 


Re: [PATCH] x86/mm/pat: Fix missing preemption disable for __native_flush_tlb()

2018-11-09 Thread Andy Lutomirski



> On Nov 9, 2018, at 4:05 PM, Dan Williams  wrote:
> 
> Commit f77084d96355 "x86/mm/pat: Disable preemption around
> __flush_tlb_all()" addressed a case where __flush_tlb_all() is called
> without preemption being disabled. It also left a warning to catch other
> cases where preemption is not disabled. That warning triggers for the
> memory hotplug path which is also used for persistent memory enabling:

I don’t think I agree with the patch. If you call __flush_tlb_all() in a 
context where you might be *migrated*, then there’s a bug. We could change the 
code to allow this particular use by checking that we haven’t done SMP init 
yet, perhaps.

> 
> WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
> RIP: 0010:__flush_tlb_all+0x1b/0x3a
> [..]
> Call Trace:
>  phys_pud_init+0x29c/0x2bb
>  kernel_physical_mapping_init+0xfc/0x219
>  init_memory_mapping+0x1a5/0x3b0
>  arch_add_memory+0x2c/0x50
>  devm_memremap_pages+0x3aa/0x610
>  pmem_attach_disk+0x585/0x700 [nd_pmem]
> 
> Rather than audit all __flush_tlb_all() callers to add preemption, just
> do it internally to __flush_tlb_all().
> 
> Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around 
> __flush_tlb_all()")
> Cc: Sebastian Andrzej Siewior 
> Cc: Thomas Gleixner 
> Cc: Andy Lutomirski 
> Cc: Dave Hansen 
> Cc: Peter Zijlstra 
> Cc: Borislav Petkov 
> Cc: 
> Signed-off-by: Dan Williams 
> ---
> arch/x86/include/asm/tlbflush.h |8 
> arch/x86/mm/pageattr.c  |6 +-
> 2 files changed, 5 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index d760611cfc35..049e0aca0fb5 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -454,11 +454,10 @@ static inline void __native_flush_tlb_one_user(unsigned 
> long addr)
> static inline void __flush_tlb_all(void)
> {
>/*
> - * This is to catch users with enabled preemption and the PGE feature
> - * and don't trigger the warning in __native_flush_tlb().
> + *  Preemption needs to be disabled around __flush_tlb* calls
> + *  due to CR3 reload in __native_flush_tlb().
> */
> -VM_WARN_ON_ONCE(preemptible());
> -
> +preempt_disable();
>if (boot_cpu_has(X86_FEATURE_PGE)) {
>__flush_tlb_global();
>} else {
> @@ -467,6 +466,7 @@ static inline void __flush_tlb_all(void)
> */
>__flush_tlb();
>}
> +preempt_enable();
> }
> 
> /*
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index db7a10082238..f799076e3d57 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -2309,13 +2309,9 @@ void __kernel_map_pages(struct page *page, int 
> numpages, int enable)
> 
>/*
> * We should perform an IPI and flush all tlbs,
> - * but that can deadlock->flush only current cpu.
> - * Preemption needs to be disabled around __flush_tlb_all() due to
> - * CR3 reload in __native_flush_tlb().
> + * but that can deadlock->flush only current cpu:
> */
> -preempt_disable();
>__flush_tlb_all();
> -preempt_enable();
> 
>arch_flush_lazy_mmu_mode();
> }
> 


  1   2   3   4   5   6   7   8   9   10   >