Re: [PATCH 1/2] clk: ti: add a usecount for autoidle
On Thu, 8 Nov 2018 12:36:35 +0200 Tero Kristo wrote: > On 04/10/2018 23:38, Andreas Kemnade wrote: > > We have the scenario that first autoidle is disabled for all clocks, > > then it is disabled for selected ones and then enabled for all. So > > we should have some counting here, also according to the > > comment in _setup_iclk_autoidle() > > > > Signed-off-by: Andreas Kemnade > > --- > > drivers/clk/ti/autoidle.c | 32 > > include/linux/clk/ti.h| 1 + > > 2 files changed, 25 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/clk/ti/autoidle.c b/drivers/clk/ti/autoidle.c > > index 7bb9afbe4058..bb6cff168e73 100644 > > --- a/drivers/clk/ti/autoidle.c > > +++ b/drivers/clk/ti/autoidle.c > > @@ -37,6 +37,14 @@ struct clk_ti_autoidle { > > static LIST_HEAD(autoidle_clks); > > static LIST_HEAD(clk_hw_omap_clocks); > > > > +/* > > + * we have some non-atomic read/write > > + * operations behind it, so lets > > + * take one mutex for handling autoidle > > + * of all clocks > > + */ > > +static DEFINE_MUTEX(autoidle_mutex); > > Why mutex? This prevents calling the autoidle APIs from atomic context. > Did you check the mutex debug kernel configs with this? > Oops, I thought they were on, but they were not. OK, I am preparing a v2 of this thing. > This may cause problems with the runtime PM entries to the code at least. > > > > + > > /** > >* omap2_clk_deny_idle - disable autoidle on an OMAP clock > >* @clk: struct clk * to disable autoidle for > > @@ -48,8 +56,13 @@ int omap2_clk_deny_idle(struct clk *clk) > > struct clk_hw_omap *c; > > > > c = to_clk_hw_omap(__clk_get_hw(clk)); > > - if (c->ops && c->ops->deny_idle) > > - c->ops->deny_idle(c); > > + if (c->ops && c->ops->deny_idle) { > > + mutex_lock(_mutex); > > + c->autoidle_count--; > > + if (c->autoidle_count == -1) > > I think you should swap the arithmetics here, all the other usecounters > use positive values, here you enter deep to the negative side when > autoidle is denied by multiple users, which might be confusing. > agreed. Regards, Andreas pgpajxU1zsCUf.pgp Description: OpenPGP digital signature
Re: [PATCH 1/2] clk: ti: add a usecount for autoidle
On Thu, 8 Nov 2018 12:36:35 +0200 Tero Kristo wrote: > On 04/10/2018 23:38, Andreas Kemnade wrote: > > We have the scenario that first autoidle is disabled for all clocks, > > then it is disabled for selected ones and then enabled for all. So > > we should have some counting here, also according to the > > comment in _setup_iclk_autoidle() > > > > Signed-off-by: Andreas Kemnade > > --- > > drivers/clk/ti/autoidle.c | 32 > > include/linux/clk/ti.h| 1 + > > 2 files changed, 25 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/clk/ti/autoidle.c b/drivers/clk/ti/autoidle.c > > index 7bb9afbe4058..bb6cff168e73 100644 > > --- a/drivers/clk/ti/autoidle.c > > +++ b/drivers/clk/ti/autoidle.c > > @@ -37,6 +37,14 @@ struct clk_ti_autoidle { > > static LIST_HEAD(autoidle_clks); > > static LIST_HEAD(clk_hw_omap_clocks); > > > > +/* > > + * we have some non-atomic read/write > > + * operations behind it, so lets > > + * take one mutex for handling autoidle > > + * of all clocks > > + */ > > +static DEFINE_MUTEX(autoidle_mutex); > > Why mutex? This prevents calling the autoidle APIs from atomic context. > Did you check the mutex debug kernel configs with this? > Oops, I thought they were on, but they were not. OK, I am preparing a v2 of this thing. > This may cause problems with the runtime PM entries to the code at least. > > > > + > > /** > >* omap2_clk_deny_idle - disable autoidle on an OMAP clock > >* @clk: struct clk * to disable autoidle for > > @@ -48,8 +56,13 @@ int omap2_clk_deny_idle(struct clk *clk) > > struct clk_hw_omap *c; > > > > c = to_clk_hw_omap(__clk_get_hw(clk)); > > - if (c->ops && c->ops->deny_idle) > > - c->ops->deny_idle(c); > > + if (c->ops && c->ops->deny_idle) { > > + mutex_lock(_mutex); > > + c->autoidle_count--; > > + if (c->autoidle_count == -1) > > I think you should swap the arithmetics here, all the other usecounters > use positive values, here you enter deep to the negative side when > autoidle is denied by multiple users, which might be confusing. > agreed. Regards, Andreas pgpajxU1zsCUf.pgp Description: OpenPGP digital signature
[PATCH] sched/fair: Make some variables static
The variables are local to the source and do not need to be in global scope, so make them static. Signed-off-by: Muchun Song --- kernel/sched/fair.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f622fc858d7a..9d0cd2d0515f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -38,7 +38,7 @@ * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds) */ unsigned int sysctl_sched_latency = 600ULL; -unsigned int normalized_sysctl_sched_latency = 600ULL; +static unsigned int normalized_sysctl_sched_latency= 600ULL; /* * The initial- and re-scaling of tunables is configurable @@ -58,8 +58,8 @@ enum sched_tunable_scaling sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_L * * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds) */ -unsigned int sysctl_sched_min_granularity = 75ULL; -unsigned int normalized_sysctl_sched_min_granularity = 75ULL; +unsigned int sysctl_sched_min_granularity = 75ULL; +static unsigned int normalized_sysctl_sched_min_granularity= 75ULL; /* * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity @@ -81,8 +81,8 @@ unsigned int sysctl_sched_child_runs_first __read_mostly; * * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds) */ -unsigned int sysctl_sched_wakeup_granularity = 100UL; -unsigned int normalized_sysctl_sched_wakeup_granularity= 100UL; +unsigned int sysctl_sched_wakeup_granularity = 100UL; +static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL; const_debug unsigned int sysctl_sched_migration_cost = 50UL; -- 2.17.1
[PATCH] sched/fair: Make some variables static
The variables are local to the source and do not need to be in global scope, so make them static. Signed-off-by: Muchun Song --- kernel/sched/fair.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f622fc858d7a..9d0cd2d0515f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -38,7 +38,7 @@ * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds) */ unsigned int sysctl_sched_latency = 600ULL; -unsigned int normalized_sysctl_sched_latency = 600ULL; +static unsigned int normalized_sysctl_sched_latency= 600ULL; /* * The initial- and re-scaling of tunables is configurable @@ -58,8 +58,8 @@ enum sched_tunable_scaling sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_L * * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds) */ -unsigned int sysctl_sched_min_granularity = 75ULL; -unsigned int normalized_sysctl_sched_min_granularity = 75ULL; +unsigned int sysctl_sched_min_granularity = 75ULL; +static unsigned int normalized_sysctl_sched_min_granularity= 75ULL; /* * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity @@ -81,8 +81,8 @@ unsigned int sysctl_sched_child_runs_first __read_mostly; * * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds) */ -unsigned int sysctl_sched_wakeup_granularity = 100UL; -unsigned int normalized_sysctl_sched_wakeup_granularity= 100UL; +unsigned int sysctl_sched_wakeup_granularity = 100UL; +static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL; const_debug unsigned int sysctl_sched_migration_cost = 50UL; -- 2.17.1
Re: [PATCH] pinctrl: msm: Add sleep pinctrl state transitions
On Fri 09 Nov 14:28 PST 2018, Evan Green wrote: > Add PM suspend callbacks to the msm core driver that select the > sleep and default pinctrl states. Then wire those callbacks up > in the sdm845 driver, for those boards that may have GPIO hogs > that need to change state during suspend. > > Signed-off-by: Evan Green > --- > > drivers/pinctrl/qcom/pinctrl-msm.c| 16 > drivers/pinctrl/qcom/pinctrl-msm.h| 2 ++ > drivers/pinctrl/qcom/pinctrl-sdm845.c | 4 > 3 files changed, 22 insertions(+) > > diff --git a/drivers/pinctrl/qcom/pinctrl-msm.c > b/drivers/pinctrl/qcom/pinctrl-msm.c > index 7c7d083e2c0dc..d505d0a50f0a5 100644 > --- a/drivers/pinctrl/qcom/pinctrl-msm.c > +++ b/drivers/pinctrl/qcom/pinctrl-msm.c > @@ -1072,6 +1072,22 @@ static void msm_pinctrl_setup_pm_reset(struct > msm_pinctrl *pctrl) > } > } > > +int msm_pinctrl_suspend(struct device *dev) > +{ > + struct msm_pinctrl *pctrl = dev_get_drvdata(dev); > + > + return pinctrl_force_sleep(pctrl->pctrl); > +} > +EXPORT_SYMBOL(msm_pinctrl_suspend); > + > +int msm_pinctrl_resume(struct device *dev) > +{ > + struct msm_pinctrl *pctrl = dev_get_drvdata(dev); > + > + return pinctrl_force_default(pctrl->pctrl); > +} > +EXPORT_SYMBOL(msm_pinctrl_resume); > + > int msm_pinctrl_probe(struct platform_device *pdev, > const struct msm_pinctrl_soc_data *soc_data) > { > diff --git a/drivers/pinctrl/qcom/pinctrl-msm.h > b/drivers/pinctrl/qcom/pinctrl-msm.h > index 29172fdf5882c..e163ca600ecd3 100644 > --- a/drivers/pinctrl/qcom/pinctrl-msm.h > +++ b/drivers/pinctrl/qcom/pinctrl-msm.h > @@ -123,6 +123,8 @@ struct msm_pinctrl_soc_data { > unsigned int ntiles; > }; > > +int msm_pinctrl_suspend(struct device *dev); > +int msm_pinctrl_resume(struct device *dev); > int msm_pinctrl_probe(struct platform_device *pdev, > const struct msm_pinctrl_soc_data *soc_data); > int msm_pinctrl_remove(struct platform_device *pdev); > diff --git a/drivers/pinctrl/qcom/pinctrl-sdm845.c > b/drivers/pinctrl/qcom/pinctrl-sdm845.c > index 2ab7a88857579..a3ac9cbeabad1 100644 > --- a/drivers/pinctrl/qcom/pinctrl-sdm845.c > +++ b/drivers/pinctrl/qcom/pinctrl-sdm845.c > @@ -1287,6 +1287,9 @@ static const struct msm_pinctrl_soc_data sdm845_pinctrl > = { > .ngpios = 150, > }; > > +static SIMPLE_DEV_PM_OPS(sdm845_pinctrl_dev_pm_ops, msm_pinctrl_suspend, > + msm_pinctrl_resume); I don't expect this to differ between the various platforms, so how about moving it to pinctrl-msm? Regards, Bjorn > + > static int sdm845_pinctrl_probe(struct platform_device *pdev) > { > return msm_pinctrl_probe(pdev, _pinctrl); > @@ -1300,6 +1303,7 @@ static const struct of_device_id > sdm845_pinctrl_of_match[] = { > static struct platform_driver sdm845_pinctrl_driver = { > .driver = { > .name = "sdm845-pinctrl", > + .pm = _pinctrl_dev_pm_ops, > .of_match_table = sdm845_pinctrl_of_match, > }, > .probe = sdm845_pinctrl_probe, > -- > 2.16.4 >
Re: [PATCH] pinctrl: msm: Add sleep pinctrl state transitions
On Fri 09 Nov 14:28 PST 2018, Evan Green wrote: > Add PM suspend callbacks to the msm core driver that select the > sleep and default pinctrl states. Then wire those callbacks up > in the sdm845 driver, for those boards that may have GPIO hogs > that need to change state during suspend. > > Signed-off-by: Evan Green > --- > > drivers/pinctrl/qcom/pinctrl-msm.c| 16 > drivers/pinctrl/qcom/pinctrl-msm.h| 2 ++ > drivers/pinctrl/qcom/pinctrl-sdm845.c | 4 > 3 files changed, 22 insertions(+) > > diff --git a/drivers/pinctrl/qcom/pinctrl-msm.c > b/drivers/pinctrl/qcom/pinctrl-msm.c > index 7c7d083e2c0dc..d505d0a50f0a5 100644 > --- a/drivers/pinctrl/qcom/pinctrl-msm.c > +++ b/drivers/pinctrl/qcom/pinctrl-msm.c > @@ -1072,6 +1072,22 @@ static void msm_pinctrl_setup_pm_reset(struct > msm_pinctrl *pctrl) > } > } > > +int msm_pinctrl_suspend(struct device *dev) > +{ > + struct msm_pinctrl *pctrl = dev_get_drvdata(dev); > + > + return pinctrl_force_sleep(pctrl->pctrl); > +} > +EXPORT_SYMBOL(msm_pinctrl_suspend); > + > +int msm_pinctrl_resume(struct device *dev) > +{ > + struct msm_pinctrl *pctrl = dev_get_drvdata(dev); > + > + return pinctrl_force_default(pctrl->pctrl); > +} > +EXPORT_SYMBOL(msm_pinctrl_resume); > + > int msm_pinctrl_probe(struct platform_device *pdev, > const struct msm_pinctrl_soc_data *soc_data) > { > diff --git a/drivers/pinctrl/qcom/pinctrl-msm.h > b/drivers/pinctrl/qcom/pinctrl-msm.h > index 29172fdf5882c..e163ca600ecd3 100644 > --- a/drivers/pinctrl/qcom/pinctrl-msm.h > +++ b/drivers/pinctrl/qcom/pinctrl-msm.h > @@ -123,6 +123,8 @@ struct msm_pinctrl_soc_data { > unsigned int ntiles; > }; > > +int msm_pinctrl_suspend(struct device *dev); > +int msm_pinctrl_resume(struct device *dev); > int msm_pinctrl_probe(struct platform_device *pdev, > const struct msm_pinctrl_soc_data *soc_data); > int msm_pinctrl_remove(struct platform_device *pdev); > diff --git a/drivers/pinctrl/qcom/pinctrl-sdm845.c > b/drivers/pinctrl/qcom/pinctrl-sdm845.c > index 2ab7a88857579..a3ac9cbeabad1 100644 > --- a/drivers/pinctrl/qcom/pinctrl-sdm845.c > +++ b/drivers/pinctrl/qcom/pinctrl-sdm845.c > @@ -1287,6 +1287,9 @@ static const struct msm_pinctrl_soc_data sdm845_pinctrl > = { > .ngpios = 150, > }; > > +static SIMPLE_DEV_PM_OPS(sdm845_pinctrl_dev_pm_ops, msm_pinctrl_suspend, > + msm_pinctrl_resume); I don't expect this to differ between the various platforms, so how about moving it to pinctrl-msm? Regards, Bjorn > + > static int sdm845_pinctrl_probe(struct platform_device *pdev) > { > return msm_pinctrl_probe(pdev, _pinctrl); > @@ -1300,6 +1303,7 @@ static const struct of_device_id > sdm845_pinctrl_of_match[] = { > static struct platform_driver sdm845_pinctrl_driver = { > .driver = { > .name = "sdm845-pinctrl", > + .pm = _pinctrl_dev_pm_ops, > .of_match_table = sdm845_pinctrl_of_match, > }, > .probe = sdm845_pinctrl_probe, > -- > 2.16.4 >
[PATCH] riscv: add S and U modes to ISA string
Booting kernel (4.20-rc1) with riscv-pk @ 6ebd0f2a46255d0c76dad3c05b16c1d154795d26 (master/HEAD) on Fedora 29 one gets: [..] [ 55.075000] unsupported ISA "rv64imafdcsu" in device tree [ 55.075000] unsupported ISA "rv64imafdcsu" in device tree [ 55.076000] unsupported ISA "rv64imafdcsu" in device tree [ 55.076000] unsupported ISA "rv64imafdcsu" in device tree [ 55.077000] systemd[1]: Detected architecture riscv64. [..] The patch adds the missing S and U modes. Signed-off-by: David Abdurachmanov --- arch/riscv/kernel/cpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c index 3a5a2ee31547..4029c7e6872b 100644 --- a/arch/riscv/kernel/cpu.c +++ b/arch/riscv/kernel/cpu.c @@ -64,7 +64,7 @@ int riscv_of_processor_hartid(struct device_node *node) static void print_isa(struct seq_file *f, const char *orig_isa) { - static const char *ext = "mafdc"; + static const char *ext = "mafdcsu"; const char *isa = orig_isa; const char *e; -- 2.19.1
[PATCH] riscv: add S and U modes to ISA string
Booting kernel (4.20-rc1) with riscv-pk @ 6ebd0f2a46255d0c76dad3c05b16c1d154795d26 (master/HEAD) on Fedora 29 one gets: [..] [ 55.075000] unsupported ISA "rv64imafdcsu" in device tree [ 55.075000] unsupported ISA "rv64imafdcsu" in device tree [ 55.076000] unsupported ISA "rv64imafdcsu" in device tree [ 55.076000] unsupported ISA "rv64imafdcsu" in device tree [ 55.077000] systemd[1]: Detected architecture riscv64. [..] The patch adds the missing S and U modes. Signed-off-by: David Abdurachmanov --- arch/riscv/kernel/cpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c index 3a5a2ee31547..4029c7e6872b 100644 --- a/arch/riscv/kernel/cpu.c +++ b/arch/riscv/kernel/cpu.c @@ -64,7 +64,7 @@ int riscv_of_processor_hartid(struct device_node *node) static void print_isa(struct seq_file *f, const char *orig_isa) { - static const char *ext = "mafdc"; + static const char *ext = "mafdcsu"; const char *isa = orig_isa; const char *e; -- 2.19.1
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
On Fri, Nov 9, 2018 at 9:43 PM chouryzhou(周威) wrote: > > > > > > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it > > > will be a static > > > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by > > > me) with > > > no namespace-ization. You will get the same one in all processes, > > > everything is > > > the same as without this patch. > > > > except, as far as I can tell, binder_init_ns() would never have been > > called on it so the mutex and list heads are not initialized so its > > completely broken. Am I missing something? How do those fields get > > initialized in this case? > > > @@ -5832,8 +5888,12 @@ static int __init binder_init(void) > > goto err_init_binder_device_failed; > > } > > > > - return ret; > > + ret = binder_init_ns(_ipc_ns); > > + if (ret) > > + goto err_init_namespace_failed; > > > > + return ret; > > They are initialized here. Ok, This init_ipc_ns is a global declared in msgutil.c if SYSVIPC || POSIX_MQUEUE. This seems kinda hacky, but now I finally see why the dependancy... msgutil.c is the only file we can count on if !IPC_NS && (SYSVIPC || POSIX_MQUEUE). There must be a cleaner way to do this, I really don't like this dependency... wouldn't it be cleaner to do: #ifndef CONFIG_IPC_NS static struct ipc_namespace binder_ipc_ns; #define ipcns (_ipc_ns) #else #define ipcns (current->nsproxy->ipc_ns) #endif (and make the initialization of binder_ipc_ns conditional on IPC_NS) This gets us the same thing without the incestuous dependency on the msgutil.c version of init_ipc_ns...and then binder doesn't rely on SYSVIPC or POSIX_MQUEUE directly. > > - choury - >
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
On Fri, Nov 9, 2018 at 9:43 PM chouryzhou(周威) wrote: > > > > > > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it > > > will be a static > > > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by > > > me) with > > > no namespace-ization. You will get the same one in all processes, > > > everything is > > > the same as without this patch. > > > > except, as far as I can tell, binder_init_ns() would never have been > > called on it so the mutex and list heads are not initialized so its > > completely broken. Am I missing something? How do those fields get > > initialized in this case? > > > @@ -5832,8 +5888,12 @@ static int __init binder_init(void) > > goto err_init_binder_device_failed; > > } > > > > - return ret; > > + ret = binder_init_ns(_ipc_ns); > > + if (ret) > > + goto err_init_namespace_failed; > > > > + return ret; > > They are initialized here. Ok, This init_ipc_ns is a global declared in msgutil.c if SYSVIPC || POSIX_MQUEUE. This seems kinda hacky, but now I finally see why the dependancy... msgutil.c is the only file we can count on if !IPC_NS && (SYSVIPC || POSIX_MQUEUE). There must be a cleaner way to do this, I really don't like this dependency... wouldn't it be cleaner to do: #ifndef CONFIG_IPC_NS static struct ipc_namespace binder_ipc_ns; #define ipcns (_ipc_ns) #else #define ipcns (current->nsproxy->ipc_ns) #endif (and make the initialization of binder_ipc_ns conditional on IPC_NS) This gets us the same thing without the incestuous dependency on the msgutil.c version of init_ipc_ns...and then binder doesn't rely on SYSVIPC or POSIX_MQUEUE directly. > > - choury - >
[PATCH] regulator: bd718x7: Use regulator_map_voltage_ascend for buck5 and buck7
The voltages in bd718xx_3rd_nodvs_buck_volts are in ascendant order, so use regulator_map_voltage_ascend. Signed-off-by: Axel Lin --- drivers/regulator/bd718x7-regulator.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/regulator/bd718x7-regulator.c b/drivers/regulator/bd718x7-regulator.c index f142f9ce6853..0ae3e03f5433 100644 --- a/drivers/regulator/bd718x7-regulator.c +++ b/drivers/regulator/bd718x7-regulator.c @@ -131,6 +131,7 @@ static struct regulator_ops bd718xx_buck_regulator_nolinear_ops = { .disable = regulator_disable_regmap, .is_enabled = regulator_is_enabled_regmap, .list_voltage = regulator_list_voltage_table, + .map_voltage = regulator_map_voltage_ascend, .set_voltage_sel = bd718xx_set_voltage_sel_restricted, .get_voltage_sel = regulator_get_voltage_sel_regmap, .set_voltage_time_sel = regulator_set_voltage_time_sel, -- 2.17.1
[PATCH] regulator: bd718x7: Use regulator_map_voltage_ascend for buck5 and buck7
The voltages in bd718xx_3rd_nodvs_buck_volts are in ascendant order, so use regulator_map_voltage_ascend. Signed-off-by: Axel Lin --- drivers/regulator/bd718x7-regulator.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/regulator/bd718x7-regulator.c b/drivers/regulator/bd718x7-regulator.c index f142f9ce6853..0ae3e03f5433 100644 --- a/drivers/regulator/bd718x7-regulator.c +++ b/drivers/regulator/bd718x7-regulator.c @@ -131,6 +131,7 @@ static struct regulator_ops bd718xx_buck_regulator_nolinear_ops = { .disable = regulator_disable_regmap, .is_enabled = regulator_is_enabled_regmap, .list_voltage = regulator_list_voltage_table, + .map_voltage = regulator_map_voltage_ascend, .set_voltage_sel = bd718xx_set_voltage_sel_restricted, .get_voltage_sel = regulator_get_voltage_sel_regmap, .set_voltage_time_sel = regulator_set_voltage_time_sel, -- 2.17.1
Re: [PATCH] drivers/android/binder.c: Remove duplicate header
On Fri, Nov 9, 2018 at 8:17 PM Greg KH wrote: > > On Fri, Nov 09, 2018 at 10:40:14PM +0800, kbuild test robot wrote: > > Hi Brajeswar, > > > > Thank you for the patch! Yet something to improve: > > > > [auto build test ERROR on staging/staging-testing] > > [also build test ERROR on v4.20-rc1 next-20181109] > > [if your patch is applied to the wrong git tree, please drop us a note to > > help improve the system] > > > > url: > > https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717 > > config: i386-allmodconfig (attached as .config) > > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > > reproduce: > > # save the attached .config to linux build tree > > make ARCH=i386 > > > > All errors (new ones prefixed by >>): > > > > Yeah, I was right :( > > Always test-build your patches. Sorry about it. It was a mistake from our side.
Re: [PATCH] drivers/android/binder.c: Remove duplicate header
On Fri, Nov 9, 2018 at 8:17 PM Greg KH wrote: > > On Fri, Nov 09, 2018 at 10:40:14PM +0800, kbuild test robot wrote: > > Hi Brajeswar, > > > > Thank you for the patch! Yet something to improve: > > > > [auto build test ERROR on staging/staging-testing] > > [also build test ERROR on v4.20-rc1 next-20181109] > > [if your patch is applied to the wrong git tree, please drop us a note to > > help improve the system] > > > > url: > > https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717 > > config: i386-allmodconfig (attached as .config) > > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > > reproduce: > > # save the attached .config to linux build tree > > make ARCH=i386 > > > > All errors (new ones prefixed by >>): > > > > Yeah, I was right :( > > Always test-build your patches. Sorry about it. It was a mistake from our side.
Re: [PATCH] infiniband/hw/hns/hns_roce_hw_v2.c: Use dma_zalloc_coherent
On Fri, Nov 9, 2018 at 8:08 PM Sabyasachi Gupta wrote: > > Replaced dma_alloc_coherent + memset with dma_zalloc_coherent There are few other places in this file where same is applicable. Can we get those changes in same patch ? > > Signed-off-by: Sabyasachi Gupta > --- > drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > index a4c62ae..c9966ec 100644 > --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > @@ -4972,13 +4972,12 @@ static int hns_roce_mhop_alloc_eq(struct hns_roce_dev > *hr_dev, > eqe_alloc = i * (buf_chk_sz / eq->eqe_size); > size = (eq->entries - eqe_alloc) * > eq->eqe_size; > } > - eq->buf[i] = dma_alloc_coherent(dev, size, > + eq->buf[i] = dma_zalloc_coherent(dev, size, > &(eq->buf_dma[i]), > GFP_KERNEL); > if (!eq->buf[i]) > goto err_dma_alloc_buf; > > - memset(eq->buf[i], 0, size); > *(eq->bt_l0 + i) = eq->buf_dma[i]; > > eq_buf_cnt++; > -- > 2.7.4 >
Re: [PATCH] infiniband/hw/hns/hns_roce_hw_v2.c: Use dma_zalloc_coherent
On Fri, Nov 9, 2018 at 8:08 PM Sabyasachi Gupta wrote: > > Replaced dma_alloc_coherent + memset with dma_zalloc_coherent There are few other places in this file where same is applicable. Can we get those changes in same patch ? > > Signed-off-by: Sabyasachi Gupta > --- > drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > index a4c62ae..c9966ec 100644 > --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > @@ -4972,13 +4972,12 @@ static int hns_roce_mhop_alloc_eq(struct hns_roce_dev > *hr_dev, > eqe_alloc = i * (buf_chk_sz / eq->eqe_size); > size = (eq->entries - eqe_alloc) * > eq->eqe_size; > } > - eq->buf[i] = dma_alloc_coherent(dev, size, > + eq->buf[i] = dma_zalloc_coherent(dev, size, > &(eq->buf_dma[i]), > GFP_KERNEL); > if (!eq->buf[i]) > goto err_dma_alloc_buf; > > - memset(eq->buf[i], 0, size); > *(eq->bt_l0 + i) = eq->buf_dma[i]; > > eq_buf_cnt++; > -- > 2.7.4 >
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
> On Nov 9, 2018, at 7:20 PM, Joel Fernandes wrote: > >> On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote: >>> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn wrote: >>> On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google) >>> wrote: Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. >>> >>> Please CC linux-api@ on patches like this. If you had done that, I >>> might have criticized your v1 patch instead of your v3 patch... >>> The following program shows the seal working in action: >>> [...] Cc: jr...@google.com Cc: john.stu...@linaro.org Cc: tk...@google.com Cc: gre...@linuxfoundation.org Cc: h...@infradead.org Reviewed-by: John Stultz Signed-off-by: Joel Fernandes (Google) --- >>> [...] diff --git a/mm/memfd.c b/mm/memfd.c index 2bb5e257080e..5ba9804e9515 100644 --- a/mm/memfd.c +++ b/mm/memfd.c >>> [...] @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } } + if ((seals & F_SEAL_FUTURE_WRITE) && + !(*file_seals & F_SEAL_FUTURE_WRITE)) { + /* +* The FUTURE_WRITE seal also prevents growing and shrinking +* so we need them to be already set, or requested now. +*/ + int test_seals = (seals | *file_seals) & +(F_SEAL_GROW | F_SEAL_SHRINK); + + if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) { + error = -EINVAL; + goto unlock; + } + + spin_lock(>f_lock); + file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE); + spin_unlock(>f_lock); + } >>> >>> So you're fiddling around with the file, but not the inode? How are >>> you preventing code like the following from re-opening the file as >>> writable? >>> >>> $ cat memfd.c >>> #define _GNU_SOURCE >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> >>> int main(void) { >>> int fd = syscall(__NR_memfd_create, "testfd", 0); >>> if (fd == -1) err(1, "memfd"); >>> char path[100]; >>> sprintf(path, "/proc/self/fd/%d", fd); >>> int fd2 = open(path, O_RDWR); >>> if (fd2 == -1) err(1, "reopen"); >>> printf("reopen successful: %d\n", fd2); >>> } >>> $ gcc -o memfd memfd.c >>> $ ./memfd >>> reopen successful: 4 >>> $ >>> >>> That aside: I wonder whether a better API would be something that >>> allows you to create a new readonly file descriptor, instead of >>> fiddling with the writability of an existing fd. >> >> My favorite approach would be to forbid open() on memfds, hope that >> nobody notices the tiny API break, and then add an ioctl for "reopen >> this memfd with reduced permissions" - but that's just my personal >> opinion. > > I did something along these lines and it fixes the issue, but I forbid open > of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not > an ABI break because this is a brand new seal. That seems the least intrusive > solution and it works. Do you mind testing it and I'll add your and Tested-by > to the new fix? The patch is based on top of this series. > > ---8<--- > From: "Joel Fernandes (Google)" > Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd > > Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd > through /proc/self/fd/N symlink as writeable succeeds. The simplest fix > without causing ABI breakages and ugly VFS hacks is to simply deny all > opens on F_SEAL_FUTURE_WRITE sealed fds. > > Reported-by: Jann Horn > Signed-off-by: Joel Fernandes (Google) > --- > mm/shmem.c | 18 ++ > 1
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
> On Nov 9, 2018, at 7:20 PM, Joel Fernandes wrote: > >> On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote: >>> On Fri, Nov 9, 2018 at 10:06 PM Jann Horn wrote: >>> On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google) >>> wrote: Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. >>> >>> Please CC linux-api@ on patches like this. If you had done that, I >>> might have criticized your v1 patch instead of your v3 patch... >>> The following program shows the seal working in action: >>> [...] Cc: jr...@google.com Cc: john.stu...@linaro.org Cc: tk...@google.com Cc: gre...@linuxfoundation.org Cc: h...@infradead.org Reviewed-by: John Stultz Signed-off-by: Joel Fernandes (Google) --- >>> [...] diff --git a/mm/memfd.c b/mm/memfd.c index 2bb5e257080e..5ba9804e9515 100644 --- a/mm/memfd.c +++ b/mm/memfd.c >>> [...] @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } } + if ((seals & F_SEAL_FUTURE_WRITE) && + !(*file_seals & F_SEAL_FUTURE_WRITE)) { + /* +* The FUTURE_WRITE seal also prevents growing and shrinking +* so we need them to be already set, or requested now. +*/ + int test_seals = (seals | *file_seals) & +(F_SEAL_GROW | F_SEAL_SHRINK); + + if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) { + error = -EINVAL; + goto unlock; + } + + spin_lock(>f_lock); + file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE); + spin_unlock(>f_lock); + } >>> >>> So you're fiddling around with the file, but not the inode? How are >>> you preventing code like the following from re-opening the file as >>> writable? >>> >>> $ cat memfd.c >>> #define _GNU_SOURCE >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> >>> int main(void) { >>> int fd = syscall(__NR_memfd_create, "testfd", 0); >>> if (fd == -1) err(1, "memfd"); >>> char path[100]; >>> sprintf(path, "/proc/self/fd/%d", fd); >>> int fd2 = open(path, O_RDWR); >>> if (fd2 == -1) err(1, "reopen"); >>> printf("reopen successful: %d\n", fd2); >>> } >>> $ gcc -o memfd memfd.c >>> $ ./memfd >>> reopen successful: 4 >>> $ >>> >>> That aside: I wonder whether a better API would be something that >>> allows you to create a new readonly file descriptor, instead of >>> fiddling with the writability of an existing fd. >> >> My favorite approach would be to forbid open() on memfds, hope that >> nobody notices the tiny API break, and then add an ioctl for "reopen >> this memfd with reduced permissions" - but that's just my personal >> opinion. > > I did something along these lines and it fixes the issue, but I forbid open > of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not > an ABI break because this is a brand new seal. That seems the least intrusive > solution and it works. Do you mind testing it and I'll add your and Tested-by > to the new fix? The patch is based on top of this series. > > ---8<--- > From: "Joel Fernandes (Google)" > Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd > > Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd > through /proc/self/fd/N symlink as writeable succeeds. The simplest fix > without causing ABI breakages and ugly VFS hacks is to simply deny all > opens on F_SEAL_FUTURE_WRITE sealed fds. > > Reported-by: Jann Horn > Signed-off-by: Joel Fernandes (Google) > --- > mm/shmem.c | 18 ++ > 1
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
> > > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE. > > > It seems like this mechanism would work even if both are disabled -- > > > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and > > > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to > > > "#ifndef CONFIG_IPC_NS" > > > > Let me explain it in detail. If SYSIPC and IPC_NS are both defined, > > current->nsproxy->ipc_ns will save the ipc namespace variables. We just use > > it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, > > current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, > > which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set > > (IPC_NS can't be set in this situation), there is no > > current->nsproxy->ipc_ns. > > We make a fack init_ipc_ns here and use it. > > Yes, I can read the code. I'm wondering specifically about SYSVIPC and > POSIX_MQUEUE. Even with your code changes, binder has no dependency on > these configs. Why are you creating one? The actual dependency with > your changes is on "current->nsproxy->ipc_ns" being initialized for > binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it? > > If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this > work? If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it will be a static reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) with no namespace-ization. You will get the same one in all processes, everything is the same as without this patch. - choury -
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
> > > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE. > > > It seems like this mechanism would work even if both are disabled -- > > > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and > > > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to > > > "#ifndef CONFIG_IPC_NS" > > > > Let me explain it in detail. If SYSIPC and IPC_NS are both defined, > > current->nsproxy->ipc_ns will save the ipc namespace variables. We just use > > it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, > > current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, > > which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set > > (IPC_NS can't be set in this situation), there is no > > current->nsproxy->ipc_ns. > > We make a fack init_ipc_ns here and use it. > > Yes, I can read the code. I'm wondering specifically about SYSVIPC and > POSIX_MQUEUE. Even with your code changes, binder has no dependency on > these configs. Why are you creating one? The actual dependency with > your changes is on "current->nsproxy->ipc_ns" being initialized for > binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it? > > If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this > work? If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it will be a static reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) with no namespace-ization. You will get the same one in all processes, everything is the same as without this patch. - choury -
Re: [PATCH 2/8] of: Drop full path from Sparc PDT full_name
From: Rob Herring Date: Fri, 9 Nov 2018 14:30:01 -0600 > That is the intent. With this change we stop storing the full path for > every node. Everywhere that needs the full path, generates it with the > %pOF printf specifier. Other than users in arch/sparc converted in > this series all the users of full_name have been converted already or > use kbasename(full_name) so they work either way. BTW, I've found a > couple more, so I'll be sending you a v2. > > And of_pdt_build_full_name for !SPARC does need to be converted too, > but I'll do that separately. I would think that, for correctness, you'd need to adjust both functions at the same time, no?
Re: [PATCH 2/8] of: Drop full path from Sparc PDT full_name
From: Rob Herring Date: Fri, 9 Nov 2018 14:30:01 -0600 > That is the intent. With this change we stop storing the full path for > every node. Everywhere that needs the full path, generates it with the > %pOF printf specifier. Other than users in arch/sparc converted in > this series all the users of full_name have been converted already or > use kbasename(full_name) so they work either way. BTW, I've found a > couple more, so I'll be sending you a v2. > > And of_pdt_build_full_name for !SPARC does need to be converted too, > but I'll do that separately. I would think that, for correctness, you'd need to adjust both functions at the same time, no?
[PATCH -next] sysv: return 'err' instead of 0 in __sysv_write_inode
Fixes gcc '-Wunused-but-set-variable' warning: fs/sysv/inode.c: In function '__sysv_write_inode': fs/sysv/inode.c:239:6: warning: variable 'err' set but not used [-Wunused-but-set-variable] __sysv_write_inode should return 'err' instead of 0 Fixes: 05459ca81ac3 ("repair sysv_write_inode(), switch sysv to simple_fsync()") Signed-off-by: YueHaibing --- fs/sysv/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c index 499a20a..273736f 100644 --- a/fs/sysv/inode.c +++ b/fs/sysv/inode.c @@ -275,7 +275,7 @@ static int __sysv_write_inode(struct inode *inode, int wait) } } brelse(bh); - return 0; + return err; } int sysv_write_inode(struct inode *inode, struct writeback_control *wbc)
[PATCH -next] sysv: return 'err' instead of 0 in __sysv_write_inode
Fixes gcc '-Wunused-but-set-variable' warning: fs/sysv/inode.c: In function '__sysv_write_inode': fs/sysv/inode.c:239:6: warning: variable 'err' set but not used [-Wunused-but-set-variable] __sysv_write_inode should return 'err' instead of 0 Fixes: 05459ca81ac3 ("repair sysv_write_inode(), switch sysv to simple_fsync()") Signed-off-by: YueHaibing --- fs/sysv/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c index 499a20a..273736f 100644 --- a/fs/sysv/inode.c +++ b/fs/sysv/inode.c @@ -275,7 +275,7 @@ static int __sysv_write_inode(struct inode *inode, int wait) } } brelse(bh); - return 0; + return err; } int sysv_write_inode(struct inode *inode, struct writeback_control *wbc)
Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages
On 18-11-05 13:20:01, Alexander Duyck wrote: > +static unsigned long __next_pfn_valid_range(unsigned long *i, > + unsigned long end_pfn) > { > - if (!pfn_valid_within(pfn)) > - return false; > - if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn)) > - return false; > - return true; > + unsigned long pfn = *i; > + unsigned long count; > + > + while (pfn < end_pfn) { > + unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages); > + unsigned long pageblock_pfn = min(t, end_pfn); > + > +#ifndef CONFIG_HOLES_IN_ZONE > + count = pageblock_pfn - pfn; > + pfn = pageblock_pfn; > + if (!pfn_valid(pfn)) > + continue; > +#else > + for (count = 0; pfn < pageblock_pfn; pfn++) { > + if (pfn_valid_within(pfn)) { > + count++; > + continue; > + } > + > + if (count) > + break; > + } > + > + if (!count) > + continue; > +#endif > + *i = pfn; > + return count; > + } > + > + return 0; > } > > +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) > \ > + for (i = (start_pfn),\ > + count = __next_pfn_valid_range(, (end_pfn)); \ > + count && ({ pfn = i - count; 1; }); \ > + count = __next_pfn_valid_range(, (end_pfn))) Can this be improved somehow? It took me a while to understand this piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - count; 1;}) is simply hard to parse. Why can't we make __next_pfn_valid_range() to return both end and a start of a block? The rest is good: Reviewed-by: Pavel Tatashin Thank you, Pasha
Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages
On 18-11-05 13:20:01, Alexander Duyck wrote: > +static unsigned long __next_pfn_valid_range(unsigned long *i, > + unsigned long end_pfn) > { > - if (!pfn_valid_within(pfn)) > - return false; > - if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn)) > - return false; > - return true; > + unsigned long pfn = *i; > + unsigned long count; > + > + while (pfn < end_pfn) { > + unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages); > + unsigned long pageblock_pfn = min(t, end_pfn); > + > +#ifndef CONFIG_HOLES_IN_ZONE > + count = pageblock_pfn - pfn; > + pfn = pageblock_pfn; > + if (!pfn_valid(pfn)) > + continue; > +#else > + for (count = 0; pfn < pageblock_pfn; pfn++) { > + if (pfn_valid_within(pfn)) { > + count++; > + continue; > + } > + > + if (count) > + break; > + } > + > + if (!count) > + continue; > +#endif > + *i = pfn; > + return count; > + } > + > + return 0; > } > > +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) > \ > + for (i = (start_pfn),\ > + count = __next_pfn_valid_range(, (end_pfn)); \ > + count && ({ pfn = i - count; 1; }); \ > + count = __next_pfn_valid_range(, (end_pfn))) Can this be improved somehow? It took me a while to understand this piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - count; 1;}) is simply hard to parse. Why can't we make __next_pfn_valid_range() to return both end and a start of a block? The rest is good: Reviewed-by: Pavel Tatashin Thank you, Pasha
Re: [RFC PATCH 1/3] static_call: Add static call infrastructure
On Fri, 9 Nov 2018 14:34:59 -0600 Josh Poimboeuf wrote: I'm slowly massaging this to work with tracepoints. But I hit a snag on this patch. > On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote: > > On Fri, 9 Nov 2018 13:35:05 -0600 > > Josh Poimboeuf wrote: > > > > > > > > > +#define DECLARE_STATIC_CALL(key, func) > > > > > \ > > > > > + extern struct static_call_key key; > > > > > \ > > > > > + extern typeof(func) STATIC_CALL_TRAMP(key); > > > > > \ > > > > > + /* Preserve the ELF symbol so objtool can access it: */ > > > > > \ > > > > > + __ADDRESSABLE(key) > > > > > > > > Does the __ADDRESSABLE(key) need to be in the DECLARE part? > > > > If so, there needs to be more explanation than just the comment above > > > > it. > > > > > > For each call site, objtool creates a struct in .static_call_sites: > > > > > > struct static_call_site { > > > s32 addr; > > > s32 key; > > > }; > > > > > > In order to do that, it needs to create a relocation which references > > > the key symbol. If the key is defined in another .o file, then the > > > current .o will not have an ELF symbol associated with the key. The > > > __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o > > > file, even though it's not referenced anywhere. That makes objtool's > > > job easier, so it doesn't have to edit the symbol table. > > > > > > I could add a comment saying as much, though it's hard to explain it in > > > fewer words than I just did :-) > > > > Does this have to do with adding the references by relative address? > > > > In record_mcount, I just picked an existing symbol and referenced that.. > > But perhaps this is a cleaner way. > > I think recordmcount is different. It creates references (in > __mcount_loc) to functions which are already in the object file, so they > already have symbols associated with them. > > But in this case, when objtool is creating references, the symbol it > needs to reference is outside the .o file, so there's no symbol to > associate it with. > The __ADDRESSABLE() appears to fail if you have a header with a DECLARE_STATIC_CALL() that is included where the DEFINE_STATIC_CALL() is, because I'm getting this: In file included from : /work/git/linux-trace.git/include/linux/compiler.h:285:11: error: redefinition of ‘__addressable___tp_func_sys_enter40’ __PASTE(__addressable_##sym, __LINE__) = (void *) ^~ /work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’ #define ___PASTE(a,b) a##b ^ /work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’ __PASTE(__addressable_##sym, __LINE__) = (void *) ^~~ /work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’ __ADDRESSABLE(key) ^ /work/git/linux-trace.git/include/linux/static_call.h:115:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’ DECLARE_STATIC_CALL(key, _func);\ ^~~ /work/git/linux-trace.git/include/linux/tracepoint.h:310:2: note: in expansion of macro ‘DEFINE_STATIC_CALL’ DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); ^~ /work/git/linux-trace.git/include/trace/define_trace.h:42:2: note: in expansion of macro ‘DEFINE_TRACE_FN’ DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args)) ^~~ /work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in expansion of macro ‘TRACE_EVENT_FN’ TRACE_EVENT_FN(sys_enter, ^~ /work/git/linux-trace.git/include/linux/compiler.h:285:11: note: previous definition of ‘__addressable___tp_func_sys_enter40’ was here __PASTE(__addressable_##sym, __LINE__) = (void *) ^~ /work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’ #define ___PASTE(a,b) a##b ^ /work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’ __PASTE(__addressable_##sym, __LINE__) = (void *) ^~~ /work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’ __ADDRESSABLE(key) ^ /work/git/linux-trace.git/include/linux/tracepoint.h:234:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’ DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \ ^~~ /work/git/linux-trace.git/include/linux/tracepoint.h:421:2: note: in expansion of macro ‘__DECLARE_TRACE’ __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \ ^~~ /work/git/linux-trace.git/include/linux/tracepoint.h:560:2: note: in expansion of macro ‘DECLARE_TRACE’ DECLARE_TRACE(name, PARAMS(proto), PARAMS(args)) ^
Re: [RFC PATCH 1/3] static_call: Add static call infrastructure
On Fri, 9 Nov 2018 14:34:59 -0600 Josh Poimboeuf wrote: I'm slowly massaging this to work with tracepoints. But I hit a snag on this patch. > On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote: > > On Fri, 9 Nov 2018 13:35:05 -0600 > > Josh Poimboeuf wrote: > > > > > > > > > +#define DECLARE_STATIC_CALL(key, func) > > > > > \ > > > > > + extern struct static_call_key key; > > > > > \ > > > > > + extern typeof(func) STATIC_CALL_TRAMP(key); > > > > > \ > > > > > + /* Preserve the ELF symbol so objtool can access it: */ > > > > > \ > > > > > + __ADDRESSABLE(key) > > > > > > > > Does the __ADDRESSABLE(key) need to be in the DECLARE part? > > > > If so, there needs to be more explanation than just the comment above > > > > it. > > > > > > For each call site, objtool creates a struct in .static_call_sites: > > > > > > struct static_call_site { > > > s32 addr; > > > s32 key; > > > }; > > > > > > In order to do that, it needs to create a relocation which references > > > the key symbol. If the key is defined in another .o file, then the > > > current .o will not have an ELF symbol associated with the key. The > > > __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o > > > file, even though it's not referenced anywhere. That makes objtool's > > > job easier, so it doesn't have to edit the symbol table. > > > > > > I could add a comment saying as much, though it's hard to explain it in > > > fewer words than I just did :-) > > > > Does this have to do with adding the references by relative address? > > > > In record_mcount, I just picked an existing symbol and referenced that.. > > But perhaps this is a cleaner way. > > I think recordmcount is different. It creates references (in > __mcount_loc) to functions which are already in the object file, so they > already have symbols associated with them. > > But in this case, when objtool is creating references, the symbol it > needs to reference is outside the .o file, so there's no symbol to > associate it with. > The __ADDRESSABLE() appears to fail if you have a header with a DECLARE_STATIC_CALL() that is included where the DEFINE_STATIC_CALL() is, because I'm getting this: In file included from : /work/git/linux-trace.git/include/linux/compiler.h:285:11: error: redefinition of ‘__addressable___tp_func_sys_enter40’ __PASTE(__addressable_##sym, __LINE__) = (void *) ^~ /work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’ #define ___PASTE(a,b) a##b ^ /work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’ __PASTE(__addressable_##sym, __LINE__) = (void *) ^~~ /work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’ __ADDRESSABLE(key) ^ /work/git/linux-trace.git/include/linux/static_call.h:115:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’ DECLARE_STATIC_CALL(key, _func);\ ^~~ /work/git/linux-trace.git/include/linux/tracepoint.h:310:2: note: in expansion of macro ‘DEFINE_STATIC_CALL’ DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); ^~ /work/git/linux-trace.git/include/trace/define_trace.h:42:2: note: in expansion of macro ‘DEFINE_TRACE_FN’ DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args)) ^~~ /work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in expansion of macro ‘TRACE_EVENT_FN’ TRACE_EVENT_FN(sys_enter, ^~ /work/git/linux-trace.git/include/linux/compiler.h:285:11: note: previous definition of ‘__addressable___tp_func_sys_enter40’ was here __PASTE(__addressable_##sym, __LINE__) = (void *) ^~ /work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’ #define ___PASTE(a,b) a##b ^ /work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’ __PASTE(__addressable_##sym, __LINE__) = (void *) ^~~ /work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’ __ADDRESSABLE(key) ^ /work/git/linux-trace.git/include/linux/tracepoint.h:234:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’ DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \ ^~~ /work/git/linux-trace.git/include/linux/tracepoint.h:421:2: note: in expansion of macro ‘__DECLARE_TRACE’ __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \ ^~~ /work/git/linux-trace.git/include/linux/tracepoint.h:560:2: note: in expansion of macro ‘DECLARE_TRACE’ DECLARE_TRACE(name, PARAMS(proto), PARAMS(args)) ^
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 12:36:34PM -0800, Andrew Morton wrote: > On Wed, 7 Nov 2018 20:15:36 -0800 "Joel Fernandes (Google)" > wrote: > > > Android uses ashmem for sharing memory regions. We are looking forward > > to migrating all usecases of ashmem to memfd so that we can possibly > > remove the ashmem driver in the future from staging while also > > benefiting from using memfd and contributing to it. Note staging drivers > > are also not ABI and generally can be removed at anytime. > > > > One of the main usecases Android has is the ability to create a region > > and mmap it as writeable, then add protection against making any > > "future" writes while keeping the existing already mmap'ed > > writeable-region active. This allows us to implement a usecase where > > receivers of the shared memory buffer can get a read-only view, while > > the sender continues to write to the buffer. > > See CursorWindow documentation in Android for more details: > > https://developer.android.com/reference/android/database/CursorWindow > > It appears that the memfd_create and fcntl manpages will require > updating. Please attend to this at the appropriate time? Yes, I am planning to send those out shortly. I finished working on them. Also just to let you know, I posted a fix for the security issue Jann Horn reported and requested him to test it: https://lore.kernel.org/lkml/20181109234636.ga136...@google.com/T/#m8d9d185e6480d095f0ab8f84bcb103892181f77d This fix along with the 2 other patches I posted in v3 are all that's needed. thanks! - Joel
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 12:36:34PM -0800, Andrew Morton wrote: > On Wed, 7 Nov 2018 20:15:36 -0800 "Joel Fernandes (Google)" > wrote: > > > Android uses ashmem for sharing memory regions. We are looking forward > > to migrating all usecases of ashmem to memfd so that we can possibly > > remove the ashmem driver in the future from staging while also > > benefiting from using memfd and contributing to it. Note staging drivers > > are also not ABI and generally can be removed at anytime. > > > > One of the main usecases Android has is the ability to create a region > > and mmap it as writeable, then add protection against making any > > "future" writes while keeping the existing already mmap'ed > > writeable-region active. This allows us to implement a usecase where > > receivers of the shared memory buffer can get a read-only view, while > > the sender continues to write to the buffer. > > See CursorWindow documentation in Android for more details: > > https://developer.android.com/reference/android/database/CursorWindow > > It appears that the memfd_create and fcntl manpages will require > updating. Please attend to this at the appropriate time? Yes, I am planning to send those out shortly. I finished working on them. Also just to let you know, I posted a fix for the security issue Jann Horn reported and requested him to test it: https://lore.kernel.org/lkml/20181109234636.ga136...@google.com/T/#m8d9d185e6480d095f0ab8f84bcb103892181f77d This fix along with the 2 other patches I posted in v3 are all that's needed. thanks! - Joel
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
> > > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it will > > be a static > > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by > > me) with > > no namespace-ization. You will get the same one in all processes, > > everything is > > the same as without this patch. > > except, as far as I can tell, binder_init_ns() would never have been > called on it so the mutex and list heads are not initialized so its > completely broken. Am I missing something? How do those fields get > initialized in this case? > @@ -5832,8 +5888,12 @@ static int __init binder_init(void) > goto err_init_binder_device_failed; > } > > - return ret; > + ret = binder_init_ns(_ipc_ns); > + if (ret) > + goto err_init_namespace_failed; > > + return ret; They are initialized here. - choury -
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
> > > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it will > > be a static > > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by > > me) with > > no namespace-ization. You will get the same one in all processes, > > everything is > > the same as without this patch. > > except, as far as I can tell, binder_init_ns() would never have been > called on it so the mutex and list heads are not initialized so its > completely broken. Am I missing something? How do those fields get > initialized in this case? > @@ -5832,8 +5888,12 @@ static int __init binder_init(void) > goto err_init_binder_device_failed; > } > > - return ret; > + ret = binder_init_ns(_ipc_ns); > + if (ret) > + goto err_init_namespace_failed; > > + return ret; They are initialized here. - choury -
Re: Re: [PATCH V3] binder: ipc namespace support for android binder
On Fri, Nov 9, 2018 at 7:09 PM chouryzhou(周威) wrote: > > > > > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE. > > It seems like this mechanism would work even if both are disabled -- > > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and > > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to > > "#ifndef CONFIG_IPC_NS" > > Let me explain it in detail. If SYSIPC and IPC_NS are both defined, > current->nsproxy->ipc_ns will save the ipc namespace variables. We just use > it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, > current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, > which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set > (IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns. > We make a fack init_ipc_ns here and use it. Yes, I can read the code. I'm wondering specifically about SYSVIPC and POSIX_MQUEUE. Even with your code changes, binder has no dependency on these configs. Why are you creating one? The actual dependency with your changes is on "current->nsproxy->ipc_ns" being initialized for binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it? If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this work? > > > why eliminate name? The string name is very useful for differentiating > > normal "framework" binder transactions vs "hal" or "vendor" > > transactions. If we just have a device number it will be hard to tell > > in the logs even which namespace it belongs to. We need to keep both > > the "name" (for which there might be multiple in each ns) and some > > indication of which namespace this is. Maybe we assign some sort of > > namespace ID during binder_init_ns(). > > I will remain the name of device. The inum of ipc_ns can be treated as > namespace ID in ipc_ns. > > > As mentioned above, we need to retain name and probably also want a ns > > id of some sort. So context now has 3 components if IPC_NS, so maybe a > > helper function to print context like: > > > > static void binder_seq_print_context(struct seq_file *m, struct > > binder_context *context) > > { > > #ifdef CONFIG_IPC_NS > > seq_printf(m, "%d-%d-%s", context->ns_id, context->device, > > context->name); > > #else > > seq_printf(m, "%d", context->name); > > #endif > > } > > > > (same comment below everywhere context is printed) > > > > Should these debugfs nodes should be ns aware and only print debugging > > info for the context of the thread accessing the node? If so, we would > > also want to be namespace-aware when printing pids. > > Nowadays, debugfs is not namespace-ized, and pid namespace is not associated > with ipc namespace. Will it be more complicated to debug this if we just > print > the info for current thread? Because we will have to enter the ipc namespace > firstly. But add ipc inum should be no problem. With the name and ns id, debugging would be fine from the host-side. I don't understand the container use cases enough to know if you need to be able to debug binder transaction failures from within the container -- in which case it seems like you'd want the container-version of the PIDs -- but obviously this depends on how the containers are set up and what the use-cases really are. I'm ok with leaving that for a later patch. > > - choury - > >
Re: Re: [PATCH V3] binder: ipc namespace support for android binder
On Fri, Nov 9, 2018 at 7:09 PM chouryzhou(周威) wrote: > > > > > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE. > > It seems like this mechanism would work even if both are disabled -- > > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and > > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to > > "#ifndef CONFIG_IPC_NS" > > Let me explain it in detail. If SYSIPC and IPC_NS are both defined, > current->nsproxy->ipc_ns will save the ipc namespace variables. We just use > it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, > current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, > which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set > (IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns. > We make a fack init_ipc_ns here and use it. Yes, I can read the code. I'm wondering specifically about SYSVIPC and POSIX_MQUEUE. Even with your code changes, binder has no dependency on these configs. Why are you creating one? The actual dependency with your changes is on "current->nsproxy->ipc_ns" being initialized for binder -- which is dependent on CONFIG_IPC_NS being enabled, isn't it? If SYSVIPC or POSIX_MQUEUE are enabled, but IPC_NS is disabled, does this work? > > > why eliminate name? The string name is very useful for differentiating > > normal "framework" binder transactions vs "hal" or "vendor" > > transactions. If we just have a device number it will be hard to tell > > in the logs even which namespace it belongs to. We need to keep both > > the "name" (for which there might be multiple in each ns) and some > > indication of which namespace this is. Maybe we assign some sort of > > namespace ID during binder_init_ns(). > > I will remain the name of device. The inum of ipc_ns can be treated as > namespace ID in ipc_ns. > > > As mentioned above, we need to retain name and probably also want a ns > > id of some sort. So context now has 3 components if IPC_NS, so maybe a > > helper function to print context like: > > > > static void binder_seq_print_context(struct seq_file *m, struct > > binder_context *context) > > { > > #ifdef CONFIG_IPC_NS > > seq_printf(m, "%d-%d-%s", context->ns_id, context->device, > > context->name); > > #else > > seq_printf(m, "%d", context->name); > > #endif > > } > > > > (same comment below everywhere context is printed) > > > > Should these debugfs nodes should be ns aware and only print debugging > > info for the context of the thread accessing the node? If so, we would > > also want to be namespace-aware when printing pids. > > Nowadays, debugfs is not namespace-ized, and pid namespace is not associated > with ipc namespace. Will it be more complicated to debug this if we just > print > the info for current thread? Because we will have to enter the ipc namespace > firstly. But add ipc inum should be no problem. With the name and ns id, debugging would be fine from the host-side. I don't understand the container use cases enough to know if you need to be able to debug binder transaction failures from within the container -- in which case it seems like you'd want the container-version of the PIDs -- but obviously this depends on how the containers are set up and what the use-cases really are. I'm ok with leaving that for a later patch. > > - choury - > >
Re: [RFC 1/4] pwm: sifive: Add DT documentation for SiFive PWM Controller.
On 10/16/18 3:04 PM, Thierry Reding wrote: > On Tue, Oct 16, 2018 at 10:31:42AM -0700, Paul Walmsley wrote: >> On 10/16/18 4:01 AM, Thierry Reding wrote: >>> On Mon, Oct 15, 2018 at 03:57:35PM -0700, Atish Patra wrote: On 10/10/18 6:49 AM, Thierry Reding wrote: > On Tue, Oct 09, 2018 at 11:51:22AM -0700, Atish Patra wrote: >> +Required properties: >> +- compatible: should be one of >> +"sifive,fu540-c000-pwm0","sifive,pwm0". > What's the '0' in here? A version number? > I think yes. Since fu540 is the first Linux capable RISC-V core, SiFive Guys decided mark it as version 0. @Wesly: Please correct me if I am wrong. >>> It seems fairly superfluous to me to have a version number in additon to >>> the fu540-c000, which already seems to be the core plus some sort of >>> part number. Do you really expect there to be any changes in the SoC >>> that would require a different compatible string at this point? If the >>> SoC has taped out, how will you ever get a different version of the PWM >>> IP in it? >>> >>> I would expect any improvements or changes to the PWM IP to show up in a >>> different SoC generation, at which point it would be something like >>> "sifive,fu640-c000" maybe, or perhaps "sifive,fu540-d000", or whatever >>> the numbering is. >> >> The "0" suffix refers to a revision number for the underlying PWM IP block. >> >> It's certainly important to keep that version number on the "sifive,pwm0" >> compatible string that doesn't have the chip name associated with it. > Isn't the hardware identified by "sifive,pwm0" and "sifive,fu540-c000" > effectively identical? The intention was that the "sifive,pwm0" compatible string specifies a register interface and programming model that the IP block exposes to the software, rather than a particular underlying hardware implementation. That is in contrast to a string like "sifive,fu540-c000-pwm" which might activate particular workarounds or quirks that are specific to the integration of the IP block on a given SoC. The idea is that, for this and similar open-source hardware IP blocks, the driver code can just match on a generic "sifive,pwm0" compatible string. The SoC DT data would include both the SoC-specific "sifive,fu540-c000-pwm0" and the common interface "sifive,pwm0". But the driver would only need the SoC-specific compatible string if the SoC wound up needing some SoC-specific quirks. In the past, some folks have had a problem with that idea, since for closed-source IP blocks, it's been difficult to determine what changes went into a specific version of the IP block. Thus folks generating data for later SoCs usually specify a compatible string for another, older, SoC that seems to have the desired behavior. But since this particular IP block has open-source RTL, and contains a "sifive,pwmX" version string in the RTL itself: https://github.com/sifive/sifive-blocks/blob/master/src/main/scala/devices/pwm/PWM.scala#L74 ... it's straightforward to see what interface the hardware exposes to the software for a given compatible string. > Is there a need to have two compatible strings > that refer to the exact same hardware? There's no intention that "sifive,pwm0" and "sifive,fu540-c000-pwm0" refer to the same hardware; just the same software interface and programming model. Even now, it's usually pretty unlikely that two different SoCs that refer to (say) "nvidia,tegra20-pwm" contain the same hardware, since differences in synthesis, place and route, ECOs, and integration change the actual realization of the hardware. Some folks interpreted that compatible string reuse as implying the same "hardware" is in use on both SoCs, but we're really just identifying a software interface. >> As to whether there could ever be a FU540-C000 part with different IP block >> versions on it: FU540-C000 is ultimately a marketing name. While >> theoretically we shouldn't have another "FU540-C000" chip with different >> peripheral IP block versions on it, I don't think any engineer can guarantee >> that it won't happen. > I would argue that if at some point there was indeed a chip with the > same name but a different IP block version in it, we can figure out what > to call it. Sure there are no guarantees, but it's still fairly unlikely > in my opinion, so I personally wouldn't worry about this up front. > > Anyway, I don't feel strongly either way, I'm just pointing out that > this is somewhat unusual. If you want to keep it, feel free to. Thanks for the review, Thierry - - Paul
Re: [RFC 1/4] pwm: sifive: Add DT documentation for SiFive PWM Controller.
On 10/16/18 3:04 PM, Thierry Reding wrote: > On Tue, Oct 16, 2018 at 10:31:42AM -0700, Paul Walmsley wrote: >> On 10/16/18 4:01 AM, Thierry Reding wrote: >>> On Mon, Oct 15, 2018 at 03:57:35PM -0700, Atish Patra wrote: On 10/10/18 6:49 AM, Thierry Reding wrote: > On Tue, Oct 09, 2018 at 11:51:22AM -0700, Atish Patra wrote: >> +Required properties: >> +- compatible: should be one of >> +"sifive,fu540-c000-pwm0","sifive,pwm0". > What's the '0' in here? A version number? > I think yes. Since fu540 is the first Linux capable RISC-V core, SiFive Guys decided mark it as version 0. @Wesly: Please correct me if I am wrong. >>> It seems fairly superfluous to me to have a version number in additon to >>> the fu540-c000, which already seems to be the core plus some sort of >>> part number. Do you really expect there to be any changes in the SoC >>> that would require a different compatible string at this point? If the >>> SoC has taped out, how will you ever get a different version of the PWM >>> IP in it? >>> >>> I would expect any improvements or changes to the PWM IP to show up in a >>> different SoC generation, at which point it would be something like >>> "sifive,fu640-c000" maybe, or perhaps "sifive,fu540-d000", or whatever >>> the numbering is. >> >> The "0" suffix refers to a revision number for the underlying PWM IP block. >> >> It's certainly important to keep that version number on the "sifive,pwm0" >> compatible string that doesn't have the chip name associated with it. > Isn't the hardware identified by "sifive,pwm0" and "sifive,fu540-c000" > effectively identical? The intention was that the "sifive,pwm0" compatible string specifies a register interface and programming model that the IP block exposes to the software, rather than a particular underlying hardware implementation. That is in contrast to a string like "sifive,fu540-c000-pwm" which might activate particular workarounds or quirks that are specific to the integration of the IP block on a given SoC. The idea is that, for this and similar open-source hardware IP blocks, the driver code can just match on a generic "sifive,pwm0" compatible string. The SoC DT data would include both the SoC-specific "sifive,fu540-c000-pwm0" and the common interface "sifive,pwm0". But the driver would only need the SoC-specific compatible string if the SoC wound up needing some SoC-specific quirks. In the past, some folks have had a problem with that idea, since for closed-source IP blocks, it's been difficult to determine what changes went into a specific version of the IP block. Thus folks generating data for later SoCs usually specify a compatible string for another, older, SoC that seems to have the desired behavior. But since this particular IP block has open-source RTL, and contains a "sifive,pwmX" version string in the RTL itself: https://github.com/sifive/sifive-blocks/blob/master/src/main/scala/devices/pwm/PWM.scala#L74 ... it's straightforward to see what interface the hardware exposes to the software for a given compatible string. > Is there a need to have two compatible strings > that refer to the exact same hardware? There's no intention that "sifive,pwm0" and "sifive,fu540-c000-pwm0" refer to the same hardware; just the same software interface and programming model. Even now, it's usually pretty unlikely that two different SoCs that refer to (say) "nvidia,tegra20-pwm" contain the same hardware, since differences in synthesis, place and route, ECOs, and integration change the actual realization of the hardware. Some folks interpreted that compatible string reuse as implying the same "hardware" is in use on both SoCs, but we're really just identifying a software interface. >> As to whether there could ever be a FU540-C000 part with different IP block >> versions on it: FU540-C000 is ultimately a marketing name. While >> theoretically we shouldn't have another "FU540-C000" chip with different >> peripheral IP block versions on it, I don't think any engineer can guarantee >> that it won't happen. > I would argue that if at some point there was indeed a chip with the > same name but a different IP block version in it, we can figure out what > to call it. Sure there are no guarantees, but it's still fairly unlikely > in my opinion, so I personally wouldn't worry about this up front. > > Anyway, I don't feel strongly either way, I'm just pointing out that > this is somewhat unusual. If you want to keep it, feel free to. Thanks for the review, Thierry - - Paul
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
On Fri, Nov 9, 2018 at 8:43 PM chouryzhou(周威) wrote: > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it will > be a static > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) > with > no namespace-ization. You will get the same one in all processes, everything > is > the same as without this patch. except, as far as I can tell, binder_init_ns() would never have been called on it so the mutex and list heads are not initialized so its completely broken. Am I missing something? How do those fields get initialized in this case? > > - choury - >
Re: Re: [PATCH V3] binder: ipc namespace support for android binder(Internet mail)
On Fri, Nov 9, 2018 at 8:43 PM chouryzhou(周威) wrote: > > If IPC_NS is disabled, "current-nsporxy->ipc_ns" will also exists, it will > be a static > reference of "init_ipc_ns" (in ipc/msgutil.c, not defined in binder.c by me) > with > no namespace-ization. You will get the same one in all processes, everything > is > the same as without this patch. except, as far as I can tell, binder_init_ns() would never have been called on it so the mutex and list heads are not initialized so its completely broken. Am I missing something? How do those fields get initialized in this case? > > - choury - >
RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org ow...@vger.kernel.org> On Behalf Of Daniel Jordan > Sent: Monday, November 05, 2018 10:56 AM > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page > initialization within each node > > ... The kernel doesn't > know the memory bandwidth of a given system to get the most efficient > number of threads, so there's some guesswork involved. The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report that kind of information, and could facilitate automatic tuning. There was discussion last year about kernel support for it: https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/ > In testing, a reasonable value turned out to be about a quarter of the > CPUs on the node. ... > + /* > + * We'd like to know the memory bandwidth of the chip to > calculate the > + * most efficient number of threads to start, but we can't. > + * In testing, a good value for a variety of systems was a > quarter of the CPUs on the node. > + */ > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4); You might want to base that calculation on and limit the threads to physical cores, not hyperthreaded cores. --- Robert Elliott, HPE Persistent Memory
RE: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org ow...@vger.kernel.org> On Behalf Of Daniel Jordan > Sent: Monday, November 05, 2018 10:56 AM > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page > initialization within each node > > ... The kernel doesn't > know the memory bandwidth of a given system to get the most efficient > number of threads, so there's some guesswork involved. The ACPI HMAT (Heterogeneous Memory Attribute Table) is designed to report that kind of information, and could facilitate automatic tuning. There was discussion last year about kernel support for it: https://lore.kernel.org/lkml/20171214021019.13579-1-ross.zwis...@linux.intel.com/ > In testing, a reasonable value turned out to be about a quarter of the > CPUs on the node. ... > + /* > + * We'd like to know the memory bandwidth of the chip to > calculate the > + * most efficient number of threads to start, but we can't. > + * In testing, a good value for a variety of systems was a > quarter of the CPUs on the node. > + */ > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4); You might want to base that calculation on and limit the threads to physical cores, not hyperthreaded cores. --- Robert Elliott, HPE Persistent Memory
[PATCH v5 2/3] dt-bindings: Add vendor prefix for PHICOMM Co., Ltd.
PHICOMM Co., Ltd. is a hardware provider headquartered in Shanghai, it's product includes router and smart devices. Signed-off-by: He Yangxuan --- Documentation/devicetree/bindings/vendor-prefixes.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index 4b1a2a8fc..8c413d8cc 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -296,6 +296,7 @@ panasonic Panasonic Corporation parade Parade Technologies Inc. pericomPericom Technology Inc. pervasive Pervasive Displays, Inc. +phicomm PHICOMM Co., Ltd. phytec PHYTEC Messtechnik GmbH picochip Picochip Ltd pine64 Pine64 -- 2.11.0
[PATCH v5 2/3] dt-bindings: Add vendor prefix for PHICOMM Co., Ltd.
PHICOMM Co., Ltd. is a hardware provider headquartered in Shanghai, it's product includes router and smart devices. Signed-off-by: He Yangxuan --- Documentation/devicetree/bindings/vendor-prefixes.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index 4b1a2a8fc..8c413d8cc 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -296,6 +296,7 @@ panasonic Panasonic Corporation parade Parade Technologies Inc. pericomPericom Technology Inc. pervasive Pervasive Displays, Inc. +phicomm PHICOMM Co., Ltd. phytec PHYTEC Messtechnik GmbH picochip Picochip Ltd pine64 Pine64 -- 2.11.0
[PATCH v5 3/3] dt-bindings: arm: amlogic: Add Phicomm N1
Add bindings documentation for the Phicomm N1. Signed-off-by: He Yangxuan --- Documentation/devicetree/bindings/arm/amlogic.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/arm/amlogic.txt b/Documentation/devicetree/bindings/arm/amlogic.txt index 4498292b8..93177f38e 100644 --- a/Documentation/devicetree/bindings/arm/amlogic.txt +++ b/Documentation/devicetree/bindings/arm/amlogic.txt @@ -91,6 +91,7 @@ Board compatible values (alphabetically, grouped by SoC): - "amlogic,p230" (Meson gxl s905d) - "amlogic,p231" (Meson gxl s905d) + - "phicomm,n1" (Meson gxl s905d) - "amlogic,p241" (Meson gxl s805x) -- 2.11.0
[PATCH v5 3/3] dt-bindings: arm: amlogic: Add Phicomm N1
Add bindings documentation for the Phicomm N1. Signed-off-by: He Yangxuan --- Documentation/devicetree/bindings/arm/amlogic.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/arm/amlogic.txt b/Documentation/devicetree/bindings/arm/amlogic.txt index 4498292b8..93177f38e 100644 --- a/Documentation/devicetree/bindings/arm/amlogic.txt +++ b/Documentation/devicetree/bindings/arm/amlogic.txt @@ -91,6 +91,7 @@ Board compatible values (alphabetically, grouped by SoC): - "amlogic,p230" (Meson gxl s905d) - "amlogic,p231" (Meson gxl s905d) + - "phicomm,n1" (Meson gxl s905d) - "amlogic,p241" (Meson gxl s805x) -- 2.11.0
[PATCH v5 0/3] arm64: dts: meson-gxl: add support for phicomm n1
This patch adds support for the Phicomm N1. This device based on P230 reference design. The phy is RTL8211F, need to disable Energy Efficient Ethernet (EEE) to make it stable. And this box doesn't have cvbs, so disable related section in device tree. Changes since v4: - include device tree Changes since v3: - remove external phy from device tree, will disable EEE in meson-gxl-s905d-p230.dts Changes since v2: - remove changes section in commit message - add commit message for patch of vendor prefix and bindings documentation Changes since v1: - rewrite external phy section - add phicomm vendor prefix - add phicomm n1 in Documentation/devicetree/bindings/arm/amlogic.txt He Yangxuan (3): arm64: dts: meson-gxl: add support for phicomm n1 dt-bindings: Add vendor prefix for PHICOMM Co., Ltd. dt-bindings: arm: amlogic: Add Phicomm N1 Documentation/devicetree/bindings/arm/amlogic.txt | 1 + .../devicetree/bindings/vendor-prefixes.txt | 1 + arch/arm64/boot/dts/amlogic/Makefile| 1 + .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 + 4 files changed, 24 insertions(+) create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts -- 2.11.0
[PATCH v5 1/3] arm64: dts: meson-gxl: add support for phicomm n1
This patch adds support for the Phicomm N1. This device based on P230 reference design. And this box doesn't have cvbs, so disable related section in device tree. Signed-off-by: He Yangxuan --- arch/arm64/boot/dts/amlogic/Makefile| 1 + .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 + 2 files changed, 22 insertions(+) create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts diff --git a/arch/arm64/boot/dts/amlogic/Makefile b/arch/arm64/boot/dts/amlogic/Makefile index c31f29d66..49f3ac5d8 100644 --- a/arch/arm64/boot/dts/amlogic/Makefile +++ b/arch/arm64/boot/dts/amlogic/Makefile @@ -18,6 +18,7 @@ dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-nexbox-a95x.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-p212.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p230.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p231.dtb +dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-phicomm-n1.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s805x-p241.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-p281.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-tx3-mini.dtb diff --git a/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts new file mode 100644 index 0..9a8a8a7e4 --- /dev/null +++ b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: (GPL-2.0+ OR MIT) +/* + * Copyright (c) 2018 He Yangxuan + */ + +/dts-v1/; + +#include "meson-gxl-s905d-p230.dts" + +/ { + compatible = "phicomm,n1", "amlogic,s905d", "amlogic,meson-gxl"; + model = "Phicomm N1"; + + cvbs-connector { + status = "disabled"; + }; +}; + +_vdac_port { + status = "disabled"; +}; -- 2.11.0
[PATCH v5 1/3] arm64: dts: meson-gxl: add support for phicomm n1
This patch adds support for the Phicomm N1. This device based on P230 reference design. And this box doesn't have cvbs, so disable related section in device tree. Signed-off-by: He Yangxuan --- arch/arm64/boot/dts/amlogic/Makefile| 1 + .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 + 2 files changed, 22 insertions(+) create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts diff --git a/arch/arm64/boot/dts/amlogic/Makefile b/arch/arm64/boot/dts/amlogic/Makefile index c31f29d66..49f3ac5d8 100644 --- a/arch/arm64/boot/dts/amlogic/Makefile +++ b/arch/arm64/boot/dts/amlogic/Makefile @@ -18,6 +18,7 @@ dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-nexbox-a95x.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905x-p212.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p230.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-p231.dtb +dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905d-phicomm-n1.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s805x-p241.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-p281.dtb dtb-$(CONFIG_ARCH_MESON) += meson-gxl-s905w-tx3-mini.dtb diff --git a/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts new file mode 100644 index 0..9a8a8a7e4 --- /dev/null +++ b/arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: (GPL-2.0+ OR MIT) +/* + * Copyright (c) 2018 He Yangxuan + */ + +/dts-v1/; + +#include "meson-gxl-s905d-p230.dts" + +/ { + compatible = "phicomm,n1", "amlogic,s905d", "amlogic,meson-gxl"; + model = "Phicomm N1"; + + cvbs-connector { + status = "disabled"; + }; +}; + +_vdac_port { + status = "disabled"; +}; -- 2.11.0
[PATCH v5 0/3] arm64: dts: meson-gxl: add support for phicomm n1
This patch adds support for the Phicomm N1. This device based on P230 reference design. The phy is RTL8211F, need to disable Energy Efficient Ethernet (EEE) to make it stable. And this box doesn't have cvbs, so disable related section in device tree. Changes since v4: - include device tree Changes since v3: - remove external phy from device tree, will disable EEE in meson-gxl-s905d-p230.dts Changes since v2: - remove changes section in commit message - add commit message for patch of vendor prefix and bindings documentation Changes since v1: - rewrite external phy section - add phicomm vendor prefix - add phicomm n1 in Documentation/devicetree/bindings/arm/amlogic.txt He Yangxuan (3): arm64: dts: meson-gxl: add support for phicomm n1 dt-bindings: Add vendor prefix for PHICOMM Co., Ltd. dt-bindings: arm: amlogic: Add Phicomm N1 Documentation/devicetree/bindings/arm/amlogic.txt | 1 + .../devicetree/bindings/vendor-prefixes.txt | 1 + arch/arm64/boot/dts/amlogic/Makefile| 1 + .../boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts | 21 + 4 files changed, 24 insertions(+) create mode 100644 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-phicomm-n1.dts -- 2.11.0
Re: KASAN: use-after-free Read in task_is_descendant
syzbot has found a reproducer for the following crash on: HEAD commit:3541833fd1f2 Merge tag 's390-4.20-2' of git://git.kernel.o.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=16d3cad540 kernel config: https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219 compiler: gcc (GCC) 8.0.1 20180413 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1494bb0b40 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+a9ac39bf55329e206...@syzkaller.appspotmail.com IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready 8021q: adding VLAN 0 to HW filter on device team0 == BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182 [inline] BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295 Read of size 8 at addr 8801c46f24e0 by task syz-executor2/9973 CPU: 0 PID: 9973 Comm: syz-executor2 Not tainted 4.20.0-rc1+ #107 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 __read_once_size include/linux/compiler.h:182 [inline] task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295 task_is_descendant security/yama/yama_lsm.c:282 [inline] yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371 security_ptrace_access_check+0x54/0xb0 security/security.c:271 __ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336 ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389 __do_sys_ptrace kernel/ptrace.c:1139 [inline] __se_sys_ptrace kernel/ptrace.c:1119 [inline] __x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1119 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457569 Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7f2ed4dbfc78 EFLAGS: 0246 ORIG_RAX: 0065 RAX: ffda RBX: 0004 RCX: 00457569 RDX: RSI: 021d RDI: 4206 RBP: 0072bf00 R08: R09: R10: R11: 0246 R12: 7f2ed4dc06d4 R13: 004c33bd R14: 004d50e0 R15: Allocated by task 5873: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490 kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644 alloc_task_struct_node kernel/fork.c:158 [inline] dup_task_struct kernel/fork.c:843 [inline] copy_process+0x2026/0x87a0 kernel/fork.c:1751 _do_fork+0x1cb/0x11d0 kernel/fork.c:2216 __do_sys_clone kernel/fork.c:2323 [inline] __se_sys_clone kernel/fork.c:2317 [inline] __x64_sys_clone+0xbf/0x150 kernel/fork.c:2317 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kmem_cache_free+0x83/0x290 mm/slab.c:3760 free_task_struct kernel/fork.c:163 [inline] free_task+0x16e/0x1f0 kernel/fork.c:457 __put_task_struct+0x2e6/0x620 kernel/fork.c:730 put_task_struct include/linux/sched/task.h:96 [inline] delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 __do_softirq+0x308/0xb7e kernel/softirq.c:292 The buggy address belongs to the object at 8801c46f2000 which belongs to the cache task_struct(65:syz2) of size 6080 The buggy address is located 1248 bytes inside of 6080-byte region [8801c46f2000, 8801c46f37c0) The buggy address belongs to the page: page:ea000711bc80 count:1 mapcount:0 mapping:8801c204fc80 index:0x0 compound_mapcount: 0 flags: 0x2fffc010200(slab|head) raw: 02fffc010200 ea00073cf808 ea000697c908 8801c204fc80 raw: 8801c46f2000 00010001 8801d8404a80 page dumped because:
Re: KASAN: use-after-free Read in task_is_descendant
syzbot has found a reproducer for the following crash on: HEAD commit:3541833fd1f2 Merge tag 's390-4.20-2' of git://git.kernel.o.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=16d3cad540 kernel config: https://syzkaller.appspot.com/x/.config?x=8f559fee2fc3375a dashboard link: https://syzkaller.appspot.com/bug?extid=a9ac39bf55329e206219 compiler: gcc (GCC) 8.0.1 20180413 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1494bb0b40 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+a9ac39bf55329e206...@syzkaller.appspotmail.com IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready 8021q: adding VLAN 0 to HW filter on device team0 == BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:182 [inline] BUG: KASAN: use-after-free in task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295 Read of size 8 at addr 8801c46f24e0 by task syz-executor2/9973 CPU: 0 PID: 9973 Comm: syz-executor2 Not tainted 4.20.0-rc1+ #107 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 __read_once_size include/linux/compiler.h:182 [inline] task_is_descendant.part.3+0x610/0x670 security/yama/yama_lsm.c:295 task_is_descendant security/yama/yama_lsm.c:282 [inline] yama_ptrace_access_check+0x215/0x10fc security/yama/yama_lsm.c:371 security_ptrace_access_check+0x54/0xb0 security/security.c:271 __ptrace_may_access+0x5c8/0x980 kernel/ptrace.c:336 ptrace_attach+0x1fa/0x640 kernel/ptrace.c:389 __do_sys_ptrace kernel/ptrace.c:1139 [inline] __se_sys_ptrace kernel/ptrace.c:1119 [inline] __x64_sys_ptrace+0x229/0x260 kernel/ptrace.c:1119 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457569 Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7f2ed4dbfc78 EFLAGS: 0246 ORIG_RAX: 0065 RAX: ffda RBX: 0004 RCX: 00457569 RDX: RSI: 021d RDI: 4206 RBP: 0072bf00 R08: R09: R10: R11: 0246 R12: 7f2ed4dc06d4 R13: 004c33bd R14: 004d50e0 R15: Allocated by task 5873: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490 kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644 alloc_task_struct_node kernel/fork.c:158 [inline] dup_task_struct kernel/fork.c:843 [inline] copy_process+0x2026/0x87a0 kernel/fork.c:1751 _do_fork+0x1cb/0x11d0 kernel/fork.c:2216 __do_sys_clone kernel/fork.c:2323 [inline] __se_sys_clone kernel/fork.c:2317 [inline] __x64_sys_clone+0xbf/0x150 kernel/fork.c:2317 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kmem_cache_free+0x83/0x290 mm/slab.c:3760 free_task_struct kernel/fork.c:163 [inline] free_task+0x16e/0x1f0 kernel/fork.c:457 __put_task_struct+0x2e6/0x620 kernel/fork.c:730 put_task_struct include/linux/sched/task.h:96 [inline] delayed_put_task_struct+0x2ff/0x4c0 kernel/exit.c:181 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 __do_softirq+0x308/0xb7e kernel/softirq.c:292 The buggy address belongs to the object at 8801c46f2000 which belongs to the cache task_struct(65:syz2) of size 6080 The buggy address is located 1248 bytes inside of 6080-byte region [8801c46f2000, 8801c46f37c0) The buggy address belongs to the page: page:ea000711bc80 count:1 mapcount:0 mapping:8801c204fc80 index:0x0 compound_mapcount: 0 flags: 0x2fffc010200(slab|head) raw: 02fffc010200 ea00073cf808 ea000697c908 8801c204fc80 raw: 8801c46f2000 00010001 8801d8404a80 page dumped because:
[PATCH] tracehook: fix documentation for tracehook_report_syscall_entry()
tracehook_report_syscall_entry() is called not only if %TIF_SYSCALL_TRACE is set, but also if %TIF_SYSCALL_EMU is set, as appears from x86's entry code. Signed-off-by: Elvira Khabirova --- include/linux/tracehook.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 40b0b4c1bf7b..df20f8bdbfa3 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -83,8 +83,8 @@ static inline int ptrace_report_syscall(struct pt_regs *regs) * tracehook_report_syscall_entry - task is about to attempt a system call * @regs: user register state of current task * - * This will be called if %TIF_SYSCALL_TRACE has been set, when the - * current task has just entered the kernel for a system call. + * This will be called if %TIF_SYSCALL_TRACE or %TIF_SYSCALL_EMU have been set, + * when the current task has just entered the kernel for a system call. * Full user register state is available here. Changing the values * in @regs can affect the system call number and arguments to be tried. * It is safe to block here, preventing the system call from beginning. -- 2.19.1
[PATCH] tracehook: fix documentation for tracehook_report_syscall_entry()
tracehook_report_syscall_entry() is called not only if %TIF_SYSCALL_TRACE is set, but also if %TIF_SYSCALL_EMU is set, as appears from x86's entry code. Signed-off-by: Elvira Khabirova --- include/linux/tracehook.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 40b0b4c1bf7b..df20f8bdbfa3 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -83,8 +83,8 @@ static inline int ptrace_report_syscall(struct pt_regs *regs) * tracehook_report_syscall_entry - task is about to attempt a system call * @regs: user register state of current task * - * This will be called if %TIF_SYSCALL_TRACE has been set, when the - * current task has just entered the kernel for a system call. + * This will be called if %TIF_SYSCALL_TRACE or %TIF_SYSCALL_EMU have been set, + * when the current task has just entered the kernel for a system call. * Full user register state is available here. Changing the values * in @regs can affect the system call number and arguments to be tried. * It is safe to block here, preventing the system call from beginning. -- 2.19.1
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote: > On Fri, Nov 9, 2018 at 10:06 PM Jann Horn wrote: > > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google) > > wrote: > > > Android uses ashmem for sharing memory regions. We are looking forward > > > to migrating all usecases of ashmem to memfd so that we can possibly > > > remove the ashmem driver in the future from staging while also > > > benefiting from using memfd and contributing to it. Note staging drivers > > > are also not ABI and generally can be removed at anytime. > > > > > > One of the main usecases Android has is the ability to create a region > > > and mmap it as writeable, then add protection against making any > > > "future" writes while keeping the existing already mmap'ed > > > writeable-region active. This allows us to implement a usecase where > > > receivers of the shared memory buffer can get a read-only view, while > > > the sender continues to write to the buffer. > > > See CursorWindow documentation in Android for more details: > > > https://developer.android.com/reference/android/database/CursorWindow > > > > > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal. > > > To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal > > > which prevents any future mmap and write syscalls from succeeding while > > > keeping the existing mmap active. > > > > Please CC linux-api@ on patches like this. If you had done that, I > > might have criticized your v1 patch instead of your v3 patch... > > > > > The following program shows the seal > > > working in action: > > [...] > > > Cc: jr...@google.com > > > Cc: john.stu...@linaro.org > > > Cc: tk...@google.com > > > Cc: gre...@linuxfoundation.org > > > Cc: h...@infradead.org > > > Reviewed-by: John Stultz > > > Signed-off-by: Joel Fernandes (Google) > > > --- > > [...] > > > diff --git a/mm/memfd.c b/mm/memfd.c > > > index 2bb5e257080e..5ba9804e9515 100644 > > > --- a/mm/memfd.c > > > +++ b/mm/memfd.c > > [...] > > > @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, > > > unsigned int seals) > > > } > > > } > > > > > > + if ((seals & F_SEAL_FUTURE_WRITE) && > > > + !(*file_seals & F_SEAL_FUTURE_WRITE)) { > > > + /* > > > +* The FUTURE_WRITE seal also prevents growing and > > > shrinking > > > +* so we need them to be already set, or requested now. > > > +*/ > > > + int test_seals = (seals | *file_seals) & > > > +(F_SEAL_GROW | F_SEAL_SHRINK); > > > + > > > + if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) { > > > + error = -EINVAL; > > > + goto unlock; > > > + } > > > + > > > + spin_lock(>f_lock); > > > + file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE); > > > + spin_unlock(>f_lock); > > > + } > > > > So you're fiddling around with the file, but not the inode? How are > > you preventing code like the following from re-opening the file as > > writable? > > > > $ cat memfd.c > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(void) { > > int fd = syscall(__NR_memfd_create, "testfd", 0); > > if (fd == -1) err(1, "memfd"); > > char path[100]; > > sprintf(path, "/proc/self/fd/%d", fd); > > int fd2 = open(path, O_RDWR); > > if (fd2 == -1) err(1, "reopen"); > > printf("reopen successful: %d\n", fd2); > > } > > $ gcc -o memfd memfd.c > > $ ./memfd > > reopen successful: 4 > > $ > > > > That aside: I wonder whether a better API would be something that > > allows you to create a new readonly file descriptor, instead of > > fiddling with the writability of an existing fd. > > My favorite approach would be to forbid open() on memfds, hope that > nobody notices the tiny API break, and then add an ioctl for "reopen > this memfd with reduced permissions" - but that's just my personal > opinion. I did something along these lines and it fixes the issue, but I forbid open of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not an ABI break because this is a brand new seal. That seems the least intrusive solution and it works. Do you mind testing it and I'll add your and Tested-by to the new fix? The patch is based on top of this series. ---8<--- From: "Joel Fernandes (Google)" Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd through /proc/self/fd/N symlink as writeable succeeds. The simplest fix without causing ABI breakages and ugly VFS hacks is to simply deny all opens on F_SEAL_FUTURE_WRITE sealed fds. Reported-by: Jann Horn Signed-off-by: Joel Fernandes (Google) --- mm/shmem.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/mm/shmem.c
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 10:19:03PM +0100, Jann Horn wrote: > On Fri, Nov 9, 2018 at 10:06 PM Jann Horn wrote: > > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google) > > wrote: > > > Android uses ashmem for sharing memory regions. We are looking forward > > > to migrating all usecases of ashmem to memfd so that we can possibly > > > remove the ashmem driver in the future from staging while also > > > benefiting from using memfd and contributing to it. Note staging drivers > > > are also not ABI and generally can be removed at anytime. > > > > > > One of the main usecases Android has is the ability to create a region > > > and mmap it as writeable, then add protection against making any > > > "future" writes while keeping the existing already mmap'ed > > > writeable-region active. This allows us to implement a usecase where > > > receivers of the shared memory buffer can get a read-only view, while > > > the sender continues to write to the buffer. > > > See CursorWindow documentation in Android for more details: > > > https://developer.android.com/reference/android/database/CursorWindow > > > > > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal. > > > To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal > > > which prevents any future mmap and write syscalls from succeeding while > > > keeping the existing mmap active. > > > > Please CC linux-api@ on patches like this. If you had done that, I > > might have criticized your v1 patch instead of your v3 patch... > > > > > The following program shows the seal > > > working in action: > > [...] > > > Cc: jr...@google.com > > > Cc: john.stu...@linaro.org > > > Cc: tk...@google.com > > > Cc: gre...@linuxfoundation.org > > > Cc: h...@infradead.org > > > Reviewed-by: John Stultz > > > Signed-off-by: Joel Fernandes (Google) > > > --- > > [...] > > > diff --git a/mm/memfd.c b/mm/memfd.c > > > index 2bb5e257080e..5ba9804e9515 100644 > > > --- a/mm/memfd.c > > > +++ b/mm/memfd.c > > [...] > > > @@ -219,6 +220,25 @@ static int memfd_add_seals(struct file *file, > > > unsigned int seals) > > > } > > > } > > > > > > + if ((seals & F_SEAL_FUTURE_WRITE) && > > > + !(*file_seals & F_SEAL_FUTURE_WRITE)) { > > > + /* > > > +* The FUTURE_WRITE seal also prevents growing and > > > shrinking > > > +* so we need them to be already set, or requested now. > > > +*/ > > > + int test_seals = (seals | *file_seals) & > > > +(F_SEAL_GROW | F_SEAL_SHRINK); > > > + > > > + if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) { > > > + error = -EINVAL; > > > + goto unlock; > > > + } > > > + > > > + spin_lock(>f_lock); > > > + file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE); > > > + spin_unlock(>f_lock); > > > + } > > > > So you're fiddling around with the file, but not the inode? How are > > you preventing code like the following from re-opening the file as > > writable? > > > > $ cat memfd.c > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(void) { > > int fd = syscall(__NR_memfd_create, "testfd", 0); > > if (fd == -1) err(1, "memfd"); > > char path[100]; > > sprintf(path, "/proc/self/fd/%d", fd); > > int fd2 = open(path, O_RDWR); > > if (fd2 == -1) err(1, "reopen"); > > printf("reopen successful: %d\n", fd2); > > } > > $ gcc -o memfd memfd.c > > $ ./memfd > > reopen successful: 4 > > $ > > > > That aside: I wonder whether a better API would be something that > > allows you to create a new readonly file descriptor, instead of > > fiddling with the writability of an existing fd. > > My favorite approach would be to forbid open() on memfds, hope that > nobody notices the tiny API break, and then add an ioctl for "reopen > this memfd with reduced permissions" - but that's just my personal > opinion. I did something along these lines and it fixes the issue, but I forbid open of memfd only when the F_SEAL_FUTURE_WRITE seal is in place. So then its not an ABI break because this is a brand new seal. That seems the least intrusive solution and it works. Do you mind testing it and I'll add your and Tested-by to the new fix? The patch is based on top of this series. ---8<--- From: "Joel Fernandes (Google)" Subject: [PATCH] mm/memfd: Fix possible promotion to writeable of sealed memfd Jann Horn found that reopening an F_SEAL_FUTURE_WRITE sealed memfd through /proc/self/fd/N symlink as writeable succeeds. The simplest fix without causing ABI breakages and ugly VFS hacks is to simply deny all opens on F_SEAL_FUTURE_WRITE sealed fds. Reported-by: Jann Horn Signed-off-by: Joel Fernandes (Google) --- mm/shmem.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/mm/shmem.c
Re: Re: [PATCH V3] binder: ipc namespace support for android binder
> > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE. > It seems like this mechanism would work even if both are disabled -- > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to > "#ifndef CONFIG_IPC_NS" Let me explain it in detail. If SYSIPC and IPC_NS are both defined, current->nsproxy->ipc_ns will save the ipc namespace variables. We just use it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set (IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns. We make a fack init_ipc_ns here and use it. > why eliminate name? The string name is very useful for differentiating > normal "framework" binder transactions vs "hal" or "vendor" > transactions. If we just have a device number it will be hard to tell > in the logs even which namespace it belongs to. We need to keep both > the "name" (for which there might be multiple in each ns) and some > indication of which namespace this is. Maybe we assign some sort of > namespace ID during binder_init_ns(). I will remain the name of device. The inum of ipc_ns can be treated as namespace ID in ipc_ns. > As mentioned above, we need to retain name and probably also want a ns > id of some sort. So context now has 3 components if IPC_NS, so maybe a > helper function to print context like: > > static void binder_seq_print_context(struct seq_file *m, struct > binder_context *context) > { > #ifdef CONFIG_IPC_NS > seq_printf(m, "%d-%d-%s", context->ns_id, context->device, > context->name); > #else > seq_printf(m, "%d", context->name); > #endif > } > > (same comment below everywhere context is printed) > > Should these debugfs nodes should be ns aware and only print debugging > info for the context of the thread accessing the node? If so, we would > also want to be namespace-aware when printing pids. Nowadays, debugfs is not namespace-ized, and pid namespace is not associated with ipc namespace. Will it be more complicated to debug this if we just print the info for current thread? Because we will have to enter the ipc namespace firstly. But add ipc inum should be no problem. - choury -
Re: Re: [PATCH V3] binder: ipc namespace support for android binder
> > I still don't understand the dependencies on SYSVIPC or POSIX_MQUEUE. > It seems like this mechanism would work even if both are disabled -- > as long as IPC_NS is enabled. Seems cleaner to change init/Kconfig and > allow IPC_NS if CONFIG_ANDROID_BINDER_IPC and change this line to > "#ifndef CONFIG_IPC_NS" Let me explain it in detail. If SYSIPC and IPC_NS are both defined, current->nsproxy->ipc_ns will save the ipc namespace variables. We just use it. If SYSIPC (or POSIX_MQUEUE) is defined while IPC_NS is not set, current->nsproxy->ipc_ns will always refer to init_ipc_ns in ipc/msgutil.c, which is also fine to us. But if neither SYSIPC nor POSIX_MQUEUE is set (IPC_NS can't be set in this situation), there is no current->nsproxy->ipc_ns. We make a fack init_ipc_ns here and use it. > why eliminate name? The string name is very useful for differentiating > normal "framework" binder transactions vs "hal" or "vendor" > transactions. If we just have a device number it will be hard to tell > in the logs even which namespace it belongs to. We need to keep both > the "name" (for which there might be multiple in each ns) and some > indication of which namespace this is. Maybe we assign some sort of > namespace ID during binder_init_ns(). I will remain the name of device. The inum of ipc_ns can be treated as namespace ID in ipc_ns. > As mentioned above, we need to retain name and probably also want a ns > id of some sort. So context now has 3 components if IPC_NS, so maybe a > helper function to print context like: > > static void binder_seq_print_context(struct seq_file *m, struct > binder_context *context) > { > #ifdef CONFIG_IPC_NS > seq_printf(m, "%d-%d-%s", context->ns_id, context->device, > context->name); > #else > seq_printf(m, "%d", context->name); > #endif > } > > (same comment below everywhere context is printed) > > Should these debugfs nodes should be ns aware and only print debugging > info for the context of the thread accessing the node? If so, we would > also want to be namespace-aware when printing pids. Nowadays, debugfs is not namespace-ized, and pid namespace is not associated with ipc namespace. Will it be more complicated to debug this if we just print the info for current thread? Because we will have to enter the ipc namespace firstly. But add ipc inum should be no problem. - choury -
Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.
On 2018/11/10 0:43, Petr Mladek wrote: >> + * Line buffered printk() tries to assign a buffer when printk() from a new >> + * context identifier comes in. And it automatically releases that buffer >> when >> + * one of three conditions listed below became true. >> + * >> + * (1) printk() from that context identifier emitted '\n' as the last >> + * character of output. >> + * (2) printk() from that context identifier tried to print a too long >> line >> + * which cannot be stored into a buffer. >> + * (3) printk() from a new context identifier noticed that some context >> + * identifier is reserving a buffer for more than 10 seconds without >> + * emitting '\n'. >> + * >> + * Since (3) is based on a heuristic that somebody forgot to emit '\n' as >> the >> + * last character of output(), pr_cont()/KERN_CONT users are expected to >> emit >> + * '\n' within 10 seconds even if they reserved a buffer. > > This is my main concern about this approach. It is so easy to omit > the final '\n'. If it is so easy to forget the final '\n', there will be a lot of implicit pr_cont() users (because pr_cont() assumes that previous printk() omitted the final '\n'), and "I am going to investigate much more pr_cont() users." will be insufficient for getting meaningful conclusion. Checking "lack of the the final '\n'" means that we need to check "all printk() users who are not emitting the final '\n'" and evaluate "whether there is a possibility that subsequent printk() will not be called from that context due to e.g. conditional branches". That is an impossible task for anybody, for there might be out-of-tree code doing it. > > They are currently delayed until another printk(). Even this is bad. > Unfortunately we could not setup timer from printk() because it > would add more locks into the game. We could use interval timer for flushing incomplete line. But updating printk() users to always end with '\n' will be preferable. > > This patch will make it worse. They might get delayed by 10s or even > more. Many other lines might appear in between. Also the code is > more tricky[*]. If there are a lot of implicit pr_cont() users (who are omitting the final '\n' in previous printk()), we can try to find them (and fix them) by "reporting the location of printk() which omitted the final '\n'" instead of "flushing the partial line from different processes" when "try_buffered_printk() (or an interval timer) found that some buffer is holding a partial line for more than 10 seconds". > > > Sign, I am really unhappy how the buffered_printk() conversion > looks in lockdep.c. But I still think that the approach is more > reliable. I am going to investigate much more pr_cont() users. If there is a possibility that subsequent printk() will not be called from that context due to e.g. conditional branches, we will need to flush after previous printk() in order to eliminate possibility of failing to flush. That is asking printk() users to fix previous printk() so that partial line is always flushed (with '\n' or without '\n'). > I wonder how many would be that complicated. I do not want > to give up just because one use-case that was complicated > even before. > > > [*] The buffer can get written and flushed by different processes. > It is not trivial to do it correctly a lockless way. > > The proposed locking looks right on the first glance. But > the code is a bit scary ;-) try_buffered_printk() will be a good hook for finding implicit pr_cont() users who are omitting the final '\n', for we can find them without making changes to printk() users. try_buffered_printk() hook can offload atomic-but-complicated task (converting e.g. %pSOMETHING to %s) to outside of logbuf_lock, and reduce the period of holding logbuf_lock. Since try_buffered_printk() serves as a hook, if we ignore the offloading atomic-but-complicated task, we can eliminate try_buffered_printk() after we updated printk() users to always end with '\n'.
Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.
On 2018/11/10 0:43, Petr Mladek wrote: >> + * Line buffered printk() tries to assign a buffer when printk() from a new >> + * context identifier comes in. And it automatically releases that buffer >> when >> + * one of three conditions listed below became true. >> + * >> + * (1) printk() from that context identifier emitted '\n' as the last >> + * character of output. >> + * (2) printk() from that context identifier tried to print a too long >> line >> + * which cannot be stored into a buffer. >> + * (3) printk() from a new context identifier noticed that some context >> + * identifier is reserving a buffer for more than 10 seconds without >> + * emitting '\n'. >> + * >> + * Since (3) is based on a heuristic that somebody forgot to emit '\n' as >> the >> + * last character of output(), pr_cont()/KERN_CONT users are expected to >> emit >> + * '\n' within 10 seconds even if they reserved a buffer. > > This is my main concern about this approach. It is so easy to omit > the final '\n'. If it is so easy to forget the final '\n', there will be a lot of implicit pr_cont() users (because pr_cont() assumes that previous printk() omitted the final '\n'), and "I am going to investigate much more pr_cont() users." will be insufficient for getting meaningful conclusion. Checking "lack of the the final '\n'" means that we need to check "all printk() users who are not emitting the final '\n'" and evaluate "whether there is a possibility that subsequent printk() will not be called from that context due to e.g. conditional branches". That is an impossible task for anybody, for there might be out-of-tree code doing it. > > They are currently delayed until another printk(). Even this is bad. > Unfortunately we could not setup timer from printk() because it > would add more locks into the game. We could use interval timer for flushing incomplete line. But updating printk() users to always end with '\n' will be preferable. > > This patch will make it worse. They might get delayed by 10s or even > more. Many other lines might appear in between. Also the code is > more tricky[*]. If there are a lot of implicit pr_cont() users (who are omitting the final '\n' in previous printk()), we can try to find them (and fix them) by "reporting the location of printk() which omitted the final '\n'" instead of "flushing the partial line from different processes" when "try_buffered_printk() (or an interval timer) found that some buffer is holding a partial line for more than 10 seconds". > > > Sign, I am really unhappy how the buffered_printk() conversion > looks in lockdep.c. But I still think that the approach is more > reliable. I am going to investigate much more pr_cont() users. If there is a possibility that subsequent printk() will not be called from that context due to e.g. conditional branches, we will need to flush after previous printk() in order to eliminate possibility of failing to flush. That is asking printk() users to fix previous printk() so that partial line is always flushed (with '\n' or without '\n'). > I wonder how many would be that complicated. I do not want > to give up just because one use-case that was complicated > even before. > > > [*] The buffer can get written and flushed by different processes. > It is not trivial to do it correctly a lockless way. > > The proposed locking looks right on the first glance. But > the code is a bit scary ;-) try_buffered_printk() will be a good hook for finding implicit pr_cont() users who are omitting the final '\n', for we can find them without making changes to printk() users. try_buffered_printk() hook can offload atomic-but-complicated task (converting e.g. %pSOMETHING to %s) to outside of logbuf_lock, and reduce the period of holding logbuf_lock. Since try_buffered_printk() serves as a hook, if we ignore the offloading atomic-but-complicated task, we can eliminate try_buffered_printk() after we updated printk() users to always end with '\n'.
Re: [PATCH] drivers/android/binder.c: Remove duplicate header
Hi Brajeswar, Thank you for the patch! Yet something to improve: [auto build test ERROR on staging/staging-testing] [also build test ERROR on v4.20-rc1 next-20181109] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717 config: i386-randconfig-i3-201844 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/android/binder.o: In function `__read_once_size': >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_return' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_return' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_return' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_wait_for_work' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_fd_recv' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_wait_for_work' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_fd_recv' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_fd_recv' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_wait_for_work' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_received' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_received' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_received' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_alloc_buf' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_failed_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_alloc_buf' vim +182 include/linux/compiler.h d976441f Andrey Ryabinin 2015-10-19 178 d976441f Andrey Ryabinin 2015-10-19 179 static __always_inline d976441f Andrey Ryabinin 2015-10-19 180 void __read_once_size(const volatile void *p, void *res, int size) 230fa253 Christian Borntraeger 2014-11-25 181 { d976441f Andrey Ryabinin 2015-10-19 @182 __READ_ONCE_SIZE; 230fa253 Christian Borntraeger 2014-11-25 183 } d976441f Andrey Ryabinin 2015-10-19 184 :: The code at line 182 was first introduced by commit :: d976441f44bc5d48635d081d277aa76556ffbf8b compiler, atomics, kasan: Provide READ_ONCE_NOCHECK() :: TO: Andrey Ryabinin :: CC: Ingo Molnar --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH] drivers/android/binder.c: Remove duplicate header
Hi Brajeswar, Thank you for the patch! Yet something to improve: [auto build test ERROR on staging/staging-testing] [also build test ERROR on v4.20-rc1 next-20181109] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Brajeswar-Ghosh/drivers-android-binder-c-Remove-duplicate-header/20181109-154717 config: i386-randconfig-i3-201844 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/android/binder.o: In function `__read_once_size': >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_return' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_return' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_return' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_wait_for_work' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_fd_recv' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_wait_for_work' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_fd_recv' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_fd_recv' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_wait_for_work' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_received' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_received' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_received' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_alloc_buf' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_failed_buffer_release' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction' >> include/linux/compiler.h:182: undefined reference to >> `__tracepoint_binder_transaction_alloc_buf' vim +182 include/linux/compiler.h d976441f Andrey Ryabinin 2015-10-19 178 d976441f Andrey Ryabinin 2015-10-19 179 static __always_inline d976441f Andrey Ryabinin 2015-10-19 180 void __read_once_size(const volatile void *p, void *res, int size) 230fa253 Christian Borntraeger 2014-11-25 181 { d976441f Andrey Ryabinin 2015-10-19 @182 __READ_ONCE_SIZE; 230fa253 Christian Borntraeger 2014-11-25 183 } d976441f Andrey Ryabinin 2015-10-19 184 :: The code at line 182 was first introduced by commit :: d976441f44bc5d48635d081d277aa76556ffbf8b compiler, atomics, kasan: Provide READ_ONCE_NOCHECK() :: TO: Andrey Ryabinin :: CC: Ingo Molnar --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 08:02:14PM +, Michael Tirado wrote: [...] > > > That aside: I wonder whether a better API would be something that > > > allows you to create a new readonly file descriptor, instead of > > > fiddling with the writability of an existing fd. > > > > Every now and then I try to write a patch to prevent using proc to reopen > > a file with greater permission than the original open. > > > > I like your idea to have a clean way to reopen a a memfd with reduced > > permissions. But I would make it a syscall instead and maybe make it only > > work for memfd at first. And the proc issue would need to be fixed, too. > > IMO the best solution would handle the issue at memfd creation time by > removing the race condition. I agree, this is another idea I'm exploring. We could add a new .open callback to shmem_file_operations and check for seals there. thanks, - Joel
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 08:02:14PM +, Michael Tirado wrote: [...] > > > That aside: I wonder whether a better API would be something that > > > allows you to create a new readonly file descriptor, instead of > > > fiddling with the writability of an existing fd. > > > > Every now and then I try to write a patch to prevent using proc to reopen > > a file with greater permission than the original open. > > > > I like your idea to have a clean way to reopen a a memfd with reduced > > permissions. But I would make it a syscall instead and maybe make it only > > work for memfd at first. And the proc issue would need to be fixed, too. > > IMO the best solution would handle the issue at memfd creation time by > removing the race condition. I agree, this is another idea I'm exploring. We could add a new .open callback to shmem_file_operations and check for seals there. thanks, - Joel
Re: ODEBUG: Out of memory. ODEBUG disabled
> Sent: Friday, November 09, 2018 at 5:08 PM > From: "Waiman Long" > To: "Qian Cai" , "Yang Shi" > Cc: "open list" , "Thomas Gleixner" > , "Arnd Bergmann" , "Joel Fernandes > (Google)" , "Zhong Jiang" > Subject: Re: ODEBUG: Out of memory. ODEBUG disabled > > On 11/09/2018 04:51 PM, Qian Cai wrote: >> >>> On Nov 9, 2018, at 4:42 PM, Yang Shi wrote: >>> >>> >>> >>> On 11/9/18 1:36 PM, Qian Cai wrote: It is a bit annoying on this aarch64 server with 64 CPUs that is booting the latest mainline (3541833fd1f2) causes object debugging always running out of memory. >>> May you please paste the detail failure log? >> I assume you mean dmesg. >> >> Here is the dmesg for 64 CPUs, >> https://paste.ubuntu.com/p/BnhvXXhn7k/ I have to boot the kernel with only 16 CPUs instead (nr_cpus=16) to make it work. Is it expected that object debugging is not going to work with large machines? >>> I don't think so. I'm supposed it works well with large CPU number on x86. >> Here is the one with nr_cpus workaround, >> https://paste.ubuntu.com/p/qMpd2CCPSV/ > > The debugobjects code have a set of 1024 statically allocated debug > objects that can be used in early boot before the slab memory allocator > is initialized. Apparently, the system may have used up all the > statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it > helps. Great, you are right. Doubling the size makes it work. Does it make sense to have a kconfig option instead? > > There are also quite a number of warnings in your console log. So there > is certainly something wrong with your kernel or config options. Yes, I am working on all those warnings. This one is found by ODEBUG, https://lkml.org/lkml/2018/11/10/136
Re: ODEBUG: Out of memory. ODEBUG disabled
> Sent: Friday, November 09, 2018 at 5:08 PM > From: "Waiman Long" > To: "Qian Cai" , "Yang Shi" > Cc: "open list" , "Thomas Gleixner" > , "Arnd Bergmann" , "Joel Fernandes > (Google)" , "Zhong Jiang" > Subject: Re: ODEBUG: Out of memory. ODEBUG disabled > > On 11/09/2018 04:51 PM, Qian Cai wrote: >> >>> On Nov 9, 2018, at 4:42 PM, Yang Shi wrote: >>> >>> >>> >>> On 11/9/18 1:36 PM, Qian Cai wrote: It is a bit annoying on this aarch64 server with 64 CPUs that is booting the latest mainline (3541833fd1f2) causes object debugging always running out of memory. >>> May you please paste the detail failure log? >> I assume you mean dmesg. >> >> Here is the dmesg for 64 CPUs, >> https://paste.ubuntu.com/p/BnhvXXhn7k/ I have to boot the kernel with only 16 CPUs instead (nr_cpus=16) to make it work. Is it expected that object debugging is not going to work with large machines? >>> I don't think so. I'm supposed it works well with large CPU number on x86. >> Here is the one with nr_cpus workaround, >> https://paste.ubuntu.com/p/qMpd2CCPSV/ > > The debugobjects code have a set of 1024 statically allocated debug > objects that can be used in early boot before the slab memory allocator > is initialized. Apparently, the system may have used up all the > statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it > helps. Great, you are right. Doubling the size makes it work. Does it make sense to have a kconfig option instead? > > There are also quite a number of warnings in your console log. So there > is certainly something wrong with your kernel or config options. Yes, I am working on all those warnings. This one is found by ODEBUG, https://lkml.org/lkml/2018/11/10/136
[PATCH v9 0/2] Add support for LPASS clock controller for SDM845
[v9] * Update GCC documentation binding with the protected-clocks list. * Update the GCC code to add the GCC lpass clocks. * This depends on the acceptance of https://lore.kernel.org/lkml/20181105194011.43770-1-swb...@chromium.org/ [v8] * Add CLK_IS_CRITICAL for GCC lpass clocks for lpass clocks access to go through always. [v7] * Cleanup header file inclusions. * Move the comments along with the flags. * Update the commit with details for CLK_IGNORE_UNUSED. [v6] * Update the logic to register the lpass clocks when the device tree property is not present. * Add the CLK_IGNORE_UNUSED flag for the lpass clocks to not gate the clocks at late_init. [v5] * Address the comments in device tree binding to update the reg-names, update the unit address in lpass clock node example and also add reg property for the gcc clock node. * Update the lpass driver to take care of the reg-names. [v4] * Update the description in GCC Documentation binding for 'qcom,lpass-protected'. * Remove 'qcom,lpass-protected' from LPASS Documentation binding. * Update KConfig to use Low Power Audio Subsystem. * Add module_exit() and also update return value for devm_ioremap_resource failure. [v3] * Add a device tree property to identify lpass protected GCC clocks. * Update the GCC driver code to register the lpass clocks when the flag is defined. * Add comment for clocks using the BRANCH_HALT_SKIP flag. * Use platform APIs instead of of_address_to_resource. * Replace devm_ioremap with devm_ioremap_resource. * Use fixed index for 'lpass_cc' & 'lpass_qdsp6ss' in probe. [v2] * Make gcc_lpass_sway_clk static. * Remove using child nodes and use reg-names to differentiate various domains of LPASS CC. Add support for the lpass clock controller found on SDM845 based devices. This would allow lpass peripheral loader drivers to control the clocks to bring the subsystem out of reset. Taniya Das (2): dt-bindings: clock: Introduce QCOM LPASS clock bindings clk: qcom: Add lpass clock controller driver for SDM845 .../devicetree/bindings/clock/qcom,gcc.txt | 16 ++ .../devicetree/bindings/clock/qcom,lpasscc.txt | 26 +++ drivers/clk/qcom/Kconfig | 9 + drivers/clk/qcom/Makefile | 1 + drivers/clk/qcom/gcc-sdm845.c | 30 drivers/clk/qcom/lpasscc-sdm845.c | 192 + include/dt-bindings/clock/qcom,gcc-sdm845.h| 2 + include/dt-bindings/clock/qcom,lpass-sdm845.h | 16 ++ 8 files changed, 292 insertions(+) create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h -- Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member of the Code Aurora Forum, hosted by the Linux Foundation.
[PATCH v9 1/2] dt-bindings: clock: Introduce QCOM LPASS clock bindings
Add device tree bindings for Low Power Audio subsystem clock controller for Qualcomm Technology Inc's SDM845 SoCs. Signed-off-by: Taniya Das --- .../devicetree/bindings/clock/qcom,gcc.txt | 16 + .../devicetree/bindings/clock/qcom,lpasscc.txt | 26 ++ include/dt-bindings/clock/qcom,gcc-sdm845.h| 2 ++ include/dt-bindings/clock/qcom,lpass-sdm845.h | 16 + 4 files changed, 60 insertions(+) create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h diff --git a/Documentation/devicetree/bindings/clock/qcom,gcc.txt b/Documentation/devicetree/bindings/clock/qcom,gcc.txt index 52d9345..8661c3c 100644 --- a/Documentation/devicetree/bindings/clock/qcom,gcc.txt +++ b/Documentation/devicetree/bindings/clock/qcom,gcc.txt @@ -35,6 +35,8 @@ be part of GCC and hence the TSENS properties can also be part of the GCC/clock-controller node. For more details on the TSENS properties please refer Documentation/devicetree/bindings/thermal/qcom-tsens.txt +- protected-clocks : Protected clock specifier list as per common clock + binding. Example: clock-controller@90 { @@ -55,3 +57,17 @@ Example of GCC with TSENS properties: #reset-cells = <1>; #thermal-sensor-cells = <1>; }; + +Example of GCC with protected-clocks properties: + clock-controller@10 { + compatible = "qcom,gcc-sdm845"; + reg = <0x10 0x1f>; + #clock-cells = <1>; + #reset-cells = <1>; + #power-domain-cells = <1>; + protected-clocks = , + , + , + , + ; + }; diff --git a/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt new file mode 100644 index 000..b9e9787 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt @@ -0,0 +1,26 @@ +Qualcomm LPASS Clock Controller Binding +--- + +Required properties : +- compatible : shall contain "qcom,sdm845-lpasscc" +- #clock-cells : from common clock binding, shall contain 1. +- reg : shall contain base register address and size, + in the order + Index-0 maps to LPASS_CC register region + Index-1 maps to LPASS_QDSP6SS register region + +Optional properties : +- reg-names: register names of LPASS domain +"cc", "qdsp6ss". + +Example: + +The below node has to be defined in the cases where the LPASS peripheral loader +would bring the subsystem out of reset. + + lpasscc: clock-controller@17014000 { + compatible = "qcom,sdm845-lpasscc"; + reg = <0x17014000 0x1f004>, <0x1730 0x200>; + reg-names = "cc", "qdsp6ss"; + #clock-cells = <1>; + }; diff --git a/include/dt-bindings/clock/qcom,gcc-sdm845.h b/include/dt-bindings/clock/qcom,gcc-sdm845.h index b8eae5a..968fa65 100644 --- a/include/dt-bindings/clock/qcom,gcc-sdm845.h +++ b/include/dt-bindings/clock/qcom,gcc-sdm845.h @@ -197,6 +197,8 @@ #define GCC_QSPI_CORE_CLK_SRC 187 #define GCC_QSPI_CORE_CLK 188 #define GCC_QSPI_CNOC_PERIPH_AHB_CLK 189 +#define GCC_LPASS_Q6_AXI_CLK 190 +#define GCC_LPASS_SWAY_CLK 191 /* GCC Resets */ #define GCC_MMSS_BCR 0 diff --git a/include/dt-bindings/clock/qcom,lpass-sdm845.h b/include/dt-bindings/clock/qcom,lpass-sdm845.h new file mode 100644 index 000..015968e --- /dev/null +++ b/include/dt-bindings/clock/qcom,lpass-sdm845.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2018, The Linux Foundation. All rights reserved. + */ + +#ifndef _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H +#define _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H + +#define LPASS_AUDIO_WRAPPER_AON_CLK0 +#define LPASS_Q6SS_AHBM_AON_CLK1 +#define LPASS_Q6SS_AHBS_AON_CLK2 +#define LPASS_QDSP6SS_XO_CLK 3 +#define LPASS_QDSP6SS_SLEEP_CLK4 +#define LPASS_QDSP6SS_CORE_CLK 5 + +#endif -- Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member of the Code Aurora Forum, hosted by the Linux Foundation.
[PATCH v9 2/2] clk: qcom: Add lpass clock controller driver for SDM845
Add support for the lpass clock controller found on SDM845 based devices. This would allow lpass peripheral loader drivers to control the clocks to bring the subsystem out of reset. LPASS clocks present on the global clock controller would be registered with the clock framework based on the protected-clock flag. Also do not gate these clocks if they are left unused, as the lpass clocks require the global clock controller lpass clocks to be enabled before they are accessed. Mark the GCC lpass clocks as CRITICAL, for the LPASS clock access. Signed-off-by: Taniya Das --- drivers/clk/qcom/Kconfig | 9 ++ drivers/clk/qcom/Makefile | 1 + drivers/clk/qcom/gcc-sdm845.c | 30 ++ drivers/clk/qcom/lpasscc-sdm845.c | 192 ++ 4 files changed, 232 insertions(+) create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c diff --git a/drivers/clk/qcom/Kconfig b/drivers/clk/qcom/Kconfig index a611531..23adc4c 100644 --- a/drivers/clk/qcom/Kconfig +++ b/drivers/clk/qcom/Kconfig @@ -293,6 +293,15 @@ config SDM_DISPCC_845 Say Y if you want to support display devices and functionality such as splash screen. +config SDM_LPASSCC_845 + tristate "SDM845 Low Power Audio Subsystem (LPAAS) Clock Controller" + depends on COMMON_CLK_QCOM + select SDM_GCC_845 + help + Support for the LPASS clock controller on SDM845 devices. + Say Y if you want to use the LPASS branch clocks of the LPASS clock + controller to reset the LPASS subsystem. + config SPMI_PMIC_CLKDIV tristate "SPMI PMIC clkdiv Support" depends on (COMMON_CLK_QCOM && SPMI) || COMPILE_TEST diff --git a/drivers/clk/qcom/Makefile b/drivers/clk/qcom/Makefile index 981882e..3d530b1 100644 --- a/drivers/clk/qcom/Makefile +++ b/drivers/clk/qcom/Makefile @@ -46,6 +46,7 @@ obj-$(CONFIG_SDM_CAMCC_845) += camcc-sdm845.o obj-$(CONFIG_SDM_DISPCC_845) += dispcc-sdm845.o obj-$(CONFIG_SDM_GCC_660) += gcc-sdm660.o obj-$(CONFIG_SDM_GCC_845) += gcc-sdm845.o +obj-$(CONFIG_SDM_LPASSCC_845) += lpasscc-sdm845.o obj-$(CONFIG_SDM_VIDEOCC_845) += videocc-sdm845.o obj-$(CONFIG_SPMI_PMIC_CLKDIV) += clk-spmi-pmic-div.o obj-$(CONFIG_KPSS_XCC) += kpss-xcc.o diff --git a/drivers/clk/qcom/gcc-sdm845.c b/drivers/clk/qcom/gcc-sdm845.c index f133b7f..ba8ff99 100644 --- a/drivers/clk/qcom/gcc-sdm845.c +++ b/drivers/clk/qcom/gcc-sdm845.c @@ -3153,6 +3153,34 @@ enum { }, }; +static struct clk_branch gcc_lpass_q6_axi_clk = { + .halt_reg = 0x47000, + .halt_check = BRANCH_HALT, + .clkr = { + .enable_reg = 0x47000, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name = "gcc_lpass_q6_axi_clk", + .flags = CLK_IS_CRITICAL, + .ops = _branch2_ops, + }, + }, +}; + +static struct clk_branch gcc_lpass_sway_clk = { + .halt_reg = 0x47008, + .halt_check = BRANCH_HALT, + .clkr = { + .enable_reg = 0x47008, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name = "gcc_lpass_sway_clk", + .flags = CLK_IS_CRITICAL, + .ops = _branch2_ops, + }, + }, +}; + static struct gdsc pcie_0_gdsc = { .gdscr = 0x6b004, .pd = { @@ -3453,6 +3481,8 @@ enum { [GCC_QSPI_CORE_CLK_SRC] = _qspi_core_clk_src.clkr, [GCC_QSPI_CORE_CLK] = _qspi_core_clk.clkr, [GCC_QSPI_CNOC_PERIPH_AHB_CLK] = _qspi_cnoc_periph_ahb_clk.clkr, + [GCC_LPASS_Q6_AXI_CLK] = _lpass_q6_axi_clk.clkr, + [GCC_LPASS_SWAY_CLK] = _lpass_sway_clk.clkr, }; static const struct qcom_reset_map gcc_sdm845_resets[] = { diff --git a/drivers/clk/qcom/lpasscc-sdm845.c b/drivers/clk/qcom/lpasscc-sdm845.c new file mode 100644 index 000..2ef7f2a --- /dev/null +++ b/drivers/clk/qcom/lpasscc-sdm845.c @@ -0,0 +1,192 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2018, The Linux Foundation. All rights reserved. + */ + +#include +#include +#include +#include + +#include + +#include "clk-regmap.h" +#include "clk-branch.h" +#include "common.h" + +static struct clk_branch lpass_audio_wrapper_aon_clk = { + .halt_reg = 0x098, + .halt_check = BRANCH_VOTED, + .clkr = { + .enable_reg = 0x098, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name = "lpass_audio_wrapper_aon_clk", + .ops = _branch2_ops, + }, + }, +}; + +static struct clk_branch lpass_q6ss_ahbm_aon_clk = { + .halt_reg = 0x12000, + .halt_check = BRANCH_VOTED, + .clkr = { + .enable_reg = 0x12000, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name =
[PATCH v9 2/2] clk: qcom: Add lpass clock controller driver for SDM845
Add support for the lpass clock controller found on SDM845 based devices. This would allow lpass peripheral loader drivers to control the clocks to bring the subsystem out of reset. LPASS clocks present on the global clock controller would be registered with the clock framework based on the protected-clock flag. Also do not gate these clocks if they are left unused, as the lpass clocks require the global clock controller lpass clocks to be enabled before they are accessed. Mark the GCC lpass clocks as CRITICAL, for the LPASS clock access. Signed-off-by: Taniya Das --- drivers/clk/qcom/Kconfig | 9 ++ drivers/clk/qcom/Makefile | 1 + drivers/clk/qcom/gcc-sdm845.c | 30 ++ drivers/clk/qcom/lpasscc-sdm845.c | 192 ++ 4 files changed, 232 insertions(+) create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c diff --git a/drivers/clk/qcom/Kconfig b/drivers/clk/qcom/Kconfig index a611531..23adc4c 100644 --- a/drivers/clk/qcom/Kconfig +++ b/drivers/clk/qcom/Kconfig @@ -293,6 +293,15 @@ config SDM_DISPCC_845 Say Y if you want to support display devices and functionality such as splash screen. +config SDM_LPASSCC_845 + tristate "SDM845 Low Power Audio Subsystem (LPAAS) Clock Controller" + depends on COMMON_CLK_QCOM + select SDM_GCC_845 + help + Support for the LPASS clock controller on SDM845 devices. + Say Y if you want to use the LPASS branch clocks of the LPASS clock + controller to reset the LPASS subsystem. + config SPMI_PMIC_CLKDIV tristate "SPMI PMIC clkdiv Support" depends on (COMMON_CLK_QCOM && SPMI) || COMPILE_TEST diff --git a/drivers/clk/qcom/Makefile b/drivers/clk/qcom/Makefile index 981882e..3d530b1 100644 --- a/drivers/clk/qcom/Makefile +++ b/drivers/clk/qcom/Makefile @@ -46,6 +46,7 @@ obj-$(CONFIG_SDM_CAMCC_845) += camcc-sdm845.o obj-$(CONFIG_SDM_DISPCC_845) += dispcc-sdm845.o obj-$(CONFIG_SDM_GCC_660) += gcc-sdm660.o obj-$(CONFIG_SDM_GCC_845) += gcc-sdm845.o +obj-$(CONFIG_SDM_LPASSCC_845) += lpasscc-sdm845.o obj-$(CONFIG_SDM_VIDEOCC_845) += videocc-sdm845.o obj-$(CONFIG_SPMI_PMIC_CLKDIV) += clk-spmi-pmic-div.o obj-$(CONFIG_KPSS_XCC) += kpss-xcc.o diff --git a/drivers/clk/qcom/gcc-sdm845.c b/drivers/clk/qcom/gcc-sdm845.c index f133b7f..ba8ff99 100644 --- a/drivers/clk/qcom/gcc-sdm845.c +++ b/drivers/clk/qcom/gcc-sdm845.c @@ -3153,6 +3153,34 @@ enum { }, }; +static struct clk_branch gcc_lpass_q6_axi_clk = { + .halt_reg = 0x47000, + .halt_check = BRANCH_HALT, + .clkr = { + .enable_reg = 0x47000, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name = "gcc_lpass_q6_axi_clk", + .flags = CLK_IS_CRITICAL, + .ops = _branch2_ops, + }, + }, +}; + +static struct clk_branch gcc_lpass_sway_clk = { + .halt_reg = 0x47008, + .halt_check = BRANCH_HALT, + .clkr = { + .enable_reg = 0x47008, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name = "gcc_lpass_sway_clk", + .flags = CLK_IS_CRITICAL, + .ops = _branch2_ops, + }, + }, +}; + static struct gdsc pcie_0_gdsc = { .gdscr = 0x6b004, .pd = { @@ -3453,6 +3481,8 @@ enum { [GCC_QSPI_CORE_CLK_SRC] = _qspi_core_clk_src.clkr, [GCC_QSPI_CORE_CLK] = _qspi_core_clk.clkr, [GCC_QSPI_CNOC_PERIPH_AHB_CLK] = _qspi_cnoc_periph_ahb_clk.clkr, + [GCC_LPASS_Q6_AXI_CLK] = _lpass_q6_axi_clk.clkr, + [GCC_LPASS_SWAY_CLK] = _lpass_sway_clk.clkr, }; static const struct qcom_reset_map gcc_sdm845_resets[] = { diff --git a/drivers/clk/qcom/lpasscc-sdm845.c b/drivers/clk/qcom/lpasscc-sdm845.c new file mode 100644 index 000..2ef7f2a --- /dev/null +++ b/drivers/clk/qcom/lpasscc-sdm845.c @@ -0,0 +1,192 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2018, The Linux Foundation. All rights reserved. + */ + +#include +#include +#include +#include + +#include + +#include "clk-regmap.h" +#include "clk-branch.h" +#include "common.h" + +static struct clk_branch lpass_audio_wrapper_aon_clk = { + .halt_reg = 0x098, + .halt_check = BRANCH_VOTED, + .clkr = { + .enable_reg = 0x098, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name = "lpass_audio_wrapper_aon_clk", + .ops = _branch2_ops, + }, + }, +}; + +static struct clk_branch lpass_q6ss_ahbm_aon_clk = { + .halt_reg = 0x12000, + .halt_check = BRANCH_VOTED, + .clkr = { + .enable_reg = 0x12000, + .enable_mask = BIT(0), + .hw.init = &(struct clk_init_data){ + .name =
[PATCH v9 0/2] Add support for LPASS clock controller for SDM845
[v9] * Update GCC documentation binding with the protected-clocks list. * Update the GCC code to add the GCC lpass clocks. * This depends on the acceptance of https://lore.kernel.org/lkml/20181105194011.43770-1-swb...@chromium.org/ [v8] * Add CLK_IS_CRITICAL for GCC lpass clocks for lpass clocks access to go through always. [v7] * Cleanup header file inclusions. * Move the comments along with the flags. * Update the commit with details for CLK_IGNORE_UNUSED. [v6] * Update the logic to register the lpass clocks when the device tree property is not present. * Add the CLK_IGNORE_UNUSED flag for the lpass clocks to not gate the clocks at late_init. [v5] * Address the comments in device tree binding to update the reg-names, update the unit address in lpass clock node example and also add reg property for the gcc clock node. * Update the lpass driver to take care of the reg-names. [v4] * Update the description in GCC Documentation binding for 'qcom,lpass-protected'. * Remove 'qcom,lpass-protected' from LPASS Documentation binding. * Update KConfig to use Low Power Audio Subsystem. * Add module_exit() and also update return value for devm_ioremap_resource failure. [v3] * Add a device tree property to identify lpass protected GCC clocks. * Update the GCC driver code to register the lpass clocks when the flag is defined. * Add comment for clocks using the BRANCH_HALT_SKIP flag. * Use platform APIs instead of of_address_to_resource. * Replace devm_ioremap with devm_ioremap_resource. * Use fixed index for 'lpass_cc' & 'lpass_qdsp6ss' in probe. [v2] * Make gcc_lpass_sway_clk static. * Remove using child nodes and use reg-names to differentiate various domains of LPASS CC. Add support for the lpass clock controller found on SDM845 based devices. This would allow lpass peripheral loader drivers to control the clocks to bring the subsystem out of reset. Taniya Das (2): dt-bindings: clock: Introduce QCOM LPASS clock bindings clk: qcom: Add lpass clock controller driver for SDM845 .../devicetree/bindings/clock/qcom,gcc.txt | 16 ++ .../devicetree/bindings/clock/qcom,lpasscc.txt | 26 +++ drivers/clk/qcom/Kconfig | 9 + drivers/clk/qcom/Makefile | 1 + drivers/clk/qcom/gcc-sdm845.c | 30 drivers/clk/qcom/lpasscc-sdm845.c | 192 + include/dt-bindings/clock/qcom,gcc-sdm845.h| 2 + include/dt-bindings/clock/qcom,lpass-sdm845.h | 16 ++ 8 files changed, 292 insertions(+) create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt create mode 100644 drivers/clk/qcom/lpasscc-sdm845.c create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h -- Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member of the Code Aurora Forum, hosted by the Linux Foundation.
[PATCH v9 1/2] dt-bindings: clock: Introduce QCOM LPASS clock bindings
Add device tree bindings for Low Power Audio subsystem clock controller for Qualcomm Technology Inc's SDM845 SoCs. Signed-off-by: Taniya Das --- .../devicetree/bindings/clock/qcom,gcc.txt | 16 + .../devicetree/bindings/clock/qcom,lpasscc.txt | 26 ++ include/dt-bindings/clock/qcom,gcc-sdm845.h| 2 ++ include/dt-bindings/clock/qcom,lpass-sdm845.h | 16 + 4 files changed, 60 insertions(+) create mode 100644 Documentation/devicetree/bindings/clock/qcom,lpasscc.txt create mode 100644 include/dt-bindings/clock/qcom,lpass-sdm845.h diff --git a/Documentation/devicetree/bindings/clock/qcom,gcc.txt b/Documentation/devicetree/bindings/clock/qcom,gcc.txt index 52d9345..8661c3c 100644 --- a/Documentation/devicetree/bindings/clock/qcom,gcc.txt +++ b/Documentation/devicetree/bindings/clock/qcom,gcc.txt @@ -35,6 +35,8 @@ be part of GCC and hence the TSENS properties can also be part of the GCC/clock-controller node. For more details on the TSENS properties please refer Documentation/devicetree/bindings/thermal/qcom-tsens.txt +- protected-clocks : Protected clock specifier list as per common clock + binding. Example: clock-controller@90 { @@ -55,3 +57,17 @@ Example of GCC with TSENS properties: #reset-cells = <1>; #thermal-sensor-cells = <1>; }; + +Example of GCC with protected-clocks properties: + clock-controller@10 { + compatible = "qcom,gcc-sdm845"; + reg = <0x10 0x1f>; + #clock-cells = <1>; + #reset-cells = <1>; + #power-domain-cells = <1>; + protected-clocks = , + , + , + , + ; + }; diff --git a/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt new file mode 100644 index 000..b9e9787 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/qcom,lpasscc.txt @@ -0,0 +1,26 @@ +Qualcomm LPASS Clock Controller Binding +--- + +Required properties : +- compatible : shall contain "qcom,sdm845-lpasscc" +- #clock-cells : from common clock binding, shall contain 1. +- reg : shall contain base register address and size, + in the order + Index-0 maps to LPASS_CC register region + Index-1 maps to LPASS_QDSP6SS register region + +Optional properties : +- reg-names: register names of LPASS domain +"cc", "qdsp6ss". + +Example: + +The below node has to be defined in the cases where the LPASS peripheral loader +would bring the subsystem out of reset. + + lpasscc: clock-controller@17014000 { + compatible = "qcom,sdm845-lpasscc"; + reg = <0x17014000 0x1f004>, <0x1730 0x200>; + reg-names = "cc", "qdsp6ss"; + #clock-cells = <1>; + }; diff --git a/include/dt-bindings/clock/qcom,gcc-sdm845.h b/include/dt-bindings/clock/qcom,gcc-sdm845.h index b8eae5a..968fa65 100644 --- a/include/dt-bindings/clock/qcom,gcc-sdm845.h +++ b/include/dt-bindings/clock/qcom,gcc-sdm845.h @@ -197,6 +197,8 @@ #define GCC_QSPI_CORE_CLK_SRC 187 #define GCC_QSPI_CORE_CLK 188 #define GCC_QSPI_CNOC_PERIPH_AHB_CLK 189 +#define GCC_LPASS_Q6_AXI_CLK 190 +#define GCC_LPASS_SWAY_CLK 191 /* GCC Resets */ #define GCC_MMSS_BCR 0 diff --git a/include/dt-bindings/clock/qcom,lpass-sdm845.h b/include/dt-bindings/clock/qcom,lpass-sdm845.h new file mode 100644 index 000..015968e --- /dev/null +++ b/include/dt-bindings/clock/qcom,lpass-sdm845.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2018, The Linux Foundation. All rights reserved. + */ + +#ifndef _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H +#define _DT_BINDINGS_CLK_SDM_LPASS_SDM845_H + +#define LPASS_AUDIO_WRAPPER_AON_CLK0 +#define LPASS_Q6SS_AHBM_AON_CLK1 +#define LPASS_Q6SS_AHBS_AON_CLK2 +#define LPASS_QDSP6SS_XO_CLK 3 +#define LPASS_QDSP6SS_SLEEP_CLK4 +#define LPASS_QDSP6SS_CORE_CLK 5 + +#endif -- Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member of the Code Aurora Forum, hosted by the Linux Foundation.
Re: [PATCH v3 1/4] x86/mm: declare check_la57_support() as inline
On Sat, Nov 10, 2018 at 01:27:29AM +0900, Masahiro Yamada wrote: > On Fri, Nov 2, 2018 at 8:28 PM Borislav Petkov wrote: > > > > On Mon, Oct 29, 2018 at 10:04:46PM +0900, Masahiro Yamada wrote: > > > On Mon, Oct 29, 2018 at 3:09 AM Steven Rostedt > > > wrote: > > > > > > > > On Sun, 28 Oct 2018 13:09:42 + > > > > Changbin Du wrote: > > > > > > > > > The level4_kernel_pgt is only defined when X86_5LEVEL is enabled. > > > > > So declare check_la57_support() as inline to make sure the code > > > > > referring to level4_kernel_pgt is optimized out. This is a preparation > > > > > for CONFIG_CC_OPTIMIZE_FOR_DEBUGGING. > > > > > > > > > > Signed-off-by: Changbin Du > > > > > > > > Reviewed-by: Steven Rostedt (VMware) > > > > > > > > -- Steve > > > > > > > > > > Applied to linux-kbuild. > > > > Next time, before you route a patch through your tree, please get an x86 > > maintainer's ACK first. > > > I changed my mind, and did not include this in my previous pull request > because this series is causing many warnings. > Most of the new warnings are false warnings. Since this series didn't change any code where the warnings happen. I'll make patches to make gcc happy. > Now, it is aiming for v5.0-rc1 > > So, it is not too late. > > If x86 maintainers issue Acked-by, > I am happy to add it. > > > > > > Thx. > > > > -- > > Regards/Gruss, > > Boris. > > > > Good mailing practices for 400: avoid top-posting and trim the reply. > > > > -- > Best Regards > Masahiro Yamada -- Thanks, Changbin Du
Re: [PATCH v3 1/4] x86/mm: declare check_la57_support() as inline
On Sat, Nov 10, 2018 at 01:27:29AM +0900, Masahiro Yamada wrote: > On Fri, Nov 2, 2018 at 8:28 PM Borislav Petkov wrote: > > > > On Mon, Oct 29, 2018 at 10:04:46PM +0900, Masahiro Yamada wrote: > > > On Mon, Oct 29, 2018 at 3:09 AM Steven Rostedt > > > wrote: > > > > > > > > On Sun, 28 Oct 2018 13:09:42 + > > > > Changbin Du wrote: > > > > > > > > > The level4_kernel_pgt is only defined when X86_5LEVEL is enabled. > > > > > So declare check_la57_support() as inline to make sure the code > > > > > referring to level4_kernel_pgt is optimized out. This is a preparation > > > > > for CONFIG_CC_OPTIMIZE_FOR_DEBUGGING. > > > > > > > > > > Signed-off-by: Changbin Du > > > > > > > > Reviewed-by: Steven Rostedt (VMware) > > > > > > > > -- Steve > > > > > > > > > > Applied to linux-kbuild. > > > > Next time, before you route a patch through your tree, please get an x86 > > maintainer's ACK first. > > > I changed my mind, and did not include this in my previous pull request > because this series is causing many warnings. > Most of the new warnings are false warnings. Since this series didn't change any code where the warnings happen. I'll make patches to make gcc happy. > Now, it is aiming for v5.0-rc1 > > So, it is not too late. > > If x86 maintainers issue Acked-by, > I am happy to add it. > > > > > > Thx. > > > > -- > > Regards/Gruss, > > Boris. > > > > Good mailing practices for 400: avoid top-posting and trim the reply. > > > > -- > Best Regards > Masahiro Yamada -- Thanks, Changbin Du
[PATCH v9 0/4] KASLR feature to randomize each loadable module
This is V9 of the "KASLR feature to randomize each loadable module" patchset. The purpose is to increase the randomization for the module space from 10 to 17 bits, and also to make the modules randomized in relation to each other instead of just the address where the allocations begin, so that if one module leaks the location of the others can't be inferred. Why its useful == Randomizing the location of executable code is a defense against control flow attacks, where the kernel is tricked into jumping, or speculatively executing code other than what is intended. By randomizing the location of the code, the attacker doesn't know where to redirect the control flow. Today the RANDOMIZE_BASE feature randomizes the base address where the module allocations begin with 10 bits of entropy for this purpose. From here, a highly deterministic algorithm allocates space for the modules as they are loaded and unloaded. If an attacker can predict the order and identities for modules that will be loaded (either by the system, or controlled by the user with request_module or BPF), then a single text address leak can give the attacker access to the locations of other modules. So in this case this new algorithm can take the entropy of the other modules from ~0 to 17, making it much more robust. Another problem today is that the low 10 bits of entropy makes brute force attacks feasible, especially in the case of speculative execution where a wrong guess won't necessarily cause a crash. In this case, increasing the randomization will force attacks to take longer, and so increase the time an attacker may be detected on a system. In the past KASLR has been considered mostly a remote defense, due to available methods of de-randomizing the kernel text locally, but previous easier local de-randomizing methods have been blocked by KPTI. There are multiple efforts to apply more randomization to the core kernel text as well, and so this module space piece can be a first step to increasing randomization for all kernel space executable code. Userspace ASLR can get 28 bits of entropy or more, so at least increasing this to 17 for now improves what is currently a pretty low amount of randomization for the higher privileged kernel space. How it works The algorithm is pretty simple. It just breaks the module space in two, a random area (2/3 of module space) and a backup area (1/3 of module space). It first tries to allocate up to 1 randomly located starting pages inside the random section. If this fails, it will allocate in the backup area. The backup area base will be offset in the same way as current algorithm does for the base area, which has 10 bits of entropy. The vmalloc allocator can be used to try an allocation at a specific address, however it is usually used to try an allocation over a large address range, and so some behaviors which are non-issues in normal usage can be be sub-optimal when trying the an allocation at 1 small ranges. So this patch also includes a new vmalloc function __vmalloc_node_try_addr and some other vmalloc tweaks that allow for more efficient trying of addresses. This algorithm targets maintaining high entropy for many 1000's of module allocations. This is because there are other users of the module space besides kernel modules, like eBPF JIT, classic BPF socket filter JIT and kprobes. Performance === Simulations were run using module sizes derived from the x86_64 modules to measure the allocation performance at various levels of fragmentation and whether the backup area was used. Capacity There is a slight reduction in the capacity of modules as simulated by the x86_64 module sizes of <1000. Note this is a worst case, since in practice module allocations in the 1000's will consist of smaller BPF JIT allocations or kprobes which would fit better in the random area. Allocation time --- Below are three sets of measurements in ns of the allocation time as measured by the included kselftests. The first two columns are this new algorithm with and with out the vmalloc optimizations for trying random addresses quickly. They are included for consideration of whether the changes are worth it. The last column is the performance of the original algorithm. Modules Vmalloc optimizationNo Vmalloc Optimization Existing Module KASLR 1000143319933821 2000229536817830 30004424745013012 4000774613824 18106 500012721 21852 22572 600019724 33926 26443 700027638 47427 30473 800037745 64443 34200 These allocations are not taking very long, but it may show up on systems with very high usage of the module space (BPF JITs).
[PATCH v9 1/4] vmalloc: Add __vmalloc_node_try_addr function
Create __vmalloc_node_try_addr function that tries to allocate at a specific address without triggering any lazy purging and retry. For the randomized allocator that uses this function, failing to allocate at a specific address is a lot more common. This function will not try to do any lazy purge and retry, to try to fail faster when an allocation won't fit at a specific address. This function is used for a case where lazy free areas are unlikely and so the purge and retry is just extra work done every time. For the randomized module loader, the performance for an average allocation in ns for different numbers of modules was: Modules Vmalloc optimizationNo Vmalloc Optimization 100014331993 200022953681 300044247450 4000774613824 500012721 21852 600019724 33926 700027638 47427 800037745 64443 In order to support this behavior a try_addr argument was plugged into several of the static helpers. This also changes logic in __get_vm_area_node to be faster in cases where allocations fail due to no space, which is a lot more common when trying specific addresses. Signed-off-by: Rick Edgecombe --- include/linux/vmalloc.h | 3 + mm/vmalloc.c| 128 +--- 2 files changed, 95 insertions(+), 36 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 398e9c95cd61..6eaa89612372 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align, unsigned long start, unsigned long end, gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags, int node, const void *caller); +extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size, + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags, + int node, const void *caller); #ifndef CONFIG_MMU extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags); static inline void *__vmalloc_node_flags_caller(unsigned long size, int node, diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 97d4b25d0373..b8b34d319c85 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -326,6 +326,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn); #define VM_LAZY_FREE 0x02 #define VM_VM_AREA 0x04 +#define VMAP_MAY_PURGE 0x2 +#define VMAP_NO_PURGE 0x1 + static DEFINE_SPINLOCK(vmap_area_lock); /* Export for kexec only */ LIST_HEAD(vmap_area_list); @@ -402,12 +405,12 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, - int node, gfp_t gfp_mask) + int node, gfp_t gfp_mask, int try_purge) { struct vmap_area *va; struct rb_node *n; unsigned long addr; - int purged = 0; + int purged = try_purge & VMAP_NO_PURGE; struct vmap_area *first; BUG_ON(!size); @@ -860,7 +863,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, VMALLOC_START, VMALLOC_END, - node, gfp_mask); + node, gfp_mask, VMAP_MAY_PURGE); if (IS_ERR(va)) { kfree(vb); return ERR_CAST(va); @@ -1170,8 +1173,9 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro addr = (unsigned long)mem; } else { struct vmap_area *va; - va = alloc_vmap_area(size, PAGE_SIZE, - VMALLOC_START, VMALLOC_END, node, GFP_KERNEL); + va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, + VMALLOC_END, node, GFP_KERNEL, + VMAP_MAY_PURGE); if (IS_ERR(va)) return NULL; @@ -1372,7 +1376,8 @@ static void clear_vm_uninitialized_flag(struct vm_struct *vm) static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long flags, unsigned long start, - unsigned long end, int node, gfp_t gfp_mask, const void *caller) + unsigned long end, int node, gfp_t gfp_mask, int try_purge, + const void *caller) { struct vmap_area *va; struct vm_struct *area; @@ -1386,16 +1391,17 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, align = 1ul << clamp_t(int, get_count_order_long(size),
[PATCH v9 0/4] KASLR feature to randomize each loadable module
This is V9 of the "KASLR feature to randomize each loadable module" patchset. The purpose is to increase the randomization for the module space from 10 to 17 bits, and also to make the modules randomized in relation to each other instead of just the address where the allocations begin, so that if one module leaks the location of the others can't be inferred. Why its useful == Randomizing the location of executable code is a defense against control flow attacks, where the kernel is tricked into jumping, or speculatively executing code other than what is intended. By randomizing the location of the code, the attacker doesn't know where to redirect the control flow. Today the RANDOMIZE_BASE feature randomizes the base address where the module allocations begin with 10 bits of entropy for this purpose. From here, a highly deterministic algorithm allocates space for the modules as they are loaded and unloaded. If an attacker can predict the order and identities for modules that will be loaded (either by the system, or controlled by the user with request_module or BPF), then a single text address leak can give the attacker access to the locations of other modules. So in this case this new algorithm can take the entropy of the other modules from ~0 to 17, making it much more robust. Another problem today is that the low 10 bits of entropy makes brute force attacks feasible, especially in the case of speculative execution where a wrong guess won't necessarily cause a crash. In this case, increasing the randomization will force attacks to take longer, and so increase the time an attacker may be detected on a system. In the past KASLR has been considered mostly a remote defense, due to available methods of de-randomizing the kernel text locally, but previous easier local de-randomizing methods have been blocked by KPTI. There are multiple efforts to apply more randomization to the core kernel text as well, and so this module space piece can be a first step to increasing randomization for all kernel space executable code. Userspace ASLR can get 28 bits of entropy or more, so at least increasing this to 17 for now improves what is currently a pretty low amount of randomization for the higher privileged kernel space. How it works The algorithm is pretty simple. It just breaks the module space in two, a random area (2/3 of module space) and a backup area (1/3 of module space). It first tries to allocate up to 1 randomly located starting pages inside the random section. If this fails, it will allocate in the backup area. The backup area base will be offset in the same way as current algorithm does for the base area, which has 10 bits of entropy. The vmalloc allocator can be used to try an allocation at a specific address, however it is usually used to try an allocation over a large address range, and so some behaviors which are non-issues in normal usage can be be sub-optimal when trying the an allocation at 1 small ranges. So this patch also includes a new vmalloc function __vmalloc_node_try_addr and some other vmalloc tweaks that allow for more efficient trying of addresses. This algorithm targets maintaining high entropy for many 1000's of module allocations. This is because there are other users of the module space besides kernel modules, like eBPF JIT, classic BPF socket filter JIT and kprobes. Performance === Simulations were run using module sizes derived from the x86_64 modules to measure the allocation performance at various levels of fragmentation and whether the backup area was used. Capacity There is a slight reduction in the capacity of modules as simulated by the x86_64 module sizes of <1000. Note this is a worst case, since in practice module allocations in the 1000's will consist of smaller BPF JIT allocations or kprobes which would fit better in the random area. Allocation time --- Below are three sets of measurements in ns of the allocation time as measured by the included kselftests. The first two columns are this new algorithm with and with out the vmalloc optimizations for trying random addresses quickly. They are included for consideration of whether the changes are worth it. The last column is the performance of the original algorithm. Modules Vmalloc optimizationNo Vmalloc Optimization Existing Module KASLR 1000143319933821 2000229536817830 30004424745013012 4000774613824 18106 500012721 21852 22572 600019724 33926 26443 700027638 47427 30473 800037745 64443 34200 These allocations are not taking very long, but it may show up on systems with very high usage of the module space (BPF JITs).
[PATCH v9 1/4] vmalloc: Add __vmalloc_node_try_addr function
Create __vmalloc_node_try_addr function that tries to allocate at a specific address without triggering any lazy purging and retry. For the randomized allocator that uses this function, failing to allocate at a specific address is a lot more common. This function will not try to do any lazy purge and retry, to try to fail faster when an allocation won't fit at a specific address. This function is used for a case where lazy free areas are unlikely and so the purge and retry is just extra work done every time. For the randomized module loader, the performance for an average allocation in ns for different numbers of modules was: Modules Vmalloc optimizationNo Vmalloc Optimization 100014331993 200022953681 300044247450 4000774613824 500012721 21852 600019724 33926 700027638 47427 800037745 64443 In order to support this behavior a try_addr argument was plugged into several of the static helpers. This also changes logic in __get_vm_area_node to be faster in cases where allocations fail due to no space, which is a lot more common when trying specific addresses. Signed-off-by: Rick Edgecombe --- include/linux/vmalloc.h | 3 + mm/vmalloc.c| 128 +--- 2 files changed, 95 insertions(+), 36 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 398e9c95cd61..6eaa89612372 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align, unsigned long start, unsigned long end, gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags, int node, const void *caller); +extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size, + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags, + int node, const void *caller); #ifndef CONFIG_MMU extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags); static inline void *__vmalloc_node_flags_caller(unsigned long size, int node, diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 97d4b25d0373..b8b34d319c85 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -326,6 +326,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn); #define VM_LAZY_FREE 0x02 #define VM_VM_AREA 0x04 +#define VMAP_MAY_PURGE 0x2 +#define VMAP_NO_PURGE 0x1 + static DEFINE_SPINLOCK(vmap_area_lock); /* Export for kexec only */ LIST_HEAD(vmap_area_list); @@ -402,12 +405,12 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list); static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, - int node, gfp_t gfp_mask) + int node, gfp_t gfp_mask, int try_purge) { struct vmap_area *va; struct rb_node *n; unsigned long addr; - int purged = 0; + int purged = try_purge & VMAP_NO_PURGE; struct vmap_area *first; BUG_ON(!size); @@ -860,7 +863,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, VMALLOC_START, VMALLOC_END, - node, gfp_mask); + node, gfp_mask, VMAP_MAY_PURGE); if (IS_ERR(va)) { kfree(vb); return ERR_CAST(va); @@ -1170,8 +1173,9 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro addr = (unsigned long)mem; } else { struct vmap_area *va; - va = alloc_vmap_area(size, PAGE_SIZE, - VMALLOC_START, VMALLOC_END, node, GFP_KERNEL); + va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, + VMALLOC_END, node, GFP_KERNEL, + VMAP_MAY_PURGE); if (IS_ERR(va)) return NULL; @@ -1372,7 +1376,8 @@ static void clear_vm_uninitialized_flag(struct vm_struct *vm) static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long flags, unsigned long start, - unsigned long end, int node, gfp_t gfp_mask, const void *caller) + unsigned long end, int node, gfp_t gfp_mask, int try_purge, + const void *caller) { struct vmap_area *va; struct vm_struct *area; @@ -1386,16 +1391,17 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, align = 1ul << clamp_t(int, get_count_order_long(size),
[PATCH v9 3/4] vmalloc: Add debugfs modfraginfo
Add debugfs file "modfraginfo" for providing info on module space fragmentation. This can be used for determining if loadable module randomization is causing any problems for extreme module loading situations, like huge numbers of modules or extremely large modules. Sample output when KASLR is enabled and X86_64 is configured: Largest free space: 897912 kB Total free space: 1025424 kB Allocations in backup area: 0 Sample output when just X86_64: Largest free space: 897912 kB Total free space: 1025424 kB Signed-off-by: Rick Edgecombe Reviewed-by: Kees Cook --- mm/vmalloc.c | 100 +-- 1 file changed, 98 insertions(+), 2 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b8b34d319c85..63894cb50873 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -36,6 +37,12 @@ #include #include +#ifdef CONFIG_X86 +#include +#include +#include +#endif + #include "internal.h" struct vfree_deferred { @@ -2415,7 +2422,6 @@ void free_vm_area(struct vm_struct *area) } EXPORT_SYMBOL_GPL(free_vm_area); -#ifdef CONFIG_SMP static struct vmap_area *node_to_va(struct rb_node *n) { return rb_entry_safe(n, struct vmap_area, rb_node); @@ -2463,6 +2469,7 @@ static bool pvm_find_next_prev(unsigned long end, return true; } +#ifdef CONFIG_SMP /** * pvm_determine_end - find the highest aligned address between two vmap_areas * @pnext: in/out arg for the next vmap_area @@ -2804,7 +2811,96 @@ static int __init proc_vmalloc_init(void) proc_create_seq("vmallocinfo", 0400, NULL, _op); return 0; } -module_init(proc_vmalloc_init); +#elif defined(CONFIG_DEBUG_FS) +static int __init proc_vmalloc_init(void) +{ + return 0; +} +#endif + +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_RANDOMIZE_FINE_MODULE) +static inline unsigned long is_in_backup(unsigned long addr) +{ + return addr >= MODULES_VADDR + get_modules_rand_len(); +} + +static int modulefraginfo_debug_show(struct seq_file *m, void *v) +{ + unsigned long last_end = MODULES_VADDR; + unsigned long total_free = 0; + unsigned long largest_free = 0; + unsigned long backup_cnt = 0; + unsigned long gap; + struct vmap_area *prev, *cur = NULL; + + spin_lock(_area_lock); + + if (!pvm_find_next_prev(MODULES_VADDR, , ) || !cur) + goto done; + + for (; cur->va_end <= MODULES_END; cur = list_next_entry(cur, list)) { + /* Don't count areas that are marked to be lazily freed */ + if (!(cur->flags & VM_LAZY_FREE)) { + if (kaslr_mod_randomize_each_module()) + backup_cnt += is_in_backup(cur->va_start); + gap = cur->va_start - last_end; + if (gap > largest_free) + largest_free = gap; + total_free += gap; + last_end = cur->va_end; + } + + if (list_is_last(>list, _area_list)) + break; + } + +done: + gap = (MODULES_END - last_end); + if (gap > largest_free) + largest_free = gap; + total_free += gap; + spin_unlock(_area_lock); + + seq_printf(m, "\tLargest free space:\t%lu kB\n", largest_free / 1024); + seq_printf(m, "\t Total free space:\t%lu kB\n", total_free / 1024); + + if (kaslr_mod_randomize_each_module()) + seq_printf(m, "Allocations in backup area:\t%lu\n", backup_cnt); + + return 0; +} + +static int proc_module_frag_debug_open(struct inode *inode, struct file *file) +{ + return single_open(file, modulefraginfo_debug_show, NULL); +} + +static const struct file_operations debug_module_frag_operations = { + .open = proc_module_frag_debug_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + +static void __init debug_modfrag_init(void) +{ + debugfs_create_file("modfraginfo", 0400, NULL, NULL, + _module_frag_operations); +} +#elif defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS) +static void __init debug_modfrag_init(void) +{ +} #endif +#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS) +static int __init info_vmalloc_init(void) +{ + proc_vmalloc_init(); + debug_modfrag_init(); + return 0; +} + +module_init(info_vmalloc_init); +#endif -- 2.17.1
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 03:14:02PM -0800, Andy Lutomirski wrote: > That aside: I wonder whether a better API would be something that > allows you to create a new readonly file descriptor, instead of > fiddling with the writability of an existing fd. > >>> > >>> That doesn't work, unfortunately. The ashmem API we're replacing with > >>> memfd requires file descriptor continuity. I also looked into opening > >>> a new FD and dup2(2)ing atop the old one, but this approach doesn't > >>> work in the case that the old FD has already leaked to some other > >>> context (e.g., another dup, SCM_RIGHTS). See > >>> https://developer.android.com/ndk/reference/group/memory. We can't > >>> break ASharedMemory_setProt. > >> > >> > >> Hmm. If we fix the general reopen bug, a way to drop write access from > >> an existing struct file would do what Android needs, right? I don’t > >> know if there are general VFS issues with that. > > I don't think there is a way to fix this in /proc/pid/fd. At the proc level, the /proc/pid/fd/N files are just soft symlinks that follow through to the actual file. The open is actually done on that inode/file. I think changing it the way being discussed here means changing the way symlinks work in Linux. I think the right way to fix this is at the memfd inode level. I am working on a follow up patch on top of this patch, and will send that out in a few days (along with the man page updates). thanks! - Joel
[PATCH v9 2/4] x86/modules: Increase randomization for modules
This changes the behavior of the KASLR logic for allocating memory for the text sections of loadable modules. It randomizes the location of each module text section with about 17 bits of entropy in typical use. This is enabled on X86_64 only. For 32 bit, the behavior is unchanged. It refactors existing code around module randomization somewhat. There are now three different behaviors for x86 module_alloc depending on config. RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y ARCH=i386. The refactor of the existing code is to try to clearly show what those behaviors are without having three separate versions or threading the behaviors in a bunch of little spots. The reason it is not enabled on 32 bit yet is because the module space is much smaller and simulations haven't been run to see how it performs. The new algorithm breaks the module space in two, a random area and a backup area. It first tries to allocate at a number of randomly located starting pages inside the random section. If this fails, then it will allocate in the backup area. The backup area base will be offset in the same way as the current algorithm does for the base area, 1024 possible locations. Due to boot_params being defined with different types in different places, placing the config helpers modules.h or kaslr.h caused conflicts elsewhere, and so they are placed in a new file, kaslr_modules.h, instead. Signed-off-by: Rick Edgecombe --- arch/x86/Kconfig| 3 + arch/x86/include/asm/kaslr_modules.h| 38 arch/x86/include/asm/pgtable_64_types.h | 7 ++ arch/x86/kernel/module.c| 111 +++- 4 files changed, 136 insertions(+), 23 deletions(-) create mode 100644 arch/x86/include/asm/kaslr_modules.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ba7e3464ee92..db93cde0528a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2144,6 +2144,9 @@ config RANDOMIZE_BASE If unsure, say Y. +config RANDOMIZE_FINE_MODULE + def_bool y if RANDOMIZE_BASE && X86_64 && !CONFIG_UML + # Relocation on x86 needs some additional build support config X86_NEED_RELOCS def_bool y diff --git a/arch/x86/include/asm/kaslr_modules.h b/arch/x86/include/asm/kaslr_modules.h new file mode 100644 index ..1da6eced4b47 --- /dev/null +++ b/arch/x86/include/asm/kaslr_modules.h @@ -0,0 +1,38 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_KASLR_MODULES_H_ +#define _ASM_KASLR_MODULES_H_ + +#ifdef CONFIG_RANDOMIZE_BASE +/* kaslr_enabled is not always defined */ +static inline int kaslr_mod_randomize_base(void) +{ + return kaslr_enabled(); +} +#else +static inline int kaslr_mod_randomize_base(void) +{ + return 0; +} +#endif /* CONFIG_RANDOMIZE_BASE */ + +#ifdef CONFIG_RANDOMIZE_FINE_MODULE +/* kaslr_enabled is not always defined */ +static inline int kaslr_mod_randomize_each_module(void) +{ + return kaslr_enabled(); +} + +static inline unsigned long get_modules_rand_len(void) +{ + return MODULES_RAND_LEN; +} +#else +static inline int kaslr_mod_randomize_each_module(void) +{ + return 0; +} + +unsigned long get_modules_rand_len(void); +#endif /* CONFIG_RANDOMIZE_FINE_MODULE */ + +#endif diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 04edd2d58211..5e26369ab86c 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -143,6 +143,13 @@ extern unsigned int ptrs_per_p4d; #define MODULES_END_AC(0xff00, UL) #define MODULES_LEN(MODULES_END - MODULES_VADDR) +/* + * Dedicate the first part of the module space to a randomized area when KASLR + * is in use. Leave the remaining part for a fallback if we are unable to + * allocate in the random area. + */ +#define MODULES_RAND_LEN PAGE_ALIGN((MODULES_LEN/3)*2) + #define ESPFIX_PGD_ENTRY _AC(-2, UL) #define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT) diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index b052e883dd8c..35cb912ed1f8 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -36,6 +36,7 @@ #include #include #include +#include #if 0 #define DEBUGP(fmt, ...) \ @@ -48,34 +49,96 @@ do { \ } while (0) #endif -#ifdef CONFIG_RANDOMIZE_BASE static unsigned long module_load_offset; +static const unsigned long NO_TRY_RAND = 1; /* Mutex protects the module_load_offset. */ static DEFINE_MUTEX(module_kaslr_mutex); static unsigned long int get_module_load_offset(void) { - if (kaslr_enabled()) { - mutex_lock(_kaslr_mutex); - /* -* Calculate the module_load_offset the first time this -* code is called. Once calculated it stays the same until -* reboot. -*/
[PATCH v9 3/4] vmalloc: Add debugfs modfraginfo
Add debugfs file "modfraginfo" for providing info on module space fragmentation. This can be used for determining if loadable module randomization is causing any problems for extreme module loading situations, like huge numbers of modules or extremely large modules. Sample output when KASLR is enabled and X86_64 is configured: Largest free space: 897912 kB Total free space: 1025424 kB Allocations in backup area: 0 Sample output when just X86_64: Largest free space: 897912 kB Total free space: 1025424 kB Signed-off-by: Rick Edgecombe Reviewed-by: Kees Cook --- mm/vmalloc.c | 100 +-- 1 file changed, 98 insertions(+), 2 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b8b34d319c85..63894cb50873 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -36,6 +37,12 @@ #include #include +#ifdef CONFIG_X86 +#include +#include +#include +#endif + #include "internal.h" struct vfree_deferred { @@ -2415,7 +2422,6 @@ void free_vm_area(struct vm_struct *area) } EXPORT_SYMBOL_GPL(free_vm_area); -#ifdef CONFIG_SMP static struct vmap_area *node_to_va(struct rb_node *n) { return rb_entry_safe(n, struct vmap_area, rb_node); @@ -2463,6 +2469,7 @@ static bool pvm_find_next_prev(unsigned long end, return true; } +#ifdef CONFIG_SMP /** * pvm_determine_end - find the highest aligned address between two vmap_areas * @pnext: in/out arg for the next vmap_area @@ -2804,7 +2811,96 @@ static int __init proc_vmalloc_init(void) proc_create_seq("vmallocinfo", 0400, NULL, _op); return 0; } -module_init(proc_vmalloc_init); +#elif defined(CONFIG_DEBUG_FS) +static int __init proc_vmalloc_init(void) +{ + return 0; +} +#endif + +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_RANDOMIZE_FINE_MODULE) +static inline unsigned long is_in_backup(unsigned long addr) +{ + return addr >= MODULES_VADDR + get_modules_rand_len(); +} + +static int modulefraginfo_debug_show(struct seq_file *m, void *v) +{ + unsigned long last_end = MODULES_VADDR; + unsigned long total_free = 0; + unsigned long largest_free = 0; + unsigned long backup_cnt = 0; + unsigned long gap; + struct vmap_area *prev, *cur = NULL; + + spin_lock(_area_lock); + + if (!pvm_find_next_prev(MODULES_VADDR, , ) || !cur) + goto done; + + for (; cur->va_end <= MODULES_END; cur = list_next_entry(cur, list)) { + /* Don't count areas that are marked to be lazily freed */ + if (!(cur->flags & VM_LAZY_FREE)) { + if (kaslr_mod_randomize_each_module()) + backup_cnt += is_in_backup(cur->va_start); + gap = cur->va_start - last_end; + if (gap > largest_free) + largest_free = gap; + total_free += gap; + last_end = cur->va_end; + } + + if (list_is_last(>list, _area_list)) + break; + } + +done: + gap = (MODULES_END - last_end); + if (gap > largest_free) + largest_free = gap; + total_free += gap; + spin_unlock(_area_lock); + + seq_printf(m, "\tLargest free space:\t%lu kB\n", largest_free / 1024); + seq_printf(m, "\t Total free space:\t%lu kB\n", total_free / 1024); + + if (kaslr_mod_randomize_each_module()) + seq_printf(m, "Allocations in backup area:\t%lu\n", backup_cnt); + + return 0; +} + +static int proc_module_frag_debug_open(struct inode *inode, struct file *file) +{ + return single_open(file, modulefraginfo_debug_show, NULL); +} + +static const struct file_operations debug_module_frag_operations = { + .open = proc_module_frag_debug_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + +static void __init debug_modfrag_init(void) +{ + debugfs_create_file("modfraginfo", 0400, NULL, NULL, + _module_frag_operations); +} +#elif defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS) +static void __init debug_modfrag_init(void) +{ +} #endif +#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS) +static int __init info_vmalloc_init(void) +{ + proc_vmalloc_init(); + debug_modfrag_init(); + return 0; +} + +module_init(info_vmalloc_init); +#endif -- 2.17.1
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 09, 2018 at 03:14:02PM -0800, Andy Lutomirski wrote: > That aside: I wonder whether a better API would be something that > allows you to create a new readonly file descriptor, instead of > fiddling with the writability of an existing fd. > >>> > >>> That doesn't work, unfortunately. The ashmem API we're replacing with > >>> memfd requires file descriptor continuity. I also looked into opening > >>> a new FD and dup2(2)ing atop the old one, but this approach doesn't > >>> work in the case that the old FD has already leaked to some other > >>> context (e.g., another dup, SCM_RIGHTS). See > >>> https://developer.android.com/ndk/reference/group/memory. We can't > >>> break ASharedMemory_setProt. > >> > >> > >> Hmm. If we fix the general reopen bug, a way to drop write access from > >> an existing struct file would do what Android needs, right? I don’t > >> know if there are general VFS issues with that. > > I don't think there is a way to fix this in /proc/pid/fd. At the proc level, the /proc/pid/fd/N files are just soft symlinks that follow through to the actual file. The open is actually done on that inode/file. I think changing it the way being discussed here means changing the way symlinks work in Linux. I think the right way to fix this is at the memfd inode level. I am working on a follow up patch on top of this patch, and will send that out in a few days (along with the man page updates). thanks! - Joel
[PATCH v9 2/4] x86/modules: Increase randomization for modules
This changes the behavior of the KASLR logic for allocating memory for the text sections of loadable modules. It randomizes the location of each module text section with about 17 bits of entropy in typical use. This is enabled on X86_64 only. For 32 bit, the behavior is unchanged. It refactors existing code around module randomization somewhat. There are now three different behaviors for x86 module_alloc depending on config. RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y ARCH=i386. The refactor of the existing code is to try to clearly show what those behaviors are without having three separate versions or threading the behaviors in a bunch of little spots. The reason it is not enabled on 32 bit yet is because the module space is much smaller and simulations haven't been run to see how it performs. The new algorithm breaks the module space in two, a random area and a backup area. It first tries to allocate at a number of randomly located starting pages inside the random section. If this fails, then it will allocate in the backup area. The backup area base will be offset in the same way as the current algorithm does for the base area, 1024 possible locations. Due to boot_params being defined with different types in different places, placing the config helpers modules.h or kaslr.h caused conflicts elsewhere, and so they are placed in a new file, kaslr_modules.h, instead. Signed-off-by: Rick Edgecombe --- arch/x86/Kconfig| 3 + arch/x86/include/asm/kaslr_modules.h| 38 arch/x86/include/asm/pgtable_64_types.h | 7 ++ arch/x86/kernel/module.c| 111 +++- 4 files changed, 136 insertions(+), 23 deletions(-) create mode 100644 arch/x86/include/asm/kaslr_modules.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ba7e3464ee92..db93cde0528a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2144,6 +2144,9 @@ config RANDOMIZE_BASE If unsure, say Y. +config RANDOMIZE_FINE_MODULE + def_bool y if RANDOMIZE_BASE && X86_64 && !CONFIG_UML + # Relocation on x86 needs some additional build support config X86_NEED_RELOCS def_bool y diff --git a/arch/x86/include/asm/kaslr_modules.h b/arch/x86/include/asm/kaslr_modules.h new file mode 100644 index ..1da6eced4b47 --- /dev/null +++ b/arch/x86/include/asm/kaslr_modules.h @@ -0,0 +1,38 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_KASLR_MODULES_H_ +#define _ASM_KASLR_MODULES_H_ + +#ifdef CONFIG_RANDOMIZE_BASE +/* kaslr_enabled is not always defined */ +static inline int kaslr_mod_randomize_base(void) +{ + return kaslr_enabled(); +} +#else +static inline int kaslr_mod_randomize_base(void) +{ + return 0; +} +#endif /* CONFIG_RANDOMIZE_BASE */ + +#ifdef CONFIG_RANDOMIZE_FINE_MODULE +/* kaslr_enabled is not always defined */ +static inline int kaslr_mod_randomize_each_module(void) +{ + return kaslr_enabled(); +} + +static inline unsigned long get_modules_rand_len(void) +{ + return MODULES_RAND_LEN; +} +#else +static inline int kaslr_mod_randomize_each_module(void) +{ + return 0; +} + +unsigned long get_modules_rand_len(void); +#endif /* CONFIG_RANDOMIZE_FINE_MODULE */ + +#endif diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 04edd2d58211..5e26369ab86c 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -143,6 +143,13 @@ extern unsigned int ptrs_per_p4d; #define MODULES_END_AC(0xff00, UL) #define MODULES_LEN(MODULES_END - MODULES_VADDR) +/* + * Dedicate the first part of the module space to a randomized area when KASLR + * is in use. Leave the remaining part for a fallback if we are unable to + * allocate in the random area. + */ +#define MODULES_RAND_LEN PAGE_ALIGN((MODULES_LEN/3)*2) + #define ESPFIX_PGD_ENTRY _AC(-2, UL) #define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT) diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index b052e883dd8c..35cb912ed1f8 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -36,6 +36,7 @@ #include #include #include +#include #if 0 #define DEBUGP(fmt, ...) \ @@ -48,34 +49,96 @@ do { \ } while (0) #endif -#ifdef CONFIG_RANDOMIZE_BASE static unsigned long module_load_offset; +static const unsigned long NO_TRY_RAND = 1; /* Mutex protects the module_load_offset. */ static DEFINE_MUTEX(module_kaslr_mutex); static unsigned long int get_module_load_offset(void) { - if (kaslr_enabled()) { - mutex_lock(_kaslr_mutex); - /* -* Calculate the module_load_offset the first time this -* code is called. Once calculated it stays the same until -* reboot. -*/
[PATCH v9 4/4] Kselftest for module text allocation benchmarking
This adds a test module in lib/, and a script in kselftest that does benchmarking on the allocation of memory in the module space. Performance here would have some small impact on kernel module insertions, BPF JIT insertions and kprobes. In the case of KASLR features for the module space, this module can be used to measure the allocation performance of different configurations. This module needs to be compiled into the kernel because module_alloc is not exported. With some modification to the code, as explained in the comments, it can be enabled to measure TLB flushes as well. There are two tests in the module. One allocates until failure in order to test module capacity and the other times allocating space in the module area. They both use module sizes that roughly approximate the distribution of in-tree X86_64 modules. You can control the number of modules used in the tests like this: echo m1000>/dev/mod_alloc_test Run the test for module capacity like: echo t1>/dev/mod_alloc_test The other test will measure the allocation time, and for CONFG_X86_64 and CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is used. Run the test for allocation time and backup area usage like: echo t2>/dev/mod_alloc_test The output will be something like this: num all(ns) last(ns) 100010831099 Last module in backup count = 0 Total modules in backup = 0 >1 module in backup count = 0 To run a suite of allocation time tests for a collection of module numbers you can run: tools/testing/selftests/bpf/test_mod_alloc.sh Signed-off-by: Rick Edgecombe --- lib/Kconfig.debug | 9 + lib/Makefile | 1 + lib/test_mod_alloc.c | 375 ++ tools/testing/selftests/bpf/test_mod_alloc.sh | 29 ++ 4 files changed, 414 insertions(+) create mode 100644 lib/test_mod_alloc.c create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 1af29b8224fd..b590b2bb312f 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1886,6 +1886,15 @@ config TEST_BPF If unsure, say N. +config TEST_MOD_ALLOC + bool "Tests for module allocator/vmalloc" + help + This builds the "test_mod_alloc" module that performs performance + tests on the module text section allocator. The module uses X86_64 + module text sizes for simulations. + + If unsure, say N. + config FIND_BIT_BENCHMARK tristate "Test find_bit functions" help diff --git a/lib/Makefile b/lib/Makefile index db06d1237898..c447e07931b0 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -60,6 +60,7 @@ UBSAN_SANITIZE_test_ubsan.o := y obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o obj-$(CONFIG_TEST_LKM) += test_module.o +obj-$(CONFIG_TEST_MOD_ALLOC) += test_mod_alloc.o obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o obj-$(CONFIG_TEST_SORT) += test_sort.o diff --git a/lib/test_mod_alloc.c b/lib/test_mod_alloc.c new file mode 100644 index ..3a6fb7999df4 --- /dev/null +++ b/lib/test_mod_alloc.c @@ -0,0 +1,375 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * This module can be used to test allocation allocation time and randomness. + * + * To interact with this module, mount debugfs, for example: + * mount -t debugfs none /sys/kernel/debug/ + * + * Then write to the file: + * /sys/kernel/debug/mod_alloc_test + * + * There are two tests: + * Test 1: Allocate until failure + * Test 2: Run 1000 iterations of a test the simulates loading modules with + * x86_64 module sizes. + * + * Configure the number (ex:1000) of modules to use per test in the tests: + * echo m1000 > /sys/kernel/debug/mod_alloc_test + * + * To run test (ex: Test 2): + * echo t2 > /sys/kernel/debug/mod_alloc_test + * + * For test 1 it will print the results of each test. For test 2 it will print + * out statistics for example: + * New module count: 1000 + * Starting 1 iterations of 1000 modules + * num all(ns) last(ns) + * 100019842112 + * Last module in backup count = 0 + * Total modules in backup = 188 + * >1 module in backup count = 7 + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct mod { int filesize; int coresize; int initsize; }; + +/* Begin optional logging */ +/* + * Note: In order to get an accurate count for the tlb flushes triggered in + * vmalloc, create a counter in vmalloc.c with this method signature and export + * it. Then replace the below with: + * + * extern unsigned long get_tlb_flushes_vmalloc(void); + */ +static unsigned long get_tlb_flushes_vmalloc(void) +{ + return 0; +} + +/* End optional logging */ + + +#define
[PATCH v9 4/4] Kselftest for module text allocation benchmarking
This adds a test module in lib/, and a script in kselftest that does benchmarking on the allocation of memory in the module space. Performance here would have some small impact on kernel module insertions, BPF JIT insertions and kprobes. In the case of KASLR features for the module space, this module can be used to measure the allocation performance of different configurations. This module needs to be compiled into the kernel because module_alloc is not exported. With some modification to the code, as explained in the comments, it can be enabled to measure TLB flushes as well. There are two tests in the module. One allocates until failure in order to test module capacity and the other times allocating space in the module area. They both use module sizes that roughly approximate the distribution of in-tree X86_64 modules. You can control the number of modules used in the tests like this: echo m1000>/dev/mod_alloc_test Run the test for module capacity like: echo t1>/dev/mod_alloc_test The other test will measure the allocation time, and for CONFG_X86_64 and CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is used. Run the test for allocation time and backup area usage like: echo t2>/dev/mod_alloc_test The output will be something like this: num all(ns) last(ns) 100010831099 Last module in backup count = 0 Total modules in backup = 0 >1 module in backup count = 0 To run a suite of allocation time tests for a collection of module numbers you can run: tools/testing/selftests/bpf/test_mod_alloc.sh Signed-off-by: Rick Edgecombe --- lib/Kconfig.debug | 9 + lib/Makefile | 1 + lib/test_mod_alloc.c | 375 ++ tools/testing/selftests/bpf/test_mod_alloc.sh | 29 ++ 4 files changed, 414 insertions(+) create mode 100644 lib/test_mod_alloc.c create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 1af29b8224fd..b590b2bb312f 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1886,6 +1886,15 @@ config TEST_BPF If unsure, say N. +config TEST_MOD_ALLOC + bool "Tests for module allocator/vmalloc" + help + This builds the "test_mod_alloc" module that performs performance + tests on the module text section allocator. The module uses X86_64 + module text sizes for simulations. + + If unsure, say N. + config FIND_BIT_BENCHMARK tristate "Test find_bit functions" help diff --git a/lib/Makefile b/lib/Makefile index db06d1237898..c447e07931b0 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -60,6 +60,7 @@ UBSAN_SANITIZE_test_ubsan.o := y obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o obj-$(CONFIG_TEST_LKM) += test_module.o +obj-$(CONFIG_TEST_MOD_ALLOC) += test_mod_alloc.o obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o obj-$(CONFIG_TEST_SORT) += test_sort.o diff --git a/lib/test_mod_alloc.c b/lib/test_mod_alloc.c new file mode 100644 index ..3a6fb7999df4 --- /dev/null +++ b/lib/test_mod_alloc.c @@ -0,0 +1,375 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * This module can be used to test allocation allocation time and randomness. + * + * To interact with this module, mount debugfs, for example: + * mount -t debugfs none /sys/kernel/debug/ + * + * Then write to the file: + * /sys/kernel/debug/mod_alloc_test + * + * There are two tests: + * Test 1: Allocate until failure + * Test 2: Run 1000 iterations of a test the simulates loading modules with + * x86_64 module sizes. + * + * Configure the number (ex:1000) of modules to use per test in the tests: + * echo m1000 > /sys/kernel/debug/mod_alloc_test + * + * To run test (ex: Test 2): + * echo t2 > /sys/kernel/debug/mod_alloc_test + * + * For test 1 it will print the results of each test. For test 2 it will print + * out statistics for example: + * New module count: 1000 + * Starting 1 iterations of 1000 modules + * num all(ns) last(ns) + * 100019842112 + * Last module in backup count = 0 + * Total modules in backup = 188 + * >1 module in backup count = 7 + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct mod { int filesize; int coresize; int initsize; }; + +/* Begin optional logging */ +/* + * Note: In order to get an accurate count for the tlb flushes triggered in + * vmalloc, create a counter in vmalloc.c with this method signature and export + * it. Then replace the below with: + * + * extern unsigned long get_tlb_flushes_vmalloc(void); + */ +static unsigned long get_tlb_flushes_vmalloc(void) +{ + return 0; +} + +/* End optional logging */ + + +#define
Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory
On 11/09/2018 11:51 AM, Andrea Arcangeli wrote: > Hello, > > On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote: >> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote: >>> The basic idea as outlined by Mel Gorman in [2] is: >>> >>> 1) On first fault in a sufficiently sized range, allocate a huge page >>>sized and aligned block of base pages. Map the base page >>>corresponding to the fault address and hold the rest of the pages in >>>reserve. >>> 2) On subsequent faults in the range, map the pages from the reservation. >>> 3) When enough pages have been mapped, promote the mapped pages and >>>remaining pages in the reservation to a huge page. >>> 4) When there is memory pressure, release the unused pages from their >>>reservations. >> I haven't yet read the patch in details, but I'm skeptical about the >> approach in general for few reasons: >> >> - PTE page table retracting to replace it with huge PMD entry requires >> down_write(mmap_sem). It makes the approach not practical for many >> multi-threaded workloads. >> >> I don't see a way to avoid exclusive lock here. I will be glad to >> be proved otherwise. >> >> - The promotion will also require TLB flush which might be prohibitively >> slow on big machines. >> >> - Short living processes will fail to benefit from THP with the policy, >> even with plenty of free memory in the system: no time to promote to THP >> or, with synchronous promotion, cost will overweight the benefit. >> >> The goal to reduce memory overhead of THP is admirable, but we need to be >> careful not to kill THP benefit itself. The approach will reduce number of >> THP mapped in the system and/or shift their allocation to later stage of >> process lifetime. >> >> The only way I see it can be useful is if it will be possible to apply the >> policy on per-VMA basis. It will be very useful for malloc() >> implementations, for instance. But as a global policy it's no-go to me. > I'm also skeptical about this: the current design is quite > intentional. It's not a bug but a feature that we're not doing the > promotion. > > Part of the tradeoff with THP is to use more RAM to save CPU, when you > use less RAM you're inherently already wasting some CPU just for the > reservation management and you don't get the immediate TLB benefit > anymore either. > > And if you're in the camp that is concerned about the use of more RAM > or/and about the higher latency of COW faults, I'm afraid the > intermediate solution will be still slower than the already available > MADV_NOHUGEPAGE or enabled=madvise. > > Apps like redis that will use more RAM during snapshot and that are > slowed down with THP needs to simply use MADV_NOHUGEPAGE which already > exists as an madvise from the very first kernel that supported > THP-anon. Same thing for other apps that use more RAM with THP and > that are on the losing end of the tradeoff. > > Now about the implementation: the whole point of the reservation > complexity is to skip the khugepaged copy, so it can collapse in > place. Is skipping the copy worth it? Isn't the big cost the IPI > anyway to avoid leaving two simultaneous TLB mappings of different > granularity? Good questions. I'll take them into account when measuring performance. I do wonder about other architectures (e.g. ARM) where the PMD size may be significantly larger than 2MB. > > khugepaged is already tunable to specify a ratio of memory in use to > avoid wasting memory > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none. > > If you set max_ptes_none to half the default value, it'll only promote > pages that are half mapped, reducing the memory waste to 50% of what > it is by default. > > So if you are ok to copy the memory that you promote to THP, you'd > just need a global THP mode to avoid allocating THP even when they're > available during the page fault (while still allowing khugepaged to > collapse hugepages in the background), and then reduce max_ptes_none > to get the desired promotion ratio. > > Doing the copy will avoid the reservation there will be also more THP > available to use for those khugepaged users without losing them in > reservations. You won't have to worry about what to do when there's > memory pressure because you won't have to undo the reservation because > there was no reservation in the first place. That problem also goes > away with the copy. > > So it sounds like you could achieve a similar runtime behavior with > much less complexity by reducing max_ptes_none and by doing the copy > and dropping all reservation code. These are compelling arguments. I will be sure to evaluate any performance data against this alternate implementation/tuning. Thank you for the comments. Anthony > >> Prove me wrong with performance data. :) > Same here. > > Thanks, > Andrea
Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory
On 11/09/2018 11:51 AM, Andrea Arcangeli wrote: > Hello, > > On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote: >> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote: >>> The basic idea as outlined by Mel Gorman in [2] is: >>> >>> 1) On first fault in a sufficiently sized range, allocate a huge page >>>sized and aligned block of base pages. Map the base page >>>corresponding to the fault address and hold the rest of the pages in >>>reserve. >>> 2) On subsequent faults in the range, map the pages from the reservation. >>> 3) When enough pages have been mapped, promote the mapped pages and >>>remaining pages in the reservation to a huge page. >>> 4) When there is memory pressure, release the unused pages from their >>>reservations. >> I haven't yet read the patch in details, but I'm skeptical about the >> approach in general for few reasons: >> >> - PTE page table retracting to replace it with huge PMD entry requires >> down_write(mmap_sem). It makes the approach not practical for many >> multi-threaded workloads. >> >> I don't see a way to avoid exclusive lock here. I will be glad to >> be proved otherwise. >> >> - The promotion will also require TLB flush which might be prohibitively >> slow on big machines. >> >> - Short living processes will fail to benefit from THP with the policy, >> even with plenty of free memory in the system: no time to promote to THP >> or, with synchronous promotion, cost will overweight the benefit. >> >> The goal to reduce memory overhead of THP is admirable, but we need to be >> careful not to kill THP benefit itself. The approach will reduce number of >> THP mapped in the system and/or shift their allocation to later stage of >> process lifetime. >> >> The only way I see it can be useful is if it will be possible to apply the >> policy on per-VMA basis. It will be very useful for malloc() >> implementations, for instance. But as a global policy it's no-go to me. > I'm also skeptical about this: the current design is quite > intentional. It's not a bug but a feature that we're not doing the > promotion. > > Part of the tradeoff with THP is to use more RAM to save CPU, when you > use less RAM you're inherently already wasting some CPU just for the > reservation management and you don't get the immediate TLB benefit > anymore either. > > And if you're in the camp that is concerned about the use of more RAM > or/and about the higher latency of COW faults, I'm afraid the > intermediate solution will be still slower than the already available > MADV_NOHUGEPAGE or enabled=madvise. > > Apps like redis that will use more RAM during snapshot and that are > slowed down with THP needs to simply use MADV_NOHUGEPAGE which already > exists as an madvise from the very first kernel that supported > THP-anon. Same thing for other apps that use more RAM with THP and > that are on the losing end of the tradeoff. > > Now about the implementation: the whole point of the reservation > complexity is to skip the khugepaged copy, so it can collapse in > place. Is skipping the copy worth it? Isn't the big cost the IPI > anyway to avoid leaving two simultaneous TLB mappings of different > granularity? Good questions. I'll take them into account when measuring performance. I do wonder about other architectures (e.g. ARM) where the PMD size may be significantly larger than 2MB. > > khugepaged is already tunable to specify a ratio of memory in use to > avoid wasting memory > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none. > > If you set max_ptes_none to half the default value, it'll only promote > pages that are half mapped, reducing the memory waste to 50% of what > it is by default. > > So if you are ok to copy the memory that you promote to THP, you'd > just need a global THP mode to avoid allocating THP even when they're > available during the page fault (while still allowing khugepaged to > collapse hugepages in the background), and then reduce max_ptes_none > to get the desired promotion ratio. > > Doing the copy will avoid the reservation there will be also more THP > available to use for those khugepaged users without losing them in > reservations. You won't have to worry about what to do when there's > memory pressure because you won't have to undo the reservation because > there was no reservation in the first place. That problem also goes > away with the copy. > > So it sounds like you could achieve a similar runtime behavior with > much less complexity by reducing max_ptes_none and by doing the copy > and dropping all reservation code. These are compelling arguments. I will be sure to evaluate any performance data against this alternate implementation/tuning. Thank you for the comments. Anthony > >> Prove me wrong with performance data. :) > Same here. > > Thanks, > Andrea
Re: [PATCH v5 1/5] dt-bindings: phy-qcom-qmp: Fix register underspecification
Quoting Doug Anderson (2018-11-05 08:52:39) > On Sat, Nov 3, 2018 at 7:40 PM Stephen Boyd wrote: > > > + clocks = < GCC_USB3_SEC_PHY_PIPE_CLK>; > > > + clock-names = "pipe0"; > > > + clock-output-names = "usb3_uni_phy_pipe_clk_src"; > > > > If this has clock-output-names then I would expect to see a #clock-cells > > property, but that isn't here in this node. Are we relying on the same > > property in the parent node? > > If I had to guess, I believe it's yet more confusing than that. I > believe you actually point to the parent phandle if you want to use > the clock. I notice that the parent has #clock-cells as 1 so > presumably this is how you point to one child or the other? ...but I > don't think it's documented how this works. There are 'clock-ranges', that almost nobody uses. It might be usable for this purpose. > The lane nodes don't have > any sort of ID as far as I can tell. ...and in any case having > #clock-cells of 1 makes no sense for USB 3 PHYs which are supposed to > only have one child? > > Let's look at the code, maybe? Hrm, phy_pipe_clk_register() takes no > ID or anything. Huh? OK, so as far as I can tell > of_clk_add_provider() is never called on this clock... > > So I think the answer is that #clock-cells should be <0> and should > move to the child node to match with clock-output-names. Then I guess > (if anyone references this clock from the device tree rather than > relying on the global clock-output-names) we should add the > of_clk_add_provider() into the code? > > Maybe we can add that as a patch to the end of this series? There are > so many crazy / random things wrong with these bindings that it makes > sense to make smaller / incremental changes? > Sure that sounds fine. It would be another case where a driver would want to call the proposed devm_of_clk_add_parent_provider() API.
Re: [PATCH v5 1/5] dt-bindings: phy-qcom-qmp: Fix register underspecification
Quoting Doug Anderson (2018-11-05 08:52:39) > On Sat, Nov 3, 2018 at 7:40 PM Stephen Boyd wrote: > > > + clocks = < GCC_USB3_SEC_PHY_PIPE_CLK>; > > > + clock-names = "pipe0"; > > > + clock-output-names = "usb3_uni_phy_pipe_clk_src"; > > > > If this has clock-output-names then I would expect to see a #clock-cells > > property, but that isn't here in this node. Are we relying on the same > > property in the parent node? > > If I had to guess, I believe it's yet more confusing than that. I > believe you actually point to the parent phandle if you want to use > the clock. I notice that the parent has #clock-cells as 1 so > presumably this is how you point to one child or the other? ...but I > don't think it's documented how this works. There are 'clock-ranges', that almost nobody uses. It might be usable for this purpose. > The lane nodes don't have > any sort of ID as far as I can tell. ...and in any case having > #clock-cells of 1 makes no sense for USB 3 PHYs which are supposed to > only have one child? > > Let's look at the code, maybe? Hrm, phy_pipe_clk_register() takes no > ID or anything. Huh? OK, so as far as I can tell > of_clk_add_provider() is never called on this clock... > > So I think the answer is that #clock-cells should be <0> and should > move to the child node to match with clock-output-names. Then I guess > (if anyone references this clock from the device tree rather than > relying on the global clock-output-names) we should add the > of_clk_add_provider() into the code? > > Maybe we can add that as a patch to the end of this series? There are > so many crazy / random things wrong with these bindings that it makes > sense to make smaller / incremental changes? > Sure that sounds fine. It would be another case where a driver would want to call the proposed devm_of_clk_add_parent_provider() API.
Re: [PATCH 3.18 000/144] 3.18.125-stable review
On Fri, Nov 09, 2018 at 12:39:56PM -0700, Shuah Khan wrote: > On 11/08/2018 02:49 PM, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 3.18.125 release. > > There are 144 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Nov 10 21:50:17 UTC 2018. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > > > https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.125-rc1.gz > > or in the git tree and branch at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-3.18.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > Compiled and booted on my new test system. I don't have a reference for > dmesg regressions and I haven't noticed any problems. Thanks for testing all of these and letting me know. greg k-h
Re: [PATCH 3.18 000/144] 3.18.125-stable review
On Fri, Nov 09, 2018 at 12:39:56PM -0700, Shuah Khan wrote: > On 11/08/2018 02:49 PM, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 3.18.125 release. > > There are 144 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Sat Nov 10 21:50:17 UTC 2018. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > > > https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.125-rc1.gz > > or in the git tree and branch at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-3.18.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > Compiled and booted on my new test system. I don't have a reference for > dmesg regressions and I haven't noticed any problems. Thanks for testing all of these and letting me know. greg k-h
Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory
On 11/09/2018 07:34 AM, Zi Yan wrote: > On 9 Nov 2018, at 8:11, Mel Gorman wrote: > >> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote: >>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote: The basic idea as outlined by Mel Gorman in [2] is: 1) On first fault in a sufficiently sized range, allocate a huge page sized and aligned block of base pages. Map the base page corresponding to the fault address and hold the rest of the pages in reserve. 2) On subsequent faults in the range, map the pages from the reservation. 3) When enough pages have been mapped, promote the mapped pages and remaining pages in the reservation to a huge page. 4) When there is memory pressure, release the unused pages from their reservations. >>> I haven't yet read the patch in details, but I'm skeptical about the >>> approach in general for few reasons: >>> >>> - PTE page table retracting to replace it with huge PMD entry requires >>> down_write(mmap_sem). It makes the approach not practical for many >>> multi-threaded workloads. >>> >>> I don't see a way to avoid exclusive lock here. I will be glad to >>> be proved otherwise. >>> >> That problem is somewhat fundamental to the mmap_sem itself and >> conceivably it could be alleviated by range-locking (if that gets >> completed). The other thing to bear in mind is the timing. If the >> promotion is in-place due to reservations, there isn't the allocation >> overhead and the hold times *should* be short. >> > Is it possible to convert all these PTEs to migration entries during > the promotion and replace them with a huge PMD entry afterwards? > AFAIK, migrating pages does not require holding a mmap_sem. > Basically, it will act like migrating 512 base pages to a THP without > actually doing the page copy. That's an interesting idea. I'll look into it. Thanks, Anthony > > -- > Best Regards > Yan Zi
Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory
On 11/09/2018 07:34 AM, Zi Yan wrote: > On 9 Nov 2018, at 8:11, Mel Gorman wrote: > >> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote: >>> On Thu, Nov 08, 2018 at 10:48:58PM -0800, Anthony Yznaga wrote: The basic idea as outlined by Mel Gorman in [2] is: 1) On first fault in a sufficiently sized range, allocate a huge page sized and aligned block of base pages. Map the base page corresponding to the fault address and hold the rest of the pages in reserve. 2) On subsequent faults in the range, map the pages from the reservation. 3) When enough pages have been mapped, promote the mapped pages and remaining pages in the reservation to a huge page. 4) When there is memory pressure, release the unused pages from their reservations. >>> I haven't yet read the patch in details, but I'm skeptical about the >>> approach in general for few reasons: >>> >>> - PTE page table retracting to replace it with huge PMD entry requires >>> down_write(mmap_sem). It makes the approach not practical for many >>> multi-threaded workloads. >>> >>> I don't see a way to avoid exclusive lock here. I will be glad to >>> be proved otherwise. >>> >> That problem is somewhat fundamental to the mmap_sem itself and >> conceivably it could be alleviated by range-locking (if that gets >> completed). The other thing to bear in mind is the timing. If the >> promotion is in-place due to reservations, there isn't the allocation >> overhead and the hold times *should* be short. >> > Is it possible to convert all these PTEs to migration entries during > the promotion and replace them with a huge PMD entry afterwards? > AFAIK, migrating pages does not require holding a mmap_sem. > Basically, it will act like migrating 512 base pages to a THP without > actually doing the page copy. That's an interesting idea. I'll look into it. Thanks, Anthony > > -- > Best Regards > Yan Zi
Re: [PATCH 3/8] sparc: prom: use property "name" directly to construct node names
From: Rob Herring Date: Fri, 9 Nov 2018 07:02:46 -0600 > On Thu, Nov 8, 2018 at 9:13 PM David Miller wrote: >> >> From: Rob Herring >> Date: Wed, 7 Nov 2018 16:31:46 -0600 >> >> > In preparation to remove direct accesses to the device_node.name >> > pointer, retrieve the node name from the "name" property instead. >> > >> > Cc: "David S. Miller" >> > Cc: sparcli...@vger.kernel.org >> > Signed-off-by: Rob Herring >> >> On some 32-bit sparcs, the OF runs in non-cached memory. >> >> Sucking in the OF device tree take a signficant amount of the >> boot time on such systems. >> >> And these changes are going to make it even slower. >> >> Please just put a wrapper around dp->name or whatever and use that, >> instead of making more get property OF calls which can be very >> expensive. > > It's not making calls into OF. It is retrieving the property from the > kernel's copy of the DT (allocated by memblock). So the only > difference is we have to walk the struct property list to find "name". > That's perhaps offset by the fact that we'll be doing one less > property retrieval from OF when this is all done. Right now, we > retrieve "name" from OF twice per node. Ok, now I understand, thanks.
Re: [PATCH 3/8] sparc: prom: use property "name" directly to construct node names
From: Rob Herring Date: Fri, 9 Nov 2018 07:02:46 -0600 > On Thu, Nov 8, 2018 at 9:13 PM David Miller wrote: >> >> From: Rob Herring >> Date: Wed, 7 Nov 2018 16:31:46 -0600 >> >> > In preparation to remove direct accesses to the device_node.name >> > pointer, retrieve the node name from the "name" property instead. >> > >> > Cc: "David S. Miller" >> > Cc: sparcli...@vger.kernel.org >> > Signed-off-by: Rob Herring >> >> On some 32-bit sparcs, the OF runs in non-cached memory. >> >> Sucking in the OF device tree take a signficant amount of the >> boot time on such systems. >> >> And these changes are going to make it even slower. >> >> Please just put a wrapper around dp->name or whatever and use that, >> instead of making more get property OF calls which can be very >> expensive. > > It's not making calls into OF. It is retrieving the property from the > kernel's copy of the DT (allocated by memblock). So the only > difference is we have to walk the struct property list to find "name". > That's perhaps offset by the fact that we'll be doing one less > property retrieval from OF when this is all done. Right now, we > retrieve "name" from OF twice per node. Ok, now I understand, thanks.
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 9, 2018 at 9:41 PM Andy Lutomirski wrote: > > > > > On Nov 9, 2018, at 1:06 PM, Jann Horn wrote: > > > > +linux-api for API addition > > +hughd as FYI since this is somewhat related to mm/shmem > > > > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google) > > wrote: > >> Android uses ashmem for sharing memory regions. We are looking forward > >> to migrating all usecases of ashmem to memfd so that we can possibly > >> remove the ashmem driver in the future from staging while also > >> benefiting from using memfd and contributing to it. Note staging drivers > >> are also not ABI and generally can be removed at anytime. > >> > >> One of the main usecases Android has is the ability to create a region > >> and mmap it as writeable, then add protection against making any > >> "future" writes while keeping the existing already mmap'ed > >> writeable-region active. This allows us to implement a usecase where > >> receivers of the shared memory buffer can get a read-only view, while > >> the sender continues to write to the buffer. Oh I remember trying this years ago with a new seal, F_SEAL_WRITE_PEER, or something like that. > > > > So you're fiddling around with the file, but not the inode? How are > > you preventing code like the following from re-opening the file as > > writable? > > > > $ cat memfd.c > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(void) { > > int fd = syscall(__NR_memfd_create, "testfd", 0); > > if (fd == -1) err(1, "memfd"); > > char path[100]; > > sprintf(path, "/proc/self/fd/%d", fd); > > int fd2 = open(path, O_RDWR); > > if (fd2 == -1) err(1, "reopen"); > > printf("reopen successful: %d\n", fd2); > > } > > $ gcc -o memfd memfd.c > > $ ./memfd > > reopen successful: 4 > > $ > > The race condition between memfd_create and applying seals in fcntl? I think it would be possible to block new write mappings from peer processes if there is a new memfd_create api that accepts seals. Allowing caller to set a seal like the one I proposed years ago, though in a race-free manner. Then also consider how to properly handle blocking inherited +W mapping through clone/fork. Maybe I'm forgetting some other pitfalls? > > That aside: I wonder whether a better API would be something that > > allows you to create a new readonly file descriptor, instead of > > fiddling with the writability of an existing fd. > > Every now and then I try to write a patch to prevent using proc to reopen a > file with greater permission than the original open. > > I like your idea to have a clean way to reopen a a memfd with reduced > permissions. But I would make it a syscall instead and maybe make it only > work for memfd at first. And the proc issue would need to be fixed, too. IMO the best solution would handle the issue at memfd creation time by removing the race condition.
Re: [PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
On Fri, Nov 9, 2018 at 9:41 PM Andy Lutomirski wrote: > > > > > On Nov 9, 2018, at 1:06 PM, Jann Horn wrote: > > > > +linux-api for API addition > > +hughd as FYI since this is somewhat related to mm/shmem > > > > On Fri, Nov 9, 2018 at 9:46 PM Joel Fernandes (Google) > > wrote: > >> Android uses ashmem for sharing memory regions. We are looking forward > >> to migrating all usecases of ashmem to memfd so that we can possibly > >> remove the ashmem driver in the future from staging while also > >> benefiting from using memfd and contributing to it. Note staging drivers > >> are also not ABI and generally can be removed at anytime. > >> > >> One of the main usecases Android has is the ability to create a region > >> and mmap it as writeable, then add protection against making any > >> "future" writes while keeping the existing already mmap'ed > >> writeable-region active. This allows us to implement a usecase where > >> receivers of the shared memory buffer can get a read-only view, while > >> the sender continues to write to the buffer. Oh I remember trying this years ago with a new seal, F_SEAL_WRITE_PEER, or something like that. > > > > So you're fiddling around with the file, but not the inode? How are > > you preventing code like the following from re-opening the file as > > writable? > > > > $ cat memfd.c > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(void) { > > int fd = syscall(__NR_memfd_create, "testfd", 0); > > if (fd == -1) err(1, "memfd"); > > char path[100]; > > sprintf(path, "/proc/self/fd/%d", fd); > > int fd2 = open(path, O_RDWR); > > if (fd2 == -1) err(1, "reopen"); > > printf("reopen successful: %d\n", fd2); > > } > > $ gcc -o memfd memfd.c > > $ ./memfd > > reopen successful: 4 > > $ > > The race condition between memfd_create and applying seals in fcntl? I think it would be possible to block new write mappings from peer processes if there is a new memfd_create api that accepts seals. Allowing caller to set a seal like the one I proposed years ago, though in a race-free manner. Then also consider how to properly handle blocking inherited +W mapping through clone/fork. Maybe I'm forgetting some other pitfalls? > > That aside: I wonder whether a better API would be something that > > allows you to create a new readonly file descriptor, instead of > > fiddling with the writability of an existing fd. > > Every now and then I try to write a patch to prevent using proc to reopen a > file with greater permission than the original open. > > I like your idea to have a clean way to reopen a a memfd with reduced > permissions. But I would make it a syscall instead and maybe make it only > work for memfd at first. And the proc issue would need to be fixed, too. IMO the best solution would handle the issue at memfd creation time by removing the race condition.
Re: [PATCH 5/6] staging:iio:ad2s90: Add SPDX license identifier
On Fri, Nov 9, 2018 at 10:20 PM Greg Kroah-Hartman wrote: > > On Fri, Nov 09, 2018 at 09:19:52PM -0200, Matheus Tavares Bernardino wrote: > > On Fri, Nov 9, 2018 at 8:13 PM Fabio Estevam wrote: > > > > > > Hi Matheus, > > > > > > > Hi, Fabio > > > > > On Fri, Nov 9, 2018 at 8:01 PM Matheus Tavares > > > wrote: > > > > > > > > This patch adds the SPDX GPL-2.0-only license identifier to ad2s90.c, > > > > which solves the checkpatch.pl warning: > > > > "WARNING: Missing or malformed SPDX-License-Identifier tag in line 1". > > > > > > > > Signed-off-by: Matheus Tavares > > > > --- > > > > drivers/staging/iio/resolver/ad2s90.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/drivers/staging/iio/resolver/ad2s90.c > > > > b/drivers/staging/iio/resolver/ad2s90.c > > > > index 949ff55ac6b0..f439da721df8 100644 > > > > --- a/drivers/staging/iio/resolver/ad2s90.c > > > > +++ b/drivers/staging/iio/resolver/ad2s90.c > > > > @@ -1,3 +1,4 @@ > > > > +// SPDX-License-Identifier: GPL-2.0-only > > > > > > This should be: > > > // SPDX-License-Identifier: GPL-2.0 > > > > Hm, but it seems that the identifier "GPL-2.0" is deprecated, look: > > https://spdx.org/licenses/GPL-2.0.html. It has been updated to > > "GPL-2.0-only" in license list v3 > > (https://spdx.org/licenses/GPL-2.0-only.html). Is there some other > > reason to use the deprecated "GPL-2.0" that I'm not aware of? > > Yes, please read the in-kernel documentation for all of this at: > Documentation/process/license-rules.rst > > Long story short, we started the adding of these tags to the kernel > before the crazyness of the "-only" markings for GPL in spdx. Let's > keep it this way for now, if we ever get the whole kernel finished, then > we can revisit the markings and maybe do a wholesale conversion, if it's > really needed. > Got it, thanks for the explanation! I'll correct this in v2. Thanks, Matheus > thanks, > > greg k-h
Re: [PATCH 5/6] staging:iio:ad2s90: Add SPDX license identifier
On Fri, Nov 9, 2018 at 10:20 PM Greg Kroah-Hartman wrote: > > On Fri, Nov 09, 2018 at 09:19:52PM -0200, Matheus Tavares Bernardino wrote: > > On Fri, Nov 9, 2018 at 8:13 PM Fabio Estevam wrote: > > > > > > Hi Matheus, > > > > > > > Hi, Fabio > > > > > On Fri, Nov 9, 2018 at 8:01 PM Matheus Tavares > > > wrote: > > > > > > > > This patch adds the SPDX GPL-2.0-only license identifier to ad2s90.c, > > > > which solves the checkpatch.pl warning: > > > > "WARNING: Missing or malformed SPDX-License-Identifier tag in line 1". > > > > > > > > Signed-off-by: Matheus Tavares > > > > --- > > > > drivers/staging/iio/resolver/ad2s90.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/drivers/staging/iio/resolver/ad2s90.c > > > > b/drivers/staging/iio/resolver/ad2s90.c > > > > index 949ff55ac6b0..f439da721df8 100644 > > > > --- a/drivers/staging/iio/resolver/ad2s90.c > > > > +++ b/drivers/staging/iio/resolver/ad2s90.c > > > > @@ -1,3 +1,4 @@ > > > > +// SPDX-License-Identifier: GPL-2.0-only > > > > > > This should be: > > > // SPDX-License-Identifier: GPL-2.0 > > > > Hm, but it seems that the identifier "GPL-2.0" is deprecated, look: > > https://spdx.org/licenses/GPL-2.0.html. It has been updated to > > "GPL-2.0-only" in license list v3 > > (https://spdx.org/licenses/GPL-2.0-only.html). Is there some other > > reason to use the deprecated "GPL-2.0" that I'm not aware of? > > Yes, please read the in-kernel documentation for all of this at: > Documentation/process/license-rules.rst > > Long story short, we started the adding of these tags to the kernel > before the crazyness of the "-only" markings for GPL in spdx. Let's > keep it this way for now, if we ever get the whole kernel finished, then > we can revisit the markings and maybe do a wholesale conversion, if it's > really needed. > Got it, thanks for the explanation! I'll correct this in v2. Thanks, Matheus > thanks, > > greg k-h
Re: [PATCH] x86/mm/pat: Fix missing preemption disable for __native_flush_tlb()
> On Nov 9, 2018, at 4:05 PM, Dan Williams wrote: > > Commit f77084d96355 "x86/mm/pat: Disable preemption around > __flush_tlb_all()" addressed a case where __flush_tlb_all() is called > without preemption being disabled. It also left a warning to catch other > cases where preemption is not disabled. That warning triggers for the > memory hotplug path which is also used for persistent memory enabling: I don’t think I agree with the patch. If you call __flush_tlb_all() in a context where you might be *migrated*, then there’s a bug. We could change the code to allow this particular use by checking that we haven’t done SMP init yet, perhaps. > > WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460 > RIP: 0010:__flush_tlb_all+0x1b/0x3a > [..] > Call Trace: > phys_pud_init+0x29c/0x2bb > kernel_physical_mapping_init+0xfc/0x219 > init_memory_mapping+0x1a5/0x3b0 > arch_add_memory+0x2c/0x50 > devm_memremap_pages+0x3aa/0x610 > pmem_attach_disk+0x585/0x700 [nd_pmem] > > Rather than audit all __flush_tlb_all() callers to add preemption, just > do it internally to __flush_tlb_all(). > > Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around > __flush_tlb_all()") > Cc: Sebastian Andrzej Siewior > Cc: Thomas Gleixner > Cc: Andy Lutomirski > Cc: Dave Hansen > Cc: Peter Zijlstra > Cc: Borislav Petkov > Cc: > Signed-off-by: Dan Williams > --- > arch/x86/include/asm/tlbflush.h |8 > arch/x86/mm/pageattr.c |6 +- > 2 files changed, 5 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index d760611cfc35..049e0aca0fb5 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -454,11 +454,10 @@ static inline void __native_flush_tlb_one_user(unsigned > long addr) > static inline void __flush_tlb_all(void) > { >/* > - * This is to catch users with enabled preemption and the PGE feature > - * and don't trigger the warning in __native_flush_tlb(). > + * Preemption needs to be disabled around __flush_tlb* calls > + * due to CR3 reload in __native_flush_tlb(). > */ > -VM_WARN_ON_ONCE(preemptible()); > - > +preempt_disable(); >if (boot_cpu_has(X86_FEATURE_PGE)) { >__flush_tlb_global(); >} else { > @@ -467,6 +466,7 @@ static inline void __flush_tlb_all(void) > */ >__flush_tlb(); >} > +preempt_enable(); > } > > /* > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c > index db7a10082238..f799076e3d57 100644 > --- a/arch/x86/mm/pageattr.c > +++ b/arch/x86/mm/pageattr.c > @@ -2309,13 +2309,9 @@ void __kernel_map_pages(struct page *page, int > numpages, int enable) > >/* > * We should perform an IPI and flush all tlbs, > - * but that can deadlock->flush only current cpu. > - * Preemption needs to be disabled around __flush_tlb_all() due to > - * CR3 reload in __native_flush_tlb(). > + * but that can deadlock->flush only current cpu: > */ > -preempt_disable(); >__flush_tlb_all(); > -preempt_enable(); > >arch_flush_lazy_mmu_mode(); > } >
Re: [PATCH] x86/mm/pat: Fix missing preemption disable for __native_flush_tlb()
> On Nov 9, 2018, at 4:05 PM, Dan Williams wrote: > > Commit f77084d96355 "x86/mm/pat: Disable preemption around > __flush_tlb_all()" addressed a case where __flush_tlb_all() is called > without preemption being disabled. It also left a warning to catch other > cases where preemption is not disabled. That warning triggers for the > memory hotplug path which is also used for persistent memory enabling: I don’t think I agree with the patch. If you call __flush_tlb_all() in a context where you might be *migrated*, then there’s a bug. We could change the code to allow this particular use by checking that we haven’t done SMP init yet, perhaps. > > WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460 > RIP: 0010:__flush_tlb_all+0x1b/0x3a > [..] > Call Trace: > phys_pud_init+0x29c/0x2bb > kernel_physical_mapping_init+0xfc/0x219 > init_memory_mapping+0x1a5/0x3b0 > arch_add_memory+0x2c/0x50 > devm_memremap_pages+0x3aa/0x610 > pmem_attach_disk+0x585/0x700 [nd_pmem] > > Rather than audit all __flush_tlb_all() callers to add preemption, just > do it internally to __flush_tlb_all(). > > Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around > __flush_tlb_all()") > Cc: Sebastian Andrzej Siewior > Cc: Thomas Gleixner > Cc: Andy Lutomirski > Cc: Dave Hansen > Cc: Peter Zijlstra > Cc: Borislav Petkov > Cc: > Signed-off-by: Dan Williams > --- > arch/x86/include/asm/tlbflush.h |8 > arch/x86/mm/pageattr.c |6 +- > 2 files changed, 5 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index d760611cfc35..049e0aca0fb5 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -454,11 +454,10 @@ static inline void __native_flush_tlb_one_user(unsigned > long addr) > static inline void __flush_tlb_all(void) > { >/* > - * This is to catch users with enabled preemption and the PGE feature > - * and don't trigger the warning in __native_flush_tlb(). > + * Preemption needs to be disabled around __flush_tlb* calls > + * due to CR3 reload in __native_flush_tlb(). > */ > -VM_WARN_ON_ONCE(preemptible()); > - > +preempt_disable(); >if (boot_cpu_has(X86_FEATURE_PGE)) { >__flush_tlb_global(); >} else { > @@ -467,6 +466,7 @@ static inline void __flush_tlb_all(void) > */ >__flush_tlb(); >} > +preempt_enable(); > } > > /* > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c > index db7a10082238..f799076e3d57 100644 > --- a/arch/x86/mm/pageattr.c > +++ b/arch/x86/mm/pageattr.c > @@ -2309,13 +2309,9 @@ void __kernel_map_pages(struct page *page, int > numpages, int enable) > >/* > * We should perform an IPI and flush all tlbs, > - * but that can deadlock->flush only current cpu. > - * Preemption needs to be disabled around __flush_tlb_all() due to > - * CR3 reload in __native_flush_tlb(). > + * but that can deadlock->flush only current cpu: > */ > -preempt_disable(); >__flush_tlb_all(); > -preempt_enable(); > >arch_flush_lazy_mmu_mode(); > } >