Re: [PATCH 1/6] dump_stack: Support adding to the dump stack arch description
On Tue, 2015-05-05 at 14:16 -0700, Andrew Morton wrote: On Tue, 5 May 2015 21:12:12 +1000 Michael Ellerman m...@ellerman.id.au wrote: Arch code can set a dump stack arch description string which is displayed with oops output to describe the hardware platform. + + len = strnlen(dump_stack_arch_desc_str, sizeof(dump_stack_arch_desc_str)); + pos = len; + + if (len) + pos++; + + if (pos = sizeof(dump_stack_arch_desc_str)) + return; /* Ran out of space */ + + p = dump_stack_arch_desc_str[pos]; + + va_start(args, fmt); + vsnprintf(p, sizeof(dump_stack_arch_desc_str) - pos, fmt, args); + va_end(args); This code is almost race-free. A (documented) smp_wmb() in here would make that 100%? + if (len) + dump_stack_arch_desc_str[len] = ' '; +} On second thoughts I don't think it would. It would order the stores in vsnprintf() vs the store of the space. The idea being you never see a partially printed string. But for that to actually work you need a barrier on the read side, and where do you put it? The cpu printing the buffer could speculate the load of the tail of the buffer, seeing something half printed from vsnprintf(), and then load the head of the buffer and see the space, unless you order those loads. So I don't think we can prevent a crashing cpu seeing a semi-printed buffer without a lock, and we don't want to add a lock. The other issue would be that a reader could miss the trailing NULL from the vsnprintf() but see the space, meaning it would wander off the end of the buffer. But the buffer's in BSS to start with, and we're careful not to print off the end of it, so it should always be NULL terminated. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc
On 05/08/2015 09:58 AM, Ingo Molnar wrote: * Hemant Kumar hem...@linux.vnet.ibm.com wrote: # perf kvm stat report -p 60515 Analyze events for pid(s) 60515, all VCPUs: VM-EXITSamples Samples% Time%Min Time Max Time Avg time H_DATA_STORAGE 500635.30% 0.13% 1.94us 49.46us 12.37us ( +- 0.52% ) HV_DECREMENTER 445731.43% 0.02% 0.72us 16.14us 1.91us ( +- 0.96% ) SYSCALL 269018.97% 0.10% 2.84us528.24us 18.29us ( +- 3.75% ) RETURN_TO_HOST 178912.61%99.76% 1.58us 672791.91us 27470.23us ( +- 3.00% ) EXTERNAL240 1.69% 0.00%0.69us 10.67us 1.33us ( +- 5.34% ) Where is the last line misaligned? Copy paste error or does perf kvm produce it in such a way? Its a copy-paste error. Thanks for pointing this out. Shall I resend the patches with the correct alignment of the o/p? Thanks, Ingo -- Thanks, Hemant Kumar ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/2] perf/kvm: Port perf kvm to powerpc
* Hemant Kumar hem...@linux.vnet.ibm.com wrote: On 05/08/2015 09:58 AM, Ingo Molnar wrote: * Hemant Kumar hem...@linux.vnet.ibm.com wrote: # perf kvm stat report -p 60515 Analyze events for pid(s) 60515, all VCPUs: VM-EXITSamples Samples% Time%Min Time Max Time Avg time H_DATA_STORAGE 500635.30% 0.13% 1.94us 49.46us 12.37us ( +- 0.52% ) HV_DECREMENTER 445731.43% 0.02% 0.72us 16.14us 1.91us ( +- 0.96% ) SYSCALL 269018.97% 0.10% 2.84us528.24us 18.29us ( +- 3.75% ) RETURN_TO_HOST 178912.61%99.76% 1.58us 672791.91us 27470.23us ( +- 3.00% ) EXTERNAL240 1.69% 0.00%0.69us 10.67us 1.33us ( +- 5.34% ) Where is the last line misaligned? Copy paste error or does perf kvm produce it in such a way? Its a copy-paste error. Thanks for pointing this out. Shall I resend the patches with the correct alignment of the o/p? I don't think that's necessary, as long as the code is fine. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH Part3 v11 8/9] PCI: Remove platform specific pci_domain_nr()
Now pci_host_bridge holds the domain number, so we could eliminate all platform specific pci_domain_nr(). Signed-off-by: Yijing Wang wangyij...@huawei.com --- arch/alpha/include/asm/pci.h |2 -- arch/ia64/include/asm/pci.h |1 - arch/microblaze/pci/pci-common.c | 11 --- arch/mips/include/asm/pci.h |2 -- arch/powerpc/kernel/pci-common.c | 11 --- arch/s390/pci/pci.c |6 -- arch/sh/include/asm/pci.h|2 -- arch/sparc/kernel/pci.c | 17 - arch/tile/include/asm/pci.h |2 -- arch/x86/include/asm/pci.h |6 -- drivers/pci/pci.c|8 include/linux/pci.h |7 +-- 12 files changed, 9 insertions(+), 66 deletions(-) diff --git a/arch/alpha/include/asm/pci.h b/arch/alpha/include/asm/pci.h index f7f680f..63a9a1e 100644 --- a/arch/alpha/include/asm/pci.h +++ b/arch/alpha/include/asm/pci.h @@ -95,8 +95,6 @@ static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel) return channel ? 15 : 14; } -#define pci_domain_nr(bus) ((struct pci_controller *)(bus)-sysdata)-index - static inline int pci_proc_domain(struct pci_bus *bus) { struct pci_controller *hose = bus-sysdata; diff --git a/arch/ia64/include/asm/pci.h b/arch/ia64/include/asm/pci.h index 52af5ed..1dcea49 100644 --- a/arch/ia64/include/asm/pci.h +++ b/arch/ia64/include/asm/pci.h @@ -99,7 +99,6 @@ struct pci_controller { #define PCI_CONTROLLER(busdev) ((struct pci_controller *) busdev-sysdata) -#define pci_domain_nr(busdev)(PCI_CONTROLLER(busdev)-segment) extern struct pci_ops pci_root_ops; diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c index d232c8a..6f64908 100644 --- a/arch/microblaze/pci/pci-common.c +++ b/arch/microblaze/pci/pci-common.c @@ -123,17 +123,6 @@ unsigned long pci_address_to_pio(phys_addr_t address) } EXPORT_SYMBOL_GPL(pci_address_to_pio); -/* - * Return the domain number for this bus. - */ -int pci_domain_nr(struct pci_bus *bus) -{ - struct pci_controller *hose = pci_bus_to_host(bus); - - return hose-global_number; -} -EXPORT_SYMBOL(pci_domain_nr); - /* This routine is meant to be used early during boot, when the * PCI bus numbers have not yet been assigned, and you need to * issue PCI config cycles to an OF device. diff --git a/arch/mips/include/asm/pci.h b/arch/mips/include/asm/pci.h index d969299..f5e96d4 100644 --- a/arch/mips/include/asm/pci.h +++ b/arch/mips/include/asm/pci.h @@ -124,8 +124,6 @@ static inline void pci_dma_burst_advice(struct pci_dev *pdev, #endif #ifdef CONFIG_PCI_DOMAINS -#define pci_domain_nr(bus) ((struct pci_controller *)(bus)-sysdata)-index - static inline int pci_proc_domain(struct pci_bus *bus) { struct pci_controller *hose = bus-sysdata; diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 5754367..b787d89 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -195,17 +195,6 @@ unsigned long pci_address_to_pio(phys_addr_t address) } EXPORT_SYMBOL_GPL(pci_address_to_pio); -/* - * Return the domain number for this bus. - */ -int pci_domain_nr(struct pci_bus *bus) -{ - struct pci_controller *hose = pci_bus_to_host(bus); - - return hose-global_number; -} -EXPORT_SYMBOL(pci_domain_nr); - /* This routine is meant to be used early during boot, when the * PCI bus numbers have not yet been assigned, and you need to * issue PCI config cycles to an OF device. diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c index b9ac2f5..86acba4 100644 --- a/arch/s390/pci/pci.c +++ b/arch/s390/pci/pci.c @@ -101,12 +101,6 @@ static struct zpci_dev *get_zdev_by_bus(struct pci_bus *bus) return (bus bus-sysdata) ? (struct zpci_dev *) bus-sysdata : NULL; } -int pci_domain_nr(struct pci_bus *bus) -{ - return ((struct zpci_dev *) bus-sysdata)-domain; -} -EXPORT_SYMBOL_GPL(pci_domain_nr); - int pci_proc_domain(struct pci_bus *bus) { return pci_domain_nr(bus); diff --git a/arch/sh/include/asm/pci.h b/arch/sh/include/asm/pci.h index 5b45115..4dc3ad6 100644 --- a/arch/sh/include/asm/pci.h +++ b/arch/sh/include/asm/pci.h @@ -109,8 +109,6 @@ static inline void pci_dma_burst_advice(struct pci_dev *pdev, /* Board-specific fixup routines. */ int pcibios_map_platform_irq(const struct pci_dev *dev, u8 slot, u8 pin); -#define pci_domain_nr(bus) ((struct pci_channel *)(bus)-sysdata)-index - static inline int pci_proc_domain(struct pci_bus *bus) { struct pci_channel *hose = bus-sysdata; diff --git a/arch/sparc/kernel/pci.c b/arch/sparc/kernel/pci.c index dc74202..b38eba5 100644 --- a/arch/sparc/kernel/pci.c +++ b/arch/sparc/kernel/pci.c @@ -886,23 +886,6 @@ int pcibus_to_node(struct pci_bus *pbus) EXPORT_SYMBOL(pcibus_to_node); #endif -/* Return the domain number for this pci bus */ - -int pci_domain_nr(struct pci_bus *pbus) -{ -
[PATCH Part3 v11 5/9] powerpc/PCI: Rename pcibios_root_bridge_prepare() to pcibios_root_bus_prepare()
Pcibios_root_bridge_prepare() in powerpc set root bus speed, it's not the preparation for pci host bridge. For better separation of host bridge and root bus creation, It's need to rename it to another weak function. Signed-off-by: Yijing Wang wangyij...@huawei.com --- arch/powerpc/include/asm/machdep.h |2 +- arch/powerpc/kernel/pci-common.c |6 +++--- arch/powerpc/platforms/pseries/pci.c |2 +- arch/powerpc/platforms/pseries/pseries.h |2 +- arch/powerpc/platforms/pseries/setup.c |2 +- drivers/pci/probe.c |9 + 6 files changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index ef88994..f236660 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -125,7 +125,7 @@ struct machdep_calls { /* Called after allocating resources */ void(*pcibios_fixup)(void); void(*pci_irq_fixup)(struct pci_dev *dev); - int (*pcibios_root_bridge_prepare)(struct pci_host_bridge + int (*pcibios_root_bus_prepare)(struct pci_host_bridge *bridge); /* To setup PHBs when using automatic OF platform driver for PCI */ diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index e9506d5..5754367 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -781,10 +781,10 @@ int pci_proc_domain(struct pci_bus *bus) return 1; } -int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge) +int pcibios_root_bus_prepare(struct pci_host_bridge *bridge) { - if (ppc_md.pcibios_root_bridge_prepare) - return ppc_md.pcibios_root_bridge_prepare(bridge); + if (ppc_md.pcibios_root_bus_prepare) + return ppc_md.pcibios_root_bus_prepare(bridge); return 0; } diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c index fe16a50..885f9ff 100644 --- a/arch/powerpc/platforms/pseries/pci.c +++ b/arch/powerpc/platforms/pseries/pci.c @@ -110,7 +110,7 @@ static void fixup_winbond_82c105(struct pci_dev* dev) DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_WINBOND, PCI_DEVICE_ID_WINBOND_82C105, fixup_winbond_82c105); -int pseries_root_bridge_prepare(struct pci_host_bridge *bridge) +int pseries_root_bus_prepare(struct pci_host_bridge *bridge) { struct device_node *dn, *pdn; struct pci_bus *bus; diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index 8411c27..41310dc 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -75,7 +75,7 @@ static inline int dlpar_memory(struct pseries_hp_errorlog *hp_elog) /* PCI root bridge prepare function override for pseries */ struct pci_host_bridge; -int pseries_root_bridge_prepare(struct pci_host_bridge *bridge); +int pseries_root_bus_prepare(struct pci_host_bridge *bridge); extern struct pci_controller_ops pseries_pci_controller_ops; diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index df6a704..2815309 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -537,7 +537,7 @@ static void __init pSeries_setup_arch(void) ppc_md.enable_pmcs = power4_enable_pmcs; } - ppc_md.pcibios_root_bridge_prepare = pseries_root_bridge_prepare; + ppc_md.pcibios_root_bus_prepare = pseries_root_bus_prepare; if (firmware_has_feature(FW_FEATURE_SET_MODE)) { long rc; diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 9f9445e..f5f5de6 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1883,6 +1883,11 @@ int __weak pcibios_root_bridge_prepare(struct pci_host_bridge *bridge) return 0; } +int __weak pcibios_root_bus_prepare(struct pci_host_bridge *bridge) +{ + return 0; +} + void __weak pcibios_add_bus(struct pci_bus *bus) { } @@ -1948,6 +1953,10 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int domain, b-dev.class = pcibus_class; b-dev.parent = b-bridge; dev_set_name(b-dev, %04x:%02x, pci_domain_nr(b), bus); + error = pcibios_root_bus_prepare(bridge); + if (error) + goto class_dev_reg_err; + error = device_register(b-dev); if (error) goto class_dev_reg_err; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH Part3 v11 2/9] PCI: Move pci_bus_assign_domain_nr() declaration into drivers/pci/pci.h
pci_bus_assign_domain_nr() is only called in probe.c, Move pci_bus_assign_domain_nr() declaration into drivers/pci/pci.h. Signed-off-by: Yijing Wang wangyij...@huawei.com --- drivers/pci/pci.h |9 + include/linux/pci.h |6 -- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 9bd762c..bc3e79a 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -325,4 +325,13 @@ static inline int pci_dev_specific_reset(struct pci_dev *dev, int probe) struct pci_host_bridge *pci_find_host_bridge(struct pci_bus *bus); +#ifdef CONFIG_PCI_DOMAINS_GENERIC +void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent); +#else +static inline void pci_bus_assign_domain_nr(struct pci_bus *bus, + struct device *parent) +{ +} +#endif + #endif /* DRIVERS_PCI_H */ diff --git a/include/linux/pci.h b/include/linux/pci.h index 720fdbb..5ff35cb 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1332,12 +1332,6 @@ static inline int pci_domain_nr(struct pci_bus *bus) { return bus-domain_nr; } -void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent); -#else -static inline void pci_bus_assign_domain_nr(struct pci_bus *bus, - struct device *parent) -{ -} #endif /* some architectures require additional setup to direct VGA traffic */ -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only do we not know what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply choosing an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V2: https://lkml.org/lkml/2015/5/7/78 Introduce a function in cpuidle core to select an idle state where ticks do not stop rather than going through the governors. Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge branch drivers/cpuidle/cpuidle.c | 45 +++-- include/linux/sched.h | 16 kernel/sched/core.c | 17 + kernel/sched/fair.c |2 +- kernel/sched/idle.c |6 -- kernel/sched/sched.h | 24 6 files changed, 77 insertions(+), 33 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..d1af760 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -146,6 +147,36 @@ int cpuidle_enter_freeze(struct cpuidle_driver *drv, struct cpuidle_device *dev) return index; } +/* + * find_tick_valid_state - select a state where tick does not stop + * @dev: cpuidle device for this cpu + * @drv: cpuidle driver for this cpu + */ +static int find_tick_valid_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv) +{ + int i, ret = -1; + + for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + struct cpuidle_state *s = drv-states[i]; + struct cpuidle_state_usage *su = dev-states_usage[i]; + + /* +* We do not explicitly check for latency requirement +* since it is safe to assume that only shallower idle +* states will have the CPUIDLE_FLAG_TIMER_STOP bit +* cleared and they will invariably meet the latency +* requirement. +*/ + if (s-disabled || su-disable || + (s-flags CPUIDLE_FLAG_TIMER_STOP)) + continue; + + ret = i; + } + return ret; +} + /** * cpuidle_enter_state - enter the state and update stats * @dev: cpuidle device for this cpu @@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = find_tick_valid_state(dev, drv); + if (index 0) { + default_idle_call(); + return -EBUSY; + } + target_state = drv-states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); + trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); @@ -180,6 +218,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a2e61..fef8359 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -45,6 +45,7 @@ struct sched_param { #include linux/rcupdate.h #include linux/rculist.h #include linux/rtmutex.h +#include linux/cpuidle.h #include linux/time.h #include linux/param.h @@ -893,6 +894,21 @@ enum cpu_idle_type { CPU_MAX_IDLE_TYPES }; +#ifdef CONFIG_CPU_IDLE +extern void idle_set_state(int cpu, struct cpuidle_state *idle_state); +extern struct cpuidle_state *idle_get_state(int cpu); +#else +static inline void idle_set_state(int cpu, + struct cpuidle_state *idle_state) +{ +} + +static inline struct cpuidle_state *idle_get_state(int cpu) +{ + return NULL; +} +#endif + /* * Increase resolution of
[PATCH Part3 v11 4/9] PCI: Introduce pci_host_assign_domain_nr() to assign domain
Introduce pci_host_assign_domain_nr() to save domain number in pci_host_bridge. Signed-off-by: Yijing Wang wangyij...@huawei.com --- drivers/pci/pci.c | 24 +++- drivers/pci/pci.h |1 + 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 7bf27e8..46a0240 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4506,10 +4506,10 @@ static int pci_get_new_domain_nr(void) return atomic_inc_return(__domain_nr); } -void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent) +static int pci_assign_domain_nr(struct device *dev) { static int use_dt_domains = -1; - int domain = of_get_pci_domain_nr(parent-of_node); + int domain = of_get_pci_domain_nr(dev-of_node); /* * Check DT domain and use_dt_domains values. @@ -4543,16 +4543,30 @@ void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent) use_dt_domains = 0; domain = pci_get_new_domain_nr(); } else { - dev_err(parent, Node %s has inconsistent \linux,pci-domain\ property in DT\n, - parent-of_node-full_name); + dev_err(dev, Node %s has inconsistent \linux,pci-domain\ property in DT\n, + dev-of_node-full_name); domain = -1; } - bus-domain_nr = domain; + return domain; +} + +void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent) +{ + bus-domain_nr = pci_assign_domain_nr(parent); } #endif #endif +void pci_host_assign_domain_nr(struct pci_host_bridge *host, int domain) +{ +#ifdef CONFIG_PCI_DOMAINS_GENERIC + host-domain = pci_assign_domain_nr(host-dev.parent); +#else + host-domain = domain; +#endif +} + /** * pci_ext_cfg_avail - can we access extended PCI config space? * diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index bc3e79a..c2e1a6b 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -334,4 +334,5 @@ static inline void pci_bus_assign_domain_nr(struct pci_bus *bus, } #endif +void pci_host_assign_domain_nr(struct pci_host_bridge *host, int domain); #endif /* DRIVERS_PCI_H */ -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/08/2015 02:20 AM, Rafael J. Wysocki wrote: On Thursday, May 07, 2015 11:17:21 PM Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..b7e86f4 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { -default_idle_call(); -return -EBUSY; +index = cpuidle_select(drv, dev, !broadcast); No, you can't do that. This code path may be used by suspend-to-idle and that should not call cpuidle_select(). What's needed here seems to be a fallback mechanism like choose the deepest state shallower than X and such that it won't stop the tick. You don't really need to run a full governor for that. Agreed. Makes the patch a lot simpler as well. I have sent out V3 doing this. Thank you Regards Preeti U Murthy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 1/2] arm64: dts: Add the arasan sdhci nodes in apm-storm.dtsi.
On Wed, May 6, 2015 at 7:12 PM, Suman Tripathi stripa...@apm.com wrote: This patch adds the arasan sdhci nodes to reuse the of-arasan driver for APM X-Gene SoC. Signed-off-by: Suman Tripathi stripa...@apm.com --- arch/arm64/boot/dts/apm/apm-mustang.dts | 4 +++ arch/arm64/boot/dts/apm/apm-storm.dtsi | 43 + 2 files changed, 47 insertions(+) diff --git a/arch/arm64/boot/dts/apm/apm-mustang.dts b/arch/arm64/boot/dts/apm/apm-mustang.dts index 83578e7..7ccd517 100644 --- a/arch/arm64/boot/dts/apm/apm-mustang.dts +++ b/arch/arm64/boot/dts/apm/apm-mustang.dts @@ -52,3 +52,7 @@ xgenet { status = ok; }; + +sdhci0 { + status = ok; +}; diff --git a/arch/arm64/boot/dts/apm/apm-storm.dtsi b/arch/arm64/boot/dts/apm/apm-storm.dtsi index c8d3e0e..b5d2698 100644 --- a/arch/arm64/boot/dts/apm/apm-storm.dtsi +++ b/arch/arm64/boot/dts/apm/apm-storm.dtsi @@ -145,6 +145,40 @@ clock-output-names = socplldiv2; }; + ahbclk: ahbclk@1f2ac000 { + compatible = apm,xgene-device-clock; + #clock-cells = 1; + clocks = socplldiv2 0; + reg = 0x0 0x1f2ac000 0x0 0x1000 + 0x0 0x1700 0x0 0x2000; + reg-names = csr-reg, div-reg; + csr-offset = 0x0; + csr-mask = 0x1; + enable-offset = 0x8; + enable-mask = 0x1; + divider-offset = 0x164; + divider-width = 0x5; + divider-shift = 0x0; + clock-output-names = ahbclk; + }; + + sdioclk: sdioclk@1f2ac000 { + compatible = apm,xgene-device-clock; + #clock-cells = 1; + clocks = socplldiv2 0; + reg = 0x0 0x1f2ac000 0x0 0x1000 + 0x0 0x1700 0x0 0x2000; + reg-names = csr-reg, div-reg; + csr-offset = 0x0; + csr-mask = 0x2; + enable-offset = 0x8; + enable-mask = 0x2; + divider-offset = 0x178; + divider-width = 0x8; + divider-shift = 0x0; + clock-output-names = sdioclk; + }; + qmlclk: qmlclk { compatible = apm,xgene-device-clock; #clock-cells = 1; @@ -533,6 +567,15 @@ interrupts = 0x0 0x4f 0x4; }; + sdhci0: sdhci@1c00 { + compatible = arasan,sdhci-4.9a; + reg = 0x0 0x1c00 0x0 0x100; + interrupts = 0x0 0x49 0x4; + dma-coherent; + clock-names = clk_xin, clk_ahb; + clocks = sdioclk 0, ahbclk 0; + }; + phy1: phy@1f21a000 { compatible = apm,xgene-phy; reg = 0x0 0x1f21a000 0x0 0x100; -- 1.8.2.1 Any comments on this patch ?? -- Thanks, with regards, Suman Tripathi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH Part3 v11 7/9] PCI: Create pci host bridge prior to root bus
Pci_host_bridge hold the domain number, we need to assign domain number prior to root bus creation, because root bus need to know the domain number to check whether it's alreay exist. Signed-off-by: Yijing Wang wangyij...@huawei.com --- drivers/pci/probe.c | 60 ++ 1 files changed, 31 insertions(+), 29 deletions(-) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 9ed8ab7..e4ef791 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -515,7 +515,7 @@ static void pci_release_host_bridge_dev(struct device *dev) kfree(bridge); } -static struct pci_host_bridge *pci_alloc_host_bridge(struct pci_bus *b) +static struct pci_host_bridge *pci_alloc_host_bridge(void) { struct pci_host_bridge *bridge; @@ -524,7 +524,6 @@ static struct pci_host_bridge *pci_alloc_host_bridge(struct pci_bus *b) return NULL; INIT_LIST_HEAD(bridge-windows); - bridge-bus = b; return bridge; } @@ -1902,48 +1901,51 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int domain, { int error; struct pci_host_bridge *bridge; - struct pci_bus *b, *b2; + struct pci_bus *b; struct resource_entry *window, *n; struct resource *res; resource_size_t offset; char bus_addr[64]; char *fmt; - b = pci_alloc_bus(NULL); - if (!b) - return NULL; - - b-sysdata = sysdata; - b-ops = ops; - b-number = b-busn_res.start = bus; - pci_bus_assign_domain_nr(b, parent); - b2 = pci_find_bus(pci_domain_nr(b), bus); - if (b2) { - /* If we already got to this bus through a different bridge, ignore it */ - dev_dbg(b2-dev, bus already known\n); - goto err_out; - } - - bridge = pci_alloc_host_bridge(b); + bridge = pci_alloc_host_bridge(); if (!bridge) - goto err_out; + return NULL; - bridge-domain = domain; bridge-dev.parent = parent; + pci_host_assign_domain_nr(bridge, domain); bridge-dev.release = pci_release_host_bridge_dev; dev_set_drvdata(bridge-dev, sysdata); - dev_set_name(bridge-dev, pci%04x:%02x, pci_domain_nr(b), bus); + dev_set_name(bridge-dev, pci%04x:%02x, bridge-domain, bus); error = pcibios_root_bridge_prepare(bridge); if (error) { kfree(bridge); - goto err_out; + return NULL; } error = device_register(bridge-dev); if (error) { put_device(bridge-dev); - goto err_out; + return NULL; } + + b = pci_find_bus(bridge-domain, bus); + if (b) { + /* If we already got to this bus through a different bridge, ignore it */ + dev_dbg(b-dev, bus already known\n); + goto unregister_host; + } + + b = pci_alloc_bus(NULL); + if (!b) + goto unregister_host; + + bridge-bus = b; + b-sysdata = sysdata; + b-ops = ops; + b-number = b-busn_res.start = bus; + pci_bus_assign_domain_nr(b, parent); + b-bridge = get_device(bridge-dev); device_enable_async_suspend(b-bridge); pci_set_bus_of_node(b); @@ -1956,11 +1958,11 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int domain, dev_set_name(b-dev, %04x:%02x, pci_domain_nr(b), bus); error = pcibios_root_bus_prepare(bridge); if (error) - goto class_dev_reg_err; + goto free_bus; error = device_register(b-dev); if (error) - goto class_dev_reg_err; + goto free_bus; pcibios_add_bus(b); @@ -2000,11 +2002,11 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int domain, return b; -class_dev_reg_err: +free_bus: + kfree(b); put_device(bridge-dev); +unregister_host: device_unregister(bridge-dev); -err_out: - kfree(b); return NULL; } EXPORT_SYMBOL_GPL(pci_create_root_bus); -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH Part3 v11 6/9] PCI: Make pci_host_bridge hold sysdata in drvdata
Now platform specific sysdata is saved in pci_bus, and pcibios_root_bridge_prepare() need to know the sysdata. Later, we would move pcibios_root_bridge_prepare() prior to root bus creation, so we need to make pci_host_bridge hold sysdata. Signed-off-by: Yijing Wang wangyij...@huawei.com --- arch/ia64/pci/pci.c |2 +- arch/x86/pci/acpi.c |2 +- drivers/pci/probe.c |1 + 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/ia64/pci/pci.c b/arch/ia64/pci/pci.c index 33803f7..c82d666 100644 --- a/arch/ia64/pci/pci.c +++ b/arch/ia64/pci/pci.c @@ -478,7 +478,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge) { - struct pci_controller *controller = bridge-bus-sysdata; + struct pci_controller *controller = dev_get_drvdata(bridge-dev); ACPI_COMPANION_SET(bridge-dev, controller-companion); return 0; diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c index 7563855..948b675 100644 --- a/arch/x86/pci/acpi.c +++ b/arch/x86/pci/acpi.c @@ -462,7 +462,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge) { - struct pci_sysdata *sd = bridge-bus-sysdata; + struct pci_sysdata *sd = dev_get_drvdata(bridge-dev); ACPI_COMPANION_SET(bridge-dev, sd-companion); return 0; diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index f5f5de6..9ed8ab7 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1931,6 +1931,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int domain, bridge-domain = domain; bridge-dev.parent = parent; bridge-dev.release = pci_release_host_bridge_dev; + dev_set_drvdata(bridge-dev, sysdata); dev_set_name(bridge-dev, pci%04x:%02x, pci_domain_nr(b), bus); error = pcibios_root_bridge_prepare(bridge); if (error) { -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platform
Hi Vipin, These comments are in addition to what Joel has said in his review. On Thu, May 7, 2015 at 3:00 PM, Vipin K Parashar vi...@linux.vnet.ibm.com wrote: This patch adds support for FSP EPOW (Early Power Off Warning) and DPO (Delayed Power Off) events support for PowerNV platform. EPOW events are generated by SPCN/FSP due to various critical system conditions that need system shutdown. Few examples of these conditions are high ambient temperature or system running on UPS power with low UPS battery. DPO event is generated in response to admin initiated system shutdown request. This patch enables host kernel on PowerNV platform to handle OPAL notifications for these events and initiate system poweroff. Since EPOW notifications are sent in advance of impending shutdown event and thus this patch also adds functionality to wait for EPOW condition to return to normal. Host allows MAX_POWEROFF_SYS_TIME (600 seconds) as system poweroff time (time for host + guests shutdown) and waits for remaining time for EPOW condition to return to normal. If EPOW condition doesn't return to normal in calculated time it proceeds with graceful system shutdown. For EPOW events with smaller timeouts values than MAX_POWEROFF_SYS_TIME it proceeds with system shutdown without any wait for EPOW condition to return to normal. System admin can also add systemd service shutdown scripts to perform any specific actions like graceful guest shutdown upon system poweroff. libvirt-guests is systemd service available on recent distros for management of guests at system stat/shutdown time. Signed-off-by: Vipin K Parashar vi...@linux.vnet.ibm.com --- arch/powerpc/include/asm/opal-api.h| 30 ++ arch/powerpc/include/asm/opal.h| 3 +- arch/powerpc/platforms/powernv/opal-power.c| 379 +++-- arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 4 files changed, 391 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 0321a90..03b3cef 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -730,6 +730,36 @@ struct opal_i2c_request { __be64 buffer_ra; /* Buffer real address */ }; +/* + * EPOW status sharing (OPAL and the host) + * + * The host will pass on OPAL, a buffer of length OPAL_EPOW_MAX_CLASSES + * to fetch system wide EPOW status. Each element in the returned buffer + * will contain bitwise EPOW status for each EPOW sub class. + */ + +/* EPOW types */ +enum OpalEpow { + OPAL_EPOW_POWER = 0,/* Power EPOW */ + OPAL_EPOW_TEMP = 1,/* Temperature EPOW */ + OPAL_EPOW_COOLING = 2,/* Cooling EPOW */ + OPAL_MAX_EPOW_CLASSES = 3,/* Max EPOW categories */ +}; Dont explicitly assign sequential numbers in an enum. Its taken care of by the compiler. + +/* Power EPOW events */ +enum OpalEpowPower { + OPAL_EPOW_POWER_UPS = 0x1, /* System on UPS power */ + OPAL_EPOW_POWER_UPS_LOW = 0x2, /* System on UPS power with low battery*/ +}; + +/* Temperature EPOW events */ +enum OpalEpowTemp { + OPAL_EPOW_TEMP_HIGH_AMB = 0x1, /* High ambient temperature */ + OPAL_EPOW_TEMP_CRIT_AMB = 0x2, /* Critical ambient temperature */ + OPAL_EPOW_TEMP_HIGH_INT = 0x4, /* High internal temperature */ + OPAL_EPOW_TEMP_CRIT_INT = 0x8, /* Critical internal temperature */ +}; + #endif /* __ASSEMBLY__ */ #endif /* __OPAL_API_H */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 042af1a..0777864 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -141,7 +141,6 @@ int64_t opal_pci_fence_phb(uint64_t phb_id); int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data); int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t error_type, uint8_t mask_action); int64_t opal_set_slot_led_status(uint64_t phb_id, uint64_t slot_id, uint8_t led_type, uint8_t led_action); -int64_t opal_get_epow_status(__be64 *status); int64_t opal_set_system_attention_led(uint8_t led_action); int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe, __be16 *pci_error_type, __be16 *severity); @@ -200,6 +199,8 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf, uint64_t size, uint64_t token); int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size, uint64_t token); +int32_t opal_get_epow_status(__be32 *status, __be32 *num_classes); +int32_t opal_get_dpo_status(__be32 *timeout); /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, diff --git a/arch/powerpc/platforms/powernv/opal-power.c b/arch/powerpc/platforms/powernv/opal-power.c
[PATCH Part3 v11 3/9] PCI: Remove declaration for pci_get_new_domain_nr()
pci_get_new_domain_nr() is only used in drivers/pci/pci.c, remove the declaration in include/linux/pci.h. Signed-off-by: Yijing Wang wangyij...@huawei.com --- drivers/pci/pci.c |4 ++-- include/linux/pci.h |3 --- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index acc4b6e..7bf27e8 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4498,14 +4498,14 @@ static void pci_no_domains(void) } #ifdef CONFIG_PCI_DOMAINS +#ifdef CONFIG_PCI_DOMAINS_GENERIC static atomic_t __domain_nr = ATOMIC_INIT(-1); -int pci_get_new_domain_nr(void) +static int pci_get_new_domain_nr(void) { return atomic_inc_return(__domain_nr); } -#ifdef CONFIG_PCI_DOMAINS_GENERIC void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent) { static int use_dt_domains = -1; diff --git a/include/linux/pci.h b/include/linux/pci.h index 5ff35cb..636c0a9 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1314,12 +1314,10 @@ void pci_cfg_access_unlock(struct pci_dev *dev); */ #ifdef CONFIG_PCI_DOMAINS extern int pci_domains_supported; -int pci_get_new_domain_nr(void); #else enum { pci_domains_supported = 0 }; static inline int pci_domain_nr(struct pci_bus *bus) { return 0; } static inline int pci_proc_domain(struct pci_bus *bus) { return 0; } -static inline int pci_get_new_domain_nr(void) { return -ENOSYS; } #endif /* CONFIG_PCI_DOMAINS */ /* @@ -1442,7 +1440,6 @@ static inline struct pci_dev *pci_get_bus_and_slot(unsigned int bus, static inline int pci_domain_nr(struct pci_bus *bus) { return 0; } static inline struct pci_dev *pci_dev_get(struct pci_dev *dev) { return NULL; } -static inline int pci_get_new_domain_nr(void) { return -ENOSYS; } #define dev_is_pci(d) (false) #define dev_is_pf(d) (false) -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 2/2] mmc: sdhci: Add support to disable SDR104/SDR50/DDR50 based on capability register 0.
On Wed, May 6, 2015 at 7:12 PM, Suman Tripathi stripa...@apm.com wrote: The sdhci framework disables SDR104/SDR50/DDR50 based on only quirk. This patch adds the support to disable SDR104/SDR50/DDR50 based on reading the capability register 0. Signed-off-by: Suman Tripathi stripa...@apm.com --- drivers/mmc/host/sdhci.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index c80287a..e024c64 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -3199,7 +3199,8 @@ int sdhci_add_host(struct sdhci_host *host) } } - if (host-quirks2 SDHCI_QUIRK2_NO_1_8_V) + if (host-quirks2 SDHCI_QUIRK2_NO_1_8_V || + !(caps[0] SDHCI_CAN_VDD_180)) caps[1] = ~(SDHCI_SUPPORT_SDR104 | SDHCI_SUPPORT_SDR50 | SDHCI_SUPPORT_DDR50); -- 1.8.2.1 Any comments on this patch ?? -- Thanks, with regards, Suman Tripathi ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH Part3 v11 9/9] PCI: Remove pci_bus_assign_domain_nr()
Now we save the domain number in pci_host_bridge, we could remove pci_bus_assign_domain_nr() and clean the domain member in pci_bus. Signed-off-by: Yijing Wang wangyij...@huawei.com --- drivers/pci/pci.c |5 - drivers/pci/pci.h |9 - drivers/pci/probe.c | 11 +++ include/linux/pci.h |3 --- 4 files changed, 3 insertions(+), 25 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 2e2f429..a3cb571 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4558,11 +4558,6 @@ static int pci_assign_domain_nr(struct device *dev) return domain; } - -void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent) -{ - bus-domain_nr = pci_assign_domain_nr(parent); -} #endif #endif diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index c2e1a6b..d8a4238 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -325,14 +325,5 @@ static inline int pci_dev_specific_reset(struct pci_dev *dev, int probe) struct pci_host_bridge *pci_find_host_bridge(struct pci_bus *bus); -#ifdef CONFIG_PCI_DOMAINS_GENERIC -void pci_bus_assign_domain_nr(struct pci_bus *bus, struct device *parent); -#else -static inline void pci_bus_assign_domain_nr(struct pci_bus *bus, - struct device *parent) -{ -} -#endif - void pci_host_assign_domain_nr(struct pci_host_bridge *host, int domain); #endif /* DRIVERS_PCI_H */ diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index e4ef791..be60074 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -481,7 +481,7 @@ void pci_read_bridge_bases(struct pci_bus *child) } } -static struct pci_bus *pci_alloc_bus(struct pci_bus *parent) +static struct pci_bus *pci_alloc_bus(void) { struct pci_bus *b; @@ -496,10 +496,6 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent) INIT_LIST_HEAD(b-resources); b-max_bus_speed = PCI_SPEED_UNKNOWN; b-cur_bus_speed = PCI_SPEED_UNKNOWN; -#ifdef CONFIG_PCI_DOMAINS_GENERIC - if (parent) - b-domain_nr = parent-domain_nr; -#endif return b; } @@ -670,7 +666,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, /* * Allocate a new bus, and inherit stuff from the parent.. */ - child = pci_alloc_bus(parent); + child = pci_alloc_bus(); if (!child) return NULL; @@ -1936,7 +1932,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int domain, goto unregister_host; } - b = pci_alloc_bus(NULL); + b = pci_alloc_bus(); if (!b) goto unregister_host; @@ -1944,7 +1940,6 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int domain, b-sysdata = sysdata; b-ops = ops; b-number = b-busn_res.start = bus; - pci_bus_assign_domain_nr(b, parent); b-bridge = get_device(bridge-dev); device_enable_async_suspend(b-bridge); diff --git a/include/linux/pci.h b/include/linux/pci.h index 13ed681..f010042 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -460,9 +460,6 @@ struct pci_bus { unsigned char primary;/* number of primary bridge */ unsigned char max_bus_speed; /* enum pci_bus_speed */ unsigned char cur_bus_speed; /* enum pci_bus_speed */ -#ifdef CONFIG_PCI_DOMAINS_GENERIC - int domain_nr; -#endif charname[48]; -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH Part3 v11 1/9] PCI: Save domain in pci_host_bridge
Save domain in pci_host_bridge, so we could get domain from pci_host_bridge, and at the end of series, we could clean up the arch specific pci_domain_nr(). For arm/arm64, the domain argument is pointless, because they enable CONFIG_PCI_DOMAINS_GENERIC, PCI core would assign domain number for them, so we pass meaningless -1 as the domain number. Tested-by: Gregory CLEMENT gregory.clem...@free-electrons.com #mvebu part Signed-off-by: Yijing Wang wangyij...@huawei.com --- arch/alpha/kernel/pci.c|4 ++-- arch/alpha/kernel/sys_nautilus.c |2 +- arch/arm/kernel/bios32.c |2 +- arch/arm/mach-dove/pcie.c |2 +- arch/arm/mach-iop13xx/pci.c|4 ++-- arch/arm/mach-mv78xx0/pcie.c |2 +- arch/arm/mach-orion5x/pci.c|4 ++-- arch/frv/mb93090-mb00/pci-vdk.c|3 ++- arch/ia64/pci/pci.c|4 ++-- arch/ia64/sn/kernel/io_init.c |4 ++-- arch/m68k/coldfire/pci.c |2 +- arch/microblaze/pci/pci-common.c |4 ++-- arch/mips/pci/pci.c|4 ++-- arch/mn10300/unit-asb2305/pci.c|3 ++- arch/powerpc/kernel/pci-common.c |4 ++-- arch/s390/pci/pci.c|4 ++-- arch/sh/drivers/pci/pci.c |4 ++-- arch/sparc/kernel/leon_pci.c |2 +- arch/sparc/kernel/pci.c|4 ++-- arch/sparc/kernel/pcic.c |2 +- arch/tile/kernel/pci.c |4 ++-- arch/tile/kernel/pci_gx.c |4 ++-- arch/unicore32/kernel/pci.c|2 +- arch/x86/pci/acpi.c|4 ++-- arch/x86/pci/common.c |2 +- arch/xtensa/kernel/pci.c |2 +- drivers/parisc/dino.c |2 +- drivers/parisc/lba_pci.c |2 +- drivers/pci/host/pci-versatile.c |3 ++- drivers/pci/host/pci-xgene.c |2 +- drivers/pci/host/pcie-designware.c |2 +- drivers/pci/host/pcie-iproc.c |2 +- drivers/pci/host/pcie-xilinx.c |2 +- drivers/pci/hotplug/ibmphp_core.c |2 +- drivers/pci/probe.c| 21 + drivers/pci/xen-pcifront.c |2 +- include/linux/pci.h|8 +--- 37 files changed, 70 insertions(+), 60 deletions(-) diff --git a/arch/alpha/kernel/pci.c b/arch/alpha/kernel/pci.c index 82f738e..2b0bce9 100644 --- a/arch/alpha/kernel/pci.c +++ b/arch/alpha/kernel/pci.c @@ -336,8 +336,8 @@ common_init_pci(void) pci_add_resource_offset(resources, hose-mem_space, hose-mem_space-start); - bus = pci_scan_root_bus(NULL, next_busno, alpha_mv.pci_ops, - hose, resources); + bus = pci_scan_root_bus(NULL, hose-index, next_busno, + alpha_mv.pci_ops, hose, resources); if (!bus) continue; hose-bus = bus; diff --git a/arch/alpha/kernel/sys_nautilus.c b/arch/alpha/kernel/sys_nautilus.c index 700686d..9614e4e 100644 --- a/arch/alpha/kernel/sys_nautilus.c +++ b/arch/alpha/kernel/sys_nautilus.c @@ -206,7 +206,7 @@ nautilus_init_pci(void) unsigned long memtop = max_low_pfn PAGE_SHIFT; /* Scan our single hose. */ - bus = pci_scan_bus(0, alpha_mv.pci_ops, hose); + bus = pci_scan_bus(hose-index, 0, alpha_mv.pci_ops, hose); if (!bus) return; diff --git a/arch/arm/kernel/bios32.c b/arch/arm/kernel/bios32.c index fc1..5c5a9bd 100644 --- a/arch/arm/kernel/bios32.c +++ b/arch/arm/kernel/bios32.c @@ -486,7 +486,7 @@ static void pcibios_init_hw(struct device *parent, struct hw_pci *hw, if (hw-scan) sys-bus = hw-scan(nr, sys); else - sys-bus = pci_scan_root_bus(parent, sys-busnr, + sys-bus = pci_scan_root_bus(parent, -1, sys-busnr, hw-ops, sys, sys-resources); if (!sys-bus) diff --git a/arch/arm/mach-dove/pcie.c b/arch/arm/mach-dove/pcie.c index 91fe971..a379287 100644 --- a/arch/arm/mach-dove/pcie.c +++ b/arch/arm/mach-dove/pcie.c @@ -160,7 +160,7 @@ dove_pcie_scan_bus(int nr, struct pci_sys_data *sys) return NULL; } - return pci_scan_root_bus(NULL, sys-busnr, pcie_ops, sys, + return pci_scan_root_bus(NULL, -1, sys-busnr, pcie_ops, sys, sys-resources); } diff --git a/arch/arm/mach-iop13xx/pci.c b/arch/arm/mach-iop13xx/pci.c index 9082b84..bc4ba7e 100644 --- a/arch/arm/mach-iop13xx/pci.c +++ b/arch/arm/mach-iop13xx/pci.c @@ -535,12 +535,12 @@ struct pci_bus *iop13xx_scan_bus(int nr, struct pci_sys_data *sys) while(time_before(jiffies, atux_trhfa_timeout)) udelay(100); - bus =
[PATCH Part3 v11 0/9] Remove platform pci_domain_nr()
This series is splitted out from previous patchset Refine PCI scan interfaces and make generic pci host bridge. It try to clean up all platform pci_domain_nr(), save domain in pci_host_bridge, so we could get domain number from the common interface. You could pull it from https://github.com/YijingWang/linux-pci.git enumer11 Yijing Wang (9): PCI: Save domain in pci_host_bridge PCI: Move pci_bus_assign_domain_nr() declaration into drivers/pci/pci.h PCI: Remove declaration for pci_get_new_domain_nr() PCI: Introduce pci_host_assign_domain_nr() to assign domain powerpc/PCI: Rename pcibios_root_bridge_prepare() to pcibios_root_bus_prepare() PCI: Make pci_host_bridge hold sysdata in drvdata PCI: Create pci host bridge prior to root bus PCI: Remove platform specific pci_domain_nr() PCI: Remove pci_bus_assign_domain_nr() arch/alpha/include/asm/pci.h |2 - arch/alpha/kernel/pci.c |4 +- arch/alpha/kernel/sys_nautilus.c |2 +- arch/arm/kernel/bios32.c |2 +- arch/arm/mach-dove/pcie.c|2 +- arch/arm/mach-iop13xx/pci.c |4 +- arch/arm/mach-mv78xx0/pcie.c |2 +- arch/arm/mach-orion5x/pci.c |4 +- arch/frv/mb93090-mb00/pci-vdk.c |3 +- arch/ia64/include/asm/pci.h |1 - arch/ia64/pci/pci.c |6 +- arch/ia64/sn/kernel/io_init.c|4 +- arch/m68k/coldfire/pci.c |2 +- arch/microblaze/pci/pci-common.c | 15 + arch/mips/include/asm/pci.h |2 - arch/mips/pci/pci.c |4 +- arch/mn10300/unit-asb2305/pci.c |3 +- arch/powerpc/include/asm/machdep.h |2 +- arch/powerpc/kernel/pci-common.c | 21 ++- arch/powerpc/platforms/pseries/pci.c |2 +- arch/powerpc/platforms/pseries/pseries.h |2 +- arch/powerpc/platforms/pseries/setup.c |2 +- arch/s390/pci/pci.c | 10 +--- arch/sh/drivers/pci/pci.c|4 +- arch/sh/include/asm/pci.h|2 - arch/sparc/kernel/leon_pci.c |2 +- arch/sparc/kernel/pci.c | 21 +-- arch/sparc/kernel/pcic.c |2 +- arch/tile/include/asm/pci.h |2 - arch/tile/kernel/pci.c |4 +- arch/tile/kernel/pci_gx.c|4 +- arch/unicore32/kernel/pci.c |2 +- arch/x86/include/asm/pci.h |6 -- arch/x86/pci/acpi.c |6 +- arch/x86/pci/common.c|2 +- arch/xtensa/kernel/pci.c |2 +- drivers/parisc/dino.c|2 +- drivers/parisc/lba_pci.c |2 +- drivers/pci/host/pci-versatile.c |3 +- drivers/pci/host/pci-xgene.c |2 +- drivers/pci/host/pcie-designware.c |2 +- drivers/pci/host/pcie-iproc.c|2 +- drivers/pci/host/pcie-xilinx.c |2 +- drivers/pci/hotplug/ibmphp_core.c|2 +- drivers/pci/pci.c| 31 -- drivers/pci/pci.h|1 + drivers/pci/probe.c | 94 +- drivers/pci/xen-pcifront.c |2 +- include/linux/pci.h | 27 ++--- 49 files changed, 145 insertions(+), 187 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 02/12] KVM: define common __KVM_GUESTDBG_USE_SW/HW_BP values
On Wed, May 06, 2015 at 05:23:17PM +0100, Alex Bennée wrote: Currently x86, powerpc and soon arm64 use the same two architecture specific bits for guest debug support for software and hardware breakpoints. This makes the shared values explicit while leaving the gate open for another architecture to use some other value if they really really want to. Signed-off-by: Alex Bennée alex.ben...@linaro.org Reviewed-by: Andrew Jones drjo...@redhat.com diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index ab4d473..1731569 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -310,8 +310,8 @@ struct kvm_guest_debug_arch { * and upper 16 bits are architecture specific. Architecture specific defines * that ioctl is for setting hardware breakpoint or software breakpoint. */ -#define KVM_GUESTDBG_USE_SW_BP 0x0001 -#define KVM_GUESTDBG_USE_HW_BP 0x0002 +#define KVM_GUESTDBG_USE_SW_BP __KVM_GUESTDBG_USE_SW_BP +#define KVM_GUESTDBG_USE_HW_BP __KVM_GUESTDBG_USE_HW_BP /* definition of registers in kvm_run */ struct kvm_sync_regs { diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index d7dcef5..1438202 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -250,8 +250,8 @@ struct kvm_debug_exit_arch { __u64 dr7; }; -#define KVM_GUESTDBG_USE_SW_BP 0x0001 -#define KVM_GUESTDBG_USE_HW_BP 0x0002 +#define KVM_GUESTDBG_USE_SW_BP __KVM_GUESTDBG_USE_SW_BP +#define KVM_GUESTDBG_USE_HW_BP __KVM_GUESTDBG_USE_HW_BP #define KVM_GUESTDBG_INJECT_DB 0x0004 #define KVM_GUESTDBG_INJECT_BP 0x0008 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 70ac641..3b6252e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -570,8 +570,16 @@ struct kvm_s390_irq_state { /* for KVM_SET_GUEST_DEBUG */ -#define KVM_GUESTDBG_ENABLE 0x0001 -#define KVM_GUESTDBG_SINGLESTEP 0x0002 +#define KVM_GUESTDBG_ENABLE (1 0) +#define KVM_GUESTDBG_SINGLESTEP (1 1) + +/* + * Architecture specific stuff uses the top 16 bits of the field, s/stuff/something more specific/ + * however there is some shared commonality for the common cases + */ +#define __KVM_GUESTDBG_USE_SW_BP (1 16) +#define __KVM_GUESTDBG_USE_HW_BP (1 17) + struct kvm_guest_debug { __u32 control; We sort of left this discussion hanging with me expressing slight concern about the usefulness about these defines. Paolo, what are your thoughts? -Christoffer ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 02/12] KVM: define common __KVM_GUESTDBG_USE_SW/HW_BP values
On 08/05/2015 11:23, Christoffer Dall wrote: On Wed, May 06, 2015 at 05:23:17PM +0100, Alex Bennée wrote: Currently x86, powerpc and soon arm64 use the same two architecture specific bits for guest debug support for software and hardware breakpoints. This makes the shared values explicit while leaving the gate open for another architecture to use some other value if they really really want to. Signed-off-by: Alex Bennée alex.ben...@linaro.org Reviewed-by: Andrew Jones drjo...@redhat.com diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index ab4d473..1731569 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -310,8 +310,8 @@ struct kvm_guest_debug_arch { * and upper 16 bits are architecture specific. Architecture specific defines * that ioctl is for setting hardware breakpoint or software breakpoint. */ -#define KVM_GUESTDBG_USE_SW_BP 0x0001 -#define KVM_GUESTDBG_USE_HW_BP 0x0002 +#define KVM_GUESTDBG_USE_SW_BP __KVM_GUESTDBG_USE_SW_BP +#define KVM_GUESTDBG_USE_HW_BP __KVM_GUESTDBG_USE_HW_BP /* definition of registers in kvm_run */ struct kvm_sync_regs { diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index d7dcef5..1438202 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -250,8 +250,8 @@ struct kvm_debug_exit_arch { __u64 dr7; }; -#define KVM_GUESTDBG_USE_SW_BP 0x0001 -#define KVM_GUESTDBG_USE_HW_BP 0x0002 +#define KVM_GUESTDBG_USE_SW_BP __KVM_GUESTDBG_USE_SW_BP +#define KVM_GUESTDBG_USE_HW_BP __KVM_GUESTDBG_USE_HW_BP #define KVM_GUESTDBG_INJECT_DB 0x0004 #define KVM_GUESTDBG_INJECT_BP 0x0008 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 70ac641..3b6252e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -570,8 +570,16 @@ struct kvm_s390_irq_state { /* for KVM_SET_GUEST_DEBUG */ -#define KVM_GUESTDBG_ENABLE 0x0001 -#define KVM_GUESTDBG_SINGLESTEP 0x0002 +#define KVM_GUESTDBG_ENABLE (1 0) +#define KVM_GUESTDBG_SINGLESTEP (1 1) + +/* + * Architecture specific stuff uses the top 16 bits of the field, s/stuff/something more specific/ + * however there is some shared commonality for the common cases + */ +#define __KVM_GUESTDBG_USE_SW_BP(1 16) +#define __KVM_GUESTDBG_USE_HW_BP(1 17) + struct kvm_guest_debug { __u32 control; We sort of left this discussion hanging with me expressing slight concern about the usefulness about these defines. Paolo, what are your thoughts? I would just lift these two KVM_GUESTDBG_* defines to include/uapi/linux/kvm.h and say that architecture specific stuff uses the top 14 bits of the field. :) Paolo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
On Friday, May 08, 2015 01:05:32 PM Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only do we not know what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply choosing an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V2: https://lkml.org/lkml/2015/5/7/78 Introduce a function in cpuidle core to select an idle state where ticks do not stop rather than going through the governors. Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge branch drivers/cpuidle/cpuidle.c | 45 +++-- include/linux/sched.h | 16 kernel/sched/core.c | 17 + kernel/sched/fair.c |2 +- kernel/sched/idle.c |6 -- kernel/sched/sched.h | 24 6 files changed, 77 insertions(+), 33 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..d1af760 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -146,6 +147,36 @@ int cpuidle_enter_freeze(struct cpuidle_driver *drv, struct cpuidle_device *dev) return index; } +/* + * find_tick_valid_state - select a state where tick does not stop + * @dev: cpuidle device for this cpu + * @drv: cpuidle driver for this cpu + */ +static int find_tick_valid_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv) +{ + int i, ret = -1; + + for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + struct cpuidle_state *s = drv-states[i]; + struct cpuidle_state_usage *su = dev-states_usage[i]; + + /* + * We do not explicitly check for latency requirement + * since it is safe to assume that only shallower idle + * states will have the CPUIDLE_FLAG_TIMER_STOP bit + * cleared and they will invariably meet the latency + * requirement. + */ + if (s-disabled || su-disable || + (s-flags CPUIDLE_FLAG_TIMER_STOP)) + continue; + + ret = i; + } + return ret; +} + /** * cpuidle_enter_state - enter the state and update stats * @dev: cpuidle device for this cpu @@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = find_tick_valid_state(dev, drv); Well, the new state needs to be deeper than the old one or you may violate the governor's choice and this doesn't guarantee that. Also I don't quite see a reason to duplicate the find_deepest_state() functionality here. + if (index 0) { + default_idle_call(); + return -EBUSY; + } + target_state = drv-states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); And I wouldn't do this either. The behavior here is pretty much as though the driver demoted the state chosen by the governor and we don't call idle_set_state() again in those cases. + trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); Overall, something like the patch below (untested) should work I suppose? --- drivers/cpuidle/cpuidle.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -73,17 +73,19 @@ int cpuidle_play_dead(void) } static int find_deepest_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev, bool freeze) + struct cpuidle_device *dev, bool freeze, + int limit, unsigned int flags_to_avoid) { unsigned int latency_req = 0;
[tip:perf/core] perf_event: Don't allow vmalloc() backed perf on powerpc
Commit-ID: cb307113746b4d184155d2c412e8069aeaa60d42 Gitweb: http://git.kernel.org/tip/cb307113746b4d184155d2c412e8069aeaa60d42 Author: Michael Ellerman m...@ellerman.id.au AuthorDate: Mon, 4 May 2015 16:26:39 +1000 Committer: Ingo Molnar mi...@kernel.org CommitDate: Fri, 8 May 2015 12:26:01 +0200 perf_event: Don't allow vmalloc() backed perf on powerpc On powerpc the perf event interrupt is not masked when interrupts are disabled, allowing it to function as an NMI. This causes problems if perf is using vmalloc. If we take a page fault on the vmalloc region the fault handler will fail the page fault because it detects we are coming in from an NMI (see do_hash_page()). We don't actually need or want vmalloc backed perf so just disable it on powerpc. Signed-off-by: Michael Ellerman m...@ellerman.id.au Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: linuxppc-...@ozlabs.org Cc: Andrew Morton a...@osdl.org Cc: Anton Blanchard an...@samba.org Cc: Borislav Petkov b...@alien8.de Cc: H. Peter Anvin h...@zytor.com Cc: Paul Mackerras pau...@samba.org Cc: Thomas Gleixner t...@linutronix.de Cc: a...@ghostprotocols.net Cc: suka...@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1430720799-18426-1-git-send-email-...@ellerman.id.au Signed-off-by: Ingo Molnar mi...@kernel.org --- init/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/init/Kconfig b/init/Kconfig index dc24dec..81050e4 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1637,7 +1637,7 @@ config PERF_EVENTS config DEBUG_PERF_USE_VMALLOC default n bool Debug: use vmalloc to back perf mmap() buffers - depends on PERF_EVENTS DEBUG_KERNEL + depends on PERF_EVENTS DEBUG_KERNEL !PPC select PERF_USE_VMALLOC help Use vmalloc memory to back perf mmap() buffers. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On Friday, May 08, 2015 09:16:44 AM Preeti U Murthy wrote: On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote: On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote: On 05/05/2015 02:11 PM, Preeti U Murthy wrote: On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: Hi Preeti, On 05/05/2015 09:30 AM, Preeti U Murthy wrote: Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: Re-evaluate the chip's throttled state on recieving OCC_THROTTLE notification by executing *throttle_check() on any one of the cpu on the chip. This is a sanity check to verify if we were indeed throttled/unthrottled after receiving OCC_THROTTLE notification. We cannot call *throttle_check() directly from the notification handler because we could be handling chip1's notification in chip2. So initiate an smp_call to execute *throttle_check(). We are irq-disabled in the notification handler, so use a worker thread to smp_call throttle_check() on any of the cpu in the chipmask. I see that the first patch takes care of reporting *per-chip* throttling for pmax capping condition. But where are we taking care of reporting pstate set to safe and freq control disabled scenarios per-chip ? IMO let us not have psafe and freq control disabled states managed per-chip. Because when the above two conditions occur it is likely to happen across all chips during an OCC reset cycle. So I am setting 'throttled' to false on OCC_ACTIVE and re-verifying if it actually is the case by invoking *throttle_check(). Alright like I pointed in the previous reply, a comment to indicate that psafe and freq control disabled conditions will fail when occ is inactive and that all chips face the consequence of this will help. From your explanation on the thread of the first patch of this series, this will not be required. So, Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com OK, so is the whole series reviewed now? Yes the whole series has been reviewed. OK, I'll queue it up for 4.2, then, thanks! -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Build Failure with allyesconfig for PowerPc on latest verison of Linus's tree
Greetings Benjamin,Paul,Michael and others, I am reporting the below error message: drivers/built-in.o: In function `.i40e_vc_process_vflr_event': (.text+0x1ffaea0): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `._mcount' defined in .text section in arch/powerpc/kernel/entry_64.o drivers/built-in.o: In function `.i40e_vc_process_vflr_event': (.text+0x1ffafa0): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `.eeh_check_failure' defined in .text section in arch/powerpc/kernel/built-in.o drivers/built-in.o: In function `.i40e_vc_process_vflr_event': (.text+0x1ffb120): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `.eeh_check_failure' defined in .text section in arch/powerpc/kernel/built-in.o drivers/built-in.o: In function `.i40e_vc_process_vflr_event': (.text+0x1ffb254): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `.eeh_check_failure' defined in .text section in arch/powerpc/kernel/built-in.o drivers/built-in.o: In function `.i40e_vc_process_vflr_event': (.text+0x1ffb358): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `_restgpr0_23' defined in .text.save.restore section in arch/powerpc/lib/built-in.o drivers/built-in.o: In function `.i40e_ndo_set_vf_mac': (.text+0x1ffb360): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `_savegpr0_24' defined in .text.save.restore section in arch/powerpc/lib/built-in.o drivers/built-in.o: In function `.i40e_ndo_set_vf_mac': (.text+0x1ffb374): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `._mcount' defined in .text section in arch/powerpc/kernel/entry_64.o drivers/built-in.o: In function `.i40e_ndo_set_vf_mac': (.text+0x1ffb6e4): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `.eeh_check_failure' defined in .text section in arch/powerpc/kernel/built-in.o drivers/built-in.o: In function `.i40e_ndo_set_vf_mac': (.text+0x1ffb870): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `.eeh_check_failure' defined in .text section in arch/powerpc/kernel/built-in.o This causes the build to break and fail on powerpc on the latest version of Linus's tree. Unfortunately my understanding of the powerpc code is rather limited and therefore felt best just to report it. Please let me known if there is anything else I can do to help solve this outstanding build breakage. Cheers, Nick ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] Allow user to request memory to be locked on page fault
On Fri, 8 May 2015 15:33:43 -0400 Eric B Munson emun...@akamai.com wrote: mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. This series introduces new flags for mmap() and mlockall() that allow a user to specify that the covered are should not be paged out, but only after the memory has been used the first time. Please tell us much much more about the value of these changes: the use cases, the behavioural improvements and performance results which the patchset brings to those use cases, etc. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
On Friday, May 08, 2015 04:18:02 PM Rafael J. Wysocki wrote: On Friday, May 08, 2015 01:05:32 PM Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only do we not know what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply choosing an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V2: https://lkml.org/lkml/2015/5/7/78 Introduce a function in cpuidle core to select an idle state where ticks do not stop rather than going through the governors. Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge branch drivers/cpuidle/cpuidle.c | 45 +++-- include/linux/sched.h | 16 kernel/sched/core.c | 17 + kernel/sched/fair.c |2 +- kernel/sched/idle.c |6 -- kernel/sched/sched.h | 24 6 files changed, 77 insertions(+), 33 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..d1af760 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -146,6 +147,36 @@ int cpuidle_enter_freeze(struct cpuidle_driver *drv, struct cpuidle_device *dev) return index; } +/* + * find_tick_valid_state - select a state where tick does not stop + * @dev: cpuidle device for this cpu + * @drv: cpuidle driver for this cpu + */ +static int find_tick_valid_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv) +{ + int i, ret = -1; + + for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + struct cpuidle_state *s = drv-states[i]; + struct cpuidle_state_usage *su = dev-states_usage[i]; + + /* +* We do not explicitly check for latency requirement +* since it is safe to assume that only shallower idle +* states will have the CPUIDLE_FLAG_TIMER_STOP bit +* cleared and they will invariably meet the latency +* requirement. +*/ + if (s-disabled || su-disable || + (s-flags CPUIDLE_FLAG_TIMER_STOP)) + continue; + + ret = i; + } + return ret; +} + /** * cpuidle_enter_state - enter the state and update stats * @dev: cpuidle device for this cpu @@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = find_tick_valid_state(dev, drv); Well, the new state needs to be deeper I should have said shallower, sorry about that. The state chosen by the governor satisfies certain latency requirements and we can't violate those by choosing a deeper state here. But the patch I sent actually did the right thing. :-) than the old one or you may violate the governor's choice and this doesn't guarantee that. Also I don't quite see a reason to duplicate the find_deepest_state() functionality here. + if (index 0) { + default_idle_call(); + return -EBUSY; + } + target_state = drv-states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); And I wouldn't do this either. The behavior here is pretty much as though the driver demoted the state chosen by the governor and we don't call idle_set_state() again in those cases. + trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); Overall, something like the patch below (untested) should work I suppose? --- drivers/cpuidle/cpuidle.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -73,17 +73,19
Re: [PATCH] powerpc/mpc85xx: Fix EDAC address capture
On Fri, 2015-05-08 at 16:34 -0500, Scott Wood wrote: On Thu, 2015-05-07 at 17:04 +0800, songwenbin wrote: From: York Sun york...@freescale.com Extend err_addr to cover 64 bits for DDR errors. Signed-off-by: York Sun york...@freescale.com Change-Id: Idb112c4a106416a9cad9933c415e6f62de5cf07b Reviewed-on: http://git.am.freescale.net:8181/553 Tested-by: Schmitt Richard-B43082 b43...@freescale.com Reviewed-by: Fleming Andrew-AFLEMING aflem...@freescale.com Tested-by: Fleming Andrew-AFLEMING aflem...@freescale.com Signed-off-by: songwenbin wenbin.s...@freescale.com Please don't include gerrit stuff in upstream submissions. Definitely don't include Reviewed-by/Tested-by from gerrit as those approvals are from an entirely different context. Never mind, I see you fixed that in v2. :-) That said, these patches should go via the edac tree (see MAINTAINERS). -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/mpc85xx: Fix EDAC address capture
On Thu, 2015-05-07 at 17:04 +0800, songwenbin wrote: From: York Sun york...@freescale.com Extend err_addr to cover 64 bits for DDR errors. Signed-off-by: York Sun york...@freescale.com Change-Id: Idb112c4a106416a9cad9933c415e6f62de5cf07b Reviewed-on: http://git.am.freescale.net:8181/553 Tested-by: Schmitt Richard-B43082 b43...@freescale.com Reviewed-by: Fleming Andrew-AFLEMING aflem...@freescale.com Tested-by: Fleming Andrew-AFLEMING aflem...@freescale.com Signed-off-by: songwenbin wenbin.s...@freescale.com Please don't include gerrit stuff in upstream submissions. Definitely don't include Reviewed-by/Tested-by from gerrit as those approvals are from an entirely different context. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] cxl: Use call_rcu to reduce latency when releasing the afu fd
From: Ian Munsie imun...@au1.ibm.com The afu fd release path was identified as a significant bottleneck in the overall performance of cxl. While an optimal AFU design would minimise the need to close reopen the AFU fd, it is not always practical to avoid. The bottleneck seems to be down to the call to synchronize_rcu(), which will block until every other thread is guaranteed to be out of an RCU critical section. Replace it with call_rcu() to free the context structures later so we can return to the application sooner. This reduces the time spent in the fd release path from 13356 usec to 13.3 usec - about a 100x speed up. Reported-by: Fei K Chen uc...@cn.ibm.com Signed-off-by: Ian Munsie imun...@au1.ibm.com --- drivers/misc/cxl/context.c | 15 ++- drivers/misc/cxl/cxl.h | 2 ++ 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index 22eb338..cea299e 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -243,12 +243,9 @@ void cxl_context_detach_all(struct cxl_afu *afu) mutex_unlock(afu-contexts_lock); } -void cxl_context_free(struct cxl_context *ctx) +static void reclaim_ctx(struct rcu_head *rcu) { - mutex_lock(ctx-afu-contexts_lock); - idr_remove(ctx-afu-contexts_idr, ctx-pe); - mutex_unlock(ctx-afu-contexts_lock); - synchronize_rcu(); + struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu); free_page((u64)ctx-sstp); ctx-sstp = NULL; @@ -256,3 +253,11 @@ void cxl_context_free(struct cxl_context *ctx) put_pid(ctx-pid); kfree(ctx); } + +void cxl_context_free(struct cxl_context *ctx) +{ + mutex_lock(ctx-afu-contexts_lock); + idr_remove(ctx-afu-contexts_idr, ctx-pe); + mutex_unlock(ctx-afu-contexts_lock); + call_rcu(ctx-rcu, reclaim_ctx); +} diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 47f655f..ebd2e0d 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -460,6 +460,8 @@ struct cxl_context { bool pending_irq; bool pending_fault; bool pending_afu_err; + + struct rcu_head rcu; }; struct cxl { -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 2/2] powerpc/thp: Serialize pmd clear against a linux page table walk.
On Thu, 7 May 2015 12:53:28 +0530 Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Serialize against find_linux_pte_or_hugepte which does lock-less lookup in page tables with local interrupts disabled. For huge pages it casts pmd_t to pte_t. Since format of pte_t is different from pmd_t we want to prevent transit from pmd pointing to page table to pmd pointing to huge page (and back) while interrupts are disabled. We clear pmd to possibly replace it with page table pointer in different code paths. So make sure we wait for the parallel find_linux_pte_or_hugepage to finish. I'm not seeing here any description of the problem which is being fixed. Does the patch make the machine faster? Does the machine crash? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 1/2] mm/thp: Split out pmd collpase flush into a seperate functions
On Thu, 7 May 2015 12:53:27 +0530 Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: After this patch pmdp_* functions operate only on hugepage pte, and not on regular pmd_t values pointing to page table. The patch looks like a pretty safe no-op for non-powerpc? --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, extern void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH +extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); + The fashionable way of doing this is extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); #define pmdp_collapse_flush pmdp_collapse_flush then, elsewhere, #ifndef pmdp_collapse_flush static inline pmd_t pmdp_collapse_flush(...) {} #define pmdp_collapse_flush pmdp_collapse_flush #endif It avoids introducing a second (ugly) symbol into the kernel. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On Wed, 06 May 2015 11:28:12 +0200 Vlastimil Babka vba...@suse.cz wrote: On 05/06/2015 12:09 AM, Nishanth Aravamudan wrote: On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote: What I find somewhat worrying though is that we could potentially break the pfmemalloc_watermark_ok() test in situations where zone_reclaimable_pages(zone) == 0 is a transient situation (and not a permanently allocated hugepage). In that case, the throttling is supposed to help system recover, and we might be breaking that ability with this patch, no? Well, if it's transient, we'll skip it this time through, and once there are reclaimable pages, we should notice it again. I'm not familiar enough with this logic, so I'll read through the code again soon to see if your concern is valid, as best I can. In reviewing the code, I think that transiently unreclaimable zones will lead to some higher direct reclaim rates and possible contention, but shouldn't cause any major harm. The likelihood of that situation, as well, in a non-reserved memory setup like the one I described, seems exceedingly low. OK, I guess when a reasonably configured system has nothing to reclaim, it's already busted and throttling won't change much. Consider the patch Acked-by: Vlastimil Babka vba...@suse.cz OK, thanks, I'll move this patch into the queue for 4.2-rc1. Or is it important enough to merge into 4.1? From: Nishanth Aravamudan n...@linux.vnet.ibm.com Subject: mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages Based upon 675becce15 (mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL) from Mel. We have a system with the following topology: # numactl -H available: 3 nodes (0,2-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 28273 MB node 0 free: 27323 MB node 2 cpus: node 2 size: 16384 MB node 2 free: 0 MB node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 3 size: 30533 MB node 3 free: 13273 MB node distances: node 0 2 3 0: 10 20 20 2: 20 10 20 3: 20 20 10 Node 2 has no free memory, because: # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 1 This leads to the following zoneinfo: Node 2, zone DMA pages free 0 min 1840 low 2300 high 2760 scanned 0 spanned 262144 present 262144 managed 262144 ... all_unreclaimable: 1 If one then attempts to allocate some normal 16M hugepages via echo 37 /proc/sys/vm/nr_hugepages The echo never returns and kswapd2 consumes CPU cycles. This is because throttle_direct_reclaim ends up calling wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). pfmemalloc_watermark_ok() in turn checks all zones on the node if there are any reserves, and if so, then indicates the watermarks are ok, by seeing if there are sufficient free pages. 675becce15 added a condition already for memoryless nodes. In this case, though, the node has memory, it is just all consumed (and not reclaimable). Effectively, though, the result is the same on this call to pfmemalloc_watermark_ok() and thus seems like a reasonable additional condition. With this change, the afore-mentioned 16M hugepage allocation attempt succeeds and correctly round-robins between Nodes 1 and 3. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Reviewed-by: Michal Hocko mho...@suse.cz Acked-by: Vlastimil Babka vba...@suse.cz Cc: Dave Hansen dave.han...@intel.com Cc: Mel Gorman mgor...@suse.de Cc: Anton Blanchard an...@samba.org Cc: Johannes Weiner han...@cmpxchg.org Cc: Michal Hocko mho...@suse.cz Cc: Rik van Riel r...@redhat.com Cc: Dan Streetman ddstr...@ieee.org Signed-off-by: Andrew Morton a...@linux-foundation.org --- mm/vmscan.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff -puN mm/vmscan.c~mm-vmscan-do-not-throttle-based-on-pfmemalloc-reserves-if-node-has-no-reclaimable-pages mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-do-not-throttle-based-on-pfmemalloc-reserves-if-node-has-no-reclaimable-pages +++ a/mm/vmscan.c @@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_d for (i = 0; i = ZONE_NORMAL; i++) { zone = pgdat-node_zones[i]; - if (!populated_zone(zone)) + if (!populated_zone(zone) || + zone_reclaimable_pages(zone) == 0) continue; pfmemalloc_reserve += min_wmark_pages(zone); _ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
On 08.05.2015 [15:47:26 -0700], Andrew Morton wrote: On Wed, 06 May 2015 11:28:12 +0200 Vlastimil Babka vba...@suse.cz wrote: On 05/06/2015 12:09 AM, Nishanth Aravamudan wrote: On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote: What I find somewhat worrying though is that we could potentially break the pfmemalloc_watermark_ok() test in situations where zone_reclaimable_pages(zone) == 0 is a transient situation (and not a permanently allocated hugepage). In that case, the throttling is supposed to help system recover, and we might be breaking that ability with this patch, no? Well, if it's transient, we'll skip it this time through, and once there are reclaimable pages, we should notice it again. I'm not familiar enough with this logic, so I'll read through the code again soon to see if your concern is valid, as best I can. In reviewing the code, I think that transiently unreclaimable zones will lead to some higher direct reclaim rates and possible contention, but shouldn't cause any major harm. The likelihood of that situation, as well, in a non-reserved memory setup like the one I described, seems exceedingly low. OK, I guess when a reasonably configured system has nothing to reclaim, it's already busted and throttling won't change much. Consider the patch Acked-by: Vlastimil Babka vba...@suse.cz OK, thanks, I'll move this patch into the queue for 4.2-rc1. Thank you! Or is it important enough to merge into 4.1? I think 4.2 is sufficient, but I wonder now if I should have included a stable tag? The issue has been around for a while and there's a relatively easily workaround (use the per-node sysfs files to manually round-robin around the exhausted node) in older kernels, so I had decided against it before. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 08/05/15 08:35, Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only do we not know what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply choosing an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com I gave it a spin on ARM64 Juno platform with one of the CPU in broadcast mode and Vexpress TC2 with broadcast timer. I found no issues in both the cases. So, you can add: Tested-by: Sudeep Holla sudeep.ho...@arm.com Regards, Sudeep ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
Hi Rafael, On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote: +/* + * find_tick_valid_state - select a state where tick does not stop + * @dev: cpuidle device for this cpu + * @drv: cpuidle driver for this cpu + */ +static int find_tick_valid_state(struct cpuidle_device *dev, +struct cpuidle_driver *drv) +{ +int i, ret = -1; + +for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { +struct cpuidle_state *s = drv-states[i]; +struct cpuidle_state_usage *su = dev-states_usage[i]; + +/* + * We do not explicitly check for latency requirement + * since it is safe to assume that only shallower idle + * states will have the CPUIDLE_FLAG_TIMER_STOP bit + * cleared and they will invariably meet the latency + * requirement. + */ +if (s-disabled || su-disable || +(s-flags CPUIDLE_FLAG_TIMER_STOP)) +continue; + +ret = i; +} +return ret; +} + /** * cpuidle_enter_state - enter the state and update stats * @dev: cpuidle device for this cpu @@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { -default_idle_call(); -return -EBUSY; +index = find_tick_valid_state(dev, drv); Well, the new state needs to be deeper than the old one or you may violate the governor's choice and this doesn't guarantee that. The comment above in find_tick_valid_state() explains why we are bound to choose a shallow idle state. I think its safe to assume that any state deeper than this one, would have the CPUIDLE_FLAG_TIMER_STOP flag set and hence would be skipped. Your patch relies on the assumption that the idle states are arranged in the increasing order of exit_latency/in the order of shallow to deep. This is not guaranteed, is it? Also I don't quite see a reason to duplicate the find_deepest_state() functionality here. Agreed. We could club them like in your patch. +if (index 0) { +default_idle_call(); +return -EBUSY; +} +target_state = drv-states[index]; } +/* Take note of the planned idle state. */ +idle_set_state(smp_processor_id(), target_state); And I wouldn't do this either. The behavior here is pretty much as though the driver demoted the state chosen by the governor and we don't call idle_set_state() again in those cases. Why is this wrong? The idea here is to set the idle state of the runqueue to the one that it is more likely to enter into. Its is true that the state has been demoted, but I don't see any code that requires rq-idle_state to be a only a governor chosen state or nothing at all. This is a more important chunk of this patch because it allows us to track the idle states of the broadcast CPU. Else the system idle time is bound to be higher than the residency time in different idle states of all the CPUs. This shows up starkly as an anomaly if we are profiling cpuidle state entry/exit. + trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); Overall, something like the patch below (untested) should work I suppose? With the exception of the above two points,yes this should work. --- drivers/cpuidle/cpuidle.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -73,17 +73,19 @@ int cpuidle_play_dead(void) } static int find_deepest_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev, bool freeze) + struct cpuidle_device *dev, bool freeze, + int limit, unsigned int flags_to_avoid) { unsigned int latency_req = 0; int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; - for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + for (i = CPUIDLE_DRIVER_STATE_START; i limit; i++) { struct cpuidle_state *s = drv-states[i]; struct cpuidle_state_usage *su = dev-states_usage[i]; if (s-disabled || su-disable || s-exit_latency = latency_req - || (freeze !s-enter_freeze)) + || (freeze !s-enter_freeze) + || (s-flags flags_to_avoid)) continue; latency_req = s-exit_latency; @@ -100,7 +102,7 @@ static int find_deepest_state(struct cpu int cpuidle_find_deepest_state(struct cpuidle_driver *drv,
[PATCH 0/3] Allow user to request memory to be locked on page fault
mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. This series introduces new flags for mmap() and mlockall() that allow a user to specify that the covered are should not be paged out, but only after the memory has been used the first time. The performance cost of these patches are minimal on the two benchmarks I have tested (stream and kernbench). Avg throughput in MB/s from stream using 100 element arrays Test 4.1-rc2 4.1-rc2+lock-on-fault Copy:10,979.0810,917.34 Scale: 11,094.4511,023.01 Add: 12,487.2912,388.65 Triad: 12,505.7712,418.78 Kernbench optimal load 4.1-rc2 4.1-rc2+lock-on-fault Elapsed Time 71.046 71.324 User Time62.117 62.352 System Time 8.9268.969 Context Switches 14531.9 14542.5 Sleeps 14935.9 14939 Eric B Munson (3): Add flag to request pages are locked after page fault Add mlockall flag for locking pages on fault Add tests for lock on fault arch/alpha/include/uapi/asm/mman.h | 2 + arch/mips/include/uapi/asm/mman.h | 2 + arch/parisc/include/uapi/asm/mman.h | 2 + arch/powerpc/include/uapi/asm/mman.h| 2 + arch/sparc/include/uapi/asm/mman.h | 2 + arch/tile/include/uapi/asm/mman.h | 2 + arch/xtensa/include/uapi/asm/mman.h | 2 + include/linux/mm.h | 1 + include/linux/mman.h| 3 +- include/uapi/asm-generic/mman.h | 2 + mm/mlock.c | 13 ++- mm/mmap.c | 4 +- mm/swap.c | 3 +- tools/testing/selftests/vm/Makefile | 8 +- tools/testing/selftests/vm/lock-on-fault.c | 145 tools/testing/selftests/vm/on-fault-limit.c | 47 + tools/testing/selftests/vm/run_vmtests | 23 + 17 files changed, 254 insertions(+), 9 deletions(-) create mode 100644 tools/testing/selftests/vm/lock-on-fault.c create mode 100644 tools/testing/selftests/vm/on-fault-limit.c Cc: Shuah Khan shua...@osg.samsung.com Cc: linux-al...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-m...@linux-mips.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparcli...@vger.kernel.org Cc: linux-xte...@linux-xtensa.org Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] Add mlockall flag for locking pages on fault
Building on the previous patch, extend mlockall() to give a process a way to specify that pages should be locked when they are faulted in, but that pre-faulting is not needed. Signed-off-by: Eric B Munson emun...@akamai.com Cc: linux-al...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-m...@linux-mips.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparcli...@vger.kernel.org Cc: linux-xte...@linux-xtensa.org Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux...@kvack.org --- arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h| 1 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/powerpc/include/uapi/asm/mman.h | 1 + arch/sparc/include/uapi/asm/mman.h | 1 + arch/tile/include/uapi/asm/mman.h| 1 + arch/xtensa/include/uapi/asm/mman.h | 1 + include/uapi/asm-generic/mman.h | 1 + mm/mlock.c | 13 + 9 files changed, 17 insertions(+), 4 deletions(-) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 15e96e1..3120dfb 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -38,6 +38,7 @@ #define MCL_CURRENT 8192 /* lock all currently mapped pages */ #define MCL_FUTURE 16384 /* lock all additions to address space */ +#define MCL_ON_FAULT 32768 /* lock all pages that are faulted in */ #define MADV_NORMAL0 /* no further special treatment */ #define MADV_RANDOM1 /* expect random page references */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 47846a5..82aec3c 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -62,6 +62,7 @@ */ #define MCL_CURRENT1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ +#define MCL_ON_FAULT 4 /* lock all pages that are faulted in */ #define MADV_NORMAL0 /* no further special treatment */ #define MADV_RANDOM1 /* expect random page references */ diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 1514cd7..f4601f3 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -32,6 +32,7 @@ #define MCL_CURRENT1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ +#define MCL_ON_FAULT 4 /* lock all pages that are faulted in */ #define MADV_NORMAL 0 /* no further special treatment */ #define MADV_RANDOM 1 /* expect random page references */ diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h index fce74fe..0a28efc 100644 --- a/arch/powerpc/include/uapi/asm/mman.h +++ b/arch/powerpc/include/uapi/asm/mman.h @@ -22,6 +22,7 @@ #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ +#define MCL_ON_FAULT 0x8 /* lock all pages that are faulted in */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x1 /* do not block on IO */ diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h index 12425d8..119be80 100644 --- a/arch/sparc/include/uapi/asm/mman.h +++ b/arch/sparc/include/uapi/asm/mman.h @@ -17,6 +17,7 @@ #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ #define MCL_FUTURE 0x4000 /* lock all additions to address space */ +#define MCL_ON_FAULT 0x8 /* lock all pages that are faulted in */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x1 /* do not block on IO */ diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h index ec04eaf..66ea935 100644 --- a/arch/tile/include/uapi/asm/mman.h +++ b/arch/tile/include/uapi/asm/mman.h @@ -37,6 +37,7 @@ */ #define MCL_CURRENT1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ +#define MCL_ON_FAULT 4 /* lock all pages that are faulted in */ #endif /* _ASM_TILE_MMAN_H */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 42d43cc..9abcc29 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -75,6 +75,7 @@ */ #define MCL_CURRENT1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ +#define MCL_ON_FAULT 4 /* lock all pages that are faulted in */ #define MADV_NORMAL0
Re: [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
On 05/01/2015 04:02 PM, Gavin Shan wrote: The patch enables M64 window on P7IOC, which has been enabled on PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them are divided to 8 segments. compared to something means you will tell about PHB3 too :) Do I understand correctly that IODA==IODA1==P7IOC and P7IOC != IODA2? The code does not use PHB3 or P7IOC acronym so it is a bit confusing. So each PHB can support 128 M64 segments. Also, P7IOC has M64DT, which helps mapping one particular M64 segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs) segments and fixed mapping between PE# and M64 segment# in order to keep same logic to support M64 for PHB3 and P7IOC. In turn, we just need different phb-init_m64() hooks for P7IOC and PHB3. Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++ 1 file changed, 103 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f8bc950..646962f 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe) clear_bit(pe, phb-ioda.pe_alloc); } +static int pnv_ioda1_init_m64(struct pnv_phb *phb) +{ + struct resource *r; + int seg; + s64 rc; Here @rc is of the s64 type. + + /* Each PHB supports 16 separate M64 BARs, each of which are +* divided into 8 segments. So there are number of M64 segments +* as total PE#, which is 128. +*/ there are as many M64 segments as a maximum number of PEs which is 128? + for (seg = 0; seg phb-ioda.total_pe; seg += 8) { + unsigned long base; + + base = phb-ioda.m64_base + seg * phb-ioda.m64_segsize; + rc = opal_pci_set_phb_mem_window(phb-opal_id, +OPAL_M64_WINDOW_TYPE, +seg / 8, +base, +0, /* unused */ +8 * phb-ioda.m64_segsize); + if (rc != OPAL_SUCCESS) { + pr_warn( Failure %lld configuring M64 BAR#%d on PHB#%d\n, + rc, seg / 8, phb-hose-global_number); + goto fail; + } + + rc = opal_pci_phb_mmio_enable(phb-opal_id, + OPAL_M64_WINDOW_TYPE, + seg / 8, + OPAL_ENABLE_M64_SPLIT); + if (rc != OPAL_SUCCESS) { + pr_warn( Failure %lld enabling M64 BAR#%d on PHB#%d\n, + rc, seg / 8, phb-hose-global_number); + goto fail; + } + } + + /* Strip of the segment used by the reserved PE, which +* is expected to be 0 or last supported PE# +*/ + r = phb-hose-mem_resources[1]; mem_resources[0] is IO, mem_resources[1] is MMIO, mem_resources[2] is for what? Would be nice to have this commented somewhere. + if (phb-ioda.reserved_pe == 0) + r-start += phb-ioda.m64_segsize; + else if (phb-ioda.reserved_pe == (phb-ioda.total_pe - 1)) + r-end -= phb-ioda.m64_segsize; + else + pr_warn( Cannot strip M64 segment for reserved PE#%d\n, + phb-ioda.reserved_pe); + + return 0; + +fail: + for ( ; seg = 0; seg -= 8) + opal_pci_phb_mmio_enable(phb-opal_id, +OPAL_M64_WINDOW_TYPE, +seg / 8, +OPAL_DISABLE_M64); Out of curiosity - is not there a counterpart for opal_pci_set_phb_mem_window() for cleanup? + + return -EIO; +} + /* The default M64 BAR is shared by all PEs */ static int pnv_ioda2_init_m64(struct pnv_phb *phb) { @@ -222,7 +283,7 @@ fail: return -EIO; } -static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb) +static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb) { resource_size_t sgsz = phb-ioda.m64_segsize; struct pci_dev *pdev; @@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb) } } -static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb, -struct pci_bus *bus, int all) +static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb, + struct pci_bus *bus, int all) { resource_size_t segsz = phb-ioda.m64_segsize; struct pci_dev *pdev; @@ -346,6 +407,28 @@ done: pe-master = master_pe;
Re: [PATCH 0/3] Allow user to request memory to be locked on page fault
On Fri, 08 May 2015, Andrew Morton wrote: On Fri, 8 May 2015 15:33:43 -0400 Eric B Munson emun...@akamai.com wrote: mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. This series introduces new flags for mmap() and mlockall() that allow a user to specify that the covered are should not be paged out, but only after the memory has been used the first time. Please tell us much much more about the value of these changes: the use cases, the behavioural improvements and performance results which the patchset brings to those use cases, etc. The primary use case is for mmaping large files read only. The process knows that some of the data is necessary, but it is unlikely that the entire file will be needed. The developer only wants to pay the cost to read the data in once. Unfortunately developer must choose between allowing the kernel to page in the memory as needed and guaranteeing that the data will only be read from disk once. The first option runs the risk of having the memory reclaimed if the system is under memory pressure, the second forces the memory usage and startup delay when faulting in the entire file. I am working on getting startup times with and without this change for an application, I will post them as soon as I have them. Eric signature.asc Description: Digital signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
On 05/01/2015 04:02 PM, Gavin Shan wrote: The series of patches intend to support PCI slot for PowerPC PowerNV platform, which is running on top of skiboot firmware. The patchset requires corresponding changes from skiboot firmware, which is sent to skib...@lists.ozlabs.org for review. The PCI slots are exposed by skiboot with device node properties, and kernel utilizes those properties to populated PCI slots accordingly. The original PCI infrastructure on PowerNV platform can't support hotplug because the PE is assigned during PHB fixup time, which is called for once during system boot time. For this, the PCI infrastructure on PowerNV platform has been reworked for a lot. After that, the PE and its corresponding resources (IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64 resources, on P8 strictly speaking). Out of curiosity - does this PCI scan happen when memory subsystem is initialized? More precisely, after these changes, won't pnv_pci_ioda2_setup_dma_pe() be called too early after boot so I won't be able to use kmalloc() to allocate iommu_table's? Also, checkpatch.pl failed multiple times on the series. Please fix. Each PE will maintain a reference count, which is (number of child PCI devices + 1). That indicates when last child PCI device leaves the PE, the PE and its included resources will be relased and put back into free pool again. With this design, the PE will be released when EEH PE is released. PATCH[1 - 8] are related to this part. From skiboot perspective, PCI slot is providing (hot/fundamental/complete) resets to EEH. The kernel gets to know if skiboot supports various reset on one particular PCI slot through device-tree node. If it does, EEH will utilize the functionality provided by skiboot. Besides, the device-tree nodes have to change in order to support PCI hotplug. For example, when one PCI adapter inserted to one slot, its device-tree node should be added to the system dynamically. Conversely, the device-tree node should be removed from the system when the PCI adapter is going to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes, they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are doing the related work. The last patch is the standalone PCI hotplug driver for PowerNV platform. When removing PCI adapter from one PCI slot, which is invoked by command in userland, the skiboot will power off the slot to save power and remove all device-tree nodes for all PCI devices behind the slot. Conversely, the Power to the slot is turned on, the PCI devices behind the slot is rescanned, and the device-tree nodes for those newly detected PCI devices will be built in skiboot. For both of cases, one message will be sent to kernel by skiboot so that the kernel can adjust the device-tree accordingly. At the same time, the kernel also have to deallocate or allocate PE# and its related resources (PE# and so on) for the removed/added PCI devices. Changelog = v4: * Rebased to 4.1.RC1 * Added API to unflatten FDT blob to device node sub-tree, which is attached the indicated parent device node. The original mechanism based on formatted string stream has been dropped. * The PATCH[v3 09/21] (powerpc/eeh: Delay probing EEH device during hotplug) was picked up sent to linux-ppc@ separately for review as Richard's VF EEH Support depends on that. v3: * Rebased to 4.1.RC0 * PowerNV PCI infrasturcture is total refactored in order to support PCI hotplug. The PowerNV hotplug driver is also reworked a lot because of the changes in skiboot in order to support PCI hotplug. Gavin Shan (21): pci: Add pcibios_setup_bridge() powerpc/powernv: Enable M64 on P7IOC powerpc/powernv: M64 support improvement powerpc/powernv: Improve IO and M32 mapping powerpc/powernv: Improve DMA32 segment assignment powerpc/powernv: Create PEs dynamically powerpc/powernv: Release PEs dynamically powerpc/powernv: Drop pnv_ioda_setup_dev_PE() powerpc/powernv: Use PCI slot reset infrastructure powerpc/powernv: Fundamental reset for PCI bus reset powerpc/pci: Don't scan empty slot powerpc/pci: Move pcibios_find_pci_bus() around powerpc/powernv: Introduce pnv_pci_poll() powerpc/powernv: Functions to get/reset PCI slot status powerpc/pci: Delay creating pci_dn powerpc/pci: Create eeh_dev while creating pci_dn powerpc/pci: Export traverse_pci_device_nodes() powerpc/pci: Update bridge windows on PCI plugging drivers/of: Support adding sub-tree powerpc/powernv: Select OF_DYNAMIC pci/hotplug: PowerPC PowerNV PCI hotplug driver arch/powerpc/include/asm/eeh.h |7 +- arch/powerpc/include/asm/opal-api.h|7 +- arch/powerpc/include/asm/opal.h|7 +- arch/powerpc/include/asm/pci-bridge.h |
[PATCH 1/3] Add flag to request pages are locked after page fault
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. To keep accounting checks out of the page fault path, users are billed for the entire mapping lock as if MAP_LOCKED was used. Signed-off-by: Eric B Munson emun...@akamai.com Cc: linux-al...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-m...@linux-mips.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparcli...@vger.kernel.org Cc: linux-xte...@linux-xtensa.org Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org --- arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h| 1 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/powerpc/include/uapi/asm/mman.h | 1 + arch/sparc/include/uapi/asm/mman.h | 1 + arch/tile/include/uapi/asm/mman.h| 1 + arch/xtensa/include/uapi/asm/mman.h | 1 + include/linux/mm.h | 1 + include/linux/mman.h | 3 ++- include/uapi/asm-generic/mman.h | 1 + mm/mmap.c| 4 ++-- mm/swap.c| 3 ++- 12 files changed, 15 insertions(+), 4 deletions(-) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 0086b47..15e96e1 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -30,6 +30,7 @@ #define MAP_NONBLOCK 0x4 /* do not block on IO */ #define MAP_STACK 0x8 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB0x10/* create a huge page mapping */ +#define MAP_LOCKONFAULT0x20/* Lock pages after they are faulted in, do not prefault */ #define MS_ASYNC 1 /* sync memory asynchronously */ #define MS_SYNC2 /* synchronous memory sync */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index cfcb876..47846a5 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -48,6 +48,7 @@ #define MAP_NONBLOCK 0x2 /* do not block on IO */ #define MAP_STACK 0x4 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB0x8 /* create a huge page mapping */ +#define MAP_LOCKONFAULT0x10/* Lock pages after they are faulted in, do not prefault */ /* * Flags for msync diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 294d251..1514cd7 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -24,6 +24,7 @@ #define MAP_NONBLOCK 0x2 /* do not block on IO */ #define MAP_STACK 0x4 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB0x8 /* create a huge page mapping */ +#define MAP_LOCKONFAULT0x10/* Lock pages after they are faulted in, do not prefault */ #define MS_SYNC1 /* synchronous memory sync */ #define MS_ASYNC 2 /* sync memory asynchronously */ diff --git a/arch/powerpc/include/uapi/asm/mman.h b/arch/powerpc/include/uapi/asm/mman.h index 6ea26df..fce74fe 100644 --- a/arch/powerpc/include/uapi/asm/mman.h +++ b/arch/powerpc/include/uapi/asm/mman.h @@ -27,5 +27,6 @@ #define MAP_NONBLOCK 0x1 /* do not block on IO */ #define MAP_STACK 0x2 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB0x4 /* create a huge page mapping */ +#define MAP_LOCKONFAULT0x8 /* Lock pages after they are faulted in, do not prefault */ #endif /* _UAPI_ASM_POWERPC_MMAN_H */ diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h index 0b14df3..12425d8 100644 --- a/arch/sparc/include/uapi/asm/mman.h +++ b/arch/sparc/include/uapi/asm/mman.h @@ -22,6 +22,7 @@ #define MAP_NONBLOCK 0x1 /* do not block on IO */ #define MAP_STACK 0x2 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB0x4 /* create a huge page mapping */ +#define MAP_LOCKONFAULT0x8 /* Lock pages after they are faulted in, do not prefault */ #endif /* _UAPI__SPARC_MMAN_H__ */ diff --git a/arch/tile/include/uapi/asm/mman.h b/arch/tile/include/uapi/asm/mman.h index 81b8fc3..ec04eaf 100644 --- a/arch/tile/include/uapi/asm/mman.h +++ b/arch/tile/include/uapi/asm/mman.h @@ -29,6 +29,7 @@ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define
Re: [PATCH 0/3] Allow user to request memory to be locked on page fault
On Fri, 8 May 2015 16:06:10 -0400 Eric B Munson emun...@akamai.com wrote: On Fri, 08 May 2015, Andrew Morton wrote: On Fri, 8 May 2015 15:33:43 -0400 Eric B Munson emun...@akamai.com wrote: mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. This series introduces new flags for mmap() and mlockall() that allow a user to specify that the covered are should not be paged out, but only after the memory has been used the first time. Please tell us much much more about the value of these changes: the use cases, the behavioural improvements and performance results which the patchset brings to those use cases, etc. The primary use case is for mmaping large files read only. The process knows that some of the data is necessary, but it is unlikely that the entire file will be needed. The developer only wants to pay the cost to read the data in once. Unfortunately developer must choose between allowing the kernel to page in the memory as needed and guaranteeing that the data will only be read from disk once. The first option runs the risk of having the memory reclaimed if the system is under memory pressure, the second forces the memory usage and startup delay when faulting in the entire file. Why can't the application mmap only those parts of the file which it wants and mlock those? I am working on getting startup times with and without this change for an application, I will post them as soon as I have them. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev