Re: [PATCH RFC v0 00/12] Cyclic Scheduler Against RTC
On Mon, 2016-04-11 at 22:29 -0700, Bill Huey (hui) wrote: > Hi, > > This a crude cyclic scheduler implementation. It uses SCHED_FIFO tasks > and runs them according to a map pattern specified by a 64 bit mask. Each > bit corresponds to an entry into an 64 entry array of > 'struct task_struct'. This works single core CPU 0 only for now. > > Threads are 'admitted' to this map by an extension to the ioctl() via the > of (rtc) real-time clock interface. The bit pattern then determines when > the task will run or activate next. > > The /dev/rtc interface is choosen for this purpose because of its > accessibilty to userspace. For example, the mplayer program already use > it as a timer source and could possibly benefit from being sync to a > vertical retrace interrupt during decoding. Could be an OpenGL program > needing precisely scheduler support for those same handling vertical > retrace interrupts, low latency audio and timely handling of touch > events amognst other uses. Sounds like you want SGI's frame rate scheduler. -Mike
Re: [PATCH RFC v0 00/12] Cyclic Scheduler Against RTC
On Mon, 2016-04-11 at 22:29 -0700, Bill Huey (hui) wrote: > Hi, > > This a crude cyclic scheduler implementation. It uses SCHED_FIFO tasks > and runs them according to a map pattern specified by a 64 bit mask. Each > bit corresponds to an entry into an 64 entry array of > 'struct task_struct'. This works single core CPU 0 only for now. > > Threads are 'admitted' to this map by an extension to the ioctl() via the > of (rtc) real-time clock interface. The bit pattern then determines when > the task will run or activate next. > > The /dev/rtc interface is choosen for this purpose because of its > accessibilty to userspace. For example, the mplayer program already use > it as a timer source and could possibly benefit from being sync to a > vertical retrace interrupt during decoding. Could be an OpenGL program > needing precisely scheduler support for those same handling vertical > retrace interrupts, low latency audio and timely handling of touch > events amognst other uses. Sounds like you want SGI's frame rate scheduler. -Mike
Re: [PATCH 1/3] ARM: dts: vf-colibri: alias the primary FEC as ethernet0
On Fri, Apr 01, 2016 at 11:13:39PM -0700, Stefan Agner wrote: > The Vybrid based Colibri modules provide a on-module PHY which is > connected to the second FEC instance FEC1. Since the on-module > Ethernet port is considered as primary ethernet interface, alias > fec1 as ethernet0. This also makes sure that the first MAC address > provided by the boot loader gets assigned to the FEC instance used > for the on-module PHY. > > Signed-off-by: Stefan AgnerApplied all, thanks.
Re: [PATCH 1/3] ARM: dts: vf-colibri: alias the primary FEC as ethernet0
On Fri, Apr 01, 2016 at 11:13:39PM -0700, Stefan Agner wrote: > The Vybrid based Colibri modules provide a on-module PHY which is > connected to the second FEC instance FEC1. Since the on-module > Ethernet port is considered as primary ethernet interface, alias > fec1 as ethernet0. This also makes sure that the first MAC address > provided by the boot loader gets assigned to the FEC instance used > for the on-module PHY. > > Signed-off-by: Stefan Agner Applied all, thanks.
[PATCH] watchdog: dw_wdt: dont build for avr32
The build of avr32 allmodconfig fails with the error: ERROR: "__avr32_udiv64" [drivers/watchdog/kempld_wdt.ko] undefined! Exclude this driver from the build of avr32. Signed-off-by: Sudip Mukherjee--- avr32 build log is at: https://travis-ci.org/sudipm-mukherjee/parport/jobs/122158665 drivers/watchdog/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index fb94765..61041ba 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -981,7 +981,7 @@ config HP_WATCHDOG config KEMPLD_WDT tristate "Kontron COM Watchdog Timer" - depends on MFD_KEMPLD + depends on MFD_KEMPLD && !AVR32 select WATCHDOG_CORE help Support for the PLD watchdog on some Kontron ETX and COMexpress -- 1.9.1
[PATCH] watchdog: dw_wdt: dont build for avr32
The build of avr32 allmodconfig fails with the error: ERROR: "__avr32_udiv64" [drivers/watchdog/kempld_wdt.ko] undefined! Exclude this driver from the build of avr32. Signed-off-by: Sudip Mukherjee --- avr32 build log is at: https://travis-ci.org/sudipm-mukherjee/parport/jobs/122158665 drivers/watchdog/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index fb94765..61041ba 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -981,7 +981,7 @@ config HP_WATCHDOG config KEMPLD_WDT tristate "Kontron COM Watchdog Timer" - depends on MFD_KEMPLD + depends on MFD_KEMPLD && !AVR32 select WATCHDOG_CORE help Support for the PLD watchdog on some Kontron ETX and COMexpress -- 1.9.1
Re: [PATCH v6 10/10] clocksource: arm_arch_timer: Remove arch_timer_get_timecounter
On Mon, Apr 11, 2016 at 04:33:00PM +0100, Julien Grall wrote: > The only call of arch_timer_get_timecounter (in KVM) has been removed. > > Signed-off-by: Julien Grall> Acked-by: Christoffer Dall > > --- > Cc: Daniel Lezcano > Cc: Thomas Gleixner > > Changes in v4: > - Add Christoffer's acked-by > > Changes in v3: > - Patch added > --- Acked-by: Daniel Lezcano
Re: [PATCH v6 10/10] clocksource: arm_arch_timer: Remove arch_timer_get_timecounter
On Mon, Apr 11, 2016 at 04:33:00PM +0100, Julien Grall wrote: > The only call of arch_timer_get_timecounter (in KVM) has been removed. > > Signed-off-by: Julien Grall > Acked-by: Christoffer Dall > > --- > Cc: Daniel Lezcano > Cc: Thomas Gleixner > > Changes in v4: > - Add Christoffer's acked-by > > Changes in v3: > - Patch added > --- Acked-by: Daniel Lezcano
Re: [PATCH v6 02/10] clocksource: arm_arch_timer: Extend arch_timer_kvm_info to get the virtual IRQ
On Mon, Apr 11, 2016 at 04:32:52PM +0100, Julien Grall wrote: > Currently, the firmware table is parsed by the virtual timer code in > order to retrieve the virtual timer interrupt. However, this is already > done by the arch timer driver. > > To avoid code duplication, extend arch_timer_kvm_info to get the virtual > IRQ. > > Note that the KVM code will be modified in a subsequent patch. > > Signed-off-by: Julien Grall> Acked-by: Christoffer Dall > > --- Acked-by: Daniel Lezcano
Re: [PATCH v6 02/10] clocksource: arm_arch_timer: Extend arch_timer_kvm_info to get the virtual IRQ
On Mon, Apr 11, 2016 at 04:32:52PM +0100, Julien Grall wrote: > Currently, the firmware table is parsed by the virtual timer code in > order to retrieve the virtual timer interrupt. However, this is already > done by the arch timer driver. > > To avoid code duplication, extend arch_timer_kvm_info to get the virtual > IRQ. > > Note that the KVM code will be modified in a subsequent patch. > > Signed-off-by: Julien Grall > Acked-by: Christoffer Dall > > --- Acked-by: Daniel Lezcano
Re: [PATCH v6 01/10] clocksource: arm_arch_timer: Gather KVM specific information in a structure
On Mon, Apr 11, 2016 at 04:32:51PM +0100, Julien Grall wrote: > Introduce a structure which are filled up by the arch timer driver and > used by the virtual timer in KVM. > > The first member of this structure will be the timecounter. More members > will be added later. > > A stub for the new helper isn't introduced because KVM requires the arch > timer for both ARM64 and ARM32. > > The function arch_timer_get_timecounter is kept for the time being and > will be dropped in a subsequent patch. > > Signed-off-by: Julien Grall> Acked-by: Christoffer Dall > Acked-by: Daniel Lezcano
Re: [PATCH v6 01/10] clocksource: arm_arch_timer: Gather KVM specific information in a structure
On Mon, Apr 11, 2016 at 04:32:51PM +0100, Julien Grall wrote: > Introduce a structure which are filled up by the arch timer driver and > used by the virtual timer in KVM. > > The first member of this structure will be the timecounter. More members > will be added later. > > A stub for the new helper isn't introduced because KVM requires the arch > timer for both ARM64 and ARM32. > > The function arch_timer_get_timecounter is kept for the time being and > will be dropped in a subsequent patch. > > Signed-off-by: Julien Grall > Acked-by: Christoffer Dall > Acked-by: Daniel Lezcano
Re: [PATCH v6 0/3] thermal: mediatek: Add cpu dynamic power cooling model.
On 12-04-16, 13:24, dawei chien wrote: > Please refer to following for my resending, thank you. > > https://lkml.org/lkml/2016/3/15/101 > https://patchwork.kernel.org/patch/8586131/ > https://patchwork.kernel.org/patch/8586111/ > https://patchwork.kernel.org/patch/8586081/ Oh, you were continuously sending new ping requests on the old thread. You should have used the new thread instead :) Anyway, I have pinged Rafael over the new thread now. -- viresh
Re: [PATCH v6 0/3] thermal: mediatek: Add cpu dynamic power cooling model.
On 12-04-16, 13:24, dawei chien wrote: > Please refer to following for my resending, thank you. > > https://lkml.org/lkml/2016/3/15/101 > https://patchwork.kernel.org/patch/8586131/ > https://patchwork.kernel.org/patch/8586111/ > https://patchwork.kernel.org/patch/8586081/ Oh, you were continuously sending new ping requests on the old thread. You should have used the new thread instead :) Anyway, I have pinged Rafael over the new thread now. -- viresh
Re: [RESEND][PATCH 1/3] thermal: mediatek: Add cpu dynamic power cooling model.
Hi Rafael, On 15-03-16, 16:10, Dawei Chien wrote: > MT8173 cpufreq driver select of_cpufreq_power_cooling_register registering > cooling devices with dynamic power coefficient. > > Signed-off-by: Dawei Chien> Acked-by: Viresh Kumar Can you please apply this patch from Dawei ? -- viresh
Re: [RESEND][PATCH 1/3] thermal: mediatek: Add cpu dynamic power cooling model.
Hi Rafael, On 15-03-16, 16:10, Dawei Chien wrote: > MT8173 cpufreq driver select of_cpufreq_power_cooling_register registering > cooling devices with dynamic power coefficient. > > Signed-off-by: Dawei Chien > Acked-by: Viresh Kumar Can you please apply this patch from Dawei ? -- viresh
[PATCH RFC v0 03/12] Add cyclic support to rtc-dev.c
wait-queue changes to rtc_dev_read so that it can support overrun count reporting when multiple threads are blocked against a single wait object. ioctl() additions to allow for those calling it to admit the thread to the cyclic scheduler. Signed-off-by: Bill Huey (hui)--- drivers/rtc/rtc-dev.c | 161 ++ 1 file changed, 161 insertions(+) diff --git a/drivers/rtc/rtc-dev.c b/drivers/rtc/rtc-dev.c index a6d9434..0fc9a8c 100644 --- a/drivers/rtc/rtc-dev.c +++ b/drivers/rtc/rtc-dev.c @@ -18,6 +18,15 @@ #include #include "rtc-core.h" +#ifdef CONFIG_RTC_CYCLIC +#include +#include + +#include <../kernel/sched/sched.h> +#include <../kernel/sched/cyclic.h> +//#include <../kernel/sched/cyclic_rt.h> +#endif + static dev_t rtc_devt; #define RTC_DEV_MAX 16 /* 16 RTCs should be enough for everyone... */ @@ -29,6 +38,10 @@ static int rtc_dev_open(struct inode *inode, struct file *file) struct rtc_device, char_dev); const struct rtc_class_ops *ops = rtc->ops; +#ifdef CONFIG_RTC_CYCLIC + reset_rt_overrun(); +#endif + if (test_and_set_bit_lock(RTC_DEV_BUSY, >flags)) return -EBUSY; @@ -153,13 +166,26 @@ rtc_dev_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct rtc_device *rtc = file->private_data; +#ifdef CONFIG_RTC_CYCLIC + DEFINE_WAIT_FUNC(wait, single_default_wake_function); +#else DECLARE_WAITQUEUE(wait, current); +#endif unsigned long data; + unsigned long flags; +#ifdef CONFIG_RTC_CYCLIC + int wake = 0, block = 0; +#endif ssize_t ret; if (count != sizeof(unsigned int) && count < sizeof(unsigned long)) return -EINVAL; +#ifdef CONFIG_RTC_CYCLIC + if (rt_overrun_task_yield(current)) + goto yield; +#endif +printk("%s: 0 color = %d \n", __func__, current->rt.rt_overrun.color); add_wait_queue(>irq_queue, ); do { __set_current_state(TASK_INTERRUPTIBLE); @@ -169,23 +195,59 @@ rtc_dev_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) rtc->irq_data = 0; spin_unlock_irq(>irq_lock); +if (block) { + block = 0; + if (wake) { + printk("%s: wake \n", __func__); + wake = 0; + } else { + printk("%s: ~wake \n", __func__); + } +} if (data != 0) { +#ifdef CONFIG_RTC_CYCLIC + /* overrun reporting */ + raw_spin_lock_irqsave(_overrun_lock, flags); + if (_on_rt_overrun_admitted(current)) { + /* pass back to userspace */ + data = rt_task_count(current); + rt_task_count(current) = 0; + } + raw_spin_unlock_irqrestore(_overrun_lock, flags); + ret = 0; +printk("%s: 1 color = %d \n", __func__, current->rt.rt_overrun.color); + break; + } +#else ret = 0; break; } +#endif if (file->f_flags & O_NONBLOCK) { ret = -EAGAIN; +printk("%s: 2 color = %d \n", __func__, current->rt.rt_overrun.color); break; } if (signal_pending(current)) { +printk("%s: 3 color = %d \n", __func__, current->rt.rt_overrun.color); ret = -ERESTARTSYS; break; } +#ifdef CONFIG_RTC_CYCLIC + block = 1; +#endif schedule(); +#ifdef CONFIG_RTC_CYCLIC + /* debugging */ + wake = 1; +#endif } while (1); set_current_state(TASK_RUNNING); remove_wait_queue(>irq_queue, ); +#ifdef CONFIG_RTC_CYCLIC +ret: +#endif if (ret == 0) { /* Check for any data updates */ if (rtc->ops->read_callback) @@ -201,6 +263,29 @@ rtc_dev_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) sizeof(unsigned long); } return ret; + +#ifdef CONFIG_RTC_CYCLIC +yield: + + spin_lock_irq(>irq_lock); + data = rtc->irq_data; + rtc->irq_data = 0; + spin_unlock_irq(>irq_lock); + + raw_spin_lock_irqsave(_overrun_lock, flags); + if (_on_rt_overrun_admitted(current)) { + /* pass back to userspace */ + data = rt_task_count(current); + rt_task_count(current) = 0; + } + else { + } + + raw_spin_unlock_irqrestore(_overrun_lock, flags); + ret = 0; + + goto ret; +#endif } static unsigned int rtc_dev_poll(struct file *file, poll_table *wait) @@ -215,6 +300,56 @@ static unsigned int rtc_dev_poll(struct file
[PATCH RFC v0 05/12] Task tracking per file descriptor
Task tracking per file descriptor for thread death clean up. Signed-off-by: Bill Huey (hui)--- drivers/rtc/class.c | 3 +++ include/linux/rtc.h | 3 +++ 2 files changed, 6 insertions(+) diff --git a/drivers/rtc/class.c b/drivers/rtc/class.c index 74fd974..ad570b9 100644 --- a/drivers/rtc/class.c +++ b/drivers/rtc/class.c @@ -201,6 +201,9 @@ struct rtc_device *rtc_device_register(const char *name, struct device *dev, rtc->irq_freq = 1; rtc->max_user_freq = 64; rtc->dev.parent = dev; +#ifdef CONFIG_RTC_CYCLIC + INIT_LIST_HEAD(>rt_overrun_tasks); //struct list_head +#endif rtc->dev.class = rtc_class; rtc->dev.groups = rtc_get_dev_attribute_groups(); rtc->dev.release = rtc_device_release; diff --git a/include/linux/rtc.h b/include/linux/rtc.h index b693ada..1424550 100644 --- a/include/linux/rtc.h +++ b/include/linux/rtc.h @@ -114,6 +114,9 @@ struct rtc_timer { struct rtc_device { struct device dev; struct module *owner; +#ifdef CONFIG_RTC_CYCLIC + struct list_head rt_overrun_tasks; +#endif int id; char name[RTC_DEVICE_NAME_SIZE]; -- 2.5.0
[PATCH RFC v0 03/12] Add cyclic support to rtc-dev.c
wait-queue changes to rtc_dev_read so that it can support overrun count reporting when multiple threads are blocked against a single wait object. ioctl() additions to allow for those calling it to admit the thread to the cyclic scheduler. Signed-off-by: Bill Huey (hui) --- drivers/rtc/rtc-dev.c | 161 ++ 1 file changed, 161 insertions(+) diff --git a/drivers/rtc/rtc-dev.c b/drivers/rtc/rtc-dev.c index a6d9434..0fc9a8c 100644 --- a/drivers/rtc/rtc-dev.c +++ b/drivers/rtc/rtc-dev.c @@ -18,6 +18,15 @@ #include #include "rtc-core.h" +#ifdef CONFIG_RTC_CYCLIC +#include +#include + +#include <../kernel/sched/sched.h> +#include <../kernel/sched/cyclic.h> +//#include <../kernel/sched/cyclic_rt.h> +#endif + static dev_t rtc_devt; #define RTC_DEV_MAX 16 /* 16 RTCs should be enough for everyone... */ @@ -29,6 +38,10 @@ static int rtc_dev_open(struct inode *inode, struct file *file) struct rtc_device, char_dev); const struct rtc_class_ops *ops = rtc->ops; +#ifdef CONFIG_RTC_CYCLIC + reset_rt_overrun(); +#endif + if (test_and_set_bit_lock(RTC_DEV_BUSY, >flags)) return -EBUSY; @@ -153,13 +166,26 @@ rtc_dev_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct rtc_device *rtc = file->private_data; +#ifdef CONFIG_RTC_CYCLIC + DEFINE_WAIT_FUNC(wait, single_default_wake_function); +#else DECLARE_WAITQUEUE(wait, current); +#endif unsigned long data; + unsigned long flags; +#ifdef CONFIG_RTC_CYCLIC + int wake = 0, block = 0; +#endif ssize_t ret; if (count != sizeof(unsigned int) && count < sizeof(unsigned long)) return -EINVAL; +#ifdef CONFIG_RTC_CYCLIC + if (rt_overrun_task_yield(current)) + goto yield; +#endif +printk("%s: 0 color = %d \n", __func__, current->rt.rt_overrun.color); add_wait_queue(>irq_queue, ); do { __set_current_state(TASK_INTERRUPTIBLE); @@ -169,23 +195,59 @@ rtc_dev_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) rtc->irq_data = 0; spin_unlock_irq(>irq_lock); +if (block) { + block = 0; + if (wake) { + printk("%s: wake \n", __func__); + wake = 0; + } else { + printk("%s: ~wake \n", __func__); + } +} if (data != 0) { +#ifdef CONFIG_RTC_CYCLIC + /* overrun reporting */ + raw_spin_lock_irqsave(_overrun_lock, flags); + if (_on_rt_overrun_admitted(current)) { + /* pass back to userspace */ + data = rt_task_count(current); + rt_task_count(current) = 0; + } + raw_spin_unlock_irqrestore(_overrun_lock, flags); + ret = 0; +printk("%s: 1 color = %d \n", __func__, current->rt.rt_overrun.color); + break; + } +#else ret = 0; break; } +#endif if (file->f_flags & O_NONBLOCK) { ret = -EAGAIN; +printk("%s: 2 color = %d \n", __func__, current->rt.rt_overrun.color); break; } if (signal_pending(current)) { +printk("%s: 3 color = %d \n", __func__, current->rt.rt_overrun.color); ret = -ERESTARTSYS; break; } +#ifdef CONFIG_RTC_CYCLIC + block = 1; +#endif schedule(); +#ifdef CONFIG_RTC_CYCLIC + /* debugging */ + wake = 1; +#endif } while (1); set_current_state(TASK_RUNNING); remove_wait_queue(>irq_queue, ); +#ifdef CONFIG_RTC_CYCLIC +ret: +#endif if (ret == 0) { /* Check for any data updates */ if (rtc->ops->read_callback) @@ -201,6 +263,29 @@ rtc_dev_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) sizeof(unsigned long); } return ret; + +#ifdef CONFIG_RTC_CYCLIC +yield: + + spin_lock_irq(>irq_lock); + data = rtc->irq_data; + rtc->irq_data = 0; + spin_unlock_irq(>irq_lock); + + raw_spin_lock_irqsave(_overrun_lock, flags); + if (_on_rt_overrun_admitted(current)) { + /* pass back to userspace */ + data = rt_task_count(current); + rt_task_count(current) = 0; + } + else { + } + + raw_spin_unlock_irqrestore(_overrun_lock, flags); + ret = 0; + + goto ret; +#endif } static unsigned int rtc_dev_poll(struct file *file, poll_table *wait) @@ -215,6 +300,56 @@ static unsigned int rtc_dev_poll(struct file *file, poll_table
[PATCH RFC v0 05/12] Task tracking per file descriptor
Task tracking per file descriptor for thread death clean up. Signed-off-by: Bill Huey (hui) --- drivers/rtc/class.c | 3 +++ include/linux/rtc.h | 3 +++ 2 files changed, 6 insertions(+) diff --git a/drivers/rtc/class.c b/drivers/rtc/class.c index 74fd974..ad570b9 100644 --- a/drivers/rtc/class.c +++ b/drivers/rtc/class.c @@ -201,6 +201,9 @@ struct rtc_device *rtc_device_register(const char *name, struct device *dev, rtc->irq_freq = 1; rtc->max_user_freq = 64; rtc->dev.parent = dev; +#ifdef CONFIG_RTC_CYCLIC + INIT_LIST_HEAD(>rt_overrun_tasks); //struct list_head +#endif rtc->dev.class = rtc_class; rtc->dev.groups = rtc_get_dev_attribute_groups(); rtc->dev.release = rtc_device_release; diff --git a/include/linux/rtc.h b/include/linux/rtc.h index b693ada..1424550 100644 --- a/include/linux/rtc.h +++ b/include/linux/rtc.h @@ -114,6 +114,9 @@ struct rtc_timer { struct rtc_device { struct device dev; struct module *owner; +#ifdef CONFIG_RTC_CYCLIC + struct list_head rt_overrun_tasks; +#endif int id; char name[RTC_DEVICE_NAME_SIZE]; -- 2.5.0
[PATCH RFC v0 10/12] Export SCHED_FIFO/RT requeuing functions
SCHED_FIFO/RT tail/head runqueue insertion support, initial thread death support via a hook to the scheduler class. Thread death must include additional semantics to remove/discharge an admitted task properly. Signed-off-by: Bill Huey (hui)--- kernel/sched/rt.c | 41 + 1 file changed, 41 insertions(+) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index c41ea7a..1d77adc 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -8,6 +8,11 @@ #include #include +#ifdef CONFIG_RTC_CYCLIC +#include "cyclic.h" +extern int rt_overrun_task_admitted1(struct rq *rq, struct task_struct *p); +#endif + int sched_rr_timeslice = RR_TIMESLICE; static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun); @@ -1321,8 +1326,18 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) if (flags & ENQUEUE_WAKEUP) rt_se->timeout = 0; +#ifdef CONFIG_RTC_CYCLIC + /* if admitted and the current slot then head, otherwise tail */ + if (rt_overrun_task_admitted1(rq, p)) { + if (rt_overrun_task_active(p)) { + flags |= ENQUEUE_HEAD; + } + } enqueue_rt_entity(rt_se, flags); +#else + enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD); +#endif if (!task_current(rq, p) && p->nr_cpus_allowed > 1) enqueue_pushable_task(rq, p); @@ -1367,6 +1382,18 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p, int head) } } +#ifdef CONFIG_RTC_CYCLIC +void dequeue_task_rt2(struct rq *rq, struct task_struct *p, int flags) +{ + dequeue_task_rt(rq, p, flags); +} + +void requeue_task_rt2(struct rq *rq, struct task_struct *p, int head) +{ + requeue_task_rt(rq, p, head); +} +#endif + static void yield_task_rt(struct rq *rq) { requeue_task_rt(rq, rq->curr, 0); @@ -2177,6 +2204,10 @@ void __init init_sched_rt_class(void) zalloc_cpumask_var_node(_cpu(local_cpu_mask, i), GFP_KERNEL, cpu_to_node(i)); } + +#ifdef CONFIG_RTC_CYCLIC + init_rt_overrun(); +#endif } #endif /* CONFIG_SMP */ @@ -2322,6 +2353,13 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task) return 0; } +#ifdef CONFIG_RTC_CYCLIC +static void task_dead_rt(struct task_struct *p) +{ + rt_overrun_entry_delete(p); +} +#endif + const struct sched_class rt_sched_class = { .next = _sched_class, .enqueue_task = enqueue_task_rt, @@ -2344,6 +2382,9 @@ const struct sched_class rt_sched_class = { #endif .set_curr_task = set_curr_task_rt, +#ifdef CONFIG_RTC_CYCLIC + .task_dead = task_dead_rt, +#endif .task_tick = task_tick_rt, .get_rr_interval= get_rr_interval_rt, -- 2.5.0
[PATCH RFC v0 06/12] Add anonymous struct to sched_rt_entity
Add an anonymous struct to support admittance using a red-black tree, overrun tracking, state for whether or not to yield or block, debugging support, execution slot pattern for the scheduler. Signed-off-by: Bill Huey (hui)--- include/linux/sched.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 084ed9f..cff56c6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1305,6 +1305,21 @@ struct sched_rt_entity { /* rq "owned" by this entity/group: */ struct rt_rq*my_q; #endif +#ifdef CONFIG_RTC_CYCLIC + struct { + struct rb_node node; /* admittance structure */ + struct list_head task_list; + unsigned long count; /* overrun count per slot */ + int type, color, yield; + u64 slots; + + /* debug */ + unsigned long last_task_state; + + /* instrumentation */ + unsigned int machine_state, last_machine_state; + } rt_overrun; +#endif }; struct sched_dl_entity { -- 2.5.0
[PATCH RFC v0 02/12] Reroute rtc update irqs to the cyclic scheduler handler
Redirect rtc update irqs so that it drives the cyclic scheduler timer handler instead. Let the handler determine which slot to activate next. Similar to scheduler tick handling but just for the cyclic scheduler. Signed-off-by: Bill Huey (hui)--- drivers/rtc/interface.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c index 9ef5f6f..6d39d40 100644 --- a/drivers/rtc/interface.c +++ b/drivers/rtc/interface.c @@ -17,6 +17,10 @@ #include #include +#ifdef CONFIG_RTC_CYCLIC +#include "../kernel/sched/cyclic.h" +#endif + static int rtc_timer_enqueue(struct rtc_device *rtc, struct rtc_timer *timer); static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer); @@ -488,6 +492,9 @@ EXPORT_SYMBOL_GPL(rtc_update_irq_enable); void rtc_handle_legacy_irq(struct rtc_device *rtc, int num, int mode) { unsigned long flags; +#ifdef CONFIG_RTC_CYCLIC + int handled = 0; +#endif /* mark one irq of the appropriate mode */ spin_lock_irqsave(>irq_lock, flags); @@ -500,7 +507,23 @@ void rtc_handle_legacy_irq(struct rtc_device *rtc, int num, int mode) rtc->irq_task->func(rtc->irq_task->private_data); spin_unlock_irqrestore(>irq_task_lock, flags); +#ifdef CONFIG_RTC_CYCLIC + /* wake up slot_curr if overrun task */ + if (RTC_PF) { + if (rt_overrun_rq_admitted()) { + /* advance the cursor, overrun report */ + rt_overrun_timer_handler(rtc); + handled = 1; + } + } + + if (!handled) { + wake_up_interruptible(>irq_queue); + } +#else wake_up_interruptible(>irq_queue); +#endif + kill_fasync(>async_queue, SIGIO, POLL_IN); } -- 2.5.0
Re: [PATCH] MAINTAINERS: correct entry for LVM
On Tuesday 12 April 2016 12:20 AM, Wols Lists wrote: On 11/04/16 17:39, Sudip Mukherjee wrote: On Monday 11 April 2016 09:53 PM, Alasdair G Kergon wrote: On Mon, Apr 11, 2016 at 09:45:01PM +0530, Sudip Mukherjee wrote: L stands for "Mailing list that is relevant to this area", and this is a mailing list. :) Your proposed patch isn't changing the L entry, so this is of no relevance. Sorry, I am not understanding. The current entry in MAINTAINERS is: DEVICE-MAPPER (LVM) M: Alasdair KergonM: Mike Snitzer M: dm-de...@redhat.com L: dm-de...@redhat.com ... So my patch just removed the line : "M: dm-de...@redhat.com" So now the entry becomes : DEVICE-MAPPER (LVM) M: Alasdair Kergon M: Mike Snitzer L: dm-de...@redhat.com ... So, now it correctly shows dm-de...@redhat.com as a mailing list which should have cc to all the patches related to LVM. Or am I understanding this wrong? Yes. Because (I guess M stands for maintainer) this list has maintainer status. As all patches should be sent to the maintainers therefore all patches should be sent to this list. The same person can appear twice in a phone book, once under their name and once under their job title. This is exactly the same situation - this list should appear once as a list to tell people that it's a list, AND ALSO as a maintainer to tell people that patches must be sent to the list. I guess English is not your first language, but the important point is that M and L are not mutually exclusive. Don't worry, English is my first language. Have you tried with getmaintainer.pl and seen the result? It only shows dm-de...@redhat.com as a Maintainer and not as a list. (I noticed because I was sending a patch, and hence this patch again). But I believe a mailing list can not be a Maintainer ( have you seen any patch with a Signed-off-by: from a mailing list? ). Anyway, I think this thread has become too long for an unimportant patch. regards sudip
[PATCH RFC v0 10/12] Export SCHED_FIFO/RT requeuing functions
SCHED_FIFO/RT tail/head runqueue insertion support, initial thread death support via a hook to the scheduler class. Thread death must include additional semantics to remove/discharge an admitted task properly. Signed-off-by: Bill Huey (hui) --- kernel/sched/rt.c | 41 + 1 file changed, 41 insertions(+) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index c41ea7a..1d77adc 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -8,6 +8,11 @@ #include #include +#ifdef CONFIG_RTC_CYCLIC +#include "cyclic.h" +extern int rt_overrun_task_admitted1(struct rq *rq, struct task_struct *p); +#endif + int sched_rr_timeslice = RR_TIMESLICE; static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun); @@ -1321,8 +1326,18 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) if (flags & ENQUEUE_WAKEUP) rt_se->timeout = 0; +#ifdef CONFIG_RTC_CYCLIC + /* if admitted and the current slot then head, otherwise tail */ + if (rt_overrun_task_admitted1(rq, p)) { + if (rt_overrun_task_active(p)) { + flags |= ENQUEUE_HEAD; + } + } enqueue_rt_entity(rt_se, flags); +#else + enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD); +#endif if (!task_current(rq, p) && p->nr_cpus_allowed > 1) enqueue_pushable_task(rq, p); @@ -1367,6 +1382,18 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p, int head) } } +#ifdef CONFIG_RTC_CYCLIC +void dequeue_task_rt2(struct rq *rq, struct task_struct *p, int flags) +{ + dequeue_task_rt(rq, p, flags); +} + +void requeue_task_rt2(struct rq *rq, struct task_struct *p, int head) +{ + requeue_task_rt(rq, p, head); +} +#endif + static void yield_task_rt(struct rq *rq) { requeue_task_rt(rq, rq->curr, 0); @@ -2177,6 +2204,10 @@ void __init init_sched_rt_class(void) zalloc_cpumask_var_node(_cpu(local_cpu_mask, i), GFP_KERNEL, cpu_to_node(i)); } + +#ifdef CONFIG_RTC_CYCLIC + init_rt_overrun(); +#endif } #endif /* CONFIG_SMP */ @@ -2322,6 +2353,13 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task) return 0; } +#ifdef CONFIG_RTC_CYCLIC +static void task_dead_rt(struct task_struct *p) +{ + rt_overrun_entry_delete(p); +} +#endif + const struct sched_class rt_sched_class = { .next = _sched_class, .enqueue_task = enqueue_task_rt, @@ -2344,6 +2382,9 @@ const struct sched_class rt_sched_class = { #endif .set_curr_task = set_curr_task_rt, +#ifdef CONFIG_RTC_CYCLIC + .task_dead = task_dead_rt, +#endif .task_tick = task_tick_rt, .get_rr_interval= get_rr_interval_rt, -- 2.5.0
[PATCH RFC v0 06/12] Add anonymous struct to sched_rt_entity
Add an anonymous struct to support admittance using a red-black tree, overrun tracking, state for whether or not to yield or block, debugging support, execution slot pattern for the scheduler. Signed-off-by: Bill Huey (hui) --- include/linux/sched.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 084ed9f..cff56c6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1305,6 +1305,21 @@ struct sched_rt_entity { /* rq "owned" by this entity/group: */ struct rt_rq*my_q; #endif +#ifdef CONFIG_RTC_CYCLIC + struct { + struct rb_node node; /* admittance structure */ + struct list_head task_list; + unsigned long count; /* overrun count per slot */ + int type, color, yield; + u64 slots; + + /* debug */ + unsigned long last_task_state; + + /* instrumentation */ + unsigned int machine_state, last_machine_state; + } rt_overrun; +#endif }; struct sched_dl_entity { -- 2.5.0
[PATCH RFC v0 02/12] Reroute rtc update irqs to the cyclic scheduler handler
Redirect rtc update irqs so that it drives the cyclic scheduler timer handler instead. Let the handler determine which slot to activate next. Similar to scheduler tick handling but just for the cyclic scheduler. Signed-off-by: Bill Huey (hui) --- drivers/rtc/interface.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c index 9ef5f6f..6d39d40 100644 --- a/drivers/rtc/interface.c +++ b/drivers/rtc/interface.c @@ -17,6 +17,10 @@ #include #include +#ifdef CONFIG_RTC_CYCLIC +#include "../kernel/sched/cyclic.h" +#endif + static int rtc_timer_enqueue(struct rtc_device *rtc, struct rtc_timer *timer); static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer); @@ -488,6 +492,9 @@ EXPORT_SYMBOL_GPL(rtc_update_irq_enable); void rtc_handle_legacy_irq(struct rtc_device *rtc, int num, int mode) { unsigned long flags; +#ifdef CONFIG_RTC_CYCLIC + int handled = 0; +#endif /* mark one irq of the appropriate mode */ spin_lock_irqsave(>irq_lock, flags); @@ -500,7 +507,23 @@ void rtc_handle_legacy_irq(struct rtc_device *rtc, int num, int mode) rtc->irq_task->func(rtc->irq_task->private_data); spin_unlock_irqrestore(>irq_task_lock, flags); +#ifdef CONFIG_RTC_CYCLIC + /* wake up slot_curr if overrun task */ + if (RTC_PF) { + if (rt_overrun_rq_admitted()) { + /* advance the cursor, overrun report */ + rt_overrun_timer_handler(rtc); + handled = 1; + } + } + + if (!handled) { + wake_up_interruptible(>irq_queue); + } +#else wake_up_interruptible(>irq_queue); +#endif + kill_fasync(>async_queue, SIGIO, POLL_IN); } -- 2.5.0
Re: [PATCH] MAINTAINERS: correct entry for LVM
On Tuesday 12 April 2016 12:20 AM, Wols Lists wrote: On 11/04/16 17:39, Sudip Mukherjee wrote: On Monday 11 April 2016 09:53 PM, Alasdair G Kergon wrote: On Mon, Apr 11, 2016 at 09:45:01PM +0530, Sudip Mukherjee wrote: L stands for "Mailing list that is relevant to this area", and this is a mailing list. :) Your proposed patch isn't changing the L entry, so this is of no relevance. Sorry, I am not understanding. The current entry in MAINTAINERS is: DEVICE-MAPPER (LVM) M: Alasdair Kergon M: Mike Snitzer M: dm-de...@redhat.com L: dm-de...@redhat.com ... So my patch just removed the line : "M: dm-de...@redhat.com" So now the entry becomes : DEVICE-MAPPER (LVM) M: Alasdair Kergon M: Mike Snitzer L: dm-de...@redhat.com ... So, now it correctly shows dm-de...@redhat.com as a mailing list which should have cc to all the patches related to LVM. Or am I understanding this wrong? Yes. Because (I guess M stands for maintainer) this list has maintainer status. As all patches should be sent to the maintainers therefore all patches should be sent to this list. The same person can appear twice in a phone book, once under their name and once under their job title. This is exactly the same situation - this list should appear once as a list to tell people that it's a list, AND ALSO as a maintainer to tell people that patches must be sent to the list. I guess English is not your first language, but the important point is that M and L are not mutually exclusive. Don't worry, English is my first language. Have you tried with getmaintainer.pl and seen the result? It only shows dm-de...@redhat.com as a Maintainer and not as a list. (I noticed because I was sending a patch, and hence this patch again). But I believe a mailing list can not be a Maintainer ( have you seen any patch with a Signed-off-by: from a mailing list? ). Anyway, I think this thread has become too long for an unimportant patch. regards sudip
[PATCH RFC v0 01/12] Kconfig change
Add the selection options for the cyclic scheduler Signed-off-by: Bill Huey (hui)--- drivers/rtc/Kconfig | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig index 544bd34..8a1b704 100644 --- a/drivers/rtc/Kconfig +++ b/drivers/rtc/Kconfig @@ -73,6 +73,11 @@ config RTC_DEBUG Say yes here to enable debugging support in the RTC framework and individual RTC drivers. +config RTC_CYCLIC + bool "RTC cyclic executive scheduler support" + help + Frame/Cyclic executive scheduler support through the RTC interface + comment "RTC interfaces" config RTC_INTF_SYSFS -- 2.5.0
[PATCH RFC v0 08/12] Compilation support
Makefile changes to support the menuconfig option Signed-off-by: Bill Huey (hui)--- kernel/sched/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 302d6eb..df8e131 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,4 +19,5 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_RTC_CYCLIC) += cyclic.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o -- 2.5.0
[PATCH RFC v0 01/12] Kconfig change
Add the selection options for the cyclic scheduler Signed-off-by: Bill Huey (hui) --- drivers/rtc/Kconfig | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig index 544bd34..8a1b704 100644 --- a/drivers/rtc/Kconfig +++ b/drivers/rtc/Kconfig @@ -73,6 +73,11 @@ config RTC_DEBUG Say yes here to enable debugging support in the RTC framework and individual RTC drivers. +config RTC_CYCLIC + bool "RTC cyclic executive scheduler support" + help + Frame/Cyclic executive scheduler support through the RTC interface + comment "RTC interfaces" config RTC_INTF_SYSFS -- 2.5.0
[PATCH RFC v0 08/12] Compilation support
Makefile changes to support the menuconfig option Signed-off-by: Bill Huey (hui) --- kernel/sched/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 302d6eb..df8e131 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,4 +19,5 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_RTC_CYCLIC) += cyclic.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o -- 2.5.0
[PATCH RFC v0 07/12] kernel/userspace additions for addition ioctl() support for rtc
Add additional ioctl() values to rtc so that it can 'admit' the calling thread into a red-black tree for tracking, set the execution slot pattern, support for setting whether read() will yield or block. Signed-off-by: Bill Huey (hui)--- include/uapi/linux/rtc.h | 4 1 file changed, 4 insertions(+) diff --git a/include/uapi/linux/rtc.h b/include/uapi/linux/rtc.h index f8c82e6..76c9254 100644 --- a/include/uapi/linux/rtc.h +++ b/include/uapi/linux/rtc.h @@ -94,6 +94,10 @@ struct rtc_pll_info { #define RTC_VL_READ_IOR('p', 0x13, int)/* Voltage low detector */ #define RTC_VL_CLR _IO('p', 0x14) /* Clear voltage low information */ +#define RTC_OV_ADMIT _IOW('p', 0x15, unsigned long) /* Set test */ +#define RTC_OV_REPLEN _IOW('p', 0x16, unsigned long) /* Set test */ +#define RTC_OV_YIELD _IOW('p', 0x17, unsigned long) /* Set test */ + /* interrupt flags */ #define RTC_IRQF 0x80 /* Any of the following is active */ #define RTC_PF 0x40/* Periodic interrupt */ -- 2.5.0
[REGRESSION, bisect] pci: cxgb4 probe fails after commit 104daa71b3961434 ("PCI: Determine actual VPD size on first access")
Hi All, The following patch introduced a regression, causing cxgb4 driver to fail in PCIe probe. commit 104daa71b39614343929e1982170d5fcb0569bb5 Author: Hannes ReineckeAuthor: Hannes Reinecke Date: Mon Feb 15 09:42:01 2016 +0100 PCI: Determine actual VPD size on first access PCI-2.2 VPD entries have a maximum size of 32k, but might actually be smaller than that. To figure out the actual size one has to read the VPD area until the 'end marker' is reached. Per spec, reading outside of the VPD space is "not allowed." In practice, it may cause simple read errors or even crash the card. To make matters worse not every PCI card implements this properly, leaving us with no 'end' marker or even completely invalid data. Try to determine the size of the VPD data when it's first accessed. If no valid data can be read an I/O error will be returned when reading or writing the sysfs attribute. As the amount of VPD data is unknown initially the size of the sysfs attribute will always be set to '0'. [bhelgaas: changelog, use 0/1 (not false/true) for bitfield, tweak pci_vpd_pci22_read() error checking] Tested-by: Shane Seymour Tested-by: Babu Moger Signed-off-by: Hannes Reinecke Signed-off-by: Bjorn Helgaas Cc: Alexander Duyck The problem is stemming from the fact that the Chelsio adapters actually have two VPD structures stored in the VPD. An abbreviated on at Offset 0x0 and the complete VPD at Offset 0x400. The abbreviated one only contains the PN, SN and EC Keywords, while the complete VPD contains those plus various adapter constants contained in V0, V1, etc. And it also contains the Base Ethernet MAC Address in the "NA" Keyword which the cxgb4 driver needs when it can't contact the adapter firmware. (We don't have the "NA" Keywork in the VPD Structure at Offset 0x0 because that's not an allowed VPD Keyword in the PCI-E 3.0 specification.) With the new code, the computed size of the VPD is 0x200 and so our efforts to read the VPD at Offset 0x400 silently fails. We check the result of the read looking for a signature 0x82 byte but we're checking against random stack garbage. The end result is that the cxgb4 driver now fails the PCI-E Probe. Thanks, Hari
[PATCH RFC v0 07/12] kernel/userspace additions for addition ioctl() support for rtc
Add additional ioctl() values to rtc so that it can 'admit' the calling thread into a red-black tree for tracking, set the execution slot pattern, support for setting whether read() will yield or block. Signed-off-by: Bill Huey (hui) --- include/uapi/linux/rtc.h | 4 1 file changed, 4 insertions(+) diff --git a/include/uapi/linux/rtc.h b/include/uapi/linux/rtc.h index f8c82e6..76c9254 100644 --- a/include/uapi/linux/rtc.h +++ b/include/uapi/linux/rtc.h @@ -94,6 +94,10 @@ struct rtc_pll_info { #define RTC_VL_READ_IOR('p', 0x13, int)/* Voltage low detector */ #define RTC_VL_CLR _IO('p', 0x14) /* Clear voltage low information */ +#define RTC_OV_ADMIT _IOW('p', 0x15, unsigned long) /* Set test */ +#define RTC_OV_REPLEN _IOW('p', 0x16, unsigned long) /* Set test */ +#define RTC_OV_YIELD _IOW('p', 0x17, unsigned long) /* Set test */ + /* interrupt flags */ #define RTC_IRQF 0x80 /* Any of the following is active */ #define RTC_PF 0x40/* Periodic interrupt */ -- 2.5.0
[REGRESSION, bisect] pci: cxgb4 probe fails after commit 104daa71b3961434 ("PCI: Determine actual VPD size on first access")
Hi All, The following patch introduced a regression, causing cxgb4 driver to fail in PCIe probe. commit 104daa71b39614343929e1982170d5fcb0569bb5 Author: Hannes Reinecke Author: Hannes Reinecke Date: Mon Feb 15 09:42:01 2016 +0100 PCI: Determine actual VPD size on first access PCI-2.2 VPD entries have a maximum size of 32k, but might actually be smaller than that. To figure out the actual size one has to read the VPD area until the 'end marker' is reached. Per spec, reading outside of the VPD space is "not allowed." In practice, it may cause simple read errors or even crash the card. To make matters worse not every PCI card implements this properly, leaving us with no 'end' marker or even completely invalid data. Try to determine the size of the VPD data when it's first accessed. If no valid data can be read an I/O error will be returned when reading or writing the sysfs attribute. As the amount of VPD data is unknown initially the size of the sysfs attribute will always be set to '0'. [bhelgaas: changelog, use 0/1 (not false/true) for bitfield, tweak pci_vpd_pci22_read() error checking] Tested-by: Shane Seymour Tested-by: Babu Moger Signed-off-by: Hannes Reinecke Signed-off-by: Bjorn Helgaas Cc: Alexander Duyck The problem is stemming from the fact that the Chelsio adapters actually have two VPD structures stored in the VPD. An abbreviated on at Offset 0x0 and the complete VPD at Offset 0x400. The abbreviated one only contains the PN, SN and EC Keywords, while the complete VPD contains those plus various adapter constants contained in V0, V1, etc. And it also contains the Base Ethernet MAC Address in the "NA" Keyword which the cxgb4 driver needs when it can't contact the adapter firmware. (We don't have the "NA" Keywork in the VPD Structure at Offset 0x0 because that's not an allowed VPD Keyword in the PCI-E 3.0 specification.) With the new code, the computed size of the VPD is 0x200 and so our efforts to read the VPD at Offset 0x400 silently fails. We check the result of the read looking for a signature 0x82 byte but we're checking against random stack garbage. The end result is that the cxgb4 driver now fails the PCI-E Probe. Thanks, Hari
Re: [PATCH v4 1/2] scsi: Add intermediate STARGET_REMOVE state to scsi_target_state
Hi, On Tue, Apr 5, 2016 at 5:50 PM, Johannes Thumshirn <jthumsh...@suse.de> wrote: > Add intermediate STARGET_REMOVE state to scsi_target_state to avoid > running into the BUG_ON() in scsi_target_reap(). The STARGET_REMOVE > state is only valid in the path from scsi_remove_target() to > scsi_target_destroy() indicating this target is going to be removed. > > This re-fixes the problem introduced in commits > bc3f02a795d3b4faa99d37390174be2a75d091bd and > 40998193560dab6c3ce8d25f4fa58a23e252ef38 in a more comprehensive way. > > Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de> > Fixes: 40998193560dab6c3ce8d25f4fa58a23e252ef38 > Cc: sta...@vger.kernel.org > Reviewed-by: Ewan D. Milne <emi...@redhat.com> > Reviewed-by: Hannes Reinecke <h...@suse.com> > Reviewed-by: James Bottomley <j...@linux.vnet.ibm.com> > --- > drivers/scsi/scsi_scan.c | 2 ++ > drivers/scsi/scsi_sysfs.c | 2 ++ > include/scsi/scsi_device.h | 1 + > 3 files changed, 5 insertions(+) > > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c > index 6a82066..63b8bca 100644 > --- a/drivers/scsi/scsi_scan.c > +++ b/drivers/scsi/scsi_scan.c > @@ -315,6 +315,8 @@ static void scsi_target_destroy(struct scsi_target > *starget) > struct Scsi_Host *shost = dev_to_shost(dev->parent); > unsigned long flags; > > + BUG_ON(starget->state != STARGET_REMOVE && > + starget->state != STARGET_CREATED); #modprobe scsi_debug #modprobe -r scsi_debug always triggers this BUG_ON in linux-next-20160411 printk says starget->state is _RUNNING > starget->state = STARGET_DEL; > transport_destroy_device(dev); > spin_lock_irqsave(shost->host_lock, flags); > diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c > index 00bc721..0df82e8 100644 > --- a/drivers/scsi/scsi_sysfs.c > +++ b/drivers/scsi/scsi_sysfs.c > @@ -1279,11 +1279,13 @@ restart: > spin_lock_irqsave(shost->host_lock, flags); > list_for_each_entry(starget, >__targets, siblings) { > if (starget->state == STARGET_DEL || > + starget->state == STARGET_REMOVE || > starget == last_target) > continue; > if (starget->dev.parent == dev || >dev == dev) { > kref_get(>reap_ref); > last_target = starget; > + starget->state = STARGET_REMOVE; > spin_unlock_irqrestore(shost->host_lock, flags); > __scsi_remove_target(starget); > scsi_target_reap(starget); > diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h > index f63a167..2bffaa6 100644 > --- a/include/scsi/scsi_device.h > +++ b/include/scsi/scsi_device.h > @@ -240,6 +240,7 @@ scmd_printk(const char *, const struct scsi_cmnd *, const > char *, ...); > enum scsi_target_state { > STARGET_CREATED = 1, > STARGET_RUNNING, > + STARGET_REMOVE, > STARGET_DEL, > }; > > -- > 1.8.5.6 >
[PATCH RFC v0 12/12] Cyclic/rtc documentation
Initial attempt at documentation with a test program Signed-off-by: Bill Huey (hui)--- Documentation/scheduler/sched-cyclic-rtc.txt | 468 +++ 1 file changed, 468 insertions(+) create mode 100644 Documentation/scheduler/sched-cyclic-rtc.txt diff --git a/Documentation/scheduler/sched-cyclic-rtc.txt b/Documentation/scheduler/sched-cyclic-rtc.txt new file mode 100644 index 000..4d22381 --- /dev/null +++ b/Documentation/scheduler/sched-cyclic-rtc.txt @@ -0,0 +1,468 @@ +[in progress] + +"Work Conserving" + +When a task is active and calls read(), it will block/yield depending on +is requested from the cyclic scheduler. A RT_OV_YIELD call to ioctl() +specifies the behavior for the calling thread. + +In the case where read() is called before the time slice is over, it will +allow other tasks to run with the leftover time. + +"Overrun Reporting/Apps" + +Calls to read() will return the overrun count and zero the counter. This +can be used to adjust the execution time of the thread so that it can run +within that slot so that thread can meet some deadline constraint. + +[no decision has been made to return a more meaningful set of numbers as +you can just get time stamps and do the math in userspace but it could +be changed to do so] + +The behavior of the read() depends on whether it has been admitted or not +via an ioctl() using RTC_OV_ADMIT. If it is then it will return the overrun +count. If this is not admitted then it returns value corresponding to the +default read() behavior for rtc. + +See the sample test sources for details. + +Using a video game as an example, having a rendering engine overrunning its +slot driving by a vertical retrace interrupt can cause visual skipping and +hurt interactivity. Adapting the computation from the read() result can +allow for the frame buffer swap at the frame interrupt. If read() reports +and it can simplify calculations and adapt to fit within that slot. +It would then allow the program to respond to events (touches, buttons) +minimizing the possibility of perceived pauses. + +The slot allocation scheme for the video game must have some inherit +definition of interactivity. That determines appropriate slot allocation +amognst a mixture of soft/hard real-time. A general policy must be created +for the system, and all programs, to meet a real-time criteria. + +"Admittance" + +Admittance of a task is done through a ioctl() call using RTC_OV_ADMIT. +This passes 64 bit wide bitmap that maps onto a entries in the slot map. + +(slot map of two threads) +execution direction -> + +1000 1000 1000 1000... +0100 0100 0100 0100... + +(bit pattern of two threads) +0001 0001 0001 0001... +0010 0010 0010 0010... + +(hex) +0x +0x + +The slot map is an array of 64 entries of threads. An index is increment +through determine what the next active thread-slot will be. The end of the +index set in /proc/rt_overrun_proc + +"Slot/slice activation" + +Move the task to the front of the SCHED_FIFO list when active, the tail when +inactive. + +"RTC Infrastructure and Interrupt Routing" + +The cyclic scheduler is driven by the update interrupt in the RTC +infrastructure but can be rerouted to any periodic interrupt source. + +One of those applications could be when interrupts from a display refresh +happen or some interval where an external controller such as a drum pad, +touch event or whatever. + +"Embedded Environments" + +This is single run queue only and targeting embedded scenarios where not all +cores are guaranteed to be available. Older Qualcomm MSM kernels have a very +aggressive cpu hotplug as a means of fully powering off cores. The only +guaranteed CPU to run is CPU 0. + +"Project History" + +This was originally created when I was at HP/Palm to solve issues related +to touch event handling and lag working with the real-time media subsystem. +The typical workaround used to prevent skipping is to use large buffers to +prevent data underruns. The programs running at SCHED_FIFO which can +starve the system from handling external events in a timely manner like +buttons or touch events. The lack of a globally defined policy of how to +use real-time resources can causes long pauses between handling touch +events and other kinds of implicit deadline misses. + +By choosing some kind of slot execution pattern, it was hoped that it that +can be controlled globally across the system so that some basic interactive +guarantees can be met. Whether the tasks be some combination of soft or +hard real-time, a mechanism like this can help guide how SCHED_FIFO tasks +are run versus letting SCHED_FIFO tasks run wildly. + +"Future work" + +Possible integration with the deadline scheduler. Power management +awareness, CPU clock governor. Turning off the scheduler tick when there +are no runnable tasks, other things... + +"Power management" + +Governor awareness... + +[more] + + + +/* + * Based on the: + * + *
Re: [PATCH v4 1/2] scsi: Add intermediate STARGET_REMOVE state to scsi_target_state
Hi, On Tue, Apr 5, 2016 at 5:50 PM, Johannes Thumshirn wrote: > Add intermediate STARGET_REMOVE state to scsi_target_state to avoid > running into the BUG_ON() in scsi_target_reap(). The STARGET_REMOVE > state is only valid in the path from scsi_remove_target() to > scsi_target_destroy() indicating this target is going to be removed. > > This re-fixes the problem introduced in commits > bc3f02a795d3b4faa99d37390174be2a75d091bd and > 40998193560dab6c3ce8d25f4fa58a23e252ef38 in a more comprehensive way. > > Signed-off-by: Johannes Thumshirn > Fixes: 40998193560dab6c3ce8d25f4fa58a23e252ef38 > Cc: sta...@vger.kernel.org > Reviewed-by: Ewan D. Milne > Reviewed-by: Hannes Reinecke > Reviewed-by: James Bottomley > --- > drivers/scsi/scsi_scan.c | 2 ++ > drivers/scsi/scsi_sysfs.c | 2 ++ > include/scsi/scsi_device.h | 1 + > 3 files changed, 5 insertions(+) > > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c > index 6a82066..63b8bca 100644 > --- a/drivers/scsi/scsi_scan.c > +++ b/drivers/scsi/scsi_scan.c > @@ -315,6 +315,8 @@ static void scsi_target_destroy(struct scsi_target > *starget) > struct Scsi_Host *shost = dev_to_shost(dev->parent); > unsigned long flags; > > + BUG_ON(starget->state != STARGET_REMOVE && > + starget->state != STARGET_CREATED); #modprobe scsi_debug #modprobe -r scsi_debug always triggers this BUG_ON in linux-next-20160411 printk says starget->state is _RUNNING > starget->state = STARGET_DEL; > transport_destroy_device(dev); > spin_lock_irqsave(shost->host_lock, flags); > diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c > index 00bc721..0df82e8 100644 > --- a/drivers/scsi/scsi_sysfs.c > +++ b/drivers/scsi/scsi_sysfs.c > @@ -1279,11 +1279,13 @@ restart: > spin_lock_irqsave(shost->host_lock, flags); > list_for_each_entry(starget, >__targets, siblings) { > if (starget->state == STARGET_DEL || > + starget->state == STARGET_REMOVE || > starget == last_target) > continue; > if (starget->dev.parent == dev || >dev == dev) { > kref_get(>reap_ref); > last_target = starget; > + starget->state = STARGET_REMOVE; > spin_unlock_irqrestore(shost->host_lock, flags); > __scsi_remove_target(starget); > scsi_target_reap(starget); > diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h > index f63a167..2bffaa6 100644 > --- a/include/scsi/scsi_device.h > +++ b/include/scsi/scsi_device.h > @@ -240,6 +240,7 @@ scmd_printk(const char *, const struct scsi_cmnd *, const > char *, ...); > enum scsi_target_state { > STARGET_CREATED = 1, > STARGET_RUNNING, > + STARGET_REMOVE, > STARGET_DEL, > }; > > -- > 1.8.5.6 >
[PATCH RFC v0 12/12] Cyclic/rtc documentation
Initial attempt at documentation with a test program Signed-off-by: Bill Huey (hui) --- Documentation/scheduler/sched-cyclic-rtc.txt | 468 +++ 1 file changed, 468 insertions(+) create mode 100644 Documentation/scheduler/sched-cyclic-rtc.txt diff --git a/Documentation/scheduler/sched-cyclic-rtc.txt b/Documentation/scheduler/sched-cyclic-rtc.txt new file mode 100644 index 000..4d22381 --- /dev/null +++ b/Documentation/scheduler/sched-cyclic-rtc.txt @@ -0,0 +1,468 @@ +[in progress] + +"Work Conserving" + +When a task is active and calls read(), it will block/yield depending on +is requested from the cyclic scheduler. A RT_OV_YIELD call to ioctl() +specifies the behavior for the calling thread. + +In the case where read() is called before the time slice is over, it will +allow other tasks to run with the leftover time. + +"Overrun Reporting/Apps" + +Calls to read() will return the overrun count and zero the counter. This +can be used to adjust the execution time of the thread so that it can run +within that slot so that thread can meet some deadline constraint. + +[no decision has been made to return a more meaningful set of numbers as +you can just get time stamps and do the math in userspace but it could +be changed to do so] + +The behavior of the read() depends on whether it has been admitted or not +via an ioctl() using RTC_OV_ADMIT. If it is then it will return the overrun +count. If this is not admitted then it returns value corresponding to the +default read() behavior for rtc. + +See the sample test sources for details. + +Using a video game as an example, having a rendering engine overrunning its +slot driving by a vertical retrace interrupt can cause visual skipping and +hurt interactivity. Adapting the computation from the read() result can +allow for the frame buffer swap at the frame interrupt. If read() reports +and it can simplify calculations and adapt to fit within that slot. +It would then allow the program to respond to events (touches, buttons) +minimizing the possibility of perceived pauses. + +The slot allocation scheme for the video game must have some inherit +definition of interactivity. That determines appropriate slot allocation +amognst a mixture of soft/hard real-time. A general policy must be created +for the system, and all programs, to meet a real-time criteria. + +"Admittance" + +Admittance of a task is done through a ioctl() call using RTC_OV_ADMIT. +This passes 64 bit wide bitmap that maps onto a entries in the slot map. + +(slot map of two threads) +execution direction -> + +1000 1000 1000 1000... +0100 0100 0100 0100... + +(bit pattern of two threads) +0001 0001 0001 0001... +0010 0010 0010 0010... + +(hex) +0x +0x + +The slot map is an array of 64 entries of threads. An index is increment +through determine what the next active thread-slot will be. The end of the +index set in /proc/rt_overrun_proc + +"Slot/slice activation" + +Move the task to the front of the SCHED_FIFO list when active, the tail when +inactive. + +"RTC Infrastructure and Interrupt Routing" + +The cyclic scheduler is driven by the update interrupt in the RTC +infrastructure but can be rerouted to any periodic interrupt source. + +One of those applications could be when interrupts from a display refresh +happen or some interval where an external controller such as a drum pad, +touch event or whatever. + +"Embedded Environments" + +This is single run queue only and targeting embedded scenarios where not all +cores are guaranteed to be available. Older Qualcomm MSM kernels have a very +aggressive cpu hotplug as a means of fully powering off cores. The only +guaranteed CPU to run is CPU 0. + +"Project History" + +This was originally created when I was at HP/Palm to solve issues related +to touch event handling and lag working with the real-time media subsystem. +The typical workaround used to prevent skipping is to use large buffers to +prevent data underruns. The programs running at SCHED_FIFO which can +starve the system from handling external events in a timely manner like +buttons or touch events. The lack of a globally defined policy of how to +use real-time resources can causes long pauses between handling touch +events and other kinds of implicit deadline misses. + +By choosing some kind of slot execution pattern, it was hoped that it that +can be controlled globally across the system so that some basic interactive +guarantees can be met. Whether the tasks be some combination of soft or +hard real-time, a mechanism like this can help guide how SCHED_FIFO tasks +are run versus letting SCHED_FIFO tasks run wildly. + +"Future work" + +Possible integration with the deadline scheduler. Power management +awareness, CPU clock governor. Turning off the scheduler tick when there +are no runnable tasks, other things... + +"Power management" + +Governor awareness... + +[more] + + + +/* + * Based on the: + * + * Real Time Clock
[PATCH RFC v0 09/12] Add priority support for the cyclic scheduler
Initial bits to prevent priority changing of cyclic scheduler tasks by only allow them to be SCHED_FIFO. Fairly hacky at this time and will need revisiting because of the security concerns. Affects task death handling since it uses an additional scheduler class hook for clean up at death. Must be SCHED_FIFO. Signed-off-by: Bill Huey (hui)--- kernel/sched/core.c | 13 + 1 file changed, 13 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44db0ff..cf6cf57 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -87,6 +87,10 @@ #include "../workqueue_internal.h" #include "../smpboot.h" +#ifdef CONFIG_RTC_CYCLIC +#include "cyclic.h" +#endif + #define CREATE_TRACE_POINTS #include @@ -2074,6 +2078,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) memset(>se.statistics, 0, sizeof(p->se.statistics)); #endif +#ifdef CONFIG_RTC_CYCLIC + RB_CLEAR_NODE(>rt.rt_overrun.node); +#endif + RB_CLEAR_NODE(>dl.rb_node); init_dl_task_timer(>dl); __dl_clear_params(p); @@ -3881,6 +3889,11 @@ recheck: if (dl_policy(policy)) return -EPERM; +#ifdef CONFIG_RTC_CYCLIC + if (rt_overrun_policy(p, policy)) + return -EPERM; +#endif + /* * Treat SCHED_IDLE as nice 20. Only allow a switch to * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. -- 2.5.0
[PATCH RFC v0 11/12] Cyclic scheduler support
Core implementation of the cyclic scheduler that includes admittance handling, thread death supprot, cyclic timer tick handler, primitive proc debugging interface, wait-queue modifications. Signed-off-by: Bill Huey (hui)--- kernel/sched/cyclic.c| 620 +++ kernel/sched/cyclic.h| 86 +++ kernel/sched/cyclic_rt.h | 7 + 3 files changed, 713 insertions(+) create mode 100644 kernel/sched/cyclic.c create mode 100644 kernel/sched/cyclic.h create mode 100644 kernel/sched/cyclic_rt.h diff --git a/kernel/sched/cyclic.c b/kernel/sched/cyclic.c new file mode 100644 index 000..8ce34bd --- /dev/null +++ b/kernel/sched/cyclic.c @@ -0,0 +1,620 @@ +/* + * cyclic scheduler for rtc support + * + * Copyright (C) Bill Huey + * Author: Bill Huey + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. +*/ + +#include +#include +#include +#include "sched.h" +#include "cyclic.h" +#include "cyclic_rt.h" + +#include +#include + +DEFINE_RAW_SPINLOCK(rt_overrun_lock); +struct rb_root rt_overrun_tree = RB_ROOT; + +#define MASK2 0xfff0 + +/* must revisit again when I get more time to fix the possbility of + * overflow here and 32 bit portability */ +static int cmp_ptr_unsigned_long(long *p, long *q) +{ + int result = ((unsigned long)p & MASK2) - ((unsigned long)q & MASK2); + + WARN_ON(sizeof(long *) != 8); + + if (!result) + return 0; + else if (result > 0) + return 1; + else + return -1; +} + +static int eq_ptr_unsigned_long(long *p, long *q) +{ + return (((long)p & MASK2) == ((long)q & MASK2)); +} + +#define CMP_PTR_LONG(p,q) cmp_ptr_unsigned_long((long *)p, (long *)q) + +static +struct task_struct *_rt_overrun_entry_find(struct rb_root *root, + struct task_struct *p) +{ + struct task_struct *ret = NULL; + struct rb_node *node = root->rb_node; + + while (node) { // double_rq_lock(struct rq *, struct rq *) cpu_rq + struct task_struct *task = container_of(node, + struct task_struct, rt.rt_overrun.node); + + int result = CMP_PTR_LONG(p, task); + + if (result < 0) + node = node->rb_left; + else if (result > 0) + node = node->rb_right; + else { + ret = task; + goto exit; + } + } +exit: + return ret; +} + +static int rt_overrun_task_runnable(struct task_struct *p) +{ + return task_on_rq_queued(p); +} + +/* avoiding excessive debug printing, splitting the entry point */ +static +struct task_struct *rt_overrun_entry_find(struct rb_root *root, + struct task_struct *p) +{ +printk("%s: \n", __func__); + return _rt_overrun_entry_find(root, p); +} + +static int _rt_overrun_entry_insert(struct rb_root *root, struct task_struct *p) +{ + struct rb_node **new = &(root->rb_node), *parent = NULL; + +printk("%s: \n", __func__); + while (*new) { + struct task_struct *task = container_of(*new, + struct task_struct, rt.rt_overrun.node); + + int result = CMP_PTR_LONG(p, task); + + parent = *new; + if (result < 0) + new = &((*new)->rb_left); + else if (result > 0) + new = &((*new)->rb_right); + else + return 0; + } + + /* Add new node and rebalance tree. */ + rb_link_node(>rt.rt_overrun.node, parent, new); + rb_insert_color(>rt.rt_overrun.node, root); + + return 1; +} + +static void _rt_overrun_entry_delete(struct task_struct *p) +{ + struct task_struct *task; + int i; + + task = rt_overrun_entry_find(_overrun_tree, p); + + if (task) { + printk("%s: p color %d - comm %s - slots 0x%016llx\n", + __func__, task->rt.rt_overrun.color, task->comm, + task->rt.rt_overrun.slots); + + rb_erase(>rt.rt_overrun.node, _overrun_tree); + list_del(>rt.rt_overrun.task_list); + for (i = 0; i < SLOTS; ++i) { + if (rt_admit_rq.curr[i] == p) + rt_admit_rq.curr[i] = NULL; + } + + if (rt_admit_curr == p) + rt_admit_curr = NULL; + } +} + +void rt_overrun_entry_delete(struct task_struct *p) +{ + unsigned long flags; + + raw_spin_lock_irqsave(_overrun_lock, flags); + _rt_overrun_entry_delete(p); +
[PATCH RFC v0 00/12] Cyclic Scheduler Against RTC
Hi, This a crude cyclic scheduler implementation. It uses SCHED_FIFO tasks and runs them according to a map pattern specified by a 64 bit mask. Each bit corresponds to an entry into an 64 entry array of 'struct task_struct'. This works single core CPU 0 only for now. Threads are 'admitted' to this map by an extension to the ioctl() via the of (rtc) real-time clock interface. The bit pattern then determines when the task will run or activate next. The /dev/rtc interface is choosen for this purpose because of its accessibilty to userspace. For example, the mplayer program already use it as a timer source and could possibly benefit from being sync to a vertical retrace interrupt during decoding. Could be an OpenGL program needing precisely scheduler support for those same handling vertical retrace interrupts, low latency audio and timely handling of touch events amognst other uses. There is also a need for some kind of blocking/yielding interface that can return an overrun count for when the thread utilizes more time than allocated for that frame. The read() function in rtc is overloaded for this purpose and reports overrun events. Yield functionality has yet to be fully tested. I apologize for any informal or misused of terminology as I haven't fully reviewed all of the academic literature regarding these kind of schedulers. I welcome suggestions and corrects etc Special thanks to includes... Peter Ziljstra (Intel), Steve Rostedt (Red Hat), Rik van Riel (Red Hat) for encouraging me to continue working in the Linux kernel community and being generally positive and supportive. KY Srinivasan (formerly Novell now Microsoft) for discussion of real-time schedulers and pointers to specifics on that topic. It was just a single discussion but was basically the inspiration for this kind of work. Amir Frenkel (Palm), Kenneth Albanowski (Palm), Bdale Garbee (HP) for the amazing place that was Palm, Kenneth for being a co-conspirator with this scheduler. This scheduler was inspired by performance work that I did at Palm's kernel group along with discussions with the multimedia team before HP kill webOS off. Sad and infuriating moment. Maybe, in a short while, the community will understand the value of these patches for -rt and start solving the general phenomenon of high performance multi-media and user interactivity problems more properly with both a scheduler like this and -rt shipped as default in the near future. [Also, I'd love some kind of sponsorship to continue what I think is critical work versus heading back into the valley] --- Bill Huey (hui) (12): Kconfig change Reroute rtc update irqs to the cyclic scheduler handler Add cyclic support to rtc-dev.c Anonymous struct initialization Task tracking per file descriptor Add anonymous struct to sched_rt_entity kernel/userspace additions for addition ioctl() support for rtc Compilation support Add priority support for the cyclic scheduler Export SCHED_FIFO/RT requeuing functions Cyclic scheduler support Cyclic/rtc documentation Documentation/scheduler/sched-cyclic-rtc.txt | 468 drivers/rtc/Kconfig | 5 + drivers/rtc/class.c | 3 + drivers/rtc/interface.c | 23 + drivers/rtc/rtc-dev.c| 161 +++ include/linux/init_task.h| 18 + include/linux/rtc.h | 3 + include/linux/sched.h| 15 + include/uapi/linux/rtc.h | 4 + kernel/sched/Makefile| 1 + kernel/sched/core.c | 13 + kernel/sched/cyclic.c| 620 +++ kernel/sched/cyclic.h| 86 kernel/sched/cyclic_rt.h | 7 + kernel/sched/rt.c| 41 ++ 15 files changed, 1468 insertions(+) create mode 100644 Documentation/scheduler/sched-cyclic-rtc.txt create mode 100644 kernel/sched/cyclic.c create mode 100644 kernel/sched/cyclic.h create mode 100644 kernel/sched/cyclic_rt.h -- 2.5.0
[PATCH RFC v0 04/12] Anonymous struct initialization
Anonymous struct initialization Signed-off-by: Bill Huey (hui)--- include/linux/init_task.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index f2cb8d4..308caf6 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -183,6 +183,23 @@ extern struct task_group root_task_group; # define INIT_KASAN(tsk) #endif +#ifdef CONFIG_RTC_CYCLIC +# define INIT_RT_OVERRUN(tsk) \ + .rt_overrun = { \ + .count = 0, \ + .task_list = LIST_HEAD_INIT(tsk.rt.rt_overrun.task_list), \ + .type = 0, \ + .color = 0, \ + .slots = 0, \ + .yield = 0, \ + .machine_state = 0, \ + .last_machine_state = 0,\ + .last_task_state = 0, \ + }, +#else +# define INIT_RT_OVERRUN +#endif + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1f (=2MB) @@ -210,6 +227,7 @@ extern struct task_group root_task_group; .rt = { \ .run_list = LIST_HEAD_INIT(tsk.rt.run_list), \ .time_slice = RR_TIMESLICE, \ + INIT_RT_OVERRUN(tsk)\ }, \ .tasks = LIST_HEAD_INIT(tsk.tasks),\ INIT_PUSHABLE_TASKS(tsk)\ -- 2.5.0
[PATCH RFC v0 09/12] Add priority support for the cyclic scheduler
Initial bits to prevent priority changing of cyclic scheduler tasks by only allow them to be SCHED_FIFO. Fairly hacky at this time and will need revisiting because of the security concerns. Affects task death handling since it uses an additional scheduler class hook for clean up at death. Must be SCHED_FIFO. Signed-off-by: Bill Huey (hui) --- kernel/sched/core.c | 13 + 1 file changed, 13 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44db0ff..cf6cf57 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -87,6 +87,10 @@ #include "../workqueue_internal.h" #include "../smpboot.h" +#ifdef CONFIG_RTC_CYCLIC +#include "cyclic.h" +#endif + #define CREATE_TRACE_POINTS #include @@ -2074,6 +2078,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) memset(>se.statistics, 0, sizeof(p->se.statistics)); #endif +#ifdef CONFIG_RTC_CYCLIC + RB_CLEAR_NODE(>rt.rt_overrun.node); +#endif + RB_CLEAR_NODE(>dl.rb_node); init_dl_task_timer(>dl); __dl_clear_params(p); @@ -3881,6 +3889,11 @@ recheck: if (dl_policy(policy)) return -EPERM; +#ifdef CONFIG_RTC_CYCLIC + if (rt_overrun_policy(p, policy)) + return -EPERM; +#endif + /* * Treat SCHED_IDLE as nice 20. Only allow a switch to * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. -- 2.5.0
[PATCH RFC v0 11/12] Cyclic scheduler support
Core implementation of the cyclic scheduler that includes admittance handling, thread death supprot, cyclic timer tick handler, primitive proc debugging interface, wait-queue modifications. Signed-off-by: Bill Huey (hui) --- kernel/sched/cyclic.c| 620 +++ kernel/sched/cyclic.h| 86 +++ kernel/sched/cyclic_rt.h | 7 + 3 files changed, 713 insertions(+) create mode 100644 kernel/sched/cyclic.c create mode 100644 kernel/sched/cyclic.h create mode 100644 kernel/sched/cyclic_rt.h diff --git a/kernel/sched/cyclic.c b/kernel/sched/cyclic.c new file mode 100644 index 000..8ce34bd --- /dev/null +++ b/kernel/sched/cyclic.c @@ -0,0 +1,620 @@ +/* + * cyclic scheduler for rtc support + * + * Copyright (C) Bill Huey + * Author: Bill Huey + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. +*/ + +#include +#include +#include +#include "sched.h" +#include "cyclic.h" +#include "cyclic_rt.h" + +#include +#include + +DEFINE_RAW_SPINLOCK(rt_overrun_lock); +struct rb_root rt_overrun_tree = RB_ROOT; + +#define MASK2 0xfff0 + +/* must revisit again when I get more time to fix the possbility of + * overflow here and 32 bit portability */ +static int cmp_ptr_unsigned_long(long *p, long *q) +{ + int result = ((unsigned long)p & MASK2) - ((unsigned long)q & MASK2); + + WARN_ON(sizeof(long *) != 8); + + if (!result) + return 0; + else if (result > 0) + return 1; + else + return -1; +} + +static int eq_ptr_unsigned_long(long *p, long *q) +{ + return (((long)p & MASK2) == ((long)q & MASK2)); +} + +#define CMP_PTR_LONG(p,q) cmp_ptr_unsigned_long((long *)p, (long *)q) + +static +struct task_struct *_rt_overrun_entry_find(struct rb_root *root, + struct task_struct *p) +{ + struct task_struct *ret = NULL; + struct rb_node *node = root->rb_node; + + while (node) { // double_rq_lock(struct rq *, struct rq *) cpu_rq + struct task_struct *task = container_of(node, + struct task_struct, rt.rt_overrun.node); + + int result = CMP_PTR_LONG(p, task); + + if (result < 0) + node = node->rb_left; + else if (result > 0) + node = node->rb_right; + else { + ret = task; + goto exit; + } + } +exit: + return ret; +} + +static int rt_overrun_task_runnable(struct task_struct *p) +{ + return task_on_rq_queued(p); +} + +/* avoiding excessive debug printing, splitting the entry point */ +static +struct task_struct *rt_overrun_entry_find(struct rb_root *root, + struct task_struct *p) +{ +printk("%s: \n", __func__); + return _rt_overrun_entry_find(root, p); +} + +static int _rt_overrun_entry_insert(struct rb_root *root, struct task_struct *p) +{ + struct rb_node **new = &(root->rb_node), *parent = NULL; + +printk("%s: \n", __func__); + while (*new) { + struct task_struct *task = container_of(*new, + struct task_struct, rt.rt_overrun.node); + + int result = CMP_PTR_LONG(p, task); + + parent = *new; + if (result < 0) + new = &((*new)->rb_left); + else if (result > 0) + new = &((*new)->rb_right); + else + return 0; + } + + /* Add new node and rebalance tree. */ + rb_link_node(>rt.rt_overrun.node, parent, new); + rb_insert_color(>rt.rt_overrun.node, root); + + return 1; +} + +static void _rt_overrun_entry_delete(struct task_struct *p) +{ + struct task_struct *task; + int i; + + task = rt_overrun_entry_find(_overrun_tree, p); + + if (task) { + printk("%s: p color %d - comm %s - slots 0x%016llx\n", + __func__, task->rt.rt_overrun.color, task->comm, + task->rt.rt_overrun.slots); + + rb_erase(>rt.rt_overrun.node, _overrun_tree); + list_del(>rt.rt_overrun.task_list); + for (i = 0; i < SLOTS; ++i) { + if (rt_admit_rq.curr[i] == p) + rt_admit_rq.curr[i] = NULL; + } + + if (rt_admit_curr == p) + rt_admit_curr = NULL; + } +} + +void rt_overrun_entry_delete(struct task_struct *p) +{ + unsigned long flags; + + raw_spin_lock_irqsave(_overrun_lock, flags); + _rt_overrun_entry_delete(p); + raw_spin_unlock_irqrestore(_overrun_lock, flags); +} + +/* forward
[PATCH RFC v0 00/12] Cyclic Scheduler Against RTC
Hi, This a crude cyclic scheduler implementation. It uses SCHED_FIFO tasks and runs them according to a map pattern specified by a 64 bit mask. Each bit corresponds to an entry into an 64 entry array of 'struct task_struct'. This works single core CPU 0 only for now. Threads are 'admitted' to this map by an extension to the ioctl() via the of (rtc) real-time clock interface. The bit pattern then determines when the task will run or activate next. The /dev/rtc interface is choosen for this purpose because of its accessibilty to userspace. For example, the mplayer program already use it as a timer source and could possibly benefit from being sync to a vertical retrace interrupt during decoding. Could be an OpenGL program needing precisely scheduler support for those same handling vertical retrace interrupts, low latency audio and timely handling of touch events amognst other uses. There is also a need for some kind of blocking/yielding interface that can return an overrun count for when the thread utilizes more time than allocated for that frame. The read() function in rtc is overloaded for this purpose and reports overrun events. Yield functionality has yet to be fully tested. I apologize for any informal or misused of terminology as I haven't fully reviewed all of the academic literature regarding these kind of schedulers. I welcome suggestions and corrects etc Special thanks to includes... Peter Ziljstra (Intel), Steve Rostedt (Red Hat), Rik van Riel (Red Hat) for encouraging me to continue working in the Linux kernel community and being generally positive and supportive. KY Srinivasan (formerly Novell now Microsoft) for discussion of real-time schedulers and pointers to specifics on that topic. It was just a single discussion but was basically the inspiration for this kind of work. Amir Frenkel (Palm), Kenneth Albanowski (Palm), Bdale Garbee (HP) for the amazing place that was Palm, Kenneth for being a co-conspirator with this scheduler. This scheduler was inspired by performance work that I did at Palm's kernel group along with discussions with the multimedia team before HP kill webOS off. Sad and infuriating moment. Maybe, in a short while, the community will understand the value of these patches for -rt and start solving the general phenomenon of high performance multi-media and user interactivity problems more properly with both a scheduler like this and -rt shipped as default in the near future. [Also, I'd love some kind of sponsorship to continue what I think is critical work versus heading back into the valley] --- Bill Huey (hui) (12): Kconfig change Reroute rtc update irqs to the cyclic scheduler handler Add cyclic support to rtc-dev.c Anonymous struct initialization Task tracking per file descriptor Add anonymous struct to sched_rt_entity kernel/userspace additions for addition ioctl() support for rtc Compilation support Add priority support for the cyclic scheduler Export SCHED_FIFO/RT requeuing functions Cyclic scheduler support Cyclic/rtc documentation Documentation/scheduler/sched-cyclic-rtc.txt | 468 drivers/rtc/Kconfig | 5 + drivers/rtc/class.c | 3 + drivers/rtc/interface.c | 23 + drivers/rtc/rtc-dev.c| 161 +++ include/linux/init_task.h| 18 + include/linux/rtc.h | 3 + include/linux/sched.h| 15 + include/uapi/linux/rtc.h | 4 + kernel/sched/Makefile| 1 + kernel/sched/core.c | 13 + kernel/sched/cyclic.c| 620 +++ kernel/sched/cyclic.h| 86 kernel/sched/cyclic_rt.h | 7 + kernel/sched/rt.c| 41 ++ 15 files changed, 1468 insertions(+) create mode 100644 Documentation/scheduler/sched-cyclic-rtc.txt create mode 100644 kernel/sched/cyclic.c create mode 100644 kernel/sched/cyclic.h create mode 100644 kernel/sched/cyclic_rt.h -- 2.5.0
[PATCH RFC v0 04/12] Anonymous struct initialization
Anonymous struct initialization Signed-off-by: Bill Huey (hui) --- include/linux/init_task.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index f2cb8d4..308caf6 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -183,6 +183,23 @@ extern struct task_group root_task_group; # define INIT_KASAN(tsk) #endif +#ifdef CONFIG_RTC_CYCLIC +# define INIT_RT_OVERRUN(tsk) \ + .rt_overrun = { \ + .count = 0, \ + .task_list = LIST_HEAD_INIT(tsk.rt.rt_overrun.task_list), \ + .type = 0, \ + .color = 0, \ + .slots = 0, \ + .yield = 0, \ + .machine_state = 0, \ + .last_machine_state = 0,\ + .last_task_state = 0, \ + }, +#else +# define INIT_RT_OVERRUN +#endif + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1f (=2MB) @@ -210,6 +227,7 @@ extern struct task_group root_task_group; .rt = { \ .run_list = LIST_HEAD_INIT(tsk.rt.run_list), \ .time_slice = RR_TIMESLICE, \ + INIT_RT_OVERRUN(tsk)\ }, \ .tasks = LIST_HEAD_INIT(tsk.tasks),\ INIT_PUSHABLE_TASKS(tsk)\ -- 2.5.0
Re: [PATCH v6 0/3] thermal: mediatek: Add cpu dynamic power cooling model.
On Tue, 2016-04-12 at 10:41 +0530, Viresh Kumar wrote: > On 12-04-16, 10:32, dawei chien wrote: > > On Tue, 2016-03-22 at 13:13 +0800, dawei chien wrote: > > > On Tue, 2016-03-15 at 13:17 +0700, Viresh Kumar wrote: > > > > Its Rafael, who is going to apply this one. > > > > > > > > Can you please resend it as he may not have it in patchworks? > > > > > > > > > > Hi Rafael, > > > Would you merge this patch to your tree, thank you. > > > > > > BR, > > > Dawei > > > > Hi Rafael, > > Would you please merge this patch, or please kindly let me know for any > > problem, thank you. > > Didn't I ask you earlier to resend this patch as Rafael wouldn't have > it in his queue now ? > > Please resend it and that will make it earlier for Rafael to get it > applied. > Hi Viresh, Please refer to following for my resending, thank you. https://lkml.org/lkml/2016/3/15/101 https://patchwork.kernel.org/patch/8586131/ https://patchwork.kernel.org/patch/8586111/ https://patchwork.kernel.org/patch/8586081/ BR, Dawei
Re: [PATCH v4 1/8] arm64: dts: rockchip: Clean up /memory nodes
Am Donnerstag, 31. März 2016, 22:45:52 schrieb Heiko Stuebner: > Am Donnerstag, 31. März 2016, 19:15:43 schrieb Heiko Stuebner: > > Am Samstag, 19. März 2016, 09:04:08 schrieb Heiko Stuebner: > > > Am Mittwoch, 16. März 2016, 14:58:39 schrieb Andreas Färber: > > > > A dtc update results in warnings for nodes with reg property but > > > > without > > > > unit address in the node name, so rename /memory to /memory@0. > > > > > > > > Signed-off-by: Andreas Färber> > > > > > applied to a dts64-fixes branch for 4.6, after changing the commit > > > message to > > > A dtc update results in warnings for nodes with reg property but > > > without > > > unit address in the node name, so rename /memory to > > > /memory@startaddress > > > (memory starts at 0 in the case of the rk3368). > > > > > > > > > To clarify that the @0 is not arbitary chosen. > > > > This dtc update in question hasn't landed in v4.6-rc1 and from what I > > gathered will need some changes. The patch is obviously still correct, > > but I have now moved it from v4.6-fixes to the regular v4.7 64bit dts > > changes. > > also it seems "memory" is special and memory without unitname will stay > allowed [0], especially as uboot or other bootloaders may expect such a > node to insert the actual amount of memory into it. > > Looking at uboot, fdt_fixup_memory_banks seems to look explicitly for a > "memory" node, so I'm actually not sure, if this is safe to keep at all. so after pondering this some more, I decided to drop this change again. /memory will stay allowed and might produce less issues with bootloaders touching the memory values. > [0] http://www.spinics.net/lists/arm-kernel/msg494038.html
Re: [PATCH v6 0/3] thermal: mediatek: Add cpu dynamic power cooling model.
On Tue, 2016-04-12 at 10:41 +0530, Viresh Kumar wrote: > On 12-04-16, 10:32, dawei chien wrote: > > On Tue, 2016-03-22 at 13:13 +0800, dawei chien wrote: > > > On Tue, 2016-03-15 at 13:17 +0700, Viresh Kumar wrote: > > > > Its Rafael, who is going to apply this one. > > > > > > > > Can you please resend it as he may not have it in patchworks? > > > > > > > > > > Hi Rafael, > > > Would you merge this patch to your tree, thank you. > > > > > > BR, > > > Dawei > > > > Hi Rafael, > > Would you please merge this patch, or please kindly let me know for any > > problem, thank you. > > Didn't I ask you earlier to resend this patch as Rafael wouldn't have > it in his queue now ? > > Please resend it and that will make it earlier for Rafael to get it > applied. > Hi Viresh, Please refer to following for my resending, thank you. https://lkml.org/lkml/2016/3/15/101 https://patchwork.kernel.org/patch/8586131/ https://patchwork.kernel.org/patch/8586111/ https://patchwork.kernel.org/patch/8586081/ BR, Dawei
Re: [PATCH v4 1/8] arm64: dts: rockchip: Clean up /memory nodes
Am Donnerstag, 31. März 2016, 22:45:52 schrieb Heiko Stuebner: > Am Donnerstag, 31. März 2016, 19:15:43 schrieb Heiko Stuebner: > > Am Samstag, 19. März 2016, 09:04:08 schrieb Heiko Stuebner: > > > Am Mittwoch, 16. März 2016, 14:58:39 schrieb Andreas Färber: > > > > A dtc update results in warnings for nodes with reg property but > > > > without > > > > unit address in the node name, so rename /memory to /memory@0. > > > > > > > > Signed-off-by: Andreas Färber > > > > > > applied to a dts64-fixes branch for 4.6, after changing the commit > > > message to > > > A dtc update results in warnings for nodes with reg property but > > > without > > > unit address in the node name, so rename /memory to > > > /memory@startaddress > > > (memory starts at 0 in the case of the rk3368). > > > > > > > > > To clarify that the @0 is not arbitary chosen. > > > > This dtc update in question hasn't landed in v4.6-rc1 and from what I > > gathered will need some changes. The patch is obviously still correct, > > but I have now moved it from v4.6-fixes to the regular v4.7 64bit dts > > changes. > > also it seems "memory" is special and memory without unitname will stay > allowed [0], especially as uboot or other bootloaders may expect such a > node to insert the actual amount of memory into it. > > Looking at uboot, fdt_fixup_memory_banks seems to look explicitly for a > "memory" node, so I'm actually not sure, if this is safe to keep at all. so after pondering this some more, I decided to drop this change again. /memory will stay allowed and might produce less issues with bootloaders touching the memory values. > [0] http://www.spinics.net/lists/arm-kernel/msg494038.html
Re: [PATCH 5/5] cpufreq: Loongson1: Replace goto out with return in ls1x_cpufreq_probe()
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung> > This patch replaces goto out with return in ls1x_cpufreq_probe(), > and also includes some minor fixes. > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 37 > - > 1 file changed, 16 insertions(+), 21 deletions(-) > > diff --git a/drivers/cpufreq/loongson1-cpufreq.c > b/drivers/cpufreq/loongson1-cpufreq.c > index 5074f5e..1bc90af 100644 > --- a/drivers/cpufreq/loongson1-cpufreq.c > +++ b/drivers/cpufreq/loongson1-cpufreq.c > @@ -1,7 +1,7 @@ > /* > * CPU Frequency Scaling for Loongson 1 SoC > * > - * Copyright (C) 2014 Zhang, Keguang > + * Copyright (C) 2014-2016 Zhang, Keguang Actually you should fold above into the first patch of the series, that renames this file. It makes much sense that way. > * > * This file is licensed under the terms of the GNU General Public > * License version 2. This program is licensed "as is" without any > @@ -141,7 +141,8 @@ static int ls1x_cpufreq_probe(struct platform_device > *pdev) > struct clk *clk; > int ret; > > - if (!pdata || !pdata->clk_name || !pdata->osc_clk_name) > + if (!pdata || !pdata->clk_name || !pdata->osc_clk_name) { You added a '{' here, but the closing '}' is added way down.. Something is wrong here I feel.. > + dev_err(>dev, "platform data missing\n"); > return -EINVAL; > > cpufreq = > @@ -155,8 +156,7 @@ static int ls1x_cpufreq_probe(struct platform_device > *pdev) > if (IS_ERR(clk)) { > dev_err(>dev, "unable to get %s clock\n", > pdata->clk_name); > - ret = PTR_ERR(clk); > - goto out; > + return PTR_ERR(clk); > } > static struct platform_driver ls1x_cpufreq_platdrv = { > - .driver = { > + .probe = ls1x_cpufreq_probe, > + .remove = ls1x_cpufreq_remove, > + .driver = { > .name = "ls1x-cpufreq", > }, > - .probe = ls1x_cpufreq_probe, > - .remove = ls1x_cpufreq_remove, Why do this change at all? Do it in the first patch if you really want to. > }; > > module_platform_driver(ls1x_cpufreq_platdrv); > > MODULE_AUTHOR("Kelvin Cheung "); > -MODULE_DESCRIPTION("Loongson 1 CPUFreq driver"); > +MODULE_DESCRIPTION("Loongson1 CPUFreq driver"); This one as well, move it to the first patch. > MODULE_LICENSE("GPL"); > -- > 1.9.1 -- viresh
Re: [PATCH 5/5] cpufreq: Loongson1: Replace goto out with return in ls1x_cpufreq_probe()
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung > > This patch replaces goto out with return in ls1x_cpufreq_probe(), > and also includes some minor fixes. > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 37 > - > 1 file changed, 16 insertions(+), 21 deletions(-) > > diff --git a/drivers/cpufreq/loongson1-cpufreq.c > b/drivers/cpufreq/loongson1-cpufreq.c > index 5074f5e..1bc90af 100644 > --- a/drivers/cpufreq/loongson1-cpufreq.c > +++ b/drivers/cpufreq/loongson1-cpufreq.c > @@ -1,7 +1,7 @@ > /* > * CPU Frequency Scaling for Loongson 1 SoC > * > - * Copyright (C) 2014 Zhang, Keguang > + * Copyright (C) 2014-2016 Zhang, Keguang Actually you should fold above into the first patch of the series, that renames this file. It makes much sense that way. > * > * This file is licensed under the terms of the GNU General Public > * License version 2. This program is licensed "as is" without any > @@ -141,7 +141,8 @@ static int ls1x_cpufreq_probe(struct platform_device > *pdev) > struct clk *clk; > int ret; > > - if (!pdata || !pdata->clk_name || !pdata->osc_clk_name) > + if (!pdata || !pdata->clk_name || !pdata->osc_clk_name) { You added a '{' here, but the closing '}' is added way down.. Something is wrong here I feel.. > + dev_err(>dev, "platform data missing\n"); > return -EINVAL; > > cpufreq = > @@ -155,8 +156,7 @@ static int ls1x_cpufreq_probe(struct platform_device > *pdev) > if (IS_ERR(clk)) { > dev_err(>dev, "unable to get %s clock\n", > pdata->clk_name); > - ret = PTR_ERR(clk); > - goto out; > + return PTR_ERR(clk); > } > static struct platform_driver ls1x_cpufreq_platdrv = { > - .driver = { > + .probe = ls1x_cpufreq_probe, > + .remove = ls1x_cpufreq_remove, > + .driver = { > .name = "ls1x-cpufreq", > }, > - .probe = ls1x_cpufreq_probe, > - .remove = ls1x_cpufreq_remove, Why do this change at all? Do it in the first patch if you really want to. > }; > > module_platform_driver(ls1x_cpufreq_platdrv); > > MODULE_AUTHOR("Kelvin Cheung "); > -MODULE_DESCRIPTION("Loongson 1 CPUFreq driver"); > +MODULE_DESCRIPTION("Loongson1 CPUFreq driver"); This one as well, move it to the first patch. > MODULE_LICENSE("GPL"); > -- > 1.9.1 -- viresh
Re: [PATCH 4/5] cpufreq: Loongson1: Use devm_kzalloc() instead of global structure
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung> > This patch uses devm_kzalloc() instead of global structure. > Why are you doing this? The commit log should contain that. > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 63 > - > 1 file changed, 35 insertions(+), 28 deletions(-) I don't have any issues with you doing this, but I don't think that's necessary to do. Acked-by: Viresh Kumar -- viresh
Re: [PATCH 4/5] cpufreq: Loongson1: Use devm_kzalloc() instead of global structure
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung > > This patch uses devm_kzalloc() instead of global structure. > Why are you doing this? The commit log should contain that. > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 63 > - > 1 file changed, 35 insertions(+), 28 deletions(-) I don't have any issues with you doing this, but I don't think that's necessary to do. Acked-by: Viresh Kumar -- viresh
Re: [PATCH V2 1/3] acpi,pci,irq: reduce resource requirements
Hi Sinan, I was hoping we could *simplify* this, but I think it's just getting even more complicated (it's a net addition of 100 lines), which is due to feature creep that I'm afraid is my fault. IIRC, the main thing you need is to get rid of some early memory allocation. I don't think all the trigger mode/level checking is worth it. The current code doesn't do it, and it's not fixing a problem for you. It's conceivable that it could even make us trip over a new problem, e.g., some broken BIOS that we currently tolerate. I think you could make this a little easier to review if you split things like the acpi_irq_penalty[] -> acpi_isa_irq_penalty[] rename into their own patches. Little patches like that are trivial to review because a simple rename is pretty safe, and then the patches that actually *do* interesting things are smaller and easier to review, too. On Fri, Apr 08, 2016 at 09:26:30PM -0400, Sinan Kaya wrote: > Code has been redesigned to calculate penalty requirements on the fly. This > significantly simplifies the implementation and removes some of the init > calls from x86 architecture. Command line penalty assignment has been > limited to ISA interrupts only. > > Signed-off-by: Sinan Kaya> --- > drivers/acpi/pci_link.c | 176 > ++-- > 1 file changed, 140 insertions(+), 36 deletions(-) > > diff --git a/drivers/acpi/pci_link.c b/drivers/acpi/pci_link.c > index ededa90..25695ea 100644 > --- a/drivers/acpi/pci_link.c > +++ b/drivers/acpi/pci_link.c > @@ -437,17 +437,15 @@ static int acpi_pci_link_set(struct acpi_pci_link > *link, int irq) > * enabled system. > */ > > -#define ACPI_MAX_IRQS256 > #define ACPI_MAX_ISA_IRQ 16 ACPI_MAX_ISA_IRQ is a bit of a misnomer. The maximum ISA IRQ is 15, not 16, so I think this should be named ACPI_MAX_ISA_IRQS. > -#define PIRQ_PENALTY_PCI_AVAILABLE (0) > #define PIRQ_PENALTY_PCI_POSSIBLE(16*16) > #define PIRQ_PENALTY_PCI_USING (16*16*16) > #define PIRQ_PENALTY_ISA_TYPICAL (16*16*16*16) > #define PIRQ_PENALTY_ISA_USED(16*16*16*16*16) > #define PIRQ_PENALTY_ISA_ALWAYS (16*16*16*16*16*16) > > -static int acpi_irq_penalty[ACPI_MAX_IRQS] = { > +static int acpi_isa_irq_penalty[ACPI_MAX_ISA_IRQ] = { > PIRQ_PENALTY_ISA_ALWAYS,/* IRQ0 timer */ > PIRQ_PENALTY_ISA_ALWAYS,/* IRQ1 keyboard */ > PIRQ_PENALTY_ISA_ALWAYS,/* IRQ2 cascade */ > @@ -457,9 +455,9 @@ static int acpi_irq_penalty[ACPI_MAX_IRQS] = { > PIRQ_PENALTY_ISA_TYPICAL, /* IRQ6 */ > PIRQ_PENALTY_ISA_TYPICAL, /* IRQ7 parallel, spurious */ > PIRQ_PENALTY_ISA_TYPICAL, /* IRQ8 rtc, sometimes */ > - PIRQ_PENALTY_PCI_AVAILABLE, /* IRQ9 PCI, often acpi */ > - PIRQ_PENALTY_PCI_AVAILABLE, /* IRQ10 PCI */ > - PIRQ_PENALTY_PCI_AVAILABLE, /* IRQ11 PCI */ > + 0, /* IRQ9 PCI, often acpi */ > + 0, /* IRQ10 PCI */ > + 0, /* IRQ11 PCI */ > PIRQ_PENALTY_ISA_USED, /* IRQ12 mouse */ > PIRQ_PENALTY_ISA_USED, /* IRQ13 fpe, sometimes */ > PIRQ_PENALTY_ISA_USED, /* IRQ14 ide0 */ > @@ -467,6 +465,121 @@ static int acpi_irq_penalty[ACPI_MAX_IRQS] = { > /* >IRQ15 */ > }; > > +static int acpi_link_trigger(int irq, u8 *polarity, u8 *triggering) > +{ > + struct acpi_pci_link *link; > + bool found = false; > + > + *polarity = ~0; > + *triggering = ~0; > + > + list_for_each_entry(link, _link_list, list) { > + int i; > + > + if (link->irq.active && link->irq.active == irq) { > + if (*polarity == ~0) > + *polarity = link->irq.polarity; > + > + if (*triggering == ~0) > + *triggering = link->irq.triggering; > + > + if (*polarity != link->irq.polarity) > + return -EINVAL; > + > + if (*triggering != link->irq.triggering) > + return -EINVAL; > + > + found = true; > + } > + > + for (i = 0; i < link->irq.possible_count; i++) > + if (link->irq.possible[i] == irq) { > + if (*polarity == ~0) > + *polarity = link->irq.polarity; > + > + if (*triggering == ~0) > + *triggering = link->irq.triggering; > + > + if (*polarity != link->irq.polarity) > + return -EINVAL; > + > + if (*triggering != link->irq.triggering) > + return -EINVAL; > + > + found = true; > +
Re: [PATCH V2 1/3] acpi,pci,irq: reduce resource requirements
Hi Sinan, I was hoping we could *simplify* this, but I think it's just getting even more complicated (it's a net addition of 100 lines), which is due to feature creep that I'm afraid is my fault. IIRC, the main thing you need is to get rid of some early memory allocation. I don't think all the trigger mode/level checking is worth it. The current code doesn't do it, and it's not fixing a problem for you. It's conceivable that it could even make us trip over a new problem, e.g., some broken BIOS that we currently tolerate. I think you could make this a little easier to review if you split things like the acpi_irq_penalty[] -> acpi_isa_irq_penalty[] rename into their own patches. Little patches like that are trivial to review because a simple rename is pretty safe, and then the patches that actually *do* interesting things are smaller and easier to review, too. On Fri, Apr 08, 2016 at 09:26:30PM -0400, Sinan Kaya wrote: > Code has been redesigned to calculate penalty requirements on the fly. This > significantly simplifies the implementation and removes some of the init > calls from x86 architecture. Command line penalty assignment has been > limited to ISA interrupts only. > > Signed-off-by: Sinan Kaya > --- > drivers/acpi/pci_link.c | 176 > ++-- > 1 file changed, 140 insertions(+), 36 deletions(-) > > diff --git a/drivers/acpi/pci_link.c b/drivers/acpi/pci_link.c > index ededa90..25695ea 100644 > --- a/drivers/acpi/pci_link.c > +++ b/drivers/acpi/pci_link.c > @@ -437,17 +437,15 @@ static int acpi_pci_link_set(struct acpi_pci_link > *link, int irq) > * enabled system. > */ > > -#define ACPI_MAX_IRQS256 > #define ACPI_MAX_ISA_IRQ 16 ACPI_MAX_ISA_IRQ is a bit of a misnomer. The maximum ISA IRQ is 15, not 16, so I think this should be named ACPI_MAX_ISA_IRQS. > -#define PIRQ_PENALTY_PCI_AVAILABLE (0) > #define PIRQ_PENALTY_PCI_POSSIBLE(16*16) > #define PIRQ_PENALTY_PCI_USING (16*16*16) > #define PIRQ_PENALTY_ISA_TYPICAL (16*16*16*16) > #define PIRQ_PENALTY_ISA_USED(16*16*16*16*16) > #define PIRQ_PENALTY_ISA_ALWAYS (16*16*16*16*16*16) > > -static int acpi_irq_penalty[ACPI_MAX_IRQS] = { > +static int acpi_isa_irq_penalty[ACPI_MAX_ISA_IRQ] = { > PIRQ_PENALTY_ISA_ALWAYS,/* IRQ0 timer */ > PIRQ_PENALTY_ISA_ALWAYS,/* IRQ1 keyboard */ > PIRQ_PENALTY_ISA_ALWAYS,/* IRQ2 cascade */ > @@ -457,9 +455,9 @@ static int acpi_irq_penalty[ACPI_MAX_IRQS] = { > PIRQ_PENALTY_ISA_TYPICAL, /* IRQ6 */ > PIRQ_PENALTY_ISA_TYPICAL, /* IRQ7 parallel, spurious */ > PIRQ_PENALTY_ISA_TYPICAL, /* IRQ8 rtc, sometimes */ > - PIRQ_PENALTY_PCI_AVAILABLE, /* IRQ9 PCI, often acpi */ > - PIRQ_PENALTY_PCI_AVAILABLE, /* IRQ10 PCI */ > - PIRQ_PENALTY_PCI_AVAILABLE, /* IRQ11 PCI */ > + 0, /* IRQ9 PCI, often acpi */ > + 0, /* IRQ10 PCI */ > + 0, /* IRQ11 PCI */ > PIRQ_PENALTY_ISA_USED, /* IRQ12 mouse */ > PIRQ_PENALTY_ISA_USED, /* IRQ13 fpe, sometimes */ > PIRQ_PENALTY_ISA_USED, /* IRQ14 ide0 */ > @@ -467,6 +465,121 @@ static int acpi_irq_penalty[ACPI_MAX_IRQS] = { > /* >IRQ15 */ > }; > > +static int acpi_link_trigger(int irq, u8 *polarity, u8 *triggering) > +{ > + struct acpi_pci_link *link; > + bool found = false; > + > + *polarity = ~0; > + *triggering = ~0; > + > + list_for_each_entry(link, _link_list, list) { > + int i; > + > + if (link->irq.active && link->irq.active == irq) { > + if (*polarity == ~0) > + *polarity = link->irq.polarity; > + > + if (*triggering == ~0) > + *triggering = link->irq.triggering; > + > + if (*polarity != link->irq.polarity) > + return -EINVAL; > + > + if (*triggering != link->irq.triggering) > + return -EINVAL; > + > + found = true; > + } > + > + for (i = 0; i < link->irq.possible_count; i++) > + if (link->irq.possible[i] == irq) { > + if (*polarity == ~0) > + *polarity = link->irq.polarity; > + > + if (*triggering == ~0) > + *triggering = link->irq.triggering; > + > + if (*polarity != link->irq.polarity) > + return -EINVAL; > + > + if (*triggering != link->irq.triggering) > + return -EINVAL; > + > + found = true; > + } > + } >
Re: [PATCH 2/5] cpufreq: Loongson1: Replace kzalloc() with kcalloc()
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung> > This patch replaces kzalloc() with kcalloc() when allocating > frequency table, and remove unnecessary 'out of memory' message. > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 12 > 1 file changed, 4 insertions(+), 8 deletions(-) Acked-by: Viresh Kumar -- viresh
Re: [PATCH 2/5] cpufreq: Loongson1: Replace kzalloc() with kcalloc()
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung > > This patch replaces kzalloc() with kcalloc() when allocating > frequency table, and remove unnecessary 'out of memory' message. > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 12 > 1 file changed, 4 insertions(+), 8 deletions(-) Acked-by: Viresh Kumar -- viresh
Re: [PATCH 3/5] cpufreq: Loongson1: Use dev_get_platdata() to get platform_data
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung> > This patch uses dev_get_platdata() to get the platform_data > instead of referencing it directly. > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/loongson1-cpufreq.c > b/drivers/cpufreq/loongson1-cpufreq.c > index 2d83744..f0d40fd 100644 > --- a/drivers/cpufreq/loongson1-cpufreq.c > +++ b/drivers/cpufreq/loongson1-cpufreq.c > @@ -134,7 +134,7 @@ static int ls1x_cpufreq_remove(struct platform_device > *pdev) > > static int ls1x_cpufreq_probe(struct platform_device *pdev) > { > - struct plat_ls1x_cpufreq *pdata = pdev->dev.platform_data; > + struct plat_ls1x_cpufreq *pdata = dev_get_platdata(>dev); > struct clk *clk; > int ret; Acked-by: Viresh Kumar -- viresh
Re: [PATCH 3/5] cpufreq: Loongson1: Use dev_get_platdata() to get platform_data
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung > > This patch uses dev_get_platdata() to get the platform_data > instead of referencing it directly. > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/loongson1-cpufreq.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/loongson1-cpufreq.c > b/drivers/cpufreq/loongson1-cpufreq.c > index 2d83744..f0d40fd 100644 > --- a/drivers/cpufreq/loongson1-cpufreq.c > +++ b/drivers/cpufreq/loongson1-cpufreq.c > @@ -134,7 +134,7 @@ static int ls1x_cpufreq_remove(struct platform_device > *pdev) > > static int ls1x_cpufreq_probe(struct platform_device *pdev) > { > - struct plat_ls1x_cpufreq *pdata = pdev->dev.platform_data; > + struct plat_ls1x_cpufreq *pdata = dev_get_platdata(>dev); > struct clk *clk; > int ret; Acked-by: Viresh Kumar -- viresh
Re: [PATCH] arm64: CONFIG_DEVPORT should not be used when PCI is being used
On 04/07/2016 11:56 AM, Al Stone wrote: > config DEVPORT > bool > depends on ISA && PCI > default y > > That makes more sense. Thanks. I think Itanium does IO ports but not ISA. Probably best to just turn on IO ports on the three architectures that use them in that case. Jon. -- Computer Architect
Re: [PATCH 1/5] cpufreq: Loongson1: Rename the file to loongson1-cpufreq.c
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung> > This patch renames the file to loongson1-cpufreq.c > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/Makefile| 2 +- > drivers/cpufreq/{ls1x-cpufreq.c => loongson1-cpufreq.c} | 0 > 2 files changed, 1 insertion(+), 1 deletion(-) > rename drivers/cpufreq/{ls1x-cpufreq.c => loongson1-cpufreq.c} (100%) > > diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile > index 9e63fb1..bebe9c8 100644 > --- a/drivers/cpufreq/Makefile > +++ b/drivers/cpufreq/Makefile > @@ -100,7 +100,7 @@ obj-$(CONFIG_CRIS_MACH_ARTPEC3) += > cris-artpec3-cpufreq.o > obj-$(CONFIG_ETRAXFS)+= cris-etraxfs-cpufreq.o > obj-$(CONFIG_IA64_ACPI_CPUFREQ) += ia64-acpi-cpufreq.o > obj-$(CONFIG_LOONGSON2_CPUFREQ) += loongson2_cpufreq.o > -obj-$(CONFIG_LOONGSON1_CPUFREQ) += ls1x-cpufreq.o > +obj-$(CONFIG_LOONGSON1_CPUFREQ) += loongson1-cpufreq.o > obj-$(CONFIG_SH_CPU_FREQ)+= sh-cpufreq.o > obj-$(CONFIG_SPARC_US2E_CPUFREQ) += sparc-us2e-cpufreq.o > obj-$(CONFIG_SPARC_US3_CPUFREQ) += sparc-us3-cpufreq.o > diff --git a/drivers/cpufreq/ls1x-cpufreq.c > b/drivers/cpufreq/loongson1-cpufreq.c > similarity index 100% > rename from drivers/cpufreq/ls1x-cpufreq.c > rename to drivers/cpufreq/loongson1-cpufreq.c Acked-by: Viresh Kumar -- viresh
Re: [PATCH] arm64: CONFIG_DEVPORT should not be used when PCI is being used
On 04/07/2016 11:56 AM, Al Stone wrote: > config DEVPORT > bool > depends on ISA && PCI > default y > > That makes more sense. Thanks. I think Itanium does IO ports but not ISA. Probably best to just turn on IO ports on the three architectures that use them in that case. Jon. -- Computer Architect
Re: [PATCH 1/5] cpufreq: Loongson1: Rename the file to loongson1-cpufreq.c
On 11-04-16, 19:55, Keguang Zhang wrote: > From: Kelvin Cheung > > This patch renames the file to loongson1-cpufreq.c > > Signed-off-by: Kelvin Cheung > --- > drivers/cpufreq/Makefile| 2 +- > drivers/cpufreq/{ls1x-cpufreq.c => loongson1-cpufreq.c} | 0 > 2 files changed, 1 insertion(+), 1 deletion(-) > rename drivers/cpufreq/{ls1x-cpufreq.c => loongson1-cpufreq.c} (100%) > > diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile > index 9e63fb1..bebe9c8 100644 > --- a/drivers/cpufreq/Makefile > +++ b/drivers/cpufreq/Makefile > @@ -100,7 +100,7 @@ obj-$(CONFIG_CRIS_MACH_ARTPEC3) += > cris-artpec3-cpufreq.o > obj-$(CONFIG_ETRAXFS)+= cris-etraxfs-cpufreq.o > obj-$(CONFIG_IA64_ACPI_CPUFREQ) += ia64-acpi-cpufreq.o > obj-$(CONFIG_LOONGSON2_CPUFREQ) += loongson2_cpufreq.o > -obj-$(CONFIG_LOONGSON1_CPUFREQ) += ls1x-cpufreq.o > +obj-$(CONFIG_LOONGSON1_CPUFREQ) += loongson1-cpufreq.o > obj-$(CONFIG_SH_CPU_FREQ)+= sh-cpufreq.o > obj-$(CONFIG_SPARC_US2E_CPUFREQ) += sparc-us2e-cpufreq.o > obj-$(CONFIG_SPARC_US3_CPUFREQ) += sparc-us3-cpufreq.o > diff --git a/drivers/cpufreq/ls1x-cpufreq.c > b/drivers/cpufreq/loongson1-cpufreq.c > similarity index 100% > rename from drivers/cpufreq/ls1x-cpufreq.c > rename to drivers/cpufreq/loongson1-cpufreq.c Acked-by: Viresh Kumar -- viresh
Re: [RESEND] fence: add missing descriptions for fence
Hi Luis, On 12 April 2016 at 04:03, Luis de Bethencourtwrote: > On 11/04/16 21:09, Gustavo Padovan wrote: >> Hi Luis, >> >> 2016-04-11 Luis de Bethencourt : >> >>> The members child_list and active_list were added to the fence struct >>> without descriptions for the Documentation. Adding these. >>> Thanks for the patch; will get it queued for for-next. >>> Fixes: b55b54b5db33 ("staging/android: remove struct sync_pt") >>> Signed-off-by: Luis de Bethencourt >>> Reviewed-by: Javier Martinez Canillas >>> --- >>> Hi, >>> >>> Just resending this patch since it hasn't had any reviews in since >>> March 21st. >>> >>> Thanks, >>> Luis >>> >>> include/linux/fence.h | 2 ++ >>> 1 file changed, 2 insertions(+) >> >> Reviewed-by: Gustavo Padovan >> >> Gustavo >> > > Thank you Gustavo. > > Nice seeing you around here :) > > Luis BR, Sumit.
Re: [RESEND] fence: add missing descriptions for fence
Hi Luis, On 12 April 2016 at 04:03, Luis de Bethencourt wrote: > On 11/04/16 21:09, Gustavo Padovan wrote: >> Hi Luis, >> >> 2016-04-11 Luis de Bethencourt : >> >>> The members child_list and active_list were added to the fence struct >>> without descriptions for the Documentation. Adding these. >>> Thanks for the patch; will get it queued for for-next. >>> Fixes: b55b54b5db33 ("staging/android: remove struct sync_pt") >>> Signed-off-by: Luis de Bethencourt >>> Reviewed-by: Javier Martinez Canillas >>> --- >>> Hi, >>> >>> Just resending this patch since it hasn't had any reviews in since >>> March 21st. >>> >>> Thanks, >>> Luis >>> >>> include/linux/fence.h | 2 ++ >>> 1 file changed, 2 insertions(+) >> >> Reviewed-by: Gustavo Padovan >> >> Gustavo >> > > Thank you Gustavo. > > Nice seeing you around here :) > > Luis BR, Sumit.
Re: [PATCH v6 0/3] thermal: mediatek: Add cpu dynamic power cooling model.
On 12-04-16, 10:32, dawei chien wrote: > On Tue, 2016-03-22 at 13:13 +0800, dawei chien wrote: > > On Tue, 2016-03-15 at 13:17 +0700, Viresh Kumar wrote: > > > Its Rafael, who is going to apply this one. > > > > > > Can you please resend it as he may not have it in patchworks? > > > > > > > Hi Rafael, > > Would you merge this patch to your tree, thank you. > > > > BR, > > Dawei > > Hi Rafael, > Would you please merge this patch, or please kindly let me know for any > problem, thank you. Didn't I ask you earlier to resend this patch as Rafael wouldn't have it in his queue now ? Please resend it and that will make it earlier for Rafael to get it applied. -- viresh
Re: [PATCH v6 0/3] thermal: mediatek: Add cpu dynamic power cooling model.
On 12-04-16, 10:32, dawei chien wrote: > On Tue, 2016-03-22 at 13:13 +0800, dawei chien wrote: > > On Tue, 2016-03-15 at 13:17 +0700, Viresh Kumar wrote: > > > Its Rafael, who is going to apply this one. > > > > > > Can you please resend it as he may not have it in patchworks? > > > > > > > Hi Rafael, > > Would you merge this patch to your tree, thank you. > > > > BR, > > Dawei > > Hi Rafael, > Would you please merge this patch, or please kindly let me know for any > problem, thank you. Didn't I ask you earlier to resend this patch as Rafael wouldn't have it in his queue now ? Please resend it and that will make it earlier for Rafael to get it applied. -- viresh
Re: [PATCH v1] clk: Add clk_composite_set_rate_and_parent
Hi Finley, Am Montag, 11. April 2016, 09:54:12 schrieb Finlye Xiao: > From: Finley Xiao> > Some of Rockchip's clocks should consider the priority of .set_parent > and .set_rate to prevent a too large temporary clock rate. > > For example, the gpu clock can be parented to cpll(750MHz) and > usbphy_480m(480MHz), 375MHz comes from cpll and the div is set > to 2, 480MHz comes from usbphy_480m and the div is set to 1. > > From the code, when change rate from 480MHz to 375MHz, it changes > the gpu's parent from USBPHY_480M to cpll first(.set_parent), but the > div value is still 1 and the gpu's rate will be 750MHz at the moment, > then it changes the div value from 1 to 2(.set_rate) and the gpu's > rate will be changed to 375MHz(480MHZ->750MHz->375MHz), here temporary > rate is 750MHz, the voltage which supply for 480MHz certainly can not > supply for 750MHz, so the gpu will crash. We did talk about this internally and while we refined the actual code change, it seems I forgot to look at the commit message itself. This behaviour (and the wish to not overflow a target clock rate) should be the same on all socs, so the commit message is quite a bit to Rockchip specific. I think I would go with something like: --- 8< When changing the clock-rate, currently a new parent is set first and a divider adapted thereafter. This may result in the clock-rate overflowing its target rate for a short time if the new parent has a higher rate than the old parent. While this often doesn't produce negative effects, it can affect components in a voltage-scaling environment, like the GPU on the rk3399 socs, where the voltage than simply is to low for the temporarily to high clock rate. For general clock hirarchies this may need more extensive adaptions to the common clock-framework, but at least for composite clocks having both parent and rate settings it is easy to create a short-term solution to make sure the clock-rate does not overflow the target. --- 8< But of course feel free to extend or change that as you wish ;-) . > > Signed-off-by: Finley Xiao I remember having clocks not overflow their target rate came up in some ELC talk last week (probably in Stephens Qualcomm-kernel-talk) and a general solution might need some changes closer to the core. But at least for composite clocks where we can control the rate+parent process easily, Finley's change is a nice low-hanging fruit which at least improves behaviour for those clock-types in the short term, so Reviewed-by: Heiko Stuebner Heiko
Re: [PATCH v1] clk: Add clk_composite_set_rate_and_parent
Hi Finley, Am Montag, 11. April 2016, 09:54:12 schrieb Finlye Xiao: > From: Finley Xiao > > Some of Rockchip's clocks should consider the priority of .set_parent > and .set_rate to prevent a too large temporary clock rate. > > For example, the gpu clock can be parented to cpll(750MHz) and > usbphy_480m(480MHz), 375MHz comes from cpll and the div is set > to 2, 480MHz comes from usbphy_480m and the div is set to 1. > > From the code, when change rate from 480MHz to 375MHz, it changes > the gpu's parent from USBPHY_480M to cpll first(.set_parent), but the > div value is still 1 and the gpu's rate will be 750MHz at the moment, > then it changes the div value from 1 to 2(.set_rate) and the gpu's > rate will be changed to 375MHz(480MHZ->750MHz->375MHz), here temporary > rate is 750MHz, the voltage which supply for 480MHz certainly can not > supply for 750MHz, so the gpu will crash. We did talk about this internally and while we refined the actual code change, it seems I forgot to look at the commit message itself. This behaviour (and the wish to not overflow a target clock rate) should be the same on all socs, so the commit message is quite a bit to Rockchip specific. I think I would go with something like: --- 8< When changing the clock-rate, currently a new parent is set first and a divider adapted thereafter. This may result in the clock-rate overflowing its target rate for a short time if the new parent has a higher rate than the old parent. While this often doesn't produce negative effects, it can affect components in a voltage-scaling environment, like the GPU on the rk3399 socs, where the voltage than simply is to low for the temporarily to high clock rate. For general clock hirarchies this may need more extensive adaptions to the common clock-framework, but at least for composite clocks having both parent and rate settings it is easy to create a short-term solution to make sure the clock-rate does not overflow the target. --- 8< But of course feel free to extend or change that as you wish ;-) . > > Signed-off-by: Finley Xiao I remember having clocks not overflow their target rate came up in some ELC talk last week (probably in Stephens Qualcomm-kernel-talk) and a general solution might need some changes closer to the core. But at least for composite clocks where we can control the rate+parent process easily, Finley's change is a nice low-hanging fruit which at least improves behaviour for those clock-types in the short term, so Reviewed-by: Heiko Stuebner Heiko
Re: [PATCH RESEND] sst: fix missing breaks that would cause the wrong operation to execute
On Thu, Apr 07, 2016 at 09:30:44AM +0800, Yang Jie wrote: > > > On 2016年04月06日 21:44, Alan wrote: > >From: Alan> > > >Now we correctly error an attempt to execute an unsupported operation. The > >current code does something else random. Hi Alan, Can you either send me this patch or send this to ALSA mailing list. You might want to add the subsystem name which here should be: ASoC: Intel: atom: fix missing breaks that would cause the wrong operation to execute And you can add my Ack :) Acked-by: Vinod Koul Thanks -- ~Vinod > > > >Signed-off-by: Alan Cox > > I think this is nice fix. + Vinod who is the owner of the atom sound driver. > > Thanks, > ~Keyon > > >--- > > sound/soc/intel/atom/sst-mfld-platform-compress.c |9 +++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > >diff --git a/sound/soc/intel/atom/sst-mfld-platform-compress.c > >b/sound/soc/intel/atom/sst-mfld-platform-compress.c > >index 3951689..1bead81 100644 > >--- a/sound/soc/intel/atom/sst-mfld-platform-compress.c > >+++ b/sound/soc/intel/atom/sst-mfld-platform-compress.c > >@@ -182,24 +182,29 @@ static int sst_platform_compr_trigger(struct > >snd_compr_stream *cstream, int cmd) > > case SNDRV_PCM_TRIGGER_START: > > if (stream->compr_ops->stream_start) > > return stream->compr_ops->stream_start(sst->dev, > > stream->id); > >+break; > > case SNDRV_PCM_TRIGGER_STOP: > > if (stream->compr_ops->stream_drop) > > return stream->compr_ops->stream_drop(sst->dev, > > stream->id); > >+break; > > case SND_COMPR_TRIGGER_DRAIN: > > if (stream->compr_ops->stream_drain) > > return stream->compr_ops->stream_drain(sst->dev, > > stream->id); > >+break; > > case SND_COMPR_TRIGGER_PARTIAL_DRAIN: > > if (stream->compr_ops->stream_partial_drain) > > return > > stream->compr_ops->stream_partial_drain(sst->dev, stream->id); > >+break; > > case SNDRV_PCM_TRIGGER_PAUSE_PUSH: > > if (stream->compr_ops->stream_pause) > > return stream->compr_ops->stream_pause(sst->dev, > > stream->id); > >+break; > > case SNDRV_PCM_TRIGGER_PAUSE_RELEASE: > > if (stream->compr_ops->stream_pause_release) > > return > > stream->compr_ops->stream_pause_release(sst->dev, stream->id); > >-default: > >-return -EINVAL; > >+break; > > } > >+return -EINVAL; > > } > > > > static int sst_platform_compr_pointer(struct snd_compr_stream *cstream, > > > >
Re: [PATCH RESEND] sst: fix missing breaks that would cause the wrong operation to execute
On Thu, Apr 07, 2016 at 09:30:44AM +0800, Yang Jie wrote: > > > On 2016年04月06日 21:44, Alan wrote: > >From: Alan > > > >Now we correctly error an attempt to execute an unsupported operation. The > >current code does something else random. Hi Alan, Can you either send me this patch or send this to ALSA mailing list. You might want to add the subsystem name which here should be: ASoC: Intel: atom: fix missing breaks that would cause the wrong operation to execute And you can add my Ack :) Acked-by: Vinod Koul Thanks -- ~Vinod > > > >Signed-off-by: Alan Cox > > I think this is nice fix. + Vinod who is the owner of the atom sound driver. > > Thanks, > ~Keyon > > >--- > > sound/soc/intel/atom/sst-mfld-platform-compress.c |9 +++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > >diff --git a/sound/soc/intel/atom/sst-mfld-platform-compress.c > >b/sound/soc/intel/atom/sst-mfld-platform-compress.c > >index 3951689..1bead81 100644 > >--- a/sound/soc/intel/atom/sst-mfld-platform-compress.c > >+++ b/sound/soc/intel/atom/sst-mfld-platform-compress.c > >@@ -182,24 +182,29 @@ static int sst_platform_compr_trigger(struct > >snd_compr_stream *cstream, int cmd) > > case SNDRV_PCM_TRIGGER_START: > > if (stream->compr_ops->stream_start) > > return stream->compr_ops->stream_start(sst->dev, > > stream->id); > >+break; > > case SNDRV_PCM_TRIGGER_STOP: > > if (stream->compr_ops->stream_drop) > > return stream->compr_ops->stream_drop(sst->dev, > > stream->id); > >+break; > > case SND_COMPR_TRIGGER_DRAIN: > > if (stream->compr_ops->stream_drain) > > return stream->compr_ops->stream_drain(sst->dev, > > stream->id); > >+break; > > case SND_COMPR_TRIGGER_PARTIAL_DRAIN: > > if (stream->compr_ops->stream_partial_drain) > > return > > stream->compr_ops->stream_partial_drain(sst->dev, stream->id); > >+break; > > case SNDRV_PCM_TRIGGER_PAUSE_PUSH: > > if (stream->compr_ops->stream_pause) > > return stream->compr_ops->stream_pause(sst->dev, > > stream->id); > >+break; > > case SNDRV_PCM_TRIGGER_PAUSE_RELEASE: > > if (stream->compr_ops->stream_pause_release) > > return > > stream->compr_ops->stream_pause_release(sst->dev, stream->id); > >-default: > >-return -EINVAL; > >+break; > > } > >+return -EINVAL; > > } > > > > static int sst_platform_compr_pointer(struct snd_compr_stream *cstream, > > > >
Re: [RFC][PATCH 2/3] locking/qrwlock: Use smp_cond_load_acquire()
On Mon, 04 Apr 2016, Peter Zijlstra wrote: Use smp_cond_load_acquire() to make better use of the hardware assisted 'spin' wait on arm64. Arguably the second hunk is the more horrid abuse possible, but avoids having to use cmpwait (see next patch) directly. Also, this makes 'clever' (ab)use of the cond+rmb acquire to omit the acquire from cmpxchg(). Signed-off-by: Peter Zijlstra (Intel)--- kernel/locking/qrwlock.c | 18 -- 1 file changed, 4 insertions(+), 14 deletions(-) --- a/kernel/locking/qrwlock.c +++ b/kernel/locking/qrwlock.c @@ -53,10 +53,7 @@ struct __qrwlock { static __always_inline void rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts) { - while ((cnts & _QW_WMASK) == _QW_LOCKED) { - cpu_relax_lowlatency(); - cnts = atomic_read_acquire(>cnts); - } + smp_cond_load_acquire(>cnts.counter, (VAL & _QW_WMASK) != _QW_LOCKED); } /** @@ -109,8 +106,6 @@ EXPORT_SYMBOL(queued_read_lock_slowpath) */ void queued_write_lock_slowpath(struct qrwlock *lock) { - u32 cnts; - /* Put the writer into the wait queue */ arch_spin_lock(>wait_lock); @@ -134,15 +129,10 @@ void queued_write_lock_slowpath(struct q } /* When no more readers, set the locked flag */ - for (;;) { - cnts = atomic_read(>cnts); - if ((cnts == _QW_WAITING) && - (atomic_cmpxchg_acquire(>cnts, _QW_WAITING, - _QW_LOCKED) == _QW_WAITING)) - break; + smp_cond_load_acquire(>cnts.counter, + (VAL == _QW_WAITING) && + atomic_cmpxchg_relaxed(>cnts, _QW_WAITING, _QW_LOCKED) == _QW_WAITING); - cpu_relax_lowlatency(); You would need some variant for cpu_relax_lowlatency otherwise you'll be hurting s390, no? fwiw back when I was looking at this, I recall thinking about possibly introducing smp_cond_acquire_lowlatency but never got around to it. Thanks, Davidlohr
Re: [RFC][PATCH 2/3] locking/qrwlock: Use smp_cond_load_acquire()
On Mon, 04 Apr 2016, Peter Zijlstra wrote: Use smp_cond_load_acquire() to make better use of the hardware assisted 'spin' wait on arm64. Arguably the second hunk is the more horrid abuse possible, but avoids having to use cmpwait (see next patch) directly. Also, this makes 'clever' (ab)use of the cond+rmb acquire to omit the acquire from cmpxchg(). Signed-off-by: Peter Zijlstra (Intel) --- kernel/locking/qrwlock.c | 18 -- 1 file changed, 4 insertions(+), 14 deletions(-) --- a/kernel/locking/qrwlock.c +++ b/kernel/locking/qrwlock.c @@ -53,10 +53,7 @@ struct __qrwlock { static __always_inline void rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts) { - while ((cnts & _QW_WMASK) == _QW_LOCKED) { - cpu_relax_lowlatency(); - cnts = atomic_read_acquire(>cnts); - } + smp_cond_load_acquire(>cnts.counter, (VAL & _QW_WMASK) != _QW_LOCKED); } /** @@ -109,8 +106,6 @@ EXPORT_SYMBOL(queued_read_lock_slowpath) */ void queued_write_lock_slowpath(struct qrwlock *lock) { - u32 cnts; - /* Put the writer into the wait queue */ arch_spin_lock(>wait_lock); @@ -134,15 +129,10 @@ void queued_write_lock_slowpath(struct q } /* When no more readers, set the locked flag */ - for (;;) { - cnts = atomic_read(>cnts); - if ((cnts == _QW_WAITING) && - (atomic_cmpxchg_acquire(>cnts, _QW_WAITING, - _QW_LOCKED) == _QW_WAITING)) - break; + smp_cond_load_acquire(>cnts.counter, + (VAL == _QW_WAITING) && + atomic_cmpxchg_relaxed(>cnts, _QW_WAITING, _QW_LOCKED) == _QW_WAITING); - cpu_relax_lowlatency(); You would need some variant for cpu_relax_lowlatency otherwise you'll be hurting s390, no? fwiw back when I was looking at this, I recall thinking about possibly introducing smp_cond_acquire_lowlatency but never got around to it. Thanks, Davidlohr
[PATCH v2 03/11] mm/slab: drain the free slab as much as possible
From: Joonsoo Kimslabs_tofree() implies freeing all free slab. We can do it with just providing INT_MAX. Acked-by: Christoph Lameter Signed-off-by: Joonsoo Kim --- mm/slab.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 373b8be..5451929 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -888,12 +888,6 @@ static int init_cache_node_node(int node) return 0; } -static inline int slabs_tofree(struct kmem_cache *cachep, - struct kmem_cache_node *n) -{ - return (n->free_objects + cachep->num - 1) / cachep->num; -} - static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -958,7 +952,7 @@ free_slab: n = get_node(cachep, node); if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); } } @@ -1110,7 +1104,7 @@ static int __meminit drain_cache_node_node(int node) if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); if (!list_empty(>slabs_full) || !list_empty(>slabs_partial)) { @@ -2304,7 +2298,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate) check_irq_on(); for_each_kmem_cache_node(cachep, node, n) { - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); ret += !list_empty(>slabs_full) || !list_empty(>slabs_partial); -- 1.9.1
[PATCH v2 04/11] mm/slab: factor out kmem_cache_node initialization code
From: Joonsoo KimIt can be reused on other place, so factor out it. Following patch will use it. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 --- 1 file changed, 39 insertions(+), 29 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 5451929..49af685 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags) } #endif +static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp) +{ + struct kmem_cache_node *n; + + /* +* Set up the kmem_cache_node for cpu before we can +* begin anything. Make sure some other cpu on this +* node has not already allocated this +*/ + n = get_node(cachep, node); + if (n) + return 0; + + n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node); + if (!n) + return -ENOMEM; + + kmem_cache_node_init(n); + n->next_reap = jiffies + REAPTIMEOUT_NODE + + ((unsigned long)cachep) % REAPTIMEOUT_NODE; + + n->free_limit = + (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; + + /* +* The kmem_cache_nodes don't come and go as CPUs +* come and go. slab_mutex is sufficient +* protection here. +*/ + cachep->node[node] = n; + + return 0; +} + /* * Allocates and initializes node for a node on each slab cache, used for * either memory or cpu hotplug. If memory is being hot-added, the kmem_cache_node @@ -852,39 +886,15 @@ static inline gfp_t gfp_exact_node(gfp_t flags) */ static int init_cache_node_node(int node) { + int ret; struct kmem_cache *cachep; - struct kmem_cache_node *n; - const size_t memsize = sizeof(struct kmem_cache_node); list_for_each_entry(cachep, _caches, list) { - /* -* Set up the kmem_cache_node for cpu before we can -* begin anything. Make sure some other cpu on this -* node has not already allocated this -*/ - n = get_node(cachep, node); - if (!n) { - n = kmalloc_node(memsize, GFP_KERNEL, node); - if (!n) - return -ENOMEM; - kmem_cache_node_init(n); - n->next_reap = jiffies + REAPTIMEOUT_NODE + - ((unsigned long)cachep) % REAPTIMEOUT_NODE; - - /* -* The kmem_cache_nodes don't come and go as CPUs -* come and go. slab_mutex is sufficient -* protection here. -*/ - cachep->node[node] = n; - } - - spin_lock_irq(>list_lock); - n->free_limit = - (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - spin_unlock_irq(>list_lock); + ret = init_cache_node(cachep, node, GFP_KERNEL); + if (ret) + return ret; } + return 0; } -- 1.9.1
[PATCH v2 03/11] mm/slab: drain the free slab as much as possible
From: Joonsoo Kim slabs_tofree() implies freeing all free slab. We can do it with just providing INT_MAX. Acked-by: Christoph Lameter Signed-off-by: Joonsoo Kim --- mm/slab.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 373b8be..5451929 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -888,12 +888,6 @@ static int init_cache_node_node(int node) return 0; } -static inline int slabs_tofree(struct kmem_cache *cachep, - struct kmem_cache_node *n) -{ - return (n->free_objects + cachep->num - 1) / cachep->num; -} - static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -958,7 +952,7 @@ free_slab: n = get_node(cachep, node); if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); } } @@ -1110,7 +1104,7 @@ static int __meminit drain_cache_node_node(int node) if (!n) continue; - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); if (!list_empty(>slabs_full) || !list_empty(>slabs_partial)) { @@ -2304,7 +2298,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate) check_irq_on(); for_each_kmem_cache_node(cachep, node, n) { - drain_freelist(cachep, n, slabs_tofree(cachep, n)); + drain_freelist(cachep, n, INT_MAX); ret += !list_empty(>slabs_full) || !list_empty(>slabs_partial); -- 1.9.1
[PATCH v2 04/11] mm/slab: factor out kmem_cache_node initialization code
From: Joonsoo Kim It can be reused on other place, so factor out it. Following patch will use it. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 --- 1 file changed, 39 insertions(+), 29 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 5451929..49af685 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags) } #endif +static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp) +{ + struct kmem_cache_node *n; + + /* +* Set up the kmem_cache_node for cpu before we can +* begin anything. Make sure some other cpu on this +* node has not already allocated this +*/ + n = get_node(cachep, node); + if (n) + return 0; + + n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node); + if (!n) + return -ENOMEM; + + kmem_cache_node_init(n); + n->next_reap = jiffies + REAPTIMEOUT_NODE + + ((unsigned long)cachep) % REAPTIMEOUT_NODE; + + n->free_limit = + (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; + + /* +* The kmem_cache_nodes don't come and go as CPUs +* come and go. slab_mutex is sufficient +* protection here. +*/ + cachep->node[node] = n; + + return 0; +} + /* * Allocates and initializes node for a node on each slab cache, used for * either memory or cpu hotplug. If memory is being hot-added, the kmem_cache_node @@ -852,39 +886,15 @@ static inline gfp_t gfp_exact_node(gfp_t flags) */ static int init_cache_node_node(int node) { + int ret; struct kmem_cache *cachep; - struct kmem_cache_node *n; - const size_t memsize = sizeof(struct kmem_cache_node); list_for_each_entry(cachep, _caches, list) { - /* -* Set up the kmem_cache_node for cpu before we can -* begin anything. Make sure some other cpu on this -* node has not already allocated this -*/ - n = get_node(cachep, node); - if (!n) { - n = kmalloc_node(memsize, GFP_KERNEL, node); - if (!n) - return -ENOMEM; - kmem_cache_node_init(n); - n->next_reap = jiffies + REAPTIMEOUT_NODE + - ((unsigned long)cachep) % REAPTIMEOUT_NODE; - - /* -* The kmem_cache_nodes don't come and go as CPUs -* come and go. slab_mutex is sufficient -* protection here. -*/ - cachep->node[node] = n; - } - - spin_lock_irq(>list_lock); - n->free_limit = - (1 + nr_cpus_node(node)) * - cachep->batchcount + cachep->num; - spin_unlock_irq(>list_lock); + ret = init_cache_node(cachep, node, GFP_KERNEL); + if (ret) + return ret; } + return 0; } -- 1.9.1
[PATCH v2 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock
From: Joonsoo KimUntil now, cache growing makes a free slab on node's slab list and then we can allocate free objects from it. This necessarily requires to hold a node lock which is very contended. If we refill cpu cache before attaching it to node's slab list, we can avoid holding a node lock as much as possible because this newly allocated slab is only visible to the current task. This will reduce lock contention. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 * After Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 It shows that contention is reduced for all the object sizes and performance increases by 30 ~ 40%. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 +-- 1 file changed, 36 insertions(+), 32 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 2c28ad5..cf12fbd 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2852,6 +2852,30 @@ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep, return obj; } +/* + * Slab list should be fixed up by fixup_slab_list() for existing slab + * or cache_grow_end() for new slab + */ +static __always_inline int alloc_block(struct kmem_cache *cachep, + struct array_cache *ac, struct page *page, int batchcount) +{ + /* +* There must be at least one object available for +* allocation. +*/ + BUG_ON(page->active >= cachep->num); + + while (page->active < cachep->num && batchcount--) { + STATS_INC_ALLOCED(cachep); + STATS_INC_ACTIVE(cachep); + STATS_SET_HIGH(cachep); + + ac->entry[ac->avail++] = slab_get_obj(cachep, page); + } + + return batchcount; +} + static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; @@ -2864,7 +2888,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) check_irq_off(); node = numa_mem_id(); -retry: ac = cpu_cache_get(cachep); batchcount = ac->batchcount; if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { @@ -2894,21 +2917,7 @@ retry: check_spinlock_acquired(cachep); - /* -* The slab was either on partial or free list so -* there must be at least one object available for -* allocation. -*/ - BUG_ON(page->active >= cachep->num); - - while (page->active < cachep->num && batchcount--) { - STATS_INC_ALLOCED(cachep); - STATS_INC_ACTIVE(cachep); - STATS_SET_HIGH(cachep); - - ac->entry[ac->avail++] = slab_get_obj(cachep, page); - } - + batchcount = alloc_block(cachep, ac, page, batchcount); fixup_slab_list(cachep, n, page, ); } @@ -2928,21 +2937,18 @@ alloc_done: } page = cache_grow_begin(cachep, gfp_exact_node(flags), node); - cache_grow_end(cachep, page); /* * cache_grow_begin() can reenable interrupts, * then ac could change. */ ac = cpu_cache_get(cachep); - node = numa_mem_id(); + if (!ac->avail && page) + alloc_block(cachep, ac, page, batchcount); + cache_grow_end(cachep, page); - /* no objects in sight? abort */ - if (!page && ac->avail == 0) + if (!ac->avail) return NULL; - - if (!ac->avail) /* objects refilled by interrupt? */ - goto retry; } ac->touched = 1; @@ -3136,14 +3142,13 @@ static void *cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, { struct page *page; struct kmem_cache_node *n; - void *obj; + void *obj = NULL; void *list = NULL; VM_BUG_ON(nodeid < 0 ||
[PATCH v2 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock
From: Joonsoo Kim Until now, cache growing makes a free slab on node's slab list and then we can allocate free objects from it. This necessarily requires to hold a node lock which is very contended. If we refill cpu cache before attaching it to node's slab list, we can avoid holding a node lock as much as possible because this newly allocated slab is only visible to the current task. This will reduce lock contention. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 * After Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 It shows that contention is reduced for all the object sizes and performance increases by 30 ~ 40%. Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 +-- 1 file changed, 36 insertions(+), 32 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 2c28ad5..cf12fbd 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2852,6 +2852,30 @@ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep, return obj; } +/* + * Slab list should be fixed up by fixup_slab_list() for existing slab + * or cache_grow_end() for new slab + */ +static __always_inline int alloc_block(struct kmem_cache *cachep, + struct array_cache *ac, struct page *page, int batchcount) +{ + /* +* There must be at least one object available for +* allocation. +*/ + BUG_ON(page->active >= cachep->num); + + while (page->active < cachep->num && batchcount--) { + STATS_INC_ALLOCED(cachep); + STATS_INC_ACTIVE(cachep); + STATS_SET_HIGH(cachep); + + ac->entry[ac->avail++] = slab_get_obj(cachep, page); + } + + return batchcount; +} + static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; @@ -2864,7 +2888,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) check_irq_off(); node = numa_mem_id(); -retry: ac = cpu_cache_get(cachep); batchcount = ac->batchcount; if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { @@ -2894,21 +2917,7 @@ retry: check_spinlock_acquired(cachep); - /* -* The slab was either on partial or free list so -* there must be at least one object available for -* allocation. -*/ - BUG_ON(page->active >= cachep->num); - - while (page->active < cachep->num && batchcount--) { - STATS_INC_ALLOCED(cachep); - STATS_INC_ACTIVE(cachep); - STATS_SET_HIGH(cachep); - - ac->entry[ac->avail++] = slab_get_obj(cachep, page); - } - + batchcount = alloc_block(cachep, ac, page, batchcount); fixup_slab_list(cachep, n, page, ); } @@ -2928,21 +2937,18 @@ alloc_done: } page = cache_grow_begin(cachep, gfp_exact_node(flags), node); - cache_grow_end(cachep, page); /* * cache_grow_begin() can reenable interrupts, * then ac could change. */ ac = cpu_cache_get(cachep); - node = numa_mem_id(); + if (!ac->avail && page) + alloc_block(cachep, ac, page, batchcount); + cache_grow_end(cachep, page); - /* no objects in sight? abort */ - if (!page && ac->avail == 0) + if (!ac->avail) return NULL; - - if (!ac->avail) /* objects refilled by interrupt? */ - goto retry; } ac->touched = 1; @@ -3136,14 +3142,13 @@ static void *cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, { struct page *page; struct kmem_cache_node *n; - void *obj; + void *obj = NULL; void *list = NULL; VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES); n =
[PATCH v2 11/11] mm/slab: lockless decision to grow cache
From: Joonsoo KimTo check whther free objects exist or not precisely, we need to grab a lock. But, accuracy isn't that important because race window would be even small and if there is too much free object, cache reaper would reap it. So, this patch makes the check for free object exisistence not to hold a lock. This will reduce lock contention in heavily allocation case. Note that until now, n->shared can be freed during the processing by writing slabinfo, but, with some trick in this patch, we can access it freely within interrupt disabled period. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that allocation performance decreases for the object size up to 128 and it may be due to extra checks in cache_alloc_refill(). But, with considering improvement of free performance, net result looks the same. Result for other size class looks very promising, roughly, 50% performance improvement. v2: replace kick_all_cpus_sync() with synchronize_sched(). Signed-off-by: Joonsoo Kim --- mm/slab.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index cf12fbd..13e74aa 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -952,6 +952,15 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep, spin_unlock_irq(>list_lock); slabs_destroy(cachep, ); + /* +* To protect lockless access to n->shared during irq disabled context. +* If n->shared isn't NULL in irq disabled context, accessing to it is +* guaranteed to be valid until irq is re-enabled, because it will be +* freed after synchronize_sched(). +*/ + if (force_change) + synchronize_sched(); + fail: kfree(old_shared); kfree(new_shared); @@ -2880,7 +2889,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; struct kmem_cache_node *n; - struct array_cache *ac; + struct array_cache *ac, *shared; int node; void *list = NULL; struct page *page; @@ -2901,11 +2910,16 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) n = get_node(cachep, node); BUG_ON(ac->avail > 0 || !n); + shared = READ_ONCE(n->shared); + if (!n->free_objects && (!shared || !shared->avail)) + goto direct_grow; + spin_lock(>list_lock); + shared = READ_ONCE(n->shared); /* See if we can refill from the shared array */ - if (n->shared && transfer_objects(ac, n->shared, batchcount)) { - n->shared->touched = 1; + if (shared && transfer_objects(ac, shared, batchcount)) { + shared->touched = 1; goto alloc_done; } @@ -2927,6 +2941,7 @@ alloc_done: spin_unlock(>list_lock); fixup_objfreelist_debug(cachep, ); +direct_grow: if (unlikely(!ac->avail)) { /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { -- 1.9.1
[PATCH v2 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node
From: Joonsoo KimCurrently, cache_grow() assumes that allocated page's nodeid would be same with parameter nodeid which is used for allocation request. If we discard this assumption, we can handle fallback_alloc() case gracefully. So, this patch makes cache_grow() handle the page allocated on arbitrary node and clean-up relevant code. Signed-off-by: Joonsoo Kim --- mm/slab.c | 60 +--- 1 file changed, 21 insertions(+), 39 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index a3422bc..1910589 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2543,13 +2543,14 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, - gfp_t flags, int nodeid, struct page *page) +static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) { void *freelist; size_t offset; gfp_t local_flags; + int page_node; struct kmem_cache_node *n; + struct page *page; /* * Be lazy and only check for valid flags here, keeping it out of the @@ -2577,12 +2578,12 @@ static int cache_grow(struct kmem_cache *cachep, * Get mem for the objs. Attempt to allocate a physical page from * 'nodeid'. */ - if (!page) - page = kmem_getpages(cachep, local_flags, nodeid); + page = kmem_getpages(cachep, local_flags, nodeid); if (!page) goto failed; - n = get_node(cachep, nodeid); + page_node = page_to_nid(page); + n = get_node(cachep, page_node); /* Get colour for the slab, and cal the next value. */ n->colour_next++; @@ -2597,7 +2598,7 @@ static int cache_grow(struct kmem_cache *cachep, /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, - local_flags & ~GFP_CONSTRAINT_MASK, nodeid); + local_flags & ~GFP_CONSTRAINT_MASK, page_node); if (OFF_SLAB(cachep) && !freelist) goto opps1; @@ -2616,13 +2617,13 @@ static int cache_grow(struct kmem_cache *cachep, STATS_INC_GROWN(cachep); n->free_objects += cachep->num; spin_unlock(>list_lock); - return 1; + return page_node; opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return 0; + return -1; } #if DEBUG @@ -2903,14 +2904,14 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); node = numa_mem_id(); /* no objects in sight? abort */ - if (!x && ac->avail == 0) + if (x < 0 && ac->avail == 0) return NULL; if (!ac->avail) /* objects refilled by interrupt? */ @@ -3039,7 +3040,6 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) { struct zonelist *zonelist; - gfp_t local_flags; struct zoneref *z; struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); @@ -3050,8 +3050,6 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) if (flags & __GFP_THISNODE) return NULL; - local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); zonelist = node_zonelist(mempolicy_slab_node(), flags); @@ -3081,33 +3079,17 @@ retry: * We may trigger various forms of reclaim on the allowed * set and go into memory reserves if necessary. */ - struct page *page; + nid = cache_grow(cache, flags, numa_mem_id()); + if (nid >= 0) { + obj = cache_alloc_node(cache, + gfp_exact_node(flags), nid); - if (gfpflags_allow_blocking(local_flags)) - local_irq_enable(); - kmem_flagcheck(cache, flags); - page = kmem_getpages(cache, local_flags, numa_mem_id()); - if (gfpflags_allow_blocking(local_flags)) - local_irq_disable(); - if (page) { /* -* Insert into the appropriate per node queues +*
[PATCH v2 07/11] mm/slab: racy access/modify the slab color
From: Joonsoo KimSlab color isn't needed to be changed strictly. Because locking for changing slab color could cause more lock contention so this patch implements racy access/modify the slab color. This is a preparation step to implement lockless allocation path when there is no free objects in the kmem_cache. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 It shows that contention is reduced for object size >= 1024 and performance increases by roughly 15%. Acked-by: Christoph Lameter Signed-off-by: Joonsoo Kim --- mm/slab.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 6e61461..a3422bc 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2561,20 +2561,7 @@ static int cache_grow(struct kmem_cache *cachep, } local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - /* Take the node list lock to change the colour_next on this node */ check_irq_off(); - n = get_node(cachep, nodeid); - spin_lock(>list_lock); - - /* Get colour for the slab, and cal the next value. */ - offset = n->colour_next; - n->colour_next++; - if (n->colour_next >= cachep->colour) - n->colour_next = 0; - spin_unlock(>list_lock); - - offset *= cachep->colour_off; - if (gfpflags_allow_blocking(local_flags)) local_irq_enable(); @@ -2595,6 +2582,19 @@ static int cache_grow(struct kmem_cache *cachep, if (!page) goto failed; + n = get_node(cachep, nodeid); + + /* Get colour for the slab, and cal the next value. */ + n->colour_next++; + if (n->colour_next >= cachep->colour) + n->colour_next = 0; + + offset = n->colour_next; + if (offset >= cachep->colour) + offset = 0; + + offset *= cachep->colour_off; + /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, local_flags & ~GFP_CONSTRAINT_MASK, nodeid); -- 1.9.1
[PATCH v2 09/11] mm/slab: separate cache_grow() to two parts
From: Joonsoo KimThis is a preparation step to implement lockless allocation path when there is no free objects in kmem_cache. What we'd like to do here is to refill cpu cache without holding a node lock. To accomplish this purpose, refill should be done after new slab allocation but before attaching the slab to the management list. So, this patch separates cache_grow() to two parts, allocation and attaching to the list in order to add some code inbetween them in the following patch. Signed-off-by: Joonsoo Kim --- mm/slab.c | 74 --- 1 file changed, 52 insertions(+), 22 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 1910589..2c28ad5 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -213,6 +213,11 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list); static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp); static void cache_reap(struct work_struct *unused); +static inline void fixup_objfreelist_debug(struct kmem_cache *cachep, + void **list); +static inline void fixup_slab_list(struct kmem_cache *cachep, + struct kmem_cache_node *n, struct page *page, + void **list); static int slab_early_init = 1; #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) @@ -1797,7 +1802,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep, /* * Needed to avoid possible looping condition -* in cache_grow() +* in cache_grow_begin() */ if (OFF_SLAB(freelist_cache)) continue; @@ -2543,7 +2548,8 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) +static struct page *cache_grow_begin(struct kmem_cache *cachep, + gfp_t flags, int nodeid) { void *freelist; size_t offset; @@ -2609,21 +2615,40 @@ static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - check_irq_off(); - spin_lock(>list_lock); - /* Make slab active. */ - list_add_tail(>lru, &(n->slabs_free)); - STATS_INC_GROWN(cachep); - n->free_objects += cachep->num; - spin_unlock(>list_lock); - return page_node; + return page; + opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return -1; + return NULL; +} + +static void cache_grow_end(struct kmem_cache *cachep, struct page *page) +{ + struct kmem_cache_node *n; + void *list = NULL; + + check_irq_off(); + + if (!page) + return; + + INIT_LIST_HEAD(>lru); + n = get_node(cachep, page_to_nid(page)); + + spin_lock(>list_lock); + if (!page->active) + list_add_tail(>lru, &(n->slabs_free)); + else + fixup_slab_list(cachep, n, page, ); + STATS_INC_GROWN(cachep); + n->free_objects += cachep->num - page->active; + spin_unlock(>list_lock); + + fixup_objfreelist_debug(cachep, ); } #if DEBUG @@ -2834,6 +2859,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) struct array_cache *ac; int node; void *list = NULL; + struct page *page; check_irq_off(); node = numa_mem_id(); @@ -2861,7 +2887,6 @@ retry: } while (batchcount > 0) { - struct page *page; /* Get slab alloc is to come from. */ page = get_first_slab(n, false); if (!page) @@ -2894,8 +2919,6 @@ alloc_done: fixup_objfreelist_debug(cachep, ); if (unlikely(!ac->avail)) { - int x; - /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { void *obj = cache_alloc_pfmemalloc(cachep, n, flags); @@ -2904,14 +2927,18 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node); + page = cache_grow_begin(cachep, gfp_exact_node(flags), node); + cache_grow_end(cachep, page); - /* cache_grow can reenable interrupts, then ac could change. */ + /* +* cache_grow_begin() can reenable interrupts, +* then ac could change. +*/
[PATCH v2 11/11] mm/slab: lockless decision to grow cache
From: Joonsoo Kim To check whther free objects exist or not precisely, we need to grab a lock. But, accuracy isn't that important because race window would be even small and if there is too much free object, cache reaper would reap it. So, this patch makes the check for free object exisistence not to hold a lock. This will reduce lock contention in heavily allocation case. Note that until now, n->shared can be freed during the processing by writing slabinfo, but, with some trick in this patch, we can access it freely within interrupt disabled period. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that allocation performance decreases for the object size up to 128 and it may be due to extra checks in cache_alloc_refill(). But, with considering improvement of free performance, net result looks the same. Result for other size class looks very promising, roughly, 50% performance improvement. v2: replace kick_all_cpus_sync() with synchronize_sched(). Signed-off-by: Joonsoo Kim --- mm/slab.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index cf12fbd..13e74aa 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -952,6 +952,15 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep, spin_unlock_irq(>list_lock); slabs_destroy(cachep, ); + /* +* To protect lockless access to n->shared during irq disabled context. +* If n->shared isn't NULL in irq disabled context, accessing to it is +* guaranteed to be valid until irq is re-enabled, because it will be +* freed after synchronize_sched(). +*/ + if (force_change) + synchronize_sched(); + fail: kfree(old_shared); kfree(new_shared); @@ -2880,7 +2889,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) { int batchcount; struct kmem_cache_node *n; - struct array_cache *ac; + struct array_cache *ac, *shared; int node; void *list = NULL; struct page *page; @@ -2901,11 +2910,16 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) n = get_node(cachep, node); BUG_ON(ac->avail > 0 || !n); + shared = READ_ONCE(n->shared); + if (!n->free_objects && (!shared || !shared->avail)) + goto direct_grow; + spin_lock(>list_lock); + shared = READ_ONCE(n->shared); /* See if we can refill from the shared array */ - if (n->shared && transfer_objects(ac, n->shared, batchcount)) { - n->shared->touched = 1; + if (shared && transfer_objects(ac, shared, batchcount)) { + shared->touched = 1; goto alloc_done; } @@ -2927,6 +2941,7 @@ alloc_done: spin_unlock(>list_lock); fixup_objfreelist_debug(cachep, ); +direct_grow: if (unlikely(!ac->avail)) { /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { -- 1.9.1
[PATCH v2 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node
From: Joonsoo Kim Currently, cache_grow() assumes that allocated page's nodeid would be same with parameter nodeid which is used for allocation request. If we discard this assumption, we can handle fallback_alloc() case gracefully. So, this patch makes cache_grow() handle the page allocated on arbitrary node and clean-up relevant code. Signed-off-by: Joonsoo Kim --- mm/slab.c | 60 +--- 1 file changed, 21 insertions(+), 39 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index a3422bc..1910589 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2543,13 +2543,14 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, - gfp_t flags, int nodeid, struct page *page) +static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) { void *freelist; size_t offset; gfp_t local_flags; + int page_node; struct kmem_cache_node *n; + struct page *page; /* * Be lazy and only check for valid flags here, keeping it out of the @@ -2577,12 +2578,12 @@ static int cache_grow(struct kmem_cache *cachep, * Get mem for the objs. Attempt to allocate a physical page from * 'nodeid'. */ - if (!page) - page = kmem_getpages(cachep, local_flags, nodeid); + page = kmem_getpages(cachep, local_flags, nodeid); if (!page) goto failed; - n = get_node(cachep, nodeid); + page_node = page_to_nid(page); + n = get_node(cachep, page_node); /* Get colour for the slab, and cal the next value. */ n->colour_next++; @@ -2597,7 +2598,7 @@ static int cache_grow(struct kmem_cache *cachep, /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, - local_flags & ~GFP_CONSTRAINT_MASK, nodeid); + local_flags & ~GFP_CONSTRAINT_MASK, page_node); if (OFF_SLAB(cachep) && !freelist) goto opps1; @@ -2616,13 +2617,13 @@ static int cache_grow(struct kmem_cache *cachep, STATS_INC_GROWN(cachep); n->free_objects += cachep->num; spin_unlock(>list_lock); - return 1; + return page_node; opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return 0; + return -1; } #if DEBUG @@ -2903,14 +2904,14 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); + x = cache_grow(cachep, gfp_exact_node(flags), node); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); node = numa_mem_id(); /* no objects in sight? abort */ - if (!x && ac->avail == 0) + if (x < 0 && ac->avail == 0) return NULL; if (!ac->avail) /* objects refilled by interrupt? */ @@ -3039,7 +3040,6 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) { struct zonelist *zonelist; - gfp_t local_flags; struct zoneref *z; struct zone *zone; enum zone_type high_zoneidx = gfp_zone(flags); @@ -3050,8 +3050,6 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) if (flags & __GFP_THISNODE) return NULL; - local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); zonelist = node_zonelist(mempolicy_slab_node(), flags); @@ -3081,33 +3079,17 @@ retry: * We may trigger various forms of reclaim on the allowed * set and go into memory reserves if necessary. */ - struct page *page; + nid = cache_grow(cache, flags, numa_mem_id()); + if (nid >= 0) { + obj = cache_alloc_node(cache, + gfp_exact_node(flags), nid); - if (gfpflags_allow_blocking(local_flags)) - local_irq_enable(); - kmem_flagcheck(cache, flags); - page = kmem_getpages(cache, local_flags, numa_mem_id()); - if (gfpflags_allow_blocking(local_flags)) - local_irq_disable(); - if (page) { /* -* Insert into the appropriate per node queues +* Another processor may allocate the objects in +
[PATCH v2 07/11] mm/slab: racy access/modify the slab color
From: Joonsoo Kim Slab color isn't needed to be changed strictly. Because locking for changing slab color could cause more lock contention so this patch implements racy access/modify the slab color. This is a preparation step to implement lockless allocation path when there is no free objects in the kmem_cache. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 It shows that contention is reduced for object size >= 1024 and performance increases by roughly 15%. Acked-by: Christoph Lameter Signed-off-by: Joonsoo Kim --- mm/slab.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 6e61461..a3422bc 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2561,20 +2561,7 @@ static int cache_grow(struct kmem_cache *cachep, } local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); - /* Take the node list lock to change the colour_next on this node */ check_irq_off(); - n = get_node(cachep, nodeid); - spin_lock(>list_lock); - - /* Get colour for the slab, and cal the next value. */ - offset = n->colour_next; - n->colour_next++; - if (n->colour_next >= cachep->colour) - n->colour_next = 0; - spin_unlock(>list_lock); - - offset *= cachep->colour_off; - if (gfpflags_allow_blocking(local_flags)) local_irq_enable(); @@ -2595,6 +2582,19 @@ static int cache_grow(struct kmem_cache *cachep, if (!page) goto failed; + n = get_node(cachep, nodeid); + + /* Get colour for the slab, and cal the next value. */ + n->colour_next++; + if (n->colour_next >= cachep->colour) + n->colour_next = 0; + + offset = n->colour_next; + if (offset >= cachep->colour) + offset = 0; + + offset *= cachep->colour_off; + /* Get slab management. */ freelist = alloc_slabmgmt(cachep, page, offset, local_flags & ~GFP_CONSTRAINT_MASK, nodeid); -- 1.9.1
[PATCH v2 09/11] mm/slab: separate cache_grow() to two parts
From: Joonsoo Kim This is a preparation step to implement lockless allocation path when there is no free objects in kmem_cache. What we'd like to do here is to refill cpu cache without holding a node lock. To accomplish this purpose, refill should be done after new slab allocation but before attaching the slab to the management list. So, this patch separates cache_grow() to two parts, allocation and attaching to the list in order to add some code inbetween them in the following patch. Signed-off-by: Joonsoo Kim --- mm/slab.c | 74 --- 1 file changed, 52 insertions(+), 22 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 1910589..2c28ad5 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -213,6 +213,11 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list); static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp); static void cache_reap(struct work_struct *unused); +static inline void fixup_objfreelist_debug(struct kmem_cache *cachep, + void **list); +static inline void fixup_slab_list(struct kmem_cache *cachep, + struct kmem_cache_node *n, struct page *page, + void **list); static int slab_early_init = 1; #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) @@ -1797,7 +1802,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep, /* * Needed to avoid possible looping condition -* in cache_grow() +* in cache_grow_begin() */ if (OFF_SLAB(freelist_cache)) continue; @@ -2543,7 +2548,8 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page, * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) +static struct page *cache_grow_begin(struct kmem_cache *cachep, + gfp_t flags, int nodeid) { void *freelist; size_t offset; @@ -2609,21 +2615,40 @@ static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid) if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - check_irq_off(); - spin_lock(>list_lock); - /* Make slab active. */ - list_add_tail(>lru, &(n->slabs_free)); - STATS_INC_GROWN(cachep); - n->free_objects += cachep->num; - spin_unlock(>list_lock); - return page_node; + return page; + opps1: kmem_freepages(cachep, page); failed: if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); - return -1; + return NULL; +} + +static void cache_grow_end(struct kmem_cache *cachep, struct page *page) +{ + struct kmem_cache_node *n; + void *list = NULL; + + check_irq_off(); + + if (!page) + return; + + INIT_LIST_HEAD(>lru); + n = get_node(cachep, page_to_nid(page)); + + spin_lock(>list_lock); + if (!page->active) + list_add_tail(>lru, &(n->slabs_free)); + else + fixup_slab_list(cachep, n, page, ); + STATS_INC_GROWN(cachep); + n->free_objects += cachep->num - page->active; + spin_unlock(>list_lock); + + fixup_objfreelist_debug(cachep, ); } #if DEBUG @@ -2834,6 +2859,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) struct array_cache *ac; int node; void *list = NULL; + struct page *page; check_irq_off(); node = numa_mem_id(); @@ -2861,7 +2887,6 @@ retry: } while (batchcount > 0) { - struct page *page; /* Get slab alloc is to come from. */ page = get_first_slab(n, false); if (!page) @@ -2894,8 +2919,6 @@ alloc_done: fixup_objfreelist_debug(cachep, ); if (unlikely(!ac->avail)) { - int x; - /* Check if we can use obj in pfmemalloc slab */ if (sk_memalloc_socks()) { void *obj = cache_alloc_pfmemalloc(cachep, n, flags); @@ -2904,14 +2927,18 @@ alloc_done: return obj; } - x = cache_grow(cachep, gfp_exact_node(flags), node); + page = cache_grow_begin(cachep, gfp_exact_node(flags), node); + cache_grow_end(cachep, page); - /* cache_grow can reenable interrupts, then ac could change. */ + /* +* cache_grow_begin() can reenable interrupts, +* then ac could change. +*/ ac = cpu_cache_get(cachep);
[PATCH v2 05/11] mm/slab: clean-up kmem_cache_node setup
From: Joonsoo KimThere are mostly same code for setting up kmem_cache_node either in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up them. v2 o Rename setup_kmem_cache_node_node to setup_kmem_cache_nodes o Fix suspend-to-ram issue reported by Nishanth Tested-by: Nishanth Menon Tested-by: Jon Hunter Signed-off-by: Joonsoo Kim --- mm/slab.c | 168 +- 1 file changed, 68 insertions(+), 100 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 49af685..27cb390 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -898,6 +898,63 @@ static int init_cache_node_node(int node) return 0; } +static int setup_kmem_cache_node(struct kmem_cache *cachep, + int node, gfp_t gfp, bool force_change) +{ + int ret = -ENOMEM; + struct kmem_cache_node *n; + struct array_cache *old_shared = NULL; + struct array_cache *new_shared = NULL; + struct alien_cache **new_alien = NULL; + LIST_HEAD(list); + + if (use_alien_caches) { + new_alien = alloc_alien_cache(node, cachep->limit, gfp); + if (!new_alien) + goto fail; + } + + if (cachep->shared) { + new_shared = alloc_arraycache(node, + cachep->shared * cachep->batchcount, 0xbaadf00d, gfp); + if (!new_shared) + goto fail; + } + + ret = init_cache_node(cachep, node, gfp); + if (ret) + goto fail; + + n = get_node(cachep, node); + spin_lock_irq(>list_lock); + if (n->shared && force_change) { + free_block(cachep, n->shared->entry, + n->shared->avail, node, ); + n->shared->avail = 0; + } + + if (!n->shared || force_change) { + old_shared = n->shared; + n->shared = new_shared; + new_shared = NULL; + } + + if (!n->alien) { + n->alien = new_alien; + new_alien = NULL; + } + + spin_unlock_irq(>list_lock); + slabs_destroy(cachep, ); + +fail: + kfree(old_shared); + kfree(new_shared); + free_alien_cache(new_alien); + + return ret; +} + static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -969,7 +1026,6 @@ free_slab: static int cpuup_prepare(long cpu) { struct kmem_cache *cachep; - struct kmem_cache_node *n = NULL; int node = cpu_to_mem(cpu); int err; @@ -988,44 +1044,9 @@ static int cpuup_prepare(long cpu) * array caches */ list_for_each_entry(cachep, _caches, list) { - struct array_cache *shared = NULL; - struct alien_cache **alien = NULL; - - if (cachep->shared) { - shared = alloc_arraycache(node, - cachep->shared * cachep->batchcount, - 0xbaadf00d, GFP_KERNEL); - if (!shared) - goto bad; - } - if (use_alien_caches) { - alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL); - if (!alien) { - kfree(shared); - goto bad; - } - } - n = get_node(cachep, node); - BUG_ON(!n); - - spin_lock_irq(>list_lock); - if (!n->shared) { - /* -* We are serialised from CPU_DEAD or -* CPU_UP_CANCELLED by the cpucontrol lock -*/ - n->shared = shared; - shared = NULL; - } -#ifdef CONFIG_NUMA - if (!n->alien) { - n->alien = alien; - alien = NULL; - } -#endif - spin_unlock_irq(>list_lock); - kfree(shared); - free_alien_cache(alien); + err = setup_kmem_cache_node(cachep, node, GFP_KERNEL, false); + if (err) + goto bad; } return 0; @@ -3676,72 +3697,19 @@ EXPORT_SYMBOL(kfree); /* * This initializes kmem_cache_node or resizes various caches for all nodes. */ -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) +static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp) { + int ret; int node; struct kmem_cache_node *n; - struct array_cache *new_shared; - struct alien_cache **new_alien = NULL; for_each_online_node(node) { - - if (use_alien_caches) { - new_alien = alloc_alien_cache(node,
[PATCH v2 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit
From: Joonsoo KimCurrently, determination to free a slab is done whenever each freed object is put into the slab. This has a following problem. Assume free_limit = 10 and nr_free = 9. Free happens as following sequence and nr_free changes as following. free(become a free slab) free(not become a free slab) nr_free: 9 -> 10 (at first free) -> 11 (at second free) If we try to check if we can free current slab or not on each object free, we can't free any slab in this situation because current slab isn't a free slab when nr_free exceed free_limit (at second free) even if there is a free slab. However, if we check it lastly, we can free 1 free slab. This problem would cause to keep too much memory in the slab subsystem. This patch try to fix it by checking number of free object after all free work is done. If there is free slab at that time, we can free slab as much as possible so we keep free slab as minimal. v2: explain more about the problem Signed-off-by: Joonsoo Kim --- mm/slab.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 27cb390..6e61461 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3283,6 +3283,9 @@ static void free_block(struct kmem_cache *cachep, void **objpp, { int i; struct kmem_cache_node *n = get_node(cachep, node); + struct page *page; + + n->free_objects += nr_objects; for (i = 0; i < nr_objects; i++) { void *objp; @@ -3295,17 +3298,11 @@ static void free_block(struct kmem_cache *cachep, void **objpp, check_spinlock_acquired_node(cachep, node); slab_put_obj(cachep, page, objp); STATS_DEC_ACTIVE(cachep); - n->free_objects++; /* fixup slab chains */ - if (page->active == 0) { - if (n->free_objects > n->free_limit) { - n->free_objects -= cachep->num; - list_add_tail(>lru, list); - } else { - list_add(>lru, >slabs_free); - } - } else { + if (page->active == 0) + list_add(>lru, >slabs_free); + else { /* Unconditionally move a slab to the end of the * partial list on free - maximum time for the * other objects to be freed, too. @@ -3313,6 +3310,14 @@ static void free_block(struct kmem_cache *cachep, void **objpp, list_add_tail(>lru, >slabs_partial); } } + + while (n->free_objects > n->free_limit && !list_empty(>slabs_free)) { + n->free_objects -= cachep->num; + + page = list_last_entry(>slabs_free, struct page, lru); + list_del(>lru); + list_add(>lru, list); + } } static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) -- 1.9.1
[PATCH v2 02/11] mm/slab: remove BAD_ALIEN_MAGIC again
From: Joonsoo KimInitial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it causes a problem on m68k which has many node but !CONFIG_NUMA. In this case, although alien cache isn't used at all but to cope with some initialization path, garbage value is used and that is BAD_ALIEN_MAGIC. Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. Tested-by: Geert Uytterhoeven Signed-off-by: Joonsoo Kim --- mm/slab.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index d8746c0..373b8be 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -421,8 +421,6 @@ static struct kmem_cache kmem_cache_boot = { .name = "kmem_cache", }; -#define BAD_ALIEN_MAGIC 0x01020304ul - static DEFINE_PER_CPU(struct delayed_work, slab_reap_work); static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) @@ -637,7 +635,7 @@ static int transfer_objects(struct array_cache *to, static inline struct alien_cache **alloc_alien_cache(int node, int limit, gfp_t gfp) { - return (struct alien_cache **)BAD_ALIEN_MAGIC; + return NULL; } static inline void free_alien_cache(struct alien_cache **ac_ptr) @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void) sizeof(struct rcu_head)); kmem_cache = _cache_boot; - if (num_possible_nodes() == 1) + if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1) use_alien_caches = 0; for (i = 0; i < NUM_INIT_LISTS; i++) -- 1.9.1
[PATCH v2 05/11] mm/slab: clean-up kmem_cache_node setup
From: Joonsoo Kim There are mostly same code for setting up kmem_cache_node either in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up them. v2 o Rename setup_kmem_cache_node_node to setup_kmem_cache_nodes o Fix suspend-to-ram issue reported by Nishanth Tested-by: Nishanth Menon Tested-by: Jon Hunter Signed-off-by: Joonsoo Kim --- mm/slab.c | 168 +- 1 file changed, 68 insertions(+), 100 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 49af685..27cb390 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -898,6 +898,63 @@ static int init_cache_node_node(int node) return 0; } +static int setup_kmem_cache_node(struct kmem_cache *cachep, + int node, gfp_t gfp, bool force_change) +{ + int ret = -ENOMEM; + struct kmem_cache_node *n; + struct array_cache *old_shared = NULL; + struct array_cache *new_shared = NULL; + struct alien_cache **new_alien = NULL; + LIST_HEAD(list); + + if (use_alien_caches) { + new_alien = alloc_alien_cache(node, cachep->limit, gfp); + if (!new_alien) + goto fail; + } + + if (cachep->shared) { + new_shared = alloc_arraycache(node, + cachep->shared * cachep->batchcount, 0xbaadf00d, gfp); + if (!new_shared) + goto fail; + } + + ret = init_cache_node(cachep, node, gfp); + if (ret) + goto fail; + + n = get_node(cachep, node); + spin_lock_irq(>list_lock); + if (n->shared && force_change) { + free_block(cachep, n->shared->entry, + n->shared->avail, node, ); + n->shared->avail = 0; + } + + if (!n->shared || force_change) { + old_shared = n->shared; + n->shared = new_shared; + new_shared = NULL; + } + + if (!n->alien) { + n->alien = new_alien; + new_alien = NULL; + } + + spin_unlock_irq(>list_lock); + slabs_destroy(cachep, ); + +fail: + kfree(old_shared); + kfree(new_shared); + free_alien_cache(new_alien); + + return ret; +} + static void cpuup_canceled(long cpu) { struct kmem_cache *cachep; @@ -969,7 +1026,6 @@ free_slab: static int cpuup_prepare(long cpu) { struct kmem_cache *cachep; - struct kmem_cache_node *n = NULL; int node = cpu_to_mem(cpu); int err; @@ -988,44 +1044,9 @@ static int cpuup_prepare(long cpu) * array caches */ list_for_each_entry(cachep, _caches, list) { - struct array_cache *shared = NULL; - struct alien_cache **alien = NULL; - - if (cachep->shared) { - shared = alloc_arraycache(node, - cachep->shared * cachep->batchcount, - 0xbaadf00d, GFP_KERNEL); - if (!shared) - goto bad; - } - if (use_alien_caches) { - alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL); - if (!alien) { - kfree(shared); - goto bad; - } - } - n = get_node(cachep, node); - BUG_ON(!n); - - spin_lock_irq(>list_lock); - if (!n->shared) { - /* -* We are serialised from CPU_DEAD or -* CPU_UP_CANCELLED by the cpucontrol lock -*/ - n->shared = shared; - shared = NULL; - } -#ifdef CONFIG_NUMA - if (!n->alien) { - n->alien = alien; - alien = NULL; - } -#endif - spin_unlock_irq(>list_lock); - kfree(shared); - free_alien_cache(alien); + err = setup_kmem_cache_node(cachep, node, GFP_KERNEL, false); + if (err) + goto bad; } return 0; @@ -3676,72 +3697,19 @@ EXPORT_SYMBOL(kfree); /* * This initializes kmem_cache_node or resizes various caches for all nodes. */ -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp) +static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp) { + int ret; int node; struct kmem_cache_node *n; - struct array_cache *new_shared; - struct alien_cache **new_alien = NULL; for_each_online_node(node) { - - if (use_alien_caches) { - new_alien = alloc_alien_cache(node, cachep->limit, gfp); - if (!new_alien) -
[PATCH v2 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit
From: Joonsoo Kim Currently, determination to free a slab is done whenever each freed object is put into the slab. This has a following problem. Assume free_limit = 10 and nr_free = 9. Free happens as following sequence and nr_free changes as following. free(become a free slab) free(not become a free slab) nr_free: 9 -> 10 (at first free) -> 11 (at second free) If we try to check if we can free current slab or not on each object free, we can't free any slab in this situation because current slab isn't a free slab when nr_free exceed free_limit (at second free) even if there is a free slab. However, if we check it lastly, we can free 1 free slab. This problem would cause to keep too much memory in the slab subsystem. This patch try to fix it by checking number of free object after all free work is done. If there is free slab at that time, we can free slab as much as possible so we keep free slab as minimal. v2: explain more about the problem Signed-off-by: Joonsoo Kim --- mm/slab.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 27cb390..6e61461 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3283,6 +3283,9 @@ static void free_block(struct kmem_cache *cachep, void **objpp, { int i; struct kmem_cache_node *n = get_node(cachep, node); + struct page *page; + + n->free_objects += nr_objects; for (i = 0; i < nr_objects; i++) { void *objp; @@ -3295,17 +3298,11 @@ static void free_block(struct kmem_cache *cachep, void **objpp, check_spinlock_acquired_node(cachep, node); slab_put_obj(cachep, page, objp); STATS_DEC_ACTIVE(cachep); - n->free_objects++; /* fixup slab chains */ - if (page->active == 0) { - if (n->free_objects > n->free_limit) { - n->free_objects -= cachep->num; - list_add_tail(>lru, list); - } else { - list_add(>lru, >slabs_free); - } - } else { + if (page->active == 0) + list_add(>lru, >slabs_free); + else { /* Unconditionally move a slab to the end of the * partial list on free - maximum time for the * other objects to be freed, too. @@ -3313,6 +3310,14 @@ static void free_block(struct kmem_cache *cachep, void **objpp, list_add_tail(>lru, >slabs_partial); } } + + while (n->free_objects > n->free_limit && !list_empty(>slabs_free)) { + n->free_objects -= cachep->num; + + page = list_last_entry(>slabs_free, struct page, lru); + list_del(>lru); + list_add(>lru, list); + } } static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) -- 1.9.1
[PATCH v2 02/11] mm/slab: remove BAD_ALIEN_MAGIC again
From: Joonsoo Kim Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it causes a problem on m68k which has many node but !CONFIG_NUMA. In this case, although alien cache isn't used at all but to cope with some initialization path, garbage value is used and that is BAD_ALIEN_MAGIC. Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it. Tested-by: Geert Uytterhoeven Signed-off-by: Joonsoo Kim --- mm/slab.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index d8746c0..373b8be 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -421,8 +421,6 @@ static struct kmem_cache kmem_cache_boot = { .name = "kmem_cache", }; -#define BAD_ALIEN_MAGIC 0x01020304ul - static DEFINE_PER_CPU(struct delayed_work, slab_reap_work); static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) @@ -637,7 +635,7 @@ static int transfer_objects(struct array_cache *to, static inline struct alien_cache **alloc_alien_cache(int node, int limit, gfp_t gfp) { - return (struct alien_cache **)BAD_ALIEN_MAGIC; + return NULL; } static inline void free_alien_cache(struct alien_cache **ac_ptr) @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void) sizeof(struct rcu_head)); kmem_cache = _cache_boot; - if (num_possible_nodes() == 1) + if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1) use_alien_caches = 0; for (i = 0; i < NUM_INIT_LISTS; i++) -- 1.9.1
[PATCH v2 00/11] mm/slab: reduce lock contention in alloc path
From: Joonsoo KimMajor changes from v1 o hold node lock instead of slab_mutex in kmem_cache_shrink() o fix suspend-to-ram issue reported by Nishanth o use synchronize_sched() instead of kick_all_cpus_sync() While processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. Joonsoo Kim (11): mm/slab: fix the theoretical race by holding proper lock mm/slab: remove BAD_ALIEN_MAGIC again mm/slab: drain the free slab as much as possible mm/slab: factor out kmem_cache_node initialization code mm/slab: clean-up kmem_cache_node setup mm/slab: don't keep free slabs if free_objects exceeds free_limit mm/slab: racy access/modify the slab color mm/slab: make cache_grow() handle the page allocated on arbitrary node mm/slab: separate cache_grow() to two parts mm/slab: refill cpu cache through a new slab without holding a node lock mm/slab: lockless decision to grow cache mm/slab.c | 562 +- 1 file changed, 295 insertions(+), 267 deletions(-) -- 1.9.1
[PATCH v2 01/11] mm/slab: fix the theoretical race by holding proper lock
From: Joonsoo KimWhile processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. This patch (of 11): If we don't hold neither the slab_mutex nor the node lock, node's shared array cache could be freed and re-populated. If __kmem_cache_shrink() is called at the same time, it will call drain_array() with n->shared without holding node lock so problem can happen. This patch fix the situation by holding the node lock before trying to drain the shared array. In addition, add a debug check to confirm that n->shared access race doesn't exist. v2: o Hold the node lock instead of holding the slab_mutex (per Christoph) o Add a debug check rather than adding code comment (per Nikolay) Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 ++- 1 file changed, 45 insertions(+), 23 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index a53a0f6..d8746c0 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2173,6 +2173,11 @@ static void check_irq_on(void) BUG_ON(irqs_disabled()); } +static void check_mutex_acquired(void) +{ + BUG_ON(!mutex_is_locked(_mutex)); +} + static void check_spinlock_acquired(struct kmem_cache *cachep) { #ifdef CONFIG_SMP @@ -2192,13 +2197,27 @@ static void check_spinlock_acquired_node(struct kmem_cache *cachep, int node) #else #define check_irq_off()do { } while(0) #define check_irq_on() do { } while(0) +#define check_mutex_acquired() do { } while(0) #define check_spinlock_acquired(x) do { } while(0) #define check_spinlock_acquired_node(x, y) do { } while(0) #endif -static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n, - struct array_cache *ac, - int force, int node); +static void drain_array_locked(struct kmem_cache *cachep, struct array_cache *ac, + int node, bool free_all, struct list_head *list) +{ + int tofree; + + if (!ac || !ac->avail) + return; + + tofree = free_all ? ac->avail : (ac->limit + 4) / 5; + if (tofree > ac->avail) + tofree = (ac->avail + 1) / 2; + + free_block(cachep, ac->entry, tofree, node, list); + ac->avail -= tofree; + memmove(ac->entry, &(ac->entry[tofree]), sizeof(void *) * ac->avail); +} static void do_drain(void *arg) { @@ -,6 +2241,7 @@ static void drain_cpu_caches(struct kmem_cache *cachep) { struct kmem_cache_node *n; int node; + LIST_HEAD(list); on_each_cpu(do_drain, cachep, 1); check_irq_on(); @@ -2229,8 +2249,13 @@ static void drain_cpu_caches(struct kmem_cache *cachep) if (n->alien) drain_alien_cache(cachep, n->alien); - for_each_kmem_cache_node(cachep, node, n) - drain_array(cachep, n, n->shared, 1, node); + for_each_kmem_cache_node(cachep, node, n) { + spin_lock_irq(>list_lock); + drain_array_locked(cachep, n->shared, node, true, ); + spin_unlock_irq(>list_lock); + + slabs_destroy(cachep, ); + } } /* @@ -3873,29 +3898,26 @@ skip_setup: * if drain_array() is used on the shared array. */ static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n, -struct array_cache *ac, int force, int node) +struct array_cache *ac, int node) { LIST_HEAD(list); - int tofree; + + /* ac from n->shared can be freed if we don't hold the slab_mutex. */ +
[PATCH v2 00/11] mm/slab: reduce lock contention in alloc path
From: Joonsoo Kim Major changes from v1 o hold node lock instead of slab_mutex in kmem_cache_shrink() o fix suspend-to-ram issue reported by Nishanth o use synchronize_sched() instead of kick_all_cpus_sync() While processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. Joonsoo Kim (11): mm/slab: fix the theoretical race by holding proper lock mm/slab: remove BAD_ALIEN_MAGIC again mm/slab: drain the free slab as much as possible mm/slab: factor out kmem_cache_node initialization code mm/slab: clean-up kmem_cache_node setup mm/slab: don't keep free slabs if free_objects exceeds free_limit mm/slab: racy access/modify the slab color mm/slab: make cache_grow() handle the page allocated on arbitrary node mm/slab: separate cache_grow() to two parts mm/slab: refill cpu cache through a new slab without holding a node lock mm/slab: lockless decision to grow cache mm/slab.c | 562 +- 1 file changed, 295 insertions(+), 267 deletions(-) -- 1.9.1
[PATCH v2 01/11] mm/slab: fix the theoretical race by holding proper lock
From: Joonsoo Kim While processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. This patch (of 11): If we don't hold neither the slab_mutex nor the node lock, node's shared array cache could be freed and re-populated. If __kmem_cache_shrink() is called at the same time, it will call drain_array() with n->shared without holding node lock so problem can happen. This patch fix the situation by holding the node lock before trying to drain the shared array. In addition, add a debug check to confirm that n->shared access race doesn't exist. v2: o Hold the node lock instead of holding the slab_mutex (per Christoph) o Add a debug check rather than adding code comment (per Nikolay) Signed-off-by: Joonsoo Kim --- mm/slab.c | 68 ++- 1 file changed, 45 insertions(+), 23 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index a53a0f6..d8746c0 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2173,6 +2173,11 @@ static void check_irq_on(void) BUG_ON(irqs_disabled()); } +static void check_mutex_acquired(void) +{ + BUG_ON(!mutex_is_locked(_mutex)); +} + static void check_spinlock_acquired(struct kmem_cache *cachep) { #ifdef CONFIG_SMP @@ -2192,13 +2197,27 @@ static void check_spinlock_acquired_node(struct kmem_cache *cachep, int node) #else #define check_irq_off()do { } while(0) #define check_irq_on() do { } while(0) +#define check_mutex_acquired() do { } while(0) #define check_spinlock_acquired(x) do { } while(0) #define check_spinlock_acquired_node(x, y) do { } while(0) #endif -static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n, - struct array_cache *ac, - int force, int node); +static void drain_array_locked(struct kmem_cache *cachep, struct array_cache *ac, + int node, bool free_all, struct list_head *list) +{ + int tofree; + + if (!ac || !ac->avail) + return; + + tofree = free_all ? ac->avail : (ac->limit + 4) / 5; + if (tofree > ac->avail) + tofree = (ac->avail + 1) / 2; + + free_block(cachep, ac->entry, tofree, node, list); + ac->avail -= tofree; + memmove(ac->entry, &(ac->entry[tofree]), sizeof(void *) * ac->avail); +} static void do_drain(void *arg) { @@ -,6 +2241,7 @@ static void drain_cpu_caches(struct kmem_cache *cachep) { struct kmem_cache_node *n; int node; + LIST_HEAD(list); on_each_cpu(do_drain, cachep, 1); check_irq_on(); @@ -2229,8 +2249,13 @@ static void drain_cpu_caches(struct kmem_cache *cachep) if (n->alien) drain_alien_cache(cachep, n->alien); - for_each_kmem_cache_node(cachep, node, n) - drain_array(cachep, n, n->shared, 1, node); + for_each_kmem_cache_node(cachep, node, n) { + spin_lock_irq(>list_lock); + drain_array_locked(cachep, n->shared, node, true, ); + spin_unlock_irq(>list_lock); + + slabs_destroy(cachep, ); + } } /* @@ -3873,29 +3898,26 @@ skip_setup: * if drain_array() is used on the shared array. */ static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n, -struct array_cache *ac, int force, int node) +struct array_cache *ac, int node) { LIST_HEAD(list); - int tofree; + + /* ac from n->shared can be freed if we don't hold the slab_mutex. */ + check_mutex_acquired(); if (!ac || !ac->avail)