Re: [PATCH v2] RISC-V: Don't use a global include guard for uapi/asm/syscalls.h
This actually seems to break the compilation for me in for-next: hch@carbon:~/work/linux$ make ARCH=riscv CALLscripts/checksyscalls.sh :1335:2: warning: #warning syscall rseq not implemented [-Wcpp] CHK include/generated/compile.h CC arch/riscv/kernel/syscall_table.o ./arch/riscv/include/uapi/asm/syscalls.h:29:36: error: 'sys_riscv_flush_icache' undeclared here (not in a function); did you mean '__NR_riscv_flush_icache'? __SYSCALL(__NR_riscv_flush_icache, sys_riscv_flush_icache) ^ arch/riscv/kernel/syscall_table.c:21:37: note: in definition of macro '__SYSCALL' #define __SYSCALL(nr, call) [nr] = (call), ^~~~ make[1]: *** [scripts/Makefile.build:318: arch/riscv/kernel/syscall_table.o] Error 1 make: *** [Makefile:1029: arch/riscv/kernel] Error 2 Reverting it for now, but I guess this might hit others too..
Re: [PATCH v2] RISC-V: Don't use a global include guard for uapi/asm/syscalls.h
This actually seems to break the compilation for me in for-next: hch@carbon:~/work/linux$ make ARCH=riscv CALLscripts/checksyscalls.sh :1335:2: warning: #warning syscall rseq not implemented [-Wcpp] CHK include/generated/compile.h CC arch/riscv/kernel/syscall_table.o ./arch/riscv/include/uapi/asm/syscalls.h:29:36: error: 'sys_riscv_flush_icache' undeclared here (not in a function); did you mean '__NR_riscv_flush_icache'? __SYSCALL(__NR_riscv_flush_icache, sys_riscv_flush_icache) ^ arch/riscv/kernel/syscall_table.c:21:37: note: in definition of macro '__SYSCALL' #define __SYSCALL(nr, call) [nr] = (call), ^~~~ make[1]: *** [scripts/Makefile.build:318: arch/riscv/kernel/syscall_table.o] Error 1 make: *** [Makefile:1029: arch/riscv/kernel] Error 2 Reverting it for now, but I guess this might hit others too..
Re: [PATCH v8 07/22] KVM: s390: refactor crypto initialization
On 08.08.2018 16:44, Tony Krowiak wrote: > From: Tony Krowiak > > This patch refactors the code that initializes and sets up the > crypto configuration for a guest. The following changes are > implemented via this patch: > > 1. Prior to the introduction of AP device virtualization, it >was not necessary to provide guest access to the CRYCB >unless the MSA extension 3 (MSAX3) facility was installed >on the host system. With the introduction of AP device >virtualization, the CRYCB must be made accessible to the >guest if the AP instructions are installed on the host >and are to be provided to the guest. > > 2. Introduces a flag indicating AP instructions executed on >the guest shall be interpreted by the firmware. It is >initialized to indicate AP instructions are to be >to be interpreted and is used to set the SIE bit for >each vcpu during vcpu setup. > > Signed-off-by: Tony Krowiak > Reviewed-by: Halil Pasic > Acked-by: Christian Borntraeger > Tested-by: Michael Mueller > Tested-by: Farhan Ali > Signed-off-by: Christian Borntraeger Acked-by: Janosch Frank > --- > arch/s390/include/asm/kvm_host.h |3 + > arch/s390/include/uapi/asm/kvm.h |1 + > arch/s390/kvm/kvm-s390.c | 86 > -- > 3 files changed, 49 insertions(+), 41 deletions(-) > > diff --git a/arch/s390/include/asm/kvm_host.h > b/arch/s390/include/asm/kvm_host.h > index af39561..0c13f61 100644 > --- a/arch/s390/include/asm/kvm_host.h > +++ b/arch/s390/include/asm/kvm_host.h > @@ -187,6 +187,7 @@ struct kvm_s390_sie_block { > #define ECA_AIV 0x0020 > #define ECA_VX 0x0002 > #define ECA_PROTEXCI 0x2000 > +#define ECA_APIE 0x0008 > #define ECA_SII 0x0001 > __u32 eca;/* 0x004c */ > #define ICPT_INST0x04 > @@ -256,6 +257,7 @@ struct kvm_s390_sie_block { > __u8reservede4[4]; /* 0x00e4 */ > __u64 tecmc; /* 0x00e8 */ > __u8reservedf0[12]; /* 0x00f0 */ > +#define CRYCB_FORMAT_MASK 0x0003 > #define CRYCB_FORMAT1 0x0001 > #define CRYCB_FORMAT2 0x0003 > __u32 crycbd; /* 0x00fc */ > @@ -714,6 +716,7 @@ struct kvm_s390_crypto { > __u32 crycbd; > __u8 aes_kw; > __u8 dea_kw; > + __u8 apie; In the last review I wanted a comment here to know what they do. > static void kvm_s390_vcpu_crypto_setup(struct kvm_vcpu *vcpu) > { > - if (!test_kvm_facility(vcpu->kvm, 76)) > + /* > + * If neither the AP instructions nor the MSAX3 facility are installed > + * on the host, then there is no need for a CRYCB in SIE because the > + * they will not be installed on the guest either. the they > + */ > + if (ap_instructions_available() && !test_facility(76)) > return; I know you're not responsible for that one :) but 0 being the wanted value here is a bit counter-intuitive. > > - vcpu->arch.sie_block->ecb3 &= ~(ECB3_AES | ECB3_DEA); > + vcpu->arch.sie_block->crycbd = vcpu->kvm->arch.crypto.crycbd; > + > + vcpu->arch.sie_block->eca &= ~ECA_APIE; The scb is zero allocated, are the ECA and the ECB3s set somewhere in-between, or is that your way of making sure the controls are definitely gone for good? > + if (vcpu->kvm->arch.crypto.apie && > + test_kvm_cpu_feat(vcpu->kvm, KVM_S390_VM_CPU_FEAT_AP)) > + vcpu->arch.sie_block->eca |= ECA_APIE; > > - if (vcpu->kvm->arch.crypto.aes_kw) > - vcpu->arch.sie_block->ecb3 |= ECB3_AES; > - if (vcpu->kvm->arch.crypto.dea_kw) > - vcpu->arch.sie_block->ecb3 |= ECB3_DEA; > + /* If MSAX3 is installed on the guest, set up protected key support */ > + if (test_kvm_facility(vcpu->kvm, 76)) { > + vcpu->arch.sie_block->ecb3 &= ~(ECB3_AES | ECB3_DEA); > > - vcpu->arch.sie_block->crycbd = vcpu->kvm->arch.crypto.crycbd; > + if (vcpu->kvm->arch.crypto.aes_kw) > + vcpu->arch.sie_block->ecb3 |= ECB3_AES; > + if (vcpu->kvm->arch.crypto.dea_kw) > + vcpu->arch.sie_block->ecb3 |= ECB3_DEA; > + } > } > > void kvm_s390_vcpu_unsetup_cmma(struct kvm_vcpu *vcpu) > signature.asc Description: OpenPGP digital signature
Re: [PATCH v8 07/22] KVM: s390: refactor crypto initialization
On 08.08.2018 16:44, Tony Krowiak wrote: > From: Tony Krowiak > > This patch refactors the code that initializes and sets up the > crypto configuration for a guest. The following changes are > implemented via this patch: > > 1. Prior to the introduction of AP device virtualization, it >was not necessary to provide guest access to the CRYCB >unless the MSA extension 3 (MSAX3) facility was installed >on the host system. With the introduction of AP device >virtualization, the CRYCB must be made accessible to the >guest if the AP instructions are installed on the host >and are to be provided to the guest. > > 2. Introduces a flag indicating AP instructions executed on >the guest shall be interpreted by the firmware. It is >initialized to indicate AP instructions are to be >to be interpreted and is used to set the SIE bit for >each vcpu during vcpu setup. > > Signed-off-by: Tony Krowiak > Reviewed-by: Halil Pasic > Acked-by: Christian Borntraeger > Tested-by: Michael Mueller > Tested-by: Farhan Ali > Signed-off-by: Christian Borntraeger Acked-by: Janosch Frank > --- > arch/s390/include/asm/kvm_host.h |3 + > arch/s390/include/uapi/asm/kvm.h |1 + > arch/s390/kvm/kvm-s390.c | 86 > -- > 3 files changed, 49 insertions(+), 41 deletions(-) > > diff --git a/arch/s390/include/asm/kvm_host.h > b/arch/s390/include/asm/kvm_host.h > index af39561..0c13f61 100644 > --- a/arch/s390/include/asm/kvm_host.h > +++ b/arch/s390/include/asm/kvm_host.h > @@ -187,6 +187,7 @@ struct kvm_s390_sie_block { > #define ECA_AIV 0x0020 > #define ECA_VX 0x0002 > #define ECA_PROTEXCI 0x2000 > +#define ECA_APIE 0x0008 > #define ECA_SII 0x0001 > __u32 eca;/* 0x004c */ > #define ICPT_INST0x04 > @@ -256,6 +257,7 @@ struct kvm_s390_sie_block { > __u8reservede4[4]; /* 0x00e4 */ > __u64 tecmc; /* 0x00e8 */ > __u8reservedf0[12]; /* 0x00f0 */ > +#define CRYCB_FORMAT_MASK 0x0003 > #define CRYCB_FORMAT1 0x0001 > #define CRYCB_FORMAT2 0x0003 > __u32 crycbd; /* 0x00fc */ > @@ -714,6 +716,7 @@ struct kvm_s390_crypto { > __u32 crycbd; > __u8 aes_kw; > __u8 dea_kw; > + __u8 apie; In the last review I wanted a comment here to know what they do. > static void kvm_s390_vcpu_crypto_setup(struct kvm_vcpu *vcpu) > { > - if (!test_kvm_facility(vcpu->kvm, 76)) > + /* > + * If neither the AP instructions nor the MSAX3 facility are installed > + * on the host, then there is no need for a CRYCB in SIE because the > + * they will not be installed on the guest either. the they > + */ > + if (ap_instructions_available() && !test_facility(76)) > return; I know you're not responsible for that one :) but 0 being the wanted value here is a bit counter-intuitive. > > - vcpu->arch.sie_block->ecb3 &= ~(ECB3_AES | ECB3_DEA); > + vcpu->arch.sie_block->crycbd = vcpu->kvm->arch.crypto.crycbd; > + > + vcpu->arch.sie_block->eca &= ~ECA_APIE; The scb is zero allocated, are the ECA and the ECB3s set somewhere in-between, or is that your way of making sure the controls are definitely gone for good? > + if (vcpu->kvm->arch.crypto.apie && > + test_kvm_cpu_feat(vcpu->kvm, KVM_S390_VM_CPU_FEAT_AP)) > + vcpu->arch.sie_block->eca |= ECA_APIE; > > - if (vcpu->kvm->arch.crypto.aes_kw) > - vcpu->arch.sie_block->ecb3 |= ECB3_AES; > - if (vcpu->kvm->arch.crypto.dea_kw) > - vcpu->arch.sie_block->ecb3 |= ECB3_DEA; > + /* If MSAX3 is installed on the guest, set up protected key support */ > + if (test_kvm_facility(vcpu->kvm, 76)) { > + vcpu->arch.sie_block->ecb3 &= ~(ECB3_AES | ECB3_DEA); > > - vcpu->arch.sie_block->crycbd = vcpu->kvm->arch.crypto.crycbd; > + if (vcpu->kvm->arch.crypto.aes_kw) > + vcpu->arch.sie_block->ecb3 |= ECB3_AES; > + if (vcpu->kvm->arch.crypto.dea_kw) > + vcpu->arch.sie_block->ecb3 |= ECB3_DEA; > + } > } > > void kvm_s390_vcpu_unsetup_cmma(struct kvm_vcpu *vcpu) > signature.asc Description: OpenPGP digital signature
Re: [PATCH v5 1/2] leds: core: Introduce LED pattern trigger
Hi Jacek, On 9 August 2018 at 05:28, Jacek Anaszewski wrote: > Hi Baolin, > > On 08/08/2018 08:01 AM, Baolin Wang wrote: >> Hi Jacek, >> >> On 8 August 2018 at 05:54, Jacek Anaszewski >> wrote: >>> Hi Baolin, >>> >>> Thank you for addressing the review remarks. >>> Since the patch set is targeted for 4.19, then we have three weeks >>> before it will be merged to the for-next anyway. That said, I propose >>> one more modification, please take a look below. >> >> Sure. >> >>> >>> On 08/06/2018 02:05 PM, Baolin Wang wrote: Some LED controllers have support for autonomously controlling brightness over time, according to some preprogrammed pattern or function. This patch adds pattern trigger that LED device can configure the pattern and trigger it. Signed-off-by: Raphael Teysseyre Signed-off-by: Baolin Wang --- Changes from v4: - Change the repeat file to return the originally written number. - Improve comments. - Fix some build warnings. Changes from v3: - Reset pattern number to 0 if user provides incorrect pattern string. - Support one pattern. Changes from v2: - Remove hardware_pattern boolen. - Chnage the pattern string format. Changes from v1: - Use ATTRIBUTE_GROUPS() to define attributes. - Introduce hardware_pattern flag to determine if software pattern or hardware pattern. - Re-implement pattern_trig_store_pattern() function. - Remove pattern_get() interface. - Improve comments. - Other small optimization. --- .../ABI/testing/sysfs-class-led-trigger-pattern| 24 ++ drivers/leds/trigger/Kconfig |7 + drivers/leds/trigger/Makefile |1 + drivers/leds/trigger/ledtrig-pattern.c | 266 include/linux/leds.h | 16 ++ 5 files changed, 314 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-led-trigger-pattern create mode 100644 drivers/leds/trigger/ledtrig-pattern.c diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-pattern b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern new file mode 100644 index 000..bc7475f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern @@ -0,0 +1,24 @@ +What:/sys/class/leds//pattern +Date:August 2018 +KernelVersion: 4.19 +Description: + Specify a pattern for the LED, for LED hardware that support + altering the brightness as a function of time. + + The pattern is given by a series of tuples, of brightness and + duration (ms). The LED is expected to traverse the series and + each brightness value for the specified duration. Duration of + 0 means brightness should immediately change to new value. + + The format of the pattern values should be: + "brightness_1 duration_1 brightness_2 duration_2 brightness_3 + duration_3 ...". + +What:/sys/class/leds//repeat +Date:August 2018 +KernelVersion: 4.19 +Description: + Specify a pattern repeat number. 0 means repeat indefinitely. + + This file will always return the originally written repeat + number. diff --git a/drivers/leds/trigger/Kconfig b/drivers/leds/trigger/Kconfig index 4018af7..b76fc3c 100644 --- a/drivers/leds/trigger/Kconfig +++ b/drivers/leds/trigger/Kconfig @@ -129,4 +129,11 @@ config LEDS_TRIGGER_NETDEV This allows LEDs to be controlled by network device activity. If unsure, say Y. +config LEDS_TRIGGER_PATTERN + tristate "LED Pattern Trigger" + help + This allows LEDs to be controlled by a software or hardware pattern + which is a series of tuples, of brightness and duration (ms). + If unsure, say N + endif # LEDS_TRIGGERS diff --git a/drivers/leds/trigger/Makefile b/drivers/leds/trigger/Makefile index f3cfe19..9bcb64e 100644 --- a/drivers/leds/trigger/Makefile +++ b/drivers/leds/trigger/Makefile @@ -13,3 +13,4 @@ obj-$(CONFIG_LEDS_TRIGGER_TRANSIENT)+= ledtrig-transient.o obj-$(CONFIG_LEDS_TRIGGER_CAMERA)+= ledtrig-camera.o obj-$(CONFIG_LEDS_TRIGGER_PANIC) += ledtrig-panic.o obj-$(CONFIG_LEDS_TRIGGER_NETDEV)+= ledtrig-netdev.o +obj-$(CONFIG_LEDS_TRIGGER_PATTERN) += ledtrig-pattern.o diff --git a/drivers/leds/trigger/ledtrig-pattern.c b/drivers/leds/trigger/ledtrig-pattern.c new file mode 100644 index 000..63b94a2
Re: [PATCH v5 1/2] leds: core: Introduce LED pattern trigger
Hi Jacek, On 9 August 2018 at 05:28, Jacek Anaszewski wrote: > Hi Baolin, > > On 08/08/2018 08:01 AM, Baolin Wang wrote: >> Hi Jacek, >> >> On 8 August 2018 at 05:54, Jacek Anaszewski >> wrote: >>> Hi Baolin, >>> >>> Thank you for addressing the review remarks. >>> Since the patch set is targeted for 4.19, then we have three weeks >>> before it will be merged to the for-next anyway. That said, I propose >>> one more modification, please take a look below. >> >> Sure. >> >>> >>> On 08/06/2018 02:05 PM, Baolin Wang wrote: Some LED controllers have support for autonomously controlling brightness over time, according to some preprogrammed pattern or function. This patch adds pattern trigger that LED device can configure the pattern and trigger it. Signed-off-by: Raphael Teysseyre Signed-off-by: Baolin Wang --- Changes from v4: - Change the repeat file to return the originally written number. - Improve comments. - Fix some build warnings. Changes from v3: - Reset pattern number to 0 if user provides incorrect pattern string. - Support one pattern. Changes from v2: - Remove hardware_pattern boolen. - Chnage the pattern string format. Changes from v1: - Use ATTRIBUTE_GROUPS() to define attributes. - Introduce hardware_pattern flag to determine if software pattern or hardware pattern. - Re-implement pattern_trig_store_pattern() function. - Remove pattern_get() interface. - Improve comments. - Other small optimization. --- .../ABI/testing/sysfs-class-led-trigger-pattern| 24 ++ drivers/leds/trigger/Kconfig |7 + drivers/leds/trigger/Makefile |1 + drivers/leds/trigger/ledtrig-pattern.c | 266 include/linux/leds.h | 16 ++ 5 files changed, 314 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-led-trigger-pattern create mode 100644 drivers/leds/trigger/ledtrig-pattern.c diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-pattern b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern new file mode 100644 index 000..bc7475f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern @@ -0,0 +1,24 @@ +What:/sys/class/leds//pattern +Date:August 2018 +KernelVersion: 4.19 +Description: + Specify a pattern for the LED, for LED hardware that support + altering the brightness as a function of time. + + The pattern is given by a series of tuples, of brightness and + duration (ms). The LED is expected to traverse the series and + each brightness value for the specified duration. Duration of + 0 means brightness should immediately change to new value. + + The format of the pattern values should be: + "brightness_1 duration_1 brightness_2 duration_2 brightness_3 + duration_3 ...". + +What:/sys/class/leds//repeat +Date:August 2018 +KernelVersion: 4.19 +Description: + Specify a pattern repeat number. 0 means repeat indefinitely. + + This file will always return the originally written repeat + number. diff --git a/drivers/leds/trigger/Kconfig b/drivers/leds/trigger/Kconfig index 4018af7..b76fc3c 100644 --- a/drivers/leds/trigger/Kconfig +++ b/drivers/leds/trigger/Kconfig @@ -129,4 +129,11 @@ config LEDS_TRIGGER_NETDEV This allows LEDs to be controlled by network device activity. If unsure, say Y. +config LEDS_TRIGGER_PATTERN + tristate "LED Pattern Trigger" + help + This allows LEDs to be controlled by a software or hardware pattern + which is a series of tuples, of brightness and duration (ms). + If unsure, say N + endif # LEDS_TRIGGERS diff --git a/drivers/leds/trigger/Makefile b/drivers/leds/trigger/Makefile index f3cfe19..9bcb64e 100644 --- a/drivers/leds/trigger/Makefile +++ b/drivers/leds/trigger/Makefile @@ -13,3 +13,4 @@ obj-$(CONFIG_LEDS_TRIGGER_TRANSIENT)+= ledtrig-transient.o obj-$(CONFIG_LEDS_TRIGGER_CAMERA)+= ledtrig-camera.o obj-$(CONFIG_LEDS_TRIGGER_PANIC) += ledtrig-panic.o obj-$(CONFIG_LEDS_TRIGGER_NETDEV)+= ledtrig-netdev.o +obj-$(CONFIG_LEDS_TRIGGER_PATTERN) += ledtrig-pattern.o diff --git a/drivers/leds/trigger/ledtrig-pattern.c b/drivers/leds/trigger/ledtrig-pattern.c new file mode 100644 index 000..63b94a2
[PATCH] sched: idle: Reenable sched tick for cpuidle request
The idle loop stops tick by respecting the decision from cpuidle framework, if the condition 'need_resched()' is false without any task scheduling, the CPU keeps running in the loop in do_idle() and it has no chance call tick_nohz_idle_exit() to enable the tick. This results in the idle loop cannot reenable sched tick afterwards, if the idle governor selects a shallow state, thus the powernightmares issue can occur again. This issue can be easily reproduce with the case on Arm Hikey board: use CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback function it start a hrtimer with 4ms, so the 4ms timer delta value can let 'menu' governor to choose deepest state in the next entering idle time. From then on, CPU7 restarts hrtimer with 1ms interval for total 10 times, so this can utilize the typical pattern in 'menu' governor to have prediction for 1ms duration, finally idle governor is easily to select a shallow state, on Hikey board it usually is to select CPU off state. From then on, CPU7 stays in this shallow state for long time until there have other interrupts on it. C2: cluster off; C1: CPU off Idle state: C2C2C2C2C2C2C2C1 -> Interrupt: ^^ ^ ^ ^ ^ ^ ^ ^ IPI Timer Timer Timer Timer Timer Timer Timer Timer 4ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms To fix this issue, the idle loop needs to support reenabling sched tick. This patch checks the conditions 'stop_tick' is false when the tick is stopped, this condition indicates the cpuidle governor asks to reenable the tick and we can use tick_nohz_idle_restart_tick() for this purpose. A synthetic case is used to to verify this patch, we use CPU0 to send IPI to wake up CPU7 with 50ms interval, CPU7 generate a series hrtimer events (the first interval is 4ms, then the sequential 10 timer events are 1ms interval, same as described above). We do statistics for idle states duration, the unit is second (s), the testing result shows the C2 state (deepest state) staying time can be improved significantly for CPU7 (+7.942s for 10s execution time on CPU7) and all CPUs wide (+13.360s for ~80s of all CPUs execution time). Without patches With patches Difference --- CPUC0 C1 C2 C0 C1 C2 C0 C1 C2 00.000 0.027 9.941 0.055 0.038 9.700 +0.055 +0.010 -0.240 10.045 0.000 9.964 0.019 0.000 9.943 -0.026 +0.000 -0.020 20.002 0.003 10.007 0.035 0.053 9.916 +0.033 +0.049 -0.090 30.000 0.023 9.994 0.024 0.246 9.732 +0.024 +0.222 -0.261 40.032 0.000 9.985 0.015 0.007 9.993 -0.016 +0.007 +0.008 50.001 0.000 9.226 0.039 0.000 9.971 +0.038 +0.000 +0.744 60.000 0.000 0.000 0.036 0.000 5.278 +0.036 +0.000 +5.278 71.894 8.013 0.059 1.509 0.026 8.002 -0.384 -7.987 +7.942 All 1.976 8.068 59.179 1.737 0.372 72.539 -0.239 -7.695 +13.360 Cc: Daniel Lezcano Cc: Vincent Guittot Signed-off-by: Leo Yan --- kernel/sched/idle.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 1a3e9bd..802286e 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void) */ next_state = cpuidle_select(drv, dev, _tick); - if (stop_tick) + if (stop_tick) { tick_nohz_idle_stop_tick(); - else + } else { + /* +* The cpuidle framework says to not stop tick but +* the tick has been stopped yet, so restart it. +*/ + if (tick_nohz_tick_stopped()) + tick_nohz_idle_restart_tick(); + tick_nohz_idle_retain_tick(); + } rcu_idle_enter(); -- 2.7.4
Crypto Fixes for 4.18
Hi Linus: This push fixes a performance regression in arm64 NEON crypto as well as a crash in x86 aegis/morus on unsupported CPUs. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Ard Biesheuvel (1): crypto: arm64 - revert NEON yield for fast AEAD implementations Ondrej Mosnacek (1): crypto: x86/aegis,morus - Fix and simplify CPUID checks arch/arm64/crypto/aes-ce-ccm-core.S| 150 arch/arm64/crypto/ghash-ce-core.S | 76 ++-- arch/x86/crypto/aegis128-aesni-glue.c | 12 +-- arch/x86/crypto/aegis128l-aesni-glue.c | 12 +-- arch/x86/crypto/aegis256-aesni-glue.c | 12 +-- arch/x86/crypto/morus1280-avx2-glue.c | 10 +-- arch/x86/crypto/morus1280-sse2-glue.c | 10 +-- arch/x86/crypto/morus640-sse2-glue.c | 10 +-- 8 files changed, 101 insertions(+), 191 deletions(-) Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH] sched: idle: Reenable sched tick for cpuidle request
The idle loop stops tick by respecting the decision from cpuidle framework, if the condition 'need_resched()' is false without any task scheduling, the CPU keeps running in the loop in do_idle() and it has no chance call tick_nohz_idle_exit() to enable the tick. This results in the idle loop cannot reenable sched tick afterwards, if the idle governor selects a shallow state, thus the powernightmares issue can occur again. This issue can be easily reproduce with the case on Arm Hikey board: use CPU0 to send IPI to CPU7, CPU7 receives the IPI and in the callback function it start a hrtimer with 4ms, so the 4ms timer delta value can let 'menu' governor to choose deepest state in the next entering idle time. From then on, CPU7 restarts hrtimer with 1ms interval for total 10 times, so this can utilize the typical pattern in 'menu' governor to have prediction for 1ms duration, finally idle governor is easily to select a shallow state, on Hikey board it usually is to select CPU off state. From then on, CPU7 stays in this shallow state for long time until there have other interrupts on it. C2: cluster off; C1: CPU off Idle state: C2C2C2C2C2C2C2C1 -> Interrupt: ^^ ^ ^ ^ ^ ^ ^ ^ IPI Timer Timer Timer Timer Timer Timer Timer Timer 4ms 1ms 1ms 1ms 1ms 1ms 1ms 1ms To fix this issue, the idle loop needs to support reenabling sched tick. This patch checks the conditions 'stop_tick' is false when the tick is stopped, this condition indicates the cpuidle governor asks to reenable the tick and we can use tick_nohz_idle_restart_tick() for this purpose. A synthetic case is used to to verify this patch, we use CPU0 to send IPI to wake up CPU7 with 50ms interval, CPU7 generate a series hrtimer events (the first interval is 4ms, then the sequential 10 timer events are 1ms interval, same as described above). We do statistics for idle states duration, the unit is second (s), the testing result shows the C2 state (deepest state) staying time can be improved significantly for CPU7 (+7.942s for 10s execution time on CPU7) and all CPUs wide (+13.360s for ~80s of all CPUs execution time). Without patches With patches Difference --- CPUC0 C1 C2 C0 C1 C2 C0 C1 C2 00.000 0.027 9.941 0.055 0.038 9.700 +0.055 +0.010 -0.240 10.045 0.000 9.964 0.019 0.000 9.943 -0.026 +0.000 -0.020 20.002 0.003 10.007 0.035 0.053 9.916 +0.033 +0.049 -0.090 30.000 0.023 9.994 0.024 0.246 9.732 +0.024 +0.222 -0.261 40.032 0.000 9.985 0.015 0.007 9.993 -0.016 +0.007 +0.008 50.001 0.000 9.226 0.039 0.000 9.971 +0.038 +0.000 +0.744 60.000 0.000 0.000 0.036 0.000 5.278 +0.036 +0.000 +5.278 71.894 8.013 0.059 1.509 0.026 8.002 -0.384 -7.987 +7.942 All 1.976 8.068 59.179 1.737 0.372 72.539 -0.239 -7.695 +13.360 Cc: Daniel Lezcano Cc: Vincent Guittot Signed-off-by: Leo Yan --- kernel/sched/idle.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 1a3e9bd..802286e 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -190,10 +190,18 @@ static void cpuidle_idle_call(void) */ next_state = cpuidle_select(drv, dev, _tick); - if (stop_tick) + if (stop_tick) { tick_nohz_idle_stop_tick(); - else + } else { + /* +* The cpuidle framework says to not stop tick but +* the tick has been stopped yet, so restart it. +*/ + if (tick_nohz_tick_stopped()) + tick_nohz_idle_restart_tick(); + tick_nohz_idle_retain_tick(); + } rcu_idle_enter(); -- 2.7.4
Crypto Fixes for 4.18
Hi Linus: This push fixes a performance regression in arm64 NEON crypto as well as a crash in x86 aegis/morus on unsupported CPUs. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Ard Biesheuvel (1): crypto: arm64 - revert NEON yield for fast AEAD implementations Ondrej Mosnacek (1): crypto: x86/aegis,morus - Fix and simplify CPUID checks arch/arm64/crypto/aes-ce-ccm-core.S| 150 arch/arm64/crypto/ghash-ce-core.S | 76 ++-- arch/x86/crypto/aegis128-aesni-glue.c | 12 +-- arch/x86/crypto/aegis128l-aesni-glue.c | 12 +-- arch/x86/crypto/aegis256-aesni-glue.c | 12 +-- arch/x86/crypto/morus1280-avx2-glue.c | 10 +-- arch/x86/crypto/morus1280-sse2-glue.c | 10 +-- arch/x86/crypto/morus640-sse2-glue.c | 10 +-- 8 files changed, 101 insertions(+), 191 deletions(-) Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH RFC 2/2] KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
On 07.08.2018 14:51, David Hildenbrand wrote: > When we change the crycb (or execution controls), we also have to make sure > that the vSIE shadow datastructures properly consider the changed > values before rerunning the vSIE. We can achieve that by simply using a > VCPU request now. > > This has to be a synchronous request (== handled before entering the > (v)SIE again). > > The request will make sure that the vSIE handler is left, and that the > request will be processed (NOP), therefore forcing a reload of all > vSIE data (including rebuilding the crycb) when re-entering the vSIE > interception handler the next time. > > Signed-off-by: David Hildenbrand Reviewed-by: Janosch Frank signature.asc Description: OpenPGP digital signature
Re: [PATCH RFC 2/2] KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
On 07.08.2018 14:51, David Hildenbrand wrote: > When we change the crycb (or execution controls), we also have to make sure > that the vSIE shadow datastructures properly consider the changed > values before rerunning the vSIE. We can achieve that by simply using a > VCPU request now. > > This has to be a synchronous request (== handled before entering the > (v)SIE again). > > The request will make sure that the vSIE handler is left, and that the > request will be processed (NOP), therefore forcing a reload of all > vSIE data (including rebuilding the crycb) when re-entering the vSIE > interception handler the next time. > > Signed-off-by: David Hildenbrand Reviewed-by: Janosch Frank signature.asc Description: OpenPGP digital signature
Re: [PATCH RFC 1/2] KVM: s390: vsie: simulate VCPU SIE entry/exit
On 07.08.2018 14:51, David Hildenbrand wrote: > VCPU requests and VCPU blocking right now don't take care of the vSIE > (as it was not necessary until now). But we want to have VCPU requests > that will also be handled before running the vSIE again. > > So let's simulate a SIE entry when entering the vSIE loop and check > for PROG_ flags. The existing infrastructure (e.g. exit_sie()) will then > detect that the SIE (in form of the vSIE execution loop) is running and > properly kick the vSIE CPU, resulting in it leaving the vSIE loop and > therefore the vSIE interception handler, allowing it to handle VCPU > requests. > > E.g. if we want to modify the crycb of the VCPU and make sure that any > masks also get applied to the VSIE crycb shadow (which uses masks from the > VCPU crycb), we will need a way to hinder the vSIE from running and make > sure to process the updated crycb before reentering the vSIE again. > > Signed-off-by: David Hildenbrand Finally found some time: Reviewed-by: Janosch Frank As the first user will be AP, I guess the patches will be queued with them. Thanks for helping out :) signature.asc Description: OpenPGP digital signature
Re: [PATCH RFC 1/2] KVM: s390: vsie: simulate VCPU SIE entry/exit
On 07.08.2018 14:51, David Hildenbrand wrote: > VCPU requests and VCPU blocking right now don't take care of the vSIE > (as it was not necessary until now). But we want to have VCPU requests > that will also be handled before running the vSIE again. > > So let's simulate a SIE entry when entering the vSIE loop and check > for PROG_ flags. The existing infrastructure (e.g. exit_sie()) will then > detect that the SIE (in form of the vSIE execution loop) is running and > properly kick the vSIE CPU, resulting in it leaving the vSIE loop and > therefore the vSIE interception handler, allowing it to handle VCPU > requests. > > E.g. if we want to modify the crycb of the VCPU and make sure that any > masks also get applied to the VSIE crycb shadow (which uses masks from the > VCPU crycb), we will need a way to hinder the vSIE from running and make > sure to process the updated crycb before reentering the vSIE again. > > Signed-off-by: David Hildenbrand Finally found some time: Reviewed-by: Janosch Frank As the first user will be AP, I guess the patches will be queued with them. Thanks for helping out :) signature.asc Description: OpenPGP digital signature
[PATCH] kbuild: remove deprecated host-progs variable
The host-progs has been supported as an alias of hostprogs-y at least since the beginning of Git era, with the clear prompt: Usage of host-progs is deprecated. Please replace with hostprogs-y! Enough time for the migration has passed. Signed-off-by: Masahiro Yamada --- arch/xtensa/boot/Makefile | 3 +-- scripts/Makefile.build| 7 --- scripts/Makefile.clean| 1 - 3 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/xtensa/boot/Makefile b/arch/xtensa/boot/Makefile index 53e4178..dc9e0ba 100644 --- a/arch/xtensa/boot/Makefile +++ b/arch/xtensa/boot/Makefile @@ -30,8 +30,7 @@ Image: boot-elf zImage: boot-redboot uImage: $(obj)/uImage -boot-elf boot-redboot: $(addprefix $(obj)/,$(subdir-y)) \ - $(addprefix $(obj)/,$(host-progs)) +boot-elf boot-redboot: $(addprefix $(obj)/,$(subdir-y)) $(Q)$(MAKE) $(build)=$(obj)/$@ $(MAKECMDGOALS) OBJCOPYFLAGS = --strip-all -R .comment -R .note.gnu.build-id -O binary diff --git a/scripts/Makefile.build b/scripts/Makefile.build index ba1c83b..b134b37 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -53,13 +53,6 @@ endif include scripts/Makefile.lib -ifdef host-progs -ifneq ($(hostprogs-y),$(host-progs)) -$(warning kbuild: $(obj)/Makefile - Usage of host-progs is deprecated. Please replace with hostprogs-y!) -hostprogs-y += $(host-progs) -endif -endif - # Do not include host rules unless needed ifneq ($(hostprogs-y)$(hostprogs-m)$(hostlibs-y)$(hostlibs-m)$(hostcxxlibs-y)$(hostcxxlibs-m),) include scripts/Makefile.host diff --git a/scripts/Makefile.clean b/scripts/Makefile.clean index 17ef94c..0b80e320 100644 --- a/scripts/Makefile.clean +++ b/scripts/Makefile.clean @@ -38,7 +38,6 @@ subdir-ymn:= $(addprefix $(obj)/,$(subdir-ymn)) __clean-files := $(extra-y) $(extra-m) $(extra-) \ $(always) $(targets) $(clean-files) \ - $(host-progs) \ $(hostprogs-y) $(hostprogs-m) $(hostprogs-) \ $(hostlibs-y) $(hostlibs-m) $(hostlibs-) \ $(hostcxxlibs-y) $(hostcxxlibs-m) -- 2.7.4
[PATCH] kbuild: remove deprecated host-progs variable
The host-progs has been supported as an alias of hostprogs-y at least since the beginning of Git era, with the clear prompt: Usage of host-progs is deprecated. Please replace with hostprogs-y! Enough time for the migration has passed. Signed-off-by: Masahiro Yamada --- arch/xtensa/boot/Makefile | 3 +-- scripts/Makefile.build| 7 --- scripts/Makefile.clean| 1 - 3 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/xtensa/boot/Makefile b/arch/xtensa/boot/Makefile index 53e4178..dc9e0ba 100644 --- a/arch/xtensa/boot/Makefile +++ b/arch/xtensa/boot/Makefile @@ -30,8 +30,7 @@ Image: boot-elf zImage: boot-redboot uImage: $(obj)/uImage -boot-elf boot-redboot: $(addprefix $(obj)/,$(subdir-y)) \ - $(addprefix $(obj)/,$(host-progs)) +boot-elf boot-redboot: $(addprefix $(obj)/,$(subdir-y)) $(Q)$(MAKE) $(build)=$(obj)/$@ $(MAKECMDGOALS) OBJCOPYFLAGS = --strip-all -R .comment -R .note.gnu.build-id -O binary diff --git a/scripts/Makefile.build b/scripts/Makefile.build index ba1c83b..b134b37 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -53,13 +53,6 @@ endif include scripts/Makefile.lib -ifdef host-progs -ifneq ($(hostprogs-y),$(host-progs)) -$(warning kbuild: $(obj)/Makefile - Usage of host-progs is deprecated. Please replace with hostprogs-y!) -hostprogs-y += $(host-progs) -endif -endif - # Do not include host rules unless needed ifneq ($(hostprogs-y)$(hostprogs-m)$(hostlibs-y)$(hostlibs-m)$(hostcxxlibs-y)$(hostcxxlibs-m),) include scripts/Makefile.host diff --git a/scripts/Makefile.clean b/scripts/Makefile.clean index 17ef94c..0b80e320 100644 --- a/scripts/Makefile.clean +++ b/scripts/Makefile.clean @@ -38,7 +38,6 @@ subdir-ymn:= $(addprefix $(obj)/,$(subdir-ymn)) __clean-files := $(extra-y) $(extra-m) $(extra-) \ $(always) $(targets) $(clean-files) \ - $(host-progs) \ $(hostprogs-y) $(hostprogs-m) $(hostprogs-) \ $(hostlibs-y) $(hostlibs-m) $(hostlibs-) \ $(hostcxxlibs-y) $(hostcxxlibs-m) -- 2.7.4
Re: [PATCH 1/1] kbuild: allow alternate src for target's implicit prerequisite
2018-08-06 23:37 GMT+09:00 Vasily Gorbik : > With kbuild there is no easy way to re-compile source files from another > directory, which is required for the decompressor on some platforms > (files like lib/ctype.c, lib/cmdline.c, etc). Writing custom build > rules for those files is not feasible since original rules are complex > and kbuild functions and variables are not exposed. > > The simplest solution is to reverse include source files either into > existing files or separate files. That eliminates the need to tackle > with the kbuild rules, but is ugly. > > Here is another solution to that problem, utilizing secondary expansion. > Build rules are in a form: > $(obj)/%.o: $(src)/%.c ... > $(obj)/%.o: $(src)/%.S ... > > "src" variable could be changed to cover the need of specifying alternate > source file directory. > src := $(if $(SRCDIR_$(@F)),$(SRCDIR_$(@F)),$(src)) > > So, if there is SRCDIR_ set, it will be used, original "src" is > used otherwise. But that wouldn't work as it is. To be able to utilize > automatic variables in implicit prerequisite secondary expansion has to > be used and src value has to be additionally escaped. > > Alternate src dir then could be specified in Makefile as: > obj-y := file1.o file2.o > SRCDIR_file1.o := file1/src/dir > SRCDIR_file2.o := file2/src/dir > > Which is enough to build $(obj)/file1.o from file1/src/dir/file1.(c|S), > and $(obj)/file2.o from file2/src/dir/file2.(c|S) Is this a simple solution compared with existing ones? Kind of. However, it is just moving the complexity around from Makefiles to the core build scripts, after all. I guess the increase of build time is not too big, ( 1 % or so for the incremental build of allmodconfig) but I do not like performing the second expansion for every directory to save some a few Makefiles. > Secondary expansion has been introduced with make 3.81, which is > minimal supported version by kbuild itself. > > Signed-off-by: Vasily Gorbik > --- > scripts/Makefile.build | 26 +++--- > 1 file changed, 15 insertions(+), 11 deletions(-) > > diff --git a/scripts/Makefile.build b/scripts/Makefile.build > index 514ed63ff571..97c6ece96cfb 100644 > --- a/scripts/Makefile.build > +++ b/scripts/Makefile.build > @@ -70,6 +70,10 @@ $(warning kbuild: Makefile.build is included improperly) > endif > > # === > +# Allow to specify alternate source directory of target's implicit > prerequisite > +# e.g. 'SRCDIR_cmdline.o := lib' This does not work for single targets. 'make foo/bar/cmdline.o' works, but 'make foo/bar/cmdline.i' does not. > +.SECONDEXPANSION: > +srcdir := $$(if $$(SRCDIR_$$(@F)),$$(SRCDIR_$$(@F)),$(src)) > -- Best Regards Masahiro Yamada
Re: [PATCH 1/1] kbuild: allow alternate src for target's implicit prerequisite
2018-08-06 23:37 GMT+09:00 Vasily Gorbik : > With kbuild there is no easy way to re-compile source files from another > directory, which is required for the decompressor on some platforms > (files like lib/ctype.c, lib/cmdline.c, etc). Writing custom build > rules for those files is not feasible since original rules are complex > and kbuild functions and variables are not exposed. > > The simplest solution is to reverse include source files either into > existing files or separate files. That eliminates the need to tackle > with the kbuild rules, but is ugly. > > Here is another solution to that problem, utilizing secondary expansion. > Build rules are in a form: > $(obj)/%.o: $(src)/%.c ... > $(obj)/%.o: $(src)/%.S ... > > "src" variable could be changed to cover the need of specifying alternate > source file directory. > src := $(if $(SRCDIR_$(@F)),$(SRCDIR_$(@F)),$(src)) > > So, if there is SRCDIR_ set, it will be used, original "src" is > used otherwise. But that wouldn't work as it is. To be able to utilize > automatic variables in implicit prerequisite secondary expansion has to > be used and src value has to be additionally escaped. > > Alternate src dir then could be specified in Makefile as: > obj-y := file1.o file2.o > SRCDIR_file1.o := file1/src/dir > SRCDIR_file2.o := file2/src/dir > > Which is enough to build $(obj)/file1.o from file1/src/dir/file1.(c|S), > and $(obj)/file2.o from file2/src/dir/file2.(c|S) Is this a simple solution compared with existing ones? Kind of. However, it is just moving the complexity around from Makefiles to the core build scripts, after all. I guess the increase of build time is not too big, ( 1 % or so for the incremental build of allmodconfig) but I do not like performing the second expansion for every directory to save some a few Makefiles. > Secondary expansion has been introduced with make 3.81, which is > minimal supported version by kbuild itself. > > Signed-off-by: Vasily Gorbik > --- > scripts/Makefile.build | 26 +++--- > 1 file changed, 15 insertions(+), 11 deletions(-) > > diff --git a/scripts/Makefile.build b/scripts/Makefile.build > index 514ed63ff571..97c6ece96cfb 100644 > --- a/scripts/Makefile.build > +++ b/scripts/Makefile.build > @@ -70,6 +70,10 @@ $(warning kbuild: Makefile.build is included improperly) > endif > > # === > +# Allow to specify alternate source directory of target's implicit > prerequisite > +# e.g. 'SRCDIR_cmdline.o := lib' This does not work for single targets. 'make foo/bar/cmdline.o' works, but 'make foo/bar/cmdline.i' does not. > +.SECONDEXPANSION: > +srcdir := $$(if $$(SRCDIR_$$(@F)),$$(SRCDIR_$$(@F)),$(src)) > -- Best Regards Masahiro Yamada
ALT+CTRL+F2 - Switch Terminal - agetty process
I am trying to understand how agetty process is called when ALT+CTRL+F2/ALT+CTRL+F3/ keys are pressed in CentOS or in any other linux distro. What I know so far - systemd runs agetty.service with next tty[N] number - agetty then run login process and then other process continues. What I want to know --- When ALT+CTRL+F2 keys are pressed what happens in kernel? Looking at keyboard.c seems k_cons is called which calls set_console. static void k_cons(struct vc_data *vc, unsigned char value, char up_flag) { if (up_flag) return; set_console(value); } I am not sure how set_console triggers agetty with tty2/tty3/tty4 as one of it's parameter. Any information will be helpful. Regards Mohit
ALT+CTRL+F2 - Switch Terminal - agetty process
I am trying to understand how agetty process is called when ALT+CTRL+F2/ALT+CTRL+F3/ keys are pressed in CentOS or in any other linux distro. What I know so far - systemd runs agetty.service with next tty[N] number - agetty then run login process and then other process continues. What I want to know --- When ALT+CTRL+F2 keys are pressed what happens in kernel? Looking at keyboard.c seems k_cons is called which calls set_console. static void k_cons(struct vc_data *vc, unsigned char value, char up_flag) { if (up_flag) return; set_console(value); } I am not sure how set_console triggers agetty with tty2/tty3/tty4 as one of it's parameter. Any information will be helpful. Regards Mohit
Re: [PATCH] kbuild: add Map option to save vmlinux linker map file(s)
2018-08-06 22:38 GMT+09:00 Vasily Gorbik : > Add "Map" kbuild option, so that "make Map=1" would save vmlinux linker > map into vmlinux.map, which is quite useful during making kernel changes > related to how the kernel is composed. > > KBUILD_SAVE_LINK_MAP flag is exported and architectures > supporting compressed kernel images might respect it and produce > arch/*/boot/compressed/vmlinux.map for the decompressor code as well. > > Signed-off-by: Vasily Gorbik > --- If the map file is quite useful, should it be generated all the time? Or, how about CONFIG option if you want this conditionally generated? I do not want to increase fine-grained command-line switches except _really_ useful ones. > .gitignore | 1 + > Makefile | 14 ++ > 2 files changed, 15 insertions(+) > > diff --git a/.gitignore b/.gitignore > index 97ba6b79834c..1d2308e597ad 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -44,6 +44,7 @@ > *.xz > Module.symvers > modules.builtin > +vmlinux.map > > # > # Top-level generic files > diff --git a/Makefile b/Makefile > index 7a3c4548162b..f1790deae03b 100644 > --- a/Makefile > +++ b/Makefile > @@ -176,6 +176,14 @@ ifndef KBUILD_CHECKSRC >KBUILD_CHECKSRC = 0 > endif > > +# Use 'make Map=1' to enable saving linker map file(s): > +# vmlinux.map for vmlinux, > +# (arch/*/boot/compressed/vmlinux.map for arch/*/boot/compressed/vmlinux) > + > +ifeq ("$(origin Map)", "command line") > + export KBUILD_SAVE_LINK_MAP := $(Map) > +endif > + > # Use make M=dir to specify directory of external module to build > # Old syntax make ... SUBDIRS=$PWD is still supported > # Setting the environment variable KBUILD_EXTMOD take precedence > @@ -838,6 +846,11 @@ ifeq ($(CONFIG_STRIP_ASM_SYMS),y) > LDFLAGS_vmlinux+= $(call ld-option, -X,) > endif > > +ifdef KBUILD_SAVE_LINK_MAP > +LDFLAGS_vmlinux+= -Map=vmlinux.map > +CLEAN_FILES+= vmlinux.map > +endif > + > # insure the checker run with the right endianness > CHECKFLAGS += $(if $(CONFIG_CPU_BIG_ENDIAN),-mbig-endian,-mlittle-endian) > > @@ -1434,6 +1447,7 @@ help: > @echo '2: warnings which occur quite often but may > still be relevant' > @echo '3: more obscure warnings, can most likely be > ignored' > @echo 'Multiple levels can be combined with W=12 or > W=123' > + @echo ' make Map=1 [targets] Save vmlinux linker map file(s)' > @echo '' > @echo 'Execute "make" or "make all" to build all targets marked with > [*] ' > @echo 'For further info see the ./README file' > -- > 2.18.0.13.gd42ae10 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards Masahiro Yamada
Re: [PATCH] kbuild: add Map option to save vmlinux linker map file(s)
2018-08-06 22:38 GMT+09:00 Vasily Gorbik : > Add "Map" kbuild option, so that "make Map=1" would save vmlinux linker > map into vmlinux.map, which is quite useful during making kernel changes > related to how the kernel is composed. > > KBUILD_SAVE_LINK_MAP flag is exported and architectures > supporting compressed kernel images might respect it and produce > arch/*/boot/compressed/vmlinux.map for the decompressor code as well. > > Signed-off-by: Vasily Gorbik > --- If the map file is quite useful, should it be generated all the time? Or, how about CONFIG option if you want this conditionally generated? I do not want to increase fine-grained command-line switches except _really_ useful ones. > .gitignore | 1 + > Makefile | 14 ++ > 2 files changed, 15 insertions(+) > > diff --git a/.gitignore b/.gitignore > index 97ba6b79834c..1d2308e597ad 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -44,6 +44,7 @@ > *.xz > Module.symvers > modules.builtin > +vmlinux.map > > # > # Top-level generic files > diff --git a/Makefile b/Makefile > index 7a3c4548162b..f1790deae03b 100644 > --- a/Makefile > +++ b/Makefile > @@ -176,6 +176,14 @@ ifndef KBUILD_CHECKSRC >KBUILD_CHECKSRC = 0 > endif > > +# Use 'make Map=1' to enable saving linker map file(s): > +# vmlinux.map for vmlinux, > +# (arch/*/boot/compressed/vmlinux.map for arch/*/boot/compressed/vmlinux) > + > +ifeq ("$(origin Map)", "command line") > + export KBUILD_SAVE_LINK_MAP := $(Map) > +endif > + > # Use make M=dir to specify directory of external module to build > # Old syntax make ... SUBDIRS=$PWD is still supported > # Setting the environment variable KBUILD_EXTMOD take precedence > @@ -838,6 +846,11 @@ ifeq ($(CONFIG_STRIP_ASM_SYMS),y) > LDFLAGS_vmlinux+= $(call ld-option, -X,) > endif > > +ifdef KBUILD_SAVE_LINK_MAP > +LDFLAGS_vmlinux+= -Map=vmlinux.map > +CLEAN_FILES+= vmlinux.map > +endif > + > # insure the checker run with the right endianness > CHECKFLAGS += $(if $(CONFIG_CPU_BIG_ENDIAN),-mbig-endian,-mlittle-endian) > > @@ -1434,6 +1447,7 @@ help: > @echo '2: warnings which occur quite often but may > still be relevant' > @echo '3: more obscure warnings, can most likely be > ignored' > @echo 'Multiple levels can be combined with W=12 or > W=123' > + @echo ' make Map=1 [targets] Save vmlinux linker map file(s)' > @echo '' > @echo 'Execute "make" or "make all" to build all targets marked with > [*] ' > @echo 'For further info see the ./README file' > -- > 2.18.0.13.gd42ae10 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards Masahiro Yamada
[PATCH v3] perf report: Create auxiliary trace data files for s390
Create auxiliary trace data log files when invoked with option --itrace=d as in [root@s35lp76 perf] ./perf report -i perf.data.aux1 --stdio --itrace=d perf report creates several data files in the current directory named aux.smp.## where ## is a 2 digit hex number with leading zeros representing the CPU number this trace data was recorded from. The file contents is binary and contains the CPU-Measurement Sampling Data Blocks (SDBs). The directory to save the auxiliary trace buffer can be changed using the perf config file and command. Specify section 'auxtrace' keyword 'dumpdir' and assign it a valid directory name. If the directory does not exist or has the wrong file type, the current directory is used. [root@p23lp27 perf]# ./perf config auxtrace.dumpdir=/tmp [root@p23lp27 perf]# ./perf config --user -l auxtrace.dumpdir=/tmp [root@p23lp27 perf]# ./perf report ... [root@p23lp27 perf]# ll /tmp/aux.smp.00 -rw-r--r-- 1 root root 204800 Aug 2 13:48 /tmp/aux.smp.00 [root@p23lp27 perf]# Signed-off-by: Thomas Richter Reviewed-by: Hendrik Brueckner --- tools/perf/util/s390-cpumsf.c | 94 +-- 1 file changed, 91 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/s390-cpumsf.c b/tools/perf/util/s390-cpumsf.c index d2c78ffd9fee..a2eeebbfb25f 100644 --- a/tools/perf/util/s390-cpumsf.c +++ b/tools/perf/util/s390-cpumsf.c @@ -147,6 +147,9 @@ #include #include +#include +#include + #include "cpumap.h" #include "color.h" #include "evsel.h" @@ -159,6 +162,7 @@ #include "auxtrace.h" #include "s390-cpumsf.h" #include "s390-cpumsf-kernel.h" +#include "config.h" struct s390_cpumsf { struct auxtrace auxtrace; @@ -170,6 +174,8 @@ struct s390_cpumsf { u32 pmu_type; u16 machine_type; booldata_queued; + booluse_logfile; + char*logdir; }; struct s390_cpumsf_queue { @@ -177,6 +183,7 @@ struct s390_cpumsf_queue { unsigned intqueue_nr; struct auxtrace_buffer *buffer; int cpu; + FILE*logfile; }; /* Display s390 CPU measurement facility basic-sampling data entry */ @@ -595,6 +602,12 @@ static int s390_cpumsf_run_decoder(struct s390_cpumsf_queue *sfq, buffer->use_size = buffer->size; buffer->use_data = buffer->data; } + if (sfq->logfile) { /* Write into log file */ + size_t rc = fwrite(buffer->data, buffer->size, 1, + sfq->logfile); + if (rc != 1) + pr_err("Failed to write auxiliary data\n"); + } } else buffer = sfq->buffer; @@ -606,6 +619,13 @@ static int s390_cpumsf_run_decoder(struct s390_cpumsf_queue *sfq, return -ENOMEM; buffer->use_size = buffer->size; buffer->use_data = buffer->data; + + if (sfq->logfile) { /* Write into log file */ + size_t rc = fwrite(buffer->data, buffer->size, 1, + sfq->logfile); + if (rc != 1) + pr_err("Failed to write auxiliary data\n"); + } } pr_debug4("%s queue_nr:%d buffer:%" PRId64 " offset:%#" PRIx64 " size:%#zx rest:%#zx\n", __func__, sfq->queue_nr, buffer->buffer_nr, buffer->offset, @@ -640,6 +660,23 @@ s390_cpumsf_alloc_queue(struct s390_cpumsf *sf, unsigned int queue_nr) sfq->sf = sf; sfq->queue_nr = queue_nr; sfq->cpu = -1; + if (sf->use_logfile) { + char *name; + int rc; + + rc = (sf->logdir) + ? asprintf(, "%s/aux.smp.%02x", +sf->logdir, queue_nr) + : asprintf(, "aux.smp.%02x", queue_nr); + if (rc > 0) + sfq->logfile = fopen(name, "w"); + if (sfq->logfile == NULL) { + pr_err("Failed to open auxiliary log file %s," + "continue...\n", name); + sf->use_logfile = false; + } + free(name); + } return sfq; } @@ -850,8 +887,16 @@ static void s390_cpumsf_free_queues(struct perf_session *session) struct auxtrace_queues *queues = >queues; unsigned int i; - for (i = 0; i < queues->nr_queues; i++) + for (i = 0; i < queues->nr_queues; i++) { + struct s390_cpumsf_queue *sfq = (struct s390_cpumsf_queue *) + queues->queue_array[i].priv; + + if (sfq != NULL && sfq->logfile) { +
[PATCH v3] perf report: Create auxiliary trace data files for s390
Create auxiliary trace data log files when invoked with option --itrace=d as in [root@s35lp76 perf] ./perf report -i perf.data.aux1 --stdio --itrace=d perf report creates several data files in the current directory named aux.smp.## where ## is a 2 digit hex number with leading zeros representing the CPU number this trace data was recorded from. The file contents is binary and contains the CPU-Measurement Sampling Data Blocks (SDBs). The directory to save the auxiliary trace buffer can be changed using the perf config file and command. Specify section 'auxtrace' keyword 'dumpdir' and assign it a valid directory name. If the directory does not exist or has the wrong file type, the current directory is used. [root@p23lp27 perf]# ./perf config auxtrace.dumpdir=/tmp [root@p23lp27 perf]# ./perf config --user -l auxtrace.dumpdir=/tmp [root@p23lp27 perf]# ./perf report ... [root@p23lp27 perf]# ll /tmp/aux.smp.00 -rw-r--r-- 1 root root 204800 Aug 2 13:48 /tmp/aux.smp.00 [root@p23lp27 perf]# Signed-off-by: Thomas Richter Reviewed-by: Hendrik Brueckner --- tools/perf/util/s390-cpumsf.c | 94 +-- 1 file changed, 91 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/s390-cpumsf.c b/tools/perf/util/s390-cpumsf.c index d2c78ffd9fee..a2eeebbfb25f 100644 --- a/tools/perf/util/s390-cpumsf.c +++ b/tools/perf/util/s390-cpumsf.c @@ -147,6 +147,9 @@ #include #include +#include +#include + #include "cpumap.h" #include "color.h" #include "evsel.h" @@ -159,6 +162,7 @@ #include "auxtrace.h" #include "s390-cpumsf.h" #include "s390-cpumsf-kernel.h" +#include "config.h" struct s390_cpumsf { struct auxtrace auxtrace; @@ -170,6 +174,8 @@ struct s390_cpumsf { u32 pmu_type; u16 machine_type; booldata_queued; + booluse_logfile; + char*logdir; }; struct s390_cpumsf_queue { @@ -177,6 +183,7 @@ struct s390_cpumsf_queue { unsigned intqueue_nr; struct auxtrace_buffer *buffer; int cpu; + FILE*logfile; }; /* Display s390 CPU measurement facility basic-sampling data entry */ @@ -595,6 +602,12 @@ static int s390_cpumsf_run_decoder(struct s390_cpumsf_queue *sfq, buffer->use_size = buffer->size; buffer->use_data = buffer->data; } + if (sfq->logfile) { /* Write into log file */ + size_t rc = fwrite(buffer->data, buffer->size, 1, + sfq->logfile); + if (rc != 1) + pr_err("Failed to write auxiliary data\n"); + } } else buffer = sfq->buffer; @@ -606,6 +619,13 @@ static int s390_cpumsf_run_decoder(struct s390_cpumsf_queue *sfq, return -ENOMEM; buffer->use_size = buffer->size; buffer->use_data = buffer->data; + + if (sfq->logfile) { /* Write into log file */ + size_t rc = fwrite(buffer->data, buffer->size, 1, + sfq->logfile); + if (rc != 1) + pr_err("Failed to write auxiliary data\n"); + } } pr_debug4("%s queue_nr:%d buffer:%" PRId64 " offset:%#" PRIx64 " size:%#zx rest:%#zx\n", __func__, sfq->queue_nr, buffer->buffer_nr, buffer->offset, @@ -640,6 +660,23 @@ s390_cpumsf_alloc_queue(struct s390_cpumsf *sf, unsigned int queue_nr) sfq->sf = sf; sfq->queue_nr = queue_nr; sfq->cpu = -1; + if (sf->use_logfile) { + char *name; + int rc; + + rc = (sf->logdir) + ? asprintf(, "%s/aux.smp.%02x", +sf->logdir, queue_nr) + : asprintf(, "aux.smp.%02x", queue_nr); + if (rc > 0) + sfq->logfile = fopen(name, "w"); + if (sfq->logfile == NULL) { + pr_err("Failed to open auxiliary log file %s," + "continue...\n", name); + sf->use_logfile = false; + } + free(name); + } return sfq; } @@ -850,8 +887,16 @@ static void s390_cpumsf_free_queues(struct perf_session *session) struct auxtrace_queues *queues = >queues; unsigned int i; - for (i = 0; i < queues->nr_queues; i++) + for (i = 0; i < queues->nr_queues; i++) { + struct s390_cpumsf_queue *sfq = (struct s390_cpumsf_queue *) + queues->queue_array[i].priv; + + if (sfq != NULL && sfq->logfile) { +
Re: [PATCH] extcon: maxim: Add SPDX license identifiers
Hi Krzysztof, On 2018년 08월 08일 01:21, Krzysztof Kozlowski wrote: > Replace GPL v2.0+ license statements with SPDX license identifiers. > > Signed-off-by: Krzysztof Kozlowski > --- > drivers/extcon/extcon-max14577.c | 24 +++- > drivers/extcon/extcon-max77693.c | 22 ++ > drivers/extcon/extcon-max77843.c | 19 +++ > drivers/extcon/extcon-max8997.c | 22 ++ > 4 files changed, 26 insertions(+), 61 deletions(-) > Thanks. Applied it for v4.20. (snip) -- Best Regards, Chanwoo Choi Samsung Electronics
Re: [PATCH] extcon: maxim: Add SPDX license identifiers
Hi Krzysztof, On 2018년 08월 08일 01:21, Krzysztof Kozlowski wrote: > Replace GPL v2.0+ license statements with SPDX license identifiers. > > Signed-off-by: Krzysztof Kozlowski > --- > drivers/extcon/extcon-max14577.c | 24 +++- > drivers/extcon/extcon-max77693.c | 22 ++ > drivers/extcon/extcon-max77843.c | 19 +++ > drivers/extcon/extcon-max8997.c | 22 ++ > 4 files changed, 26 insertions(+), 61 deletions(-) > Thanks. Applied it for v4.20. (snip) -- Best Regards, Chanwoo Choi Samsung Electronics
Re: [PATCH 2/3] perf report: Add raw report support for s390 auxiliary trace
On 08/08/2018 06:42 PM, Arnaldo Carvalho de Melo wrote: > Em Wed, Aug 08, 2018 at 01:14:51PM -0300, Arnaldo Carvalho de Melo escreveu: >> Em Wed, Aug 08, 2018 at 01:08:43PM -0300, Arnaldo Carvalho de Melo escreveu: >>> No need for __packed. >> >>> I'm removing that to avoid having this failling in compilers that have >>> such a warning, since we default to Werror. >> >>> Holler if there is something I'missing :-) >> >> In file included from util/cpumap.h:10, >> from util/s390-cpumsf.c:150: >> util/s390-cpumsf.c: In function 's390_cpumsf_diag_show': >> util/s390-cpumsf.c:208:10: error: format '%lx' expects argument of type >> 'long unsigned int', but argument 4 has type 'size_t' {aka 'unsigned int'} >> [-Werror=format=] >>pr_err("Invalid AUX trace diagnostic entry [%#08lx]\n", pos); >> ^~~ >> /git/linux/tools/perf/util/debug.h:20:21: note: in definition of macro >> 'pr_fmt' >> #define pr_fmt(fmt) fmt >> ^~~ >> util/s390-cpumsf.c:208:3: note: in expansion of macro 'pr_err' >>pr_err("Invalid AUX trace diagnostic entry [%#08lx]\n", pos); >>^~ >> util/s390-cpumsf.c: In function 's390_cpumsf_trailer_show': >> util/s390-cpumsf.c:233:10: error: format '%lx' expects argument of type >> 'long unsigned int', but argument 4 has type 'size_t' {aka 'unsigned int'} >> [-Werror=format=] >>pr_err("Invalid AUX trace trailer entry [%#08lx]\n", pos); > > So for those I applied this, seems to pass the ones that were failing, > restarting tests... > > diff --git a/tools/perf/util/s390-cpumsf.c b/tools/perf/util/s390-cpumsf.c > index 2bcb160a08f0..d2c78ffd9fee 100644 > --- a/tools/perf/util/s390-cpumsf.c > +++ b/tools/perf/util/s390-cpumsf.c > @@ -187,7 +187,7 @@ static bool s390_cpumsf_basic_show(const char *color, > size_t pos, > pr_err("Invalid AUX trace basic entry [%#08zx]\n", pos); > return false; > } > - color_fprintf(stdout, color, "[%#08x] Basic Def:%04x Inst:%#04x" > + color_fprintf(stdout, color, "[%#08zx] Basic Def:%04x Inst:%#04x" > " %c%c%c%c AS:%d ASN:%#04x IA:%#018llx\n" > "\t\tCL:%d HPP:%#018llx GPP:%#018llx\n", > pos, basic->def, basic->U, > @@ -205,10 +205,10 @@ static bool s390_cpumsf_diag_show(const char *color, > size_t pos, > struct hws_diag_entry *diag) > { > if (diag->def < S390_CPUMSF_DIAG_DEF_FIRST) { > - pr_err("Invalid AUX trace diagnostic entry [%#08lx]\n", pos); > + pr_err("Invalid AUX trace diagnostic entry [%#08zx]\n", pos); > return false; > } > - color_fprintf(stdout, color, "[%#08x] DiagDef:%04x %c\n", > + color_fprintf(stdout, color, "[%#08zx] DiagDef:%04x %c\n", > pos, diag->def, diag->I ? 'I' : ' '); > return true; > } > @@ -230,10 +230,10 @@ static bool s390_cpumsf_trailer_show(const char *color, > size_t pos, >struct hws_trailer_entry *te) > { > if (te->bsdes != sizeof(struct hws_basic_entry)) { > - pr_err("Invalid AUX trace trailer entry [%#08lx]\n", pos); > + pr_err("Invalid AUX trace trailer entry [%#08zx]\n", pos); > return false; > } > - color_fprintf(stdout, color, "[%#08x] Trailer %c%c%c bsdes:%d" > + color_fprintf(stdout, color, "[%#08zx] Trailer %c%c%c bsdes:%d" > " dsdes:%d Overflow:%lld Time:%#llx\n" > "\t\tC:%d TOD:%#lx 1:%#llx 2:%#llx\n", > pos, > @@ -418,7 +418,7 @@ static bool s390_cpumsf_make_event(size_t pos, > event.sample.header.misc = sample.cpumode; > event.sample.header.size = sizeof(struct perf_event_header); > > - pr_debug4("%s pos:%#zx ip:%#lx P:%d CL:%d pid:%d.%d cpumode:%d > cpu:%d\n", > + pr_debug4("%s pos:%#zx ip:%#" PRIx64 " P:%d CL:%d pid:%d.%d cpumode:%d > cpu:%d\n", >__func__, pos, sample.ip, basic->P, basic->CL, sample.pid, >sample.tid, sample.cpumode, sample.cpu); > if (perf_session__deliver_synth_event(sfq->sf->session, , > @@ -498,7 +498,7 @@ static int s390_cpumsf_samples(struct s390_cpumsf_queue > *sfq, u64 *ts) >*/ > aux_ts = get_trailer_time(buf); > if (!aux_ts) { > - pr_err("[%#08lx] Invalid AUX trailer entry TOD clock base\n", > + pr_err("[%#08" PRIx64 "] Invalid AUX trailer entry TOD clock > base\n", > sfq->buffer->data_offset); > aux_ts = ~0ULL; > goto out; > @@ -607,7 +607,7 @@ static int s390_cpumsf_run_decoder(struct > s390_cpumsf_queue *sfq, > buffer->use_size = buffer->size; > buffer->use_data = buffer->data; > } > - pr_debug4("%s queue_nr:%d buffer:%ld offset:%#lx size:%#zx rest:%#zx\n", > + pr_debug4("%s
Re: [PATCH 2/3] perf report: Add raw report support for s390 auxiliary trace
On 08/08/2018 06:42 PM, Arnaldo Carvalho de Melo wrote: > Em Wed, Aug 08, 2018 at 01:14:51PM -0300, Arnaldo Carvalho de Melo escreveu: >> Em Wed, Aug 08, 2018 at 01:08:43PM -0300, Arnaldo Carvalho de Melo escreveu: >>> No need for __packed. >> >>> I'm removing that to avoid having this failling in compilers that have >>> such a warning, since we default to Werror. >> >>> Holler if there is something I'missing :-) >> >> In file included from util/cpumap.h:10, >> from util/s390-cpumsf.c:150: >> util/s390-cpumsf.c: In function 's390_cpumsf_diag_show': >> util/s390-cpumsf.c:208:10: error: format '%lx' expects argument of type >> 'long unsigned int', but argument 4 has type 'size_t' {aka 'unsigned int'} >> [-Werror=format=] >>pr_err("Invalid AUX trace diagnostic entry [%#08lx]\n", pos); >> ^~~ >> /git/linux/tools/perf/util/debug.h:20:21: note: in definition of macro >> 'pr_fmt' >> #define pr_fmt(fmt) fmt >> ^~~ >> util/s390-cpumsf.c:208:3: note: in expansion of macro 'pr_err' >>pr_err("Invalid AUX trace diagnostic entry [%#08lx]\n", pos); >>^~ >> util/s390-cpumsf.c: In function 's390_cpumsf_trailer_show': >> util/s390-cpumsf.c:233:10: error: format '%lx' expects argument of type >> 'long unsigned int', but argument 4 has type 'size_t' {aka 'unsigned int'} >> [-Werror=format=] >>pr_err("Invalid AUX trace trailer entry [%#08lx]\n", pos); > > So for those I applied this, seems to pass the ones that were failing, > restarting tests... > > diff --git a/tools/perf/util/s390-cpumsf.c b/tools/perf/util/s390-cpumsf.c > index 2bcb160a08f0..d2c78ffd9fee 100644 > --- a/tools/perf/util/s390-cpumsf.c > +++ b/tools/perf/util/s390-cpumsf.c > @@ -187,7 +187,7 @@ static bool s390_cpumsf_basic_show(const char *color, > size_t pos, > pr_err("Invalid AUX trace basic entry [%#08zx]\n", pos); > return false; > } > - color_fprintf(stdout, color, "[%#08x] Basic Def:%04x Inst:%#04x" > + color_fprintf(stdout, color, "[%#08zx] Basic Def:%04x Inst:%#04x" > " %c%c%c%c AS:%d ASN:%#04x IA:%#018llx\n" > "\t\tCL:%d HPP:%#018llx GPP:%#018llx\n", > pos, basic->def, basic->U, > @@ -205,10 +205,10 @@ static bool s390_cpumsf_diag_show(const char *color, > size_t pos, > struct hws_diag_entry *diag) > { > if (diag->def < S390_CPUMSF_DIAG_DEF_FIRST) { > - pr_err("Invalid AUX trace diagnostic entry [%#08lx]\n", pos); > + pr_err("Invalid AUX trace diagnostic entry [%#08zx]\n", pos); > return false; > } > - color_fprintf(stdout, color, "[%#08x] DiagDef:%04x %c\n", > + color_fprintf(stdout, color, "[%#08zx] DiagDef:%04x %c\n", > pos, diag->def, diag->I ? 'I' : ' '); > return true; > } > @@ -230,10 +230,10 @@ static bool s390_cpumsf_trailer_show(const char *color, > size_t pos, >struct hws_trailer_entry *te) > { > if (te->bsdes != sizeof(struct hws_basic_entry)) { > - pr_err("Invalid AUX trace trailer entry [%#08lx]\n", pos); > + pr_err("Invalid AUX trace trailer entry [%#08zx]\n", pos); > return false; > } > - color_fprintf(stdout, color, "[%#08x] Trailer %c%c%c bsdes:%d" > + color_fprintf(stdout, color, "[%#08zx] Trailer %c%c%c bsdes:%d" > " dsdes:%d Overflow:%lld Time:%#llx\n" > "\t\tC:%d TOD:%#lx 1:%#llx 2:%#llx\n", > pos, > @@ -418,7 +418,7 @@ static bool s390_cpumsf_make_event(size_t pos, > event.sample.header.misc = sample.cpumode; > event.sample.header.size = sizeof(struct perf_event_header); > > - pr_debug4("%s pos:%#zx ip:%#lx P:%d CL:%d pid:%d.%d cpumode:%d > cpu:%d\n", > + pr_debug4("%s pos:%#zx ip:%#" PRIx64 " P:%d CL:%d pid:%d.%d cpumode:%d > cpu:%d\n", >__func__, pos, sample.ip, basic->P, basic->CL, sample.pid, >sample.tid, sample.cpumode, sample.cpu); > if (perf_session__deliver_synth_event(sfq->sf->session, , > @@ -498,7 +498,7 @@ static int s390_cpumsf_samples(struct s390_cpumsf_queue > *sfq, u64 *ts) >*/ > aux_ts = get_trailer_time(buf); > if (!aux_ts) { > - pr_err("[%#08lx] Invalid AUX trailer entry TOD clock base\n", > + pr_err("[%#08" PRIx64 "] Invalid AUX trailer entry TOD clock > base\n", > sfq->buffer->data_offset); > aux_ts = ~0ULL; > goto out; > @@ -607,7 +607,7 @@ static int s390_cpumsf_run_decoder(struct > s390_cpumsf_queue *sfq, > buffer->use_size = buffer->size; > buffer->use_data = buffer->data; > } > - pr_debug4("%s queue_nr:%d buffer:%ld offset:%#lx size:%#zx rest:%#zx\n", > + pr_debug4("%s
[PATCH v8 0/6] Uprobes: Support SDT markers having reference count (semaphore)
v8 changes: - Call delayed_uprobe_remove(uprobe, NULL) from put_uprobe() as well. - Other minor optimizations. v7: https://lkml.org/lkml/2018/7/31/20 Description: Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Currently, perf tool has limited supports for SDT markers. I.e. it can not trace markers surrounded by reference counter. Also, it's not easy to add reference counter logic in userspace tool like perf, so basic idea for this patchset is to add reference counter logic in the a uprobe infrastructure. Ex,[2] # cat tick.c ... for (i = 0; i < 100; i++) { DTRACE_PROBE1(tick, loop1, i); if (TICK_LOOP2_ENABLED()) { DTRACE_PROBE1(tick, loop2, i); } printf("hi: %d\n", i); sleep(1); } ... Here tick:loop1 is marker without reference counter where as tick:loop2 is surrounded by reference counter condition. # perf buildid-cache --add /tmp/tick # perf probe sdt_tick:loop1 # perf probe sdt_tick:loop2 # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop1 0 sdt_tick:loop2 2.747086086 seconds time elapsed Perf failed to record data for tick:loop2. Same experiment with this patch series: # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Ravi Bangoria (6): Uprobes: Simplify uprobe_register() body Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Uprobes: Support SDT markers having reference count (semaphore) Uprobes/sdt: Prevent multiple reference counter for same uprobe trace_uprobe/sdt: Prevent multiple reference counter for same uprobe perf probe: Support SDT markers having reference counter (semaphore) arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 7 +- kernel/events/uprobes.c| 329 +++-- kernel/trace/trace.c | 2 +- kernel/trace/trace_uprobe.c| 75 +- tools/perf/util/probe-event.c | 39 - tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 - tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 -- tools/perf/util/symbol.h | 7 + 12 files changed, 472 insertions(+), 73 deletions(-) -- 2.14.4
[PATCH v8 0/6] Uprobes: Support SDT markers having reference count (semaphore)
v8 changes: - Call delayed_uprobe_remove(uprobe, NULL) from put_uprobe() as well. - Other minor optimizations. v7: https://lkml.org/lkml/2018/7/31/20 Description: Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Currently, perf tool has limited supports for SDT markers. I.e. it can not trace markers surrounded by reference counter. Also, it's not easy to add reference counter logic in userspace tool like perf, so basic idea for this patchset is to add reference counter logic in the a uprobe infrastructure. Ex,[2] # cat tick.c ... for (i = 0; i < 100; i++) { DTRACE_PROBE1(tick, loop1, i); if (TICK_LOOP2_ENABLED()) { DTRACE_PROBE1(tick, loop2, i); } printf("hi: %d\n", i); sleep(1); } ... Here tick:loop1 is marker without reference counter where as tick:loop2 is surrounded by reference counter condition. # perf buildid-cache --add /tmp/tick # perf probe sdt_tick:loop1 # perf probe sdt_tick:loop2 # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop1 0 sdt_tick:loop2 2.747086086 seconds time elapsed Perf failed to record data for tick:loop2. Same experiment with this patch series: # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Ravi Bangoria (6): Uprobes: Simplify uprobe_register() body Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Uprobes: Support SDT markers having reference count (semaphore) Uprobes/sdt: Prevent multiple reference counter for same uprobe trace_uprobe/sdt: Prevent multiple reference counter for same uprobe perf probe: Support SDT markers having reference counter (semaphore) arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 7 +- kernel/events/uprobes.c| 329 +++-- kernel/trace/trace.c | 2 +- kernel/trace/trace_uprobe.c| 75 +- tools/perf/util/probe-event.c | 39 - tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 - tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 -- tools/perf/util/symbol.h | 7 + 12 files changed, 472 insertions(+), 73 deletions(-) -- 2.14.4
[PATCH v8 1/6] Uprobes: Simplify uprobe_register() body
Simplify uprobe_register() function body and let __uprobe_register() handle everything. Also move dependency functions around to fix build failures. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 69 ++--- 1 file changed, 36 insertions(+), 33 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ccc579a7d32e..471eac896635 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -840,13 +840,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) return err; } -static int __uprobe_register(struct uprobe *uprobe, struct uprobe_consumer *uc) -{ - consumer_add(uprobe, uc); - return register_for_each_vma(uprobe, uc); -} - -static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) +static void +__uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) { int err; @@ -860,24 +855,46 @@ static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *u } /* - * uprobe_register - register a probe + * uprobe_unregister - unregister a already registered probe. + * @inode: the file in which the probe has to be removed. + * @offset: offset from the start of the file. + * @uc: identify which probe if multiple probes are colocated. + */ +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +{ + struct uprobe *uprobe; + + uprobe = find_uprobe(inode, offset); + if (WARN_ON(!uprobe)) + return; + + down_write(>register_rwsem); + __uprobe_unregister(uprobe, uc); + up_write(>register_rwsem); + put_uprobe(uprobe); +} +EXPORT_SYMBOL_GPL(uprobe_unregister); + +/* + * __uprobe_register - register a probe * @inode: the file in which the probe has to be placed. * @offset: offset from the start of the file. * @uc: information on howto handle the probe.. * - * Apart from the access refcount, uprobe_register() takes a creation + * Apart from the access refcount, __uprobe_register() takes a creation * refcount (thro alloc_uprobe) if and only if this @uprobe is getting * inserted into the rbtree (i.e first consumer for a @inode:@offset * tuple). Creation refcount stops uprobe_unregister from freeing the * @uprobe even before the register operation is complete. Creation * refcount is released when the last @uc for the @uprobe - * unregisters. Caller of uprobe_register() is required to keep @inode + * unregisters. Caller of __uprobe_register() is required to keep @inode * (and the containing mount) referenced. * * Return errno if it cannot successully install probes * else return 0 (success) */ -int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +static int __uprobe_register(struct inode *inode, loff_t offset, +struct uprobe_consumer *uc) { struct uprobe *uprobe; int ret; @@ -904,7 +921,8 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * down_write(>register_rwsem); ret = -EAGAIN; if (likely(uprobe_is_active(uprobe))) { - ret = __uprobe_register(uprobe, uc); + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); if (ret) __uprobe_unregister(uprobe, uc); } @@ -915,6 +933,12 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * goto retry; return ret; } + +int uprobe_register(struct inode *inode, loff_t offset, + struct uprobe_consumer *uc) +{ + return __uprobe_register(inode, offset, uc); +} EXPORT_SYMBOL_GPL(uprobe_register); /* @@ -946,27 +970,6 @@ int uprobe_apply(struct inode *inode, loff_t offset, return ret; } -/* - * uprobe_unregister - unregister a already registered probe. - * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. - */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - struct uprobe *uprobe; - - uprobe = find_uprobe(inode, offset); - if (WARN_ON(!uprobe)) - return; - - down_write(>register_rwsem); - __uprobe_unregister(uprobe, uc); - up_write(>register_rwsem); - put_uprobe(uprobe); -} -EXPORT_SYMBOL_GPL(uprobe_unregister); - static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) { struct vm_area_struct *vma; -- 2.14.4
[PATCH v8 5/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to add multiple trace_uprobe entries having same inode+offset but different reference counter. Ex, # echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events # echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xf)" >> uprobe_events bash: echo: write error: Invalid argument # dmesg trace_kprobe: Reference counter offset mismatch. There is one exception though: When user is trying to replace the old entry with the new one, we allow this if the new entry does not conflict with any other existing entries. Signed-off-by: Ravi Bangoria --- kernel/trace/trace_uprobe.c | 37 +++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index bf2be098eb08..be64d943d7ea 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -324,6 +324,35 @@ static int unregister_trace_uprobe(struct trace_uprobe *tu) return 0; } +/* + * Uprobe with multiple reference counter is not allowed. i.e. + * If inode and offset matches, reference counter offset *must* + * match as well. Though, there is one exception: If user is + * replacing old trace_uprobe with new one(same group/event), + * then we allow same uprobe with new reference counter as far + * as the new one does not conflict with any other existing + * ones. + */ +static struct trace_uprobe *find_old_trace_uprobe(struct trace_uprobe *new) +{ + struct trace_uprobe *tmp, *old = NULL; + struct inode *new_inode = d_real_inode(new->path.dentry); + + old = find_probe_event(trace_event_name(>tp.call), + new->tp.call.class->system); + + list_for_each_entry(tmp, _list, list) { + if ((old ? old != tmp : true) && + new_inode == d_real_inode(tmp->path.dentry) && + new->offset == tmp->offset && + new->ref_ctr_offset != tmp->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch."); + return ERR_PTR(-EINVAL); + } + } + return old; +} + /* Register a trace_uprobe and probe_event */ static int register_trace_uprobe(struct trace_uprobe *tu) { @@ -333,8 +362,12 @@ static int register_trace_uprobe(struct trace_uprobe *tu) mutex_lock(_lock); /* register as an event */ - old_tu = find_probe_event(trace_event_name(>tp.call), - tu->tp.call.class->system); + old_tu = find_old_trace_uprobe(tu); + if (IS_ERR(old_tu)) { + ret = PTR_ERR(old_tu); + goto end; + } + if (old_tu) { /* delete old event */ ret = unregister_trace_uprobe(old_tu); -- 2.14.4
[PATCH v8 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Add addition argument 'arch_uprobe' to uprobe_write_opcode(). We need this in later set of patches. Signed-off-by: Ravi Bangoria --- arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 2 +- kernel/events/uprobes.c| 9 + 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c index d1329f1ba4e4..bf992264060e 100644 --- a/arch/arm/probes/uprobes/core.c +++ b/arch/arm/probes/uprobes/core.c @@ -32,7 +32,7 @@ bool is_swbp_insn(uprobe_opcode_t *insn) int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, + return uprobe_write_opcode(auprobe, mm, vaddr, __opcode_to_mem_arm(auprobe->bpinsn)); } diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c index f7a0645ccb82..4aaff3b3175c 100644 --- a/arch/mips/kernel/uprobes.c +++ b/arch/mips/kernel/uprobes.c @@ -224,7 +224,7 @@ unsigned long arch_uretprobe_hijack_return_addr( int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0a294e950df8..bb9d2084af03 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -121,7 +121,7 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn); extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); -extern int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); +extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 471eac896635..c0418ba52ba8 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -299,8 +299,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t * Called with mm->mmap_sem held for write. * Return 0 (success) or a negative errno. */ -int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, - uprobe_opcode_t opcode) +int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, + unsigned long vaddr, uprobe_opcode_t opcode) { struct page *old_page, *new_page; struct vm_area_struct *vma; @@ -351,7 +351,7 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, */ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } /** @@ -366,7 +366,8 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned int __weak set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, *(uprobe_opcode_t *)>insn); + return uprobe_write_opcode(auprobe, mm, vaddr, + *(uprobe_opcode_t *)>insn); } static struct uprobe *get_uprobe(struct uprobe *uprobe) -- 2.14.4
[PATCH v8 6/6] perf probe: Support SDT markers having reference counter (semaphore)
With this, perf buildid-cache will save SDT markers with reference counter in probe cache. Perf probe will be able to probe markers having reference counter. Ex, # readelf -n /tmp/tick | grep -A1 loop2 Name: loop2 ... Semaphore: 0x10020036 # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed Signed-off-by: Ravi Bangoria Acked-by: Masami Hiramatsu Acked-by: Srikar Dronamraju --- tools/perf/util/probe-event.c | 39 tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 ++-- tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 --- tools/perf/util/symbol.h | 7 +++ 6 files changed, 106 insertions(+), 22 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index f119eb628dbb..e86f8be89157 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1819,6 +1819,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev) tp->offset = strtoul(fmt2_str, NULL, 10); } + if (tev->uprobes) { + fmt2_str = strchr(p, '('); + if (fmt2_str) + tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0); + } + tev->nargs = argc - 2; tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs); if (tev->args == NULL) { @@ -2012,6 +2018,22 @@ static int synthesize_probe_trace_arg(struct probe_trace_arg *arg, return err; } +static int +synthesize_uprobe_trace_def(struct probe_trace_event *tev, struct strbuf *buf) +{ + struct probe_trace_point *tp = >point; + int err; + + err = strbuf_addf(buf, "%s:0x%lx", tp->module, tp->address); + + if (err >= 0 && tp->ref_ctr_offset) { + if (!uprobe_ref_ctr_is_supported()) + return -1; + err = strbuf_addf(buf, "(0x%lx)", tp->ref_ctr_offset); + } + return err >= 0 ? 0 : -1; +} + char *synthesize_probe_trace_command(struct probe_trace_event *tev) { struct probe_trace_point *tp = >point; @@ -2041,15 +2063,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev) } /* Use the tp->address for uprobes */ - if (tev->uprobes) - err = strbuf_addf(, "%s:0x%lx", tp->module, tp->address); - else if (!strncmp(tp->symbol, "0x", 2)) + if (tev->uprobes) { + err = synthesize_uprobe_trace_def(tev, ); + } else if (!strncmp(tp->symbol, "0x", 2)) { /* Absolute address. See try_to_find_absolute_address() */ err = strbuf_addf(, "%s%s0x%lx", tp->module ?: "", tp->module ? ":" : "", tp->address); - else + } else { err = strbuf_addf(, "%s%s%s+%lu", tp->module ?: "", tp->module ? ":" : "", tp->symbol, tp->offset); + } + if (err) goto error; @@ -2633,6 +2657,13 @@ static void warn_uprobe_event_compat(struct probe_trace_event *tev) { int i; char *buf = synthesize_probe_trace_command(tev); + struct probe_trace_point *tp = >point; + + if (tp->ref_ctr_offset && !uprobe_ref_ctr_is_supported()) { + pr_warning("A semaphore is associated with %s:%s and " + "seems your kernel doesn't support it.\n", + tev->group, tev->event); + } /* Old uprobe event doesn't support memory dereference */ if (!tev->uprobes || tev->nargs == 0 || !buf) diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h index 45b14f020558..15a98c3a2a2f 100644 --- a/tools/perf/util/probe-event.h +++ b/tools/perf/util/probe-event.h @@ -27,6 +27,7 @@ struct probe_trace_point { char*symbol;/* Base symbol */ char*module;/* Module name */ unsigned long offset; /* Offset from symbol */ + unsigned long ref_ctr_offset; /* SDT reference counter offset */ unsigned long address;/* Actual address of the trace point */ boolretprobe; /* Return probe flag */ }; diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c index b76088fadf3d..aac7817d9e14 100644 --- a/tools/perf/util/probe-file.c +++ b/tools/perf/util/probe-file.c @@ -696,8 +696,16 @@ int probe_cache__add_entry(struct probe_cache *pcache, #ifdef HAVE_GELF_GETNOTE_SUPPORT static unsigned long long sdt_note__get_addr(struct sdt_note *note) { - return note->bit32 ? (unsigned long
[PATCH v8 1/6] Uprobes: Simplify uprobe_register() body
Simplify uprobe_register() function body and let __uprobe_register() handle everything. Also move dependency functions around to fix build failures. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 69 ++--- 1 file changed, 36 insertions(+), 33 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ccc579a7d32e..471eac896635 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -840,13 +840,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) return err; } -static int __uprobe_register(struct uprobe *uprobe, struct uprobe_consumer *uc) -{ - consumer_add(uprobe, uc); - return register_for_each_vma(uprobe, uc); -} - -static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) +static void +__uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) { int err; @@ -860,24 +855,46 @@ static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *u } /* - * uprobe_register - register a probe + * uprobe_unregister - unregister a already registered probe. + * @inode: the file in which the probe has to be removed. + * @offset: offset from the start of the file. + * @uc: identify which probe if multiple probes are colocated. + */ +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +{ + struct uprobe *uprobe; + + uprobe = find_uprobe(inode, offset); + if (WARN_ON(!uprobe)) + return; + + down_write(>register_rwsem); + __uprobe_unregister(uprobe, uc); + up_write(>register_rwsem); + put_uprobe(uprobe); +} +EXPORT_SYMBOL_GPL(uprobe_unregister); + +/* + * __uprobe_register - register a probe * @inode: the file in which the probe has to be placed. * @offset: offset from the start of the file. * @uc: information on howto handle the probe.. * - * Apart from the access refcount, uprobe_register() takes a creation + * Apart from the access refcount, __uprobe_register() takes a creation * refcount (thro alloc_uprobe) if and only if this @uprobe is getting * inserted into the rbtree (i.e first consumer for a @inode:@offset * tuple). Creation refcount stops uprobe_unregister from freeing the * @uprobe even before the register operation is complete. Creation * refcount is released when the last @uc for the @uprobe - * unregisters. Caller of uprobe_register() is required to keep @inode + * unregisters. Caller of __uprobe_register() is required to keep @inode * (and the containing mount) referenced. * * Return errno if it cannot successully install probes * else return 0 (success) */ -int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +static int __uprobe_register(struct inode *inode, loff_t offset, +struct uprobe_consumer *uc) { struct uprobe *uprobe; int ret; @@ -904,7 +921,8 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * down_write(>register_rwsem); ret = -EAGAIN; if (likely(uprobe_is_active(uprobe))) { - ret = __uprobe_register(uprobe, uc); + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); if (ret) __uprobe_unregister(uprobe, uc); } @@ -915,6 +933,12 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * goto retry; return ret; } + +int uprobe_register(struct inode *inode, loff_t offset, + struct uprobe_consumer *uc) +{ + return __uprobe_register(inode, offset, uc); +} EXPORT_SYMBOL_GPL(uprobe_register); /* @@ -946,27 +970,6 @@ int uprobe_apply(struct inode *inode, loff_t offset, return ret; } -/* - * uprobe_unregister - unregister a already registered probe. - * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. - */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - struct uprobe *uprobe; - - uprobe = find_uprobe(inode, offset); - if (WARN_ON(!uprobe)) - return; - - down_write(>register_rwsem); - __uprobe_unregister(uprobe, uc); - up_write(>register_rwsem); - put_uprobe(uprobe); -} -EXPORT_SYMBOL_GPL(uprobe_unregister); - static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) { struct vm_area_struct *vma; -- 2.14.4
[PATCH v8 5/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to add multiple trace_uprobe entries having same inode+offset but different reference counter. Ex, # echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events # echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xf)" >> uprobe_events bash: echo: write error: Invalid argument # dmesg trace_kprobe: Reference counter offset mismatch. There is one exception though: When user is trying to replace the old entry with the new one, we allow this if the new entry does not conflict with any other existing entries. Signed-off-by: Ravi Bangoria --- kernel/trace/trace_uprobe.c | 37 +++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index bf2be098eb08..be64d943d7ea 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -324,6 +324,35 @@ static int unregister_trace_uprobe(struct trace_uprobe *tu) return 0; } +/* + * Uprobe with multiple reference counter is not allowed. i.e. + * If inode and offset matches, reference counter offset *must* + * match as well. Though, there is one exception: If user is + * replacing old trace_uprobe with new one(same group/event), + * then we allow same uprobe with new reference counter as far + * as the new one does not conflict with any other existing + * ones. + */ +static struct trace_uprobe *find_old_trace_uprobe(struct trace_uprobe *new) +{ + struct trace_uprobe *tmp, *old = NULL; + struct inode *new_inode = d_real_inode(new->path.dentry); + + old = find_probe_event(trace_event_name(>tp.call), + new->tp.call.class->system); + + list_for_each_entry(tmp, _list, list) { + if ((old ? old != tmp : true) && + new_inode == d_real_inode(tmp->path.dentry) && + new->offset == tmp->offset && + new->ref_ctr_offset != tmp->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch."); + return ERR_PTR(-EINVAL); + } + } + return old; +} + /* Register a trace_uprobe and probe_event */ static int register_trace_uprobe(struct trace_uprobe *tu) { @@ -333,8 +362,12 @@ static int register_trace_uprobe(struct trace_uprobe *tu) mutex_lock(_lock); /* register as an event */ - old_tu = find_probe_event(trace_event_name(>tp.call), - tu->tp.call.class->system); + old_tu = find_old_trace_uprobe(tu); + if (IS_ERR(old_tu)) { + ret = PTR_ERR(old_tu); + goto end; + } + if (old_tu) { /* delete old event */ ret = unregister_trace_uprobe(old_tu); -- 2.14.4
[PATCH v8 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Add addition argument 'arch_uprobe' to uprobe_write_opcode(). We need this in later set of patches. Signed-off-by: Ravi Bangoria --- arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 2 +- kernel/events/uprobes.c| 9 + 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c index d1329f1ba4e4..bf992264060e 100644 --- a/arch/arm/probes/uprobes/core.c +++ b/arch/arm/probes/uprobes/core.c @@ -32,7 +32,7 @@ bool is_swbp_insn(uprobe_opcode_t *insn) int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, + return uprobe_write_opcode(auprobe, mm, vaddr, __opcode_to_mem_arm(auprobe->bpinsn)); } diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c index f7a0645ccb82..4aaff3b3175c 100644 --- a/arch/mips/kernel/uprobes.c +++ b/arch/mips/kernel/uprobes.c @@ -224,7 +224,7 @@ unsigned long arch_uretprobe_hijack_return_addr( int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0a294e950df8..bb9d2084af03 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -121,7 +121,7 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn); extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); -extern int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); +extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 471eac896635..c0418ba52ba8 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -299,8 +299,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t * Called with mm->mmap_sem held for write. * Return 0 (success) or a negative errno. */ -int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, - uprobe_opcode_t opcode) +int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, + unsigned long vaddr, uprobe_opcode_t opcode) { struct page *old_page, *new_page; struct vm_area_struct *vma; @@ -351,7 +351,7 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, */ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } /** @@ -366,7 +366,8 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned int __weak set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, *(uprobe_opcode_t *)>insn); + return uprobe_write_opcode(auprobe, mm, vaddr, + *(uprobe_opcode_t *)>insn); } static struct uprobe *get_uprobe(struct uprobe *uprobe) -- 2.14.4
[PATCH v8 6/6] perf probe: Support SDT markers having reference counter (semaphore)
With this, perf buildid-cache will save SDT markers with reference counter in probe cache. Perf probe will be able to probe markers having reference counter. Ex, # readelf -n /tmp/tick | grep -A1 loop2 Name: loop2 ... Semaphore: 0x10020036 # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed Signed-off-by: Ravi Bangoria Acked-by: Masami Hiramatsu Acked-by: Srikar Dronamraju --- tools/perf/util/probe-event.c | 39 tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 ++-- tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 --- tools/perf/util/symbol.h | 7 +++ 6 files changed, 106 insertions(+), 22 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index f119eb628dbb..e86f8be89157 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1819,6 +1819,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev) tp->offset = strtoul(fmt2_str, NULL, 10); } + if (tev->uprobes) { + fmt2_str = strchr(p, '('); + if (fmt2_str) + tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0); + } + tev->nargs = argc - 2; tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs); if (tev->args == NULL) { @@ -2012,6 +2018,22 @@ static int synthesize_probe_trace_arg(struct probe_trace_arg *arg, return err; } +static int +synthesize_uprobe_trace_def(struct probe_trace_event *tev, struct strbuf *buf) +{ + struct probe_trace_point *tp = >point; + int err; + + err = strbuf_addf(buf, "%s:0x%lx", tp->module, tp->address); + + if (err >= 0 && tp->ref_ctr_offset) { + if (!uprobe_ref_ctr_is_supported()) + return -1; + err = strbuf_addf(buf, "(0x%lx)", tp->ref_ctr_offset); + } + return err >= 0 ? 0 : -1; +} + char *synthesize_probe_trace_command(struct probe_trace_event *tev) { struct probe_trace_point *tp = >point; @@ -2041,15 +2063,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev) } /* Use the tp->address for uprobes */ - if (tev->uprobes) - err = strbuf_addf(, "%s:0x%lx", tp->module, tp->address); - else if (!strncmp(tp->symbol, "0x", 2)) + if (tev->uprobes) { + err = synthesize_uprobe_trace_def(tev, ); + } else if (!strncmp(tp->symbol, "0x", 2)) { /* Absolute address. See try_to_find_absolute_address() */ err = strbuf_addf(, "%s%s0x%lx", tp->module ?: "", tp->module ? ":" : "", tp->address); - else + } else { err = strbuf_addf(, "%s%s%s+%lu", tp->module ?: "", tp->module ? ":" : "", tp->symbol, tp->offset); + } + if (err) goto error; @@ -2633,6 +2657,13 @@ static void warn_uprobe_event_compat(struct probe_trace_event *tev) { int i; char *buf = synthesize_probe_trace_command(tev); + struct probe_trace_point *tp = >point; + + if (tp->ref_ctr_offset && !uprobe_ref_ctr_is_supported()) { + pr_warning("A semaphore is associated with %s:%s and " + "seems your kernel doesn't support it.\n", + tev->group, tev->event); + } /* Old uprobe event doesn't support memory dereference */ if (!tev->uprobes || tev->nargs == 0 || !buf) diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h index 45b14f020558..15a98c3a2a2f 100644 --- a/tools/perf/util/probe-event.h +++ b/tools/perf/util/probe-event.h @@ -27,6 +27,7 @@ struct probe_trace_point { char*symbol;/* Base symbol */ char*module;/* Module name */ unsigned long offset; /* Offset from symbol */ + unsigned long ref_ctr_offset; /* SDT reference counter offset */ unsigned long address;/* Actual address of the trace point */ boolretprobe; /* Return probe flag */ }; diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c index b76088fadf3d..aac7817d9e14 100644 --- a/tools/perf/util/probe-file.c +++ b/tools/perf/util/probe-file.c @@ -696,8 +696,16 @@ int probe_cache__add_entry(struct probe_cache *pcache, #ifdef HAVE_GELF_GETNOTE_SUPPORT static unsigned long long sdt_note__get_addr(struct sdt_note *note) { - return note->bit32 ? (unsigned long
[PATCH v8 4/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to register multiple uprobes having same inode+offset but different reference counter. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 9 + 1 file changed, 9 insertions(+) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 61b0481ef417..492a0e005b4d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -689,6 +689,12 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch.\n"); + put_uprobe(cur_uprobe); + kfree(uprobe); + return ERR_PTR(-EINVAL); + } kfree(uprobe); uprobe = cur_uprobe; } @@ -1103,6 +1109,9 @@ static int __uprobe_register(struct inode *inode, loff_t offset, uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (!uprobe) return -ENOMEM; + if (IS_ERR(uprobe)) + return PTR_ERR(uprobe); + /* * We can race with uprobe_unregister()->delete_uprobe(). * Check uprobe_is_active() and retry if it is false. -- 2.14.4
[PATCH v8 4/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to register multiple uprobes having same inode+offset but different reference counter. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 9 + 1 file changed, 9 insertions(+) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 61b0481ef417..492a0e005b4d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -689,6 +689,12 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch.\n"); + put_uprobe(cur_uprobe); + kfree(uprobe); + return ERR_PTR(-EINVAL); + } kfree(uprobe); uprobe = cur_uprobe; } @@ -1103,6 +1109,9 @@ static int __uprobe_register(struct inode *inode, loff_t offset, uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (!uprobe) return -ENOMEM; + if (IS_ERR(uprobe)) + return PTR_ERR(uprobe); + /* * We can race with uprobe_unregister()->delete_uprobe(). * Check uprobe_is_active() and retry if it is false. -- 2.14.4
[PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Implement the reference counter logic in core uprobe. User will be able to use it from trace_uprobe as well as from kernel module. New trace_uprobe definition with reference counter will now be: :[(ref_ctr_offset)] where ref_ctr_offset is an optional field. For kernel module, new variant of uprobe_register() has been introduced: uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer) No new variant for uprobe_unregister() because it's assumed to have only one reference counter for one uprobe. [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Note: 'reference counter' is called as 'semaphore' in original Dtrace (or Systemtap, bcc and even in ELF) documentation and code. But the term 'semaphore' is misleading in this context. This is just a counter used to hold number of tracers tracing on a marker. This is not really used for any synchronization. So we are referring it as 'reference counter' in kernel / perf code. Signed-off-by: Ravi Bangoria Reviewed-by: Masami Hiramatsu [Only trace_uprobe.c] --- include/linux/uprobes.h | 5 + kernel/events/uprobes.c | 246 ++-- kernel/trace/trace.c| 2 +- kernel/trace/trace_uprobe.c | 38 ++- 4 files changed, 280 insertions(+), 11 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index bb9d2084af03..103a48a48872 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -123,6 +123,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); @@ -160,6 +161,10 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c0418ba52ba8..61b0481ef417 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -73,6 +73,7 @@ struct uprobe { struct uprobe_consumer *consumers; struct inode*inode; /* Also hold a ref to inode */ loff_t offset; + loff_t ref_ctr_offset; unsigned long flags; /* @@ -88,6 +89,15 @@ struct uprobe { struct arch_uprobe arch; }; +struct delayed_uprobe { + struct list_head list; + struct uprobe *uprobe; + struct mm_struct *mm; +}; + +static DEFINE_MUTEX(delayed_uprobe_lock); +static LIST_HEAD(delayed_uprobe_list); + /* * Execute out of line area: anonymous executable mapping installed * by the probed task to execute the copy of the original instruction @@ -282,6 +292,157 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t return 1; } +static struct delayed_uprobe * +delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct delayed_uprobe *du; + + list_for_each_entry(du, _uprobe_list, list) + if (du->uprobe == uprobe && du->mm == mm) + return du; + return NULL; +} + +static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct delayed_uprobe *du; + + if (delayed_uprobe_check(uprobe, mm)) + return 0; +
[PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Implement the reference counter logic in core uprobe. User will be able to use it from trace_uprobe as well as from kernel module. New trace_uprobe definition with reference counter will now be: :[(ref_ctr_offset)] where ref_ctr_offset is an optional field. For kernel module, new variant of uprobe_register() has been introduced: uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer) No new variant for uprobe_unregister() because it's assumed to have only one reference counter for one uprobe. [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Note: 'reference counter' is called as 'semaphore' in original Dtrace (or Systemtap, bcc and even in ELF) documentation and code. But the term 'semaphore' is misleading in this context. This is just a counter used to hold number of tracers tracing on a marker. This is not really used for any synchronization. So we are referring it as 'reference counter' in kernel / perf code. Signed-off-by: Ravi Bangoria Reviewed-by: Masami Hiramatsu [Only trace_uprobe.c] --- include/linux/uprobes.h | 5 + kernel/events/uprobes.c | 246 ++-- kernel/trace/trace.c| 2 +- kernel/trace/trace_uprobe.c | 38 ++- 4 files changed, 280 insertions(+), 11 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index bb9d2084af03..103a48a48872 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -123,6 +123,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); @@ -160,6 +161,10 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c0418ba52ba8..61b0481ef417 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -73,6 +73,7 @@ struct uprobe { struct uprobe_consumer *consumers; struct inode*inode; /* Also hold a ref to inode */ loff_t offset; + loff_t ref_ctr_offset; unsigned long flags; /* @@ -88,6 +89,15 @@ struct uprobe { struct arch_uprobe arch; }; +struct delayed_uprobe { + struct list_head list; + struct uprobe *uprobe; + struct mm_struct *mm; +}; + +static DEFINE_MUTEX(delayed_uprobe_lock); +static LIST_HEAD(delayed_uprobe_list); + /* * Execute out of line area: anonymous executable mapping installed * by the probed task to execute the copy of the original instruction @@ -282,6 +292,157 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t return 1; } +static struct delayed_uprobe * +delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct delayed_uprobe *du; + + list_for_each_entry(du, _uprobe_list, list) + if (du->uprobe == uprobe && du->mm == mm) + return du; + return NULL; +} + +static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct delayed_uprobe *du; + + if (delayed_uprobe_check(uprobe, mm)) + return 0; +
FUSE: write operations trigger balance_dirty_pages when using writeback cache
Dear Miklos, Recently I've been testing FUSE and libfuse example passthrough_ll with writeback cache on, and found out that the performance drops significantly compared to that in local filesystem. As I can see from trace, balance_dirty_pages is triggered very frequently even if there not enough pages that shall be sent to libfuse. I'm not sure if this is a known fact or the FUSE writeback feature requires some specific configurations. Trace log is attached. dd-19067 [001] 195295.568097: balance_dirty_pages: bdi 0:42: limit=3180390 setpoint=2782421 dirty=5 bdi_setpoint=0 bdi_dirty=32 dirty_ratelimit=32 task_ratelimit=0 dirtied=32 dirtied_pause=32 paused=0 pause=33 period=33 think=0 cgroup_ino=1 dd-19067 [001] 195295.602029: balance_dirty_pages: bdi 0:42: limit=3180390 setpoint=2782421 dirty=5 bdi_setpoint=0 bdi_dirty=33 dirty_ratelimit=32 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=34 period=34 think=1 cgroup_ino=1 dd-19067 [001] 195295.637026: balance_dirty_pages: bdi 0:42: limit=3180390 setpoint=2782421 dirty=5 bdi_setpoint=0 bdi_dirty=34 dirty_ratelimit=32 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=35 period=35 think=1 cgroup_ino=1 BTW, I'm using Linux kernel 4.17.12 and libfuse 3.2.5. Look forward to hearing from you. Thanks in advance. Regards, Shuoran Liu
FUSE: write operations trigger balance_dirty_pages when using writeback cache
Dear Miklos, Recently I've been testing FUSE and libfuse example passthrough_ll with writeback cache on, and found out that the performance drops significantly compared to that in local filesystem. As I can see from trace, balance_dirty_pages is triggered very frequently even if there not enough pages that shall be sent to libfuse. I'm not sure if this is a known fact or the FUSE writeback feature requires some specific configurations. Trace log is attached. dd-19067 [001] 195295.568097: balance_dirty_pages: bdi 0:42: limit=3180390 setpoint=2782421 dirty=5 bdi_setpoint=0 bdi_dirty=32 dirty_ratelimit=32 task_ratelimit=0 dirtied=32 dirtied_pause=32 paused=0 pause=33 period=33 think=0 cgroup_ino=1 dd-19067 [001] 195295.602029: balance_dirty_pages: bdi 0:42: limit=3180390 setpoint=2782421 dirty=5 bdi_setpoint=0 bdi_dirty=33 dirty_ratelimit=32 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=34 period=34 think=1 cgroup_ino=1 dd-19067 [001] 195295.637026: balance_dirty_pages: bdi 0:42: limit=3180390 setpoint=2782421 dirty=5 bdi_setpoint=0 bdi_dirty=34 dirty_ratelimit=32 task_ratelimit=0 dirtied=1 dirtied_pause=0 paused=0 pause=35 period=35 think=1 cgroup_ino=1 BTW, I'm using Linux kernel 4.17.12 and libfuse 3.2.5. Look forward to hearing from you. Thanks in advance. Regards, Shuoran Liu
Re: [PATCH] dmaengine: sprd: Support DMA link-list mode
Hi Vinod, (Comments from Eric Long) On 9 August 2018 at 10:32, Vinod wrote: > On 26-07-18, 16:00, Baolin Wang wrote: >> From: Eric Long >> >> The Spreadtrum DMA can support the link-list transaction mode, which means >> DMA controller can do transaction one by one automatically once we linked >> these transaction by link-list register. >> >> Signed-off-by: Eric Long >> Signed-off-by: Baolin Wang >> --- >> drivers/dma/sprd-dma.c | 82 >> ++ >> include/linux/dma/sprd-dma.h | 69 +++ >> 2 files changed, 144 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/dma/sprd-dma.c b/drivers/dma/sprd-dma.c >> index 55df0d4..649bd2c 100644 >> --- a/drivers/dma/sprd-dma.c >> +++ b/drivers/dma/sprd-dma.c >> @@ -68,6 +68,7 @@ >> >> /* SPRD_DMA_CHN_CFG register definition */ >> #define SPRD_DMA_CHN_EN BIT(0) >> +#define SPRD_DMA_LINKLIST_EN BIT(4) >> #define SPRD_DMA_WAIT_BDONE_OFFSET 24 >> #define SPRD_DMA_DONOT_WAIT_BDONE1 >> >> @@ -103,7 +104,7 @@ >> #define SPRD_DMA_REQ_MODE_MASK GENMASK(1, 0) >> #define SPRD_DMA_FIX_SEL_OFFSET 21 >> #define SPRD_DMA_FIX_EN_OFFSET 20 >> -#define SPRD_DMA_LLIST_END_OFFSET19 >> +#define SPRD_DMA_LLIST_END BIT(19) >> #define SPRD_DMA_FRG_LEN_MASKGENMASK(16, 0) >> >> /* SPRD_DMA_CHN_BLK_LEN register definition */ >> @@ -164,6 +165,7 @@ struct sprd_dma_desc { >> struct sprd_dma_chn { >> struct virt_dma_chanvc; >> void __iomem*chn_base; >> + struct sprd_dma_linklistlinklist; >> struct dma_slave_config slave_cfg; >> u32 chn_num; >> u32 dev_id; >> @@ -582,7 +584,8 @@ static int sprd_dma_get_step(enum dma_slave_buswidth >> buswidth) >> } >> >> static int sprd_dma_fill_desc(struct dma_chan *chan, >> - struct sprd_dma_desc *sdesc, >> + struct sprd_dma_chn_hw *hw, >> + unsigned int sglen, int sg_index, >> dma_addr_t src, dma_addr_t dst, u32 len, >> enum dma_transfer_direction dir, >> unsigned long flags, >> @@ -590,7 +593,6 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, >> { >> struct sprd_dma_dev *sdev = to_sprd_dma_dev(chan); >> struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); >> - struct sprd_dma_chn_hw *hw = >chn_hw; >> u32 req_mode = (flags >> SPRD_DMA_REQ_SHIFT) & SPRD_DMA_REQ_MODE_MASK; >> u32 int_mode = flags & SPRD_DMA_INT_MASK; >> int src_datawidth, dst_datawidth, src_step, dst_step; >> @@ -670,12 +672,58 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, >> temp |= (src_step & SPRD_DMA_TRSF_STEP_MASK) << >> SPRD_DMA_SRC_TRSF_STEP_OFFSET; >> hw->trsf_step = temp; >> >> + /* link-list configuration */ >> + if (schan->linklist.phy_addr) { >> + if (sg_index == sglen - 1) >> + hw->frg_len |= SPRD_DMA_LLIST_END; >> + >> + hw->cfg |= SPRD_DMA_LINKLIST_EN; >> + hw->llist_ptr = schan->linklist.phy_addr + >> + ((sg_index + 1) % sglen) * sizeof(*hw) + >> + SPRD_DMA_CHN_SRC_ADDR; >> + } else { >> + hw->llist_ptr = 0; >> + } >> + >> hw->frg_step = 0; >> hw->src_blk_step = 0; >> hw->des_blk_step = 0; >> return 0; >> } >> >> +static int sprd_dma_fill_linklist_desc(struct dma_chan *chan, >> +unsigned int sglen, int sg_index, >> +dma_addr_t src, dma_addr_t dst, u32 len, >> +enum dma_transfer_direction dir, >> +unsigned long flags, >> +struct dma_slave_config *slave_cfg) >> +{ >> + struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); >> + struct sprd_dma_chn_hw *hw; >> + >> + if (sglen < 2 || !schan->linklist.virt_addr) >> + return -EINVAL; > > why the limitation of 2 here? Spreadtrum DMA link-list mode needs more than 2 link-list configurations. If there is only one sg, it doesn't need to fill the link-list configuration. Moreover, we have valided the sg length in sprd_dma_prep_slave_sg(), so will remove the redundant validation here. > >> + >> + hw = (struct sprd_dma_chn_hw *)(schan->linklist.virt_addr + >> + sg_index * sizeof(*hw)); >> + >> + return sprd_dma_fill_desc(chan, hw, sglen, sg_index, src, dst, len, >> dir, >> + flags, slave_cfg); >> +} >> + >> +static int sprd_dma_fill_chn_desc(struct dma_chan *chan, >> + struct sprd_dma_desc *sdesc, >> + dma_addr_t src, dma_addr_t dst, u32 len,
Re: [PATCH] dmaengine: sprd: Support DMA link-list mode
Hi Vinod, (Comments from Eric Long) On 9 August 2018 at 10:32, Vinod wrote: > On 26-07-18, 16:00, Baolin Wang wrote: >> From: Eric Long >> >> The Spreadtrum DMA can support the link-list transaction mode, which means >> DMA controller can do transaction one by one automatically once we linked >> these transaction by link-list register. >> >> Signed-off-by: Eric Long >> Signed-off-by: Baolin Wang >> --- >> drivers/dma/sprd-dma.c | 82 >> ++ >> include/linux/dma/sprd-dma.h | 69 +++ >> 2 files changed, 144 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/dma/sprd-dma.c b/drivers/dma/sprd-dma.c >> index 55df0d4..649bd2c 100644 >> --- a/drivers/dma/sprd-dma.c >> +++ b/drivers/dma/sprd-dma.c >> @@ -68,6 +68,7 @@ >> >> /* SPRD_DMA_CHN_CFG register definition */ >> #define SPRD_DMA_CHN_EN BIT(0) >> +#define SPRD_DMA_LINKLIST_EN BIT(4) >> #define SPRD_DMA_WAIT_BDONE_OFFSET 24 >> #define SPRD_DMA_DONOT_WAIT_BDONE1 >> >> @@ -103,7 +104,7 @@ >> #define SPRD_DMA_REQ_MODE_MASK GENMASK(1, 0) >> #define SPRD_DMA_FIX_SEL_OFFSET 21 >> #define SPRD_DMA_FIX_EN_OFFSET 20 >> -#define SPRD_DMA_LLIST_END_OFFSET19 >> +#define SPRD_DMA_LLIST_END BIT(19) >> #define SPRD_DMA_FRG_LEN_MASKGENMASK(16, 0) >> >> /* SPRD_DMA_CHN_BLK_LEN register definition */ >> @@ -164,6 +165,7 @@ struct sprd_dma_desc { >> struct sprd_dma_chn { >> struct virt_dma_chanvc; >> void __iomem*chn_base; >> + struct sprd_dma_linklistlinklist; >> struct dma_slave_config slave_cfg; >> u32 chn_num; >> u32 dev_id; >> @@ -582,7 +584,8 @@ static int sprd_dma_get_step(enum dma_slave_buswidth >> buswidth) >> } >> >> static int sprd_dma_fill_desc(struct dma_chan *chan, >> - struct sprd_dma_desc *sdesc, >> + struct sprd_dma_chn_hw *hw, >> + unsigned int sglen, int sg_index, >> dma_addr_t src, dma_addr_t dst, u32 len, >> enum dma_transfer_direction dir, >> unsigned long flags, >> @@ -590,7 +593,6 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, >> { >> struct sprd_dma_dev *sdev = to_sprd_dma_dev(chan); >> struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); >> - struct sprd_dma_chn_hw *hw = >chn_hw; >> u32 req_mode = (flags >> SPRD_DMA_REQ_SHIFT) & SPRD_DMA_REQ_MODE_MASK; >> u32 int_mode = flags & SPRD_DMA_INT_MASK; >> int src_datawidth, dst_datawidth, src_step, dst_step; >> @@ -670,12 +672,58 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, >> temp |= (src_step & SPRD_DMA_TRSF_STEP_MASK) << >> SPRD_DMA_SRC_TRSF_STEP_OFFSET; >> hw->trsf_step = temp; >> >> + /* link-list configuration */ >> + if (schan->linklist.phy_addr) { >> + if (sg_index == sglen - 1) >> + hw->frg_len |= SPRD_DMA_LLIST_END; >> + >> + hw->cfg |= SPRD_DMA_LINKLIST_EN; >> + hw->llist_ptr = schan->linklist.phy_addr + >> + ((sg_index + 1) % sglen) * sizeof(*hw) + >> + SPRD_DMA_CHN_SRC_ADDR; >> + } else { >> + hw->llist_ptr = 0; >> + } >> + >> hw->frg_step = 0; >> hw->src_blk_step = 0; >> hw->des_blk_step = 0; >> return 0; >> } >> >> +static int sprd_dma_fill_linklist_desc(struct dma_chan *chan, >> +unsigned int sglen, int sg_index, >> +dma_addr_t src, dma_addr_t dst, u32 len, >> +enum dma_transfer_direction dir, >> +unsigned long flags, >> +struct dma_slave_config *slave_cfg) >> +{ >> + struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); >> + struct sprd_dma_chn_hw *hw; >> + >> + if (sglen < 2 || !schan->linklist.virt_addr) >> + return -EINVAL; > > why the limitation of 2 here? Spreadtrum DMA link-list mode needs more than 2 link-list configurations. If there is only one sg, it doesn't need to fill the link-list configuration. Moreover, we have valided the sg length in sprd_dma_prep_slave_sg(), so will remove the redundant validation here. > >> + >> + hw = (struct sprd_dma_chn_hw *)(schan->linklist.virt_addr + >> + sg_index * sizeof(*hw)); >> + >> + return sprd_dma_fill_desc(chan, hw, sglen, sg_index, src, dst, len, >> dir, >> + flags, slave_cfg); >> +} >> + >> +static int sprd_dma_fill_chn_desc(struct dma_chan *chan, >> + struct sprd_dma_desc *sdesc, >> + dma_addr_t src, dma_addr_t dst, u32 len,
Re: [PATCH 2/2] spi: sprd: Add SPI driver for Spreadtrum SC9860
Hi Trent, On 9 August 2018 at 03:08, Trent Piepho wrote: > On Wed, 2018-08-08 at 11:19 +0800, Baolin Wang wrote: >> On 8 August 2018 at 01:10, Trent Piepho wrote: >> > On Tue, 2018-08-07 at 18:43 +0800, Baolin Wang wrote: >> > > >> > > +static u32 sprd_spi_transfer_max_timeout(struct sprd_spi *ss, >> > > + struct spi_transfer *t) >> > > +{ >> > > + /* >> > > + * The time spent on transmission of the full FIFO data is the >> > > maximum >> > > + * SPI transmission time. >> > > + */ >> > > + u32 size = t->bits_per_word * SPRD_SPI_FIFO_SIZE; >> > > + u32 bit_time_us = SPRD_SPI_HZ / t->speed_hz + 1; > > There's another flaw here in that the transfer speed, t->speed_hz, > might not be exactly what is used, due to limitations of the clock > divider. It would be better to find the actual SPI clock used, then > use that in the calculations. Right, will use the real speed to calculate the transfer time. > >> > > + u32 total_time_us = size * bit_time_us; >> > > + /* >> > > + * There is an interval between data and the data in our SPI >> > > hardware, >> > > + * so the total transmission time need add the interval time. >> > > + * >> > > + * The inteval calculation formula: >> > > + * interval time (source clock cycles) = interval * 4 + 10. >> > > + */ >> > > + u32 interval_cycle = SPRD_SPI_FIFO_SIZE * ((ss->interval << 2) + >> > > 10); >> > > + u32 interval_time_us = interval_cycle * SPRD_SPI_HZ / ss->src_clk >> > > + 1; >> > >> > If interval is greater than 31, this will overflow. >> >> Usually we will not set such a big value for interval, but this is a >> risk like you said. So we will check and limit the interval value to >> make sure the formula will not overflow. >> > > Better would be to limit the inter word delay to whatever maximum value > your hardware supports, and then write code that can calculate that > without error. Yes, will define the maximum word delay values to avoid overflow. > >> > > +static void sprd_spi_set_speed(struct sprd_spi *ss, u32 speed_hz) >> > > +{ >> > > + /* >> > > + * From SPI datasheet, the prescale calculation formula: >> > > + * prescale = SPI source clock / (2 * SPI_freq) - 1; >> > > + */ >> > > + u32 clk_div = ss->src_clk / (speed_hz << 1) - 1; >> > >> > You should probably round up here. The convention is to use the >> > closest speed less than equal to the requested speed, but since this is >> > a divider, rounding it down will select the next highest speed. >> >> Per the datasheet, we do not need round up/down the speed. Since our >> hardware can handle the clock calculated by the above formula if the >> requested speed is in the normal region (less than ->max_speed_hz). > > That is not what I mean. Let me explain differently. > > An integer divider like this can not produce any frequency exactly. > Consider if src_clk is 10 MHz. A clk_div value of 0 produces a 5 MHz > SPI clock. A clk_div value of 1 produces a 2.5 MHz SPI clock. > > What if the transfer requests a SPI clock of 3 MHz? > > Your math above will produce a SPI clock of 5 MHz, faster than > requested. This is not the convention in Linux SPI masters. You > should instead of have chosen a clk_div value of 1 to get a SPI clock > of 2.5 MHz, the closest clock possible that is not greater than the > requested value. > > To do this, round up. > > clk_div = DIV_ROUND_UP(ss->src_clk, speed_hz * 2) - 1; Thanks for your patient explanation. After talking with Lanqing who is the author of the SPI driver, we think you are definitely correct and will fix in next version according to your suggestion. Thanks a lot. -- Baolin Wang Best Regards
Re: [PATCH 2/2] spi: sprd: Add SPI driver for Spreadtrum SC9860
Hi Trent, On 9 August 2018 at 03:08, Trent Piepho wrote: > On Wed, 2018-08-08 at 11:19 +0800, Baolin Wang wrote: >> On 8 August 2018 at 01:10, Trent Piepho wrote: >> > On Tue, 2018-08-07 at 18:43 +0800, Baolin Wang wrote: >> > > >> > > +static u32 sprd_spi_transfer_max_timeout(struct sprd_spi *ss, >> > > + struct spi_transfer *t) >> > > +{ >> > > + /* >> > > + * The time spent on transmission of the full FIFO data is the >> > > maximum >> > > + * SPI transmission time. >> > > + */ >> > > + u32 size = t->bits_per_word * SPRD_SPI_FIFO_SIZE; >> > > + u32 bit_time_us = SPRD_SPI_HZ / t->speed_hz + 1; > > There's another flaw here in that the transfer speed, t->speed_hz, > might not be exactly what is used, due to limitations of the clock > divider. It would be better to find the actual SPI clock used, then > use that in the calculations. Right, will use the real speed to calculate the transfer time. > >> > > + u32 total_time_us = size * bit_time_us; >> > > + /* >> > > + * There is an interval between data and the data in our SPI >> > > hardware, >> > > + * so the total transmission time need add the interval time. >> > > + * >> > > + * The inteval calculation formula: >> > > + * interval time (source clock cycles) = interval * 4 + 10. >> > > + */ >> > > + u32 interval_cycle = SPRD_SPI_FIFO_SIZE * ((ss->interval << 2) + >> > > 10); >> > > + u32 interval_time_us = interval_cycle * SPRD_SPI_HZ / ss->src_clk >> > > + 1; >> > >> > If interval is greater than 31, this will overflow. >> >> Usually we will not set such a big value for interval, but this is a >> risk like you said. So we will check and limit the interval value to >> make sure the formula will not overflow. >> > > Better would be to limit the inter word delay to whatever maximum value > your hardware supports, and then write code that can calculate that > without error. Yes, will define the maximum word delay values to avoid overflow. > >> > > +static void sprd_spi_set_speed(struct sprd_spi *ss, u32 speed_hz) >> > > +{ >> > > + /* >> > > + * From SPI datasheet, the prescale calculation formula: >> > > + * prescale = SPI source clock / (2 * SPI_freq) - 1; >> > > + */ >> > > + u32 clk_div = ss->src_clk / (speed_hz << 1) - 1; >> > >> > You should probably round up here. The convention is to use the >> > closest speed less than equal to the requested speed, but since this is >> > a divider, rounding it down will select the next highest speed. >> >> Per the datasheet, we do not need round up/down the speed. Since our >> hardware can handle the clock calculated by the above formula if the >> requested speed is in the normal region (less than ->max_speed_hz). > > That is not what I mean. Let me explain differently. > > An integer divider like this can not produce any frequency exactly. > Consider if src_clk is 10 MHz. A clk_div value of 0 produces a 5 MHz > SPI clock. A clk_div value of 1 produces a 2.5 MHz SPI clock. > > What if the transfer requests a SPI clock of 3 MHz? > > Your math above will produce a SPI clock of 5 MHz, faster than > requested. This is not the convention in Linux SPI masters. You > should instead of have chosen a clk_div value of 1 to get a SPI clock > of 2.5 MHz, the closest clock possible that is not greater than the > requested value. > > To do this, round up. > > clk_div = DIV_ROUND_UP(ss->src_clk, speed_hz * 2) - 1; Thanks for your patient explanation. After talking with Lanqing who is the author of the SPI driver, we think you are definitely correct and will fix in next version according to your suggestion. Thanks a lot. -- Baolin Wang Best Regards
Re: [PATCH 1/2] dt-bindings: spi: Add Spreadtrum SPI controller documentation
Hi Trent, On 9 August 2018 at 02:57, Trent Piepho wrote: > On Wed, 2018-08-08 at 11:54 +0100, Mark Brown wrote: >> On Wed, Aug 08, 2018 at 06:35:28PM +0800, Baolin Wang wrote: >> > On 8 August 2018 at 17:50, Mark Brown wrote: >> > > Right, I don't think we added this yet (if we did I can't see >> > > it). I'd >> > > add a new field to spi_transfer for this, then other controllers >> > > with >> > > the same support can implement it as well and drivers can start >> > > using >> > > it too. >> > OK. So I will name the new filed as 'word_delay', is it OK for you? >> >> Sounds good, yes. > > Should it be in µs like the existing iter-transfer delay? I think > perhaps units of cycles of the SPI clock make more sense? Since some SPI controllers just want some interval values (neither µs unit nor cycles unit ) set into hardware, and the hardware will convert to the correct delay time automatically. So I did not force 'word_delay' unit as µs or cycle, and just let the slave devices decide the unit which depends on the SPI hardware requirement. -- Baolin Wang Best Regards
Re: [PATCH 1/2] dt-bindings: spi: Add Spreadtrum SPI controller documentation
Hi Trent, On 9 August 2018 at 02:57, Trent Piepho wrote: > On Wed, 2018-08-08 at 11:54 +0100, Mark Brown wrote: >> On Wed, Aug 08, 2018 at 06:35:28PM +0800, Baolin Wang wrote: >> > On 8 August 2018 at 17:50, Mark Brown wrote: >> > > Right, I don't think we added this yet (if we did I can't see >> > > it). I'd >> > > add a new field to spi_transfer for this, then other controllers >> > > with >> > > the same support can implement it as well and drivers can start >> > > using >> > > it too. >> > OK. So I will name the new filed as 'word_delay', is it OK for you? >> >> Sounds good, yes. > > Should it be in µs like the existing iter-transfer delay? I think > perhaps units of cycles of the SPI clock make more sense? Since some SPI controllers just want some interval values (neither µs unit nor cycles unit ) set into hardware, and the hardware will convert to the correct delay time automatically. So I did not force 'word_delay' unit as µs or cycle, and just let the slave devices decide the unit which depends on the SPI hardware requirement. -- Baolin Wang Best Regards
[PATCH v2 2/2] ASoC: trace: track dapm type in snd_soc_dapm_widget
Track snd_soc_dapm_widget:id which is used as dapm sequence index in dapm_down_seq/dapm_up_seq. It's useful for checking dapm seq after tracking it. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index bf17edc..7c4e8de 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -92,16 +92,18 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, TP_STRUCT__entry( __string( name, w->name ) + __field(int,id ) __field(int,val ) ), TP_fast_assign( __assign_str(name, w->name); + __entry->id = (int)w->id; __entry->val = val; ), - TP_printk("widget=%s val=%d", __get_str(name), - __entry->val) + TP_printk("widget=%s dapm_id=%d val=%d", __get_str(name), + __entry->id, __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, -- 2.7.4
[PATCH v2 2/2] ASoC: trace: track dapm type in snd_soc_dapm_widget
Track snd_soc_dapm_widget:id which is used as dapm sequence index in dapm_down_seq/dapm_up_seq. It's useful for checking dapm seq after tracking it. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index bf17edc..7c4e8de 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -92,16 +92,18 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, TP_STRUCT__entry( __string( name, w->name ) + __field(int,id ) __field(int,val ) ), TP_fast_assign( __assign_str(name, w->name); + __entry->id = (int)w->id; __entry->val = val; ), - TP_printk("widget=%s val=%d", __get_str(name), - __entry->val) + TP_printk("widget=%s dapm_id=%d val=%d", __get_str(name), + __entry->id, __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, -- 2.7.4
[PATCH v2 1/2] ASoC: trace: remove unnecessary typecast
There's no need to do typecast since the defined type matches the output fromat requirement. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index 40c300f..bf17edc 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(snd_soc_card, __entry->val = val; ), - TP_printk("card=%s val=%d", __get_str(name), (int)__entry->val) + TP_printk("card=%s val=%d", __get_str(name), __entry->val) ); DEFINE_EVENT(snd_soc_card, snd_soc_bias_level_start, @@ -101,7 +101,7 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, ), TP_printk("widget=%s val=%d", __get_str(name), - (int)__entry->val) + __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, @@ -149,8 +149,8 @@ TRACE_EVENT(snd_soc_dapm_walk_done, ), TP_printk("%s: checks %d power, %d path, %d neighbour", - __get_str(name), (int)__entry->power_checks, - (int)__entry->path_checks, (int)__entry->neighbour_checks) + __get_str(name), __entry->power_checks, + __entry->path_checks, __entry->neighbour_checks) ); TRACE_EVENT(snd_soc_dapm_path, @@ -180,8 +180,7 @@ TRACE_EVENT(snd_soc_dapm_path, ), TP_printk("%c%s %s %s %s %s", - (int) __entry->path_node && - (int) __entry->path_connect ? '*' : ' ', + __entry->path_node && __entry->path_connect ? '*' : ' ', __get_str(wname), DAPM_ARROW(__entry->path_dir), __get_str(pname), DAPM_ARROW(__entry->path_dir), __get_str(pnname)) @@ -242,8 +241,8 @@ TRACE_EVENT(snd_soc_jack_report, __entry->val = val; ), - TP_printk("jack=%s %x/%x", __get_str(name), (int)__entry->val, - (int)__entry->mask) + TP_printk("jack=%s %x/%x", __get_str(name), __entry->val, + __entry->mask) ); TRACE_EVENT(snd_soc_jack_notify, @@ -262,7 +261,7 @@ TRACE_EVENT(snd_soc_jack_notify, __entry->val = val; ), - TP_printk("jack=%s %x", __get_str(name), (int)__entry->val) + TP_printk("jack=%s %x", __get_str(name), __entry->val) ); #endif /* _TRACE_ASOC_H */ -- 2.7.4
[PATCH v2 1/2] ASoC: trace: remove unnecessary typecast
There's no need to do typecast since the defined type matches the output fromat requirement. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index 40c300f..bf17edc 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(snd_soc_card, __entry->val = val; ), - TP_printk("card=%s val=%d", __get_str(name), (int)__entry->val) + TP_printk("card=%s val=%d", __get_str(name), __entry->val) ); DEFINE_EVENT(snd_soc_card, snd_soc_bias_level_start, @@ -101,7 +101,7 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, ), TP_printk("widget=%s val=%d", __get_str(name), - (int)__entry->val) + __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, @@ -149,8 +149,8 @@ TRACE_EVENT(snd_soc_dapm_walk_done, ), TP_printk("%s: checks %d power, %d path, %d neighbour", - __get_str(name), (int)__entry->power_checks, - (int)__entry->path_checks, (int)__entry->neighbour_checks) + __get_str(name), __entry->power_checks, + __entry->path_checks, __entry->neighbour_checks) ); TRACE_EVENT(snd_soc_dapm_path, @@ -180,8 +180,7 @@ TRACE_EVENT(snd_soc_dapm_path, ), TP_printk("%c%s %s %s %s %s", - (int) __entry->path_node && - (int) __entry->path_connect ? '*' : ' ', + __entry->path_node && __entry->path_connect ? '*' : ' ', __get_str(wname), DAPM_ARROW(__entry->path_dir), __get_str(pname), DAPM_ARROW(__entry->path_dir), __get_str(pnname)) @@ -242,8 +241,8 @@ TRACE_EVENT(snd_soc_jack_report, __entry->val = val; ), - TP_printk("jack=%s %x/%x", __get_str(name), (int)__entry->val, - (int)__entry->mask) + TP_printk("jack=%s %x/%x", __get_str(name), __entry->val, + __entry->mask) ); TRACE_EVENT(snd_soc_jack_notify, @@ -262,7 +261,7 @@ TRACE_EVENT(snd_soc_jack_notify, __entry->val = val; ), - TP_printk("jack=%s %x", __get_str(name), (int)__entry->val) + TP_printk("jack=%s %x", __get_str(name), __entry->val) ); #endif /* _TRACE_ASOC_H */ -- 2.7.4
[PATCH v2 0/2] ASoC: trace: format file and add trace field
1. Some typecast are not needed 2. trace dapm type in snd_soc_dapm_widget class event Liu Changcheng (2): ASoC: trace: remove unnecessary typecast ASoC: trace: track dapm type in snd_soc_dapm_widget v1->v2 correct code style to meet kernel requirement. include/trace/events/asoc.h | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) -- 2.7.4
Re: [PATCH 0/4][RFC v2] Introduce the in-kernel hibernation encryption
Hi Pavel, Joey, Oliver Please let me describe the original requirement and my understanding about hibernation encryption here, thus help us sync on the same thread: On Wed, Aug 08, 2018 at 07:50:36PM +0200, Pavel Machek wrote: > Hi! > > > > > > User space doesn't need to involve. The EFI root key is generated by > > > > > EFI boot stub and be transfer to kernel. It's stored in EFI boot > > > > > service > > > > > variable that it can only be accessed by trusted EFI binary when > > > > > secure boot is enabled. > > > > > > > > > Okay, this apply to the 'suspend' phase, right? > > > > I'm still a little confused about the 'resume' phase. > > > > Taking encryption as example(not signature), > > > > the purpose of doing hibernation encryption is to prevent other users > > > > from stealing ram content. Say, user A uses a passphrase to generate > > > > the > > > > > > No, I don't think that's purpose here. > > > > > > Purpose here is to prevent user from reading/modifying kernel memory > > > content on machine he owns. > > > > > Say, A puts his laptop into hibernation and walks away, > > and B walks by, and opens A's laptop and wakes up the system and he > > can do what he wants. Although EFI key/TPM trusted key is enabled, > > currently there's no certification during resume, which sounds > > unsafe to me. Afterall, the original requirement is to probe > > Define unsafe. > > If you want security against bad people resuming your machines, please Yes, this is one of the requirements. > take a look at existing uswsusp solutions. It defends against that. > > If you want security against bad people tampering with your machines > physically, sorry, there's no way to defend against that. No, this is not the requirement. > > But I thought you were trying to do something for secure boot, and "bad > person resumes your machine" is out of scope there. > Not exactly, secure boot is one solution to meet the requirement. > So please always explain security against _what kind of attack_ you > are trying to improve; intelligent communication is not possible > without that. > User requirement: A is the user, B is the attacker, user A launches a STD and encrypts A's ram data, then writes these encrypted data onto the disk, so that: Even if user B has access to the disk, B could not know the content of A. Which implies: 1. If B unplugs the disk from A's machine, and plugs the disk onto another machine, B could not decode the content without A's 'permission'. 2. If B is using the same machine as A, even A has walked away leaving the system suspend, B could not resume to A's context without A's 'permission'. Previously, there are three proposal for this: a. Enhance the uswsusp(Pavel) b. Using user provided password to generate the key, for encryption(Yu) c. Using security boot(TPM or EFI key) for encryption(Joey) Since I was proposing solution b, I'll say a little more about it. The original idea was that, the user provides a password, then this password is used to generate the key, which means, if user B has provided an incorrect password, the kernel will fail to decrypt the data and is likely to fail the resume process. That is to say, no matter which physical machine B is using, only if he has provided the password, he would be able to resume. In the first version, the key deviration was firstly done in kernel space, which satisfies the requirement and both saftey. Unfortunately it was rejected and people would like to see the key generated in user space instead. However, using user provided key directly is not safe, according to the discussion in the thread. I don't have good idea on how to improve this, but only have some workarounds, say, ask the kernel to use TPM key to protects the user provided 'key', etc. Then let's talk a little more about secure boot. According to my understanding, the situation secure boot tries to deal with is a little different from the user case we raised above - It is an enhancement for case 1, because it refuses to resume once the machine is changed. And it does not cover case 2. But if it is a requirement from the user, that's ok. uswsusp is to do all the staff in user space, and other two aim to do all the staff in kernel space. I'm not arguing which one is better, but I'm not sure how often user is using it, as we don't get uswsusp related bug report on kernel bugzilla (both internally)recent years. Another point is, why the compression is in kernel rather than in uswsusp, it looks like the same case as encryption here. Best, Yu
[PATCH v2 0/2] ASoC: trace: format file and add trace field
1. Some typecast are not needed 2. trace dapm type in snd_soc_dapm_widget class event Liu Changcheng (2): ASoC: trace: remove unnecessary typecast ASoC: trace: track dapm type in snd_soc_dapm_widget v1->v2 correct code style to meet kernel requirement. include/trace/events/asoc.h | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) -- 2.7.4
Re: [PATCH 0/4][RFC v2] Introduce the in-kernel hibernation encryption
Hi Pavel, Joey, Oliver Please let me describe the original requirement and my understanding about hibernation encryption here, thus help us sync on the same thread: On Wed, Aug 08, 2018 at 07:50:36PM +0200, Pavel Machek wrote: > Hi! > > > > > > User space doesn't need to involve. The EFI root key is generated by > > > > > EFI boot stub and be transfer to kernel. It's stored in EFI boot > > > > > service > > > > > variable that it can only be accessed by trusted EFI binary when > > > > > secure boot is enabled. > > > > > > > > > Okay, this apply to the 'suspend' phase, right? > > > > I'm still a little confused about the 'resume' phase. > > > > Taking encryption as example(not signature), > > > > the purpose of doing hibernation encryption is to prevent other users > > > > from stealing ram content. Say, user A uses a passphrase to generate > > > > the > > > > > > No, I don't think that's purpose here. > > > > > > Purpose here is to prevent user from reading/modifying kernel memory > > > content on machine he owns. > > > > > Say, A puts his laptop into hibernation and walks away, > > and B walks by, and opens A's laptop and wakes up the system and he > > can do what he wants. Although EFI key/TPM trusted key is enabled, > > currently there's no certification during resume, which sounds > > unsafe to me. Afterall, the original requirement is to probe > > Define unsafe. > > If you want security against bad people resuming your machines, please Yes, this is one of the requirements. > take a look at existing uswsusp solutions. It defends against that. > > If you want security against bad people tampering with your machines > physically, sorry, there's no way to defend against that. No, this is not the requirement. > > But I thought you were trying to do something for secure boot, and "bad > person resumes your machine" is out of scope there. > Not exactly, secure boot is one solution to meet the requirement. > So please always explain security against _what kind of attack_ you > are trying to improve; intelligent communication is not possible > without that. > User requirement: A is the user, B is the attacker, user A launches a STD and encrypts A's ram data, then writes these encrypted data onto the disk, so that: Even if user B has access to the disk, B could not know the content of A. Which implies: 1. If B unplugs the disk from A's machine, and plugs the disk onto another machine, B could not decode the content without A's 'permission'. 2. If B is using the same machine as A, even A has walked away leaving the system suspend, B could not resume to A's context without A's 'permission'. Previously, there are three proposal for this: a. Enhance the uswsusp(Pavel) b. Using user provided password to generate the key, for encryption(Yu) c. Using security boot(TPM or EFI key) for encryption(Joey) Since I was proposing solution b, I'll say a little more about it. The original idea was that, the user provides a password, then this password is used to generate the key, which means, if user B has provided an incorrect password, the kernel will fail to decrypt the data and is likely to fail the resume process. That is to say, no matter which physical machine B is using, only if he has provided the password, he would be able to resume. In the first version, the key deviration was firstly done in kernel space, which satisfies the requirement and both saftey. Unfortunately it was rejected and people would like to see the key generated in user space instead. However, using user provided key directly is not safe, according to the discussion in the thread. I don't have good idea on how to improve this, but only have some workarounds, say, ask the kernel to use TPM key to protects the user provided 'key', etc. Then let's talk a little more about secure boot. According to my understanding, the situation secure boot tries to deal with is a little different from the user case we raised above - It is an enhancement for case 1, because it refuses to resume once the machine is changed. And it does not cover case 2. But if it is a requirement from the user, that's ok. uswsusp is to do all the staff in user space, and other two aim to do all the staff in kernel space. I'm not arguing which one is better, but I'm not sure how often user is using it, as we don't get uswsusp related bug report on kernel bugzilla (both internally)recent years. Another point is, why the compression is in kernel rather than in uswsusp, it looks like the same case as encryption here. Best, Yu
[PATCH v3] resource: Merge resources on a node when hot-adding memory
When hot-removing memory release_mem_region_adjustable() splits iomem resources if they are not the exact size of the memory being hot-deleted. Adding this memory back to the kernel adds a new resource. Eg a node has memory 0x0 - 0xf. Offlining and hot-removing 1GB from 0xf4000 results in the single resource 0x0-0xf being split into two resources: 0x0-0xf3fff and 0xf8000-0xf. When we hot-add the memory back we now have three resources: 0x0-0xf3fff, 0xf4000-0xf7fff, and 0xf8000-0xf. Now if we try to remove some memory that overlaps these resources, like 2GB from 0xf4000, release_mem_region_adjustable() fails as it expects the chunk of memory to be within the boundaries of a single resource. This patch adds a function request_resource_and_merge(). This is called instead of request_resource_conflict() when registering a resource in add_memory(). It calls request_resource_conflict() and if hot-removing is enabled (if it isn't we won't get resource fragmentation) we attempt to merge contiguous resources on the node. Signed-off-by: Rashmica Gupta --- v2->v3: Update Xen balloon, make the commit msg and a comment clearer, and changed '>' to '>=' when comparing the end of a resource and the end of a node. v1->v2: Only attempt to merge resources if hot-remove is enabled. drivers/xen/balloon.c | 3 +- include/linux/ioport.h | 2 + include/linux/memory_hotplug.h | 2 +- kernel/resource.c | 120 + mm/memory_hotplug.c| 22 5 files changed, 136 insertions(+), 13 deletions(-) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 065f0b607373..9b972b37b0da 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -401,7 +401,8 @@ static enum bp_state reserve_additional_memory(void) * callers drop the mutex before trying again. */ mutex_unlock(_mutex); - rc = add_memory_resource(nid, resource, memhp_auto_online); + rc = add_memory_resource(nid, resource->start, resource_size(resource), +memhp_auto_online); mutex_lock(_mutex); if (rc) { diff --git a/include/linux/ioport.h b/include/linux/ioport.h index da0ebaec25f0..f5b93a711e86 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -189,6 +189,8 @@ extern int allocate_resource(struct resource *root, struct resource *new, resource_size_t, resource_size_t), void *alignf_data); +extern struct resource *request_resource_and_merge(struct resource *parent, + struct resource *new, int nid); struct resource *lookup_resource(struct resource *root, resource_size_t start); int adjust_resource(struct resource *res, resource_size_t start, resource_size_t size); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 4e9828cda7a2..9c00f97c8cc6 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -322,7 +322,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {} extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, void *arg, int (*func)(struct memory_block *, void *)); extern int add_memory(int nid, u64 start, u64 size); -extern int add_memory_resource(int nid, struct resource *resource, bool online); +extern int add_memory_resource(int nid, u64 start, u64 size, bool online); extern int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, diff --git a/kernel/resource.c b/kernel/resource.c index 30e1bc68503b..a31d3f5bccb7 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1621,3 +1621,123 @@ static int __init strict_iomem(char *str) } __setup("iomem=", strict_iomem); + +#ifdef CONFIG_MEMORY_HOTPLUG +#ifdef CONFIG_MEMORY_HOTREMOVE +/* + * Attempt to merge resource and it's sibling + */ +static int merge_resources(struct resource *res) +{ + struct resource *next; + struct resource *tmp; + uint64_t size; + int ret = -EINVAL; + + next = res->sibling; + + /* +* Not sure how to handle two different children. So only attempt +* to merge two resources if neither have children, only one has a +* child or if both have the same child. +*/ + if ((res->child && next->child) && (res->child != next->child)) + return ret; + + if (res->end + 1 != next->start) + return ret; + + if (res->flags != next->flags) + return ret; + + /* Update sibling and child of resource */ + res->sibling = next->sibling; + tmp
[PATCH v3] resource: Merge resources on a node when hot-adding memory
When hot-removing memory release_mem_region_adjustable() splits iomem resources if they are not the exact size of the memory being hot-deleted. Adding this memory back to the kernel adds a new resource. Eg a node has memory 0x0 - 0xf. Offlining and hot-removing 1GB from 0xf4000 results in the single resource 0x0-0xf being split into two resources: 0x0-0xf3fff and 0xf8000-0xf. When we hot-add the memory back we now have three resources: 0x0-0xf3fff, 0xf4000-0xf7fff, and 0xf8000-0xf. Now if we try to remove some memory that overlaps these resources, like 2GB from 0xf4000, release_mem_region_adjustable() fails as it expects the chunk of memory to be within the boundaries of a single resource. This patch adds a function request_resource_and_merge(). This is called instead of request_resource_conflict() when registering a resource in add_memory(). It calls request_resource_conflict() and if hot-removing is enabled (if it isn't we won't get resource fragmentation) we attempt to merge contiguous resources on the node. Signed-off-by: Rashmica Gupta --- v2->v3: Update Xen balloon, make the commit msg and a comment clearer, and changed '>' to '>=' when comparing the end of a resource and the end of a node. v1->v2: Only attempt to merge resources if hot-remove is enabled. drivers/xen/balloon.c | 3 +- include/linux/ioport.h | 2 + include/linux/memory_hotplug.h | 2 +- kernel/resource.c | 120 + mm/memory_hotplug.c| 22 5 files changed, 136 insertions(+), 13 deletions(-) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 065f0b607373..9b972b37b0da 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -401,7 +401,8 @@ static enum bp_state reserve_additional_memory(void) * callers drop the mutex before trying again. */ mutex_unlock(_mutex); - rc = add_memory_resource(nid, resource, memhp_auto_online); + rc = add_memory_resource(nid, resource->start, resource_size(resource), +memhp_auto_online); mutex_lock(_mutex); if (rc) { diff --git a/include/linux/ioport.h b/include/linux/ioport.h index da0ebaec25f0..f5b93a711e86 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -189,6 +189,8 @@ extern int allocate_resource(struct resource *root, struct resource *new, resource_size_t, resource_size_t), void *alignf_data); +extern struct resource *request_resource_and_merge(struct resource *parent, + struct resource *new, int nid); struct resource *lookup_resource(struct resource *root, resource_size_t start); int adjust_resource(struct resource *res, resource_size_t start, resource_size_t size); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 4e9828cda7a2..9c00f97c8cc6 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -322,7 +322,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {} extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, void *arg, int (*func)(struct memory_block *, void *)); extern int add_memory(int nid, u64 start, u64 size); -extern int add_memory_resource(int nid, struct resource *resource, bool online); +extern int add_memory_resource(int nid, u64 start, u64 size, bool online); extern int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, diff --git a/kernel/resource.c b/kernel/resource.c index 30e1bc68503b..a31d3f5bccb7 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1621,3 +1621,123 @@ static int __init strict_iomem(char *str) } __setup("iomem=", strict_iomem); + +#ifdef CONFIG_MEMORY_HOTPLUG +#ifdef CONFIG_MEMORY_HOTREMOVE +/* + * Attempt to merge resource and it's sibling + */ +static int merge_resources(struct resource *res) +{ + struct resource *next; + struct resource *tmp; + uint64_t size; + int ret = -EINVAL; + + next = res->sibling; + + /* +* Not sure how to handle two different children. So only attempt +* to merge two resources if neither have children, only one has a +* child or if both have the same child. +*/ + if ((res->child && next->child) && (res->child != next->child)) + return ret; + + if (res->end + 1 != next->start) + return ret; + + if (res->flags != next->flags) + return ret; + + /* Update sibling and child of resource */ + res->sibling = next->sibling; + tmp
RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
On Wed, Aug 08 2018, Frank Filz wrote: > > Now here's another question... How does this new logic play with Open > File Description Locks? Should still be ok since there's a thread > waiting on each of those. At this level Posix locks and OFD locks are almost identical - just a different owner. So this enhancement should work equally for posix, flock, ofd and even leases. Thanks, NeilBrown > > Frank signature.asc Description: PGP signature
RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
On Wed, Aug 08 2018, Frank Filz wrote: > > Now here's another question... How does this new logic play with Open > File Description Locks? Should still be ok since there's a thread > waiting on each of those. At this level Posix locks and OFD locks are almost identical - just a different owner. So this enhancement should work equally for posix, flock, ofd and even leases. Thanks, NeilBrown > > Frank signature.asc Description: PGP signature
Re: [PATCH v3 3/4] dmaengine: Add Actions Semi Owl family S900 DMA driver
On 26-07-18, 10:36, Manivannan Sadhasivam wrote: > Add Actions Semi Owl family S900 DMA driver. Applied, thanks -- ~Vinod
Re: [PATCH v3 3/4] dmaengine: Add Actions Semi Owl family S900 DMA driver
On 26-07-18, 10:36, Manivannan Sadhasivam wrote: > Add Actions Semi Owl family S900 DMA driver. Applied, thanks -- ~Vinod
Re: [PATCH v3 1/4] dt-bindings: dmaengine: Add binding for Actions Semi Owl SoCs
On 26-07-18, 10:36, Manivannan Sadhasivam wrote: > Add devicetree binding for Actions Semi Owl SoCs DMA controller. Applied, thanks -- ~Vinod
Re: [PATCH v3 1/4] dt-bindings: dmaengine: Add binding for Actions Semi Owl SoCs
On 26-07-18, 10:36, Manivannan Sadhasivam wrote: > Add devicetree binding for Actions Semi Owl SoCs DMA controller. Applied, thanks -- ~Vinod
[PATCH v1 2/2] ASoC: trace: track dapm type in snd_soc_dapm_widget
ASoC: trace: track dapm type in snd_soc_dapm_widget Track snd_soc_dapm_widget:id which is used as dapm sequence index in dapm_down_seq/dapm_up_seq. It's useful for checking dapm seq after tracking it. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index bf17edc..aa08a0b 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -92,16 +92,18 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, TP_STRUCT__entry( __string( name, w->name ) + __field(int,id ) __field(int,val ) ), TP_fast_assign( __assign_str(name, w->name); + __entry->id= (int)w->id; __entry->val = val; ), - TP_printk("widget=%s val=%d", __get_str(name), - __entry->val) + TP_printk("widget=%s dapm_id=%d val=%d", __get_str(name), + __entry->id, __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, -- 2.7.4
[PATCH v1 2/2] ASoC: trace: track dapm type in snd_soc_dapm_widget
ASoC: trace: track dapm type in snd_soc_dapm_widget Track snd_soc_dapm_widget:id which is used as dapm sequence index in dapm_down_seq/dapm_up_seq. It's useful for checking dapm seq after tracking it. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index bf17edc..aa08a0b 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -92,16 +92,18 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, TP_STRUCT__entry( __string( name, w->name ) + __field(int,id ) __field(int,val ) ), TP_fast_assign( __assign_str(name, w->name); + __entry->id= (int)w->id; __entry->val = val; ), - TP_printk("widget=%s val=%d", __get_str(name), - __entry->val) + TP_printk("widget=%s dapm_id=%d val=%d", __get_str(name), + __entry->id, __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, -- 2.7.4
[PATCH v1 1/2] ASoC: trace: remove unnecessary typecast
There's no need to do typecast since the defined type matches the output fromat requirement. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index 40c300f..bf17edc 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(snd_soc_card, __entry->val = val; ), - TP_printk("card=%s val=%d", __get_str(name), (int)__entry->val) + TP_printk("card=%s val=%d", __get_str(name), __entry->val) ); DEFINE_EVENT(snd_soc_card, snd_soc_bias_level_start, @@ -101,7 +101,7 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, ), TP_printk("widget=%s val=%d", __get_str(name), - (int)__entry->val) + __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, @@ -149,8 +149,8 @@ TRACE_EVENT(snd_soc_dapm_walk_done, ), TP_printk("%s: checks %d power, %d path, %d neighbour", - __get_str(name), (int)__entry->power_checks, - (int)__entry->path_checks, (int)__entry->neighbour_checks) + __get_str(name), __entry->power_checks, + __entry->path_checks, __entry->neighbour_checks) ); TRACE_EVENT(snd_soc_dapm_path, @@ -180,8 +180,7 @@ TRACE_EVENT(snd_soc_dapm_path, ), TP_printk("%c%s %s %s %s %s", - (int) __entry->path_node && - (int) __entry->path_connect ? '*' : ' ', + __entry->path_node && __entry->path_connect ? '*' : ' ', __get_str(wname), DAPM_ARROW(__entry->path_dir), __get_str(pname), DAPM_ARROW(__entry->path_dir), __get_str(pnname)) @@ -242,8 +241,8 @@ TRACE_EVENT(snd_soc_jack_report, __entry->val = val; ), - TP_printk("jack=%s %x/%x", __get_str(name), (int)__entry->val, - (int)__entry->mask) + TP_printk("jack=%s %x/%x", __get_str(name), __entry->val, + __entry->mask) ); TRACE_EVENT(snd_soc_jack_notify, @@ -262,7 +261,7 @@ TRACE_EVENT(snd_soc_jack_notify, __entry->val = val; ), - TP_printk("jack=%s %x", __get_str(name), (int)__entry->val) + TP_printk("jack=%s %x", __get_str(name), __entry->val) ); #endif /* _TRACE_ASOC_H */ -- 2.7.4
[PATCH v1 1/2] ASoC: trace: remove unnecessary typecast
There's no need to do typecast since the defined type matches the output fromat requirement. Signed-off-by: Liu Changcheng diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index 40c300f..bf17edc 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(snd_soc_card, __entry->val = val; ), - TP_printk("card=%s val=%d", __get_str(name), (int)__entry->val) + TP_printk("card=%s val=%d", __get_str(name), __entry->val) ); DEFINE_EVENT(snd_soc_card, snd_soc_bias_level_start, @@ -101,7 +101,7 @@ DECLARE_EVENT_CLASS(snd_soc_dapm_widget, ), TP_printk("widget=%s val=%d", __get_str(name), - (int)__entry->val) + __entry->val) ); DEFINE_EVENT(snd_soc_dapm_widget, snd_soc_dapm_widget_power, @@ -149,8 +149,8 @@ TRACE_EVENT(snd_soc_dapm_walk_done, ), TP_printk("%s: checks %d power, %d path, %d neighbour", - __get_str(name), (int)__entry->power_checks, - (int)__entry->path_checks, (int)__entry->neighbour_checks) + __get_str(name), __entry->power_checks, + __entry->path_checks, __entry->neighbour_checks) ); TRACE_EVENT(snd_soc_dapm_path, @@ -180,8 +180,7 @@ TRACE_EVENT(snd_soc_dapm_path, ), TP_printk("%c%s %s %s %s %s", - (int) __entry->path_node && - (int) __entry->path_connect ? '*' : ' ', + __entry->path_node && __entry->path_connect ? '*' : ' ', __get_str(wname), DAPM_ARROW(__entry->path_dir), __get_str(pname), DAPM_ARROW(__entry->path_dir), __get_str(pnname)) @@ -242,8 +241,8 @@ TRACE_EVENT(snd_soc_jack_report, __entry->val = val; ), - TP_printk("jack=%s %x/%x", __get_str(name), (int)__entry->val, - (int)__entry->mask) + TP_printk("jack=%s %x/%x", __get_str(name), __entry->val, + __entry->mask) ); TRACE_EVENT(snd_soc_jack_notify, @@ -262,7 +261,7 @@ TRACE_EVENT(snd_soc_jack_notify, __entry->val = val; ), - TP_printk("jack=%s %x", __get_str(name), (int)__entry->val) + TP_printk("jack=%s %x", __get_str(name), __entry->val) ); #endif /* _TRACE_ASOC_H */ -- 2.7.4
[PATCH v1 0/2] ASoC: trace: format file and add trace field
1. Some typecast are not needed. 2. trace dapm type in snd_soc_dapm_widget class event. Liu Changcheng (2): ASoC: trace: remove unnecessary typecast ASoC: trace: track dapm type in snd_soc_dapm_widget include/trace/events/asoc.h | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) -- 2.7.4
Re: [PATCH v3 2/2] clk: meson: add sub MMC clock controller driver
Hi Jerome On 07/30/18 16:57, Jerome Brunet wrote: > On Fri, 2018-07-27 at 09:45 -0700, Stephen Boyd wrote: >> Quoting Stephen Boyd (2018-07-27 09:41:40) >>> Quoting Yixun Lan (2018-07-27 07:52:23) HI Stephen: On 07/26/2018 11:20 PM, Stephen Boyd wrote: > Quoting Yixun Lan (2018-07-12 14:12:44) >> diff --git a/drivers/clk/meson/mmc-clkc.c b/drivers/clk/meson/mmc-clkc.c >> new file mode 100644 >> index ..36c4c7cd69a6 >> --- /dev/null >> +++ b/drivers/clk/meson/mmc-clkc.c >> @@ -0,0 +1,367 @@ >> +// SPDX-License-Identifier: (GPL-2.0+ OR MIT) >> +/* >> + * Amlogic Meson MMC Sub Clock Controller Driver >> + * >> + * Copyright (c) 2017 Baylibre SAS. >> + * Author: Jerome Brunet >> + * >> + * Copyright (c) 2018 Amlogic, inc. >> + * Author: Yixun Lan >> + */ >> + >> +#include > > Is this include used? > this is needed by clk_get_rate() see drivers/clk/meson/mmc-clkc.c:204 >>> >>> Hmm ok. That's unfortunate. >> >> You should be able to read the hardware to figure out the clk frequency? >> This may be a sign that the phase clk_ops are bad and should be passing >> in the frequency of the parent clk to the op so that phase can be >> calculated. Jerome? >> > > It could be a away to do it but: > a) if we modify the API, we would need to update every clock driver using it. >There is not that many users of the phase API but still, it is annoying > b) This particular driver need the parent rate, other might need something > else > I guess. (parent phase ??, duty cycle ??) > > I think the real problem here it that you are using the consumer API. You > should > be using the provider API like clk_hw_get_rate. Look at the clk-divider.c > which > use clk_hw_round_rate() on the parent clock. I will replace it with clk_hw_get_rate() > > Clock drivers should deal with 'struct clk_hw', not 'struct clk'. I think it > was > mentioned in the past that the 'clk' within 'struct clk_hw' might be removed > someday. > > Yixun, please don't put your clock driver within the controller driver. Please > implement your 'phase-delay' clock in its own file and export the ops, like > every other clock in the amlogic directory. Also, please review your list of > '#define', some of them are unnecessary copy/paste from the MMC driver. > will implement a clk-phase-delay.c I can move the extra CC list Yixun
[PATCH v1 0/2] ASoC: trace: format file and add trace field
1. Some typecast are not needed. 2. trace dapm type in snd_soc_dapm_widget class event. Liu Changcheng (2): ASoC: trace: remove unnecessary typecast ASoC: trace: track dapm type in snd_soc_dapm_widget include/trace/events/asoc.h | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) -- 2.7.4
Re: [PATCH v3 2/2] clk: meson: add sub MMC clock controller driver
Hi Jerome On 07/30/18 16:57, Jerome Brunet wrote: > On Fri, 2018-07-27 at 09:45 -0700, Stephen Boyd wrote: >> Quoting Stephen Boyd (2018-07-27 09:41:40) >>> Quoting Yixun Lan (2018-07-27 07:52:23) HI Stephen: On 07/26/2018 11:20 PM, Stephen Boyd wrote: > Quoting Yixun Lan (2018-07-12 14:12:44) >> diff --git a/drivers/clk/meson/mmc-clkc.c b/drivers/clk/meson/mmc-clkc.c >> new file mode 100644 >> index ..36c4c7cd69a6 >> --- /dev/null >> +++ b/drivers/clk/meson/mmc-clkc.c >> @@ -0,0 +1,367 @@ >> +// SPDX-License-Identifier: (GPL-2.0+ OR MIT) >> +/* >> + * Amlogic Meson MMC Sub Clock Controller Driver >> + * >> + * Copyright (c) 2017 Baylibre SAS. >> + * Author: Jerome Brunet >> + * >> + * Copyright (c) 2018 Amlogic, inc. >> + * Author: Yixun Lan >> + */ >> + >> +#include > > Is this include used? > this is needed by clk_get_rate() see drivers/clk/meson/mmc-clkc.c:204 >>> >>> Hmm ok. That's unfortunate. >> >> You should be able to read the hardware to figure out the clk frequency? >> This may be a sign that the phase clk_ops are bad and should be passing >> in the frequency of the parent clk to the op so that phase can be >> calculated. Jerome? >> > > It could be a away to do it but: > a) if we modify the API, we would need to update every clock driver using it. >There is not that many users of the phase API but still, it is annoying > b) This particular driver need the parent rate, other might need something > else > I guess. (parent phase ??, duty cycle ??) > > I think the real problem here it that you are using the consumer API. You > should > be using the provider API like clk_hw_get_rate. Look at the clk-divider.c > which > use clk_hw_round_rate() on the parent clock. I will replace it with clk_hw_get_rate() > > Clock drivers should deal with 'struct clk_hw', not 'struct clk'. I think it > was > mentioned in the past that the 'clk' within 'struct clk_hw' might be removed > someday. > > Yixun, please don't put your clock driver within the controller driver. Please > implement your 'phase-delay' clock in its own file and export the ops, like > every other clock in the amlogic directory. Also, please review your list of > '#define', some of them are unnecessary copy/paste from the MMC driver. > will implement a clk-phase-delay.c I can move the extra CC list Yixun
Re: [PATCH 07/46] dmaengine: omap-dma: use dmaenginem_async_device_register to simplify the code
On 03-08-18, 10:47, Peter Ujfalusi wrote: > > > On 2018-08-03 10:19, Huang Shijie wrote: > > Use dmaenginem_async_device_register to simplify the code: > >remove dma_async_device_unregister > > > > Signed-off-by: Huang Shijie > > --- > > drivers/dma/ti/omap-dma.c | 5 + > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > > diff --git a/drivers/dma/ti/omap-dma.c b/drivers/dma/ti/omap-dma.c > > index a4a931ddf6f6..085748c6eb67 100644 > > --- a/drivers/dma/ti/omap-dma.c > > +++ b/drivers/dma/ti/omap-dma.c > > @@ -1566,7 +1566,7 @@ static int omap_dma_probe(struct platform_device > > *pdev) > > } > > } > > > > - rc = dma_async_device_register(>ddev); > > + rc = dmaenginem_async_device_register(>ddev); > > Why it is dmaenginem_async_device_register() and not aligned other > resource managed functions (devm_* dmam_*), like > devm_dma_async_device_register() > > and in dmaenginem_async_device_register() what is the 'm' in dmaenginem ? > DMAengine Managed? yup, as you rightly said we could have done devm_dmaengine... or like few others do dmaenginem_ Yes it makes a little odd API but am trying to move away from dma_ to dmaengine_ for everything new.. -- ~Vinod
Re: [PATCH 07/46] dmaengine: omap-dma: use dmaenginem_async_device_register to simplify the code
On 03-08-18, 10:47, Peter Ujfalusi wrote: > > > On 2018-08-03 10:19, Huang Shijie wrote: > > Use dmaenginem_async_device_register to simplify the code: > >remove dma_async_device_unregister > > > > Signed-off-by: Huang Shijie > > --- > > drivers/dma/ti/omap-dma.c | 5 + > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > > diff --git a/drivers/dma/ti/omap-dma.c b/drivers/dma/ti/omap-dma.c > > index a4a931ddf6f6..085748c6eb67 100644 > > --- a/drivers/dma/ti/omap-dma.c > > +++ b/drivers/dma/ti/omap-dma.c > > @@ -1566,7 +1566,7 @@ static int omap_dma_probe(struct platform_device > > *pdev) > > } > > } > > > > - rc = dma_async_device_register(>ddev); > > + rc = dmaenginem_async_device_register(>ddev); > > Why it is dmaenginem_async_device_register() and not aligned other > resource managed functions (devm_* dmam_*), like > devm_dma_async_device_register() > > and in dmaenginem_async_device_register() what is the 'm' in dmaenginem ? > DMAengine Managed? yup, as you rightly said we could have done devm_dmaengine... or like few others do dmaenginem_ Yes it makes a little odd API but am trying to move away from dma_ to dmaengine_ for everything new.. -- ~Vinod
RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > > > If you have a many-core machine, and have many threads all wanting > > > to briefly lock a give file (udev is known to do this), you can get > > > quite poor performance. > > > > > > When one thread releases a lock, it wakes up all other threads that > > > are waiting (classic thundering-herd) - one will get the lock and > > > the others go to sleep. > > > When you have few cores, this is not very noticeable: by the time > > > the 4th or 5th thread gets enough CPU time to try to claim the lock, > > > the earlier threads have claimed it, done what was needed, and released. > > > With 50+ cores, the contention can easily be measured. > > > > > > This patchset creates a tree of pending lock request in which > > > siblings don't conflict and each lock request does conflict with its parent. > > > When a lock is released, only requests which don't conflict with > > > each other a woken. > > > > Are you sure you aren't depending on the (incorrect) assumption that > > "X blocks Y" is a transitive relation? > > > > OK I should be able to answer that question myself, my patience for > > code-reading is at a real low this afternoon > > In other words, is there the possibility of a tree of, say, exclusive locks with > (offset, length) like: > > (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a process > waiting without there being an actual conflict. That implies that the order the locks were received in was: (0,4) (2,2) (1,2) (0,2) But couldn't (0,2) have been made only dependent on (0,4)? Of course then (1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that case? On the other hand, there might be a fairness reason to make (0,2) wait for (1,2) even though it could have been granted concurrently with (2,2) since this dependency tree also preserves some of the order of lock requests. Frank
RE: [PATCH 0/4] locks: avoid thundering-herd wake-ups
> On Wed, Aug 08, 2018 at 03:54:45PM -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 11:51:07AM +1000, NeilBrown wrote: > > > If you have a many-core machine, and have many threads all wanting > > > to briefly lock a give file (udev is known to do this), you can get > > > quite poor performance. > > > > > > When one thread releases a lock, it wakes up all other threads that > > > are waiting (classic thundering-herd) - one will get the lock and > > > the others go to sleep. > > > When you have few cores, this is not very noticeable: by the time > > > the 4th or 5th thread gets enough CPU time to try to claim the lock, > > > the earlier threads have claimed it, done what was needed, and released. > > > With 50+ cores, the contention can easily be measured. > > > > > > This patchset creates a tree of pending lock request in which > > > siblings don't conflict and each lock request does conflict with its parent. > > > When a lock is released, only requests which don't conflict with > > > each other a woken. > > > > Are you sure you aren't depending on the (incorrect) assumption that > > "X blocks Y" is a transitive relation? > > > > OK I should be able to answer that question myself, my patience for > > code-reading is at a real low this afternoon > > In other words, is there the possibility of a tree of, say, exclusive locks with > (offset, length) like: > > (0, 2) waiting on (1, 2) waiting on (2, 2) waiting on (0, 4) > > and when waking (0, 4) you could wake up (2, 2) but not (0, 2), leaving a process > waiting without there being an actual conflict. That implies that the order the locks were received in was: (0,4) (2,2) (1,2) (0,2) But couldn't (0,2) have been made only dependent on (0,4)? Of course then (1,2) is dependent on BOTH (2,2) and (0,2). Does this tree logic handle that case? On the other hand, there might be a fairness reason to make (0,2) wait for (1,2) even though it could have been granted concurrently with (2,2) since this dependency tree also preserves some of the order of lock requests. Frank
Re: [PATCH] dmaengine: sprd: Support DMA link-list mode
On 26-07-18, 16:00, Baolin Wang wrote: > From: Eric Long > > The Spreadtrum DMA can support the link-list transaction mode, which means > DMA controller can do transaction one by one automatically once we linked > these transaction by link-list register. > > Signed-off-by: Eric Long > Signed-off-by: Baolin Wang > --- > drivers/dma/sprd-dma.c | 82 > ++ > include/linux/dma/sprd-dma.h | 69 +++ > 2 files changed, 144 insertions(+), 7 deletions(-) > > diff --git a/drivers/dma/sprd-dma.c b/drivers/dma/sprd-dma.c > index 55df0d4..649bd2c 100644 > --- a/drivers/dma/sprd-dma.c > +++ b/drivers/dma/sprd-dma.c > @@ -68,6 +68,7 @@ > > /* SPRD_DMA_CHN_CFG register definition */ > #define SPRD_DMA_CHN_EN BIT(0) > +#define SPRD_DMA_LINKLIST_EN BIT(4) > #define SPRD_DMA_WAIT_BDONE_OFFSET 24 > #define SPRD_DMA_DONOT_WAIT_BDONE1 > > @@ -103,7 +104,7 @@ > #define SPRD_DMA_REQ_MODE_MASK GENMASK(1, 0) > #define SPRD_DMA_FIX_SEL_OFFSET 21 > #define SPRD_DMA_FIX_EN_OFFSET 20 > -#define SPRD_DMA_LLIST_END_OFFSET19 > +#define SPRD_DMA_LLIST_END BIT(19) > #define SPRD_DMA_FRG_LEN_MASKGENMASK(16, 0) > > /* SPRD_DMA_CHN_BLK_LEN register definition */ > @@ -164,6 +165,7 @@ struct sprd_dma_desc { > struct sprd_dma_chn { > struct virt_dma_chanvc; > void __iomem*chn_base; > + struct sprd_dma_linklistlinklist; > struct dma_slave_config slave_cfg; > u32 chn_num; > u32 dev_id; > @@ -582,7 +584,8 @@ static int sprd_dma_get_step(enum dma_slave_buswidth > buswidth) > } > > static int sprd_dma_fill_desc(struct dma_chan *chan, > - struct sprd_dma_desc *sdesc, > + struct sprd_dma_chn_hw *hw, > + unsigned int sglen, int sg_index, > dma_addr_t src, dma_addr_t dst, u32 len, > enum dma_transfer_direction dir, > unsigned long flags, > @@ -590,7 +593,6 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, > { > struct sprd_dma_dev *sdev = to_sprd_dma_dev(chan); > struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); > - struct sprd_dma_chn_hw *hw = >chn_hw; > u32 req_mode = (flags >> SPRD_DMA_REQ_SHIFT) & SPRD_DMA_REQ_MODE_MASK; > u32 int_mode = flags & SPRD_DMA_INT_MASK; > int src_datawidth, dst_datawidth, src_step, dst_step; > @@ -670,12 +672,58 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, > temp |= (src_step & SPRD_DMA_TRSF_STEP_MASK) << > SPRD_DMA_SRC_TRSF_STEP_OFFSET; > hw->trsf_step = temp; > > + /* link-list configuration */ > + if (schan->linklist.phy_addr) { > + if (sg_index == sglen - 1) > + hw->frg_len |= SPRD_DMA_LLIST_END; > + > + hw->cfg |= SPRD_DMA_LINKLIST_EN; > + hw->llist_ptr = schan->linklist.phy_addr + > + ((sg_index + 1) % sglen) * sizeof(*hw) + > + SPRD_DMA_CHN_SRC_ADDR; > + } else { > + hw->llist_ptr = 0; > + } > + > hw->frg_step = 0; > hw->src_blk_step = 0; > hw->des_blk_step = 0; > return 0; > } > > +static int sprd_dma_fill_linklist_desc(struct dma_chan *chan, > +unsigned int sglen, int sg_index, > +dma_addr_t src, dma_addr_t dst, u32 len, > +enum dma_transfer_direction dir, > +unsigned long flags, > +struct dma_slave_config *slave_cfg) > +{ > + struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); > + struct sprd_dma_chn_hw *hw; > + > + if (sglen < 2 || !schan->linklist.virt_addr) > + return -EINVAL; why the limitation of 2 here? > + > + hw = (struct sprd_dma_chn_hw *)(schan->linklist.virt_addr + > + sg_index * sizeof(*hw)); > + > + return sprd_dma_fill_desc(chan, hw, sglen, sg_index, src, dst, len, dir, > + flags, slave_cfg); > +} > + > +static int sprd_dma_fill_chn_desc(struct dma_chan *chan, > + struct sprd_dma_desc *sdesc, > + dma_addr_t src, dma_addr_t dst, u32 len, > + enum dma_transfer_direction dir, > + unsigned long flags, > + struct dma_slave_config *slave_cfg) > +{ > + struct sprd_dma_chn_hw *hw = >chn_hw; > + > + return sprd_dma_fill_desc(chan, hw, 0, 0, src, dst, len, dir, > + flags, slave_cfg); > +} this fn does not do much, so why not call sprd_dma_fill_desc() instead from
Re: [PATCH] dmaengine: sprd: Support DMA link-list mode
On 26-07-18, 16:00, Baolin Wang wrote: > From: Eric Long > > The Spreadtrum DMA can support the link-list transaction mode, which means > DMA controller can do transaction one by one automatically once we linked > these transaction by link-list register. > > Signed-off-by: Eric Long > Signed-off-by: Baolin Wang > --- > drivers/dma/sprd-dma.c | 82 > ++ > include/linux/dma/sprd-dma.h | 69 +++ > 2 files changed, 144 insertions(+), 7 deletions(-) > > diff --git a/drivers/dma/sprd-dma.c b/drivers/dma/sprd-dma.c > index 55df0d4..649bd2c 100644 > --- a/drivers/dma/sprd-dma.c > +++ b/drivers/dma/sprd-dma.c > @@ -68,6 +68,7 @@ > > /* SPRD_DMA_CHN_CFG register definition */ > #define SPRD_DMA_CHN_EN BIT(0) > +#define SPRD_DMA_LINKLIST_EN BIT(4) > #define SPRD_DMA_WAIT_BDONE_OFFSET 24 > #define SPRD_DMA_DONOT_WAIT_BDONE1 > > @@ -103,7 +104,7 @@ > #define SPRD_DMA_REQ_MODE_MASK GENMASK(1, 0) > #define SPRD_DMA_FIX_SEL_OFFSET 21 > #define SPRD_DMA_FIX_EN_OFFSET 20 > -#define SPRD_DMA_LLIST_END_OFFSET19 > +#define SPRD_DMA_LLIST_END BIT(19) > #define SPRD_DMA_FRG_LEN_MASKGENMASK(16, 0) > > /* SPRD_DMA_CHN_BLK_LEN register definition */ > @@ -164,6 +165,7 @@ struct sprd_dma_desc { > struct sprd_dma_chn { > struct virt_dma_chanvc; > void __iomem*chn_base; > + struct sprd_dma_linklistlinklist; > struct dma_slave_config slave_cfg; > u32 chn_num; > u32 dev_id; > @@ -582,7 +584,8 @@ static int sprd_dma_get_step(enum dma_slave_buswidth > buswidth) > } > > static int sprd_dma_fill_desc(struct dma_chan *chan, > - struct sprd_dma_desc *sdesc, > + struct sprd_dma_chn_hw *hw, > + unsigned int sglen, int sg_index, > dma_addr_t src, dma_addr_t dst, u32 len, > enum dma_transfer_direction dir, > unsigned long flags, > @@ -590,7 +593,6 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, > { > struct sprd_dma_dev *sdev = to_sprd_dma_dev(chan); > struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); > - struct sprd_dma_chn_hw *hw = >chn_hw; > u32 req_mode = (flags >> SPRD_DMA_REQ_SHIFT) & SPRD_DMA_REQ_MODE_MASK; > u32 int_mode = flags & SPRD_DMA_INT_MASK; > int src_datawidth, dst_datawidth, src_step, dst_step; > @@ -670,12 +672,58 @@ static int sprd_dma_fill_desc(struct dma_chan *chan, > temp |= (src_step & SPRD_DMA_TRSF_STEP_MASK) << > SPRD_DMA_SRC_TRSF_STEP_OFFSET; > hw->trsf_step = temp; > > + /* link-list configuration */ > + if (schan->linklist.phy_addr) { > + if (sg_index == sglen - 1) > + hw->frg_len |= SPRD_DMA_LLIST_END; > + > + hw->cfg |= SPRD_DMA_LINKLIST_EN; > + hw->llist_ptr = schan->linklist.phy_addr + > + ((sg_index + 1) % sglen) * sizeof(*hw) + > + SPRD_DMA_CHN_SRC_ADDR; > + } else { > + hw->llist_ptr = 0; > + } > + > hw->frg_step = 0; > hw->src_blk_step = 0; > hw->des_blk_step = 0; > return 0; > } > > +static int sprd_dma_fill_linklist_desc(struct dma_chan *chan, > +unsigned int sglen, int sg_index, > +dma_addr_t src, dma_addr_t dst, u32 len, > +enum dma_transfer_direction dir, > +unsigned long flags, > +struct dma_slave_config *slave_cfg) > +{ > + struct sprd_dma_chn *schan = to_sprd_dma_chan(chan); > + struct sprd_dma_chn_hw *hw; > + > + if (sglen < 2 || !schan->linklist.virt_addr) > + return -EINVAL; why the limitation of 2 here? > + > + hw = (struct sprd_dma_chn_hw *)(schan->linklist.virt_addr + > + sg_index * sizeof(*hw)); > + > + return sprd_dma_fill_desc(chan, hw, sglen, sg_index, src, dst, len, dir, > + flags, slave_cfg); > +} > + > +static int sprd_dma_fill_chn_desc(struct dma_chan *chan, > + struct sprd_dma_desc *sdesc, > + dma_addr_t src, dma_addr_t dst, u32 len, > + enum dma_transfer_direction dir, > + unsigned long flags, > + struct dma_slave_config *slave_cfg) > +{ > + struct sprd_dma_chn_hw *hw = >chn_hw; > + > + return sprd_dma_fill_desc(chan, hw, 0, 0, src, dst, len, dir, > + flags, slave_cfg); > +} this fn does not do much, so why not call sprd_dma_fill_desc() instead from
Re: [PATCH mmc-next v3 0/3] solve SDHCI DWC MSHC 128MB DMA boundary limitation
Hi Adrian, Ulf, could you please review these patches? Any comments are welcome. Thanks in advance, Jisheng On Mon, 30 Jul 2018 10:42:28 +0800 Jisheng Zhang wrote: > When using DMA, if the DMA addr spans 128MB boundary, we have to split > the DMA transfer into two so that each one doesn't exceed the boundary. > > patch1 adds adma_table_num to struct sdhci_host so that driver can > control the ADMA table number. > patch2 introduces adma_write_desc() hook to struct sdhci_ops so that > driver can override it. > patch3 finally solves the 128MB boundary limitation. > > since v2: > - make use of "likely" to check (!len || BOUNDARY_OK(addr, len)) > - explictly include for SZ_128M > > since v1: > - fix BOUNDARY_OK macro if addr+len is aligned to 128MB > - use DIV_ROUND_UP to cal extra desc num > - fix !len for dwcmshc_adma_write_desc() > > Jisheng Zhang (3): > mmc: sdhci: add adma_table_num member to struct sdhci_host > mmc: sdhci: introduce adma_write_desc() hook to struct sdhci_ops > mmc: sdhci-of-dwcmshc: solve 128MB DMA boundary limitation > > drivers/mmc/host/sdhci-of-dwcmshc.c | 43 ++ > drivers/mmc/host/sdhci.c| 48 +++-- > drivers/mmc/host/sdhci.h| 8 + > 3 files changed, 83 insertions(+), 16 deletions(-) >
Re: [PATCH mmc-next v3 0/3] solve SDHCI DWC MSHC 128MB DMA boundary limitation
Hi Adrian, Ulf, could you please review these patches? Any comments are welcome. Thanks in advance, Jisheng On Mon, 30 Jul 2018 10:42:28 +0800 Jisheng Zhang wrote: > When using DMA, if the DMA addr spans 128MB boundary, we have to split > the DMA transfer into two so that each one doesn't exceed the boundary. > > patch1 adds adma_table_num to struct sdhci_host so that driver can > control the ADMA table number. > patch2 introduces adma_write_desc() hook to struct sdhci_ops so that > driver can override it. > patch3 finally solves the 128MB boundary limitation. > > since v2: > - make use of "likely" to check (!len || BOUNDARY_OK(addr, len)) > - explictly include for SZ_128M > > since v1: > - fix BOUNDARY_OK macro if addr+len is aligned to 128MB > - use DIV_ROUND_UP to cal extra desc num > - fix !len for dwcmshc_adma_write_desc() > > Jisheng Zhang (3): > mmc: sdhci: add adma_table_num member to struct sdhci_host > mmc: sdhci: introduce adma_write_desc() hook to struct sdhci_ops > mmc: sdhci-of-dwcmshc: solve 128MB DMA boundary limitation > > drivers/mmc/host/sdhci-of-dwcmshc.c | 43 ++ > drivers/mmc/host/sdhci.c| 48 +++-- > drivers/mmc/host/sdhci.h| 8 + > 3 files changed, 83 insertions(+), 16 deletions(-) >
[PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum.
In a future patch we will need to differentiate between conflicts that are "transitive" and those that aren't. A "transitive" conflict is defined as one where any lock that conflicts with the first (newly requested) lock would conflict with the existing lock. So change posix_locks_conflict(), flock_locks_conflict() (both currently returning int) and leases_conflict() (currently returning bool) to return "enum conflict". Add locks_transitive_overlap() to make it possible to compute the correct conflict for posix locks. The FL_NO_CONFLICT value is zero, so current code which only tests the truth value of these functions will still work the same way. And convert some return (foo); to return foo; Signed-off-by: NeilBrown --- fs/locks.c | 64 ++-- 1 file changed, 49 insertions(+), 15 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index b4812da2a374..d06658b2dc7a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -139,6 +139,16 @@ #define IS_OFDLCK(fl) (fl->fl_flags & FL_OFDLCK) #define IS_REMOTELCK(fl) (fl->fl_pid <= 0) +/* A transitive conflict is one where the first lock conflicts with + * the second lock, and any other lock that conflicts with the + * first lock, also conflicts with the second lock. + */ +enum conflict { + FL_NO_CONFLICT = 0, + FL_CONFLICT, + FL_TRANSITIVE_CONFLICT, +}; + static inline bool is_remote_lock(struct file *filp) { return likely(!(filp->f_path.dentry->d_sb->s_flags & SB_NOREMOTELOCK)); @@ -612,6 +622,15 @@ static inline int locks_overlap(struct file_lock *fl1, struct file_lock *fl2) (fl2->fl_end >= fl1->fl_start)); } +/* Check for transitive-overlap - true if any lock that overlaps + * the first lock must overlap the seconds + */ +static inline bool locks_transitive_overlap(struct file_lock *fl1, + struct file_lock *fl2) +{ + return (fl1->fl_start >= fl2->fl_start) && + (fl1->fl_end <= fl2->fl_end); +} /* * Check whether two locks have the same owner. */ @@ -793,47 +812,61 @@ locks_delete_lock_ctx(struct file_lock *fl, struct list_head *dispose) /* Determine if lock sys_fl blocks lock caller_fl. Common functionality * checks for shared/exclusive status of overlapping locks. */ -static int locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) +static enum conflict locks_conflict(struct file_lock *caller_fl, + struct file_lock *sys_fl) { if (sys_fl->fl_type == F_WRLCK) - return 1; + return FL_TRANSITIVE_CONFLICT; if (caller_fl->fl_type == F_WRLCK) - return 1; - return 0; + return FL_CONFLICT; + return FL_NO_CONFLICT; } /* Determine if lock sys_fl blocks lock caller_fl. POSIX specific * checking before calling the locks_conflict(). */ -static int posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) +static enum conflict posix_locks_conflict(struct file_lock *caller_fl, + struct file_lock *sys_fl) { /* POSIX locks owned by the same process do not conflict with * each other. */ if (posix_same_owner(caller_fl, sys_fl)) - return (0); + return FL_NO_CONFLICT; /* Check whether they overlap */ if (!locks_overlap(caller_fl, sys_fl)) - return 0; + return FL_NO_CONFLICT; - return (locks_conflict(caller_fl, sys_fl)); + switch (locks_conflict(caller_fl, sys_fl)) { + default: + case FL_NO_CONFLICT: + return FL_NO_CONFLICT; + case FL_CONFLICT: + return FL_CONFLICT; + case FL_TRANSITIVE_CONFLICT: + if (locks_transitive_overlap(caller_fl, sys_fl)) + return FL_TRANSITIVE_CONFLICT; + else + return FL_CONFLICT; + } } /* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific * checking before calling the locks_conflict(). */ -static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) +static enum conflict flock_locks_conflict(struct file_lock *caller_fl, + struct file_lock *sys_fl) { /* FLOCK locks referring to the same filp do not conflict with * each other. */ if (caller_fl->fl_file == sys_fl->fl_file) - return (0); + return FL_NO_CONFLICT; if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND)) - return 0; + return FL_NO_CONFLICT; - return (locks_conflict(caller_fl, sys_fl)); + return locks_conflict(caller_fl, sys_fl); } void @@ -1435,12 +1468,13 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose) } }
[PATCH 5/5] fs/locks: create a tree of dependent requests.
When we find an existing lock which conflicts with a request, and the request wants to wait, we currently add the request to a list. When the lock is removed, the whole list is woken. This can cause the thundering-herd problem. To reduce the problem, we make use of the (new) fact that a pending request can itself have a list of blocked requests. When we find a conflict, we look through the existing blocked requests. If any one of them blocks the new request, the new request is attached below that request. This way, when the lock is released, only a set of non-conflicting locks will be woken. The rest of the herd can stay asleep. Reported-and-tested-by: Martin Wilck Signed-off-by: NeilBrown --- fs/locks.c | 69 +++- 1 file changed, 63 insertions(+), 6 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index fc64016d01ee..17843feb6f5b 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter) spin_unlock(_lock_lock); } +static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker, + enum conflict conflict(struct file_lock *, + struct file_lock *)) +{ + struct file_lock *parent = waiter; + struct file_lock *fl; + struct file_lock *t; + + fl = list_entry(>fl_blocked, struct file_lock, fl_block); +restart: + list_for_each_entry_safe_continue(fl, t, >fl_blocked, fl_block) { + switch (conflict(fl, blocker)) { + default: + case FL_NO_CONFLICT: + __locks_wake_one(fl); + break; + case FL_CONFLICT: + /* Need to check children */ + parent = fl; + fl = list_entry(>fl_blocked, struct file_lock, fl_block); + goto restart; + case FL_TRANSITIVE_CONFLICT: + /* all children must also conflict, no need to check */ + continue; + } + } + if (parent != waiter) { + parent = parent->fl_blocker; + fl = parent; + goto restart; + } +} + /* Insert waiter into blocker's block list. * We use a circular list so that processes can be easily woken up in * the order they blocked. The documentation doesn't require this but @@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter) * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring * that the flc_lock is also held on insertions we can avoid taking the * blocked_lock_lock in some cases when we see that the fl_blocked list is empty. + * + * Rather than just adding to the list, we check for conflicts with any existing + * waiter, and add to that waiter instead. + * Thus wakeups don't happen until needed. */ static void __locks_insert_block(struct file_lock *blocker, - struct file_lock *waiter) +struct file_lock *waiter, +enum conflict conflict(struct file_lock *, + struct file_lock *)) { + struct file_lock *fl; BUG_ON(!list_empty(>fl_block)); + + /* Any request in waiter->fl_blocked is know to conflict with +* waiter, but it might not conflict with blocker. +* If it doesn't, it needs to be woken now so it can find +* somewhere else to wait, or possible it can get granted. +*/ + if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT) + wake_non_conflicts(waiter, blocker, conflict); +new_blocker: + list_for_each_entry(fl, >fl_blocked, fl_block) + if (conflict(fl, waiter)) { + blocker = fl; + goto new_blocker; + } waiter->fl_blocker = blocker; list_add_tail(>fl_block, >fl_blocked); if (IS_POSIX(blocker) && !IS_OFDLCK(blocker)) @@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker, /* Must be called with flc_lock held. */ static void locks_insert_block(struct file_lock *blocker, - struct file_lock *waiter) + struct file_lock *waiter, + enum conflict conflict(struct file_lock *, + struct file_lock *)) { spin_lock(_lock_lock); - __locks_insert_block(blocker, waiter); + __locks_insert_block(blocker, waiter, conflict); spin_unlock(_lock_lock); } @@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request) if (!(request->fl_flags & FL_SLEEP)) goto
[PATCH 4/5] fs/locks: split out __locks_wake_one()
This functionality will be useful in the next patch, so split it out from __locks_wake_up_blocks(). Signed-off-by: NeilBrown --- fs/locks.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index d06658b2dc7a..fc64016d01ee 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -710,6 +710,15 @@ static void __locks_delete_block(struct file_lock *waiter) waiter->fl_blocker = NULL; } +static void __locks_wake_one(struct file_lock *waiter) +{ + __locks_delete_block(waiter); + if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) + waiter->fl_lmops->lm_notify(waiter); + else + wake_up(>fl_wait); +} + static void __locks_wake_up_blocks(struct file_lock *blocker) { while (!list_empty(>fl_blocked)) { @@ -717,11 +726,7 @@ static void __locks_wake_up_blocks(struct file_lock *blocker) waiter = list_first_entry(>fl_blocked, struct file_lock, fl_block); - __locks_delete_block(waiter); - if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) - waiter->fl_lmops->lm_notify(waiter); - else - wake_up(>fl_wait); + __locks_wake_one(waiter); } }
[PATCH 5/5] fs/locks: create a tree of dependent requests.
When we find an existing lock which conflicts with a request, and the request wants to wait, we currently add the request to a list. When the lock is removed, the whole list is woken. This can cause the thundering-herd problem. To reduce the problem, we make use of the (new) fact that a pending request can itself have a list of blocked requests. When we find a conflict, we look through the existing blocked requests. If any one of them blocks the new request, the new request is attached below that request. This way, when the lock is released, only a set of non-conflicting locks will be woken. The rest of the herd can stay asleep. Reported-and-tested-by: Martin Wilck Signed-off-by: NeilBrown --- fs/locks.c | 69 +++- 1 file changed, 63 insertions(+), 6 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index fc64016d01ee..17843feb6f5b 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -738,6 +738,39 @@ static void locks_delete_block(struct file_lock *waiter) spin_unlock(_lock_lock); } +static void wake_non_conflicts(struct file_lock *waiter, struct file_lock *blocker, + enum conflict conflict(struct file_lock *, + struct file_lock *)) +{ + struct file_lock *parent = waiter; + struct file_lock *fl; + struct file_lock *t; + + fl = list_entry(>fl_blocked, struct file_lock, fl_block); +restart: + list_for_each_entry_safe_continue(fl, t, >fl_blocked, fl_block) { + switch (conflict(fl, blocker)) { + default: + case FL_NO_CONFLICT: + __locks_wake_one(fl); + break; + case FL_CONFLICT: + /* Need to check children */ + parent = fl; + fl = list_entry(>fl_blocked, struct file_lock, fl_block); + goto restart; + case FL_TRANSITIVE_CONFLICT: + /* all children must also conflict, no need to check */ + continue; + } + } + if (parent != waiter) { + parent = parent->fl_blocker; + fl = parent; + goto restart; + } +} + /* Insert waiter into blocker's block list. * We use a circular list so that processes can be easily woken up in * the order they blocked. The documentation doesn't require this but @@ -747,11 +780,32 @@ static void locks_delete_block(struct file_lock *waiter) * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring * that the flc_lock is also held on insertions we can avoid taking the * blocked_lock_lock in some cases when we see that the fl_blocked list is empty. + * + * Rather than just adding to the list, we check for conflicts with any existing + * waiter, and add to that waiter instead. + * Thus wakeups don't happen until needed. */ static void __locks_insert_block(struct file_lock *blocker, - struct file_lock *waiter) +struct file_lock *waiter, +enum conflict conflict(struct file_lock *, + struct file_lock *)) { + struct file_lock *fl; BUG_ON(!list_empty(>fl_block)); + + /* Any request in waiter->fl_blocked is know to conflict with +* waiter, but it might not conflict with blocker. +* If it doesn't, it needs to be woken now so it can find +* somewhere else to wait, or possible it can get granted. +*/ + if (conflict(waiter, blocker) != FL_TRANSITIVE_CONFLICT) + wake_non_conflicts(waiter, blocker, conflict); +new_blocker: + list_for_each_entry(fl, >fl_blocked, fl_block) + if (conflict(fl, waiter)) { + blocker = fl; + goto new_blocker; + } waiter->fl_blocker = blocker; list_add_tail(>fl_block, >fl_blocked); if (IS_POSIX(blocker) && !IS_OFDLCK(blocker)) @@ -760,10 +814,12 @@ static void __locks_insert_block(struct file_lock *blocker, /* Must be called with flc_lock held. */ static void locks_insert_block(struct file_lock *blocker, - struct file_lock *waiter) + struct file_lock *waiter, + enum conflict conflict(struct file_lock *, + struct file_lock *)) { spin_lock(_lock_lock); - __locks_insert_block(blocker, waiter); + __locks_insert_block(blocker, waiter, conflict); spin_unlock(_lock_lock); } @@ -1033,7 +1089,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request) if (!(request->fl_flags & FL_SLEEP)) goto
[PATCH 4/5] fs/locks: split out __locks_wake_one()
This functionality will be useful in the next patch, so split it out from __locks_wake_up_blocks(). Signed-off-by: NeilBrown --- fs/locks.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index d06658b2dc7a..fc64016d01ee 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -710,6 +710,15 @@ static void __locks_delete_block(struct file_lock *waiter) waiter->fl_blocker = NULL; } +static void __locks_wake_one(struct file_lock *waiter) +{ + __locks_delete_block(waiter); + if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) + waiter->fl_lmops->lm_notify(waiter); + else + wake_up(>fl_wait); +} + static void __locks_wake_up_blocks(struct file_lock *blocker) { while (!list_empty(>fl_blocked)) { @@ -717,11 +726,7 @@ static void __locks_wake_up_blocks(struct file_lock *blocker) waiter = list_first_entry(>fl_blocked, struct file_lock, fl_block); - __locks_delete_block(waiter); - if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) - waiter->fl_lmops->lm_notify(waiter); - else - wake_up(>fl_wait); + __locks_wake_one(waiter); } }
[PATCH 3/5] fs/locks: change all *_conflict() functions to return a new enum.
In a future patch we will need to differentiate between conflicts that are "transitive" and those that aren't. A "transitive" conflict is defined as one where any lock that conflicts with the first (newly requested) lock would conflict with the existing lock. So change posix_locks_conflict(), flock_locks_conflict() (both currently returning int) and leases_conflict() (currently returning bool) to return "enum conflict". Add locks_transitive_overlap() to make it possible to compute the correct conflict for posix locks. The FL_NO_CONFLICT value is zero, so current code which only tests the truth value of these functions will still work the same way. And convert some return (foo); to return foo; Signed-off-by: NeilBrown --- fs/locks.c | 64 ++-- 1 file changed, 49 insertions(+), 15 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index b4812da2a374..d06658b2dc7a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -139,6 +139,16 @@ #define IS_OFDLCK(fl) (fl->fl_flags & FL_OFDLCK) #define IS_REMOTELCK(fl) (fl->fl_pid <= 0) +/* A transitive conflict is one where the first lock conflicts with + * the second lock, and any other lock that conflicts with the + * first lock, also conflicts with the second lock. + */ +enum conflict { + FL_NO_CONFLICT = 0, + FL_CONFLICT, + FL_TRANSITIVE_CONFLICT, +}; + static inline bool is_remote_lock(struct file *filp) { return likely(!(filp->f_path.dentry->d_sb->s_flags & SB_NOREMOTELOCK)); @@ -612,6 +622,15 @@ static inline int locks_overlap(struct file_lock *fl1, struct file_lock *fl2) (fl2->fl_end >= fl1->fl_start)); } +/* Check for transitive-overlap - true if any lock that overlaps + * the first lock must overlap the seconds + */ +static inline bool locks_transitive_overlap(struct file_lock *fl1, + struct file_lock *fl2) +{ + return (fl1->fl_start >= fl2->fl_start) && + (fl1->fl_end <= fl2->fl_end); +} /* * Check whether two locks have the same owner. */ @@ -793,47 +812,61 @@ locks_delete_lock_ctx(struct file_lock *fl, struct list_head *dispose) /* Determine if lock sys_fl blocks lock caller_fl. Common functionality * checks for shared/exclusive status of overlapping locks. */ -static int locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) +static enum conflict locks_conflict(struct file_lock *caller_fl, + struct file_lock *sys_fl) { if (sys_fl->fl_type == F_WRLCK) - return 1; + return FL_TRANSITIVE_CONFLICT; if (caller_fl->fl_type == F_WRLCK) - return 1; - return 0; + return FL_CONFLICT; + return FL_NO_CONFLICT; } /* Determine if lock sys_fl blocks lock caller_fl. POSIX specific * checking before calling the locks_conflict(). */ -static int posix_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) +static enum conflict posix_locks_conflict(struct file_lock *caller_fl, + struct file_lock *sys_fl) { /* POSIX locks owned by the same process do not conflict with * each other. */ if (posix_same_owner(caller_fl, sys_fl)) - return (0); + return FL_NO_CONFLICT; /* Check whether they overlap */ if (!locks_overlap(caller_fl, sys_fl)) - return 0; + return FL_NO_CONFLICT; - return (locks_conflict(caller_fl, sys_fl)); + switch (locks_conflict(caller_fl, sys_fl)) { + default: + case FL_NO_CONFLICT: + return FL_NO_CONFLICT; + case FL_CONFLICT: + return FL_CONFLICT; + case FL_TRANSITIVE_CONFLICT: + if (locks_transitive_overlap(caller_fl, sys_fl)) + return FL_TRANSITIVE_CONFLICT; + else + return FL_CONFLICT; + } } /* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific * checking before calling the locks_conflict(). */ -static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *sys_fl) +static enum conflict flock_locks_conflict(struct file_lock *caller_fl, + struct file_lock *sys_fl) { /* FLOCK locks referring to the same filp do not conflict with * each other. */ if (caller_fl->fl_file == sys_fl->fl_file) - return (0); + return FL_NO_CONFLICT; if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND)) - return 0; + return FL_NO_CONFLICT; - return (locks_conflict(caller_fl, sys_fl)); + return locks_conflict(caller_fl, sys_fl); } void @@ -1435,12 +1468,13 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose) } }
[PATCH 2/5] fs/locks: allow a lock request to block other requests.
Currently, a lock can block pending requests, but all pending requests are equal. If lots of pending requests are mutually exclusive, this means they will all be woken up and all but one will fail. This can hurt performance. So we will allow pending requests to block other requests. Only the first request will be woken, and it will wake the others. This patch doesn't implement this fully, but prepares the way. - It acknowledges that a request might be blocking other requests, and when the request is converted to a lock, those blocked requests are moved across. - When a request is removed, all blocked requests are woken. - When deadlock-detection looks for the lock which blocks a given request, we follow the chain of ->fl_blocker all the way to the top. Signed-off-by: NeilBrown --- fs/locks.c | 58 -- 1 file changed, 40 insertions(+), 18 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 322491e70e41..b4812da2a374 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -408,9 +408,19 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl) fl->fl_ops->fl_copy_lock(new, fl); } } - EXPORT_SYMBOL(locks_copy_lock); +static void locks_move_blocks(struct file_lock *new, struct file_lock *fl) +{ + struct file_lock *f; + + spin_lock(_lock_lock); + list_splice_init(>fl_blocked, >fl_blocked); + list_for_each_entry(f, >fl_blocked, fl_block) + f->fl_blocker = new; + spin_unlock(_lock_lock); +} + static inline int flock_translate_cmd(int cmd) { if (cmd & LOCK_MAND) return cmd & (LOCK_MAND | LOCK_RW); @@ -681,9 +691,25 @@ static void __locks_delete_block(struct file_lock *waiter) waiter->fl_blocker = NULL; } +static void __locks_wake_up_blocks(struct file_lock *blocker) +{ + while (!list_empty(>fl_blocked)) { + struct file_lock *waiter; + + waiter = list_first_entry(>fl_blocked, + struct file_lock, fl_block); + __locks_delete_block(waiter); + if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) + waiter->fl_lmops->lm_notify(waiter); + else + wake_up(>fl_wait); + } +} + static void locks_delete_block(struct file_lock *waiter) { spin_lock(_lock_lock); + __locks_wake_up_blocks(waiter); __locks_delete_block(waiter); spin_unlock(_lock_lock); } @@ -735,17 +761,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker) return; spin_lock(_lock_lock); - while (!list_empty(>fl_blocked)) { - struct file_lock *waiter; - - waiter = list_first_entry(>fl_blocked, - struct file_lock, fl_block); - __locks_delete_block(waiter); - if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) - waiter->fl_lmops->lm_notify(waiter); - else - wake_up(>fl_wait); - } + __locks_wake_up_blocks(blocker); spin_unlock(_lock_lock); } @@ -888,8 +904,11 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) struct file_lock *fl; hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) { - if (posix_same_owner(fl, block_fl)) - return fl->fl_blocker; + if (posix_same_owner(fl, block_fl)) { + while (fl->fl_blocker) + fl = fl->fl_blocker; + return fl; + } } return NULL; } @@ -982,6 +1001,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request) if (request->fl_flags & FL_ACCESS) goto out; locks_copy_lock(new_fl, request); + locks_move_blocks(new_fl, request); locks_insert_lock_ctx(new_fl, >flc_flock); new_fl = NULL; error = 0; @@ -1174,6 +1194,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, goto out; } locks_copy_lock(new_fl, request); + locks_move_blocks(new_fl, request); locks_insert_lock_ctx(new_fl, >fl_list); fl = new_fl; new_fl = NULL; @@ -2582,13 +2603,14 @@ void locks_remove_file(struct file *filp) int posix_unblock_lock(struct file_lock *waiter) { - int status = 0; + int status = -ENOENT; spin_lock(_lock_lock); - if (waiter->fl_blocker) + if (waiter->fl_blocker) { + __locks_wake_up_blocks(waiter); __locks_delete_block(waiter); - else - status = -ENOENT; + status = 0; + } spin_unlock(_lock_lock); return
[PATCH 2/5] fs/locks: allow a lock request to block other requests.
Currently, a lock can block pending requests, but all pending requests are equal. If lots of pending requests are mutually exclusive, this means they will all be woken up and all but one will fail. This can hurt performance. So we will allow pending requests to block other requests. Only the first request will be woken, and it will wake the others. This patch doesn't implement this fully, but prepares the way. - It acknowledges that a request might be blocking other requests, and when the request is converted to a lock, those blocked requests are moved across. - When a request is removed, all blocked requests are woken. - When deadlock-detection looks for the lock which blocks a given request, we follow the chain of ->fl_blocker all the way to the top. Signed-off-by: NeilBrown --- fs/locks.c | 58 -- 1 file changed, 40 insertions(+), 18 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 322491e70e41..b4812da2a374 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -408,9 +408,19 @@ void locks_copy_lock(struct file_lock *new, struct file_lock *fl) fl->fl_ops->fl_copy_lock(new, fl); } } - EXPORT_SYMBOL(locks_copy_lock); +static void locks_move_blocks(struct file_lock *new, struct file_lock *fl) +{ + struct file_lock *f; + + spin_lock(_lock_lock); + list_splice_init(>fl_blocked, >fl_blocked); + list_for_each_entry(f, >fl_blocked, fl_block) + f->fl_blocker = new; + spin_unlock(_lock_lock); +} + static inline int flock_translate_cmd(int cmd) { if (cmd & LOCK_MAND) return cmd & (LOCK_MAND | LOCK_RW); @@ -681,9 +691,25 @@ static void __locks_delete_block(struct file_lock *waiter) waiter->fl_blocker = NULL; } +static void __locks_wake_up_blocks(struct file_lock *blocker) +{ + while (!list_empty(>fl_blocked)) { + struct file_lock *waiter; + + waiter = list_first_entry(>fl_blocked, + struct file_lock, fl_block); + __locks_delete_block(waiter); + if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) + waiter->fl_lmops->lm_notify(waiter); + else + wake_up(>fl_wait); + } +} + static void locks_delete_block(struct file_lock *waiter) { spin_lock(_lock_lock); + __locks_wake_up_blocks(waiter); __locks_delete_block(waiter); spin_unlock(_lock_lock); } @@ -735,17 +761,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker) return; spin_lock(_lock_lock); - while (!list_empty(>fl_blocked)) { - struct file_lock *waiter; - - waiter = list_first_entry(>fl_blocked, - struct file_lock, fl_block); - __locks_delete_block(waiter); - if (waiter->fl_lmops && waiter->fl_lmops->lm_notify) - waiter->fl_lmops->lm_notify(waiter); - else - wake_up(>fl_wait); - } + __locks_wake_up_blocks(blocker); spin_unlock(_lock_lock); } @@ -888,8 +904,11 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl) struct file_lock *fl; hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) { - if (posix_same_owner(fl, block_fl)) - return fl->fl_blocker; + if (posix_same_owner(fl, block_fl)) { + while (fl->fl_blocker) + fl = fl->fl_blocker; + return fl; + } } return NULL; } @@ -982,6 +1001,7 @@ static int flock_lock_inode(struct inode *inode, struct file_lock *request) if (request->fl_flags & FL_ACCESS) goto out; locks_copy_lock(new_fl, request); + locks_move_blocks(new_fl, request); locks_insert_lock_ctx(new_fl, >flc_flock); new_fl = NULL; error = 0; @@ -1174,6 +1194,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, goto out; } locks_copy_lock(new_fl, request); + locks_move_blocks(new_fl, request); locks_insert_lock_ctx(new_fl, >fl_list); fl = new_fl; new_fl = NULL; @@ -2582,13 +2603,14 @@ void locks_remove_file(struct file *filp) int posix_unblock_lock(struct file_lock *waiter) { - int status = 0; + int status = -ENOENT; spin_lock(_lock_lock); - if (waiter->fl_blocker) + if (waiter->fl_blocker) { + __locks_wake_up_blocks(waiter); __locks_delete_block(waiter); - else - status = -ENOENT; + status = 0; + } spin_unlock(_lock_lock); return
[PATCH 1/5] fs/locks: rename some lists and pointers.
struct file lock contains an 'fl_next' pointer which is used to point to the lock that this request is blocked waiting for. So rename it to fl_blocker. The fl_blocked list_head in an active lock is the head of a list of blocked requests. In a request it is a node in that list. These are two distinct uses, so replace with two list_heads with different names. fl_blocked is the head of a list of blocked requests fl_block is a node on that list. This separation will be used in the next patch. Note that a tracepoint is changed to report fl_blocker instead of fl_next. Might this be an ABI breakage? Signed-off-by: NeilBrown --- fs/cifs/file.c |2 +- fs/locks.c | 40 --- include/linux/fs.h |5 +++-- include/trace/events/filelock.h | 16 4 files changed, 33 insertions(+), 30 deletions(-) diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 8d41ca7bfcf1..066ed2e4ba96 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1092,7 +1092,7 @@ cifs_posix_lock_set(struct file *file, struct file_lock *flock) rc = posix_lock_file(file, flock, NULL); up_write(>lock_sem); if (rc == FILE_LOCK_DEFERRED) { - rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next); + rc = wait_event_interruptible(flock->fl_wait, !flock->fl_blocker); if (!rc) goto try_again; posix_unblock_lock(flock); diff --git a/fs/locks.c b/fs/locks.c index db7b6917d9c5..322491e70e41 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -194,7 +194,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); * This lock protects the blocked_hash. Generally, if you're accessing it, you * want to be holding this lock. * - * In addition, it also protects the fl->fl_block list, and the fl->fl_next + * In addition, it also protects the fl->fl_blocked list, and the fl->fl_blocker * pointer for file_lock structures that are acting as lock requests (in * contrast to those that are acting as records of acquired locks). * @@ -203,7 +203,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); * protected by this lock, we can skip acquiring it iff we already hold the * flc_lock. * - * In particular, adding an entry to the fl_block list requires that you hold + * In particular, adding an entry to the fl_blocked list requires that you hold * both the flc_lock and the blocked_lock_lock (acquired in that order). * Deleting an entry from the list however only requires the file_lock_lock. */ @@ -302,6 +302,7 @@ static void locks_init_lock_heads(struct file_lock *fl) { INIT_HLIST_NODE(>fl_link); INIT_LIST_HEAD(>fl_list); + INIT_LIST_HEAD(>fl_blocked); INIT_LIST_HEAD(>fl_block); init_waitqueue_head(>fl_wait); } @@ -341,6 +342,7 @@ void locks_free_lock(struct file_lock *fl) { BUG_ON(waitqueue_active(>fl_wait)); BUG_ON(!list_empty(>fl_list)); + BUG_ON(!list_empty(>fl_blocked)); BUG_ON(!list_empty(>fl_block)); BUG_ON(!hlist_unhashed(>fl_link)); @@ -676,7 +678,7 @@ static void __locks_delete_block(struct file_lock *waiter) { locks_delete_global_blocked(waiter); list_del_init(>fl_block); - waiter->fl_next = NULL; + waiter->fl_blocker = NULL; } static void locks_delete_block(struct file_lock *waiter) @@ -692,16 +694,16 @@ static void locks_delete_block(struct file_lock *waiter) * it seems like the reasonable thing to do. * * Must be called with both the flc_lock and blocked_lock_lock held. The - * fl_block list itself is protected by the blocked_lock_lock, but by ensuring + * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring * that the flc_lock is also held on insertions we can avoid taking the - * blocked_lock_lock in some cases when we see that the fl_block list is empty. + * blocked_lock_lock in some cases when we see that the fl_blocked list is empty. */ static void __locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) { BUG_ON(!list_empty(>fl_block)); - waiter->fl_next = blocker; - list_add_tail(>fl_block, >fl_block); + waiter->fl_blocker = blocker; + list_add_tail(>fl_block, >fl_blocked); if (IS_POSIX(blocker) && !IS_OFDLCK(blocker)) locks_insert_global_blocked(waiter); } @@ -725,18 +727,18 @@ static void locks_wake_up_blocks(struct file_lock *blocker) /* * Avoid taking global lock if list is empty. This is safe since new * blocked requests are only added to the list under the flc_lock, and -* the flc_lock is always held here. Note that removal from the fl_block +* the flc_lock is always held here. Note that removal from the fl_blocked * list does not require the flc_lock, so we must recheck
[PATCH 1/5] fs/locks: rename some lists and pointers.
struct file lock contains an 'fl_next' pointer which is used to point to the lock that this request is blocked waiting for. So rename it to fl_blocker. The fl_blocked list_head in an active lock is the head of a list of blocked requests. In a request it is a node in that list. These are two distinct uses, so replace with two list_heads with different names. fl_blocked is the head of a list of blocked requests fl_block is a node on that list. This separation will be used in the next patch. Note that a tracepoint is changed to report fl_blocker instead of fl_next. Might this be an ABI breakage? Signed-off-by: NeilBrown --- fs/cifs/file.c |2 +- fs/locks.c | 40 --- include/linux/fs.h |5 +++-- include/trace/events/filelock.h | 16 4 files changed, 33 insertions(+), 30 deletions(-) diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 8d41ca7bfcf1..066ed2e4ba96 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1092,7 +1092,7 @@ cifs_posix_lock_set(struct file *file, struct file_lock *flock) rc = posix_lock_file(file, flock, NULL); up_write(>lock_sem); if (rc == FILE_LOCK_DEFERRED) { - rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next); + rc = wait_event_interruptible(flock->fl_wait, !flock->fl_blocker); if (!rc) goto try_again; posix_unblock_lock(flock); diff --git a/fs/locks.c b/fs/locks.c index db7b6917d9c5..322491e70e41 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -194,7 +194,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); * This lock protects the blocked_hash. Generally, if you're accessing it, you * want to be holding this lock. * - * In addition, it also protects the fl->fl_block list, and the fl->fl_next + * In addition, it also protects the fl->fl_blocked list, and the fl->fl_blocker * pointer for file_lock structures that are acting as lock requests (in * contrast to those that are acting as records of acquired locks). * @@ -203,7 +203,7 @@ static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS); * protected by this lock, we can skip acquiring it iff we already hold the * flc_lock. * - * In particular, adding an entry to the fl_block list requires that you hold + * In particular, adding an entry to the fl_blocked list requires that you hold * both the flc_lock and the blocked_lock_lock (acquired in that order). * Deleting an entry from the list however only requires the file_lock_lock. */ @@ -302,6 +302,7 @@ static void locks_init_lock_heads(struct file_lock *fl) { INIT_HLIST_NODE(>fl_link); INIT_LIST_HEAD(>fl_list); + INIT_LIST_HEAD(>fl_blocked); INIT_LIST_HEAD(>fl_block); init_waitqueue_head(>fl_wait); } @@ -341,6 +342,7 @@ void locks_free_lock(struct file_lock *fl) { BUG_ON(waitqueue_active(>fl_wait)); BUG_ON(!list_empty(>fl_list)); + BUG_ON(!list_empty(>fl_blocked)); BUG_ON(!list_empty(>fl_block)); BUG_ON(!hlist_unhashed(>fl_link)); @@ -676,7 +678,7 @@ static void __locks_delete_block(struct file_lock *waiter) { locks_delete_global_blocked(waiter); list_del_init(>fl_block); - waiter->fl_next = NULL; + waiter->fl_blocker = NULL; } static void locks_delete_block(struct file_lock *waiter) @@ -692,16 +694,16 @@ static void locks_delete_block(struct file_lock *waiter) * it seems like the reasonable thing to do. * * Must be called with both the flc_lock and blocked_lock_lock held. The - * fl_block list itself is protected by the blocked_lock_lock, but by ensuring + * fl_blocked list itself is protected by the blocked_lock_lock, but by ensuring * that the flc_lock is also held on insertions we can avoid taking the - * blocked_lock_lock in some cases when we see that the fl_block list is empty. + * blocked_lock_lock in some cases when we see that the fl_blocked list is empty. */ static void __locks_insert_block(struct file_lock *blocker, struct file_lock *waiter) { BUG_ON(!list_empty(>fl_block)); - waiter->fl_next = blocker; - list_add_tail(>fl_block, >fl_block); + waiter->fl_blocker = blocker; + list_add_tail(>fl_block, >fl_blocked); if (IS_POSIX(blocker) && !IS_OFDLCK(blocker)) locks_insert_global_blocked(waiter); } @@ -725,18 +727,18 @@ static void locks_wake_up_blocks(struct file_lock *blocker) /* * Avoid taking global lock if list is empty. This is safe since new * blocked requests are only added to the list under the flc_lock, and -* the flc_lock is always held here. Note that removal from the fl_block +* the flc_lock is always held here. Note that removal from the fl_blocked * list does not require the flc_lock, so we must recheck
[PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
This series adds "wake_non_conflicts()" to wake up any waiters that are being added beneath a lock that they don't actually conflict with, even though they conflict with a parent which conflict with the top-level blocker. Thanks for Bruce for highlighting this issue. This series hasn't been tested beyond compile-test. Original description: If you have a many-core machine, and have many threads all wanting to briefly lock a give file (udev is known to do this), you can get quite poor performance. When one thread releases a lock, it wakes up all other threads that are waiting (classic thundering-herd) - one will get the lock and the others go to sleep. When you have few cores, this is not very noticeable: by the time the 4th or 5th thread gets enough CPU time to try to claim the lock, the earlier threads have claimed it, done what was needed, and released. With 50+ cores, the contention can easily be measured. This patchset creates a tree of pending lock request in which siblings don't conflict and each lock request does conflict with its parent. When a lock is released, only requests which don't conflict with each other a woken. Testing shows that lock-acquisitions-per-second is now fairly stable even as number of contending process goes to 1000. Without this patch, locks-per-second drops off steeply after a few 10s of processes. There is a small cost to this extra complexity. At 20 processes running a particular test on 72 cores, the lock acquisitions per second drops from 1.8 million to 1.4 million with this patch. For 100 processes, this patch still provides 1.4 million while without this patch there are about 700,000. NeilBrown --- NeilBrown (5): fs/locks: rename some lists and pointers. fs/locks: allow a lock request to block other requests. fs/locks: change all *_conflict() functions to return a new enum. fs/locks: split out __locks_wake_one() fs/locks: create a tree of dependent requests. fs/cifs/file.c |2 fs/locks.c | 228 ++- include/linux/fs.h |5 + include/trace/events/filelock.h | 16 +-- 4 files changed, 186 insertions(+), 65 deletions(-) -- Signature
[PATCH 0/5 - V2] locks: avoid thundering-herd wake-ups
This series adds "wake_non_conflicts()" to wake up any waiters that are being added beneath a lock that they don't actually conflict with, even though they conflict with a parent which conflict with the top-level blocker. Thanks for Bruce for highlighting this issue. This series hasn't been tested beyond compile-test. Original description: If you have a many-core machine, and have many threads all wanting to briefly lock a give file (udev is known to do this), you can get quite poor performance. When one thread releases a lock, it wakes up all other threads that are waiting (classic thundering-herd) - one will get the lock and the others go to sleep. When you have few cores, this is not very noticeable: by the time the 4th or 5th thread gets enough CPU time to try to claim the lock, the earlier threads have claimed it, done what was needed, and released. With 50+ cores, the contention can easily be measured. This patchset creates a tree of pending lock request in which siblings don't conflict and each lock request does conflict with its parent. When a lock is released, only requests which don't conflict with each other a woken. Testing shows that lock-acquisitions-per-second is now fairly stable even as number of contending process goes to 1000. Without this patch, locks-per-second drops off steeply after a few 10s of processes. There is a small cost to this extra complexity. At 20 processes running a particular test on 72 cores, the lock acquisitions per second drops from 1.8 million to 1.4 million with this patch. For 100 processes, this patch still provides 1.4 million while without this patch there are about 700,000. NeilBrown --- NeilBrown (5): fs/locks: rename some lists and pointers. fs/locks: allow a lock request to block other requests. fs/locks: change all *_conflict() functions to return a new enum. fs/locks: split out __locks_wake_one() fs/locks: create a tree of dependent requests. fs/cifs/file.c |2 fs/locks.c | 228 ++- include/linux/fs.h |5 + include/trace/events/filelock.h | 16 +-- 4 files changed, 186 insertions(+), 65 deletions(-) -- Signature
Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it
piaojun wrote on Thu, Aug 09, 2018: > > What exact kernel commit are you running? > > My kernel commit id 6edf1d4cb0acde, and I replace the 9p code with > 9p-next. And I wonder if this will work well? That is somewhere on top of 4.18-rc1 and got merged in 4.18-rc4, which are close enough so while I can question the practice I don't see why not. I've just tried the following: $ git checkout 6edf1d4cb0acde $ git checkout martinetd/9p-next net/9p fs/9p include/net/9p (martinetd/9p-next is 9f961802a7 as of this mail) $ uname -r 4.18.0-rc1+ $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" 9pnet_virtio 32768 0 $ sudo modprobe -r 9pnet_virtio $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" not loaded $ sudo modprobe 9pnet_virtio $ sudo mount -t 9p -o debug=1,trans=virtio shm /mnt $ ls /mnt $ cat /sys/module/9pnet_virtio/drivers/virtio\:9pnet_virtio/*/mount_tag tmpshm (these could use a new line...) $ sudo umount /mnt $ sudo modprobe -r 9pnet_virtio $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" not loaded The /sys/devices/pci*/*/virtio*/mount_tag files are also removed properly; I don't see any problem. Not being able to reproduce is fine in general, but I also get problems when applying the patch and unloading the module multiple times so I can't help but question this patch and think your problem lies somewhere else. -- Dominique
Re: [PATCH] net/9p/trans_virtio.c: decrease the refcount of 9p virtio device when removing it
piaojun wrote on Thu, Aug 09, 2018: > > What exact kernel commit are you running? > > My kernel commit id 6edf1d4cb0acde, and I replace the 9p code with > 9p-next. And I wonder if this will work well? That is somewhere on top of 4.18-rc1 and got merged in 4.18-rc4, which are close enough so while I can question the practice I don't see why not. I've just tried the following: $ git checkout 6edf1d4cb0acde $ git checkout martinetd/9p-next net/9p fs/9p include/net/9p (martinetd/9p-next is 9f961802a7 as of this mail) $ uname -r 4.18.0-rc1+ $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" 9pnet_virtio 32768 0 $ sudo modprobe -r 9pnet_virtio $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" not loaded $ sudo modprobe 9pnet_virtio $ sudo mount -t 9p -o debug=1,trans=virtio shm /mnt $ ls /mnt $ cat /sys/module/9pnet_virtio/drivers/virtio\:9pnet_virtio/*/mount_tag tmpshm (these could use a new line...) $ sudo umount /mnt $ sudo modprobe -r 9pnet_virtio $ lsmod | grep -E '^9pnet_virtio' || echo "not loaded" not loaded The /sys/devices/pci*/*/virtio*/mount_tag files are also removed properly; I don't see any problem. Not being able to reproduce is fine in general, but I also get problems when applying the patch and unloading the module multiple times so I can't help but question this patch and think your problem lies somewhere else. -- Dominique
Re: [PATCH RFC ftrace/core] tracing: Directly call tracer and lockdep probes
On Wed, 8 Aug 2018 15:20:22 -0700 Joel Fernandes wrote: > On Wed, Aug 8, 2018 at 3:00 PM, Joel Fernandes (Google) > wrote: > > From: Steven Rostedt > > > > I realize I kept your authorship since I started working based on your patch > ;-) > > Well sorry about that, and you could change it to mine or keep it as > is it is if you want ;-) > Actually I already started testing a second patch that basically does your extension ;-) I'll post it shortly. -- Steve
Re: [PATCH RFC ftrace/core] tracing: Directly call tracer and lockdep probes
On Wed, 8 Aug 2018 15:20:22 -0700 Joel Fernandes wrote: > On Wed, Aug 8, 2018 at 3:00 PM, Joel Fernandes (Google) > wrote: > > From: Steven Rostedt > > > > I realize I kept your authorship since I started working based on your patch > ;-) > > Well sorry about that, and you could change it to mine or keep it as > is it is if you want ;-) > Actually I already started testing a second patch that basically does your extension ;-) I'll post it shortly. -- Steve
Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
On Wed, Aug 08 2018, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: >> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: >> > If you have a many-core machine, and have many threads all wanting to >> > briefly lock a give file (udev is known to do this), you can get quite >> > poor performance. >> > >> > When one thread releases a lock, it wakes up all other threads that >> > are waiting (classic thundering-herd) - one will get the lock and the >> > others go to sleep. >> > When you have few cores, this is not very noticeable: by the time the >> > 4th or 5th thread gets enough CPU time to try to claim the lock, the >> > earlier threads have claimed it, done what was needed, and released. >> > With 50+ cores, the contention can easily be measured. >> > >> > This patchset creates a tree of pending lock request in which siblings >> > don't conflict and each lock request does conflict with its parent. >> > When a lock is released, only requests which don't conflict with each >> > other a woken. >> > >> > Testing shows that lock-acquisitions-per-second is now fairly stable even >> > as number of contending process goes to 1000. Without this patch, >> > locks-per-second drops off steeply after a few 10s of processes. >> > >> > There is a small cost to this extra complexity. >> > At 20 processes running a particular test on 72 cores, the lock >> > acquisitions per second drops from 1.8 million to 1.4 million with >> > this patch. For 100 processes, this patch still provides 1.4 million >> > while without this patch there are about 700,000. >> > >> > NeilBrown >> > >> > --- >> > >> > NeilBrown (4): >> > fs/locks: rename some lists and pointers. >> > fs/locks: allow a lock request to block other requests. >> > fs/locks: change all *_conflict() functions to return bool. >> > fs/locks: create a tree of dependent requests. >> > >> > >> > fs/cifs/file.c |2 - >> > fs/locks.c | 142 >> > +-- >> > include/linux/fs.h |5 + >> > include/trace/events/filelock.h | 16 ++-- >> > 4 files changed, 103 insertions(+), 62 deletions(-) >> > >> >> Nice work! I looked over this and I think it looks good. >> >> I made an attempt to fix this issue several years ago, but my method >> sucked as it ended up penalizing the unlocking task too much. This is >> much cleaner and should scale well overall, I think. > > I think I also took a crack at this at one point while I was at UM/CITI > and never got anything I was happy with. Looks like good work! > > I remember one main obstacle that I felt like I never had a good > benchmark > > How did you choose this workload and hardware? Was it in fact udev > (booting a large machine?), or was there some other motivation? I'm hoping Martin will chime in here - her identified the problem and did most of the testing... NeilBrown > > Not that I'm likely to do it any time soon, but could you share > sufficient details for someone else to reproduce your results? > > --b. signature.asc Description: PGP signature
Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups
On Wed, Aug 08 2018, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: >> On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: >> > If you have a many-core machine, and have many threads all wanting to >> > briefly lock a give file (udev is known to do this), you can get quite >> > poor performance. >> > >> > When one thread releases a lock, it wakes up all other threads that >> > are waiting (classic thundering-herd) - one will get the lock and the >> > others go to sleep. >> > When you have few cores, this is not very noticeable: by the time the >> > 4th or 5th thread gets enough CPU time to try to claim the lock, the >> > earlier threads have claimed it, done what was needed, and released. >> > With 50+ cores, the contention can easily be measured. >> > >> > This patchset creates a tree of pending lock request in which siblings >> > don't conflict and each lock request does conflict with its parent. >> > When a lock is released, only requests which don't conflict with each >> > other a woken. >> > >> > Testing shows that lock-acquisitions-per-second is now fairly stable even >> > as number of contending process goes to 1000. Without this patch, >> > locks-per-second drops off steeply after a few 10s of processes. >> > >> > There is a small cost to this extra complexity. >> > At 20 processes running a particular test on 72 cores, the lock >> > acquisitions per second drops from 1.8 million to 1.4 million with >> > this patch. For 100 processes, this patch still provides 1.4 million >> > while without this patch there are about 700,000. >> > >> > NeilBrown >> > >> > --- >> > >> > NeilBrown (4): >> > fs/locks: rename some lists and pointers. >> > fs/locks: allow a lock request to block other requests. >> > fs/locks: change all *_conflict() functions to return bool. >> > fs/locks: create a tree of dependent requests. >> > >> > >> > fs/cifs/file.c |2 - >> > fs/locks.c | 142 >> > +-- >> > include/linux/fs.h |5 + >> > include/trace/events/filelock.h | 16 ++-- >> > 4 files changed, 103 insertions(+), 62 deletions(-) >> > >> >> Nice work! I looked over this and I think it looks good. >> >> I made an attempt to fix this issue several years ago, but my method >> sucked as it ended up penalizing the unlocking task too much. This is >> much cleaner and should scale well overall, I think. > > I think I also took a crack at this at one point while I was at UM/CITI > and never got anything I was happy with. Looks like good work! > > I remember one main obstacle that I felt like I never had a good > benchmark > > How did you choose this workload and hardware? Was it in fact udev > (booting a large machine?), or was there some other motivation? I'm hoping Martin will chime in here - her identified the problem and did most of the testing... NeilBrown > > Not that I'm likely to do it any time soon, but could you share > sufficient details for someone else to reproduce your results? > > --b. signature.asc Description: PGP signature