[PATCH v2 RESEND 0/5] sched/vtime: vtime.h headers cleanup

2024-04-10 Thread Alexander Gordeev
Hi All,

There are no changes since the last post, just a re-send.

v2:
- patch 4: commit message reworded (Heiko)
- patch 5: vtime.h is removed from Kbuild scripts (PowerPC only) (Heiko)

v1:
Please find a small cleanup to vtime_task_switch() wiring.
I split it into smaller patches to allow separate PowerPC
vs s390 reviews. Otherwise patches 2+3 and 4+5 could have
been merged.

I tested it on s390 and compile-tested it on 32- and 64-bit
PowerPC and few other major architectures only, but it is
only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable
ones (AFAICT).

Thanks!


Alexander Gordeev (5):
  sched/vtime: remove confusing arch_vtime_task_switch() declaration
  sched/vtime: get rid of generic vtime_task_switch() implementation
  s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  s390/irq,nmi: include  header directly
  sched/vtime: do not include  header

 arch/powerpc/include/asm/Kbuild|  1 -
 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 arch/s390/include/asm/vtime.h  |  2 --
 arch/s390/kernel/irq.c |  1 +
 arch/s390/kernel/nmi.c |  1 +
 include/asm-generic/vtime.h|  1 -
 include/linux/vtime.h  |  5 -
 kernel/sched/cputime.c | 13 -
 9 files changed, 24 insertions(+), 35 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

-- 
2.40.1



[PATCH v2 RESEND 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover

2024-04-10 Thread Alexander Gordeev
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore.

Reviewed-by: Frederic Weisbecker 
Acked-by: Heiko Carstens 
Acked-by: Nicholas Piggin 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/include/asm/vtime.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h
index fe17e448c0c5..561c91c1a87c 100644
--- a/arch/s390/include/asm/vtime.h
+++ b/arch/s390/include/asm/vtime.h
@@ -2,8 +2,6 @@
 #ifndef _S390_VTIME_H
 #define _S390_VTIME_H
 
-#define __ARCH_HAS_VTIME_TASK_SWITCH
-
 static inline void update_timer_sys(void)
 {
S390_lowcore.system_timer += S390_lowcore.last_update_timer - 
S390_lowcore.exit_timer;
-- 
2.40.1



[PATCH v2 RESEND 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration

2024-04-10 Thread Alexander Gordeev
Callback arch_vtime_task_switch() is only defined when
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the
function prototype forward declaration is present for
CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Alexander Gordeev 
---
 include/linux/vtime.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 3684487d01e1..593466ceebed 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk);
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void arch_vtime_task_switch(struct task_struct *tsk);
 extern void vtime_user_enter(struct task_struct *tsk);
 extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
-- 
2.40.1



[PATCH v2 RESEND 4/5] s390/irq,nmi: include header directly

2024-04-10 Thread Alexander Gordeev
update_timer_sys() and update_timer_mcck() are inlines used for
CPU time accounting from the interrupt and machine-check handlers.
These routines are specific to s390 architecture, but included
via  header implicitly. Avoid the extra loop and
include  header directly.

Acked-by: Heiko Carstens 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/kernel/irq.c | 1 +
 arch/s390/kernel/nmi.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index 6f71b0ce1068..259496fe0ef9 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "entry.h"
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index c77382a67325..230d010bac9b 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mcck_struct {
unsigned int kill_task : 1;
-- 
2.40.1



[PATCH v2 RESEND 5/5] sched/vtime: do not include header

2024-04-10 Thread Alexander Gordeev
There is no architecture-specific code or data left
that generic  needs to know about.
Thus, avoid the inclusion of  header.

Reviewed-by: Frederic Weisbecker 
Acked-by: Nicholas Piggin 
Signed-off-by: Alexander Gordeev 
---
 arch/powerpc/include/asm/Kbuild | 1 -
 include/asm-generic/vtime.h | 1 -
 include/linux/vtime.h   | 4 
 3 files changed, 6 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 61a8dcd7..e5fdc336c9b2 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -6,5 +6,4 @@ generic-y += agp.h
 generic-y += kvm_types.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
-generic-y += vtime.h
 generic-y += early_ioremap.h
diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h
deleted file mode 100644
index b1a49677fe25..
--- a/include/asm-generic/vtime.h
+++ /dev/null
@@ -1 +0,0 @@
-/* no content, but patch(1) dislikes empty files */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 593466ceebed..29dd5b91dd7d 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -5,10 +5,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-#include 
-#endif
-
 /*
  * Common vtime APIs
  */
-- 
2.40.1



[PATCH v2 RESEND 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation

2024-04-10 Thread Alexander Gordeev
The generic vtime_task_switch() implementation gets built only
if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an
architecture to implement arch_vtime_task_switch() callback at
the same time, which is confusing.

Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC
architecture only and vtime_task_switch() generic variant is rather
superfluous.

Simplify the whole vtime_task_switch() wiring by moving the existing
generic implementation to PowerPC.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Nicholas Piggin 
Acked-by: Michael Ellerman 
Signed-off-by: Alexander Gordeev 
---
 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 kernel/sched/cputime.c | 13 -
 3 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 4961fb38e438..aff858ca99c0 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -32,23 +32,10 @@
 #ifdef CONFIG_PPC64
 #define get_accounting(tsk)(_paca()->accounting)
 #define raw_get_accounting(tsk)(_paca->accounting)
-static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 
 #else
 #define get_accounting(tsk)(_thread_info(tsk)->accounting)
 #define raw_get_accounting(tsk)get_accounting(tsk)
-/*
- * Called from the context switch with interrupts disabled, to charge all
- * accumulated times to the current process, and to prepare accounting on
- * the next process.
- */
-static inline void arch_vtime_task_switch(struct task_struct *prev)
-{
-   struct cpu_accounting_data *acct = get_accounting(current);
-   struct cpu_accounting_data *acct0 = get_accounting(prev);
-
-   acct->starttime = acct0->starttime;
-}
 #endif
 
 /*
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index df20cf201f74..c0fdc6d94fee 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk)
acct->hardirq_time = 0;
acct->softirq_time = 0;
 }
+
+/*
+ * Called from the context switch with interrupts disabled, to charge all
+ * accumulated times to the current process, and to prepare accounting on
+ * the next process.
+ */
+void vtime_task_switch(struct task_struct *prev)
+{
+   if (is_idle_task(prev))
+   vtime_account_idle(prev);
+   else
+   vtime_account_kernel(prev);
+
+   vtime_flush(prev);
+
+   if (!IS_ENABLED(CONFIG_PPC64)) {
+   struct cpu_accounting_data *acct = get_accounting(current);
+   struct cpu_accounting_data *acct0 = get_accounting(prev);
+
+   acct->starttime = acct0->starttime;
+   }
+}
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 void __no_kcsan __delay(unsigned long loops)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..aa48b2ec879d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct 
task_struct *p, int user_
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 
-# ifndef __ARCH_HAS_VTIME_TASK_SWITCH
-void vtime_task_switch(struct task_struct *prev)
-{
-   if (is_idle_task(prev))
-   vtime_account_idle(prev);
-   else
-   vtime_account_kernel(prev);
-
-   vtime_flush(prev);
-   arch_vtime_task_switch(prev);
-}
-# endif
-
 void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
 {
unsigned int pc = irq_count() - offset;
-- 
2.40.1



[PATCH v2 RESEND 4/5] s390/irq,nmi: include header directly

2024-02-22 Thread Alexander Gordeev
update_timer_sys() and update_timer_mcck() are inlines used for
CPU time accounting from the interrupt and machine-check handlers.
These routines are specific to s390 architecture, but included
via  header implicitly. Avoid the extra loop and
include  header directly.

Acked-by: Heiko Carstens 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/kernel/irq.c | 1 +
 arch/s390/kernel/nmi.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index 6f71b0ce1068..259496fe0ef9 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "entry.h"
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index 9ad44c26d1a2..4422a27faace 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct mcck_struct {
-- 
2.40.1



[PATCH v2 RESEND 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover

2024-02-22 Thread Alexander Gordeev
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore.

Reviewed-by: Frederic Weisbecker 
Acked-by: Heiko Carstens 
Acked-by: Nicholas Piggin 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/include/asm/vtime.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h
index fe17e448c0c5..561c91c1a87c 100644
--- a/arch/s390/include/asm/vtime.h
+++ b/arch/s390/include/asm/vtime.h
@@ -2,8 +2,6 @@
 #ifndef _S390_VTIME_H
 #define _S390_VTIME_H
 
-#define __ARCH_HAS_VTIME_TASK_SWITCH
-
 static inline void update_timer_sys(void)
 {
S390_lowcore.system_timer += S390_lowcore.last_update_timer - 
S390_lowcore.exit_timer;
-- 
2.40.1



[PATCH v2 RESEND 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation

2024-02-22 Thread Alexander Gordeev
The generic vtime_task_switch() implementation gets built only
if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an
architecture to implement arch_vtime_task_switch() callback at
the same time, which is confusing.

Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC
architecture only and vtime_task_switch() generic variant is rather
superfluous.

Simplify the whole vtime_task_switch() wiring by moving the existing
generic implementation to PowerPC.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Alexander Gordeev 
---
 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 kernel/sched/cputime.c | 13 -
 3 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 4961fb38e438..aff858ca99c0 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -32,23 +32,10 @@
 #ifdef CONFIG_PPC64
 #define get_accounting(tsk)(_paca()->accounting)
 #define raw_get_accounting(tsk)(_paca->accounting)
-static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 
 #else
 #define get_accounting(tsk)(_thread_info(tsk)->accounting)
 #define raw_get_accounting(tsk)get_accounting(tsk)
-/*
- * Called from the context switch with interrupts disabled, to charge all
- * accumulated times to the current process, and to prepare accounting on
- * the next process.
- */
-static inline void arch_vtime_task_switch(struct task_struct *prev)
-{
-   struct cpu_accounting_data *acct = get_accounting(current);
-   struct cpu_accounting_data *acct0 = get_accounting(prev);
-
-   acct->starttime = acct0->starttime;
-}
 #endif
 
 /*
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index df20cf201f74..c0fdc6d94fee 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk)
acct->hardirq_time = 0;
acct->softirq_time = 0;
 }
+
+/*
+ * Called from the context switch with interrupts disabled, to charge all
+ * accumulated times to the current process, and to prepare accounting on
+ * the next process.
+ */
+void vtime_task_switch(struct task_struct *prev)
+{
+   if (is_idle_task(prev))
+   vtime_account_idle(prev);
+   else
+   vtime_account_kernel(prev);
+
+   vtime_flush(prev);
+
+   if (!IS_ENABLED(CONFIG_PPC64)) {
+   struct cpu_accounting_data *acct = get_accounting(current);
+   struct cpu_accounting_data *acct0 = get_accounting(prev);
+
+   acct->starttime = acct0->starttime;
+   }
+}
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 void __no_kcsan __delay(unsigned long loops)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..aa48b2ec879d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct 
task_struct *p, int user_
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 
-# ifndef __ARCH_HAS_VTIME_TASK_SWITCH
-void vtime_task_switch(struct task_struct *prev)
-{
-   if (is_idle_task(prev))
-   vtime_account_idle(prev);
-   else
-   vtime_account_kernel(prev);
-
-   vtime_flush(prev);
-   arch_vtime_task_switch(prev);
-}
-# endif
-
 void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
 {
unsigned int pc = irq_count() - offset;
-- 
2.40.1



[PATCH v2 RESEND 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration

2024-02-22 Thread Alexander Gordeev
Callback arch_vtime_task_switch() is only defined when
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the
function prototype forward declaration is present for
CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Alexander Gordeev 
---
 include/linux/vtime.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 3684487d01e1..593466ceebed 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk);
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void arch_vtime_task_switch(struct task_struct *tsk);
 extern void vtime_user_enter(struct task_struct *tsk);
 extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
-- 
2.40.1



[PATCH v2 RESEND 5/5] sched/vtime: do not include header

2024-02-22 Thread Alexander Gordeev
There is no architecture-specific code or data left
that generic  needs to know about.
Thus, avoid the inclusion of  header.

Reviewed-by: Frederic Weisbecker 
Acked-by: Nicholas Piggin 
Signed-off-by: Alexander Gordeev 
---
 arch/powerpc/include/asm/Kbuild | 1 -
 include/asm-generic/vtime.h | 1 -
 include/linux/vtime.h   | 4 
 3 files changed, 6 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 61a8dcd7..e5fdc336c9b2 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -6,5 +6,4 @@ generic-y += agp.h
 generic-y += kvm_types.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
-generic-y += vtime.h
 generic-y += early_ioremap.h
diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h
deleted file mode 100644
index b1a49677fe25..
--- a/include/asm-generic/vtime.h
+++ /dev/null
@@ -1 +0,0 @@
-/* no content, but patch(1) dislikes empty files */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 593466ceebed..29dd5b91dd7d 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -5,10 +5,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-#include 
-#endif
-
 /*
  * Common vtime APIs
  */
-- 
2.40.1



[PATCH v2 RESEND 0/5] sched/vtime: vtime.h headers cleanup

2024-02-22 Thread Alexander Gordeev
Hi All,

There are no changes since the last post, just a re-send.

v2:

- patch 4: commit message reworded (Heiko)
- patch 5: vtime.h is removed from Kbuild scripts (PowerPC only) (Heiko)

v1:

Please find a small cleanup to vtime_task_switch() wiring.
I split it into smaller patches to allow separate PowerPC
vs s390 reviews. Otherwise patches 2+3 and 4+5 could have
been merged.

I tested it on s390 and compile-tested it on 32- and 64-bit
PowerPC and few other major architectures only, but it is
only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable
ones (AFAICT).

Thanks!

Alexander Gordeev (5):
  sched/vtime: remove confusing arch_vtime_task_switch() declaration
  sched/vtime: get rid of generic vtime_task_switch() implementation
  s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  s390/irq,nmi: include  header directly
  sched/vtime: do not include  header

 arch/powerpc/include/asm/Kbuild|  1 -
 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 arch/s390/include/asm/vtime.h  |  2 --
 arch/s390/kernel/irq.c |  1 +
 arch/s390/kernel/nmi.c |  1 +
 include/asm-generic/vtime.h|  1 -
 include/linux/vtime.h  |  5 -
 kernel/sched/cputime.c | 13 -
 9 files changed, 24 insertions(+), 35 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

-- 
2.40.1



[PATCH v2 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration

2024-02-08 Thread Alexander Gordeev
Callback arch_vtime_task_switch() is only defined when
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the
function prototype forward declaration is present for
CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Alexander Gordeev 
---
 include/linux/vtime.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 3684487d01e1..593466ceebed 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk);
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void arch_vtime_task_switch(struct task_struct *tsk);
 extern void vtime_user_enter(struct task_struct *tsk);
 extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
-- 
2.40.1



[PATCH v2 5/5] sched/vtime: do not include header

2024-02-08 Thread Alexander Gordeev
There is no architecture-specific code or data left
that generic  needs to know about.
Thus, avoid the inclusion of  header.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Alexander Gordeev 
---
 arch/powerpc/include/asm/Kbuild | 1 -
 include/asm-generic/vtime.h | 1 -
 include/linux/vtime.h   | 4 
 3 files changed, 6 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 61a8dcd7..e5fdc336c9b2 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -6,5 +6,4 @@ generic-y += agp.h
 generic-y += kvm_types.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
-generic-y += vtime.h
 generic-y += early_ioremap.h
diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h
deleted file mode 100644
index b1a49677fe25..
--- a/include/asm-generic/vtime.h
+++ /dev/null
@@ -1 +0,0 @@
-/* no content, but patch(1) dislikes empty files */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 593466ceebed..29dd5b91dd7d 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -5,10 +5,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-#include 
-#endif
-
 /*
  * Common vtime APIs
  */
-- 
2.40.1



[PATCH v2 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover

2024-02-08 Thread Alexander Gordeev
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore.

Reviewed-by: Frederic Weisbecker 
Acked-by: Heiko Carstens 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/include/asm/vtime.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h
index fe17e448c0c5..561c91c1a87c 100644
--- a/arch/s390/include/asm/vtime.h
+++ b/arch/s390/include/asm/vtime.h
@@ -2,8 +2,6 @@
 #ifndef _S390_VTIME_H
 #define _S390_VTIME_H
 
-#define __ARCH_HAS_VTIME_TASK_SWITCH
-
 static inline void update_timer_sys(void)
 {
S390_lowcore.system_timer += S390_lowcore.last_update_timer - 
S390_lowcore.exit_timer;
-- 
2.40.1



[PATCH v2 0/5] sched/vtime: vtime.h headers cleanup

2024-02-08 Thread Alexander Gordeev
Hi All,

I kept all tags on reveiwed patches.

v2:

- patch 4: commit message reworded (Heiko)
- patch 5: vtime.h is removed from Kbuild scripts (PowerPC only) (Heiko)

v1:

Please find a small cleanup to vtime_task_switch() wiring.
I split it into smaller patches to allow separate PowerPC
vs s390 reviews. Otherwise patches 2+3 and 4+5 could have
been merged.

I tested it on s390 and compile-tested it on 32- and 64-bit
PowerPC and few other major architectures only, but it is
only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable
ones (AFAICT).

Thanks!

Alexander Gordeev (5):
  sched/vtime: remove confusing arch_vtime_task_switch() declaration
  sched/vtime: get rid of generic vtime_task_switch() implementation
  s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  s390/irq,nmi: include  header directly
  sched/vtime: do not include  header

 arch/powerpc/include/asm/Kbuild|  1 -
 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 arch/s390/include/asm/vtime.h  |  2 --
 arch/s390/kernel/irq.c |  1 +
 arch/s390/kernel/nmi.c |  1 +
 include/asm-generic/vtime.h|  1 -
 include/linux/vtime.h  |  5 -
 kernel/sched/cputime.c | 13 -
 9 files changed, 24 insertions(+), 35 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

-- 
2.40.1



[PATCH v2 4/5] s390/irq,nmi: include header directly

2024-02-08 Thread Alexander Gordeev
update_timer_sys() and update_timer_mcck() are inlines used for
CPU time accounting from the interrupt and machine-check handlers.
These routines are specific to s390 architecture, but included
via  header implicitly. Avoid the extra loop and
include  header directly.

Acked-by: Heiko Carstens 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/kernel/irq.c | 1 +
 arch/s390/kernel/nmi.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index 6f71b0ce1068..259496fe0ef9 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "entry.h"
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index 9ad44c26d1a2..4422a27faace 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct mcck_struct {
-- 
2.40.1



[PATCH v2 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation

2024-02-08 Thread Alexander Gordeev
The generic vtime_task_switch() implementation gets built only
if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an
architecture to implement arch_vtime_task_switch() callback at
the same time, which is confusing.

Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC
architecture only and vtime_task_switch() generic variant is rather
superfluous.

Simplify the whole vtime_task_switch() wiring by moving the existing
generic implementation to PowerPC.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Alexander Gordeev 
---
 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 kernel/sched/cputime.c | 13 -
 3 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 4961fb38e438..aff858ca99c0 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -32,23 +32,10 @@
 #ifdef CONFIG_PPC64
 #define get_accounting(tsk)(_paca()->accounting)
 #define raw_get_accounting(tsk)(_paca->accounting)
-static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 
 #else
 #define get_accounting(tsk)(_thread_info(tsk)->accounting)
 #define raw_get_accounting(tsk)get_accounting(tsk)
-/*
- * Called from the context switch with interrupts disabled, to charge all
- * accumulated times to the current process, and to prepare accounting on
- * the next process.
- */
-static inline void arch_vtime_task_switch(struct task_struct *prev)
-{
-   struct cpu_accounting_data *acct = get_accounting(current);
-   struct cpu_accounting_data *acct0 = get_accounting(prev);
-
-   acct->starttime = acct0->starttime;
-}
 #endif
 
 /*
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index df20cf201f74..c0fdc6d94fee 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk)
acct->hardirq_time = 0;
acct->softirq_time = 0;
 }
+
+/*
+ * Called from the context switch with interrupts disabled, to charge all
+ * accumulated times to the current process, and to prepare accounting on
+ * the next process.
+ */
+void vtime_task_switch(struct task_struct *prev)
+{
+   if (is_idle_task(prev))
+   vtime_account_idle(prev);
+   else
+   vtime_account_kernel(prev);
+
+   vtime_flush(prev);
+
+   if (!IS_ENABLED(CONFIG_PPC64)) {
+   struct cpu_accounting_data *acct = get_accounting(current);
+   struct cpu_accounting_data *acct0 = get_accounting(prev);
+
+   acct->starttime = acct0->starttime;
+   }
+}
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 void __no_kcsan __delay(unsigned long loops)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..aa48b2ec879d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct 
task_struct *p, int user_
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 
-# ifndef __ARCH_HAS_VTIME_TASK_SWITCH
-void vtime_task_switch(struct task_struct *prev)
-{
-   if (is_idle_task(prev))
-   vtime_account_idle(prev);
-   else
-   vtime_account_kernel(prev);
-
-   vtime_flush(prev);
-   arch_vtime_task_switch(prev);
-}
-# endif
-
 void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
 {
unsigned int pc = irq_count() - offset;
-- 
2.40.1



Re: [PATCH 4/5] s390/irq,nmi: do not include header

2024-02-07 Thread Alexander Gordeev
On Mon, Jan 29, 2024 at 10:51:44AM +0100, Heiko Carstens wrote:
> It is confusing when the patch subject is "do not include.." and all
> what this patch is doing is to add two includes. I see what this is
> doing: getting rid of the implicit include of asm/vtime.h most likely
> via linux/hardirq.h, but that's not very obvious.
> 
> Anyway:
> Acked-by: Heiko Carstens 

Thank you, Heiko!

Whether this wording sounds better?

s390/irq,nmi: include  header directly 

update_timer_sys() and update_timer_mcck() are inlines used for 
CPU time accounting from the interrupt and machine-check handlers.  
These routines are specific to s390 architecture, but included  
via  header implicitly. Avoid the extra loop and 
include  header directly.  


Re: [PATCH 5/5] sched/vtime: do not include header

2024-02-07 Thread Alexander Gordeev
On Wed, Feb 07, 2024 at 12:30:08AM +0100, Frederic Weisbecker wrote:
> Reviewed-by: Frederic Weisbecker 

Thank you for the review, Frederic!

The Heiko comment is valid and I would add this chunk in v2:

--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -6,5 +6,4 @@ generic-y += agp.h
 generic-y += kvm_types.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
-generic-y += vtime.h
 generic-y += early_ioremap.h

Would you keep your Reviewed-by?


[PATCH 4/5] s390/irq,nmi: do not include header

2024-01-28 Thread Alexander Gordeev
update_timer_sys() and update_timer_mcck() are inlines used for
CPU time accounting from the interrupt and machine-check handlers.
These routines are specific to s390 architecture, but declared
via  header, which in turn inludes .
Avoid the extra loop and include  header directly.

Signed-off-by: Alexander Gordeev 
---
 arch/s390/kernel/irq.c | 1 +
 arch/s390/kernel/nmi.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index 6f71b0ce1068..259496fe0ef9 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "entry.h"
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index 9ad44c26d1a2..4422a27faace 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct mcck_struct {
-- 
2.40.1



[PATCH 0/5] sched/vtime: vtime.h headers cleanup

2024-01-28 Thread Alexander Gordeev
Hi all,

Please find a small cleanup to vtime_task_switch() wiring.
I split it into smaller patches to allow separate PowerPC
vs s390 reviews. Otherwise patches 2+3 and 4+5 could have
been merged.

I tested it on s390 and compile-tested it on 32- and 64-bit
PowerPC and few other major architectures only, but it is
only of concern for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE-capable
ones (AFAICT).

Thanks!

Alexander Gordeev (5):
  sched/vtime: remove confusing arch_vtime_task_switch() declaration
  sched/vtime: get rid of generic vtime_task_switch() implementation
  s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  s390/irq,nmi: do not include  header
  sched/vtime: do not include  header

 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 arch/s390/include/asm/vtime.h  |  2 --
 arch/s390/kernel/irq.c |  1 +
 arch/s390/kernel/nmi.c |  1 +
 include/asm-generic/vtime.h|  1 -
 include/linux/vtime.h  |  5 -
 kernel/sched/cputime.c | 13 -
 8 files changed, 24 insertions(+), 34 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

-- 
2.40.1



[PATCH 2/5] sched/vtime: get rid of generic vtime_task_switch() implementation

2024-01-28 Thread Alexander Gordeev
The generic vtime_task_switch() implementation gets built only
if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an
architecture to implement arch_vtime_task_switch() callback at
the same time, which is confusing.

Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC
architecture only and vtime_task_switch() generic variant is rather
superfluous.

Simplify the whole vtime_task_switch() wiring by moving the existing
generic implementation to PowerPC.

Signed-off-by: Alexander Gordeev 
---
 arch/powerpc/include/asm/cputime.h | 13 -
 arch/powerpc/kernel/time.c | 22 ++
 kernel/sched/cputime.c | 13 -
 3 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 4961fb38e438..aff858ca99c0 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -32,23 +32,10 @@
 #ifdef CONFIG_PPC64
 #define get_accounting(tsk)(_paca()->accounting)
 #define raw_get_accounting(tsk)(_paca->accounting)
-static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 
 #else
 #define get_accounting(tsk)(_thread_info(tsk)->accounting)
 #define raw_get_accounting(tsk)get_accounting(tsk)
-/*
- * Called from the context switch with interrupts disabled, to charge all
- * accumulated times to the current process, and to prepare accounting on
- * the next process.
- */
-static inline void arch_vtime_task_switch(struct task_struct *prev)
-{
-   struct cpu_accounting_data *acct = get_accounting(current);
-   struct cpu_accounting_data *acct0 = get_accounting(prev);
-
-   acct->starttime = acct0->starttime;
-}
 #endif
 
 /*
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index df20cf201f74..c0fdc6d94fee 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -354,6 +354,28 @@ void vtime_flush(struct task_struct *tsk)
acct->hardirq_time = 0;
acct->softirq_time = 0;
 }
+
+/*
+ * Called from the context switch with interrupts disabled, to charge all
+ * accumulated times to the current process, and to prepare accounting on
+ * the next process.
+ */
+void vtime_task_switch(struct task_struct *prev)
+{
+   if (is_idle_task(prev))
+   vtime_account_idle(prev);
+   else
+   vtime_account_kernel(prev);
+
+   vtime_flush(prev);
+
+   if (!IS_ENABLED(CONFIG_PPC64)) {
+   struct cpu_accounting_data *acct = get_accounting(current);
+   struct cpu_accounting_data *acct0 = get_accounting(prev);
+
+   acct->starttime = acct0->starttime;
+   }
+}
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 void __no_kcsan __delay(unsigned long loops)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..aa48b2ec879d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -424,19 +424,6 @@ static inline void irqtime_account_process_tick(struct 
task_struct *p, int user_
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 
-# ifndef __ARCH_HAS_VTIME_TASK_SWITCH
-void vtime_task_switch(struct task_struct *prev)
-{
-   if (is_idle_task(prev))
-   vtime_account_idle(prev);
-   else
-   vtime_account_kernel(prev);
-
-   vtime_flush(prev);
-   arch_vtime_task_switch(prev);
-}
-# endif
-
 void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
 {
unsigned int pc = irq_count() - offset;
-- 
2.40.1



[PATCH 3/5] s390/vtime: remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover

2024-01-28 Thread Alexander Gordeev
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore.

Signed-off-by: Alexander Gordeev 
---
 arch/s390/include/asm/vtime.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h
index fe17e448c0c5..561c91c1a87c 100644
--- a/arch/s390/include/asm/vtime.h
+++ b/arch/s390/include/asm/vtime.h
@@ -2,8 +2,6 @@
 #ifndef _S390_VTIME_H
 #define _S390_VTIME_H
 
-#define __ARCH_HAS_VTIME_TASK_SWITCH
-
 static inline void update_timer_sys(void)
 {
S390_lowcore.system_timer += S390_lowcore.last_update_timer - 
S390_lowcore.exit_timer;
-- 
2.40.1



[PATCH 5/5] sched/vtime: do not include header

2024-01-28 Thread Alexander Gordeev
There is no architecture-specific code or data left
that generic  needs to know about.
Thus, avoid the inclusion of  header.

Signed-off-by: Alexander Gordeev 
---
 include/asm-generic/vtime.h | 1 -
 include/linux/vtime.h   | 4 
 2 files changed, 5 deletions(-)
 delete mode 100644 include/asm-generic/vtime.h

diff --git a/include/asm-generic/vtime.h b/include/asm-generic/vtime.h
deleted file mode 100644
index b1a49677fe25..
--- a/include/asm-generic/vtime.h
+++ /dev/null
@@ -1 +0,0 @@
-/* no content, but patch(1) dislikes empty files */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 593466ceebed..29dd5b91dd7d 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -5,10 +5,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-#include 
-#endif
-
 /*
  * Common vtime APIs
  */
-- 
2.40.1



[PATCH 1/5] sched/vtime: remove confusing arch_vtime_task_switch() declaration

2024-01-28 Thread Alexander Gordeev
Callback arch_vtime_task_switch() is only defined when
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is selected. Yet, the
function prototype forward declaration is present for
CONFIG_VIRT_CPU_ACCOUNTING_GEN variant. Remove it.

Signed-off-by: Alexander Gordeev 
---
 include/linux/vtime.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 3684487d01e1..593466ceebed 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -18,7 +18,6 @@ extern void vtime_account_idle(struct task_struct *tsk);
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void arch_vtime_task_switch(struct task_struct *tsk);
 extern void vtime_user_enter(struct task_struct *tsk);
 extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
-- 
2.40.1



Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-11-22 Thread Alexander Gordeev
On Tue, Jan 17, 2023 at 10:45:25PM +0100, Jann Horn wrote:

Hi Jann,

> On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan  wrote:
> > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
> > >
> > > +locking maintainers
> >
> > Thanks! I'll CC the locking maintainers in the next posting.
> >
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops 
> > > > mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +   up_read(>lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else. So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(>lock);
> > > rcu_read_unlock();
> >
> > But for deleting VMA one would need to write-lock the vma->lock first,
> > which I assume can't happen until this up_read() is complete. Is that
> > assumption wrong?
> 
> __up_read() does:
> 
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, >count);
> DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
>   RWSEM_FLAG_WAITERS)) {
>   clear_nonspinnable(sem);
>   rwsem_wake(sem);
> }

This sequence is covered by preempt_disable()/preempt_enable().
Would not it preserve the RCU grace period until after __up_read()
exited?

> The atomic_long_add_return_release() is the point where we are doing
> the main lock-releasing.
> 
> So if a reader dropped the read-lock while someone else was waiting on
> the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> lock together with it, the reader also does clear_nonspinnable() and
> rwsem_wake() afterwards.
> But in rwsem_down_write_slowpath(), after we've set
> RWSEM_FLAG_WAITERS, we can return successfully immediately once
> rwsem_try_write_lock() sees that there are no active readers or
> writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> the wake.
> 
> So yeah, I think down_write() can return successfully before up_read()
> is done with its memory accesses.

Thanks!


Re: [PATCH 1/8] S390: Remove sentinel elem from ctl_table arrays

2023-09-07 Thread Alexander Gordeev
On Wed, Sep 06, 2023 at 12:03:22PM +0200, Joel Granados via B4 Relay wrote:
> From: Joel Granados 
> 
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove the sentinel element from appldata_table, s390dbf_table,
> topology_ctl_table, cmm_table and page_table_sysctl. Reduced the
> memory allocation in appldata_register_ops by 1 effectively removing the
> sentinel from ops->ctl_table.
> 
> Signed-off-by: Joel Granados 
> ---
>  arch/s390/appldata/appldata_base.c | 6 ++
>  arch/s390/kernel/debug.c   | 3 +--
>  arch/s390/kernel/topology.c| 3 +--
>  arch/s390/mm/cmm.c | 3 +--
>  arch/s390/mm/pgalloc.c | 3 +--
>  5 files changed, 6 insertions(+), 12 deletions(-)

Tested-by: Alexander Gordeev 


Re: [PATCH rfc v2 04/10] s390: mm: use try_vma_locked_page_fault()

2023-08-24 Thread Alexander Gordeev
On Mon, Aug 21, 2023 at 08:30:50PM +0800, Kefeng Wang wrote:
> Use new try_vma_locked_page_fault() helper to simplify code.
> No functional change intended.
> 
> Signed-off-by: Kefeng Wang 
> ---
>  arch/s390/mm/fault.c | 66 ++--
>  1 file changed, 27 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index 099c4824dd8a..fbbdebde6ea7 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -357,16 +357,18 @@ static noinline void do_fault_error(struct pt_regs 
> *regs, vm_fault_t fault)
>  static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  {
>   struct gmap *gmap;
> - struct task_struct *tsk;
> - struct mm_struct *mm;
>   struct vm_area_struct *vma;
>   enum fault_type type;
> - unsigned long address;
> - unsigned int flags;
> + struct mm_struct *mm = current->mm;
> + unsigned long address = get_fault_address(regs);
>   vm_fault_t fault;
>   bool is_write;
> + struct vm_fault vmf = {
> + .real_address = address,
> + .flags = FAULT_FLAG_DEFAULT,
> + .vm_flags = access,
> + };
>  
> - tsk = current;
>   /*
>* The instruction that caused the program check has
>* been nullified. Don't signal single step via SIGTRAP.
> @@ -376,8 +378,6 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   if (kprobe_page_fault(regs, 14))
>   return 0;
>  
> - mm = tsk->mm;
> - address = get_fault_address(regs);
>   is_write = fault_is_write(regs);
>  
>   /*
> @@ -398,45 +398,33 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   }
>  
>   perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> - flags = FAULT_FLAG_DEFAULT;
>   if (user_mode(regs))
> - flags |= FAULT_FLAG_USER;
> + vmf.flags |= FAULT_FLAG_USER;
>   if (is_write)
> - access = VM_WRITE;
> - if (access == VM_WRITE)
> - flags |= FAULT_FLAG_WRITE;
> - if (!(flags & FAULT_FLAG_USER))
> - goto lock_mmap;
> - vma = lock_vma_under_rcu(mm, address);
> - if (!vma)
> - goto lock_mmap;
> - if (!(vma->vm_flags & access)) {
> - vma_end_read(vma);
> - goto lock_mmap;
> - }
> - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, 
> regs);
> - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
> - vma_end_read(vma);
> - if (!(fault & VM_FAULT_RETRY)) {
> - count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> - if (likely(!(fault & VM_FAULT_ERROR)))
> - fault = 0;

This fault fixup is removed in the new version.

> + vmf.vm_flags = VM_WRITE;
> + if (vmf.vm_flags == VM_WRITE)
> + vmf.flags |= FAULT_FLAG_WRITE;
> +
> + fault = try_vma_locked_page_fault();
> + if (fault == VM_FAULT_NONE)
> + goto lock_mm;

Because VM_FAULT_NONE is set to 0 it gets confused with
the success code of 0 returned by a fault handler. In the
former case we want to continue, while in the latter -
successfully return. I think it applies to all archs.

> + if (!(fault & VM_FAULT_RETRY))
>   goto out;
> - }
> - count_vm_vma_lock_event(VMA_LOCK_RETRY);
> +
>   /* Quick path to respond to signals */
>   if (fault_signal_pending(fault, regs)) {
>   fault = VM_FAULT_SIGNAL;
>   goto out;
>   }
> -lock_mmap:
> +
> +lock_mm:
>   mmap_read_lock(mm);
>  
>   gmap = NULL;
>   if (IS_ENABLED(CONFIG_PGSTE) && type == GMAP_FAULT) {
>   gmap = (struct gmap *) S390_lowcore.gmap;
>   current->thread.gmap_addr = address;
> - current->thread.gmap_write_flag = !!(flags & FAULT_FLAG_WRITE);
> + current->thread.gmap_write_flag = !!(vmf.flags & 
> FAULT_FLAG_WRITE);
>   current->thread.gmap_int_code = regs->int_code & 0x;
>   address = __gmap_translate(gmap, address);
>   if (address == -EFAULT) {
> @@ -444,7 +432,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   goto out_up;
>   }
>   if (gmap->pfault_enabled)
> - flags |= FAULT_FLAG_RETRY_NOWAIT;
> + vmf.flags |= FAULT_FLAG_RETRY_NOWAIT;
>   }
>  
>  retry:
> @@ -466,7 +454,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>* we can handle it..
>*/
>   fault = VM_FAULT_BADACCESS;
> - if (unlikely(!(vma->vm_flags & access)))
> + if (unlikely(!(vma->vm_flags & vmf.vm_flags)))
>   goto out_up;
>  
>   /*
> @@ -474,10 +462,10 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>* make sure we exit gracefully rather 

Re: [PATCH rfc v2 01/10] mm: add a generic VMA lock-based page fault handler

2023-08-24 Thread Alexander Gordeev
On Mon, Aug 21, 2023 at 08:30:47PM +0800, Kefeng Wang wrote:

Hi Kefeng,

> The ARCH_SUPPORTS_PER_VMA_LOCK are enabled by more and more architectures,
> eg, x86, arm64, powerpc and s390, and riscv, those implementation are very
> similar which results in some duplicated codes, let's add a generic VMA
> lock-based page fault handler try_to_vma_locked_page_fault() to eliminate
> them, and which also make us easy to support this on new architectures.
> 
> Since different architectures use different way to check vma whether is
> accessable or not, the struct pt_regs, page fault error code and vma flags
> are added into struct vm_fault, then, the architecture's page fault code
> could re-use struct vm_fault to record and check vma accessable by each
> own implementation.
> 
> Signed-off-by: Kefeng Wang 
> ---
>  include/linux/mm.h   | 17 +
>  include/linux/mm_types.h |  2 ++
>  mm/memory.c  | 39 +++
>  3 files changed, 58 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3f764e84e567..22a6f4c56ff3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -512,9 +512,12 @@ struct vm_fault {
>   pgoff_t pgoff;  /* Logical page offset based on 
> vma */
>   unsigned long address;  /* Faulting virtual address - 
> masked */
>   unsigned long real_address; /* Faulting virtual address - 
> unmasked */
> + unsigned long fault_code;   /* Faulting error code during 
> page fault */
> + struct pt_regs *regs;   /* The registers stored during 
> page fault */
>   };
>   enum fault_flag flags;  /* FAULT_FLAG_xxx flags
>* XXX: should really be 'const' */
> + vm_flags_t vm_flags;/* VMA flags to be used for access 
> checking */
>   pmd_t *pmd; /* Pointer to pmd entry matching
>* the 'address' */
>   pud_t *pud; /* Pointer to pud entry matching
> @@ -774,6 +777,9 @@ static inline void assert_fault_locked(struct vm_fault 
> *vmf)
>  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> unsigned long address);
>  
> +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf);
> +vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline bool vma_start_read(struct vm_area_struct *vma)
> @@ -801,6 +807,17 @@ static inline void assert_fault_locked(struct vm_fault 
> *vmf)
>   mmap_assert_locked(vmf->vma->vm_mm);
>  }
>  
> +static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> + unsigned long address)
> +{
> + return NULL;
> +}
> +
> +static inline vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf)
> +{
> + return VM_FAULT_NONE;
> +}
> +
>  #endif /* CONFIG_PER_VMA_LOCK */
>  
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index f5ba5b0bc836..702820cea3f9 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1119,6 +1119,7 @@ typedef __bitwise unsigned int vm_fault_t;
>   * fault. Used to decide whether a process gets delivered SIGBUS or
>   * just gets major/minor fault counters bumped up.
>   *
> + * @VM_FAULT_NONE:   Special case, not starting to handle fault
>   * @VM_FAULT_OOM:Out Of Memory
>   * @VM_FAULT_SIGBUS: Bad access
>   * @VM_FAULT_MAJOR:  Page read from storage
> @@ -1139,6 +1140,7 @@ typedef __bitwise unsigned int vm_fault_t;
>   *
>   */
>  enum vm_fault_reason {
> + VM_FAULT_NONE   = (__force vm_fault_t)0x00,
>   VM_FAULT_OOM= (__force vm_fault_t)0x01,
>   VM_FAULT_SIGBUS = (__force vm_fault_t)0x02,
>   VM_FAULT_MAJOR  = (__force vm_fault_t)0x04,
> diff --git a/mm/memory.c b/mm/memory.c
> index 3b4aaa0d2fff..60fe35db5134 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5510,6 +5510,45 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> mm_struct *mm,
>   count_vm_vma_lock_event(VMA_LOCK_ABORT);
>   return NULL;
>  }
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> +bool __weak arch_vma_access_error(struct vm_area_struct *vma, struct 
> vm_fault *vmf)
> +{
> + return (vma->vm_flags & vmf->vm_flags) == 0;
> +}
> +#endif
> +
> +vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf)
> +{
> + vm_fault_t fault = VM_FAULT_NONE;
> + struct vm_area_struct *vma;
> +
> + if (!(vmf->flags & FAULT_FLAG_USER))
> + return fault;
> +
> + vma = lock_vma_under_rcu(current->mm, vmf->real_address);
> + if (!vma)
> + return fault;
> +
> + if (arch_vma_access_error(vma, vmf)) {
> + vma_end_read(vma);
> +  

Re: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page

2023-07-12 Thread Alexander Gordeev
 list_empty(>lru))
>   return;
>   snprintf(msg, sizeof(msg),
>"Invalid pgtable %p release half 0x%02x mask 0x%02x",
> @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, 
> void *table,
>   dump_page(page, msg);
>  }
>  
> +static void pte_free_now(struct rcu_head *head)
> +{
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + pgtable_pte_page_dtor(page);
> + __free_page(page);
> +}
> +
>  void page_table_free(struct mm_struct *mm, unsigned long *table)
>  {
>   unsigned int mask, bit, half;
> @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned 
> long *table)
>*/
>   mask = atomic_xor_bits(>_refcount, 0x11U << (bit + 24));
>   mask >>= 24;
> - if (mask & 0x03U)
> + if ((mask & 0x03U) && !PageActive(page)) {
> + /*
> +  * Other half is allocated, and neither half has had
> +  * its free deferred: add page to head of list, to make
> +  * this freed half available for immediate reuse.
> +  */
>   list_add(>lru, >context.pgtable_list);
> - else
> - list_del(>lru);
> + } else {
> + /* If page is on list, now remove it. */
> + list_del_init(>lru);
> + }
>   spin_unlock_bh(>context.lock);
>   mask = atomic_xor_bits(>_refcount, 0x10U << (bit + 24));
>   mask >>= 24;
> @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long 
> *table)
>   }
>  
>   page_table_release_check(page, table, half, mask);
> - pgtable_pte_page_dtor(page);
> - __free_page(page);
> + if (TestClearPageActive(page))
> + call_rcu(>rcu_head, pte_free_now);
> + else
> + pte_free_now(>rcu_head);
>  }
>  
>  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, 
> unsigned long *table,
>*/
>   mask = atomic_xor_bits(>_refcount, 0x11U << (bit + 24));
>   mask >>= 24;
> - if (mask & 0x03U)
> + if ((mask & 0x03U) && !PageActive(page)) {
> + /*
> +  * Other half is allocated, and neither half has had
> +  * its free deferred: add page to end of list, to make
> +  * this freed half available for reuse once its pending
> +  * bit has been cleared by __tlb_remove_table().
> +  */
>   list_add_tail(>lru, >context.pgtable_list);
> - else
> - list_del(>lru);
> + } else {
> + /* If page is on list, now remove it. */
> + list_del_init(>lru);
> + }
>   spin_unlock_bh(>context.lock);
>   table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
>   tlb_remove_table(tlb, table);
> @@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table)
>   }
>  
>   page_table_release_check(page, table, half, mask);
> - pgtable_pte_page_dtor(page);
> - __free_page(page);
> + if (TestClearPageActive(page))
> + call_rcu(>rcu_head, pte_free_now);
> + else
> + pte_free_now(>rcu_head);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + SetPageActive(page);
> + page_table_free(mm, (unsigned long *)pgtable);
> + /*
> +  * page_table_free() does not do the pgste gmap_unlink() which
> +  * page_table_free_rcu() does: warn us if pgste ever reaches here.
> +  */
> + WARN_ON_ONCE(mm_alloc_pgste(mm));
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.

Tested-by: Alexander Gordeev 
Acked-by: Alexander Gordeev 


Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Alexander Gordeev
~~
> |KEXEC_ON_CRASH
>   arch/s390/kernel/machine_kexec.c:280:66: error: passing argument 1 of 
> 'kdump_csum_valid' from incompatible pointer type 
> [-Werror=incompatible-pointer-types]
> 280 | if (image->type == KEXEC_TYPE_CRASH && 
> !kdump_csum_valid(image))
> |  
> ^
> |  |
> |  
> struct kimage *
>   arch/s390/kernel/machine_kexec.c:120:45: note: expected 'struct kimage *' 
> but argument is of type 'struct kimage *'
> 120 | static bool kdump_csum_valid(struct kimage *image)
> |  ~~~^
>   cc1: some warnings being treated as errors
> 
> I don't think this change is equivalent for s390, which had
> 
>   config KEXEC
>   def_bool y
>   select KEXEC_CORE
> 
> but it is now the equivalent of
> 
>   config KEXEC
>   bool "Enable kexec system call"
>   default y
> 
> which enables KEXEC by default but it also allows KEXEC to be disabled
> for s390 now, because it is a user-visible symbol, not one that is
> unconditionally enabled no matter what. If s390 can tolerate KEXEC being
> user selectable, then I assume the fix is just adjusting
> arch/s390/kernel/Makefile to only build the machine_kexec files when
> CONFIG_KEXEC_CORE is set:
> 
> diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
> index 6b2a051e1f8a..a06b39da95f0 100644
> --- a/arch/s390/kernel/Makefile
> +++ b/arch/s390/kernel/Makefile
> @@ -37,10 +37,10 @@ CFLAGS_unwind_bc.o+= -fno-optimize-sibling-calls
>  obj-y:= head64.o traps.o time.o process.o earlypgm.o early.o setup.o 
> idle.o vtime.o
>  obj-y+= processor.o syscall.o ptrace.o signal.o cpcmd.o ebcdic.o 
> nmi.o
>  obj-y+= debug.o irq.o ipl.o dis.o diag.o vdso.o cpufeature.o
> -obj-y+= sysinfo.o lgr.o os_info.o machine_kexec.o
> +obj-y+= sysinfo.o lgr.o os_info.o
>  obj-y+= runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o 
> sthyi.o
>  obj-y+= entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o
> -obj-y+= nospec-branch.o ipl_vmparm.o machine_kexec_reloc.o 
> unwind_bc.o
> +obj-y+= nospec-branch.o ipl_vmparm.o unwind_bc.o
>  obj-y+= smp.o text_amode31.o stacktrace.o abs_lowcore.o
>  
>  extra-y  += vmlinux.lds
> @@ -66,6 +66,7 @@ obj-$(CONFIG_CRASH_DUMP)+= crash_dump.o
>  obj-$(CONFIG_UPROBES)+= uprobes.o
>  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  
> +obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o machine_kexec_reloc.o
>  obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o
>  obj-$(CONFIG_KEXEC_FILE) += kexec_elf.o
>  
> 
> Otherwise, the prompt for KEXEC could be made conditional on some ARCH
> symbol so that architectures can opt out of it.

Hi Nathan,

Thanks a lot for looking into it!
With few modification the fix would looke like below.
It probably needs to be a pre-requisite for this series:


[PATCH] s390/kexec: make machine_kexec depend on CONFIG_KEXEC_CORE

Make machine_kexec.o and relocate_kernel.o depend on
CONFIG_KEXEC_CORE option as other architectures do.

Still generate machine_kexec_reloc.o unconditionally,
since arch_kexec_do_relocs() function is neded by the
decompressor.

Probably, #include  could be be removed from
machine_kexec_reloc.c source as well, but that would revert
commit 155e6c706125 ("s390/kexec: add missing include to
machine_kexec_reloc.c").

Suggested-by: Nathan Chancellor 
Reported-by: Nathan Chancellor 
Reported-by: Linux Kernel Functional Testing 
Signed-off-by: Alexander Gordeev 
---
 arch/s390/kernel/Makefile | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
index 8d7514c72bb8..0df2b88cc0da 100644
--- a/arch/s390/kernel/Makefile
+++ b/arch/s390/kernel/Makefile
@@ -37,9 +37,9 @@ CFLAGS_unwind_bc.o+= -fno-optimize-sibling-calls
 obj-y  := head64.o traps.o time.o process.o earlypgm.o early.o setup.o idle.o 
vtime.o
 obj-y  += processor.o syscall.o ptrace.o signal.o cpcmd.o ebcdic.o nmi.o
 obj-y  += debug.o irq.o ipl.o dis.o diag.o vdso.o cpufeature.o
-obj-y  += sysinfo.o lgr.o os_info.o machine_kexec.o
+obj-y  += sysinfo.o lgr.o os_info.o
 obj-y  += runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o sthyi.o
-obj-y  += entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o
+obj-y  += entry.o reipl.o kdebugfs.o alternative.o
 obj-y  += nospec-branch.o ipl_vmparm.o machine_kexec_reloc.o unwind_bc.o
 obj-y  += smp.o text_amode31.o stacktrace.o abs_lowcore.o
 
@@ -63,6 +63,7 @@ obj-$(CONFIG_RETHOOK) += rethook.o
 obj-$(CONFIG_FUNCTION_TRACER)  += ftrace.o
 obj-$(CONFIG_FUNCTION_TRACER)  += mcount.o
 obj-$(CONFIG_CRASH_DUMP)   += crash_dump.o
+obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o relocate_kernel.o
 obj-$(CONFIG_UPROBES)  += uprobes.o
 obj-$(CONFIG_JUMP_LABEL)   += jump_label.o
 
> Cheers,
> Nathan

Thanks!


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-05 Thread Alexander Gordeev
On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jun 2023, Hugh Dickins wrote:

Hi Hugh,

...

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;

If I got your and Claudio conversation right, you were going to add
here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

> + page = virt_to_page(pgtable);
> + SetPageActive(page);
> + page_table_free(mm, (unsigned long *)pgtable);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.

Thanks!


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Alexander Gordeev
On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jun 2023, Hugh Dickins wrote:

Hi Hugh,

...
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).

Just to make things more clear for me: do I understand correctly that this
was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Thanks!


Re: [PATCH v3 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-06-30 Thread Alexander Gordeev
On Mon, Jun 26, 2023 at 12:13:31PM -0400, Eric DeVolder wrote:
> The kexec and crash kernel options are provided in the common
> kernel/Kconfig.kexec. Utilize the common options and provide
> the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
> equivalent set of KEXEC and CRASH options.
> 
> NOTE: The original Kconfig has a KEXEC_SIG which depends on
> MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
> dependency (using the strategy outlined in this series, and other
> techniques) results in 'error: recursive dependency detected'
> on CRYPTO.
> 
> Per Alexander Gordeev : "the MODULE_SIG_FORMAT
> dependency was introduced with [git commit below] and in fact was not
> necessary, since s390 did/does not use mod_check_sig() anyway.
> 
>  commit c8424e776b09 ("MODSIGN: Export module signature definitions")
> 
> MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But
> SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping
> MODULE_SIG_FORMAT does not hurt."
> 
> Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
> from KEXEC_SIG. Still results in equivalent .config files for s390.
> 
> Signed-off-by: Eric DeVolder 
> ---
>  arch/s390/Kconfig | 65 ++-
>  1 file changed, 19 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 6dab9c1be508..58dc124433ca 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -243,6 +243,25 @@ config PGTABLE_LEVELS
>  
>  source "kernel/livepatch/Kconfig"
>  
> +config ARCH_DEFAULT_KEXEC
> + def_bool y
> +
> +config ARCH_SUPPORTS_KEXEC
> + def_bool y
> +
> +config ARCH_SUPPORTS_KEXEC_FILE
> + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
> +
> +config ARCH_HAS_KEXEC_PURGATORY
> + def_bool KEXEC_FILE
> +
> +config ARCH_SUPPORTS_CRASH_DUMP
> + def_bool y
> + help
> +   Refer to  for more details on 
> this.
> +   This option also enables s390 zfcpdump.
> +   See also 
> +
>  menu "Processor type and features"
>  
>  config HAVE_MARCH_Z10_FEATURES
> @@ -481,36 +500,6 @@ config SCHED_TOPOLOGY
>  
>  source "kernel/Kconfig.hz"
>  
> -config KEXEC
> - def_bool y
> - select KEXEC_CORE
> -
> -config KEXEC_FILE
> - bool "kexec file based system call"
> - select KEXEC_CORE
> - depends on CRYPTO
> - depends on CRYPTO_SHA256
> - depends on CRYPTO_SHA256_S390
> - help
> -   Enable the kexec file based system call. In contrast to the normal
> -   kexec system call this system call takes file descriptors for the
> -   kernel and initramfs as arguments.
> -
> -config ARCH_HAS_KEXEC_PURGATORY
> - def_bool y
> - depends on KEXEC_FILE
> -
> -config KEXEC_SIG
> - bool "Verify kernel signature during kexec_file_load() syscall"
> - depends on KEXEC_FILE && MODULE_SIG_FORMAT
> - help
> -   This option makes kernel signature verification mandatory for
> -   the kexec_file_load() syscall.
> -
> -   In addition to that option, you need to enable signature
> -   verification for the corresponding kernel image type being
> -   loaded in order for this to work.
> -
>  config KERNEL_NOBP
>   def_bool n
>   prompt "Enable modified branch prediction for the kernel by default"
> @@ -732,22 +721,6 @@ config VFIO_AP
>  
>  endmenu
>  
> -menu "Dump support"
> -
> -config CRASH_DUMP
> - bool "kernel crash dumps"
> - select KEXEC
> - help
> -   Generate crash dump after being started by kexec.
> -   Crash dump kernels are loaded in the main kernel with kexec-tools
> -   into a specially reserved region and then later executed after
> -   a crash by kdump/kexec.
> -   Refer to  for more details on 
> this.
> -   This option also enables s390 zfcpdump.
> -   See also 
> -
> -endmenu
> -
>  config CCW
>   def_bool y

Acked-by: Alexander Gordeev 


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-29 Thread Alexander Gordeev
On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> Hugh Dickins  wrote:

Hi Gerald, Hugh!

...
> @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
>   __free_page(page);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now0(struct rcu_head *head);
> +static void pte_free_now1(struct rcu_head *head);

What about pte_free_lower() / pte_free_upper()?

...
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + unsigned int bit, mask;
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + if (mm_alloc_pgste(mm)) {
> + /*
> +  * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> +  * page_table_free_rcu()?
> +  * If yes -> need addr parameter here, like in pte_free_tlb().
> +  */
> + call_rcu(>rcu_head, pte_free_pgste);
> + return;
> +}
> + bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * 
> sizeof(pte_t));
> +
> + spin_lock_bh(>context.lock);
> + mask = atomic_xor_bits(>_refcount, 0x15U << (bit + 24));

This  makes the bit logic increasingly complicated to me.

What if instead we set the rule "one bit at a time only"?
That means an atomic group bit flip is only allowed between
pairs of bits, namely:

bit flipinitiated from
--- 
P  <- A page_table_free(), page_table_free_rcu()
 H <- A pte_free_defer()
P <- H  pte_free_half()

In the current model P bit could be on together with H
bit simultaneously. That actually brings in equation
nothing.

Besides, this check in page_table_alloc() (while still
correct) makes one (well, me) wonder "what about HH bits?":

mask = (mask | (mask >> 4)) & 0x03U;
if (mask != 0x03U) {
...
}

By contrast, with "one bit at a time only" policy every
of three bits effectevely indicates which state a page
half is currently in.

Thanks!


Re: [PATCH v2 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-06-22 Thread Alexander Gordeev
On Wed, Jun 21, 2023 at 12:10:49PM -0500, Eric DeVolder wrote:
Hi Eric,
...
> > > NOTE: The original Kconfig has a KEXEC_SIG which depends on
> > > MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
> > > dependency (using the strategy outlined in this series, and other
> > > techniques) results in 'error: recursive dependency detected'
> > > on CRYPTO. This occurs due to any path through KEXEC_SIG
> > > attempting to select CRYPTO is ultimately dependent upon CRYPTO:
> > > 
> > >   CRYPTO
> > ><- ARCH_SUPPORTS_KEXEC_FILE
> > >   <- KEXEC_FILE
> > >  <- KEXEC_SIG
> > > 
> > > Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
> > > for KEXEC_SIG. In practice, however, MODULE_SIG_FORMAT is still
> > > configured-in as the use of KEXEC_SIG is in step with the use of
> > > SYSTEM_DATA_VERIFICATION, which does select MODULE_SIG_FORMAT.
> > 
> > No, it is actually the other way around.
> > Could you please provide the correct explanation?
> > 
> > AFAICT the MODULE_SIG_FORMAT dependency was introduced with commit
> > c8424e776b09 ("MODSIGN: Export module signature definitions") and
> > in fact was not necessary, since s390 did/does not use mod_check_sig()
> > anyway. So the SYSTEM_DATA_VERIFICATION could have left intact.
> 
> Thomas, would the correct explanation be simply indicating that
> MODULE_SIG_FORMAT isn't needed as it is not used by s390 (crediting your
> summary above)?

I guess, you asked me? Anyway, I will try to answer as if I were Thomas :)

MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION.
But SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so
dropping MODULE_SIG_FORMAT does not hurt.

Thanks!


Re: [PATCH v2 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-06-21 Thread Alexander Gordeev
On Mon, Jun 19, 2023 at 10:58:00AM -0400, Eric DeVolder wrote:

Hi Eric,

> The kexec and crash kernel options are provided in the common
> kernel/Kconfig.kexec. Utilize the common options and provide
> the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
> equivalent set of KEXEC and CRASH options.
> 
> NOTE: The original Kconfig has a KEXEC_SIG which depends on
> MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
> dependency (using the strategy outlined in this series, and other
> techniques) results in 'error: recursive dependency detected'
> on CRYPTO. This occurs due to any path through KEXEC_SIG
> attempting to select CRYPTO is ultimately dependent upon CRYPTO:
> 
>  CRYPTO
>   <- ARCH_SUPPORTS_KEXEC_FILE
>  <- KEXEC_FILE
> <- KEXEC_SIG
> 
> Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
> for KEXEC_SIG. In practice, however, MODULE_SIG_FORMAT is still
> configured-in as the use of KEXEC_SIG is in step with the use of
> SYSTEM_DATA_VERIFICATION, which does select MODULE_SIG_FORMAT.

No, it is actually the other way around.
Could you please provide the correct explanation?

AFAICT the MODULE_SIG_FORMAT dependency was introduced with commit
c8424e776b09 ("MODSIGN: Export module signature definitions") and
in fact was not necessary, since s390 did/does not use mod_check_sig()
anyway. So the SYSTEM_DATA_VERIFICATION could have left intact.

However, the original SYSTEM_DATA_VERIFICATION seems sane and I do
not understand why other architectures do not have it also? May be
Mimi Zohar (putting on CC) could explain that?

It looks like such dependency actually exists in implicit form
(which you picked from x86):

In addition to this option, you need to enable signature
verification for the corresponding kernel image type being
loaded in order for this to work.

Does it mean that if an architecture did not enable the signature
verification type explicitly the linker could fail - both before
and after you series?

Thanks!

> Not ideal, but results in equivalent .config files for s390.
> 
> Signed-off-by: Eric DeVolder 
> ---
>  arch/s390/Kconfig | 65 ++-
>  1 file changed, 19 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 6dab9c1be508..58dc124433ca 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -243,6 +243,25 @@ config PGTABLE_LEVELS
>  
>  source "kernel/livepatch/Kconfig"
>  
> +config ARCH_DEFAULT_KEXEC
> + def_bool y
> +
> +config ARCH_SUPPORTS_KEXEC
> + def_bool y
> +
> +config ARCH_SUPPORTS_KEXEC_FILE
> + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
> +
> +config ARCH_HAS_KEXEC_PURGATORY
> + def_bool KEXEC_FILE
> +
> +config ARCH_SUPPORTS_CRASH_DUMP
> + def_bool y
> + help
> +   Refer to  for more details on 
> this.
> +   This option also enables s390 zfcpdump.
> +   See also 
> +
>  menu "Processor type and features"
>  
>  config HAVE_MARCH_Z10_FEATURES
> @@ -481,36 +500,6 @@ config SCHED_TOPOLOGY
>  
>  source "kernel/Kconfig.hz"
>  
> -config KEXEC
> - def_bool y
> - select KEXEC_CORE
> -
> -config KEXEC_FILE
> - bool "kexec file based system call"
> - select KEXEC_CORE
> - depends on CRYPTO
> - depends on CRYPTO_SHA256
> - depends on CRYPTO_SHA256_S390
> - help
> -   Enable the kexec file based system call. In contrast to the normal
> -   kexec system call this system call takes file descriptors for the
> -   kernel and initramfs as arguments.
> -
> -config ARCH_HAS_KEXEC_PURGATORY
> - def_bool y
> - depends on KEXEC_FILE
> -
> -config KEXEC_SIG
> - bool "Verify kernel signature during kexec_file_load() syscall"
> - depends on KEXEC_FILE && MODULE_SIG_FORMAT
> - help
> -   This option makes kernel signature verification mandatory for
> -   the kexec_file_load() syscall.
> -
> -   In addition to that option, you need to enable signature
> -   verification for the corresponding kernel image type being
> -   loaded in order for this to work.
> -
>  config KERNEL_NOBP
>   def_bool n
>   prompt "Enable modified branch prediction for the kernel by default"
> @@ -732,22 +721,6 @@ config VFIO_AP
>  
>  endmenu
>  
> -menu "Dump support"
> -
> -config CRASH_DUMP
> - bool "kernel crash dumps"
> - select KEXEC
> - help
> -   Generate crash dump after being started by kexec.
> -   Crash dump kernels are loaded in the main kernel with kexec-tools
> -   into a specially reserved region and then later executed after
> -   a crash by kdump/kexec.
> -   Refer to  for more details on 
> this.
> -   This option also enables s390 zfcpdump.
> -   See also 
> -
> -endmenu
> -
>  config CCW
>   def_bool y
>  
> -- 
> 2.31.1
> 


Re: [PATCH v1 01/21] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-06-14 Thread Alexander Gordeev
On Mon, Jun 12, 2023 at 01:27:53PM -0400, Eric DeVolder wrote:
...
> +config KEXEC_FILE
> + bool "Enable kexec file based system call"
> + depends on ARCH_HAS_KEXEC_FILE
> + select KEXEC_CORE
> + help
> +   This is new version of kexec system call. This system call is
> +   file based and takes file descriptors as system call argument
> +   for kernel and initramfs as opposed to list of segments as
> +   accepted by previous system call.

Which "previous"? I guess, "by kexec system call" would sound clear.

Thanks!


Re: [PATCH] irq_work: consolidate arch_irq_work_raise prototypes

2023-05-17 Thread Alexander Gordeev
On Tue, May 16, 2023 at 10:02:31PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The prototype was hidden on x86, which causes a warning:
> 
> kernel/irq_work.c:72:13: error: no previous prototype for 
> 'arch_irq_work_raise' [-Werror=missing-prototypes]
> 
> Fix this by providing it in only one place that is always visible.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arm/include/asm/irq_work.h | 2 --
>  arch/arm64/include/asm/irq_work.h   | 2 --
>  arch/csky/include/asm/irq_work.h| 2 +-
>  arch/powerpc/include/asm/irq_work.h | 1 -
>  arch/riscv/include/asm/irq_work.h   | 2 +-
>  arch/s390/include/asm/irq_work.h| 2 --
>  arch/x86/include/asm/irq_work.h | 1 -
>  include/linux/irq_work.h| 3 +++
>  8 files changed, 5 insertions(+), 10 deletions(-)

...

> diff --git a/arch/s390/include/asm/irq_work.h 
> b/arch/s390/include/asm/irq_work.h
> index 603783766d0a..f00c9f610d5a 100644
> --- a/arch/s390/include/asm/irq_work.h
> +++ b/arch/s390/include/asm/irq_work.h
> @@ -7,6 +7,4 @@ static inline bool arch_irq_work_has_interrupt(void)
>   return true;
>  }
>  
> -void arch_irq_work_raise(void);
> -
>  #endif /* _ASM_S390_IRQ_WORK_H */

...

> diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
> index 8cd11a223260..136f2980cba3 100644
> --- a/include/linux/irq_work.h
> +++ b/include/linux/irq_work.h
> @@ -66,6 +66,9 @@ void irq_work_sync(struct irq_work *work);
>  void irq_work_run(void);
>  bool irq_work_needs_cpu(void);
>  void irq_work_single(void *arg);
> +
> +void arch_irq_work_raise(void);
> +
>  #else
>  static inline bool irq_work_needs_cpu(void) { return false; }
>  static inline void irq_work_run(void) { }

For s390:

Reviewed-by: Alexander Gordeev 


Re: [PATCH] procfs: consolidate arch_report_meminfo declaration

2023-05-17 Thread Alexander Gordeev
On Tue, May 16, 2023 at 09:57:29PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The arch_report_meminfo() function is provided by four architectures,
> with a __weak fallback in procfs itself. On architectures that don't
> have a custom version, the __weak version causes a warning because
> of the missing prototype.
> 
> Remove the architecture specific prototypes and instead add one
> in linux/proc_fs.h.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/parisc/include/asm/pgtable.h| 3 ---
>  arch/powerpc/include/asm/pgtable.h   | 3 ---
>  arch/s390/include/asm/pgtable.h  | 3 ---
>  arch/s390/mm/pageattr.c  | 1 +
>  arch/x86/include/asm/pgtable.h   | 1 +
>  arch/x86/include/asm/pgtable_types.h | 3 ---
>  arch/x86/mm/pat/set_memory.c | 1 +
>  include/linux/proc_fs.h  | 2 ++
>  8 files changed, 5 insertions(+), 12 deletions(-)
...
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 6822a11c2c8a..c55f3c3365af 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -42,9 +42,6 @@ static inline void update_page_count(int level, long count)
>   atomic_long_add(count, _pages_count[level]);
>  }
>  
> -struct seq_file;
> -void arch_report_meminfo(struct seq_file *m);
> -
>  /*
>   * The S390 doesn't have any external MMU info: the kernel page
>   * tables contain all the necessary information.
> diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
> index 5ba3bd8a7b12..ca5a418c58a8 100644
> --- a/arch/s390/mm/pageattr.c
> +++ b/arch/s390/mm/pageattr.c
> @@ -4,6 +4,7 @@
>   * Author(s): Jan Glauber 
>   */
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 

For s390:

Reviewed-by: Alexander Gordeev 


Re: [PATCH 16/23] s390: gmap use pte_unmap_unlock() not spin_unlock()

2023-05-17 Thread Alexander Gordeev
On Tue, May 09, 2023 at 10:02:32PM -0700, Hugh Dickins wrote:
> pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
> keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
> and use pte_unmap_unlock() instead of direct spin_unlock() (even though
> ptep ends up unused inside the macro).
> 
> Signed-off-by: Hugh Dickins 
> ---
>  arch/s390/mm/gmap.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)

Acked-by: Alexander Gordeev 


Re: [PATCH] cputime: remove cputime_to_nsecs fallback

2022-12-21 Thread Alexander Gordeev
On Tue, Dec 20, 2022 at 05:07:05PM +1000, Nicholas Piggin wrote:
> The archs that use cputime_to_nsecs() internally provide their own
> definition and don't need the fallback. cputime_to_usecs() unused except
> in this fallback, and is not defined anywhere.
> 
> This removes the final remnant of the cputime_t code from the kernel.
> 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Frederic Weisbecker 
> Cc: Sven Schnelle 
> Cc: Arnd Bergmann 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Signed-off-by: Nicholas Piggin 
> ---
> This required a couple of tweaks to s390 includes so we're not pulling
> asm/cputime.h into the core header unnecessarily. In that case maybe
> this can go via s390 tree because the patch should be otherwise quite
> trivial. Could it get an ack or two from a core maintainer to support
> that?
> 
> Thanks,
> Nick
> 
>  arch/s390/kernel/idle.c   | 2 +-
>  arch/s390/kernel/vtime.c  | 2 +-
>  include/linux/sched/cputime.h | 9 -
>  kernel/sched/cputime.c| 4 
>  4 files changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
> index 4bf1ee293f2b..a6bbceaf7616 100644
> --- a/arch/s390/kernel/idle.c
> +++ b/arch/s390/kernel/idle.c
> @@ -12,9 +12,9 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "entry.h"
> diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
> index 9436f3053b88..e0a88dcaf5cb 100644
> --- a/arch/s390/kernel/vtime.c
> +++ b/arch/s390/kernel/vtime.c
> @@ -7,13 +7,13 @@
>   */
>  
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
> index ce3c58286062..5f8fd5b24a2e 100644
> --- a/include/linux/sched/cputime.h
> +++ b/include/linux/sched/cputime.h
> @@ -8,15 +8,6 @@
>   * cputime accounting APIs:
>   */
>  
> -#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> -#include 
> -
> -#ifndef cputime_to_nsecs
> -# define cputime_to_nsecs(__ct)  \
> - (cputime_to_usecs(__ct) * NSEC_PER_USEC)
> -#endif
> -#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
> -
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  extern bool task_cputime(struct task_struct *t,
>u64 *utime, u64 *stime);
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 95fc77853743..af7952f12e6c 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -3,6 +3,10 @@
>   * Simple CPU accounting cgroup controller
>   */
>  
> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> + #include 
> +#endif
> +
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>  
>  /*

For s390:

Acked-by: Alexander Gordeev 


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-03 Thread Alexander Gordeev
On Thu, Dec 03, 2020 at 09:14:22AM -0800, Andy Lutomirski wrote:
> 
> 
> > On Dec 3, 2020, at 9:09 AM, Alexander Gordeev  
> > wrote:
> > 
> > On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> >> other arch folk: there's some background here:
> 
> > 
> >> 
> >> power: Ridiculously complicated, seems to vary by system and kernel config.
> >> 
> >> So, Nick, your unconditional IPI scheme is apparently a big
> >> improvement for power, and it should be an improvement and have low
> >> cost for x86.  On arm64 and s390x it will add more IPIs on process
> >> exit but reduce contention on context switching depending on how lazy
> > 
> > s390 does not invalidate TLBs per-CPU explicitly - we have special
> > instructions for that. Those in turn initiate signalling to other
> > CPUs, completely transparent to OS.
> 
> Just to make sure I understand: this means that you broadcast flushes to all 
> CPUs, not just a subset?

Correct.
If mm has one CPU attached we flush TLB only for that CPU.
If mm has more than one CPUs attached we flush all CPUs' TLBs.

In fact, details are bit more complicated, since the hardware
is able to flush subsets of TLB entries depending on provided
parameters (e.g page tables used to create that entries).
But we can not select a CPU subset.

> > Apart from mm_count, I am struggling to realize how the suggested
> > scheme could change the the contention on s390 in connection with
> > TLB. Could you clarify a bit here, please?
> 
> I’m just talking about mm_count. Maintaining mm_count is quite expensive on 
> some workloads.
> 
> > 
> >> TLB works.  I suppose we could try it for all architectures without
> >> any further optimizations.  Or we could try one of the perhaps
> >> excessively clever improvements I linked above.  arm64, s390x people,
> >> what do you think?
> > 
> > I do not immediately see anything in the series that would harm
> > performance on s390.
> > 
> > We however use mm_cpumask to distinguish between local and global TLB
> > flushes. With this series it looks like mm_cpumask is *required* to
> > be consistent with lazy users. And that is something quite diffucult
> > for us to adhere (at least in the foreseeable future).
> 
> You don’t actually need to maintain mm_cpumask — we could scan all CPUs 
> instead.
> 
> > 
> > But actually keeping track of lazy users in a cpumask is something
> > the generic code would rather do AFAICT.
> 
> The problem is that arches don’t agree on what the contents of mm_cpumask 
> should be.  Tracking a mask of exactly what the arch wants in generic code is 
> a nontrivial operation.

It could be yet another cpumask or the CPU scan you mentioned.
Just wanted to make sure there is no new requirement for an arch
to maintain mm_cpumask ;)

Thanks, Andy!


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-03 Thread Alexander Gordeev
On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> other arch folk: there's some background here:
> 
> https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
> 
> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
> >
> > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> > >
> > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
> > > >
> > > > On big systems, the mm refcount can become highly contented when doing
> > > > a lot of context switching with threaded applications (particularly
> > > > switching between the idle thread and an application thread).
> > > >
> > > > Abandoning lazy tlb slows switching down quite a bit in the important
> > > > user->idle->user cases, so so instead implement a non-refcounted scheme
> > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> > > > any remaining lazy ones.
> > > >
> > > > Shootdown IPIs are some concern, but they have not been observed to be
> > > > a big problem with this scheme (the powerpc implementation generated
> > > > 314 additional interrupts on a 144 CPU system during a kernel compile).
> > > > There are a number of strategies that could be employed to reduce IPIs
> > > > if they turn out to be a problem for some workload.
> > >
> > > I'm still wondering whether we can do even better.
> > >
> >
> > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> > the TLB.  On x86, this will shoot down all lazies as long as even a
> > single pagetable was freed.  (Or at least it will if we don't have a
> > serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> > sets tlb->freed_tables, which will trigger the IPI.)  So, on
> > architectures like x86, the shootdown approach should be free.  The
> > only way it ought to have any excess IPIs is if we have CPUs in
> > mm_cpumask() that don't need IPI to free pagetables, which could
> > happen on paravirt.
> 
> Indeed, on x86, we do this:
> 
> [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> [   11.561068]  exit_mmap+0xc8/0x1a0
> [   11.561932]  mmput+0x29/0xd0
> [   11.562688]  do_exit+0x316/0xa90
> [   11.563588]  do_group_exit+0x34/0xb0
> [   11.564476]  __x64_sys_exit_group+0xf/0x10
> [   11.565512]  do_syscall_64+0x34/0x50
> 
> and we have info->freed_tables set.
> 
> What are the architectures that have large systems like?
> 
> x86: we already zap lazies, so it should cost basically nothing to do
> a little loop at the end of __mmput() to make sure that no lazies are
> left.  If we care about paravirt performance, we could implement one
> of the optimizations I mentioned above to fix up the refcounts instead
> of sending an IPI to any remaining lazies.
> 
> arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> remote flushes, so any lazy mm references will still exist after
> exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> the x86 paravirt case.  Are there large enough arm64 systems that any
> of this matters?
> 
> s390x: The code has too many acronyms for me to understand it fully,
> but I think it's more or less the same situation as arm64.  How big do
> s390x systems come?
> 
> power: Ridiculously complicated, seems to vary by system and kernel config.
> 
> So, Nick, your unconditional IPI scheme is apparently a big
> improvement for power, and it should be an improvement and have low
> cost for x86.  On arm64 and s390x it will add more IPIs on process
> exit but reduce contention on context switching depending on how lazy

s390 does not invalidate TLBs per-CPU explicitly - we have special
instructions for that. Those in turn initiate signalling to other
CPUs, completely transparent to OS.

Apart from mm_count, I am struggling to realize how the suggested
scheme could change the the contention on s390 in connection with
TLB. Could you clarify a bit here, please?

> TLB works.  I suppose we could try it for all architectures without
> any further optimizations.  Or we could try one of the perhaps
> excessively clever improvements I linked above.  arm64, s390x people,
> what do you think?

I do not immediately see anything in the series that would harm
performance on s390.

We however use mm_cpumask to distinguish between local and global TLB
flushes. With this series it looks like mm_cpumask is *required* to
be consistent with lazy users. And that is something quite diffucult
for us to adhere (at least in the foreseeable future).

But actually keeping track of lazy users in a cpumask is something
the generic code would rather do AFAICT.

Thanks!


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Alexander Gordeev
On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
> > Or am I way off here, and it really is possible (aside from the current
> > s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 
> Without the stack variable it looks like this:
> 
>pud_t pud = READ_ONCE(*pudp);
>if (!pud_present(pud))
> return
>pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
> return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.

One thing I ask myself and it is probably a good moment to wonder.

What if the entry is still pud_present, but got remapped after
READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

> Jason


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Alexander Gordeev
On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?
> 
> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?
> 
> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

Your observation is correct.

Another way to describe the problem is existing pXd_addr_end helpers
could be applied to mismatching levels on s390 (e.g p4d_addr_end
applied to pud or pgd_addr_end applied to p4d). As you noticed,
all *_pXd_range iterators could be called with address ranges that
exceed single pXd table.

However, when it happens with pointers to real page tables (passed to
*_pXd_range iterators) we still operate on valid tables, which just
(lucky for us) happened to be folded. Thus we still reference correct
table entries.

It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.

As Gerald mentioned, it is very difficult to explain in a clear way.
Hopefully, one could make sense ot of it.

> Jason


Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)
> >\
> >-({  unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;  \
> >-(__boundary - 1 < (end) - 1)? __boundary: (end);\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
unsigned long end)
{
return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
> >unsigned long end)
> >+{   unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c 
b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned 
long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
struct mm_struct *mm = current->mm;
+   pmd_t *pmd = pmd_off(mm, addr);
struct subpage_prot_table *spt;
u32 **spm, *spp;
unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned 
long len)
limit = addr + len;
if (limit > spt->maxaddr)
limit = spt->maxaddr;
-   for (; addr < limit; addr = next) {
-   next = pmd_addr_end(addr, limit);
+   for (; addr < limit; addr = next, pmd++) {
+   next = pmd_addr_end(*pmd, addr, limit);
if (addr < 0x1UL) {
spm = spt->low_prot;
} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, 
unsigned long addr,
unsigned long, len, u32 __user *, map)
 {
struct mm_struct *mm = current->mm;
+   pmd_t *pmd = pmd_off(mm, addr);
struct subpage_prot_table *spt;
u32 **spm, *spp;
unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, 
unsigned long addr,
}
 
subpage_mark_vma_nohuge(mm, addr, len);
-   for (limit = addr + len; addr < limit; addr = next) {
-   next = pmd_addr_end(addr, limit);
+   for (limit = addr + len; addr < limit; addr = next, pmd++) {
+   next = pmd_addr_end(*pmd, addr, limit);
err = -ENOMEM;
if (addr < 0x1UL) {
spm = spt->low_prot;

Thanks!

> Christophe


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> > arch/powerpc/mm/kasan/8xx.c
> > arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> > for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block 
> > += SZ_8M) {
> > pte_basic_t *new;
> >-k_next = pgd_addr_end(k_cur, k_end);
> >-k_next = pgd_addr_end(k_next, k_end);
> >+k_next = pmd_addr_end(k_cur, k_end);
> >+k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block 
+= SZ_8M) {
pte_basic_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
-   k_next = pgd_addr_end(k_next, k_end);
+   k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+   k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, 
k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned long k_
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
pte_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
+   k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
do {
-   next = pgd_addr_end(addr, end);
+   next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
pmd_populate_kernel(_mm, pmd, kasan_early_shadow_pte);
} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
arch/powerpc/mm/kasan/8xx.c
arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block 
+= SZ_8M) {
pte_basic_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
-   k_next = pgd_addr_end(k_next, k_end);
+   k_next = pmd_addr_end(k_cur, k_end);
+   k_next = pmd_addr_end(k_next, k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned long k_
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
pte_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
+   k_next = pmd_addr_end(k_cur, k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
do {
-   next = pgd_addr_end(addr, end);
+   next = pmd_addr_end(addr, end);
pmd_populate_kernel(_mm, pmd, kasan_early_shadow_pte);
} while (pmd++, addr = next, addr != end);
 

> Christophe


Re: [PATCH V4 0/4] mm/debug_vm_pgtable: Add some more tests

2020-07-06 Thread Alexander Gordeev
On Mon, Jul 06, 2020 at 06:18:32AM +0530, Anshuman Khandual wrote:

[...]

> Tested on arm64, x86 platforms but only build tested on all other enabled
> platforms through ARCH_HAS_DEBUG_VM_PGTABLE i.e powerpc, arc, s390. The

Sorry for missing to test earlier. Works for me on s390. Also, tried with
few relevant config options to set/unset.

Thanks!


Re: FSL P5020/P5040: DPAA Ethernet issue with the latest Git kernel

2020-06-25 Thread Alexander Gordeev
On Thu, Jun 25, 2020 at 01:01:52AM +0200, Christian Zigotzky wrote:
[...]
> I compiled a test kernel with the option "CONFIG_TEST_BITMAP=y"
> yesterday. After that Skateman and I booted it and looked for the
> bitmap tests with "dmesg | grep -i bitmap".
> 
> Results:
> 
> FSL P5020:
> 
> [    0.297756] test_bitmap: loaded.
> [    0.298113] test_bitmap: parselist: 14: input is '0-2047:128/256'
> OK, Time: 562
> [    0.298142] test_bitmap: parselist_user: 14: input is
> '0-2047:128/256' OK, Time: 761
> [    0.301553] test_bitmap: all 1663 tests passed
> 
> FSL P5040:
> 
> [    0.296563] test_bitmap: loaded.
> [    0.296894] test_bitmap: parselist: 14: input is '0-2047:128/256'
> OK, Time: 540
> [    0.296920] test_bitmap: parselist_user: 14: input is
> '0-2047:128/256' OK, Time: 680
> [    0.24] test_bitmap: all 1663 tests passed

Thanks for the test! So it works as expected.

I would suggest to compare what is going on on the device probing
with and without the bisected commit.

There seems to be MAC and PHY mode initialization issue that might
resulted from the bitmap format change.

I put Madalin and Sascha on CC as they have done some works on
this part recently.

Thanks!


> Thanks,
> Christian


Re: FSL P5020/P5040: DPAA Ethernet issue with the latest Git kernel

2020-06-24 Thread Alexander Gordeev
On Sun, Jun 21, 2020 at 08:40:14AM +0200, Christian Zigotzky wrote:
> Hello Alexander,
> 
> The DPAA Ethernet doesn’t work anymore on our FSL P5020/P5040 boards [1] 
> since the RC1 of kernel 5.8 [2].
> We bisected last days [3] and found the problematic commit [4]. I was able to 
> revert it [5]. After that the DPAA Ethernet works again. I created a patch 
> for reverting the commit [5]. This patch works and I will use it for the RC2.
> Could you please check your commit? [4]

Hi Christian,

Could you please check if the kernel passes CONFIG_TEST_BITMAP self-test?

Thanks!

> Thanks,
> Christian
> 
> [1] http://wiki.amiga.org/index.php?title=X5000
> [2] https://forum.hyperion-entertainment.com/viewtopic.php?p=50885#p50885
> [3] https://forum.hyperion-entertainment.com/viewtopic.php?p=50892#p50892
> [4] lib: fix bitmap_parse() on 64-bit big endian archs: 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=81c4f4d924d5d009b5ed785a3e22b18d0f7b831f
> [5] https://forum.hyperion-entertainment.com/viewtopic.php?p=50982#p50982
> 
> 


Re: [PATCH 0/2] PCI/MSI: Remove arch_msi_check_device()

2014-09-07 Thread Alexander Gordeev
On Fri, Sep 05, 2014 at 03:27:49PM -0600, Bjorn Helgaas wrote:
  I applied these (with Michael's ack on the first, and v2 of the second) to
  pci/msi for v3.18, thanks!

Hi Bjorn,

I resent a series with updates that fix kbuild robot errors.
Hopefully, the rebase for pci/msi would not cause trouble for anyone.

 Oh, I forgot -- if you'd rather take the first one through the PPC
 tree, you can do that and I can merge the second one later.  Let me
 know if you want to do that.

Nah, your treee is just fine.

Thanks!

 Bjorn

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 3/3] PCI/MSI: Remove arch_msi_check_device()

2014-09-07 Thread Alexander Gordeev
There are no archs that override arch_msi_check_device()
hook. Remove it as it is completely redundant.

If an arch would need to check MSI/MSI-X possibility for a
device it should make it within arch_setup_msi_irqs() hook.

Cc: Michael Ellerman m...@ellerman.id.au
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Alexander Gordeev agord...@redhat.com
---
 drivers/pci/msi.c   | 49 +
 include/linux/msi.h |  3 ---
 2 files changed, 13 insertions(+), 39 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 5a40516..6c2cc41 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
chip-teardown_irq(chip, irq);
 }
 
-int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
-{
-   struct msi_chip *chip = dev-bus-msi;
-
-   if (!chip || !chip-check_device)
-   return 0;
-
-   return chip-check_device(chip, dev, nvec, type);
-}
-
 int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
struct msi_desc *entry;
@@ -806,22 +796,23 @@ out_free:
 }
 
 /**
- * pci_msi_check_device - check whether MSI may be enabled on a device
+ * pci_msi_supported - check whether MSI may be enabled on a device
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: how many MSIs have been requested ?
- * @type: are we checking for MSI or MSI-X ?
  *
  * Look at global flags, the device itself, and its parent buses
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 0, else return an error code.
  **/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int pci_msi_supported(struct pci_dev *dev, int nvec)
 {
struct pci_bus *bus;
-   int ret;
 
/* MSI must be globally enabled and supported by the device */
-   if (!pci_msi_enable || !dev || dev-no_msi)
+   if (!pci_msi_enable)
+   return -EINVAL;
+
+   if (!dev || dev-no_msi || dev-current_state != PCI_D0)
return -EINVAL;
 
/*
@@ -843,10 +834,6 @@ static int pci_msi_check_device(struct pci_dev *dev, int 
nvec, int type)
if (bus-bus_flags  PCI_BUS_FLAGS_NO_MSI)
return -EINVAL;
 
-   ret = arch_msi_check_device(dev, nvec, type);
-   if (ret)
-   return ret;
-
return 0;
 }
 
@@ -949,13 +936,13 @@ int pci_enable_msix(struct pci_dev *dev, struct 
msix_entry *entries, int nvec)
int status, nr_entries;
int i, j;
 
-   if (!entries || !dev-msix_cap || dev-current_state != PCI_D0)
-   return -EINVAL;
-
-   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+   status = pci_msi_supported(dev, nvec);
if (status)
return status;
 
+   if (!entries)
+   return -EINVAL;
+
nr_entries = pci_msix_vec_count(dev);
if (nr_entries  0)
return nr_entries;
@@ -1062,8 +1049,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, 
int maxvec)
int nvec;
int rc;
 
-   if (dev-current_state != PCI_D0)
-   return -EINVAL;
+   rc = pci_msi_supported(dev, minvec);
+   if (rc)
+   return rc;
 
WARN_ON(!!dev-msi_enabled);
 
@@ -1086,17 +1074,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
minvec, int maxvec)
nvec = maxvec;
 
do {
-   rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
-   if (rc  0) {
-   return rc;
-   } else if (rc  0) {
-   if (rc  minvec)
-   return -ENOSPC;
-   nvec = rc;
-   }
-   } while (rc);
-
-   do {
rc = msi_capability_init(dev, nvec);
if (rc  0) {
return rc;
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8103f32..dbf7cc9 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -60,7 +60,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc 
*desc);
 void arch_teardown_msi_irq(unsigned int irq);
 int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
@@ -77,8 +76,6 @@ struct msi_chip {
int (*setup_irq)(struct msi_chip *chip, struct pci_dev *dev,
 struct msi_desc *desc);
void (*teardown_irq)(struct msi_chip *chip, unsigned int irq);
-   int (*check_device)(struct msi_chip *chip, struct pci_dev *dev,
-   int nvec, int type);
 };
 
 #endif /* LINUX_MSI_H */
-- 
1.9.3

[PATCH v2 1/3] PCI/MSI/PPC: Remove arch_msi_check_device()

2014-09-07 Thread Alexander Gordeev
Moving MSI checks from arch_msi_check_device() function to
arch_setup_msi_irqs() function makes code more compact and
allows removing unnecessary hook arch_msi_check_device()
from generic MSI code.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org
Cc: Michael Ellerman m...@ellerman.id.au
Signed-off-by: Alexander Gordeev agord...@redhat.com
---
 arch/powerpc/include/asm/machdep.h |  2 --
 arch/powerpc/kernel/msi.c  | 12 +-
 arch/powerpc/platforms/cell/axon_msi.c |  9 
 arch/powerpc/platforms/powernv/pci.c   | 19 ---
 arch/powerpc/platforms/pseries/msi.c   | 42 +-
 arch/powerpc/sysdev/fsl_msi.c  | 12 +++---
 arch/powerpc/sysdev/mpic_pasemi_msi.c  | 11 ++---
 arch/powerpc/sysdev/mpic_u3msi.c   | 28 +--
 arch/powerpc/sysdev/ppc4xx_hsta_msi.c  | 18 +--
 arch/powerpc/sysdev/ppc4xx_msi.c   | 19 +--
 10 files changed, 50 insertions(+), 122 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index b125cea..3af7216 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -136,8 +136,6 @@ struct machdep_calls {
int (*pci_setup_phb)(struct pci_controller *host);
 
 #ifdef CONFIG_PCI_MSI
-   int (*msi_check_device)(struct pci_dev* dev,
-   int nvec, int type);
int (*setup_msi_irqs)(struct pci_dev *dev,
  int nvec, int type);
void(*teardown_msi_irqs)(struct pci_dev *dev);
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..71bd161 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,7 +13,7 @@
 
 #include asm/machdep.h
 
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
pr_debug(msi: Platform doesn't provide MSI callbacks.\n);
@@ -24,16 +24,6 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int 
type)
if (type == PCI_CAP_ID_MSI  nvec  1)
return 1;
 
-   if (ppc_md.msi_check_device) {
-   pr_debug(msi: Using platform check routine.\n);
-   return ppc_md.msi_check_device(dev, nvec, type);
-   }
-
-return 0;
-}
-
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
-{
return ppc_md.setup_msi_irqs(dev, nvec, type);
 }
 
diff --git a/arch/powerpc/platforms/cell/axon_msi.c 
b/arch/powerpc/platforms/cell/axon_msi.c
index 85825b5..862b327 100644
--- a/arch/powerpc/platforms/cell/axon_msi.c
+++ b/arch/powerpc/platforms/cell/axon_msi.c
@@ -199,14 +199,6 @@ out_error:
return msic;
 }
 
-static int axon_msi_check_device(struct pci_dev *dev, int nvec, int type)
-{
-   if (!find_msi_translator(dev))
-   return -ENODEV;
-
-   return 0;
-}
-
 static int setup_msi_msg_address(struct pci_dev *dev, struct msi_msg *msg)
 {
struct device_node *dn;
@@ -416,7 +408,6 @@ static int axon_msi_probe(struct platform_device *device)
 
ppc_md.setup_msi_irqs = axon_msi_setup_msi_irqs;
ppc_md.teardown_msi_irqs = axon_msi_teardown_msi_irqs;
-   ppc_md.msi_check_device = axon_msi_check_device;
 
axon_msi_debug_setup(dn, msic);
 
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index b854b57..b45c492 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -46,29 +46,21 @@
 //#define cfg_dbg(fmt...)  printk(fmt)
 
 #ifdef CONFIG_PCI_MSI
-static int pnv_msi_check_device(struct pci_dev* pdev, int nvec, int type)
-{
-   struct pci_controller *hose = pci_bus_to_host(pdev-bus);
-   struct pnv_phb *phb = hose-private_data;
-   struct pci_dn *pdn = pci_get_pdn(pdev);
-
-   if (pdn  pdn-force_32bit_msi  !phb-msi32_support)
-   return -ENODEV;
-
-   return (phb  phb-msi_bmp.bitmap) ? 0 : -ENODEV;
-}
-
 static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
 {
struct pci_controller *hose = pci_bus_to_host(pdev-bus);
struct pnv_phb *phb = hose-private_data;
+   struct pci_dn *pdn = pci_get_pdn(pdev);
struct msi_desc *entry;
struct msi_msg msg;
int hwirq;
unsigned int virq;
int rc;
 
-   if (WARN_ON(!phb))
+   if (WARN_ON(!phb) || !phb-msi_bmp.bitmap)
+   return -ENODEV;
+
+   if (pdn  pdn-force_32bit_msi  !phb-msi32_support)
return -ENODEV;
 
list_for_each_entry(entry, pdev-msi_list, list) {
@@ -860,7 +852,6 @@ void __init pnv_pci_init(void)
 
/* Configure MSIs */
 #ifdef CONFIG_PCI_MSI
-   ppc_md.msi_check_device = pnv_msi_check_device

[PATCH v2 0/3] PCI/MSI: Remove arch_msi_check_device()

2014-09-07 Thread Alexander Gordeev
Hello,

This is a cleanup effort to get rid of arch_msi_check_device() function.

I am sending v2 series, since kbuild for v1 reports compile errors on
ppc4xx and Armada 370. Still, I have not checked the fixes on these
platforms.

Changes since v1:
  - patch 1: 'pdev' undeclared compile error fixed on ppc4xx. I updated
 changelog and removed ACK from this patch in case there are
 any objections;

  - patch 2: 'struct msi_chip' has no 'check_device' error fixed on
 Armada 370 - armada_370_xp_check_msi_device() hook removed;

  - patch 3: msi_check_device() renamed to pci_msi_supported(). This is
 the very same patch I sent earlier;

Thanks!

Cc: Thomas Gleixner t...@linutronix.de
Cc: Jason Cooper ja...@lakedaemon.net
Cc: Michael Ellerman m...@ellerman.id.au
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org

Alexander Gordeev (3):
  patch 1 - PCI/MSI/PPC: Remove arch_msi_check_device()
  patch 2 - PCI/MSI/Armada-370-xp: Remove arch_msi_check_device()
  patch 3 - PCI/MSI: Remove arch_msi_check_device()

 arch/powerpc/include/asm/machdep.h |  2 --
 arch/powerpc/kernel/msi.c  | 12 +
 arch/powerpc/platforms/cell/axon_msi.c |  9 ---
 arch/powerpc/platforms/powernv/pci.c   | 19 -
 arch/powerpc/platforms/pseries/msi.c   | 42 +++--
 arch/powerpc/sysdev/fsl_msi.c  | 12 +++--
 arch/powerpc/sysdev/mpic_pasemi_msi.c  | 11 ++--
 arch/powerpc/sysdev/mpic_u3msi.c   | 28 ---
 arch/powerpc/sysdev/ppc4xx_hsta_msi.c  | 18 +
 arch/powerpc/sysdev/ppc4xx_msi.c   | 19 +
 drivers/irqchip/irq-armada-370-xp.c| 14 +++---
 drivers/pci/msi.c  | 49 +-
 include/linux/msi.h|  3 ---
 13 files changed, 67 insertions(+), 171 deletions(-)

-- 
1.9.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] PCI/MSI/PPC: Remove arch_msi_check_device()

2014-08-19 Thread Alexander Gordeev
On Thu, Jul 31, 2014 at 03:53:33PM +0200, Alexander Gordeev wrote:
 Michael, Ben,
 
 Any feedback?

Michael, Ben?

 Thanks!

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 2/2] PCI/MSI: Remove arch_msi_check_device()

2014-08-11 Thread Alexander Gordeev
There are no archs that override arch_msi_check_device()
hook. Remove it as it is completely redundant.

If an arch would need to check MSI/MSI-X possibility for a
device it should make it within arch_setup_msi_irqs() hook.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Alexander Gordeev agord...@redhat.com
---
 drivers/pci/msi.c   | 49 +
 include/linux/msi.h |  3 ---
 2 files changed, 13 insertions(+), 39 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 5a40516..6c2cc41 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
chip-teardown_irq(chip, irq);
 }
 
-int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
-{
-   struct msi_chip *chip = dev-bus-msi;
-
-   if (!chip || !chip-check_device)
-   return 0;
-
-   return chip-check_device(chip, dev, nvec, type);
-}
-
 int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
struct msi_desc *entry;
@@ -806,22 +796,23 @@ out_free:
 }
 
 /**
- * pci_msi_check_device - check whether MSI may be enabled on a device
+ * pci_msi_supported - check whether MSI may be enabled on a device
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: how many MSIs have been requested ?
- * @type: are we checking for MSI or MSI-X ?
  *
  * Look at global flags, the device itself, and its parent buses
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 0, else return an error code.
  **/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int pci_msi_supported(struct pci_dev *dev, int nvec)
 {
struct pci_bus *bus;
-   int ret;
 
/* MSI must be globally enabled and supported by the device */
-   if (!pci_msi_enable || !dev || dev-no_msi)
+   if (!pci_msi_enable)
+   return -EINVAL;
+
+   if (!dev || dev-no_msi || dev-current_state != PCI_D0)
return -EINVAL;
 
/*
@@ -843,10 +834,6 @@ static int pci_msi_check_device(struct pci_dev *dev, int 
nvec, int type)
if (bus-bus_flags  PCI_BUS_FLAGS_NO_MSI)
return -EINVAL;
 
-   ret = arch_msi_check_device(dev, nvec, type);
-   if (ret)
-   return ret;
-
return 0;
 }
 
@@ -949,13 +936,13 @@ int pci_enable_msix(struct pci_dev *dev, struct 
msix_entry *entries, int nvec)
int status, nr_entries;
int i, j;
 
-   if (!entries || !dev-msix_cap || dev-current_state != PCI_D0)
-   return -EINVAL;
-
-   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+   status = pci_msi_supported(dev, nvec);
if (status)
return status;
 
+   if (!entries)
+   return -EINVAL;
+
nr_entries = pci_msix_vec_count(dev);
if (nr_entries  0)
return nr_entries;
@@ -1062,8 +1049,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, 
int maxvec)
int nvec;
int rc;
 
-   if (dev-current_state != PCI_D0)
-   return -EINVAL;
+   rc = pci_msi_supported(dev, minvec);
+   if (rc)
+   return rc;
 
WARN_ON(!!dev-msi_enabled);
 
@@ -1086,17 +1074,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
minvec, int maxvec)
nvec = maxvec;
 
do {
-   rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
-   if (rc  0) {
-   return rc;
-   } else if (rc  0) {
-   if (rc  minvec)
-   return -ENOSPC;
-   nvec = rc;
-   }
-   } while (rc);
-
-   do {
rc = msi_capability_init(dev, nvec);
if (rc  0) {
return rc;
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8103f32..dbf7cc9 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -60,7 +60,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc 
*desc);
 void arch_teardown_msi_irq(unsigned int irq);
 int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
@@ -77,8 +76,6 @@ struct msi_chip {
int (*setup_irq)(struct msi_chip *chip, struct pci_dev *dev,
 struct msi_desc *desc);
void (*teardown_irq)(struct msi_chip *chip, unsigned int irq);
-   int (*check_device)(struct msi_chip *chip, struct pci_dev *dev,
-   int nvec, int type);
 };
 
 #endif /* LINUX_MSI_H */
-- 
1.9.3

___
Linuxppc-dev mailing

Re: [PATCH v2 2/2] PCI/MSI: Remove arch_msi_check_device()

2014-08-11 Thread Alexander Gordeev
On Mon, Aug 11, 2014 at 02:33:25PM +, bharat.bhus...@freescale.com wrote:
 
 
  -Original Message-
  From: linux-pci-ow...@vger.kernel.org 
  [mailto:linux-pci-ow...@vger.kernel.org]
  On Behalf Of Alexander Gordeev
  Sent: Monday, August 11, 2014 5:16 PM
  To: Bjorn Helgaas
  Cc: linux-ker...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
  p...@vger.kernel.org
  Subject: [PATCH v2 2/2] PCI/MSI: Remove arch_msi_check_device()
  
  There are no archs that override arch_msi_check_device() hook. Remove it as 
  it
  is completely redundant.
 
 Is not we are still overriding this in powerpc (not sure I missed some patch 
 , if so please point to that).

Patch 1/2 ;)

 $ grep -r  arch_msi_check_device arch/powerpc/
 Binary file arch/powerpc/kernel/msi.o matches
 arch/powerpc/kernel/msi.c:int arch_msi_check_device(struct pci_dev* dev, int 
 nvec, int type)
 Binary file arch/powerpc/kernel/built-in.o matches
 
 Thanks
 -Bharat
 
  
  If an arch would need to check MSI/MSI-X possibility for a device it should 
  make
  it within arch_setup_msi_irqs() hook.
  
  Cc: linuxppc-dev@lists.ozlabs.org
  Cc: linux-...@vger.kernel.org
  Signed-off-by: Alexander Gordeev agord...@redhat.com
  ---
   drivers/pci/msi.c   | 49 +
   include/linux/msi.h |  3 ---
   2 files changed, 13 insertions(+), 39 deletions(-)
  
  diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 5a40516..6c2cc41 
  100644
  --- a/drivers/pci/msi.c
  +++ b/drivers/pci/msi.c
  @@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
  chip-teardown_irq(chip, irq);
   }
  
  -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type) 
  -{
  -   struct msi_chip *chip = dev-bus-msi;
  -
  -   if (!chip || !chip-check_device)
  -   return 0;
  -
  -   return chip-check_device(chip, dev, nvec, type);
  -}
  -
   int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)  {
  struct msi_desc *entry;
  @@ -806,22 +796,23 @@ out_free:
   }
  
   /**
  - * pci_msi_check_device - check whether MSI may be enabled on a device
  + * pci_msi_supported - check whether MSI may be enabled on a device
* @dev: pointer to the pci_dev data structure of MSI device function
* @nvec: how many MSIs have been requested ?
  - * @type: are we checking for MSI or MSI-X ?
*
* Look at global flags, the device itself, and its parent buses
* to determine if MSI/-X are supported for the device. If MSI/-X is
* supported return 0, else return an error code.
**/
  -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
  +static int pci_msi_supported(struct pci_dev *dev, int nvec)
   {
  struct pci_bus *bus;
  -   int ret;
  
  /* MSI must be globally enabled and supported by the device */
  -   if (!pci_msi_enable || !dev || dev-no_msi)
  +   if (!pci_msi_enable)
  +   return -EINVAL;
  +
  +   if (!dev || dev-no_msi || dev-current_state != PCI_D0)
  return -EINVAL;
  
  /*
  @@ -843,10 +834,6 @@ static int pci_msi_check_device(struct pci_dev *dev, 
  int
  nvec, int type)
  if (bus-bus_flags  PCI_BUS_FLAGS_NO_MSI)
  return -EINVAL;
  
  -   ret = arch_msi_check_device(dev, nvec, type);
  -   if (ret)
  -   return ret;
  -
  return 0;
   }
  
  @@ -949,13 +936,13 @@ int pci_enable_msix(struct pci_dev *dev, struct 
  msix_entry
  *entries, int nvec)
  int status, nr_entries;
  int i, j;
  
  -   if (!entries || !dev-msix_cap || dev-current_state != PCI_D0)
  -   return -EINVAL;
  -
  -   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
  +   status = pci_msi_supported(dev, nvec);
  if (status)
  return status;
  
  +   if (!entries)
  +   return -EINVAL;
  +
  nr_entries = pci_msix_vec_count(dev);
  if (nr_entries  0)
  return nr_entries;
  @@ -1062,8 +1049,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
  minvec,
  int maxvec)
  int nvec;
  int rc;
  
  -   if (dev-current_state != PCI_D0)
  -   return -EINVAL;
  +   rc = pci_msi_supported(dev, minvec);
  +   if (rc)
  +   return rc;
  
  WARN_ON(!!dev-msi_enabled);
  
  @@ -1086,17 +1074,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
  minvec,
  int maxvec)
  nvec = maxvec;
  
  do {
  -   rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
  -   if (rc  0) {
  -   return rc;
  -   } else if (rc  0) {
  -   if (rc  minvec)
  -   return -ENOSPC;
  -   nvec = rc;
  -   }
  -   } while (rc);
  -
  -   do {
  rc = msi_capability_init(dev, nvec);
  if (rc  0) {
  return rc;
  diff --git a/include/linux/msi.h b/include/linux/msi.h index 
  8103f32..dbf7cc9
  100644
  --- a/include/linux/msi.h
  +++ b/include/linux/msi.h

Re: [PATCH 1/2] PCI/MSI/PPC: Remove arch_msi_check_device()

2014-07-31 Thread Alexander Gordeev
On Sat, Jul 12, 2014 at 01:21:07PM +0200, Alexander Gordeev wrote:
 PowerPC is the only architecture that makes use of hook
 arch_msi_check_device() and does perform some checks to
 figure out if MSI/MSI-X could be enabled for a device.
 However, there are no reasons why those checks could not
 be done within arch_setup_msi_irqs() hook.
 
 Moving MSI checks into arch_setup_msi_irqs() makes code
 more readable and allows getting rid of unnecessary hook
 arch_msi_check_device().

Michael, Ben,

Any feedback?

Thanks!

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] PCI/MSI: Remove arch_msi_check_device()

2014-07-17 Thread Alexander Gordeev
On Wed, Jul 16, 2014 at 04:20:24PM -0600, Bjorn Helgaas wrote:
  @@ -809,22 +799,23 @@ out_free:
   }
   
   /**
  - * pci_msi_check_device - check whether MSI may be enabled on a device
  + * msi_check_device - check whether MSI may be enabled on a device
* @dev: pointer to the pci_dev data structure of MSI device function
* @nvec: how many MSIs have been requested ?
  - * @type: are we checking for MSI or MSI-X ?
*
* Look at global flags, the device itself, and its parent buses
* to determine if MSI/-X are supported for the device. If MSI/-X is
* supported return 0, else return an error code.
**/
  -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
  +static int msi_check_device(struct pci_dev *dev, int nvec)
 
 I think check_device is a terrible name because it really doesn't
 give a clue about what it's doing or what the return value means.
 Since you're removing the external usage (arch_msi_check_device) and
 this one is static, this would be a good time to fix it.  Maybe
 pci_msi_supported() or something?

What about pci_can_enable_msi() or pci_msi_can_enable() or msi_can_enable()?

 I *love* the idea of getting rid of this much code!
 
 Bjorn

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] PCI/MSI: Remove arch_msi_check_device()

2014-07-14 Thread Alexander Gordeev
On Mon, Jul 14, 2014 at 10:11:57AM +0800, Yijing Wang wrote:
   /**
  - * pci_msi_check_device - check whether MSI may be enabled on a device
  + * msi_check_device - check whether MSI may be enabled on a device
* @dev: pointer to the pci_dev data structure of MSI device function
* @nvec: how many MSIs have been requested ?
  - * @type: are we checking for MSI or MSI-X ?
*
* Look at global flags, the device itself, and its parent buses
* to determine if MSI/-X are supported for the device. If MSI/-X is
* supported return 0, else return an error code.
**/
  -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
  +static int msi_check_device(struct pci_dev *dev, int nvec)
   {
  struct pci_bus *bus;
  -   int ret;
   
  /* MSI must be globally enabled and supported by the device */
  -   if (!pci_msi_enable || !dev || dev-no_msi)
  +   if (!pci_msi_enable)
  +   return -EINVAL;
  +
  +   if (!dev || dev-no_msi || dev-current_state != PCI_D0)
  return -EINVAL;
   
  /*
  @@ -846,10 +837,6 @@ static int pci_msi_check_device(struct pci_dev *dev, 
  int nvec, int type)
  if (bus-bus_flags  PCI_BUS_FLAGS_NO_MSI)
  return -EINVAL;
   
  -   ret = arch_msi_check_device(dev, nvec, type);
  -   if (ret)
  -   return ret;
  -
 
 Move the arch_msi_check_device() into arch_msi_setup_irq(), make we can not 
 detect whether the device in this platform
 supports MSI or MSI-X aeap. If we delay this, maybe we will do a lot 
 unnecessary working for MSI/MSI-X setup.

A traditional approach for a function is first to make sanity check and
then allocate resources. I do not see a reason to keep these two steps
in separate functions: arch_msi_check_device() and arch_setup_msi_irq().

Just make checks within arch_setup_msi_irq() and bail out early would be as
cheap as it is now, but more natural and would deflate the interface.

Moreover, some platforms duplicate checks in arch_msi_check_device() and
arch_setup_msi_irq(), which does not add to readability.

 Thanks!
 Yijing.
 
  return 0;
   }
   
  @@ -954,13 +941,13 @@ int pci_enable_msix(struct pci_dev *dev, struct 
  msix_entry *entries, int nvec)
  int status, nr_entries;
  int i, j;
   
  -   if (!entries || !dev-msix_cap || dev-current_state != PCI_D0)
  -   return -EINVAL;
  -
  -   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
  +   status = msi_check_device(dev, nvec);
  if (status)
  return status;
   
  +   if (!entries)
  +   return -EINVAL;
  +
  nr_entries = pci_msix_vec_count(dev);
  if (nr_entries  0)
  return nr_entries;
  @@ -1085,8 +1072,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
  minvec, int maxvec)
  int nvec;
  int rc;
   
  -   if (dev-current_state != PCI_D0)
  -   return -EINVAL;
  +   rc = msi_check_device(dev, minvec);
  +   if (rc)
  +   return rc;
   
  WARN_ON(!!dev-msi_enabled);
   
  @@ -1109,17 +1097,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
  minvec, int maxvec)
  nvec = maxvec;
   
  do {
  -   rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
  -   if (rc  0) {
  -   return rc;
  -   } else if (rc  0) {
  -   if (rc  minvec)
  -   return -ENOSPC;
  -   nvec = rc;
  -   }
  -   } while (rc);
  -
  -   do {
  rc = msi_capability_init(dev, nvec);
  if (rc  0) {
  return rc;
  diff --git a/include/linux/msi.h b/include/linux/msi.h
  index 92a2f99..3b873bc 100644
  --- a/include/linux/msi.h
  +++ b/include/linux/msi.h
  @@ -59,7 +59,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct 
  msi_desc *desc);
   void arch_teardown_msi_irq(unsigned int irq);
   int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
   void arch_teardown_msi_irqs(struct pci_dev *dev);
  -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
   void arch_restore_msi_irqs(struct pci_dev *dev);
   
   void default_teardown_msi_irqs(struct pci_dev *dev);
  @@ -76,8 +75,6 @@ struct msi_chip {
  int (*setup_irq)(struct msi_chip *chip, struct pci_dev *dev,
   struct msi_desc *desc);
  void (*teardown_irq)(struct msi_chip *chip, unsigned int irq);
  -   int (*check_device)(struct msi_chip *chip, struct pci_dev *dev,
  -   int nvec, int type);
   };
   
   #endif /* LINUX_MSI_H */
  
 
 
 -- 
 Thanks!
 Yijing
 

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] PCI/MSI/PPC: Remove arch_msi_check_device()

2014-07-12 Thread Alexander Gordeev
PowerPC is the only architecture that makes use of hook
arch_msi_check_device() and does perform some checks to
figure out if MSI/MSI-X could be enabled for a device.
However, there are no reasons why those checks could not
be done within arch_setup_msi_irqs() hook.

Moving MSI checks into arch_setup_msi_irqs() makes code
more readable and allows getting rid of unnecessary hook
arch_msi_check_device().

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Alexander Gordeev agord...@redhat.com
---
 arch/powerpc/include/asm/machdep.h |2 -
 arch/powerpc/kernel/msi.c  |   12 +
 arch/powerpc/platforms/cell/axon_msi.c |9 ---
 arch/powerpc/platforms/powernv/pci.c   |   19 --
 arch/powerpc/platforms/pseries/msi.c   |   42 ---
 arch/powerpc/sysdev/fsl_msi.c  |   12 ++---
 arch/powerpc/sysdev/mpic_pasemi_msi.c  |   11 +---
 arch/powerpc/sysdev/mpic_u3msi.c   |   28 -
 arch/powerpc/sysdev/ppc4xx_hsta_msi.c  |   18 -
 arch/powerpc/sysdev/ppc4xx_msi.c   |   19 --
 10 files changed, 50 insertions(+), 122 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index f92b0b5..d05aa76 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -136,8 +136,6 @@ struct machdep_calls {
int (*pci_setup_phb)(struct pci_controller *host);
 
 #ifdef CONFIG_PCI_MSI
-   int (*msi_check_device)(struct pci_dev* dev,
-   int nvec, int type);
int (*setup_msi_irqs)(struct pci_dev *dev,
  int nvec, int type);
void(*teardown_msi_irqs)(struct pci_dev *dev);
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..71bd161 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,7 +13,7 @@
 
 #include asm/machdep.h
 
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
pr_debug(msi: Platform doesn't provide MSI callbacks.\n);
@@ -24,16 +24,6 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int 
type)
if (type == PCI_CAP_ID_MSI  nvec  1)
return 1;
 
-   if (ppc_md.msi_check_device) {
-   pr_debug(msi: Using platform check routine.\n);
-   return ppc_md.msi_check_device(dev, nvec, type);
-   }
-
-return 0;
-}
-
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
-{
return ppc_md.setup_msi_irqs(dev, nvec, type);
 }
 
diff --git a/arch/powerpc/platforms/cell/axon_msi.c 
b/arch/powerpc/platforms/cell/axon_msi.c
index 85825b5..862b327 100644
--- a/arch/powerpc/platforms/cell/axon_msi.c
+++ b/arch/powerpc/platforms/cell/axon_msi.c
@@ -199,14 +199,6 @@ out_error:
return msic;
 }
 
-static int axon_msi_check_device(struct pci_dev *dev, int nvec, int type)
-{
-   if (!find_msi_translator(dev))
-   return -ENODEV;
-
-   return 0;
-}
-
 static int setup_msi_msg_address(struct pci_dev *dev, struct msi_msg *msg)
 {
struct device_node *dn;
@@ -416,7 +408,6 @@ static int axon_msi_probe(struct platform_device *device)
 
ppc_md.setup_msi_irqs = axon_msi_setup_msi_irqs;
ppc_md.teardown_msi_irqs = axon_msi_teardown_msi_irqs;
-   ppc_md.msi_check_device = axon_msi_check_device;
 
axon_msi_debug_setup(dn, msic);
 
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index f91a4e5..987b677 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -46,29 +46,21 @@
 //#define cfg_dbg(fmt...)  printk(fmt)
 
 #ifdef CONFIG_PCI_MSI
-static int pnv_msi_check_device(struct pci_dev* pdev, int nvec, int type)
-{
-   struct pci_controller *hose = pci_bus_to_host(pdev-bus);
-   struct pnv_phb *phb = hose-private_data;
-   struct pci_dn *pdn = pci_get_pdn(pdev);
-
-   if (pdn  pdn-force_32bit_msi  !phb-msi32_support)
-   return -ENODEV;
-
-   return (phb  phb-msi_bmp.bitmap) ? 0 : -ENODEV;
-}
-
 static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
 {
struct pci_controller *hose = pci_bus_to_host(pdev-bus);
struct pnv_phb *phb = hose-private_data;
+   struct pci_dn *pdn = pci_get_pdn(pdev);
struct msi_desc *entry;
struct msi_msg msg;
int hwirq;
unsigned int virq;
int rc;
 
-   if (WARN_ON(!phb))
+   if (WARN_ON(!phb) || !phb-msi_bmp.bitmap)
+   return -ENODEV;
+
+   if (pdn  pdn-force_32bit_msi  !phb-msi32_support)
return -ENODEV;
 
list_for_each_entry(entry, pdev-msi_list, list

[PATCH 2/2] PCI/MSI: Remove arch_msi_check_device()

2014-07-12 Thread Alexander Gordeev
There are no archs that override arch_msi_check_device()
hook. Remove it as it is completely redundant.

If an arch would need to check MSI/MSI-X possibility for a
device it should make it within arch_setup_msi_irqs() hook.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Alexander Gordeev agord...@redhat.com
---
 drivers/pci/msi.c   |   49 +
 include/linux/msi.h |3 ---
 2 files changed, 13 insertions(+), 39 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 13f3d30..19ac058 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -56,16 +56,6 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
chip-teardown_irq(chip, irq);
 }
 
-int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
-{
-   struct msi_chip *chip = dev-bus-msi;
-
-   if (!chip || !chip-check_device)
-   return 0;
-
-   return chip-check_device(chip, dev, nvec, type);
-}
-
 int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
struct msi_desc *entry;
@@ -809,22 +799,23 @@ out_free:
 }
 
 /**
- * pci_msi_check_device - check whether MSI may be enabled on a device
+ * msi_check_device - check whether MSI may be enabled on a device
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: how many MSIs have been requested ?
- * @type: are we checking for MSI or MSI-X ?
  *
  * Look at global flags, the device itself, and its parent buses
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 0, else return an error code.
  **/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int msi_check_device(struct pci_dev *dev, int nvec)
 {
struct pci_bus *bus;
-   int ret;
 
/* MSI must be globally enabled and supported by the device */
-   if (!pci_msi_enable || !dev || dev-no_msi)
+   if (!pci_msi_enable)
+   return -EINVAL;
+
+   if (!dev || dev-no_msi || dev-current_state != PCI_D0)
return -EINVAL;
 
/*
@@ -846,10 +837,6 @@ static int pci_msi_check_device(struct pci_dev *dev, int 
nvec, int type)
if (bus-bus_flags  PCI_BUS_FLAGS_NO_MSI)
return -EINVAL;
 
-   ret = arch_msi_check_device(dev, nvec, type);
-   if (ret)
-   return ret;
-
return 0;
 }
 
@@ -954,13 +941,13 @@ int pci_enable_msix(struct pci_dev *dev, struct 
msix_entry *entries, int nvec)
int status, nr_entries;
int i, j;
 
-   if (!entries || !dev-msix_cap || dev-current_state != PCI_D0)
-   return -EINVAL;
-
-   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+   status = msi_check_device(dev, nvec);
if (status)
return status;
 
+   if (!entries)
+   return -EINVAL;
+
nr_entries = pci_msix_vec_count(dev);
if (nr_entries  0)
return nr_entries;
@@ -1085,8 +1072,9 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, 
int maxvec)
int nvec;
int rc;
 
-   if (dev-current_state != PCI_D0)
-   return -EINVAL;
+   rc = msi_check_device(dev, minvec);
+   if (rc)
+   return rc;
 
WARN_ON(!!dev-msi_enabled);
 
@@ -1109,17 +1097,6 @@ int pci_enable_msi_range(struct pci_dev *dev, int 
minvec, int maxvec)
nvec = maxvec;
 
do {
-   rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
-   if (rc  0) {
-   return rc;
-   } else if (rc  0) {
-   if (rc  minvec)
-   return -ENOSPC;
-   nvec = rc;
-   }
-   } while (rc);
-
-   do {
rc = msi_capability_init(dev, nvec);
if (rc  0) {
return rc;
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 92a2f99..3b873bc 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -59,7 +59,6 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc 
*desc);
 void arch_teardown_msi_irq(unsigned int irq);
 int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
@@ -76,8 +75,6 @@ struct msi_chip {
int (*setup_irq)(struct msi_chip *chip, struct pci_dev *dev,
 struct msi_desc *desc);
void (*teardown_irq)(struct msi_chip *chip, unsigned int irq);
-   int (*check_device)(struct msi_chip *chip, struct pci_dev *dev,
-   int nvec, int type);
 };
 
 #endif /* LINUX_MSI_H */
-- 
1.7.7.6

___
Linuxppc-dev

Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-10 Thread Alexander Gordeev
On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
 Out of curiosity, do you have a pointer to this?  It looks like it

I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27

 uses one vector per port, and I'm wondering if the reason it requests
 16 is because there's some possibility of a part with more than 8
 ports.

I doubt that is the reason. The only allowed MME values (powers of two)
are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
I would speculate it suits nicely to some hardware logic.

BTW, apart from AHCI, it seems the reason MSI is not going to disappear
(in a decade at least) is it is way cheaper to implement than MSI-X.

  No, this is not an erratum. The value of 8 vectors is reserved and could
  cause undefined results if used.
 
 As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
 (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
 undefined results, I'd say that's a chipset defect.

Well, the PCI spec does not prevent devices to have their own specs on top
of it. Undefined results are meant on the device side here. On the MSI side
these results are likely perfectly within the PCI spec. I feel speaking as
a lawer here ;)

 Interrupt vector space is the issue I would worry about, but I think
 I'm going to put this on the back burner until it actually becomes a
 problem.

I plan to try get rid of arch_msi_check_device() hook. Should I repost
this series afterwards?

Thanks!

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-08 Thread Alexander Gordeev
On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
  Can you quantify the benefit of this?  Can't a device already use
  MSI-X to request exactly the number of vectors it can use?  (I know
 
  A Intel AHCI chipset requires 16 vectors written to MME while advertises
  (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
  in device's fallback to 1 (!).
 
 Is the fact that it uses only 6 vectors documented in the public spec?

Yes, it is documented in ICH specs.

 Is this a chipset erratum?  Are there newer versions of the chipset
 that fix this, e.g., by requesting 8 vectors and using 6, or by also
 supporting MSI-X?

No, this is not an erratum. The value of 8 vectors is reserved and could
cause undefined results if used.

 I know this conserves vector numbers.  What does that mean in real
 user-visible terms?  Are there systems that won't boot because of this
 issue, and this patch fixes them?  Does it enable bigger
 configurations, e.g., more I/O devices, than before?

Visibly, it ceases logging messages ('ahci :00:1f.2: irq 107 for
MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.

No, it does not enable/fix any existing hardware issue I am aware of.
It just saves a couple of interrupt vectors, as Michael put it (10/16
to be precise). However, interrupt vectors space is pretty much scarce
resource on x86 and a risk of exhausting the vectors (and introducing
quota i.e) has already been raised AFAIR.

 Do you know how Windows handles this?  Does it have a similar interface?

Have no clue, TBH. Can try to investigate if you see it helpful.

 As you can tell, I'm a little skeptical about this.  It's a fairly big
 change, it affects the arch interface, it seems to be targeted for
 only a single chipset (though it's widely used), and we already
 support a standard solution (MSI-X, reducing the number of vectors
 requested, or even operating with 1 vector).

I also do not like the fact the arch interface is getting complicated,
so I happily leave it to your judgement ;) Well, it is low-level and
hidden from drivers at least.

Thanks!

 Bjorn

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-07 Thread Alexander Gordeev
On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
 As you can tell, I'm a little skeptical about this.  It's a fairly big
 change, it affects the arch interface, it seems to be targeted for
 only a single chipset (though it's widely used), and we already
 support a standard solution (MSI-X, reducing the number of vectors
 requested, or even operating with 1 vector).

Bjorn,

I surely understand your concerns. I am answering this summary
question right away.

Even though an extra parameter is introduced, functionally this update
is rather small. It is only the new pci_enable_msi_partial() function
that could exploit a custom 'nvec_mme' parameter. By contrast, existing
pci_enable_msi_range() function (and therefore all device drivers) is
unaffected - it just rounds up 'nvec' to the nearest power of two and
continues exactly as it has been. All archs besides x86 just ignore it.
And x86 change is fairly small as well - all necessary functionality is
already in.

Thus, at the moment it is only AHCI of concern. And no, AHCI can not do MSI-X..

Thanks!

 Bjorn

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-04 Thread Alexander Gordeev
On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
 On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
  There are PCI devices that require a particular value written
  to the Multiple Message Enable (MME) register while aligned on
  power of 2 boundary value of actually used MSI vectors 'nvec'
  is a lesser of that MME value:
  
  roundup_pow_of_two(nvec)  'Multiple Message Enable'
  
  However the existing pci_enable_msi_block() interface is not
  able to configure such devices, since the value written to the
  MME register is calculated from the number of requested MSIs
  'nvec':
  
  'Multiple Message Enable' = roundup_pow_of_two(nvec)
 
 For MSI, software learns how many vectors a device requests by reading
 the Multiple Message Capable (MMC) field.  This field is encoded, so a
 device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
 for a device to request 3 vectors; it would have to round up that up
 to a power of two and request 4 vectors.
 
 Software writes similarly encoded values to MME to tell the device how
 many vectors have been allocated for its use.  For example, it's
 impossible to tell the device that it can use 3 vectors; the OS has to
 round that up and tell the device it can use 4 vectors.

Nod.

 So if I understand correctly, the point of this series is to take
 advantage of device-specific knowledge, e.g., the device requests 4
 vectors via MMC, but we know the device is only capable of using 3.
 Moreover, we tell the device via MME that 4 vectors are available, but
 we've only actually set up 3 of them.

Exactly.

 This makes me uneasy because we're lying to the device, and the device
 is perfectly within spec to use all 4 of those vectors.  If anything
 changes the number of vectors the device uses (new device revision,
 firmware upgrade, etc.), this is liable to break.

If a device committed via non-MSI specific means to send only 3 vectors
out of 4 available why should we expect it to send 4? The probability of
a firmware sending 4/4 vectors in this case is equal to the probability
of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
Moreover, even vector 4/4 would be unexpected by the device driver, though
it is perfectly within the spec.

As of new device revision or firmware update etc. - it is just yet another
case of device driver vs the firmware match/mismatch. Not including this
change does not help here at all IMHO.

 Can you quantify the benefit of this?  Can't a device already use
 MSI-X to request exactly the number of vectors it can use?  (I know

A Intel AHCI chipset requires 16 vectors written to MME while advertises
(via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
in device's fallback to 1 (!).

 not all devices support MSI-X, but maybe we should just accept MSI for
 what it is and encourage the HW guys to use MSI-X if MSI isn't good
 enough.)
 
  In this case the result written to the MME register may not
  satisfy the aforementioned PCI devices requirement and therefore
  the PCI functions will not operate in a desired mode.
 
 I'm not sure what you mean by will not operate in a desired mode.
 I thought this was an optimization to save vectors and that these
 changes would be completely invisible to the hardware.

Yes, this should be invisible to the hardware. The above is an attempt
to describe the Intel AHCI weirdness in general terms :) I think it
could be omitted.

 Bjorn
 
  This update introduces pci_enable_msi_partial() extension to
  pci_enable_msi_block() interface that accepts extra 'nvec_mme'
  argument which is then written to MME register while the value
  of 'nvec' is still used to setup as many interrupts as requested.
  
  As result of this change, architecture-specific callbacks
  arch_msi_check_device() and arch_setup_msi_irqs() get an extra
  'nvec_mme' parameter as well, but it is ignored for now.
  Therefore, this update is a placeholder for architectures that
  wish to support pci_enable_msi_partial() function in the future.
  
  Cc: linux-...@vger.kernel.org
  Cc: linux-m...@linux-mips.org
  Cc: linuxppc-dev@lists.ozlabs.org
  Cc: linux-s...@vger.kernel.org
  Cc: x...@kernel.org
  Cc: xen-de...@lists.xenproject.org
  Cc: io...@lists.linux-foundation.org
  Cc: linux-...@vger.kernel.org
  Cc: linux-...@vger.kernel.org
  Signed-off-by: Alexander Gordeev agord...@redhat.com
  ---
   Documentation/PCI/MSI-HOWTO.txt |   36 ++--
   arch/mips/pci/msi-octeon.c  |2 +-
   arch/powerpc/kernel/msi.c   |4 +-
   arch/s390/pci/pci.c |2 +-
   arch/x86/kernel/x86_init.c  |2 +-
   drivers/pci/msi.c   |   83 
  ++-
   include/linux/msi.h |5 +-
   include/linux/pci.h |3 +
   8 files changed, 115 insertions(+), 22 deletions(-)
  
  diff --git a/Documentation/PCI/MSI-HOWTO.txt 
  b/Documentation/PCI/MSI-HOWTO.txt
  index 10a9369

Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-04 Thread Alexander Gordeev
On Thu, Jul 03, 2014 at 09:20:52AM +, David Laight wrote:
 From: Bjorn Helgaas
  On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
   There are PCI devices that require a particular value written
   to the Multiple Message Enable (MME) register while aligned on
   power of 2 boundary value of actually used MSI vectors 'nvec'
   is a lesser of that MME value:
  
 roundup_pow_of_two(nvec)  'Multiple Message Enable'
  
   However the existing pci_enable_msi_block() interface is not
   able to configure such devices, since the value written to the
   MME register is calculated from the number of requested MSIs
   'nvec':
  
 'Multiple Message Enable' = roundup_pow_of_two(nvec)
  
  For MSI, software learns how many vectors a device requests by reading
  the Multiple Message Capable (MMC) field.  This field is encoded, so a
  device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
  for a device to request 3 vectors; it would have to round up that up
  to a power of two and request 4 vectors.
  
  Software writes similarly encoded values to MME to tell the device how
  many vectors have been allocated for its use.  For example, it's
  impossible to tell the device that it can use 3 vectors; the OS has to
  round that up and tell the device it can use 4 vectors.
  
  So if I understand correctly, the point of this series is to take
  advantage of device-specific knowledge, e.g., the device requests 4
  vectors via MMC, but we know the device is only capable of using 3.
  Moreover, we tell the device via MME that 4 vectors are available, but
  we've only actually set up 3 of them.
 ...
 
 Even if you do that, you ought to write valid interrupt information
 into the 4th slot (maybe replicating one of the earlier interrupts).
 Then, if the device does raise the 'unexpected' interrupt you don't
 get a write to a random kernel location.

I might be missing something, but we are talking of MSI address space
here, aren't we? I am not getting how we could end up with a 'write'
to a random kernel location when a unclaimed MSI vector sent. We could
only expect a spurious interrupt at worst, which is handled and reported.

Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

 Plausibly something similar should be done when a smaller number of
 interrupts is assigned.
 
   David
 

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-07-04 Thread Alexander Gordeev
On Fri, Jul 04, 2014 at 09:11:50AM +, David Laight wrote:
  I might be missing something, but we are talking of MSI address space
  here, aren't we? I am not getting how we could end up with a 'write'
  to a random kernel location when a unclaimed MSI vector sent. We could
  only expect a spurious interrupt at worst, which is handled and reported.
  
  Anyway, as I described in my reply to Bjorn, this is not a concern IMO.
 
 I'm thinking of the following - which might be MSI-X ?
 1) Hardware requests some interrupts and tells the host the BAR (and offset)
where the 'vectors' should be written.
 2) To raise an interrupt the hardware uses the 'vector' as the address
of a normal PCIe write cycle.
 
 So if the hardware requests 4 interrupts, but the driver (believing it
 will only use 3) only write 3 vectors, and then the hardware uses the
 4th vector it can write to a random location.
 
 Debugging that would be hard!

MSI base address is kind of hardcoded for a platform. A combination of
MSI base address, PCI function number and MSI vector makes a PCI host to
raise interrupt on a CPU. I might be inaccurate in details, but the scenario
you described is impossible AFAICT.

   David
 
 
 

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-06-23 Thread Alexander Gordeev
Hi Bjorn,

Any feedback?

Thanks!

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/3] Add pci_enable_msi_partial() to conserve MSI-related resources

2014-06-10 Thread Alexander Gordeev
Add new pci_enable_msi_partial() interface and use it to
conserve on othewise wasted interrupt resources.

AHCI driver is the first user which would conserve on
10 out of 16 unused MSI vectors on some Intel chipsets.

Cc: linux-...@vger.kernel.org
Cc: linux-m...@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: io...@lists.linux-foundation.org
Cc: linux-...@vger.kernel.org
Cc: linux-...@vger.kernel.org

Alexander Gordeev (3):
  PCI/MSI: Add pci_enable_msi_partial()
  PCI/MSI/x86: Support pci_enable_msi_partial()
  AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs

 Documentation/PCI/MSI-HOWTO.txt |   36 ++--
 arch/mips/pci/msi-octeon.c  |2 +-
 arch/powerpc/kernel/msi.c   |4 +-
 arch/s390/pci/pci.c |2 +-
 arch/x86/include/asm/pci.h  |3 +-
 arch/x86/include/asm/x86_init.h |3 +-
 arch/x86/kernel/apic/io_apic.c  |2 +-
 arch/x86/kernel/x86_init.c  |4 +-
 arch/x86/pci/xen.c  |9 +++-
 drivers/ata/ahci.c  |4 +-
 drivers/iommu/irq_remapping.c   |   10 ++--
 drivers/pci/msi.c   |   83 ++-
 include/linux/msi.h |5 +-
 include/linux/pci.h |3 +
 14 files changed, 134 insertions(+), 36 deletions(-)

-- 
1.7.7.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()

2014-06-10 Thread Alexander Gordeev
There are PCI devices that require a particular value written
to the Multiple Message Enable (MME) register while aligned on
power of 2 boundary value of actually used MSI vectors 'nvec'
is a lesser of that MME value:

roundup_pow_of_two(nvec)  'Multiple Message Enable'

However the existing pci_enable_msi_block() interface is not
able to configure such devices, since the value written to the
MME register is calculated from the number of requested MSIs
'nvec':

'Multiple Message Enable' = roundup_pow_of_two(nvec)

In this case the result written to the MME register may not
satisfy the aforementioned PCI devices requirement and therefore
the PCI functions will not operate in a desired mode.

This update introduces pci_enable_msi_partial() extension to
pci_enable_msi_block() interface that accepts extra 'nvec_mme'
argument which is then written to MME register while the value
of 'nvec' is still used to setup as many interrupts as requested.

As result of this change, architecture-specific callbacks
arch_msi_check_device() and arch_setup_msi_irqs() get an extra
'nvec_mme' parameter as well, but it is ignored for now.
Therefore, this update is a placeholder for architectures that
wish to support pci_enable_msi_partial() function in the future.

Cc: linux-...@vger.kernel.org
Cc: linux-m...@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: io...@lists.linux-foundation.org
Cc: linux-...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Alexander Gordeev agord...@redhat.com
---
 Documentation/PCI/MSI-HOWTO.txt |   36 ++--
 arch/mips/pci/msi-octeon.c  |2 +-
 arch/powerpc/kernel/msi.c   |4 +-
 arch/s390/pci/pci.c |2 +-
 arch/x86/kernel/x86_init.c  |2 +-
 drivers/pci/msi.c   |   83 ++-
 include/linux/msi.h |5 +-
 include/linux/pci.h |3 +
 8 files changed, 115 insertions(+), 22 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 10a9369..c8a8503 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, 
pci_enable_msi_exact()
 returns zero in case of success, which indicates MSI interrupts have been
 successfully allocated.
 
-4.2.4 pci_disable_msi
+4.2.4 pci_enable_msi_partial
+
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+
+This variation on pci_enable_msi_exact() call allows a device driver to
+setup 'nvec_mme' number of multiple MSIs with the PCI function, while
+setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
+in operating system. The MSI specification only allows 'nvec_mme' to be
+allocated in powers of two, up to a maximum of 2^5 (32).
+
+This function could be used when a PCI function is known to send 'nvec'
+MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
+initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
+do not waste system resources.
+
+If this function returns 0, it has succeeded in allocating 'nvec_mme'
+interrupts and setting up 'nvec' interrupts. In this case, the function
+enables MSI on this device and updates dev-irq to be the lowest of the
+new interrupts assigned to it.  The other interrupts assigned to the
+device are in the range dev-irq to dev-irq + nvec - 1.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+4.2.5 pci_disable_msi
 
 void pci_disable_msi(struct pci_dev *dev)
 
-This function should be used to undo the effect of pci_enable_msi_range().
-Calling it restores dev-irq to the pin-based interrupt number and frees
-the previously allocated MSIs.  The interrupts may subsequently be assigned
-to another device, so drivers should not cache the value of dev-irq.
+This function should be used to undo the effect of pci_enable_msi_range()
+or pci_enable_msi_partial(). Calling it restores dev-irq to the pin-based
+interrupt number and frees the previously allocated MSIs.  The interrupts
+may subsequently be assigned to another device, so drivers should not cache
+the value of dev-irq.
 
 Before calling this function, a device driver must always call free_irq()
 on any interrupt for which it previously called request_irq().
diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
index 2b91b0e..2be7979 100644
--- a/arch/mips/pci/msi-octeon.c
+++ b/arch/mips/pci/msi-octeon.c
@@ -178,7 +178,7 @@ msi_irq_allocated:
return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
struct msi_desc *entry;
int ret;
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c

Re: [PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()

2014-05-01 Thread Alexander Gordeev
On Wed, Apr 30, 2014 at 05:49:33PM -0600, Bjorn Helgaas wrote:
 I mistakenly assumed this would have to wait because I thought there were
 other pci_enable_msi_block() users that wouldn't be removed until the v3.16
 merge window.  But I think I was wrong: I put your GenWQE patch in my tree,
 and I think that was the last use, so I can just add this patch on top.

That is right, GenWQE is the last one.

 But I need a little help understanding the changelog:
 
  Up until now, when enabling MSI mode for a device a single
  successful call to arch_msi_check_device() was followed by
  a single call to arch_setup_msi_irqs() function.
 
 I understand this part; the following two paths call
 arch_msi_check_device() once and then arch_setup_msi_irqs() once:
 
   pci_enable_msi_block
 pci_msi_check_device
   arch_msi_check_device
 msi_capability_init
   arch_setup_msi_irqs
 
   pci_enable_msix
 pci_msi_check_device
   arch_msi_check_device
 msix_capability_init
   arch_setup_msi_irqs
 
  Yet, if arch_msi_check_device() returned success we should be
  able to call arch_setup_msi_irqs() multiple times - while it
  returns a number of MSI vectors that could have been allocated
  (a third state).
 
 I don't know what you mean by a third state.

That is a number of MSI vectors that could have been allocated, which
is neither success nor failure. In previous conversations someone branded
it this as third state.

 Previously we only called arch_msi_check_device() once.  After your patch,
 pci_enable_msi_range() can call it several times.  The only non-trivial
 implementation of arch_msi_check_device() is in powerpc, and all the
 ppc_md.msi_check_device() possibilities look safe to call multiple times.

Yep, I see it the same way.

 After your patch, the pci_enable_msi_range() path can also call
 arch_setup_msi_irqs() several times.  I don't see a problem with that --
 even if the first call succeeds and allocates something, then a subsequent
 call fails, I assume the allocations will be cleaned up when
 msi_capability_init() calls free_msi_irqs().

Well, the potential problem related to the fact arch_msi_check_device()
could be called with 'nvec1' while arch_setup_msi_irqs() could be called
with 'nvec2', where 'nvec1'  'nvec2'.

While it is not a problem with current implementations, in therory it is
possible free_msi_irqs() could be called with 'nvec2' and fail to clean
up 'nvec1' - 'nvec2' number of resources.

The only assumption that makes the above scenario impossible is if
arch_msi_check_device() is stateless.

Again, that is purely theoretical with no particular architecture in mind.

  This update makes use of the assumption described above. It
  could have broke things had the architectures done any pre-
  allocations or switch to some state in a call to function
  arch_msi_check_device(). But because arch_msi_check_device()
  is expected stateless and MSI resources are allocated in a
  follow-up call to arch_setup_msi_irqs() we should be fine.
 
 I guess you mean that your patch assumes arch_msi_check_device() is
 stateless.  That looks like a safe assumption to me.

Moreover, if arch_msi_check_device() was not stateless then it would
be superfluous, since all state switches and allocations are done in
arch_setup_msi_irqs() anyway.

In fact, I think arch_msi_check_device() could be eliminated, but I
do not want to engage with PPC folks at this stage ;)

 arch_setup_msi_irqs() is clearly not stateless, but I assume
 free_msi_irqs() is enough to clean up any state if we fail.
 
 Bjorn

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()

2014-04-30 Thread Alexander Gordeev
On Mon, Apr 14, 2014 at 03:28:34PM +0200, Alexander Gordeev wrote:
 There are no users of pci_enable_msi_block() function have
 left. Obsolete it in favor of pci_enable_msi_range() and
 pci_enable_msi_exact() functions.

Hi Bjorn,

How about this one?

Thanks!

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()

2014-04-14 Thread Alexander Gordeev
There are no users of pci_enable_msi_block() function have
left. Obsolete it in favor of pci_enable_msi_range() and
pci_enable_msi_exact() functions.

Up until now, when enabling MSI mode for a device, a single
successful call to arch_msi_check_device() was followed by
a single call to arch_setup_msi_irqs() function.

Yet, if arch_msi_check_device() returned success we should be
able to call arch_setup_msi_irqs() multiple times - while it
returns a number of MSI vectors that could have been allocated
(a third state).

This update makes use of the assumption described above. It
could have broke things had the architectures done any pre-
allocations or switch to some state in a call to function
arch_msi_check_device(). But because arch_msi_check_device()
is expected stateless and MSI resources are allocated in a
follow-up call to arch_setup_msi_irqs() we should be fine.

Signed-off-by: Alexander Gordeev agord...@redhat.com
Cc: linux-m...@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-...@vger.kernel.org
---
 drivers/pci/msi.c   |   79 +-
 include/linux/pci.h |5 +--
 2 files changed, 34 insertions(+), 50 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 955ab79..fc669ab 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -883,50 +883,6 @@ int pci_msi_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msi_vec_count);
 
-/**
- * pci_enable_msi_block - configure device's MSI capability structure
- * @dev: device to configure
- * @nvec: number of interrupts to configure
- *
- * Allocate IRQs for a device with the MSI capability.
- * This function returns a negative errno if an error occurs.  If it
- * is unable to allocate the number of interrupts requested, it returns
- * the number of interrupts it might be able to allocate.  If it successfully
- * allocates at least the number of interrupts requested, it returns 0 and
- * updates the @dev's irq member to the lowest new interrupt number; the
- * other interrupt numbers allocated to this device are consecutive.
- */
-int pci_enable_msi_block(struct pci_dev *dev, int nvec)
-{
-   int status, maxvec;
-
-   if (dev-current_state != PCI_D0)
-   return -EINVAL;
-
-   maxvec = pci_msi_vec_count(dev);
-   if (maxvec  0)
-   return maxvec;
-   if (nvec  maxvec)
-   return maxvec;
-
-   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
-   if (status)
-   return status;
-
-   WARN_ON(!!dev-msi_enabled);
-
-   /* Check whether driver already requested MSI-X irqs */
-   if (dev-msix_enabled) {
-   dev_info(dev-dev, can't enable MSI 
-(MSI-X already enabled)\n);
-   return -EINVAL;
-   }
-
-   status = msi_capability_init(dev, nvec);
-   return status;
-}
-EXPORT_SYMBOL(pci_enable_msi_block);
-
 void pci_msi_shutdown(struct pci_dev *dev)
 {
struct msi_desc *desc;
@@ -1132,14 +1088,45 @@ void pci_msi_init_pci_dev(struct pci_dev *dev)
  **/
 int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 {
-   int nvec = maxvec;
+   int nvec;
int rc;
 
+   if (dev-current_state != PCI_D0)
+   return -EINVAL;
+
+   WARN_ON(!!dev-msi_enabled);
+
+   /* Check whether driver already requested MSI-X irqs */
+   if (dev-msix_enabled) {
+   dev_info(dev-dev,
+can't enable MSI (MSI-X already enabled)\n);
+   return -EINVAL;
+   }
+
if (maxvec  minvec)
return -ERANGE;
 
+   nvec = pci_msi_vec_count(dev);
+   if (nvec  0)
+   return nvec;
+   else if (nvec  minvec)
+   return -EINVAL;
+   else if (nvec  maxvec)
+   nvec = maxvec;
+
+   do {
+   rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+   if (rc  0) {
+   return rc;
+   } else if (rc  0) {
+   if (rc  minvec)
+   return -ENOSPC;
+   nvec = rc;
+   }
+   } while (rc);
+
do {
-   rc = pci_enable_msi_block(dev, nvec);
+   rc = msi_capability_init(dev, nvec);
if (rc  0) {
return rc;
} else if (rc  0) {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index aab57b4..d8104f4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1158,7 +1158,6 @@ struct msix_entry {
 
 #ifdef CONFIG_PCI_MSI
 int pci_msi_vec_count(struct pci_dev *dev);
-int pci_enable_msi_block(struct pci_dev *dev, int nvec);
 void pci_msi_shutdown(struct pci_dev *dev);
 void pci_disable_msi(struct pci_dev *dev);
 int pci_msix_vec_count(struct pci_dev *dev);
@@ -1188,8 +1187,6 @@ static inline int pci_enable_msix_exact(struct pci_dev 
*dev

Re: [PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()

2014-04-14 Thread Alexander Gordeev
On Mon, Apr 14, 2014 at 09:14:07AM +0200, Alexander Gordeev wrote:
 @@ -1244,7 +1241,7 @@ static inline void pcie_set_ecrc_checking(struct 
 pci_dev *dev) { }
  static inline void pcie_ecrc_get_policy(char *str) { }
  #endif
  
 -#define pci_enable_msi(pdev) pci_enable_msi_block(pdev, 1)
 +#define pci_enable_msi(pdev) pci_enable_msi_range(pdev, 1, 1)

Self-Nack, it should have been pci_enable_msi_exact(pdev, 1) ^^^

  #ifdef CONFIG_HT_IRQ
  /* The functions a driver should call */

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] PCI/MSI: Phase out pci_enable_msi_block()

2014-04-14 Thread Alexander Gordeev
There are no users of pci_enable_msi_block() function have
left. Obsolete it in favor of pci_enable_msi_range() and
pci_enable_msi_exact() functions.

Up until now, when enabling MSI mode for a device a single
successful call to arch_msi_check_device() was followed by
a single call to arch_setup_msi_irqs() function.

Yet, if arch_msi_check_device() returned success we should be
able to call arch_setup_msi_irqs() multiple times - while it
returns a number of MSI vectors that could have been allocated
(a third state).

This update makes use of the assumption described above. It
could have broke things had the architectures done any pre-
allocations or switch to some state in a call to function
arch_msi_check_device(). But because arch_msi_check_device()
is expected stateless and MSI resources are allocated in a
follow-up call to arch_setup_msi_irqs() we should be fine.

Signed-off-by: Alexander Gordeev agord...@redhat.com
Cc: linux-m...@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-...@vger.kernel.org
---
 drivers/pci/msi.c   |   79 +-
 include/linux/pci.h |5 +--
 2 files changed, 34 insertions(+), 50 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 955ab79..fc669ab 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -883,50 +883,6 @@ int pci_msi_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msi_vec_count);
 
-/**
- * pci_enable_msi_block - configure device's MSI capability structure
- * @dev: device to configure
- * @nvec: number of interrupts to configure
- *
- * Allocate IRQs for a device with the MSI capability.
- * This function returns a negative errno if an error occurs.  If it
- * is unable to allocate the number of interrupts requested, it returns
- * the number of interrupts it might be able to allocate.  If it successfully
- * allocates at least the number of interrupts requested, it returns 0 and
- * updates the @dev's irq member to the lowest new interrupt number; the
- * other interrupt numbers allocated to this device are consecutive.
- */
-int pci_enable_msi_block(struct pci_dev *dev, int nvec)
-{
-   int status, maxvec;
-
-   if (dev-current_state != PCI_D0)
-   return -EINVAL;
-
-   maxvec = pci_msi_vec_count(dev);
-   if (maxvec  0)
-   return maxvec;
-   if (nvec  maxvec)
-   return maxvec;
-
-   status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
-   if (status)
-   return status;
-
-   WARN_ON(!!dev-msi_enabled);
-
-   /* Check whether driver already requested MSI-X irqs */
-   if (dev-msix_enabled) {
-   dev_info(dev-dev, can't enable MSI 
-(MSI-X already enabled)\n);
-   return -EINVAL;
-   }
-
-   status = msi_capability_init(dev, nvec);
-   return status;
-}
-EXPORT_SYMBOL(pci_enable_msi_block);
-
 void pci_msi_shutdown(struct pci_dev *dev)
 {
struct msi_desc *desc;
@@ -1132,14 +1088,45 @@ void pci_msi_init_pci_dev(struct pci_dev *dev)
  **/
 int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 {
-   int nvec = maxvec;
+   int nvec;
int rc;
 
+   if (dev-current_state != PCI_D0)
+   return -EINVAL;
+
+   WARN_ON(!!dev-msi_enabled);
+
+   /* Check whether driver already requested MSI-X irqs */
+   if (dev-msix_enabled) {
+   dev_info(dev-dev,
+can't enable MSI (MSI-X already enabled)\n);
+   return -EINVAL;
+   }
+
if (maxvec  minvec)
return -ERANGE;
 
+   nvec = pci_msi_vec_count(dev);
+   if (nvec  0)
+   return nvec;
+   else if (nvec  minvec)
+   return -EINVAL;
+   else if (nvec  maxvec)
+   nvec = maxvec;
+
+   do {
+   rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+   if (rc  0) {
+   return rc;
+   } else if (rc  0) {
+   if (rc  minvec)
+   return -ENOSPC;
+   nvec = rc;
+   }
+   } while (rc);
+
do {
-   rc = pci_enable_msi_block(dev, nvec);
+   rc = msi_capability_init(dev, nvec);
if (rc  0) {
return rc;
} else if (rc  0) {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index aab57b4..499755e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1158,7 +1158,6 @@ struct msix_entry {
 
 #ifdef CONFIG_PCI_MSI
 int pci_msi_vec_count(struct pci_dev *dev);
-int pci_enable_msi_block(struct pci_dev *dev, int nvec);
 void pci_msi_shutdown(struct pci_dev *dev);
 void pci_disable_msi(struct pci_dev *dev);
 int pci_msix_vec_count(struct pci_dev *dev);
@@ -1188,8 +1187,6 @@ static inline int pci_enable_msix_exact(struct pci_dev 
*dev

Re: [PATCH v3 00/11] Introduce pcim_enable_msi*() family helpers

2013-12-06 Thread Alexander Gordeev
On Tue, Nov 26, 2013 at 10:09:49AM +0100, Alexander Gordeev wrote:
 Patches 3,4   - fixes for PowerPC pSeries platform
[...]
   PCI/MSI/pSeries: Fix wrong error code reporting
   PCI/MSI/pSeries: Make quota traversing and requesting race-safe

Hi Bjorn,

The two unreviewed PowerPC patches is not a blocker for the rest of the
series as they are completely independent with the rest of the changes.
I can rework them later if needed.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-15 Thread Alexander Gordeev
On Fri, Oct 11, 2013 at 04:29:39PM -0400, Mark Lord wrote:
  static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec)
  {
  nvec = roundup_pow_of_two(nvec);/* assume 0  nvec = 16 */
  
  xx_disable_all_irqs(dev);
  
  pci_lock_msi(dev-pdev);
  
  rc = pci_get_msix_limit(dev-pdev, nvec);
  if (rc  0)
  goto err;
  
  nvec = min(nvec, rc);   /* if limit is more than requested */
  nvec = rounddown_pow_of_two(nvec);  /* (a) */
  
  xx_prep_for_msix_vectors(dev, nvec);
  
  rc = pci_enable_msix(dev-pdev, dev-irqs, nvec);   /* (b)  */
  if (rc  0)
  goto err;
  
  pci_unlock_msi(dev-pdev);
  
  dev-num_vectors = nvec;/* (b) */
  return 0;
  
  err:
  pci_unlock_msi(dev-pdev);
  
  kerr(dev-name, pci_enable_msix() failed, err=%d, rc);
  dev-num_vectors = 0;
  return rc;
  }
 
 That would still need a loop, to handle the natural race between
 the calls to pci_get_msix_limit() and pci_enable_msix() -- the driver and 
 device
 can and should fall back to a smaller number of vectors when 
 pci_enable_msix() fails.

Could you please explain why the value returned by pci_get_msix_limit()
might change before pci_enable_msix() returned, while both protected by
pci_lock_msi()?

Anyway, although the loop-free code (IMHO) reads better, pci_lock_msi()
it is not a part of the original proposal and the more I think about it
the less I like it.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-11 Thread Alexander Gordeev
On Thu, Oct 10, 2013 at 07:17:18PM -0400, Mark Lord wrote:
 Just to help us all understand the loop issue..
 
 Here's an example of driver code which uses the existing MSI-X interfaces,
 for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt.
 This is from a new driver I'm working on right now:
 
 
 static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec)
 {
 xx_disable_all_irqs(dev);
 do {
 if (nvec  2)
 xx_prep_for_1_msix_vector(dev);
 else if (nvec  4)
 xx_prep_for_2_msix_vectors(dev);
 else if (nvec  8)
 xx_prep_for_4_msix_vectors(dev);
 else if (nvec  16)
 xx_prep_for_8_msix_vectors(dev);
 else
 xx_prep_for_16_msix_vectors(dev);
 nvec = pci_enable_msix(dev-pdev, dev-irqs, 
 dev-num_vectors);
 } while (nvec  0);
 
 if (nvec) {
 kerr(dev-name, pci_enable_msix() failed, err=%d, nvec);
 dev-num_vectors = 0;
 return nvec;
 }
 return 0;   /* success */
 }

Yeah, that is a very good example.

I would move all xx_prep_for_pow2_msix_vector() functions to a single
helper i.e. xx_prep_msix_vectors(dev, ndev).

Considering also (a) we do not want to waste unused platform resources
associated with MSI-Xs and pull more quota than needed and (b) fixing
couple of bugs, this function could look like this:

static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec_in)
{
int nvec = roundup_pow_of_two(nvec_in); /* assume 0  nvec_in = 16 */
int rc;

xx_disable_all_irqs(dev);

retry:
xx_prep_for_msix_vectors(dev, nvec);

rc = pci_enable_msix(dev-pdev, dev-irqs, nvec);   /* (b) */
if (rc  0) {
nvec = rounddown_pow_of_two(nvec);  /* (a) */
goto retry;
}

if (rc) {
kerr(dev-name, pci_enable_msix() failed, err=%d, rc);
dev-num_vectors = 0;
return rc;
}

dev-num_vectors = nvec;/* (b) */
return 0;   /* success */
}

Now, this is a loop-free alternative:

static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec)
{
nvec = roundup_pow_of_two(nvec);/* assume 0  nvec = 16 */

xx_disable_all_irqs(dev);

pci_lock_msi(dev-pdev);

rc = pci_get_msix_limit(dev-pdev, nvec);
if (rc  0)
goto err;

nvec = min(nvec, rc);   /* if limit is more than requested */
nvec = rounddown_pow_of_two(nvec);  /* (a) */

xx_prep_for_msix_vectors(dev, nvec);

rc = pci_enable_msix(dev-pdev, dev-irqs, nvec);   /* (b)  */
if (rc  0)
goto err;

pci_unlock_msi(dev-pdev);

dev-num_vectors = nvec;/* (b) */
return 0;

err:
pci_unlock_msi(dev-pdev);

kerr(dev-name, pci_enable_msix() failed, err=%d, rc);
dev-num_vectors = 0;
return rc;
}

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-10 Thread Alexander Gordeev
On Wed, Oct 09, 2013 at 03:24:08PM +1100, Benjamin Herrenschmidt wrote:
 On Tue, 2013-10-08 at 20:55 -0700, H. Peter Anvin wrote:
  Why not add a minimum number to pci_enable_msix(), i.e.:
  
  pci_enable_msix(pdev, msix_entries, nvec, minvec)
  
  ... which means nvec is the number of interrupts *requested*, and
  minvec is the minimum acceptable number (otherwise fail).
 
 Which is exactly what Ben (the other Ben :-) suggested and that I
 supports...

Ok, this suggestion sounded in one or another form by several people.
What about name it pcim_enable_msix_range() and wrap in couple more
helpers to complete an API:

int pcim_enable_msix_range(pdev, msix_entries, nvec, minvec);
0 - error code
0 - number of MSIs allocated, where minvec = result = nvec

int pcim_enable_msix(pdev, msix_entries, nvec);
0 - error code
0 - number of MSIs allocated, where 1 = result = nvec 

int pcim_enable_msix_exact(pdev, msix_entries, nvec);
0 - error code
0 - number of MSIs allocated, where result == nvec

The latter's return value seems odd, but I can not help to make
it consistent with the first two.


(Sorry if you see this message twice - my MUA seems struggle with one of CC).

 Cheers,
 Ben.
 
 

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-10 Thread Alexander Gordeev
On Thu, Oct 10, 2013 at 09:28:27AM -0700, H. Peter Anvin wrote:
 On 10/10/2013 03:17 AM, Alexander Gordeev wrote:
  On Wed, Oct 09, 2013 at 03:24:08PM +1100, Benjamin Herrenschmidt wrote:
  
  Ok, this suggestion sounded in one or another form by several people.
  What about name it pcim_enable_msix_range() and wrap in couple more
  helpers to complete an API:
  
  int pcim_enable_msix_range(pdev, msix_entries, nvec, minvec);
  0 - error code
  0 - number of MSIs allocated, where minvec = result = nvec
  
  int pcim_enable_msix(pdev, msix_entries, nvec);
  0 - error code
  0 - number of MSIs allocated, where 1 = result = nvec 
  
  int pcim_enable_msix_exact(pdev, msix_entries, nvec);
  0 - error code
  0 - number of MSIs allocated, where result == nvec
  
  The latter's return value seems odd, but I can not help to make
  it consistent with the first two.
  
 
 Is there a reason for the wrappers, as opposed to just specifying either
 1 or nvec as the minimum?

The wrappers are more handy IMO.

I.e. can imagine people start struggling to figure out what minvec to provide:
1 or 0? Why 1? Oh.. okay.. Or should we tolerate 0 (as opposite to -ERANGE)?

Well, do not know.. pcim_enable_msix(pdev, msix_entries, nvec, nvec) is 
less readable for me than just pcim_enable_msix_exact(pdev, msix_entries,
nvec).

   -hpa

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-09 Thread Alexander Gordeev
On Mon, Oct 07, 2013 at 02:01:11PM -0400, Tejun Heo wrote:
 Hmmm... yean, the race condition could be an issue as multiple msi
 allocation might fail even if the driver can and explicitly handle
 multiple allocation if the quota gets reduced inbetween.

BTW, should we care about the quota getting increased inbetween?
That would entail.. kind of pci_get_msi_limit() :), but IMHO it is
not worth it.

 tejun

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-08 Thread Alexander Gordeev
On Tue, Oct 08, 2013 at 03:33:30PM +1100, Michael Ellerman wrote:
 On Wed, Oct 02, 2013 at 12:29:04PM +0200, Alexander Gordeev wrote:
  This technique proved to be confusing and error-prone. Vast share
  of device drivers simply fail to follow the described guidelines.
 
 To clarify Vast share of device drivers:
 
  - 58 drivers call pci_enable_msix()
  - 24 try a single allocation and then fallback to MSI/LSI
  - 19 use the loop style allocation as above
  - 14 try an allocation, and if it fails retry once
  - 1  incorrectly continues when pci_enable_msix() returns  0
 
 So 33 drivers ( 50%) successfully make use of the confusing and
 error-prone return value.

Ok, you caught me - 'vast share' is incorrect and is a subject to
rewording. But out of 19/58 how many drivers tested fallbacks on the
real hardware? IOW, which drivers are affected by the pSeries quota?

 And yes, one is buggy, so obviously the interface is too complex. Thanks
 drivers/ntb/ntb_hw.c

vmxnet3 would be the best example.

 cheers

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 07/77] PCI/MSI: Re-design MSI/MSI-X interrupts enablement pattern

2013-10-08 Thread Alexander Gordeev
On Mon, Oct 07, 2013 at 02:17:49PM -0400, Tejun Heo wrote:
 Hello,
 
 On Wed, Oct 02, 2013 at 12:48:23PM +0200, Alexander Gordeev wrote:
  +static int foo_driver_enable_msi(struct foo_adapter *adapter, int nvec)
  +{
  +   rc = pci_get_msi_cap(adapter-pdev);
  +   if (rc  0)
  +   return rc;
  +
  +   nvec = min(nvec, rc);
  +   if (nvec  FOO_DRIVER_MINIMUM_NVEC) {
  +   return -ENOSPC;
  +
  +   rc = pci_enable_msi_block(adapter-pdev, nvec);
  +   return rc;
  +}
 
 If there are many which duplicate the above pattern, it'd probably be
 worthwhile to provide a helper?  It's usually a good idea to reduce
 the amount of boilerplate code in drivers.

I wanted to limit discussion in v1 to as little changes as possible.
I 'planned' those helper(s) for a separate effort if/when the most
important change is accepted and soaked a bit.

  @@ -975,7 +951,7 @@ int pci_enable_msix(struct pci_dev *dev, struct 
  msix_entry *entries, int nvec)
  if (nr_entries  0)
  return nr_entries;
  if (nvec  nr_entries)
  -   return nr_entries;
  +   return -EINVAL;
   
  /* Check for any invalid entries */
  for (i = 0; i  nvec; i++) {
 
 If we do things this way, it breaks all drivers using this interface
 until they're converted, right?

Right. And the rest of the series does it.

 Also, it probably isn't the best idea
 to flip the behavior like this as this can go completely unnoticed (no
 compiler warning or anything, the same function just behaves
 differently).  Maybe it'd be a better idea to introduce a simpler
 interface that most can be converted to?

Well, an *other* interface is a good idea. What do you mean with the
simpler here?

 tejun

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 05/77] PCI/MSI: Convert pci_msix_table_size() to a public interface

2013-10-08 Thread Alexander Gordeev
On Mon, Oct 07, 2013 at 02:10:43PM -0400, Tejun Heo wrote:
 On Wed, Oct 02, 2013 at 12:48:21PM +0200, Alexander Gordeev wrote:
  Make pci_msix_table_size() to return a error code if the device
  does not support MSI-X. This update is needed to facilitate a
  forthcoming re-design MSI/MSI-X interrupts enabling pattern.
  
  Device drivers will use this interface to obtain maximum number
  of MSI-X interrupts the device supports and use that value in
  the following call to pci_enable_msix() interface.
 
 Hmmm... I probably missed something but why is this necessary?  To
 discern between -EINVAL and -ENOSPC?  If so, does that really matter?

pci_msix_table_size() is kind of helper and returns 0 if a device does
not have MSI-X table. If we want drivers to use it we need return -EINVAL
for MSI-X incapable/disabled devices. Nothing about -ENOSPC.

 tejun

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-08 Thread Alexander Gordeev
On Mon, Oct 07, 2013 at 02:21:17PM -0400, Tejun Heo wrote:
 Whee that's a lot more than I expected.  I was just scanning
 multiple msi users.  Maybe we can stage the work in more manageable
 steps so that you don't have to go through massive conversion only to
 do it all over again afterwards and likewise people don't get
 bombarded on each iteration?  Maybe we can first update pci / msi code
 proper, msi and then msix?

Multipe MSIs is just a handful of drivers, really. MSI-X impact still
will be huge. But if we opt a different name for the new pci_enable_msix()
then we could first update pci/msi, then drivers (in few stages possibly)
and finally remove the old implementation.

 tejun

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-08 Thread Alexander Gordeev
On Mon, Oct 07, 2013 at 02:01:11PM -0400, Tejun Heo wrote:
  What about introducing pci_lock_msi() and pci_unlock_msi() and let device
  drivers care about their ranges and specifics in race-safe manner?
  I do not call to introduce it right now (since it appears pSeries has not
  been hitting the race for years) just as a possible alternative to Ben's
  proposal.
 
 I don't think the same race condition would happen with the loop.

We need to distinguish the contexts we're discussing.

If we talk about pSeries quota, then the current pSeries pci_enable_msix()
implementation is racy internally and could fail if the quota went down
*while* pci_enable_msix() is executing. In this case the loop will have to
exit rather than retry with a lower number (what number?).

In this regard the new scheme does not bring anything new and relies on
the fact this race does not hit and therefore does not worry.

If we talk about quota as it has to be, then yes - the loop scheme seems
more preferable.

Overall, looks like we just need to fix the pSeries implementation,
if the guys want it, he-he :)

 The problem case is where multiple msi(x) allocation fails completely
 because the global limit went down before inquiry and allocation.  In
 the loop based interface, it'd retry with the lower number.

I am probably missing something here. If the global limit went down before
inquiry then the inquiry will get what is available and try to allocate with
than number.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 54/77] ntb: Ensure number of MSIs on SNB is enough for the link interrupt

2013-10-07 Thread Alexander Gordeev
On Mon, Oct 07, 2013 at 09:50:57AM -0700, Jon Mason wrote:
 On Sat, Oct 05, 2013 at 11:43:04PM +0200, Alexander Gordeev wrote:
  On Wed, Oct 02, 2013 at 05:48:05PM -0700, Jon Mason wrote:
   On Wed, Oct 02, 2013 at 12:49:10PM +0200, Alexander Gordeev wrote:
Signed-off-by: Alexander Gordeev agord...@redhat.com
---
 drivers/ntb/ntb_hw.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/ntb/ntb_hw.c b/drivers/ntb/ntb_hw.c
index de2062c..eccd5e5 100644
--- a/drivers/ntb/ntb_hw.c
+++ b/drivers/ntb/ntb_hw.c
@@ -1066,7 +1066,7 @@ static int ntb_setup_msix(struct ntb_device *ndev)
/* On SNB, the link interrupt is always tied to 4th 
vector.  If
 * we can't get all 4, then we can't use MSI-X.
 */
-   if (ndev-hw_type != BWD_HW) {
+   if ((rc  SNB_MSIX_CNT)  (ndev-hw_type != BWD_HW)) {
   
   Nack, this check is unnecessary.
  
  If SNB can do more than SNB_MSIX_CNT MSI-Xs then this check is needed
  to enable less than maximum MSI-Xs in case the maximum was not allocated.
  Otherwise SNB will fallback to single MSI instead of multiple MSI-Xs.
 
 Per the comment in the code snippet above, If we can't get all 4,
 then we can't use MSI-X.  There is already a check to see if more
 than 4 were acquired.  So it's not possible to hit this.  Even if it
 was, don't use SNB_MSIX_CNT here (limits.msix_cnt is the preferred
 variable).  Also, the () are unnecessary.

The changelog is definitely bogus. I meant here an improvement to the
existing scheme, not a conversion to the new one:

msix_entries = msix_table_size(val);

Getting i.e. 16 vectors here.

if (msix_entries  ndev-limits.msix_cnt) {
rc = -EINVAL;
goto err;
}

Upper limit check i.e. succeeds.

[...]

rc = pci_enable_msix(pdev, ndev-msix_entries, msix_entries);

pci_enable_msix() does not success and returns i.e. 8 here, should retry.

if (rc  0)
goto err1;
if (rc  0) {
/* On SNB, the link interrupt is always tied to 4th vector.  If
 * we can't get all 4, then we can't use MSI-X.
 */
if (ndev-hw_type != BWD_HW) {

On SNB bail out here, although could have continue with 8 vectors.
Can only use SNB_MSIX_CNT here, since limits.msix_cnt is the upper limit.

rc = -EIO;
goto err1;
}

[...]
}

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-06 Thread Alexander Gordeev
On Sun, Oct 06, 2013 at 08:46:26AM +1100, Benjamin Herrenschmidt wrote:
 On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote:
  So my point is - drivers should first obtain a number of MSIs they *can*
  get, then *derive* a number of MSIs the device is fine with and only then
  request that number. Not terribly different from memory or any other type
  of resource allocation ;)
 
 What if the limit is for a group of devices ? Your interface is racy in
 that case, another driver could have eaten into the limit in between the
 calls.

Well, the another driver has had a better karma ;) But seriously, the
current scheme with a loop is not race-safe wrt to any other type of
resource which might exhaust. What makes the quota so special so we
should care about it and should not care i.e. about lack of msi_desc's?

Yeah, I know the quota might hit more likely. But why it is not addressed
right now then? Not a single function in chains...
  rtas_msi_check_device() - msi_quota_for_device() - traverse_pci_devices()
  rtas_setup_msi_irqs() - msi_quota_for_device() - traverse_pci_devices()
...is race-safe. So if it has not been bothering anyone until now then
no reason to start worrying now :)

In fact, in the current design to address the quota race decently the
drivers would have to protect the *loop* to prevent the quota change
between a pci_enable_msix() returned a positive number and the the next
call to pci_enable_msix() with that number. Is it doable?

 Ben.
 
 

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-06 Thread Alexander Gordeev
On Sun, Oct 06, 2013 at 05:19:46PM +1100, Benjamin Herrenschmidt wrote:
 On Sun, 2013-10-06 at 08:02 +0200, Alexander Gordeev wrote:
  In fact, in the current design to address the quota race decently the
  drivers would have to protect the *loop* to prevent the quota change
  between a pci_enable_msix() returned a positive number and the the next
  call to pci_enable_msix() with that number. Is it doable?
 
 I am not advocating for the current design, simply saying that your
 proposal doesn't address this issue while Ben's does.

There is one major flaw in min-max approach - the generic MSI layer
will have to take decisions on exact number of MSIs to request, not
device drivers.

This will never work for all devices, because there might be specific
requirements which are not covered by the min-max. That is what Ben
described ...say, any even number within a certain range. Ben suggests
to leave the existing loop scheme to cover such devices, which I think is
not right.

What about introducing pci_lock_msi() and pci_unlock_msi() and let device
drivers care about their ranges and specifics in race-safe manner?
I do not call to introduce it right now (since it appears pSeries has not
been hitting the race for years) just as a possible alternative to Ben's
proposal.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-05 Thread Alexander Gordeev
On Fri, Oct 04, 2013 at 10:29:16PM +0100, Ben Hutchings wrote:
 On Fri, 2013-10-04 at 10:29 +0200, Alexander Gordeev wrote:
 All I can see there is that Tejun didn't think that the global limits
 and positive return values were implemented by any architecture.

I would say more than just that :) I picked few quotes which in my
reading represent the guys positions:

Tejun Heo: ...do we really
care about possible partial success?  This sort of interface is
unnecessarily complex and actively harmful.  It forces all users to
wonder what could possibly happen and implement all sorts of nutty
fallback logic which is highly likely to be error-prone on both the
software and hardware side.  Seriously, how much testing would such
code path would get both on the driver and firmware sides?

Bjorn Helgaas: I agree, that would be much simpler.
I propose that you rework it that way, and at least find out what
(if anything) would break if we do that.

Michael Ellerman: I really think you're overstating the complexity here.
Functions typically return a boolean   - nothing to see here
This function returns a tristate value - brain explosion!;
All a lot of bother for no real gain IMHO.

 But you have a counter-example, so I'm not sure what your point is.

I concur with Tejun. I think we need to get rid of the loop.

As of the counter-example I think we could honour the pSeries quota by
inroducing an interface to interrogate what you call global limits -
pci_get_msix_limit(), i.e.:

rc = pci_msix_table_size(pdev, nvec);
if (rc  0)
return rc;

nvec = min(rc, nvec);

rc = pci_get_msix_limit(pdev, nvec);
if (rc  0)
return rc;

nvec = min(rc, nvec);

for (i = 0; i  nvec; i++)
msix_entry[i].entry = i;

rc = pci_enable_msix(pdev, msix_entry, nvec);
if (rc)
return rc;

The latest state of those discussion is somewhere around Michael's:
We could probably make that work. and Tejun's: Are we talking about
some limited number of device drivers here? Also, is the quota still
necessary for machines in production today?

So my point is - drivers should first obtain a number of MSIs they *can*
get, then *derive* a number of MSIs the device is fine with and only then
request that number. Not terribly different from memory or any other type
of resource allocation ;)

 It has been quite a while since I saw this happen on x86.  But I just
 checked on a test system running RHEL 5 i386 (Linux 2.6.18).  If I ask
 for 16 MSI-X vectors on a device that supports 1024, the return value is
 8, and indeed I can then successfully allocate 8.
 
 Now that's going quite a way back, and it may be that global limits
 aren't a significant problem any more.  With the x86_64 build of RHEL 5
 on an identical system, I can allocate 16 or even 32, so this is
 apparently not a hardware limit in this case.

Well, I do not know how to comment here. 2.6.18 has a significantly
different codebase wrt MSIs. What about a recent version?

 Most drivers seem to either:
 (a) require exactly a certain number of MSI vectors, or
 (b) require a minimum number of MSI vectors, usually want to allocate
 more, and work with any number in between
 
 We can support drivers in both classes by adding new allocation
 functions that allow specifying a minimum (required) and maximum
 (wanted) number of MSI vectors.  Those in class (a) would just specify
 the same value for both.  These new functions can take account of any
 global limit or allocation policy without any further changes to the
 drivers that use them.

I think such interface is redundant wrt the current pci_enable_msix()
implementation.. and you propose to leave it. IMO it unnecessarily blows
the generic MSI API with no demand from drivers.

 The few drivers with more specific requirements would still need to
 implement the currently recommended loop, using the old allocation
 functions.

Although the classes of drivers you specified indeed exist, I do believe
just a single interface is enough to handle them all.

 Ben.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 54/77] ntb: Ensure number of MSIs on SNB is enough for the link interrupt

2013-10-05 Thread Alexander Gordeev
On Wed, Oct 02, 2013 at 05:48:05PM -0700, Jon Mason wrote:
 On Wed, Oct 02, 2013 at 12:49:10PM +0200, Alexander Gordeev wrote:
  Signed-off-by: Alexander Gordeev agord...@redhat.com
  ---
   drivers/ntb/ntb_hw.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)
  
  diff --git a/drivers/ntb/ntb_hw.c b/drivers/ntb/ntb_hw.c
  index de2062c..eccd5e5 100644
  --- a/drivers/ntb/ntb_hw.c
  +++ b/drivers/ntb/ntb_hw.c
  @@ -1066,7 +1066,7 @@ static int ntb_setup_msix(struct ntb_device *ndev)
  /* On SNB, the link interrupt is always tied to 4th vector.  If
   * we can't get all 4, then we can't use MSI-X.
   */
  -   if (ndev-hw_type != BWD_HW) {
  +   if ((rc  SNB_MSIX_CNT)  (ndev-hw_type != BWD_HW)) {
 
 Nack, this check is unnecessary.

If SNB can do more than SNB_MSIX_CNT MSI-Xs then this check is needed
to enable less than maximum MSI-Xs in case the maximum was not allocated.
Otherwise SNB will fallback to single MSI instead of multiple MSI-Xs.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-04 Thread Alexander Gordeev
On Thu, Oct 03, 2013 at 11:49:45PM +0100, Ben Hutchings wrote:
 On Wed, 2013-10-02 at 12:48 +0200, Alexander Gordeev wrote:
  This update converts pci_enable_msix() and pci_enable_msi_block()
  interfaces to canonical kernel functions and makes them return a
  error code in case of failure or 0 in case of success.
 [...]
 
 I think this is fundamentally flawed: pci_msix_table_size() and
 pci_get_msi_cap() can only report the limits of the *device* (which the
 driver usually already knows), whereas MSI allocation can also be
 constrained due to *global* limits on the number of distinct IRQs.

Even the current implementation by no means addresses it. Although it
might seem a case for architectures to report the number of IRQs available
for a driver to retry, in fact they all just fail. The same applies to
*any* other type of resource involved: irq_desc's, CPU interrupt vector
space, msi_desc's etc. No platform cares about it and just bails out once
a constrain met (please correct me if I am wrong here). Given that Linux
has been doing well even on embedded I think we should not change it.

The only exception to the above is pSeries platform which takes advantage
of the current design (to implement MSI quota). There are indications we
can satisfy pSeries requirements, but the design proposed in this RFC
is not going to change drastically anyway. The start of the discusstion
is here: https://lkml.org/lkml/2013/9/5/293

 Currently pci_enable_msix() will report a positive value if it fails due
 to the global limit.  Your patch 7 removes that.  pci_enable_msi_block()
 unfortunately doesn't appear to do this.

pci_enable_msi_block() can do more than one MSI only on x86 (with IOMMU),
but it does not bother to return positive numbers, indeed.

 It seems to me that a more useful interface would take a minimum and
 maximum number of vectors from the driver.  This wouldn't allow the
 driver to specify that it could only accept, say, any even number within
 a certain range, but you could still leave the current functions
 available for any driver that needs that.

Mmmm.. I am not sure I am getting it. Could you please rephrase?

 Ben.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-04 Thread Alexander Gordeev
On Fri, Oct 04, 2013 at 09:31:49AM +0100, David Laight wrote:
  Mmmm.. I am not sure I am getting it. Could you please rephrase?
 
 One possibility is for drivers than can use a lot of interrupts to
 request a minimum number initially and then request the additional
 ones much later on.
 That would make it less likely that none will be available for
 devices/drivers that need them but are initialised later.

It sounds as a whole new topic for me. Isn't it?

Anyway, what prevents the above from being done with pci_enable_msix(nvec1) -
pci_disable_msix() - pci_enable_msix(nvec2) where nvec1  nvec2?

   David

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 50/77] mlx5: Update MSI/MSI-X interrupts enablement code

2013-10-03 Thread Alexander Gordeev
On Thu, Oct 03, 2013 at 10:14:33AM +0300, Eli Cohen wrote:
 On Wed, Oct 02, 2013 at 12:49:06PM +0200, Alexander Gordeev wrote:
   
  +   err = pci_msix_table_size(dev-pdev);
  +   if (err  0)
  +   return err;
  +
  nvec = dev-caps.num_ports * num_online_cpus() + MLX5_EQ_VEC_COMP_BASE;
  nvec = min_t(int, nvec, num_eqs);
  +   nvec = min_t(int, nvec, err);
  if (nvec = MLX5_EQ_VEC_COMP_BASE)
  return -ENOSPC;
 
 Making sure we don't request more vectors then the device's is capable
 of -- looks good.
   
  @@ -131,20 +136,15 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
  for (i = 0; i  nvec; i++)
  table-msix_arr[i].entry = i;
   
  -retry:
  -   table-num_comp_vectors = nvec - MLX5_EQ_VEC_COMP_BASE;
  err = pci_enable_msix(dev-pdev, table-msix_arr, nvec);
  -   if (err = 0) {
  +   if (err) {
  +   kfree(table-msix_arr);
  return err;
  -   } else if (err  MLX5_EQ_VEC_COMP_BASE) {
  -   nvec = err;
  -   goto retry;
  }
   
 
 According to latest sources, pci_enable_msix() may still fail so why
 do you want to remove this code?

pci_enable_msix() may fail, but it can not return a positive number.

We first calculate how many MSI-Xs we need, adjust to what we can get,
check if that is enough and only then go for it.

  -   mlx5_core_dbg(dev, received %d MSI vectors out of %d requested\n, 
  err, nvec);
  -   kfree(table-msix_arr);
  +   table-num_comp_vectors = nvec - MLX5_EQ_VEC_COMP_BASE;
   
  -   return -ENOSPC;
  +   return 0;
   }
   
   static void mlx5_disable_msix(struct mlx5_core_dev *dev)

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFC 06/77] PCI/MSI: Factor out pci_get_msi_cap() interface

2013-10-03 Thread Alexander Gordeev
On Thu, Oct 03, 2013 at 10:52:54PM +0100, Ben Hutchings wrote:
 On Wed, 2013-10-02 at 12:48 +0200, Alexander Gordeev wrote:
   #ifndef CONFIG_PCI_MSI
  +static inline int pci_get_msi_cap(struct pci_dev *dev)
  +{
  +   return -1;
 [...]
 
 Shouldn't this also return -EINVAL?

Yep, all inliners here are better to return -EINVAL.
Will do unless someone speaks out against.

 Ben.

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


  1   2   >