Re: [PATCH v4 3/7] kprobes: validate the symbol name provided during probe registration

2017-04-21 Thread Michael Ellerman
"Naveen N. Rao"  writes:

> When a kprobe is being registered, we use the symbol_name field to
> lookup the address where the probe should be placed. Since this is a
> user-provided field, let's ensure that the length of the string is
> within expected limits.

What are we actually trying to protect against here?

If you ignore powerpc for a moment, kprobe_lookup_name() is just
kallsyms_lookup_name().

All kallsyms_lookup_name() does with name is strcmp() it against a
legitimate symbol name which is at most KSYM_NAME_LEN.

So I don't think any of this validation helps in that case?

In the powerpc version of kprobe_lookup_name() we do need to do some
string juggling, for which it helps to know the input is sane. But I
think we should just make that code more robust by checking the input
before we do anything with it.

cheers


Re: [PATCH v4 3/7] kprobes: validate the symbol name provided during probe registration

2017-04-21 Thread Michael Ellerman
"Naveen N. Rao"  writes:

> When a kprobe is being registered, we use the symbol_name field to
> lookup the address where the probe should be placed. Since this is a
> user-provided field, let's ensure that the length of the string is
> within expected limits.

What are we actually trying to protect against here?

If you ignore powerpc for a moment, kprobe_lookup_name() is just
kallsyms_lookup_name().

All kallsyms_lookup_name() does with name is strcmp() it against a
legitimate symbol name which is at most KSYM_NAME_LEN.

So I don't think any of this validation helps in that case?

In the powerpc version of kprobe_lookup_name() we do need to do some
string juggling, for which it helps to know the input is sane. But I
think we should just make that code more robust by checking the input
before we do anything with it.

cheers


Re: [PATCH 8/8] selftests: x86: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:44: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/x86/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/x86/Makefile 
> b/tools/testing/selftests/x86/Makefile
> index 38e0a9c..4d27550 100644
> --- a/tools/testing/selftests/x86/Makefile
> +++ b/tools/testing/selftests/x86/Makefile
> @@ -40,8 +40,9 @@ all_32: $(BINARIES_32)
>  
>  all_64: $(BINARIES_64)
>  
> -clean:
> +override define CLEAN
>   $(RM) $(BINARIES_32) $(BINARIES_64)
> +endef

Simpler as:

EXTRA_CLEAN := $(BINARIES_32) $(BINARIES_64)

cheers


Re: [PATCH 8/8] selftests: x86: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:44: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/x86/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/x86/Makefile 
> b/tools/testing/selftests/x86/Makefile
> index 38e0a9c..4d27550 100644
> --- a/tools/testing/selftests/x86/Makefile
> +++ b/tools/testing/selftests/x86/Makefile
> @@ -40,8 +40,9 @@ all_32: $(BINARIES_32)
>  
>  all_64: $(BINARIES_64)
>  
> -clean:
> +override define CLEAN
>   $(RM) $(BINARIES_32) $(BINARIES_64)
> +endef

Simpler as:

EXTRA_CLEAN := $(BINARIES_32) $(BINARIES_64)

cheers


Re: [PATCH 6/8] selftests: splice: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:8: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/splice/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/splice/Makefile 
> b/tools/testing/selftests/splice/Makefile
> index 559512c..3f967ba 100644
> --- a/tools/testing/selftests/splice/Makefile
> +++ b/tools/testing/selftests/splice/Makefile
> @@ -4,5 +4,6 @@ all: $(TEST_PROGS) $(EXTRA)
>  
>  include ../lib.mk
>  
> -clean:
> +override define CLEAN
>   rm -fr $(EXTRA)
> +endef

Could just be:

EXTRA_CLEAN := $(EXTRA)

cheers


Re: [PATCH 6/8] selftests: splice: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:8: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/splice/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/splice/Makefile 
> b/tools/testing/selftests/splice/Makefile
> index 559512c..3f967ba 100644
> --- a/tools/testing/selftests/splice/Makefile
> +++ b/tools/testing/selftests/splice/Makefile
> @@ -4,5 +4,6 @@ all: $(TEST_PROGS) $(EXTRA)
>  
>  include ../lib.mk
>  
> -clean:
> +override define CLEAN
>   rm -fr $(EXTRA)
> +endef

Could just be:

EXTRA_CLEAN := $(EXTRA)

cheers


Re: [PATCH 7/8] selftests: sync: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:24: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/sync/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/sync/Makefile 
> b/tools/testing/selftests/sync/Makefile
> index 87ac400..f7d250d 100644
> --- a/tools/testing/selftests/sync/Makefile
> +++ b/tools/testing/selftests/sync/Makefile
> @@ -20,5 +20,6 @@ TESTS += sync_stress_merge.o
>  
>  sync_test: $(OBJS) $(TESTS)
>  
> -clean:
> +override define CLEAN
>   $(RM) sync_test $(OBJS) $(TESTS)
> +endef

EXTRA_CLEAN := sync_test $(OBJS) $(TESTS)

cheers


Re: [PATCH 7/8] selftests: sync: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:24: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/sync/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/sync/Makefile 
> b/tools/testing/selftests/sync/Makefile
> index 87ac400..f7d250d 100644
> --- a/tools/testing/selftests/sync/Makefile
> +++ b/tools/testing/selftests/sync/Makefile
> @@ -20,5 +20,6 @@ TESTS += sync_stress_merge.o
>  
>  sync_test: $(OBJS) $(TESTS)
>  
> -clean:
> +override define CLEAN
>   $(RM) sync_test $(OBJS) $(TESTS)
> +endef

EXTRA_CLEAN := sync_test $(OBJS) $(TESTS)

cheers


Re: [PATCH 4/8] selftests: gpio: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:11: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/gpio/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/gpio/Makefile 
> b/tools/testing/selftests/gpio/Makefile
> index 205e4d1..4f6d9e0 100644
> --- a/tools/testing/selftests/gpio/Makefile
> +++ b/tools/testing/selftests/gpio/Makefile
> @@ -7,8 +7,9 @@ include ../lib.mk
>  
>  all: $(BINARIES)
>  
> -clean:
> +override define CLEAN
>   $(RM) $(BINARIES)
> +endef

This could be achieved more simply with:

EXTRA_CLEAN := $(BINARIES)

cheers


Re: [PATCH 4/8] selftests: gpio: override clean in lib.mk to fix warnings

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Add override for lib.mk clean to fix the following warnings from clean
> target run.
>
> Makefile:11: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/gpio/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/gpio/Makefile 
> b/tools/testing/selftests/gpio/Makefile
> index 205e4d1..4f6d9e0 100644
> --- a/tools/testing/selftests/gpio/Makefile
> +++ b/tools/testing/selftests/gpio/Makefile
> @@ -7,8 +7,9 @@ include ../lib.mk
>  
>  all: $(BINARIES)
>  
> -clean:
> +override define CLEAN
>   $(RM) $(BINARIES)
> +endef

This could be achieved more simply with:

EXTRA_CLEAN := $(BINARIES)

cheers


Re: [PATCH 2/8] selftests: lib.mk: define CLEAN macro to allow Makefiles to override clean

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Define CLEAN macro to allow Makefiles to override common clean target
> in lib.mk. This will help fix the following failures:
>
> warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 

Should probably have:

Fixes: 88baa78d1f31 ("selftests: remove duplicated all and clean target")


In hindsight I'm not sure moving the clean target into lib.mk was
the best idea, but anyway it's a bit late to change our mind on that.

This patch is a good solution to fix the warnings.

Acked-by: Michael Ellerman 

cheers


Re: [PATCH 2/8] selftests: lib.mk: define CLEAN macro to allow Makefiles to override clean

2017-04-21 Thread Michael Ellerman
Shuah Khan  writes:

> Define CLEAN macro to allow Makefiles to override common clean target
> in lib.mk. This will help fix the following failures:
>
> warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>
> Signed-off-by: Shuah Khan 

Should probably have:

Fixes: 88baa78d1f31 ("selftests: remove duplicated all and clean target")


In hindsight I'm not sure moving the clean target into lib.mk was
the best idea, but anyway it's a bit late to change our mind on that.

This patch is a good solution to fix the warnings.

Acked-by: Michael Ellerman 

cheers


Re: Linux 3.18.50

2017-04-21 Thread Greg KH
diff --git a/Makefile b/Makefile
index 252070fdf91c..8665178e2a36 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 VERSION = 3
 PATCHLEVEL = 18
-SUBLEVEL = 49
+SUBLEVEL = 50
 EXTRAVERSION =
 NAME = Diseased Newt
 
diff --git a/arch/arm/include/asm/psci.h b/arch/arm/include/asm/psci.h
index e3789fb02c9c..c25ef3ec6d1f 100644
--- a/arch/arm/include/asm/psci.h
+++ b/arch/arm/include/asm/psci.h
@@ -37,7 +37,7 @@ struct psci_operations {
 extern struct psci_operations psci_ops;
 extern struct smp_operations psci_smp_ops;
 
-#if defined(CONFIG_SMP) && defined(CONFIG_ARM_PSCI)
+#ifdef CONFIG_ARM_PSCI
 int psci_init(void);
 bool psci_smp_available(void);
 #else
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index feda3ff185e9..9fb14a37263b 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1407,6 +1407,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
(KVM_PHYS_SIZE >> PAGE_SHIFT))
return -EFAULT;
 
+   down_read(>mm->mmap_sem);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
 * between them, so iterate over all of them to find out if we can map
@@ -1464,6 +1465,8 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
else
stage2_flush_memslot(kvm, memslot);
spin_unlock(>mmu_lock);
+
+   up_read(>mm->mmap_sem);
return ret;
 }
 
diff --git a/arch/c6x/kernel/ptrace.c b/arch/c6x/kernel/ptrace.c
index 3c494e8d..a511ac16a8e3 100644
--- a/arch/c6x/kernel/ptrace.c
+++ b/arch/c6x/kernel/ptrace.c
@@ -69,46 +69,6 @@ static int gpr_get(struct task_struct *target,
   0, sizeof(*regs));
 }
 
-static int gpr_set(struct task_struct *target,
-  const struct user_regset *regset,
-  unsigned int pos, unsigned int count,
-  const void *kbuf, const void __user *ubuf)
-{
-   int ret;
-   struct pt_regs *regs = task_pt_regs(target);
-
-   /* Don't copyin TSR or CSR */
-   ret = user_regset_copyin(, , , ,
-,
-0, PT_TSR * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin_ignore(, , , ,
-   PT_TSR * sizeof(long),
-   (PT_TSR + 1) * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin(, , , ,
-,
-(PT_TSR + 1) * sizeof(long),
-PT_CSR * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin_ignore(, , , ,
-   PT_CSR * sizeof(long),
-   (PT_CSR + 1) * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin(, , , ,
-,
-(PT_CSR + 1) * sizeof(long), -1);
-   return ret;
-}
-
 enum c6x_regset {
REGSET_GPR,
 };
@@ -120,7 +80,6 @@ static const struct user_regset c6x_regsets[] = {
.size = sizeof(u32),
.align = sizeof(u32),
.get = gpr_get,
-   .set = gpr_set
},
 };
 
diff --git a/arch/metag/include/asm/uaccess.h b/arch/metag/include/asm/uaccess.h
index 7841f2290385..9d523375f68a 100644
--- a/arch/metag/include/asm/uaccess.h
+++ b/arch/metag/include/asm/uaccess.h
@@ -192,20 +192,21 @@ extern long __must_check strnlen_user(const char __user 
*src, long count);
 
 #define strlen_user(str) strnlen_user(str, 32767)
 
-extern unsigned long __must_check __copy_user_zeroing(void *to,
- const void __user *from,
- unsigned long n);
+extern unsigned long raw_copy_from_user(void *to, const void __user *from,
+   unsigned long n);
 
 static inline unsigned long
 copy_from_user(void *to, const void __user *from, unsigned long n)
 {
+   unsigned long res = n;
if (likely(access_ok(VERIFY_READ, from, n)))
-   return __copy_user_zeroing(to, from, n);
-   memset(to, 0, n);
-   return n;
+   res = raw_copy_from_user(to, from, n);
+   if (unlikely(res))
+   memset(to + (n - res), 0, res);
+   return res;
 }
 
-#define __copy_from_user(to, from, n) __copy_user_zeroing(to, from, n)
+#define __copy_from_user(to, from, n) raw_copy_from_user(to, from, n)
 #define __copy_from_user_inatomic __copy_from_user
 
 extern unsigned long __must_check __copy_user(void __user *to,
diff --git a/arch/metag/kernel/ptrace.c b/arch/metag/kernel/ptrace.c
index 7563628822bd..5e2dc7defd2c 100644
--- a/arch/metag/kernel/ptrace.c
+++ b/arch/metag/kernel/ptrace.c
@@ -24,6 +24,16 @@
  * user_regset definitions.
  */
 
+static unsigned long 

Re: Linux 3.18.50

2017-04-21 Thread Greg KH
diff --git a/Makefile b/Makefile
index 252070fdf91c..8665178e2a36 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
 VERSION = 3
 PATCHLEVEL = 18
-SUBLEVEL = 49
+SUBLEVEL = 50
 EXTRAVERSION =
 NAME = Diseased Newt
 
diff --git a/arch/arm/include/asm/psci.h b/arch/arm/include/asm/psci.h
index e3789fb02c9c..c25ef3ec6d1f 100644
--- a/arch/arm/include/asm/psci.h
+++ b/arch/arm/include/asm/psci.h
@@ -37,7 +37,7 @@ struct psci_operations {
 extern struct psci_operations psci_ops;
 extern struct smp_operations psci_smp_ops;
 
-#if defined(CONFIG_SMP) && defined(CONFIG_ARM_PSCI)
+#ifdef CONFIG_ARM_PSCI
 int psci_init(void);
 bool psci_smp_available(void);
 #else
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index feda3ff185e9..9fb14a37263b 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1407,6 +1407,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
(KVM_PHYS_SIZE >> PAGE_SHIFT))
return -EFAULT;
 
+   down_read(>mm->mmap_sem);
/*
 * A memory region could potentially cover multiple VMAs, and any holes
 * between them, so iterate over all of them to find out if we can map
@@ -1464,6 +1465,8 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
else
stage2_flush_memslot(kvm, memslot);
spin_unlock(>mmu_lock);
+
+   up_read(>mm->mmap_sem);
return ret;
 }
 
diff --git a/arch/c6x/kernel/ptrace.c b/arch/c6x/kernel/ptrace.c
index 3c494e8d..a511ac16a8e3 100644
--- a/arch/c6x/kernel/ptrace.c
+++ b/arch/c6x/kernel/ptrace.c
@@ -69,46 +69,6 @@ static int gpr_get(struct task_struct *target,
   0, sizeof(*regs));
 }
 
-static int gpr_set(struct task_struct *target,
-  const struct user_regset *regset,
-  unsigned int pos, unsigned int count,
-  const void *kbuf, const void __user *ubuf)
-{
-   int ret;
-   struct pt_regs *regs = task_pt_regs(target);
-
-   /* Don't copyin TSR or CSR */
-   ret = user_regset_copyin(, , , ,
-,
-0, PT_TSR * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin_ignore(, , , ,
-   PT_TSR * sizeof(long),
-   (PT_TSR + 1) * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin(, , , ,
-,
-(PT_TSR + 1) * sizeof(long),
-PT_CSR * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin_ignore(, , , ,
-   PT_CSR * sizeof(long),
-   (PT_CSR + 1) * sizeof(long));
-   if (ret)
-   return ret;
-
-   ret = user_regset_copyin(, , , ,
-,
-(PT_CSR + 1) * sizeof(long), -1);
-   return ret;
-}
-
 enum c6x_regset {
REGSET_GPR,
 };
@@ -120,7 +80,6 @@ static const struct user_regset c6x_regsets[] = {
.size = sizeof(u32),
.align = sizeof(u32),
.get = gpr_get,
-   .set = gpr_set
},
 };
 
diff --git a/arch/metag/include/asm/uaccess.h b/arch/metag/include/asm/uaccess.h
index 7841f2290385..9d523375f68a 100644
--- a/arch/metag/include/asm/uaccess.h
+++ b/arch/metag/include/asm/uaccess.h
@@ -192,20 +192,21 @@ extern long __must_check strnlen_user(const char __user 
*src, long count);
 
 #define strlen_user(str) strnlen_user(str, 32767)
 
-extern unsigned long __must_check __copy_user_zeroing(void *to,
- const void __user *from,
- unsigned long n);
+extern unsigned long raw_copy_from_user(void *to, const void __user *from,
+   unsigned long n);
 
 static inline unsigned long
 copy_from_user(void *to, const void __user *from, unsigned long n)
 {
+   unsigned long res = n;
if (likely(access_ok(VERIFY_READ, from, n)))
-   return __copy_user_zeroing(to, from, n);
-   memset(to, 0, n);
-   return n;
+   res = raw_copy_from_user(to, from, n);
+   if (unlikely(res))
+   memset(to + (n - res), 0, res);
+   return res;
 }
 
-#define __copy_from_user(to, from, n) __copy_user_zeroing(to, from, n)
+#define __copy_from_user(to, from, n) raw_copy_from_user(to, from, n)
 #define __copy_from_user_inatomic __copy_from_user
 
 extern unsigned long __must_check __copy_user(void __user *to,
diff --git a/arch/metag/kernel/ptrace.c b/arch/metag/kernel/ptrace.c
index 7563628822bd..5e2dc7defd2c 100644
--- a/arch/metag/kernel/ptrace.c
+++ b/arch/metag/kernel/ptrace.c
@@ -24,6 +24,16 @@
  * user_regset definitions.
  */
 
+static unsigned long 

Linux 3.18.50

2017-04-21 Thread Greg KH
I'm announcing the release of the 3.18.50 kernel.

All users of the 3.18 kernel series must upgrade.

The updated 3.18.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-3.18.y
and can be browsed at the normal kernel.org git web browser:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



 Makefile   |2 
 arch/arm/include/asm/psci.h|2 
 arch/arm/kvm/mmu.c |3 
 arch/c6x/kernel/ptrace.c   |   41 ---
 arch/metag/include/asm/uaccess.h   |   15 -
 arch/metag/kernel/ptrace.c |   19 +
 arch/metag/lib/usercopy.c  |  312 +
 arch/mips/kernel/ptrace.c  |3 
 arch/powerpc/boot/zImage.lds.S |1 
 arch/powerpc/kernel/align.c|   27 +-
 arch/powerpc/kernel/setup_64.c |9 
 arch/powerpc/kvm/emulate.c |1 
 arch/powerpc/mm/hash_native_64.c   |7 
 arch/s390/boot/compressed/misc.c   |   35 +-
 arch/s390/include/asm/uaccess.h|2 
 arch/sparc/kernel/ptrace_64.c  |2 
 arch/x86/include/asm/elf.h |2 
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |2 
 arch/x86/kvm/vmx.c |   10 
 arch/x86/mm/init.c |   40 ++-
 arch/x86/vdso/vdso32-setup.c   |   11 
 block/scsi_ioctl.c |3 
 crypto/ahash.c |   79 --
 drivers/acpi/Makefile  |1 
 drivers/acpi/acpi_platform.c   |8 
 drivers/block/zram/zram_drv.c  |6 
 drivers/char/Kconfig   |6 
 drivers/char/mem.c |   82 --
 drivers/char/virtio_console.c  |   12 
 drivers/crypto/caam/ctrl.c |3 
 drivers/gpu/drm/ttm/ttm_object.c   |   10 
 drivers/gpu/drm/vmwgfx/vmwgfx_fence.c  |   79 --
 drivers/gpu/drm/vmwgfx/vmwgfx_ioctl.c  |4 
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c   |4 
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c|   31 +-
 drivers/hv/hv_balloon.c|4 
 drivers/iio/adc/ti_am335x_adc.c|   13 -
 drivers/input/joystick/iforce/iforce-usb.c |3 
 drivers/input/joystick/xpad.c  |2 
 drivers/input/misc/cm109.c |4 
 drivers/input/misc/ims-pcu.c   |4 
 drivers/input/misc/yealink.c   |4 
 drivers/input/serio/i8042-x86ia64io.h  |7 
 drivers/input/tablet/hanwang.c |3 
 drivers/input/tablet/kbtab.c   |3 
 drivers/input/touchscreen/sur40.c  |3 
 drivers/iommu/intel-iommu.c|2 
 drivers/isdn/gigaset/bas-gigaset.c |3 
 drivers/md/raid10.c|   18 +
 drivers/media/usb/dvb-usb-v2/dvb_usb_core.c|   10 
 drivers/media/usb/dvb-usb/dvb-usb-firmware.c   |   33 +-
 drivers/media/usb/uvc/uvc_driver.c |  118 -
 drivers/mmc/host/sdhci.c   |4 
 drivers/mmc/host/ushc.c|3 
 drivers/mtd/bcm47xxpart.c  |   10 
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |6 
 drivers/net/ethernet/intel/igb/e1000_phy.c |4 
 drivers/net/ethernet/mellanox/mlx5/core/main.c |2 
 drivers/net/usb/catc.c |   56 ++--
 drivers/net/usb/pegasus.c  |   29 ++
 drivers/net/usb/rtl8150.c  |   34 ++
 drivers/pinctrl/qcom/pinctrl-msm.c |4 
 drivers/platform/x86/acer-wmi.c|   22 +
 drivers/rtc/rtc-s35390a.c  |  167 ++---
 drivers/rtc/rtc-tegra.c|   28 ++
 drivers/scsi/libiscsi.c|   26 ++
 drivers/scsi/libsas/sas_ata.c  |2 
 drivers/scsi/lpfc/lpfc_init.c  |1 
 drivers/scsi/sd.c  |   20 +
 drivers/scsi/sg.c  |2 
 drivers/scsi/sr.c  |6 
 drivers/target/iscsi/iscsi_target_parameters.c |   16 -
 drivers/target/iscsi/iscsi_target_util.c   |   12 
 drivers/target/target_core_pscsi.c |   47 ---
 drivers/target/target_core_sbc.c   |   10 
 drivers/tty/serial/8250/8250_pci.c |   23 +
 drivers/tty/serial/atmel_serial.c  |5 
 drivers/usb/class/usbtmc.c |7 
 drivers/usb/gadget/function/f_acm.c|4 
 drivers/uwb/hwa-rc.c   |

Linux 3.18.50

2017-04-21 Thread Greg KH
I'm announcing the release of the 3.18.50 kernel.

All users of the 3.18 kernel series must upgrade.

The updated 3.18.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-3.18.y
and can be browsed at the normal kernel.org git web browser:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



 Makefile   |2 
 arch/arm/include/asm/psci.h|2 
 arch/arm/kvm/mmu.c |3 
 arch/c6x/kernel/ptrace.c   |   41 ---
 arch/metag/include/asm/uaccess.h   |   15 -
 arch/metag/kernel/ptrace.c |   19 +
 arch/metag/lib/usercopy.c  |  312 +
 arch/mips/kernel/ptrace.c  |3 
 arch/powerpc/boot/zImage.lds.S |1 
 arch/powerpc/kernel/align.c|   27 +-
 arch/powerpc/kernel/setup_64.c |9 
 arch/powerpc/kvm/emulate.c |1 
 arch/powerpc/mm/hash_native_64.c   |7 
 arch/s390/boot/compressed/misc.c   |   35 +-
 arch/s390/include/asm/uaccess.h|2 
 arch/sparc/kernel/ptrace_64.c  |2 
 arch/x86/include/asm/elf.h |2 
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |2 
 arch/x86/kvm/vmx.c |   10 
 arch/x86/mm/init.c |   40 ++-
 arch/x86/vdso/vdso32-setup.c   |   11 
 block/scsi_ioctl.c |3 
 crypto/ahash.c |   79 --
 drivers/acpi/Makefile  |1 
 drivers/acpi/acpi_platform.c   |8 
 drivers/block/zram/zram_drv.c  |6 
 drivers/char/Kconfig   |6 
 drivers/char/mem.c |   82 --
 drivers/char/virtio_console.c  |   12 
 drivers/crypto/caam/ctrl.c |3 
 drivers/gpu/drm/ttm/ttm_object.c   |   10 
 drivers/gpu/drm/vmwgfx/vmwgfx_fence.c  |   79 --
 drivers/gpu/drm/vmwgfx/vmwgfx_ioctl.c  |4 
 drivers/gpu/drm/vmwgfx/vmwgfx_resource.c   |4 
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c|   31 +-
 drivers/hv/hv_balloon.c|4 
 drivers/iio/adc/ti_am335x_adc.c|   13 -
 drivers/input/joystick/iforce/iforce-usb.c |3 
 drivers/input/joystick/xpad.c  |2 
 drivers/input/misc/cm109.c |4 
 drivers/input/misc/ims-pcu.c   |4 
 drivers/input/misc/yealink.c   |4 
 drivers/input/serio/i8042-x86ia64io.h  |7 
 drivers/input/tablet/hanwang.c |3 
 drivers/input/tablet/kbtab.c   |3 
 drivers/input/touchscreen/sur40.c  |3 
 drivers/iommu/intel-iommu.c|2 
 drivers/isdn/gigaset/bas-gigaset.c |3 
 drivers/md/raid10.c|   18 +
 drivers/media/usb/dvb-usb-v2/dvb_usb_core.c|   10 
 drivers/media/usb/dvb-usb/dvb-usb-firmware.c   |   33 +-
 drivers/media/usb/uvc/uvc_driver.c |  118 -
 drivers/mmc/host/sdhci.c   |4 
 drivers/mmc/host/ushc.c|3 
 drivers/mtd/bcm47xxpart.c  |   10 
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |6 
 drivers/net/ethernet/intel/igb/e1000_phy.c |4 
 drivers/net/ethernet/mellanox/mlx5/core/main.c |2 
 drivers/net/usb/catc.c |   56 ++--
 drivers/net/usb/pegasus.c  |   29 ++
 drivers/net/usb/rtl8150.c  |   34 ++
 drivers/pinctrl/qcom/pinctrl-msm.c |4 
 drivers/platform/x86/acer-wmi.c|   22 +
 drivers/rtc/rtc-s35390a.c  |  167 ++---
 drivers/rtc/rtc-tegra.c|   28 ++
 drivers/scsi/libiscsi.c|   26 ++
 drivers/scsi/libsas/sas_ata.c  |2 
 drivers/scsi/lpfc/lpfc_init.c  |1 
 drivers/scsi/sd.c  |   20 +
 drivers/scsi/sg.c  |2 
 drivers/scsi/sr.c  |6 
 drivers/target/iscsi/iscsi_target_parameters.c |   16 -
 drivers/target/iscsi/iscsi_target_util.c   |   12 
 drivers/target/target_core_pscsi.c |   47 ---
 drivers/target/target_core_sbc.c   |   10 
 drivers/tty/serial/8250/8250_pci.c |   23 +
 drivers/tty/serial/atmel_serial.c  |5 
 drivers/usb/class/usbtmc.c |7 
 drivers/usb/gadget/function/f_acm.c|4 
 drivers/uwb/hwa-rc.c   |

Re: [HMM 03/15] mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable memory

2017-04-21 Thread Dan Williams
On Fri, Apr 21, 2017 at 8:30 PM, Jérôme Glisse  wrote:
> HMM (heterogeneous memory management) need struct page to support migration
> from system main memory to device memory.  Reasons for HMM and migration to
> device memory is explained with HMM core patch.
>
> This patch deals with device memory that is un-addressable memory (ie CPU
> can not access it). Hence we do not want those struct page to be manage
> like regular memory. That is why we extend ZONE_DEVICE to support different
> types of memory.
>
> A persistent memory type is define for existing user of ZONE_DEVICE and a
> new device un-addressable type is added for the un-addressable memory type.
> There is a clear separation between what is expected from each memory type
> and existing user of ZONE_DEVICE are un-affected by new requirement and new
> use of the un-addressable type. All specific code path are protect with
> test against the memory type.
>
> Because memory is un-addressable we use a new special swap type for when
> a page is migrated to device memory (this reduces the number of maximum
> swap file).
>
> The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> First one, page_free() is call whenever page refcount reach 1 (which means
> the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> allow device driver to manage its memory and associated struct page.
>
> The second callback page_fault() happens when there is a CPU access to
> an address that is back by a device page (which are un-addressable by the
> CPU). This callback is responsible to migrate the page back to system
> main memory. Device driver can not block migration back to system memory,
> HMM make sure that such page can not be pin into device memory.
>
> If device is in some error condition and can not migrate memory back then
> a CPU page fault to device memory should end with SIGBUS.
>
> Signed-off-by: Jérôme Glisse 
> Cc: Dan Williams 
> Cc: Ross Zwisler 
> ---
>  fs/proc/task_mmu.c   |  7 +
>  include/linux/ioport.h   |  1 +
>  include/linux/memremap.h | 82 
> 
>  include/linux/swap.h | 24 --
>  include/linux/swapops.h  | 68 +++
>  kernel/memremap.c| 43 -
>  mm/Kconfig   | 13 
>  mm/memory.c  | 61 +++
>  mm/memory_hotplug.c  | 10 --
>  mm/mprotect.c| 14 +
>  10 files changed, 317 insertions(+), 6 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33..a12ba94 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -542,6 +542,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long 
> addr,
> }
> } else if (is_migration_entry(swpent))
> page = migration_entry_to_page(swpent);
> +   else if (is_device_entry(swpent))
> +   page = device_entry_to_page(swpent);
> } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
> && pte_none(*pte))) {
> page = find_get_entry(vma->vm_file->f_mapping,
> @@ -704,6 +706,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
> hmask,
>
> if (is_migration_entry(swpent))
> page = migration_entry_to_page(swpent);
> +   else if (is_device_entry(swpent))
> +   page = device_entry_to_page(swpent);
> }
> if (page) {
> int mapcount = page_mapcount(page);
> @@ -1196,6 +1200,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
> pagemapread *pm,
> flags |= PM_SWAP;
> if (is_migration_entry(entry))
> page = migration_entry_to_page(entry);
> +
> +   if (is_device_entry(entry))
> +   page = device_entry_to_page(entry);
> }
>
> if (page && !PageAnon(page))
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 6230064..ec619dc 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -130,6 +130,7 @@ enum {
> IORES_DESC_ACPI_NV_STORAGE  = 3,
> IORES_DESC_PERSISTENT_MEMORY= 4,
> IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
> +   IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE  = 6,
>  };
>
>  /Fput* helpers to define resources */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9341619..365fb4e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -35,24 +35,101 @@ static inline struct vmem_altmap 
> *to_vmem_altmap(unsigned long memmap_start)
>  }
>  #endif
>
> +/*
> + * Specialize ZONE_DEVICE memory into multiple 

Re: [HMM 03/15] mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable memory

2017-04-21 Thread Dan Williams
On Fri, Apr 21, 2017 at 8:30 PM, Jérôme Glisse  wrote:
> HMM (heterogeneous memory management) need struct page to support migration
> from system main memory to device memory.  Reasons for HMM and migration to
> device memory is explained with HMM core patch.
>
> This patch deals with device memory that is un-addressable memory (ie CPU
> can not access it). Hence we do not want those struct page to be manage
> like regular memory. That is why we extend ZONE_DEVICE to support different
> types of memory.
>
> A persistent memory type is define for existing user of ZONE_DEVICE and a
> new device un-addressable type is added for the un-addressable memory type.
> There is a clear separation between what is expected from each memory type
> and existing user of ZONE_DEVICE are un-affected by new requirement and new
> use of the un-addressable type. All specific code path are protect with
> test against the memory type.
>
> Because memory is un-addressable we use a new special swap type for when
> a page is migrated to device memory (this reduces the number of maximum
> swap file).
>
> The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> First one, page_free() is call whenever page refcount reach 1 (which means
> the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> allow device driver to manage its memory and associated struct page.
>
> The second callback page_fault() happens when there is a CPU access to
> an address that is back by a device page (which are un-addressable by the
> CPU). This callback is responsible to migrate the page back to system
> main memory. Device driver can not block migration back to system memory,
> HMM make sure that such page can not be pin into device memory.
>
> If device is in some error condition and can not migrate memory back then
> a CPU page fault to device memory should end with SIGBUS.
>
> Signed-off-by: Jérôme Glisse 
> Cc: Dan Williams 
> Cc: Ross Zwisler 
> ---
>  fs/proc/task_mmu.c   |  7 +
>  include/linux/ioport.h   |  1 +
>  include/linux/memremap.h | 82 
> 
>  include/linux/swap.h | 24 --
>  include/linux/swapops.h  | 68 +++
>  kernel/memremap.c| 43 -
>  mm/Kconfig   | 13 
>  mm/memory.c  | 61 +++
>  mm/memory_hotplug.c  | 10 --
>  mm/mprotect.c| 14 +
>  10 files changed, 317 insertions(+), 6 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33..a12ba94 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -542,6 +542,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long 
> addr,
> }
> } else if (is_migration_entry(swpent))
> page = migration_entry_to_page(swpent);
> +   else if (is_device_entry(swpent))
> +   page = device_entry_to_page(swpent);
> } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
> && pte_none(*pte))) {
> page = find_get_entry(vma->vm_file->f_mapping,
> @@ -704,6 +706,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
> hmask,
>
> if (is_migration_entry(swpent))
> page = migration_entry_to_page(swpent);
> +   else if (is_device_entry(swpent))
> +   page = device_entry_to_page(swpent);
> }
> if (page) {
> int mapcount = page_mapcount(page);
> @@ -1196,6 +1200,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
> pagemapread *pm,
> flags |= PM_SWAP;
> if (is_migration_entry(entry))
> page = migration_entry_to_page(entry);
> +
> +   if (is_device_entry(entry))
> +   page = device_entry_to_page(entry);
> }
>
> if (page && !PageAnon(page))
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 6230064..ec619dc 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -130,6 +130,7 @@ enum {
> IORES_DESC_ACPI_NV_STORAGE  = 3,
> IORES_DESC_PERSISTENT_MEMORY= 4,
> IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
> +   IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE  = 6,
>  };
>
>  /Fput* helpers to define resources */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9341619..365fb4e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -35,24 +35,101 @@ static inline struct vmem_altmap 
> *to_vmem_altmap(unsigned long memmap_start)
>  }
>  #endif
>
> +/*
> + * Specialize ZONE_DEVICE memory into multiple types each having differents
> + * usage.
> + *
> + * MEMORY_DEVICE_PERSISTENT:
> + * Persistent 

Re: [PATCH] drm: fourcc byteorder: brings header file comments in line with reality.

2017-04-21 Thread Ilia Mirkin
On Fri, Apr 21, 2017 at 12:59 PM, Ville Syrjälä
 wrote:
> On Fri, Apr 21, 2017 at 10:49:49AM -0400, Ilia Mirkin wrote:
>> On Fri, Apr 21, 2017 at 3:58 AM, Gerd Hoffmann  wrote:
>> > While working on graphics support for virtual machines on ppc64 (which
>> > exists in both little and big endian variants) I've figured the comments
>> > for various drm fourcc formats in the header file don't match reality.
>> >
>> > Comments says the RGB formats are little endian, but in practice they
>> > are native endian.  Look at the drm_mode_legacy_fb_format() helper.  It
>> > maps -- for example -- bpp/depth 32/24 to DRM_FORMAT_XRGB, no matter
>> > whenever the machine is little endian or big endian.  The users of this
>> > function (fbdev emulation, DRM_IOCTL_MODE_ADDFB) expect the framebuffer
>> > is native endian, not little endian.  Most userspace also operates on
>> > native endian only.
>> >
>> > So, go update the comments for all 16+24+32 bpp RGB formats.
>> >
>> > Leaving the yuv formats as-is.  I have no idea if and how those are used
>> > on bigendian machines.
>>
>> I think this is premature. The current situation is that I can't get
>> modetest to work *at all* on my NV34 / BE setup (I mean, it runs, just
>> the colors displayed are wrong). I believe that currently it packs
>> things in "cpu native endian". I've tried futzing with that without
>> much success, although I didn't spend too much time on it. I have a
>> NV34 plugged into my LE setup as well although I haven't tested to
>> double-check that it all works there. However I'm quite sure it used
>> to, as I used modetest to help develop the YUV overlay support for
>> those GPUs.
>
> I just took a quick stab at fixing modetest to respect the current
> wording in drm_fourcc.h:
>
> git://github.com/vsyrjala/libdrm.git modetest_endian

Looks like there was some careless testing on my part :( So ... it
looks like the current modetest without those changes does, in fact,
work on NV34/BE. With the changes, it breaks (and the handling of the
b* modes is a little broken in those patches -- they're not selectable
from the cmdline.) Which means that, as Michel & co predicted, it
appears to be taking BE input not LE input. This is very surprising to
me, but it is what it is. As I mentioned before, the details of how
the "BE" mode works on the GPUs is largely unknown to us beyond a few
basics. Note that only XR24 works, AR24 ends up with all black
displayed. This also happens on LE.

Furthermore, all of YUYV, UYVY, and NV12 plane overlays don't appear
to work properly. The first half of the overlay is OK (but I think
compressed), while the second half is gibberish. Testing this on my
board plugged into a LE CPU, I also get the same type of issue, so I'm
guessing that it's just bitrot of the feature. (Or modetest gained a
bug.)

Cheers,

  -ilia


Re: [PATCH] drm: fourcc byteorder: brings header file comments in line with reality.

2017-04-21 Thread Ilia Mirkin
On Fri, Apr 21, 2017 at 12:59 PM, Ville Syrjälä
 wrote:
> On Fri, Apr 21, 2017 at 10:49:49AM -0400, Ilia Mirkin wrote:
>> On Fri, Apr 21, 2017 at 3:58 AM, Gerd Hoffmann  wrote:
>> > While working on graphics support for virtual machines on ppc64 (which
>> > exists in both little and big endian variants) I've figured the comments
>> > for various drm fourcc formats in the header file don't match reality.
>> >
>> > Comments says the RGB formats are little endian, but in practice they
>> > are native endian.  Look at the drm_mode_legacy_fb_format() helper.  It
>> > maps -- for example -- bpp/depth 32/24 to DRM_FORMAT_XRGB, no matter
>> > whenever the machine is little endian or big endian.  The users of this
>> > function (fbdev emulation, DRM_IOCTL_MODE_ADDFB) expect the framebuffer
>> > is native endian, not little endian.  Most userspace also operates on
>> > native endian only.
>> >
>> > So, go update the comments for all 16+24+32 bpp RGB formats.
>> >
>> > Leaving the yuv formats as-is.  I have no idea if and how those are used
>> > on bigendian machines.
>>
>> I think this is premature. The current situation is that I can't get
>> modetest to work *at all* on my NV34 / BE setup (I mean, it runs, just
>> the colors displayed are wrong). I believe that currently it packs
>> things in "cpu native endian". I've tried futzing with that without
>> much success, although I didn't spend too much time on it. I have a
>> NV34 plugged into my LE setup as well although I haven't tested to
>> double-check that it all works there. However I'm quite sure it used
>> to, as I used modetest to help develop the YUV overlay support for
>> those GPUs.
>
> I just took a quick stab at fixing modetest to respect the current
> wording in drm_fourcc.h:
>
> git://github.com/vsyrjala/libdrm.git modetest_endian

Looks like there was some careless testing on my part :( So ... it
looks like the current modetest without those changes does, in fact,
work on NV34/BE. With the changes, it breaks (and the handling of the
b* modes is a little broken in those patches -- they're not selectable
from the cmdline.) Which means that, as Michel & co predicted, it
appears to be taking BE input not LE input. This is very surprising to
me, but it is what it is. As I mentioned before, the details of how
the "BE" mode works on the GPUs is largely unknown to us beyond a few
basics. Note that only XR24 works, AR24 ends up with all black
displayed. This also happens on LE.

Furthermore, all of YUYV, UYVY, and NV12 plane overlays don't appear
to work properly. The first half of the overlay is OK (but I think
compressed), while the second half is gibberish. Testing this on my
board plugged into a LE CPU, I also get the same type of issue, so I'm
guessing that it's just bitrot of the feature. (Or modetest gained a
bug.)

Cheers,

  -ilia


[HMM 14/15] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3

2017-04-21 Thread Jérôme Glisse
This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is useful to device driver that want to manage multiple physical
device memory under same struct device umbrella.

Changed since v2:
  - use device_initcall() and drop everything that is module specific
Changed since v1:
  - Improve commit message
  - Add drvdata parameter to set on struct device

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h | 22 +-
 mm/hmm.c| 88 +
 2 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 50a1115..374e5fd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,11 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include 
 #include 
 #include 
 #include 
 
-
 struct hmm;
 
 /*
@@ -433,6 +433,26 @@ static inline unsigned long 
hmm_devmem_page_get_drvdata(struct page *page)
 
return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+   struct device   device;
+   unsigned intminor;
+};
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper and
+ * it is not strictly needed, in order to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 5d882e6..8b6b4c6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,6 +19,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1112,4 +1113,91 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper
+ * and it is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+   struct hmm_device *hmm_device;
+
+   hmm_device = container_of(device, struct hmm_device, device);
+   spin_lock(_device_lock);
+   clear_bit(hmm_device->minor, hmm_device_mask);
+   spin_unlock(_device_lock);
+
+   kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+   struct hmm_device *hmm_device;
+   int ret;
+
+   hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+   if (!hmm_device)
+   return ERR_PTR(-ENOMEM);
+
+   ret = alloc_chrdev_region(_device->device.devt, 0, 1, "hmm_device");
+   if (ret < 0) {
+   kfree(hmm_device);
+   return NULL;
+   }
+
+   spin_lock(_device_lock);
+   hmm_device->minor = find_first_zero_bit(hmm_device_mask, 
HMM_DEVICE_MAX);
+   if (hmm_device->minor >= HMM_DEVICE_MAX) {
+   spin_unlock(_device_lock);
+   kfree(hmm_device);
+   return NULL;
+   }
+   set_bit(hmm_device->minor, hmm_device_mask);
+   spin_unlock(_device_lock);
+
+   dev_set_name(_device->device, "hmm_device%d", hmm_device->minor);
+   hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+   hmm_device->minor);
+   hmm_device->device.release = hmm_device_release;
+   dev_set_drvdata(_device->device, drvdata);
+   hmm_device->device.class = hmm_device_class;
+   device_initialize(_device->device);
+
+   return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+   put_device(_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+   int ret;
+
+   ret = alloc_chrdev_region(_device_devt, 0,
+ HMM_DEVICE_MAX,
+ "hmm_device");
+   if (ret)
+   return ret;
+
+   hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+   if (IS_ERR(hmm_device_class)) {
+   unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+   return PTR_ERR(hmm_device_class);
+   }
+   return 0;
+}
+
+device_initcall(hmm_init);
 #endif /* 

[HMM 14/15] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3

2017-04-21 Thread Jérôme Glisse
This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is useful to device driver that want to manage multiple physical
device memory under same struct device umbrella.

Changed since v2:
  - use device_initcall() and drop everything that is module specific
Changed since v1:
  - Improve commit message
  - Add drvdata parameter to set on struct device

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h | 22 +-
 mm/hmm.c| 88 +
 2 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 50a1115..374e5fd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,11 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include 
 #include 
 #include 
 #include 
 
-
 struct hmm;
 
 /*
@@ -433,6 +433,26 @@ static inline unsigned long 
hmm_devmem_page_get_drvdata(struct page *page)
 
return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+   struct device   device;
+   unsigned intminor;
+};
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper and
+ * it is not strictly needed, in order to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 5d882e6..8b6b4c6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,6 +19,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1112,4 +1113,91 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper
+ * and it is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+   struct hmm_device *hmm_device;
+
+   hmm_device = container_of(device, struct hmm_device, device);
+   spin_lock(_device_lock);
+   clear_bit(hmm_device->minor, hmm_device_mask);
+   spin_unlock(_device_lock);
+
+   kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+   struct hmm_device *hmm_device;
+   int ret;
+
+   hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+   if (!hmm_device)
+   return ERR_PTR(-ENOMEM);
+
+   ret = alloc_chrdev_region(_device->device.devt, 0, 1, "hmm_device");
+   if (ret < 0) {
+   kfree(hmm_device);
+   return NULL;
+   }
+
+   spin_lock(_device_lock);
+   hmm_device->minor = find_first_zero_bit(hmm_device_mask, 
HMM_DEVICE_MAX);
+   if (hmm_device->minor >= HMM_DEVICE_MAX) {
+   spin_unlock(_device_lock);
+   kfree(hmm_device);
+   return NULL;
+   }
+   set_bit(hmm_device->minor, hmm_device_mask);
+   spin_unlock(_device_lock);
+
+   dev_set_name(_device->device, "hmm_device%d", hmm_device->minor);
+   hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+   hmm_device->minor);
+   hmm_device->device.release = hmm_device_release;
+   dev_set_drvdata(_device->device, drvdata);
+   hmm_device->device.class = hmm_device_class;
+   device_initialize(_device->device);
+
+   return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+   put_device(_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+   int ret;
+
+   ret = alloc_chrdev_region(_device_devt, 0,
+ HMM_DEVICE_MAX,
+ "hmm_device");
+   if (ret)
+   return ret;
+
+   hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+   if (IS_ERR(hmm_device_class)) {
+   unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+   return PTR_ERR(hmm_device_class);
+   }
+   return 0;
+}
+
+device_initcall(hmm_init);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.9.3



Re: [HMM 02/15] mm/put_page: move ZONE_DEVICE page reference decrement v2

2017-04-21 Thread Dan Williams
On Fri, Apr 21, 2017 at 8:30 PM, Jérôme Glisse  wrote:
> Move page reference decrement of ZONE_DEVICE from put_page()
> to put_zone_device_page() this does not affect non ZONE_DEVICE
> page.
>
> Doing this allow to catch when a ZONE_DEVICE page refcount reach
> 1 which means the device is no longer reference by any one (unlike
> page from other zone, ZONE_DEVICE page refcount never reach 0).
>
> This patch is just a preparatory patch for HMM.
>
> Changes since v1:
>   - commit message
>
> Signed-off-by: Jérôme Glisse 
> Cc: Dan Williams 
> Cc: Ross Zwisler 

Reviewed-by: Dan Williams 


Re: [HMM 02/15] mm/put_page: move ZONE_DEVICE page reference decrement v2

2017-04-21 Thread Dan Williams
On Fri, Apr 21, 2017 at 8:30 PM, Jérôme Glisse  wrote:
> Move page reference decrement of ZONE_DEVICE from put_page()
> to put_zone_device_page() this does not affect non ZONE_DEVICE
> page.
>
> Doing this allow to catch when a ZONE_DEVICE page refcount reach
> 1 which means the device is no longer reference by any one (unlike
> page from other zone, ZONE_DEVICE page refcount never reach 0).
>
> This patch is just a preparatory patch for HMM.
>
> Changes since v1:
>   - commit message
>
> Signed-off-by: Jérôme Glisse 
> Cc: Dan Williams 
> Cc: Ross Zwisler 

Reviewed-by: Dan Williams 


Re: [tip:irq/urgent] genirq/affinity: Fix calculating vectors to assign

2017-04-21 Thread Andrei Vagin
Looks like 4.11 will be release in a few days, it would be nice if this
commit reaches the upstream tree before this moment.

Thanks.

On Thu, Apr 20, 2017 at 07:06:49AM -0700, tip-bot for Keith Busch wrote:
> Commit-ID:  b72f8051f34b8164a62391e3676edc34523c5952
> Gitweb: http://git.kernel.org/tip/b72f8051f34b8164a62391e3676edc34523c5952
> Author: Keith Busch 
> AuthorDate: Wed, 19 Apr 2017 19:51:10 -0400
> Committer:  Thomas Gleixner 
> CommitDate: Thu, 20 Apr 2017 16:03:09 +0200
> 
> genirq/affinity: Fix calculating vectors to assign
> 
> The vectors_per_node is calculated from the remaining available vectors.
> The current vector starts after pre_vectors, so we need to subtract that
> from the current to properly account for the number of remaining vectors
> to assign.
> 
> Fixes: 3412386b531 ("irq/affinity: Fix extra vecs calculation")
> Reported-by: Andrei Vagin 
> Signed-off-by: Keith Busch 
> Link: 
> http://lkml.kernel.org/r/1492645870-13019-1-git-send-email-keith.bu...@intel.com
> Signed-off-by: Thomas Gleixner 
> 
> ---
>  kernel/irq/affinity.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index d052947..e2d356d 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -98,7 +98,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   int ncpus, v, vecs_to_assign, vecs_per_node;
>  
>   /* Spread the vectors per node */
> - vecs_per_node = (affv - curvec) / nodes;
> + vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
>  
>   /* Get the cpus on this node which are in the mask */
>   cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));


Re: [tip:irq/urgent] genirq/affinity: Fix calculating vectors to assign

2017-04-21 Thread Andrei Vagin
Looks like 4.11 will be release in a few days, it would be nice if this
commit reaches the upstream tree before this moment.

Thanks.

On Thu, Apr 20, 2017 at 07:06:49AM -0700, tip-bot for Keith Busch wrote:
> Commit-ID:  b72f8051f34b8164a62391e3676edc34523c5952
> Gitweb: http://git.kernel.org/tip/b72f8051f34b8164a62391e3676edc34523c5952
> Author: Keith Busch 
> AuthorDate: Wed, 19 Apr 2017 19:51:10 -0400
> Committer:  Thomas Gleixner 
> CommitDate: Thu, 20 Apr 2017 16:03:09 +0200
> 
> genirq/affinity: Fix calculating vectors to assign
> 
> The vectors_per_node is calculated from the remaining available vectors.
> The current vector starts after pre_vectors, so we need to subtract that
> from the current to properly account for the number of remaining vectors
> to assign.
> 
> Fixes: 3412386b531 ("irq/affinity: Fix extra vecs calculation")
> Reported-by: Andrei Vagin 
> Signed-off-by: Keith Busch 
> Link: 
> http://lkml.kernel.org/r/1492645870-13019-1-git-send-email-keith.bu...@intel.com
> Signed-off-by: Thomas Gleixner 
> 
> ---
>  kernel/irq/affinity.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index d052947..e2d356d 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -98,7 +98,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   int ncpus, v, vecs_to_assign, vecs_per_node;
>  
>   /* Spread the vectors per node */
> - vecs_per_node = (affv - curvec) / nodes;
> + vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
>  
>   /* Get the cpus on this node which are in the mask */
>   cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));


[HMM 07/15] mm/hmm: heterogeneous memory management (HMM for short) v2

2017-04-21 Thread Jérôme Glisse
HMM provides 3 separate types of functionality:
- Mirroring: synchronize CPU page table and device page table
- Device memory: allocating struct page for device memory
- Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 MAINTAINERS  |   7 +++
 include/linux/hmm.h  | 146 +++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c|   2 +
 mm/Kconfig   |  13 +
 mm/Makefile  |   1 +
 mm/hmm.c |  71 +++
 7 files changed, 245 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 09341a9..6df21a9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6052,6 +6052,13 @@ S:   Supported
 F: drivers/scsi/hisi_sas/
 F: Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M: Jérôme Glisse 
+L: linux...@kvack.org
+S: Maintained
+F: mm/hmm*
+F: include/linux/hmm*
+
 HOST AP DRIVER
 M: Jouni Malinen 
 L: linux-wirel...@vger.kernel.org
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 000..93b363d
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,146 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse 
+ */
+/*
+ * Heterogeneous Memory Management (HMM)
+ *
+ * See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it
+ * is for. Here we focus on the HMM API description, with some explanation of
+ * the underlying implementation.
+ *
+ * Short description: HMM provides a set of helpers to share a virtual address
+ * space between CPU and a device, so that the device can access any valid
+ * address of the process (while still obeying memory protection). HMM also
+ * provides helpers to migrate process memory to device memory, and back. Each
+ * set of functionality (address space mirroring, and migration to and from
+ * device memory) can be used independently of the other.
+ *
+ *
+ * HMM address space mirroring API:
+ *
+ * Use HMM address space mirroring if you want to mirror range of the CPU page
+ * table of a process into a device page table. Here, "mirror" means "keep
+ * synchronized". Prerequisites: the device must provide the ability to write-
+ * protect its page tables (at PAGE_SIZE granularity), and must be able to
+ * recover from the resulting potential page faults.
+ *
+ * HMM guarantees that at any point in time, a given virtual address points to
+ * either the same memory in both CPU and device page tables (that is: CPU and
+ * device page tables each point to the same pages), or that one page table 
(CPU
+ * or device) points to no entry, while the other still points to the old page
+ * for the address. The latter case happens when the CPU page table update
+ * happens first, and then the update is mirrored over to the device page 
table.
+ * This does not cause any issue, because the CPU page table cannot start
+ * pointing to a new page until the device page table is invalidated.
+ *
+ * HMM uses mmu_notifiers to monitor the CPU page tables, and forwards any
+ * updates to each device driver that has registered a mirror. It also provides
+ * some API calls to help with taking a snapshot of the CPU page table, and to
+ * synchronize with any updates that might happen concurrently.
+ *
+ *
+ * HMM migration to and from device memory:
+ *
+ * HMM provides a set of helpers to hotplug device memory as ZONE_DEVICE, with
+ * a new MEMORY_DEVICE_UNADDRESSABLE type. This provides a struct page for
+ * each page of the device memory, and allows the device driver to manage its
+ * memory using those struct pages. Having struct pages for device memory makes
+ * migration easier. Because that memory is not addressable by the CPU it must
+ * never be pinned to the device; in other words, any CPU page fault can always
+ * 

[HMM 08/15] mm/hmm/mirror: mirror process address space on device with HMM helpers v2

2017-04-21 Thread Jérôme Glisse
This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h | 110 ++
 mm/Kconfig  |  12 
 mm/hmm.c| 170 +++-
 3 files changed, 277 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 93b363d..6668a1b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
@@ -134,6 +135,115 @@ static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long 
pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to synchronize device page table with CPU page table.
+ *
+ * A device driver that is participating in HMM mirroring must always
+ * synchronize with CPU page table updates. For this, device drivers can either
+ * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device
+ * drivers can decide to register one mirror per device per process, or just
+ * one mirror per process for a group of devices. The pattern is:
+ *
+ *  int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *  {
+ *  struct device_address_space *das;
+ *
+ *  // Device driver specific initialization, and allocation of das
+ *  // which contains an hmm_mirror struct as one of its fields.
+ *  ...
+ *
+ *  ret = hmm_mirror_register(>mirror, mm, _mirror_ops);
+ *  if (ret) {
+ *  // Cleanup on error
+ *  return ret;
+ *  }
+ *
+ *  // Other device driver specific initialization
+ *  ...
+ *  }
+ *
+ * Once an hmm_mirror is registered for an address space, the device driver
+ * will get callbacks through sync_cpu_device_pagetables() operation (see
+ * hmm_mirror_ops struct).
+ *
+ * Device driver must not free the struct containing the hmm_mirror struct
+ * before calling hmm_mirror_unregister(). The expected usage is to do that 
when
+ * the device driver is unbinding from an address space.
+ *
+ *
+ *  void device_unbind_address_space(struct device_address_space *das)
+ *  {
+ *  // Device driver specific cleanup
+ *  ...
+ *
+ *  hmm_mirror_unregister(>mirror);
+ *
+ *  // Other device driver specific cleanup, and now das can be freed
+ *  ...
+ *  }
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update_type - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update_type {
+   HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+   /* sync_cpu_device_pagetables() - synchronize page tables
+*
+* @mirror: pointer to struct hmm_mirror
+* @update_type: type of update that occurred to the CPU page table
+* @start: virtual start address of the range to update
+* @end: virtual end address of the range to update
+*
+* This callback ultimately originates from mmu_notifiers when the CPU
+* page table is updated. The device driver must update its page table
+* in response to this callback. The update argument tells what action
+* to perform.
+*
+* The device driver must not return from this callback until the device
+* page tables are completely updated (TLBs flushed, etc); this is a
+* synchronous call.
+*/
+   void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
+  enum hmm_update_type update_type,
+  unsigned long start,
+  unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver 

[HMM 07/15] mm/hmm: heterogeneous memory management (HMM for short) v2

2017-04-21 Thread Jérôme Glisse
HMM provides 3 separate types of functionality:
- Mirroring: synchronize CPU page table and device page table
- Device memory: allocating struct page for device memory
- Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 MAINTAINERS  |   7 +++
 include/linux/hmm.h  | 146 +++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c|   2 +
 mm/Kconfig   |  13 +
 mm/Makefile  |   1 +
 mm/hmm.c |  71 +++
 7 files changed, 245 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 09341a9..6df21a9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6052,6 +6052,13 @@ S:   Supported
 F: drivers/scsi/hisi_sas/
 F: Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M: Jérôme Glisse 
+L: linux...@kvack.org
+S: Maintained
+F: mm/hmm*
+F: include/linux/hmm*
+
 HOST AP DRIVER
 M: Jouni Malinen 
 L: linux-wirel...@vger.kernel.org
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 000..93b363d
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,146 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse 
+ */
+/*
+ * Heterogeneous Memory Management (HMM)
+ *
+ * See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it
+ * is for. Here we focus on the HMM API description, with some explanation of
+ * the underlying implementation.
+ *
+ * Short description: HMM provides a set of helpers to share a virtual address
+ * space between CPU and a device, so that the device can access any valid
+ * address of the process (while still obeying memory protection). HMM also
+ * provides helpers to migrate process memory to device memory, and back. Each
+ * set of functionality (address space mirroring, and migration to and from
+ * device memory) can be used independently of the other.
+ *
+ *
+ * HMM address space mirroring API:
+ *
+ * Use HMM address space mirroring if you want to mirror range of the CPU page
+ * table of a process into a device page table. Here, "mirror" means "keep
+ * synchronized". Prerequisites: the device must provide the ability to write-
+ * protect its page tables (at PAGE_SIZE granularity), and must be able to
+ * recover from the resulting potential page faults.
+ *
+ * HMM guarantees that at any point in time, a given virtual address points to
+ * either the same memory in both CPU and device page tables (that is: CPU and
+ * device page tables each point to the same pages), or that one page table 
(CPU
+ * or device) points to no entry, while the other still points to the old page
+ * for the address. The latter case happens when the CPU page table update
+ * happens first, and then the update is mirrored over to the device page 
table.
+ * This does not cause any issue, because the CPU page table cannot start
+ * pointing to a new page until the device page table is invalidated.
+ *
+ * HMM uses mmu_notifiers to monitor the CPU page tables, and forwards any
+ * updates to each device driver that has registered a mirror. It also provides
+ * some API calls to help with taking a snapshot of the CPU page table, and to
+ * synchronize with any updates that might happen concurrently.
+ *
+ *
+ * HMM migration to and from device memory:
+ *
+ * HMM provides a set of helpers to hotplug device memory as ZONE_DEVICE, with
+ * a new MEMORY_DEVICE_UNADDRESSABLE type. This provides a struct page for
+ * each page of the device memory, and allows the device driver to manage its
+ * memory using those struct pages. Having struct pages for device memory makes
+ * migration easier. Because that memory is not addressable by the CPU it must
+ * never be pinned to the device; in other words, any CPU page fault can always
+ * cause the device memory to be migrated (copied/moved) back to regular 
memory.
+ *
+ * A new migrate helper (migrate_vma()) has been added (see mm/migrate.c) that
+ * allows 

[HMM 08/15] mm/hmm/mirror: mirror process address space on device with HMM helpers v2

2017-04-21 Thread Jérôme Glisse
This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h | 110 ++
 mm/Kconfig  |  12 
 mm/hmm.c| 170 +++-
 3 files changed, 277 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 93b363d..6668a1b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
@@ -134,6 +135,115 @@ static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long 
pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to synchronize device page table with CPU page table.
+ *
+ * A device driver that is participating in HMM mirroring must always
+ * synchronize with CPU page table updates. For this, device drivers can either
+ * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device
+ * drivers can decide to register one mirror per device per process, or just
+ * one mirror per process for a group of devices. The pattern is:
+ *
+ *  int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *  {
+ *  struct device_address_space *das;
+ *
+ *  // Device driver specific initialization, and allocation of das
+ *  // which contains an hmm_mirror struct as one of its fields.
+ *  ...
+ *
+ *  ret = hmm_mirror_register(>mirror, mm, _mirror_ops);
+ *  if (ret) {
+ *  // Cleanup on error
+ *  return ret;
+ *  }
+ *
+ *  // Other device driver specific initialization
+ *  ...
+ *  }
+ *
+ * Once an hmm_mirror is registered for an address space, the device driver
+ * will get callbacks through sync_cpu_device_pagetables() operation (see
+ * hmm_mirror_ops struct).
+ *
+ * Device driver must not free the struct containing the hmm_mirror struct
+ * before calling hmm_mirror_unregister(). The expected usage is to do that 
when
+ * the device driver is unbinding from an address space.
+ *
+ *
+ *  void device_unbind_address_space(struct device_address_space *das)
+ *  {
+ *  // Device driver specific cleanup
+ *  ...
+ *
+ *  hmm_mirror_unregister(>mirror);
+ *
+ *  // Other device driver specific cleanup, and now das can be freed
+ *  ...
+ *  }
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update_type - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update_type {
+   HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+   /* sync_cpu_device_pagetables() - synchronize page tables
+*
+* @mirror: pointer to struct hmm_mirror
+* @update_type: type of update that occurred to the CPU page table
+* @start: virtual start address of the range to update
+* @end: virtual end address of the range to update
+*
+* This callback ultimately originates from mmu_notifiers when the CPU
+* page table is updated. The device driver must update its page table
+* in response to this callback. The update argument tells what action
+* to perform.
+*
+* The device driver must not return from this callback until the device
+* page tables are completely updated (TLBs flushed, etc); this is a
+* synchronous call.
+*/
+   void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
+  enum hmm_update_type update_type,
+  unsigned long start,
+  unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being 

[HMM 01/15] mm, memory_hotplug: introduce add_pages

2017-04-21 Thread Jérôme Glisse
From: Michal Hocko 

There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be
done by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use
the currently existing __add_pages.

Signed-off-by: Michal Hocko 
Signed-off-by: Jérôme Glisse 
---
 arch/x86/Kconfig   |  4 
 arch/x86/mm/init_64.c  | 22 +++---
 include/linux/memory_hotplug.h | 11 +++
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c43f476..e515dc2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2263,6 +2263,10 @@ source "kernel/livepatch/Kconfig"
 
 endmenu
 
+config ARCH_HAS_ADD_PAGES
+   def_bool y
+   depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
depends on X86_64 || (X86_32 && HIGHMEM)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ffeba90..a573ebc 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -671,7 +671,7 @@ void __init paging_init(void)
  * After memory hotplug the variables max_pfn, max_low_pfn and high_memory need
  * updating.
  */
-static void  update_end_of_memory_vars(u64 start, u64 size)
+static void update_end_of_memory_vars(u64 start, u64 size)
 {
unsigned long end_pfn = PFN_UP(start + size);
 
@@ -682,22 +682,30 @@ static void  update_end_of_memory_vars(u64 start, u64 
size)
}
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+int add_pages(int nid, unsigned long start_pfn,
+ unsigned long nr_pages, bool want_memblock)
 {
-   unsigned long start_pfn = start >> PAGE_SHIFT;
-   unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
-   init_memory_mapping(start, start + size);
-
ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
WARN_ON_ONCE(ret);
 
/* update max_pfn, max_low_pfn and high_memory */
-   update_end_of_memory_vars(start, size);
+   update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
+ nr_pages << PAGE_SHIFT);
 
return ret;
 }
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+   unsigned long start_pfn = start >> PAGE_SHIFT;
+   unsigned long nr_pages = size >> PAGE_SHIFT;
+
+   init_memory_mapping(start, start + size);
+
+   return add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index aec8865..5ec6d64 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -126,6 +126,17 @@ extern int __remove_pages(struct zone *zone, unsigned long 
start_pfn,
 extern int __add_pages(int nid, unsigned long start_pfn,
unsigned long nr_pages, bool want_memblock);
 
+#ifndef CONFIG_ARCH_HAS_ADD_PAGES
+static inline int add_pages(int nid, unsigned long start_pfn,
+   unsigned long nr_pages, bool want_memblock)
+{
+   return __add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
+#else /* ARCH_HAS_ADD_PAGES */
+int add_pages(int nid, unsigned long start_pfn,
+ unsigned long nr_pages, bool want_memblock);
+#endif /* ARCH_HAS_ADD_PAGES */
+
 #ifdef CONFIG_NUMA
 extern int memory_add_physaddr_to_nid(u64 start);
 #else
-- 
2.9.3



[HMM 01/15] mm, memory_hotplug: introduce add_pages

2017-04-21 Thread Jérôme Glisse
From: Michal Hocko 

There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be
done by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use
the currently existing __add_pages.

Signed-off-by: Michal Hocko 
Signed-off-by: Jérôme Glisse 
---
 arch/x86/Kconfig   |  4 
 arch/x86/mm/init_64.c  | 22 +++---
 include/linux/memory_hotplug.h | 11 +++
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c43f476..e515dc2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2263,6 +2263,10 @@ source "kernel/livepatch/Kconfig"
 
 endmenu
 
+config ARCH_HAS_ADD_PAGES
+   def_bool y
+   depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
depends on X86_64 || (X86_32 && HIGHMEM)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ffeba90..a573ebc 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -671,7 +671,7 @@ void __init paging_init(void)
  * After memory hotplug the variables max_pfn, max_low_pfn and high_memory need
  * updating.
  */
-static void  update_end_of_memory_vars(u64 start, u64 size)
+static void update_end_of_memory_vars(u64 start, u64 size)
 {
unsigned long end_pfn = PFN_UP(start + size);
 
@@ -682,22 +682,30 @@ static void  update_end_of_memory_vars(u64 start, u64 
size)
}
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+int add_pages(int nid, unsigned long start_pfn,
+ unsigned long nr_pages, bool want_memblock)
 {
-   unsigned long start_pfn = start >> PAGE_SHIFT;
-   unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
-   init_memory_mapping(start, start + size);
-
ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
WARN_ON_ONCE(ret);
 
/* update max_pfn, max_low_pfn and high_memory */
-   update_end_of_memory_vars(start, size);
+   update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
+ nr_pages << PAGE_SHIFT);
 
return ret;
 }
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+   unsigned long start_pfn = start >> PAGE_SHIFT;
+   unsigned long nr_pages = size >> PAGE_SHIFT;
+
+   init_memory_mapping(start, start + size);
+
+   return add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index aec8865..5ec6d64 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -126,6 +126,17 @@ extern int __remove_pages(struct zone *zone, unsigned long 
start_pfn,
 extern int __add_pages(int nid, unsigned long start_pfn,
unsigned long nr_pages, bool want_memblock);
 
+#ifndef CONFIG_ARCH_HAS_ADD_PAGES
+static inline int add_pages(int nid, unsigned long start_pfn,
+   unsigned long nr_pages, bool want_memblock)
+{
+   return __add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
+#else /* ARCH_HAS_ADD_PAGES */
+int add_pages(int nid, unsigned long start_pfn,
+ unsigned long nr_pages, bool want_memblock);
+#endif /* ARCH_HAS_ADD_PAGES */
+
 #ifdef CONFIG_NUMA
 extern int memory_add_physaddr_to_nid(u64 start);
 #else
-- 
2.9.3



[HMM 03/15] mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable memory

2017-04-21 Thread Jérôme Glisse
HMM (heterogeneous memory management) need struct page to support migration
from system main memory to device memory.  Reasons for HMM and migration to
device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support different
types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory type.
There is a clear separation between what is expected from each memory type
and existing user of ZONE_DEVICE are un-affected by new requirement and new
use of the un-addressable type. All specific code path are protect with
test against the memory type.

Because memory is un-addressable we use a new special swap type for when
a page is migrated to device memory (this reduces the number of maximum
swap file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which means
the page is free as ZONE_DEVICE page never reach a refcount of 0). This
allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to
an address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

Signed-off-by: Jérôme Glisse 
Cc: Dan Williams 
Cc: Ross Zwisler 
---
 fs/proc/task_mmu.c   |  7 +
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 82 
 include/linux/swap.h | 24 --
 include/linux/swapops.h  | 68 +++
 kernel/memremap.c| 43 -
 mm/Kconfig   | 13 
 mm/memory.c  | 61 +++
 mm/memory_hotplug.c  | 10 --
 mm/mprotect.c| 14 +
 10 files changed, 317 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33..a12ba94 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -542,6 +542,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
}
} else if (is_migration_entry(swpent))
page = migration_entry_to_page(swpent);
+   else if (is_device_entry(swpent))
+   page = device_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = find_get_entry(vma->vm_file->f_mapping,
@@ -704,6 +706,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
 
if (is_migration_entry(swpent))
page = migration_entry_to_page(swpent);
+   else if (is_device_entry(swpent))
+   page = device_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1196,6 +1200,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
flags |= PM_SWAP;
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
+
+   if (is_device_entry(entry))
+   page = device_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..ec619dc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
IORES_DESC_ACPI_NV_STORAGE  = 3,
IORES_DESC_PERSISTENT_MEMORY= 4,
IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
+   IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE  = 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..365fb4e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,24 +35,101 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned 
long memmap_start)
 }
 #endif
 
+/*
+ * Specialize ZONE_DEVICE memory into multiple types each having differents
+ * usage.
+ *
+ * MEMORY_DEVICE_PERSISTENT:
+ * Persistent device memory (pmem): struct page might be allocated in different
+ * memory and architecture might want to perform special actions. It is similar
+ * to regular memory, in that the CPU can access 

[HMM 03/15] mm/unaddressable-memory: new type of ZONE_DEVICE for unaddressable memory

2017-04-21 Thread Jérôme Glisse
HMM (heterogeneous memory management) need struct page to support migration
from system main memory to device memory.  Reasons for HMM and migration to
device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support different
types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory type.
There is a clear separation between what is expected from each memory type
and existing user of ZONE_DEVICE are un-affected by new requirement and new
use of the un-addressable type. All specific code path are protect with
test against the memory type.

Because memory is un-addressable we use a new special swap type for when
a page is migrated to device memory (this reduces the number of maximum
swap file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which means
the page is free as ZONE_DEVICE page never reach a refcount of 0). This
allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to
an address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

Signed-off-by: Jérôme Glisse 
Cc: Dan Williams 
Cc: Ross Zwisler 
---
 fs/proc/task_mmu.c   |  7 +
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 82 
 include/linux/swap.h | 24 --
 include/linux/swapops.h  | 68 +++
 kernel/memremap.c| 43 -
 mm/Kconfig   | 13 
 mm/memory.c  | 61 +++
 mm/memory_hotplug.c  | 10 --
 mm/mprotect.c| 14 +
 10 files changed, 317 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33..a12ba94 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -542,6 +542,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
}
} else if (is_migration_entry(swpent))
page = migration_entry_to_page(swpent);
+   else if (is_device_entry(swpent))
+   page = device_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = find_get_entry(vma->vm_file->f_mapping,
@@ -704,6 +706,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
 
if (is_migration_entry(swpent))
page = migration_entry_to_page(swpent);
+   else if (is_device_entry(swpent))
+   page = device_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1196,6 +1200,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
flags |= PM_SWAP;
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
+
+   if (is_device_entry(entry))
+   page = device_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..ec619dc 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
IORES_DESC_ACPI_NV_STORAGE  = 3,
IORES_DESC_PERSISTENT_MEMORY= 4,
IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
+   IORES_DESC_DEVICE_MEMORY_UNADDRESSABLE  = 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..365fb4e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,24 +35,101 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned 
long memmap_start)
 }
 #endif
 
+/*
+ * Specialize ZONE_DEVICE memory into multiple types each having differents
+ * usage.
+ *
+ * MEMORY_DEVICE_PERSISTENT:
+ * Persistent device memory (pmem): struct page might be allocated in different
+ * memory and architecture might want to perform special actions. It is similar
+ * to regular memory, in that the CPU can access it transparently. However,
+ * it is likely to have different bandwidth and 

[PATCH v3] selftests: ftrace: Allow some tests to be run in a tracing instance

2017-04-21 Thread Steven Rostedt
>From 4464dc867ead3ea14654165ad3ab68263aff7b17 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" 
Date: Thu, 20 Apr 2017 12:53:18 -0400
Subject: [PATCH] selftests: ftrace: Allow some tests to be run in a tracing
 instance

An tracing instance has several of the same capabilities as the top level
instance, but may be implemented slightly different. Instead of just writing
tests that duplicat the same test cases of the top level instance, allow a
test to be written for both the top level as well as for an instance.

If a test case can be run in both the top level as well as in an tracing
instance directory, then it should add a tag "# flags: instance" in the
header of the test file. Then after all tests have run, any test that has an
instance flag set, will run again within a tracing instance.

Cc: Shuah Khan 
Cc: Namhyung Kim 
Suggestions-from: Masami Hiramatsu 
Signed-off-by: Steven Rostedt (VMware) 
---
 tools/testing/selftests/ftrace/ftracetest | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/testing/selftests/ftrace/ftracetest 
b/tools/testing/selftests/ftrace/ftracetest
index a8631d9..3215a8d 100755
--- a/tools/testing/selftests/ftrace/ftracetest
+++ b/tools/testing/selftests/ftrace/ftracetest
@@ -157,6 +157,10 @@ testcase() { # testfile
   prlog -n "[$CASENO]$desc"
 }
 
+test_on_instance() { # testfile
+  grep -q "^#[ \t]*flags:.*instance" $1
+}
+
 eval_result() { # sigval
   case $1 in
 $PASS)
@@ -271,6 +275,21 @@ for t in $TEST_CASES; do
   run_test $t
 done
 
+# Test on instance loop
+FIRST_INSTANCE=0
+for t in $TEST_CASES; do
+  test_on_instance $t || continue
+  if [ $FIRST_INSTANCE -eq 0 ]; then
+FIRST_INSTANCE=1
+echo "Running tests in a tracing instance:"
+  fi
+  SAVED_TRACING_DIR=$TRACING_DIR
+  export TRACING_DIR=`mktemp -d $TRACING_DIR/instances/ftracetest.XX`
+  run_test $t
+  rmdir $TRACING_DIR
+  TRACING_DIR=$SAVED_TRACING_DIR
+done
+
 prlog ""
 prlog "# of passed: " `echo $PASSED_CASES | wc -w`
 prlog "# of failed: " `echo $FAILED_CASES | wc -w`
-- 
2.9.3



[PATCH v3] selftests: ftrace: Allow some tests to be run in a tracing instance

2017-04-21 Thread Steven Rostedt
>From 4464dc867ead3ea14654165ad3ab68263aff7b17 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" 
Date: Thu, 20 Apr 2017 12:53:18 -0400
Subject: [PATCH] selftests: ftrace: Allow some tests to be run in a tracing
 instance

An tracing instance has several of the same capabilities as the top level
instance, but may be implemented slightly different. Instead of just writing
tests that duplicat the same test cases of the top level instance, allow a
test to be written for both the top level as well as for an instance.

If a test case can be run in both the top level as well as in an tracing
instance directory, then it should add a tag "# flags: instance" in the
header of the test file. Then after all tests have run, any test that has an
instance flag set, will run again within a tracing instance.

Cc: Shuah Khan 
Cc: Namhyung Kim 
Suggestions-from: Masami Hiramatsu 
Signed-off-by: Steven Rostedt (VMware) 
---
 tools/testing/selftests/ftrace/ftracetest | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/testing/selftests/ftrace/ftracetest 
b/tools/testing/selftests/ftrace/ftracetest
index a8631d9..3215a8d 100755
--- a/tools/testing/selftests/ftrace/ftracetest
+++ b/tools/testing/selftests/ftrace/ftracetest
@@ -157,6 +157,10 @@ testcase() { # testfile
   prlog -n "[$CASENO]$desc"
 }
 
+test_on_instance() { # testfile
+  grep -q "^#[ \t]*flags:.*instance" $1
+}
+
 eval_result() { # sigval
   case $1 in
 $PASS)
@@ -271,6 +275,21 @@ for t in $TEST_CASES; do
   run_test $t
 done
 
+# Test on instance loop
+FIRST_INSTANCE=0
+for t in $TEST_CASES; do
+  test_on_instance $t || continue
+  if [ $FIRST_INSTANCE -eq 0 ]; then
+FIRST_INSTANCE=1
+echo "Running tests in a tracing instance:"
+  fi
+  SAVED_TRACING_DIR=$TRACING_DIR
+  export TRACING_DIR=`mktemp -d $TRACING_DIR/instances/ftracetest.XX`
+  run_test $t
+  rmdir $TRACING_DIR
+  TRACING_DIR=$SAVED_TRACING_DIR
+done
+
 prlog ""
 prlog "# of passed: " `echo $PASSED_CASES | wc -w`
 prlog "# of failed: " `echo $FAILED_CASES | wc -w`
-- 
2.9.3



[HMM 04/15] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY

2017-04-21 Thread Jérôme Glisse
Introduce a new migration mode that allow to offload the copy to
a device DMA engine. This changes the workflow of migration and
not all address_space migratepage callback can support this. So
it needs to be tested in those cases.

This is intended to be use by migrate_vma() which itself is use
for thing like HMM (see include/linux/hmm.h).

Signed-off-by: Jérôme Glisse 
---
 fs/aio.c |  8 +++
 fs/f2fs/data.c   |  5 -
 fs/hugetlbfs/inode.c |  5 -
 fs/ubifs/file.c  |  5 -
 include/linux/migrate.h  |  5 +
 include/linux/migrate_mode.h |  5 +
 mm/balloon_compaction.c  |  8 +++
 mm/migrate.c | 52 ++--
 mm/zsmalloc.c|  8 +++
 9 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925..e51351e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -373,6 +373,14 @@ static int aio_migratepage(struct address_space *mapping, 
struct page *new,
pgoff_t idx;
int rc;
 
+   /*
+* We cannot support the _NO_COPY case here, because copy needs to
+* happen under the ctx->completion_lock. That does not work with the
+* migration workflow of MIGRATE_SYNC_NO_COPY.
+*/
+   if (mode == MIGRATE_SYNC_NO_COPY)
+   return -EINVAL;
+
rc = 0;
 
/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index e984a42..b36191f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2131,7 +2131,10 @@ int f2fs_migrate_page(struct address_space *mapping,
SetPagePrivate(newpage);
set_page_private(newpage, page_private(page));
 
-   migrate_page_copy(newpage, page);
+   if (mode != MIGRATE_SYNC_NO_COPY)
+   migrate_page_copy(newpage, page);
+   else
+   migrate_page_states(newpage, page);
 
return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dde8613..c02ff56 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -846,7 +846,10 @@ static int hugetlbfs_migrate_page(struct address_space 
*mapping,
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
-   migrate_page_copy(newpage, page);
+   if (mode != MIGRATE_SYNC_NO_COPY)
+   migrate_page_copy(newpage, page);
+   else
+   migrate_page_states(newpage, page);
 
return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 2cda3d6..b2292be 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1482,7 +1482,10 @@ static int ubifs_migrate_page(struct address_space 
*mapping,
SetPagePrivate(newpage);
}
 
-   migrate_page_copy(newpage, page);
+   if (mode != MIGRATE_SYNC_NO_COPY)
+   migrate_page_copy(newpage, page);
+   else
+   migrate_page_states(newpage, page);
return MIGRATEPAGE_SUCCESS;
 }
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 48e2484..78a0fdc 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,6 +43,7 @@ extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
+extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
  struct page *newpage, struct page *page);
@@ -63,6 +64,10 @@ static inline int isolate_movable_page(struct page *page, 
isolate_mode_t mode)
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
+static inline void migrate_page_states(struct page *newpage, struct page *page)
+{
+}
+
 static inline void migrate_page_copy(struct page *newpage,
 struct page *page) {}
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..bdf66af 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,16 @@
  * on most operations but not ->writepage as the potential stall time
  * is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages
+ * with the CPU. Instead, page copy happens outside the migratepage()
+ * callback and is likely using a DMA engine. See migrate_vma() and HMM
+ * (mm/hmm.c) for users of this mode.
  */
 enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+   MIGRATE_SYNC_NO_COPY,
 };
 
 #endif /* MIGRATE_MODE_H_INCLUDED */
diff --git 

[HMM 11/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration

2017-04-21 Thread Jérôme Glisse
Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
---
 include/linux/migrate.h |  10 +++-
 mm/migrate.c| 136 ++--
 mm/page_vma_mapped.c|  10 
 mm/rmap.c   |  25 +
 4 files changed, 152 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 576b3f5..7dd875a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,12 +130,18 @@ static inline int migrate_misplaced_transhuge_page(struct 
mm_struct *mm,
 
 #ifdef CONFIG_MIGRATION
 
+/*
+ * Watch out for PAE architecture, which has an unsigned long, and might not
+ * have enough bits to store all physical address and flags. So far we have
+ * enough room for all our flags.
+ */
 #define MIGRATE_PFN_VALID  (1UL << 0)
 #define MIGRATE_PFN_MIGRATE(1UL << 1)
 #define MIGRATE_PFN_LOCKED (1UL << 2)
 #define MIGRATE_PFN_WRITE  (1UL << 3)
-#define MIGRATE_PFN_ERROR  (1UL << 4)
-#define MIGRATE_PFN_SHIFT  5
+#define MIGRATE_PFN_DEVICE (1UL << 4)
+#define MIGRATE_PFN_ERROR  (1UL << 5)
+#define MIGRATE_PFN_SHIFT  6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 4ac2a7a..62ad41c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -233,7 +234,15 @@ static bool remove_migration_pte(struct page *page, struct 
vm_area_struct *vma,
pte = arch_make_huge_pte(pte, vma, new, 0);
}
 #endif
-   flush_dcache_page(new);
+
+   if (unlikely(is_zone_device_page(new)) &&
+   is_device_unaddressable_page(new)) {
+   entry = make_device_entry(new, pte_write(pte));
+   pte = swp_entry_to_pte(entry);
+   if (pte_swp_soft_dirty(*pvmw.pte))
+   pte = pte_mksoft_dirty(pte);
+   } else
+   flush_dcache_page(new);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
if (PageHuge(new)) {
@@ -305,6 +314,8 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t 
*ptep,
 */
if (!get_page_unless_zero(page))
goto out;
+   if (is_zone_device_page(page))
+   get_zone_device_page(page);
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);
@@ -2139,17 +2150,40 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pte = *ptep;
pfn = pte_pfn(pte);
 
-   if (!pte_present(pte)) {
+   if (pte_none(pte)) {
mpfn = pfn = 0;
goto next;
}
 
+   if (!pte_present(pte)) {
+   mpfn = pfn = 0;
+
+   /*
+* Only care about unaddressable device page special
+* page table entry. Other special swap entries are not
+* migratable, and we ignore regular swapped page.
+*/
+   entry = pte_to_swp_entry(pte);
+   if (!is_device_entry(entry))
+   goto next;
+
+   page = device_entry_to_page(entry);
+   mpfn = migrate_pfn(page_to_pfn(page))|
+   MIGRATE_PFN_DEVICE | MIGRATE_PFN_MIGRATE;
+   if (is_write_device_entry(entry))
+   mpfn |= MIGRATE_PFN_WRITE;
+   } else {
+   page = vm_normal_page(migrate->vma, addr, pte);
+   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+   }
+
/* FIXME support THP */
-   page = vm_normal_page(migrate->vma, addr, pte);
if (!page || !page->mapping || PageTransCompound(page)) {
mpfn = pfn = 0;
goto next;
}
+   pfn = page_to_pfn(page);
 
/*
 * By getting a reference on the page we pin it and that blocks
@@ -2162,8 +2196,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 */
get_page(page);
migrate->cpages++;
-   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
-   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
/*
 * Optimize for the common case where page is only mapped once
@@ -2194,6 +2226,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
 
 next:
+

[HMM 04/15] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY

2017-04-21 Thread Jérôme Glisse
Introduce a new migration mode that allow to offload the copy to
a device DMA engine. This changes the workflow of migration and
not all address_space migratepage callback can support this. So
it needs to be tested in those cases.

This is intended to be use by migrate_vma() which itself is use
for thing like HMM (see include/linux/hmm.h).

Signed-off-by: Jérôme Glisse 
---
 fs/aio.c |  8 +++
 fs/f2fs/data.c   |  5 -
 fs/hugetlbfs/inode.c |  5 -
 fs/ubifs/file.c  |  5 -
 include/linux/migrate.h  |  5 +
 include/linux/migrate_mode.h |  5 +
 mm/balloon_compaction.c  |  8 +++
 mm/migrate.c | 52 ++--
 mm/zsmalloc.c|  8 +++
 9 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925..e51351e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -373,6 +373,14 @@ static int aio_migratepage(struct address_space *mapping, 
struct page *new,
pgoff_t idx;
int rc;
 
+   /*
+* We cannot support the _NO_COPY case here, because copy needs to
+* happen under the ctx->completion_lock. That does not work with the
+* migration workflow of MIGRATE_SYNC_NO_COPY.
+*/
+   if (mode == MIGRATE_SYNC_NO_COPY)
+   return -EINVAL;
+
rc = 0;
 
/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index e984a42..b36191f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2131,7 +2131,10 @@ int f2fs_migrate_page(struct address_space *mapping,
SetPagePrivate(newpage);
set_page_private(newpage, page_private(page));
 
-   migrate_page_copy(newpage, page);
+   if (mode != MIGRATE_SYNC_NO_COPY)
+   migrate_page_copy(newpage, page);
+   else
+   migrate_page_states(newpage, page);
 
return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dde8613..c02ff56 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -846,7 +846,10 @@ static int hugetlbfs_migrate_page(struct address_space 
*mapping,
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
-   migrate_page_copy(newpage, page);
+   if (mode != MIGRATE_SYNC_NO_COPY)
+   migrate_page_copy(newpage, page);
+   else
+   migrate_page_states(newpage, page);
 
return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 2cda3d6..b2292be 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1482,7 +1482,10 @@ static int ubifs_migrate_page(struct address_space 
*mapping,
SetPagePrivate(newpage);
}
 
-   migrate_page_copy(newpage, page);
+   if (mode != MIGRATE_SYNC_NO_COPY)
+   migrate_page_copy(newpage, page);
+   else
+   migrate_page_states(newpage, page);
return MIGRATEPAGE_SUCCESS;
 }
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 48e2484..78a0fdc 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,6 +43,7 @@ extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
+extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
  struct page *newpage, struct page *page);
@@ -63,6 +64,10 @@ static inline int isolate_movable_page(struct page *page, 
isolate_mode_t mode)
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
+static inline void migrate_page_states(struct page *newpage, struct page *page)
+{
+}
+
 static inline void migrate_page_copy(struct page *newpage,
 struct page *page) {}
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..bdf66af 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,16 @@
  * on most operations but not ->writepage as the potential stall time
  * is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages
+ * with the CPU. Instead, page copy happens outside the migratepage()
+ * callback and is likely using a DMA engine. See migrate_vma() and HMM
+ * (mm/hmm.c) for users of this mode.
  */
 enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+   MIGRATE_SYNC_NO_COPY,
 };
 
 #endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/balloon_compaction.c 

[HMM 11/15] mm/migrate: support un-addressable ZONE_DEVICE page in migration

2017-04-21 Thread Jérôme Glisse
Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
---
 include/linux/migrate.h |  10 +++-
 mm/migrate.c| 136 ++--
 mm/page_vma_mapped.c|  10 
 mm/rmap.c   |  25 +
 4 files changed, 152 insertions(+), 29 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 576b3f5..7dd875a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,12 +130,18 @@ static inline int migrate_misplaced_transhuge_page(struct 
mm_struct *mm,
 
 #ifdef CONFIG_MIGRATION
 
+/*
+ * Watch out for PAE architecture, which has an unsigned long, and might not
+ * have enough bits to store all physical address and flags. So far we have
+ * enough room for all our flags.
+ */
 #define MIGRATE_PFN_VALID  (1UL << 0)
 #define MIGRATE_PFN_MIGRATE(1UL << 1)
 #define MIGRATE_PFN_LOCKED (1UL << 2)
 #define MIGRATE_PFN_WRITE  (1UL << 3)
-#define MIGRATE_PFN_ERROR  (1UL << 4)
-#define MIGRATE_PFN_SHIFT  5
+#define MIGRATE_PFN_DEVICE (1UL << 4)
+#define MIGRATE_PFN_ERROR  (1UL << 5)
+#define MIGRATE_PFN_SHIFT  6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 4ac2a7a..62ad41c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -233,7 +234,15 @@ static bool remove_migration_pte(struct page *page, struct 
vm_area_struct *vma,
pte = arch_make_huge_pte(pte, vma, new, 0);
}
 #endif
-   flush_dcache_page(new);
+
+   if (unlikely(is_zone_device_page(new)) &&
+   is_device_unaddressable_page(new)) {
+   entry = make_device_entry(new, pte_write(pte));
+   pte = swp_entry_to_pte(entry);
+   if (pte_swp_soft_dirty(*pvmw.pte))
+   pte = pte_mksoft_dirty(pte);
+   } else
+   flush_dcache_page(new);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
if (PageHuge(new)) {
@@ -305,6 +314,8 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t 
*ptep,
 */
if (!get_page_unless_zero(page))
goto out;
+   if (is_zone_device_page(page))
+   get_zone_device_page(page);
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
put_page(page);
@@ -2139,17 +2150,40 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pte = *ptep;
pfn = pte_pfn(pte);
 
-   if (!pte_present(pte)) {
+   if (pte_none(pte)) {
mpfn = pfn = 0;
goto next;
}
 
+   if (!pte_present(pte)) {
+   mpfn = pfn = 0;
+
+   /*
+* Only care about unaddressable device page special
+* page table entry. Other special swap entries are not
+* migratable, and we ignore regular swapped page.
+*/
+   entry = pte_to_swp_entry(pte);
+   if (!is_device_entry(entry))
+   goto next;
+
+   page = device_entry_to_page(entry);
+   mpfn = migrate_pfn(page_to_pfn(page))|
+   MIGRATE_PFN_DEVICE | MIGRATE_PFN_MIGRATE;
+   if (is_write_device_entry(entry))
+   mpfn |= MIGRATE_PFN_WRITE;
+   } else {
+   page = vm_normal_page(migrate->vma, addr, pte);
+   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+   }
+
/* FIXME support THP */
-   page = vm_normal_page(migrate->vma, addr, pte);
if (!page || !page->mapping || PageTransCompound(page)) {
mpfn = pfn = 0;
goto next;
}
+   pfn = page_to_pfn(page);
 
/*
 * By getting a reference on the page we pin it and that blocks
@@ -2162,8 +2196,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 */
get_page(page);
migrate->cpages++;
-   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
-   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
/*
 * Optimize for the common case where page is only mapped once
@@ -2194,6 +2226,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
 
 next:
+   migrate->dst[migrate->npages] = 0;
   

Re: [PATCH] xprtrdma: use offset_in_page() macro

2017-04-21 Thread Chuck Lever

> On Apr 21, 2017, at 9:21 PM, Geliang Tang  wrote:
> 
> Use offset_in_page() macro instead of open-coding.
> 
> Signed-off-by: Geliang Tang 
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c| 4 ++--
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +--
> 2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index a044be2..429beea 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -540,7 +540,7 @@ rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct 
> rpcrdma_req *req,
>   goto out;
> 
>   page = virt_to_page(xdr->tail[0].iov_base);
> - page_base = (unsigned long)xdr->tail[0].iov_base & ~PAGE_MASK;
> + page_base = offset_in_page(xdr->tail[0].iov_base);
> 
>   /* If the content in the page list is an odd length,
>* xdr_write_pages() has added a pad at the beginning
> @@ -587,7 +587,7 @@ rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct 
> rpcrdma_req *req,
>*/
>   if (xdr->tail[0].iov_len) {
>   page = virt_to_page(xdr->tail[0].iov_base);
> - page_base = (unsigned long)xdr->tail[0].iov_base & ~PAGE_MASK;
> + page_base = offset_in_page(xdr->tail[0].iov_base);
>   len = xdr->tail[0].iov_len;
> 
> map_tail:

There are several other sites that use PAGE_MASK in
rpc_rdma.c. Should those be included in this patch?

Do you have a way to test this change? If not I
can take it (once the above comment is addressed),
run it through the usual battery of NFS/RDMA
testing, and then pass it along to Anna.


> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index 1736337..60b3f29 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -306,12 +306,11 @@ static int svc_rdma_dma_map_buf(struct svcxprt_rdma 
> *rdma,
>   unsigned char *base,
>   unsigned int len)
> {
> - unsigned long offset = (unsigned long)base & ~PAGE_MASK;
>   struct ib_device *dev = rdma->sc_cm_id->device;
>   dma_addr_t dma_addr;
> 
>   dma_addr = ib_dma_map_page(dev, virt_to_page(base),
> -offset, len, DMA_TO_DEVICE);
> +offset_in_page(base), len, DMA_TO_DEVICE);
>   if (ib_dma_mapping_error(dev, dma_addr))
>   return -EIO;
> 

This hunk conflicts with a rewrite of svc_rdma_sendto.c that
Bruce has already accepted for v4.12. I would prefer this
be dropped.

The rewritten code also has this issue. I can submit a patch
separately that adds offset_in_page in the appropriate place.


--
Chuck Lever





[HMM 05/15] mm/migrate: new memory migration helper for use with device memory v4

2017-04-21 Thread Jérôme Glisse
This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be useful for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changes since v3:
  - Rebase

Changes since v2:
  - droped HMM prefix and HMM specific code
Changes since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/migrate.h | 104 
 mm/migrate.c| 444 
 2 files changed, 548 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78a0fdc..576b3f5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -127,4 +127,108 @@ static inline int migrate_misplaced_transhuge_page(struct 
mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#ifdef CONFIG_MIGRATION
+
+#define MIGRATE_PFN_VALID  (1UL << 0)
+#define MIGRATE_PFN_MIGRATE(1UL << 1)
+#define MIGRATE_PFN_LOCKED (1UL << 2)
+#define MIGRATE_PFN_WRITE  (1UL << 3)
+#define MIGRATE_PFN_ERROR  (1UL << 4)
+#define MIGRATE_PFN_SHIFT  5
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+   if (!(mpfn & MIGRATE_PFN_VALID))
+   return NULL;
+   return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+}
+
+static inline unsigned long migrate_pfn(unsigned long pfn)
+{
+   return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memory and copy source memory to it
+ * @finalize_and_map: allow caller to map the successfully migrated pages
+ *
+ *
+ * The alloc_and_copy() callback happens once all source pages have been 
locked,
+ * unmapped and checked (checked whether pinned or not). All pages that can be
+ * migrated will have an entry in the src array set with the pfn value of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
+ * callback must update each corresponding entry in the dst array with the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
+ * set in the src array entry). If the device driver cannot migrate a device
+ * page back to system memory, then it must set the corresponding dst array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
+ * access any of the virtual addresses originally backed by this page. Because
+ * a SIGBUS is such a severe result for the userspace process, the device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *
+ *
+ * The finalize_and_map() callback happens after struct page migration from
+ * source to destination (destination struct pages are the struct pages for the
+ * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
+ * thus the finalize_and_map() allows the driver to inspect which pages were
+ * successfully migrated, and which were not. Successfully migrated pages will
+ * 

Re: [PATCH] xprtrdma: use offset_in_page() macro

2017-04-21 Thread Chuck Lever

> On Apr 21, 2017, at 9:21 PM, Geliang Tang  wrote:
> 
> Use offset_in_page() macro instead of open-coding.
> 
> Signed-off-by: Geliang Tang 
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c| 4 ++--
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +--
> 2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index a044be2..429beea 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -540,7 +540,7 @@ rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct 
> rpcrdma_req *req,
>   goto out;
> 
>   page = virt_to_page(xdr->tail[0].iov_base);
> - page_base = (unsigned long)xdr->tail[0].iov_base & ~PAGE_MASK;
> + page_base = offset_in_page(xdr->tail[0].iov_base);
> 
>   /* If the content in the page list is an odd length,
>* xdr_write_pages() has added a pad at the beginning
> @@ -587,7 +587,7 @@ rpcrdma_prepare_msg_sges(struct rpcrdma_ia *ia, struct 
> rpcrdma_req *req,
>*/
>   if (xdr->tail[0].iov_len) {
>   page = virt_to_page(xdr->tail[0].iov_base);
> - page_base = (unsigned long)xdr->tail[0].iov_base & ~PAGE_MASK;
> + page_base = offset_in_page(xdr->tail[0].iov_base);
>   len = xdr->tail[0].iov_len;
> 
> map_tail:

There are several other sites that use PAGE_MASK in
rpc_rdma.c. Should those be included in this patch?

Do you have a way to test this change? If not I
can take it (once the above comment is addressed),
run it through the usual battery of NFS/RDMA
testing, and then pass it along to Anna.


> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index 1736337..60b3f29 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -306,12 +306,11 @@ static int svc_rdma_dma_map_buf(struct svcxprt_rdma 
> *rdma,
>   unsigned char *base,
>   unsigned int len)
> {
> - unsigned long offset = (unsigned long)base & ~PAGE_MASK;
>   struct ib_device *dev = rdma->sc_cm_id->device;
>   dma_addr_t dma_addr;
> 
>   dma_addr = ib_dma_map_page(dev, virt_to_page(base),
> -offset, len, DMA_TO_DEVICE);
> +offset_in_page(base), len, DMA_TO_DEVICE);
>   if (ib_dma_mapping_error(dev, dma_addr))
>   return -EIO;
> 

This hunk conflicts with a rewrite of svc_rdma_sendto.c that
Bruce has already accepted for v4.12. I would prefer this
be dropped.

The rewritten code also has this issue. I can submit a patch
separately that adds offset_in_page in the appropriate place.


--
Chuck Lever





[HMM 05/15] mm/migrate: new memory migration helper for use with device memory v4

2017-04-21 Thread Jérôme Glisse
This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be useful for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changes since v3:
  - Rebase

Changes since v2:
  - droped HMM prefix and HMM specific code
Changes since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/migrate.h | 104 
 mm/migrate.c| 444 
 2 files changed, 548 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78a0fdc..576b3f5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -127,4 +127,108 @@ static inline int migrate_misplaced_transhuge_page(struct 
mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#ifdef CONFIG_MIGRATION
+
+#define MIGRATE_PFN_VALID  (1UL << 0)
+#define MIGRATE_PFN_MIGRATE(1UL << 1)
+#define MIGRATE_PFN_LOCKED (1UL << 2)
+#define MIGRATE_PFN_WRITE  (1UL << 3)
+#define MIGRATE_PFN_ERROR  (1UL << 4)
+#define MIGRATE_PFN_SHIFT  5
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+   if (!(mpfn & MIGRATE_PFN_VALID))
+   return NULL;
+   return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+}
+
+static inline unsigned long migrate_pfn(unsigned long pfn)
+{
+   return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memory and copy source memory to it
+ * @finalize_and_map: allow caller to map the successfully migrated pages
+ *
+ *
+ * The alloc_and_copy() callback happens once all source pages have been 
locked,
+ * unmapped and checked (checked whether pinned or not). All pages that can be
+ * migrated will have an entry in the src array set with the pfn value of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
+ * callback must update each corresponding entry in the dst array with the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
+ * set in the src array entry). If the device driver cannot migrate a device
+ * page back to system memory, then it must set the corresponding dst array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
+ * access any of the virtual addresses originally backed by this page. Because
+ * a SIGBUS is such a severe result for the userspace process, the device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *
+ *
+ * The finalize_and_map() callback happens after struct page migration from
+ * source to destination (destination struct pages are the struct pages for the
+ * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
+ * thus the finalize_and_map() allows the driver to inspect which pages were
+ * successfully migrated, and which were not. Successfully migrated pages will
+ * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
+ *
+ * It is safe to update device page table from within 

[HMM 02/15] mm/put_page: move ZONE_DEVICE page reference decrement v2

2017-04-21 Thread Jérôme Glisse
Move page reference decrement of ZONE_DEVICE from put_page()
to put_zone_device_page() this does not affect non ZONE_DEVICE
page.

Doing this allow to catch when a ZONE_DEVICE page refcount reach
1 which means the device is no longer reference by any one (unlike
page from other zone, ZONE_DEVICE page refcount never reach 0).

This patch is just a preparatory patch for HMM.

Changes since v1:
  - commit message

Signed-off-by: Jérôme Glisse 
Cc: Dan Williams 
Cc: Ross Zwisler 
---
 include/linux/mm.h | 14 +++---
 kernel/memremap.c  |  6 ++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c82e8db..022423c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,11 +821,19 @@ static inline void put_page(struct page *page)
 {
page = compound_head(page);
 
+   /*
+* ZONE_DEVICE pages should never have their refcount reach 0 (this
+* would be a bug), so call page_ref_dec() in put_zone_device_page()
+* to decrement page refcount and skip __put_page() here, as this
+* would worsen things if a ZONE_DEVICE had a refcount bug.
+*/
+   if (unlikely(is_zone_device_page(page))) {
+   put_zone_device_page(page);
+   return;
+   }
+
if (put_page_testzero(page))
__put_page(page);
-
-   if (unlikely(is_zone_device_page(page)))
-   put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index ea714ee..97ef676 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+   /*
+* ZONE_DEVICE page refcount should never reach 0 and never be freed
+* to kernel memory allocator.
+*/
+   page_ref_dec(page);
+
put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
-- 
2.9.3



[HMM 02/15] mm/put_page: move ZONE_DEVICE page reference decrement v2

2017-04-21 Thread Jérôme Glisse
Move page reference decrement of ZONE_DEVICE from put_page()
to put_zone_device_page() this does not affect non ZONE_DEVICE
page.

Doing this allow to catch when a ZONE_DEVICE page refcount reach
1 which means the device is no longer reference by any one (unlike
page from other zone, ZONE_DEVICE page refcount never reach 0).

This patch is just a preparatory patch for HMM.

Changes since v1:
  - commit message

Signed-off-by: Jérôme Glisse 
Cc: Dan Williams 
Cc: Ross Zwisler 
---
 include/linux/mm.h | 14 +++---
 kernel/memremap.c  |  6 ++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c82e8db..022423c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,11 +821,19 @@ static inline void put_page(struct page *page)
 {
page = compound_head(page);
 
+   /*
+* ZONE_DEVICE pages should never have their refcount reach 0 (this
+* would be a bug), so call page_ref_dec() in put_zone_device_page()
+* to decrement page refcount and skip __put_page() here, as this
+* would worsen things if a ZONE_DEVICE had a refcount bug.
+*/
+   if (unlikely(is_zone_device_page(page))) {
+   put_zone_device_page(page);
+   return;
+   }
+
if (put_page_testzero(page))
__put_page(page);
-
-   if (unlikely(is_zone_device_page(page)))
-   put_zone_device_page(page);
 }
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index ea714ee..97ef676 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+   /*
+* ZONE_DEVICE page refcount should never reach 0 and never be freed
+* to kernel memory allocator.
+*/
+   page_ref_dec(page);
+
put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
-- 
2.9.3



Re: [RFC PATCH v5 1/2] usb: typec: USB Type-C Port Manager (tcpm)

2017-04-21 Thread Guenter Roeck

On 04/21/2017 03:15 PM, Guenter Roeck wrote:

From: Guenter Roeck 

This driver implements the USB Type-C Power Delivery state machine
for both source and sink ports. Alternate mode support is not
fully implemented.

The driver attaches to the USB Type-C class code implemented in
the following patches.

usb: typec: add driver for Intel Whiskey Cove PMIC USB Type-C PHY
usb: USB Type-C connector class

This driver only implements the state machine. Lower level drivers are
responsible for
- Reporting VBUS status and activating VBUS
- Setting CC lines and providing CC line status
- Setting line polarity
- Activating and deactivating VCONN
- Setting the current limit
- Activating and deactivating PD message transfers
- Sending and receiving PD messages

The driver provides both a functional API as well as callbacks for
lower level drivers.



Open question, since this still requires some work:

Would the code, in its current form, be acceptable in -staging ?

[ ... ]


diff --git a/drivers/usb/typec/tcpm.c b/drivers/usb/typec/tcpm.c
new file mode 100644
index ..1a82dddb243d
--- /dev/null
+++ b/drivers/usb/typec/tcpm.c
@@ -0,0 +1,3443 @@
+/*
+ * Copyright 2015-2016 Google, Inc
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * USB Power Delivery protocol stack.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 


Missing

#include 

to actually build in on top of -next.

Guenter



Re: [RFC PATCH v5 1/2] usb: typec: USB Type-C Port Manager (tcpm)

2017-04-21 Thread Guenter Roeck

On 04/21/2017 03:15 PM, Guenter Roeck wrote:

From: Guenter Roeck 

This driver implements the USB Type-C Power Delivery state machine
for both source and sink ports. Alternate mode support is not
fully implemented.

The driver attaches to the USB Type-C class code implemented in
the following patches.

usb: typec: add driver for Intel Whiskey Cove PMIC USB Type-C PHY
usb: USB Type-C connector class

This driver only implements the state machine. Lower level drivers are
responsible for
- Reporting VBUS status and activating VBUS
- Setting CC lines and providing CC line status
- Setting line polarity
- Activating and deactivating VCONN
- Setting the current limit
- Activating and deactivating PD message transfers
- Sending and receiving PD messages

The driver provides both a functional API as well as callbacks for
lower level drivers.



Open question, since this still requires some work:

Would the code, in its current form, be acceptable in -staging ?

[ ... ]


diff --git a/drivers/usb/typec/tcpm.c b/drivers/usb/typec/tcpm.c
new file mode 100644
index ..1a82dddb243d
--- /dev/null
+++ b/drivers/usb/typec/tcpm.c
@@ -0,0 +1,3443 @@
+/*
+ * Copyright 2015-2016 Google, Inc
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * USB Power Delivery protocol stack.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 


Missing

#include 

to actually build in on top of -next.

Guenter



[HMM 15/15] hmm: heterogeneous memory management documentation

2017-04-21 Thread Jérôme Glisse
This add documentation for HMM (Heterogeneous Memory Management). It
presents the motivation behind it, the features necessary for it to
be useful and and gives an overview of how this is implemented.

Signed-off-by: Jérôme Glisse 
---
 Documentation/vm/hmm.txt | 362 +++
 1 file changed, 362 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 000..a18ffc0
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,362 @@
+Heterogeneous Memory Management (HMM)
+
+Transparently allow any component of a program to use any memory region of said
+program with a device without using device specific memory allocator. This is
+becoming a requirement to simplify the use of advance heterogeneous computing
+where GPU, DSP or FPGA are use to perform various computations.
+
+This document is divided as follow, in the first section i expose the problems
+related to the use of a device specific allocator. The second section i expose
+the hardware limitations that are inherent to many platforms. The third section
+gives an overview of HMM designs. The fourth section explains how CPU page-
+table mirroring works and what is HMM purpose in this context. Fifth section
+deals with how device memory is represented inside the kernel. Finaly the last
+section present the new migration helper that allow to leverage the device DMA
+engine.
+
+
+---
+
+1) Problems of using device specific memory allocator:
+
+Device with large amount of on board memory (several giga bytes) like GPU have
+historically manage their memory through dedicated driver specific API. This
+creates a disconnect between memory allocated and managed by device driver and
+regular application memory (private anonymous, share memory or regular file
+back memory). From here on i will refer to this aspect as split address space.
+I use share address space to refer to the opposite situation ie one in which
+any memory region can be use by device transparently.
+
+Split address space because device can only access memory allocated through the
+device specific API. This imply that all memory object in a program are not
+equal from device point of view which complicate large program that rely on a
+wide set of libraries.
+
+Concretly this means that code that wants to leverage device like GPU need to
+copy object between genericly allocated memory (malloc, mmap private/share/)
+and memory allocated through the device driver API (this still end up with an
+mmap but of the device file).
+
+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
+data-set need to re-map all the pointer relations between each of its elements.
+This is error prone and program gets harder to debug because of the duplicate
+data-set.
+
+Split address space also means that library can not transparently use data they
+are getting from core program or other library and thus each library might have
+to duplicate its input data-set using specific memory allocator. Large project
+suffer from this and waste resources because of the various memory copy.
+
+Duplicating each library API to accept as input or output memory allocted by
+each device specific allocator is not a viable option. It would lead to a
+combinatorial explosions in the library entry points.
+
+Finaly with the advance of high level language constructs (in C++ but in other
+language too) it is now possible for compiler to leverage GPU or other devices
+without even the programmer knowledge. Some of compiler identified patterns are
+only do-able with a share address. It is as well more reasonable to use a share
+address space for all the other patterns.
+
+
+---
+
+2) System bus, device memory characteristics
+
+System bus cripple share address due to few limitations. Most system bus only
+allow basic memory access from device to main memory, even cache coherency is
+often optional. Access to device memory from CPU is even more limited, most
+often than not it is not cache coherent.
+
+If we only consider the PCIE bus than device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However it only allows
+a limited set of atomic operation from device on main memory. This is worse
+in the other direction the CPUs can only access a limited range of the device
+memory and can not perform atomic operations on it. Thus device memory can not
+be consider like regular memory from kernel point of view.
+
+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
+The final limitation is latency, access to 

[HMM 15/15] hmm: heterogeneous memory management documentation

2017-04-21 Thread Jérôme Glisse
This add documentation for HMM (Heterogeneous Memory Management). It
presents the motivation behind it, the features necessary for it to
be useful and and gives an overview of how this is implemented.

Signed-off-by: Jérôme Glisse 
---
 Documentation/vm/hmm.txt | 362 +++
 1 file changed, 362 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 000..a18ffc0
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,362 @@
+Heterogeneous Memory Management (HMM)
+
+Transparently allow any component of a program to use any memory region of said
+program with a device without using device specific memory allocator. This is
+becoming a requirement to simplify the use of advance heterogeneous computing
+where GPU, DSP or FPGA are use to perform various computations.
+
+This document is divided as follow, in the first section i expose the problems
+related to the use of a device specific allocator. The second section i expose
+the hardware limitations that are inherent to many platforms. The third section
+gives an overview of HMM designs. The fourth section explains how CPU page-
+table mirroring works and what is HMM purpose in this context. Fifth section
+deals with how device memory is represented inside the kernel. Finaly the last
+section present the new migration helper that allow to leverage the device DMA
+engine.
+
+
+---
+
+1) Problems of using device specific memory allocator:
+
+Device with large amount of on board memory (several giga bytes) like GPU have
+historically manage their memory through dedicated driver specific API. This
+creates a disconnect between memory allocated and managed by device driver and
+regular application memory (private anonymous, share memory or regular file
+back memory). From here on i will refer to this aspect as split address space.
+I use share address space to refer to the opposite situation ie one in which
+any memory region can be use by device transparently.
+
+Split address space because device can only access memory allocated through the
+device specific API. This imply that all memory object in a program are not
+equal from device point of view which complicate large program that rely on a
+wide set of libraries.
+
+Concretly this means that code that wants to leverage device like GPU need to
+copy object between genericly allocated memory (malloc, mmap private/share/)
+and memory allocated through the device driver API (this still end up with an
+mmap but of the device file).
+
+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
+data-set need to re-map all the pointer relations between each of its elements.
+This is error prone and program gets harder to debug because of the duplicate
+data-set.
+
+Split address space also means that library can not transparently use data they
+are getting from core program or other library and thus each library might have
+to duplicate its input data-set using specific memory allocator. Large project
+suffer from this and waste resources because of the various memory copy.
+
+Duplicating each library API to accept as input or output memory allocted by
+each device specific allocator is not a viable option. It would lead to a
+combinatorial explosions in the library entry points.
+
+Finaly with the advance of high level language constructs (in C++ but in other
+language too) it is now possible for compiler to leverage GPU or other devices
+without even the programmer knowledge. Some of compiler identified patterns are
+only do-able with a share address. It is as well more reasonable to use a share
+address space for all the other patterns.
+
+
+---
+
+2) System bus, device memory characteristics
+
+System bus cripple share address due to few limitations. Most system bus only
+allow basic memory access from device to main memory, even cache coherency is
+often optional. Access to device memory from CPU is even more limited, most
+often than not it is not cache coherent.
+
+If we only consider the PCIE bus than device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However it only allows
+a limited set of atomic operation from device on main memory. This is worse
+in the other direction the CPUs can only access a limited range of the device
+memory and can not perform atomic operations on it. Thus device memory can not
+be consider like regular memory from kernel point of view.
+
+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
+The final limitation is latency, access to main memory from 

[HMM 10/15] mm/hmm/mirror: device page fault handler

2017-04-21 Thread Jérôme Glisse
This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h |  27 ++
 mm/hmm.c| 243 +---
 2 files changed, 256 insertions(+), 14 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index defa7cd..d267989 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -292,6 +292,33 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 unsigned long end,
 hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
+ * not migrate any device memory back to system memory. The hmm_pfn_t array 
will
+ * be updated with the fault result and current snapshot of the CPU page table
+ * for the range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be
+ * dropped by the function if the block argument is false. In that case, the
+ * function returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successful for every single
+ * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * determine fault status for each address.
+ *
+ * Trying to fault inside an invalid vma will result in -EINVAL.
+ *
+ * See the function description in mm/hmm.c for further documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns,
+ bool write,
+ bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4828b97..be88807 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -235,6 +235,36 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+struct hmm_vma_walk {
+   struct hmm_range*range;
+   unsigned long   last;
+   boolfault;
+   boolblock;
+   boolwrite;
+};
+
+static int hmm_vma_do_fault(struct mm_walk *walk,
+   unsigned long addr,
+   hmm_pfn_t *pfn)
+{
+   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+   struct hmm_vma_walk *hmm_vma_walk = walk->private;
+   struct vm_area_struct *vma = walk->vma;
+   int r;
+
+   flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+   flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+   r = handle_mm_fault(vma, addr, flags);
+   if (r & VM_FAULT_RETRY)
+   return -EBUSY;
+   if (r & VM_FAULT_ERROR) {
+   *pfn = HMM_PFN_ERROR;
+   return -EFAULT;
+   }
+
+   return -EAGAIN;
+}
+
 static void hmm_pfns_special(hmm_pfn_t *pfns,
 unsigned long addr,
 unsigned long end)
@@ -243,34 +273,62 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
*pfns = HMM_PFN_SPECIAL;
 }
 
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+  unsigned long addr,
+  unsigned long end)
+{
+   for (; addr < end; addr += PAGE_SIZE, pfns++)
+   *pfns = 0;
+}
+
 static int hmm_vma_walk_hole(unsigned long addr,
 unsigned long end,
 struct mm_walk *walk)
 {
-   struct hmm_range *range = walk->private;
+   struct hmm_vma_walk *hmm_vma_walk = walk->private;
+   struct hmm_range *range = hmm_vma_walk->range;
hmm_pfn_t *pfns = range->pfns;
unsigned long i;
 
+   hmm_vma_walk->last = addr;
i = (addr - range->start) >> PAGE_SHIFT;
-   for (; addr < end; addr += PAGE_SIZE, i++)
+   for (; addr < end; addr += PAGE_SIZE, i++) {
pfns[i] = HMM_PFN_EMPTY;
+   if (hmm_vma_walk->fault) {
+   int ret;
 
-   return 0;
+   ret = hmm_vma_do_fault(walk, addr, [i]);
+   if (ret != -EAGAIN)
+   return ret;
+   }
+   }
+
+   return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_clear(unsigned long addr,
  unsigned long end,
  struct mm_walk *walk)
 {
-   struct hmm_range *range = walk->private;
+   struct hmm_vma_walk *hmm_vma_walk = walk->private;
+   struct 

[HMM 12/15] mm/migrate: allow migrate_vma() to alloc new page on empty entry v2

2017-04-21 Thread Jérôme Glisse
This allow caller of migrate_vma() to allocate new page for empty CPU
page table entry. It only support anoymous memory and it won't allow
new page to be instance if userfaultfd is armed.

This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.

Changed since v1:
  - 5 level page table fix

Signed-off-by: Jérôme Glisse 
---
 mm/migrate.c | 135 +--
 1 file changed, 131 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 62ad41c..c4c3f93 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2112,9 +2113,10 @@ static int migrate_vma_collect_hole(unsigned long start,
struct mm_walk *walk)
 {
struct migrate_vma *migrate = walk->private;
-   unsigned long addr, next;
+   unsigned long addr;
 
for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+   migrate->cpages++;
migrate->dst[migrate->npages] = 0;
migrate->src[migrate->npages++] = 0;
}
@@ -2151,6 +2153,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pfn = pte_pfn(pte);
 
if (pte_none(pte)) {
+   migrate->cpages++;
mpfn = pfn = 0;
goto next;
}
@@ -2464,6 +2467,118 @@ static void migrate_vma_unmap(struct migrate_vma 
*migrate)
}
 }
 
+static void migrate_vma_insert_page(struct migrate_vma *migrate,
+   unsigned long addr,
+   struct page *page,
+   unsigned long *src,
+   unsigned long *dst)
+{
+   struct vm_area_struct *vma = migrate->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   struct mem_cgroup *memcg;
+   spinlock_t *ptl;
+   pgd_t *pgdp;
+   p4d_t *p4dp;
+   pud_t *pudp;
+   pmd_t *pmdp;
+   pte_t *ptep;
+   pte_t entry;
+
+   /* Only allow populating anonymous memory */
+   if (!vma_is_anonymous(vma))
+   goto abort;
+
+   pgdp = pgd_offset(mm, addr);
+   p4dp = p4d_alloc(mm, pgdp, addr);
+   if (!p4dp)
+   goto abort;
+   pudp = pud_alloc(mm, p4dp, addr);
+   if (!pudp)
+   goto abort;
+   pmdp = pmd_alloc(mm, pudp, addr);
+   if (!pmdp)
+   goto abort;
+
+   if (pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))
+   goto abort;
+
+   /*
+* Use pte_alloc() instead of pte_alloc_map().  We can't run
+* pte_offset_map() on pmds where a huge pmd might be created
+* from a different thread.
+*
+* pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+* parallel threads are excluded by other means.
+*
+* Here we only have down_read(mmap_sem).
+*/
+   if (pte_alloc(mm, pmdp, addr))
+   goto abort;
+
+   /* See the comment in pte_alloc_one_map() */
+   if (unlikely(pmd_trans_unstable(pmdp)))
+   goto abort;
+
+   if (unlikely(anon_vma_prepare(vma)))
+   goto abort;
+   if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, , false))
+   goto abort;
+
+   /*
+* The memory barrier inside __SetPageUptodate makes sure that
+* preceding stores to the page contents become visible before
+* the set_pte_at() write.
+*/
+   __SetPageUptodate(page);
+
+   if (is_zone_device_page(page) && is_device_unaddressable_page(page)) {
+   swp_entry_t swp_entry;
+
+   swp_entry = make_device_entry(page, vma->vm_flags & VM_WRITE);
+   entry = swp_entry_to_pte(swp_entry);
+   } else {
+   entry = mk_pte(page, vma->vm_page_prot);
+   if (vma->vm_flags & VM_WRITE)
+   entry = pte_mkwrite(pte_mkdirty(entry));
+   }
+
+   ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!pte_none(*ptep)) {
+   pte_unmap_unlock(ptep, ptl);
+   mem_cgroup_cancel_charge(page, memcg, false);
+   goto abort;
+   }
+
+   /*
+* Check for usefaultfd but do not deliver the fault. Instead,
+* just back off.
+*/
+   if (userfaultfd_missing(vma)) {
+   pte_unmap_unlock(ptep, ptl);
+   mem_cgroup_cancel_charge(page, memcg, false);
+   goto abort;
+   }
+
+   inc_mm_counter(mm, MM_ANONPAGES);
+   page_add_new_anon_rmap(page, vma, addr, false);
+   mem_cgroup_commit_charge(page, memcg, false, false);
+   if (!is_zone_device_page(page))
+   lru_cache_add_active_or_unevictable(page, vma);
+   

[HMM 10/15] mm/hmm/mirror: device page fault handler

2017-04-21 Thread Jérôme Glisse
This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h |  27 ++
 mm/hmm.c| 243 +---
 2 files changed, 256 insertions(+), 14 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index defa7cd..d267989 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -292,6 +292,33 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 unsigned long end,
 hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
+ * not migrate any device memory back to system memory. The hmm_pfn_t array 
will
+ * be updated with the fault result and current snapshot of the CPU page table
+ * for the range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be
+ * dropped by the function if the block argument is false. In that case, the
+ * function returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successful for every single
+ * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * determine fault status for each address.
+ *
+ * Trying to fault inside an invalid vma will result in -EINVAL.
+ *
+ * See the function description in mm/hmm.c for further documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+ struct hmm_range *range,
+ unsigned long start,
+ unsigned long end,
+ hmm_pfn_t *pfns,
+ bool write,
+ bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4828b97..be88807 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -235,6 +235,36 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+struct hmm_vma_walk {
+   struct hmm_range*range;
+   unsigned long   last;
+   boolfault;
+   boolblock;
+   boolwrite;
+};
+
+static int hmm_vma_do_fault(struct mm_walk *walk,
+   unsigned long addr,
+   hmm_pfn_t *pfn)
+{
+   unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+   struct hmm_vma_walk *hmm_vma_walk = walk->private;
+   struct vm_area_struct *vma = walk->vma;
+   int r;
+
+   flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+   flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+   r = handle_mm_fault(vma, addr, flags);
+   if (r & VM_FAULT_RETRY)
+   return -EBUSY;
+   if (r & VM_FAULT_ERROR) {
+   *pfn = HMM_PFN_ERROR;
+   return -EFAULT;
+   }
+
+   return -EAGAIN;
+}
+
 static void hmm_pfns_special(hmm_pfn_t *pfns,
 unsigned long addr,
 unsigned long end)
@@ -243,34 +273,62 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
*pfns = HMM_PFN_SPECIAL;
 }
 
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+  unsigned long addr,
+  unsigned long end)
+{
+   for (; addr < end; addr += PAGE_SIZE, pfns++)
+   *pfns = 0;
+}
+
 static int hmm_vma_walk_hole(unsigned long addr,
 unsigned long end,
 struct mm_walk *walk)
 {
-   struct hmm_range *range = walk->private;
+   struct hmm_vma_walk *hmm_vma_walk = walk->private;
+   struct hmm_range *range = hmm_vma_walk->range;
hmm_pfn_t *pfns = range->pfns;
unsigned long i;
 
+   hmm_vma_walk->last = addr;
i = (addr - range->start) >> PAGE_SHIFT;
-   for (; addr < end; addr += PAGE_SIZE, i++)
+   for (; addr < end; addr += PAGE_SIZE, i++) {
pfns[i] = HMM_PFN_EMPTY;
+   if (hmm_vma_walk->fault) {
+   int ret;
 
-   return 0;
+   ret = hmm_vma_do_fault(walk, addr, [i]);
+   if (ret != -EAGAIN)
+   return ret;
+   }
+   }
+
+   return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_clear(unsigned long addr,
  unsigned long end,
  struct mm_walk *walk)
 {
-   struct hmm_range *range = walk->private;
+   struct hmm_vma_walk *hmm_vma_walk = walk->private;
+   struct hmm_range *range = hmm_vma_walk->range;
hmm_pfn_t *pfns = range->pfns;
unsigned long i;
 
+   hmm_vma_walk->last 

[HMM 12/15] mm/migrate: allow migrate_vma() to alloc new page on empty entry v2

2017-04-21 Thread Jérôme Glisse
This allow caller of migrate_vma() to allocate new page for empty CPU
page table entry. It only support anoymous memory and it won't allow
new page to be instance if userfaultfd is armed.

This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.

Changed since v1:
  - 5 level page table fix

Signed-off-by: Jérôme Glisse 
---
 mm/migrate.c | 135 +--
 1 file changed, 131 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 62ad41c..c4c3f93 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2112,9 +2113,10 @@ static int migrate_vma_collect_hole(unsigned long start,
struct mm_walk *walk)
 {
struct migrate_vma *migrate = walk->private;
-   unsigned long addr, next;
+   unsigned long addr;
 
for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+   migrate->cpages++;
migrate->dst[migrate->npages] = 0;
migrate->src[migrate->npages++] = 0;
}
@@ -2151,6 +2153,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pfn = pte_pfn(pte);
 
if (pte_none(pte)) {
+   migrate->cpages++;
mpfn = pfn = 0;
goto next;
}
@@ -2464,6 +2467,118 @@ static void migrate_vma_unmap(struct migrate_vma 
*migrate)
}
 }
 
+static void migrate_vma_insert_page(struct migrate_vma *migrate,
+   unsigned long addr,
+   struct page *page,
+   unsigned long *src,
+   unsigned long *dst)
+{
+   struct vm_area_struct *vma = migrate->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   struct mem_cgroup *memcg;
+   spinlock_t *ptl;
+   pgd_t *pgdp;
+   p4d_t *p4dp;
+   pud_t *pudp;
+   pmd_t *pmdp;
+   pte_t *ptep;
+   pte_t entry;
+
+   /* Only allow populating anonymous memory */
+   if (!vma_is_anonymous(vma))
+   goto abort;
+
+   pgdp = pgd_offset(mm, addr);
+   p4dp = p4d_alloc(mm, pgdp, addr);
+   if (!p4dp)
+   goto abort;
+   pudp = pud_alloc(mm, p4dp, addr);
+   if (!pudp)
+   goto abort;
+   pmdp = pmd_alloc(mm, pudp, addr);
+   if (!pmdp)
+   goto abort;
+
+   if (pmd_trans_unstable(pmdp) || pmd_devmap(*pmdp))
+   goto abort;
+
+   /*
+* Use pte_alloc() instead of pte_alloc_map().  We can't run
+* pte_offset_map() on pmds where a huge pmd might be created
+* from a different thread.
+*
+* pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+* parallel threads are excluded by other means.
+*
+* Here we only have down_read(mmap_sem).
+*/
+   if (pte_alloc(mm, pmdp, addr))
+   goto abort;
+
+   /* See the comment in pte_alloc_one_map() */
+   if (unlikely(pmd_trans_unstable(pmdp)))
+   goto abort;
+
+   if (unlikely(anon_vma_prepare(vma)))
+   goto abort;
+   if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, , false))
+   goto abort;
+
+   /*
+* The memory barrier inside __SetPageUptodate makes sure that
+* preceding stores to the page contents become visible before
+* the set_pte_at() write.
+*/
+   __SetPageUptodate(page);
+
+   if (is_zone_device_page(page) && is_device_unaddressable_page(page)) {
+   swp_entry_t swp_entry;
+
+   swp_entry = make_device_entry(page, vma->vm_flags & VM_WRITE);
+   entry = swp_entry_to_pte(swp_entry);
+   } else {
+   entry = mk_pte(page, vma->vm_page_prot);
+   if (vma->vm_flags & VM_WRITE)
+   entry = pte_mkwrite(pte_mkdirty(entry));
+   }
+
+   ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!pte_none(*ptep)) {
+   pte_unmap_unlock(ptep, ptl);
+   mem_cgroup_cancel_charge(page, memcg, false);
+   goto abort;
+   }
+
+   /*
+* Check for usefaultfd but do not deliver the fault. Instead,
+* just back off.
+*/
+   if (userfaultfd_missing(vma)) {
+   pte_unmap_unlock(ptep, ptl);
+   mem_cgroup_cancel_charge(page, memcg, false);
+   goto abort;
+   }
+
+   inc_mm_counter(mm, MM_ANONPAGES);
+   page_add_new_anon_rmap(page, vma, addr, false);
+   mem_cgroup_commit_charge(page, memcg, false, false);
+   if (!is_zone_device_page(page))
+   lru_cache_add_active_or_unevictable(page, vma);
+   set_pte_at(mm, addr, 

[HMM 13/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v3

2017-04-21 Thread Jérôme Glisse
This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Changed since v2:
  - s/SECTION_SIZE/PA_SECTION_SIZE
Changed since v1:
  - change to adapt to new add_pages() helper
  - make this x86-64 only for now

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h | 114 +++
 mm/Kconfig  |   9 ++
 mm/hmm.c| 404 
 3 files changed, 527 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d267989..50a1115 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include 
+#include 
+#include 
+
+
 struct hmm;
 
 /*
@@ -322,6 +327,115 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+  unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+   void (*free)(struct hmm_devmem *devmem, struct page *page);
+   int (*fault)(struct hmm_devmem *devmem,
+struct vm_area_struct *vma,
+unsigned long addr,
+struct page *page,
+unsigned int flags,
+pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device drivers that do not wish to implement
+ * the gory details related to hotplugging new memoy and allocating struct
+ * pages.
+ *
+ * Device drivers can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+   struct completion   completion;
+   unsigned long   pfn_first;
+   unsigned long   pfn_last;
+   struct resource *resource;
+   struct device   *device;
+   struct dev_pagemap  pagemap;
+   const struct hmm_devmem_ops *ops;
+   struct percpu_ref   ref;
+};
+
+/*
+ * To add (hotplug) device memory, HMM assumes that there is no real resource
+ * that reserves a range in the physical address space (this is intended to be
+ * use by unaddressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * The device driver can wrap the hmm_devmem struct inside a private device
+ * driver struct. The device driver must call hmm_devmem_remove() before the
+ * device goes away and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+ struct device *device,
+ unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+  struct vm_area_struct *vma,
+  const struct migrate_vma_ops *ops,
+  unsigned long *src,
+  unsigned long *dst,
+  unsigned long start,
+  unsigned long addr,
+  unsigned long end,
+  void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per-page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+  unsigned long data)
+{
+   unsigned long *drvdata = (unsigned long *)>pgmap;
+
+   drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver 

[HMM 13/15] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v3

2017-04-21 Thread Jérôme Glisse
This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Changed since v2:
  - s/SECTION_SIZE/PA_SECTION_SIZE
Changed since v1:
  - change to adapt to new add_pages() helper
  - make this x86-64 only for now

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h | 114 +++
 mm/Kconfig  |   9 ++
 mm/hmm.c| 404 
 3 files changed, 527 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d267989..50a1115 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include 
+#include 
+#include 
+
+
 struct hmm;
 
 /*
@@ -322,6 +327,115 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+  unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+   void (*free)(struct hmm_devmem *devmem, struct page *page);
+   int (*fault)(struct hmm_devmem *devmem,
+struct vm_area_struct *vma,
+unsigned long addr,
+struct page *page,
+unsigned int flags,
+pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device drivers that do not wish to implement
+ * the gory details related to hotplugging new memoy and allocating struct
+ * pages.
+ *
+ * Device drivers can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+   struct completion   completion;
+   unsigned long   pfn_first;
+   unsigned long   pfn_last;
+   struct resource *resource;
+   struct device   *device;
+   struct dev_pagemap  pagemap;
+   const struct hmm_devmem_ops *ops;
+   struct percpu_ref   ref;
+};
+
+/*
+ * To add (hotplug) device memory, HMM assumes that there is no real resource
+ * that reserves a range in the physical address space (this is intended to be
+ * use by unaddressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * The device driver can wrap the hmm_devmem struct inside a private device
+ * driver struct. The device driver must call hmm_devmem_remove() before the
+ * device goes away and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+ struct device *device,
+ unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+  struct vm_area_struct *vma,
+  const struct migrate_vma_ops *ops,
+  unsigned long *src,
+  unsigned long *dst,
+  unsigned long start,
+  unsigned long addr,
+  unsigned long end,
+  void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per-page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+  unsigned long data)
+{
+   unsigned long *drvdata = (unsigned long *)>pgmap;
+
+   drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long 

[HMM 06/15] mm/migrate: migrate_vma() unmap page from vma while collecting pages

2017-04-21 Thread Jérôme Glisse
Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 mm/migrate.c | 114 ++-
 1 file changed, 98 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 452f894..4ac2a7a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2118,7 +2118,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 {
struct migrate_vma *migrate = walk->private;
struct mm_struct *mm = walk->vma->vm_mm;
-   unsigned long addr = start;
+   unsigned long addr = start, unmapped = 0;
spinlock_t *ptl;
pte_t *ptep;
 
@@ -2128,9 +2128,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   arch_enter_lazy_mmu_mode();
+
for (; addr < end; addr += PAGE_SIZE, ptep++) {
unsigned long mpfn, pfn;
struct page *page;
+   swp_entry_t entry;
pte_t pte;
 
pte = *ptep;
@@ -2162,11 +2165,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
+   /*
+* Optimize for the common case where page is only mapped once
+* in one process. If we can lock the page, then we can safely
+* set up a special migration page table entry now.
+*/
+   if (trylock_page(page)) {
+   pte_t swp_pte;
+
+   mpfn |= MIGRATE_PFN_LOCKED;
+   ptep_get_and_clear(mm, addr, ptep);
+
+   /* Setup special migration page table entry */
+   entry = make_migration_entry(page, pte_write(pte));
+   swp_pte = swp_entry_to_pte(entry);
+   if (pte_soft_dirty(pte))
+   swp_pte = pte_swp_mksoft_dirty(swp_pte);
+   set_pte_at(mm, addr, ptep, swp_pte);
+
+   /*
+* This is like regular unmap: we remove the rmap and
+* drop page refcount. Page won't be freed, as we took
+* a reference just above.
+*/
+   page_remove_rmap(page, false);
+   put_page(page);
+   unmapped++;
+   }
+
 next:
migrate->src[migrate->npages++] = mpfn;
}
+   arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
 
+   /* Only flush the TLB if we actually modified any entries */
+   if (unmapped)
+   flush_tlb_range(walk->vma, start, end);
+
return 0;
 }
 
@@ -2191,7 +2227,13 @@ static void migrate_vma_collect(struct migrate_vma 
*migrate)
mm_walk.mm = migrate->vma->vm_mm;
mm_walk.private = migrate;
 
+   mmu_notifier_invalidate_range_start(mm_walk.mm,
+   migrate->start,
+   migrate->end);
walk_page_range(migrate->start, migrate->end, _walk);
+   mmu_notifier_invalidate_range_end(mm_walk.mm,
+ migrate->start,
+ migrate->end);
 
migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
 }
@@ -2247,12 +2289,16 @@ static void migrate_vma_prepare(struct migrate_vma 
*migrate)
 
for (i = 0; i < npages; i++) {
struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   bool remap = true;
 
if (!page)
continue;
 
-   lock_page(page);
-   migrate->src[i] |= MIGRATE_PFN_LOCKED;
+   if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+   remap = false;
+   lock_page(page);
+   migrate->src[i] |= MIGRATE_PFN_LOCKED;
+   }
 
if (!PageLRU(page) && allow_drain) {
/* Drain CPU's pagevec */
@@ -2261,21 +2307,50 @@ static void migrate_vma_prepare(struct migrate_vma 
*migrate)
}
 
if (isolate_lru_page(page)) {
-   migrate->src[i] = 0;
-   unlock_page(page);
-   migrate->cpages--;
-   

[HMM 06/15] mm/migrate: migrate_vma() unmap page from vma while collecting pages

2017-04-21 Thread Jérôme Glisse
Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 mm/migrate.c | 114 ++-
 1 file changed, 98 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 452f894..4ac2a7a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2118,7 +2118,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 {
struct migrate_vma *migrate = walk->private;
struct mm_struct *mm = walk->vma->vm_mm;
-   unsigned long addr = start;
+   unsigned long addr = start, unmapped = 0;
spinlock_t *ptl;
pte_t *ptep;
 
@@ -2128,9 +2128,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   arch_enter_lazy_mmu_mode();
+
for (; addr < end; addr += PAGE_SIZE, ptep++) {
unsigned long mpfn, pfn;
struct page *page;
+   swp_entry_t entry;
pte_t pte;
 
pte = *ptep;
@@ -2162,11 +2165,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
+   /*
+* Optimize for the common case where page is only mapped once
+* in one process. If we can lock the page, then we can safely
+* set up a special migration page table entry now.
+*/
+   if (trylock_page(page)) {
+   pte_t swp_pte;
+
+   mpfn |= MIGRATE_PFN_LOCKED;
+   ptep_get_and_clear(mm, addr, ptep);
+
+   /* Setup special migration page table entry */
+   entry = make_migration_entry(page, pte_write(pte));
+   swp_pte = swp_entry_to_pte(entry);
+   if (pte_soft_dirty(pte))
+   swp_pte = pte_swp_mksoft_dirty(swp_pte);
+   set_pte_at(mm, addr, ptep, swp_pte);
+
+   /*
+* This is like regular unmap: we remove the rmap and
+* drop page refcount. Page won't be freed, as we took
+* a reference just above.
+*/
+   page_remove_rmap(page, false);
+   put_page(page);
+   unmapped++;
+   }
+
 next:
migrate->src[migrate->npages++] = mpfn;
}
+   arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
 
+   /* Only flush the TLB if we actually modified any entries */
+   if (unmapped)
+   flush_tlb_range(walk->vma, start, end);
+
return 0;
 }
 
@@ -2191,7 +2227,13 @@ static void migrate_vma_collect(struct migrate_vma 
*migrate)
mm_walk.mm = migrate->vma->vm_mm;
mm_walk.private = migrate;
 
+   mmu_notifier_invalidate_range_start(mm_walk.mm,
+   migrate->start,
+   migrate->end);
walk_page_range(migrate->start, migrate->end, _walk);
+   mmu_notifier_invalidate_range_end(mm_walk.mm,
+ migrate->start,
+ migrate->end);
 
migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
 }
@@ -2247,12 +2289,16 @@ static void migrate_vma_prepare(struct migrate_vma 
*migrate)
 
for (i = 0; i < npages; i++) {
struct page *page = migrate_pfn_to_page(migrate->src[i]);
+   bool remap = true;
 
if (!page)
continue;
 
-   lock_page(page);
-   migrate->src[i] |= MIGRATE_PFN_LOCKED;
+   if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+   remap = false;
+   lock_page(page);
+   migrate->src[i] |= MIGRATE_PFN_LOCKED;
+   }
 
if (!PageLRU(page) && allow_drain) {
/* Drain CPU's pagevec */
@@ -2261,21 +2307,50 @@ static void migrate_vma_prepare(struct migrate_vma 
*migrate)
}
 
if (isolate_lru_page(page)) {
-   migrate->src[i] = 0;
-   unlock_page(page);
-   migrate->cpages--;
-   put_page(page);
+   if (remap) {
+   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+ 

[HMM 00/15] HMM (Heterogeneous Memory Management) v20

2017-04-21 Thread Jérôme Glisse
Patchset is on top of mmotm mmotm-2017-04-18 and Michal patchset
([PATCH -v3 0/13] mm: make movable onlining suck less). Branch:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v20

I have included all suggestion made since v19, it is all build
fix and change in respect to memory hotplug with Michal rework.
Changes since v19:
- Included various build fix and compilation warning fix
- Limit HMM to x86-64 (easy to enable other arch as separate patch)
- Rebase on top of Michal memory hotplug rework

Heterogeneous Memory Management (HMM) (description and justification)

Today device driver expose dedicated memory allocation API through their
device file, often relying on a combination of IOCTL and mmap calls. The
device can only access and use memory allocated through this API. This
effectively split the program address space into object allocated for the
device and useable by the device and other regular memory (malloc, mmap
of a file, share memory, …) only accessible by CPU (or in a very limited
way by a device by pinning memory).

Allowing different isolated component of a program to use a device thus
require duplication of the input data structure using device memory
allocator. This is reasonable for simple data structure (array, grid,
image, …) but this get extremely complex with advance data structure
(list, tree, graph, …) that rely on a web of memory pointers. This is
becoming a serious limitation on the kind of work load that can be
offloaded to device like GPU.

New industry standard like C++, OpenCL or CUDA are pushing to remove this
barrier. This require a shared address space between GPU device and CPU so
that GPU can access any memory of a process (while still obeying memory
protection like read only). This kind of feature is also appearing in
various other operating systems.

HMM is a set of helpers to facilitate several aspects of address space
sharing and device memory management. Unlike existing sharing mechanism
that rely on pining pages use by a device, HMM relies on mmu_notifier to
propagate CPU page table update to device page table.

Duplicating CPU page table is only one aspect necessary for efficiently
using device like GPU. GPU local memory have bandwidth in the TeraBytes/
second range but they are connected to main memory through a system bus
like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
is necessary to allow migration of process memory from main system memory
to device memory. Issue is that on platform that only have PCIE the device
memory is not accessible by the CPU with the same properties as main
memory (cache coherency, atomic operations, …).

To allow migration from main memory to device memory HMM provides a set
of helper to hotplug device memory as a new type of ZONE_DEVICE memory
which is un-addressable by CPU but still has struct page representing it.
This allow most of the core kernel logic that deals with a process memory
to stay oblivious of the peculiarity of device memory.

When page backing an address of a process is migrated to device memory
the CPU page table entry is set to a new specific swap entry. CPU access
to such address triggers a migration back to system memory, just like if
the page was swap on disk. HMM also blocks any one from pinning a
ZONE_DEVICE page so that it can always be migrated back to system memory
if CPU access it. Conversely HMM does not migrate to device memory any
page that is pin in system memory.

To allow efficient migration between device memory and main memory a new
migrate_vma() helpers is added with this patchset. It allows to leverage
device DMA engine to perform the copy operation.

This feature will be use by upstream driver like nouveau mlx5 and probably
other in the future (amdgpu is next suspect  in line). We are actively
working on nouveau and mlx5 support. To test this patchset we also worked
with NVidia close source driver team, they have more resources than us to
test this kind of infrastructure and also a bigger and better userspace
eco-system with various real industry workload they can be use to test and
profile HMM.

The expected workload is a program builds a data set on the CPU (from disk,
from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...)
to give hint on memory placement for the input data and also for the output
buffer. Program call GPU API to schedule a GPU job, this happens using
device driver specific ioctl. All this is hidden from programmer point of
view in case of C++ compiler that transparently offload some part of a
program to GPU. Program can keep doing other stuff on the CPU while the
GPU is crunching numbers.

It is expected that CPU will not access the same data set as the GPU while
GPU is working on it, but this is not mandatory. In fact we expect some
small memory object to be actively access by both GPU and CPU concurrently
as synchronization channel and/or for monitoring purposes. Such object will
stay in system memory and should not be 

[HMM 00/15] HMM (Heterogeneous Memory Management) v20

2017-04-21 Thread Jérôme Glisse
Patchset is on top of mmotm mmotm-2017-04-18 and Michal patchset
([PATCH -v3 0/13] mm: make movable onlining suck less). Branch:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v20

I have included all suggestion made since v19, it is all build
fix and change in respect to memory hotplug with Michal rework.
Changes since v19:
- Included various build fix and compilation warning fix
- Limit HMM to x86-64 (easy to enable other arch as separate patch)
- Rebase on top of Michal memory hotplug rework

Heterogeneous Memory Management (HMM) (description and justification)

Today device driver expose dedicated memory allocation API through their
device file, often relying on a combination of IOCTL and mmap calls. The
device can only access and use memory allocated through this API. This
effectively split the program address space into object allocated for the
device and useable by the device and other regular memory (malloc, mmap
of a file, share memory, …) only accessible by CPU (or in a very limited
way by a device by pinning memory).

Allowing different isolated component of a program to use a device thus
require duplication of the input data structure using device memory
allocator. This is reasonable for simple data structure (array, grid,
image, …) but this get extremely complex with advance data structure
(list, tree, graph, …) that rely on a web of memory pointers. This is
becoming a serious limitation on the kind of work load that can be
offloaded to device like GPU.

New industry standard like C++, OpenCL or CUDA are pushing to remove this
barrier. This require a shared address space between GPU device and CPU so
that GPU can access any memory of a process (while still obeying memory
protection like read only). This kind of feature is also appearing in
various other operating systems.

HMM is a set of helpers to facilitate several aspects of address space
sharing and device memory management. Unlike existing sharing mechanism
that rely on pining pages use by a device, HMM relies on mmu_notifier to
propagate CPU page table update to device page table.

Duplicating CPU page table is only one aspect necessary for efficiently
using device like GPU. GPU local memory have bandwidth in the TeraBytes/
second range but they are connected to main memory through a system bus
like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
is necessary to allow migration of process memory from main system memory
to device memory. Issue is that on platform that only have PCIE the device
memory is not accessible by the CPU with the same properties as main
memory (cache coherency, atomic operations, …).

To allow migration from main memory to device memory HMM provides a set
of helper to hotplug device memory as a new type of ZONE_DEVICE memory
which is un-addressable by CPU but still has struct page representing it.
This allow most of the core kernel logic that deals with a process memory
to stay oblivious of the peculiarity of device memory.

When page backing an address of a process is migrated to device memory
the CPU page table entry is set to a new specific swap entry. CPU access
to such address triggers a migration back to system memory, just like if
the page was swap on disk. HMM also blocks any one from pinning a
ZONE_DEVICE page so that it can always be migrated back to system memory
if CPU access it. Conversely HMM does not migrate to device memory any
page that is pin in system memory.

To allow efficient migration between device memory and main memory a new
migrate_vma() helpers is added with this patchset. It allows to leverage
device DMA engine to perform the copy operation.

This feature will be use by upstream driver like nouveau mlx5 and probably
other in the future (amdgpu is next suspect  in line). We are actively
working on nouveau and mlx5 support. To test this patchset we also worked
with NVidia close source driver team, they have more resources than us to
test this kind of infrastructure and also a bigger and better userspace
eco-system with various real industry workload they can be use to test and
profile HMM.

The expected workload is a program builds a data set on the CPU (from disk,
from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...)
to give hint on memory placement for the input data and also for the output
buffer. Program call GPU API to schedule a GPU job, this happens using
device driver specific ioctl. All this is hidden from programmer point of
view in case of C++ compiler that transparently offload some part of a
program to GPU. Program can keep doing other stuff on the CPU while the
GPU is crunching numbers.

It is expected that CPU will not access the same data set as the GPU while
GPU is working on it, but this is not mandatory. In fact we expect some
small memory object to be actively access by both GPU and CPU concurrently
as synchronization channel and/or for monitoring purposes. Such object will
stay in system memory and should not be 

[HMM 09/15] mm/hmm/mirror: helper to snapshot CPU page table v2

2017-04-21 Thread Jérôme Glisse
This does not use existing page table walker because we want to share
same code for our page fault handler.

Changes since v1:
  - Use spinlock instead of rcu synchronized list traversal

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h |  55 +-
 mm/hmm.c| 285 
 2 files changed, 338 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 6668a1b..defa7cd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,13 +79,26 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
+ *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
+ *  be mirrored by a device, because the entry will never have 
HMM_PFN_VALID
+ *  set and the pfn value is undefined.
+ * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_SPECIAL (1 << 5)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
+#define HMM_PFN_SHIFT 7
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -241,6 +254,44 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+   struct list_headlist;
+   unsigned long   start;
+   unsigned long   end;
+   hmm_pfn_t   *pfns;
+   boolvalid;
+};
+
+/*
+ * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
+ * driver lock that serializes device page table updates, then call
+ * hmm_vma_range_done(), to check if the snapshot is still valid. The same
+ * device driver page table update lock must also be used in the
+ * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
+ * table invalidation serializes on it.
+ *
+ * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+struct hmm_range *range,
+unsigned long start,
+unsigned long end,
+hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 7ed4b4c..4828b97 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,8 +19,12 @@
  */
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
@@ -30,14 +34,18 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
  * @sequence: we track updates to the CPU page table with a sequence number
+ * @ranges: list of range being snapshotted
  * @mirrors: list of mirrors for this mm
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
struct mm_struct*mm;
+   spinlock_t  lock;
atomic_tsequence;
+   struct list_headranges;
struct list_headmirrors;
struct mmu_notifier mmu_notifier;
struct rw_semaphore mirrors_sem;
@@ -71,6 +79,8 @@ static struct hmm *hmm_register(struct mm_struct *mm)
init_rwsem(>mirrors_sem);
atomic_set(>sequence, 0);
hmm->mmu_notifier.ops = NULL;
+   INIT_LIST_HEAD(>ranges);
+   

[HMM 09/15] mm/hmm/mirror: helper to snapshot CPU page table v2

2017-04-21 Thread Jérôme Glisse
This does not use existing page table walker because we want to share
same code for our page fault handler.

Changes since v1:
  - Use spinlock instead of rcu synchronized list traversal

Signed-off-by: Jérôme Glisse 
Signed-off-by: Evgeny Baskakov 
Signed-off-by: John Hubbard 
Signed-off-by: Mark Hairgrove 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
---
 include/linux/hmm.h |  55 +-
 mm/hmm.c| 285 
 2 files changed, 338 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 6668a1b..defa7cd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,13 +79,26 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
+ *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
+ *  be mirrored by a device, because the entry will never have 
HMM_PFN_VALID
+ *  set and the pfn value is undefined.
+ * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_SPECIAL (1 << 5)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
+#define HMM_PFN_SHIFT 7
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -241,6 +254,44 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+   struct list_headlist;
+   unsigned long   start;
+   unsigned long   end;
+   hmm_pfn_t   *pfns;
+   boolvalid;
+};
+
+/*
+ * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
+ * driver lock that serializes device page table updates, then call
+ * hmm_vma_range_done(), to check if the snapshot is still valid. The same
+ * device driver page table update lock must also be used in the
+ * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
+ * table invalidation serializes on it.
+ *
+ * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+struct hmm_range *range,
+unsigned long start,
+unsigned long end,
+hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 7ed4b4c..4828b97 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,8 +19,12 @@
  */
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
@@ -30,14 +34,18 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
  * @sequence: we track updates to the CPU page table with a sequence number
+ * @ranges: list of range being snapshotted
  * @mirrors: list of mirrors for this mm
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
struct mm_struct*mm;
+   spinlock_t  lock;
atomic_tsequence;
+   struct list_headranges;
struct list_headmirrors;
struct mmu_notifier mmu_notifier;
struct rw_semaphore mirrors_sem;
@@ -71,6 +79,8 @@ static struct hmm *hmm_register(struct mm_struct *mm)
init_rwsem(>mirrors_sem);
atomic_set(>sequence, 0);
hmm->mmu_notifier.ops = NULL;
+   INIT_LIST_HEAD(>ranges);
+   spin_lock_init(>lock);
hmm->mm = mm;
 
/*
@@ -111,6 +121,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
   

Re: [PATCH v2] selftests: ftrace: Allow some tests to be run in a tracing instance

2017-04-21 Thread Steven Rostedt
On Sat, 22 Apr 2017 08:00:36 +0900
Masami Hiramatsu  wrote:

> 
> BTW, this seems too complecated (with many similar variables).

I'm use to complicated ;-)

> I think we just need following patch, if we run the tests which
> have "instance" flag twice on top-level and an instance.
> (If you'd like to run those tests only on instance, we need
>  just one more line in the main loop ;-) )
> 
> ---
> diff --git a/tools/testing/selftests/ftrace/ftracetest 
> b/tools/testing/selftests
> index 52e3c4d..bdb10e6 100755
> --- a/tools/testing/selftests/ftrace/ftracetest
> +++ b/tools/testing/selftests/ftrace/ftracetest
> @@ -152,6 +152,10 @@ testcase() { # testfile
>prlog -n "[$CASENO]$desc"
>  }
>  
> +test_on_instance() { # testfile
> +  grep "^#[ \t]*flag:.*instance" $1
> +}
> +
>  eval_result() { # sigval
>case $1 in
>  $PASS)
> @@ -266,6 +270,16 @@ for t in $TEST_CASES; do
>run_test $t
>  done
>  
> +# Test on instance loop
> +for t in $TEST_CASES; do
> +  test_on_instance $t || continue
> +  SAVED_TRACING_DIR=$TRACING_DIR
> +  export TRACING_DIR=`mktemp -d $TRACING_DIR/instances/ftracetest.XX`
> +  run_test $t
> +  rmdir $TRACING_DIR
> +  TRACING_DIR=$SAVED_TRACING_DIR
> +done

I have a slight modification to this. Will send soon.

-- Steve

> +
>  prlog ""
>  prlog "# of passed: " `echo $PASSED_CASES | wc -w`
>  prlog "# of failed: " `echo $FAILED_CASES | wc -w`
> ---
> 
> Would this work for you?
> 
> Thank you,
> 
> > +
> >  eval_result() { # sigval
> >case $1 in
> >  $PASS)
> > @@ -233,18 +242,26 @@ exit_xfail () {
> >  }
> >  trap 'SIG_RESULT=$XFAIL' $SIG_XFAIL
> >  
> > +INSTANCE_DIR="."
> >  __run_test() { # testfile
> ># setup PID and PPID, $$ is not updated.
> >(cd $TRACING_DIR; read PID _ < /proc/self/stat; set -e; set -x; 
> > initialize_ftrace; . $1)
> >[ $? -ne 0 ] && kill -s $SIG_FAIL $SIG_PID
> >  }
> >  
> > +TEST_INSTANCES=0
> > +
> >  # Run one test case
> >  run_test() { # testfile
> >local testname=`basename $1`
> >local testlog=`mktemp $LOG_DIR/${testname}-log.XX`
> >export TMPDIR=`mktemp -d /tmp/ftracetest-dir.XX`
> >testcase $1
> > +  local SAVE_TRACING_DIR=$TRACING_DIR
> > +  if [ $TEST_INSTANCES -eq 1 ]; then
> > +TRACING_DIR=$TRACING_DIR/instances/${testname}_test_$$
> > +mkdir $TRACING_DIR
> > +  fi
> >echo "execute: "$1 > $testlog
> >SIG_RESULT=0
> >if [ $VERBOSE -ge 2 ]; then
> > @@ -260,17 +277,40 @@ run_test() { # testfile
> >  [ $VERBOSE -ge 1 ] && catlog $testlog
> >  TOTAL_RESULT=1
> >fi
> > +  if [ $TEST_INSTANCES -eq 1 ]; then
> > +rmdir $TRACING_DIR
> > +  fi
> > +  TRACING_DIR=$SAVE_TRACING_DIR
> >rm -rf $TMPDIR
> >  }
> >  
> >  # load in the helper functions
> >  . $TEST_DIR/functions
> >  
> > +RUN_INSTANCES=0
> > +
> >  # Main loop
> >  for t in $TEST_CASES; do
> > +  testinstance $t
> > +  if [ $INSTANCE -eq 1 ]; then
> > +RUN_INSTANCES=1
> > +  fi
> >run_test $t
> >  done
> >  
> > +TEST_INSTANCES=1
> > +
> > +if [ $RUN_INSTANCES -eq 1 ]; then
> > +echo "Running tests in a tracing instance:"
> > +for t in $TEST_CASES; do
> > +   testinstance $t
> > +   if [ $INSTANCE -eq 0 ]; then
> > +   continue
> > +   fi
> > +   run_test $t
> > +done
> > +fi
> > +
> >  prlog ""
> >  prlog "# of passed: " `echo $PASSED_CASES | wc -w`
> >  prlog "# of failed: " `echo $FAILED_CASES | wc -w`
> > -- 
> > 2.9.3
> >   
> 
> 



Re: [PATCH v2] selftests: ftrace: Allow some tests to be run in a tracing instance

2017-04-21 Thread Steven Rostedt
On Sat, 22 Apr 2017 08:00:36 +0900
Masami Hiramatsu  wrote:

> 
> BTW, this seems too complecated (with many similar variables).

I'm use to complicated ;-)

> I think we just need following patch, if we run the tests which
> have "instance" flag twice on top-level and an instance.
> (If you'd like to run those tests only on instance, we need
>  just one more line in the main loop ;-) )
> 
> ---
> diff --git a/tools/testing/selftests/ftrace/ftracetest 
> b/tools/testing/selftests
> index 52e3c4d..bdb10e6 100755
> --- a/tools/testing/selftests/ftrace/ftracetest
> +++ b/tools/testing/selftests/ftrace/ftracetest
> @@ -152,6 +152,10 @@ testcase() { # testfile
>prlog -n "[$CASENO]$desc"
>  }
>  
> +test_on_instance() { # testfile
> +  grep "^#[ \t]*flag:.*instance" $1
> +}
> +
>  eval_result() { # sigval
>case $1 in
>  $PASS)
> @@ -266,6 +270,16 @@ for t in $TEST_CASES; do
>run_test $t
>  done
>  
> +# Test on instance loop
> +for t in $TEST_CASES; do
> +  test_on_instance $t || continue
> +  SAVED_TRACING_DIR=$TRACING_DIR
> +  export TRACING_DIR=`mktemp -d $TRACING_DIR/instances/ftracetest.XX`
> +  run_test $t
> +  rmdir $TRACING_DIR
> +  TRACING_DIR=$SAVED_TRACING_DIR
> +done

I have a slight modification to this. Will send soon.

-- Steve

> +
>  prlog ""
>  prlog "# of passed: " `echo $PASSED_CASES | wc -w`
>  prlog "# of failed: " `echo $FAILED_CASES | wc -w`
> ---
> 
> Would this work for you?
> 
> Thank you,
> 
> > +
> >  eval_result() { # sigval
> >case $1 in
> >  $PASS)
> > @@ -233,18 +242,26 @@ exit_xfail () {
> >  }
> >  trap 'SIG_RESULT=$XFAIL' $SIG_XFAIL
> >  
> > +INSTANCE_DIR="."
> >  __run_test() { # testfile
> ># setup PID and PPID, $$ is not updated.
> >(cd $TRACING_DIR; read PID _ < /proc/self/stat; set -e; set -x; 
> > initialize_ftrace; . $1)
> >[ $? -ne 0 ] && kill -s $SIG_FAIL $SIG_PID
> >  }
> >  
> > +TEST_INSTANCES=0
> > +
> >  # Run one test case
> >  run_test() { # testfile
> >local testname=`basename $1`
> >local testlog=`mktemp $LOG_DIR/${testname}-log.XX`
> >export TMPDIR=`mktemp -d /tmp/ftracetest-dir.XX`
> >testcase $1
> > +  local SAVE_TRACING_DIR=$TRACING_DIR
> > +  if [ $TEST_INSTANCES -eq 1 ]; then
> > +TRACING_DIR=$TRACING_DIR/instances/${testname}_test_$$
> > +mkdir $TRACING_DIR
> > +  fi
> >echo "execute: "$1 > $testlog
> >SIG_RESULT=0
> >if [ $VERBOSE -ge 2 ]; then
> > @@ -260,17 +277,40 @@ run_test() { # testfile
> >  [ $VERBOSE -ge 1 ] && catlog $testlog
> >  TOTAL_RESULT=1
> >fi
> > +  if [ $TEST_INSTANCES -eq 1 ]; then
> > +rmdir $TRACING_DIR
> > +  fi
> > +  TRACING_DIR=$SAVE_TRACING_DIR
> >rm -rf $TMPDIR
> >  }
> >  
> >  # load in the helper functions
> >  . $TEST_DIR/functions
> >  
> > +RUN_INSTANCES=0
> > +
> >  # Main loop
> >  for t in $TEST_CASES; do
> > +  testinstance $t
> > +  if [ $INSTANCE -eq 1 ]; then
> > +RUN_INSTANCES=1
> > +  fi
> >run_test $t
> >  done
> >  
> > +TEST_INSTANCES=1
> > +
> > +if [ $RUN_INSTANCES -eq 1 ]; then
> > +echo "Running tests in a tracing instance:"
> > +for t in $TEST_CASES; do
> > +   testinstance $t
> > +   if [ $INSTANCE -eq 0 ]; then
> > +   continue
> > +   fi
> > +   run_test $t
> > +done
> > +fi
> > +
> >  prlog ""
> >  prlog "# of passed: " `echo $PASSED_CASES | wc -w`
> >  prlog "# of failed: " `echo $FAILED_CASES | wc -w`
> > -- 
> > 2.9.3
> >   
> 
> 



Re: [RFC PATCH 1/3] clk: add clk_bulk_get accessories

2017-04-21 Thread Stephen Boyd
On 04/12, Dong Aisheng wrote:
>  
>  #ifdef CONFIG_HAVE_CLK
> @@ -230,6 +257,32 @@ static inline void clk_unprepare(struct clk *clk)
>  struct clk *clk_get(struct device *dev, const char *id);
>  
>  /**
> + * clk_bulk_get - lookup and obtain a number of references to clock producer.
> + * @dev: device for clock "consumer"
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * This helper function allows drivers to get several clk consumers in one
> + * operation. If any of the clk cannot be acquired then any clks
> + * that were obtained will be freed before returning to the caller.
> + *
> + * Returns 0 if all clocks specified in clk_bulk_data table are obtained
> + * successfully, or valid IS_ERR() condition containing errno.
> + * The implementation uses @dev and @clk_bulk_data.id to determine the
> + * clock consumer, and thereby the clock producer.
> + * (IOW, @id may be identical strings, but clk_get may return different
> + * clock producers depending on @dev.) The clock returned is stored in

This comment is inaccurate. Only one dev is possible with this
API.

> + * each @clk_bulk_data.clk field.
> + *
> + * Drivers must assume that the clock source is not enabled.
> + *
> + * clk_bulk_get should not be called from within interrupt context.
> + */
> +

Drop space.

> +int __must_check clk_bulk_get(struct device *dev, int num_clks,
> +   struct clk_bulk_data *clks);
> +
> +/**
>   * devm_clk_get - lookup and obtain a managed reference to a clock producer.
>   * @dev: device for clock "consumer"
>   * @id: clock consumer ID
> @@ -279,6 +332,20 @@ struct clk *devm_get_clk_from_child(struct device *dev,
>  int clk_enable(struct clk *clk);
>  
>  /**
> + * clk_bulk_enable - inform the system when the bulk of clock source should
> + *be running.
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * If the clock can not be enabled/disabled all, this should return success.
> + *
> + * May be called from atomic contexts.
> + *
> + * Returns success (0) or negative errno.
> + */
> +int __must_check clk_bulk_enable(int num_clks, struct clk_bulk_data *clks);
> +
> +/**
>   * clk_disable - inform the system when the clock source is no longer 
> required.
>   * @clk: clock source
>   *
> @@ -295,6 +362,24 @@ int clk_enable(struct clk *clk);
>  void clk_disable(struct clk *clk);
>  
>  /**
> + * clk_bulk_disable - inform the system when the bulk of clock source is no
> + * longer required.
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * Inform the system that a bulk of clock source is no longer required by
> + * a driver and may be shut down.
> + *
> + * May be called from atomic contexts.
> + *
> + * Implementation detail: if the bulk of clock source is shared between

I'm not sure "bulk of clock source" is the correct terminology.
Perhaps "set of clks"?

> + * multiple drivers, clk_bulk_enable() calls must be balanced by the
> + * same number of clk_bulk_disable() calls for the clock source to be
> + * disabled.
> + */
> +void clk_bulk_disable(int num_clks, struct clk_bulk_data *clks);

We can mark clk_bulk_data structure as const here? Probably
applies in other places as well in this patch.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RFC PATCH 1/3] clk: add clk_bulk_get accessories

2017-04-21 Thread Stephen Boyd
On 04/12, Dong Aisheng wrote:
>  
>  #ifdef CONFIG_HAVE_CLK
> @@ -230,6 +257,32 @@ static inline void clk_unprepare(struct clk *clk)
>  struct clk *clk_get(struct device *dev, const char *id);
>  
>  /**
> + * clk_bulk_get - lookup and obtain a number of references to clock producer.
> + * @dev: device for clock "consumer"
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * This helper function allows drivers to get several clk consumers in one
> + * operation. If any of the clk cannot be acquired then any clks
> + * that were obtained will be freed before returning to the caller.
> + *
> + * Returns 0 if all clocks specified in clk_bulk_data table are obtained
> + * successfully, or valid IS_ERR() condition containing errno.
> + * The implementation uses @dev and @clk_bulk_data.id to determine the
> + * clock consumer, and thereby the clock producer.
> + * (IOW, @id may be identical strings, but clk_get may return different
> + * clock producers depending on @dev.) The clock returned is stored in

This comment is inaccurate. Only one dev is possible with this
API.

> + * each @clk_bulk_data.clk field.
> + *
> + * Drivers must assume that the clock source is not enabled.
> + *
> + * clk_bulk_get should not be called from within interrupt context.
> + */
> +

Drop space.

> +int __must_check clk_bulk_get(struct device *dev, int num_clks,
> +   struct clk_bulk_data *clks);
> +
> +/**
>   * devm_clk_get - lookup and obtain a managed reference to a clock producer.
>   * @dev: device for clock "consumer"
>   * @id: clock consumer ID
> @@ -279,6 +332,20 @@ struct clk *devm_get_clk_from_child(struct device *dev,
>  int clk_enable(struct clk *clk);
>  
>  /**
> + * clk_bulk_enable - inform the system when the bulk of clock source should
> + *be running.
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * If the clock can not be enabled/disabled all, this should return success.
> + *
> + * May be called from atomic contexts.
> + *
> + * Returns success (0) or negative errno.
> + */
> +int __must_check clk_bulk_enable(int num_clks, struct clk_bulk_data *clks);
> +
> +/**
>   * clk_disable - inform the system when the clock source is no longer 
> required.
>   * @clk: clock source
>   *
> @@ -295,6 +362,24 @@ int clk_enable(struct clk *clk);
>  void clk_disable(struct clk *clk);
>  
>  /**
> + * clk_bulk_disable - inform the system when the bulk of clock source is no
> + * longer required.
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * Inform the system that a bulk of clock source is no longer required by
> + * a driver and may be shut down.
> + *
> + * May be called from atomic contexts.
> + *
> + * Implementation detail: if the bulk of clock source is shared between

I'm not sure "bulk of clock source" is the correct terminology.
Perhaps "set of clks"?

> + * multiple drivers, clk_bulk_enable() calls must be balanced by the
> + * same number of clk_bulk_disable() calls for the clock source to be
> + * disabled.
> + */
> +void clk_bulk_disable(int num_clks, struct clk_bulk_data *clks);

We can mark clk_bulk_data structure as const here? Probably
applies in other places as well in this patch.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RFC] minimum gcc version for kernel: raise to gcc-4.3 or 4.6?

2017-04-21 Thread Maciej W. Rozycki
On Fri, 21 Apr 2017, Kees Cook wrote:

> > The linux-4.2 x86 defconfig could still be built with gcc-4.0, but
> > later kernels have several minor problems with that, and
> > require at least gcc-4.3.
> >
> > If we are ok with this status quo, we could simply declare gcc-4.3
> > the absolute minimum version for the kernel, make gcc-4.9
> > the recommeded minimum version, and remove all workarounds
> > for gcc-4.2 or older.
> 
> I think starting with this would be a good first step. I'm not sure
> the best way to add "recommended minimum" to
> Documentation/process/changes.rst hmmm

 FWIW for some reasons (mainly the ability to avoid NPTL) I have stuck to 
GCC 4.1.2 with some MIPS configurations and I've had no issues with that 
compiler up to Linux 4.6.0, which is the last kernel version I have tried 
with that compiler so far.  I could check if anything has regressed since 
then I suppose.

  Maciej


Re: [RFC] minimum gcc version for kernel: raise to gcc-4.3 or 4.6?

2017-04-21 Thread Maciej W. Rozycki
On Fri, 21 Apr 2017, Kees Cook wrote:

> > The linux-4.2 x86 defconfig could still be built with gcc-4.0, but
> > later kernels have several minor problems with that, and
> > require at least gcc-4.3.
> >
> > If we are ok with this status quo, we could simply declare gcc-4.3
> > the absolute minimum version for the kernel, make gcc-4.9
> > the recommeded minimum version, and remove all workarounds
> > for gcc-4.2 or older.
> 
> I think starting with this would be a good first step. I'm not sure
> the best way to add "recommended minimum" to
> Documentation/process/changes.rst hmmm

 FWIW for some reasons (mainly the ability to avoid NPTL) I have stuck to 
GCC 4.1.2 with some MIPS configurations and I've had no issues with that 
compiler up to Linux 4.6.0, which is the last kernel version I have tried 
with that compiler so far.  I could check if anything has regressed since 
then I suppose.

  Maciej


Re: [RFC PATCH 0/3] clk: introduce clk_bulk_get accessories

2017-04-21 Thread Stephen Boyd
On 04/12, Dong Aisheng wrote:
> 
> Together with the err path handling for each clocks, it does make
> things a bit ugly.
> 
> Since we already have regulator_bulk_get accessories, i thought we
> probably could introduce clk_bulk_get as well to handle such case to
> ease the driver owners' life. 
> 
> Besides IMX cpufreq driver, there is also some similar cases
> in kernel which could befinit from this api as well.
> e.g.
> drivers/cpufreq/tegra124-cpufreq.c
> drivers/cpufreq/s3c2412-cpufreq.c
> sound/soc/samsung/smdk_spdif.c
> arch/arm/mach-omap1/serial.c
> ...
> 
> And actually, if we handle clocks more than 3, then it might be
> worthy to try, which there is quite many manay in kernel and
> that probably could save a lot codes.
> 
> This is a RFC patch intending to bring up the idea to discuss.
> 

Idea seems fine to me. Please also add Russell King, as we need
an ack from him on the clk.h API changes.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RFC PATCH 0/3] clk: introduce clk_bulk_get accessories

2017-04-21 Thread Stephen Boyd
On 04/12, Dong Aisheng wrote:
> 
> Together with the err path handling for each clocks, it does make
> things a bit ugly.
> 
> Since we already have regulator_bulk_get accessories, i thought we
> probably could introduce clk_bulk_get as well to handle such case to
> ease the driver owners' life. 
> 
> Besides IMX cpufreq driver, there is also some similar cases
> in kernel which could befinit from this api as well.
> e.g.
> drivers/cpufreq/tegra124-cpufreq.c
> drivers/cpufreq/s3c2412-cpufreq.c
> sound/soc/samsung/smdk_spdif.c
> arch/arm/mach-omap1/serial.c
> ...
> 
> And actually, if we handle clocks more than 3, then it might be
> worthy to try, which there is quite many manay in kernel and
> that probably could save a lot codes.
> 
> This is a RFC patch intending to bring up the idea to discuss.
> 

Idea seems fine to me. Please also add Russell King, as we need
an ack from him on the clk.h API changes.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RFC PATCH 2/3] clk: add managed version of clk_bulk_get

2017-04-21 Thread Stephen Boyd
On 04/13, Dong Aisheng wrote:
> On Wed, Apr 12, 2017 at 12:03:28PM +0800, Dong Aisheng wrote:
> 
>drivers/built-in.o: In function `devm_clk_bulk_get':
> >> (.text+0x1930e): undefined reference to `clk_bulk_get'
>drivers/built-in.o: In function `devm_clk_bulk_release':
> >> clk-devres.c:(.text+0x19370): undefined reference to `clk_bulk_put'
> 
> clk_bulk_get is defined in clkdev.c which depends on CONFIG_CLKDEV_LOOKUP.
> However, some platforms like m68k may not select CLKDEV_LOOKUP but
> select HAVE_CLK. Thus compiling devm_clk_bulk_get may cause a undefined
> reference to 'clk_bulk_get'.
> 
> Since clk_bulk_get is built upon the platform specific clk_get api,
> clk_bulk_get can also be used by that platform accordingly.
> 
> Then we probably could move clk_bulk_get into clk-devres.c as well which
> is controlled by common CONFIG_HAVE_CLK to benifit all platforms.

clk-devres is for devm* things. I'd just make another file for
now, clk-bulk.c or something like that. When everyone moves to
common clk, we can fold it into clk.c, or not because clk.c is
rather large right now.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RFC PATCH 2/3] clk: add managed version of clk_bulk_get

2017-04-21 Thread Stephen Boyd
On 04/13, Dong Aisheng wrote:
> On Wed, Apr 12, 2017 at 12:03:28PM +0800, Dong Aisheng wrote:
> 
>drivers/built-in.o: In function `devm_clk_bulk_get':
> >> (.text+0x1930e): undefined reference to `clk_bulk_get'
>drivers/built-in.o: In function `devm_clk_bulk_release':
> >> clk-devres.c:(.text+0x19370): undefined reference to `clk_bulk_put'
> 
> clk_bulk_get is defined in clkdev.c which depends on CONFIG_CLKDEV_LOOKUP.
> However, some platforms like m68k may not select CLKDEV_LOOKUP but
> select HAVE_CLK. Thus compiling devm_clk_bulk_get may cause a undefined
> reference to 'clk_bulk_get'.
> 
> Since clk_bulk_get is built upon the platform specific clk_get api,
> clk_bulk_get can also be used by that platform accordingly.
> 
> Then we probably could move clk_bulk_get into clk-devres.c as well which
> is controlled by common CONFIG_HAVE_CLK to benifit all platforms.

clk-devres is for devm* things. I'd just make another file for
now, clk-bulk.c or something like that. When everyone moves to
common clk, we can fold it into clk.c, or not because clk.c is
rather large right now.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RFC PATCH 2/3] clk: add managed version of clk_bulk_get

2017-04-21 Thread Stephen Boyd
On 04/12, Dong Aisheng wrote:
> diff --git a/include/linux/clk.h b/include/linux/clk.h
> index 1d05b66..3fc6010 100644
> --- a/include/linux/clk.h
> +++ b/include/linux/clk.h
> @@ -278,11 +278,25 @@ struct clk *clk_get(struct device *dev, const char *id);
>   *
>   * clk_bulk_get should not be called from within interrupt context.
>   */
> -

Should be in previous patch?

>  int __must_check clk_bulk_get(struct device *dev, int num_clks,
> struct clk_bulk_data *clks);
>  
>  /**
> + * devm_clk_bulk_get - managed get multiple clk consumers
> + * @dev: device for clock "consumer"
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * Return 0 on success, an errno on failure.
> + *
> + * This helper function allows drivers to get several regulator

s/regulator/clk/

> + * consumers in one operation with management, the clks will
> + * automatically be freed when the device is unbound.
> + */
> +int __must_check devm_clk_bulk_get(struct device *dev, int num_clks,

Thanks for the __must_check. We need to add more __must_check to
clk APIs.

> +struct clk_bulk_data *clks);
> +
> +/**

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [RFC PATCH 2/3] clk: add managed version of clk_bulk_get

2017-04-21 Thread Stephen Boyd
On 04/12, Dong Aisheng wrote:
> diff --git a/include/linux/clk.h b/include/linux/clk.h
> index 1d05b66..3fc6010 100644
> --- a/include/linux/clk.h
> +++ b/include/linux/clk.h
> @@ -278,11 +278,25 @@ struct clk *clk_get(struct device *dev, const char *id);
>   *
>   * clk_bulk_get should not be called from within interrupt context.
>   */
> -

Should be in previous patch?

>  int __must_check clk_bulk_get(struct device *dev, int num_clks,
> struct clk_bulk_data *clks);
>  
>  /**
> + * devm_clk_bulk_get - managed get multiple clk consumers
> + * @dev: device for clock "consumer"
> + * @num_clks: the number of clk_bulk_data
> + * @clks: the clk_bulk_data table of consumer
> + *
> + * Return 0 on success, an errno on failure.
> + *
> + * This helper function allows drivers to get several regulator

s/regulator/clk/

> + * consumers in one operation with management, the clks will
> + * automatically be freed when the device is unbound.
> + */
> +int __must_check devm_clk_bulk_get(struct device *dev, int num_clks,

Thanks for the __must_check. We need to add more __must_check to
clk APIs.

> +struct clk_bulk_data *clks);
> +
> +/**

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH v2 8/8] platform: x86: intel_bxtwc_tmu: remove first level irq unmask

2017-04-21 Thread Andy Shevchenko
On Sat, Apr 22, 2017 at 1:34 AM, sathyanarayanan kuppuswamy
 wrote:

> Thanks for brining it up. I was planning to ask either Andy or Lee regarding
> this issue after all patches in the series are reviewed.

Darren, I'm planning to review this soon.

P.S. We have few series flying around regarding to Intel PMIC(s): my
for Kconfig naming, Hans' for Crystal Cove (touches Kconfig as well),
and Sathya's series. I hope Lee can collect them in proper order.

-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH v2 8/8] platform: x86: intel_bxtwc_tmu: remove first level irq unmask

2017-04-21 Thread Andy Shevchenko
On Sat, Apr 22, 2017 at 1:34 AM, sathyanarayanan kuppuswamy
 wrote:

> Thanks for brining it up. I was planning to ask either Andy or Lee regarding
> this issue after all patches in the series are reviewed.

Darren, I'm planning to review this soon.

P.S. We have few series flying around regarding to Intel PMIC(s): my
for Kconfig naming, Hans' for Crystal Cove (touches Kconfig as well),
and Sathya's series. I hope Lee can collect them in proper order.

-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH 3/5] clk: mvebu: Use kcalloc() in two functions

2017-04-21 Thread Stephen Boyd
On 04/19, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Wed, 19 Apr 2017 21:08:54 +0200
> 
> * Multiplications for the size determination of memory allocations
>   indicated that array data structures should be processed.
>   Thus use the corresponding function "kcalloc".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of data types by pointer dereferences
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


[RFC][PATCH tip/sched/core] sched/rt: Simplify the IPI rt balancing logic

2017-04-21 Thread Steven Rostedt

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task
push migration instead of pulling") changed the way to request a pull.
Instead of grabbing the lock of the overloaded CPU's runqueue, it simply
sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism, of if a source run queue changes status, the CPU spent minutes!
processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work  - the irq work to process each CPU set in rto_mask
  rto_lock   - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
  rto_loop_next  - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
  rto_loop   - a variable protected by rto_lock that is used to
compare against rto_loop_next
  rto_cpu- The cpu to send the next IPI to.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it does
an atomic_inc_return() on rto_loop_start. If the returned value is not "1",
then it does atomic_dec() on rt_loop_start and returns. If the value is "1",
then it will take the rto_lock to synchronize with a possible IPI being sent
around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT task that is
queued on the CPU but can't run because a higher priority RT task is
currently running on the CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Signed-off-by: Steven Rostedt (VMware) 
---
 kernel/sched/rt.c   | 300 
 kernel/sched/sched.h|  24 ++--
 kernel/sched/topology.c |   6 +
 3 files changed, 122 insertions(+), 208 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 979b734..fe4022e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -73,10 +73,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(_b->rt_runtime_lock);
 }
 
-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
-static void push_irq_work_func(struct irq_work *work);
-#endif
-
 void init_rt_rq(struct rt_rq *rt_rq)
 {
struct 

Re: [PATCH 3/5] clk: mvebu: Use kcalloc() in two functions

2017-04-21 Thread Stephen Boyd
On 04/19, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Wed, 19 Apr 2017 21:08:54 +0200
> 
> * Multiplications for the size determination of memory allocations
>   indicated that array data structures should be processed.
>   Thus use the corresponding function "kcalloc".
> 
>   This issue was detected by using the Coccinelle software.
> 
> * Replace the specification of data types by pointer dereferences
>   to make the corresponding size determination a bit safer according to
>   the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


[RFC][PATCH tip/sched/core] sched/rt: Simplify the IPI rt balancing logic

2017-04-21 Thread Steven Rostedt

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task
push migration instead of pulling") changed the way to request a pull.
Instead of grabbing the lock of the overloaded CPU's runqueue, it simply
sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism, of if a source run queue changes status, the CPU spent minutes!
processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work  - the irq work to process each CPU set in rto_mask
  rto_lock   - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
  rto_loop_next  - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
  rto_loop   - a variable protected by rto_lock that is used to
compare against rto_loop_next
  rto_cpu- The cpu to send the next IPI to.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it does
an atomic_inc_return() on rto_loop_start. If the returned value is not "1",
then it does atomic_dec() on rt_loop_start and returns. If the value is "1",
then it will take the rto_lock to synchronize with a possible IPI being sent
around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT task that is
queued on the CPU but can't run because a higher priority RT task is
currently running on the CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Signed-off-by: Steven Rostedt (VMware) 
---
 kernel/sched/rt.c   | 300 
 kernel/sched/sched.h|  24 ++--
 kernel/sched/topology.c |   6 +
 3 files changed, 122 insertions(+), 208 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 979b734..fe4022e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -73,10 +73,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(_b->rt_runtime_lock);
 }
 
-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
-static void push_irq_work_func(struct irq_work *work);
-#endif
-
 void init_rt_rq(struct rt_rq *rt_rq)
 {
struct rt_prio_array *array;

Re: [PATCH 1/5] clk: mvebu: Use kcalloc() in of_cpu_clk_setup()

2017-04-21 Thread Stephen Boyd
On 04/19, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Wed, 19 Apr 2017 20:15:21 +0200
> 
> Multiplications for the size determination of memory allocations
> indicated that array data structures should be processed.
> Thus use the corresponding function "kcalloc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 1/5] clk: mvebu: Use kcalloc() in of_cpu_clk_setup()

2017-04-21 Thread Stephen Boyd
On 04/19, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Wed, 19 Apr 2017 20:15:21 +0200
> 
> Multiplications for the size determination of memory allocations
> indicated that array data structures should be processed.
> Thus use the corresponding function "kcalloc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 8/8] clk: nomadik: Delete error messages for a failed memory allocation in two functions

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 10:04:00 +0200
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> WARNING: Possible unnecessary 'out of memory' message
> 
> Thus remove such statements here.
> 
> Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 7/8] clk: nomadik: Use seq_puts() in nomadik_src_clk_show()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 09:45:04 +0200
> 
> A string which did not contain a data format specification should be put
> into a sequence. Thus use the corresponding function "seq_puts".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 8/8] clk: nomadik: Delete error messages for a failed memory allocation in two functions

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 10:04:00 +0200
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> WARNING: Possible unnecessary 'out of memory' message
> 
> Thus remove such statements here.
> 
> Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 7/8] clk: nomadik: Use seq_puts() in nomadik_src_clk_show()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 09:45:04 +0200
> 
> A string which did not contain a data format specification should be put
> into a sequence. Thus use the corresponding function "seq_puts".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 6/8] clk: Improve a size determination in two functions

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 09:30:52 +0200
> 
> Replace the specification of two data structures by pointer dereferences
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 6/8] clk: Improve a size determination in two functions

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 09:30:52 +0200
> 
> Replace the specification of two data structures by pointer dereferences
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 4/8] clk: Replace four seq_printf() calls by seq_putc()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 08:45:43 +0200
> 
> Four single characters should be put into a sequence.
> Thus use the corresponding function "seq_putc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 4/8] clk: Replace four seq_printf() calls by seq_putc()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 08:45:43 +0200
> 
> Four single characters should be put into a sequence.
> Thus use the corresponding function "seq_putc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


[PATCH v2] nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()

2017-04-21 Thread Artem Savkov
Calling pnfs_put_lset on an IS_ERR pointer results in a NULL pointer
dereference like the one below. At the same time the check of retvalue
of filelayout_check_deviceid() sets lseg to error, but does not free it
before that.

[ 3000.636161] BUG: unable to handle kernel NULL pointer dereference at 
003c
[ 3000.636970] IP: pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.637420] PGD 4f23b067
[ 3000.637421] PUD 4a0f4067
[ 3000.637679] PMD 0
[ 3000.637937]
[ 3000.638287] Oops:  [#1] SMP
[ 3000.638591] Modules linked in: nfs_layout_nfsv41_files nfsv3 nfnetlink_queue 
nfnetlink_log nfnetlink bluetooth rfkill rpcsec_gss_krb5 nfsv4 nfs fscache 
binfmt_misc arc4 md4 nls_utf8 cifs ccm dns_resolver rpcrdma ib_isert 
iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt 
target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad 
ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 
nf_conntrack crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr 
virtio_balloon ppdev virtio_rng parport_pc i2c_piix4 parport acpi_cpufreq nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c ata_generic pata_acpi 
virtio_blk virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops crc32c_intel ata_piix ttm libata drm serio_raw
[ 3000.645245]  i2c_core virtio_pci virtio_ring virtio floppy dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: xt_u32]
[ 3000.646360] CPU: 1 PID: 26402 Comm: date Not tainted 
4.11.0-rc7.1.el7.test.x86_64 #1
[ 3000.647092] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 3000.647638] task: 8800415ada00 task.stack: c9ff
[ 3000.648207] RIP: 0010:pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.648696] RSP: 0018:c9ff39b8 EFLAGS: 00010246
[ 3000.649193] RAX:  RBX: fff4 RCX: 000d43be
[ 3000.649859] RDX: 000d43bd RSI:  RDI: fff4
[ 3000.650530] RBP: c9ff39d8 R08: 0001e320 R09: a05c35ce
[ 3000.651203] R10: 88007fd1e320 R11: ea0001283d80 R12: 01400040
[ 3000.651875] R13: 88004f77d9f0 R14: c9ff3cd8 R15: 8800417ade00
[ 3000.652546] FS:  7fac4d5cd740() GS:88007fd0() 
knlGS:
[ 3000.653304] CS:  0010 DS:  ES:  CR0: 80050033
[ 3000.653849] CR2: 003c CR3: 4f08 CR4: 000406e0
[ 3000.654527] Call Trace:
[ 3000.654771]  fl_pnfs_update_layout.constprop.20+0x10c/0x150 
[nfs_layout_nfsv41_files]
[ 3000.655505]  filelayout_pg_init_write+0x21d/0x270 [nfs_layout_nfsv41_files]
[ 3000.656195]  __nfs_pageio_add_request+0x11c/0x490 [nfs]
[ 3000.656698]  nfs_pageio_add_request+0xac/0x260 [nfs]
[ 3000.657180]  nfs_do_writepage+0x109/0x2e0 [nfs]
[ 3000.657616]  nfs_writepages_callback+0x16/0x30 [nfs]
[ 3000.658096]  write_cache_pages+0x26f/0x510
[ 3000.658495]  ? nfs_do_writepage+0x2e0/0x2e0 [nfs]
[ 3000.658946]  ? _raw_spin_unlock_bh+0x1e/0x20
[ 3000.659357]  ? wb_wakeup_delayed+0x5f/0x70
[ 3000.659748]  ? __mark_inode_dirty+0x2eb/0x360
[ 3000.660170]  nfs_writepages+0x84/0xd0 [nfs]
[ 3000.660575]  ? nfs_updatepage+0x571/0xb70 [nfs]
[ 3000.661012]  do_writepages+0x1e/0x30
[ 3000.661358]  __filemap_fdatawrite_range+0xc6/0x100
[ 3000.661819]  filemap_write_and_wait_range+0x41/0x90
[ 3000.662292]  nfs_file_fsync+0x34/0x1f0 [nfs]
[ 3000.662704]  vfs_fsync_range+0x3d/0xb0
[ 3000.663065]  vfs_fsync+0x1c/0x20
[ 3000.663385]  nfs4_file_flush+0x57/0x80 [nfsv4]
[ 3000.663813]  filp_close+0x2f/0x70
[ 3000.664132]  __close_fd+0x9a/0xc0
[ 3000.664453]  SyS_close+0x23/0x50
[ 3000.664785]  do_syscall_64+0x67/0x180
[ 3000.665162]  entry_SYSCALL64_slow_path+0x25/0x25
[ 3000.665600] RIP: 0033:0x7fac4d0e1e90
[ 3000.665946] RSP: 002b:7ffd54e90c88 EFLAGS: 0246 ORIG_RAX: 
0003
[ 3000.79] RAX: ffda RBX: 7fac4d3b5400 RCX: 7fac4d0e1e90
[ 3000.667349] RDX:  RSI: 7fac4d5d9000 RDI: 0001
[ 3000.668031] RBP:  R08: 7fac4d3b6a00 R09: 7fac4d5cd740
[ 3000.668709] R10: 7ffd54e909e0 R11: 0246 R12: 
[ 3000.669385] R13: 7fac4d3b5e80 R14:  R15: 
[ 3000.670061] Code: 00 00 66 66 66 66 90 55 48 85 ff 48 89 e5 41 56 41 55 41 
54 53 48 89 fb 0f 84 97 00 00 00 f6 05 16 8f bc ff 10 0f 85 a6 00 00 00 <4c> 8b 
63 48 48 8d 7b 38 49 8b 84 24 90 00 00 00 4c 8d a8 88 00
[ 3000.671831] RIP: pnfs_put_lseg+0x29/0x100 [nfsv4] RSP: c9ff39b8
[ 3000.672462] CR2: 003c

Signed-off-by: Artem Savkov 
---
 fs/nfs/filelayout/filelayout.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
index acd30ba..fb39fd8 100644
--- a/fs/nfs/filelayout/filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -921,11 +921,11 @@ fl_pnfs_update_layout(struct inode *ino,
fl = 

Re: [PATCH 2/8] clk: si5351: Delete an error message for a failed memory allocation in si5351_i2c_probe()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 07:34:54 +0200
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> * CHECK: Comparison to NULL could be written "!drvdata"
> 
>   Thus adjust this expression.
> 
> 
> * WARNING: Possible unnecessary 'out of memory' message
> 
>   Thus remove such a statement here.
> 
>   Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


[PATCH v2] nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()

2017-04-21 Thread Artem Savkov
Calling pnfs_put_lset on an IS_ERR pointer results in a NULL pointer
dereference like the one below. At the same time the check of retvalue
of filelayout_check_deviceid() sets lseg to error, but does not free it
before that.

[ 3000.636161] BUG: unable to handle kernel NULL pointer dereference at 
003c
[ 3000.636970] IP: pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.637420] PGD 4f23b067
[ 3000.637421] PUD 4a0f4067
[ 3000.637679] PMD 0
[ 3000.637937]
[ 3000.638287] Oops:  [#1] SMP
[ 3000.638591] Modules linked in: nfs_layout_nfsv41_files nfsv3 nfnetlink_queue 
nfnetlink_log nfnetlink bluetooth rfkill rpcsec_gss_krb5 nfsv4 nfs fscache 
binfmt_misc arc4 md4 nls_utf8 cifs ccm dns_resolver rpcrdma ib_isert 
iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt 
target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad 
ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 
nf_conntrack crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr 
virtio_balloon ppdev virtio_rng parport_pc i2c_piix4 parport acpi_cpufreq nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c ata_generic pata_acpi 
virtio_blk virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops crc32c_intel ata_piix ttm libata drm serio_raw
[ 3000.645245]  i2c_core virtio_pci virtio_ring virtio floppy dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: xt_u32]
[ 3000.646360] CPU: 1 PID: 26402 Comm: date Not tainted 
4.11.0-rc7.1.el7.test.x86_64 #1
[ 3000.647092] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 3000.647638] task: 8800415ada00 task.stack: c9ff
[ 3000.648207] RIP: 0010:pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.648696] RSP: 0018:c9ff39b8 EFLAGS: 00010246
[ 3000.649193] RAX:  RBX: fff4 RCX: 000d43be
[ 3000.649859] RDX: 000d43bd RSI:  RDI: fff4
[ 3000.650530] RBP: c9ff39d8 R08: 0001e320 R09: a05c35ce
[ 3000.651203] R10: 88007fd1e320 R11: ea0001283d80 R12: 01400040
[ 3000.651875] R13: 88004f77d9f0 R14: c9ff3cd8 R15: 8800417ade00
[ 3000.652546] FS:  7fac4d5cd740() GS:88007fd0() 
knlGS:
[ 3000.653304] CS:  0010 DS:  ES:  CR0: 80050033
[ 3000.653849] CR2: 003c CR3: 4f08 CR4: 000406e0
[ 3000.654527] Call Trace:
[ 3000.654771]  fl_pnfs_update_layout.constprop.20+0x10c/0x150 
[nfs_layout_nfsv41_files]
[ 3000.655505]  filelayout_pg_init_write+0x21d/0x270 [nfs_layout_nfsv41_files]
[ 3000.656195]  __nfs_pageio_add_request+0x11c/0x490 [nfs]
[ 3000.656698]  nfs_pageio_add_request+0xac/0x260 [nfs]
[ 3000.657180]  nfs_do_writepage+0x109/0x2e0 [nfs]
[ 3000.657616]  nfs_writepages_callback+0x16/0x30 [nfs]
[ 3000.658096]  write_cache_pages+0x26f/0x510
[ 3000.658495]  ? nfs_do_writepage+0x2e0/0x2e0 [nfs]
[ 3000.658946]  ? _raw_spin_unlock_bh+0x1e/0x20
[ 3000.659357]  ? wb_wakeup_delayed+0x5f/0x70
[ 3000.659748]  ? __mark_inode_dirty+0x2eb/0x360
[ 3000.660170]  nfs_writepages+0x84/0xd0 [nfs]
[ 3000.660575]  ? nfs_updatepage+0x571/0xb70 [nfs]
[ 3000.661012]  do_writepages+0x1e/0x30
[ 3000.661358]  __filemap_fdatawrite_range+0xc6/0x100
[ 3000.661819]  filemap_write_and_wait_range+0x41/0x90
[ 3000.662292]  nfs_file_fsync+0x34/0x1f0 [nfs]
[ 3000.662704]  vfs_fsync_range+0x3d/0xb0
[ 3000.663065]  vfs_fsync+0x1c/0x20
[ 3000.663385]  nfs4_file_flush+0x57/0x80 [nfsv4]
[ 3000.663813]  filp_close+0x2f/0x70
[ 3000.664132]  __close_fd+0x9a/0xc0
[ 3000.664453]  SyS_close+0x23/0x50
[ 3000.664785]  do_syscall_64+0x67/0x180
[ 3000.665162]  entry_SYSCALL64_slow_path+0x25/0x25
[ 3000.665600] RIP: 0033:0x7fac4d0e1e90
[ 3000.665946] RSP: 002b:7ffd54e90c88 EFLAGS: 0246 ORIG_RAX: 
0003
[ 3000.79] RAX: ffda RBX: 7fac4d3b5400 RCX: 7fac4d0e1e90
[ 3000.667349] RDX:  RSI: 7fac4d5d9000 RDI: 0001
[ 3000.668031] RBP:  R08: 7fac4d3b6a00 R09: 7fac4d5cd740
[ 3000.668709] R10: 7ffd54e909e0 R11: 0246 R12: 
[ 3000.669385] R13: 7fac4d3b5e80 R14:  R15: 
[ 3000.670061] Code: 00 00 66 66 66 66 90 55 48 85 ff 48 89 e5 41 56 41 55 41 
54 53 48 89 fb 0f 84 97 00 00 00 f6 05 16 8f bc ff 10 0f 85 a6 00 00 00 <4c> 8b 
63 48 48 8d 7b 38 49 8b 84 24 90 00 00 00 4c 8d a8 88 00
[ 3000.671831] RIP: pnfs_put_lseg+0x29/0x100 [nfsv4] RSP: c9ff39b8
[ 3000.672462] CR2: 003c

Signed-off-by: Artem Savkov 
---
 fs/nfs/filelayout/filelayout.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
index acd30ba..fb39fd8 100644
--- a/fs/nfs/filelayout/filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -921,11 +921,11 @@ fl_pnfs_update_layout(struct inode *ino,
fl = FILELAYOUT_LSEG(lseg);
 
status = 

Re: [PATCH 2/8] clk: si5351: Delete an error message for a failed memory allocation in si5351_i2c_probe()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Thu, 20 Apr 2017 07:34:54 +0200
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> * CHECK: Comparison to NULL could be written "!drvdata"
> 
>   Thus adjust this expression.
> 
> 
> * WARNING: Possible unnecessary 'out of memory' message
> 
>   Thus remove such a statement here.
> 
>   Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 1/8] clk: si5351: Use devm_kcalloc() in si5351_i2c_probe()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Wed, 19 Apr 2017 22:37:30 +0200
> 
> Multiplications for the size determination of memory allocations
> indicated that array data structures should be processed.
> Thus use the corresponding function "devm_kcalloc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 1/8] clk: si5351: Use devm_kcalloc() in si5351_i2c_probe()

2017-04-21 Thread Stephen Boyd
On 04/20, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Wed, 19 Apr 2017 22:37:30 +0200
> 
> Multiplications for the size determination of memory allocations
> indicated that array data structures should be processed.
> Thus use the corresponding function "devm_kcalloc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] clk: tegra: fix SS control on PLL enable/disable

2017-04-21 Thread Stephen Boyd
On 04/20, Peter De Schrijver wrote:
> PLL SS was only controlled when setting the PLL rate, not when the PLL
> itself is enabled or disabled. This means that if the PLL rate was set
> before the PLL is enabled, SS will not be enabled, even when configured.
> 
> Signed-off-by: Peter De Schrijver 

Fixes tag? Or this isn't a problem right now, just future fix?

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] clk: tegra: fix SS control on PLL enable/disable

2017-04-21 Thread Stephen Boyd
On 04/20, Peter De Schrijver wrote:
> PLL SS was only controlled when setting the PLL rate, not when the PLL
> itself is enabled or disabled. This means that if the PLL rate was set
> before the PLL is enabled, SS will not be enabled, even when configured.
> 
> Signed-off-by: Peter De Schrijver 

Fixes tag? Or this isn't a problem right now, just future fix?

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 2/2] clk: ti: fix building without legacy omap3

2017-04-21 Thread Stephen Boyd
On 04/19, Arnd Bergmann wrote:
> When CONFIG_ATAGS or CONFIG_OMAP3 is disabled, we get a build error:
> 
> In file included from include/linux/clk-provider.h:15:0,
>  from drivers/clk/ti/clk.c:19:
> drivers/clk/ti/clk.c: In function 'ti_clk_add_aliases':
> drivers/clk/ti/clk.c:438:29: error: 'simple_clk_match_table' undeclared 
> (first use in this function); did you mean 'simple_attr_write'?
> 
> Moving the match table down fixes it.
> 
> Fixes: c17435c56bb1 ("clk: ti: add API for creating aliases automatically for 
> simple clock types")
> Signed-off-by: Arnd Bergmann 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH 2/2] clk: ti: fix building without legacy omap3

2017-04-21 Thread Stephen Boyd
On 04/19, Arnd Bergmann wrote:
> When CONFIG_ATAGS or CONFIG_OMAP3 is disabled, we get a build error:
> 
> In file included from include/linux/clk-provider.h:15:0,
>  from drivers/clk/ti/clk.c:19:
> drivers/clk/ti/clk.c: In function 'ti_clk_add_aliases':
> drivers/clk/ti/clk.c:438:29: error: 'simple_clk_match_table' undeclared 
> (first use in this function); did you mean 'simple_attr_write'?
> 
> Moving the match table down fixes it.
> 
> Fixes: c17435c56bb1 ("clk: ti: add API for creating aliases automatically for 
> simple clock types")
> Signed-off-by: Arnd Bergmann 
> ---

Applied to clk-next

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


  1   2   3   4   5   6   7   8   9   10   >