date:20171001

Re: [PATCH 01/12] mmc: dt-bindings: update Mediatek MMC bindings

2017-10-01 Thread Ulf Hansson

[...]

>> > >> >  Required properties:
>> > >> > -- compatible: Should be "mediatek,mt8173-mmc","mediatek,mt8135-mmc"
>> > >> > +- compatible: value should be either of the following.
>> > >> > +   "mediatek,mt8135-mmc": for mmc host ip compatible with mt8135
>> > >> > +   "mediatek,mt8173-mmc": for mmc host ip compatible with mt8173
>> > >> > +   "mediatek,mt2701-mmc": for mmc host ip compatible with mt2701
>> > >> > +   "mediatek,mt2712-mmc": for mmc host ip compatible with mt2712
>> > >> > +- reg: physical base address of the controller and length
>> > >> >  - interrupts: Should contain MSDC interrupt number
>> > >> > -- clocks: MSDC source clock, HCLK
>> > >> > -- clock-names: "source", "hclk"
>> > >> > +- clocks: MSDC source clock, HCLK, source_cg
>> > >> > +- clock-names: "source", "hclk", "source_cg"
>> > >>
>> > >> All chips support source_cg? That's not backwards compatible for
>> > >> existing compatible strings if the driver requires it.
>> > > Not all chips support source_cg, for chips which do not support
>> > > source_cg, no need source_cg here, and the driver will parse it
>> > > to know if current chip support it.
>> >
>> > In such case you must not add add a required binding for it. I think
>> > that is what Rob is trying to point out for you.
>> >
>> > [...]
>> >
>> > Kind regards
>> > Uffe
>> The source_cg is required(MUST) at MT2712 and future SoCs, but not
>> required(do not have it) at previous SoCs, so that put it at required
>> properties, let the driver to handle it.

Then you must explain that in the binding...

On 29 September 2017 at 03:56, Chaotian Jing  wrote:
> On Wed, 2017-09-27 at 09:18 +0800, Chaotian Jing wrote:
>> On Wed, 2017-09-27 at 00:33 +0200, Ulf Hansson wrote:
>> > On 14 September 2017 at 04:10, Chaotian Jing  
>> > wrote:
>> > > On Wed, 2017-09-13 at 09:10 -0500, Rob Herring wrote:
>> > >> On Tue, Sep 12, 2017 at 05:07:41PM +0800, Chaotian Jing wrote:
>> > >> > Change the comptiable for support of multi-platform
>> > >> > Add description for reg
>> > >> > Add description for source_cg
>> > >> > Add description for mediatek,latch-ck
>> > >>
>> > >> This is at least the 3rd patch with exactly the same vague subject.
>> > >> Please make the subject somewhat unique.
>> > >>
>> > > Thx, will change the subject at next version
>> > >> >
>> > >> > Signed-off-by: Chaotian Jing 
>> > >> > ---
>> > >> >  Documentation/devicetree/bindings/mmc/mtk-sd.txt | 13 ++---
>> > >> >  1 file changed, 10 insertions(+), 3 deletions(-)
>> > >> >
>> > >> > diff --git a/Documentation/devicetree/bindings/mmc/mtk-sd.txt 
>> > >> > b/Documentation/devicetree/bindings/mmc/mtk-sd.txt
>> > >> > index 4182ea3..405cd06 100644
>> > >> > --- a/Documentation/devicetree/bindings/mmc/mtk-sd.txt
>> > >> > +++ b/Documentation/devicetree/bindings/mmc/mtk-sd.txt
>> > >> > @@ -7,10 +7,15 @@ This file documents differences between the core 
>> > >> > properties in mmc.txt
>> > >> >  and the properties used by the msdc driver.
>> > >> >
>
> Any other comments about it ? still must not add a required binding for
> it ? if add a optional binding for it, how to add it ? as cannot
> duplicate "clocks" & "clock-names" in one node.
>
>

I suggest you keep the description of the new clock name part of the
"Required properties:" header. However, after each clock name,
explicit state when the clock is required/optional. Something along
the lines as sdhci-msm does it.
Documentation/devicetree/bindings/mmc/sdhci-msm.txt

Kind regards
Uffe

[PATCH v4 06/10] arm64: kexec_file: create purgatory

2017-10-01 Thread AKASHI Takahiro

This is a basic purgatory, or a kind of glue code between the two kernels,
for arm64.

Since purgatory is assumed to be relocatable (not executable) object by
kexec generic code, arch_kexec_apply_relocations_add() is required in
general. Arm64's purgatory, however, is a simple asm and all the references
can be resolved as local, no re-linking is needed here.

Please note that even if we don't support digest check at purgatory we
need purgatory_sha_regions and purgatory_sha256_digest as they are
referenced by generic kexec code.

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
---
 arch/arm64/Makefile   |  1 +
 arch/arm64/purgatory/Makefile | 24 +++
 arch/arm64/purgatory/entry.S  | 55 +++
 3 files changed, 80 insertions(+)
 create mode 100644 arch/arm64/purgatory/Makefile
 create mode 100644 arch/arm64/purgatory/entry.S

diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 939b310913cf..cf39ec3baf5a 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -110,6 +110,7 @@ core-$(CONFIG_XEN) += arch/arm64/xen/
 core-$(CONFIG_CRYPTO) += arch/arm64/crypto/
 libs-y := arch/arm64/lib/ $(libs-y)
 core-$(CONFIG_EFI_STUB) += $(objtree)/drivers/firmware/efi/libstub/lib.a
+core-$(CONFIG_KEXEC_FILE) += arch/arm64/purgatory/
 
 # Default target when executing plain make
 boot   := arch/arm64/boot
diff --git a/arch/arm64/purgatory/Makefile b/arch/arm64/purgatory/Makefile
new file mode 100644
index ..c2127a2cbd51
--- /dev/null
+++ b/arch/arm64/purgatory/Makefile
@@ -0,0 +1,24 @@
+OBJECT_FILES_NON_STANDARD := y
+
+purgatory-y := entry.o
+
+targets += $(purgatory-y)
+PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
+
+LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined \
+   -nostdlib -z nodefaultlib
+targets += purgatory.ro
+
+$(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
+   $(call if_changed,ld)
+
+targets += kexec_purgatory.c
+
+CMD_BIN2C = $(objtree)/scripts/basic/bin2c
+quiet_cmd_bin2c = BIN2C $@
+   cmd_bin2c = $(CMD_BIN2C) kexec_purgatory < $< > $@
+
+$(obj)/kexec_purgatory.c: $(obj)/purgatory.ro FORCE
+   $(call if_changed,bin2c)
+
+obj-${CONFIG_KEXEC_FILE}   += kexec_purgatory.o
diff --git a/arch/arm64/purgatory/entry.S b/arch/arm64/purgatory/entry.S
new file mode 100644
index ..fe6e968076db
--- /dev/null
+++ b/arch/arm64/purgatory/entry.S
@@ -0,0 +1,55 @@
+/*
+ * kexec core purgatory
+ */
+#include 
+#include 
+
+#define SHA256_DIGEST_SIZE 32 /* defined in crypto/sha.h */
+
+.text
+
+ENTRY(purgatory_start)
+   /* Start new image. */
+   ldr x17, __kernel_entry
+   ldr x0, __dtb_addr
+   mov x1, xzr
+   mov x2, xzr
+   mov x3, xzr
+   br  x17
+END(purgatory_start)
+
+/*
+ * data section:
+ * kernel_entry and dtb_addr are global but also labelled as local,
+ * "__xxx:", to avoid unwanted re-linking.
+ *
+ * purgatory_sha_regions and purgatory_sha256_digest are referenced
+ * by kexec generic code and so must exist, but not actually used
+ * here because hash check is not that useful in purgatory.
+ */
+.align 3
+
+.globl kernel_entry
+kernel_entry:
+__kernel_entry:
+   .quad   0
+END(kernel_entry)
+
+.globl dtb_addr
+dtb_addr:
+__dtb_addr:
+   .quad   0
+END(dtb_addr)
+
+.globl purgatory_sha_regions
+purgatory_sha_regions:
+   .rept   KEXEC_SEGMENT_MAX
+   .quad   0
+   .quad   0
+   .endr
+END(purgatory_sha_regions)
+
+.globl purgatory_sha256_digest
+purgatory_sha256_digest:
+.skip   SHA256_DIGEST_SIZE
+END(purgatory_sha256_digest)
-- 
2.14.1

[PATCH v4 08/10] arm64: kexec_file: set up for crash dump adding elf core header

2017-10-01 Thread AKASHI Takahiro

load_crashdump_segments() creates and loads a memory segment of elf core
header for crash dump.

"linux,usable-memory-range" and "linux,elfcorehdr" will add to the 2nd
kernel's device-tree blob. The logic of this cod is also from kexec-tools.

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
---
 arch/arm64/include/asm/kexec.h |   5 ++
 arch/arm64/kernel/machine_kexec_file.c | 149 +
 kernel/kexec_file.c|   2 +-
 3 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 2fadd3cbf3af..edb702e64a8a 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -98,6 +98,10 @@ static inline void crash_post_resume(void) {}
 
 struct kimage_arch {
void *dtb_buf;
+   /* Core ELF header buffer */
+   void *elf_headers;
+   unsigned long elf_headers_sz;
+   unsigned long elf_load_addr;
 };
 
 struct kimage;
@@ -113,6 +117,7 @@ extern int load_other_segments(struct kimage *image,
unsigned long kernel_load_addr,
char *initrd, unsigned long initrd_len,
char *cmdline, unsigned long cmdline_len);
+extern int load_crashdump_segments(struct kimage *image);
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index 8a09d89f6266..1d30b4773af5 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -32,6 +32,10 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
vfree(image->arch.dtb_buf);
image->arch.dtb_buf = NULL;
 
+   vfree(image->arch.elf_headers);
+   image->arch.elf_headers = NULL;
+   image->arch.elf_headers_sz = 0;
+
return _kexec_kernel_post_load_cleanup(image);
 }
 
@@ -48,6 +52,77 @@ int arch_kexec_walk_mem(struct kexec_buf *kbuf, int 
(*func)(u64, u64, void *))
return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
 }
 
+static int __init arch_kexec_file_init(void)
+{
+   /* Those values are used later on loading the kernel */
+   __dt_root_addr_cells = dt_root_addr_cells;
+   __dt_root_size_cells = dt_root_size_cells;
+
+   return 0;
+}
+late_initcall(arch_kexec_file_init);
+
+#define FDT_ALIGN(x, a)(((x) + (a) - 1) & ~((a) - 1))
+#define FDT_TAGALIGN(x)(FDT_ALIGN((x), FDT_TAGSIZE))
+
+static int fdt_prop_len(const char *prop_name, int len)
+{
+   return (strlen(prop_name) + 1) +
+   sizeof(struct fdt_property) +
+   FDT_TAGALIGN(len);
+}
+
+static bool cells_size_fitted(unsigned long base, unsigned long size)
+{
+   /* if *_cells >= 2, cells can hold 64-bit values anyway */
+   if ((__dt_root_addr_cells == 1) && (base >= (1ULL << 32)))
+   return false;
+
+   if ((__dt_root_size_cells == 1) && (size >= (1ULL << 32)))
+   return false;
+
+   return true;
+}
+
+static void fill_property(void *buf, u64 val64, int cells)
+{
+   u32 val32;
+   int i;
+
+   if (cells == 1) {
+   val32 = cpu_to_fdt32((u32)val64);
+   memcpy(buf, &val32, sizeof(val32));
+   } else {
+   for (i = 0; i < (cells * sizeof(u32) - sizeof(u64)); i++)
+   *(char *)buf++ = 0;
+
+   val64 = cpu_to_fdt64(val64);
+   memcpy(buf, &val64, sizeof(val64));
+   }
+}
+
+static int fdt_setprop_range(void *fdt, int nodeoffset, const char *name,
+   unsigned long addr, unsigned long size)
+{
+   u64 range[2];
+   void *prop;
+   size_t buf_size;
+   int result;
+
+   prop = range;
+   buf_size = (__dt_root_addr_cells + __dt_root_size_cells) * sizeof(u32);
+
+   fill_property(prop, addr, __dt_root_addr_cells);
+   prop += __dt_root_addr_cells * sizeof(u32);
+
+   fill_property(prop, size, __dt_root_size_cells);
+   prop += __dt_root_size_cells * sizeof(u32);
+
+   result = fdt_setprop(fdt, nodeoffset, name, range, buf_size);
+
+   return result;
+}
+
 int setup_dtb(struct kimage *image,
unsigned long initrd_load_addr, unsigned long initrd_len,
char *cmdline, unsigned long cmdline_len,
@@ -60,10 +135,26 @@ int setup_dtb(struct kimage *image,
int range_len;
int ret;
 
+   /* check ranges against root's #address-cells and #size-cells */
+   if (image->type == KEXEC_TYPE_CRASH &&
+   (!cells_size_fitted(image->arch.elf_load_addr,
+   image->arch.elf_headers_sz) ||
+!cells_size_fitted(crashk_res.start,
+   crashk_res.end - crashk_res.start + 1))) {
+   pr_err("Crash memory region doesn't fit into DT's root cell 
sizes.\n");
+   ret = -EINVAL;
+   goto out_err;
+   }
+
/* duplica

[PATCH v4 03/10] kexec_file: factor out arch_kexec_kernel_*() from x86, powerpc

2017-10-01 Thread AKASHI Takahiro

arch_kexec_kernel_*() and arch_kimage_file_post_load_cleanup can now be
duplicated among some architectures, so let's factor them out.

Signed-off-by: AKASHI Takahiro 
Cc: Dave Young 
Cc: Vivek Goyal 
Cc: Baoquan He 
Cc: Michael Ellerman 
Cc: Thiago Jung Bauermann 
---
 arch/powerpc/kernel/kexec_elf_64.c  |  2 +-
 arch/powerpc/kernel/machine_kexec_file_64.c | 36 ++---
 arch/x86/kernel/kexec-bzimage64.c   |  2 +-
 arch/x86/kernel/machine_kexec_64.c  | 45 +
 include/linux/kexec.h   | 15 +++
 kernel/kexec_file.c | 63 ++---
 6 files changed, 73 insertions(+), 90 deletions(-)

diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec_elf_64.c
index 9a42309b091a..6c78c11c7faf 100644
--- a/arch/powerpc/kernel/kexec_elf_64.c
+++ b/arch/powerpc/kernel/kexec_elf_64.c
@@ -657,7 +657,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
return ret ? ERR_PTR(ret) : fdt;
 }
 
-struct kexec_file_ops kexec_elf64_ops = {
+const struct kexec_file_ops kexec_elf64_ops = {
.probe = elf64_probe,
.load = elf64_load,
 };
diff --git a/arch/powerpc/kernel/machine_kexec_file_64.c 
b/arch/powerpc/kernel/machine_kexec_file_64.c
index 992c0d258e5d..e7ce78857f0b 100644
--- a/arch/powerpc/kernel/machine_kexec_file_64.c
+++ b/arch/powerpc/kernel/machine_kexec_file_64.c
@@ -31,8 +31,9 @@
 
 #define SLAVE_CODE_SIZE256
 
-static struct kexec_file_ops *kexec_file_loaders[] = {
+const struct kexec_file_ops * const kexec_file_loaders[] = {
&kexec_elf64_ops,
+   NULL
 };
 
 int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
@@ -45,38 +46,7 @@ int arch_kexec_kernel_image_probe(struct kimage *image, void 
*buf,
if (image->type == KEXEC_TYPE_CRASH)
return -ENOTSUPP;
 
-   for (i = 0; i < ARRAY_SIZE(kexec_file_loaders); i++) {
-   fops = kexec_file_loaders[i];
-   if (!fops || !fops->probe)
-   continue;
-
-   ret = fops->probe(buf, buf_len);
-   if (!ret) {
-   image->fops = fops;
-   return ret;
-   }
-   }
-
-   return ret;
-}
-
-void *arch_kexec_kernel_image_load(struct kimage *image)
-{
-   if (!image->fops || !image->fops->load)
-   return ERR_PTR(-ENOEXEC);
-
-   return image->fops->load(image, image->kernel_buf,
-image->kernel_buf_len, image->initrd_buf,
-image->initrd_buf_len, image->cmdline_buf,
-image->cmdline_buf_len);
-}
-
-int arch_kimage_file_post_load_cleanup(struct kimage *image)
-{
-   if (!image->fops || !image->fops->cleanup)
-   return 0;
-
-   return image->fops->cleanup(image->image_loader_data);
+   return _kexec_kernel_image_probe(image, buf, buf_len);
 }
 
 /**
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index fb095ba0c02f..705654776c0c 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -538,7 +538,7 @@ static int bzImage64_verify_sig(const char *kernel, 
unsigned long kernel_len)
 }
 #endif
 
-struct kexec_file_ops kexec_bzImage64_ops = {
+const struct kexec_file_ops kexec_bzImage64_ops = {
.probe = bzImage64_probe,
.load = bzImage64_load,
.cleanup = bzImage64_cleanup,
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 1f790cf9d38f..2cdd29d64181 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -30,8 +30,9 @@
 #include 
 
 #ifdef CONFIG_KEXEC_FILE
-static struct kexec_file_ops *kexec_file_loaders[] = {
+const struct kexec_file_ops * const kexec_file_loaders[] = {
&kexec_bzImage64_ops,
+   NULL
 };
 #endif
 
@@ -363,27 +364,6 @@ void arch_crash_save_vmcoreinfo(void)
 /* arch-dependent functionality related to kexec file-based syscall */
 
 #ifdef CONFIG_KEXEC_FILE
-int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
- unsigned long buf_len)
-{
-   int i, ret = -ENOEXEC;
-   struct kexec_file_ops *fops;
-
-   for (i = 0; i < ARRAY_SIZE(kexec_file_loaders); i++) {
-   fops = kexec_file_loaders[i];
-   if (!fops || !fops->probe)
-   continue;
-
-   ret = fops->probe(buf, buf_len);
-   if (!ret) {
-   image->fops = fops;
-   return ret;
-   }
-   }
-
-   return ret;
-}
-
 void *arch_kexec_kernel_image_load(struct kimage *image)
 {
vfree(image->arch.elf_headers);
@@ -398,27 +378,6 @@ void *arch_kexec_kernel_image_load(struct kimage *image)
 image->cmdline_buf_len);
 }
 
-int arch

Purchase Order Copy

2017-10-01 Thread Aldaz

Dear Sir

Please find the attached copy of Purchase order with the item details 
and confirm with us immediately

Thank you 
Aldaz
<>

[PATCH v4 07/10] arm64: kexec_file: load initrd, device-tree and purgatory segments

2017-10-01 Thread AKASHI Takahiro

load_other_segments() sets up and adds all the memory segments necessary
other than kernel, including initrd, device-tree blob and purgatory.
Most of the code was borrowed from kexec-tools' counterpart.

arch_kimage_kernel_post_load_cleanup() is meant to free arm64-specific data
allocated for loading kernel.

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
---
 arch/arm64/include/asm/kexec.h |  22 
 arch/arm64/kernel/Makefile |   3 +-
 arch/arm64/kernel/machine_kexec_file.c | 213 +
 3 files changed, 237 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kernel/machine_kexec_file.c

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index e17f0529a882..2fadd3cbf3af 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -93,6 +93,28 @@ static inline void crash_prepare_suspend(void) {}
 static inline void crash_post_resume(void) {}
 #endif
 
+#ifdef CONFIG_KEXEC_FILE
+#define ARCH_HAS_KIMAGE_ARCH
+
+struct kimage_arch {
+   void *dtb_buf;
+};
+
+struct kimage;
+
+#define arch_kimage_file_post_load_cleanup arch_kimage_file_post_load_cleanup
+extern int arch_kimage_file_post_load_cleanup(struct kimage *image);
+
+extern int setup_dtb(struct kimage *image,
+   unsigned long initrd_load_addr, unsigned long initrd_len,
+   char *cmdline, unsigned long cmdline_len,
+   char **dtb_buf, size_t *dtb_buf_len);
+extern int load_other_segments(struct kimage *image,
+   unsigned long kernel_load_addr,
+   char *initrd, unsigned long initrd_len,
+   char *cmdline, unsigned long cmdline_len);
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #endif
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index f2b4e816b6de..5df003d6157c 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -48,8 +48,9 @@ arm64-obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL)   += 
acpi_parking_protocol.o
 arm64-obj-$(CONFIG_PARAVIRT)   += paravirt.o
 arm64-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
 arm64-obj-$(CONFIG_HIBERNATION)+= hibernate.o hibernate-asm.o
-arm64-obj-$(CONFIG_KEXEC)  += machine_kexec.o relocate_kernel.o
\
+arm64-obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o
\
   cpu-reset.o
+arm64-obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o
 arm64-obj-$(CONFIG_ARM64_RELOC_TEST)   += arm64-reloc-test.o
 arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
 arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
new file mode 100644
index ..8a09d89f6266
--- /dev/null
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -0,0 +1,213 @@
+/*
+ * kexec_file for arm64
+ *
+ * Copyright (C) 2017 Linaro Limited
+ * Author: AKASHI Takahiro 
+ *
+ * Most code is derived from arm64 port of kexec-tools
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#define pr_fmt(fmt) "kexec_file: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int __dt_root_addr_cells;
+static int __dt_root_size_cells;
+
+const struct kexec_file_ops * const kexec_file_loaders[] = {
+   NULL
+};
+
+int arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+   vfree(image->arch.dtb_buf);
+   image->arch.dtb_buf = NULL;
+
+   return _kexec_kernel_post_load_cleanup(image);
+}
+
+int arch_kexec_walk_mem(struct kexec_buf *kbuf, int (*func)(u64, u64, void *))
+{
+   if (kbuf->image->type == KEXEC_TYPE_CRASH)
+   return walk_iomem_res_desc(crashk_res.desc,
+   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+   crashk_res.start, crashk_res.end,
+   kbuf, func);
+   else if (kbuf->top_down)
+   return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func);
+   else
+   return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
+}
+
+int setup_dtb(struct kimage *image,
+   unsigned long initrd_load_addr, unsigned long initrd_len,
+   char *cmdline, unsigned long cmdline_len,
+   char **dtb_buf, size_t *dtb_buf_len)
+{
+   char *buf = NULL;
+   size_t buf_size;
+   int nodeoffset;
+   u64 value;
+   int range_len;
+   int ret;
+
+   /* duplicate dt blob */
+   buf_size = fdt_totalsize(initial_boot_params);
+   range_len = (__dt_root_addr_cells + __dt_root_size_cells) * sizeof(u32);
+
+   if (initrd_load_addr)
+   buf_size += fdt_prop_len("initrd-start", sizeof(u64))
+   + fdt_p

[PATCH v4 09/10] arm64: enable KEXEC_FILE config

2017-10-01 Thread AKASHI Takahiro

Modify arm64/Kconfig and Makefile to enable kexec_file_load support.

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
---
 arch/arm64/Kconfig | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..e37be8a59a88 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -757,6 +757,28 @@ config KEXEC
  but it is independent of the system firmware.   And like a reboot
  you can start any kernel with it, not just Linux.
 
+config KEXEC_FILE
+   bool "kexec file based system call"
+   select KEXEC_CORE
+   select BUILD_BIN2C
+   ---help---
+ This is new version of kexec system call. This system call is
+ file based and takes file descriptors as system call argument
+ for kernel and initramfs as opposed to list of segments as
+ accepted by previous system call.
+
+ In addition to this option, you need to enable a specific type
+ of image support.
+
+config KEXEC_VERIFY_SIG
+   bool "Verify kernel signature during kexec_file_load() syscall"
+   depends on KEXEC_FILE
+   select SYSTEM_DATA_VERIFICATION
+   ---help---
+ Select this option to verify a signature with loaded kernel
+ image. If configured, any attempt of loading a image without
+ valid signature will fail.
+
 config CRASH_DUMP
bool "Build kdump crash kernel"
help
-- 
2.14.1

[PATCH v4 10/10] arm64: kexec_file: add Image format support

2017-10-01 Thread AKASHI Takahiro

The "Image" binary will be loaded at the offset of TEXT_OFFSET from
the start of system memory. TEXT_OFFSET is determined from the header
of the image.

Regarding kernel signature verification, it will be done through
verify_pefile_signature() as arm64's "Image" binary can be seen as
in PE format. This approach is consistent with x86 implementation.

we can sign an image with sbsign command.

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
---
 arch/arm64/Kconfig |   7 +++
 arch/arm64/include/asm/kexec.h |  66 +
 arch/arm64/kernel/Makefile |   1 +
 arch/arm64/kernel/kexec_image.c| 105 +
 arch/arm64/kernel/machine_kexec_file.c |   3 +
 5 files changed, 182 insertions(+)
 create mode 100644 arch/arm64/kernel/kexec_image.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e37be8a59a88..a9ef277faa3e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -770,10 +770,17 @@ config KEXEC_FILE
  In addition to this option, you need to enable a specific type
  of image support.
 
+config KEXEC_FILE_IMAGE_FMT
+   bool "Enable Image support"
+   depends on KEXEC_FILE
+   ---help---
+ Select this option to enable 'Image' kernel loading.
+
 config KEXEC_VERIFY_SIG
bool "Verify kernel signature during kexec_file_load() syscall"
depends on KEXEC_FILE
select SYSTEM_DATA_VERIFICATION
+   select SIGNED_PE_FILE_VERIFICATION if KEXEC_FILE_IMAGE_FMT
---help---
  Select this option to verify a signature with loaded kernel
  image. If configured, any attempt of loading a image without
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index edb702e64a8a..2a63bf5f32ea 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -104,6 +104,72 @@ struct kimage_arch {
unsigned long elf_load_addr;
 };
 
+/**
+ * struct arm64_image_header - arm64 kernel image header
+ *
+ * @pe_sig: Optional PE format 'MZ' signature
+ * @branch_code: Instruction to branch to stext
+ * @text_offset: Image load offset, little endian
+ * @image_size: Effective image size, little endian
+ * @flags:
+ * Bit 0: Kernel endianness. 0=little endian, 1=big endian
+ * @reserved: Reserved
+ * @magic: Magic number, "ARM\x64"
+ * @pe_header: Optional offset to a PE format header
+ **/
+
+struct arm64_image_header {
+   u8 pe_sig[2];
+   u8 pad[2];
+   u32 branch_code;
+   u64 text_offset;
+   u64 image_size;
+   u64 flags;
+   u64 reserved[3];
+   u8 magic[4];
+   u32 pe_header;
+};
+
+static const u8 arm64_image_magic[4] = {'A', 'R', 'M', 0x64U};
+static const u8 arm64_image_pe_sig[2] = {'M', 'Z'};
+
+/**
+ * arm64_header_check_magic - Helper to check the arm64 image header.
+ *
+ * Returns non-zero if header is OK.
+ */
+
+static inline int arm64_header_check_magic(const struct arm64_image_header *h)
+{
+   if (!h)
+   return 0;
+
+   if (!h->text_offset)
+   return 0;
+
+   return (h->magic[0] == arm64_image_magic[0]
+   && h->magic[1] == arm64_image_magic[1]
+   && h->magic[2] == arm64_image_magic[2]
+   && h->magic[3] == arm64_image_magic[3]);
+}
+
+/**
+ * arm64_header_check_pe_sig - Helper to check the arm64 image header.
+ *
+ * Returns non-zero if 'MZ' signature is found.
+ */
+
+static inline int arm64_header_check_pe_sig(const struct arm64_image_header *h)
+{
+   if (!h)
+   return 0;
+
+   return (h->pe_sig[0] == arm64_image_pe_sig[0]
+   && h->pe_sig[1] == arm64_image_pe_sig[1]);
+}
+
+extern const struct kexec_file_ops kexec_image_ops;
+
 struct kimage;
 
 #define arch_kimage_file_post_load_cleanup arch_kimage_file_post_load_cleanup
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 5df003d6157c..a1161bab6810 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -51,6 +51,7 @@ arm64-obj-$(CONFIG_HIBERNATION)   += hibernate.o 
hibernate-asm.o
 arm64-obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o
\
   cpu-reset.o
 arm64-obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o
+arm64-obj-$(CONFIG_KEXEC_FILE_IMAGE_FMT)   += kexec_image.o
 arm64-obj-$(CONFIG_ARM64_RELOC_TEST)   += arm64-reloc-test.o
 arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
 arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
new file mode 100644
index ..b840b6ed6ed9
--- /dev/null
+++ b/arch/arm64/kernel/kexec_image.c
@@ -0,0 +1,105 @@
+/*
+ * Kexec image loader
+
+ * Copyright (C) 2017 Linaro Limited
+ * Author: AKASHI Takahiro 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the

[PATCH v4 05/10] asm-generic: add kexec_file_load system call to unistd.h

2017-10-01 Thread AKASHI Takahiro

The initial user of this system call number is arm64.

Signed-off-by: AKASHI Takahiro 
Acked-by: Arnd Bergmann 
---
 include/uapi/asm-generic/unistd.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 061185a5eb51..086697fe3917 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -731,9 +731,11 @@ __SYSCALL(__NR_pkey_alloc,sys_pkey_alloc)
 __SYSCALL(__NR_pkey_free, sys_pkey_free)
 #define __NR_statx 291
 __SYSCALL(__NR_statx, sys_statx)
+#define __NR_kexec_file_load 292
+__SYSCALL(__NR_kexec_file_load, sys_kexec_file_load)
 
 #undef __NR_syscalls
-#define __NR_syscalls 292
+#define __NR_syscalls 293
 
 /*
  * All syscalls below here should go away really,
-- 
2.14.1

[PATCH v4 04/10] kexec_file: factor out crashdump elf header function from x86

2017-10-01 Thread AKASHI Takahiro

prepare_elf_headers() can also be useful for other architectures,
including arm64. So let it factored out.

Signed-off-by: AKASHI Takahiro 
Cc: Dave Young 
Cc: Vivek Goyal 
Cc: Baoquan He 
---
 arch/x86/kernel/crash.c | 324 
 include/linux/kexec.h   |  17 +++
 kernel/kexec_file.c | 308 +
 kernel/kexec_internal.h |  20 +++
 4 files changed, 345 insertions(+), 324 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 44404e2307bb..3c6b880f6dbf 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -21,7 +21,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -41,34 +40,6 @@
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
 
-/* This primarily represents number of split ranges due to exclusion */
-#define CRASH_MAX_RANGES   16
-
-struct crash_mem_range {
-   u64 start, end;
-};
-
-struct crash_mem {
-   unsigned int nr_ranges;
-   struct crash_mem_range ranges[CRASH_MAX_RANGES];
-};
-
-/* Misc data about ram ranges needed to prepare elf headers */
-struct crash_elf_data {
-   struct kimage *image;
-   /*
-* Total number of ram ranges we have after various adjustments for
-* crash reserved region, etc.
-*/
-   unsigned int max_nr_ranges;
-
-   /* Pointer to elf header */
-   void *ehdr;
-   /* Pointer to next phdr */
-   void *bufp;
-   struct crash_mem mem;
-};
-
 /* Used while preparing memory map entries for second kernel */
 struct crash_memmap_data {
struct boot_params *params;
@@ -209,301 +180,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 }
 
 #ifdef CONFIG_KEXEC_FILE
-static int get_nr_ram_ranges_callback(u64 start, u64 end, void *arg)
-{
-   unsigned int *nr_ranges = arg;
-
-   (*nr_ranges)++;
-   return 0;
-}
-
-
-/* Gather all the required information to prepare elf headers for ram regions 
*/
-static void fill_up_crash_elf_data(struct crash_elf_data *ced,
-  struct kimage *image)
-{
-   unsigned int nr_ranges = 0;
-
-   ced->image = image;
-
-   walk_system_ram_res(0, -1, &nr_ranges,
-   get_nr_ram_ranges_callback);
-
-   ced->max_nr_ranges = nr_ranges;
-
-   /* Exclusion of crash region could split memory ranges */
-   ced->max_nr_ranges++;
-
-   /* If crashk_low_res is not 0, another range split possible */
-   if (crashk_low_res.end)
-   ced->max_nr_ranges++;
-}
-
-static int exclude_mem_range(struct crash_mem *mem,
-   unsigned long long mstart, unsigned long long mend)
-{
-   int i, j;
-   unsigned long long start, end;
-   struct crash_mem_range temp_range = {0, 0};
-
-   for (i = 0; i < mem->nr_ranges; i++) {
-   start = mem->ranges[i].start;
-   end = mem->ranges[i].end;
-
-   if (mstart > end || mend < start)
-   continue;
-
-   /* Truncate any area outside of range */
-   if (mstart < start)
-   mstart = start;
-   if (mend > end)
-   mend = end;
-
-   /* Found completely overlapping range */
-   if (mstart == start && mend == end) {
-   mem->ranges[i].start = 0;
-   mem->ranges[i].end = 0;
-   if (i < mem->nr_ranges - 1) {
-   /* Shift rest of the ranges to left */
-   for (j = i; j < mem->nr_ranges - 1; j++) {
-   mem->ranges[j].start =
-   mem->ranges[j+1].start;
-   mem->ranges[j].end =
-   mem->ranges[j+1].end;
-   }
-   }
-   mem->nr_ranges--;
-   return 0;
-   }
-
-   if (mstart > start && mend < end) {
-   /* Split original range */
-   mem->ranges[i].end = mstart - 1;
-   temp_range.start = mend + 1;
-   temp_range.end = end;
-   } else if (mstart != start)
-   mem->ranges[i].end = mstart - 1;
-   else
-   mem->ranges[i].start = mend + 1;
-   break;
-   }
-
-   /* If a split happend, add the split to array */
-   if (!temp_range.end)
-   return 0;
-
-   /* Split happened */
-   if (i == CRASH_MAX_RANGES - 1) {
-   pr_err("Too many crash ranges after split\n");
-   return -ENOMEM;
-   }
-
-   /* Location where new range should go */
-   j = i + 1;
-   if (j < mem->nr_ranges) {
-

[PATCH v4 02/10] resource: add walk_system_ram_res_rev()

2017-10-01 Thread AKASHI Takahiro

This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70acead ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM
in reversed order, i.e., from higher to lower.

It will be used in kexec_file implementation on arm64.

Signed-off-by: AKASHI Takahiro 
Cc: Vivek Goyal 
Cc: Andrew Morton 
Cc: Linus Torvalds 
---
 include/linux/ioport.h |  3 +++
 kernel/resource.c  | 59 ++
 2 files changed, 62 insertions(+)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index f5cf32e80041..62eb62b98118 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -273,6 +273,9 @@ extern int
 walk_system_ram_res(u64 start, u64 end, void *arg,
int (*func)(u64, u64, void *));
 extern int
+walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(u64, u64, void *));
+extern int
 walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 
end,
void *arg, int (*func)(u64, u64, void *));
 
diff --git a/kernel/resource.c b/kernel/resource.c
index 9b5f04404152..572f2f91ce9c 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 
@@ -469,6 +471,63 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
return ret;
 }
 
+int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
+   int (*func)(u64, u64, void *))
+{
+   struct resource res, *rams;
+   u64 orig_end;
+   int count, i;
+   int ret = -1;
+
+   count = 16; /* initial */
+
+   /* create a list */
+   rams = vmalloc(sizeof(struct resource) * count);
+   if (!rams)
+   return ret;
+
+   res.start = start;
+   res.end = end;
+   res.flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+   orig_end = res.end;
+   i = 0;
+   while ((res.start < res.end) &&
+   (!find_next_iomem_res(&res, IORES_DESC_NONE, true))) {
+   if (i >= count) {
+   /* re-alloc */
+   struct resource *rams_new;
+   int count_new;
+
+   count_new = count + 16;
+   rams_new = vmalloc(sizeof(struct resource) * count_new);
+   if (!rams_new)
+   goto out;
+
+   memcpy(rams_new, rams, count);
+   vfree(rams);
+   rams = rams_new;
+   count = count_new;
+   }
+
+   rams[i].start = res.start;
+   rams[i++].end = res.end;
+
+   res.start = res.end + 1;
+   res.end = orig_end;
+   }
+
+   /* go reverse */
+   for (i--; i >= 0; i--) {
+   ret = (*func)(rams[i].start, rams[i].end, arg);
+   if (ret)
+   break;
+   }
+
+out:
+   vfree(rams);
+   return ret;
+}
+
 #if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
 
 /*
-- 
2.14.1

[PATCH v4 01/10] include: pe.h: remove message[] from mz header definition

2017-10-01 Thread AKASHI Takahiro

message[] field won't be part of the definition of mz header.

This change is crucial for enabling kexec_file_load on arm64 because
arm64's "Image" binary, as in PE format, doesn't have any data for it and
accordingly the following check in pefile_parse_binary() will fail:

chkaddr(cursor, mz->peaddr, sizeof(*pe));

Signed-off-by: AKASHI Takahiro 
Reviewed-by: Ard Biesheuvel 
Cc: David Howells 
Cc: Vivek Goyal 
Cc: Herbert Xu 
Cc: David S. Miller 
---
 include/linux/pe.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/pe.h b/include/linux/pe.h
index 143ce75be5f0..3482b18a48b5 100644
--- a/include/linux/pe.h
+++ b/include/linux/pe.h
@@ -166,7 +166,7 @@ struct mz_hdr {
uint16_t oem_info;  /* oem specific */
uint16_t reserved1[10]; /* reserved */
uint32_t peaddr;/* address of pe header */
-   char message[64];   /* message to print */
+   char message[]; /* message to print */
 };
 
 struct mz_reloc {
-- 
2.14.1

[PATCH v4 00/10] arm64: kexec: add kexec_file_load() support

2017-10-01 Thread AKASHI Takahiro

This is the fourth round of implementing kexec_file_load() support
on arm64.[1]
Most of the code is based on kexec-tools (along with some kernel code
from x86, which also came from kexec-tools).


This patch series enables us to
  * load the kernel, Image, with kexec_file_load system call, and
  * optionally verify its signature at load time for trusted boot.

To load the kernel via kexec_file_load system call, a small change
is also needed to kexec-tools. See [2]. This enables '-s' option.

As we discussed a long time ago, users may not be allowed to specify
device-tree file of the 2nd kernel explicitly with kexec-tools, hence
re-using the blob of the first kernel.

Regarding a signing method, we conform with x86 (or rather Microsoft?)
style of signing since the binary can also be seen as in PE format
(assuming that CONFIG_EFI is enabled).

Powerpc is also going to support extended-file-attribute-based
verification[3] with vmlinux, but arm64 doesn't for now partly
because we don't have TPM-based IMA at this moment.

Accordingly, we can use the existing command, sbsign, to sign the kernel.

$ sbsign --key ${KEY} --cert ${CERT} Image

Please note that it is totally up to the system what key/certificate is
used for signing, but one of easy ways to *try* this feature is to turn on
CONFIG_MODULE_SIG so that we can reuse certs/signing_key.pem as a signing
key, KEY and CERT above, for kernel.
(This also enables CONFIG_CRYPTO_SHA1 by default.)


Some concerns(or future works):
* Even if the kernel is configured with CONFIG_RANDOMIZE_BASE, the 2nd
  kernel won't be placed at a randomized address. We will have to
  add some boot code similar to efi-stub to implement the feature.
* While big-endian kernel can support kernel signing, I'm not sure that
  Image can be recognized as in PE format because x86 standard only
  defines little-endian-based format.
* IMA(and extended file attribute)-based kexec
* vmlinux support

  [1] http://git.linaro.org/people/takahiro.akashi/linux-aarch64.git
branch:arm64/kexec_file
  [2] http://git.linaro.org/people/takahiro.akashi/kexec-tools.git
branch:arm64/kexec_file
  [3] http://lkml.iu.edu//hypermail/linux/kernel/1707.0/03669.html


Changes in v4 (Oct 2, 2017)
* reinstate x86's arch_kexec_kernel_image_load()
* rename weak arch_kexec_kernel_xxx() to _kexec_kernel_xxx() for
  better re-use
* constify kexec_file_loaders[]

Changes in v3 (Sep 15, 2017)
* fix kbuild test error
* factor out arch_kexec_kernel_*() & arch_kimage_file_post_load_cleanup()
* remove CONFIG_CRASH_CORE guard from kexec_file.c
* add vmapped kernel region to vmcore for gdb backtracing
  (see prepare_elf64_headers())
* merge asm/kexec_file.h into asm/kexec.h
* and some cleanups

Changes in v2 (Sep 8, 2017)
* move core-header-related functions from crash_core.c to kexec_file.c
* drop hash-check code from purgatory
* modify purgatory asm to remove arch_kexec_apply_relocations_add()
* drop older kernel support
* drop vmlinux support (at least, for this series)

Patch #1 to #5 are all preparatory patches on generic side.
Patch #6 is purgatory code.
Patch #7 to #9 are common for enabling kexec_file_load.
Patch #10 is for 'Image' support.

AKASHI Takahiro (10):
  include: pe.h: remove message[] from mz header definition
  resource: add walk_system_ram_res_rev()
  kexec_file: factor out arch_kexec_kernel_*() from x86, powerpc
  kexec_file: factor out crashdump elf header function from x86
  asm-generic: add kexec_file_load system call to unistd.h
  arm64: kexec_file: create purgatory
  arm64: kexec_file: load initrd, device-tree and purgatory segments
  arm64: kexec_file: set up for crash dump adding elf core header
  arm64: enable KEXEC_FILE config
  arm64: kexec_file: add Image format support

 arch/arm64/Kconfig  |  29 +++
 arch/arm64/Makefile |   1 +
 arch/arm64/include/asm/kexec.h  |  93 +++
 arch/arm64/kernel/Makefile  |   4 +-
 arch/arm64/kernel/kexec_image.c | 105 
 arch/arm64/kernel/machine_kexec_file.c  | 365 +++
 arch/arm64/purgatory/Makefile   |  24 ++
 arch/arm64/purgatory/entry.S|  55 +
 arch/powerpc/kernel/kexec_elf_64.c  |   2 +-
 arch/powerpc/kernel/machine_kexec_file_64.c |  36 +--
 arch/x86/kernel/crash.c | 324 
 arch/x86/kernel/kexec-bzimage64.c   |   2 +-
 arch/x86/kernel/machine_kexec_64.c  |  45 +---
 include/linux/ioport.h  |   3 +
 include/linux/kexec.h   |  32 ++-
 include/linux/pe.h  |   2 +-
 include/uapi/asm-generic/unistd.h   |   4 +-
 kernel/kexec_file.c | 371 +++-
 kernel/kexec_internal.h |  20 ++
 kernel/resource.c   |  59 +
 20 files changed, 1159 insertions(+), 417 del

Re: [PATCH][next] mlxsw: spectrum: fix uninitialized value in err

2017-10-01 Thread David Miller

From: Colin King 
Date: Sun,  1 Oct 2017 17:27:35 +0100

> From: Colin Ian King 
> 
> In the unlikely event that mfc->mfc_un.res.ttls[i] is 255 for all
> values of i from 0 to MAXIVS-1, the err is not set at all and hence
> has a garbage value on the error return at the end of the function,
> so initialize it to 0.  Also, the error return check on err and goto
> to err: inside the for loop makes it impossible for err to be zero
> at the end of the for loop, so we can remove the redundant err check
> at the end of the loop.
> 
> Detected by CoverityScan CID#1457207 ("Unitialized scalar value")
> 
> Fixes: c011ec1bbfd6 ("mlxsw: spectrum: Add the multicast routing offloading 
> logic")
> Signed-off-by: Colin Ian King 

Applied.

Re: [PATCH] MAINTAINERS: Remove myself as reviewer

2017-10-01 Thread Michal Simek

On 27.9.2017 23:23, Soren Brinkmann wrote:
> This address is gonna bounce in the not too far away future.
> 
> Signed-off-by: Soren Brinkmann 
> ---
>  MAINTAINERS | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index dc3ff3aaa588..4502f016be12 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2161,7 +2161,6 @@ F:  sound/soc/zte/
>  
>  ARM/ZYNQ ARCHITECTURE
>  M:   Michal Simek 
> -R:   Sören Brinkmann 
>  L:   linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
>  W:   http://wiki.xilinx.com
>  T:   git https://github.com/Xilinx/linux-xlnx.git
> 

Applied to zynq/soc.

Thanks,
Michal

Re: [PATCH] net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'

2017-10-01 Thread David Miller

From: Christophe JAILLET 
Date: Sat, 30 Sep 2017 07:34:34 +0200

> If this sanity check fails, we must free 'rss_indir'. Otherwise there is a
> memory leak.
> 'goto err' as done in the other error handling paths to fix it.
> 
> Fixes: 46a3df9f9718 ("net: hns3: Fix for setting rss_size incorrectly")
> Signed-off-by: Christophe JAILLET 

Applied.

Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-01 Thread David Miller

From: David Miller 
Date: Sun, 01 Oct 2017 22:54:19 -0700 (PDT)

> From: Avinash Repaka 
> Date: Fri, 29 Sep 2017 18:13:50 -0700
> 
>> This patch fixes the scope of has_fr and has_fmr variables as they are
>> needed only in rds_ib_add_one().
>> 
>> Signed-off-by: Avinash Repaka 
> 
> Applied.

Actually, reverted, this breaks the build.

net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or directory
 #include "ib.h"

Although I can't see how in the world this patch is causing such
an error.

Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables

2017-10-01 Thread David Miller

From: Avinash Repaka 
Date: Fri, 29 Sep 2017 18:13:50 -0700

> This patch fixes the scope of has_fr and has_fmr variables as they are
> needed only in rds_ib_add_one().
> 
> Signed-off-by: Avinash Repaka 

Applied.

Re: [PATCH v2 net] net: mvpp2: Fix clock resource by adding an optional bus clock

2017-10-01 Thread David Miller

From: Gregory CLEMENT 
Date: Fri, 29 Sep 2017 14:27:39 +0200

> On Armada 7K/8K we need to explicitly enable the bus clock. The bus clock
> is optional because not all the SoCs need them but at least for Armada
> 7K/8K it is actually mandatory.
> 
> The binding documentation is updating accordingly.
> 
> Signed-off-by: Gregory CLEMENT 

Applied.

Re: [PATCH V4] r8152: add Linksys USB3GIGV1 id

2017-10-01 Thread David Miller

From: Grant Grundler 
Date: Thu, 28 Sep 2017 11:35:00 -0700

> This linksys dongle by default comes up in cdc_ether mode.
> This patch allows r8152 to claim the device:
>Bus 002 Device 002: ID 13b1:0041 Linksys
> 
> Signed-off-by: Grant Grundler 

Applied, thanks.

Re: [PATCH 00/18] use ARRAY_SIZE macro

2017-10-01 Thread Greg KH

On Sun, Oct 01, 2017 at 08:52:20PM -0400, Jérémy Lefaure wrote:
> On Mon, 2 Oct 2017 09:01:31 +1100
> "Tobin C. Harding"  wrote:
> 
> > > In order to reduce the size of the To: and Cc: lines, each patch of the
> > > series is sent only to the maintainers and lists concerned by the patch.
> > > This cover letter is sent to every list concerned by this series.  
> > 
> > Why don't you just send individual patches for each subsystem? I'm not a 
> > maintainer but I don't see
> > how any one person is going to be able to apply this whole series, it is 
> > making it hard for
> > maintainers if they have to pick patches out from among the series (if 
> > indeed any will bother
> > doing that).
> Yeah, maybe it would have been better to send individual patches.
> 
> From my point of view it's a series because the patches are related (I
> did a git format-patch from my local branch). But for the maintainers
> point of view, they are individual patches.

And the maintainers view is what matters here, if you wish to get your
patches reviewed and accepted...

thanks,

greg k-h

Re: [PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once

2017-10-01 Thread Michael Witten

On Sun, 1 Oct 2017 17:59:09 -0700, Stephen Hemminger wrote:

> On Sun, 01 Oct 2017 22:19:20 - Michael Witten wrote:
>
>> +spin_lock_irqsave(&q->lock, flags);
>> +skb = q->next;
>> +__skb_queue_head_init(q);
>> +spin_unlock_irqrestore(&q->lock, flags);
>
> Other code manipulating lists uses splice operation and
> a sk_buff_head temporary on the stack. That would be easier
> to understand.
>
>   struct sk_buf_head head;
>
>   __skb_queue_head_init(&head);
>   spin_lock_irqsave(&q->lock, flags);
>   skb_queue_splice_init(q, &head);
>   spin_unlock_irqrestore(&q->lock, flags);
>
>
>> +while (skb != head) {
>> +next = skb->next;
>>  kfree_skb(skb);
>> +skb = next;
>> +}
>
> It would be cleaner if you could use
> skb_queue_walk_safe rather than open coding the loop.
>
>   skb_queue_walk_safe(&head, skb,  tmp)
>   kfree_skb(skb);

I appreciate abstraction as much as anybody, but I do not believe
that such abstractions would actually be an improvement here.

* Splice-initing seems more like an idiom than an abstraction;
  at first blush, it wouldn't be clear to me what the intention
  is.

* Such abstractions are fairly unnecessary.

* The function as written is already so short as to be
  easily digested.

* More to the point, this function is not some generic,
  higher-level algorithm that just happens to employ the
  socket buffer interface; rather, it is a function that
  implements part of that very interface, and may thus
  twiddle the intimate bits of these data structures
  without being accused of abusing a leaky abstraction.

* Such abstractions add overhead, if only conceptually. In this
  case, a temporary socket buffer queue allocates *3* unnecessary
  struct members, including a whole `spinlock_t' member:

prev
qlen
lock

  It's possible that the compiler will be smart enough to leave
  those out, but I have my suspicions that it won't, not only
  given that the interface contract requires that the temporary
  socket buffer queue be properly initialized before use, but
  also because splicing into the temporary will manipulate its
  `qlen'. Yet, why worry whether optimization happens? The whole
  issue can simply be avoided by exploiting the intimate details
  that are already philosophically available to us.

  Similarly, the function `skb_queue_walk_safe' is nice, but it
  loses value both because a temporary queue loses value (as just
  described), and because it ignores the fact that legitimate
  access to the internals of these data structures allows for
  setting up the requested loop in advance; that is to say, the
  two parts of the function that we are now debating can be woven
  together more tightly than `skb_queue_walk_safe' allows.

For these reasons, I stand by the way that the patch currently
implements this function; it does exactly what is desired, no more
or less.

Sincerely,
Michael Witten

Re: [PATCH 04/18] IB/mlx5: Use ARRAY_SIZE

2017-10-01 Thread Leon Romanovsky

On Sun, Oct 01, 2017 at 03:30:42PM -0400, Jérémy Lefaure wrote:
> Using the ARRAY_SIZE macro improves the readability of the code.
>
> Found with Coccinelle with the following semantic patch:
> @r depends on (org || report)@
> type T;
> T[] E;
> position p;
> @@
> (
>  (sizeof(E)@p /sizeof(*E))
> |
>  (sizeof(E)@p /sizeof(E[...]))
> |
>  (sizeof(E)@p /sizeof(T))
> )
>
> Signed-off-by: Jérémy Lefaure 
> ---
>  drivers/infiniband/hw/mlx5/odp.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>

Thanks Jérémy,
I took this into my tree and will forward it to Doug L. during this
cycle.


signature.asc
Description: PGP signature

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Dave Chinner

On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  
> > > wrote:
> > > >
> > > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > > the iint struct with a separate set of locks should work.  It will be
> > > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > > 
> > > Note that the "inner lock" could possibly be omitted if the
> > > invalidation can be just a single atomic instruction.
> > > 
> > > So particularly if invalidation could be just an atomic_inc() on the
> > > generation count, there might not need to be any inner lock at all.
> > > 
> > > You'd have to serialize the actual measurement with the "read
> > > generation count", but that should be as simple as just doing a
> > > smp_rmb() between the "read generation count" and "do measurement on
> > > file contents".
> > 
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well. Filesystems
> > that support i_version are marked with [SB|MS]_I_VERSION in the
> > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> > ATM).
> 
> Recently I received a patch to replace i_version with mtime/atime.

mtime is not guaranteed to change on data writes - the resolution of
the filesystem timestamps may mean mtime only changes once a second
regardless of the number of writes performed to that file. That's
why NFS can't use it as a change attribute, and hence we have
i_version

>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.

Did you ask them to explain/quantify the performance improvement? 

e.g. Using i_version on XFS slows down performance on small
writes by 2-3% because i_version because all data writes log a
version change rather than only logging a change when mtime updates.
We take that penalty because NFS requires specific change attribute
behaviour, otherwise we wouldn't have implemented it at all in
XFS...

>  For file systems that
> don't support i_version, assume that the file has changed.
> 
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.

I'm not sure what you mean here - the struct inode already has a
i_generation variable. It's a lifecycle indicator used to
discriminate between alloc/free cycles on the same inode number.
i.e. It only changes at inode allocation time, not whenever the data
in the inode changes...

> With Linus' suggested changes, I think this will work nicely.
> 
> > The IMA code should be able to sample that at measurement time and
> > either fail or be retried if i_version changes during measurement.
> > We can then simply make the IMA xattr write conditional on the
> > i_version value being unchanged from the sample the IMA code passes
> > into the filesystem once the filesystem holds all the locks it needs
> > to write the xattr...
> 
> > I note that IMA already grabs the i_version in
> > ima_collect_measurement(), so this shouldn't be too hard to do.
> > Perhaps we don't need any new locks or counterst all, maybe just
> > the ability to feed a version cookie to the set_xattr method?
> 
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().

Which, if IIUC, does this to measure and update the xattr:

ima_check_last_writer
  -> ima_update_xattr
-> ima_collect_measurement
-> ima_fix_xattr

>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).

Yup, and it samples the i_version before it calculates the hash and
stores it in the iint, which then gets passed to ima_fix_xattr().
Looks like all that is needed is to pass the i_version back to the
filesystem through the xattr call

IOWs, sample the i_version early while we hold the inode lock and
check the writer count, then if it is the last writer drop the inode
lock and call ima_update_xattr(). The sampled i_version then tells
us if the file has changed before we write the updated xattr...

> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.

ima_collect_measurement() doesn't write an xattr at all - it just
reads the file data and calculates the hash.

> There should be no open writers in im

Re: [PATCH v16 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-10-01 Thread Michael S. Tsirkin

Looks good to me. minor comments below.

On Sat, Sep 30, 2017 at 12:05:52PM +0800, Wei Wang wrote:
> @@ -141,13 +146,128 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
>  }
>  
> +
> +static void kick_and_wait(struct virtqueue *vq, wait_queue_head_t wq_head)
> +{
> + unsigned int len;
> +
> + virtqueue_kick(vq);
> + wait_event(wq_head, virtqueue_get_buf(vq, &len));
> +}
> +
> +static int add_one_sg(struct virtqueue *vq, void *addr, uint32_t size)
> +{
> + struct scatterlist sg;
> + unsigned int len;
> +
> + sg_init_one(&sg, addr, size);
> +
> + /* Detach all the used buffers from the vq */
> + while (virtqueue_get_buf(vq, &len))
> + ;
> +
> + return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +}
> +
> +static int send_balloon_page_sg(struct virtio_balloon *vb,
> +  struct virtqueue *vq,
> +  void *addr,
> +  uint32_t size,
> +  bool batch)
> +{
> + int err;
> +
> + err = add_one_sg(vq, addr, size);
> +
> + /* If batchng is requested, we batch till the vq is full */

typo

> + if (!batch || !vq->num_free)
> + kick_and_wait(vq, vb->acked);
> +
> + return err;
> +}

If add_one_sg fails, kick_and_wait will hang forever.

The reason this might work in because
1. with 1 sg there are no memory allocations
2. if adding fails on vq full, then something
   is in queue and will wake up kick_and_wait.

So in short this is expected to never fail.
How about a BUG_ON here then?
And make it void, and add a comment with above explanation.

> +
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)
> +{
> + unsigned long sg_pfn_start, sg_pfn_end;
> + void *sg_addr;
> + uint32_t sg_len, sg_max_len = round_down(UINT_MAX, PAGE_SIZE);
> + int err = 0;
> +
> + sg_pfn_start = page_xb_start;
> + while (sg_pfn_start < page_xb_end) {
> + sg_pfn_start = xb_find_next_set_bit(&vb->page_xb, sg_pfn_start,
> + page_xb_end);
> + if (sg_pfn_start == page_xb_end + 1)
> + break;
> + sg_pfn_end = xb_find_next_zero_bit(&vb->page_xb,
> +sg_pfn_start + 1,
> +page_xb_end);
> + sg_addr = (void *)pfn_to_kaddr(sg_pfn_start);
> + sg_len = (sg_pfn_end - sg_pfn_start) << PAGE_SHIFT;
> + while (sg_len > sg_max_len) {
> + err = send_balloon_page_sg(vb, vq, sg_addr, sg_max_len,
> +true);
> + if (unlikely(err < 0))
> + goto err_out;
> + sg_addr += sg_max_len;
> + sg_len -= sg_max_len;
> + }
> + err = send_balloon_page_sg(vb, vq, sg_addr, sg_len, true);
> + if (unlikely(err < 0))
> + goto err_out;
> + sg_pfn_start = sg_pfn_end + 1;
> + }
> +
> + /*
> +  * The last few sgs may not reach the batch size, but need a kick to
> +  * notify the device to handle them.
> +  */
> + if (vq->num_free != virtqueue_get_vring_size(vq))
> + kick_and_wait(vq, vb->acked);
> +
> + xb_clear_bit_range(&vb->page_xb, page_xb_start, page_xb_end);
> + return;
> +
> +err_out:
> + dev_warn(&vb->vdev->dev, "%s failure: %d\n", __func__, err);

so fundamentally just make send_balloon_page_sg void then.

> +}
> +
> +static inline void xb_set_page(struct virtio_balloon *vb,
> +struct page *page,
> +unsigned long *pfn_min,
> +unsigned long *pfn_max)
> +{
> + unsigned long pfn = page_to_pfn(page);
> +
> + *pfn_min = min(pfn, *pfn_min);
> + *pfn_max = max(pfn, *pfn_max);
> + xb_preload(GFP_KERNEL);
> + xb_set_bit(&vb->page_xb, pfn);
> + xb_preload_end();
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>   struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>   unsigned num_allocated_pages;
> + bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> + unsigned long pf

Re: [PATCH 2/6] ath9k: add a quirk to set use_msi automatically

2017-10-01 Thread Daniel Drake

Hi AceLan,

On Thu, Sep 28, 2017 at 4:28 PM, AceLan Kao  wrote:
> Hi Daniel,
>
> I've tried your patch, but it doesn't work for me.
> Wifi can scan AP, but can't get connected.

Can you please clarify which patch(es) you have tried?

This is the base patch which adds the infrastructure to request
specific MSI IRQ vectors:
https://marc.info/?l=linux-wireless&m=150631274108016&w=2

This is the ath9k MSI patch which makes use of that:
https://github.com/endlessm/linux/commit/739c7a924db8f4434a9617657

If you were already able to use ath9k MSI interrupts without specific
consideration for which MSI vector numbers were used, these are the
possible explanations that spring to mind:

1. You got lucky and it picked a vector number that is 4-aligned. You
can check this in the "lspci -vvv" output. You'll see something like:
Capabilities: [50] MSI: Enable+ Count=1/4 Maskable+ 64bit+
Address: fee0300c  Data: 4142
The lower number is the vector number. In my example here 0x42 (66) is
not 4-aligned so the failure condition will be hit.

2. You are using interrupt remapping, which I suspect may provide a
high likelihood of MSI interrupt vectors being 4-aligned. See if
/proc/interrupts shows the IRQ type as IR-PCI-MSI
Unfortunately interrupt remapping is not available here,
https://lists.linuxfoundation.org/pipermail/iommu/2017-August/023717.html

3. My assumption that all ath9k hardware corrupts the MSI vector
number could wrong. However we've seen this on different wifi modules
in laptops produced by different OEMs and ODMs, so it seems to be a
somewhat widespread problem at least.

4. My assumption that ath9k hardware is corrupting the MSI vector
number could be wrong; maybe another component is to blame, could it
be a BIOS issue? Admittedly I don't really know how I can debug the
layers inbetween seeing the MSI Message Data value disagree with the
vector number being handled inside do_IRQ().

Daniel

[PATCH v2] rpmsg: Allow RPMSG_VIRTIO to be enabled via menuconfig or defconfig

2017-10-01 Thread Anup Patel

Currently, RPMSG_VIRTIO can only be enabled if some other kconfig
option selects it. This does not allow it to be enabled for
virtualized systems where Virtio RPMSG is available over Virtio
MMIO or PCI transport.

This patch updates RPMSG_VIRTIO kconfig option so that we can
enable the VirtIO RPMSG driver via menuconfig or defconfig.

Signed-off-by: Anup Patel 
---

Changes since v1:
- Add depends on HAS_DMA to avoid build failures on
  archs (such as um) with NO_DMA=y. For most archs,
  HAS_DMA=y so having depends on HAS_DMA is fine. 

 drivers/rpmsg/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/rpmsg/Kconfig b/drivers/rpmsg/Kconfig
index 0fe6eac..65a9f6b 100644
--- a/drivers/rpmsg/Kconfig
+++ b/drivers/rpmsg/Kconfig
@@ -47,7 +47,8 @@ config RPMSG_QCOM_SMD
  platforms.
 
 config RPMSG_VIRTIO
-   tristate
+   tristate "Virtio RPMSG bus driver"
+   depends on HAS_DMA
select RPMSG
select VIRTIO
 
-- 
2.7.4

Re: [PATCH 01/18] sound: use ARRAY_SIZE

2017-10-01 Thread Joe Perches

On Sun, 2017-10-01 at 15:30 -0400, Jérémy Lefaure wrote:
> Using the ARRAY_SIZE macro improves the readability of the code.
> 
> Found with Coccinelle with the following semantic patch:
> @r depends on (org || report)@
> type T;
> T[] E;
> position p;
> @@
> (
>  (sizeof(E)@p /sizeof(*E))
> > 
> 
>  (sizeof(E)@p /sizeof(E[...]))
> > 
> 
>  (sizeof(E)@p /sizeof(T))
> )
[]
> diff --git a/sound/oss/ad1848.c b/sound/oss/ad1848.c
[]
> @@ -797,7 +798,7 @@ static int ad1848_set_speed(int dev, int arg)
>  
>   int i, n, selected = -1;
>  
> - n = sizeof(speed_table) / sizeof(speed_struct);
> + n = ARRAY_SIZE(speed_table);

These sorts of changes are OK, but for many
uses, it's more readable to use ARRAY_SIZE(foo)
in each location rather than using a temporary.

[PATCH] platform/x86: peaq-wmi: Blacklist Lenovo ideapad 700-15ISK

2017-10-01 Thread Kai-Heng Feng

peaq-wmi on Lenovo ideapad 700-15ISK keeps sending KEY_SOUND,
which makes user's repeated keys gets interrupted.

The system does not have Dolby button, let's blacklist it.

BugLink: https://bugs.launchpad.net/bugs/1720219
Signed-off-by: Kai-Heng Feng 
---
 drivers/platform/x86/peaq-wmi.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/x86/peaq-wmi.c b/drivers/platform/x86/peaq-wmi.c
index bc98ef95514a..5673d5daebc3 100644
--- a/drivers/platform/x86/peaq-wmi.c
+++ b/drivers/platform/x86/peaq-wmi.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PEAQ_DOLBY_BUTTON_GUID "ABBC0F6F-8EA1-11D1-00A0-C9062910"
 #define PEAQ_DOLBY_BUTTON_METHOD_ID5
@@ -64,9 +65,22 @@ static void peaq_wmi_poll(struct input_polled_dev *dev)
}
 }
 
+static const struct dmi_system_id peaq_blacklist[] __initconst = {
+   {
+   /* Lenovo ideapad 700-15ISK does not have Dolby button */
+   .ident = "Lenovo ideapad 700-15ISK",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_NAME, "80RU"),
+   },
+   },
+   {}
+};
+
 static int __init peaq_wmi_init(void)
 {
-   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID))
+   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID) ||
+   dmi_check_system(peaq_blacklist))
return -ENODEV;
 
peaq_poll_dev = input_allocate_polled_device();
@@ -86,7 +100,8 @@ static int __init peaq_wmi_init(void)
 
 static void __exit peaq_wmi_exit(void)
 {
-   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID))
+   if (!wmi_has_guid(PEAQ_DOLBY_BUTTON_GUID) ||
+   dmi_check_system(peaq_blacklist))
return;
 
input_unregister_polled_device(peaq_poll_dev);
-- 
2.14.1

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Dave Chinner

On Sun, Oct 01, 2017 at 04:15:07PM -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner  wrote:
> >
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well.
> 
> I actually think i_version has exactly the wrong semantics.
> 
> Afaik, it doesn't actually version the file _data_ at all, it only
> versions "inode itself changed".

No, the NFSv4 change attribute must change if either data or
metadata on the inode is changed, and be consistent and persistent
across server crashes. For data updates, they piggy back on mtime
updates 

> But I might have missed something obvious. The updates are hidden in
> some odd places sometimes.

... which are in file_update_time().

Hence every data write or write page fault will call
file_update_time() and trigger an i_version increment, even if the
mtime doesn't change.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Eric W. Biederman

Mimi Zohar  writes:

> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
>> On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
>> > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  
>> > wrote:
>> > >
>> > > Right, re-introducing the iint->mutex and a new i_generation field in
>> > > the iint struct with a separate set of locks should work.  It will be
>> > > reset if the file metadata changes (eg. setxattr, chown, chmod).
>> > 
>> > Note that the "inner lock" could possibly be omitted if the
>> > invalidation can be just a single atomic instruction.
>> > 
>> > So particularly if invalidation could be just an atomic_inc() on the
>> > generation count, there might not need to be any inner lock at all.
>> > 
>> > You'd have to serialize the actual measurement with the "read
>> > generation count", but that should be as simple as just doing a
>> > smp_rmb() between the "read generation count" and "do measurement on
>> > file contents".
>> 
>> We already have a change counter on the inode, which is modified on
>> any data or metadata write (i_version) under filesystem locks.  The
>> i_version counter has well defined semantics - it's required by
>> NFSv4 to increment on any metadata or data change - so we should be
>> able to rely on it's behaviour to implement IMA as well. Filesystems
>> that support i_version are marked with [SB|MS]_I_VERSION in the
>> superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
>> can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
>> ATM).
>
> Recently I received a patch to replace i_version with mtime/atime.
>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.  For file systems that
> don't support i_version, assume that the file has changed.
>
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.
>
> With Linus' suggested changes, I think this will work nicely.
>
>> The IMA code should be able to sample that at measurement time and
>> either fail or be retried if i_version changes during measurement.
>> We can then simply make the IMA xattr write conditional on the
>> i_version value being unchanged from the sample the IMA code passes
>> into the filesystem once the filesystem holds all the locks it needs
>> to write the xattr...
>
>> I note that IMA already grabs the i_version in
>> ima_collect_measurement(), so this shouldn't be too hard to do.
>> Perhaps we don't need any new locks or counterst all, maybe just
>> the ability to feed a version cookie to the set_xattr method?
>
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().
>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).
>
> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.
>
> There should be no open writers in ima_check_last_writer(), so the
> file shouldn't be changing.

This is slightly tangential but I think important to consider.
What do you do about distributed filesystems fuse, nfs, etc that
can change the data behind the kernels back.

Do you not support such systems or do you have a sufficient way to
detect changes?

Eric

Re: [PATCH] powernv: Add OCC driver to mmap sensor area

2017-10-01 Thread Stewart Smith

Shilpasri G Bhat  writes:
> This driver provides interface to mmap the OCC sensor area
> to userspace to parse and read OCC inband sensors.

Why?

Is this for debug? If so, the existing exports interface should be used.

If there's actual sensors, we already have two ways of exposing sensors
to Linux: the OPAL_SENSOR API and the IMC API.

Why this method and not use the existing ones?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [RFC][PATCH] KEYS: Replace uid/gid/perm permissions checking with ACL

2017-10-01 Thread Eric Biggers

Hi David,

On Wed, Sep 27, 2017 at 12:41:41PM +0100, David Howells wrote:
> 
> Replace the uid/gid/perm permissions checking on a key with an ACL to 
> allow
> the SETATTR permission to be split.  The problem is that SETATTR covers a
> slew of things, not all of which should be grouped together.  This
> includes:
> 
>  (1) Changing the key ownership.
> 
>  (2) Changing the security information.
> 
>  (3) Keyring restriction.
> 
>  (4) Expiry time.
> 
>  (5) Revocation.
> 
> and it has also been proposed to add:
> 
>  (6) Invalidation.
> 
> The above can be divided into three groups: Controlling access (1), (2) 
> and
> (3), managing the content at construction time (4) and managing the key 
> (5)
> and (6).

This is interesting work, though it adds complexity and makes a lot of subtle
(and potentially breaking) changes to which permissions are required for various
things.  First I think you need to start out with a better statement of the
problems you are trying to solve.  The patch does much more than simply split up
the SETATTR permission --- for example, it also adds the ability to assign
permissions to specific uids, gids, and capabilities.  Who is planning to use
those features and why?

> The KEYCTL_SETATTR function is then deprecated.  If called, it will

KEYCTL_SETPERM

> construct an ACL to reflect the mask it is given, using possessor, owner,
> group and other ACE's as appropriate if any of those elements are granted
> any permissions.  SETATTR permission turns on all of INVAL, REVOKE and
> SET_SECURITY.  WRITE permission turns on WRITE, REVOKE and, if a keyring,
> CLEAR.  JOIN is turned on if a keyring is being altered.

The proposed changes to keyctl_setperm_key() actually never enable INVAL at all,
which doesn't match the description here.  Also, all breaking changes need to be
justified.  If keyctl_setperm(key, KEY_*_SEARCH) is no longer going to allow the
key to be invalidated (as I had proposed earlier), that is really its own change
which needs its own justification; it shouldn't be hidden in a larger patch.

> will return an error if SETACL has been called on a key.

That is simplest, but it doesn't match the behavior of POSIX ACLs, for example.
With POSIX ACLs you can still chmod() a file that has an ACL.

> The KEYCTL_DESCRIBE function then creates a permissions mask to return
> depending on possessor, owner, group and other ACEs, indicating SETATTR if
> any of INVAL, REVOKE and SET_SECURITY are set and indicating WRITE if any
> of WRITE, REVOKE or CLEAR are set.

Ignoring ACEs for specific users, groups, and capabilities may be problematic
because the returned mask will under-estimate rather than over-estimate the
permissions that have been granted.  With POSIX ACLs, for example, the union of
all permissions that have been granted to any subjects other than the regular
ones is reflected in the group entry.  I believe that's generally considered
better from a security perspective, because then no permissions are "hidden"
from a listing of the regular (non-ACL) permissions only.

> Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
> the value set with KEYCTL_SETATTR - but this is already true because keys
> that lack ->read() can't have READ set and keys that lack ->write() can't
> have WRITE set.

Not true; you *can* set READ on a key that lacks ->read() and WRITE on a key
that lacks ->update().  They are only omitted from the default permissions.

> The KEYCTL_SET_TIMEOUT function then is permitted if WRITE or SETSEC is
> set, or if the caller has a valid instantiation auth token.

This doesn't match the code, which asks for WRITE permission only.  It's also a
breaking change which needs to be justified on its own.  Also I'm not sure that
WRITE permission actually makes sense, given that KEYCTL_SET_TIMEOUT doesn't
modify the payload of the key.

> +static struct key_acl blacklist_key_acl = {
> + .usage  = REFCOUNT_INIT(1),
> + .nr_ace = 2,
> + .aces[0] = {
> + .mask = KEY_ACE_SPECIAL | (KEY_ACE_SEARCH | KEY_ACE_READ),
> + .special_id = KEY_ACE_POSSESSOR,
> + },
> + .aces[1] = {
> + .mask = KEY_ACE_SPECIAL | (KEY_ACE_VIEW | KEY_ACE_SEARCH),
> + .special_id = KEY_ACE_OWNER,
> + },
> +};

Designators into flexible arrays are a gcc extension which doesn't work with
clang.  Use this instead:

.aces = {
{
.mask = KEY_ACE_SPECIAL | (KEY_ACE_SEARCH | 
KEY_ACE_READ),
.special_id = KEY_ACE_POSSESSOR,
},
{
.mask = KEY_ACE_SPECIAL | (KEY_ACE_VIEW | 
KEY_ACE_SEARCH),
.special_id = KEY_ACE_OWNER,
},
},

It's also difficult to read these lists of ACEs.  An ACE should read as

[Patch v4 14/22] CIFS: SMBD: Implement function to send data via RDMA send

2017-10-01 Thread Long Li

From: Long Li 

The transport doesn't maintain send buffers or send queue for transferring
payload via RDMA send. There is no data copy in the transport on send.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 248 
 fs/cifs/smbdirect.h |   4 +
 2 files changed, 252 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index b9be9d6..90e2c94 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -42,6 +42,12 @@ static int smbd_post_recv(
struct smbd_response *response);
 
 static int smbd_post_send_empty(struct smbd_connection *info);
+static int smbd_post_send_data(
+   struct smbd_connection *info,
+   struct kvec *iov, int n_vec, int remaining_data_length);
+static int smbd_post_send_page(struct smbd_connection *info,
+   struct page *page, unsigned long offset,
+   size_t size, int remaining_data_length);
 
 /* SMBD version number */
 #define SMBD_V10x0100
@@ -198,6 +204,10 @@ static void smbd_destroy_rdma_work(struct work_struct 
*work)
log_rdma_event(INFO, "cancelling send immediate work\n");
cancel_delayed_work_sync(&info->send_immediate_work);
 
+   log_rdma_event(INFO, "wait for all send to finish\n");
+   wait_event(info->wait_smbd_send_pending,
+   info->smbd_send_pending == 0);
+
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(&info->wait_reassembly_queue);
wait_event(info->wait_smbd_recv_pending,
@@ -1103,6 +1113,24 @@ static int smbd_post_send_sgl(struct smbd_connection 
*info,
 }
 
 /*
+ * Send a page
+ * page: the page to send
+ * offset: offset in the page to send
+ * size: length in the page to send
+ * remaining_data_length: remaining data to send in this payload
+ */
+static int smbd_post_send_page(struct smbd_connection *info, struct page *page,
+   unsigned long offset, size_t size, int remaining_data_length)
+{
+   struct scatterlist sgl;
+
+   sg_init_table(&sgl, 1);
+   sg_set_page(&sgl, page, size, offset);
+
+   return smbd_post_send_sgl(info, &sgl, size, remaining_data_length);
+}
+
+/*
  * Send an empty message
  * Empty message is used to extend credits to peer to for keep live
  * while there is no upper layer payload to send at the time
@@ -1114,6 +1142,35 @@ static int smbd_post_send_empty(struct smbd_connection 
*info)
 }
 
 /*
+ * Send a data buffer
+ * iov: the iov array describing the data buffers
+ * n_vec: number of iov array
+ * remaining_data_length: remaining data to send following this packet
+ * in segmented SMBD packet
+ */
+static int smbd_post_send_data(
+   struct smbd_connection *info, struct kvec *iov, int n_vec,
+   int remaining_data_length)
+{
+   int i;
+   u32 data_length = 0;
+   struct scatterlist sgl[SMBDIRECT_MAX_SGE];
+
+   if (n_vec > SMBDIRECT_MAX_SGE) {
+   cifs_dbg(VFS, "Can't fit data to SGL, n_vec=%d\n", n_vec);
+   return -ENOMEM;
+   }
+
+   sg_init_table(sgl, n_vec);
+   for (i = 0; i < n_vec; i++) {
+   data_length += iov[i].iov_len;
+   sg_set_buf(&sgl[i], iov[i].iov_base, iov[i].iov_len);
+   }
+
+   return smbd_post_send_sgl(info, sgl, data_length, 
remaining_data_length);
+}
+
+/*
  * Post a receive request to the transport
  * The remote peer can only send data when a receive request is posted
  * The interaction is controlled by send/receive credit system
@@ -1680,6 +1737,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, &info->idle_timer_work,
info->keep_alive_interval*HZ);
 
+   init_waitqueue_head(&info->wait_smbd_send_pending);
+   info->smbd_send_pending = 0;
+
init_waitqueue_head(&info->wait_smbd_recv_pending);
info->smbd_recv_pending = 0;
 
@@ -1973,3 +2033,191 @@ int smbd_recv(struct smbd_connection *info, struct 
msghdr *msg)
msg->msg_iter.count = 0;
return rc;
 }
+
+/*
+ * Send data to transport
+ * Each rqst is transported as a SMBDirect payload
+ * rqst: the data to write
+ * return value: 0 if successfully write, otherwise error code
+ */
+int smbd_send(struct smbd_connection *info, struct smb_rqst *rqst)
+{
+   struct kvec vec;
+   int nvecs;
+   int size;
+   int buflen = 0, remaining_data_length;
+   int start, i, j;
+   int max_iov_size =
+   info->max_send_size - sizeof(struct smbd_data_transfer);
+   struct kvec iov[SMBDIRECT_MAX_SGE];
+   int rc;
+   unsigned long long t1 = rdtsc();
+
+   info->smbd_send_pending++;
+   if (info->transport_status != SMBD_CONNECTED) {
+   rc = -ENODEV;
+   goto done;
+   }
+
+   /*
+* This usually means a configuration error
+* We use RDMA read/write for packet size > rdma_readwrite_threshold
+* as long as it's

[Patch v4 02/22] CIFS: SMBD: Establish SMBDirect connection

2017-10-01 Thread Long Li

From: Long Li 

Add code to implement the core functions to establish a SMBDirect connection.

1. Establish an RDMA connection to SMB server.
2. Negotiate and setup SMBDirect protocol.
3. Implement idle connection timer and credit management.

Add to Makefile.

Signed-off-by: Long Li 
---
 fs/cifs/Makefile|2 +-
 fs/cifs/smbdirect.c | 1600 +++
 fs/cifs/smbdirect.h |  229 
 3 files changed, 1830 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/Makefile b/fs/cifs/Makefile
index 5e853a3..bb54662b 100644
--- a/fs/cifs/Makefile
+++ b/fs/cifs/Makefile
@@ -8,7 +8,7 @@ cifs-y := cifsfs.o cifssmb.o cifs_debug.o connect.o dir.o 
file.o inode.o \
  cifs_unicode.o nterr.o cifsencrypt.o \
  readdir.o ioctl.o sess.o export.o smb1ops.o winucase.o \
  smb2ops.o smb2maperror.o smb2transport.o \
- smb2misc.o smb2pdu.o smb2inode.o smb2file.o
+ smb2misc.o smb2pdu.o smb2inode.o smb2file.o smbdirect.o
 
 cifs-$(CONFIG_CIFS_XATTR) += xattr.o
 cifs-$(CONFIG_CIFS_ACL) += cifsacl.o
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index d3c16f8..e8f976f 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -13,7 +13,35 @@
  *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
  *   the GNU General Public License for more details.
  */
+#include 
 #include "smbdirect.h"
+#include "cifs_debug.h"
+#include 
+
+static struct smbd_response *get_empty_queue_buffer(
+   struct smbd_connection *info);
+static struct smbd_response *get_receive_buffer(
+   struct smbd_connection *info);
+static void put_receive_buffer(
+   struct smbd_connection *info,
+   struct smbd_response *response,
+   bool lock);
+static int allocate_receive_buffers(struct smbd_connection *info, int num_buf);
+static void destroy_receive_buffers(struct smbd_connection *info);
+
+static void put_empty_packet(
+   struct smbd_connection *info, struct smbd_response *response);
+static void enqueue_reassembly(
+   struct smbd_connection *info,
+   struct smbd_response *response, int data_length);
+static struct smbd_response *_get_first_reassembly(
+   struct smbd_connection *info);
+
+static int smbd_post_recv(
+   struct smbd_connection *info,
+   struct smbd_response *response);
+
+static int smbd_post_send_empty(struct smbd_connection *info);
 
 /* SMBD version number */
 #define SMBD_V10x0100
@@ -75,3 +103,1575 @@ int smbd_max_frmr_depth = 2048;
 
 /* If payload is less than this byte, use RDMA send/recv not read/write */
 int rdma_readwrite_threshold = 4096;
+
+/* Transport logging functions
+ * Logging are defined as classes. They can be OR'ed to define the actual
+ * logging level via module parameter smbd_logging_class
+ * e.g. cifs.smbd_logging_class=0x500 will log all log_rdma_recv() and
+ * log_rdma_event()
+ */
+#define LOG_OUTGOING   0x1
+#define LOG_INCOMING   0x2
+#define LOG_READ   0x4
+#define LOG_WRITE  0x8
+#define LOG_RDMA_SEND  0x10
+#define LOG_RDMA_RECV  0x20
+#define LOG_KEEP_ALIVE 0x40
+#define LOG_RDMA_EVENT 0x80
+#define LOG_RDMA_MR0x100
+static unsigned int smbd_logging_class = 0;
+module_param(smbd_logging_class, uint, 0644);
+MODULE_PARM_DESC(smbd_logging_class,
+   "Logging class for SMBD transport 0x0 to 0x100");
+
+#define ERR0x0
+#define INFO   0x1
+static unsigned int smbd_logging_level = ERR;
+module_param(smbd_logging_level, uint, 0644);
+MODULE_PARM_DESC(smbd_logging_level,
+   "Logging level for SMBD transport, 0 (default): error, 1: info");
+
+#define log_rdma(level, class, fmt, args...)   \
+do {   \
+   if (level <= smbd_logging_level || class & smbd_logging_class)  \
+   cifs_dbg(VFS, "%s:%d " fmt, __func__, __LINE__, ##args);\
+} while (0)
+
+#define log_outgoing(level, fmt, args...) \
+   log_rdma(level, LOG_OUTGOING, fmt, ##args)
+#define log_incoming(level, fmt, args...) \
+   log_rdma(level, LOG_INCOMING, fmt, ##args)
+#define log_read(level, fmt, args...)  log_rdma(level, LOG_READ, fmt, ##args)
+#define log_write(level, fmt, args...) log_rdma(level, LOG_WRITE, fmt, ##args)
+#define log_rdma_send(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_SEND, fmt, ##args)
+#define log_rdma_recv(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_RECV, fmt, ##args)
+#define log_keep_alive(level, fmt, args...) \
+   log_rdma(level, LOG_KEEP_ALIVE, fmt, ##args)
+#define log_rdma_event(level, fmt, args...) \
+   log_rdma(level, LOG_RDMA_EVENT, fmt, ##args)
+#define log_rdma_mr(level, fmt, args...) \
+   log

[Patch v4 01/22] CIFS: SMBD: Add SMBDirect protocol initial values and constants

2017-10-01 Thread Long Li

From: Long Li 

To prepare for protocol implementation, add constants and user-configurable
values in the SMBDirect protocol.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 77 +
 fs/cifs/smbdirect.h | 21 +++
 2 files changed, 98 insertions(+)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
new file mode 100644
index 000..d3c16f8
--- /dev/null
+++ b/fs/cifs/smbdirect.c
@@ -0,0 +1,77 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li 
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#include "smbdirect.h"
+
+/* SMBD version number */
+#define SMBD_V10x0100
+
+/* Port numbers for SMBD transport */
+#define SMB_PORT   445
+#define SMBD_PORT  5445
+
+/* Address lookup and resolve timeout in ms */
+#define RDMA_RESOLVE_TIMEOUT   5000
+
+/* SMBD negotiation timeout in seconds */
+#define SMBD_NEGOTIATE_TIMEOUT 120
+
+/* SMBD minimum receive size and fragmented sized defined in [MS-SMBD] */
+#define SMBD_MIN_RECEIVE_SIZE  128
+#define SMBD_MIN_FRAGMENTED_SIZE   131072
+
+/*
+ * Default maximum number of RDMA read/write outstanding on this connection
+ * This value is possibly decreased during QP creation on hardware limit
+ */
+#define SMBD_CM_RESPONDER_RESOURCES32
+
+/* Maximum number of retries on data transfer operations */
+#define SMBD_CM_RETRY  6
+/* No need to retry on Receiver Not Ready since SMBD manages credits */
+#define SMBD_CM_RNR_RETRY  0
+
+/*
+ * User configurable initial values per SMBD transport connection
+ * as defined in [MS-SMBD] 3.1.1.1
+ * Those may change after a SMBD negotiation
+ */
+/* The local peer's maximum number of credits to grant to the peer */
+int smbd_receive_credit_max = 255;
+
+/* The remote peer's credit request of local peer */
+int smbd_send_credit_target = 255;
+
+/* The maximum single message size can be sent to remote peer */
+int smbd_max_send_size = 1364;
+
+/*  The maximum fragmented upper-layer payload receive size supported */
+int smbd_max_fragmented_recv_size = 1024 * 1024;
+
+/*  The maximum single-message size which can be received */
+int smbd_max_receive_size = 8192;
+
+/* The timeout to initiate send of a keepalive message on idle */
+int smbd_keep_alive_interval = 120;
+
+/*
+ * User configurable initial values for RDMA transport
+ * The actual values used may be lower and are limited to hardware capabilities
+ */
+/* Default maximum number of SGEs in a RDMA write/read */
+int smbd_max_frmr_depth = 2048;
+
+/* If payload is less than this byte, use RDMA send/recv not read/write */
+int rdma_readwrite_threshold = 4096;
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
new file mode 100644
index 000..c55f28b
--- /dev/null
+++ b/fs/cifs/smbdirect.h
@@ -0,0 +1,21 @@
+/*
+ *   Copyright (C) 2017, Microsoft Corporation.
+ *
+ *   Author(s): Long Li 
+ *
+ *   This program is free software;  you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation; either version 2 of the License, or
+ *   (at your option) any later version.
+ *
+ *   This program is distributed in the hope that it will be useful,
+ *   but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ *   the GNU General Public License for more details.
+ */
+#ifndef _SMBDIRECT_H
+#define _SMBDIRECT_H
+
+/* Default maximum number of SGEs in a RDMA send/recv */
+#define SMBDIRECT_MAX_SGE  16
+#endif
-- 
2.7.4

[Patch v4 06/22] CIFS: SMBD: Upper layer connects to SMBDirect session

2017-10-01 Thread Long Li

From: Long Li 

When "rdma" is specified in the mount option, CIFS attempts to connect to
SMBDirect instead of TCP socket.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index b5a575f..94b6357 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 
+#include "smbdirect.h"
 #include "cifspdu.h"
 #include "cifsglob.h"
 #include "cifsproto.h"
@@ -2280,12 +2281,26 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
else
tcp_ses->echo_interval = SMB_ECHO_INTERVAL_DEFAULT * HZ;
 
+   if (tcp_ses->rdma) {
+   tcp_ses->smbd_conn = smbd_get_connection(
+   tcp_ses, (struct sockaddr *)&volume_info->dstaddr);
+   if (tcp_ses->smbd_conn) {
+   cifs_dbg(VFS, "RDMA transport established\n");
+   rc = 0;
+   goto connected;
+   } else {
+   rc = -ENOENT;
+   goto out_err_crypto_release;
+   }
+   }
+
rc = ip_connect(tcp_ses);
if (rc < 0) {
cifs_dbg(VFS, "Error connecting to socket. Aborting 
operation.\n");
goto out_err_crypto_release;
}
 
+connected:
/*
 * since we're in a cifs function already, we know that
 * this will succeed. No need for try_module_get().
-- 
2.7.4

[Patch v4 07/22] CIFS: SMBD: Implement function to reconnect to a SMBDirect transport

2017-10-01 Thread Long Li

From: Long Li 

Add function to implement a reconnect to SMBDirect. This involves tearing down
the current connection and establishing/negotiating a new connection.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 36 
 fs/cifs/smbdirect.h |  3 +++
 2 files changed, 39 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 34f73e2..1f0f33c 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1416,6 +1416,42 @@ static void idle_connection_timer(struct work_struct 
*work)
info->keep_alive_interval*HZ);
 }
 
+/*
+ * Reconnect this SMBD connection, called from upper layer
+ * return value: 0 on success, or actual error code
+ */
+int smbd_reconnect(struct TCP_Server_Info *server)
+{
+   log_rdma_event(INFO, "reconnecting rdma session\n");
+
+   if (!server->smbd_conn) {
+   log_rdma_event(ERR, "rdma session already destroyed\n");
+   return -EINVAL;
+   }
+
+   /*
+* This is possible if transport is disconnected and we haven't received
+* notification from RDMA, but upper layer has detected timeout
+*/
+   if (server->smbd_conn->transport_status == SMBD_CONNECTED) {
+   log_rdma_event(INFO, "disconnecting transport\n");
+   smbd_disconnect_rdma_connection(server->smbd_conn);
+   }
+
+   /* wait until the transport is destroyed */
+   wait_event(server->smbd_conn->wait_destroy,
+   server->smbd_conn->transport_status == SMBD_DESTROYED);
+
+   destroy_workqueue(server->smbd_conn->workqueue);
+   kfree(server->smbd_conn);
+
+   log_rdma_event(INFO, "creating rdma session\n");
+   server->smbd_conn = smbd_get_connection(
+   server, (struct sockaddr *) &server->dstaddr);
+
+   return server->smbd_conn ? 0 : -ENOENT;
+}
+
 static void destroy_caches_and_workqueue(struct smbd_connection *info)
 {
destroy_receive_buffers(info);
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 42a9338..9818852 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -249,6 +249,9 @@ struct smbd_response {
 struct smbd_connection *smbd_get_connection(
struct TCP_Server_Info *server, struct sockaddr *dstaddr);
 
+/* Reconnect SMBDirect session */
+int smbd_reconnect(struct TCP_Server_Info *server);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4

[Patch v4 11/22] CIFS: SMBD: Set SMBDirect maximum read or write size for I/O

2017-10-01 Thread Long Li

From: Long Li 

When connecting over SMBDirect, the transport negotiates its maximum I/O sizes
with the server and determines how to choose to do RDMA send/recv vs
read/write. Expose these maximum I/O sizes to upper layer so we will get the
correct sized payloads.

Signed-off-by: Long Li 
---
 fs/cifs/smb2ops.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index fb2934b..7ad35d6 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -32,6 +32,7 @@
 #include "smb2status.h"
 #include "smb2glob.h"
 #include "cifs_ioctl.h"
+#include "smbdirect.h"
 
 static int
 change_conf(struct TCP_Server_Info *server)
@@ -249,7 +250,11 @@ smb2_negotiate_wsize(struct cifs_tcon *tcon, struct 
smb_vol *volume_info)
 
/* start with specified wsize, or default */
wsize = volume_info->wsize ? volume_info->wsize : CIFS_DEFAULT_IOSIZE;
-   wsize = min_t(unsigned int, wsize, server->max_write);
+   if (server->rdma)
+   wsize = min_t(unsigned int,
+   wsize, server->smbd_conn->max_readwrite_size);
+   else
+   wsize = min_t(unsigned int, wsize, server->max_write);
 
if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
wsize = min_t(unsigned int, wsize, SMB2_MAX_BUFFER_SIZE);
@@ -265,7 +270,11 @@ smb2_negotiate_rsize(struct cifs_tcon *tcon, struct 
smb_vol *volume_info)
 
/* start with specified rsize, or default */
rsize = volume_info->rsize ? volume_info->rsize : CIFS_DEFAULT_IOSIZE;
-   rsize = min_t(unsigned int, rsize, server->max_read);
+   if (server->rdma)
+   rsize = min_t(unsigned int,
+   rsize, server->smbd_conn->max_readwrite_size);
+   else
+   rsize = min_t(unsigned int, rsize, server->max_read);
 
if (!(server->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU))
rsize = min_t(unsigned int, rsize, SMB2_MAX_BUFFER_SIZE);
-- 
2.7.4

[Patch v4 08/22] CIFS: SMBD: Upper layer reconnects to SMBDirect session

2017-10-01 Thread Long Li

From: Long Li 

Do a reconnect on SMBDirect when it is used as the connection. Reconnect can
happen for many reasons and it's mostly the decision of upper layer SMB2.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 94b6357..26ad706 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -405,7 +405,11 @@ cifs_reconnect(struct TCP_Server_Info *server)
 
/* we should try only the port we connected to before */
mutex_lock(&server->srv_mutex);
-   rc = generic_ip_connect(server);
+   if (server->rdma)
+   rc = smbd_reconnect(server);
+   else
+   rc = generic_ip_connect(server);
+
if (rc) {
cifs_dbg(FYI, "reconnect error %d\n", rc);
mutex_unlock(&server->srv_mutex);
-- 
2.7.4

[Patch v4 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2017-10-01 Thread Long Li

From: Long Li 

If I/O size is larger than rdma_readwrite_threshold, use RDMA write for
SMB read by specifying channel SMB2_CHANNEL_RDMA_V1 or
SMB2_CHANNEL_RDMA_V1_INVALIDATE in the SMB packet, depending on SMB dialect
used. Append a smbd_buffer_descriptor_v1 to the end of the SMB packet and fill
in other values to indicate this SMB read uses RDMA write.

There is no need to read from the transport for incoming payload. At the time
SMB read response comes back, the data is already transfered and placed in the
pages by RDMA hardware.

When SMB read is finished, deregister the memory regions if RDMA write is used
for this SMB read. smbd_deregister_mr may need to do local invalidation and
sleep, if server remote invalidation is not used.

There are situations where the MID may not be created on I/O failure, under
which memory region is deregistered when read data context is released.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/file.c | 10 ++
 fs/cifs/smb2pdu.c  | 43 +++
 3 files changed, 54 insertions(+)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index f851b50..30b99a5 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1152,6 +1152,7 @@ struct cifs_readdata {
struct cifs_readdata *rdata,
struct iov_iter *iter);
struct kvec iov[2];
+   struct smbd_mr  *mr;
unsigned intpagesz;
unsigned inttailsz;
unsigned intcredits;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0786f19..8396f1e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -42,6 +42,7 @@
 #include "cifs_debug.h"
 #include "cifs_fs_sb.h"
 #include "fscache.h"
+#include "smbdirect.h"
 
 
 static inline int cifs_convert_flags(unsigned int flags)
@@ -2909,6 +2910,11 @@ cifs_readdata_release(struct kref *refcount)
struct cifs_readdata *rdata = container_of(refcount,
struct cifs_readdata, refcount);
 
+   if (rdata->mr) {
+   smbd_deregister_mr(rdata->mr);
+   rdata->mr = NULL;
+   }
+
if (rdata->cfile)
cifsFileInfo_put(rdata->cfile);
 
@@ -3037,6 +3043,8 @@ uncached_fill_pages(struct TCP_Server_Info *server,
}
if (iter)
result = copy_page_from_iter(page, 0, n, iter);
+   else if (rdata->mr)
+   result = n;
else
result = cifs_read_page_from_socket(server, page, n);
if (result < 0)
@@ -3606,6 +3614,8 @@ readpages_fill_pages(struct TCP_Server_Info *server,
 
if (iter)
result = copy_page_from_iter(page, 0, n, iter);
+   else if (rdata->mr)
+   result = n;
else
result = cifs_read_page_from_socket(server, page, n);
if (result < 0)
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 7053db9..31dcee0 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2380,6 +2380,39 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
req->Length = cpu_to_le32(io_parms->length);
req->Offset = cpu_to_le64(io_parms->offset);
 
+   /*
+* If we want to do a RDMA write, fill in and append
+* smbd_buffer_descriptor_v1 to the end of read request
+*/
+   if (server->rdma && rdata &&
+   rdata->bytes >= server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate =
+   io_parms->tcon->ses->server->dialect == SMB30_PROT_ID;
+
+   rdata->mr = smbd_register_mr(
+   server->smbd_conn, rdata->pages,
+   rdata->nr_pages, rdata->tailsz,
+   true, need_invalidate);
+   if (!rdata->mr)
+   return -ENOBUFS;
+
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->ReadChannelInfoOffset =
+   offsetof(struct smb2_read_plain_req, Buffer);
+   req->ReadChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
+   v1->offset = rdata->mr->mr->iova;
+   v1->token = rdata->mr->mr->rkey;
+   v1->length = rdata->mr->mr->length;
+
+   *total_len += sizeof(*v1) - 1;
+   }
+
if (request_type & CHAINED_REQUEST) {
if (!(request_type & END_OF_CHAIN)) {
/* next 8-byte aligned request */
@@ -2459,6 +2492,16

[Patch v4 12/22] CIFS: SMBD: Implement function to receive data via RDMA receive

2017-10-01 Thread Long Li

From: Long Li 

On the receive path, the transport maintains receive buffers and a reassembly
queue for transferring payload via RDMA recv. There is data copy in the
transport on recv when it copies the payload to upper layer.

The transport recognizes the RFC1002 header length use in the SMB
upper layer payloads in CIFS. Because this length is mainly used for TCP and
not applicable to RDMA, it is handled as a out-of-band information and is
never sent over the wire, and the trasnport behaves like TCP to upper layer
by processing and exposing the length correctly on data payloads.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 229 
 fs/cifs/smbdirect.h |   6 ++
 2 files changed, 235 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index cb129c2..b9be9d6 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -200,6 +200,8 @@ static void smbd_destroy_rdma_work(struct work_struct *work)
 
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(&info->wait_reassembly_queue);
+   wait_event(info->wait_smbd_recv_pending,
+   info->smbd_recv_pending == 0);
 
log_rdma_event(INFO, "wait for all send posted to IB to finish\n");
wait_event(info->wait_send_pending,
@@ -1678,6 +1680,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, &info->idle_timer_work,
info->keep_alive_interval*HZ);
 
+   init_waitqueue_head(&info->wait_smbd_recv_pending);
+   info->smbd_recv_pending = 0;
+
init_waitqueue_head(&info->wait_send_pending);
atomic_set(&info->send_pending, 0);
 
@@ -1744,3 +1749,227 @@ struct smbd_connection *smbd_get_connection(
}
return ret;
 }
+
+/*
+ * Receive data from receive reassembly queue
+ * All the incoming data packets are placed in reassembly queue
+ * buf: the buffer to read data into
+ * size: the length of data to read
+ * return value: actual data read
+ * Note: this implementation copies the data from reassebmly queue to receive
+ * buffers used by upper layer. This is not the optimal code path. A better way
+ * to do it is to not have upper layer allocate its receive buffers but rather
+ * borrow the buffer from reassembly queue, and return it after data is
+ * consumed. But this will require more changes to upper layer code, and also
+ * need to consider packet boundaries while they still being reassembled.
+ */
+int smbd_recv_buf(struct smbd_connection *info, char *buf, unsigned int size)
+{
+   struct smbd_response *response;
+   struct smbd_data_transfer *data_transfer;
+   int to_copy, to_read, data_read, offset;
+   u32 data_length, remaining_data_length, data_offset;
+   int rc;
+   unsigned long flags;
+
+again:
+   if (info->transport_status != SMBD_CONNECTED) {
+   log_read(ERR, "disconnected\n");
+   return -ENODEV;
+   }
+
+   /*
+* No need to hold the reassembly queue lock all the time as we are
+* the only one reading from the front of the queue. The transport
+* may add more entries to the back of the queeu at the same time
+*/
+   log_read(INFO, "size=%d info->reassembly_data_length=%d\n", size,
+   info->reassembly_data_length);
+   if (info->reassembly_data_length >= size) {
+   unsigned long long t1 = rdtsc();
+   int queue_length;
+   int queue_removed = 0;
+
+   /*
+* Need to make sure reassembly_data_length is read before
+* reading reassembly_queue_length and calling
+* _get_first_reassembly. This call is lock free
+* as we never read at the end of the queue which are being
+* updated in SOFTIRQ as more data is received
+*/
+   virt_rmb();
+   queue_length = info->reassembly_queue_length;
+   data_read = 0;
+   to_read = size;
+   offset = info->first_entry_offset;
+   while (data_read < size) {
+   response = _get_first_reassembly(info);
+   data_transfer = smbd_response_payload(response);
+   data_length = le32_to_cpu(data_transfer->data_length);
+   remaining_data_length =
+   le32_to_cpu(
+   data_transfer->remaining_data_length);
+   data_offset = le32_to_cpu(data_transfer->data_offset);
+
+   /*
+* The upper layer expects RFC1002 length at the
+* beginning of the payload. Return it to indicate
+* the total length of the packet. This minimize the
+* change to upper layer packet processing logic. This
+* w

[Patch v4 10/22] CIFS: SMBD: Upper layer destroys SMBDirect session on shutdown or umount

2017-10-01 Thread Long Li

From: Long Li 

When CIFS wants to umount, call shutdown on transport when SMBDirect is used.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 26ad706..1a9f22f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -704,6 +704,11 @@ static void clean_demultiplex_info(struct TCP_Server_Info 
*server)
/* give those requests time to exit */
msleep(125);
 
+   if (server->smbd_conn) {
+   smbd_destroy(server->smbd_conn);
+   server->smbd_conn = NULL;
+   }
+
if (server->ssocket) {
sock_release(server->ssocket);
server->ssocket = NULL;
-- 
2.7.4

[Patch v4 04/22] CIFS: SMBD: Add rdma mount option

2017-10-01 Thread Long Li

From: Long Li 

Add "rdma" to CIFS mount options to connect to SMB Direct.
Add checks to validate this is used on SMB 3.X dialects.

To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
At the time of this patch, 3.x can be 3.0, 3.02 or 3.1.1.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c |  2 ++
 fs/cifs/cifsfs.c |  2 ++
 fs/cifs/cifsglob.h   |  5 +
 fs/cifs/connect.c| 15 ++-
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index bdc2f38..9738026 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, 
void *v)
ses->ses_count, ses->serverOS, ses->serverNOS,
ses->capabilities, ses->status);
}
+   if (server->rdma)
+   seq_printf(m, "RDMA\n\t");
seq_printf(m, "TCP status: %d\n\tLocal Users To "
   "Server: %d SecMode: 0x%x Req On Wire: %d",
   server->tcpStatus, server->srv_count,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 180b335..e15fbf1 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -327,6 +327,8 @@ cifs_show_address(struct seq_file *s, struct 
TCP_Server_Info *server)
default:
seq_puts(s, "(unknown)");
}
+   if (server->rdma)
+   seq_puts(s, ",rdma");
 }
 
 static void
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 808486c..5585516 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -530,6 +530,7 @@ struct smb_vol {
bool nopersistent:1;
bool resilient:1; /* noresilient not required since not fored for CA */
bool domainauto:1;
+   bool rdma:1;
unsigned int rsize;
unsigned int wsize;
bool sockopt_tcp_nodelay:1;
@@ -646,6 +647,10 @@ struct TCP_Server_Info {
boolsec_kerberos;   /* supports plain Kerberos */
boolsec_mskerberos; /* supports legacy MS Kerberos */
boollarge_buf;  /* is current buffer large? */
+   /* use SMBD connection instead of socket */
+   boolrdma;
+   /* point to the SMBD connection if RDMA is used instead of socket */
+   struct smbd_connection *smbd_conn;
struct delayed_work echo; /* echo ping workqueue job */
char*smallbuf;  /* pointer to current "small" buffer */
char*bigbuf;/* pointer to current "big" buffer */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 59647eb..b5a575f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -92,7 +92,7 @@ enum {
Opt_multiuser, Opt_sloppy, Opt_nosharesock,
Opt_persistent, Opt_nopersistent,
Opt_resilient, Opt_noresilient,
-   Opt_domainauto,
+   Opt_domainauto, Opt_rdma,
 
/* Mount options which take numeric value */
Opt_backupuid, Opt_backupgid, Opt_uid,
@@ -183,6 +183,7 @@ static const match_table_t cifs_mount_option_tokens = {
{ Opt_resilient, "resilienthandles"},
{ Opt_noresilient, "noresilienthandles"},
{ Opt_domainauto, "domainauto"},
+   { Opt_rdma, "rdma"},
 
{ Opt_backupuid, "backupuid=%s" },
{ Opt_backupgid, "backupgid=%s" },
@@ -1538,6 +1539,9 @@ cifs_parse_mount_options(const char *mountdata, const 
char *devname,
case Opt_domainauto:
vol->domainauto = true;
break;
+   case Opt_rdma:
+   vol->rdma = true;
+   break;
 
/* Numeric Values */
case Opt_backupuid:
@@ -1928,6 +1932,11 @@ cifs_parse_mount_options(const char *mountdata, const 
char *devname,
goto cifs_parse_mount_err;
}
 
+   if (vol->rdma && vol->vals->protocol_id < SMB30_PROT_ID) {
+   cifs_dbg(VFS, "SMB Direct requires Version >=3.0\n");
+   goto cifs_parse_mount_err;
+   }
+
 #ifndef CONFIG_KEYS
/* Muliuser mounts require CONFIG_KEYS support */
if (vol->multiuser) {
@@ -2131,6 +2140,9 @@ static int match_server(struct TCP_Server_Info *server, 
struct smb_vol *vol)
if (server->echo_interval != vol->echo_interval * HZ)
return 0;
 
+   if (server->rdma != vol->rdma)
+   return 0;
+
return 1;
 }
 
@@ -2229,6 +2241,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
tcp_ses->noblocksnd = volume_info->noblocksnd;
tcp_ses->noautotune = volume_info->noautotune;
tcp_ses->tcp_nodelay = volume_info->sockopt_tcp_nodelay;
+   tcp_ses->rdma = volume_info->rdma;
tcp_ses->in_flight = 0;
tcp_ses->credits = 1;
init_waitqueue_head(&tcp_ses->response_q);
-- 
2.7.4

[Patch v4 05/22] CIFS: SMBD: Implement function to create a SMBDirect connection

2017-10-01 Thread Long Li

From: Long Li 

The upper layer calls this function to connect to peer through SMBDirect.
Each SMBDirect connection is based on a RC Queue Pair.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 17 +
 fs/cifs/smbdirect.h |  4 
 2 files changed, 21 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index e8f976f..34f73e2 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1675,3 +1675,20 @@ struct smbd_connection *_smbd_get_connection(
kfree(info);
return NULL;
 }
+
+struct smbd_connection *smbd_get_connection(
+   struct TCP_Server_Info *server, struct sockaddr *dstaddr)
+{
+   struct smbd_connection *ret;
+   int port = SMBD_PORT;
+
+try_again:
+   ret = _smbd_get_connection(server, dstaddr, port);
+
+   /* Try SMB_PORT if SMBD_PORT doesn't work */
+   if (!ret && port == SMBD_PORT) {
+   port = SMB_PORT;
+   goto try_again;
+   }
+   return ret;
+}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index ca60700..42a9338 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -245,6 +245,10 @@ struct smbd_response {
u8 packet[];
 };
 
+/* Create a SMBDirect session */
+struct smbd_connection *smbd_get_connection(
+   struct TCP_Server_Info *server, struct sockaddr *dstaddr);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4

[Patch v4 16/22] CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE

2017-10-01 Thread Long Li

From: Long Li 

The channel value for requesting server remote invalidating local memory
registration should be 0x0002

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/smb2pdu.h b/fs/cifs/smb2pdu.h
index 393ed5f..f783a08 100644
--- a/fs/cifs/smb2pdu.h
+++ b/fs/cifs/smb2pdu.h
@@ -832,7 +832,7 @@ struct smb2_flush_rsp {
 /* Channel field for read and write: exactly one of following flags can be 
set*/
 #define SMB2_CHANNEL_NONE  0x
 #define SMB2_CHANNEL_RDMA_V1   0x0001 /* SMB3 or later */
-#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x0001 /* SMB3.02 or later */
+#define SMB2_CHANNEL_RDMA_V1_INVALIDATE 0x0002 /* SMB3.02 or later */
 
 /* SMB2 read request without RFC1001 length at the beginning */
 struct smb2_read_plain_req {
-- 
2.7.4

[Patch v4 03/22] CIFS: SMBD: export protocol initial values

2017-10-01 Thread Long Li

From: Long Li 

Those values can be configured by user. Export them to /proc/fs/cifs.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c | 70 
 1 file changed, 70 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9727e1d..bdc2f38 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -369,6 +369,52 @@ static const struct file_operations cifs_stats_proc_fops = 
{
 };
 #endif /* STATS */
 
+#define PROC_FILE_DEFINE(name) \
+static ssize_t name##_write(struct file *file, const char __user *buffer, \
+   size_t count, loff_t *ppos) \
+{ \
+   int rc; \
+   rc = kstrtoint_from_user(buffer, count, 10, & name ); \
+   if (rc) \
+   return rc; \
+   return count; \
+} \
+static int name##_proc_show(struct seq_file *m, void *v) \
+{ \
+   seq_printf(m, "%d\n", name ); \
+   return 0; \
+} \
+static int name##_open(struct inode *inode, struct file *file) \
+{ \
+   return single_open(file, name##_proc_show, NULL); \
+} \
+\
+static const struct file_operations cifs_##name##_proc_fops = { \
+   .open   = name##_open, \
+   .read   = seq_read, \
+   .llseek = seq_lseek, \
+   .release= single_release, \
+   .write  = name##_write, \
+}
+
+extern int rdma_readwrite_threshold;
+extern int smbd_max_frmr_depth;
+extern int smbd_keep_alive_interval;
+extern int smbd_max_receive_size;
+extern int smbd_max_fragmented_recv_size;
+extern int smbd_max_send_size;
+extern int smbd_send_credit_target;
+extern int smbd_receive_credit_max;
+
+PROC_FILE_DEFINE(rdma_readwrite_threshold);
+PROC_FILE_DEFINE(smbd_max_frmr_depth);
+PROC_FILE_DEFINE(smbd_keep_alive_interval);
+PROC_FILE_DEFINE(smbd_max_receive_size);
+PROC_FILE_DEFINE(smbd_max_fragmented_recv_size);
+PROC_FILE_DEFINE(smbd_max_send_size);
+PROC_FILE_DEFINE(smbd_send_credit_target);
+PROC_FILE_DEFINE(smbd_receive_credit_max);
+
 static struct proc_dir_entry *proc_fs_cifs;
 static const struct file_operations cifsFYI_proc_fops;
 static const struct file_operations cifs_lookup_cache_proc_fops;
@@ -396,6 +442,22 @@ cifs_proc_init(void)
&cifs_security_flags_proc_fops);
proc_create("LookupCacheEnabled", 0, proc_fs_cifs,
&cifs_lookup_cache_proc_fops);
+   proc_create("rdma_readwrite_threshold", 0, proc_fs_cifs,
+   &cifs_rdma_readwrite_threshold_proc_fops);
+   proc_create("smbd_max_frmr_depth", 0, proc_fs_cifs,
+   &cifs_smbd_max_frmr_depth_proc_fops);
+   proc_create("smbd_keep_alive_interval", 0, proc_fs_cifs,
+   &cifs_smbd_keep_alive_interval_proc_fops);
+   proc_create("smbd_max_receive_size", 0, proc_fs_cifs,
+   &cifs_smbd_max_receive_size_proc_fops);
+   proc_create("smbd_max_fragmented_recv_size", 0, proc_fs_cifs,
+   &cifs_smbd_max_fragmented_recv_size_proc_fops);
+   proc_create("smbd_max_send_size", 0, proc_fs_cifs,
+   &cifs_smbd_max_send_size_proc_fops);
+   proc_create("smbd_send_credit_target", 0, proc_fs_cifs,
+   &cifs_smbd_send_credit_target_proc_fops);
+   proc_create("smbd_receive_credit_max", 0, proc_fs_cifs,
+   &cifs_smbd_receive_credit_max_proc_fops);
 }
 
 void
@@ -413,6 +475,14 @@ cifs_proc_clean(void)
remove_proc_entry("SecurityFlags", proc_fs_cifs);
remove_proc_entry("LinuxExtensionsEnabled", proc_fs_cifs);
remove_proc_entry("LookupCacheEnabled", proc_fs_cifs);
+   remove_proc_entry("rdma_readwrite_threshold", proc_fs_cifs);
+   remove_proc_entry("smbd_max_frmr_depth", proc_fs_cifs);
+   remove_proc_entry("smbd_keep_alive_interval", proc_fs_cifs);
+   remove_proc_entry("smbd_max_receive_size", proc_fs_cifs);
+   remove_proc_entry("smbd_max_fragmented_recv_size", proc_fs_cifs);
+   remove_proc_entry("smbd_max_send_size", proc_fs_cifs);
+   remove_proc_entry("smbd_send_credit_target", proc_fs_cifs);
+   remove_proc_entry("smbd_receive_credit_max", proc_fs_cifs);
remove_proc_entry("fs/cifs", NULL);
 }
 
-- 
2.7.4

[Patch v4 22/22] CIFS: SMBD: Add SMBDirect debug counters

2017-10-01 Thread Long Li

From: Long Li 

Export SMBDirect debug counters to /proc/fs/cifs/DebugData.

Those are used for debugging, troubleshooting and profiling.

Signed-off-by: Long Li 
---
 fs/cifs/cifs_debug.c | 87 
 1 file changed, 87 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9738026..1ea78d5 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -30,6 +30,7 @@
 #include "cifsproto.h"
 #include "cifs_debug.h"
 #include "cifsfs.h"
+#include "smbdirect.h"
 
 void
 cifs_dump_mem(char *label, void *data, int length)
@@ -152,6 +153,92 @@ static int cifs_debug_data_proc_show(struct seq_file *m, 
void *v)
list_for_each(tmp1, &cifs_tcp_ses_list) {
server = list_entry(tmp1, struct TCP_Server_Info,
tcp_ses_list);
+
+   if (!server->rdma)
+   goto skip_rdma;
+
+   seq_printf(m, "\nSMBDirect (in hex) protocol version: %x "
+   "transport status: %x",
+   server->smbd_conn->protocol,
+   server->smbd_conn->transport_status);
+   seq_printf(m, "\nConn receive_credit_max: %x "
+   "send_credit_target: %x max_send_size: %x",
+   server->smbd_conn->receive_credit_max,
+   server->smbd_conn->send_credit_target,
+   server->smbd_conn->max_send_size);
+   seq_printf(m, "\nConn max_fragmented_recv_size: %x "
+   "max_fragmented_send_size: %x max_receive_size:%x",
+   server->smbd_conn->max_fragmented_recv_size,
+   server->smbd_conn->max_fragmented_send_size,
+   server->smbd_conn->max_receive_size);
+   seq_printf(m, "\nConn keep_alive_interval: %x "
+   "max_readwrite_size: %x rdma_readwrite_threshold: %x",
+   server->smbd_conn->keep_alive_interval,
+   server->smbd_conn->max_readwrite_size,
+   server->smbd_conn->rdma_readwrite_threshold);
+   seq_printf(m, "\nDebug count_get_receive_buffer: %x "
+   "count_put_receive_buffer: %x count_send_empty: %x",
+   server->smbd_conn->count_get_receive_buffer,
+   server->smbd_conn->count_put_receive_buffer,
+   server->smbd_conn->count_send_empty);
+   seq_printf(m, "\nRead Queue count_reassembly_queue: %x "
+   "count_enqueue_reassembly_queue: %x "
+   "count_dequeue_reassembly_queue: %x "
+   "fragment_reassembly_remaining: %x "
+   "reassembly_data_length: %x "
+   "reassembly_queue_length: %x",
+   server->smbd_conn->count_reassembly_queue,
+   server->smbd_conn->count_enqueue_reassembly_queue,
+   server->smbd_conn->count_dequeue_reassembly_queue,
+   server->smbd_conn->fragment_reassembly_remaining,
+   server->smbd_conn->reassembly_data_length,
+   server->smbd_conn->reassembly_queue_length);
+   seq_printf(m, "\nCurrent Credits send_credits: %x "
+   "receive_credits: %x receive_credit_target: %x",
+   atomic_read(&server->smbd_conn->send_credits),
+   atomic_read(&server->smbd_conn->receive_credits),
+   server->smbd_conn->receive_credit_target);
+   seq_printf(m, "\nPending send_pending: %x send_payload_pending:"
+   " %x smbd_send_pending: %x smbd_recv_pending: %x",
+   atomic_read(&server->smbd_conn->send_pending),
+   atomic_read(&server->smbd_conn->send_payload_pending),
+   server->smbd_conn->smbd_send_pending,
+   server->smbd_conn->smbd_recv_pending);
+   seq_printf(m, "\nReceive buffers count_receive_queue: %x "
+   "count_empty_packet_queue: %x",
+   server->smbd_conn->count_receive_queue,
+   server->smbd_conn->count_empty_packet_queue);
+   seq_printf(m, "\nMR responder_resources: %x "
+   "max_frmr_depth: %x mr_type: %x",
+   server->smbd_conn->responder_resources,
+   server->smbd_conn->max_frmr_depth,
+   server->smbd_conn->mr_type);
+   seq_printf(m, "\nMR mr_ready_count: %x mr_used_count: %x",
+   atomic_read(&server->smbd_conn->mr_ready_count),
+   atomic_read(&server->smbd_conn->mr_used_count));
+
+   seq_printf(m, "\nTSC cycle histogram in I/O path: "
+   "(the number of most significa

[Patch v4 09/22] CIFS: SMBD: Implement function to destroy a SMBDirect connection

2017-10-01 Thread Long Li

From: Long Li 

Add function to tear down a SMBDirect connection. This is used by upper layer
to free all SMBDirect connection and transport resources.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 16 
 fs/cifs/smbdirect.h |  3 +++
 2 files changed, 19 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 1f0f33c..cb129c2 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1416,6 +1416,22 @@ static void idle_connection_timer(struct work_struct 
*work)
info->keep_alive_interval*HZ);
 }
 
+/* Destroy this SMBD connection, called from upper layer */
+void smbd_destroy(struct smbd_connection *info)
+{
+   log_rdma_event(INFO, "destroying rdma session\n");
+
+   /* Kick off the disconnection process */
+   smbd_disconnect_rdma_connection(info);
+
+   log_rdma_event(INFO, "wait for transport being destroyed\n");
+   wait_event(info->wait_destroy,
+   info->transport_status == SMBD_DESTROYED);
+
+   destroy_workqueue(info->workqueue);
+   kfree(info);
+}
+
 /*
  * Reconnect this SMBD connection, called from upper layer
  * return value: 0 on success, or actual error code
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 9818852..d14a484 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -252,6 +252,9 @@ struct smbd_connection *smbd_get_connection(
 /* Reconnect SMBDirect session */
 int smbd_reconnect(struct TCP_Server_Info *server);
 
+/* Destroy SMBDirect session */
+void smbd_destroy(struct smbd_connection *info);
+
 void profiling_display_histogram(
struct seq_file *m, unsigned long long array[]);
 #endif
-- 
2.7.4

[Patch v4 19/22] CIFS: SMBD: Add parameter rdata to smb2_new_read_req

2017-10-01 Thread Long Li

From: Long Li 

This patch is for preparing upper layer for doing SMB read via RDMA write.

When we assemble the SMB read packet header, we need to know the I/O layout
if this request is to use a RDMA write. rdata has all the information we need
for memory registration. Add rdata to smb2_new_read_req.

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 6089957..7053db9 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2351,18 +2351,21 @@ SMB2_flush(const unsigned int xid, struct cifs_tcon 
*tcon, u64 persistent_fid,
  */
 static int
 smb2_new_read_req(void **buf, unsigned int *total_len,
- struct cifs_io_parms *io_parms, unsigned int remaining_bytes,
- int request_type)
+   struct cifs_io_parms *io_parms, struct cifs_readdata *rdata,
+   unsigned int remaining_bytes, int request_type)
 {
int rc = -EACCES;
struct smb2_read_plain_req *req = NULL;
struct smb2_sync_hdr *shdr;
+   struct TCP_Server_Info *server;
 
rc = smb2_plain_req_init(SMB2_READ, io_parms->tcon, (void **) &req,
 total_len);
if (rc)
return rc;
-   if (io_parms->tcon->ses->server == NULL)
+
+   server = io_parms->tcon->ses->server;
+   if (server == NULL)
return -ECONNABORTED;
 
shdr = &req->sync_hdr;
@@ -2490,7 +2493,8 @@ smb2_async_readv(struct cifs_readdata *rdata)
 
server = io_parms.tcon->ses->server;
 
-   rc = smb2_new_read_req((void **) &buf, &total_len, &io_parms, 0, 0);
+   rc = smb2_new_read_req(
+   (void **) &buf, &total_len, &io_parms, rdata, 0, 0);
if (rc) {
if (rc == -EAGAIN && rdata->credits) {
/* credits was reset by reconnect */
@@ -2558,7 +2562,7 @@ SMB2_read(const unsigned int xid, struct cifs_io_parms 
*io_parms,
struct cifs_ses *ses = io_parms->tcon->ses;
 
*nbytes = 0;
-   rc = smb2_new_read_req((void **)&req, &total_len, io_parms, 0, 0);
+   rc = smb2_new_read_req((void **)&req, &total_len, io_parms, NULL, 0, 0);
if (rc)
return rc;
 
-- 
2.7.4

[Patch v4 00/22] CIFS: Implement SMBDirect

2017-10-01 Thread Long Li

From: Long Li 

Starting with SMB2 dialect 3.0, Microsoft introduced SMBDirect transport
protocol for transferring upper layer (SMB2) payload over RDMA via Infiniband,
RoCE or iWARP. The prococol is published in [MS-SMBD]
(https://msdn.microsoft.com/en-us/library/hh536346.aspx).

Patch v2 added RDMA read/write via memory registration, and addressed
feedbacks on v1.

Patch v3 improved performance by introducing an additional queue for handling
empty packets and reducing lock contention on IRQ path. Also added light
weight profiling by reading TSC and addressed feedbacks on v2.

Patch v4 fixed connectivity issues with iWAPR devices and addressed comments.

Long Li (22):
  CIFS: SMBD: Add SMBDirect protocol initial values and constants
  CIFS: SMBD: Establish SMBDirect connection
  CIFS: SMBD: export protocol initial values
  CIFS: SMBD: Add rdma mount option
  CIFS: SMBD: Implement function to create a SMBDirect connection
  CIFS: SMBD: Upper layer connects to SMBDirect session
  CIFS: SMBD: Implement function to reconnect to a SMBDirect transport
  CIFS: SMBD: Upper layer reconnects to SMBDirect session
  CIFS: SMBD: Implement function to destroy a SMBDirect connection
  CIFS: SMBD: Upper layer destroys SMBDirect session on shutdown or
umount
  CIFS: SMBD: Set SMBDirect maximum read or write size for I/O
  CIFS: SMBD: Implement function to receive data via RDMA receive
  CIFS: SMBD: Upper layer receives data via RDMA receive
  CIFS: SMBD: Implement function to send data via RDMA send
  CIFS: SMBD: Upper layer sends data via RDMA send
  CIFS: SMBD: Fix the definition for SMB2_CHANNEL_RDMA_V1_INVALIDATE
  CIFS: SMBD: Implement RDMA memory registration
  CIFS: SMBD: Upper layer performs SMB write via RDMA read through
memory registration
  CIFS: SMBD: Add parameter rdata to smb2_new_read_req
  CIFS: SMBD: Read correct returned data length for RDMA write (SMB
read) I/O
  CIFS: SMBD: Upper layer performs SMB read via RDMA write through
memory registration
  CIFS: SMBD: Add SMBDirect debug counters

 fs/cifs/Makefile |2 +-
 fs/cifs/cifs_debug.c |  159 +++
 fs/cifs/cifsfs.c |2 +
 fs/cifs/cifsglob.h   |   17 +-
 fs/cifs/cifssmb.c|   10 +-
 fs/cifs/connect.c|   46 +-
 fs/cifs/file.c   |   10 +
 fs/cifs/smb1ops.c|2 +-
 fs/cifs/smb2ops.c|   21 +-
 fs/cifs/smb2pdu.c|  114 ++-
 fs/cifs/smb2pdu.h|2 +-
 fs/cifs/smbdirect.c  | 2651 ++
 fs/cifs/smbdirect.h  |  325 +++
 fs/cifs/transport.c  |7 +
 14 files changed, 3348 insertions(+), 20 deletions(-)
 create mode 100644 fs/cifs/smbdirect.c
 create mode 100644 fs/cifs/smbdirect.h

-- 
2.7.4

[Patch v4 13/22] CIFS: SMBD: Upper layer receives data via RDMA receive

2017-10-01 Thread Long Li

From: Long Li 

With SMBDirect connected, use it for receiving data via RDMA receive.

Signed-off-by: Long Li 
---
 fs/cifs/connect.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 1a9f22f..8026682 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -542,7 +542,10 @@ cifs_readv_from_socket(struct TCP_Server_Info *server, 
struct msghdr *smb_msg)
if (server_unresponsive(server))
return -ECONNABORTED;
 
-   length = sock_recvmsg(server->ssocket, smb_msg, 0);
+   if (server->smbd_conn)
+   length = smbd_recv(server->smbd_conn, smb_msg);
+   else
+   length = sock_recvmsg(server->ssocket, smb_msg, 0);
 
if (server->tcpStatus == CifsExiting)
return -ESHUTDOWN;
-- 
2.7.4

[Patch v4 17/22] CIFS: SMBD: Implement RDMA memory registration

2017-10-01 Thread Long Li

From: Long Li 

Memory registration is used for transferring payload via RDMA read or write.
After I/O is done, memory registrations are recovered and reused. This
process can be time consuming and is done in a work queue.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 428 
 fs/cifs/smbdirect.h |  55 +++
 2 files changed, 483 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 90e2c94..3f2de48 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -49,6 +49,9 @@ static int smbd_post_send_page(struct smbd_connection *info,
struct page *page, unsigned long offset,
size_t size, int remaining_data_length);
 
+static void destroy_mr_list(struct smbd_connection *info);
+static int allocate_mr_list(struct smbd_connection *info);
+
 /* SMBD version number */
 #define SMBD_V10x0100
 
@@ -219,6 +222,12 @@ static void smbd_destroy_rdma_work(struct work_struct 
*work)
wait_event(info->wait_send_payload_pending,
atomic_read(&info->send_payload_pending) == 0);
 
+   log_rdma_event(INFO, "freeing mr list\n");
+   wake_up_interruptible_all(&info->wait_mr);
+   wait_event(info->wait_for_mr_cleanup,
+   atomic_read(&info->mr_used_count) == 0);
+   destroy_mr_list(info);
+
/* It's not posssible for upper layer to get to reassembly */
log_rdma_event(INFO, "drain the reassembly queue\n");
do {
@@ -475,6 +484,16 @@ static bool process_negotiation_response(
}
info->max_fragmented_send_size =
le32_to_cpu(packet->max_fragmented_size);
+   info->rdma_readwrite_threshold =
+   rdma_readwrite_threshold > info->max_fragmented_send_size ?
+   info->max_fragmented_send_size :
+   rdma_readwrite_threshold;
+
+
+   info->max_readwrite_size = min_t(u32,
+   le32_to_cpu(packet->max_readwrite_size),
+   info->max_frmr_depth * PAGE_SIZE);
+   info->max_frmr_depth = info->max_readwrite_size / PAGE_SIZE;
 
return true;
 }
@@ -773,6 +792,12 @@ static int smbd_ia_open(
rc = -EPROTONOSUPPORT;
goto out2;
}
+   info->max_frmr_depth = min_t(int,
+   smbd_max_frmr_depth,
+   info->id->device->attrs.max_fast_reg_page_list_len);
+   info->mr_type = IB_MR_TYPE_MEM_REG;
+   if (info->id->device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG)
+   info->mr_type = IB_MR_TYPE_SG_GAPS;
 
info->pd = ib_alloc_pd(info->id->device, 0);
if (IS_ERR(info->pd)) {
@@ -1610,6 +1635,8 @@ struct smbd_connection *_smbd_get_connection(
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
struct sockaddr_in *addr_in = (struct sockaddr_in *) dstaddr;
+   struct ib_port_immutable port_immutable;
+   u32 ird_ord_hdr[2];
 
info = kzalloc(sizeof(struct smbd_connection), GFP_KERNEL);
if (!info)
@@ -1698,6 +1725,28 @@ struct smbd_connection *_smbd_get_connection(
memset(&conn_param, 0, sizeof(conn_param));
conn_param.initiator_depth = 0;
 
+   conn_param.responder_resources =
+   info->id->device->attrs.max_qp_rd_atom
+   < SMBD_CM_RESPONDER_RESOURCES ?
+   info->id->device->attrs.max_qp_rd_atom :
+   SMBD_CM_RESPONDER_RESOURCES;
+   info->responder_resources = conn_param.responder_resources;
+   log_rdma_mr(INFO, "responder_resources=%d\n",
+   info->responder_resources);
+
+   /* Need to send IRD/ORD in private data for iWARP */
+   info->id->device->get_port_immutable(
+   info->id->device, info->id->port_num, &port_immutable);
+   if (port_immutable.core_cap_flags & RDMA_CORE_PORT_IWARP) {
+   ird_ord_hdr[0] = info->responder_resources;
+   ird_ord_hdr[1] = 1;
+   conn_param.private_data = ird_ord_hdr;
+   conn_param.private_data_len = sizeof(ird_ord_hdr);
+   } else {
+   conn_param.private_data = NULL;
+   conn_param.private_data_len = 0;
+   }
+
conn_param.retry_count = SMBD_CM_RETRY;
conn_param.rnr_retry_count = SMBD_CM_RNR_RETRY;
conn_param.flow_control = 0;
@@ -1762,8 +1811,19 @@ struct smbd_connection *_smbd_get_connection(
goto negotiation_failed;
}
 
+   rc = allocate_mr_list(info);
+   if (rc) {
+   log_rdma_mr(ERR, "memory registration allocation failed\n");
+   goto allocate_mr_failed;
+   }
+
return info;
 
+allocate_mr_failed:
+   /* At this point, need to a full transport shutdown */
+   smbd_destroy(info);
+   return NULL;
+
 negotiation_failed:
cancel_delayed_work_sync(&info->idle_timer_work);
destroy_caches_and_workqueue(info);
@@ -2221,3 +2281,371 @@ i

[Patch v4 15/22] CIFS: SMBD: Upper layer sends data via RDMA send

2017-10-01 Thread Long Li

From: Long Li 

With SMBDirect connected, use it for sending data via RDMA send.

Signed-off-by: Long Li 
---
 fs/cifs/transport.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 7efbab0..3a9b5a0 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -37,6 +37,7 @@
 #include "cifsglob.h"
 #include "cifsproto.h"
 #include "cifs_debug.h"
+#include "smbdirect.h"
 
 void
 cifs_wake_up_task(struct mid_q_entry *mid)
@@ -230,6 +231,11 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
struct msghdr smb_msg;
int val = 1;
 
+   if (server->smbd_conn) {
+   rc = smbd_send(server->smbd_conn, rqst);
+   goto done;
+   }
+
if (ssocket == NULL)
return -ENOTSOCK;
 
@@ -299,6 +305,7 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
server->tcpStatus = CifsNeedReconnect;
}
 
+done:
if (rc < 0 && rc != -EINTR)
cifs_dbg(VFS, "Error %d sending data on socket to server\n",
 rc);
-- 
2.7.4

[Patch v4 20/22] CIFS: SMBD: Read correct returned data length for RDMA write (SMB read) I/O

2017-10-01 Thread Long Li

From: Long Li 

This patch is for preparing upper layer for doing SMB read via RDMA write.

When RDMA write is used for SMB read, the returned data length is in
DataRemaining in the response packet. Reading it properly by adding a
parameter to specifiy where the returned data length is.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h | 10 --
 fs/cifs/cifssmb.c  |  4 ++--
 fs/cifs/smb1ops.c  |  2 +-
 fs/cifs/smb2ops.c  |  8 ++--
 4 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index bcb6df1..f851b50 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -228,8 +228,14 @@ struct smb_version_operations {
__u64 (*get_next_mid)(struct TCP_Server_Info *);
/* data offset from read response message */
unsigned int (*read_data_offset)(char *);
-   /* data length from read response message */
-   unsigned int (*read_data_length)(char *);
+   /*
+* Data length from read response message
+* When in_remaining is true, the returned data length is in
+* message field DataRemaining for out-of-band data read (e.g through
+* Memory Registration RDMA write in SMBD).
+* Otherwise, the returned data length is in message field DataLength.
+*/
+   unsigned int (*read_data_length)(char *, bool in_remaining);
/* map smb to linux error */
int (*map_error)(char *, bool);
/* find mid corresponding to the response message */
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 0e29ecf..b9410e1 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1531,8 +1531,8 @@ cifs_readv_receive(struct TCP_Server_Info *server, struct 
mid_q_entry *mid)
 rdata->iov[0].iov_base, server->total_read);
 
/* how much data is in the response? */
-   data_len = server->ops->read_data_length(buf);
-   if (data_offset + data_len > buflen) {
+   data_len = server->ops->read_data_length(buf, rdata->mr);
+   if (!rdata->mr && (data_offset + data_len > buflen)) {
/* data_len is corrupt -- discard frame */
rdata->result = -EIO;
return cifs_readv_discard(server, mid);
diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
index a723df3..27a8280 100644
--- a/fs/cifs/smb1ops.c
+++ b/fs/cifs/smb1ops.c
@@ -87,7 +87,7 @@ cifs_read_data_offset(char *buf)
 }
 
 static unsigned int
-cifs_read_data_length(char *buf)
+cifs_read_data_length(char *buf, bool in_remaining)
 {
READ_RSP *rsp = (READ_RSP *)buf;
return (le16_to_cpu(rsp->DataLengthHigh) << 16) +
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 7ad35d6..a765877 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -935,9 +935,13 @@ smb2_read_data_offset(char *buf)
 }
 
 static unsigned int
-smb2_read_data_length(char *buf)
+smb2_read_data_length(char *buf, bool in_remaining)
 {
struct smb2_read_rsp *rsp = (struct smb2_read_rsp *)buf;
+
+   if (in_remaining)
+   return le32_to_cpu(rsp->DataRemaining);
+
return le32_to_cpu(rsp->DataLength);
 }
 
@@ -2446,7 +2450,7 @@ handle_read_data(struct TCP_Server_Info *server, struct 
mid_q_entry *mid,
}
 
data_offset = server->ops->read_data_offset(buf) + 4;
-   data_len = server->ops->read_data_length(buf);
+   data_len = server->ops->read_data_length(buf, rdata->mr);
 
if (data_offset < server->vals->read_rsp_size) {
/*
-- 
2.7.4

[Patch v4 18/22] CIFS: SMBD: Upper layer performs SMB write via RDMA read through memory registration

2017-10-01 Thread Long Li

From: Long Li 

When sending I/O, if size is larger than rdma_readwrite_threshold we prepare
to send SMB write packet for a RDMA read via memory registration. The actual
I/O is done by remote peer through local RDMA hardware. Modify the relevant
fields in the packet accordingly, and append a smbd_buffer_descriptor_v1 to
the end of the SMB write packet.

On write I/O finish, deregister the memory region if this was for a RDMA read.
If remote invalidation is not used, the call to smbd_deregister_mr will do
local invalidation and possibly wait. Memory region is normally deregistered
in MID callback as soon as it's used. There are situations where the MID may
not be created on I/O failure, under which memory region is deregistered when
write data context is released.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/cifssmb.c  |  6 ++
 fs/cifs/smb2pdu.c  | 57 +-
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 5585516..bcb6df1 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1168,6 +1168,7 @@ struct cifs_writedata {
pid_t   pid;
unsigned intbytes;
int result;
+   struct smbd_mr  *mr;
unsigned intpagesz;
unsigned inttailsz;
unsigned intcredits;
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 5857009..0e29ecf 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -43,6 +43,7 @@
 #include "cifs_unicode.h"
 #include "cifs_debug.h"
 #include "fscache.h"
+#include "smbdirect.h"
 
 #ifdef CONFIG_CIFS_POSIX
 static struct {
@@ -1912,6 +1913,11 @@ cifs_writedata_release(struct kref *refcount)
struct cifs_writedata *wdata = container_of(refcount,
struct cifs_writedata, refcount);
 
+   if (wdata->mr) {
+   smbd_deregister_mr(wdata->mr);
+   wdata->mr = NULL;
+   }
+
if (wdata->cfile)
cifsFileInfo_put(wdata->cfile);
 
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index bab3da6..6089957 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -48,6 +48,7 @@
 #include "smb2glob.h"
 #include "cifspdu.h"
 #include "cifs_spnego.h"
+#include "smbdirect.h"
 
 /*
  *  The following table defines the expected "StructureSize" of SMB2 requests
@@ -2653,6 +2654,18 @@ smb2_writev_callback(struct mid_q_entry *mid)
break;
}
 
+   /*
+* If this wdata has a memory registered, the MR can be freed
+* The number of MRs available is limited, it's important to recover
+* used MR as soon as I/O is finished. Hold MR longer in the later
+* I/O process can possibly result in I/O deadlock due to lack of MR
+* to send request on I/O retry
+*/
+   if (wdata->mr) {
+   smbd_deregister_mr(wdata->mr);
+   wdata->mr = NULL;
+   }
+
if (wdata->result)
cifs_stats_fail_inc(tcon, SMB2_WRITE_HE);
 
@@ -2704,6 +2717,41 @@ smb2_async_writev(struct cifs_writedata *wdata,
offsetof(struct smb2_write_req, Buffer) - 4);
req->RemainingBytes = 0;
 
+   /*
+* If we want to do a server RDMA read, fill in and append
+* smbd_buffer_descriptor_v1 to the end of write request
+*/
+   if (server->rdma && wdata->bytes >=
+   server->smbd_conn->rdma_readwrite_threshold) {
+
+   struct smbd_buffer_descriptor_v1 *v1;
+   bool need_invalidate = server->dialect == SMB30_PROT_ID;
+
+   wdata->mr = smbd_register_mr(
+   server->smbd_conn, wdata->pages,
+   wdata->nr_pages, wdata->tailsz,
+   false, need_invalidate);
+   if (!wdata->mr) {
+   rc = -ENOBUFS;
+   goto async_writev_out;
+   }
+   req->Length = 0;
+   req->DataOffset = 0;
+   req->RemainingBytes =
+   (wdata->nr_pages-1)*PAGE_SIZE + wdata->tailsz;
+   req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
+   if (need_invalidate)
+   req->Channel = SMB2_CHANNEL_RDMA_V1;
+   req->WriteChannelInfoOffset =
+   offsetof(struct smb2_write_req, Buffer) - 4;
+   req->WriteChannelInfoLength =
+   sizeof(struct smbd_buffer_descriptor_v1);
+   v1 = (struct smbd_buffer_descriptor_v1 *) &req->Buffer[0];
+   v1->offset = wdata->mr->mr->iova;
+   v1->token = wdata->mr->mr->rkey;
+   v1->length = wdata->mr->mr->length;
+   }
+
/* 4 for rfc1002 length field and 1 for Buffer

[PATCH v3 07/10] arm: dts: mt7623: add iommu and jpecdec nodes

2017-10-01 Thread Ryder Lee

This patch adds iommu and jpecdec nodes for MT7623.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 74 +++
 1 file changed, 74 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index a877f9a..b257715 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "skeleton64.dtsi"
@@ -273,6 +274,17 @@
clock-names = "system-clk", "rtc-clk";
};
 
+   smi_common: smi@1000c000 {
+   compatible = "mediatek,mt7623-smi-common",
+"mediatek,mt2701-smi-common";
+   reg = <0 0x1000c000 0 0x1000>;
+   clocks = <&infracfg CLK_INFRA_SMI>,
+<&mmsys CLK_MM_SMI_COMMON>,
+<&infracfg CLK_INFRA_SMI>;
+   clock-names = "apb", "smi", "async";
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_DISP>;
+   };
+
pwrap: pwrap@1000d000 {
compatible = "mediatek,mt7623-pwrap",
 "mediatek,mt2701-pwrap";
@@ -304,6 +316,17 @@
reg = <0 0x10200100 0 0x1c>;
};
 
+   iommu: mmsys_iommu@10205000 {
+   compatible = "mediatek,mt7623-m4u",
+"mediatek,mt2701-m4u";
+   reg = <0 0x10205000 0 0x1000>;
+   interrupts = ;
+   clocks = <&infracfg CLK_INFRA_M4U>;
+   clock-names = "bclk";
+   mediatek,larbs = <&larb0 &larb1 &larb2>;
+   #iommu-cells = <1>;
+   };
+
efuse: efuse@10206000 {
compatible = "mediatek,mt7623-efuse",
 "mediatek,mt8173-efuse";
@@ -669,6 +692,18 @@
#clock-cells = <1>;
};
 
+   larb0: larb@1401 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x1401 0 0x1000>;
+   mediatek,smi = <&smi_common>;
+   mediatek,larb-id = <0>;
+   clocks = <&mmsys CLK_MM_SMI_LARB0>,
+<&mmsys CLK_MM_SMI_LARB0>;
+   clock-names = "apb", "smi";
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_DISP>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt7623-imgsys",
 "mediatek,mt2701-imgsys",
@@ -677,6 +712,33 @@
#clock-cells = <1>;
};
 
+   larb2: larb@15001000 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x15001000 0 0x1000>;
+   mediatek,smi = <&smi_common>;
+   mediatek,larb-id = <2>;
+   clocks = <&imgsys CLK_IMG_SMI_COMM>,
+<&imgsys CLK_IMG_SMI_COMM>;
+   clock-names = "apb", "smi";
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_ISP>;
+   };
+
+   jpegdec: jpegdec@15004000 {
+   compatible = "mediatek,mt7623-jpgdec",
+"mediatek,mt2701-jpgdec";
+   reg = <0 0x15004000 0 0x1000>;
+   interrupts = ;
+   clocks =  <&imgsys CLK_IMG_JPGDEC_SMI>,
+ <&imgsys CLK_IMG_JPGDEC>;
+   clock-names = "jpgdec-smi",
+ "jpgdec";
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_ISP>;
+   mediatek,larb = <&larb2>;
+   iommus = <&iommu MT2701_M4U_PORT_JPGDEC_WDMA>,
+<&iommu MT2701_M4U_PORT_JPGDEC_BSDMA>;
+   };
+
vdecsys: syscon@1600 {
compatible = "mediatek,mt7623-vdecsys",
 "mediatek,mt2701-vdecsys",
@@ -685,6 +747,18 @@
#clock-cells = <1>;
};
 
+   larb1: larb@1601 {
+   compatible = "mediatek,mt7623-smi-larb",
+"mediatek,mt2701-smi-larb";
+   reg = <0 0x1601 0 0x1000>;
+   mediatek,smi = <&smi_common>;
+   mediatek,larb-id = <1>;
+   clocks = <&vdecsys CLK_VDEC_CKGEN>,
+<&vdecsys CLK_VDEC_LARB>;
+   clock-names = "apb", "smi";
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_VDEC>;
+   };
+
hifsys: syscon@1a00 {
compatible = "mediatek,mt7623-hifsys",
 "mediatek,mt2701-hifsys",
-- 
1.9.1

[PATCH v3 06/10] arm: dts: mt7623: add subsystem clock controller nodes

2017-10-01 Thread Ryder Lee

This patch adds missing susbsystem clock controllers nodes for MT7623.
(e.g., mmsys, imgsys, vdecsys and bdpsys)

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index 0640fb7..a877f9a 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -661,6 +661,30 @@
status = "disabled";
};
 
+   mmsys: syscon@1400 {
+   compatible = "mediatek,mt7623-mmsys",
+"mediatek,mt2701-mmsys",
+"syscon";
+   reg = <0 0x1400 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
+   imgsys: syscon@1500 {
+   compatible = "mediatek,mt7623-imgsys",
+"mediatek,mt2701-imgsys",
+"syscon";
+   reg = <0 0x1500 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
+   vdecsys: syscon@1600 {
+   compatible = "mediatek,mt7623-vdecsys",
+"mediatek,mt2701-vdecsys",
+"syscon";
+   reg = <0 0x1600 0 0x1000>;
+   #clock-cells = <1>;
+   };
+
hifsys: syscon@1a00 {
compatible = "mediatek,mt7623-hifsys",
 "mediatek,mt2701-hifsys",
@@ -799,4 +823,12 @@
power-domains = <&scpsys MT2701_POWER_DOMAIN_ETH>;
status = "disabled";
};
+
+   bdpsys: syscon@1c00 {
+   compatible = "mediatek,mt7623-bdpsys",
+"mediatek,mt2701-bdpsys",
+"syscon";
+   reg = <0 0x1c00 0 0x1000>;
+   #clock-cells = <1>;
+   };
 };
-- 
1.9.1

[PATCH v3 05/10] arm: dts: mt7623: update pio, usb and crypto nodes

2017-10-01 Thread Ryder Lee

This patch updates pio, usb and crypto nodes to make them be consistent
with the binding documents.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index 381843e..0640fb7 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -227,8 +227,7 @@
};
 
pio: pinctrl@10005000 {
-   compatible = "mediatek,mt7623-pinctrl",
-"mediatek,mt2701-pinctrl";
+   compatible = "mediatek,mt7623-pinctrl";
reg = <0 0x1000b000 0 0x1000>;
mediatek,pctl-regmap = <&syscfg_pctl_a>;
pins-are-numbered;
@@ -680,7 +679,7 @@
interrupts = ;
clocks = <&hifsys CLK_HIFSYS_USB0PHY>,
 <&topckgen CLK_TOP_ETHIF_SEL>;
-   clock-names = "sys_ck", "free_ck";
+   clock-names = "sys_ck", "ref_ck";
power-domains = <&scpsys MT2701_POWER_DOMAIN_HIF>;
phys = <&u2port0 PHY_TYPE_USB2>, <&u3port0 PHY_TYPE_USB3>;
status = "disabled";
@@ -690,8 +689,6 @@
compatible = "mediatek,mt7623-u3phy",
 "mediatek,mt2701-u3phy";
reg = <0 0x1a1c4000 0 0x0700>;
-   clocks = <&clk26m>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
@@ -699,12 +696,16 @@
 
u2port0: usb-phy@1a1c4800 {
reg = <0 0x1a1c4800 0 0x0100>;
+   clocks = <&topckgen CLK_TOP_USB_PHY48M>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u3port0: usb-phy@1a1c4900 {
reg = <0 0x1a1c4900 0 0x0700>;
+   clocks = <&clk26m>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
@@ -719,7 +720,7 @@
interrupts = ;
clocks = <&hifsys CLK_HIFSYS_USB1PHY>,
 <&topckgen CLK_TOP_ETHIF_SEL>;
-   clock-names = "sys_ck", "free_ck";
+   clock-names = "sys_ck", "ref_ck";
power-domains = <&scpsys MT2701_POWER_DOMAIN_HIF>;
phys = <&u2port1 PHY_TYPE_USB2>, <&u3port1 PHY_TYPE_USB3>;
status = "disabled";
@@ -729,8 +730,6 @@
compatible = "mediatek,mt7623-u3phy",
 "mediatek,mt2701-u3phy";
reg = <0 0x1a244000 0 0x0700>;
-   clocks = <&clk26m>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
@@ -738,12 +737,16 @@
 
u2port1: usb-phy@1a244800 {
reg = <0 0x1a244800 0 0x0100>;
+   clocks = <&topckgen CLK_TOP_USB_PHY48M>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u3port1: usb-phy@1a244900 {
reg = <0 0x1a244900 0 0x0700>;
+   clocks = <&clk26m>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
@@ -784,16 +787,15 @@
};
 
crypto: crypto@1b24 {
-   compatible = "mediatek,mt7623-crypto";
+   compatible = "mediatek,eip97-crypto";
reg = <0 0x1b24 0 0x2>;
interrupts = ,
 ,
 ,
 ,
 ;
-   clocks = <&topckgen CLK_TOP_ETHIF_SEL>,
-<ðsys CLK_ETHSYS_CRYPTO>;
-   clock-names = "ethif","cryp";
+   clocks = <ðsys CLK_ETHSYS_CRYPTO>;
+   clock-names = "cryp";
power-domains = <&scpsys MT2701_POWER_DOMAIN_ETH>;
status = "disabled";
};
-- 
1.9.1

[PATCH v3 02/10] arm: dts: mt2701: enable display pwm backlight

2017-10-01 Thread Ryder Lee

From: Weiqing Kong 

This patch adds board related config for MT2701 pwm backlight.

Signed-off-by: Weiqing Kong 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701-evb.dts | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701-evb.dts b/arch/arm/boot/dts/mt2701-evb.dts
index f484973..63af4b1 100644
--- a/arch/arm/boot/dts/mt2701-evb.dts
+++ b/arch/arm/boot/dts/mt2701-evb.dts
@@ -56,12 +56,29 @@
bt_sco_codec:bt_sco_codec {
compatible = "linux,bt-sco";
};
+
+   backlight_lcd: backlight_lcd {
+   compatible = "pwm-backlight";
+   pwms = <&bls 0 10>;
+   brightness-levels = <
+ 0  16  32  48  64  80  96 112
+   128 144 160 176 192 208 224 240
+   255
+   >;
+   default-brightness-level = <9>;
+   };
 };
 
 &auxadc {
status = "okay";
 };
 
+&bls {
+   status = "okay";
+   pinctrl-names = "default";
+   pinctrl-0 = <&pwm_bls_gpio>;
+};
+
 &i2c0 {
pinctrl-names = "default";
pinctrl-0 = <&i2c0_pins_a>;
@@ -111,6 +128,12 @@
};
};
 
+   pwm_bls_gpio: pwm_bls_gpio {
+   pins_cmd_dat {
+   pinmux = ;
+   };
+   };
+
spi_pins_a: spi0@0 {
pins_spi {
pinmux = ,
-- 
1.9.1

[PATCH v3 04/10] arm: dts: mediatek: update audio node for mt2701 and mt7623

2017-10-01 Thread Ryder Lee

This patch adds interrupt-names property in audio node so that
binding can be agnostic of the IRQ order.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 4 +++-
 arch/arm/boot/dts/mt7623.dtsi | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index 8c9fbe5..ecd388a 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -445,7 +445,9 @@
compatible = "mediatek,mt2701-audio";
reg = <0 0x1122 0 0x2000>,
  <0 0x112a 0 0x2>;
-   interrupts = ;
+   interrupts =  ,
+ ;
+   interrupt-names = "afe", "asys";
power-domains = <&scpsys MT2701_POWER_DOMAIN_IFR_MSC>;
 
clocks = <&infracfg CLK_INFRA_AUDIO>,
diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index ec8a074..381843e 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -544,7 +544,9 @@
 "mediatek,mt2701-audio";
reg = <0 0x1122 0 0x2000>,
  <0 0x112a 0 0x2>;
-   interrupts = ;
+   interrupts =  ,
+ ;
+   interrupt-names = "afe", "asys";
power-domains = <&scpsys MT2701_POWER_DOMAIN_IFR_MSC>;
 
clocks = <&infracfg CLK_INFRA_AUDIO>,
-- 
1.9.1

[PATCH v3 03/10] arm: dts: mt2701: add display subsystem related nodes

2017-10-01 Thread Ryder Lee

From: YT Shen 

This patch adds the device nodes for MT2701 DISP function blocks.

Signed-off-by: YT Shen 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 75 +++
 1 file changed, 75 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index 3c85879..8c9fbe5 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -26,6 +26,11 @@
compatible = "mediatek,mt2701";
interrupt-parent = <&cirq>;
 
+   aliases {
+   rdma0 = &rdma0;
+   rdma1 = &rdma1;
+   };
+
cpus {
#address-cells = <1>;
#size-cells = <0>;
@@ -203,6 +208,16 @@
power-domains = <&scpsys MT2701_POWER_DOMAIN_DISP>;
};
 
+   mipi_tx0: mipi-dphy@1001 {
+   compatible = "mediatek,mt2701-mipi-tx";
+   reg = <0 0x1001 0 0x90>;
+   clocks = <&clk26m>;
+   clock-output-names = "mipi_tx0_pll";
+   #clock-cells = <0>;
+   #phy-cells = <0>;
+   status = "disabled";
+   };
+
sysirq: interrupt-controller@10200100 {
compatible = "mediatek,mt2701-sysirq",
 "mediatek,mt6577-sysirq";
@@ -530,6 +545,30 @@
#clock-cells = <1>;
};
 
+   display_components: dispsys@1400 {
+   compatible = "mediatek,mt2701-mmsys";
+   reg = <0 0x1400 0 0x1000>;
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_DISP>;
+   };
+
+   ovl: ovl@14007000 {
+   compatible = "mediatek,mt2701-disp-ovl";
+   reg = <0 0x14007000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_OVL>;
+   iommus = <&iommu MT2701_M4U_PORT_DISP_OVL_0>;
+   mediatek,larb = <&larb0>;
+   };
+
+   rdma0: rdma@14008000 {
+   compatible = "mediatek,mt2701-disp-rdma";
+   reg = <0 0x14008000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_RDMA>;
+   iommus = <&iommu MT2701_M4U_PORT_DISP_RDMA>;
+   mediatek,larb = <&larb0>;
+   };
+
bls: pwm@1400a000 {
compatible = "mediatek,mt2701-disp-pwm";
reg = <0 0x1400a000 0 0x1000>;
@@ -539,6 +578,33 @@
status = "disabled";
};
 
+   color: color@1400b000 {
+   compatible = "mediatek,mt2701-disp-color";
+   reg = <0 0x1400b000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_COLOR>;
+   };
+
+   dsi: dsi@1400c000 {
+   compatible = "mediatek,mt2701-dsi";
+   reg = <0 0x1400c000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DSI_ENGINE>,
+<&mmsys CLK_MM_DSI_DIG>,
+<&mipi_tx0>;
+   clock-names = "engine", "digital", "hs";
+   phys = <&mipi_tx0>;
+   phy-names = "dphy";
+   status = "disabled";
+   };
+
+   mutex: mutex@1400e000 {
+   compatible = "mediatek,mt2701-disp-mutex";
+   reg = <0 0x1400e000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_MUTEX_32K>;
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt2701-smi-larb";
reg = <0 0x1401 0 0x1000>;
@@ -550,6 +616,15 @@
power-domains = <&scpsys MT2701_POWER_DOMAIN_DISP>;
};
 
+   rdma1: rdma@14012000 {
+   compatible = "mediatek,mt2701-disp-rdma";
+   reg = <0 0x14012000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_RDMA1>;
+   iommus = <&iommu MT2701_M4U_PORT_DISP_RDMA1>;
+   mediatek,larb = <&larb0>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt2701-imgsys", "syscon";
reg = <0 0x1500 0 0x1000>;
-- 
1.9.1

[PATCH v3 00/10] update MT7623 and MT2701 dts

2017-10-01 Thread Ryder Lee

Hi Matthias,

This patch series adds/corrects some device nodes for both MT7623 and MT2701.

changes since v3:
- revert PIO register space.

changes since v2:
- move non-common part and non-display related nodes to different patches.
- remove unused wdma node.
- add display related nodes for MT2701.

changes since v1:
- rebase to v4.14.
- sort nodes in alphabetical order

Ryder Lee (7):
  arm: dts: mediatek: update audio node for mt2701 and mt7623
  arm: dts: mt7623: update pio, usb and crypto nodes
  arm: dts: mt7623: add subsystem clock controller nodes
  arm: dts: mt7623: add iommu and jpecdec nodes
  arm: dts: mt7623: add display subsystem related nodes
  arm: dts: mt7623: enable bananapi-r2 display function
  arm: dts: mt7623: add PCIe related nodes

Weiqing Kong (2):
  arm: dts: mt2701: add pwm backlight device node
  arm: dts: mt2701: enable display pwm backlight

YT Shen (1):
  arm: dts: mt2701: add display subsystem related nodes

 arch/arm/boot/dts/mt2701-evb.dts  |  23 ++
 arch/arm/boot/dts/mt2701.dtsi |  88 ++-
 arch/arm/boot/dts/mt7623.dtsi | 338 +-
 arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts |  71 +-
 include/dt-bindings/pinctrl/mt7623-pinfunc.h  |  12 +
 5 files changed, 516 insertions(+), 16 deletions(-)

-- 
1.9.1

[PATCH v3 08/10] arm: dts: mt7623: add display subsystem related nodes

2017-10-01 Thread Ryder Lee

This patch adds the device nodes for the display function blocks.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 94 +++
 1 file changed, 94 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index b257715..b19aa9f 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -29,6 +29,11 @@
compatible = "mediatek,mt7623";
interrupt-parent = <&sysirq>;
 
+   aliases {
+   rdma0 = &rdma0;
+   rdma1 = &rdma1;
+   };
+
cpu_opp_table: opp_table {
compatible = "operating-points-v2";
opp-shared;
@@ -298,6 +303,17 @@
clock-names = "spi", "wrap";
};
 
+   mipi_tx0: mipi-dphy@1001 {
+   compatible = "mediatek,mt7623-mipi-tx",
+"mediatek,mt2701-mipi-tx";
+   reg = <0 0x1001 0 0x90>;
+   clocks = <&clk26m>;
+   clock-output-names = "mipi_tx0_pll";
+   #clock-cells = <0>;
+   #phy-cells = <0>;
+   status = "disabled";
+   };
+
cir: cir@10013000 {
compatible = "mediatek,mt7623-cir";
reg = <0 0x10013000 0 0x1000>;
@@ -692,6 +708,74 @@
#clock-cells = <1>;
};
 
+   display_components: dispsys@1400 {
+   compatible = "mediatek,mt7623-mmsys",
+"mediatek,mt2701-mmsys";
+   reg = <0 0x1400 0 0x1000>;
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_DISP>;
+   };
+
+   ovl: ovl@14007000 {
+   compatible = "mediatek,mt7623-disp-ovl",
+"mediatek,mt2701-disp-ovl";
+   reg = <0 0x14007000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_OVL>;
+   iommus = <&iommu MT2701_M4U_PORT_DISP_OVL_0>;
+   mediatek,larb = <&larb0>;
+   };
+
+   rdma0: rdma@14008000 {
+   compatible = "mediatek,mt7623-disp-rdma",
+"mediatek,mt2701-disp-rdma";
+   reg = <0 0x14008000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_RDMA>;
+   iommus = <&iommu MT2701_M4U_PORT_DISP_RDMA>;
+   mediatek,larb = <&larb0>;
+   };
+
+   bls: pwm@1400a000 {
+   compatible = "mediatek,mt7623-disp-pwm",
+"mediatek,mt2701-disp-pwm";
+   reg = <0 0x1400a000 0 0x1000>;
+   #pwm-cells = <2>;
+   clocks = <&mmsys CLK_MM_MDP_BLS_26M>,
+<&mmsys CLK_MM_DISP_BLS>;
+   clock-names = "main", "mm";
+   status = "disabled";
+   };
+
+   color: color@1400b000 {
+   compatible = "mediatek,mt7623-disp-color",
+"mediatek,mt2701-disp-color";
+   reg = <0 0x1400b000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_COLOR>;
+   };
+
+   dsi: dsi@1400c000 {
+   compatible = "mediatek,mt7623-dsi",
+"mediatek,mt2701-dsi";
+   reg = <0 0x1400c000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DSI_ENGINE>,
+<&mmsys CLK_MM_DSI_DIG>,
+<&mipi_tx0>;
+   clock-names = "engine", "digital", "hs";
+   phys = <&mipi_tx0>;
+   phy-names = "dphy";
+   status = "disabled";
+   };
+
+   mutex: mutex@1400e000 {
+   compatible = "mediatek,mt7623-disp-mutex",
+"mediatek,mt2701-disp-mutex";
+   reg = <0 0x1400e000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_MUTEX_32K>;
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt7623-smi-larb",
 "mediatek,mt2701-smi-larb";
@@ -704,6 +788,16 @@
power-domains = <&scpsys MT2701_POWER_DOMAIN_DISP>;
};
 
+   rdma1: rdma@14012000 {
+   compatible = "mediatek,mt7623-disp-rdma",
+"mediatek,mt2701-disp-rdma";
+   reg = <0 0x14012000 0 0x1000>;
+   interrupts = ;
+   clocks = <&mmsys CLK_MM_DISP_RDMA1>;
+   iommus = <&iommu MT2701_M4U_PORT_DISP_RDMA1>;
+   mediatek,larb = <&larb0>;
+   };
+
imgsys: syscon@1500 {
compatible = "mediatek,mt7623-imgsys",
 "mediatek,mt2701-imgsys",
-- 
1.9.1

[PATCH v3 01/10] arm: dts: mt2701: add pwm backlight device node

2017-10-01 Thread Ryder Lee

From: Weiqing Kong 

This patch adds the device node for MT2701 pwm backlight.

Signed-off-by: Weiqing Kong 
Signed-off-by: Erin Lo 
Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt2701.dtsi | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/arm/boot/dts/mt2701.dtsi b/arch/arm/boot/dts/mt2701.dtsi
index afe12e5..3c85879 100644
--- a/arch/arm/boot/dts/mt2701.dtsi
+++ b/arch/arm/boot/dts/mt2701.dtsi
@@ -530,6 +530,15 @@
#clock-cells = <1>;
};
 
+   bls: pwm@1400a000 {
+   compatible = "mediatek,mt2701-disp-pwm";
+   reg = <0 0x1400a000 0 0x1000>;
+   #pwm-cells = <2>;
+   clocks = <&mmsys CLK_MM_MDP_BLS_26M>, <&mmsys CLK_MM_DISP_BLS>;
+   clock-names = "main", "mm";
+   status = "disabled";
+   };
+
larb0: larb@1401 {
compatible = "mediatek,mt2701-smi-larb";
reg = <0 0x1401 0 0x1000>;
-- 
1.9.1

[PATCH v3 10/10] arm: dts: mt7623: add PCIe related nodes

2017-10-01 Thread Ryder Lee

This patch adds devices nodes and updates pinmux setting for the PICe
function block. Just note that PCIe port2 PHY is shared with U3 port.

Signed-off-by: Ryder Lee 
---
 arch/arm/boot/dts/mt7623.dtsi | 108 ++
 arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts |  30 +++
 2 files changed, 138 insertions(+)

diff --git a/arch/arm/boot/dts/mt7623.dtsi b/arch/arm/boot/dts/mt7623.dtsi
index b19aa9f..32d454e 100644
--- a/arch/arm/boot/dts/mt7623.dtsi
+++ b/arch/arm/boot/dts/mt7623.dtsi
@@ -862,6 +862,114 @@
#reset-cells = <1>;
};
 
+   pcie: pcie-controller@1a14 {
+   compatible = "mediatek,mt7623-pcie";
+   device_type = "pci";
+   reg = <0 0x1a14 0 0x1000>, /* PCIe shared registers */
+ <0 0x1a142000 0 0x1000>, /* Port0 registers */
+ <0 0x1a143000 0 0x1000>, /* Port1 registers */
+ <0 0x1a144000 0 0x1000>; /* Port2 registers */
+   reg-names = "subsys", "port0", "port1", "port2";
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0xf800 0 0 0>;
+   interrupt-map = <0x 0 0 0 &sysirq GIC_SPI 193 
IRQ_TYPE_LEVEL_LOW>,
+   <0x0800 0 0 0 &sysirq GIC_SPI 194 
IRQ_TYPE_LEVEL_LOW>,
+   <0x1000 0 0 0 &sysirq GIC_SPI 195 
IRQ_TYPE_LEVEL_LOW>;
+   clocks = <&topckgen CLK_TOP_ETHIF_SEL>,
+<&hifsys CLK_HIFSYS_PCIE0>,
+<&hifsys CLK_HIFSYS_PCIE1>,
+<&hifsys CLK_HIFSYS_PCIE2>;
+   clock-names = "free_ck", "sys_ck0", "sys_ck1", "sys_ck2";
+   resets = <&hifsys MT2701_HIFSYS_PCIE0_RST>,
+<&hifsys MT2701_HIFSYS_PCIE1_RST>,
+<&hifsys MT2701_HIFSYS_PCIE2_RST>;
+   reset-names = "pcie-rst0", "pcie-rst1", "pcie-rst2";
+   phys = <&pcie0_port PHY_TYPE_PCIE>,
+  <&pcie1_port PHY_TYPE_PCIE>,
+  <&u3port1 PHY_TYPE_PCIE>;
+   phy-names = "pcie-phy0", "pcie-phy1", "pcie-phy2";
+   power-domains = <&scpsys MT2701_POWER_DOMAIN_HIF>;
+   bus-range = <0x00 0xff>;
+   status = "disabled";
+   ranges = <0x8100 0 0x1a16 0 0x1a16 0 0x0001
+ 0x8300 0 0x6000 0 0x6000 0 0x1000>;
+
+   pcie@0,0 {
+   device_type = "pci";
+   reg = <0x 0 0 0 0>;
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0 0 0 0>;
+   interrupt-map = <0 0 0 0 &sysirq GIC_SPI 193 
IRQ_TYPE_LEVEL_LOW>;
+   ranges;
+   num-lanes = <1>;
+   status = "disabled";
+   };
+
+   pcie@1,0 {
+   device_type = "pci";
+   reg = <0x0800 0 0 0 0>;
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0 0 0 0>;
+   interrupt-map = <0 0 0 0 &sysirq GIC_SPI 194 
IRQ_TYPE_LEVEL_LOW>;
+   ranges;
+   num-lanes = <1>;
+   status = "disabled";
+   };
+
+   pcie@2,0 {
+   device_type = "pci";
+   reg = <0x1000 0 0 0 0>;
+   #address-cells = <3>;
+   #size-cells = <2>;
+   #interrupt-cells = <1>;
+   interrupt-map-mask = <0 0 0 0>;
+   interrupt-map = <0 0 0 0 &sysirq GIC_SPI 195 
IRQ_TYPE_LEVEL_LOW>;
+   ranges;
+   num-lanes = <1>;
+   status = "disabled";
+   };
+   };
+
+   pcie0_phy: pcie-phy@1a149000 {
+   compatible = "mediatek,generic-tphy-v1";
+   reg = <0 0x1a149000 0 0x0700>;
+   #address-cells = <2>;
+   #size-cells = <2>;
+   ranges;
+   status = "disabled";
+
+   pcie0_port: pcie-phy@1a149900 {
+   reg = <0 0x1a149900 0 0x0700>;
+   clocks = <&clk26m>;
+   clock-names = "ref";
+   #phy-cells = <1>;
+   status = "okay";
+   };
+   };
+
+   pcie1_phy: pcie-phy@1a14a000 {
+   compatible = "mediatek,generic-tphy-v1";
+   reg = <0 0x1a14a000 0 0x0700>;
+   #address-cells = <2>;
+   #size-cells

[PATCH v3 09/10] arm: dts: mt7623: enable bananapi-r2 display function

2017-10-01 Thread Ryder Lee

This patch adds missing MIPI pin macros in mt7623-pinfunc.h and
enables pwm backlight support for bananapi-r2.

Signed-off-by: Ryder Lee 
Acked-by: Linus Walleij 
---
 arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts | 41 +--
 include/dt-bindings/pinctrl/mt7623-pinfunc.h  | 12 
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts 
b/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
index 688a863..267a05a 100644
--- a/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
+++ b/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
@@ -17,6 +17,17 @@
serial2 = &uart2;
};
 
+   backlight_lcd: backlight_lcd {
+   compatible = "pwm-backlight";
+   pwms = <&bls 0 10>;
+   brightness-levels = <
+ 0  16  32  48  64  80  96 112
+   128 144 160 176 192 208 224 240
+   255
+   >;
+   default-brightness-level = <9>;
+   };
+
chosen {
stdout-path = "serial2:115200n8";
};
@@ -86,6 +97,12 @@
};
 };
 
+&bls {
+   status = "okay";
+   pinctrl-names = "default";
+   pinctrl-0 = <&bls_pins_a>;
+};
+
 &cir {
pinctrl-names = "default";
pinctrl-0 = <&cir_pins_a>;
@@ -210,6 +227,12 @@
 };
 
 &pio {
+   bls_pins_a: bls@0 {
+   pins_cmd_dat {
+   pinmux = ;
+   };
+   };
+
cir_pins_a:cir@0 {
pins_cir {
pinmux = ;
@@ -273,6 +296,21 @@
};
};
 
+   mipi_dsi_pin: mipi_dsi_pin {
+   pins_cmd_dat {
+   pinmux = ,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   };
+   };
+
mmc0_pins_default: mmc0default {
pins_cmd_dat {
pinmux = ,
@@ -378,8 +416,7 @@
 
pwm_pins_a: pwm@0 {
pins_pwm {
-   pinmux = ,
-,
+   pinmux = ,
 ,
 ,
 ;
diff --git a/include/dt-bindings/pinctrl/mt7623-pinfunc.h 
b/include/dt-bindings/pinctrl/mt7623-pinfunc.h
index 436a87b..72bed67 100644
--- a/include/dt-bindings/pinctrl/mt7623-pinfunc.h
+++ b/include/dt-bindings/pinctrl/mt7623-pinfunc.h
@@ -272,6 +272,18 @@
 #define MT7623_PIN_84_DSI_TE_FUNC_GPIO84 (MTK_PIN_NO(84) | 0)
 #define MT7623_PIN_84_DSI_TE_FUNC_DSI_TE (MTK_PIN_NO(84) | 1)
 
+#define MT7623_PIN_91_MIPI_TDN3_FUNC_GPIO91 (MTK_PIN_NO(91) | 0)
+#define MT7623_PIN_91_MIPI_TDN3_FUNC_TDN3 (MTK_PIN_NO(91) | 1)
+
+#define MT7623_PIN_92_MIPI_TDP3_FUNC_GPIO92 (MTK_PIN_NO(92) | 0)
+#define MT7623_PIN_92_MIPI_TDP3_FUNC_TDP3 (MTK_PIN_NO(92) | 1)
+
+#define MT7623_PIN_93_MIPI_TDN2_FUNC_GPIO93 (MTK_PIN_NO(93) | 0)
+#define MT7623_PIN_93_MIPI_TDN2_FUNC_TDN2 (MTK_PIN_NO(93) | 1)
+
+#define MT7623_PIN_94_MIPI_TDP2_FUNC_GPIO94 (MTK_PIN_NO(94) | 0)
+#define MT7623_PIN_94_MIPI_TDP2_FUNC_TDP2 (MTK_PIN_NO(94) | 1)
+
 #define MT7623_PIN_95_MIPI_TCN_FUNC_GPIO95 (MTK_PIN_NO(95) | 0)
 #define MT7623_PIN_95_MIPI_TCN_FUNC_TCN (MTK_PIN_NO(95) | 1)
 
-- 
1.9.1

Re: [PATCH v2 2/2] cpufreq: schedutil: consolidate capacity margin calculation

2017-10-01 Thread Joel Fernandes

Hi Leo,

On Sun, Oct 1, 2017 at 5:30 PM, Leo Yan  wrote:
> Scheduler CFS class has variable 'capacity_margin' to calculate the

s/calculate/represent/ ?

> capacity margin, and schedutil governor also needs to compensate the
> same margin for frequency tipping point. Below are formulas used in
> CFS class and schedutil governor separately:
>
> CFS:   U` = U * capacity_margin / 1024 = U * 1.25

You should mention in the commit message, at the moment
capacity_margin is 1280 which makes U` = 1.25.

> Schedutil: U` = U + U >> 2 = U + U * 0.25  = U * 1.25
>
> This patch consolidates the capacity margin calculation so let
> schedutil to use same formula with CFS class. As result this can avoid

As a result.

> the mismatch issue between schedutil and CFS class after change
> 'capacity_margin' to other values.

This didn't make sense to me. May be you meant:

This patch consolidates the usage of the capacity margin value and
lets schedutil use the same formula as the CFS class. Thus we can
avoid the mismatch between schedutil and CFS class if
'capacity_margin' is changed to other values in the future.

> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: Chris Redpath 
> Cc: Joel Fernandes 
> Cc: Vincent Guittot 
> Cc: Patrick Bellasi 
> Cc: Rafael J. Wysocki 
> Signed-off-by: Leo Yan 
> ---
>  kernel/sched/cpufreq_schedutil.c | 6 --
>  kernel/sched/sched.h | 1 +
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index 9209d83..13cc243 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -155,7 +155,8 @@ static void sugov_update_commit(struct sugov_policy 
> *sg_policy, u64 time,
>   *
>   * next_freq = C * curr_freq * util_raw / max
>   *
> - * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
> + * Take C = capacity_margin / 1024 = 1.25, so it's for the frequency tipping
> + * point at (util / max) = 0.8.

The above comment assumes capacity_margin is 1280. If for any reason
the capacity_margin is changed to something else, then the comment
wont make sense anymore.

thanks,

- Joel

Re: [PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once

2017-10-01 Thread Stephen Hemminger

On Sun, 01 Oct 2017 22:19:20 -
Michael Witten  wrote:

> + spin_lock_irqsave(&q->lock, flags);
> + skb = q->next;
> + __skb_queue_head_init(q);
> + spin_unlock_irqrestore(&q->lock, flags);

Other code manipulating lists uses splice operation and
a sk_buff_head temporary on the stack. That would be easier
to understand.

struct sk_buf_head head;

__skb_queue_head_init(&head);
spin_lock_irqsave(&q->lock, flags);
skb_queue_splice_init(q, &head);
spin_unlock_irqrestore(&q->lock, flags);

> + while (skb != head) {
> + next = skb->next;
>   kfree_skb(skb);
> + skb = next;
> + }

It would be cleaner if you could use
skb_queue_walk_safe rather than open coding the loop.

skb_queue_walk_safe(&head, skb,  tmp)
kfree_skb(skb);

Re: [PATCH v3 4/8] platform/x86: wmi: create character devices when requested by drivers

2017-10-01 Thread Darren Hart

On Sat, Sep 30, 2017 at 10:12:05AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Sep 29, 2017 at 06:52:28PM -0700, Darren Hart wrote:
> > 
> > On Wed, Sep 27, 2017 at 11:02:16PM -0500, Mario Limonciello wrote:
> > > For WMI operations that are only Set or Query read or write sysfs
> > > attributes created by WMI vendor drivers make sense.
> > > 
> > > For other WMI operations that are run on Method, there needs to be a
> > > way to guarantee to userspace that the results from the method call
> > > belong to the data request to the method call.  Sysfs attributes don't
> > > work well in this scenario because two userspace processes may be
> > > competing at reading/writing an attribute and step on each other's
> > > data.
> > > 
> > > When a WMI vendor driver declares a set of functions in a
> > > file_operations object the WMI bus driver will create a character
> > > device that maps to those file operations.
> > > 
> > > That character device will correspond to this path:
> > > /dev/wmi/$driver
> > > 
> > > This policy is selected as one driver may map and use multiple
> > > GUIDs and it would be better to only expose a single character
> > > device.
> > > 
> > > The WMI vendor drivers will be responsible for managing access to
> > > this character device and proper locking on it.
> > > 
> > > When a WMI vendor driver is unloaded the WMI bus driver will clean
> > > up the character device.
> > > 
> > > Signed-off-by: Mario Limonciello 
> > > ---
> > >  drivers/platform/x86/wmi.c | 98 
> > > +++---
> > >  include/linux/wmi.h|  1 +
> > >  2 files changed, 94 insertions(+), 5 deletions(-)
> > 
> > +Greg, Rafael, Matthew, and Christoph
> > 
> > You each provided feedback regarding the method of exposing WMI methods
> > to userspace. This and subsequent patches from Mario lay some of the
> > core groundwork.
> > 
> > They implement an implicit whitelist as only drivers requesting the char
> > dev will see it created.
> > 
> > https://lkml.org/lkml/2017/9/28/8
> 
> If you want patchs reviewed, it's best to actually cc: us on the patch
> itself :(
> 

Of course. I didn't send the series, but thought you should see it. I
could have asked Mario to resend, but I thought a pointer would have
made it easy enough to find in your lkml folder, and it would avoid
splitting the conversation which resending would inevitably lead to. I
pruned this one because Christoph gets upset if I don't.

We can wait for v4 I guess. And next time I want to get your take on
something someone doesn't Cc you on, I'll just ask them to resend the
whole series with you on Cc.

-- 
Darren Hart
VMware Open Source Technology Center

Re: [PATCH 00/18] use ARRAY_SIZE macro

2017-10-01 Thread Jérémy Lefaure

On Mon, 2 Oct 2017 09:01:31 +1100
"Tobin C. Harding"  wrote:

> > In order to reduce the size of the To: and Cc: lines, each patch of the
> > series is sent only to the maintainers and lists concerned by the patch.
> > This cover letter is sent to every list concerned by this series.  
> 
> Why don't you just send individual patches for each subsystem? I'm not a 
> maintainer but I don't see
> how any one person is going to be able to apply this whole series, it is 
> making it hard for
> maintainers if they have to pick patches out from among the series (if indeed 
> any will bother
> doing that).
Yeah, maybe it would have been better to send individual patches.

>From my point of view it's a series because the patches are related (I
did a git format-patch from my local branch). But for the maintainers
point of view, they are individual patches.

As each patch in this series is standalone, the maintainers can pick
the patches they want and apply them individually. So yeah, maybe I
should have sent individual patches. But I also wanted to tie all
patches together as it's the same change.

Anyway, I can tell to each maintainer that they can apply the patches
they're concerned about and next time I may send individual patches.

Thank you for your answer,
Jérémy

Re: [PATCH] rpmsg: Allow RPMSG_VIRTIO to be enabled via menuconfig or defconfig

2017-10-01 Thread kbuild test robot

Hi Anup,

[auto build test ERROR on v4.14-rc2]
[cannot apply to rpmsg/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Anup-Patel/rpmsg-Allow-RPMSG_VIRTIO-to-be-enabled-via-menuconfig-or-defconfig/20170926-121327
config: um-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=um 

All errors (new ones prefixed by >>):

   arch/um/drivers/vde.o: In function `vde_open_real':
   (.text+0xfb1): warning: Using 'getgrnam' in statically linked applications 
requires at runtime the shared libraries from the glibc version used for linking
   arch/um/drivers/vde.o: In function `vde_open_real':
   (.text+0xdfc): warning: Using 'getpwuid' in statically linked applications 
requires at runtime the shared libraries from the glibc version used for linking
   arch/um/drivers/vde.o: In function `vde_open_real':
   (.text+0x1115): warning: Using 'getaddrinfo' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoaddr':
   (.text+0xe475): warning: Using 'gethostbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametonetaddr':
   (.text+0xe515): warning: Using 'getnetbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoproto':
   (.text+0xe735): warning: Using 'getprotobyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoport':
   (.text+0xe567): warning: Using 'getservbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   drivers/auxdisplay/img-ascii-lcd.o: In function `img_ascii_lcd_probe':
   drivers/auxdisplay/img-ascii-lcd.c:386: undefined reference to 
`devm_ioremap_resource'
   drivers/rpmsg/virtio_rpmsg_bus.o: In function `rpmsg_remove':
>> include/linux/dma-mapping.h:528: undefined reference to `bad_dma_ops'
   include/linux/dma-mapping.h:534: undefined reference to `bad_dma_ops'
   drivers/rpmsg/virtio_rpmsg_bus.o: In function `rpmsg_probe':
   include/linux/dma-mapping.h:507: undefined reference to `bad_dma_ops'
   include/linux/dma-mapping.h:534: undefined reference to `bad_dma_ops'
   drivers/virtio/virtio_ring.o: In function `vring_unmap_one':
   include/linux/dma-mapping.h:251: undefined reference to `bad_dma_ops'
   drivers/virtio/virtio_ring.o:include/linux/dma-mapping.h:315: more undefined 
references to `bad_dma_ops' follow
   collect2: error: ld returned 1 exit status

vim +528 include/linux/dma-mapping.h

e1c7e324 Christoph Hellwig   2016-01-20  521  
e1c7e324 Christoph Hellwig   2016-01-20  522  static inline void 
dma_free_attrs(struct device *dev, size_t size,
e1c7e324 Christoph Hellwig   2016-01-20  523 
void *cpu_addr, dma_addr_t dma_handle,
00085f1e Krzysztof Kozlowski 2016-08-03  524 
unsigned long attrs)
e1c7e324 Christoph Hellwig   2016-01-20  525  {
5299709d Bart Van Assche 2017-01-20  526const struct dma_map_ops *ops = 
get_dma_ops(dev);
e1c7e324 Christoph Hellwig   2016-01-20  527  
e1c7e324 Christoph Hellwig   2016-01-20 @528BUG_ON(!ops);
e1c7e324 Christoph Hellwig   2016-01-20  529WARN_ON(irqs_disabled());
e1c7e324 Christoph Hellwig   2016-01-20  530  
43fc509c Vladimir Murzin 2017-07-20  531if 
(dma_release_from_dev_coherent(dev, get_order(size), cpu_addr))
e1c7e324 Christoph Hellwig   2016-01-20  532return;
e1c7e324 Christoph Hellwig   2016-01-20  533  
d6b7eaeb Zhen Lei2016-03-09  534if (!ops->free || !cpu_addr)
e1c7e324 Christoph Hellwig   2016-01-20  535return;
e1c7e324 Christoph Hellwig   2016-01-20  536  
e1c7e324 Christoph Hellwig   2016-01-20  537debug_dma_free_coherent(dev, 
size, cpu_addr, dma_handle);
e1c7e324 Christoph Hellwig   2016-01-20  538ops->free(dev, size, cpu_addr, 
dma_handle, attrs);
e1c7e324 Christoph Hellwig   2016-01-20  539  }
e1c7e324 Christoph Hellwig   2016-01-20  540  

:: The code at line 528 was first introduced by commit
:: e1c7e324539ada3b2b13ca2898bcb4948a9ef9db dma-mapping: always provide the 
dma_map_ops based implementation

:: TO: Christoph Hellwig 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH v2 2/2] cpufreq: schedutil: consolidate capacity margin calculation

2017-10-01 Thread Leo Yan

Scheduler CFS class has variable 'capacity_margin' to calculate the
capacity margin, and schedutil governor also needs to compensate the
same margin for frequency tipping point. Below are formulas used in
CFS class and schedutil governor separately:

CFS:   U` = U * capacity_margin / 1024 = U * 1.25
Schedutil: U` = U + U >> 2 = U + U * 0.25  = U * 1.25

This patch consolidates the capacity margin calculation so let
schedutil to use same formula with CFS class. As result this can avoid
the mismatch issue between schedutil and CFS class after change
'capacity_margin' to other values.

Cc: Dietmar Eggemann 
Cc: Morten Rasmussen 
Cc: Chris Redpath 
Cc: Joel Fernandes 
Cc: Vincent Guittot 
Cc: Patrick Bellasi 
Cc: Rafael J. Wysocki 
Signed-off-by: Leo Yan 
---
 kernel/sched/cpufreq_schedutil.c | 6 --
 kernel/sched/sched.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 9209d83..13cc243 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -155,7 +155,8 @@ static void sugov_update_commit(struct sugov_policy 
*sg_policy, u64 time,
  *
  * next_freq = C * curr_freq * util_raw / max
  *
- * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ * Take C = capacity_margin / 1024 = 1.25, so it's for the frequency tipping
+ * point at (util / max) = 0.8.
  *
  * The lowest driver-supported frequency which is equal or greater than the raw
  * next_freq (as calculated above) is returned, subject to policy min/max and
@@ -168,7 +169,8 @@ static unsigned int get_next_freq(struct sugov_policy 
*sg_policy,
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
 
-   freq = (freq + (freq >> 2)) * util / max;
+   freq = freq * capacity_margin >> SCHED_CAPACITY_SHIFT;
+   freq = freq * util / max;
 
if (freq == sg_policy->cached_raw_freq && sg_policy->next_freq != 
UINT_MAX)
return sg_policy->next_freq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14db76c..cf75bdc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -52,6 +52,7 @@ struct cpuidle_state;
 #define TASK_ON_RQ_MIGRATING   2
 
 extern __read_mostly int scheduler_running;
+extern unsigned int capacity_margin __read_mostly;
 
 extern unsigned long calc_load_update;
 extern atomic_long_t calc_load_tasks;
-- 
2.7.4

[PATCH v2 1/2] sched/fair: make capacity_margin __read_mostly

2017-10-01 Thread Leo Yan

Variable 'capacity_margin' is used with read operation for most cases
to calculate the capacity margin, put it into __read_mostly section.

Cc: Dietmar Eggemann 
Cc: Morten Rasmussen 
Cc: Chris Redpath 
Cc: Joel Fernandes 
Cc: Vincent Guittot 
Cc: Patrick Bellasi 
Signed-off-by: Leo Yan 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70ba32e..ad03bf4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -129,7 +129,7 @@ unsigned int sysctl_sched_cfs_bandwidth_slice   
= 5000UL;
  *
  * (default: ~20%)
  */
-unsigned int capacity_margin   = 1280;
+unsigned int capacity_margin __read_mostly = 1280;
 
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
-- 
2.7.4

Re: [PATCH v2 01/16] hyper-v: trace vmbus_on_msg_dpc()

2017-10-01 Thread kbuild test robot

Hi Vitaly,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.14-rc3 next-20170929]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Vitaly-Kuznetsov/hyper-v-trace-vmbus_on_msg_dpc/20171002-062040
config: i386-randconfig-x017-201740 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:95:0,
from drivers/hv/hv_trace.h:29,
from drivers/hv/hv_trace.c:4:
   include/trace/trace_events.h:759:13: warning: 'print_fmt_vmbus_hdr_msg' 
defined but not used [-Wunused-variable]
static char print_fmt_##call[] = print; \
^
>> drivers/hv/./hv_trace.h:9:1: note: in expansion of macro 
>> 'DECLARE_EVENT_CLASS'
DECLARE_EVENT_CLASS(vmbus_hdr_msg,
^~~
   In file included from include/trace/define_trace.h:95:0,
from drivers/hv/hv_trace.h:29,
from drivers/hv/hv_trace.c:4:
   include/trace/trace_events.h:363:37: warning: 
'trace_event_type_funcs_vmbus_hdr_msg' defined but not used [-Wunused-variable]
static struct trace_event_functions trace_event_type_funcs_##call = { \
^
>> drivers/hv/./hv_trace.h:9:1: note: in expansion of macro 
>> 'DECLARE_EVENT_CLASS'
DECLARE_EVENT_CLASS(vmbus_hdr_msg,
^~~

vim +/DECLARE_EVENT_CLASS +9 drivers/hv/./hv_trace.h

 8  
   > 9  DECLARE_EVENT_CLASS(vmbus_hdr_msg,
10  TP_PROTO(const struct vmbus_channel_message_header *hdr),
11  TP_ARGS(hdr),
12  TP_STRUCT__entry(__field(unsigned int, msgtype)),
13  TP_fast_assign(__entry->msgtype = hdr->msgtype;),
14  TP_printk("msgtype=%d", __entry->msgtype)
15  );
16  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Mimi Zohar

On Sun, 2017-10-01 at 15:20 -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 3:06 PM, Eric W. Biederman  
> wrote:
> >
> > Unless I misread something it was being pointed out there are some vfs
> > operations today on which ima writes an ima xattr as a side effect.  And
> > those operations hold the i_sem.  So perhaps I am misunderstanding
> > things or writing the ima xattr needs to happen at some point.  Which
> > implies something like queued work.
> 
> So the issue is indeed the inode semaphore, as it is used by IMA. But
> all these IMA patches to work around the issue are just horribly ugly.
> One adds a VFS-layer filesystem method that most filesystems end up
> not really needing (it's the same as the regular read), and other
> filesystems end up then having hacks with ("oh, I don't need to take
> this lock because it was already taken by the caller").
> 
> The second patch attempt avoided the need for a new filesystem method,
> but added a flag in an annoying place (for the same basic logic). The
> advantage is that now most filesystems don't actually need to care any
> more (and the filesystems that used to care now check that flag).
> 
> There was discussion about moving the flag to a mode convenient spot,
> which would have made it a lot less intrusive.
> 
> But the basic issue is that almost always when you see lock
> inversions, the problem can just be fixed by doing the locking
> differently instead.

This is what I've been missing.  Thank you for taking the time to
understand the problem and explain how!

> And that's what I was/am pushing for.

> There really are two totally different issues:
> 
>  - the integrity _measurement_.
> 
>This one wants to be serialized, so that you don't have multiple
> concurrent measurements, and the serialization fundamentally has to be
> around all the IO, so this lock pretty much has to be outside the
> i_sem.
> 
>  - the integrity invalidation on certain operations.
> 
>This one fundamentally had to be inside the i_sem, since some of
> the operations that cause this end up already holding the i_sem at a
> VFS layer.
> 
> so you had these two different requirements (inside _and_ outside),
> and the IMA approach was basically to avoid the problem by making
> i_sem *the* lock, and then making the IO routines aware of it already
> being held. That does solve the inside/outside issue.
> 
> But the simpler way to fix it is to simply use two locks that nest
> inside each other, with i_sem nesting in the middle.  That just avoids
> the problem entirely, and doesn't require anybody to ever care about
> i_sem semantic changes, because i_sem semantics simply didn't change
> at all.
> 
> So that's the approach I'm pushing. I admittedly haven't actually
> looked at the IMA details, but from a high-level standpoint you can
> basically describe it (as above) without having to care too much about
> exactly what IMA even wants.
> 
> The two-lock approach does require that the operations that invalidate
> the integrity measurements always only invalidate it, and don't try to
> re-compute it. But I suspect that would be entirely insane anyway
> (imagine a world where "setxattr" would have to read the whole file
> contents in order to revalidate the integrity measurement - even if
> there is nobody who even *cares*).

Right, the setxattr, chmod, chown syscalls just resets the cached
flags, which indicate whether the file needs to be re-measured, re-
validated, or re-audited.  The file hash is not re-calculated at this
point.  That happens on the next access (in policy).

Mimi

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Mimi Zohar

On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  wrote:
> > >
> > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > the iint struct with a separate set of locks should work.  It will be
> > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > 
> > Note that the "inner lock" could possibly be omitted if the
> > invalidation can be just a single atomic instruction.
> > 
> > So particularly if invalidation could be just an atomic_inc() on the
> > generation count, there might not need to be any inner lock at all.
> > 
> > You'd have to serialize the actual measurement with the "read
> > generation count", but that should be as simple as just doing a
> > smp_rmb() between the "read generation count" and "do measurement on
> > file contents".
> 
> We already have a change counter on the inode, which is modified on
> any data or metadata write (i_version) under filesystem locks.  The
> i_version counter has well defined semantics - it's required by
> NFSv4 to increment on any metadata or data change - so we should be
> able to rely on it's behaviour to implement IMA as well. Filesystems
> that support i_version are marked with [SB|MS]_I_VERSION in the
> superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> ATM).

Recently I received a patch to replace i_version with mtime/atime.
 Now, even more recently, I received a patch that claims that
i_version is just a performance improvement.  For file systems that
don't support i_version, assume that the file has changed.

For file systems that don't support i_version, instead of assuming
that the file has changed, we can at least use i_generation.

With Linus' suggested changes, I think this will work nicely.

> The IMA code should be able to sample that at measurement time and
> either fail or be retried if i_version changes during measurement.
> We can then simply make the IMA xattr write conditional on the
> i_version value being unchanged from the sample the IMA code passes
> into the filesystem once the filesystem holds all the locks it needs
> to write the xattr...

> I note that IMA already grabs the i_version in
> ima_collect_measurement(), so this shouldn't be too hard to do.
> Perhaps we don't need any new locks or counterst all, maybe just
> the ability to feed a version cookie to the set_xattr method?

The security.ima xattr is normally written out in
ima_check_last_writer(), not in ima_collect_measurement().
 ima_collect_measurement() calculates the file hash for storing in the
measurement list (IMA-measurement), verifying the hash/signature (IMA-
appraisal) already stored in the xattr, and auditing (IMA-audit).

The only time that ima_collect_measurement() writes the file xattr is
in "fix" mode.  Writing the xattr will need to be deferred until after
the iint->mutex is released.

There should be no open writers in ima_check_last_writer(), so the
file shouldn't be changing.

Mimi

Re: [v8 0/4] cgroup-aware OOM killer

2017-10-01 Thread Shakeel Butt

>
> Going back to Michal's example, say the user configured the following:
>
>root
>   /\
>  A  D
> / \
>B   C
>
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
>
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
>
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
>
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
>

I think Tim has given very clear explanation why comparing A & D makes
perfect sense. However I think the above example, a single user system
where a user has designed and created the whole hierarchy and then
attaches different jobs/applications to different nodes in this
hierarchy, is also a valid scenario. One solution I can think of, to
cater both scenarios, is to introduce a notion of 'bypass oom' or not
include a memcg for oom comparision and instead include its children
in the comparison.

So, in the same above example:
root
   /   \
  A(b)D
 /  \
B   C

A is marked as bypass and thus B and C are to be compared to D. So,
for the single user scenario, all the internal nodes are marked
'bypass oom comparison' and oom_priority of the leaves has to be set
to the same value.

Below is the pseudo code of select_victim_memcg() based on this idea
and David's previous pseudo code. The calculation of size of a memcg
is still not very well baked here yet. I am working on it and I plan
to have a patch based on Roman's v9 "mm, oom: cgroup-aware OOM killer"
patch.


struct mem_cgroup *memcg = root_mem_cgroup;
struct mem_cgroup *selected_memcg = root_mem_cgroup;
struct mem_cgroup *low_memcg;
unsigned long low_priority;
unsigned long prev_badness = memcg_oom_badness(memcg); // Roman's code
LIST_HEAD(queue);

next_level:
low_memcg = NULL;
low_priority = ULONG_MAX;

next:
for_each_child_of_memcg(it, memcg) {
unsigned long prio = it->oom_priority;
unsigned long badness = 0;

if (it->bypass_oom && !it->oom_group &&
memcg_has_children(it)) {
list_add(&it->oom_queue, &queue);
continue;
}

if (prio > low_priority)
continue;

if (prio == low_priority) {
badness = mem_cgroup_usage(it); // for
simplicity, need more thinking
if (badness < prev_badness)
continue;
}

low_memcg = it;
low_priority = prio;
prev_badness = badness ?: mem_cgroup_usage(it);  //
for simplicity
}
if (!list_empty(&queue)) {
memcg = list_last_entry(&queue, struct mem_cgroup, oom_queue);
list_del(&memcg->oom_queue);
goto next;
}
if (low_memcg) {
selected_memcg = memcg = low_memcg;
prev_badness = 0;
if (!low_memcg->oom_group)
goto next_level;
}
if (selected_memcg->oom_group)
oom_kill_memcg(selected_memcg);
else
oom_kill_process_from_memcg(selected_memcg);

4879b7ae05 ("Merge tag 'dmaengine-4.12-rc1' of .."): WARNING: kernel stack regs at bd92bc2e in 01-cpu-hotplug:3811 has bad 'bp' value 000001be

2017-10-01 Thread kernel test robot

Greetings,

0day kernel testing robot got the below dmesg and the first bad commit is

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

commit 4879b7ae05431ebcd228a4ff25a81120b3d85891
Merge: ecc721a72c121 be13ec668d043
Author: Linus Torvalds 
AuthorDate: Tue May 9 15:40:28 2017 -0700
Commit: Linus Torvalds 
CommitDate: Tue May 9 15:40:28 2017 -0700

Merge tag 'dmaengine-4.12-rc1' of 
git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine updates from Vinod Koul:
 "This time again a smaller update consisting of:

   - support for TI DA8xx dma controller and updates to the cppi driver

   - updates on bunch of drivers like xilinx, pl08x, stm32-dma, mv_xor,
 ioat, dmatest"

* tag 'dmaengine-4.12-rc1' of 
git://git.infradead.org/users/vkoul/slave-dma: (35 commits)
  dmaengine: pl08x: remove lock documentation
  dmaengine: pl08x: fix pl08x_dma_chan_state documentation
  dmaengine: pl08x: Use the BIT() macro consistently
  dmaengine: pl080: Fix some missing kerneldoc
  dmaengine: pl080: Cut some unused defines
  dmaengine: dmatest: Add check for supported buffer count (sg_buffers)
  dmaengine: dmatest: Select DMA_ENGINE_RAID as its needed for the slave_sg 
test
  dmaengine: virt-dma: Convert to use list_for_each_entry_safe()
  dma-debug: use offset_in_page() macro
  dmaengine: mv_xor: use offset_in_page() macro
  dmaengine: dmatest: use offset_in_page() macro
  dmaengine: sun4i: fix invalid argument
  dmaengine: ioat: use setup_timer
  dmaengine: cppi41: Fix an Oops happening in cppi41_dma_probe()
  dmaengine: pl330: remove pdata based initialization
  dmaengine: cppi: fix build error due to bad variable
  dmaengine: imx-sdma: add 1ms delay to ensure SDMA channel is stopped
  dmaengine: cppi41: use managed functions devm_*()
  dmaengine: cppi41: fix cppi41_dma_tx_status() logic
  dmaengine: qcom_hidma: pause the channel on shutdown
  ...

ecc721a72c  Merge tag 'pwm/for-4.12-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
be13ec668d  Merge branch 'topic/pl330' into for-linus
4879b7ae05  Merge tag 'dmaengine-4.12-rc1' of 
git://git.infradead.org/users/vkoul/slave-dma
9e66317d3c  Linux 4.14-rc3
1418b85217  Add linux-next specific files for 20170929
+---++++---+---+
|   | ecc721a72c | 
be13ec668d | 4879b7ae05 | v4.14-rc3 | next-20170929 |
+---++++---+---+
| boot_successes| 1009   | 1009 
  | 909| 5 | 510   |
| boot_failures | 0  | 0
  | 1  | 4 | 153   |
| WARNING:kernel_stack  | 0  | 0
  | 1  | 3 | 111   |
| BUG:unable_to_handle_kernel   | 0  | 0
  | 0  | 3 | 48|
| Oops:#[##]| 0  | 0
  | 0  | 3 | 48|
| EIP:update_stack_state| 0  | 0
  | 0  | 3 | 48|
| Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 0  | 0
  | 0  | 3 | 48|
| invoked_oom-killer:gfp_mask=0x| 0  | 0
  | 0  | 1 | 16|
| Mem-Info  | 0  | 0
  | 0  | 1 | 16|
| EIP:clear_user| 0  | 0
  | 0  | 0 | 2 |
| EIP:copy_page_to_iter | 0  | 0
  | 0  | 0 | 1 |
+---++++---+---+

[   98.541044] random: ubusd: uninitialized urandom read (4 bytes read)
[   98.541708] random: ubusd: uninitialized urandom read (4 bytes read)
[   98.549594] random: ubusd: uninitialized urandom read (4 bytes read)
[   98.550365] random: ubusd: uninitialized urandom read (4 bytes read)
[  110.395203] sock: process `trinity-main' is using obsolete setsockopt 
SO_BSDCOMPAT
[  110.399425] WARNING: kernel stack regs at bd92bc2e in 01-cpu-hotplug:3811 
has bad 'bp' value 01be
[  110.399428] unwind stack type:0 next_sp:  (null) mask:0x2 graph_idx:0
[  110.399431] bd92bc2e: 92bc7f00 (0x92bc7f00)
[  110.399433] bd92bc32: c27041bd (0xc27041bd)
[  110.399435] bd92bc36: 92bc7fb8 (0x92bc7fb8)
[  110.399436] bd92bc3a: ffbd

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Linus Torvalds

On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner  wrote:
>
> We already have a change counter on the inode, which is modified on
> any data or metadata write (i_version) under filesystem locks.  The
> i_version counter has well defined semantics - it's required by
> NFSv4 to increment on any metadata or data change - so we should be
> able to rely on it's behaviour to implement IMA as well.

I actually think i_version has exactly the wrong semantics.

Afaik, it doesn't actually version the file _data_ at all, it only
versions "inode itself changed".

But I might have missed something obvious. The updates are hidden in
some odd places sometimes.

  Linus

[PATCH] mm,hugetlb,migration: don't migrate kernelcore hugepages

2017-10-01 Thread Alexandru Moise

This attempts to bring more flexibility to how hugepages are allocated
by making it possible to decide whether we want the hugepages to be
allocated from ZONE_MOVABLE or to the zone allocated by the "kernelcore="
boot parameter for non-movable allocations.

A new boot parameter is introduced, "hugepages_movable=", this sets the
default value for the "hugepages_treat_as_movable" sysctl. This allows
us to determine the zone for hugepages allocated at boot time. It only
affects 2M hugepages allocated at boot time for now because 1G
hugepages are allocated much earlier in the boot process and ignore
this sysctl completely.

The "hugepages_treat_as_movable" sysctl is also turned into a mandatory
setting that all hugepage allocations at runtime must respect (both
2M and 1G sized hugepages). The default value is changed to "1" to
preserve the existing behavior that if hugepage migration is supported,
then the pages will be allocated from ZONE_MOVABLE.

Note however if not enough contiguous memory is present in ZONE_MOVABLE
then the allocation will fallback to the non-movable zone and those
pages will not be migratable.

The implementation is a bit dirty so obviously I'm open to suggestions
for a better way to implement this behavior, or comments whether the whole
idea is fundamentally __wrong__.

Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  8 
 Documentation/sysctl/vm.txt |  3 +++
 mm/hugetlb.c| 15 +--
 mm/migrate.c|  8 +++-
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..25116d32d59e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1318,6 +1318,14 @@
x86-64 are 2M (when the CPU supports "pse") and 1G
(when the CPU supports the "pdpe1gb" cpuinfo flag).
 
+   hugepages_movable=
+   [HW,IA-64,PPC,X86-64] Default value for the
+   hugepages_treat_as_movable sysctl (default is 1).
+   When 1 this will attempt to allocate hugepages from
+   ZONE_MOVABLE, if 0 it will attempt to allocate hugepages
+   from the non-movable zone created with the "kernelcore="
+   kernel parameter.
+
hvc_iucv=   [S390] Number of z/VM IUCV hypervisor console (HVC)
   terminal devices. Valid values: 0..8
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a9ef4e..4c5755a1cf9f 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -267,6 +267,9 @@ or not. If set to non-zero, hugepages can be allocated from 
ZONE_MOVABLE.
 ZONE_MOVABLE is created when kernel boot parameter kernelcore= is specified,
 so this parameter has no effect if used without kernelcore=.
 
+The default value for this sysctl can also be set via the hugepages_movable=
+kernel boot parameter (to 0 or 1), default is 1.
+
 Hugepage migration is now available in some situations which depend on the
 architecture and/or the hugepage size. If a hugepage supports migration,
 allocation from ZONE_MOVABLE is always enabled for the hugepage regardless
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 424b0ef08a60..5d4efdadbd56 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -36,7 +36,7 @@
 #include 
 #include "internal.h"
 
-int hugepages_treat_as_movable;
+int hugepages_treat_as_movable = 1;
 
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
@@ -926,7 +926,7 @@ static struct page *dequeue_huge_page_nodemask(struct 
hstate *h, gfp_t gfp_mask,
 /* Movability of hugepages depends on migration support. */
 static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
-   if (hugepages_treat_as_movable || hugepage_migration_supported(h))
+   if (hugepages_treat_as_movable && hugepage_migration_supported(h))
return GFP_HIGHUSER_MOVABLE;
else
return GFP_HIGHUSER;
@@ -2805,6 +2805,17 @@ static int __init hugetlb_init(void)
 }
 subsys_initcall(hugetlb_init);
 
+static int __init hugepages_movable(char *str)
+{
+   if (!strncmp(str, "0", 1))
+   hugepages_treat_as_movable = 0;
+   else if (!strncmp(str, "1", 1))
+   hugepages_treat_as_movable = 1;
+
+   return 1;
+}
+__setup("hugepages_movable=", hugepages_movable);
+
 /* Should be called on processing a hugepagesz=... option */
 void __init hugetlb_bad_size(void)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..23946d88e533 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1266,6 +1266,7 @@ static int unmap_and_

Linux 4.14-rc3

2017-10-01 Thread Linus Torvalds

So 4.14 continues to be a somewhat painful release, and I'm starting
to at least partly blame the fact that it's meant to be an LTS
release.

The last LTS release we had (4.9) resulted in one of the biggest
kernel releases we ever had because everybody wanted in; the 4.14
release doesn't seem to be as large, but it does seem to result in
some late work happening because people want to prep for 4.14, knowing
it will be LTS.

But who knows. Some of this may just be pure coincidence too. But I
already know of two more pull requests that are still pending that
will also probably want to be pushed into 4.14.

Anyway, on to the actual rc3 changes.. Most of them are the normal
small fixes, but a few things do stand out:

 - some x86 FPU state handling fixes

 - fixed some crypto problems in our internal key handling

 - some smp/hotplug cleanups

and all of them are bigger than I would have wished for at this stage,
but all of them had fine reasons for going in now. They all had one
thing in common, in that they also came with cleanups in order to fix
the underlying problem (so often the actual commit that _fixes_ it is
pretty small, but there's a series of cleanups that makes that fix
possible).

The two issues that I know as potentially still pending are some of
the same kind: a writeback fix and some watchdog fixes, both with the
majority being cleanups in order to fix things.

Anyway, this all has the common thread that I'd have loved to get that
code during the merge window as "obviously good changes", but I'm not
thrilled to get it during the rc stages.

Oh well. Enough of the "Woe is me".

Things don't actually look *bad*. Yes, it's more changes than I would
have wished for at this stage, but at the same time none of it looks
like it's really fundamentally problematic for the 4.14 release. Most
of the x86 FPU state cleanups had already been around for a while just
because they were needed cleanup, for example, it's just that the bug
fixes made them get merged at a less than optimal time.

The various changes do end up making the diffstat look somewhat
unusual: driver fixes that usually dominate are just a quarter of the
haul this rc around, with arch fixes (almost all of which are x86) are
another quarter. The rest is core kernel (much of it the smp/hotplug
updates), security (the key handling changes) and tooling (much of it
perf, but also more selftests). Some fs fixes (btrfs and xfs, some
misc) accounts for the rest.

It's still early enough in the rc release that I don't know if this
will impact timing. Right now it still feels like we're fine with the
usual schedule (ie rc7 being the last rc), but we'll just have to see
how this release cycle continues.

Do go out and test, please.

Linus

---

Adrian Hunter (1):
  mmc: sdhci-pci: Fix voltage switch for some Intel host controllers

Akemi Yagi (1):
  perf tools: Fix syscalltbl build failure

Al Viro (2):
  fix a typo in put_compat_shm_info()
  fix infoleak in waitid(2)

Alex Deucher (1):
  drm/radeon: disable hard reset in hibernate for APUs

Alex Estrin (1):
  Revert "IB/ipoib: Update broadcast object if PKey value was
changed in index 0"

Alex Vesker (1):
  IB/ipoib: Fix inconsistency with free_netdev and free_rdma_netdev

Alexander Shishkin (1):
  perf/aux: Only update ->aux_wakeup in non-overwrite mode

Alexandru Moise (1):
  genirq: Check __free_irq() return value for NULL

Andi Kleen (1):
  x86/fpu: Turn WARN_ON() in context switch into WARN_ON_FPU()

Andreas Gruenbacher (2):
  gfs2: Fix debugfs glocks dump
  vfs: Return -ENXIO for negative SEEK_HOLE / SEEK_DATA offsets

Andrey Ryabinin (1):
  x86/asm: Use register variable to get stack pointer value

Arnaldo Carvalho de Melo (2):
  perf tools: Get all of tools/{arch,include}/ in the MANIFEST
  perf evsel: Fix attr.exclude_kernel setting for default cycles:p

Arnd Bergmann (1):
  isofs: fix build regression

Arvind Yadav (1):
  iommu/amd: pr_err() strings should end with newlines

Babu Moger (1):
  arch: change default endian for microblaze

Bhumika Goyal (1):
  x86/numachip: Add const and __initconst to numachip2_clockevent

Boqun Feng (1):
  kvm/x86: Handle async PF in RCU read-side critical sections

Boris Brezillon (1):
  mtd: Fix partition alignment check on multi-erasesize devices

Carlos Maiolino (1):
  xfs: Capture state of the right inode in xfs_iflush_done

Chandan Rajendra (1):
  iomap_dio_rw: Allocate AIO completion queue before submitting dio

Christoph Hellwig (1):
  bsg-lib: don't free job in bsg_prepare_job

Colin Ian King (3):
  drm/amdkfd: check for null dev to avoid a null pointer dereference
  x86/xen: clean up clang build warning
  xfs: remove redundant re-initialization of total_nr_pages

Daniel Díaz (2):
  selftests: net: More graceful finding of `ip'.
  selftests: intel_pstate: build only on x86

Darrick J. Wong (4):
  xfs: don't uncon

[PATCH v2] staging: android: TODO: Removing an invalid issue

2017-10-01 Thread Joaquin Garmendia Cabrera

The first line of TODO is invalid because no file
has an error or warning when running checkpatch.pl

Signed-off-by: Joaquin Garmendia Cabrera 
---
Changes in v2:
  - Fixing a Typo.

 drivers/staging/android/TODO | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/staging/android/TODO b/drivers/staging/android/TODO
index 5f14247392bf..687e0eac85bf 100644
--- a/drivers/staging/android/TODO
+++ b/drivers/staging/android/TODO
@@ -1,5 +1,4 @@
 TODO:
-   - checkpatch.pl cleanups
- sparse fixes
- rename files to be not so "generic"
- add proper arch dependencies as needed
-- 
2.14.2

[PATCH net 3/3] net: skb_queue_purge(): lock/unlock the queue only once

2017-10-01 Thread Michael Witten

Date: Sat, 9 Sep 2017 05:50:23 +
Hitherto, the queue's lock has been locked/unlocked every time
an item is dequeued; this seems not only inefficient, but also
incorrect, as the whole point of `skb_queue_purge()' is to clear
the queue, presumably without giving any other thread a chance to
manipulate the queue in the interim.

With this commit, the queue's lock is locked/unlocked only once
when `skb_queue_purge()' is called, and in a way that disables
the IRQs for only a minimal amount of time.

This is achieved by atomically re-initializing the queue (thereby
clearing it), and then freeing each of the items as though it were
enqueued in a private queue that doesn't require locking.

Signed-off-by: Michael Witten 
---
 net/core/skbuff.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 68065d7d383f..bd26b0bde784 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2825,18 +2825,28 @@ struct sk_buff *skb_dequeue_tail(struct sk_buff_head 
*list)
 EXPORT_SYMBOL(skb_dequeue_tail);
 
 /**
- * skb_queue_purge - empty a list
- * @list: list to empty
+ * skb_queue_purge - empty a queue
+ * @q: the queue to empty
  *
- * Delete all buffers on an &sk_buff list. Each buffer is removed from
- * the list and one reference dropped. This function takes the list
- * lock and is atomic with respect to other list locking functions.
+ * Dequeue and free each socket buffer that is in @q.
+ *
+ * This function is atomic with respect to other queue-locking functions.
  */
-void skb_queue_purge(struct sk_buff_head *list)
+void skb_queue_purge(struct sk_buff_head *q)
 {
-   struct sk_buff *skb;
-   while ((skb = skb_dequeue(list)) != NULL)
+   unsigned long flags;
+   struct sk_buff *skb, *next, *head = (struct sk_buff *)q;
+
+   spin_lock_irqsave(&q->lock, flags);
+   skb = q->next;
+   __skb_queue_head_init(q);
+   spin_unlock_irqrestore(&q->lock, flags);
+
+   while (skb != head) {
+   next = skb->next;
kfree_skb(skb);
+   skb = next;
+   }
 }
 EXPORT_SYMBOL(skb_queue_purge);
 
-- 
2.14.1

[PATCH net 2/3] net: inet_recvmsg(): Remove unnecessary bitwise operation

2017-10-01 Thread Michael Witten

Date: Fri, 8 Sep 2017 00:47:49 +
The flag `MSG_DONTWAIT' is handled by passing an argument through
the dedicated parameter `nonblock' of the function `tcp_recvmsg()'.

Presumably because `MSG_DONTWAIT' is handled so explicitly, it is
unset in the collection of flags that are passed to `tcp_recvmsg()';
yet, this unsetting appears to be unnecessary, and so this commit
removes the bitwise operation that performs the unsetting.

Signed-off-by: Michael Witten 
---
 net/ipv4/af_inet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e5ef79..2dbed042a412 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -791,7 +791,7 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
sock_rps_record_flow(sk);
 
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
-  flags & ~MSG_DONTWAIT, &addr_len);
+  flags, &addr_len);
if (err >= 0)
msg->msg_namelen = addr_len;
return err;
-- 
2.14.1

[PATCH net 1/3] net: __sock_cmsg_send(): Remove unused parameter `msg'

2017-10-01 Thread Michael Witten

Date: Thu, 7 Sep 2017 03:21:38 +
Signed-off-by: Michael Witten 
---
 include/net/sock.h | 2 +-
 net/core/sock.c| 4 ++--
 net/ipv4/ip_sockglue.c | 2 +-
 net/ipv6/datagram.c| 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 03a362568357..83373d7148a9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1562,7 +1562,7 @@ struct sockcm_cookie {
u16 tsflags;
 };
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
   struct sockcm_cookie *sockc);
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..425e03fe1c56 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2091,7 +2091,7 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, 
unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 struct sockcm_cookie *sockc)
 {
u32 tsflags;
@@ -2137,7 +2137,7 @@ int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
return -EINVAL;
if (cmsg->cmsg_level != SOL_SOCKET)
continue;
-   ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
+   ret = __sock_cmsg_send(sk, cmsg, sockc);
if (ret)
return ret;
}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index e558e4f9597b..c79b7822b0b9 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -263,7 +263,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
}
 #endif
if (cmsg->cmsg_level == SOL_SOCKET) {
-   err = __sock_cmsg_send(sk, msg, cmsg, &ipc->sockc);
+   err = __sock_cmsg_send(sk, cmsg, &ipc->sockc);
if (err)
return err;
continue;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a1f918713006..1d1926a4cbe2 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -756,7 +756,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
}
 
if (cmsg->cmsg_level == SOL_SOCKET) {
-   err = __sock_cmsg_send(sk, msg, cmsg, sockc);
+   err = __sock_cmsg_send(sk, cmsg, sockc);
if (err)
return err;
continue;
-- 
2.14.1

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Dave Chinner

On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar  wrote:
> >
> > Right, re-introducing the iint->mutex and a new i_generation field in
> > the iint struct with a separate set of locks should work.  It will be
> > reset if the file metadata changes (eg. setxattr, chown, chmod).
> 
> Note that the "inner lock" could possibly be omitted if the
> invalidation can be just a single atomic instruction.
> 
> So particularly if invalidation could be just an atomic_inc() on the
> generation count, there might not need to be any inner lock at all.
> 
> You'd have to serialize the actual measurement with the "read
> generation count", but that should be as simple as just doing a
> smp_rmb() between the "read generation count" and "do measurement on
> file contents".

We already have a change counter on the inode, which is modified on
any data or metadata write (i_version) under filesystem locks.  The
i_version counter has well defined semantics - it's required by
NFSv4 to increment on any metadata or data change - so we should be
able to rely on it's behaviour to implement IMA as well. Filesystems
that support i_version are marked with [SB|MS]_I_VERSION in the
superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
ATM).

The IMA code should be able to sample that at measurement time and
either fail or be retried if i_version changes during measurement.
We can then simply make the IMA xattr write conditional on the
i_version value being unchanged from the sample the IMA code passes
into the filesystem once the filesystem holds all the locks it needs
to write the xattr...

I note that IMA already grabs the i_version in
ima_collect_measurement(), so this shouldn't be too hard to do.
Perhaps we don't need any new locks or counters at all, maybe just
the ability to feed a version cookie to the set_xattr method?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

[PATCH net 0/3] net: TCP/IP: A few minor cleanups

2017-10-01 Thread Michael Witten

The following patch series is an ad hoc "cleanup" that I made
while perusing the code (I'm not well versed in this code, so I
would not be surprised if there were objections to the changes):

  [1] net: __sock_cmsg_send(): Remove unused parameter `msg'
  [2] net: inet_recvmsg(): Remove unnecessary bitwise operation
  [3] net: skb_queue_purge(): lock/unlock the queue only once

Each patch will be sent as an individual reply to this email;
the total diff is appended below for your convenience.

You may also fetch these patches from GitHub:

  git checkout --detach 5969d1bb3082b41eba8fd2c826559abe38ccb6df
  git pull https://github.com/mfwitten/linux.git net/tcp-ip/01-cleanup/02

Overall:

  include/net/sock.h |  2 +-
  net/core/skbuff.c  | 26 ++
  net/core/sock.c|  4 ++--
  net/ipv4/af_inet.c |  2 +-
  net/ipv4/ip_sockglue.c |  2 +-
  net/ipv6/datagram.c|  2 +-
  6 files changed, 24 insertions(+), 14 deletions(-)

Sincerly,
Michael Witten

diff --git a/include/net/sock.h b/include/net/sock.h
index 03a362568357..83373d7148a9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1562,7 +1562,7 @@ struct sockcm_cookie {
u16 tsflags;
 };
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
   struct sockcm_cookie *sockc);
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..425e03fe1c56 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2091,7 +2091,7 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, 
unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
-int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 struct sockcm_cookie *sockc)
 {
u32 tsflags;
@@ -2137,7 +2137,7 @@ int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
return -EINVAL;
if (cmsg->cmsg_level != SOL_SOCKET)
continue;
-   ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
+   ret = __sock_cmsg_send(sk, cmsg, sockc);
if (ret)
return ret;
}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index e558e4f9597b..c79b7822b0b9 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -263,7 +263,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
}
 #endif
if (cmsg->cmsg_level == SOL_SOCKET) {
-   err = __sock_cmsg_send(sk, msg, cmsg, &ipc->sockc);
+   err = __sock_cmsg_send(sk, cmsg, &ipc->sockc);
if (err)
return err;
continue;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a1f918713006..1d1926a4cbe2 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -756,7 +756,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
}
 
if (cmsg->cmsg_level == SOL_SOCKET) {
-   err = __sock_cmsg_send(sk, msg, cmsg, sockc);
+   err = __sock_cmsg_send(sk, cmsg, sockc);
if (err)
return err;
continue;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e5ef79..2dbed042a412 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -791,7 +791,7 @@ int inet_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
sock_rps_record_flow(sk);
 
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
-  flags & ~MSG_DONTWAIT, &addr_len);
+  flags, &addr_len);
if (err >= 0)
msg->msg_namelen = addr_len;
return err;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 68065d7d383f..bd26b0bde784 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2825,18 +2825,28 @@
 EXPORT_SYMBOL(skb_dequeue_tail);
 
 /**
- * skb_queue_purge - empty a list
- * @list: list to empty
+ * skb_queue_purge - empty a queue
+ * @q: the queue to empty
  *
- * Delete all buffers on an &sk_buff list. Each buffer is removed from
- * the list and one reference dropped. This function takes the list
- * lock and is atomic with respect to other list locking functions.
+ * Dequeue and free each socket buffer that is in @q.
+ *
+ * This function is atomic with respect to other queue-locking functions.
  */
-void skb_queue_purge(struct sk_buff_head *list)
+void skb_queue_purge(struct sk_buff_head *q)
 {
-   struct sk_buff *skb;
-   while ((skb = skb_dequeue(list)) != NULL)
+   unsigned

Re: [PATCH v4 4/5] cramfs: add mmap support

2017-10-01 Thread Nicolas Pitre

On Sun, 1 Oct 2017, Christoph Hellwig wrote:

> up_read(&mm->mmap_sem) in the fault path is a still a complete
> no-go,
> 
> NAK

Care to elaborate?

What about mm/filemap.c:__lock_page_or_retry() then?

Why the special handling on mm->mmap_sem with VM_FAULT_RETRY?

What are the potential problems with my approach I didn't cover yet?

Serious: I'm simply looking for solutions here.

Nicolas

Re: [PATCH v4 1/5] cramfs: direct memory access support

2017-10-01 Thread Nicolas Pitre

On Sun, 1 Oct 2017, Christoph Hellwig wrote:

> On Wed, Sep 27, 2017 at 07:32:20PM -0400, Nicolas Pitre wrote:
> > To distinguish between both access types, the cramfs_physmem filesystem
> > type must be specified when using a memory accessible cramfs image, and
> > the physaddr argument must provide the actual filesystem image's physical
> > memory location.
> 
> Sorry, but this still is a complete no-go.  A physical address is not a
> proper interface.  You still need to have some interface for your NOR nand
> or DRAM.  - usually that would be a mtd driver, but if you have a good
> reason why that's not suitable for you (and please explain it well)
> we'll need a little OF or similar layer to bind a thin driver.

The primary use case for this is to run Linux on a small microcontroller 
with some amount of RAM and ROM on chip. And this is not theoretical -- 
I already have it running here. The ROM is some kind of flash that 
appears in the direct memory address space and requires no access layer 
what so ever given it is meant to execute code from it. The flash is 
programmed with an external programmer through some debug port. It can't 
be programmed from the microcontroller itself, not even probed, as that 
would make the running code unavailable (unless the probe code is copied 
elsewhere but what would be the point?). Persistent state is typically 
kept in NVRAM or external flash, not in _that_ flash.

The MTD subsystem provides a lot of features and flexibility, but almost 
none of it would be usable here and constitutes only a useless kernel 
size increase.

The kernel itself runs XIP from that ROM. It has to be linked for the 
exact address where it is flashed. The link address is therefore not a 
variable that can be changed at run time. It is the same for the 
filesystem image: it is related to the way things are laid out in ROM, 
and typically depends on the actual size of the kernel when ROM is 
tight.

You fundamentally need to know the address of the kernel _and_ the 
address of the fs image. Those addresses are properties of your kernel 
config. So having to specify one in Kconfig and bury the other in DT 
doesn't make sense to me as this is just an extra file to edit and 
compile, and an extra binary to write into flash, for something that 
isn't a property of the hardware. The bootloader and DT should remain 
stable as much as possible with invariant data.

If you prefer, the physical address could be specified with a Kconfig 
symbol just like the kernel link address. Personally I think it is best 
to keep it along with the other root mount args. But going all the way 
with a dynamic driver binding interface and a dummy intermediate name is 
like using a sledge hammer to kill an ant: it will work of course, but 
given the context it is prone to errors due to the added manipulations 
mentioned previously ... and a tad overkill.

> > Signed-off-by: Nicolas Pitre 
> > Tested-by: Chris Brandt 
> > ---
> >  fs/cramfs/Kconfig |  29 +-
> >  fs/cramfs/inode.c | 264 
> > +++---
> >  2 files changed, 241 insertions(+), 52 deletions(-)
> > 
> > diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
> > index 11b29d491b..5b4e0b7e13 100644
> > --- a/fs/cramfs/Kconfig
> > +++ b/fs/cramfs/Kconfig
> > @@ -1,6 +1,5 @@
> >  config CRAMFS
> > tristate "Compressed ROM file system support (cramfs) (OBSOLETE)"
> > -   depends on BLOCK
> > select ZLIB_INFLATE
> > help
> >   Saying Y here includes support for CramFs (Compressed ROM File
> > @@ -20,3 +19,31 @@ config CRAMFS
> >   in terms of performance and features.
> >  
> >   If unsure, say N.
> > +
> > +config CRAMFS_BLOCKDEV
> > +   bool "Support CramFs image over a regular block device" if EXPERT
> > +   depends on CRAMFS && BLOCK
> > +   default y
> > +   help
> > + This option allows the CramFs driver to load data from a regular
> > + block device such a disk partition or a ramdisk.
> > +
> > +config CRAMFS_PHYSMEM
> > +   bool "Support CramFs image directly mapped in physical memory"
> > +   depends on CRAMFS
> > +   default y if !CRAMFS_BLOCKDEV
> > +   help
> > + This option allows the CramFs driver to load data directly from
> > + a linear adressed memory range (usually non volatile memory
> > + like flash) instead of going through the block device layer.
> > + This saves some memory since no intermediate buffering is
> > + necessary.
> > +
> > + The filesystem type for this feature is "cramfs_physmem".
> > + The location of the CramFs image in memory is board
> > + dependent. Therefore, if you say Y, you must know the proper
> > + physical address where to store the CramFs image and specify
> > + it using the physaddr=0x mount option (for example:
> > + "mount -t cramfs_physmem -o physaddr=0x10 none /mnt").
> > +
> > + If unsure, say N.
> > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > index 7919967488..

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Linus Torvalds

On Sun, Oct 1, 2017 at 3:06 PM, Eric W. Biederman  wrote:
>
> Unless I misread something it was being pointed out there are some vfs
> operations today on which ima writes an ima xattr as a side effect.  And
> those operations hold the i_sem.  So perhaps I am misunderstanding
> things or writing the ima xattr needs to happen at some point.  Which
> implies something like queued work.

So the issue is indeed the inode semaphore, as it is used by IMA. But
all these IMA patches to work around the issue are just horribly ugly.
One adds a VFS-layer filesystem method that most filesystems end up
not really needing (it's the same as the regular read), and other
filesystems end up then having hacks with ("oh, I don't need to take
this lock because it was already taken by the caller").

The second patch attempt avoided the need for a new filesystem method,
but added a flag in an annoying place (for the same basic logic). The
advantage is that now most filesystems don't actually need to care any
more (and the filesystems that used to care now check that flag).

There was discussion about moving the flag to a mode convenient spot,
which would have made it a lot less intrusive.

But the basic issue is that almost always when you see lock
inversions, the problem can just be fixed by doing the locking
differently instead.

And that's what I was/am pushing for.

There really are two totally different issues:

 - the integrity _measurement_.

   This one wants to be serialized, so that you don't have multiple
concurrent measurements, and the serialization fundamentally has to be
around all the IO, so this lock pretty much has to be outside the
i_sem.

 - the integrity invalidation on certain operations.

   This one fundamentally had to be inside the i_sem, since some of
the operations that cause this end up already holding the i_sem at a
VFS layer.

so you had these two different requirements (inside _and_ outside),
and the IMA approach was basically to avoid the problem by making
i_sem *the* lock, and then making the IO routines aware of it already
being held. That does solve the inside/outside issue.

But the simpler way to fix it is to simply use two locks that nest
inside each other, with i_sem nesting in the middle.  That just avoids
the problem entirely, and doesn't require anybody to ever care about
i_sem semantic changes, because i_sem semantics simply didn't change
at all.

So that's the approach I'm pushing. I admittedly haven't actually
looked at the IMA details, but from a high-level standpoint you can
basically describe it (as above) without having to care too much about
exactly what IMA even wants.

The two-lock approach does require that the operations that invalidate
the integrity measurements always only invalidate it, and don't try to
re-compute it. But I suspect that would be entirely insane anyway
(imagine a world where "setxattr" would have to read the whole file
contents in order to revalidate the integrity measurement - even if
there is nobody who even *cares*).

   Linus

Re: i2c-omap.c vs. qemu: too much work in one irq (and broken boot)

2017-10-01 Thread Pavel Machek

Hi!

> I'm trying to get qemu emulation of Nokia N900 to work, but
> unfortunately i2c-omap.c breaks boot in the emulator (real hardware
> works ok).

I started bisection. v4.6 works, v4.7 is broken. (It still may be
config difference or something).

This looked suspect, so I tried reverting it, but it did not help. And
it is only significant change in i2c-omap.c. So... this may be a lot
of fun.

commit 126a66caec4a00b8d66dbc3174b0efa905cf68c3
Author: Sebastian Andrzej Siewior 
Date:   Mon Apr 4 16:55:23 2016 +0300

Unfortunately, mmc is broken in v4.6, so its useless for
emulation... [ 3.236724] mmc1: error -22 whilst initialising SDIO card

Ideas?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken exclusively

2017-10-01 Thread Eric W. Biederman

Linus Torvalds  writes:

> On Sep 30, 2017 18:33, "Eric W. Biederman"  wrote:.
>
>  That would require a task_work or another kind of work callback so that
>  the writes of the xattr are not synchronous with the vfs callback
>  correct?
>
> No, why?
>
> You should just invalidate the IMA on xattr write or other operations that 
> make the measurement invalid. You only need the inner
> lock.
>
> Why are you guys making up all these things just to make it complicated?

I am not trying to make things complicated I am just trying to
understand the conversation.

Unless I misread something it was being pointed out there are some vfs
operations today on which ima writes an ima xattr as a side effect.  And
those operations hold the i_sem.  So perhaps I am misunderstanding
things or writing the ima xattr needs to happen at some point.  Which
implies something like queued work.

But perhaps I a misunderstanding the conversation and ima.  I frequenly
misunderstand ima.

Eric

Re: [PATCH] x86/CPU/AMD, mm: Extend with mem_encrypt=sme option

2017-10-01 Thread Borislav Petkov

On Sun, Oct 01, 2017 at 02:45:09PM -0500, Brijesh Singh wrote:
> >
> > So I want to be able to disable SEV and the whole code that comes with
> > it in the *host*.
> 
> We can add a new variable 'sme_only'. By default this variable should be set
> to false. When mem_encrypt=sme is passed then set it to true and
> based on sme_only state early_detect_mem_encrypt() can clear X86_FEATURE_SEV
> flag.

Why would you need yet another variable? We have sev_enabled already?!?

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
--

Re: [PATCH 00/18] use ARRAY_SIZE macro

2017-10-01 Thread Tobin C. Harding

On Sun, Oct 01, 2017 at 03:30:38PM -0400, Jérémy Lefaure wrote:
> Hi everyone,
> Using ARRAY_SIZE improves the code readability. I used coccinelle (I
> made a change to the array_size.cocci file [1]) to find several places
> where ARRAY_SIZE could be used instead of other macros or sizeof
> division.
> 
> I tried to divide the changes into a patch per subsystem (excepted for
> staging). If one of the patch should be split into several patches, let
> me know.
> 
> In order to reduce the size of the To: and Cc: lines, each patch of the
> series is sent only to the maintainers and lists concerned by the patch.
> This cover letter is sent to every list concerned by this series.

Why don't you just send individual patches for each subsystem? I'm not a 
maintainer but I don't see
how any one person is going to be able to apply this whole series, it is making 
it hard for
maintainers if they have to pick patches out from among the series (if indeed 
any will bother
doing that).

I get that this will be more work for you but AFAIK we are optimizing for 
maintainers.

Good luck,
Tobin.

Re: [RFC] yamldt v0.5, now a DTS compiler too

2017-10-01 Thread Rob Herring

On Thu, Sep 28, 2017 at 2:58 PM, Pantelis Antoniou
 wrote:
> Hello again,
>
> Significant progress has been made on yamldt and is now capable of
> not only generating yaml from DTS source but also compiling DTS sources
> and being almost fully compatible with DTC.

Can you quantify "almost"?

> Compiling the kernel's DTBs using yamldt is as simple as using a
> DTC=yamldt.

Good.

>
> Error reporting is accurate and validation against a YAML based schema
> works as well. In a short while I will begin posting patches with
> fixes on bindings and DTS files in the kernel.

What I would like to see is the schema format posted for review.

I would also like to see the bindings for top-level compatible strings
(aka boards) as an example. That's something that's simple enough that
I'd think we could agree on a format and start moving towards defining
board bindings that way.

Rob

i2c-omap.c vs. qemu: too much work in one irq (and broken boot)

2017-10-01 Thread Pavel Machek

Hi!

I'm trying to get qemu emulation of Nokia N900 to work, but
unfortunately i2c-omap.c breaks boot in the emulator (real hardware
works ok).

[0.837524] omap2-onenand omap2-onenand: initializing on CS0, phys
base 0x0100, virtual base d00c, freq 66 MHz
[0.838958] Muxed OneNAND 256MB 1.8V 16-bit (0x40)
[0.839752] OneNAND version = 0x0121
[0.842102] Scanning device for bad blocks
[1.012451] 6 ofpart partitions found on MTD device omap2-onenand
[1.013153] Creating 6 MTD partitions on "omap2-onenand":
[1.014007] 0x-0x0002 : "bootloader"
[1.018066] 0x0002-0x0008 : "config"
[1.020660] 0x0008-0x000c : "log"
[1.022827] 0x000c-0x002c : "kernel"
[1.025848] 0x002c-0x004c : "initfs"
[1.028106] 0x004c-0x1000 : "rootfs"
[1.047668] omap_i2c 4807.i2c: addr: 0x004b, len: 2, flags:
0x0, stop: 1
[1.048828] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.049530] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.050018] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.050476] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.050872] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.051422] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.052001] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.052398] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.052825] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.053222] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.053619] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.054016] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.054412] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
[1.054840] omap_i2c 4807.i2c: IRQ (ISR = 0x0010)
...

After 100 messages I get "Too much work in one IRQ" message, and then
repeat. I tried just disabling omap_i2c in the dts, but 1) its not
easy 2) I'd lose quite fundamental functionality.

Ideas welcome.

Best regards,
Pavel

PS: If anyone is interested, this is why working qemu would be useful:
https://wiki.postmarketos.org/wiki/Nokia_N900_(nokia-rx51) . There's a
lot of work to be done in the userspace, and swapping SD cards is kind
of slow.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

[PATCH] exthdr: Add support for reserved header and address

2017-10-01 Thread Harsha Sharma

Add support for IPV6 type 0 routing header reserved field and address
unable to test it with nft-test.py

Signed-off-by: Harsha Sharma 
---
 include/exthdr.h  | 2 ++
 src/exthdr.c  | 7 +--
 tests/py/ip6/rt.t | 2 ++
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/exthdr.h b/include/exthdr.h
index 97ccc38..ad09f27 100644
--- a/include/exthdr.h
+++ b/include/exthdr.h
@@ -14,6 +14,8 @@
 struct exthdr_desc {
const char  *name;
uint8_t type;
+   unsigned intprotocol_key;
+   const struct exthdr_desc*protocols[3];
struct proto_hdr_template   templates[10];
 };
 
diff --git a/src/exthdr.c b/src/exthdr.c
index 4add3da..87c09da 100644
--- a/src/exthdr.c
+++ b/src/exthdr.c
@@ -263,13 +263,8 @@ const struct exthdr_desc exthdr_rt0 = {
 const struct exthdr_desc exthdr_rt = {
.name   = "rt",
.type   = IPPROTO_ROUTING,
-#if 0
.protocol_key   = RTHDR_TYPE,
-   .protocols  = {
-   [0] = &exthdr_rt0,
-   [2] = &exthdr_rt2,
-   },
-#endif
+   .protocols  = {&exthdr_rt0, NULL, &exthdr_rt2},
.templates  = {
[RTHDR_NEXTHDR] = RT_FIELD("nexthdr", ip6r_nxt, 
&inet_protocol_type),
[RTHDR_HDRLENGTH]   = RT_FIELD("hdrlength", ip6r_len, 
&integer_type),
diff --git a/tests/py/ip6/rt.t b/tests/py/ip6/rt.t
index 2d044c3..1eb198d 100644
--- a/tests/py/ip6/rt.t
+++ b/tests/py/ip6/rt.t
@@ -44,3 +44,5 @@ rt seg-left { 33, 55, 67, 88};ok
 rt seg-left != { 33, 55, 67, 88};ok
 rt seg-left { 33-55};ok
 rt seg-left != { 33-55};ok
+
+rt type 0 reserved 2;ok
-- 
1.9.1

Re: [Part2 PATCH v4 05/29] crypto: ccp: Add Platform Security Processor (PSP) device support

2017-10-01 Thread Brijesh Singh

On 9/30/17 11:11 AM, Borislav Petkov wrote:
> I think just from having CRYPTO_DEV_CCP_DD depend on CPU_SUP_AMD ||
> ARM64, CRYPTO_DEV_SP_PSP gets almost the same dependency transitively.
> But sure, let's make the PSP build only on x86. It should depend on
> X86_64, to be precise.

I think theoretically a 32-bit host OS can invoke a PSP commands but
currently PSP interface is exposing only the SEV FW command. And SEV
feature is available when we are in 64-bit mode hence for now its okay
to have depends on X86_64. I will add CRYPTO_DEV_CCP_DD depend on
CPU_SUP_AMD || ARM64 and CRYPTO_DEV_SP_PSP depend on X86_64 and send you
v4.2. thanks

Re: [RFC GIT Pull] core watchdog sanitizing

2017-10-01 Thread Linus Torvalds

I refuse to pull this.

Look, I understand what you want to do, but the code is disgusting.

Maybe most of it is fine, but I just couldn't stomach looking at it
after just a few lines.

Look at that abortion called "watchdog_nmi_reconfigure()".

It's one single function that does two completely different things
based on a static argument.

Whaa?  So now you have doubly illegible code: the callers have that
insane set of

 watchdog_nmi_reconfigure(true/false);

in it, which is illegible garbage. There's no way that makes sense to anybody.

And the actual implementation has

void watchdog_nmi_reconfigure(bool run)

where the whole function is basically a single if-statement on that
"run" argument that does two totally different things.

If you don't see how that is pure and utter garbage, I don't know what
to say. It's just bad code. Don't do that in the first place, but
*definitely* don't do that and then try to send it to me during an rc.

Seriously, instead of that retarded

watchdog_nmi_reconfigure(false);
...
watchdog_nmi_reconfigure(true);

you could have written it something like

watchdog_nmi_stop();
...
watchdog_nmi_start();

and it would be _understandable_ code. You don't have to even know the
code to understand what's going on.

So get rid of that kind of shit, and I may reconsider. But as is, I
look at that patch and say "no, this is worse than the garbage it used
to be".

Yeah, yeah, I'll be honest, and you were unlucky, and just happened to
hit a pet peeve of mine, but I absolutely *detest* code that describes
obvious static things as non-obvious dynamic "flag" arguments to a
function for no good reason. There's not even any code sharing between
the two cases going on aside from the trivial locking.

That kind of code makes it harder for everybody. It makes it harder
for the compiler to generate good code (you need to inline it to get
the obvious code generation), it makes it harder for checkers to
verify things, and it makes it harder for humans to read and
understand what is going on.

If you see code like

watchdog_nmi_reconfigure(true);

and you don't ask yourself "what does 'true' mean in this call site",
you're either not human, or you just don't care. Neither of which are
good.

In contrast, when you see code like

watchdog_nmi_start();

it kind of documents itself, don't you think?

And yes, I see the comment above the function definition. No, the
comment doesn't help. The comment just makes it obvious that somebody
realized that the calling convention was too damn confusing. Good. But
even better would have been to just not make it that confusing in the
first place.

 Linus

On Sun, Oct 1, 2017 at 3:34 AM, Thomas Gleixner  wrote:
>
> The watchdog (hard/softlockup detector) code is pretty much broken in its
> current state. The patch series addresses this by removing all duct tape
> and refactoring it into a workable state.

1 2 3 >

1 - 100 of 255 matches

Mail list logo