date:20201113

Re: [PATCH 2/4] MIPS: kvm: Use vm_get_page_prot to get protection bits

2020-11-13 Thread Huacai Chen

Hi, Thomas,

On Fri, Nov 13, 2020 at 7:13 PM Thomas Bogendoerfer
 wrote:
>
> MIPS protection bits are setup during runtime so using defines like
> PAGE_SHARED ignores this runtime changes. Using vm_get_page_prot
> to get correct page protection fixes this.
Is there some visible bugs if without this fix?

Huacai
>
> Signed-off-by: Thomas Bogendoerfer 
> ---
>  arch/mips/kvm/mmu.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
> index 28c366d307e7..3dabeda82458 100644
> --- a/arch/mips/kvm/mmu.c
> +++ b/arch/mips/kvm/mmu.c
> @@ -1074,6 +1074,7 @@ int kvm_mips_handle_commpage_tlb_fault(unsigned long 
> badvaddr,
>  {
> kvm_pfn_t pfn;
> pte_t *ptep;
> +   pgprot_t prot;
>
> ptep = kvm_trap_emul_pte_for_gva(vcpu, badvaddr);
> if (!ptep) {
> @@ -1083,7 +1084,8 @@ int kvm_mips_handle_commpage_tlb_fault(unsigned long 
> badvaddr,
>
> pfn = PFN_DOWN(virt_to_phys(vcpu->arch.kseg0_commpage));
> /* Also set valid and dirty, so refill handler doesn't have to */
> -   *ptep = pte_mkyoung(pte_mkdirty(pfn_pte(pfn, PAGE_SHARED)));
> +   prot = vm_get_page_prot(VM_READ|VM_WRITE|VM_SHARED);
> +   *ptep = pte_mkyoung(pte_mkdirty(pfn_pte(pfn, prot)));
>
> /* Invalidate this entry in the TLB, guest kernel ASID only */
> kvm_mips_host_tlb_inv(vcpu, badvaddr, false, true);
> --
> 2.16.4
>

Re: [PATCH] s5p-jpeg: hangle error condition in s5p_jpeg_probe

2020-11-13 Thread baskov


On 2020-11-14 00:35, Jacek Anaszewski wrote:

There is a typo in the subject: s/hangle/handle/


Thanks for pointing out, sorry for that.

Apparently, there is also a typo in my name -> s/Evgeiny/Evgeniy/

--
Respectfully,
Baskov Evgeniy

[PATCH -next] scsi: be2iscsi: Mark beiscsi_attrs with static keyword

2020-11-13 Thread Zou Wei

Fix the following sparse warning:

./be_main.c:167:25: warning: symbol 'beiscsi_attrs' was not declared. Should it 
be static?

Signed-off-by: Zou Wei 
---
 drivers/scsi/be2iscsi/be_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c
index 202ba92..50e4642 100644
--- a/drivers/scsi/be2iscsi/be_main.c
+++ b/drivers/scsi/be2iscsi/be_main.c
@@ -164,7 +164,7 @@ DEVICE_ATTR(beiscsi_active_session_count, S_IRUGO,
 beiscsi_active_session_disp, NULL);
 DEVICE_ATTR(beiscsi_free_session_count, S_IRUGO,
 beiscsi_free_session_disp, NULL);
-struct device_attribute *beiscsi_attrs[] = {
+static struct device_attribute *beiscsi_attrs[] = {
_attr_beiscsi_log_enable,
_attr_beiscsi_drvr_ver,
_attr_beiscsi_adapter_family,
-- 
2.6.2

[PATCH -next] drm/virtio: Make virtgpu_dmabuf_ops with static keyword

2020-11-13 Thread Zou Wei

Fix the following sparse warning:

./virtgpu_prime.c:46:33: warning: symbol 'virtgpu_dmabuf_ops' was not declared. 
Should it be static?

Signed-off-by: Zou Wei 
---
 drivers/gpu/drm/virtio/virtgpu_prime.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_prime.c 
b/drivers/gpu/drm/virtio/virtgpu_prime.c
index 1ef1e2f..807a27a 100644
--- a/drivers/gpu/drm/virtio/virtgpu_prime.c
+++ b/drivers/gpu/drm/virtio/virtgpu_prime.c
@@ -43,7 +43,7 @@ static int virtgpu_virtio_get_uuid(struct dma_buf *buf,
return 0;
 }
 
-const struct virtio_dma_buf_ops virtgpu_dmabuf_ops =  {
+static const struct virtio_dma_buf_ops virtgpu_dmabuf_ops =  {
.ops = {
.cache_sgt_mapping = true,
.attach = virtio_dma_buf_attach,
-- 
2.6.2

Re: [PATCH] ARM: configs: sunxi: enable Realtek PHY

2020-11-13 Thread Jernej Škrabec

Dne četrtek, 12. november 2020 ob 21:26:52 CET je Corentin Labbe napisal(a):
> Lot of sunxi boards has a Realtek PHY, so let's enable it.
> 
> Signed-off-by: Corentin Labbe 
> ---
>  arch/arm/configs/sunxi_defconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm/configs/sunxi_defconfig
> b/arch/arm/configs/sunxi_defconfig index 244126172fd6..05f7f4ed8ded 100644
> --- a/arch/arm/configs/sunxi_defconfig
> +++ b/arch/arm/configs/sunxi_defconfig
> @@ -51,6 +51,7 @@ CONFIG_STMMAC_ETH=y
>  # CONFIG_NET_VENDOR_VIA is not set
>  # CONFIG_NET_VENDOR_WIZNET is not set
>  CONFIG_MICREL_PHY=y
> +CONFIG_REALTEK_PHY=y
>  # CONFIG_WLAN is not set
>  CONFIG_INPUT_EVDEV=y
>  CONFIG_KEYBOARD_SUN4I_LRADC=y

Acked-by: Jernej Skrabec 

Thanks!

Best regards,
Jernej

Re: [RFC bpf-next 1/3] bpf: add module support to btf display helpers

2020-11-13 Thread Andrii Nakryiko

On Fri, Nov 13, 2020 at 10:11 AM Alan Maguire  wrote:
>
> bpf_snprintf_btf and bpf_seq_printf_btf use a "struct btf_ptr *"
> argument that specifies type information about the type to
> be displayed.  Augment this information to include a module
> name, allowing such display to support module types.
>
> Signed-off-by: Alan Maguire 
> ---
>  include/linux/btf.h|  8 
>  include/uapi/linux/bpf.h   |  5 -
>  kernel/bpf/btf.c   | 18 ++
>  kernel/trace/bpf_trace.c   | 42 
> --
>  tools/include/uapi/linux/bpf.h |  5 -
>  5 files changed, 66 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 2bf6418..d55ca00 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -209,6 +209,14 @@ static inline const struct btf_var_secinfo 
> *btf_type_var_secinfo(
>  const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
>  const char *btf_name_by_offset(const struct btf *btf, u32 offset);
>  struct btf *btf_parse_vmlinux(void);
> +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
> +struct btf *bpf_get_btf_module(const char *name);
> +#else
> +static inline struct btf *bpf_get_btf_module(const char *name)
> +{
> +   return ERR_PTR(-ENOTSUPP);
> +}
> +#endif
>  struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
>  #else
>  static inline const struct btf_type *btf_type_by_id(const struct btf *btf,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 162999b..26978be 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3636,7 +3636,8 @@ struct bpf_stack_build_id {
>   * the pointer data is carried out to avoid kernel crashes during
>   * operation.  Smaller types can use string space on the stack;
>   * larger programs can use map data to store the string
> - * representation.
> + * representation.  Module-specific data structures can be
> + * displayed if the module name is supplied.
>   *
>   * The string can be subsequently shared with userspace via
>   * bpf_perf_event_output() or ring buffer interfaces.
> @@ -5076,11 +5077,13 @@ struct bpf_sk_lookup {
>   * potentially to specify additional details about the BTF pointer
>   * (rather than its mode of display) - is included for future use.
>   * Display flags - BTF_F_* - are passed to bpf_snprintf_btf separately.
> + * A module name can be specified for module-specific data.
>   */
>  struct btf_ptr {
> void *ptr;
> __u32 type_id;
> __u32 flags;/* BTF ptr flags; unused at present. */
> +   const char *module; /* optional module name. */

I think module name is a wrong API here, similarly how type name was
wrong API for specifying the type (and thus we use type_id here).
Using the module's BTF ID seems like a more suitable interface. That's
what I'm going to use for all kinds of existing BPF APIs that expect
BTF type to attach BPF programs.

Right now, we use only type_id and implicitly know that it's in
vmlinux BTF. With module BTFs, we now need a pair of BTF object ID +
BTF type ID to uniquely identify the type. vmlinux BTF now can be
specified in two different ways: either leaving BTF object ID as zero
(for simplicity and backwards compatibility) or specifying it's actual
BTF obj ID (which pretty much always should be 1, btw). This feels
like a natural extension, WDYT?

And similar to type_id, no one should expect users to specify these
IDs by hand, Clang built-in and libbpf should work together to figure
this out for the kernel to use.

BTW, with module names there is an extra problem for end users. Some
types could be either built-in or built as a module (e.g., XFS data
structures). Why would we require BPF users to care which is the case
on any given host? It feels right now that we should just extend the
existing __builtin_btf_type_id() helper to generate ldimm64
instructions that would encode both BTF type ID and BTF object ID.
This would just naturally add transparent module BTF support without
BPF programs having to do any changes.

But we need to do a bit of thinking and experimentation with Yonghong,
haven't gotten around to this yet, you are running a bit ahead of me
with module BTFs. :)

>  };
>
>  /*

[...]

>  struct btf_ptr {
> void *ptr;
> __u32 type_id;
> __u32 flags;/* BTF ptr flags; unused at present. */

Also, if flags are not used at present, can we repurpose it to just
encode btf_obj_id and avoid (at least for now) the backwards
compatibility checks based on btf_ptr size?

> +   const char *module; /* optional module name. */
>  };
>
>  /*
> --
> 1.8.3.1
>

Re: [PATCH 1/6] seq_file: add seq_read_iter

2020-11-13 Thread Al Viro

On Fri, Nov 13, 2020 at 11:19:34PM -0700, Nathan Chancellor wrote:

> Assuming so, I have attached the output both with and without the
> WARN_ON. Looks like mountinfo is what is causing the error?

Cute...  FWIW, on #origin + that commit with fix folded in I don't
see anything unusual in reads from mountinfo ;-/  OTOH, they'd
obviously been... creative with readv(2) arguments, so it would
be very interesting to see what it is they are passing to it.

I'm half-asleep right now; will try to cook something to gather
that information tomorrow morning.  'Later...

Re: [RFC bpf-next 3/3] selftests/bpf: verify module-specific types can be shown via bpf_snprintf_btf

2020-11-13 Thread Andrii Nakryiko

On Fri, Nov 13, 2020 at 10:11 AM Alan Maguire  wrote:
>
> Verify that specifying a module name in "struct btf_ptr *" along
> with a type id of a module-specific type will succeed.
>
> veth_stats_rx() is chosen because its function signature consists
> of a module-specific type "struct veth_stats" and a kernel-specific
> one "struct net_device".
>
> Signed-off-by: Alan Maguire 
> ---
>  .../selftests/bpf/prog_tests/snprintf_btf_mod.c| 96 
> ++
>  tools/testing/selftests/bpf/progs/btf_ptr.h|  1 +
>  tools/testing/selftests/bpf/progs/veth_stats_rx.c  | 73 
>  3 files changed, 170 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf_btf_mod.c
>  create mode 100644 tools/testing/selftests/bpf/progs/veth_stats_rx.c
>

[...]

> +   err = veth_stats_rx__load(skel);
> +   if (CHECK(err, "skel_load", "failed to load skeleton: %d\n", err))
> +   goto cleanup;
> +
> +   bss = skel->bss;
> +
> +   bss->veth_stats_btf_id = btf__find_by_name(veth_btf, "veth_stats");

This is really awkward that this needs to be done from user-space.
Libbpf will be able to do this regardless of whether the type is in
vmlinux or kernel module. See my comments on patch #1.

> +
> +   if (CHECK(bss->veth_stats_btf_id <= 0, "find 'struct veth_stats'",
> + "could not find 'struct veth_stats' in veth BTF: %d",
> + bss->veth_stats_btf_id))
> +   goto cleanup;
> +

[...]

> +   btf_ids[0] = veth_stats_btf_id;
> +   ptrs[0] = (void *)PT_REGS_PARM1_CORE(ctx);
> +#if __has_builtin(__builtin_btf_type_id)

nit: there are a bunch of selftests that just assume we have this
built-in, so I don't think you need to guard it with #if here.

> +   btf_ids[1] = bpf_core_type_id_kernel(struct net_device);
> +   ptrs[1] = (void *)PT_REGS_PARM2_CORE(ctx);
> +#endif

[...]

Re: [RFC bpf-next 2/3] libbpf: bpf__find_by_name[_kind] should use btf__get_nr_types()

2020-11-13 Thread Andrii Nakryiko

On Fri, Nov 13, 2020 at 10:11 AM Alan Maguire  wrote:
>
> When operating on split BTF, btf__find_by_name[_kind] will not
> iterate over all types since they use btf->nr_types to show
> the number of types to iterate over.  For split BTF this is
> the number of types _on top of base BTF_, so it will
> underestimate the number of types to iterate over, especially
> for vmlinux + module BTF, where the latter is much smaller.
>
> Use btf__get_nr_types() instead.
>
> Signed-off-by: Alan Maguire 
> ---

Good catch. I'm amazed I didn't fix it up when I implemented split BTF
support, I distinctly remember looking at these two APIs...

Can you please add Fixes tag and post this as a separate patch? There
is no need to wait on all the other changes.

Fixes: ba451366bf44 ("libbpf: Implement basic split BTF support")

>  tools/lib/bpf/btf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> index 2d0d064..0fccf4b 100644
> --- a/tools/lib/bpf/btf.c
> +++ b/tools/lib/bpf/btf.c
> @@ -679,7 +679,7 @@ __s32 btf__find_by_name(const struct btf *btf, const char 
> *type_name)
> if (!strcmp(type_name, "void"))
> return 0;
>
> -   for (i = 1; i <= btf->nr_types; i++) {
> +   for (i = 1; i <= btf__get_nr_types(btf); i++) {

I think it's worthwhile to cache the result of btf__get_nr_types(btf)
in a local variable instead of re-calculating it thousands of times.

> const struct btf_type *t = btf__type_by_id(btf, i);
> const char *name = btf__name_by_offset(btf, t->name_off);
>
> @@ -698,7 +698,7 @@ __s32 btf__find_by_name_kind(const struct btf *btf, const 
> char *type_name,
> if (kind == BTF_KIND_UNKN || !strcmp(type_name, "void"))
> return 0;
>
> -   for (i = 1; i <= btf->nr_types; i++) {
> +   for (i = 1; i <= btf__get_nr_types(btf); i++) {

same as above


> const struct btf_type *t = btf__type_by_id(btf, i);
> const char *name;
>
> --
> 1.8.3.1
>

[PATCH v2] mm/shmem.c: make shmem_mapping() inline

2020-11-13 Thread Hui Su

inline the shmem_mapping(), and use shmem_mapping()
instead of 'inode->i_mapping->a_ops == _aops'
in shmem_evict_inode().

v1->v2:
remove the inline for func declaration in shmem_fs.h

Reviewed-by: Pankaj Gupta 
Reported-by: kernel test robot 
Signed-off-by: Hui Su 
---
 mm/shmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 537c137698f8..7395d8e8226a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1152,7 +1152,7 @@ static void shmem_evict_inode(struct inode *inode)
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 
-   if (inode->i_mapping->a_ops == _aops) {
+   if (shmem_mapping(inode->i_mapping)) {
shmem_unacct_size(info->flags, inode->i_size);
inode->i_size = 0;
shmem_truncate_range(inode, 0, (loff_t)-1);
@@ -2352,7 +2352,7 @@ static struct inode *shmem_get_inode(struct super_block 
*sb, const struct inode
return inode;
 }
 
-bool shmem_mapping(struct address_space *mapping)
+inline bool shmem_mapping(struct address_space *mapping)
 {
return mapping->a_ops == _aops;
 }
-- 
2.29.0

RE: [PATCH RESEND net-next 18/18] net: phy: adin: remove the use of the .ack_interrupt()

2020-11-13 Thread Ardelean, Alexandru




> -Original Message-
> From: Ioana Ciornei 
> Sent: Friday, November 13, 2020 6:52 PM
> To: Andrew Lunn ; Heiner Kallweit ;
> Russell King ; Florian Fainelli ;
> Jakub Kicinski ; net...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Cc: Ioana Ciornei ; Ardelean, Alexandru
> 
> Subject: [PATCH RESEND net-next 18/18] net: phy: adin: remove the use of the
> .ack_interrupt()
> 
> [External]
> 
> From: Ioana Ciornei 
> 
> In preparation of removing the .ack_interrupt() callback, we must replace its
> occurrences (aka phy_clear_interrupt), from the 2 places where it is called 
> from
> (phy_enable_interrupts and phy_disable_interrupts), with equivalent
> functionality.
> 
> This means that clearing interrupts now becomes something that the PHY driver
> is responsible of doing, before enabling interrupts and after clearing them. 
> Make
> this driver follow the new contract.
> 

Acked-by: Alexandru Ardelean 

> Cc: Alexandru Ardelean 
> Signed-off-by: Ioana Ciornei 
> ---
>  drivers/net/phy/adin.c | 25 ++---
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/phy/adin.c b/drivers/net/phy/adin.c index
> ba24434b867d..55a0b91816e2 100644
> --- a/drivers/net/phy/adin.c
> +++ b/drivers/net/phy/adin.c
> @@ -471,12 +471,25 @@ static int adin_phy_ack_intr(struct phy_device
> *phydev)
> 
>  static int adin_phy_config_intr(struct phy_device *phydev)  {
> - if (phydev->interrupts == PHY_INTERRUPT_ENABLED)
> - return phy_set_bits(phydev, ADIN1300_INT_MASK_REG,
> - ADIN1300_INT_MASK_EN);
> + int err;
> +
> + if (phydev->interrupts == PHY_INTERRUPT_ENABLED) {
> + err = adin_phy_ack_intr(phydev);
> + if (err)
> + return err;
> +
> + err = phy_set_bits(phydev, ADIN1300_INT_MASK_REG,
> +ADIN1300_INT_MASK_EN);
> + } else {
> + err = phy_clear_bits(phydev, ADIN1300_INT_MASK_REG,
> +  ADIN1300_INT_MASK_EN);
> + if (err)
> + return err;
> +
> + err = adin_phy_ack_intr(phydev);
> + }
> 
> - return phy_clear_bits(phydev, ADIN1300_INT_MASK_REG,
> -   ADIN1300_INT_MASK_EN);
> + return err;
>  }
> 
>  static irqreturn_t adin_phy_handle_interrupt(struct phy_device *phydev) @@ -
> 895,7 +908,6 @@ static struct phy_driver adin_driver[] = {
>   .read_status= adin_read_status,
>   .get_tunable= adin_get_tunable,
>   .set_tunable= adin_set_tunable,
> - .ack_interrupt  = adin_phy_ack_intr,
>   .config_intr= adin_phy_config_intr,
>   .handle_interrupt = adin_phy_handle_interrupt,
>   .get_sset_count = adin_get_sset_count,
> @@ -919,7 +931,6 @@ static struct phy_driver adin_driver[] = {
>   .read_status= adin_read_status,
>   .get_tunable= adin_get_tunable,
>   .set_tunable= adin_set_tunable,
> - .ack_interrupt  = adin_phy_ack_intr,
>   .config_intr= adin_phy_config_intr,
>   .handle_interrupt = adin_phy_handle_interrupt,
>   .get_sset_count = adin_get_sset_count,
> --
> 2.28.0

RE: [PATCH RESEND net-next 17/18] net: phy: adin: implement generic .handle_interrupt() callback

2020-11-13 Thread Ardelean, Alexandru




> -Original Message-
> From: Ioana Ciornei 
> Sent: Friday, November 13, 2020 6:52 PM
> To: Andrew Lunn ; Heiner Kallweit ;
> Russell King ; Florian Fainelli ;
> Jakub Kicinski ; net...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Cc: Ioana Ciornei ; Ardelean, Alexandru
> 
> Subject: [PATCH RESEND net-next 17/18] net: phy: adin: implement generic
> .handle_interrupt() callback
> 
> [External]
> 
> From: Ioana Ciornei 
> 
> In an attempt to actually support shared IRQs in phylib, we now move the
> responsibility of triggering the phylib state machine or just returning 
> IRQ_NONE,
> based on the IRQ status register, to the PHY driver. Having
> 3 different IRQ handling callbacks (.handle_interrupt(),
> .did_interrupt() and .ack_interrupt() ) is confusing so let the PHY driver
> implement directly an IRQ handler like any other device driver.
> Make this driver follow the new convention.
> 

Acked-by: Alexandru Ardelean 

> Cc: Alexandru Ardelean 
> Signed-off-by: Ioana Ciornei 
> ---
>  drivers/net/phy/adin.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/net/phy/adin.c b/drivers/net/phy/adin.c index
> 3727b38addf7..ba24434b867d 100644
> --- a/drivers/net/phy/adin.c
> +++ b/drivers/net/phy/adin.c
> @@ -479,6 +479,24 @@ static int adin_phy_config_intr(struct phy_device
> *phydev)
> ADIN1300_INT_MASK_EN);
>  }
> 
> +static irqreturn_t adin_phy_handle_interrupt(struct phy_device *phydev)
> +{
> + int irq_status;
> +
> + irq_status = phy_read(phydev, ADIN1300_INT_STATUS_REG);
> + if (irq_status < 0) {
> + phy_error(phydev);
> + return IRQ_NONE;
> + }
> +
> + if (!(irq_status & ADIN1300_INT_LINK_STAT_CHNG_EN))
> + return IRQ_NONE;
> +
> + phy_trigger_machine(phydev);
> +
> + return IRQ_HANDLED;
> +}
> +
>  static int adin_cl45_to_adin_reg(struct phy_device *phydev, int devad,
>u16 cl45_regnum)
>  {
> @@ -879,6 +897,7 @@ static struct phy_driver adin_driver[] = {
>   .set_tunable= adin_set_tunable,
>   .ack_interrupt  = adin_phy_ack_intr,
>   .config_intr= adin_phy_config_intr,
> + .handle_interrupt = adin_phy_handle_interrupt,
>   .get_sset_count = adin_get_sset_count,
>   .get_strings= adin_get_strings,
>   .get_stats  = adin_get_stats,
> @@ -902,6 +921,7 @@ static struct phy_driver adin_driver[] = {
>   .set_tunable= adin_set_tunable,
>   .ack_interrupt  = adin_phy_ack_intr,
>   .config_intr= adin_phy_config_intr,
> + .handle_interrupt = adin_phy_handle_interrupt,
>   .get_sset_count = adin_get_sset_count,
>   .get_strings= adin_get_strings,
>   .get_stats  = adin_get_stats,
> --
> 2.28.0

Re: [PATCH 1/3] arm64: dts: ti: k3-j7200-main: Add gpio nodes in main domain

2020-11-13 Thread Sekhar Nori

On 14/11/20 9:45 AM, Grygorii Strashko wrote:
> Hi
> 
> On 13/11/2020 22:55, Nishanth Menon wrote:
>> On 00:39-20201114, Sekhar Nori wrote:
>>>
>>> I was using the latest schema from master. But I changed to 2020.08.1
>>> also, and still don't see the warning.
>>>
>>> $ dt-doc-validate --version
>>> 2020.12.dev1+gab5a73fcef26
>>>
>>> I dont have a system-wide dtc installed. One in kernel tree is updated.
>>>
>>> $ scripts/dtc/dtc --version
>>> Version: DTC 1.6.0-gcbca977e
>>>
>>> Looking at your logs, it looks like you have more patches than just this
>>> applied. I wonder if thats making a difference. Can you check with just
>>> these patches applied to linux-next or share your tree which includes
>>> other patches?
>>>
>>> In your logs, you have such error for other interrupt controller nodes
>>> as well. For example:
>>>
>>>   arch/arm64/boot/dts/ti/k3-j7200-main.dtsi:
>>> /bus@10/bus@3000/interrupt-controller1: Missing #address-cells
>>> in interrupt provider
>>>
>>> Which I don't see in my logs. My guess is some other patch(es) in your
>>> patch stack either uncovers this warning or causes it.
>>
>> Oh boy! I sent you and myself on wild goose chase! Really sorry about
>> messing up in the report of bug.
>>
>> It is not dtbs_check, it is building dtbs with W=2 that generates this
>> warning. dtc 1.6.0 is sufficient to reproduce this behavior.
>>
>> Using v5.10-rc1 as baseline (happens the same with next-20201113 as
>>     well.
>>
>> v5.10-rc1: https://pastebin.ubuntu.com/p/Pn9HDqRjQ4/ (recording:
>>  https://asciinema.org/a/55YVpql9Bq8rh8fePTxI2xObO)
>>
>> v5.10-rc1 + 1st patch in the series(since we are testing):
>> https://pastebin.ubuntu.com/p/QWQRMSv565/ (recording:
>> https://asciinema.org/a/ZSKZkOY13l4lmZ2xWH34jMlM1)
>>
>> Diff: https://pastebin.ubuntu.com/p/239sYYT2QY/
>>
> 
> This warning come from scripts/dtc/checks.c
> and was introduced by commit 3eb619b2f7d8 ("scripts/dtc: Update to
> upstream version v1.6.0-11-g9d7888cbf19c").
> 
> In my opinion it's false warning as there is no requirement to have 
> #address-cells in interrupt provider node.
> by the way, above commit description says: "The interrupt_provider check
> is noisy, so turn it off for now."

Adding Andre who adding this check in upstream dtc for guidance.

It looks like address-cells makes sense only if there is an
interrupt-map specified as well. Since we don't use it, I can add

#address-cells = <0>;

to silence the warning. Let me know if there is a better way to deal
with this.

Thanks,
Sekhar

Re: [PATCH 1/6] seq_file: add seq_read_iter

2020-11-13 Thread Al Viro

On Fri, Nov 13, 2020 at 09:14:20PM -0700, Nathan Chancellor wrote:

> Unfortunately that patch does not solve my issue. Is there any other
> debugging I should add?

Hmm...  I wonder which file it is; how about
if (WARN_ON(!iovec.iov_len))
printk(KERN_ERR "odd readv on %pd4\n", file);
in the loop in fs/read_write.c:do_loop_readv_writev()?

Re: [PATCH] mm/shmem.c: make shmem_mapping() inline

2020-11-13 Thread Hui Su

On Sat, Nov 14, 2020 at 12:54:47AM +0800, kernel test robot wrote:
> Hi Hui,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on mmotm/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Hui-Su/mm-shmem-c-make-shmem_mapping-inline/20201113-215549
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: arm-randconfig-s032-20201113 (attached as .config)
> compiler: arm-linux-gnueabi-gcc (GCC) 9.3.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # apt-get install sparse
> # sparse version: v0.6.3-107-gaf3512a6-dirty
> # 
> https://github.com/0day-ci/linux/commit/0434762d5523a3d702cd589a7f8e3771fee7b3b2
> git remote add linux-review https://github.com/0day-ci/linux
> git fetch --no-tags linux-review 
> Hui-Su/mm-shmem-c-make-shmem_mapping-inline/20201113-215549
> git checkout 0434762d5523a3d702cd589a7f8e3771fee7b3b2
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 
> CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=arm 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> 
> "sparse warnings: (new ones prefixed by >>)"
>mm/filemap.c: note: in included file:
> >> include/linux/shmem_fs.h:66:33: sparse: sparse: marked inline, but without 
> >> a definition
> >> include/linux/shmem_fs.h:66:33: sparse: sparse: marked inline, but without 
> >> a definition
> >> include/linux/shmem_fs.h:66:33: sparse: sparse: marked inline, but without 
> >> a definition
> --
>mm/truncate.c: note: in included file:
> >> include/linux/shmem_fs.h:66:33: sparse: sparse: marked inline, but without 
> >> a definition
> >> include/linux/shmem_fs.h:66:33: sparse: sparse: marked inline, but without 
> >> a definition
> >> include/linux/shmem_fs.h:66:33: sparse: sparse: marked inline, but without 
> >> a definition
> --
>mm/memfd.c: note: in included file:
> >> include/linux/shmem_fs.h:66:33: sparse: sparse: marked inline, but without 
> >> a definition
> 
> vim +66 include/linux/shmem_fs.h
> 
> 48
> 49/*
> 50 * Functions in mm/shmem.c called directly from elsewhere:
> 51 */
> 52extern const struct fs_parameter_description 
> shmem_fs_parameters;
> 53extern int shmem_init(void);
> 54extern int shmem_init_fs_context(struct fs_context *fc);
> 55extern struct file *shmem_file_setup(const char *name,
> 56loff_t size, unsigned 
> long flags);
> 57extern struct file *shmem_kernel_file_setup(const char *name, 
> loff_t size,
> 58unsigned long 
> flags);
> 59extern struct file *shmem_file_setup_with_mnt(struct vfsmount 
> *mnt,
> 60const char *name, loff_t size, unsigned long 
> flags);
> 61extern int shmem_zero_setup(struct vm_area_struct *);
> 62extern unsigned long shmem_get_unmapped_area(struct file *, 
> unsigned long addr,
> 63unsigned long len, unsigned long pgoff, 
> unsigned long flags);
> 64extern int shmem_lock(struct file *file, int lock, struct 
> user_struct *user);
> 65#ifdef CONFIG_SHMEM
>   > 66extern inline bool shmem_mapping(struct address_space *mapping);
> 67#else
> 68static inline bool shmem_mapping(struct address_space *mapping)
> 69{
> 70return false;
> 71}
> 72#endif /* CONFIG_SHMEM */
> 73extern void shmem_unlock_mapping(struct address_space *mapping);
> 74extern struct page *shmem_read_mapping_page_gfp(struct 
> address_space *mapping,
> 75pgoff_t index, gfp_t 
> gfp_mask);
> 76extern void shmem_truncate_range(struct inode *inode, loff_t 
> start, loff_t end);
> 77extern int shmem_unuse(unsigned int type, bool frontswap,
> 78   unsigned long *fs_pages_to_unuse);
> 79
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Thanks for your test.

I will resend a PATCH V2 later.

[PATCH] rtlwifi: rtl8192de: remove the useless value assignment

2020-11-13 Thread xiakaixu1987

From: Kaixu Xia 

The variable u4tmp is overwritten by the following call and the assignment
is useless, so remove it.

Reported-by: Tosk Robot 
Signed-off-by: Kaixu Xia 
---
 drivers/net/wireless/realtek/rtlwifi/rtl8192de/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192de/phy.c 
b/drivers/net/wireless/realtek/rtlwifi/rtl8192de/phy.c
index e34d33e73e52..68ec009ea157 100644
--- a/drivers/net/wireless/realtek/rtlwifi/rtl8192de/phy.c
+++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192de/phy.c
@@ -2566,7 +2566,7 @@ static void _rtl92d_phy_lc_calibrate_sw(struct 
ieee80211_hw *hw, bool is2t)
}
RTPRINT(rtlpriv, FINIT, INIT_IQK,
"PHY_LCK finish delay for %d ms=2\n", timecount);
-   u4tmp = rtl_get_rfreg(hw, index, RF_SYN_G4, RFREG_OFFSET_MASK);
+   rtl_get_rfreg(hw, index, RF_SYN_G4, RFREG_OFFSET_MASK);
if (index == 0 && rtlhal->interfaceindex == 0) {
RTPRINT(rtlpriv, FINIT, INIT_IQK,
"path-A / 5G LCK\n");
-- 
2.20.0

Re: [RFC 07/11] coresight: sink: Add TRBE driver

2020-11-13 Thread Tingwei Zhang

Hi Anshuman,

On Tue, Nov 10, 2020 at 08:45:05PM +0800, Anshuman Khandual wrote:
> Trace Buffer Extension (TRBE) implements a trace buffer per CPU which is
> accessible via the system registers. The TRBE supports different addressing
> modes including CPU virtual address and buffer modes including the circular
> buffer mode. The TRBE buffer is addressed by a base pointer (TRBBASER_EL1),
> an write pointer (TRBPTR_EL1) and a limit pointer (TRBLIMITR_EL1). But the
> access to the trace buffer could be prohibited by a higher exception level
> (EL3 or EL2), indicated by TRBIDR_EL1.P. The TRBE can also generate a CPU
> private interrupt (PPI) on address translation errors and when the buffer
> is full. Overall implementation here is inspired from the Arm SPE driver.
> 
> Signed-off-by: Anshuman Khandual 
> ---
>  Documentation/trace/coresight/coresight-trbe.rst |  36 ++
>  arch/arm64/include/asm/sysreg.h  |   2 +
>  drivers/hwtracing/coresight/Kconfig  |  11 +
>  drivers/hwtracing/coresight/Makefile |   1 +
>  drivers/hwtracing/coresight/coresight-trbe.c | 766 
> +++
>  drivers/hwtracing/coresight/coresight-trbe.h | 525 
>  6 files changed, 1341 insertions(+)
>  create mode 100644 Documentation/trace/coresight/coresight-trbe.rst
>  create mode 100644 drivers/hwtracing/coresight/coresight-trbe.c
>  create mode 100644 drivers/hwtracing/coresight/coresight-trbe.h
> 
> diff --git a/Documentation/trace/coresight/coresight-trbe.rst 
> b/Documentation/trace/coresight/coresight-trbe.rst
> new file mode 100644
> index 000..4320a8b
> --- /dev/null
> +++ b/Documentation/trace/coresight/coresight-trbe.rst
> @@ -0,0 +1,36 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==
> +Trace Buffer Extension (TRBE).
> +==
> +
> +:Author:   Anshuman Khandual 
> +:Date: November 2020
> +
> +Hardware Description
> +
> +
> +Trace Buffer Extension (TRBE) is a percpu hardware which captures in system
> +memory, CPU traces generated from a corresponding percpu tracing unit. This
> +gets plugged in as a coresight sink device because the corresponding trace
> +genarators (ETE), are plugged in as source device.
> +
> +Sysfs files and directories
> +---
> +
> +The TRBE devices appear on the existing coresight bus alongside the other
> +coresight devices::
> +
> + >$ ls /sys/bus/coresight/devices
> + trbe0  trbe1  trbe2 trbe3
> +
> +The ``trbe`` named TRBEs are associated with a CPU.::
> +
> + >$ ls /sys/bus/coresight/devices/trbe0/
> + irq align dbm
> +
> +*Key file items are:-*
> +   * ``irq``: TRBE maintenance interrupt number
> +   * ``align``: TRBE write pointer alignment
> +   * ``dbm``: TRBE updates memory with access and dirty flags
> +
> diff --git a/arch/arm64/include/asm/sysreg.h 
> b/arch/arm64/include/asm/sysreg.h
> index 14cb156..61136f6 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -97,6 +97,7 @@
>  #define SET_PSTATE_UAO(x)__emit_inst(0xd500401f | PSTATE_UAO | 
> ((!!x) << 
> PSTATE_Imm_shift))
>  #define SET_PSTATE_SSBS(x)   __emit_inst(0xd500401f | PSTATE_SSBS | 
> ((!!x) 
> << PSTATE_Imm_shift))
>  #define SET_PSTATE_TCO(x)__emit_inst(0xd500401f | PSTATE_TCO | 
> ((!!x) << 
> PSTATE_Imm_shift))
> +#define TSB_CSYNC__emit_inst(0xd503225f)
> 
>  #define __SYS_BARRIER_INSN(CRm, op2, Rt) \
>   __emit_inst(0xd500 | sys_insn(0, 3, 3, (CRm), (op2)) | ((Rt) & 
> 0x1f))
> @@ -865,6 +866,7 @@
>  #define ID_AA64MMFR2_CNP_SHIFT   0
> 
>  /* id_aa64dfr0 */
> +#define ID_AA64DFR0_TRBE_SHIFT   44
>  #define ID_AA64DFR0_TRACE_FILT_SHIFT 40
>  #define ID_AA64DFR0_DOUBLELOCK_SHIFT 36
>  #define ID_AA64DFR0_PMSVER_SHIFT 32
> diff --git a/drivers/hwtracing/coresight/Kconfig 
> b/drivers/hwtracing/coresight/Kconfig
> index c119824..0f5e101 100644
> --- a/drivers/hwtracing/coresight/Kconfig
> +++ b/drivers/hwtracing/coresight/Kconfig
> @@ -156,6 +156,17 @@ config CORESIGHT_CTI
> To compile this driver as a module, choose M here: the
> module will be called coresight-cti.
> 
> +config CORESIGHT_TRBE
> + bool "Trace Buffer Extension (TRBE) driver"

Can you consider to support TRBE as loadable module since all coresight
drivers support loadable module now.

Thanks
Tingwei

> + depends on ARM64
> + help
> +   This driver provides support for percpu Trace Buffer Extension (TRBE).
> +   TRBE always needs to be used along with it's corresponding percpu ETE
> +   component. ETE generates trace data which is then captured with TRBE.
> +   Unlike traditional sink devices, TRBE is a CPU feature accessible via
> +   system registers. But it's explicit dependency with trace unit (ETE)
> +   requires it to be plugged in as a coresight sink device.
> +
>  config

Re: [RFC 06/11] coresight: ete: Detect ETE as one of the supported ETMs

2020-11-13 Thread Tingwei Zhang

Hi Anshuman,

On Tue, Nov 10, 2020 at 08:45:04PM +0800, Anshuman Khandual wrote:
> From: Suzuki K Poulose 
> 
> Add ETE as one of the supported device types we support
> with ETM4x driver. The devices are named following the
> existing convention as ete.
> 
> ETE mandates that the trace resource status register is programmed
> before the tracing is turned on. For the moment simply write to
> it indicating TraceActive.
> 
> Signed-off-by: Suzuki K Poulose 
> Signed-off-by: Anshuman Khandual 
> ---
>  .../devicetree/bindings/arm/coresight.txt  |  3 ++
>  drivers/hwtracing/coresight/coresight-etm4x-core.c | 55 
> +-
>  drivers/hwtracing/coresight/coresight-etm4x.h  |  7 +++
>  3 files changed, 52 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/arm/coresight.txt 
> b/Documentation/devicetree/bindings/arm/coresight.txt
> index bff96a5..784cc1b 100644
> --- a/Documentation/devicetree/bindings/arm/coresight.txt
> +++ b/Documentation/devicetree/bindings/arm/coresight.txt
> @@ -40,6 +40,9 @@ its hardware characteristcs.
>   - Embedded Trace Macrocell with system register access only.
>   "arm,coresight-etm-sysreg";
> 
> + - Embedded Trace Extensions.
> + "arm,ete"
> +
>   - Coresight programmable Replicator :
>   "arm,coresight-dynamic-replicator", "arm,primecell";
> 
> diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c 
> b/drivers/hwtracing/coresight/coresight-etm4x-core.c
> index 15b6e94..0fea349 100644
> --- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
> +++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
> @@ -331,6 +331,13 @@ static int etm4_enable_hw(struct etmv4_drvdata 
> *drvdata)
>   etm4x_relaxed_write32(csa, trcpdcr | TRCPDCR_PU, TRCPDCR);
>   }
> 
> + /*
> +  * ETE mandates that the TRCRSR is written to before
> +  * enabling it.
> +  */
> + if (drvdata->arch >= ETM_ARCH_ETE)
> + etm4x_relaxed_write32(csa, TRCRSR_TA, TRCRSR);
> +
>   /* Enable the trace unit */
>   etm4x_relaxed_write32(csa, 1, TRCPRGCTLR);
> 
> @@ -763,13 +770,24 @@ static bool etm_init_sysreg_access(struct 
> etmv4_drvdata *drvdata,
>* ETMs implementing sysreg access must implement TRCDEVARCH.
>*/
>   devarch = read_etm4x_sysreg_const_offset(TRCDEVARCH);
> - if ((devarch & ETM_DEVARCH_ID_MASK) != ETM_DEVARCH_ETMv4x_ARCH)
> + switch (devarch & ETM_DEVARCH_ID_MASK) {
> + case ETM_DEVARCH_ETMv4x_ARCH:
> + *csa = (struct csdev_access) {
> + .io_mem = false,
> + .read   = etm4x_sysreg_read,
> + .write  = etm4x_sysreg_write,
> + };
> + break;
> + case ETM_DEVARCH_ETE_ARCH:
> + *csa = (struct csdev_access) {
> + .io_mem = false,
> + .read   = ete_sysreg_read,
> + .write  = ete_sysreg_write,
> + };
> + break;
> + default:
>   return false;
> - *csa = (struct csdev_access) {
> - .io_mem = false,
> - .read   = etm4x_sysreg_read,
> - .write  = etm4x_sysreg_write,
> - };
> + }
> 
>   drvdata->arch = etm_devarch_to_arch(devarch);
>   return true;
> @@ -1698,6 +1716,8 @@ static int etm4_probe(struct device *dev, void __iomem 
> *base)
>   struct etmv4_drvdata *drvdata;
>   struct coresight_desc desc = { 0 };
>   struct etm_init_arg init_arg = { 0 };
> + u8 major, minor;
> + char *type_name;
> 
>   drvdata = devm_kzalloc(dev, sizeof(*drvdata), GFP_KERNEL);
>   if (!drvdata)
> @@ -1724,10 +1744,6 @@ static int etm4_probe(struct device *dev, void 
> __iomem *base)
>   if (drvdata->cpu < 0)
>   return drvdata->cpu;
> 
> - desc.name = devm_kasprintf(dev, GFP_KERNEL, "etm%d", drvdata->cpu);
> - if (!desc.name)
> - return -ENOMEM;
> -
>   init_arg.drvdata = drvdata;
>   init_arg.csa = 
> 
> @@ -1742,6 +1758,19 @@ static int etm4_probe(struct device *dev, void 
> __iomem *base)
>   if (!desc.access.io_mem ||
>   fwnode_property_present(dev_fwnode(dev), "qcom,skip-power-up"))
>   drvdata->skip_power_up = true;
> + major = ETM_ARCH_MAJOR_VERSION(drvdata->arch);
> + minor = ETM_ARCH_MINOR_VERSION(drvdata->arch);
> + if (drvdata->arch >= ETM_ARCH_ETE) {
> + type_name = "ete";
> + major -= 4;
> + } else {
> + type_name = "etm";
> + }
> +
When trace unit supports ETE, could it be still compatible with ETMv4.4?
Can use selectively use it as ETM instead of ETE?

Thanks,
Tingwei

> + desc.name = devm_kasprintf(dev, GFP_KERNEL,
> +"%s%d", type_name, drvdata->cpu);
> + if (!desc.name)
> + return -ENOMEM;
> 
>

Re: [RFC 00/11] arm64: coresight: Enable ETE and TRBE

2020-11-13 Thread Tingwei Zhang

Hi Anshuman,

On Tue, Nov 10, 2020 at 08:44:58PM +0800, Anshuman Khandual wrote:
> This series enables future IP trace features Embedded Trace Extension (ETE)
> and Trace Buffer Extension (TRBE). This series depends on the ETM system
> register instruction support series [0] and the v8.4 Self hosted tracing
> support series (Jonathan Zhou) [1]. The tree is available here [2] for
> quick access.
> 
> ETE is the PE (CPU) trace unit for CPUs, implementing future architecture
> extensions. ETE overlaps with the ETMv4 architecture, with additions to
> support the newer architecture features and some restrictions on the
> supported features w.r.t ETMv4. The ETE support is added by extending the
> ETMv4 driver to recognise the ETE and handle the features as exposed by the
> TRCIDRx registers. ETE only supports system instructions access from the
> host CPU. The ETE could be integrated with a TRBE (see below), or with the
> legacy CoreSight trace bus (e.g, ETRs). Thus the ETE follows same firmware
> description as the ETMs and requires a node per instance.
> 
> Trace Buffer Extensions (TRBE) implements a per CPU trace buffer, which is
> accessible via the system registers and can be combined with the ETE to
> provide a 1x1 configuration of source & sink. TRBE is being represented
> here as a CoreSight sink. Primary reason is that the ETE source could work
> with other traditional CoreSight sink devices. As TRBE captures the trace
> data which is produced by ETE, it cannot work alone.
> 
> TRBE representation here have some distinct deviations from a traditional
> CoreSight sink device. Coresight path between ETE and TRBE are not built
> during boot looking at respective DT or ACPI entries. Instead TRBE gets
> checked on each available CPU, when found gets connected with respective
> ETE source device on the same CPU, after altering its outward connections.
> ETE TRBE path connection lasts only till the CPU is online. But ETE-TRBE
> coupling/decoupling method implemented here is not optimal and would be
> reworked later on.

Only perf mode is supported in TRBE in current path. Will you consider
support sysfs mode as well in following patch sets?

Thanks,
Tingwei

> 
> Unlike traditional sinks, TRBE can generate interrupts to signal including
> many other things, buffer got filled. The interrupt is a PPI and should be
> communicated from the platform. DT or ACPI entry representing TRBE should
> have the PPI number for a given platform. During perf session, the TRBE IRQ
> handler should capture trace for perf auxiliary buffer before restarting it
> back. System registers being used here to configure ETE and TRBE could be
> referred in the link below.
> 
> https://developer.arm.com/docs/ddi0601/g/aarch64-system-registers.
> 
> This adds another change where CoreSight sink device needs to be disabled
> before capturing the trace data for perf in order to avoid race condition
> with another simultaneous TRBE IRQ handling. This might cause problem with
> traditional sink devices which can be operated in both sysfs and perf mode.
> This needs to be addressed correctly. One option would be to move the
> update_buffer callback into the respective sink devices. e.g, disable().
> 
> This series is primarily looking from some early feed back both on proposed
> design and its implementation. It acknowledges, that it might be incomplete
> and will have scopes for improvement.
> 
> Things todo:
> - Improve ETE-TRBE coupling and decoupling method
> - Improve TRBE IRQ handling for all possible corner cases
> - Implement sysfs based trace sessions
> 
> [0] 
> https://lore.kernel.org/linux-arm-kernel/20201028220945.3826358-1-suzuki.poul...@arm.com/
> [1] 
> https://lore.kernel.org/linux-arm-kernel/1600396210-54196-1-git-send-email-jonathan.zhou...@huawei.com/
> [2] 
> https://gitlab.arm.com/linux-arm/linux-skp/-/tree/coresight/etm/v8.4-self-hosted
> 
> Anshuman Khandual (6):
>   arm64: Add TRBE definitions
>   coresight: sink: Add TRBE driver
>   coresight: etm-perf: Truncate the perf record if handle has no space
>   coresight: etm-perf: Disable the path before capturing the trace data
>   coresgith: etm-perf: Connect TRBE sink with ETE source
>   dts: bindings: Document device tree binding for Arm TRBE
> 
> Suzuki K Poulose (5):
>   coresight: etm-perf: Allow an event to use different sinks
>   coresight: Do not scan for graph if none is present
>   coresight: etm4x: Add support for PE OS lock
>   coresight: ete: Add support for sysreg support
>   coresight: ete: Detect ETE as one of the supported ETMs
> 
>  .../devicetree/bindings/arm/coresight.txt  |   3 +
>  Documentation/devicetree/bindings/arm/trbe.txt |  20 +
>  Documentation/trace/coresight/coresight-trbe.rst   |  36 +
>  arch/arm64/include/asm/sysreg.h|  51 ++
>  drivers/hwtracing/coresight/Kconfig|  11 +
>  drivers/hwtracing/coresight/Makefile   |   1 +
>  drivers/hwtracing/coresight/coresight-etm-perf.c   |  85

Re: [PATCH v4 5/5] drm/i915/display: Introduce DEFINE_SHOW_STORE_ATTRIBUTE for debugfs

2020-11-13 Thread kernel test robot

Hi Luo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mkp-scsi/for-next]
[also build test ERROR on scsi/for-next linus/master v5.10-rc3 next-20201113]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Luo-Jiaxing/Introduce-a-new-helper-macro-DEFINE_SHOW_STORE_ATTRIBUTE-at-seq_file-c/20201112-150927
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next
config: x86_64-rhel (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# 
https://github.com/0day-ci/linux/commit/c5417f366b929124a8b8a6add9b86653da6935a8
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Luo-Jiaxing/Introduce-a-new-helper-macro-DEFINE_SHOW_STORE_ATTRIBUTE-at-seq_file-c/20201112-150927
git checkout c5417f366b929124a8b8a6add9b86653da6935a8
# save the attached .config to linux build tree
make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   In file included from include/drm/drm_debugfs.h:36,
from drivers/gpu/drm/i915/display/intel_display_debugfs.c:6:
>> drivers/gpu/drm/i915/display/intel_display_debugfs.c:1788:29: error: 
>> redefinition of 'i915_hpd_short_storm_ctl_open'
1788 | DEFINE_SHOW_STORE_ATTRIBUTE(i915_hpd_short_storm_ctl);
 | ^~~~
   include/linux/seq_file.h:195:12: note: in definition of macro 
'DEFINE_SHOW_STORE_ATTRIBUTE'
 195 | static int __name ## _open(struct inode *inode, struct file *file) \
 |^~
   drivers/gpu/drm/i915/display/intel_display_debugfs.c:1735:1: note: previous 
definition of 'i915_hpd_short_storm_ctl_open' was here
1735 | i915_hpd_short_storm_ctl_open(struct inode *inode, struct file *file)
 | ^
   drivers/gpu/drm/i915/display/intel_display_debugfs.c:1735:1: warning: 
'i915_hpd_short_storm_ctl_open' defined but not used [-Wunused-function]

vim +/i915_hpd_short_storm_ctl_open +1788 
drivers/gpu/drm/i915/display/intel_display_debugfs.c

  1787  
> 1788  DEFINE_SHOW_STORE_ATTRIBUTE(i915_hpd_short_storm_ctl);
  1789  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

[PATCH] USB: serial: mos7720: defer state restore to a workqueue

2020-11-13 Thread Davidlohr Bueso


The parallel port restore operation currently defers writes
to a tasklet, if it sees a locked disconnect mutex. The
driver goes to a lot of trouble to ensure writes happen
in a non-blocking context, but things can be greatly
simplified if it's done in regular process context and
this is not a system performance critical path. As such,
instead of doing the async state restore writes in irq
context, use a workqueue and just do regular synchronous
writes.

In addition to the cleanup, this also imposes less on the
overall system as tasklets have been deprecated because
of it's BH implications, potentially blocking a higher
priority task from running. We also get rid of hacks
such as trylocking a mutex in irq, something which does
not play nice with priority boosting in PREEMPT_RT.

Signed-off-by: Davidlohr Bueso 
---
drivers/usb/serial/mos7720.c | 235 ++-
1 file changed, 36 insertions(+), 199 deletions(-)

diff --git a/drivers/usb/serial/mos7720.c b/drivers/usb/serial/mos7720.c
index 5a5d2a95070e..d36aaa4a13de 100644
--- a/drivers/usb/serial/mos7720.c
+++ b/drivers/usb/serial/mos7720.c
@@ -79,14 +79,6 @@ MODULE_DEVICE_TABLE(usb, id_table);
#define DCR_INIT_VAL   0x0c /* SLCTIN, nINIT */
#define ECR_INIT_VAL   0x00 /* SPP mode */

-struct urbtracker {
-   struct mos7715_parport  *mos_parport;
-   struct list_headurblist_entry;
-   struct kref ref_count;
-   struct urb  *urb;
-   struct usb_ctrlrequest  *setup;
-};
-
enum mos7715_pp_modes {
SPP = 0<<5,
PS2 = 1<<5,  /* moschip calls this 'NIBBLE' mode */
@@ -96,12 +88,9 @@ enum mos7715_pp_modes {
struct mos7715_parport {
struct parport  *pp;   /* back to containing struct */
struct kref ref_count; /* to instance of this struct */
-   struct list_headdeferred_urbs; /* list deferred async urbs */
-   struct list_headactive_urbs;   /* list async urbs in flight */
-   spinlock_t  listlock;  /* protects list access */
boolmsg_pending;   /* usb sync call pending */
struct completion   syncmsg_compl; /* usb sync call completed */
-   struct tasklet_struct   urb_tasklet;   /* for sending deferred urbs */
+   struct work_struct  work;  /* restore deferred writes */
struct usb_serial   *serial;   /* back to containing struct */
__u8shadowECR; /* parallel port regs... */
__u8shadowDCR;
@@ -265,174 +254,8 @@ static void destroy_mos_parport(struct kref *kref)
kfree(mos_parport);
}

-static void destroy_urbtracker(struct kref *kref)
-{
-   struct urbtracker *urbtrack =
-   container_of(kref, struct urbtracker, ref_count);
-   struct mos7715_parport *mos_parport = urbtrack->mos_parport;
-
-   usb_free_urb(urbtrack->urb);
-   kfree(urbtrack->setup);
-   kfree(urbtrack);
-   kref_put(_parport->ref_count, destroy_mos_parport);
-}
-
/*
- * This runs as a tasklet when sending an urb in a non-blocking parallel
- * port callback had to be deferred because the disconnect mutex could not be
- * obtained at the time.
- */
-static void send_deferred_urbs(struct tasklet_struct *t)
-{
-   int ret_val;
-   unsigned long flags;
-   struct mos7715_parport *mos_parport = from_tasklet(mos_parport, t,
-  urb_tasklet);
-   struct urbtracker *urbtrack, *tmp;
-   struct list_head *cursor, *next;
-   struct device *dev;
-
-   /* if release function ran, game over */
-   if (unlikely(mos_parport->serial == NULL))
-   return;
-
-   dev = _parport->serial->dev->dev;
-
-   /* try again to get the mutex */
-   if (!mutex_trylock(_parport->serial->disc_mutex)) {
-   dev_dbg(dev, "%s: rescheduling tasklet\n", __func__);
-   tasklet_schedule(_parport->urb_tasklet);
-   return;
-   }
-
-   /* if device disconnected, game over */
-   if (unlikely(mos_parport->serial->disconnected)) {
-   mutex_unlock(_parport->serial->disc_mutex);
-   return;
-   }
-
-   spin_lock_irqsave(_parport->listlock, flags);
-   if (list_empty(_parport->deferred_urbs)) {
-   spin_unlock_irqrestore(_parport->listlock, flags);
-   mutex_unlock(_parport->serial->disc_mutex);
-   dev_dbg(dev, "%s: deferred_urbs list empty\n", __func__);
-   return;
-   }
-
-   /* move contents of deferred_urbs list to active_urbs list and submit */
-   list_for_each_safe(cursor, next, _parport->deferred_urbs)
-   list_move_tail(cursor, _parport->active_urbs);
-   list_for_each_entry_safe(urbtrack, tmp, _parport->active_urbs,
-   urblist_entry) {
-   ret_val =

Re: [Nouveau] [PATCH 1/8] drm/nouveau/kms/nv50-: Use atomic encoder callbacks everywhere

2020-11-13 Thread Ben Skeggs

I've merged all of these.  Sent the first to 5.10-fixes for the
regression there, the rest will go in with a later -next pull request.

Thanks,
Ben.

On Sat, 14 Nov 2020 at 10:14, Lyude Paul  wrote:
>
> It turns out that I forgot to go through and make sure that I converted all
> encoder callbacks to use atomic_enable/atomic_disable(), so let's go and
> actually do that.
>
> Signed-off-by: Lyude Paul 
> Cc: Kirill A. Shutemov 
> Fixes: 09838c4efe9a ("drm/nouveau/kms: Search for encoders' connectors 
> properly")
> ---
>  drivers/gpu/drm/nouveau/dispnv50/disp.c | 29 -
>  1 file changed, 14 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c 
> b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> index b111fe24a06b..36d6b6093d16 100644
> --- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
> +++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> @@ -455,7 +455,7 @@ nv50_outp_get_old_connector(struct nouveau_encoder *outp,
>   * DAC
>   
> */
>  static void
> -nv50_dac_disable(struct drm_encoder *encoder)
> +nv50_dac_disable(struct drm_encoder *encoder, struct drm_atomic_state *state)
>  {
> struct nouveau_encoder *nv_encoder = nouveau_encoder(encoder);
> struct nv50_core *core = nv50_disp(encoder->dev)->core;
> @@ -467,7 +467,7 @@ nv50_dac_disable(struct drm_encoder *encoder)
>  }
>
>  static void
> -nv50_dac_enable(struct drm_encoder *encoder)
> +nv50_dac_enable(struct drm_encoder *encoder, struct drm_atomic_state *state)
>  {
> struct nouveau_encoder *nv_encoder = nouveau_encoder(encoder);
> struct nouveau_crtc *nv_crtc = nouveau_crtc(encoder->crtc);
> @@ -525,8 +525,8 @@ nv50_dac_detect(struct drm_encoder *encoder, struct 
> drm_connector *connector)
>  static const struct drm_encoder_helper_funcs
>  nv50_dac_help = {
> .atomic_check = nv50_outp_atomic_check,
> -   .enable = nv50_dac_enable,
> -   .disable = nv50_dac_disable,
> +   .atomic_enable = nv50_dac_enable,
> +   .atomic_disable = nv50_dac_disable,
> .detect = nv50_dac_detect
>  };
>
> @@ -1055,7 +1055,7 @@ nv50_dp_bpc_to_depth(unsigned int bpc)
>  }
>
>  static void
> -nv50_msto_enable(struct drm_encoder *encoder)
> +nv50_msto_enable(struct drm_encoder *encoder, struct drm_atomic_state *state)
>  {
> struct nv50_head *head = nv50_head(encoder->crtc);
> struct nv50_head_atom *armh = nv50_head_atom(head->base.base.state);
> @@ -1101,7 +1101,7 @@ nv50_msto_enable(struct drm_encoder *encoder)
>  }
>
>  static void
> -nv50_msto_disable(struct drm_encoder *encoder)
> +nv50_msto_disable(struct drm_encoder *encoder, struct drm_atomic_state 
> *state)
>  {
> struct nv50_msto *msto = nv50_msto(encoder);
> struct nv50_mstc *mstc = msto->mstc;
> @@ -1118,8 +1118,8 @@ nv50_msto_disable(struct drm_encoder *encoder)
>
>  static const struct drm_encoder_helper_funcs
>  nv50_msto_help = {
> -   .disable = nv50_msto_disable,
> -   .enable = nv50_msto_enable,
> +   .atomic_disable = nv50_msto_disable,
> +   .atomic_enable = nv50_msto_enable,
> .atomic_check = nv50_msto_atomic_check,
>  };
>
> @@ -1645,8 +1645,7 @@ nv50_sor_disable(struct drm_encoder *encoder,
>  }
>
>  static void
> -nv50_sor_enable(struct drm_encoder *encoder,
> -   struct drm_atomic_state *state)
> +nv50_sor_enable(struct drm_encoder *encoder, struct drm_atomic_state *state)
>  {
> struct nouveau_encoder *nv_encoder = nouveau_encoder(encoder);
> struct nouveau_crtc *nv_crtc = nouveau_crtc(encoder->crtc);
> @@ -1873,7 +1872,7 @@ nv50_pior_atomic_check(struct drm_encoder *encoder,
>  }
>
>  static void
> -nv50_pior_disable(struct drm_encoder *encoder)
> +nv50_pior_disable(struct drm_encoder *encoder, struct drm_atomic_state 
> *state)
>  {
> struct nouveau_encoder *nv_encoder = nouveau_encoder(encoder);
> struct nv50_core *core = nv50_disp(encoder->dev)->core;
> @@ -1885,7 +1884,7 @@ nv50_pior_disable(struct drm_encoder *encoder)
>  }
>
>  static void
> -nv50_pior_enable(struct drm_encoder *encoder)
> +nv50_pior_enable(struct drm_encoder *encoder, struct drm_atomic_state *state)
>  {
> struct nouveau_encoder *nv_encoder = nouveau_encoder(encoder);
> struct nouveau_crtc *nv_crtc = nouveau_crtc(encoder->crtc);
> @@ -1921,14 +1920,14 @@ nv50_pior_enable(struct drm_encoder *encoder)
> }
>
> core->func->pior->ctrl(core, nv_encoder->or, ctrl, asyh);
> -   nv_encoder->crtc = encoder->crtc;
> +   nv_encoder->crtc = _crtc->base;
>  }
>
>  static const struct drm_encoder_helper_funcs
>  nv50_pior_help = {
> .atomic_check = nv50_pior_atomic_check,
> -   .enable = nv50_pior_enable,
> -   .disable = nv50_pior_disable,
> +   .atomic_enable = nv50_pior_enable,
> +   .atomic_disable = nv50_pior_disable,
>  };
>
>  static void
> --
> 2.28.0
>
>

[PATCH] rtc: Fix memleak in sun6i_rtc_clk_init

2020-11-13 Thread Youling Tang

When rtc->base or rtc->int_osc or rtc->losc or rtc->ext_losc is NULL,
we should free clk_data and rtc before the function returns to prevent
memleak.

Signed-off-by: Youling Tang 
---
 drivers/rtc/rtc-sun6i.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/rtc/rtc-sun6i.c b/drivers/rtc/rtc-sun6i.c
index e2b8b15..84ff1e6 100644
--- a/drivers/rtc/rtc-sun6i.c
+++ b/drivers/rtc/rtc-sun6i.c
@@ -272,7 +272,7 @@ static void __init sun6i_rtc_clk_init(struct device_node 
*node,
3);
if (IS_ERR(rtc->int_osc)) {
pr_crit("Couldn't register the internal oscillator\n");
-   return;
+   goto err;
}
 
parents[0] = clk_hw_get_name(rtc->int_osc);
@@ -290,7 +290,7 @@ static void __init sun6i_rtc_clk_init(struct device_node 
*node,
rtc->losc = clk_register(NULL, >hw);
if (IS_ERR(rtc->losc)) {
pr_crit("Couldn't register the LOSC clock\n");
-   return;
+   goto err;
}
 
of_property_read_string_index(node, "clock-output-names", 1,
@@ -301,7 +301,7 @@ static void __init sun6i_rtc_clk_init(struct device_node 
*node,
  >lock);
if (IS_ERR(rtc->ext_losc)) {
pr_crit("Couldn't register the LOSC external gate\n");
-   return;
+   goto err;
}
 
clk_data->num = 2;
@@ -316,6 +316,7 @@ static void __init sun6i_rtc_clk_init(struct device_node 
*node,
 
 err:
kfree(clk_data);
+   kfree(rtc);
 }
 
 static const struct sun6i_rtc_clk_data sun6i_a31_rtc_data = {
-- 
2.1.0

Re: [PATCH 1/3] arm64: dts: ti: k3-j7200-main: Add gpio nodes in main domain

2020-11-13 Thread Grygorii Strashko


Hi

On 13/11/2020 22:55, Nishanth Menon wrote:

On 00:39-20201114, Sekhar Nori wrote:


I was using the latest schema from master. But I changed to 2020.08.1
also, and still don't see the warning.

$ dt-doc-validate --version
2020.12.dev1+gab5a73fcef26

I dont have a system-wide dtc installed. One in kernel tree is updated.

$ scripts/dtc/dtc --version
Version: DTC 1.6.0-gcbca977e

Looking at your logs, it looks like you have more patches than just this
applied. I wonder if thats making a difference. Can you check with just
these patches applied to linux-next or share your tree which includes
other patches?

In your logs, you have such error for other interrupt controller nodes
as well. For example:

  arch/arm64/boot/dts/ti/k3-j7200-main.dtsi:
/bus@10/bus@3000/interrupt-controller1: Missing #address-cells
in interrupt provider

Which I don't see in my logs. My guess is some other patch(es) in your
patch stack either uncovers this warning or causes it.


Oh boy! I sent you and myself on wild goose chase! Really sorry about
messing up in the report of bug.

It is not dtbs_check, it is building dtbs with W=2 that generates this
warning. dtc 1.6.0 is sufficient to reproduce this behavior.

Using v5.10-rc1 as baseline (happens the same with next-20201113 as
well.

v5.10-rc1: https://pastebin.ubuntu.com/p/Pn9HDqRjQ4/ (recording:
 https://asciinema.org/a/55YVpql9Bq8rh8fePTxI2xObO)

v5.10-rc1 + 1st patch in the series(since we are testing):
https://pastebin.ubuntu.com/p/QWQRMSv565/ (recording:
https://asciinema.org/a/ZSKZkOY13l4lmZ2xWH34jMlM1)

Diff: https://pastebin.ubuntu.com/p/239sYYT2QY/



This warning come from scripts/dtc/checks.c
and was introduced by commit 3eb619b2f7d8 ("scripts/dtc: Update to upstream version 
v1.6.0-11-g9d7888cbf19c").

In my opinion it's false warning as there is no requirement to have  
#address-cells in interrupt provider node.
by the way, above commit description says: "The interrupt_provider check is noisy, 
so turn it off for now."

--
Best regards,
grygorii

Re: [PATCH 1/6] seq_file: add seq_read_iter

2020-11-13 Thread Nathan Chancellor

On Sat, Nov 14, 2020 at 03:54:53AM +, Al Viro wrote:
> On Fri, Nov 13, 2020 at 08:01:24PM -0700, Nathan Chancellor wrote:
> > Sure thing, it does trigger.
> > 
> > [0.235058] [ cut here ]
> > [0.235062] WARNING: CPU: 15 PID: 237 at fs/seq_file.c:176 
> > seq_read_iter+0x3b3/0x3f0
> > [0.235064] CPU: 15 PID: 237 Comm: localhost Not tainted 
> > 5.10.0-rc2-microsoft-cbl-2-g6a9f696d1627-dirty #15
> > [0.235065] RIP: 0010:seq_read_iter+0x3b3/0x3f0
> > [0.235066] Code: ba 01 00 00 00 e8 6d d2 fc ff 4c 89 e7 48 89 ee 48 8b 
> > 54 24 10 e8 ad 8b 45 00 49 01 c5 48 29 43 18 48 89 43 10 e9 61 fe ff ff 
> > <0f> 0b e9 6f fc ff ff 0f 0b 45 31 ed e9 0d fd ff ff 48 c7 43 18 00
> > [0.235067] RSP: 0018:9c774063bd08 EFLAGS: 00010246
> > [0.235068] RAX: 91a77ac01f00 RBX: 91a50133c348 RCX: 
> > 0001
> > [0.235069] RDX: 9c774063bdb8 RSI: 9c774063bd60 RDI: 
> > 9c774063bd88
> > [0.235069] RBP:  R08:  R09: 
> > 91a50058b768
> > [0.235070] R10: 91a7f79f R11: bc2c2030 R12: 
> > 9c774063bd88
> > [0.235070] R13: 9c774063bd60 R14: 9c774063be48 R15: 
> > 91a77af58900
> > [0.235072] FS:  0029c800() GS:91a7f7bc() 
> > knlGS:
> > [0.235073] CS:  0010 DS:  ES:  CR0: 80050033
> > [0.235073] CR2: 7ab6c1fabad0 CR3: 00037a004000 CR4: 
> > 00350ea0
> > [0.235074] Call Trace:
> > [0.235077]  seq_read+0x127/0x150
> > [0.235078]  proc_reg_read+0x42/0xa0
> > [0.235080]  do_iter_read+0x14c/0x1e0
> > [0.235081]  do_readv+0x18d/0x240
> > [0.235083]  do_syscall_64+0x33/0x70
> > [0.235085]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> *blink*
> 
>   Lovely...  For one thing, it did *not* go through
> proc_reg_read_iter().  For another, it has hit proc_reg_read() with
> zero length, which must've been an iovec with zero ->iov_len in
> readv(2) arguments.  I wonder if we should use that kind of
> pathology (readv() with zero-length segment in the middle of
> iovec array) for regression tests...
> 
>   OK...  First of all, since that kind of crap can happen,
> let's do this (incremental to be folded); then (and that's
> a separate patch) we ought to switch the proc_ops with ->proc_read
> equal to seq_read to ->proc_read_iter = seq_read_iter, so that
> those guys would not mess with seq_read() wrapper at all.
> 
>   Finally, is there any point having do_loop_readv_writev()
> call any methods for zero-length segments?
> 
>   In any case, the following should be folded into
> "fix return values of seq_read_iter()"; could you check if that
> fixes the problem you are seeing?
> 
> diff --git a/fs/seq_file.c b/fs/seq_file.c
> index 07b33c1f34a9..e66d6b8bae23 100644
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
> @@ -211,9 +211,9 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter 
> *iter)
>   m->count -= n;
>   m->from += n;
>   copied += n;
> - if (!iov_iter_count(iter) || m->count)
> - goto Done;
>   }
> + if (m->count || !iov_iter_count(iter))
> + goto Done;
>   /* we need at least one record in buffer */
>   m->from = 0;
>   p = m->op->start(m, >index);

Unfortunately that patch does not solve my issue. Is there any other
debugging I should add?

Cheers,
Nathan

[PATCH 2/2] arm: Fix kfree NULL pointer in omap2xxx_clkt_vps_init

2020-11-13 Thread Youling Tang

The returns pointer is NULL when kzalloc fails to apply for space, so fix
kfree NULL pointer.

Signed-off-by: Youling Tang 
---
 arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c 
b/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c
index 70892b3..edf046b 100644
--- a/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c
+++ b/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c
@@ -235,7 +235,7 @@ void omap2xxx_clkt_vps_init(void)
 
hw = kzalloc(sizeof(*hw), GFP_KERNEL);
if (!hw)
-   goto cleanup;
+   return;
init.name = "virt_prcm_set";
init.ops = _prcm_set_ops;
init.parent_names = _name;
@@ -251,8 +251,5 @@ void omap2xxx_clkt_vps_init(void)
}
 
clkdev_create(clk, "cpufreq_ck", NULL);
-   return;
-cleanup:
-   kfree(hw);
 }
 #endif
-- 
2.1.0

[PATCH 1/2] arm: Fix memleak in omap2xxx_clkt_vps_init

2020-11-13 Thread Youling Tang

If the clk_register fails, we should free hw before function returns to
prevent memleak.

Signed-off-by: Youling Tang 
---
 arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c 
b/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c
index 2a3e722..70892b3 100644
--- a/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c
+++ b/arch/arm/mach-omap2/clkt2xxx_virt_prcm_set.c
@@ -244,6 +244,12 @@ void omap2xxx_clkt_vps_init(void)
hw->hw.init = 
 
clk = clk_register(NULL, >hw);
+   if (IS_ERR(clk)) {
+   printk(KERN_ERR "Failed to register clock\n");
+   kfree(hw);
+   return;
+   }
+
clkdev_create(clk, "cpufreq_ck", NULL);
return;
 cleanup:
-- 
2.1.0

[PATCH] dmaengine: pl330: _prep_dma_memcpy: Fix wrong burst size

2020-11-13 Thread Sugar Zhang

Actually, burst size is equal to '1 << desc->rqcfg.brst_size'.
we should use burst size, not desc->rqcfg.brst_size.

dma memcpy performance on Rockchip RV1126
@ 1512MHz A7, 1056MHz LPDDR3, 200MHz DMA:

dmatest:

/# echo dma0chan0 > /sys/module/dmatest/parameters/channel
/# echo 4194304 > /sys/module/dmatest/parameters/test_buf_size
/# echo 8 > /sys/module/dmatest/parameters/iterations
/# echo y > /sys/module/dmatest/parameters/norandom
/# echo y > /sys/module/dmatest/parameters/verbose
/# echo 1 > /sys/module/dmatest/parameters/run

dmatest: dma0chan0-copy0: result #1: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40
dmatest: dma0chan0-copy0: result #2: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40
dmatest: dma0chan0-copy0: result #3: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40
dmatest: dma0chan0-copy0: result #4: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40
dmatest: dma0chan0-copy0: result #5: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40
dmatest: dma0chan0-copy0: result #6: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40
dmatest: dma0chan0-copy0: result #7: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40
dmatest: dma0chan0-copy0: result #8: 'test passed' with src_off=0x0 dst_off=0x0 
len=0x40

Before:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 48 iops 200338 KB/s (0)

After this patch:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 179 iops 734873 KB/s (0)

After this patch and increase dma clk to 400MHz:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 259 iops 1062929 KB/s 
(0)

Signed-off-by: Sugar Zhang 
---

 drivers/dma/pl330.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dma/pl330.c b/drivers/dma/pl330.c
index e9f0101..0f5c193 100644
--- a/drivers/dma/pl330.c
+++ b/drivers/dma/pl330.c
@@ -2799,7 +2799,7 @@ pl330_prep_dma_memcpy(struct dma_chan *chan, dma_addr_t 
dst,
 * If burst size is smaller than bus width then make sure we only
 * transfer one at a time to avoid a burst stradling an MFIFO entry.
 */
-   if (desc->rqcfg.brst_size * 8 < pl330->pcfg.data_bus_width)
+   if (burst * 8 < pl330->pcfg.data_bus_width)
desc->rqcfg.brst_len = 1;
 
desc->bytes_requested = len;
-- 
2.7.4

[PATCH net-next 2/3] net: ethernet: ti: cpsw_new: enable broadcast/multicast rate limit support

2020-11-13 Thread Grygorii Strashko

This patch enables support for ingress broadcast(BC)/multicast(MC) rate limiting
in TI CPSW switchdev driver (the corresponding ALE support was added in previous
patch) by implementing HW offload for simple tc-flower policer with matches
on dst_mac:
 - ff:ff:ff:ff:ff:ff has to be used for BC rate limiting
 - 01:00:00:00:00:00 fixed value has to be used for MC rate limiting

Hence tc policer defines rate limit in terms of bits per second, but the
ALE supports limiting in terms of packets per second - the rate limit
bits/sec is converted to number of packets per second assuming minimum
Ethernet packet size ETH_ZLEN=60 bytes.

Examples:
- BC rate limit to 1000pps:
  tc qdisc add dev eth0 clsact
  tc filter add dev eth0 ingress flower skip_sw dst_mac ff:ff:ff:ff:ff:ff \
  action police rate 480kbit burst 64k

  rate 480kbit - 1000pps * 60 bytes * 8, burst - not used.

- MC rate limit to 2pps:
  tc qdisc add dev eth0 clsact
  tc filter add dev eth0 ingress flower skip_sw dst_mac 01:00:00:00:00:00 \
  action police rate 9600kbit burst 64k

  rate 9600kbit - 2pps * 60 bytes * 8, burst - not used.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw_new.c  |   4 +-
 drivers/net/ethernet/ti/cpsw_priv.c | 171 
 drivers/net/ethernet/ti/cpsw_priv.h |   8 ++
 3 files changed, 182 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw_new.c 
b/drivers/net/ethernet/ti/cpsw_new.c
index 2f5e0ad23ad7..6fad5a5461f6 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -505,6 +505,8 @@ static void cpsw_restore(struct cpsw_priv *priv)
 
/* restore CBS offload */
cpsw_cbs_resume(>slaves[priv->emac_port - 1], priv);
+
+   cpsw_qos_clsflower_resume(priv);
 }
 
 static void cpsw_init_stp_ale_entry(struct cpsw_common *cpsw)
@@ -1418,7 +1420,7 @@ static int cpsw_create_ports(struct cpsw_common *cpsw)
cpsw->slaves[i].ndev = ndev;
 
ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER |
- NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_NETNS_LOCAL;
+ NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_NETNS_LOCAL 
| NETIF_F_HW_TC;
 
ndev->netdev_ops = _netdev_ops;
ndev->ethtool_ops = _ethtool_ops;
diff --git a/drivers/net/ethernet/ti/cpsw_priv.c 
b/drivers/net/ethernet/ti/cpsw_priv.c
index 31c5e36ff706..0908d476b854 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.c
+++ b/drivers/net/ethernet/ti/cpsw_priv.c
@@ -502,6 +502,7 @@ int cpsw_init_common(struct cpsw_common *cpsw, void __iomem 
*ss_regs,
ale_params.ale_ageout   = ale_ageout;
ale_params.ale_ports= CPSW_ALE_PORTS_NUM;
ale_params.dev_id   = "cpsw";
+   ale_params.bus_freq = cpsw->bus_freq_mhz * 100;
 
cpsw->ale = cpsw_ale_create(_params);
if (IS_ERR(cpsw->ale)) {
@@ -1046,6 +1047,8 @@ static int cpsw_set_mqprio(struct net_device *ndev, void 
*type_data)
return 0;
 }
 
+static int cpsw_qos_setup_tc_block(struct net_device *ndev, struct 
flow_block_offload *f);
+
 int cpsw_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
  void *type_data)
 {
@@ -1056,6 +1059,9 @@ int cpsw_ndo_setup_tc(struct net_device *ndev, enum 
tc_setup_type type,
case TC_SETUP_QDISC_MQPRIO:
return cpsw_set_mqprio(ndev, type_data);
 
+   case TC_SETUP_BLOCK:
+   return cpsw_qos_setup_tc_block(ndev, type_data);
+
default:
return -EOPNOTSUPP;
}
@@ -1383,3 +1389,168 @@ int cpsw_run_xdp(struct cpsw_priv *priv, int ch, struct 
xdp_buff *xdp,
page_pool_recycle_direct(cpsw->page_pool[ch], page);
return ret;
 }
+
+static int cpsw_qos_clsflower_add_policer(struct cpsw_priv *priv,
+ struct netlink_ext_ack *extack,
+ struct flow_cls_offload *cls,
+ u64 rate_bytes_ps)
+{
+   struct flow_rule *rule = flow_cls_offload_flow_rule(cls);
+   struct flow_dissector *dissector = rule->match.dissector;
+   u8 null_mac[] = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+   u8 bc_mac[] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
+   u8 mc_mac[] = {0x01, 0x00, 0x00, 0x00, 0x00, 0x00};
+   struct flow_match_eth_addrs match;
+   u32 pps, port_id;
+   int ret;
+
+   if (dissector->used_keys &
+   ~(BIT(FLOW_DISSECTOR_KEY_BASIC) |
+ BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+ BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS))) {
+   NL_SET_ERR_MSG_MOD(extack,
+  "Unsupported keys used");
+   return -EOPNOTSUPP;
+   }
+
+   if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+   NL_SET_ERR_MSG_MOD(extack, "Not matching on eth address");
+   return -EOPNOTSUPP;
+

[PATCH net-next 0/3] net: ethernet: ti: cpsw: enable broadcast/multicast rate limit support

2020-11-13 Thread Grygorii Strashko

Hi

This series first adds supports for the ALE feature to rate limit number ingress
broadcast(BC)/multicast(MC) packets per/sec which main purpose is BC/MC storm
prevention.

And then enables corresponding support for ingress broadcast(BC)/multicast(MC)
rate limiting for TI CPSW switchdev and AM65x/J221E CPSW_NUSS drivers by
implementing HW offload for simple tc-flower policer with matches on dst_mac:
 - ff:ff:ff:ff:ff:ff has to be used for BC rate limiting
 - 01:00:00:00:00:00 fixed value has to be used for MC rate limiting

Hence tc policer defines rate limit in terms of bits per second, but the
ALE supports limiting in terms of packets per second - the rate limit
bits/sec is converted to number of packets per second assuming minimum
Ethernet packet size ETH_ZLEN=60 bytes.

The solution inspired patch from Vladimir Oltean [1].

Examples:
- BC rate limit to 1000pps:
  tc qdisc add dev eth0 clsact
  tc filter add dev eth0 ingress flower skip_sw dst_mac ff:ff:ff:ff:ff:ff \
  action police rate 480kbit burst 64k

  rate 480kbit - 1000pps * 60 bytes * 8, burst - not used.

- MC rate limit to 2pps:
  tc qdisc add dev eth0 clsact
  tc filter add dev eth0 ingress flower skip_sw dst_mac 01:00:00:00:00:00 \
  action police rate 9600kbit burst 64k

  rate 9600kbit - 2pps * 60 bytes * 8, burst - not used.

- show: tc filter show dev eth0 ingress
filter protocol all pref 49151 flower chain 0
filter protocol all pref 49151 flower chain 0 handle 0x1
  dst_mac ff:ff:ff:ff:ff:ff
  skip_sw
  in_hw in_hw_count 1
action order 1:  police 0x2 rate 480Kbit burst 64Kb mtu 2Kb action 
reclassify overhead 0b
ref 1 bind 1

filter protocol all pref 49152 flower chain 0
filter protocol all pref 49152 flower chain 0 handle 0x1
  dst_mac 01:00:00:00:00:00
  skip_sw
  in_hw in_hw_count 1
action order 1:  police 0x1 rate 9600Kbit burst 64Kb mtu 2Kb action 
reclassify overhead 0b
ref 1 bind

Testing MC with iperf:
- client
  -- setup tc-flower as per above
  route add -host 239.255.1.3 eth0
  iperf -s -B 239.255.1.3 -u -f m &
  cat /sys/class/net/eth0/statistics/rx_packets

- server
  route add -host 239.255.1.3 eth0
  iperf -c 239.255.1.3 -u -f m -i 5 -t 30 -l1472  -b12176 -t1 //~1pps

[1] https://lore.kernel.org/patchwork/patch/1217254/

Grygorii Strashko (3):
  drivers: net: cpsw: ale: add broadcast/multicast rate limit support
  net: ethernet: ti: cpsw_new: enable broadcast/multicast rate limit
support
  net: ethernet: ti: am65-cpsw: enable broadcast/multicast rate limit
support

 drivers/net/ethernet/ti/am65-cpsw-qos.c | 148 
 drivers/net/ethernet/ti/am65-cpsw-qos.h |   8 ++
 drivers/net/ethernet/ti/cpsw_ale.c  |  66 +
 drivers/net/ethernet/ti/cpsw_ale.h  |   2 +
 drivers/net/ethernet/ti/cpsw_new.c  |   4 +-
 drivers/net/ethernet/ti/cpsw_priv.c | 171 
 drivers/net/ethernet/ti/cpsw_priv.h |   8 ++
 7 files changed, 406 insertions(+), 1 deletion(-)

-- 
2.17.1

Re: [PATCH 1/6] seq_file: add seq_read_iter

2020-11-13 Thread Al Viro

On Fri, Nov 13, 2020 at 08:01:24PM -0700, Nathan Chancellor wrote:
> Sure thing, it does trigger.
> 
> [0.235058] [ cut here ]
> [0.235062] WARNING: CPU: 15 PID: 237 at fs/seq_file.c:176 
> seq_read_iter+0x3b3/0x3f0
> [0.235064] CPU: 15 PID: 237 Comm: localhost Not tainted 
> 5.10.0-rc2-microsoft-cbl-2-g6a9f696d1627-dirty #15
> [0.235065] RIP: 0010:seq_read_iter+0x3b3/0x3f0
> [0.235066] Code: ba 01 00 00 00 e8 6d d2 fc ff 4c 89 e7 48 89 ee 48 8b 54 
> 24 10 e8 ad 8b 45 00 49 01 c5 48 29 43 18 48 89 43 10 e9 61 fe ff ff <0f> 0b 
> e9 6f fc ff ff 0f 0b 45 31 ed e9 0d fd ff ff 48 c7 43 18 00
> [0.235067] RSP: 0018:9c774063bd08 EFLAGS: 00010246
> [0.235068] RAX: 91a77ac01f00 RBX: 91a50133c348 RCX: 
> 0001
> [0.235069] RDX: 9c774063bdb8 RSI: 9c774063bd60 RDI: 
> 9c774063bd88
> [0.235069] RBP:  R08:  R09: 
> 91a50058b768
> [0.235070] R10: 91a7f79f R11: bc2c2030 R12: 
> 9c774063bd88
> [0.235070] R13: 9c774063bd60 R14: 9c774063be48 R15: 
> 91a77af58900
> [0.235072] FS:  0029c800() GS:91a7f7bc() 
> knlGS:
> [0.235073] CS:  0010 DS:  ES:  CR0: 80050033
> [0.235073] CR2: 7ab6c1fabad0 CR3: 00037a004000 CR4: 
> 00350ea0
> [0.235074] Call Trace:
> [0.235077]  seq_read+0x127/0x150
> [0.235078]  proc_reg_read+0x42/0xa0
> [0.235080]  do_iter_read+0x14c/0x1e0
> [0.235081]  do_readv+0x18d/0x240
> [0.235083]  do_syscall_64+0x33/0x70
> [0.235085]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

*blink*

Lovely...  For one thing, it did *not* go through
proc_reg_read_iter().  For another, it has hit proc_reg_read() with
zero length, which must've been an iovec with zero ->iov_len in
readv(2) arguments.  I wonder if we should use that kind of
pathology (readv() with zero-length segment in the middle of
iovec array) for regression tests...

OK...  First of all, since that kind of crap can happen,
let's do this (incremental to be folded); then (and that's
a separate patch) we ought to switch the proc_ops with ->proc_read
equal to seq_read to ->proc_read_iter = seq_read_iter, so that
those guys would not mess with seq_read() wrapper at all.

Finally, is there any point having do_loop_readv_writev()
call any methods for zero-length segments?

In any case, the following should be folded into
"fix return values of seq_read_iter()"; could you check if that
fixes the problem you are seeing?

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 07b33c1f34a9..e66d6b8bae23 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -211,9 +211,9 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter 
*iter)
m->count -= n;
m->from += n;
copied += n;
-   if (!iov_iter_count(iter) || m->count)
-   goto Done;
}
+   if (m->count || !iov_iter_count(iter))
+   goto Done;
/* we need at least one record in buffer */
m->from = 0;
p = m->op->start(m, >index);

[PATCH net-next 3/3] net: ethernet: ti: am65-cpsw: enable broadcast/multicast rate limit support

2020-11-13 Thread Grygorii Strashko

This patch enables support for ingress broadcast(BC)/multicast(MC) rate limiting
in TI AM65x CPSW driver (the corresponding ALE support was added in previous
patch) by implementing HW offload for simple tc-flower policer with matches
on dst_mac:
 - ff:ff:ff:ff:ff:ff has to be used for BC rate limiting
 - 01:00:00:00:00:00 fixed value has to be used for MC rate limiting

Hence tc policer defines rate limit in terms of bits per second, but the
ALE supports limiting in terms of packets per second - the rate limit
bits/sec is converted to number of packets per second assuming minimum
Ethernet packet size ETH_ZLEN=60 bytes.

Examples:
- BC rate limit to 1000pps:
  tc qdisc add dev eth0 clsact
  tc filter add dev eth0 ingress flower skip_sw dst_mac ff:ff:ff:ff:ff:ff \
  action police rate 480kbit burst 64k

  rate 480kbit - 1000pps * 60 bytes * 8, burst - not used.

- MC rate limit to 2pps:
  tc qdisc add dev eth0 clsact
  tc filter add dev eth0 ingress flower skip_sw dst_mac 01:00:00:00:00:00 \
  action police rate 9600kbit burst 64k

  rate 9600kbit - 2pps * 60 bytes * 8, burst - not used.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/am65-cpsw-qos.c | 148 
 drivers/net/ethernet/ti/am65-cpsw-qos.h |   8 ++
 2 files changed, 156 insertions(+)

diff --git a/drivers/net/ethernet/ti/am65-cpsw-qos.c 
b/drivers/net/ethernet/ti/am65-cpsw-qos.c
index 3bdd4dbcd2ff..a06207233cd5 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-qos.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-qos.c
@@ -8,10 +8,12 @@
 
 #include 
 #include 
+#include 
 
 #include "am65-cpsw-nuss.h"
 #include "am65-cpsw-qos.h"
 #include "am65-cpts.h"
+#include "cpsw_ale.h"
 
 #define AM65_CPSW_REG_CTL  0x004
 #define AM65_CPSW_PN_REG_CTL   0x004
@@ -588,12 +590,158 @@ static int am65_cpsw_setup_taprio(struct net_device 
*ndev, void *type_data)
return am65_cpsw_set_taprio(ndev, type_data);
 }
 
+static int am65_cpsw_qos_clsflower_add_policer(struct am65_cpsw_port *port,
+  struct netlink_ext_ack *extack,
+  struct flow_cls_offload *cls,
+  u64 rate_bytes_ps)
+{
+   struct flow_rule *rule = flow_cls_offload_flow_rule(cls);
+   struct flow_dissector *dissector = rule->match.dissector;
+   u8 null_mac[] = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+   u8 bc_mac[] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
+   u8 mc_mac[] = {0x01, 0x00, 0x00, 0x00, 0x00, 0x00};
+   struct am65_cpsw_qos *qos = >qos;
+   struct flow_match_eth_addrs match;
+   u32 pps;
+   int ret;
+
+   if (dissector->used_keys &
+   ~(BIT(FLOW_DISSECTOR_KEY_BASIC) |
+ BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+ BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS))) {
+   NL_SET_ERR_MSG_MOD(extack,
+  "Unsupported keys used");
+   return -EOPNOTSUPP;
+   }
+
+   if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+   NL_SET_ERR_MSG_MOD(extack, "Not matching on eth address");
+   return -EOPNOTSUPP;
+   }
+
+   flow_rule_match_eth_addrs(rule, );
+
+   if (!ether_addr_equal_masked(match.key->src, null_mac,
+match.mask->src)) {
+   NL_SET_ERR_MSG_MOD(extack,
+  "Matching on source MAC not supported");
+   return -EOPNOTSUPP;
+   }
+
+   /* Calculate number of packets per second for given bps
+* assuming min ethernet packet size
+*/
+   pps = div_u64(rate_bytes_ps, ETH_ZLEN);
+
+   if (ether_addr_equal(match.key->dst, bc_mac)) {
+   ret = cpsw_ale_rx_ratelimit_bc(port->common->ale, 
port->port_id, pps);
+   if (ret)
+   return ret;
+
+   qos->ale_bc_ratelimit.cookie = cls->cookie;
+   qos->ale_bc_ratelimit.rate_packet_ps = pps;
+   }
+
+   if (ether_addr_equal(match.key->dst, mc_mac)) {
+   ret = cpsw_ale_rx_ratelimit_mc(port->common->ale, 
port->port_id, pps);
+   if (ret)
+   return ret;
+
+   qos->ale_mc_ratelimit.cookie = cls->cookie;
+   qos->ale_mc_ratelimit.rate_packet_ps = pps;
+   }
+
+   return 0;
+}
+
+static int am65_cpsw_qos_configure_clsflower(struct am65_cpsw_port *port,
+struct flow_cls_offload *cls)
+{
+   struct flow_rule *rule = flow_cls_offload_flow_rule(cls);
+   struct netlink_ext_ack *extack = cls->common.extack;
+   const struct flow_action_entry *act;
+   int i;
+
+   flow_action_for_each(i, act, >action) {
+   switch (act->id) {
+   case FLOW_ACTION_POLICE:
+   return am65_cpsw_qos_clsflower_add_policer(port, 
extack, cls,
+

[PATCH net-next 1/3] drivers: net: cpsw: ale: add broadcast/multicast rate limit support

2020-11-13 Thread Grygorii Strashko

The CPSW ALE supports feature to rate limit number ingress
broadcast(BC)/multicast(MC) packets per/sec which main purpose is BC/MC
storm prevention.

The ALE BC/MC packet rate limit configuration consist of two parts:
- global
  ALE_CONTROL.ENABLE_RATE_LIMIT bit 0 which enables rate limiting globally
  ALE_PRESCALE.PRESCALE specifies rate limiting interval
- per-port
  ALE_PORTCTLx.BCASTMCAST/_LIMIT specifies number of BC/MC packets allowed
  per rate limiting interval.
  When port.BCASTMCAST/_LIMIT is 0 rate limiting is disabled for Port.

When BC/MC packet rate limiting is enabled the number of allowed packets
per/sec is defined as:
  number_of_packets/sec = (Fclk / ALE_PRESCALE) * port.BCASTMCAST/_LIMIT

Hence, the ALE_PRESCALE configuration is common for all ports the 1ms
interval is selected and configured during ALE initialization while
port.BCAST/MCAST_LIMIT are configured per-port.
This allows to achieve:
 - min number_of_packets = 1000 when port.BCAST/MCAST_LIMIT = 1
 - max number_of_packets = 1000 * 255 = 255000
   when port.BCAST/MCAST_LIMIT = 0xFF

The ALE_CONTROL.ENABLE_RATE_LIMIT can also be enabled once during ALE
initialization as rate limiting enabled by non zero port.BCASTMCAST/_LIMIT
values.

This patch implements above logic in ALE and adds new ALE APIs
 cpsw_ale_rx_ratelimit_bc();
 cpsw_ale_rx_ratelimit_mc();

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw_ale.c | 66 ++
 drivers/net/ethernet/ti/cpsw_ale.h |  2 +
 2 files changed, 68 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw_ale.c 
b/drivers/net/ethernet/ti/cpsw_ale.c
index cdc308a2aa3e..771e4d9f98ab 100644
--- a/drivers/net/ethernet/ti/cpsw_ale.c
+++ b/drivers/net/ethernet/ti/cpsw_ale.c
@@ -50,6 +50,8 @@
 /* ALE_AGING_TIMER */
 #define ALE_AGING_TIMER_MASK   GENMASK(23, 0)
 
+#define ALE_RATE_LIMIT_MIN_PPS 1000
+
 /**
  * struct ale_entry_fld - The ALE tbl entry field description
  * @start_bit: field start bit
@@ -1136,6 +1138,50 @@ int cpsw_ale_control_get(struct cpsw_ale *ale, int port, 
int control)
return tmp & BITMASK(info->bits);
 }
 
+int cpsw_ale_rx_ratelimit_mc(struct cpsw_ale *ale, int port, unsigned int 
ratelimit_pps)
+
+{
+   int val = ratelimit_pps / ALE_RATE_LIMIT_MIN_PPS;
+   u32 remainder = ratelimit_pps % ALE_RATE_LIMIT_MIN_PPS;
+
+   if (ratelimit_pps && !val) {
+   dev_err(ale->params.dev, "ALE MC port:%d ratelimit min value 
1000pps\n", port);
+   return -EINVAL;
+   }
+
+   if (remainder)
+   dev_info(ale->params.dev, "ALE port:%d MC ratelimit set to 
%dpps (requested %d)\n",
+port, ratelimit_pps - remainder, ratelimit_pps);
+
+   cpsw_ale_control_set(ale, port, ALE_PORT_MCAST_LIMIT, val);
+
+   dev_dbg(ale->params.dev, "ALE port:%d MC ratelimit set %d\n",
+   port, val * ALE_RATE_LIMIT_MIN_PPS);
+   return 0;
+}
+
+int cpsw_ale_rx_ratelimit_bc(struct cpsw_ale *ale, int port, unsigned int 
ratelimit_pps)
+
+{
+   int val = ratelimit_pps / ALE_RATE_LIMIT_MIN_PPS;
+   u32 remainder = ratelimit_pps % ALE_RATE_LIMIT_MIN_PPS;
+
+   if (ratelimit_pps && !val) {
+   dev_err(ale->params.dev, "ALE port:%d BC ratelimit min value 
1000pps\n", port);
+   return -EINVAL;
+   }
+
+   if (remainder)
+   dev_info(ale->params.dev, "ALE port:%d BC ratelimit set to 
%dpps (requested %d)\n",
+port, ratelimit_pps - remainder, ratelimit_pps);
+
+   cpsw_ale_control_set(ale, port, ALE_PORT_BCAST_LIMIT, val);
+
+   dev_dbg(ale->params.dev, "ALE port:%d BC ratelimit set %d\n",
+   port, val * ALE_RATE_LIMIT_MIN_PPS);
+   return 0;
+}
+
 static void cpsw_ale_timer(struct timer_list *t)
 {
struct cpsw_ale *ale = from_timer(ale, t, timer);
@@ -1199,6 +1245,26 @@ static void cpsw_ale_aging_stop(struct cpsw_ale *ale)
 
 void cpsw_ale_start(struct cpsw_ale *ale)
 {
+   unsigned long ale_prescale;
+
+   /* configure Broadcast and Multicast Rate Limit
+* number_of_packets = (Fclk / ALE_PRESCALE) * port.BCAST/MCAST_LIMIT
+* ALE_PRESCALE width is 19bit and min value 0x10
+* port.BCAST/MCAST_LIMIT is 8bit
+*
+* For multi port configuration support the ALE_PRESCALE is configured 
to 1ms interval,
+* which allows to configure port.BCAST/MCAST_LIMIT per port and 
achieve:
+* min number_of_packets = 1000 when port.BCAST/MCAST_LIMIT = 1
+* max number_of_packets = 1000 * 255 = 255000 when 
port.BCAST/MCAST_LIMIT = 0xFF
+*/
+   ale_prescale = ale->params.bus_freq / ALE_RATE_LIMIT_MIN_PPS;
+   writel((u32)ale_prescale, ale->params.ale_regs + ALE_PRESCALE);
+
+   /* Allow MC/BC rate limiting globally.
+* The actual Rate Limit cfg enabled per-port by port.BCAST/MCAST_LIMIT
+*/
+   cpsw_ale_control_set(ale, 0, ALE_RATE_LIMIT, 1);
+
cpsw_ale_control_set(ale,

Re: [PATCH kernel v3] genirq/irqdomain: Add reference counting to IRQs

2020-11-13 Thread Alexey Kardashevskiy





On 14/11/2020 05:19, Cédric Le Goater wrote:

On 11/9/20 10:46 AM, Alexey Kardashevskiy wrote:

PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.


The background context for such a need is that the POWER9 and POWER10
processors have a new XIVE interrupt controller which uses MMIO pages
for interrupt management. Each interrupt has a pair of pages which are
required to be unmapped in some environment, like PHB removal. And so,
all interrupts need to be unmmaped.



This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

As kobject_put() is called directly now (not via RCU), it can also handle
the early boot case (irq_kobj_base==NULL) with the help of
the kobject::state_in_sysfs flag and without additional irq_sysfs_del().


Could this change be done in a following patch ?


No. Before this patch, we remove from sysfs -  call kobject_del() - 
before calling kobject_put() which we do via RCU. After the patch, this 
kobject_del() is called from the very last kobject_put() and when we get 
to this release handler - the sysfs node is already removed and we get a 
message about the missing parent.




While at this, clean up the comment at where irq_sysfs_del() was called.>
Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;


Some ARM drivers call directly irq_alloc_descs() and irq_free_descs().
Is  that a problem ?


Kind of, I'll need to go through these places and replace 
irq_free_descs() with kobject_put() (may be some wrapper or may be 
change irq_free_descs() to do kobject_put()).




platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Cc: Cédric Le Goater 
Cc: Marc Zyngier 
Cc: Michael Ellerman 
Cc: Qian Cai 
Cc: Rob Herring 
Cc: Frederic Barrat 
Cc: Michal Suchánek 
Cc: Thomas Gleixner 
Signed-off-by: Alexey Kardashevskiy 


I used this patch and the ppc one doing the LSI removal:

   
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20201027090655.14118-3-...@ozlabs.ru/

on different P10 and P9 systems, on a large system (>1K HW theads),
KVM guests and pSeries machines. Checked that PHB removal was OK.
  
Tested-by: Cédric Le Goater 


But IRQ subsystem covers much more than these systems.


Indeed. But doing our own powerpc-only reference counting on top of 
irs_desc is just ugly.





Some comments below,


---

This is what it is fixing for powerpc:

There was a comment about whether hierarchical IRQ domains should
contribute to this reference counter and I need some help here as
I cannot see why.
It is reverse now - IRQs contribute to domain->mapcount and
irq_domain_associate/irq_domain_disassociate take necessary steps to
keep this counter in order. What might be missing is that if we have
cascade of IRQs (as in the IOAPIC example from
Documentation/core-api/irq/irq-domain.rst ), then a parent IRQ should
contribute to the children IRQs and it is up to
irq_domain_ops::alloc/free hooks, and they all seem to be eventually
calling irq_domain_alloc_irqs_xxx/irq_domain_free_irqs_xxx which seems
right.

Documentation/core-api/irq/irq-domain.rst also suggests there is a lot
to see in debugfs about IRQs but on my thinkpad there nothing about
hierarchy.

So I'll ask again :)

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?



---
Changes:
v3:
* removed very wrong kobject_get/_put from irq_domain_associate/
irq_domain_disassociate as these are called from kobject_release so
irq_descs were never actually released
* removed irq_sysfs_del as 1) we do not seem to need it with changed
counting  2) produces a "no parent" warning as it would be called from
kobject_release which removes sysfs nodes itself

v2:
* added more get/put, including irq_domain_associate/irq_domain_disassociate
---
  kernel/irq/irqdesc.c   | 55 ++
  kernel/irq/irqdomain.c | 37 
  2 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..79c904ebfd5c 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -295,18 +295,6 @@ static void irq_sysfs_add(int irq, struct irq_desc

Re: [PATCH 1/2] mm,thp,shmem: limit shmem THP alloc gfp_mask

2020-11-13 Thread Rik van Riel

On Thu, 2020-11-12 at 11:52 +0100, Michal Hocko wrote:
> On Thu 05-11-20 14:15:07, Rik van Riel wrote:
> > 
> > This patch applies the same configurated limitation of THPs to
> > shmem
> > hugepage allocations, to prevent that from happening.
> 
> I believe you should also exaplain why we want to control defrag by
> the
> global knob while the enable logic is per mount.

I added that to the changelog for the next version of
the patches.

> > This way a THP defrag setting of "never" or "defer+madvise" will
> > result
> > in quick allocation failures without direct reclaim when no 2MB
> > free
> > pages are available.
> > 
> > With this patch applied, THP allocations for tmpfs will be a little
> > more aggressive than today for files mmapped with MADV_HUGEPAGE,
> > and a little less aggressive for files that are not mmapped or
> > mapped without that flag.
> 
> This begs some numbers. A little is rather bad unit of performance. I
> do
> agree that unifying those makes sense in general though.

The aggressiveness is in changes to the gfp_mask, eg by
adding __GFP_NORETRY. How that translates into THP
allocation success rates is entirely dependent on the
workload and on what else is in memory at the time.

I am not sure any
numbers I could gather will be
representative for anything but the workloads I am
testing.

However, I did find an issue in hugepage_vma_check
that prevents khugepaged from collapsing pages on
shmem filesystems mounted with huge=always or
huge=within_size when transparent_hugepage/enabled
is set to [madvise].

The next version of the series will have a third
patch, in order to fix that.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 2/2] mm,thp,shm: limit gfp mask to no more than specified

2020-11-13 Thread Rik van Riel

On Thu, 2020-11-12 at 12:22 +0100, Michal Hocko wrote:
> [Cc Chris for i915 and Andray]
> 
> On Thu 05-11-20 14:15:08, Rik van Riel wrote:
> > Matthew Wilcox pointed out that the i915 driver opportunistically
> > allocates tmpfs memory, but will happily reclaim some of its
> > pool if no memory is available.
> 
> It would be good to explicitly mention the requested gfp flags for
> those
> allocations. i915 uses __GFP_NORETRY | __GFP_NOWARN, or GFP_KERNEL.
> Is
> __shmem_rw really meant to not allocate from highmeme/movable zones?
> Can
> it be ever backed by THPs?

You are right, I need to copy the zone flags __GFP_DMA
through
__GFP_MOVABLE straight from the limiting gfp_mask
into the gfp_mask used for THP allocations, and not use
the default THP zone flags if the caller specifies something
else.

I'll send out a new version that fixes that.

> ttm might want __GFP_RETRY_MAYFAIL while shmem_read_mapping_page use
> the mapping gfp mask which can be NOFS or something else. This is
> quite
> messy already and I suspect that they are more targeting regular
> order-0
> requests. E.g. have a look at cb5f1a52caf23.
> 
> I am worried that this games with gfp flags will lead to
> unmaintainable
> code later on. There is a clear disconnect betwen the core THP
> allocation strategy and what drivers are asking for and those
> requirements might be really conflicting. Not to mention that flags
> might be different between regular and THP pages.

That is exactly why I want to make sure the THP allocations
are never more aggressive than the gfp flags the drivers
request, and the THP allocations may only ever be less
aggressive than the order 0 gfp_mask specified by the drivers.


-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH kernel v3] genirq/irqdomain: Add reference counting to IRQs

2020-11-13 Thread Alexey Kardashevskiy





On 14/11/2020 05:34, Marc Zyngier wrote:

Hi Alexey,

On 2020-11-09 09:46, Alexey Kardashevskiy wrote:

PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

As kobject_put() is called directly now (not via RCU), it can also handle
the early boot case (irq_kobj_base==NULL) with the help of
the kobject::state_in_sysfs flag and without additional irq_sysfs_del().
While at this, clean up the comment at where irq_sysfs_del() was called.

Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;
platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Cc: Cédric Le Goater 
Cc: Marc Zyngier 
Cc: Michael Ellerman 
Cc: Qian Cai 
Cc: Rob Herring 
Cc: Frederic Barrat 
Cc: Michal Suchánek 
Cc: Thomas Gleixner 
Signed-off-by: Alexey Kardashevskiy 
---

This is what it is fixing for powerpc:

There was a comment about whether hierarchical IRQ domains should
contribute to this reference counter and I need some help here as
I cannot see why.
It is reverse now - IRQs contribute to domain->mapcount and
irq_domain_associate/irq_domain_disassociate take necessary steps to
keep this counter in order. What might be missing is that if we have
cascade of IRQs (as in the IOAPIC example from
Documentation/core-api/irq/irq-domain.rst ), then a parent IRQ should
contribute to the children IRQs and it is up to
irq_domain_ops::alloc/free hooks, and they all seem to be eventually
calling irq_domain_alloc_irqs_xxx/irq_domain_free_irqs_xxx which seems
right.

Documentation/core-api/irq/irq-domain.rst also suggests there is a lot
to see in debugfs about IRQs but on my thinkpad there nothing about
hierarchy.

So I'll ask again :)

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?


If your HW doesn't require an interrupt hierarchy, run VMs!
Booting an arm64 guest with virtual PCI devices will result in
hierarchies being created (PCI-MSI -> GIC MSI widget -> GIC).


Absolutely :) But the beauty of ARM is that one can buy an actual ARM 
device for 20$, I have "opi one+ allwinner h6 64bit cortex a53 1GB RAM", 
is it worth using KVM on this device, or is it too small for that?



You can use KVM, or even bare QEMU on x86 if you are so inclined.


Have a QEMU command line handy for x86/tcg?


I'll try to go through this patch over the week-end (or more probably
early next week), and try to understand where our understandings
differ.


Great, thanks! Fred spotted a problem with irq_free_descs() not doing 
kobject_put() anymore and this is a problem for sa.c and the likes 
and I will go though these places anyway.



--
Alexey

[PATCH 05/10] tracepoints: Migrate to use SYSCALL_WORK flag

2020-11-13 Thread Gabriel Krisman Bertazi

For architectures that rely on the generic syscall entry code, use the
syscall_work field in struct thread_info and the specific SYSCALL_WORK
flag.  This set of flags has the advantage of being architecture
independent.

Users of the flag outside of the generic entry code should rely on the
accessor macros, such that the flag is still correctly resolved for
architectures that don't use the generic entry code and still rely on
TIF flags for system call work.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/linux/entry-common.h | 13 +
 include/linux/thread_info.h  |  3 +++
 include/trace/syscall.h  |  6 +++---
 kernel/entry/common.c|  4 ++--
 kernel/trace/trace_events.c  |  2 +-
 kernel/tracepoint.c  |  4 ++--
 6 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f3fc4457f63f..8aba367e5c79 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -17,10 +17,6 @@
 # define _TIF_SYSCALL_EMU  (0)
 #endif
 
-#ifndef _TIF_SYSCALL_TRACEPOINT
-# define _TIF_SYSCALL_TRACEPOINT   (0)
-#endif
-
 #ifndef _TIF_SYSCALL_AUDIT
 # define _TIF_SYSCALL_AUDIT(0)
 #endif
@@ -46,7 +42,7 @@
 
 #define SYSCALL_ENTER_WORK \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT  | \
-_TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU |   \
+_TIF_SYSCALL_EMU | \
 ARCH_SYSCALL_ENTER_WORK)
 
 /*
@@ -58,10 +54,11 @@
 
 #define SYSCALL_EXIT_WORK  \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |  \
-_TIF_SYSCALL_TRACEPOINT | ARCH_SYSCALL_EXIT_WORK)
+ARCH_SYSCALL_EXIT_WORK)
 
-#define SYSCALL_WORK_ENTER (SYSCALL_WORK_SECCOMP)
-#define SYSCALL_WORK_EXIT  (0)
+#define SYSCALL_WORK_ENTER (SYSCALL_WORK_SECCOMP | \
+SYSCALL_WORK_SYSCALL_TRACEPOINT)
+#define SYSCALL_WORK_EXIT  (SYSCALL_WORK_SYSCALL_TRACEPOINT)
 
 /*
  * TIF flags handled in exit_to_user_mode_loop()
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index fb53c24fc8a6..f764314b00b9 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -36,10 +36,13 @@ enum {
 };
 
 enum syscall_work_bit {
+
SYSCALL_WORK_SECCOMP= 0,
+   SYSCALL_WORK_SYSCALL_TRACEPOINT = 1,
 };
 
 #define _SYSCALL_WORK_SECCOMP   BIT(SYSCALL_WORK_SECCOMP)
+#define _SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_SYSCALL_TRACEPOINT)
 
 #include 
 
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index dc8ac27d27c1..8e193f3a33b3 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -37,10 +37,10 @@ struct syscall_metadata {
 #if defined(CONFIG_TRACEPOINTS) && defined(CONFIG_HAVE_SYSCALL_TRACEPOINTS)
 static inline void syscall_tracepoint_update(struct task_struct *p)
 {
-   if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
-   set_tsk_thread_flag(p, TIF_SYSCALL_TRACEPOINT);
+   if (test_syscall_work(SYSCALL_TRACEPOINT))
+   set_task_syscall_work(p, SYSCALL_TRACEPOINT);
else
-   clear_tsk_thread_flag(p, TIF_SYSCALL_TRACEPOINT);
+   clear_task_syscall_work(p, SYSCALL_TRACEPOINT);
 }
 #else
 static inline void syscall_tracepoint_update(struct task_struct *p)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index ef49786e5c5b..745b847f4ed4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -63,7 +63,7 @@ static long syscall_trace_enter(struct pt_regs *regs, long 
syscall,
/* Either of the above might have changed the syscall number */
syscall = syscall_get_nr(current, regs);
 
-   if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT))
+   if (unlikely(work & _SYSCALL_WORK_SYSCALL_TRACEPOINT))
trace_sys_enter(regs, syscall);
 
syscall_enter_audit(regs, syscall);
@@ -233,7 +233,7 @@ static void syscall_exit_work(struct pt_regs *regs, 
unsigned long ti_work,
 
audit_syscall_exit(regs);
 
-   if (ti_work & _TIF_SYSCALL_TRACEPOINT)
+   if (work & _SYSCALL_WORK_SYSCALL_TRACEPOINT)
trace_sys_exit(regs, syscall_get_return_value(current, regs));
 
step = report_single_step(ti_work);
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 47a71f96e5bc..950764dd226f 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -3428,7 +3428,7 @@ static __init int event_trace_enable(void)
  * initialize events and perhaps start any events that are on the
  * command line. Unfortunately, there are some events that will not
  * start this early, like the system call tracepoints that need
- * to set the TIF_SYSCALL_TRACEPOINT flag of pid 1. But event_trace_enable()
+ * to set the

[PATCH 02/10] kernel: entry: Expose helpers to migrate TIF to SYSCALL_WORK flags

2020-11-13 Thread Gabriel Krisman Bertazi

With the goal to split the syscall work related flags into a separate field
that is architecture independent, expose transitional helpers that
resolve to either the TIF flags or to the corresponding SYSCALL_WORK
flags.  This will allow architectures to migrate only when they port to
the generic syscall entry code.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/linux/thread_info.h | 42 +
 1 file changed, 42 insertions(+)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index e93e249a4e9b..18755373dc4d 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -97,6 +97,48 @@ static inline int test_ti_thread_flag(struct thread_info 
*ti, int flag)
 #define test_thread_flag(flag) \
test_ti_thread_flag(current_thread_info(), flag)
 
+#ifdef CONFIG_GENERIC_ENTRY
+static inline void __set_task_syscall_work(struct thread_info *ti, int flag)
+{
+   set_bit(flag, (unsigned long *)>syscall_work);
+}
+static inline int __test_task_syscall_work(struct thread_info *ti, int flag)
+{
+   return test_bit(flag, (unsigned long *)>syscall_work);
+}
+static inline void __clear_task_syscall_work(struct thread_info *ti, int flag)
+{
+   return clear_bit(flag, (unsigned long *)>syscall_work);
+}
+#define set_syscall_work(fl) \
+   __set_task_syscall_work(current_thread_info(), SYSCALL_WORK_##fl)
+#define test_syscall_work(fl) \
+   __test_task_syscall_work(current_thread_info(), SYSCALL_WORK_##fl)
+#define clear_syscall_work(fl) \
+   __clear_task_syscall_work(current_thread_info(), SYSCALL_WORK_##fl)
+
+#define set_task_syscall_work(t, fl) \
+   __set_task_syscall_work(task_thread_info(t), SYSCALL_WORK_##fl)
+#define test_task_syscall_work(t, fl) \
+   __test_task_syscall_work(task_thread_info(t), SYSCALL_WORK_##fl)
+#define clear_task_syscall_work(t, fl) \
+   __clear_task_syscall_work(task_thread_info(t), SYSCALL_WORK_##fl)
+#else
+#define set_syscall_work(fl) \
+   set_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+#define test_syscall_work(fl) \
+   test_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+#define clear_syscall_work(fl) \
+   clear_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+
+#define set_task_syscall_work(t, fl) \
+   set_ti_thread_flag(task_thread_info(t), TIF_##fl)
+#define test_task_syscall_work(t, fl) \
+   test_ti_thread_flag(task_thread_info(t), TIF_##fl)
+#define clear_task_syscall_work(t, fl) \
+   clear_ti_thread_flag(task_thread_info(t), TIF_##fl)
+#endif /* CONFIG_GENERIC_ENTRY */
+
 #define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)
 
 #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
-- 
2.29.2

[PATCH 04/10] seccomp: Migrate to use SYSCALL_WORK flag

2020-11-13 Thread Gabriel Krisman Bertazi

When one the generic syscall entry code, use the syscall_work field in
struct thread_info and specific SYSCALL_WORK flags to setup this syscall
work.  This flag has the advantage of being architecture independent.

Users of the flag outside of the generic entry code should rely on the
accessor macros, such that the flag is still correctly resolved for
architectures that don't use the generic entry code and still rely on
TIF flags for system call work.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/linux/entry-common.h | 8 ++--
 include/linux/seccomp.h  | 2 +-
 include/linux/thread_info.h  | 6 ++
 kernel/entry/common.c| 2 +-
 kernel/fork.c| 2 +-
 kernel/seccomp.c | 6 +++---
 6 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index cbc5c702ee4d..f3fc4457f63f 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -21,10 +21,6 @@
 # define _TIF_SYSCALL_TRACEPOINT   (0)
 #endif
 
-#ifndef _TIF_SECCOMP
-# define _TIF_SECCOMP  (0)
-#endif
-
 #ifndef _TIF_SYSCALL_AUDIT
 # define _TIF_SYSCALL_AUDIT(0)
 #endif
@@ -49,7 +45,7 @@
 #endif
 
 #define SYSCALL_ENTER_WORK \
-   (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP |   \
+   (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT  | \
 _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU |   \
 ARCH_SYSCALL_ENTER_WORK)
 
@@ -64,7 +60,7 @@
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |  \
 _TIF_SYSCALL_TRACEPOINT | ARCH_SYSCALL_EXIT_WORK)
 
-#define SYSCALL_WORK_ENTER (0)
+#define SYSCALL_WORK_ENTER (SYSCALL_WORK_SECCOMP)
 #define SYSCALL_WORK_EXIT  (0)
 
 /*
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..47763f3999f7 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -42,7 +42,7 @@ struct seccomp {
 extern int __secure_computing(const struct seccomp_data *sd);
 static inline int secure_computing(void)
 {
-   if (unlikely(test_thread_flag(TIF_SECCOMP)))
+   if (unlikely(test_syscall_work(SECCOMP)))
return  __secure_computing(NULL);
return 0;
 }
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 18755373dc4d..fb53c24fc8a6 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -35,6 +35,12 @@ enum {
GOOD_STACK,
 };
 
+enum syscall_work_bit {
+   SYSCALL_WORK_SECCOMP= 0,
+};
+
+#define _SYSCALL_WORK_SECCOMP   BIT(SYSCALL_WORK_SECCOMP)
+
 #include 
 
 #ifdef __KERNEL__
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5a4bb72ff28e..ef49786e5c5b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -54,7 +54,7 @@ static long syscall_trace_enter(struct pt_regs *regs, long 
syscall,
}
 
/* Do seccomp after ptrace, to catch any tracer changes. */
-   if (ti_work & _TIF_SECCOMP) {
+   if (work & _SYSCALL_WORK_SECCOMP) {
ret = __secure_computing(NULL);
if (ret == -1L)
return ret;
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..4433c9c60100 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1625,7 +1625,7 @@ static void copy_seccomp(struct task_struct *p)
 * to manually enable the seccomp thread flag here.
 */
if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
-   set_tsk_thread_flag(p, TIF_SECCOMP);
+   set_task_syscall_work(p, SECCOMP);
 #endif
 }
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8ad7a293255a..f67e92d11ad7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -356,14 +356,14 @@ static inline void seccomp_assign_mode(struct task_struct 
*task,
 
task->seccomp.mode = seccomp_mode;
/*
-* Make sure TIF_SECCOMP cannot be set before the mode (and
+* Make sure SYSCALL_WORK_SECCOMP cannot be set before the mode (and
 * filter) is set.
 */
smp_mb__before_atomic();
/* Assume default seccomp processes want spec flaw mitigation. */
if ((flags & SECCOMP_FILTER_FLAG_SPEC_ALLOW) == 0)
arch_seccomp_spec_mitigate(task);
-   set_tsk_thread_flag(task, TIF_SECCOMP);
+   set_task_syscall_work(task, SECCOMP);
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
@@ -929,7 +929,7 @@ static int __seccomp_filter(int this_syscall, const struct 
seccomp_data *sd,
 
/*
 * Make sure that any changes to mode from another thread have
-* been seen after TIF_SECCOMP was seen.
+* been seen after SYSCALL_WORK_SECCOMP was seen.
 */
rmb();
 
-- 
2.29.2

[PATCH 09/10] kernel: entry: Drop usage of TIF flags in the generic syscall code

2020-11-13 Thread Gabriel Krisman Bertazi

Now that the flags migration in the common syscall entry is complete and
the code relies exclusively on syscall_work, clean up the
accesses to TI flags in that path.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/linux/entry-common.h | 20 +---
 kernel/entry/common.c| 17 +++--
 2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index afeb927e8545..cffd8bf1e085 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -26,31 +26,29 @@
 #endif
 
 /*
- * TIF flags handled in syscall_enter_from_user_mode()
+ * SYSCALL_WORK flags handled in syscall_enter_from_user_mode()
  */
-#ifndef ARCH_SYSCALL_ENTER_WORK
-# define ARCH_SYSCALL_ENTER_WORK   (0)
+#ifndef ARCH_SYSCALL_WORK_ENTER
+# define ARCH_SYSCALL_WORK_ENTER   (0)
 #endif
 
-#define SYSCALL_ENTER_WORK ARCH_SYSCALL_ENTER_WORK
-
 /*
  * TIF flags handled in syscall_exit_to_user_mode()
  */
-#ifndef ARCH_SYSCALL_EXIT_WORK
-# define ARCH_SYSCALL_EXIT_WORK(0)
+#ifndef ARCH_SYSCALL_WORK_EXIT
+# define ARCH_SYSCALL_WORK_EXIT(0)
 #endif
 
-#define SYSCALL_EXIT_WORK ARCH_SYSCALL_EXIT_WORK
-
 #define SYSCALL_WORK_ENTER (SYSCALL_WORK_SECCOMP | \
 SYSCALL_WORK_SYSCALL_TRACEPOINT |  \
 SYSCALL_WORK_SYSCALL_TRACE |   \
 SYSCALL_WORK_SYSCALL_EMU | \
-SYSCALL_WORK_SYSCALL_AUDIT)
+SYSCALL_WORK_SYSCALL_AUDIT |   \
+ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT  (SYSCALL_WORK_SYSCALL_TRACEPOINT |  \
 SYSCALL_WORK_SYSCALL_TRACE |   \
-SYSCALL_WORK_SYSCALL_AUDIT)
+SYSCALL_WORK_SYSCALL_AUDIT |   \
+ARCH_SYSCALL_WORK_EXIT)
 
 /*
  * TIF flags handled in exit_to_user_mode_loop()
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 0170a4ae58f8..0ddc590bfe73 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -42,7 +42,7 @@ static inline void syscall_enter_audit(struct pt_regs *regs, 
long syscall)
 }
 
 static long syscall_trace_enter(struct pt_regs *regs, long syscall,
-   unsigned long ti_work, unsigned long work)
+   unsigned long work)
 {
long ret = 0;
 
@@ -74,12 +74,10 @@ static long syscall_trace_enter(struct pt_regs *regs, long 
syscall,
 static __always_inline long
 __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
 {
-   unsigned long ti_work;
unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
 
-   ti_work = READ_ONCE(current_thread_info()->flags);
-   if (work & SYSCALL_WORK_ENTER || ti_work & SYSCALL_ENTER_WORK)
-   syscall = syscall_trace_enter(regs, syscall, ti_work, work);
+   if (work & SYSCALL_WORK_ENTER)
+   syscall = syscall_trace_enter(regs, syscall, work);
 
return syscall;
 }
@@ -227,8 +225,8 @@ static inline bool report_single_step(unsigned long work)
 }
 #endif
 
-static void syscall_exit_work(struct pt_regs *regs, unsigned long ti_work,
- unsigned long work)
+
+static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
 {
bool step;
 
@@ -248,7 +246,6 @@ static void syscall_exit_work(struct pt_regs *regs, 
unsigned long ti_work,
  */
 static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-   u32 cached_flags = READ_ONCE(current_thread_info()->flags);
unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
unsigned long nr = syscall_get_nr(current, regs);
 
@@ -266,8 +263,8 @@ static void syscall_exit_to_user_mode_prepare(struct 
pt_regs *regs)
 * enabled, we want to run them exactly once per syscall exit with
 * interrupts enabled.
 */
-   if (unlikely(work & SYSCALL_WORK_EXIT || cached_flags & 
SYSCALL_EXIT_WORK))
-   syscall_exit_work(regs, cached_flags, work);
+   if (unlikely(work & SYSCALL_WORK_EXIT))
+   syscall_exit_work(regs, work);
 }
 
 __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
-- 
2.29.2

[PATCH 03/10] kernel: entry: Wire up syscall_work in common entry code

2020-11-13 Thread Gabriel Krisman Bertazi

Prepares the common entry code to use the SYSCALL_WORK flags. They will
be defined in subsequent patches for each type of syscall
work. SYSCALL_WORK_ENTRY/EXIT are defined for the transition, as they
will replace the TIF_ equivalent defines.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/linux/entry-common.h |  3 +++
 kernel/entry/common.c| 15 +--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1a128baf3628..cbc5c702ee4d 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -64,6 +64,9 @@
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT |  \
 _TIF_SYSCALL_TRACEPOINT | ARCH_SYSCALL_EXIT_WORK)
 
+#define SYSCALL_WORK_ENTER (0)
+#define SYSCALL_WORK_EXIT  (0)
+
 /*
  * TIF flags handled in exit_to_user_mode_loop()
  */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bc75c114c1b3..5a4bb72ff28e 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -42,7 +42,7 @@ static inline void syscall_enter_audit(struct pt_regs *regs, 
long syscall)
 }
 
 static long syscall_trace_enter(struct pt_regs *regs, long syscall,
-   unsigned long ti_work)
+   unsigned long ti_work, unsigned long work)
 {
long ret = 0;
 
@@ -75,10 +75,11 @@ static __always_inline long
 __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
 {
unsigned long ti_work;
+   unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
 
ti_work = READ_ONCE(current_thread_info()->flags);
-   if (ti_work & SYSCALL_ENTER_WORK)
-   syscall = syscall_trace_enter(regs, syscall, ti_work);
+   if (work & SYSCALL_WORK_ENTER || ti_work & SYSCALL_ENTER_WORK)
+   syscall = syscall_trace_enter(regs, syscall, ti_work, work);
 
return syscall;
 }
@@ -225,7 +226,8 @@ static inline bool report_single_step(unsigned long ti_work)
 }
 #endif
 
-static void syscall_exit_work(struct pt_regs *regs, unsigned long ti_work)
+static void syscall_exit_work(struct pt_regs *regs, unsigned long ti_work,
+ unsigned long work)
 {
bool step;
 
@@ -246,6 +248,7 @@ static void syscall_exit_work(struct pt_regs *regs, 
unsigned long ti_work)
 static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
u32 cached_flags = READ_ONCE(current_thread_info()->flags);
+   unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
unsigned long nr = syscall_get_nr(current, regs);
 
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -262,8 +265,8 @@ static void syscall_exit_to_user_mode_prepare(struct 
pt_regs *regs)
 * enabled, we want to run them exactly once per syscall exit with
 * interrupts enabled.
 */
-   if (unlikely(cached_flags & SYSCALL_EXIT_WORK))
-   syscall_exit_work(regs, cached_flags);
+   if (unlikely(work & SYSCALL_WORK_EXIT || cached_flags & 
SYSCALL_EXIT_WORK))
+   syscall_exit_work(regs, cached_flags, work);
 }
 
 __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
-- 
2.29.2

[PATCH 06/10] ptrace: Migrate to use SYSCALL_TRACE flag

2020-11-13 Thread Gabriel Krisman Bertazi

For architectures that rely on the generic syscall entry code, use the
syscall_work field in struct thread_info and the specific SYSCALL_WORK
flag.  This set of flags has the advantage of being architecture
independent.

Users of the flag outside of the generic entry code should rely on the
accessor macros, such that the flag is still correctly resolved for
architectures that don't use the generic entry code and still rely on
TIF flags for system call work.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/asm-generic/syscall.h | 14 +++---
 include/linux/entry-common.h  | 10 ++
 include/linux/thread_info.h   |  2 ++
 include/linux/tracehook.h |  6 +++---
 kernel/entry/common.c |  4 ++--
 kernel/fork.c |  2 +-
 kernel/ptrace.c   |  6 +++---
 7 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index f3135e734387..5042d1ba4bc5 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -43,7 +43,7 @@ int syscall_get_nr(struct task_struct *task, struct pt_regs 
*regs);
  * @regs:  task_pt_regs() of @task
  *
  * It's only valid to call this when @task is stopped for system
- * call exit tracing (due to TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT),
+ * call exit tracing (due to SYSCALL_TRACE or TIF_SYSCALL_AUDIT),
  * after tracehook_report_syscall_entry() returned nonzero to prevent
  * the system call from taking place.
  *
@@ -63,7 +63,7 @@ void syscall_rollback(struct task_struct *task, struct 
pt_regs *regs);
  * Returns 0 if the system call succeeded, or -ERRORCODE if it failed.
  *
  * It's only valid to call this when @task is stopped for tracing on exit
- * from a system call, due to %TIF_SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * from a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
  */
 long syscall_get_error(struct task_struct *task, struct pt_regs *regs);
 
@@ -76,7 +76,7 @@ long syscall_get_error(struct task_struct *task, struct 
pt_regs *regs);
  * This value is meaningless if syscall_get_error() returned nonzero.
  *
  * It's only valid to call this when @task is stopped for tracing on exit
- * from a system call, due to %TIF_SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * from a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
  */
 long syscall_get_return_value(struct task_struct *task, struct pt_regs *regs);
 
@@ -93,7 +93,7 @@ long syscall_get_return_value(struct task_struct *task, 
struct pt_regs *regs);
  * code; the user sees a failed system call with this errno code.
  *
  * It's only valid to call this when @task is stopped for tracing on exit
- * from a system call, due to %TIF_SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * from a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
  */
 void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs,
  int error, long val);
@@ -108,7 +108,7 @@ void syscall_set_return_value(struct task_struct *task, 
struct pt_regs *regs,
 *  @args[0], and so on.
  *
  * It's only valid to call this when @task is stopped for tracing on
- * entry to a system call, due to %TIF_SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * entry to a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
  */
 void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
   unsigned long *args);
@@ -123,7 +123,7 @@ void syscall_get_arguments(struct task_struct *task, struct 
pt_regs *regs,
  * The first argument gets value @args[0], and so on.
  *
  * It's only valid to call this when @task is stopped for tracing on
- * entry to a system call, due to %TIF_SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * entry to a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
  */
 void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
   const unsigned long *args);
@@ -135,7 +135,7 @@ void syscall_set_arguments(struct task_struct *task, struct 
pt_regs *regs,
  * Returns the AUDIT_ARCH_* based on the system call convention in use.
  *
  * It's only valid to call this when @task is stopped on entry to a system
- * call, due to %TIF_SYSCALL_TRACE, %TIF_SYSCALL_AUDIT, or %TIF_SECCOMP.
+ * call, due to %SYSCALL_TRACE, %TIF_SYSCALL_AUDIT, or %TIF_SECCOMP.
  *
  * Architectures which permit CONFIG_HAVE_ARCH_SECCOMP_FILTER must
  * provide an implementation of this.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 8aba367e5c79..dc864edb7950 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -41,7 +41,7 @@
 #endif
 
 #define SYSCALL_ENTER_WORK \
-   (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT  | \
+   (_TIF_SYSCALL_AUDIT  |  \
 _TIF_SYSCALL_EMU | \
 ARCH_SYSCALL_ENTER_WORK)
 
@@

[PATCH 00/10] Migrate syscall entry/exit work to SYSCALL_WORK flagset

2020-11-13 Thread Gabriel Krisman Bertazi

Thomas,

This a refactor work moving the work done by features like seccomp,
ptrace, audit and tracepoints out of the TI flags.  The reasons are:

   1) Scarcity of TI flags in x86 32-bit.

   2) TI flags are defined by the architecture, while these features are
   arch-independent.

   3) Community resistance in merging new architecture-independent
   features as TI flags.

The design exposes a new field in struct thread_info that is read at
syscall_trace_enter and syscall_work_exit in place of the ti flags.
No functional changes is expected from this patchset.  The design and
organization of this patchset achieves the following goals:

  1) SYSCALL_WORK flags are architecture-independent

  2) Architectures that are not using the generic entry code can
  continue to use TI flags transparently and be converted later.

  3) Architectures that migrate to the generic entry code are forced to
  use the new design.

  4) x86, since it supports the generic code, is migrated in this
  patchset.

The transparent usage of TIF or SYSCALL_WORK flags is achieved through
some macros.  Any code outside of the generic entry code is converted to
use the flags only through the accessors.

The patchset has some transition helpers, in an attempt to simplify the
patches converting each of the subsystems separately.  I believe this
simplifies the review while making the tree bisectable.

I tested this by running each of the features in x86.  Other
architectures were compile tested only.

This is based on top of tip/master.

A tree with the patches applies can be pulled from

  https://gitlab.collabora.com/krisman/linux.git -b x86/tif-cleanup-v1

Please, if possible, consider queueing this for the 5.11 merge window,
as this is blocking the Syscall User Dispatch work that has been on the
list for a while.

Gabriel Krisman Bertazi (10):
  x86: Expose syscall_work field in thread_info
  kernel: entry: Expose helpers to migrate TIF to SYSCALL_WORK flags
  kernel: entry: Wire up syscall_work in common entry code
  seccomp: Migrate to use SYSCALL_WORK flag
  tracepoints: Migrate to use SYSCALL_WORK flag
  ptrace: Migrate to use SYSCALL_TRACE flag
  ptrace: Migrate TIF_SYSCALL_EMU to use SYSCALL_WORK flag
  audit: Migrate to use SYSCALL_WORK flag
  kernel: entry: Drop usage of TIF flags in the generic syscall code
  x86: Reclaim unused x86 TI flags

 arch/x86/include/asm/thread_info.h | 11 +-
 include/asm-generic/syscall.h  | 14 
 include/linux/entry-common.h   | 44 ---
 include/linux/seccomp.h|  2 +-
 include/linux/thread_info.h| 57 ++
 include/linux/tracehook.h  |  6 ++--
 include/trace/syscall.h|  6 ++--
 kernel/auditsc.c   |  4 +--
 kernel/entry/common.c  | 45 +++
 kernel/fork.c  |  8 ++---
 kernel/ptrace.c| 16 -
 kernel/seccomp.c   |  6 ++--
 kernel/trace/trace_events.c|  2 +-
 kernel/tracepoint.c|  4 +--
 14 files changed, 130 insertions(+), 95 deletions(-)

-- 
2.29.2

[PATCH 07/10] ptrace: Migrate TIF_SYSCALL_EMU to use SYSCALL_WORK flag

2020-11-13 Thread Gabriel Krisman Bertazi

For architectures that rely on the generic syscall entry code, use the
syscall_work field in struct thread_info and the specific SYSCALL_WORK
flag.  This set of flags has the advantage of being architecture
independent.

Users of the flag outside of the generic entry code should rely on the
accessor macros, such that the flag is still correctly resolved for
architectures that don't use the generic entry code and still rely on
TIF flags for system call work.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/linux/entry-common.h |  8 ++--
 include/linux/thread_info.h  |  2 ++
 include/linux/tracehook.h|  2 +-
 kernel/entry/common.c| 19 ++-
 kernel/fork.c|  4 ++--
 kernel/ptrace.c  | 10 +-
 6 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index dc864edb7950..39d56558818d 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -13,10 +13,6 @@
  * Define dummy _TIF work flags if not defined by the architecture or for
  * disabled functionality.
  */
-#ifndef _TIF_SYSCALL_EMU
-# define _TIF_SYSCALL_EMU  (0)
-#endif
-
 #ifndef _TIF_SYSCALL_AUDIT
 # define _TIF_SYSCALL_AUDIT(0)
 #endif
@@ -42,7 +38,6 @@
 
 #define SYSCALL_ENTER_WORK \
(_TIF_SYSCALL_AUDIT  |  \
-_TIF_SYSCALL_EMU | \
 ARCH_SYSCALL_ENTER_WORK)
 
 /*
@@ -58,7 +53,8 @@
 
 #define SYSCALL_WORK_ENTER (SYSCALL_WORK_SECCOMP | \
 SYSCALL_WORK_SYSCALL_TRACEPOINT |  \
-SYSCALL_WORK_SYSCALL_TRACE)
+SYSCALL_WORK_SYSCALL_TRACE |   \
+SYSCALL_WORK_SYSCALL_EMU)
 #define SYSCALL_WORK_EXIT  (SYSCALL_WORK_SYSCALL_TRACEPOINT |  \
 SYSCALL_WORK_SYSCALL_TRACE)
 
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index b01f05282158..3c7dedadf94d 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -40,11 +40,13 @@ enum syscall_work_bit {
SYSCALL_WORK_SECCOMP= 0,
SYSCALL_WORK_SYSCALL_TRACEPOINT = 1,
SYSCALL_WORK_SYSCALL_TRACE  = 2,
+   SYSCALL_WORK_SYSCALL_EMU= 3,
 };
 
 #define _SYSCALL_WORK_SECCOMP   BIT(SYSCALL_WORK_SECCOMP)
 #define _SYSCALL_WORK_SYSCALL_TRACEPOINT BIT(SYSCALL_WORK_SYSCALL_TRACEPOINT)
 #define _SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_SYSCALL_TRACE)
+#define _SYSCALL_WORK_SYSCALL_EMU   BIT(SYSCALL_WORK_SYSCALL_EMU)
 
 #include 
 
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 0aa3771d1df5..24424da49abc 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -83,7 +83,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs,
  * tracehook_report_syscall_entry - task is about to attempt a system call
  * @regs:  user register state of current task
  *
- * This will be called if %SYSCALL_TRACE or %TIF_SYSCALL_EMU have been set,
+ * This will be called if %SYSCALL_TRACE or %SYSCALL_EMU have been set,
  * when the current task has just entered the kernel for a system call.
  * Full user register state is available here.  Changing the values
  * in @regs can affect the system call number and arguments to be tried.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 55ede5fed650..0170a4ae58f8 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -47,9 +47,9 @@ static long syscall_trace_enter(struct pt_regs *regs, long 
syscall,
long ret = 0;
 
/* Handle ptrace */
-   if (work & _SYSCALL_WORK_SYSCALL_TRACE || ti_work & _TIF_SYSCALL_EMU) {
+   if (work & (_SYSCALL_WORK_SYSCALL_TRACE | _SYSCALL_WORK_SYSCALL_EMU)) {
ret = arch_syscall_enter_tracehook(regs);
-   if (ret || (ti_work & _TIF_SYSCALL_EMU))
+   if (ret || (work & _SYSCALL_WORK_SYSCALL_EMU))
return -1L;
}
 
@@ -208,21 +208,22 @@ static void exit_to_user_mode_prepare(struct pt_regs 
*regs)
 }
 
 #ifndef _TIF_SINGLESTEP
-static inline bool report_single_step(unsigned long ti_work)
+static inline bool report_single_step(unsigned long work)
 {
return false;
 }
 #else
 /*
- * If TIF_SYSCALL_EMU is set, then the only reason to report is when
+ * If SYSCALL_EMU is set, then the only reason to report is when
  * TIF_SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall
  * instruction has been already reported in syscall_enter_from_user_mode().
  */
-#define SYSEMU_STEP(_TIF_SINGLESTEP | _TIF_SYSCALL_EMU)
-
-static inline bool report_single_step(unsigned long ti_work)
+static inline bool report_single_step(unsigned long work)
 {
-   return (ti_work

[PATCH 10/10] x86: Reclaim unused x86 TI flags

2020-11-13 Thread Gabriel Krisman Bertazi

Reclaim TI flags that were migrated to syscall_work flags.

Signed-off-by: Gabriel Krisman Bertazi 
---
 arch/x86/include/asm/thread_info.h | 10 --
 1 file changed, 10 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index b217f63e73b7..33b637442b9e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -75,15 +75,11 @@ struct thread_info {
  * - these are process state flags that various assembly files
  *   may need to access
  */
-#define TIF_SYSCALL_TRACE  0   /* syscall trace active */
 #define TIF_NOTIFY_RESUME  1   /* callback before returning to user */
 #define TIF_SIGPENDING 2   /* signal pending */
 #define TIF_NEED_RESCHED   3   /* rescheduling necessary */
 #define TIF_SINGLESTEP 4   /* reenable singlestep on user return*/
 #define TIF_SSBD   5   /* Speculative store bypass disable */
-#define TIF_SYSCALL_EMU6   /* syscall emulation active */
-#define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
-#define TIF_SECCOMP8   /* secure computing */
 #define TIF_SPEC_IB9   /* Indirect branch speculation 
mitigation */
 #define TIF_SPEC_L1D_FLUSH 10  /* Flush L1D on mm switches (processes) 
*/
 #define TIF_USER_RETURN_NOTIFY 11  /* notify kernel of userspace return */
@@ -101,18 +97,13 @@ struct thread_info {
 #define TIF_FORCED_TF  24  /* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP  25  /* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES   27  /* task is updating the mmu lazily */
-#define TIF_SYSCALL_TRACEPOINT 28  /* syscall tracepoint instrumentation */
 #define TIF_ADDR32 29  /* 32-bit address space on 64 bits */
 
-#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED  (1 << TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD  (1 << TIF_SSBD)
-#define _TIF_SYSCALL_EMU   (1 << TIF_SYSCALL_EMU)
-#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
-#define _TIF_SECCOMP   (1 << TIF_SECCOMP)
 #define _TIF_SPEC_IB   (1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY)
@@ -129,7 +120,6 @@ struct thread_info {
 #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES  (1 << TIF_LAZY_MMU_UPDATES)
-#define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32(1 << TIF_ADDR32)
 
 /* flags to check in __switch_to() */
-- 
2.29.2

[PATCH 01/10] x86: Expose syscall_work field in thread_info

2020-11-13 Thread Gabriel Krisman Bertazi

This field will be used by SYSCALL_WORK flags, migrated from TI flags.

Signed-off-by: Gabriel Krisman Bertazi 
---
 arch/x86/include/asm/thread_info.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 93277a8d2ef0..b217f63e73b7 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -55,6 +55,7 @@ struct task_struct;
 
 struct thread_info {
unsigned long   flags;  /* low level flags */
+   unsigned long   syscall_work;   /* SYSCALL_WORK_ flags */
u32 status; /* thread synchronous flags */
 };
 
-- 
2.29.2

[PATCH 08/10] audit: Migrate to use SYSCALL_WORK flag

2020-11-13 Thread Gabriel Krisman Bertazi

For architectures that rely on the generic syscall entry code, use the
syscall_work field in struct thread_info and the specific SYSCALL_WORK
flag.  This set of flags has the advantage of being architecture
independent.

Users of the flag outside of the generic entry code should rely on the
accessor macros, such that the flag is still correctly resolved for
architectures that don't use the generic entry code and still rely on
TIF flags for system call work.

Signed-off-by: Gabriel Krisman Bertazi 
---
 include/asm-generic/syscall.h | 14 +++---
 include/linux/entry-common.h  | 18 ++
 include/linux/thread_info.h   |  2 ++
 kernel/auditsc.c  |  4 ++--
 4 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index 5042d1ba4bc5..66ada3b099eb 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -43,7 +43,7 @@ int syscall_get_nr(struct task_struct *task, struct pt_regs 
*regs);
  * @regs:  task_pt_regs() of @task
  *
  * It's only valid to call this when @task is stopped for system
- * call exit tracing (due to SYSCALL_TRACE or TIF_SYSCALL_AUDIT),
+ * call exit tracing (due to SYSCALL_TRACE or SYSCALL_AUDIT),
  * after tracehook_report_syscall_entry() returned nonzero to prevent
  * the system call from taking place.
  *
@@ -63,7 +63,7 @@ void syscall_rollback(struct task_struct *task, struct 
pt_regs *regs);
  * Returns 0 if the system call succeeded, or -ERRORCODE if it failed.
  *
  * It's only valid to call this when @task is stopped for tracing on exit
- * from a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * from a system call, due to %SYSCALL_TRACE or %SYSCALL_AUDIT.
  */
 long syscall_get_error(struct task_struct *task, struct pt_regs *regs);
 
@@ -76,7 +76,7 @@ long syscall_get_error(struct task_struct *task, struct 
pt_regs *regs);
  * This value is meaningless if syscall_get_error() returned nonzero.
  *
  * It's only valid to call this when @task is stopped for tracing on exit
- * from a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * from a system call, due to %SYSCALL_TRACE or %SYSCALL_AUDIT.
  */
 long syscall_get_return_value(struct task_struct *task, struct pt_regs *regs);
 
@@ -93,7 +93,7 @@ long syscall_get_return_value(struct task_struct *task, 
struct pt_regs *regs);
  * code; the user sees a failed system call with this errno code.
  *
  * It's only valid to call this when @task is stopped for tracing on exit
- * from a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * from a system call, due to %SYSCALL_TRACE or %SYSCALL_AUDIT.
  */
 void syscall_set_return_value(struct task_struct *task, struct pt_regs *regs,
  int error, long val);
@@ -108,7 +108,7 @@ void syscall_set_return_value(struct task_struct *task, 
struct pt_regs *regs,
 *  @args[0], and so on.
  *
  * It's only valid to call this when @task is stopped for tracing on
- * entry to a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * entry to a system call, due to %SYSCALL_TRACE or %SYSCALL_AUDIT.
  */
 void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
   unsigned long *args);
@@ -123,7 +123,7 @@ void syscall_get_arguments(struct task_struct *task, struct 
pt_regs *regs,
  * The first argument gets value @args[0], and so on.
  *
  * It's only valid to call this when @task is stopped for tracing on
- * entry to a system call, due to %SYSCALL_TRACE or %TIF_SYSCALL_AUDIT.
+ * entry to a system call, due to %SYSCALL_TRACE or %SYSCALL_AUDIT.
  */
 void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
   const unsigned long *args);
@@ -135,7 +135,7 @@ void syscall_set_arguments(struct task_struct *task, struct 
pt_regs *regs,
  * Returns the AUDIT_ARCH_* based on the system call convention in use.
  *
  * It's only valid to call this when @task is stopped on entry to a system
- * call, due to %SYSCALL_TRACE, %TIF_SYSCALL_AUDIT, or %TIF_SECCOMP.
+ * call, due to %SYSCALL_TRACE, %SYSCALL_AUDIT, or %TIF_SECCOMP.
  *
  * Architectures which permit CONFIG_HAVE_ARCH_SECCOMP_FILTER must
  * provide an implementation of this.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 39d56558818d..afeb927e8545 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -13,10 +13,6 @@
  * Define dummy _TIF work flags if not defined by the architecture or for
  * disabled functionality.
  */
-#ifndef _TIF_SYSCALL_AUDIT
-# define _TIF_SYSCALL_AUDIT(0)
-#endif
-
 #ifndef _TIF_PATCH_PENDING
 # define _TIF_PATCH_PENDING(0)
 #endif
@@ -36,9 +32,7 @@
 # define ARCH_SYSCALL_ENTER_WORK   (0)
 #endif
 
-#define SYSCALL_ENTER_WORK \
-   (_TIF_SYSCALL_AUDIT  |  \
-

Re: [PATCH v6 1/2] kunit: Support for Parameterized Testing

2020-11-13 Thread David Gow

On Sat, Nov 14, 2020 at 9:38 AM Arpitha Raghunandan <98.a...@gmail.com> wrote:
>
> On 14/11/20 5:44 am, Marco Elver wrote:
> >
> > Arpitha: Do you want to send v7, but with the following modifications
> > from what I proposed? Assuming nobody objects.
> >
> > 1. Remove the num_params counter and don't print the number of params
> > anymore, nor do validation that generators are deterministic.
> > 2. Remove the [].
> > [ I'm happy to send as well, just let me know what you prefer. ]
> >
> > Thanks,
> > -- Marco
> >
>
> If no objections I will send the v7 with the above changes.
> Thanks!

This sounds good to me!

Cheers,
-- David

Re: [SPECIFICATION RFC] The firmware and bootloader log specification

2020-11-13 Thread Randy Dunlap

On 11/13/20 3:52 PM, Daniel Kiper wrote:
> Hey,
> 
> 
> Here is the description (pseudocode) of the structures which will be
> used to store the log data.
> 
> Anyway, I am aware that this is not specification per se.

Yes, you have caveats here. I'm sure that you either already know
or would learn soon enough that struct struct bf_log has some
padding added to it (for alignment) unless it is packed.
Or you could rearrange the order of some of its fields
and save 8 bytes per struct on x86_64.

>   struct bf_log
>   {
> uint32_t   version;
> char   producer[64];
> uint64_t   flags;
> uint64_t   next_bf_log_addr;
> uint32_t   next_msg_off;
> bf_log_msg msgs[];
>   }
> 
>   struct bf_log_msg
>   {
> uint32_t size;
> uint64_t ts_nsec;
> uint32_t level;
> uint32_t facility;
> uint32_t msg_off;
> char strings[];
>   }

cheers.
-- 
~Randy

Re: [PATCH 1/6] seq_file: add seq_read_iter

2020-11-13 Thread Nathan Chancellor

On Sat, Nov 14, 2020 at 01:17:54AM +, Al Viro wrote:
> On Fri, Nov 13, 2020 at 04:54:53PM -0700, Nathan Chancellor wrote:
> 
> > This patch in -next (6a9f696d1627bacc91d1cebcfb177f474484e8ba) breaks
> > WSL2's interoperability feature, where Windows paths automatically get
> > added to PATH on start up so that Windows binaries can be accessed from
> > within Linux (such as clip.exe to pipe output to the clipboard). Before,
> > I would see a bunch of Linux + Windows folders in $PATH but after, I
> > only see the Linux folders (I can give you the actual PATH value if you
> > care but it is really long).
> > 
> > I am not at all familiar with the semantics of this patch or how
> > Microsoft would be using it to inject folders into PATH (they have some
> > documentation on it here:
> > https://docs.microsoft.com/en-us/windows/wsl/interop) and I am not sure
> > how to go about figuring that out to see why this patch breaks something
> > (unless you have an idea). I have added the Hyper-V maintainers and list
> > to CC in case they know someone who could help.
> 
> Out of curiosity: could you slap WARN_ON(!iov_iter_count(iter)); right in
> the beginning of seq_read_iter() and see if that triggers?

Sure thing, it does trigger.

[0.235058] [ cut here ]
[0.235062] WARNING: CPU: 15 PID: 237 at fs/seq_file.c:176 
seq_read_iter+0x3b3/0x3f0
[0.235064] CPU: 15 PID: 237 Comm: localhost Not tainted 
5.10.0-rc2-microsoft-cbl-2-g6a9f696d1627-dirty #15
[0.235065] RIP: 0010:seq_read_iter+0x3b3/0x3f0
[0.235066] Code: ba 01 00 00 00 e8 6d d2 fc ff 4c 89 e7 48 89 ee 48 8b 54 
24 10 e8 ad 8b 45 00 49 01 c5 48 29 43 18 48 89 43 10 e9 61 fe ff ff <0f> 0b e9 
6f fc ff ff 0f 0b 45 31 ed e9 0d fd ff ff 48 c7 43 18 00
[0.235067] RSP: 0018:9c774063bd08 EFLAGS: 00010246
[0.235068] RAX: 91a77ac01f00 RBX: 91a50133c348 RCX: 0001
[0.235069] RDX: 9c774063bdb8 RSI: 9c774063bd60 RDI: 9c774063bd88
[0.235069] RBP:  R08:  R09: 91a50058b768
[0.235070] R10: 91a7f79f R11: bc2c2030 R12: 9c774063bd88
[0.235070] R13: 9c774063bd60 R14: 9c774063be48 R15: 91a77af58900
[0.235072] FS:  0029c800() GS:91a7f7bc() 
knlGS:
[0.235073] CS:  0010 DS:  ES:  CR0: 80050033
[0.235073] CR2: 7ab6c1fabad0 CR3: 00037a004000 CR4: 00350ea0
[0.235074] Call Trace:
[0.235077]  seq_read+0x127/0x150
[0.235078]  proc_reg_read+0x42/0xa0
[0.235080]  do_iter_read+0x14c/0x1e0
[0.235081]  do_readv+0x18d/0x240
[0.235083]  do_syscall_64+0x33/0x70
[0.235085]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[0.235086] RIP: 0033:0x22c483
[0.235086] Code: 4e 66 48 0f 7e c8 48 83 f8 01 48 89 d0 48 83 d0 ff 48 89 
46 08 66 0f 7f 46 10 48 63 7f 78 b8 13 00 00 00 ba 02 00 00 00 0f 05 <48> 89 c7 
e8 15 bb ff ff 48 85 c0 7e 34 48 89 c1 48 2b 4c 24 08 76
[0.235087] RSP: 002b:7ffca2245ca0 EFLAGS: 0257 ORIG_RAX: 
0013
[0.235088] RAX: ffda RBX: 00a58120 RCX: 0022c483
[0.235088] RDX: 0002 RSI: 7ffca2245ca0 RDI: 0005
[0.235089] RBP:  R08: fefefefefefefeff R09: 8080808080808080
[0.235089] R10: 7ab6c1fabb20 R11: 0257 R12: 00a58120
[0.235089] R13: 7ffca2245d90 R14: 0001 R15: 7ffca2245ce7
[0.235091] CPU: 15 PID: 237 Comm: localhost Not tainted 
5.10.0-rc2-microsoft-cbl-2-g6a9f696d1627-dirty #15
[0.235091] Call Trace:
[0.235092]  dump_stack+0xa1/0xfb
[0.235094]  __warn+0x7f/0x120
[0.235095]  ? seq_read_iter+0x3b3/0x3f0
[0.235096]  report_bug+0xb1/0x110
[0.235097]  handle_bug+0x3d/0x70
[0.235098]  exc_invalid_op+0x18/0xb0
[0.235098]  asm_exc_invalid_op+0x12/0x20
[0.235100] RIP: 0010:seq_read_iter+0x3b3/0x3f0
[0.235100] Code: ba 01 00 00 00 e8 6d d2 fc ff 4c 89 e7 48 89 ee 48 8b 54 
24 10 e8 ad 8b 45 00 49 01 c5 48 29 43 18 48 89 43 10 e9 61 fe ff ff <0f> 0b e9 
6f fc ff ff 0f 0b 45 31 ed e9 0d fd ff ff 48 c7 43 18 00
[0.235101] RSP: 0018:9c774063bd08 EFLAGS: 00010246
[0.235101] RAX: 91a77ac01f00 RBX: 91a50133c348 RCX: 0001
[0.235102] RDX: 9c774063bdb8 RSI: 9c774063bd60 RDI: 9c774063bd88
[0.235102] RBP:  R08:  R09: 91a50058b768
[0.235103] R10: 91a7f79f R11: bc2c2030 R12: 9c774063bd88
[0.235103] R13: 9c774063bd60 R14: 9c774063be48 R15: 91a77af58900
[0.235104]  ? seq_open+0x70/0x70
[0.235105]  ? path_openat+0xbc0/0xc40
[0.235106]  seq_read+0x127/0x150
[0.235107]  proc_reg_read+0x42/0xa0
[0.235108]  do_iter_read+0x14c/0x1e0
[0.235109]  do_readv+0x18d/0x240
[0.235109]  do_syscall_64+0x33/0x70
[0.235110]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[0.235111] RIP:

Re: [net-next,v2,4/5] seg6: add support for the SRv6 End.DT4 behavior

2020-11-13 Thread David Ahern

On 11/13/20 7:29 PM, Andrea Mayer wrote:
> Hi Jakub,
> 
> On Fri, 13 Nov 2020 18:01:26 -0800
> Jakub Kicinski  wrote:
> 
>>> UAPI solution 2
>>>
>>> we turn "table" into an optional parameter and we add the "vrftable" 
>>> optional
>>> parameter. DT4 can only be used with the "vrftable" (hence it is a required
>>> parameter for DT4).
>>> DT6 can be used with "vrftable" (new vrf mode) or with "table" (legacy mode)
>>> (hence it is an optional parameter for DT6).
>>>
>>> UAPI solution 2 examples:
>>>
>>> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT4 vrftable 100 
>>> dev eth0
>>> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 vrftable 100 
>>> dev eth0
>>> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 table 100 
>>> dev eth0
>>>
>>> IMO solution 2 is nicer from UAPI POV because we always have only one 
>>> parameter, maybe solution 1 is slightly easier to implement, all in all 
>>> we prefer solution 2 but we can go for 1 if you prefer.
>>
>> Agreed, 2 looks better to me as well. But let's not conflate uABI with
>> iproute2's command line. I'm more concerned about the kernel ABI.
> 
> Sorry I was a little imprecise here. I reported only the user command 
> perspective.
> From the kernel point of view in solution 2 the vrftable will be a new
> [SEG6_LOCAL_VRFTABLE] optional parameter.
> 
>> BTW you prefer to operate on tables (and therefore require
>> net.vrf.strict_mode=1) because that's closer to the spirit of the RFC,
>> correct? As I said from the implementation perspective passing any VRF
>> ifindex down from user space to the kernel should be fine?
> 
> Yes, I definitely prefer to operate on tables (and so on the table ID) due to
> the spirit of the RFC. We have discussed in depth this design choice with
> David Ahern when implementing the DT4 patch and we are confident that 
> operating
> with VRF strict mode is a sound approach also for DT6. 
> 

I like the vrftable option. Straightforward extension from current table
argument.

Re: [PATCH v2] perf data: Allow to use stdio functions for pipe mode

2020-11-13 Thread Namhyung Kim

Gentle ping! :)


On Fri, Oct 30, 2020 at 2:47 PM Namhyung Kim  wrote:
>
> When perf data is in a pipe, it reads each event separately using
> read(2) syscall.  This is a huge performance bottleneck when
> processing large data like in perf inject.  Also perf inject needs to
> use write(2) syscall for the output.
>
> So convert it to use buffer I/O functions in stdio library for pipe
> data.  This makes inject-build-id bench time drops from 20ms to 8ms.
>
>   $ perf bench internals inject-build-id
>   # Running 'internals/inject-build-id' benchmark:
> Average build-id injection took: 8.074 msec (+- 0.013 msec)
> Average time per event: 0.792 usec (+- 0.001 usec)
> Average memory usage: 8328 KB (+- 0 KB)
> Average build-id-all injection took: 5.490 msec (+- 0.008 msec)
> Average time per event: 0.538 usec (+- 0.001 usec)
> Average memory usage: 7563 KB (+- 0 KB)
>
> This patch enables it just for perf inject when used with pipe (it's a
> default behavior).  Maybe we could do it for perf record and/or report
> later..
>
> Signed-off-by: Namhyung Kim 
> ---
> v2: check result of fdopen()
>
>  tools/perf/builtin-inject.c |  2 ++
>  tools/perf/util/data.c  | 41 ++---
>  tools/perf/util/data.h  | 11 +-
>  tools/perf/util/header.c|  8 
>  tools/perf/util/session.c   |  7 ---
>  5 files changed, 58 insertions(+), 11 deletions(-)
>
> diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
> index 452a75fe68e5..14d6c88fed76 100644
> --- a/tools/perf/builtin-inject.c
> +++ b/tools/perf/builtin-inject.c
> @@ -853,10 +853,12 @@ int cmd_inject(int argc, const char **argv)
> .output = {
> .path = "-",
> .mode = PERF_DATA_MODE_WRITE,
> +   .use_stdio = true,
> },
> };
> struct perf_data data = {
> .mode = PERF_DATA_MODE_READ,
> +   .use_stdio = true,
> };
> int ret;
>
> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> index c47aa34fdc0a..05bbcb663c41 100644
> --- a/tools/perf/util/data.c
> +++ b/tools/perf/util/data.c
> @@ -174,8 +174,21 @@ static bool check_pipe(struct perf_data *data)
> is_pipe = true;
> }
>
> -   if (is_pipe)
> -   data->file.fd = fd;
> +   if (is_pipe) {
> +   if (data->use_stdio) {
> +   const char *mode;
> +
> +   mode = perf_data__is_read(data) ? "r" : "w";
> +   data->file.fptr = fdopen(fd, mode);
> +
> +   if (data->file.fptr == NULL) {
> +   data->file.fd = fd;
> +   data->use_stdio = false;
> +   }
> +   } else {
> +   data->file.fd = fd;
> +   }
> +   }
>
> return data->is_pipe = is_pipe;
>  }
> @@ -334,6 +347,9 @@ int perf_data__open(struct perf_data *data)
> if (check_pipe(data))
> return 0;
>
> +   /* currently it allows stdio for pipe only */
> +   data->use_stdio = false;
> +
> if (!data->path)
> data->path = "perf.data";
>
> @@ -353,7 +369,21 @@ void perf_data__close(struct perf_data *data)
> perf_data__close_dir(data);
>
> zfree(>file.path);
> -   close(data->file.fd);
> +
> +   if (data->use_stdio)
> +   fclose(data->file.fptr);
> +   else
> +   close(data->file.fd);
> +}
> +
> +ssize_t perf_data__read(struct perf_data *data, void *buf, size_t size)
> +{
> +   if (data->use_stdio) {
> +   if (fread(buf, size, 1, data->file.fptr) == 1)
> +   return size;
> +   return feof(data->file.fptr) ? 0 : -1;
> +   }
> +   return readn(data->file.fd, buf, size);
>  }
>
>  ssize_t perf_data_file__write(struct perf_data_file *file,
> @@ -365,6 +395,11 @@ ssize_t perf_data_file__write(struct perf_data_file 
> *file,
>  ssize_t perf_data__write(struct perf_data *data,
>   void *buf, size_t size)
>  {
> +   if (data->use_stdio) {
> +   if (fwrite(buf, size, 1, data->file.fptr) == 1)
> +   return size;
> +   return -1;
> +   }
> return perf_data_file__write(>file, buf, size);
>  }
>
> diff --git a/tools/perf/util/data.h b/tools/perf/util/data.h
> index 75947ef6bc17..c563fcbb0288 100644
> --- a/tools/perf/util/data.h
> +++ b/tools/perf/util/data.h
> @@ -2,6 +2,7 @@
>  #ifndef __PERF_DATA_H
>  #define __PERF_DATA_H
>
> +#include 
>  #include 
>
>  enum perf_data_mode {
> @@ -16,7 +17,10 @@ enum perf_dir_version {
>
>  struct perf_data_file {
> char*path;
> -   int  fd;
> +   union {
> +   int  fd;
> +   FILE*fptr;
> +   };
>

Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

2020-11-13 Thread Suren Baghdasaryan

On Fri, Nov 13, 2020 at 6:16 PM Andrew Morton  wrote:
>
> On Fri, 13 Nov 2020 17:57:02 -0800 Suren Baghdasaryan  
> wrote:
>
> > On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton  
> > wrote:
> > >
> > > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan  
> > > wrote:
> > >
> > > > > > > Seems to me that the ability to reap another process's memory is a
> > > > > > > generally useful one, and that it should not be tied to 
> > > > > > > delivering a
> > > > > > > signal in this fashion.
> > > > > > >
> > > > > > > And we do have the new process_madvise(MADV_PAGEOUT).  It may 
> > > > > > > need a
> > > > > > > few changes and tweaks, but can't that be used to solve this 
> > > > > > > problem?
> > > > > >
> > > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) 
> > > > > > was
> > > > > > one of the options recently discussed in
> > > > > > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > > > > > . The thread describes some of the issues with that approach but if 
> > > > > > we
> > > > > > limit it to processes with pending SIGKILL only then I think that
> > > > > > would be doable.
> > > > >
> > > > > Why would it be necessary to read /proc/pid/maps?  I'd have thought
> > > > > that a starting effort would be
> > > > >
> > > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT)
> > > > >
> > > > > (after translation into process_madvise() speak).  Which is equivalent
> > > > > to the proposed process_madvise(MADV_DONTNEED_MM)?
> > > >
> > > > Yep, this is very similar to option #3 in
> > > > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > > > and I actually have a tested prototype for that.
> > >
> > > Why is the `vector=NULL' needed?  Can't `vector' point at a single iovec
> > > which spans the whole address range?
> >
> > That would be the option #4 from the same discussion and the issues
> > noted there are "process_madvise return value can't handle such a
> > large number of bytes and there is MAX_RW_COUNT limit on max number of
> > bytes one process_madvise call can handle". In my prototype I have a
> > special handling for such "bulk operation" to work around the
> > MAX_RW_COUNT limitation.
>
> Ah, OK, return value.  Maybe process_madvise() shouldn't have done that
> and should have simply returned 0 on success, like madvise().
>
> I guess a special "nuke whole address space" command is OK.  But, again
> in the search for generality, the ability to nuke very large amounts of
> address space (but not the entire address space) would be better.
>
> The process_madvise() return value issue could be addressed by adding a
> process_madvise() mode which return 0 on success.
>
> And I guess the MAX_RW_COUNT issue is solvable by adding an
> import_iovec() arg to say "don't check that".  Along those lines.
>
> It's all sounding a bit painful (but not *too* painful).  But to
> reiterate, I do think that adding the ability for a process to shoot
> down a large amount of another process's memory is a lot more generally
> useful than tying it to SIGKILL, agree?

I see. So you are suggesting a mode where process_madvise() can
operate on large areas spanning multiple VMAs. This slightly differs
from option 4 in the previous RFC which suggested a special mode that
operates on the *entire* mm of the process. I agree, your suggestion
is more generic.

>
> > >
> > > > If that's the
> > > > preferred method then I can post it quite quickly.
> > >
> > > I assume you've tested that prototype.  How did its usefulness compare
> > > with this SIGKILL-based approach?
> >
> > Just to make sure I understand correctly your question, you are asking
> > about performance comparison of:
> >
> > // approach in this RFC
> > pidfd_send_signal(SIGKILL, SYNC_REAP_MM)
> >
> > vs
> >
> > // option #4 in the previous RFC
> > kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED);
> >
> > If so, I have results for the current RFC approach but the previous
> > approach was testing on an older device, so don't have
> > apples-to-apples comparison results at the moment. I can collect the
> > data for fair comparison if desired, however I don't expect a
> > noticeable performance difference since they both do pretty much the
> > same thing (even on different devices my results are quite close). I
> > think it's more a question of which API would be more appropriate.
>
> OK.  I wouldn't expect performance to be very different (and things can
> be sped up if so), but the API usefulness might be an issue.  Using
> process_madvise() (or similar) makes it a two-step operation, whereas
> tying it to SIGKILL&_UNINTERRUPTIBLE provides a more precise tool.
> Any thoughts on this?
>

[RFC] depopulate_range_driver_managed() for removing page-table mappings for hot-added memory blocks

2020-11-13 Thread Sudarshan Rajagopalan




Hello,

When memory blocks are removed, along with removing the memmap entries, 
memory resource and memory block devices, the arch specific 
arch_remove_memory() is called which takes care of tearing down the 
page-tables.


Suppose there’s a usecase where the removed memory blocks will be added 
back into the system at later point, we can remove/offline the block in 
a way that all entries such as memmaps, memory resources and block 
devices can be kept intact so that they won’t be needed to be created 
again when blocks are added back. Now this can be done by doing offline 
alone. But if there’s special usecase where the page-table entries are 
needed to be teared down when blocks are offlined in order to avoid 
speculative accesses on offlined memory region, but also keep the memmap 
entries and block devices intact, I was thinking if we can implement 
something like {populate|depopulate}_range_driver_managed() that can be 
called after online/offline which can create/tear down page table 
mappings for that range. This would avoid us from the need to do 
remove_memory() entirely just for the sake of page-table entries being 
removed. We can now just offline the block and call 
depopulate_range_driver_managed.


This basically isolates arch_{add/remove}_memory outside of 
add/remove_memory routines so that drivers can choose if it needs to 
just offline and remove page-table mappings or hotremove memory 
entirely. This gives drivers the flexibility to retain memmap entries 
and memory resource and block device creation so that they can be 
skipped when blocks are added back – this helps us reduce the latencies 
for removing and adding memory blocks.


I’m still in the process the creating the patch that implements this, 
which would give clear view about this RFC but just putting out the 
thought here if it makes sense or not.



Sudarshan
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

[PATCH] perf stat: Take cgroups into account for shadow stats

2020-11-13 Thread Namhyung Kim

As of now it doesn't consider cgroups when collecting shadow stats and
metrics so counter values from different cgroups will be saved in a
same slot.  This resulted in an incorrect numbers when those cgroups
have different workloads.

For example, let's look at the below - the cgroup A and C runs same
workload which burns a cpu while cgroup B runs a light workload.

  $ perf stat -a -e cycles,instructions --for-each-cgroup A,B,C  sleep 1

   Performance counter stats for 'system wide':

 3,958,116,522  cyclesA
 6,722,650,929  instructions  A #2.53  insn per cycle
 1,132,741  cyclesB
   571,743  instructions  B #0.00  insn per cycle
 4,007,799,935  cyclesC
 6,793,181,523  instructions  C #2.56  insn per cycle

   1.001050869 seconds time elapsed

When I run perf stat with single workload, it usually shows IPC around 1.7.
We can verify it (6,722,650,929.0 / 3,958,116,522 = 1.698) for cgroup A.

But in this case, since cgroups are ignored, cycles are averaged so it
used the lower value for IPC calculation and resulted in around 2.5.

  avg cycle: (3958116522 + 1132741 + 4007799935) / 3 = 2655683066
  IPC (A)  :  6722650929 / 2655683066 = 2.531
  IPC (B)  :  571743 / 2655683066 = 0.0002
  IPC (C)  :  6793181523 / 2655683066 = 2.557

We can simply compare cgroup pointers in the evsel and it'll be NULL
when cgroups are not specified.  With this patch, I can see correct
numbers like below:

  $ perf stat -a -e cycles,instructions --for-each-cgroup A,B,C  sleep 1

  Performance counter stats for 'system wide':

 4,171,051,687  cyclesA
 7,219,793,922  instructions  A #1.73  insn per cycle
 1,051,189  cyclesB
   583,102  instructions  B #0.55  insn per cycle
 4,171,124,710  cyclesC
 7,192,944,580  instructions  C #1.72  insn per cycle

   1.007909814 seconds time elapsed

Signed-off-by: Namhyung Kim 
---
 tools/perf/util/stat-shadow.c | 243 ++
 1 file changed, 132 insertions(+), 111 deletions(-)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 901265127e36..10d0f5a0fd4a 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -8,6 +8,7 @@
 #include "evlist.h"
 #include "expr.h"
 #include "metricgroup.h"
+#include "cgroup.h"
 #include 
 
 /*
@@ -28,6 +29,7 @@ struct saved_value {
enum stat_type type;
int ctx;
int cpu;
+   struct cgroup *cgrp;
struct runtime_stat *stat;
struct stats stats;
u64 metric_total;
@@ -57,6 +59,9 @@ static int saved_value_cmp(struct rb_node *rb_node, const 
void *entry)
if (a->ctx != b->ctx)
return a->ctx - b->ctx;
 
+   if (a->cgrp != b->cgrp)
+   return (char *)a->cgrp < (char *)b->cgrp ? -1 : +1;
+
if (a->evsel == NULL && b->evsel == NULL) {
if (a->stat == b->stat)
return 0;
@@ -100,7 +105,8 @@ static struct saved_value *saved_value_lookup(struct evsel 
*evsel,
  bool create,
  enum stat_type type,
  int ctx,
- struct runtime_stat *st)
+ struct runtime_stat *st,
+ struct cgroup *cgrp)
 {
struct rblist *rblist;
struct rb_node *nd;
@@ -110,6 +116,7 @@ static struct saved_value *saved_value_lookup(struct evsel 
*evsel,
.type = type,
.ctx = ctx,
.stat = st,
+   .cgrp = cgrp,
};
 
rblist = >value_list;
@@ -193,10 +200,11 @@ void perf_stat__reset_shadow_per_stat(struct runtime_stat 
*st)
 
 static void update_runtime_stat(struct runtime_stat *st,
enum stat_type type,
-   int ctx, int cpu, u64 count)
+   int ctx, int cpu, u64 count,
+   struct cgroup *cgrp)
 {
struct saved_value *v = saved_value_lookup(NULL, cpu, true,
-  type, ctx, st);
+  type, ctx, st, cgrp);
 
if (v)
update_stats(>stats, count);
@@ -212,80 +220,81 @@ void perf_stat__update_shadow_stats(struct evsel 
*counter, u64 count,
 {
int ctx = evsel_context(counter);
u64 count_ns = count;
+   struct cgroup *cgrp = counter->cgrp;
struct saved_value *v;
 
count *= counter->scale;
 
if (evsel__is_clock(counter))
-   update_runtime_stat(st, STAT_NSECS, 0, cpu, count_ns);
+

Re: [PATCH net-next] net: phy: mscc: Add PTP support for 2 more VSC PHYs

2020-11-13 Thread patchwork-bot+netdevbpf

Hello:

This patch was applied to netdev/net-next.git (refs/heads/master):

On Thu, 12 Nov 2020 10:22:50 +0100 you wrote:
> Add VSC8572 and VSC8574 in the PTP configuration
> as they also support PTP.
> 
> The relevant datasheets can be found here:
>   - VSC8572: https://www.microchip.com/wwwproducts/en/VSC8572
>   - VSC8574: https://www.microchip.com/wwwproducts/en/VSC8574
> 
> [...]

Here is the summary with links:
  - [net-next] net: phy: mscc: Add PTP support for 2 more VSC PHYs
https://git.kernel.org/netdev/net-next/c/774626fa440e

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [net-next,v2,4/5] seg6: add support for the SRv6 End.DT4 behavior

2020-11-13 Thread Andrea Mayer

Hi Jakub,

On Fri, 13 Nov 2020 18:01:26 -0800
Jakub Kicinski  wrote:

> > UAPI solution 2
> > 
> > we turn "table" into an optional parameter and we add the "vrftable" 
> > optional
> > parameter. DT4 can only be used with the "vrftable" (hence it is a required
> > parameter for DT4).
> > DT6 can be used with "vrftable" (new vrf mode) or with "table" (legacy mode)
> > (hence it is an optional parameter for DT6).
> > 
> > UAPI solution 2 examples:
> > 
> > ip -6 route add 2001:db8::1/128 encap seg6local action End.DT4 vrftable 100 
> > dev eth0
> > ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 vrftable 100 
> > dev eth0
> > ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 table 100 
> > dev eth0
> > 
> > IMO solution 2 is nicer from UAPI POV because we always have only one 
> > parameter, maybe solution 1 is slightly easier to implement, all in all 
> > we prefer solution 2 but we can go for 1 if you prefer.
> 
> Agreed, 2 looks better to me as well. But let's not conflate uABI with
> iproute2's command line. I'm more concerned about the kernel ABI.

Sorry I was a little imprecise here. I reported only the user command 
perspective.
>From the kernel point of view in solution 2 the vrftable will be a new
[SEG6_LOCAL_VRFTABLE] optional parameter.

> BTW you prefer to operate on tables (and therefore require
> net.vrf.strict_mode=1) because that's closer to the spirit of the RFC,
> correct? As I said from the implementation perspective passing any VRF
> ifindex down from user space to the kernel should be fine?

Yes, I definitely prefer to operate on tables (and so on the table ID) due to
the spirit of the RFC. We have discussed in depth this design choice with
David Ahern when implementing the DT4 patch and we are confident that operating
with VRF strict mode is a sound approach also for DT6. 

Thanks
Andrea,

Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

2020-11-13 Thread Andrew Morton

On Fri, 13 Nov 2020 17:57:02 -0800 Suren Baghdasaryan  wrote:

> On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton  
> wrote:
> >
> > On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan  
> > wrote:
> >
> > > > > > Seems to me that the ability to reap another process's memory is a
> > > > > > generally useful one, and that it should not be tied to delivering a
> > > > > > signal in this fashion.
> > > > > >
> > > > > > And we do have the new process_madvise(MADV_PAGEOUT).  It may need a
> > > > > > few changes and tweaks, but can't that be used to solve this 
> > > > > > problem?
> > > > >
> > > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was
> > > > > one of the options recently discussed in
> > > > > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > > > > . The thread describes some of the issues with that approach but if we
> > > > > limit it to processes with pending SIGKILL only then I think that
> > > > > would be doable.
> > > >
> > > > Why would it be necessary to read /proc/pid/maps?  I'd have thought
> > > > that a starting effort would be
> > > >
> > > > madvise((void *)0, (void *)-1, MADV_PAGEOUT)
> > > >
> > > > (after translation into process_madvise() speak).  Which is equivalent
> > > > to the proposed process_madvise(MADV_DONTNEED_MM)?
> > >
> > > Yep, this is very similar to option #3 in
> > > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > > and I actually have a tested prototype for that.
> >
> > Why is the `vector=NULL' needed?  Can't `vector' point at a single iovec
> > which spans the whole address range?
> 
> That would be the option #4 from the same discussion and the issues
> noted there are "process_madvise return value can't handle such a
> large number of bytes and there is MAX_RW_COUNT limit on max number of
> bytes one process_madvise call can handle". In my prototype I have a
> special handling for such "bulk operation" to work around the
> MAX_RW_COUNT limitation.

Ah, OK, return value.  Maybe process_madvise() shouldn't have done that
and should have simply returned 0 on success, like madvise().

I guess a special "nuke whole address space" command is OK.  But, again
in the search for generality, the ability to nuke very large amounts of
address space (but not the entire address space) would be better. 

The process_madvise() return value issue could be addressed by adding a
process_madvise() mode which return 0 on success.

And I guess the MAX_RW_COUNT issue is solvable by adding an
import_iovec() arg to say "don't check that".  Along those lines.

It's all sounding a bit painful (but not *too* painful).  But to
reiterate, I do think that adding the ability for a process to shoot
down a large amount of another process's memory is a lot more generally
useful than tying it to SIGKILL, agree?

> >
> > > If that's the
> > > preferred method then I can post it quite quickly.
> >
> > I assume you've tested that prototype.  How did its usefulness compare
> > with this SIGKILL-based approach?
> 
> Just to make sure I understand correctly your question, you are asking
> about performance comparison of:
> 
> // approach in this RFC
> pidfd_send_signal(SIGKILL, SYNC_REAP_MM)
> 
> vs
> 
> // option #4 in the previous RFC
> kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED);
> 
> If so, I have results for the current RFC approach but the previous
> approach was testing on an older device, so don't have
> apples-to-apples comparison results at the moment. I can collect the
> data for fair comparison if desired, however I don't expect a
> noticeable performance difference since they both do pretty much the
> same thing (even on different devices my results are quite close). I
> think it's more a question of which API would be more appropriate.

OK.  I wouldn't expect performance to be very different (and things can
be sped up if so), but the API usefulness might be an issue.  Using
process_madvise() (or similar) makes it a two-step operation, whereas
tying it to SIGKILL&_UNINTERRUPTIBLE provides a more precise tool.
Any thoughts on this?

[GIT PULL] hwmon fixes for v5.10-rc4

2020-11-13 Thread Guenter Roeck

Hi Linus,

Please pull hwmon fixes for Linux v5.10-rc4 from signed tag:

git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git 
hwmon-for-v5.10-rc4

Thanks,
Guenter
--

The following changes since commit 3650b228f83adda7e5ee532e2b90429c03f7b9ec:

  Linux 5.10-rc1 (2020-10-25 15:14:11 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git 
tags/hwmon-for-v5.10-rc4

for you to fetch changes up to 60268b0e8258fdea9a3c9f4b51e161c123571db3:

  hwmon: (amd_energy) modify the visibility of the counters (2020-11-13 
06:46:20 -0800)


hwmon fixes for v5.10-rc4

Fix potential bufer overflow in pmbus/max20730 driver
Fix locking issue in pmbus core
Fix regression causing timeouts in applesmc driver
Fix RPM calculation in pwm-fan driver
Restrict counter visibility in amd_energy driver


Brad Campbell (1):
  hwmon: (applesmc) Re-work SMC comms

Dan Carpenter (1):
  hwmon: (pmbus/max20730) use scnprintf() instead of snprintf()

Naveen Krishna Chatradhi (1):
  hwmon: (amd_energy) modify the visibility of the counters

Paul Barker (1):
  hwmon: (pwm-fan) Fix RPM calculation

Robert Hancock (1):
  hwmon: (pmbus) Add mutex locking for sysfs reads

 drivers/hwmon/amd_energy.c   |   2 +-
 drivers/hwmon/applesmc.c | 130 ---
 drivers/hwmon/pmbus/max20730.c   |  26 
 drivers/hwmon/pmbus/pmbus_core.c |  13 +++-
 drivers/hwmon/pwm-fan.c  |  16 ++---
 5 files changed, 115 insertions(+), 72 deletions(-)

Re: [net-next,v2,4/5] seg6: add support for the SRv6 End.DT4 behavior

2020-11-13 Thread Jakub Kicinski

On Sat, 14 Nov 2020 02:50:58 +0100 Andrea Mayer wrote:
> Hi Jakub,
> Please see my responses inline:
> 
> On Fri, 13 Nov 2020 15:54:37 -0800
> Jakub Kicinski  wrote:
> 
> > On Sat, 14 Nov 2020 00:00:24 +0100 Andrea Mayer wrote:  
> > > On Fri, 13 Nov 2020 13:40:10 -0800
> > > Jakub Kicinski  wrote:
> > > 
> > > I can tackle the v6 version but how do we face the compatibility issue 
> > > raised
> > > by Stefano in his message?
> > > 
> > > if it is ok to implement a uAPI that breaks the existing scripts, it is 
> > > relatively
> > > easy to replicate the VRF-based approach also in v6.  
> > 
> > We need to keep existing End.DT6 as is, and add a separate
> > implementation.  
> 
> ok
> 
> >
> > The way to distinguish between the two could be either by  
> 
> > 1) passing via
> > netlink a flag attribute (which would request use of VRF in both
> > cases);  
> 
> yes, feasible... see UAPI solution 1
> 
> > 2) using a different attribute than SEG6_LOCAL_TABLE for the
> > table id (or perhaps passing VRF's ifindex instead), e.g.
> > SEG6_LOCAL_TABLE_VRF;  
> 
> yes, feasible... see UAPI solution 2
> 
> > 3) or adding a new command
> > (SEG6_LOCAL_ACTION_END_DT6_VRF) which would behave like End.DT4.  
> 
> no, we prefer not to add a new command, because it is better to keep a 
> semantic one-to-one relationship between these commands and the SRv6 
> behaviors defined in the draft.
> 
> 
> UAPI solution 1
> 
> we add a new parameter "vrfmode". DT4 can only be used with the 
> vrfmode parameter (hence it is a required parameter for DT4).
> DT6 can be used with "vrfmode" (new vrf based mode) or without "vrfmode" 
> (legacy mode)(hence "vrfmode" is an optional parameter for DT6)
> 
> UAPI solution 1 examples:
> 
> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT4 vrfmode table 
> 100 dev eth0
> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 vrfmode table 
> 100 dev eth0
> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 table 100 dev 
> eth0
> 
> UAPI solution 2
> 
> we turn "table" into an optional parameter and we add the "vrftable" optional
> parameter. DT4 can only be used with the "vrftable" (hence it is a required
> parameter for DT4).
> DT6 can be used with "vrftable" (new vrf mode) or with "table" (legacy mode)
> (hence it is an optional parameter for DT6).
> 
> UAPI solution 2 examples:
> 
> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT4 vrftable 100 
> dev eth0
> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 vrftable 100 
> dev eth0
> ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 table 100 dev 
> eth0
> 
> IMO solution 2 is nicer from UAPI POV because we always have only one 
> parameter, maybe solution 1 is slightly easier to implement, all in all 
> we prefer solution 2 but we can go for 1 if you prefer.

Agreed, 2 looks better to me as well. But let's not conflate uABI with
iproute2's command line. I'm more concerned about the kernel ABI.

BTW you prefer to operate on tables (and therefore require
net.vrf.strict_mode=1) because that's closer to the spirit of the RFC,
correct? As I said from the implementation perspective passing any VRF
ifindex down from user space to the kernel should be fine?

Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

2020-11-13 Thread Suren Baghdasaryan

On Fri, Nov 13, 2020 at 5:18 PM Andrew Morton  wrote:
>
> On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan  
> wrote:
>
> > > > > Seems to me that the ability to reap another process's memory is a
> > > > > generally useful one, and that it should not be tied to delivering a
> > > > > signal in this fashion.
> > > > >
> > > > > And we do have the new process_madvise(MADV_PAGEOUT).  It may need a
> > > > > few changes and tweaks, but can't that be used to solve this problem?
> > > >
> > > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was
> > > > one of the options recently discussed in
> > > > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > > > . The thread describes some of the issues with that approach but if we
> > > > limit it to processes with pending SIGKILL only then I think that
> > > > would be doable.
> > >
> > > Why would it be necessary to read /proc/pid/maps?  I'd have thought
> > > that a starting effort would be
> > >
> > > madvise((void *)0, (void *)-1, MADV_PAGEOUT)
> > >
> > > (after translation into process_madvise() speak).  Which is equivalent
> > > to the proposed process_madvise(MADV_DONTNEED_MM)?
> >
> > Yep, this is very similar to option #3 in
> > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > and I actually have a tested prototype for that.
>
> Why is the `vector=NULL' needed?  Can't `vector' point at a single iovec
> which spans the whole address range?

That would be the option #4 from the same discussion and the issues
noted there are "process_madvise return value can't handle such a
large number of bytes and there is MAX_RW_COUNT limit on max number of
bytes one process_madvise call can handle". In my prototype I have a
special handling for such "bulk operation" to work around the
MAX_RW_COUNT limitation.

>
> > If that's the
> > preferred method then I can post it quite quickly.
>
> I assume you've tested that prototype.  How did its usefulness compare
> with this SIGKILL-based approach?

Just to make sure I understand correctly your question, you are asking
about performance comparison of:

// approach in this RFC
pidfd_send_signal(SIGKILL, SYNC_REAP_MM)

vs

// option #4 in the previous RFC
kill(SIGKILL); process_madvise(vector=NULL, MADV_DONTNEED);

If so, I have results for the current RFC approach but the previous
approach was testing on an older device, so don't have
apples-to-apples comparison results at the moment. I can collect the
data for fair comparison if desired, however I don't expect a
noticeable performance difference since they both do pretty much the
same thing (even on different devices my results are quite close). I
think it's more a question of which API would be more appropriate.

>

Re: [net-next,v2,4/5] seg6: add support for the SRv6 End.DT4 behavior

2020-11-13 Thread Andrea Mayer

Hi Jakub,
Please see my responses inline:

On Fri, 13 Nov 2020 15:54:37 -0800
Jakub Kicinski  wrote:

> On Sat, 14 Nov 2020 00:00:24 +0100 Andrea Mayer wrote:
> > On Fri, 13 Nov 2020 13:40:10 -0800
> > Jakub Kicinski  wrote:
> > 
> > I can tackle the v6 version but how do we face the compatibility issue 
> > raised
> > by Stefano in his message?
> > 
> > if it is ok to implement a uAPI that breaks the existing scripts, it is 
> > relatively
> > easy to replicate the VRF-based approach also in v6.
> 
> We need to keep existing End.DT6 as is, and add a separate
> implementation.

ok

>
> The way to distinguish between the two could be either by

> 1) passing via
> netlink a flag attribute (which would request use of VRF in both
> cases);

yes, feasible... see UAPI solution 1

> 2) using a different attribute than SEG6_LOCAL_TABLE for the
> table id (or perhaps passing VRF's ifindex instead), e.g.
> SEG6_LOCAL_TABLE_VRF;

yes, feasible... see UAPI solution 2

> 3) or adding a new command
> (SEG6_LOCAL_ACTION_END_DT6_VRF) which would behave like End.DT4.

no, we prefer not to add a new command, because it is better to keep a 
semantic one-to-one relationship between these commands and the SRv6 
behaviors defined in the draft.

UAPI solution 1

we add a new parameter "vrfmode". DT4 can only be used with the 
vrfmode parameter (hence it is a required parameter for DT4).
DT6 can be used with "vrfmode" (new vrf based mode) or without "vrfmode" 
(legacy mode)(hence "vrfmode" is an optional parameter for DT6)

UAPI solution 1 examples:

ip -6 route add 2001:db8::1/128 encap seg6local action End.DT4 vrfmode table 
100 dev eth0
ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 vrfmode table 
100 dev eth0
ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 table 100 dev 
eth0

UAPI solution 2

we turn "table" into an optional parameter and we add the "vrftable" optional
parameter. DT4 can only be used with the "vrftable" (hence it is a required
parameter for DT4).
DT6 can be used with "vrftable" (new vrf mode) or with "table" (legacy mode)
(hence it is an optional parameter for DT6).

UAPI solution 2 examples:

ip -6 route add 2001:db8::1/128 encap seg6local action End.DT4 vrftable 100 dev 
eth0
ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 vrftable 100 dev 
eth0
ip -6 route add 2001:db8::1/128 encap seg6local action End.DT6 table 100 dev 
eth0

IMO solution 2 is nicer from UAPI POV because we always have only one 
parameter, maybe solution 1 is slightly easier to implement, all in all 
we prefer solution 2 but we can go for 1 if you prefer.

Waiting for your advice!

Thanks,
Andrea

Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-13 Thread Roman Gushchin

On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
> On 13 Nov 2020, at 19:15, Roman Gushchin wrote:
> 
> > On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
> > > From: Zi Yan 
> > > 
> > > It adds a new_order parameter to set new page order in page owner.
> > > It prepares for upcoming changes to support split huge page to any
> > > lower
> > > order.
> > > 
> > > Signed-off-by: Zi Yan 
> > > ---
> > >  include/linux/page_owner.h | 7 ---
> > >  mm/huge_memory.c   | 2 +-
> > >  mm/page_alloc.c| 2 +-
> > >  mm/page_owner.c| 6 +++---
> > >  4 files changed, 9 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
> > > index 3468794f83d2..215cbb159568 100644
> > > --- a/include/linux/page_owner.h
> > > +++ b/include/linux/page_owner.h
> > > @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page
> > > *page,
> > >   __set_page_owner(page, order, gfp_mask);
> > >  }
> > > 
> > > -static inline void split_page_owner(struct page *page, unsigned int
> > > nr)
> > > +static inline void split_page_owner(struct page *page, unsigned int
> > > nr,
> > > + unsigned int new_order)
> > >  {
> > >   if (static_branch_unlikely(_owner_inited))
> > > - __split_page_owner(page, nr);
> > > + __split_page_owner(page, nr, new_order);
> > >  }
> > >  static inline void copy_page_owner(struct page *oldpage, struct
> > > page *newpage)
> > >  {
> > > @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page
> > > *page,
> > >  {
> > >  }
> > >  static inline void split_page_owner(struct page *page,
> > > - unsigned int order)
> > > + unsigned int nr, unsigned int new_order)
> > 
> > With the addition of the new argument it's a bit hard to understand
> > what the function is supposed to do. It seems like nr ==
> > page_order(page),
> > is it right? Maybe we can pass old_order and new_order? Or just the page
> > and the new order?
> 
> Yeah, it is a bit confusing. Please see more below.
> 
> > 
> > >  {
> > >  }
> > >  static inline void copy_page_owner(struct page *oldpage, struct
> > > page *newpage)
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index f599f5b9bf7f..8b7d771ee962 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page
> > > *page, struct list_head *list,
> > > 
> > >   ClearPageCompound(head);
> > > 
> > > - split_page_owner(head, nr);
> > > + split_page_owner(head, nr, 1);
> > > 
> > >   /* See comment in __split_huge_page_tail() */
> > >   if (PageAnon(head)) {
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index d77220615fd5..a9eead0e091a 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned
> > > int order)
> > > 
> > >   for (i = 1; i < (1 << order); i++)
> > >   set_page_refcounted(page + i);
> > > - split_page_owner(page, 1 << order);
> > > + split_page_owner(page, 1 << order, 1);
> > >  }
> > >  EXPORT_SYMBOL_GPL(split_page);
> > > 
> > > diff --git a/mm/page_owner.c b/mm/page_owner.c
> > > index b735a8eafcdb..2b7f7e9056dc 100644
> > > --- a/mm/page_owner.c
> > > +++ b/mm/page_owner.c
> > > @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page
> > > *page, int reason)
> > >   page_owner->last_migrate_reason = reason;
> > >  }
> > > 
> > > -void __split_page_owner(struct page *page, unsigned int nr)
> > > +void __split_page_owner(struct page *page, unsigned int nr,
> > > unsigned int new_order)
> > >  {
> > >   int i;
> > >   struct page_ext *page_ext = lookup_page_ext(page);
> > > @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page,
> > > unsigned int nr)
> > >   if (unlikely(!page_ext))
> > >   return;
> > > 
> > > - for (i = 0; i < nr; i++) {
> > > + for (i = 0; i < nr; i += (1 << new_order)) {
> > >   page_owner = get_page_owner(page_ext);
> > > - page_owner->order = 0;
> > > + page_owner->order = new_order;
> > >   page_ext = page_ext_next(page_ext);
> > 
> > I believe there cannot be any leftovers because nr is always a power of
> > 2.
> > Is it true? Converting nr argument to order (if it's possible) will make
> > it obvious.
> 
> Right. nr = thp_nr_pages(head), which is a power of 2. There would not be
> any
> leftover.
> 
> Matthew recently converted split_page_owner to take nr instead of order.[1]
> But I am not
> sure why, since it seems to me that two call sites (__split_huge_page in
> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
> information.

Yeah, I'm not sure why too. Maybe Matthew has some input here?
You can also pass new_nr, but IMO orders look so much better here.

Thanks!

Re: [PATCH V4 3/5] arm64: dts: ti: am65/j721e: Fix up un-necessary status set to "okay" for crypto

2020-11-13 Thread J, KEERTHY





On 11/14/2020 2:48 AM, Nishanth Menon wrote:

The default state of a device tree node is "okay". There is no specific
use of explicitly adding status = "okay" in the SoC dtsi.


Reviewed-by: Keerthy 



Signed-off-by: Nishanth Menon 
Reviewed-by: Tony Lindgren 
Acked-by: Tero Kristo 
Cc: Keerthy 
---
Change in v4: Dropped Fixes

V3: https://lore.kernel.org/linux-arm-kernel/20201112183538.6805-4...@ti.com/
V2: https://lore.kernel.org/linux-arm-kernel/20201112014929.25227-4...@ti.com/
V1: https://lore.kernel.org/linux-arm-kernel/20201104224356.18040-4...@ti.com/

  arch/arm64/boot/dts/ti/k3-am65-main.dtsi  | 1 -
  arch/arm64/boot/dts/ti/k3-j721e-main.dtsi | 2 --
  2 files changed, 3 deletions(-)

diff --git a/arch/arm64/boot/dts/ti/k3-am65-main.dtsi 
b/arch/arm64/boot/dts/ti/k3-am65-main.dtsi
index c842b9803f2d..116818912ba2 100644
--- a/arch/arm64/boot/dts/ti/k3-am65-main.dtsi
+++ b/arch/arm64/boot/dts/ti/k3-am65-main.dtsi
@@ -119,7 +119,6 @@ crypto: crypto@4e0 {
#address-cells = <2>;
#size-cells = <2>;
ranges = <0x0 0x04e0 0x00 0x04e0 0x0 0x3>;
-   status = "okay";
  
  		dmas = <_udmap 0xc000>, <_udmap 0x4000>,

<_udmap 0x4001>;
diff --git a/arch/arm64/boot/dts/ti/k3-j721e-main.dtsi 
b/arch/arm64/boot/dts/ti/k3-j721e-main.dtsi
index 137966c6be1f..19e602afdb05 100644
--- a/arch/arm64/boot/dts/ti/k3-j721e-main.dtsi
+++ b/arch/arm64/boot/dts/ti/k3-j721e-main.dtsi
@@ -345,8 +345,6 @@ main_crypto: crypto@4e0 {
#size-cells = <2>;
ranges = <0x0 0x04e0 0x00 0x04e0 0x0 0x3>;
  
-		status = "okay";

-
dmas = <_udmap 0xc000>, <_udmap 0x4000>,
<_udmap 0x4001>;
dma-names = "tx", "rx1", "rx2";

Re: [PATCH v6 1/2] kunit: Support for Parameterized Testing

2020-11-13 Thread Arpitha Raghunandan

On 14/11/20 5:44 am, Marco Elver wrote:
> On Fri, 13 Nov 2020 at 23:37, David Gow  wrote:
>>
>> On Fri, Nov 13, 2020 at 6:31 PM Marco Elver  wrote:
>>>
>>> On Fri, Nov 13, 2020 at 01:17PM +0800, David Gow wrote:
 On Thu, Nov 12, 2020 at 8:37 PM Marco Elver  wrote:
>>> [...]
>> (It also might be a little tricky with the current implementation to
>> produce the test plan, as the parameters come from a generator, and I
>> don't think there's a way of getting the number of parameters ahead of
>> time. That's a problem with the sub-subtest model, too, though at
>> least there it's a little more isolated from other tests.)
>
> The whole point of generators, as I envisage it, is to also provide the
> ability for varying parameters dependent on e.g. environment,
> configuration, number of CPUs, etc. The current array-based generator is
> the simplest possible use-case.
>
> However, we *can* require generators generate a deterministic number of
> parameters when called multiple times on the same system.

 I think this is a reasonable compromise, though it's not actually
 essential. As I understand the TAP spec, the test plan is actually
 optional (and/or can be at the end of the sequence of tests), though
 kunit_tool currently only supports having it at the beginning (which
 is strongly preferred by the spec anyway). I think we could get away
 with having it at the bottom of the subtest results though, which
 would save having to run the generator twice, when subtest support is
 added to kunit_tool.
>>>
>>> I can't find this in the TAP spec, where should I look? Perhaps we
>>> shouldn't venture too far off the beaten path, given we might not be the
>>> only ones that want to parse this output.
>>>
>>
>> It's in the "Test Lines and the Plan" section:
>> "The plan is optional but if there is a plan before the test points it
>> must be the first non-diagnostic line output by the test file. In
>> certain instances a test file may not know how many test points it
>> will ultimately be running. In this case the plan can be the last
>> non-diagnostic line in the output. The plan cannot appear in the
>> middle of the output, nor can it appear more than once."
> 
> Ah, that's fine then.
> 
>> My only concern with running through the generator multiple times to
>> get the count is that it might be slow and/or more difficult if
>> someone uses a more complicated generator. I can't think of anything
>> specific yet, though, so we can always do it for now and change it
>> later if a problematic case occurs.
> 
> I'm all for simplicity, so if nobody objects, let's just get rid of
> the number of parameters and avoid running it twice.
> 
> To that end, I propose a v7 (below) that takes care of getting number of
> parameters (and also displays descriptions for each parameter where
> available).
>
> Now it is up to you how you want to turn the output from diagnostic
> lines into something TAP compliant, because now we have the number of
> parameters and can turn it into a subsubtest. But I think kunit-tool
> doesn't understand subsubtests yet, so I suggest we take these patches,
> and then somebody can prepare kunit-tool.
>

 This sounds good to me. The only thing I'm not sure about is the
 format of the parameter description: thus far test names be valid C
 identifier names, due to the fact they're named after the test
 function. I don't think there's a fundamental reason parameters (and
 hence, potentially, subsubtests) need to follow that convention as
 well, but it does look a bit odd.  Equally, the square brackets around
 the description shouldn't be necessary according to the TAP spec, but
 do seem to make things a little more readable, particuarly with the
 names in the ext4 inode test. I'm not too worried about either of
 those, though: I'm sure it'll look fine once I've got used to it.
>>>
>>> The parameter description doesn't need to be a C identifier. At least
>>> that's what I could immediately glean from TAP v13 spec (I'm looking
>>> here: https://testanything.org/tap-version-13-specification.html and see
>>> e.g. "ok 1 - Input file opened" ...).
>>>
>>
>> Yeah: it looked a bit weird for everything else to be an identifier
>> (given that KUnit does require it for tests), but these parameter
>> descriptions not to be. It's not a problem, though, so let's go ahead
>> with it.
>>
>>> [...]
>> In any case, I'm happy to leave the final decision here to Arpitha and
>> Marco, so long as we don't actually violate the TAP/KTAP spec and
>> kunit_tool is able to read at least the top-level result. My
>> preference is still to go either with the "# [test_case->name]:
>> [ok|not ok] [index] - param-[index]", or to get rid of the
>> per-parameter results entirely for now (or just print out a diagnostic
>> message on failure). In

Re: [RFC PATCH 5/9] cxl/mem: Find device capabilities

2020-11-13 Thread Ben Widawsky

On 20-11-13 12:26:03, Bjorn Helgaas wrote:
> On Tue, Nov 10, 2020 at 09:43:52PM -0800, Ben Widawsky wrote:
> > CXL devices contain an array of capabilities that describe the
> > interactions software can interact with the device, or firmware running
> > on the device. A CXL compliant device must implement the device status
> > and the mailbox capability. A CXL compliant memory device must implement
> > the memory device capability.
> > 
> > Each of the capabilities can [will] provide an offset within the MMIO
> > region for interacting with the CXL device.
> > 
> > Signed-off-by: Ben Widawsky 
> > ---
> >  drivers/cxl/cxl.h | 89 +++
> >  drivers/cxl/mem.c | 58 +++---
> >  2 files changed, 143 insertions(+), 4 deletions(-)
> >  create mode 100644 drivers/cxl/cxl.h
> > 
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > new file mode 100644
> > index ..02858ae63d6d
> > --- /dev/null
> > +++ b/drivers/cxl/cxl.h
> > @@ -0,0 +1,89 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> 
> Fix comment usage (I think SPDX in .h needs "/* */")
> 
> > +#ifndef __CXL_H__
> > +#define __CXL_H__
> > +
> > +/* Device */
> > +#define CXLDEV_CAP_ARRAY_REG 0x0
> > +#define CXLDEV_CAP_ARRAY_CAP_ID 0
> > +#define CXLDEV_CAP_ARRAY_ID(x) ((x) & 0x)
> > +#define CXLDEV_CAP_ARRAY_COUNT(x) (((x) >> 32) & 0x)
> > +
> > +#define CXL_CAPABILITIES_CAP_ID_DEVICE_STATUS 1
> > +#define CXL_CAPABILITIES_CAP_ID_PRIMARY_MAILBOX 2
> > +#define CXL_CAPABILITIES_CAP_ID_SECONDARY_MAILBOX 3
> > +#define CXL_CAPABILITIES_CAP_ID_MEMDEV 0x4000
> 
> Strange that the first three are decimal and the last is hex.
> 
> > +/* Mailbox */
> > +#define CXLDEV_MB_CAPS 0x00
> > +#define   CXLDEV_MB_CAP_PAYLOAD_SIZE(cap) ((cap) & 0x1F)
> 
> Use upper- or lower-case hex consistently.  Add tabs to line things
> up.
> 
> > +#define CXLDEV_MB_CTRL 0x04
> > +#define CXLDEV_MB_CMD 0x08
> > +#define CXLDEV_MB_STATUS 0x10
> > +#define CXLDEV_MB_BG_CMD_STATUS 0x18
> > +
> > +struct cxl_mem {
> > +   struct pci_dev *pdev;
> > +   void __iomem *regs;
> > +
> > +   /* Cap h */
> > +   struct {
> > +   void __iomem *regs;
> > +   } status;
> > +
> > +   /* Cap 0002h */
> > +   struct {
> > +   void __iomem *regs;
> > +   size_t payload_size;
> > +   } mbox;
> > +
> > +   /* Cap 0040h */
> > +   struct {
> > +   void __iomem *regs;
> > +   } mem;
> > +};
> 
> Maybe a note about why READ_ONCE() is required?
> 

I don't believe it's actually necessary. I will drop it.

> > +#define cxl_reg(type)  
> > \
> > +   static inline void cxl_write_##type##_reg32(struct cxl_mem *cxlm,  \
> > +   u32 reg, u32 value)\
> > +   {  \
> > +   void __iomem *reg_addr = READ_ONCE(cxlm->type.regs);   \
> > +   writel(value, reg_addr + reg); \
> > +   }  \
> > +   static inline void cxl_write_##type##_reg64(struct cxl_mem *cxlm,  \
> > +   u32 reg, u64 value)\
> > +   {  \
> > +   void __iomem *reg_addr = READ_ONCE(cxlm->type.regs);   \
> > +   writeq(value, reg_addr + reg); \
> > +   }  \
> > +   static inline u32 cxl_read_##type##_reg32(struct cxl_mem *cxlm,\
> > + u32 reg) \
> > +   {  \
> > +   void __iomem *reg_addr = READ_ONCE(cxlm->type.regs);   \
> > +   return readl(reg_addr + reg);  \
> > +   }  \
> > +   static inline u64 cxl_read_##type##_reg64(struct cxl_mem *cxlm,\
> > + u32 reg) \
> > +   {  \
> > +   void __iomem *reg_addr = READ_ONCE(cxlm->type.regs);   \
> > +   return readq(reg_addr + reg);  \
> > +   }
> > +
> > +cxl_reg(status)
> > +cxl_reg(mbox)
> > +
> > +static inline u32 __cxl_raw_read_reg32(struct cxl_mem *cxlm, u32 reg)
> > +{
> > +   void __iomem *reg_addr = READ_ONCE(cxlm->regs);
> > +
> > +   return readl(reg_addr + reg);
> > +}
> > +
> > +static inline u64 __cxl_raw_read_reg64(struct cxl_mem *cxlm, u32 reg)
> > +{
> > +   void __iomem *reg_addr = READ_ONCE(cxlm->regs);
> > +
> > +

Re: [PATCH v2 1/2] x86/pci: use unsigned int in check_reserved_t

2020-11-13 Thread Randy Dunlap

On 11/13/20 4:23 PM, Sami Tolvanen wrote:
> Use unsigned int instead of raw unsigned in check_reserved_t to follow
> the kernel's style guidelines.
> 
> Signed-off-by: Sami Tolvanen 

Acked-by: Randy Dunlap 

Thanks.

> ---
>  arch/x86/pci/mmconfig-shared.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
> index 6fa42e9c4e6f..37f31dd7005a 100644
> --- a/arch/x86/pci/mmconfig-shared.c
> +++ b/arch/x86/pci/mmconfig-shared.c
> @@ -425,7 +425,7 @@ static acpi_status find_mboard_resource(acpi_handle 
> handle, u32 lvl,
>   return AE_OK;
>  }
>  
> -static bool is_acpi_reserved(u64 start, u64 end, unsigned not_used)
> +static bool is_acpi_reserved(u64 start, u64 end, unsigned int not_used)
>  {
>   struct resource mcfg_res;
>  
> @@ -442,7 +442,7 @@ static bool is_acpi_reserved(u64 start, u64 end, unsigned 
> not_used)
>   return mcfg_res.flags;
>  }
>  
> -typedef bool (*check_reserved_t)(u64 start, u64 end, unsigned type);
> +typedef bool (*check_reserved_t)(u64 start, u64 end, unsigned int type);
>  
>  static bool __ref is_mmconf_reserved(check_reserved_t is_reserved,
>struct pci_mmcfg_region *cfg,
> 
> base-commit: 9e6a39eae450b81c8b2c8cbbfbdf8218e9b40c81
> 


-- 
~Randy

Re: [PATCH v2 2/2] x86/e820: fix the function type for e820__mapped_all

2020-11-13 Thread Randy Dunlap

On 11/13/20 4:23 PM, Sami Tolvanen wrote:
> e820__mapped_all is passed as a callback to is_mmconf_reserved, which
> expects a function of type:
> 
> typedef bool (*check_reserved_t)(u64 start, u64 end, unsigned int type);
> 
> This trips indirect call checking with Clang's Control-Flow Integrity
> (CFI). Change the last argument from enum e820_type to unsigned to fix
> the type mismatch.
> 
> Reported-by: Sedat Dilek 
> Signed-off-by: Sami Tolvanen 

Acked-by: Randy Dunlap 

Thanks.

> ---
>  arch/x86/include/asm/e820/api.h | 2 +-
>  arch/x86/kernel/e820.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/e820/api.h b/arch/x86/include/asm/e820/api.h
> index e8f58ddd06d9..a122ca2784b2 100644
> --- a/arch/x86/include/asm/e820/api.h
> +++ b/arch/x86/include/asm/e820/api.h
> @@ -12,7 +12,7 @@ extern unsigned long pci_mem_start;
>  
>  extern bool e820__mapped_raw_any(u64 start, u64 end, enum e820_type type);
>  extern bool e820__mapped_any(u64 start, u64 end, enum e820_type type);
> -extern bool e820__mapped_all(u64 start, u64 end, enum e820_type type);
> +extern bool e820__mapped_all(u64 start, u64 end, unsigned int type);
>  
>  extern void e820__range_add   (u64 start, u64 size, enum e820_type type);
>  extern u64  e820__range_update(u64 start, u64 size, enum e820_type old_type, 
> enum e820_type new_type);
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 22aad412f965..24b82ff53513 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -145,7 +145,7 @@ static struct e820_entry *__e820__mapped_all(u64 start, 
> u64 end,
>  /*
>   * This function checks if the entire range  is mapped with type.
>   */
> -bool __init e820__mapped_all(u64 start, u64 end, enum e820_type type)
> +bool __init e820__mapped_all(u64 start, u64 end, unsigned int type)
>  {
>   return __e820__mapped_all(start, end, type);
>  }
> 


-- 
~Randy

Re: [PATCH 1/6] seq_file: add seq_read_iter

2020-11-13 Thread Al Viro

On Fri, Nov 13, 2020 at 04:54:53PM -0700, Nathan Chancellor wrote:

> This patch in -next (6a9f696d1627bacc91d1cebcfb177f474484e8ba) breaks
> WSL2's interoperability feature, where Windows paths automatically get
> added to PATH on start up so that Windows binaries can be accessed from
> within Linux (such as clip.exe to pipe output to the clipboard). Before,
> I would see a bunch of Linux + Windows folders in $PATH but after, I
> only see the Linux folders (I can give you the actual PATH value if you
> care but it is really long).
> 
> I am not at all familiar with the semantics of this patch or how
> Microsoft would be using it to inject folders into PATH (they have some
> documentation on it here:
> https://docs.microsoft.com/en-us/windows/wsl/interop) and I am not sure
> how to go about figuring that out to see why this patch breaks something
> (unless you have an idea). I have added the Hyper-V maintainers and list
> to CC in case they know someone who could help.

Out of curiosity: could you slap WARN_ON(!iov_iter_count(iter)); right in
the beginning of seq_read_iter() and see if that triggers?

Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

2020-11-13 Thread Andrew Morton

On Fri, 13 Nov 2020 17:09:37 -0800 Suren Baghdasaryan  wrote:

> > > > Seems to me that the ability to reap another process's memory is a
> > > > generally useful one, and that it should not be tied to delivering a
> > > > signal in this fashion.
> > > >
> > > > And we do have the new process_madvise(MADV_PAGEOUT).  It may need a
> > > > few changes and tweaks, but can't that be used to solve this problem?
> > >
> > > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was
> > > one of the options recently discussed in
> > > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > > . The thread describes some of the issues with that approach but if we
> > > limit it to processes with pending SIGKILL only then I think that
> > > would be doable.
> >
> > Why would it be necessary to read /proc/pid/maps?  I'd have thought
> > that a starting effort would be
> >
> > madvise((void *)0, (void *)-1, MADV_PAGEOUT)
> >
> > (after translation into process_madvise() speak).  Which is equivalent
> > to the proposed process_madvise(MADV_DONTNEED_MM)?
> 
> Yep, this is very similar to option #3 in
> https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> and I actually have a tested prototype for that.

Why is the `vector=NULL' needed?  Can't `vector' point at a single iovec
which spans the whole address range?

> If that's the
> preferred method then I can post it quite quickly.

I assume you've tested that prototype.  How did its usefulness compare
with this SIGKILL-based approach?

Re: [PATCH v3 5/6] i2c: iproc: handle master read request

2020-11-13 Thread Dhananjay Phadke

On Tue, 10 Nov 2020 11:24:36 -0800, Ray Jui wrote:

 Yes it's true that for master write-read events both
 IS_S_RD_EVENT_SHIFT and IS_S_RX_EVENT_SHIFT  are coming together.
 So before the slave starts transmitting data to the master, it should
 first read all data from rx-fifo i.e. complete master write and then
 process master read.

 To minimise interrupt overhead, we are batching 64bytes.
 To keep isr running for less time, we are using a tasklet.
 Again to keep the tasklet not running for more than 20u, we have set
 max of 10 bytes data read from rx-fifo per tasklet run.

 If we start processing everything in isr and using rx threshold
 interrupt, then isr will run for a longer time and this may hog the
 system.
 For example, to process 10 bytes it takes 20us, to process 30 bytes it
 takes 60us and so on.
 So is it okay to run isr for so long ?

 Keeping all this in mind we thought a tasklet would be a good option
 and kept max of 10 bytes read per tasklet.

 Please let me know if you still feel we should not use a tasklet and
 don't batch 64 bytes.
>>>
>>> Deferring to tasklet is OK, could use a kernel thread (i.e. threaded_irq)
>>> as i2c rate is quite low.
>>>
>
>kernel thread was proposed in the internal review. I don't see much
>benefit of using tasklet. If a thread is blocked from running for more
>than several tenth of ms, that's really a system-level issue than an
>issue with this driver.
>
>IMO, it's an overkill to use tasklet here but we can probably leave it
>as it is since it does not have a adverse effect and the code ran in
>tasklet is short.
>
>How much time is expected to read 64 bytes from an RX FIFO? Even with
>APB bus each register read is expected to be in the tenth or hundreds of
>nanosecond range. Reading the entire FIFO of 64 bytes should take less
>than 10 us. The interrupt context switch overhead is probably longer
>than that. It's much more effective to read all of them in a single
>batch than breaking them into multiple batches.

OK, there's a general discussions towards removing tasklets, if this
fix works with threaded isr, strongly recommend that route.

This comment in the code suggested that register reads take long time to
drain 64 bytes.

>+/*
>+ * It takes ~18us to reading 10bytes of data, hence to keep tasklet
>+ * running for less time, max slave read per tasklet is set to 10
>bytes.

@Rayagonda, please take care of hand-off mentioned below, once the tasklet
is scheduled, isr should just return and clear status at the end of tasklet.

>>
>> Few other comments -
>>
>>> +  /* schedule tasklet to read data later */
>>> +  tasklet_schedule(_i2c->slave_rx_tasklet);
>>> +
>>> +  /* clear only IS_S_RX_EVENT_SHIFT interrupt */
>>> +  iproc_i2c_wr_reg(iproc_i2c, IS_OFFSET,
>>> +   BIT(IS_S_RX_EVENT_SHIFT));
>>> +  }
>>
>> Why clearing one rx interrupt bit here after scheduling tasklet? Should all 
>> that
>> be done by tasklet? Also should just return after scheduling tasklet?

Regards,
Dhananjay

Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-13 Thread Zi Yan


On 13 Nov 2020, at 19:15, Roman Gushchin wrote:


On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

From: Zi Yan 

It adds a new_order parameter to set new page order in page owner.
It prepares for upcoming changes to support split huge page to any 
lower

order.

Signed-off-by: Zi Yan 
---
 include/linux/page_owner.h | 7 ---
 mm/huge_memory.c   | 2 +-
 mm/page_alloc.c| 2 +-
 mm/page_owner.c| 6 +++---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..215cbb159568 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -31,10 +31,11 @@ static inline void set_page_owner(struct page 
*page,

__set_page_owner(page, order, gfp_mask);
 }

-static inline void split_page_owner(struct page *page, unsigned int 
nr)
+static inline void split_page_owner(struct page *page, unsigned int 
nr,

+   unsigned int new_order)
 {
if (static_branch_unlikely(_owner_inited))
-   __split_page_owner(page, nr);
+   __split_page_owner(page, nr, new_order);
 }
 static inline void copy_page_owner(struct page *oldpage, struct page 
*newpage)

 {
@@ -60,7 +61,7 @@ static inline void set_page_owner(struct page 
*page,

 {
 }
 static inline void split_page_owner(struct page *page,
-   unsigned int order)
+   unsigned int nr, unsigned int new_order)


With the addition of the new argument it's a bit hard to understand
what the function is supposed to do. It seems like nr == 
page_order(page),
is it right? Maybe we can pass old_order and new_order? Or just the 
page

and the new order?


Yeah, it is a bit confusing. Please see more below.




 {
 }
 static inline void copy_page_owner(struct page *oldpage, struct page 
*newpage)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f599f5b9bf7f..8b7d771ee962 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page 
*page, struct list_head *list,


ClearPageCompound(head);

-   split_page_owner(head, nr);
+   split_page_owner(head, nr, 1);

/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..a9eead0e091a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int 
order)


for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
-   split_page_owner(page, 1 << order);
+   split_page_owner(page, 1 << order, 1);
 }
 EXPORT_SYMBOL_GPL(split_page);

diff --git a/mm/page_owner.c b/mm/page_owner.c
index b735a8eafcdb..2b7f7e9056dc 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page 
*page, int reason)

page_owner->last_migrate_reason = reason;
 }

-void __split_page_owner(struct page *page, unsigned int nr)
+void __split_page_owner(struct page *page, unsigned int nr, unsigned 
int new_order)

 {
int i;
struct page_ext *page_ext = lookup_page_ext(page);
@@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, 
unsigned int nr)

if (unlikely(!page_ext))
return;

-   for (i = 0; i < nr; i++) {
+   for (i = 0; i < nr; i += (1 << new_order)) {
page_owner = get_page_owner(page_ext);
-   page_owner->order = 0;
+   page_owner->order = new_order;
page_ext = page_ext_next(page_ext);


I believe there cannot be any leftovers because nr is always a power 
of 2.
Is it true? Converting nr argument to order (if it's possible) will 
make it obvious.


Right. nr = thp_nr_pages(head), which is a power of 2. There would not 
be any

leftover.

Matthew recently converted split_page_owner to take nr instead of 
order.[1] But I am not

sure why, since it seems to me that two call sites (__split_huge_page in
mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order 
information.



[1]https://lore.kernel.org/linux-mm/20200908195539.25896-4-wi...@infradead.org/


—
Best Regards,
Yan Zi

Re: linux-next: build failure after merge of the akpm tree

2020-11-13 Thread Andrew Morton

On Fri, 13 Nov 2020 18:02:39 +1100 Stephen Rothwell  
wrote:

> Hi all,
> 
> After merging the akpm tree, today's linux-next build (i386 defconfig)
> failed like this:
> 
> mm/secretmem.c: In function 'secretmem_memcg_charge':
> mm/secretmem.c:72:4: error: 'struct page' has no member named 'memcg_data'
>72 |   p->memcg_data = page->memcg_data;
>   |^~
> mm/secretmem.c:72:23: error: 'struct page' has no member named 'memcg_data'
>72 |   p->memcg_data = page->memcg_data;
>   |   ^~
> mm/secretmem.c: In function 'secretmem_memcg_uncharge':
> mm/secretmem.c:86:4: error: 'struct page' has no member named 'memcg_data'
>86 |   p->memcg_data = 0;
>   |^~
> 
> ...
>
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -69,7 +69,9 @@ static int secretmem_memcg_charge(struct page *page, gfp_t 
> gfp, int order)
>   for (i = 1; i < nr_pages; i++) {
>   struct page *p = page + i;
>  
> +#ifdef CONFIG_MEMCG
>   p->memcg_data = page->memcg_data;
> +#endif
>   }
>  
>   return 0;

Thanks, that'll work for now.

I guess we're looking at adding a set_page_memcg() (I'd prefer
page_memcg_set()).

But probably these functions shouldn't be compiled at all if
CONFIG_MEMCG=n.

Re: [RFC PATCH 4/9] cxl/mem: Map memory device registers

2020-11-13 Thread Ben Widawsky

On 20-11-13 12:17:32, Bjorn Helgaas wrote:
> On Tue, Nov 10, 2020 at 09:43:51PM -0800, Ben Widawsky wrote:
> > All the necessary bits are initialized in order to find and map the
> > register space for CXL Memory Devices. This is accomplished by using the
> > Register Locator DVSEC (CXL 2.0 - 8.1.9.1) to determine which PCI BAR to
> > use, and how much of an offset from that BAR should be added.
> 
> "Initialize the necessary bits ..." to use the usual imperative
> sentence structure, as you did in the subject.
> 
> > If the memory device registers are found and mapped a new internal data
> > structure tracking device state is allocated.
> 
> "Allocate device state if we find device registers" or similar.
> 
> > Signed-off-by: Ben Widawsky 
> > ---
> >  drivers/cxl/mem.c | 68 +++
> >  drivers/cxl/pci.h |  6 +
> >  2 files changed, 69 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index aa7d881fa47b..8d9b9ab6c5ea 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -7,9 +7,49 @@
> >  #include "pci.h"
> >  
> >  struct cxl_mem {
> > +   struct pci_dev *pdev;
> > void __iomem *regs;
> >  };
> >  
> > +static struct cxl_mem *cxl_mem_create(struct pci_dev *pdev, u32 reg_lo, 
> > u32 reg_hi)
> > +{
> > +   struct device *dev = >dev;
> > +   struct cxl_mem *cxlm;
> > +   void __iomem *regs;
> > +   u64 offset;
> > +   u8 bar;
> > +   int rc;
> > +
> > +   offset = ((u64)reg_hi << 32) | (reg_lo & 0x);
> > +   bar = reg_lo & 0x7;
> > +
> > +   /* Basic sanity check that BAR is big enough */
> > +   if (pci_resource_len(pdev, bar) < offset) {
> > +   dev_err(dev, "bar%d: %pr: too small (offset: %#llx)\n",
> > +   bar, >resource[bar], (unsigned long long) 
> > offset);
> 
> s/bar/BAR/
> 
> > +   return ERR_PTR(-ENXIO);
> > +   }
> > +
> > +   rc = pcim_iomap_regions(pdev, 1 << bar, pci_name(pdev));
> > +   if (rc != 0) {
> > +   dev_err(dev, "failed to map registers\n");
> > +   return ERR_PTR(-ENXIO);
> > +   }
> > +
> > +   cxlm = devm_kzalloc(>dev, sizeof(*cxlm), GFP_KERNEL);
> > +   if (!cxlm) {
> > +   dev_err(dev, "No memory available\n");
> > +   return ERR_PTR(-ENOMEM);
> > +   }
> > +
> > +   regs = pcim_iomap_table(pdev)[bar];
> > +   cxlm->pdev = pdev;
> > +   cxlm->regs = regs + offset;
> > +
> > +   dev_dbg(dev, "Mapped CXL Memory Device resource\n");
> > +   return cxlm;
> > +}
> > +
> >  static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> >  {
> > int pos;
> > @@ -34,9 +74,9 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> >  
> >  static int cxl_mem_probe(struct pci_dev *pdev, const struct pci_device_id 
> > *id)
> >  {
> > +   struct cxl_mem *cxlm = ERR_PTR(-ENXIO);
> > struct device *dev = >dev;
> > -   struct cxl_mem *cxlm;
> 
> The order was better before ("dev", then "clxm").  Oh, I suppose this
> is a "reverse Christmas tree" thing.
> 

I don't actually care either way as long as it's consistent. I tend to do
reverse Christmas tree for no particular reason.

> > -   int rc, regloc;
> > +   int rc, regloc, i;
> >  
> > rc = cxl_bus_prepared(pdev);
> > if (rc != 0) {
> > @@ -44,15 +84,33 @@ static int cxl_mem_probe(struct pci_dev *pdev, const 
> > struct pci_device_id *id)
> > return rc;
> > }
> >  
> > +   rc = pcim_enable_device(pdev);
> > +   if (rc)
> > +   return rc;
> > +
> > regloc = cxl_mem_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC);
> > if (!regloc) {
> > dev_err(dev, "register location dvsec not found\n");
> > return -ENXIO;
> > }
> > +   regloc += 0xc; /* Skip DVSEC + reserved fields */
> > +
> > +   for (i = regloc; i < regloc + 0x24; i += 8) {
> > +   u32 reg_lo, reg_hi;
> > +
> > +   pci_read_config_dword(pdev, i, _lo);
> > +   pci_read_config_dword(pdev, i + 4, _hi);
> > +
> > +   if (CXL_REGLOG_IS_MEMDEV(reg_lo)) {
> > +   cxlm = cxl_mem_create(pdev, reg_lo, reg_hi);
> > +   break;
> > +   }
> > +   }
> > +
> > +   if (IS_ERR(cxlm))
> > +   return -ENXIO;
> 
> I think this would be easier to read if cxl_mem_create() returned NULL
> on failure (it prints error messages and we throw away
> -ENXIO/-ENOMEM distinction here anyway) so you could do:
> 
>   struct cxl_mem *cxlm = NULL;
> 
>   for (...) {
> if (...) {
>   cxlm = cxl_mem_create(pdev, reg_lo, reg_hi);
>   break;
> }
>   }
> 
>   if (!cxlm)
> return -ENXIO;  /* -ENODEV might be more natural? */
> 

I agree on both counts. Both of these came from Dan, so I will let him explain.

> > -   cxlm = devm_kzalloc(dev, sizeof(*cxlm), GFP_KERNEL);
> > -   if (!cxlm)
> > -   return -ENOMEM;
> > +   pci_set_drvdata(pdev, cxlm);
> >  
> > return 0;
> >  }
> > diff --git a/drivers/cxl/pci.h b/drivers/cxl/pci.h
> > index beb03921e6da..be87f62e9132

Re: [RFC PATCH 3/9] cxl/mem: Add a driver for the type-3 mailbox

2020-11-13 Thread Ben Widawsky

On 20-11-13 12:17:28, Bjorn Helgaas wrote:
> On Tue, Nov 10, 2020 at 09:43:50PM -0800, Ben Widawsky wrote:
> > From: Dan Williams 
> > 
> > The CXL.mem protocol allows a device to act as a provider of "System
> > RAM" and/or "Persistent Memory" that is fully coherent as if the memory
> > was attached to the typical CPU memory controller.
> > 
> > The memory range exported by the device may optionally be described by
> > the platform firmware memory map, or by infrastructure like LIBNVDIMM to
> > provision persistent memory capacity from one, or more, CXL.mem devices.
> > 
> > A pre-requisite for Linux-managed memory-capacity provisioning is this
> > cxl_mem driver that can speak the "type-3 mailbox" protocol.
> 
> "Type 3" to indicate that this is a proper adjective that can be
> looked up in the spec and to match the usage there.
> 
> The r1.1 spec I have doesn't mention "mailbox".  Is that also
> something defined in the 2.0 spec?

Yes, these device types are new to 2.0.

> 
> A URL or similar citation for the spec would be nice somewhere.
> 

Agreed. For the patches I authored at least, it seemed repetitive to put a Link:
in each one to the spec. It was meant to be in the cover letter, but obviously I
missed that. Do you have a suggestion there, is cover letter good enough?

> > For now just land the driver boiler-plate and fill it in with
> > functionality in subsequent commits.
> > 
> > Signed-off-by: Dan Williams 
> > Signed-off-by: Ben Widawsky 
> > ---
> >  drivers/cxl/Kconfig  | 20 +++
> >  drivers/cxl/Makefile |  2 ++
> >  drivers/cxl/mem.c| 82 
> >  drivers/cxl/pci.h| 15 
> >  4 files changed, 119 insertions(+)
> >  create mode 100644 drivers/cxl/mem.c
> >  create mode 100644 drivers/cxl/pci.h
> > 
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index dd724bd364df..15548f5c77ff 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -27,4 +27,24 @@ config CXL_ACPI
> >   resources described by the CEDT (CXL Early Discovery Table)
> >  
> >   Say 'y' to enable CXL (Compute Express Link) drivers.
> > +
> > +config CXL_MEM
> > +tristate "CXL.mem Device Support"
> > +depends on PCI && CXL_BUS_PROVIDER != n
> > +default m if CXL_BUS_PROVIDER
> > +help
> > +  The CXL.mem protocol allows a device to act as a provider of
> > +  "System RAM" and/or "Persistent Memory" that is fully coherent
> > +  as if the memory was attached to the typical CPU memory
> > +  controller.
> > +
> > +  Say 'y/m' to enable a driver named "cxl_mem.ko" that will attach
> > +  to CXL.mem devices for configuration, provisioning, and health
> > +  monitoring, the so called "type-3 mailbox". Note, this driver
> 
> "Type 3"
> 
> > +  is required for dynamic provisioning of CXL.mem attached
> > +  memory, a pre-requisite for persistent memory support, but
> > +  devices that provide volatile memory may be fully described by
> > +  existing platform firmware memory enumeration.
> > +
> > +  If unsure say 'n'.
> >  endif
> > diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> > index d38cd34a2582..97fdffb00f2d 100644
> > --- a/drivers/cxl/Makefile
> > +++ b/drivers/cxl/Makefile
> > @@ -1,5 +1,7 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
> > +obj-$(CONFIG_CXL_MEM) += cxl_mem.o
> >  
> >  ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=CXL
> >  cxl_acpi-y := acpi.o
> > +cxl_mem-y := mem.o
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > new file mode 100644
> > index ..aa7d881fa47b
> > --- /dev/null
> > +++ b/drivers/cxl/mem.c
> > @@ -0,0 +1,82 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> > +#include 
> > +#include 
> > +#include 
> > +#include "acpi.h"
> > +#include "pci.h"
> > +
> > +struct cxl_mem {
> > +   void __iomem *regs;
> > +};
> 
> Unused, maybe move it to the patch that adds the use?
> 

This is a remnant from when Dan gave me the basis to do the mmio work. I agree
it can be removed now.

> > +static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> > +{
> > +   int pos;
> > +
> > +   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC);
> > +   if (!pos)
> > +   return 0;
> > +
> > +   while (pos) {
> > +   u16 vendor, id;
> > +
> > +   pci_read_config_word(pdev, pos + PCI_DVSEC_VENDOR_OFFSET, 
> > );
> > +   pci_read_config_word(pdev, pos + PCI_DVSEC_ID_OFFSET, );
> > +   if (vendor == PCI_DVSEC_VENDOR_CXL && dvsec == id)
> > +   return pos;
> > +
> > +   pos = pci_find_next_ext_capability(pdev, pos, 
> > PCI_EXT_CAP_ID_DVSEC);
> > +   }
> > +
> > +   return 0;
> > +}
> 
> I assume we'll refactor and move this into the PCI core after we
> resolve the several places this is needed.  When

[PATCH net-next] net: linux/skbuff.h: combine NET + KCOV handling

2020-11-13 Thread Randy Dunlap

The previous Kconfig patch led to some other build errors as
reported by the 0day bot and my own overnight build testing.

These are all in  when KCOV is enabled but
NET is not enabled, so fix those by combining those conditions
in the header file.

Fixes: 6370cc3bbd8a ("net: add kcov handle to skb extensions")
Fixes: 85ce50d337d1 ("net: kcov: don't select SKB_EXTENSIONS when there is no 
NET")
Signed-off-by: Randy Dunlap 
Reported-by: kernel test robot 
Cc: Aleksandr Nogikh 
Cc: Willem de Bruijn 
Cc: Jakub Kicinski 
Cc: linux-n...@vger.kernel.org
Cc: net...@vger.kernel.org
---
 include/linux/skbuff.h |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux-next-20201113.orig/include/linux/skbuff.h
+++ linux-next-20201113/include/linux/skbuff.h
@@ -4151,7 +4151,7 @@ enum skb_ext_id {
 #if IS_ENABLED(CONFIG_MPTCP)
SKB_EXT_MPTCP,
 #endif
-#if IS_ENABLED(CONFIG_KCOV)
+#if IS_ENABLED(CONFIG_KCOV) && IS_ENABLED(CONFIG_NET)
SKB_EXT_KCOV_HANDLE,
 #endif
SKB_EXT_NUM, /* must be last */
@@ -4608,7 +4608,7 @@ static inline void skb_reset_redirect(st
 #endif
 }
 
-#ifdef CONFIG_KCOV
+#if IS_ENABLED(CONFIG_KCOV) && IS_ENABLED(CONFIG_NET)
 static inline void skb_set_kcov_handle(struct sk_buff *skb,
   const u64 kcov_handle)
 {
@@ -4636,7 +4636,7 @@ static inline u64 skb_get_kcov_handle(st
 static inline void skb_set_kcov_handle(struct sk_buff *skb,
   const u64 kcov_handle) { }
 static inline u64 skb_get_kcov_handle(struct sk_buff *skb) { return 0; }
-#endif /* CONFIG_KCOV */
+#endif /* CONFIG_KCOV &&  CONFIG_NET */
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_SKBUFF_H */

Re: [PATCH v2 08/10] ARM: dts: NSP: Add a SRAB compatible string for each board

2020-11-13 Thread Vladimir Oltean

On Wed, Nov 11, 2020 at 08:50:18PM -0800, Florian Fainelli wrote:
> Provide a valid compatible string for the Ethernet switch node based on
> the board including the switch. This allows us to have sane defaults and
> silences the following warnings:
> 
>  arch/arm/boot/dts/bcm958522er.dt.yaml:
> ethernet-switch@36000: compatible: 'oneOf' conditional failed,
> one
> must be fixed:
> ['brcm,bcm5301x-srab'] is too short
> 'brcm,bcm5325' was expected
> 'brcm,bcm53115' was expected
> 'brcm,bcm53125' was expected
> 'brcm,bcm53128' was expected
> 'brcm,bcm5365' was expected
> 'brcm,bcm5395' was expected
> 'brcm,bcm5389' was expected
> 'brcm,bcm5397' was expected
> 'brcm,bcm5398' was expected
> 'brcm,bcm11360-srab' was expected
> 'brcm,bcm5301x-srab' is not one of ['brcm,bcm53010-srab',
> 'brcm,bcm53011-srab', 'brcm,bcm53012-srab', 'brcm,bcm53018-srab',
> 'brcm,bcm53019-srab']
> 'brcm,bcm5301x-srab' is not one of ['brcm,bcm11404-srab',
> 'brcm,bcm11407-srab', 'brcm,bcm11409-srab', 'brcm,bcm58310-srab',
> 'brcm,bcm58311-srab', 'brcm,bcm58313-srab']
> 'brcm,bcm5301x-srab' is not one of ['brcm,bcm58522-srab',
> 'brcm,bcm58523-srab', 'brcm,bcm58525-srab', 'brcm,bcm58622-srab',
> 'brcm,bcm58623-srab', 'brcm,bcm58625-srab', 'brcm,bcm88312-srab']
> 'brcm,bcm5301x-srab' is not one of ['brcm,bcm3384-switch',
> 'brcm,bcm6328-switch', 'brcm,bcm6368-switch']
> From schema:
> Documentation/devicetree/bindings/net/dsa/b53.yaml
> 
> Signed-off-by: Florian Fainelli 
> ---

Reviewed-by: Vladimir Oltean

Re: [PATCH v2 09/10] ARM: dts: NSP: Provide defaults ports container node

2020-11-13 Thread Vladimir Oltean

On Wed, Nov 11, 2020 at 08:50:19PM -0800, Florian Fainelli wrote:
> Provide an empty 'ports' container node with the correct #address-cells
> and #size-cells properties. This silences the following warning:
> 
> arch/arm/boot/dts/bcm958522er.dt.yaml:
> ethernet-switch@36000: 'oneOf' conditional failed, one must be fixed:
> 'ports' is a required property
> 'ethernet-ports' is a required property
> From schema:
> Documentation/devicetree/bindings/net/dsa/b53.yaml
> 
> Signed-off-by: Florian Fainelli 
> ---

So 'ports' is not going away and getting bulk-replaced with
'ethernet-ports'. Good.

Reviewed-by: Vladimir Oltean

Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

2020-11-13 Thread Suren Baghdasaryan

On Fri, Nov 13, 2020 at 5:00 PM Andrew Morton  wrote:
>
> On Fri, 13 Nov 2020 16:06:25 -0800 Suren Baghdasaryan  
> wrote:
>
> > On Fri, Nov 13, 2020 at 3:55 PM Andrew Morton  
> > wrote:
> > >
> > > On Fri, 13 Nov 2020 09:34:48 -0800 Suren Baghdasaryan  
> > > wrote:
> > >
> > > > When a process is being killed it might be in an uninterruptible sleep
> > > > which leads to an unpredictable delay in its memory reclaim. In low 
> > > > memory
> > > > situations, when it's important to free up memory quickly, such delay is
> > > > problematic. Kernel solves this problem with oom-reaper thread which
> > > > performs memory reclaim even when the victim process is not runnable.
> > > > Userspace currently lacks such mechanisms and the need and potential
> > > > solutions were discussed before (see links below).
> > > > This patch provides a mechanism to perform memory reclaim in the context
> > > > of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for
> > > > pidfd_send_signal syscall can be used only when sending SIGKILL signal
> > > > and will lead to the caller synchronously reclaiming the memory that
> > > > belongs to the victim and can be easily reclaimed.
> > >
> > > hm.
> > >
> > > Seems to me that the ability to reap another process's memory is a
> > > generally useful one, and that it should not be tied to delivering a
> > > signal in this fashion.
> > >
> > > And we do have the new process_madvise(MADV_PAGEOUT).  It may need a
> > > few changes and tweaks, but can't that be used to solve this problem?
> >
> > Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was
> > one of the options recently discussed in
> > https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> > . The thread describes some of the issues with that approach but if we
> > limit it to processes with pending SIGKILL only then I think that
> > would be doable.
>
> Why would it be necessary to read /proc/pid/maps?  I'd have thought
> that a starting effort would be
>
> madvise((void *)0, (void *)-1, MADV_PAGEOUT)
>
> (after translation into process_madvise() speak).  Which is equivalent
> to the proposed process_madvise(MADV_DONTNEED_MM)?

Yep, this is very similar to option #3 in
https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
and I actually have a tested prototype for that. If that's the
preferred method then I can post it quite quickly.

>
> There may be things which trip this up, such as mlocked regions or
> whatever, but we could add another madvise `advice' mode to handle
> this?

Re: #PF from NMI

2020-11-13 Thread Paul E. McKenney

On Sat, Nov 14, 2020 at 12:13:58AM +0100, Thomas Gleixner wrote:
> On Fri, Nov 13 2020 at 13:53, Peter Zijlstra wrote:
> > [  139.226724] WARNING: CPU: 9 PID: 2290 at kernel/rcu/tree.c:932 
> > __rcu_irq_enter_check_tick+0x84/0xd0
> > [  139.226753]  irqentry_enter+0x25/0x40
> > [  139.226753]  exc_page_fault+0x38/0x4c0
> > [  139.226753]  asm_exc_page_fault+0x1e/0x30
> 
> ...
> 
> > [  139.226757]  perf_callchain_user+0xf4/0x280
> >
> > Which is a #PF from NMI context, which is perfectly fine. However
> > __rcu_irq_enter_check_tick() is triggering WARN.
> >
> > AFAICT the right thing is to simply remove the warn like so.
> >
> > ---
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 430ba58d8bfe..9bda92d8b914 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -928,8 +928,8 @@ void __rcu_irq_enter_check_tick(void)
> >  {
> > struct rcu_data *rdp = this_cpu_ptr(_data);
> >  
> > -// Enabling the tick is unsafe in NMI handlers.
> > -   if (WARN_ON_ONCE(in_nmi()))
> > +   // if we're here from NMI, there's nothing to do.
> > +   if (in_nmi())
> > return;
> >  
> > RCU_LOCKDEP_WARN(rcu_dynticks_curr_cpu_in_eqs(),
> 
> Yes. That's right.
> 
> To answer Pauls question:
> 
> > But is a corresponding change required on return-from-NMI side?
> > Looks OK to me at first glance, but I could be missing something.
> 
> No. The corresponding issue is not return from NMI. The corresponding
> problem is the return from the page fault handler, but there is nothing
> to worry about. That part is NMI safe already.

In that case:

Reviewed-by: Paul E. McKenney 

Or let me know (and get me a Signed-off-by) if you want me to take it.

Thanx, Paul

> And Luto's as well:
> 
> > with the following caveat that has nothing to do with NMI: the rest of
> > irqentry_enter() has tracing calls in it. Does anything prevent
> > infinite recursion if one of those tracing calls causes a page fault?
> 
> nmi:
>   ...
>   trace_hardirqs_off_finish() {
> if (!this_cpu_read(tracing_irq_cpu)) {
>this_cpu_write(tracing_irq_cpu, 1);
>...
>   }
>   ...
>   perf()
> 
> #PF
>   save_cr2()
>   
>   irqentry_enter()
>  trace_hardirqs_off_finish()
> if (!this_cpu_read(tracing_irq_cpu)) {
> 
> So yes, it is recursion protected unless I'm missing something.
> 
> Thanks,
> 
> tglx

Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

2020-11-13 Thread Andrew Morton

On Fri, 13 Nov 2020 16:06:25 -0800 Suren Baghdasaryan  wrote:

> On Fri, Nov 13, 2020 at 3:55 PM Andrew Morton  
> wrote:
> >
> > On Fri, 13 Nov 2020 09:34:48 -0800 Suren Baghdasaryan  
> > wrote:
> >
> > > When a process is being killed it might be in an uninterruptible sleep
> > > which leads to an unpredictable delay in its memory reclaim. In low memory
> > > situations, when it's important to free up memory quickly, such delay is
> > > problematic. Kernel solves this problem with oom-reaper thread which
> > > performs memory reclaim even when the victim process is not runnable.
> > > Userspace currently lacks such mechanisms and the need and potential
> > > solutions were discussed before (see links below).
> > > This patch provides a mechanism to perform memory reclaim in the context
> > > of the process that sends SIGKILL signal. New SYNC_REAP_MM flag for
> > > pidfd_send_signal syscall can be used only when sending SIGKILL signal
> > > and will lead to the caller synchronously reclaiming the memory that
> > > belongs to the victim and can be easily reclaimed.
> >
> > hm.
> >
> > Seems to me that the ability to reap another process's memory is a
> > generally useful one, and that it should not be tied to delivering a
> > signal in this fashion.
> >
> > And we do have the new process_madvise(MADV_PAGEOUT).  It may need a
> > few changes and tweaks, but can't that be used to solve this problem?
> 
> Thank you for the feedback, Andrew. process_madvise(MADV_DONTNEED) was
> one of the options recently discussed in
> https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com
> . The thread describes some of the issues with that approach but if we
> limit it to processes with pending SIGKILL only then I think that
> would be doable.

Why would it be necessary to read /proc/pid/maps?  I'd have thought
that a starting effort would be

madvise((void *)0, (void *)-1, MADV_PAGEOUT)

(after translation into process_madvise() speak).  Which is equivalent
to the proposed process_madvise(MADV_DONTNEED_MM)?

There may be things which trip this up, such as mlocked regions or
whatever, but we could add another madvise `advice' mode to handle
this?

Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

2020-11-13 Thread Zi Yan

On 13 Nov 2020, at 19:52, Roman Gushchin wrote:

> On Wed, Nov 11, 2020 at 03:40:06PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> To split a THP to any lower order pages, we need to reform THPs on
>> subpages at given order and add page refcount based on the new page
>> order. Also we need to reinitialize page_deferred_list after removing
>> the page from the split_queue, otherwise a subsequent split will see
>> list corruption when checking the page_deferred_list again.
>>
>> It has many uses, like minimizing the number of pages after
>> truncating a pagecache THP. For anonymous THPs, we can only split them
>> to order-0 like before until we add support for any size anonymous THPs.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  include/linux/huge_mm.h |  8 +
>>  mm/huge_memory.c| 78 +
>>  mm/swap.c   |  1 -
>>  3 files changed, 63 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 60a907a19f7d..9819cd9b4619 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>>
>>  bool can_split_huge_page(struct page *page, int *pextra_pins);
>>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head 
>> *list,
>> +unsigned int new_order);
>>  static inline int split_huge_page(struct page *page)
>>  {
>>  return split_huge_page_to_list(page, NULL);
>> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct 
>> list_head *list)
>>  {
>>  return 0;
>>  }
>> +static inline int
>> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
>> +unsigned int new_order)
>
> It was
> int split_huge_page_to_list_to_order(struct page *page, struct list_head 
> *list,
>   unsigned int new_order);
> above.

Right. It should be split_huge_page_to_list_to_order. Will fix it.

>
>> +{
>> +return 0;
>> +}
>>  static inline int split_huge_page(struct page *page)
>>  {
>>  return 0;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8b7d771ee962..88f50da40c9b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct 
>> *vma,
>>  static void unmap_page(struct page *page)
>>  {
>>  enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
>> -TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
>> +TTU_RMAP_LOCKED;
>>  bool unmap_success;
>>
>>  VM_BUG_ON_PAGE(!PageHead(page), page);
>>
>> +if (thp_order(page) >= HPAGE_PMD_ORDER)
>> +ttu_flags |= TTU_SPLIT_HUGE_PMD;
>> +
>>  if (PageAnon(page))
>>  ttu_flags |= TTU_SPLIT_FREEZE;
>>
>> @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>>  VM_BUG_ON_PAGE(!unmap_success, page);
>>  }
>>
>> -static void remap_page(struct page *page, unsigned int nr)
>> +static void remap_page(struct page *page, unsigned int nr, unsigned int 
>> new_nr)
>>  {
>>  int i;
>> -if (PageTransHuge(page)) {
>> +if (thp_nr_pages(page) == nr) {
>>  remove_migration_ptes(page, page, true);
>>  } else {
>> -for (i = 0; i < nr; i++)
>> +for (i = 0; i < nr; i += new_nr)
>>  remove_migration_ptes(page + i, page + i, true);
>>  }
>>  }
>>
>>  static void __split_huge_page_tail(struct page *head, int tail,
>> -struct lruvec *lruvec, struct list_head *list)
>> +struct lruvec *lruvec, struct list_head *list, unsigned int 
>> new_order)
>>  {
>>  struct page *page_tail = head + tail;
>> +unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>>
>>  VM_BUG_ON_PAGE(atomic_read(_tail->_mapcount) != -1, page_tail);
>>
>> @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, 
>> int tail,
>>  #ifdef CONFIG_64BIT
>>   (1L << PG_arch_2) |
>>  #endif
>> + compound_head_flag |
>>   (1L << PG_dirty)));
>>
>>  /* ->mapping in first tail page is compound_mapcount */
>> @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page 
>> *head, int tail,
>>   * which needs correct compound_head().
>>   */
>>  clear_compound_head(page_tail);
>> +if (new_order) {
>> +prep_compound_page(page_tail, new_order);
>> +thp_prep(page_tail);
>> +}
>>
>>  /* Finally unfreeze refcount. Additional reference from page cache. */
>> -page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
>> -  PageSwapCache(head)));
>> +page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
>> +   PageSwapCache(head)) ?
>> +thp_nr_pages(page_tail) :

Re: Error: invalid switch -me200

2020-11-13 Thread Segher Boessenkool

On Fri, Nov 13, 2020 at 04:37:38PM -0800, Fāng-ruì Sòng wrote:
> On Fri, Nov 13, 2020 at 4:23 PM Segher Boessenkool
>  wrote:
> > On Fri, Nov 13, 2020 at 12:14:18PM -0800, Nick Desaulniers wrote:
> > > > > > Error: invalid switch -me200
> > > > > > Error: unrecognized option -me200
> > > > >
> > > > > 251 cpu-as-$(CONFIG_E200)   += -Wa,-me200
> > > > >
> > > > > Are those all broken configs, or is Kconfig messed up such that
> > > > > randconfig can select these when it should not?
> > > >
> > > > Hmmm, looks like this flag does not exist in mainline binutils? There is
> > > > a thread in 2010 about this that Segher commented on:
> > > >
> > > > https://lore.kernel.org/linuxppc-dev/9859e645-954d-4d07-8003-ffcd2391a...@kernel.crashing.org/
> > > >
> > > > Guess this config should be eliminated?
> >
> > The help text for this config options says that e200 is used in 55xx,
> > and there *is* an -me5500 GAS flag (which probably does this same
> > thing, too).  But is any of this tested, or useful, or wanted?
> >
> > Maybe Christophe knows, cc:ed.
> 
> CC Alan Modra, a binutils global maintainer.
> 
> Alan, can the few -Wa,-m* options deleted from arch/powerpc/Makefile ?

All the others work fine (and are needed afaics), it is only -me200 that
doesn't exist (in mainline binutils).  Perhaps -me5500 will work for it
instead.


Segher

[PATCH v5 0/6] Intel MAX10 BMC Secure Update Driver

2020-11-13 Thread Russ Weight

The Intel MAX10 BMC Secure Update driver instantiates the FPGA
Security Manager class driver and provides the callback functions
required to support secure updates on Intel n3000 PAC devices.
This driver is implemented as a sub-driver of the Intel MAX10 BMC
mfd driver. Future instances of the MAX10 BMC will support other
devices as well (e.g. d5005) and this same MAX10 BMC Secure
Update driver will receive modifications to support that device.

This driver interacts with the HW secure update engine of the
BMC in order to transfer new FPGA and BMC images to FLASH so
that they will be automatically loaded when the FPGA card reboots.
Security is enforced by hardware and firmware. The MAX10 BMC
Secure Update driver interacts with the firmware to initiate
an update, pass in the necessary data, and collect status on
the update.

This driver provides sysfs files for displaying the flash count,
the root entry hashes (REH), and the code-signing-key (CSK)
cancellation vectors.

These patches are dependent on other patches that are under
review. If you want to apply and compile these patches on
linux-next, please apply these patches first:

(7 patches) https://marc.info/?l=linux-fpga=160462501901359=2

If you have an n3000 PAC card and want to test this driver, you
will also need this patch:

(1 patch) https://marc.info/?l=linux-fpga=160379607703940=2

Changelog v4 -> v5:
  - Renamed sysfs node user_flash_count to flash_count and updated
the sysfs documentation accordingly to more accurately descirbe
the purpose of the count.

Changelog v3 -> v4:
  - Moved sysfs files for displaying the flash count, the root
entry hashes (REH), and the code-signing-key (CSK) cancellation
vectors from the FPGA Security Manager class driver to this
driver (as they are not generic enough for the class driver).
  - Added a new ABI documentation file with informtaion about the
new sysfs entries: sysfs-driver-intel-m10-bmc-secure
  - Updated the MAINTAINERS file to add the new ABI documentation
file: sysfs-driver-intel-m10-bmc-secure
  - Removed unnecessary ret variable from m10bmc_secure_probe()
  - Incorporated new devm_fpga_sec_mgr_register() function into
m10bmc_secure_probe() and removed the m10bmc_secure_remove()
function.

Changelog v2 -> v3:
  - Changed "MAX10 BMC Security Engine driver" to "MAX10 BMC Secure
Update driver"
  - Changed from "Intel FPGA Security Manager" to FPGA Security Manager"
  - Changed: iops -> sops, imgr -> smgr, IFPGA_ -> FPGA_, ifpga_ to fpga_
  - Removed wrapper functions (m10bmc_raw_*, m10bmc_sys_*). The
underlying functions are now called directly.
  - Changed "_root_entry_hash" to "_reh", with a comment explaining
what reh is.
  - Renamed get_csk_vector() to m10bmc_csk_vector()
  - Changed calling functions of functions that return "enum fpga_sec_err"
to check for (ret != FPGA_SEC_ERR_NONE) instead of (ret)

Changelog v1 -> v2:
  - These patches were previously submitted as part of a larger V1
patch set under the title "Intel FPGA Security Manager Class Driver".
  - Grouped all changes to include/linux/mfd/intel-m10-bmc.h into a
single patch: "mfd: intel-m10-bmc: support for MAX10 BMC Security
Engine".
  - Removed ifpga_sec_mgr_init() and ifpga_sec_mgr_uinit() functions.
  - Adapted to changes in the Intel FPGA Security Manager by splitting
the single call to ifpga_sec_mgr_register() into two function
calls: devm_ifpga_sec_mgr_create() and ifpga_sec_mgr_register().
  - Replaced small function-creation macros for explicit function
declarations.
  - Bug fix for the get_csk_vector() function to properly apply the
stride variable in calls to m10bmc_raw_bulk_read().
  - Added m10bmc_ prefix to functions in m10bmc_iops structure
  - Implemented HW_ERRINFO_POISON for m10bmc_sec_hw_errinfo() to
ensure that corresponding bits are set to 1 if we are unable
to read the doorbell or auth_result registers.
  - Added comments and additional code cleanup per V1 review.

Russ Weight (6):
  mfd: intel-m10-bmc: support for MAX10 BMC Secure Updates
  fpga: m10bmc-sec: create max10 bmc secure update driver
  fpga: m10bmc-sec: expose max10 flash update count
  fpga: m10bmc-sec: expose max10 canceled keys in sysfs
  fpga: m10bmc-sec: add max10 secure update functions
  fpga: m10bmc-sec: add max10 get_hw_errinfo callback func

 .../testing/sysfs-driver-intel-m10-bmc-secure |  61 ++
 MAINTAINERS   |   2 +
 drivers/fpga/Kconfig  |  11 +
 drivers/fpga/Makefile |   3 +
 drivers/fpga/intel-m10-bmc-secure.c   | 542 ++
 include/linux/mfd/intel-m10-bmc.h |  85 +++
 6 files changed, 704 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
 create mode 100644 drivers/fpga/intel-m10-bmc-secure.c

-- 
2.25.1

[PATCH v5 4/6] fpga: m10bmc-sec: expose max10 canceled keys in sysfs

2020-11-13 Thread Russ Weight

Extend the MAX10 BMC Secure Update driver to provide sysfs
files to expose the canceled code signing key (CSK) bit
vectors. These use the standard bitmap list format
(e.g. 1,2-6,9).

Signed-off-by: Russ Weight 
Reviewed-by: Tom Rix 
---
v5:
  - No change
v4:
  - Moved sysfs files for displaying the code-signing-key (CSK)
cancellation vectors from the FPGA Security Manger class driver
to here. The m10bmc_csk_vector() and m10bmc_csk_cancel_nbits()
functions are removed and the functionality from these functions
is moved into a show_canceled_csk() function for for displaying
the CSK vectors.
  - Added ABI documentation for new sysfs entries
v3:
  - Changed: iops -> sops, imgr -> smgr, IFPGA_ -> FPGA_, ifpga_ to fpga_
  - Changed "MAX10 BMC Secure Engine driver" to "MAX10 BMC Secure Update
driver"
  - Removed wrapper functions (m10bmc_raw_*, m10bmc_sys_*). The
underlying functions are now called directly.
  - Renamed get_csk_vector() to m10bmc_csk_vector()
v2:
  - Replaced small function-creation macros for explicit function
declarations.
  - Fixed get_csk_vector() function to properly apply the stride
variable in calls to m10bmc_raw_bulk_read()
  - Added m10bmc_ prefix to functions in m10bmc_iops structure
---
 .../testing/sysfs-driver-intel-m10-bmc-secure | 24 ++
 drivers/fpga/intel-m10-bmc-secure.c   | 46 +++
 2 files changed, 70 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure 
b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
index 73a3aba750e8..610f19569b5f 100644
--- a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
+++ b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
@@ -28,6 +28,30 @@ Description: Read only. Returns the root entry hash for the 
BMC image
underlying device supports it.
Format: "0x%x".
 
+What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/sr_canceled_csks
+Date:  Oct 2020
+KernelVersion:  5.11
+Contact:   Russ Weight 
+Description:   Read only. Returns a list of indices for canceled code
+   signing keys for the static region. The standard bitmap
+   list format is used (e.g. "1,2-6,9").
+
+What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/pr_canceled_csks
+Date:  Oct 2020
+KernelVersion:  5.11
+Contact:   Russ Weight 
+Description:   Read only. Returns a list of indices for canceled code
+   signing keys for the partial reconfiguration region. The
+   standard bitmap list format is used (e.g. "1,2-6,9").
+
+What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/bmc_canceled_csks
+Date:  Oct 2020
+KernelVersion:  5.11
+Contact:   Russ Weight 
+Description:   Read only. Returns a list of indices for canceled code
+   signing keys for the BMC.  The standard bitmap list format
+   is used (e.g. "1,2-6,9").
+
 What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/flash_count
 Date:  Oct 2020
 KernelVersion:  5.11
diff --git a/drivers/fpga/intel-m10-bmc-secure.c 
b/drivers/fpga/intel-m10-bmc-secure.c
index 6ad897001086..689da5bc6461 100644
--- a/drivers/fpga/intel-m10-bmc-secure.c
+++ b/drivers/fpga/intel-m10-bmc-secure.c
@@ -78,6 +78,49 @@ DEVICE_ATTR_SEC_REH_RO(bmc, BMC_PROG_MAGIC, BMC_PROG_ADDR, 
BMC_REH_ADDR);
 DEVICE_ATTR_SEC_REH_RO(sr, SR_PROG_MAGIC, SR_PROG_ADDR, SR_REH_ADDR);
 DEVICE_ATTR_SEC_REH_RO(pr, PR_PROG_MAGIC, PR_PROG_ADDR, PR_REH_ADDR);
 
+#define CSK_BIT_LEN128U
+#define CSK_32ARRAY_SIZE   DIV_ROUND_UP(CSK_BIT_LEN, 32)
+
+static ssize_t
+show_canceled_csk(struct device *dev, u32 addr, char *buf)
+{
+   unsigned int i, stride, size = CSK_32ARRAY_SIZE * sizeof(u32);
+   struct m10bmc_sec *sec = dev_get_drvdata(dev);
+   DECLARE_BITMAP(csk_map, CSK_BIT_LEN);
+   __le32 csk_le32[CSK_32ARRAY_SIZE];
+   u32 csk32[CSK_32ARRAY_SIZE];
+   int ret;
+
+   stride = regmap_get_reg_stride(sec->m10bmc->regmap);
+
+   ret = regmap_bulk_read(sec->m10bmc->regmap, addr, csk_le32, size / 
stride);
+   if (ret) {
+   dev_err(sec->dev, "failed to read CSK vector: %x cnt %x: %d\n",
+   addr, size / stride, ret);
+   return ret;
+   }
+
+   for (i = 0; i < CSK_32ARRAY_SIZE; i++)
+   csk32[i] = le32_to_cpu(((csk_le32[i])));
+
+   bitmap_from_arr32(csk_map, csk32, CSK_BIT_LEN);
+   bitmap_complement(csk_map, csk_map, CSK_BIT_LEN);
+   return bitmap_print_to_pagebuf(1, buf, csk_map, CSK_BIT_LEN);
+}
+
+#define DEVICE_ATTR_SEC_CSK_RO(_name, _addr) \
+static ssize_t _name##_canceled_csks_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ return show_canceled_csk(dev, _addr, buf); } \
+static

[PATCH v5 3/6] fpga: m10bmc-sec: expose max10 flash update count

2020-11-13 Thread Russ Weight

Extend the MAX10 BMC Secure Update driver to provide a
sysfs file to expose the flash update count for the FPGA
user image.

Signed-off-by: Russ Weight 
Reviewed-by: Tom Rix 
---
v5:
  - Renamed sysfs node user_flash_count to flash_count and updated the
sysfs documentation accordingly.
v4:
  - Moved the sysfs file for displaying the flash count from the
FPGA Security Manager class driver to here. The
m10bmc_user_flash_count() function is removed and the
functionality is moved into a user_flash_count_show()
function.
  - Added ABI documentation for the new sysfs entry
v3:
  - Changed: iops -> sops, imgr -> smgr, IFPGA_ -> FPGA_, ifpga_ to fpga_
  - Changed "MAX10 BMC Secure Engine driver" to "MAX10 BMC Secure Update
driver"
  - Removed wrapper functions (m10bmc_raw_*, m10bmc_sys_*). The
underlying functions are now called directly.
v2:
  - Renamed get_qspi_flash_count() to m10bmc_user_flash_count()
  - Minor code cleanup per review comments
  - Added m10bmc_ prefix to functions in m10bmc_iops structure
---
 .../testing/sysfs-driver-intel-m10-bmc-secure |  8 +
 drivers/fpga/intel-m10-bmc-secure.c   | 34 +++
 2 files changed, 42 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure 
b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
index 2992488b717a..73a3aba750e8 100644
--- a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
+++ b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
@@ -27,3 +27,11 @@ Description: Read only. Returns the root entry hash for the 
BMC image
"hash not programmed".  This file is only visible if the
underlying device supports it.
Format: "0x%x".
+
+What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/flash_count
+Date:  Oct 2020
+KernelVersion:  5.11
+Contact:   Russ Weight 
+Description:   Read only. Returns number of times the secure update
+   staging area has been flashed.
+   Format: "%u".
diff --git a/drivers/fpga/intel-m10-bmc-secure.c 
b/drivers/fpga/intel-m10-bmc-secure.c
index 198bc8273d6b..6ad897001086 100644
--- a/drivers/fpga/intel-m10-bmc-secure.c
+++ b/drivers/fpga/intel-m10-bmc-secure.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct m10bmc_sec {
struct device *dev;
@@ -77,7 +78,40 @@ DEVICE_ATTR_SEC_REH_RO(bmc, BMC_PROG_MAGIC, BMC_PROG_ADDR, 
BMC_REH_ADDR);
 DEVICE_ATTR_SEC_REH_RO(sr, SR_PROG_MAGIC, SR_PROG_ADDR, SR_REH_ADDR);
 DEVICE_ATTR_SEC_REH_RO(pr, PR_PROG_MAGIC, PR_PROG_ADDR, PR_REH_ADDR);
 
+#define FLASH_COUNT_SIZE 4096  /* count stored as inverted bit vector */
+
+static ssize_t flash_count_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct m10bmc_sec *sec = dev_get_drvdata(dev);
+   unsigned int stride = regmap_get_reg_stride(sec->m10bmc->regmap);
+   unsigned int num_bits = FLASH_COUNT_SIZE * 8;
+   u8 *flash_buf;
+   int cnt, ret;
+
+   flash_buf = kmalloc(FLASH_COUNT_SIZE, GFP_KERNEL);
+   if (!flash_buf)
+   return -ENOMEM;
+
+   ret = regmap_bulk_read(sec->m10bmc->regmap, STAGING_FLASH_COUNT,
+  flash_buf, FLASH_COUNT_SIZE / stride);
+   if (ret) {
+   dev_err(sec->dev,
+   "failed to read flash count: %x cnt %x: %d\n",
+   STAGING_FLASH_COUNT, FLASH_COUNT_SIZE / stride, ret);
+   goto exit_free;
+   }
+   cnt = num_bits - bitmap_weight((unsigned long *)flash_buf, num_bits);
+
+exit_free:
+   kfree(flash_buf);
+
+   return ret ? : sysfs_emit(buf, "%u\n", cnt);
+}
+static DEVICE_ATTR_RO(flash_count);
+
 static struct attribute *m10bmc_security_attrs[] = {
+   _attr_flash_count.attr,
_attr_bmc_root_entry_hash.attr,
_attr_sr_root_entry_hash.attr,
_attr_pr_root_entry_hash.attr,
-- 
2.25.1

[PATCH v5 5/6] fpga: m10bmc-sec: add max10 secure update functions

2020-11-13 Thread Russ Weight

Extend the MAX10 BMC Secure Update driver to include
the functions that enable secure updates of BMC images,
FPGA images, etc.

Signed-off-by: Russ Weight 
---
v5:
  - No change
v4:
  - No change
v3:
  - Changed: iops -> sops, imgr -> smgr, IFPGA_ -> FPGA_, ifpga_ to fpga_
  - Changed "MAX10 BMC Secure Engine driver" to "MAX10 BMC Secure Update
driver"
  - Removed wrapper functions (m10bmc_raw_*, m10bmc_sys_*). The
underlying functions are now called directly.
  - Changed calling functions of functions that return "enum fpga_sec_err"
to check for (ret != FPGA_SEC_ERR_NONE) instead of (ret)
v2:
  - Reworked the rsu_start_done() function to make it more readable
  - Reworked while-loop condition/content in rsu_prog_ready()
  - Minor code cleanup per review comments
  - Added a comment to the m10bmc_sec_poll_complete() function to
explain the context (could take 30+ minutes to complete).
  - Added m10bmc_ prefix to functions in m10bmc_iops structure
  - Moved MAX10 BMC address and function definitions to a separate
patch.
---
 drivers/fpga/intel-m10-bmc-secure.c | 305 +++-
 1 file changed, 304 insertions(+), 1 deletion(-)

diff --git a/drivers/fpga/intel-m10-bmc-secure.c 
b/drivers/fpga/intel-m10-bmc-secure.c
index 689da5bc6461..4fa8a2256088 100644
--- a/drivers/fpga/intel-m10-bmc-secure.c
+++ b/drivers/fpga/intel-m10-bmc-secure.c
@@ -174,7 +174,310 @@ static const struct attribute_group 
*m10bmc_sec_attr_groups[] = {
NULL,
 };
 
-static const struct fpga_sec_mgr_ops m10bmc_sops = { };
+static void log_error_regs(struct m10bmc_sec *sec, u32 doorbell)
+{
+   u32 auth_result;
+
+   dev_err(sec->dev, "RSU error status: 0x%08x\n", doorbell);
+
+   if (!m10bmc_sys_read(sec->m10bmc, M10BMC_AUTH_RESULT, _result))
+   dev_err(sec->dev, "RSU auth result: 0x%08x\n", auth_result);
+}
+
+static enum fpga_sec_err rsu_check_idle(struct m10bmc_sec *sec)
+{
+   u32 doorbell;
+   int ret;
+
+   ret = m10bmc_sys_read(sec->m10bmc, M10BMC_DOORBELL, );
+   if (ret)
+   return FPGA_SEC_ERR_RW_ERROR;
+
+   if (rsu_prog(doorbell) != RSU_PROG_IDLE &&
+   rsu_prog(doorbell) != RSU_PROG_RSU_DONE) {
+   log_error_regs(sec, doorbell);
+   return FPGA_SEC_ERR_BUSY;
+   }
+
+   return FPGA_SEC_ERR_NONE;
+}
+
+static inline bool rsu_start_done(u32 doorbell)
+{
+   u32 status, progress;
+
+   if (doorbell & DRBL_RSU_REQUEST)
+   return false;
+
+   status = rsu_stat(doorbell);
+   if (status == RSU_STAT_ERASE_FAIL || status == RSU_STAT_WEAROUT)
+   return true;
+
+   progress = rsu_prog(doorbell);
+   if (progress != RSU_PROG_IDLE && progress != RSU_PROG_RSU_DONE)
+   return true;
+
+   return false;
+}
+
+static enum fpga_sec_err rsu_update_init(struct m10bmc_sec *sec)
+{
+   u32 doorbell, status;
+   int ret;
+
+   ret = regmap_update_bits(sec->m10bmc->regmap,
+M10BMC_SYS_BASE + M10BMC_DOORBELL,
+DRBL_RSU_REQUEST | DRBL_HOST_STATUS,
+DRBL_RSU_REQUEST |
+FIELD_PREP(DRBL_HOST_STATUS,
+   HOST_STATUS_IDLE));
+   if (ret)
+   return FPGA_SEC_ERR_RW_ERROR;
+
+   ret = regmap_read_poll_timeout(sec->m10bmc->regmap,
+  M10BMC_SYS_BASE + M10BMC_DOORBELL,
+  doorbell,
+  rsu_start_done(doorbell),
+  NIOS_HANDSHAKE_INTERVAL_US,
+  NIOS_HANDSHAKE_TIMEOUT_US);
+
+   if (ret == -ETIMEDOUT) {
+   log_error_regs(sec, doorbell);
+   return FPGA_SEC_ERR_TIMEOUT;
+   } else if (ret) {
+   return FPGA_SEC_ERR_RW_ERROR;
+   }
+
+   status = rsu_stat(doorbell);
+   if (status == RSU_STAT_WEAROUT) {
+   dev_warn(sec->dev, "Excessive flash update count detected\n");
+   return FPGA_SEC_ERR_WEAROUT;
+   } else if (status == RSU_STAT_ERASE_FAIL) {
+   log_error_regs(sec, doorbell);
+   return FPGA_SEC_ERR_HW_ERROR;
+   }
+
+   return FPGA_SEC_ERR_NONE;
+}
+
+static enum fpga_sec_err rsu_prog_ready(struct m10bmc_sec *sec)
+{
+   unsigned long poll_timeout;
+   u32 doorbell, progress;
+   int ret;
+
+   ret = m10bmc_sys_read(sec->m10bmc, M10BMC_DOORBELL, );
+   if (ret)
+   return FPGA_SEC_ERR_RW_ERROR;
+
+   poll_timeout = jiffies + msecs_to_jiffies(RSU_PREP_TIMEOUT_MS);
+   while (rsu_prog(doorbell) == RSU_PROG_PREPARE) {
+   msleep(RSU_PREP_INTERVAL_MS);
+   if (time_after(jiffies, poll_timeout))
+   break;
+
+   ret = m10bmc_sys_read(sec->m10bmc, M10BMC_DOORBELL, );
+

[PATCH v5 2/6] fpga: m10bmc-sec: create max10 bmc secure update driver

2020-11-13 Thread Russ Weight

Create a platform driver that can be invoked as a sub
driver for the Intel MAX10 BMC in order to support
secure updates. This sub-driver will invoke an
instance of the FPGA Security Manager class driver
in order to expose sysfs interfaces for managing and
monitoring secure updates to FPGA and BMC images.

This patch creates the MAX10 BMC Secure Update driver and
provides sysfs files for displaying the current root entry hashes
for the FPGA static region, the FPGA PR region, and the MAX10
BMC.

Signed-off-by: Russ Weight 
---
v5:
  - No change
v4:
  - Moved sysfs files for displaying the root entry hashes (REH)
from the FPGA Security Manager class driver to here. The
m10bmc_reh() and m10bmc_reh_size() functions are removed and
the functionality from these functions is moved into a
show_root_entry_hash() function for displaying the REHs.
  - Added ABI documentation for the new sysfs entries:
sysfs-driver-intel-m10-bmc-secure
  - Updated the MAINTAINERS file to add the new ABI documentation
file: sysfs-driver-intel-m10-bmc-secure
  - Removed unnecessary ret variable from m10bmc_secure_probe()
  - Incorporated new devm_fpga_sec_mgr_register() function into
m10bmc_secure_probe() and removed the m10bmc_secure_remove()
function.
v3:
  - Changed from "Intel FPGA Security Manager" to FPGA Security Manager"
  - Changed: iops -> sops, imgr -> smgr, IFPGA_ -> FPGA_, ifpga_ to fpga_
  - Changed "MAX10 BMC Secure Engine driver" to "MAX10 BMC Secure
Update driver"
  - Removed wrapper functions (m10bmc_raw_*, m10bmc_sys_*). The
underlying functions are now called directly.
  - Changed "_root_entry_hash" to "_reh", with a comment explaining
what reh is.
v2:
  - Added drivers/fpga/intel-m10-bmc-secure.c file to MAINTAINERS.
  - Switched to GENMASK(31, 16) for a couple of mask definitions.
  - Moved MAX10 BMC address and function definitions to a separate
patch.
  - Replaced small function-creation macros with explicit function
declarations.
  - Removed ifpga_sec_mgr_init() and ifpga_sec_mgr_uinit() functions.
  - Adapted to changes in the Intel FPGA Security Manager by splitting
the single call to ifpga_sec_mgr_register() into two function
calls: devm_ifpga_sec_mgr_create() and ifpga_sec_mgr_register().
---
 .../testing/sysfs-driver-intel-m10-bmc-secure |  29 
 MAINTAINERS   |   2 +
 drivers/fpga/Kconfig  |  11 ++
 drivers/fpga/Makefile |   3 +
 drivers/fpga/intel-m10-bmc-secure.c   | 134 ++
 5 files changed, 179 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
 create mode 100644 drivers/fpga/intel-m10-bmc-secure.c

diff --git a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure 
b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
new file mode 100644
index ..2992488b717a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
@@ -0,0 +1,29 @@
+What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/sr_root_entry_hash
+Date:  Oct 2020
+KernelVersion:  5.11
+Contact:   Russ Weight 
+Description:   Read only. Returns the root entry hash for the static
+   region if one is programmed, else it returns the
+   string: "hash not programmed".  This file is only
+   visible if the underlying device supports it.
+   Format: "0x%x".
+
+What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/pr_root_entry_hash
+Date:  Oct 2020
+KernelVersion:  5.11
+Contact:   Russ Weight 
+Description:   Read only. Returns the root entry hash for the partial
+   reconfiguration region if one is programmed, else it
+   returns the string: "hash not programmed".  This file
+   is only visible if the underlying device supports it.
+   Format: "0x%x".
+
+What:  
/sys/bus/platform/devices/n3000bmc-secure.*.auto/security/bmc_root_entry_hash
+Date:  Oct 2020
+KernelVersion:  5.11
+Contact:   Russ Weight 
+Description:   Read only. Returns the root entry hash for the BMC image
+   if one is programmed, else it returns the string:
+   "hash not programmed".  This file is only visible if the
+   underlying device supports it.
+   Format: "0x%x".
diff --git a/MAINTAINERS b/MAINTAINERS
index 23c655fc0001..bbd2366280de 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6941,8 +6941,10 @@ M:   Russ Weight 
 L: linux-f...@vger.kernel.org
 S: Maintained
 F: Documentation/ABI/testing/sysfs-class-fpga-sec-mgr
+F: Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-secure
 F: Documentation/fpga/fpga-sec-mgr.rst
 F: drivers/fpga/fpga-sec-mgr.c
+F: drivers/fpga/intel-m10-bmc-secure.c
 F: include/linux/fpga/fpga-sec-mgr.h
 
 FPU EMULATOR
diff --git

[PATCH v5 1/6] mfd: intel-m10-bmc: support for MAX10 BMC Secure Updates

2020-11-13 Thread Russ Weight

Add macros and definitions required by the MAX10 BMC
Secure Update driver.

Signed-off-by: Russ Weight 
Acked-by: Lee Jones 
---
v5:
  - Renamed USER_FLASH_COUNT to STAGING_FLASH_COUNT
v4:
  - No change
v3:
  - Changed "MAX10 BMC Secure Engine driver" to "MAX10 BMC Secure
Update driver"
  - Removed wrapper functions (m10bmc_raw_*, m10bmc_sys_*). The
underlying functions will be called directly.
v2:
  - These functions and macros were previously distributed among
the patches that needed them. They are now grouped together
in a single patch containing changes to the Intel MAX10 BMC
driver.
  - Added DRBL_ prefix to some definitions
  - Some address definitions were moved here from the .c files that
use them.
---
 include/linux/mfd/intel-m10-bmc.h | 85 +++
 1 file changed, 85 insertions(+)

diff --git a/include/linux/mfd/intel-m10-bmc.h 
b/include/linux/mfd/intel-m10-bmc.h
index c8ef2f1654a4..ab8d78b92df9 100644
--- a/include/linux/mfd/intel-m10-bmc.h
+++ b/include/linux/mfd/intel-m10-bmc.h
@@ -13,6 +13,9 @@
 #define M10BMC_SYS_BASE0x300800
 #define M10BMC_MEM_END 0x20fc
 
+#define M10BMC_STAGING_BASE0x1800
+#define M10BMC_STAGING_SIZE0x380
+
 /* Register offset of system registers */
 #define NIOS2_FW_VERSION   0x0
 #define M10BMC_TEST_REG0x3c
@@ -21,6 +24,88 @@
 #define M10BMC_VER_PCB_INFO_MSKGENMASK(31, 24)
 #define M10BMC_VER_LEGACY_INVALID  0x
 
+/* Secure update doorbell register, in system register region */
+#define M10BMC_DOORBELL0x400
+
+/* Authorization Result register, in system register region */
+#define M10BMC_AUTH_RESULT 0x404
+
+/* Doorbell register fields */
+#define DRBL_RSU_REQUEST   BIT(0)
+#define DRBL_RSU_PROGRESS  GENMASK(7, 4)
+#define DRBL_HOST_STATUS   GENMASK(11, 8)
+#define DRBL_RSU_STATUSGENMASK(23, 16)
+#define DRBL_PKVL_EEPROM_LOAD_SEC  BIT(24)
+#define DRBL_PKVL1_POLL_EN BIT(25)
+#define DRBL_PKVL2_POLL_EN BIT(26)
+#define DRBL_CONFIG_SELBIT(28)
+#define DRBL_REBOOT_REQBIT(29)
+#define DRBL_REBOOT_DISABLED   BIT(30)
+
+/* Progress states */
+#define RSU_PROG_IDLE  0x0
+#define RSU_PROG_PREPARE   0x1
+#define RSU_PROG_READY 0x3
+#define RSU_PROG_AUTHENTICATING0x4
+#define RSU_PROG_COPYING   0x5
+#define RSU_PROG_UPDATE_CANCEL 0x6
+#define RSU_PROG_PROGRAM_KEY_HASH  0x7
+#define RSU_PROG_RSU_DONE  0x8
+#define RSU_PROG_PKVL_PROM_DONE0x9
+
+/* Device and error states */
+#define RSU_STAT_NORMAL0x0
+#define RSU_STAT_TIMEOUT   0x1
+#define RSU_STAT_AUTH_FAIL 0x2
+#define RSU_STAT_COPY_FAIL 0x3
+#define RSU_STAT_FATAL 0x4
+#define RSU_STAT_PKVL_REJECT   0x5
+#define RSU_STAT_NON_INC   0x6
+#define RSU_STAT_ERASE_FAIL0x7
+#define RSU_STAT_WEAROUT   0x8
+#define RSU_STAT_NIOS_OK   0x80
+#define RSU_STAT_USER_OK   0x81
+#define RSU_STAT_FACTORY_OK0x82
+#define RSU_STAT_USER_FAIL 0x83
+#define RSU_STAT_FACTORY_FAIL  0x84
+#define RSU_STAT_NIOS_FLASH_ERR0x85
+#define RSU_STAT_FPGA_FLASH_ERR0x86
+
+#define HOST_STATUS_IDLE   0x0
+#define HOST_STATUS_WRITE_DONE 0x1
+#define HOST_STATUS_ABORT_RSU  0x2
+
+#define rsu_prog(doorbell) FIELD_GET(DRBL_RSU_PROGRESS, doorbell)
+#define rsu_stat(doorbell) FIELD_GET(DRBL_RSU_STATUS, doorbell)
+
+/* interval 100ms and timeout 5s */
+#define NIOS_HANDSHAKE_INTERVAL_US (100 * 1000)
+#define NIOS_HANDSHAKE_TIMEOUT_US  (5 * 1000 * 1000)
+
+/* RSU PREP Timeout (2 minutes) to erase flash staging area */
+#define RSU_PREP_INTERVAL_MS   100
+#define RSU_PREP_TIMEOUT_MS(2 * 60 * 1000)
+
+/* RSU Complete Timeout (40 minutes) for full flash update */
+#define RSU_COMPLETE_INTERVAL_MS   1000
+#define RSU_COMPLETE_TIMEOUT_MS(40 * 60 * 1000)
+
+/* Addresses for security related data in FLASH */
+#define BMC_REH_ADDR   0x17ffc004
+#define BMC_PROG_ADDR  0x17ffc000
+#define BMC_PROG_MAGIC 0x5746
+
+#define SR_REH_ADDR0x17ffd004
+#define SR_PROG_ADDR   0x17ffd000
+#define SR_PROG_MAGIC  0x5253
+
+#define PR_REH_ADDR0x17ffe004
+#define PR_PROG_ADDR   0x17ffe000
+#define PR_PROG_MAGIC  0x5250
+
+/* Address of 4KB inverted bit vector containing staging area FLASH count */
+#define STAGING_FLASH_COUNT0x17ffb000
+
 /**
  * struct intel_m10bmc - Intel MAX 10 BMC parent driver data structure
  * @dev: this device
-- 
2.25.1

[PATCH v5 6/6] fpga: m10bmc-sec: add max10 get_hw_errinfo callback func

2020-11-13 Thread Russ Weight

Extend the MAX10 BMC Secure Update driver to include
a function that returns 64 bits of additional HW specific
data for errors that require additional information.
This callback function enables the hw_errinfo sysfs
node in the Intel Security Manager class driver.

Signed-off-by: Russ Weight 
---
v5:
  - No change
v4:
  - No change
v3:
  - Changed: iops -> sops, imgr -> smgr, IFPGA_ -> FPGA_, ifpga_ to fpga_
  - Changed "MAX10 BMC Secure Engine driver" to "MAX10 BMC Secure Update
driver"
v2:
  - Implemented HW_ERRINFO_POISON for m10bmc_sec_hw_errinfo() to
ensure that corresponding bits are set to 1 if we are unable
to read the doorbell or auth_result registers.
  - Added m10bmc_ prefix to functions in m10bmc_iops structure
---
 drivers/fpga/intel-m10-bmc-secure.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/drivers/fpga/intel-m10-bmc-secure.c 
b/drivers/fpga/intel-m10-bmc-secure.c
index 4fa8a2256088..a024efb173d3 100644
--- a/drivers/fpga/intel-m10-bmc-secure.c
+++ b/drivers/fpga/intel-m10-bmc-secure.c
@@ -472,11 +472,36 @@ static enum fpga_sec_err m10bmc_sec_cancel(struct 
fpga_sec_mgr *smgr)
return ret ? FPGA_SEC_ERR_RW_ERROR : FPGA_SEC_ERR_NONE;
 }
 
+#define HW_ERRINFO_POISON  GENMASK(31, 0)
+static u64 m10bmc_sec_hw_errinfo(struct fpga_sec_mgr *smgr)
+{
+   struct m10bmc_sec *sec = smgr->priv;
+   u32 doorbell, auth_result;
+
+   switch (smgr->err_code) {
+   case FPGA_SEC_ERR_HW_ERROR:
+   case FPGA_SEC_ERR_TIMEOUT:
+   case FPGA_SEC_ERR_BUSY:
+   case FPGA_SEC_ERR_WEAROUT:
+   if (m10bmc_sys_read(sec->m10bmc, M10BMC_DOORBELL, ))
+   doorbell = HW_ERRINFO_POISON;
+
+   if (m10bmc_sys_read(sec->m10bmc, M10BMC_AUTH_RESULT,
+   _result))
+   auth_result = HW_ERRINFO_POISON;
+
+   return (u64)doorbell << 32 | (u64)auth_result;
+   default:
+   return 0;
+   }
+}
+
 static const struct fpga_sec_mgr_ops m10bmc_sops = {
.prepare = m10bmc_sec_prepare,
.write_blk = m10bmc_sec_write_blk,
.poll_complete = m10bmc_sec_poll_complete,
.cancel = m10bmc_sec_cancel,
+   .get_hw_errinfo = m10bmc_sec_hw_errinfo,
 };
 
 static int m10bmc_secure_probe(struct platform_device *pdev)
-- 
2.25.1

Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

2020-11-13 Thread Zi Yan

On 13 Nov 2020, at 19:23, Roman Gushchin wrote:

> On Wed, Nov 11, 2020 at 03:40:04PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> It reads thp_nr_pages and splits to provided new_nr. It prepares for
>> upcoming changes to support split huge page to any lower order.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  include/linux/memcontrol.h | 5 +++--
>>  mm/huge_memory.c   | 2 +-
>>  mm/memcontrol.c| 4 ++--
>>  3 files changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 0f4dd7829fb2..b3bac79ceed6 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct 
>> mm_struct *mm,
>>  }
>>
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -void mem_cgroup_split_huge_fixup(struct page *head);
>> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
>>  #endif
>>
>>  #else /* CONFIG_MEMCG */
>> @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
>> *pgdat, int order,
>>  return 0;
>>  }
>>
>> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
>> +static inline void mem_cgroup_split_huge_fixup(struct page *head,
>> +   unsigned int new_nr)
>>  {
>>  }
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c4fead5ead31..f599f5b9bf7f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, 
>> struct list_head *list,
>>  lruvec = mem_cgroup_page_lruvec(head, pgdat);
>>
>>  /* complete memcg works before add pages to LRU */
>> -mem_cgroup_split_huge_fixup(head);
>> +mem_cgroup_split_huge_fixup(head, 1);
>>
>>  if (PageAnon(head) && PageSwapCache(head)) {
>>  swp_entry_t entry = { .val = page_private(head) };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 33f632689cee..e9705ba6bbcc 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
>> size_t size)
>>   * Because tail pages are not marked as "used", set it. We're under
>>   * pgdat->lru_lock and migration entries setup in all page mappings.
>>   */
>> -void mem_cgroup_split_huge_fixup(struct page *head)
>> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
>
> I'd go with unsigned int new_order, then it's obvious that we can split
> the original page without any leftovers.

Makes sense. Will change it.

>
> Other than that the patch looks good!
> Acked-by: Roman Gushchin 

Thanks.

>>  {
>>  struct mem_cgroup *memcg = page_memcg(head);
>>  int i;
>> @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>  if (mem_cgroup_disabled())
>>  return;
>>
>> -for (i = 1; i < thp_nr_pages(head); i++) {
>> +for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
>>  css_get(>css);
>>  head[i].memcg_data = (unsigned long)memcg;
>>  }
>> -- 
>> 2.28.0
>>


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature

Re: [PATCH net-next v2 09/11] net: dsa: microchip: ksz9477: add hardware time stamping support

2020-11-13 Thread Vladimir Oltean

On Fri, Nov 13, 2020 at 07:57:32PM +0100, Christian Eggers wrote:
> On Friday, 13 November 2020, 03:40:20 CET, Vladimir Oltean wrote:
> > On Thu, Nov 12, 2020 at 04:35:35PM +0100, Christian Eggers wrote:
> [...]
> > > @@ -103,6 +108,10 @@ static int ksz9477_ptp_adjtime(struct ptp_clock_info
> > > *ptp, s64 delta)>
> > > if (ret)
> > >
> > > goto error_return;
> > >
> > > +   spin_lock_irqsave(>ptp_clock_lock, flags);
> >
> > I believe that spin_lock_irqsave is unnecessary, since there is no code
> > that runs in hardirq context.
> I'll check this again. Originally I had only a mutex for everything, but later
> it turned out that for ptp_clock_time a spinlock is required. Maybe this has
> changed since starting of my work on the driver.

Yes, it's called from the networking softirq.
The typical assumption is that the networking data path can run in
both hardirq and softirq context (or, well, in process context if it
gets picked up by ksoftirqd), so one would think that _irqsave would be
justified. But the hardirq stuff is only used by netpoll, for
netconsole. So you would never hit that condition for PTP timestamping.

> >
> > > +   dev->ptp_clock_time = timespec64_add(dev->ptp_clock_time, delta64);
> > > +   spin_unlock_irqrestore(>ptp_clock_lock, flags);
> > > +
>
> [...]
>
> > Could we make this line shorter?
> ...
> > Additionally, you exceed the 80 characters limit.
> ...
> > Also, you exceeded 80 characters by quite a bit.
> ...
> > In networking we still have the 80-characters rule, please follow it.
> Can this be added to the netdev-FAQ (just below the section about
> "comment style convention")?
>
> > > +static void ksz9477_ptp_ports_deinit(struct ksz_device *dev)
> > > +{
> > > +   int port;
> > > +
> > > +   for (port = dev->port_cnt - 1; port >= 0; --port)
> >
> > Nice, but also probably not worth the effort?
> What do you mean. Shall I used forward direction?

Yes, that's what I meant.

> > > +
> > > +   /* Should already been tested in dsa_skb_tx_timestamp()? */
> > > +   if (!(skb_shinfo(clone)->tx_flags & SKBTX_HW_TSTAMP))
> > > +   return false;
> >
> > Yeah, should have...
> > What do you think about this one though:
> > https://lore.kernel.org/netdev/20201104015834.mcn2eoibxf6j3ksw@skbuf/
> I am not an expert for performance stuff. But for me it looks obvious that
> cheaper checks should be performed first. What about also moving checking
> for ops->port_txtstamp above ptp_classify_raw()?

I am no expert either. Also, it looks like I'm not even keeping on top
of things lately. I'll try to return to that investigation during this
weekend.

>
> Is there any reason why this isn't already applied?

Probably because nobody sent a patch for it?

> > case in which you'll need an skb_queue and a process context
> > to wait for the TX timestamp of the previous PTP message before calling
> > dsa_enqueue_skb for the next PTP event message. There are already
> > implementations of both models in DSA that you can look at.
> In the past I got sometimes a "timeout waiting for hw timestamp" (or similar)
> message from ptp4l. I am not sure whether this is still the case, but this may
> explain this type of problems.

Yeah, well, the default tx_timestamp_timeout value of 1 ms chosen by
ptp4l is not going to be enough in general for DSA. If you schedule a
workqueue for timestamping, that delay will only get worse, but luckily
you can increase the timestamp timeout value and all should be fine.

> > So good for you that you can use a function so simple for timestamp
> > reconstruction.
> You already told me that another hardware has much less budget than 4 seconds.

sja1105 has 24 bits of partial timestamp (and 1 bit measures 8 ns). So
it wraps around in 135 ms. You can imagine that periodically reading the
PTP clock over SPI is not an option there :)

> How is timestamp reconstruction done there? Is there any code which I should
> reuse?

No, I wasn't suggesting you reuse that logic, since it's very
error-prone. If you can get away with reconstruction done on-the-fly,
great. But just for reference:
- In drivers/net/dsa/sja1105/, the actual transmission of the PTP
  packets is done synchronously, from process context, and an interrupt
  is not even used. See sja1105_ptp_txtstamp_skb and
  sja1105_tstamp_reconstruct. Actually, more interesting would be the RX
  timestamping case, since we have a worse problem there: the partial
  PTP timestamp is obtained in softirq context, and we need process
  context for the current PTP time. For that, see sja1105_port_rxtstamp
  and sja1105_rxtstamp_work.
- In drivers/net/dsa/ocelot/, the reconstruction is done in IRQ context,
  since it is a memory-mapped switch and therefore, reading the PTP time
  is "cheap". See ocelot_get_txtstamp and ocelot_get_hwtimestamp.
The point is that both these drivers read the full PTP current time
_after_ the partial timestamp was obtained. That's what gives you a
solid guarantee that the "partial_timestamp >

Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

2020-11-13 Thread Roman Gushchin

On Wed, Nov 11, 2020 at 03:40:06PM -0500, Zi Yan wrote:
> From: Zi Yan 
> 
> To split a THP to any lower order pages, we need to reform THPs on
> subpages at given order and add page refcount based on the new page
> order. Also we need to reinitialize page_deferred_list after removing
> the page from the split_queue, otherwise a subsequent split will see
> list corruption when checking the page_deferred_list again.
> 
> It has many uses, like minimizing the number of pages after
> truncating a pagecache THP. For anonymous THPs, we can only split them
> to order-0 like before until we add support for any size anonymous THPs.
> 
> Signed-off-by: Zi Yan 
> ---
>  include/linux/huge_mm.h |  8 +
>  mm/huge_memory.c| 78 +
>  mm/swap.c   |  1 -
>  3 files changed, 63 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 60a907a19f7d..9819cd9b4619 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>  
>  bool can_split_huge_page(struct page *page, int *pextra_pins);
>  int split_huge_page_to_list(struct page *page, struct list_head *list);
> +int split_huge_page_to_list_to_order(struct page *page, struct list_head 
> *list,
> + unsigned int new_order);
>  static inline int split_huge_page(struct page *page)
>  {
>   return split_huge_page_to_list(page, NULL);
> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct 
> list_head *list)
>  {
>   return 0;
>  }
> +static inline int
> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
> + unsigned int new_order)

It was
int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
unsigned int new_order);
above.


> +{
> + return 0;
> +}
>  static inline int split_huge_page(struct page *page)
>  {
>   return 0;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8b7d771ee962..88f50da40c9b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  static void unmap_page(struct page *page)
>  {
>   enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
> - TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> + TTU_RMAP_LOCKED;
>   bool unmap_success;
>  
>   VM_BUG_ON_PAGE(!PageHead(page), page);
>  
> + if (thp_order(page) >= HPAGE_PMD_ORDER)
> + ttu_flags |= TTU_SPLIT_HUGE_PMD;
> +
>   if (PageAnon(page))
>   ttu_flags |= TTU_SPLIT_FREEZE;
>  
> @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>   VM_BUG_ON_PAGE(!unmap_success, page);
>  }
>  
> -static void remap_page(struct page *page, unsigned int nr)
> +static void remap_page(struct page *page, unsigned int nr, unsigned int 
> new_nr)
>  {
>   int i;
> - if (PageTransHuge(page)) {
> + if (thp_nr_pages(page) == nr) {
>   remove_migration_ptes(page, page, true);
>   } else {
> - for (i = 0; i < nr; i++)
> + for (i = 0; i < nr; i += new_nr)
>   remove_migration_ptes(page + i, page + i, true);
>   }
>  }
>  
>  static void __split_huge_page_tail(struct page *head, int tail,
> - struct lruvec *lruvec, struct list_head *list)
> + struct lruvec *lruvec, struct list_head *list, unsigned int 
> new_order)
>  {
>   struct page *page_tail = head + tail;
> + unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>  
>   VM_BUG_ON_PAGE(atomic_read(_tail->_mapcount) != -1, page_tail);
>  
> @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, 
> int tail,
>  #ifdef CONFIG_64BIT
>(1L << PG_arch_2) |
>  #endif
> +  compound_head_flag |
>(1L << PG_dirty)));
>  
>   /* ->mapping in first tail page is compound_mapcount */
> @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, 
> int tail,
>* which needs correct compound_head().
>*/
>   clear_compound_head(page_tail);
> + if (new_order) {
> + prep_compound_page(page_tail, new_order);
> + thp_prep(page_tail);
> + }
>  
>   /* Finally unfreeze refcount. Additional reference from page cache. */
> - page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
> -   PageSwapCache(head)));
> + page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
> +PageSwapCache(head)) ?
> + thp_nr_pages(page_tail) : 0));
>  
>   if (page_is_young(head))
>   set_page_young(page_tail);
> @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, 
> int tail,
>  }
>  
>

Re: [PATCH v6 22/25] x86/asm: annotate indirect jumps

2020-11-13 Thread Josh Poimboeuf

On Fri, Nov 13, 2020 at 03:31:34PM -0800, Sami Tolvanen wrote:
> >  #else /* !CONFIG_STACK_VALIDATION */
> > @@ -123,6 +129,8 @@ struct unwind_hint {
> >  .macro UNWIND_HINT sp_reg:req sp_offset=0 type:req end=0
> >  .endm
> >  #endif
> > +.macro STACK_FRAME_NON_STANDARD func:req
> > +.endm
> 
> This macro needs to be before the #endif, so it's defined only for
> assembly code. This breaks my arm64 builds even though x86 curiously
> worked just fine.

Yeah, I noticed that after syncing objtool.h with the tools copy.  Fixed
now.

I've got fixes for some of the other warnings, but I'll queue them up
and post when they're all ready.

-- 
Josh

Re: [PATCH v4] zram: break the strict dependency from lzo

2020-11-13 Thread Sergey Senozhatsky

On (20/11/04 12:41), Minchan Kim wrote:
> On Wed, Nov 04, 2020 at 02:12:35PM +, Rui Salvaterra wrote:
> > Hi, Minchan,
> > 
> > On Tue, 3 Nov 2020 at 21:29, Minchan Kim  wrote:
> > >
> > > Can't we just provide choice/endchoice in Kconfig to select default
> > > comp algorithm from admin?
> > 
> > I'm fine with whatever you guys decide, just let me know what works
> > best for everyone.
> 
> Thanks. Sergey, if you don't mined, I'd like to approach more explict
> way to select default compressor algorithm in Kconfig.
> 
> Do you have any suggestion?

No objections. Some people tend to dislike new Kconfig options,
but we probably can live with that.

-ss

Re: [PATCH v2 net-next] net: stmmac: platform: use optional clk/reset get APIs

2020-11-13 Thread patchwork-bot+netdevbpf

Hello:

This patch was applied to netdev/net-next.git (refs/heads/master):

On Thu, 12 Nov 2020 09:27:37 +0800 you wrote:
> Use the devm_reset_control_get_optional() and devm_clk_get_optional()
> rather than open coding them.
> 
> Signed-off-by: Jisheng Zhang 
> ---
> Since v1:
>  - keep wrapped as suggested by Jakub
> 
> [...]

Here is the summary with links:
  - [v2,net-next] net: stmmac: platform: use optional clk/reset get APIs
https://git.kernel.org/netdev/net-next/c/bb3222f71b57

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: Error: invalid switch -me200

2020-11-13 Thread Fāng-ruì Sòng

On Fri, Nov 13, 2020 at 4:23 PM Segher Boessenkool
 wrote:
>
> On Fri, Nov 13, 2020 at 12:14:18PM -0800, Nick Desaulniers wrote:
> > > > > Error: invalid switch -me200
> > > > > Error: unrecognized option -me200
> > > >
> > > > 251 cpu-as-$(CONFIG_E200)   += -Wa,-me200
> > > >
> > > > Are those all broken configs, or is Kconfig messed up such that
> > > > randconfig can select these when it should not?
> > >
> > > Hmmm, looks like this flag does not exist in mainline binutils? There is
> > > a thread in 2010 about this that Segher commented on:
> > >
> > > https://lore.kernel.org/linuxppc-dev/9859e645-954d-4d07-8003-ffcd2391a...@kernel.crashing.org/
> > >
> > > Guess this config should be eliminated?
>
> The help text for this config options says that e200 is used in 55xx,
> and there *is* an -me5500 GAS flag (which probably does this same
> thing, too).  But is any of this tested, or useful, or wanted?
>
> Maybe Christophe knows, cc:ed.
>
>
> Segher

CC Alan Modra, a binutils global maintainer.

Alan, can the few -Wa,-m* options deleted from arch/powerpc/Makefile ?
The topic started at
http://lore.kernel.org/r/202011131146.g8dplqdd-...@intel.com and
people would like to get rid of some options (if possible).

Re: [PATCH 24/24] perf record: Add --buildid-mmap option to enable mmap's build id

2020-11-13 Thread Namhyung Kim

On Fri, Nov 13, 2020 at 8:09 PM Jiri Olsa  wrote:
>
> On Fri, Nov 13, 2020 at 01:40:00PM +0900, Namhyung Kim wrote:
> > On Mon, Nov 09, 2020 at 10:54:15PM +0100, Jiri Olsa wrote:
> > > Adding --buildid-mmap option to enable build id in mmap2 events.
> > > It will only work if there's kernel support for that and it disables
> > > build id cache (implies --no-buildid).
> > >
> > > It's also possible to enable it permanently via config option
> > > in ~.perfconfig file:
> > >
> > >   [record]
> > >   build-id=mmap
> >
> > You also need to update the documentation.
>
> right, forgot doc for the config option
>
> SNIP
>
> > > "append timestamp to output filename"),
> > > OPT_BOOLEAN(0, "timestamp-boundary", _boundary,
> > > @@ -2657,6 +2662,21 @@ int cmd_record(int argc, const char **argv)
> > >
> > > }
> > >
> > > +   if (rec->buildid_mmap) {
> > > +   if (!perf_can_record_build_id()) {
> > > +   pr_err("Failed: no support to record build id in mmap 
> > > events, update your kernel.\n");
> > > +   err = -EINVAL;
> > > +   goto out_opts;
> > > +   }
> > > +   pr_debug("Enabling build id in mmap2 events.\n");
> > > +   /* Enable mmap build id synthesizing. */
> > > +   symbol_conf.buildid_mmap2 = true;
> > > +   /* Enable perf_event_attr::build_id bit. */
> > > +   rec->opts.build_id = true;
> > > +   /* Disable build id cache. */
> > > +   rec->no_buildid = true;
> >
> > I'm afraid this can make it miss some build-id in the end because of
> > the possibility of the failure.
>
> with following fix (already merged):
>   b33164f2bd1c bpf: Iterate through all PT_NOTE sections when looking for 
> build id
>
> I could see high rate of build id being retrieved
>
> I'll make new numbers for next version, but I think we can neglect
> the failure, considering that we pick only 'hit' objects out of all
> of them
>
> also enabling the build id cache for this would go against the
> purpose why I'd like to have this.. so hopefuly the numbers
> will be convincing ;-)

Yeah, I think it'd be ok for most cases but we cannot guarantee..
What about checking the dso list at the end of a record session
and check all of them having build-id?  Then we can safely skip
the build-id collecting stage.  Hmm.. but it won't work for the pipe.

Thanks,
Namhyung

Re: Error: invalid switch -me200

2020-11-13 Thread Segher Boessenkool

On Fri, Nov 13, 2020 at 12:14:18PM -0800, Nick Desaulniers wrote:
> > > > Error: invalid switch -me200
> > > > Error: unrecognized option -me200
> > >
> > > 251 cpu-as-$(CONFIG_E200)   += -Wa,-me200
> > >
> > > Are those all broken configs, or is Kconfig messed up such that
> > > randconfig can select these when it should not?
> >
> > Hmmm, looks like this flag does not exist in mainline binutils? There is
> > a thread in 2010 about this that Segher commented on:
> >
> > https://lore.kernel.org/linuxppc-dev/9859e645-954d-4d07-8003-ffcd2391a...@kernel.crashing.org/
> >
> > Guess this config should be eliminated?

The help text for this config options says that e200 is used in 55xx,
and there *is* an -me5500 GAS flag (which probably does this same
thing, too).  But is any of this tested, or useful, or wanted?

Maybe Christophe knows, cc:ed.


Segher

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1310 matches

Mail list logo