Re: nfp: abm: fix memory leak in nfp_abm_u32_knode_replace

2019-09-27 Thread Markus Elfring
>> Can such a change variant be a bit nicer?
>
> Definitely not.
>
> Looks good as is, thanks Navid!

I find it interesting how the software development opinions are different
also in this use case for the implementation of correct and efficient
exception handling.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?id=f1f2f614d535564992f32e720739cb53cf03489f#n450

Regards,
Markus


Re: [PATCH 4/4] perf docs: Correct and clarify jitdump spec

2019-09-27 Thread Stephane Eranian
On Fri, Sep 27, 2019 at 6:53 PM Steve MacLean
 wrote:
>
> Specification claims latest version of jitdump file format is 2. Current
> jit dump reading code treats 1 as the latest version.
>
> Correct spec to match code.
>
> The original language made it unclear the value to be written in the magic
> field.
>
> Revise language that the writer always writes the same value. Specify that
> the reader uses the value to detect endian mismatches.
>
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Mark Rutland 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Stephane Eranian 
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Steve MacLean 

Corrections are valid.

Acked-by: Stephane Eranian 

> ---
>  tools/perf/Documentation/jitdump-specification.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/Documentation/jitdump-specification.txt 
> b/tools/perf/Documentation/jitdump-specification.txt
> index 4c62b07..52152d1 100644
> --- a/tools/perf/Documentation/jitdump-specification.txt
> +++ b/tools/perf/Documentation/jitdump-specification.txt
> @@ -36,8 +36,8 @@ III/ Jitdump file header format
>  Each jitdump file starts with a fixed size header containing the following 
> fields in order:
>
>
> -* uint32_t magic : a magic number tagging the file type. The value is 
> 4-byte long and represents the string "JiTD" in ASCII form. It is 0x4A695444 
> or 0x4454694a depending on the endianness. The field can be used to detect 
> the endianness of the file
> -* uint32_t version   : a 4-byte value representing the format version. It is 
> currently set to 2
> +* uint32_t magic : a magic number tagging the file type. The value is 
> 4-byte long and represents the string "JiTD" in ASCII form. It written is as 
> 0x4A695444. The reader will detect an endian mismatch when it reads 
> 0x4454694a.
> +* uint32_t version   : a 4-byte value representing the format version. It is 
> currently set to 1
>  * uint32_t total_size: size in bytes of file header
>  * uint32_t elf_mach  : ELF architecture encoding (ELF e_machine value as 
> specified in /usr/include/elf.h)
>  * uint32_t pad1  : padding. Reserved for future use
> --
> 2.7.4
>


[PATCH -next] nfsd: remove set but not used variable 'len'

2019-09-27 Thread YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:

fs/nfsd/nfs4xdr.c: In function nfsd4_encode_splice_read:
fs/nfsd/nfs4xdr.c:3464:7: warning: variable len set but not used 
[-Wunused-but-set-variable]

It is not used since commit 83a63072c815 ("nfsd: fix nfs read eof detection")

Reported-by: Hulk Robot 
Signed-off-by: YueHaibing 
---
 fs/nfsd/nfs4xdr.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 533d0fc..1883370 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3461,7 +3461,6 @@ static __be32 nfsd4_encode_splice_read(
struct xdr_stream *xdr = >xdr;
struct xdr_buf *buf = xdr->buf;
u32 eof;
-   long len;
int space_left;
__be32 nfserr;
__be32 *p = xdr->p - 2;
@@ -3470,7 +3469,6 @@ static __be32 nfsd4_encode_splice_read(
if (xdr->end - xdr->p < 1)
return nfserr_resource;
 
-   len = maxcount;
nfserr = nfsd_splice_read(read->rd_rqstp, read->rd_fhp,
  file, read->rd_offset, , );
read->rd_length = maxcount;
-- 
2.7.4




Re: [PATCH v4 0/3] Optimize tlbflush path

2019-09-27 Thread Atish Patra
On Fri, 2019-08-30 at 19:50 -0700, Paul Walmsley wrote:
> Hi Atish,
> 
> On Thu, 22 Aug 2019, Atish Patra wrote:
> 
> > This series adds few optimizations to reduce the trap cost in the
> > tlb
> > flush path. We should only make SBI calls to remote tlb flush only
> > if
> > absolutely required. 
> 
> The patches look great.  My understanding is that these optimization 
> patches may actually be a partial workaround for the TLB flushing bug
> that 
> we've been looking at for the last month or so, which can corrupt
> memory 
> or crash the system.
> 
> If that's the case, let's first root-cause the underlying
> bug.  Otherwise 
> we'll just be papering over the actual issue, which probably could
> still 
> occur even with this series, correct?  Since it contains no explicit 
> fixes?
> 
> 
I have verified the glibc locale install issue both in Qemu and
Unleashed. I don't see any issue with OpenSBI master + Linux v5.3
kernel.

As per our investigation, it looks like a hardware errata with
Unleashed board as the memory corruption issue only happens in case of
tlb range flush. In RISC-V, sfence.vma can only be issued at page
boundary. If the range is larger than that, OpenSBI has to issue
multiple sfence.vma calls back to back leading to possible memory
corruption.

Currently, OpenSBI has a platform feature i.e. "tlb_range_flush_limit"
that allows to configure tlb flush threshold per platform. Any tlb
flush range request greater than this threshold, is converted to a full
flush. Currently, it is set to the default value 4K for every
platform[1]. Glibc locale install memory corruption only happens if
this threshold is changed to a higher value i.e. 1G. This doesn't
change anything in OpenSBI code path except the fact that it will issue
many sfence.vma instructions back to back instead of one. 

If the hardware team at SiFive can look into this as well, it would be
great.

To conclude, we think this issue need to be investigated by hardware
team and the kernel patch can be merged to get the performance benefit.

[1] 
https://github.com/riscv/opensbi/blob/master/include/sbi/sbi_platform.h#L40



> - Paul
> 
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

-- 
Regards,
Atish


[PATCH -next] habanalabs: remove set but not used variable 'ctx'

2019-09-27 Thread YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/misc/habanalabs/device.c: In function hpriv_release:
drivers/misc/habanalabs/device.c:45:17: warning: variable ctx set but not used 
[-Wunused-but-set-variable]

It is never used since commit eb7caf84b029 ("habanalabs:
maintain a list of file private data objects")

Reported-by: Hulk Robot 
Signed-off-by: YueHaibing 
---
 drivers/misc/habanalabs/device.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 459fee7..2f5a4da 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -42,12 +42,10 @@ static void hpriv_release(struct kref *ref)
 {
struct hl_fpriv *hpriv;
struct hl_device *hdev;
-   struct hl_ctx *ctx;
 
hpriv = container_of(ref, struct hl_fpriv, refcount);
 
hdev = hpriv->hdev;
-   ctx = hpriv->ctx;
 
put_pid(hpriv->taskpid);
 
-- 
2.7.4




Re: [PATCH v2 1/1] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger

2019-09-27 Thread Akinobu Mita
2019年9月28日(土) 2:46 Greg Kroah-Hartman :
>
> On Sat, Sep 28, 2019 at 01:47:21AM +0900, Akinobu Mita wrote:
> > 2019年9月27日(金) 15:39 Greg Kroah-Hartman :
> > >
> > > On Sat, Sep 14, 2019 at 12:03:24AM +0900, Akinobu Mita wrote:
> > > > Reading /sys/class/leds//trigger returns all available LED 
> > > > triggers.
> > > > However, the size of this file is limited to PAGE_SIZE because of the
> > > > limitation for sysfs attribute.
> > > >
> > > > Enabling LED CPU trigger on systems with thousands of CPUs easily hits
> > > > PAGE_SIZE limit, and makes it impossible to see all available LED 
> > > > triggers
> > > > and which trigger is currently activated.
> > > >
> > > > We work around it here by converting /sys/class/leds//trigger to
> > > > binary attribute, which is not limited by length. This is _not_ good
> > > > design, do not copy it.
> > > >
> > > > Cc: Greg Kroah-Hartman 
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Jacek Anaszewski 
> > > > Cc: Pavel Machek 
> > > > Cc: Dan Murphy 
> > > > Acked-by: Pavel Machek 
> > > > Signed-off-by: Akinobu Mita 
> > > > ---
> > > >  drivers/leds/led-class.c|  8 ++--
> > > >  drivers/leds/led-triggers.c | 90 
> > > > ++---
> > > >  drivers/leds/leds.h |  6 +++
> > > >  include/linux/leds.h|  5 ---
> > > >  4 files changed, 79 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c
> > > > index 4793e77..8b5a1d1 100644
> > > > --- a/drivers/leds/led-class.c
> > > > +++ b/drivers/leds/led-class.c
> > > > @@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device 
> > > > *dev,
> > > >  static DEVICE_ATTR_RO(max_brightness);
> > > >
> > > >  #ifdef CONFIG_LEDS_TRIGGERS
> > > > -static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store);
> > > > -static struct attribute *led_trigger_attrs[] = {
> > > > - _attr_trigger.attr,
> > > > +static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0);
> > > > +static struct bin_attribute *led_trigger_bin_attrs[] = {
> > > > + _attr_trigger,
> > > >   NULL,
> > > >  };
> > > >  static const struct attribute_group led_trigger_group = {
> > > > - .attrs = led_trigger_attrs,
> > > > + .bin_attrs = led_trigger_bin_attrs,
> > > >  };
> > > >  #endif
> > > >
> > > > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
> > > > index 8d11a5e..ed5a311 100644
> > > > --- a/drivers/leds/led-triggers.c
> > > > +++ b/drivers/leds/led-triggers.c
> > > > @@ -16,6 +16,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > >  #include "leds.h"
> > > >
> > > >  /*
> > > > @@ -26,9 +27,11 @@ LIST_HEAD(trigger_list);
> > > >
> > > >   /* Used by LED Class */
> > > >
> > > > -ssize_t led_trigger_store(struct device *dev, struct device_attribute 
> > > > *attr,
> > > > - const char *buf, size_t count)
> > > > +ssize_t led_trigger_write(struct file *filp, struct kobject *kobj,
> > > > +   struct bin_attribute *bin_attr, char *buf,
> > > > +   loff_t pos, size_t count)
> > > >  {
> > > > + struct device *dev = kobj_to_dev(kobj);
> > > >   struct led_classdev *led_cdev = dev_get_drvdata(dev);
> > > >   struct led_trigger *trig;
> > > >   int ret = count;
> > > > @@ -64,39 +67,84 @@ ssize_t led_trigger_store(struct device *dev, 
> > > > struct device_attribute *attr,
> > > >   mutex_unlock(_cdev->led_access);
> > > >   return ret;
> > > >  }
> > > > -EXPORT_SYMBOL_GPL(led_trigger_store);
> > > > +EXPORT_SYMBOL_GPL(led_trigger_write);
> > > >
> > > > -ssize_t led_trigger_show(struct device *dev, struct device_attribute 
> > > > *attr,
> > > > - char *buf)
> > > > +__printf(4, 5)
> > > > +static int led_trigger_snprintf(char *buf, size_t size, bool query,
> > > > + const char *fmt, ...)
> > > > +{
> > > > + va_list args;
> > > > + int i;
> > > > +
> > > > + va_start(args, fmt);
> > > > + if (query)
> > > > + i = vsnprintf(NULL, 0, fmt, args);
> > > > + else
> > > > + i = vscnprintf(buf, size, fmt, args);
> > > > + va_end(args);
> > > > +
> > > > + return i;
> > > > +}
> > >
> > > You only call this in one place, why is it needed like this?  The "old"
> > > code open-coded this, what is this helping with here?
> > >
> > > And what does "query" mean here?  I have no idea how that variable
> > > matters, or what it does.  Why not just test if buf is NULL or not if
> > > you don't want to use it?
> > >
> > > Ah, you are trying to see how "long" the buffer is going to be.  That
> > > makes more sense, but just trigger off of the NULL buffer or not, making
> > > this a bit more "obvious" what you are doing and not tieing two
> > > parameters to each other (meaning one always reflects the other one).
> >
> > We cannot simply replace the "query" by checking the buffer is NULL or not.

Re: [PATCH] media: uvcvideo: Use streaming DMA APIs to transfer buffers

2019-09-27 Thread Nicolas Boichat
On Fri, Aug 2, 2019 at 9:13 PM Shik Chen  wrote:
>
> Similar to the commit 1161db6776bd ("media: usb: pwc: Don't use coherent
> DMA buffers for ISO transfer") [1] for the pwc driver. Use streaming DMA
> APIs to transfer buffers and sync them explicitly, because accessing
> buffers allocated by usb_alloc_coherent() is slow on platforms without
> hardware coherent DMA.
>
> Tested on x86-64 (Intel Celeron 4305U) and armv7l (Rockchip RK3288) with
> Logitech Brio 4K camera at 1920x1080 using the WebRTC sample site [3].
>
> |  |  URB (us)  | Decode (Gbps) | CPU (%) |
> |--||---|-|
> | x86-64 coherent  |  53 +-  20 |  50.6 |0.24 |
> | x86-64 streaming |  55 +-  19 |  50.1 |0.25 |
> | armv7l coherent  | 342 +- 379 |   1.8 |2.16 |
> | armv7l streaming |  99 +-  98 |  11.0 |0.36 |
>
> The performance characteristics are improved a lot on armv7l, and
> remained (almost) unchanged on x86-64. The code used for measurement can
> be found at [2].
>
> [1] https://git.kernel.org/torvalds/c/1161db6776bd
> [2] https://crrev.com/c/1729133
> [3] https://webrtc.github.io/samples/src/content/getusermedia/resolution/
>
> Signed-off-by: Shik Chen 
> ---
> The allocated buffer could be as large as 768K when streaming 4K video.
> Ideally we should use some generic helper to allocate non-coherent
> memory in a more efficient way, such as https://lwn.net/Articles/774429/
> ("make the non-consistent DMA allocator more userful").
>
> But I cannot find any existing helper so a simple kzalloc() is used in
> this patch. The logic to figure out the DMA addressable GFP flags is
> similar to __dma_direct_alloc_pages() without the optimistic retries:
> https://elixir.bootlin.com/linux/v5.2.5/source/kernel/dma/direct.c#L96
>
>  drivers/media/usb/uvc/uvc_video.c | 65 +--
>  1 file changed, 45 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/media/usb/uvc/uvc_video.c 
> b/drivers/media/usb/uvc/uvc_video.c
> index 8fa77a81dd7f2c..962c35478896c4 100644
> --- a/drivers/media/usb/uvc/uvc_video.c
> +++ b/drivers/media/usb/uvc/uvc_video.c
> @@ -1539,6 +1539,8 @@ static void uvc_video_complete(struct urb *urb)
>  * Process the URB headers, and optionally queue expensive memcpy 
> tasks
>  * to be deferred to a work queue.
>  */
> +   dma_sync_single_for_cpu(>dev->dev, urb->transfer_dma,
> +   urb->transfer_buffer_length, DMA_FROM_DEVICE);
> stream->decode(uvc_urb, buf, buf_meta);
>
> /* If no async work is needed, resubmit the URB immediately. */
> @@ -1565,18 +1567,51 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
> *stream)
> if (!uvc_urb->buffer)
> continue;
>
> -#ifndef CONFIG_DMA_NONCOHERENT
> -   usb_free_coherent(stream->dev->udev, stream->urb_size,
> - uvc_urb->buffer, uvc_urb->dma);
> -#else
> +   dma_unmap_single(>dev->udev->dev, uvc_urb->dma,
> +stream->urb_size, DMA_FROM_DEVICE);
> kfree(uvc_urb->buffer);
> -#endif
> -   uvc_urb->buffer = NULL;
> }
>
> stream->urb_size = 0;
>  }
>
> +static gfp_t uvc_alloc_gfp_flags(struct device *dev)
> +{
> +   u64 mask = dma_get_mask(dev);
> +
> +   if (dev->bus_dma_mask)
> +   mask &= dev->bus_dma_mask;
> +
> +   if (mask < DMA_BIT_MASK(32) && IS_ENABLED(CONFIG_ZONE_DMA))
> +   return GFP_DMA;
> +
> +   if (mask < DMA_BIT_MASK(64)) {
> +   if (IS_ENABLED(CONFIG_ZONE_DMA32))
> +   return GFP_DMA32;

We're hitting issues with this on 64-bit ARM platform, where
ZONE_DMA32 is enabled (default), the kmalloc allocation with GFP_DMA32
fails.

There is no slab cache for GFP_DMA32, so those calls are expected to
fail, AFAICT there are no other (working) kmalloc(..., .. | GFP_DMA32)
users in the kernel, so I don't think we want to add a cache. If this
helps, some discussion here why the cache wasn't added:
https://lore.kernel.org/patchwork/patch/1009563/#1198622

This function looks out of place in a high-level driver, but from your
comment above (below ---), I guess we may have to live with that until
the kernel provides a better API?

For the time being, looking at how __dma_direct_alloc_pages works,
could you use alloc_pages_node (or dma_alloc_contiguous?) instead of
kmalloc, that would let you use GFP_DMA32 as needed?

> +   if (IS_ENABLED(CONFIG_ZONE_DMA))
> +   return GFP_DMA;
> +   }
> +
> +   return 0;
> +}
> +
> +static char *uvc_alloc_urb_buffer(struct device *dev, size_t size,
> + gfp_t gfp_flags, dma_addr_t *dma_handle)
> +{
> +   void *buffer = kzalloc(size, gfp_flags | uvc_alloc_gfp_flags(dev));
> +
> +   if (!buffer)
> +   return NULL;
> +
> + 

Re: KASAN: slab-out-of-bounds Read in bpf_prog_create

2019-09-27 Thread Eric Biggers
Arnd and Al,

On Tue, Sep 17, 2019 at 11:49:06AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:2015a28f Add linux-next specific files for 20190915
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=11880d6960
> kernel config:  https://syzkaller.appspot.com/x/.config?x=110691c2286b679a
> dashboard link: https://syzkaller.appspot.com/bug?extid=eb853b51b10f1befa0b7
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=127c348160
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1150a70d60
> 
> The bug was bisected to:
> 
> commit 2f4fa2db75e26995709043c8d3de4632ebed5c4b
> Author: Al Viro 
> Date:   Thu Apr 18 03:48:01 2019 +
> 
> compat_ioctl: unify copy-in of ppp filters
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=145eee1d60
> final crash:https://syzkaller.appspot.com/x/report.txt?x=165eee1d60
> console output: https://syzkaller.appspot.com/x/log.txt?x=125eee1d60
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+eb853b51b10f1befa...@syzkaller.appspotmail.com
> Fixes: 2f4fa2db75e2 ("compat_ioctl: unify copy-in of ppp filters")
> 
> ==
> BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:404 [inline]
> BUG: KASAN: slab-out-of-bounds in bpf_prog_create+0xe9/0x250
> net/core/filter.c:1351
> Read of size 32768 at addr 88809cf74000 by task syz-executor183/8575
> 
> CPU: 0 PID: 8575 Comm: syz-executor183 Not tainted 5.3.0-rc8-next-20190915
> #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
>  __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
>  kasan_report+0x12/0x20 mm/kasan/common.c:634
>  check_memory_region_inline mm/kasan/generic.c:185 [inline]
>  check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
>  memcpy+0x24/0x50 mm/kasan/common.c:122
>  memcpy include/linux/string.h:404 [inline]
>  bpf_prog_create+0xe9/0x250 net/core/filter.c:1351
>  get_filter.isra.0+0x108/0x1a0 drivers/net/ppp/ppp_generic.c:572
>  ppp_get_filter drivers/net/ppp/ppp_generic.c:584 [inline]
>  ppp_ioctl+0x129d/0x2590 drivers/net/ppp/ppp_generic.c:801

This is a correct bisection.  This commit needs:

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index 267fe2c58087..f55d7937d6c5 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -564,8 +564,9 @@ static struct bpf_prog *get_filter(struct sock_fprog *uprog)
return NULL;
 
/* uprog->len is unsigned short, so no overflow here */
-   fprog.len = uprog->len * sizeof(struct sock_filter);
-   fprog.filter = memdup_user(uprog->filter, fprog.len);
+   fprog.len = uprog->len;
+   fprog.filter = memdup_user(uprog->filter,
+  uprog->len * sizeof(struct sock_filter));
if (IS_ERR(fprog.filter))
return ERR_CAST(fprog.filter);
 


Re: [GIT PULL] integrity subsystem updates for v5.4

2019-09-27 Thread pr-tracker-bot
The pull request you sent on Wed, 11 Sep 2019 17:29:25 -0400:

> git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git 
> next-integrity

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f1f2f614d535564992f32e720739cb53cf03489f

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: WARNING: locking bug in selinux_netlbl_socket_connect

2019-09-27 Thread Eric Biggers
On Tue, Sep 24, 2019 at 09:17:35AM -0400, Paul Moore wrote:
> On Tue, Sep 24, 2019 at 4:21 AM Dmitry Vyukov  wrote:
> > On Tue, Sep 24, 2019 at 4:14 AM Paul Moore  wrote:
> > > On Sat, Sep 21, 2019 at 11:50 AM syzbot
> > >  wrote:
> > > > Hello,
> > > >
> > > > syzbot found the following crash on:
> > > >
> > > > HEAD commit:f97c81dc Merge tag 'armsoc-late' of 
> > > > git://git.kernel.org/p..
> > > > git tree:   upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=127b709d60
> > > > kernel config:  
> > > > https://syzkaller.appspot.com/x/.config?x=10283c2b00ab4cd7
> > > > dashboard link: 
> > > > https://syzkaller.appspot.com/bug?extid=5fa07e4e18e4eb1ccb12
> > > > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > > > syz repro:  
> > > > https://syzkaller.appspot.com/x/repro.syz?x=1299684160
> > > >
> > > > IMPORTANT: if you fix the bug, please add the following tag to the 
> > > > commit:
> > > > Reported-by: syzbot+5fa07e4e18e4eb1cc...@syzkaller.appspotmail.com
> > > >
> > > > WARNING: CPU: 0 PID: 10315 at kernel/locking/lockdep.c:840
> > > > look_up_lock_class kernel/locking/lockdep.c:840 [inline]
> > > > WARNING: CPU: 0 PID: 10315 at kernel/locking/lockdep.c:840
> > > > register_lock_class+0x206/0x1850 kernel/locking/lockdep.c:1185
> > > > Kernel panic - not syncing: panic_on_warn set ...
> > > > CPU: 0 PID: 10315 Comm: syz-executor.0 Not tainted 5.3.0+ #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > > Google 01/01/2011
> > > > Call Trace:
> > > >   __dump_stack lib/dump_stack.c:77 [inline]
> > > >   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
> > > >   panic+0x2dc/0x755 kernel/panic.c:219
> > > >   __warn.cold+0x20/0x4c kernel/panic.c:576
> > > >   report_bug+0x263/0x2b0 lib/bug.c:186
> > > >   fixup_bug arch/x86/kernel/traps.c:179 [inline]
> > > >   fixup_bug arch/x86/kernel/traps.c:174 [inline]
> > > >   do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
> > > >   do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
> > > >   invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
> > > > RIP: 0010:look_up_lock_class kernel/locking/lockdep.c:840 [inline]
> > > > RIP: 0010:register_lock_class+0x206/0x1850 kernel/locking/lockdep.c:1185
> > > > Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 aa 10 00 00 4c 3b 
> > > > 7b
> > > > 18 44 8b 35 d5 de 55 09 74 0b 48 81 3b a0 65 06 8a 74 02 <0f> 0b 45 85 
> > > > ed
> > > > 0f 84 71 03 00 00 f6 85 70 ff ff ff 01 0f 85 64 03
> > > > RSP: 0018:888096777a48 EFLAGS: 00010002
> > > > RAX: dc00 RBX: 888093ff78e0 RCX: 
> > > > RDX: 1110127fef1f RSI:  RDI: 888093ff78f8
> > > > RBP: 888096777b10 R08: 111012ceef51 R09: 8aaea0e0
> > > > R10: 8a7753c8 R11:  R12: 8a7b5d20
> > > > R13:  R14:  R15: 884766e0
> > > >   __lock_acquire+0xf4/0x4e70 kernel/locking/lockdep.c:3837
> > > >   lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4487
> > > >   __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
> > > >   _raw_spin_lock_bh+0x33/0x50 kernel/locking/spinlock.c:175
> > > >   spin_lock_bh include/linux/spinlock.h:343 [inline]
> > > >   lock_sock_nested+0x41/0x120 net/core/sock.c:2929
> > > >   lock_sock include/net/sock.h:1522 [inline]
> > > >   selinux_netlbl_socket_connect+0x20/0xc0 
> > > > security/selinux/netlabel.c:607
> > > >   selinux_socket_connect+0x6a/0x90 security/selinux/hooks.c:4745
> > > >   security_socket_connect+0x77/0xc0 security/security.c:1958
> > > >   __sys_connect+0x19d/0x330 net/socket.c:1824
> > > >   __do_sys_connect net/socket.c:1839 [inline]
> > > >   __se_sys_connect net/socket.c:1836 [inline]
> > > >   __x64_sys_connect+0x73/0xb0 net/socket.c:1836
> > > >   do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
> > > >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > > RIP: 0033:0x459a09
> > > > Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 
> > > > f7
> > > > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 
> > > > ff
> > > > ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> > > > RSP: 002b:7fc302ec5c78 EFLAGS: 0246 ORIG_RAX: 002a
> > > > RAX: ffda RBX: 0003 RCX: 00459a09
> > > > RDX: 001c RSI: 2080 RDI: 0005
> > > > RBP: 0075bf20 R08:  R09: 
> > > > R10:  R11: 0246 R12: 7fc302ec66d4
> > > > R13: 004bff42 R14: 004d1eb0 R15: 
> > > > Kernel Offset: disabled
> > > > Rebooting in 86400 seconds..
> > >
> > > This doesn't appear to be related to selinux_netlbl_socket_connect();
> > > I believe it should be okay to call lock_sock() in that context.
> >
> >
> > FTR, this is this warning:
> >
> > static inline struct lock_class *
> > look_up_lock_class(const 

Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode

2019-09-27 Thread Andrew Morton
On Wed, 25 Sep 2019 09:56:03 +0800 hev  wrote:

> From: Heiher 
> 
> Take the case where we have:
> 
> t0
>  | (ew)
> e0
>  | (et)
> e1
>  | (lt)
> s0
> 
> t0: thread 0
> e0: epoll fd 0
> e1: epoll fd 1
> s0: socket fd 0
> ew: epoll_wait
> et: edge-trigger
> lt: level-trigger
> 
> We only need to wakeup nested epoll fds if something has been queued to the
> overflow list, since the ep_poll() traverses the rdllist during recursive poll
> and thus events on the overflow list may not be visible yet.
> 
> Test code:

Look sane to me.  Do you have any performance testing results which
show a benefit?

epoll maintainership isn't exactly a hive of activity nowadays :(
Roman, would you please have time to review this?



[PATCH 4/4] perf docs: Correct and clarify jitdump spec

2019-09-27 Thread Steve MacLean
Specification claims latest version of jitdump file format is 2. Current
jit dump reading code treats 1 as the latest version.

Correct spec to match code.

The original language made it unclear the value to be written in the magic
field.

Revise language that the writer always writes the same value. Specify that
the reader uses the value to detect endian mismatches.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steve MacLean 
---
 tools/perf/Documentation/jitdump-specification.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/jitdump-specification.txt 
b/tools/perf/Documentation/jitdump-specification.txt
index 4c62b07..52152d1 100644
--- a/tools/perf/Documentation/jitdump-specification.txt
+++ b/tools/perf/Documentation/jitdump-specification.txt
@@ -36,8 +36,8 @@ III/ Jitdump file header format
 Each jitdump file starts with a fixed size header containing the following 
fields in order:
 
 
-* uint32_t magic : a magic number tagging the file type. The value is 
4-byte long and represents the string "JiTD" in ASCII form. It is 0x4A695444 or 
0x4454694a depending on the endianness. The field can be used to detect the 
endianness of the file
-* uint32_t version   : a 4-byte value representing the format version. It is 
currently set to 2
+* uint32_t magic : a magic number tagging the file type. The value is 
4-byte long and represents the string "JiTD" in ASCII form. It written is as 
0x4A695444. The reader will detect an endian mismatch when it reads 0x4454694a.
+* uint32_t version   : a 4-byte value representing the format version. It is 
currently set to 1
 * uint32_t total_size: size in bytes of file header
 * uint32_t elf_mach  : ELF architecture encoding (ELF e_machine value as 
specified in /usr/include/elf.h)
 * uint32_t pad1  : padding. Reserved for future use
-- 
2.7.4



[PATCH 3/4] perf inject --jit: Remove //anon mmap events

2019-09-27 Thread Steve MacLean
While a JIT is jitting code it will eventually need to commit more pages and
change these pages to executable permissions.

Typically the JIT will want these co-located to minimize branch displacements.

The kernel will coalesce these anonymous mapping with identical permissions
before sending an mmap event for the new pages. This means the mmap event for
the new pages will include the older pages.

These anonymous mmap events will obscure the jitdump injected pseudo events.
This means that the jitdump generated symbols, machine code, debugging info,
and unwind info will no longer be used.

Observations:

When a process emits a jit dump marker and a jitdump file, the perf-xxx.map
file represents inferior information which has been superseded by the
jitdump jit-xxx.dump file.

Further the '//anon*' mmap events are only required for the legacy
perf-xxx.map mapping.

Summary:

Add rbtree to track which pids have successfully injected a jitdump file.

During "perf inject --jit", discard "//anon*" mmap events for any pid which
has successfully processed a jitdump file.

Committer testing:

// jitdump case
perf record 
perf inject --jit --input perf.data --output perfjit.data

// verify mmap "//anon" events present initially
perf script --input perf.data --show-mmap-events | grep '//anon'
// verify mmap "//anon" events removed
perf script --input perfjit.data --show-mmap-events | grep '//anon'

// no jitdump case
perf record 
perf inject --jit --input perf.data --output perfjit.data

// verify mmap "//anon" events present initially
perf script --input perf.data --show-mmap-events | grep '//anon'
// verify mmap "//anon" events not removed
perf script --input perfjit.data --show-mmap-events | grep '//anon'

Repro:

This issue was discovered while testing the initial CoreCLR jitdump
implementation. https://github.com/dotnet/coreclr/pull/26897.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steve MacLean 
---
 tools/perf/builtin-inject.c |  4 +--
 tools/perf/util/jitdump.c   | 63 +
 2 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 372ecb3..0f38862 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -263,7 +263,7 @@ static int perf_event__jit_repipe_mmap(struct perf_tool 
*tool,
 * if jit marker, then inject jit mmaps and generate ELF images
 */
ret = jit_process(inject->session, >output, machine,
- event->mmap.filename, sample->pid, );
+ event->mmap.filename, event->mmap.pid, );
if (ret < 0)
return ret;
if (ret) {
@@ -301,7 +301,7 @@ static int perf_event__jit_repipe_mmap2(struct perf_tool 
*tool,
 * if jit marker, then inject jit mmaps and generate ELF images
 */
ret = jit_process(inject->session, >output, machine,
- event->mmap2.filename, sample->pid, );
+ event->mmap2.filename, event->mmap2.pid, );
if (ret < 0)
return ret;
if (ret) {
diff --git a/tools/perf/util/jitdump.c b/tools/perf/util/jitdump.c
index e3ccb0c..6d891d1 100644
--- a/tools/perf/util/jitdump.c
+++ b/tools/perf/util/jitdump.c
@@ -749,6 +749,59 @@ jit_detect(char *mmap_name, pid_t pid)
return 0;
 }
 
+struct pid_rbtree
+{
+   struct rb_node node;
+   pid_t pid;
+};
+
+static void jit_add_pid(struct rb_root *root, pid_t pid)
+{
+   struct rb_node **new = &(root->rb_node), *parent = NULL;
+   struct pid_rbtree* data = NULL;
+
+   /* Figure out where to put new node */
+   while (*new) {
+   struct pid_rbtree *this = container_of(*new, struct pid_rbtree, 
node);
+   pid_t nodePid = this->pid;
+
+   parent = *new;
+   if (pid < nodePid)
+   new = &((*new)->rb_left);
+   else if (pid > nodePid)
+   new = &((*new)->rb_right);
+   else
+   return;
+   }
+
+   data = malloc(sizeof(struct pid_rbtree));
+   data->pid = pid;
+
+   /* Add new node and rebalance tree. */
+   rb_link_node(>node, parent, new);
+   rb_insert_color(>node, root);
+
+   return;
+}
+
+static bool jit_has_pid(struct rb_root *root, pid_t pid)
+{
+   struct rb_node *node = root->rb_node;
+
+   while (node) {
+   struct pid_rbtree *this = container_of(node, struct pid_rbtree, 
node);
+   pid_t nodePid = this->pid;
+
+   if (pid < nodePid)
+   node = node->rb_left;
+   else if (pid > nodePid)
+   node = node->rb_right;
+   else
+   return 1;
+   }
+   

[PATCH 2/4] perf inject jit: Fix JIT_CODE_MOVE filename

2019-09-27 Thread Steve MacLean
During perf inject --jit, JIT_CODE_MOVE records were injecting MMAP records
with an incorrect filename. Specifically it was missing the ".so" suffix.

Further the JIT_CODE_LOAD record were silently truncating the
jr->load.code_index field to 32 bits before generating the filename.

Make both records emit the same filename based on the full 64 bit
code_index field.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steve MacLean 
---
 tools/perf/util/jitdump.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/jitdump.c b/tools/perf/util/jitdump.c
index 1bdf4c6..e3ccb0c 100644
--- a/tools/perf/util/jitdump.c
+++ b/tools/perf/util/jitdump.c
@@ -395,7 +395,7 @@ static int jit_repipe_code_load(struct jit_buf_desc *jd, 
union jr_entry *jr)
size_t size;
u16 idr_size;
const char *sym;
-   uint32_t count;
+   uint64_t count;
int ret, csize, usize;
pid_t pid, tid;
struct {
@@ -418,7 +418,7 @@ static int jit_repipe_code_load(struct jit_buf_desc *jd, 
union jr_entry *jr)
return -1;
 
filename = event->mmap2.filename;
-   size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%u.so",
+   size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%" PRIu64 ".so",
jd->dir,
pid,
count);
@@ -529,7 +529,7 @@ static int jit_repipe_code_move(struct jit_buf_desc *jd, 
union jr_entry *jr)
return -1;
 
filename = event->mmap2.filename;
-   size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%"PRIu64,
+   size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%" PRIu64 ".so",
 jd->dir,
 pid,
 jr->move.code_index);
-- 
2.7.4


[PATCH] KVM: Unlimit number of ioeventfd assignments for real

2019-09-27 Thread Peter Xu
Previously we've tried to unlimit ioeventfd creation (6ea34c9b78c1,
"kvm: exclude ioeventfd from counting kvm_io_range limit",
2013-06-04), because that can be easily done by fd limitations and
otherwise it can easily reach the current maximum of 1000 iodevices.
Meanwhile, we still use the counter to limit the maximum allowed kvm
io devices to be created besides ioeventfd.

6ea34c9b78c1 achieved that in most cases, however it'll still fali the
ioeventfd creation when non-ioeventfd io devices overflows to 1000.
Then the next ioeventfd creation will fail while logically it should
be the next non-ioeventfd iodevice creation to fail.

That's not really a big problem at all because when it happens it
probably means something has leaked in userspace (or even malicious
program) so it's a bug to fix there.  However the error message like
"ioeventfd creation failed" with an -ENOSPACE is really confusing and
may let people think about the fact that it's the ioeventfd that is
leaked (while in most cases it's not!).

Let's use this patch to unlimit the creation of ioeventfd for real
this time, assuming this is also a bugfix of 6ea34c9b78c1.  To me more
importantly, when with a bug in userspace this patch can probably give
us another more meaningful failure on what has overflowed/leaked
rather than "ioeventfd creation failure: -ENOSPC".

CC: Dr. David Alan Gilbert 
Signed-off-by: Peter Xu 
---
 include/linux/kvm_host.h |  3 +++
 virt/kvm/eventfd.c   |  4 ++--
 virt/kvm/kvm_main.c  | 23 ---
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fcb46b3374c6..d8530e7d85d4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -192,6 +192,9 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus 
bus_idx, gpa_t addr,
int len, void *val);
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
int len, struct kvm_io_device *dev);
+int kvm_io_bus_register_dev_ioeventfd(struct kvm *kvm, enum kvm_bus bus_idx,
+ gpa_t addr, int len,
+ struct kvm_io_device *dev);
 void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
   struct kvm_io_device *dev);
 struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 67b6fc153e9c..3cb0e1c3279b 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -823,8 +823,8 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
 
kvm_iodevice_init(>dev, _ops);
 
-   ret = kvm_io_bus_register_dev(kvm, bus_idx, p->addr, p->length,
- >dev);
+   ret = kvm_io_bus_register_dev_ioeventfd(kvm, bus_idx, p->addr,
+   p->length, >dev);
if (ret < 0)
goto unlock_fail;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c6a91b044d8d..242cfcaa9a56 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3809,8 +3809,10 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus 
bus_idx, gpa_t addr,
 }
 
 /* Caller must hold slots_lock. */
-int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
-   int len, struct kvm_io_device *dev)
+static int __kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
+gpa_t addr, int len,
+struct kvm_io_device *dev,
+bool check_limit)
 {
int i;
struct kvm_io_bus *new_bus, *bus;
@@ -3821,7 +3823,8 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus 
bus_idx, gpa_t addr,
return -ENOMEM;
 
/* exclude ioeventfd which is limited by maximum fd */
-   if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1)
+   if (check_limit &&
+   (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1))
return -ENOSPC;
 
new_bus = kmalloc(struct_size(bus, range, bus->dev_count + 1),
@@ -3851,6 +3854,20 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
kvm_bus bus_idx, gpa_t addr,
return 0;
 }
 
+int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
+   int len, struct kvm_io_device *dev)
+{
+   return __kvm_io_bus_register_dev(kvm, bus_idx, addr, len, dev, true);
+}
+
+int kvm_io_bus_register_dev_ioeventfd(struct kvm *kvm, enum kvm_bus bus_idx,
+ gpa_t addr, int len,
+ struct kvm_io_device *dev)
+{
+   return __kvm_io_bus_register_dev(kvm, bus_idx, addr, len, dev, false);
+}
+
+
 /* Caller must hold slots_lock. */
 void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
  

[PATCH 1/4] perf map: fix overlapped map handling

2019-09-27 Thread Steve MacLean
Whenever an mmap/mmap2 event occurs, the map tree must be updated to add a new
entry. If a new map overlaps a previous map, the overlapped section of the
previous map is effectively unmapped, but the non-overlapping sections are
still valid.

maps__fixup_overlappings() is responsible for creating any new map entries from
the previously overlapped map. It optionally creates a before and an after map.

When creating the after map the existing code failed to adjust the map.pgoff.
This meant the new after map would incorrectly calculate the file offset
for the ip. This results in incorrect symbol name resolution for any ip in the
after region.

Make maps__fixup_overlappings() correctly populate map.pgoff.

Add an assert that new mapping matches old mapping at the beginning of
the after map.

Committer-testing:

Validated correct parsing of libcoreclr.so symbols from .NET Core 3.0 preview9
(which didn't strip symbols).

Preparation:

~/dotnet3.0-preview9/dotnet new webapi -o perfSymbol
cd perfSymbol
~/dotnet3.0-preview9/dotnet publish
perf record ~/dotnet3.0-preview9/dotnet \
bin/Debug/netcoreapp3.0/publish/perfSymbol.dll
^C

Before:

perf script --show-mmap-events 2>&1 | grep -e MMAP -e unknown |\
   grep libcoreclr.so | head -n 4
  dotnet  1907 373352.698780: PERF_RECORD_MMAP2 1907/1907: \
  [0x7fe615726000(0x768000) @ 0 08:02 5510620 765057155]: \
  r-xp .../3.0.0-preview9-19423-09/libcoreclr.so
  dotnet  1907 373352.701091: PERF_RECORD_MMAP2 1907/1907: \
  [0x7fe615974000(0x1000) @ 0x24e000 08:02 5510620 765057155]: \
  rwxp .../3.0.0-preview9-19423-09/libcoreclr.so
  dotnet  1907 373352.701241: PERF_RECORD_MMAP2 1907/1907: \
  [0x7fe615c42000(0x1000) @ 0x51c000 08:02 5510620 765057155]: \
  rwxp .../3.0.0-preview9-19423-09/libcoreclr.so
  dotnet  1907 373352.705249: 25 cpu-clock: \
   7fe6159a1f99 [unknown] \
   (.../3.0.0-preview9-19423-09/libcoreclr.so)

After:

perf script --show-mmap-events 2>&1 | grep -e MMAP -e unknown |\
   grep libcoreclr.so | head -n 4
  dotnet  1907 373352.698780: PERF_RECORD_MMAP2 1907/1907: \
  [0x7fe615726000(0x768000) @ 0 08:02 5510620 765057155]: \
  r-xp .../3.0.0-preview9-19423-09/libcoreclr.so
  dotnet  1907 373352.701091: PERF_RECORD_MMAP2 1907/1907: \
  [0x7fe615974000(0x1000) @ 0x24e000 08:02 5510620 765057155]: \
  rwxp .../3.0.0-preview9-19423-09/libcoreclr.so
  dotnet  1907 373352.701241: PERF_RECORD_MMAP2 1907/1907: \
  [0x7fe615c42000(0x1000) @ 0x51c000 08:02 5510620 765057155]: \
  rwxp .../3.0.0-preview9-19423-09/libcoreclr.so

All the [unknown] symbols were resolved.

Tested-by: Brian Robbins 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Stephane Eranian 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steve MacLean 
---
 tools/perf/util/map.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c
index 5b83ed1..eec9b28 100644
--- a/tools/perf/util/map.c
+++ b/tools/perf/util/map.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include "symbol.h"
+#include 
 #include 
 #include 
 #include 
@@ -850,6 +851,8 @@ static int maps__fixup_overlappings(struct maps *maps, 
struct map *map, FILE *fp
}
 
after->start = map->end;
+   after->pgoff += map->end - pos->start;
+   assert(pos->map_ip(pos, map->end) == 
after->map_ip(after, map->end));
__map_groups__insert(pos->groups, after);
if (verbose >= 2 && !use_browser)
map__fprintf(after, fp);
-- 
2.7.4


RE: [PATCH] perf map: fix overlapped map handling

2019-09-27 Thread Steve MacLean
>> An earlier version of this patch used:
>>  after->start = map->end;
>> +after->pgoff += map->end - pos->start;
>> 
>> Instead of the newer Functionally equivalent:
>>  after->start = map->end;
>> +after->pgoff = pos->map_ip(pos, map->end);
>> 
>> I preferred the latter form as it made more sense with the assertion that 
>> the mapping of map->end should match in pos and after.
>
> So, if they are equivalent then I think its better to use code that 
> ressembles the kernel as much as possible, so that when in doubt we can 
> compare the tools/perf calcs with how the kernel does it, filtering out 
> things like the PAGE_SHIFT, can we go that way?
>
> Also do you have some reproducer, if you have one then we can try and have 
> this as a 'perf test' entry, bolting some more checks into 
> tools/perf/tests/perf-record.c or using it as a start for a test that 
> stresses this code.
>
> This is not a prerequisite for having your fix on, but would help checking 
> that perf doesn't regresses in this area.
>
> - Arnaldo

I have updated the patch to use the earlier version, which more closely matches 
the kernel.

I have updated the commit message to include the repro info.

I am including a few other patches I have generated while adding support for 
perf jitdump to coreclr.


Re: [GIT PULL] nfsd changes for 5.4

2019-09-27 Thread pr-tracker-bot
The pull request you sent on Fri, 27 Sep 2019 16:08:38 -0400:

> git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4f375483559c94ec7c58fa3499e28e327d1ef911

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: Re: [GIT PULL] nfsd changes for 5.4

2019-09-27 Thread pr-tracker-bot
The pull request you sent on Fri, 27 Sep 2019 19:55:54 -0400:

> git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/298fb76a5583900a155d387efaf37a8b39e5dea2

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: x86/purgatory: undefined symbol __stack_chk_fail

2019-09-27 Thread Randy Dunlap
On 9/3/19 8:50 AM, Andreas Smas wrote:
> Hi,
> 
> For me, kernels built including this commit
> b059f801a937 (x86/purgatory: Use CFLAGS_REMOVE rather than reset 
> KBUILD_CFLAGS)
> 
> results in kexec() failing to load the kernel:
> 
> kexec: Undefined symbol: __stack_chk_fail
> kexec-bzImage64: Loading purgatory failed
> 
> Can be seen:
> 
> $ readelf -a arch/x86/purgatory/purgatory.ro | grep UND
>  0:  0 NOTYPE  LOCAL  DEFAULT  UND
> 51:  0 NOTYPE  GLOBAL DEFAULT  UND __stack_chk_fail
> 
> Using: gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> 
> Adding -ffreestanding or -fno-stack-protector to ccflags-y in
> arch/x86/purgatory/Makefile
> fixes the problem. Not sure which would be preferred.
> 

Hi,
Do you have a kernel .config file that causes this?
I can't seem to reproduce it.

Thanks.
-- 
~Randy


[PATCH] iio: adc: meson-saradc: Variables could be uninitalized if regmap_read() fails

2019-09-27 Thread Yizhuo
Several functions in this file are trying to use regmap_read() to
initialize the specific variable, however, if regmap_read() fails,
the variable could be uninitialized but used directly, which is
potentially unsafe. The return value of regmap_read() should be
checked and handled.

Signed-off-by: Yizhuo 
---
 drivers/iio/adc/meson_saradc.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/iio/adc/meson_saradc.c b/drivers/iio/adc/meson_saradc.c
index 7b28d045d271..c032a64108b4 100644
--- a/drivers/iio/adc/meson_saradc.c
+++ b/drivers/iio/adc/meson_saradc.c
@@ -323,6 +323,7 @@ static int meson_sar_adc_wait_busy_clear(struct iio_dev 
*indio_dev)
 {
struct meson_sar_adc_priv *priv = iio_priv(indio_dev);
int regval, timeout = 1;
+   int ret;
 
/*
 * NOTE: we need a small delay before reading the status, otherwise
@@ -331,7 +332,9 @@ static int meson_sar_adc_wait_busy_clear(struct iio_dev 
*indio_dev)
 */
do {
udelay(1);
-   regmap_read(priv->regmap, MESON_SAR_ADC_REG0, );
+   ret = regmap_read(priv->regmap, MESON_SAR_ADC_REG0, );
+   if (ret)
+   return ret;
} while (FIELD_GET(MESON_SAR_ADC_REG0_BUSY_MASK, regval) && timeout--);
 
if (timeout < 0)
@@ -358,7 +361,11 @@ static int meson_sar_adc_read_raw_sample(struct iio_dev 
*indio_dev,
return -EINVAL;
}
 
-   regmap_read(priv->regmap, MESON_SAR_ADC_FIFO_RD, );
+   int ret = regmap_read(priv->regmap, MESON_SAR_ADC_FIFO_RD, );
+
+   if (ret)
+   return ret;
+
fifo_chan = FIELD_GET(MESON_SAR_ADC_FIFO_RD_CHAN_ID_MASK, regval);
if (fifo_chan != chan->address) {
dev_err(_dev->dev,
@@ -491,6 +498,7 @@ static int meson_sar_adc_lock(struct iio_dev *indio_dev)
 {
struct meson_sar_adc_priv *priv = iio_priv(indio_dev);
int val, timeout = 1;
+   int ret;
 
mutex_lock(_dev->mlock);
 
@@ -506,7 +514,10 @@ static int meson_sar_adc_lock(struct iio_dev *indio_dev)
 */
do {
udelay(1);
-   regmap_read(priv->regmap, MESON_SAR_ADC_DELAY, );
+   ret = regmap_read(priv->regmap,
+   MESON_SAR_ADC_DELAY, );
+   if (ret)
+   return ret;
} while (val & MESON_SAR_ADC_DELAY_BL30_BUSY && timeout--);
 
if (timeout < 0) {
@@ -784,7 +795,10 @@ static int meson_sar_adc_init(struct iio_dev *indio_dev)
 * BL30 to make sure BL30 gets the values it expects when
 * reading the temperature sensor.
 */
-   regmap_read(priv->regmap, MESON_SAR_ADC_REG3, );
+   ret = regmap_read(priv->regmap, MESON_SAR_ADC_REG3, );
+   if (ret)
+   return ret;
+
if (regval & MESON_SAR_ADC_REG3_BL30_INITIALIZED)
return 0;
}
@@ -1014,7 +1028,11 @@ static irqreturn_t meson_sar_adc_irq(int irq, void *data)
unsigned int cnt, threshold;
u32 regval;
 
-   regmap_read(priv->regmap, MESON_SAR_ADC_REG0, );
+   int ret = regmap_read(priv->regmap, MESON_SAR_ADC_REG0, );
+
+   if (ret)
+   return ret;
+
cnt = FIELD_GET(MESON_SAR_ADC_REG0_FIFO_COUNT_MASK, regval);
threshold = FIELD_GET(MESON_SAR_ADC_REG0_FIFO_CNT_IRQ_MASK, regval);
 
-- 
2.17.1



Re: [PATCH] Input: hyperv-keyboard: Add the support of hibernation

2019-09-27 Thread dmitry.torok...@gmail.com
On Sat, Sep 21, 2019 at 06:56:04AM +, Dexuan Cui wrote:
> > From: dmitry.torok...@gmail.com 
> > Sent: Thursday, September 19, 2019 9:18 AM
> > 
> > Hi Dexuan,
> > 
> > On Wed, Sep 11, 2019 at 11:36:20PM +, Dexuan Cui wrote:
> > > We need hv_kbd_pm_notify() to make sure the pm_wakeup_hard_event()
> > call
> > > does not prevent the system from entering hibernation: the hibernation
> > > is a relatively long process, which can be aborted by the call
> > > pm_wakeup_hard_event(), which is invoked upon keyboard events.
> > >
> > > diff --git a/drivers/input/serio/hyperv-keyboard.c
> > b/drivers/input/serio/hyperv-keyboard.c
> > > index 88ae7c2..277dc4c 100644
> > > --- a/drivers/input/serio/hyperv-keyboard.c
> > > +++ b/drivers/input/serio/hyperv-keyboard.c
> > > @@ -10,6 +10,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  /*
> > >   * Current version 1.0
> > > @@ -95,6 +96,9 @@ struct hv_kbd_dev {
> > >   struct completion wait_event;
> > >   spinlock_t lock; /* protects 'started' field */
> > >   bool started;
> > > +
> > > + struct notifier_block pm_nb;
> > > + bool hibernation_in_progress;
> > 
> > Why do you use notifier block instead of exposing proper PM methods if
> > you want to support hibernation?
> > 
> > Dmitry
> 
> Hi,
> In the patch I do implement hv_kbd_suspend() and hv_kbd_resume(), and
> add them into the hv_kbd_drv struct:
> 
> @@ -416,6 +472,8 @@ static struct  hv_driver hv_kbd_drv = {
> .id_table = id_table,
> .probe = hv_kbd_probe,
> .remove = hv_kbd_remove,
> +   .suspend = hv_kbd_suspend,
> +   .resume = hv_kbd_resume,
> 
> The .suspend and .resume callbacks are inroduced by another patch (which
> uses the dev_pm_ops struct):
> 271b2224d42f ("Drivers: hv: vmbus: Implement suspend/resume for VSC drivers 
> for hibernation")
> (which is on the Hyper-V tree's hyperv-next branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=hyperv-next=271b2224d42f88870e6b060924ee374871c131fc
>  )
> 
> The only purpose of the notifier is to set the variable 
> kbd_dev->hibernation_in_progress to true during the hibernation process.
> 
> As I explained in the changelog, the hibernation is a long process (which
> can take 10+ seconds), during which the user may unintentionally touch
> the keyboard, causing key up/down events, which are still handled by
> hv_kbd_on_receive(), which calls pm_wakeup_hard_event(), which
> calls some other functions which increase the global counter
> "pm_abort_suspend", and hence pm_wakeup_pending() becomes true.
> 
> pm_wakeup_pending() is tested in a lot of places in the suspend
> process and eventually an unintentional keystroke (or mouse movement,
> when it comes to the Hyper-V mouse driver drivers/hid/hid-hyperv.c)
> causes the whole hibernation process to be aborted. Usually this
> behavior is not expected by the user, I think.

Why not? If a device is configured as wakeup source, then it activity
should wake up the system, unless you disable it.

> 
> So, I use the notifier to set the flag variable and with it the driver can
> know when it should not call pm_wakeup_hard_event().

No, please implement hibernation support properly, as notifier + flag is
a hack. In this particular case you do not want to have your
hv_kbd_resume() to be called in place of pm_ops->thaw() as that is what
reenables the keyboard vmbus channel and causes the undesired wakeup
events. Your vmbus implementation should allow individual drivers to
control the set of PM operations that they wish to use, instead of
forcing everything through suspend/resume.

Thanks.

-- 
Dmitry


[PATCH] iio: adc: imx25-gcq: Variable could be uninitialized if regmap_read() fails

2019-09-27 Thread Yizhuo
In function mx25_gcq_irq(), local variable "stats" could
be uninitialized if function regmap_read() returns -EINVAL.
However, this value is used in if statement, which is
potentially unsafe. The same case applied to the variable
"data" in function mx25_gcq_get_raw_value() in the same file.

Signed-off-by: Yizhuo 
---
 drivers/iio/adc/fsl-imx25-gcq.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/iio/adc/fsl-imx25-gcq.c b/drivers/iio/adc/fsl-imx25-gcq.c
index df19ecae52f7..dbf3e8e85aba 100644
--- a/drivers/iio/adc/fsl-imx25-gcq.c
+++ b/drivers/iio/adc/fsl-imx25-gcq.c
@@ -74,7 +74,10 @@ static irqreturn_t mx25_gcq_irq(int irq, void *data)
struct mx25_gcq_priv *priv = data;
u32 stats;
 
-   regmap_read(priv->regs, MX25_ADCQ_SR, );
+   int ret = regmap_read(priv->regs, MX25_ADCQ_SR, );
+
+   if (ret)
+   return ret;
 
if (stats & MX25_ADCQ_SR_EOQ) {
regmap_update_bits(priv->regs, MX25_ADCQ_MR,
@@ -121,7 +124,10 @@ static int mx25_gcq_get_raw_value(struct device *dev,
return -ETIMEDOUT;
}
 
-   regmap_read(priv->regs, MX25_ADCQ_FIFO, );
+   int ret = regmap_read(priv->regs, MX25_ADCQ_FIFO, );
+
+   if (ret)
+   return ret;
 
*val = MX25_ADCQ_FIFO_DATA(data);
 
-- 
2.17.1



Re: Summary of Crash/Panic Behaviors in Linux Kernel

2019-09-27 Thread 김동현
2019년 9월 27일 (금) 오전 5:01, 慕冬亮 님이 작성:
>
> Dear all,
>
> Is there any summary of crash/panic behaviors in the Linux Kernel? For
> example, GPF (general protection fault), Kernel BUG (BUG_ON).

There are a number of blogs and material which describe the behavior
the kernel panic.

Talking about Android Linux Device, the behavior of kernel panic is as
followings:

USER DEBUG Version(Engineering version)

* On kernel panic, the device suddenly stops execution and keep on
displaying kernel log through the screen.
Below is one of the examples.
[   34.187856 01-05 13:30:49.424] Unable to handle kernel NULL pointer
dereference at virtual address 
[   34.197812 01-05 13:30:49.434] pgd = e6f9c000
[   34.202275 01-05 13:30:49.438] [] *pgd=
[   34.207610 01-05 13:30:49.444] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[   34.214663 01-05 13:30:49.451] Modules linked in: bcmdhd(O)
[   34.220400 01-05 13:30:49.457] CPU: 0Tainted: GW  O
(3.4.65-gbc0bf75 #1)
[   34.228317 01-05 13:30:49.464] PC is at ext4_free_inode+0x288/0x574
[   34.234664 01-05 13:30:49.471] LR is at ext4_free_inode+0x260/0x574

Kernel engineers are able to notice what is the cause of the kernel
panic from kernel log.

USER Version(Production version)

 * When kernel panic occurs, the device suddenly stops execution and
starts rebooting again.

In addition, we can describe BUG_ON()  and exception as followings:

BUG_ON()

  * If the subsystem realizes that it is running under critical condition,
BUG_ON() is called to cause kernel panic.

Please be aware that BUG_ON or BUG is executed only with CONFIG_BUG enabled.
Normally, many Linux devices are running with CONFIG_BUG enabled
*by default*.

Exception(Data abort, Prefeth abort: ARM processor)

  * If MMU cannot handle the virtual address  translation, exception occurs.
The program counter is jumped into predefined exception vector.
Usually panic() is called and then system is crashed.

Hope above would be helpful.

Thanks,
Austin Kim

>
> --
> My best regards to you.
>
>  No System Is Safe!
>  Dongliang Mu


Re: [PATCH v2] Input: atmel_mxt_ts - Disable IRQ across suspend

2019-09-27 Thread Dmitry Torokhov
Hi Evan,

On Tue, Sep 24, 2019 at 02:52:38PM -0700, Evan Green wrote:
> Across suspend and resume, we are seeing error messages like the following:
> 
> atmel_mxt_ts i2c-PRP0001:00: __mxt_read_reg: i2c transfer failed (-121)
> atmel_mxt_ts i2c-PRP0001:00: Failed to read T44 and T5 (-121)
> 
> This occurs because the driver leaves its IRQ enabled. Upon resume, there
> is an IRQ pending, but the interrupt is serviced before both the driver and
> the underlying I2C bus have been resumed. This causes EREMOTEIO errors.
> 
> Disable the IRQ in suspend, and re-enable it on resume. If there are cases
> where the driver enters suspend with interrupts disabled, that's a bug we
> should fix separately.
> 
> Signed-off-by: Evan Green 
> ---
> 
> Changes in v2:
>  - Enable and disable unconditionally (Dmitry)
> 
>  drivers/input/touchscreen/atmel_mxt_ts.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/input/touchscreen/atmel_mxt_ts.c 
> b/drivers/input/touchscreen/atmel_mxt_ts.c
> index 24c4b691b1c9..a58092488679 100644
> --- a/drivers/input/touchscreen/atmel_mxt_ts.c
> +++ b/drivers/input/touchscreen/atmel_mxt_ts.c
> @@ -3155,6 +3155,7 @@ static int __maybe_unused mxt_suspend(struct device 
> *dev)
>   mxt_stop(data);
>  
>   mutex_unlock(_dev->mutex);
> + disable_irq(data->irq);
>  
>   return 0;
>  }
> @@ -3174,6 +3175,7 @@ static int __maybe_unused mxt_resume(struct device *dev)
>   mxt_start(data);
>  
>   mutex_unlock(_dev->mutex);
> + enable_irq(data->irq);

At least for older devices that do soft reset on resume we need
interrupts to already work when we call mxt_start().

In general, order of resume steps should mirror suspend.

Thanks.

-- 
Dmitry


[PATCH] staging: rtl8723bs: Variable rf_type in function rtw_cfg80211_init_wiphy() could be uninitialized

2019-09-27 Thread Yizhuo
In function rtw_cfg80211_init_wiphy(), local variable "rf_type" could
be uninitialized if function rtw_hal_get_hwreg() fails to initialize
it. However, this value is used in function rtw_cfg80211_init_ht_capab()
and used to decide the value writing to ht_cap, which is potentially
unsafe.

Signed-off-by: Yizhuo 
---
 drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c 
b/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c
index 9bc685632651..dd39a581b7ef 100644
--- a/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c
+++ b/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c
@@ -3315,7 +3315,7 @@ static void rtw_cfg80211_init_ht_capab(struct 
ieee80211_sta_ht_cap *ht_cap, enum
 
 void rtw_cfg80211_init_wiphy(struct adapter *padapter)
 {
-   u8 rf_type;
+   u8 rf_type = RF_MAX_TYPE;
struct ieee80211_supported_band *bands;
struct wireless_dev *pwdev = padapter->rtw_wdev;
struct wiphy *wiphy = pwdev->wiphy;
-- 
2.17.1



Re: [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation

2019-09-27 Thread Rik van Riel
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> clean up load_balance and remove meaningless calculation and fields
> before
> adding new algorithm.
> 
> Signed-off-by: Vincent Guittot 

Yay.

Acked-by: Rik van Riel 

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running

2019-09-27 Thread Rik van Riel
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> Rename sum_nr_running to sum_h_nr_running because it effectively
> tracks
> cfs->h_nr_running so we can use sum_nr_running to track rq-
> >nr_running
> when needed.
> 
> There is no functional changes.
> 
> Signed-off-by: Vincent Guittot 
> 
Acked-by: Rik van Riel 

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v3 01/10] sched/fair: clean up asym packing

2019-09-27 Thread Rik van Riel
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote:
> Clean up asym packing to follow the default load balance behavior:
> - classify the group by creating a group_asym_packing field.
> - calculate the imbalance in calculate_imbalance() instead of
> bypassing it.
> 
> We don't need to test twice same conditions anymore to detect asym
> packing
> and we consolidate the calculation of imbalance in
> calculate_imbalance().
> 
> There is no functional changes.
> 
> Signed-off-by: Vincent Guittot 

Acked-by: Rik van Riel 

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [GIT PULL] nfsd changes for 5.4

2019-09-27 Thread J. Bruce Fields
On Fri, Sep 27, 2019 at 03:44:02PM -0700, Linus Torvalds wrote:
> But then the actual code is just one single small commit:
> 
> > Dave Wysochanski (1):
> >   SUNRPC: Track writers of the 'channel' file to improve 
> > cache_listeners_exist

And that's also all that was in the diffstat in my pull message, but I
somehow didn't notice.  I must be tired

> which doesn't actually match any of the things your description says
> should be there.
> 
> So I undid my pull - either the description is completely wrong, or
> you tagged the wrong commit.

Yes, the latter.  I've redone the tag; the pull request should have
been for:

  git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4


Highlights:

- add a new knfsd file cache, so that we don't have to open and
  close on each (NFSv2/v3) READ or WRITE.  This can speed up
  read and write in some cases.  It also replaces our readahead
  cache.
- Prevent silent data loss on write errors, by treating write
  errors like server reboots for the purposes of write caching,
  thus forcing clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving,
  so that NFSv4.1 mounts are less likely to hang when a server
  already has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should
  now be limited only by the backend filesystem and the
  maximum RPC size.
- Allow the server to enforce use of the correct kerberos
  credentials when a client reclaims state after a reboot.

And some miscellaneous smaller bugfixes and cleanup.


Chuck Lever (2):
  svcrdma: Remove svc_rdma_wq
  svcrdma: Use llist for managing cache of recv_ctxts

Colin Ian King (1):
  sunrpc: clean up indentation issue

Dave Wysochanski (1):
  SUNRPC: Track writers of the 'channel' file to improve 
cache_listeners_exist

J. Bruce Fields (4):
  Merge nfsd bugfixes
  nfsd: Remove unnecessary NULL checks
  Deprecate nfsd fault injection
  nfsd: eliminate an unnecessary acl size limit

Jeff Layton (12):
  sunrpc: add a new cache_detail operation for when a cache is flushed
  locks: create a new notifier chain for lease attempts
  nfsd: add a new struct file caching facility to nfsd
  nfsd: hook up nfsd_write to the new nfsd_file cache
  nfsd: hook up nfsd_read to the nfsd_file cache
  nfsd: hook nfsd_commit up to the nfsd_file cache
  nfsd: convert nfs4_file->fi_fds array to use nfsd_files
  nfsd: convert fi_deleg_file and ls_file fields to nfsd_file
  nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
  nfsd: have nfsd_test_lock use the nfsd_file cache
  nfsd: rip out the raparms cache
  nfsd: close cached files prior to a REMOVE or RENAME that would replace 
target

NeilBrown (2):
  nfsd: handle drc over-allocation gracefully.
  nfsd: degraded slot-count more gracefully as allocation nears exhaustion.

Scott Mayhew (2):
  nfsd: add a "GetVersion" upcall for nfsdcld
  nfsd: add support for upcall version 2

Trond Myklebust (9):
  notify: export symbols for use by the knfsd file cache
  vfs: Export flush_delayed_fput for use by knfsd.
  nfsd: Fix up some unused variable warnings
  nfsd: Fix the documentation for svcxdr_tmpalloc()
  nfsd: nfsd_file cache entries should be per net namespace
  nfsd: Support the server resetting the boot verifier
  nfsd: Don't garbage collect files that might contain write errors
  nfsd: Reset the boot verifier on all write I/O errors
  nfsd: fix nfs read eof detection

YueHaibing (2):
  nfsd: remove duplicated include from filecache.c
  nfsd: Make nfsd_reset_boot_verifier_locked static

 fs/file_table.c  |   1 +
 fs/locks.c   |  62 ++
 fs/nfsd/Kconfig  |   3 +-
 fs/nfsd/Makefile |   3 +-
 fs/nfsd/acl.h|   8 -
 fs/nfsd/blocklayout.c|   3 +-
 fs/nfsd/export.c |  13 +
 fs/nfsd/filecache.c  | 934 +++
 fs/nfsd/filecache.h  |  61 ++
 fs/nfsd/netns.h  |   4 +
 fs/nfsd/nfs3proc.c   |   9 +-
 fs/nfsd/nfs3xdr.c|  13 +-
 fs/nfsd/nfs4callback.c   |  35 +-
 fs/nfsd/nfs4layouts.c|  12 +-
 fs/nfsd/nfs4proc.c   |  97 ++--
 fs/nfsd/nfs4recover.c| 388 ++---
 fs/nfsd/nfs4state.c  | 239 
 fs/nfsd/nfs4xdr.c|  56 +-
 fs/nfsd/nfsctl.c |   1 +
 fs/nfsd/nfsproc.c|   4 

[PATCH v4 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

2019-09-27 Thread Leonardo Bras
Skips slow part of serialize_against_pte_lookup if there is no running
lockless pagetable walk.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/mm/book3s64/pgtable.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 6ba6195bff1b..27966481f2a6 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,7 +95,8 @@ static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
smp_mb();
-   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+   if (running_lockless_pgtbl_walk(mm))
+   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
 /*
-- 
2.20.1



[PATCH v4 07/11] powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks

2019-09-27 Thread Leonardo Bras
Applies the counting-based method for monitoring lockless pgtable walks on
kvmppc_e500_shadow_map().

Fixes the place where local_irq_restore() is called: previously, if ptep
was NULL, local_irq_restore() would never be called.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kvm/e500_mmu_host.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 321db0fdb9db..a0be76ad8d14 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -473,6 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 * We are holding kvm->mmu_lock so a notifier invalidate
 * can't run hence pfn won't change.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_linux_pte(pgdir, hva, NULL, NULL);
if (ptep) {
@@ -481,15 +482,18 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
if (pte_present(pte)) {
wimg = (pte_val(pte) >> PTE_WIMGE_SHIFT) &
MAS2_WIMGE_MASK;
-   local_irq_restore(flags);
} else {
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(kvm->mm);
pr_err_ratelimited("%s: pte not present: gfn %lx,pfn 
%lx\n",
   __func__, (long)gfn, pfn);
ret = -EINVAL;
goto out;
}
}
+   local_irq_restore(flags);
+   end_lockless_pgtbl_walk(kvm->mm);
+
kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);
 
kvmppc_e500_setup_stlbe(_e500->vcpu, gtlbe, tsize,
-- 
2.20.1



[PATCH v4 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk

2019-09-27 Thread Leonardo Bras
Enables count-based monitoring method for lockless pagetable walks on
PowerPC book3s_64.

Other architectures/platforms fallback to using generic dummy functions.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8308f32e9782..eb9b26a4a483 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1370,5 +1370,10 @@ static inline bool pgd_is_leaf(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+void start_lockless_pgtbl_walk(struct mm_struct *mm);
+void end_lockless_pgtbl_walk(struct mm_struct *mm);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
-- 
2.20.1



[PATCH v4 05/11] powerpc/perf: Applies counting method to monitor lockless pgtbl walks

2019-09-27 Thread Leonardo Bras
Applies the counting-based method for monitoring lockless pgtable walks on
read_user_stack_slow.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/perf/callchain.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index c84bbd4298a0..9d76194a2a8f 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -113,16 +113,18 @@ static int read_user_stack_slow(void __user *ptr, void 
*buf, int nb)
int ret = -EFAULT;
pgd_t *pgdir;
pte_t *ptep, pte;
+   struct mm_struct *mm = current->mm;
unsigned shift;
unsigned long addr = (unsigned long) ptr;
unsigned long offset;
unsigned long pfn, flags;
void *kaddr;
 
-   pgdir = current->mm->pgd;
+   pgdir = mm->pgd;
if (!pgdir)
return -EFAULT;
 
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
ptep = find_current_mm_pte(pgdir, addr, NULL, );
if (!ptep)
@@ -146,6 +148,7 @@ static int read_user_stack_slow(void __user *ptr, void 
*buf, int nb)
ret = 0;
 err_out:
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
return ret;
 }
 
-- 
2.20.1



[PATCH v4 08/11] powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks

2019-09-27 Thread Leonardo Bras
Applies the counting-based method for monitoring all book3s_hv related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0

kvmppc_do_h_enter: Fixes where local_irq_restore() must be placed (after
the last usage of ptep).

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 22 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 18 ++
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 735e0ac6f5b2..5a641b559de7 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -803,7 +803,11 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 
n_rmap,
if (!gp)
return;
 
-   /* Find the pte */
+   /* Find the pte:
+* We are walking the nested guest (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, );
/*
 * If the pte is present and the pfn is still the same, update the pte.
@@ -853,7 +857,11 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 
n_rmap,
if (!gp)
return;
 
-   /* Find and invalidate the pte */
+   /* Find and invalidate the pte:
+* We are walking the nested guest (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, );
/* Don't spuriously invalidate ptes if the pfn has changed */
if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa))
@@ -921,6 +929,11 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu 
*vcpu,
int shift;
 
spin_lock(>mmu_lock);
+   /*
+* We are walking the nested guest (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, );
if (!shift)
shift = PAGE_SHIFT;
@@ -1362,6 +1375,11 @@ static long int __kvmhv_nested_page_fault(struct kvm_run 
*run,
/* See if can find translation in our partition scoped tables for L1 */
pte = __pte(0);
spin_lock(>mmu_lock);
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (!shift)
shift = PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..2076a7ac230a 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -252,6 +252,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 * If we had a page table table change after lookup, we would
 * retry via mmu_notifier_retry.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
if (!realmode)
local_irq_save(irq_flags);
/*
@@ -287,8 +288,6 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pa |= gpa & ~PAGE_MASK;
}
}
-   if (!realmode)
-   local_irq_restore(irq_flags);
 
ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1);
ptel |= pa;
@@ -311,6 +310,9 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
ptel |= HPTE_R_M;
}
+   if (!realmode)
+   local_irq_restore(irq_flags);
+   end_lockless_pgtbl_walk(kvm->mm);
 
/* Find and lock the HPTEG slot to use */
  do_insert:
@@ -885,11 +887,19 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned 
long gpa,
/* Translate to host virtual address */
hva = __gfn_to_hva_memslot(memslot, gfn);
 
-   /* Try to find the host pte for that virtual address */
+   /* Try to find the host pte for that virtual address :
+* Called by hcall_real_table (real mode + MSR_EE=0)
+* Interrupts are disabled here.
+*/
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
-   if (!ptep)
+   if (!ptep) {
+   end_lockless_pgtbl_walk(kvm->mm);
return H_TOO_HARD;

[PATCH v4 06/11] powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks

2019-09-27 Thread Leonardo Bras
Applies the counting-based method for monitoring all hash-related functions
that do lockless pagetable walks.

hash_page_mm: Adds comment that explain that there is no need to
local_int_disable/save given that it is only called from DataAccess
interrupt, so interrupts are already disabled.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/mm/book3s64/hash_tlb.c   |  2 ++
 arch/powerpc/mm/book3s64/hash_utils.c | 12 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c 
b/arch/powerpc/mm/book3s64/hash_tlb.c
index 4a70d8dd39cd..5e5213c3f7c4 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -209,6 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, 
unsigned long start,
 * to being hashed). This is not the most performance oriented
 * way to do things but is fine for our needs here.
 */
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
for (; start < end; start += PAGE_SIZE) {
@@ -230,6 +231,7 @@ void __flush_hash_table_range(struct mm_struct *mm, 
unsigned long start,
}
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
 }
 
 void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b8ad14bb1170..8615fab87c43 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1321,7 +1321,11 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
 #endif /* CONFIG_PPC_64K_PAGES */
 
-   /* Get PTE and page size from page tables */
+   /* Get PTE and page size from page tables :
+* Called in from DataAccess interrupt (data_access_common: 0x300),
+* interrupts are disabled here.
+*/
+   start_lockless_pgtbl_walk(mm);
ptep = find_linux_pte(pgdir, ea, _thp, );
if (ptep == NULL || !pte_present(*ptep)) {
DBG_LOW(" no PTE !\n");
@@ -1438,6 +1442,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
DBG_LOW(" -> rc=%d\n", rc);
 
 bail:
+   end_lockless_pgtbl_walk(mm);
exception_exit(prev_state);
return rc;
 }
@@ -1547,10 +1552,12 @@ void hash_preload(struct mm_struct *mm, unsigned long 
ea,
vsid = get_user_vsid(>context, ea, ssize);
if (!vsid)
return;
+
/*
 * Hash doesn't like irqs. Walking linux page table with irq disabled
 * saves us from holding multiple locks.
 */
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
 
/*
@@ -1597,6 +1604,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
   pte_val(*ptep));
 out_exit:
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
 }
 
 #ifdef CONFIG_PPC_MEM_KEYS
@@ -1613,11 +1621,13 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long 
address)
if (!mm || !mm->pgd)
return 0;
 
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
if (ptep)
pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
 
return pkey;
 }
-- 
2.20.1



[PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-09-27 Thread Leonardo Bras
As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
monitor against THP split/collapse with the couting method, it's necessary
to bound it with {start,end}_lockless_pgtbl_walk.

There are dummy functions, so it is not going to add any overhead on archs
that don't use this method.

Signed-off-by: Leonardo Bras 
---
 mm/gup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..7105c829cf44 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, 
unsigned long end)
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages)
 {
+   struct mm_struct *mm;
unsigned long len, end;
unsigned long flags;
int nr = 0;
@@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
 
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+   mm = current->mm;
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, );
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
}
 
return nr;
@@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
 {
unsigned long addr, len, end;
+   struct mm_struct *mm;
int nr = 0, ret = 0;
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
@@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages,
 
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+   mm = current->mm;
+   start_lockless_pgtbl_walk(mm);
local_irq_disable();
gup_pgd_range(addr, end, gup_flags, pages, );
local_irq_enable();
+   end_lockless_pgtbl_walk(mm);
ret = nr;
}
 
-- 
2.20.1



[PATCH v4 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks

2019-09-27 Thread Leonardo Bras
Applies the counting-based method for monitoring lockless pgtable walks on
addr_to_pfn().

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kernel/mce_power.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index a814d2dfb5b0..0f2f87da4cd1 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long 
addr)
 {
pte_t *ptep;
unsigned long flags;
+   unsigned long pfn;
struct mm_struct *mm;
 
if (user_mode(regs))
@@ -34,15 +35,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
long addr)
else
mm = _mm;
 
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
if (mm == current->mm)
ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
else
ptep = find_init_mm_pte(addr, NULL);
-   local_irq_restore(flags);
+
if (!ptep || pte_special(*ptep))
-   return ULONG_MAX;
-   return pte_pfn(*ptep);
+   pfn = ULONG_MAX;
+   else
+   pfn = pte_pfn(*ptep);
+
+   local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
+   return pfn;
 }
 
 /* flush SLBs and reload */
-- 
2.20.1



[PATCH v4 09/11] powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks

2019-09-27 Thread Leonardo Bras
Applies the counting-based method for monitoring all book3s_64-related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 30 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c|  3 +++
 3 files changed, 35 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 9a75f0e1933b..fcd3dad1297f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * We need to protect against page table destruction
 * hugepage split and collapse.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_current_mm_pte(current->mm->pgd,
   hva, NULL, NULL);
@@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
write_ok = 1;
}
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(kvm->mm);
}
}
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..9b374b9838fa 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -813,6 +813,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 * Read the PTE from the process' radix tree and use that
 * so we get the shift and attribute bits.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
/*
@@ -821,12 +822,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 */
if (!ptep) {
local_irq_enable();
+   end_lockless_pgtbl_walk(kvm->mm);
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
local_irq_enable();
+   end_lockless_pgtbl_walk(kvm->mm);
 
/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -972,10 +975,16 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 kvm->arch.lpid);
+
return 0;   
 }
 
@@ -989,6 +998,11 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1027,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@ -1030,6 +1049,11 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
int ret = 0;
unsigned long old, *rmapp;
 
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && 

[PATCH v4 02/11] asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks

2019-09-27 Thread Leonardo Bras
There is a need to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup these cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.

Given that there are lockless pagetable walks on generic code, it's
necessary to create dummy functions for archs that won't use the approach.

Signed-off-by: Leonardo Bras 
---
 include/asm-generic/pgtable.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..0831475e72d3 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1172,6 +1172,21 @@ static inline bool arch_has_pfn_modify_check(void)
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+static inline void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   return 0;
+}
+#endif
+
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.20.1



[PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks

2019-09-27 Thread Leonardo Bras
It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup some cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.

This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.

start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 45 
 3 files changed, 49 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 23b83d3593e2..13b006e7dde4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -116,6 +116,9 @@ typedef struct {
/* Number of users of the external (Nest) MMU */
atomic_t copros;
 
+   /* Number of running instances of lockless pagetable walk*/
+   atomic_t lockless_pgtbl_walk_count;
+
struct hash_mm_context *hash_context;
 
unsigned long vdso_base;
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index 2d0cb5ba9a47..3dd01c0ca5be 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
 #endif
atomic_set(>context.active_cpus, 0);
atomic_set(>context.copros, 0);
+   atomic_set(>context.lockless_pgtbl_walk_count, 0);
 
return 0;
 }
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..6ba6195bff1b 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,51 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte()
+ */
+void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   atomic_inc(>context.lockless_pgtbl_walk_count);
+   /* Avoid reorder to garantee that the increment will happen before any
+* part of the walkless pagetable walk after it.
+*/
+   smp_mb();
+}
+EXPORT_SYMBOL(start_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+*/
+void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   /* Avoid reorder to garantee that it will only decrement after the last
+* use of the returned ptep from the lockless pagetable walk.
+*/
+   smp_mb();
+   atomic_dec(>context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, 
and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   return atomic_read(>context.lockless_pgtbl_walk_count);
+}
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1



[PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks

2019-09-27 Thread Leonardo Bras
If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.

It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.

So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().

The related functions are:
start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running

On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.

Also, I documented some lockless pagetable walks in which it's not
necessary to keep track, given they work on init_mm or guest pgd.

Changes since v3:
 Adds memory barrier to {start,end}_lockless_pgtbl_walk()
 Explain (comments) why some lockless pgtbl walks don't need
local_irq_disable (real mode + MSR_EE=0)
 Explain (comments) places where counting method is not needed (guest pgd,
which is not touched by THP)
 Fixes some misplaced local_irq_restore()
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=132417

Changes since v2:
 Rebased to v5.3
 Adds support on __get_user_pages_fast
 Adds usage decription to *_lockless_pgtbl_walk()
 Better style to dummy functions
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839

Changes since v1:
 Isolated atomic operations in functions *_lockless_pgtbl_walk()
 Fixed behavior of decrementing before last ptep was used
 Link: http://patchwork.ozlabs.org/patch/1163093/


Leonardo Bras (11):
  powerpc/mm: Adds counting method to monitor lockless pgtable walks
  asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
walks
  mm/gup: Applies counting method to monitor gup_pgd_range
  powerpc/mce_power: Applies counting method to monitor lockless pgtbl
walks
  powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
pgtbl walks
  powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
walks
  powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
pgtbl walks
  powerpc/kvm/book3s_64: Applies counting method to monitor lockless
pgtbl walks
  powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
walk
  powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 ++
 arch/powerpc/kernel/mce_power.c  | 13 --
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 30 
 arch/powerpc/kvm/book3s_64_vio_hv.c  |  3 ++
 arch/powerpc/kvm/book3s_hv_nested.c  | 22 -
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 18 ++--
 arch/powerpc/kvm/e500_mmu_host.c |  6 ++-
 arch/powerpc/mm/book3s64/hash_tlb.c  |  2 +
 arch/powerpc/mm/book3s64/hash_utils.c| 12 -
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 48 +++-
 arch/powerpc/perf/callchain.c|  5 +-
 include/asm-generic/pgtable.h| 15 ++
 mm/gup.c |  8 
 16 files changed, 180 insertions(+), 13 deletions(-)

-- 
2.20.1



Re: [PATCH v2 1/8] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-27 Thread Jim Mattson
On Fri, Sep 27, 2019 at 2:45 PM Sean Christopherson
 wrote:
>
> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> instead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> VM-Exit occurs without actually entering L2, e.g. if the nested run
> is squashed because nested VM-Enter (from L1) is putting L2 into HLT.
>
> Note, the above scenario can occur regardless of whether L1 is
> intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with
> vmcs.GUEST_ACTIVITY_STATE=HALTED.  But practically speaking, a VMM will
> likely put a guest into HALTED if and only if it's not intercepting HLT.
>
> In an ideal world where EPT *requires* unrestricted guest (and vice
> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> the unrestricted guest silliness complicates the dirty tracking logic
> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> VM-Enter is a simpler overall implementation.
>
> Cc: sta...@vger.kernel.org
> Reported-and-tested-by: Reto Buerki 
> Tested-by: Vitaly Kuznetsov 
> Reviewed-by: Liran Alon 
> Signed-off-by: Sean Christopherson 
Reviewed-by: Jim Mattson 


[PATCH 3/3] perf annotate: Improve handling of corrupted ~/.debug

2019-09-27 Thread Andi Kleen
From: Andi Kleen 

Sometimes ~/.debug can get corrupted and contain files that still
have symbol tables, but which objdump cannot handle. Add a fallback
to read the "original" file in such a case. This might be wrong
too if it's different, but in many cases when profiling
on the same host it will work.

Signed-off-by: Andi Kleen 
---
 tools/perf/util/annotate.c | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 1748f528b6e9..cff5f36786fa 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -1638,19 +1638,22 @@ int symbol__strerror_disassemble(struct symbol *sym 
__maybe_unused, struct map *
return 0;
 }
 
-static int dso__disassemble_filename(struct dso *dso, char *filename, size_t 
filename_size)
+static int dso__disassemble_filename(struct dso *dso, char *filename, size_t 
filename_size,
+bool *build_id)
 {
char linkname[PATH_MAX];
char *build_id_filename;
char *build_id_path = NULL;
char *pos;
 
+   *build_id = false;
if (dso->symtab_type == DSO_BINARY_TYPE__KALLSYMS &&
!dso__is_kcore(dso))
return SYMBOL_ANNOTATE_ERRNO__NO_VMLINUX;
 
build_id_filename = dso__build_id_filename(dso, NULL, 0, false);
if (build_id_filename) {
+   *build_id = true;
__symbol__join_symfs(filename, filename_size, 
build_id_filename);
free(build_id_filename);
} else {
@@ -1854,11 +1857,14 @@ static int symbol__disassemble(struct symbol *sym, 
struct annotate_args *args)
int lineno = 0;
int nline;
pid_t pid;
-   int err = dso__disassemble_filename(dso, symfs_filename, 
sizeof(symfs_filename));
+   bool build_id;
+   int err = dso__disassemble_filename(dso, symfs_filename, 
sizeof(symfs_filename),
+   _id);
 
if (err)
return err;
 
+again:
pr_debug("%s: filename=%s, sym=%s, start=%#" PRIx64 ", end=%#" PRIx64 
"\n", __func__,
 symfs_filename, sym->name, map->unmap_ip(map, sym->start),
 map->unmap_ip(map, sym->end));
@@ -1955,8 +1961,21 @@ static int symbol__disassemble(struct symbol *sym, 
struct annotate_args *args)
nline++;
}
 
-   if (nline == 0)
-   pr_err("No output from %s\n", command);
+   if (nline == 0) {
+   pr_err("No objdump output for %s. Corrupted file?\n", 
symfs_filename);
+   if (build_id) {
+   /*
+* It could be that the buildid file is corrupted.
+* Try again with the "true" file. This might be wrong
+* too, but it's better than nothing.
+* We could move the build id file here?
+*/
+   __symbol__join_symfs(symfs_filename, 
sizeof(symfs_filename),
+dso->long_name);
+   build_id = false;
+   goto again;
+   }
+   }
 
/*
 * kallsyms does not have symbol sizes so there may a nop at the end.
-- 
2.21.0



[PATCH 2/3] perf jevents: Fix period for Intel fixed counters

2019-09-27 Thread Andi Kleen
From: Andi Kleen 

The Intel fixed counters use a special table to override the JSON
information. During this override the period information from
the JSON file got dropped, which results in inst_retired.any
and similar running with frequency mode instead of a period.
Just specify the expected period in the table.

Signed-off-by: Andi Kleen 
---
 tools/perf/pmu-events/jevents.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c
index d413761621b0..fa85e33762f7 100644
--- a/tools/perf/pmu-events/jevents.c
+++ b/tools/perf/pmu-events/jevents.c
@@ -449,12 +449,12 @@ static struct fixed {
const char *name;
const char *event;
 } fixed[] = {
-   { "inst_retired.any", "event=0xc0" },
-   { "inst_retired.any_p", "event=0xc0" },
-   { "cpu_clk_unhalted.ref", "event=0x0,umask=0x03" },
-   { "cpu_clk_unhalted.thread", "event=0x3c" },
-   { "cpu_clk_unhalted.core", "event=0x3c" },
-   { "cpu_clk_unhalted.thread_any", "event=0x3c,any=1" },
+   { "inst_retired.any", "event=0xc0,period=203" },
+   { "inst_retired.any_p", "event=0xc0,period=203" },
+   { "cpu_clk_unhalted.ref", "event=0x0,umask=0x03,period=203" },
+   { "cpu_clk_unhalted.thread", "event=0x3c,period=203" },
+   { "cpu_clk_unhalted.core", "event=0x3c,period=203" },
+   { "cpu_clk_unhalted.thread_any", "event=0x3c,any=1,period=203" },
{ NULL, NULL},
 };
 
-- 
2.21.0



[PATCH 1/3] perf script brstackinsn: Fix recovery from LBR/binary mismatch

2019-09-27 Thread Andi Kleen
From: Andi Kleen 

When the LBR data and the instructions in a binary do not match the
loop printing instructions could get confused and print a long
stream of bogus  instructions.

The problem was that if the instruction decoder cannot decode an
instruction it ilen wasn't initialized, so the loop going through
the basic block would continue with the previous value.

Harden the code to avoid such problems:
- Make sure ilen is always freshly initialized and is 0 for bad
instructions.
- Do not overrun the code buffer while printing instructions
- Print a warning message if the final jump is not on an
instruction boundary.

Signed-off-by: Andi Kleen 
---
 tools/perf/builtin-script.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e079b34201f2..32b17d51c982 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -1061,7 +1061,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
continue;
 
insn = 0;
-   for (off = 0;; off += ilen) {
+   for (off = 0; off < (unsigned)len; off += ilen) {
uint64_t ip = start + off;
 
printed += ip__fprintf_sym(ip, thread, x.cpumode, 
x.cpu, , attr, fp);
@@ -1072,6 +1072,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
printed += print_srccode(thread, 
x.cpumode, ip);
break;
} else {
+   ilen = 0;
printed += fprintf(fp, "\t%016" PRIx64 
"\t%s\n", ip,
   dump_insn(, ip, buffer + 
off, len - off, ));
if (ilen == 0)
@@ -1081,6 +1082,8 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
insn++;
}
}
+   if (off != (unsigned)len)
+   printed += fprintf(fp, "\tmismatch of LBR data and 
executable\n");
}
 
/*
@@ -1121,6 +1124,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
goto out;
}
for (off = 0; off <= end - start; off += ilen) {
+   ilen = 0;
printed += fprintf(fp, "\t%016" PRIx64 "\t%s\n", start + off,
   dump_insn(, start + off, buffer + off, len 
- off, ));
if (ilen == 0)
-- 
2.21.0



RE: [PATCH] namespace: fix namespace.pl script to support relative paths

2019-09-27 Thread Keller, Jacob E
> -Original Message-
> From: Randy Dunlap [mailto:rdun...@infradead.org]
> Sent: Friday, September 27, 2019 4:12 PM
> To: Keller, Jacob E 
> Cc: intel-wired-...@lists.osuosl.org; linux-kernel@vger.kernel.org; 
> linux-kbuild  kbu...@vger.kernel.org>; Masahiro Yamada 
> Subject: Re: [PATCH] namespace: fix namespace.pl script to support relative 
> paths
> 
> 
> re: 
> https://lore.kernel.org/lkml/20190129204319.15238-1-jacob.e.kel...@intel.com/
> 
> Did anything happen with this patch?
> 
> Please send it to linux-kbu...@vger.kernel.org and
> Cc: Masahiro Yamada 
> 
> You can also add:
> Acked-by: Randy Dunlap 
> Tested-by: Randy Dunlap 
> 
> 
> I was just about to fix this script but I decided to first see if anyone else
> had already done so.  Thanks.
> 
> --
> ~Randy

Done, thanks.

Regards,
Jake


Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks

2019-09-27 Thread Leonardo Bras
On Fri, 2019-09-27 at 11:46 -0300, Leonardo Bras wrote:
> I am not sure if it would be ok to use irq_{save,restore} in real mode,
> I will do some more reading of the docs before addressing this. 

It looks like it's unsafe to merge irq_{save,restore} in
{start,end}_lockless_pgtbl_walk(), due to a possible access of code
that is not accessible in real mode.

I am sending a v4 for the changes so far.
I will look forward for your feedback.


signature.asc
Description: This is a digitally signed message part


RE: [PATCH] namespace: fix namespace.pl script to support relative paths

2019-09-27 Thread Keller, Jacob E
> -Original Message-
> From: Randy Dunlap [mailto:rdun...@infradead.org]
> Sent: Friday, September 27, 2019 4:12 PM
> To: Keller, Jacob E 
> Cc: intel-wired-...@lists.osuosl.org; linux-kernel@vger.kernel.org; 
> linux-kbuild  kbu...@vger.kernel.org>; Masahiro Yamada 
> Subject: Re: [PATCH] namespace: fix namespace.pl script to support relative 
> paths
> 
> 
> re: 
> https://lore.kernel.org/lkml/20190129204319.15238-1-jacob.e.kel...@intel.com/
> 
> Did anything happen with this patch?
> 

I haven't heard anything or seen it get applied.

> Please send it to linux-kbu...@vger.kernel.org and
> Cc: Masahiro Yamada 
> 

Sure, I can forward it.

> You can also add:
> Acked-by: Randy Dunlap 
> Tested-by: Randy Dunlap 
> 
> 
> I was just about to fix this script but I decided to first see if anyone else
> had already done so.  Thanks.
> 

Thanks for the review/notice here.

> --
> ~Randy

-Jake


Re: [PATCH] PCI: mobiveil: Fix csr_read/write build issue

2019-09-27 Thread Christoph Hellwig
On Thu, Sep 26, 2019 at 10:29:34AM +0100, Andrew Murray wrote:
> Though I'd be just as happy if the csr_[read,write][l,] functions were
> renamed to mobiveil_csr_[read,write][l,].

Please do that instead, using such generic names as csr_* in a driver
is a bad idea, with or without a __ prefix.


[PATCH v2] riscv: Fix memblock reservation for device tree blob

2019-09-27 Thread Albert Ou
This fixes an error with how the FDT blob is reserved in memblock.
An incorrect physical address calculation exposed the FDT header to
unintended corruption, which typically manifested with of_fdt_raw_init()
faulting during late boot after fdt_totalsize() returned a wrong value.
Systems with smaller physical memory sizes more frequently trigger this
issue, as the kernel is more likely to allocate from the DMA32 zone
where bbl places the DTB after the kernel image.

Commit 671f9a3e2e24 ("RISC-V: Setup initial page tables in two stages")
changed the mapping of the DTB to reside in the fixmap area.
Consequently, early_init_fdt_reserve_self() cannot be used anymore in
setup_bootmem() since it relies on __pa() to derive a physical address,
which does not work with dtb_early_va that is no longer a valid kernel
logical address.

The reserved[0x1] region shows the effect of the pointer underflow
resulting from the __pa(initial_boot_params) offset subtraction:

[0.00] MEMBLOCK configuration:
[0.00]  memory size = 0x1fe0 reserved size = 
0x00a2e514
[0.00]  memory.cnt  = 0x1
[0.00]  memory[0x0] [0x8020-0x9fff], 
0x1fe0 bytes flags: 0x0
[0.00]  reserved.cnt  = 0x2
[0.00]  reserved[0x0]   [0x8020-0x80c2dfeb], 
0x00a2dfec bytes flags: 0x0
[0.00]  reserved[0x1]   [0xfff08010-0xfff080100527], 
0x0528 bytes flags: 0x0

With the fix applied:

[0.00] MEMBLOCK configuration:
[0.00]  memory size = 0x1fe0 reserved size = 
0x00a2e514
[0.00]  memory.cnt  = 0x1
[0.00]  memory[0x0] [0x8020-0x9fff], 
0x1fe0 bytes flags: 0x0
[0.00]  reserved.cnt  = 0x2
[0.00]  reserved[0x0]   [0x8020-0x80c2dfeb], 
0x00a2dfec bytes flags: 0x0
[0.00]  reserved[0x1]   [0x80e0-0x80e00527], 
0x0528 bytes flags: 0x0

Fixes: 671f9a3e2e24 ("RISC-V: Setup initial page tables in two stages")
Signed-off-by: Albert Ou 
---

Changes in v2:
* Changed variable identifier to dtb_early_pa per reviewer feedback.
* Removed whitespace change.

 arch/riscv/mm/init.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f0ba713..83f7d12 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -82,6 +83,8 @@ static void __init setup_initrd(void)
 }
 #endif /* CONFIG_BLK_DEV_INITRD */
 
+static phys_addr_t dtb_early_pa __initdata;
+
 void __init setup_bootmem(void)
 {
struct memblock_region *reg;
@@ -117,7 +120,12 @@ void __init setup_bootmem(void)
setup_initrd();
 #endif /* CONFIG_BLK_DEV_INITRD */
 
-   early_init_fdt_reserve_self();
+   /*
+* Avoid using early_init_fdt_reserve_self() since __pa() does
+* not work for DTB pointers that are fixmap addresses
+*/
+   memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va));
+
early_init_fdt_scan_reserved_mem();
memblock_allow_resize();
memblock_dump_all();
@@ -393,6 +401,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 
/* Save pointer to DTB for early FDT parsing */
dtb_early_va = (void *)fix_to_virt(FIX_FDT) + (dtb_pa & ~PAGE_MASK);
+   /* Save physical address for memblock reservation */
+   dtb_early_pa = dtb_pa;
 }
 
 static void __init setup_vm_final(void)
-- 
2.7.4



Re: [PATCH] namespace: fix namespace.pl script to support relative paths

2019-09-27 Thread Randy Dunlap


re: 
https://lore.kernel.org/lkml/20190129204319.15238-1-jacob.e.kel...@intel.com/

Did anything happen with this patch?

Please send it to linux-kbu...@vger.kernel.org and
Cc: Masahiro Yamada 

You can also add:
Acked-by: Randy Dunlap 
Tested-by: Randy Dunlap 


I was just about to fix this script but I decided to first see if anyone else
had already done so.  Thanks.

-- 
~Randy


[PATCH] virt: vbox: fix memory leak in hgcm_call_preprocess_linaddr

2019-09-27 Thread Navid Emamdoost
In hgcm_call_preprocess_linaddr memory is allocated for bounce_buf but
is not released if copy_form_user fails. The release is added.

Fixes: 579db9d45cb4 ("virt: Add vboxguest VMMDEV communication code")

Signed-off-by: Navid Emamdoost 
---
 drivers/virt/vboxguest/vboxguest_utils.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virt/vboxguest/vboxguest_utils.c 
b/drivers/virt/vboxguest/vboxguest_utils.c
index 75fd140b02ff..7965885a50fa 100644
--- a/drivers/virt/vboxguest/vboxguest_utils.c
+++ b/drivers/virt/vboxguest/vboxguest_utils.c
@@ -222,8 +222,10 @@ static int hgcm_call_preprocess_linaddr(
 
if (copy_in) {
ret = copy_from_user(bounce_buf, (void __user *)buf, len);
-   if (ret)
+   if (ret) {
+   kvfree(bounce_buf);
return -EFAULT;
+   }
} else {
memset(bounce_buf, 0, len);
}
-- 
2.17.1



Re: [GIT PULL] 9p updates for 5.4

2019-09-27 Thread pr-tracker-bot
The pull request you sent on Fri, 27 Sep 2019 16:27:26 +0200:

> https://github.com/martinetd/linux tags/9p-for-5.4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/9977b1a71488742606376c09e19e0074e4403cdf

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: [GIT PULL] add virtio-fs

2019-09-27 Thread pr-tracker-bot
The pull request you sent on Thu, 26 Sep 2019 10:43:40 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git 
> tags/virtio-fs-5.4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8f744bdee4fefb17fac052c7418b830de2b59ac8

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-27 Thread Mike Kravetz
On 9/27/19 3:51 PM, Mina Almasry wrote:
> On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz  wrote:
>>
>> On 9/26/19 5:55 PM, Mina Almasry wrote:
>>> Provided we keep the existing controller untouched, should the new
>>> controller track:
>>>
>>> 1. only reservations, or
>>> 2. both reservations and allocations for which no reservations exist
>>> (such as the MAP_NORESERVE case)?
>>>
>>> I like the 'both' approach. Seems to me a counter like that would work
>>> automatically regardless of whether the application is allocating
>>> hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
>>> into reserved hugetlb pages, correct?
>>
>> Correct.  One other easy way to allocate huge pages without reserves
>> (that I know is used today) is via the fallocate system call.
>>
>>>   If so, then applications that
>>> allocate with NORESERVE will get sigbused when they hit their limit,
>>> and applications that allocate without NORESERVE may get an error at
>>> mmap time but will always be within their limits while they access the
>>> mmap'd memory, correct?
>>
>> Correct.  At page allocation time we can easily check to see if a reservation
>> exists and not charge.  For any specific page within a hugetlbfs file,
>> a charge would happen at mmap time or allocation time.
>>
>> One exception (that I can think of) to this mmap(RESERVE) will not cause
>> a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
>> a file, not only do they remove pages associated with the file but the
>> reservation information as well.  Therefore, a subsequent fault will be
>> the same as an allocation without reservation.
>>
> 
> I don't think it causes a sigbus. This is the scenario, right:
> 
> 1. Make cgroup with limit X bytes.
> 2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged
> 3. A hole of size Y is punched in the file, causing the cgroup to get
> uncharged Y bytes.
> 4. The task faults in memory from the hole, getting charged up to Y
> bytes again. But they will be still within their limits.
> 
> IIUC userspace only gets sigbus'd if the limit is lowered between
> steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

You are correct.  That was my mistake.  I was still thinking of behavior
for a reservation only cgroup model.  It would behave as you describe
above (no SIGBUS) for a combined reservation/allocate model.
-- 
Mike Kravetz


Re: [PATCH v2 0/3] Add support for SBI v0.2

2019-09-27 Thread Atish Patra
On Fri, 2019-09-27 at 15:19 -0700, Christoph Hellwig wrote:
> On Thu, Sep 26, 2019 at 05:09:12PM -0700, Atish Patra wrote:
> > The Supervisor Binary Interface(SBI) specification[1] now defines a
> > base extension that provides extendability to add future extensions
> > while maintaining backward compatibility with previous versions.
> > The new version is defined as 0.2 and older version is marked as
> > 0.1.
> > 
> > This series adds support v0.2 and a unified calling convention
> > implementation between 0.1 and 0.2. It also adds minimal SBI
> > functions
> > from 0.2 as well to keep the series lean. 
> 
> So before we do this game can be please make sure we have a clean 0.2
> environment that never uses the legacy extensions as discussed
> before?
> Without that all this work is rather futile.
> 

As per our discussion offline, here are things need to be done to
achieve that.

1. Replace timer, sfence and ipi with better alternative APIs
- sbi_set_timer will be same but with new calling convention
- send_ipi and sfence_* apis can be modified in such a way that
- we don't have to use unprivileged load anymore
- Make it scalable

2. Drop clear_ipi, console, and shutdown in 0.2.

We will have a new kernel config (LEGACY_SBI) that can be manually
enabled if older firmware need to be used. By default, LEGACY_SBI will
be disabled and kernel with new SBI will be built. We will have to set
a flag day in a year or so when we can remove the LEGACY_SBI
completely.

Let us know if it is not an acceptable approach to anybody.
I will post a RFC patch with new alternate v0.2 APIs sometime next
week.

> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

-- 
Regards,
Atish


Re: [PATCH 2/4] rsicv: avoid sending a SIGTRAP to a user thread trapped in WARN()

2019-09-27 Thread Christoph Hellwig
Oh and s/rsicv/riscv/ in the subject, please.


Re: [PATCH] riscv: move sifive_l2_cache.c to drivers/soc

2019-09-27 Thread Christoph Hellwig
On Thu, Aug 22, 2019 at 06:26:35AM -0300, Mauro Carvalho Chehab wrote:
> Em Mon, 19 Aug 2019 08:26:19 +0200
> Christoph Hellwig  escreveu:
> 
> > On Mon, Aug 19, 2019 at 08:09:04AM +0200, Borislav Petkov wrote:
> > > On Sun, Aug 18, 2019 at 10:29:35AM +0200, Christoph Hellwig wrote:  
> > > > The sifive_l2_cache.c is in no way related to RISC-V architecture
> > > > memory management.  It is a little stub driver working around the fact
> > > > that the EDAC maintainers prefer their drivers to be structured in a
> > > > certain way  
> > > 
> > > That changed recently so I guess we can do the per-IP block driver after
> > > all, if people would still prefer it.  
> > 
> > That would seem like the best idea.  But I don't really know this code
> > well enough myself, and I really need to get this code out of the
> > forced on RISC-V codebase as some SOCs I'm working with simply don't
> > have the memory for it..
> > 
> > So unless someone signs up to do a per-IP block edac drivers instead
> > very quickly I'd still like to see something like this go into 5.4
> > for now.
> 
> I'm wandering if we should at least add an entry for this one at
> MAINTAINERS, pointing it to the EDAC mailing list. Something like:

Sounds fine.  Can you also ACK the patch with that, as Paul mention
in another thread he wants an EDAC ACK for it.


Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-27 Thread Mina Almasry
On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz  wrote:
>
> On 9/26/19 5:55 PM, Mina Almasry wrote:
> > Provided we keep the existing controller untouched, should the new
> > controller track:
> >
> > 1. only reservations, or
> > 2. both reservations and allocations for which no reservations exist
> > (such as the MAP_NORESERVE case)?
> >
> > I like the 'both' approach. Seems to me a counter like that would work
> > automatically regardless of whether the application is allocating
> > hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
> > into reserved hugetlb pages, correct?
>
> Correct.  One other easy way to allocate huge pages without reserves
> (that I know is used today) is via the fallocate system call.
>
> >   If so, then applications that
> > allocate with NORESERVE will get sigbused when they hit their limit,
> > and applications that allocate without NORESERVE may get an error at
> > mmap time but will always be within their limits while they access the
> > mmap'd memory, correct?
>
> Correct.  At page allocation time we can easily check to see if a reservation
> exists and not charge.  For any specific page within a hugetlbfs file,
> a charge would happen at mmap time or allocation time.
>
> One exception (that I can think of) to this mmap(RESERVE) will not cause
> a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
> a file, not only do they remove pages associated with the file but the
> reservation information as well.  Therefore, a subsequent fault will be
> the same as an allocation without reservation.
>

I don't think it causes a sigbus. This is the scenario, right:

1. Make cgroup with limit X bytes.
2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged
3. A hole of size Y is punched in the file, causing the cgroup to get
uncharged Y bytes.
4. The task faults in memory from the hole, getting charged up to Y
bytes again. But they will be still within their limits.

IIUC userspace only gets sigbus'd if the limit is lowered between
steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

> I 'think' the code to remove/truncate a file will work corrctly as it
> is today, but I need to think about this some more.
>
> > mmap'd memory, correct? So the 'both' counter seems like a one size
> > fits all.
> >
> > I think the only sticking point left is whether an added controller
> > can support both cgroup-v2 and cgroup-v1. If I could get confirmation
> > on that I'll provide a patchset.
>
> Sorry, but I can not provide cgroup expertise.
> --
> Mike Kravetz


Re: [PATCH 4/4] riscv: remove the switch statement in do_trap_break()

2019-09-27 Thread Christoph Hellwig
On Mon, Sep 23, 2019 at 08:45:17AM +0800, Vincent Chen wrote:
> To make the code more straightforward, replacing the switch statement
> with if statement.
> 
> Suggested-by: Paul Walmsley 
> Signed-off-by: Vincent Chen 
> ---
>  arch/riscv/kernel/traps.c | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index dd13bc90aeb6..1ac75f7d0bff 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -124,23 +124,24 @@ static inline unsigned long 
> get_break_insn_length(unsigned long pc)
>  
>  asmlinkage void do_trap_break(struct pt_regs *regs)
>  {
> + if (user_mode(regs)) {
> + force_sig_fault(SIGTRAP, TRAP_BRKPT,
> + (void __user *)(regs->sepc));
> + return;
> + }
> +#ifdef CONFIG_GENERIC_BUG
> + {
>   enum bug_trap_type type;
>  
>   type = report_bug(regs->sepc, regs);
> + if (type == BUG_TRAP_TYPE_WARN) {
>   regs->sepc += get_break_insn_length(regs->sepc);
>   return;
>   }
> +#endif /* CONFIG_GENERIC_BUG */
> +
> + die(regs, "Kernel BUG");

I like where this is going, but I think this can be improved further
given that fact that report_bug has a nice stub for the
!CONFIG_GENERIC_BUG case.

How about:

if (user_mode(regs))
force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->sepc);
else if (report_bug(regs->sepc, regs) == BUG_TRAP_TYPE_WARN)
regs->sepc += get_break_insn_length(regs->sepc);
else
die(regs, "Kernel BUG");


Re: [GIT PULL] nfsd changes for 5.4

2019-09-27 Thread Linus Torvalds
On Fri, Sep 27, 2019 at 1:08 PM J. Bruce Fields  wrote:
>
> Please pull nfsd changes from:
>
>   git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4

Hmm. I styarted to pull, but then realized that the description
doesn't match the content at all.

You say:

> Highlights:
>
> - add a new knfsd file cache, so that we don't have to open and
>   close on each (NFSv2/v3) READ or WRITE.  This can speed up
>   read and write in some cases.  It also replaces our readahead
>   cache.
> - Prevent silent data loss on write errors, by treating write
>   errors like server reboots for the purposes of write caching,
>   thus forcing clients to resend their writes.
> - Tweak the code that allocates sessions to be more forgiving,
>   so that NFSv4.1 mounts are less likely to hang when a server
>   already has a lot of clients.
> - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should
>   now be limited only by the backend filesystem and the
>   maximum RPC size.
> - Allow the server to enforce use of the correct kerberos
>   credentials when a client reclaims state after a reboot.
>
> And some miscellaneous smaller bugfixes and cleanup.

But then the actual code is just one single small commit:

> Dave Wysochanski (1):
>   SUNRPC: Track writers of the 'channel' file to improve 
> cache_listeners_exist

which doesn't actually match any of the things your description says
should be there.

So I undid my pull - either the description is completely wrong, or
you tagged the wrong commit.

Please double-check,

 Linus


Re: [PATCH 3/4] riscv: Correct the handling of unexpected ebreak in do_trap_break()

2019-09-27 Thread Christoph Hellwig
Looks ok:

Reviewed-by: Christoph Hellwig 


Re: [PATCH] compiler: enable CONFIG_OPTIMIZE_INLINING forcibly

2019-09-27 Thread Linus Torvalds
On Fri, Sep 27, 2019 at 3:08 PM Nick Desaulniers
 wrote:
>
> So get_user() was passed a bad value/pointer from userspace? Do you
> know which of the tree calls to get_user() from sock_setsockopt() is
> failing?  (It's not immediately clear to me how this patch is at
> fault, vs there just being a bug in the source somewhere).

Based on the error messages, the SO_PASSCRED ones are almost certainly
from the get_user() in net/core/sock.c: sock_setsockopt(), which just
does

if (optlen < sizeof(int))
return -EINVAL;

if (get_user(val, (int __user *)optval))
return -EFAULT;

valbool = val ? 1 : 0;

but it's the other messages imply that a lot of other cases are
failing too (ie the "Failed to bind netlink socket" is, according to
google, a bind() that fails with the same EFAULT error).

There are probably even more failures that happen elsewhere and just
don't even syslog the fact. I'd guess that all get_user() calls just
fail, and those are the ones that happen to get printed out.

Now, _why_ it would fail, I have ni idea. There are several inlines in
the arm uaccess.h file, and it depends on other headers like
 with more inlines still - eg get/set_domain() etc.

Soem of that code is pretty subtle. They have fixed register usage
(but the asm macros actually check them). And the inline asms clobber
the link register, but they do seem to clearly _state_ that they
clobber it, so who knows.

Just based on the EFAULT, I'd _guess_ that it's some interaction with
the domain access control register (so that get/set_domain() thing).
But I'm not even sure that code is enabled for the Rpi2, so who
knows..

   Linus


[PATCH] net/mlx5: fix memory leak in mlx5_fw_fatal_reporter_dump

2019-09-27 Thread Navid Emamdoost
In mlx5_fw_fatal_reporter_dump if mlx5_crdump_collect fails the
allocated memory for cr_data must be released otherwise there will be
memory leak. To fix this, this commit changes the return instruction
into goto error handling.

Fixes: 9b1f29823605 ("net/mlx5: Add support for FW fatal reporter dump")

Signed-off-by: Navid Emamdoost 
---
 drivers/net/ethernet/mellanox/mlx5/core/health.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c 
b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index d685122d9ff7..c07f3154437c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -572,7 +572,7 @@ mlx5_fw_fatal_reporter_dump(struct devlink_health_reporter 
*reporter,
return -ENOMEM;
err = mlx5_crdump_collect(dev, cr_data);
if (err)
-   return err;
+   goto free_data;
 
if (priv_ctx) {
struct mlx5_fw_reporter_ctx *fw_reporter_ctx = priv_ctx;
-- 
2.17.1



Re: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

2019-09-27 Thread Mina Almasry
On Fri, Sep 27, 2019 at 2:44 PM Mike Kravetz  wrote:
>
> On 9/19/19 3:24 PM, Mina Almasry wrote:
> > A follow up patch in this series adds hugetlb cgroup uncharge info the
> > file_region entries in resv->regions. The cgroup uncharge info may
> > differ for different regions, so they can no longer be coalesced at
> > region_add time. So, disable region coalescing in region_add in this
> > patch.
> >
> > Behavior change:
> >
> > Say a resv_map exists like this [0->1], [2->3], and [5->6].
> >
> > Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> >
> > Old code would generate resv->regions: [0->5], [5->6].
> > New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> > [5->6].
> >
> > Special care needs to be taken to handle the resv->adds_in_progress
> > variable correctly. In the past, only 1 region would be added for every
> > region_chg and region_add call. But now, each call may add multiple
> > regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> > or decrement adds_in_progress by 1 after region_add or region_abort. 
> > Instead,
> > region_chg calls add_reservation_in_range() to count the number of regions
> > needed and allocates those, and that info is passed to region_add and
> > region_abort to decrement adds_in_progress correctly.
> >
> > Signed-off-by: Mina Almasry 
> >
> > ---
> >  mm/hugetlb.c | 273 +--
> >  1 file changed, 158 insertions(+), 115 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bac1cbdd027c..d03b048084a3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -244,6 +244,12 @@ struct file_region {
> >   long to;
> >  };
> >
> > +/* Helper that removes a struct file_region from the resv_map cache and 
> > returns
> > + * it for use.
> > + */
> > +static struct file_region *
> > +get_file_region_entry_from_cache(struct resv_map *resv, long from, long 
> > to);
> > +
>
> Instead of the forward declaration, just put the function here.
>
> >  /* Must be called with resv->lock held. Calling this with count_only == 
> > true
> >   * will count the number of pages to be added but will not modify the 
> > linked
> >   * list.
> > @@ -251,51 +257,61 @@ struct file_region {
> >  static long add_reservation_in_range(struct resv_map *resv, long f, long t,
> >bool count_only)
> >  {
> > - long chg = 0;
> > + long add = 0;
> >   struct list_head *head = >regions;
> > + long last_accounted_offset = f;
> >   struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
> >
> > - /* Locate the region we are before or in. */
> > - list_for_each_entry (rg, head, link)
> > - if (f <= rg->to)
> > - break;
> > -
> > - /* Round our left edge to the current segment if it encloses us. */
> > - if (f > rg->from)
> > - f = rg->from;
> > -
> > - chg = t - f;
> > + /* In this loop, we essentially handle an entry for the range
> > +  * last_accounted_offset -> rg->from, at every iteration, with some
> > +  * bounds checking.
> > +  */
> > + list_for_each_entry_safe(rg, trg, head, link) {
> > + /* Skip irrelevant regions that start before our range. */
> > + if (rg->from < f) {
> > + /* If this region ends after the last accounted 
> > offset,
> > +  * then we need to update last_accounted_offset.
> > +  */
> > + if (rg->to > last_accounted_offset)
> > + last_accounted_offset = rg->to;
> > + continue;
> > + }
> >
> > - /* Check for and consume any regions we now overlap with. */
> > - nrg = rg;
> > - list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
> > - if (>link == head)
> > - break;
> > + /* When we find a region that starts beyond our range, we've
> > +  * finished.
> > +  */
> >   if (rg->from > t)
> >   break;
> >
> > - /* We overlap with this area, if it extends further than
> > -  * us then we must extend ourselves.  Account for its
> > -  * existing reservation.
> > + /* Add an entry for last_accounted_offset -> rg->from, and
> > +  * update last_accounted_offset.
> >*/
> > - if (rg->to > t) {
> > - chg += rg->to - t;
> > - t = rg->to;
> > + if (rg->from > last_accounted_offset) {
> > + add += rg->from - last_accounted_offset;
> > + if (!count_only) {
> > + nrg = get_file_region_entry_from_cache(
> > + resv, last_accounted_offset, 
> > rg->from);
> > + list_add(>link, rg->link.prev);
> > +  

Re: [PATCH 2/4] rsicv: avoid sending a SIGTRAP to a user thread trapped in WARN()

2019-09-27 Thread Christoph Hellwig
On Mon, Sep 23, 2019 at 08:45:15AM +0800, Vincent Chen wrote:
> On RISC-V, when the kernel runs code on behalf of a user thread, and the
> kernel executes a WARN() or WARN_ON(), the user thread will be sent
> a bogus SIGTRAP.  Fix the RISC-V kernel code to not send a SIGTRAP when
> a WARN()/WARN_ON() is executed.
> 
> Signed-off-by: Vincent Chen 

Looks good,

Reviewed-by: Christoph Hellwig 


Re: [alsa-devel] [PATCH v2] ASoC: Intel: Skylake: prevent memory leak in snd_skl_parse_uuids

2019-09-27 Thread Pierre-Louis Bossart

On 9/27/19 3:39 PM, Andy Shevchenko wrote:

On Fri, Sep 27, 2019 at 7:39 PM Pierre-Louis Bossart
 wrote:

The problem with solution #1 is freeing orphaned pointer. It will work,
but it's simple is not okay from object life time prospective.


?? I don't get your point at all Andy.
Two allocations happens in a loop and if the second fails, you free the
first and then jump to free everything allocated in the previous
iterations. what am I missing?


Two things:
  - one allocation is done with kzalloc(), while the other one with
devm_kcalloc()
  - due to above the ordering of resources is reversed


Ah yes, I see your point now, sorry for being thick.
Indeed it'd make sense to use devm_ for both allocations, but then the 
kfree needs to be removed in the error handling.




Re: [PATCH 1/4] riscv: avoid kernel hangs when trapped in BUG()

2019-09-27 Thread Christoph Hellwig
On Mon, Sep 23, 2019 at 08:45:14AM +0800, Vincent Chen wrote:
> When the CONFIG_GENERIC_BUG is disabled by disabling CONFIG_BUG, if a
> kernel thread is trapped by BUG(), the whole system will be in the
> loop that infinitely handles the ebreak exception instead of entering the
> die function. To fix this problem, the do_trap_break() will always call
> the die() to deal with the break exception as the type of break is
> BUG_TRAP_TYPE_BUG.
> 
> Signed-off-by: Vincent Chen 

Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH] RISC-V: Clear load reservations while restoring hart contexts

2019-09-27 Thread Christoph Hellwig
On Tue, Sep 24, 2019 at 05:15:56PM -0700, Palmer Dabbelt wrote:
> This is almost entirely a comment.  The bug is unlikely to manifest on
> existing hardware because there is a timeout on load reservations, but
> manifests on QEMU because there is no timeout.

Given how many different lines of RISC-V hardware are around how do
you know it doesn't affect any existing hardware?

Except for that the patch looks fine to me:

Reviewed-by: Christoph Hellwig 


Re: [GIT PULL] RISC-V additional updates for v5.4-rc1

2019-09-27 Thread Christoph Hellwig
On Fri, Sep 27, 2019 at 11:25:13AM -0700, Paul Walmsley wrote:
> Atish Patra (1):
>   RISC-V: Export kernel symbols for kvm

None of these have any current users, they should go in with the kvm
series once the virtualization spec has been finalized.


Re: [PATCH v2 3/3] RISC-V: Move SBI related macros under uapi.

2019-09-27 Thread Christoph Hellwig
On Thu, Sep 26, 2019 at 05:09:15PM -0700, Atish Patra wrote:
> All SBI related macros can be reused by KVM RISC-V and userspace tools
> such as kvmtool, qemu-kvm. SBI calls can also be emulated by userspace
> if required. Any future vendor extensions can leverage this to emulate
> the specific extension in userspace instead of kernel.

Just because userspace can use them that doesn't mean they are a
userspace API.  Please don't do this as this limits how we can ever
remove previously existing symbols.  Just copy over the current
version of the file into the other project of your choice instead
of creating and API we need to maintain.


Re: [PATCH v2 1/3] RISC-V: Mark existing SBI as 0.1 SBI.

2019-09-27 Thread Christoph Hellwig
On Thu, Sep 26, 2019 at 05:09:13PM -0700, Atish Patra wrote:
> -#define SBI_CALL(which, arg0, arg1, arg2, arg3) ({   \
> +#define SBI_CALL(which, arg0, arg1, arg2, arg3) ({ \

Spurious whitespace change.


Re: [PATCH v2 0/3] Add support for SBI v0.2

2019-09-27 Thread Christoph Hellwig
On Thu, Sep 26, 2019 at 05:09:12PM -0700, Atish Patra wrote:
> The Supervisor Binary Interface(SBI) specification[1] now defines a
> base extension that provides extendability to add future extensions
> while maintaining backward compatibility with previous versions.
> The new version is defined as 0.2 and older version is marked as 0.1.
> 
> This series adds support v0.2 and a unified calling convention
> implementation between 0.1 and 0.2. It also adds minimal SBI functions
> from 0.2 as well to keep the series lean. 

So before we do this game can be please make sure we have a clean 0.2
environment that never uses the legacy extensions as discussed before?
Without that all this work is rather futile.


Re: [PATCH] mm/page_alloc: fix a crash in free_pages_prepare()

2019-09-27 Thread Alexander Duyck
On Fri, Sep 27, 2019 at 2:59 PM Andrew Morton  wrote:
>
> On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai  wrote:
>
> > >
> > > So I think you've moved the arch_free_page() to be after the final
> > > thing which can access page contents, yes?  If so, we should have a
> > > comment in free_pages_prepare() to attmept to prevent this problem from
> > > reoccurring as the code evolves?
> >
> > Right, something like this above arch_free_page() there?
> >
> > /*
> >  * It needs to be just above kernel_map_pages(), as s390 could mark those
> >  * pages unused and then trigger a fault when accessing.
> >  */
>
> I did this.
>
> --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
> +++ a/mm/page_alloc.c
> @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
> kernel_init_free_pages(page, 1 << order);
>
> kernel_poison_pages(page, 1 << order, 0);
> +   /*
> +* arch_free_page() can make the page's contents inaccessible.  s390
> +* does this.  So nothing which can access the page's contents should
> +* happen after this.
> +*/
> arch_free_page(page, order);
> +
> if (debug_pagealloc_enabled())
> kernel_map_pages(page, 1 << order, 0);
>

So the question I would have is what is the state of the page after it
has been marked unused and then pulled back in? I'm assuming it will
be all 0s.

I know with the work I am still doing on marking pages as unused this
ends up being an additional item that we will need to pay attention
to, however in our case we will just be initializing the page as zero
if we end up evicting it from the guest.


Re: [PATCH v3 2/5] dt-bindings: mfd: Document the Xylon LogiCVC multi-function device

2019-09-27 Thread Rob Herring
On Fri, Sep 27, 2019 at 12:04:04PM +0200, Paul Kocialkowski wrote:
> The LogiCVC is a display engine which also exposes GPIO functionality.
> For this reason, it is described as a multi-function device that is expected
> to provide register access to its children nodes for gpio and display.
> 
> Signed-off-by: Paul Kocialkowski 
> ---
>  .../bindings/mfd/xylon,logicvc.yaml   | 50 +++
>  1 file changed, 50 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml
> 
> diff --git a/Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml 
> b/Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml
> new file mode 100644
> index ..abc9937506e0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml
> @@ -0,0 +1,50 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +# Copyright 2019 Bootlin
> +%YAML 1.2
> +---
> +$id: "http://devicetree.org/schemas/mfd/xylon,logicvc.yaml#;
> +$schema: "http://devicetree.org/meta-schemas/core.yaml#;
> +
> +title: Xylon LogiCVC multi-function device
> +
> +maintainers:
> +  - Paul Kocialkowski 
> +
> +description: |
> +  The LogiCVC is a display controller that also contains a GPIO controller.
> +  As a result, a multi-function device is exposed as parent of the display
> +  and GPIO blocks.
> +
> +properties:
> +  compatible:
> +items:
> +  - enum:
> +  - xylon,logicvc-3.02.a
> +  - const: syscon
> +  - const: simple-mfd
> +
> +  reg:
> +maxItems: 1
> +
> +select:
> +  properties:
> +compatible:
> +  contains:
> +enum:
> +  - xylon,logicvc-3.02.a

I've seen a couple of these with 'syscon' today, so I fixed the schema 
tool to just exclude 'syscon' and 'simple-mfd' from the generated 
'select'. So you can drop select now.

Reviewed-by: Rob Herring 

> +
> +  required:
> +- compatible
> +
> +required:
> +  - compatible
> +  - reg
> +
> +examples:
> +  - |
> +logicvc: logicvc@43c0 {
> +  compatible = "xylon,logicvc-3.02.a", "syscon", "simple-mfd";
> +  reg = <0x43c0 0x6000>;
> +  #address-cells = <1>;
> +  #size-cells = <1>;
> +};
> -- 
> 2.23.0
> 


Re: [PATCH v3 4/5] dt-bindings: gpio: Document the Xylon LogiCVC GPIO controller

2019-09-27 Thread Rob Herring
On Fri, 27 Sep 2019 12:04:06 +0200, Paul Kocialkowski wrote:
> The Xylon LogiCVC display controller exports some GPIOs, which are
> exposed as a separate entity.
> 
> Signed-off-by: Paul Kocialkowski 
> ---
>  .../bindings/gpio/xylon,logicvc-gpio.yaml | 69 +++
>  1 file changed, 69 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/gpio/xylon,logicvc-gpio.yaml
> 

Reviewed-by: Rob Herring 


Re: [PATCH v6 5/5] pinctrl: mediatek: Add support for pin configuration dump via debugfs.

2019-09-27 Thread Sean Wang
Hi,

On Thu, Sep 26, 2019 at 10:02 PM Light Hsieh  wrote:
>
> Add support for pin configuration dump via catting
> /sys/kernel/debug/pinctrl/$platform_dependent_path/pinconf-pins.
> pinctrl framework had already support such dump. This patch implement the
> operation function pointer to fullfill this dump.
>

Here are missing tags too.

> ---
>  drivers/pinctrl/mediatek/pinctrl-paris.c | 88 
> 
>  drivers/pinctrl/mediatek/pinctrl-paris.h | 30 +++
>  2 files changed, 118 insertions(+)
>
> diff --git a/drivers/pinctrl/mediatek/pinctrl-paris.c 
> b/drivers/pinctrl/mediatek/pinctrl-paris.c
> index 2a47c45..f531908 100644
> --- a/drivers/pinctrl/mediatek/pinctrl-paris.c
> +++ b/drivers/pinctrl/mediatek/pinctrl-paris.c
> @@ -538,12 +538,99 @@ static int mtk_pctrl_get_group_pins(struct pinctrl_dev 
> *pctldev,
> return 0;
>  }
>
> +int mtk_hw_get_value_wrap(struct mtk_pinctrl *hw, unsigned int gpio, int 
> field)
> +{
> +   const struct mtk_pin_desc *desc;
> +   int value, err;
> +
> +   if (gpio > hw->soc->npins)
> +   return -EINVAL;
> +
> +   desc = (const struct mtk_pin_desc *)>soc->pins[gpio];
> +
> +   err = mtk_hw_get_value(hw, desc, field, );
> +   if (err)
> +   return err;
> +
> +   return value;
> +}
> +
> +ssize_t mtk_pctrl_show_one_pin(struct mtk_pinctrl *hw,
> +   unsigned int gpio, char *buf, unsigned int bufLen)
> +{
> +   const struct mtk_pin_desc *desc;
> +   int pinmux, pullup, pullen, r1 = -1, r0 = -1, len = 0;

Sort the variable declarations in reverse xmas tree order.

> +
> +   if (gpio > hw->soc->npins)
> +   return -EINVAL;
> +
> +   desc = (const struct mtk_pin_desc *)>soc->pins[gpio];
> +   pinmux = mtk_pctrl_get_pinmux(hw, gpio);
> +   if (pinmux >= hw->soc->nfuncs)
> +   pinmux -= hw->soc->nfuncs;
> +
> +   mtk_pinconf_bias_get_combo(hw, desc, , );
> +   if (pullen == MTK_PUPD_SET_R1R0_00) {
> +   pullen = 0;
> +   r1 = 0;
> +   r0 = 0;
> +   } else if (pullen == MTK_PUPD_SET_R1R0_01) {
> +   pullen = 1;
> +   r1 = 0;
> +   r0 = 1;
> +   } else if (pullen == MTK_PUPD_SET_R1R0_10) {
> +   pullen = 1;
> +   r1 = 1;
> +   r0 = 0;
> +   } else if (pullen == MTK_PUPD_SET_R1R0_11) {
> +   pullen = 1;
> +   r1 = 1;
> +   r0 = 1;
> +   } else if (pullen != MTK_DISABLE && pullen != MTK_ENABLE) {
> +   pullen = 0;
> +   }
> +   len += snprintf(buf + len, bufLen - len,
> +   "%03d: %1d%1d%1d%1d%02d%1d%1d%1d%1d",
> +   gpio,
> +   pinmux,
> +   mtk_pctrl_get_direction(hw, gpio),
> +   mtk_pctrl_get_out(hw, gpio),
> +   mtk_pctrl_get_in(hw, gpio),
> +   mtk_pctrl_get_driving(hw, gpio),
> +   mtk_pctrl_get_smt(hw, gpio),
> +   mtk_pctrl_get_ies(hw, gpio),
> +   pullen,
> +   pullup);
> +
> +   if (r1 != -1) {
> +   len += snprintf(buf + len, bufLen - len, " (%1d %1d)\n",
> +   r1, r0);
> +   } else {
> +   len += snprintf(buf + len, bufLen - len, "\n");
> +   }
> +
> +   return len;
> +}
> +
> +#define PIN_DBG_BUF_SZ 96
> +static void mtk_pctrl_dbg_show(struct pinctrl_dev *pctldev, struct seq_file 
> *s,
> + unsigned int gpio)
> +{
> +   struct mtk_pinctrl *hw = pinctrl_dev_get_drvdata(pctldev);
> +   char buf[PIN_DBG_BUF_SZ];
> +
> +   (void)mtk_pctrl_show_one_pin(hw, gpio, buf, PIN_DBG_BUF_SZ);
> +
> +   seq_printf(s, "%s", buf);
> +}
> +
>  static const struct pinctrl_ops mtk_pctlops = {
> .dt_node_to_map = mtk_pctrl_dt_node_to_map,
> .dt_free_map= pinctrl_utils_free_map,
> .get_groups_count   = mtk_pctrl_get_groups_count,
> .get_group_name = mtk_pctrl_get_group_name,
> .get_group_pins = mtk_pctrl_get_group_pins,
> +   .pin_dbg_show   = mtk_pctrl_dbg_show,
>  };
>
>  static int mtk_pmx_get_funcs_cnt(struct pinctrl_dev *pctldev)
> @@ -640,6 +727,7 @@ static int mtk_pconf_group_set(struct pinctrl_dev 
> *pctldev, unsigned group,
> .pin_config_get = mtk_pinconf_get,
> .pin_config_group_get   = mtk_pconf_group_get,
> .pin_config_group_set   = mtk_pconf_group_set,
> +   .is_generic = true,
>  };
>
>  static struct pinctrl_desc mtk_desc = {
> diff --git a/drivers/pinctrl/mediatek/pinctrl-paris.h 
> b/drivers/pinctrl/mediatek/pinctrl-paris.h
> index 3d43771..d73f4b6 100644
> --- a/drivers/pinctrl/mediatek/pinctrl-paris.h
> +++ b/drivers/pinctrl/mediatek/pinctrl-paris.h
> @@ -60,6 +60,36 @@
>  int mtk_paris_pinctrl_probe(struct 

Re: [PATCH] compiler: enable CONFIG_OPTIMIZE_INLINING forcibly

2019-09-27 Thread Nick Desaulniers
On Fri, Sep 27, 2019 at 3:43 AM Nicolas Saenz Julienne
 wrote:
>
> On Fri, 2019-08-30 at 12:43 +0900, Masahiro Yamada wrote:
> > Commit 9012d011660e ("compiler: allow all arches to enable
> > CONFIG_OPTIMIZE_INLINING") allowed all architectures to enable
> > this option. A couple of build errors were reported by randconfig,
> > but all of them have been ironed out.
> >
> > Towards the goal of removing CONFIG_OPTIMIZE_INLINING entirely
> > (and it will simplify the 'inline' macro in compiler_types.h),
> > this commit changes it to always-on option. Going forward, the
> > compiler will always be allowed to not inline functions marked
> > 'inline'.
> >
> > This is not a problem for x86 since it has been long used by
> > arch/x86/configs/{x86_64,i386}_defconfig.
> >
> > I am keeping the config option just in case any problem crops up for
> > other architectures.
> >
> > The code clean-up will be done after confirming this is solid.
> >
> > Signed-off-by: Masahiro Yamada 
>
> [ Adding some ARM people as they might be able to help ]
>
> This was found to cause a regression on a Raspberry Pi 2 B built with
> bcm2835_defconfig which among other things has no SMP support.
>
> The relevant logs (edited to remove the noise) are:
>
> [5.827333] Run /init as init process
> Loading, please wait...
> Failed to set SO_PASSCRED: Bad address
> Failed to bind netlink socket: Bad address
> Failed to create manager: Bad address
> Failed to set SO_PASSCRED: Bad address
> [9.021623] systemd[1]: SO_PASSCRED failed: Bad address
> [!!] Failed to start up manager.
> [9.079148] systemd[1]: Freezing execution.
>
> I looked into it, it turns out that the call to get_user() in 
> sock_setsockopt()
> is returning -EFAULT. Down the assembly rabbit hole that get_user() is I
> found-out that it's the macro 'check_uaccess' who's triggering the error.
>
> I'm clueless at this point, so I hope you can give me some hints on what's
> going bad here.

So get_user() was passed a bad value/pointer from userspace? Do you
know which of the tree calls to get_user() from sock_setsockopt() is
failing?  (It's not immediately clear to me how this patch is at
fault, vs there just being a bug in the source somewhere).
-- 
Thanks,
~Nick Desaulniers


Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-27 Thread Mike Kravetz
On 9/26/19 5:55 PM, Mina Almasry wrote:
> Provided we keep the existing controller untouched, should the new
> controller track:
> 
> 1. only reservations, or
> 2. both reservations and allocations for which no reservations exist
> (such as the MAP_NORESERVE case)?
> 
> I like the 'both' approach. Seems to me a counter like that would work
> automatically regardless of whether the application is allocating
> hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
> into reserved hugetlb pages, correct?

Correct.  One other easy way to allocate huge pages without reserves
(that I know is used today) is via the fallocate system call.

>   If so, then applications that
> allocate with NORESERVE will get sigbused when they hit their limit,
> and applications that allocate without NORESERVE may get an error at
> mmap time but will always be within their limits while they access the
> mmap'd memory, correct?

Correct.  At page allocation time we can easily check to see if a reservation
exists and not charge.  For any specific page within a hugetlbfs file,
a charge would happen at mmap time or allocation time.

One exception (that I can think of) to this mmap(RESERVE) will not cause
a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
a file, not only do they remove pages associated with the file but the
reservation information as well.  Therefore, a subsequent fault will be
the same as an allocation without reservation.

I 'think' the code to remove/truncate a file will work corrctly as it
is today, but I need to think about this some more.

> mmap'd memory, correct? So the 'both' counter seems like a one size
> fits all.
> 
> I think the only sticking point left is whether an added controller
> can support both cgroup-v2 and cgroup-v1. If I could get confirmation
> on that I'll provide a patchset.

Sorry, but I can not provide cgroup expertise.
-- 
Mike Kravetz


Re: [PATCH] mm/page_alloc: fix a crash in free_pages_prepare()

2019-09-27 Thread Andrew Morton
On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai  wrote:

> > 
> > So I think you've moved the arch_free_page() to be after the final
> > thing which can access page contents, yes?  If so, we should have a
> > comment in free_pages_prepare() to attmept to prevent this problem from
> > reoccurring as the code evolves?
> 
> Right, something like this above arch_free_page() there?
> 
> /*
>  * It needs to be just above kernel_map_pages(), as s390 could mark those
>  * pages unused and then trigger a fault when accessing.
>  */

I did this.

--- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
+++ a/mm/page_alloc.c
@@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
kernel_init_free_pages(page, 1 << order);
 
kernel_poison_pages(page, 1 << order, 0);
+   /*
+* arch_free_page() can make the page's contents inaccessible.  s390
+* does this.  So nothing which can access the page's contents should
+* happen after this.
+*/
arch_free_page(page, order);
+
if (debug_pagealloc_enabled())
kernel_map_pages(page, 1 << order, 0);
 
_



Re: [PATCH v3] platform/x86: dell-laptop: disable kbd backlight on Inspiron 10xx

2019-09-27 Thread Pali Rohár
On Friday 27 September 2019 23:19:03 Pacien TRAN-GIRARD wrote:
> This patch adds a quirk disabling keyboard backlight support for the
> Dell Inspiron 1012 and 1018.
> 
> Those models wrongly report supporting keyboard backlight control
> features (through SMBIOS tokens) even though they're not equipped with
> a backlit keyboard. This led to broken controls being exposed
> through sysfs by this driver which froze the system when used.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107651
> Signed-off-by: Pacien TRAN-GIRARD 
> ---

Thank you for update! Now it is clear what is the problem
and you can add my

Reviewed-by: Pali Rohár 

>  drivers/platform/x86/dell-laptop.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/drivers/platform/x86/dell-laptop.c 
> b/drivers/platform/x86/dell-laptop.c
> index d27be2836bc2..74e988f839e8 100644
> --- a/drivers/platform/x86/dell-laptop.c
> +++ b/drivers/platform/x86/dell-laptop.c
> @@ -33,6 +33,7 @@
>  
>  struct quirk_entry {
>   bool touchpad_led;
> + bool kbd_led_not_present;
>   bool kbd_led_levels_off_1;
>   bool kbd_missing_ac_tag;
>  
> @@ -73,6 +74,10 @@ static struct quirk_entry quirk_dell_latitude_e6410 = {
>   .kbd_led_levels_off_1 = true,
>  };
>  
> +static struct quirk_entry quirk_dell_inspiron_1012 = {
> + .kbd_led_not_present = true,
> +};
> +
>  static struct platform_driver platform_driver = {
>   .driver = {
>   .name = "dell-laptop",
> @@ -310,6 +315,24 @@ static const struct dmi_system_id dell_quirks[] 
> __initconst = {
>   },
>   .driver_data = _dell_latitude_e6410,
>   },
> + {
> + .callback = dmi_matched,
> + .ident = "Dell Inspiron 1012",
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 1012"),
> + },
> + .driver_data = _dell_inspiron_1012,
> + },
> + {
> + .callback = dmi_matched,
> + .ident = "Dell Inspiron 1018",
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 1018"),
> + },
> + .driver_data = _dell_inspiron_1012,
> + },
>   { }
>  };
>  
> @@ -1493,6 +1516,9 @@ static void kbd_init(void)
>  {
>   int ret;
>  
> + if (quirks && quirks->kbd_led_not_present)
> + return;
> +
>   ret = kbd_init_info();
>   kbd_init_tokens();
>  

-- 
Pali Rohár
pali.ro...@gmail.com


Re: Do we need to correct barriering in circular-buffers.rst?

2019-09-27 Thread Nick Desaulniers
On Fri, Sep 27, 2019 at 1:43 PM Nick Desaulniers
 wrote:
>
> On Fri, Sep 27, 2019 at 5:49 AM Peter Zijlstra  wrote:
> > Barring LTO the above works for perf because of inter-translation-unit
> > function calls, which imply a compiler barrier.
> >
> > Now, when the compiler inlines, it looses that sync point (and thereby
> > subtlely changes semantics from the non-inline variant). I suspect LTO
> > does the same and can cause subtle breakage through this transformation.
>
> Do you have a bug report or godbolt link for the above?  I trust that
> you're familiar enough with the issue to be able to quickly reproduce
> it?  These descriptions of problems are difficult for me to picture in
> code or generated code, and when I try to read through
> memory-barriers.txt my eyes start to glaze over (then something else
> catches fire and I have to go put that out).  Having a concise test
> case I think would better illustrate potential issues with LTO that
> we'd then be able to focus on trying to fix/support.
>

Further, if we identified a case were the existing invariants were
broken under LTO, such a case could be added to
CONFIG_DEBUG_LOCKING_API_SELFTESTS (or whatever the most appropriate
kself test is).

-- 
Thanks,
~Nick Desaulniers


Re: [PATCH v8 17/17] Documentation/x86/64: Add documentation for GS/FS addressing mode

2019-09-27 Thread Bae, Chang Seok
> On Sep 27, 2019, at 14:25, Randy Dunlap  wrote:
> 
> Hi,
> Some doc comments/fixes below...
> 
> On 9/12/19 1:06 PM, Chang S. Bae wrote:
>> From: Thomas Gleixner 
>> 

[snip]

>> +There exist two mechanisms to read and write the FS/FS base address:
> 
>FS/GS

[snip]

>> + The arch_prctl(2) based mechanism is available on all 64bit CPUs and all
> 
>  64-bit

[snip]

>> +mov %reg, %gs:offset
>> \ No newline at end of file
> oops.

Thanks. Let me include this in my v9 submission.

Chang


Re: [PATCH 1/2] Modify cpupower to schedule itself on cores it is reading MSRs from

2019-09-27 Thread Thomas Renninger
On Friday, September 27, 2019 6:07:56 PM CEST  Natarajan, Janakarajan  wrote:
> On 9/18/2019 11:34 AM, Natarajan, Janakarajan wrote:

> > This is advantageous because an IPI is not generated when a read_msr() is
> > executed on the local logical CPU thereby reducing the chance of having
> > APERF and MPERF being out of sync.
> > +   if (sched_setaffinity(getpid(), sizeof(set), ) == -1) {
> > +   dprint("Could not migrate to cpu: %d\n", cpu);
> > +   return 1;

On a 80 core cpu the process would be pushed around through the
system quite a lot. 
This might affect what you are measuring or the other measure values?
Otherwise it's the kernel's MSR read only, not the whole cpupower process,
right? No idea about the exact overhead, though. Others in CC list should
know.

Afaik msr reads through msr module should be avoided anyway?
Those which are worth it are abstracted through sysfs nowadays?

For aperf/mperf it might make sense to define a sysfs file where you
can read both, as this is what you always need?

It would take a while, but could be a longterm solution which is also
usable in secure boot or without msr module case.

  Thomas





[PATCH v2 2/8] KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date

2019-09-27 Thread Sean Christopherson
Skip the VMWRITE to update GUEST_CR3 if CR3 is not available, i.e. has
not been read from the VMCS since the last VM-Enter.  If vcpu->arch.cr3
is stale, kvm_read_cr3(vcpu) will refresh vcpu->arch.cr3 from the VMCS,
meaning KVM will do a VMREAD and then VMWRITE the value it just pulled
from the VMCS.

Note, this is a purely theoretical change, no instances of skipping
the VMREAD+VMWRITE have been observed with this change.

Tested-by: Reto Buerki 
Tested-by: Vitaly Kuznetsov 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/vmx.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7679c2a05a50..0b8dd9c315f8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3004,10 +3004,12 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
cr3)
/* Loading vmcs02.GUEST_CR3 is handled by nested VM-Enter. */
if (is_guest_mode(vcpu))
update_guest_cr3 = false;
-   else if (enable_unrestricted_guest || is_paging(vcpu))
-   guest_cr3 = kvm_read_cr3(vcpu);
-   else
+   else if (!enable_unrestricted_guest && !is_paging(vcpu))
guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
+   else if (test_bit(VCPU_EXREG_CR3, (ulong 
*)>arch.regs_avail))
+   guest_cr3 = vcpu->arch.cr3;
+   else /* vmcs01.GUEST_CR3 is already up-to-date. */
+   update_guest_cr3 = false;
ept_load_pdptrs(vcpu);
}
 
-- 
2.22.0



[PATCH v2 4/8] KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest

2019-09-27 Thread Sean Christopherson
Rework vmx_set_rflags() to avoid the extra code need to handle emulation
of real mode and invalid state when unrestricted guest is disabled.  The
primary reason for doing so is to avoid the call to vmx_get_rflags(),
which will incur a VMREAD when RFLAGS is not already available.  When
running nested VMs, the majority of calls to vmx_set_rflags() will occur
without an associated vmx_get_rflags(), i.e. when stuffing GUEST_RFLAGS
during transitions between vmcs01 and vmcs02.

Note, vmx_get_rflags() guarantees RFLAGS is marked available.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/vmx.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 83fe8b02b732..814d3e6d0264 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1426,18 +1426,26 @@ unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)
 void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned long old_rflags = vmx_get_rflags(vcpu);
+   unsigned long old_rflags;
 
-   __set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail);
-   vmx->rflags = rflags;
-   if (vmx->rmode.vm86_active) {
-   vmx->rmode.save_rflags = rflags;
-   rflags |= X86_EFLAGS_IOPL | X86_EFLAGS_VM;
+   if (enable_unrestricted_guest) {
+   __set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail);
+
+   vmx->rflags = rflags;
+   vmcs_writel(GUEST_RFLAGS, rflags);
+   } else {
+   old_rflags = vmx_get_rflags(vcpu);
+
+   vmx->rflags = rflags;
+   if (vmx->rmode.vm86_active) {
+   vmx->rmode.save_rflags = rflags;
+   rflags |= X86_EFLAGS_IOPL | X86_EFLAGS_VM;
+   }
+   vmcs_writel(GUEST_RFLAGS, rflags);
+
+   if ((old_rflags ^ vmx->rflags) & X86_EFLAGS_VM)
+   vmx->emulation_required = emulation_required(vcpu);
}
-   vmcs_writel(GUEST_RFLAGS, rflags);
-
-   if ((old_rflags ^ vmx->rflags) & X86_EFLAGS_VM)
-   vmx->emulation_required = emulation_required(vcpu);
 }
 
 u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
-- 
2.22.0



[PATCH v2 3/8] KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors

2019-09-27 Thread Sean Christopherson
Capture struct vcpu_vmx in a local variable to improve the readability
of vmx_{g,s}et_rflags().

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/vmx.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0b8dd9c315f8..83fe8b02b732 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1407,35 +1407,37 @@ static void vmx_decache_cr0_guest_bits(struct kvm_vcpu 
*vcpu);
 
 unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)
 {
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long rflags, save_rflags;
 
if (!test_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail)) {
__set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail);
rflags = vmcs_readl(GUEST_RFLAGS);
-   if (to_vmx(vcpu)->rmode.vm86_active) {
+   if (vmx->rmode.vm86_active) {
rflags &= RMODE_GUEST_OWNED_EFLAGS_BITS;
-   save_rflags = to_vmx(vcpu)->rmode.save_rflags;
+   save_rflags = vmx->rmode.save_rflags;
rflags |= save_rflags & ~RMODE_GUEST_OWNED_EFLAGS_BITS;
}
-   to_vmx(vcpu)->rflags = rflags;
+   vmx->rflags = rflags;
}
-   return to_vmx(vcpu)->rflags;
+   return vmx->rflags;
 }
 
 void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 {
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long old_rflags = vmx_get_rflags(vcpu);
 
__set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail);
-   to_vmx(vcpu)->rflags = rflags;
-   if (to_vmx(vcpu)->rmode.vm86_active) {
-   to_vmx(vcpu)->rmode.save_rflags = rflags;
+   vmx->rflags = rflags;
+   if (vmx->rmode.vm86_active) {
+   vmx->rmode.save_rflags = rflags;
rflags |= X86_EFLAGS_IOPL | X86_EFLAGS_VM;
}
vmcs_writel(GUEST_RFLAGS, rflags);
 
-   if ((old_rflags ^ to_vmx(vcpu)->rflags) & X86_EFLAGS_VM)
-   to_vmx(vcpu)->emulation_required = emulation_required(vcpu);
+   if ((old_rflags ^ vmx->rflags) & X86_EFLAGS_VM)
+   vmx->emulation_required = emulation_required(vcpu);
 }
 
 u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
-- 
2.22.0



[PATCH v2 0/8] KVM: x86: nVMX GUEST_CR3 bug fix, and then some...

2019-09-27 Thread Sean Christopherson
*sigh*

v2 was shaping up to be a trivial update, until I started working on
Vitaly's suggestion to add a helper to test for register availability.

The primary purpose of this series is to fix a CR3 corruption in L2
reported by Reto Buerki when running with HLT interception disabled in L1.
On a nested VM-Enter that puts L2 into HLT, KVM never actually enters L2
and instead mimics HLT interception by canceling the nested run and
pretending that VM-Enter to L2 completed and then exited on HLT (which
KVM intercepted).  Because KVM never actually runs L2, KVM skips the
pending MMU update for L2 and so leaves a stale value in vmcs02.GUEST_CR3.
If the next wake event for L2 triggers a nested VM-Exit, KVM will refresh
vmcs12->guest_cr3 from vmcs02.GUEST_CR3 and consume the stale value.

Fix the issue by unconditionally writing vmcs02.GUEST_CR3 during nested
VM-Enter instead of deferring the update to vmx_set_cr3(), and skip the
update of GUEST_CR3 in vmx_set_cr3() when running L2.  I.e. make the
nested code fully responsible for vmcs02.GUEST_CR3.

Patch 02/08 is a minor optimization to skip the GUEST_CR3 update if
vmcs01 is already up-to-date.

Patches 03 and beyond are Vitaly's fault ;-).

Patches 03 and 04 are tangentially related cleanup to vmx_set_rflags()
that was discovered when working through the avail/dirty testing code.
Ideally they'd be sent as a separate series, but they conflict with the
avail/dirty helper changes and are themselves minor and straightforward.

Patches 05 and 06 clean up the register caching code so that there is a
single enum for all registers which use avail/dirty tracking.  While not
a true prerequisite for the avail/dirty helpers, the cleanup allows the
new helpers to take an 'enum kvm_reg' instead of a less helpful 'int reg'.

Patch 07 is the helpers themselves, as suggested by Vitaly.

Patch 08 is a truly optional change to ditch decache_cr3() in favor of
handling CR3 via cache_reg() like any other avail/dirty register.


Note, I collected the Reviewed-by and Tested-by tags for patches 01 and 02
even though I inverted the boolean from 'skip_cr3' to 'update_guest_cr3'.
Please drop the tags if that constitutes a non-trivial functional change.

v2:
  - Invert skip_cr3 to update_guest_cr3.  [Liran]
  - Reword the changelog and comment to be more explicit in detailing
how/when KVM will process a nested VM-Enter without runnin L2.  [Liran]
  - Added Reviewed-by and Tested-by tags.
  - Add a comment in vmx_set_cr3() to explicitly state that nested
VM-Enter is responsible for loading vmcs02.GUEST_CR3.  [Jim]
  - All of the loveliness in patches 03-08. [Vitaly]

Sean Christopherson (8):
  KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
  KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date
  KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors
  KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest
  KVM: x86: Add WARNs to detect out-of-bounds register indices
  KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'
  KVM: x86: Add helpers to test/mark reg availability and dirtiness
  KVM: x86: Fold decache_cr3() into cache_reg()

 arch/x86/include/asm/kvm_host.h |  5 +-
 arch/x86/kvm/kvm_cache_regs.h   | 67 +--
 arch/x86/kvm/svm.c  |  5 --
 arch/x86/kvm/vmx/nested.c   | 14 -
 arch/x86/kvm/vmx/vmx.c  | 94 ++---
 arch/x86/kvm/x86.c  | 13 ++---
 arch/x86/kvm/x86.h  |  6 +--
 7 files changed, 123 insertions(+), 81 deletions(-)

-- 
2.22.0



[PATCH v2 8/8] KVM: x86: Fold decache_cr3() into cache_reg()

2019-09-27 Thread Sean Christopherson
Handle caching CR3 (from VMX's VMCS) into struct kvm_vcpu via the common
cache_reg() callback and drop the dedicated decache_cr3().  The name
decache_cr3() is somewhat confusing as the caching behavior of CR3
follows that of GPRs, RFLAGS and PDPTRs, (handled via cache_reg()), and
has nothing in common with the caching behavior of CR0/CR4 (whose
decache_cr{0,4}_guest_bits() likely provided the 'decache' verbiage).

Note, this effectively adds a BUG() if KVM attempts to cache CR3 on SVM.
Opportunistically add a WARN_ON_ONCE() in VMX to provide an equivalent
check.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/kvm_cache_regs.h   |  2 +-
 arch/x86/kvm/svm.c  |  5 -
 arch/x86/kvm/vmx/vmx.c  | 15 ++-
 4 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a27f7f6b6b7a..0411dc0a27b0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1040,7 +1040,6 @@ struct kvm_x86_ops {
struct kvm_segment *var, int seg);
void (*get_cs_db_l_bits)(struct kvm_vcpu *vcpu, int *db, int *l);
void (*decache_cr0_guest_bits)(struct kvm_vcpu *vcpu);
-   void (*decache_cr3)(struct kvm_vcpu *vcpu);
void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 9c2bc528800b..f18177cd0030 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -145,7 +145,7 @@ static inline ulong kvm_read_cr4_bits(struct kvm_vcpu 
*vcpu, ulong mask)
 static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
 {
if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
-   kvm_x86_ops->decache_cr3(vcpu);
+   kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_CR3);
return vcpu->arch.cr3;
 }
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f8ecb6df5106..3102c44c12c6 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2517,10 +2517,6 @@ static void svm_decache_cr0_guest_bits(struct kvm_vcpu 
*vcpu)
 {
 }
 
-static void svm_decache_cr3(struct kvm_vcpu *vcpu)
-{
-}
-
 static void svm_decache_cr4_guest_bits(struct kvm_vcpu *vcpu)
 {
 }
@@ -7208,7 +7204,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.get_cpl = svm_get_cpl,
.get_cs_db_l_bits = kvm_get_cs_db_l_bits,
.decache_cr0_guest_bits = svm_decache_cr0_guest_bits,
-   .decache_cr3 = svm_decache_cr3,
.decache_cr4_guest_bits = svm_decache_cr4_guest_bits,
.set_cr0 = svm_set_cr0,
.set_cr3 = svm_set_cr3,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ed03d0cd1cc8..c84798026e85 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2188,7 +2188,12 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum 
kvm_reg reg)
if (enable_ept)
ept_save_pdptrs(vcpu);
break;
+   case VCPU_EXREG_CR3:
+   if (enable_unrestricted_guest || (enable_ept && 
is_paging(vcpu)))
+   vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+   break;
default:
+   WARN_ON_ONCE(1);
break;
}
 }
@@ -2859,13 +2864,6 @@ static void vmx_decache_cr0_guest_bits(struct kvm_vcpu 
*vcpu)
vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & cr0_guest_owned_bits;
 }
 
-static void vmx_decache_cr3(struct kvm_vcpu *vcpu)
-{
-   if (enable_unrestricted_guest || (enable_ept && is_paging(vcpu)))
-   vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
-   kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
-}
-
 static void vmx_decache_cr4_guest_bits(struct kvm_vcpu *vcpu)
 {
ulong cr4_guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
@@ -2910,7 +2908,7 @@ static void ept_update_paging_mode_cr0(unsigned long 
*hw_cr0,
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
-   vmx_decache_cr3(vcpu);
+   vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
if (!(cr0 & X86_CR0_PG)) {
/* From paging/starting to nonpaging */
exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING |
@@ -7792,7 +7790,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.get_cpl = vmx_get_cpl,
.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
.decache_cr0_guest_bits = vmx_decache_cr0_guest_bits,
-   .decache_cr3 = vmx_decache_cr3,
.decache_cr4_guest_bits = vmx_decache_cr4_guest_bits,
.set_cr0 = vmx_set_cr0,
.set_cr3 = vmx_set_cr3,
-- 
2.22.0



[PATCH v2 1/8] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-27 Thread Sean Christopherson
Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
instead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
VM-Exit occurs without actually entering L2, e.g. if the nested run
is squashed because nested VM-Enter (from L1) is putting L2 into HLT.

Note, the above scenario can occur regardless of whether L1 is
intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with
vmcs.GUEST_ACTIVITY_STATE=HALTED.  But practically speaking, a VMM will
likely put a guest into HALTED if and only if it's not intercepting HLT.

In an ideal world where EPT *requires* unrestricted guest (and vice
versa), VMX could handle CR3 similar to how it handles RSP and RIP,
e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
the unrestricted guest silliness complicates the dirty tracking logic
to the point that explicitly handling vmcs02.GUEST_CR3 during nested
VM-Enter is a simpler overall implementation.

Cc: sta...@vger.kernel.org
Reported-and-tested-by: Reto Buerki 
Tested-by: Vitaly Kuznetsov 
Reviewed-by: Liran Alon 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/nested.c | 10 ++
 arch/x86/kvm/vmx/vmx.c| 10 +++---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 41abc62c9a8a..b72a00b53e4a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2418,6 +2418,16 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
entry_failure_code))
return -EINVAL;
 
+   /*
+* Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
+* on nested VM-Exit, which can occur without actually running L2 and
+* thus without hitting vmx_set_cr3(), e.g. if L1 is entering L2 with
+* vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept the
+* transition to HLT instead of running L2.
+*/
+   if (enable_ept)
+   vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
+
/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
is_pae_paging(vcpu)) {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d4575ffb3cec..7679c2a05a50 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2984,6 +2984,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long 
root_hpa)
 void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
struct kvm *kvm = vcpu->kvm;
+   bool update_guest_cr3 = true;
unsigned long guest_cr3;
u64 eptp;
 
@@ -3000,15 +3001,18 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
cr3)
spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock);
}
 
-   if (enable_unrestricted_guest || is_paging(vcpu) ||
-   is_guest_mode(vcpu))
+   /* Loading vmcs02.GUEST_CR3 is handled by nested VM-Enter. */
+   if (is_guest_mode(vcpu))
+   update_guest_cr3 = false;
+   else if (enable_unrestricted_guest || is_paging(vcpu))
guest_cr3 = kvm_read_cr3(vcpu);
else
guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
ept_load_pdptrs(vcpu);
}
 
-   vmcs_writel(GUEST_CR3, guest_cr3);
+   if (update_guest_cr3)
+   vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
 int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
-- 
2.22.0



[PATCH v2 7/8] KVM: x86: Add helpers to test/mark reg availability and dirtiness

2019-09-27 Thread Sean Christopherson
Add helpers to prettify code that tests and/or marks whether or not a
register is available and/or dirty.

Suggested-by: Vitaly Kuznetsov 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/kvm_cache_regs.h | 45 +--
 arch/x86/kvm/vmx/nested.c |  4 ++--
 arch/x86/kvm/vmx/vmx.c| 29 ++
 arch/x86/kvm/x86.c| 13 --
 4 files changed, 53 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index b85fc4b4e04f..9c2bc528800b 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -37,12 +37,37 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14)
 BUILD_KVM_GPR_ACCESSORS(r15, R15)
 #endif
 
+static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu,
+enum kvm_reg reg)
+{
+   return test_bit(reg, (unsigned long *)>arch.regs_avail);
+}
+
+static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu,
+enum kvm_reg reg)
+{
+   return test_bit(reg, (unsigned long *)>arch.regs_dirty);
+}
+
+static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
+  enum kvm_reg reg)
+{
+   __set_bit(reg, (unsigned long *)>arch.regs_avail);
+}
+
+static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
+  enum kvm_reg reg)
+{
+   __set_bit(reg, (unsigned long *)>arch.regs_avail);
+   __set_bit(reg, (unsigned long *)>arch.regs_dirty);
+}
+
 static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu, int reg)
 {
if (WARN_ON_ONCE((unsigned int)reg >= NR_VCPU_REGS))
return 0;
 
-   if (!test_bit(reg, (unsigned long *)>arch.regs_avail))
+   if (!kvm_register_is_available(vcpu, reg))
kvm_x86_ops->cache_reg(vcpu, reg);
 
return vcpu->arch.regs[reg];
@@ -55,13 +80,12 @@ static inline void kvm_register_write(struct kvm_vcpu 
*vcpu, int reg,
return;
 
vcpu->arch.regs[reg] = val;
-   __set_bit(reg, (unsigned long *)>arch.regs_dirty);
-   __set_bit(reg, (unsigned long *)>arch.regs_avail);
+   kvm_register_mark_dirty(vcpu, reg);
 }
 
 static inline unsigned long kvm_rip_read(struct kvm_vcpu *vcpu)
 {
-   if (!test_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail))
+   if (!kvm_register_is_available(vcpu, VCPU_REGS_RIP))
kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RIP);
 
return vcpu->arch.regs[VCPU_REGS_RIP];
@@ -70,13 +94,12 @@ static inline unsigned long kvm_rip_read(struct kvm_vcpu 
*vcpu)
 static inline void kvm_rip_write(struct kvm_vcpu *vcpu, unsigned long val)
 {
vcpu->arch.regs[VCPU_REGS_RIP] = val;
-   __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_dirty);
-   __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail);
+   kvm_register_mark_dirty(vcpu, VCPU_REGS_RIP);
 }
 
 static inline unsigned long kvm_rsp_read(struct kvm_vcpu *vcpu)
 {
-   if (!test_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail))
+   if (!kvm_register_is_available(vcpu, VCPU_REGS_RSP))
kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RSP);
 
return vcpu->arch.regs[VCPU_REGS_RSP];
@@ -85,16 +108,14 @@ static inline unsigned long kvm_rsp_read(struct kvm_vcpu 
*vcpu)
 static inline void kvm_rsp_write(struct kvm_vcpu *vcpu, unsigned long val)
 {
vcpu->arch.regs[VCPU_REGS_RSP] = val;
-   __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_dirty);
-   __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail);
+   kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);
 }
 
 static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index)
 {
might_sleep();  /* on svm */
 
-   if (!test_bit(VCPU_EXREG_PDPTR,
- (unsigned long *)>arch.regs_avail))
+   if (!kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR))
kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR);
 
return vcpu->arch.walk_mmu->pdptrs[index];
@@ -123,7 +144,7 @@ static inline ulong kvm_read_cr4_bits(struct kvm_vcpu 
*vcpu, ulong mask)
 
 static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
 {
-   if (!test_bit(VCPU_EXREG_CR3, (ulong *)>arch.regs_avail))
+   if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
kvm_x86_ops->decache_cr3(vcpu);
return vcpu->arch.cr3;
 }
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b72a00b53e4a..8946f11c574c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1012,7 +1012,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, 
unsigned long cr3, bool ne
kvm_mmu_new_cr3(vcpu, cr3, false);
 
vcpu->arch.cr3 = cr3;
-   __set_bit(VCPU_EXREG_CR3, (ulong *)>arch.regs_avail);
+   kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
 
kvm_init_mmu(vcpu, 

[PATCH v2 6/8] KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'

2019-09-27 Thread Sean Christopherson
Now that indexing into arch.regs is either protected by WARN_ON_ONCE or
done with hardcoded enums, combine all definitions for registers that
are tracked by regs_avail and regs_dirty into 'enum kvm_reg'.  Having a
single enum type will simplify additional cleanup related to regs_avail
and regs_dirty.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 4 +---
 arch/x86/kvm/kvm_cache_regs.h   | 2 +-
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 23edf56cf577..a27f7f6b6b7a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -156,10 +156,8 @@ enum kvm_reg {
VCPU_REGS_R15 = __VCPU_REGS_R15,
 #endif
VCPU_REGS_RIP,
-   NR_VCPU_REGS
-};
+   NR_VCPU_REGS,
 
-enum kvm_reg_ex {
VCPU_EXREG_PDPTR = NR_VCPU_REGS,
VCPU_EXREG_CR3,
VCPU_EXREG_RFLAGS,
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 3972e1b65635..b85fc4b4e04f 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -95,7 +95,7 @@ static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int 
index)
 
if (!test_bit(VCPU_EXREG_PDPTR,
  (unsigned long *)>arch.regs_avail))
-   kvm_x86_ops->cache_reg(vcpu, (enum kvm_reg)VCPU_EXREG_PDPTR);
+   kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR);
 
return vcpu->arch.walk_mmu->pdptrs[index];
 }
-- 
2.22.0



[PATCH v2 5/8] KVM: x86: Add WARNs to detect out-of-bounds register indices

2019-09-27 Thread Sean Christopherson
Add WARN_ON_ONCE() checks in kvm_register_{read,write}() to detect reg
values that would cause KVM to overflow vcpu->arch.regs.  Change the reg
param to an 'int' to make it clear that the reg index is unverified.

Open code the RIP and RSP accessors so as to avoid pointless overhead of
WARN_ON_ONCE().  Alternatively, lower-level helpers could be provided,
but that opens the door for improper use of said helpers, and the
ugliness of the open-coding will be slightly improved in future patches.

Regarding the overhead of WARN_ON_ONCE(), now that all fixed GPR reads
and writes use dedicated accessors, e.g. kvm_rax_read(), the overhead
is limited to flows where the reg index is generated at runtime.  And
there is at least one historical bug where KVM has generated an out-of-
bounds access to arch.regs (see commit b68f3cc7d9789, "KVM: x86: Always
use 32-bit SMRAM save state for 32-bit kernels").

Adding the WARN_ON_ONCE() protection paves the way for additional
cleanup related to kvm_reg and kvm_reg_ex.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/kvm_cache_regs.h | 30 ++
 arch/x86/kvm/x86.h|  6 ++
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 1cc6c47dc77e..3972e1b65635 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -37,19 +37,23 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14)
 BUILD_KVM_GPR_ACCESSORS(r15, R15)
 #endif
 
-static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu,
- enum kvm_reg reg)
+static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu, int reg)
 {
+   if (WARN_ON_ONCE((unsigned int)reg >= NR_VCPU_REGS))
+   return 0;
+
if (!test_bit(reg, (unsigned long *)>arch.regs_avail))
kvm_x86_ops->cache_reg(vcpu, reg);
 
return vcpu->arch.regs[reg];
 }
 
-static inline void kvm_register_write(struct kvm_vcpu *vcpu,
- enum kvm_reg reg,
+static inline void kvm_register_write(struct kvm_vcpu *vcpu, int reg,
  unsigned long val)
 {
+   if (WARN_ON_ONCE((unsigned int)reg >= NR_VCPU_REGS))
+   return;
+
vcpu->arch.regs[reg] = val;
__set_bit(reg, (unsigned long *)>arch.regs_dirty);
__set_bit(reg, (unsigned long *)>arch.regs_avail);
@@ -57,22 +61,32 @@ static inline void kvm_register_write(struct kvm_vcpu *vcpu,
 
 static inline unsigned long kvm_rip_read(struct kvm_vcpu *vcpu)
 {
-   return kvm_register_read(vcpu, VCPU_REGS_RIP);
+   if (!test_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail))
+   kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RIP);
+
+   return vcpu->arch.regs[VCPU_REGS_RIP];
 }
 
 static inline void kvm_rip_write(struct kvm_vcpu *vcpu, unsigned long val)
 {
-   kvm_register_write(vcpu, VCPU_REGS_RIP, val);
+   vcpu->arch.regs[VCPU_REGS_RIP] = val;
+   __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_dirty);
+   __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail);
 }
 
 static inline unsigned long kvm_rsp_read(struct kvm_vcpu *vcpu)
 {
-   return kvm_register_read(vcpu, VCPU_REGS_RSP);
+   if (!test_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail))
+   kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RSP);
+
+   return vcpu->arch.regs[VCPU_REGS_RSP];
 }
 
 static inline void kvm_rsp_write(struct kvm_vcpu *vcpu, unsigned long val)
 {
-   kvm_register_write(vcpu, VCPU_REGS_RSP, val);
+   vcpu->arch.regs[VCPU_REGS_RSP] = val;
+   __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_dirty);
+   __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail);
 }
 
 static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index dbf7442a822b..45d82b8277e5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -238,8 +238,7 @@ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu 
*vcpu, gpa_t gpa)
return false;
 }
 
-static inline unsigned long kvm_register_readl(struct kvm_vcpu *vcpu,
-  enum kvm_reg reg)
+static inline unsigned long kvm_register_readl(struct kvm_vcpu *vcpu, int reg)
 {
unsigned long val = kvm_register_read(vcpu, reg);
 
@@ -247,8 +246,7 @@ static inline unsigned long kvm_register_readl(struct 
kvm_vcpu *vcpu,
 }
 
 static inline void kvm_register_writel(struct kvm_vcpu *vcpu,
-  enum kvm_reg reg,
-  unsigned long val)
+  int reg, unsigned long val)
 {
if (!is_64_bit_mode(vcpu))
val = (u32)val;
-- 
2.22.0



Re: [PATCH v2] nvme-pci: Save PCI state before putting drive into deepest state

2019-09-27 Thread Christoph Hellwig
Looks sensible to me:

Reviewed-by: Christoph Hellwig 


Re: [PATCH v6 4/5] pinctrl: mediatek: Backward compatible to previous Mediatek's bias-pull usage

2019-09-27 Thread Sean Wang
Hi,

On Thu, Sep 26, 2019 at 10:02 PM Light Hsieh  wrote:
>
> Refine mtk_pinconf_set()/mtk_pinconf_get() for backward compatibility to
> previous Mediatek's bias-pull usage.

MediaTek

> In PINCTRL_MTK that use pinctrl-mtk-common.c, bias-pull setting for pins
> with 2 pull resistors can be specified as value for bias-pull-up and
> bias-pull-down. For example:
> bias-pull-up = ;
> bias-pull-up = ;
> bias-pull-up = ;
> bias-pull-up = ;
> bias-pull-down = ;
> bias-pull-down = ;
> bias-pull-down = ;
> bias-pull-down = ;
>
> On the other hand, PINCTRL_MTK_PARIS use customized properties
> "mediatek,pull-up-adv" and "mediatek,pull-down-adv" to specify bias-pull
> setting for pins with 2 pull resistors.
> This introduce in-compatibility in device tree and increatse porting

s/increatse/inscrease/

> effort to Mediatek's customer that had already used PINCTRL_MTK version.

s/Mediatek/MediaTek/

> Besides, if customers are not awared of this change and still write devicetree

s/awared/aware/

> for PINCTRL_MTK version, they may encounter runtime failure with pinctrl and
> spent time to debug.
>
> This patch add backward compatible to previous Mediatek's bias-pull usage

s/add/adds

> so that Mediatek's customer need not use a new devicetree property name.

s/Mediatek/MediaTek/

> The rationale is that: changing driver implemenation had better leave

s/implemenation/implementation/

> interface unchanged.
>

There are many tags missing here too.

> ---
>  drivers/pinctrl/mediatek/pinctrl-mt6765.c|   6 +-
>  drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c | 285 
> +++
>  drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.h |  11 +
>  drivers/pinctrl/mediatek/pinctrl-paris.c |  49 ++--
>  4 files changed, 327 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/pinctrl/mediatek/pinctrl-mt6765.c 
> b/drivers/pinctrl/mediatek/pinctrl-mt6765.c
> index bada37f..ae85fdc 100644
> --- a/drivers/pinctrl/mediatek/pinctrl-mt6765.c
> +++ b/drivers/pinctrl/mediatek/pinctrl-mt6765.c
> @@ -1072,10 +1072,8 @@
> .gpio_m = 0,
> .base_names = mt6765_pinctrl_register_base_names,
> .nbase_names = ARRAY_SIZE(mt6765_pinctrl_register_base_names),
> -   .bias_disable_set = mtk_pinconf_bias_disable_set,
> -   .bias_disable_get = mtk_pinconf_bias_disable_get,
> -   .bias_set = mtk_pinconf_bias_set,
> -   .bias_get = mtk_pinconf_bias_get,
> +   .bias_set_combo = mtk_pinconf_bias_set_combo,
> +   .bias_get_combo = mtk_pinconf_bias_get_combo,
> .drive_set = mtk_pinconf_drive_set_direct_val,
> .drive_get = mtk_pinconf_drive_get_direct_val,
> .adv_pull_get = mtk_pinconf_adv_pull_get,
> diff --git a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c 
> b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c
> index acfddf9..6d9972f 100644
> --- a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c
> +++ b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c
> @@ -13,6 +13,8 @@
>  #include 
>  #include 
>
> +#include 
> +

List the header declarations in alphabetic order.

>  #include "mtk-eint.h"
>  #include "pinctrl-mtk-common-v2.h"
>
> @@ -206,6 +208,20 @@ int mtk_hw_set_value(struct mtk_pinctrl *hw, const 
> struct mtk_pin_desc *desc,
> return 0;
>  }
>
> +void mtk_hw_set_value_no_lookup(struct mtk_pinctrl *hw,
> +   const struct mtk_pin_desc *desc,
> +   int value, struct mtk_pin_field *pf)
> +{
> +   if (value < 0 || value > pf->mask)
> +   return;
> +
> +   if (!pf->next)
> +   mtk_rmw(hw, pf->index, pf->offset, pf->mask << pf->bitpos,
> +   (value & pf->mask) << pf->bitpos);
> +   else
> +   mtk_hw_write_cross_field(hw, pf, value);
> +}
> +
>  int mtk_hw_get_value(struct mtk_pinctrl *hw, const struct mtk_pin_desc *desc,
>  int field, int *value)
>  {
> @@ -225,6 +241,17 @@ int mtk_hw_get_value(struct mtk_pinctrl *hw, const 
> struct mtk_pin_desc *desc,
> return 0;
>  }
>
> +void mtk_hw_get_value_no_lookup(struct mtk_pinctrl *hw,
> +   const struct mtk_pin_desc *desc,
> +   int *value, struct mtk_pin_field *pf)
> +{
> +   if (!pf->next)
> +   *value = (mtk_r32(hw, pf->index, pf->offset)
> + >> pf->bitpos) & pf->mask;
> +   else
> +   mtk_hw_read_cross_field(hw, pf, value);
> +}
> +

We are able to improve mtk_hw_[set,get]_value with a cache for the
recently visited pin_desc to speed up the subsequent access to the
same register for all clients, rather than creating a specific one
with no_lookup just for a few clients. Generally, all clients should
be able to just apply the  mtk_hw_[set,get]_value to compose what they
actually need.

And the changes related to the improvement on mtk_hw_[set,get]_value
with a cache is needed to put to a separate patch 

[PATCH] staging: rtl8188eu: fix null dereference when kzalloc fails

2019-09-27 Thread Connor Kuehl
If kzalloc() returns NULL, the error path doesn't stop the flow of
control from entering rtw_hal_read_chip_version() which dereferences the
null pointer. Fix this by adding a 'goto' to the error path to more
gracefully handle the issue and avoid proceeding with initialization
steps that we're no longer prepared to handle.

Also update the debug message to be more consistent with the other debug
messages in this function.

Addresses-Coverity: ("Dereference after null check")

Signed-off-by: Connor Kuehl 
---
 drivers/staging/rtl8188eu/os_dep/usb_intf.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/rtl8188eu/os_dep/usb_intf.c 
b/drivers/staging/rtl8188eu/os_dep/usb_intf.c
index 664d93a7f90d..4fac9dca798e 100644
--- a/drivers/staging/rtl8188eu/os_dep/usb_intf.c
+++ b/drivers/staging/rtl8188eu/os_dep/usb_intf.c
@@ -348,8 +348,10 @@ static struct adapter *rtw_usb_if1_init(struct dvobj_priv 
*dvobj,
}
 
padapter->HalData = kzalloc(sizeof(struct hal_data_8188e), GFP_KERNEL);
-   if (!padapter->HalData)
-   DBG_88E("cant not alloc memory for HAL DATA\n");
+   if (!padapter->HalData) {
+   DBG_88E("Failed to allocate memory for HAL data\n");
+   goto free_adapter;
+   }
 
/* step read_chip_version */
rtw_hal_read_chip_version(padapter);
-- 
2.17.1



Re: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

2019-09-27 Thread Mike Kravetz
On 9/19/19 3:24 PM, Mina Almasry wrote:
> A follow up patch in this series adds hugetlb cgroup uncharge info the
> file_region entries in resv->regions. The cgroup uncharge info may
> differ for different regions, so they can no longer be coalesced at
> region_add time. So, disable region coalescing in region_add in this
> patch.
> 
> Behavior change:
> 
> Say a resv_map exists like this [0->1], [2->3], and [5->6].
> 
> Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> 
> Old code would generate resv->regions: [0->5], [5->6].
> New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> [5->6].
> 
> Special care needs to be taken to handle the resv->adds_in_progress
> variable correctly. In the past, only 1 region would be added for every
> region_chg and region_add call. But now, each call may add multiple
> regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
> region_chg calls add_reservation_in_range() to count the number of regions
> needed and allocates those, and that info is passed to region_add and
> region_abort to decrement adds_in_progress correctly.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
>  mm/hugetlb.c | 273 +--
>  1 file changed, 158 insertions(+), 115 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bac1cbdd027c..d03b048084a3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -244,6 +244,12 @@ struct file_region {
>   long to;
>  };
> 
> +/* Helper that removes a struct file_region from the resv_map cache and 
> returns
> + * it for use.
> + */
> +static struct file_region *
> +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);
> +

Instead of the forward declaration, just put the function here.

>  /* Must be called with resv->lock held. Calling this with count_only == true
>   * will count the number of pages to be added but will not modify the linked
>   * list.
> @@ -251,51 +257,61 @@ struct file_region {
>  static long add_reservation_in_range(struct resv_map *resv, long f, long t,
>bool count_only)
>  {
> - long chg = 0;
> + long add = 0;
>   struct list_head *head = >regions;
> + long last_accounted_offset = f;
>   struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
> 
> - /* Locate the region we are before or in. */
> - list_for_each_entry (rg, head, link)
> - if (f <= rg->to)
> - break;
> -
> - /* Round our left edge to the current segment if it encloses us. */
> - if (f > rg->from)
> - f = rg->from;
> -
> - chg = t - f;
> + /* In this loop, we essentially handle an entry for the range
> +  * last_accounted_offset -> rg->from, at every iteration, with some
> +  * bounds checking.
> +  */
> + list_for_each_entry_safe(rg, trg, head, link) {
> + /* Skip irrelevant regions that start before our range. */
> + if (rg->from < f) {
> + /* If this region ends after the last accounted offset,
> +  * then we need to update last_accounted_offset.
> +  */
> + if (rg->to > last_accounted_offset)
> + last_accounted_offset = rg->to;
> + continue;
> + }
> 
> - /* Check for and consume any regions we now overlap with. */
> - nrg = rg;
> - list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
> - if (>link == head)
> - break;
> + /* When we find a region that starts beyond our range, we've
> +  * finished.
> +  */
>   if (rg->from > t)
>   break;
> 
> - /* We overlap with this area, if it extends further than
> -  * us then we must extend ourselves.  Account for its
> -  * existing reservation.
> + /* Add an entry for last_accounted_offset -> rg->from, and
> +  * update last_accounted_offset.
>*/
> - if (rg->to > t) {
> - chg += rg->to - t;
> - t = rg->to;
> + if (rg->from > last_accounted_offset) {
> + add += rg->from - last_accounted_offset;
> + if (!count_only) {
> + nrg = get_file_region_entry_from_cache(
> + resv, last_accounted_offset, rg->from);
> + list_add(>link, rg->link.prev);
> + }
>   }
> - chg -= rg->to - rg->from;
> 
> - if (!count_only && rg != nrg) {
> - list_del(>link);
> - kfree(rg);
> - }
> + last_accounted_offset = rg->to;
>   }
> 
> - if 

[PATCH v2] perf tools: avoid sample_reg_masks being const + weak

2019-09-27 Thread Ian Rogers
Being const + weak breaks with some compilers that constant-propagate
from the weak symbol. This behavior is outside of the specification, but
in LLVM is chosen to match GCC's behavior.

LLVM's implementation was set in this patch:
https://github.com/llvm/llvm-project/commit/f49573d1eedcf1e44893d5a062ac1b72c8419646
A const + weak symbol is set to be weak_odr:
https://llvm.org/docs/LangRef.html
ODR is one definition rule, and given there is one constant definition
constant-propagation is possible. It is possible to get this code to
miscompile with LLVM when applying link time optimization. As compilers
become more aggressive, this is likely to break in more instances.

Move the definition of sample_reg_masks to the conditional part of
perf_regs.h and guard usage with HAVE_PERF_REGS_SUPPORT. This avoids the
weak symbol.

Fix an issue when HAVE_PERF_REGS_SUPPORT isn't defined from patch v1.

Signed-off-by: Ian Rogers 
---
 tools/perf/util/parse-regs-options.c | 8 ++--
 tools/perf/util/perf_regs.c  | 4 
 tools/perf/util/perf_regs.h  | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/parse-regs-options.c 
b/tools/perf/util/parse-regs-options.c
index ef46c2848808..e687497b3aac 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -13,7 +13,7 @@ static int
 __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 {
uint64_t *mode = (uint64_t *)opt->value;
-   const struct sample_reg *r;
+   const struct sample_reg *r = NULL;
char *s, *os = NULL, *p;
int ret = -1;
uint64_t mask;
@@ -46,19 +46,23 @@ __parse_regs(const struct option *opt, const char *str, int 
unset, bool intr)
 
if (!strcmp(s, "?")) {
fprintf(stderr, "available registers: ");
+#ifdef HAVE_PERF_REGS_SUPPORT
for (r = sample_reg_masks; r->name; r++) {
if (r->mask & mask)
fprintf(stderr, "%s ", r->name);
}
+#endif
fputc('\n', stderr);
/* just printing available regs */
return -1;
}
+#ifdef HAVE_PERF_REGS_SUPPORT
for (r = sample_reg_masks; r->name; r++) {
if ((r->mask & mask) && !strcasecmp(s, r->name))
break;
}
-   if (!r->name) {
+#endif
+   if (!r || !r->name) {
ui__warning("Unknown register \"%s\", check man 
page or run \"perf record %s?\"\n",
s, intr ? "-I" : "--user-regs=");
goto error;
diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
index 2774cec1f15f..5ee47ae1509c 100644
--- a/tools/perf/util/perf_regs.c
+++ b/tools/perf/util/perf_regs.c
@@ -3,10 +3,6 @@
 #include "perf_regs.h"
 #include "event.h"
 
-const struct sample_reg __weak sample_reg_masks[] = {
-   SMPL_REG_END
-};
-
 int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
 char **new_op __maybe_unused)
 {
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index 47fe34e5f7d5..e014c2c038f4 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -15,8 +15,6 @@ struct sample_reg {
 #define SMPL_REG2(n, b) { .name = #n, .mask = 3ULL << (b) }
 #define SMPL_REG_END { .name = NULL }
 
-extern const struct sample_reg sample_reg_masks[];
-
 enum {
SDT_ARG_VALID = 0,
SDT_ARG_SKIP,
@@ -27,6 +25,8 @@ uint64_t arch__intr_reg_mask(void);
 uint64_t arch__user_reg_mask(void);
 
 #ifdef HAVE_PERF_REGS_SUPPORT
+extern const struct sample_reg sample_reg_masks[];
+
 #include 
 
 #define DWARF_MINIMAL_REGS ((1ULL << PERF_REG_IP) | (1ULL << PERF_REG_SP))
-- 
2.23.0.444.g18eeb5a265-goog



Re: [PATCH v3] nfp: abm: fix memory leak in nfp_abm_u32_knode_replace

2019-09-27 Thread Jakub Kicinski
On Fri, 27 Sep 2019 14:12:42 +0200, Markus Elfring wrote:
> > Updated other gotos to have correct errno returned, too.  
> 
> How do you think about to add a jump target here?
> 
> 
> > +++ b/drivers/net/ethernet/netronome/nfp/abm/cls.c
> > @@ -176,8 +176,10 @@ nfp_abm_u32_knode_replace(struct nfp_abm_link *alink,
> > u8 mask, val;
> > int err;
> >
> > -   if (!nfp_abm_u32_check_knode(alink->abm, knode, proto, extack))
> > +   if (!nfp_abm_u32_check_knode(alink->abm, knode, proto, extack)) {
> > +   err = -EOPNOTSUPP;
> > goto err_delete;
> > +   }
> >
> > tos_off = proto == htons(ETH_P_IP) ? 16 : 20;  
> 
> - goto err_delete;
> + goto e_opnotsupp;
> 
> 
> > @@ -221,7 +227,7 @@ nfp_abm_u32_knode_replace(struct nfp_abm_link *alink,
> >  
> 
> +e_opnotsupp:
> + err = -EOPNOTSUPP;
> 
> >  err_delete:
> > nfp_abm_u32_knode_delete(alink, knode);
> > -   return -EOPNOTSUPP;
> > +   return err;
> >  }
> >
> >  static int nfp_abm_setup_tc_block_cb(enum tc_setup_type type,  
> 
> 
> Can such a change variant be a bit nicer?

Definitely not.

Looks good as is, thanks Navid!


  1   2   3   4   5   6   7   >