Re: nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
>> Can such a change variant be a bit nicer? > > Definitely not. > > Looks good as is, thanks Navid! I find it interesting how the software development opinions are different also in this use case for the implementation of correct and efficient exception handling. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?id=f1f2f614d535564992f32e720739cb53cf03489f#n450 Regards, Markus
Re: [PATCH 4/4] perf docs: Correct and clarify jitdump spec
On Fri, Sep 27, 2019 at 6:53 PM Steve MacLean wrote: > > Specification claims latest version of jitdump file format is 2. Current > jit dump reading code treats 1 as the latest version. > > Correct spec to match code. > > The original language made it unclear the value to be written in the magic > field. > > Revise language that the writer always writes the same value. Specify that > the reader uses the value to detect endian mismatches. > > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: Arnaldo Carvalho de Melo > Cc: Mark Rutland > Cc: Alexander Shishkin > Cc: Jiri Olsa > Cc: Namhyung Kim > Cc: Stephane Eranian > Cc: linux-kernel@vger.kernel.org > Signed-off-by: Steve MacLean Corrections are valid. Acked-by: Stephane Eranian > --- > tools/perf/Documentation/jitdump-specification.txt | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/tools/perf/Documentation/jitdump-specification.txt > b/tools/perf/Documentation/jitdump-specification.txt > index 4c62b07..52152d1 100644 > --- a/tools/perf/Documentation/jitdump-specification.txt > +++ b/tools/perf/Documentation/jitdump-specification.txt > @@ -36,8 +36,8 @@ III/ Jitdump file header format > Each jitdump file starts with a fixed size header containing the following > fields in order: > > > -* uint32_t magic : a magic number tagging the file type. The value is > 4-byte long and represents the string "JiTD" in ASCII form. It is 0x4A695444 > or 0x4454694a depending on the endianness. The field can be used to detect > the endianness of the file > -* uint32_t version : a 4-byte value representing the format version. It is > currently set to 2 > +* uint32_t magic : a magic number tagging the file type. The value is > 4-byte long and represents the string "JiTD" in ASCII form. It written is as > 0x4A695444. The reader will detect an endian mismatch when it reads > 0x4454694a. > +* uint32_t version : a 4-byte value representing the format version. It is > currently set to 1 > * uint32_t total_size: size in bytes of file header > * uint32_t elf_mach : ELF architecture encoding (ELF e_machine value as > specified in /usr/include/elf.h) > * uint32_t pad1 : padding. Reserved for future use > -- > 2.7.4 >
[PATCH -next] nfsd: remove set but not used variable 'len'
Fixes gcc '-Wunused-but-set-variable' warning: fs/nfsd/nfs4xdr.c: In function nfsd4_encode_splice_read: fs/nfsd/nfs4xdr.c:3464:7: warning: variable len set but not used [-Wunused-but-set-variable] It is not used since commit 83a63072c815 ("nfsd: fix nfs read eof detection") Reported-by: Hulk Robot Signed-off-by: YueHaibing --- fs/nfsd/nfs4xdr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 533d0fc..1883370 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -3461,7 +3461,6 @@ static __be32 nfsd4_encode_splice_read( struct xdr_stream *xdr = >xdr; struct xdr_buf *buf = xdr->buf; u32 eof; - long len; int space_left; __be32 nfserr; __be32 *p = xdr->p - 2; @@ -3470,7 +3469,6 @@ static __be32 nfsd4_encode_splice_read( if (xdr->end - xdr->p < 1) return nfserr_resource; - len = maxcount; nfserr = nfsd_splice_read(read->rd_rqstp, read->rd_fhp, file, read->rd_offset, , ); read->rd_length = maxcount; -- 2.7.4
Re: [PATCH v4 0/3] Optimize tlbflush path
On Fri, 2019-08-30 at 19:50 -0700, Paul Walmsley wrote: > Hi Atish, > > On Thu, 22 Aug 2019, Atish Patra wrote: > > > This series adds few optimizations to reduce the trap cost in the > > tlb > > flush path. We should only make SBI calls to remote tlb flush only > > if > > absolutely required. > > The patches look great. My understanding is that these optimization > patches may actually be a partial workaround for the TLB flushing bug > that > we've been looking at for the last month or so, which can corrupt > memory > or crash the system. > > If that's the case, let's first root-cause the underlying > bug. Otherwise > we'll just be papering over the actual issue, which probably could > still > occur even with this series, correct? Since it contains no explicit > fixes? > > I have verified the glibc locale install issue both in Qemu and Unleashed. I don't see any issue with OpenSBI master + Linux v5.3 kernel. As per our investigation, it looks like a hardware errata with Unleashed board as the memory corruption issue only happens in case of tlb range flush. In RISC-V, sfence.vma can only be issued at page boundary. If the range is larger than that, OpenSBI has to issue multiple sfence.vma calls back to back leading to possible memory corruption. Currently, OpenSBI has a platform feature i.e. "tlb_range_flush_limit" that allows to configure tlb flush threshold per platform. Any tlb flush range request greater than this threshold, is converted to a full flush. Currently, it is set to the default value 4K for every platform[1]. Glibc locale install memory corruption only happens if this threshold is changed to a higher value i.e. 1G. This doesn't change anything in OpenSBI code path except the fact that it will issue many sfence.vma instructions back to back instead of one. If the hardware team at SiFive can look into this as well, it would be great. To conclude, we think this issue need to be investigated by hardware team and the kernel patch can be merged to get the performance benefit. [1] https://github.com/riscv/opensbi/blob/master/include/sbi/sbi_platform.h#L40 > - Paul > > ___ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv -- Regards, Atish
[PATCH -next] habanalabs: remove set but not used variable 'ctx'
Fixes gcc '-Wunused-but-set-variable' warning: drivers/misc/habanalabs/device.c: In function hpriv_release: drivers/misc/habanalabs/device.c:45:17: warning: variable ctx set but not used [-Wunused-but-set-variable] It is never used since commit eb7caf84b029 ("habanalabs: maintain a list of file private data objects") Reported-by: Hulk Robot Signed-off-by: YueHaibing --- drivers/misc/habanalabs/device.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c index 459fee7..2f5a4da 100644 --- a/drivers/misc/habanalabs/device.c +++ b/drivers/misc/habanalabs/device.c @@ -42,12 +42,10 @@ static void hpriv_release(struct kref *ref) { struct hl_fpriv *hpriv; struct hl_device *hdev; - struct hl_ctx *ctx; hpriv = container_of(ref, struct hl_fpriv, refcount); hdev = hpriv->hdev; - ctx = hpriv->ctx; put_pid(hpriv->taskpid); -- 2.7.4
Re: [PATCH v2 1/1] leds: remove PAGE_SIZE limit of /sys/class/leds//trigger
2019年9月28日(土) 2:46 Greg Kroah-Hartman : > > On Sat, Sep 28, 2019 at 01:47:21AM +0900, Akinobu Mita wrote: > > 2019年9月27日(金) 15:39 Greg Kroah-Hartman : > > > > > > On Sat, Sep 14, 2019 at 12:03:24AM +0900, Akinobu Mita wrote: > > > > Reading /sys/class/leds//trigger returns all available LED > > > > triggers. > > > > However, the size of this file is limited to PAGE_SIZE because of the > > > > limitation for sysfs attribute. > > > > > > > > Enabling LED CPU trigger on systems with thousands of CPUs easily hits > > > > PAGE_SIZE limit, and makes it impossible to see all available LED > > > > triggers > > > > and which trigger is currently activated. > > > > > > > > We work around it here by converting /sys/class/leds//trigger to > > > > binary attribute, which is not limited by length. This is _not_ good > > > > design, do not copy it. > > > > > > > > Cc: Greg Kroah-Hartman > > > > Cc: "Rafael J. Wysocki" > > > > Cc: Jacek Anaszewski > > > > Cc: Pavel Machek > > > > Cc: Dan Murphy > > > > Acked-by: Pavel Machek > > > > Signed-off-by: Akinobu Mita > > > > --- > > > > drivers/leds/led-class.c| 8 ++-- > > > > drivers/leds/led-triggers.c | 90 > > > > ++--- > > > > drivers/leds/leds.h | 6 +++ > > > > include/linux/leds.h| 5 --- > > > > 4 files changed, 79 insertions(+), 30 deletions(-) > > > > > > > > diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c > > > > index 4793e77..8b5a1d1 100644 > > > > --- a/drivers/leds/led-class.c > > > > +++ b/drivers/leds/led-class.c > > > > @@ -73,13 +73,13 @@ static ssize_t max_brightness_show(struct device > > > > *dev, > > > > static DEVICE_ATTR_RO(max_brightness); > > > > > > > > #ifdef CONFIG_LEDS_TRIGGERS > > > > -static DEVICE_ATTR(trigger, 0644, led_trigger_show, led_trigger_store); > > > > -static struct attribute *led_trigger_attrs[] = { > > > > - _attr_trigger.attr, > > > > +static BIN_ATTR(trigger, 0644, led_trigger_read, led_trigger_write, 0); > > > > +static struct bin_attribute *led_trigger_bin_attrs[] = { > > > > + _attr_trigger, > > > > NULL, > > > > }; > > > > static const struct attribute_group led_trigger_group = { > > > > - .attrs = led_trigger_attrs, > > > > + .bin_attrs = led_trigger_bin_attrs, > > > > }; > > > > #endif > > > > > > > > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c > > > > index 8d11a5e..ed5a311 100644 > > > > --- a/drivers/leds/led-triggers.c > > > > +++ b/drivers/leds/led-triggers.c > > > > @@ -16,6 +16,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > #include "leds.h" > > > > > > > > /* > > > > @@ -26,9 +27,11 @@ LIST_HEAD(trigger_list); > > > > > > > > /* Used by LED Class */ > > > > > > > > -ssize_t led_trigger_store(struct device *dev, struct device_attribute > > > > *attr, > > > > - const char *buf, size_t count) > > > > +ssize_t led_trigger_write(struct file *filp, struct kobject *kobj, > > > > + struct bin_attribute *bin_attr, char *buf, > > > > + loff_t pos, size_t count) > > > > { > > > > + struct device *dev = kobj_to_dev(kobj); > > > > struct led_classdev *led_cdev = dev_get_drvdata(dev); > > > > struct led_trigger *trig; > > > > int ret = count; > > > > @@ -64,39 +67,84 @@ ssize_t led_trigger_store(struct device *dev, > > > > struct device_attribute *attr, > > > > mutex_unlock(_cdev->led_access); > > > > return ret; > > > > } > > > > -EXPORT_SYMBOL_GPL(led_trigger_store); > > > > +EXPORT_SYMBOL_GPL(led_trigger_write); > > > > > > > > -ssize_t led_trigger_show(struct device *dev, struct device_attribute > > > > *attr, > > > > - char *buf) > > > > +__printf(4, 5) > > > > +static int led_trigger_snprintf(char *buf, size_t size, bool query, > > > > + const char *fmt, ...) > > > > +{ > > > > + va_list args; > > > > + int i; > > > > + > > > > + va_start(args, fmt); > > > > + if (query) > > > > + i = vsnprintf(NULL, 0, fmt, args); > > > > + else > > > > + i = vscnprintf(buf, size, fmt, args); > > > > + va_end(args); > > > > + > > > > + return i; > > > > +} > > > > > > You only call this in one place, why is it needed like this? The "old" > > > code open-coded this, what is this helping with here? > > > > > > And what does "query" mean here? I have no idea how that variable > > > matters, or what it does. Why not just test if buf is NULL or not if > > > you don't want to use it? > > > > > > Ah, you are trying to see how "long" the buffer is going to be. That > > > makes more sense, but just trigger off of the NULL buffer or not, making > > > this a bit more "obvious" what you are doing and not tieing two > > > parameters to each other (meaning one always reflects the other one). > > > > We cannot simply replace the "query" by checking the buffer is NULL or not.
Re: [PATCH] media: uvcvideo: Use streaming DMA APIs to transfer buffers
On Fri, Aug 2, 2019 at 9:13 PM Shik Chen wrote: > > Similar to the commit 1161db6776bd ("media: usb: pwc: Don't use coherent > DMA buffers for ISO transfer") [1] for the pwc driver. Use streaming DMA > APIs to transfer buffers and sync them explicitly, because accessing > buffers allocated by usb_alloc_coherent() is slow on platforms without > hardware coherent DMA. > > Tested on x86-64 (Intel Celeron 4305U) and armv7l (Rockchip RK3288) with > Logitech Brio 4K camera at 1920x1080 using the WebRTC sample site [3]. > > | | URB (us) | Decode (Gbps) | CPU (%) | > |--||---|-| > | x86-64 coherent | 53 +- 20 | 50.6 |0.24 | > | x86-64 streaming | 55 +- 19 | 50.1 |0.25 | > | armv7l coherent | 342 +- 379 | 1.8 |2.16 | > | armv7l streaming | 99 +- 98 | 11.0 |0.36 | > > The performance characteristics are improved a lot on armv7l, and > remained (almost) unchanged on x86-64. The code used for measurement can > be found at [2]. > > [1] https://git.kernel.org/torvalds/c/1161db6776bd > [2] https://crrev.com/c/1729133 > [3] https://webrtc.github.io/samples/src/content/getusermedia/resolution/ > > Signed-off-by: Shik Chen > --- > The allocated buffer could be as large as 768K when streaming 4K video. > Ideally we should use some generic helper to allocate non-coherent > memory in a more efficient way, such as https://lwn.net/Articles/774429/ > ("make the non-consistent DMA allocator more userful"). > > But I cannot find any existing helper so a simple kzalloc() is used in > this patch. The logic to figure out the DMA addressable GFP flags is > similar to __dma_direct_alloc_pages() without the optimistic retries: > https://elixir.bootlin.com/linux/v5.2.5/source/kernel/dma/direct.c#L96 > > drivers/media/usb/uvc/uvc_video.c | 65 +-- > 1 file changed, 45 insertions(+), 20 deletions(-) > > diff --git a/drivers/media/usb/uvc/uvc_video.c > b/drivers/media/usb/uvc/uvc_video.c > index 8fa77a81dd7f2c..962c35478896c4 100644 > --- a/drivers/media/usb/uvc/uvc_video.c > +++ b/drivers/media/usb/uvc/uvc_video.c > @@ -1539,6 +1539,8 @@ static void uvc_video_complete(struct urb *urb) > * Process the URB headers, and optionally queue expensive memcpy > tasks > * to be deferred to a work queue. > */ > + dma_sync_single_for_cpu(>dev->dev, urb->transfer_dma, > + urb->transfer_buffer_length, DMA_FROM_DEVICE); > stream->decode(uvc_urb, buf, buf_meta); > > /* If no async work is needed, resubmit the URB immediately. */ > @@ -1565,18 +1567,51 @@ static void uvc_free_urb_buffers(struct uvc_streaming > *stream) > if (!uvc_urb->buffer) > continue; > > -#ifndef CONFIG_DMA_NONCOHERENT > - usb_free_coherent(stream->dev->udev, stream->urb_size, > - uvc_urb->buffer, uvc_urb->dma); > -#else > + dma_unmap_single(>dev->udev->dev, uvc_urb->dma, > +stream->urb_size, DMA_FROM_DEVICE); > kfree(uvc_urb->buffer); > -#endif > - uvc_urb->buffer = NULL; > } > > stream->urb_size = 0; > } > > +static gfp_t uvc_alloc_gfp_flags(struct device *dev) > +{ > + u64 mask = dma_get_mask(dev); > + > + if (dev->bus_dma_mask) > + mask &= dev->bus_dma_mask; > + > + if (mask < DMA_BIT_MASK(32) && IS_ENABLED(CONFIG_ZONE_DMA)) > + return GFP_DMA; > + > + if (mask < DMA_BIT_MASK(64)) { > + if (IS_ENABLED(CONFIG_ZONE_DMA32)) > + return GFP_DMA32; We're hitting issues with this on 64-bit ARM platform, where ZONE_DMA32 is enabled (default), the kmalloc allocation with GFP_DMA32 fails. There is no slab cache for GFP_DMA32, so those calls are expected to fail, AFAICT there are no other (working) kmalloc(..., .. | GFP_DMA32) users in the kernel, so I don't think we want to add a cache. If this helps, some discussion here why the cache wasn't added: https://lore.kernel.org/patchwork/patch/1009563/#1198622 This function looks out of place in a high-level driver, but from your comment above (below ---), I guess we may have to live with that until the kernel provides a better API? For the time being, looking at how __dma_direct_alloc_pages works, could you use alloc_pages_node (or dma_alloc_contiguous?) instead of kmalloc, that would let you use GFP_DMA32 as needed? > + if (IS_ENABLED(CONFIG_ZONE_DMA)) > + return GFP_DMA; > + } > + > + return 0; > +} > + > +static char *uvc_alloc_urb_buffer(struct device *dev, size_t size, > + gfp_t gfp_flags, dma_addr_t *dma_handle) > +{ > + void *buffer = kzalloc(size, gfp_flags | uvc_alloc_gfp_flags(dev)); > + > + if (!buffer) > + return NULL; > + > +
Re: KASAN: slab-out-of-bounds Read in bpf_prog_create
Arnd and Al, On Tue, Sep 17, 2019 at 11:49:06AM -0700, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:2015a28f Add linux-next specific files for 20190915 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=11880d6960 > kernel config: https://syzkaller.appspot.com/x/.config?x=110691c2286b679a > dashboard link: https://syzkaller.appspot.com/bug?extid=eb853b51b10f1befa0b7 > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=127c348160 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1150a70d60 > > The bug was bisected to: > > commit 2f4fa2db75e26995709043c8d3de4632ebed5c4b > Author: Al Viro > Date: Thu Apr 18 03:48:01 2019 + > > compat_ioctl: unify copy-in of ppp filters > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=145eee1d60 > final crash:https://syzkaller.appspot.com/x/report.txt?x=165eee1d60 > console output: https://syzkaller.appspot.com/x/log.txt?x=125eee1d60 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+eb853b51b10f1befa...@syzkaller.appspotmail.com > Fixes: 2f4fa2db75e2 ("compat_ioctl: unify copy-in of ppp filters") > > == > BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:404 [inline] > BUG: KASAN: slab-out-of-bounds in bpf_prog_create+0xe9/0x250 > net/core/filter.c:1351 > Read of size 32768 at addr 88809cf74000 by task syz-executor183/8575 > > CPU: 0 PID: 8575 Comm: syz-executor183 Not tainted 5.3.0-rc8-next-20190915 > #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x172/0x1f0 lib/dump_stack.c:113 > print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 > __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506 > kasan_report+0x12/0x20 mm/kasan/common.c:634 > check_memory_region_inline mm/kasan/generic.c:185 [inline] > check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192 > memcpy+0x24/0x50 mm/kasan/common.c:122 > memcpy include/linux/string.h:404 [inline] > bpf_prog_create+0xe9/0x250 net/core/filter.c:1351 > get_filter.isra.0+0x108/0x1a0 drivers/net/ppp/ppp_generic.c:572 > ppp_get_filter drivers/net/ppp/ppp_generic.c:584 [inline] > ppp_ioctl+0x129d/0x2590 drivers/net/ppp/ppp_generic.c:801 This is a correct bisection. This commit needs: diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c index 267fe2c58087..f55d7937d6c5 100644 --- a/drivers/net/ppp/ppp_generic.c +++ b/drivers/net/ppp/ppp_generic.c @@ -564,8 +564,9 @@ static struct bpf_prog *get_filter(struct sock_fprog *uprog) return NULL; /* uprog->len is unsigned short, so no overflow here */ - fprog.len = uprog->len * sizeof(struct sock_filter); - fprog.filter = memdup_user(uprog->filter, fprog.len); + fprog.len = uprog->len; + fprog.filter = memdup_user(uprog->filter, + uprog->len * sizeof(struct sock_filter)); if (IS_ERR(fprog.filter)) return ERR_CAST(fprog.filter);
Re: [GIT PULL] integrity subsystem updates for v5.4
The pull request you sent on Wed, 11 Sep 2019 17:29:25 -0400: > git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git > next-integrity has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/f1f2f614d535564992f32e720739cb53cf03489f Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: WARNING: locking bug in selinux_netlbl_socket_connect
On Tue, Sep 24, 2019 at 09:17:35AM -0400, Paul Moore wrote: > On Tue, Sep 24, 2019 at 4:21 AM Dmitry Vyukov wrote: > > On Tue, Sep 24, 2019 at 4:14 AM Paul Moore wrote: > > > On Sat, Sep 21, 2019 at 11:50 AM syzbot > > > wrote: > > > > Hello, > > > > > > > > syzbot found the following crash on: > > > > > > > > HEAD commit:f97c81dc Merge tag 'armsoc-late' of > > > > git://git.kernel.org/p.. > > > > git tree: upstream > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=127b709d60 > > > > kernel config: > > > > https://syzkaller.appspot.com/x/.config?x=10283c2b00ab4cd7 > > > > dashboard link: > > > > https://syzkaller.appspot.com/bug?extid=5fa07e4e18e4eb1ccb12 > > > > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > > > syz repro: > > > > https://syzkaller.appspot.com/x/repro.syz?x=1299684160 > > > > > > > > IMPORTANT: if you fix the bug, please add the following tag to the > > > > commit: > > > > Reported-by: syzbot+5fa07e4e18e4eb1cc...@syzkaller.appspotmail.com > > > > > > > > WARNING: CPU: 0 PID: 10315 at kernel/locking/lockdep.c:840 > > > > look_up_lock_class kernel/locking/lockdep.c:840 [inline] > > > > WARNING: CPU: 0 PID: 10315 at kernel/locking/lockdep.c:840 > > > > register_lock_class+0x206/0x1850 kernel/locking/lockdep.c:1185 > > > > Kernel panic - not syncing: panic_on_warn set ... > > > > CPU: 0 PID: 10315 Comm: syz-executor.0 Not tainted 5.3.0+ #0 > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > > > Google 01/01/2011 > > > > Call Trace: > > > > __dump_stack lib/dump_stack.c:77 [inline] > > > > dump_stack+0x172/0x1f0 lib/dump_stack.c:113 > > > > panic+0x2dc/0x755 kernel/panic.c:219 > > > > __warn.cold+0x20/0x4c kernel/panic.c:576 > > > > report_bug+0x263/0x2b0 lib/bug.c:186 > > > > fixup_bug arch/x86/kernel/traps.c:179 [inline] > > > > fixup_bug arch/x86/kernel/traps.c:174 [inline] > > > > do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272 > > > > do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291 > > > > invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028 > > > > RIP: 0010:look_up_lock_class kernel/locking/lockdep.c:840 [inline] > > > > RIP: 0010:register_lock_class+0x206/0x1850 kernel/locking/lockdep.c:1185 > > > > Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 aa 10 00 00 4c 3b > > > > 7b > > > > 18 44 8b 35 d5 de 55 09 74 0b 48 81 3b a0 65 06 8a 74 02 <0f> 0b 45 85 > > > > ed > > > > 0f 84 71 03 00 00 f6 85 70 ff ff ff 01 0f 85 64 03 > > > > RSP: 0018:888096777a48 EFLAGS: 00010002 > > > > RAX: dc00 RBX: 888093ff78e0 RCX: > > > > RDX: 1110127fef1f RSI: RDI: 888093ff78f8 > > > > RBP: 888096777b10 R08: 111012ceef51 R09: 8aaea0e0 > > > > R10: 8a7753c8 R11: R12: 8a7b5d20 > > > > R13: R14: R15: 884766e0 > > > > __lock_acquire+0xf4/0x4e70 kernel/locking/lockdep.c:3837 > > > > lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4487 > > > > __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] > > > > _raw_spin_lock_bh+0x33/0x50 kernel/locking/spinlock.c:175 > > > > spin_lock_bh include/linux/spinlock.h:343 [inline] > > > > lock_sock_nested+0x41/0x120 net/core/sock.c:2929 > > > > lock_sock include/net/sock.h:1522 [inline] > > > > selinux_netlbl_socket_connect+0x20/0xc0 > > > > security/selinux/netlabel.c:607 > > > > selinux_socket_connect+0x6a/0x90 security/selinux/hooks.c:4745 > > > > security_socket_connect+0x77/0xc0 security/security.c:1958 > > > > __sys_connect+0x19d/0x330 net/socket.c:1824 > > > > __do_sys_connect net/socket.c:1839 [inline] > > > > __se_sys_connect net/socket.c:1836 [inline] > > > > __x64_sys_connect+0x73/0xb0 net/socket.c:1836 > > > > do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 > > > > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > > > RIP: 0033:0x459a09 > > > > Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 > > > > f7 > > > > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 > > > > ff > > > > ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > > > > RSP: 002b:7fc302ec5c78 EFLAGS: 0246 ORIG_RAX: 002a > > > > RAX: ffda RBX: 0003 RCX: 00459a09 > > > > RDX: 001c RSI: 2080 RDI: 0005 > > > > RBP: 0075bf20 R08: R09: > > > > R10: R11: 0246 R12: 7fc302ec66d4 > > > > R13: 004bff42 R14: 004d1eb0 R15: > > > > Kernel Offset: disabled > > > > Rebooting in 86400 seconds.. > > > > > > This doesn't appear to be related to selinux_netlbl_socket_connect(); > > > I believe it should be okay to call lock_sock() in that context. > > > > > > FTR, this is this warning: > > > > static inline struct lock_class * > > look_up_lock_class(const
Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode
On Wed, 25 Sep 2019 09:56:03 +0800 hev wrote: > From: Heiher > > Take the case where we have: > > t0 > | (ew) > e0 > | (et) > e1 > | (lt) > s0 > > t0: thread 0 > e0: epoll fd 0 > e1: epoll fd 1 > s0: socket fd 0 > ew: epoll_wait > et: edge-trigger > lt: level-trigger > > We only need to wakeup nested epoll fds if something has been queued to the > overflow list, since the ep_poll() traverses the rdllist during recursive poll > and thus events on the overflow list may not be visible yet. > > Test code: Look sane to me. Do you have any performance testing results which show a benefit? epoll maintainership isn't exactly a hive of activity nowadays :( Roman, would you please have time to review this?
[PATCH 4/4] perf docs: Correct and clarify jitdump spec
Specification claims latest version of jitdump file format is 2. Current jit dump reading code treats 1 as the latest version. Correct spec to match code. The original language made it unclear the value to be written in the magic field. Revise language that the writer always writes the same value. Specify that the reader uses the value to detect endian mismatches. Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: Stephane Eranian Cc: linux-kernel@vger.kernel.org Signed-off-by: Steve MacLean --- tools/perf/Documentation/jitdump-specification.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/Documentation/jitdump-specification.txt b/tools/perf/Documentation/jitdump-specification.txt index 4c62b07..52152d1 100644 --- a/tools/perf/Documentation/jitdump-specification.txt +++ b/tools/perf/Documentation/jitdump-specification.txt @@ -36,8 +36,8 @@ III/ Jitdump file header format Each jitdump file starts with a fixed size header containing the following fields in order: -* uint32_t magic : a magic number tagging the file type. The value is 4-byte long and represents the string "JiTD" in ASCII form. It is 0x4A695444 or 0x4454694a depending on the endianness. The field can be used to detect the endianness of the file -* uint32_t version : a 4-byte value representing the format version. It is currently set to 2 +* uint32_t magic : a magic number tagging the file type. The value is 4-byte long and represents the string "JiTD" in ASCII form. It written is as 0x4A695444. The reader will detect an endian mismatch when it reads 0x4454694a. +* uint32_t version : a 4-byte value representing the format version. It is currently set to 1 * uint32_t total_size: size in bytes of file header * uint32_t elf_mach : ELF architecture encoding (ELF e_machine value as specified in /usr/include/elf.h) * uint32_t pad1 : padding. Reserved for future use -- 2.7.4
[PATCH 3/4] perf inject --jit: Remove //anon mmap events
While a JIT is jitting code it will eventually need to commit more pages and change these pages to executable permissions. Typically the JIT will want these co-located to minimize branch displacements. The kernel will coalesce these anonymous mapping with identical permissions before sending an mmap event for the new pages. This means the mmap event for the new pages will include the older pages. These anonymous mmap events will obscure the jitdump injected pseudo events. This means that the jitdump generated symbols, machine code, debugging info, and unwind info will no longer be used. Observations: When a process emits a jit dump marker and a jitdump file, the perf-xxx.map file represents inferior information which has been superseded by the jitdump jit-xxx.dump file. Further the '//anon*' mmap events are only required for the legacy perf-xxx.map mapping. Summary: Add rbtree to track which pids have successfully injected a jitdump file. During "perf inject --jit", discard "//anon*" mmap events for any pid which has successfully processed a jitdump file. Committer testing: // jitdump case perf record perf inject --jit --input perf.data --output perfjit.data // verify mmap "//anon" events present initially perf script --input perf.data --show-mmap-events | grep '//anon' // verify mmap "//anon" events removed perf script --input perfjit.data --show-mmap-events | grep '//anon' // no jitdump case perf record perf inject --jit --input perf.data --output perfjit.data // verify mmap "//anon" events present initially perf script --input perf.data --show-mmap-events | grep '//anon' // verify mmap "//anon" events not removed perf script --input perfjit.data --show-mmap-events | grep '//anon' Repro: This issue was discovered while testing the initial CoreCLR jitdump implementation. https://github.com/dotnet/coreclr/pull/26897. Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: Stephane Eranian Cc: linux-kernel@vger.kernel.org Signed-off-by: Steve MacLean --- tools/perf/builtin-inject.c | 4 +-- tools/perf/util/jitdump.c | 63 + 2 files changed, 65 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c index 372ecb3..0f38862 100644 --- a/tools/perf/builtin-inject.c +++ b/tools/perf/builtin-inject.c @@ -263,7 +263,7 @@ static int perf_event__jit_repipe_mmap(struct perf_tool *tool, * if jit marker, then inject jit mmaps and generate ELF images */ ret = jit_process(inject->session, >output, machine, - event->mmap.filename, sample->pid, ); + event->mmap.filename, event->mmap.pid, ); if (ret < 0) return ret; if (ret) { @@ -301,7 +301,7 @@ static int perf_event__jit_repipe_mmap2(struct perf_tool *tool, * if jit marker, then inject jit mmaps and generate ELF images */ ret = jit_process(inject->session, >output, machine, - event->mmap2.filename, sample->pid, ); + event->mmap2.filename, event->mmap2.pid, ); if (ret < 0) return ret; if (ret) { diff --git a/tools/perf/util/jitdump.c b/tools/perf/util/jitdump.c index e3ccb0c..6d891d1 100644 --- a/tools/perf/util/jitdump.c +++ b/tools/perf/util/jitdump.c @@ -749,6 +749,59 @@ jit_detect(char *mmap_name, pid_t pid) return 0; } +struct pid_rbtree +{ + struct rb_node node; + pid_t pid; +}; + +static void jit_add_pid(struct rb_root *root, pid_t pid) +{ + struct rb_node **new = &(root->rb_node), *parent = NULL; + struct pid_rbtree* data = NULL; + + /* Figure out where to put new node */ + while (*new) { + struct pid_rbtree *this = container_of(*new, struct pid_rbtree, node); + pid_t nodePid = this->pid; + + parent = *new; + if (pid < nodePid) + new = &((*new)->rb_left); + else if (pid > nodePid) + new = &((*new)->rb_right); + else + return; + } + + data = malloc(sizeof(struct pid_rbtree)); + data->pid = pid; + + /* Add new node and rebalance tree. */ + rb_link_node(>node, parent, new); + rb_insert_color(>node, root); + + return; +} + +static bool jit_has_pid(struct rb_root *root, pid_t pid) +{ + struct rb_node *node = root->rb_node; + + while (node) { + struct pid_rbtree *this = container_of(node, struct pid_rbtree, node); + pid_t nodePid = this->pid; + + if (pid < nodePid) + node = node->rb_left; + else if (pid > nodePid) + node = node->rb_right; + else + return 1; + } +
[PATCH 2/4] perf inject jit: Fix JIT_CODE_MOVE filename
During perf inject --jit, JIT_CODE_MOVE records were injecting MMAP records with an incorrect filename. Specifically it was missing the ".so" suffix. Further the JIT_CODE_LOAD record were silently truncating the jr->load.code_index field to 32 bits before generating the filename. Make both records emit the same filename based on the full 64 bit code_index field. Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: Stephane Eranian Cc: linux-kernel@vger.kernel.org Signed-off-by: Steve MacLean --- tools/perf/util/jitdump.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/jitdump.c b/tools/perf/util/jitdump.c index 1bdf4c6..e3ccb0c 100644 --- a/tools/perf/util/jitdump.c +++ b/tools/perf/util/jitdump.c @@ -395,7 +395,7 @@ static int jit_repipe_code_load(struct jit_buf_desc *jd, union jr_entry *jr) size_t size; u16 idr_size; const char *sym; - uint32_t count; + uint64_t count; int ret, csize, usize; pid_t pid, tid; struct { @@ -418,7 +418,7 @@ static int jit_repipe_code_load(struct jit_buf_desc *jd, union jr_entry *jr) return -1; filename = event->mmap2.filename; - size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%u.so", + size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%" PRIu64 ".so", jd->dir, pid, count); @@ -529,7 +529,7 @@ static int jit_repipe_code_move(struct jit_buf_desc *jd, union jr_entry *jr) return -1; filename = event->mmap2.filename; - size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%"PRIu64, + size = snprintf(filename, PATH_MAX, "%s/jitted-%d-%" PRIu64 ".so", jd->dir, pid, jr->move.code_index); -- 2.7.4
[PATCH] KVM: Unlimit number of ioeventfd assignments for real
Previously we've tried to unlimit ioeventfd creation (6ea34c9b78c1, "kvm: exclude ioeventfd from counting kvm_io_range limit", 2013-06-04), because that can be easily done by fd limitations and otherwise it can easily reach the current maximum of 1000 iodevices. Meanwhile, we still use the counter to limit the maximum allowed kvm io devices to be created besides ioeventfd. 6ea34c9b78c1 achieved that in most cases, however it'll still fali the ioeventfd creation when non-ioeventfd io devices overflows to 1000. Then the next ioeventfd creation will fail while logically it should be the next non-ioeventfd iodevice creation to fail. That's not really a big problem at all because when it happens it probably means something has leaked in userspace (or even malicious program) so it's a bug to fix there. However the error message like "ioeventfd creation failed" with an -ENOSPACE is really confusing and may let people think about the fact that it's the ioeventfd that is leaked (while in most cases it's not!). Let's use this patch to unlimit the creation of ioeventfd for real this time, assuming this is also a bugfix of 6ea34c9b78c1. To me more importantly, when with a bug in userspace this patch can probably give us another more meaningful failure on what has overflowed/leaked rather than "ioeventfd creation failure: -ENOSPC". CC: Dr. David Alan Gilbert Signed-off-by: Peter Xu --- include/linux/kvm_host.h | 3 +++ virt/kvm/eventfd.c | 4 ++-- virt/kvm/kvm_main.c | 23 --- 3 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index fcb46b3374c6..d8530e7d85d4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -192,6 +192,9 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, int len, void *val); int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, int len, struct kvm_io_device *dev); +int kvm_io_bus_register_dev_ioeventfd(struct kvm *kvm, enum kvm_bus bus_idx, + gpa_t addr, int len, + struct kvm_io_device *dev); void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx, struct kvm_io_device *dev); struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx, diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 67b6fc153e9c..3cb0e1c3279b 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -823,8 +823,8 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm, kvm_iodevice_init(>dev, _ops); - ret = kvm_io_bus_register_dev(kvm, bus_idx, p->addr, p->length, - >dev); + ret = kvm_io_bus_register_dev_ioeventfd(kvm, bus_idx, p->addr, + p->length, >dev); if (ret < 0) goto unlock_fail; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c6a91b044d8d..242cfcaa9a56 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3809,8 +3809,10 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, } /* Caller must hold slots_lock. */ -int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, - int len, struct kvm_io_device *dev) +static int __kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, +gpa_t addr, int len, +struct kvm_io_device *dev, +bool check_limit) { int i; struct kvm_io_bus *new_bus, *bus; @@ -3821,7 +3823,8 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, return -ENOMEM; /* exclude ioeventfd which is limited by maximum fd */ - if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1) + if (check_limit && + (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1)) return -ENOSPC; new_bus = kmalloc(struct_size(bus, range, bus->dev_count + 1), @@ -3851,6 +3854,20 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, return 0; } +int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, + int len, struct kvm_io_device *dev) +{ + return __kvm_io_bus_register_dev(kvm, bus_idx, addr, len, dev, true); +} + +int kvm_io_bus_register_dev_ioeventfd(struct kvm *kvm, enum kvm_bus bus_idx, + gpa_t addr, int len, + struct kvm_io_device *dev) +{ + return __kvm_io_bus_register_dev(kvm, bus_idx, addr, len, dev, false); +} + + /* Caller must hold slots_lock. */ void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
[PATCH 1/4] perf map: fix overlapped map handling
Whenever an mmap/mmap2 event occurs, the map tree must be updated to add a new entry. If a new map overlaps a previous map, the overlapped section of the previous map is effectively unmapped, but the non-overlapping sections are still valid. maps__fixup_overlappings() is responsible for creating any new map entries from the previously overlapped map. It optionally creates a before and an after map. When creating the after map the existing code failed to adjust the map.pgoff. This meant the new after map would incorrectly calculate the file offset for the ip. This results in incorrect symbol name resolution for any ip in the after region. Make maps__fixup_overlappings() correctly populate map.pgoff. Add an assert that new mapping matches old mapping at the beginning of the after map. Committer-testing: Validated correct parsing of libcoreclr.so symbols from .NET Core 3.0 preview9 (which didn't strip symbols). Preparation: ~/dotnet3.0-preview9/dotnet new webapi -o perfSymbol cd perfSymbol ~/dotnet3.0-preview9/dotnet publish perf record ~/dotnet3.0-preview9/dotnet \ bin/Debug/netcoreapp3.0/publish/perfSymbol.dll ^C Before: perf script --show-mmap-events 2>&1 | grep -e MMAP -e unknown |\ grep libcoreclr.so | head -n 4 dotnet 1907 373352.698780: PERF_RECORD_MMAP2 1907/1907: \ [0x7fe615726000(0x768000) @ 0 08:02 5510620 765057155]: \ r-xp .../3.0.0-preview9-19423-09/libcoreclr.so dotnet 1907 373352.701091: PERF_RECORD_MMAP2 1907/1907: \ [0x7fe615974000(0x1000) @ 0x24e000 08:02 5510620 765057155]: \ rwxp .../3.0.0-preview9-19423-09/libcoreclr.so dotnet 1907 373352.701241: PERF_RECORD_MMAP2 1907/1907: \ [0x7fe615c42000(0x1000) @ 0x51c000 08:02 5510620 765057155]: \ rwxp .../3.0.0-preview9-19423-09/libcoreclr.so dotnet 1907 373352.705249: 25 cpu-clock: \ 7fe6159a1f99 [unknown] \ (.../3.0.0-preview9-19423-09/libcoreclr.so) After: perf script --show-mmap-events 2>&1 | grep -e MMAP -e unknown |\ grep libcoreclr.so | head -n 4 dotnet 1907 373352.698780: PERF_RECORD_MMAP2 1907/1907: \ [0x7fe615726000(0x768000) @ 0 08:02 5510620 765057155]: \ r-xp .../3.0.0-preview9-19423-09/libcoreclr.so dotnet 1907 373352.701091: PERF_RECORD_MMAP2 1907/1907: \ [0x7fe615974000(0x1000) @ 0x24e000 08:02 5510620 765057155]: \ rwxp .../3.0.0-preview9-19423-09/libcoreclr.so dotnet 1907 373352.701241: PERF_RECORD_MMAP2 1907/1907: \ [0x7fe615c42000(0x1000) @ 0x51c000 08:02 5510620 765057155]: \ rwxp .../3.0.0-preview9-19423-09/libcoreclr.so All the [unknown] symbols were resolved. Tested-by: Brian Robbins Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Mark Rutland Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Namhyung Kim Cc: Stephane Eranian Cc: linux-kernel@vger.kernel.org Signed-off-by: Steve MacLean --- tools/perf/util/map.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c index 5b83ed1..eec9b28 100644 --- a/tools/perf/util/map.c +++ b/tools/perf/util/map.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include "symbol.h" +#include #include #include #include @@ -850,6 +851,8 @@ static int maps__fixup_overlappings(struct maps *maps, struct map *map, FILE *fp } after->start = map->end; + after->pgoff += map->end - pos->start; + assert(pos->map_ip(pos, map->end) == after->map_ip(after, map->end)); __map_groups__insert(pos->groups, after); if (verbose >= 2 && !use_browser) map__fprintf(after, fp); -- 2.7.4
RE: [PATCH] perf map: fix overlapped map handling
>> An earlier version of this patch used: >> after->start = map->end; >> +after->pgoff += map->end - pos->start; >> >> Instead of the newer Functionally equivalent: >> after->start = map->end; >> +after->pgoff = pos->map_ip(pos, map->end); >> >> I preferred the latter form as it made more sense with the assertion that >> the mapping of map->end should match in pos and after. > > So, if they are equivalent then I think its better to use code that > ressembles the kernel as much as possible, so that when in doubt we can > compare the tools/perf calcs with how the kernel does it, filtering out > things like the PAGE_SHIFT, can we go that way? > > Also do you have some reproducer, if you have one then we can try and have > this as a 'perf test' entry, bolting some more checks into > tools/perf/tests/perf-record.c or using it as a start for a test that > stresses this code. > > This is not a prerequisite for having your fix on, but would help checking > that perf doesn't regresses in this area. > > - Arnaldo I have updated the patch to use the earlier version, which more closely matches the kernel. I have updated the commit message to include the repro info. I am including a few other patches I have generated while adding support for perf jitdump to coreclr.
Re: [GIT PULL] nfsd changes for 5.4
The pull request you sent on Fri, 27 Sep 2019 16:08:38 -0400: > git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/4f375483559c94ec7c58fa3499e28e327d1ef911 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: Re: [GIT PULL] nfsd changes for 5.4
The pull request you sent on Fri, 27 Sep 2019 19:55:54 -0400: > git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/298fb76a5583900a155d387efaf37a8b39e5dea2 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: x86/purgatory: undefined symbol __stack_chk_fail
On 9/3/19 8:50 AM, Andreas Smas wrote: > Hi, > > For me, kernels built including this commit > b059f801a937 (x86/purgatory: Use CFLAGS_REMOVE rather than reset > KBUILD_CFLAGS) > > results in kexec() failing to load the kernel: > > kexec: Undefined symbol: __stack_chk_fail > kexec-bzImage64: Loading purgatory failed > > Can be seen: > > $ readelf -a arch/x86/purgatory/purgatory.ro | grep UND > 0: 0 NOTYPE LOCAL DEFAULT UND > 51: 0 NOTYPE GLOBAL DEFAULT UND __stack_chk_fail > > Using: gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > > Adding -ffreestanding or -fno-stack-protector to ccflags-y in > arch/x86/purgatory/Makefile > fixes the problem. Not sure which would be preferred. > Hi, Do you have a kernel .config file that causes this? I can't seem to reproduce it. Thanks. -- ~Randy
[PATCH] iio: adc: meson-saradc: Variables could be uninitalized if regmap_read() fails
Several functions in this file are trying to use regmap_read() to initialize the specific variable, however, if regmap_read() fails, the variable could be uninitialized but used directly, which is potentially unsafe. The return value of regmap_read() should be checked and handled. Signed-off-by: Yizhuo --- drivers/iio/adc/meson_saradc.c | 28 +++- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/drivers/iio/adc/meson_saradc.c b/drivers/iio/adc/meson_saradc.c index 7b28d045d271..c032a64108b4 100644 --- a/drivers/iio/adc/meson_saradc.c +++ b/drivers/iio/adc/meson_saradc.c @@ -323,6 +323,7 @@ static int meson_sar_adc_wait_busy_clear(struct iio_dev *indio_dev) { struct meson_sar_adc_priv *priv = iio_priv(indio_dev); int regval, timeout = 1; + int ret; /* * NOTE: we need a small delay before reading the status, otherwise @@ -331,7 +332,9 @@ static int meson_sar_adc_wait_busy_clear(struct iio_dev *indio_dev) */ do { udelay(1); - regmap_read(priv->regmap, MESON_SAR_ADC_REG0, ); + ret = regmap_read(priv->regmap, MESON_SAR_ADC_REG0, ); + if (ret) + return ret; } while (FIELD_GET(MESON_SAR_ADC_REG0_BUSY_MASK, regval) && timeout--); if (timeout < 0) @@ -358,7 +361,11 @@ static int meson_sar_adc_read_raw_sample(struct iio_dev *indio_dev, return -EINVAL; } - regmap_read(priv->regmap, MESON_SAR_ADC_FIFO_RD, ); + int ret = regmap_read(priv->regmap, MESON_SAR_ADC_FIFO_RD, ); + + if (ret) + return ret; + fifo_chan = FIELD_GET(MESON_SAR_ADC_FIFO_RD_CHAN_ID_MASK, regval); if (fifo_chan != chan->address) { dev_err(_dev->dev, @@ -491,6 +498,7 @@ static int meson_sar_adc_lock(struct iio_dev *indio_dev) { struct meson_sar_adc_priv *priv = iio_priv(indio_dev); int val, timeout = 1; + int ret; mutex_lock(_dev->mlock); @@ -506,7 +514,10 @@ static int meson_sar_adc_lock(struct iio_dev *indio_dev) */ do { udelay(1); - regmap_read(priv->regmap, MESON_SAR_ADC_DELAY, ); + ret = regmap_read(priv->regmap, + MESON_SAR_ADC_DELAY, ); + if (ret) + return ret; } while (val & MESON_SAR_ADC_DELAY_BL30_BUSY && timeout--); if (timeout < 0) { @@ -784,7 +795,10 @@ static int meson_sar_adc_init(struct iio_dev *indio_dev) * BL30 to make sure BL30 gets the values it expects when * reading the temperature sensor. */ - regmap_read(priv->regmap, MESON_SAR_ADC_REG3, ); + ret = regmap_read(priv->regmap, MESON_SAR_ADC_REG3, ); + if (ret) + return ret; + if (regval & MESON_SAR_ADC_REG3_BL30_INITIALIZED) return 0; } @@ -1014,7 +1028,11 @@ static irqreturn_t meson_sar_adc_irq(int irq, void *data) unsigned int cnt, threshold; u32 regval; - regmap_read(priv->regmap, MESON_SAR_ADC_REG0, ); + int ret = regmap_read(priv->regmap, MESON_SAR_ADC_REG0, ); + + if (ret) + return ret; + cnt = FIELD_GET(MESON_SAR_ADC_REG0_FIFO_COUNT_MASK, regval); threshold = FIELD_GET(MESON_SAR_ADC_REG0_FIFO_CNT_IRQ_MASK, regval); -- 2.17.1
Re: [PATCH] Input: hyperv-keyboard: Add the support of hibernation
On Sat, Sep 21, 2019 at 06:56:04AM +, Dexuan Cui wrote: > > From: dmitry.torok...@gmail.com > > Sent: Thursday, September 19, 2019 9:18 AM > > > > Hi Dexuan, > > > > On Wed, Sep 11, 2019 at 11:36:20PM +, Dexuan Cui wrote: > > > We need hv_kbd_pm_notify() to make sure the pm_wakeup_hard_event() > > call > > > does not prevent the system from entering hibernation: the hibernation > > > is a relatively long process, which can be aborted by the call > > > pm_wakeup_hard_event(), which is invoked upon keyboard events. > > > > > > diff --git a/drivers/input/serio/hyperv-keyboard.c > > b/drivers/input/serio/hyperv-keyboard.c > > > index 88ae7c2..277dc4c 100644 > > > --- a/drivers/input/serio/hyperv-keyboard.c > > > +++ b/drivers/input/serio/hyperv-keyboard.c > > > @@ -10,6 +10,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > /* > > > * Current version 1.0 > > > @@ -95,6 +96,9 @@ struct hv_kbd_dev { > > > struct completion wait_event; > > > spinlock_t lock; /* protects 'started' field */ > > > bool started; > > > + > > > + struct notifier_block pm_nb; > > > + bool hibernation_in_progress; > > > > Why do you use notifier block instead of exposing proper PM methods if > > you want to support hibernation? > > > > Dmitry > > Hi, > In the patch I do implement hv_kbd_suspend() and hv_kbd_resume(), and > add them into the hv_kbd_drv struct: > > @@ -416,6 +472,8 @@ static struct hv_driver hv_kbd_drv = { > .id_table = id_table, > .probe = hv_kbd_probe, > .remove = hv_kbd_remove, > + .suspend = hv_kbd_suspend, > + .resume = hv_kbd_resume, > > The .suspend and .resume callbacks are inroduced by another patch (which > uses the dev_pm_ops struct): > 271b2224d42f ("Drivers: hv: vmbus: Implement suspend/resume for VSC drivers > for hibernation") > (which is on the Hyper-V tree's hyperv-next branch: > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=hyperv-next=271b2224d42f88870e6b060924ee374871c131fc > ) > > The only purpose of the notifier is to set the variable > kbd_dev->hibernation_in_progress to true during the hibernation process. > > As I explained in the changelog, the hibernation is a long process (which > can take 10+ seconds), during which the user may unintentionally touch > the keyboard, causing key up/down events, which are still handled by > hv_kbd_on_receive(), which calls pm_wakeup_hard_event(), which > calls some other functions which increase the global counter > "pm_abort_suspend", and hence pm_wakeup_pending() becomes true. > > pm_wakeup_pending() is tested in a lot of places in the suspend > process and eventually an unintentional keystroke (or mouse movement, > when it comes to the Hyper-V mouse driver drivers/hid/hid-hyperv.c) > causes the whole hibernation process to be aborted. Usually this > behavior is not expected by the user, I think. Why not? If a device is configured as wakeup source, then it activity should wake up the system, unless you disable it. > > So, I use the notifier to set the flag variable and with it the driver can > know when it should not call pm_wakeup_hard_event(). No, please implement hibernation support properly, as notifier + flag is a hack. In this particular case you do not want to have your hv_kbd_resume() to be called in place of pm_ops->thaw() as that is what reenables the keyboard vmbus channel and causes the undesired wakeup events. Your vmbus implementation should allow individual drivers to control the set of PM operations that they wish to use, instead of forcing everything through suspend/resume. Thanks. -- Dmitry
[PATCH] iio: adc: imx25-gcq: Variable could be uninitialized if regmap_read() fails
In function mx25_gcq_irq(), local variable "stats" could be uninitialized if function regmap_read() returns -EINVAL. However, this value is used in if statement, which is potentially unsafe. The same case applied to the variable "data" in function mx25_gcq_get_raw_value() in the same file. Signed-off-by: Yizhuo --- drivers/iio/adc/fsl-imx25-gcq.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/iio/adc/fsl-imx25-gcq.c b/drivers/iio/adc/fsl-imx25-gcq.c index df19ecae52f7..dbf3e8e85aba 100644 --- a/drivers/iio/adc/fsl-imx25-gcq.c +++ b/drivers/iio/adc/fsl-imx25-gcq.c @@ -74,7 +74,10 @@ static irqreturn_t mx25_gcq_irq(int irq, void *data) struct mx25_gcq_priv *priv = data; u32 stats; - regmap_read(priv->regs, MX25_ADCQ_SR, ); + int ret = regmap_read(priv->regs, MX25_ADCQ_SR, ); + + if (ret) + return ret; if (stats & MX25_ADCQ_SR_EOQ) { regmap_update_bits(priv->regs, MX25_ADCQ_MR, @@ -121,7 +124,10 @@ static int mx25_gcq_get_raw_value(struct device *dev, return -ETIMEDOUT; } - regmap_read(priv->regs, MX25_ADCQ_FIFO, ); + int ret = regmap_read(priv->regs, MX25_ADCQ_FIFO, ); + + if (ret) + return ret; *val = MX25_ADCQ_FIFO_DATA(data); -- 2.17.1
Re: Summary of Crash/Panic Behaviors in Linux Kernel
2019년 9월 27일 (금) 오전 5:01, 慕冬亮 님이 작성: > > Dear all, > > Is there any summary of crash/panic behaviors in the Linux Kernel? For > example, GPF (general protection fault), Kernel BUG (BUG_ON). There are a number of blogs and material which describe the behavior the kernel panic. Talking about Android Linux Device, the behavior of kernel panic is as followings: USER DEBUG Version(Engineering version) * On kernel panic, the device suddenly stops execution and keep on displaying kernel log through the screen. Below is one of the examples. [ 34.187856 01-05 13:30:49.424] Unable to handle kernel NULL pointer dereference at virtual address [ 34.197812 01-05 13:30:49.434] pgd = e6f9c000 [ 34.202275 01-05 13:30:49.438] [] *pgd= [ 34.207610 01-05 13:30:49.444] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 34.214663 01-05 13:30:49.451] Modules linked in: bcmdhd(O) [ 34.220400 01-05 13:30:49.457] CPU: 0Tainted: GW O (3.4.65-gbc0bf75 #1) [ 34.228317 01-05 13:30:49.464] PC is at ext4_free_inode+0x288/0x574 [ 34.234664 01-05 13:30:49.471] LR is at ext4_free_inode+0x260/0x574 Kernel engineers are able to notice what is the cause of the kernel panic from kernel log. USER Version(Production version) * When kernel panic occurs, the device suddenly stops execution and starts rebooting again. In addition, we can describe BUG_ON() and exception as followings: BUG_ON() * If the subsystem realizes that it is running under critical condition, BUG_ON() is called to cause kernel panic. Please be aware that BUG_ON or BUG is executed only with CONFIG_BUG enabled. Normally, many Linux devices are running with CONFIG_BUG enabled *by default*. Exception(Data abort, Prefeth abort: ARM processor) * If MMU cannot handle the virtual address translation, exception occurs. The program counter is jumped into predefined exception vector. Usually panic() is called and then system is crashed. Hope above would be helpful. Thanks, Austin Kim > > -- > My best regards to you. > > No System Is Safe! > Dongliang Mu
Re: [PATCH v2] Input: atmel_mxt_ts - Disable IRQ across suspend
Hi Evan, On Tue, Sep 24, 2019 at 02:52:38PM -0700, Evan Green wrote: > Across suspend and resume, we are seeing error messages like the following: > > atmel_mxt_ts i2c-PRP0001:00: __mxt_read_reg: i2c transfer failed (-121) > atmel_mxt_ts i2c-PRP0001:00: Failed to read T44 and T5 (-121) > > This occurs because the driver leaves its IRQ enabled. Upon resume, there > is an IRQ pending, but the interrupt is serviced before both the driver and > the underlying I2C bus have been resumed. This causes EREMOTEIO errors. > > Disable the IRQ in suspend, and re-enable it on resume. If there are cases > where the driver enters suspend with interrupts disabled, that's a bug we > should fix separately. > > Signed-off-by: Evan Green > --- > > Changes in v2: > - Enable and disable unconditionally (Dmitry) > > drivers/input/touchscreen/atmel_mxt_ts.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/input/touchscreen/atmel_mxt_ts.c > b/drivers/input/touchscreen/atmel_mxt_ts.c > index 24c4b691b1c9..a58092488679 100644 > --- a/drivers/input/touchscreen/atmel_mxt_ts.c > +++ b/drivers/input/touchscreen/atmel_mxt_ts.c > @@ -3155,6 +3155,7 @@ static int __maybe_unused mxt_suspend(struct device > *dev) > mxt_stop(data); > > mutex_unlock(_dev->mutex); > + disable_irq(data->irq); > > return 0; > } > @@ -3174,6 +3175,7 @@ static int __maybe_unused mxt_resume(struct device *dev) > mxt_start(data); > > mutex_unlock(_dev->mutex); > + enable_irq(data->irq); At least for older devices that do soft reset on resume we need interrupts to already work when we call mxt_start(). In general, order of resume steps should mirror suspend. Thanks. -- Dmitry
[PATCH] staging: rtl8723bs: Variable rf_type in function rtw_cfg80211_init_wiphy() could be uninitialized
In function rtw_cfg80211_init_wiphy(), local variable "rf_type" could be uninitialized if function rtw_hal_get_hwreg() fails to initialize it. However, this value is used in function rtw_cfg80211_init_ht_capab() and used to decide the value writing to ht_cap, which is potentially unsafe. Signed-off-by: Yizhuo --- drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c b/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c index 9bc685632651..dd39a581b7ef 100644 --- a/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c +++ b/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c @@ -3315,7 +3315,7 @@ static void rtw_cfg80211_init_ht_capab(struct ieee80211_sta_ht_cap *ht_cap, enum void rtw_cfg80211_init_wiphy(struct adapter *padapter) { - u8 rf_type; + u8 rf_type = RF_MAX_TYPE; struct ieee80211_supported_band *bands; struct wireless_dev *pwdev = padapter->rtw_wdev; struct wiphy *wiphy = pwdev->wiphy; -- 2.17.1
Re: [PATCH v3 03/10] sched/fair: remove meaningless imbalance calculation
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote: > clean up load_balance and remove meaningless calculation and fields > before > adding new algorithm. > > Signed-off-by: Vincent Guittot Yay. Acked-by: Rik van Riel -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH v3 02/10] sched/fair: rename sum_nr_running to sum_h_nr_running
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote: > Rename sum_nr_running to sum_h_nr_running because it effectively > tracks > cfs->h_nr_running so we can use sum_nr_running to track rq- > >nr_running > when needed. > > There is no functional changes. > > Signed-off-by: Vincent Guittot > Acked-by: Rik van Riel -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH v3 01/10] sched/fair: clean up asym packing
On Thu, 2019-09-19 at 09:33 +0200, Vincent Guittot wrote: > Clean up asym packing to follow the default load balance behavior: > - classify the group by creating a group_asym_packing field. > - calculate the imbalance in calculate_imbalance() instead of > bypassing it. > > We don't need to test twice same conditions anymore to detect asym > packing > and we consolidate the calculation of imbalance in > calculate_imbalance(). > > There is no functional changes. > > Signed-off-by: Vincent Guittot Acked-by: Rik van Riel -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [GIT PULL] nfsd changes for 5.4
On Fri, Sep 27, 2019 at 03:44:02PM -0700, Linus Torvalds wrote: > But then the actual code is just one single small commit: > > > Dave Wysochanski (1): > > SUNRPC: Track writers of the 'channel' file to improve > > cache_listeners_exist And that's also all that was in the diffstat in my pull message, but I somehow didn't notice. I must be tired > which doesn't actually match any of the things your description says > should be there. > > So I undid my pull - either the description is completely wrong, or > you tagged the wrong commit. Yes, the latter. I've redone the tag; the pull request should have been for: git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4 Highlights: - add a new knfsd file cache, so that we don't have to open and close on each (NFSv2/v3) READ or WRITE. This can speed up read and write in some cases. It also replaces our readahead cache. - Prevent silent data loss on write errors, by treating write errors like server reboots for the purposes of write caching, thus forcing clients to resend their writes. - Tweak the code that allocates sessions to be more forgiving, so that NFSv4.1 mounts are less likely to hang when a server already has a lot of clients. - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be limited only by the backend filesystem and the maximum RPC size. - Allow the server to enforce use of the correct kerberos credentials when a client reclaims state after a reboot. And some miscellaneous smaller bugfixes and cleanup. Chuck Lever (2): svcrdma: Remove svc_rdma_wq svcrdma: Use llist for managing cache of recv_ctxts Colin Ian King (1): sunrpc: clean up indentation issue Dave Wysochanski (1): SUNRPC: Track writers of the 'channel' file to improve cache_listeners_exist J. Bruce Fields (4): Merge nfsd bugfixes nfsd: Remove unnecessary NULL checks Deprecate nfsd fault injection nfsd: eliminate an unnecessary acl size limit Jeff Layton (12): sunrpc: add a new cache_detail operation for when a cache is flushed locks: create a new notifier chain for lease attempts nfsd: add a new struct file caching facility to nfsd nfsd: hook up nfsd_write to the new nfsd_file cache nfsd: hook up nfsd_read to the nfsd_file cache nfsd: hook nfsd_commit up to the nfsd_file cache nfsd: convert nfs4_file->fi_fds array to use nfsd_files nfsd: convert fi_deleg_file and ls_file fields to nfsd_file nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache nfsd: have nfsd_test_lock use the nfsd_file cache nfsd: rip out the raparms cache nfsd: close cached files prior to a REMOVE or RENAME that would replace target NeilBrown (2): nfsd: handle drc over-allocation gracefully. nfsd: degraded slot-count more gracefully as allocation nears exhaustion. Scott Mayhew (2): nfsd: add a "GetVersion" upcall for nfsdcld nfsd: add support for upcall version 2 Trond Myklebust (9): notify: export symbols for use by the knfsd file cache vfs: Export flush_delayed_fput for use by knfsd. nfsd: Fix up some unused variable warnings nfsd: Fix the documentation for svcxdr_tmpalloc() nfsd: nfsd_file cache entries should be per net namespace nfsd: Support the server resetting the boot verifier nfsd: Don't garbage collect files that might contain write errors nfsd: Reset the boot verifier on all write I/O errors nfsd: fix nfs read eof detection YueHaibing (2): nfsd: remove duplicated include from filecache.c nfsd: Make nfsd_reset_boot_verifier_locked static fs/file_table.c | 1 + fs/locks.c | 62 ++ fs/nfsd/Kconfig | 3 +- fs/nfsd/Makefile | 3 +- fs/nfsd/acl.h| 8 - fs/nfsd/blocklayout.c| 3 +- fs/nfsd/export.c | 13 + fs/nfsd/filecache.c | 934 +++ fs/nfsd/filecache.h | 61 ++ fs/nfsd/netns.h | 4 + fs/nfsd/nfs3proc.c | 9 +- fs/nfsd/nfs3xdr.c| 13 +- fs/nfsd/nfs4callback.c | 35 +- fs/nfsd/nfs4layouts.c| 12 +- fs/nfsd/nfs4proc.c | 97 ++-- fs/nfsd/nfs4recover.c| 388 ++--- fs/nfsd/nfs4state.c | 239 fs/nfsd/nfs4xdr.c| 56 +- fs/nfsd/nfsctl.c | 1 + fs/nfsd/nfsproc.c| 4
[PATCH v4 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
Skips slow part of serialize_against_pte_lookup if there is no running lockless pagetable walk. Signed-off-by: Leonardo Bras --- arch/powerpc/mm/book3s64/pgtable.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 6ba6195bff1b..27966481f2a6 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -95,7 +95,8 @@ static void do_nothing(void *unused) void serialize_against_pte_lookup(struct mm_struct *mm) { smp_mb(); - smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); + if (running_lockless_pgtbl_walk(mm)) + smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); } /* -- 2.20.1
[PATCH v4 07/11] powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring lockless pgtable walks on kvmppc_e500_shadow_map(). Fixes the place where local_irq_restore() is called: previously, if ptep was NULL, local_irq_restore() would never be called. Signed-off-by: Leonardo Bras --- arch/powerpc/kvm/e500_mmu_host.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 321db0fdb9db..a0be76ad8d14 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -473,6 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, * We are holding kvm->mmu_lock so a notifier invalidate * can't run hence pfn won't change. */ + start_lockless_pgtbl_walk(kvm->mm); local_irq_save(flags); ptep = find_linux_pte(pgdir, hva, NULL, NULL); if (ptep) { @@ -481,15 +482,18 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, if (pte_present(pte)) { wimg = (pte_val(pte) >> PTE_WIMGE_SHIFT) & MAS2_WIMGE_MASK; - local_irq_restore(flags); } else { local_irq_restore(flags); + end_lockless_pgtbl_walk(kvm->mm); pr_err_ratelimited("%s: pte not present: gfn %lx,pfn %lx\n", __func__, (long)gfn, pfn); ret = -EINVAL; goto out; } } + local_irq_restore(flags); + end_lockless_pgtbl_walk(kvm->mm); + kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg); kvmppc_e500_setup_stlbe(_e500->vcpu, gtlbe, tsize, -- 2.20.1
[PATCH v4 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk
Enables count-based monitoring method for lockless pagetable walks on PowerPC book3s_64. Other architectures/platforms fallback to using generic dummy functions. Signed-off-by: Leonardo Bras --- arch/powerpc/include/asm/book3s/64/pgtable.h | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 8308f32e9782..eb9b26a4a483 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1370,5 +1370,10 @@ static inline bool pgd_is_leaf(pgd_t pgd) return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE)); } +#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER +void start_lockless_pgtbl_walk(struct mm_struct *mm); +void end_lockless_pgtbl_walk(struct mm_struct *mm); +int running_lockless_pgtbl_walk(struct mm_struct *mm); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */ -- 2.20.1
[PATCH v4 05/11] powerpc/perf: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring lockless pgtable walks on read_user_stack_slow. Signed-off-by: Leonardo Bras --- arch/powerpc/perf/callchain.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c index c84bbd4298a0..9d76194a2a8f 100644 --- a/arch/powerpc/perf/callchain.c +++ b/arch/powerpc/perf/callchain.c @@ -113,16 +113,18 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb) int ret = -EFAULT; pgd_t *pgdir; pte_t *ptep, pte; + struct mm_struct *mm = current->mm; unsigned shift; unsigned long addr = (unsigned long) ptr; unsigned long offset; unsigned long pfn, flags; void *kaddr; - pgdir = current->mm->pgd; + pgdir = mm->pgd; if (!pgdir) return -EFAULT; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); ptep = find_current_mm_pte(pgdir, addr, NULL, ); if (!ptep) @@ -146,6 +148,7 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb) ret = 0; err_out: local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); return ret; } -- 2.20.1
[PATCH v4 08/11] powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring all book3s_hv related functions that do lockless pagetable walks. Adds comments explaining that some lockless pagetable walks don't need protection due to guest pgd not being a target of THP collapse/split, or due to being called from Realmode + MSR_EE = 0 kvmppc_do_h_enter: Fixes where local_irq_restore() must be placed (after the last usage of ptep). Signed-off-by: Leonardo Bras --- arch/powerpc/kvm/book3s_hv_nested.c | 22 -- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 18 ++ 2 files changed, 34 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c index 735e0ac6f5b2..5a641b559de7 100644 --- a/arch/powerpc/kvm/book3s_hv_nested.c +++ b/arch/powerpc/kvm/book3s_hv_nested.c @@ -803,7 +803,11 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap, if (!gp) return; - /* Find the pte */ + /* Find the pte: +* We are walking the nested guest (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, ); /* * If the pte is present and the pfn is still the same, update the pte. @@ -853,7 +857,11 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 n_rmap, if (!gp) return; - /* Find and invalidate the pte */ + /* Find and invalidate the pte: +* We are walking the nested guest (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, ); /* Don't spuriously invalidate ptes if the pfn has changed */ if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa)) @@ -921,6 +929,11 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu, int shift; spin_lock(>mmu_lock); + /* +* We are walking the nested guest (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, ); if (!shift) shift = PAGE_SHIFT; @@ -1362,6 +1375,11 @@ static long int __kvmhv_nested_page_fault(struct kvm_run *run, /* See if can find translation in our partition scoped tables for L1 */ pte = __pte(0); spin_lock(>mmu_lock); + /* +* We are walking the secondary (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (!shift) shift = PAGE_SHIFT; diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 63e0ce91e29d..2076a7ac230a 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -252,6 +252,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, * If we had a page table table change after lookup, we would * retry via mmu_notifier_retry. */ + start_lockless_pgtbl_walk(kvm->mm); if (!realmode) local_irq_save(irq_flags); /* @@ -287,8 +288,6 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pa |= gpa & ~PAGE_MASK; } } - if (!realmode) - local_irq_restore(irq_flags); ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1); ptel |= pa; @@ -311,6 +310,9 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G); ptel |= HPTE_R_M; } + if (!realmode) + local_irq_restore(irq_flags); + end_lockless_pgtbl_walk(kvm->mm); /* Find and lock the HPTEG slot to use */ do_insert: @@ -885,11 +887,19 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa, /* Translate to host virtual address */ hva = __gfn_to_hva_memslot(memslot, gfn); - /* Try to find the host pte for that virtual address */ + /* Try to find the host pte for that virtual address : +* Called by hcall_real_table (real mode + MSR_EE=0) +* Interrupts are disabled here. +*/ + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, ); - if (!ptep) + if (!ptep) { + end_lockless_pgtbl_walk(kvm->mm); return H_TOO_HARD;
[PATCH v4 06/11] powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring all hash-related functions that do lockless pagetable walks. hash_page_mm: Adds comment that explain that there is no need to local_int_disable/save given that it is only called from DataAccess interrupt, so interrupts are already disabled. Signed-off-by: Leonardo Bras --- arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++ arch/powerpc/mm/book3s64/hash_utils.c | 12 +++- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c index 4a70d8dd39cd..5e5213c3f7c4 100644 --- a/arch/powerpc/mm/book3s64/hash_tlb.c +++ b/arch/powerpc/mm/book3s64/hash_tlb.c @@ -209,6 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start, * to being hashed). This is not the most performance oriented * way to do things but is fine for our needs here. */ + start_lockless_pgtbl_walk(mm); local_irq_save(flags); arch_enter_lazy_mmu_mode(); for (; start < end; start += PAGE_SIZE) { @@ -230,6 +231,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start, } arch_leave_lazy_mmu_mode(); local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); } void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index b8ad14bb1170..8615fab87c43 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -1321,7 +1321,11 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea, ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1); #endif /* CONFIG_PPC_64K_PAGES */ - /* Get PTE and page size from page tables */ + /* Get PTE and page size from page tables : +* Called in from DataAccess interrupt (data_access_common: 0x300), +* interrupts are disabled here. +*/ + start_lockless_pgtbl_walk(mm); ptep = find_linux_pte(pgdir, ea, _thp, ); if (ptep == NULL || !pte_present(*ptep)) { DBG_LOW(" no PTE !\n"); @@ -1438,6 +1442,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea, DBG_LOW(" -> rc=%d\n", rc); bail: + end_lockless_pgtbl_walk(mm); exception_exit(prev_state); return rc; } @@ -1547,10 +1552,12 @@ void hash_preload(struct mm_struct *mm, unsigned long ea, vsid = get_user_vsid(>context, ea, ssize); if (!vsid) return; + /* * Hash doesn't like irqs. Walking linux page table with irq disabled * saves us from holding multiple locks. */ + start_lockless_pgtbl_walk(mm); local_irq_save(flags); /* @@ -1597,6 +1604,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea, pte_val(*ptep)); out_exit: local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); } #ifdef CONFIG_PPC_MEM_KEYS @@ -1613,11 +1621,13 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address) if (!mm || !mm->pgd) return 0; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); ptep = find_linux_pte(mm->pgd, address, NULL, NULL); if (ptep) pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep))); local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); return pkey; } -- 2.20.1
[PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to monitor against THP split/collapse with the couting method, it's necessary to bound it with {start,end}_lockless_pgtbl_walk. There are dummy functions, so it is not going to add any overhead on archs that don't use this method. Signed-off-by: Leonardo Bras --- mm/gup.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 98f13ab37bac..7105c829cf44 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end) int __get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages) { + struct mm_struct *mm; unsigned long len, end; unsigned long flags; int nr = 0; @@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { + mm = current->mm; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, ); local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); } return nr; @@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { unsigned long addr, len, end; + struct mm_struct *mm; int nr = 0, ret = 0; if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM))) @@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { + mm = current->mm; + start_lockless_pgtbl_walk(mm); local_irq_disable(); gup_pgd_range(addr, end, gup_flags, pages, ); local_irq_enable(); + end_lockless_pgtbl_walk(mm); ret = nr; } -- 2.20.1
[PATCH v4 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring lockless pgtable walks on addr_to_pfn(). Signed-off-by: Leonardo Bras --- arch/powerpc/kernel/mce_power.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index a814d2dfb5b0..0f2f87da4cd1 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -27,6 +27,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr) { pte_t *ptep; unsigned long flags; + unsigned long pfn; struct mm_struct *mm; if (user_mode(regs)) @@ -34,15 +35,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr) else mm = _mm; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); if (mm == current->mm) ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL); else ptep = find_init_mm_pte(addr, NULL); - local_irq_restore(flags); + if (!ptep || pte_special(*ptep)) - return ULONG_MAX; - return pte_pfn(*ptep); + pfn = ULONG_MAX; + else + pfn = pte_pfn(*ptep); + + local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); + return pfn; } /* flush SLBs and reload */ -- 2.20.1
[PATCH v4 09/11] powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring all book3s_64-related functions that do lockless pagetable walks. Adds comments explaining that some lockless pagetable walks don't need protection due to guest pgd not being a target of THP collapse/split, or due to being called from Realmode + MSR_EE = 0. Signed-off-by: Leonardo Bras --- arch/powerpc/kvm/book3s_64_mmu_hv.c| 2 ++ arch/powerpc/kvm/book3s_64_mmu_radix.c | 30 ++ arch/powerpc/kvm/book3s_64_vio_hv.c| 3 +++ 3 files changed, 35 insertions(+) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 9a75f0e1933b..fcd3dad1297f 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, * We need to protect against page table destruction * hugepage split and collapse. */ + start_lockless_pgtbl_walk(kvm->mm); local_irq_save(flags); ptep = find_current_mm_pte(current->mm->pgd, hva, NULL, NULL); @@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, write_ok = 1; } local_irq_restore(flags); + end_lockless_pgtbl_walk(kvm->mm); } } diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2d415c36a61d..9b374b9838fa 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -813,6 +813,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, * Read the PTE from the process' radix tree and use that * so we get the shift and attribute bits. */ + start_lockless_pgtbl_walk(kvm->mm); local_irq_disable(); ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, ); /* @@ -821,12 +822,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, */ if (!ptep) { local_irq_enable(); + end_lockless_pgtbl_walk(kvm->mm); if (page) put_page(page); return RESUME_GUEST; } pte = *ptep; local_irq_enable(); + end_lockless_pgtbl_walk(kvm->mm); /* If we're logging dirty pages, always map single pages */ large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES); @@ -972,10 +975,16 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned long gpa = gfn << PAGE_SHIFT; unsigned int shift; + /* +* We are walking the secondary (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep)) kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot, kvm->arch.lpid); + return 0; } @@ -989,6 +998,11 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, int ref = 0; unsigned long old, *rmapp; + /* +* We are walking the secondary (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) && pte_young(*ptep)) { old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0, @@ -1013,6 +1027,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned int shift; int ref = 0; + /* +* We are walking the secondary (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) && pte_young(*ptep)) ref = 1; @@ -1030,6 +1049,11 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm, int ret = 0; unsigned long old, *rmapp; + /* +* We are walking the secondary (partition-scoped) page table here. +* We can do this without disabling irq because the Linux MM +* subsystem doesn't do THP splits and collapses on this tree. +*/ ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) &&
[PATCH v4 02/11] asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks
There is a need to monitor lockless pagetable walks, in order to avoid doing THP splitting/collapsing during them. Some methods rely on local_irq_{save,restore}, but that can be slow on cases with a lot of cpus are used for the process. In order to speedup these cases, I propose a refcount-based approach, that counts the number of lockless pagetable walks happening on the process. Given that there are lockless pagetable walks on generic code, it's necessary to create dummy functions for archs that won't use the approach. Signed-off-by: Leonardo Bras --- include/asm-generic/pgtable.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 75d9d68a6de7..0831475e72d3 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -1172,6 +1172,21 @@ static inline bool arch_has_pfn_modify_check(void) #endif #endif +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER +static inline void start_lockless_pgtbl_walk(struct mm_struct *mm) +{ +} + +static inline void end_lockless_pgtbl_walk(struct mm_struct *mm) +{ +} + +static inline int running_lockless_pgtbl_walk(struct mm_struct *mm) +{ + return 0; +} +#endif + /* * On some architectures it depends on the mm if the p4d/pud or pmd * layer of the page table hierarchy is folded or not. -- 2.20.1
[PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
It's necessary to monitor lockless pagetable walks, in order to avoid doing THP splitting/collapsing during them. Some methods rely on local_irq_{save,restore}, but that can be slow on cases with a lot of cpus are used for the process. In order to speedup some cases, I propose a refcount-based approach, that counts the number of lockless pagetable walks happening on the process. This method does not exclude the current irq-oriented method. It works as a complement to skip unnecessary waiting. start_lockless_pgtbl_walk(mm) Insert before starting any lockless pgtable walk end_lockless_pgtbl_walk(mm) Insert after the end of any lockless pgtable walk (Mostly after the ptep is last used) running_lockless_pgtbl_walk(mm) Returns the number of lockless pgtable walks running Signed-off-by: Leonardo Bras --- arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++ arch/powerpc/mm/book3s64/mmu_context.c | 1 + arch/powerpc/mm/book3s64/pgtable.c | 45 3 files changed, 49 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 23b83d3593e2..13b006e7dde4 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -116,6 +116,9 @@ typedef struct { /* Number of users of the external (Nest) MMU */ atomic_t copros; + /* Number of running instances of lockless pagetable walk*/ + atomic_t lockless_pgtbl_walk_count; + struct hash_mm_context *hash_context; unsigned long vdso_base; diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c index 2d0cb5ba9a47..3dd01c0ca5be 100644 --- a/arch/powerpc/mm/book3s64/mmu_context.c +++ b/arch/powerpc/mm/book3s64/mmu_context.c @@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) #endif atomic_set(>context.active_cpus, 0); atomic_set(>context.copros, 0); + atomic_set(>context.lockless_pgtbl_walk_count, 0); return 0; } diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 7d0e0d0d22c4..6ba6195bff1b 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -98,6 +98,51 @@ void serialize_against_pte_lookup(struct mm_struct *mm) smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); } +/* + * Counting method to monitor lockless pagetable walks: + * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the + * number of lockless pgtable walks happening, and + * running_lockless_pgtbl_walk to return this value. + */ + +/* start_lockless_pgtbl_walk: Must be inserted before a function call that does + * lockless pagetable walks, such as __find_linux_pte() + */ +void start_lockless_pgtbl_walk(struct mm_struct *mm) +{ + atomic_inc(>context.lockless_pgtbl_walk_count); + /* Avoid reorder to garantee that the increment will happen before any +* part of the walkless pagetable walk after it. +*/ + smp_mb(); +} +EXPORT_SYMBOL(start_lockless_pgtbl_walk); + +/* + * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer + * returned by a lockless pagetable walk, such as __find_linux_pte() +*/ +void end_lockless_pgtbl_walk(struct mm_struct *mm) +{ + /* Avoid reorder to garantee that it will only decrement after the last +* use of the returned ptep from the lockless pagetable walk. +*/ + smp_mb(); + atomic_dec(>context.lockless_pgtbl_walk_count); +} +EXPORT_SYMBOL(end_lockless_pgtbl_walk); + +/* + * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks + * currently running. If it returns 0, there is no running pagetable walk, and + * THP split/collapse can be safely done. This can be used to avoid more + * expensive approaches like serialize_against_pte_lookup() + */ +int running_lockless_pgtbl_walk(struct mm_struct *mm) +{ + return atomic_read(>context.lockless_pgtbl_walk_count); +} + /* * We use this to invalidate a pmdp entry before switching from a * hugepte to regular pmd entry. -- 2.20.1
[PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks
If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk of memory (496GB) mapped with THP, it takes an average of 275 seconds, which can cause a lot of problems to the load (in qemu case, the guest will lock for this time). Trying to find the source of this bug, I found out most of this time is spent on serialize_against_pte_lookup(). This function will take a lot of time in smp_call_function_many() if there is more than a couple CPUs running the user process. Since it has to happen to all THP mapped, it will take a very long time for large amounts of memory. By the docs, serialize_against_pte_lookup() is needed in order to avoid pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless pagetable walk, to happen concurrently with THP splitting/collapsing. It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], after interrupts are re-enabled. Since, interrupts are (usually) disabled during lockless pagetable walk, and serialize_against_pte_lookup will only return after interrupts are enabled, it is protected. So, by what I could understand, if there is no lockless pagetable walk running, there is no need to call serialize_against_pte_lookup(). So, to avoid the cost of running serialize_against_pte_lookup(), I propose a counter that keeps track of how many find_current_mm_pte() are currently running, and if there is none, just skip smp_call_function_many(). The related functions are: start_lockless_pgtbl_walk(mm) Insert before starting any lockless pgtable walk end_lockless_pgtbl_walk(mm) Insert after the end of any lockless pgtable walk (Mostly after the ptep is last used) running_lockless_pgtbl_walk(mm) Returns the number of lockless pgtable walks running On my workload (qemu), I could see munmap's time reduction from 275 seconds to 418ms. Also, I documented some lockless pagetable walks in which it's not necessary to keep track, given they work on init_mm or guest pgd. Changes since v3: Adds memory barrier to {start,end}_lockless_pgtbl_walk() Explain (comments) why some lockless pgtbl walks don't need local_irq_disable (real mode + MSR_EE=0) Explain (comments) places where counting method is not needed (guest pgd, which is not touched by THP) Fixes some misplaced local_irq_restore() Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=132417 Changes since v2: Rebased to v5.3 Adds support on __get_user_pages_fast Adds usage decription to *_lockless_pgtbl_walk() Better style to dummy functions Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839 Changes since v1: Isolated atomic operations in functions *_lockless_pgtbl_walk() Fixed behavior of decrementing before last ptep was used Link: http://patchwork.ozlabs.org/patch/1163093/ Leonardo Bras (11): powerpc/mm: Adds counting method to monitor lockless pgtable walks asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks mm/gup: Applies counting method to monitor gup_pgd_range powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks powerpc/perf: Applies counting method to monitor lockless pgtbl walks powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++ arch/powerpc/include/asm/book3s/64/pgtable.h | 5 ++ arch/powerpc/kernel/mce_power.c | 13 -- arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 30 arch/powerpc/kvm/book3s_64_vio_hv.c | 3 ++ arch/powerpc/kvm/book3s_hv_nested.c | 22 - arch/powerpc/kvm/book3s_hv_rm_mmu.c | 18 ++-- arch/powerpc/kvm/e500_mmu_host.c | 6 ++- arch/powerpc/mm/book3s64/hash_tlb.c | 2 + arch/powerpc/mm/book3s64/hash_utils.c| 12 - arch/powerpc/mm/book3s64/mmu_context.c | 1 + arch/powerpc/mm/book3s64/pgtable.c | 48 +++- arch/powerpc/perf/callchain.c| 5 +- include/asm-generic/pgtable.h| 15 ++ mm/gup.c | 8 16 files changed, 180 insertions(+), 13 deletions(-) -- 2.20.1
Re: [PATCH v2 1/8] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
On Fri, Sep 27, 2019 at 2:45 PM Sean Christopherson wrote: > > Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter > instead of deferring the VMWRITE until vmx_set_cr3(). If the VMWRITE > is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it > refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated > VM-Exit occurs without actually entering L2, e.g. if the nested run > is squashed because nested VM-Enter (from L1) is putting L2 into HLT. > > Note, the above scenario can occur regardless of whether L1 is > intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with > vmcs.GUEST_ACTIVITY_STATE=HALTED. But practically speaking, a VMM will > likely put a guest into HALTED if and only if it's not intercepting HLT. > > In an ideal world where EPT *requires* unrestricted guest (and vice > versa), VMX could handle CR3 similar to how it handles RSP and RIP, > e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run(). But > the unrestricted guest silliness complicates the dirty tracking logic > to the point that explicitly handling vmcs02.GUEST_CR3 during nested > VM-Enter is a simpler overall implementation. > > Cc: sta...@vger.kernel.org > Reported-and-tested-by: Reto Buerki > Tested-by: Vitaly Kuznetsov > Reviewed-by: Liran Alon > Signed-off-by: Sean Christopherson Reviewed-by: Jim Mattson
[PATCH 3/3] perf annotate: Improve handling of corrupted ~/.debug
From: Andi Kleen Sometimes ~/.debug can get corrupted and contain files that still have symbol tables, but which objdump cannot handle. Add a fallback to read the "original" file in such a case. This might be wrong too if it's different, but in many cases when profiling on the same host it will work. Signed-off-by: Andi Kleen --- tools/perf/util/annotate.c | 27 +++ 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 1748f528b6e9..cff5f36786fa 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -1638,19 +1638,22 @@ int symbol__strerror_disassemble(struct symbol *sym __maybe_unused, struct map * return 0; } -static int dso__disassemble_filename(struct dso *dso, char *filename, size_t filename_size) +static int dso__disassemble_filename(struct dso *dso, char *filename, size_t filename_size, +bool *build_id) { char linkname[PATH_MAX]; char *build_id_filename; char *build_id_path = NULL; char *pos; + *build_id = false; if (dso->symtab_type == DSO_BINARY_TYPE__KALLSYMS && !dso__is_kcore(dso)) return SYMBOL_ANNOTATE_ERRNO__NO_VMLINUX; build_id_filename = dso__build_id_filename(dso, NULL, 0, false); if (build_id_filename) { + *build_id = true; __symbol__join_symfs(filename, filename_size, build_id_filename); free(build_id_filename); } else { @@ -1854,11 +1857,14 @@ static int symbol__disassemble(struct symbol *sym, struct annotate_args *args) int lineno = 0; int nline; pid_t pid; - int err = dso__disassemble_filename(dso, symfs_filename, sizeof(symfs_filename)); + bool build_id; + int err = dso__disassemble_filename(dso, symfs_filename, sizeof(symfs_filename), + _id); if (err) return err; +again: pr_debug("%s: filename=%s, sym=%s, start=%#" PRIx64 ", end=%#" PRIx64 "\n", __func__, symfs_filename, sym->name, map->unmap_ip(map, sym->start), map->unmap_ip(map, sym->end)); @@ -1955,8 +1961,21 @@ static int symbol__disassemble(struct symbol *sym, struct annotate_args *args) nline++; } - if (nline == 0) - pr_err("No output from %s\n", command); + if (nline == 0) { + pr_err("No objdump output for %s. Corrupted file?\n", symfs_filename); + if (build_id) { + /* +* It could be that the buildid file is corrupted. +* Try again with the "true" file. This might be wrong +* too, but it's better than nothing. +* We could move the build id file here? +*/ + __symbol__join_symfs(symfs_filename, sizeof(symfs_filename), +dso->long_name); + build_id = false; + goto again; + } + } /* * kallsyms does not have symbol sizes so there may a nop at the end. -- 2.21.0
[PATCH 2/3] perf jevents: Fix period for Intel fixed counters
From: Andi Kleen The Intel fixed counters use a special table to override the JSON information. During this override the period information from the JSON file got dropped, which results in inst_retired.any and similar running with frequency mode instead of a period. Just specify the expected period in the table. Signed-off-by: Andi Kleen --- tools/perf/pmu-events/jevents.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c index d413761621b0..fa85e33762f7 100644 --- a/tools/perf/pmu-events/jevents.c +++ b/tools/perf/pmu-events/jevents.c @@ -449,12 +449,12 @@ static struct fixed { const char *name; const char *event; } fixed[] = { - { "inst_retired.any", "event=0xc0" }, - { "inst_retired.any_p", "event=0xc0" }, - { "cpu_clk_unhalted.ref", "event=0x0,umask=0x03" }, - { "cpu_clk_unhalted.thread", "event=0x3c" }, - { "cpu_clk_unhalted.core", "event=0x3c" }, - { "cpu_clk_unhalted.thread_any", "event=0x3c,any=1" }, + { "inst_retired.any", "event=0xc0,period=203" }, + { "inst_retired.any_p", "event=0xc0,period=203" }, + { "cpu_clk_unhalted.ref", "event=0x0,umask=0x03,period=203" }, + { "cpu_clk_unhalted.thread", "event=0x3c,period=203" }, + { "cpu_clk_unhalted.core", "event=0x3c,period=203" }, + { "cpu_clk_unhalted.thread_any", "event=0x3c,any=1,period=203" }, { NULL, NULL}, }; -- 2.21.0
[PATCH 1/3] perf script brstackinsn: Fix recovery from LBR/binary mismatch
From: Andi Kleen When the LBR data and the instructions in a binary do not match the loop printing instructions could get confused and print a long stream of bogus instructions. The problem was that if the instruction decoder cannot decode an instruction it ilen wasn't initialized, so the loop going through the basic block would continue with the previous value. Harden the code to avoid such problems: - Make sure ilen is always freshly initialized and is 0 for bad instructions. - Do not overrun the code buffer while printing instructions - Print a warning message if the final jump is not on an instruction boundary. Signed-off-by: Andi Kleen --- tools/perf/builtin-script.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index e079b34201f2..32b17d51c982 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -1061,7 +1061,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, continue; insn = 0; - for (off = 0;; off += ilen) { + for (off = 0; off < (unsigned)len; off += ilen) { uint64_t ip = start + off; printed += ip__fprintf_sym(ip, thread, x.cpumode, x.cpu, , attr, fp); @@ -1072,6 +1072,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += print_srccode(thread, x.cpumode, ip); break; } else { + ilen = 0; printed += fprintf(fp, "\t%016" PRIx64 "\t%s\n", ip, dump_insn(, ip, buffer + off, len - off, )); if (ilen == 0) @@ -1081,6 +1082,8 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, insn++; } } + if (off != (unsigned)len) + printed += fprintf(fp, "\tmismatch of LBR data and executable\n"); } /* @@ -1121,6 +1124,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, goto out; } for (off = 0; off <= end - start; off += ilen) { + ilen = 0; printed += fprintf(fp, "\t%016" PRIx64 "\t%s\n", start + off, dump_insn(, start + off, buffer + off, len - off, )); if (ilen == 0) -- 2.21.0
RE: [PATCH] namespace: fix namespace.pl script to support relative paths
> -Original Message- > From: Randy Dunlap [mailto:rdun...@infradead.org] > Sent: Friday, September 27, 2019 4:12 PM > To: Keller, Jacob E > Cc: intel-wired-...@lists.osuosl.org; linux-kernel@vger.kernel.org; > linux-kbuild kbu...@vger.kernel.org>; Masahiro Yamada > Subject: Re: [PATCH] namespace: fix namespace.pl script to support relative > paths > > > re: > https://lore.kernel.org/lkml/20190129204319.15238-1-jacob.e.kel...@intel.com/ > > Did anything happen with this patch? > > Please send it to linux-kbu...@vger.kernel.org and > Cc: Masahiro Yamada > > You can also add: > Acked-by: Randy Dunlap > Tested-by: Randy Dunlap > > > I was just about to fix this script but I decided to first see if anyone else > had already done so. Thanks. > > -- > ~Randy Done, thanks. Regards, Jake
Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks
On Fri, 2019-09-27 at 11:46 -0300, Leonardo Bras wrote: > I am not sure if it would be ok to use irq_{save,restore} in real mode, > I will do some more reading of the docs before addressing this. It looks like it's unsafe to merge irq_{save,restore} in {start,end}_lockless_pgtbl_walk(), due to a possible access of code that is not accessible in real mode. I am sending a v4 for the changes so far. I will look forward for your feedback. signature.asc Description: This is a digitally signed message part
RE: [PATCH] namespace: fix namespace.pl script to support relative paths
> -Original Message- > From: Randy Dunlap [mailto:rdun...@infradead.org] > Sent: Friday, September 27, 2019 4:12 PM > To: Keller, Jacob E > Cc: intel-wired-...@lists.osuosl.org; linux-kernel@vger.kernel.org; > linux-kbuild kbu...@vger.kernel.org>; Masahiro Yamada > Subject: Re: [PATCH] namespace: fix namespace.pl script to support relative > paths > > > re: > https://lore.kernel.org/lkml/20190129204319.15238-1-jacob.e.kel...@intel.com/ > > Did anything happen with this patch? > I haven't heard anything or seen it get applied. > Please send it to linux-kbu...@vger.kernel.org and > Cc: Masahiro Yamada > Sure, I can forward it. > You can also add: > Acked-by: Randy Dunlap > Tested-by: Randy Dunlap > > > I was just about to fix this script but I decided to first see if anyone else > had already done so. Thanks. > Thanks for the review/notice here. > -- > ~Randy -Jake
Re: [PATCH] PCI: mobiveil: Fix csr_read/write build issue
On Thu, Sep 26, 2019 at 10:29:34AM +0100, Andrew Murray wrote: > Though I'd be just as happy if the csr_[read,write][l,] functions were > renamed to mobiveil_csr_[read,write][l,]. Please do that instead, using such generic names as csr_* in a driver is a bad idea, with or without a __ prefix.
[PATCH v2] riscv: Fix memblock reservation for device tree blob
This fixes an error with how the FDT blob is reserved in memblock. An incorrect physical address calculation exposed the FDT header to unintended corruption, which typically manifested with of_fdt_raw_init() faulting during late boot after fdt_totalsize() returned a wrong value. Systems with smaller physical memory sizes more frequently trigger this issue, as the kernel is more likely to allocate from the DMA32 zone where bbl places the DTB after the kernel image. Commit 671f9a3e2e24 ("RISC-V: Setup initial page tables in two stages") changed the mapping of the DTB to reside in the fixmap area. Consequently, early_init_fdt_reserve_self() cannot be used anymore in setup_bootmem() since it relies on __pa() to derive a physical address, which does not work with dtb_early_va that is no longer a valid kernel logical address. The reserved[0x1] region shows the effect of the pointer underflow resulting from the __pa(initial_boot_params) offset subtraction: [0.00] MEMBLOCK configuration: [0.00] memory size = 0x1fe0 reserved size = 0x00a2e514 [0.00] memory.cnt = 0x1 [0.00] memory[0x0] [0x8020-0x9fff], 0x1fe0 bytes flags: 0x0 [0.00] reserved.cnt = 0x2 [0.00] reserved[0x0] [0x8020-0x80c2dfeb], 0x00a2dfec bytes flags: 0x0 [0.00] reserved[0x1] [0xfff08010-0xfff080100527], 0x0528 bytes flags: 0x0 With the fix applied: [0.00] MEMBLOCK configuration: [0.00] memory size = 0x1fe0 reserved size = 0x00a2e514 [0.00] memory.cnt = 0x1 [0.00] memory[0x0] [0x8020-0x9fff], 0x1fe0 bytes flags: 0x0 [0.00] reserved.cnt = 0x2 [0.00] reserved[0x0] [0x8020-0x80c2dfeb], 0x00a2dfec bytes flags: 0x0 [0.00] reserved[0x1] [0x80e0-0x80e00527], 0x0528 bytes flags: 0x0 Fixes: 671f9a3e2e24 ("RISC-V: Setup initial page tables in two stages") Signed-off-by: Albert Ou --- Changes in v2: * Changed variable identifier to dtb_early_pa per reviewer feedback. * Removed whitespace change. arch/riscv/mm/init.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index f0ba713..83f7d12 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -82,6 +83,8 @@ static void __init setup_initrd(void) } #endif /* CONFIG_BLK_DEV_INITRD */ +static phys_addr_t dtb_early_pa __initdata; + void __init setup_bootmem(void) { struct memblock_region *reg; @@ -117,7 +120,12 @@ void __init setup_bootmem(void) setup_initrd(); #endif /* CONFIG_BLK_DEV_INITRD */ - early_init_fdt_reserve_self(); + /* +* Avoid using early_init_fdt_reserve_self() since __pa() does +* not work for DTB pointers that are fixmap addresses +*/ + memblock_reserve(dtb_early_pa, fdt_totalsize(dtb_early_va)); + early_init_fdt_scan_reserved_mem(); memblock_allow_resize(); memblock_dump_all(); @@ -393,6 +401,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa) /* Save pointer to DTB for early FDT parsing */ dtb_early_va = (void *)fix_to_virt(FIX_FDT) + (dtb_pa & ~PAGE_MASK); + /* Save physical address for memblock reservation */ + dtb_early_pa = dtb_pa; } static void __init setup_vm_final(void) -- 2.7.4
Re: [PATCH] namespace: fix namespace.pl script to support relative paths
re: https://lore.kernel.org/lkml/20190129204319.15238-1-jacob.e.kel...@intel.com/ Did anything happen with this patch? Please send it to linux-kbu...@vger.kernel.org and Cc: Masahiro Yamada You can also add: Acked-by: Randy Dunlap Tested-by: Randy Dunlap I was just about to fix this script but I decided to first see if anyone else had already done so. Thanks. -- ~Randy
[PATCH] virt: vbox: fix memory leak in hgcm_call_preprocess_linaddr
In hgcm_call_preprocess_linaddr memory is allocated for bounce_buf but is not released if copy_form_user fails. The release is added. Fixes: 579db9d45cb4 ("virt: Add vboxguest VMMDEV communication code") Signed-off-by: Navid Emamdoost --- drivers/virt/vboxguest/vboxguest_utils.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/virt/vboxguest/vboxguest_utils.c b/drivers/virt/vboxguest/vboxguest_utils.c index 75fd140b02ff..7965885a50fa 100644 --- a/drivers/virt/vboxguest/vboxguest_utils.c +++ b/drivers/virt/vboxguest/vboxguest_utils.c @@ -222,8 +222,10 @@ static int hgcm_call_preprocess_linaddr( if (copy_in) { ret = copy_from_user(bounce_buf, (void __user *)buf, len); - if (ret) + if (ret) { + kvfree(bounce_buf); return -EFAULT; + } } else { memset(bounce_buf, 0, len); } -- 2.17.1
Re: [GIT PULL] 9p updates for 5.4
The pull request you sent on Fri, 27 Sep 2019 16:27:26 +0200: > https://github.com/martinetd/linux tags/9p-for-5.4 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/9977b1a71488742606376c09e19e0074e4403cdf Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [GIT PULL] add virtio-fs
The pull request you sent on Thu, 26 Sep 2019 10:43:40 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git > tags/virtio-fs-5.4 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/8f744bdee4fefb17fac052c7418b830de2b59ac8 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
On 9/27/19 3:51 PM, Mina Almasry wrote: > On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz wrote: >> >> On 9/26/19 5:55 PM, Mina Almasry wrote: >>> Provided we keep the existing controller untouched, should the new >>> controller track: >>> >>> 1. only reservations, or >>> 2. both reservations and allocations for which no reservations exist >>> (such as the MAP_NORESERVE case)? >>> >>> I like the 'both' approach. Seems to me a counter like that would work >>> automatically regardless of whether the application is allocating >>> hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut >>> into reserved hugetlb pages, correct? >> >> Correct. One other easy way to allocate huge pages without reserves >> (that I know is used today) is via the fallocate system call. >> >>> If so, then applications that >>> allocate with NORESERVE will get sigbused when they hit their limit, >>> and applications that allocate without NORESERVE may get an error at >>> mmap time but will always be within their limits while they access the >>> mmap'd memory, correct? >> >> Correct. At page allocation time we can easily check to see if a reservation >> exists and not charge. For any specific page within a hugetlbfs file, >> a charge would happen at mmap time or allocation time. >> >> One exception (that I can think of) to this mmap(RESERVE) will not cause >> a SIGBUS rule is in the case of hole punch. If someone punches a hole in >> a file, not only do they remove pages associated with the file but the >> reservation information as well. Therefore, a subsequent fault will be >> the same as an allocation without reservation. >> > > I don't think it causes a sigbus. This is the scenario, right: > > 1. Make cgroup with limit X bytes. > 2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged > 3. A hole of size Y is punched in the file, causing the cgroup to get > uncharged Y bytes. > 4. The task faults in memory from the hole, getting charged up to Y > bytes again. But they will be still within their limits. > > IIUC userspace only gets sigbus'd if the limit is lowered between > steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion. You are correct. That was my mistake. I was still thinking of behavior for a reservation only cgroup model. It would behave as you describe above (no SIGBUS) for a combined reservation/allocate model. -- Mike Kravetz
Re: [PATCH v2 0/3] Add support for SBI v0.2
On Fri, 2019-09-27 at 15:19 -0700, Christoph Hellwig wrote: > On Thu, Sep 26, 2019 at 05:09:12PM -0700, Atish Patra wrote: > > The Supervisor Binary Interface(SBI) specification[1] now defines a > > base extension that provides extendability to add future extensions > > while maintaining backward compatibility with previous versions. > > The new version is defined as 0.2 and older version is marked as > > 0.1. > > > > This series adds support v0.2 and a unified calling convention > > implementation between 0.1 and 0.2. It also adds minimal SBI > > functions > > from 0.2 as well to keep the series lean. > > So before we do this game can be please make sure we have a clean 0.2 > environment that never uses the legacy extensions as discussed > before? > Without that all this work is rather futile. > As per our discussion offline, here are things need to be done to achieve that. 1. Replace timer, sfence and ipi with better alternative APIs - sbi_set_timer will be same but with new calling convention - send_ipi and sfence_* apis can be modified in such a way that - we don't have to use unprivileged load anymore - Make it scalable 2. Drop clear_ipi, console, and shutdown in 0.2. We will have a new kernel config (LEGACY_SBI) that can be manually enabled if older firmware need to be used. By default, LEGACY_SBI will be disabled and kernel with new SBI will be built. We will have to set a flag day in a year or so when we can remove the LEGACY_SBI completely. Let us know if it is not an acceptable approach to anybody. I will post a RFC patch with new alternate v0.2 APIs sometime next week. > ___ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv -- Regards, Atish
Re: [PATCH 2/4] rsicv: avoid sending a SIGTRAP to a user thread trapped in WARN()
Oh and s/rsicv/riscv/ in the subject, please.
Re: [PATCH] riscv: move sifive_l2_cache.c to drivers/soc
On Thu, Aug 22, 2019 at 06:26:35AM -0300, Mauro Carvalho Chehab wrote: > Em Mon, 19 Aug 2019 08:26:19 +0200 > Christoph Hellwig escreveu: > > > On Mon, Aug 19, 2019 at 08:09:04AM +0200, Borislav Petkov wrote: > > > On Sun, Aug 18, 2019 at 10:29:35AM +0200, Christoph Hellwig wrote: > > > > The sifive_l2_cache.c is in no way related to RISC-V architecture > > > > memory management. It is a little stub driver working around the fact > > > > that the EDAC maintainers prefer their drivers to be structured in a > > > > certain way > > > > > > That changed recently so I guess we can do the per-IP block driver after > > > all, if people would still prefer it. > > > > That would seem like the best idea. But I don't really know this code > > well enough myself, and I really need to get this code out of the > > forced on RISC-V codebase as some SOCs I'm working with simply don't > > have the memory for it.. > > > > So unless someone signs up to do a per-IP block edac drivers instead > > very quickly I'd still like to see something like this go into 5.4 > > for now. > > I'm wandering if we should at least add an entry for this one at > MAINTAINERS, pointing it to the EDAC mailing list. Something like: Sounds fine. Can you also ACK the patch with that, as Paul mention in another thread he wants an EDAC ACK for it.
Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz wrote: > > On 9/26/19 5:55 PM, Mina Almasry wrote: > > Provided we keep the existing controller untouched, should the new > > controller track: > > > > 1. only reservations, or > > 2. both reservations and allocations for which no reservations exist > > (such as the MAP_NORESERVE case)? > > > > I like the 'both' approach. Seems to me a counter like that would work > > automatically regardless of whether the application is allocating > > hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut > > into reserved hugetlb pages, correct? > > Correct. One other easy way to allocate huge pages without reserves > (that I know is used today) is via the fallocate system call. > > > If so, then applications that > > allocate with NORESERVE will get sigbused when they hit their limit, > > and applications that allocate without NORESERVE may get an error at > > mmap time but will always be within their limits while they access the > > mmap'd memory, correct? > > Correct. At page allocation time we can easily check to see if a reservation > exists and not charge. For any specific page within a hugetlbfs file, > a charge would happen at mmap time or allocation time. > > One exception (that I can think of) to this mmap(RESERVE) will not cause > a SIGBUS rule is in the case of hole punch. If someone punches a hole in > a file, not only do they remove pages associated with the file but the > reservation information as well. Therefore, a subsequent fault will be > the same as an allocation without reservation. > I don't think it causes a sigbus. This is the scenario, right: 1. Make cgroup with limit X bytes. 2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged 3. A hole of size Y is punched in the file, causing the cgroup to get uncharged Y bytes. 4. The task faults in memory from the hole, getting charged up to Y bytes again. But they will be still within their limits. IIUC userspace only gets sigbus'd if the limit is lowered between steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion. > I 'think' the code to remove/truncate a file will work corrctly as it > is today, but I need to think about this some more. > > > mmap'd memory, correct? So the 'both' counter seems like a one size > > fits all. > > > > I think the only sticking point left is whether an added controller > > can support both cgroup-v2 and cgroup-v1. If I could get confirmation > > on that I'll provide a patchset. > > Sorry, but I can not provide cgroup expertise. > -- > Mike Kravetz
Re: [PATCH 4/4] riscv: remove the switch statement in do_trap_break()
On Mon, Sep 23, 2019 at 08:45:17AM +0800, Vincent Chen wrote: > To make the code more straightforward, replacing the switch statement > with if statement. > > Suggested-by: Paul Walmsley > Signed-off-by: Vincent Chen > --- > arch/riscv/kernel/traps.c | 23 --- > 1 file changed, 12 insertions(+), 11 deletions(-) > > diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c > index dd13bc90aeb6..1ac75f7d0bff 100644 > --- a/arch/riscv/kernel/traps.c > +++ b/arch/riscv/kernel/traps.c > @@ -124,23 +124,24 @@ static inline unsigned long > get_break_insn_length(unsigned long pc) > > asmlinkage void do_trap_break(struct pt_regs *regs) > { > + if (user_mode(regs)) { > + force_sig_fault(SIGTRAP, TRAP_BRKPT, > + (void __user *)(regs->sepc)); > + return; > + } > +#ifdef CONFIG_GENERIC_BUG > + { > enum bug_trap_type type; > > type = report_bug(regs->sepc, regs); > + if (type == BUG_TRAP_TYPE_WARN) { > regs->sepc += get_break_insn_length(regs->sepc); > return; > } > +#endif /* CONFIG_GENERIC_BUG */ > + > + die(regs, "Kernel BUG"); I like where this is going, but I think this can be improved further given that fact that report_bug has a nice stub for the !CONFIG_GENERIC_BUG case. How about: if (user_mode(regs)) force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->sepc); else if (report_bug(regs->sepc, regs) == BUG_TRAP_TYPE_WARN) regs->sepc += get_break_insn_length(regs->sepc); else die(regs, "Kernel BUG");
Re: [GIT PULL] nfsd changes for 5.4
On Fri, Sep 27, 2019 at 1:08 PM J. Bruce Fields wrote: > > Please pull nfsd changes from: > > git://linux-nfs.org/~bfields/linux.git tags/nfsd-5.4 Hmm. I styarted to pull, but then realized that the description doesn't match the content at all. You say: > Highlights: > > - add a new knfsd file cache, so that we don't have to open and > close on each (NFSv2/v3) READ or WRITE. This can speed up > read and write in some cases. It also replaces our readahead > cache. > - Prevent silent data loss on write errors, by treating write > errors like server reboots for the purposes of write caching, > thus forcing clients to resend their writes. > - Tweak the code that allocates sessions to be more forgiving, > so that NFSv4.1 mounts are less likely to hang when a server > already has a lot of clients. > - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should > now be limited only by the backend filesystem and the > maximum RPC size. > - Allow the server to enforce use of the correct kerberos > credentials when a client reclaims state after a reboot. > > And some miscellaneous smaller bugfixes and cleanup. But then the actual code is just one single small commit: > Dave Wysochanski (1): > SUNRPC: Track writers of the 'channel' file to improve > cache_listeners_exist which doesn't actually match any of the things your description says should be there. So I undid my pull - either the description is completely wrong, or you tagged the wrong commit. Please double-check, Linus
Re: [PATCH 3/4] riscv: Correct the handling of unexpected ebreak in do_trap_break()
Looks ok: Reviewed-by: Christoph Hellwig
Re: [PATCH] compiler: enable CONFIG_OPTIMIZE_INLINING forcibly
On Fri, Sep 27, 2019 at 3:08 PM Nick Desaulniers wrote: > > So get_user() was passed a bad value/pointer from userspace? Do you > know which of the tree calls to get_user() from sock_setsockopt() is > failing? (It's not immediately clear to me how this patch is at > fault, vs there just being a bug in the source somewhere). Based on the error messages, the SO_PASSCRED ones are almost certainly from the get_user() in net/core/sock.c: sock_setsockopt(), which just does if (optlen < sizeof(int)) return -EINVAL; if (get_user(val, (int __user *)optval)) return -EFAULT; valbool = val ? 1 : 0; but it's the other messages imply that a lot of other cases are failing too (ie the "Failed to bind netlink socket" is, according to google, a bind() that fails with the same EFAULT error). There are probably even more failures that happen elsewhere and just don't even syslog the fact. I'd guess that all get_user() calls just fail, and those are the ones that happen to get printed out. Now, _why_ it would fail, I have ni idea. There are several inlines in the arm uaccess.h file, and it depends on other headers like with more inlines still - eg get/set_domain() etc. Soem of that code is pretty subtle. They have fixed register usage (but the asm macros actually check them). And the inline asms clobber the link register, but they do seem to clearly _state_ that they clobber it, so who knows. Just based on the EFAULT, I'd _guess_ that it's some interaction with the domain access control register (so that get/set_domain() thing). But I'm not even sure that code is enabled for the Rpi2, so who knows.. Linus
[PATCH] net/mlx5: fix memory leak in mlx5_fw_fatal_reporter_dump
In mlx5_fw_fatal_reporter_dump if mlx5_crdump_collect fails the allocated memory for cr_data must be released otherwise there will be memory leak. To fix this, this commit changes the return instruction into goto error handling. Fixes: 9b1f29823605 ("net/mlx5: Add support for FW fatal reporter dump") Signed-off-by: Navid Emamdoost --- drivers/net/ethernet/mellanox/mlx5/core/health.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c index d685122d9ff7..c07f3154437c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c @@ -572,7 +572,7 @@ mlx5_fw_fatal_reporter_dump(struct devlink_health_reporter *reporter, return -ENOMEM; err = mlx5_crdump_collect(dev, cr_data); if (err) - return err; + goto free_data; if (priv_ctx) { struct mlx5_fw_reporter_ctx *fw_reporter_ctx = priv_ctx; -- 2.17.1
Re: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing
On Fri, Sep 27, 2019 at 2:44 PM Mike Kravetz wrote: > > On 9/19/19 3:24 PM, Mina Almasry wrote: > > A follow up patch in this series adds hugetlb cgroup uncharge info the > > file_region entries in resv->regions. The cgroup uncharge info may > > differ for different regions, so they can no longer be coalesced at > > region_add time. So, disable region coalescing in region_add in this > > patch. > > > > Behavior change: > > > > Say a resv_map exists like this [0->1], [2->3], and [5->6]. > > > > Then a region_chg/add call comes in region_chg/add(f=0, t=5). > > > > Old code would generate resv->regions: [0->5], [5->6]. > > New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], > > [5->6]. > > > > Special care needs to be taken to handle the resv->adds_in_progress > > variable correctly. In the past, only 1 region would be added for every > > region_chg and region_add call. But now, each call may add multiple > > regions, so we can no longer increment adds_in_progress by 1 in region_chg, > > or decrement adds_in_progress by 1 after region_add or region_abort. > > Instead, > > region_chg calls add_reservation_in_range() to count the number of regions > > needed and allocates those, and that info is passed to region_add and > > region_abort to decrement adds_in_progress correctly. > > > > Signed-off-by: Mina Almasry > > > > --- > > mm/hugetlb.c | 273 +-- > > 1 file changed, 158 insertions(+), 115 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index bac1cbdd027c..d03b048084a3 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -244,6 +244,12 @@ struct file_region { > > long to; > > }; > > > > +/* Helper that removes a struct file_region from the resv_map cache and > > returns > > + * it for use. > > + */ > > +static struct file_region * > > +get_file_region_entry_from_cache(struct resv_map *resv, long from, long > > to); > > + > > Instead of the forward declaration, just put the function here. > > > /* Must be called with resv->lock held. Calling this with count_only == > > true > > * will count the number of pages to be added but will not modify the > > linked > > * list. > > @@ -251,51 +257,61 @@ struct file_region { > > static long add_reservation_in_range(struct resv_map *resv, long f, long t, > >bool count_only) > > { > > - long chg = 0; > > + long add = 0; > > struct list_head *head = >regions; > > + long last_accounted_offset = f; > > struct file_region *rg = NULL, *trg = NULL, *nrg = NULL; > > > > - /* Locate the region we are before or in. */ > > - list_for_each_entry (rg, head, link) > > - if (f <= rg->to) > > - break; > > - > > - /* Round our left edge to the current segment if it encloses us. */ > > - if (f > rg->from) > > - f = rg->from; > > - > > - chg = t - f; > > + /* In this loop, we essentially handle an entry for the range > > + * last_accounted_offset -> rg->from, at every iteration, with some > > + * bounds checking. > > + */ > > + list_for_each_entry_safe(rg, trg, head, link) { > > + /* Skip irrelevant regions that start before our range. */ > > + if (rg->from < f) { > > + /* If this region ends after the last accounted > > offset, > > + * then we need to update last_accounted_offset. > > + */ > > + if (rg->to > last_accounted_offset) > > + last_accounted_offset = rg->to; > > + continue; > > + } > > > > - /* Check for and consume any regions we now overlap with. */ > > - nrg = rg; > > - list_for_each_entry_safe (rg, trg, rg->link.prev, link) { > > - if (>link == head) > > - break; > > + /* When we find a region that starts beyond our range, we've > > + * finished. > > + */ > > if (rg->from > t) > > break; > > > > - /* We overlap with this area, if it extends further than > > - * us then we must extend ourselves. Account for its > > - * existing reservation. > > + /* Add an entry for last_accounted_offset -> rg->from, and > > + * update last_accounted_offset. > >*/ > > - if (rg->to > t) { > > - chg += rg->to - t; > > - t = rg->to; > > + if (rg->from > last_accounted_offset) { > > + add += rg->from - last_accounted_offset; > > + if (!count_only) { > > + nrg = get_file_region_entry_from_cache( > > + resv, last_accounted_offset, > > rg->from); > > + list_add(>link, rg->link.prev); > > +
Re: [PATCH 2/4] rsicv: avoid sending a SIGTRAP to a user thread trapped in WARN()
On Mon, Sep 23, 2019 at 08:45:15AM +0800, Vincent Chen wrote: > On RISC-V, when the kernel runs code on behalf of a user thread, and the > kernel executes a WARN() or WARN_ON(), the user thread will be sent > a bogus SIGTRAP. Fix the RISC-V kernel code to not send a SIGTRAP when > a WARN()/WARN_ON() is executed. > > Signed-off-by: Vincent Chen Looks good, Reviewed-by: Christoph Hellwig
Re: [alsa-devel] [PATCH v2] ASoC: Intel: Skylake: prevent memory leak in snd_skl_parse_uuids
On 9/27/19 3:39 PM, Andy Shevchenko wrote: On Fri, Sep 27, 2019 at 7:39 PM Pierre-Louis Bossart wrote: The problem with solution #1 is freeing orphaned pointer. It will work, but it's simple is not okay from object life time prospective. ?? I don't get your point at all Andy. Two allocations happens in a loop and if the second fails, you free the first and then jump to free everything allocated in the previous iterations. what am I missing? Two things: - one allocation is done with kzalloc(), while the other one with devm_kcalloc() - due to above the ordering of resources is reversed Ah yes, I see your point now, sorry for being thick. Indeed it'd make sense to use devm_ for both allocations, but then the kfree needs to be removed in the error handling.
Re: [PATCH 1/4] riscv: avoid kernel hangs when trapped in BUG()
On Mon, Sep 23, 2019 at 08:45:14AM +0800, Vincent Chen wrote: > When the CONFIG_GENERIC_BUG is disabled by disabling CONFIG_BUG, if a > kernel thread is trapped by BUG(), the whole system will be in the > loop that infinitely handles the ebreak exception instead of entering the > die function. To fix this problem, the do_trap_break() will always call > the die() to deal with the break exception as the type of break is > BUG_TRAP_TYPE_BUG. > > Signed-off-by: Vincent Chen Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH] RISC-V: Clear load reservations while restoring hart contexts
On Tue, Sep 24, 2019 at 05:15:56PM -0700, Palmer Dabbelt wrote: > This is almost entirely a comment. The bug is unlikely to manifest on > existing hardware because there is a timeout on load reservations, but > manifests on QEMU because there is no timeout. Given how many different lines of RISC-V hardware are around how do you know it doesn't affect any existing hardware? Except for that the patch looks fine to me: Reviewed-by: Christoph Hellwig
Re: [GIT PULL] RISC-V additional updates for v5.4-rc1
On Fri, Sep 27, 2019 at 11:25:13AM -0700, Paul Walmsley wrote: > Atish Patra (1): > RISC-V: Export kernel symbols for kvm None of these have any current users, they should go in with the kvm series once the virtualization spec has been finalized.
Re: [PATCH v2 3/3] RISC-V: Move SBI related macros under uapi.
On Thu, Sep 26, 2019 at 05:09:15PM -0700, Atish Patra wrote: > All SBI related macros can be reused by KVM RISC-V and userspace tools > such as kvmtool, qemu-kvm. SBI calls can also be emulated by userspace > if required. Any future vendor extensions can leverage this to emulate > the specific extension in userspace instead of kernel. Just because userspace can use them that doesn't mean they are a userspace API. Please don't do this as this limits how we can ever remove previously existing symbols. Just copy over the current version of the file into the other project of your choice instead of creating and API we need to maintain.
Re: [PATCH v2 1/3] RISC-V: Mark existing SBI as 0.1 SBI.
On Thu, Sep 26, 2019 at 05:09:13PM -0700, Atish Patra wrote: > -#define SBI_CALL(which, arg0, arg1, arg2, arg3) ({ \ > +#define SBI_CALL(which, arg0, arg1, arg2, arg3) ({ \ Spurious whitespace change.
Re: [PATCH v2 0/3] Add support for SBI v0.2
On Thu, Sep 26, 2019 at 05:09:12PM -0700, Atish Patra wrote: > The Supervisor Binary Interface(SBI) specification[1] now defines a > base extension that provides extendability to add future extensions > while maintaining backward compatibility with previous versions. > The new version is defined as 0.2 and older version is marked as 0.1. > > This series adds support v0.2 and a unified calling convention > implementation between 0.1 and 0.2. It also adds minimal SBI functions > from 0.2 as well to keep the series lean. So before we do this game can be please make sure we have a clean 0.2 environment that never uses the legacy extensions as discussed before? Without that all this work is rather futile.
Re: [PATCH] mm/page_alloc: fix a crash in free_pages_prepare()
On Fri, Sep 27, 2019 at 2:59 PM Andrew Morton wrote: > > On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai wrote: > > > > > > > So I think you've moved the arch_free_page() to be after the final > > > thing which can access page contents, yes? If so, we should have a > > > comment in free_pages_prepare() to attmept to prevent this problem from > > > reoccurring as the code evolves? > > > > Right, something like this above arch_free_page() there? > > > > /* > > * It needs to be just above kernel_map_pages(), as s390 could mark those > > * pages unused and then trigger a fault when accessing. > > */ > > I did this. > > --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix > +++ a/mm/page_alloc.c > @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p > kernel_init_free_pages(page, 1 << order); > > kernel_poison_pages(page, 1 << order, 0); > + /* > +* arch_free_page() can make the page's contents inaccessible. s390 > +* does this. So nothing which can access the page's contents should > +* happen after this. > +*/ > arch_free_page(page, order); > + > if (debug_pagealloc_enabled()) > kernel_map_pages(page, 1 << order, 0); > So the question I would have is what is the state of the page after it has been marked unused and then pulled back in? I'm assuming it will be all 0s. I know with the work I am still doing on marking pages as unused this ends up being an additional item that we will need to pay attention to, however in our case we will just be initializing the page as zero if we end up evicting it from the guest.
Re: [PATCH v3 2/5] dt-bindings: mfd: Document the Xylon LogiCVC multi-function device
On Fri, Sep 27, 2019 at 12:04:04PM +0200, Paul Kocialkowski wrote: > The LogiCVC is a display engine which also exposes GPIO functionality. > For this reason, it is described as a multi-function device that is expected > to provide register access to its children nodes for gpio and display. > > Signed-off-by: Paul Kocialkowski > --- > .../bindings/mfd/xylon,logicvc.yaml | 50 +++ > 1 file changed, 50 insertions(+) > create mode 100644 Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml > > diff --git a/Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml > b/Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml > new file mode 100644 > index ..abc9937506e0 > --- /dev/null > +++ b/Documentation/devicetree/bindings/mfd/xylon,logicvc.yaml > @@ -0,0 +1,50 @@ > +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) > +# Copyright 2019 Bootlin > +%YAML 1.2 > +--- > +$id: "http://devicetree.org/schemas/mfd/xylon,logicvc.yaml#; > +$schema: "http://devicetree.org/meta-schemas/core.yaml#; > + > +title: Xylon LogiCVC multi-function device > + > +maintainers: > + - Paul Kocialkowski > + > +description: | > + The LogiCVC is a display controller that also contains a GPIO controller. > + As a result, a multi-function device is exposed as parent of the display > + and GPIO blocks. > + > +properties: > + compatible: > +items: > + - enum: > + - xylon,logicvc-3.02.a > + - const: syscon > + - const: simple-mfd > + > + reg: > +maxItems: 1 > + > +select: > + properties: > +compatible: > + contains: > +enum: > + - xylon,logicvc-3.02.a I've seen a couple of these with 'syscon' today, so I fixed the schema tool to just exclude 'syscon' and 'simple-mfd' from the generated 'select'. So you can drop select now. Reviewed-by: Rob Herring > + > + required: > +- compatible > + > +required: > + - compatible > + - reg > + > +examples: > + - | > +logicvc: logicvc@43c0 { > + compatible = "xylon,logicvc-3.02.a", "syscon", "simple-mfd"; > + reg = <0x43c0 0x6000>; > + #address-cells = <1>; > + #size-cells = <1>; > +}; > -- > 2.23.0 >
Re: [PATCH v3 4/5] dt-bindings: gpio: Document the Xylon LogiCVC GPIO controller
On Fri, 27 Sep 2019 12:04:06 +0200, Paul Kocialkowski wrote: > The Xylon LogiCVC display controller exports some GPIOs, which are > exposed as a separate entity. > > Signed-off-by: Paul Kocialkowski > --- > .../bindings/gpio/xylon,logicvc-gpio.yaml | 69 +++ > 1 file changed, 69 insertions(+) > create mode 100644 > Documentation/devicetree/bindings/gpio/xylon,logicvc-gpio.yaml > Reviewed-by: Rob Herring
Re: [PATCH v6 5/5] pinctrl: mediatek: Add support for pin configuration dump via debugfs.
Hi, On Thu, Sep 26, 2019 at 10:02 PM Light Hsieh wrote: > > Add support for pin configuration dump via catting > /sys/kernel/debug/pinctrl/$platform_dependent_path/pinconf-pins. > pinctrl framework had already support such dump. This patch implement the > operation function pointer to fullfill this dump. > Here are missing tags too. > --- > drivers/pinctrl/mediatek/pinctrl-paris.c | 88 > > drivers/pinctrl/mediatek/pinctrl-paris.h | 30 +++ > 2 files changed, 118 insertions(+) > > diff --git a/drivers/pinctrl/mediatek/pinctrl-paris.c > b/drivers/pinctrl/mediatek/pinctrl-paris.c > index 2a47c45..f531908 100644 > --- a/drivers/pinctrl/mediatek/pinctrl-paris.c > +++ b/drivers/pinctrl/mediatek/pinctrl-paris.c > @@ -538,12 +538,99 @@ static int mtk_pctrl_get_group_pins(struct pinctrl_dev > *pctldev, > return 0; > } > > +int mtk_hw_get_value_wrap(struct mtk_pinctrl *hw, unsigned int gpio, int > field) > +{ > + const struct mtk_pin_desc *desc; > + int value, err; > + > + if (gpio > hw->soc->npins) > + return -EINVAL; > + > + desc = (const struct mtk_pin_desc *)>soc->pins[gpio]; > + > + err = mtk_hw_get_value(hw, desc, field, ); > + if (err) > + return err; > + > + return value; > +} > + > +ssize_t mtk_pctrl_show_one_pin(struct mtk_pinctrl *hw, > + unsigned int gpio, char *buf, unsigned int bufLen) > +{ > + const struct mtk_pin_desc *desc; > + int pinmux, pullup, pullen, r1 = -1, r0 = -1, len = 0; Sort the variable declarations in reverse xmas tree order. > + > + if (gpio > hw->soc->npins) > + return -EINVAL; > + > + desc = (const struct mtk_pin_desc *)>soc->pins[gpio]; > + pinmux = mtk_pctrl_get_pinmux(hw, gpio); > + if (pinmux >= hw->soc->nfuncs) > + pinmux -= hw->soc->nfuncs; > + > + mtk_pinconf_bias_get_combo(hw, desc, , ); > + if (pullen == MTK_PUPD_SET_R1R0_00) { > + pullen = 0; > + r1 = 0; > + r0 = 0; > + } else if (pullen == MTK_PUPD_SET_R1R0_01) { > + pullen = 1; > + r1 = 0; > + r0 = 1; > + } else if (pullen == MTK_PUPD_SET_R1R0_10) { > + pullen = 1; > + r1 = 1; > + r0 = 0; > + } else if (pullen == MTK_PUPD_SET_R1R0_11) { > + pullen = 1; > + r1 = 1; > + r0 = 1; > + } else if (pullen != MTK_DISABLE && pullen != MTK_ENABLE) { > + pullen = 0; > + } > + len += snprintf(buf + len, bufLen - len, > + "%03d: %1d%1d%1d%1d%02d%1d%1d%1d%1d", > + gpio, > + pinmux, > + mtk_pctrl_get_direction(hw, gpio), > + mtk_pctrl_get_out(hw, gpio), > + mtk_pctrl_get_in(hw, gpio), > + mtk_pctrl_get_driving(hw, gpio), > + mtk_pctrl_get_smt(hw, gpio), > + mtk_pctrl_get_ies(hw, gpio), > + pullen, > + pullup); > + > + if (r1 != -1) { > + len += snprintf(buf + len, bufLen - len, " (%1d %1d)\n", > + r1, r0); > + } else { > + len += snprintf(buf + len, bufLen - len, "\n"); > + } > + > + return len; > +} > + > +#define PIN_DBG_BUF_SZ 96 > +static void mtk_pctrl_dbg_show(struct pinctrl_dev *pctldev, struct seq_file > *s, > + unsigned int gpio) > +{ > + struct mtk_pinctrl *hw = pinctrl_dev_get_drvdata(pctldev); > + char buf[PIN_DBG_BUF_SZ]; > + > + (void)mtk_pctrl_show_one_pin(hw, gpio, buf, PIN_DBG_BUF_SZ); > + > + seq_printf(s, "%s", buf); > +} > + > static const struct pinctrl_ops mtk_pctlops = { > .dt_node_to_map = mtk_pctrl_dt_node_to_map, > .dt_free_map= pinctrl_utils_free_map, > .get_groups_count = mtk_pctrl_get_groups_count, > .get_group_name = mtk_pctrl_get_group_name, > .get_group_pins = mtk_pctrl_get_group_pins, > + .pin_dbg_show = mtk_pctrl_dbg_show, > }; > > static int mtk_pmx_get_funcs_cnt(struct pinctrl_dev *pctldev) > @@ -640,6 +727,7 @@ static int mtk_pconf_group_set(struct pinctrl_dev > *pctldev, unsigned group, > .pin_config_get = mtk_pinconf_get, > .pin_config_group_get = mtk_pconf_group_get, > .pin_config_group_set = mtk_pconf_group_set, > + .is_generic = true, > }; > > static struct pinctrl_desc mtk_desc = { > diff --git a/drivers/pinctrl/mediatek/pinctrl-paris.h > b/drivers/pinctrl/mediatek/pinctrl-paris.h > index 3d43771..d73f4b6 100644 > --- a/drivers/pinctrl/mediatek/pinctrl-paris.h > +++ b/drivers/pinctrl/mediatek/pinctrl-paris.h > @@ -60,6 +60,36 @@ > int mtk_paris_pinctrl_probe(struct
Re: [PATCH] compiler: enable CONFIG_OPTIMIZE_INLINING forcibly
On Fri, Sep 27, 2019 at 3:43 AM Nicolas Saenz Julienne wrote: > > On Fri, 2019-08-30 at 12:43 +0900, Masahiro Yamada wrote: > > Commit 9012d011660e ("compiler: allow all arches to enable > > CONFIG_OPTIMIZE_INLINING") allowed all architectures to enable > > this option. A couple of build errors were reported by randconfig, > > but all of them have been ironed out. > > > > Towards the goal of removing CONFIG_OPTIMIZE_INLINING entirely > > (and it will simplify the 'inline' macro in compiler_types.h), > > this commit changes it to always-on option. Going forward, the > > compiler will always be allowed to not inline functions marked > > 'inline'. > > > > This is not a problem for x86 since it has been long used by > > arch/x86/configs/{x86_64,i386}_defconfig. > > > > I am keeping the config option just in case any problem crops up for > > other architectures. > > > > The code clean-up will be done after confirming this is solid. > > > > Signed-off-by: Masahiro Yamada > > [ Adding some ARM people as they might be able to help ] > > This was found to cause a regression on a Raspberry Pi 2 B built with > bcm2835_defconfig which among other things has no SMP support. > > The relevant logs (edited to remove the noise) are: > > [5.827333] Run /init as init process > Loading, please wait... > Failed to set SO_PASSCRED: Bad address > Failed to bind netlink socket: Bad address > Failed to create manager: Bad address > Failed to set SO_PASSCRED: Bad address > [9.021623] systemd[1]: SO_PASSCRED failed: Bad address > [!!] Failed to start up manager. > [9.079148] systemd[1]: Freezing execution. > > I looked into it, it turns out that the call to get_user() in > sock_setsockopt() > is returning -EFAULT. Down the assembly rabbit hole that get_user() is I > found-out that it's the macro 'check_uaccess' who's triggering the error. > > I'm clueless at this point, so I hope you can give me some hints on what's > going bad here. So get_user() was passed a bad value/pointer from userspace? Do you know which of the tree calls to get_user() from sock_setsockopt() is failing? (It's not immediately clear to me how this patch is at fault, vs there just being a bug in the source somewhere). -- Thanks, ~Nick Desaulniers
Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
On 9/26/19 5:55 PM, Mina Almasry wrote: > Provided we keep the existing controller untouched, should the new > controller track: > > 1. only reservations, or > 2. both reservations and allocations for which no reservations exist > (such as the MAP_NORESERVE case)? > > I like the 'both' approach. Seems to me a counter like that would work > automatically regardless of whether the application is allocating > hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut > into reserved hugetlb pages, correct? Correct. One other easy way to allocate huge pages without reserves (that I know is used today) is via the fallocate system call. > If so, then applications that > allocate with NORESERVE will get sigbused when they hit their limit, > and applications that allocate without NORESERVE may get an error at > mmap time but will always be within their limits while they access the > mmap'd memory, correct? Correct. At page allocation time we can easily check to see if a reservation exists and not charge. For any specific page within a hugetlbfs file, a charge would happen at mmap time or allocation time. One exception (that I can think of) to this mmap(RESERVE) will not cause a SIGBUS rule is in the case of hole punch. If someone punches a hole in a file, not only do they remove pages associated with the file but the reservation information as well. Therefore, a subsequent fault will be the same as an allocation without reservation. I 'think' the code to remove/truncate a file will work corrctly as it is today, but I need to think about this some more. > mmap'd memory, correct? So the 'both' counter seems like a one size > fits all. > > I think the only sticking point left is whether an added controller > can support both cgroup-v2 and cgroup-v1. If I could get confirmation > on that I'll provide a patchset. Sorry, but I can not provide cgroup expertise. -- Mike Kravetz
Re: [PATCH] mm/page_alloc: fix a crash in free_pages_prepare()
On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai wrote: > > > > So I think you've moved the arch_free_page() to be after the final > > thing which can access page contents, yes? If so, we should have a > > comment in free_pages_prepare() to attmept to prevent this problem from > > reoccurring as the code evolves? > > Right, something like this above arch_free_page() there? > > /* > * It needs to be just above kernel_map_pages(), as s390 could mark those > * pages unused and then trigger a fault when accessing. > */ I did this. --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix +++ a/mm/page_alloc.c @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p kernel_init_free_pages(page, 1 << order); kernel_poison_pages(page, 1 << order, 0); + /* +* arch_free_page() can make the page's contents inaccessible. s390 +* does this. So nothing which can access the page's contents should +* happen after this. +*/ arch_free_page(page, order); + if (debug_pagealloc_enabled()) kernel_map_pages(page, 1 << order, 0); _
Re: [PATCH v3] platform/x86: dell-laptop: disable kbd backlight on Inspiron 10xx
On Friday 27 September 2019 23:19:03 Pacien TRAN-GIRARD wrote: > This patch adds a quirk disabling keyboard backlight support for the > Dell Inspiron 1012 and 1018. > > Those models wrongly report supporting keyboard backlight control > features (through SMBIOS tokens) even though they're not equipped with > a backlit keyboard. This led to broken controls being exposed > through sysfs by this driver which froze the system when used. > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107651 > Signed-off-by: Pacien TRAN-GIRARD > --- Thank you for update! Now it is clear what is the problem and you can add my Reviewed-by: Pali Rohár > drivers/platform/x86/dell-laptop.c | 26 ++ > 1 file changed, 26 insertions(+) > > diff --git a/drivers/platform/x86/dell-laptop.c > b/drivers/platform/x86/dell-laptop.c > index d27be2836bc2..74e988f839e8 100644 > --- a/drivers/platform/x86/dell-laptop.c > +++ b/drivers/platform/x86/dell-laptop.c > @@ -33,6 +33,7 @@ > > struct quirk_entry { > bool touchpad_led; > + bool kbd_led_not_present; > bool kbd_led_levels_off_1; > bool kbd_missing_ac_tag; > > @@ -73,6 +74,10 @@ static struct quirk_entry quirk_dell_latitude_e6410 = { > .kbd_led_levels_off_1 = true, > }; > > +static struct quirk_entry quirk_dell_inspiron_1012 = { > + .kbd_led_not_present = true, > +}; > + > static struct platform_driver platform_driver = { > .driver = { > .name = "dell-laptop", > @@ -310,6 +315,24 @@ static const struct dmi_system_id dell_quirks[] > __initconst = { > }, > .driver_data = _dell_latitude_e6410, > }, > + { > + .callback = dmi_matched, > + .ident = "Dell Inspiron 1012", > + .matches = { > + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), > + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 1012"), > + }, > + .driver_data = _dell_inspiron_1012, > + }, > + { > + .callback = dmi_matched, > + .ident = "Dell Inspiron 1018", > + .matches = { > + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), > + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 1018"), > + }, > + .driver_data = _dell_inspiron_1012, > + }, > { } > }; > > @@ -1493,6 +1516,9 @@ static void kbd_init(void) > { > int ret; > > + if (quirks && quirks->kbd_led_not_present) > + return; > + > ret = kbd_init_info(); > kbd_init_tokens(); > -- Pali Rohár pali.ro...@gmail.com
Re: Do we need to correct barriering in circular-buffers.rst?
On Fri, Sep 27, 2019 at 1:43 PM Nick Desaulniers wrote: > > On Fri, Sep 27, 2019 at 5:49 AM Peter Zijlstra wrote: > > Barring LTO the above works for perf because of inter-translation-unit > > function calls, which imply a compiler barrier. > > > > Now, when the compiler inlines, it looses that sync point (and thereby > > subtlely changes semantics from the non-inline variant). I suspect LTO > > does the same and can cause subtle breakage through this transformation. > > Do you have a bug report or godbolt link for the above? I trust that > you're familiar enough with the issue to be able to quickly reproduce > it? These descriptions of problems are difficult for me to picture in > code or generated code, and when I try to read through > memory-barriers.txt my eyes start to glaze over (then something else > catches fire and I have to go put that out). Having a concise test > case I think would better illustrate potential issues with LTO that > we'd then be able to focus on trying to fix/support. > Further, if we identified a case were the existing invariants were broken under LTO, such a case could be added to CONFIG_DEBUG_LOCKING_API_SELFTESTS (or whatever the most appropriate kself test is). -- Thanks, ~Nick Desaulniers
Re: [PATCH v8 17/17] Documentation/x86/64: Add documentation for GS/FS addressing mode
> On Sep 27, 2019, at 14:25, Randy Dunlap wrote: > > Hi, > Some doc comments/fixes below... > > On 9/12/19 1:06 PM, Chang S. Bae wrote: >> From: Thomas Gleixner >> [snip] >> +There exist two mechanisms to read and write the FS/FS base address: > >FS/GS [snip] >> + The arch_prctl(2) based mechanism is available on all 64bit CPUs and all > > 64-bit [snip] >> +mov %reg, %gs:offset >> \ No newline at end of file > oops. Thanks. Let me include this in my v9 submission. Chang
Re: [PATCH 1/2] Modify cpupower to schedule itself on cores it is reading MSRs from
On Friday, September 27, 2019 6:07:56 PM CEST Natarajan, Janakarajan wrote: > On 9/18/2019 11:34 AM, Natarajan, Janakarajan wrote: > > This is advantageous because an IPI is not generated when a read_msr() is > > executed on the local logical CPU thereby reducing the chance of having > > APERF and MPERF being out of sync. > > + if (sched_setaffinity(getpid(), sizeof(set), ) == -1) { > > + dprint("Could not migrate to cpu: %d\n", cpu); > > + return 1; On a 80 core cpu the process would be pushed around through the system quite a lot. This might affect what you are measuring or the other measure values? Otherwise it's the kernel's MSR read only, not the whole cpupower process, right? No idea about the exact overhead, though. Others in CC list should know. Afaik msr reads through msr module should be avoided anyway? Those which are worth it are abstracted through sysfs nowadays? For aperf/mperf it might make sense to define a sysfs file where you can read both, as this is what you always need? It would take a while, but could be a longterm solution which is also usable in secure boot or without msr module case. Thomas
[PATCH v2 2/8] KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date
Skip the VMWRITE to update GUEST_CR3 if CR3 is not available, i.e. has not been read from the VMCS since the last VM-Enter. If vcpu->arch.cr3 is stale, kvm_read_cr3(vcpu) will refresh vcpu->arch.cr3 from the VMCS, meaning KVM will do a VMREAD and then VMWRITE the value it just pulled from the VMCS. Note, this is a purely theoretical change, no instances of skipping the VMREAD+VMWRITE have been observed with this change. Tested-by: Reto Buerki Tested-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 7679c2a05a50..0b8dd9c315f8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3004,10 +3004,12 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) /* Loading vmcs02.GUEST_CR3 is handled by nested VM-Enter. */ if (is_guest_mode(vcpu)) update_guest_cr3 = false; - else if (enable_unrestricted_guest || is_paging(vcpu)) - guest_cr3 = kvm_read_cr3(vcpu); - else + else if (!enable_unrestricted_guest && !is_paging(vcpu)) guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr; + else if (test_bit(VCPU_EXREG_CR3, (ulong *)>arch.regs_avail)) + guest_cr3 = vcpu->arch.cr3; + else /* vmcs01.GUEST_CR3 is already up-to-date. */ + update_guest_cr3 = false; ept_load_pdptrs(vcpu); } -- 2.22.0
[PATCH v2 4/8] KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest
Rework vmx_set_rflags() to avoid the extra code need to handle emulation of real mode and invalid state when unrestricted guest is disabled. The primary reason for doing so is to avoid the call to vmx_get_rflags(), which will incur a VMREAD when RFLAGS is not already available. When running nested VMs, the majority of calls to vmx_set_rflags() will occur without an associated vmx_get_rflags(), i.e. when stuffing GUEST_RFLAGS during transitions between vmcs01 and vmcs02. Note, vmx_get_rflags() guarantees RFLAGS is marked available. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 28 ++-- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 83fe8b02b732..814d3e6d0264 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1426,18 +1426,26 @@ unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu) void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) { struct vcpu_vmx *vmx = to_vmx(vcpu); - unsigned long old_rflags = vmx_get_rflags(vcpu); + unsigned long old_rflags; - __set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail); - vmx->rflags = rflags; - if (vmx->rmode.vm86_active) { - vmx->rmode.save_rflags = rflags; - rflags |= X86_EFLAGS_IOPL | X86_EFLAGS_VM; + if (enable_unrestricted_guest) { + __set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail); + + vmx->rflags = rflags; + vmcs_writel(GUEST_RFLAGS, rflags); + } else { + old_rflags = vmx_get_rflags(vcpu); + + vmx->rflags = rflags; + if (vmx->rmode.vm86_active) { + vmx->rmode.save_rflags = rflags; + rflags |= X86_EFLAGS_IOPL | X86_EFLAGS_VM; + } + vmcs_writel(GUEST_RFLAGS, rflags); + + if ((old_rflags ^ vmx->rflags) & X86_EFLAGS_VM) + vmx->emulation_required = emulation_required(vcpu); } - vmcs_writel(GUEST_RFLAGS, rflags); - - if ((old_rflags ^ vmx->rflags) & X86_EFLAGS_VM) - vmx->emulation_required = emulation_required(vcpu); } u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu) -- 2.22.0
[PATCH v2 3/8] KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors
Capture struct vcpu_vmx in a local variable to improve the readability of vmx_{g,s}et_rflags(). No functional change intended. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0b8dd9c315f8..83fe8b02b732 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1407,35 +1407,37 @@ static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu); unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu) { + struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long rflags, save_rflags; if (!test_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail)) { __set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail); rflags = vmcs_readl(GUEST_RFLAGS); - if (to_vmx(vcpu)->rmode.vm86_active) { + if (vmx->rmode.vm86_active) { rflags &= RMODE_GUEST_OWNED_EFLAGS_BITS; - save_rflags = to_vmx(vcpu)->rmode.save_rflags; + save_rflags = vmx->rmode.save_rflags; rflags |= save_rflags & ~RMODE_GUEST_OWNED_EFLAGS_BITS; } - to_vmx(vcpu)->rflags = rflags; + vmx->rflags = rflags; } - return to_vmx(vcpu)->rflags; + return vmx->rflags; } void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) { + struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long old_rflags = vmx_get_rflags(vcpu); __set_bit(VCPU_EXREG_RFLAGS, (ulong *)>arch.regs_avail); - to_vmx(vcpu)->rflags = rflags; - if (to_vmx(vcpu)->rmode.vm86_active) { - to_vmx(vcpu)->rmode.save_rflags = rflags; + vmx->rflags = rflags; + if (vmx->rmode.vm86_active) { + vmx->rmode.save_rflags = rflags; rflags |= X86_EFLAGS_IOPL | X86_EFLAGS_VM; } vmcs_writel(GUEST_RFLAGS, rflags); - if ((old_rflags ^ to_vmx(vcpu)->rflags) & X86_EFLAGS_VM) - to_vmx(vcpu)->emulation_required = emulation_required(vcpu); + if ((old_rflags ^ vmx->rflags) & X86_EFLAGS_VM) + vmx->emulation_required = emulation_required(vcpu); } u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu) -- 2.22.0
[PATCH v2 0/8] KVM: x86: nVMX GUEST_CR3 bug fix, and then some...
*sigh* v2 was shaping up to be a trivial update, until I started working on Vitaly's suggestion to add a helper to test for register availability. The primary purpose of this series is to fix a CR3 corruption in L2 reported by Reto Buerki when running with HLT interception disabled in L1. On a nested VM-Enter that puts L2 into HLT, KVM never actually enters L2 and instead mimics HLT interception by canceling the nested run and pretending that VM-Enter to L2 completed and then exited on HLT (which KVM intercepted). Because KVM never actually runs L2, KVM skips the pending MMU update for L2 and so leaves a stale value in vmcs02.GUEST_CR3. If the next wake event for L2 triggers a nested VM-Exit, KVM will refresh vmcs12->guest_cr3 from vmcs02.GUEST_CR3 and consume the stale value. Fix the issue by unconditionally writing vmcs02.GUEST_CR3 during nested VM-Enter instead of deferring the update to vmx_set_cr3(), and skip the update of GUEST_CR3 in vmx_set_cr3() when running L2. I.e. make the nested code fully responsible for vmcs02.GUEST_CR3. Patch 02/08 is a minor optimization to skip the GUEST_CR3 update if vmcs01 is already up-to-date. Patches 03 and beyond are Vitaly's fault ;-). Patches 03 and 04 are tangentially related cleanup to vmx_set_rflags() that was discovered when working through the avail/dirty testing code. Ideally they'd be sent as a separate series, but they conflict with the avail/dirty helper changes and are themselves minor and straightforward. Patches 05 and 06 clean up the register caching code so that there is a single enum for all registers which use avail/dirty tracking. While not a true prerequisite for the avail/dirty helpers, the cleanup allows the new helpers to take an 'enum kvm_reg' instead of a less helpful 'int reg'. Patch 07 is the helpers themselves, as suggested by Vitaly. Patch 08 is a truly optional change to ditch decache_cr3() in favor of handling CR3 via cache_reg() like any other avail/dirty register. Note, I collected the Reviewed-by and Tested-by tags for patches 01 and 02 even though I inverted the boolean from 'skip_cr3' to 'update_guest_cr3'. Please drop the tags if that constitutes a non-trivial functional change. v2: - Invert skip_cr3 to update_guest_cr3. [Liran] - Reword the changelog and comment to be more explicit in detailing how/when KVM will process a nested VM-Enter without runnin L2. [Liran] - Added Reviewed-by and Tested-by tags. - Add a comment in vmx_set_cr3() to explicitly state that nested VM-Enter is responsible for loading vmcs02.GUEST_CR3. [Jim] - All of the loveliness in patches 03-08. [Vitaly] Sean Christopherson (8): KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest KVM: x86: Add WARNs to detect out-of-bounds register indices KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg' KVM: x86: Add helpers to test/mark reg availability and dirtiness KVM: x86: Fold decache_cr3() into cache_reg() arch/x86/include/asm/kvm_host.h | 5 +- arch/x86/kvm/kvm_cache_regs.h | 67 +-- arch/x86/kvm/svm.c | 5 -- arch/x86/kvm/vmx/nested.c | 14 - arch/x86/kvm/vmx/vmx.c | 94 ++--- arch/x86/kvm/x86.c | 13 ++--- arch/x86/kvm/x86.h | 6 +-- 7 files changed, 123 insertions(+), 81 deletions(-) -- 2.22.0
[PATCH v2 8/8] KVM: x86: Fold decache_cr3() into cache_reg()
Handle caching CR3 (from VMX's VMCS) into struct kvm_vcpu via the common cache_reg() callback and drop the dedicated decache_cr3(). The name decache_cr3() is somewhat confusing as the caching behavior of CR3 follows that of GPRs, RFLAGS and PDPTRs, (handled via cache_reg()), and has nothing in common with the caching behavior of CR0/CR4 (whose decache_cr{0,4}_guest_bits() likely provided the 'decache' verbiage). Note, this effectively adds a BUG() if KVM attempts to cache CR3 on SVM. Opportunistically add a WARN_ON_ONCE() in VMX to provide an equivalent check. Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/kvm_cache_regs.h | 2 +- arch/x86/kvm/svm.c | 5 - arch/x86/kvm/vmx/vmx.c | 15 ++- 4 files changed, 7 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a27f7f6b6b7a..0411dc0a27b0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1040,7 +1040,6 @@ struct kvm_x86_ops { struct kvm_segment *var, int seg); void (*get_cs_db_l_bits)(struct kvm_vcpu *vcpu, int *db, int *l); void (*decache_cr0_guest_bits)(struct kvm_vcpu *vcpu); - void (*decache_cr3)(struct kvm_vcpu *vcpu); void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu); void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0); void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index 9c2bc528800b..f18177cd0030 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -145,7 +145,7 @@ static inline ulong kvm_read_cr4_bits(struct kvm_vcpu *vcpu, ulong mask) static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu) { if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3)) - kvm_x86_ops->decache_cr3(vcpu); + kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_CR3); return vcpu->arch.cr3; } diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index f8ecb6df5106..3102c44c12c6 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2517,10 +2517,6 @@ static void svm_decache_cr0_guest_bits(struct kvm_vcpu *vcpu) { } -static void svm_decache_cr3(struct kvm_vcpu *vcpu) -{ -} - static void svm_decache_cr4_guest_bits(struct kvm_vcpu *vcpu) { } @@ -7208,7 +7204,6 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = { .get_cpl = svm_get_cpl, .get_cs_db_l_bits = kvm_get_cs_db_l_bits, .decache_cr0_guest_bits = svm_decache_cr0_guest_bits, - .decache_cr3 = svm_decache_cr3, .decache_cr4_guest_bits = svm_decache_cr4_guest_bits, .set_cr0 = svm_set_cr0, .set_cr3 = svm_set_cr3, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index ed03d0cd1cc8..c84798026e85 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2188,7 +2188,12 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) if (enable_ept) ept_save_pdptrs(vcpu); break; + case VCPU_EXREG_CR3: + if (enable_unrestricted_guest || (enable_ept && is_paging(vcpu))) + vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); + break; default: + WARN_ON_ONCE(1); break; } } @@ -2859,13 +2864,6 @@ static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu) vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & cr0_guest_owned_bits; } -static void vmx_decache_cr3(struct kvm_vcpu *vcpu) -{ - if (enable_unrestricted_guest || (enable_ept && is_paging(vcpu))) - vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); - kvm_register_mark_available(vcpu, VCPU_EXREG_CR3); -} - static void vmx_decache_cr4_guest_bits(struct kvm_vcpu *vcpu) { ulong cr4_guest_owned_bits = vcpu->arch.cr4_guest_owned_bits; @@ -2910,7 +2908,7 @@ static void ept_update_paging_mode_cr0(unsigned long *hw_cr0, struct vcpu_vmx *vmx = to_vmx(vcpu); if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3)) - vmx_decache_cr3(vcpu); + vmx_cache_reg(vcpu, VCPU_EXREG_CR3); if (!(cr0 & X86_CR0_PG)) { /* From paging/starting to nonpaging */ exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING | @@ -7792,7 +7790,6 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .get_cpl = vmx_get_cpl, .get_cs_db_l_bits = vmx_get_cs_db_l_bits, .decache_cr0_guest_bits = vmx_decache_cr0_guest_bits, - .decache_cr3 = vmx_decache_cr3, .decache_cr4_guest_bits = vmx_decache_cr4_guest_bits, .set_cr0 = vmx_set_cr0, .set_cr3 = vmx_set_cr3, -- 2.22.0
[PATCH v2 1/8] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter instead of deferring the VMWRITE until vmx_set_cr3(). If the VMWRITE is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated VM-Exit occurs without actually entering L2, e.g. if the nested run is squashed because nested VM-Enter (from L1) is putting L2 into HLT. Note, the above scenario can occur regardless of whether L1 is intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with vmcs.GUEST_ACTIVITY_STATE=HALTED. But practically speaking, a VMM will likely put a guest into HALTED if and only if it's not intercepting HLT. In an ideal world where EPT *requires* unrestricted guest (and vice versa), VMX could handle CR3 similar to how it handles RSP and RIP, e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run(). But the unrestricted guest silliness complicates the dirty tracking logic to the point that explicitly handling vmcs02.GUEST_CR3 during nested VM-Enter is a simpler overall implementation. Cc: sta...@vger.kernel.org Reported-and-tested-by: Reto Buerki Tested-by: Vitaly Kuznetsov Reviewed-by: Liran Alon Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 10 ++ arch/x86/kvm/vmx/vmx.c| 10 +++--- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 41abc62c9a8a..b72a00b53e4a 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2418,6 +2418,16 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, entry_failure_code)) return -EINVAL; + /* +* Immediately write vmcs02.GUEST_CR3. It will be propagated to vmcs12 +* on nested VM-Exit, which can occur without actually running L2 and +* thus without hitting vmx_set_cr3(), e.g. if L1 is entering L2 with +* vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept the +* transition to HLT instead of running L2. +*/ + if (enable_ept) + vmcs_writel(GUEST_CR3, vmcs12->guest_cr3); + /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */ if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d4575ffb3cec..7679c2a05a50 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2984,6 +2984,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa) void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) { struct kvm *kvm = vcpu->kvm; + bool update_guest_cr3 = true; unsigned long guest_cr3; u64 eptp; @@ -3000,15 +3001,18 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) spin_unlock(_kvm_vmx(kvm)->ept_pointer_lock); } - if (enable_unrestricted_guest || is_paging(vcpu) || - is_guest_mode(vcpu)) + /* Loading vmcs02.GUEST_CR3 is handled by nested VM-Enter. */ + if (is_guest_mode(vcpu)) + update_guest_cr3 = false; + else if (enable_unrestricted_guest || is_paging(vcpu)) guest_cr3 = kvm_read_cr3(vcpu); else guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr; ept_load_pdptrs(vcpu); } - vmcs_writel(GUEST_CR3, guest_cr3); + if (update_guest_cr3) + vmcs_writel(GUEST_CR3, guest_cr3); } int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) -- 2.22.0
[PATCH v2 7/8] KVM: x86: Add helpers to test/mark reg availability and dirtiness
Add helpers to prettify code that tests and/or marks whether or not a register is available and/or dirty. Suggested-by: Vitaly Kuznetsov Signed-off-by: Sean Christopherson --- arch/x86/kvm/kvm_cache_regs.h | 45 +-- arch/x86/kvm/vmx/nested.c | 4 ++-- arch/x86/kvm/vmx/vmx.c| 29 ++ arch/x86/kvm/x86.c| 13 -- 4 files changed, 53 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index b85fc4b4e04f..9c2bc528800b 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -37,12 +37,37 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14) BUILD_KVM_GPR_ACCESSORS(r15, R15) #endif +static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu, +enum kvm_reg reg) +{ + return test_bit(reg, (unsigned long *)>arch.regs_avail); +} + +static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu, +enum kvm_reg reg) +{ + return test_bit(reg, (unsigned long *)>arch.regs_dirty); +} + +static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu, + enum kvm_reg reg) +{ + __set_bit(reg, (unsigned long *)>arch.regs_avail); +} + +static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu, + enum kvm_reg reg) +{ + __set_bit(reg, (unsigned long *)>arch.regs_avail); + __set_bit(reg, (unsigned long *)>arch.regs_dirty); +} + static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu, int reg) { if (WARN_ON_ONCE((unsigned int)reg >= NR_VCPU_REGS)) return 0; - if (!test_bit(reg, (unsigned long *)>arch.regs_avail)) + if (!kvm_register_is_available(vcpu, reg)) kvm_x86_ops->cache_reg(vcpu, reg); return vcpu->arch.regs[reg]; @@ -55,13 +80,12 @@ static inline void kvm_register_write(struct kvm_vcpu *vcpu, int reg, return; vcpu->arch.regs[reg] = val; - __set_bit(reg, (unsigned long *)>arch.regs_dirty); - __set_bit(reg, (unsigned long *)>arch.regs_avail); + kvm_register_mark_dirty(vcpu, reg); } static inline unsigned long kvm_rip_read(struct kvm_vcpu *vcpu) { - if (!test_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail)) + if (!kvm_register_is_available(vcpu, VCPU_REGS_RIP)) kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RIP); return vcpu->arch.regs[VCPU_REGS_RIP]; @@ -70,13 +94,12 @@ static inline unsigned long kvm_rip_read(struct kvm_vcpu *vcpu) static inline void kvm_rip_write(struct kvm_vcpu *vcpu, unsigned long val) { vcpu->arch.regs[VCPU_REGS_RIP] = val; - __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_dirty); - __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail); + kvm_register_mark_dirty(vcpu, VCPU_REGS_RIP); } static inline unsigned long kvm_rsp_read(struct kvm_vcpu *vcpu) { - if (!test_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail)) + if (!kvm_register_is_available(vcpu, VCPU_REGS_RSP)) kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RSP); return vcpu->arch.regs[VCPU_REGS_RSP]; @@ -85,16 +108,14 @@ static inline unsigned long kvm_rsp_read(struct kvm_vcpu *vcpu) static inline void kvm_rsp_write(struct kvm_vcpu *vcpu, unsigned long val) { vcpu->arch.regs[VCPU_REGS_RSP] = val; - __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_dirty); - __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail); + kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP); } static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index) { might_sleep(); /* on svm */ - if (!test_bit(VCPU_EXREG_PDPTR, - (unsigned long *)>arch.regs_avail)) + if (!kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR)) kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR); return vcpu->arch.walk_mmu->pdptrs[index]; @@ -123,7 +144,7 @@ static inline ulong kvm_read_cr4_bits(struct kvm_vcpu *vcpu, ulong mask) static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu) { - if (!test_bit(VCPU_EXREG_CR3, (ulong *)>arch.regs_avail)) + if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3)) kvm_x86_ops->decache_cr3(vcpu); return vcpu->arch.cr3; } diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index b72a00b53e4a..8946f11c574c 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1012,7 +1012,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne kvm_mmu_new_cr3(vcpu, cr3, false); vcpu->arch.cr3 = cr3; - __set_bit(VCPU_EXREG_CR3, (ulong *)>arch.regs_avail); + kvm_register_mark_available(vcpu, VCPU_EXREG_CR3); kvm_init_mmu(vcpu,
[PATCH v2 6/8] KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'
Now that indexing into arch.regs is either protected by WARN_ON_ONCE or done with hardcoded enums, combine all definitions for registers that are tracked by regs_avail and regs_dirty into 'enum kvm_reg'. Having a single enum type will simplify additional cleanup related to regs_avail and regs_dirty. Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 4 +--- arch/x86/kvm/kvm_cache_regs.h | 2 +- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 23edf56cf577..a27f7f6b6b7a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -156,10 +156,8 @@ enum kvm_reg { VCPU_REGS_R15 = __VCPU_REGS_R15, #endif VCPU_REGS_RIP, - NR_VCPU_REGS -}; + NR_VCPU_REGS, -enum kvm_reg_ex { VCPU_EXREG_PDPTR = NR_VCPU_REGS, VCPU_EXREG_CR3, VCPU_EXREG_RFLAGS, diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index 3972e1b65635..b85fc4b4e04f 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -95,7 +95,7 @@ static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index) if (!test_bit(VCPU_EXREG_PDPTR, (unsigned long *)>arch.regs_avail)) - kvm_x86_ops->cache_reg(vcpu, (enum kvm_reg)VCPU_EXREG_PDPTR); + kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR); return vcpu->arch.walk_mmu->pdptrs[index]; } -- 2.22.0
[PATCH v2 5/8] KVM: x86: Add WARNs to detect out-of-bounds register indices
Add WARN_ON_ONCE() checks in kvm_register_{read,write}() to detect reg values that would cause KVM to overflow vcpu->arch.regs. Change the reg param to an 'int' to make it clear that the reg index is unverified. Open code the RIP and RSP accessors so as to avoid pointless overhead of WARN_ON_ONCE(). Alternatively, lower-level helpers could be provided, but that opens the door for improper use of said helpers, and the ugliness of the open-coding will be slightly improved in future patches. Regarding the overhead of WARN_ON_ONCE(), now that all fixed GPR reads and writes use dedicated accessors, e.g. kvm_rax_read(), the overhead is limited to flows where the reg index is generated at runtime. And there is at least one historical bug where KVM has generated an out-of- bounds access to arch.regs (see commit b68f3cc7d9789, "KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels"). Adding the WARN_ON_ONCE() protection paves the way for additional cleanup related to kvm_reg and kvm_reg_ex. Signed-off-by: Sean Christopherson --- arch/x86/kvm/kvm_cache_regs.h | 30 ++ arch/x86/kvm/x86.h| 6 ++ 2 files changed, 24 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index 1cc6c47dc77e..3972e1b65635 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -37,19 +37,23 @@ BUILD_KVM_GPR_ACCESSORS(r14, R14) BUILD_KVM_GPR_ACCESSORS(r15, R15) #endif -static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu, - enum kvm_reg reg) +static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu, int reg) { + if (WARN_ON_ONCE((unsigned int)reg >= NR_VCPU_REGS)) + return 0; + if (!test_bit(reg, (unsigned long *)>arch.regs_avail)) kvm_x86_ops->cache_reg(vcpu, reg); return vcpu->arch.regs[reg]; } -static inline void kvm_register_write(struct kvm_vcpu *vcpu, - enum kvm_reg reg, +static inline void kvm_register_write(struct kvm_vcpu *vcpu, int reg, unsigned long val) { + if (WARN_ON_ONCE((unsigned int)reg >= NR_VCPU_REGS)) + return; + vcpu->arch.regs[reg] = val; __set_bit(reg, (unsigned long *)>arch.regs_dirty); __set_bit(reg, (unsigned long *)>arch.regs_avail); @@ -57,22 +61,32 @@ static inline void kvm_register_write(struct kvm_vcpu *vcpu, static inline unsigned long kvm_rip_read(struct kvm_vcpu *vcpu) { - return kvm_register_read(vcpu, VCPU_REGS_RIP); + if (!test_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail)) + kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RIP); + + return vcpu->arch.regs[VCPU_REGS_RIP]; } static inline void kvm_rip_write(struct kvm_vcpu *vcpu, unsigned long val) { - kvm_register_write(vcpu, VCPU_REGS_RIP, val); + vcpu->arch.regs[VCPU_REGS_RIP] = val; + __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_dirty); + __set_bit(VCPU_REGS_RIP, (unsigned long *)>arch.regs_avail); } static inline unsigned long kvm_rsp_read(struct kvm_vcpu *vcpu) { - return kvm_register_read(vcpu, VCPU_REGS_RSP); + if (!test_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail)) + kvm_x86_ops->cache_reg(vcpu, VCPU_REGS_RSP); + + return vcpu->arch.regs[VCPU_REGS_RSP]; } static inline void kvm_rsp_write(struct kvm_vcpu *vcpu, unsigned long val) { - kvm_register_write(vcpu, VCPU_REGS_RSP, val); + vcpu->arch.regs[VCPU_REGS_RSP] = val; + __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_dirty); + __set_bit(VCPU_REGS_RSP, (unsigned long *)>arch.regs_avail); } static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index) diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index dbf7442a822b..45d82b8277e5 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -238,8 +238,7 @@ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa) return false; } -static inline unsigned long kvm_register_readl(struct kvm_vcpu *vcpu, - enum kvm_reg reg) +static inline unsigned long kvm_register_readl(struct kvm_vcpu *vcpu, int reg) { unsigned long val = kvm_register_read(vcpu, reg); @@ -247,8 +246,7 @@ static inline unsigned long kvm_register_readl(struct kvm_vcpu *vcpu, } static inline void kvm_register_writel(struct kvm_vcpu *vcpu, - enum kvm_reg reg, - unsigned long val) + int reg, unsigned long val) { if (!is_64_bit_mode(vcpu)) val = (u32)val; -- 2.22.0
Re: [PATCH v2] nvme-pci: Save PCI state before putting drive into deepest state
Looks sensible to me: Reviewed-by: Christoph Hellwig
Re: [PATCH v6 4/5] pinctrl: mediatek: Backward compatible to previous Mediatek's bias-pull usage
Hi, On Thu, Sep 26, 2019 at 10:02 PM Light Hsieh wrote: > > Refine mtk_pinconf_set()/mtk_pinconf_get() for backward compatibility to > previous Mediatek's bias-pull usage. MediaTek > In PINCTRL_MTK that use pinctrl-mtk-common.c, bias-pull setting for pins > with 2 pull resistors can be specified as value for bias-pull-up and > bias-pull-down. For example: > bias-pull-up = ; > bias-pull-up = ; > bias-pull-up = ; > bias-pull-up = ; > bias-pull-down = ; > bias-pull-down = ; > bias-pull-down = ; > bias-pull-down = ; > > On the other hand, PINCTRL_MTK_PARIS use customized properties > "mediatek,pull-up-adv" and "mediatek,pull-down-adv" to specify bias-pull > setting for pins with 2 pull resistors. > This introduce in-compatibility in device tree and increatse porting s/increatse/inscrease/ > effort to Mediatek's customer that had already used PINCTRL_MTK version. s/Mediatek/MediaTek/ > Besides, if customers are not awared of this change and still write devicetree s/awared/aware/ > for PINCTRL_MTK version, they may encounter runtime failure with pinctrl and > spent time to debug. > > This patch add backward compatible to previous Mediatek's bias-pull usage s/add/adds > so that Mediatek's customer need not use a new devicetree property name. s/Mediatek/MediaTek/ > The rationale is that: changing driver implemenation had better leave s/implemenation/implementation/ > interface unchanged. > There are many tags missing here too. > --- > drivers/pinctrl/mediatek/pinctrl-mt6765.c| 6 +- > drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c | 285 > +++ > drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.h | 11 + > drivers/pinctrl/mediatek/pinctrl-paris.c | 49 ++-- > 4 files changed, 327 insertions(+), 24 deletions(-) > > diff --git a/drivers/pinctrl/mediatek/pinctrl-mt6765.c > b/drivers/pinctrl/mediatek/pinctrl-mt6765.c > index bada37f..ae85fdc 100644 > --- a/drivers/pinctrl/mediatek/pinctrl-mt6765.c > +++ b/drivers/pinctrl/mediatek/pinctrl-mt6765.c > @@ -1072,10 +1072,8 @@ > .gpio_m = 0, > .base_names = mt6765_pinctrl_register_base_names, > .nbase_names = ARRAY_SIZE(mt6765_pinctrl_register_base_names), > - .bias_disable_set = mtk_pinconf_bias_disable_set, > - .bias_disable_get = mtk_pinconf_bias_disable_get, > - .bias_set = mtk_pinconf_bias_set, > - .bias_get = mtk_pinconf_bias_get, > + .bias_set_combo = mtk_pinconf_bias_set_combo, > + .bias_get_combo = mtk_pinconf_bias_get_combo, > .drive_set = mtk_pinconf_drive_set_direct_val, > .drive_get = mtk_pinconf_drive_get_direct_val, > .adv_pull_get = mtk_pinconf_adv_pull_get, > diff --git a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c > b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c > index acfddf9..6d9972f 100644 > --- a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c > +++ b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c > @@ -13,6 +13,8 @@ > #include > #include > > +#include > + List the header declarations in alphabetic order. > #include "mtk-eint.h" > #include "pinctrl-mtk-common-v2.h" > > @@ -206,6 +208,20 @@ int mtk_hw_set_value(struct mtk_pinctrl *hw, const > struct mtk_pin_desc *desc, > return 0; > } > > +void mtk_hw_set_value_no_lookup(struct mtk_pinctrl *hw, > + const struct mtk_pin_desc *desc, > + int value, struct mtk_pin_field *pf) > +{ > + if (value < 0 || value > pf->mask) > + return; > + > + if (!pf->next) > + mtk_rmw(hw, pf->index, pf->offset, pf->mask << pf->bitpos, > + (value & pf->mask) << pf->bitpos); > + else > + mtk_hw_write_cross_field(hw, pf, value); > +} > + > int mtk_hw_get_value(struct mtk_pinctrl *hw, const struct mtk_pin_desc *desc, > int field, int *value) > { > @@ -225,6 +241,17 @@ int mtk_hw_get_value(struct mtk_pinctrl *hw, const > struct mtk_pin_desc *desc, > return 0; > } > > +void mtk_hw_get_value_no_lookup(struct mtk_pinctrl *hw, > + const struct mtk_pin_desc *desc, > + int *value, struct mtk_pin_field *pf) > +{ > + if (!pf->next) > + *value = (mtk_r32(hw, pf->index, pf->offset) > + >> pf->bitpos) & pf->mask; > + else > + mtk_hw_read_cross_field(hw, pf, value); > +} > + We are able to improve mtk_hw_[set,get]_value with a cache for the recently visited pin_desc to speed up the subsequent access to the same register for all clients, rather than creating a specific one with no_lookup just for a few clients. Generally, all clients should be able to just apply the mtk_hw_[set,get]_value to compose what they actually need. And the changes related to the improvement on mtk_hw_[set,get]_value with a cache is needed to put to a separate patch
[PATCH] staging: rtl8188eu: fix null dereference when kzalloc fails
If kzalloc() returns NULL, the error path doesn't stop the flow of control from entering rtw_hal_read_chip_version() which dereferences the null pointer. Fix this by adding a 'goto' to the error path to more gracefully handle the issue and avoid proceeding with initialization steps that we're no longer prepared to handle. Also update the debug message to be more consistent with the other debug messages in this function. Addresses-Coverity: ("Dereference after null check") Signed-off-by: Connor Kuehl --- drivers/staging/rtl8188eu/os_dep/usb_intf.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/staging/rtl8188eu/os_dep/usb_intf.c b/drivers/staging/rtl8188eu/os_dep/usb_intf.c index 664d93a7f90d..4fac9dca798e 100644 --- a/drivers/staging/rtl8188eu/os_dep/usb_intf.c +++ b/drivers/staging/rtl8188eu/os_dep/usb_intf.c @@ -348,8 +348,10 @@ static struct adapter *rtw_usb_if1_init(struct dvobj_priv *dvobj, } padapter->HalData = kzalloc(sizeof(struct hal_data_8188e), GFP_KERNEL); - if (!padapter->HalData) - DBG_88E("cant not alloc memory for HAL DATA\n"); + if (!padapter->HalData) { + DBG_88E("Failed to allocate memory for HAL data\n"); + goto free_adapter; + } /* step read_chip_version */ rtw_hal_read_chip_version(padapter); -- 2.17.1
Re: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing
On 9/19/19 3:24 PM, Mina Almasry wrote: > A follow up patch in this series adds hugetlb cgroup uncharge info the > file_region entries in resv->regions. The cgroup uncharge info may > differ for different regions, so they can no longer be coalesced at > region_add time. So, disable region coalescing in region_add in this > patch. > > Behavior change: > > Say a resv_map exists like this [0->1], [2->3], and [5->6]. > > Then a region_chg/add call comes in region_chg/add(f=0, t=5). > > Old code would generate resv->regions: [0->5], [5->6]. > New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], > [5->6]. > > Special care needs to be taken to handle the resv->adds_in_progress > variable correctly. In the past, only 1 region would be added for every > region_chg and region_add call. But now, each call may add multiple > regions, so we can no longer increment adds_in_progress by 1 in region_chg, > or decrement adds_in_progress by 1 after region_add or region_abort. Instead, > region_chg calls add_reservation_in_range() to count the number of regions > needed and allocates those, and that info is passed to region_add and > region_abort to decrement adds_in_progress correctly. > > Signed-off-by: Mina Almasry > > --- > mm/hugetlb.c | 273 +-- > 1 file changed, 158 insertions(+), 115 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index bac1cbdd027c..d03b048084a3 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -244,6 +244,12 @@ struct file_region { > long to; > }; > > +/* Helper that removes a struct file_region from the resv_map cache and > returns > + * it for use. > + */ > +static struct file_region * > +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to); > + Instead of the forward declaration, just put the function here. > /* Must be called with resv->lock held. Calling this with count_only == true > * will count the number of pages to be added but will not modify the linked > * list. > @@ -251,51 +257,61 @@ struct file_region { > static long add_reservation_in_range(struct resv_map *resv, long f, long t, >bool count_only) > { > - long chg = 0; > + long add = 0; > struct list_head *head = >regions; > + long last_accounted_offset = f; > struct file_region *rg = NULL, *trg = NULL, *nrg = NULL; > > - /* Locate the region we are before or in. */ > - list_for_each_entry (rg, head, link) > - if (f <= rg->to) > - break; > - > - /* Round our left edge to the current segment if it encloses us. */ > - if (f > rg->from) > - f = rg->from; > - > - chg = t - f; > + /* In this loop, we essentially handle an entry for the range > + * last_accounted_offset -> rg->from, at every iteration, with some > + * bounds checking. > + */ > + list_for_each_entry_safe(rg, trg, head, link) { > + /* Skip irrelevant regions that start before our range. */ > + if (rg->from < f) { > + /* If this region ends after the last accounted offset, > + * then we need to update last_accounted_offset. > + */ > + if (rg->to > last_accounted_offset) > + last_accounted_offset = rg->to; > + continue; > + } > > - /* Check for and consume any regions we now overlap with. */ > - nrg = rg; > - list_for_each_entry_safe (rg, trg, rg->link.prev, link) { > - if (>link == head) > - break; > + /* When we find a region that starts beyond our range, we've > + * finished. > + */ > if (rg->from > t) > break; > > - /* We overlap with this area, if it extends further than > - * us then we must extend ourselves. Account for its > - * existing reservation. > + /* Add an entry for last_accounted_offset -> rg->from, and > + * update last_accounted_offset. >*/ > - if (rg->to > t) { > - chg += rg->to - t; > - t = rg->to; > + if (rg->from > last_accounted_offset) { > + add += rg->from - last_accounted_offset; > + if (!count_only) { > + nrg = get_file_region_entry_from_cache( > + resv, last_accounted_offset, rg->from); > + list_add(>link, rg->link.prev); > + } > } > - chg -= rg->to - rg->from; > > - if (!count_only && rg != nrg) { > - list_del(>link); > - kfree(rg); > - } > + last_accounted_offset = rg->to; > } > > - if
[PATCH v2] perf tools: avoid sample_reg_masks being const + weak
Being const + weak breaks with some compilers that constant-propagate from the weak symbol. This behavior is outside of the specification, but in LLVM is chosen to match GCC's behavior. LLVM's implementation was set in this patch: https://github.com/llvm/llvm-project/commit/f49573d1eedcf1e44893d5a062ac1b72c8419646 A const + weak symbol is set to be weak_odr: https://llvm.org/docs/LangRef.html ODR is one definition rule, and given there is one constant definition constant-propagation is possible. It is possible to get this code to miscompile with LLVM when applying link time optimization. As compilers become more aggressive, this is likely to break in more instances. Move the definition of sample_reg_masks to the conditional part of perf_regs.h and guard usage with HAVE_PERF_REGS_SUPPORT. This avoids the weak symbol. Fix an issue when HAVE_PERF_REGS_SUPPORT isn't defined from patch v1. Signed-off-by: Ian Rogers --- tools/perf/util/parse-regs-options.c | 8 ++-- tools/perf/util/perf_regs.c | 4 tools/perf/util/perf_regs.h | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c index ef46c2848808..e687497b3aac 100644 --- a/tools/perf/util/parse-regs-options.c +++ b/tools/perf/util/parse-regs-options.c @@ -13,7 +13,7 @@ static int __parse_regs(const struct option *opt, const char *str, int unset, bool intr) { uint64_t *mode = (uint64_t *)opt->value; - const struct sample_reg *r; + const struct sample_reg *r = NULL; char *s, *os = NULL, *p; int ret = -1; uint64_t mask; @@ -46,19 +46,23 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr) if (!strcmp(s, "?")) { fprintf(stderr, "available registers: "); +#ifdef HAVE_PERF_REGS_SUPPORT for (r = sample_reg_masks; r->name; r++) { if (r->mask & mask) fprintf(stderr, "%s ", r->name); } +#endif fputc('\n', stderr); /* just printing available regs */ return -1; } +#ifdef HAVE_PERF_REGS_SUPPORT for (r = sample_reg_masks; r->name; r++) { if ((r->mask & mask) && !strcasecmp(s, r->name)) break; } - if (!r->name) { +#endif + if (!r || !r->name) { ui__warning("Unknown register \"%s\", check man page or run \"perf record %s?\"\n", s, intr ? "-I" : "--user-regs="); goto error; diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c index 2774cec1f15f..5ee47ae1509c 100644 --- a/tools/perf/util/perf_regs.c +++ b/tools/perf/util/perf_regs.c @@ -3,10 +3,6 @@ #include "perf_regs.h" #include "event.h" -const struct sample_reg __weak sample_reg_masks[] = { - SMPL_REG_END -}; - int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused, char **new_op __maybe_unused) { diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h index 47fe34e5f7d5..e014c2c038f4 100644 --- a/tools/perf/util/perf_regs.h +++ b/tools/perf/util/perf_regs.h @@ -15,8 +15,6 @@ struct sample_reg { #define SMPL_REG2(n, b) { .name = #n, .mask = 3ULL << (b) } #define SMPL_REG_END { .name = NULL } -extern const struct sample_reg sample_reg_masks[]; - enum { SDT_ARG_VALID = 0, SDT_ARG_SKIP, @@ -27,6 +25,8 @@ uint64_t arch__intr_reg_mask(void); uint64_t arch__user_reg_mask(void); #ifdef HAVE_PERF_REGS_SUPPORT +extern const struct sample_reg sample_reg_masks[]; + #include #define DWARF_MINIMAL_REGS ((1ULL << PERF_REG_IP) | (1ULL << PERF_REG_SP)) -- 2.23.0.444.g18eeb5a265-goog
Re: [PATCH v3] nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
On Fri, 27 Sep 2019 14:12:42 +0200, Markus Elfring wrote: > > Updated other gotos to have correct errno returned, too. > > How do you think about to add a jump target here? > > > > +++ b/drivers/net/ethernet/netronome/nfp/abm/cls.c > > @@ -176,8 +176,10 @@ nfp_abm_u32_knode_replace(struct nfp_abm_link *alink, > > u8 mask, val; > > int err; > > > > - if (!nfp_abm_u32_check_knode(alink->abm, knode, proto, extack)) > > + if (!nfp_abm_u32_check_knode(alink->abm, knode, proto, extack)) { > > + err = -EOPNOTSUPP; > > goto err_delete; > > + } > > > > tos_off = proto == htons(ETH_P_IP) ? 16 : 20; > > - goto err_delete; > + goto e_opnotsupp; > > > > @@ -221,7 +227,7 @@ nfp_abm_u32_knode_replace(struct nfp_abm_link *alink, > > > > +e_opnotsupp: > + err = -EOPNOTSUPP; > > > err_delete: > > nfp_abm_u32_knode_delete(alink, knode); > > - return -EOPNOTSUPP; > > + return err; > > } > > > > static int nfp_abm_setup_tc_block_cb(enum tc_setup_type type, > > > Can such a change variant be a bit nicer? Definitely not. Looks good as is, thanks Navid!