from:"Mika Penttilä"

Re: a question of split_huge_page

2020-07-10 Thread Mika Penttilä



On 10.7.2020 10.00, Alex Shi wrote:
>
> 在 2020/7/10 下午1:28, Mika Penttilä 写道:
>>> Thanks a lot for quick reply!
>>> What I am confusing is the call chain: __iommu_dma_alloc_pages()
>>> to split_huge_page(), in the func, splited page,
>>> page = alloc_pages_node(nid, alloc_flags, order);
>>> And if the pages were added into lru, they maybe reclaimed and lost,
>>> that would be a panic bug. But in fact, this never happened for long time.
>>> Also I put a BUG() at the line, it's nevre triggered in ltp, and run_vmtests
>> In  __iommu_dma_alloc_pages, after split_huge_page(),  who is taking a
>> reference on tail pages? Seems tail pages are freed and the function
>> errornously returns them in pages[] array for use?
>>
> Why you say so? It looks like the tail page returned and be used
>   pages = __iommu_dma_alloc_pages() in iommu_dma_alloc_remap()
> and still on node's lru. Is this right?
>
> thanks!
IMHO they are new pages coming from alloc_pages_node() so they are not
on lru. And split_huge_page() frees not pinned tail pages again to page
allocator.

Thanks,
Mika





pEpkey.asc
Description: application/pgp-keys

Re: a question of split_huge_page

2020-07-09 Thread Mika Penttilä



On 10.7.2020 7.51, Alex Shi wrote:
>
> 在 2020/7/10 上午12:07, Kirill A. Shutemov 写道:
>> On Thu, Jul 09, 2020 at 04:50:02PM +0100, Matthew Wilcox wrote:
>>> On Thu, Jul 09, 2020 at 11:11:11PM +0800, Alex Shi wrote:
 Hi Kirill & Matthew,

 In the func call chain, from split_huge_page() to lru_add_page_tail(),
 Seems tail pages are added to lru list at line 963, but in this scenario
 the head page has no lru bit and isn't set the bit later. Why we do this?
 or do I miss sth?
>>> I don't understand how we get to split_huge_page() with a page that's
>>> not on an LRU list.  Both anonymous and page cache pages should be on
>>> an LRU list.  What am I missing?> 
>
> Thanks a lot for quick reply!
> What I am confusing is the call chain: __iommu_dma_alloc_pages()
> to split_huge_page(), in the func, splited page,
>   page = alloc_pages_node(nid, alloc_flags, order);
> And if the pages were added into lru, they maybe reclaimed and lost,
> that would be a panic bug. But in fact, this never happened for long time.
> Also I put a BUG() at the line, it's nevre triggered in ltp, and run_vmtests


In  __iommu_dma_alloc_pages, after split_huge_page(),  who is taking a
reference on tail pages? Seems tail pages are freed and the function
errornously returns them in pages[] array for use?

> in kselftest.
>
>> Right, and it's never got removed from LRU during the split. The tail
>> pages have to be added to LRU because they now separate from the tail
>> page.
>>
> According to the explaination, looks like we could remove the code path,
> since it's never got into. (base on my v15 patchset). Any comments?
>
> Thanks
> Alex
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7c52c5228aab..c28409509ad3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2357,17 +2357,6 @@ static void lru_add_page_tail(struct page *head, 
> struct page *page_tail,
> if (!list)
> SetPageLRU(page_tail);
>
> if (likely(PageLRU(head)))
> list_add_tail(_tail->lru, >lru);
> else if (list) {
> /* page reclaim is reclaiming a huge page */
> get_page(page_tail);
> list_add_tail(_tail->lru, list);
> -   } else {
> -   /*
> -* Head page has not yet been counted, as an hpage,
> -* so we must account for each subpage individually.
> -*
> -* Put page_tail on the list at the correct position
> -* so they all end up in order.
> -*/
> -   VM_BUG_ON_PAGE(1, head);
> -   add_page_to_lru_list_tail(page_tail, lruvec,
> - page_lru(page_tail));
> }
>  }



pEpkey.asc
Description: application/pgp-keys

Re: [PATCH 4/6] vhost_vdpa: support doorbell mapping via mmap

2020-05-29 Thread Mika Penttilä


Hi,

On 29.5.2020 11.03, Jason Wang wrote:

Currently the doorbell is relayed via eventfd which may have
significant overhead because of the cost of vmexits or syscall. This
patch introduces mmap() based doorbell mapping which can eliminate the
overhead caused by vmexit or syscall.


Just wondering. I know very little about vdpa. But how is such a "sw 
doorbell" monitored or observed, if no fault or wmexit etc.

Is there some kind of polling used?


To ease the userspace modeling of the doorbell layout (usually
virtio-pci), this patch starts from a doorbell per page
model. Vhost-vdpa only support the hardware doorbell that sit at the
boundary of a page and does not share the page with other registers.

Doorbell of each virtqueue must be mapped separately, pgoff is the
index of the virtqueue. This allows userspace to map a subset of the
doorbell which may be useful for the implementation of software
assisted virtqueue (control vq) in the future.

Signed-off-by: Jason Wang 
---
  drivers/vhost/vdpa.c | 59 
  1 file changed, 59 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 6ff72289f488..bbe23cea139a 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -741,12 +742,70 @@ static int vhost_vdpa_release(struct inode *inode, struct 
file *filep)
return 0;
  }
  
+static vm_fault_t vhost_vdpa_fault(struct vm_fault *vmf)

+{
+   struct vhost_vdpa *v = vmf->vma->vm_file->private_data;
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+   struct vdpa_notification_area notify;
+   struct vm_area_struct *vma = vmf->vma;
+   u16 index = vma->vm_pgoff;
+
+   notify = ops->get_vq_notification(vdpa, index);
+
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+   if (remap_pfn_range(vma, vmf->address & PAGE_MASK,
+   notify.addr >> PAGE_SHIFT, PAGE_SIZE,
+   vma->vm_page_prot))
+   return VM_FAULT_SIGBUS;
+
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vhost_vdpa_vm_ops = {
+   .fault = vhost_vdpa_fault,
+};
+
+static int vhost_vdpa_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct vhost_vdpa *v = vma->vm_file->private_data;
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+   struct vdpa_notification_area notify;
+   int index = vma->vm_pgoff;
+
+   if (vma->vm_end - vma->vm_start != PAGE_SIZE)
+   return -EINVAL;
+   if ((vma->vm_flags & VM_SHARED) == 0)
+   return -EINVAL;
+   if (vma->vm_flags & VM_READ)
+   return -EINVAL;
+   if (index > 65535)
+   return -EINVAL;
+   if (!ops->get_vq_notification)
+   return -ENOTSUPP;
+
+   /* To be safe and easily modelled by userspace, We only
+* support the doorbell which sits on the page boundary and
+* does not share the page with other registers.
+*/
+   notify = ops->get_vq_notification(vdpa, index);
+   if (notify.addr & (PAGE_SIZE - 1))
+   return -EINVAL;
+   if (vma->vm_end - vma->vm_start != notify.size)
+   return -ENOTSUPP;
+
+   vma->vm_ops = _vdpa_vm_ops;
+   return 0;
+}
+
  static const struct file_operations vhost_vdpa_fops = {
.owner  = THIS_MODULE,
.open   = vhost_vdpa_open,
.release= vhost_vdpa_release,
.write_iter = vhost_vdpa_chr_write_iter,
.unlocked_ioctl = vhost_vdpa_unlocked_ioctl,
+   .mmap   = vhost_vdpa_mmap,
.compat_ioctl   = compat_ptr_ioctl,
  };

BUG/Oops - unix domain sockets / wakeups

2019-08-25 Thread Mika Penttilä

Hello,

I have got  a couple of these in a week with 5.3-rc. Happens very
randomly and infrequently, the system has been nearly idle.
Otoh, seems new to 5.3, dont remeber this happening before.

Thanks,
Mika

# [386517.339218] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[386517.344726] Modules linked in: btwilink radio_quantek st_drv
[386517.350488] CPU: 0 PID: 363 Comm: compositor Not tainted 5.3.0-rc5 #5
[386517.357016] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[386517.363644] PC is at __wake_up_common+0x5c/0x134
[386517.368351] LR is at __wake_up_common+0x90/0x134
[386517.373057] pc : [<80152bec>]    lr : [<80152c20>]    psr: 80070093
[386517.379411] sp : a9193cf0  ip :   fp : 0001
[386517.384724] r10: 0001  r9 : a9193d2c  r8 : 80219b08
[386517.390037] r7 : 0001  r6 : a63668c4  r5 :   r4 : 00f4
[386517.396653] r3 : 0304  r2 : 0100  r1 : 0001  r0 : 
[386517.403270] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM 
Segment user
[386517.410580] Control: 10c5387d  Table: 3942404a  DAC: 0055
[386517.416415] Process compositor (pid: 363, stack limit = 0x5fe37b3d)
[386517.422771] Stack: (0xa9193cf0 to 0xa9194000)
[386517.427218] 3ce0: 0304
0001  a63668c0
[386517.435486] 3d00: 0001 20070013 0001 0001 a9193d2c
0304  80152e90
[386517.443754] 3d20: 0304 a9193d2c 012dd844  
 0100 0122
[386517.452022] 3d40: a8cc8f00 a8cc8fec a8cca7a0 a9193e4c a8cca700
0fc0 0040 80152f74
[386517.460289] 3d60: 0304  a9192000 8079cc44 a8cc8f00
806d4310 9a0e6180 
[386517.468556] 3d80: a8cca7a0 807a0744 a9193e4c  
  
[386517.476824] 3da0: 9a0e6180 806da340 9a0e6180 806da3ac 9a0e6180
806daa0c 9a0e6180 8079d730
[386517.485091] 3dc0: 7eb24360 a9193e14  0040 a8cca944
  a8cca700
[386517.493358] 3de0: a6752f40 a8cca90c 0001  a9192000
0001  
[386517.501627] 3e00:    a9193e14 7eb245c8
a9193e78 80101204 a9193f70
[386517.509896] 3e20: a9193e74 a9193f68 4040 a9193eb8 7eb245c8
7eb245e4 a6752f40 0129
[386517.518163] 3e40: 7615db10 8079dac8 0002 8079c408 a6752f40
a9193f68  1000
[386517.526431] 3e60: 4040  7eb245c8 806cfe7c 
 014c9b90 0d00
[386517.534698] 3e80: 014c9890 0300  011c52cc 75bc9788
0001 a8cdfa00 a8cdfa0c
[386517.542966] 3ea0: a8f7e980 8024afd8 0148ef88  
0019 75bc9788 8024aeb4
[386517.551233] 3ec0: a8f7e980 a8f7e9ac a8f7e9b4 a8f7e980 
808fe018  8024a7ac
[386517.559500] 3ee0: a9193ee0 a9193ee0 a94b0c00 0020 7eb246e0
a94b0c01 a94b0c00 a8f7e980
[386517.567767] 3f00: 0001 a8f7e9ac  8024b43c 
7eb24318 a89ccf68 a94d5480
[386517.576034] 3f20: a8fce600 80223bc0 a9193f64 a9193f60 4040
0129 80101204 a6752f40
[386517.584301] 3f40: 7eb245c8 4040 0129 80101204 a9192000
806d0fb8  7eb246e0
[386517.592568] 3f60: 0001 fff7   0004
0040 0fc0 a9193e78
[386517.600837] 3f80: 0002 011c5460 7eb245e4 007c 4000
 0040 0028
[386517.609105] 3fa0: 7eb245c8 80101000 0040 0028 0028
7eb245c8 4040 
[386517.617374] 3fc0: 0040 0028 7eb245c8 0129 76f01600
7eb245c8 0024 7615db10
[386517.625642] 3fe0:  7eb24598 76f826e0 75acd798 80070010
0028  
[386517.633921] [<80152bec>] (__wake_up_common) from [<80152e90>]
(__wake_up_common_lock+0x64/0x88)
[386517.642715] [<80152e90>] (__wake_up_common_lock) from [<80152f74>]
(__wake_up_sync_key+0x24/0x2c)
[386517.651683] [<80152f74>] (__wake_up_sync_key) from [<8079cc44>]
(unix_write_space+0x58/0x84)
[386517.660220] [<8079cc44>] (unix_write_space) from [<806d4310>]
(sock_wfree+0x58/0x74)
[386517.668059] [<806d4310>] (sock_wfree) from [<807a0744>]
(unix_destruct_scm+0x6c/0x74)
[386517.675984] [<807a0744>] (unix_destruct_scm) from [<806da340>]
(skb_release_head_state+0x50/0xb0)
[386517.684947] [<806da340>] (skb_release_head_state) from [<806da3ac>]
(skb_release_all+0xc/0x24)
[386517.693649] [<806da3ac>] (skb_release_all) from [<806daa0c>]
(consume_skb+0x28/0x48)
[386517.701485] [<806daa0c>] (consume_skb) from [<8079d730>]
(unix_stream_read_generic+0x494/0x76c)
[386517.710275] [<8079d730>] (unix_stream_read_generic) from
[<8079dac8>] (unix_stream_recvmsg+0x3c/0x44)
[386517.719586] [<8079dac8>] (unix_stream_recvmsg) from [<806cfe7c>]
(___sys_recvmsg+0x90/0x158)
[386517.728115] [<806cfe7c>] (___sys_recvmsg) from [<806d0fb8>]
(__sys_recvmsg+0x3c/0x6c)
[386517.736038] [<806d0fb8>] (__sys_recvmsg) from [<80101000>]
(ret_fast_syscall+0x0/0x4c)



pEpkey.asc
Description: pEpkey.asc

Re: [PATCH 2/4] pid: add pidctl()

2019-03-25 Thread Mika Penttilä

Hi!


> +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, 
> target,
> + unsigned int, flags)
> +{
> + struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> + struct pid *struct_pid;
> + pid_t result;
> +
> + switch (cmd) {
> + case PIDCMD_QUERY_PIDNS:
> + if (pid != 0)
> + return -EINVAL;
> + pid = 1;
> + /* fall through */
> + case PIDCMD_QUERY_PID:
> + if (flags != 0)
> + return -EINVAL;
> + break;
> + case PIDCMD_GET_PIDFD:
> + if (flags & ~PIDCTL_CLOEXEC)
> + return -EINVAL;
> + break;
> + default:
> + return -EOPNOTSUPP;
> + }
> +
> + source_ns = get_pid_ns_by_fd(source);
> + result = PTR_ERR(source_ns);
> + if (IS_ERR(source_ns))
> + goto err_source;
> +
> + target_ns = get_pid_ns_by_fd(target);
> + result = PTR_ERR(target_ns);
> + if (IS_ERR(target_ns))
> + goto err_target;
> +
> + if (cmd == PIDCMD_QUERY_PIDNS) {
> + result = pidns_related(source_ns, target_ns);
> + } else {
> + rcu_read_lock();
> + struct_pid = find_pid_ns(pid, source_ns);
> + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;

Should you do get_pid(struct_pid) here to keep it alive till 
pidfd_create_fd() ?

> + rcu_read_unlock();
> +
> + if (cmd == PIDCMD_GET_PIDFD) {
> + int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
> + if (result > 0)
> + result = pidfd_create_fd(struct_pid, cloexec);
> + else if (result == 0)
> + result = -ENOENT;
> + }
> + }
> +
> + if (target)
> + put_pid_ns(target_ns);
> +err_target:
> + if (source)
> + put_pid_ns(source_ns);
> +err_source:
> + return result;
> +}
> +
>  void __init pid_idr_init(void)
>  {
>   /* Verify no one has done anything silly: */
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index aa6e72fb7c08..1c863fb3d55a 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct 
> ns_common *ns)
>   return _pid_ns(pid_ns)->ns;
>  }
>  
> +/**
> + * pidnscmp - Determine if @ancestor is ancestor of @descendant
> + * @ancestor:   pidns suspected to be the ancestor of @descendant
> + * @descendant: pidns suspected to be the descendant of @ancestor
> + *
> + * Returns -1 if @ancestor is not an ancestor of @descendant,
> + * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
> + * is an ancestor of @descendant.
> + */
> +int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace 
> *descendant)
> +{
> + if (ancestor == descendant)
> + return 0;
> +
> + for (;;) {
> + if (!descendant)
> + return -1;
> + if (descendant == ancestor)
> + break;
> + descendant = descendant->parent;
> + }
> +
> + return 1;
> +}
> +
>  static struct user_namespace *pidns_owner(struct ns_common *ns)
>  {
>   return to_pid_ns(ns)->user_ns;

Re: Regression: i.MX6: pinctrl: fsl: add scu based pinctrl support

2019-01-14 Thread Mika Penttilä

Hi!

On 14.1.2019 18.58, Fabio Estevam wrote:
> Hi Mika,
>
> On Mon, Jan 14, 2019 at 8:21 AM Mika Penttilä
>  wrote:
>> Hello,
>>
>>
>> The patch titled "pinctrl: fsl: add scu based pinctrl support" causes
>> regression on i.MX6. Tested on a custom board based on
>> i.MX6Q and sgtl5000 codec, there is no sound,  tested with simple wav
>> playing.
>>
>> Reverting the patch makes audio work again.
> Which kernel version do you use?
>
> Most likely this has already been fixed by the following commit:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/pinctrl/freescale/pinctrl-imx.c?h=v5.0-rc2=571610678bf344006ab4c47c6fd0a842e9ac6a1b
>
> Please let us know if you still have issues with such commit applied.

I am merging mainline and this regression is present in 5.0-rc2 so it should 
have the mentioned patch applied.

Obviously I had to revert them both ("pinctrl: imx: fix NO_PAD_CTL setting for 
MMIO pads"
 and "pinctrl: fsl: add scu based pinctrl support"), and after that it works. 
Is there something more I could test?


--Mika

Regression: i.MX6: pinctrl: fsl: add scu based pinctrl support

2019-01-14 Thread Mika Penttilä

Hello,


The patch titled "pinctrl: fsl: add scu based pinctrl support" causes
regression on i.MX6. Tested on a custom board based on
i.MX6Q and sgtl5000 codec, there is no sound,  tested with simple wav
playing.

Reverting the patch makes audio work again.

--Mika



pEpkey.asc
Description: pEpkey.asc

Re: [PATCH] PM / suspend: Count suspend-to-idle loop as sleep time

2018-09-14 Thread Mika Penttilä

On 09/14/2018 11:46 AM, Rafael J. Wysocki wrote:
> On Friday, September 14, 2018 10:28:44 AM CEST Mika Penttilä wrote:
>> Hi!
>>
>>
>> On 09/14/2018 09:59 AM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki 
>>>
>>> There is a difference in behavior between suspend-to-idle and
>>> suspend-to-RAM in the timekeeping handling that leads to functional
>>> issues.  Namely, every iteration of the loop in s2idle_loop()
>>> increases the monotinic clock somewhat, even if timekeeping_suspend()
>>> and timekeeping_resume() are invoked from s2idle_enter(), and if
>>> many of them are carried out in a row, the monotonic clock can grow
>>> significantly while the system is regarded as suspended, which
>>> doesn't happen during suspend-to-RAM and so it is unexpected and
>>> leads to confusion and misbehavior in user space (similar to what
>>> ensued when we tried to combine the boottime and monotonic clocks).
>>>
>>> To avoid that, count all iterations of the loop in s2idle_loop()
>>> as "sleep time" and adjust the clock for that on exit from
>>> suspend-to-idle.
>>>
>>> [That also covers systems on which timekeeping is not suspended
>>>  by by s2idle_enter().]
>>>
>>> Signed-off-by: Rafael J. Wysocki 
>>> ---
>>>
>>> This is a replacement for https://patchwork.kernel.org/patch/10599209/
>>>
>>> I decided to count the entire loop in s2idle_loop() as "sleep time" as the
>>> patch is then simpler and it also covers systems where timekeeping is not
>>> suspended in the final step of suspend-to-idle.
>>>
>>> I dropped the "Fixes:" tag, because the monotonic clock delta problem
>>> has been present on the latter since the very introduction of "freeze"
>>> (as suspend-to-idle was referred to previously) and so this doesn't fix
>>> any particular later commits.
>>>
>>> ---
>>>  kernel/power/suspend.c |   18 ++
>>>  1 file changed, 18 insertions(+)
>>>
>>> Index: linux-pm/kernel/power/suspend.c
>>> ===
>>> --- linux-pm.orig/kernel/power/suspend.c
>>> +++ linux-pm/kernel/power/suspend.c
>>> @@ -109,8 +109,12 @@ static void s2idle_enter(void)
>>>  
>>>  static void s2idle_loop(void)
>>>  {
>>> +   ktime_t start, delta;
>>> +
>>> pm_pr_dbg("suspend-to-idle\n");
>>>  
>>> +   start = ktime_get();
>>> +
>>> for (;;) {
>>> int error;
>>>  
>>> @@ -150,6 +154,20 @@ static void s2idle_loop(void)
>>> pm_wakeup_clear(false);
>>> }
>>>  
>>> +   /*
>>> +* If the monotonic clock difference between the start of the loop and
>>> +* this point is too large, user space may get confused about whether or
>>> +* not the system has been suspended and tasks may get killed by
>>> +* watchdogs etc., so count the loop as "sleep time" to compensate for
>>> +* that.
>>> +*/
>>> +   delta = ktime_sub(ktime_get(), start);
>>> +   if (ktime_to_ns(delta) > 0) {
>>> +   struct timespec64 timespec64_delta = ktime_to_timespec64(delta);
>>> +
>>> +   timekeeping_inject_sleeptime64(_delta);
>>> +   }
>>
>> But doesn't injecting sleep time here make monotonic clock too large by the 
>> amount of sleeptime? 
>> tick_freeze() / tick_unfreeze() already injects the sleeptime (otherwise 
>> delta would be 0).
> 
> No, it doesn't.
> 
> The delta here is the extra time taken by the loop which hasn't been counted
> as sleep time yet.

I said incorrectly monotonic clock, but timekeeping_inject_sleeptime64() 
forwards the wall time, by the amount of delta.
Why wouldn't some other cpu update xtime when one cpu is in the loop? And if 
all cpus enter s2idle, tick_unfreeze()
injects sleeptime. My point is that this extra injection makes wall time wrong, 
no?

> 
> Thanks,
> Rafael
>

Re: [PATCH] PM / suspend: Count suspend-to-idle loop as sleep time

2018-09-14 Thread Mika Penttilä

On 09/14/2018 11:46 AM, Rafael J. Wysocki wrote:
> On Friday, September 14, 2018 10:28:44 AM CEST Mika Penttilä wrote:
>> Hi!
>>
>>
>> On 09/14/2018 09:59 AM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki 
>>>
>>> There is a difference in behavior between suspend-to-idle and
>>> suspend-to-RAM in the timekeeping handling that leads to functional
>>> issues.  Namely, every iteration of the loop in s2idle_loop()
>>> increases the monotinic clock somewhat, even if timekeeping_suspend()
>>> and timekeeping_resume() are invoked from s2idle_enter(), and if
>>> many of them are carried out in a row, the monotonic clock can grow
>>> significantly while the system is regarded as suspended, which
>>> doesn't happen during suspend-to-RAM and so it is unexpected and
>>> leads to confusion and misbehavior in user space (similar to what
>>> ensued when we tried to combine the boottime and monotonic clocks).
>>>
>>> To avoid that, count all iterations of the loop in s2idle_loop()
>>> as "sleep time" and adjust the clock for that on exit from
>>> suspend-to-idle.
>>>
>>> [That also covers systems on which timekeeping is not suspended
>>>  by by s2idle_enter().]
>>>
>>> Signed-off-by: Rafael J. Wysocki 
>>> ---
>>>
>>> This is a replacement for https://patchwork.kernel.org/patch/10599209/
>>>
>>> I decided to count the entire loop in s2idle_loop() as "sleep time" as the
>>> patch is then simpler and it also covers systems where timekeeping is not
>>> suspended in the final step of suspend-to-idle.
>>>
>>> I dropped the "Fixes:" tag, because the monotonic clock delta problem
>>> has been present on the latter since the very introduction of "freeze"
>>> (as suspend-to-idle was referred to previously) and so this doesn't fix
>>> any particular later commits.
>>>
>>> ---
>>>  kernel/power/suspend.c |   18 ++
>>>  1 file changed, 18 insertions(+)
>>>
>>> Index: linux-pm/kernel/power/suspend.c
>>> ===
>>> --- linux-pm.orig/kernel/power/suspend.c
>>> +++ linux-pm/kernel/power/suspend.c
>>> @@ -109,8 +109,12 @@ static void s2idle_enter(void)
>>>  
>>>  static void s2idle_loop(void)
>>>  {
>>> +   ktime_t start, delta;
>>> +
>>> pm_pr_dbg("suspend-to-idle\n");
>>>  
>>> +   start = ktime_get();
>>> +
>>> for (;;) {
>>> int error;
>>>  
>>> @@ -150,6 +154,20 @@ static void s2idle_loop(void)
>>> pm_wakeup_clear(false);
>>> }
>>>  
>>> +   /*
>>> +* If the monotonic clock difference between the start of the loop and
>>> +* this point is too large, user space may get confused about whether or
>>> +* not the system has been suspended and tasks may get killed by
>>> +* watchdogs etc., so count the loop as "sleep time" to compensate for
>>> +* that.
>>> +*/
>>> +   delta = ktime_sub(ktime_get(), start);
>>> +   if (ktime_to_ns(delta) > 0) {
>>> +   struct timespec64 timespec64_delta = ktime_to_timespec64(delta);
>>> +
>>> +   timekeeping_inject_sleeptime64(_delta);
>>> +   }
>>
>> But doesn't injecting sleep time here make monotonic clock too large by the 
>> amount of sleeptime? 
>> tick_freeze() / tick_unfreeze() already injects the sleeptime (otherwise 
>> delta would be 0).
> 
> No, it doesn't.
> 
> The delta here is the extra time taken by the loop which hasn't been counted
> as sleep time yet.

I said incorrectly monotonic clock, but timekeeping_inject_sleeptime64() 
forwards the wall time, by the amount of delta.
Why wouldn't some other cpu update xtime when one cpu is in the loop? And if 
all cpus enter s2idle, tick_unfreeze()
injects sleeptime. My point is that this extra injection makes wall time wrong, 
no?

> 
> Thanks,
> Rafael
>

Re: [PATCH] PM / suspend: Count suspend-to-idle loop as sleep time

2018-09-14 Thread Mika Penttilä

Hi!


On 09/14/2018 09:59 AM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> There is a difference in behavior between suspend-to-idle and
> suspend-to-RAM in the timekeeping handling that leads to functional
> issues.  Namely, every iteration of the loop in s2idle_loop()
> increases the monotinic clock somewhat, even if timekeeping_suspend()
> and timekeeping_resume() are invoked from s2idle_enter(), and if
> many of them are carried out in a row, the monotonic clock can grow
> significantly while the system is regarded as suspended, which
> doesn't happen during suspend-to-RAM and so it is unexpected and
> leads to confusion and misbehavior in user space (similar to what
> ensued when we tried to combine the boottime and monotonic clocks).
> 
> To avoid that, count all iterations of the loop in s2idle_loop()
> as "sleep time" and adjust the clock for that on exit from
> suspend-to-idle.
> 
> [That also covers systems on which timekeeping is not suspended
>  by by s2idle_enter().]
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a replacement for https://patchwork.kernel.org/patch/10599209/
> 
> I decided to count the entire loop in s2idle_loop() as "sleep time" as the
> patch is then simpler and it also covers systems where timekeeping is not
> suspended in the final step of suspend-to-idle.
> 
> I dropped the "Fixes:" tag, because the monotonic clock delta problem
> has been present on the latter since the very introduction of "freeze"
> (as suspend-to-idle was referred to previously) and so this doesn't fix
> any particular later commits.
> 
> ---
>  kernel/power/suspend.c |   18 ++
>  1 file changed, 18 insertions(+)
> 
> Index: linux-pm/kernel/power/suspend.c
> ===
> --- linux-pm.orig/kernel/power/suspend.c
> +++ linux-pm/kernel/power/suspend.c
> @@ -109,8 +109,12 @@ static void s2idle_enter(void)
>  
>  static void s2idle_loop(void)
>  {
> + ktime_t start, delta;
> +
>   pm_pr_dbg("suspend-to-idle\n");
>  
> + start = ktime_get();
> +
>   for (;;) {
>   int error;
>  
> @@ -150,6 +154,20 @@ static void s2idle_loop(void)
>   pm_wakeup_clear(false);
>   }
>  
> + /*
> +  * If the monotonic clock difference between the start of the loop and
> +  * this point is too large, user space may get confused about whether or
> +  * not the system has been suspended and tasks may get killed by
> +  * watchdogs etc., so count the loop as "sleep time" to compensate for
> +  * that.
> +  */
> + delta = ktime_sub(ktime_get(), start);
> + if (ktime_to_ns(delta) > 0) {
> + struct timespec64 timespec64_delta = ktime_to_timespec64(delta);
> +
> + timekeeping_inject_sleeptime64(_delta);
> + }

But doesn't injecting sleep time here make monotonic clock too large by the 
amount of sleeptime? 
tick_freeze() / tick_unfreeze() already injects the sleeptime (otherwise delta 
would be 0).

> +
>   pm_pr_dbg("resume from suspend-to-idle\n");
>  }
>  
>

Re: [PATCH] PM / suspend: Count suspend-to-idle loop as sleep time

2018-09-14 Thread Mika Penttilä

Hi!


On 09/14/2018 09:59 AM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> There is a difference in behavior between suspend-to-idle and
> suspend-to-RAM in the timekeeping handling that leads to functional
> issues.  Namely, every iteration of the loop in s2idle_loop()
> increases the monotinic clock somewhat, even if timekeeping_suspend()
> and timekeeping_resume() are invoked from s2idle_enter(), and if
> many of them are carried out in a row, the monotonic clock can grow
> significantly while the system is regarded as suspended, which
> doesn't happen during suspend-to-RAM and so it is unexpected and
> leads to confusion and misbehavior in user space (similar to what
> ensued when we tried to combine the boottime and monotonic clocks).
> 
> To avoid that, count all iterations of the loop in s2idle_loop()
> as "sleep time" and adjust the clock for that on exit from
> suspend-to-idle.
> 
> [That also covers systems on which timekeeping is not suspended
>  by by s2idle_enter().]
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a replacement for https://patchwork.kernel.org/patch/10599209/
> 
> I decided to count the entire loop in s2idle_loop() as "sleep time" as the
> patch is then simpler and it also covers systems where timekeeping is not
> suspended in the final step of suspend-to-idle.
> 
> I dropped the "Fixes:" tag, because the monotonic clock delta problem
> has been present on the latter since the very introduction of "freeze"
> (as suspend-to-idle was referred to previously) and so this doesn't fix
> any particular later commits.
> 
> ---
>  kernel/power/suspend.c |   18 ++
>  1 file changed, 18 insertions(+)
> 
> Index: linux-pm/kernel/power/suspend.c
> ===
> --- linux-pm.orig/kernel/power/suspend.c
> +++ linux-pm/kernel/power/suspend.c
> @@ -109,8 +109,12 @@ static void s2idle_enter(void)
>  
>  static void s2idle_loop(void)
>  {
> + ktime_t start, delta;
> +
>   pm_pr_dbg("suspend-to-idle\n");
>  
> + start = ktime_get();
> +
>   for (;;) {
>   int error;
>  
> @@ -150,6 +154,20 @@ static void s2idle_loop(void)
>   pm_wakeup_clear(false);
>   }
>  
> + /*
> +  * If the monotonic clock difference between the start of the loop and
> +  * this point is too large, user space may get confused about whether or
> +  * not the system has been suspended and tasks may get killed by
> +  * watchdogs etc., so count the loop as "sleep time" to compensate for
> +  * that.
> +  */
> + delta = ktime_sub(ktime_get(), start);
> + if (ktime_to_ns(delta) > 0) {
> + struct timespec64 timespec64_delta = ktime_to_timespec64(delta);
> +
> + timekeeping_inject_sleeptime64(_delta);
> + }

But doesn't injecting sleep time here make monotonic clock too large by the 
amount of sleeptime? 
tick_freeze() / tick_unfreeze() already injects the sleeptime (otherwise delta 
would be 0).

> +
>   pm_pr_dbg("resume from suspend-to-idle\n");
>  }
>  
>

Re: [RFC v6 PATCH 2/2] mm: mmap: zap pages with read mmap_sem in munmap

2018-07-26 Thread Mika Penttilä




On 26.07.2018 21:10, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
>> 300GB), the below hung task issue may happen occasionally.
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> It is because munmap holds mmap_sem exclusively from very beginning to
> all the way down to the end, and doesn't release it in the middle. When
> unmapping large mapping, it may take long time (take ~18 seconds to
> unmap 320GB mapping with every single page mapped on an idle machine).
>
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hocko [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas.
>
> But, some part may need write mmap_sem, for example, vma splitting. So,
> the design is as follows:
> acquire write mmap_sem
> lookup vmas (find and split vmas)
>   detach vmas
> deal with special mappings
> downgrade_write
>
> zap pages
>   free page tables
> release mmap_sem
>
> The vm events with read mmap_sem may come in during page zapping, but
> since vmas have been detached before, they, i.e. page fault, gup, etc,
> will not be able to find valid vma, then just return SIGSEGV or -EFAULT
> as expected.
>
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are
> considered as special mappings. They will be dealt with before zapping
> pages with write mmap_sem held. Basically, just update vm_flags.
>
> And, since they are also manipulated by unmap_single_vma() which is
> called by unmap_vma() with read mmap_sem held in this case, to
> prevent from updating vm_flags in read critical section, a new
> parameter, called "skip_flags" is added to unmap_region(), unmap_vmas()
> and unmap_single_vma(). If it is true, then just skip unmap those
> special mappings. Currently, the only place which pass true to this
> parameter is us.
>
> With this approach we don't have to re-acquire mmap_sem again to clean
> up vmas to avoid race window which might get the address space changed.
>
> And, since the lock acquire/release cost is managed to the minimum and
> almost as same as before, the optimization could be extended to any size
> of mapping without incurring significant penalty to small mappings.
>
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
> intact for stability reason.
>
> With the patches, exclusive mmap_sem hold time when munmap a 80GB
> address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to
> us level from second.
>
> munmap_test-15002 [008]   594.380138: funcgraph_entry: |  
> vm_munmap_zap_rlock() {
> munmap_test-15002 [008]   594.380146: funcgraph_entry:  !2485684 us |
> unmap_region();
> munmap_test-15002 [008]   596.865836: funcgraph_exit:   !2485692 us |  }
>
> Here the excution time of unmap_region() is used to evaluate the time of
> holding read mmap_sem, then the remaining time is used with holding
> exclusive lock.
>
> [1] https://lwn.net/Articles/753269/
>
> Suggested-by: Michal Hocko 
> Suggested-by: Kirill A. Shutemov 
> Cc: Matthew Wilcox 
> Cc: Laurent Dufour 
> Cc: Andrew Morton 
> Signed-off-by: Yang Shi 
> ---
>  include/linux/mm.h |  2 +-
>  mm/memory.c| 41 --
>  mm/mmap.c  | 99 
> +-
>  3 files changed, 123 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9f..e4480d8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1321,7 +1321,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned 
> long address,
>  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
>   unsigned long size);
>  void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> - unsigned long start, unsigned long end);
> + unsigned long start, unsigned long end, bool skip_vm_flags);
>  
>  /**
>   * mm_walk - callbacks for

Re: [RFC v6 PATCH 2/2] mm: mmap: zap pages with read mmap_sem in munmap

2018-07-26 Thread Mika Penttilä




On 26.07.2018 21:10, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
>> 300GB), the below hung task issue may happen occasionally.
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> It is because munmap holds mmap_sem exclusively from very beginning to
> all the way down to the end, and doesn't release it in the middle. When
> unmapping large mapping, it may take long time (take ~18 seconds to
> unmap 320GB mapping with every single page mapped on an idle machine).
>
> Zapping pages is the most time consuming part, according to the
> suggestion from Michal Hocko [1], zapping pages can be done with holding
> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
> mmap_sem to cleanup vmas.
>
> But, some part may need write mmap_sem, for example, vma splitting. So,
> the design is as follows:
> acquire write mmap_sem
> lookup vmas (find and split vmas)
>   detach vmas
> deal with special mappings
> downgrade_write
>
> zap pages
>   free page tables
> release mmap_sem
>
> The vm events with read mmap_sem may come in during page zapping, but
> since vmas have been detached before, they, i.e. page fault, gup, etc,
> will not be able to find valid vma, then just return SIGSEGV or -EFAULT
> as expected.
>
> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are
> considered as special mappings. They will be dealt with before zapping
> pages with write mmap_sem held. Basically, just update vm_flags.
>
> And, since they are also manipulated by unmap_single_vma() which is
> called by unmap_vma() with read mmap_sem held in this case, to
> prevent from updating vm_flags in read critical section, a new
> parameter, called "skip_flags" is added to unmap_region(), unmap_vmas()
> and unmap_single_vma(). If it is true, then just skip unmap those
> special mappings. Currently, the only place which pass true to this
> parameter is us.
>
> With this approach we don't have to re-acquire mmap_sem again to clean
> up vmas to avoid race window which might get the address space changed.
>
> And, since the lock acquire/release cost is managed to the minimum and
> almost as same as before, the optimization could be extended to any size
> of mapping without incurring significant penalty to small mappings.
>
> For the time being, just do this in munmap syscall path. Other
> vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
> intact for stability reason.
>
> With the patches, exclusive mmap_sem hold time when munmap a 80GB
> address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to
> us level from second.
>
> munmap_test-15002 [008]   594.380138: funcgraph_entry: |  
> vm_munmap_zap_rlock() {
> munmap_test-15002 [008]   594.380146: funcgraph_entry:  !2485684 us |
> unmap_region();
> munmap_test-15002 [008]   596.865836: funcgraph_exit:   !2485692 us |  }
>
> Here the excution time of unmap_region() is used to evaluate the time of
> holding read mmap_sem, then the remaining time is used with holding
> exclusive lock.
>
> [1] https://lwn.net/Articles/753269/
>
> Suggested-by: Michal Hocko 
> Suggested-by: Kirill A. Shutemov 
> Cc: Matthew Wilcox 
> Cc: Laurent Dufour 
> Cc: Andrew Morton 
> Signed-off-by: Yang Shi 
> ---
>  include/linux/mm.h |  2 +-
>  mm/memory.c| 41 --
>  mm/mmap.c  | 99 
> +-
>  3 files changed, 123 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9f..e4480d8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1321,7 +1321,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned 
> long address,
>  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
>   unsigned long size);
>  void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> - unsigned long start, unsigned long end);
> + unsigned long start, unsigned long end, bool skip_vm_flags);
>  
>  /**
>   * mm_walk - callbacks for

Re: [PATCHv3 14/17] x86/mm: Introduce direct_mapping_size

2018-06-12 Thread Mika Penttilä




On 12.06.2018 17:39, Kirill A. Shutemov wrote:
> Kernel need to have a way to access encrypted memory. We are going to
> use per-KeyID direct mapping to facilitate the access with minimal
> overhead.
>
> Direct mapping for each KeyID will be put next to each other in the
> virtual address space. We need to have a way to find boundaries of
> direct mapping for particular KeyID.
>
> The new variable direct_mapping_size specifies the size of direct
> mapping. With the value, it's trivial to find direct mapping for
> KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
>
> Size of direct mapping is calculated during KASLR setup. If KALSR is
> disable it happens during MKTME initialization.
>
> Signed-off-by: Kirill A. Shutemov 
> ---
>  arch/x86/include/asm/mktme.h   |  2 ++
>  arch/x86/include/asm/page_64.h |  1 +
>  arch/x86/kernel/head64.c   |  2 ++
>  arch/x86/mm/kaslr.c| 21 ---
>  arch/x86/mm/mktme.c| 48 ++
>  5 files changed, 71 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> index 9363b989a021..3bf481fe3f56 100644
> --- a/arch/x86/include/asm/mktme.h
> +++ b/arch/x86/include/asm/mktme.h
> @@ -40,6 +40,8 @@ int page_keyid(const struct page *page);
>  
>  void mktme_disable(void);
>  
> +void setup_direct_mapping_size(void);
> +
>  #else
>  #define mktme_keyid_mask ((phys_addr_t)0)
>  #define mktme_nr_keyids  0
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 939b1cff4a7b..53c32af895ab 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -14,6 +14,7 @@ extern unsigned long phys_base;
>  extern unsigned long page_offset_base;
>  extern unsigned long vmalloc_base;
>  extern unsigned long vmemmap_base;
> +extern unsigned long direct_mapping_size;
>  
>  static inline unsigned long __phys_addr_nodebug(unsigned long x)
>  {
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index a21d6ace648e..b6175376b2e1 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -59,6 +59,8 @@ EXPORT_SYMBOL(vmalloc_base);
>  unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
>  EXPORT_SYMBOL(vmemmap_base);
>  #endif
> +unsigned long direct_mapping_size __ro_after_init = -1UL;
> +EXPORT_SYMBOL(direct_mapping_size);
>  
>  #define __head   __section(.head.text)
>  
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> index 4408cd9a3bef..3d8ef8cb97e1 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
>   return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
>  }
>  
> +#ifndef CONFIG_X86_INTEL_MKTME
> +static void __init setup_direct_mapping_size(void)
> +{
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
> + direct_mapping_size += (1UL << TB_SHIFT) * 
> CONFIG_MEMORY_PHYSICAL_PADDING;
> +}
> +#endif
> +
>  /* Initialize base and padding for each memory region randomized with KASLR 
> */
>  void __init kernel_randomize_memory(void)
>  {
> @@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
>   if (!kaslr_memory_enabled())
>   return;
>  
> - kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
> + /*
> +  * Upper limit for direct mapping size is 1/4 of whole virtual
> +  * address space

or 1/2?

> +  */
> + kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);



>   kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
>  
>   /*
> @@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
>* add padding if needed (especially for memory hotplug support).
>*/
>   BUG_ON(kaslr_regions[0].base != _offset_base);
> - memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> - CONFIG_MEMORY_PHYSICAL_PADDING;
> +
> + setup_direct_mapping_size();
> +
> + memory_tb = direct_mapping_size * mktme_nr_keyids + 1;

parenthesis ?

memory_tb = direct_mapping_size * (mktme_nr_keyids + 1);


>  
>   /* Adapt phyiscal memory region size based on available memory */
>   if (memory_tb < kaslr_regions[0].size_tb)
> diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> index 43a44f0f2a2d..3e5322bf035e 100644
> --- a/arch/x86/mm/mktme.c
> +++ b/arch/x86/mm/mktme.c
> @@ -89,3 +89,51 @@ static bool need_page_mktme(void)
>  struct page_ext_operations page_mktme_ops = {
>   .need = need_page_mktme,
>  };
> +
> +void __init setup_direct_mapping_size(void)
> +{
> + unsigned long available_va;
> +
> + /* 1/4 of virtual address space is didicated for direct mapping */
> + available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
> +
> + /* How much memory the systrem has? */
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size =

Re: [PATCHv3 14/17] x86/mm: Introduce direct_mapping_size

2018-06-12 Thread Mika Penttilä




On 12.06.2018 17:39, Kirill A. Shutemov wrote:
> Kernel need to have a way to access encrypted memory. We are going to
> use per-KeyID direct mapping to facilitate the access with minimal
> overhead.
>
> Direct mapping for each KeyID will be put next to each other in the
> virtual address space. We need to have a way to find boundaries of
> direct mapping for particular KeyID.
>
> The new variable direct_mapping_size specifies the size of direct
> mapping. With the value, it's trivial to find direct mapping for
> KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
>
> Size of direct mapping is calculated during KASLR setup. If KALSR is
> disable it happens during MKTME initialization.
>
> Signed-off-by: Kirill A. Shutemov 
> ---
>  arch/x86/include/asm/mktme.h   |  2 ++
>  arch/x86/include/asm/page_64.h |  1 +
>  arch/x86/kernel/head64.c   |  2 ++
>  arch/x86/mm/kaslr.c| 21 ---
>  arch/x86/mm/mktme.c| 48 ++
>  5 files changed, 71 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> index 9363b989a021..3bf481fe3f56 100644
> --- a/arch/x86/include/asm/mktme.h
> +++ b/arch/x86/include/asm/mktme.h
> @@ -40,6 +40,8 @@ int page_keyid(const struct page *page);
>  
>  void mktme_disable(void);
>  
> +void setup_direct_mapping_size(void);
> +
>  #else
>  #define mktme_keyid_mask ((phys_addr_t)0)
>  #define mktme_nr_keyids  0
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 939b1cff4a7b..53c32af895ab 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -14,6 +14,7 @@ extern unsigned long phys_base;
>  extern unsigned long page_offset_base;
>  extern unsigned long vmalloc_base;
>  extern unsigned long vmemmap_base;
> +extern unsigned long direct_mapping_size;
>  
>  static inline unsigned long __phys_addr_nodebug(unsigned long x)
>  {
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index a21d6ace648e..b6175376b2e1 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -59,6 +59,8 @@ EXPORT_SYMBOL(vmalloc_base);
>  unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
>  EXPORT_SYMBOL(vmemmap_base);
>  #endif
> +unsigned long direct_mapping_size __ro_after_init = -1UL;
> +EXPORT_SYMBOL(direct_mapping_size);
>  
>  #define __head   __section(.head.text)
>  
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> index 4408cd9a3bef..3d8ef8cb97e1 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
>   return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
>  }
>  
> +#ifndef CONFIG_X86_INTEL_MKTME
> +static void __init setup_direct_mapping_size(void)
> +{
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
> + direct_mapping_size += (1UL << TB_SHIFT) * 
> CONFIG_MEMORY_PHYSICAL_PADDING;
> +}
> +#endif
> +
>  /* Initialize base and padding for each memory region randomized with KASLR 
> */
>  void __init kernel_randomize_memory(void)
>  {
> @@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
>   if (!kaslr_memory_enabled())
>   return;
>  
> - kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
> + /*
> +  * Upper limit for direct mapping size is 1/4 of whole virtual
> +  * address space

or 1/2?

> +  */
> + kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);



>   kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
>  
>   /*
> @@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
>* add padding if needed (especially for memory hotplug support).
>*/
>   BUG_ON(kaslr_regions[0].base != _offset_base);
> - memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> - CONFIG_MEMORY_PHYSICAL_PADDING;
> +
> + setup_direct_mapping_size();
> +
> + memory_tb = direct_mapping_size * mktme_nr_keyids + 1;

parenthesis ?

memory_tb = direct_mapping_size * (mktme_nr_keyids + 1);


>  
>   /* Adapt phyiscal memory region size based on available memory */
>   if (memory_tb < kaslr_regions[0].size_tb)
> diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> index 43a44f0f2a2d..3e5322bf035e 100644
> --- a/arch/x86/mm/mktme.c
> +++ b/arch/x86/mm/mktme.c
> @@ -89,3 +89,51 @@ static bool need_page_mktme(void)
>  struct page_ext_operations page_mktme_ops = {
>   .need = need_page_mktme,
>  };
> +
> +void __init setup_direct_mapping_size(void)
> +{
> + unsigned long available_va;
> +
> + /* 1/4 of virtual address space is didicated for direct mapping */
> + available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
> +
> + /* How much memory the systrem has? */
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size =

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-05 Thread Mika Penttilä

On 06/05/2018 08:36 AM, H. Peter Anvin wrote:
> On 06/04/18 20:57, Mika Penttilä wrote:
>>
>> This won't work on X86-32 because it actually uses the segment limit with 
>> fs: access. So there 
>> is a reason why the lsl based method is X86-64 only.
>>
> 
> 
> 
> Why does that matter in any shape, way, or form?  The LSL instruction
> doesn't touch any of the segment registers, it just uses a segment
> selector number.
> 
> 
> 
> I see... we have a VERY unfortunate name collision: the x86-64
> GDT_ENTRY_PERC_PU and the i386 GDT_ENTRY_PERCPU are in fact two
> completely different things, with the latter being the actual percpu
> offset used by the kernel.
> 
> So yes, this patch is wrong, because the naming of the x86-64 segment is
> insane especially in the proximity of the  -- it should be something
> like GDT_ENTRY_CPU_NUMBER.
> 
> Unfortunately we probably can't use the same GDT entry on x86-32 and
> x86-64, because entry 15 (selector 0x7b) is USER_DS, which is something
> we really don't want to screw with.  This means i386 programs that
> execute LSL directly for whatever reason will have to understand the
> difference, but most of the other segment numbers differ as well,
> including user space %cs (USER_CS/USER32_CS) and %ds/%es/%ss (USER_DS).
> Perhaps we could bump down segments 23-28 by one and put it as 23, that
> way the CPU_NUMBER segment would always be at %ss+80 for the default
> (flat, initial) user space %ss.  (We want to specify using %ss rather
> than %ds, because it is less likely to be changed and because 64 bits,
> too, have %ss defined, but not %ds.)
> 
> 
> 
> Rename the x86-64 segment to CPU_NUMBER, fixing the naming conflict.
> Add 1 to GDT entry numbers 23-28 for i386 (all of these are
> kernel-internal segments and so have no impact on user space).
> Add i386 CPU_NUMBER equivalent to x86-64 at GDT entry 23.
> Document the above relationship between segments.
> 
> OK, everyone?
> 
>   -hpa
> 

Yes GDT_ENTRY_PER_CPU and GDT_ENTRY_PERCPU meaning two totally different things 
is really confusing,
the proposal seems ok to me!

--Mika

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-05 Thread Mika Penttilä

On 06/05/2018 08:36 AM, H. Peter Anvin wrote:
> On 06/04/18 20:57, Mika Penttilä wrote:
>>
>> This won't work on X86-32 because it actually uses the segment limit with 
>> fs: access. So there 
>> is a reason why the lsl based method is X86-64 only.
>>
> 
> 
> 
> Why does that matter in any shape, way, or form?  The LSL instruction
> doesn't touch any of the segment registers, it just uses a segment
> selector number.
> 
> 
> 
> I see... we have a VERY unfortunate name collision: the x86-64
> GDT_ENTRY_PERC_PU and the i386 GDT_ENTRY_PERCPU are in fact two
> completely different things, with the latter being the actual percpu
> offset used by the kernel.
> 
> So yes, this patch is wrong, because the naming of the x86-64 segment is
> insane especially in the proximity of the  -- it should be something
> like GDT_ENTRY_CPU_NUMBER.
> 
> Unfortunately we probably can't use the same GDT entry on x86-32 and
> x86-64, because entry 15 (selector 0x7b) is USER_DS, which is something
> we really don't want to screw with.  This means i386 programs that
> execute LSL directly for whatever reason will have to understand the
> difference, but most of the other segment numbers differ as well,
> including user space %cs (USER_CS/USER32_CS) and %ds/%es/%ss (USER_DS).
> Perhaps we could bump down segments 23-28 by one and put it as 23, that
> way the CPU_NUMBER segment would always be at %ss+80 for the default
> (flat, initial) user space %ss.  (We want to specify using %ss rather
> than %ds, because it is less likely to be changed and because 64 bits,
> too, have %ss defined, but not %ds.)
> 
> 
> 
> Rename the x86-64 segment to CPU_NUMBER, fixing the naming conflict.
> Add 1 to GDT entry numbers 23-28 for i386 (all of these are
> kernel-internal segments and so have no impact on user space).
> Add i386 CPU_NUMBER equivalent to x86-64 at GDT entry 23.
> Document the above relationship between segments.
> 
> OK, everyone?
> 
>   -hpa
> 

Yes GDT_ENTRY_PER_CPU and GDT_ENTRY_PERCPU meaning two totally different things 
is really confusing,
the proposal seems ok to me!

--Mika

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-04 Thread Mika Penttilä

On 06/05/2018 07:44 AM, Bae, Chang Seok wrote:
>>> diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
>>> index ea554f8..e716e94 100644
>>> --- a/arch/x86/kernel/setup_percpu.c
>>> +++ b/arch/x86/kernel/setup_percpu.c
>>> @@ -155,12 +155,21 @@ static void __init pcpup_populate_pte(unsigned long 
>>> addr)
>>>  
>>>  static inline void setup_percpu_segment(int cpu)
>>>  {
>>> -#ifdef CONFIG_X86_32
>>> -   struct desc_struct d = GDT_ENTRY_INIT(0x8092, per_cpu_offset(cpu),
>>> - 0xF);
>>> +#ifdef CONFIG_NUMA
>>> +   unsigned long node = early_cpu_to_node(cpu);
>>> +#else
>>> +   unsigned long node = 0;
>>> +#endif
>>> +   struct desc_struct d = GDT_ENTRY_INIT(0x0, per_cpu_offset(cpu),
>>> +  make_lsl_tscp(cpu, node));
>>> +
>>> +   d.type = 5; /* R0 data, expand down, accessed */
>>> +   d.dpl = 3;  /* Visible to user code */
>>> +   d.s = 1;/* Not a system segment */
>>> +   d.p = 1;/* Present */
>>> +   d.d = 1;/* 32-bit */
>>>  
>>> write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PERCPU, , DESCTYPE_S);
>>> -#endif
>>>  }
> 
>> This won't work on X86-32 because it actually uses the segment limit with 
>> fs: access. So there 
>> is a reason why the lsl based method is X86-64 only.
> 
> The limit will be consumed in X86-64 only, while the unification with i386 
> was suggested for a
> different reason.
> 
> Thanks,
> Chang
> 

The unification affects i386, and the limit is consumed by the processor with 
fs: access.
The limit was 0xF before, now it depends on the cpu/node. So accesses on 
small number cpus 
are likely to fault.

--Mika

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-04 Thread Mika Penttilä

On 06/05/2018 07:44 AM, Bae, Chang Seok wrote:
>>> diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
>>> index ea554f8..e716e94 100644
>>> --- a/arch/x86/kernel/setup_percpu.c
>>> +++ b/arch/x86/kernel/setup_percpu.c
>>> @@ -155,12 +155,21 @@ static void __init pcpup_populate_pte(unsigned long 
>>> addr)
>>>  
>>>  static inline void setup_percpu_segment(int cpu)
>>>  {
>>> -#ifdef CONFIG_X86_32
>>> -   struct desc_struct d = GDT_ENTRY_INIT(0x8092, per_cpu_offset(cpu),
>>> - 0xF);
>>> +#ifdef CONFIG_NUMA
>>> +   unsigned long node = early_cpu_to_node(cpu);
>>> +#else
>>> +   unsigned long node = 0;
>>> +#endif
>>> +   struct desc_struct d = GDT_ENTRY_INIT(0x0, per_cpu_offset(cpu),
>>> +  make_lsl_tscp(cpu, node));
>>> +
>>> +   d.type = 5; /* R0 data, expand down, accessed */
>>> +   d.dpl = 3;  /* Visible to user code */
>>> +   d.s = 1;/* Not a system segment */
>>> +   d.p = 1;/* Present */
>>> +   d.d = 1;/* 32-bit */
>>>  
>>> write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PERCPU, , DESCTYPE_S);
>>> -#endif
>>>  }
> 
>> This won't work on X86-32 because it actually uses the segment limit with 
>> fs: access. So there 
>> is a reason why the lsl based method is X86-64 only.
> 
> The limit will be consumed in X86-64 only, while the unification with i386 
> was suggested for a
> different reason.
> 
> Thanks,
> Chang
> 

The unification affects i386, and the limit is consumed by the processor with 
fs: access.
The limit was 0xF before, now it depends on the cpu/node. So accesses on 
small number cpus 
are likely to fault.

--Mika

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-04 Thread Mika Penttilä

On 06/04/2018 10:24 PM, Chang S. Bae wrote:
> The CPU (and node) number will be written, as early enough,
> to the segment limit of per CPU data and TSC_AUX MSR entry.
> The information has been retrieved by vgetcpu in user space
> and will be also loaded from the paranoid entry, when
> FSGSBASE enabled. So, it is moved out from vDSO to the CPU
> initialization path where IST setup is serialized.
> 
> Now, redundant setting of the segment in entry/vdso/vma.c
> was removed; a substantial code removal. It removes a
> hotplug notifier, makes a facility useful to both the kernel
> and userspace unconditionally available much sooner, and
> unification with i386. (Thanks to HPA for suggesting the
> cleanup)
> 
> Signed-off-by: Chang S. Bae 
> Cc: H. Peter Anvin 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Andi Kleen 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> ---

> diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
> index ea554f8..e716e94 100644
> --- a/arch/x86/kernel/setup_percpu.c
> +++ b/arch/x86/kernel/setup_percpu.c
> @@ -155,12 +155,21 @@ static void __init pcpup_populate_pte(unsigned long 
> addr)
>  
>  static inline void setup_percpu_segment(int cpu)
>  {
> -#ifdef CONFIG_X86_32
> - struct desc_struct d = GDT_ENTRY_INIT(0x8092, per_cpu_offset(cpu),
> -   0xF);
> +#ifdef CONFIG_NUMA
> + unsigned long node = early_cpu_to_node(cpu);
> +#else
> + unsigned long node = 0;
> +#endif
> + struct desc_struct d = GDT_ENTRY_INIT(0x0, per_cpu_offset(cpu),
> +make_lsl_tscp(cpu, node));
> +
> + d.type = 5; /* R0 data, expand down, accessed */
> + d.dpl = 3;  /* Visible to user code */
> + d.s = 1;/* Not a system segment */
> + d.p = 1;/* Present */
> + d.d = 1;/* 32-bit */
>  
>   write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PERCPU, , DESCTYPE_S);
> -#endif
>  }


This won't work on X86-32 because it actually uses the segment limit with fs: 
access. So there 
is a reason why the lsl based method is X86-64 only.


-Mika

>  
>  void __init setup_per_cpu_areas(void)
>

Re: [PATCH 6/6] x86/vdso: Move out the CPU number store

2018-06-04 Thread Mika Penttilä

On 06/04/2018 10:24 PM, Chang S. Bae wrote:
> The CPU (and node) number will be written, as early enough,
> to the segment limit of per CPU data and TSC_AUX MSR entry.
> The information has been retrieved by vgetcpu in user space
> and will be also loaded from the paranoid entry, when
> FSGSBASE enabled. So, it is moved out from vDSO to the CPU
> initialization path where IST setup is serialized.
> 
> Now, redundant setting of the segment in entry/vdso/vma.c
> was removed; a substantial code removal. It removes a
> hotplug notifier, makes a facility useful to both the kernel
> and userspace unconditionally available much sooner, and
> unification with i386. (Thanks to HPA for suggesting the
> cleanup)
> 
> Signed-off-by: Chang S. Bae 
> Cc: H. Peter Anvin 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Andi Kleen 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> ---

> diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
> index ea554f8..e716e94 100644
> --- a/arch/x86/kernel/setup_percpu.c
> +++ b/arch/x86/kernel/setup_percpu.c
> @@ -155,12 +155,21 @@ static void __init pcpup_populate_pte(unsigned long 
> addr)
>  
>  static inline void setup_percpu_segment(int cpu)
>  {
> -#ifdef CONFIG_X86_32
> - struct desc_struct d = GDT_ENTRY_INIT(0x8092, per_cpu_offset(cpu),
> -   0xF);
> +#ifdef CONFIG_NUMA
> + unsigned long node = early_cpu_to_node(cpu);
> +#else
> + unsigned long node = 0;
> +#endif
> + struct desc_struct d = GDT_ENTRY_INIT(0x0, per_cpu_offset(cpu),
> +make_lsl_tscp(cpu, node));
> +
> + d.type = 5; /* R0 data, expand down, accessed */
> + d.dpl = 3;  /* Visible to user code */
> + d.s = 1;/* Not a system segment */
> + d.p = 1;/* Present */
> + d.d = 1;/* 32-bit */
>  
>   write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PERCPU, , DESCTYPE_S);
> -#endif
>  }


This won't work on X86-32 because it actually uses the segment limit with fs: 
access. So there 
is a reason why the lsl based method is X86-64 only.


-Mika

>  
>  void __init setup_per_cpu_areas(void)
>

Re: [PATCH v2 4/9] x86, memcpy_mcsafe: add write-protection-fault handling

2018-05-02 Thread Mika Penttilä

On 05/03/2018 07:59 AM, Dan Williams wrote:
> In preparation for using memcpy_mcsafe() to handle user copies it needs
> to be to handle write-protection faults while writing user pages. Add
> MMU-fault handlers alongside the machine-check exception handlers.
> 
> Note that the machine check fault exception handling makes assumptions
> about source buffer alignment and poison alignment. In the write fault
> case, given the destination buffer is arbitrarily aligned, it needs a
> separate / additional fault handling approach. The mcsafe_handle_tail()
> helper is reused. The @limit argument is set to @len since there is no
> safety concern about retriggering an MMU fault, and this simplifies the
> assembly.
> 

> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 75d3776123cc..9787f5ee0cf9 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -75,6 +75,23 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>   return len;
>  }
>  
> +/*
> + * Similar to copy_user_handle_tail, probe for the write fault point,
> + * but reuse __memcpy_mcsafe in case a new read error is encountered.
> + * clac() is handled in _copy_to_iter_mcsafe().
> + */
> +__visible unsigned long
> +mcsafe_handle_tail(char *to, char *from, unsigned len)
> +{
> + for (; len; --len, to++) {
> + unsigned long rem = memcpy_mcsafe(to, from, 1);
> +


Hmm why not 
for (; len; --len, from++, to++)



> + if (rem)
> + break;
> + }
> + return len;
> +}


--Mika

Re: [PATCH v2 4/9] x86, memcpy_mcsafe: add write-protection-fault handling

2018-05-02 Thread Mika Penttilä

On 05/03/2018 07:59 AM, Dan Williams wrote:
> In preparation for using memcpy_mcsafe() to handle user copies it needs
> to be to handle write-protection faults while writing user pages. Add
> MMU-fault handlers alongside the machine-check exception handlers.
> 
> Note that the machine check fault exception handling makes assumptions
> about source buffer alignment and poison alignment. In the write fault
> case, given the destination buffer is arbitrarily aligned, it needs a
> separate / additional fault handling approach. The mcsafe_handle_tail()
> helper is reused. The @limit argument is set to @len since there is no
> safety concern about retriggering an MMU fault, and this simplifies the
> assembly.
> 

> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 75d3776123cc..9787f5ee0cf9 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -75,6 +75,23 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>   return len;
>  }
>  
> +/*
> + * Similar to copy_user_handle_tail, probe for the write fault point,
> + * but reuse __memcpy_mcsafe in case a new read error is encountered.
> + * clac() is handled in _copy_to_iter_mcsafe().
> + */
> +__visible unsigned long
> +mcsafe_handle_tail(char *to, char *from, unsigned len)
> +{
> + for (; len; --len, to++) {
> + unsigned long rem = memcpy_mcsafe(to, from, 1);
> +


Hmm why not 
for (; len; --len, from++, to++)



> + if (rem)
> + break;
> + }
> + return len;
> +}


--Mika

Re: [REGRESSION][BISECTED] i.MX6 pinctrl hogs stopped working

2018-04-10 Thread Mika Penttilä



On 10.04.2018 13:21, Richard Fitzgerald wrote:
> On 04/04/18 06:33, Mika Penttilä wrote:
>> Hi!
>>
>> Reverting this made the hogs on a i.MX6 board work again. :
>>
>>
>> commit b89405b6102fcc3746f43697b826028caa94c823
>> Author: Richard Fitzgerald <r...@opensource.cirrus.com>
>> Date:   Wed Feb 28 15:53:06 2018 +
>>
>>  pinctrl: devicetree: Fix dt_to_map_one_config handling of hogs
>>
>>
>>
>> --Mika
>>
>
> I think you should check whether the bug is with the i.MX6 driver
> relying on the previous buggy behaviour of pinctrl. I haven't got
> i.MX6 hardware to test myself.
>
> The bug I fixed in that patch was that when pinctrl is probing a
> pinctrl driver it would try to apply all the pinctrl settings
> listed in a dt node to the pinctrl driver it is probing instead
> of the pinctrl drivers they actually refer to. This was a bug
> introduced by an earlier patch (which unfortunately I forgot to
> include a fixes line reference to)
>
>   pinctrl: core: Use delayed work for hogs
>
> So if a pinctrl driver "A" had a dependency on another pinctrl
> driver "B" those dependencies wouldn't be properly created because
> all the "B" pinctrl DT entries would be attempted against "A"
> instead of "B". This caused failures if a pinctrl driver had a
> dependency on another pinctrl driver, of if creating a pinctrl
> driver that is a child of an MFD and that MFD has dependencies
> on another pinctrl driver.
>

Hard to say, but the kernel/dts has worked ok for 3+ years, from 3.17 until 
4.17-rc. Nothing fancy, just normal hogs, in two groups.
Can send you relevant pieces of DT if interested.

--Mika

Re: [REGRESSION][BISECTED] i.MX6 pinctrl hogs stopped working

2018-04-10 Thread Mika Penttilä



On 10.04.2018 13:21, Richard Fitzgerald wrote:
> On 04/04/18 06:33, Mika Penttilä wrote:
>> Hi!
>>
>> Reverting this made the hogs on a i.MX6 board work again. :
>>
>>
>> commit b89405b6102fcc3746f43697b826028caa94c823
>> Author: Richard Fitzgerald 
>> Date:   Wed Feb 28 15:53:06 2018 +
>>
>>  pinctrl: devicetree: Fix dt_to_map_one_config handling of hogs
>>
>>
>>
>> --Mika
>>
>
> I think you should check whether the bug is with the i.MX6 driver
> relying on the previous buggy behaviour of pinctrl. I haven't got
> i.MX6 hardware to test myself.
>
> The bug I fixed in that patch was that when pinctrl is probing a
> pinctrl driver it would try to apply all the pinctrl settings
> listed in a dt node to the pinctrl driver it is probing instead
> of the pinctrl drivers they actually refer to. This was a bug
> introduced by an earlier patch (which unfortunately I forgot to
> include a fixes line reference to)
>
>   pinctrl: core: Use delayed work for hogs
>
> So if a pinctrl driver "A" had a dependency on another pinctrl
> driver "B" those dependencies wouldn't be properly created because
> all the "B" pinctrl DT entries would be attempted against "A"
> instead of "B". This caused failures if a pinctrl driver had a
> dependency on another pinctrl driver, of if creating a pinctrl
> driver that is a child of an MFD and that MFD has dependencies
> on another pinctrl driver.
>

Hard to say, but the kernel/dts has worked ok for 3+ years, from 3.17 until 
4.17-rc. Nothing fancy, just normal hogs, in two groups.
Can send you relevant pieces of DT if interested.

--Mika

Re: [PATCH] ASoC: fsl_ssi: Fix mode setting when changing channel number

2018-04-09 Thread Mika Penttilä

On 04/08/2018 07:40 AM, Nicolin Chen wrote:
> This is a partial revert (in a cleaner way) of commit ebf08ae3bc90
> ("ASoC: fsl_ssi: Keep ssi->i2s_net updated") to fix a regression
> at test cases when switching between mono and stereo audio.
> 
> The problem is that ssi->i2s_net is initialized in set_dai_fmt()
> only, while this set_dai_fmt() is only called during the dai-link
> probe(). The original patch assumed set_dai_fmt() would be called
> during every playback instance, so it failed at the overriding use
> cases.
> 
> This patch adds the local variable i2s_net back to let regular use
> cases still follow the mode settings from the set_dai_fmt().
> 
> Meanwhile, the original commit of keeping ssi->i2s_net updated was
> to make set_tdm_slot() clean by checking the ssi->i2s_net directly
> instead of reading SCR register. However, the change itself is not
> necessary (or even harmful) because the set_tdm_slot() might fail
> to check the slot number for Normal-Mode-None-Net settings while
> mono audio cases still need 2 slots. So this patch can also fix it.
> And it adds an extra line of comments to declare ssi->i2s_net does
> not reflect the register value but merely the initial setting from
> the set_dai_fmt().
> 
> Reported-by: Mika Penttilä <mika.pentt...@nextfour.com>
> Signed-off-by: Nicolin Chen <nicoleots...@gmail.com>
> Cc: Mika Penttilä <mika.pentt...@nextfour.com>
> ---
>  sound/soc/fsl/fsl_ssi.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
> index 0823b08..89df2d9 100644
> --- a/sound/soc/fsl/fsl_ssi.c
> +++ b/sound/soc/fsl/fsl_ssi.c
> @@ -217,6 +217,7 @@ struct fsl_ssi_soc_data {
>   * @dai_fmt: DAI configuration this device is currently used with
>   * @streams: Mask of current active streams: BIT(TX) and BIT(RX)
>   * @i2s_net: I2S and Network mode configurations of SCR register
> + *   (this is the initial settings based on the DAI format)
>   * @synchronous: Use synchronous mode - both of TX and RX use STCK and SFCK
>   * @use_dma: DMA is used or FIQ with stream filter
>   * @use_dual_fifo: DMA with support for dual FIFO mode
> @@ -829,16 +830,23 @@ static int fsl_ssi_hw_params(struct snd_pcm_substream 
> *substream,
>   }
>  
>   if (!fsl_ssi_is_ac97(ssi)) {
> + /*
> +  * Keep the ssi->i2s_net intact while having a local variable
> +  * to override settings for special use cases. Otherwise, the
> +  * ssi->i2s_net will lose the settings for regular use cases.
> +  */
> + u8 i2s_net = ssi->i2s_net;
> +
>   /* Normal + Network mode to send 16-bit data in 32-bit frames */
>   if (fsl_ssi_is_i2s_cbm_cfs(ssi) && sample_size == 16)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
>  
>   /* Use Normal mode to send mono data at 1st slot of 2 slots */
>   if (channels == 1)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL;
>  
>   regmap_update_bits(regs, REG_SSI_SCR,
> -    SSI_SCR_I2S_NET_MASK, ssi->i2s_net);
> +SSI_SCR_I2S_NET_MASK, i2s_net);
>   }
>  
>   /* In synchronous mode, the SSI uses STCCR for capture */
> 

This patch fixes my problems, so: 

Tested-by: Mika Penttilä <mika.pentt...@nextfour.com>


--Mika

Re: [PATCH] ASoC: fsl_ssi: Fix mode setting when changing channel number

2018-04-09 Thread Mika Penttilä

On 04/08/2018 07:40 AM, Nicolin Chen wrote:
> This is a partial revert (in a cleaner way) of commit ebf08ae3bc90
> ("ASoC: fsl_ssi: Keep ssi->i2s_net updated") to fix a regression
> at test cases when switching between mono and stereo audio.
> 
> The problem is that ssi->i2s_net is initialized in set_dai_fmt()
> only, while this set_dai_fmt() is only called during the dai-link
> probe(). The original patch assumed set_dai_fmt() would be called
> during every playback instance, so it failed at the overriding use
> cases.
> 
> This patch adds the local variable i2s_net back to let regular use
> cases still follow the mode settings from the set_dai_fmt().
> 
> Meanwhile, the original commit of keeping ssi->i2s_net updated was
> to make set_tdm_slot() clean by checking the ssi->i2s_net directly
> instead of reading SCR register. However, the change itself is not
> necessary (or even harmful) because the set_tdm_slot() might fail
> to check the slot number for Normal-Mode-None-Net settings while
> mono audio cases still need 2 slots. So this patch can also fix it.
> And it adds an extra line of comments to declare ssi->i2s_net does
> not reflect the register value but merely the initial setting from
> the set_dai_fmt().
> 
> Reported-by: Mika Penttilä 
> Signed-off-by: Nicolin Chen 
> Cc: Mika Penttilä 
> ---
>  sound/soc/fsl/fsl_ssi.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
> index 0823b08..89df2d9 100644
> --- a/sound/soc/fsl/fsl_ssi.c
> +++ b/sound/soc/fsl/fsl_ssi.c
> @@ -217,6 +217,7 @@ struct fsl_ssi_soc_data {
>   * @dai_fmt: DAI configuration this device is currently used with
>   * @streams: Mask of current active streams: BIT(TX) and BIT(RX)
>   * @i2s_net: I2S and Network mode configurations of SCR register
> + *   (this is the initial settings based on the DAI format)
>   * @synchronous: Use synchronous mode - both of TX and RX use STCK and SFCK
>   * @use_dma: DMA is used or FIQ with stream filter
>   * @use_dual_fifo: DMA with support for dual FIFO mode
> @@ -829,16 +830,23 @@ static int fsl_ssi_hw_params(struct snd_pcm_substream 
> *substream,
>   }
>  
>   if (!fsl_ssi_is_ac97(ssi)) {
> + /*
> +  * Keep the ssi->i2s_net intact while having a local variable
> +  * to override settings for special use cases. Otherwise, the
> +  * ssi->i2s_net will lose the settings for regular use cases.
> +  */
> + u8 i2s_net = ssi->i2s_net;
> +
>   /* Normal + Network mode to send 16-bit data in 32-bit frames */
>   if (fsl_ssi_is_i2s_cbm_cfs(ssi) && sample_size == 16)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
>  
>   /* Use Normal mode to send mono data at 1st slot of 2 slots */
>   if (channels == 1)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL;
>  
>   regmap_update_bits(regs, REG_SSI_SCR,
> -    SSI_SCR_I2S_NET_MASK, ssi->i2s_net);
> +SSI_SCR_I2S_NET_MASK, i2s_net);
>   }
>  
>   /* In synchronous mode, the SSI uses STCCR for capture */
> 

This patch fixes my problems, so: 

Tested-by: Mika Penttilä 


--Mika

Fwd: regression, imx6 and sgtl5000 sound problems

2018-04-06 Thread Mika Penttilä




> On Fri, Apr 06, 2018 at 07:46:37AM +0300, Mika Penttilä wrote:
>>
>> With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
>> way too slow).
>> imx6q + sgtl5000 codec.
> 
> Could you please be more specific at your test cases?
> 
> Which board? Whose is the DAI master? Which sample rate?
> 
>> Maybe some of the soc/fsl changes is causing this.
> 
> There are quite a few clean-up patches of SSI driver being merged.
> Would you please try to revert/bisect the changes of fsl_ssi driver
> so as to figure out which one breaks your test cases?
> 
> If there is a regression because of one of the changes, I will need
> to fix it.
> 
> Thanks
> Nicolin
> 


Hi,


bisected and with this reverted I get it working : 

ebf08ae3bc906fc5dd33d02977efa5d4b9831517 is the first bad commit
commit ebf08ae3bc906fc5dd33d02977efa5d4b9831517
Author: Nicolin Chen <nicoleots...@gmail.com>
Date:   Mon Feb 12 14:03:10 2018 -0800

ASoC: fsl_ssi: Keep ssi->i2s_net updated



--Mika

Fwd: regression, imx6 and sgtl5000 sound problems

2018-04-06 Thread Mika Penttilä




> On Fri, Apr 06, 2018 at 07:46:37AM +0300, Mika Penttilä wrote:
>>
>> With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
>> way too slow).
>> imx6q + sgtl5000 codec.
> 
> Could you please be more specific at your test cases?
> 
> Which board? Whose is the DAI master? Which sample rate?
> 
>> Maybe some of the soc/fsl changes is causing this.
> 
> There are quite a few clean-up patches of SSI driver being merged.
> Would you please try to revert/bisect the changes of fsl_ssi driver
> so as to figure out which one breaks your test cases?
> 
> If there is a regression because of one of the changes, I will need
> to fix it.
> 
> Thanks
> Nicolin
> 


Hi,


bisected and with this reverted I get it working : 

ebf08ae3bc906fc5dd33d02977efa5d4b9831517 is the first bad commit
commit ebf08ae3bc906fc5dd33d02977efa5d4b9831517
Author: Nicolin Chen 
Date:   Mon Feb 12 14:03:10 2018 -0800

ASoC: fsl_ssi: Keep ssi->i2s_net updated



--Mika

Re: regression, imx6 and sgtl5000 sound problems

2018-04-06 Thread Mika Penttilä

On 04/06/2018 10:23 AM, Nicolin Chen wrote:
> On Fri, Apr 06, 2018 at 07:46:37AM +0300, Mika Penttilä wrote:
>>
>> With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
>> way too slow).
>> imx6q + sgtl5000 codec.
> 
> Could you please be more specific at your test cases?
> 
> Which board? Whose is the DAI master? Which sample rate?
> 
>> Maybe some of the soc/fsl changes is causing this.
> 
> There are quite a few clean-up patches of SSI driver being merged.
> Would you please try to revert/bisect the changes of fsl_ssi driver
> so as to figure out which one breaks your test cases?
> 
> If there is a regression because of one of the changes, I will need
> to fix it.
> 
> Thanks
> Nicolin
> 

Hi,

We have a custom board (very near to Karo tx evkit). The test is simply aplay 
file.wav, at least sample rate 48kHz tested
and not working. Nothing special there hw wise.

I try to bisect it and report back to you.

--Mika

Re: regression, imx6 and sgtl5000 sound problems

2018-04-06 Thread Mika Penttilä

On 04/06/2018 10:23 AM, Nicolin Chen wrote:
> On Fri, Apr 06, 2018 at 07:46:37AM +0300, Mika Penttilä wrote:
>>
>> With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
>> way too slow).
>> imx6q + sgtl5000 codec.
> 
> Could you please be more specific at your test cases?
> 
> Which board? Whose is the DAI master? Which sample rate?
> 
>> Maybe some of the soc/fsl changes is causing this.
> 
> There are quite a few clean-up patches of SSI driver being merged.
> Would you please try to revert/bisect the changes of fsl_ssi driver
> so as to figure out which one breaks your test cases?
> 
> If there is a regression because of one of the changes, I will need
> to fix it.
> 
> Thanks
> Nicolin
> 

Hi,

We have a custom board (very near to Karo tx evkit). The test is simply aplay 
file.wav, at least sample rate 48kHz tested
and not working. Nothing special there hw wise.

I try to bisect it and report back to you.

--Mika

regression, imx6 and sgtl5000 sound problems

2018-04-05 Thread Mika Penttilä

Hi,

With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
way too slow).
imx6q + sgtl5000 codec.

Maybe some of the soc/fsl changes is causing this.

--Mika

regression, imx6 and sgtl5000 sound problems

2018-04-05 Thread Mika Penttilä

Hi,

With recent merge to pre 4.17-rc, audio stopped workin (or it's hearable but 
way too slow).
imx6q + sgtl5000 codec.

Maybe some of the soc/fsl changes is causing this.

--Mika

[REGRESSION][BISECTED] i.MX6 pinctrl hogs stopped working

2018-04-03 Thread Mika Penttilä

Hi!

Reverting this made the hogs on a i.MX6 board work again. : 


commit b89405b6102fcc3746f43697b826028caa94c823
Author: Richard Fitzgerald 
Date:   Wed Feb 28 15:53:06 2018 +

pinctrl: devicetree: Fix dt_to_map_one_config handling of hogs



--Mika

[REGRESSION][BISECTED] i.MX6 pinctrl hogs stopped working

2018-04-03 Thread Mika Penttilä

Hi!

Reverting this made the hogs on a i.MX6 board work again. : 


commit b89405b6102fcc3746f43697b826028caa94c823
Author: Richard Fitzgerald 
Date:   Wed Feb 28 15:53:06 2018 +

pinctrl: devicetree: Fix dt_to_map_one_config handling of hogs



--Mika

Re: [regression, bisected] 4.11+ imx uarts broken

2017-05-09 Thread Mika Penttilä

On 05/09/2017 10:14 AM, Uwe Kleine-König wrote:
> Hello Mika,
> 
> On Tue, May 09, 2017 at 07:18:09AM +0300, Mika Penttilä wrote:
>> The following commit e61c38d85b7 makes the uarts on i.MX6 nonfunctional (no 
>> data transmitted or received). 
>> With e61c38d85b7 reverted the uarts work ok.
>>
>> ---
>> commit e61c38d85b7392e033ee03bca46f1d6006156175
>> Author: Uwe Kleine-König <u.kleine-koe...@pengutronix.de>
>> Date:   Tue Apr 4 11:18:51 2017 +0200
>>
>> serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
>>  
>> 
> 
> are you operating the UART in DTE or DCE mode? Does this affect all
> UARTs or only those that are not used in the bootloader?

I am operating in DCE mode. The debug/console uart works ok, but two others 
don't.

> 
> Looking at the patch I wonder if setting IMX21_UCR3_RXDMUXSEL |
> UCR3_ADNIMP is missing for you.
> 

Probably yes, but I can verify this later and get back to you.

> Can you please check which hunk of e61c38d85b73 is giving you problems?
> 
> Best regards
> Uwe
> 

--Mika

Re: [regression, bisected] 4.11+ imx uarts broken

2017-05-09 Thread Mika Penttilä

On 05/09/2017 10:14 AM, Uwe Kleine-König wrote:
> Hello Mika,
> 
> On Tue, May 09, 2017 at 07:18:09AM +0300, Mika Penttilä wrote:
>> The following commit e61c38d85b7 makes the uarts on i.MX6 nonfunctional (no 
>> data transmitted or received). 
>> With e61c38d85b7 reverted the uarts work ok.
>>
>> ---
>> commit e61c38d85b7392e033ee03bca46f1d6006156175
>> Author: Uwe Kleine-König 
>> Date:   Tue Apr 4 11:18:51 2017 +0200
>>
>> serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
>>  
>> 
> 
> are you operating the UART in DTE or DCE mode? Does this affect all
> UARTs or only those that are not used in the bootloader?

I am operating in DCE mode. The debug/console uart works ok, but two others 
don't.

> 
> Looking at the patch I wonder if setting IMX21_UCR3_RXDMUXSEL |
> UCR3_ADNIMP is missing for you.
> 

Probably yes, but I can verify this later and get back to you.

> Can you please check which hunk of e61c38d85b73 is giving you problems?
> 
> Best regards
> Uwe
> 

--Mika

[regression, bisected] 4.11+ imx uarts broken

2017-05-08 Thread Mika Penttilä

Hi,

The following commit e61c38d85b7 makes the uarts on i.MX6 nonfunctional (no 
data transmitted or received). 
With e61c38d85b7 reverted the uarts work ok.

---
commit e61c38d85b7392e033ee03bca46f1d6006156175
Author: Uwe Kleine-König 
Date:   Tue Apr 4 11:18:51 2017 +0200

serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
 



--Mika

[regression, bisected] 4.11+ imx uarts broken

2017-05-08 Thread Mika Penttilä

Hi,

The following commit e61c38d85b7 makes the uarts on i.MX6 nonfunctional (no 
data transmitted or received). 
With e61c38d85b7 reverted the uarts work ok.

---
commit e61c38d85b7392e033ee03bca46f1d6006156175
Author: Uwe Kleine-König 
Date:   Tue Apr 4 11:18:51 2017 +0200

serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
 



--Mika

nvdimm/pmem device lifetime

2017-04-27 Thread Mika Penttilä

Hi,

Just wondering the pmem struct device vs gendisk lifetimes.. from 
pmem_attach_disk():

device_add_disk(dev, disk);
devm_add_action_or_reset(dev, pmem_release_disk, disk);


where:
static void pmem_release_disk(void *disk)
{
del_gendisk(disk);
put_disk(disk);
}


but device_add_disk() makes disk pin dev (as a parent), and it's unpinned by 
del_gendisk()
which is called when dev is released, but it's not because of this circular 
dependency?

--Mika

nvdimm/pmem device lifetime

2017-04-27 Thread Mika Penttilä

Hi,

Just wondering the pmem struct device vs gendisk lifetimes.. from 
pmem_attach_disk():

device_add_disk(dev, disk);
devm_add_action_or_reset(dev, pmem_release_disk, disk);


where:
static void pmem_release_disk(void *disk)
{
del_gendisk(disk);
put_disk(disk);
}


but device_add_disk() makes disk pin dev (as a parent), and it's unpinned by 
del_gendisk()
which is called when dev is released, but it's not because of this circular 
dependency?

--Mika

Re: [PATCH 2/5] zram: partial IO refactoring

2017-04-03 Thread Mika Penttilä

On 04/03/2017 09:12 AM, Minchan Kim wrote:
> Hi Mika,
> 
> On Mon, Apr 03, 2017 at 08:52:33AM +0300, Mika Penttilä wrote:
>>
>> Hi!
>>
>> On 04/03/2017 08:17 AM, Minchan Kim wrote:
>>> For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
>>> However, the mixed code for handling normal/partial IO is too mess,
>>> error-prone to modify IO handler functions with upcoming feature
>>> so this patch aims for cleaning up zram's IO handling functions.
>>>
>>> Signed-off-by: Minchan Kim <minc...@kernel.org>
>>> ---
>>>  drivers/block/zram/zram_drv.c | 333 
>>> +++---
>>>  1 file changed, 184 insertions(+), 149 deletions(-)
>>>
>>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>>> index 28c2836f8c96..7938f4b98b01 100644
>>> --- a/drivers/block/zram/zram_drv.c
>>> +++ b/drivers/block/zram/zram_drv.c
>>> @@ -45,6 +45,8 @@ static const char *default_compressor = "lzo";
>>>  /* Module params (documentation at end) */
>>>  static unsigned int num_devices = 1;
>>>  
>>> +static void zram_free_page(struct zram *zram, size_t index);
>>> +
>>>  static inline bool init_done(struct zram *zram)
>>>  {
>>> return zram->disksize;
>>> @@ -98,10 +100,17 @@ static void zram_set_obj_size(struct zram_meta *meta,
>>> meta->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
>>>  }
>>>  
>>> +#if PAGE_SIZE != 4096
>>>  static inline bool is_partial_io(struct bio_vec *bvec)
>>>  {
>>> return bvec->bv_len != PAGE_SIZE;
>>>  }
>>> +#else
>>
>> For page size of 4096 bv_len can still be < 4096 and partial pages should be 
>> supported 
>> (uncompress before write etc). ? 
> 
> zram declares this.
> 
> #define ZRAM_LOGICAL_BLOCK_SIZE (1<<12)
> 
>   blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE);
>   blk_queue_logical_block_size(zram->disk->queue,
>   ZRAM_LOGICAL_BLOCK_SIZE);
> 
> So, I thought there is no such partial IO in 4096 page architecture.
> Am I missing something? Could you tell the scenario if it happens?

I think you're right. At least swap operates with min 4096 sizes.

> 
> Thanks!
>

Re: [PATCH 2/5] zram: partial IO refactoring

2017-04-03 Thread Mika Penttilä

On 04/03/2017 09:12 AM, Minchan Kim wrote:
> Hi Mika,
> 
> On Mon, Apr 03, 2017 at 08:52:33AM +0300, Mika Penttilä wrote:
>>
>> Hi!
>>
>> On 04/03/2017 08:17 AM, Minchan Kim wrote:
>>> For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
>>> However, the mixed code for handling normal/partial IO is too mess,
>>> error-prone to modify IO handler functions with upcoming feature
>>> so this patch aims for cleaning up zram's IO handling functions.
>>>
>>> Signed-off-by: Minchan Kim 
>>> ---
>>>  drivers/block/zram/zram_drv.c | 333 
>>> +++---
>>>  1 file changed, 184 insertions(+), 149 deletions(-)
>>>
>>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>>> index 28c2836f8c96..7938f4b98b01 100644
>>> --- a/drivers/block/zram/zram_drv.c
>>> +++ b/drivers/block/zram/zram_drv.c
>>> @@ -45,6 +45,8 @@ static const char *default_compressor = "lzo";
>>>  /* Module params (documentation at end) */
>>>  static unsigned int num_devices = 1;
>>>  
>>> +static void zram_free_page(struct zram *zram, size_t index);
>>> +
>>>  static inline bool init_done(struct zram *zram)
>>>  {
>>> return zram->disksize;
>>> @@ -98,10 +100,17 @@ static void zram_set_obj_size(struct zram_meta *meta,
>>> meta->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
>>>  }
>>>  
>>> +#if PAGE_SIZE != 4096
>>>  static inline bool is_partial_io(struct bio_vec *bvec)
>>>  {
>>> return bvec->bv_len != PAGE_SIZE;
>>>  }
>>> +#else
>>
>> For page size of 4096 bv_len can still be < 4096 and partial pages should be 
>> supported 
>> (uncompress before write etc). ? 
> 
> zram declares this.
> 
> #define ZRAM_LOGICAL_BLOCK_SIZE (1<<12)
> 
>   blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE);
>   blk_queue_logical_block_size(zram->disk->queue,
>   ZRAM_LOGICAL_BLOCK_SIZE);
> 
> So, I thought there is no such partial IO in 4096 page architecture.
> Am I missing something? Could you tell the scenario if it happens?

I think you're right. At least swap operates with min 4096 sizes.

> 
> Thanks!
>

Re: [PATCH 2/5] zram: partial IO refactoring

2017-04-02 Thread Mika Penttilä


Hi!

On 04/03/2017 08:17 AM, Minchan Kim wrote:
> For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
> However, the mixed code for handling normal/partial IO is too mess,
> error-prone to modify IO handler functions with upcoming feature
> so this patch aims for cleaning up zram's IO handling functions.
> 
> Signed-off-by: Minchan Kim 
> ---
>  drivers/block/zram/zram_drv.c | 333 
> +++---
>  1 file changed, 184 insertions(+), 149 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 28c2836f8c96..7938f4b98b01 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -45,6 +45,8 @@ static const char *default_compressor = "lzo";
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  
> +static void zram_free_page(struct zram *zram, size_t index);
> +
>  static inline bool init_done(struct zram *zram)
>  {
>   return zram->disksize;
> @@ -98,10 +100,17 @@ static void zram_set_obj_size(struct zram_meta *meta,
>   meta->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
>  }
>  
> +#if PAGE_SIZE != 4096
>  static inline bool is_partial_io(struct bio_vec *bvec)
>  {
>   return bvec->bv_len != PAGE_SIZE;
>  }
> +#else

For page size of 4096 bv_len can still be < 4096 and partial pages should be 
supported 
(uncompress before write etc). ? 

> +static inline bool is_partial_io(struct bio_vec *bvec)
> +{
> + return false;
> +}
> +#endif
>  
>  static void zram_revalidate_disk(struct zram *zram)
>  {
> @@ -191,18 +200,6 @@ static bool page_same_filled(void *ptr, unsigned long 
> *element)
>   return true;
>  }
>  
> -static void handle_same_page(struct bio_vec *bvec, unsigned long element)
> -{
> - struct page *page = bvec->bv_page;
> - void *user_mem;
> -
> - user_mem = kmap_atomic(page);
> - zram_fill_page(user_mem + bvec->bv_offset, bvec->bv_len, element);
> - kunmap_atomic(user_mem);
> -
> - flush_dcache_page(page);
> -}
> -
>  static ssize_t initstate_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> @@ -418,6 +415,53 @@ static DEVICE_ATTR_RO(io_stat);
>  static DEVICE_ATTR_RO(mm_stat);
>  static DEVICE_ATTR_RO(debug_stat);
>  
> +static bool zram_special_page_read(struct zram *zram, u32 index,
> + struct page *page,
> + unsigned int offset, unsigned int len)
> +{
> + struct zram_meta *meta = zram->meta;
> +
> + bit_spin_lock(ZRAM_ACCESS, >table[index].value);
> + if (unlikely(!meta->table[index].handle) ||
> + zram_test_flag(meta, index, ZRAM_SAME)) {
> + void *mem;
> +
> + bit_spin_unlock(ZRAM_ACCESS, >table[index].value);
> + mem = kmap_atomic(page);
> + zram_fill_page(mem + offset, len, meta->table[index].element);
> + kunmap_atomic(mem);
> + return true;
> + }
> + bit_spin_unlock(ZRAM_ACCESS, >table[index].value);
> +
> + return false;
> +}
> +
> +static bool zram_special_page_write(struct zram *zram, u32 index,
> + struct page *page)
> +{
> + unsigned long element;
> + void *mem = kmap_atomic(page);
> +
> + if (page_same_filled(mem, )) {
> + struct zram_meta *meta = zram->meta;
> +
> + kunmap_atomic(mem);
> + /* Free memory associated with this sector now. */
> + bit_spin_lock(ZRAM_ACCESS, >table[index].value);
> + zram_free_page(zram, index);
> + zram_set_flag(meta, index, ZRAM_SAME);
> + zram_set_element(meta, index, element);
> + bit_spin_unlock(ZRAM_ACCESS, >table[index].value);
> +
> + atomic64_inc(>stats.same_pages);
> + return true;
> + }
> + kunmap_atomic(mem);
> +
> + return false;
> +}
> +
>  static void zram_meta_free(struct zram_meta *meta, u64 disksize)
>  {
>   size_t num_pages = disksize >> PAGE_SHIFT;
> @@ -504,169 +548,104 @@ static void zram_free_page(struct zram *zram, size_t 
> index)
>   zram_set_obj_size(meta, index, 0);
>  }
>  
> -static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
> +static int zram_decompress_page(struct zram *zram, struct page *page, u32 
> index)
>  {
> - int ret = 0;
> - unsigned char *cmem;
> - struct zram_meta *meta = zram->meta;
> + int ret;
>   unsigned long handle;
>   unsigned int size;
> + void *src, *dst;
> + struct zram_meta *meta = zram->meta;
> +
> + if (zram_special_page_read(zram, index, page, 0, PAGE_SIZE))
> + return 0;
>  
>   bit_spin_lock(ZRAM_ACCESS, >table[index].value);
>   handle = meta->table[index].handle;
>   size = zram_get_obj_size(meta, index);
>  
> - if (!handle || zram_test_flag(meta, index, ZRAM_SAME)) {
> -

Re: [PATCH 2/5] zram: partial IO refactoring

2017-04-02 Thread Mika Penttilä


Hi!

On 04/03/2017 08:17 AM, Minchan Kim wrote:
> For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
> However, the mixed code for handling normal/partial IO is too mess,
> error-prone to modify IO handler functions with upcoming feature
> so this patch aims for cleaning up zram's IO handling functions.
> 
> Signed-off-by: Minchan Kim 
> ---
>  drivers/block/zram/zram_drv.c | 333 
> +++---
>  1 file changed, 184 insertions(+), 149 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 28c2836f8c96..7938f4b98b01 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -45,6 +45,8 @@ static const char *default_compressor = "lzo";
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  
> +static void zram_free_page(struct zram *zram, size_t index);
> +
>  static inline bool init_done(struct zram *zram)
>  {
>   return zram->disksize;
> @@ -98,10 +100,17 @@ static void zram_set_obj_size(struct zram_meta *meta,
>   meta->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
>  }
>  
> +#if PAGE_SIZE != 4096
>  static inline bool is_partial_io(struct bio_vec *bvec)
>  {
>   return bvec->bv_len != PAGE_SIZE;
>  }
> +#else

For page size of 4096 bv_len can still be < 4096 and partial pages should be 
supported 
(uncompress before write etc). ? 

> +static inline bool is_partial_io(struct bio_vec *bvec)
> +{
> + return false;
> +}
> +#endif
>  
>  static void zram_revalidate_disk(struct zram *zram)
>  {
> @@ -191,18 +200,6 @@ static bool page_same_filled(void *ptr, unsigned long 
> *element)
>   return true;
>  }
>  
> -static void handle_same_page(struct bio_vec *bvec, unsigned long element)
> -{
> - struct page *page = bvec->bv_page;
> - void *user_mem;
> -
> - user_mem = kmap_atomic(page);
> - zram_fill_page(user_mem + bvec->bv_offset, bvec->bv_len, element);
> - kunmap_atomic(user_mem);
> -
> - flush_dcache_page(page);
> -}
> -
>  static ssize_t initstate_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> @@ -418,6 +415,53 @@ static DEVICE_ATTR_RO(io_stat);
>  static DEVICE_ATTR_RO(mm_stat);
>  static DEVICE_ATTR_RO(debug_stat);
>  
> +static bool zram_special_page_read(struct zram *zram, u32 index,
> + struct page *page,
> + unsigned int offset, unsigned int len)
> +{
> + struct zram_meta *meta = zram->meta;
> +
> + bit_spin_lock(ZRAM_ACCESS, >table[index].value);
> + if (unlikely(!meta->table[index].handle) ||
> + zram_test_flag(meta, index, ZRAM_SAME)) {
> + void *mem;
> +
> + bit_spin_unlock(ZRAM_ACCESS, >table[index].value);
> + mem = kmap_atomic(page);
> + zram_fill_page(mem + offset, len, meta->table[index].element);
> + kunmap_atomic(mem);
> + return true;
> + }
> + bit_spin_unlock(ZRAM_ACCESS, >table[index].value);
> +
> + return false;
> +}
> +
> +static bool zram_special_page_write(struct zram *zram, u32 index,
> + struct page *page)
> +{
> + unsigned long element;
> + void *mem = kmap_atomic(page);
> +
> + if (page_same_filled(mem, )) {
> + struct zram_meta *meta = zram->meta;
> +
> + kunmap_atomic(mem);
> + /* Free memory associated with this sector now. */
> + bit_spin_lock(ZRAM_ACCESS, >table[index].value);
> + zram_free_page(zram, index);
> + zram_set_flag(meta, index, ZRAM_SAME);
> + zram_set_element(meta, index, element);
> + bit_spin_unlock(ZRAM_ACCESS, >table[index].value);
> +
> + atomic64_inc(>stats.same_pages);
> + return true;
> + }
> + kunmap_atomic(mem);
> +
> + return false;
> +}
> +
>  static void zram_meta_free(struct zram_meta *meta, u64 disksize)
>  {
>   size_t num_pages = disksize >> PAGE_SHIFT;
> @@ -504,169 +548,104 @@ static void zram_free_page(struct zram *zram, size_t 
> index)
>   zram_set_obj_size(meta, index, 0);
>  }
>  
> -static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
> +static int zram_decompress_page(struct zram *zram, struct page *page, u32 
> index)
>  {
> - int ret = 0;
> - unsigned char *cmem;
> - struct zram_meta *meta = zram->meta;
> + int ret;
>   unsigned long handle;
>   unsigned int size;
> + void *src, *dst;
> + struct zram_meta *meta = zram->meta;
> +
> + if (zram_special_page_read(zram, index, page, 0, PAGE_SIZE))
> + return 0;
>  
>   bit_spin_lock(ZRAM_ACCESS, >table[index].value);
>   handle = meta->table[index].handle;
>   size = zram_get_obj_size(meta, index);
>  
> - if (!handle || zram_test_flag(meta, index, ZRAM_SAME)) {
> -

[PATCH] fix pinctrl setup for i.IMX6

2017-02-27 Thread Mika Penttilä

Hi!

Recent pulls for mainline pre 4.11 introduced pinctrl setup changes and moved 
pinctrl-imx to
use generic helpers. 

Net effect was that hog group could not be resolved. I made it work for myself
with a two stage setup with create and start separated, and dt probe in between 
them.


Signed-off-by: Mika Penttilä <mika.pentt...@nextfour.com>
---

diff --git a/drivers/pinctrl/core.c b/drivers/pinctrl/core.c
index d690465..33659c5a 100644
--- a/drivers/pinctrl/core.c
+++ b/drivers/pinctrl/core.c
@@ -2237,6 +2237,47 @@ int devm_pinctrl_register_and_init(struct device *dev,
 }
 EXPORT_SYMBOL_GPL(devm_pinctrl_register_and_init);
 
+int devm_pinctrl_register_and_init_nostart(struct device *dev,
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev **pctldev)
+{
+   struct pinctrl_dev **ptr;
+   struct pinctrl_dev *p;
+
+   ptr = devres_alloc(devm_pinctrl_dev_release, sizeof(*ptr), GFP_KERNEL);
+   if (!ptr)
+   return -ENOMEM;
+
+   p = pinctrl_init_controller(pctldesc, dev, driver_data);
+   if (IS_ERR(p)) {
+   devres_free(ptr);
+   return PTR_ERR(p);
+   }
+
+   *ptr = p;
+   devres_add(dev, ptr);
+   *pctldev = p;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(devm_pinctrl_register_and_init_nostart);
+
+int devm_pinctrl_start(struct device *dev,
+   struct pinctrl_dev *pctldev)
+{
+   int error = 0;
+
+   error = pinctrl_create_and_start(pctldev);
+   if (error) {
+   mutex_destroy(>mutex);
+   return error;
+   }
+
+   return error;
+}
+EXPORT_SYMBOL_GPL(devm_pinctrl_start);
+
 /**
  * devm_pinctrl_unregister() - Resource managed version of 
pinctrl_unregister().
  * @dev: device for which which resource was allocated
diff --git a/drivers/pinctrl/freescale/pinctrl-imx.c 
b/drivers/pinctrl/freescale/pinctrl-imx.c
index a7ace9e..3644aed 100644
--- a/drivers/pinctrl/freescale/pinctrl-imx.c
+++ b/drivers/pinctrl/freescale/pinctrl-imx.c
@@ -686,16 +686,6 @@ static int imx_pinctrl_probe_dt(struct platform_device 
*pdev,
return 0;
 }
 
-/*
- * imx_free_resources() - free memory used by this driver
- * @info: info driver instance
- */
-static void imx_free_resources(struct imx_pinctrl *ipctl)
-{
-   if (ipctl->pctl)
-   pinctrl_unregister(ipctl->pctl);
-}
-
 int imx_pinctrl_probe(struct platform_device *pdev,
  struct imx_pinctrl_soc_info *info)
 {
@@ -774,26 +764,30 @@ int imx_pinctrl_probe(struct platform_device *pdev,
ipctl->info = info;
ipctl->dev = info->dev;
platform_set_drvdata(pdev, ipctl);
-   ret = devm_pinctrl_register_and_init(>dev,
-imx_pinctrl_desc, ipctl,
->pctl);
+
+   ret = devm_pinctrl_register_and_init_nostart(>dev,
+   imx_pinctrl_desc, ipctl,
+   >pctl);
+
if (ret) {
dev_err(>dev, "could not register IMX pinctrl driver\n");
-   goto free;
+   return ret;
}
 
ret = imx_pinctrl_probe_dt(pdev, ipctl);
if (ret) {
dev_err(>dev, "fail to probe dt properties\n");
-   goto free;
+   return ret;
+   }
+
+   ret = devm_pinctrl_start(>dev, ipctl->pctl);
+   if (ret) {
+   dev_err(>dev, "could not start IMX pinctrl driver\n");
+   return ret;
}
 
dev_info(>dev, "initialized IMX pinctrl driver\n");
 
return 0;
 
-free:
-   imx_free_resources(ipctl);
-
-   return ret;
 }
diff --git a/include/linux/pinctrl/pinctrl.h b/include/linux/pinctrl/pinctrl.h
index 8ce2d87..e7020f0 100644
--- a/include/linux/pinctrl/pinctrl.h
+++ b/include/linux/pinctrl/pinctrl.h
@@ -153,9 +153,17 @@ extern struct pinctrl_dev *pinctrl_register(struct 
pinctrl_desc *pctldesc,
 extern void pinctrl_unregister(struct pinctrl_dev *pctldev);
 
 extern int devm_pinctrl_register_and_init(struct device *dev,
-   struct pinctrl_desc *pctldesc,
-   void *driver_data,
-   struct pinctrl_dev **pctldev);
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev **pctldev);
+
+extern int devm_pinctrl_register_and_init_nostart(struct device *dev,
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev *

[PATCH] fix pinctrl setup for i.IMX6

2017-02-27 Thread Mika Penttilä

Hi!

Recent pulls for mainline pre 4.11 introduced pinctrl setup changes and moved 
pinctrl-imx to
use generic helpers. 

Net effect was that hog group could not be resolved. I made it work for myself
with a two stage setup with create and start separated, and dt probe in between 
them.


Signed-off-by: Mika Penttilä 
---

diff --git a/drivers/pinctrl/core.c b/drivers/pinctrl/core.c
index d690465..33659c5a 100644
--- a/drivers/pinctrl/core.c
+++ b/drivers/pinctrl/core.c
@@ -2237,6 +2237,47 @@ int devm_pinctrl_register_and_init(struct device *dev,
 }
 EXPORT_SYMBOL_GPL(devm_pinctrl_register_and_init);
 
+int devm_pinctrl_register_and_init_nostart(struct device *dev,
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev **pctldev)
+{
+   struct pinctrl_dev **ptr;
+   struct pinctrl_dev *p;
+
+   ptr = devres_alloc(devm_pinctrl_dev_release, sizeof(*ptr), GFP_KERNEL);
+   if (!ptr)
+   return -ENOMEM;
+
+   p = pinctrl_init_controller(pctldesc, dev, driver_data);
+   if (IS_ERR(p)) {
+   devres_free(ptr);
+   return PTR_ERR(p);
+   }
+
+   *ptr = p;
+   devres_add(dev, ptr);
+   *pctldev = p;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(devm_pinctrl_register_and_init_nostart);
+
+int devm_pinctrl_start(struct device *dev,
+   struct pinctrl_dev *pctldev)
+{
+   int error = 0;
+
+   error = pinctrl_create_and_start(pctldev);
+   if (error) {
+   mutex_destroy(>mutex);
+   return error;
+   }
+
+   return error;
+}
+EXPORT_SYMBOL_GPL(devm_pinctrl_start);
+
 /**
  * devm_pinctrl_unregister() - Resource managed version of 
pinctrl_unregister().
  * @dev: device for which which resource was allocated
diff --git a/drivers/pinctrl/freescale/pinctrl-imx.c 
b/drivers/pinctrl/freescale/pinctrl-imx.c
index a7ace9e..3644aed 100644
--- a/drivers/pinctrl/freescale/pinctrl-imx.c
+++ b/drivers/pinctrl/freescale/pinctrl-imx.c
@@ -686,16 +686,6 @@ static int imx_pinctrl_probe_dt(struct platform_device 
*pdev,
return 0;
 }
 
-/*
- * imx_free_resources() - free memory used by this driver
- * @info: info driver instance
- */
-static void imx_free_resources(struct imx_pinctrl *ipctl)
-{
-   if (ipctl->pctl)
-   pinctrl_unregister(ipctl->pctl);
-}
-
 int imx_pinctrl_probe(struct platform_device *pdev,
  struct imx_pinctrl_soc_info *info)
 {
@@ -774,26 +764,30 @@ int imx_pinctrl_probe(struct platform_device *pdev,
ipctl->info = info;
ipctl->dev = info->dev;
platform_set_drvdata(pdev, ipctl);
-   ret = devm_pinctrl_register_and_init(>dev,
-imx_pinctrl_desc, ipctl,
->pctl);
+
+   ret = devm_pinctrl_register_and_init_nostart(>dev,
+   imx_pinctrl_desc, ipctl,
+   >pctl);
+
if (ret) {
dev_err(>dev, "could not register IMX pinctrl driver\n");
-   goto free;
+   return ret;
}
 
ret = imx_pinctrl_probe_dt(pdev, ipctl);
if (ret) {
dev_err(>dev, "fail to probe dt properties\n");
-   goto free;
+   return ret;
+   }
+
+   ret = devm_pinctrl_start(>dev, ipctl->pctl);
+   if (ret) {
+   dev_err(>dev, "could not start IMX pinctrl driver\n");
+   return ret;
}
 
dev_info(>dev, "initialized IMX pinctrl driver\n");
 
return 0;
 
-free:
-   imx_free_resources(ipctl);
-
-   return ret;
 }
diff --git a/include/linux/pinctrl/pinctrl.h b/include/linux/pinctrl/pinctrl.h
index 8ce2d87..e7020f0 100644
--- a/include/linux/pinctrl/pinctrl.h
+++ b/include/linux/pinctrl/pinctrl.h
@@ -153,9 +153,17 @@ extern struct pinctrl_dev *pinctrl_register(struct 
pinctrl_desc *pctldesc,
 extern void pinctrl_unregister(struct pinctrl_dev *pctldev);
 
 extern int devm_pinctrl_register_and_init(struct device *dev,
-   struct pinctrl_desc *pctldesc,
-   void *driver_data,
-   struct pinctrl_dev **pctldev);
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev **pctldev);
+
+extern int devm_pinctrl_register_and_init_nostart(struct device *dev,
+   struct pinctrl_desc *pctldesc,
+   void *driver_data,
+   struct pinctrl_dev **pctldev);
+
+extern int de

[REGRESSION] pinctrl, of, unable to find hogs

2017-02-26 Thread Mika Penttilä


With current linus git (pre 4.11), unable to find the pinctrl hogs :


 imx6q-pinctrl 20e.iomuxc: unable to find group for node hoggrp


Device is i.MX6 based.


--Mika

[REGRESSION] pinctrl, of, unable to find hogs

2017-02-26 Thread Mika Penttilä


With current linus git (pre 4.11), unable to find the pinctrl hogs :


 imx6q-pinctrl 20e.iomuxc: unable to find group for node hoggrp


Device is i.MX6 based.


--Mika

Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held

2017-02-13 Thread Mika Penttilä


On 13.02.2017 16:15, Oleg Nesterov wrote:
> + retval = de_thread(current);
> + if (retval)
> + return retval;
>  
>   if (N_MAGIC(ex) == OMAGIC) {
>   unsigned long text_addr, map_size;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 4223702..79508f7 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
>   setup_new_exec(bprm);
>   install_exec_creds(bprm);
>  
> + retval = de_thread(current);
> + if (retval)
> + goto out_free_dentry;
> +
>   /* Do this so that we can load the interpreter, if need be.  We will
>  change some of these later */
>   retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
>executable_stack);
>   if (retval < 0)
>   goto out_free_dentry;
> - 
> +
>   current->mm->start_stack = bprm->p;
>  
>   /* Now we do a little grungy work by mmapping the ELF image into
> diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> index d2e36f8..75fd6d8 100644
> --- a/fs/binfmt_elf_fdpic.c
> +++ b/fs/binfmt_elf_fdpic.c
> @@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm 
> *bprm)
>  #endif
>  
>   install_exec_creds(bprm);
> + retval = de_thread(current);
> + if (retval)
> + goto error;
> +
>   if (create_elf_fdpic_tables(bprm, current->mm,
>   _params, _params) < 0)
>   goto error;
> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
> index 9b2917a..a0ad9a3 100644
> --- a/fs/binfmt_flat.c
> +++ b/fs/binfmt_flat.c
> @@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm)
>   }
>  
>   install_exec_creds(bprm);
> + res = de_thread(current);
> + if (res)
> + return res;
>  
>   set_binfmt(_format);
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index e579466..8591c56 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm)
>   return 0;
>  }
>  
> +static int wait_for_notify_count(struct task_struct *tsk, struct 
> signal_struct *sig)
> +{
> + for (;;) {
> + if (unlikely(__fatal_signal_pending(tsk)))
> + goto killed;
> + set_current_state(TASK_KILLABLE);
> + if (!sig->notify_count)
> + break;
> + schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> + return 0;
> +
> +killed:
> + /* protects against exit_notify() and __exit_signal() */
> + read_lock(_lock);
> + sig->group_exit_task = NULL;
> + sig->notify_count = 0;
> + read_unlock(_lock);
> + return -EINTR;
> +}
> +
> +/*
> + * Kill all the sub-threads and wait until they all pass exit_notify().
> + */
> +static int kill_sub_threads(struct task_struct *tsk)
> +{
> + struct signal_struct *sig = tsk->signal;
> + int err = -EINTR;
> +
> + if (thread_group_empty(tsk))
> + return 0;
> +
> + read_lock(_lock);
> + spin_lock_irq(>sighand->siglock);
> + if (!signal_group_exit(sig)) {
> + sig->group_exit_task = tsk;
> + sig->notify_count = -zap_other_threads(tsk);
> + err = 0;
> + }
> + spin_unlock_irq(>sighand->siglock);
> + read_unlock(_lock);
> +
> + if (!err)
> + err = wait_for_notify_count(tsk, sig);
> + return err;
> +
> +}
> +
>  /*
> - * This function makes sure the current process has its own signal table,
> - * so that flush_signal_handlers can later reset the handlers without
> - * disturbing other processes.  (Other processes might share the signal
> - * table via the CLONE_SIGHAND option to clone().)
> + * This function makes sure the current process has no other threads and
> + * has a private signal table so that flush_signal_handlers() can reset
> + * the handlers without disturbing other processes which might share the
> + * signal table via the CLONE_SIGHAND option to clone().
>   */
> -static int de_thread(struct task_struct *tsk)
> +int de_thread(struct task_struct *tsk)
>  {
>   struct signal_struct *sig = tsk->signal;
>   struct sighand_struct *oldsighand = tsk->sighand;
> @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk)
>   if (thread_group_empty(tsk))
>   goto no_thread_group;
>  
> - /*
> -  * Kill all other threads in the thread group.
> -  */
>   spin_lock_irq(lock);
> - if (signal_group_exit(sig)) {
> - /*
> -  * Another group action in progress, just
> -  * return so that the signal is processed.
> -  */
> - spin_unlock_irq(lock);
> - return -EAGAIN;
> - }
> -
> - sig->group_exit_task = tsk;
> - sig->notify_count = zap_other_threads(tsk);
> + sig->notify_count = sig->nr_threads;


maybe nr_threads - 1 since

Re: [PATCH 1/2] exec: don't wait for zombie threads with cred_guard_mutex held

2017-02-13 Thread Mika Penttilä


On 13.02.2017 16:15, Oleg Nesterov wrote:
> + retval = de_thread(current);
> + if (retval)
> + return retval;
>  
>   if (N_MAGIC(ex) == OMAGIC) {
>   unsigned long text_addr, map_size;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 4223702..79508f7 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -855,13 +855,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
>   setup_new_exec(bprm);
>   install_exec_creds(bprm);
>  
> + retval = de_thread(current);
> + if (retval)
> + goto out_free_dentry;
> +
>   /* Do this so that we can load the interpreter, if need be.  We will
>  change some of these later */
>   retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
>executable_stack);
>   if (retval < 0)
>   goto out_free_dentry;
> - 
> +
>   current->mm->start_stack = bprm->p;
>  
>   /* Now we do a little grungy work by mmapping the ELF image into
> diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> index d2e36f8..75fd6d8 100644
> --- a/fs/binfmt_elf_fdpic.c
> +++ b/fs/binfmt_elf_fdpic.c
> @@ -430,6 +430,10 @@ static int load_elf_fdpic_binary(struct linux_binprm 
> *bprm)
>  #endif
>  
>   install_exec_creds(bprm);
> + retval = de_thread(current);
> + if (retval)
> + goto error;
> +
>   if (create_elf_fdpic_tables(bprm, current->mm,
>   _params, _params) < 0)
>   goto error;
> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
> index 9b2917a..a0ad9a3 100644
> --- a/fs/binfmt_flat.c
> +++ b/fs/binfmt_flat.c
> @@ -953,6 +953,9 @@ static int load_flat_binary(struct linux_binprm *bprm)
>   }
>  
>   install_exec_creds(bprm);
> + res = de_thread(current);
> + if (res)
> + return res;
>  
>   set_binfmt(_format);
>  
> diff --git a/fs/exec.c b/fs/exec.c
> index e579466..8591c56 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1036,13 +1036,62 @@ static int exec_mmap(struct mm_struct *mm)
>   return 0;
>  }
>  
> +static int wait_for_notify_count(struct task_struct *tsk, struct 
> signal_struct *sig)
> +{
> + for (;;) {
> + if (unlikely(__fatal_signal_pending(tsk)))
> + goto killed;
> + set_current_state(TASK_KILLABLE);
> + if (!sig->notify_count)
> + break;
> + schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> + return 0;
> +
> +killed:
> + /* protects against exit_notify() and __exit_signal() */
> + read_lock(_lock);
> + sig->group_exit_task = NULL;
> + sig->notify_count = 0;
> + read_unlock(_lock);
> + return -EINTR;
> +}
> +
> +/*
> + * Kill all the sub-threads and wait until they all pass exit_notify().
> + */
> +static int kill_sub_threads(struct task_struct *tsk)
> +{
> + struct signal_struct *sig = tsk->signal;
> + int err = -EINTR;
> +
> + if (thread_group_empty(tsk))
> + return 0;
> +
> + read_lock(_lock);
> + spin_lock_irq(>sighand->siglock);
> + if (!signal_group_exit(sig)) {
> + sig->group_exit_task = tsk;
> + sig->notify_count = -zap_other_threads(tsk);
> + err = 0;
> + }
> + spin_unlock_irq(>sighand->siglock);
> + read_unlock(_lock);
> +
> + if (!err)
> + err = wait_for_notify_count(tsk, sig);
> + return err;
> +
> +}
> +
>  /*
> - * This function makes sure the current process has its own signal table,
> - * so that flush_signal_handlers can later reset the handlers without
> - * disturbing other processes.  (Other processes might share the signal
> - * table via the CLONE_SIGHAND option to clone().)
> + * This function makes sure the current process has no other threads and
> + * has a private signal table so that flush_signal_handlers() can reset
> + * the handlers without disturbing other processes which might share the
> + * signal table via the CLONE_SIGHAND option to clone().
>   */
> -static int de_thread(struct task_struct *tsk)
> +int de_thread(struct task_struct *tsk)
>  {
>   struct signal_struct *sig = tsk->signal;
>   struct sighand_struct *oldsighand = tsk->sighand;
> @@ -1051,60 +1100,24 @@ static int de_thread(struct task_struct *tsk)
>   if (thread_group_empty(tsk))
>   goto no_thread_group;
>  
> - /*
> -  * Kill all other threads in the thread group.
> -  */
>   spin_lock_irq(lock);
> - if (signal_group_exit(sig)) {
> - /*
> -  * Another group action in progress, just
> -  * return so that the signal is processed.
> -  */
> - spin_unlock_irq(lock);
> - return -EAGAIN;
> - }
> -
> - sig->group_exit_task = tsk;
> - sig->notify_count = zap_other_threads(tsk);
> + sig->notify_count = sig->nr_threads;


maybe nr_threads - 1 since

Re: [PATCH 2/2] efi: efi_mem_reserve(): don't reserve through memblock after mm_init()

2016-12-21 Thread Mika Penttilä



On 21.12.2016 20:28, Nicolai Stange wrote:
> Before invoking the arch specific handler, efi_mem_reserve() reserves
> the given memory region through memblock.
>
> efi_mem_reserve() can get called after mm_init() though -- through
> efi_bgrt_init(), for example. After mm_init(), memblock is dead and should
> not be used anymore.
>
> Let efi_mem_reserve() check whether memblock is dead and not do the
> reservation if so. Emit a warning from the generic efi_arch mem_reserve()
> in this case: if the architecture doesn't provide any other means of
> registering the region as reserved, the operation would be a nop.
>
> Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying 
> image data")
> Signed-off-by: Nicolai Stange 
> ---
>  drivers/firmware/efi/efi.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 92914801e388..12b2e3a6d73f 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -403,7 +403,10 @@ u64 __init efi_mem_desc_end(efi_memory_desc_t *md)
>   return end;
>  }
>  
> -void __init __weak efi_arch_mem_reserve(phys_addr_t addr, u64 size) {}
> +void __init __weak efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> +{
> + WARN(slab_is_available(), "efi_mem_reserve() has no effect");
> +}
>  
>  /**
>   * efi_mem_reserve - Reserve an EFI memory region
> @@ -419,7 +422,7 @@ void __init __weak efi_arch_mem_reserve(phys_addr_t addr, 
> u64 size) {}
>   */
>  void __init efi_mem_reserve(phys_addr_t addr, u64 size)
>  {
> - if (!memblock_is_region_reserved(addr, size))
> + if (slab_is_available() && !memblock_is_region_reserved(addr, size))
>   memblock_reserve(addr, size);
Maybe !slab_is_available() ?

>  
--Mika

Re: [PATCH 2/2] efi: efi_mem_reserve(): don't reserve through memblock after mm_init()

2016-12-21 Thread Mika Penttilä



On 21.12.2016 20:28, Nicolai Stange wrote:
> Before invoking the arch specific handler, efi_mem_reserve() reserves
> the given memory region through memblock.
>
> efi_mem_reserve() can get called after mm_init() though -- through
> efi_bgrt_init(), for example. After mm_init(), memblock is dead and should
> not be used anymore.
>
> Let efi_mem_reserve() check whether memblock is dead and not do the
> reservation if so. Emit a warning from the generic efi_arch mem_reserve()
> in this case: if the architecture doesn't provide any other means of
> registering the region as reserved, the operation would be a nop.
>
> Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying 
> image data")
> Signed-off-by: Nicolai Stange 
> ---
>  drivers/firmware/efi/efi.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 92914801e388..12b2e3a6d73f 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -403,7 +403,10 @@ u64 __init efi_mem_desc_end(efi_memory_desc_t *md)
>   return end;
>  }
>  
> -void __init __weak efi_arch_mem_reserve(phys_addr_t addr, u64 size) {}
> +void __init __weak efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> +{
> + WARN(slab_is_available(), "efi_mem_reserve() has no effect");
> +}
>  
>  /**
>   * efi_mem_reserve - Reserve an EFI memory region
> @@ -419,7 +422,7 @@ void __init __weak efi_arch_mem_reserve(phys_addr_t addr, 
> u64 size) {}
>   */
>  void __init efi_mem_reserve(phys_addr_t addr, u64 size)
>  {
> - if (!memblock_is_region_reserved(addr, size))
> + if (slab_is_available() && !memblock_is_region_reserved(addr, size))
>   memblock_reserve(addr, size);
Maybe !slab_is_available() ?

>  
--Mika

Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

2016-10-20 Thread Mika Penttilä



On 20.10.2016 15:38, zhouxianr...@huawei.com wrote:
> From: z00281421 
>
> The bdi flusher should be throttled only depends on 
> own bdi and is decoupled with others.
>
> separate PGDAT_WRITEBACK into PGDAT_ANON_WRITEBACK and
> PGDAT_FILE_WRITEBACK avoid scanning anon lru and it is ok 
> then throttled on file WRITEBACK.
>
> i think above may be not right.
>
> Signed-off-by: z00281421 
> ---
>  fs/fs-writeback.c  |8 ++--
>  include/linux/mmzone.h |7 +--
>  mm/vmscan.c|   20 
>  3 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 05713a5..ddcc70f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1905,10 +1905,13 @@ void wb_workfn(struct work_struct *work)
>  {
>   struct bdi_writeback *wb = container_of(to_delayed_work(work),
>   struct bdi_writeback, dwork);
> + struct backing_dev_info *bdi = container_of(to_delayed_work(work),
> + struct backing_dev_info, 
> wb.dwork);
>   long pages_written;
>  
>   set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> - current->flags |= PF_SWAPWRITE;
> + current->flags |= (PF_SWAPWRITE | PF_LESS_THROTTLE);
> + current->bdi = bdi;
>  
>   if (likely(!current_is_workqueue_rescuer() ||
>  !test_bit(WB_registered, >state))) {
> @@ -1938,7 +1941,8 @@ void wb_workfn(struct work_struct *work)
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
>  
> - current->flags &= ~PF_SWAPWRITE;
> + current->bdi = NULL;
> + current->flags &= ~(PF_SWAPWRITE | PF_LESS_THROTTLE);
>  }
>  
>  /*
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7f2ae99..fa602e9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -528,8 +528,11 @@ enum pgdat_flags {
>* many dirty file pages at the tail
>* of the LRU.
>*/
> - PGDAT_WRITEBACK,/* reclaim scanning has recently found
> -  * many pages under writeback
> + PGDAT_ANON_WRITEBACK,   /* reclaim scanning has recently found
> +  * many anonymous pages under writeback
> +  */
> + PGDAT_FILE_WRITEBACK,   /* reclaim scanning has recently found
> +  * many file pages under writeback
>*/
>   PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */

Nobody seems to be clearing those bits (same was with PGDAT_WRITEBACK) ?


--Mika

Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

2016-10-20 Thread Mika Penttilä



On 20.10.2016 15:38, zhouxianr...@huawei.com wrote:
> From: z00281421 
>
> The bdi flusher should be throttled only depends on 
> own bdi and is decoupled with others.
>
> separate PGDAT_WRITEBACK into PGDAT_ANON_WRITEBACK and
> PGDAT_FILE_WRITEBACK avoid scanning anon lru and it is ok 
> then throttled on file WRITEBACK.
>
> i think above may be not right.
>
> Signed-off-by: z00281421 
> ---
>  fs/fs-writeback.c  |8 ++--
>  include/linux/mmzone.h |7 +--
>  mm/vmscan.c|   20 
>  3 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 05713a5..ddcc70f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1905,10 +1905,13 @@ void wb_workfn(struct work_struct *work)
>  {
>   struct bdi_writeback *wb = container_of(to_delayed_work(work),
>   struct bdi_writeback, dwork);
> + struct backing_dev_info *bdi = container_of(to_delayed_work(work),
> + struct backing_dev_info, 
> wb.dwork);
>   long pages_written;
>  
>   set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> - current->flags |= PF_SWAPWRITE;
> + current->flags |= (PF_SWAPWRITE | PF_LESS_THROTTLE);
> + current->bdi = bdi;
>  
>   if (likely(!current_is_workqueue_rescuer() ||
>  !test_bit(WB_registered, >state))) {
> @@ -1938,7 +1941,8 @@ void wb_workfn(struct work_struct *work)
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
>  
> - current->flags &= ~PF_SWAPWRITE;
> + current->bdi = NULL;
> + current->flags &= ~(PF_SWAPWRITE | PF_LESS_THROTTLE);
>  }
>  
>  /*
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7f2ae99..fa602e9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -528,8 +528,11 @@ enum pgdat_flags {
>* many dirty file pages at the tail
>* of the LRU.
>*/
> - PGDAT_WRITEBACK,/* reclaim scanning has recently found
> -  * many pages under writeback
> + PGDAT_ANON_WRITEBACK,   /* reclaim scanning has recently found
> +  * many anonymous pages under writeback
> +  */
> + PGDAT_FILE_WRITEBACK,   /* reclaim scanning has recently found
> +  * many file pages under writeback
>*/
>   PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */

Nobody seems to be clearing those bits (same was with PGDAT_WRITEBACK) ?


--Mika

Re: EXT: pre 4.9-rc dma regression, imx6 sound etc

2016-10-13 Thread Mika Penttilä

On 10/13/2016 08:25 AM, Han, Nandor (GE Healthcare) wrote:
> 
> 
> On 12/10/2016 16:25, Mika Penttilä wrote:
>> Hi!
>>
>> Bisected that commit 5881826 - "dmaengine: imx-sdma - update the residue 
>> calculation for cyclic channels" is first bad commit to cause audio 
>> regression on imx6q, sgtl5000 codec. It causes audible disturbing background 
>> noise when playing wavs.
>>
>> Unfortunately, reverting only 5881826 causes uarts to fail, so another fix 
>> is needed.
>>
>> Thanks,
>>
>> Mika
>>
>>
>>
>>
>>
> 
> Hi Mika,
>Thanks for info. I have already posted a patch that will fix this issue.
> 
> https://lkml.org/lkml/2016/10/11/319
> 
> It was already tested on imx6 and imx53. Let me know how it works.
> 

Hi,

Great, with this patch audio is ok again! This patch should go in for 4.9!

Thanks,
Mika

Re: EXT: pre 4.9-rc dma regression, imx6 sound etc

2016-10-13 Thread Mika Penttilä

On 10/13/2016 08:25 AM, Han, Nandor (GE Healthcare) wrote:
> 
> 
> On 12/10/2016 16:25, Mika Penttilä wrote:
>> Hi!
>>
>> Bisected that commit 5881826 - "dmaengine: imx-sdma - update the residue 
>> calculation for cyclic channels" is first bad commit to cause audio 
>> regression on imx6q, sgtl5000 codec. It causes audible disturbing background 
>> noise when playing wavs.
>>
>> Unfortunately, reverting only 5881826 causes uarts to fail, so another fix 
>> is needed.
>>
>> Thanks,
>>
>> Mika
>>
>>
>>
>>
>>
> 
> Hi Mika,
>Thanks for info. I have already posted a patch that will fix this issue.
> 
> https://lkml.org/lkml/2016/10/11/319
> 
> It was already tested on imx6 and imx53. Let me know how it works.
> 

Hi,

Great, with this patch audio is ok again! This patch should go in for 4.9!

Thanks,
Mika

pre 4.9-rc dma regression, imx6 sound etc

2016-10-12 Thread Mika Penttilä

Hi!

Bisected that commit 5881826 - "dmaengine: imx-sdma - update the residue 
calculation for cyclic channels" is first bad commit to cause audio regression 
on imx6q, sgtl5000 codec. It causes audible disturbing background noise when 
playing wavs.

Unfortunately, reverting only 5881826 causes uarts to fail, so another fix is 
needed.

Thanks,

Mika

pre 4.9-rc dma regression, imx6 sound etc

2016-10-12 Thread Mika Penttilä

Hi!

Bisected that commit 5881826 - "dmaengine: imx-sdma - update the residue 
calculation for cyclic channels" is first bad commit to cause audio regression 
on imx6q, sgtl5000 codec. It causes audible disturbing background noise when 
playing wavs.

Unfortunately, reverting only 5881826 causes uarts to fail, so another fix is 
needed.

Thanks,

Mika

Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-15 Thread Mika Penttilä

On 09/15/2016 07:25 AM, Wanpeng Li wrote:
> 2016-09-15 12:08 GMT+08:00 Mika Penttilä <mika.pentt...@nextfour.com>:
>> On 09/14/2016 10:58 AM, Wanpeng Li wrote:
>>> From: Wanpeng Li <wanpeng...@hotmail.com>
>>>
>>> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
>>> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
>>> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542
>>> x86, apicv: add virtual x2apic support) disable virtual x2apic mode
>>> completely if w/o APICv, and the author also told me that windows guest
>>> can't enter into x2apic mode when he developed the APICv feature several
>>> years ago. However, it is not truth currently, Interrupt Remapping and
>>> vIOMMU is added to qemu and the developers from Intel test windows 8 can
>>> work in x2apic mode w/ Interrupt Remapping enabled recently.
>>>
>>> This patch enables TPR shadow for virtual x2apic mode to boost
>>> windows guest in x2apic mode even if w/o APICv.
>>>
>>> Can pass the kvm-unit-test.
>>>
>>
>> While at it, is the vmx flexpriotity stuff still valid code?
>> AFAICS it gets enabled iff TPR shadow is on. flexpriority
>> is on when :
>>
>> (flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
>> cpu_has_vmx_virtualize_apic_accesses)
>>
>> But apic accesses to TPR mmio are not then trapped and TPR changes not 
>> reported because
>> the “use TPR shadow” VM-execution control is 1.
> 
> Please note the patch is for MSR-BASED TPR shadow w/o APICv, TPR
> shadow can work correctly in other configurations.
> 
> Regards,
> Wanpeng Li
> 

Hi,

Yes I see, this is slightly offtopic but while at flexpriority, how is it 
relevant in current kvm?
In other words I see it as dead code, because it is enabled only when TPR 
shadow is enabled,
and as such ineffective because TPR shadow disables the wmexits tpr reporting 
uses.

Thanks,
Mika

Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-15 Thread Mika Penttilä

On 09/15/2016 07:25 AM, Wanpeng Li wrote:
> 2016-09-15 12:08 GMT+08:00 Mika Penttilä :
>> On 09/14/2016 10:58 AM, Wanpeng Li wrote:
>>> From: Wanpeng Li 
>>>
>>> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
>>> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
>>> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542
>>> x86, apicv: add virtual x2apic support) disable virtual x2apic mode
>>> completely if w/o APICv, and the author also told me that windows guest
>>> can't enter into x2apic mode when he developed the APICv feature several
>>> years ago. However, it is not truth currently, Interrupt Remapping and
>>> vIOMMU is added to qemu and the developers from Intel test windows 8 can
>>> work in x2apic mode w/ Interrupt Remapping enabled recently.
>>>
>>> This patch enables TPR shadow for virtual x2apic mode to boost
>>> windows guest in x2apic mode even if w/o APICv.
>>>
>>> Can pass the kvm-unit-test.
>>>
>>
>> While at it, is the vmx flexpriotity stuff still valid code?
>> AFAICS it gets enabled iff TPR shadow is on. flexpriority
>> is on when :
>>
>> (flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
>> cpu_has_vmx_virtualize_apic_accesses)
>>
>> But apic accesses to TPR mmio are not then trapped and TPR changes not 
>> reported because
>> the “use TPR shadow” VM-execution control is 1.
> 
> Please note the patch is for MSR-BASED TPR shadow w/o APICv, TPR
> shadow can work correctly in other configurations.
> 
> Regards,
> Wanpeng Li
> 

Hi,

Yes I see, this is slightly offtopic but while at flexpriority, how is it 
relevant in current kvm?
In other words I see it as dead code, because it is enabled only when TPR 
shadow is enabled,
and as such ineffective because TPR shadow disables the wmexits tpr reporting 
uses.

Thanks,
Mika

Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-14 Thread Mika Penttilä

On 09/14/2016 10:58 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used 
> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542 
> x86, apicv: add virtual x2apic support) disable virtual x2apic mode 
> completely if w/o APICv, and the author also told me that windows guest
> can't enter into x2apic mode when he developed the APICv feature several 
> years ago. However, it is not truth currently, Interrupt Remapping and 
> vIOMMU is added to qemu and the developers from Intel test windows 8 can 
> work in x2apic mode w/ Interrupt Remapping enabled recently. 
> 
> This patch enables TPR shadow for virtual x2apic mode to boost 
> windows guest in x2apic mode even if w/o APICv.
> 
> Can pass the kvm-unit-test.
> 

While at it, is the vmx flexpriotity stuff still valid code?
AFAICS it gets enabled iff TPR shadow is on. flexpriority
is on when :

(flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
cpu_has_vmx_virtualize_apic_accesses)

But apic accesses to TPR mmio are not then trapped and TPR changes not reported 
because
the “use TPR shadow” VM-execution control is 1.

Thanks,
Mika


> Suggested-by: Wincy Van 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Wincy Van 
> Cc: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 41 ++---
>  1 file changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..e703129 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6336,7 +6336,7 @@ static void wakeup_handler(void)
>  
>  static __init int hardware_setup(void)
>  {
> - int r = -ENOMEM, i, msr;
> + int r = -ENOMEM, i;
>  
>   rdmsrl_safe(MSR_EFER, _efer);
>  
> @@ -6464,18 +6464,6 @@ static __init int hardware_setup(void)
>  
>   set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>  
> - for (msr = 0x800; msr <= 0x8ff; msr++)
> - vmx_disable_intercept_msr_read_x2apic(msr);
> -
> - /* TMCCT */
> - vmx_enable_intercept_msr_read_x2apic(0x839);
> - /* TPR */
> - vmx_disable_intercept_msr_write_x2apic(0x808);
> - /* EOI */
> - vmx_disable_intercept_msr_write_x2apic(0x80b);
> - /* SELF-IPI */
> - vmx_disable_intercept_msr_write_x2apic(0x83f);
> -
>   if (enable_ept) {
>   kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
>   (enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> @@ -8435,12 +8423,7 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>   }
>  
> - /*
> -  * There is not point to enable virtualize x2apic without enable
> -  * apicv
> -  */
> - if (!cpu_has_vmx_virtualize_x2apic_mode() ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!cpu_has_vmx_virtualize_x2apic_mode())
>   return;
>  
>   if (!cpu_need_tpr_shadow(vcpu))
> @@ -8449,8 +8432,28 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   sec_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>  
>   if (set) {
> + int msr;
> +
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +
> + if (kvm_vcpu_apicv_active(vcpu)) {
> + for (msr = 0x800; msr <= 0x8ff; msr++)
> + vmx_disable_intercept_msr_read_x2apic(msr);
> +
> + /* TMCCT */
> + vmx_enable_intercept_msr_read_x2apic(0x839);
> + /* TPR */
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + /* EOI */
> + vmx_disable_intercept_msr_write_x2apic(0x80b);
> + /* SELF-IPI */
> + vmx_disable_intercept_msr_write_x2apic(0x83f);
> + } else if (vmx_exec_control(to_vmx(vcpu)) & 
> CPU_BASED_TPR_SHADOW) {
> + /* TPR */
> + vmx_disable_intercept_msr_read_x2apic(0x808);
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + }
>   } else {
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>

Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-14 Thread Mika Penttilä

On 09/14/2016 10:58 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used 
> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542 
> x86, apicv: add virtual x2apic support) disable virtual x2apic mode 
> completely if w/o APICv, and the author also told me that windows guest
> can't enter into x2apic mode when he developed the APICv feature several 
> years ago. However, it is not truth currently, Interrupt Remapping and 
> vIOMMU is added to qemu and the developers from Intel test windows 8 can 
> work in x2apic mode w/ Interrupt Remapping enabled recently. 
> 
> This patch enables TPR shadow for virtual x2apic mode to boost 
> windows guest in x2apic mode even if w/o APICv.
> 
> Can pass the kvm-unit-test.
> 

While at it, is the vmx flexpriotity stuff still valid code?
AFAICS it gets enabled iff TPR shadow is on. flexpriority
is on when :

(flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
cpu_has_vmx_virtualize_apic_accesses)

But apic accesses to TPR mmio are not then trapped and TPR changes not reported 
because
the “use TPR shadow” VM-execution control is 1.

Thanks,
Mika


> Suggested-by: Wincy Van 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Wincy Van 
> Cc: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 41 ++---
>  1 file changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..e703129 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6336,7 +6336,7 @@ static void wakeup_handler(void)
>  
>  static __init int hardware_setup(void)
>  {
> - int r = -ENOMEM, i, msr;
> + int r = -ENOMEM, i;
>  
>   rdmsrl_safe(MSR_EFER, _efer);
>  
> @@ -6464,18 +6464,6 @@ static __init int hardware_setup(void)
>  
>   set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>  
> - for (msr = 0x800; msr <= 0x8ff; msr++)
> - vmx_disable_intercept_msr_read_x2apic(msr);
> -
> - /* TMCCT */
> - vmx_enable_intercept_msr_read_x2apic(0x839);
> - /* TPR */
> - vmx_disable_intercept_msr_write_x2apic(0x808);
> - /* EOI */
> - vmx_disable_intercept_msr_write_x2apic(0x80b);
> - /* SELF-IPI */
> - vmx_disable_intercept_msr_write_x2apic(0x83f);
> -
>   if (enable_ept) {
>   kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
>   (enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> @@ -8435,12 +8423,7 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>   }
>  
> - /*
> -  * There is not point to enable virtualize x2apic without enable
> -  * apicv
> -  */
> - if (!cpu_has_vmx_virtualize_x2apic_mode() ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!cpu_has_vmx_virtualize_x2apic_mode())
>   return;
>  
>   if (!cpu_need_tpr_shadow(vcpu))
> @@ -8449,8 +8432,28 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   sec_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>  
>   if (set) {
> + int msr;
> +
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +
> + if (kvm_vcpu_apicv_active(vcpu)) {
> + for (msr = 0x800; msr <= 0x8ff; msr++)
> + vmx_disable_intercept_msr_read_x2apic(msr);
> +
> + /* TMCCT */
> + vmx_enable_intercept_msr_read_x2apic(0x839);
> + /* TPR */
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + /* EOI */
> + vmx_disable_intercept_msr_write_x2apic(0x80b);
> + /* SELF-IPI */
> + vmx_disable_intercept_msr_write_x2apic(0x83f);
> + } else if (vmx_exec_control(to_vmx(vcpu)) & 
> CPU_BASED_TPR_SHADOW) {
> + /* TPR */
> + vmx_disable_intercept_msr_read_x2apic(0x808);
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + }
>   } else {
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>

[PATCH] bluetooth, regression: MSG_TRUNC fixes

2016-08-24 Thread Mika Penttilä

Recent 4.8-rc changes to bluetooth MSG_TRUNC handling introduced regression; 
pairing finishes
but connecting profiles not. 

With the below fixes to MSG_TRUNC handling the connection is established 
normally.

--Mika


Signed-off-by: Mika Penttilä <mika.pentt...@nextfour.com>
---

diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index ece45e0..0b5f729 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -250,7 +250,7 @@ int bt_sock_recvmsg(struct socket *sock, struct msghdr 
*msg, siz
 
skb_free_datagram(sk, skb);
 
-   if (msg->msg_flags & MSG_TRUNC)
+   if (flags & MSG_TRUNC)
copied = skblen;
 
return err ? : copied;
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 6ef8a01..96f04b7 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -1091,7 +1091,7 @@ static int hci_sock_recvmsg(struct socket *sock, struct 
msghdr
 
skb_free_datagram(sk, skb);
 
-   if (msg->msg_flags & MSG_TRUNC)
+   if (flags & MSG_TRUNC)
copied = skblen;
 
return err ? : copied;

[PATCH] bluetooth, regression: MSG_TRUNC fixes

2016-08-24 Thread Mika Penttilä

Recent 4.8-rc changes to bluetooth MSG_TRUNC handling introduced regression; 
pairing finishes
but connecting profiles not. 

With the below fixes to MSG_TRUNC handling the connection is established 
normally.

--Mika


Signed-off-by: Mika Penttilä 
---

diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index ece45e0..0b5f729 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -250,7 +250,7 @@ int bt_sock_recvmsg(struct socket *sock, struct msghdr 
*msg, siz
 
skb_free_datagram(sk, skb);
 
-   if (msg->msg_flags & MSG_TRUNC)
+   if (flags & MSG_TRUNC)
copied = skblen;
 
return err ? : copied;
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 6ef8a01..96f04b7 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -1091,7 +1091,7 @@ static int hci_sock_recvmsg(struct socket *sock, struct 
msghdr
 
skb_free_datagram(sk, skb);
 
-   if (msg->msg_flags & MSG_TRUNC)
+   if (flags & MSG_TRUNC)
copied = skblen;
 
return err ? : copied;

Re: kvm vmx shadow paging question

2016-08-15 Thread Mika Penttilä

On 13.08.2016 18:47, Mika Penttilä wrote:

> On 13.08.2016 17:38, Mika Penttilä wrote:
>
>> Hi,
>>
>> While studying the vmx code, and the shadow page tables usage (no ept 
>> involved),
>> I wondered the GUEST_CR3 usage. If no ept, GUEST_CR3 points to the shadow 
>> tables.
>> But the format of sptes is always 64 bit. How is that with 32 bit hosts, is 
>> the GUEST_CR3
>> similar to EPT format, 4 level 64 bit always or how is this working on 32 
>> bit?
>>
>> Thanks,
>> Mika
>>
> Hmm seems non PAE 32 bit host without ept enabled is not supported 
> combination, right ?
>
> --Mika
>
>

Ping, am I correct that for using shadow mmu with x86 kvm host you need at 
least 32bit PAE enabled host
(because shadow ptes are always 64 bit, in PAE or long format)?

--Mika

Re: kvm vmx shadow paging question

2016-08-15 Thread Mika Penttilä

On 13.08.2016 18:47, Mika Penttilä wrote:

> On 13.08.2016 17:38, Mika Penttilä wrote:
>
>> Hi,
>>
>> While studying the vmx code, and the shadow page tables usage (no ept 
>> involved),
>> I wondered the GUEST_CR3 usage. If no ept, GUEST_CR3 points to the shadow 
>> tables.
>> But the format of sptes is always 64 bit. How is that with 32 bit hosts, is 
>> the GUEST_CR3
>> similar to EPT format, 4 level 64 bit always or how is this working on 32 
>> bit?
>>
>> Thanks,
>> Mika
>>
> Hmm seems non PAE 32 bit host without ept enabled is not supported 
> combination, right ?
>
> --Mika
>
>

Ping, am I correct that for using shadow mmu with x86 kvm host you need at 
least 32bit PAE enabled host
(because shadow ptes are always 64 bit, in PAE or long format)?

--Mika

Re: kvm vmx shadow paging question

2016-08-14 Thread Mika Penttilä

On 13.08.2016 17:38, Mika Penttilä wrote:

> Hi,
>
> While studying the vmx code, and the shadow page tables usage (no ept 
> involved),
> I wondered the GUEST_CR3 usage. If no ept, GUEST_CR3 points to the shadow 
> tables.
> But the format of sptes is always 64 bit. How is that with 32 bit hosts, is 
> the GUEST_CR3
> similar to EPT format, 4 level 64 bit always or how is this working on 32 bit?
>
> Thanks,
> Mika
>
Hmm seems non PAE 32 bit host without ept enabled is not supported combination, 
right ?

--Mika

Re: kvm vmx shadow paging question

2016-08-14 Thread Mika Penttilä

On 13.08.2016 17:38, Mika Penttilä wrote:

> Hi,
>
> While studying the vmx code, and the shadow page tables usage (no ept 
> involved),
> I wondered the GUEST_CR3 usage. If no ept, GUEST_CR3 points to the shadow 
> tables.
> But the format of sptes is always 64 bit. How is that with 32 bit hosts, is 
> the GUEST_CR3
> similar to EPT format, 4 level 64 bit always or how is this working on 32 bit?
>
> Thanks,
> Mika
>
Hmm seems non PAE 32 bit host without ept enabled is not supported combination, 
right ?

--Mika

Re: [PATCH v1 2/2] x86/KASLR: Increase BRK pages for KASLR memory randomization

2016-08-08 Thread Mika Penttilä

On 08/08/2016 09:40 PM, Thomas Garnier wrote:
> Default implementation expects 6 pages maximum are needed for low page
> allocations. If KASLR memory randomization is enabled, the worse case
> of e820 layout would require 12 pages (no large pages). It is due to the
> PUD level randomization and the variable e820 memory layout.
> 
> This bug was found while doing extensive testing of KASLR memory
> randomization on different type of hardware.
> 
> Signed-off-by: Thomas Garnier 
> ---
> Based on next-20160805
> ---
>  arch/x86/mm/init.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 6209289..3a27e6a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -130,6 +130,14 @@ void  __init early_alloc_pgt_buf(void)
>   unsigned long tables = INIT_PGT_BUF_SIZE;
>   phys_addr_t base;
>  
> + /*
> +  * Depending on the machine e860 memory layout and the PUD alignement.
> +  * We may need twice more pages when KASLR memoy randomization is
> +  * enabled.
> +  */
> + if (IS_ENABLED(CONFIG_RANDOMIZE_MEMORY))
> + tables *= 2;
> +
>   base = __pa(extend_brk(tables, PAGE_SIZE));
>  
>   pgt_buf_start = base >> PAGE_SHIFT;
> 

You should increase the reserve also :
RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);


--Mika

Re: [PATCH v1 2/2] x86/KASLR: Increase BRK pages for KASLR memory randomization

2016-08-08 Thread Mika Penttilä

On 08/08/2016 09:40 PM, Thomas Garnier wrote:
> Default implementation expects 6 pages maximum are needed for low page
> allocations. If KASLR memory randomization is enabled, the worse case
> of e820 layout would require 12 pages (no large pages). It is due to the
> PUD level randomization and the variable e820 memory layout.
> 
> This bug was found while doing extensive testing of KASLR memory
> randomization on different type of hardware.
> 
> Signed-off-by: Thomas Garnier 
> ---
> Based on next-20160805
> ---
>  arch/x86/mm/init.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 6209289..3a27e6a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -130,6 +130,14 @@ void  __init early_alloc_pgt_buf(void)
>   unsigned long tables = INIT_PGT_BUF_SIZE;
>   phys_addr_t base;
>  
> + /*
> +  * Depending on the machine e860 memory layout and the PUD alignement.
> +  * We may need twice more pages when KASLR memoy randomization is
> +  * enabled.
> +  */
> + if (IS_ENABLED(CONFIG_RANDOMIZE_MEMORY))
> + tables *= 2;
> +
>   base = __pa(extend_brk(tables, PAGE_SIZE));
>  
>   pgt_buf_start = base >> PAGE_SHIFT;
> 

You should increase the reserve also :
RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);


--Mika

Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup

2016-06-29 Thread Mika Penttilä



On 29.06.2016 10:06, Mika Penttilä wrote:
> On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
>> Hi all-
>>
>> Known issues:
>>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>>has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>>and virtio_console.
>>
>  How about PTRACE_SETREGS, it's using the child stack's vmapped address to 
> put regs?
>
> --Mika
>
PTRACE_SETREGS is ok of course.

--Mika

Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup

2016-06-29 Thread Mika Penttilä



On 29.06.2016 10:06, Mika Penttilä wrote:
> On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
>> Hi all-
>>
>> Known issues:
>>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>>has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>>and virtio_console.
>>
>  How about PTRACE_SETREGS, it's using the child stack's vmapped address to 
> put regs?
>
> --Mika
>
PTRACE_SETREGS is ok of course.

--Mika

Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup

2016-06-29 Thread Mika Penttilä

On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
> Hi all-
> 

> 
> Known issues:
>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>and virtio_console.
> 
 How about PTRACE_SETREGS, it's using the child stack's vmapped address to put 
regs?

--Mika

Re: [PATCH v4 00/29] virtually mapped stacks and thread_info cleanup

2016-06-29 Thread Mika Penttilä

On 06/27/2016 12:55 AM, Andy Lutomirski wrote:
> Hi all-
> 

> 
> Known issues:
>  - tcp md5, virtio_net, and virtio_console will have issues.  Eric Dumazet
>has a patch for tcp md5, and Michael Tsirkin says he'll fix virtio_net
>and virtio_console.
> 
 How about PTRACE_SETREGS, it's using the child stack's vmapped address to put 
regs?

--Mika

Re: [PATCH 12/13] x86/mm/64: Enable vmapped stacks

2016-06-15 Thread Mika Penttilä

Hi,

On 06/16/2016 03:28 AM, Andy Lutomirski wrote:
> This allows x86_64 kernels to enable vmapped stacks.  There are a
> couple of interesting bits.
> 
> First, x86 lazily faults in top-level paging entries for the vmalloc
> area.  This won't work if we get a page fault while trying to access
> the stack: the CPU will promote it to a double-fault and we'll die.
> To avoid this problem, probe the new stack when switching stacks and
> forcibly populate the pgd entry for the stack when switching mms.
> 
> Second, once we have guard pages around the stack, we'll want to
> detect and handle stack overflow.
> 
> I didn't enable it on x86_32.  We'd need to rework the double-fault
> code a bit and I'm concerned about running out of vmalloc virtual
> addresses under some workloads.
> 
> This patch, by itself, will behave somewhat erratically when the
> stack overflows while RSP is still more than a few tens of bytes
> above the bottom of the stack.  Specifically, we'll get #PF and make
> it to no_context and an oops without triggering a double-fault, and
> no_context doesn't know about stack overflows.  The next patch will
> improve that case.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/Kconfig |  1 +
>  arch/x86/include/asm/switch_to.h | 28 +++-
>  arch/x86/kernel/traps.c  | 32 
>  arch/x86/mm/tlb.c| 15 +++
>  4 files changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0a7b885964ba..b624b24d1dc1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -92,6 +92,7 @@ config X86
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>   select HAVE_EBPF_JITif X86_64
> + select HAVE_ARCH_VMAP_STACK if X86_64
>   select HAVE_CC_STACKPROTECTOR
>   select HAVE_CMPXCHG_DOUBLE
>   select HAVE_CMPXCHG_LOCAL
> diff --git a/arch/x86/include/asm/switch_to.h 
> b/arch/x86/include/asm/switch_to.h
> index 8f321a1b03a1..14e4b20f0aaf 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -8,6 +8,28 @@ struct tss_struct;
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
> struct tss_struct *tss);
>  
> +/* This runs runs on the previous thread's stack. */
> +static inline void prepare_switch_to(struct task_struct *prev,
> +  struct task_struct *next)
> +{
> +#ifdef CONFIG_VMAP_STACK
> + /*
> +  * If we switch to a stack that has a top-level paging entry
> +  * that is not present in the current mm, the resulting #PF will
> +  * will be promoted to a double-fault and we'll panic.  Probe
> +  * the new stack now so that vmalloc_fault can fix up the page
> +  * tables if needed.  This can only happen if we use a stack
> +  * in vmap space.
> +  *
> +  * We assume that the stack is aligned so that it never spans
> +  * more than one top-level paging entry.
> +  *
> +  * To minimize cache pollution, just follow the stack pointer.
> +  */
> + READ_ONCE(*(unsigned char *)next->thread.sp);
> +#endif
> +}
> +
>  #ifdef CONFIG_X86_32
>  
>  #ifdef CONFIG_CC_STACKPROTECTOR
> @@ -39,6 +61,8 @@ do {
> \
>*/ \
>   unsigned long ebx, ecx, edx, esi, edi;  \
>   \
> + prepare_switch_to(prev, next);  \
> + \
>   asm volatile("pushl %%ebp\n\t"  /* saveEBP   */ \
>"movl %%esp,%[prev_sp]\n\t"/* saveESP   */ \
>"movl %[next_sp],%%esp\n\t"/* restore ESP   */ \
> @@ -103,7 +127,9 @@ do {  
> \
>   * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
>   * has no effect.
>   */
> -#define switch_to(prev, next, last) \
> +#define switch_to(prev, next, last)\
> + prepare_switch_to(prev, next);\
> +   \
>   asm volatile(SAVE_CONTEXT \
>"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */   \
>"movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */\
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 00f03d82e69a..9cb7ea781176 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP, SIGBUS,  "segment not 
> present",

Re: [PATCH 12/13] x86/mm/64: Enable vmapped stacks

2016-06-15 Thread Mika Penttilä

Hi,

On 06/16/2016 03:28 AM, Andy Lutomirski wrote:
> This allows x86_64 kernels to enable vmapped stacks.  There are a
> couple of interesting bits.
> 
> First, x86 lazily faults in top-level paging entries for the vmalloc
> area.  This won't work if we get a page fault while trying to access
> the stack: the CPU will promote it to a double-fault and we'll die.
> To avoid this problem, probe the new stack when switching stacks and
> forcibly populate the pgd entry for the stack when switching mms.
> 
> Second, once we have guard pages around the stack, we'll want to
> detect and handle stack overflow.
> 
> I didn't enable it on x86_32.  We'd need to rework the double-fault
> code a bit and I'm concerned about running out of vmalloc virtual
> addresses under some workloads.
> 
> This patch, by itself, will behave somewhat erratically when the
> stack overflows while RSP is still more than a few tens of bytes
> above the bottom of the stack.  Specifically, we'll get #PF and make
> it to no_context and an oops without triggering a double-fault, and
> no_context doesn't know about stack overflows.  The next patch will
> improve that case.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/Kconfig |  1 +
>  arch/x86/include/asm/switch_to.h | 28 +++-
>  arch/x86/kernel/traps.c  | 32 
>  arch/x86/mm/tlb.c| 15 +++
>  4 files changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0a7b885964ba..b624b24d1dc1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -92,6 +92,7 @@ config X86
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>   select HAVE_EBPF_JITif X86_64
> + select HAVE_ARCH_VMAP_STACK if X86_64
>   select HAVE_CC_STACKPROTECTOR
>   select HAVE_CMPXCHG_DOUBLE
>   select HAVE_CMPXCHG_LOCAL
> diff --git a/arch/x86/include/asm/switch_to.h 
> b/arch/x86/include/asm/switch_to.h
> index 8f321a1b03a1..14e4b20f0aaf 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -8,6 +8,28 @@ struct tss_struct;
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
> struct tss_struct *tss);
>  
> +/* This runs runs on the previous thread's stack. */
> +static inline void prepare_switch_to(struct task_struct *prev,
> +  struct task_struct *next)
> +{
> +#ifdef CONFIG_VMAP_STACK
> + /*
> +  * If we switch to a stack that has a top-level paging entry
> +  * that is not present in the current mm, the resulting #PF will
> +  * will be promoted to a double-fault and we'll panic.  Probe
> +  * the new stack now so that vmalloc_fault can fix up the page
> +  * tables if needed.  This can only happen if we use a stack
> +  * in vmap space.
> +  *
> +  * We assume that the stack is aligned so that it never spans
> +  * more than one top-level paging entry.
> +  *
> +  * To minimize cache pollution, just follow the stack pointer.
> +  */
> + READ_ONCE(*(unsigned char *)next->thread.sp);
> +#endif
> +}
> +
>  #ifdef CONFIG_X86_32
>  
>  #ifdef CONFIG_CC_STACKPROTECTOR
> @@ -39,6 +61,8 @@ do {
> \
>*/ \
>   unsigned long ebx, ecx, edx, esi, edi;  \
>   \
> + prepare_switch_to(prev, next);  \
> + \
>   asm volatile("pushl %%ebp\n\t"  /* saveEBP   */ \
>"movl %%esp,%[prev_sp]\n\t"/* saveESP   */ \
>"movl %[next_sp],%%esp\n\t"/* restore ESP   */ \
> @@ -103,7 +127,9 @@ do {  
> \
>   * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
>   * has no effect.
>   */
> -#define switch_to(prev, next, last) \
> +#define switch_to(prev, next, last)\
> + prepare_switch_to(prev, next);\
> +   \
>   asm volatile(SAVE_CONTEXT \
>"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */   \
>"movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */\
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 00f03d82e69a..9cb7ea781176 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -292,12 +292,30 @@ DO_ERROR(X86_TRAP_NP, SIGBUS,  "segment not 
> present",

Re: [PATCH v2] sched/cputime: add steal time support to full dynticks CPU time accounting

2016-05-18 Thread Mika Penttilä

On 05/18/2016 11:28 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> This patch adds steal guest time support to full dynticks CPU 
> time accounting. After 'commit ff9a9b4c4334 ("sched, time: Switch 
> VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")', time is jiffy 
> based sampling even if it's still listened to ring boundaries, so 
> steal_account_process_tick() is reused to account how much 'ticks' 
> are steal time after the last accumulation. 
> 
> Suggested-by: Rik van Riel 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra (Intel) 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Frederic Weisbecker 
> Cc: Paolo Bonzini 
> Cc: Radim 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * fix divide zero bug, thanks Rik
> 
>  kernel/sched/cputime.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..bfa50a0 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
>   cpustat[CPUTIME_IDLE] += (__force u64) cputime;
>  }
>  
> -static __always_inline bool steal_account_process_tick(void)
> +static __always_inline unsigned long steal_account_process_tick(void)
>  {
>  #ifdef CONFIG_PARAVIRT
>   if (static_key_false(_steal_enabled)) {
> @@ -279,7 +279,7 @@ static __always_inline bool 
> steal_account_process_tick(void)
>   return steal_jiffies;
>   }
>  #endif
> - return false;
> + return 0;
>  }
>  
>  /*
> @@ -691,8 +691,12 @@ static cputime_t get_vtime_delta(struct task_struct *tsk)
>  
>  static void __vtime_account_system(struct task_struct *tsk)
>  {
> + unsigned long steal_time = steal_account_process_tick();
>   cputime_t delta_cpu = get_vtime_delta(tsk);
>  
> + if (steal_time >= delta_cpu)
> + return;
> + delta_cpu -= steal_time;
>   account_system_time(tsk, irq_count(), delta_cpu, 
> cputime_to_scaled(delta_cpu));
>  }
>  
> @@ -723,7 +727,12 @@ void vtime_account_user(struct task_struct *tsk)
>   write_seqcount_begin(>vtime_seqcount);
>   tsk->vtime_snap_whence = VTIME_SYS;
>   if (vtime_delta(tsk)) {
> + unsigned long steal_time = steal_account_process_tick();
>   delta_cpu = get_vtime_delta(tsk);


afaik steal_account_process_tick() returns jiffies and get_vtime_delta() 
cputime, so can't mix them like this : ?

> +
> + if (steal_time >= delta_cpu)
> + return;



> + delta_cpu -= steal_time;
>   account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
>   }
>   write_seqcount_end(>vtime_seqcount);
> 


--Mika

Re: [PATCH v2] sched/cputime: add steal time support to full dynticks CPU time accounting

2016-05-18 Thread Mika Penttilä

On 05/18/2016 11:28 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> This patch adds steal guest time support to full dynticks CPU 
> time accounting. After 'commit ff9a9b4c4334 ("sched, time: Switch 
> VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")', time is jiffy 
> based sampling even if it's still listened to ring boundaries, so 
> steal_account_process_tick() is reused to account how much 'ticks' 
> are steal time after the last accumulation. 
> 
> Suggested-by: Rik van Riel 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra (Intel) 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Frederic Weisbecker 
> Cc: Paolo Bonzini 
> Cc: Radim 
> Signed-off-by: Wanpeng Li 
> ---
> v1 -> v2:
>  * fix divide zero bug, thanks Rik
> 
>  kernel/sched/cputime.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..bfa50a0 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
>   cpustat[CPUTIME_IDLE] += (__force u64) cputime;
>  }
>  
> -static __always_inline bool steal_account_process_tick(void)
> +static __always_inline unsigned long steal_account_process_tick(void)
>  {
>  #ifdef CONFIG_PARAVIRT
>   if (static_key_false(_steal_enabled)) {
> @@ -279,7 +279,7 @@ static __always_inline bool 
> steal_account_process_tick(void)
>   return steal_jiffies;
>   }
>  #endif
> - return false;
> + return 0;
>  }
>  
>  /*
> @@ -691,8 +691,12 @@ static cputime_t get_vtime_delta(struct task_struct *tsk)
>  
>  static void __vtime_account_system(struct task_struct *tsk)
>  {
> + unsigned long steal_time = steal_account_process_tick();
>   cputime_t delta_cpu = get_vtime_delta(tsk);
>  
> + if (steal_time >= delta_cpu)
> + return;
> + delta_cpu -= steal_time;
>   account_system_time(tsk, irq_count(), delta_cpu, 
> cputime_to_scaled(delta_cpu));
>  }
>  
> @@ -723,7 +727,12 @@ void vtime_account_user(struct task_struct *tsk)
>   write_seqcount_begin(>vtime_seqcount);
>   tsk->vtime_snap_whence = VTIME_SYS;
>   if (vtime_delta(tsk)) {
> + unsigned long steal_time = steal_account_process_tick();
>   delta_cpu = get_vtime_delta(tsk);


afaik steal_account_process_tick() returns jiffies and get_vtime_delta() 
cputime, so can't mix them like this : ?

> +
> + if (steal_time >= delta_cpu)
> + return;



> + delta_cpu -= steal_time;
>   account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
>   }
>   write_seqcount_end(>vtime_seqcount);
> 


--Mika

[BUG] 4.4.7 af_unix, wakeup

2016-04-21 Thread Mika Penttilä

Have hit this same looking oops every now and then since at least 4.2 or so.. 
Not easy to reproduce systematically.

--Mika



[10973.891726] Unable to handle kernel NULL pointer dereference at virtual 
address 
[10973.899839] pgd = a8ce4000
[10973.902549] [] *pgd=38c38831, *pte=, *ppte=
[10973.908866] Internal error: Oops: 8007 [#1] PREEMPT SMP ARM
[10973.914787] Modules linked in: btwilink radio_quantek st_drv
[10973.920507] CPU: 1 PID: 310 Comm: compositor Not tainted 4.4.7 #1
[10973.926602] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[10973.933133] task: a82dc740 ti: a8dd2000 task.ti: a8dd2000
[10973.938534] PC is at 0x0
[10973.941080] LR is at __wake_up_common+0x4c/0x80
[10973.945615] pc : [<>]lr : [<80058584>]psr: 800f00b3
[10973.945615] sp : a8dd3d98  ip : a9077d48  fp : 0001
[10973.957093] r10: 0001  r9 : 0001  r8 : 00c3
[10973.962320] r7 : a9041c84  r6 : 011e6ebc  r5 : 0001  r4 : ab712074
[10973.968849] r3 : 00c3  r2 : 0001  r1 : 0001  r0 : a9077d48
[10973.975380] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment 
user
[10973.982777] Control: 10c5387d  Table: 38ce404a  DAC: 0055
[10973.988525] Process compositor (pid: 310, stack limit = 0xa8dd2210)
[10973.994793] Stack: (0xa8dd3d98 to 0xa8dd4000)
[10973.999153] 3d80:   
0001 a9041c80
[10974.007335] 3da0: 0001 00c3 0001 a00f0013 0030 a88dd080 
a88df64c 80058b68
[10974.015517] 3dc0: 00c3 806e28e0 a88df440  a88df440 a9c3ac00 
0001 8051b7ac
[10974.023698] 3de0: 0030 805c392c a8dd3e08 0003 7e9e25d8 0030 
a64d8c00 0001
[10974.031879] 3e00: a8dd3f74 a8dd3f6c     
 
[10974.040060] 3e20: a8dd3e54 a8dd3f6c 4040  a64d8c00  
a8dd3e58 
[10974.048240] 3e40: 0170 80518634 a8dd3f6c 80518fb8   
0005 757d2b98
[10974.056421] 3e60:    00c5e2dc 00c5d724 75750310 
00deeaac 0030
[10974.064602] 3e80:  00afd9c4 00c5e2dc  758057e0 7e9e22a8 
757fa4c8 757782e8
[10974.072783] 3ea0: 75805790  0068 00e364b8  00afd9c4 
00e364e0 0003
[10974.080963] 3ec0: 758057f0  00d4 75791b0c 04a6 0001 
00aff83c 00afd9c4
[10974.089145] 3ee0: 0034 00e2c374 00e364c4 00e1b634  0001 
00d4 0300
[10974.097325] 3f00: 04a6 75763790 00e364e0   0001 
3fff a84b3a98
[10974.105506] 3f20: a8dfc9c0 800fd2e8 a8dd3f68 a8dd3f64 4040 0128 
8000f6a4 a64d8c00
[10974.113686] 3f40: 7e9e25e8 4040 0128 8000f6a4 a8dd2000 80519c14 
 7e9e2730
[10974.121867] 3f60: 0107 0001 fff7   0001 
 
[10974.130047] 3f80: a8dd3e80    4040  
0001 00df0a30
[10974.138228] 3fa0: 00deea20 8000f500 0001 00df0a30 001e 7e9e25e8 
4040 
[10974.146409] 3fc0: 0001 00df0a30 00deea20 0128 00df1a18 000b308c 
7e9e25e8 0170
[10974.154590] 3fe0:  7e9e25d0 76ef54c0 75b568f8 800f0010 001e 
 
[10974.162783] [<80058584>] (__wake_up_common) from [<80058b68>] 
(__wake_up_sync_key+0x44/0x60)
[10974.171236] [<80058b68>] (__wake_up_sync_key) from [<8051b7ac>] 
(sock_def_readable+0x3c/0x6c)
[10974.179779] [<8051b7ac>] (sock_def_readable) from [<805c392c>] 
(unix_stream_sendmsg+0x154/0x340)
[10974.188572] [<805c392c>] (unix_stream_sendmsg) from [<80518634>] 
(sock_sendmsg+0x14/0x24)
[10974.196757] [<80518634>] (sock_sendmsg) from [<80518fb8>] 
(___sys_sendmsg+0x1d0/0x1d8)
[10974.204680] [<80518fb8>] (___sys_sendmsg) from [<80519c14>] 
(__sys_sendmsg+0x3c/0x6c)
[10974.212521] [<80519c14>] (__sys_sendmsg) from [<8000f500>] 
(ret_fast_syscall+0x0/0x34)
[10974.220441] Code: bad PC value
[10974.223502] ---[ end trace 7f5c4ba07462311f ]---

[BUG] 4.4.7 af_unix, wakeup

2016-04-21 Thread Mika Penttilä

Have hit this same looking oops every now and then since at least 4.2 or so.. 
Not easy to reproduce systematically.

--Mika



[10973.891726] Unable to handle kernel NULL pointer dereference at virtual 
address 
[10973.899839] pgd = a8ce4000
[10973.902549] [] *pgd=38c38831, *pte=, *ppte=
[10973.908866] Internal error: Oops: 8007 [#1] PREEMPT SMP ARM
[10973.914787] Modules linked in: btwilink radio_quantek st_drv
[10973.920507] CPU: 1 PID: 310 Comm: compositor Not tainted 4.4.7 #1
[10973.926602] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[10973.933133] task: a82dc740 ti: a8dd2000 task.ti: a8dd2000
[10973.938534] PC is at 0x0
[10973.941080] LR is at __wake_up_common+0x4c/0x80
[10973.945615] pc : [<>]lr : [<80058584>]psr: 800f00b3
[10973.945615] sp : a8dd3d98  ip : a9077d48  fp : 0001
[10973.957093] r10: 0001  r9 : 0001  r8 : 00c3
[10973.962320] r7 : a9041c84  r6 : 011e6ebc  r5 : 0001  r4 : ab712074
[10973.968849] r3 : 00c3  r2 : 0001  r1 : 0001  r0 : a9077d48
[10973.975380] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment 
user
[10973.982777] Control: 10c5387d  Table: 38ce404a  DAC: 0055
[10973.988525] Process compositor (pid: 310, stack limit = 0xa8dd2210)
[10973.994793] Stack: (0xa8dd3d98 to 0xa8dd4000)
[10973.999153] 3d80:   
0001 a9041c80
[10974.007335] 3da0: 0001 00c3 0001 a00f0013 0030 a88dd080 
a88df64c 80058b68
[10974.015517] 3dc0: 00c3 806e28e0 a88df440  a88df440 a9c3ac00 
0001 8051b7ac
[10974.023698] 3de0: 0030 805c392c a8dd3e08 0003 7e9e25d8 0030 
a64d8c00 0001
[10974.031879] 3e00: a8dd3f74 a8dd3f6c     
 
[10974.040060] 3e20: a8dd3e54 a8dd3f6c 4040  a64d8c00  
a8dd3e58 
[10974.048240] 3e40: 0170 80518634 a8dd3f6c 80518fb8   
0005 757d2b98
[10974.056421] 3e60:    00c5e2dc 00c5d724 75750310 
00deeaac 0030
[10974.064602] 3e80:  00afd9c4 00c5e2dc  758057e0 7e9e22a8 
757fa4c8 757782e8
[10974.072783] 3ea0: 75805790  0068 00e364b8  00afd9c4 
00e364e0 0003
[10974.080963] 3ec0: 758057f0  00d4 75791b0c 04a6 0001 
00aff83c 00afd9c4
[10974.089145] 3ee0: 0034 00e2c374 00e364c4 00e1b634  0001 
00d4 0300
[10974.097325] 3f00: 04a6 75763790 00e364e0   0001 
3fff a84b3a98
[10974.105506] 3f20: a8dfc9c0 800fd2e8 a8dd3f68 a8dd3f64 4040 0128 
8000f6a4 a64d8c00
[10974.113686] 3f40: 7e9e25e8 4040 0128 8000f6a4 a8dd2000 80519c14 
 7e9e2730
[10974.121867] 3f60: 0107 0001 fff7   0001 
 
[10974.130047] 3f80: a8dd3e80    4040  
0001 00df0a30
[10974.138228] 3fa0: 00deea20 8000f500 0001 00df0a30 001e 7e9e25e8 
4040 
[10974.146409] 3fc0: 0001 00df0a30 00deea20 0128 00df1a18 000b308c 
7e9e25e8 0170
[10974.154590] 3fe0:  7e9e25d0 76ef54c0 75b568f8 800f0010 001e 
 
[10974.162783] [<80058584>] (__wake_up_common) from [<80058b68>] 
(__wake_up_sync_key+0x44/0x60)
[10974.171236] [<80058b68>] (__wake_up_sync_key) from [<8051b7ac>] 
(sock_def_readable+0x3c/0x6c)
[10974.179779] [<8051b7ac>] (sock_def_readable) from [<805c392c>] 
(unix_stream_sendmsg+0x154/0x340)
[10974.188572] [<805c392c>] (unix_stream_sendmsg) from [<80518634>] 
(sock_sendmsg+0x14/0x24)
[10974.196757] [<80518634>] (sock_sendmsg) from [<80518fb8>] 
(___sys_sendmsg+0x1d0/0x1d8)
[10974.204680] [<80518fb8>] (___sys_sendmsg) from [<80519c14>] 
(__sys_sendmsg+0x3c/0x6c)
[10974.212521] [<80519c14>] (__sys_sendmsg) from [<8000f500>] 
(ret_fast_syscall+0x0/0x34)
[10974.220441] Code: bad PC value
[10974.223502] ---[ end trace 7f5c4ba07462311f ]---

Re: [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages

2016-04-06 Thread Mika Penttilä

On 04/06/2016 12:53 AM, Hugh Dickins wrote:



> +static void shmem_recovery_work(struct work_struct *work)
> +{
> + struct recovery *recovery;
> + struct shmem_inode_info *info;
> + struct address_space *mapping;
> + struct page *page;
> + struct page *head = NULL;
> + int error = -ENOENT;
> +
> + recovery = container_of(work, struct recovery, work);
> + info = SHMEM_I(recovery->inode);
> + if (!shmem_work_still_useful(recovery)) {
> + shr_stats(work_too_late);
> + goto out;
> + }
> +
> + /* Are we resuming from an earlier partially successful attempt? */
> + mapping = recovery->inode->i_mapping;
> + spin_lock_irq(>tree_lock);
> + page = shmem_clear_tag_hugehole(mapping, recovery->head_index);
> + if (page)
> + head = team_head(page);
> + spin_unlock_irq(>tree_lock);
> + if (head) {
> + /* Serialize with shrinker so it won't mess with our range */
> + spin_lock(_shrinklist_lock);
> + spin_unlock(_shrinklist_lock);
> + }
> +
> + /* If team is now complete, no tag and head would be found above */
> + page = recovery->page;
> + if (PageTeam(page))
> + head = team_head(page);
> +
> + /* Get a reference to the head of the team already being assembled */
> + if (head) {
> + if (!get_page_unless_zero(head))
> + head = NULL;
> + else if (!PageTeam(head) || head->mapping != mapping ||
> + head->index != recovery->head_index) {
> + put_page(head);
> + head = NULL;
> + }
> + }
> +
> + if (head) {
> + /* We are resuming work from a previous partial recovery */
> + if (PageTeam(page))
> + shr_stats(resume_teamed);
> + else
> + shr_stats(resume_tagged);
> + } else {
> + gfp_t gfp = mapping_gfp_mask(mapping);
> + /*
> +  * XXX: Note that with swapin readahead, page_to_nid(page) will
> +  * often choose an unsuitable NUMA node: something to fix soon,
> +  * but not an immediate blocker.
> +  */
> + head = __alloc_pages_node(page_to_nid(page),
> + gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER);  
>  
> + if (!head) {
> + shr_stats(huge_failed);
> + error = -ENOMEM;
> + goto out;
> + }

Should this head marked PageTeam? Because in patch 27/31 when given as a hint 
to shmem_getpage_gfp() :

hugehint = NULL;
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+   sgp == SGP_TEAM && *pagep) {
+   struct page *head;
+
+   if (!get_page_unless_zero(*pagep)) {
+   error = -ENOENT;
+   goto decused;
+   }
+   page = *pagep;
+   lock_page(page);
+   head = page - (index & (HPAGE_PMD_NR-1)); 

we fail always because :
+   if (!PageTeam(head)) {
+   error = -ENOENT;
+   goto decused;
+   }


> + if (!shmem_work_still_useful(recovery)) {
> + __free_pages(head, HPAGE_PMD_ORDER);
> + shr_stats(huge_too_late);
> + goto out;
> + }
> + split_page(head, HPAGE_PMD_ORDER);
> + get_page(head);
> + shr_stats(huge_alloced);
> + }


Thanks,
Mika

Re: [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages

2016-04-06 Thread Mika Penttilä

On 04/06/2016 12:53 AM, Hugh Dickins wrote:



> +static void shmem_recovery_work(struct work_struct *work)
> +{
> + struct recovery *recovery;
> + struct shmem_inode_info *info;
> + struct address_space *mapping;
> + struct page *page;
> + struct page *head = NULL;
> + int error = -ENOENT;
> +
> + recovery = container_of(work, struct recovery, work);
> + info = SHMEM_I(recovery->inode);
> + if (!shmem_work_still_useful(recovery)) {
> + shr_stats(work_too_late);
> + goto out;
> + }
> +
> + /* Are we resuming from an earlier partially successful attempt? */
> + mapping = recovery->inode->i_mapping;
> + spin_lock_irq(>tree_lock);
> + page = shmem_clear_tag_hugehole(mapping, recovery->head_index);
> + if (page)
> + head = team_head(page);
> + spin_unlock_irq(>tree_lock);
> + if (head) {
> + /* Serialize with shrinker so it won't mess with our range */
> + spin_lock(_shrinklist_lock);
> + spin_unlock(_shrinklist_lock);
> + }
> +
> + /* If team is now complete, no tag and head would be found above */
> + page = recovery->page;
> + if (PageTeam(page))
> + head = team_head(page);
> +
> + /* Get a reference to the head of the team already being assembled */
> + if (head) {
> + if (!get_page_unless_zero(head))
> + head = NULL;
> + else if (!PageTeam(head) || head->mapping != mapping ||
> + head->index != recovery->head_index) {
> + put_page(head);
> + head = NULL;
> + }
> + }
> +
> + if (head) {
> + /* We are resuming work from a previous partial recovery */
> + if (PageTeam(page))
> + shr_stats(resume_teamed);
> + else
> + shr_stats(resume_tagged);
> + } else {
> + gfp_t gfp = mapping_gfp_mask(mapping);
> + /*
> +  * XXX: Note that with swapin readahead, page_to_nid(page) will
> +  * often choose an unsuitable NUMA node: something to fix soon,
> +  * but not an immediate blocker.
> +  */
> + head = __alloc_pages_node(page_to_nid(page),
> + gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER);  
>  
> + if (!head) {
> + shr_stats(huge_failed);
> + error = -ENOMEM;
> + goto out;
> + }

Should this head marked PageTeam? Because in patch 27/31 when given as a hint 
to shmem_getpage_gfp() :

hugehint = NULL;
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+   sgp == SGP_TEAM && *pagep) {
+   struct page *head;
+
+   if (!get_page_unless_zero(*pagep)) {
+   error = -ENOENT;
+   goto decused;
+   }
+   page = *pagep;
+   lock_page(page);
+   head = page - (index & (HPAGE_PMD_NR-1)); 

we fail always because :
+   if (!PageTeam(head)) {
+   error = -ENOENT;
+   goto decused;
+   }


> + if (!shmem_work_still_useful(recovery)) {
> + __free_pages(head, HPAGE_PMD_ORDER);
> + shr_stats(huge_too_late);
> + goto out;
> + }
> + split_page(head, HPAGE_PMD_ORDER);
> + get_page(head);
> + shr_stats(huge_alloced);
> + }


Thanks,
Mika

Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()

2016-03-10 Thread Mika Penttilä



On 18.02.2016 21:47, Tony Luck wrote:
> Make use of the EXTABLE_FAULT exception table entries to write
> a kernel copy routine that doesn't crash the system if it
> encounters a machine check. Prime use case for this is to copy
> from large arrays of non-volatile memory used as storage.
>
> We have to use an unrolled copy loop for now because current
> hardware implementations treat a machine check in "rep mov"
> as fatal. When that is fixed we can simplify.
>
> Signed-off-by: Tony Luck 
> ---
>
> Is this what we want now?  Return type is a "bool". True means
> that we copied OK, false means that it didn't (this is all that
> Dan says that he needs).  Dropped all the complex code to figure
> out how many bytes we didn't copy as Linus says this isn't the
> right place to do this (and besides we should just make "rep mov"
>
> +
> + /* Copy successful. Return true */
> +.L_done_memcpy_trap:
> + xorq %rax, %rax
> + ret
> +ENDPROC(memcpy_mcsafe)
> +
> + .section .fixup, "ax"
> + /* Return false for any failure */
> +.L_memcpy_mcsafe_fail:
> + mov $1, %rax
> + ret
> +
>
But you return 0 == false for success and 1 == true for failure.

--Mika

Re: [PATCH v14] x86, mce: Add memcpy_mcsafe()

2016-03-10 Thread Mika Penttilä



On 18.02.2016 21:47, Tony Luck wrote:
> Make use of the EXTABLE_FAULT exception table entries to write
> a kernel copy routine that doesn't crash the system if it
> encounters a machine check. Prime use case for this is to copy
> from large arrays of non-volatile memory used as storage.
>
> We have to use an unrolled copy loop for now because current
> hardware implementations treat a machine check in "rep mov"
> as fatal. When that is fixed we can simplify.
>
> Signed-off-by: Tony Luck 
> ---
>
> Is this what we want now?  Return type is a "bool". True means
> that we copied OK, false means that it didn't (this is all that
> Dan says that he needs).  Dropped all the complex code to figure
> out how many bytes we didn't copy as Linus says this isn't the
> right place to do this (and besides we should just make "rep mov"
>
> +
> + /* Copy successful. Return true */
> +.L_done_memcpy_trap:
> + xorq %rax, %rax
> + ret
> +ENDPROC(memcpy_mcsafe)
> +
> + .section .fixup, "ax"
> + /* Return false for any failure */
> +.L_memcpy_mcsafe_fail:
> + mov $1, %rax
> + ret
> +
>
But you return 0 == false for success and 1 == true for failure.

--Mika

[BUG] 4.5-rc7 af unix related

2016-03-08 Thread Mika Penttilä


Have got some of these same looking crashes after 4.4 (maybe before also, not 
sure). Very random, not easy to reproduce.

--Mika



[ 1999.948171] Unable to handle kernel NULL pointer dereference at virtual 
address 
[ 1999.955740] pgd = a8ba4000
[ 1999.958264] [] *pgd=38ca7831, *pte=, *ppte=
[ 1999.964118] Internal error: Oops: 8007 [#1] PREEMPT SMP ARM
[ 1999.969600] Modules linked in: btwilink radio_quantek st_drv
[ 1999.974895] CPU: 0 PID: 335 Comm: compositor Not tainted 4.5.0-rc7 #26
[ 1999.980939] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 1999.986983] task: a8da8fc0 ti: a8ce8000 task.ti: a8ce8000
[ 1999.991983] PC is at 0x0
[ 1999.994343] LR is at __wake_up_common+0x4c/0x80
[ 1999.998540] pc : [<>]lr : [<80058934>]psr: 800e00b3
[ 1999.998540] sp : a8ce9d98  ip : a8a31d28  fp : 0001
[ 2000.009164] r10: 0001  r9 : 0001  r8 : 00c3
[ 2000.014002] r7 : 9c641944  r6 : 0001  r5 : 0001  r4 : 80150007
[ 2000.020044] r3 : 00c3  r2 : 0001  r1 : 0001  r0 : a8a31d28
[ 2000.026088] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment 
user
[ 2000.032934] Control: 10c5387d  Table: 38ba404a  DAC: 0055
[ 2000.038254] Process compositor (pid: 335, stack limit = 0xa8ce8210)
[ 2000.044056] Stack: (0xa8ce9d98 to 0xa8cea000)
[ 2000.048091] 9d80:   
0001 9c641940
[ 2000.055663] 9da0: 0001 00c3 0001 a00e0013 0120 a9d9fc80 
a9d9fbcc 80058f1c
[ 2000.063236] 9dc0: 00c3 806f3630 a9d9f9c0  a9d9f9c0 a8a8e480 
0001 8052a4fc
[ 2000.070808] 9de0: 0120 805d3a6c a8ce9e08 0003 7eba18e0 0120 
a6047680 0001
[ 2000.078380] 9e00: a8ce9f74 a8ce9f6c     
 
[ 2000.085952] 9e20: a8ce9e54 a8ce9f6c 4040  a6047680  
a8ce9e58 
[ 2000.093524] 9e40:  8052724c a8ce9f6c 80527bd0 a8ce9eb4  
a80cd400 587c
[ 2000.101095] 9e60: 806f6b90 80282e2c a80cd400 0001 a80cd808 800e0193 
00b8879c 0120
[ 2000.108668] 9e80: a80cd400 80044708 80b85704 80b85704 0097c6d4 a84d1d5c 
80059020 
[ 2000.116240] 9ea0: a81c3e2c   0003 0001 8005902c 
a81c3e20 80058934
[ 2000.123811] 9ec0:  a81c3e28 600e0193  0003 0001 
a816ecc0 
[ 2000.131383] 9ee0: 0001 80b7e450 0002d6c8 802f67a4 a81c3c10 bb9c838b 
 0001
[ 2000.138955] 9f00: 0001  0125 a816ecc0 0001 600e0013 
a8ce9f2c 800458f4
[ 2000.146527] 9f20: 9c750c00 80102fbc a8ce9f68 a8ce9f64 4040 0128 
8000f7e4 a6047680
[ 2000.154099] 9f40: 7eba18f0 4040 0128 8000f7e4 a8ce8000 8052882c 
 00b8f1a0
[ 2000.161670] 9f60: 0107 0001 fff7   0001 
 
[ 2000.169241] 9f80: a8ce9e80    4040  
0001 00b8a040
[ 2000.176813] 9fa0: 00b88030 8000f640 0001 00b8a040 0024 7eba18f0 
4040 
[ 2000.184384] 9fc0: 0001 00b8a040 00b88030 0128 00b8b028 0002876c 
7eba18f0 
[ 2000.191956] 9fe0:  7eba18d8 76faa4c0 75c0e8f8 800e0010 0024 
3bf59811 3bf59c11
[ 2000.199541] [<80058934>] (__wake_up_common) from [<80058f1c>] 
(__wake_up_sync_key+0x44/0x60)
[ 2000.207363] [<80058f1c>] (__wake_up_sync_key) from [<8052a4fc>] 
(sock_def_readable+0x3c/0x6c)
[ 2000.215267] [<8052a4fc>] (sock_def_readable) from [<805d3a6c>] 
(unix_stream_sendmsg+0x154/0x340)
[ 2000.223414] [<805d3a6c>] (unix_stream_sendmsg) from [<8052724c>] 
(sock_sendmsg+0x14/0x24)
[ 2000.230989] [<8052724c>] (sock_sendmsg) from [<80527bd0>] 
(___sys_sendmsg+0x1d0/0x1d8)
[ 2000.238321] [<80527bd0>] (___sys_sendmsg) from [<8052882c>] 
(__sys_sendmsg+0x3c/0x6c)
[ 2000.245579] [<8052882c>] (__sys_sendmsg) from [<8000f640>] 
(ret_fast_syscall+0x0/0x34)
[ 2000.252911] Code: bad PC value
[ 2000.255744] ---[ end trace a385e81f19607805 ]---

[BUG] 4.5-rc7 af unix related

2016-03-08 Thread Mika Penttilä


Have got some of these same looking crashes after 4.4 (maybe before also, not 
sure). Very random, not easy to reproduce.

--Mika



[ 1999.948171] Unable to handle kernel NULL pointer dereference at virtual 
address 
[ 1999.955740] pgd = a8ba4000
[ 1999.958264] [] *pgd=38ca7831, *pte=, *ppte=
[ 1999.964118] Internal error: Oops: 8007 [#1] PREEMPT SMP ARM
[ 1999.969600] Modules linked in: btwilink radio_quantek st_drv
[ 1999.974895] CPU: 0 PID: 335 Comm: compositor Not tainted 4.5.0-rc7 #26
[ 1999.980939] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 1999.986983] task: a8da8fc0 ti: a8ce8000 task.ti: a8ce8000
[ 1999.991983] PC is at 0x0
[ 1999.994343] LR is at __wake_up_common+0x4c/0x80
[ 1999.998540] pc : [<>]lr : [<80058934>]psr: 800e00b3
[ 1999.998540] sp : a8ce9d98  ip : a8a31d28  fp : 0001
[ 2000.009164] r10: 0001  r9 : 0001  r8 : 00c3
[ 2000.014002] r7 : 9c641944  r6 : 0001  r5 : 0001  r4 : 80150007
[ 2000.020044] r3 : 00c3  r2 : 0001  r1 : 0001  r0 : a8a31d28
[ 2000.026088] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA Thumb  Segment 
user
[ 2000.032934] Control: 10c5387d  Table: 38ba404a  DAC: 0055
[ 2000.038254] Process compositor (pid: 335, stack limit = 0xa8ce8210)
[ 2000.044056] Stack: (0xa8ce9d98 to 0xa8cea000)
[ 2000.048091] 9d80:   
0001 9c641940
[ 2000.055663] 9da0: 0001 00c3 0001 a00e0013 0120 a9d9fc80 
a9d9fbcc 80058f1c
[ 2000.063236] 9dc0: 00c3 806f3630 a9d9f9c0  a9d9f9c0 a8a8e480 
0001 8052a4fc
[ 2000.070808] 9de0: 0120 805d3a6c a8ce9e08 0003 7eba18e0 0120 
a6047680 0001
[ 2000.078380] 9e00: a8ce9f74 a8ce9f6c     
 
[ 2000.085952] 9e20: a8ce9e54 a8ce9f6c 4040  a6047680  
a8ce9e58 
[ 2000.093524] 9e40:  8052724c a8ce9f6c 80527bd0 a8ce9eb4  
a80cd400 587c
[ 2000.101095] 9e60: 806f6b90 80282e2c a80cd400 0001 a80cd808 800e0193 
00b8879c 0120
[ 2000.108668] 9e80: a80cd400 80044708 80b85704 80b85704 0097c6d4 a84d1d5c 
80059020 
[ 2000.116240] 9ea0: a81c3e2c   0003 0001 8005902c 
a81c3e20 80058934
[ 2000.123811] 9ec0:  a81c3e28 600e0193  0003 0001 
a816ecc0 
[ 2000.131383] 9ee0: 0001 80b7e450 0002d6c8 802f67a4 a81c3c10 bb9c838b 
 0001
[ 2000.138955] 9f00: 0001  0125 a816ecc0 0001 600e0013 
a8ce9f2c 800458f4
[ 2000.146527] 9f20: 9c750c00 80102fbc a8ce9f68 a8ce9f64 4040 0128 
8000f7e4 a6047680
[ 2000.154099] 9f40: 7eba18f0 4040 0128 8000f7e4 a8ce8000 8052882c 
 00b8f1a0
[ 2000.161670] 9f60: 0107 0001 fff7   0001 
 
[ 2000.169241] 9f80: a8ce9e80    4040  
0001 00b8a040
[ 2000.176813] 9fa0: 00b88030 8000f640 0001 00b8a040 0024 7eba18f0 
4040 
[ 2000.184384] 9fc0: 0001 00b8a040 00b88030 0128 00b8b028 0002876c 
7eba18f0 
[ 2000.191956] 9fe0:  7eba18d8 76faa4c0 75c0e8f8 800e0010 0024 
3bf59811 3bf59c11
[ 2000.199541] [<80058934>] (__wake_up_common) from [<80058f1c>] 
(__wake_up_sync_key+0x44/0x60)
[ 2000.207363] [<80058f1c>] (__wake_up_sync_key) from [<8052a4fc>] 
(sock_def_readable+0x3c/0x6c)
[ 2000.215267] [<8052a4fc>] (sock_def_readable) from [<805d3a6c>] 
(unix_stream_sendmsg+0x154/0x340)
[ 2000.223414] [<805d3a6c>] (unix_stream_sendmsg) from [<8052724c>] 
(sock_sendmsg+0x14/0x24)
[ 2000.230989] [<8052724c>] (sock_sendmsg) from [<80527bd0>] 
(___sys_sendmsg+0x1d0/0x1d8)
[ 2000.238321] [<80527bd0>] (___sys_sendmsg) from [<8052882c>] 
(__sys_sendmsg+0x3c/0x6c)
[ 2000.245579] [<8052882c>] (__sys_sendmsg) from [<8000f640>] 
(ret_fast_syscall+0x0/0x34)
[ 2000.252911] Code: bad PC value
[ 2000.255744] ---[ end trace a385e81f19607805 ]---

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä



On 02.03.2016 18:55, Borislav Petkov wrote:
> On Wed, Mar 02, 2016 at 06:38:15PM +0200, Mika Penttilä wrote:
>> I actually looked at it a while too...
>>
>> The
>>   movq stack_start - __START_KERNEL_map, %rsp
>>
>> turns into (objdump disassembly)
>>
>>   mov0x0,%rsp
>>
>> with relocation
>> 0004 R_X86_64_32S  stack_start+0x8000
>>
>> Now stack_start is at 81ef3380, so the relocation gives 1ef3380 
>> which would be correct, so why the
>> second subq ?
>>
>> You may explain :)
> Here it is :-)
>
> $ readelf -a vmlinux | grep stack_start
>  70526: 81cbabf8 0 NOTYPE  GLOBAL DEFAULT   14 stack_start
>
> 0x81cbabf8 - __START_KERNEL_map =
> 0x81cbabf8 - 0x8000 =
> 0x1cbabf8
>
> (gdb) x/x 0x1cbabf8
> 0x1cbabf8:  0x81c03ff8
>
> (You don't need gdb for that - you can hexdump or objdump vmlinux).
>
> Now stack_start is:
>
> GLOBAL(stack_start)
> .quad  init_thread_union+THREAD_SIZE-8
>
> which is
>
> $ readelf -a vmlinux | grep init_thread_union
>  82491: 81c0 16384 OBJECT  GLOBAL DEFAULT   14 init_thread_union
>
> so init_thread_union+THREAD_SIZE-8 = 0x81c0 + 4*4096-8 = 
> 0x81c03ff8
>
> So you have to subtract __START_KERNEL_map again because it has there a
> virtual address and we haven't enabled paging yet:
>
> 0x81c03ff8 - 0x8000 = 0x1c03ff8.
>
> Makes sense?
>
Ah missed completely that stack_start is effectively a pointer to stack..

Thanks,
Mika

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä



On 02.03.2016 18:55, Borislav Petkov wrote:
> On Wed, Mar 02, 2016 at 06:38:15PM +0200, Mika Penttilä wrote:
>> I actually looked at it a while too...
>>
>> The
>>   movq stack_start - __START_KERNEL_map, %rsp
>>
>> turns into (objdump disassembly)
>>
>>   mov0x0,%rsp
>>
>> with relocation
>> 0004 R_X86_64_32S  stack_start+0x8000
>>
>> Now stack_start is at 81ef3380, so the relocation gives 1ef3380 
>> which would be correct, so why the
>> second subq ?
>>
>> You may explain :)
> Here it is :-)
>
> $ readelf -a vmlinux | grep stack_start
>  70526: 81cbabf8 0 NOTYPE  GLOBAL DEFAULT   14 stack_start
>
> 0x81cbabf8 - __START_KERNEL_map =
> 0x81cbabf8 - 0x8000 =
> 0x1cbabf8
>
> (gdb) x/x 0x1cbabf8
> 0x1cbabf8:  0x81c03ff8
>
> (You don't need gdb for that - you can hexdump or objdump vmlinux).
>
> Now stack_start is:
>
> GLOBAL(stack_start)
> .quad  init_thread_union+THREAD_SIZE-8
>
> which is
>
> $ readelf -a vmlinux | grep init_thread_union
>  82491: 81c0 16384 OBJECT  GLOBAL DEFAULT   14 init_thread_union
>
> so init_thread_union+THREAD_SIZE-8 = 0x81c0 + 4*4096-8 = 
> 0x81c03ff8
>
> So you have to subtract __START_KERNEL_map again because it has there a
> virtual address and we haven't enabled paging yet:
>
> 0x81c03ff8 - 0x8000 = 0x1c03ff8.
>
> Makes sense?
>
Ah missed completely that stack_start is effectively a pointer to stack..

Thanks,
Mika

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä


On 02.03.2016 18:15, Borislav Petkov wrote:
> On Wed, Mar 02, 2016 at 05:55:14PM +0200, Mika Penttilä wrote:
>>> +   /* Setup a stack for verify_cpu */
>>> +   movqstack_start - __START_KERNEL_map, %rsp
>>> +   subq$__START_KERNEL_map, %rsp
>>> +
>> You subtract __START_KERNEL_map twice ?
> Yes. That's not very obvious and it took me a while. I probably should
> add a comment.
>
> Want to stare at it a little bit more and try to figure it out or should
> I explain?
>
> :-)
>

I actually looked at it a while too...

The
  movq stack_start - __START_KERNEL_map, %rsp

turns into (objdump disassembly)

  mov0x0,%rsp

with relocation
0004 R_X86_64_32S  stack_start+0x8000

Now stack_start is at 81ef3380, so the relocation gives 1ef3380 which 
would be correct, so why the
second subq ?

You may explain :)

--Mika

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä


On 02.03.2016 18:15, Borislav Petkov wrote:
> On Wed, Mar 02, 2016 at 05:55:14PM +0200, Mika Penttilä wrote:
>>> +   /* Setup a stack for verify_cpu */
>>> +   movqstack_start - __START_KERNEL_map, %rsp
>>> +   subq$__START_KERNEL_map, %rsp
>>> +
>> You subtract __START_KERNEL_map twice ?
> Yes. That's not very obvious and it took me a while. I probably should
> add a comment.
>
> Want to stare at it a little bit more and try to figure it out or should
> I explain?
>
> :-)
>

I actually looked at it a while too...

The
  movq stack_start - __START_KERNEL_map, %rsp

turns into (objdump disassembly)

  mov0x0,%rsp

with relocation
0004 R_X86_64_32S  stack_start+0x8000

Now stack_start is at 81ef3380, so the relocation gives 1ef3380 which 
would be correct, so why the
second subq ?

You may explain :)

--Mika

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä

On 02.03.2016 13:20, Borislav Petkov wrote:
> From: Borislav Petkov 
>
> 04633df0c43d ("x86/cpu: Call verify_cpu() after having entered long mode too")
> added the call to verify_cpu() for sanitizing CPU configuration.
>
> The latter uses the stack minimally and it can happen that we land in
> startup_64() directly from a 64-bit bootloader. Then we want to use our
> own, known good stack.
>
> Do that.
>
> APs don't need this as the trampoline sets up a stack for them.
>
> Reported-by: Tom Lendacky 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kernel/head_64.S | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 22fbf9df61bb..d60a044c2fdc 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -64,6 +64,10 @@ startup_64:
>* tables and then reload them.
>*/
>  
> + /* Setup a stack for verify_cpu */
> + movqstack_start - __START_KERNEL_map, %rsp
> + subq$__START_KERNEL_map, %rsp
> +

 
You subtract __START_KERNEL_map twice ?

--Mika

Re: [RFC PATCH] x86: Make sure verify_cpu has a good stack

2016-03-02 Thread Mika Penttilä

On 02.03.2016 13:20, Borislav Petkov wrote:
> From: Borislav Petkov 
>
> 04633df0c43d ("x86/cpu: Call verify_cpu() after having entered long mode too")
> added the call to verify_cpu() for sanitizing CPU configuration.
>
> The latter uses the stack minimally and it can happen that we land in
> startup_64() directly from a 64-bit bootloader. Then we want to use our
> own, known good stack.
>
> Do that.
>
> APs don't need this as the trampoline sets up a stack for them.
>
> Reported-by: Tom Lendacky 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kernel/head_64.S | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 22fbf9df61bb..d60a044c2fdc 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -64,6 +64,10 @@ startup_64:
>* tables and then reload them.
>*/
>  
> + /* Setup a stack for verify_cpu */
> + movqstack_start - __START_KERNEL_map, %rsp
> + subq$__START_KERNEL_map, %rsp
> +

 
You subtract __START_KERNEL_map twice ?

--Mika

Re: [REGRESSION, bisected] 4.5rc4 sound fsl-soc

2016-02-20 Thread Mika Penttilä

Mark,

I can confirm that the patch:

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/base/regmap/regcache.c?id=3245d460a1eb55b5c3ca31dde7b5c5ac71546edf


solved the audio codec probing issue for me. Without it there's no sound on 
imx6 with 4.5-rc4.

Please apply.

Thanks,
Mika

On 20.02.2016 22:45, Fabio Estevam wrote:
> Hi Mika,
>
> Did it work for you?
>
> If so, please ask in the mailing list for Mark Brown to apply that patch.
>
> Thanks
>
> ____
> From: Mika Penttilä <mika.pentt...@nextfour.com>
> Sent: Monday, February 15, 2016 11:00 AM
> To: Fabio Estevam
> Subject: Re: [REGRESSION, bisected] 4.5rc4 sound fsl-soc
>
> On 02/15/2016 01:30 PM, Fabio Estevam wrote:
>> [Sorry for the top post, can't reply properly from this Inbox]
> Hi,
>
> will test that tomorrow morning EET.
>
>
> thanks,
> Mika
>
>
>> Could you please try applying this commit from linux-next?
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/base/regmap/regcache.c?id=3245d460a1eb55b5c3ca31dde7b5c5ac71546edf
>>
>> Thanks,
>>
>> Fabio Estevam
>>
>> 
>> From: Mika Penttilä <mika.pentt...@nextfour.com>
>> Sent: Monday, February 15, 2016 9:25 AM
>> To: LKML; m...@maciej.szmigiero.name; Fabio Estevam
>> Subject: [REGRESSION, bisected] 4.5rc4 sound fsl-soc
>>
>> Hi,
>>
>> The following commit :
>>
>> 5c408fee254633a5be69505bc86c6b034f871ab4 is the first bad commit
>> commit 5c408fee254633a5be69505bc86c6b034f871ab4
>> Author: Maciej S. Szmigiero <m...@maciej.szmigiero.name>
>> Date:   Mon Jan 18 20:07:44 2016 +0100
>>
>> ASoC: fsl_ssi: remove explicit register defaults
>>
>> There is no guarantee that on fsl_ssi module load
>> SSI registers will have their power-on-reset values.
>>
>> In fact, if the driver is reloaded the values in
>> registers will be whatever they were set to previously.
>>
>> However, the cache needs to be fully populated at probe
>> time to avoid non-atomic allocations during register
>> access.
>>
>> Special case here is imx21-class SSI, since
>> according to datasheet it don't have SACC{ST,EN,DIS}
>> regs.
>>
>> This fixes hard lockup on fsl_ssi module reload,
>> at least in AC'97 mode.
>>
>> Fixes: 05cf237972fe ("ASoC: fsl_ssi: Add driver suspend and resume to 
>> support MEGA Fast")
>> Signed-off-by: Maciej S. Szmigiero <m...@maciej.szmigiero.name>
>> Tested-by: Fabio Estevam <fabio.este...@nxp.com>
>> Signed-off-by: Mark Brown <broo...@kernel.org>
>>
>>
>> causes regmap init failure when loading the sgtl5000 codec on imx6q, and
>> leads to no audio.
>>
>> With the mentioned patch reverted sound works ok.
>>
>> --Mika
>>

Re: [REGRESSION, bisected] 4.5rc4 sound fsl-soc

2016-02-20 Thread Mika Penttilä

Mark,

I can confirm that the patch:

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/base/regmap/regcache.c?id=3245d460a1eb55b5c3ca31dde7b5c5ac71546edf


solved the audio codec probing issue for me. Without it there's no sound on 
imx6 with 4.5-rc4.

Please apply.

Thanks,
Mika

On 20.02.2016 22:45, Fabio Estevam wrote:
> Hi Mika,
>
> Did it work for you?
>
> If so, please ask in the mailing list for Mark Brown to apply that patch.
>
> Thanks
>
> ____
> From: Mika Penttilä 
> Sent: Monday, February 15, 2016 11:00 AM
> To: Fabio Estevam
> Subject: Re: [REGRESSION, bisected] 4.5rc4 sound fsl-soc
>
> On 02/15/2016 01:30 PM, Fabio Estevam wrote:
>> [Sorry for the top post, can't reply properly from this Inbox]
> Hi,
>
> will test that tomorrow morning EET.
>
>
> thanks,
> Mika
>
>
>> Could you please try applying this commit from linux-next?
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/base/regmap/regcache.c?id=3245d460a1eb55b5c3ca31dde7b5c5ac71546edf
>>
>> Thanks,
>>
>> Fabio Estevam
>>
>> 
>> From: Mika Penttilä 
>> Sent: Monday, February 15, 2016 9:25 AM
>> To: LKML; m...@maciej.szmigiero.name; Fabio Estevam
>> Subject: [REGRESSION, bisected] 4.5rc4 sound fsl-soc
>>
>> Hi,
>>
>> The following commit :
>>
>> 5c408fee254633a5be69505bc86c6b034f871ab4 is the first bad commit
>> commit 5c408fee254633a5be69505bc86c6b034f871ab4
>> Author: Maciej S. Szmigiero 
>> Date:   Mon Jan 18 20:07:44 2016 +0100
>>
>> ASoC: fsl_ssi: remove explicit register defaults
>>
>> There is no guarantee that on fsl_ssi module load
>> SSI registers will have their power-on-reset values.
>>
>> In fact, if the driver is reloaded the values in
>> registers will be whatever they were set to previously.
>>
>> However, the cache needs to be fully populated at probe
>> time to avoid non-atomic allocations during register
>> access.
>>
>> Special case here is imx21-class SSI, since
>> according to datasheet it don't have SACC{ST,EN,DIS}
>> regs.
>>
>> This fixes hard lockup on fsl_ssi module reload,
>> at least in AC'97 mode.
>>
>> Fixes: 05cf237972fe ("ASoC: fsl_ssi: Add driver suspend and resume to 
>> support MEGA Fast")
>> Signed-off-by: Maciej S. Szmigiero 
>> Tested-by: Fabio Estevam 
>> Signed-off-by: Mark Brown 
>>
>>
>> causes regmap init failure when loading the sgtl5000 codec on imx6q, and
>> leads to no audio.
>>
>> With the mentioned patch reverted sound works ok.
>>
>> --Mika
>>

[REGRESSION, bisected] 4.5rc4 sound fsl-soc

2016-02-15 Thread Mika Penttilä

Hi,

The following commit :

5c408fee254633a5be69505bc86c6b034f871ab4 is the first bad commit
commit 5c408fee254633a5be69505bc86c6b034f871ab4
Author: Maciej S. Szmigiero 
Date:   Mon Jan 18 20:07:44 2016 +0100

ASoC: fsl_ssi: remove explicit register defaults

There is no guarantee that on fsl_ssi module load
SSI registers will have their power-on-reset values.

In fact, if the driver is reloaded the values in
registers will be whatever they were set to previously.

However, the cache needs to be fully populated at probe
time to avoid non-atomic allocations during register
access.

Special case here is imx21-class SSI, since
according to datasheet it don't have SACC{ST,EN,DIS}
regs.

This fixes hard lockup on fsl_ssi module reload,
at least in AC'97 mode.

Fixes: 05cf237972fe ("ASoC: fsl_ssi: Add driver suspend and resume to 
support MEGA Fast")
Signed-off-by: Maciej S. Szmigiero 
Tested-by: Fabio Estevam 
Signed-off-by: Mark Brown 


causes regmap init failure when loading the sgtl5000 codec on imx6q, and
leads to no audio.

With the mentioned patch reverted sound works ok.

--Mika

[REGRESSION, bisected] 4.5rc4 sound fsl-soc

2016-02-15 Thread Mika Penttilä

Hi,

The following commit :

5c408fee254633a5be69505bc86c6b034f871ab4 is the first bad commit
commit 5c408fee254633a5be69505bc86c6b034f871ab4
Author: Maciej S. Szmigiero 
Date:   Mon Jan 18 20:07:44 2016 +0100

ASoC: fsl_ssi: remove explicit register defaults

There is no guarantee that on fsl_ssi module load
SSI registers will have their power-on-reset values.

In fact, if the driver is reloaded the values in
registers will be whatever they were set to previously.

However, the cache needs to be fully populated at probe
time to avoid non-atomic allocations during register
access.

Special case here is imx21-class SSI, since
according to datasheet it don't have SACC{ST,EN,DIS}
regs.

This fixes hard lockup on fsl_ssi module reload,
at least in AC'97 mode.

Fixes: 05cf237972fe ("ASoC: fsl_ssi: Add driver suspend and resume to 
support MEGA Fast")
Signed-off-by: Maciej S. Szmigiero 
Tested-by: Fabio Estevam 
Signed-off-by: Mark Brown 


causes regmap init failure when loading the sgtl5000 codec on imx6q, and
leads to no audio.

With the mentioned patch reverted sound works ok.

--Mika

Re: [PATCHv2 2/2] x86: SROP mitigation: implement signal cookies

2016-02-06 Thread Mika Penttilä

Hi,


On 07.02.2016 01:39, Scott Bauer wrote:
> This patch adds SROP mitigation logic to the x86 signal delivery
> and sigreturn code. The cookie is placed in the unused alignment
> space above the saved FP state, if it exists. If there is no FP
> state to save then the cookie is placed in the alignment space above
> the sigframe.
>
> Cc: Abhiram Balasubramanian 
> Signed-off-by: Scott Bauer 
> ---
>  arch/x86/ia32/ia32_signal.c| 63 +---
>  arch/x86/include/asm/fpu/signal.h  |  1 +
>  arch/x86/include/asm/sighandling.h |  5 ++-
>  arch/x86/kernel/fpu/signal.c   | 10 +
>  arch/x86/kernel/signal.c   | 86 
> +-
>  5 files changed, 146 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index 0552884..2751f47 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -68,7 +68,8 @@
>  }
>  
>  static int ia32_restore_sigcontext(struct pt_regs *regs,
> -struct sigcontext_32 __user *sc)
> +struct sigcontext_32 __user *sc,
> +void __user **user_cookie)
>  {
>   unsigned int tmpflags, err = 0;
>   void __user *buf;
> @@ -105,6 +106,16 @@ static int ia32_restore_sigcontext(struct pt_regs *regs,
>   buf = compat_ptr(tmp);
>   } get_user_catch(err);
>  
> + /*
> +  * If there is fp state get cookie from the top of the fp state,
> +  * else get it from the top of the sig frame.
> +  */
> +
> + if (tmp != 0)
> + *user_cookie = compat_ptr(tmp + fpu__getsize(1));
> + else
> + *user_cookie = NULL;
> +
>   err |= fpu__restore_sig(buf, 1);
>  
>   force_iret();
> @@ -117,6 +128,7 @@ asmlinkage long sys32_sigreturn(void)
>   struct pt_regs *regs = current_pt_regs();
>   struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user 
> *)(regs->sp-8);
>   sigset_t set;
> + void __user *user_cookie;
>  
>   if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
>   goto badframe;
> @@ -129,8 +141,15 @@ asmlinkage long sys32_sigreturn(void)
>  
>   set_current_blocked();
>  
> - if (ia32_restore_sigcontext(regs, >sc))
> + if (ia32_restore_sigcontext(regs, >sc, _cookie))
> + goto badframe;
> +
> + if (user_cookie == NULL)
> + user_cookie = compat_ptr((regs->sp - 8) + sizeof(*frame));
> +
> + if (verify_clear_sigcookie(user_cookie))
>   goto badframe;
> +
>   return regs->ax;
>  
>  badframe:
> @@ -142,6 +161,7 @@ asmlinkage long sys32_rt_sigreturn(void)
>  {
>   struct pt_regs *regs = current_pt_regs();
>   struct rt_sigframe_ia32 __user *frame;
> + void __user *user_cookie;
>   sigset_t set;
>  
>   frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4);
> @@ -153,7 +173,13 @@ asmlinkage long sys32_rt_sigreturn(void)
>  
>   set_current_blocked();
>  
> - if (ia32_restore_sigcontext(regs, >uc.uc_mcontext))
> + if (ia32_restore_sigcontext(regs, >uc.uc_mcontext, _cookie))
> + goto badframe;
> +
> + if (user_cookie == NULL)
> + user_cookie = (void __user *)((regs->sp - 4) + sizeof(*frame));
regs->sp is already restored so you should use frame instead.

--Mika

Re: [PATCHv2 2/2] x86: SROP mitigation: implement signal cookies

2016-02-06 Thread Mika Penttilä

Hi,


On 07.02.2016 01:39, Scott Bauer wrote:
> This patch adds SROP mitigation logic to the x86 signal delivery
> and sigreturn code. The cookie is placed in the unused alignment
> space above the saved FP state, if it exists. If there is no FP
> state to save then the cookie is placed in the alignment space above
> the sigframe.
>
> Cc: Abhiram Balasubramanian 
> Signed-off-by: Scott Bauer 
> ---
>  arch/x86/ia32/ia32_signal.c| 63 +---
>  arch/x86/include/asm/fpu/signal.h  |  1 +
>  arch/x86/include/asm/sighandling.h |  5 ++-
>  arch/x86/kernel/fpu/signal.c   | 10 +
>  arch/x86/kernel/signal.c   | 86 
> +-
>  5 files changed, 146 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index 0552884..2751f47 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -68,7 +68,8 @@
>  }
>  
>  static int ia32_restore_sigcontext(struct pt_regs *regs,
> -struct sigcontext_32 __user *sc)
> +struct sigcontext_32 __user *sc,
> +void __user **user_cookie)
>  {
>   unsigned int tmpflags, err = 0;
>   void __user *buf;
> @@ -105,6 +106,16 @@ static int ia32_restore_sigcontext(struct pt_regs *regs,
>   buf = compat_ptr(tmp);
>   } get_user_catch(err);
>  
> + /*
> +  * If there is fp state get cookie from the top of the fp state,
> +  * else get it from the top of the sig frame.
> +  */
> +
> + if (tmp != 0)
> + *user_cookie = compat_ptr(tmp + fpu__getsize(1));
> + else
> + *user_cookie = NULL;
> +
>   err |= fpu__restore_sig(buf, 1);
>  
>   force_iret();
> @@ -117,6 +128,7 @@ asmlinkage long sys32_sigreturn(void)
>   struct pt_regs *regs = current_pt_regs();
>   struct sigframe_ia32 __user *frame = (struct sigframe_ia32 __user 
> *)(regs->sp-8);
>   sigset_t set;
> + void __user *user_cookie;
>  
>   if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
>   goto badframe;
> @@ -129,8 +141,15 @@ asmlinkage long sys32_sigreturn(void)
>  
>   set_current_blocked();
>  
> - if (ia32_restore_sigcontext(regs, >sc))
> + if (ia32_restore_sigcontext(regs, >sc, _cookie))
> + goto badframe;
> +
> + if (user_cookie == NULL)
> + user_cookie = compat_ptr((regs->sp - 8) + sizeof(*frame));
> +
> + if (verify_clear_sigcookie(user_cookie))
>   goto badframe;
> +
>   return regs->ax;
>  
>  badframe:
> @@ -142,6 +161,7 @@ asmlinkage long sys32_rt_sigreturn(void)
>  {
>   struct pt_regs *regs = current_pt_regs();
>   struct rt_sigframe_ia32 __user *frame;
> + void __user *user_cookie;
>   sigset_t set;
>  
>   frame = (struct rt_sigframe_ia32 __user *)(regs->sp - 4);
> @@ -153,7 +173,13 @@ asmlinkage long sys32_rt_sigreturn(void)
>  
>   set_current_blocked();
>  
> - if (ia32_restore_sigcontext(regs, >uc.uc_mcontext))
> + if (ia32_restore_sigcontext(regs, >uc.uc_mcontext, _cookie))
> + goto badframe;
> +
> + if (user_cookie == NULL)
> + user_cookie = (void __user *)((regs->sp - 4) + sizeof(*frame));
regs->sp is already restored so you should use frame instead.

--Mika

Re: [PATCH v2 RESEND 1/2] arm, arm64: change_memory_common with numpages == 0 should be no-op.

2016-01-27 Thread Mika Penttilä


Hi Will,

On 26.01.2016 17:59, Will Deacon wrote:
> Hi Mika,
>
> On Tue, Jan 26, 2016 at 04:59:52PM +0200, mika.pentt...@nextfour.com wrote:
>> From: Mika Penttilä 
>>
>> This makes the caller set_memory_xx() consistent with x86.
>>
>> arm64 part is rebased on 4.5.0-rc1 with Ard's patch
>>  
>> lkml.kernel.org/g/<1453125665-26627-1-git-send-email-ard.biesheu...@linaro.org>
>> applied.
>>
>> Signed-off-by: Mika Penttilä mika.pentt...@nextfour.com
>> Reviewed-by: Laura Abbott 
>> Acked-by: David Rientjes 
>>
>> ---
>>  arch/arm/mm/pageattr.c   | 3 +++
>>  arch/arm64/mm/pageattr.c | 3 +++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm/mm/pageattr.c b/arch/arm/mm/pageattr.c
>> index cf30daf..d19b1ad 100644
>> --- a/arch/arm/mm/pageattr.c
>> +++ b/arch/arm/mm/pageattr.c
>> @@ -49,6 +49,9 @@ static int change_memory_common(unsigned long addr, int 
>> numpages,
>>  WARN_ON_ONCE(1);
>>  }
>>  
>> +if (!numpages)
>> +return 0;
>> +
>>  if (start < MODULES_VADDR || start >= MODULES_END)
>>  return -EINVAL;
>>  
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 1360a02..b582fc2 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -53,6 +53,9 @@ static int change_memory_common(unsigned long addr, int 
>> numpages,
>>  WARN_ON_ONCE(1);
>>  }
>>  
>> +if (!numpages)
>> +return 0;
>> +
> Thanks for this. I can reproduce the failure on my Juno board, so I'd
> like to queue this for 4.5 since it fixes a real issue. I've taken the
> liberty of rebasing the arm64 part to my fixes branch and writing a
> commit message. Does the patch below look ok to you?
>
> Will
>
> --->8
>
> From 57adec866c0440976c96a4b8f5b59fb411b1cacb Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?Mika=20Penttil=C3=A4?= 
> Date: Tue, 26 Jan 2016 15:47:25 +
> Subject: [PATCH] arm64: mm: avoid calling apply_to_page_range on empty range
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Calling apply_to_page_range with an empty range results in a BUG_ON
> from the core code. This can be triggered by trying to load the st_drv
> module with CONFIG_DEBUG_SET_MODULE_RONX enabled:
>
>   kernel BUG at mm/memory.c:1874!
>   Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>   Modules linked in:
>   CPU: 3 PID: 1764 Comm: insmod Not tainted 4.5.0-rc1+ #2
>   Hardware name: ARM Juno development board (r0) (DT)
>   task: ffc9763b8000 ti: ffc975af8000 task.ti: ffc975af8000
>   PC is at apply_to_page_range+0x2cc/0x2d0
>   LR is at change_memory_common+0x80/0x108
>
> This patch fixes the issue by making change_memory_common (called by the
> set_memory_* functions) a NOP when numpages == 0, therefore avoiding the
> erroneous call to apply_to_page_range and bringing us into line with x86
> and s390.
>
> Cc: 
> Reviewed-by: Laura Abbott 
> Acked-by: David Rientjes 
> Signed-off-by: Mika Penttilä 
> Signed-off-by: Will Deacon 
> ---
>  arch/arm64/mm/pageattr.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 3571c7309c5e..cf6240741134 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -57,6 +57,9 @@ static int change_memory_common(unsigned long addr, int 
> numpages,
>   if (end < MODULES_VADDR || end >= MODULES_END)
>   return -EINVAL;
>  
> + if (!numpages)
> + return 0;
> +
>   data.set_mask = set_mask;
>   data.clear_mask = clear_mask;
>  

Yes I'm fine with that,
Thanks!
Mika

1 2 >

1 - 100 of 153 matches

Mail list logo