Re: [PATCH 1/3] arm64: ptrace: Add is_syscall_success to handle compat

2021-04-20 Thread He Zhe



On 4/19/21 8:19 PM, Will Deacon wrote:
> On Fri, Apr 16, 2021 at 02:34:41PM +0100, Mark Rutland wrote:
>> On Fri, Apr 16, 2021 at 01:33:22PM +0100, Catalin Marinas wrote:
>>> On Fri, Apr 16, 2021 at 03:55:31PM +0800, He Zhe wrote:
>>>> The general version of is_syscall_success does not handle 32-bit
>>>> compatible case, which would cause 32-bit negative return code to be
>>>> recoganized as a positive number later and seen as a "success".
>>>>
>>>> Since is_compat_thread is defined in compat.h, implementing
>>>> is_syscall_success in ptrace.h would introduce build failure due to
>>>> recursive inclusion of some basic headers like mutex.h. We put the
>>>> implementation to ptrace.c
>>>>
>>>> Signed-off-by: He Zhe 
>>>> ---
>>>>  arch/arm64/include/asm/ptrace.h |  3 +++
>>>>  arch/arm64/kernel/ptrace.c  | 10 ++
>>>>  2 files changed, 13 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/ptrace.h 
>>>> b/arch/arm64/include/asm/ptrace.h
>>>> index e58bca832dff..3c415e9e5d85 100644
>>>> --- a/arch/arm64/include/asm/ptrace.h
>>>> +++ b/arch/arm64/include/asm/ptrace.h
>>>> @@ -328,6 +328,9 @@ static inline void regs_set_return_value(struct 
>>>> pt_regs *regs, unsigned long rc)
>>>>regs->regs[0] = rc;
>>>>  }
>>>>  
>>>> +extern inline int is_syscall_success(struct pt_regs *regs);
>>>> +#define is_syscall_success(regs) is_syscall_success(regs)
>>>> +
>>>>  /**
>>>>   * regs_get_kernel_argument() - get Nth function argument in kernel
>>>>   * @regs: pt_regs of that context
>>>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>>>> index 170f42fd6101..3266201f8c60 100644
>>>> --- a/arch/arm64/kernel/ptrace.c
>>>> +++ b/arch/arm64/kernel/ptrace.c
>>>> @@ -1909,3 +1909,13 @@ int valid_user_regs(struct user_pt_regs *regs, 
>>>> struct task_struct *task)
>>>>else
>>>>return valid_native_regs(regs);
>>>>  }
>>>> +
>>>> +inline int is_syscall_success(struct pt_regs *regs)
>>>> +{
>>>> +  unsigned long val = regs->regs[0];
>>>> +
>>>> +  if (is_compat_thread(task_thread_info(current)))
>>>> +  val = sign_extend64(val, 31);
>>>> +
>>>> +  return !IS_ERR_VALUE(val);
>>>> +}
>>> It's better to use compat_user_mode(regs) here instead of
>>> is_compat_thread(). It saves us from worrying whether regs are for the
>>> current context.
>>>
>>> I think we should change regs_return_value() instead. This function
>>> seems to be called from several other places and it has the same
>>> potential problems if called on compat pt_regs.
>> I think this is a problem we created for ourselves back in commit:
>>
>>   15956689a0e60aa0 ("arm64: compat: Ensure upper 32 bits of x0 are zero on 
>> syscall return)
>>
>> AFAICT, the perf regs samples are the only place this matters, since for
>> ptrace the compat regs are implicitly truncated to compat_ulong_t, and
>> audit expects the non-truncated return value. Other architectures don't
>> truncate here, so I think we're setting ourselves up for a game of
>> whack-a-mole to truncate and extend wherever we need to.
>>
>> Given that, I suspect it'd be better to do something like the below.
>>
>> Will, thoughts?
> I think perf is one example, but this is also visible to userspace via the
> native ptrace interface and I distinctly remember needing this for some
> versions of arm64 strace to work correctly when tracing compat tasks.
>
> So I do think that clearing the upper bits on the return path is the right
> approach, but it sounds like we need some more work to handle syscall(-1)
> and audit (what exactly is the problem here after these patches have been
> applied?)

IIUC, IS_ERR_VALUE could handle -1, did I miss something? Thanks.

Regards,
Zhe

>
> Will



Re: [PATCH 1/3] arm64: ptrace: Add is_syscall_success to handle compat

2021-04-20 Thread He Zhe



On 4/16/21 8:33 PM, Catalin Marinas wrote:
> On Fri, Apr 16, 2021 at 03:55:31PM +0800, He Zhe wrote:
>> The general version of is_syscall_success does not handle 32-bit
>> compatible case, which would cause 32-bit negative return code to be
>> recoganized as a positive number later and seen as a "success".
>>
>> Since is_compat_thread is defined in compat.h, implementing
>> is_syscall_success in ptrace.h would introduce build failure due to
>> recursive inclusion of some basic headers like mutex.h. We put the
>> implementation to ptrace.c
>>
>> Signed-off-by: He Zhe 
>> ---
>>  arch/arm64/include/asm/ptrace.h |  3 +++
>>  arch/arm64/kernel/ptrace.c  | 10 ++
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/ptrace.h 
>> b/arch/arm64/include/asm/ptrace.h
>> index e58bca832dff..3c415e9e5d85 100644
>> --- a/arch/arm64/include/asm/ptrace.h
>> +++ b/arch/arm64/include/asm/ptrace.h
>> @@ -328,6 +328,9 @@ static inline void regs_set_return_value(struct pt_regs 
>> *regs, unsigned long rc)
>>  regs->regs[0] = rc;
>>  }
>>  
>> +extern inline int is_syscall_success(struct pt_regs *regs);
>> +#define is_syscall_success(regs) is_syscall_success(regs)
>> +
>>  /**
>>   * regs_get_kernel_argument() - get Nth function argument in kernel
>>   * @regs:   pt_regs of that context
>> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
>> index 170f42fd6101..3266201f8c60 100644
>> --- a/arch/arm64/kernel/ptrace.c
>> +++ b/arch/arm64/kernel/ptrace.c
>> @@ -1909,3 +1909,13 @@ int valid_user_regs(struct user_pt_regs *regs, struct 
>> task_struct *task)
>>  else
>>  return valid_native_regs(regs);
>>  }
>> +
>> +inline int is_syscall_success(struct pt_regs *regs)
>> +{
>> +unsigned long val = regs->regs[0];
>> +
>> +if (is_compat_thread(task_thread_info(current)))
>> +val = sign_extend64(val, 31);
>> +
>> +return !IS_ERR_VALUE(val);
>> +}
> It's better to use compat_user_mode(regs) here instead of
> is_compat_thread(). It saves us from worrying whether regs are for the
> current context.

Thanks. I'll use this for v2.

>
> I think we should change regs_return_value() instead. This function
> seems to be called from several other places and it has the same
> potential problems if called on compat pt_regs.

IMHO, now that we have had specific function, syscall_get_return_value, to get
syscall return code, we might as well use it. regs_return_value may be left for
where we want internal return code. I found such places below and haven't found
other places that syscall sign extension is concerned about.

kernel/test_kprobes.c
kernel/trace/trace_kprobe.c
samples/kprobes/kretprobe_example.c


Regards,
Zhe

>



Re: [PATCH 2/3] arm64: syscall.h: Add sign extension handling in syscall_get_return_value for compat

2021-04-20 Thread He Zhe



On 4/16/21 5:43 PM, Oleg Nesterov wrote:
> On 04/16, He Zhe wrote:
>> --- a/arch/arm64/include/asm/syscall.h
>> +++ b/arch/arm64/include/asm/syscall.h
>> @@ -44,7 +44,12 @@ static inline long syscall_get_error(struct task_struct 
>> *task,
>>  static inline long syscall_get_return_value(struct task_struct *task,
>>  struct pt_regs *regs)
>>  {
>> -return regs->regs[0];
>> +long val = regs->regs[0];
>> +
>> +if (is_compat_thread(task_thread_info(task)))
>> +val = sign_extend64(val, 31);
>> +
>> +return val;
>>  }
> I can't really review these arm-specific patches, but with this change both
> syscall_get_error() and is_syscall_success() can use 
> syscall_get_return_value()
> to avoid the code duplication.

Thanks. I'll improve this if v2 is needed.

Regards,
Zhe

>
> Oleg.
>



[PATCH 3/3] audit: Use syscall_get_return_value to get syscall return code in audit_syscall_exit

2021-04-16 Thread He Zhe
regs_return_value for some architectures like arm64 simply retrieve
register value from pt_regs without sign extension in 32-bit compatible
case and cause audit to have false syscall return code. For example,
32-bit -13 would be treated as 4294967283 below.

type=SYSCALL msg=audit(160715.887:582): arch=4028 syscall=322
success=yes exit=4294967283

We just added proper sign extension in syscall_get_return_value which
should be used instead.

Signed-off-by: He Zhe 
---
 include/linux/audit.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 82b7c1116a85..135adbe22c19 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -334,7 +334,7 @@ static inline void audit_syscall_exit(void *pt_regs)
 {
if (unlikely(audit_context())) {
int success = is_syscall_success(pt_regs);
-   long return_code = regs_return_value(pt_regs);
+   long return_code = syscall_get_return_value(current, pt_regs);
 
__audit_syscall_exit(success, return_code);
}
-- 
2.17.1



[PATCH 2/3] arm64: syscall.h: Add sign extension handling in syscall_get_return_value for compat

2021-04-16 Thread He Zhe
Add sign extension handling in syscall_get_return_value so that it can
handle 32-bit compatible case and can be used by for example audit, just
like what syscall_get_error does.

Signed-off-by: He Zhe 
---
 arch/arm64/include/asm/syscall.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..cd7a22787aeb 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -44,7 +44,12 @@ static inline long syscall_get_error(struct task_struct 
*task,
 static inline long syscall_get_return_value(struct task_struct *task,
struct pt_regs *regs)
 {
-   return regs->regs[0];
+   long val = regs->regs[0];
+
+   if (is_compat_thread(task_thread_info(task)))
+   val = sign_extend64(val, 31);
+
+   return val;
 }
 
 static inline void syscall_set_return_value(struct task_struct *task,
-- 
2.17.1



[PATCH 1/3] arm64: ptrace: Add is_syscall_success to handle compat

2021-04-16 Thread He Zhe
The general version of is_syscall_success does not handle 32-bit
compatible case, which would cause 32-bit negative return code to be
recoganized as a positive number later and seen as a "success".

Since is_compat_thread is defined in compat.h, implementing
is_syscall_success in ptrace.h would introduce build failure due to
recursive inclusion of some basic headers like mutex.h. We put the
implementation to ptrace.c

Signed-off-by: He Zhe 
---
 arch/arm64/include/asm/ptrace.h |  3 +++
 arch/arm64/kernel/ptrace.c  | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index e58bca832dff..3c415e9e5d85 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -328,6 +328,9 @@ static inline void regs_set_return_value(struct pt_regs 
*regs, unsigned long rc)
regs->regs[0] = rc;
 }
 
+extern inline int is_syscall_success(struct pt_regs *regs);
+#define is_syscall_success(regs) is_syscall_success(regs)
+
 /**
  * regs_get_kernel_argument() - get Nth function argument in kernel
  * @regs:  pt_regs of that context
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 170f42fd6101..3266201f8c60 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1909,3 +1909,13 @@ int valid_user_regs(struct user_pt_regs *regs, struct 
task_struct *task)
else
return valid_native_regs(regs);
 }
+
+inline int is_syscall_success(struct pt_regs *regs)
+{
+   unsigned long val = regs->regs[0];
+
+   if (is_compat_thread(task_thread_info(current)))
+   val = sign_extend64(val, 31);
+
+   return !IS_ERR_VALUE(val);
+}
-- 
2.17.1



Re: [PATCH 2/2] ptrace: is_syscall_success: Add syscall return code handling for compat task

2021-04-14 Thread He Zhe



On 4/15/21 12:55 AM, Oleg Nesterov wrote:
> On 04/14, David Laight wrote:
>> From: Oleg Nesterov
>>> Sent: 14 April 2021 16:08
>>>
>>> Add audit maintainers...
>>>
>>> On 04/14, He Zhe wrote:
>>>> When 32-bit userspace application is running on 64-bit kernel, the 32-bit
>>>> syscall return code would be changed from u32 to u64 in regs_return_value
>>>> and then changed to s64. Hence the negative return code would be treated
>>>> as a positive number and results in a non-error in, for example, audit
>>>> like below.
>>> Sorry, can understand. At least on x86_64 even the 32-bit syscall returns
>>> long, not u32.
>>>
>>> Hmm. And afaics on x86 is_compat_task() is only defined if !CONFIG_COMPAT,
>>> so this patch looks wrong anyway.
>> And, as with the other patch a x64_64 64bit process can make both types
>> of 32bit system call - so it needs to depend on the system call entry type
>> not any type of the task.
> I don't understand... but iirc is_compat_task() used to check TS_COMPAT and
> this is what we need to detect the 32-bit syscall. But it looks deprecated,
> I think in_compat_syscall() should be used instead.
>
> But this doesn't matter, I still can't understand the problem.

Sorry for not enough clarification.

This was found on an arm64 kernel running with 32-bit user-space application.
The arm64 version of regs_return_value returns unsigned long.

static inline unsigned long regs_return_value(struct pt_regs *regs)
{
    return regs->regs[0];
}

But when the syscall fails, with -13 in my case, the return code has been saved
as a 32 bit long negative number, 0xFFF3, in regs[0] by the time
regs_return_value gets called in audit_syscall_exit.

Then in audit_syscall_exit, the return value of regs_return_value is changed to
a 64 bit signed long, from when on it is treated as a positive number.

Similarly in is_syscall_success, 0xFFF3 would be out of error
number range, resulting in a "success".

These two patches are to do the sign extension.

David, thanks, is_compat_syscall should be the right one to use. I didn't notice
the difference between is_compat_syscall and is_compat_task and thought
is_compat_task would be harmless to other architectures.


Zhe

>
> Oleg.
>



[PATCH 2/2] ptrace: is_syscall_success: Add syscall return code handling for compat task

2021-04-14 Thread He Zhe
When 32-bit userspace application is running on 64-bit kernel, the 32-bit
syscall return code would be changed from u32 to u64 in regs_return_value
and then changed to s64. Hence the negative return code would be treated
as a positive number and results in a non-error in, for example, audit
like below.

type=SYSCALL msg=audit(160715.887:582): arch=4028 syscall=322
success=yes exit=4294967283

This patch forces the u32->s32->s64 for compat tasks.

Signed-off-by: He Zhe 
---
 include/linux/ptrace.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index b5ebf6c01292..bc3056fff8a6 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -260,7 +260,9 @@ static inline void ptrace_release_task(struct task_struct 
*task)
  * is an error value.  On some systems like ia64 and powerpc they have 
different
  * indicators of success/failure and must define their own.
  */
-#define is_syscall_success(regs) (!IS_ERR_VALUE((unsigned 
long)(regs_return_value(regs
+#define is_syscall_success(regs) (!IS_ERR_VALUE(is_compat_task() ? \
+   (unsigned long)(s64)(s32)(regs_return_value(regs)) : \
+   (unsigned long)(regs_return_value(regs
 #endif
 
 /*
-- 
2.17.1



[PATCH 1/2] audit: Add syscall return code handling for compat task

2021-04-14 Thread He Zhe
When 32-bit userspace application is running on 64-bit kernel, the 32-bit
syscall return code would be changed from u32 to u64 in regs_return_value
and then changed to s64. Hence the negative return code recorded by audit
would end up being a big positive number like below.

type=SYSCALL msg=audit(160715.887:582): arch=4028 syscall=322
success=yes exit=4294967283

This patch forces the u32->s32->s64 for compat tasks.

Signed-off-by: He Zhe 
---
 include/linux/audit.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 82b7c1116a85..32cb853f3029 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -334,7 +334,9 @@ static inline void audit_syscall_exit(void *pt_regs)
 {
if (unlikely(audit_context())) {
int success = is_syscall_success(pt_regs);
-   long return_code = regs_return_value(pt_regs);
+   long return_code = is_compat_task() ?
+   (s64)(s32)regs_return_value(pt_regs) :
+   regs_return_value(pt_regs);
 
__audit_syscall_exit(success, return_code);
}
-- 
2.17.1



Re: [PATCH 2/2] perf tools: Improve EOPNOTSUPP error reporting

2021-03-03 Thread He Zhe
Oops, add some supporters from get_maintainer.pl

Zhe

On 2/23/21 4:25 PM, He Zhe wrote:
> There may be multiple reasons for EOPNOTSUPP. Sometimes we cannot determine
> which one it is.
>
> For example, when we set up uprobe with 32-bit perf and arm64 kernel on
> some hardware that does not support sampling/overflow-interrupts,
> $ perf probe -x /lib/libc.so.6 malloc
> $ perf record -e probe_libc:malloc -a ls
>
> Before this patch:
> probe_libc:malloc: PMU Hardware doesn't support sampling/overflow-interrupts. 
> Try 'perf stat'
>
> After this patch:
> probe_libc:malloc: PMU Hardware may not support sampling/overflow-interrupts. 
> Try 'perf stat'.
> Some 64-bit architectures may not support 32-bit instruction probing.
>
> Signed-off-by: He Zhe 
> ---
>  tools/perf/util/evsel.c | 18 ++
>  1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 1bf76864c4f2..aa56511ddf60 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2697,22 +2697,24 @@ int evsel__open_strerror(struct evsel *evsel, struct 
> target *target,
>   break;
>   case EOPNOTSUPP:
>   if (evsel->core.attr.aux_output)
> - return scnprintf(msg, size,
> - "%s: PMU Hardware doesn't support 'aux_output' feature",
> + printed += scnprintf(msg + printed, size,
> + "%s: PMU Hardware may not support 'aux_output' feature.\n",
>evsel__name(evsel));
>   if (evsel->core.attr.sample_period != 0)
> - return scnprintf(msg, size,
> - "%s: PMU Hardware doesn't support sampling/overflow-interrupts. Try 
> 'perf stat'",
> + printed += scnprintf(msg + printed, size,
> + "%s: PMU Hardware doesn't support sampling/overflow-interrupts. Try 
> 'perf stat'.\n",
>evsel__name(evsel));
>   if (evsel->core.attr.precise_ip)
> - return scnprintf(msg, size, "%s",
> - "\'precise\' request may not be supported. Try removing 'p' modifier.");
> + printed += scnprintf(msg + printed, size, "%s",
> + "\'precise\' request may not be supported. Try removing 'p' 
> modifier.\n");
>  #if defined(__i386__) || defined(__x86_64__)
>   if (evsel->core.attr.type == PERF_TYPE_HARDWARE)
> - return scnprintf(msg, size, "%s",
> + printed += scnprintf(msg + printed, size, "%s",
>   "No hardware sampling interrupt available.\n");
>  #endif
> - break;
> + scnprintf(msg + printed, size, "%s",
> + "Some 64-bit architectures may not support 32-bit instruction 
> uprobe.\n");
> + return;
>   case EBUSY:
>   if (find_process("oprofiled"))
>   return scnprintf(msg, size,



[PATCH 2/2] perf tools: Improve EOPNOTSUPP error reporting

2021-02-23 Thread He Zhe
There may be multiple reasons for EOPNOTSUPP. Sometimes we cannot determine
which one it is.

For example, when we set up uprobe with 32-bit perf and arm64 kernel on
some hardware that does not support sampling/overflow-interrupts,
$ perf probe -x /lib/libc.so.6 malloc
$ perf record -e probe_libc:malloc -a ls

Before this patch:
probe_libc:malloc: PMU Hardware doesn't support sampling/overflow-interrupts. 
Try 'perf stat'

After this patch:
probe_libc:malloc: PMU Hardware may not support sampling/overflow-interrupts. 
Try 'perf stat'.
Some 64-bit architectures may not support 32-bit instruction probing.

Signed-off-by: He Zhe 
---
 tools/perf/util/evsel.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1bf76864c4f2..aa56511ddf60 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2697,22 +2697,24 @@ int evsel__open_strerror(struct evsel *evsel, struct 
target *target,
break;
case EOPNOTSUPP:
if (evsel->core.attr.aux_output)
-   return scnprintf(msg, size,
-   "%s: PMU Hardware doesn't support 'aux_output' feature",
+   printed += scnprintf(msg + printed, size,
+   "%s: PMU Hardware may not support 'aux_output' feature.\n",
 evsel__name(evsel));
if (evsel->core.attr.sample_period != 0)
-   return scnprintf(msg, size,
-   "%s: PMU Hardware doesn't support sampling/overflow-interrupts. Try 
'perf stat'",
+   printed += scnprintf(msg + printed, size,
+   "%s: PMU Hardware doesn't support sampling/overflow-interrupts. Try 
'perf stat'.\n",
 evsel__name(evsel));
if (evsel->core.attr.precise_ip)
-   return scnprintf(msg, size, "%s",
-   "\'precise\' request may not be supported. Try removing 'p' modifier.");
+   printed += scnprintf(msg + printed, size, "%s",
+   "\'precise\' request may not be supported. Try removing 'p' 
modifier.\n");
 #if defined(__i386__) || defined(__x86_64__)
if (evsel->core.attr.type == PERF_TYPE_HARDWARE)
-   return scnprintf(msg, size, "%s",
+   printed += scnprintf(msg + printed, size, "%s",
"No hardware sampling interrupt available.\n");
 #endif
-   break;
+   scnprintf(msg + printed, size, "%s",
+   "Some 64-bit architectures may not support 32-bit instruction 
uprobe.\n");
+   return;
case EBUSY:
if (find_process("oprofiled"))
return scnprintf(msg, size,
-- 
2.17.1



[PATCH 1/2] arm64: uprobe: Return EOPNOTSUPP for AARCH32 instruction probing

2021-02-23 Thread He Zhe
As stated in linux/errno.h, ENOTSUPP should never be seen by user programs.
When we set up uprobe with 32-bit perf and arm64 kernel, we would see the
following vague error without useful hint.

The sys_perf_event_open() syscall returned with 524 (INTERNAL ERROR:
strerror_r(524, [buf], 128)=22)

Use EOPNOTSUPP instead to indicate such cases.

Signed-off-by: He Zhe 
---
 arch/arm64/kernel/probes/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/probes/uprobes.c 
b/arch/arm64/kernel/probes/uprobes.c
index a412d8edbcd2..2c247634552b 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -38,7 +38,7 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, 
struct mm_struct *mm,
 
/* TODO: Currently we do not support AARCH32 instruction probing */
if (mm->context.flags & MMCF_AARCH32)
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
else if (!IS_ALIGNED(addr, AARCH64_INSN_SIZE))
return -EINVAL;
 
-- 
2.17.1



[PATCH] pinctrl: core: Add missing #ifdef CONFIG_GPIOLIB

2020-10-28 Thread He Zhe
To fix the following build warnings when CONFIG_GPIOLIB=n.

drivers/pinctrl/core.c:1607:20: warning: unused variable 'chip' 
[-Wunused-variable]
 1608 |  struct gpio_chip *chip;
  |^~~~
drivers/pinctrl/core.c:1606:15: warning: unused variable 'gpio_num' 
[-Wunused-variable]
 1607 |  unsigned int gpio_num;
  |   ^~~~
drivers/pinctrl/core.c:1605:29: warning: unused variable 'range' 
[-Wunused-variable]
 1606 |  struct pinctrl_gpio_range *range;
  | ^

Fixes: f1b206cf7c57 ("pinctrl: core: print gpio in pins debugfs file")
Signed-off-by: He Zhe 
---
 drivers/pinctrl/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pinctrl/core.c b/drivers/pinctrl/core.c
index 3663d87f51a0..9fc4433fece4 100644
--- a/drivers/pinctrl/core.c
+++ b/drivers/pinctrl/core.c
@@ -1602,9 +1602,11 @@ static int pinctrl_pins_show(struct seq_file *s, void 
*what)
struct pinctrl_dev *pctldev = s->private;
const struct pinctrl_ops *ops = pctldev->desc->pctlops;
unsigned i, pin;
+#ifdef CONFIG_GPIOLIB
struct pinctrl_gpio_range *range;
unsigned int gpio_num;
struct gpio_chip *chip;
+#endif
 
seq_printf(s, "registered pins: %d\n", pctldev->desc->npins);
 
-- 
2.17.1



Re: [PATCH] SUNRPC: Fix svc_flush_dcache()

2020-09-22 Thread He Zhe



On 9/22/20 10:14 PM, Chuck Lever wrote:
>
>> On Sep 22, 2020, at 3:13 AM, He Zhe  wrote:
>>
>>
>>
>> On 9/21/20 3:51 AM, Chuck Lever wrote:
>>> On platforms that implement flush_dcache_page(), a large NFS WRITE
>>> triggers the WARN_ONCE in bvec_iter_advance():
>>>
>>> Sep 20 14:01:05 klimt.1015granger.net kernel: Attempted to advance past end 
>>> of bvec iter
>>> Sep 20 14:01:05 klimt.1015granger.net kernel: WARNING: CPU: 0 PID: 1032 at 
>>> include/linux/bvec.h:101 bvec_iter_advance.isra.0+0xa7/0x158 [sunrpc]
>>>
>>> Sep 20 14:01:05 klimt.1015granger.net kernel: Call Trace:
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  
>>> svc_tcp_recvfrom+0x60c/0x12c7 [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>>> bvec_iter_advance.isra.0+0x158/0x158 [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? del_timer_sync+0x4b/0x55
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? test_bit+0x1d/0x27 [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  svc_recv+0x1193/0x15e4 
>>> [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>>> try_to_freeze.isra.0+0x6f/0x6f [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>>> refcount_sub_and_test.constprop.0+0x13/0x40 [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? svc_xprt_put+0x1e/0x29f 
>>> [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? svc_send+0x39f/0x3c1 
>>> [sunrpc]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  nfsd+0x282/0x345 [nfsd]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? __kthread_parkme+0x74/0xba
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  kthread+0x2ad/0x2bc
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? nfsd_destroy+0x124/0x124 
>>> [nfsd]
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? test_bit+0x1d/0x27
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
>>> kthread_mod_delayed_work+0x115/0x115
>>> Sep 20 14:01:05 klimt.1015granger.net kernel:  ret_from_fork+0x22/0x30
>>>
>>> Reported-by: He Zhe 
>>> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()")
>>> Signed-off-by: Chuck Lever 
>>> ---
>>> net/sunrpc/svcsock.c |2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> Hi Zhe-
>>>
>>> If you confirm this fixes your issue and there are no other
>>> objections or regressions, I can submit this for v5.9-rc.
>> I don't quite get why we add "seek" to "size". It seems this action does not
>> reflect the actual scenario and forcedly neutralizes the WARN_ONCE check in
>> bvec_iter_advance, so that it may "advance past end of bvec iter" and thus
>> introduces overflow.
>> Why don't we avoid this problem at the very begginning like my v1? That is, 
>> call
>> svc_flush_bvec only when we have received more than we want to seek.
>>
>> len = sock_recvmsg(svsk->sk_sock, , MSG_DONTWAIT);
>> -   if (len > 0)
>> +   if (len > 0 && (size_t)len > (seek & PAGE_MASK))
>> svc_flush_bvec(bvec, len, seek);
> Because this doesn't fix the underlying bug that triggered the
> WARN_ONCE.
>
> svc_tcp_recvfrom() attempts to assemble a possibly large RPC Call
> from a sequence of sock_recvmsg's.
>
> @seek is the running number of bytes that has been received so
> far for the RPC Call we are assembling. @size is the number of
> bytes that was just received in the most recent sock_recvmsg.
>
> We want svc_flush_bvec to flush just the area of @bvec that
> hasn't been flushed yet.
>
> Thus: the current size of the partial Call message in @bvec is
> @seek + @size. The starting location of the flush is
> @seek & PAGE_MASK. This aligns the flush so it starts on a page
> boundary.
>
> This:
>
>  230 struct bvec_iter bi = {
>  231 .bi_size= size + seek,
>  232 };
>
>  235 bvec_iter_advance(bvec, , seek & PAGE_MASK);
>
> advances the bvec_iter to the part of @bvec that hasn't been
> flushed yet.
>
> This loop:
>
>  236 for_each_bvec(bv, bvec, bi, bi)
>  237 flush_dcache_page(bv.bv_page);
>
> flushes each page starting at that point to the end of the bytes
> that have been received so far
>
> In other words, ca07eda33e01 was wrong because it always flushed
> the first section of @bvec, never the later parts of it.

Thanks for clarification. I just tested the patch. It works well.

Zhe

>
>
>> Regards,
>> Zhe
>>
>>>
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index d5805fa1d066..c2752e2b9ce3 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -228,7 +228,7 @@ static int svc_one_sock_name(struct svc_sock *svsk, 
>>> char *buf, int remaining)
>>> static void svc_flush_bvec(const struct bio_vec *bvec, size_t size, size_t 
>>> seek)
>>> {
>>> struct bvec_iter bi = {
>>> -   .bi_size= size,
>>> +   .bi_size= size + seek,
>>> };
>>> struct bio_vec bv;
> --
> Chuck Lever
>
>
>



Re: [PATCH] SUNRPC: Fix svc_flush_dcache()

2020-09-22 Thread He Zhe



On 9/21/20 3:51 AM, Chuck Lever wrote:
> On platforms that implement flush_dcache_page(), a large NFS WRITE
> triggers the WARN_ONCE in bvec_iter_advance():
>
> Sep 20 14:01:05 klimt.1015granger.net kernel: Attempted to advance past end 
> of bvec iter
> Sep 20 14:01:05 klimt.1015granger.net kernel: WARNING: CPU: 0 PID: 1032 at 
> include/linux/bvec.h:101 bvec_iter_advance.isra.0+0xa7/0x158 [sunrpc]
>
> Sep 20 14:01:05 klimt.1015granger.net kernel: Call Trace:
> Sep 20 14:01:05 klimt.1015granger.net kernel:  svc_tcp_recvfrom+0x60c/0x12c7 
> [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
> bvec_iter_advance.isra.0+0x158/0x158 [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? del_timer_sync+0x4b/0x55
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? test_bit+0x1d/0x27 [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  svc_recv+0x1193/0x15e4 [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
> try_to_freeze.isra.0+0x6f/0x6f [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
> refcount_sub_and_test.constprop.0+0x13/0x40 [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? svc_xprt_put+0x1e/0x29f 
> [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? svc_send+0x39f/0x3c1 [sunrpc]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  nfsd+0x282/0x345 [nfsd]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? __kthread_parkme+0x74/0xba
> Sep 20 14:01:05 klimt.1015granger.net kernel:  kthread+0x2ad/0x2bc
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? nfsd_destroy+0x124/0x124 
> [nfsd]
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? test_bit+0x1d/0x27
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ? 
> kthread_mod_delayed_work+0x115/0x115
> Sep 20 14:01:05 klimt.1015granger.net kernel:  ret_from_fork+0x22/0x30
>
> Reported-by: He Zhe 
> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()")
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/svcsock.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Hi Zhe-
>
> If you confirm this fixes your issue and there are no other
> objections or regressions, I can submit this for v5.9-rc.

I don't quite get why we add "seek" to "size". It seems this action does not
reflect the actual scenario and forcedly neutralizes the WARN_ONCE check in
bvec_iter_advance, so that it may "advance past end of bvec iter" and thus
introduces overflow.

Why don't we avoid this problem at the very begginning like my v1? That is, call
svc_flush_bvec only when we have received more than we want to seek.

    len = sock_recvmsg(svsk->sk_sock, , MSG_DONTWAIT);
-   if (len > 0)
+   if (len > 0 && (size_t)len > (seek & PAGE_MASK))
    svc_flush_bvec(bvec, len, seek);


Regards,
Zhe

>
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index d5805fa1d066..c2752e2b9ce3 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -228,7 +228,7 @@ static int svc_one_sock_name(struct svc_sock *svsk, char 
> *buf, int remaining)
>  static void svc_flush_bvec(const struct bio_vec *bvec, size_t size, size_t 
> seek)
>  {
>   struct bvec_iter bi = {
> - .bi_size= size,
> + .bi_size= size + seek,
>   };
>   struct bio_vec bv;
>  
>
>
>



Re: [PATCH] SUNRPC: Flush dcache only when receiving more seeking

2020-09-19 Thread He Zhe



On 9/19/20 12:23 AM, Chuck Lever wrote:
>
>> On Sep 18, 2020, at 8:50 AM, zhe...@windriver.com wrote:
>>
>> From: He Zhe 
>>
>> commit ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()") introduces
>> svc_flush_bvec to after sock_recvmsg, but sometimes we receive less than we
>> seek, which triggers the following warning.
>>
>> WARNING: CPU: 0 PID: 18266 at include/linux/bvec.h:101 
>> bvec_iter_advance+0x44/0xa8
>> Attempted to advance past end of bvec iter
>> Modules linked in: sch_fq_codel openvswitch nsh nf_conncount nf_nat
>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
>> CPU: 1 PID: 18266 Comm: nfsd Not tainted 5.9.0-rc5 #1
>> Hardware name: Xilinx Zynq Platform
>> [<80112ec0>] (unwind_backtrace) from [<8010c3a8>] (show_stack+0x18/0x1c)
>> [<8010c3a8>] (show_stack) from [<80755214>] (dump_stack+0x9c/0xd0)
>> [<80755214>] (dump_stack) from [<80125e64>] (__warn+0xdc/0xf4)
>> [<80125e64>] (__warn) from [<80126244>] (warn_slowpath_fmt+0x84/0xac)
>> [<80126244>] (warn_slowpath_fmt) from [<80c88514>] 
>> (bvec_iter_advance+0x44/0xa8)
>> [<80c88514>] (bvec_iter_advance) from [<80c88940>] 
>> (svc_tcp_read_msg+0x10c/0x1bc)
>> [<80c88940>] (svc_tcp_read_msg) from [<80c895d4>] 
>> (svc_tcp_recvfrom+0x98/0x63c)
>> [<80c895d4>] (svc_tcp_recvfrom) from [<80c97bf4>] 
>> (svc_handle_xprt+0x48c/0x4f8)
>> [<80c97bf4>] (svc_handle_xprt) from [<80c98038>] (svc_recv+0x94/0x1e0)
>> [<80c98038>] (svc_recv) from [<804747cc>] (nfsd+0xf0/0x168)
>> [<804747cc>] (nfsd) from [<80148a0c>] (kthread+0x144/0x154)
>> [<80148a0c>] (kthread) from [<80100114>] (ret_from_fork+0x14/0x20)
>>
>> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()")
>> Cc:  # 5.8+
>> Signed-off-by: He Zhe 
>> ---
>> net/sunrpc/svcsock.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index d5805fa1d066..ea3bc9635448 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -277,7 +277,7 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, 
>> size_t buflen,
>>  buflen -= seek;
>>  }
>>  len = sock_recvmsg(svsk->sk_sock, , MSG_DONTWAIT);
>> -if (len > 0)
>> +if (len > (seek & PAGE_MASK))
> I don't understand how this addresses the WARNING. Can you provide
> an example set of inputs that trigger the issue?

I was trying to meet the not warning condition in bvec_iter_advance to make
the flushing meaningful.

svc_flush_bvec
    bvec_iter_advance
        WARN_ONCE(bytes > iter->bi_size,...

Here are my steps:
mkdir /root/mount_point/
mount /dev/sda1 /root/mount_point/
systemctl restart nfs-server
exportfs
mount -vvv -t nfs 127.0.0.1:/root/mount_point/ /mnt
cp /bin/bash ./bash.tmp

>
> Also this change introduces a mixed-sign comparison, so NACK on
> this particular patch unless it can be demonstrated that the
> implicit type conversion here is benign (I don't think it is,
> but I haven't thought through it).

Thanks, I didn't notice the different types. What about this?
if (len > 0 && (size_t)len > (seek & PAGE_MASK))


Zhe

>
>
>>  svc_flush_bvec(bvec, len, seek);
>>
>>  /* If we read a full record, then assume there may be more
>> -- 
>> 2.17.1
>>
> --
> Chuck Lever
>
>
>



Re: [PATCH] mips/oprofile: Fix fallthrough placement

2020-08-21 Thread He Zhe



On 8/21/20 3:48 PM, Thomas Bogendoerfer wrote:
> On Thu, Aug 20, 2020 at 08:54:40PM +0800, zhe...@windriver.com wrote:
>> From: He Zhe 
>>
>> We want neither
>> "
>> include/linux/compiler_attributes.h:201:41: warning: statement will never
>> be executed [-Wswitch-unreachable]
>>   201 | # define fallthrough __attribute__((__fallthrough__))
>>   |  ^
>> "
>> nor
>> "
>> include/linux/compiler_attributes.h:201:41: warning: attribute
>> 'fallthrough' not preceding a case label or default label
>>   201 | # define fallthrough __attribute__((__fallthrough__))
>>   |  ^
>> "
>>
>> It's not worth adding one more macro. Let's simply place the fallthrough
>> in between the expansions.
>>
>> Signed-off-by: He Zhe 
> there is already another patch for the problem, which I've applied
> to mips-fixes.

You mean the below one?
https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/commit/?h=mips-fixes=5900acb374fe2f4f42bbcb2c84db64f582d917a1

That patch handles the first warning in my commit log but does not handle the
second one which is introduced since gcc v10.1.0 commit 6c80b1b56dec
("Make more bad uses of fallthrough attribute into pedwarns.").

Zhe

>
> Thomas.
>



Re: [PATCH] eventfd: Enlarge recursion limit to allow vhost to work

2020-08-20 Thread He Zhe



On 7/22/20 5:01 PM, Juri Lelli wrote:
> On 13/07/20 15:22, Juri Lelli wrote:
>
> [...]
>
>> Gentle ping about this issue (mainly addressing relevant maintainers and
>> potential reviewers). It's easily reproducible with PREEMPT_RT.
> Ping. Any comment at all? :-)

Hi Maintainer(s),

It's been 4 months. Can this be considered this round?

Thanks,
Zhe

>
> Thanks,
>
> Juri
>



Re: [PATCH] eventfd: Enlarge recursion limit to allow vhost to work

2020-07-03 Thread He Zhe



On 7/3/20 4:12 PM, Juri Lelli wrote:
> Hi,
>
> On 10/04/20 19:47, zhe...@windriver.com wrote:
>> From: He Zhe 
>>
>> commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
>> introduces a percpu counter that tracks the percpu recursion depth and
>> warn if it greater than zero, to avoid potential deadlock and stack
>> overflow.
>>
>> However sometimes different eventfds may be used in parallel. Specifically,
>> when heavy network load goes through kvm and vhost, working as below, it
>> would trigger the following call trace.
>>
>> -  100.00%
>>- 66.51%
>> ret_from_fork
>> kthread
>>   - vhost_worker
>>  - 33.47% handle_tx_kick
>>   handle_tx
>>   handle_tx_copy
>>   vhost_tx_batch.isra.0
>>   vhost_add_used_and_signal_n
>>   eventfd_signal
>>  - 33.05% handle_rx_net
>>   handle_rx
>>   vhost_add_used_and_signal_n
>>   eventfd_signal
>>- 33.49%
>> ioctl
>> entry_SYSCALL_64_after_hwframe
>> do_syscall_64
>> __x64_sys_ioctl
>> ksys_ioctl
>> do_vfs_ioctl
>> kvm_vcpu_ioctl
>> kvm_arch_vcpu_ioctl_run
>> vmx_handle_exit
>> handle_ept_misconfig
>> kvm_io_bus_write
>> __kvm_io_bus_write
>> eventfd_signal
>>
>> 001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
>>  snip 
>> 001: Call Trace:
>> 001:  vhost_signal+0x15e/0x1b0 [vhost]
>> 001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
>> 001:  handle_rx+0xb9/0x900 [vhost_net]
>> 001:  handle_rx_net+0x15/0x20 [vhost_net]
>> 001:  vhost_worker+0xbe/0x120 [vhost]
>> 001:  kthread+0x106/0x140
>> 001:  ? log_used.part.0+0x20/0x20 [vhost]
>> 001:  ? kthread_park+0x90/0x90
>> 001:  ret_from_fork+0x35/0x40
>> 001: ---[ end trace 0003 ]---
>>
>> This patch enlarges the limit to 1 which is the maximum recursion depth we
>> have found so far.
>>
>> Signed-off-by: He Zhe 
>> ---
> Not sure if this approch can fly, but I also encountered the same
> warning (which further caused hangs during VM install) and this change
> addresses that.
>
> I'd be interested in understanding what is the status of this problem/fix.

This is actually v2 of the patch and has not got any reply yet. Here is the v1. 
FYI.
https://lore.kernel.org/lkml/1586257192-58369-1-git-send-email-zhe...@windriver.com/

> On a side note, by looking at the code, I noticed that (apart from
> samples) all callers don't actually check eventfd_signal() return value
> and I'm wondering why is that the case and if is it safe to do so.

Checking the return value right after sending the signal can tell us if the
event counter has just overflowed, that is, exceeding ULLONG_MAX. I guess the
authors of the callers listed in the commit log just don't worry about that,
since they add only one to a dedicated eventfd.

Zhe

>
> Thanks,
>
> Juri
>



Re: [PATCH] eventfd: Enlarge recursion limit to allow vhost to work

2020-06-22 Thread He Zhe
Can this be considered for this moment?
This is actually v2 of
"[PATCH 1/2] eventfd: Make wake counter work for single fd instead of all".

Thanks,
Zhe



On 4/10/20 7:47 PM, zhe...@windriver.com wrote:
> From: He Zhe 
>
> commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
> introduces a percpu counter that tracks the percpu recursion depth and
> warn if it greater than zero, to avoid potential deadlock and stack
> overflow.
>
> However sometimes different eventfds may be used in parallel. Specifically,
> when heavy network load goes through kvm and vhost, working as below, it
> would trigger the following call trace.
>
> -  100.00%
>- 66.51%
> ret_from_fork
> kthread
>   - vhost_worker
>  - 33.47% handle_tx_kick
>   handle_tx
>   handle_tx_copy
>   vhost_tx_batch.isra.0
>   vhost_add_used_and_signal_n
>   eventfd_signal
>  - 33.05% handle_rx_net
>   handle_rx
>   vhost_add_used_and_signal_n
>   eventfd_signal
>- 33.49%
> ioctl
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_ioctl
> ksys_ioctl
> do_vfs_ioctl
> kvm_vcpu_ioctl
> kvm_arch_vcpu_ioctl_run
> vmx_handle_exit
> handle_ept_misconfig
> kvm_io_bus_write
> __kvm_io_bus_write
> eventfd_signal
>
> 001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
>  snip 
> 001: Call Trace:
> 001:  vhost_signal+0x15e/0x1b0 [vhost]
> 001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
> 001:  handle_rx+0xb9/0x900 [vhost_net]
> 001:  handle_rx_net+0x15/0x20 [vhost_net]
> 001:  vhost_worker+0xbe/0x120 [vhost]
> 001:  kthread+0x106/0x140
> 001:  ? log_used.part.0+0x20/0x20 [vhost]
> 001:  ? kthread_park+0x90/0x90
> 001:  ret_from_fork+0x35/0x40
> 001: ---[ end trace 0003 ]---
>
> This patch enlarges the limit to 1 which is the maximum recursion depth we
> have found so far.
>
> Signed-off-by: He Zhe 
> ---
>  fs/eventfd.c| 3 ++-
>  include/linux/eventfd.h | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 78e41c7c3d05..8b9bd6fb08cd 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -70,7 +70,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>* it returns true, the eventfd_signal() call should be deferred to a
>* safe context.
>*/
> - if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> + if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
> + EFD_WAKE_COUNT_MAX))
>   return 0;
>  
>   spin_lock_irqsave(>wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index dc4fd8a6644d..e7684d768e3f 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* This is the maximum recursion depth we find so far */
> +#define EFD_WAKE_COUNT_MAX 1
> +
>  struct eventfd_ctx;
>  struct file;
>  



Re: [PATCH] eventfd: Enlarge recursion limit to allow vhost to work

2020-05-12 Thread He Zhe
Can this be considered for this moment?
This is actually v2 of
"[PATCH 1/2] eventfd: Make wake counter work for single fd instead of all".

Thanks,
Zhe

On 4/10/20 7:47 PM, zhe...@windriver.com wrote:
> From: He Zhe 
>
> commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
> introduces a percpu counter that tracks the percpu recursion depth and
> warn if it greater than zero, to avoid potential deadlock and stack
> overflow.
>
> However sometimes different eventfds may be used in parallel. Specifically,
> when heavy network load goes through kvm and vhost, working as below, it
> would trigger the following call trace.
>
> -  100.00%
>- 66.51%
> ret_from_fork
> kthread
>   - vhost_worker
>  - 33.47% handle_tx_kick
>   handle_tx
>   handle_tx_copy
>   vhost_tx_batch.isra.0
>   vhost_add_used_and_signal_n
>   eventfd_signal
>  - 33.05% handle_rx_net
>   handle_rx
>   vhost_add_used_and_signal_n
>   eventfd_signal
>- 33.49%
> ioctl
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_ioctl
> ksys_ioctl
> do_vfs_ioctl
> kvm_vcpu_ioctl
> kvm_arch_vcpu_ioctl_run
> vmx_handle_exit
> handle_ept_misconfig
> kvm_io_bus_write
> __kvm_io_bus_write
> eventfd_signal
>
> 001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
>  snip 
> 001: Call Trace:
> 001:  vhost_signal+0x15e/0x1b0 [vhost]
> 001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
> 001:  handle_rx+0xb9/0x900 [vhost_net]
> 001:  handle_rx_net+0x15/0x20 [vhost_net]
> 001:  vhost_worker+0xbe/0x120 [vhost]
> 001:  kthread+0x106/0x140
> 001:  ? log_used.part.0+0x20/0x20 [vhost]
> 001:  ? kthread_park+0x90/0x90
> 001:  ret_from_fork+0x35/0x40
> 001: ---[ end trace 0003 ]---
>
> This patch enlarges the limit to 1 which is the maximum recursion depth we
> have found so far.
>
> Signed-off-by: He Zhe 
> ---
>  fs/eventfd.c| 3 ++-
>  include/linux/eventfd.h | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 78e41c7c3d05..8b9bd6fb08cd 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -70,7 +70,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>* it returns true, the eventfd_signal() call should be deferred to a
>* safe context.
>*/
> - if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> + if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
> + EFD_WAKE_COUNT_MAX))
>   return 0;
>  
>   spin_lock_irqsave(>wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index dc4fd8a6644d..e7684d768e3f 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* This is the maximum recursion depth we find so far */
> +#define EFD_WAKE_COUNT_MAX 1
> +
>  struct eventfd_ctx;
>  struct file;
>  



[tip: perf/core] libperf: Add NULL pointer check for cpu_map iteration and NULL assignment for all_cpus.

2020-05-08 Thread tip-bot2 for He Zhe
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 44d041b7b2c11b6739501fd3763cc6fed62cf0ed
Gitweb:
https://git.kernel.org/tip/44d041b7b2c11b6739501fd3763cc6fed62cf0ed
Author:He Zhe 
AuthorDate:Sun, 08 Mar 2020 18:59:17 +08:00
Committer: Arnaldo Carvalho de Melo 
CommitterDate: Tue, 05 May 2020 16:35:29 -03:00

libperf: Add NULL pointer check for cpu_map iteration and NULL assignment for 
all_cpus.

A NULL pointer may be passed to perf_cpu_map__cpu and then cause a
crash, such as the one commit cb71f7d43ece ("libperf: Setup initial
evlist::all_cpus value") fix.

Signed-off-by: He Zhe 
Acked-by: Jiri Olsa 
Cc: Andi Kleen 
Cc: Kyle Meyer 
Link: 
http://lore.kernel.org/lkml/1583665157-349023-1-git-send-email-zhe...@windriver.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/lib/perf/cpumap.c | 2 +-
 tools/lib/perf/evlist.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/perf/cpumap.c b/tools/lib/perf/cpumap.c
index f93f4e7..ca02150 100644
--- a/tools/lib/perf/cpumap.c
+++ b/tools/lib/perf/cpumap.c
@@ -247,7 +247,7 @@ out:
 
 int perf_cpu_map__cpu(const struct perf_cpu_map *cpus, int idx)
 {
-   if (idx < cpus->nr)
+   if (cpus && idx < cpus->nr)
return cpus->map[idx];
 
return -1;
diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index def5505..c481b62 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -125,6 +125,7 @@ void perf_evlist__exit(struct perf_evlist *evlist)
perf_cpu_map__put(evlist->cpus);
perf_thread_map__put(evlist->threads);
evlist->cpus = NULL;
+   evlist->all_cpus = NULL;
evlist->threads = NULL;
fdarray__exit(>pollfd);
 }


[tip: ras/core] x86/mcelog: Add compat_ioctl for 32-bit mcelog support

2020-05-04 Thread tip-bot2 for He Zhe
The following commit has been merged into the ras/core branch of tip:

Commit-ID: 3b4ff4eb904fef04c36b39052ca8eb31fa41fad0
Gitweb:
https://git.kernel.org/tip/3b4ff4eb904fef04c36b39052ca8eb31fa41fad0
Author:He Zhe 
AuthorDate:Wed, 04 Mar 2020 14:39:07 +08:00
Committer: Borislav Petkov 
CommitterDate: Mon, 04 May 2020 10:07:04 +02:00

x86/mcelog: Add compat_ioctl for 32-bit mcelog support

A 32-bit version of mcelog issuing ioctls on /dev/mcelog causes errors
like the following:

  MCE_GET_RECORD_LEN: Inappropriate ioctl for device

This is due to a missing compat_ioctl callback.

Assign to it compat_ptr_ioctl() as a generic implementation of the
.compat_ioctl file operation to ioctl functions that either ignore the
argument or pass a pointer to a compatible data type.

 [ bp: Massage commit message. ]

Signed-off-by: He Zhe 
Signed-off-by: Borislav Petkov 
Acked-by: Tony Luck 
Link: 
https://lkml.kernel.org/r/1583303947-49858-1-git-send-email-zhe...@windriver.com
---
 arch/x86/kernel/cpu/mce/dev-mcelog.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/cpu/mce/dev-mcelog.c 
b/arch/x86/kernel/cpu/mce/dev-mcelog.c
index c033e7e..a4fd528 100644
--- a/arch/x86/kernel/cpu/mce/dev-mcelog.c
+++ b/arch/x86/kernel/cpu/mce/dev-mcelog.c
@@ -329,6 +329,7 @@ static const struct file_operations mce_chrdev_ops = {
.write  = mce_chrdev_write,
.poll   = mce_chrdev_poll,
.unlocked_ioctl = mce_chrdev_ioctl,
+   .compat_ioctl   = compat_ptr_ioctl,
.llseek = no_llseek,
 };
 


Re: [PATCH] x86/mce: Add compat_ioctl assignment to make it compatible with 32-bit system

2020-05-02 Thread He Zhe



On 4/28/20 2:19 AM, Luck, Tony wrote:
> On Thu, Apr 16, 2020 at 04:40:31PM +0800, He Zhe wrote:
>> Can this be considered for the moment?
>>
>> Thanks,
>> Zhe
>>
>> On 3/4/20 2:39 PM, zhe...@windriver.com wrote:
>>> From: He Zhe 
>>>
>>> 32-bit user-space program would get errors like the following from ioctl
>>> syscall due to missing compat_ioctl.
>>> MCE_GET_RECORD_LEN: Inappropriate ioctl for device
>>>
>>> compat_ptr_ioctl is provided as a generic implementation of .compat_ioctl
>>> file operation to ioctl functions that either ignore the argument or pass
>>> a pointer to a compatible data type.
> I'm not super-familiar with the compat ioctl bits.  But this looks plausible.
>
> All three of the ioctl's for this driver have a "pointer to integer" for the
> "return" value.  And "int" is a compatible type between i386 and x86_64.
>
> I don't have a system setup to build a 32-bit binary to test the theory,
> but I assume that you have built something that tests all three:
>
>   MCE_GET_RECORD_LEN
>   MCE_GET_LOG_LEN
>   MCE_GETCLEAR_FLAGS
>
> So I guess:
>
> Acked-by: Tony Luck 

Thanks, and yes, I had tested all three.

Zhe

>
>>> Signed-off-by: He Zhe 
>>> ---
>>>  arch/x86/kernel/cpu/mce/dev-mcelog.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/x86/kernel/cpu/mce/dev-mcelog.c 
>>> b/arch/x86/kernel/cpu/mce/dev-mcelog.c
>>> index 7c8958d..6c9b91b7 100644
>>> --- a/arch/x86/kernel/cpu/mce/dev-mcelog.c
>>> +++ b/arch/x86/kernel/cpu/mce/dev-mcelog.c
>>> @@ -328,6 +328,7 @@ static const struct file_operations mce_chrdev_ops = {
>>> .write  = mce_chrdev_write,
>>> .poll   = mce_chrdev_poll,
>>> .unlocked_ioctl = mce_chrdev_ioctl,
>>> +   .compat_ioctl   = compat_ptr_ioctl,
>>> .llseek = no_llseek,
>>>  };
>>>  



Re: [PATCH] printk: Fix unnecessary returning broken pipe error from devkmsg_read

2019-09-23 Thread He Zhe



On 9/23/19 7:39 PM, Petr Mladek wrote:
> On Mon 2019-09-23 18:45:18, He Zhe wrote:
>>
>> On 9/23/19 6:05 PM, Sergey Senozhatsky wrote:
>>> On (09/18/19 21:31), zhe...@windriver.com wrote:
>>>> When users read the buffer from start, there is no need to return -EPIPE
>>>> since the possible overflows will not affect the output.
>>>>
>>> [..]
>>>> -  if (user->seq < log_first_seq) {
>>>> +  if (user->seq == 0) {
>>>> +  user->seq = log_first_seq;
>>>> +  } else if (user->seq < log_first_seq) {
>>>>/* our last seen message is gone, return error and reset */
>>>>user->idx = log_first_idx;
>>>>user->seq = log_first_seq;
>>> A tough call.
>>>
>>> User-space wants to read messages which are gone: log_first_seq > user->seq.
>>>
>>> I think returning EPIPE is sort of appropriate; user-space, possibly,
>>> can printf(stderr, "Some /dev/kmsg messages are gone\n"); or something
>>> like that.
>> Yes, a tough call. I was looking at the similar part of RT kernel and thought
>> that handling was better.
>> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/printk/printk.c?h=linux-5.2.y-rt#n706
> Adding John into CC. He takes care of printk() in RT kernel.
>
>
>> IMHO, the EPIPE is used to inform user-space when the buffer has overflown 
>> and
>> there is a inconsistency/break between what has been read and what has not.
> What is the motivation for the change, please?
> Have you just noticed the inconsistency between normal and RT kernel?
> Or was there any bug report?

I mean when messages that are going to be read get lost, there is an
inconsistency between what we want and what are left.

This is all due to an LTP case which passes on mainline kernel but fails
(line 425) on RT kernel.

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/logging/kmsg/kmsg01.c#L425

But after comparing how mainling and RT kernel handle the case, I thought it was
better not to return an EPIPE.

https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/printk/printk.c?h=linux-5.2.y-rt#n706



>
> We need to be careful to do not break user space. And this patch
> modifies the existing behavior.
>
>> When user-space just wants to read the buffer from sequence 0, there would 
>> not
>> be the inconsistency.
>>
>> I think it is NOT necessary to inform user-space, when it just wants to read
>> from the beginning of the buffer, that the buffer has changed since the time
>> point when it issues the action of reading.
> Who would set sequence 0, please?

When log_first_seq is 0 in  devkmsg_open.

>
> IMHO, the patch is wrong.
>
> devkmsg_open() initializes user->seq to a valid sequence number.
> We should return -EPIPE when user->seq == 0 during devkmsg_open()
> and the first messages got lost before the first read.
>
>> But if that IS what we want, the RT
>> kernel needs to be changed so that mainline and RT kernel act in the same 
>> way.
> I agree.

If this is the case, I'm going to submit a patch for RT kernel.

Thanks,
Zhe

>
> Best Regards,
> Petr
>



Re: [PATCH] printk: Fix unnecessary returning broken pipe error from devkmsg_read

2019-09-23 Thread He Zhe



On 9/23/19 6:05 PM, Sergey Senozhatsky wrote:
> On (09/18/19 21:31), zhe...@windriver.com wrote:
>> When users read the buffer from start, there is no need to return -EPIPE
>> since the possible overflows will not affect the output.
>>
> [..]
>> -if (user->seq < log_first_seq) {
>> +if (user->seq == 0) {
>> +user->seq = log_first_seq;
>> +} else if (user->seq < log_first_seq) {
>>  /* our last seen message is gone, return error and reset */
>>  user->idx = log_first_idx;
>>  user->seq = log_first_seq;
> A tough call.
>
> User-space wants to read messages which are gone: log_first_seq > user->seq.
>
> I think returning EPIPE is sort of appropriate; user-space, possibly,
> can printf(stderr, "Some /dev/kmsg messages are gone\n"); or something
> like that.

Yes, a tough call. I was looking at the similar part of RT kernel and thought
that handling was better.
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/printk/printk.c?h=linux-5.2.y-rt#n706

IMHO, the EPIPE is used to inform user-space when the buffer has overflown and
there is a inconsistency/break between what has been read and what has not.

When user-space just wants to read the buffer from sequence 0, there would not
be the inconsistency.

I think it is NOT necessary to inform user-space, when it just wants to read
from the beginning of the buffer, that the buffer has changed since the time
point when it issues the action of reading. But if that IS what we want, the RT
kernel needs to be changed so that mainline and RT kernel act in the same way.


Zhe

>
>   -ss
>



Re: kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup

2019-08-26 Thread He Zhe



On 8/26/19 11:11 PM, Greg KH wrote:
> On Mon, Aug 26, 2019 at 10:42:53PM +0800, He Zhe wrote:
>> Hi All,
>>
>> Since 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for 
>> ___bpf_prog_run()"),
>> We have got the following warning,
>> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call 
>> without frame pointer save/setup
>>
>> If reverting the above commit, we will get the following warning,
>> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x8b9: sibling call 
>> from callable instruction with modified stack frame
>> if CONFIG_RETPOLINE=n, and no warning if CONFIG_RETPOLINE=y
> Do you see this same problem on 5.3-rc6?
>
> And what version of gcc are you using?

Sorry for missing such information.
Here is the .config along with kernel and GCC version.

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.3.0-rc6 Kernel Configuration
#

#
# Compiler: x86_64-wrs-linux-gcc (GCC) 9.1.0
#
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=90100
CONFIG_CLANG_VERSION=0
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_WARN_MAYBE_UNINITIALIZED=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_HEADER_TEST is not set
CONFIG_LOCALVERSION="dev"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING=y
# CONFIG_CONTEXT_TRACKING_FORCE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# end of RCU Subsystem

CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_SWAP_ENABLED=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_CHECKPOINT_RESTORE=y
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_

kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup

2019-08-26 Thread He Zhe
Hi All,

Since 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for 
___bpf_prog_run()"),
We have got the following warning,
kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without 
frame pointer save/setup

If reverting the above commit, we will get the following warning,
kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x8b9: sibling call from 
callable instruction with modified stack frame
if CONFIG_RETPOLINE=n, and no warning if CONFIG_RETPOLINE=y


Zhe


Re: [PATCH] module: Fix load failure when CONFIG_STRICT_MODULE_RWX is diabled

2019-08-13 Thread He Zhe



On 8/14/19 1:59 AM, Jessica Yu wrote:
> +++ zhe...@windriver.com [10/08/19 15:22 +0800]:
>> From: He Zhe 
>>
>> When loading modules with CONFIG_ARCH_HAS_STRICT_MODULE_RWX enabled and
>> CONFIG_STRICT_MODULE_RWX disabled, the memory allocated for modules would
>> not be page-aligned and cause the following BUG during frob_text.
>>
>> [ cut here ]
>> kernel BUG at kernel/module.c:1907!
>> Internal error: Oops - BUG: 0 [#1] ARM
>> Modules linked in:
>> CPU: 0 PID: 89 Comm: systemd-modules Not tainted 5.3.0-rc2 #1
>> Hardware name: ARM-Versatile (Device Tree Support)
>> PC is at frob_text.constprop.0+0x2c/0x40
>> LR is at load_module+0x14b4/0x1d28
>> pc : []    lr : []    psr: 2013
>> sp : ce44fe58  ip :   fp : 
>> r10:   r9 : ce44feb8  r8 : 
>> r7 : 0001  r6 : bf00032c  r5 : ce44ff40  r4 : bf000320
>> r3 : bf000400  r2 : 0fff  r1 : 0220  r0 : bf00
>> Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
>> Control: 00093177  Table: 0e4c  DAC: 0051
>> Process systemd-modules (pid: 89, stack limit = 0x9fccc8dc)
>> Stack: (0xce44fe58 to 0xce45)
>> fe40:    cf1b05b8
>> fe60: 0001 ce47cf08 bf002754 c07ae5d8 d0a2a484 bf002060 bf0004f8 
>> fe80: b6d17910 c017cf1c ce47cf00 d0a29000 ce47cf00 ce44ff34 14fc 
>> fea0:   bf00025c 0001   6e72656b 6c65
>> fec0:        
>> fee0:      c0ac9048 7fff 
>> ff00: b6d17910 0005 017b c0009208 ce44e000  b6ebfe54 c008562c
>> ff20: 7fff  0003 cefd28f8 0001 d0a29000 14fc 
>> ff40: d0a292cb d0a29380 d0a29000 14fc d0a29f0c d0a29d90 d0a29a60 0520
>> ff60: 0710 0718 0826    0708 0023
>> ff80: 0024 001c  0016  c0ac9048 0041c620 
>> ffa0:  c0009000 0041c620  0005 b6d17910  
>> ffc0: 0041c620   017b 0041f078  004098b0 b6ebfe54
>> ffe0: bedb6bc8 bedb6bb8 b6d0f91c b6c945a0 6010 0005  
>> [] (frob_text.constprop.0) from [] 
>> (load_module+0x14b4/0x1d28)
>> [] (load_module) from [] (sys_finit_module+0xa0/0xc4)
>> [] (sys_finit_module) from [] (ret_fast_syscall+0x0/0x50)
>> Exception stack(0xce44ffa8 to 0xce44fff0)
>> ffa0:   0041c620  0005 b6d17910  
>> ffc0: 0041c620   017b 0041f078  004098b0 b6ebfe54
>> ffe0: bedb6bc8 bedb6bb8 b6d0f91c b6c945a0
>> Code: e7f001f2 e5931008 e1110002 0a01 (e7f001f2)
>> ---[ end trace e904557128d9aed5 ]---
>>
>> This patch enables page-aligned allocation when
>> CONFIG_ARCH_HAS_STRICT_MODULE_RWX is enabled.
>>
>> Fixes: 93651f80dcb6 ("modules: fix compile error if don't have strict module 
>> rwx")
>> Signed-off-by: He Zhe 
>
> Hi!
>
> I have already committed a fix for this to modules-next and plan to
> send a pull request next week.

Thanks for pointing out :)

https://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux.git/commit/?h=modules-next=38f054d549a869f22a02224cd276a27bf14b6171

But I'd suggest we should keep the case of "define debug_align(X) (X)" for all
the rest arches without CONFIG_HAS_STRICT_MODULE_RWX ability, which would save
people who are sensitive to system size a lot of memory when using modules,
especially for embedded systems, as this patch did. This seems the original
intention of this #ifdef... statement and still valid for now.

Zhe

>
> Thanks!
>
> Jessica
>



[tip:perf/urgent] perf cpumap: Fix writing to illegal memory in handling cpumap mask

2019-08-08 Thread tip-bot for He Zhe
Commit-ID:  5f5e25f1c7933a6e1673515c0b1d5acd82fea1ed
Gitweb: https://git.kernel.org/tip/5f5e25f1c7933a6e1673515c0b1d5acd82fea1ed
Author: He Zhe 
AuthorDate: Fri, 2 Aug 2019 16:29:52 +0800
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Thu, 8 Aug 2019 15:41:10 -0300

perf cpumap: Fix writing to illegal memory in handling cpumap mask

cpu_map__snprint_mask() would write to illegal memory pointed by
zalloc(0) when there is only one cpu.

This patch fixes the calculation and adds sanity check against the input
parameters.

Signed-off-by: He Zhe 
Cc: Alexander Shishkin 
Cc: Alexey Budankov 
Cc: Jiri Olsa 
Cc: Kan Liang 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Fixes: 4400ac8a9a90 ("perf cpumap: Introduce cpu_map__snprint_mask()")
Link: 
http://lkml.kernel.org/r/1564734592-15624-2-git-send-email-zhe...@windriver.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/cpumap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 3acfbe34ebaf..39cce66b4ebc 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -751,7 +751,10 @@ size_t cpu_map__snprint_mask(struct cpu_map *map, char 
*buf, size_t size)
unsigned char *bitmap;
int last_cpu = cpu_map__cpu(map, map->nr - 1);
 
-   bitmap = zalloc((last_cpu + 7) / 8);
+   if (buf == NULL)
+   return 0;
+
+   bitmap = zalloc(last_cpu / 8 + 1);
if (bitmap == NULL) {
buf[0] = '\0';
return 0;


[tip:perf/urgent] perf ftrace: Fix failure to set cpumask when only one cpu is present

2019-08-08 Thread tip-bot for He Zhe
Commit-ID:  cf30ae726c011e0372fd4c2d588466c8b50a8907
Gitweb: https://git.kernel.org/tip/cf30ae726c011e0372fd4c2d588466c8b50a8907
Author: He Zhe 
AuthorDate: Fri, 2 Aug 2019 16:29:51 +0800
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Thu, 8 Aug 2019 15:41:10 -0300

perf ftrace: Fix failure to set cpumask when only one cpu is present

The buffer containing the string used to set cpumask is overwritten at
the end of the string later in cpu_map__snprint_mask due to not enough
memory space, when there is only one cpu.

And thus causes the following failure:

  $ perf ftrace ls
  failed to reset ftrace
  $

This patch fixes the calculation of the cpumask string size.

Signed-off-by: He Zhe 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Alexey Budankov 
Cc: Jiri Olsa 
Cc: Kan Liang 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Fixes: dc23103278c5 ("perf ftrace: Add support for -a and -C option")
Link: 
http://lkml.kernel.org/r/1564734592-15624-1-git-send-email-zhe...@windriver.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-ftrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
index 66d5a6658daf..019312810405 100644
--- a/tools/perf/builtin-ftrace.c
+++ b/tools/perf/builtin-ftrace.c
@@ -173,7 +173,7 @@ static int set_tracing_cpumask(struct cpu_map *cpumap)
int last_cpu;
 
last_cpu = cpu_map__cpu(cpumap, cpumap->nr - 1);
-   mask_size = (last_cpu + 3) / 4 + 1;
+   mask_size = last_cpu / 4 + 2; /* one more byte for EOS */
mask_size += last_cpu / 32; /* ',' is needed for every 32th cpus */
 
cpumask = malloc(mask_size);


Re: [PATCH] nfsd4: Fix kernel crash when reading proc file reply_cache_stats

2019-08-06 Thread He Zhe
Please ignore this one. Resent.

Thanks,
Zhe

On 8/6/19 5:38 PM, zhe...@windriver.com wrote:
> From: He Zhe 
>
> reply_cache_stats uses wrong parameter as seq file private structure and
> thus causes the following kernel crash when users read
> /proc/fs/nfsd/reply_cache_stats
>
> m=a2ec03f7 v=f5777155
> BUG: kernel NULL pointer dereference, address: 01f9
> PGD 0 P4D 0
> Oops:  [#3] SMP PTI
> CPU: 6 PID: 1502 Comm: cat Tainted: G  D   5.3.0-rc3+ #1
> Hardware name: Intel Corporation Broadwell Client platform/Basking Ridge, 
> BIOS BDW-E2R1.86C.0118.R01.1503110618 03/11/2015
> RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
> Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 
> d1 ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 
> 00 48 c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
> RSP: 0018:aa520106fe08 EFLAGS: 00010246
> RAX: 00cfe1a77123 RBX:  RCX: 00291b46
> RDX: 00cf RSI: 0006 RDI: 00291b28
> RBP: aa520106fe20 R08: 0006 R09: 00cfe17e55dd
> R10: a424e47c R11: 030b R12: 0001
> R13: a424e5697000 R14: 0001 R15: a424e5697000
> FS:  7f805735f580() GS:a424f8f8() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 01f9 CR3: 655ce005 CR4: 003606e0
> Call Trace:
>  seq_read+0x194/0x3e0
>  __vfs_read+0x1b/0x40
>  vfs_read+0x95/0x140
>  ksys_read+0x61/0xe0
>  __x64_sys_read+0x1a/0x20
>  do_syscall_64+0x4d/0x120
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f805728b861
> Code: fe ff ff 50 48 8d 3d 86 b4 09 00 e8 79 e0 01 00 66 0f 1f 84 00 00 00 00 
> 00 48 8d 05 d9 19 0d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 
> 57 c3 66 0f 1f 44 00 00 48 83 ec 28 48 89 54
> RSP: 002b:7ffea1ce3c38 EFLAGS: 0246 ORIG_RAX: 
> RAX: ffda RBX: 0002 RCX: 7f805728b861
> RDX: 0002 RSI: 7f8057183000 RDI: 0003
> RBP: 7f8057183000 R08: 7f8057182010 R09: 
> R10: 0022 R11: 0246 R12: 559a60e8ff10
> R13: 0003 R14: 0002 R15: 0002
> Modules linked in:
> CR2: 01f9
> ---[ end trace 01613595153f0cba ]---
> RIP: 0010:nfsd_reply_cache_stats_show+0x3b/0x2d0
> Code: 41 54 49 89 f4 48 89 fe 48 c7 c7 b3 10 33 88 53 bb e8 03 00 00 e8 88 82 
> d1 ff bf 58 89 41 00 e8 eb c5 85 00 48 83 eb 01 75 f0 <41> 8b 94 24 f8 01 00 
> 00 48 c7 c6 be 10 33 88 4c 89 ef bb e8 03 00
> RSP: 0018:aa52004b3e08 EFLAGS: 00010246
> RAX: 002bab45a7c6 RBX:  RCX: 00291b4c
> RDX: 002b RSI: 0004 RDI: 00291b28
> RBP: aa52004b3e20 R08: 0004 R09: 002bab1c8c7a
> R10: a424e550 R11: 02a9 R12: 0001
> R13: a424e4475000 R14: 0001 R15: a424e4475000
> FS:  7f805735f580() GS:a424f8f8() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 01f9 CR3: 655ce005 CR4: 003606e0
> Killed
>
> Fixes: 3ba75830ce17 ("nfsd4: drc containerization")
> Signed-off-by: He Zhe 
> ---
>  fs/nfsd/nfscache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 26ad75a..96352ab 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -571,7 +571,7 @@ nfsd_cache_append(struct svc_rqst *rqstp, struct kvec 
> *data)
>   */
>  static int nfsd_reply_cache_stats_show(struct seq_file *m, void *v)
>  {
> - struct nfsd_net *nn = v;
> + struct nfsd_net *nn = m->private;
>  
>   seq_printf(m, "max entries:   %u\n", nn->max_drc_entries);
>   seq_printf(m, "num entries:   %u\n",



Re: [PATCH 1/2] perf: Fix failure to set cpumask when only one cpu

2019-08-02 Thread He Zhe



On 8/2/19 9:06 PM, Jiri Olsa wrote:
> On Fri, Aug 02, 2019 at 04:29:51PM +0800, zhe...@windriver.com wrote:
>> From: He Zhe 
>>
>> The buffer containing string used to set cpumask is overwritten by end of
>> string later in cpu_map__snprint_mask due to not enough memory space, when
>> there is only one cpu. And thus causes the following failure.
>>
>> $ perf ftrace ls
>> failed to reset ftrace
>>
>> This patch fixes the calculation of cpumask string size.
>>
>> Signed-off-by: He Zhe 
>> ---
>>  tools/perf/builtin-ftrace.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
>> index 66d5a66..0193128 100644
>> --- a/tools/perf/builtin-ftrace.c
>> +++ b/tools/perf/builtin-ftrace.c
>> @@ -173,7 +173,7 @@ static int set_tracing_cpumask(struct cpu_map *cpumap)
>>  int last_cpu;
>>  
>>  last_cpu = cpu_map__cpu(cpumap, cpumap->nr - 1);
>> -mask_size = (last_cpu + 3) / 4 + 1;
>> +mask_size = last_cpu / 4 + 2; /* one more byte for EOS */
>>  mask_size += last_cpu / 32; /* ',' is needed for every 32th cpus */
> ugh..  why do we care about last_cpu value in here at all?
>
> feels like using static buffer would be more reasonable

Thanks, and yes, a static buffer would be easy to handle. A 2KB buffer is enough
to cover 8196 cpus, the maximum numbers of cpus we can run with for now.

Let's see if there is any other concerns.

Zhe

>
> jirka
>
>>  
>>  cpumask = malloc(mask_size);
>> -- 
>> 2.7.4
>>



Re: [PATCH] netfilter: Fix remainder of pseudo-header protocol 0

2019-06-28 Thread He Zhe



On 6/28/19 2:49 AM, Pablo Neira Ayuso wrote:
> On Mon, Jun 24, 2019 at 11:17:38AM +0800, zhe...@windriver.com wrote:
>> From: He Zhe 
>>
>> Since v5.1-rc1, some types of packets do not get unreachable reply with the
>> following iptables setting. Fox example,
>>
>> $ iptables -A INPUT -p icmp --icmp-type 8 -j REJECT
>> $ ping 127.0.0.1 -c 1
>> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
>> — 127.0.0.1 ping statistics —
>> 1 packets transmitted, 0 received, 100% packet loss, time 0ms
>>
>> We should have got the following reply from command line, but we did not.
>> From 127.0.0.1 icmp_seq=1 Destination Port Unreachable
>>
>> Yi Zhao reported it and narrowed it down to:
>> 7fc38225363d ("netfilter: reject: skip csum verification for protocols that 
>> don't support it"),
>>
>> This is because nf_ip_checksum still expects pseudo-header protocol type 0 
>> for
>> packets that are of neither TCP or UDP, and thus ICMP packets are mistakenly
>> treated as TCP/UDP.
>>
>> This patch corrects the conditions in nf_ip_checksum and all other places 
>> that
>> still call it with protocol 0.
> Looking at 7fc38225363dd8f19e667ad7c77b63bc4a5c065d, I wonder this can
> be fixed while simplifying it...
>
> I think nf_reject_verify_csum() is useless?
>
> In your patch, now you explicitly check for IPPROTO_TCP and
> IPPROTO_UDP to validate the checksum.

Thanks for your review.

I suppose the two main points of 7fc38225363d are valid and I was trying to
align with them and fix them:
1) Skip csum verification for protocols that don't support it.
2) Remove the protocol 0 used to indicate non-TCP/UDP packets, and use actual
   types instead to be clear.

1) uses nf_reject_verify_csum to skip those that should be skipped and leaves
the protocols that support csum to the rest of the logic including
nf_ip_checksum. But 2) removes the "0" transition from the rest of the
logic and thus causes this issue. So I add the explicit check against TCP/UDP to
nf_ip_checksum. And nf_reject_verify_csum is still useful.

Zhe

>



User Stack Tracer Causes Crash 2

2019-05-30 Thread He Zhe
Hi,

https://lore.kernel.org/lkml/20190320221534.165ab...@oasis.local.home/ didn't 
get merged. And the crash it was trying to fix still happens on the latest 
master branch. If rebasing the patch on the latest top, the following new crash 
come up.

Sometimes,

root@intel-x86-64:~# echo 1 > /sys/kernel/debug/tracing/options/userstacktrace
root@intel-x86-64:~# echo 1 > 
/sys/kernel/debug/tracing/events/preemptirq/irq_disable/enable
root@intel-x86-64:~# echo 1 > /proc/sys/kernel/stack_tracer_enabled
hangs...

Sometimes,

root@intel-x86-64:~# echo 1 > /sys/kernel/debug/tracing/options/userstacktrace
root@intel-x86-64:~# echo 1 > 
/sys/kernel/debug/tracing/events/preemptirq/irq_disable/enable
root@intel-x86-64:~# echo 1 > /proc/sys/kernel/stack_tracer_enabled
PANIC: double fault, error_code: 0x0
CPU: 0 PID: 492 Comm: sh Not tainted 5.2.0-rc2 #1
Hardware name: Intel Corporation Broadwell Client platform/Basking Ridge, BIOS 
BDW-E2R1.86C.0118.R01.1503110618 03/11/2015
RIP: 0010:error_entry+0x32/0x100
Code: 89 7c 24 08 52 31 d2 51 31 c9 50 41 50 45 31 c0 41 51 45 31 c9 41 52 45 
31 d2 41 53 45 31 db 53 31 db 55 31 ed 41 54 45 31 e4 <41> 55 45 31 ed 41 56 45 
31 f6 41 57 45 31 ff 56 48 8d 6c 24 09 f6
RSP: 0018:9ab68000 EFLAGS: 00010046
RAX: ae200a97 RBX:  RCX: 
RDX:  RSI: ae200ec9 RDI: ae201176
RBP:  R08:  R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 
FS:  7f2c078a4740() GS:988fb8a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 9ab67ff8 CR3: 5b8ee003 CR4: 003606f0
Call Trace:
 
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_thunk_cr2+0x1a/0x1c
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? error_entry+0x86/0x100
 ? int3+0x29/0x40
 ? native_iret+0x7/0x7
 ? int3+0x29/0x40
 ? error_entry+0x86/0x100
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? trace_hardirqs_off_caller_cr2+0x1/0x20
 ? 

Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry

2019-04-16 Thread He Zhe
>From the last replies in the thread, it seems some work is going on. May I ask
when we can possibly roughly have a fix or workaround?

Thanks,
Zhe


On 3/21/19 10:15 AM, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) 
>
> He Zhe reported a crash by enabling trace events and selecting
> "userstacktrace" which will read the stack of userspace for every trace
> event recorded. Zhe narrowed it down to:
>
>  c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their 
> usage")
>
> With the problem config, I was able to also reproduce the error. I
> narrowed it down to just having to do the following:
>
>  # cd /sys/kernel/tracing
>  # echo 1 > options/userstacktrace
>  # echo 1 > events/preemptirq/irq_disable/enable
>
> And sure enough, I triggered a crash. Well, it was systemd crashing
> with a bad memory access??
>
>  systemd-journal[537]: segfault at ed8cb8 ip 7f7fffc9fef5 sp 
> 7ffc4062cb10 error 7
>
> And it would crash similarly each time I tried it, but always at a
> different place. After spending the day on this, I finally figured it
> out. The bug is happening in entry_64.S right after error_entry.
> There's two TRACE_IRQS_OFF in that code path, which if I comment out,
> the bug goes away. Then it dawned on me that the crash always happens
> when systemd does a normal page fault. We had this bug before, and it
> was with the exception trace points.
>
> The issue is that a tracepoint can fault (reading vmalloc or whatever).
> And doing a userspace stack trace most definitely will fault. But if we
> are coming from a legitimate page fault, the address of that fault (in
> the CR2 register) will be lost if we fault before we get to the page
> fault handler. That's exactly what is happening.
>
> To solve this, a TRACE_IRQS_OFF_CR2 (and ON for consistency) was added
> that saves the CR2 register. A new trace_hardirqs_off_thunk_cr2 is
> created that stores the cr2 register, calls the
> trace_hardirqs_off_caller, then on return restores the cr2 register if
> it changed, before returning.
>
> On my tests this fixes the issue. I just want to know if this is a
> legitimate fix or if someone can come up with a better fix?
>
> Note: this also saves the exception context just like the
> do_page_fault() function does.
>
> Note2: This only gets enabled when lockdep or irq tracing is enabled,
> which is not recommended for production environments.
>
> Link: 
> http://lkml.kernel.org/r/897cf5cf-fc24-8a64-cb28-847f2d2e6...@windriver.com
>
> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify 
> their usage")
> Signed-off-by: Steven Rostedt (VMware) 
> ---
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 7bc105f47d21..7edffec328e2 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -292,6 +292,32 @@ __visible void do_syscall_64(unsigned long nr, struct 
> pt_regs *regs)
>  
>   syscall_return_slowpath(regs);
>  }
> +
> +extern void trace_hardirqs_on_caller(unsigned long caller_addr);
> +__visible void trace_hardirqs_on_caller_cr2(unsigned long caller_addr)
> +{
> + unsigned long address = read_cr2(); /* Get the faulting address */
> + enum ctx_state prev_state;
> +
> + prev_state = exception_enter();
> + trace_hardirqs_on_caller(caller_addr);
> + if (address != read_cr2())
> + write_cr2(address);
> + exception_exit(prev_state);
> +}
> +
> +extern void trace_hardirqs_off_caller(unsigned long caller_addr);
> +__visible void trace_hardirqs_off_caller_cr2(unsigned long caller_addr)
> +{
> + unsigned long address = read_cr2(); /* Get the faulting address */
> + enum ctx_state prev_state;
> +
> + prev_state = exception_enter();
> + trace_hardirqs_off_caller(caller_addr);
> + if (address != read_cr2())
> + write_cr2(address);
> + exception_exit(prev_state);
> +}
>  #endif
>  
>  #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 1f0efdb7b629..73ddf24fed3e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1248,12 +1248,12 @@ ENTRY(error_entry)
>* we fix gsbase, and we should do it before enter_from_user_mode
>* (which can take locks).
>*/
> - TRACE_IRQS_OFF
> + TRACE_IRQS_OFF_CR2
>   CALL_enter_from_user_mode
>   ret
>  
>  .Lerror_entry_done:
> - TRACE_IRQS_OFF
> + TRACE_IRQS_OFF_CR2
>   ret
>  
>   /*
> diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
> index be36bf4e095

User Stack Tracer Causes Crash

2019-03-19 Thread He Zhe
Hi,

User stack tracer causes crash and hang since the following commit till 
now(5.1-rc1).

c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their 
usage")

echo 1 > /proc/sys/kernel/stack_tracer_enabled
echo userstacktrace > /sys/kernel/debug/tracing/trace_options
echo 1 > /sys/kernel/debug/tracing/events/enable
dmesg

login[269]: segfault at 80 ip 7f7e847edc19 sp 7ffcc8cefdc0 error 7 in 
libc-2.29.so[7f7e8478e000+142000]
Code: ff ff 0f 1f 80 00 00 00 00 4a 8d 0c e0 48 8b 51 40 48 85 d2 0f 84 2a ff 
ff ff 48 81 fb ff 03 00 00 0f 87 ba 01 00 00 48 8b 32 <48> 89 71 40 42 80 2c 20 
01 48 c7 42 08 00 00 00 00 48 83 c4 08 48
systemd[1]: segfault at b ip 7ff15b8a8420 sp 7ffc6eaab890 error 7 in 
libc-2.29.so[7ff15b7a1000+142000]
Code: b6 8f 08 00 ff 25 a0 71 08 00 48 83 ec 08 be 01 00 00 00 31 c0 83 3d 1e 
de 08 00 00 74 0c f0 0f b1 35 1c df 08 00 75 0b eb 23 <0f> b1 35 11 df 08 00 74 
1a 48 8d 3d 08 df 08 00 48 81 ec 80 00 00
systemd[1]: segfault at 0 ip 7ff15bab40db sp 7ffc6eaaa7f0 error 7 in 
libsystemd-shared-241.so[7ff15ba2c000+12f000]
Code: cb d4 f7 ff 48 83 c4 20 44 8b 54 24 0c eb b3 41 57 41 56 4d 89 ce 41 55 
4d 89 c5 41 54 55 89 f5 53 89 fb 48 81 ec 38 08 00 00 <48> 89 54 24 08 4c 8b bc 
24 70 08 00 00 89 4c 24 18 64 48 8b 04 25
printk: systemd: 30 output lines suppressed due to ratelimiting
Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b
CPU: 5 PID: 1 Comm: systemd Not tainted 5.1.0-rc1-yocto-standard+ #2
Hardware name: Intel Corporation Broadwell Client platform/Basking Ridge, BIOS 
BDW-E2R1.86C.0118.R01.1503110618 03/11/2015
Call Trace:
 dump_stack+0x67/0x95
 panic+0xfd/0x282
 ? do_exit+0xe34/0xf30
 do_exit+0xf24/0xf30
 ? do_exit+0x5/0xf30
 do_group_exit+0x5c/0xd0
 get_signal+0x18e/0xa40
 do_signal+0x37/0x830
 exit_to_usermode_loop+0x78/0xf0
 prepare_exit_to_usermode+0xa0/0x100
 ? page_fault+0x8/0x30
 retint_user+0x8/0x18
RIP: 0033:0x7ff15bab40db
Code: Bad RIP value.
RSP: 002b:7ffc6eaaa7f0 EFLAGS: 00010206
RAX: 7ffc6eaab070 RBX:  RCX: 00d8
RDX: 559989da74d5 RSI:  RDI: 
RBP:  R08: 559989daa4b5 R09: 559989da8150
R10: 0004 R11: 0246 R12: 000b
R13: 559989daa4b5 R14: 559989da8150 R15: 000b
Kernel Offset: 0x3320 from 0x8100 (relocation range: 
0x8000-0xbfff)
---[ end Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x000b ]---
[ cut here ]
sched: Unexpected reschedule of offline CPU#1!
WARNING: CPU: 5 PID: 1 at arch/x86/kernel/smp.c:128 
native_smp_send_reschedule+0x95/0xc0
Modules linked in:
CPU: 5 PID: 1 Comm: systemd Not tainted 5.1.0-rc1-yocto-standard+ #2
Hardware name: Intel Corporation Broadwell Client platform/Basking Ridge, BIOS 
BDW-E2R1.86C.0118.R01.1503110618 03/11/2015
RIP: 0010:native_smp_send_reschedule+0x95/0xc0
Code: 5d 5d c3 b9 01 00 00 00 31 d2 be 01 00 00 00 48 c7 c7 b8 bc b6 b5 e8 0a 
85 13 00 44 89 e6 48 c7 c7 c8 7f 78 b5 e8 eb c5 02 00 <0f> 0b b9 01 00 00 00 31 
d2 be 01 00 00 00 48 c7 c7 88 bc b6 b5 e8
RSP: 0018:9434f8d43c38 EFLAGS: 00010082
RAX:  RBX: 0001 RCX: 
RDX:  RSI:  RDI: b5b83e98
RBP: 9434f8d43c50 R08: 00010004 R09: 02bf
R10: 9434f8d43b70 R11: 02be R12: 0001
R13: 0001 R14: 9434f8d43d28 R15: 9434f057
FS:  7ff15b6b6840() GS:9434f8d4() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7ff15bab40b1 CR3: 7220e005 CR4: 003606e0
Call Trace:
 
 resched_curr+0xac/0x180
 check_preempt_curr+0x56/0xb0
 ttwu_do_wakeup.isra.17+0x1e/0x270
 ttwu_do_activate+0x78/0x90
 try_to_wake_up+0x243/0x5c0
 ? default_wake_function+0x5/0x20
 default_wake_function+0x12/0x20
 autoremove_wake_function+0x12/0x40
 __wake_up_common+0x8c/0x130
 __wake_up_common_lock+0x80/0xc0
 __wake_up+0x13/0x20
 wake_up_klogd_work_func+0x4c/0x80
 irq_work_run_list+0x6c/0x90
 ? tick_sched_handle.isra.5+0x50/0x50
 irq_work_tick+0x55/0x60
 update_process_times+0x42/0x60
 tick_sched_handle.isra.5+0x34/0x50
 tick_sched_timer+0x40/0xa0
 __hrtimer_run_queues+0x175/0x450
 hrtimer_interrupt+0x141/0x290
 smp_apic_timer_interrupt+0x8f/0x260
 apic_timer_interrupt+0xf/0x20
 
RIP: 0010:panic+0x242/0x282
Code: b0 83 3d 8a e4 bf 01 00 74 05 e8 4b c6 02 00 48 c7 c6 00 91 e7 b5 48 c7 
c7 68 f7 78 b5 e8 ab 74 07 00 e8 e3 60 10 00 fb 31 db <4c> 39 eb 7c 1d 41 83 f4 
01 48 8b 05 30 e4 bf 01 44 89 e7 e8 78 64
RSP: 0018:b306c038fc58 EFLAGS: 0246 ORIG_RAX: ff13
RAX:  RBX:  RCX: 
RDX:  RSI: b42815d4 RDI: b427ac5d
RBP: b306c038fcd0 R08: 8000 R09: b7cc
R10: 0944 R11: 02bc R12: 

Re: LTP case read_all_proc fails on qemux86-64 since 5.0-rc1

2019-01-25 Thread He Zhe



On 1/26/19 3:32 AM, Jens Axboe wrote:
> On 1/24/19 8:20 PM, Jens Axboe wrote:
>> On 1/24/19 8:14 PM, He Zhe wrote:
>>> On 1/25/19 10:26 AM, Jens Axboe wrote:
>>>> On 1/24/19 7:22 PM, He Zhe wrote:
>>>>> On 1/25/19 10:05 AM, Jens Axboe wrote:
>>>>>> On 1/24/19 6:40 PM, He Zhe wrote:
>>>>>>> On 1/24/19 11:40 PM, Jens Axboe wrote:
>>>>>>>> On 1/24/19 1:23 AM, He Zhe wrote:
>>>>>>>>> On 1/24/19 12:05 AM, Jens Axboe wrote:
>>>>>>>>>> On 1/22/19 8:39 PM, Jens Axboe wrote:
>>>>>>>>>>> On Jan 22, 2019, at 8:13 PM, He Zhe  wrote:
>>>>>>>>>>>> LTP case read_all_proc(read_all -d /proc -q -r 10) often, but not 
>>>>>>>>>>>> every time, fails with the following call traces, since 
>>>>>>>>>>>> 600335205b8d "ide: convert to blk-mq"(5.0-rc1) till now(5.0-rc3).
>>>>>>>>>>>>
>>>>>>>>>>>> qemu-system-x86_64 -drive file=rootfs.ext4,if=virtio,format=raw 
>>>>>>>>>>>> -object rng-random,filename=/dev/urandom,id=rng0 -device 
>>>>>>>>>>>> virtio-rng-pci,rng=rng0 -nographic -m 16192 -smp cpus=12 -cpu 
>>>>>>>>>>>> core2duo -enable-kvm -serial mon:stdio -serial null -kernel 
>>>>>>>>>>>> bzImage -append 'root=/dev/vda rw highres=off console=ttyS0 
>>>>>>>>>>>> mem=16192M'
>>>>>>>>>>>>
>>>>>>>>>>>> tst_test.c:1085: INFO: Timeout per run is 0h 05m 00s
>>>>>>>>>>>> [   47.080156] Warning: /proc/ide/hd?/settings interface is 
>>>>>>>>>>>> obsolete, and will be removed soon!
>>>>>>>>>>>> [   47.085330] [ cut here ]
>>>>>>>>>>>> [   47.085810] kernel BUG at block/blk-mq.c:767!
>>>>>>>>>>>> [   47.086498] invalid opcode:  [#1] PREEMPT SMP PTI
>>>>>>>>>>>> [   47.087022] CPU: 5 PID: 146 Comm: kworker/5:1H Not tainted 
>>>>>>>>>>>> 5.0.0-rc3 #1
>>>>>>>>>>>> [   47.087858] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>>>>>>>>>>> 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
>>>>>>>>>>>> [   47.088992] Workqueue: kblockd blk_mq_run_work_fn
>>>>>>>>>>>> [   47.089469] RIP: 0010:blk_mq_add_to_requeue_list+0xc1/0xd0
>>>>>>>>>>>> [   47.090035] Code: 48 8d 53 48 49 8b 8c 24 b8 04 00 00 48 89 51 
>>>>>>>>>>>> 08 48 89 4b 48 49 8d 8c 24 b8 04 00 00 48 89 4b 50 49 89 94 24 b8 
>>>>>>>>>>>> 04 00 008
>>>>>>>>>>>> [   47.091930] RSP: 0018:9e1ea4b43e40 EFLAGS: 00010002
>>>>>>>>>>>> [   47.092458] RAX: 9e1ea13c0048 RBX: 9e1ea13c RCX: 
>>>>>>>>>>>> 0006
>>>>>>>>>>>> [   47.093181] RDX:  RSI: 0001 RDI: 
>>>>>>>>>>>> 9e1ea13c
>>>>>>>>>>>> [   47.093906] RBP: 9e1ea4b43e68 R08: eb5bcf630680 R09: 
>>>>>>>>>>>> 
>>>>>>>>>>>> [   47.094626] R10: 0001 R11: 0012 R12: 
>>>>>>>>>>>> 9e1ea1033a40
>>>>>>>>>>>> [   47.095347] R13: 9e1ea13a8d00 R14: 9e1ea13a9000 R15: 
>>>>>>>>>>>> 0046
>>>>>>>>>>>> [   47.096071] FS:  () 
>>>>>>>>>>>> GS:9e1ea4b4() knlGS:
>>>>>>>>>>>> [   47.096898] CS:  0010 DS:  ES:  CR0: 80050033
>>>>>>>>>>>> [   47.097477] CR2: 003fda41fda0 CR3: 0003d8e6a000 CR4: 
>>>>>>>>>>>> 06e0
>>>>>>>>>>>> [   47.098203] DR0:  DR1:  DR2: 
>>>>>>>>>>>> 
>>>>

Re: LTP case read_all_proc fails on qemux86-64 since 5.0-rc1

2019-01-24 Thread He Zhe



On 1/25/19 10:26 AM, Jens Axboe wrote:
> On 1/24/19 7:22 PM, He Zhe wrote:
>>
>> On 1/25/19 10:05 AM, Jens Axboe wrote:
>>> On 1/24/19 6:40 PM, He Zhe wrote:
>>>> On 1/24/19 11:40 PM, Jens Axboe wrote:
>>>>> On 1/24/19 1:23 AM, He Zhe wrote:
>>>>>> On 1/24/19 12:05 AM, Jens Axboe wrote:
>>>>>>> On 1/22/19 8:39 PM, Jens Axboe wrote:
>>>>>>>> On Jan 22, 2019, at 8:13 PM, He Zhe  wrote:
>>>>>>>>> LTP case read_all_proc(read_all -d /proc -q -r 10) often, but not 
>>>>>>>>> every time, fails with the following call traces, since 600335205b8d 
>>>>>>>>> "ide: convert to blk-mq"(5.0-rc1) till now(5.0-rc3).
>>>>>>>>>
>>>>>>>>> qemu-system-x86_64 -drive file=rootfs.ext4,if=virtio,format=raw 
>>>>>>>>> -object rng-random,filename=/dev/urandom,id=rng0 -device 
>>>>>>>>> virtio-rng-pci,rng=rng0 -nographic -m 16192 -smp cpus=12 -cpu 
>>>>>>>>> core2duo -enable-kvm -serial mon:stdio -serial null -kernel bzImage 
>>>>>>>>> -append 'root=/dev/vda rw highres=off console=ttyS0 mem=16192M'
>>>>>>>>>
>>>>>>>>> tst_test.c:1085: INFO: Timeout per run is 0h 05m 00s
>>>>>>>>> [   47.080156] Warning: /proc/ide/hd?/settings interface is obsolete, 
>>>>>>>>> and will be removed soon!
>>>>>>>>> [   47.085330] [ cut here ]
>>>>>>>>> [   47.085810] kernel BUG at block/blk-mq.c:767!
>>>>>>>>> [   47.086498] invalid opcode:  [#1] PREEMPT SMP PTI
>>>>>>>>> [   47.087022] CPU: 5 PID: 146 Comm: kworker/5:1H Not tainted 
>>>>>>>>> 5.0.0-rc3 #1
>>>>>>>>> [   47.087858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
>>>>>>>>> BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
>>>>>>>>> [   47.088992] Workqueue: kblockd blk_mq_run_work_fn
>>>>>>>>> [   47.089469] RIP: 0010:blk_mq_add_to_requeue_list+0xc1/0xd0
>>>>>>>>> [   47.090035] Code: 48 8d 53 48 49 8b 8c 24 b8 04 00 00 48 89 51 08 
>>>>>>>>> 48 89 4b 48 49 8d 8c 24 b8 04 00 00 48 89 4b 50 49 89 94 24 b8 04 00 
>>>>>>>>> 008
>>>>>>>>> [   47.091930] RSP: 0018:9e1ea4b43e40 EFLAGS: 00010002
>>>>>>>>> [   47.092458] RAX: 9e1ea13c0048 RBX: 9e1ea13c RCX: 
>>>>>>>>> 0006
>>>>>>>>> [   47.093181] RDX:  RSI: 0001 RDI: 
>>>>>>>>> 9e1ea13c
>>>>>>>>> [   47.093906] RBP: 9e1ea4b43e68 R08: eb5bcf630680 R09: 
>>>>>>>>> 
>>>>>>>>> [   47.094626] R10: 0001 R11: 0012 R12: 
>>>>>>>>> 9e1ea1033a40
>>>>>>>>> [   47.095347] R13: 9e1ea13a8d00 R14: 9e1ea13a9000 R15: 
>>>>>>>>> 0046
>>>>>>>>> [   47.096071] FS:  () GS:9e1ea4b4() 
>>>>>>>>> knlGS:
>>>>>>>>> [   47.096898] CS:  0010 DS:  ES:  CR0: 80050033
>>>>>>>>> [   47.097477] CR2: 003fda41fda0 CR3: 0003d8e6a000 CR4: 
>>>>>>>>> 06e0
>>>>>>>>> [   47.098203] DR0:  DR1:  DR2: 
>>>>>>>>> 
>>>>>>>>> [   47.098929] DR3:  DR6: fffe0ff0 DR7: 
>>>>>>>>> 0400
>>>>>>>>> [   47.099650] Call Trace:
>>>>>>>>> [   47.099910]  
>>>>>>>>> [   47.100128]  blk_mq_requeue_request+0x58/0x60
>>>>>>>>> [   47.100576]  ide_requeue_and_plug+0x20/0x50
>>>>>>>>> [   47.101014]  ide_intr+0x21a/0x230
>>>>>>>>> [   47.101362]  ? idecd_open+0xc0/0xc0
>>>>>>>>> [   47.101735]  __handle_irq_event_percpu+0x43/0x1e0
>>>>>>>>> [   47.102214]  handle_irq_event_percpu+0x32/0x80
>>>>>>>>&

Re: LTP case read_all_proc fails on qemux86-64 since 5.0-rc1

2019-01-24 Thread He Zhe



On 1/25/19 10:05 AM, Jens Axboe wrote:
> On 1/24/19 6:40 PM, He Zhe wrote:
>>
>> On 1/24/19 11:40 PM, Jens Axboe wrote:
>>> On 1/24/19 1:23 AM, He Zhe wrote:
>>>> On 1/24/19 12:05 AM, Jens Axboe wrote:
>>>>> On 1/22/19 8:39 PM, Jens Axboe wrote:
>>>>>> On Jan 22, 2019, at 8:13 PM, He Zhe  wrote:
>>>>>>> LTP case read_all_proc(read_all -d /proc -q -r 10) often, but not every 
>>>>>>> time, fails with the following call traces, since 600335205b8d "ide: 
>>>>>>> convert to blk-mq"(5.0-rc1) till now(5.0-rc3).
>>>>>>>
>>>>>>> qemu-system-x86_64 -drive file=rootfs.ext4,if=virtio,format=raw -object 
>>>>>>> rng-random,filename=/dev/urandom,id=rng0 -device 
>>>>>>> virtio-rng-pci,rng=rng0 -nographic -m 16192 -smp cpus=12 -cpu core2duo 
>>>>>>> -enable-kvm -serial mon:stdio -serial null -kernel bzImage -append 
>>>>>>> 'root=/dev/vda rw highres=off console=ttyS0 mem=16192M'
>>>>>>>
>>>>>>> tst_test.c:1085: INFO: Timeout per run is 0h 05m 00s
>>>>>>> [   47.080156] Warning: /proc/ide/hd?/settings interface is obsolete, 
>>>>>>> and will be removed soon!
>>>>>>> [   47.085330] [ cut here ]
>>>>>>> [   47.085810] kernel BUG at block/blk-mq.c:767!
>>>>>>> [   47.086498] invalid opcode:  [#1] PREEMPT SMP PTI
>>>>>>> [   47.087022] CPU: 5 PID: 146 Comm: kworker/5:1H Not tainted 5.0.0-rc3 
>>>>>>> #1
>>>>>>> [   47.087858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
>>>>>>> BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
>>>>>>> [   47.088992] Workqueue: kblockd blk_mq_run_work_fn
>>>>>>> [   47.089469] RIP: 0010:blk_mq_add_to_requeue_list+0xc1/0xd0
>>>>>>> [   47.090035] Code: 48 8d 53 48 49 8b 8c 24 b8 04 00 00 48 89 51 08 48 
>>>>>>> 89 4b 48 49 8d 8c 24 b8 04 00 00 48 89 4b 50 49 89 94 24 b8 04 00 008
>>>>>>> [   47.091930] RSP: 0018:9e1ea4b43e40 EFLAGS: 00010002
>>>>>>> [   47.092458] RAX: 9e1ea13c0048 RBX: 9e1ea13c RCX: 
>>>>>>> 0006
>>>>>>> [   47.093181] RDX:  RSI: 0001 RDI: 
>>>>>>> 9e1ea13c
>>>>>>> [   47.093906] RBP: 9e1ea4b43e68 R08: eb5bcf630680 R09: 
>>>>>>> 
>>>>>>> [   47.094626] R10: 0001 R11: 0012 R12: 
>>>>>>> 9e1ea1033a40
>>>>>>> [   47.095347] R13: 9e1ea13a8d00 R14: 9e1ea13a9000 R15: 
>>>>>>> 0046
>>>>>>> [   47.096071] FS:  () GS:9e1ea4b4() 
>>>>>>> knlGS:
>>>>>>> [   47.096898] CS:  0010 DS:  ES:  CR0: 80050033
>>>>>>> [   47.097477] CR2: 003fda41fda0 CR3: 0003d8e6a000 CR4: 
>>>>>>> 06e0
>>>>>>> [   47.098203] DR0:  DR1:  DR2: 
>>>>>>> 
>>>>>>> [   47.098929] DR3:  DR6: fffe0ff0 DR7: 
>>>>>>> 0400
>>>>>>> [   47.099650] Call Trace:
>>>>>>> [   47.099910]  
>>>>>>> [   47.100128]  blk_mq_requeue_request+0x58/0x60
>>>>>>> [   47.100576]  ide_requeue_and_plug+0x20/0x50
>>>>>>> [   47.101014]  ide_intr+0x21a/0x230
>>>>>>> [   47.101362]  ? idecd_open+0xc0/0xc0
>>>>>>> [   47.101735]  __handle_irq_event_percpu+0x43/0x1e0
>>>>>>> [   47.102214]  handle_irq_event_percpu+0x32/0x80
>>>>>>> [   47.102668]  handle_irq_event+0x39/0x60
>>>>>>> [   47.103074]  handle_edge_irq+0xe8/0x1c0
>>>>>>> [   47.103470]  handle_irq+0x20/0x30
>>>>>>> [   47.103819]  do_IRQ+0x46/0xe0
>>>>>>> [   47.104128]  common_interrupt+0xf/0xf
>>>>>>> [   47.104505]  
>>>>>>> [   47.104731] RIP: 0010:ide_output_data+0xbc/0x100
>>>>>>> [   47.105201] Code: 74 22 8d 41 ff 85 c9 74 24 49 8d 54 40 02 41 0f b7 
>>>>>>> 00 66 41 89 01 49

Re: LTP case read_all_proc fails on qemux86-64 since 5.0-rc1

2019-01-24 Thread He Zhe



On 1/24/19 12:05 AM, Jens Axboe wrote:
> On 1/22/19 8:39 PM, Jens Axboe wrote:
>> On Jan 22, 2019, at 8:13 PM, He Zhe  wrote:
>>> LTP case read_all_proc(read_all -d /proc -q -r 10) often, but not every 
>>> time, fails with the following call traces, since 600335205b8d "ide: 
>>> convert to blk-mq"(5.0-rc1) till now(5.0-rc3).
>>>
>>> qemu-system-x86_64 -drive file=rootfs.ext4,if=virtio,format=raw -object 
>>> rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 
>>> -nographic -m 16192 -smp cpus=12 -cpu core2duo -enable-kvm -serial 
>>> mon:stdio -serial null -kernel bzImage -append 'root=/dev/vda rw 
>>> highres=off console=ttyS0 mem=16192M'
>>>
>>> tst_test.c:1085: INFO: Timeout per run is 0h 05m 00s
>>> [   47.080156] Warning: /proc/ide/hd?/settings interface is obsolete, and 
>>> will be removed soon!
>>> [   47.085330] [ cut here ]
>>> [   47.085810] kernel BUG at block/blk-mq.c:767!
>>> [   47.086498] invalid opcode:  [#1] PREEMPT SMP PTI
>>> [   47.087022] CPU: 5 PID: 146 Comm: kworker/5:1H Not tainted 5.0.0-rc3 #1
>>> [   47.087858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
>>> [   47.088992] Workqueue: kblockd blk_mq_run_work_fn
>>> [   47.089469] RIP: 0010:blk_mq_add_to_requeue_list+0xc1/0xd0
>>> [   47.090035] Code: 48 8d 53 48 49 8b 8c 24 b8 04 00 00 48 89 51 08 48 89 
>>> 4b 48 49 8d 8c 24 b8 04 00 00 48 89 4b 50 49 89 94 24 b8 04 00 008
>>> [   47.091930] RSP: 0018:9e1ea4b43e40 EFLAGS: 00010002
>>> [   47.092458] RAX: 9e1ea13c0048 RBX: 9e1ea13c RCX: 
>>> 0006
>>> [   47.093181] RDX:  RSI: 0001 RDI: 
>>> 9e1ea13c
>>> [   47.093906] RBP: 9e1ea4b43e68 R08: eb5bcf630680 R09: 
>>> 
>>> [   47.094626] R10: 0001 R11: 0012 R12: 
>>> 9e1ea1033a40
>>> [   47.095347] R13: 9e1ea13a8d00 R14: 9e1ea13a9000 R15: 
>>> 0046
>>> [   47.096071] FS:  () GS:9e1ea4b4() 
>>> knlGS:
>>> [   47.096898] CS:  0010 DS:  ES:  CR0: 80050033
>>> [   47.097477] CR2: 003fda41fda0 CR3: 0003d8e6a000 CR4: 
>>> 06e0
>>> [   47.098203] DR0:  DR1:  DR2: 
>>> 
>>> [   47.098929] DR3:  DR6: fffe0ff0 DR7: 
>>> 0400
>>> [   47.099650] Call Trace:
>>> [   47.099910]  
>>> [   47.100128]  blk_mq_requeue_request+0x58/0x60
>>> [   47.100576]  ide_requeue_and_plug+0x20/0x50
>>> [   47.101014]  ide_intr+0x21a/0x230
>>> [   47.101362]  ? idecd_open+0xc0/0xc0
>>> [   47.101735]  __handle_irq_event_percpu+0x43/0x1e0
>>> [   47.102214]  handle_irq_event_percpu+0x32/0x80
>>> [   47.102668]  handle_irq_event+0x39/0x60
>>> [   47.103074]  handle_edge_irq+0xe8/0x1c0
>>> [   47.103470]  handle_irq+0x20/0x30
>>> [   47.103819]  do_IRQ+0x46/0xe0
>>> [   47.104128]  common_interrupt+0xf/0xf
>>> [   47.104505]  
>>> [   47.104731] RIP: 0010:ide_output_data+0xbc/0x100
>>> [   47.105201] Code: 74 22 8d 41 ff 85 c9 74 24 49 8d 54 40 02 41 0f b7 00 
>>> 66 41 89 01 49 83 c0 02 49 39 d0 75 ef 5b 41 5c 5d c3 4c 89 c6 445
>>> [   47.107092] RSP: 0018:bd508059bb18 EFLAGS: 00010246 ORIG_RAX: 
>>> ffdd
>>> [   47.107862] RAX: 9e1ea13a8800 RBX: 9e1ea13a9000 RCX: 
>>> 
>>> [   47.108581] RDX: 0170 RSI: 9e1ea13c012c RDI: 
>>> 
>>> [   47.109293] RBP: bd508059bb28 R08: 9e1ea13c0120 R09: 
>>> 0170
>>> [   47.110016] R10: 000d R11: 000c R12: 
>>> 9e1ea13a8800
>>> [   47.110731] R13: 000c R14: 9e1ea13c R15: 
>>> 7530
>>> [   47.111446]  ide_transfer_pc+0x216/0x310
>>> [   47.111848]  ? __const_udelay+0x3d/0x40
>>> [   47.112236]  ? ide_execute_command+0x85/0xb0
>>> [   47.112668]  ? ide_pc_intr+0x3f0/0x3f0
>>> [   47.113051]  ? ide_check_atapi_device+0x110/0x110
>>> [   47.113524]  ide_issue_pc+0x178/0x240
>>> [   47.113901]  ide_cd_do_request+0x15c/0x350
>>> [   47.114314]  ide_queue_rq+0x180/0x6b0
>>> [   47.114686]  ? blk_mq_get_driver_tag+0xa1/0

LTP case read_all_proc fails on qemux86-64 since 5.0-rc1

2019-01-22 Thread He Zhe


LTP case read_all_proc(read_all -d /proc -q -r 10) often, but not every time, 
fails with the following call traces, since 600335205b8d "ide: convert to 
blk-mq"(5.0-rc1) till now(5.0-rc3).

qemu-system-x86_64 -drive file=rootfs.ext4,if=virtio,format=raw -object 
rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 
-nographic -m 16192 -smp cpus=12 -cpu core2duo -enable-kvm -serial mon:stdio 
-serial null -kernel bzImage -append 'root=/dev/vda rw highres=off 
console=ttyS0 mem=16192M'

tst_test.c:1085: INFO: Timeout per run is 0h 05m 00s
[   47.080156] Warning: /proc/ide/hd?/settings interface is obsolete, and will 
be removed soon!
[   47.085330] [ cut here ]
[   47.085810] kernel BUG at block/blk-mq.c:767!
[   47.086498] invalid opcode:  [#1] PREEMPT SMP PTI
[   47.087022] CPU: 5 PID: 146 Comm: kworker/5:1H Not tainted 5.0.0-rc3 #1
[   47.087858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[   47.088992] Workqueue: kblockd blk_mq_run_work_fn
[   47.089469] RIP: 0010:blk_mq_add_to_requeue_list+0xc1/0xd0
[   47.090035] Code: 48 8d 53 48 49 8b 8c 24 b8 04 00 00 48 89 51 08 48 89 4b 
48 49 8d 8c 24 b8 04 00 00 48 89 4b 50 49 89 94 24 b8 04 00 008
[   47.091930] RSP: 0018:9e1ea4b43e40 EFLAGS: 00010002
[   47.092458] RAX: 9e1ea13c0048 RBX: 9e1ea13c RCX: 0006
[   47.093181] RDX:  RSI: 0001 RDI: 9e1ea13c
[   47.093906] RBP: 9e1ea4b43e68 R08: eb5bcf630680 R09: 
[   47.094626] R10: 0001 R11: 0012 R12: 9e1ea1033a40
[   47.095347] R13: 9e1ea13a8d00 R14: 9e1ea13a9000 R15: 0046
[   47.096071] FS:  () GS:9e1ea4b4() 
knlGS:
[   47.096898] CS:  0010 DS:  ES:  CR0: 80050033
[   47.097477] CR2: 003fda41fda0 CR3: 0003d8e6a000 CR4: 06e0
[   47.098203] DR0:  DR1:  DR2: 
[   47.098929] DR3:  DR6: fffe0ff0 DR7: 0400
[   47.099650] Call Trace:
[   47.099910]  
[   47.100128]  blk_mq_requeue_request+0x58/0x60
[   47.100576]  ide_requeue_and_plug+0x20/0x50
[   47.101014]  ide_intr+0x21a/0x230
[   47.101362]  ? idecd_open+0xc0/0xc0
[   47.101735]  __handle_irq_event_percpu+0x43/0x1e0
[   47.102214]  handle_irq_event_percpu+0x32/0x80
[   47.102668]  handle_irq_event+0x39/0x60
[   47.103074]  handle_edge_irq+0xe8/0x1c0
[   47.103470]  handle_irq+0x20/0x30
[   47.103819]  do_IRQ+0x46/0xe0
[   47.104128]  common_interrupt+0xf/0xf
[   47.104505]  
[   47.104731] RIP: 0010:ide_output_data+0xbc/0x100
[   47.105201] Code: 74 22 8d 41 ff 85 c9 74 24 49 8d 54 40 02 41 0f b7 00 66 
41 89 01 49 83 c0 02 49 39 d0 75 ef 5b 41 5c 5d c3 4c 89 c6 445
[   47.107092] RSP: 0018:bd508059bb18 EFLAGS: 00010246 ORIG_RAX: 
ffdd
[   47.107862] RAX: 9e1ea13a8800 RBX: 9e1ea13a9000 RCX: 
[   47.108581] RDX: 0170 RSI: 9e1ea13c012c RDI: 
[   47.109293] RBP: bd508059bb28 R08: 9e1ea13c0120 R09: 0170
[   47.110016] R10: 000d R11: 000c R12: 9e1ea13a8800
[   47.110731] R13: 000c R14: 9e1ea13c R15: 7530
[   47.111446]  ide_transfer_pc+0x216/0x310
[   47.111848]  ? __const_udelay+0x3d/0x40
[   47.112236]  ? ide_execute_command+0x85/0xb0
[   47.112668]  ? ide_pc_intr+0x3f0/0x3f0
[   47.113051]  ? ide_check_atapi_device+0x110/0x110
[   47.113524]  ide_issue_pc+0x178/0x240
[   47.113901]  ide_cd_do_request+0x15c/0x350
[   47.114314]  ide_queue_rq+0x180/0x6b0
[   47.114686]  ? blk_mq_get_driver_tag+0xa1/0x110
[   47.115153]  blk_mq_dispatch_rq_list+0x90/0x550
[   47.115606]  ? __queue_delayed_work+0x63/0x90
[   47.116054]  ? deadline_fifo_request+0x41/0x90
[   47.116506]  blk_mq_do_dispatch_sched+0x80/0x100
[   47.116976]  blk_mq_sched_dispatch_requests+0xfc/0x170
[   47.117491]  __blk_mq_run_hw_queue+0x6f/0xd0
[   47.117941]  blk_mq_run_work_fn+0x1b/0x20
[   47.118342]  process_one_work+0x14c/0x450
[   47.118747]  worker_thread+0x4a/0x440
[   47.119125]  kthread+0x105/0x140
[   47.119456]  ? process_one_work+0x450/0x450
[   47.119880]  ? kthread_park+0x90/0x90
[   47.120251]  ret_from_fork+0x35/0x40
[   47.120619] Modules linked in:
[   47.120952] ---[ end trace 4562f716e88fdefe ]---
[   47.121423] RIP: 0010:blk_mq_add_to_requeue_list+0xc1/0xd0
[   47.121981] Code: 48 8d 53 48 49 8b 8c 24 b8 04 00 00 48 89 51 08 48 89 4b 
48 49 8d 8c 24 b8 04 00 00 48 89 4b 50 49 89 94 24 b8 04 00 008
[   47.123851] RSP: 0018:9e1ea4b43e40 EFLAGS: 00010002
[   47.124393] RAX: 9e1ea13c0048 RBX: 9e1ea13c RCX: 0006
[   47.125108] RDX:  RSI: 0001 RDI: 9e1ea13c
[   47.125819] RBP: 9e1ea4b43e68 R08: eb5bcf630680 R09: 
[   47.126539] R10: 

Re: [PATCH v2] kernel/dma: Fix panic caused by passing swiotlb to command line

2019-01-10 Thread He Zhe



On 1/11/19 9:46 AM, He Zhe wrote:
>
> On 1/11/19 1:27 AM, Konrad Rzeszutek Wilk wrote:
>> Let's skip it. There was another patch that would allocate a default 4MB 
>> size if it there was an misue of swiotlb parameters.
> But this patch mainly fixes a crash. Could you please point me to the patch 
> you mentioned?

And the v2 is modified according to your suggestion.

Zhe

>
> Thanks,
> Zhe
>
>>
>>
>> On Mon, Jan 7, 2019, 4:07 AM Christoph Hellwig > <mailto:h...@lst.de> wrote:
>>
>> On Mon, Jan 07, 2019 at 04:46:51PM +0800, He Zhe wrote:
>> > Kindly ping.
>>
>> Konrad, I'll pick this up through the DMA mapping tree unless you
>> protest in the next few days.
>>
>



Re: [PATCH v2] kernel/dma: Fix panic caused by passing swiotlb to command line

2019-01-10 Thread He Zhe



On 1/11/19 1:27 AM, Konrad Rzeszutek Wilk wrote:
> Let's skip it. There was another patch that would allocate a default 4MB size 
> if it there was an misue of swiotlb parameters.

But this patch mainly fixes a crash. Could you please point me to the patch you 
mentioned?

Thanks,
Zhe

>
>
>
> On Mon, Jan 7, 2019, 4:07 AM Christoph Hellwig  <mailto:h...@lst.de> wrote:
>
> On Mon, Jan 07, 2019 at 04:46:51PM +0800, He Zhe wrote:
> > Kindly ping.
>
> Konrad, I'll pick this up through the DMA mapping tree unless you
> protest in the next few days.
>



Re: [PATCH v2] kernel/dma: Fix panic caused by passing swiotlb to command line

2019-01-07 Thread He Zhe
Kindly ping.

Zhe

On 12/3/18 6:22 PM, zhe...@windriver.com wrote:
> From: He Zhe 
>
> setup_io_tlb_npages does not check input argument before passing it
> to isdigit. The argument would be a NULL pointer if "swiotlb", without
> its value, is set in command line and thus causes the following panic.
>
> PANIC: early exception 0xe3 IP 10:bb9b8e9f error 0 cr2 0x0
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 4.19.0-rc3-yocto-standard+ #9
> [0.00] RIP: 0010:setup_io_tlb_npages+0xf/0x95
> ...
> [0.00] Call Trace:
> [0.00]  do_early_param+0x57/0x8e
> [0.00]  parse_args+0x208/0x320
> [0.00]  ? rdinit_setup+0x30/0x30
> [0.00]  parse_early_options+0x29/0x2d
> [0.00]  ? rdinit_setup+0x30/0x30
> [0.00]  parse_early_param+0x36/0x4d
> [0.00]  setup_arch+0x336/0x99e
> [0.00]  start_kernel+0x6f/0x4e6
> [0.00]  x86_64_start_reservations+0x24/0x26
> [0.00]  x86_64_start_kernel+0x6f/0x72
> [0.00]  secondary_startup_64+0xa4/0xb0
>
> This patch adds a check to prevent the panic and sets it for 4MB by default.
>
> Signed-off-by: He Zhe 
> Cc: sta...@vger.kernel.org
> Cc: konrad.w...@oracle.com
> Cc: h...@lst.de
> Cc: m.szyprow...@samsung.com
> Cc: robin.mur...@arm.com
>
> Signed-off-by: He Zhe 
> ---
> v2: Set swiotlb for 4MB by default
>
>  kernel/dma/swiotlb.c | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 045930e..0e18cd4 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -58,6 +58,9 @@
>   */
>  #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT)
>  
> +#define IO_TLB_DEFAULT_MB 4
> +#define IO_TLB_DEFAULT_SLABS ((IO_TLB_DEFAULT_MB<<20) >> IO_TLB_SHIFT)
> +
>  enum swiotlb_force swiotlb_force;
>  
>  /*
> @@ -103,6 +106,14 @@ static int late_alloc;
>  static int __init
>  setup_io_tlb_npages(char *str)
>  {
> + if (!str) {
> + pr_err("No value provided for swiotlb, "
> +"set it to %d slabs for %dMB by default.\n",
> +IO_TLB_DEFAULT_SLABS, IO_TLB_DEFAULT_MB);
> + io_tlb_nslabs = IO_TLB_DEFAULT_SLABS;
> + return 0;
> + }
> +
>   if (isdigit(*str)) {
>   io_tlb_nslabs = simple_strtoul(str, , 0);
>   /* avoid tail segment of size < IO_TLB_SEGSIZE */



Re: [PATCH] mm: kmemleak: Turn kmemleak_lock to spin lock and RCU primitives

2019-01-06 Thread He Zhe



On 1/5/19 2:37 AM, Catalin Marinas wrote:
> On Fri, Jan 04, 2019 at 10:29:13PM +0800, zhe...@windriver.com wrote:
>> It's not necessary to keep consistency between readers and writers of
>> kmemleak_lock. RCU is more proper for this case. And in order to gain better
>> performance, we turn the reader locks to RCU read locks and writer locks to
>> normal spin locks.
> This won't work.
>
>> @@ -515,9 +515,7 @@ static struct kmemleak_object 
>> *find_and_get_object(unsigned long ptr, int alias)
>>  struct kmemleak_object *object;
>>  
>>  rcu_read_lock();
>> -read_lock_irqsave(_lock, flags);
>>  object = lookup_object(ptr, alias);
>> -read_unlock_irqrestore(_lock, flags);
> The comment on lookup_object() states that the kmemleak_lock must be
> held. That's because we don't have an RCU-like mechanism for removing
> removing objects from the object_tree_root:
>
>>  
>>  /* check whether the object is still available */
>>  if (object && !get_object(object))
>> @@ -537,13 +535,13 @@ static struct kmemleak_object 
>> *find_and_remove_object(unsigned long ptr, int ali
>>  unsigned long flags;
>>  struct kmemleak_object *object;
>>  
>> -write_lock_irqsave(_lock, flags);
>> +spin_lock_irqsave(_lock, flags);
>>  object = lookup_object(ptr, alias);
>>  if (object) {
>>  rb_erase(>rb_node, _tree_root);
>>  list_del_rcu(>object_list);
>>  }
>> -write_unlock_irqrestore(_lock, flags);
>> +spin_unlock_irqrestore(_lock, flags);
> So here, while list removal is RCU-safe, rb_erase() is not.
>
> If you have time to implement an rb_erase_rcu(), than we could reduce
> the locking in kmemleak.

Thanks, I really neglected that rb_erase is not RCU-safe here.

I'm not sure if it is practically possible to implement rb_erase_rcu. Here
is my concern:
In the code paths starting from rb_erase, the tree is tweaked at many
places, in both __rb_erase_augmented and rb_erase_color. To my
understanding, there are many intermediate versions of the tree
during the erasion. In some of the versions, the tree is incomplete, i.e.
some nodes(not the one to be deleted) are invisible to readers. I'm not
sure if this is acceptable as an RCU implementation. Does it mean we
need to form a rb_erase_rcu from scratch?

And are there any other concerns about this attempt?

Let me add RCU supporters Paul and Josh here. Your advice would be
highly appreciated.

Thanks,
Zhe


>



Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-12-19 Thread He Zhe



On 2018/12/19 23:30, Sebastian Andrzej Siewior wrote:
> On 2018-12-18 15:07:45 [+], Catalin Marinas wrote:
> …
>> It may be worth running some performance/latency tests during kmemleak
>> scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
>> I don't think we'd see any difference with a raw_spin_lock_t.
>>
>> With a bit more thinking (though I'll be off until the new year), we
>> could probably get rid of the kmemleak_lock entirely in scan_block() and
>> make lookup_object() and the related rbtree code in kmemleak RCU-safe.
> Okay. So let me apply that patch into my RT tree with your ack (from the
> other email). And then I hope that it either shows up upstream or gets
> replaced with RCU in the ende :)

I'd like to do the upstreaming or replacing. Thanks.

Zhe

>
> Thanks.
>
> Sebastian
>



Re: [PATCH] perf/x86/intel: Avoid unnecessary reallocations of memory allocated in cpu hotplug prepare state

2018-12-18 Thread He Zhe



On 2018/12/18 19:47, Sebastian Andrzej Siewior wrote:
> On 2018-12-18 12:37:00 [+0100], Peter Zijlstra wrote:
>> On Tue, Dec 18, 2018 at 12:31:19PM +0100, Peter Zijlstra wrote:
>>> On Tue, Dec 18, 2018 at 12:16:37PM +0100, Sebastian Andrzej Siewior wrote:
 On 2018-12-18 12:02:09 [+0100], Peter Zijlstra wrote:
> On Tue, Dec 18, 2018 at 06:30:33PM +0800, zhe...@windriver.com wrote:
>> Besides, in preempt-rt full mode, the freeing can happen in atomic 
>> context and
>> thus cause the following BUG.
> Hurm, I though we fixed all those long ago..
>
> And no, the patch is horrible; that's what we have things like
> x86_pmu::cpu_dead() for.
 ehm, you say we keep memory allocation +free on CPU up/down?
>>> Sure, why not?
> It does not seem to be useful to allocate & free memory which you need
> anyway. So you could avoid the refcnt for instance.
> Also I doubt the memory will remain unallocated for a longer period of
> time (like you would remove the CPU for good and not boot it again a
> minute later).
> *Maybe* it is different in cloud environment where you attach vcpus
> depending on guest load at runtime but still…
>
>> I suspect the below is all we really need.
> Zhe, could you please have a look?

This works in my environment, taking CPUs offline and online many times.

Zhe

>
> Sebastian
>



Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-12-18 Thread He Zhe



On 2018/12/6 03:14, Sebastian Andrzej Siewior wrote:
> On 2018-12-05 21:53:37 [+0800], He Zhe wrote:
>> For call trace 1:
> …
>> Since kmemleak would most likely be used to debug in environments where
>> we would not expect as great performance as without it, and kfree() has raw 
>> locks
>> in its main path and other debug function paths, I suppose it wouldn't hurt 
>> that
>> we change to raw locks.
> okay.
>
>>>> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>>>>
>>>> The call trace above is caused by grabbing kmemleak_lock and then getting
>>>> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
>>>> this problem.
>>> But this is a reader / writer lock. And if I understand the other part
>>> of the thread then it needs multiple readers.
>> For call trace 2:
>>
>> I don't get what "it needs multiple readers" exactly means here.
>>
>> In this call trace, the kmemleak_lock is grabbed as write lock, and then 
>> scheduled
>> away, and then grabbed again as write lock from another path. It's a
>> write->write locking, compared to the discussion in the other part of the 
>> thread.
>>
>> This is essentially because kmemleak hooks on the very low level memory
>> allocation and free operations. After scheduled away, it can easily re-enter 
>> itself.
>> We need raw locks to prevent this from happening.
> With raw locks you wouldn't have multiple readers at the same time.
> Maybe you wouldn't have recursion but since you can't have multiple
> readers you would add lock contention where was none (because you could
> have two readers at the same time).

Sorry for slow reply.

OK. I understand your concern finally. At the commit log said, I wanted to use 
raw
rwlock but didn't find the DEFINE helper for it. Thinking it would not be 
expected to
have great performance, I turn to use raw spinlock instead. And yes, this would
introduce worse performance.

Maybe I miss the reason, but why don't we have rwlock_types_raw.h to define raw
rwlock helper for RT? With that, we can cleanly replace kmemleak_lock with a raw
rwlock.

Or should we just define a raw rwlock using basic type, like arch_rwlock_t, 
only in
kmemleak?

>
>>> Couldn't we just get rid of that kfree() or move it somewhere else?
>>> I mean if the free() memory on CPU-down and allocate it again CPU-up
>>> then we could skip that, rigth? Just allocate it and don't free it
>>> because the CPU will likely get up again.
>> For call trace 1:
>>
>> I went through the CPU hotplug code and found that the allocation of the
>> problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
>> the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
>> different perf events.
>>
>> It seems we can hardly form a convincing method that holds the data while
>> CPUs are off and then uses it again. raw locks would be easy and good enough.
> Why not allocate the memory in intel_pmu_cpu_prepare() if it is not
> already there (otherwise skip the allocation) and in
> intel_pmu_cpu_dying() not free it. It looks easy.

Thanks for your suggestion. I've sent the change for call trace 1 to mainline
mailing list. Hopefully it can be accepted.

Zhe

>
>> Thanks,
>> Zhe
> Sebastian
>



Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-12-05 Thread He Zhe



On 2018/12/1 02:19, Sebastian Andrzej Siewior wrote:
> On 2018-11-24 22:26:46 [+0800], He Zhe wrote:
>> On latest v4.19.1-rt3, both of the call traces can be reproduced with 
>> kmemleak
>> enabied. And none can be reproduced with kmemleak disabled.
> okay. So it needs attention.
>
>> On latest mainline tree, none can be reproduced no matter kmemleak is enabled
>> or disabled.
>>
>> I don't get why kfree from a preempt-disabled section should cause a warning
>> without kmemleak, since kfree can't sleep.
> it might. It will acquire a sleeping lock if it has go down to the
> memory allocator to actually give memory back.

Got it. Thanks.

>
>> If I understand correctly, the call trace above is caused by trying to 
>> schedule
>> after preemption is disabled, which cannot be reached in mainline kernel. So
>> we might need to turn to use raw lock to keep preemption disabled.
> The buddy-allocator runs with spin locks so it is okay on !RT. So you
> can use kfree() with disabled preemption or disabled interrupts.
> I don't think that we want to use raw-locks in the buddy-allocator.

For call trace 1:

I went through the calling paths inside kfree() and found that there have 
already
been things using raw lock in it as follow.

1) in the path of kfree() itself
kfree -> slab_free -> do_slab_free -> __slab_free -> raw_spin_lock_irqsave

2) in the path of CONFIG_DEBUG_OBJECTS_FREE
kfree -> slab_free -> slab_free_freelist_hook -> slab_free_hook -> 
debug_check_no_obj_freed -> __debug_check_no_obj_freed -> raw_spin_lock_irqsave

3) in the path of CONFIG_LOCKDEP
kfree -> __free_pages -> free_unref_page -> free_unref_page_prepare -> 
free_pcp_prepare -> free_pages_prepare -> debug_check_no_locks_freed -> 
debug_check_no_locks_freed -> raw_local_irq_save

Since kmemleak would most likely be used to debug in environments where
we would not expect as great performance as without it, and kfree() has raw 
locks
in its main path and other debug function paths, I suppose it wouldn't hurt that
we change to raw locks.

>
>> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>>
>> The call trace above is caused by grabbing kmemleak_lock and then getting
>> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
>> this problem.
> But this is a reader / writer lock. And if I understand the other part
> of the thread then it needs multiple readers.

For call trace 2:

I don't get what "it needs multiple readers" exactly means here.

In this call trace, the kmemleak_lock is grabbed as write lock, and then 
scheduled
away, and then grabbed again as write lock from another path. It's a
write->write locking, compared to the discussion in the other part of the 
thread.

This is essentially because kmemleak hooks on the very low level memory
allocation and free operations. After scheduled away, it can easily re-enter 
itself.
We need raw locks to prevent this from happening.

> Couldn't we just get rid of that kfree() or move it somewhere else?
> I mean if the free() memory on CPU-down and allocate it again CPU-up
> then we could skip that, rigth? Just allocate it and don't free it
> because the CPU will likely get up again.

For call trace 1:

I went through the CPU hotplug code and found that the allocation of the
problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
different perf events.

It seems we can hardly form a convincing method that holds the data while
CPUs are off and then uses it again. raw locks would be easy and good enough.

Thanks,
Zhe

>
>> Thanks,
>> Zhe
> Sebastian
>



Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-12-05 Thread He Zhe



On 2018/12/1 02:19, Sebastian Andrzej Siewior wrote:
> On 2018-11-24 22:26:46 [+0800], He Zhe wrote:
>> On latest v4.19.1-rt3, both of the call traces can be reproduced with 
>> kmemleak
>> enabied. And none can be reproduced with kmemleak disabled.
> okay. So it needs attention.
>
>> On latest mainline tree, none can be reproduced no matter kmemleak is enabled
>> or disabled.
>>
>> I don't get why kfree from a preempt-disabled section should cause a warning
>> without kmemleak, since kfree can't sleep.
> it might. It will acquire a sleeping lock if it has go down to the
> memory allocator to actually give memory back.

Got it. Thanks.

>
>> If I understand correctly, the call trace above is caused by trying to 
>> schedule
>> after preemption is disabled, which cannot be reached in mainline kernel. So
>> we might need to turn to use raw lock to keep preemption disabled.
> The buddy-allocator runs with spin locks so it is okay on !RT. So you
> can use kfree() with disabled preemption or disabled interrupts.
> I don't think that we want to use raw-locks in the buddy-allocator.

For call trace 1:

I went through the calling paths inside kfree() and found that there have 
already
been things using raw lock in it as follow.

1) in the path of kfree() itself
kfree -> slab_free -> do_slab_free -> __slab_free -> raw_spin_lock_irqsave

2) in the path of CONFIG_DEBUG_OBJECTS_FREE
kfree -> slab_free -> slab_free_freelist_hook -> slab_free_hook -> 
debug_check_no_obj_freed -> __debug_check_no_obj_freed -> raw_spin_lock_irqsave

3) in the path of CONFIG_LOCKDEP
kfree -> __free_pages -> free_unref_page -> free_unref_page_prepare -> 
free_pcp_prepare -> free_pages_prepare -> debug_check_no_locks_freed -> 
debug_check_no_locks_freed -> raw_local_irq_save

Since kmemleak would most likely be used to debug in environments where
we would not expect as great performance as without it, and kfree() has raw 
locks
in its main path and other debug function paths, I suppose it wouldn't hurt that
we change to raw locks.

>
>> >From what I reached above, this is RT-only and happens on v4.18 and v4.19.
>>
>> The call trace above is caused by grabbing kmemleak_lock and then getting
>> scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
>> this problem.
> But this is a reader / writer lock. And if I understand the other part
> of the thread then it needs multiple readers.

For call trace 2:

I don't get what "it needs multiple readers" exactly means here.

In this call trace, the kmemleak_lock is grabbed as write lock, and then 
scheduled
away, and then grabbed again as write lock from another path. It's a
write->write locking, compared to the discussion in the other part of the 
thread.

This is essentially because kmemleak hooks on the very low level memory
allocation and free operations. After scheduled away, it can easily re-enter 
itself.
We need raw locks to prevent this from happening.

> Couldn't we just get rid of that kfree() or move it somewhere else?
> I mean if the free() memory on CPU-down and allocate it again CPU-up
> then we could skip that, rigth? Just allocate it and don't free it
> because the CPU will likely get up again.

For call trace 1:

I went through the CPU hotplug code and found that the allocation of the
problematic data, cpuc->shared_regs, is done in intel_pmu_cpu_prepare. And
the free is done in intel_pmu_cpu_dying. They are handlers triggered by two
different perf events.

It seems we can hardly form a convincing method that holds the data while
CPUs are off and then uses it again. raw locks would be easy and good enough.

Thanks,
Zhe

>
>> Thanks,
>> Zhe
> Sebastian
>



Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-11-24 Thread He Zhe



On 2018/11/23 17:53, Sebastian Andrzej Siewior wrote:
> On 2018-11-22 17:04:19 [+0800], zhe...@windriver.com wrote:
>> From: He Zhe 
>>
>> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
>> causes the follow BUG.
>>
>> BUG: scheduling while atomic: migration/15/132/0x0002
> …
>> Preemption disabled at:
>> [] cpu_stopper_thread+0x71/0x100
>> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
>> Hardware name: Intel Corp. Harcuvar/Server, BIOS 
>> HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
>> Call Trace:
>>  dump_stack+0x4f/0x6a
>>  ? cpu_stopper_thread+0x71/0x100
>>  __schedule_bug.cold.16+0x38/0x55
>>  __schedule+0x484/0x6c0
>>  schedule+0x3d/0xe0
>>  rt_spin_lock_slowlock_locked+0x118/0x2a0
>>  rt_spin_lock_slowlock+0x57/0x90
>>  __rt_spin_lock+0x26/0x30
>>  __write_rt_lock+0x23/0x1a0
>>  ? intel_pmu_cpu_dying+0x67/0x70
>>  rt_write_lock+0x2a/0x30
>>  find_and_remove_object+0x1e/0x80
>>  delete_object_full+0x10/0x20
>>  kmemleak_free+0x32/0x50
>>  kfree+0x104/0x1f0
>>  ? x86_pmu_starting_cpu+0x30/0x30
>>  intel_pmu_cpu_dying+0x67/0x70
>>  x86_pmu_dying_cpu+0x1a/0x30
>>  cpuhp_invoke_callback+0x92/0x700
>>  take_cpu_down+0x70/0xa0
>>  multi_cpu_stop+0x62/0xc0
>>  ? cpu_stop_queue_work+0x130/0x130
>>  cpu_stopper_thread+0x79/0x100
>>  smpboot_thread_fn+0x20f/0x2d0
>>  kthread+0x121/0x140
>>  ? sort_range+0x30/0x30
>>  ? kthread_park+0x90/0x90
>>  ret_from_fork+0x35/0x40
> If this is the only problem? kfree() from a preempt-disabled section
> should cause a warning even without kmemleak.

Thanks for your review. I just did some tests aginst the latest code.

On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
enabied. And none can be reproduced with kmemleak disabled.

On latest mainline tree, none can be reproduced no matter kmemleak is enabled
or disabled.

I don't get why kfree from a preempt-disabled section should cause a warning
without kmemleak, since kfree can't sleep.

If I understand correctly, the call trace above is caused by trying to schedule
after preemption is disabled, which cannot be reached in mainline kernel. So
we might need to turn to use raw lock to keep preemption disabled.

>
>> And on v4.18 stable tree the following call trace, caused by grabbing
>> kmemleak_lock again, is also observed.
>>
>> kernel BUG at kernel/locking/rtmutex.c:1048! 
>> invalid opcode:  [#1] PREEMPT SMP PTI 
>> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1 
> …
>> Call Trace: 
>>  ? preempt_count_add+0x74/0xc0 
>>  rt_spin_lock_slowlock+0x57/0x90 
>>  ? __kernel_text_address+0x12/0x40 
>>  ? __save_stack_trace+0x75/0x100 
>>  __rt_spin_lock+0x26/0x30 
>>  __write_rt_lock+0x23/0x1a0 
>>  rt_write_lock+0x2a/0x30 
>>  create_object+0x17d/0x2b0 
> …
>
> is this an RT-only problem? Because mainline should not allow read->read
> locking or read->write locking for reader-writer locks. If this only
> happens on v4.18 and not on v4.19 then something must have fixed it.

>From what I reached above, this is RT-only and happens on v4.18 and v4.19.

The call trace above is caused by grabbing kmemleak_lock and then getting
scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
this problem.

Thanks,
Zhe

>  
>
> Sebastian
>



Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-11-24 Thread He Zhe



On 2018/11/23 17:53, Sebastian Andrzej Siewior wrote:
> On 2018-11-22 17:04:19 [+0800], zhe...@windriver.com wrote:
>> From: He Zhe 
>>
>> kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and
>> causes the follow BUG.
>>
>> BUG: scheduling while atomic: migration/15/132/0x0002
> …
>> Preemption disabled at:
>> [] cpu_stopper_thread+0x71/0x100
>> CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1
>> Hardware name: Intel Corp. Harcuvar/Server, BIOS 
>> HAVLCRB1.X64.0015.D62.1708310404 08/31/2017
>> Call Trace:
>>  dump_stack+0x4f/0x6a
>>  ? cpu_stopper_thread+0x71/0x100
>>  __schedule_bug.cold.16+0x38/0x55
>>  __schedule+0x484/0x6c0
>>  schedule+0x3d/0xe0
>>  rt_spin_lock_slowlock_locked+0x118/0x2a0
>>  rt_spin_lock_slowlock+0x57/0x90
>>  __rt_spin_lock+0x26/0x30
>>  __write_rt_lock+0x23/0x1a0
>>  ? intel_pmu_cpu_dying+0x67/0x70
>>  rt_write_lock+0x2a/0x30
>>  find_and_remove_object+0x1e/0x80
>>  delete_object_full+0x10/0x20
>>  kmemleak_free+0x32/0x50
>>  kfree+0x104/0x1f0
>>  ? x86_pmu_starting_cpu+0x30/0x30
>>  intel_pmu_cpu_dying+0x67/0x70
>>  x86_pmu_dying_cpu+0x1a/0x30
>>  cpuhp_invoke_callback+0x92/0x700
>>  take_cpu_down+0x70/0xa0
>>  multi_cpu_stop+0x62/0xc0
>>  ? cpu_stop_queue_work+0x130/0x130
>>  cpu_stopper_thread+0x79/0x100
>>  smpboot_thread_fn+0x20f/0x2d0
>>  kthread+0x121/0x140
>>  ? sort_range+0x30/0x30
>>  ? kthread_park+0x90/0x90
>>  ret_from_fork+0x35/0x40
> If this is the only problem? kfree() from a preempt-disabled section
> should cause a warning even without kmemleak.

Thanks for your review. I just did some tests aginst the latest code.

On latest v4.19.1-rt3, both of the call traces can be reproduced with kmemleak
enabied. And none can be reproduced with kmemleak disabled.

On latest mainline tree, none can be reproduced no matter kmemleak is enabled
or disabled.

I don't get why kfree from a preempt-disabled section should cause a warning
without kmemleak, since kfree can't sleep.

If I understand correctly, the call trace above is caused by trying to schedule
after preemption is disabled, which cannot be reached in mainline kernel. So
we might need to turn to use raw lock to keep preemption disabled.

>
>> And on v4.18 stable tree the following call trace, caused by grabbing
>> kmemleak_lock again, is also observed.
>>
>> kernel BUG at kernel/locking/rtmutex.c:1048! 
>> invalid opcode:  [#1] PREEMPT SMP PTI 
>> CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1 
> …
>> Call Trace: 
>>  ? preempt_count_add+0x74/0xc0 
>>  rt_spin_lock_slowlock+0x57/0x90 
>>  ? __kernel_text_address+0x12/0x40 
>>  ? __save_stack_trace+0x75/0x100 
>>  __rt_spin_lock+0x26/0x30 
>>  __write_rt_lock+0x23/0x1a0 
>>  rt_write_lock+0x2a/0x30 
>>  create_object+0x17d/0x2b0 
> …
>
> is this an RT-only problem? Because mainline should not allow read->read
> locking or read->write locking for reader-writer locks. If this only
> happens on v4.18 and not on v4.19 then something must have fixed it.

>From what I reached above, this is RT-only and happens on v4.18 and v4.19.

The call trace above is caused by grabbing kmemleak_lock and then getting
scheduled and then re-grabbing kmemleak_lock. Using raw lock can also solve
this problem.

Thanks,
Zhe

>  
>
> Sebastian
>



Re: [PATCH v5 4/4] printk: Give error on attempt to set log buffer length to over 4G

2018-10-08 Thread He Zhe



On 2018年10月08日 21:59, Petr Mladek wrote:
> I tried this patch with log_buf_len=5G. The kernel did not crash
> but dmesg shown some mess. There are several 32-bit variables
> to store the size, for example:
>
> static u32 log_buf_len = __LOG_BUF_LEN;
> u32 log_buf_len_get(void)
> static u32 log_first_idx;
> static u32 log_next_idx;
>
> I guess that the code somewhere does not detect an overflows
> correctly.
>
> I am not motivated enought to add support for such huge message
> buffer. Therefore I suggest to limit it to 32G for now.

I'm also fine with this suggestion. Thanks.

Zhe



Re: [PATCH v5 4/4] printk: Give error on attempt to set log buffer length to over 4G

2018-10-08 Thread He Zhe



On 2018年10月08日 21:59, Petr Mladek wrote:
> I tried this patch with log_buf_len=5G. The kernel did not crash
> but dmesg shown some mess. There are several 32-bit variables
> to store the size, for example:
>
> static u32 log_buf_len = __LOG_BUF_LEN;
> u32 log_buf_len_get(void)
> static u32 log_first_idx;
> static u32 log_next_idx;
>
> I guess that the code somewhere does not detect an overflows
> correctly.
>
> I am not motivated enought to add support for such huge message
> buffer. Therefore I suggest to limit it to 32G for now.

I'm also fine with this suggestion. Thanks.

Zhe



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-28 Thread He Zhe



On 2018年09月28日 15:35, Sergey Senozhatsky wrote:
> On (09/26/18 13:05), Petr Mladek wrote:
>> First, I would move the check into log_buf_len_update() so that
>> we catch the overflow also in other situations.
> OK.
>
>> Second, please, keep only the first line.
> OK.
>
>> It is enough to describe the problem.
> OK.
> He Zhe, will you pick it up?

Yes, I agree, I'll send v3 ASAP.

Thanks,
Zhe

>
>   -ss
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-28 Thread He Zhe



On 2018年09月28日 15:35, Sergey Senozhatsky wrote:
> On (09/26/18 13:05), Petr Mladek wrote:
>> First, I would move the check into log_buf_len_update() so that
>> we catch the overflow also in other situations.
> OK.
>
>> Second, please, keep only the first line.
> OK.
>
>> It is enough to describe the problem.
> OK.
> He Zhe, will you pick it up?

Yes, I agree, I'll send v3 ASAP.

Thanks,
Zhe

>
>   -ss
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-25 Thread He Zhe



On 2018年09月25日 21:31, Sergey Senozhatsky wrote:
> On (09/25/18 14:23), Petr Mladek wrote:
>> The 32GB was mentioned as an example one year ego. This is not enough
>> for a new syscall from my point of view.
> I agree. I didn't think of syslog(); was merely thinking about logbuf
> and flushing it to the consoles. syslog() stuff is a bit complex. We
> sort of don't expect user space to allocate 64G to read all log_buf
> messages, do we.
>
> I'm wondering if we can do something like this
>
> ---
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cf275f4d7912..1b48b61da8fe 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1110,9 +1110,15 @@ static void __init log_buf_len_update(unsigned size)
>  /* save requested log_buf_len since it's too early to process it */
>  static int __init log_buf_len_setup(char *str)
>  {
> - unsigned size = memparse(str, );
> + u64 size = memparse(str, );
>  
> - log_buf_len_update(size);
> + if (size > UINT_MAX) {
> + size = UINT_MAX;
> + pr_err("log_buf over 4G is not supported. "
> + "Please contact printk maintainers.\n");
> + }
> +
> + log_buf_len_update((unsigned int)size);
>  
>   return 0;
>  }
>
> ---
>
> So we could know that "the day has come".

I agree on this idea, a good way to collect the potential requirement.
I can send v4 with it if no objection from other people.

Thanks,
Zhe

>
>   -ss
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-25 Thread He Zhe



On 2018年09月25日 21:31, Sergey Senozhatsky wrote:
> On (09/25/18 14:23), Petr Mladek wrote:
>> The 32GB was mentioned as an example one year ego. This is not enough
>> for a new syscall from my point of view.
> I agree. I didn't think of syslog(); was merely thinking about logbuf
> and flushing it to the consoles. syslog() stuff is a bit complex. We
> sort of don't expect user space to allocate 64G to read all log_buf
> messages, do we.
>
> I'm wondering if we can do something like this
>
> ---
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cf275f4d7912..1b48b61da8fe 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1110,9 +1110,15 @@ static void __init log_buf_len_update(unsigned size)
>  /* save requested log_buf_len since it's too early to process it */
>  static int __init log_buf_len_setup(char *str)
>  {
> - unsigned size = memparse(str, );
> + u64 size = memparse(str, );
>  
> - log_buf_len_update(size);
> + if (size > UINT_MAX) {
> + size = UINT_MAX;
> + pr_err("log_buf over 4G is not supported. "
> + "Please contact printk maintainers.\n");
> + }
> +
> + log_buf_len_update((unsigned int)size);
>  
>   return 0;
>  }
>
> ---
>
> So we could know that "the day has come".

I agree on this idea, a good way to collect the potential requirement.
I can send v4 with it if no objection from other people.

Thanks,
Zhe

>
>   -ss
>



Re: [PATCH v3 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-23 Thread He Zhe



On 2018年09月23日 00:19, Steven Rostedt wrote:
> On Sat, 22 Sep 2018 23:40:51 +0800
>  wrote:
>
>> From: He Zhe 
>>
>> log_buf_len_setup does not check input argument before passing it to
>> simple_strtoull. The argument would be a NULL pointer if "log_buf_len",
>> without its value, is set in command line and thus causes the following
>> panic.
>>
>> PANIC: early exception 0xe3 IP 10:aaeacd0d error 0 cr2 0x0
>> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
>> 4.19.0-rc4-yocto-standard+ #1
>> [0.00] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
>> ...
>> [0.00] Call Trace:
>> [0.00]  simple_strtoull+0x29/0x70
>> [0.00]  memparse+0x26/0x90
>> [0.00]  log_buf_len_setup+0x17/0x22
>> [0.00]  do_early_param+0x57/0x8e
>> [0.00]  parse_args+0x208/0x320
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_options+0x29/0x2d
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_param+0x36/0x4d
>> [0.00]  setup_arch+0x336/0x99e
>> [0.00]  start_kernel+0x6f/0x4ee
>> [0.00]  x86_64_start_reservations+0x24/0x26
>> [0.00]  x86_64_start_kernel+0x6f/0x72
>> [0.00]  secondary_startup_64+0xa4/0xb0
>>
>> This patch adds a check to prevent the panic.
>>
>> Signed-off-by: He Zhe 
>> Cc: sta...@vger.kernel.org
> I just tried this on a 2.6.32 kernel, and it crashes there. I guess
> this goes farther back than git history goes.
>
> Perhaps it should be commented that this bug has been here since
> creation of (git) time.

I did a try on 2.6.32. It passed. Actually this bug only happens on
early_param(not __setup) which is introduced since v3.0. The oldest
LTS version is 3.16 now. Should I send v4 and add a statement about
the supported version range in commit log?

>
>
>> Cc: pmla...@suse.com
>> Cc: sergey.senozhat...@gmail.com
>> Cc: rost...@goodmis.org
>> ---
>> v2:
>> Split out the addition of pr_fmt and the unsigned update
> Which unsigned update? As it does switch to unsigned to "unsigned int",
> but that change is fine to me with this.

No problem. It's the history of v2.

In v1 you suggested "unsigned int size" should be in a separate patch and
I did that in v2. Then Sergey suggested "unsigned int size" should be in the
1/2 patch to avoid checkpatch.pl warning. With your conformation, I add it
back here in v3.

Thanks,
Zhe

>
>> v3:
>> Use more clear error info
>> Change unsigned to unsigned in to avoid checkpatch.pl warning
>>
>>  kernel/printk/printk.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 9bf5404..d9821c0 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned size)
>>  /* save requested log_buf_len since it's too early to process it */
>>  static int __init log_buf_len_setup(char *str)
>>  {
>> -unsigned size = memparse(str, );
>> +unsigned int size;
> I'm OK with the int update too, as its low risk.
>
> Acked-by: Steven Rostedt (VMware) 
>
> -- Steve
>
>> +
>> +if (!str) {
>> +pr_err("boot command line parameter value not provided\n");
>> +return -EINVAL;
>> +}
>> +
>> +size = memparse(str, );
>>  
>>  log_buf_len_update(size);
>>  
>



Re: [PATCH v3 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-23 Thread He Zhe



On 2018年09月23日 00:19, Steven Rostedt wrote:
> On Sat, 22 Sep 2018 23:40:51 +0800
>  wrote:
>
>> From: He Zhe 
>>
>> log_buf_len_setup does not check input argument before passing it to
>> simple_strtoull. The argument would be a NULL pointer if "log_buf_len",
>> without its value, is set in command line and thus causes the following
>> panic.
>>
>> PANIC: early exception 0xe3 IP 10:aaeacd0d error 0 cr2 0x0
>> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
>> 4.19.0-rc4-yocto-standard+ #1
>> [0.00] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
>> ...
>> [0.00] Call Trace:
>> [0.00]  simple_strtoull+0x29/0x70
>> [0.00]  memparse+0x26/0x90
>> [0.00]  log_buf_len_setup+0x17/0x22
>> [0.00]  do_early_param+0x57/0x8e
>> [0.00]  parse_args+0x208/0x320
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_options+0x29/0x2d
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_param+0x36/0x4d
>> [0.00]  setup_arch+0x336/0x99e
>> [0.00]  start_kernel+0x6f/0x4ee
>> [0.00]  x86_64_start_reservations+0x24/0x26
>> [0.00]  x86_64_start_kernel+0x6f/0x72
>> [0.00]  secondary_startup_64+0xa4/0xb0
>>
>> This patch adds a check to prevent the panic.
>>
>> Signed-off-by: He Zhe 
>> Cc: sta...@vger.kernel.org
> I just tried this on a 2.6.32 kernel, and it crashes there. I guess
> this goes farther back than git history goes.
>
> Perhaps it should be commented that this bug has been here since
> creation of (git) time.

I did a try on 2.6.32. It passed. Actually this bug only happens on
early_param(not __setup) which is introduced since v3.0. The oldest
LTS version is 3.16 now. Should I send v4 and add a statement about
the supported version range in commit log?

>
>
>> Cc: pmla...@suse.com
>> Cc: sergey.senozhat...@gmail.com
>> Cc: rost...@goodmis.org
>> ---
>> v2:
>> Split out the addition of pr_fmt and the unsigned update
> Which unsigned update? As it does switch to unsigned to "unsigned int",
> but that change is fine to me with this.

No problem. It's the history of v2.

In v1 you suggested "unsigned int size" should be in a separate patch and
I did that in v2. Then Sergey suggested "unsigned int size" should be in the
1/2 patch to avoid checkpatch.pl warning. With your conformation, I add it
back here in v3.

Thanks,
Zhe

>
>> v3:
>> Use more clear error info
>> Change unsigned to unsigned in to avoid checkpatch.pl warning
>>
>>  kernel/printk/printk.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 9bf5404..d9821c0 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned size)
>>  /* save requested log_buf_len since it's too early to process it */
>>  static int __init log_buf_len_setup(char *str)
>>  {
>> -unsigned size = memparse(str, );
>> +unsigned int size;
> I'm OK with the int update too, as its low risk.
>
> Acked-by: Steven Rostedt (VMware) 
>
> -- Steve
>
>> +
>> +if (!str) {
>> +pr_err("boot command line parameter value not provided\n");
>> +return -EINVAL;
>> +}
>> +
>> +size = memparse(str, );
>>  
>>  log_buf_len_update(size);
>>  
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-22 Thread He Zhe



On 2018年09月21日 15:37, Petr Mladek wrote:
> On Thu 2018-09-20 12:30:56, Steven Rostedt wrote:
>> On Fri, 21 Sep 2018 00:16:50 +0800
>> He Zhe  wrote:
>>
>>> On 2018年09月19日 10:43, Steven Rostedt wrote:
>>>> On Wed, 19 Sep 2018 11:39:32 +0900
>>>> Sergey Senozhatsky  wrote:
>>>>  
>>>>> On (09/19/18 10:27), He Zhe wrote:  
>>>>>> On 2018年09月19日 09:50, Sergey Senozhatsky wrote:
>>>>>>> On (09/19/18 01:17), zhe...@windriver.com wrote:
>>>>>>>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned 
>>>>>>>> size)
>>>>>>>>  /* save requested log_buf_len since it's too early to process it */
>>>>>>>>  static int __init log_buf_len_setup(char *str)
>>>>>>>>  {
>>>>>>>> -  unsigned size = memparse(str, );
>>>>>>>> +  unsigned size;
>>>>>>> unsigned int size;
>>>>>> This is in v1 but then Steven suggested that it should be split out
>>>>>> and only keep the pure fix part here.
>>>>> Ah, I see.
>>>>>
>>>>> Hmm... memparse() returns u64 value. A user *probably* can ask the kernel
>>>>> to allocate log_buf larger than 'unsigned int'.
>>>>>
>>>>> So may be I'd do two fixes here:
>>>>>
>>>>>  First  - switch to u64 size.
>>>>>  Second - check for NULL str.
>>>>>
>>>>>
>>>>> Steven, Petr, what do you think?
>>>>>  
>>>> I think I would switch it around. Check for NULL first, and then switch
>>>> to u64. It was always an int, do we need to backport converting it to
>>>> u64 to stable? The NULL check is a definite, the overflow of int
>>>> shouldn't crash anything.  
>> Hi Zhe,
>>
>>> To switch to u64, several variables need to be adjusted to new type to 
>>> aligned
>>> with new_log_buf_len. And currently new_log_buf_len is passed to
>>> memblock_virt_alloc(phys_addr_t, phys_addr_t). So we can't simply define
>>> new_log_buf_len as u64. We need to define it as phys_addr_t tomake it work
>>> well for both 32bit and 64bit arches, since a 32-bit architecture can set
>>> ARCH_PHYS_ADDR_T_64BIT if it needs a 64-bit phys_addr_t.
>> The above explanation verifies more that the NULL pointer check needs
>> to be first, and that the change in size should not be backported to
>> stable because it has a high risk to doing the change as compared to it
>> being a problem for older kernels.
> I agree that the NULL check should go first.
>
> I would personally keep the size as unsigned int. IMHO, a support
> for a log buffer bigger than 4GB is not worth the complexity.

Thank you, Petr, Steven and Sergey. I'll send v3 soon for the NULL check
and cast correction.



For 64 bit log buffer length, the variable changes should be OK. The main
obstacle might be that we need to modify the syscall below to make the
64 bit length returned back to user space. "error" is not long enough.

int do_syslog(int type, char __user *buf, int len, int source)
...
    case SYSLOG_ACTION_SIZE_BUFFER: 
    error = log_buf_len;
    break;
...
    return error;
...

SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)
{   


    return do_syslog(type, buf, len, SYSLOG_FROM_READER);   

}



Besides, in terms of variable type alignment, there has already been a
similar problematic case in do_syslog as below. i.e. int = u64 - u64. It seems
this needs to be fixed.

int do_syslog(int type, char __user *buf, int len, int source)
...
    case SYSLOG_ACTION_SIZE_UNREAD:
        if (source == SYSLOG_FROM_PROC) {
            error = log_next_seq - syslog_seq;
...



I'd like to fix the above issues. But can I have your opinions about how to
handle the syscall?


Thanks,
Zhe

>
> Best Regards,
> Petr
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-22 Thread He Zhe



On 2018年09月21日 15:37, Petr Mladek wrote:
> On Thu 2018-09-20 12:30:56, Steven Rostedt wrote:
>> On Fri, 21 Sep 2018 00:16:50 +0800
>> He Zhe  wrote:
>>
>>> On 2018年09月19日 10:43, Steven Rostedt wrote:
>>>> On Wed, 19 Sep 2018 11:39:32 +0900
>>>> Sergey Senozhatsky  wrote:
>>>>  
>>>>> On (09/19/18 10:27), He Zhe wrote:  
>>>>>> On 2018年09月19日 09:50, Sergey Senozhatsky wrote:
>>>>>>> On (09/19/18 01:17), zhe...@windriver.com wrote:
>>>>>>>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned 
>>>>>>>> size)
>>>>>>>>  /* save requested log_buf_len since it's too early to process it */
>>>>>>>>  static int __init log_buf_len_setup(char *str)
>>>>>>>>  {
>>>>>>>> -  unsigned size = memparse(str, );
>>>>>>>> +  unsigned size;
>>>>>>> unsigned int size;
>>>>>> This is in v1 but then Steven suggested that it should be split out
>>>>>> and only keep the pure fix part here.
>>>>> Ah, I see.
>>>>>
>>>>> Hmm... memparse() returns u64 value. A user *probably* can ask the kernel
>>>>> to allocate log_buf larger than 'unsigned int'.
>>>>>
>>>>> So may be I'd do two fixes here:
>>>>>
>>>>>  First  - switch to u64 size.
>>>>>  Second - check for NULL str.
>>>>>
>>>>>
>>>>> Steven, Petr, what do you think?
>>>>>  
>>>> I think I would switch it around. Check for NULL first, and then switch
>>>> to u64. It was always an int, do we need to backport converting it to
>>>> u64 to stable? The NULL check is a definite, the overflow of int
>>>> shouldn't crash anything.  
>> Hi Zhe,
>>
>>> To switch to u64, several variables need to be adjusted to new type to 
>>> aligned
>>> with new_log_buf_len. And currently new_log_buf_len is passed to
>>> memblock_virt_alloc(phys_addr_t, phys_addr_t). So we can't simply define
>>> new_log_buf_len as u64. We need to define it as phys_addr_t tomake it work
>>> well for both 32bit and 64bit arches, since a 32-bit architecture can set
>>> ARCH_PHYS_ADDR_T_64BIT if it needs a 64-bit phys_addr_t.
>> The above explanation verifies more that the NULL pointer check needs
>> to be first, and that the change in size should not be backported to
>> stable because it has a high risk to doing the change as compared to it
>> being a problem for older kernels.
> I agree that the NULL check should go first.
>
> I would personally keep the size as unsigned int. IMHO, a support
> for a log buffer bigger than 4GB is not worth the complexity.

Thank you, Petr, Steven and Sergey. I'll send v3 soon for the NULL check
and cast correction.



For 64 bit log buffer length, the variable changes should be OK. The main
obstacle might be that we need to modify the syscall below to make the
64 bit length returned back to user space. "error" is not long enough.

int do_syslog(int type, char __user *buf, int len, int source)
...
    case SYSLOG_ACTION_SIZE_BUFFER: 
    error = log_buf_len;
    break;
...
    return error;
...

SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)
{   


    return do_syslog(type, buf, len, SYSLOG_FROM_READER);   

}



Besides, in terms of variable type alignment, there has already been a
similar problematic case in do_syslog as below. i.e. int = u64 - u64. It seems
this needs to be fixed.

int do_syslog(int type, char __user *buf, int len, int source)
...
    case SYSLOG_ACTION_SIZE_UNREAD:
        if (source == SYSLOG_FROM_PROC) {
            error = log_next_seq - syslog_seq;
...



I'd like to fix the above issues. But can I have your opinions about how to
handle the syscall?


Thanks,
Zhe

>
> Best Regards,
> Petr
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-20 Thread He Zhe



On 2018年09月19日 10:43, Steven Rostedt wrote:
> On Wed, 19 Sep 2018 11:39:32 +0900
> Sergey Senozhatsky  wrote:
>
>> On (09/19/18 10:27), He Zhe wrote:
>>> On 2018年09月19日 09:50, Sergey Senozhatsky wrote:  
>>>> On (09/19/18 01:17), zhe...@windriver.com wrote:  
>>>>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned 
>>>>> size)
>>>>>  /* save requested log_buf_len since it's too early to process it */
>>>>>  static int __init log_buf_len_setup(char *str)
>>>>>  {
>>>>> - unsigned size = memparse(str, );
>>>>> + unsigned size;  
>>>>unsigned int size;  
>>> This is in v1 but then Steven suggested that it should be split out
>>> and only keep the pure fix part here.  
>> Ah, I see.
>>
>> Hmm... memparse() returns u64 value. A user *probably* can ask the kernel
>> to allocate log_buf larger than 'unsigned int'.
>>
>> So may be I'd do two fixes here:
>>
>>  First  - switch to u64 size.
>>  Second - check for NULL str.
>>
>>
>> Steven, Petr, what do you think?
>>
> I think I would switch it around. Check for NULL first, and then switch
> to u64. It was always an int, do we need to backport converting it to
> u64 to stable? The NULL check is a definite, the overflow of int
> shouldn't crash anything.

To switch to u64, several variables need to be adjusted to new type to aligned
with new_log_buf_len. And currently new_log_buf_len is passed to
memblock_virt_alloc(phys_addr_t, phys_addr_t). So we can't simply define
new_log_buf_len as u64. We need to define it as phys_addr_t tomake it work
well for both 32bit and 64bit arches, since a 32-bit architecture can set
ARCH_PHYS_ADDR_T_64BIT if it needs a 64-bit phys_addr_t.

What do you think?

#ifdef CONFIG_PHYS_ADDR_T_64BIT
typedef u64 phys_addr_t;
#else
typedef u32 phys_addr_t;
#endif

Thanks,
Zhe


> -- Steve
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-20 Thread He Zhe



On 2018年09月19日 10:43, Steven Rostedt wrote:
> On Wed, 19 Sep 2018 11:39:32 +0900
> Sergey Senozhatsky  wrote:
>
>> On (09/19/18 10:27), He Zhe wrote:
>>> On 2018年09月19日 09:50, Sergey Senozhatsky wrote:  
>>>> On (09/19/18 01:17), zhe...@windriver.com wrote:  
>>>>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned 
>>>>> size)
>>>>>  /* save requested log_buf_len since it's too early to process it */
>>>>>  static int __init log_buf_len_setup(char *str)
>>>>>  {
>>>>> - unsigned size = memparse(str, );
>>>>> + unsigned size;  
>>>>unsigned int size;  
>>> This is in v1 but then Steven suggested that it should be split out
>>> and only keep the pure fix part here.  
>> Ah, I see.
>>
>> Hmm... memparse() returns u64 value. A user *probably* can ask the kernel
>> to allocate log_buf larger than 'unsigned int'.
>>
>> So may be I'd do two fixes here:
>>
>>  First  - switch to u64 size.
>>  Second - check for NULL str.
>>
>>
>> Steven, Petr, what do you think?
>>
> I think I would switch it around. Check for NULL first, and then switch
> to u64. It was always an int, do we need to backport converting it to
> u64 to stable? The NULL check is a definite, the overflow of int
> shouldn't crash anything.

To switch to u64, several variables need to be adjusted to new type to aligned
with new_log_buf_len. And currently new_log_buf_len is passed to
memblock_virt_alloc(phys_addr_t, phys_addr_t). So we can't simply define
new_log_buf_len as u64. We need to define it as phys_addr_t tomake it work
well for both 32bit and 64bit arches, since a 32-bit architecture can set
ARCH_PHYS_ADDR_T_64BIT if it needs a 64-bit phys_addr_t.

What do you think?

#ifdef CONFIG_PHYS_ADDR_T_64BIT
typedef u64 phys_addr_t;
#else
typedef u32 phys_addr_t;
#endif

Thanks,
Zhe


> -- Steve
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-19 Thread He Zhe



On 2018年09月19日 14:44, Sergey Senozhatsky wrote:
> On (09/19/18 10:27), He Zhe wrote:
>> On 2018年09月19日 09:50, Sergey Senozhatsky wrote:
>>> On (09/19/18 01:17), zhe...@windriver.com wrote:
>>>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned size)
>>>>  /* save requested log_buf_len since it's too early to process it */
>>>>  static int __init log_buf_len_setup(char *str)
>>>>  {
>>>> -  unsigned size = memparse(str, );
>>>> +  unsigned size;
>>> unsigned int size;
>> This is in v1 but then Steven suggested that it should be split out
>> and only keep the pure fix part here.
> JFI
>
> Seems there are more patches like this. I found this one in MM list:
> lkml.kernel.org/r/1537284788-428784-1-git-send-email-zhe...@windriver.com
>
> Andrew's response:
> lkml.kernel.org/r/20180918141917.2cb16b01c122dbe1ead2f...@linux-foundation.org

I just replied. Let's see what Andrew would say.
https://lore.kernel.org/lkml/1c32c1d2-a54a-30f7-1afb-ad6d3282f...@windriver.com/

Thanks,
Zhe

>
>   -ss
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-19 Thread He Zhe



On 2018年09月19日 14:44, Sergey Senozhatsky wrote:
> On (09/19/18 10:27), He Zhe wrote:
>> On 2018年09月19日 09:50, Sergey Senozhatsky wrote:
>>> On (09/19/18 01:17), zhe...@windriver.com wrote:
>>>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned size)
>>>>  /* save requested log_buf_len since it's too early to process it */
>>>>  static int __init log_buf_len_setup(char *str)
>>>>  {
>>>> -  unsigned size = memparse(str, );
>>>> +  unsigned size;
>>> unsigned int size;
>> This is in v1 but then Steven suggested that it should be split out
>> and only keep the pure fix part here.
> JFI
>
> Seems there are more patches like this. I found this one in MM list:
> lkml.kernel.org/r/1537284788-428784-1-git-send-email-zhe...@windriver.com
>
> Andrew's response:
> lkml.kernel.org/r/20180918141917.2cb16b01c122dbe1ead2f...@linux-foundation.org

I just replied. Let's see what Andrew would say.
https://lore.kernel.org/lkml/1c32c1d2-a54a-30f7-1afb-ad6d3282f...@windriver.com/

Thanks,
Zhe

>
>   -ss
>



Re: [PATCH] mm/page_alloc: Fix panic caused by passing debug_guardpage_minorder or kernelcore to command line

2018-09-19 Thread He Zhe



On 2018年09月19日 05:19, Andrew Morton wrote:
> On Tue, 18 Sep 2018 23:33:08 +0800  wrote:
>
>> From: He Zhe 
>>
>> debug_guardpage_minorder_setup and cmdline_parse_kernelcore do not check
>> input argument before using it. The argument would be a NULL pointer if
>> "debug_guardpage_minorder" or "kernelcore", without its value, is set in
>> command line and thus causes the following panic.
>>
>> PANIC: early exception 0xe3 IP 10:a08146f1 error 0 cr2 0x0
>> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
>> 4.19.0-rc4-yocto-standard+ #1
>> [0.00] RIP: 0010:parse_option_str+0x11/0x90
>> ...
>> [0.00] Call Trace:
>> [0.00]  cmdline_parse_kernelcore+0x19/0x41
>> [0.00]  do_early_param+0x57/0x8e
>> [0.00]  parse_args+0x208/0x320
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_options+0x29/0x2d
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_param+0x36/0x4d
>> [0.00]  setup_arch+0x336/0x99e
>> [0.00]  start_kernel+0x6f/0x4ee
>> [0.00]  x86_64_start_reservations+0x24/0x26
>> [0.00]  x86_64_start_kernel+0x6f/0x72
>> [0.00]  secondary_startup_64+0xa4/0xb0
> >From my quick reading, more than half of the __setup handlers in mm/
> will crash in the same way if misused in this fashion.
>
>> This patch adds a check to prevent the panic and adds KBUILD_MODNAME to
>> prints.
> So a better solution might be to add a check into the calling code
> (presumably in init/main.c) to print a warning if we have kernel
> command line arguments such as "kernelcore=".  That way, users will see
> the warning immediately before the oops and will know how to fix things
> up.

Thank you for your suggestion.

"kernelcore=" would not cause crash, "kernelcore' would. Andmany users of
early_param, e.g. the following two, depend on the validity of the "xxx"
format. If we fixed in the calling code, those parameters would become
invalid and need to be changed to a new format. That might affect too much.
Soit might be better to correct the users who misuse it.


static int __init cmdline_parse_movable_node(char *p)   
{   

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
    movable_node_enabled = true;
#else   
    pr_warn("movable_node parameter depends on 
CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
#endif  
    return 0;   
}   
early_param("movable_node", cmdline_parse_movable_node);


static int __init parse_alloc_mptable_opt(char *p)  
   
{   
    enable_update_mptable = 1;  
   
#ifdef CONFIG_PCI   
    pci_routeirq = 1;   
#endif  
    alloc_mptable = 1;  
    if (!p) 
    return 0;   
    mpc_new_length = memparse(p, );   
   
    return 0;   
}   
early_param("alloc_mptable", parse_alloc_mptable_opt);

>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -14,6 +14,8 @@
>>   *  (lots of bits borrowed from Ingo Molnar & Andrew Morton)
>>   */
>>  
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>>  #include 
>>  #include 
>>  #include 
>> @@ -630,6 +632,11 @@ static int __init debug_guardpage_minorder_setup(char 
>> *buf)
>>  {
>>  unsigned long res;
>>  
>> +if (!buf) {
>> +pr_err("Config string not provided\n");
> If were going to do it this way, we should tell the operator which
> argument was bad.  pr_err("kernel option debug_guardpage_minorder
> requires a

Re: [PATCH] mm/page_alloc: Fix panic caused by passing debug_guardpage_minorder or kernelcore to command line

2018-09-19 Thread He Zhe



On 2018年09月19日 05:19, Andrew Morton wrote:
> On Tue, 18 Sep 2018 23:33:08 +0800  wrote:
>
>> From: He Zhe 
>>
>> debug_guardpage_minorder_setup and cmdline_parse_kernelcore do not check
>> input argument before using it. The argument would be a NULL pointer if
>> "debug_guardpage_minorder" or "kernelcore", without its value, is set in
>> command line and thus causes the following panic.
>>
>> PANIC: early exception 0xe3 IP 10:a08146f1 error 0 cr2 0x0
>> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
>> 4.19.0-rc4-yocto-standard+ #1
>> [0.00] RIP: 0010:parse_option_str+0x11/0x90
>> ...
>> [0.00] Call Trace:
>> [0.00]  cmdline_parse_kernelcore+0x19/0x41
>> [0.00]  do_early_param+0x57/0x8e
>> [0.00]  parse_args+0x208/0x320
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_options+0x29/0x2d
>> [0.00]  ? rdinit_setup+0x30/0x30
>> [0.00]  parse_early_param+0x36/0x4d
>> [0.00]  setup_arch+0x336/0x99e
>> [0.00]  start_kernel+0x6f/0x4ee
>> [0.00]  x86_64_start_reservations+0x24/0x26
>> [0.00]  x86_64_start_kernel+0x6f/0x72
>> [0.00]  secondary_startup_64+0xa4/0xb0
> >From my quick reading, more than half of the __setup handlers in mm/
> will crash in the same way if misused in this fashion.
>
>> This patch adds a check to prevent the panic and adds KBUILD_MODNAME to
>> prints.
> So a better solution might be to add a check into the calling code
> (presumably in init/main.c) to print a warning if we have kernel
> command line arguments such as "kernelcore=".  That way, users will see
> the warning immediately before the oops and will know how to fix things
> up.

Thank you for your suggestion.

"kernelcore=" would not cause crash, "kernelcore' would. Andmany users of
early_param, e.g. the following two, depend on the validity of the "xxx"
format. If we fixed in the calling code, those parameters would become
invalid and need to be changed to a new format. That might affect too much.
Soit might be better to correct the users who misuse it.


static int __init cmdline_parse_movable_node(char *p)   
{   

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
    movable_node_enabled = true;
#else   
    pr_warn("movable_node parameter depends on 
CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
#endif  
    return 0;   
}   
early_param("movable_node", cmdline_parse_movable_node);


static int __init parse_alloc_mptable_opt(char *p)  
   
{   
    enable_update_mptable = 1;  
   
#ifdef CONFIG_PCI   
    pci_routeirq = 1;   
#endif  
    alloc_mptable = 1;  
    if (!p) 
    return 0;   
    mpc_new_length = memparse(p, );   
   
    return 0;   
}   
early_param("alloc_mptable", parse_alloc_mptable_opt);

>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -14,6 +14,8 @@
>>   *  (lots of bits borrowed from Ingo Molnar & Andrew Morton)
>>   */
>>  
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>>  #include 
>>  #include 
>>  #include 
>> @@ -630,6 +632,11 @@ static int __init debug_guardpage_minorder_setup(char 
>> *buf)
>>  {
>>  unsigned long res;
>>  
>> +if (!buf) {
>> +pr_err("Config string not provided\n");
> If were going to do it this way, we should tell the operator which
> argument was bad.  pr_err("kernel option debug_guardpage_minorder
> requires a

Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-18 Thread He Zhe



On 2018年09月19日 09:50, Sergey Senozhatsky wrote:
> On (09/19/18 01:17), zhe...@windriver.com wrote:
>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned size)
>>  /* save requested log_buf_len since it's too early to process it */
>>  static int __init log_buf_len_setup(char *str)
>>  {
>> -unsigned size = memparse(str, );
>> +unsigned size;
>   unsigned int size;

This is in v1 but then Steven suggested that it should be split out
and only keep the pure fix part here.

>
>> +if (!str) {
>> +pr_err("Config string not provided\n");
> pr_debug() may be?
>
> It's not clear from this error message which one of the "config strings"
> was not provided. Besides, I think "config string" is misleading and in
> fact it's a "boot command line parameter". But, dunno, I guess I'd just
> drop that print out.

I'm OK with dropping it. Anyway it'd get "Malformed early
option 'log_buf_len'" from early_param later.

I'll send v3. Thank you.

Zhe

>
> Otherwise,
>
> Acked-by: Sergey Senozhatsky 
>
>   -ss
>



Re: [PATCH v2 1/2] printk: Fix panic caused by passing log_buf_len to command line

2018-09-18 Thread He Zhe



On 2018年09月19日 09:50, Sergey Senozhatsky wrote:
> On (09/19/18 01:17), zhe...@windriver.com wrote:
>> @@ -1048,7 +1048,14 @@ static void __init log_buf_len_update(unsigned size)
>>  /* save requested log_buf_len since it's too early to process it */
>>  static int __init log_buf_len_setup(char *str)
>>  {
>> -unsigned size = memparse(str, );
>> +unsigned size;
>   unsigned int size;

This is in v1 but then Steven suggested that it should be split out
and only keep the pure fix part here.

>
>> +if (!str) {
>> +pr_err("Config string not provided\n");
> pr_debug() may be?
>
> It's not clear from this error message which one of the "config strings"
> was not provided. Besides, I think "config string" is misleading and in
> fact it's a "boot command line parameter". But, dunno, I guess I'd just
> drop that print out.

I'm OK with dropping it. Anyway it'd get "Malformed early
option 'log_buf_len'" from early_param later.

I'll send v3. Thank you.

Zhe

>
> Otherwise,
>
> Acked-by: Sergey Senozhatsky 
>
>   -ss
>



Re: [PATCH] scripts/gcc-goto.sh: Show stdout and stderr for potential errors

2018-09-17 Thread He Zhe
May I have your input?

Thanks,
Zhe

On 2018年08月09日 16:34, zhe...@windriver.com wrote:
> From: He Zhe 
>
> The check may fail not only because ${CC} does not support the asm
> feature, but also due to potential defects of ${CC} itself like what
> we experienced below or even it's missing.
>
> Assembler messages:
> Fatal error: The input and output files must be distinct
> (introduced by binutils-2.31 "Stop the assembler from overwriting its
> output file.")
>
> This patch enables stdout and stderr to give user direct error
> information for those cases, while not polluting the output for normal
> cases.
>
> Signed-off-by: He Zhe 
> ---
>  scripts/gcc-goto.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/scripts/gcc-goto.sh b/scripts/gcc-goto.sh
> index 083c526..bd7fc8eb 100755
> --- a/scripts/gcc-goto.sh
> +++ b/scripts/gcc-goto.sh
> @@ -3,7 +3,7 @@
>  # Test for gcc 'asm goto' support
>  # Copyright (C) 2010, Jason Baron 
>  
> -cat << "END" | $@ -x c - -c -o /dev/null >/dev/null 2>&1 && echo "y"
> +cat << "END" | $@ -x c - -c -o /dev/null && echo "y"
>  int main(void)
>  {
>  #if defined(__arm__) || defined(__aarch64__)



Re: [PATCH] scripts/gcc-goto.sh: Show stdout and stderr for potential errors

2018-09-17 Thread He Zhe
May I have your input?

Thanks,
Zhe

On 2018年08月09日 16:34, zhe...@windriver.com wrote:
> From: He Zhe 
>
> The check may fail not only because ${CC} does not support the asm
> feature, but also due to potential defects of ${CC} itself like what
> we experienced below or even it's missing.
>
> Assembler messages:
> Fatal error: The input and output files must be distinct
> (introduced by binutils-2.31 "Stop the assembler from overwriting its
> output file.")
>
> This patch enables stdout and stderr to give user direct error
> information for those cases, while not polluting the output for normal
> cases.
>
> Signed-off-by: He Zhe 
> ---
>  scripts/gcc-goto.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/scripts/gcc-goto.sh b/scripts/gcc-goto.sh
> index 083c526..bd7fc8eb 100755
> --- a/scripts/gcc-goto.sh
> +++ b/scripts/gcc-goto.sh
> @@ -3,7 +3,7 @@
>  # Test for gcc 'asm goto' support
>  # Copyright (C) 2010, Jason Baron 
>  
> -cat << "END" | $@ -x c - -c -o /dev/null >/dev/null 2>&1 && echo "y"
> +cat << "END" | $@ -x c - -c -o /dev/null && echo "y"
>  int main(void)
>  {
>  #if defined(__arm__) || defined(__aarch64__)



[tip:x86/boot] x86/corruption-check: Use pr_*() instead of printk()

2018-09-11 Thread tip-bot for He Zhe
Commit-ID:  b1e3a25f5879017fc50ca17f03118b26a19df49a
Gitweb: https://git.kernel.org/tip/b1e3a25f5879017fc50ca17f03118b26a19df49a
Author: He Zhe 
AuthorDate: Tue, 14 Aug 2018 23:33:43 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 14:47:33 +0200

x86/corruption-check: Use pr_*() instead of printk()

pr_*() is the preferred style.

Signed-off-by: He Zhe 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: gre...@linuxfoundation.org
Cc: kstew...@linuxfoundation.org
Cc: pombreda...@nexb.com
Link: 
http://lkml.kernel.org/r/1534260823-87917-2-git-send-email-zhe...@windriver.com
[ Moved all console output into a single line. ]
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/check.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/check.c b/arch/x86/kernel/check.c
index cc8258a5378b..1979a76bfadd 100644
--- a/arch/x86/kernel/check.c
+++ b/arch/x86/kernel/check.c
@@ -1,4 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include 
 #include 
 #include 
@@ -41,6 +44,7 @@ static __init int set_corruption_check(char *arg)
return ret;
 
memory_corruption_check = val;
+
return 0;
 }
 early_param("memory_corruption_check", set_corruption_check);
@@ -128,7 +132,7 @@ void __init setup_bios_corruption_check(void)
}
 
if (num_scan_areas)
-   printk(KERN_INFO "Scanning %d areas for low memory 
corruption\n", num_scan_areas);
+   pr_info("Scanning %d areas for low memory corruption\n", 
num_scan_areas);
 }
 
 
@@ -147,8 +151,7 @@ void check_for_bios_corruption(void)
for (; size; addr++, size -= sizeof(unsigned long)) {
if (!*addr)
continue;
-   printk(KERN_ERR "Corrupted low memory at %p (%lx phys) 
= %08lx\n",
-  addr, __pa(addr), *addr);
+   pr_err("Corrupted low memory at %p (%lx phys) = 
%08lx\n", addr, __pa(addr), *addr);
corruption = 1;
*addr = 0;
}
@@ -172,11 +175,11 @@ static int start_periodic_check_for_corruption(void)
if (!num_scan_areas || !memory_corruption_check || 
corruption_check_period == 0)
return 0;
 
-   printk(KERN_INFO "Scanning for low memory corruption every %d 
seconds\n",
-  corruption_check_period);
+   pr_info("Scanning for low memory corruption every %d seconds\n", 
corruption_check_period);
 
/* First time we run the checks right away */
schedule_delayed_work(_check_work, 0);
+
return 0;
 }
 device_initcall(start_periodic_check_for_corruption);


[tip:x86/boot] x86/corruption-check: Use pr_*() instead of printk()

2018-09-11 Thread tip-bot for He Zhe
Commit-ID:  b1e3a25f5879017fc50ca17f03118b26a19df49a
Gitweb: https://git.kernel.org/tip/b1e3a25f5879017fc50ca17f03118b26a19df49a
Author: He Zhe 
AuthorDate: Tue, 14 Aug 2018 23:33:43 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 14:47:33 +0200

x86/corruption-check: Use pr_*() instead of printk()

pr_*() is the preferred style.

Signed-off-by: He Zhe 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: gre...@linuxfoundation.org
Cc: kstew...@linuxfoundation.org
Cc: pombreda...@nexb.com
Link: 
http://lkml.kernel.org/r/1534260823-87917-2-git-send-email-zhe...@windriver.com
[ Moved all console output into a single line. ]
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/check.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/check.c b/arch/x86/kernel/check.c
index cc8258a5378b..1979a76bfadd 100644
--- a/arch/x86/kernel/check.c
+++ b/arch/x86/kernel/check.c
@@ -1,4 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include 
 #include 
 #include 
@@ -41,6 +44,7 @@ static __init int set_corruption_check(char *arg)
return ret;
 
memory_corruption_check = val;
+
return 0;
 }
 early_param("memory_corruption_check", set_corruption_check);
@@ -128,7 +132,7 @@ void __init setup_bios_corruption_check(void)
}
 
if (num_scan_areas)
-   printk(KERN_INFO "Scanning %d areas for low memory 
corruption\n", num_scan_areas);
+   pr_info("Scanning %d areas for low memory corruption\n", 
num_scan_areas);
 }
 
 
@@ -147,8 +151,7 @@ void check_for_bios_corruption(void)
for (; size; addr++, size -= sizeof(unsigned long)) {
if (!*addr)
continue;
-   printk(KERN_ERR "Corrupted low memory at %p (%lx phys) 
= %08lx\n",
-  addr, __pa(addr), *addr);
+   pr_err("Corrupted low memory at %p (%lx phys) = 
%08lx\n", addr, __pa(addr), *addr);
corruption = 1;
*addr = 0;
}
@@ -172,11 +175,11 @@ static int start_periodic_check_for_corruption(void)
if (!num_scan_areas || !memory_corruption_check || 
corruption_check_period == 0)
return 0;
 
-   printk(KERN_INFO "Scanning for low memory corruption every %d 
seconds\n",
-  corruption_check_period);
+   pr_info("Scanning for low memory corruption every %d seconds\n", 
corruption_check_period);
 
/* First time we run the checks right away */
schedule_delayed_work(_check_work, 0);
+
return 0;
 }
 device_initcall(start_periodic_check_for_corruption);


[tip:x86/boot] x86/corruption-check: Fix panic in memory_corruption_check() when boot option without value is provided

2018-09-11 Thread tip-bot for He Zhe
Commit-ID:  ccde460b9ae5c2bd5e4742af0a7f623c2daad566
Gitweb: https://git.kernel.org/tip/ccde460b9ae5c2bd5e4742af0a7f623c2daad566
Author: He Zhe 
AuthorDate: Tue, 14 Aug 2018 23:33:42 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 14:47:32 +0200

x86/corruption-check: Fix panic in memory_corruption_check() when boot option 
without value is provided

memory_corruption_check[{_period|_size}]()'s handlers do not check input
argument before passing it to kstrtoul() or simple_strtoull(). The argument
would be a NULL pointer if each of the kernel parameters, without its
value, is set in command line and thus cause the following panic.

PANIC: early exception 0xe3 IP 10:73587c22 error 0 cr2 0x0
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.18-rc8+ #2
[0.00] RIP: 0010:kstrtoull+0x2/0x10
...
[0.00] Call Trace
[0.00]  ? set_corruption_check+0x21/0x49
[0.00]  ? do_early_param+0x4d/0x82
[0.00]  ? parse_args+0x212/0x330
[0.00]  ? rdinit_setup+0x26/0x26
[0.00]  ? parse_early_options+0x20/0x23
[0.00]  ? rdinit_setup+0x26/0x26
[0.00]  ? parse_early_param+0x2d/0x39
[0.00]  ? setup_arch+0x2f7/0xbf4
[0.00]  ? start_kernel+0x5e/0x4c2
[0.00]  ? load_ucode_bsp+0x113/0x12f
[0.00]  ? secondary_startup_64+0xa5/0xb0

This patch adds checks to prevent the panic.

Signed-off-by: He Zhe 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: gre...@linuxfoundation.org
Cc: kstew...@linuxfoundation.org
Cc: pombreda...@nexb.com
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1534260823-87917-1-git-send-email-zhe...@windriver.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/check.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/check.c b/arch/x86/kernel/check.c
index 33399426793e..cc8258a5378b 100644
--- a/arch/x86/kernel/check.c
+++ b/arch/x86/kernel/check.c
@@ -31,6 +31,11 @@ static __init int set_corruption_check(char *arg)
ssize_t ret;
unsigned long val;
 
+   if (!arg) {
+   pr_err("memory_corruption_check config string not provided\n");
+   return -EINVAL;
+   }
+
ret = kstrtoul(arg, 10, );
if (ret)
return ret;
@@ -45,6 +50,11 @@ static __init int set_corruption_check_period(char *arg)
ssize_t ret;
unsigned long val;
 
+   if (!arg) {
+   pr_err("memory_corruption_check_period config string not 
provided\n");
+   return -EINVAL;
+   }
+
ret = kstrtoul(arg, 10, );
if (ret)
return ret;
@@ -59,6 +69,11 @@ static __init int set_corruption_check_size(char *arg)
char *end;
unsigned size;
 
+   if (!arg) {
+   pr_err("memory_corruption_check_size config string not 
provided\n");
+   return -EINVAL;
+   }
+
size = memparse(arg, );
 
if (*end == '\0')


[tip:x86/boot] x86/corruption-check: Fix panic in memory_corruption_check() when boot option without value is provided

2018-09-11 Thread tip-bot for He Zhe
Commit-ID:  ccde460b9ae5c2bd5e4742af0a7f623c2daad566
Gitweb: https://git.kernel.org/tip/ccde460b9ae5c2bd5e4742af0a7f623c2daad566
Author: He Zhe 
AuthorDate: Tue, 14 Aug 2018 23:33:42 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 14:47:32 +0200

x86/corruption-check: Fix panic in memory_corruption_check() when boot option 
without value is provided

memory_corruption_check[{_period|_size}]()'s handlers do not check input
argument before passing it to kstrtoul() or simple_strtoull(). The argument
would be a NULL pointer if each of the kernel parameters, without its
value, is set in command line and thus cause the following panic.

PANIC: early exception 0xe3 IP 10:73587c22 error 0 cr2 0x0
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.18-rc8+ #2
[0.00] RIP: 0010:kstrtoull+0x2/0x10
...
[0.00] Call Trace
[0.00]  ? set_corruption_check+0x21/0x49
[0.00]  ? do_early_param+0x4d/0x82
[0.00]  ? parse_args+0x212/0x330
[0.00]  ? rdinit_setup+0x26/0x26
[0.00]  ? parse_early_options+0x20/0x23
[0.00]  ? rdinit_setup+0x26/0x26
[0.00]  ? parse_early_param+0x2d/0x39
[0.00]  ? setup_arch+0x2f7/0xbf4
[0.00]  ? start_kernel+0x5e/0x4c2
[0.00]  ? load_ucode_bsp+0x113/0x12f
[0.00]  ? secondary_startup_64+0xa5/0xb0

This patch adds checks to prevent the panic.

Signed-off-by: He Zhe 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: gre...@linuxfoundation.org
Cc: kstew...@linuxfoundation.org
Cc: pombreda...@nexb.com
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1534260823-87917-1-git-send-email-zhe...@windriver.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/check.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/check.c b/arch/x86/kernel/check.c
index 33399426793e..cc8258a5378b 100644
--- a/arch/x86/kernel/check.c
+++ b/arch/x86/kernel/check.c
@@ -31,6 +31,11 @@ static __init int set_corruption_check(char *arg)
ssize_t ret;
unsigned long val;
 
+   if (!arg) {
+   pr_err("memory_corruption_check config string not provided\n");
+   return -EINVAL;
+   }
+
ret = kstrtoul(arg, 10, );
if (ret)
return ret;
@@ -45,6 +50,11 @@ static __init int set_corruption_check_period(char *arg)
ssize_t ret;
unsigned long val;
 
+   if (!arg) {
+   pr_err("memory_corruption_check_period config string not 
provided\n");
+   return -EINVAL;
+   }
+
ret = kstrtoul(arg, 10, );
if (ret)
return ret;
@@ -59,6 +69,11 @@ static __init int set_corruption_check_size(char *arg)
char *end;
unsigned size;
 
+   if (!arg) {
+   pr_err("memory_corruption_check_size config string not 
provided\n");
+   return -EINVAL;
+   }
+
size = memparse(arg, );
 
if (*end == '\0')


Re: [PATCH v2 1/2] x86: corruption-check: Passing memory_corruption_check to command line causes panic

2018-08-20 Thread He Zhe
Could you please provide your input? Thanks.

Zhe

On 2018年08月14日 23:33, zhe...@windriver.com wrote:
> From: He Zhe 
>
> memory_corruption_check[{_period|_size}]'s handlers do not check input
> argument before passing it to kstrtoul or simple_strtoull. The argument
> would be a NULL pointer if each of the kernel parameters, without its
> value, is set in command line and thus cause the following panic.
>
> PANIC: early exception 0xe3 IP 10:73587c22 error 0 cr2 0x0
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.18-rc8+ #2
> [0.00] RIP: 0010:kstrtoull+0x2/0x10
> ...
> [0.00] Call Trace
> [0.00]  ? set_corruption_check+0x21/0x49
> [0.00]  ? do_early_param+0x4d/0x82
> [0.00]  ? parse_args+0x212/0x330
> [0.00]  ? rdinit_setup+0x26/0x26
> [0.00]  ? parse_early_options+0x20/0x23
> [0.00]  ? rdinit_setup+0x26/0x26
> [0.00]  ? parse_early_param+0x2d/0x39
> [0.00]  ? setup_arch+0x2f7/0xbf4
> [0.00]  ? start_kernel+0x5e/0x4c2
> [0.00]  ? load_ucode_bsp+0x113/0x12f
> [0.00]  ? secondary_startup_64+0xa5/0xb0
>
> This patch adds checks to prevent the panic.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: He Zhe 
> ---
> v2:
> - Split out printk cleanups
> - Add cc to sta...@vger.kernel.org
> - Use more meaningful error message
>
>  arch/x86/kernel/check.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/arch/x86/kernel/check.c b/arch/x86/kernel/check.c
> index 3339942..cc8258a 100644
> --- a/arch/x86/kernel/check.c
> +++ b/arch/x86/kernel/check.c
> @@ -31,6 +31,11 @@ static __init int set_corruption_check(char *arg)
>   ssize_t ret;
>   unsigned long val;
>  
> + if (!arg) {
> + pr_err("memory_corruption_check config string not provided\n");
> + return -EINVAL;
> + }
> +
>   ret = kstrtoul(arg, 10, );
>   if (ret)
>   return ret;
> @@ -45,6 +50,11 @@ static __init int set_corruption_check_period(char *arg)
>   ssize_t ret;
>   unsigned long val;
>  
> + if (!arg) {
> + pr_err("memory_corruption_check_period config string not 
> provided\n");
> + return -EINVAL;
> + }
> +
>   ret = kstrtoul(arg, 10, );
>   if (ret)
>   return ret;
> @@ -59,6 +69,11 @@ static __init int set_corruption_check_size(char *arg)
>   char *end;
>   unsigned size;
>  
> + if (!arg) {
> + pr_err("memory_corruption_check_size config string not 
> provided\n");
> + return -EINVAL;
> + }
> +
>   size = memparse(arg, );
>  
>   if (*end == '\0')



Re: [PATCH v2 1/2] x86: corruption-check: Passing memory_corruption_check to command line causes panic

2018-08-20 Thread He Zhe
Could you please provide your input? Thanks.

Zhe

On 2018年08月14日 23:33, zhe...@windriver.com wrote:
> From: He Zhe 
>
> memory_corruption_check[{_period|_size}]'s handlers do not check input
> argument before passing it to kstrtoul or simple_strtoull. The argument
> would be a NULL pointer if each of the kernel parameters, without its
> value, is set in command line and thus cause the following panic.
>
> PANIC: early exception 0xe3 IP 10:73587c22 error 0 cr2 0x0
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.18-rc8+ #2
> [0.00] RIP: 0010:kstrtoull+0x2/0x10
> ...
> [0.00] Call Trace
> [0.00]  ? set_corruption_check+0x21/0x49
> [0.00]  ? do_early_param+0x4d/0x82
> [0.00]  ? parse_args+0x212/0x330
> [0.00]  ? rdinit_setup+0x26/0x26
> [0.00]  ? parse_early_options+0x20/0x23
> [0.00]  ? rdinit_setup+0x26/0x26
> [0.00]  ? parse_early_param+0x2d/0x39
> [0.00]  ? setup_arch+0x2f7/0xbf4
> [0.00]  ? start_kernel+0x5e/0x4c2
> [0.00]  ? load_ucode_bsp+0x113/0x12f
> [0.00]  ? secondary_startup_64+0xa5/0xb0
>
> This patch adds checks to prevent the panic.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: He Zhe 
> ---
> v2:
> - Split out printk cleanups
> - Add cc to sta...@vger.kernel.org
> - Use more meaningful error message
>
>  arch/x86/kernel/check.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/arch/x86/kernel/check.c b/arch/x86/kernel/check.c
> index 3339942..cc8258a 100644
> --- a/arch/x86/kernel/check.c
> +++ b/arch/x86/kernel/check.c
> @@ -31,6 +31,11 @@ static __init int set_corruption_check(char *arg)
>   ssize_t ret;
>   unsigned long val;
>  
> + if (!arg) {
> + pr_err("memory_corruption_check config string not provided\n");
> + return -EINVAL;
> + }
> +
>   ret = kstrtoul(arg, 10, );
>   if (ret)
>   return ret;
> @@ -45,6 +50,11 @@ static __init int set_corruption_check_period(char *arg)
>   ssize_t ret;
>   unsigned long val;
>  
> + if (!arg) {
> + pr_err("memory_corruption_check_period config string not 
> provided\n");
> + return -EINVAL;
> + }
> +
>   ret = kstrtoul(arg, 10, );
>   if (ret)
>   return ret;
> @@ -59,6 +69,11 @@ static __init int set_corruption_check_size(char *arg)
>   char *end;
>   unsigned size;
>  
> + if (!arg) {
> + pr_err("memory_corruption_check_size config string not 
> provided\n");
> + return -EINVAL;
> + }
> +
>   size = memparse(arg, );
>  
>   if (*end == '\0')



[PATCH] mips: Fix build error by disabling attribute-alias warning

2018-06-15 Thread He Zhe
This patch fixes the following error caused by building arch/mips with
GCC 8.1.0.

In file included from arch/mips/kernel/signal32.c:15:
include/linux/syscalls.h:233:18: error: 'sys_32_sigaction' alias between 
functions of incompatible types 'long int(long int,  const struct 
compat_sigaction *, struct compat_sigaction *)' and 'long int(long int,  long 
int,  long int)' [-Werror=attribute-alias]
  asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
  ^~~

Signed-off-by: He Zhe 
---
 arch/mips/kernel/signal32.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index c4db910a8794..95cb406e220d 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -35,6 +35,9 @@ asmlinkage int sys32_sigsuspend(compat_sigset_t __user *uset)
return compat_sys_rt_sigsuspend(uset, sizeof(compat_sigset_t));
 }
 
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wpragmas"
+#pragma GCC diagnostic ignored "-Wattribute-alias"
 SYSCALL_DEFINE3(32_sigaction, long, sig, const struct compat_sigaction __user 
*, act,
struct compat_sigaction __user *, oact)
 {
@@ -76,3 +79,4 @@ SYSCALL_DEFINE3(32_sigaction, long, sig, const struct 
compat_sigaction __user *,
 
return ret;
 }
+#pragma GCC diagnostic pop
-- 
2.11.0



[PATCH] mips: Fix build error by disabling attribute-alias warning

2018-06-15 Thread He Zhe
This patch fixes the following error caused by building arch/mips with
GCC 8.1.0.

In file included from arch/mips/kernel/signal32.c:15:
include/linux/syscalls.h:233:18: error: 'sys_32_sigaction' alias between 
functions of incompatible types 'long int(long int,  const struct 
compat_sigaction *, struct compat_sigaction *)' and 'long int(long int,  long 
int,  long int)' [-Werror=attribute-alias]
  asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
  ^~~

Signed-off-by: He Zhe 
---
 arch/mips/kernel/signal32.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index c4db910a8794..95cb406e220d 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -35,6 +35,9 @@ asmlinkage int sys32_sigsuspend(compat_sigset_t __user *uset)
return compat_sys_rt_sigsuspend(uset, sizeof(compat_sigset_t));
 }
 
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wpragmas"
+#pragma GCC diagnostic ignored "-Wattribute-alias"
 SYSCALL_DEFINE3(32_sigaction, long, sig, const struct compat_sigaction __user 
*, act,
struct compat_sigaction __user *, oact)
 {
@@ -76,3 +79,4 @@ SYSCALL_DEFINE3(32_sigaction, long, sig, const struct 
compat_sigaction __user *,
 
return ret;
 }
+#pragma GCC diagnostic pop
-- 
2.11.0