Re: [v2 1/1] i2c: dev: prevent ZERO_SIZE_PTR deref in i2cdev_ioctl_rdwr()

2018-04-19 Thread Uwe Kleine-König
Hello,

On Thu, Apr 19, 2018 at 08:01:46PM +0300, Alexander Popov wrote:
> On 19.04.2018 16:49, Uwe Kleine-König wrote:
> >> @@ -280,6 +280,7 @@ static noinline int i2cdev_ioctl_rdwr(struct 
> >> i2c_client *client,
> >> */
> >>if (msgs[i].flags & I2C_M_RECV_LEN) {
> >>if (!(msgs[i].flags & I2C_M_RD) ||
> >> +  !msgs[i].len ||
> > 
> > I'd prefer
> > 
> > msgs[i].len > 0
> 
> Excuse me, it will be wrong. We stop if len is 0 to avoid the following
> ZERO_SIZE_PTR dereference.

right you are. I missed the negation.
 
> > here instead of
> > 
> > !msgs[i].len
> 
> I can change it to "msgs[i].len == 0". But is it really important?
> 
> I've carefully tested the current version with the original repro. It works 
> correct.

I don't doubt it, and the code generated is maybe even the same. The
point I wanted to make is that

!len

is harder to read for a human than

len < 1

(or another suitable arithmetic expression). But feel free to disagree
and keep the code as is.

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |


Re: [v2 1/1] i2c: dev: prevent ZERO_SIZE_PTR deref in i2cdev_ioctl_rdwr()

2018-04-19 Thread Uwe Kleine-König
Hello,

On Thu, Apr 19, 2018 at 08:01:46PM +0300, Alexander Popov wrote:
> On 19.04.2018 16:49, Uwe Kleine-König wrote:
> >> @@ -280,6 +280,7 @@ static noinline int i2cdev_ioctl_rdwr(struct 
> >> i2c_client *client,
> >> */
> >>if (msgs[i].flags & I2C_M_RECV_LEN) {
> >>if (!(msgs[i].flags & I2C_M_RD) ||
> >> +  !msgs[i].len ||
> > 
> > I'd prefer
> > 
> > msgs[i].len > 0
> 
> Excuse me, it will be wrong. We stop if len is 0 to avoid the following
> ZERO_SIZE_PTR dereference.

right you are. I missed the negation.
 
> > here instead of
> > 
> > !msgs[i].len
> 
> I can change it to "msgs[i].len == 0". But is it really important?
> 
> I've carefully tested the current version with the original repro. It works 
> correct.

I don't doubt it, and the code generated is maybe even the same. The
point I wanted to make is that

!len

is harder to read for a human than

len < 1

(or another suitable arithmetic expression). But feel free to disagree
and keep the code as is.

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |


Re: [RESEND PATCH 1/1] drm/i915/glk: Add MODULE_FIRMWARE for Geminilake

2018-04-19 Thread Ian W MORRISON
On 18 April 2018 at 00:14, Joonas Lahtinen
 wrote:
> Quoting Jani Nikula (2018-04-17 12:02:52)
>> On Mon, 16 Apr 2018, "Srivatsa, Anusha"  wrote:
>> >>-Original Message-
>> >>From: Jani Nikula [mailto:jani.nik...@linux.intel.com]
>> >>Sent: Wednesday, April 11, 2018 5:27 AM
>> >>To: Ian W MORRISON 
>> >>Cc: Vivi, Rodrigo ; Srivatsa, Anusha
>> >>; Wajdeczko, Michal
>> >>; Greg KH ;
>> >>airl...@linux.ie; joonas.lahti...@linux.intel.com; 
>> >>linux-kernel@vger.kernel.org;
>> >>sta...@vger.kernel.org; intel-...@lists.freedesktop.org; dri-
>> >>de...@lists.freedesktop.org
>> >>Subject: Re: [RESEND PATCH 1/1] drm/i915/glk: Add MODULE_FIRMWARE for
>> >>Geminilake



In summary so far:

Jani:
> NAK on indiscriminate Cc: stable. There are zero guarantees that
> older kernels will work with whatever firmware you throw at them.
> Who tested the firmware with v4.12 and later? We only have the CI
> results against *current* drm-tip. We don't even know about v4.16.
> I'm not going to ack and take responsibility for the stable backports
> unless someone actually comes forward with credible Tested-bys.

Anusha:
> The stable kernel version is 4.12 and beyond.
> It is appropriate to add the CC: stable in my opinion

Joonas:
> And even then, some distros will be surprised of the new MODULE_FIRMWARE
> and will need to update the linux-firmware package, too.

I've performed backport testing and some additional analysis as follows:

The DMC firmware for GLK was initially included in 4.11
  (commit: dbb28b5c3d3cb945a63030fab8d3894cf335ce19).
Then the firmware version was upgraded to 1.03 in 4.12
  (commit: f4a791819ed00a749a90387aa139706a507aa690).
However MODULE_FIRMWARE for the GLK DMC firmware
was also removed in 4.12
  (commit: d9321a03efcda867b3a8c6327e01808516f0acd7)
together with the firmware version being bumped to 1.04
  (commit: aebfd1d37194e00d4c417e7be97efeb736cd9c04).

The patch below effectively reverts commit d9321a03 because the GLK
firmware is now available in the linux-firmware repository.

To test stable backports I've used Ubuntu 18.04 (Beta 2) userspace with
both Ubuntu (generic) and self-compiled mainline (patched) kernels.
The conclusion was that the patch works across 4.12 to 4.17-rc1 kernels
additionally displaying a 'Possible missing firmware' message when
installing a kernel with the expected firmware missing.

The following are abridged backport test results:

Scenario: No DMC (glk_dmc_ver1_04.bin) firmware installed in
'/lib/firmware/i915'
  Test:Kernel installation ('grep -i dmc' output from 'apt install'):
4.12-generic and 4.15-generic:
  No output # as expected
4.12 to 4.17-rc1-patched:
  W: Possible missing firmware
/lib/firmware/i915/glk_dmc_ver1_04.bin for module i915
  Result: The effect of the patch is to add a 'Possible missing
firmware' message.
  Test: Booting ('grep -i dmc' output from 'dmesg'):
4.12-generic:
  No output # as expected
4.15-generic:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
i915/glk_dmc_ver1_04.bin. Disabling runtime power management.
  i915 :00:02.0: DMC firmware homepage:
https://01.org/linuxgraphics/downloads/firmware
4.12-patched:
  No output # as expected
4.13 to 4.14-patched:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
[https://01.org/linuxgraphics/downloads/firmware], disabling runtime
power management.
4.15 to 4.17-rc1-patched:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
i915/glk_dmc_ver1_04.bin. Disabling runtime power management.
  i915 :00:02.0: DMC firmware homepage:
https://01.org/linuxgraphics/downloads/firmware
  Result: The effect of the patch does not change existing
(non-patched kernel) messages.

Scenario: DMC (glk_dmc_ver1_04.bin) firmware installed in '/lib/firmware/i915'
  Test:Kernel installation ('grep -i dmc' output from 'apt install')
All kernels:
  No messages # as expected
  Result: The effect of the patch does not change existing messages.
  Test" Booting ('grep -i dmc' output from 'dmesg'):
4.12-generic:
  No output # as expected
4.15-generic:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
i915/glk_dmc_ver1_04.bin. Disabling runtime power management.
  i915 :00:02.0: DMC firmware homepage:
https://01.org/linuxgraphics/downloads/firmware
4.12-patched:
  No output # as expected
4.13 to 4.17-rc1-patched:
  [drm] Finished loading DMC 

[PATCH] perf: update to new syscall stub naming convention

2018-04-19 Thread Dominik Brodowski
For v4.17-rc1, the naming of syscall stubs changed. Update the
perf scripts/utils/tests which need to be aware of the syscall
stub naming accordingly.

Signed-off-by: Dominik Brodowski 

diff --git a/tools/perf/arch/powerpc/util/sym-handling.c 
b/tools/perf/arch/powerpc/util/sym-handling.c
index 53d83d7e6a09..9a970e334cea 100644
--- a/tools/perf/arch/powerpc/util/sym-handling.c
+++ b/tools/perf/arch/powerpc/util/sym-handling.c
@@ -32,10 +32,10 @@ int arch__choose_best_symbol(struct symbol *syma,
if (*sym == '.')
sym++;
 
-   /* Avoid "SyS" kernel syscall aliases */
-   if (strlen(sym) >= 3 && !strncmp(sym, "SyS", 3))
+   /* Avoid "__se_sys" kernel syscall aliases */
+   if (strlen(sym) >= 8 && !strncmp(sym,  "__se_sys", 8))
return SYMBOL_B;
-   if (strlen(sym) >= 10 && !strncmp(sym, "compat_SyS", 10))
+   if (strlen(sym) >= 15 && !strncmp(sym, "__se_compat_sys", 15))
return SYMBOL_B;
 
return SYMBOL_A;
diff --git a/tools/perf/tests/bpf-script-example.c 
b/tools/perf/tests/bpf-script-example.c
index e4123c1b0e88..5839baa3d766 100644
--- a/tools/perf/tests/bpf-script-example.c
+++ b/tools/perf/tests/bpf-script-example.c
@@ -31,8 +31,8 @@ struct bpf_map_def SEC("maps") flip_table = {
.max_entries = 1,
 };
 
-SEC("func=SyS_epoll_pwait")
-int bpf_func__SyS_epoll_pwait(void *ctx)
+SEC("func=__se_sys_epoll_pwait")
+int bpf_funcse_sys_epoll_pwait(void *ctx)
 {
int ind =0;
int *flag = bpf_map_lookup_elem(_table, );
diff --git a/tools/perf/util/c++/clang-test.cpp 
b/tools/perf/util/c++/clang-test.cpp
index 7b042a5ebc68..67a39ac8626d 100644
--- a/tools/perf/util/c++/clang-test.cpp
+++ b/tools/perf/util/c++/clang-test.cpp
@@ -41,7 +41,7 @@ int test__clang_to_IR(void)
if (!M)
return -1;
for (llvm::Function& F : *M)
-   if (F.getName() == "bpf_func__SyS_epoll_pwait")
+   if (F.getName() == "bpf_funcse_sys_epoll_pwait")
return 0;
return -1;
 }
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 62b2dd2253eb..32e156992dfc 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -113,10 +113,11 @@ int __weak arch__compare_symbol_names_n(const char 
*namea, const char *nameb,
 int __weak arch__choose_best_symbol(struct symbol *syma,
struct symbol *symb __maybe_unused)
 {
-   /* Avoid "SyS" kernel syscall aliases */
-   if (strlen(syma->name) >= 3 && !strncmp(syma->name, "SyS", 3))
+   /* Avoid "__se_sys" kernel syscall aliases */
+   if (strlen(syma->name) >= 8 && !strncmp(syma->name,  "__se_sys", 8))
return SYMBOL_B;
-   if (strlen(syma->name) >= 10 && !strncmp(syma->name, "compat_SyS", 10))
+   if (strlen(syma->name) >= 15 &&
+   !strncmp(syma->name, "__se_compat_sys", 15))
return SYMBOL_B;
 
return SYMBOL_A;


[PATCH] Documentation: updates for new syscall stub naming convention

2018-04-19 Thread Dominik Brodowski
For v4.17-rc1, the naming of syscall stubs changed. Update stack
traces and similar instances in the documentation to avoid sources
for confusion.

Signed-off-by: Dominik Brodowski 

diff --git a/Documentation/admin-guide/bug-hunting.rst 
b/Documentation/admin-guide/bug-hunting.rst
index f278b289e260..cebff8e5c59f 100644
--- a/Documentation/admin-guide/bug-hunting.rst
+++ b/Documentation/admin-guide/bug-hunting.rst
@@ -30,7 +30,7 @@ Kernel bug reports often come with a stack dump like the one 
below::
 [] ? driver_detach+0x87/0x90
 [] ? bus_remove_driver+0x38/0x90
 [] ? usb_deregister+0x58/0xb0
-[] ? SyS_delete_module+0x130/0x1f0
+[] ? __se_sys_delete_module+0x130/0x1f0
 [] ? task_work_run+0x64/0x80
 [] ? exit_to_usermode_loop+0x85/0x90
 [] ? do_fast_syscall_32+0x80/0x130
diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index f7a18f274357..0fe231401ae9 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -60,7 +60,7 @@ A typical out of bounds access report looks like this::
  init_module+0x9/0x47 [test_kasan]
  do_one_initcall+0x99/0x200
  load_module+0x2cb3/0x3b20
- SyS_finit_module+0x76/0x80
+ __se_sys_finit_module+0x76/0x80
  system_call_fastpath+0x12/0x17
 INFO: Slab 0xea0001a4ef00 objects=17 used=7 fp=0x8800693bd728 
flags=0x1004080
 INFO: Object 0x8800693bc558 @offset=1368 fp=0x8800693bc720
@@ -101,7 +101,7 @@ A typical out of bounds access report looks like this::
  [] ? __vunmap+0xec/0x160
  [] load_module+0x2cb3/0x3b20
  [] ? m_show+0x240/0x240
- [] SyS_finit_module+0x76/0x80
+ [] __se_sys_finit_module+0x76/0x80
  [] system_call_fastpath+0x12/0x17
 Memory state around the buggy address:
  8800693bc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
diff --git a/Documentation/dev-tools/kcov.rst b/Documentation/dev-tools/kcov.rst
index c2f6452e38ed..df3f4016137a 100644
--- a/Documentation/dev-tools/kcov.rst
+++ b/Documentation/dev-tools/kcov.rst
@@ -103,7 +103,7 @@ program using kcov:
 
 After piping through addr2line output of the program looks as follows::
 
-SyS_read
+__se_sys_read
 fs/read_write.c:562
 __fdget_pos
 fs/file.c:774
@@ -115,7 +115,7 @@ After piping through addr2line output of the program looks 
as follows::
 fs/file.c:760
 __fdget_pos
 fs/file.c:784
-SyS_read
+__se_sys_read
 fs/read_write.c:562
 
 If a program needs to collect coverage from several threads (independently),
diff --git a/Documentation/locking/lockstat.txt 
b/Documentation/locking/lockstat.txt
index 5786ad2cd5e6..346a67e72671 100644
--- a/Documentation/locking/lockstat.txt
+++ b/Documentation/locking/lockstat.txt
@@ -96,7 +96,7 @@ Look at the current lock statistics:
 12   >mmap_sem 17  
[] vm_munmap+0x41/0x80
 13 ---
 14   >mmap_sem  1  
[] dup_mmap+0x2a/0x3f0
-15   >mmap_sem 60  
[] SyS_mprotect+0xe9/0x250
+15   >mmap_sem 60  
[] __se_sys_mprotect+0xe9/0x250
 16   >mmap_sem 41  
[] __do_page_fault+0x1d4/0x510
 17   >mmap_sem 68  
[] vm_mmap_pgoff+0x87/0xd0
 18
diff --git a/Documentation/trace/histogram.txt 
b/Documentation/trace/histogram.txt
index 6e05510afc28..f36784deae99 100644
--- a/Documentation/trace/histogram.txt
+++ b/Documentation/trace/histogram.txt
@@ -598,7 +598,7 @@
  apparmor_cred_prepare+0x1f/0x50
  security_prepare_creds+0x16/0x20
  prepare_creds+0xdf/0x1a0
- SyS_capset+0xb5/0x200
+ __se_sys_capset+0xb5/0x200
  system_call_fastpath+0x12/0x6a
 } hitcount:  1  bytes_req: 32  bytes_alloc: 32
 .
@@ -609,7 +609,7 @@
  i915_gem_execbuffer2+0x6c/0x2c0 [i915]
  drm_ioctl+0x349/0x670 [drm]
  do_vfs_ioctl+0x2f0/0x4f0
- SyS_ioctl+0x81/0xa0
+ __se_sys_ioctl+0x81/0xa0
  system_call_fastpath+0x12/0x6a
 } hitcount:  17726  bytes_req:   13944120  bytes_alloc:   19593808
 { stacktrace:
@@ -618,7 +618,7 @@
  load_elf_binary+0x102/0x1650
  search_binary_handler+0x97/0x1d0
  do_execveat_common.isra.34+0x551/0x6e0
- SyS_execve+0x3a/0x50
+ __se_sys_execve+0x3a/0x50
  return_from_execve+0x0/0x23
 } hitcount:  33348  bytes_req:   17152128  bytes_alloc:   20226048
 { stacktrace:
@@ -629,7 +629,7 @@
  path_openat+0x31/0x5f0
  do_filp_open+0x3a/0x90
  do_sys_open+0x128/0x220
- SyS_open+0x1e/0x20
+ __se_sys_open+0x1e/0x20
  system_call_fastpath+0x12/0x6a
 } hitcount:4766422  bytes_req:  

[PATCH] Documentation: updates for new syscall stub naming convention

2018-04-19 Thread Dominik Brodowski
For v4.17-rc1, the naming of syscall stubs changed. Update stack
traces and similar instances in the documentation to avoid sources
for confusion.

Signed-off-by: Dominik Brodowski 

diff --git a/Documentation/admin-guide/bug-hunting.rst 
b/Documentation/admin-guide/bug-hunting.rst
index f278b289e260..cebff8e5c59f 100644
--- a/Documentation/admin-guide/bug-hunting.rst
+++ b/Documentation/admin-guide/bug-hunting.rst
@@ -30,7 +30,7 @@ Kernel bug reports often come with a stack dump like the one 
below::
 [] ? driver_detach+0x87/0x90
 [] ? bus_remove_driver+0x38/0x90
 [] ? usb_deregister+0x58/0xb0
-[] ? SyS_delete_module+0x130/0x1f0
+[] ? __se_sys_delete_module+0x130/0x1f0
 [] ? task_work_run+0x64/0x80
 [] ? exit_to_usermode_loop+0x85/0x90
 [] ? do_fast_syscall_32+0x80/0x130
diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index f7a18f274357..0fe231401ae9 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -60,7 +60,7 @@ A typical out of bounds access report looks like this::
  init_module+0x9/0x47 [test_kasan]
  do_one_initcall+0x99/0x200
  load_module+0x2cb3/0x3b20
- SyS_finit_module+0x76/0x80
+ __se_sys_finit_module+0x76/0x80
  system_call_fastpath+0x12/0x17
 INFO: Slab 0xea0001a4ef00 objects=17 used=7 fp=0x8800693bd728 
flags=0x1004080
 INFO: Object 0x8800693bc558 @offset=1368 fp=0x8800693bc720
@@ -101,7 +101,7 @@ A typical out of bounds access report looks like this::
  [] ? __vunmap+0xec/0x160
  [] load_module+0x2cb3/0x3b20
  [] ? m_show+0x240/0x240
- [] SyS_finit_module+0x76/0x80
+ [] __se_sys_finit_module+0x76/0x80
  [] system_call_fastpath+0x12/0x17
 Memory state around the buggy address:
  8800693bc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
diff --git a/Documentation/dev-tools/kcov.rst b/Documentation/dev-tools/kcov.rst
index c2f6452e38ed..df3f4016137a 100644
--- a/Documentation/dev-tools/kcov.rst
+++ b/Documentation/dev-tools/kcov.rst
@@ -103,7 +103,7 @@ program using kcov:
 
 After piping through addr2line output of the program looks as follows::
 
-SyS_read
+__se_sys_read
 fs/read_write.c:562
 __fdget_pos
 fs/file.c:774
@@ -115,7 +115,7 @@ After piping through addr2line output of the program looks 
as follows::
 fs/file.c:760
 __fdget_pos
 fs/file.c:784
-SyS_read
+__se_sys_read
 fs/read_write.c:562
 
 If a program needs to collect coverage from several threads (independently),
diff --git a/Documentation/locking/lockstat.txt 
b/Documentation/locking/lockstat.txt
index 5786ad2cd5e6..346a67e72671 100644
--- a/Documentation/locking/lockstat.txt
+++ b/Documentation/locking/lockstat.txt
@@ -96,7 +96,7 @@ Look at the current lock statistics:
 12   >mmap_sem 17  
[] vm_munmap+0x41/0x80
 13 ---
 14   >mmap_sem  1  
[] dup_mmap+0x2a/0x3f0
-15   >mmap_sem 60  
[] SyS_mprotect+0xe9/0x250
+15   >mmap_sem 60  
[] __se_sys_mprotect+0xe9/0x250
 16   >mmap_sem 41  
[] __do_page_fault+0x1d4/0x510
 17   >mmap_sem 68  
[] vm_mmap_pgoff+0x87/0xd0
 18
diff --git a/Documentation/trace/histogram.txt 
b/Documentation/trace/histogram.txt
index 6e05510afc28..f36784deae99 100644
--- a/Documentation/trace/histogram.txt
+++ b/Documentation/trace/histogram.txt
@@ -598,7 +598,7 @@
  apparmor_cred_prepare+0x1f/0x50
  security_prepare_creds+0x16/0x20
  prepare_creds+0xdf/0x1a0
- SyS_capset+0xb5/0x200
+ __se_sys_capset+0xb5/0x200
  system_call_fastpath+0x12/0x6a
 } hitcount:  1  bytes_req: 32  bytes_alloc: 32
 .
@@ -609,7 +609,7 @@
  i915_gem_execbuffer2+0x6c/0x2c0 [i915]
  drm_ioctl+0x349/0x670 [drm]
  do_vfs_ioctl+0x2f0/0x4f0
- SyS_ioctl+0x81/0xa0
+ __se_sys_ioctl+0x81/0xa0
  system_call_fastpath+0x12/0x6a
 } hitcount:  17726  bytes_req:   13944120  bytes_alloc:   19593808
 { stacktrace:
@@ -618,7 +618,7 @@
  load_elf_binary+0x102/0x1650
  search_binary_handler+0x97/0x1d0
  do_execveat_common.isra.34+0x551/0x6e0
- SyS_execve+0x3a/0x50
+ __se_sys_execve+0x3a/0x50
  return_from_execve+0x0/0x23
 } hitcount:  33348  bytes_req:   17152128  bytes_alloc:   20226048
 { stacktrace:
@@ -629,7 +629,7 @@
  path_openat+0x31/0x5f0
  do_filp_open+0x3a/0x90
  do_sys_open+0x128/0x220
- SyS_open+0x1e/0x20
+ __se_sys_open+0x1e/0x20
  system_call_fastpath+0x12/0x6a
 } hitcount:4766422  bytes_req:9532844  bytes_alloc:   

[PATCH] perf: update to new syscall stub naming convention

2018-04-19 Thread Dominik Brodowski
For v4.17-rc1, the naming of syscall stubs changed. Update the
perf scripts/utils/tests which need to be aware of the syscall
stub naming accordingly.

Signed-off-by: Dominik Brodowski 

diff --git a/tools/perf/arch/powerpc/util/sym-handling.c 
b/tools/perf/arch/powerpc/util/sym-handling.c
index 53d83d7e6a09..9a970e334cea 100644
--- a/tools/perf/arch/powerpc/util/sym-handling.c
+++ b/tools/perf/arch/powerpc/util/sym-handling.c
@@ -32,10 +32,10 @@ int arch__choose_best_symbol(struct symbol *syma,
if (*sym == '.')
sym++;
 
-   /* Avoid "SyS" kernel syscall aliases */
-   if (strlen(sym) >= 3 && !strncmp(sym, "SyS", 3))
+   /* Avoid "__se_sys" kernel syscall aliases */
+   if (strlen(sym) >= 8 && !strncmp(sym,  "__se_sys", 8))
return SYMBOL_B;
-   if (strlen(sym) >= 10 && !strncmp(sym, "compat_SyS", 10))
+   if (strlen(sym) >= 15 && !strncmp(sym, "__se_compat_sys", 15))
return SYMBOL_B;
 
return SYMBOL_A;
diff --git a/tools/perf/tests/bpf-script-example.c 
b/tools/perf/tests/bpf-script-example.c
index e4123c1b0e88..5839baa3d766 100644
--- a/tools/perf/tests/bpf-script-example.c
+++ b/tools/perf/tests/bpf-script-example.c
@@ -31,8 +31,8 @@ struct bpf_map_def SEC("maps") flip_table = {
.max_entries = 1,
 };
 
-SEC("func=SyS_epoll_pwait")
-int bpf_func__SyS_epoll_pwait(void *ctx)
+SEC("func=__se_sys_epoll_pwait")
+int bpf_funcse_sys_epoll_pwait(void *ctx)
 {
int ind =0;
int *flag = bpf_map_lookup_elem(_table, );
diff --git a/tools/perf/util/c++/clang-test.cpp 
b/tools/perf/util/c++/clang-test.cpp
index 7b042a5ebc68..67a39ac8626d 100644
--- a/tools/perf/util/c++/clang-test.cpp
+++ b/tools/perf/util/c++/clang-test.cpp
@@ -41,7 +41,7 @@ int test__clang_to_IR(void)
if (!M)
return -1;
for (llvm::Function& F : *M)
-   if (F.getName() == "bpf_func__SyS_epoll_pwait")
+   if (F.getName() == "bpf_funcse_sys_epoll_pwait")
return 0;
return -1;
 }
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 62b2dd2253eb..32e156992dfc 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -113,10 +113,11 @@ int __weak arch__compare_symbol_names_n(const char 
*namea, const char *nameb,
 int __weak arch__choose_best_symbol(struct symbol *syma,
struct symbol *symb __maybe_unused)
 {
-   /* Avoid "SyS" kernel syscall aliases */
-   if (strlen(syma->name) >= 3 && !strncmp(syma->name, "SyS", 3))
+   /* Avoid "__se_sys" kernel syscall aliases */
+   if (strlen(syma->name) >= 8 && !strncmp(syma->name,  "__se_sys", 8))
return SYMBOL_B;
-   if (strlen(syma->name) >= 10 && !strncmp(syma->name, "compat_SyS", 10))
+   if (strlen(syma->name) >= 15 &&
+   !strncmp(syma->name, "__se_compat_sys", 15))
return SYMBOL_B;
 
return SYMBOL_A;


Re: [RESEND PATCH 1/1] drm/i915/glk: Add MODULE_FIRMWARE for Geminilake

2018-04-19 Thread Ian W MORRISON
On 18 April 2018 at 00:14, Joonas Lahtinen
 wrote:
> Quoting Jani Nikula (2018-04-17 12:02:52)
>> On Mon, 16 Apr 2018, "Srivatsa, Anusha"  wrote:
>> >>-Original Message-
>> >>From: Jani Nikula [mailto:jani.nik...@linux.intel.com]
>> >>Sent: Wednesday, April 11, 2018 5:27 AM
>> >>To: Ian W MORRISON 
>> >>Cc: Vivi, Rodrigo ; Srivatsa, Anusha
>> >>; Wajdeczko, Michal
>> >>; Greg KH ;
>> >>airl...@linux.ie; joonas.lahti...@linux.intel.com; 
>> >>linux-kernel@vger.kernel.org;
>> >>sta...@vger.kernel.org; intel-...@lists.freedesktop.org; dri-
>> >>de...@lists.freedesktop.org
>> >>Subject: Re: [RESEND PATCH 1/1] drm/i915/glk: Add MODULE_FIRMWARE for
>> >>Geminilake



In summary so far:

Jani:
> NAK on indiscriminate Cc: stable. There are zero guarantees that
> older kernels will work with whatever firmware you throw at them.
> Who tested the firmware with v4.12 and later? We only have the CI
> results against *current* drm-tip. We don't even know about v4.16.
> I'm not going to ack and take responsibility for the stable backports
> unless someone actually comes forward with credible Tested-bys.

Anusha:
> The stable kernel version is 4.12 and beyond.
> It is appropriate to add the CC: stable in my opinion

Joonas:
> And even then, some distros will be surprised of the new MODULE_FIRMWARE
> and will need to update the linux-firmware package, too.

I've performed backport testing and some additional analysis as follows:

The DMC firmware for GLK was initially included in 4.11
  (commit: dbb28b5c3d3cb945a63030fab8d3894cf335ce19).
Then the firmware version was upgraded to 1.03 in 4.12
  (commit: f4a791819ed00a749a90387aa139706a507aa690).
However MODULE_FIRMWARE for the GLK DMC firmware
was also removed in 4.12
  (commit: d9321a03efcda867b3a8c6327e01808516f0acd7)
together with the firmware version being bumped to 1.04
  (commit: aebfd1d37194e00d4c417e7be97efeb736cd9c04).

The patch below effectively reverts commit d9321a03 because the GLK
firmware is now available in the linux-firmware repository.

To test stable backports I've used Ubuntu 18.04 (Beta 2) userspace with
both Ubuntu (generic) and self-compiled mainline (patched) kernels.
The conclusion was that the patch works across 4.12 to 4.17-rc1 kernels
additionally displaying a 'Possible missing firmware' message when
installing a kernel with the expected firmware missing.

The following are abridged backport test results:

Scenario: No DMC (glk_dmc_ver1_04.bin) firmware installed in
'/lib/firmware/i915'
  Test:Kernel installation ('grep -i dmc' output from 'apt install'):
4.12-generic and 4.15-generic:
  No output # as expected
4.12 to 4.17-rc1-patched:
  W: Possible missing firmware
/lib/firmware/i915/glk_dmc_ver1_04.bin for module i915
  Result: The effect of the patch is to add a 'Possible missing
firmware' message.
  Test: Booting ('grep -i dmc' output from 'dmesg'):
4.12-generic:
  No output # as expected
4.15-generic:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
i915/glk_dmc_ver1_04.bin. Disabling runtime power management.
  i915 :00:02.0: DMC firmware homepage:
https://01.org/linuxgraphics/downloads/firmware
4.12-patched:
  No output # as expected
4.13 to 4.14-patched:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
[https://01.org/linuxgraphics/downloads/firmware], disabling runtime
power management.
4.15 to 4.17-rc1-patched:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
i915/glk_dmc_ver1_04.bin. Disabling runtime power management.
  i915 :00:02.0: DMC firmware homepage:
https://01.org/linuxgraphics/downloads/firmware
  Result: The effect of the patch does not change existing
(non-patched kernel) messages.

Scenario: DMC (glk_dmc_ver1_04.bin) firmware installed in '/lib/firmware/i915'
  Test:Kernel installation ('grep -i dmc' output from 'apt install')
All kernels:
  No messages # as expected
  Result: The effect of the patch does not change existing messages.
  Test" Booting ('grep -i dmc' output from 'dmesg'):
4.12-generic:
  No output # as expected
4.15-generic:
  i915 :00:02.0: Direct firmware load for
i915/glk_dmc_ver1_04.bin failed with error -2
  i915 :00:02.0: Failed to load DMC firmware
i915/glk_dmc_ver1_04.bin. Disabling runtime power management.
  i915 :00:02.0: DMC firmware homepage:
https://01.org/linuxgraphics/downloads/firmware
4.12-patched:
  No output # as expected
4.13 to 4.17-rc1-patched:
  [drm] Finished loading DMC firmware i915/glk_dmc_ver1_04.bin (v1.4)
  Result: The effect of the patch is to remove the 'Failed to load' message.

Regards,
Ian


Re: [RFC/RFT patch 0/7] timekeeping: Unify clock MONOTONIC and clock BOOTTIME

2018-04-19 Thread Sergey Senozhatsky
On (04/20/18 06:37), David Herrmann wrote:
>
> I get lots of timer-errors on Arch-Linux booting current master, after
> a suspend/resume cycle. Just a selection of errors I see on resume:

Hello David,
Any chance you can revert the patches in question and test? I'm running
ARCH (4.17.0-rc1-dbg-00042-gaa03ddd9c434) and suspend/resume cycle does
not trigger any errors. Except for this one

kernel: do_IRQ: 0.55 No irq handler for vector

> systemd[1]: systemd-journald.service: Main process exited,
> code=dumped, status=6/ABRT
> rtkit-daemon[742]: The canary thread is apparently starving. Taking action.
> systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
> systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
> kernel: e1000e :00:1f.6: Failed to restore TIMINCA clock rate delta: -22
> 
> Lots of crashes with SIGABRT due to these.
> 
> I did not bisect it, but it sounds related to me. Also, user-space
> uses CLOCK_MONOTONIC for watchdog timers. That is, a process is
> required to respond to a watchdog-request in a given MONOTONIC
> time-frame. If this jumps during suspend/resume, watchdogs will fire
> immediately. I don't see how this can work with the new MONOTONIC
> behavior?

-ss


Re: [RFC/RFT patch 0/7] timekeeping: Unify clock MONOTONIC and clock BOOTTIME

2018-04-19 Thread Sergey Senozhatsky
On (04/20/18 06:37), David Herrmann wrote:
>
> I get lots of timer-errors on Arch-Linux booting current master, after
> a suspend/resume cycle. Just a selection of errors I see on resume:

Hello David,
Any chance you can revert the patches in question and test? I'm running
ARCH (4.17.0-rc1-dbg-00042-gaa03ddd9c434) and suspend/resume cycle does
not trigger any errors. Except for this one

kernel: do_IRQ: 0.55 No irq handler for vector

> systemd[1]: systemd-journald.service: Main process exited,
> code=dumped, status=6/ABRT
> rtkit-daemon[742]: The canary thread is apparently starving. Taking action.
> systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
> systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
> kernel: e1000e :00:1f.6: Failed to restore TIMINCA clock rate delta: -22
> 
> Lots of crashes with SIGABRT due to these.
> 
> I did not bisect it, but it sounds related to me. Also, user-space
> uses CLOCK_MONOTONIC for watchdog timers. That is, a process is
> required to respond to a watchdog-request in a given MONOTONIC
> time-frame. If this jumps during suspend/resume, watchdogs will fire
> immediately. I don't see how this can work with the new MONOTONIC
> behavior?

-ss


Re: [PATCH V1 4/4] qcom: spmi-wled: Add auto-calibration logic support

2018-04-19 Thread kgunda

On 2018-04-19 21:28, Bjorn Andersson wrote:

On Thu 19 Apr 03:45 PDT 2018, kgu...@codeaurora.org wrote:



On 2017-12-05 11:10, Bjorn Andersson wrote:
> On Thu 16 Nov 04:18 PST 2017, Kiran Gunda wrote:
>
> > The auto-calibration algorithm checks if the current WLED sink
> > configuration is valid. It tries enabling every sink and checks
> > if the OVP fault is observed. Based on this information it
> > detects and enables the valid sink configuration. Auto calibration
> > will be triggered when the OVP fault interrupts are seen frequently
> > thereby it tries to fix the sink configuration.
> >
>
> So it's not auto "calibration" it's auto "detection" of strings?
>
Hi Bjorn,
Sorry for late response. Please find my answers.



No worries, happy to hear back from you!


Thanks!

Correct. This is the auto detection, This is the name given by the
HW/systems team.


I think the name should be considered a "hardware bug", that we can 
work
around in software (give it a useful name and document what the 
original

name was).

I don't think this is the "hardware bug". Rather we can say HW doesn't 
support it.
Hence, we are implementing it as a SW feature to detect the strings 
present on the
display panel, if the user fails to give the correct strings. As you 
suggested I will

rename this to "auto detection" instead of "auto calibration".


> When is this feature needed?
>
This feature is needed if the string configuration is given wrong in
the DT node by the user.


DT describes the hardware and for all other nodes it must do so
accurately.

But the user may not be aware of the strings present on the display 
panel or
may be using the same software on different devices which have different 
strings

present.
For cases where the hardware supports auto detection of functionality 
we

remove information from DT and rely on that logic to figure out the
hardware. We do not use it to reconfigure the hardware once we detect 
an

error. So when auto-detection is enabled it should always be used to
probe the hardware.

The auto string detection is not supported in any qcom hardware and i 
don't

think there is a plan to introduce in new hardware also.


Regards,
Bjorn


> > Signed-off-by: Kiran Gunda 
> > ---
> >  .../bindings/leds/backlight/qcom-spmi-wled.txt |   5 +
> >  drivers/video/backlight/qcom-spmi-wled.c   | 304
> > -
> >  2 files changed, 306 insertions(+), 3 deletions(-)
> >
> > diff --git
> > a/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > b/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > index d39ee93..f06c0cd 100644
> > ---
> > a/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > +++
> > b/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > @@ -94,6 +94,11 @@ The PMIC is connected to the host processor via
> > SPMI bus.
> >   Definition: Interrupt names associated with the interrupts.
> >   Currently supported interrupts are "sc-irq" and "ovp-irq".
> >
> > +- qcom,auto-calibration
>
> qcom,auto-string-detect?
>
ok. Will address in the next patch.
> > + Usage:  optional
> > + Value type: 
> > + Definition: Enables auto-calibration of the WLED sink configuration.
> > +
> >  Example:
> >
> >  qcom-wled@d800 {
> > diff --git a/drivers/video/backlight/qcom-spmi-wled.c
> > b/drivers/video/backlight/qcom-spmi-wled.c
> > index 8b2a77a..aee5c56 100644
> > --- a/drivers/video/backlight/qcom-spmi-wled.c
> > +++ b/drivers/video/backlight/qcom-spmi-wled.c
> > @@ -38,11 +38,14 @@
> >  #define  QCOM_WLED_CTRL_SC_FAULT_BIT BIT(2)
> >
> >  #define QCOM_WLED_CTRL_INT_RT_STS0x10
> > +#define  QCOM_WLED_CTRL_OVP_FLT_RT_STS_BIT   BIT(1)
>
> The use of BIT() makes this a mask and not a bit number, so if you just
> drop that you can afford to spell out the "FAULT" like the data sheet
> does. Perhaps even making it QCOM_WLED_CTRL_OVP_FAULT_STATUS ?
>
ok. Will change it in the next series.
> >
> >  #define QCOM_WLED_CTRL_MOD_ENABLE0x46
> >  #define  QCOM_WLED_CTRL_MOD_EN_MASK  BIT(7)
> >  #define  QCOM_WLED_CTRL_MODULE_EN_SHIFT  7
> >
> > +#define QCOM_WLED_CTRL_FDBK_OP   0x48
>
> This is called WLED_CTRL_FEEDBACK_CONTROL, why the need to make it
> unreadable?
>
Ok. Will address it in next series.
> > +
> >  #define QCOM_WLED_CTRL_SWITCH_FREQ   0x4c
> >  #define  QCOM_WLED_CTRL_SWITCH_FREQ_MASK GENMASK(3, 0)
> >
> > @@ -99,6 +102,7 @@ struct qcom_wled_config {
> >   int ovp_irq;
> >   bool en_cabc;
> >   bool ext_pfet_sc_pro_en;
> > + bool auto_calib_enabled;
> >  };
> >
> >  struct qcom_wled {
> > @@ -108,18 +112,25 @@ struct qcom_wled {
> >   struct mutex lock;
> >   struct qcom_wled_config cfg;
> >   ktime_t last_sc_event_time;
> > + ktime_t start_ovp_fault_time;
> >   u16 sink_addr;
> >   u16 ctrl_addr;
> > + u16 

Re: [PATCH V1 4/4] qcom: spmi-wled: Add auto-calibration logic support

2018-04-19 Thread kgunda

On 2018-04-19 21:28, Bjorn Andersson wrote:

On Thu 19 Apr 03:45 PDT 2018, kgu...@codeaurora.org wrote:



On 2017-12-05 11:10, Bjorn Andersson wrote:
> On Thu 16 Nov 04:18 PST 2017, Kiran Gunda wrote:
>
> > The auto-calibration algorithm checks if the current WLED sink
> > configuration is valid. It tries enabling every sink and checks
> > if the OVP fault is observed. Based on this information it
> > detects and enables the valid sink configuration. Auto calibration
> > will be triggered when the OVP fault interrupts are seen frequently
> > thereby it tries to fix the sink configuration.
> >
>
> So it's not auto "calibration" it's auto "detection" of strings?
>
Hi Bjorn,
Sorry for late response. Please find my answers.



No worries, happy to hear back from you!


Thanks!

Correct. This is the auto detection, This is the name given by the
HW/systems team.


I think the name should be considered a "hardware bug", that we can 
work
around in software (give it a useful name and document what the 
original

name was).

I don't think this is the "hardware bug". Rather we can say HW doesn't 
support it.
Hence, we are implementing it as a SW feature to detect the strings 
present on the
display panel, if the user fails to give the correct strings. As you 
suggested I will

rename this to "auto detection" instead of "auto calibration".


> When is this feature needed?
>
This feature is needed if the string configuration is given wrong in
the DT node by the user.


DT describes the hardware and for all other nodes it must do so
accurately.

But the user may not be aware of the strings present on the display 
panel or
may be using the same software on different devices which have different 
strings

present.
For cases where the hardware supports auto detection of functionality 
we

remove information from DT and rely on that logic to figure out the
hardware. We do not use it to reconfigure the hardware once we detect 
an

error. So when auto-detection is enabled it should always be used to
probe the hardware.

The auto string detection is not supported in any qcom hardware and i 
don't

think there is a plan to introduce in new hardware also.


Regards,
Bjorn


> > Signed-off-by: Kiran Gunda 
> > ---
> >  .../bindings/leds/backlight/qcom-spmi-wled.txt |   5 +
> >  drivers/video/backlight/qcom-spmi-wled.c   | 304
> > -
> >  2 files changed, 306 insertions(+), 3 deletions(-)
> >
> > diff --git
> > a/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > b/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > index d39ee93..f06c0cd 100644
> > ---
> > a/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > +++
> > b/Documentation/devicetree/bindings/leds/backlight/qcom-spmi-wled.txt
> > @@ -94,6 +94,11 @@ The PMIC is connected to the host processor via
> > SPMI bus.
> >   Definition: Interrupt names associated with the interrupts.
> >   Currently supported interrupts are "sc-irq" and "ovp-irq".
> >
> > +- qcom,auto-calibration
>
> qcom,auto-string-detect?
>
ok. Will address in the next patch.
> > + Usage:  optional
> > + Value type: 
> > + Definition: Enables auto-calibration of the WLED sink configuration.
> > +
> >  Example:
> >
> >  qcom-wled@d800 {
> > diff --git a/drivers/video/backlight/qcom-spmi-wled.c
> > b/drivers/video/backlight/qcom-spmi-wled.c
> > index 8b2a77a..aee5c56 100644
> > --- a/drivers/video/backlight/qcom-spmi-wled.c
> > +++ b/drivers/video/backlight/qcom-spmi-wled.c
> > @@ -38,11 +38,14 @@
> >  #define  QCOM_WLED_CTRL_SC_FAULT_BIT BIT(2)
> >
> >  #define QCOM_WLED_CTRL_INT_RT_STS0x10
> > +#define  QCOM_WLED_CTRL_OVP_FLT_RT_STS_BIT   BIT(1)
>
> The use of BIT() makes this a mask and not a bit number, so if you just
> drop that you can afford to spell out the "FAULT" like the data sheet
> does. Perhaps even making it QCOM_WLED_CTRL_OVP_FAULT_STATUS ?
>
ok. Will change it in the next series.
> >
> >  #define QCOM_WLED_CTRL_MOD_ENABLE0x46
> >  #define  QCOM_WLED_CTRL_MOD_EN_MASK  BIT(7)
> >  #define  QCOM_WLED_CTRL_MODULE_EN_SHIFT  7
> >
> > +#define QCOM_WLED_CTRL_FDBK_OP   0x48
>
> This is called WLED_CTRL_FEEDBACK_CONTROL, why the need to make it
> unreadable?
>
Ok. Will address it in next series.
> > +
> >  #define QCOM_WLED_CTRL_SWITCH_FREQ   0x4c
> >  #define  QCOM_WLED_CTRL_SWITCH_FREQ_MASK GENMASK(3, 0)
> >
> > @@ -99,6 +102,7 @@ struct qcom_wled_config {
> >   int ovp_irq;
> >   bool en_cabc;
> >   bool ext_pfet_sc_pro_en;
> > + bool auto_calib_enabled;
> >  };
> >
> >  struct qcom_wled {
> > @@ -108,18 +112,25 @@ struct qcom_wled {
> >   struct mutex lock;
> >   struct qcom_wled_config cfg;
> >   ktime_t last_sc_event_time;
> > + ktime_t start_ovp_fault_time;
> >   u16 sink_addr;
> >   u16 ctrl_addr;
> > + u16 auto_calibration_ovp_count;
> >   

Re: [PATCH] nvme: fc: provide a descriptive error

2018-04-19 Thread Hannes Reinecke
On 04/19/2018 07:43 PM, Johannes Thumshirn wrote:
> Provide a descriptive error in case an lport to rport association
> isn't found when creating the FC-NVME controller.
> 
> Currently it's very hard to debug the reason for a failed connect
> attempt without a look at the source.
> 
> Signed-off-by: Johannes Thumshirn 
> 
> ---
> This actually happened to Hannes and me because of a typo in a
> customer demo today, so yes things like this happen until we have a
> proper way to do auto-connect.
> ---
>  drivers/nvme/host/fc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 6cb26bcf6ec0..8b66879b4ebf 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -3284,6 +3284,8 @@ nvme_fc_create_ctrl(struct device *dev, struct 
> nvmf_ctrl_options *opts)
>   }
>   spin_unlock_irqrestore(_fc_lock, flags);
>  
> + pr_warn("%s: %s - %s combination not found\n",
> + __func__, opts->traddr, opts->host_traddr);
>   return ERR_PTR(-ENOENT);
>  }
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH] nvme: fc: provide a descriptive error

2018-04-19 Thread Hannes Reinecke
On 04/19/2018 07:43 PM, Johannes Thumshirn wrote:
> Provide a descriptive error in case an lport to rport association
> isn't found when creating the FC-NVME controller.
> 
> Currently it's very hard to debug the reason for a failed connect
> attempt without a look at the source.
> 
> Signed-off-by: Johannes Thumshirn 
> 
> ---
> This actually happened to Hannes and me because of a typo in a
> customer demo today, so yes things like this happen until we have a
> proper way to do auto-connect.
> ---
>  drivers/nvme/host/fc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 6cb26bcf6ec0..8b66879b4ebf 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -3284,6 +3284,8 @@ nvme_fc_create_ctrl(struct device *dev, struct 
> nvmf_ctrl_options *opts)
>   }
>   spin_unlock_irqrestore(_fc_lock, flags);
>  
> + pr_warn("%s: %s - %s combination not found\n",
> + __func__, opts->traddr, opts->host_traddr);
>   return ERR_PTR(-ENOENT);
>  }
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH] mm:memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create

2018-04-19 Thread Minchan Kim
On Thu, Apr 19, 2018 at 08:40:05AM +0200, Michal Hocko wrote:
> On Wed 18-04-18 11:58:00, David Rientjes wrote:
> > On Wed, 18 Apr 2018, Michal Hocko wrote:
> > 
> > > > Okay, no problem. However, I don't feel we need ratelimit at this 
> > > > moment.
> > > > We can do when we got real report. Let's add just one line warning.
> > > > However, I have no talent to write a poem to express with one line.
> > > > Could you help me?
> > > 
> > > What about
> > >   pr_info("Failed to create memcg slab cache. Report if you see floods of 
> > > these\n");
> > >  

Thanks you, Michal. However, hmm, floods is very vague to me. 100 time per sec?
10 time per hour? I guess we need more guide line to trigger user's reporting
if we really want to do.


> > 
> > Um, there's nothing actionable here for the user.  Even if the message 
> > directed them to a specific email address, what would you ask the user for 
> > in response if they show a kernel log with 100 of these?
> 
> We would have to think of a better way to create shaddow memcg caches.
> 
> > Probably ask 
> > them to use sysrq at the time it happens to get meminfo.  But any user 
> > initiated sysrq is going to reveal very different state of memory compared 
> > to when the kmalloc() actually failed.
> 
> Not really.
> 
> > If this really needs a warning, I think it only needs to be done once and 
> > reveal the state of memory similar to how slub emits oom warnings.  But as 
> > the changelog indicates, the system is oom and we couldn't reclaim.  We 
> > can expect this happens a lot on systems with memory pressure.  What is 
> > the warning revealing that would be actionable?
> 
> That it actually happens in real workloads and we want to know what
> those workloads are. This code is quite old and yet this is the first
> some somebody complains. So it is most probably rare. Maybe because most
> workloads doesn't create many memcgs dynamically while low on memory.
> And maybe that will change in future. In any case, having a large splat
> of meminfo for GFP_NOWAIT is not really helpful. It will tell us what we
> know already - the memory is low and the reclaim was prohibited. We just
> need to know that this happens out there.

The workload was experimenting creating memcg per app on embedded device
but at this moment, I don't consider kmemcg at this moment so I can live
with disabling kmemcg, even. Based on it, I cannot say whether it's real
workload or not.

When I see replies of this thread, it's arguble to add such one-line
warn so if you want it strongly, could you handle by yourself?
Sorry but I don't have any interest on the arguing.

Thanks.


Re: [PATCH] mm:memcg: add __GFP_NOWARN in __memcg_schedule_kmem_cache_create

2018-04-19 Thread Minchan Kim
On Thu, Apr 19, 2018 at 08:40:05AM +0200, Michal Hocko wrote:
> On Wed 18-04-18 11:58:00, David Rientjes wrote:
> > On Wed, 18 Apr 2018, Michal Hocko wrote:
> > 
> > > > Okay, no problem. However, I don't feel we need ratelimit at this 
> > > > moment.
> > > > We can do when we got real report. Let's add just one line warning.
> > > > However, I have no talent to write a poem to express with one line.
> > > > Could you help me?
> > > 
> > > What about
> > >   pr_info("Failed to create memcg slab cache. Report if you see floods of 
> > > these\n");
> > >  

Thanks you, Michal. However, hmm, floods is very vague to me. 100 time per sec?
10 time per hour? I guess we need more guide line to trigger user's reporting
if we really want to do.


> > 
> > Um, there's nothing actionable here for the user.  Even if the message 
> > directed them to a specific email address, what would you ask the user for 
> > in response if they show a kernel log with 100 of these?
> 
> We would have to think of a better way to create shaddow memcg caches.
> 
> > Probably ask 
> > them to use sysrq at the time it happens to get meminfo.  But any user 
> > initiated sysrq is going to reveal very different state of memory compared 
> > to when the kmalloc() actually failed.
> 
> Not really.
> 
> > If this really needs a warning, I think it only needs to be done once and 
> > reveal the state of memory similar to how slub emits oom warnings.  But as 
> > the changelog indicates, the system is oom and we couldn't reclaim.  We 
> > can expect this happens a lot on systems with memory pressure.  What is 
> > the warning revealing that would be actionable?
> 
> That it actually happens in real workloads and we want to know what
> those workloads are. This code is quite old and yet this is the first
> some somebody complains. So it is most probably rare. Maybe because most
> workloads doesn't create many memcgs dynamically while low on memory.
> And maybe that will change in future. In any case, having a large splat
> of meminfo for GFP_NOWAIT is not really helpful. It will tell us what we
> know already - the memory is low and the reclaim was prohibited. We just
> need to know that this happens out there.

The workload was experimenting creating memcg per app on embedded device
but at this moment, I don't consider kmemcg at this moment so I can live
with disabling kmemcg, even. Based on it, I cannot say whether it's real
workload or not.

When I see replies of this thread, it's arguble to add such one-line
warn so if you want it strongly, could you handle by yourself?
Sorry but I don't have any interest on the arguing.

Thanks.


Re: [PATCH v2 4/4] tpm: Move eventlog declarations to its own header

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:50PM +0200, Thiebaud Weksteen wrote:
> Reduce the size of tpm.h by moving eventlog declarations to a separate
> header.
> 
> Signed-off-by: Thiebaud Weksteen 
> Suggested-by: Jarkko Sakkinen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


Re: [PATCH v2 4/4] tpm: Move eventlog declarations to its own header

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:50PM +0200, Thiebaud Weksteen wrote:
> Reduce the size of tpm.h by moving eventlog declarations to a separate
> header.
> 
> Signed-off-by: Thiebaud Weksteen 
> Suggested-by: Jarkko Sakkinen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


Re: [PATCH v2 3/4] tpm: Move shared eventlog functions to common.c

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:49PM +0200, Thiebaud Weksteen wrote:
> Functions and structures specific to TPM1 are renamed from tpm* to tpm1*.
> 
> Signed-off-by: Thiebaud Weksteen 
> Suggested-by: Jarkko Sakkinen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


Re: [PATCH v2 3/4] tpm: Move shared eventlog functions to common.c

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:49PM +0200, Thiebaud Weksteen wrote:
> Functions and structures specific to TPM1 are renamed from tpm* to tpm1*.
> 
> Signed-off-by: Thiebaud Weksteen 
> Suggested-by: Jarkko Sakkinen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


[PATCH] iommu/vt-d: fix shift-out-of-bounds in bug checking

2018-04-19 Thread changbin . du
From: Changbin Du 

It allows to flush more than 4GB of device TLBs. So the mask should be
64bit wide. UBSAN captured this fault as below.

[3.760024] 

[3.768440] UBSAN: Undefined behaviour in drivers/iommu/dmar.c:1348:3
[3.774864] shift exponent 64 is too large for 32-bit type 'int'
[3.780853] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G U
4.17.0-rc1+ #89
[3.788661] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 
01/26/2016
[3.796034] Call Trace:
[3.798472]  
[3.800479]  dump_stack+0x90/0xfb
[3.803787]  ubsan_epilogue+0x9/0x40
[3.807353]  __ubsan_handle_shift_out_of_bounds+0x10e/0x170
[3.812916]  ? qi_flush_dev_iotlb+0x124/0x180
[3.817261]  qi_flush_dev_iotlb+0x124/0x180
[3.821437]  iommu_flush_dev_iotlb+0x94/0xf0
[3.825698]  iommu_flush_iova+0x10b/0x1c0
[3.829699]  ? fq_ring_free+0x1d0/0x1d0
[3.833527]  iova_domain_flush+0x25/0x40
[3.837448]  fq_flush_timeout+0x55/0x160
[3.841368]  ? fq_ring_free+0x1d0/0x1d0
[3.845200]  ? fq_ring_free+0x1d0/0x1d0
[3.849034]  call_timer_fn+0xbe/0x310
[3.852696]  ? fq_ring_free+0x1d0/0x1d0
[3.856530]  run_timer_softirq+0x223/0x6e0
[3.860625]  ? sched_clock+0x5/0x10
[3.864108]  ? sched_clock+0x5/0x10
[3.867594]  __do_softirq+0x1b5/0x6f5
[3.871250]  irq_exit+0xd4/0x130
[3.874470]  smp_apic_timer_interrupt+0xb8/0x2f0
[3.879075]  apic_timer_interrupt+0xf/0x20
[3.883159]  
[3.885255] RIP: 0010:poll_idle+0x60/0xe7
[3.889252] RSP: 0018:b1b201943e30 EFLAGS: 0246 ORIG_RAX: 
ff13
[3.896802] RAX: 8020 RBX: 008e RCX: 001f
[3.903918] RDX:  RSI: 2819aa06 RDI: 
[3.911031] RBP: 9e93c6b33280 R08: 0010f717d567 R09: 0010d205
[3.918146] R10: b1b201943df8 R11: 0001 R12: e01b169d
[3.925260] R13:  R14: b12aa400 R15: 
[3.932382]  cpuidle_enter_state+0xb4/0x470
[3.936558]  do_idle+0x222/0x310
[3.939779]  cpu_startup_entry+0x78/0x90
[3.943693]  start_secondary+0x205/0x2e0
[3.947607]  secondary_startup_64+0xa5/0xb0
[3.951783] 


Signed-off-by: Changbin Du 
---
 drivers/iommu/dmar.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index accf5838..e4ae600 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1345,7 +1345,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 
sid, u16 qdep,
struct qi_desc desc;
 
if (mask) {
-   BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
+   BUG_ON(addr & ((1ULL << (VTD_PAGE_SHIFT + mask)) - 1));
addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE;
} else
-- 
2.7.4



Re: [PATCH v2 2/4] tpm: Move eventlog files to a subdirectory

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:48PM +0200, Thiebaud Weksteen wrote:
> Signed-off-by: Thiebaud Weksteen 
> Suggested-by: Jarkko Sakkinen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


[PATCH] iommu/vt-d: fix shift-out-of-bounds in bug checking

2018-04-19 Thread changbin . du
From: Changbin Du 

It allows to flush more than 4GB of device TLBs. So the mask should be
64bit wide. UBSAN captured this fault as below.

[3.760024] 

[3.768440] UBSAN: Undefined behaviour in drivers/iommu/dmar.c:1348:3
[3.774864] shift exponent 64 is too large for 32-bit type 'int'
[3.780853] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G U
4.17.0-rc1+ #89
[3.788661] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 
01/26/2016
[3.796034] Call Trace:
[3.798472]  
[3.800479]  dump_stack+0x90/0xfb
[3.803787]  ubsan_epilogue+0x9/0x40
[3.807353]  __ubsan_handle_shift_out_of_bounds+0x10e/0x170
[3.812916]  ? qi_flush_dev_iotlb+0x124/0x180
[3.817261]  qi_flush_dev_iotlb+0x124/0x180
[3.821437]  iommu_flush_dev_iotlb+0x94/0xf0
[3.825698]  iommu_flush_iova+0x10b/0x1c0
[3.829699]  ? fq_ring_free+0x1d0/0x1d0
[3.833527]  iova_domain_flush+0x25/0x40
[3.837448]  fq_flush_timeout+0x55/0x160
[3.841368]  ? fq_ring_free+0x1d0/0x1d0
[3.845200]  ? fq_ring_free+0x1d0/0x1d0
[3.849034]  call_timer_fn+0xbe/0x310
[3.852696]  ? fq_ring_free+0x1d0/0x1d0
[3.856530]  run_timer_softirq+0x223/0x6e0
[3.860625]  ? sched_clock+0x5/0x10
[3.864108]  ? sched_clock+0x5/0x10
[3.867594]  __do_softirq+0x1b5/0x6f5
[3.871250]  irq_exit+0xd4/0x130
[3.874470]  smp_apic_timer_interrupt+0xb8/0x2f0
[3.879075]  apic_timer_interrupt+0xf/0x20
[3.883159]  
[3.885255] RIP: 0010:poll_idle+0x60/0xe7
[3.889252] RSP: 0018:b1b201943e30 EFLAGS: 0246 ORIG_RAX: 
ff13
[3.896802] RAX: 8020 RBX: 008e RCX: 001f
[3.903918] RDX:  RSI: 2819aa06 RDI: 
[3.911031] RBP: 9e93c6b33280 R08: 0010f717d567 R09: 0010d205
[3.918146] R10: b1b201943df8 R11: 0001 R12: e01b169d
[3.925260] R13:  R14: b12aa400 R15: 
[3.932382]  cpuidle_enter_state+0xb4/0x470
[3.936558]  do_idle+0x222/0x310
[3.939779]  cpu_startup_entry+0x78/0x90
[3.943693]  start_secondary+0x205/0x2e0
[3.947607]  secondary_startup_64+0xa5/0xb0
[3.951783] 


Signed-off-by: Changbin Du 
---
 drivers/iommu/dmar.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index accf5838..e4ae600 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1345,7 +1345,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 
sid, u16 qdep,
struct qi_desc desc;
 
if (mask) {
-   BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
+   BUG_ON(addr & ((1ULL << (VTD_PAGE_SHIFT + mask)) - 1));
addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
desc.high = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE;
} else
-- 
2.7.4



Re: [PATCH v2 2/4] tpm: Move eventlog files to a subdirectory

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:48PM +0200, Thiebaud Weksteen wrote:
> Signed-off-by: Thiebaud Weksteen 
> Suggested-by: Jarkko Sakkinen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


Re: [PATCH v2 1/4] tpm: Add explicit endianness cast

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:47PM +0200, Thiebaud Weksteen wrote:
> Signed-off-by: Thiebaud Weksteen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


Re: [PATCH v2 1/4] tpm: Add explicit endianness cast

2018-04-19 Thread Jarkko Sakkinen
On Thu, Apr 12, 2018 at 12:13:47PM +0200, Thiebaud Weksteen wrote:
> Signed-off-by: Thiebaud Weksteen 

Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 

/Jarkko


[PATCH] arm64: avoid potential infinity loop in dump_backtrace

2018-04-19 Thread Ji Zhang
When we dump the backtrace of some tasks there is a potential infinity
loop if the content of the stack changed, no matter the change is
because the task is running or other unexpected cases.

This patch add stronger check on frame pointer and set the max number
of stack spanning to avoid infinity loop.

Signed-off-by: Ji Zhang 
---
 arch/arm64/include/asm/stacktrace.h | 25 +
 arch/arm64/kernel/stacktrace.c  |  8 
 arch/arm64/kernel/traps.c   |  1 +
 3 files changed, 34 insertions(+)

diff --git a/arch/arm64/include/asm/stacktrace.h 
b/arch/arm64/include/asm/stacktrace.h
index 902f9ed..f235b86 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -24,9 +24,18 @@
 #include 
 #include 
 
+#ifndef CONFIG_VMAP_STACK
+#define MAX_NR_STACKS  2
+#elif !defined(CONFIG_ARM_SDE_INTERFACE)
+#define MAX_NR_STACKS  3
+#else
+#define MAX_NR_STACKS  4
+#endif
+
 struct stackframe {
unsigned long fp;
unsigned long pc;
+   int nr_stacks;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
int graph;
 #endif
@@ -92,4 +101,20 @@ static inline bool on_accessible_stack(struct task_struct 
*tsk, unsigned long sp
return false;
 }
 
+
+static inline bool on_same_stack(struct task_struct *tsk,
+   unsigned long sp1, unsigned long sp2)
+{
+   if (on_task_stack(tsk, sp1) && on_task_stack(tsk, sp2))
+   return true;
+   if (on_irq_stack(sp1) && on_irq_stack(sp2))
+   return true;
+   if (on_overflow_stack(sp1) && on_overflow_stack(sp2))
+   return true;
+   if (on_sdei_stack(sp1) && on_sdei_stack(sp2))
+   return true;
+
+   return false;
+}
+
 #endif /* __ASM_STACKTRACE_H */
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index d5718a0..d75f59d 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -43,6 +43,7 @@
 int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
 {
unsigned long fp = frame->fp;
+   bool same_stack;
 
if (fp & 0xf)
return -EINVAL;
@@ -56,6 +57,13 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
stackframe *frame)
frame->fp = READ_ONCE_NOCHECK(*(unsigned long *)(fp));
frame->pc = READ_ONCE_NOCHECK(*(unsigned long *)(fp + 8));
 
+   same_stack = on_same_stack(tsk, fp, frame->fp);
+
+   if (fp <= frame->fp && same_stack)
+   return -EINVAL;
+   if (!same_stack && ++frame->nr_stacks > MAX_NR_STACKS)
+   return -EINVAL;
+
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
(frame->pc == (unsigned long)return_to_handler)) {
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index ba964da..ee0403d 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -121,6 +121,7 @@ void dump_backtrace(struct pt_regs *regs, struct 
task_struct *tsk)
frame.fp = thread_saved_fp(tsk);
frame.pc = thread_saved_pc(tsk);
}
+   frame.nr_stacks = 1;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
frame.graph = tsk->curr_ret_stack;
 #endif
-- 
1.9.1



[PATCH] arm64: avoid potential infinity loop in dump_backtrace

2018-04-19 Thread Ji Zhang
When we dump the backtrace of some tasks there is a potential infinity
loop if the content of the stack changed, no matter the change is
because the task is running or other unexpected cases.

This patch add stronger check on frame pointer and set the max number
of stack spanning to avoid infinity loop.

Signed-off-by: Ji Zhang 
---
 arch/arm64/include/asm/stacktrace.h | 25 +
 arch/arm64/kernel/stacktrace.c  |  8 
 arch/arm64/kernel/traps.c   |  1 +
 3 files changed, 34 insertions(+)

diff --git a/arch/arm64/include/asm/stacktrace.h 
b/arch/arm64/include/asm/stacktrace.h
index 902f9ed..f235b86 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -24,9 +24,18 @@
 #include 
 #include 
 
+#ifndef CONFIG_VMAP_STACK
+#define MAX_NR_STACKS  2
+#elif !defined(CONFIG_ARM_SDE_INTERFACE)
+#define MAX_NR_STACKS  3
+#else
+#define MAX_NR_STACKS  4
+#endif
+
 struct stackframe {
unsigned long fp;
unsigned long pc;
+   int nr_stacks;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
int graph;
 #endif
@@ -92,4 +101,20 @@ static inline bool on_accessible_stack(struct task_struct 
*tsk, unsigned long sp
return false;
 }
 
+
+static inline bool on_same_stack(struct task_struct *tsk,
+   unsigned long sp1, unsigned long sp2)
+{
+   if (on_task_stack(tsk, sp1) && on_task_stack(tsk, sp2))
+   return true;
+   if (on_irq_stack(sp1) && on_irq_stack(sp2))
+   return true;
+   if (on_overflow_stack(sp1) && on_overflow_stack(sp2))
+   return true;
+   if (on_sdei_stack(sp1) && on_sdei_stack(sp2))
+   return true;
+
+   return false;
+}
+
 #endif /* __ASM_STACKTRACE_H */
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index d5718a0..d75f59d 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -43,6 +43,7 @@
 int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame)
 {
unsigned long fp = frame->fp;
+   bool same_stack;
 
if (fp & 0xf)
return -EINVAL;
@@ -56,6 +57,13 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
stackframe *frame)
frame->fp = READ_ONCE_NOCHECK(*(unsigned long *)(fp));
frame->pc = READ_ONCE_NOCHECK(*(unsigned long *)(fp + 8));
 
+   same_stack = on_same_stack(tsk, fp, frame->fp);
+
+   if (fp <= frame->fp && same_stack)
+   return -EINVAL;
+   if (!same_stack && ++frame->nr_stacks > MAX_NR_STACKS)
+   return -EINVAL;
+
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
(frame->pc == (unsigned long)return_to_handler)) {
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index ba964da..ee0403d 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -121,6 +121,7 @@ void dump_backtrace(struct pt_regs *regs, struct 
task_struct *tsk)
frame.fp = thread_saved_fp(tsk);
frame.pc = thread_saved_pc(tsk);
}
+   frame.nr_stacks = 1;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
frame.graph = tsk->curr_ret_stack;
 #endif
-- 
1.9.1



Re: [PATCH 2/7] i2c: i2c-mux-gpio: move header to platform_data

2018-04-19 Thread Peter Korsgaard
> "WS" == Wolfram Sang  writes:

WS> This header only contains platform_data. Move it to the proper directory.
WS> Signed-off-by: Wolfram Sang 

Thanks,

Acked-by: Peter Korsgaard 

--
Bye, Peter Korsgaard
This message is subject to the following terms and conditions: MAIL 
DISCLAIMER


Re: [PATCH 2/7] i2c: i2c-mux-gpio: move header to platform_data

2018-04-19 Thread Peter Korsgaard
> "WS" == Wolfram Sang  writes:

WS> This header only contains platform_data. Move it to the proper directory.
WS> Signed-off-by: Wolfram Sang 

Thanks,

Acked-by: Peter Korsgaard 

--
Bye, Peter Korsgaard
This message is subject to the following terms and conditions: MAIL 
DISCLAIMER


[PATCH v6 0/2] PCI: mediatek: Fixups for the IRQ handle routine and MT7622's class code

2018-04-19 Thread honghui.zhang
From: Honghui Zhang 

Two fixups for mediatek's host bridge:
The first patch fixup class type and vendor ID for MT7622.
The second patch fixup the IRQ handle routine by using irq_chip solution
to avoid IRQ reentry which may exist for both MT2712 and MT7622.

Change since v5:
 - Make the comments consistend with the code modification in the first patch.
 - Using writew to performing a 16-bit write.
 - Using irq_chip solution to fix the IRQ issue.

The v5 patchset could be found in:
 https://patchwork.kernel.org/patch/10133303
 https://patchwork.kernel.org/patch/10133305

Change since v4:
 - Only setup vendor ID for MT7622, igorning the device ID since mediatek's
   host bridge driver does not cares about the device ID.

Change since v3:
 - Setup the class type and vendor ID at the beginning of startup instead
   of in a quirk.
 - Add mediatek's vendor ID, it could be found in:
   https://pcisig.com/membership/member-companies?combine==4

Change since v2:
 - Move the initialize of the iterate before the loop to fix an
   INTx IRQ issue in the first patch

Change since v1:
 - Add the second patch.
 - Make the first patch's commit message more standard.
Honghui Zhang (2):
  PCI: mediatek: Set up vendor ID and class type for MT7622
  PCI: mediatek: Using chained IRQ to setup IRQ handle

 drivers/pci/host/pcie-mediatek.c | 220 +++
 include/linux/pci_ids.h  |   2 +
 2 files changed, 133 insertions(+), 89 deletions(-)

-- 
2.6.4



[PATCH v6 0/2] PCI: mediatek: Fixups for the IRQ handle routine and MT7622's class code

2018-04-19 Thread honghui.zhang
From: Honghui Zhang 

Two fixups for mediatek's host bridge:
The first patch fixup class type and vendor ID for MT7622.
The second patch fixup the IRQ handle routine by using irq_chip solution
to avoid IRQ reentry which may exist for both MT2712 and MT7622.

Change since v5:
 - Make the comments consistend with the code modification in the first patch.
 - Using writew to performing a 16-bit write.
 - Using irq_chip solution to fix the IRQ issue.

The v5 patchset could be found in:
 https://patchwork.kernel.org/patch/10133303
 https://patchwork.kernel.org/patch/10133305

Change since v4:
 - Only setup vendor ID for MT7622, igorning the device ID since mediatek's
   host bridge driver does not cares about the device ID.

Change since v3:
 - Setup the class type and vendor ID at the beginning of startup instead
   of in a quirk.
 - Add mediatek's vendor ID, it could be found in:
   https://pcisig.com/membership/member-companies?combine==4

Change since v2:
 - Move the initialize of the iterate before the loop to fix an
   INTx IRQ issue in the first patch

Change since v1:
 - Add the second patch.
 - Make the first patch's commit message more standard.
Honghui Zhang (2):
  PCI: mediatek: Set up vendor ID and class type for MT7622
  PCI: mediatek: Using chained IRQ to setup IRQ handle

 drivers/pci/host/pcie-mediatek.c | 220 +++
 include/linux/pci_ids.h  |   2 +
 2 files changed, 133 insertions(+), 89 deletions(-)

-- 
2.6.4



[PATCH 2/2] PCI: mediatek: Using chained IRQ to setup IRQ handle

2018-04-19 Thread honghui.zhang
From: Honghui Zhang 

Using irq_chip solution to setup IRQs for the consistent with IRQ framework.

Signed-off-by: Honghui Zhang 
---
 drivers/pci/host/pcie-mediatek.c | 192 +--
 1 file changed, 105 insertions(+), 87 deletions(-)

diff --git a/drivers/pci/host/pcie-mediatek.c b/drivers/pci/host/pcie-mediatek.c
index c3dc549..1d9c6f1 100644
--- a/drivers/pci/host/pcie-mediatek.c
+++ b/drivers/pci/host/pcie-mediatek.c
@@ -11,8 +11,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -130,14 +132,12 @@ struct mtk_pcie_port;
 /**
  * struct mtk_pcie_soc - differentiate between host generations
  * @need_fix_class_id: whether this host's class ID needed to be fixed or not
- * @has_msi: whether this host supports MSI interrupts or not
  * @ops: pointer to configuration access functions
  * @startup: pointer to controller setting functions
  * @setup_irq: pointer to initialize IRQ functions
  */
 struct mtk_pcie_soc {
bool need_fix_class_id;
-   bool has_msi;
struct pci_ops *ops;
int (*startup)(struct mtk_pcie_port *port);
int (*setup_irq)(struct mtk_pcie_port *port, struct device_node *node);
@@ -161,7 +161,9 @@ struct mtk_pcie_soc {
  * @lane: lane count
  * @slot: port slot
  * @irq_domain: legacy INTx IRQ domain
+ * @inner_domain: inner IRQ domain
  * @msi_domain: MSI IRQ domain
+ * @lock: protect the msi_irq_in_use bitmap
  * @msi_irq_in_use: bit map for assigned MSI IRQ
  */
 struct mtk_pcie_port {
@@ -179,7 +181,9 @@ struct mtk_pcie_port {
u32 lane;
u32 slot;
struct irq_domain *irq_domain;
+   struct irq_domain *inner_domain;
struct irq_domain *msi_domain;
+   struct mutex lock;
DECLARE_BITMAP(msi_irq_in_use, MTK_MSI_IRQS_NUM);
 };
 
@@ -446,103 +450,122 @@ static int mtk_pcie_startup_port_v2(struct 
mtk_pcie_port *port)
return 0;
 }
 
-static int mtk_pcie_msi_alloc(struct mtk_pcie_port *port)
+static void mtk_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
-   int msi;
+   struct mtk_pcie_port *port = irq_data_get_irq_chip_data(data);
+   phys_addr_t addr;
 
-   msi = find_first_zero_bit(port->msi_irq_in_use, MTK_MSI_IRQS_NUM);
-   if (msi < MTK_MSI_IRQS_NUM)
-   set_bit(msi, port->msi_irq_in_use);
-   else
-   return -ENOSPC;
+   /* MT2712/MT7622 only support 32-bit MSI addresses */
+   addr = virt_to_phys(port->base + PCIE_MSI_VECTOR);
+   msg->address_hi = 0;
+   msg->address_lo = lower_32_bits(addr);
 
-   return msi;
+   msg->data = data->hwirq;
+
+   dev_dbg(port->pcie->dev, "msi#%d address_hi %#x address_lo %#x\n",
+   (int)data->hwirq, msg->address_hi, msg->address_lo);
 }
 
-static void mtk_pcie_msi_free(struct mtk_pcie_port *port, unsigned long hwirq)
+static int mtk_msi_set_affinity(struct irq_data *irq_data,
+  const struct cpumask *mask, bool force)
 {
-   clear_bit(hwirq, port->msi_irq_in_use);
+   return -EINVAL;
 }
 
-static int mtk_pcie_msi_setup_irq(struct msi_controller *chip,
- struct pci_dev *pdev, struct msi_desc *desc)
-{
-   struct mtk_pcie_port *port;
-   struct msi_msg msg;
-   unsigned int irq;
-   int hwirq;
-   phys_addr_t msg_addr;
+static struct irq_chip mtk_msi_bottom_irq_chip = {
+   .name   = "MTK MSI",
+   .irq_compose_msi_msg= mtk_compose_msi_msg,
+   .irq_set_affinity   = mtk_msi_set_affinity,
+   .irq_mask   = pci_msi_mask_irq,
+   .irq_unmask = pci_msi_unmask_irq,
+};
 
-   port = mtk_pcie_find_port(pdev->bus, pdev->devfn);
-   if (!port)
-   return -EINVAL;
+static int mtk_pcie_irq_domain_alloc(struct irq_domain *domain, unsigned int 
virq,
+unsigned int nr_irqs, void *args)
+{
+   struct mtk_pcie_port *port = domain->host_data;
+   unsigned long bit;
 
-   hwirq = mtk_pcie_msi_alloc(port);
-   if (hwirq < 0)
-   return hwirq;
+   WARN_ON(nr_irqs != 1);
+   mutex_lock(>lock);
 
-   irq = irq_create_mapping(port->msi_domain, hwirq);
-   if (!irq) {
-   mtk_pcie_msi_free(port, hwirq);
-   return -EINVAL;
+   bit = find_first_zero_bit(port->msi_irq_in_use, MTK_MSI_IRQS_NUM);
+   if (bit >= MTK_MSI_IRQS_NUM) {
+   mutex_unlock(>lock);
+   return -ENOSPC;
}
 
-   chip->dev = >dev;
-
-   irq_set_msi_desc(irq, desc);
+   __set_bit(bit, port->msi_irq_in_use);
 
-   /* MT2712/MT7622 only support 32-bit MSI addresses */
-   msg_addr = virt_to_phys(port->base + PCIE_MSI_VECTOR);
-   msg.address_hi = 0;
-   msg.address_lo = lower_32_bits(msg_addr);
-   msg.data = hwirq;
+   

[PATCH 2/2] PCI: mediatek: Using chained IRQ to setup IRQ handle

2018-04-19 Thread honghui.zhang
From: Honghui Zhang 

Using irq_chip solution to setup IRQs for the consistent with IRQ framework.

Signed-off-by: Honghui Zhang 
---
 drivers/pci/host/pcie-mediatek.c | 192 +--
 1 file changed, 105 insertions(+), 87 deletions(-)

diff --git a/drivers/pci/host/pcie-mediatek.c b/drivers/pci/host/pcie-mediatek.c
index c3dc549..1d9c6f1 100644
--- a/drivers/pci/host/pcie-mediatek.c
+++ b/drivers/pci/host/pcie-mediatek.c
@@ -11,8 +11,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -130,14 +132,12 @@ struct mtk_pcie_port;
 /**
  * struct mtk_pcie_soc - differentiate between host generations
  * @need_fix_class_id: whether this host's class ID needed to be fixed or not
- * @has_msi: whether this host supports MSI interrupts or not
  * @ops: pointer to configuration access functions
  * @startup: pointer to controller setting functions
  * @setup_irq: pointer to initialize IRQ functions
  */
 struct mtk_pcie_soc {
bool need_fix_class_id;
-   bool has_msi;
struct pci_ops *ops;
int (*startup)(struct mtk_pcie_port *port);
int (*setup_irq)(struct mtk_pcie_port *port, struct device_node *node);
@@ -161,7 +161,9 @@ struct mtk_pcie_soc {
  * @lane: lane count
  * @slot: port slot
  * @irq_domain: legacy INTx IRQ domain
+ * @inner_domain: inner IRQ domain
  * @msi_domain: MSI IRQ domain
+ * @lock: protect the msi_irq_in_use bitmap
  * @msi_irq_in_use: bit map for assigned MSI IRQ
  */
 struct mtk_pcie_port {
@@ -179,7 +181,9 @@ struct mtk_pcie_port {
u32 lane;
u32 slot;
struct irq_domain *irq_domain;
+   struct irq_domain *inner_domain;
struct irq_domain *msi_domain;
+   struct mutex lock;
DECLARE_BITMAP(msi_irq_in_use, MTK_MSI_IRQS_NUM);
 };
 
@@ -446,103 +450,122 @@ static int mtk_pcie_startup_port_v2(struct 
mtk_pcie_port *port)
return 0;
 }
 
-static int mtk_pcie_msi_alloc(struct mtk_pcie_port *port)
+static void mtk_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
-   int msi;
+   struct mtk_pcie_port *port = irq_data_get_irq_chip_data(data);
+   phys_addr_t addr;
 
-   msi = find_first_zero_bit(port->msi_irq_in_use, MTK_MSI_IRQS_NUM);
-   if (msi < MTK_MSI_IRQS_NUM)
-   set_bit(msi, port->msi_irq_in_use);
-   else
-   return -ENOSPC;
+   /* MT2712/MT7622 only support 32-bit MSI addresses */
+   addr = virt_to_phys(port->base + PCIE_MSI_VECTOR);
+   msg->address_hi = 0;
+   msg->address_lo = lower_32_bits(addr);
 
-   return msi;
+   msg->data = data->hwirq;
+
+   dev_dbg(port->pcie->dev, "msi#%d address_hi %#x address_lo %#x\n",
+   (int)data->hwirq, msg->address_hi, msg->address_lo);
 }
 
-static void mtk_pcie_msi_free(struct mtk_pcie_port *port, unsigned long hwirq)
+static int mtk_msi_set_affinity(struct irq_data *irq_data,
+  const struct cpumask *mask, bool force)
 {
-   clear_bit(hwirq, port->msi_irq_in_use);
+   return -EINVAL;
 }
 
-static int mtk_pcie_msi_setup_irq(struct msi_controller *chip,
- struct pci_dev *pdev, struct msi_desc *desc)
-{
-   struct mtk_pcie_port *port;
-   struct msi_msg msg;
-   unsigned int irq;
-   int hwirq;
-   phys_addr_t msg_addr;
+static struct irq_chip mtk_msi_bottom_irq_chip = {
+   .name   = "MTK MSI",
+   .irq_compose_msi_msg= mtk_compose_msi_msg,
+   .irq_set_affinity   = mtk_msi_set_affinity,
+   .irq_mask   = pci_msi_mask_irq,
+   .irq_unmask = pci_msi_unmask_irq,
+};
 
-   port = mtk_pcie_find_port(pdev->bus, pdev->devfn);
-   if (!port)
-   return -EINVAL;
+static int mtk_pcie_irq_domain_alloc(struct irq_domain *domain, unsigned int 
virq,
+unsigned int nr_irqs, void *args)
+{
+   struct mtk_pcie_port *port = domain->host_data;
+   unsigned long bit;
 
-   hwirq = mtk_pcie_msi_alloc(port);
-   if (hwirq < 0)
-   return hwirq;
+   WARN_ON(nr_irqs != 1);
+   mutex_lock(>lock);
 
-   irq = irq_create_mapping(port->msi_domain, hwirq);
-   if (!irq) {
-   mtk_pcie_msi_free(port, hwirq);
-   return -EINVAL;
+   bit = find_first_zero_bit(port->msi_irq_in_use, MTK_MSI_IRQS_NUM);
+   if (bit >= MTK_MSI_IRQS_NUM) {
+   mutex_unlock(>lock);
+   return -ENOSPC;
}
 
-   chip->dev = >dev;
-
-   irq_set_msi_desc(irq, desc);
+   __set_bit(bit, port->msi_irq_in_use);
 
-   /* MT2712/MT7622 only support 32-bit MSI addresses */
-   msg_addr = virt_to_phys(port->base + PCIE_MSI_VECTOR);
-   msg.address_hi = 0;
-   msg.address_lo = lower_32_bits(msg_addr);
-   msg.data = hwirq;
+   mutex_unlock(>lock);
 
-   pci_write_msi_msg(irq, );
+   

[PATCH v6 1/2] PCI: mediatek: Set up vendor ID and class type for MT7622

2018-04-19 Thread honghui.zhang
From: Honghui Zhang 

MT7622's hardware default value of vendor ID and class type is not correct,
fix that by setup the correct values before linkup with Endpoint.

Signed-off-by: Honghui Zhang 
---
 drivers/pci/host/pcie-mediatek.c | 30 +++---
 include/linux/pci_ids.h  |  2 ++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/host/pcie-mediatek.c b/drivers/pci/host/pcie-mediatek.c
index a8b20c5..c3dc549 100644
--- a/drivers/pci/host/pcie-mediatek.c
+++ b/drivers/pci/host/pcie-mediatek.c
@@ -66,6 +66,10 @@
 
 /* PCIe V2 per-port registers */
 #define PCIE_MSI_VECTOR0x0c0
+
+#define PCIE_CONF_VEND_ID  0x100
+#define PCIE_CONF_CLASS_ID 0x106
+
 #define PCIE_INT_MASK  0x420
 #define INTX_MASK  GENMASK(19, 16)
 #define INTX_SHIFT 16
@@ -125,12 +129,14 @@ struct mtk_pcie_port;
 
 /**
  * struct mtk_pcie_soc - differentiate between host generations
+ * @need_fix_class_id: whether this host's class ID needed to be fixed or not
  * @has_msi: whether this host supports MSI interrupts or not
  * @ops: pointer to configuration access functions
  * @startup: pointer to controller setting functions
  * @setup_irq: pointer to initialize IRQ functions
  */
 struct mtk_pcie_soc {
+   bool need_fix_class_id;
bool has_msi;
struct pci_ops *ops;
int (*startup)(struct mtk_pcie_port *port);
@@ -375,6 +381,7 @@ static int mtk_pcie_startup_port_v2(struct mtk_pcie_port 
*port)
 {
struct mtk_pcie *pcie = port->pcie;
struct resource *mem = >mem;
+   const struct mtk_pcie_soc *soc = port->pcie->soc;
u32 val;
size_t size;
int err;
@@ -403,6 +410,15 @@ static int mtk_pcie_startup_port_v2(struct mtk_pcie_port 
*port)
   PCIE_MAC_SRSTB | PCIE_CRSTB;
writel(val, port->base + PCIE_RST_CTRL);
 
+   /* Set up vendor ID and class code */
+   if (soc->need_fix_class_id) {
+   val = PCI_VENDOR_ID_MEDIATEK;
+   writew(val, port->base + PCIE_CONF_VEND_ID);
+
+   val = PCI_CLASS_BRIDGE_PCI;
+   writew(val, port->base + PCIE_CONF_CLASS_ID);
+   }
+
/* 100ms timeout value should be enough for Gen1/2 training */
err = readl_poll_timeout(port->base + PCIE_LINK_STATUS_V2, val,
 !!(val & PCIE_PORT_LINKUP_V2), 20,
@@ -1142,7 +1158,15 @@ static const struct mtk_pcie_soc mtk_pcie_soc_v1 = {
.startup = mtk_pcie_startup_port,
 };
 
-static const struct mtk_pcie_soc mtk_pcie_soc_v2 = {
+static const struct mtk_pcie_soc mtk_pcie_soc_mt2712 = {
+   .has_msi = true,
+   .ops = _pcie_ops_v2,
+   .startup = mtk_pcie_startup_port_v2,
+   .setup_irq = mtk_pcie_setup_irq,
+};
+
+static const struct mtk_pcie_soc mtk_pcie_soc_mt7622 = {
+   .need_fix_class_id = true,
.has_msi = true,
.ops = _pcie_ops_v2,
.startup = mtk_pcie_startup_port_v2,
@@ -1152,8 +1176,8 @@ static const struct mtk_pcie_soc mtk_pcie_soc_v2 = {
 static const struct of_device_id mtk_pcie_ids[] = {
{ .compatible = "mediatek,mt2701-pcie", .data = _pcie_soc_v1 },
{ .compatible = "mediatek,mt7623-pcie", .data = _pcie_soc_v1 },
-   { .compatible = "mediatek,mt2712-pcie", .data = _pcie_soc_v2 },
-   { .compatible = "mediatek,mt7622-pcie", .data = _pcie_soc_v2 },
+   { .compatible = "mediatek,mt2712-pcie", .data = _pcie_soc_mt2712 },
+   { .compatible = "mediatek,mt7622-pcie", .data = _pcie_soc_mt7622 },
{},
 };
 
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index a6b3066..9d4fca5 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2115,6 +2115,8 @@
 
 #define PCI_VENDOR_ID_MYRICOM  0x14c1
 
+#define PCI_VENDOR_ID_MEDIATEK 0x14c3
+
 #define PCI_VENDOR_ID_TITAN0x14D2
 #define PCI_DEVICE_ID_TITAN_010L   0x8001
 #define PCI_DEVICE_ID_TITAN_100L   0x8010
-- 
2.6.4



[PATCH v6 1/2] PCI: mediatek: Set up vendor ID and class type for MT7622

2018-04-19 Thread honghui.zhang
From: Honghui Zhang 

MT7622's hardware default value of vendor ID and class type is not correct,
fix that by setup the correct values before linkup with Endpoint.

Signed-off-by: Honghui Zhang 
---
 drivers/pci/host/pcie-mediatek.c | 30 +++---
 include/linux/pci_ids.h  |  2 ++
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/host/pcie-mediatek.c b/drivers/pci/host/pcie-mediatek.c
index a8b20c5..c3dc549 100644
--- a/drivers/pci/host/pcie-mediatek.c
+++ b/drivers/pci/host/pcie-mediatek.c
@@ -66,6 +66,10 @@
 
 /* PCIe V2 per-port registers */
 #define PCIE_MSI_VECTOR0x0c0
+
+#define PCIE_CONF_VEND_ID  0x100
+#define PCIE_CONF_CLASS_ID 0x106
+
 #define PCIE_INT_MASK  0x420
 #define INTX_MASK  GENMASK(19, 16)
 #define INTX_SHIFT 16
@@ -125,12 +129,14 @@ struct mtk_pcie_port;
 
 /**
  * struct mtk_pcie_soc - differentiate between host generations
+ * @need_fix_class_id: whether this host's class ID needed to be fixed or not
  * @has_msi: whether this host supports MSI interrupts or not
  * @ops: pointer to configuration access functions
  * @startup: pointer to controller setting functions
  * @setup_irq: pointer to initialize IRQ functions
  */
 struct mtk_pcie_soc {
+   bool need_fix_class_id;
bool has_msi;
struct pci_ops *ops;
int (*startup)(struct mtk_pcie_port *port);
@@ -375,6 +381,7 @@ static int mtk_pcie_startup_port_v2(struct mtk_pcie_port 
*port)
 {
struct mtk_pcie *pcie = port->pcie;
struct resource *mem = >mem;
+   const struct mtk_pcie_soc *soc = port->pcie->soc;
u32 val;
size_t size;
int err;
@@ -403,6 +410,15 @@ static int mtk_pcie_startup_port_v2(struct mtk_pcie_port 
*port)
   PCIE_MAC_SRSTB | PCIE_CRSTB;
writel(val, port->base + PCIE_RST_CTRL);
 
+   /* Set up vendor ID and class code */
+   if (soc->need_fix_class_id) {
+   val = PCI_VENDOR_ID_MEDIATEK;
+   writew(val, port->base + PCIE_CONF_VEND_ID);
+
+   val = PCI_CLASS_BRIDGE_PCI;
+   writew(val, port->base + PCIE_CONF_CLASS_ID);
+   }
+
/* 100ms timeout value should be enough for Gen1/2 training */
err = readl_poll_timeout(port->base + PCIE_LINK_STATUS_V2, val,
 !!(val & PCIE_PORT_LINKUP_V2), 20,
@@ -1142,7 +1158,15 @@ static const struct mtk_pcie_soc mtk_pcie_soc_v1 = {
.startup = mtk_pcie_startup_port,
 };
 
-static const struct mtk_pcie_soc mtk_pcie_soc_v2 = {
+static const struct mtk_pcie_soc mtk_pcie_soc_mt2712 = {
+   .has_msi = true,
+   .ops = _pcie_ops_v2,
+   .startup = mtk_pcie_startup_port_v2,
+   .setup_irq = mtk_pcie_setup_irq,
+};
+
+static const struct mtk_pcie_soc mtk_pcie_soc_mt7622 = {
+   .need_fix_class_id = true,
.has_msi = true,
.ops = _pcie_ops_v2,
.startup = mtk_pcie_startup_port_v2,
@@ -1152,8 +1176,8 @@ static const struct mtk_pcie_soc mtk_pcie_soc_v2 = {
 static const struct of_device_id mtk_pcie_ids[] = {
{ .compatible = "mediatek,mt2701-pcie", .data = _pcie_soc_v1 },
{ .compatible = "mediatek,mt7623-pcie", .data = _pcie_soc_v1 },
-   { .compatible = "mediatek,mt2712-pcie", .data = _pcie_soc_v2 },
-   { .compatible = "mediatek,mt7622-pcie", .data = _pcie_soc_v2 },
+   { .compatible = "mediatek,mt2712-pcie", .data = _pcie_soc_mt2712 },
+   { .compatible = "mediatek,mt7622-pcie", .data = _pcie_soc_mt7622 },
{},
 };
 
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index a6b3066..9d4fca5 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2115,6 +2115,8 @@
 
 #define PCI_VENDOR_ID_MYRICOM  0x14c1
 
+#define PCI_VENDOR_ID_MEDIATEK 0x14c3
+
 #define PCI_VENDOR_ID_TITAN0x14D2
 #define PCI_DEVICE_ID_TITAN_010L   0x8001
 #define PCI_DEVICE_ID_TITAN_100L   0x8010
-- 
2.6.4



Re: DOS by unprivileged user

2018-04-19 Thread Mike Galbraith
On Thu, 2018-04-19 at 21:13 +0200, Ferry Toth wrote:
> It appears any ordinary user can easily create a DOS on linux.
> 
> One sure way to reproduce this is to open gitk on the linux kernel repo 
> (SIC) on a machine with 8GB RAM 16 GB swap on a HDD with btrfs and quad core 
> + hyperthreading. But I will be easy enough to get the same effect with more 
> RAM, other fs etc.
> 
> In this case gitk allocates more and more memory (until my system freezes 
> 6.5GB of 7.5GB avaiable), the system starts swapping or writing to tmp files 
> (can't investigate as there is no time until it freezes) and the io wait 
> goes to 100% on all cores. At this point it is impossible to login from 
> remote and local keyboard and mouse are frozen. Hard reset is the only way 
> out at this point.

datapoint: my i4790/ext4 box running master.yesterday booted mem=8G
became highly unpleasant to use, but I retained control, and the all
cores going to 100% thing did not happen at any time.

I didn't try constraining on the gitk user, just turned it loose a few
times to see if it managed to render box effectively dead.  It failed
to kill my box, but (expectedly) did make it suck rocks.

-Mike


Re: DOS by unprivileged user

2018-04-19 Thread Mike Galbraith
On Thu, 2018-04-19 at 21:13 +0200, Ferry Toth wrote:
> It appears any ordinary user can easily create a DOS on linux.
> 
> One sure way to reproduce this is to open gitk on the linux kernel repo 
> (SIC) on a machine with 8GB RAM 16 GB swap on a HDD with btrfs and quad core 
> + hyperthreading. But I will be easy enough to get the same effect with more 
> RAM, other fs etc.
> 
> In this case gitk allocates more and more memory (until my system freezes 
> 6.5GB of 7.5GB avaiable), the system starts swapping or writing to tmp files 
> (can't investigate as there is no time until it freezes) and the io wait 
> goes to 100% on all cores. At this point it is impossible to login from 
> remote and local keyboard and mouse are frozen. Hard reset is the only way 
> out at this point.

datapoint: my i4790/ext4 box running master.yesterday booted mem=8G
became highly unpleasant to use, but I retained control, and the all
cores going to 100% thing did not happen at any time.

I didn't try constraining on the gitk user, just turned it loose a few
times to see if it managed to render box effectively dead.  It failed
to kill my box, but (expectedly) did make it suck rocks.

-Mike


Re: [PATCH] tpm: moves the delay_msec increment after sleep in tpm_transmit()

2018-04-19 Thread Jarkko Sakkinen
On Tue, Apr 10, 2018 at 03:31:09PM +0300, Jarkko Sakkinen wrote:
> On Mon, 2018-04-09 at 10:29 -0400, Mimi Zohar wrote:
> > If this change is acceptable, do you want to make the change or should Nayna
> > repost the patch?
> 
> No need. I'll move on to testing.

Tested-by: Jarkko Sakkinen 
Reviewed-by: Jarkko Sakkinen 

/Jarkko


Re: [PATCH] tpm: moves the delay_msec increment after sleep in tpm_transmit()

2018-04-19 Thread Jarkko Sakkinen
On Tue, Apr 10, 2018 at 03:31:09PM +0300, Jarkko Sakkinen wrote:
> On Mon, 2018-04-09 at 10:29 -0400, Mimi Zohar wrote:
> > If this change is acceptable, do you want to make the change or should Nayna
> > repost the patch?
> 
> No need. I'll move on to testing.

Tested-by: Jarkko Sakkinen 
Reviewed-by: Jarkko Sakkinen 

/Jarkko


Re: [PATCH 2/2] cpufreq: brcmstb-avs-cpufreq: prefer SCMI cpufreq if supported

2018-04-19 Thread Viresh Kumar
On 19-04-18, 11:37, Sudeep Holla wrote:
> 
> 
> On 19/04/18 05:16, Viresh Kumar wrote:
> > On 18-04-18, 08:56, Markus Mayer wrote:
> >> From: Jim Quinlan 
> >>
> >> If the SCMI cpufreq driver is supported, we bail, so that the new
> >> approach can be used.
> >>
> >> Signed-off-by: Jim Quinlan 
> >> Signed-off-by: Markus Mayer 
> >> ---
> >>  drivers/cpufreq/brcmstb-avs-cpufreq.c | 16 
> >>  1 file changed, 16 insertions(+)
> >>
> >> diff --git a/drivers/cpufreq/brcmstb-avs-cpufreq.c 
> >> b/drivers/cpufreq/brcmstb-avs-cpufreq.c
> >> index b07559b9ed99..b4861a730162 100644
> >> --- a/drivers/cpufreq/brcmstb-avs-cpufreq.c
> >> +++ b/drivers/cpufreq/brcmstb-avs-cpufreq.c
> >> @@ -164,6 +164,8 @@
> >>  #define BRCM_AVS_CPU_INTR "brcm,avs-cpu-l2-intr"
> >>  #define BRCM_AVS_HOST_INTR"sw_intr"
> >>  
> >> +#define ARM_SCMI_COMPAT   "arm,scmi"
> >> +
> >>  struct pmap {
> >>unsigned int mode;
> >>unsigned int p1;
> >> @@ -511,6 +513,20 @@ static int brcm_avs_prepare_init(struct 
> >> platform_device *pdev)
> >>struct device *dev;
> >>int host_irq, ret;
> >>  
> >> +  /*
> >> +   * If the SCMI cpufreq driver is supported, we bail, so that the more
> >> +   * modern approach can be used.
> >> +   */
> >> +  if (IS_ENABLED(CONFIG_ARM_SCMI_PROTOCOL)) {
> >> +  struct device_node *np;
> >> +
> >> +  np = of_find_compatible_node(NULL, NULL, ARM_SCMI_COMPAT);
> >> +  if (np) {
> >> +  of_node_put(np);
> >> +  return -ENXIO;
> >> +  }
> >> +  }
> >> +
> > 
> > What about adding !CONFIG_ARM_SCMI_PROTOCOL in Kconfig dependency and don't
> > compile the driver at all ?
> > 
> 
> Unfortunately, that may not be good idea with single image needing both
> configs to be enabled.

Sure, but looking at the above code, it looked like they don't need the other
config if SCMI is enabled.

-- 
viresh


Re: [PATCH 2/2] cpufreq: brcmstb-avs-cpufreq: prefer SCMI cpufreq if supported

2018-04-19 Thread Viresh Kumar
On 19-04-18, 11:37, Sudeep Holla wrote:
> 
> 
> On 19/04/18 05:16, Viresh Kumar wrote:
> > On 18-04-18, 08:56, Markus Mayer wrote:
> >> From: Jim Quinlan 
> >>
> >> If the SCMI cpufreq driver is supported, we bail, so that the new
> >> approach can be used.
> >>
> >> Signed-off-by: Jim Quinlan 
> >> Signed-off-by: Markus Mayer 
> >> ---
> >>  drivers/cpufreq/brcmstb-avs-cpufreq.c | 16 
> >>  1 file changed, 16 insertions(+)
> >>
> >> diff --git a/drivers/cpufreq/brcmstb-avs-cpufreq.c 
> >> b/drivers/cpufreq/brcmstb-avs-cpufreq.c
> >> index b07559b9ed99..b4861a730162 100644
> >> --- a/drivers/cpufreq/brcmstb-avs-cpufreq.c
> >> +++ b/drivers/cpufreq/brcmstb-avs-cpufreq.c
> >> @@ -164,6 +164,8 @@
> >>  #define BRCM_AVS_CPU_INTR "brcm,avs-cpu-l2-intr"
> >>  #define BRCM_AVS_HOST_INTR"sw_intr"
> >>  
> >> +#define ARM_SCMI_COMPAT   "arm,scmi"
> >> +
> >>  struct pmap {
> >>unsigned int mode;
> >>unsigned int p1;
> >> @@ -511,6 +513,20 @@ static int brcm_avs_prepare_init(struct 
> >> platform_device *pdev)
> >>struct device *dev;
> >>int host_irq, ret;
> >>  
> >> +  /*
> >> +   * If the SCMI cpufreq driver is supported, we bail, so that the more
> >> +   * modern approach can be used.
> >> +   */
> >> +  if (IS_ENABLED(CONFIG_ARM_SCMI_PROTOCOL)) {
> >> +  struct device_node *np;
> >> +
> >> +  np = of_find_compatible_node(NULL, NULL, ARM_SCMI_COMPAT);
> >> +  if (np) {
> >> +  of_node_put(np);
> >> +  return -ENXIO;
> >> +  }
> >> +  }
> >> +
> > 
> > What about adding !CONFIG_ARM_SCMI_PROTOCOL in Kconfig dependency and don't
> > compile the driver at all ?
> > 
> 
> Unfortunately, that may not be good idea with single image needing both
> configs to be enabled.

Sure, but looking at the above code, it looked like they don't need the other
config if SCMI is enabled.

-- 
viresh


Re: [RFC/RFT patch 0/7] timekeeping: Unify clock MONOTONIC and clock BOOTTIME

2018-04-19 Thread David Herrmann
Hey

On Tue, Mar 13, 2018 at 7:11 PM, John Stultz  wrote:
> On Mon, Mar 12, 2018 at 11:36 PM, Ingo Molnar  wrote:
>> Ok, I have edited all the changelogs accordingly (and also flipped around the
>> 'clock MONOTONIC' language to the more readable 'the MONOTONIC clock' 
>> variant),
>> the resulting titles are (in order):
>>
>>  72199320d49d: timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock
>>  d6ed449afdb3: timekeeping: Make the MONOTONIC clock behave like the 
>> BOOTTIME clock
>>  f2d6fdbfd238: Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior
>>  d6c7270e913d: timekeeping: Remove boot time specific code
>>  7250a4047aa6: posix-timers: Unify MONOTONIC and BOOTTIME clock behavior
>>  127bfa5f4342: hrtimer: Unify MONOTONIC and BOOTTIME clock behavior
>>  92af4dcb4e1c: tracing: Unify the "boot" and "mono" tracing clocks
>>
>> I'll push these out after testing.
>
> I'm still anxious about userspace effects given how much I've seen the
> current behavior documented, and wouldn't pushed for this myself (I'm
> a worrier), but at least I'm not seeing any failures in initial
> testing w/ kselftest so far.

I get lots of timer-errors on Arch-Linux booting current master, after
a suspend/resume cycle. Just a selection of errors I see on resume:

systemd[1]: systemd-journald.service: Main process exited,
code=dumped, status=6/ABRT
rtkit-daemon[742]: The canary thread is apparently starving. Taking action.
systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
kernel: e1000e :00:1f.6: Failed to restore TIMINCA clock rate delta: -22

Lots of crashes with SIGABRT due to these.

I did not bisect it, but it sounds related to me. Also, user-space
uses CLOCK_MONOTONIC for watchdog timers. That is, a process is
required to respond to a watchdog-request in a given MONOTONIC
time-frame. If this jumps during suspend/resume, watchdogs will fire
immediately. I don't see how this can work with the new MONOTONIC
behavior?

Thanks
David


Re: [RFC/RFT patch 0/7] timekeeping: Unify clock MONOTONIC and clock BOOTTIME

2018-04-19 Thread David Herrmann
Hey

On Tue, Mar 13, 2018 at 7:11 PM, John Stultz  wrote:
> On Mon, Mar 12, 2018 at 11:36 PM, Ingo Molnar  wrote:
>> Ok, I have edited all the changelogs accordingly (and also flipped around the
>> 'clock MONOTONIC' language to the more readable 'the MONOTONIC clock' 
>> variant),
>> the resulting titles are (in order):
>>
>>  72199320d49d: timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock
>>  d6ed449afdb3: timekeeping: Make the MONOTONIC clock behave like the 
>> BOOTTIME clock
>>  f2d6fdbfd238: Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior
>>  d6c7270e913d: timekeeping: Remove boot time specific code
>>  7250a4047aa6: posix-timers: Unify MONOTONIC and BOOTTIME clock behavior
>>  127bfa5f4342: hrtimer: Unify MONOTONIC and BOOTTIME clock behavior
>>  92af4dcb4e1c: tracing: Unify the "boot" and "mono" tracing clocks
>>
>> I'll push these out after testing.
>
> I'm still anxious about userspace effects given how much I've seen the
> current behavior documented, and wouldn't pushed for this myself (I'm
> a worrier), but at least I'm not seeing any failures in initial
> testing w/ kselftest so far.

I get lots of timer-errors on Arch-Linux booting current master, after
a suspend/resume cycle. Just a selection of errors I see on resume:

systemd[1]: systemd-journald.service: Main process exited,
code=dumped, status=6/ABRT
rtkit-daemon[742]: The canary thread is apparently starving. Taking action.
systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
kernel: e1000e :00:1f.6: Failed to restore TIMINCA clock rate delta: -22

Lots of crashes with SIGABRT due to these.

I did not bisect it, but it sounds related to me. Also, user-space
uses CLOCK_MONOTONIC for watchdog timers. That is, a process is
required to respond to a watchdog-request in a given MONOTONIC
time-frame. If this jumps during suspend/resume, watchdogs will fire
immediately. I don't see how this can work with the new MONOTONIC
behavior?

Thanks
David


Re: [greybus-dev] [PATCH 47/61] staging: greybus: simplify getting .drvdata

2018-04-19 Thread Viresh Kumar
On 19-04-18, 16:06, Wolfram Sang wrote:
> We should get drvdata from struct device directly. Going via
> platform_device is an unneeded step back and forth.
> 
> Signed-off-by: Wolfram Sang 
> ---
> 
> Build tested only. buildbot is happy. Please apply individually.
> 
>  drivers/staging/greybus/arche-platform.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Acked-by: Viresh Kumar 

-- 
viresh


Re: [greybus-dev] [PATCH 47/61] staging: greybus: simplify getting .drvdata

2018-04-19 Thread Viresh Kumar
On 19-04-18, 16:06, Wolfram Sang wrote:
> We should get drvdata from struct device directly. Going via
> platform_device is an unneeded step back and forth.
> 
> Signed-off-by: Wolfram Sang 
> ---
> 
> Build tested only. buildbot is happy. Please apply individually.
> 
>  drivers/staging/greybus/arche-platform.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH 08/61] dmaengine: dw: simplify getting .drvdata

2018-04-19 Thread Viresh Kumar
On 19-04-18, 16:05, Wolfram Sang wrote:
> We should get drvdata from struct device directly. Going via
> platform_device is an unneeded step back and forth.
> 
> Signed-off-by: Wolfram Sang 
> ---
> 
> Build tested only. buildbot is happy. Please apply individually.
> 
>  drivers/dma/dw/platform.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH 08/61] dmaengine: dw: simplify getting .drvdata

2018-04-19 Thread Viresh Kumar
On 19-04-18, 16:05, Wolfram Sang wrote:
> We should get drvdata from struct device directly. Going via
> platform_device is an unneeded step back and forth.
> 
> Signed-off-by: Wolfram Sang 
> ---
> 
> Build tested only. buildbot is happy. Please apply individually.
> 
>  drivers/dma/dw/platform.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread Chao Yu
On 2018/4/20 11:37, Jaegeuk Kim wrote:
> On 04/20, Chao Yu wrote:
>> As most indirect node, dindirect node, and xattr node won't be updated
>> after they are created, but inode node and other direct node will change
>> more frequently, so store their nat entries mixedly in whole nat table
>> will suffer:
>> - fragment nat table soon due to different update rate
>> - more nat block update due to fragmented nat table
>>
>> In order to solve above issue, we're trying to separate whole nat table to
>> two part:
>> a. Hot free nid area:
>>  - range: [nid #0, nid #x)
>>  - store node block address for
>>* inode node
>>* other direct node
>> b. Cold free nid area:
>>  - range: [nid #x, max nid)
>>  - store node block address for
>>* indirect node
>>* dindirect node
>>* xattr node
>>
>> Allocation strategy example:
>>
>> Free nid: '-'
>> Used nid: '='
>>
>> 1. Initial status:
>> Free Nids:   
>> |---|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |---|   
>> |---|
>>  hot_start   hot_end 
>> cold_start  cold_end
>>
>> 2. Free nids have ran out:
>> Free Nids:   
>> |===-===|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |===|   
>> |===|
>>  hot_start   hot_end 
>> cold_start  cold_end
>>
>> 3. Expand hot/cold area range:
>> Free Nids:   
>> |===-===|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |===|   
>> |===|
>>  hot_start   hot_end cold_start  
>> cold_end
>>
>> 4. Hot free nids have ran out:
>> Free Nids:   
>> |===-===|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |===|   
>> |===|
>>  hot_start   hot_end cold_start  
>> cold_end
>>
>> 5. Expand hot area range, hot/cold area boundary has been fixed:
>> Free Nids:   
>> |===-===|
>>  ^   ^   
>> ^
>> Alloc Range: 
>> |===|===|
>>  hot_start   hot_end(cold_start) 
>> cold_end
>>
>> Run xfstests with generic/*:
>>
>> before
>> node_write:  169660
>> cp_count:60118
>> node/cp  2.82
>>
>> after:
>> node_write:  159145
>> cp_count:84501
>> node/cp: 2.64
> 
> Nice trial tho, I don't see much benefit on this huge patch. I guess we may be
> able to find an efficient way to achieve this issue rather than changing whole
> stable codes.

IMO, based on this, later, we can add more allocation policy to manage free nid
resource to get more benefit.

If you worry about code stability, we can queue this patch in dev-test branch to
test this longer time.

> 
> How about getting a free nid in the list from head or tail separately?

I don't think this can get benefit from long time used image, since nat table
will be fragmented anyway, then we won't know free nid in head or in tail comes
from hot nat block or cold nat block.

Anyway, I will have a try.

Thanks,

> 
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/checkpoint.c |   4 -
>>  fs/f2fs/debug.c  |   6 +-
>>  fs/f2fs/f2fs.h   |  19 +++-
>>  fs/f2fs/inode.c  |   2 +-
>>  fs/f2fs/namei.c  |   2 +-
>>  fs/f2fs/node.c   | 302 
>> ---
>>  fs/f2fs/node.h   |  17 +--
>>  fs/f2fs/segment.c|   8 +-
>>  fs/f2fs/shrinker.c   |   3 +-
>>  fs/f2fs/xattr.c  |  10 +-
>>  10 files changed, 221 insertions(+), 152 deletions(-)
>>
>> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> index 96785ffc6181..c17feec72c74 100644
>> --- a/fs/f2fs/checkpoint.c
>> +++ b/fs/f2fs/checkpoint.c
>> @@ -1029,14 +1029,10 @@ int f2fs_sync_inode_meta(struct f2fs_sb_info *sbi)
>>  static void __prepare_cp_block(struct f2fs_sb_info *sbi)
>>  {
>>  struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
>> -struct f2fs_nm_info *nm_i = NM_I(sbi);
>> -nid_t last_nid = nm_i->next_scan_nid;
>>  
>> -next_free_nid(sbi, _nid);
>>  ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
>>  

Re: [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread Chao Yu
On 2018/4/20 11:37, Jaegeuk Kim wrote:
> On 04/20, Chao Yu wrote:
>> As most indirect node, dindirect node, and xattr node won't be updated
>> after they are created, but inode node and other direct node will change
>> more frequently, so store their nat entries mixedly in whole nat table
>> will suffer:
>> - fragment nat table soon due to different update rate
>> - more nat block update due to fragmented nat table
>>
>> In order to solve above issue, we're trying to separate whole nat table to
>> two part:
>> a. Hot free nid area:
>>  - range: [nid #0, nid #x)
>>  - store node block address for
>>* inode node
>>* other direct node
>> b. Cold free nid area:
>>  - range: [nid #x, max nid)
>>  - store node block address for
>>* indirect node
>>* dindirect node
>>* xattr node
>>
>> Allocation strategy example:
>>
>> Free nid: '-'
>> Used nid: '='
>>
>> 1. Initial status:
>> Free Nids:   
>> |---|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |---|   
>> |---|
>>  hot_start   hot_end 
>> cold_start  cold_end
>>
>> 2. Free nids have ran out:
>> Free Nids:   
>> |===-===|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |===|   
>> |===|
>>  hot_start   hot_end 
>> cold_start  cold_end
>>
>> 3. Expand hot/cold area range:
>> Free Nids:   
>> |===-===|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |===|   
>> |===|
>>  hot_start   hot_end cold_start  
>> cold_end
>>
>> 4. Hot free nids have ran out:
>> Free Nids:   
>> |===-===|
>>  ^   ^   ^   
>> ^
>> Alloc Range: |===|   
>> |===|
>>  hot_start   hot_end cold_start  
>> cold_end
>>
>> 5. Expand hot area range, hot/cold area boundary has been fixed:
>> Free Nids:   
>> |===-===|
>>  ^   ^   
>> ^
>> Alloc Range: 
>> |===|===|
>>  hot_start   hot_end(cold_start) 
>> cold_end
>>
>> Run xfstests with generic/*:
>>
>> before
>> node_write:  169660
>> cp_count:60118
>> node/cp  2.82
>>
>> after:
>> node_write:  159145
>> cp_count:84501
>> node/cp: 2.64
> 
> Nice trial tho, I don't see much benefit on this huge patch. I guess we may be
> able to find an efficient way to achieve this issue rather than changing whole
> stable codes.

IMO, based on this, later, we can add more allocation policy to manage free nid
resource to get more benefit.

If you worry about code stability, we can queue this patch in dev-test branch to
test this longer time.

> 
> How about getting a free nid in the list from head or tail separately?

I don't think this can get benefit from long time used image, since nat table
will be fragmented anyway, then we won't know free nid in head or in tail comes
from hot nat block or cold nat block.

Anyway, I will have a try.

Thanks,

> 
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/checkpoint.c |   4 -
>>  fs/f2fs/debug.c  |   6 +-
>>  fs/f2fs/f2fs.h   |  19 +++-
>>  fs/f2fs/inode.c  |   2 +-
>>  fs/f2fs/namei.c  |   2 +-
>>  fs/f2fs/node.c   | 302 
>> ---
>>  fs/f2fs/node.h   |  17 +--
>>  fs/f2fs/segment.c|   8 +-
>>  fs/f2fs/shrinker.c   |   3 +-
>>  fs/f2fs/xattr.c  |  10 +-
>>  10 files changed, 221 insertions(+), 152 deletions(-)
>>
>> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> index 96785ffc6181..c17feec72c74 100644
>> --- a/fs/f2fs/checkpoint.c
>> +++ b/fs/f2fs/checkpoint.c
>> @@ -1029,14 +1029,10 @@ int f2fs_sync_inode_meta(struct f2fs_sb_info *sbi)
>>  static void __prepare_cp_block(struct f2fs_sb_info *sbi)
>>  {
>>  struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
>> -struct f2fs_nm_info *nm_i = NM_I(sbi);
>> -nid_t last_nid = nm_i->next_scan_nid;
>>  
>> -next_free_nid(sbi, _nid);
>>  ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
>>  ckpt->valid_node_count = 

Re: [PATCH] IB/core: Make ib_mad_client_id atomic

2018-04-19 Thread Doug Ledford
On Wed, 2018-04-18 at 16:24 +0200, Håkon Bugge wrote:
> Two kernel threads may get the same value for agent.hi_tid, if the
> agents are registered for different ports. As of now, this works, as
> the agent list is per port.
> 
> It is however confusing and not future robust. Hence, making it
> atomic.
> 

People sometimes underestimate the performance penalty of atomic ops. 
Every atomic op is the equivalent of a spin_lock/spin_unlock pair.  This
is why two atomics are worse than taking a spin_lock, doing what you
have to do, and releasing the spin_lock.  Is this really what you want
for a "confusing, let's make it robust" issue?

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part


Re: [PATCH] IB/core: Make ib_mad_client_id atomic

2018-04-19 Thread Doug Ledford
On Wed, 2018-04-18 at 16:24 +0200, Håkon Bugge wrote:
> Two kernel threads may get the same value for agent.hi_tid, if the
> agents are registered for different ports. As of now, this works, as
> the agent list is per port.
> 
> It is however confusing and not future robust. Hence, making it
> atomic.
> 

People sometimes underestimate the performance penalty of atomic ops. 
Every atomic op is the equivalent of a spin_lock/spin_unlock pair.  This
is why two atomics are worse than taking a spin_lock, doing what you
have to do, and releasing the spin_lock.  Is this really what you want
for a "confusing, let's make it robust" issue?

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 5/5] f2fs: fix to avoid race during access gc_thread pointer

2018-04-19 Thread Jaegeuk Kim
On 04/20, Chao Yu wrote:
> On 2018/4/20 11:19, Jaegeuk Kim wrote:
> > On 04/18, Chao Yu wrote:
> >> Thread A   Thread BThread C
> >> - f2fs_remount
> >>  - stop_gc_thread
> >>- f2fs_sbi_store
> >>- issue_discard_thread
> >>sbi->gc_thread = NULL;
> >>  sbi->gc_thread->gc_wake = 1
> >>  access 
> >> sbi->gc_thread->gc_urgent
> > 
> > Do we simply need a lock for this?
> 
> Code will be more complicated for handling existed and new coming fields with
> the sbi->gc_thread pointer, and causing unneeded lock overhead, right?
> 
> So let's just allocate memory during fill_super?

No, the case is when stopping the thread. We can keep the gc_thread and indicate
its state as "disabled". Then, we need to handle other paths with the state?

> 
> Thanks,
> 
> > 
> >>
> >> Previously, we allocate memory for sbi->gc_thread based on background
> >> gc thread mount option, the memory can be released if we turn off
> >> that mount option, but still there are several places access gc_thread
> >> pointer without considering race condition, result in NULL point
> >> dereference.
> >>
> >> In order to fix this issue, keep gc_thread structure valid in sbi all
> >> the time instead of alloc/free it dynamically.
> >>
> >> Signed-off-by: Chao Yu 
> >> ---
> >>  fs/f2fs/debug.c   |  3 +--
> >>  fs/f2fs/f2fs.h|  7 +++
> >>  fs/f2fs/gc.c  | 58 
> >> +--
> >>  fs/f2fs/segment.c |  4 ++--
> >>  fs/f2fs/super.c   | 13 +++--
> >>  fs/f2fs/sysfs.c   |  8 
> >>  6 files changed, 60 insertions(+), 33 deletions(-)
> >>
> >> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> >> index 715beb85e9db..7bb036a3bb81 100644
> >> --- a/fs/f2fs/debug.c
> >> +++ b/fs/f2fs/debug.c
> >> @@ -223,8 +223,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi)
> >>si->cache_mem = 0;
> >>  
> >>/* build gc */
> >> -  if (sbi->gc_thread)
> >> -  si->cache_mem += sizeof(struct f2fs_gc_kthread);
> >> +  si->cache_mem += sizeof(struct f2fs_gc_kthread);
> >>  
> >>/* build merge flush thread */
> >>if (SM_I(sbi)->fcc_info)
> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> >> index 567c6bb57ae3..c553f63199e8 100644
> >> --- a/fs/f2fs/f2fs.h
> >> +++ b/fs/f2fs/f2fs.h
> >> @@ -1412,6 +1412,11 @@ static inline struct sit_info *SIT_I(struct 
> >> f2fs_sb_info *sbi)
> >>return (struct sit_info *)(SM_I(sbi)->sit_info);
> >>  }
> >>  
> >> +static inline struct f2fs_gc_kthread *GC_I(struct f2fs_sb_info *sbi)
> >> +{
> >> +  return (struct f2fs_gc_kthread *)(sbi->gc_thread);
> >> +}
> >> +
> >>  static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
> >>  {
> >>return (struct free_segmap_info *)(SM_I(sbi)->free_info);
> >> @@ -2954,6 +2959,8 @@ bool f2fs_overwrite_io(struct inode *inode, loff_t 
> >> pos, size_t len);
> >>  /*
> >>   * gc.c
> >>   */
> >> +int init_gc_context(struct f2fs_sb_info *sbi);
> >> +void destroy_gc_context(struct f2fs_sb_info * sbi);
> >>  int start_gc_thread(struct f2fs_sb_info *sbi);
> >>  void stop_gc_thread(struct f2fs_sb_info *sbi);
> >>  block_t start_bidx_of_node(unsigned int node_ofs, struct inode *inode);
> >> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> >> index da89ca16a55d..7d310e454b77 100644
> >> --- a/fs/f2fs/gc.c
> >> +++ b/fs/f2fs/gc.c
> >> @@ -26,8 +26,8 @@
> >>  static int gc_thread_func(void *data)
> >>  {
> >>struct f2fs_sb_info *sbi = data;
> >> -  struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
> >> -  wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head;
> >> +  struct f2fs_gc_kthread *gc_th = GC_I(sbi);
> >> +  wait_queue_head_t *wq = _th->gc_wait_queue_head;
> >>unsigned int wait_ms;
> >>  
> >>wait_ms = gc_th->min_sleep_time;
> >> @@ -114,17 +114,15 @@ static int gc_thread_func(void *data)
> >>return 0;
> >>  }
> >>  
> >> -int start_gc_thread(struct f2fs_sb_info *sbi)
> >> +int init_gc_context(struct f2fs_sb_info *sbi)
> >>  {
> >>struct f2fs_gc_kthread *gc_th;
> >> -  dev_t dev = sbi->sb->s_bdev->bd_dev;
> >> -  int err = 0;
> >>  
> >>gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
> >> -  if (!gc_th) {
> >> -  err = -ENOMEM;
> >> -  goto out;
> >> -  }
> >> +  if (!gc_th)
> >> +  return -ENOMEM;
> >> +
> >> +  gc_th->f2fs_gc_task = NULL;
> >>  
> >>gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
> >>gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME;
> >> @@ -139,26 +137,41 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
> >>gc_th->atomic_file[FG_GC] = 0;
> >>  
> >>sbi->gc_thread = gc_th;
> >> -  init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> >> -  sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> >> +
> >> +  return 0;
> >> +}
> >> +
> >> +void 

Re: [PATCH 5/5] f2fs: fix to avoid race during access gc_thread pointer

2018-04-19 Thread Jaegeuk Kim
On 04/20, Chao Yu wrote:
> On 2018/4/20 11:19, Jaegeuk Kim wrote:
> > On 04/18, Chao Yu wrote:
> >> Thread A   Thread BThread C
> >> - f2fs_remount
> >>  - stop_gc_thread
> >>- f2fs_sbi_store
> >>- issue_discard_thread
> >>sbi->gc_thread = NULL;
> >>  sbi->gc_thread->gc_wake = 1
> >>  access 
> >> sbi->gc_thread->gc_urgent
> > 
> > Do we simply need a lock for this?
> 
> Code will be more complicated for handling existed and new coming fields with
> the sbi->gc_thread pointer, and causing unneeded lock overhead, right?
> 
> So let's just allocate memory during fill_super?

No, the case is when stopping the thread. We can keep the gc_thread and indicate
its state as "disabled". Then, we need to handle other paths with the state?

> 
> Thanks,
> 
> > 
> >>
> >> Previously, we allocate memory for sbi->gc_thread based on background
> >> gc thread mount option, the memory can be released if we turn off
> >> that mount option, but still there are several places access gc_thread
> >> pointer without considering race condition, result in NULL point
> >> dereference.
> >>
> >> In order to fix this issue, keep gc_thread structure valid in sbi all
> >> the time instead of alloc/free it dynamically.
> >>
> >> Signed-off-by: Chao Yu 
> >> ---
> >>  fs/f2fs/debug.c   |  3 +--
> >>  fs/f2fs/f2fs.h|  7 +++
> >>  fs/f2fs/gc.c  | 58 
> >> +--
> >>  fs/f2fs/segment.c |  4 ++--
> >>  fs/f2fs/super.c   | 13 +++--
> >>  fs/f2fs/sysfs.c   |  8 
> >>  6 files changed, 60 insertions(+), 33 deletions(-)
> >>
> >> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> >> index 715beb85e9db..7bb036a3bb81 100644
> >> --- a/fs/f2fs/debug.c
> >> +++ b/fs/f2fs/debug.c
> >> @@ -223,8 +223,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi)
> >>si->cache_mem = 0;
> >>  
> >>/* build gc */
> >> -  if (sbi->gc_thread)
> >> -  si->cache_mem += sizeof(struct f2fs_gc_kthread);
> >> +  si->cache_mem += sizeof(struct f2fs_gc_kthread);
> >>  
> >>/* build merge flush thread */
> >>if (SM_I(sbi)->fcc_info)
> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> >> index 567c6bb57ae3..c553f63199e8 100644
> >> --- a/fs/f2fs/f2fs.h
> >> +++ b/fs/f2fs/f2fs.h
> >> @@ -1412,6 +1412,11 @@ static inline struct sit_info *SIT_I(struct 
> >> f2fs_sb_info *sbi)
> >>return (struct sit_info *)(SM_I(sbi)->sit_info);
> >>  }
> >>  
> >> +static inline struct f2fs_gc_kthread *GC_I(struct f2fs_sb_info *sbi)
> >> +{
> >> +  return (struct f2fs_gc_kthread *)(sbi->gc_thread);
> >> +}
> >> +
> >>  static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
> >>  {
> >>return (struct free_segmap_info *)(SM_I(sbi)->free_info);
> >> @@ -2954,6 +2959,8 @@ bool f2fs_overwrite_io(struct inode *inode, loff_t 
> >> pos, size_t len);
> >>  /*
> >>   * gc.c
> >>   */
> >> +int init_gc_context(struct f2fs_sb_info *sbi);
> >> +void destroy_gc_context(struct f2fs_sb_info * sbi);
> >>  int start_gc_thread(struct f2fs_sb_info *sbi);
> >>  void stop_gc_thread(struct f2fs_sb_info *sbi);
> >>  block_t start_bidx_of_node(unsigned int node_ofs, struct inode *inode);
> >> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> >> index da89ca16a55d..7d310e454b77 100644
> >> --- a/fs/f2fs/gc.c
> >> +++ b/fs/f2fs/gc.c
> >> @@ -26,8 +26,8 @@
> >>  static int gc_thread_func(void *data)
> >>  {
> >>struct f2fs_sb_info *sbi = data;
> >> -  struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
> >> -  wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head;
> >> +  struct f2fs_gc_kthread *gc_th = GC_I(sbi);
> >> +  wait_queue_head_t *wq = _th->gc_wait_queue_head;
> >>unsigned int wait_ms;
> >>  
> >>wait_ms = gc_th->min_sleep_time;
> >> @@ -114,17 +114,15 @@ static int gc_thread_func(void *data)
> >>return 0;
> >>  }
> >>  
> >> -int start_gc_thread(struct f2fs_sb_info *sbi)
> >> +int init_gc_context(struct f2fs_sb_info *sbi)
> >>  {
> >>struct f2fs_gc_kthread *gc_th;
> >> -  dev_t dev = sbi->sb->s_bdev->bd_dev;
> >> -  int err = 0;
> >>  
> >>gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
> >> -  if (!gc_th) {
> >> -  err = -ENOMEM;
> >> -  goto out;
> >> -  }
> >> +  if (!gc_th)
> >> +  return -ENOMEM;
> >> +
> >> +  gc_th->f2fs_gc_task = NULL;
> >>  
> >>gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
> >>gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME;
> >> @@ -139,26 +137,41 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
> >>gc_th->atomic_file[FG_GC] = 0;
> >>  
> >>sbi->gc_thread = gc_th;
> >> -  init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> >> -  sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> >> +
> >> +  return 0;
> >> +}
> >> +
> >> +void destroy_gc_context(struct 

Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Jason Wang



On 2018年04月20日 02:40, Michael S. Tsirkin wrote:

On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:

One problem is that, different virtio ring compatible devices
may have different device interfaces. That is to say, we will
need different drivers in QEMU. It could be troublesome. And
that's what this patch trying to fix. The idea behind this
patch is very simple: mdev is a standard way to emulate device
in kernel.

So you just move the abstraction layer from qemu to kernel, and you still
need different drivers in kernel for different device interfaces of
accelerators. This looks even more complex than leaving it in qemu. As you
said, another idea is to implement userspace vhost backend for accelerators
which seems easier and could co-work with other parts of qemu without
inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to
add various vendor specific hardware drivers in QEMU?


I don't object but we need to figure out the advantages of doing it in qemu
too.

Thanks

To be frank kernel is exactly where device drivers belong.  DPDK did
move them to userspace but that's merely a requirement for data path.
*If* you can have them in kernel that is best:
- update kernel and there's no need to rebuild userspace


Well, you still need to rebuild userspace since a new vhost backend is 
required which relies vhost protocol through mdev API. And I believe 
upgrading userspace package is considered to be more lightweight than 
upgrading kernel. With mdev, we're likely to repeat the story of vhost 
API, dealing with features/versions and inventing new API endless for 
new features. And you will still need to rebuild the userspace.



- apps can be written in any language no need to maintain multiple
   libraries or add wrappers


This is not a big issue consider It's not a generic network driver but a 
mdev driver, the only possible user is VM.



- security concerns are much smaller (ok people are trying to
   raise the bar with IOMMUs and such, but it's already pretty
   good even without)


Well, I think not, kernel bugs are much more serious than userspace 
ones. And I beg the kernel driver itself won't be small.




The biggest issue is that you let userspace poke at the
device which is also allowed by the IOMMU to poke at
kernel memory (needed for kernel driver to work).


I don't quite get. The userspace driver could be built on top of VFIO 
for sure. So kernel memory were perfectly isolated in this case.




Yes, maybe if device is not buggy it's all fine, but
it's better if we do not have to trust the device
otherwise the security picture becomes more murky.

I suggested attaching a PASID to (some) queues - see my old post "using
PASIDs to enable a safe variant of direct ring access".

Then using IOMMU with VFIO to limit access through queue to corrent
ranges of memory.


Well userspace driver could benefit from this too. And we can even go 
further by using nested IO page tables to share IOVA address space 
between devices and a VM.


Thanks



Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Jason Wang



On 2018年04月20日 02:40, Michael S. Tsirkin wrote:

On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:

One problem is that, different virtio ring compatible devices
may have different device interfaces. That is to say, we will
need different drivers in QEMU. It could be troublesome. And
that's what this patch trying to fix. The idea behind this
patch is very simple: mdev is a standard way to emulate device
in kernel.

So you just move the abstraction layer from qemu to kernel, and you still
need different drivers in kernel for different device interfaces of
accelerators. This looks even more complex than leaving it in qemu. As you
said, another idea is to implement userspace vhost backend for accelerators
which seems easier and could co-work with other parts of qemu without
inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to
add various vendor specific hardware drivers in QEMU?


I don't object but we need to figure out the advantages of doing it in qemu
too.

Thanks

To be frank kernel is exactly where device drivers belong.  DPDK did
move them to userspace but that's merely a requirement for data path.
*If* you can have them in kernel that is best:
- update kernel and there's no need to rebuild userspace


Well, you still need to rebuild userspace since a new vhost backend is 
required which relies vhost protocol through mdev API. And I believe 
upgrading userspace package is considered to be more lightweight than 
upgrading kernel. With mdev, we're likely to repeat the story of vhost 
API, dealing with features/versions and inventing new API endless for 
new features. And you will still need to rebuild the userspace.



- apps can be written in any language no need to maintain multiple
   libraries or add wrappers


This is not a big issue consider It's not a generic network driver but a 
mdev driver, the only possible user is VM.



- security concerns are much smaller (ok people are trying to
   raise the bar with IOMMUs and such, but it's already pretty
   good even without)


Well, I think not, kernel bugs are much more serious than userspace 
ones. And I beg the kernel driver itself won't be small.




The biggest issue is that you let userspace poke at the
device which is also allowed by the IOMMU to poke at
kernel memory (needed for kernel driver to work).


I don't quite get. The userspace driver could be built on top of VFIO 
for sure. So kernel memory were perfectly isolated in this case.




Yes, maybe if device is not buggy it's all fine, but
it's better if we do not have to trust the device
otherwise the security picture becomes more murky.

I suggested attaching a PASID to (some) queues - see my old post "using
PASIDs to enable a safe variant of direct ring access".

Then using IOMMU with VFIO to limit access through queue to corrent
ranges of memory.


Well userspace driver could benefit from this too. And we can even go 
further by using nested IO page tables to share IOVA address space 
between devices and a VM.


Thanks



RE: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Liang, Cunming


> -Original Message-
> From: Bie, Tiwei
> Sent: Friday, April 20, 2018 11:28 AM
> To: Michael S. Tsirkin 
> Cc: Jason Wang ; alex.william...@redhat.com;
> ddut...@redhat.com; Duyck, Alexander H ;
> virtio-...@lists.oasis-open.org; linux-kernel@vger.kernel.org;
> k...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> net...@vger.kernel.org; Daly, Dan ; Liang, Cunming
> ; Wang, Zhihong ; Tan,
> Jianfeng ; Wang, Xiao W ;
> Tian, Kevin 
> Subject: Re: [RFC] vhost: introduce mdev based hardware vhost backend
> 
> On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > One problem is that, different virtio ring compatible devices
> > > > > > may have different device interfaces. That is to say, we will
> > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > in kernel.
> > > > > So you just move the abstraction layer from qemu to kernel, and
> > > > > you still need different drivers in kernel for different device
> > > > > interfaces of accelerators. This looks even more complex than
> > > > > leaving it in qemu. As you said, another idea is to implement
> > > > > userspace vhost backend for accelerators which seems easier and
> > > > > could co-work with other parts of qemu without inventing new type of
> messages.
> > > > I'm not quite sure. Do you think it's acceptable to add various
> > > > vendor specific hardware drivers in QEMU?
> > > >
> > >
> > > I don't object but we need to figure out the advantages of doing it
> > > in qemu too.
> > >
> > > Thanks
> >
> > To be frank kernel is exactly where device drivers belong.  DPDK did
> > move them to userspace but that's merely a requirement for data path.
> > *If* you can have them in kernel that is best:
> > - update kernel and there's no need to rebuild userspace
> > - apps can be written in any language no need to maintain multiple
> >   libraries or add wrappers
> > - security concerns are much smaller (ok people are trying to
> >   raise the bar with IOMMUs and such, but it's already pretty
> >   good even without)
> >
> > The biggest issue is that you let userspace poke at the device which
> > is also allowed by the IOMMU to poke at kernel memory (needed for
> > kernel driver to work).
> 
> I think the device won't and shouldn't be allowed to poke at kernel memory. 
> Its
> kernel driver needs some kernel memory to work. But the device doesn't have
> the access to them. Instead, the device only has the access to:
> 
> (1) the entire memory of the VM (if vIOMMU isn't used) or
> (2) the memory belongs to the guest virtio device (if
> vIOMMU is being used).
> 
> Below is the reason:
> 
> For the first case, we should program the IOMMU for the hardware device based
> on the info in the memory table which is the entire memory of the VM.
> 
> For the second case, we should program the IOMMU for the hardware device
> based on the info in the shadow page table of the vIOMMU.
> 
> So the memory can be accessed by the device is limited, it should be safe
> especially for the second case.
> 
> My concern is that, in this RFC, we don't program the IOMMU for the mdev
> device in the userspace via the VFIO API directly. Instead, we pass the memory
> table to the kernel driver via the mdev device (BAR0) and ask the driver to 
> do the
> IOMMU programming. Someone may don't like it. The main reason why we don't
> program IOMMU via VFIO API in userspace directly is that, currently IOMMU
> drivers don't support mdev bus.
> 
> >
> > Yes, maybe if device is not buggy it's all fine, but it's better if we
> > do not have to trust the device otherwise the security picture becomes
> > more murky.
> >
> > I suggested attaching a PASID to (some) queues - see my old post
> > "using PASIDs to enable a safe variant of direct ring access".
> 
Ideally we can have a device binding with normal driver in host, meanwhile 
support to allocate a few queues attaching with PASID on-demand. By vhost mdev 
transport channel, the data path ability of queues(as a device) can expose to 
qemu vhost adaptor as a vDPA instance. Then we can avoid VF number limitation, 
providing vhost data path acceleration in a small granularity.

> It's pretty cool. We also have some similar ideas.
> Cunming will talk more about this.
> 
> Best regards,
> Tiwei Bie
> 
> >
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> >
> >
> > --
> > MST


Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Michael S. Tsirkin
On Fri, Apr 20, 2018 at 11:28:07AM +0800, Tiwei Bie wrote:
> On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > One problem is that, different virtio ring compatible devices
> > > > > > may have different device interfaces. That is to say, we will
> > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > in kernel.
> > > > > So you just move the abstraction layer from qemu to kernel, and you 
> > > > > still
> > > > > need different drivers in kernel for different device interfaces of
> > > > > accelerators. This looks even more complex than leaving it in qemu. 
> > > > > As you
> > > > > said, another idea is to implement userspace vhost backend for 
> > > > > accelerators
> > > > > which seems easier and could co-work with other parts of qemu without
> > > > > inventing new type of messages.
> > > > I'm not quite sure. Do you think it's acceptable to
> > > > add various vendor specific hardware drivers in QEMU?
> > > > 
> > > 
> > > I don't object but we need to figure out the advantages of doing it in 
> > > qemu
> > > too.
> > > 
> > > Thanks
> > 
> > To be frank kernel is exactly where device drivers belong.  DPDK did
> > move them to userspace but that's merely a requirement for data path.
> > *If* you can have them in kernel that is best:
> > - update kernel and there's no need to rebuild userspace
> > - apps can be written in any language no need to maintain multiple
> >   libraries or add wrappers
> > - security concerns are much smaller (ok people are trying to
> >   raise the bar with IOMMUs and such, but it's already pretty
> >   good even without)
> > 
> > The biggest issue is that you let userspace poke at the
> > device which is also allowed by the IOMMU to poke at
> > kernel memory (needed for kernel driver to work).
> 
> I think the device won't and shouldn't be allowed to
> poke at kernel memory. Its kernel driver needs some
> kernel memory to work. But the device doesn't have
> the access to them. Instead, the device only has the
> access to:
> 
> (1) the entire memory of the VM (if vIOMMU isn't used)
> or
> (2) the memory belongs to the guest virtio device (if
> vIOMMU is being used).
> 
> Below is the reason:
> 
> For the first case, we should program the IOMMU for
> the hardware device based on the info in the memory
> table which is the entire memory of the VM.
> 
> For the second case, we should program the IOMMU for
> the hardware device based on the info in the shadow
> page table of the vIOMMU.
> 
> So the memory can be accessed by the device is limited,
> it should be safe especially for the second case.
> 
> My concern is that, in this RFC, we don't program the
> IOMMU for the mdev device in the userspace via the VFIO
> API directly. Instead, we pass the memory table to the
> kernel driver via the mdev device (BAR0) and ask the
> driver to do the IOMMU programming. Someone may don't
> like it. The main reason why we don't program IOMMU via
> VFIO API in userspace directly is that, currently IOMMU
> drivers don't support mdev bus.

But it is a pci device after all, isn't it?
IOMMU drivers certainly support that ...

Another issue with this approach is that internal
kernel issues leak out to the interface.

> > 
> > Yes, maybe if device is not buggy it's all fine, but
> > it's better if we do not have to trust the device
> > otherwise the security picture becomes more murky.
> > 
> > I suggested attaching a PASID to (some) queues - see my old post "using
> > PASIDs to enable a safe variant of direct ring access".
> 
> It's pretty cool. We also have some similar ideas.
> Cunming will talk more about this.
> 
> Best regards,
> Tiwei Bie

An extra benefit to this could be that requests with PASID
undergo an extra level of translation.
We could use it to avoid the need for shadowing on intel.



Something like this:
- expose to guest a standard virtio device (no pasid support)
- back it by virtio device with pasid support on the host
  by attaching same pasid to all queues

now - guest will build 1 level of page tables

we build first level page tables for requests with pasid
and point the IOMMU to use the guest supplied page tables
for the second level of translation.

Now we do need to forward invalidations but we no
longer need to set the CM bit and shadow valid entries.



> > 
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> > 
> > 
> > -- 
> > MST


RE: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Liang, Cunming


> -Original Message-
> From: Bie, Tiwei
> Sent: Friday, April 20, 2018 11:28 AM
> To: Michael S. Tsirkin 
> Cc: Jason Wang ; alex.william...@redhat.com;
> ddut...@redhat.com; Duyck, Alexander H ;
> virtio-...@lists.oasis-open.org; linux-kernel@vger.kernel.org;
> k...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> net...@vger.kernel.org; Daly, Dan ; Liang, Cunming
> ; Wang, Zhihong ; Tan,
> Jianfeng ; Wang, Xiao W ;
> Tian, Kevin 
> Subject: Re: [RFC] vhost: introduce mdev based hardware vhost backend
> 
> On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > One problem is that, different virtio ring compatible devices
> > > > > > may have different device interfaces. That is to say, we will
> > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > in kernel.
> > > > > So you just move the abstraction layer from qemu to kernel, and
> > > > > you still need different drivers in kernel for different device
> > > > > interfaces of accelerators. This looks even more complex than
> > > > > leaving it in qemu. As you said, another idea is to implement
> > > > > userspace vhost backend for accelerators which seems easier and
> > > > > could co-work with other parts of qemu without inventing new type of
> messages.
> > > > I'm not quite sure. Do you think it's acceptable to add various
> > > > vendor specific hardware drivers in QEMU?
> > > >
> > >
> > > I don't object but we need to figure out the advantages of doing it
> > > in qemu too.
> > >
> > > Thanks
> >
> > To be frank kernel is exactly where device drivers belong.  DPDK did
> > move them to userspace but that's merely a requirement for data path.
> > *If* you can have them in kernel that is best:
> > - update kernel and there's no need to rebuild userspace
> > - apps can be written in any language no need to maintain multiple
> >   libraries or add wrappers
> > - security concerns are much smaller (ok people are trying to
> >   raise the bar with IOMMUs and such, but it's already pretty
> >   good even without)
> >
> > The biggest issue is that you let userspace poke at the device which
> > is also allowed by the IOMMU to poke at kernel memory (needed for
> > kernel driver to work).
> 
> I think the device won't and shouldn't be allowed to poke at kernel memory. 
> Its
> kernel driver needs some kernel memory to work. But the device doesn't have
> the access to them. Instead, the device only has the access to:
> 
> (1) the entire memory of the VM (if vIOMMU isn't used) or
> (2) the memory belongs to the guest virtio device (if
> vIOMMU is being used).
> 
> Below is the reason:
> 
> For the first case, we should program the IOMMU for the hardware device based
> on the info in the memory table which is the entire memory of the VM.
> 
> For the second case, we should program the IOMMU for the hardware device
> based on the info in the shadow page table of the vIOMMU.
> 
> So the memory can be accessed by the device is limited, it should be safe
> especially for the second case.
> 
> My concern is that, in this RFC, we don't program the IOMMU for the mdev
> device in the userspace via the VFIO API directly. Instead, we pass the memory
> table to the kernel driver via the mdev device (BAR0) and ask the driver to 
> do the
> IOMMU programming. Someone may don't like it. The main reason why we don't
> program IOMMU via VFIO API in userspace directly is that, currently IOMMU
> drivers don't support mdev bus.
> 
> >
> > Yes, maybe if device is not buggy it's all fine, but it's better if we
> > do not have to trust the device otherwise the security picture becomes
> > more murky.
> >
> > I suggested attaching a PASID to (some) queues - see my old post
> > "using PASIDs to enable a safe variant of direct ring access".
> 
Ideally we can have a device binding with normal driver in host, meanwhile 
support to allocate a few queues attaching with PASID on-demand. By vhost mdev 
transport channel, the data path ability of queues(as a device) can expose to 
qemu vhost adaptor as a vDPA instance. Then we can avoid VF number limitation, 
providing vhost data path acceleration in a small granularity.

> It's pretty cool. We also have some similar ideas.
> Cunming will talk more about this.
> 
> Best regards,
> Tiwei Bie
> 
> >
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> >
> >
> > --
> > MST


Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Michael S. Tsirkin
On Fri, Apr 20, 2018 at 11:28:07AM +0800, Tiwei Bie wrote:
> On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > One problem is that, different virtio ring compatible devices
> > > > > > may have different device interfaces. That is to say, we will
> > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > in kernel.
> > > > > So you just move the abstraction layer from qemu to kernel, and you 
> > > > > still
> > > > > need different drivers in kernel for different device interfaces of
> > > > > accelerators. This looks even more complex than leaving it in qemu. 
> > > > > As you
> > > > > said, another idea is to implement userspace vhost backend for 
> > > > > accelerators
> > > > > which seems easier and could co-work with other parts of qemu without
> > > > > inventing new type of messages.
> > > > I'm not quite sure. Do you think it's acceptable to
> > > > add various vendor specific hardware drivers in QEMU?
> > > > 
> > > 
> > > I don't object but we need to figure out the advantages of doing it in 
> > > qemu
> > > too.
> > > 
> > > Thanks
> > 
> > To be frank kernel is exactly where device drivers belong.  DPDK did
> > move them to userspace but that's merely a requirement for data path.
> > *If* you can have them in kernel that is best:
> > - update kernel and there's no need to rebuild userspace
> > - apps can be written in any language no need to maintain multiple
> >   libraries or add wrappers
> > - security concerns are much smaller (ok people are trying to
> >   raise the bar with IOMMUs and such, but it's already pretty
> >   good even without)
> > 
> > The biggest issue is that you let userspace poke at the
> > device which is also allowed by the IOMMU to poke at
> > kernel memory (needed for kernel driver to work).
> 
> I think the device won't and shouldn't be allowed to
> poke at kernel memory. Its kernel driver needs some
> kernel memory to work. But the device doesn't have
> the access to them. Instead, the device only has the
> access to:
> 
> (1) the entire memory of the VM (if vIOMMU isn't used)
> or
> (2) the memory belongs to the guest virtio device (if
> vIOMMU is being used).
> 
> Below is the reason:
> 
> For the first case, we should program the IOMMU for
> the hardware device based on the info in the memory
> table which is the entire memory of the VM.
> 
> For the second case, we should program the IOMMU for
> the hardware device based on the info in the shadow
> page table of the vIOMMU.
> 
> So the memory can be accessed by the device is limited,
> it should be safe especially for the second case.
> 
> My concern is that, in this RFC, we don't program the
> IOMMU for the mdev device in the userspace via the VFIO
> API directly. Instead, we pass the memory table to the
> kernel driver via the mdev device (BAR0) and ask the
> driver to do the IOMMU programming. Someone may don't
> like it. The main reason why we don't program IOMMU via
> VFIO API in userspace directly is that, currently IOMMU
> drivers don't support mdev bus.

But it is a pci device after all, isn't it?
IOMMU drivers certainly support that ...

Another issue with this approach is that internal
kernel issues leak out to the interface.

> > 
> > Yes, maybe if device is not buggy it's all fine, but
> > it's better if we do not have to trust the device
> > otherwise the security picture becomes more murky.
> > 
> > I suggested attaching a PASID to (some) queues - see my old post "using
> > PASIDs to enable a safe variant of direct ring access".
> 
> It's pretty cool. We also have some similar ideas.
> Cunming will talk more about this.
> 
> Best regards,
> Tiwei Bie

An extra benefit to this could be that requests with PASID
undergo an extra level of translation.
We could use it to avoid the need for shadowing on intel.



Something like this:
- expose to guest a standard virtio device (no pasid support)
- back it by virtio device with pasid support on the host
  by attaching same pasid to all queues

now - guest will build 1 level of page tables

we build first level page tables for requests with pasid
and point the IOMMU to use the guest supplied page tables
for the second level of translation.

Now we do need to forward invalidations but we no
longer need to set the CM bit and shadow valid entries.



> > 
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> > 
> > 
> > -- 
> > MST


Re: [PATCH v1 5/7] soc: mediatek: add a fixed wait for SRAM stable

2018-04-19 Thread Sean Wang
On Thu, 2018-04-19 at 12:33 +0200, Matthias Brugger wrote:
> 
> On 04/03/2018 09:15 AM, sean.w...@mediatek.com wrote:
> > From: Sean Wang 
> > 
> > MT7622_POWER_DOMAIN_WB doesn't send an ACK when its managed SRAM becomes
> > stable, which is not like the behavior the other power domains should
> > have. Therefore, it's necessary for such a power domain to have a fixed
> > and well-predefined duration to wait until its managed SRAM can be allowed
> > to access by all functions running on the top.
> > 
> > Signed-off-by: Sean Wang 
> > Cc: Matthias Brugger 
> > Cc: Ulf Hansson 
> > Cc: Weiyi Lu 
> > ---
> >  drivers/soc/mediatek/mtk-scpsys.c | 17 -
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
> > b/drivers/soc/mediatek/mtk-scpsys.c
> > index f9b7248..19aceb8 100644
> > --- a/drivers/soc/mediatek/mtk-scpsys.c
> > +++ b/drivers/soc/mediatek/mtk-scpsys.c
> > @@ -121,6 +121,7 @@ struct scp_domain_data {
> > u32 bus_prot_mask;
> > enum clk_id clk_id[MAX_CLKS];
> > bool active_wakeup;
> > +   u32 us_sram_fwait;
> 
> Before adding more and more fields to scp_domain_data which get checked in 
> if's,
> I'd prefer to add a caps field used for bus_prot_mask, active_wakeup in a 
> first
> patch and add the cap FORCE_WAIT in a second patch.
> 
> Can you help to implement this Sean, or shall I give it a try?
> 

Sure, I have a willing to do and then see if you're also fond of it.

thanks!

> Regards,
> Matthias
> 
> >  };
> >  
> >  struct scp;
> > @@ -234,11 +235,16 @@ static int scpsys_power_on(struct generic_pm_domain 
> > *genpd)
> > val &= ~scpd->data->sram_pdn_bits;
> > writel(val, ctl_addr);
> >  
> > -   /* wait until SRAM_PDN_ACK all 0 */
> > -   ret = readl_poll_timeout(ctl_addr, tmp, (tmp & pdn_ack) == 0,
> > -MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
> > -   if (ret < 0)
> > -   goto err_pwr_ack;
> > +   /* Either wait until SRAM_PDN_ACK all 0 or have a force wait */
> > +   if (!scpd->data->us_sram_fwait) {
> > +   ret = readl_poll_timeout(ctl_addr, tmp, (tmp & pdn_ack) == 0,
> > +MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
> > +   if (ret < 0)
> > +   goto err_pwr_ack;
> > +   } else {
> > +   usleep_range(scpd->data->us_sram_fwait,
> > +scpd->data->us_sram_fwait + 100);
> > +   };
> >  
> > if (scpd->data->bus_prot_mask) {
> > ret = mtk_infracfg_clear_bus_protection(scp->infracfg,
> > @@ -783,6 +789,7 @@ static const struct scp_domain_data 
> > scp_domain_data_mt7622[] = {
> > .clk_id = {CLK_NONE},
> > .bus_prot_mask = MT7622_TOP_AXI_PROT_EN_WB,
> > .active_wakeup = true,
> > +   .us_sram_fwait = 12000,
> > },
> >  };
> >  
> > 




Re: [PATCH v1 5/7] soc: mediatek: add a fixed wait for SRAM stable

2018-04-19 Thread Sean Wang
On Thu, 2018-04-19 at 12:33 +0200, Matthias Brugger wrote:
> 
> On 04/03/2018 09:15 AM, sean.w...@mediatek.com wrote:
> > From: Sean Wang 
> > 
> > MT7622_POWER_DOMAIN_WB doesn't send an ACK when its managed SRAM becomes
> > stable, which is not like the behavior the other power domains should
> > have. Therefore, it's necessary for such a power domain to have a fixed
> > and well-predefined duration to wait until its managed SRAM can be allowed
> > to access by all functions running on the top.
> > 
> > Signed-off-by: Sean Wang 
> > Cc: Matthias Brugger 
> > Cc: Ulf Hansson 
> > Cc: Weiyi Lu 
> > ---
> >  drivers/soc/mediatek/mtk-scpsys.c | 17 -
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
> > b/drivers/soc/mediatek/mtk-scpsys.c
> > index f9b7248..19aceb8 100644
> > --- a/drivers/soc/mediatek/mtk-scpsys.c
> > +++ b/drivers/soc/mediatek/mtk-scpsys.c
> > @@ -121,6 +121,7 @@ struct scp_domain_data {
> > u32 bus_prot_mask;
> > enum clk_id clk_id[MAX_CLKS];
> > bool active_wakeup;
> > +   u32 us_sram_fwait;
> 
> Before adding more and more fields to scp_domain_data which get checked in 
> if's,
> I'd prefer to add a caps field used for bus_prot_mask, active_wakeup in a 
> first
> patch and add the cap FORCE_WAIT in a second patch.
> 
> Can you help to implement this Sean, or shall I give it a try?
> 

Sure, I have a willing to do and then see if you're also fond of it.

thanks!

> Regards,
> Matthias
> 
> >  };
> >  
> >  struct scp;
> > @@ -234,11 +235,16 @@ static int scpsys_power_on(struct generic_pm_domain 
> > *genpd)
> > val &= ~scpd->data->sram_pdn_bits;
> > writel(val, ctl_addr);
> >  
> > -   /* wait until SRAM_PDN_ACK all 0 */
> > -   ret = readl_poll_timeout(ctl_addr, tmp, (tmp & pdn_ack) == 0,
> > -MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
> > -   if (ret < 0)
> > -   goto err_pwr_ack;
> > +   /* Either wait until SRAM_PDN_ACK all 0 or have a force wait */
> > +   if (!scpd->data->us_sram_fwait) {
> > +   ret = readl_poll_timeout(ctl_addr, tmp, (tmp & pdn_ack) == 0,
> > +MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
> > +   if (ret < 0)
> > +   goto err_pwr_ack;
> > +   } else {
> > +   usleep_range(scpd->data->us_sram_fwait,
> > +scpd->data->us_sram_fwait + 100);
> > +   };
> >  
> > if (scpd->data->bus_prot_mask) {
> > ret = mtk_infracfg_clear_bus_protection(scp->infracfg,
> > @@ -783,6 +789,7 @@ static const struct scp_domain_data 
> > scp_domain_data_mt7622[] = {
> > .clk_id = {CLK_NONE},
> > .bus_prot_mask = MT7622_TOP_AXI_PROT_EN_WB,
> > .active_wakeup = true,
> > +   .us_sram_fwait = 12000,
> > },
> >  };
> >  
> > 




Re: [PATCH v1 4/7] soc: mediatek: reuse regmap_read_poll_timeout helpers

2018-04-19 Thread Sean Wang
On Thu, 2018-04-19 at 12:23 +0200, Matthias Brugger wrote:
> 
> On 04/03/2018 09:15 AM, sean.w...@mediatek.com wrote:
> > From: Sean Wang 
> > 
> > Reuse the common helpers regmap_read_poll_timeout provided by Linux core
> > instead of an open-coded handling.
> > 
> > Signed-off-by: Sean Wang 
> > Cc: Matthias Brugger 
> > Cc: Ulf Hansson 
> > Cc: Weiyi Lu 
> > ---
> >  drivers/soc/mediatek/mtk-infracfg.c | 45 
> > +
> >  1 file changed, 10 insertions(+), 35 deletions(-)
> > 
> > diff --git a/drivers/soc/mediatek/mtk-infracfg.c 
> > b/drivers/soc/mediatek/mtk-infracfg.c
> > index 8c310de..b849aa5 100644
> > --- a/drivers/soc/mediatek/mtk-infracfg.c
> > +++ b/drivers/soc/mediatek/mtk-infracfg.c
> > @@ -12,6 +12,7 @@
> >   */
> >  
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -37,7 +38,6 @@
> >  int mtk_infracfg_set_bus_protection(struct regmap *infracfg, u32 mask,
> > bool reg_update)
> >  {
> > -   unsigned long expired;
> > u32 val;
> > int ret;
> >  
> > @@ -47,22 +47,11 @@ int mtk_infracfg_set_bus_protection(struct regmap 
> > *infracfg, u32 mask,
> > else
> > regmap_write(infracfg, INFRA_TOPAXI_PROTECTEN_SET, mask);
> >  
> > -   expired = jiffies + HZ;
> > +   ret = regmap_read_poll_timeout(infracfg, INFRA_TOPAXI_PROTECTSTA1,
> > +  val, (val & mask) == mask, 10,
> > +  jiffies_to_usecs(HZ));
> 
> To align with the changes in scpsys, please define MTK_POLL_DELAY_US and
> MTK_POLL_TIMEOUT. I'm not really fan of passing macros as function arguments.
> 

Agreed on. will have an improve on it

thanks!

> Other then that, the patch looks good.
> 
> Thanks a lot,
> Matthias
> 
> >  
> > -   while (1) {
> > -   ret = regmap_read(infracfg, INFRA_TOPAXI_PROTECTSTA1, );
> > -   if (ret)
> > -   return ret;
> > -
> > -   if ((val & mask) == mask)
> > -   break;
> > -
> > -   cpu_relax();
> > -   if (time_after(jiffies, expired))
> > -   return -EIO;
> > -   }
> > -
> > -   return 0;
> > +   return ret;
> >  }
> >  
> >  /**
> > @@ -80,30 +69,16 @@ int mtk_infracfg_set_bus_protection(struct regmap 
> > *infracfg, u32 mask,
> >  int mtk_infracfg_clear_bus_protection(struct regmap *infracfg, u32 mask,
> > bool reg_update)
> >  {
> > -   unsigned long expired;
> > int ret;
> > +   u32 val;
> >  
> > if (reg_update)
> > regmap_update_bits(infracfg, INFRA_TOPAXI_PROTECTEN, mask, 0);
> > else
> > regmap_write(infracfg, INFRA_TOPAXI_PROTECTEN_CLR, mask);
> >  
> > -   expired = jiffies + HZ;
> > -
> > -   while (1) {
> > -   u32 val;
> > -
> > -   ret = regmap_read(infracfg, INFRA_TOPAXI_PROTECTSTA1, );
> > -   if (ret)
> > -   return ret;
> > -
> > -   if (!(val & mask))
> > -   break;
> > -
> > -   cpu_relax();
> > -   if (time_after(jiffies, expired))
> > -   return -EIO;
> > -   }
> > -
> > -   return 0;
> > +   ret = regmap_read_poll_timeout(infracfg, INFRA_TOPAXI_PROTECTSTA1,
> > +  val, !(val & mask), 10,
> > +  jiffies_to_usecs(HZ));
> > +   return ret;
> >  }
> > 





Re: [PATCH v1 4/7] soc: mediatek: reuse regmap_read_poll_timeout helpers

2018-04-19 Thread Sean Wang
On Thu, 2018-04-19 at 12:23 +0200, Matthias Brugger wrote:
> 
> On 04/03/2018 09:15 AM, sean.w...@mediatek.com wrote:
> > From: Sean Wang 
> > 
> > Reuse the common helpers regmap_read_poll_timeout provided by Linux core
> > instead of an open-coded handling.
> > 
> > Signed-off-by: Sean Wang 
> > Cc: Matthias Brugger 
> > Cc: Ulf Hansson 
> > Cc: Weiyi Lu 
> > ---
> >  drivers/soc/mediatek/mtk-infracfg.c | 45 
> > +
> >  1 file changed, 10 insertions(+), 35 deletions(-)
> > 
> > diff --git a/drivers/soc/mediatek/mtk-infracfg.c 
> > b/drivers/soc/mediatek/mtk-infracfg.c
> > index 8c310de..b849aa5 100644
> > --- a/drivers/soc/mediatek/mtk-infracfg.c
> > +++ b/drivers/soc/mediatek/mtk-infracfg.c
> > @@ -12,6 +12,7 @@
> >   */
> >  
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -37,7 +38,6 @@
> >  int mtk_infracfg_set_bus_protection(struct regmap *infracfg, u32 mask,
> > bool reg_update)
> >  {
> > -   unsigned long expired;
> > u32 val;
> > int ret;
> >  
> > @@ -47,22 +47,11 @@ int mtk_infracfg_set_bus_protection(struct regmap 
> > *infracfg, u32 mask,
> > else
> > regmap_write(infracfg, INFRA_TOPAXI_PROTECTEN_SET, mask);
> >  
> > -   expired = jiffies + HZ;
> > +   ret = regmap_read_poll_timeout(infracfg, INFRA_TOPAXI_PROTECTSTA1,
> > +  val, (val & mask) == mask, 10,
> > +  jiffies_to_usecs(HZ));
> 
> To align with the changes in scpsys, please define MTK_POLL_DELAY_US and
> MTK_POLL_TIMEOUT. I'm not really fan of passing macros as function arguments.
> 

Agreed on. will have an improve on it

thanks!

> Other then that, the patch looks good.
> 
> Thanks a lot,
> Matthias
> 
> >  
> > -   while (1) {
> > -   ret = regmap_read(infracfg, INFRA_TOPAXI_PROTECTSTA1, );
> > -   if (ret)
> > -   return ret;
> > -
> > -   if ((val & mask) == mask)
> > -   break;
> > -
> > -   cpu_relax();
> > -   if (time_after(jiffies, expired))
> > -   return -EIO;
> > -   }
> > -
> > -   return 0;
> > +   return ret;
> >  }
> >  
> >  /**
> > @@ -80,30 +69,16 @@ int mtk_infracfg_set_bus_protection(struct regmap 
> > *infracfg, u32 mask,
> >  int mtk_infracfg_clear_bus_protection(struct regmap *infracfg, u32 mask,
> > bool reg_update)
> >  {
> > -   unsigned long expired;
> > int ret;
> > +   u32 val;
> >  
> > if (reg_update)
> > regmap_update_bits(infracfg, INFRA_TOPAXI_PROTECTEN, mask, 0);
> > else
> > regmap_write(infracfg, INFRA_TOPAXI_PROTECTEN_CLR, mask);
> >  
> > -   expired = jiffies + HZ;
> > -
> > -   while (1) {
> > -   u32 val;
> > -
> > -   ret = regmap_read(infracfg, INFRA_TOPAXI_PROTECTSTA1, );
> > -   if (ret)
> > -   return ret;
> > -
> > -   if (!(val & mask))
> > -   break;
> > -
> > -   cpu_relax();
> > -   if (time_after(jiffies, expired))
> > -   return -EIO;
> > -   }
> > -
> > -   return 0;
> > +   ret = regmap_read_poll_timeout(infracfg, INFRA_TOPAXI_PROTECTSTA1,
> > +  val, !(val & mask), 10,
> > +  jiffies_to_usecs(HZ));
> > +   return ret;
> >  }
> > 





Re: [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread Jaegeuk Kim
On 04/20, Chao Yu wrote:
> As most indirect node, dindirect node, and xattr node won't be updated
> after they are created, but inode node and other direct node will change
> more frequently, so store their nat entries mixedly in whole nat table
> will suffer:
> - fragment nat table soon due to different update rate
> - more nat block update due to fragmented nat table
> 
> In order to solve above issue, we're trying to separate whole nat table to
> two part:
> a. Hot free nid area:
>  - range: [nid #0, nid #x)
>  - store node block address for
>* inode node
>* other direct node
> b. Cold free nid area:
>  - range: [nid #x, max nid)
>  - store node block address for
>* indirect node
>* dindirect node
>* xattr node
> 
> Allocation strategy example:
> 
> Free nid: '-'
> Used nid: '='
> 
> 1. Initial status:
> Free Nids:
> |---|
>   ^   ^   ^   
> ^
> Alloc Range:  |---|   
> |---|
>   hot_start   hot_end 
> cold_start  cold_end
> 
> 2. Free nids have ran out:
> Free Nids:
> |===-===|
>   ^   ^   ^   
> ^
> Alloc Range:  |===|   
> |===|
>   hot_start   hot_end 
> cold_start  cold_end
> 
> 3. Expand hot/cold area range:
> Free Nids:
> |===-===|
>   ^   ^   ^   
> ^
> Alloc Range:  |===|   
> |===|
>   hot_start   hot_end cold_start  
> cold_end
> 
> 4. Hot free nids have ran out:
> Free Nids:
> |===-===|
>   ^   ^   ^   
> ^
> Alloc Range:  |===|   
> |===|
>   hot_start   hot_end cold_start  
> cold_end
> 
> 5. Expand hot area range, hot/cold area boundary has been fixed:
> Free Nids:
> |===-===|
>   ^   ^   
> ^
> Alloc Range:  
> |===|===|
>   hot_start   hot_end(cold_start) 
> cold_end
> 
> Run xfstests with generic/*:
> 
> before
> node_write:   169660
> cp_count: 60118
> node/cp   2.82
> 
> after:
> node_write:   159145
> cp_count: 84501
> node/cp:  2.64

Nice trial tho, I don't see much benefit on this huge patch. I guess we may be
able to find an efficient way to achieve this issue rather than changing whole
stable codes.

How about getting a free nid in the list from head or tail separately?

> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/checkpoint.c |   4 -
>  fs/f2fs/debug.c  |   6 +-
>  fs/f2fs/f2fs.h   |  19 +++-
>  fs/f2fs/inode.c  |   2 +-
>  fs/f2fs/namei.c  |   2 +-
>  fs/f2fs/node.c   | 302 
> ---
>  fs/f2fs/node.h   |  17 +--
>  fs/f2fs/segment.c|   8 +-
>  fs/f2fs/shrinker.c   |   3 +-
>  fs/f2fs/xattr.c  |  10 +-
>  10 files changed, 221 insertions(+), 152 deletions(-)
> 
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 96785ffc6181..c17feec72c74 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -1029,14 +1029,10 @@ int f2fs_sync_inode_meta(struct f2fs_sb_info *sbi)
>  static void __prepare_cp_block(struct f2fs_sb_info *sbi)
>  {
>   struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
> - struct f2fs_nm_info *nm_i = NM_I(sbi);
> - nid_t last_nid = nm_i->next_scan_nid;
>  
> - next_free_nid(sbi, _nid);
>   ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
>   ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
>   ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
> - ckpt->next_free_nid = cpu_to_le32(last_nid);
>  }
>  
>  /*
> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> index 7bb036a3bb81..b13c1d4f110f 100644
> --- a/fs/f2fs/debug.c
> +++ b/fs/f2fs/debug.c
> @@ -100,7 +100,8 @@ static void update_general_status(struct f2fs_sb_info 
> *sbi)
>   si->dirty_nats = NM_I(sbi)->dirty_nat_cnt;
>   si->sits = MAIN_SEGS(sbi);
>   si->dirty_sits = SIT_I(sbi)->dirty_sentries;
> - 

Re: [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread Jaegeuk Kim
On 04/20, Chao Yu wrote:
> As most indirect node, dindirect node, and xattr node won't be updated
> after they are created, but inode node and other direct node will change
> more frequently, so store their nat entries mixedly in whole nat table
> will suffer:
> - fragment nat table soon due to different update rate
> - more nat block update due to fragmented nat table
> 
> In order to solve above issue, we're trying to separate whole nat table to
> two part:
> a. Hot free nid area:
>  - range: [nid #0, nid #x)
>  - store node block address for
>* inode node
>* other direct node
> b. Cold free nid area:
>  - range: [nid #x, max nid)
>  - store node block address for
>* indirect node
>* dindirect node
>* xattr node
> 
> Allocation strategy example:
> 
> Free nid: '-'
> Used nid: '='
> 
> 1. Initial status:
> Free Nids:
> |---|
>   ^   ^   ^   
> ^
> Alloc Range:  |---|   
> |---|
>   hot_start   hot_end 
> cold_start  cold_end
> 
> 2. Free nids have ran out:
> Free Nids:
> |===-===|
>   ^   ^   ^   
> ^
> Alloc Range:  |===|   
> |===|
>   hot_start   hot_end 
> cold_start  cold_end
> 
> 3. Expand hot/cold area range:
> Free Nids:
> |===-===|
>   ^   ^   ^   
> ^
> Alloc Range:  |===|   
> |===|
>   hot_start   hot_end cold_start  
> cold_end
> 
> 4. Hot free nids have ran out:
> Free Nids:
> |===-===|
>   ^   ^   ^   
> ^
> Alloc Range:  |===|   
> |===|
>   hot_start   hot_end cold_start  
> cold_end
> 
> 5. Expand hot area range, hot/cold area boundary has been fixed:
> Free Nids:
> |===-===|
>   ^   ^   
> ^
> Alloc Range:  
> |===|===|
>   hot_start   hot_end(cold_start) 
> cold_end
> 
> Run xfstests with generic/*:
> 
> before
> node_write:   169660
> cp_count: 60118
> node/cp   2.82
> 
> after:
> node_write:   159145
> cp_count: 84501
> node/cp:  2.64

Nice trial tho, I don't see much benefit on this huge patch. I guess we may be
able to find an efficient way to achieve this issue rather than changing whole
stable codes.

How about getting a free nid in the list from head or tail separately?

> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/checkpoint.c |   4 -
>  fs/f2fs/debug.c  |   6 +-
>  fs/f2fs/f2fs.h   |  19 +++-
>  fs/f2fs/inode.c  |   2 +-
>  fs/f2fs/namei.c  |   2 +-
>  fs/f2fs/node.c   | 302 
> ---
>  fs/f2fs/node.h   |  17 +--
>  fs/f2fs/segment.c|   8 +-
>  fs/f2fs/shrinker.c   |   3 +-
>  fs/f2fs/xattr.c  |  10 +-
>  10 files changed, 221 insertions(+), 152 deletions(-)
> 
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 96785ffc6181..c17feec72c74 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -1029,14 +1029,10 @@ int f2fs_sync_inode_meta(struct f2fs_sb_info *sbi)
>  static void __prepare_cp_block(struct f2fs_sb_info *sbi)
>  {
>   struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
> - struct f2fs_nm_info *nm_i = NM_I(sbi);
> - nid_t last_nid = nm_i->next_scan_nid;
>  
> - next_free_nid(sbi, _nid);
>   ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
>   ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
>   ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
> - ckpt->next_free_nid = cpu_to_le32(last_nid);
>  }
>  
>  /*
> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> index 7bb036a3bb81..b13c1d4f110f 100644
> --- a/fs/f2fs/debug.c
> +++ b/fs/f2fs/debug.c
> @@ -100,7 +100,8 @@ static void update_general_status(struct f2fs_sb_info 
> *sbi)
>   si->dirty_nats = NM_I(sbi)->dirty_nat_cnt;
>   si->sits = MAIN_SEGS(sbi);
>   si->dirty_sits = SIT_I(sbi)->dirty_sentries;
> - si->free_nids = 

Re: general protection fault in kernfs_kill_sb

2018-04-19 Thread Eric Biggers
On Thu, Apr 19, 2018 at 07:44:40PM -0700, Eric Biggers wrote:
> On Mon, Apr 02, 2018 at 03:34:15PM +0100, Al Viro wrote:
> > On Mon, Apr 02, 2018 at 07:40:22PM +0900, Tetsuo Handa wrote:
> > 
> > > That commit assumes that calling kill_sb() from deactivate_locked_super(s)
> > > without corresponding fill_super() is safe. We have so far crashed with
> > > rpc_mount() and kernfs_mount_ns(). Is that really safe?
> > 
> > Consider the case when fill_super() returns an error immediately.
> > It is exactly the same situation.  And ->kill_sb() *is* called in cases
> > when fill_super() has failed.  Always had been - it's much less boilerplate
> > that way.
> > 
> > deactivate_locked_super() on that failure exit is the least painful
> > variant, unfortunately.
> > 
> > Filesystems with ->kill_sb() instances that rely upon something
> > done between sget() and the first failure exit after it need to be fixed.
> > And yes, that should've been spotted back then.  Sorry.
> > 
> > Fortunately, we don't have many of those - kill_{block,litter,anon}_super()
> > are safe and those are the majority.  Looking through the rest uncovers
> > some bugs; so far all I've seen were already there.  Note that normally
> > we have something like
> > static void affs_kill_sb(struct super_block *sb)
> > {
> > struct affs_sb_info *sbi = AFFS_SB(sb);
> > kill_block_super(sb);
> > if (sbi) {
> > affs_free_bitmap(sb);
> > affs_brelse(sbi->s_root_bh);
> > kfree(sbi->s_prefix);
> > mutex_destroy(>s_bmlock);
> > kfree(sbi);
> > }
> > }
> > which basically does one of the safe ones augmented with something that
> > takes care *not* to assume that e.g. ->s_fs_info has been allocated.
> > Not everyone does, though:
> > 
> > jffs2_fill_super():
> > c = kzalloc(sizeof(*c), GFP_KERNEL);
> > if (!c)
> > return -ENOMEM;
> > in the very beginning.  So we can return from it with NULL ->s_fs_info.
> > Now, consider
> > struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
> > if (!(sb->s_flags & MS_RDONLY))
> > jffs2_stop_garbage_collect_thread(c);
> > in jffs2_kill_sb() and
> > void jffs2_stop_garbage_collect_thread(struct jffs2_sb_info *c)
> > {
> > int wait = 0;
> > spin_lock(>erase_completion_lock);
> > if (c->gc_task) {
> > 
> > IOW, fail that kzalloc() (or, indeed, an allocation in register_shrinker())
> > and eat an oops.  Always had been there, always hard to hit without
> > fault injectors and fortunately trivial to fix.
> > 
> > Similar in nfs_kill_super() calling nfs_free_server().
> > Similar in v9fs_kill_super() with 
> > v9fs_session_cancel()/v9fs_session_close() calls.
> > Similar in hypfs_kill_super(), afs_kill_super(), btrfs_kill_super(), 
> > cifs_kill_sb()
> > (all trivial to fix)
> > 
> > Aha... nfsd_umount() is a new regression.
> > 
> > orangefs: old, trivial to fix.
> > 
> > cgroup_kill_sb(): old, hopefully easy to fix.  Note that 
> > kernfs_root_from_sb()
> > can bloody well return NULL, making cgroup_root_from_kf() oops.  Always had 
> > been
> > there.
> > 
> > AFAICS, after discarding the instances that do the right thing we are left 
> > with:
> > hypfs_kill_super, rdt_kill_sb, v9fs_kill_super, afs_kill_super, 
> > btrfs_kill_super,
> > cifs_kill_sb, jffs2_kill_sb, nfs_kill_super, nfsd_umount, orangefs_kill_sb,
> > proc_kill_sb, sysfs_kill_sb, cgroup_kill_sb, rpc_kill_sb.
> > 
> > Out of those, nfsd_umount(), proc_kill_sb() and rpc_kill_sb() are 
> > regressions.
> > So are rdt_kill_sb() and sysfs_kill_sb() (victims of the issue you've 
> > spotted
> > in kernfs_kill_sb()).  The rest are old (and I wonder if syzbot had been
> > catching those - they are also dependent upon a specific allocation failing
> > at the right time).
> > 
> 
> Fix for the kernfs bug is now queued in vfs/for-linus:
> 
> #syz fix: kernfs: deal with early sget() failures
> 

But, there is still a related bug: when mounting sysfs, if register_shrinker()
fails in sget_userns(), then kernfs_kill_sb() gets called, which frees the
'struct kernfs_super_info'.  But, the 'struct kernfs_super_info' is also freed
in kernfs_mount_ns() by:

sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
 _user_ns, info);
if (IS_ERR(sb) || sb->s_fs_info != info)
kfree(info);
if (IS_ERR(sb))
return ERR_CAST(sb);

I guess the problem is that sget_userns() shouldn't take ownership of the 'info'
if it returns an error -- but, it actually does if register_shrinker() fails,
resulting in a double free.

Here is a reproducer and the KASAN splat.  This is on Linus' tree (87ef12027b9b)
with vfs/for-linus merged in.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main()
{
int fd, i;
char buf[16];

unshare(CLONE_NEWNET);
 

Re: general protection fault in kernfs_kill_sb

2018-04-19 Thread Eric Biggers
On Thu, Apr 19, 2018 at 07:44:40PM -0700, Eric Biggers wrote:
> On Mon, Apr 02, 2018 at 03:34:15PM +0100, Al Viro wrote:
> > On Mon, Apr 02, 2018 at 07:40:22PM +0900, Tetsuo Handa wrote:
> > 
> > > That commit assumes that calling kill_sb() from deactivate_locked_super(s)
> > > without corresponding fill_super() is safe. We have so far crashed with
> > > rpc_mount() and kernfs_mount_ns(). Is that really safe?
> > 
> > Consider the case when fill_super() returns an error immediately.
> > It is exactly the same situation.  And ->kill_sb() *is* called in cases
> > when fill_super() has failed.  Always had been - it's much less boilerplate
> > that way.
> > 
> > deactivate_locked_super() on that failure exit is the least painful
> > variant, unfortunately.
> > 
> > Filesystems with ->kill_sb() instances that rely upon something
> > done between sget() and the first failure exit after it need to be fixed.
> > And yes, that should've been spotted back then.  Sorry.
> > 
> > Fortunately, we don't have many of those - kill_{block,litter,anon}_super()
> > are safe and those are the majority.  Looking through the rest uncovers
> > some bugs; so far all I've seen were already there.  Note that normally
> > we have something like
> > static void affs_kill_sb(struct super_block *sb)
> > {
> > struct affs_sb_info *sbi = AFFS_SB(sb);
> > kill_block_super(sb);
> > if (sbi) {
> > affs_free_bitmap(sb);
> > affs_brelse(sbi->s_root_bh);
> > kfree(sbi->s_prefix);
> > mutex_destroy(>s_bmlock);
> > kfree(sbi);
> > }
> > }
> > which basically does one of the safe ones augmented with something that
> > takes care *not* to assume that e.g. ->s_fs_info has been allocated.
> > Not everyone does, though:
> > 
> > jffs2_fill_super():
> > c = kzalloc(sizeof(*c), GFP_KERNEL);
> > if (!c)
> > return -ENOMEM;
> > in the very beginning.  So we can return from it with NULL ->s_fs_info.
> > Now, consider
> > struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
> > if (!(sb->s_flags & MS_RDONLY))
> > jffs2_stop_garbage_collect_thread(c);
> > in jffs2_kill_sb() and
> > void jffs2_stop_garbage_collect_thread(struct jffs2_sb_info *c)
> > {
> > int wait = 0;
> > spin_lock(>erase_completion_lock);
> > if (c->gc_task) {
> > 
> > IOW, fail that kzalloc() (or, indeed, an allocation in register_shrinker())
> > and eat an oops.  Always had been there, always hard to hit without
> > fault injectors and fortunately trivial to fix.
> > 
> > Similar in nfs_kill_super() calling nfs_free_server().
> > Similar in v9fs_kill_super() with 
> > v9fs_session_cancel()/v9fs_session_close() calls.
> > Similar in hypfs_kill_super(), afs_kill_super(), btrfs_kill_super(), 
> > cifs_kill_sb()
> > (all trivial to fix)
> > 
> > Aha... nfsd_umount() is a new regression.
> > 
> > orangefs: old, trivial to fix.
> > 
> > cgroup_kill_sb(): old, hopefully easy to fix.  Note that 
> > kernfs_root_from_sb()
> > can bloody well return NULL, making cgroup_root_from_kf() oops.  Always had 
> > been
> > there.
> > 
> > AFAICS, after discarding the instances that do the right thing we are left 
> > with:
> > hypfs_kill_super, rdt_kill_sb, v9fs_kill_super, afs_kill_super, 
> > btrfs_kill_super,
> > cifs_kill_sb, jffs2_kill_sb, nfs_kill_super, nfsd_umount, orangefs_kill_sb,
> > proc_kill_sb, sysfs_kill_sb, cgroup_kill_sb, rpc_kill_sb.
> > 
> > Out of those, nfsd_umount(), proc_kill_sb() and rpc_kill_sb() are 
> > regressions.
> > So are rdt_kill_sb() and sysfs_kill_sb() (victims of the issue you've 
> > spotted
> > in kernfs_kill_sb()).  The rest are old (and I wonder if syzbot had been
> > catching those - they are also dependent upon a specific allocation failing
> > at the right time).
> > 
> 
> Fix for the kernfs bug is now queued in vfs/for-linus:
> 
> #syz fix: kernfs: deal with early sget() failures
> 

But, there is still a related bug: when mounting sysfs, if register_shrinker()
fails in sget_userns(), then kernfs_kill_sb() gets called, which frees the
'struct kernfs_super_info'.  But, the 'struct kernfs_super_info' is also freed
in kernfs_mount_ns() by:

sb = sget_userns(fs_type, kernfs_test_super, kernfs_set_super, flags,
 _user_ns, info);
if (IS_ERR(sb) || sb->s_fs_info != info)
kfree(info);
if (IS_ERR(sb))
return ERR_CAST(sb);

I guess the problem is that sget_userns() shouldn't take ownership of the 'info'
if it returns an error -- but, it actually does if register_shrinker() fails,
resulting in a double free.

Here is a reproducer and the KASAN splat.  This is on Linus' tree (87ef12027b9b)
with vfs/for-linus merged in.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main()
{
int fd, i;
char buf[16];

unshare(CLONE_NEWNET);
 

[PATCH] ACPI / scan: Fix regression related to X-Gene UARTs

2018-04-19 Thread Mark Salter
Commit e361d1f85855 ("ACPI / scan: Fix enumeration for special UART
devices") caused a regression with some X-Gene based platforms (Mustang
and M400) with invalid DSDT. The DSDT makes it appear that the UART
device is also a slave device attached to itself. With the above commit
the UART won't be enumerated by ACPI scan (slave serial devices shouldn't
be). So check for X-Gene UART device and skip slace device check on it.

Signed-off-by: Mark Salter 
---
 drivers/acpi/scan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index cc234e6a6297..1dcdd0122862 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1551,6 +1551,14 @@ static bool acpi_device_enumeration_by_parent(struct 
acpi_device *device)
 fwnode_property_present(>fwnode, "baud")))
return true;
 
+   /*
+* Firmware on some arm64 X-Gene platforms will make the UART
+* device appear as both a UART and a slave of that UART. Just
+* bail out here for X-Gene UARTs.
+*/
+   if (!strcmp(acpi_device_hid(device), "APMC0D08"))
+   return false;
+
INIT_LIST_HEAD(_list);
acpi_dev_get_resources(device, _list,
   acpi_check_serial_bus_slave,
-- 
2.14.3



[PATCH] ACPI / scan: Fix regression related to X-Gene UARTs

2018-04-19 Thread Mark Salter
Commit e361d1f85855 ("ACPI / scan: Fix enumeration for special UART
devices") caused a regression with some X-Gene based platforms (Mustang
and M400) with invalid DSDT. The DSDT makes it appear that the UART
device is also a slave device attached to itself. With the above commit
the UART won't be enumerated by ACPI scan (slave serial devices shouldn't
be). So check for X-Gene UART device and skip slace device check on it.

Signed-off-by: Mark Salter 
---
 drivers/acpi/scan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index cc234e6a6297..1dcdd0122862 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1551,6 +1551,14 @@ static bool acpi_device_enumeration_by_parent(struct 
acpi_device *device)
 fwnode_property_present(>fwnode, "baud")))
return true;
 
+   /*
+* Firmware on some arm64 X-Gene platforms will make the UART
+* device appear as both a UART and a slave of that UART. Just
+* bail out here for X-Gene UARTs.
+*/
+   if (!strcmp(acpi_device_hid(device), "APMC0D08"))
+   return false;
+
INIT_LIST_HEAD(_list);
acpi_dev_get_resources(device, _list,
   acpi_check_serial_bus_slave,
-- 
2.14.3



Re: [PATCH 3/5] f2fs: avoid stucking GC due to atomic write

2018-04-19 Thread Jaegeuk Kim
On 04/20, Chao Yu wrote:
> On 2018/4/20 11:12, Jaegeuk Kim wrote:
> > On 04/18, Chao Yu wrote:
> >> f2fs doesn't allow abuse on atomic write class interface, so except
> >> limiting in-mem pages' total memory usage capacity, we need to limit
> >> atomic-write usage as well when filesystem is seriously fragmented,
> >> otherwise we may run into infinite loop during foreground GC because
> >> target blocks in victim segment are belong to atomic opened file for
> >> long time.
> >>
> >> Now, we will detect failure due to atomic write in foreground GC, if
> >> the count exceeds threshold, we will drop all atomic written data in
> >> cache, by this, I expect it can keep our system running safely to
> >> prevent Dos attack.
> >>
> >> Signed-off-by: Chao Yu 
> >> ---
> >>  fs/f2fs/f2fs.h|  1 +
> >>  fs/f2fs/file.c|  5 +
> >>  fs/f2fs/gc.c  | 27 +++
> >>  fs/f2fs/gc.h  |  3 +++
> >>  fs/f2fs/segment.c |  1 +
> >>  fs/f2fs/segment.h |  2 ++
> >>  6 files changed, 35 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> >> index c1c3a1d11186..3453288d6a71 100644
> >> --- a/fs/f2fs/f2fs.h
> >> +++ b/fs/f2fs/f2fs.h
> >> @@ -2249,6 +2249,7 @@ enum {
> >>FI_EXTRA_ATTR,  /* indicate file has extra attribute */
> >>FI_PROJ_INHERIT,/* indicate file inherits projectid */
> >>FI_PIN_FILE,/* indicate file should not be gced */
> >> +  FI_ATOMIC_REVOKE_REQUEST,/* indicate atomic committed data has been 
> >> dropped */
> >>  };
> >>  
> >>  static inline void __mark_inode_dirty_flag(struct inode *inode,
> >> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> >> index 7c90ded5a431..cddd9aee1bb2 100644
> >> --- a/fs/f2fs/file.c
> >> +++ b/fs/f2fs/file.c
> >> @@ -1698,6 +1698,7 @@ static int f2fs_ioc_start_atomic_write(struct file 
> >> *filp)
> >>  skip_flush:
> >>set_inode_flag(inode, FI_HOT_DATA);
> >>set_inode_flag(inode, FI_ATOMIC_FILE);
> >> +  clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
> >>f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
> >>  
> >>F2FS_I(inode)->inmem_task = current;
> >> @@ -1746,6 +1747,10 @@ static int f2fs_ioc_commit_atomic_write(struct file 
> >> *filp)
> >>ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 1, false);
> >>}
> >>  err_out:
> >> +  if (is_inode_flag_set(inode, FI_ATOMIC_REVOKE_REQUEST)) {
> >> +  clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
> >> +  ret = -EINVAL;
> >> +  }
> >>up_write(_I(inode)->dio_rwsem[WRITE]);
> >>inode_unlock(inode);
> >>mnt_drop_write_file(filp);
> >> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> >> index bfb7a4a3a929..495876ca62b6 100644
> >> --- a/fs/f2fs/gc.c
> >> +++ b/fs/f2fs/gc.c
> >> @@ -135,6 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
> >>gc_th->gc_urgent = 0;
> >>gc_th->gc_wake= 0;
> >>  
> >> +  gc_th->atomic_file = 0;
> >> +
> >>sbi->gc_thread = gc_th;
> >>init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> >>sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> >> @@ -603,7 +605,7 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct 
> >> f2fs_summary *sum,
> >>   * This can be used to move blocks, aka LBAs, directly on disk.
> >>   */
> >>  static void move_data_block(struct inode *inode, block_t bidx,
> >> -  unsigned int segno, int off)
> >> +  int gc_type, unsigned int segno, int off)
> >>  {
> >>struct f2fs_io_info fio = {
> >>.sbi = F2FS_I_SB(inode),
> >> @@ -630,8 +632,10 @@ static void move_data_block(struct inode *inode, 
> >> block_t bidx,
> >>if (!check_valid_map(F2FS_I_SB(inode), segno, off))
> >>goto out;
> >>  
> >> -  if (f2fs_is_atomic_file(inode))
> >> +  if (f2fs_is_atomic_file(inode)) {
> >> +  F2FS_I_SB(inode)->gc_thread->atomic_file++;
> >>goto out;
> >> +  }
> >>  
> >>if (f2fs_is_pinned_file(inode)) {
> >>f2fs_pin_file_control(inode, true);
> >> @@ -737,8 +741,10 @@ static void move_data_page(struct inode *inode, 
> >> block_t bidx, int gc_type,
> >>if (!check_valid_map(F2FS_I_SB(inode), segno, off))
> >>goto out;
> >>  
> >> -  if (f2fs_is_atomic_file(inode))
> >> +  if (f2fs_is_atomic_file(inode)) {
> >> +  F2FS_I_SB(inode)->gc_thread->atomic_file++;
> >>goto out;
> >> +  }
> >>if (f2fs_is_pinned_file(inode)) {
> >>if (gc_type == FG_GC)
> >>f2fs_pin_file_control(inode, true);
> >> @@ -900,7 +906,8 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, 
> >> struct f2fs_summary *sum,
> >>start_bidx = start_bidx_of_node(nofs, inode)
> >>+ ofs_in_node;
> >>if (f2fs_encrypted_file(inode))
> >> -  move_data_block(inode, start_bidx, segno, off);
> >> +   

Re: [PATCH 3/5] f2fs: avoid stucking GC due to atomic write

2018-04-19 Thread Jaegeuk Kim
On 04/20, Chao Yu wrote:
> On 2018/4/20 11:12, Jaegeuk Kim wrote:
> > On 04/18, Chao Yu wrote:
> >> f2fs doesn't allow abuse on atomic write class interface, so except
> >> limiting in-mem pages' total memory usage capacity, we need to limit
> >> atomic-write usage as well when filesystem is seriously fragmented,
> >> otherwise we may run into infinite loop during foreground GC because
> >> target blocks in victim segment are belong to atomic opened file for
> >> long time.
> >>
> >> Now, we will detect failure due to atomic write in foreground GC, if
> >> the count exceeds threshold, we will drop all atomic written data in
> >> cache, by this, I expect it can keep our system running safely to
> >> prevent Dos attack.
> >>
> >> Signed-off-by: Chao Yu 
> >> ---
> >>  fs/f2fs/f2fs.h|  1 +
> >>  fs/f2fs/file.c|  5 +
> >>  fs/f2fs/gc.c  | 27 +++
> >>  fs/f2fs/gc.h  |  3 +++
> >>  fs/f2fs/segment.c |  1 +
> >>  fs/f2fs/segment.h |  2 ++
> >>  6 files changed, 35 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> >> index c1c3a1d11186..3453288d6a71 100644
> >> --- a/fs/f2fs/f2fs.h
> >> +++ b/fs/f2fs/f2fs.h
> >> @@ -2249,6 +2249,7 @@ enum {
> >>FI_EXTRA_ATTR,  /* indicate file has extra attribute */
> >>FI_PROJ_INHERIT,/* indicate file inherits projectid */
> >>FI_PIN_FILE,/* indicate file should not be gced */
> >> +  FI_ATOMIC_REVOKE_REQUEST,/* indicate atomic committed data has been 
> >> dropped */
> >>  };
> >>  
> >>  static inline void __mark_inode_dirty_flag(struct inode *inode,
> >> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> >> index 7c90ded5a431..cddd9aee1bb2 100644
> >> --- a/fs/f2fs/file.c
> >> +++ b/fs/f2fs/file.c
> >> @@ -1698,6 +1698,7 @@ static int f2fs_ioc_start_atomic_write(struct file 
> >> *filp)
> >>  skip_flush:
> >>set_inode_flag(inode, FI_HOT_DATA);
> >>set_inode_flag(inode, FI_ATOMIC_FILE);
> >> +  clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
> >>f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
> >>  
> >>F2FS_I(inode)->inmem_task = current;
> >> @@ -1746,6 +1747,10 @@ static int f2fs_ioc_commit_atomic_write(struct file 
> >> *filp)
> >>ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 1, false);
> >>}
> >>  err_out:
> >> +  if (is_inode_flag_set(inode, FI_ATOMIC_REVOKE_REQUEST)) {
> >> +  clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
> >> +  ret = -EINVAL;
> >> +  }
> >>up_write(_I(inode)->dio_rwsem[WRITE]);
> >>inode_unlock(inode);
> >>mnt_drop_write_file(filp);
> >> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> >> index bfb7a4a3a929..495876ca62b6 100644
> >> --- a/fs/f2fs/gc.c
> >> +++ b/fs/f2fs/gc.c
> >> @@ -135,6 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
> >>gc_th->gc_urgent = 0;
> >>gc_th->gc_wake= 0;
> >>  
> >> +  gc_th->atomic_file = 0;
> >> +
> >>sbi->gc_thread = gc_th;
> >>init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> >>sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> >> @@ -603,7 +605,7 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct 
> >> f2fs_summary *sum,
> >>   * This can be used to move blocks, aka LBAs, directly on disk.
> >>   */
> >>  static void move_data_block(struct inode *inode, block_t bidx,
> >> -  unsigned int segno, int off)
> >> +  int gc_type, unsigned int segno, int off)
> >>  {
> >>struct f2fs_io_info fio = {
> >>.sbi = F2FS_I_SB(inode),
> >> @@ -630,8 +632,10 @@ static void move_data_block(struct inode *inode, 
> >> block_t bidx,
> >>if (!check_valid_map(F2FS_I_SB(inode), segno, off))
> >>goto out;
> >>  
> >> -  if (f2fs_is_atomic_file(inode))
> >> +  if (f2fs_is_atomic_file(inode)) {
> >> +  F2FS_I_SB(inode)->gc_thread->atomic_file++;
> >>goto out;
> >> +  }
> >>  
> >>if (f2fs_is_pinned_file(inode)) {
> >>f2fs_pin_file_control(inode, true);
> >> @@ -737,8 +741,10 @@ static void move_data_page(struct inode *inode, 
> >> block_t bidx, int gc_type,
> >>if (!check_valid_map(F2FS_I_SB(inode), segno, off))
> >>goto out;
> >>  
> >> -  if (f2fs_is_atomic_file(inode))
> >> +  if (f2fs_is_atomic_file(inode)) {
> >> +  F2FS_I_SB(inode)->gc_thread->atomic_file++;
> >>goto out;
> >> +  }
> >>if (f2fs_is_pinned_file(inode)) {
> >>if (gc_type == FG_GC)
> >>f2fs_pin_file_control(inode, true);
> >> @@ -900,7 +906,8 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, 
> >> struct f2fs_summary *sum,
> >>start_bidx = start_bidx_of_node(nofs, inode)
> >>+ ofs_in_node;
> >>if (f2fs_encrypted_file(inode))
> >> -  move_data_block(inode, start_bidx, segno, off);
> >> +  

Re: [PATCH 5/5] f2fs: fix to avoid race during access gc_thread pointer

2018-04-19 Thread Chao Yu
On 2018/4/20 11:19, Jaegeuk Kim wrote:
> On 04/18, Chao Yu wrote:
>> Thread A Thread BThread C
>> - f2fs_remount
>>  - stop_gc_thread
>>  - f2fs_sbi_store
>>  - issue_discard_thread
>>sbi->gc_thread = NULL;
>>sbi->gc_thread->gc_wake = 1
>>access 
>> sbi->gc_thread->gc_urgent
> 
> Do we simply need a lock for this?

Code will be more complicated for handling existed and new coming fields with
the sbi->gc_thread pointer, and causing unneeded lock overhead, right?

So let's just allocate memory during fill_super?

Thanks,

> 
>>
>> Previously, we allocate memory for sbi->gc_thread based on background
>> gc thread mount option, the memory can be released if we turn off
>> that mount option, but still there are several places access gc_thread
>> pointer without considering race condition, result in NULL point
>> dereference.
>>
>> In order to fix this issue, keep gc_thread structure valid in sbi all
>> the time instead of alloc/free it dynamically.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/debug.c   |  3 +--
>>  fs/f2fs/f2fs.h|  7 +++
>>  fs/f2fs/gc.c  | 58 
>> +--
>>  fs/f2fs/segment.c |  4 ++--
>>  fs/f2fs/super.c   | 13 +++--
>>  fs/f2fs/sysfs.c   |  8 
>>  6 files changed, 60 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
>> index 715beb85e9db..7bb036a3bb81 100644
>> --- a/fs/f2fs/debug.c
>> +++ b/fs/f2fs/debug.c
>> @@ -223,8 +223,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi)
>>  si->cache_mem = 0;
>>  
>>  /* build gc */
>> -if (sbi->gc_thread)
>> -si->cache_mem += sizeof(struct f2fs_gc_kthread);
>> +si->cache_mem += sizeof(struct f2fs_gc_kthread);
>>  
>>  /* build merge flush thread */
>>  if (SM_I(sbi)->fcc_info)
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index 567c6bb57ae3..c553f63199e8 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -1412,6 +1412,11 @@ static inline struct sit_info *SIT_I(struct 
>> f2fs_sb_info *sbi)
>>  return (struct sit_info *)(SM_I(sbi)->sit_info);
>>  }
>>  
>> +static inline struct f2fs_gc_kthread *GC_I(struct f2fs_sb_info *sbi)
>> +{
>> +return (struct f2fs_gc_kthread *)(sbi->gc_thread);
>> +}
>> +
>>  static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
>>  {
>>  return (struct free_segmap_info *)(SM_I(sbi)->free_info);
>> @@ -2954,6 +2959,8 @@ bool f2fs_overwrite_io(struct inode *inode, loff_t 
>> pos, size_t len);
>>  /*
>>   * gc.c
>>   */
>> +int init_gc_context(struct f2fs_sb_info *sbi);
>> +void destroy_gc_context(struct f2fs_sb_info * sbi);
>>  int start_gc_thread(struct f2fs_sb_info *sbi);
>>  void stop_gc_thread(struct f2fs_sb_info *sbi);
>>  block_t start_bidx_of_node(unsigned int node_ofs, struct inode *inode);
>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>> index da89ca16a55d..7d310e454b77 100644
>> --- a/fs/f2fs/gc.c
>> +++ b/fs/f2fs/gc.c
>> @@ -26,8 +26,8 @@
>>  static int gc_thread_func(void *data)
>>  {
>>  struct f2fs_sb_info *sbi = data;
>> -struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
>> -wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head;
>> +struct f2fs_gc_kthread *gc_th = GC_I(sbi);
>> +wait_queue_head_t *wq = _th->gc_wait_queue_head;
>>  unsigned int wait_ms;
>>  
>>  wait_ms = gc_th->min_sleep_time;
>> @@ -114,17 +114,15 @@ static int gc_thread_func(void *data)
>>  return 0;
>>  }
>>  
>> -int start_gc_thread(struct f2fs_sb_info *sbi)
>> +int init_gc_context(struct f2fs_sb_info *sbi)
>>  {
>>  struct f2fs_gc_kthread *gc_th;
>> -dev_t dev = sbi->sb->s_bdev->bd_dev;
>> -int err = 0;
>>  
>>  gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
>> -if (!gc_th) {
>> -err = -ENOMEM;
>> -goto out;
>> -}
>> +if (!gc_th)
>> +return -ENOMEM;
>> +
>> +gc_th->f2fs_gc_task = NULL;
>>  
>>  gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
>>  gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME;
>> @@ -139,26 +137,41 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>>  gc_th->atomic_file[FG_GC] = 0;
>>  
>>  sbi->gc_thread = gc_th;
>> -init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>> -sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
>> +
>> +return 0;
>> +}
>> +
>> +void destroy_gc_context(struct f2fs_sb_info *sbi)
>> +{
>> +kfree(GC_I(sbi));
>> +sbi->gc_thread = NULL;
>> +}
>> +
>> +int start_gc_thread(struct f2fs_sb_info *sbi)
>> +{
>> +struct f2fs_gc_kthread *gc_th = GC_I(sbi);
>> +dev_t dev = sbi->sb->s_bdev->bd_dev;
>> +int err = 0;
>> +
>> +init_waitqueue_head(_th->gc_wait_queue_head);
>> +

Re: [PATCH 5/5] f2fs: fix to avoid race during access gc_thread pointer

2018-04-19 Thread Chao Yu
On 2018/4/20 11:19, Jaegeuk Kim wrote:
> On 04/18, Chao Yu wrote:
>> Thread A Thread BThread C
>> - f2fs_remount
>>  - stop_gc_thread
>>  - f2fs_sbi_store
>>  - issue_discard_thread
>>sbi->gc_thread = NULL;
>>sbi->gc_thread->gc_wake = 1
>>access 
>> sbi->gc_thread->gc_urgent
> 
> Do we simply need a lock for this?

Code will be more complicated for handling existed and new coming fields with
the sbi->gc_thread pointer, and causing unneeded lock overhead, right?

So let's just allocate memory during fill_super?

Thanks,

> 
>>
>> Previously, we allocate memory for sbi->gc_thread based on background
>> gc thread mount option, the memory can be released if we turn off
>> that mount option, but still there are several places access gc_thread
>> pointer without considering race condition, result in NULL point
>> dereference.
>>
>> In order to fix this issue, keep gc_thread structure valid in sbi all
>> the time instead of alloc/free it dynamically.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/debug.c   |  3 +--
>>  fs/f2fs/f2fs.h|  7 +++
>>  fs/f2fs/gc.c  | 58 
>> +--
>>  fs/f2fs/segment.c |  4 ++--
>>  fs/f2fs/super.c   | 13 +++--
>>  fs/f2fs/sysfs.c   |  8 
>>  6 files changed, 60 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
>> index 715beb85e9db..7bb036a3bb81 100644
>> --- a/fs/f2fs/debug.c
>> +++ b/fs/f2fs/debug.c
>> @@ -223,8 +223,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi)
>>  si->cache_mem = 0;
>>  
>>  /* build gc */
>> -if (sbi->gc_thread)
>> -si->cache_mem += sizeof(struct f2fs_gc_kthread);
>> +si->cache_mem += sizeof(struct f2fs_gc_kthread);
>>  
>>  /* build merge flush thread */
>>  if (SM_I(sbi)->fcc_info)
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index 567c6bb57ae3..c553f63199e8 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -1412,6 +1412,11 @@ static inline struct sit_info *SIT_I(struct 
>> f2fs_sb_info *sbi)
>>  return (struct sit_info *)(SM_I(sbi)->sit_info);
>>  }
>>  
>> +static inline struct f2fs_gc_kthread *GC_I(struct f2fs_sb_info *sbi)
>> +{
>> +return (struct f2fs_gc_kthread *)(sbi->gc_thread);
>> +}
>> +
>>  static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
>>  {
>>  return (struct free_segmap_info *)(SM_I(sbi)->free_info);
>> @@ -2954,6 +2959,8 @@ bool f2fs_overwrite_io(struct inode *inode, loff_t 
>> pos, size_t len);
>>  /*
>>   * gc.c
>>   */
>> +int init_gc_context(struct f2fs_sb_info *sbi);
>> +void destroy_gc_context(struct f2fs_sb_info * sbi);
>>  int start_gc_thread(struct f2fs_sb_info *sbi);
>>  void stop_gc_thread(struct f2fs_sb_info *sbi);
>>  block_t start_bidx_of_node(unsigned int node_ofs, struct inode *inode);
>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>> index da89ca16a55d..7d310e454b77 100644
>> --- a/fs/f2fs/gc.c
>> +++ b/fs/f2fs/gc.c
>> @@ -26,8 +26,8 @@
>>  static int gc_thread_func(void *data)
>>  {
>>  struct f2fs_sb_info *sbi = data;
>> -struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
>> -wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head;
>> +struct f2fs_gc_kthread *gc_th = GC_I(sbi);
>> +wait_queue_head_t *wq = _th->gc_wait_queue_head;
>>  unsigned int wait_ms;
>>  
>>  wait_ms = gc_th->min_sleep_time;
>> @@ -114,17 +114,15 @@ static int gc_thread_func(void *data)
>>  return 0;
>>  }
>>  
>> -int start_gc_thread(struct f2fs_sb_info *sbi)
>> +int init_gc_context(struct f2fs_sb_info *sbi)
>>  {
>>  struct f2fs_gc_kthread *gc_th;
>> -dev_t dev = sbi->sb->s_bdev->bd_dev;
>> -int err = 0;
>>  
>>  gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
>> -if (!gc_th) {
>> -err = -ENOMEM;
>> -goto out;
>> -}
>> +if (!gc_th)
>> +return -ENOMEM;
>> +
>> +gc_th->f2fs_gc_task = NULL;
>>  
>>  gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
>>  gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME;
>> @@ -139,26 +137,41 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>>  gc_th->atomic_file[FG_GC] = 0;
>>  
>>  sbi->gc_thread = gc_th;
>> -init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>> -sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
>> +
>> +return 0;
>> +}
>> +
>> +void destroy_gc_context(struct f2fs_sb_info *sbi)
>> +{
>> +kfree(GC_I(sbi));
>> +sbi->gc_thread = NULL;
>> +}
>> +
>> +int start_gc_thread(struct f2fs_sb_info *sbi)
>> +{
>> +struct f2fs_gc_kthread *gc_th = GC_I(sbi);
>> +dev_t dev = sbi->sb->s_bdev->bd_dev;
>> +int err = 0;
>> +
>> +init_waitqueue_head(_th->gc_wait_queue_head);
>> +gc_th->f2fs_gc_task = 

Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Tiwei Bie
On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > One problem is that, different virtio ring compatible devices
> > > > > may have different device interfaces. That is to say, we will
> > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > that's what this patch trying to fix. The idea behind this
> > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > in kernel.
> > > > So you just move the abstraction layer from qemu to kernel, and you 
> > > > still
> > > > need different drivers in kernel for different device interfaces of
> > > > accelerators. This looks even more complex than leaving it in qemu. As 
> > > > you
> > > > said, another idea is to implement userspace vhost backend for 
> > > > accelerators
> > > > which seems easier and could co-work with other parts of qemu without
> > > > inventing new type of messages.
> > > I'm not quite sure. Do you think it's acceptable to
> > > add various vendor specific hardware drivers in QEMU?
> > > 
> > 
> > I don't object but we need to figure out the advantages of doing it in qemu
> > too.
> > 
> > Thanks
> 
> To be frank kernel is exactly where device drivers belong.  DPDK did
> move them to userspace but that's merely a requirement for data path.
> *If* you can have them in kernel that is best:
> - update kernel and there's no need to rebuild userspace
> - apps can be written in any language no need to maintain multiple
>   libraries or add wrappers
> - security concerns are much smaller (ok people are trying to
>   raise the bar with IOMMUs and such, but it's already pretty
>   good even without)
> 
> The biggest issue is that you let userspace poke at the
> device which is also allowed by the IOMMU to poke at
> kernel memory (needed for kernel driver to work).

I think the device won't and shouldn't be allowed to
poke at kernel memory. Its kernel driver needs some
kernel memory to work. But the device doesn't have
the access to them. Instead, the device only has the
access to:

(1) the entire memory of the VM (if vIOMMU isn't used)
or
(2) the memory belongs to the guest virtio device (if
vIOMMU is being used).

Below is the reason:

For the first case, we should program the IOMMU for
the hardware device based on the info in the memory
table which is the entire memory of the VM.

For the second case, we should program the IOMMU for
the hardware device based on the info in the shadow
page table of the vIOMMU.

So the memory can be accessed by the device is limited,
it should be safe especially for the second case.

My concern is that, in this RFC, we don't program the
IOMMU for the mdev device in the userspace via the VFIO
API directly. Instead, we pass the memory table to the
kernel driver via the mdev device (BAR0) and ask the
driver to do the IOMMU programming. Someone may don't
like it. The main reason why we don't program IOMMU via
VFIO API in userspace directly is that, currently IOMMU
drivers don't support mdev bus.

> 
> Yes, maybe if device is not buggy it's all fine, but
> it's better if we do not have to trust the device
> otherwise the security picture becomes more murky.
> 
> I suggested attaching a PASID to (some) queues - see my old post "using
> PASIDs to enable a safe variant of direct ring access".

It's pretty cool. We also have some similar ideas.
Cunming will talk more about this.

Best regards,
Tiwei Bie

> 
> Then using IOMMU with VFIO to limit access through queue to corrent
> ranges of memory.
> 
> 
> -- 
> MST


Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-19 Thread Tiwei Bie
On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > One problem is that, different virtio ring compatible devices
> > > > > may have different device interfaces. That is to say, we will
> > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > that's what this patch trying to fix. The idea behind this
> > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > in kernel.
> > > > So you just move the abstraction layer from qemu to kernel, and you 
> > > > still
> > > > need different drivers in kernel for different device interfaces of
> > > > accelerators. This looks even more complex than leaving it in qemu. As 
> > > > you
> > > > said, another idea is to implement userspace vhost backend for 
> > > > accelerators
> > > > which seems easier and could co-work with other parts of qemu without
> > > > inventing new type of messages.
> > > I'm not quite sure. Do you think it's acceptable to
> > > add various vendor specific hardware drivers in QEMU?
> > > 
> > 
> > I don't object but we need to figure out the advantages of doing it in qemu
> > too.
> > 
> > Thanks
> 
> To be frank kernel is exactly where device drivers belong.  DPDK did
> move them to userspace but that's merely a requirement for data path.
> *If* you can have them in kernel that is best:
> - update kernel and there's no need to rebuild userspace
> - apps can be written in any language no need to maintain multiple
>   libraries or add wrappers
> - security concerns are much smaller (ok people are trying to
>   raise the bar with IOMMUs and such, but it's already pretty
>   good even without)
> 
> The biggest issue is that you let userspace poke at the
> device which is also allowed by the IOMMU to poke at
> kernel memory (needed for kernel driver to work).

I think the device won't and shouldn't be allowed to
poke at kernel memory. Its kernel driver needs some
kernel memory to work. But the device doesn't have
the access to them. Instead, the device only has the
access to:

(1) the entire memory of the VM (if vIOMMU isn't used)
or
(2) the memory belongs to the guest virtio device (if
vIOMMU is being used).

Below is the reason:

For the first case, we should program the IOMMU for
the hardware device based on the info in the memory
table which is the entire memory of the VM.

For the second case, we should program the IOMMU for
the hardware device based on the info in the shadow
page table of the vIOMMU.

So the memory can be accessed by the device is limited,
it should be safe especially for the second case.

My concern is that, in this RFC, we don't program the
IOMMU for the mdev device in the userspace via the VFIO
API directly. Instead, we pass the memory table to the
kernel driver via the mdev device (BAR0) and ask the
driver to do the IOMMU programming. Someone may don't
like it. The main reason why we don't program IOMMU via
VFIO API in userspace directly is that, currently IOMMU
drivers don't support mdev bus.

> 
> Yes, maybe if device is not buggy it's all fine, but
> it's better if we do not have to trust the device
> otherwise the security picture becomes more murky.
> 
> I suggested attaching a PASID to (some) queues - see my old post "using
> PASIDs to enable a safe variant of direct ring access".

It's pretty cool. We also have some similar ideas.
Cunming will talk more about this.

Best regards,
Tiwei Bie

> 
> Then using IOMMU with VFIO to limit access through queue to corrent
> ranges of memory.
> 
> 
> -- 
> MST


Re: [PATCH 4/5] f2fs: show GC failure info in debugfs

2018-04-19 Thread Chao Yu
On 2018/4/20 11:15, Jaegeuk Kim wrote:
> On 04/18, Chao Yu wrote:
>> This patch adds to show GC failure information in debugfs, now it just
>> shows count of failure caused by atomic write.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/debug.c |  5 +
>>  fs/f2fs/f2fs.h  |  1 +
>>  fs/f2fs/gc.c| 13 +++--
>>  fs/f2fs/gc.h|  2 +-
>>  4 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
>> index a66107b5cfff..715beb85e9db 100644
>> --- a/fs/f2fs/debug.c
>> +++ b/fs/f2fs/debug.c
>> @@ -104,6 +104,8 @@ static void update_general_status(struct f2fs_sb_info 
>> *sbi)
>>  si->avail_nids = NM_I(sbi)->available_nids;
>>  si->alloc_nids = NM_I(sbi)->nid_cnt[PREALLOC_NID];
>>  si->bg_gc = sbi->bg_gc;
>> +si->bg_atomic = sbi->gc_thread->atomic_file[BG_GC];
>> +si->fg_atomic = sbi->gc_thread->atomic_file[FG_GC];
> 
> Need to change the naming like skipped_atomic_files?

OK

> 
>>  si->util_free = (int)(free_user_blocks(sbi) >> sbi->log_blocks_per_seg)
>>  * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
>>  / 2;
>> @@ -342,6 +344,9 @@ static int stat_show(struct seq_file *s, void *v)
>>  si->bg_data_blks);
>>  seq_printf(s, "  - node blocks : %d (%d)\n", si->node_blks,
>>  si->bg_node_blks);
>> +seq_printf(s, "Failure : atomic write %d (%d)\n",
> 
> It's not failure.

Alright... just skip..

> 
>> +si->bg_atomic + si->fg_atomic,
>> +si->bg_atomic);
>>  seq_puts(s, "\nExtent Cache:\n");
>>  seq_printf(s, "  - Hit Count: L1-1:%llu L1-2:%llu L2:%llu\n",
>>  si->hit_largest, si->hit_cached,
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index 3453288d6a71..567c6bb57ae3 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -3003,6 +3003,7 @@ struct f2fs_stat_info {
>>  int bg_node_segs, bg_data_segs;
>>  int tot_blks, data_blks, node_blks;
>>  int bg_data_blks, bg_node_blks;
>> +unsigned int bg_atomic, fg_atomic;
>>  int curseg[NR_CURSEG_TYPE];
>>  int cursec[NR_CURSEG_TYPE];
>>  int curzone[NR_CURSEG_TYPE];
>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>> index 495876ca62b6..da89ca16a55d 100644
>> --- a/fs/f2fs/gc.c
>> +++ b/fs/f2fs/gc.c
>> @@ -135,7 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>>  gc_th->gc_urgent = 0;
>>  gc_th->gc_wake= 0;
>>  
>> -gc_th->atomic_file = 0;
>> +gc_th->atomic_file[BG_GC] = 0;
>> +gc_th->atomic_file[FG_GC] = 0;
> 
> Need to merge the previous patch with this.

Let me merge them. :)

Thanks,

> 
>>  
>>  sbi->gc_thread = gc_th;
>>  init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>> @@ -633,7 +634,7 @@ static void move_data_block(struct inode *inode, block_t 
>> bidx,
>>  goto out;
>>  
>>  if (f2fs_is_atomic_file(inode)) {
>> -F2FS_I_SB(inode)->gc_thread->atomic_file++;
>> +F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>>  goto out;
>>  }
>>  
>> @@ -742,7 +743,7 @@ static void move_data_page(struct inode *inode, block_t 
>> bidx, int gc_type,
>>  goto out;
>>  
>>  if (f2fs_is_atomic_file(inode)) {
>> -F2FS_I_SB(inode)->gc_thread->atomic_file++;
>> +F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>>  goto out;
>>  }
>>  if (f2fs_is_pinned_file(inode)) {
>> @@ -1024,7 +1025,7 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>  .ilist = LIST_HEAD_INIT(gc_list.ilist),
>>  .iroot = RADIX_TREE_INIT(GFP_NOFS),
>>  };
>> -unsigned int last_atomic_file = sbi->gc_thread->atomic_file;
>> +unsigned int last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>>  unsigned int skipped_round = 0, round = 0;
>>  
>>  trace_f2fs_gc_begin(sbi->sb, sync, background,
>> @@ -1078,9 +1079,9 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>  total_freed += seg_freed;
>>  
>>  if (gc_type == FG_GC) {
>> -if (sbi->gc_thread->atomic_file > last_atomic_file)
>> +if (sbi->gc_thread->atomic_file[FG_GC] > last_atomic_file)
>>  skipped_round++;
>> -last_atomic_file = sbi->gc_thread->atomic_file;
>> +last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>>  round++;
>>  }
>>  
>> diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
>> index bc1d21d46ae7..a6cffe6b249b 100644
>> --- a/fs/f2fs/gc.h
>> +++ b/fs/f2fs/gc.h
>> @@ -41,7 +41,7 @@ struct f2fs_gc_kthread {
>>  unsigned int gc_wake;
>>  
>>  /* for stuck statistic */
>> -unsigned int atomic_file;
>> +unsigned int atomic_file[2];
>>  };
>>  
>>  struct gc_inode_list {
>> -- 
>> 2.15.0.55.gc2ece9dc4de6
> 
> .
> 



Re: [PATCH 4/5] f2fs: show GC failure info in debugfs

2018-04-19 Thread Chao Yu
On 2018/4/20 11:15, Jaegeuk Kim wrote:
> On 04/18, Chao Yu wrote:
>> This patch adds to show GC failure information in debugfs, now it just
>> shows count of failure caused by atomic write.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/debug.c |  5 +
>>  fs/f2fs/f2fs.h  |  1 +
>>  fs/f2fs/gc.c| 13 +++--
>>  fs/f2fs/gc.h|  2 +-
>>  4 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
>> index a66107b5cfff..715beb85e9db 100644
>> --- a/fs/f2fs/debug.c
>> +++ b/fs/f2fs/debug.c
>> @@ -104,6 +104,8 @@ static void update_general_status(struct f2fs_sb_info 
>> *sbi)
>>  si->avail_nids = NM_I(sbi)->available_nids;
>>  si->alloc_nids = NM_I(sbi)->nid_cnt[PREALLOC_NID];
>>  si->bg_gc = sbi->bg_gc;
>> +si->bg_atomic = sbi->gc_thread->atomic_file[BG_GC];
>> +si->fg_atomic = sbi->gc_thread->atomic_file[FG_GC];
> 
> Need to change the naming like skipped_atomic_files?

OK

> 
>>  si->util_free = (int)(free_user_blocks(sbi) >> sbi->log_blocks_per_seg)
>>  * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
>>  / 2;
>> @@ -342,6 +344,9 @@ static int stat_show(struct seq_file *s, void *v)
>>  si->bg_data_blks);
>>  seq_printf(s, "  - node blocks : %d (%d)\n", si->node_blks,
>>  si->bg_node_blks);
>> +seq_printf(s, "Failure : atomic write %d (%d)\n",
> 
> It's not failure.

Alright... just skip..

> 
>> +si->bg_atomic + si->fg_atomic,
>> +si->bg_atomic);
>>  seq_puts(s, "\nExtent Cache:\n");
>>  seq_printf(s, "  - Hit Count: L1-1:%llu L1-2:%llu L2:%llu\n",
>>  si->hit_largest, si->hit_cached,
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index 3453288d6a71..567c6bb57ae3 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -3003,6 +3003,7 @@ struct f2fs_stat_info {
>>  int bg_node_segs, bg_data_segs;
>>  int tot_blks, data_blks, node_blks;
>>  int bg_data_blks, bg_node_blks;
>> +unsigned int bg_atomic, fg_atomic;
>>  int curseg[NR_CURSEG_TYPE];
>>  int cursec[NR_CURSEG_TYPE];
>>  int curzone[NR_CURSEG_TYPE];
>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>> index 495876ca62b6..da89ca16a55d 100644
>> --- a/fs/f2fs/gc.c
>> +++ b/fs/f2fs/gc.c
>> @@ -135,7 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>>  gc_th->gc_urgent = 0;
>>  gc_th->gc_wake= 0;
>>  
>> -gc_th->atomic_file = 0;
>> +gc_th->atomic_file[BG_GC] = 0;
>> +gc_th->atomic_file[FG_GC] = 0;
> 
> Need to merge the previous patch with this.

Let me merge them. :)

Thanks,

> 
>>  
>>  sbi->gc_thread = gc_th;
>>  init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>> @@ -633,7 +634,7 @@ static void move_data_block(struct inode *inode, block_t 
>> bidx,
>>  goto out;
>>  
>>  if (f2fs_is_atomic_file(inode)) {
>> -F2FS_I_SB(inode)->gc_thread->atomic_file++;
>> +F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>>  goto out;
>>  }
>>  
>> @@ -742,7 +743,7 @@ static void move_data_page(struct inode *inode, block_t 
>> bidx, int gc_type,
>>  goto out;
>>  
>>  if (f2fs_is_atomic_file(inode)) {
>> -F2FS_I_SB(inode)->gc_thread->atomic_file++;
>> +F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>>  goto out;
>>  }
>>  if (f2fs_is_pinned_file(inode)) {
>> @@ -1024,7 +1025,7 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>  .ilist = LIST_HEAD_INIT(gc_list.ilist),
>>  .iroot = RADIX_TREE_INIT(GFP_NOFS),
>>  };
>> -unsigned int last_atomic_file = sbi->gc_thread->atomic_file;
>> +unsigned int last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>>  unsigned int skipped_round = 0, round = 0;
>>  
>>  trace_f2fs_gc_begin(sbi->sb, sync, background,
>> @@ -1078,9 +1079,9 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>  total_freed += seg_freed;
>>  
>>  if (gc_type == FG_GC) {
>> -if (sbi->gc_thread->atomic_file > last_atomic_file)
>> +if (sbi->gc_thread->atomic_file[FG_GC] > last_atomic_file)
>>  skipped_round++;
>> -last_atomic_file = sbi->gc_thread->atomic_file;
>> +last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>>  round++;
>>  }
>>  
>> diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
>> index bc1d21d46ae7..a6cffe6b249b 100644
>> --- a/fs/f2fs/gc.h
>> +++ b/fs/f2fs/gc.h
>> @@ -41,7 +41,7 @@ struct f2fs_gc_kthread {
>>  unsigned int gc_wake;
>>  
>>  /* for stuck statistic */
>> -unsigned int atomic_file;
>> +unsigned int atomic_file[2];
>>  };
>>  
>>  struct gc_inode_list {
>> -- 
>> 2.15.0.55.gc2ece9dc4de6
> 
> .
> 



Re: [PATCH 3/5] f2fs: avoid stucking GC due to atomic write

2018-04-19 Thread Chao Yu
On 2018/4/20 11:12, Jaegeuk Kim wrote:
> On 04/18, Chao Yu wrote:
>> f2fs doesn't allow abuse on atomic write class interface, so except
>> limiting in-mem pages' total memory usage capacity, we need to limit
>> atomic-write usage as well when filesystem is seriously fragmented,
>> otherwise we may run into infinite loop during foreground GC because
>> target blocks in victim segment are belong to atomic opened file for
>> long time.
>>
>> Now, we will detect failure due to atomic write in foreground GC, if
>> the count exceeds threshold, we will drop all atomic written data in
>> cache, by this, I expect it can keep our system running safely to
>> prevent Dos attack.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/f2fs.h|  1 +
>>  fs/f2fs/file.c|  5 +
>>  fs/f2fs/gc.c  | 27 +++
>>  fs/f2fs/gc.h  |  3 +++
>>  fs/f2fs/segment.c |  1 +
>>  fs/f2fs/segment.h |  2 ++
>>  6 files changed, 35 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index c1c3a1d11186..3453288d6a71 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -2249,6 +2249,7 @@ enum {
>>  FI_EXTRA_ATTR,  /* indicate file has extra attribute */
>>  FI_PROJ_INHERIT,/* indicate file inherits projectid */
>>  FI_PIN_FILE,/* indicate file should not be gced */
>> +FI_ATOMIC_REVOKE_REQUEST,/* indicate atomic committed data has been 
>> dropped */
>>  };
>>  
>>  static inline void __mark_inode_dirty_flag(struct inode *inode,
>> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>> index 7c90ded5a431..cddd9aee1bb2 100644
>> --- a/fs/f2fs/file.c
>> +++ b/fs/f2fs/file.c
>> @@ -1698,6 +1698,7 @@ static int f2fs_ioc_start_atomic_write(struct file 
>> *filp)
>>  skip_flush:
>>  set_inode_flag(inode, FI_HOT_DATA);
>>  set_inode_flag(inode, FI_ATOMIC_FILE);
>> +clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
>>  f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
>>  
>>  F2FS_I(inode)->inmem_task = current;
>> @@ -1746,6 +1747,10 @@ static int f2fs_ioc_commit_atomic_write(struct file 
>> *filp)
>>  ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 1, false);
>>  }
>>  err_out:
>> +if (is_inode_flag_set(inode, FI_ATOMIC_REVOKE_REQUEST)) {
>> +clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
>> +ret = -EINVAL;
>> +}
>>  up_write(_I(inode)->dio_rwsem[WRITE]);
>>  inode_unlock(inode);
>>  mnt_drop_write_file(filp);
>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>> index bfb7a4a3a929..495876ca62b6 100644
>> --- a/fs/f2fs/gc.c
>> +++ b/fs/f2fs/gc.c
>> @@ -135,6 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>>  gc_th->gc_urgent = 0;
>>  gc_th->gc_wake= 0;
>>  
>> +gc_th->atomic_file = 0;
>> +
>>  sbi->gc_thread = gc_th;
>>  init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>>  sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
>> @@ -603,7 +605,7 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct 
>> f2fs_summary *sum,
>>   * This can be used to move blocks, aka LBAs, directly on disk.
>>   */
>>  static void move_data_block(struct inode *inode, block_t bidx,
>> -unsigned int segno, int off)
>> +int gc_type, unsigned int segno, int off)
>>  {
>>  struct f2fs_io_info fio = {
>>  .sbi = F2FS_I_SB(inode),
>> @@ -630,8 +632,10 @@ static void move_data_block(struct inode *inode, 
>> block_t bidx,
>>  if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>>  goto out;
>>  
>> -if (f2fs_is_atomic_file(inode))
>> +if (f2fs_is_atomic_file(inode)) {
>> +F2FS_I_SB(inode)->gc_thread->atomic_file++;
>>  goto out;
>> +}
>>  
>>  if (f2fs_is_pinned_file(inode)) {
>>  f2fs_pin_file_control(inode, true);
>> @@ -737,8 +741,10 @@ static void move_data_page(struct inode *inode, block_t 
>> bidx, int gc_type,
>>  if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>>  goto out;
>>  
>> -if (f2fs_is_atomic_file(inode))
>> +if (f2fs_is_atomic_file(inode)) {
>> +F2FS_I_SB(inode)->gc_thread->atomic_file++;
>>  goto out;
>> +}
>>  if (f2fs_is_pinned_file(inode)) {
>>  if (gc_type == FG_GC)
>>  f2fs_pin_file_control(inode, true);
>> @@ -900,7 +906,8 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, 
>> struct f2fs_summary *sum,
>>  start_bidx = start_bidx_of_node(nofs, inode)
>>  + ofs_in_node;
>>  if (f2fs_encrypted_file(inode))
>> -move_data_block(inode, start_bidx, segno, off);
>> +move_data_block(inode, start_bidx, gc_type,
>> +segno, off);
>>  else
>>  

Re: [PATCH 3/5] f2fs: avoid stucking GC due to atomic write

2018-04-19 Thread Chao Yu
On 2018/4/20 11:12, Jaegeuk Kim wrote:
> On 04/18, Chao Yu wrote:
>> f2fs doesn't allow abuse on atomic write class interface, so except
>> limiting in-mem pages' total memory usage capacity, we need to limit
>> atomic-write usage as well when filesystem is seriously fragmented,
>> otherwise we may run into infinite loop during foreground GC because
>> target blocks in victim segment are belong to atomic opened file for
>> long time.
>>
>> Now, we will detect failure due to atomic write in foreground GC, if
>> the count exceeds threshold, we will drop all atomic written data in
>> cache, by this, I expect it can keep our system running safely to
>> prevent Dos attack.
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/f2fs.h|  1 +
>>  fs/f2fs/file.c|  5 +
>>  fs/f2fs/gc.c  | 27 +++
>>  fs/f2fs/gc.h  |  3 +++
>>  fs/f2fs/segment.c |  1 +
>>  fs/f2fs/segment.h |  2 ++
>>  6 files changed, 35 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index c1c3a1d11186..3453288d6a71 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -2249,6 +2249,7 @@ enum {
>>  FI_EXTRA_ATTR,  /* indicate file has extra attribute */
>>  FI_PROJ_INHERIT,/* indicate file inherits projectid */
>>  FI_PIN_FILE,/* indicate file should not be gced */
>> +FI_ATOMIC_REVOKE_REQUEST,/* indicate atomic committed data has been 
>> dropped */
>>  };
>>  
>>  static inline void __mark_inode_dirty_flag(struct inode *inode,
>> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>> index 7c90ded5a431..cddd9aee1bb2 100644
>> --- a/fs/f2fs/file.c
>> +++ b/fs/f2fs/file.c
>> @@ -1698,6 +1698,7 @@ static int f2fs_ioc_start_atomic_write(struct file 
>> *filp)
>>  skip_flush:
>>  set_inode_flag(inode, FI_HOT_DATA);
>>  set_inode_flag(inode, FI_ATOMIC_FILE);
>> +clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
>>  f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
>>  
>>  F2FS_I(inode)->inmem_task = current;
>> @@ -1746,6 +1747,10 @@ static int f2fs_ioc_commit_atomic_write(struct file 
>> *filp)
>>  ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 1, false);
>>  }
>>  err_out:
>> +if (is_inode_flag_set(inode, FI_ATOMIC_REVOKE_REQUEST)) {
>> +clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
>> +ret = -EINVAL;
>> +}
>>  up_write(_I(inode)->dio_rwsem[WRITE]);
>>  inode_unlock(inode);
>>  mnt_drop_write_file(filp);
>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>> index bfb7a4a3a929..495876ca62b6 100644
>> --- a/fs/f2fs/gc.c
>> +++ b/fs/f2fs/gc.c
>> @@ -135,6 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>>  gc_th->gc_urgent = 0;
>>  gc_th->gc_wake= 0;
>>  
>> +gc_th->atomic_file = 0;
>> +
>>  sbi->gc_thread = gc_th;
>>  init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>>  sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
>> @@ -603,7 +605,7 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct 
>> f2fs_summary *sum,
>>   * This can be used to move blocks, aka LBAs, directly on disk.
>>   */
>>  static void move_data_block(struct inode *inode, block_t bidx,
>> -unsigned int segno, int off)
>> +int gc_type, unsigned int segno, int off)
>>  {
>>  struct f2fs_io_info fio = {
>>  .sbi = F2FS_I_SB(inode),
>> @@ -630,8 +632,10 @@ static void move_data_block(struct inode *inode, 
>> block_t bidx,
>>  if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>>  goto out;
>>  
>> -if (f2fs_is_atomic_file(inode))
>> +if (f2fs_is_atomic_file(inode)) {
>> +F2FS_I_SB(inode)->gc_thread->atomic_file++;
>>  goto out;
>> +}
>>  
>>  if (f2fs_is_pinned_file(inode)) {
>>  f2fs_pin_file_control(inode, true);
>> @@ -737,8 +741,10 @@ static void move_data_page(struct inode *inode, block_t 
>> bidx, int gc_type,
>>  if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>>  goto out;
>>  
>> -if (f2fs_is_atomic_file(inode))
>> +if (f2fs_is_atomic_file(inode)) {
>> +F2FS_I_SB(inode)->gc_thread->atomic_file++;
>>  goto out;
>> +}
>>  if (f2fs_is_pinned_file(inode)) {
>>  if (gc_type == FG_GC)
>>  f2fs_pin_file_control(inode, true);
>> @@ -900,7 +906,8 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, 
>> struct f2fs_summary *sum,
>>  start_bidx = start_bidx_of_node(nofs, inode)
>>  + ofs_in_node;
>>  if (f2fs_encrypted_file(inode))
>> -move_data_block(inode, start_bidx, segno, off);
>> +move_data_block(inode, start_bidx, gc_type,
>> +segno, off);
>>  else
>>  

Re: [PATCH 5/5] f2fs: fix to avoid race during access gc_thread pointer

2018-04-19 Thread Jaegeuk Kim
On 04/18, Chao Yu wrote:
> Thread A  Thread BThread C
> - f2fs_remount
>  - stop_gc_thread
>   - f2fs_sbi_store
>   - issue_discard_thread
>sbi->gc_thread = NULL;
> sbi->gc_thread->gc_wake = 1
> access 
> sbi->gc_thread->gc_urgent

Do we simply need a lock for this?

> 
> Previously, we allocate memory for sbi->gc_thread based on background
> gc thread mount option, the memory can be released if we turn off
> that mount option, but still there are several places access gc_thread
> pointer without considering race condition, result in NULL point
> dereference.
> 
> In order to fix this issue, keep gc_thread structure valid in sbi all
> the time instead of alloc/free it dynamically.
> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/debug.c   |  3 +--
>  fs/f2fs/f2fs.h|  7 +++
>  fs/f2fs/gc.c  | 58 
> +--
>  fs/f2fs/segment.c |  4 ++--
>  fs/f2fs/super.c   | 13 +++--
>  fs/f2fs/sysfs.c   |  8 
>  6 files changed, 60 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> index 715beb85e9db..7bb036a3bb81 100644
> --- a/fs/f2fs/debug.c
> +++ b/fs/f2fs/debug.c
> @@ -223,8 +223,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi)
>   si->cache_mem = 0;
>  
>   /* build gc */
> - if (sbi->gc_thread)
> - si->cache_mem += sizeof(struct f2fs_gc_kthread);
> + si->cache_mem += sizeof(struct f2fs_gc_kthread);
>  
>   /* build merge flush thread */
>   if (SM_I(sbi)->fcc_info)
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 567c6bb57ae3..c553f63199e8 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -1412,6 +1412,11 @@ static inline struct sit_info *SIT_I(struct 
> f2fs_sb_info *sbi)
>   return (struct sit_info *)(SM_I(sbi)->sit_info);
>  }
>  
> +static inline struct f2fs_gc_kthread *GC_I(struct f2fs_sb_info *sbi)
> +{
> + return (struct f2fs_gc_kthread *)(sbi->gc_thread);
> +}
> +
>  static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
>  {
>   return (struct free_segmap_info *)(SM_I(sbi)->free_info);
> @@ -2954,6 +2959,8 @@ bool f2fs_overwrite_io(struct inode *inode, loff_t pos, 
> size_t len);
>  /*
>   * gc.c
>   */
> +int init_gc_context(struct f2fs_sb_info *sbi);
> +void destroy_gc_context(struct f2fs_sb_info * sbi);
>  int start_gc_thread(struct f2fs_sb_info *sbi);
>  void stop_gc_thread(struct f2fs_sb_info *sbi);
>  block_t start_bidx_of_node(unsigned int node_ofs, struct inode *inode);
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index da89ca16a55d..7d310e454b77 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -26,8 +26,8 @@
>  static int gc_thread_func(void *data)
>  {
>   struct f2fs_sb_info *sbi = data;
> - struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
> - wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head;
> + struct f2fs_gc_kthread *gc_th = GC_I(sbi);
> + wait_queue_head_t *wq = _th->gc_wait_queue_head;
>   unsigned int wait_ms;
>  
>   wait_ms = gc_th->min_sleep_time;
> @@ -114,17 +114,15 @@ static int gc_thread_func(void *data)
>   return 0;
>  }
>  
> -int start_gc_thread(struct f2fs_sb_info *sbi)
> +int init_gc_context(struct f2fs_sb_info *sbi)
>  {
>   struct f2fs_gc_kthread *gc_th;
> - dev_t dev = sbi->sb->s_bdev->bd_dev;
> - int err = 0;
>  
>   gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
> - if (!gc_th) {
> - err = -ENOMEM;
> - goto out;
> - }
> + if (!gc_th)
> + return -ENOMEM;
> +
> + gc_th->f2fs_gc_task = NULL;
>  
>   gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
>   gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME;
> @@ -139,26 +137,41 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>   gc_th->atomic_file[FG_GC] = 0;
>  
>   sbi->gc_thread = gc_th;
> - init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> - sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> +
> + return 0;
> +}
> +
> +void destroy_gc_context(struct f2fs_sb_info *sbi)
> +{
> + kfree(GC_I(sbi));
> + sbi->gc_thread = NULL;
> +}
> +
> +int start_gc_thread(struct f2fs_sb_info *sbi)
> +{
> + struct f2fs_gc_kthread *gc_th = GC_I(sbi);
> + dev_t dev = sbi->sb->s_bdev->bd_dev;
> + int err = 0;
> +
> + init_waitqueue_head(_th->gc_wait_queue_head);
> + gc_th->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
>   "f2fs_gc-%u:%u", MAJOR(dev), MINOR(dev));
>   if (IS_ERR(gc_th->f2fs_gc_task)) {
>   err = PTR_ERR(gc_th->f2fs_gc_task);
> - kfree(gc_th);
> - sbi->gc_thread = NULL;
> + gc_th->f2fs_gc_task = NULL;
>   }
> -out:
> +
>  

Re: [PATCH 5/5] f2fs: fix to avoid race during access gc_thread pointer

2018-04-19 Thread Jaegeuk Kim
On 04/18, Chao Yu wrote:
> Thread A  Thread BThread C
> - f2fs_remount
>  - stop_gc_thread
>   - f2fs_sbi_store
>   - issue_discard_thread
>sbi->gc_thread = NULL;
> sbi->gc_thread->gc_wake = 1
> access 
> sbi->gc_thread->gc_urgent

Do we simply need a lock for this?

> 
> Previously, we allocate memory for sbi->gc_thread based on background
> gc thread mount option, the memory can be released if we turn off
> that mount option, but still there are several places access gc_thread
> pointer without considering race condition, result in NULL point
> dereference.
> 
> In order to fix this issue, keep gc_thread structure valid in sbi all
> the time instead of alloc/free it dynamically.
> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/debug.c   |  3 +--
>  fs/f2fs/f2fs.h|  7 +++
>  fs/f2fs/gc.c  | 58 
> +--
>  fs/f2fs/segment.c |  4 ++--
>  fs/f2fs/super.c   | 13 +++--
>  fs/f2fs/sysfs.c   |  8 
>  6 files changed, 60 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> index 715beb85e9db..7bb036a3bb81 100644
> --- a/fs/f2fs/debug.c
> +++ b/fs/f2fs/debug.c
> @@ -223,8 +223,7 @@ static void update_mem_info(struct f2fs_sb_info *sbi)
>   si->cache_mem = 0;
>  
>   /* build gc */
> - if (sbi->gc_thread)
> - si->cache_mem += sizeof(struct f2fs_gc_kthread);
> + si->cache_mem += sizeof(struct f2fs_gc_kthread);
>  
>   /* build merge flush thread */
>   if (SM_I(sbi)->fcc_info)
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 567c6bb57ae3..c553f63199e8 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -1412,6 +1412,11 @@ static inline struct sit_info *SIT_I(struct 
> f2fs_sb_info *sbi)
>   return (struct sit_info *)(SM_I(sbi)->sit_info);
>  }
>  
> +static inline struct f2fs_gc_kthread *GC_I(struct f2fs_sb_info *sbi)
> +{
> + return (struct f2fs_gc_kthread *)(sbi->gc_thread);
> +}
> +
>  static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
>  {
>   return (struct free_segmap_info *)(SM_I(sbi)->free_info);
> @@ -2954,6 +2959,8 @@ bool f2fs_overwrite_io(struct inode *inode, loff_t pos, 
> size_t len);
>  /*
>   * gc.c
>   */
> +int init_gc_context(struct f2fs_sb_info *sbi);
> +void destroy_gc_context(struct f2fs_sb_info * sbi);
>  int start_gc_thread(struct f2fs_sb_info *sbi);
>  void stop_gc_thread(struct f2fs_sb_info *sbi);
>  block_t start_bidx_of_node(unsigned int node_ofs, struct inode *inode);
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index da89ca16a55d..7d310e454b77 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -26,8 +26,8 @@
>  static int gc_thread_func(void *data)
>  {
>   struct f2fs_sb_info *sbi = data;
> - struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
> - wait_queue_head_t *wq = >gc_thread->gc_wait_queue_head;
> + struct f2fs_gc_kthread *gc_th = GC_I(sbi);
> + wait_queue_head_t *wq = _th->gc_wait_queue_head;
>   unsigned int wait_ms;
>  
>   wait_ms = gc_th->min_sleep_time;
> @@ -114,17 +114,15 @@ static int gc_thread_func(void *data)
>   return 0;
>  }
>  
> -int start_gc_thread(struct f2fs_sb_info *sbi)
> +int init_gc_context(struct f2fs_sb_info *sbi)
>  {
>   struct f2fs_gc_kthread *gc_th;
> - dev_t dev = sbi->sb->s_bdev->bd_dev;
> - int err = 0;
>  
>   gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
> - if (!gc_th) {
> - err = -ENOMEM;
> - goto out;
> - }
> + if (!gc_th)
> + return -ENOMEM;
> +
> + gc_th->f2fs_gc_task = NULL;
>  
>   gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
>   gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME;
> @@ -139,26 +137,41 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>   gc_th->atomic_file[FG_GC] = 0;
>  
>   sbi->gc_thread = gc_th;
> - init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> - sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> +
> + return 0;
> +}
> +
> +void destroy_gc_context(struct f2fs_sb_info *sbi)
> +{
> + kfree(GC_I(sbi));
> + sbi->gc_thread = NULL;
> +}
> +
> +int start_gc_thread(struct f2fs_sb_info *sbi)
> +{
> + struct f2fs_gc_kthread *gc_th = GC_I(sbi);
> + dev_t dev = sbi->sb->s_bdev->bd_dev;
> + int err = 0;
> +
> + init_waitqueue_head(_th->gc_wait_queue_head);
> + gc_th->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
>   "f2fs_gc-%u:%u", MAJOR(dev), MINOR(dev));
>   if (IS_ERR(gc_th->f2fs_gc_task)) {
>   err = PTR_ERR(gc_th->f2fs_gc_task);
> - kfree(gc_th);
> - sbi->gc_thread = NULL;
> + gc_th->f2fs_gc_task = NULL;
>   }
> -out:
> +
>   return err;
>  

Re: [PATCH 4/5] f2fs: show GC failure info in debugfs

2018-04-19 Thread Jaegeuk Kim
On 04/18, Chao Yu wrote:
> This patch adds to show GC failure information in debugfs, now it just
> shows count of failure caused by atomic write.
> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/debug.c |  5 +
>  fs/f2fs/f2fs.h  |  1 +
>  fs/f2fs/gc.c| 13 +++--
>  fs/f2fs/gc.h|  2 +-
>  4 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> index a66107b5cfff..715beb85e9db 100644
> --- a/fs/f2fs/debug.c
> +++ b/fs/f2fs/debug.c
> @@ -104,6 +104,8 @@ static void update_general_status(struct f2fs_sb_info 
> *sbi)
>   si->avail_nids = NM_I(sbi)->available_nids;
>   si->alloc_nids = NM_I(sbi)->nid_cnt[PREALLOC_NID];
>   si->bg_gc = sbi->bg_gc;
> + si->bg_atomic = sbi->gc_thread->atomic_file[BG_GC];
> + si->fg_atomic = sbi->gc_thread->atomic_file[FG_GC];

Need to change the naming like skipped_atomic_files?

>   si->util_free = (int)(free_user_blocks(sbi) >> sbi->log_blocks_per_seg)
>   * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
>   / 2;
> @@ -342,6 +344,9 @@ static int stat_show(struct seq_file *s, void *v)
>   si->bg_data_blks);
>   seq_printf(s, "  - node blocks : %d (%d)\n", si->node_blks,
>   si->bg_node_blks);
> + seq_printf(s, "Failure : atomic write %d (%d)\n",

It's not failure.

> + si->bg_atomic + si->fg_atomic,
> + si->bg_atomic);
>   seq_puts(s, "\nExtent Cache:\n");
>   seq_printf(s, "  - Hit Count: L1-1:%llu L1-2:%llu L2:%llu\n",
>   si->hit_largest, si->hit_cached,
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 3453288d6a71..567c6bb57ae3 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -3003,6 +3003,7 @@ struct f2fs_stat_info {
>   int bg_node_segs, bg_data_segs;
>   int tot_blks, data_blks, node_blks;
>   int bg_data_blks, bg_node_blks;
> + unsigned int bg_atomic, fg_atomic;
>   int curseg[NR_CURSEG_TYPE];
>   int cursec[NR_CURSEG_TYPE];
>   int curzone[NR_CURSEG_TYPE];
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 495876ca62b6..da89ca16a55d 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -135,7 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>   gc_th->gc_urgent = 0;
>   gc_th->gc_wake= 0;
>  
> - gc_th->atomic_file = 0;
> + gc_th->atomic_file[BG_GC] = 0;
> + gc_th->atomic_file[FG_GC] = 0;

Need to merge the previous patch with this.

>  
>   sbi->gc_thread = gc_th;
>   init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> @@ -633,7 +634,7 @@ static void move_data_block(struct inode *inode, block_t 
> bidx,
>   goto out;
>  
>   if (f2fs_is_atomic_file(inode)) {
> - F2FS_I_SB(inode)->gc_thread->atomic_file++;
> + F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>   goto out;
>   }
>  
> @@ -742,7 +743,7 @@ static void move_data_page(struct inode *inode, block_t 
> bidx, int gc_type,
>   goto out;
>  
>   if (f2fs_is_atomic_file(inode)) {
> - F2FS_I_SB(inode)->gc_thread->atomic_file++;
> + F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>   goto out;
>   }
>   if (f2fs_is_pinned_file(inode)) {
> @@ -1024,7 +1025,7 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>   .ilist = LIST_HEAD_INIT(gc_list.ilist),
>   .iroot = RADIX_TREE_INIT(GFP_NOFS),
>   };
> - unsigned int last_atomic_file = sbi->gc_thread->atomic_file;
> + unsigned int last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>   unsigned int skipped_round = 0, round = 0;
>  
>   trace_f2fs_gc_begin(sbi->sb, sync, background,
> @@ -1078,9 +1079,9 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>   total_freed += seg_freed;
>  
>   if (gc_type == FG_GC) {
> - if (sbi->gc_thread->atomic_file > last_atomic_file)
> + if (sbi->gc_thread->atomic_file[FG_GC] > last_atomic_file)
>   skipped_round++;
> - last_atomic_file = sbi->gc_thread->atomic_file;
> + last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>   round++;
>   }
>  
> diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
> index bc1d21d46ae7..a6cffe6b249b 100644
> --- a/fs/f2fs/gc.h
> +++ b/fs/f2fs/gc.h
> @@ -41,7 +41,7 @@ struct f2fs_gc_kthread {
>   unsigned int gc_wake;
>  
>   /* for stuck statistic */
> - unsigned int atomic_file;
> + unsigned int atomic_file[2];
>  };
>  
>  struct gc_inode_list {
> -- 
> 2.15.0.55.gc2ece9dc4de6


Re: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread Chao Yu
On 2018/4/20 10:30, heyunlei wrote:
> 
> 
>> -Original Message-
>> From: Chao Yu [mailto:yuch...@huawei.com]
>> Sent: Friday, April 20, 2018 9:53 AM
>> To: jaeg...@kernel.org
>> Cc: linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net
>> Subject: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid
>>
>> As most indirect node, dindirect node, and xattr node won't be updated
>> after they are created, but inode node and other direct node will change
>> more frequently, so store their nat entries mixedly in whole nat table
>> will suffer:
>> - fragment nat table soon due to different update rate
>> - more nat block update due to fragmented nat table
>>
> 
> BTW, should we enable this patch:  f2fs: reuse nids more aggressively?
> 
> I think it will decrease nat area fragment and will decrease io of nat?

For a fragmented nat table, there will be no different in between reusing
obsolete nid or allocating nid from next nat block.

IMO, in order to decrease nat block write, it needs to add more allocation
algorithm like a filesystem does, but firstly, I'd like to separate hot entry
from cold one.

Thanks,



Re: [PATCH 4/5] f2fs: show GC failure info in debugfs

2018-04-19 Thread Jaegeuk Kim
On 04/18, Chao Yu wrote:
> This patch adds to show GC failure information in debugfs, now it just
> shows count of failure caused by atomic write.
> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/debug.c |  5 +
>  fs/f2fs/f2fs.h  |  1 +
>  fs/f2fs/gc.c| 13 +++--
>  fs/f2fs/gc.h|  2 +-
>  4 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
> index a66107b5cfff..715beb85e9db 100644
> --- a/fs/f2fs/debug.c
> +++ b/fs/f2fs/debug.c
> @@ -104,6 +104,8 @@ static void update_general_status(struct f2fs_sb_info 
> *sbi)
>   si->avail_nids = NM_I(sbi)->available_nids;
>   si->alloc_nids = NM_I(sbi)->nid_cnt[PREALLOC_NID];
>   si->bg_gc = sbi->bg_gc;
> + si->bg_atomic = sbi->gc_thread->atomic_file[BG_GC];
> + si->fg_atomic = sbi->gc_thread->atomic_file[FG_GC];

Need to change the naming like skipped_atomic_files?

>   si->util_free = (int)(free_user_blocks(sbi) >> sbi->log_blocks_per_seg)
>   * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
>   / 2;
> @@ -342,6 +344,9 @@ static int stat_show(struct seq_file *s, void *v)
>   si->bg_data_blks);
>   seq_printf(s, "  - node blocks : %d (%d)\n", si->node_blks,
>   si->bg_node_blks);
> + seq_printf(s, "Failure : atomic write %d (%d)\n",

It's not failure.

> + si->bg_atomic + si->fg_atomic,
> + si->bg_atomic);
>   seq_puts(s, "\nExtent Cache:\n");
>   seq_printf(s, "  - Hit Count: L1-1:%llu L1-2:%llu L2:%llu\n",
>   si->hit_largest, si->hit_cached,
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 3453288d6a71..567c6bb57ae3 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -3003,6 +3003,7 @@ struct f2fs_stat_info {
>   int bg_node_segs, bg_data_segs;
>   int tot_blks, data_blks, node_blks;
>   int bg_data_blks, bg_node_blks;
> + unsigned int bg_atomic, fg_atomic;
>   int curseg[NR_CURSEG_TYPE];
>   int cursec[NR_CURSEG_TYPE];
>   int curzone[NR_CURSEG_TYPE];
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 495876ca62b6..da89ca16a55d 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -135,7 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>   gc_th->gc_urgent = 0;
>   gc_th->gc_wake= 0;
>  
> - gc_th->atomic_file = 0;
> + gc_th->atomic_file[BG_GC] = 0;
> + gc_th->atomic_file[FG_GC] = 0;

Need to merge the previous patch with this.

>  
>   sbi->gc_thread = gc_th;
>   init_waitqueue_head(>gc_thread->gc_wait_queue_head);
> @@ -633,7 +634,7 @@ static void move_data_block(struct inode *inode, block_t 
> bidx,
>   goto out;
>  
>   if (f2fs_is_atomic_file(inode)) {
> - F2FS_I_SB(inode)->gc_thread->atomic_file++;
> + F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>   goto out;
>   }
>  
> @@ -742,7 +743,7 @@ static void move_data_page(struct inode *inode, block_t 
> bidx, int gc_type,
>   goto out;
>  
>   if (f2fs_is_atomic_file(inode)) {
> - F2FS_I_SB(inode)->gc_thread->atomic_file++;
> + F2FS_I_SB(inode)->gc_thread->atomic_file[gc_type]++;
>   goto out;
>   }
>   if (f2fs_is_pinned_file(inode)) {
> @@ -1024,7 +1025,7 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>   .ilist = LIST_HEAD_INIT(gc_list.ilist),
>   .iroot = RADIX_TREE_INIT(GFP_NOFS),
>   };
> - unsigned int last_atomic_file = sbi->gc_thread->atomic_file;
> + unsigned int last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>   unsigned int skipped_round = 0, round = 0;
>  
>   trace_f2fs_gc_begin(sbi->sb, sync, background,
> @@ -1078,9 +1079,9 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>   total_freed += seg_freed;
>  
>   if (gc_type == FG_GC) {
> - if (sbi->gc_thread->atomic_file > last_atomic_file)
> + if (sbi->gc_thread->atomic_file[FG_GC] > last_atomic_file)
>   skipped_round++;
> - last_atomic_file = sbi->gc_thread->atomic_file;
> + last_atomic_file = sbi->gc_thread->atomic_file[FG_GC];
>   round++;
>   }
>  
> diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
> index bc1d21d46ae7..a6cffe6b249b 100644
> --- a/fs/f2fs/gc.h
> +++ b/fs/f2fs/gc.h
> @@ -41,7 +41,7 @@ struct f2fs_gc_kthread {
>   unsigned int gc_wake;
>  
>   /* for stuck statistic */
> - unsigned int atomic_file;
> + unsigned int atomic_file[2];
>  };
>  
>  struct gc_inode_list {
> -- 
> 2.15.0.55.gc2ece9dc4de6


Re: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread Chao Yu
On 2018/4/20 10:30, heyunlei wrote:
> 
> 
>> -Original Message-
>> From: Chao Yu [mailto:yuch...@huawei.com]
>> Sent: Friday, April 20, 2018 9:53 AM
>> To: jaeg...@kernel.org
>> Cc: linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net
>> Subject: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid
>>
>> As most indirect node, dindirect node, and xattr node won't be updated
>> after they are created, but inode node and other direct node will change
>> more frequently, so store their nat entries mixedly in whole nat table
>> will suffer:
>> - fragment nat table soon due to different update rate
>> - more nat block update due to fragmented nat table
>>
> 
> BTW, should we enable this patch:  f2fs: reuse nids more aggressively?
> 
> I think it will decrease nat area fragment and will decrease io of nat?

For a fragmented nat table, there will be no different in between reusing
obsolete nid or allocating nid from next nat block.

IMO, in order to decrease nat block write, it needs to add more allocation
algorithm like a filesystem does, but firstly, I'd like to separate hot entry
from cold one.

Thanks,



Re: [PATCH 3/5] f2fs: avoid stucking GC due to atomic write

2018-04-19 Thread Jaegeuk Kim
On 04/18, Chao Yu wrote:
> f2fs doesn't allow abuse on atomic write class interface, so except
> limiting in-mem pages' total memory usage capacity, we need to limit
> atomic-write usage as well when filesystem is seriously fragmented,
> otherwise we may run into infinite loop during foreground GC because
> target blocks in victim segment are belong to atomic opened file for
> long time.
> 
> Now, we will detect failure due to atomic write in foreground GC, if
> the count exceeds threshold, we will drop all atomic written data in
> cache, by this, I expect it can keep our system running safely to
> prevent Dos attack.
> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/f2fs.h|  1 +
>  fs/f2fs/file.c|  5 +
>  fs/f2fs/gc.c  | 27 +++
>  fs/f2fs/gc.h  |  3 +++
>  fs/f2fs/segment.c |  1 +
>  fs/f2fs/segment.h |  2 ++
>  6 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index c1c3a1d11186..3453288d6a71 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2249,6 +2249,7 @@ enum {
>   FI_EXTRA_ATTR,  /* indicate file has extra attribute */
>   FI_PROJ_INHERIT,/* indicate file inherits projectid */
>   FI_PIN_FILE,/* indicate file should not be gced */
> + FI_ATOMIC_REVOKE_REQUEST,/* indicate atomic committed data has been 
> dropped */
>  };
>  
>  static inline void __mark_inode_dirty_flag(struct inode *inode,
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 7c90ded5a431..cddd9aee1bb2 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -1698,6 +1698,7 @@ static int f2fs_ioc_start_atomic_write(struct file 
> *filp)
>  skip_flush:
>   set_inode_flag(inode, FI_HOT_DATA);
>   set_inode_flag(inode, FI_ATOMIC_FILE);
> + clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
>   f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
>  
>   F2FS_I(inode)->inmem_task = current;
> @@ -1746,6 +1747,10 @@ static int f2fs_ioc_commit_atomic_write(struct file 
> *filp)
>   ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 1, false);
>   }
>  err_out:
> + if (is_inode_flag_set(inode, FI_ATOMIC_REVOKE_REQUEST)) {
> + clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
> + ret = -EINVAL;
> + }
>   up_write(_I(inode)->dio_rwsem[WRITE]);
>   inode_unlock(inode);
>   mnt_drop_write_file(filp);
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index bfb7a4a3a929..495876ca62b6 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -135,6 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>   gc_th->gc_urgent = 0;
>   gc_th->gc_wake= 0;
>  
> + gc_th->atomic_file = 0;
> +
>   sbi->gc_thread = gc_th;
>   init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>   sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> @@ -603,7 +605,7 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct 
> f2fs_summary *sum,
>   * This can be used to move blocks, aka LBAs, directly on disk.
>   */
>  static void move_data_block(struct inode *inode, block_t bidx,
> - unsigned int segno, int off)
> + int gc_type, unsigned int segno, int off)
>  {
>   struct f2fs_io_info fio = {
>   .sbi = F2FS_I_SB(inode),
> @@ -630,8 +632,10 @@ static void move_data_block(struct inode *inode, block_t 
> bidx,
>   if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>   goto out;
>  
> - if (f2fs_is_atomic_file(inode))
> + if (f2fs_is_atomic_file(inode)) {
> + F2FS_I_SB(inode)->gc_thread->atomic_file++;
>   goto out;
> + }
>  
>   if (f2fs_is_pinned_file(inode)) {
>   f2fs_pin_file_control(inode, true);
> @@ -737,8 +741,10 @@ static void move_data_page(struct inode *inode, block_t 
> bidx, int gc_type,
>   if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>   goto out;
>  
> - if (f2fs_is_atomic_file(inode))
> + if (f2fs_is_atomic_file(inode)) {
> + F2FS_I_SB(inode)->gc_thread->atomic_file++;
>   goto out;
> + }
>   if (f2fs_is_pinned_file(inode)) {
>   if (gc_type == FG_GC)
>   f2fs_pin_file_control(inode, true);
> @@ -900,7 +906,8 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, 
> struct f2fs_summary *sum,
>   start_bidx = start_bidx_of_node(nofs, inode)
>   + ofs_in_node;
>   if (f2fs_encrypted_file(inode))
> - move_data_block(inode, start_bidx, segno, off);
> + move_data_block(inode, start_bidx, gc_type,
> + segno, off);
>   else
>   move_data_page(inode, start_bidx, gc_type,
>  

Re: [PATCH 3/5] f2fs: avoid stucking GC due to atomic write

2018-04-19 Thread Jaegeuk Kim
On 04/18, Chao Yu wrote:
> f2fs doesn't allow abuse on atomic write class interface, so except
> limiting in-mem pages' total memory usage capacity, we need to limit
> atomic-write usage as well when filesystem is seriously fragmented,
> otherwise we may run into infinite loop during foreground GC because
> target blocks in victim segment are belong to atomic opened file for
> long time.
> 
> Now, we will detect failure due to atomic write in foreground GC, if
> the count exceeds threshold, we will drop all atomic written data in
> cache, by this, I expect it can keep our system running safely to
> prevent Dos attack.
> 
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/f2fs.h|  1 +
>  fs/f2fs/file.c|  5 +
>  fs/f2fs/gc.c  | 27 +++
>  fs/f2fs/gc.h  |  3 +++
>  fs/f2fs/segment.c |  1 +
>  fs/f2fs/segment.h |  2 ++
>  6 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index c1c3a1d11186..3453288d6a71 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2249,6 +2249,7 @@ enum {
>   FI_EXTRA_ATTR,  /* indicate file has extra attribute */
>   FI_PROJ_INHERIT,/* indicate file inherits projectid */
>   FI_PIN_FILE,/* indicate file should not be gced */
> + FI_ATOMIC_REVOKE_REQUEST,/* indicate atomic committed data has been 
> dropped */
>  };
>  
>  static inline void __mark_inode_dirty_flag(struct inode *inode,
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 7c90ded5a431..cddd9aee1bb2 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -1698,6 +1698,7 @@ static int f2fs_ioc_start_atomic_write(struct file 
> *filp)
>  skip_flush:
>   set_inode_flag(inode, FI_HOT_DATA);
>   set_inode_flag(inode, FI_ATOMIC_FILE);
> + clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
>   f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
>  
>   F2FS_I(inode)->inmem_task = current;
> @@ -1746,6 +1747,10 @@ static int f2fs_ioc_commit_atomic_write(struct file 
> *filp)
>   ret = f2fs_do_sync_file(filp, 0, LLONG_MAX, 1, false);
>   }
>  err_out:
> + if (is_inode_flag_set(inode, FI_ATOMIC_REVOKE_REQUEST)) {
> + clear_inode_flag(inode, FI_ATOMIC_REVOKE_REQUEST);
> + ret = -EINVAL;
> + }
>   up_write(_I(inode)->dio_rwsem[WRITE]);
>   inode_unlock(inode);
>   mnt_drop_write_file(filp);
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index bfb7a4a3a929..495876ca62b6 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -135,6 +135,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi)
>   gc_th->gc_urgent = 0;
>   gc_th->gc_wake= 0;
>  
> + gc_th->atomic_file = 0;
> +
>   sbi->gc_thread = gc_th;
>   init_waitqueue_head(>gc_thread->gc_wait_queue_head);
>   sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
> @@ -603,7 +605,7 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct 
> f2fs_summary *sum,
>   * This can be used to move blocks, aka LBAs, directly on disk.
>   */
>  static void move_data_block(struct inode *inode, block_t bidx,
> - unsigned int segno, int off)
> + int gc_type, unsigned int segno, int off)
>  {
>   struct f2fs_io_info fio = {
>   .sbi = F2FS_I_SB(inode),
> @@ -630,8 +632,10 @@ static void move_data_block(struct inode *inode, block_t 
> bidx,
>   if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>   goto out;
>  
> - if (f2fs_is_atomic_file(inode))
> + if (f2fs_is_atomic_file(inode)) {
> + F2FS_I_SB(inode)->gc_thread->atomic_file++;
>   goto out;
> + }
>  
>   if (f2fs_is_pinned_file(inode)) {
>   f2fs_pin_file_control(inode, true);
> @@ -737,8 +741,10 @@ static void move_data_page(struct inode *inode, block_t 
> bidx, int gc_type,
>   if (!check_valid_map(F2FS_I_SB(inode), segno, off))
>   goto out;
>  
> - if (f2fs_is_atomic_file(inode))
> + if (f2fs_is_atomic_file(inode)) {
> + F2FS_I_SB(inode)->gc_thread->atomic_file++;
>   goto out;
> + }
>   if (f2fs_is_pinned_file(inode)) {
>   if (gc_type == FG_GC)
>   f2fs_pin_file_control(inode, true);
> @@ -900,7 +906,8 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, 
> struct f2fs_summary *sum,
>   start_bidx = start_bidx_of_node(nofs, inode)
>   + ofs_in_node;
>   if (f2fs_encrypted_file(inode))
> - move_data_block(inode, start_bidx, segno, off);
> + move_data_block(inode, start_bidx, gc_type,
> + segno, off);
>   else
>   move_data_page(inode, start_bidx, gc_type,
>  

Re: [PATCH] virtio_ring: switch to dma_XX barriers for rpmsg

2018-04-19 Thread Jason Wang



On 2018年04月20日 01:35, Michael S. Tsirkin wrote:

virtio is using barriers to order memory accesses, thus
dma_wmb/rmb is a good match.

Build-tested on x86: Before

[mst@tuck linux]$ size drivers/virtio/virtio_ring.o
textdata bss dec hex filename
   11392 820   0   122122fb4 drivers/virtio/virtio_ring.o

After
mst@tuck linux]$ size drivers/virtio/virtio_ring.o
textdata bss dec hex filename
   11284 820   0   121042f48 drivers/virtio/virtio_ring.o

Cc: Ohad Ben-Cohen 
Cc: Bjorn Andersson 
Cc: linux-remotep...@vger.kernel.org
Signed-off-by: Michael S. Tsirkin 
---

It's good in theory, but could one of RPMSG maintainers please review
and ack this patch? Or even better test it?

All these barriers are useless on Intel anyway ...

  include/linux/virtio_ring.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bbf3252..fab0213 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -35,7 +35,7 @@ static inline void virtio_rmb(bool weak_barriers)
if (weak_barriers)
virt_rmb();
else
-   rmb();
+   dma_rmb();
  }
  
  static inline void virtio_wmb(bool weak_barriers)

@@ -43,7 +43,7 @@ static inline void virtio_wmb(bool weak_barriers)
if (weak_barriers)
virt_wmb();
else
-   wmb();
+   dma_wmb();
  }
  
  static inline void virtio_store_mb(bool weak_barriers,


Acked-by: Jason Wang 



Re: [PATCH] virtio_ring: switch to dma_XX barriers for rpmsg

2018-04-19 Thread Jason Wang



On 2018年04月20日 01:35, Michael S. Tsirkin wrote:

virtio is using barriers to order memory accesses, thus
dma_wmb/rmb is a good match.

Build-tested on x86: Before

[mst@tuck linux]$ size drivers/virtio/virtio_ring.o
textdata bss dec hex filename
   11392 820   0   122122fb4 drivers/virtio/virtio_ring.o

After
mst@tuck linux]$ size drivers/virtio/virtio_ring.o
textdata bss dec hex filename
   11284 820   0   121042f48 drivers/virtio/virtio_ring.o

Cc: Ohad Ben-Cohen 
Cc: Bjorn Andersson 
Cc: linux-remotep...@vger.kernel.org
Signed-off-by: Michael S. Tsirkin 
---

It's good in theory, but could one of RPMSG maintainers please review
and ack this patch? Or even better test it?

All these barriers are useless on Intel anyway ...

  include/linux/virtio_ring.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index bbf3252..fab0213 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -35,7 +35,7 @@ static inline void virtio_rmb(bool weak_barriers)
if (weak_barriers)
virt_rmb();
else
-   rmb();
+   dma_rmb();
  }
  
  static inline void virtio_wmb(bool weak_barriers)

@@ -43,7 +43,7 @@ static inline void virtio_wmb(bool weak_barriers)
if (weak_barriers)
virt_wmb();
else
-   wmb();
+   dma_wmb();
  }
  
  static inline void virtio_store_mb(bool weak_barriers,


Acked-by: Jason Wang 



Re: [PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings

2018-04-19 Thread Dan Williams
On Thu, Apr 19, 2018 at 3:44 AM, Jan Kara  wrote:
> On Fri 13-04-18 15:03:51, Dan Williams wrote:
>> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams  
>> wrote:
>> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara  wrote:
>> >> On Sat 07-04-18 12:38:24, Dan Williams wrote:
>> > [..]
>> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't
>> >>> need to wait for a global quiescent state, just a
>> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
>> >>> srcu api?
>> >>
>> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
>> >> SRCU. It is a more-or-less standard locking mechanism rather than relying
>> >> on implementation properties of SRCU which is a data structure protection
>> >> method. And the overhead of percpu rwsemaphore for your use case should be
>> >> about the same as that of SRCU.
>> >
>> > I was just about to ask that. Yes, it seems they would share similar
>> > properties and it would be better to use the explicit implementation
>> > rather than a side effect of srcu.
>>
>> ...unfortunately:
>>
>>  BUG: sleeping function called from invalid context at
>> ./include/linux/percpu-rwsem.h:34
>>  [..]
>>  Call Trace:
>>   dump_stack+0x85/0xcb
>>   ___might_sleep+0x15b/0x240
>>   dax_layout_lock+0x18/0x80
>>   get_user_pages_fast+0xf8/0x140
>>
>> ...and thinking about it more srcu is a better fit. We don't need the
>> 100% exclusion provided by an rwsem we only need the guarantee that
>> all cpus that might have been running get_user_pages_fast() have
>> finished it at least once.
>>
>> In my tests synchronize_srcu is a bit slower than unpatched for the
>> trivial 100 truncate test, but certainly not the 200x latency you were
>> seeing with syncrhonize_rcu.
>>
>> Elapsed time:
>> 0.006149178 unpatched
>> 0.009426360 srcu
>
> Hum, right. Yesterday I was looking into KSM for a different reason and
> I've noticed it also does writeprotect pages and deals with races with GUP.
> And what KSM relies on is:
>
> write_protect_page()
>   ...
>   entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>   /*
>* Check that no O_DIRECT or similar I/O is in progress on the
>* page
>*/
>   if (page_mapcount(page) + 1 + swapped != page_count(page)) {
> page used -> bail

Slick.

>   }
>
> And this really works because gup_pte_range() does:
>
>   page = pte_page(pte);
>   head = compound_head(page);
>
>   if (!page_cache_get_speculative(head))
> goto pte_unmap;
>
>   if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> bail

Need to add a similar check to __gup_device_huge_pmd.

>   }
>
> So either write_protect_page() page sees the elevated reference or
> gup_pte_range() bails because it will see the pte changed.
>
> In the truncate path things are a bit different but in principle the same
> should work - once truncate blocks page faults and unmaps pages from page
> tables, we can be sure GUP will not grab the page anymore or we'll see
> elevated page count. So IMO there's no need for any additional locking
> against the GUP path (but a comment explaining this is highly desirable I
> guess).

Yes, those "pte_val(pte) != pte_val(*ptep)" checks should be
documented for the same reason we require comments on rmb/wmb pairs.
I'll take a look, thanks Jan.


Re: [PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings

2018-04-19 Thread Dan Williams
On Thu, Apr 19, 2018 at 3:44 AM, Jan Kara  wrote:
> On Fri 13-04-18 15:03:51, Dan Williams wrote:
>> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams  
>> wrote:
>> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara  wrote:
>> >> On Sat 07-04-18 12:38:24, Dan Williams wrote:
>> > [..]
>> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't
>> >>> need to wait for a global quiescent state, just a
>> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
>> >>> srcu api?
>> >>
>> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
>> >> SRCU. It is a more-or-less standard locking mechanism rather than relying
>> >> on implementation properties of SRCU which is a data structure protection
>> >> method. And the overhead of percpu rwsemaphore for your use case should be
>> >> about the same as that of SRCU.
>> >
>> > I was just about to ask that. Yes, it seems they would share similar
>> > properties and it would be better to use the explicit implementation
>> > rather than a side effect of srcu.
>>
>> ...unfortunately:
>>
>>  BUG: sleeping function called from invalid context at
>> ./include/linux/percpu-rwsem.h:34
>>  [..]
>>  Call Trace:
>>   dump_stack+0x85/0xcb
>>   ___might_sleep+0x15b/0x240
>>   dax_layout_lock+0x18/0x80
>>   get_user_pages_fast+0xf8/0x140
>>
>> ...and thinking about it more srcu is a better fit. We don't need the
>> 100% exclusion provided by an rwsem we only need the guarantee that
>> all cpus that might have been running get_user_pages_fast() have
>> finished it at least once.
>>
>> In my tests synchronize_srcu is a bit slower than unpatched for the
>> trivial 100 truncate test, but certainly not the 200x latency you were
>> seeing with syncrhonize_rcu.
>>
>> Elapsed time:
>> 0.006149178 unpatched
>> 0.009426360 srcu
>
> Hum, right. Yesterday I was looking into KSM for a different reason and
> I've noticed it also does writeprotect pages and deals with races with GUP.
> And what KSM relies on is:
>
> write_protect_page()
>   ...
>   entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
>   /*
>* Check that no O_DIRECT or similar I/O is in progress on the
>* page
>*/
>   if (page_mapcount(page) + 1 + swapped != page_count(page)) {
> page used -> bail

Slick.

>   }
>
> And this really works because gup_pte_range() does:
>
>   page = pte_page(pte);
>   head = compound_head(page);
>
>   if (!page_cache_get_speculative(head))
> goto pte_unmap;
>
>   if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> bail

Need to add a similar check to __gup_device_huge_pmd.

>   }
>
> So either write_protect_page() page sees the elevated reference or
> gup_pte_range() bails because it will see the pte changed.
>
> In the truncate path things are a bit different but in principle the same
> should work - once truncate blocks page faults and unmaps pages from page
> tables, we can be sure GUP will not grab the page anymore or we'll see
> elevated page count. So IMO there's no need for any additional locking
> against the GUP path (but a comment explaining this is highly desirable I
> guess).

Yes, those "pte_val(pte) != pte_val(*ptep)" checks should be
documented for the same reason we require comments on rmb/wmb pairs.
I'll take a look, thanks Jan.


Re: general protection fault in kernfs_kill_sb

2018-04-19 Thread Eric Biggers
On Mon, Apr 02, 2018 at 03:34:15PM +0100, Al Viro wrote:
> On Mon, Apr 02, 2018 at 07:40:22PM +0900, Tetsuo Handa wrote:
> 
> > That commit assumes that calling kill_sb() from deactivate_locked_super(s)
> > without corresponding fill_super() is safe. We have so far crashed with
> > rpc_mount() and kernfs_mount_ns(). Is that really safe?
> 
>   Consider the case when fill_super() returns an error immediately.
> It is exactly the same situation.  And ->kill_sb() *is* called in cases
> when fill_super() has failed.  Always had been - it's much less boilerplate
> that way.
> 
>   deactivate_locked_super() on that failure exit is the least painful
> variant, unfortunately.
> 
>   Filesystems with ->kill_sb() instances that rely upon something
> done between sget() and the first failure exit after it need to be fixed.
> And yes, that should've been spotted back then.  Sorry.
> 
> Fortunately, we don't have many of those - kill_{block,litter,anon}_super()
> are safe and those are the majority.  Looking through the rest uncovers
> some bugs; so far all I've seen were already there.  Note that normally
> we have something like
> static void affs_kill_sb(struct super_block *sb)
> {
> struct affs_sb_info *sbi = AFFS_SB(sb);
> kill_block_super(sb);
> if (sbi) {
> affs_free_bitmap(sb);
> affs_brelse(sbi->s_root_bh);
> kfree(sbi->s_prefix);
> mutex_destroy(>s_bmlock);
> kfree(sbi);
> }
> }
> which basically does one of the safe ones augmented with something that
> takes care *not* to assume that e.g. ->s_fs_info has been allocated.
> Not everyone does, though:
> 
> jffs2_fill_super():
> c = kzalloc(sizeof(*c), GFP_KERNEL);
> if (!c)
> return -ENOMEM;
> in the very beginning.  So we can return from it with NULL ->s_fs_info.
> Now, consider
> struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
> if (!(sb->s_flags & MS_RDONLY))
> jffs2_stop_garbage_collect_thread(c);
> in jffs2_kill_sb() and
> void jffs2_stop_garbage_collect_thread(struct jffs2_sb_info *c)
> {
> int wait = 0;
> spin_lock(>erase_completion_lock);
> if (c->gc_task) {
> 
> IOW, fail that kzalloc() (or, indeed, an allocation in register_shrinker())
> and eat an oops.  Always had been there, always hard to hit without
> fault injectors and fortunately trivial to fix.
> 
> Similar in nfs_kill_super() calling nfs_free_server().
> Similar in v9fs_kill_super() with v9fs_session_cancel()/v9fs_session_close() 
> calls.
> Similar in hypfs_kill_super(), afs_kill_super(), btrfs_kill_super(), 
> cifs_kill_sb()
> (all trivial to fix)
> 
> Aha... nfsd_umount() is a new regression.
> 
> orangefs: old, trivial to fix.
> 
> cgroup_kill_sb(): old, hopefully easy to fix.  Note that kernfs_root_from_sb()
> can bloody well return NULL, making cgroup_root_from_kf() oops.  Always had 
> been
> there.
> 
> AFAICS, after discarding the instances that do the right thing we are left 
> with:
> hypfs_kill_super, rdt_kill_sb, v9fs_kill_super, afs_kill_super, 
> btrfs_kill_super,
> cifs_kill_sb, jffs2_kill_sb, nfs_kill_super, nfsd_umount, orangefs_kill_sb,
> proc_kill_sb, sysfs_kill_sb, cgroup_kill_sb, rpc_kill_sb.
> 
> Out of those, nfsd_umount(), proc_kill_sb() and rpc_kill_sb() are regressions.
> So are rdt_kill_sb() and sysfs_kill_sb() (victims of the issue you've spotted
> in kernfs_kill_sb()).  The rest are old (and I wonder if syzbot had been
> catching those - they are also dependent upon a specific allocation failing
> at the right time).
> 

Fix for the kernfs bug is now queued in vfs/for-linus:

#syz fix: kernfs: deal with early sget() failures

syzkaller just recently (3 weeks ago) gained the ability to mount filesystem
images, so that's the main reason for the increase in filesystem bug reports.
Each time syzkaller is updated to cover more code, bugs are found.

- Eric


Re: general protection fault in kernfs_kill_sb

2018-04-19 Thread Eric Biggers
On Mon, Apr 02, 2018 at 03:34:15PM +0100, Al Viro wrote:
> On Mon, Apr 02, 2018 at 07:40:22PM +0900, Tetsuo Handa wrote:
> 
> > That commit assumes that calling kill_sb() from deactivate_locked_super(s)
> > without corresponding fill_super() is safe. We have so far crashed with
> > rpc_mount() and kernfs_mount_ns(). Is that really safe?
> 
>   Consider the case when fill_super() returns an error immediately.
> It is exactly the same situation.  And ->kill_sb() *is* called in cases
> when fill_super() has failed.  Always had been - it's much less boilerplate
> that way.
> 
>   deactivate_locked_super() on that failure exit is the least painful
> variant, unfortunately.
> 
>   Filesystems with ->kill_sb() instances that rely upon something
> done between sget() and the first failure exit after it need to be fixed.
> And yes, that should've been spotted back then.  Sorry.
> 
> Fortunately, we don't have many of those - kill_{block,litter,anon}_super()
> are safe and those are the majority.  Looking through the rest uncovers
> some bugs; so far all I've seen were already there.  Note that normally
> we have something like
> static void affs_kill_sb(struct super_block *sb)
> {
> struct affs_sb_info *sbi = AFFS_SB(sb);
> kill_block_super(sb);
> if (sbi) {
> affs_free_bitmap(sb);
> affs_brelse(sbi->s_root_bh);
> kfree(sbi->s_prefix);
> mutex_destroy(>s_bmlock);
> kfree(sbi);
> }
> }
> which basically does one of the safe ones augmented with something that
> takes care *not* to assume that e.g. ->s_fs_info has been allocated.
> Not everyone does, though:
> 
> jffs2_fill_super():
> c = kzalloc(sizeof(*c), GFP_KERNEL);
> if (!c)
> return -ENOMEM;
> in the very beginning.  So we can return from it with NULL ->s_fs_info.
> Now, consider
> struct jffs2_sb_info *c = JFFS2_SB_INFO(sb);
> if (!(sb->s_flags & MS_RDONLY))
> jffs2_stop_garbage_collect_thread(c);
> in jffs2_kill_sb() and
> void jffs2_stop_garbage_collect_thread(struct jffs2_sb_info *c)
> {
> int wait = 0;
> spin_lock(>erase_completion_lock);
> if (c->gc_task) {
> 
> IOW, fail that kzalloc() (or, indeed, an allocation in register_shrinker())
> and eat an oops.  Always had been there, always hard to hit without
> fault injectors and fortunately trivial to fix.
> 
> Similar in nfs_kill_super() calling nfs_free_server().
> Similar in v9fs_kill_super() with v9fs_session_cancel()/v9fs_session_close() 
> calls.
> Similar in hypfs_kill_super(), afs_kill_super(), btrfs_kill_super(), 
> cifs_kill_sb()
> (all trivial to fix)
> 
> Aha... nfsd_umount() is a new regression.
> 
> orangefs: old, trivial to fix.
> 
> cgroup_kill_sb(): old, hopefully easy to fix.  Note that kernfs_root_from_sb()
> can bloody well return NULL, making cgroup_root_from_kf() oops.  Always had 
> been
> there.
> 
> AFAICS, after discarding the instances that do the right thing we are left 
> with:
> hypfs_kill_super, rdt_kill_sb, v9fs_kill_super, afs_kill_super, 
> btrfs_kill_super,
> cifs_kill_sb, jffs2_kill_sb, nfs_kill_super, nfsd_umount, orangefs_kill_sb,
> proc_kill_sb, sysfs_kill_sb, cgroup_kill_sb, rpc_kill_sb.
> 
> Out of those, nfsd_umount(), proc_kill_sb() and rpc_kill_sb() are regressions.
> So are rdt_kill_sb() and sysfs_kill_sb() (victims of the issue you've spotted
> in kernfs_kill_sb()).  The rest are old (and I wonder if syzbot had been
> catching those - they are also dependent upon a specific allocation failing
> at the right time).
> 

Fix for the kernfs bug is now queued in vfs/for-linus:

#syz fix: kernfs: deal with early sget() failures

syzkaller just recently (3 weeks ago) gained the ability to mount filesystem
images, so that's the main reason for the increase in filesystem bug reports.
Each time syzkaller is updated to cover more code, bugs are found.

- Eric


Re: [v2] prctl: Deprecate non PR_SET_MM_MAP operations

2018-04-19 Thread Sergey Senozhatsky
On (04/05/18 21:26), Cyrill Gorcunov wrote:
[..]
> -
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>   if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
>   return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
>  #endif
>  
> - if (!capable(CAP_SYS_RESOURCE))
> - return -EPERM;
> -
> - if (opt == PR_SET_MM_EXE_FILE)
> - return prctl_set_mm_exe_file(mm, (unsigned int)addr);
> -
> - if (opt == PR_SET_MM_AUXV)
> - return prctl_set_auxv(mm, addr, arg4);

Then validate_prctl_map() and prctl_set_mm_exe_file() can be moved
under CONFIG_CHECKPOINT_RESTORE ifdef.

---

 kernel/sys.c | 126 +--
 1 file changed, 63 insertions(+), 63 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 6bdffe264303..86e5ef1a5612 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1815,68 +1815,7 @@ SYSCALL_DEFINE1(umask, int, mask)
return mask;
 }
 
-static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
-{
-   struct fd exe;
-   struct file *old_exe, *exe_file;
-   struct inode *inode;
-   int err;
-
-   exe = fdget(fd);
-   if (!exe.file)
-   return -EBADF;
-
-   inode = file_inode(exe.file);
-
-   /*
-* Because the original mm->exe_file points to executable file, make
-* sure that this one is executable as well, to avoid breaking an
-* overall picture.
-*/
-   err = -EACCES;
-   if (!S_ISREG(inode->i_mode) || path_noexec(>f_path))
-   goto exit;
-
-   err = inode_permission(inode, MAY_EXEC);
-   if (err)
-   goto exit;
-
-   /*
-* Forbid mm->exe_file change if old file still mapped.
-*/
-   exe_file = get_mm_exe_file(mm);
-   err = -EBUSY;
-   if (exe_file) {
-   struct vm_area_struct *vma;
-
-   down_read(>mmap_sem);
-   for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   if (!vma->vm_file)
-   continue;
-   if (path_equal(>vm_file->f_path,
-  _file->f_path))
-   goto exit_err;
-   }
-
-   up_read(>mmap_sem);
-   fput(exe_file);
-   }
-
-   err = 0;
-   /* set the new file, lockless */
-   get_file(exe.file);
-   old_exe = xchg(>exe_file, exe.file);
-   if (old_exe)
-   fput(old_exe);
-exit:
-   fdput(exe);
-   return err;
-exit_err:
-   up_read(>mmap_sem);
-   fput(exe_file);
-   goto exit;
-}
-
+#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * WARNING: we don't require any capability here so be very careful
  * in what is allowed for modification from userspace.
@@ -1968,7 +1907,68 @@ static int validate_prctl_map(struct prctl_mm_map 
*prctl_map)
return error;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
+{
+   struct fd exe;
+   struct file *old_exe, *exe_file;
+   struct inode *inode;
+   int err;
+
+   exe = fdget(fd);
+   if (!exe.file)
+   return -EBADF;
+
+   inode = file_inode(exe.file);
+
+   /*
+* Because the original mm->exe_file points to executable file, make
+* sure that this one is executable as well, to avoid breaking an
+* overall picture.
+*/
+   err = -EACCES;
+   if (!S_ISREG(inode->i_mode) || path_noexec(>f_path))
+   goto exit;
+
+   err = inode_permission(inode, MAY_EXEC);
+   if (err)
+   goto exit;
+
+   /*
+* Forbid mm->exe_file change if old file still mapped.
+*/
+   exe_file = get_mm_exe_file(mm);
+   err = -EBUSY;
+   if (exe_file) {
+   struct vm_area_struct *vma;
+
+   down_read(>mmap_sem);
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   if (!vma->vm_file)
+   continue;
+   if (path_equal(>vm_file->f_path,
+  _file->f_path))
+   goto exit_err;
+   }
+
+   up_read(>mmap_sem);
+   fput(exe_file);
+   }
+
+   err = 0;
+   /* set the new file, lockless */
+   get_file(exe.file);
+   old_exe = xchg(>exe_file, exe.file);
+   if (old_exe)
+   fput(old_exe);
+exit:
+   fdput(exe);
+   return err;
+exit_err:
+   up_read(>mmap_sem);
+   fput(exe_file);
+   goto exit;
+}
+
 static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long 
data_size)
 {
struct prctl_mm_map prctl_map = { .exe_fd = (u32)-1, };



Re: [v2] prctl: Deprecate non PR_SET_MM_MAP operations

2018-04-19 Thread Sergey Senozhatsky
On (04/05/18 21:26), Cyrill Gorcunov wrote:
[..]
> -
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>   if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
>   return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
>  #endif
>  
> - if (!capable(CAP_SYS_RESOURCE))
> - return -EPERM;
> -
> - if (opt == PR_SET_MM_EXE_FILE)
> - return prctl_set_mm_exe_file(mm, (unsigned int)addr);
> -
> - if (opt == PR_SET_MM_AUXV)
> - return prctl_set_auxv(mm, addr, arg4);

Then validate_prctl_map() and prctl_set_mm_exe_file() can be moved
under CONFIG_CHECKPOINT_RESTORE ifdef.

---

 kernel/sys.c | 126 +--
 1 file changed, 63 insertions(+), 63 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 6bdffe264303..86e5ef1a5612 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1815,68 +1815,7 @@ SYSCALL_DEFINE1(umask, int, mask)
return mask;
 }
 
-static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
-{
-   struct fd exe;
-   struct file *old_exe, *exe_file;
-   struct inode *inode;
-   int err;
-
-   exe = fdget(fd);
-   if (!exe.file)
-   return -EBADF;
-
-   inode = file_inode(exe.file);
-
-   /*
-* Because the original mm->exe_file points to executable file, make
-* sure that this one is executable as well, to avoid breaking an
-* overall picture.
-*/
-   err = -EACCES;
-   if (!S_ISREG(inode->i_mode) || path_noexec(>f_path))
-   goto exit;
-
-   err = inode_permission(inode, MAY_EXEC);
-   if (err)
-   goto exit;
-
-   /*
-* Forbid mm->exe_file change if old file still mapped.
-*/
-   exe_file = get_mm_exe_file(mm);
-   err = -EBUSY;
-   if (exe_file) {
-   struct vm_area_struct *vma;
-
-   down_read(>mmap_sem);
-   for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   if (!vma->vm_file)
-   continue;
-   if (path_equal(>vm_file->f_path,
-  _file->f_path))
-   goto exit_err;
-   }
-
-   up_read(>mmap_sem);
-   fput(exe_file);
-   }
-
-   err = 0;
-   /* set the new file, lockless */
-   get_file(exe.file);
-   old_exe = xchg(>exe_file, exe.file);
-   if (old_exe)
-   fput(old_exe);
-exit:
-   fdput(exe);
-   return err;
-exit_err:
-   up_read(>mmap_sem);
-   fput(exe_file);
-   goto exit;
-}
-
+#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * WARNING: we don't require any capability here so be very careful
  * in what is allowed for modification from userspace.
@@ -1968,7 +1907,68 @@ static int validate_prctl_map(struct prctl_mm_map 
*prctl_map)
return error;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
+{
+   struct fd exe;
+   struct file *old_exe, *exe_file;
+   struct inode *inode;
+   int err;
+
+   exe = fdget(fd);
+   if (!exe.file)
+   return -EBADF;
+
+   inode = file_inode(exe.file);
+
+   /*
+* Because the original mm->exe_file points to executable file, make
+* sure that this one is executable as well, to avoid breaking an
+* overall picture.
+*/
+   err = -EACCES;
+   if (!S_ISREG(inode->i_mode) || path_noexec(>f_path))
+   goto exit;
+
+   err = inode_permission(inode, MAY_EXEC);
+   if (err)
+   goto exit;
+
+   /*
+* Forbid mm->exe_file change if old file still mapped.
+*/
+   exe_file = get_mm_exe_file(mm);
+   err = -EBUSY;
+   if (exe_file) {
+   struct vm_area_struct *vma;
+
+   down_read(>mmap_sem);
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   if (!vma->vm_file)
+   continue;
+   if (path_equal(>vm_file->f_path,
+  _file->f_path))
+   goto exit_err;
+   }
+
+   up_read(>mmap_sem);
+   fput(exe_file);
+   }
+
+   err = 0;
+   /* set the new file, lockless */
+   get_file(exe.file);
+   old_exe = xchg(>exe_file, exe.file);
+   if (old_exe)
+   fput(old_exe);
+exit:
+   fdput(exe);
+   return err;
+exit_err:
+   up_read(>mmap_sem);
+   fput(exe_file);
+   goto exit;
+}
+
 static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long 
data_size)
 {
struct prctl_mm_map prctl_map = { .exe_fd = (u32)-1, };



Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Minchan Kim
On Fri, Apr 20, 2018 at 11:18:34AM +0900, Sergey Senozhatsky wrote:
> On (04/20/18 11:09), Minchan Kim wrote:
> [..]
> > > hm, OK, can we get this info into the changelog?  
> > 
> > No problem. I will add as follows,
> > 
> > "I used the feature a few years ago to find memory hoggers in userspace
> > to notice them what memory they have wasted without touch for a long time.
> > With it, they could reduce unnecessary memory space. However, at that time,
> > I hacked up zram for the feature but now I need the feature again so
> > I decided it would be better to upstream rather than keeping it alone.
> > I hope I submit the userspace tool to use the feature soon"
> 
> Shall we then just wait until you resubmit the "complete" patch set: zram
> tracking + the user space tool which would parse the tracking output?

tl;dr: I think userspace tool is just ancillary, not must.

Although my main purpose is to find idle memory hogger, I don't think
userspace tool to find is must to merge this feature because someone
might want to do other thing regardless of the tool.

Examples from my mind is to see how swap write pattern going on,
how sparse swap write happens and so on. :)


Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Minchan Kim
On Fri, Apr 20, 2018 at 11:18:34AM +0900, Sergey Senozhatsky wrote:
> On (04/20/18 11:09), Minchan Kim wrote:
> [..]
> > > hm, OK, can we get this info into the changelog?  
> > 
> > No problem. I will add as follows,
> > 
> > "I used the feature a few years ago to find memory hoggers in userspace
> > to notice them what memory they have wasted without touch for a long time.
> > With it, they could reduce unnecessary memory space. However, at that time,
> > I hacked up zram for the feature but now I need the feature again so
> > I decided it would be better to upstream rather than keeping it alone.
> > I hope I submit the userspace tool to use the feature soon"
> 
> Shall we then just wait until you resubmit the "complete" patch set: zram
> tracking + the user space tool which would parse the tracking output?

tl;dr: I think userspace tool is just ancillary, not must.

Although my main purpose is to find idle memory hogger, I don't think
userspace tool to find is must to merge this feature because someone
might want to do other thing regardless of the tool.

Examples from my mind is to see how swap write pattern going on,
how sparse swap write happens and so on. :)


RE: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread heyunlei


>-Original Message-
>From: Chao Yu [mailto:yuch...@huawei.com]
>Sent: Friday, April 20, 2018 9:53 AM
>To: jaeg...@kernel.org
>Cc: linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net
>Subject: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid
>
>As most indirect node, dindirect node, and xattr node won't be updated
>after they are created, but inode node and other direct node will change
>more frequently, so store their nat entries mixedly in whole nat table
>will suffer:
>- fragment nat table soon due to different update rate
>- more nat block update due to fragmented nat table
>

BTW, should we enable this patch:  f2fs: reuse nids more aggressively?

I think it will decrease nat area fragment and will decrease io of nat?

>In order to solve above issue, we're trying to separate whole nat table to
>two part:
>a. Hot free nid area:
> - range: [nid #0, nid #x)
> - store node block address for
>   * inode node
>   * other direct node
>b. Cold free nid area:
> - range: [nid #x, max nid)
> - store node block address for
>   * indirect node
>   * dindirect node
>   * xattr node
>
>Allocation strategy example:
>
>Free nid: '-'
>Used nid: '='
>
>1. Initial status:
>Free Nids: 
>|---|
>   ^   ^   ^   
> ^
>Alloc Range:   |---|   
>|---|
>   hot_start   hot_end 
> cold_start  cold_end
>
>2. Free nids have ran out:
>Free Nids: 
>|===-===|
>   ^   ^   ^   
> ^
>Alloc Range:   |===|   
>|===|
>   hot_start   hot_end 
> cold_start  cold_end
>
>3. Expand hot/cold area range:
>Free Nids: 
>|===-===|
>   ^   ^   ^   
> ^
>Alloc Range:   |===|   
>|===|
>   hot_start   hot_end cold_start  
> cold_end
>
>4. Hot free nids have ran out:
>Free Nids: 
>|===-===|
>   ^   ^   ^   
> ^
>Alloc Range:   |===|   
>|===|
>   hot_start   hot_end cold_start  
> cold_end
>
>5. Expand hot area range, hot/cold area boundary has been fixed:
>Free Nids: 
>|===-===|
>   ^   ^   
> ^
>Alloc Range:   
>|===|===|
>   hot_start   hot_end(cold_start) 
> cold_end
>
>Run xfstests with generic/*:
>
>before
>node_write:169660
>cp_count:  60118
>node/cp2.82
>
>after:
>node_write:159145
>cp_count:  84501
>node/cp:   2.64
>
>Signed-off-by: Chao Yu 
>---
> fs/f2fs/checkpoint.c |   4 -
> fs/f2fs/debug.c  |   6 +-
> fs/f2fs/f2fs.h   |  19 +++-
> fs/f2fs/inode.c  |   2 +-
> fs/f2fs/namei.c  |   2 +-
> fs/f2fs/node.c   | 302 ---
> fs/f2fs/node.h   |  17 +--
> fs/f2fs/segment.c|   8 +-
> fs/f2fs/shrinker.c   |   3 +-
> fs/f2fs/xattr.c  |  10 +-
> 10 files changed, 221 insertions(+), 152 deletions(-)
>
>diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>index 96785ffc6181..c17feec72c74 100644
>--- a/fs/f2fs/checkpoint.c
>+++ b/fs/f2fs/checkpoint.c
>@@ -1029,14 +1029,10 @@ int f2fs_sync_inode_meta(struct f2fs_sb_info *sbi)
> static void __prepare_cp_block(struct f2fs_sb_info *sbi)
> {
>   struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
>-  struct f2fs_nm_info *nm_i = NM_I(sbi);
>-  nid_t last_nid = nm_i->next_scan_nid;
>
>-  next_free_nid(sbi, _nid);
>   ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
>   ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
>   ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
>-  ckpt->next_free_nid = cpu_to_le32(last_nid);
> }
>
> /*
>diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
>index 7bb036a3bb81..b13c1d4f110f 100644
>--- a/fs/f2fs/debug.c
>+++ b/fs/f2fs/debug.c
>@@ -100,7 +100,8 @@ static void update_general_status(struct f2fs_sb_info *sbi)
>   si->dirty_nats = NM_I(sbi)->dirty_nat_cnt;
>   si->sits = MAIN_SEGS(sbi);
>   

RE: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid

2018-04-19 Thread heyunlei


>-Original Message-
>From: Chao Yu [mailto:yuch...@huawei.com]
>Sent: Friday, April 20, 2018 9:53 AM
>To: jaeg...@kernel.org
>Cc: linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net
>Subject: [f2fs-dev] [PATCH] f2fs: sepearte hot/cold in free nid
>
>As most indirect node, dindirect node, and xattr node won't be updated
>after they are created, but inode node and other direct node will change
>more frequently, so store their nat entries mixedly in whole nat table
>will suffer:
>- fragment nat table soon due to different update rate
>- more nat block update due to fragmented nat table
>

BTW, should we enable this patch:  f2fs: reuse nids more aggressively?

I think it will decrease nat area fragment and will decrease io of nat?

>In order to solve above issue, we're trying to separate whole nat table to
>two part:
>a. Hot free nid area:
> - range: [nid #0, nid #x)
> - store node block address for
>   * inode node
>   * other direct node
>b. Cold free nid area:
> - range: [nid #x, max nid)
> - store node block address for
>   * indirect node
>   * dindirect node
>   * xattr node
>
>Allocation strategy example:
>
>Free nid: '-'
>Used nid: '='
>
>1. Initial status:
>Free Nids: 
>|---|
>   ^   ^   ^   
> ^
>Alloc Range:   |---|   
>|---|
>   hot_start   hot_end 
> cold_start  cold_end
>
>2. Free nids have ran out:
>Free Nids: 
>|===-===|
>   ^   ^   ^   
> ^
>Alloc Range:   |===|   
>|===|
>   hot_start   hot_end 
> cold_start  cold_end
>
>3. Expand hot/cold area range:
>Free Nids: 
>|===-===|
>   ^   ^   ^   
> ^
>Alloc Range:   |===|   
>|===|
>   hot_start   hot_end cold_start  
> cold_end
>
>4. Hot free nids have ran out:
>Free Nids: 
>|===-===|
>   ^   ^   ^   
> ^
>Alloc Range:   |===|   
>|===|
>   hot_start   hot_end cold_start  
> cold_end
>
>5. Expand hot area range, hot/cold area boundary has been fixed:
>Free Nids: 
>|===-===|
>   ^   ^   
> ^
>Alloc Range:   
>|===|===|
>   hot_start   hot_end(cold_start) 
> cold_end
>
>Run xfstests with generic/*:
>
>before
>node_write:169660
>cp_count:  60118
>node/cp2.82
>
>after:
>node_write:159145
>cp_count:  84501
>node/cp:   2.64
>
>Signed-off-by: Chao Yu 
>---
> fs/f2fs/checkpoint.c |   4 -
> fs/f2fs/debug.c  |   6 +-
> fs/f2fs/f2fs.h   |  19 +++-
> fs/f2fs/inode.c  |   2 +-
> fs/f2fs/namei.c  |   2 +-
> fs/f2fs/node.c   | 302 ---
> fs/f2fs/node.h   |  17 +--
> fs/f2fs/segment.c|   8 +-
> fs/f2fs/shrinker.c   |   3 +-
> fs/f2fs/xattr.c  |  10 +-
> 10 files changed, 221 insertions(+), 152 deletions(-)
>
>diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>index 96785ffc6181..c17feec72c74 100644
>--- a/fs/f2fs/checkpoint.c
>+++ b/fs/f2fs/checkpoint.c
>@@ -1029,14 +1029,10 @@ int f2fs_sync_inode_meta(struct f2fs_sb_info *sbi)
> static void __prepare_cp_block(struct f2fs_sb_info *sbi)
> {
>   struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
>-  struct f2fs_nm_info *nm_i = NM_I(sbi);
>-  nid_t last_nid = nm_i->next_scan_nid;
>
>-  next_free_nid(sbi, _nid);
>   ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
>   ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
>   ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
>-  ckpt->next_free_nid = cpu_to_le32(last_nid);
> }
>
> /*
>diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
>index 7bb036a3bb81..b13c1d4f110f 100644
>--- a/fs/f2fs/debug.c
>+++ b/fs/f2fs/debug.c
>@@ -100,7 +100,8 @@ static void update_general_status(struct f2fs_sb_info *sbi)
>   si->dirty_nats = NM_I(sbi)->dirty_nat_cnt;
>   si->sits = MAIN_SEGS(sbi);
>   si->dirty_sits = 

  1   2   3   4   5   6   7   8   9   10   >