Re: [PATCH v2 01/12] docs: path-lookup: update follow_managed() part
On Tue, Apr 20, 2021 at 3:22 AM Jonathan Corbet wrote: > > Fox Chen writes: > > > On Mon, Apr 19, 2021 at 11:25 AM Matthew Wilcox wrote: > >> > >> On Mon, Apr 19, 2021 at 10:33:00AM +0800, Fox Chen wrote: > >> > On Mon, Apr 19, 2021 at 10:17 AM Matthew Wilcox > >> > wrote: > >> > > You can drop ``..`` from around function named which are followed with > >> > > (). d74b0d31ddde ("Docs: An initial automarkup extension for sphinx") > >> > > marks them up automatically. > >> > > > >> > > >> > Got it, thanks for letting me know. But I will still use them in this > >> > patch series to keep consistency with the remaining parts of the > >> > document. > >> > >> Well, you weren't. For example: > >> > >> +As the last step of ``walk_component()``, ``step_into()`` will be called > >> either > >> +directly from walk_component() or from handle_dots(). It calls > >> +``handle_mount()``, to check and handle mount points, in which a new > >> > >> Neither of the functions on the second line were using ``. > > > > Oh, That was a mistake, They should've been wrapped with ``. > > Thanks for pointing it out. I will go through the whole patch set and > > fix this type of inconsistency in V3. > > Please, if possible, go toward the bare function() form rather than > using literals...it's easier to read and the docs system will > automatically create cross references for you. > > Thanks, > > jon Ok, If you have no problem with that inconsistency, I will go with the bare one in v3. thanks, fox
RE: [PATCH 5.11 000/122] 5.11.16-rc1 review
On Mon, 19 Apr 2021 15:04:40 +0200, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 5.11.16 release. > There are 122 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Wed, 21 Apr 2021 13:05:09 +. > Anything received after that time might be too late. > > The whole patch series can be found in one patch at: > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.11.16-rc1.gz > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > linux-5.11.y > and the diffstat can be found below. > > thanks, > > greg k-h > 5.11.16-rc1 Successfully Compiled and booted on my Raspberry PI 4b (8g) (bcm2711) Tested-by: Fox Chen
RE: [PATCH 5.10 000/103] 5.10.32-rc1 review
On Mon, 19 Apr 2021 15:05:11 +0200, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 5.10.32 release. > There are 103 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Wed, 21 Apr 2021 13:05:09 +. > Anything received after that time might be too late. > > The whole patch series can be found in one patch at: > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.32-rc1.gz > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > linux-5.10.y > and the diffstat can be found below. > > thanks, > > greg k-h > 5.10.32-rc1 Successfully Compiled and booted on my Raspberry PI 4b (8g) (bcm2711) Tested-by: Fox Chen
Re: [PATCH v2 01/12] docs: path-lookup: update follow_managed() part
On Mon, Apr 19, 2021 at 11:25 AM Matthew Wilcox wrote: > > On Mon, Apr 19, 2021 at 10:33:00AM +0800, Fox Chen wrote: > > On Mon, Apr 19, 2021 at 10:17 AM Matthew Wilcox wrote: > > > > > > On Tue, Mar 16, 2021 at 01:47:16PM +0800, Fox Chen wrote: > > > > -In the absence of symbolic links, ``walk_component()`` creates a new > > > > +As the last step of ``walk_component()``, ``step_into()`` will be > > > > called either > > > > > > You can drop ``..`` from around function named which are followed with > > > (). d74b0d31ddde ("Docs: An initial automarkup extension for sphinx") > > > marks them up automatically. > > > > > > > Got it, thanks for letting me know. But I will still use them in this > > patch series to keep consistency with the remaining parts of the > > document. > > Well, you weren't. For example: > > +As the last step of ``walk_component()``, ``step_into()`` will be called > either > +directly from walk_component() or from handle_dots(). It calls > +``handle_mount()``, to check and handle mount points, in which a new > > Neither of the functions on the second line were using ``. Oh, That was a mistake, They should've been wrapped with ``. Thanks for pointing it out. I will go through the whole patch set and fix this type of inconsistency in V3. thanks, fox
Re: [PATCH v2 12/12] docs: path-lookup: update symlink description
On Mon, Apr 19, 2021 at 9:59 AM NeilBrown wrote: > > On Tue, Mar 16 2021, Fox Chen wrote: > > > instead of lookup_real()/vfs_create(), i_op->lookup() and > > i_op->create() will be called directly. > > > > update vfs_open() logic > > > > should_follow_link is merged into lookup_last() or open_last_lookup() > > which returns symlink name instead of an integer. > > > > Signed-off-by: Fox Chen > > --- > > Documentation/filesystems/path-lookup.rst | 13 ++--- > > 1 file changed, 6 insertions(+), 7 deletions(-) > > > > diff --git a/Documentation/filesystems/path-lookup.rst > > b/Documentation/filesystems/path-lookup.rst > > index eef6e9f68fba..adbc714740c2 100644 > > --- a/Documentation/filesystems/path-lookup.rst > > +++ b/Documentation/filesystems/path-lookup.rst > > @@ -1202,16 +1202,15 @@ the code. > > it. If the file was found in the dcache, then ``vfs_open()`` is used > > for > > this. If not, then ``lookup_open()`` will either call > > ``atomic_open()`` (if > > the filesystem provides it) to combine the final lookup with the open, > > or > > - will perform the separate ``lookup_real()`` and ``vfs_create()`` steps > > + will perform the separate ``i_op->lookup()`` and ``i_op->create()`` > > steps > > directly. In the later case the actual "open" of this newly found or > > created file will be performed by ``vfs_open()``, just as if the name > > were found in the dcache. > > > > 2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information > > - wasn't quite current enough. Rather than restarting the lookup from > > - the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead, > > - giving the filesystem a chance to resolve small inconsistencies. > > - If that doesn't work, only then is the lookup restarted from the top. > > + wasn't quite current enough. If it's in RCU-walk -ECHILD will be > > returned > > + otherwise will return -ESTALE. When -ESTALE is returned, the caller may > > "otherwise -ESTALE is returned". > If you don't like repeating "is returned", then maybe: > "... -ECHILD will be returned, otherwise the result is -ESTALE". > > > > + retry with LOOKUP_REVAL flag set. > > > > 3. An open with O_CREAT **does** follow a symlink in the final component, > > unlike other creation system calls (like ``mkdir``). So the sequence:: > > @@ -1221,8 +1220,8 @@ the code. > > > > will create a file called ``/tmp/bar``. This is not permitted if > > ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much > > - like for a non-creating open: ``should_follow_link()`` returns ``1``, > > and > > - so does ``do_last()`` so that ``trailing_symlink()`` gets called and the > > + like for a non-creating open: ``lookup_last()`` or > > ``open_last_lookup()`` > > + returns a non ``Null`` value, and ``link_path_walk()`` gets called and > > the > > "NULL", not "Null". > > This those changes, > Reviewed-by: NeilBrown > > Thanks for a lot of all these improvements!! and apologies for the delay > in the review. Thanks for the review, I will fix them and send the next version back. > Thanks, > NeilBrown > > > > open process continues on the symlink that was found. > > > > Updating the access time > > -- > > 2.30.2 thanks, fox
Re: [PATCH v2 01/12] docs: path-lookup: update follow_managed() part
On Mon, Apr 19, 2021 at 10:17 AM Matthew Wilcox wrote: > > On Tue, Mar 16, 2021 at 01:47:16PM +0800, Fox Chen wrote: > > -In the absence of symbolic links, ``walk_component()`` creates a new > > +As the last step of ``walk_component()``, ``step_into()`` will be called > > either > > You can drop ``..`` from around function named which are followed with > (). d74b0d31ddde ("Docs: An initial automarkup extension for sphinx") > marks them up automatically. > Got it, thanks for letting me know. But I will still use them in this patch series to keep consistency with the remaining parts of the document. thanks, fox
RE: [PATCH 5.10 00/25] 5.10.31-rc1 review
On Thu, 15 Apr 2021 16:47:54 +0200, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 5.10.31 release. > There are 25 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Sat, 17 Apr 2021 14:44:01 +. > Anything received after that time might be too late. > > The whole patch series can be found in one patch at: > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.31-rc1.gz > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > linux-5.10.y > and the diffstat can be found below. > > thanks, > > greg k-h > 5.10.31-rc1 Successfully Compiled and booted on my Raspberry PI 4b (8g) (bcm2711) Tested-by: Fox Chen
RE: [PATCH 5.11 00/23] 5.11.15-rc1 review
On Thu, 15 Apr 2021 16:48:07 +0200, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 5.11.15 release. > There are 23 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Sat, 17 Apr 2021 14:44:01 +. > Anything received after that time might be too late. > > The whole patch series can be found in one patch at: > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.11.15-rc1.gz > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > linux-5.11.y > and the diffstat can be found below. > > thanks, > > greg k-h > 5.11.15-rc1 Successfully Compiled and booted on my Raspberry PI 4b (8g) (bcm2711) Tested-by: Fox Chen
[PATCH v2 12/12] docs: path-lookup: update symlink description
instead of lookup_real()/vfs_create(), i_op->lookup() and i_op->create() will be called directly. update vfs_open() logic should_follow_link is merged into lookup_last() or open_last_lookup() which returns symlink name instead of an integer. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index eef6e9f68fba..adbc714740c2 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1202,16 +1202,15 @@ the code. it. If the file was found in the dcache, then ``vfs_open()`` is used for this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if the filesystem provides it) to combine the final lookup with the open, or - will perform the separate ``lookup_real()`` and ``vfs_create()`` steps + will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps directly. In the later case the actual "open" of this newly found or created file will be performed by ``vfs_open()``, just as if the name were found in the dcache. 2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information - wasn't quite current enough. Rather than restarting the lookup from - the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead, - giving the filesystem a chance to resolve small inconsistencies. - If that doesn't work, only then is the lookup restarted from the top. + wasn't quite current enough. If it's in RCU-walk -ECHILD will be returned + otherwise will return -ESTALE. When -ESTALE is returned, the caller may + retry with LOOKUP_REVAL flag set. 3. An open with O_CREAT **does** follow a symlink in the final component, unlike other creation system calls (like ``mkdir``). So the sequence:: @@ -1221,8 +1220,8 @@ the code. will create a file called ``/tmp/bar``. This is not permitted if ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much - like for a non-creating open: ``should_follow_link()`` returns ``1``, and - so does ``do_last()`` so that ``trailing_symlink()`` gets called and the + like for a non-creating open: ``lookup_last()`` or ``open_last_lookup()`` + returns a non ``Null`` value, and ``link_path_walk()`` gets called and the open process continues on the symlink that was found. Updating the access time -- 2.30.2
[PATCH v2 10/12] docs: path-lookup: update WALK_GET, WALK_PUT desc
WALK_GET is changed to WALK_TRAILING with a different meaning. Here it should be WALK_NOFOLLOW. WALK_PUT dosn't exist, we have WALK_MORE. WALK_PUT == !WALK_MORE And there is not should_follow_link(). Related commits: commit 8c4efe22e7c4 ("namei: invert the meaning of WALK_FOLLOW") commit 1c4ff1a87e46 ("namei: invert WALK_PUT logics") Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 0d41c61f7e4f..abd0153e2415 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1123,13 +1123,11 @@ stack in ``walk_component()`` immediately when the symlink is found; old symlink as it walks that last component. So it is quite convenient for ``walk_component()`` to release the old symlink and pop the references just before pushing the reference information for the -new symlink. It is guided in this by two flags; ``WALK_GET``, which -gives it permission to follow a symlink if it finds one, and -``WALK_PUT``, which tells it to release the current symlink after it has been -followed. ``WALK_PUT`` is tested first, leading to a call to -``put_link()``. ``WALK_GET`` is tested subsequently (by -``should_follow_link()``) leading to a call to ``pick_link()`` which sets -up the stack frame. +new symlink. It is guided in this by two flags; ``WALK_NOFOLLOW``, which +suggests whether to follow a symlink if it finds one, and +``WALK_MORE``, which tells whether to release the current symlink after it has +been followed. ``WALK_MORE`` is tested first, leading to a call to +``put_link()``. Symlinks with no final component -- 2.30.2
[PATCH v2 09/12] docs: path-lookup: no get_link()
no get_link() anymore. we have step_into() and pick_link(). walk_component() will call step_into(), in turn call pick_link, and return symlink name. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 8ab95dd9046e..0d41c61f7e4f 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1103,12 +1103,10 @@ doesn't need to notice. Getting this ``name`` variable on and off the stack is very straightforward; pushing and popping the references is a little more complex. -When a symlink is found, ``walk_component()`` returns the value ``1`` -(``0`` is returned for any other sort of success, and a negative number -is, as usual, an error indicator). This causes ``get_link()`` to be -called; it then gets the link from the filesystem. Providing that -operation is successful, the old path ``name`` is placed on the stack, -and the new value is used as the ``name`` for a while. When the end of +When a symlink is found, ``walk_component()`` calls ``pick_link()``, +it then gets the link from the filesystem returning new path ``name``. +Providing that operation is successful, the old path ``name`` is placed on the +stack, and the new value is used as the ``name`` for a while. When the end of the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored off the stack and path walking continues. -- 2.30.2
[PATCH v2 11/12] docs: path-lookup: update get_link() ->follow_link description
get_link() is merged into pick_link(). i_op->follow_link is replaced with i_op->get_link(). get_link() can return ERR_PTR(0) which equals NULL. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index abd0153e2415..eef6e9f68fba 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1134,10 +1134,10 @@ Symlinks with no final component A pair of special-case symlinks deserve a little further explanation. Both result in a new ``struct path`` (with mount and dentry) being set -up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``. +up in the ``nameidata``, and result in ``pick_link()`` returning ``NULL``. The more obvious case is a symlink to "``/``". All symlinks starting -with "``/``" are detected in ``get_link()`` which resets the ``nameidata`` +with "``/``" are detected in ``pick_link()`` which resets the ``nameidata`` to point to the effective filesystem root. If the symlink only contains "``/``" then there is nothing more to do, no components at all, so ``NULL`` is returned to indicate that the symlink can be released and @@ -1154,12 +1154,11 @@ something that looks like a symlink. It is really a reference to the target file, not just the name of it. When you ``readlink`` these objects you get a name that might refer to the same file - unless it has been unlinked or mounted over. When ``walk_component()`` follows -one of these, the ``->follow_link()`` method in "procfs" doesn't return +one of these, the ``->get_link()`` method in "procfs" doesn't return a string name, but instead calls ``nd_jump_link()`` which updates the -``nameidata`` in place to point to that target. ``->follow_link()`` then -returns ``NULL``. Again there is no final component and ``get_link()`` -reports this by leaving the ``last_type`` field of ``nameidata`` as -``LAST_BIND``. +``nameidata`` in place to point to that target. ``->get_link()`` then +returns ``0``. Again there is no final component and ``pick_link()`` +returns NULL. Following the symlink in the final component -- 2.30.2
[PATCH v2 06/12] docs: path-lookup: Add macro name to symlink limit description
Add macro name MAXSYMLINKS to the symlink limit description, so that it is consistent with path name length description above. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 66697db74955..af5c20fecfef 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -992,8 +992,8 @@ is 4096. There are a number of reasons for this limit; not letting the kernel spend too much time on just one path is one of them. With symbolic links you can effectively generate much longer paths so some sort of limit is needed for the same reason. Linux imposes a limit of -at most 40 symlinks in any one path lookup. It previously imposed a -further limit of eight on the maximum depth of recursion, but that was +at most 40 (MAXSYMLINKS) symlinks in any one path lookup. It previously imposed +a further limit of eight on the maximum depth of recursion, but that was raised to 40 when a separate stack was implemented, so there is now just the one limit. -- 2.30.2
[PATCH v2 07/12] docs: path-lookup: i_op->follow_link replaced with i_op->get_link
follow_link has been replaced by get_link() which can be called in RCU mode. see commit: commit 6b2553918d8b ("replace ->follow_link() with new method that could stay in RCU mode") Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index af5c20fecfef..e6b6c43ff0f6 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1060,13 +1060,11 @@ filesystem cannot successfully get a reference in RCU-walk mode, it must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to REF-walk mode in which the filesystem is allowed to sleep. -The place for all this to happen is the ``i_op->follow_link()`` inode -method. In the present mainline code this is never actually called in -RCU-walk mode as the rewrite is not quite complete. It is likely that -in a future release this method will be passed an ``inode`` pointer when -called in RCU-walk mode so it both (1) knows to be careful, and (2) has the -validated pointer. Much like the ``i_op->permission()`` method we -looked at previously, ``->follow_link()`` would need to be careful that +The place for all this to happen is the ``i_op->get_link()`` inode +method. This is called both in RCU-walk and REF-walk. In RCU-walk the +``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of +RCU-walk. Much like the ``i_op->permission()`` method we +looked at previously, ``->get_link()`` would need to be careful that all the data structures it references are safe to be accessed while holding no counted reference, only the RCU lock. Though getting a reference with ``->follow_link()`` is not yet done in RCU-walk mode, the -- 2.30.2
[PATCH v2 08/12] docs: path-lookup: update i_op->put_link and cookie description
No inode->put_link operation anymore. We use delayed_call to deal with link destruction. Cookie has been replaced with struct delayed_call. Related commit: commit fceef393a538 ("switch ->get_link() to delayed_call, kill ->put_link()") Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 30 ++- 1 file changed, 8 insertions(+), 22 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index e6b6c43ff0f6..8ab95dd9046e 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1066,34 +1066,20 @@ method. This is called both in RCU-walk and REF-walk. In RCU-walk the RCU-walk. Much like the ``i_op->permission()`` method we looked at previously, ``->get_link()`` would need to be careful that all the data structures it references are safe to be accessed while -holding no counted reference, only the RCU lock. Though getting a -reference with ``->follow_link()`` is not yet done in RCU-walk mode, the -code is ready to release the reference when that does happen. - -This need to drop the reference to a symlink adds significant -complexity. It requires a reference to the inode so that the -``i_op->put_link()`` inode operation can be called. In REF-walk, that -reference is kept implicitly through a reference to the dentry, so -keeping the ``struct path`` of the symlink is easiest. For RCU-walk, -the pointer to the inode is kept separately. To allow switching from -RCU-walk back to REF-walk in the middle of processing nested symlinks -we also need the seq number for the dentry so we can confirm that -switching back was safe. - -Finally, when providing a reference to a symlink, the filesystem also -provides an opaque "cookie" that must be passed to ``->put_link()`` so that it -knows what to free. This might be the allocated memory area, or a -pointer to the ``struct page`` in the page cache, or something else -completely. Only the filesystem knows what it is. +holding no counted reference, only the RCU lock. A callback +``struct delayed_called`` will be passed to get_link, +file systems can set their own put_link function and argument through +``set_delayed_call``. Later on, when vfs wants to put link, it will call +``do_delayed_call`` to invoke that callback function with the argument. In order for the reference to each symlink to be dropped when the walk completes, whether in RCU-walk or REF-walk, the symlink stack needs to contain, along with the path remnants: -- the ``struct path`` to provide a reference to the inode in REF-walk -- the ``struct inode *`` to provide a reference to the inode in RCU-walk +- the ``struct path`` to provide a reference to the previous path +- the ``const char *`` to provide a reference to the to previous name - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk -- the ``cookie`` that tells ``->put_path()`` what to put. +- the ``struct delayed_call`` for later invocation. This means that each entry in the symlink stack needs to hold five pointers and an integer instead of just one pointer (the path -- 2.30.2
[PATCH v2 05/12] docs: path-lookup: remove filename_mountpoint
No filename_mountpoint any more see commit: commit 161aff1d93ab ("LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()") Without filename_mountpoint and path_mountpoint(), the numbers should be four & three: "These four correspond roughly to the three path_*() functions" Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index a65cb477d524..66697db74955 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -652,9 +652,9 @@ restarts from the top with REF-walk. This pattern of "try RCU-walk, if that fails try REF-walk" can be clearly seen in functions like ``filename_lookup()``, -``filename_parentat()``, ``filename_mountpoint()``, -``do_filp_open()``, and ``do_file_open_root()``. These five -correspond roughly to the four ``path_*()`` functions we met earlier, +``filename_parentat()``, +``do_filp_open()``, and ``do_file_open_root()``. These four +correspond roughly to the three ``path_*()`` functions we met earlier, each of which calls ``link_path_walk()``. The ``path_*()`` functions are called using different mode flags until a mode is found which works. They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If -- 2.30.2
[PATCH v2 04/12] docs: path-lookup: update do_last() part
traling_symlink() was merged into lookup_last, do_last(). do_last() has later been split into open_last_lookups() and do_open(). see related commit: commit c5971b8c6354 ("take post-lookup part of do_last() out of loop") Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 35 --- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index b6a301b78121..a65cb477d524 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -495,11 +495,11 @@ This is important when unmounting a filesystem that is inaccessible, such as one provided by a dead NFS server. Finally ``path_openat()`` is used for the ``open()`` system call; it -contains, in support functions starting with "``do_last()``", all the +contains, in support functions starting with "``open_last_lookups()``", all the complexity needed to handle the different subtleties of O_CREAT (with or without O_EXCL), final "``/``" characters, and trailing symbolic links. We will revisit this in the final part of this series, which -focuses on those symbolic links. "``do_last()``" will sometimes, but +focuses on those symbolic links. "``open_last_lookups()``" will sometimes, but not always, take ``i_rwsem``, depending on what it finds. Each of these, or the functions which call them, need to be alert to @@ -1199,26 +1199,27 @@ symlink. This case is handled by the relevant caller of ``link_path_walk()``, such as ``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then handles the final component. If the final component is a symlink -that needs to be followed, then ``trailing_symlink()`` is called to set -things up properly and the loop repeats, calling ``link_path_walk()`` -again. This could loop as many as 40 times if the last component of -each symlink is another symlink. +that needs to be followed, then ``open_last_lookups()`` is +called to set things up properly and the loop repeats, calling +``link_path_walk()`` again. This could loop as many as 40 times if the last +component of each symlink is another symlink. The various functions that examine the final component and possibly -report that it is a symlink are ``lookup_last()``, ``mountpoint_last()`` -and ``do_last()``, each of which use the same convention as -``walk_component()`` of returning ``1`` if a symlink was found that needs -to be followed. +report that it is a symlink are ``lookup_last()``, ``open_last_lookups()`` +, each of which use the same convention as +``walk_component()`` of returning ``char *name`` if a symlink was found that +needs to be followed. -Of these, ``do_last()`` is the most interesting as it is used for -opening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this -part is in a separate function: ``lookup_open()``. +Of these, ``open_last_lookups()`` is the most interesting as it works in tandem +with ``do_open()`` for opening a file. Part of ``open_last_lookups()`` runs +with ``i_rwsem`` held and this part is in a separate function: ``lookup_open()``. -Explaining ``do_last()`` completely is beyond the scope of this article, -but a few highlights should help those interested in exploring the -code. +Explaining ``open_last_lookups()`` and ``do_open()`` completely is beyond the scope +of this article, but a few highlights should help those interested in exploring +the code. -1. Rather than just finding the target file, ``do_last()`` needs to open +1. Rather than just finding the target file, ``do_open()`` is used after + ``open_last_lookup()`` to open it. If the file was found in the dcache, then ``vfs_open()`` is used for this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if the filesystem provides it) to combine the final lookup with the open, or -- 2.30.2
[PATCH v2 02/12] docs: path-lookup: update path_to_nameidata() part
No path_to_namei() anymore, step_into() will be called. Related commit: commit c99687a03a78 ("fold path_to_nameidata() into its only remaining caller") Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index d07766375e13..a29d714431a3 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -455,9 +455,10 @@ directly from walk_component() or from handle_dots(). It calls ``handle_mount()``, to check and handle mount points, in which a new ``struct path`` containing a counted reference to the new dentry and a reference to the new ``vfsmount`` which is only counted if it is -different from the previous ``vfsmount``. It then calls -``path_to_nameidata()`` to install the new ``struct path`` in the -``struct nameidata`` and drop the unneeded references. +different from the previous ``vfsmount`` will be created. Then if there is +symbolic link, ``step_into()`` calls ``pick_link()`` to deal with it, otherwise +installs the new ``struct path`` in the ``struct nameidata`` and drop the +unneeded references. This "hand-over-hand" sequencing of getting a reference to the new dentry before dropping the reference to the previous dentry may -- 2.30.2
[PATCH v2 01/12] docs: path-lookup: update follow_managed() part
No follow_managed() anymore, handle_mounts(), traverse_mounts(), will do the job. see commit 9deed3ebca24 ("new helper: traverse_mounts()") Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index c482e1619e77..d07766375e13 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -448,10 +448,11 @@ described. If it finds a ``LAST_NORM`` component it first calls filesystem to revalidate the result if it is that sort of filesystem. If that doesn't get a good result, it calls "``lookup_slow()``" which takes ``i_rwsem``, rechecks the cache, and then asks the filesystem -to find a definitive answer. Each of these will call -``follow_managed()`` (as described below) to handle any mount points. +to find a definitive answer. -In the absence of symbolic links, ``walk_component()`` creates a new +As the last step of ``walk_component()``, ``step_into()`` will be called either +directly from walk_component() or from handle_dots(). It calls +``handle_mount()``, to check and handle mount points, in which a new ``struct path`` containing a counted reference to the new dentry and a reference to the new ``vfsmount`` which is only counted if it is different from the previous ``vfsmount``. It then calls @@ -535,8 +536,7 @@ covered in greater detail in autofs.txt in the Linux documentation tree, but a few notes specifically related to path lookup are in order here. -The Linux VFS has a concept of "managed" dentries which is reflected -in function names such as "``follow_managed()``". There are three +The Linux VFS has a concept of "managed" dentries. There are three potentially interesting things about these dentries corresponding to three different flags that might be set in ``dentry->d_flags``: -- 2.30.2
[PATCH v2 03/12] docs: path-lookup: update path_mountpoint() part
path_mountpoint() doesn't exist anymore. Have been folded into path_lookup_at when flag is set with LOOKUP_MOUNTPOINT. Check commit: commit 161aff1d93abf0e ("LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()") Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index a29d714431a3..b6a301b78121 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -472,7 +472,7 @@ Handling the final component ``nd->last_type`` to refer to the final component of the path. It does not call ``walk_component()`` that last time. Handling that final component remains for the caller to sort out. Those callers are -``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and +``path_lookupat()``, ``path_parentat()`` and ``path_openat()`` each of which handles the differing requirements of different system calls. @@ -488,12 +488,10 @@ perform their operation. object is wanted such as by ``stat()`` or ``chmod()``. It essentially just calls ``walk_component()`` on the final component through a call to ``lookup_last()``. ``path_lookupat()`` returns just the final dentry. - -``path_mountpoint()`` handles the special case of unmounting which must -not try to revalidate the mounted filesystem. It effectively -contains, through a call to ``mountpoint_last()``, an alternate -implementation of ``lookup_slow()`` which skips that step. This is -important when unmounting a filesystem that is inaccessible, such as +It is worth noting that when flag ``LOOKUP_MOUNTPOINT`` is set, +``path_lookupat()`` will unset LOOKUP_JUMPED in nameidata so that in the further +path traversal ``d_weak_revalidate()`` won't be called. +This is important when unmounting a filesystem that is inaccessible, such as one provided by a dead NFS server. Finally ``path_openat()`` is used for the ``open()`` system call; it -- 2.30.2
[PATCH v2 00/12] docs: path-lookup: Update pathlookup docs
The Path lookup is a very complex subject in VFS. The path-lookup document provides a very detailed guidance to help people understand how path lookup works in the kernel. This document was originally written based on three lwn articles five years ago. As times goes by, some of the content is outdated. This patchset is intended to update the document to make it more relevant to current codebase. --- v1: https://lore.kernel.org/lkml/20210126072443.33066-1-foxhlc...@gmail.com/ v2: - Fix problems in v1 reviewed by Neil: 1. In Patch 01 and 02 rewrite a new paragrah to describe step_into() 2. In Patch 01 instead of changing it to traverse_mounts, remove follow_managed() 3. In Patch 03 re-telling the story rather than adding notes 4. In Patch 04 do_open() should be outside of loop, fix it and fix other problems in following paragrah 5. In Patch 07 use "drop out of RCU-walk" 6. In Patch 08 "latter" should be "later", fix it and restructure the next paragrah removing "Finally" To help review, I've put a compiled html version here: http://linux-docs.54fox.com/linux_docs/filesystems/path-lookup-v2.html Fox Chen (12): docs: path-lookup: update follow_managed() part docs: path-lookup: update path_to_nameidata() part docs: path-lookup: update path_mountpoint() part docs: path-lookup: update do_last() part docs: path-lookup: remove filename_mountpoint docs: path-lookup: Add macro name to symlink limit description docs: path-lookup: i_op->follow_link replaced with i_op->get_link docs: path-lookup: update i_op->put_link and cookie description docs: path-lookup: no get_link() docs: path-lookup: update WALK_GET, WALK_PUT desc docs: path-lookup: update get_link() ->follow_link description docs: path-lookup: update symlink description Documentation/filesystems/path-lookup.rst | 164 ++ 1 file changed, 71 insertions(+), 93 deletions(-) -- 2.30.2
Re: [PATCH 00/12] docs: path-lookup: Update pathlookup docs
On Thu, Jan 28, 2021 at 11:58 AM NeilBrown wrote: > > On Tue, Jan 26 2021, Fox Chen wrote: > > > The Path lookup is a very complex subject in VFS. The path-lookup > > document provides a very detailed guidance to help people understand > > how path lookup works in the kernel.This document was originally > > written based on three lwn articles five years ago. As times goes by, > > some of the content was outdated. This patchset is intended to update > > the document to make it more relevant to current codebase. > > > > > > Fox Chen (12): > > docs: path-lookup: update follow_managed() part > > docs: path-lookup: update path_to_nameidata() parth > > docs: path-lookup: update path_mountpoint() part > > docs: path-lookup: update do_last() part > > docs: path-lookup: remove filename_mountpoint > > docs: path-lookup: Add macro name to symlink limit description > > docs: path-lookup: i_op->follow_link replaced with i_op->get_link > > docs: path-lookup: update i_op->put_link and cookie description > > docs: path-lookup: no get_link() > > docs: path-lookup: update WALK_GET, WALK_PUT desc > > docs: path-lookup: update get_link() ->follow_link description > > docs: path-lookup: update symlink description > > > > Thanks for doing this. I've responded individually to several of the > patches. As you can see from my comments, there is often more to it > than just changing function names. In some places you have capture the > more general nature of the change fairly well. In other places the > result is incoherent or confusion. > Making small updates to this sort of documentation is not easy. You > need to step have and see the "big picture", to overall story-arc. > Sometimes you can fit changes into that arc, sometimes you might need to > restructure or re-tell the story. > > This is part of why I haven't put much effort into the document - > re-telling a story can be a lot of work. > Thanks for reviewing my patchset. Yeah, sometimes it's difficult to fit in the context and match the writing style. Most part of the article is still valid, adding annotations may be better than rewrite. Let me revise it and send in the v2, though It may take a while. thanks, fox > NeilBrown
Re: [PATCH 00/12] docs: path-lookup: Update pathlookup docs
On Wed, Jan 27, 2021 at 4:31 AM Jonathan Corbet wrote: > > On Tue, 26 Jan 2021 15:24:31 +0800 > Fox Chen wrote: > > > The Path lookup is a very complex subject in VFS. The path-lookup > > document provides a very detailed guidance to help people understand > > how path lookup works in the kernel.This document was originally > > written based on three lwn articles five years ago. As times goes by, > > some of the content was outdated. This patchset is intended to update > > the document to make it more relevant to current codebase. > > > > > > Fox Chen (12): > > docs: path-lookup: update follow_managed() part > > docs: path-lookup: update path_to_nameidata() parth > > docs: path-lookup: update path_mountpoint() part > > docs: path-lookup: update do_last() part > > docs: path-lookup: remove filename_mountpoint > > docs: path-lookup: Add macro name to symlink limit description > > docs: path-lookup: i_op->follow_link replaced with i_op->get_link > > docs: path-lookup: update i_op->put_link and cookie description > > docs: path-lookup: no get_link() > > docs: path-lookup: update WALK_GET, WALK_PUT desc > > docs: path-lookup: update get_link() ->follow_link description > > docs: path-lookup: update symlink description > > > > Documentation/filesystems/path-lookup.rst | 146 ++ > > 1 file changed, 63 insertions(+), 83 deletions(-) > > Neil Brown (copied) is the original author of this document; I'd really > like his feedback on these changes. Neil, the full set is at: > > > https://lore.kernel.org/lkml/20210126072443.33066-1-foxhlc...@gmail.com/ > > Thanks, > > jon FYI, To help review the patches I marked & annotated all the places I've changed. you can check the link: https://hyp.is/go?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Ffilesystems%2Fpath-lookup.html=__world__ thanks, fox
Re: [PATCH 01/12] docs: path-lookup: update follow_managed() part
On Tue, Jan 26, 2021 at 10:03 PM Greg KH wrote: > > On Tue, Jan 26, 2021 at 03:24:32PM +0800, Fox Chen wrote: > > No follow_managed() anymore, handle_mounts(), > > traverse_mounts(), will do the job. > > see commit: 9deed3ebca244663530782631834e706a86a8c8f > > When referencing commits in changelogs, please use the documented way, > which for this one would be 9deed3ebca24 ("new helper: > traverse_mounts()"). That enables us to read and understand them > better. > > thanks, > > greg k-h Got it, I will wait to see any other things that need to be changed, then send in v2 together. thanks, fox
[PATCH 10/12] docs: path-lookup: update WALK_GET, WALK_PUT desc
WALK_GET is changed to WALK_TRAILING with a different meaning. Here it should be WALK_NOFOLLOW. WALK_PUT dosn't exist, we have WALK_MORE. WALK_PUT == !WALK_MORE And there is not should_follow_link(). related commits: 8c4efe22e7c4de1d44f624120a979e31e3a584b8 1c4ff1a87e46a06fc00a83da2fbbc3ac0298f221 Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 915c9ffe22c1..921779a4636f 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1126,13 +1126,11 @@ stack in ``walk_component()`` immediately when the symlink is found; old symlink as it walks that last component. So it is quite convenient for ``walk_component()`` to release the old symlink and pop the references just before pushing the reference information for the -new symlink. It is guided in this by two flags; ``WALK_GET``, which -gives it permission to follow a symlink if it finds one, and -``WALK_PUT``, which tells it to release the current symlink after it has been -followed. ``WALK_PUT`` is tested first, leading to a call to -``put_link()``. ``WALK_GET`` is tested subsequently (by -``should_follow_link()``) leading to a call to ``pick_link()`` which sets -up the stack frame. +new symlink. It is guided in this by two flags; ``WALK_NOFOLLOW``, which +suggests whether to follow a symlink if it finds one, and +``WALK_MORE``, which tells whether to release the current symlink after it has +been followed. ``WALK_MORE`` is tested first, leading to a call to +``put_link()``. Symlinks with no final component -- 2.30.0
[PATCH 08/12] docs: path-lookup: update i_op->put_link and cookie description
No inode->put_link operation anymore. We use delayed_call to deal with link destruction. Cookie has been replaced with struct delayed_call. Related commit: fceef393a538134f03b778c5d2519e670269342f Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 31 +++ 1 file changed, 9 insertions(+), 22 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 0a362849b26f..8572300b5405 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1068,34 +1068,21 @@ method. This is called both in RCU-walk and REF-walk. In RCU-walk the RCU-walk. Much like the ``i_op->permission()`` method we looked at previously, ``->get_link()`` would need to be careful that all the data structures it references are safe to be accessed while -holding no counted reference, only the RCU lock. Though getting a -reference with ``->follow_link()`` is not yet done in RCU-walk mode, the -code is ready to release the reference when that does happen. - -This need to drop the reference to a symlink adds significant -complexity. It requires a reference to the inode so that the -``i_op->put_link()`` inode operation can be called. In REF-walk, that -reference is kept implicitly through a reference to the dentry, so -keeping the ``struct path`` of the symlink is easiest. For RCU-walk, -the pointer to the inode is kept separately. To allow switching from -RCU-walk back to REF-walk in the middle of processing nested symlinks -we also need the seq number for the dentry so we can confirm that -switching back was safe. - -Finally, when providing a reference to a symlink, the filesystem also -provides an opaque "cookie" that must be passed to ``->put_link()`` so that it -knows what to free. This might be the allocated memory area, or a -pointer to the ``struct page`` in the page cache, or something else -completely. Only the filesystem knows what it is. +holding no counted reference, only the RCU lock. + +Finally, a callback ``struct delayed_called`` will be passed to get_link, +file systems can set their own put_link function and argument through +``set_delayed_call``. Latter on, when vfs wants to put link, it will call +``do_delayed_call`` to invoke that callback function with the argument. In order for the reference to each symlink to be dropped when the walk completes, whether in RCU-walk or REF-walk, the symlink stack needs to contain, along with the path remnants: -- the ``struct path`` to provide a reference to the inode in REF-walk -- the ``struct inode *`` to provide a reference to the inode in RCU-walk +- the ``struct path`` to provide a reference to the previous path +- the ``const char *`` to provide a reference to the to previous name - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk -- the ``cookie`` that tells ``->put_path()`` what to put. +- the ``struct delayed_call`` for later invocation. This means that each entry in the symlink stack needs to hold five pointers and an integer instead of just one pointer (the path -- 2.30.0
[PATCH 09/12] docs: path-lookup: no get_link()
no get_link() anymore. we have step_into() and pick_link(). walk_component() will call step_into(), in turn call pick_link, and return symlink name. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 8572300b5405..915c9ffe22c1 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1106,12 +1106,10 @@ doesn't need to notice. Getting this ``name`` variable on and off the stack is very straightforward; pushing and popping the references is a little more complex. -When a symlink is found, ``walk_component()`` returns the value ``1`` -(``0`` is returned for any other sort of success, and a negative number -is, as usual, an error indicator). This causes ``get_link()`` to be -called; it then gets the link from the filesystem. Providing that -operation is successful, the old path ``name`` is placed on the stack, -and the new value is used as the ``name`` for a while. When the end of +When a symlink is found, ``walk_component()`` calls ``pick_link()``, +it then gets the link from the filesystem returning new path ``name``. +Providing that operation is successful, the old path ``name`` is placed on the +stack, and the new value is used as the ``name`` for a while. When the end of the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored off the stack and path walking continues. -- 2.30.0
Re: [PATCH 00/12] docs: path-lookup: Update pathlookup docs
On Tue, Jan 26, 2021 at 3:25 PM Fox Chen wrote: > > The Path lookup is a very complex subject in VFS. The path-lookup > document provides a very detailed guidance to help people understand > how path lookup works in the kernel.This document was originally > written based on three lwn articles five years ago. As times goes by, > some of the content was outdated. This patchset is intended to update > the document to make it more relevant to current codebase. > > > Fox Chen (12): > docs: path-lookup: update follow_managed() part > docs: path-lookup: update path_to_nameidata() parth > docs: path-lookup: update path_mountpoint() part > docs: path-lookup: update do_last() part > docs: path-lookup: remove filename_mountpoint > docs: path-lookup: Add macro name to symlink limit description > docs: path-lookup: i_op->follow_link replaced with i_op->get_link > docs: path-lookup: update i_op->put_link and cookie description > docs: path-lookup: no get_link() > docs: path-lookup: update WALK_GET, WALK_PUT desc > docs: path-lookup: update get_link() ->follow_link description > docs: path-lookup: update symlink description > > Documentation/filesystems/path-lookup.rst | 146 ++ > 1 file changed, 63 insertions(+), 83 deletions(-) > > -- > 2.30.0 > To help review the patches, I annotated and highlighted changes using hypothesis you can check the current docs: https://hyp.is/go?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Flatest%2Ffilesystems%2Fpath-lookup.html=__world__ the docs after patches applied: https://hyp.is/go?url=http%3A%2F%2Flinux-docs.54fox.com%2Flinux_docs%2Ffilesystems%2Fpath-lookup.html=__world__
[PATCH 07/12] docs: path-lookup: i_op->follow_link replaced with i_op->get_link
follow_link has been replaced by get_link() which can be called in RCU mode. see commit: 6b2553918d8b4e6de9853fd6315bec7271a2e592 Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 25d2a5a59f45..0a362849b26f 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1062,13 +1062,11 @@ filesystem cannot successfully get a reference in RCU-walk mode, it must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to REF-walk mode in which the filesystem is allowed to sleep. -The place for all this to happen is the ``i_op->follow_link()`` inode -method. In the present mainline code this is never actually called in -RCU-walk mode as the rewrite is not quite complete. It is likely that -in a future release this method will be passed an ``inode`` pointer when -called in RCU-walk mode so it both (1) knows to be careful, and (2) has the -validated pointer. Much like the ``i_op->permission()`` method we -looked at previously, ``->follow_link()`` would need to be careful that +The place for all this to happen is the ``i_op->get_link()`` inode +method. This is called both in RCU-walk and REF-walk. In RCU-walk the +``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop +RCU-walk. Much like the ``i_op->permission()`` method we +looked at previously, ``->get_link()`` would need to be careful that all the data structures it references are safe to be accessed while holding no counted reference, only the RCU lock. Though getting a reference with ``->follow_link()`` is not yet done in RCU-walk mode, the -- 2.30.0
[PATCH 12/12] docs: path-lookup: update symlink description
instead of lookup_real()/vfs_create(), i_op->lookup() and i_op->create() will be called directly. update vfs_open() logic should_follow_link is merged into lookup_last() or open_last_lookup() which returns symlink name instead of an integer. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 2bb3ca486acd..0c6fc296056c 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1204,16 +1204,15 @@ the code. it. If the file was found in the dcache, then ``vfs_open()`` is used for this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if the filesystem provides it) to combine the final lookup with the open, or - will perform the separate ``lookup_real()`` and ``vfs_create()`` steps + will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps directly. In the later case the actual "open" of this newly found or created file will be performed by ``vfs_open()``, just as if the name were found in the dcache. 2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information - wasn't quite current enough. Rather than restarting the lookup from - the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead, - giving the filesystem a chance to resolve small inconsistencies. - If that doesn't work, only then is the lookup restarted from the top. + wasn't quite current enough. If it's in RCU-walk -ECHILD will be returned + otherwise will return -ESTALE. When -ESTALE is returned, the caller may + retry with LOOKUP_REVAL flag set. 3. An open with O_CREAT **does** follow a symlink in the final component, unlike other creation system calls (like ``mkdir``). So the sequence:: @@ -1223,8 +1222,8 @@ the code. will create a file called ``/tmp/bar``. This is not permitted if ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much - like for a non-creating open: ``should_follow_link()`` returns ``1``, and - so does ``do_last()`` so that ``trailing_symlink()`` gets called and the + like for a non-creating open: ``lookup_last()`` or ``open_last_lookup()`` + returns a non ``Null`` value, and ``link_path_walk()`` gets called and the open process continues on the symlink that was found. Updating the access time -- 2.30.0
[PATCH 11/12] docs: path-lookup: update get_link() ->follow_link description
get_link() is merged into pick_link(). i_op->follow_link is replaced with i_op->get_link(). get_link() can return ERR_PTR(0) which equals NULL. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 921779a4636f..2bb3ca486acd 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1137,10 +1137,10 @@ Symlinks with no final component A pair of special-case symlinks deserve a little further explanation. Both result in a new ``struct path`` (with mount and dentry) being set -up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``. +up in the ``nameidata``, and result in ``pick_link()`` returning ``NULL``. The more obvious case is a symlink to "``/``". All symlinks starting -with "``/``" are detected in ``get_link()`` which resets the ``nameidata`` +with "``/``" are detected in ``pick_link()`` which resets the ``nameidata`` to point to the effective filesystem root. If the symlink only contains "``/``" then there is nothing more to do, no components at all, so ``NULL`` is returned to indicate that the symlink can be released and @@ -1157,12 +1157,11 @@ something that looks like a symlink. It is really a reference to the target file, not just the name of it. When you ``readlink`` these objects you get a name that might refer to the same file - unless it has been unlinked or mounted over. When ``walk_component()`` follows -one of these, the ``->follow_link()`` method in "procfs" doesn't return +one of these, the ``->get_link()`` method in "procfs" doesn't return a string name, but instead calls ``nd_jump_link()`` which updates the -``nameidata`` in place to point to that target. ``->follow_link()`` then -returns ``NULL``. Again there is no final component and ``get_link()`` -reports this by leaving the ``last_type`` field of ``nameidata`` as -``LAST_BIND``. +``nameidata`` in place to point to that target. ``->get_link()`` then +returns ``0``. Again there is no final component and ``pick_link()`` +returns NULL. Following the symlink in the final component -- 2.30.0
[PATCH 04/12] docs: path-lookup: update do_last() part
traling_symlink() was merged into lookup_last, do_last(). do_last() has later been split into open_last_lookups() and do_open(). see related commit: c5971b8c6354a95c9ee7eb722928af5000bac247 Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 34 +++ 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 4e77c8520fa9..1f05b1417a55 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -496,11 +496,11 @@ one provided by a dead NFS server. In the current kernel, path_mountpoint has been merged into ``path_lookup_at()`` with a new flag LOOKUP_MOUNTPOINT. Finally ``path_openat()`` is used for the ``open()`` system call; it -contains, in support functions starting with "``do_last()``", all the +contains, in support functions starting with "``open_last_lookups()``", all the complexity needed to handle the different subtleties of O_CREAT (with or without O_EXCL), final "``/``" characters, and trailing symbolic links. We will revisit this in the final part of this series, which -focuses on those symbolic links. "``do_last()``" will sometimes, but +focuses on those symbolic links. "``open_last_lookups()``" will sometimes, but not always, take ``i_rwsem``, depending on what it finds. Each of these, or the functions which call them, need to be alert to @@ -1201,26 +1201,26 @@ symlink. This case is handled by the relevant caller of ``link_path_walk()``, such as ``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then handles the final component. If the final component is a symlink -that needs to be followed, then ``trailing_symlink()`` is called to set -things up properly and the loop repeats, calling ``link_path_walk()`` -again. This could loop as many as 40 times if the last component of -each symlink is another symlink. +that needs to be followed, then ``open_last_lookups()`` and ``do_open()`` is +called to set things up properly and the loop repeats, calling +``link_path_walk()`` again. This could loop as many as 40 times if the last +component of each symlink is another symlink. The various functions that examine the final component and possibly -report that it is a symlink are ``lookup_last()``, ``mountpoint_last()`` -and ``do_last()``, each of which use the same convention as -``walk_component()`` of returning ``1`` if a symlink was found that needs -to be followed. +report that it is a symlink are ``lookup_last()``, ``open_last_lookups()`` +, each of which use the same convention as +``walk_component()`` of returning ``char *name`` if a symlink was found that +needs to be followed. -Of these, ``do_last()`` is the most interesting as it is used for -opening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this -part is in a separate function: ``lookup_open()``. +Of these, ``open_last_lookups()``, ``do_open()`` is the most interesting as it is +used for opening a file. Part of ``open_last_lookups()`` runs with ``i_rwsem`` +held and this part is in a separate function: ``lookup_open()``. -Explaining ``do_last()`` completely is beyond the scope of this article, -but a few highlights should help those interested in exploring the -code. +Explaining ``open_last_lookups()``, ``do_open()`` completely is beyond the scope +of this article, but a few highlights should help those interested in exploring +the code. -1. Rather than just finding the target file, ``do_last()`` needs to open +1. Rather than just finding the target file, ``do_open()`` needs to open it. If the file was found in the dcache, then ``vfs_open()`` is used for this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if the filesystem provides it) to combine the final lookup with the open, or -- 2.30.0
[PATCH 06/12] docs: path-lookup: Add macro name to symlink limit description
Add macro name MAXSYMLINKS to the symlink limit description, so that it is consistent with path name length description above. Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index bc450e0864d6..25d2a5a59f45 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -994,8 +994,8 @@ is 4096. There are a number of reasons for this limit; not letting the kernel spend too much time on just one path is one of them. With symbolic links you can effectively generate much longer paths so some sort of limit is needed for the same reason. Linux imposes a limit of -at most 40 symlinks in any one path lookup. It previously imposed a -further limit of eight on the maximum depth of recursion, but that was +at most 40 (MAXSYMLINKS) symlinks in any one path lookup. It previously imposed +a further limit of eight on the maximum depth of recursion, but that was raised to 40 when a separate stack was implemented, so there is now just the one limit. -- 2.30.0
[PATCH 00/12] docs: path-lookup: Update pathlookup docs
The Path lookup is a very complex subject in VFS. The path-lookup document provides a very detailed guidance to help people understand how path lookup works in the kernel.This document was originally written based on three lwn articles five years ago. As times goes by, some of the content was outdated. This patchset is intended to update the document to make it more relevant to current codebase. Fox Chen (12): docs: path-lookup: update follow_managed() part docs: path-lookup: update path_to_nameidata() parth docs: path-lookup: update path_mountpoint() part docs: path-lookup: update do_last() part docs: path-lookup: remove filename_mountpoint docs: path-lookup: Add macro name to symlink limit description docs: path-lookup: i_op->follow_link replaced with i_op->get_link docs: path-lookup: update i_op->put_link and cookie description docs: path-lookup: no get_link() docs: path-lookup: update WALK_GET, WALK_PUT desc docs: path-lookup: update get_link() ->follow_link description docs: path-lookup: update symlink description Documentation/filesystems/path-lookup.rst | 146 ++ 1 file changed, 63 insertions(+), 83 deletions(-) -- 2.30.0
[PATCH 02/12] docs: path-lookup: update path_to_nameidata() parth
No path_to_namei() anymore, step_into() will be called. Related commit: c99687a03a78775f77d57fe9b07af4c8ec3dd03c Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index e778db767120..2ad96e1e3c49 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -455,7 +455,7 @@ In the absence of symbolic links, ``walk_component()`` creates a new ``struct path`` containing a counted reference to the new dentry and a reference to the new ``vfsmount`` which is only counted if it is different from the previous ``vfsmount``. It then calls -``path_to_nameidata()`` to install the new ``struct path`` in the +``step_into()`` to install the new ``struct path`` in the ``struct nameidata`` and drop the unneeded references. This "hand-over-hand" sequencing of getting a reference to the new -- 2.30.0
[PATCH 03/12] docs: path-lookup: update path_mountpoint() part
path_mountpoint() doesn't exist anymore. Have been folded into path_lookup_at when flag is set with LOOKUP_MOUNTPOINT. check out commit:161aff1d93abf0e5b5e9dbca88928998c155f677 Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 2ad96e1e3c49..4e77c8520fa9 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -492,7 +492,8 @@ not try to revalidate the mounted filesystem. It effectively contains, through a call to ``mountpoint_last()``, an alternate implementation of ``lookup_slow()`` which skips that step. This is important when unmounting a filesystem that is inaccessible, such as -one provided by a dead NFS server. +one provided by a dead NFS server. In the current kernel, path_mountpoint +has been merged into ``path_lookup_at()`` with a new flag LOOKUP_MOUNTPOINT. Finally ``path_openat()`` is used for the ``open()`` system call; it contains, in support functions starting with "``do_last()``", all the -- 2.30.0
[PATCH 05/12] docs: path-lookup: remove filename_mountpoint
No filename_mountpoint any more see commit: 161aff1d93abf0e5b5e9dbca88928998c155f677 Without filename_mountpoint and path_mountpoint(), the numbers should be four & three: "These four correspond roughly to the three path_*() functions" Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 1f05b1417a55..bc450e0864d6 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -654,9 +654,9 @@ restarts from the top with REF-walk. This pattern of "try RCU-walk, if that fails try REF-walk" can be clearly seen in functions like ``filename_lookup()``, -``filename_parentat()``, ``filename_mountpoint()``, -``do_filp_open()``, and ``do_file_open_root()``. These five -correspond roughly to the four ``path_*()`` functions we met earlier, +``filename_parentat()``, +``do_filp_open()``, and ``do_file_open_root()``. These four +correspond roughly to the three ``path_*()`` functions we met earlier, each of which calls ``link_path_walk()``. The ``path_*()`` functions are called using different mode flags until a mode is found which works. They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If -- 2.30.0
[PATCH 01/12] docs: path-lookup: update follow_managed() part
No follow_managed() anymore, handle_mounts(), traverse_mounts(), will do the job. see commit: 9deed3ebca244663530782631834e706a86a8c8f Signed-off-by: Fox Chen --- Documentation/filesystems/path-lookup.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index c482e1619e77..e778db767120 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -448,8 +448,8 @@ described. If it finds a ``LAST_NORM`` component it first calls filesystem to revalidate the result if it is that sort of filesystem. If that doesn't get a good result, it calls "``lookup_slow()``" which takes ``i_rwsem``, rechecks the cache, and then asks the filesystem -to find a definitive answer. Each of these will call -``follow_managed()`` (as described below) to handle any mount points. +to find a definitive answer. In ``step_into()``, ``handle_mount()`` will be +called to handle any mount point. In the absence of symbolic links, ``walk_component()`` creates a new ``struct path`` containing a counted reference to the new dentry and a @@ -536,7 +536,7 @@ tree, but a few notes specifically related to path lookup are in order here. The Linux VFS has a concept of "managed" dentries which is reflected -in function names such as "``follow_managed()``". There are three +in function names such as "``traverse_mounts()``". There are three potentially interesting things about these dentries corresponding to three different flags that might be set in ``dentry->d_flags``: -- 2.30.0
Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
On Sun, Dec 20, 2020 at 7:52 AM Ian Kent wrote: > > On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote: > > Hello, > > > > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote: > > > And looking further I see there's a race that kernfs can't do > > > anything > > > about between kernfs_refresh_inode() and fs/inode.c:update_times(). > > > > Do kernfs files end up calling into that path tho? Doesn't look like > > it to > > me but if so yeah we'd need to override the update_time for kernfs. > > Sorry, the below was very hastily done and not what I would actually > propose. > > The main point of it was the question > > + /* Which kernfs node attributes should be updated from > +* time? > +*/ > > but looking at it again this morning I think the node iattr fields > that might need to be updated would be atime, ctime and mtime only, > maybe not ctime ... not sure. > > What do you think? > > Also, if kn->attr == NULL it should fall back to what the VFS > currently does. > > The update_times() function is one of the few places where the > VFS updates the inode times. > > The idea is that the reason kernfs needs to overwrite the inode > attributes is to reset what the VFS might have done but if kernfs > has this inode operation they won't need to be overwritten since > they won't have changed. > > There may be other places where the attributes (or an attribute) > are set by the VFS, I haven't finished checking that yet so my > suggestion might not be entirely valid. > > What I need to do is work out what kernfs node attributes, if any, > should be updated by .update_times(). If I go by what > kernfs_refresh_inode() does now then that would be none but shouldn't > atime at least be updated in the node iattr. > > > > +static int kernfs_iop_update_time(struct inode *inode, struct > > > timespec64 *time, int flags) > > > { > > > - struct inode *inode = d_inode(path->dentry); > > > struct kernfs_node *kn = inode->i_private; > > > + struct kernfs_iattrs *attrs; > > > > > > mutex_lock(_mutex); > > > + attrs = kernfs_iattrs(kn); > > > + if (!attrs) { > > > + mutex_unlock(_mutex); > > > + return -ENOMEM; > > > + } > > > + > > > + /* Which kernfs node attributes should be updated from > > > +* time? > > > +*/ > > > + > > > kernfs_refresh_inode(kn, inode); > > > mutex_unlock(_mutex); > > > > I don't see how this would reflect the changes from kernfs_setattr() > > into > > the attached inode. This would actually make the attr updates > > obviously racy > > - the userland visible attrs would be stale until the inode gets > > reclaimed > > and then when it gets reinstantiated it'd show the latest > > information. > > Right, I will have to think about that, but as I say above this > isn't really what I would propose. > > If .update_times() sticks strictly to what kernfs_refresh_inode() > does now then it would set the inode attributes from the node iattr > only. > > > > > That said, if you wanna take the direction where attr updates are > > reflected > > to the associated inode when the change occurs, which makes sense, > > the right > > thing to do would be making kernfs_setattr() update the associated > > inode if > > existent. > > Mmm, that's a good point but it looks like the inode isn't available > there. > Is it possible to embed super block somewhere, then we can call kernfs_get_inode to get inode in kernfs_setattr??? thanks, fox
Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
On Sat, Dec 19, 2020 at 8:53 AM Ian Kent wrote: > > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote: > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent wrote: > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent > > > > wrote: > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > > > Hello, > > > > > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > > > What could be done is to make the kernfs node attr_mutex > > > > > > > > a pointer and dynamically allocate it but even that is > > > > > > > > too > > > > > > > > costly a size addition to the kernfs node structure as > > > > > > > > Tejun has said. > > > > > > > > > > > > > > I guess the question to ask is, is there really a need to > > > > > > > call kernfs_refresh_inode() from functions that are usually > > > > > > > reading/checking functions. > > > > > > > > > > > > > > Would it be sufficient to refresh the inode in the > > > > > > > write/set > > > > > > > operations in (if there's any) places where things like > > > > > > > setattr_copy() is not already called? > > > > > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > > > > > My memory is a bit hazy but invalidations on reads is how > > > > > > sysfs > > > > > > namespace is > > > > > > implemented, so I don't think there's an easy around that. > > > > > > The > > > > > > only > > > > > > thing I > > > > > > can think of is embedding the lock into attrs and doing xchg > > > > > > dance > > > > > > when > > > > > > attaching it. > > > > > > > > > > Sounds like your saying it would be ok to add a lock to the > > > > > attrs structure, am I correct? > > > > > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > > > > > One global lock for the allocation and an attrs lock for all > > > > > the > > > > > attrs field updates including the kernfs_refresh_inode() > > > > > update. > > > > > > > > > > The critical section for the global lock could be reduced and > > > > > it > > > > > changed to a spin lock. > > > > > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > > > > > take the allocation lock > > > > > do the allocated checks > > > > > assign if existing attrs > > > > > release the allocation lock > > > > > return existing if found > > > > > othewise > > > > > release the allocation lock > > > > > > > > > > allocate and initialize attrs > > > > > > > > > > take the allocation lock > > > > > check if someone beat us to it > > > > > free and grab exiting attrs > > > > > otherwise > > > > > assign the new attrs > > > > > release the allocation lock > > > > > return attrs > > > > > > > > > > Add a spinlock to the attrs struct and use it everywhere for > > > > > field updates. > > > > > > > > > > Am I on the right track or can you see problems with this? > > > > > > > > > > Ian > > > > > > > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So I > > > > guess > > > > the problem is how can we protect the inode when > > > > kernfs_refresh_inode > > > > is called, not the attrs?? > > > > > > But the attrs (which is what's copied from) were protected by the > > > mutex lock (IIUC) so dealing with the inode attributes implies > > > dealing with the kernfs node attrs too. > > > > > > For example in kernfs_iop_setattr() the call to setattr_copy() > > > copies > > > the node attrs to the inode under the same mutex lock. So, if a > > > read > > > lock is us
Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
On Fri, Dec 18, 2020 at 7:21 PM Ian Kent wrote: > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent wrote: > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > What could be done is to make the kernfs node attr_mutex > > > > > > a pointer and dynamically allocate it but even that is too > > > > > > costly a size addition to the kernfs node structure as > > > > > > Tejun has said. > > > > > > > > > > I guess the question to ask is, is there really a need to > > > > > call kernfs_refresh_inode() from functions that are usually > > > > > reading/checking functions. > > > > > > > > > > Would it be sufficient to refresh the inode in the write/set > > > > > operations in (if there's any) places where things like > > > > > setattr_copy() is not already called? > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > My memory is a bit hazy but invalidations on reads is how sysfs > > > > namespace is > > > > implemented, so I don't think there's an easy around that. The > > > > only > > > > thing I > > > > can think of is embedding the lock into attrs and doing xchg > > > > dance > > > > when > > > > attaching it. > > > > > > Sounds like your saying it would be ok to add a lock to the > > > attrs structure, am I correct? > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > One global lock for the allocation and an attrs lock for all the > > > attrs field updates including the kernfs_refresh_inode() update. > > > > > > The critical section for the global lock could be reduced and it > > > changed to a spin lock. > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > take the allocation lock > > > do the allocated checks > > > assign if existing attrs > > > release the allocation lock > > > return existing if found > > > othewise > > > release the allocation lock > > > > > > allocate and initialize attrs > > > > > > take the allocation lock > > > check if someone beat us to it > > > free and grab exiting attrs > > > otherwise > > > assign the new attrs > > > release the allocation lock > > > return attrs > > > > > > Add a spinlock to the attrs struct and use it everywhere for > > > field updates. > > > > > > Am I on the right track or can you see problems with this? > > > > > > Ian > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So I guess > > the problem is how can we protect the inode when kernfs_refresh_inode > > is called, not the attrs?? > > But the attrs (which is what's copied from) were protected by the > mutex lock (IIUC) so dealing with the inode attributes implies > dealing with the kernfs node attrs too. > > For example in kernfs_iop_setattr() the call to setattr_copy() copies > the node attrs to the inode under the same mutex lock. So, if a read > lock is used the copy in kernfs_refresh_inode() is no longer protected, > it needs to be protected in a different way. > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for .setattr but no lock for .getattr (misdocumented?? sometimes they have as you've found out)? What does it protect against?? Because .permission does a similar thing here -- updating inode attributes, the goal is to provide the same protection level for .permission as for .setattr, am I right??? thanks, fox
Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
On Fri, Dec 18, 2020 at 3:36 PM Ian Kent wrote: > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > Hello, > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > What could be done is to make the kernfs node attr_mutex > > > > a pointer and dynamically allocate it but even that is too > > > > costly a size addition to the kernfs node structure as > > > > Tejun has said. > > > > > > I guess the question to ask is, is there really a need to > > > call kernfs_refresh_inode() from functions that are usually > > > reading/checking functions. > > > > > > Would it be sufficient to refresh the inode in the write/set > > > operations in (if there's any) places where things like > > > setattr_copy() is not already called? > > > > > > Perhaps GKH or Tejun could comment on this? > > > > My memory is a bit hazy but invalidations on reads is how sysfs > > namespace is > > implemented, so I don't think there's an easy around that. The only > > thing I > > can think of is embedding the lock into attrs and doing xchg dance > > when > > attaching it. > > Sounds like your saying it would be ok to add a lock to the > attrs structure, am I correct? > > Assuming it is then, to keep things simple, use two locks. > > One global lock for the allocation and an attrs lock for all the > attrs field updates including the kernfs_refresh_inode() update. > > The critical section for the global lock could be reduced and it > changed to a spin lock. > > In __kernfs_iattrs() we would have something like: > > take the allocation lock > do the allocated checks > assign if existing attrs > release the allocation lock > return existing if found > othewise > release the allocation lock > > allocate and initialize attrs > > take the allocation lock > check if someone beat us to it > free and grab exiting attrs > otherwise > assign the new attrs > release the allocation lock > return attrs > > Add a spinlock to the attrs struct and use it everywhere for > field updates. > > Am I on the right track or can you see problems with this? > > Ian > umm, we update the inode in kernfs_refresh_inode, right?? So I guess the problem is how can we protect the inode when kernfs_refresh_inode is called, not the attrs?? thanks, fox
Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
On Thu, Dec 17, 2020 at 12:46 PM Ian Kent wrote: > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote: > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote: > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent wrote: > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent > > > > > wrote: > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > > > > For the patches, there is a mutex_lock in kn- > > > > > > > > > >attr_mutex, > > > > > > > > > as > > > > > > > > > Tejun > > > > > > > > > mentioned here > > > > > > > > > ( > > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/ > > > > > > > > > ), > > > > > > > > > maybe a global > > > > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be used > > > > > > > > around > > > > > > > > the > > > > > > > > initial check and checked again at the end which would > > > > > > > > probably > > > > > > > > have > > > > > > > > been much faster but much less conservative and a bit > > > > > > > > more > > > > > > > > ugly > > > > > > > > so > > > > > > > > I just went the conservative path since there was so much > > > > > > > > change > > > > > > > > already. > > > > > > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't > > > > > > > remember > > > > > > > it. > > > > > > > > > > > > > > Based on what Tejun said it sounds like that needs work. > > > > > > > > > > > > Those attribute handling patches were meant to allow taking > > > > > > the > > > > > > rw > > > > > > sem read lock instead of the write lock for > > > > > > kernfs_refresh_inode() > > > > > > updates, with the added locking to protect the inode > > > > > > attributes > > > > > > update since it's called from the VFS both with and without > > > > > > the > > > > > > inode lock. > > > > > > > > > > Oh, understood. I was asking also because lock on kn- > > > > > >attr_mutex > > > > > drags > > > > > concurrent performance. > > > > > > > > > > > Looking around it looks like kernfs_iattrs() is called from > > > > > > multiple > > > > > > places without a node database lock at all. > > > > > > > > > > > > I'm thinking that, to keep my proposed change straight > > > > > > forward > > > > > > and on topic, I should just leave kernfs_refresh_inode() > > > > > > taking > > > > > > the node db write lock for now and consider the attributes > > > > > > handling > > > > > > as a separate change. Once that's done we could reconsider > > > > > > what's > > > > > > needed to use the node db read lock in > > > > > > kernfs_refresh_inode(). > > > > > > > > > > You meant taking write lock of kernfs_rwsem for > > > > > kernfs_refresh_inode()?? > > > > > It may be a lot slower in my benchmark, let me test it. > > > > > > > > Yes, but make sure the write lock of kernfs_rwsem is being taken > > > > not the read lock. > > > > > > > > That's a mistake I had initially? > > > > > > > > Still, that attributes handling is, I think, sufficient to > > > > warrant > > > > a separate change since it looks like it might need work, the > > > > kernfs > > > > node db probably should be kept stable for those attribute > > > > updates > > > > but equally the existence of an instantiated dentry might &
Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
On Mon, Dec 14, 2020 at 9:30 PM Ian Kent wrote: > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent wrote: > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as > > > > > > Tejun > > > > > > mentioned here > > > > > > ( > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/ > > > > > > ), > > > > > > maybe a global > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be used around > > > > > the > > > > > initial check and checked again at the end which would probably > > > > > have > > > > > been much faster but much less conservative and a bit more ugly > > > > > so > > > > > I just went the conservative path since there was so much > > > > > change > > > > > already. > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't > > > > remember > > > > it. > > > > > > > > Based on what Tejun said it sounds like that needs work. > > > > > > Those attribute handling patches were meant to allow taking the rw > > > sem read lock instead of the write lock for kernfs_refresh_inode() > > > updates, with the added locking to protect the inode attributes > > > update since it's called from the VFS both with and without the > > > inode lock. > > > > Oh, understood. I was asking also because lock on kn->attr_mutex > > drags > > concurrent performance. > > > > > Looking around it looks like kernfs_iattrs() is called from > > > multiple > > > places without a node database lock at all. > > > > > > I'm thinking that, to keep my proposed change straight forward > > > and on topic, I should just leave kernfs_refresh_inode() taking > > > the node db write lock for now and consider the attributes handling > > > as a separate change. Once that's done we could reconsider what's > > > needed to use the node db read lock in kernfs_refresh_inode(). > > > > You meant taking write lock of kernfs_rwsem for > > kernfs_refresh_inode()?? > > It may be a lot slower in my benchmark, let me test it. > > Yes, but make sure the write lock of kernfs_rwsem is being taken > not the read lock. > > That's a mistake I had initially? > > Still, that attributes handling is, I think, sufficient to warrant > a separate change since it looks like it might need work, the kernfs > node db probably should be kept stable for those attribute updates > but equally the existence of an instantiated dentry might mitigate > the it. > > Some people might just know whether it's ok or not but I would like > to check the callers to work out what's going on. > > In any case it's academic if GCH isn't willing to consider the series > for review and possible merge. > Hi Ian I removed kn->attr_mutex and changed read lock to write lock for kernfs_refresh_inode down_write(_rwsem); kernfs_refresh_inode(kn, inode); up_write(_rwsem); Unfortunate, changes in this way make things worse, my benchmark runs 100% slower than upstream sysfs. :( open+read+close a sysfs file concurrently took 1000us. (Currently, sysfs with a big mutex kernfs_mutex only takes ~500us for one open+read+close operation concurrently) |--45.93%--kernfs_iop_permission || | | | | || | | | |--22.55%--down_write || | | | | | || | | | | --20.69%--rwsem_down_write_slowpath || | | | | | || | | | | |--8.89%--schedule perf showed most of the time had been spent on kernfs_iop_permission thanks, fox
Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
On Sun, Dec 13, 2020 at 11:46 AM Ian Kent wrote: > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as > > > > Tejun > > > > mentioned here > > > > (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/), > > > > maybe a global > > > > rwsem for kn->iattr will be better?? > > > > > > I wasn't sure about that, IIRC a spin lock could be used around the > > > initial check and checked again at the end which would probably > > > have > > > been much faster but much less conservative and a bit more ugly so > > > I just went the conservative path since there was so much change > > > already. > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember > > it. > > > > Based on what Tejun said it sounds like that needs work. > > Those attribute handling patches were meant to allow taking the rw > sem read lock instead of the write lock for kernfs_refresh_inode() > updates, with the added locking to protect the inode attributes > update since it's called from the VFS both with and without the > inode lock. Oh, understood. I was asking also because lock on kn->attr_mutex drags concurrent performance. > Looking around it looks like kernfs_iattrs() is called from multiple > places without a node database lock at all. > > I'm thinking that, to keep my proposed change straight forward > and on topic, I should just leave kernfs_refresh_inode() taking > the node db write lock for now and consider the attributes handling > as a separate change. Once that's done we could reconsider what's > needed to use the node db read lock in kernfs_refresh_inode(). You meant taking write lock of kernfs_rwsem for kernfs_refresh_inode()?? It may be a lot slower in my benchmark, let me test it. > It will reduce the effectiveness of the series but it would make > this change much more complicated, and is somewhat off-topic, and > could hamper the chances of reviewers spotting problem with it. > > Ian > thanks, fox
[tip: core/rcu] docs/memory-barriers.txt: Fix a typo in CPU MEMORY BARRIERS section
The following commit has been merged into the core/rcu branch of tip: Commit-ID: d8566f15da9b1e51fd35f24321ec133095e02d06 Gitweb: https://git.kernel.org/tip/d8566f15da9b1e51fd35f24321ec133095e02d06 Author:Fox Chen AuthorDate:Wed, 09 Sep 2020 14:53:40 +08:00 Committer: Paul E. McKenney CommitterDate: Fri, 06 Nov 2020 17:24:51 -08:00 docs/memory-barriers.txt: Fix a typo in CPU MEMORY BARRIERS section Commit 39323c6 ("smp_mb__{before,after}_atomic(): update Documentation") has a typo in CPU MEORY BARRIERS section: "RMW functions that do not imply are memory barrier are ..." should be "RMW functions that do not imply a memory barrier are ...". This patch fixes this typo. Signed-off-by: Fox Chen Acked-by: Will Deacon Signed-off-by: Paul E. McKenney --- Documentation/memory-barriers.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 17c8e0c..7367ada 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1870,7 +1870,7 @@ There are some more advanced barrier functions: These are for use with atomic RMW functions that do not imply memory barriers, but where the code needs a memory barrier. Examples for atomic - RMW functions that do not imply are memory barrier are e.g. add, + RMW functions that do not imply a memory barrier are e.g. add, subtract, (failed) conditional operations, _relaxed functions, but not atomic_read or atomic_set. A common example where a memory barrier may be required is when atomic ops are used for reference
RE:[PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
Hi, I found this series of patches solves exact the problem I am trying to solve. https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlc...@gmail.com/ The problem is reported by Brice Goglin on thread: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface https://lore.kernel.org/lkml/x60dvjot4furc...@kroah.com/ I independently comfirmed that on a 96-core AWS c5.metal server. Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id 1000 times. With a single thread it takes ~2.5 us for each open+read+close. With one thread per core, 96 threads running simultaneously takes 540 us for each of the same operation (without much variation) -- 200x slower than the single thread one. My Benchmark code is here: https://github.com/foxhlchen/sysfs_benchmark The problem can only be observed in large machines (>=16 cores). The more cores you have the slower it can be. Perf shows that CPUs spend most of the time (>80%) waiting on mutex locks in kernfs_iop_permission and kernfs_dop_revalidate. After applying this, performance gets huge boost -- with the fastest one at ~30 us to the worst at ~180 us (most of on spin_locks, the delay just stacking up, very similar to the performance on ext4). I hope this problem can justifies this series of patches. A big mutex in kernfs is really not nice. Due to this BIG LOCK, concurrency in kernfs is almost NONE, even though you do operations on different files, they are contentious. As we get more and more cores on normal machines and because sysfs provides such important information, this problem should be fix. So please reconsider accepting the patches. For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun mentioned here (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/), maybe a global rwsem for kn->iattr will be better?? thanks, fox
[PATCH V2] kernfs: replace the mutex in kernfs_iop_permission with a rwlock
A big global mutex in kernfs_iop_permission will significanly drag system performance when processes concurrently open files on kernfs in Big machines(with >= 16 cpu cores). This patch replace the big mutex with a global rwsem lock. So that kernfs_iop_permission can perform concurrently. In a 96-core AMD EPYC ROME server, I can observe 50% boost on a open+read+close cycle when I call open+read+close one thread per core concurrently 1000 times after applying the patch. Signed-off-by: Fox Chen --- fs/kernfs/inode.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index fc2469a20fed..ea65da176cfa 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -14,9 +14,12 @@ #include #include #include +#include #include "kernfs-internal.h" +static DECLARE_RWSEM(kernfs_iattr_rwsem); + static const struct address_space_operations kernfs_aops = { .readpage = simple_readpage, .write_begin= simple_write_begin, @@ -106,9 +109,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) { int ret; - mutex_lock(_mutex); + down_write(_iattr_rwsem); ret = __kernfs_setattr(kn, iattr); - mutex_unlock(_mutex); + up_write(_iattr_rwsem); return ret; } @@ -121,7 +124,7 @@ int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr) if (!kn) return -EINVAL; - mutex_lock(_mutex); + down_write(_iattr_rwsem); error = setattr_prepare(dentry, iattr); if (error) goto out; @@ -134,7 +137,7 @@ int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr) setattr_copy(inode, iattr); out: - mutex_unlock(_mutex); + up_write(_iattr_rwsem); return error; } @@ -189,9 +192,9 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat, struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; - mutex_lock(_mutex); + down_read(_iattr_rwsem); kernfs_refresh_inode(kn, inode); - mutex_unlock(_mutex); + up_read(_iattr_rwsem); generic_fillattr(inode, stat); return 0; @@ -281,9 +284,9 @@ int kernfs_iop_permission(struct inode *inode, int mask) kn = inode->i_private; - mutex_lock(_mutex); + down_read(_iattr_rwsem); kernfs_refresh_inode(kn, inode); - mutex_unlock(_mutex); + up_read(_iattr_rwsem); return generic_permission(inode, mask); } -- 2.29.2 Differences from V1: * Use rwsem instead of rwlock so we can sleep when kernfs_iattrs calls GFP_KERNEL type memory allocation. * Use a global lock instead of a per-node lock to reduce memory consumption. It's still slow, a open+read+close cycle spends ~260us compared to ~3us of single thread one. After applying this, the mutex in kernfs_dop_revalidate becomes the top time-consuming operation on concurrent open+read+close. However That's harder to solve than this one and it's near the merge window and holiday season, I don't want to add up work load to you guys during that time so I decided to turn in this separately. Hopefully, I can bring in kernfs_dop_revalidate patch after holiday. And hope this patch can help. thanks, fox
Re: [PATCH 2/2] kernfs: remove mutex in kernfs_dop_revalidate
On Thu, Dec 3, 2020 at 2:46 AM Tejun Heo wrote: > > Hello, > > On Wed, Dec 02, 2020 at 10:58:37PM +0800, Fox Chen wrote: > > There is a big mutex in kernfs_dop_revalidate which slows down the > > concurrent performance of kernfs. > > > > Since kernfs_dop_revalidate only does some checks, the lock is > > largely unnecessary. Also, according to kernel filesystem locking > > document: > > https://www.kernel.org/doc/html/latest/filesystems/locking.html > > locking is not in the protocal for d_revalidate operation. > > That's just describing the rules seen from vfs side. It doesn't say anything > about locking rules internal to each file system implementation. Oh, Ok, I got it. > > This patch remove this mutex from > > kernfs_dop_revalidate, so kernfs_dop_revalidate > > can run concurrently. > > > > Signed-off-by: Fox Chen > > --- > > fs/kernfs/dir.c | 9 +++-- > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > > index 9aec80b9d7c6..c2267c93f546 100644 > > --- a/fs/kernfs/dir.c > > +++ b/fs/kernfs/dir.c > > @@ -26,7 +26,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);/* > > root->ino_idr */ > > > > static bool kernfs_active(struct kernfs_node *kn) > > { > > - lockdep_assert_held(_mutex); > > return atomic_read(>active) >= 0; > > } > > > > @@ -557,10 +556,9 @@ static int kernfs_dop_revalidate(struct dentry > > *dentry, unsigned int flags) > > > > /* Always perform fresh lookup for negatives */ > > if (d_really_is_negative(dentry)) > > - goto out_bad_unlocked; > > + goto out_bad; > > > > kn = kernfs_dentry_node(dentry); > > - mutex_lock(_mutex); > > > > /* The kernfs node has been deactivated */ > > if (!kernfs_active(kn)) > > @@ -579,11 +577,8 @@ static int kernfs_dop_revalidate(struct dentry > > *dentry, unsigned int flags) > > kernfs_info(dentry->d_sb)->ns != kn->ns) > > goto out_bad; > > > > - mutex_unlock(_mutex); > > return 1; > > out_bad: > > - mutex_unlock(_mutex); > > -out_bad_unlocked: > > return 0; > > } > > I don't see how this can be safe. Nothing even protects the dentry from > turning negative in the middle and it may end up trying to deref NULL. I'm > sure we can make this not need kernfs_mutex but that'd have to be a lot more > careful. > Sorry Tejun, I don't get it. Even before the patch if (d_really_is_negative(dentry)) goto out_bad_unlocked; kn = kernfs_dentry_node(dentry); mutex_lock(_mutex); < we lock here status of d_really_is_negative is not preserved by the mutex. It could turn negative between we checked it and we lock kernfs_mutex. Is it a bug here?? thanks, fox
Re: [PATCH 0/2] kernfs: speed up concurrency performance
On Thu, Dec 3, 2020 at 2:28 AM Greg KH wrote: > > On Wed, Dec 02, 2020 at 10:58:35PM +0800, Fox Chen wrote: > > Hello, > > > > kernfs is an important facillity to support pseudo file systems and cgroup. > > Currently, with a global mutex, reading files concurrently from kernfs > > (e.g. /sys) > > is very slow. > > > > This problem is reported by Brice Goglin on thread: > > Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface > > https://lore.kernel.org/lkml/x60dvjot4furc...@kroah.com/ > > > > I independently comfirmed this on a 96-core AWS c5.metal server. > > Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id 1000 > > times. > > With a single thread it takes ~2.5 us for each open+read+close. > > With one thread per core, 96 threads running simultaneously takes 540 us > > for each of the same operation (without much variation) -- 200x slower than > > the > > single thread one. > > > > The problem can only be observed in large machines (>=16 cores). > > The more cores you have the slower it can be. > > > > Perf shows that CPUs spend most of the time (>80%) waiting on mutex locks in > > kernfs_iop_permission and kernfs_dop_revalidate. > > > > This patchset contains the following 2 patches: > > 0001-kernfs-replace-the-mutex-in-kernfs_iop_permission-wi.patch > > 0002-kernfs-remove-mutex-in-kernfs_dop_revalidate.patch > > > > 0001 replace the mutex lock in kernfs_iop_permission with a new rwlock and > > 0002 removes the mutex lock in kernfs_dop_revalidate. > > > > After applying this patchset, the multi-thread performance becomes linear > > with > > the fastest one at ~30 us to the worst at ~150 us, very similar as I tested > > it > > on a normal ext4 file system with fastest one at ~20 us to slowest at ~100 > > us. > > And I believe that is largely due to spin_locks in filesystems which are > > normal. > > > > Although it's still slower than single thread, users can benefit from this > > patchset, especially ones working on HPC realm with lots of cpu cores and > > want to > > fetch system information from sysfs. > > Does this mean that the changes slow down the single-threaded case? Or > that it's just not as good as the speed of a single-threaded access? No, It won't influence the single-threaded case. I meant multi-threaded case is still not as good as single-threaded one. > But anyway, thanks so much for looking into this, it should help the > crazy systems out today, which means the normal systems in 5 years will > really appreciate this :) thanks :)
Re: [PATCH 2/2] kernfs: remove mutex in kernfs_dop_revalidate
On Thu, Dec 3, 2020 at 2:26 AM Greg KH wrote: > > On Wed, Dec 02, 2020 at 10:58:37PM +0800, Fox Chen wrote: > > There is a big mutex in kernfs_dop_revalidate which slows down the > > concurrent performance of kernfs. > > > > Since kernfs_dop_revalidate only does some checks, the lock is > > largely unnecessary. Also, according to kernel filesystem locking > > document: > > https://www.kernel.org/doc/html/latest/filesystems/locking.html > > locking is not in the protocal for d_revalidate operation. > > > > This patch remove this mutex from > > kernfs_dop_revalidate, so kernfs_dop_revalidate > > can run concurrently. > > > > Signed-off-by: Fox Chen > > --- > > fs/kernfs/dir.c | 9 +++-- > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > > index 9aec80b9d7c6..c2267c93f546 100644 > > --- a/fs/kernfs/dir.c > > +++ b/fs/kernfs/dir.c > > @@ -26,7 +26,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);/* > > root->ino_idr */ > > > > static bool kernfs_active(struct kernfs_node *kn) > > { > > - lockdep_assert_held(_mutex); > > return atomic_read(>active) >= 0; > > } > > > > @@ -557,10 +556,9 @@ static int kernfs_dop_revalidate(struct dentry > > *dentry, unsigned int flags) > > > > /* Always perform fresh lookup for negatives */ > > if (d_really_is_negative(dentry)) > > - goto out_bad_unlocked; > > + goto out_bad; > > > > kn = kernfs_dentry_node(dentry); > > - mutex_lock(_mutex); > > > > /* The kernfs node has been deactivated */ > > if (!kernfs_active(kn)) > > @@ -579,11 +577,8 @@ static int kernfs_dop_revalidate(struct dentry > > *dentry, unsigned int flags) > > kernfs_info(dentry->d_sb)->ns != kn->ns) > > goto out_bad; > > > > - mutex_unlock(_mutex); > > return 1; > > out_bad: > > - mutex_unlock(_mutex); > > -out_bad_unlocked: > > return 0; > > } > > > > @@ -650,6 +645,8 @@ static struct kernfs_node *__kernfs_new_node(struct > > kernfs_root *root, > > kn->mode = mode; > > kn->flags = flags; > > > > + rwlock_init(>iattr_rwlock); > > Ah, now you initialize this, it should go into patch 1, right? :) > Yes, it's my fault. It should be in patch 1. Sorry.
Re: [PATCH 1/2] kernfs: replace the mutex in kernfs_iop_permission with a rwlock
Hi, Thanks for your comments. > On Wed, Dec 02, 2020 at 10:58:36PM +0800, Fox Chen wrote: > > @@ -121,7 +121,7 @@ int kernfs_iop_setattr(struct dentry *dentry, struct > > iattr *iattr) > > if (!kn) > > return -EINVAL; > > > > - mutex_lock(_mutex); > > + write_lock(>iattr_rwlock); > > error = setattr_prepare(dentry, iattr); > > if (error) > > goto out; > > @@ -134,7 +134,7 @@ int kernfs_iop_setattr(struct dentry *dentry, struct > > iattr *iattr) > > setattr_copy(inode, iattr); > > > > out: > > - mutex_unlock(_mutex); > > + write_unlock(>iattr_rwlock); > > return error; > > } > > This is putting GFP_KERNEL allocation inside a rwlock. Can you please test > with debug options including LOCKDEP and DEBUG_ATOMIC_SLEEP turned on? > Ok, I will try that. Allocation is protected by the write_lock, only one thread can enter this at a time. It should give the same protection as a mutex, right?? Or am I missing something here?? Any caveat? On Thu, Dec 3, 2020 at 2:37 AM Tejun Heo wrote: > > On Wed, Dec 02, 2020 at 10:58:36PM +0800, Fox Chen wrote: > > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h > > index 89f6a4214a70..545cdb39b34b 100644 > > --- a/include/linux/kernfs.h > > +++ b/include/linux/kernfs.h > > @@ -156,6 +156,7 @@ struct kernfs_node { > > unsigned short flags; > > umode_t mode; > > struct kernfs_iattrs*iattr; > > + rwlock_tiattr_rwlock; > > }; > > Also, while this might not look like much, kernfs_node is very size > sensitive. There are systems with huge number of these nodes, so I don't > think putting a per-node lock like this is a good idea. Either we can use a > shared iattr protecting lock or play some cmpxchg games when allocating and > setting ->iattr and put the lock there. > Initially, I tried to put rwlock in kn->iattr, but __kernfs_setattr(kn, iattr) needs lock protection and kn->iattr may not exist before calling __kernfs_setattr. It's a chicken-egg paradox. :) It's hard to solve. cmpxchg can help, but who sets kn->iattr first should be clearly defined. What about I used a global shared rwlock to protect all kn->iattr. It's easier to implement and I think we read sysfs more than write to it, I guess it won't be that slow compared to one kn per lock? thanks, fox
[PATCH 1/2] kernfs: replace the mutex in kernfs_iop_permission with a rwlock
A big global mutex in kernfs_iop_permission will significanly drag system performance when processes concurrently open files on kernfs in Big machines(with >= 16 cpu cores). This patch replace the big mutex with a rwlock specifically for protecting kernfs_node->iattribute. So that kernfs_iop_permission can perform concurrently. Signed-off-by: Fox Chen --- fs/kernfs/inode.c | 16 include/linux/kernfs.h | 1 + 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index fc2469a20fed..c8c2ea669e6d 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -106,9 +106,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) { int ret; - mutex_lock(_mutex); + write_lock(>iattr_rwlock); ret = __kernfs_setattr(kn, iattr); - mutex_unlock(_mutex); + write_unlock(>iattr_rwlock); return ret; } @@ -121,7 +121,7 @@ int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr) if (!kn) return -EINVAL; - mutex_lock(_mutex); + write_lock(>iattr_rwlock); error = setattr_prepare(dentry, iattr); if (error) goto out; @@ -134,7 +134,7 @@ int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr) setattr_copy(inode, iattr); out: - mutex_unlock(_mutex); + write_unlock(>iattr_rwlock); return error; } @@ -189,9 +189,9 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat, struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; - mutex_lock(_mutex); + read_lock(>iattr_rwlock); kernfs_refresh_inode(kn, inode); - mutex_unlock(_mutex); + read_unlock(>iattr_rwlock); generic_fillattr(inode, stat); return 0; @@ -281,9 +281,9 @@ int kernfs_iop_permission(struct inode *inode, int mask) kn = inode->i_private; - mutex_lock(_mutex); + read_lock(>iattr_rwlock); kernfs_refresh_inode(kn, inode); - mutex_unlock(_mutex); + read_unlock(>iattr_rwlock); return generic_permission(inode, mask); } diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 89f6a4214a70..545cdb39b34b 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -156,6 +156,7 @@ struct kernfs_node { unsigned short flags; umode_t mode; struct kernfs_iattrs*iattr; + rwlock_tiattr_rwlock; }; /* -- 2.29.2
[PATCH 0/2] kernfs: speed up concurrency performance
Hello, kernfs is an important facillity to support pseudo file systems and cgroup. Currently, with a global mutex, reading files concurrently from kernfs (e.g. /sys) is very slow. This problem is reported by Brice Goglin on thread: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface https://lore.kernel.org/lkml/x60dvjot4furc...@kroah.com/ I independently comfirmed this on a 96-core AWS c5.metal server. Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id 1000 times. With a single thread it takes ~2.5 us for each open+read+close. With one thread per core, 96 threads running simultaneously takes 540 us for each of the same operation (without much variation) -- 200x slower than the single thread one. The problem can only be observed in large machines (>=16 cores). The more cores you have the slower it can be. Perf shows that CPUs spend most of the time (>80%) waiting on mutex locks in kernfs_iop_permission and kernfs_dop_revalidate. This patchset contains the following 2 patches: 0001-kernfs-replace-the-mutex-in-kernfs_iop_permission-wi.patch 0002-kernfs-remove-mutex-in-kernfs_dop_revalidate.patch 0001 replace the mutex lock in kernfs_iop_permission with a new rwlock and 0002 removes the mutex lock in kernfs_dop_revalidate. After applying this patchset, the multi-thread performance becomes linear with the fastest one at ~30 us to the worst at ~150 us, very similar as I tested it on a normal ext4 file system with fastest one at ~20 us to slowest at ~100 us. And I believe that is largely due to spin_locks in filesystems which are normal. Although it's still slower than single thread, users can benefit from this patchset, especially ones working on HPC realm with lots of cpu cores and want to fetch system information from sysfs. I tried my best to solve this problem. If there is stupid mistake, please kindly point out. I would appreciate it greatly. Fox fs/kernfs/dir.c| 9 +++-- fs/kernfs/inode.c | 16 include/linux/kernfs.h | 1 + 3 files changed, 12 insertions(+), 14 deletions(-) -- 2.29.2
[PATCH 2/2] kernfs: remove mutex in kernfs_dop_revalidate
There is a big mutex in kernfs_dop_revalidate which slows down the concurrent performance of kernfs. Since kernfs_dop_revalidate only does some checks, the lock is largely unnecessary. Also, according to kernel filesystem locking document: https://www.kernel.org/doc/html/latest/filesystems/locking.html locking is not in the protocal for d_revalidate operation. This patch remove this mutex from kernfs_dop_revalidate, so kernfs_dop_revalidate can run concurrently. Signed-off-by: Fox Chen --- fs/kernfs/dir.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 9aec80b9d7c6..c2267c93f546 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -26,7 +26,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */ static bool kernfs_active(struct kernfs_node *kn) { - lockdep_assert_held(_mutex); return atomic_read(>active) >= 0; } @@ -557,10 +556,9 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags) /* Always perform fresh lookup for negatives */ if (d_really_is_negative(dentry)) - goto out_bad_unlocked; + goto out_bad; kn = kernfs_dentry_node(dentry); - mutex_lock(_mutex); /* The kernfs node has been deactivated */ if (!kernfs_active(kn)) @@ -579,11 +577,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags) kernfs_info(dentry->d_sb)->ns != kn->ns) goto out_bad; - mutex_unlock(_mutex); return 1; out_bad: - mutex_unlock(_mutex); -out_bad_unlocked: return 0; } @@ -650,6 +645,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, kn->mode = mode; kn->flags = flags; + rwlock_init(>iattr_rwlock); + if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) { struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, -- 2.29.2
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
On Fri, Nov 13, 2020 at 2:15 PM Brice Goglin wrote: > > > Le 12/11/2020 à 11:49, Greg Kroah-Hartman a écrit : > > On Thu, Nov 12, 2020 at 10:10:57AM +0100, Brice Goglin wrote: > > Le 12/11/2020 à 07:42, Greg Kroah-Hartman a écrit : > > On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote: > > Le 07/10/2020 à 07:15, Greg Kroah-Hartman a écrit : > > On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote: > > On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote: > > On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote: > > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote: > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote: > > Hybrid CPU topologies combine CPUs of different microarchitectures in the > same die. Thus, even though the instruction set is compatible among all > CPUs, there may still be differences in features (e.g., some CPUs may > have counters that others CPU do not). There may be applications > interested in knowing the type of micro-architecture topology of the > system to make decisions about process affinity. > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/ > cpu_capacity) may be used to infer the types of micro-architecture of the > CPUs in the platform, it may not be entirely accurate. For instance, two > subsets of CPUs with different types of micro-architecture may have the > same capacity due to power or thermal constraints. > > Create the new directory /sys/devices/system/cpu/types. Under such > directory, create individual subdirectories for each type of CPU micro- > architecture. Each subdirectory will have cpulist and cpumap files. This > makes it convenient for user space to read all the CPUs of the same type > at once without having to inspect each CPU individually. > > Implement a generic interface using weak functions that architectures can > override to indicate a) support for CPU types, b) the CPU type number, and > c) a string to identify the CPU vendor and type. > > For example, an x86 system with one Intel Core and four Intel Atom CPUs > would look like this (other architectures have the hooks to use whatever > directory naming convention below "types" that meets their needs): > > user@host:~$: ls /sys/devices/system/cpu/types > intel_atom_0 intel_core_0 > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0 > cpulist cpumap > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0 > cpulist cpumap > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap > 0f > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist > 0-3 > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap > 10 > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist > 4 > > Thank you for the quick and detailed Greg! > > The output of 'tree' sometimes makes it easier to see here, or: > grep -R . * > also works well. > > Indeed, this would definitely make it more readable. > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not > created. Add a hook for this purpose. > > Why should these not show up if the system is not "hybrid"? > > My thinking was that on a non-hybrid system, it does not make sense to > create this interface, as all the CPUs will be of the same type. > > Why not just have this an attribute type in the existing cpuX directory? > Why do this have to be a totally separate directory and userspace has to > figure out to look in two different spots for the same cpu to determine > what it is? > > But if the type is located under cpuX, usespace would need to traverse > all the CPUs and create its own cpu masks. Under the types directory it > would only need to look once for each type of CPU, IMHO. > > What does a "mask" do? What does userspace care about this? You would > have to create it by traversing the directories you are creating anyway, > so it's not much different, right? > > Hello > > Sorry for the late reply. As the first userspace consumer of this > interface [1], I can confirm that reading a single file to get the mask > would be better, at least for performance reason. On large platforms, we > already have to read thousands of sysfs files to get CPU topology and > cache information, I'd be happy not to read one more file per cpu. > > Reading these sysfs files is slow, and it does not scale well when > multiple processes read them in parallel. > > Really? Where is the slowdown? Would something like readfile() work > better for you for that? > https://lore.kernel.org/linux-api/20200704140250.423345-1-gre...@linuxfoundation.org/ > > I guess readfile would improve the sequential case by avoiding syscalls > but it would not improve the parallel case since syscalls shouldn't have > any parallel issue? > > syscalls should not have parallel issues at all. > > We've been watching the status of readfile() since it was posted on LKML > 6 months ago, but we were actually wondering if it would end up
[PATCH] squashfs: Add id_table sanity check to squashfs_get_id
When uid/gid info in superblocks or id_index_table is corrupted, The uid/gid index can be larger than the size of msblk->id_table. This is reported by syzkaller. This patch adds a sanity check to squashfs_get_id which calculates the max available room for uid/gid table by doing msblk->xattr_table - msblk->id_table[0] and check if index is larger than this. While this provides some sort of check, it is imperfect because id_table can be smaller than that. Reported-by: syzbot+8e28bba73ed1772a6...@syzkaller.appspotmail.com Signed-off-by: Fox Chen --- fs/squashfs/id.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/squashfs/id.c b/fs/squashfs/id.c index 6be5afe7287d..81bd67c0f649 100644 --- a/fs/squashfs/id.c +++ b/fs/squashfs/id.c @@ -35,10 +35,16 @@ int squashfs_get_id(struct super_block *sb, unsigned int index, struct squashfs_sb_info *msblk = sb->s_fs_info; int block = SQUASHFS_ID_BLOCK(index); int offset = SQUASHFS_ID_BLOCK_OFFSET(index); - u64 start_block = le64_to_cpu(msblk->id_table[block]); + u64 start_block; __le32 disk_id; int err; + // sanity check + if (le64_to_cpu(msblk->id_table[0]) + block >= msblk->xattr_table) + return -EINVAL; + + start_block = le64_to_cpu(msblk->id_table[block]); + err = squashfs_read_metadata(sb, _id, _block, , sizeof(disk_id)); if (err < 0) -- 2.25.1
[PATCH] NTFS: Add name sanity check to ntfs_attr_find
When mounting, if Attribute data is correupted, doing named attribute lookup can lead to invalid memory access. This is reported by syzkaller. This patch adds a sanity check prior to attribute name lookup. If attribute's name_offset is invalid, It will mark volume error and return -EIO. Reported-by: syzbot+ecbcf37464c627253...@syzkaller.appspotmail.com Signed-off-by: Fox Chen --- fs/ntfs/attrib.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c index d563abc3e136..e7366f74ff62 100644 --- a/fs/ntfs/attrib.c +++ b/fs/ntfs/attrib.c @@ -607,6 +607,16 @@ static int ntfs_attr_find(const ATTR_TYPE type, const ntfschar *name, * If @name is present, compare the two names. If @name is * missing, assume we want an unnamed attribute. */ + + /* +* Sanity check, a->name_offset should be within the range of a->lengh, +*/ + if (name && ((u8*)a + le16_to_cpu(a->name_offset)) > ((u8*)a + le32_to_cpu(a->length))) { + ntfs_error(vol->sb, "Invalid Attribute Name. Inode is corrupt. Run chkdsk."); + NVolSetErrors(vol); + return -EIO; + } + if (!name) { /* The search failed if the found attribute is named. */ if (a->name_length) -- 2.25.1
Re: [Cluster-devel] [PATCH] fs: gfs2: prevent OOB access in gfs2_read_sb()
Hi Andrew, On Wed, Oct 14, 2020 at 9:04 PM Andrew Price wrote: > Just a heads-up to avoid duplication of effort: Fox Chen (CCed) has > attempted to fix this also[1], but I don't know if they plan to send > another patch. Oh, I thought it was solved by someone else as you've pointed out the correct solution the other day. https://lkml.org/lkml/2020/10/5/538 I'm solving another bug right now, I will leave this to Anant. :) thanks, fox
Re: [Cluster-devel] KASAN: slab-out-of-bounds Write in,gfs2_fill_super
Yes, it's the same bug, https://lkml.org/lkml/2020/10/5/538 this may help thanks,
Re: KASAN: slab-out-of-bounds Read in squashfs_get_id
Hi, I found this bug was caused by either uid/gid info in superblocks or id_index_table is corrupted. The uid/gid index is larger than the size of msblk->id_table. Should I add a sanity check to squashfs_get_id?? The complete solution is to record the size of msblk->id_table in msblk and check uid/gid index each time in squashfs_get_id. However, this requires a change to msblk struct. A simple solution is to calculate the max available room for uid/gid table by doing msblk->xattr_table - msblk->id_table[0] and check if index is larger than this. While this provides some sort of check, it is imperfect because id_table can be smaller than that. Both of them work out for this bug. thanks, fox On Friday, September 25, 2020 at 10:48:18 PM UTC+8 syzbot wrote: Hello, syzbot found the following issue on: HEAD commit: 171d4ff7 Merge tag 'mmc-v5.9-rc4-2' of git://git.kernel.or.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=1597ead390 kernel config: https://syzkaller.appspot.com/x/.config?x=af502ec9a451c9fc dashboard link: https://syzkaller.appspot.com/bug?extid=8e28bba73ed1772a6802 compiler: clang version 10.0.0 (https://github.com/llvm/llvm-project/ c2443155a0fb245c8f17f2c1c72b6ea391e86e81) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=172ff48190 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17c3e6c590 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+8e28bb...@syzkaller.appspotmail.com == BUG: KASAN: slab-out-of-bounds in squashfs_get_id+0xb9/0x1c0 fs/squashfs/id.c:38 Read of size 8 at addr 8880a9684b98 by task syz-executor329/6836 CPU: 1 PID: 6836 Comm: syz-executor329 Not tainted 5.9.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1d6/0x29e lib/dump_stack.c:118 print_address_description+0x66/0x620 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 squashfs_get_id+0xb9/0x1c0 fs/squashfs/id.c:38 squashfs_new_inode fs/squashfs/inode.c:51 [inline] squashfs_read_inode+0x155/0x2170 fs/squashfs/inode.c:120 squashfs_fill_super+0x1478/0x1790 fs/squashfs/super.c:310 get_tree_bdev+0x3e9/0x5f0 fs/super.c:1342 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x446d1a Code: b8 08 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 fd ad fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 da ad fb ff c3 66 0f 1f 84 00 00 00 00 00 RSP: 002b:7ffd7dd4f8b8 EFLAGS: 0293 ORIG_RAX: 00a5 RAX: ffda RBX: 7ffd7dd4f910 RCX: 00446d1a RDX: 2000 RSI: 2100 RDI: 7ffd7dd4f8d0 RBP: 7ffd7dd4f8d0 R08: 7ffd7dd4f910 R09: 7ffd0015 R10: R11: 0293 R12: 0001 R13: 0004 R14: 0003 R15: 0003 Allocated by task 3913: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461 kmalloc_node include/linux/slab.h:577 [inline] __vmalloc_area_node mm/vmalloc.c:2429 [inline] __vmalloc_node_range+0x2c7/0x870 mm/vmalloc.c:2511 module_alloc+0x7e/0x90 arch/x86/kernel/module.c:75 bpf_jit_binary_alloc+0x123/0x230 kernel/bpf/core.c:871 bpf_int_jit_compile+0x7995/0x8920 arch/x86/net/bpf_jit_comp.c:1911 bpf_prog_select_runtime+0x76d/0xa60 kernel/bpf/core.c:1807 bpf_migrate_filter net/core/filter.c:1290 [inline] bpf_prepare_filter+0xec2/0x1140 net/core/filter.c:1338 bpf_prog_create_from_user+0x2ad/0x3e0 net/core/filter.c:1432 seccomp_prepare_filter kernel/seccomp.c:567 [inline] seccomp_prepare_user_filter kernel/seccomp.c:604 [inline] seccomp_set_mode_filter kernel/seccomp.c:1546 [inline] do_seccomp+0x852/0x20b0 kernel/seccomp.c:1661 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at 8880a9684b80 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 24 bytes inside of 32-byte region [8880a9684b80, 8880a9684ba0) The buggy address belongs to the page: page:f697ca3d refcount:1 mapcount:0 mapping: index:0x8880a9684fc1 pfn:0xa9684 flags: 0xfffe000200(slab) raw: 00fffe000200 ea0002a5d5c8 ea0002a98588 8880aa440100 raw: 8880a9684fc1 8880a9684000 0001003f page
Re: [Cluster-devel] [PATCH] gfs2: gfs2_read_sb: put gfs2_assert inside the loop
On Mon, Oct 5, 2020 at 9:23 PM Andrew Price wrote: > > On 03/10/2020 07:31, Fox Chen wrote: > > for (x = 2;; x++) { > > ... > > gfs2_assert(sdp, x <= GFS2_MAX_META_HEIGHT); <--- after > > ... > > if (d != sdp->sd_heightsize[x - 1] || m) > > break; > > sdp->sd_heightsize[x] = space; > > } > > > > sdp->sd_max_height = x > > gfs2_assert(sdp, sdp->sd_max_height <= GFS2_MAX_META_HEIGHT) <--- before > > > > Before this patch, gfs2_assert is put outside of the loop of > > sdp->sd_heightsize[x] calculation. When something goes wrong, > > So this looks related to one of the recent syzbot reports, where the > "something goes wrong" is the block size in the on-disk superblock was > zeroed and that leads eventually to this out-of-bounds write. The > correct fix in that case would be to add a validity check for the block > size in gfs2_check_sb(). > Yes, I saw this bug from the syzbot report and I though instead of KASAN gfs2_assert should be able to catch it so I proposed this patch. :) thank you both for your comments. fox
[PATCH] gfs2: gfs2_read_sb: put gfs2_assert inside the loop
for (x = 2;; x++) { ... gfs2_assert(sdp, x <= GFS2_MAX_META_HEIGHT); <--- after ... if (d != sdp->sd_heightsize[x - 1] || m) break; sdp->sd_heightsize[x] = space; } sdp->sd_max_height = x gfs2_assert(sdp, sdp->sd_max_height <= GFS2_MAX_META_HEIGHT) <--- before Before this patch, gfs2_assert is put outside of the loop of sdp->sd_heightsize[x] calculation. When something goes wrong, x exceeds the size of GFS2_MAX_META_HEIGHT, it may already crash inside the loop when sdp->sd_heightsize[x] = space tries to reach the out-of-bound location, gfs2_assert won't help here. This patch fixes this by moving gfs2_assert into the loop. We will check x value each time to see if it exceeds GFS2_MAX_META_HEIGHT. Signed-off-by: Fox Chen --- fs/gfs2/ops_fstype.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index 6d18d2c91add..6cc32e3010f2 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -333,6 +333,7 @@ static int gfs2_read_sb(struct gfs2_sbd *sdp, int silent) u64 space, d; u32 m; + gfs2_assert(sdp, x <= GFS2_MAX_META_HEIGHT); space = sdp->sd_heightsize[x - 1] * sdp->sd_inptrs; d = space; m = do_div(d, sdp->sd_inptrs); @@ -343,7 +344,6 @@ static int gfs2_read_sb(struct gfs2_sbd *sdp, int silent) } sdp->sd_max_height = x; sdp->sd_heightsize[x] = ~0; - gfs2_assert(sdp, sdp->sd_max_height <= GFS2_MAX_META_HEIGHT); sdp->sd_max_dents_per_leaf = (sdp->sd_sb.sb_bsize - sizeof(struct gfs2_leaf)) / -- 2.25.1
[PATCH] docs/memory-barriers.txt: Fix a typo in CPU MEMORY BARRIERS section
Commit 39323c6 smp_mb__{before,after}_atomic(): update Documentation has a typo in CPU MEORY BARRIERS section: "RMW functions that do not imply are memory barrier are ..." should be "RMW functions that do not imply a memory barrier are ...". This patch fixes this typo. Signed-off-by: Fox Chen --- Documentation/memory-barriers.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 96186332e5f4..20b8a7b30320 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1870,7 +1870,7 @@ There are some more advanced barrier functions: These are for use with atomic RMW functions that do not imply memory barriers, but where the code needs a memory barrier. Examples for atomic - RMW functions that do not imply are memory barrier are e.g. add, + RMW functions that do not imply a memory barrier are e.g. add, subtract, (failed) conditional operations, _relaxed functions, but not atomic_read or atomic_set. A common example where a memory barrier may be required is when atomic ops are used for reference -- 2.25.1
HCI_EV_PHY_LINK_COMPLETE in net/bluetooth/hci_event.c
Dear all, I have a question about static void hci_phy_link_complete_evt(struct hci_dev *hdev, struct sk_buff *skb) -- HCI_EV_PHY_LINK_COMPLETE event packet handler in hci_even.c:4940 if (ev->status) { hci_conn_del(hcon); <-- hci_dev_unlock(hdev); return; } Is it correct to del hcon here?? Because later on, when we close the socket fd, socket_close will call sco_chan_del which will eventually call hci_conn_drop. With hcon already deleted by this handler, it will crash. This bug is reported by syzbot in https://syzkaller.appspot.com/bug?id=57e98513afbe427bbd65ac295130bcf5bc860dd8 I'm trying to fix that, but I don't know the design nature of HCI_EV_PHY_LINK_COMPLETE. Will this scenario happen in real life?? Can I remove hci_conn_del(hcon) here (I tested it, which fixes this bug) ?? Thanks, Fox
[PATCH] staging: rtl8723bs: Cleanup open brace issues
This cleans up open brace issues reported by checkpatch.pl Signed-off-by: Fox Chen --- .../staging/rtl8723bs/include/HalPwrSeqCmd.h | 6 +-- drivers/staging/rtl8723bs/include/HalVerDef.h | 18 +++ drivers/staging/rtl8723bs/include/drv_types.h | 6 +-- .../rtl8723bs/include/drv_types_sdio.h| 3 +- .../staging/rtl8723bs/include/hal_btcoex.h| 3 +- drivers/staging/rtl8723bs/include/hal_com.h | 3 +- drivers/staging/rtl8723bs/include/ieee80211.h | 3 +- .../rtl8723bs/include/ioctl_cfg80211.h| 3 +- .../staging/rtl8723bs/include/rtl8192c_recv.h | 3 +- .../staging/rtl8723bs/include/rtl8723b_recv.h | 6 +-- .../staging/rtl8723bs/include/rtl8723b_xmit.h | 3 +- drivers/staging/rtl8723bs/include/rtw_cmd.h | 30 --- .../staging/rtl8723bs/include/rtw_eeprom.h| 6 +-- drivers/staging/rtl8723bs/include/rtw_event.h | 6 +-- drivers/staging/rtl8723bs/include/rtw_ht.h| 3 +- drivers/staging/rtl8723bs/include/rtw_mlme.h | 6 +-- .../staging/rtl8723bs/include/rtw_mlme_ext.h | 51 +++ drivers/staging/rtl8723bs/include/rtw_mp.h| 15 ++ .../staging/rtl8723bs/include/rtw_pwrctrl.h | 12 ++--- drivers/staging/rtl8723bs/include/rtw_recv.h | 12 ++--- .../staging/rtl8723bs/include/rtw_security.h | 9 ++-- drivers/staging/rtl8723bs/include/rtw_xmit.h | 12 ++--- drivers/staging/rtl8723bs/include/wifi.h | 24 +++-- .../staging/rtl8723bs/include/wlan_bssdef.h | 9 ++-- 24 files changed, 84 insertions(+), 168 deletions(-) diff --git a/drivers/staging/rtl8723bs/include/HalPwrSeqCmd.h b/drivers/staging/rtl8723bs/include/HalPwrSeqCmd.h index 7040cfc507d8..459f2f9d4bbb 100644 --- a/drivers/staging/rtl8723bs/include/HalPwrSeqCmd.h +++ b/drivers/staging/rtl8723bs/include/HalPwrSeqCmd.h @@ -82,14 +82,12 @@ #definePWR_CUT_ALL_MSK 0xFF -typedef enum _PWRSEQ_CMD_DELAY_UNIT_ -{ +typedef enum _PWRSEQ_CMD_DELAY_UNIT_ { PWRSEQ_DELAY_US, PWRSEQ_DELAY_MS, } PWRSEQ_DELAY_UNIT; -typedef struct _WL_PWR_CFG_ -{ +typedef struct _WL_PWR_CFG_ { u16 offset; u8 cut_msk; u8 fab_msk:4; diff --git a/drivers/staging/rtl8723bs/include/HalVerDef.h b/drivers/staging/rtl8723bs/include/HalVerDef.h index c548fb126683..b4744be2cbe1 100644 --- a/drivers/staging/rtl8723bs/include/HalVerDef.h +++ b/drivers/staging/rtl8723bs/include/HalVerDef.h @@ -8,8 +8,7 @@ #define __HAL_VERSION_DEF_H__ /* HAL_IC_TYPE_E */ -typedef enum tag_HAL_IC_Type_Definition -{ +typedef enum tag_HAL_IC_Type_Definition { CHIP_8192S = 0, CHIP_8188C = 1, CHIP_8192C = 2, @@ -23,16 +22,14 @@ typedef enum tag_HAL_IC_Type_Definition } HAL_IC_TYPE_E; /* HAL_CHIP_TYPE_E */ -typedef enum tag_HAL_CHIP_Type_Definition -{ +typedef enum tag_HAL_CHIP_Type_Definition { TEST_CHIP = 0, NORMAL_CHIP = 1, FPGA= 2, } HAL_CHIP_TYPE_E; /* HAL_CUT_VERSION_E */ -typedef enum tag_HAL_Cut_Version_Definition -{ +typedef enum tag_HAL_Cut_Version_Definition { A_CUT_VERSION = 0, B_CUT_VERSION = 1, C_CUT_VERSION = 2, @@ -47,15 +44,13 @@ typedef enum tag_HAL_Cut_Version_Definition } HAL_CUT_VERSION_E; /* HAL_Manufacturer */ -typedef enum tag_HAL_Manufacturer_Version_Definition -{ +typedef enum tag_HAL_Manufacturer_Version_Definition { CHIP_VENDOR_TSMC= 0, CHIP_VENDOR_UMC = 1, CHIP_VENDOR_SMIC= 2, } HAL_VENDOR_E; -typedef enum tag_HAL_RF_Type_Definition -{ +typedef enum tag_HAL_RF_Type_Definition { RF_TYPE_1T1R= 0, RF_TYPE_1T2R= 1, RF_TYPE_2T2R= 2, @@ -66,8 +61,7 @@ typedef enum tag_HAL_RF_Type_Definition RF_TYPE_4T4R= 7, } HAL_RF_TYPE_E; -typedefstruct tag_HAL_VERSION -{ +typedefstruct tag_HAL_VERSION { HAL_IC_TYPE_E ICType; HAL_CHIP_TYPE_E ChipType; HAL_CUT_VERSION_E CUTVersion; diff --git a/drivers/staging/rtl8723bs/include/drv_types.h b/drivers/staging/rtl8723bs/include/drv_types.h index dba75216cbfe..c73f581aea06 100644 --- a/drivers/staging/rtl8723bs/include/drv_types.h +++ b/drivers/staging/rtl8723bs/include/drv_types.h @@ -86,8 +86,7 @@ struct specific_device_id { }; -struct registry_priv -{ +struct registry_priv { u8 chip_version; u8 rfintfs; u8 lbkmode; @@ -418,8 +417,7 @@ struct cam_entry_cache { ((u8 *)(x))[6], ((u8 *)(x))[7], ((u8 *)(x))[8], ((u8 *)(x))[9], ((u8 *)(x))[10], ((u8 *)(x))[11], \ ((u8 *)(x))[12], ((u8 *)(x))[13], ((u8 *)(x))[14], ((u8 *)(x))[15] -struct dvobj_priv -{ +struct dvobj_priv { /* below is common data */ struct adapter *if1; /* PRIMARY_ADAPTER */ struct adapter *if2; /* SECONDARY_ADAPTER */ diff --git