Re: [PATCH v2 0/3] Directed kmem charging
On Wed, 21 Feb 2018, Andrew Morton wrote: > What do others think? I think the changes to the hotpaths of the slab allocators increasing register pressure in some of the hotttest paths of the kernel are problematic. Its better to do the allocation properly in the task context to which it is finally charged. There may be other restrictions that emerge from other fields in the task_struct that also may influence allocation and reclaim behavior. It is cleanest to do this in the proper task context instead of taking a piece (like the cgroup) out of context.
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, 21 Feb 2018, Andrew Morton wrote: > What do others think? I think the changes to the hotpaths of the slab allocators increasing register pressure in some of the hotttest paths of the kernel are problematic. Its better to do the allocation properly in the task context to which it is finally charged. There may be other restrictions that emerge from other fields in the task_struct that also may influence allocation and reclaim behavior. It is cleanest to do this in the proper task context instead of taking a piece (like the cgroup) out of context.
Re: [PATCH v2 0/3] Directed kmem charging
On Thu, 22 Feb 2018, Jan Kara wrote: > I don't see how task work can be used here. Firstly I don't know of a case > where task work would be used for something else than the current task - > and that is substantial because otherwise you have to deal with lots of > problems like races with task exit, when work gets executed (normally it > gets executed once task exits to userspace) etc. Or do you mean that you'd > queue task work for current task and then somehow magically switch memcg > there? In that case this magic switching isn't clear to me... Thats surprising since one can specify the task. If its only for current then why do you need a parameter? I think a capability of executing a function in the context of another running task could simplify a lot. In particular if something triggers behavior that is related to another task from the kernel like whats happening here. It should not be that difficult to do proper synchronization on the list of work and then set some flag (maybe SIGPENDING) to make the task execute whats on the tasklist. Signal delivery is after all similar.
Re: [PATCH v2 0/3] Directed kmem charging
On Thu, 22 Feb 2018, Jan Kara wrote: > I don't see how task work can be used here. Firstly I don't know of a case > where task work would be used for something else than the current task - > and that is substantial because otherwise you have to deal with lots of > problems like races with task exit, when work gets executed (normally it > gets executed once task exits to userspace) etc. Or do you mean that you'd > queue task work for current task and then somehow magically switch memcg > there? In that case this magic switching isn't clear to me... Thats surprising since one can specify the task. If its only for current then why do you need a parameter? I think a capability of executing a function in the context of another running task could simplify a lot. In particular if something triggers behavior that is related to another task from the kernel like whats happening here. It should not be that difficult to do proper synchronization on the list of work and then set some flag (maybe SIGPENDING) to make the task execute whats on the tasklist. Signal delivery is after all similar.
Re: [PATCH v2 0/3] Directed kmem charging
On Wed 21-02-18 11:57:47, Christopher Lameter wrote: > On Wed, 21 Feb 2018, Shakeel Butt wrote: > > > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameterwrote: > > > Another way to solve this is to switch the user context right? > > > > > > Isnt it possible to avoid these patches if do the allocation in another > > > task context instead? > > > > > > > Sorry, can you please explain what you mean by 'switch the user > > context'. Is there any example in kernel which does something similar? > > See include/linux/task_work.h. One use case is in mntput_no_expire() in > linux/fs/namespace.c > > > > Are there really any other use cases beyond fsnotify? > > > > > > > Another use case I have in mind and plan to upstream is to bind a > > filesystem mount with a memcg. So, all the file pages (or anon pages > > for shmem) and kmem (like inodes and dentry) will be charged to that > > memcg. > > The mount logic already uses task_work.h. That may be the approach to > expand there. I don't see how task work can be used here. Firstly I don't know of a case where task work would be used for something else than the current task - and that is substantial because otherwise you have to deal with lots of problems like races with task exit, when work gets executed (normally it gets executed once task exits to userspace) etc. Or do you mean that you'd queue task work for current task and then somehow magically switch memcg there? In that case this magic switching isn't clear to me... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH v2 0/3] Directed kmem charging
On Wed 21-02-18 11:57:47, Christopher Lameter wrote: > On Wed, 21 Feb 2018, Shakeel Butt wrote: > > > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: > > > Another way to solve this is to switch the user context right? > > > > > > Isnt it possible to avoid these patches if do the allocation in another > > > task context instead? > > > > > > > Sorry, can you please explain what you mean by 'switch the user > > context'. Is there any example in kernel which does something similar? > > See include/linux/task_work.h. One use case is in mntput_no_expire() in > linux/fs/namespace.c > > > > Are there really any other use cases beyond fsnotify? > > > > > > > Another use case I have in mind and plan to upstream is to bind a > > filesystem mount with a memcg. So, all the file pages (or anon pages > > for shmem) and kmem (like inodes and dentry) will be charged to that > > memcg. > > The mount logic already uses task_work.h. That may be the approach to > expand there. I don't see how task work can be used here. Firstly I don't know of a case where task work would be used for something else than the current task - and that is substantial because otherwise you have to deal with lots of problems like races with task exit, when work gets executed (normally it gets executed once task exits to userspace) etc. Or do you mean that you'd queue task work for current task and then somehow magically switch memcg there? In that case this magic switching isn't clear to me... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH v2 0/3] Directed kmem charging
On Wed 21-02-18 12:54:26, Andrew Morton wrote: > On Wed, 21 Feb 2018 09:18:35 -0800 Shakeel Buttwrote: > > > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: > > > Another way to solve this is to switch the user context right? > > > > > > Isnt it possible to avoid these patches if do the allocation in another > > > task context instead? > > > > > > > Sorry, can you please explain what you mean by 'switch the user > > context'. Is there any example in kernel which does something similar? > > > > Another way is by adding a field 'remote_memcg_to_charge' in > > task_struct and set it before the allocation and in memcontrol.c, > > first check if current->remote_memcg_to_charge is set otherwise use > > the memcg of current. Also if we provide a wrapper to do that for the > > user, there will be a lot less plumbing. > > > > Please let me know if you prefer this approach. > > That would be a lot simpler. Passing function arguments via > task_struct is a bit dirty but is sometimes sooo effective. You > should've seen how much mess task_struct.journal_info avoided! And > reclaim_state. Agreed, although from time to time people try to be too creative e.g. with journal_info and surprising bugs come out of that :). > And one always wonders whether we should do a local save/restore before > modifying the task_struct field, so it nests. > > What do others think? Sounds nice to me. > Maybe we can rename task_struct.reclaim_state to `struct task_mm_state > *task_mm_state", add remote_memcg_to_charge to struct task_mm_state and > avoid bloating the task_struct? Yeah, even better, but then we really need to make sure these things stack properly. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH v2 0/3] Directed kmem charging
On Wed 21-02-18 12:54:26, Andrew Morton wrote: > On Wed, 21 Feb 2018 09:18:35 -0800 Shakeel Butt wrote: > > > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: > > > Another way to solve this is to switch the user context right? > > > > > > Isnt it possible to avoid these patches if do the allocation in another > > > task context instead? > > > > > > > Sorry, can you please explain what you mean by 'switch the user > > context'. Is there any example in kernel which does something similar? > > > > Another way is by adding a field 'remote_memcg_to_charge' in > > task_struct and set it before the allocation and in memcontrol.c, > > first check if current->remote_memcg_to_charge is set otherwise use > > the memcg of current. Also if we provide a wrapper to do that for the > > user, there will be a lot less plumbing. > > > > Please let me know if you prefer this approach. > > That would be a lot simpler. Passing function arguments via > task_struct is a bit dirty but is sometimes sooo effective. You > should've seen how much mess task_struct.journal_info avoided! And > reclaim_state. Agreed, although from time to time people try to be too creative e.g. with journal_info and surprising bugs come out of that :). > And one always wonders whether we should do a local save/restore before > modifying the task_struct field, so it nests. > > What do others think? Sounds nice to me. > Maybe we can rename task_struct.reclaim_state to `struct task_mm_state > *task_mm_state", add remote_memcg_to_charge to struct task_mm_state and > avoid bloating the task_struct? Yeah, even better, but then we really need to make sure these things stack properly. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, 21 Feb 2018 09:18:35 -0800 Shakeel Buttwrote: > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: > > Another way to solve this is to switch the user context right? > > > > Isnt it possible to avoid these patches if do the allocation in another > > task context instead? > > > > Sorry, can you please explain what you mean by 'switch the user > context'. Is there any example in kernel which does something similar? > > Another way is by adding a field 'remote_memcg_to_charge' in > task_struct and set it before the allocation and in memcontrol.c, > first check if current->remote_memcg_to_charge is set otherwise use > the memcg of current. Also if we provide a wrapper to do that for the > user, there will be a lot less plumbing. > > Please let me know if you prefer this approach. That would be a lot simpler. Passing function arguments via task_struct is a bit dirty but is sometimes sooo effective. You should've seen how much mess task_struct.journal_info avoided! And reclaim_state. And one always wonders whether we should do a local save/restore before modifying the task_struct field, so it nests. What do others think? Maybe we can rename task_struct.reclaim_state to `struct task_mm_state *task_mm_state", add remote_memcg_to_charge to struct task_mm_state and avoid bloating the task_struct?
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, 21 Feb 2018 09:18:35 -0800 Shakeel Butt wrote: > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: > > Another way to solve this is to switch the user context right? > > > > Isnt it possible to avoid these patches if do the allocation in another > > task context instead? > > > > Sorry, can you please explain what you mean by 'switch the user > context'. Is there any example in kernel which does something similar? > > Another way is by adding a field 'remote_memcg_to_charge' in > task_struct and set it before the allocation and in memcontrol.c, > first check if current->remote_memcg_to_charge is set otherwise use > the memcg of current. Also if we provide a wrapper to do that for the > user, there will be a lot less plumbing. > > Please let me know if you prefer this approach. That would be a lot simpler. Passing function arguments via task_struct is a bit dirty but is sometimes sooo effective. You should've seen how much mess task_struct.journal_info avoided! And reclaim_state. And one always wonders whether we should do a local save/restore before modifying the task_struct field, so it nests. What do others think? Maybe we can rename task_struct.reclaim_state to `struct task_mm_state *task_mm_state", add remote_memcg_to_charge to struct task_mm_state and avoid bloating the task_struct?
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, Feb 21, 2018 at 9:57 AM, Christopher Lameterwrote: > On Wed, 21 Feb 2018, Shakeel Butt wrote: > >> On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: >> > Another way to solve this is to switch the user context right? >> > >> > Isnt it possible to avoid these patches if do the allocation in another >> > task context instead? >> > >> >> Sorry, can you please explain what you mean by 'switch the user >> context'. Is there any example in kernel which does something similar? > > See include/linux/task_work.h. One use case is in mntput_no_expire() in > linux/fs/namespace.c > >From what I understand, using task_work will require fanotify/inotify event handler to allocate memory asynchronously. IMHO the code will be much more complex if we go through that route. > Another way is by adding a field 'remote_memcg_to_charge' in > task_struct and set it before the allocation and in memcontrol.c, > first check if current->remote_memcg_to_charge is set otherwise use > the memcg of current. Also if we provide a wrapper to do that for the > user, there will be a lot less plumbing. > > Please let me know if you prefer this approach. > What do you think of the above approach. I think the amount and complexity of code will be much less. >> > Are there really any other use cases beyond fsnotify? >> > >> >> Another use case I have in mind and plan to upstream is to bind a >> filesystem mount with a memcg. So, all the file pages (or anon pages >> for shmem) and kmem (like inodes and dentry) will be charged to that >> memcg. > > The mount logic already uses task_work.h. That may be the approach to > expand there. The task_work approach will require that the job is already running at the time of mount operation. Usually the mount operations are done by either admin or the control task starting the job and is a part of setting up the environment. So, there might not be any process running at the time of mount operation.
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, Feb 21, 2018 at 9:57 AM, Christopher Lameter wrote: > On Wed, 21 Feb 2018, Shakeel Butt wrote: > >> On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: >> > Another way to solve this is to switch the user context right? >> > >> > Isnt it possible to avoid these patches if do the allocation in another >> > task context instead? >> > >> >> Sorry, can you please explain what you mean by 'switch the user >> context'. Is there any example in kernel which does something similar? > > See include/linux/task_work.h. One use case is in mntput_no_expire() in > linux/fs/namespace.c > >From what I understand, using task_work will require fanotify/inotify event handler to allocate memory asynchronously. IMHO the code will be much more complex if we go through that route. > Another way is by adding a field 'remote_memcg_to_charge' in > task_struct and set it before the allocation and in memcontrol.c, > first check if current->remote_memcg_to_charge is set otherwise use > the memcg of current. Also if we provide a wrapper to do that for the > user, there will be a lot less plumbing. > > Please let me know if you prefer this approach. > What do you think of the above approach. I think the amount and complexity of code will be much less. >> > Are there really any other use cases beyond fsnotify? >> > >> >> Another use case I have in mind and plan to upstream is to bind a >> filesystem mount with a memcg. So, all the file pages (or anon pages >> for shmem) and kmem (like inodes and dentry) will be charged to that >> memcg. > > The mount logic already uses task_work.h. That may be the approach to > expand there. The task_work approach will require that the job is already running at the time of mount operation. Usually the mount operations are done by either admin or the control task starting the job and is a part of setting up the environment. So, there might not be any process running at the time of mount operation.
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, 21 Feb 2018, Shakeel Butt wrote: > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameterwrote: > > Another way to solve this is to switch the user context right? > > > > Isnt it possible to avoid these patches if do the allocation in another > > task context instead? > > > > Sorry, can you please explain what you mean by 'switch the user > context'. Is there any example in kernel which does something similar? See include/linux/task_work.h. One use case is in mntput_no_expire() in linux/fs/namespace.c > > Are there really any other use cases beyond fsnotify? > > > > Another use case I have in mind and plan to upstream is to bind a > filesystem mount with a memcg. So, all the file pages (or anon pages > for shmem) and kmem (like inodes and dentry) will be charged to that > memcg. The mount logic already uses task_work.h. That may be the approach to expand there.
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, 21 Feb 2018, Shakeel Butt wrote: > On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: > > Another way to solve this is to switch the user context right? > > > > Isnt it possible to avoid these patches if do the allocation in another > > task context instead? > > > > Sorry, can you please explain what you mean by 'switch the user > context'. Is there any example in kernel which does something similar? See include/linux/task_work.h. One use case is in mntput_no_expire() in linux/fs/namespace.c > > Are there really any other use cases beyond fsnotify? > > > > Another use case I have in mind and plan to upstream is to bind a > filesystem mount with a memcg. So, all the file pages (or anon pages > for shmem) and kmem (like inodes and dentry) will be charged to that > memcg. The mount logic already uses task_work.h. That may be the approach to expand there.
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameterwrote: > Another way to solve this is to switch the user context right? > > Isnt it possible to avoid these patches if do the allocation in another > task context instead? > Sorry, can you please explain what you mean by 'switch the user context'. Is there any example in kernel which does something similar? Another way is by adding a field 'remote_memcg_to_charge' in task_struct and set it before the allocation and in memcontrol.c, first check if current->remote_memcg_to_charge is set otherwise use the memcg of current. Also if we provide a wrapper to do that for the user, there will be a lot less plumbing. Please let me know if you prefer this approach. > Are there really any other use cases beyond fsnotify? > Another use case I have in mind and plan to upstream is to bind a filesystem mount with a memcg. So, all the file pages (or anon pages for shmem) and kmem (like inodes and dentry) will be charged to that memcg. > > The charging of the memory works on a per page level but the allocation > occur from the same page for multiple tasks that may be running on a > system. So how relevant is this for other small objects? > > Seems that if you do a large amount of allocations for the same purpose > your chance of accounting it to the right memcg increases. But this is a > game of chance. > > >
Re: [PATCH v2 0/3] Directed kmem charging
On Wed, Feb 21, 2018 at 8:09 AM, Christopher Lameter wrote: > Another way to solve this is to switch the user context right? > > Isnt it possible to avoid these patches if do the allocation in another > task context instead? > Sorry, can you please explain what you mean by 'switch the user context'. Is there any example in kernel which does something similar? Another way is by adding a field 'remote_memcg_to_charge' in task_struct and set it before the allocation and in memcontrol.c, first check if current->remote_memcg_to_charge is set otherwise use the memcg of current. Also if we provide a wrapper to do that for the user, there will be a lot less plumbing. Please let me know if you prefer this approach. > Are there really any other use cases beyond fsnotify? > Another use case I have in mind and plan to upstream is to bind a filesystem mount with a memcg. So, all the file pages (or anon pages for shmem) and kmem (like inodes and dentry) will be charged to that memcg. > > The charging of the memory works on a per page level but the allocation > occur from the same page for multiple tasks that may be running on a > system. So how relevant is this for other small objects? > > Seems that if you do a large amount of allocations for the same purpose > your chance of accounting it to the right memcg increases. But this is a > game of chance. > > >
Re: [PATCH v2 0/3] Directed kmem charging
Another way to solve this is to switch the user context right? Isnt it possible to avoid these patches if do the allocation in another task context instead? Are there really any other use cases beyond fsnotify? The charging of the memory works on a per page level but the allocation occur from the same page for multiple tasks that may be running on a system. So how relevant is this for other small objects? Seems that if you do a large amount of allocations for the same purpose your chance of accounting it to the right memcg increases. But this is a game of chance.
Re: [PATCH v2 0/3] Directed kmem charging
Another way to solve this is to switch the user context right? Isnt it possible to avoid these patches if do the allocation in another task context instead? Are there really any other use cases beyond fsnotify? The charging of the memory works on a per page level but the allocation occur from the same page for multiple tasks that may be running on a system. So how relevant is this for other small objects? Seems that if you do a large amount of allocations for the same purpose your chance of accounting it to the right memcg increases. But this is a game of chance.