Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-19 Thread Parav Pandit
Hi Tejun,

On Thu, Mar 17, 2016 at 2:10 AM, Tejun Heo  wrote:
>
>> If this is ok. I will keep the code as it is, because it uses common
>> helper functions for max and current files.
>
> Hmmm... can you please try to refactor the common part to helpers?
> It's not a big thing but there were both styles across different
> controllers and helper based ones tend to be easier to follow.
>
in this v9 patch, To read max and current value, entry point is common
function rdmacg_resource_read().
This is due to the fact that reading max and current needs to do same
thing. Exceptions are taken care
in below helper functions.
It uses two small helper functions
1. get_cg_rpool_values
2. print_rpool_values
Can I continue with this approach?

> I see, but if that's the case, please drop the fine locking.  The fine
> locking doesn't make much sense - as implemented it's slower and the
> whole thing is not hot.

o.k.

Also can you please confirm that once patch looks good to you,
you are ok to merge this patch from Doug's linux-rdma tree as merge is
clean from that tree?


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-19 Thread Parav Pandit
Hi Tejun,

On Thu, Mar 17, 2016 at 2:10 AM, Tejun Heo  wrote:
>
>> If this is ok. I will keep the code as it is, because it uses common
>> helper functions for max and current files.
>
> Hmmm... can you please try to refactor the common part to helpers?
> It's not a big thing but there were both styles across different
> controllers and helper based ones tend to be easier to follow.
>
in this v9 patch, To read max and current value, entry point is common
function rdmacg_resource_read().
This is due to the fact that reading max and current needs to do same
thing. Exceptions are taken care
in below helper functions.
It uses two small helper functions
1. get_cg_rpool_values
2. print_rpool_values
Can I continue with this approach?

> I see, but if that's the case, please drop the fine locking.  The fine
> locking doesn't make much sense - as implemented it's slower and the
> whole thing is not hot.

o.k.

Also can you please confirm that once patch looks good to you,
you are ok to merge this patch from Doug's linux-rdma tree as merge is
clean from that tree?


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-19 Thread Tejun Heo
Hello, Parav.

Sorry about the delay.

On Sat, Mar 12, 2016 at 11:49:03AM +0530, Parav Pandit wrote:
> > For (1) shall I have one spin lock that is uses across multiple
> > hierarchy and multiple cgroup.
> > Essentially one global lock among all cgroup. During hierarchical
> > charging, continue to use same lock it at each level.
> > Would that work in this first release?
> 
> I am waiting for your reply.
> Shall one lock for all cgroup is ok with you?

Yes, when you're locking up to the root each time, splitting locks at
the bottom doesn't really achieve anything.  It just makes things more
expensive.

> If this is ok. I will keep the code as it is, because it uses common
> helper functions for max and current files.

Hmmm... can you please try to refactor the common part to helpers?
It's not a big thing but there were both styles across different
controllers and helper based ones tend to be easier to follow.

> >> 3 is fine but resource [un]charging is not hot path?
> > charge/uncharge is hot path from cgroup perspective.
> > Considering 1 to 4 devices in system rpool list would grow upto 4
> > entry deep at each cgroup level.
> > I believe this is good enough to start with. O complexity wise its
> > O(N). where N is number of devices in system.

I see, but if that's the case, please drop the fine locking.  The fine
locking doesn't make much sense - as implemented it's slower and the
whole thing is not hot.

Thanks.

-- 
tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-19 Thread Tejun Heo
Hello, Parav.

Sorry about the delay.

On Sat, Mar 12, 2016 at 11:49:03AM +0530, Parav Pandit wrote:
> > For (1) shall I have one spin lock that is uses across multiple
> > hierarchy and multiple cgroup.
> > Essentially one global lock among all cgroup. During hierarchical
> > charging, continue to use same lock it at each level.
> > Would that work in this first release?
> 
> I am waiting for your reply.
> Shall one lock for all cgroup is ok with you?

Yes, when you're locking up to the root each time, splitting locks at
the bottom doesn't really achieve anything.  It just makes things more
expensive.

> If this is ok. I will keep the code as it is, because it uses common
> helper functions for max and current files.

Hmmm... can you please try to refactor the common part to helpers?
It's not a big thing but there were both styles across different
controllers and helper based ones tend to be easier to follow.

> >> 3 is fine but resource [un]charging is not hot path?
> > charge/uncharge is hot path from cgroup perspective.
> > Considering 1 to 4 devices in system rpool list would grow upto 4
> > entry deep at each cgroup level.
> > I believe this is good enough to start with. O complexity wise its
> > O(N). where N is number of devices in system.

I see, but if that's the case, please drop the fine locking.  The fine
locking doesn't make much sense - as implemented it's slower and the
whole thing is not hot.

Thanks.

-- 
tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-11 Thread Parav Pandit
Hi Tejun,

On Sat, Mar 5, 2016 at 10:50 PM, Parav Pandit  wrote:
> Hi Tejun,
>
> On Sat, Mar 5, 2016 at 6:22 PM, Tejun Heo  wrote:
>> Hello, Parav.
>>
>> On Sat, Mar 05, 2016 at 04:45:09PM +0530, Parav Pandit wrote:
>>> Design that remains same from v6 to v10.
>>>   * spin lock is still fine grained at cgroup level instead of one
>>> global shared lock among all cgroups.
>>>  In future it can be optimized further to do per cpu or using
>>> single lock if required.
>>>   * file type enums are still present for max and current, as
>>> read/write call to those files is already taken care by common
>>> functions with required if/else.
>>>   * Resource limit setting is as it is, because number of devices are
>>> in range of 1 to 4 count in most use cases (as explained in
>>> documentation), and its not hot path.
>>
>> 1 and 2 are not okay.
> For (1) shall I have one spin lock that is uses across multiple
> hierarchy and multiple cgroup.
> Essentially one global lock among all cgroup. During hierarchical
> charging, continue to use same lock it at each level.
> Would that work in this first release?
>

I am waiting for your reply.
Shall one lock for all cgroup is ok with you?

> Can you please review the code for (2), I cannot think of any further
> helper functions that I can write.
> For both the file types, all the code is already common.
> file types are used only to find out whether to reference max variable
> or usage variable in structure.
> Which can also be made as array, but I do not want to lose the code
> readability for that little gain.
> What exactly is the issue in current implementation? You just
> mentioned that "its not good sign".
> Its readable, simple and serves the purpose, what am I missing?
>
If this is ok. I will keep the code as it is, because it uses common
helper functions for max and current files.


>> 3 is fine but resource [un]charging is not hot path?
> charge/uncharge is hot path from cgroup perspective.
> Considering 1 to 4 devices in system rpool list would grow upto 4
> entry deep at each cgroup level.
> I believe this is good enough to start with. O complexity wise its
> O(N). where N is number of devices in system.
>
>
>>
>> Thanks.
>>
>> --
>> tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-11 Thread Parav Pandit
Hi Tejun,

On Sat, Mar 5, 2016 at 10:50 PM, Parav Pandit  wrote:
> Hi Tejun,
>
> On Sat, Mar 5, 2016 at 6:22 PM, Tejun Heo  wrote:
>> Hello, Parav.
>>
>> On Sat, Mar 05, 2016 at 04:45:09PM +0530, Parav Pandit wrote:
>>> Design that remains same from v6 to v10.
>>>   * spin lock is still fine grained at cgroup level instead of one
>>> global shared lock among all cgroups.
>>>  In future it can be optimized further to do per cpu or using
>>> single lock if required.
>>>   * file type enums are still present for max and current, as
>>> read/write call to those files is already taken care by common
>>> functions with required if/else.
>>>   * Resource limit setting is as it is, because number of devices are
>>> in range of 1 to 4 count in most use cases (as explained in
>>> documentation), and its not hot path.
>>
>> 1 and 2 are not okay.
> For (1) shall I have one spin lock that is uses across multiple
> hierarchy and multiple cgroup.
> Essentially one global lock among all cgroup. During hierarchical
> charging, continue to use same lock it at each level.
> Would that work in this first release?
>

I am waiting for your reply.
Shall one lock for all cgroup is ok with you?

> Can you please review the code for (2), I cannot think of any further
> helper functions that I can write.
> For both the file types, all the code is already common.
> file types are used only to find out whether to reference max variable
> or usage variable in structure.
> Which can also be made as array, but I do not want to lose the code
> readability for that little gain.
> What exactly is the issue in current implementation? You just
> mentioned that "its not good sign".
> Its readable, simple and serves the purpose, what am I missing?
>
If this is ok. I will keep the code as it is, because it uses common
helper functions for max and current files.


>> 3 is fine but resource [un]charging is not hot path?
> charge/uncharge is hot path from cgroup perspective.
> Considering 1 to 4 devices in system rpool list would grow upto 4
> entry deep at each cgroup level.
> I believe this is good enough to start with. O complexity wise its
> O(N). where N is number of devices in system.
>
>
>>
>> Thanks.
>>
>> --
>> tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-06 Thread Haggai Eran
On 05/03/2016 19:20, Parav Pandit wrote:
>> > 3 is fine but resource [un]charging is not hot path?
> charge/uncharge is hot path from cgroup perspective.

Most of the resources the RDMA cgroup handles are only allocated at
the beginning of the application. The RDMA subsystem allows direct
user-space access to the devices, so most of the hot path operations
don't go through the kernel at all.

It is true though that for some applications MR registration and 
de-registration is in the hot path.

Haggai


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-06 Thread Haggai Eran
On 05/03/2016 19:20, Parav Pandit wrote:
>> > 3 is fine but resource [un]charging is not hot path?
> charge/uncharge is hot path from cgroup perspective.

Most of the resources the RDMA cgroup handles are only allocated at
the beginning of the application. The RDMA subsystem allows direct
user-space access to the devices, so most of the hot path operations
don't go through the kernel at all.

It is true though that for some applications MR registration and 
de-registration is in the hot path.

Haggai


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-05 Thread Parav Pandit
Hi Tejun,

On Sat, Mar 5, 2016 at 6:22 PM, Tejun Heo  wrote:
> Hello, Parav.
>
> On Sat, Mar 05, 2016 at 04:45:09PM +0530, Parav Pandit wrote:
>> Design that remains same from v6 to v10.
>>   * spin lock is still fine grained at cgroup level instead of one
>> global shared lock among all cgroups.
>>  In future it can be optimized further to do per cpu or using
>> single lock if required.
>>   * file type enums are still present for max and current, as
>> read/write call to those files is already taken care by common
>> functions with required if/else.
>>   * Resource limit setting is as it is, because number of devices are
>> in range of 1 to 4 count in most use cases (as explained in
>> documentation), and its not hot path.
>
> 1 and 2 are not okay.
For (1) shall I have one spin lock that is uses across multiple
hierarchy and multiple cgroup.
Essentially one global lock among all cgroup. During hierarchical
charging, continue to use same lock it at each level.
Would that work in this first release?

Can you please review the code for (2), I cannot think of any further
helper functions that I can write.
For both the file types, all the code is already common.
file types are used only to find out whether to reference max variable
or usage variable in structure.
Which can also be made as array, but I do not want to lose the code
readability for that little gain.
What exactly is the issue in current implementation? You just
mentioned that "its not good sign".
Its readable, simple and serves the purpose, what am I missing?

> 3 is fine but resource [un]charging is not hot path?
charge/uncharge is hot path from cgroup perspective.
Considering 1 to 4 devices in system rpool list would grow upto 4
entry deep at each cgroup level.
I believe this is good enough to start with. O complexity wise its
O(N). where N is number of devices in system.


>
> Thanks.
>
> --
> tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-05 Thread Parav Pandit
Hi Tejun,

On Sat, Mar 5, 2016 at 6:22 PM, Tejun Heo  wrote:
> Hello, Parav.
>
> On Sat, Mar 05, 2016 at 04:45:09PM +0530, Parav Pandit wrote:
>> Design that remains same from v6 to v10.
>>   * spin lock is still fine grained at cgroup level instead of one
>> global shared lock among all cgroups.
>>  In future it can be optimized further to do per cpu or using
>> single lock if required.
>>   * file type enums are still present for max and current, as
>> read/write call to those files is already taken care by common
>> functions with required if/else.
>>   * Resource limit setting is as it is, because number of devices are
>> in range of 1 to 4 count in most use cases (as explained in
>> documentation), and its not hot path.
>
> 1 and 2 are not okay.
For (1) shall I have one spin lock that is uses across multiple
hierarchy and multiple cgroup.
Essentially one global lock among all cgroup. During hierarchical
charging, continue to use same lock it at each level.
Would that work in this first release?

Can you please review the code for (2), I cannot think of any further
helper functions that I can write.
For both the file types, all the code is already common.
file types are used only to find out whether to reference max variable
or usage variable in structure.
Which can also be made as array, but I do not want to lose the code
readability for that little gain.
What exactly is the issue in current implementation? You just
mentioned that "its not good sign".
Its readable, simple and serves the purpose, what am I missing?

> 3 is fine but resource [un]charging is not hot path?
charge/uncharge is hot path from cgroup perspective.
Considering 1 to 4 devices in system rpool list would grow upto 4
entry deep at each cgroup level.
I believe this is good enough to start with. O complexity wise its
O(N). where N is number of devices in system.


>
> Thanks.
>
> --
> tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-05 Thread Tejun Heo
Hello, Parav.

On Sat, Mar 05, 2016 at 04:45:09PM +0530, Parav Pandit wrote:
> Design that remains same from v6 to v10.
>   * spin lock is still fine grained at cgroup level instead of one
> global shared lock among all cgroups.
>  In future it can be optimized further to do per cpu or using
> single lock if required.
>   * file type enums are still present for max and current, as
> read/write call to those files is already taken care by common
> functions with required if/else.
>   * Resource limit setting is as it is, because number of devices are
> in range of 1 to 4 count in most use cases (as explained in
> documentation), and its not hot path.

1 and 2 are not okay.  3 is fine but resource [un]charging is not hot
path?

Thanks.

-- 
tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-05 Thread Tejun Heo
Hello, Parav.

On Sat, Mar 05, 2016 at 04:45:09PM +0530, Parav Pandit wrote:
> Design that remains same from v6 to v10.
>   * spin lock is still fine grained at cgroup level instead of one
> global shared lock among all cgroups.
>  In future it can be optimized further to do per cpu or using
> single lock if required.
>   * file type enums are still present for max and current, as
> read/write call to those files is already taken care by common
> functions with required if/else.
>   * Resource limit setting is as it is, because number of devices are
> in range of 1 to 4 count in most use cases (as explained in
> documentation), and its not hot path.

1 and 2 are not okay.  3 is fine but resource [un]charging is not hot
path?

Thanks.

-- 
tejun


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-05 Thread Parav Pandit
Hi Tejun,

I would like to submit patch v10.
Can you please confirm that you are ok (or not) with the current
design and below changes should be good enough?
I am ok if you directly want to jump to review v10 too.

Changes from v9:
  * Included documentation of resources in v2.txt and v1.txt
  * Fixed issue of race condition of process migration during charging stage.
  * Fixed comments and code to adhere to CodingStyle.
  * Simplified and removed support to charge/uncharge multiple resource.
  * Fixed removed refcnt with usage_num that tracks how many
resources are unused to trigger freeing the object.
  * Removed inline for query_limit and other function as its not necessary.

Design that remains same from v6 to v10.
  * spin lock is still fine grained at cgroup level instead of one
global shared lock among all cgroups.
 In future it can be optimized further to do per cpu or using
single lock if required.
  * file type enums are still present for max and current, as
read/write call to those files is already taken care by common
functions with required if/else.
  * Resource limit setting is as it is, because number of devices are
in range of 1 to 4 count in most use cases (as explained in
documentation), and its not hot path.

Parav



On Thu, Mar 3, 2016 at 8:19 AM, Parav Pandit  wrote:
> Hi Tejun, Haggai,
>
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
 + rpool->refcnt--;
 + if (rpool->refcnt == 0 && rpool->num_max_cnt == 
 pool_info->table_len) {
>>>
>>> If the caller charges 2 and then uncharges 1 two times, the refcnt
>>> underflows?  Why not just track how many usages are zero?
>>>
>> This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
>> usage_sum -= num during uncharging
>> and
>> usage_sum += num during charing.
>
> This is not sufficient as css_get() and put are done only once per
> call, which leads to similar problem as of refcnt.
> As I think more, I realised that this particular test is missing that
> resulted in this related bug, I realize that we don't have use case to
> have "num" field from the IB stack side.
> For bulk free IB stack will have to keep track of different or same
> rdmacg returned values to call uncharge() with right number of
> resources, all of that complexity just doesn't make sense and not
> required.
> So as first step to further simplify this, I am removing "num" input
> field from charge and uncharge API.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-05 Thread Parav Pandit
Hi Tejun,

I would like to submit patch v10.
Can you please confirm that you are ok (or not) with the current
design and below changes should be good enough?
I am ok if you directly want to jump to review v10 too.

Changes from v9:
  * Included documentation of resources in v2.txt and v1.txt
  * Fixed issue of race condition of process migration during charging stage.
  * Fixed comments and code to adhere to CodingStyle.
  * Simplified and removed support to charge/uncharge multiple resource.
  * Fixed removed refcnt with usage_num that tracks how many
resources are unused to trigger freeing the object.
  * Removed inline for query_limit and other function as its not necessary.

Design that remains same from v6 to v10.
  * spin lock is still fine grained at cgroup level instead of one
global shared lock among all cgroups.
 In future it can be optimized further to do per cpu or using
single lock if required.
  * file type enums are still present for max and current, as
read/write call to those files is already taken care by common
functions with required if/else.
  * Resource limit setting is as it is, because number of devices are
in range of 1 to 4 count in most use cases (as explained in
documentation), and its not hot path.

Parav



On Thu, Mar 3, 2016 at 8:19 AM, Parav Pandit  wrote:
> Hi Tejun, Haggai,
>
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
 + rpool->refcnt--;
 + if (rpool->refcnt == 0 && rpool->num_max_cnt == 
 pool_info->table_len) {
>>>
>>> If the caller charges 2 and then uncharges 1 two times, the refcnt
>>> underflows?  Why not just track how many usages are zero?
>>>
>> This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
>> usage_sum -= num during uncharging
>> and
>> usage_sum += num during charing.
>
> This is not sufficient as css_get() and put are done only once per
> call, which leads to similar problem as of refcnt.
> As I think more, I realised that this particular test is missing that
> resulted in this related bug, I realize that we don't have use case to
> have "num" field from the IB stack side.
> For bulk free IB stack will have to keep track of different or same
> rdmacg returned values to call uncharge() with right number of
> resources, all of that complexity just doesn't make sense and not
> required.
> So as first step to further simplify this, I am removing "num" input
> field from charge and uncharge API.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-03 Thread Parav Pandit
On Thu, Mar 3, 2016 at 1:44 PM, Haggai Eran  wrote:
> On 03/03/2016 05:18, Parav Pandit wrote:
>> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
>>> On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
 Nothing seems to prevent @cg from going away if this races with
 @current being migrated to a different cgroup.  Have you run this with
 lockdep and rcu debugging enabled?  This should have triggered a
 warning.
>> I am able to reproduce this race. Looking into how to address it.
>
> If I understand correctly, task_css() requires rcu read lock being held.
> Is task_get_css() suitable here?

Yes. Its suitable if we continue with newer API to drop "num" entries.
Or I will have make minor modification to support that in cgroup.h.
There is no variation as task_get_css_many, but its minor change anyway.
I saw slept off yesterday night, saw today morning.
Will test over weekend and wait for Tejun's opinion if there is any.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-03 Thread Parav Pandit
On Thu, Mar 3, 2016 at 1:44 PM, Haggai Eran  wrote:
> On 03/03/2016 05:18, Parav Pandit wrote:
>> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
>>> On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
 Nothing seems to prevent @cg from going away if this races with
 @current being migrated to a different cgroup.  Have you run this with
 lockdep and rcu debugging enabled?  This should have triggered a
 warning.
>> I am able to reproduce this race. Looking into how to address it.
>
> If I understand correctly, task_css() requires rcu read lock being held.
> Is task_get_css() suitable here?

Yes. Its suitable if we continue with newer API to drop "num" entries.
Or I will have make minor modification to support that in cgroup.h.
There is no variation as task_get_css_many, but its minor change anyway.
I saw slept off yesterday night, saw today morning.
Will test over weekend and wait for Tejun's opinion if there is any.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-03 Thread Haggai Eran
On 03/03/2016 05:18, Parav Pandit wrote:
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
>> On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
>>> Nothing seems to prevent @cg from going away if this races with
>>> @current being migrated to a different cgroup.  Have you run this with
>>> lockdep and rcu debugging enabled?  This should have triggered a
>>> warning.
> I am able to reproduce this race. Looking into how to address it.

If I understand correctly, task_css() requires rcu read lock being held.
Is task_get_css() suitable here?


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-03 Thread Haggai Eran
On 03/03/2016 05:18, Parav Pandit wrote:
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
>> On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
>>> Nothing seems to prevent @cg from going away if this races with
>>> @current being migrated to a different cgroup.  Have you run this with
>>> lockdep and rcu debugging enabled?  This should have triggered a
>>> warning.
> I am able to reproduce this race. Looking into how to address it.

If I understand correctly, task_css() requires rcu read lock being held.
Is task_get_css() suitable here?


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Haggai Eran
On 03/03/2016 04:49, Parav Pandit wrote:
> Hi Tejun, Haggai,
> 
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
 + rpool->refcnt--;
 + if (rpool->refcnt == 0 && rpool->num_max_cnt == 
 pool_info->table_len) {
>>>
>>> If the caller charges 2 and then uncharges 1 two times, the refcnt
>>> underflows?  Why not just track how many usages are zero?
>>>
>> This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
>> usage_sum -= num during uncharging
>> and
>> usage_sum += num during charing.
> 
> This is not sufficient as css_get() and put are done only once per
> call, which leads to similar problem as of refcnt.
Are css_get_many() and css_put_many() relevant here?

> As I think more, I realised that this particular test is missing that
> resulted in this related bug, I realize that we don't have use case to
> have "num" field from the IB stack side.
> For bulk free IB stack will have to keep track of different or same
> rdmacg returned values to call uncharge() with right number of
> resources, all of that complexity just doesn't make sense and not
> required.
> So as first step to further simplify this, I am removing "num" input
> field from charge and uncharge API.

IIRC there are no instances in the RDMA subsystem today where userspace
allocates more than one resource at a time.

Yishai, in your proposed RSS patchset did you have a verb to allocate
multiple work queues at once?

Haggai



Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Haggai Eran
On 03/03/2016 04:49, Parav Pandit wrote:
> Hi Tejun, Haggai,
> 
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
 + rpool->refcnt--;
 + if (rpool->refcnt == 0 && rpool->num_max_cnt == 
 pool_info->table_len) {
>>>
>>> If the caller charges 2 and then uncharges 1 two times, the refcnt
>>> underflows?  Why not just track how many usages are zero?
>>>
>> This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
>> usage_sum -= num during uncharging
>> and
>> usage_sum += num during charing.
> 
> This is not sufficient as css_get() and put are done only once per
> call, which leads to similar problem as of refcnt.
Are css_get_many() and css_put_many() relevant here?

> As I think more, I realised that this particular test is missing that
> resulted in this related bug, I realize that we don't have use case to
> have "num" field from the IB stack side.
> For bulk free IB stack will have to keep track of different or same
> rdmacg returned values to call uncharge() with right number of
> resources, all of that complexity just doesn't make sense and not
> required.
> So as first step to further simplify this, I am removing "num" input
> field from charge and uncharge API.

IIRC there are no instances in the RDMA subsystem today where userspace
allocates more than one resource at a time.

Yishai, in your proposed RSS patchset did you have a verb to allocate
multiple work queues at once?

Haggai



Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Parav Pandit
On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
> On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
>> Nothing seems to prevent @cg from going away if this races with
>> @current being migrated to a different cgroup.  Have you run this with
>> lockdep and rcu debugging enabled?  This should have triggered a
>> warning.
I am able to reproduce this race. Looking into how to address it.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Parav Pandit
On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
> On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
>> Nothing seems to prevent @cg from going away if this races with
>> @current being migrated to a different cgroup.  Have you run this with
>> lockdep and rcu debugging enabled?  This should have triggered a
>> warning.
I am able to reproduce this race. Looking into how to address it.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Parav Pandit
Hi Tejun, Haggai,

On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
>>> + rpool->refcnt--;
>>> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) 
>>> {
>>
>> If the caller charges 2 and then uncharges 1 two times, the refcnt
>> underflows?  Why not just track how many usages are zero?
>>
> This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
> usage_sum -= num during uncharging
> and
> usage_sum += num during charing.

This is not sufficient as css_get() and put are done only once per
call, which leads to similar problem as of refcnt.
As I think more, I realised that this particular test is missing that
resulted in this related bug, I realize that we don't have use case to
have "num" field from the IB stack side.
For bulk free IB stack will have to keep track of different or same
rdmacg returned values to call uncharge() with right number of
resources, all of that complexity just doesn't make sense and not
required.
So as first step to further simplify this, I am removing "num" input
field from charge and uncharge API.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Parav Pandit
Hi Tejun, Haggai,

On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
>>> + rpool->refcnt--;
>>> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) 
>>> {
>>
>> If the caller charges 2 and then uncharges 1 two times, the refcnt
>> underflows?  Why not just track how many usages are zero?
>>
> This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
> usage_sum -= num during uncharging
> and
> usage_sum += num during charing.

This is not sufficient as css_get() and put are done only once per
call, which leads to similar problem as of refcnt.
As I think more, I realised that this particular test is missing that
resulted in this related bug, I realize that we don't have use case to
have "num" field from the IB stack side.
For bulk free IB stack will have to keep track of different or same
rdmacg returned values to call uncharge() with right number of
resources, all of that complexity just doesn't make sense and not
required.
So as first step to further simplify this, I am removing "num" input
field from charge and uncharge API.


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Parav Pandit
On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
> Hello,
>
>> +struct rdma_cgroup {
>> + struct cgroup_subsys_state  css;
>> +
>> + spinlock_t rpool_list_lock; /* protects resource pool list */
>> + struct list_head rpool_head;/* head to keep track of all resource
>> +  * pools that belongs to this cgroup.
>> +  */
>
> I think we usually don't tail wing these comments.

ok. I will put comments in separate line.

>
>> +};
>> +
>> +struct rdmacg_pool_info {
>> + const char **resource_name_table;
>> + int table_len;
>
> Align fields consistently?  I've said this multiple times now but
> please make the available resources constant and document them in
> Documentation/cgroup-v2.txt.

o.k. I will align them.
o.k. I will document the resource constants defined by IB stack in
cgroup-v2.txt.

>> +
>> +/* APIs for RDMA/IB stack to publish when a device wants to
>> + * participate in resource accounting
>> + */
>
> Please follow CodingStyle.  Wasn't this pointed out a couple times
> already?
Yes. You did. I fixed at few places. I missed this out. Sorry. I am doing now.

>
>> +enum rdmacg_file_type {
>> + RDMACG_RESOURCE_MAX,
>> + RDMACG_RESOURCE_STAT,
>> +};
>
> Heh, usually not a good sign.  If you're using this to call into a
> common function and then switch out on the file type, just switch out
> at the method level and factor out common part into helpers.
>
Methods for both the constants are same. Code changes between two file
type is hardly 4 lines of code.
So there is no need of additional helper functions.
So in currently defined functions rdmacg_resource_read() and
rdmacg_resource_set_max() works on file type.

>> +/* resource tracker per resource for rdma cgroup */
>> +struct rdmacg_resource {
>> + int max;
>> + int usage;
>> +};
>
> Align fields?

Above one seems to be aligned. Not sure what am I missing.
I am aligning all the structures anyways.

>
>> +/**
>
> The above indicates kerneldoc comments, which this isn't.
Fixed for this comment.

>
>> + * resource pool object which represents, per cgroup, per device
>> + * resources. There are multiple instance
>> + * of this object per cgroup, therefore it cannot be embedded within
>> + * rdma_cgroup structure. It is maintained as list.
>
> Consistent paragraph fill?
Fixed.

>

>> + */
>> +struct rdmacg_resource_pool {
>> + struct list_head cg_list;
>> + struct list_head dev_list;
>> +
>> + struct rdmacg_device *device;
>> + struct rdmacg_resource *resources;
>> + struct rdma_cgroup *cg; /* owner cg used during device cleanup */
>> +
>> + int refcnt; /* count active user tasks of this pool */
>> + int num_max_cnt;/* total number counts which are set to max */
>> +};
>
> Formatting.

Aligning all the fields now in next patch.

>
>> +static inline void set_resource_limit(struct rdmacg_resource_pool *rpool,
>> +   int index, int new_max)
>
> Is inline necessary?  Compiler can figure out these.
Yes. Removed.

>
>> +static struct rdmacg_resource_pool*
>  ^
>  space
>
>> +find_cg_rpool_locked(struct rdma_cgroup *cg,
>> +  struct rdmacg_device *device)
> ...
>> +static void uncharge_cg_resource(struct rdma_cgroup *cg,
>> +  struct rdmacg_device *device,
>> +  int index, int num)
>> +{
> ...
>> + rpool->refcnt--;
>> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) {
>
> If the caller charges 2 and then uncharges 1 two times, the refcnt
> underflows?  Why not just track how many usages are zero?
>
This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
usage_sum -= num during uncharging
and
usage_sum += num during charing.

> ...
>> +void rdmacg_uncharge(struct rdma_cgroup *cg,
>> +  struct rdmacg_device *device,
>> +  int index, int num)
>> +{
>> + struct rdma_cgroup *p;
>> +
>> + for (p = cg; p; p = parent_rdmacg(p))
>> + uncharge_cg_resource(p, device, index, num);
>> +
>> + css_put(>css);
>> +}
>> +EXPORT_SYMBOL(rdmacg_uncharge);
>
> So, I suppose the code is trying to save lock contention with
> fine-grained locking;
Yes.
> however, as the code is currently structured,
> it's just increasing the number of lock ops and it'd be highly likely
> to cheaper to simply use a single lock.

Single lock per subsystem? I understood that you were ok to have per
cgroup fine grain lock.

> If you're working up the tree
> grabbing lock at each level, per-node locking doesn't save you
> anything.  Also, it introduces conditions where charges are spuriously
> denied when there are racing requestors.  If scalability becomes an
> issue, the right way to address is adding percpu front cache.
>
>> +void 

Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Parav Pandit
On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
> Hello,
>
>> +struct rdma_cgroup {
>> + struct cgroup_subsys_state  css;
>> +
>> + spinlock_t rpool_list_lock; /* protects resource pool list */
>> + struct list_head rpool_head;/* head to keep track of all resource
>> +  * pools that belongs to this cgroup.
>> +  */
>
> I think we usually don't tail wing these comments.

ok. I will put comments in separate line.

>
>> +};
>> +
>> +struct rdmacg_pool_info {
>> + const char **resource_name_table;
>> + int table_len;
>
> Align fields consistently?  I've said this multiple times now but
> please make the available resources constant and document them in
> Documentation/cgroup-v2.txt.

o.k. I will align them.
o.k. I will document the resource constants defined by IB stack in
cgroup-v2.txt.

>> +
>> +/* APIs for RDMA/IB stack to publish when a device wants to
>> + * participate in resource accounting
>> + */
>
> Please follow CodingStyle.  Wasn't this pointed out a couple times
> already?
Yes. You did. I fixed at few places. I missed this out. Sorry. I am doing now.

>
>> +enum rdmacg_file_type {
>> + RDMACG_RESOURCE_MAX,
>> + RDMACG_RESOURCE_STAT,
>> +};
>
> Heh, usually not a good sign.  If you're using this to call into a
> common function and then switch out on the file type, just switch out
> at the method level and factor out common part into helpers.
>
Methods for both the constants are same. Code changes between two file
type is hardly 4 lines of code.
So there is no need of additional helper functions.
So in currently defined functions rdmacg_resource_read() and
rdmacg_resource_set_max() works on file type.

>> +/* resource tracker per resource for rdma cgroup */
>> +struct rdmacg_resource {
>> + int max;
>> + int usage;
>> +};
>
> Align fields?

Above one seems to be aligned. Not sure what am I missing.
I am aligning all the structures anyways.

>
>> +/**
>
> The above indicates kerneldoc comments, which this isn't.
Fixed for this comment.

>
>> + * resource pool object which represents, per cgroup, per device
>> + * resources. There are multiple instance
>> + * of this object per cgroup, therefore it cannot be embedded within
>> + * rdma_cgroup structure. It is maintained as list.
>
> Consistent paragraph fill?
Fixed.

>

>> + */
>> +struct rdmacg_resource_pool {
>> + struct list_head cg_list;
>> + struct list_head dev_list;
>> +
>> + struct rdmacg_device *device;
>> + struct rdmacg_resource *resources;
>> + struct rdma_cgroup *cg; /* owner cg used during device cleanup */
>> +
>> + int refcnt; /* count active user tasks of this pool */
>> + int num_max_cnt;/* total number counts which are set to max */
>> +};
>
> Formatting.

Aligning all the fields now in next patch.

>
>> +static inline void set_resource_limit(struct rdmacg_resource_pool *rpool,
>> +   int index, int new_max)
>
> Is inline necessary?  Compiler can figure out these.
Yes. Removed.

>
>> +static struct rdmacg_resource_pool*
>  ^
>  space
>
>> +find_cg_rpool_locked(struct rdma_cgroup *cg,
>> +  struct rdmacg_device *device)
> ...
>> +static void uncharge_cg_resource(struct rdma_cgroup *cg,
>> +  struct rdmacg_device *device,
>> +  int index, int num)
>> +{
> ...
>> + rpool->refcnt--;
>> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) {
>
> If the caller charges 2 and then uncharges 1 two times, the refcnt
> underflows?  Why not just track how many usages are zero?
>
This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
usage_sum -= num during uncharging
and
usage_sum += num during charing.

> ...
>> +void rdmacg_uncharge(struct rdma_cgroup *cg,
>> +  struct rdmacg_device *device,
>> +  int index, int num)
>> +{
>> + struct rdma_cgroup *p;
>> +
>> + for (p = cg; p; p = parent_rdmacg(p))
>> + uncharge_cg_resource(p, device, index, num);
>> +
>> + css_put(>css);
>> +}
>> +EXPORT_SYMBOL(rdmacg_uncharge);
>
> So, I suppose the code is trying to save lock contention with
> fine-grained locking;
Yes.
> however, as the code is currently structured,
> it's just increasing the number of lock ops and it'd be highly likely
> to cheaper to simply use a single lock.

Single lock per subsystem? I understood that you were ok to have per
cgroup fine grain lock.

> If you're working up the tree
> grabbing lock at each level, per-node locking doesn't save you
> anything.  Also, it introduces conditions where charges are spuriously
> denied when there are racing requestors.  If scalability becomes an
> issue, the right way to address is adding percpu front cache.
>
>> +void rdmacg_query_limit(struct 

Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Tejun Heo
Hello,

On Wed, Mar 02, 2016 at 12:35:35AM +0530, Parav Pandit wrote:
> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 000..2da3d6c
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h
> @@ -0,0 +1,50 @@
> +#ifndef _CGROUP_RDMA_H
> +#define _CGROUP_RDMA_H
> +
> +#include 
> +
> +#ifdef CONFIG_CGROUP_RDMA
> +
> +struct rdma_cgroup {
> + struct cgroup_subsys_state  css;
> +
> + spinlock_t rpool_list_lock; /* protects resource pool list */
> + struct list_head rpool_head;/* head to keep track of all resource
> +  * pools that belongs to this cgroup.
> +  */

I think we usually don't tail wing these comments.

> +};
> +
> +struct rdmacg_pool_info {
> + const char **resource_name_table;
> + int table_len;

Align fields consistently?  I've said this multiple times now but
please make the available resources constant and document them in
Documentation/cgroup-v2.txt.

> +};
> +
> +struct rdmacg_device {
> + struct rdmacg_pool_info pool_info;
> + struct list_headrdmacg_list;
> + struct list_headrpool_head;
> + /* protects resource pool list */
> + spinlock_t  rpool_lock;
> + char*name;
> +};
> +
> +/* APIs for RDMA/IB stack to publish when a device wants to
> + * participate in resource accounting
> + */

Please follow CodingStyle.  Wasn't this pointed out a couple times
already?

> +enum rdmacg_file_type {
> + RDMACG_RESOURCE_MAX,
> + RDMACG_RESOURCE_STAT,
> +};

Heh, usually not a good sign.  If you're using this to call into a
common function and then switch out on the file type, just switch out
at the method level and factor out common part into helpers.

> +/* resource tracker per resource for rdma cgroup */
> +struct rdmacg_resource {
> + int max;
> + int usage;
> +};

Align fields?

> +/**

The above indicates kerneldoc comments, which this isn't.

> + * resource pool object which represents, per cgroup, per device
> + * resources. There are multiple instance
> + * of this object per cgroup, therefore it cannot be embedded within
> + * rdma_cgroup structure. It is maintained as list.

Consistent paragraph fill?

> + */
> +struct rdmacg_resource_pool {
> + struct list_head cg_list;
> + struct list_head dev_list;
> +
> + struct rdmacg_device *device;
> + struct rdmacg_resource *resources;
> + struct rdma_cgroup *cg; /* owner cg used during device cleanup */
> +
> + int refcnt; /* count active user tasks of this pool */
> + int num_max_cnt;/* total number counts which are set to max */
> +};

Formatting.

> +static inline void set_resource_limit(struct rdmacg_resource_pool *rpool,
> +   int index, int new_max)

Is inline necessary?  Compiler can figure out these.

> +static struct rdmacg_resource_pool*
 ^
 space

> +find_cg_rpool_locked(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device)
...
> +static void uncharge_cg_resource(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int index, int num)
> +{
...
> + rpool->refcnt--;
> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) {

If the caller charges 2 and then uncharges 1 two times, the refcnt
underflows?  Why not just track how many usages are zero?

...
> +void rdmacg_uncharge(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int index, int num)
> +{
> + struct rdma_cgroup *p;
> +
> + for (p = cg; p; p = parent_rdmacg(p))
> + uncharge_cg_resource(p, device, index, num);
> +
> + css_put(>css);
> +}
> +EXPORT_SYMBOL(rdmacg_uncharge);

So, I suppose the code is trying to save lock contention with
fine-grained locking; however, as the code is currently structured,
it's just increasing the number of lock ops and it'd be highly likely
to cheaper to simply use a single lock.  If you're working up the tree
grabbing lock at each level, per-node locking doesn't save you
anything.  Also, it introduces conditions where charges are spuriously
denied when there are racing requestors.  If scalability becomes an
issue, the right way to address is adding percpu front cache.

> +void rdmacg_query_limit(struct rdmacg_device *device,
> + int *limits)
> +{
> + struct rdma_cgroup *cg, *p;
> + struct rdmacg_resource_pool *rpool;
> + struct rdmacg_pool_info *pool_info;
> + int i;
> +
> + cg = task_rdmacg(current);
> + pool_info = >pool_info;

Nothing seems to prevent @cg from going away if this races with
@current being migrated to a different cgroup.  Have you run this with
lockdep and rcu debugging enabled?  This should have triggered a
warning.

...
> 

Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Tejun Heo
Hello,

On Wed, Mar 02, 2016 at 12:35:35AM +0530, Parav Pandit wrote:
> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 000..2da3d6c
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h
> @@ -0,0 +1,50 @@
> +#ifndef _CGROUP_RDMA_H
> +#define _CGROUP_RDMA_H
> +
> +#include 
> +
> +#ifdef CONFIG_CGROUP_RDMA
> +
> +struct rdma_cgroup {
> + struct cgroup_subsys_state  css;
> +
> + spinlock_t rpool_list_lock; /* protects resource pool list */
> + struct list_head rpool_head;/* head to keep track of all resource
> +  * pools that belongs to this cgroup.
> +  */

I think we usually don't tail wing these comments.

> +};
> +
> +struct rdmacg_pool_info {
> + const char **resource_name_table;
> + int table_len;

Align fields consistently?  I've said this multiple times now but
please make the available resources constant and document them in
Documentation/cgroup-v2.txt.

> +};
> +
> +struct rdmacg_device {
> + struct rdmacg_pool_info pool_info;
> + struct list_headrdmacg_list;
> + struct list_headrpool_head;
> + /* protects resource pool list */
> + spinlock_t  rpool_lock;
> + char*name;
> +};
> +
> +/* APIs for RDMA/IB stack to publish when a device wants to
> + * participate in resource accounting
> + */

Please follow CodingStyle.  Wasn't this pointed out a couple times
already?

> +enum rdmacg_file_type {
> + RDMACG_RESOURCE_MAX,
> + RDMACG_RESOURCE_STAT,
> +};

Heh, usually not a good sign.  If you're using this to call into a
common function and then switch out on the file type, just switch out
at the method level and factor out common part into helpers.

> +/* resource tracker per resource for rdma cgroup */
> +struct rdmacg_resource {
> + int max;
> + int usage;
> +};

Align fields?

> +/**

The above indicates kerneldoc comments, which this isn't.

> + * resource pool object which represents, per cgroup, per device
> + * resources. There are multiple instance
> + * of this object per cgroup, therefore it cannot be embedded within
> + * rdma_cgroup structure. It is maintained as list.

Consistent paragraph fill?

> + */
> +struct rdmacg_resource_pool {
> + struct list_head cg_list;
> + struct list_head dev_list;
> +
> + struct rdmacg_device *device;
> + struct rdmacg_resource *resources;
> + struct rdma_cgroup *cg; /* owner cg used during device cleanup */
> +
> + int refcnt; /* count active user tasks of this pool */
> + int num_max_cnt;/* total number counts which are set to max */
> +};

Formatting.

> +static inline void set_resource_limit(struct rdmacg_resource_pool *rpool,
> +   int index, int new_max)

Is inline necessary?  Compiler can figure out these.

> +static struct rdmacg_resource_pool*
 ^
 space

> +find_cg_rpool_locked(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device)
...
> +static void uncharge_cg_resource(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int index, int num)
> +{
...
> + rpool->refcnt--;
> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) {

If the caller charges 2 and then uncharges 1 two times, the refcnt
underflows?  Why not just track how many usages are zero?

...
> +void rdmacg_uncharge(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int index, int num)
> +{
> + struct rdma_cgroup *p;
> +
> + for (p = cg; p; p = parent_rdmacg(p))
> + uncharge_cg_resource(p, device, index, num);
> +
> + css_put(>css);
> +}
> +EXPORT_SYMBOL(rdmacg_uncharge);

So, I suppose the code is trying to save lock contention with
fine-grained locking; however, as the code is currently structured,
it's just increasing the number of lock ops and it'd be highly likely
to cheaper to simply use a single lock.  If you're working up the tree
grabbing lock at each level, per-node locking doesn't save you
anything.  Also, it introduces conditions where charges are spuriously
denied when there are racing requestors.  If scalability becomes an
issue, the right way to address is adding percpu front cache.

> +void rdmacg_query_limit(struct rdmacg_device *device,
> + int *limits)
> +{
> + struct rdma_cgroup *cg, *p;
> + struct rdmacg_resource_pool *rpool;
> + struct rdmacg_pool_info *pool_info;
> + int i;
> +
> + cg = task_rdmacg(current);
> + pool_info = >pool_info;

Nothing seems to prevent @cg from going away if this races with
@current being migrated to a different cgroup.  Have you run this with
lockdep and rcu debugging enabled?  This should have triggered a
warning.

...
>