Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-08-08 Thread Waiman Long

On 07/27/2016 11:12 AM, Christoph Lameter wrote:

On Mon, 25 Jul 2016, Tejun Heo wrote:


I don't get it.  What's the harm of using percpu memory here?  Other
percpu data structures have remote access too.  They're to a lower
degree but I don't see a clear demarcation line and making addtions
per-cpu seems to have significant benefits here.  If there's a better
way of splitting the list and locking, sure, let's try that but short
of that I don't see anything wrong with doing this per-cpu.

For the regular global declarations we have separate areas for "SHARED"
per cpu data like this. See DECLARE_PER_CPU_SHARED* and friends.

Even if you align a percpu_alloc() there is still the possibility that
other percpu variables defined after this will suffer from aliasing.
The aligning causes space to be wasted for performance critical areas
where you want to minimize cache line usage. The variables cannot be
packed as densely as before. I think allocations like this need to be
separate. Simply allocate an array of these structs using

kcalloc(nr_cpu_ids, sizeof(my_struct), GFP_KERNEL)?

Why bother with percpu_alloc() if its not per cpu data?

Well if we do not care about that detail that much then lets continue going 
down this patch.



I think that make sense. The various lists don't really need to be in 
the percpu area. Allocated as an array may increase contention a bit 
when multiple CPUs try to access the list heads that happen to be in the 
same cacheline. However, it can speed up dlock list iterations as less 
cachelines need to be traversed. I will make the change to allocate the 
head array using kcalloc instead of using the percpu_alloc.


Thanks for the suggestion.

Cheers,
Longman


Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-08-08 Thread Waiman Long

On 07/27/2016 11:12 AM, Christoph Lameter wrote:

On Mon, 25 Jul 2016, Tejun Heo wrote:


I don't get it.  What's the harm of using percpu memory here?  Other
percpu data structures have remote access too.  They're to a lower
degree but I don't see a clear demarcation line and making addtions
per-cpu seems to have significant benefits here.  If there's a better
way of splitting the list and locking, sure, let's try that but short
of that I don't see anything wrong with doing this per-cpu.

For the regular global declarations we have separate areas for "SHARED"
per cpu data like this. See DECLARE_PER_CPU_SHARED* and friends.

Even if you align a percpu_alloc() there is still the possibility that
other percpu variables defined after this will suffer from aliasing.
The aligning causes space to be wasted for performance critical areas
where you want to minimize cache line usage. The variables cannot be
packed as densely as before. I think allocations like this need to be
separate. Simply allocate an array of these structs using

kcalloc(nr_cpu_ids, sizeof(my_struct), GFP_KERNEL)?

Why bother with percpu_alloc() if its not per cpu data?

Well if we do not care about that detail that much then lets continue going 
down this patch.



I think that make sense. The various lists don't really need to be in 
the percpu area. Allocated as an array may increase contention a bit 
when multiple CPUs try to access the list heads that happen to be in the 
same cacheline. However, it can speed up dlock list iterations as less 
cachelines need to be traversed. I will make the change to allocate the 
head array using kcalloc instead of using the percpu_alloc.


Thanks for the suggestion.

Cheers,
Longman


Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-07-27 Thread Christoph Lameter
On Mon, 25 Jul 2016, Tejun Heo wrote:

> I don't get it.  What's the harm of using percpu memory here?  Other
> percpu data structures have remote access too.  They're to a lower
> degree but I don't see a clear demarcation line and making addtions
> per-cpu seems to have significant benefits here.  If there's a better
> way of splitting the list and locking, sure, let's try that but short
> of that I don't see anything wrong with doing this per-cpu.

For the regular global declarations we have separate areas for "SHARED"
per cpu data like this. See DECLARE_PER_CPU_SHARED* and friends.

Even if you align a percpu_alloc() there is still the possibility that
other percpu variables defined after this will suffer from aliasing.
The aligning causes space to be wasted for performance critical areas
where you want to minimize cache line usage. The variables cannot be
packed as densely as before. I think allocations like this need to be
separate. Simply allocate an array of these structs using

kcalloc(nr_cpu_ids, sizeof(my_struct), GFP_KERNEL)?

Why bother with percpu_alloc() if its not per cpu data?

Well if we do not care about that detail that much then lets continue going 
down this patch.



Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-07-27 Thread Christoph Lameter
On Mon, 25 Jul 2016, Tejun Heo wrote:

> I don't get it.  What's the harm of using percpu memory here?  Other
> percpu data structures have remote access too.  They're to a lower
> degree but I don't see a clear demarcation line and making addtions
> per-cpu seems to have significant benefits here.  If there's a better
> way of splitting the list and locking, sure, let's try that but short
> of that I don't see anything wrong with doing this per-cpu.

For the regular global declarations we have separate areas for "SHARED"
per cpu data like this. See DECLARE_PER_CPU_SHARED* and friends.

Even if you align a percpu_alloc() there is still the possibility that
other percpu variables defined after this will suffer from aliasing.
The aligning causes space to be wasted for performance critical areas
where you want to minimize cache line usage. The variables cannot be
packed as densely as before. I think allocations like this need to be
separate. Simply allocate an array of these structs using

kcalloc(nr_cpu_ids, sizeof(my_struct), GFP_KERNEL)?

Why bother with percpu_alloc() if its not per cpu data?

Well if we do not care about that detail that much then lets continue going 
down this patch.



Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-07-25 Thread Tejun Heo
Hello, Christoph.

On Mon, Jul 25, 2016 at 08:48:25AM -0500, Christoph Lameter wrote:
> On Fri, 22 Jul 2016, Waiman Long wrote:
> 
> >  - Add a new patch to make the percpu head structure cacheline aligned
> >to prevent cacheline contention from disrupting the performance
> >of nearby percpu variables.
> 
> It would be better not to use the percpu allocation etc for this.
> Given the frequency of off node data access I would say that the data
> structure does not qualify as per cpu data. You have per cpu data items
> yes but this is not used as per cpu data.

I don't get it.  What's the harm of using percpu memory here?  Other
percpu data structures have remote access too.  They're to a lower
degree but I don't see a clear demarcation line and making addtions
per-cpu seems to have significant benefits here.  If there's a better
way of splitting the list and locking, sure, let's try that but short
of that I don't see anything wrong with doing this per-cpu.

Thanks.

-- 
tejun


Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-07-25 Thread Tejun Heo
Hello, Christoph.

On Mon, Jul 25, 2016 at 08:48:25AM -0500, Christoph Lameter wrote:
> On Fri, 22 Jul 2016, Waiman Long wrote:
> 
> >  - Add a new patch to make the percpu head structure cacheline aligned
> >to prevent cacheline contention from disrupting the performance
> >of nearby percpu variables.
> 
> It would be better not to use the percpu allocation etc for this.
> Given the frequency of off node data access I would say that the data
> structure does not qualify as per cpu data. You have per cpu data items
> yes but this is not used as per cpu data.

I don't get it.  What's the harm of using percpu memory here?  Other
percpu data structures have remote access too.  They're to a lower
degree but I don't see a clear demarcation line and making addtions
per-cpu seems to have significant benefits here.  If there's a better
way of splitting the list and locking, sure, let's try that but short
of that I don't see anything wrong with doing this per-cpu.

Thanks.

-- 
tejun


Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-07-25 Thread Christoph Lameter
On Fri, 22 Jul 2016, Waiman Long wrote:

>  - Add a new patch to make the percpu head structure cacheline aligned
>to prevent cacheline contention from disrupting the performance
>of nearby percpu variables.

It would be better not to use the percpu allocation etc for this.
Given the frequency of off node data access I would say that the data
structure does not qualify as per cpu data. You have per cpu data items
yes but this is not used as per cpu data.


Re: [PATCH v4 0/5] vfs: Use dlock list for SB's s_inodes list

2016-07-25 Thread Christoph Lameter
On Fri, 22 Jul 2016, Waiman Long wrote:

>  - Add a new patch to make the percpu head structure cacheline aligned
>to prevent cacheline contention from disrupting the performance
>of nearby percpu variables.

It would be better not to use the percpu allocation etc for this.
Given the frequency of off node data access I would say that the data
structure does not qualify as per cpu data. You have per cpu data items
yes but this is not used as per cpu data.