Re: [PATCH 0/6] Enable parallel page migration

2017-03-10 Thread Michal Hocko
On Fri 10-03-17 14:07:16, Mel Gorman wrote:
> On Thu, Mar 09, 2017 at 05:46:16PM -0600, Zi Yan wrote:
[...]
> > I understand your concern on CPU utilization impact. I think checking
> > CPU utilization and only using idle CPUs could potentially avoid this
> > problem.
> > 
> 
> That will be costly to detect actually. It would require poking into the
> scheduler core and incurring a number of cache misses for a race-prone
> operation that may not succeed. Even if you do it, it'll still be
> brought up that the serialised case should be optimised first.

do not forget that seeing idle cpus is not a sufficient criterion to use
it for parallel migration. There might be other policies you are not
aware of from the MM code to keep them idle (power saving and who knows
what else). Developing a reasonable strategy for spreading the load to
different CPUs is really hard, much harder than you can imaging I
suspect (just look at how hard it was and I long it took to get to a
reasonable scheduler driven frequency scaling/power governors).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-10 Thread Michal Hocko
On Fri 10-03-17 14:07:16, Mel Gorman wrote:
> On Thu, Mar 09, 2017 at 05:46:16PM -0600, Zi Yan wrote:
[...]
> > I understand your concern on CPU utilization impact. I think checking
> > CPU utilization and only using idle CPUs could potentially avoid this
> > problem.
> > 
> 
> That will be costly to detect actually. It would require poking into the
> scheduler core and incurring a number of cache misses for a race-prone
> operation that may not succeed. Even if you do it, it'll still be
> brought up that the serialised case should be optimised first.

do not forget that seeing idle cpus is not a sufficient criterion to use
it for parallel migration. There might be other policies you are not
aware of from the MM code to keep them idle (power saving and who knows
what else). Developing a reasonable strategy for spreading the load to
different CPUs is really hard, much harder than you can imaging I
suspect (just look at how hard it was and I long it took to get to a
reasonable scheduler driven frequency scaling/power governors).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-10 Thread Mel Gorman
On Thu, Mar 09, 2017 at 05:46:16PM -0600, Zi Yan wrote:
> Hi Mel,
> 
> Thanks for pointing out the problems in this patchset.
> 
> It was my intern project done in NVIDIA last summer. I only used
> micro-benchmarks to demonstrate the big memory bandwidth utilization gap
> between base page migration and THP migration along with serialized page
> migration vs parallel page migration.
> 

The measurement itself is not a problem. It clearly shows why you were
doing it and indicates that it's possible.

> 
> This big increase on BW utilization is the motivation of pushing this
> patchset.
> 

As before, I have no problem with the motivation, my problem is with the
approach and in particular that the serialised case was not optimised first.

> > 
> > So the key potential issue here in my mind is that THP migration is too slow
> > in some cases. What I object to is improving that using a high priority
> > workqueue that potentially starves other CPUs and pollutes their cache
> > which is generally very expensive.
> 
> I might not completely agree with this. Using a high priority workqueue
> can guarantee page migration work is done ASAP.

Yes, but at the cost of stalling other operations that are happening at
the same tiime. The series assumes that the migration is definitely the
most important operation going on at the moment.

> Otherwise, we completely
> lose the speedup brought by parallel page migration, if data copy
> threads have to wait.
> 

And conversely, if important threads were running on the other CPUs at
the time the migration started then they might be equally unhappy.

> I understand your concern on CPU utilization impact. I think checking
> CPU utilization and only using idle CPUs could potentially avoid this
> problem.
> 

That will be costly to detect actually. It would require poking into the
scheduler core and incurring a number of cache misses for a race-prone
operation that may not succeed. Even if you do it, it'll still be
brought up that the serialised case should be optimised first.

> > The function takes a huge page, splits it into PAGE_SIZE chunks, 
> > kmap_atomics
> > the source and destination for each PAGE_SIZE chunk and copies it. The
> > parallelised version does one kmap and copies it in chunks assuming the
> > THP is fully mapped and accessible. Fundamentally, this is broken in the
> > generic sense as the kmap is not guaranteed to make the whole page necessary
> > but it happens to work on !highmem systems.  What is more important to
> > note is that it's multiple preempt and pagefault enables and disables
> > on a per-page basis that happens 512 times (for THP on x86-64 at least),
> > all of which are expensive operations depending on the kernel config and
> > I suspect that the parallisation is actually masking that stupid overhead.
> 
> You are right on kmap, I think making this patchset depend on !HIGHMEM
> can avoid the problem. It might not make sense to kmap potentially 512
> base pages to migrate a THP in a system with highmem.
> 

One concern I have is that the series benefitted the most by simply batching
all those operations even if it was not intended.

> > At the very least, I would have expected an initial attempt of one patch 
> > that
> > optimised for !highmem systems to ignore kmap, simply disable preempt (if
> > that is even necessary, I didn't check) and copy a pinned physical->physical
> > page as a single copy without looping on a PAGE_SIZE basis and see how
> > much that gained. Do it initially for THP only and worry about gigantic
> > pages when or if that is a problem.
> 
> I can try this out to show how much improvement we can obtain from
> existing THP migration, which is shown in the data above.
> 

It would be important to do so. There would need to be absolute proof
that parallelisation is required and even then the concerns about
interfering with workloads on other CPUs is not going to be easy to
handle.

> > That would be patch 1 of a series.  Maybe that'll be enough, maybe not but
> > I feel it's important to optimise the serialised case as much as possible
> > before considering parallelisation to highlight and justify why it's
> > necessary[1]. If nothing else, what if two CPUs both parallelise a migration
> > at the same time and end up preempting each other? Between that and the
> > workqueue setup, it's potentially much slower than an optimised serial copy.
> > 
> > It would be tempting to experiment but the test case was not even included
> > with the series (maybe it's somewhere else)[2]. While it's obvious how
> > such a test case could be constructed, it feels unnecessary to construct
> > it when it should be in the changelog.
> 
> Do you mean performing multiple parallel page migrations at the same
> time and show all the page migration time?

I mean that the test case that was used to generate the bandwidth
utilisation figures should be included.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-10 Thread Mel Gorman
On Thu, Mar 09, 2017 at 05:46:16PM -0600, Zi Yan wrote:
> Hi Mel,
> 
> Thanks for pointing out the problems in this patchset.
> 
> It was my intern project done in NVIDIA last summer. I only used
> micro-benchmarks to demonstrate the big memory bandwidth utilization gap
> between base page migration and THP migration along with serialized page
> migration vs parallel page migration.
> 

The measurement itself is not a problem. It clearly shows why you were
doing it and indicates that it's possible.

> 
> This big increase on BW utilization is the motivation of pushing this
> patchset.
> 

As before, I have no problem with the motivation, my problem is with the
approach and in particular that the serialised case was not optimised first.

> > 
> > So the key potential issue here in my mind is that THP migration is too slow
> > in some cases. What I object to is improving that using a high priority
> > workqueue that potentially starves other CPUs and pollutes their cache
> > which is generally very expensive.
> 
> I might not completely agree with this. Using a high priority workqueue
> can guarantee page migration work is done ASAP.

Yes, but at the cost of stalling other operations that are happening at
the same tiime. The series assumes that the migration is definitely the
most important operation going on at the moment.

> Otherwise, we completely
> lose the speedup brought by parallel page migration, if data copy
> threads have to wait.
> 

And conversely, if important threads were running on the other CPUs at
the time the migration started then they might be equally unhappy.

> I understand your concern on CPU utilization impact. I think checking
> CPU utilization and only using idle CPUs could potentially avoid this
> problem.
> 

That will be costly to detect actually. It would require poking into the
scheduler core and incurring a number of cache misses for a race-prone
operation that may not succeed. Even if you do it, it'll still be
brought up that the serialised case should be optimised first.

> > The function takes a huge page, splits it into PAGE_SIZE chunks, 
> > kmap_atomics
> > the source and destination for each PAGE_SIZE chunk and copies it. The
> > parallelised version does one kmap and copies it in chunks assuming the
> > THP is fully mapped and accessible. Fundamentally, this is broken in the
> > generic sense as the kmap is not guaranteed to make the whole page necessary
> > but it happens to work on !highmem systems.  What is more important to
> > note is that it's multiple preempt and pagefault enables and disables
> > on a per-page basis that happens 512 times (for THP on x86-64 at least),
> > all of which are expensive operations depending on the kernel config and
> > I suspect that the parallisation is actually masking that stupid overhead.
> 
> You are right on kmap, I think making this patchset depend on !HIGHMEM
> can avoid the problem. It might not make sense to kmap potentially 512
> base pages to migrate a THP in a system with highmem.
> 

One concern I have is that the series benefitted the most by simply batching
all those operations even if it was not intended.

> > At the very least, I would have expected an initial attempt of one patch 
> > that
> > optimised for !highmem systems to ignore kmap, simply disable preempt (if
> > that is even necessary, I didn't check) and copy a pinned physical->physical
> > page as a single copy without looping on a PAGE_SIZE basis and see how
> > much that gained. Do it initially for THP only and worry about gigantic
> > pages when or if that is a problem.
> 
> I can try this out to show how much improvement we can obtain from
> existing THP migration, which is shown in the data above.
> 

It would be important to do so. There would need to be absolute proof
that parallelisation is required and even then the concerns about
interfering with workloads on other CPUs is not going to be easy to
handle.

> > That would be patch 1 of a series.  Maybe that'll be enough, maybe not but
> > I feel it's important to optimise the serialised case as much as possible
> > before considering parallelisation to highlight and justify why it's
> > necessary[1]. If nothing else, what if two CPUs both parallelise a migration
> > at the same time and end up preempting each other? Between that and the
> > workqueue setup, it's potentially much slower than an optimised serial copy.
> > 
> > It would be tempting to experiment but the test case was not even included
> > with the series (maybe it's somewhere else)[2]. While it's obvious how
> > such a test case could be constructed, it feels unnecessary to construct
> > it when it should be in the changelog.
> 
> Do you mean performing multiple parallel page migrations at the same
> time and show all the page migration time?

I mean that the test case that was used to generate the bandwidth
utilisation figures should be included.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-10 Thread Michal Hocko
On Thu 09-03-17 15:09:04, Mel Gorman wrote:
> On Wed, Mar 08, 2017 at 09:34:27PM +0530, Anshuman Khandual wrote:
> > > Any comments, suggestions are welcome.
> > 
> > Hello Vlastimil/Michal/Minchan/Mel/Dave,
> > 
> > Apart from the comments from Naoya on a different thread posted by Zi
> > Yan, I did not get any more review comments on this series. Could you
> > please kindly have a look on the over all design and its benefits from
> > page migration performance point of view and let me know your views.
> > Thank you.
> > 
> 
> I didn't look into the patches in detail except to get a general feel
> for how it works and I'm not convinced that it's a good idea at all.
> 
> I accept that memory bandwidth utilisation may be higher as a result but
> consider the impact. THP migrations are relatively rare and when they
> occur, it's in the context of a single thread. To parallelise the copy,
> an allocation, kmap and workqueue invocation are required. There may be a
> long delay before the workqueue item can start which may exceed the time
> to do a single copy if the CPUs on a node are saturated. Furthermore, a
> single thread can preempt operations of other unrelated threads and incur
> CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> the fact that a high priority system workqueue is used to do the operation,
> one that is used for CPU hotplug operations and rolling back when a netdevice
> fails to be registered. It treats a hugepage copy as an essential operation
> that can preempt all other work which is very questionable.
> 
> The series leader has no details on a workload that is bottlenecked by
> THP migrations and even if it is, the primary question should be *why*
> THP migrations are so frequent and alleviating that instead of
> preempting multiple CPUs to do the work.

FWIW I very much agree here and the follow up reply. Making migration
itself parallel is a hard task. You should start simple and optimize the
current code first and each step accompany with numbers. Parallel
migration should be the very last step - if it is needed at all of
course. I am quite skeptical that a reasonable parallel load balancing
is achievable without a large maintenance cost and/or predictable
behavior. 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-10 Thread Michal Hocko
On Thu 09-03-17 15:09:04, Mel Gorman wrote:
> On Wed, Mar 08, 2017 at 09:34:27PM +0530, Anshuman Khandual wrote:
> > > Any comments, suggestions are welcome.
> > 
> > Hello Vlastimil/Michal/Minchan/Mel/Dave,
> > 
> > Apart from the comments from Naoya on a different thread posted by Zi
> > Yan, I did not get any more review comments on this series. Could you
> > please kindly have a look on the over all design and its benefits from
> > page migration performance point of view and let me know your views.
> > Thank you.
> > 
> 
> I didn't look into the patches in detail except to get a general feel
> for how it works and I'm not convinced that it's a good idea at all.
> 
> I accept that memory bandwidth utilisation may be higher as a result but
> consider the impact. THP migrations are relatively rare and when they
> occur, it's in the context of a single thread. To parallelise the copy,
> an allocation, kmap and workqueue invocation are required. There may be a
> long delay before the workqueue item can start which may exceed the time
> to do a single copy if the CPUs on a node are saturated. Furthermore, a
> single thread can preempt operations of other unrelated threads and incur
> CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> the fact that a high priority system workqueue is used to do the operation,
> one that is used for CPU hotplug operations and rolling back when a netdevice
> fails to be registered. It treats a hugepage copy as an essential operation
> that can preempt all other work which is very questionable.
> 
> The series leader has no details on a workload that is bottlenecked by
> THP migrations and even if it is, the primary question should be *why*
> THP migrations are so frequent and alleviating that instead of
> preempting multiple CPUs to do the work.

FWIW I very much agree here and the follow up reply. Making migration
itself parallel is a hard task. You should start simple and optimize the
current code first and each step accompany with numbers. Parallel
migration should be the very last step - if it is needed at all of
course. I am quite skeptical that a reasonable parallel load balancing
is achievable without a large maintenance cost and/or predictable
behavior. 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread Zi Yan
Hi Mel,

Thanks for pointing out the problems in this patchset.

It was my intern project done in NVIDIA last summer. I only used
micro-benchmarks to demonstrate the big memory bandwidth utilization gap
between base page migration and THP migration along with serialized page
migration vs parallel page migration.

Here are cross-socket serialized page migration results from calling
move_pages() syscall:

In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

THP migration is not slow at all when compared to a group of equivalent
base page migrations.

For 1-thread vs 8-thread THP migration:
In x86_64,
1-thread 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
8-thread 2MB THP migration takes 227.76 us, using 8.58 GB/s BW.

In ppc64,
1-thread 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
8-thread 16MB THP migration takes 1223.87 us, using 12.77 GB/s BW.

This big increase on BW utilization is the motivation of pushing this
patchset.

> 
> So the key potential issue here in my mind is that THP migration is too slow
> in some cases. What I object to is improving that using a high priority
> workqueue that potentially starves other CPUs and pollutes their cache
> which is generally very expensive.

I might not completely agree with this. Using a high priority workqueue
can guarantee page migration work is done ASAP. Otherwise, we completely
lose the speedup brought by parallel page migration, if data copy
threads have to wait.

I understand your concern on CPU utilization impact. I think checking
CPU utilization and only using idle CPUs could potentially avoid this
problem.

> 
> Lets look at the core of what copy_huge_page does in mm/migrate.c which
> is the function that gets parallelised by the series in question. For
> a !HIGHMEM system, it's woefully inefficient. Historically, it was an
> implementation that would work generically which was fine but maybe not
> for future systems. It was also fine back when hugetlbfs was the only huge
> page implementation and COW operations were incredibly rare on the grounds
> due to the risk that they could terminate the process with prejudice.
> 
> The function takes a huge page, splits it into PAGE_SIZE chunks, kmap_atomics
> the source and destination for each PAGE_SIZE chunk and copies it. The
> parallelised version does one kmap and copies it in chunks assuming the
> THP is fully mapped and accessible. Fundamentally, this is broken in the
> generic sense as the kmap is not guaranteed to make the whole page necessary
> but it happens to work on !highmem systems.  What is more important to
> note is that it's multiple preempt and pagefault enables and disables
> on a per-page basis that happens 512 times (for THP on x86-64 at least),
> all of which are expensive operations depending on the kernel config and
> I suspect that the parallisation is actually masking that stupid overhead.

You are right on kmap, I think making this patchset depend on !HIGHMEM
can avoid the problem. It might not make sense to kmap potentially 512
base pages to migrate a THP in a system with highmem.

> 
> At the very least, I would have expected an initial attempt of one patch that
> optimised for !highmem systems to ignore kmap, simply disable preempt (if
> that is even necessary, I didn't check) and copy a pinned physical->physical
> page as a single copy without looping on a PAGE_SIZE basis and see how
> much that gained. Do it initially for THP only and worry about gigantic
> pages when or if that is a problem.

I can try this out to show how much improvement we can obtain from
existing THP migration, which is shown in the data above.

> 
> That would be patch 1 of a series.  Maybe that'll be enough, maybe not but
> I feel it's important to optimise the serialised case as much as possible
> before considering parallelisation to highlight and justify why it's
> necessary[1]. If nothing else, what if two CPUs both parallelise a migration
> at the same time and end up preempting each other? Between that and the
> workqueue setup, it's potentially much slower than an optimised serial copy.
> 
> It would be tempting to experiment but the test case was not even included
> with the series (maybe it's somewhere else)[2]. While it's obvious how
> such a test case could be constructed, it feels unnecessary to construct
> it when it should be in the changelog.

Do you mean performing multiple parallel page migrations at the same
time and show all the page migration time?


-- 
Best Regards,
Yan Zi



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread Zi Yan
Hi Mel,

Thanks for pointing out the problems in this patchset.

It was my intern project done in NVIDIA last summer. I only used
micro-benchmarks to demonstrate the big memory bandwidth utilization gap
between base page migration and THP migration along with serialized page
migration vs parallel page migration.

Here are cross-socket serialized page migration results from calling
move_pages() syscall:

In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

THP migration is not slow at all when compared to a group of equivalent
base page migrations.

For 1-thread vs 8-thread THP migration:
In x86_64,
1-thread 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
8-thread 2MB THP migration takes 227.76 us, using 8.58 GB/s BW.

In ppc64,
1-thread 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
8-thread 16MB THP migration takes 1223.87 us, using 12.77 GB/s BW.

This big increase on BW utilization is the motivation of pushing this
patchset.

> 
> So the key potential issue here in my mind is that THP migration is too slow
> in some cases. What I object to is improving that using a high priority
> workqueue that potentially starves other CPUs and pollutes their cache
> which is generally very expensive.

I might not completely agree with this. Using a high priority workqueue
can guarantee page migration work is done ASAP. Otherwise, we completely
lose the speedup brought by parallel page migration, if data copy
threads have to wait.

I understand your concern on CPU utilization impact. I think checking
CPU utilization and only using idle CPUs could potentially avoid this
problem.

> 
> Lets look at the core of what copy_huge_page does in mm/migrate.c which
> is the function that gets parallelised by the series in question. For
> a !HIGHMEM system, it's woefully inefficient. Historically, it was an
> implementation that would work generically which was fine but maybe not
> for future systems. It was also fine back when hugetlbfs was the only huge
> page implementation and COW operations were incredibly rare on the grounds
> due to the risk that they could terminate the process with prejudice.
> 
> The function takes a huge page, splits it into PAGE_SIZE chunks, kmap_atomics
> the source and destination for each PAGE_SIZE chunk and copies it. The
> parallelised version does one kmap and copies it in chunks assuming the
> THP is fully mapped and accessible. Fundamentally, this is broken in the
> generic sense as the kmap is not guaranteed to make the whole page necessary
> but it happens to work on !highmem systems.  What is more important to
> note is that it's multiple preempt and pagefault enables and disables
> on a per-page basis that happens 512 times (for THP on x86-64 at least),
> all of which are expensive operations depending on the kernel config and
> I suspect that the parallisation is actually masking that stupid overhead.

You are right on kmap, I think making this patchset depend on !HIGHMEM
can avoid the problem. It might not make sense to kmap potentially 512
base pages to migrate a THP in a system with highmem.

> 
> At the very least, I would have expected an initial attempt of one patch that
> optimised for !highmem systems to ignore kmap, simply disable preempt (if
> that is even necessary, I didn't check) and copy a pinned physical->physical
> page as a single copy without looping on a PAGE_SIZE basis and see how
> much that gained. Do it initially for THP only and worry about gigantic
> pages when or if that is a problem.

I can try this out to show how much improvement we can obtain from
existing THP migration, which is shown in the data above.

> 
> That would be patch 1 of a series.  Maybe that'll be enough, maybe not but
> I feel it's important to optimise the serialised case as much as possible
> before considering parallelisation to highlight and justify why it's
> necessary[1]. If nothing else, what if two CPUs both parallelise a migration
> at the same time and end up preempting each other? Between that and the
> workqueue setup, it's potentially much slower than an optimised serial copy.
> 
> It would be tempting to experiment but the test case was not even included
> with the series (maybe it's somewhere else)[2]. While it's obvious how
> such a test case could be constructed, it feels unnecessary to construct
> it when it should be in the changelog.

Do you mean performing multiple parallel page migrations at the same
time and show all the page migration time?


-- 
Best Regards,
Yan Zi



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread Mel Gorman
On Thu, Mar 09, 2017 at 11:38:00AM -0600, David Nellans wrote:
> On 03/09/2017 09:09 AM, Mel Gorman wrote:
> > I didn't look into the patches in detail except to get a general feel
> > for how it works and I'm not convinced that it's a good idea at all.
> >
> > I accept that memory bandwidth utilisation may be higher as a result but
> > consider the impact. THP migrations are relatively rare and when they
> > occur, it's in the context of a single thread. To parallelise the copy,
> > an allocation, kmap and workqueue invocation are required. There may be a
> > long delay before the workqueue item can start which may exceed the time
> > to do a single copy if the CPUs on a node are saturated. Furthermore, a
> > single thread can preempt operations of other unrelated threads and incur
> > CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> > the fact that a high priority system workqueue is used to do the operation,
> > one that is used for CPU hotplug operations and rolling back when a 
> > netdevice
> > fails to be registered. It treats a hugepage copy as an essential operation
> > that can preempt all other work which is very questionable.
> >
> > The series leader has no details on a workload that is bottlenecked by
> > THP migrations and even if it is, the primary question should be *why*
> > THP migrations are so frequent and alleviating that instead of
> > preempting multiple CPUs to do the work.
> >
> >
>
> Mel - I sense on going frustration around some of the THP migration,
> migration acceleration, CDM, and other patches.  Here is a 10k foot
> description that I hope adds to what John & Anshuman have said in other
> threads.
> 

Hi David,

I recognise the motivation for some of these patches but disagree on the
mechanisms used, more on this later.

> Vendors are currently providing systems that have both traditional
> DDR3/4 memory (lets call it 100GB/s) and high bandwidth memory (HBM)
> (lets call it 1TB/s) within a single system.  GPUs have been doing this
> with HBM on the GPU and DDR on the CPU complex, but they've been
> attached via PCIe and thus HBM has been GPU private memory. 

I completely understand although I'd point out that HBM is slightly
different in that it could be expressed in terms of a hierarchical node
system whereby some nodes migrate to each other -- from faster to slower by
a "migrate on LRU reclaim" and from slower to faster with automatic NUMA
balancing using sampling. However, HBM is extremely specific and dealing
with that is not necessarily compatible with devices that are not coherent.

> 

Again, I understand the motivation and have no further comment to make.
In the interest of trying to be helpful, I'll propose an alternative to
this series and expand upon why I think it's problematic.

> the HBM node from the DDR node. The expectation is that on such systems
> either the user, a daemon, or kernel/autonuma is going to be migrating
> (TH)pages between the NUMA zones to optimize overall system
> bandwidth/throughput.  Because of the 10x discrepancy in memory
> bandwidth, despite the best paging policies to optimize for page
> locality in the HBM nodes, pages will often still be moving at a high
> rate between zones.  This differs from a traditional NUMA system where
> moving a page from one 100GB/s node to the other 100GB/s node has
> dubious value, like you say.
> 
> To your specific question - what workloads benefit from this improved
> migration throughput and why THPs? 

So the key potential issue here in my mind is that THP migration is too slow
in some cases. What I object to is improving that using a high priority
workqueue that potentially starves other CPUs and pollutes their cache
which is generally very expensive.

Lets look at the core of what copy_huge_page does in mm/migrate.c which
is the function that gets parallelised by the series in question. For
a !HIGHMEM system, it's woefully inefficient. Historically, it was an
implementation that would work generically which was fine but maybe not
for future systems. It was also fine back when hugetlbfs was the only huge
page implementation and COW operations were incredibly rare on the grounds
due to the risk that they could terminate the process with prejudice.

The function takes a huge page, splits it into PAGE_SIZE chunks, kmap_atomics
the source and destination for each PAGE_SIZE chunk and copies it. The
parallelised version does one kmap and copies it in chunks assuming the
THP is fully mapped and accessible. Fundamentally, this is broken in the
generic sense as the kmap is not guaranteed to make the whole page necessary
but it happens to work on !highmem systems.  What is more important to
note is that it's multiple preempt and pagefault enables and disables
on a per-page basis that happens 512 times (for THP on x86-64 at least),
all of which are expensive operations depending on the kernel config and
I suspect that the parallisation is actually masking that stupid overhead.


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread Mel Gorman
On Thu, Mar 09, 2017 at 11:38:00AM -0600, David Nellans wrote:
> On 03/09/2017 09:09 AM, Mel Gorman wrote:
> > I didn't look into the patches in detail except to get a general feel
> > for how it works and I'm not convinced that it's a good idea at all.
> >
> > I accept that memory bandwidth utilisation may be higher as a result but
> > consider the impact. THP migrations are relatively rare and when they
> > occur, it's in the context of a single thread. To parallelise the copy,
> > an allocation, kmap and workqueue invocation are required. There may be a
> > long delay before the workqueue item can start which may exceed the time
> > to do a single copy if the CPUs on a node are saturated. Furthermore, a
> > single thread can preempt operations of other unrelated threads and incur
> > CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> > the fact that a high priority system workqueue is used to do the operation,
> > one that is used for CPU hotplug operations and rolling back when a 
> > netdevice
> > fails to be registered. It treats a hugepage copy as an essential operation
> > that can preempt all other work which is very questionable.
> >
> > The series leader has no details on a workload that is bottlenecked by
> > THP migrations and even if it is, the primary question should be *why*
> > THP migrations are so frequent and alleviating that instead of
> > preempting multiple CPUs to do the work.
> >
> >
>
> Mel - I sense on going frustration around some of the THP migration,
> migration acceleration, CDM, and other patches.  Here is a 10k foot
> description that I hope adds to what John & Anshuman have said in other
> threads.
> 

Hi David,

I recognise the motivation for some of these patches but disagree on the
mechanisms used, more on this later.

> Vendors are currently providing systems that have both traditional
> DDR3/4 memory (lets call it 100GB/s) and high bandwidth memory (HBM)
> (lets call it 1TB/s) within a single system.  GPUs have been doing this
> with HBM on the GPU and DDR on the CPU complex, but they've been
> attached via PCIe and thus HBM has been GPU private memory. 

I completely understand although I'd point out that HBM is slightly
different in that it could be expressed in terms of a hierarchical node
system whereby some nodes migrate to each other -- from faster to slower by
a "migrate on LRU reclaim" and from slower to faster with automatic NUMA
balancing using sampling. However, HBM is extremely specific and dealing
with that is not necessarily compatible with devices that are not coherent.

> 

Again, I understand the motivation and have no further comment to make.
In the interest of trying to be helpful, I'll propose an alternative to
this series and expand upon why I think it's problematic.

> the HBM node from the DDR node. The expectation is that on such systems
> either the user, a daemon, or kernel/autonuma is going to be migrating
> (TH)pages between the NUMA zones to optimize overall system
> bandwidth/throughput.  Because of the 10x discrepancy in memory
> bandwidth, despite the best paging policies to optimize for page
> locality in the HBM nodes, pages will often still be moving at a high
> rate between zones.  This differs from a traditional NUMA system where
> moving a page from one 100GB/s node to the other 100GB/s node has
> dubious value, like you say.
> 
> To your specific question - what workloads benefit from this improved
> migration throughput and why THPs? 

So the key potential issue here in my mind is that THP migration is too slow
in some cases. What I object to is improving that using a high priority
workqueue that potentially starves other CPUs and pollutes their cache
which is generally very expensive.

Lets look at the core of what copy_huge_page does in mm/migrate.c which
is the function that gets parallelised by the series in question. For
a !HIGHMEM system, it's woefully inefficient. Historically, it was an
implementation that would work generically which was fine but maybe not
for future systems. It was also fine back when hugetlbfs was the only huge
page implementation and COW operations were incredibly rare on the grounds
due to the risk that they could terminate the process with prejudice.

The function takes a huge page, splits it into PAGE_SIZE chunks, kmap_atomics
the source and destination for each PAGE_SIZE chunk and copies it. The
parallelised version does one kmap and copies it in chunks assuming the
THP is fully mapped and accessible. Fundamentally, this is broken in the
generic sense as the kmap is not guaranteed to make the whole page necessary
but it happens to work on !highmem systems.  What is more important to
note is that it's multiple preempt and pagefault enables and disables
on a per-page basis that happens 512 times (for THP on x86-64 at least),
all of which are expensive operations depending on the kernel config and
I suspect that the parallisation is actually masking that stupid overhead.


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread David Nellans
On 03/09/2017 09:09 AM, Mel Gorman wrote:
> I didn't look into the patches in detail except to get a general feel
> for how it works and I'm not convinced that it's a good idea at all.
>
> I accept that memory bandwidth utilisation may be higher as a result but
> consider the impact. THP migrations are relatively rare and when they
> occur, it's in the context of a single thread. To parallelise the copy,
> an allocation, kmap and workqueue invocation are required. There may be a
> long delay before the workqueue item can start which may exceed the time
> to do a single copy if the CPUs on a node are saturated. Furthermore, a
> single thread can preempt operations of other unrelated threads and incur
> CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> the fact that a high priority system workqueue is used to do the operation,
> one that is used for CPU hotplug operations and rolling back when a netdevice
> fails to be registered. It treats a hugepage copy as an essential operation
> that can preempt all other work which is very questionable.
>
> The series leader has no details on a workload that is bottlenecked by
> THP migrations and even if it is, the primary question should be *why*
> THP migrations are so frequent and alleviating that instead of
> preempting multiple CPUs to do the work.
>
>
Mel - I sense on going frustration around some of the THP migration,
migration acceleration, CDM, and other patches.  Here is a 10k foot
description that I hope adds to what John & Anshuman have said in other
threads.

Vendors are currently providing systems that have both traditional
DDR3/4 memory (lets call it 100GB/s) and high bandwidth memory (HBM)
(lets call it 1TB/s) within a single system.  GPUs have been doing this
with HBM on the GPU and DDR on the CPU complex, but they've been
attached via PCIe and thus HBM has been GPU private memory.  The GPU has
managed this memory by effectively mlocking pages on the CPU and copying
the data into the GPU while its being computed on and then copying it
back to the CPU when CPU faults on trying to touch it or the GPU is
done.  Because HBM is limited in capacity (10's of GB max) versus DDR3
(100's+ GB), runtimes like Nvidia's unified memory dynamically page
memory in and out of the GPU to get the benefits of high bandwidth,
while still allowing access a total footprint of system memory.  Its
effectively page protection based CPU/GPU memory coherence.

PCIe attached GPUs+HBM are the bulk or whats out there today and will
continue to be, so there are efforts to try and improve how GPUs (and
other devices in the same PCIe boat) interact with -mm given the
limitations of PCIe (see HMM).

Jumping to what is essentially a different platform - there will be
systems where that same GPU HBM memory is now part of the OS controlled
memory (aka NUMA node) because these systems have a cache coherent link
attaching them (could be NVLINK, QPI, CAPI, HT, or something else)  This
HBM zone might have CPU cores in it, it might have GPU cores in it, or
an FPGA, its not necessarily GPU specific.  NVIDIA has talked about
systems that look like this, as has Intel (KNL with flat memory), and
there are likely others. Systems like this can be thought of (just for
exampke) as 2 NUMA node box where you've got 100GB/s of bandwidth on one
node, 1TB/s on the other, connected via some cache coherent link. That
link is probably order 100GB/s max too (maybe lower, but certainly not
1TB/s yet).

Cores (CPU/GPU/FPGA) can access either NUMA node via the coherent link
(just like a multi-socket CPU box) but you also want to be able to
optimize page placement so that hot pages physically get migrated into
the HBM node from the DDR node. The expectation is that on such systems
either the user, a daemon, or kernel/autonuma is going to be migrating
(TH)pages between the NUMA zones to optimize overall system
bandwidth/throughput.  Because of the 10x discrepancy in memory
bandwidth, despite the best paging policies to optimize for page
locality in the HBM nodes, pages will often still be moving at a high
rate between zones.  This differs from a traditional NUMA system where
moving a page from one 100GB/s node to the other 100GB/s node has
dubious value, like you say.

To your specific question - what workloads benefit from this improved
migration throughput and why THPs?  We have seen that there can be a
1.7x improvement in GPU perf by improving NVLink page migration
bandwidth from 6GB/s->32.5GB/s.  In comparison, 4KB page migration on
x86 over QPI (today) gets < 100MB/s of throughput even though QPI and
NVLink can provide 32GB/s+.  We couldn't cripple the link enough to get
down to ~100MB/s, but obviously using small base page sizes at 100MB/s
of migration throughput would kill performance.  So good THP
functionality + good migration throughput appear critical to us (and
maybe KNL too?).

https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread David Nellans
On 03/09/2017 09:09 AM, Mel Gorman wrote:
> I didn't look into the patches in detail except to get a general feel
> for how it works and I'm not convinced that it's a good idea at all.
>
> I accept that memory bandwidth utilisation may be higher as a result but
> consider the impact. THP migrations are relatively rare and when they
> occur, it's in the context of a single thread. To parallelise the copy,
> an allocation, kmap and workqueue invocation are required. There may be a
> long delay before the workqueue item can start which may exceed the time
> to do a single copy if the CPUs on a node are saturated. Furthermore, a
> single thread can preempt operations of other unrelated threads and incur
> CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> the fact that a high priority system workqueue is used to do the operation,
> one that is used for CPU hotplug operations and rolling back when a netdevice
> fails to be registered. It treats a hugepage copy as an essential operation
> that can preempt all other work which is very questionable.
>
> The series leader has no details on a workload that is bottlenecked by
> THP migrations and even if it is, the primary question should be *why*
> THP migrations are so frequent and alleviating that instead of
> preempting multiple CPUs to do the work.
>
>
Mel - I sense on going frustration around some of the THP migration,
migration acceleration, CDM, and other patches.  Here is a 10k foot
description that I hope adds to what John & Anshuman have said in other
threads.

Vendors are currently providing systems that have both traditional
DDR3/4 memory (lets call it 100GB/s) and high bandwidth memory (HBM)
(lets call it 1TB/s) within a single system.  GPUs have been doing this
with HBM on the GPU and DDR on the CPU complex, but they've been
attached via PCIe and thus HBM has been GPU private memory.  The GPU has
managed this memory by effectively mlocking pages on the CPU and copying
the data into the GPU while its being computed on and then copying it
back to the CPU when CPU faults on trying to touch it or the GPU is
done.  Because HBM is limited in capacity (10's of GB max) versus DDR3
(100's+ GB), runtimes like Nvidia's unified memory dynamically page
memory in and out of the GPU to get the benefits of high bandwidth,
while still allowing access a total footprint of system memory.  Its
effectively page protection based CPU/GPU memory coherence.

PCIe attached GPUs+HBM are the bulk or whats out there today and will
continue to be, so there are efforts to try and improve how GPUs (and
other devices in the same PCIe boat) interact with -mm given the
limitations of PCIe (see HMM).

Jumping to what is essentially a different platform - there will be
systems where that same GPU HBM memory is now part of the OS controlled
memory (aka NUMA node) because these systems have a cache coherent link
attaching them (could be NVLINK, QPI, CAPI, HT, or something else)  This
HBM zone might have CPU cores in it, it might have GPU cores in it, or
an FPGA, its not necessarily GPU specific.  NVIDIA has talked about
systems that look like this, as has Intel (KNL with flat memory), and
there are likely others. Systems like this can be thought of (just for
exampke) as 2 NUMA node box where you've got 100GB/s of bandwidth on one
node, 1TB/s on the other, connected via some cache coherent link. That
link is probably order 100GB/s max too (maybe lower, but certainly not
1TB/s yet).

Cores (CPU/GPU/FPGA) can access either NUMA node via the coherent link
(just like a multi-socket CPU box) but you also want to be able to
optimize page placement so that hot pages physically get migrated into
the HBM node from the DDR node. The expectation is that on such systems
either the user, a daemon, or kernel/autonuma is going to be migrating
(TH)pages between the NUMA zones to optimize overall system
bandwidth/throughput.  Because of the 10x discrepancy in memory
bandwidth, despite the best paging policies to optimize for page
locality in the HBM nodes, pages will often still be moving at a high
rate between zones.  This differs from a traditional NUMA system where
moving a page from one 100GB/s node to the other 100GB/s node has
dubious value, like you say.

To your specific question - what workloads benefit from this improved
migration throughput and why THPs?  We have seen that there can be a
1.7x improvement in GPU perf by improving NVLink page migration
bandwidth from 6GB/s->32.5GB/s.  In comparison, 4KB page migration on
x86 over QPI (today) gets < 100MB/s of throughput even though QPI and
NVLink can provide 32GB/s+.  We couldn't cripple the link enough to get
down to ~100MB/s, but obviously using small base page sizes at 100MB/s
of migration throughput would kill performance.  So good THP
functionality + good migration throughput appear critical to us (and
maybe KNL too?).

https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread Mel Gorman
On Wed, Mar 08, 2017 at 09:34:27PM +0530, Anshuman Khandual wrote:
> > Any comments, suggestions are welcome.
> 
> Hello Vlastimil/Michal/Minchan/Mel/Dave,
> 
> Apart from the comments from Naoya on a different thread posted by Zi
> Yan, I did not get any more review comments on this series. Could you
> please kindly have a look on the over all design and its benefits from
> page migration performance point of view and let me know your views.
> Thank you.
> 

I didn't look into the patches in detail except to get a general feel
for how it works and I'm not convinced that it's a good idea at all.

I accept that memory bandwidth utilisation may be higher as a result but
consider the impact. THP migrations are relatively rare and when they
occur, it's in the context of a single thread. To parallelise the copy,
an allocation, kmap and workqueue invocation are required. There may be a
long delay before the workqueue item can start which may exceed the time
to do a single copy if the CPUs on a node are saturated. Furthermore, a
single thread can preempt operations of other unrelated threads and incur
CPU cache pollution and future misses on unrelated CPUs. It's compounded by
the fact that a high priority system workqueue is used to do the operation,
one that is used for CPU hotplug operations and rolling back when a netdevice
fails to be registered. It treats a hugepage copy as an essential operation
that can preempt all other work which is very questionable.

The series leader has no details on a workload that is bottlenecked by
THP migrations and even if it is, the primary question should be *why*
THP migrations are so frequent and alleviating that instead of
preempting multiple CPUs to do the work.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-09 Thread Mel Gorman
On Wed, Mar 08, 2017 at 09:34:27PM +0530, Anshuman Khandual wrote:
> > Any comments, suggestions are welcome.
> 
> Hello Vlastimil/Michal/Minchan/Mel/Dave,
> 
> Apart from the comments from Naoya on a different thread posted by Zi
> Yan, I did not get any more review comments on this series. Could you
> please kindly have a look on the over all design and its benefits from
> page migration performance point of view and let me know your views.
> Thank you.
> 

I didn't look into the patches in detail except to get a general feel
for how it works and I'm not convinced that it's a good idea at all.

I accept that memory bandwidth utilisation may be higher as a result but
consider the impact. THP migrations are relatively rare and when they
occur, it's in the context of a single thread. To parallelise the copy,
an allocation, kmap and workqueue invocation are required. There may be a
long delay before the workqueue item can start which may exceed the time
to do a single copy if the CPUs on a node are saturated. Furthermore, a
single thread can preempt operations of other unrelated threads and incur
CPU cache pollution and future misses on unrelated CPUs. It's compounded by
the fact that a high priority system workqueue is used to do the operation,
one that is used for CPU hotplug operations and rolling back when a netdevice
fails to be registered. It treats a hugepage copy as an essential operation
that can preempt all other work which is very questionable.

The series leader has no details on a workload that is bottlenecked by
THP migrations and even if it is, the primary question should be *why*
THP migrations are so frequent and alleviating that instead of
preempting multiple CPUs to do the work.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 0/6] Enable parallel page migration

2017-03-08 Thread Anshuman Khandual
On 02/17/2017 04:54 PM, Anshuman Khandual wrote:
>   This patch series is base on the work posted by Zi Yan back in
> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
> amount clean up and re-organization. This series depends on THP migration
> optimization patch series posted by Naoya Horiguchi on 8th November 2016
> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
> this series is yet to be rebased.
> 
>   Primary motivation behind this patch series is to achieve higher
> bandwidth of memory migration when ever possible using multi threaded
> instead of a single threaded copy. Did all the experiments using a two
> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
> here have same allocation size 4K * 10 (which did not split evenly
> for the 2MB huge pages). Here are the results.
> 
> Vanilla:
> 
> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
> 
> THP migration improvements:
> 
> Moved 10 normal pages in 302.00 msecs 1.263145 GBs
> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
> 
> THP migration improvements + Multi threaded page copy:
> 
> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **
> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
> 
> 
> **  Using multi threaded copy can be detrimental to performance if
>   used for regular pages which are way too small. But then the
>   framework provides the means to use it if some kernel/driver
>   caller or user application wants to use it.
> 
> *** These applications have used the new MPOL_MF_MOVE_MT flag while
>   calling the system calls like mbind() and move_pages().
> 
> On POWER8 the improvements are similar when tested with a draft patch
> which enables migration at PMD level. Not putting out the results here
> as the kernel is not stable with the that draft patch and crashes some
> times. We are working on enabling PMD level migration on POWER8 and will
> test this series out thoroughly when its ready.
> 
> Patch Series Description::
> 
> Patch 1: Add new parameter to migrate_page_copy and copy_huge_page so
>that it can differentiate between when to use single threaded
>version (MIGRATE_ST) or multi threaded version (MIGRATE_MT).
> 
> Patch 2: Make migrate_mode types non-exclusive.
> 
> Patch 3: Add copy_pages_mthread function which does the actual multi
>threaded copy. This involves splitting the copy work into
>chunks, selecting threads and submitting copy jobs in the
>work queues.
> 
> Patch 4: Add new migrate mode MIGRATE_MT to be used by higher level
>migration functions.
> 
> Patch 5: Add new migration flag MPOL_MF_MOVE_MT for migration system
>calls to be used in the user space.
> 
> Patch 6: Define global mt_page_copy tunable which turns on the multi
>threaded page copy no matter what for all migrations on the
>system.
> 
> Outstanding Issues::
> 
> Issue 1: The usefulness of the global multi threaded copy tunable i.e
>vm.mt_page_copy. It makes sense and helps in validating the
>framework. Should this be moved to debugfs instead ?
> 
> Issue 2: We choose nr_copythreads = 8 as maximum number of threads on
>a node can be 8 on any architecture (Which is on POWER8 if
>I am not missing any other arch which might have equal or
>more number of threads per node). It just denotes max number
>of threads and we will be adjusted based on cpumask_weight
>value on destination node. Can we do better, suggestions ?
> 
> Issue 3: Multi threaded page migration works best with threads allocated
>at different physical cores, not all in the same hyper-threaded
>core. Work queues submitted jobs consume scheduler slots from
>the given thread to execute the copy. This can interfere with
>scheduling and affect some already running tasks on the system.
>Should we be looking into arch topology information, scheduler
>cpu idle details to decide on which threads to use before going
>for multi threaded copy ? Abort multi threaded copy and fallback
>to regular copy at times when the parameters are not good ?
> 
> Any comments, suggestions are welcome.

Hello Vlastimil/Michal/Minchan/Mel/Dave,

Apart from the comments from Naoya on a different thread posted by Zi
Yan, I did not get any more review comments on this series. Could you
please kindly have a 

Re: [PATCH 0/6] Enable parallel page migration

2017-03-08 Thread Anshuman Khandual
On 02/17/2017 04:54 PM, Anshuman Khandual wrote:
>   This patch series is base on the work posted by Zi Yan back in
> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
> amount clean up and re-organization. This series depends on THP migration
> optimization patch series posted by Naoya Horiguchi on 8th November 2016
> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
> this series is yet to be rebased.
> 
>   Primary motivation behind this patch series is to achieve higher
> bandwidth of memory migration when ever possible using multi threaded
> instead of a single threaded copy. Did all the experiments using a two
> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
> here have same allocation size 4K * 10 (which did not split evenly
> for the 2MB huge pages). Here are the results.
> 
> Vanilla:
> 
> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
> 
> THP migration improvements:
> 
> Moved 10 normal pages in 302.00 msecs 1.263145 GBs
> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
> 
> THP migration improvements + Multi threaded page copy:
> 
> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **
> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
> 
> 
> **  Using multi threaded copy can be detrimental to performance if
>   used for regular pages which are way too small. But then the
>   framework provides the means to use it if some kernel/driver
>   caller or user application wants to use it.
> 
> *** These applications have used the new MPOL_MF_MOVE_MT flag while
>   calling the system calls like mbind() and move_pages().
> 
> On POWER8 the improvements are similar when tested with a draft patch
> which enables migration at PMD level. Not putting out the results here
> as the kernel is not stable with the that draft patch and crashes some
> times. We are working on enabling PMD level migration on POWER8 and will
> test this series out thoroughly when its ready.
> 
> Patch Series Description::
> 
> Patch 1: Add new parameter to migrate_page_copy and copy_huge_page so
>that it can differentiate between when to use single threaded
>version (MIGRATE_ST) or multi threaded version (MIGRATE_MT).
> 
> Patch 2: Make migrate_mode types non-exclusive.
> 
> Patch 3: Add copy_pages_mthread function which does the actual multi
>threaded copy. This involves splitting the copy work into
>chunks, selecting threads and submitting copy jobs in the
>work queues.
> 
> Patch 4: Add new migrate mode MIGRATE_MT to be used by higher level
>migration functions.
> 
> Patch 5: Add new migration flag MPOL_MF_MOVE_MT for migration system
>calls to be used in the user space.
> 
> Patch 6: Define global mt_page_copy tunable which turns on the multi
>threaded page copy no matter what for all migrations on the
>system.
> 
> Outstanding Issues::
> 
> Issue 1: The usefulness of the global multi threaded copy tunable i.e
>vm.mt_page_copy. It makes sense and helps in validating the
>framework. Should this be moved to debugfs instead ?
> 
> Issue 2: We choose nr_copythreads = 8 as maximum number of threads on
>a node can be 8 on any architecture (Which is on POWER8 if
>I am not missing any other arch which might have equal or
>more number of threads per node). It just denotes max number
>of threads and we will be adjusted based on cpumask_weight
>value on destination node. Can we do better, suggestions ?
> 
> Issue 3: Multi threaded page migration works best with threads allocated
>at different physical cores, not all in the same hyper-threaded
>core. Work queues submitted jobs consume scheduler slots from
>the given thread to execute the copy. This can interfere with
>scheduling and affect some already running tasks on the system.
>Should we be looking into arch topology information, scheduler
>cpu idle details to decide on which threads to use before going
>for multi threaded copy ? Abort multi threaded copy and fallback
>to regular copy at times when the parameters are not good ?
> 
> Any comments, suggestions are welcome.

Hello Vlastimil/Michal/Minchan/Mel/Dave,

Apart from the comments from Naoya on a different thread posted by Zi
Yan, I did not get any more review comments on this series. Could you
please kindly have a 

Re: [PATCH 0/6] Enable parallel page migration

2017-02-22 Thread Balbir Singh


On 22/02/17 16:55, Anshuman Khandual wrote:
> On 02/22/2017 10:34 AM, Balbir Singh wrote:
>> On Fri, Feb 17, 2017 at 04:54:47PM +0530, Anshuman Khandual wrote:
>>> This patch series is base on the work posted by Zi Yan back in
>>> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
>>> amount clean up and re-organization. This series depends on THP migration
>>> optimization patch series posted by Naoya Horiguchi on 8th November 2016
>>> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
>>> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
>>> this series is yet to be rebased.
>>>
>>> Primary motivation behind this patch series is to achieve higher
>>> bandwidth of memory migration when ever possible using multi threaded
>>> instead of a single threaded copy. Did all the experiments using a two
>>> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
>>> here have same allocation size 4K * 10 (which did not split evenly
>>> for the 2MB huge pages). Here are the results.
>>>
>>> Vanilla:
>>>
>>> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
>>> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
>>> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
>>> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
>>>
>>> THP migration improvements:
>>>
>>> Moved 10 normal pages in 302.00 msecs 1.263145 GBs
>>
>> Is there a decrease here for normal pages?
> 
> Yeah.
> 
>>
>>> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
>>> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
>>> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
>>>
>>> THP migration improvements + Multi threaded page copy:
>>>
>>> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **
>>
>> Ditto?
> 
> Yeah, I have already mentioned about this after these data in
> the cover letter. This new flag is controlled from user space
> while invoking the system calls. Users should be careful in
> using it for scenarios where its useful and avoid it for cases
> where it hurts.

Fair enough, I wonder if _MT should be disabled for normal pages
and allow only THP migration. I think it might be worth evaluating
the overheads

> 
>>
>>> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
>>> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
>>> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
>>>
>>
>> Could you also comment on the CPU utilization impact of these
>> patches. 
> 
> Yeah, it really makes sense to analyze this impact. I have mentioned
> about this in the outstanding issues section of the series. But what
> exactly we need to analyze from CPU utilization impact point of view
> ? Like whats the probability that the work queue requested jobs will
> throw some tasks from the run queue and make them starve for some
> more time ? Could you please give some details on this ?
> 

I wonder if the CPU utilization is so high that its hurting the CPU
(system time) at the cost of increased migration speeds. We may need
a trade-off (see my comment above)

Balbir Singh.


Re: [PATCH 0/6] Enable parallel page migration

2017-02-22 Thread Balbir Singh


On 22/02/17 16:55, Anshuman Khandual wrote:
> On 02/22/2017 10:34 AM, Balbir Singh wrote:
>> On Fri, Feb 17, 2017 at 04:54:47PM +0530, Anshuman Khandual wrote:
>>> This patch series is base on the work posted by Zi Yan back in
>>> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
>>> amount clean up and re-organization. This series depends on THP migration
>>> optimization patch series posted by Naoya Horiguchi on 8th November 2016
>>> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
>>> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
>>> this series is yet to be rebased.
>>>
>>> Primary motivation behind this patch series is to achieve higher
>>> bandwidth of memory migration when ever possible using multi threaded
>>> instead of a single threaded copy. Did all the experiments using a two
>>> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
>>> here have same allocation size 4K * 10 (which did not split evenly
>>> for the 2MB huge pages). Here are the results.
>>>
>>> Vanilla:
>>>
>>> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
>>> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
>>> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
>>> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
>>>
>>> THP migration improvements:
>>>
>>> Moved 10 normal pages in 302.00 msecs 1.263145 GBs
>>
>> Is there a decrease here for normal pages?
> 
> Yeah.
> 
>>
>>> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
>>> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
>>> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
>>>
>>> THP migration improvements + Multi threaded page copy:
>>>
>>> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **
>>
>> Ditto?
> 
> Yeah, I have already mentioned about this after these data in
> the cover letter. This new flag is controlled from user space
> while invoking the system calls. Users should be careful in
> using it for scenarios where its useful and avoid it for cases
> where it hurts.

Fair enough, I wonder if _MT should be disabled for normal pages
and allow only THP migration. I think it might be worth evaluating
the overheads

> 
>>
>>> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
>>> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
>>> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
>>>
>>
>> Could you also comment on the CPU utilization impact of these
>> patches. 
> 
> Yeah, it really makes sense to analyze this impact. I have mentioned
> about this in the outstanding issues section of the series. But what
> exactly we need to analyze from CPU utilization impact point of view
> ? Like whats the probability that the work queue requested jobs will
> throw some tasks from the run queue and make them starve for some
> more time ? Could you please give some details on this ?
> 

I wonder if the CPU utilization is so high that its hurting the CPU
(system time) at the cost of increased migration speeds. We may need
a trade-off (see my comment above)

Balbir Singh.


Re: [PATCH 0/6] Enable parallel page migration

2017-02-21 Thread Anshuman Khandual
On 02/22/2017 10:34 AM, Balbir Singh wrote:
> On Fri, Feb 17, 2017 at 04:54:47PM +0530, Anshuman Khandual wrote:
>>  This patch series is base on the work posted by Zi Yan back in
>> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
>> amount clean up and re-organization. This series depends on THP migration
>> optimization patch series posted by Naoya Horiguchi on 8th November 2016
>> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
>> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
>> this series is yet to be rebased.
>>
>>  Primary motivation behind this patch series is to achieve higher
>> bandwidth of memory migration when ever possible using multi threaded
>> instead of a single threaded copy. Did all the experiments using a two
>> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
>> here have same allocation size 4K * 10 (which did not split evenly
>> for the 2MB huge pages). Here are the results.
>>
>> Vanilla:
>>
>> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
>> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
>> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
>> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
>>
>> THP migration improvements:
>>
>> Moved 10 normal pages in 302.00 msecs 1.263145 GBs
> 
> Is there a decrease here for normal pages?

Yeah.

> 
>> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
>> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
>> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
>>
>> THP migration improvements + Multi threaded page copy:
>>
>> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **
> 
> Ditto?

Yeah, I have already mentioned about this after these data in
the cover letter. This new flag is controlled from user space
while invoking the system calls. Users should be careful in
using it for scenarios where its useful and avoid it for cases
where it hurts.

> 
>> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
>> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
>> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
>>
> 
> Could you also comment on the CPU utilization impact of these
> patches. 

Yeah, it really makes sense to analyze this impact. I have mentioned
about this in the outstanding issues section of the series. But what
exactly we need to analyze from CPU utilization impact point of view
? Like whats the probability that the work queue requested jobs will
throw some tasks from the run queue and make them starve for some
more time ? Could you please give some details on this ?



Re: [PATCH 0/6] Enable parallel page migration

2017-02-21 Thread Anshuman Khandual
On 02/22/2017 10:34 AM, Balbir Singh wrote:
> On Fri, Feb 17, 2017 at 04:54:47PM +0530, Anshuman Khandual wrote:
>>  This patch series is base on the work posted by Zi Yan back in
>> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
>> amount clean up and re-organization. This series depends on THP migration
>> optimization patch series posted by Naoya Horiguchi on 8th November 2016
>> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
>> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
>> this series is yet to be rebased.
>>
>>  Primary motivation behind this patch series is to achieve higher
>> bandwidth of memory migration when ever possible using multi threaded
>> instead of a single threaded copy. Did all the experiments using a two
>> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
>> here have same allocation size 4K * 10 (which did not split evenly
>> for the 2MB huge pages). Here are the results.
>>
>> Vanilla:
>>
>> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
>> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
>> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
>> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
>>
>> THP migration improvements:
>>
>> Moved 10 normal pages in 302.00 msecs 1.263145 GBs
> 
> Is there a decrease here for normal pages?

Yeah.

> 
>> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
>> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
>> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
>>
>> THP migration improvements + Multi threaded page copy:
>>
>> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **
> 
> Ditto?

Yeah, I have already mentioned about this after these data in
the cover letter. This new flag is controlled from user space
while invoking the system calls. Users should be careful in
using it for scenarios where its useful and avoid it for cases
where it hurts.

> 
>> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
>> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
>> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
>>
> 
> Could you also comment on the CPU utilization impact of these
> patches. 

Yeah, it really makes sense to analyze this impact. I have mentioned
about this in the outstanding issues section of the series. But what
exactly we need to analyze from CPU utilization impact point of view
? Like whats the probability that the work queue requested jobs will
throw some tasks from the run queue and make them starve for some
more time ? Could you please give some details on this ?



Re: [PATCH 0/6] Enable parallel page migration

2017-02-21 Thread Balbir Singh
On Fri, Feb 17, 2017 at 04:54:47PM +0530, Anshuman Khandual wrote:
>   This patch series is base on the work posted by Zi Yan back in
> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
> amount clean up and re-organization. This series depends on THP migration
> optimization patch series posted by Naoya Horiguchi on 8th November 2016
> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
> this series is yet to be rebased.
> 
>   Primary motivation behind this patch series is to achieve higher
> bandwidth of memory migration when ever possible using multi threaded
> instead of a single threaded copy. Did all the experiments using a two
> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
> here have same allocation size 4K * 10 (which did not split evenly
> for the 2MB huge pages). Here are the results.
> 
> Vanilla:
> 
> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
> 
> THP migration improvements:
> 
> Moved 10 normal pages in 302.00 msecs 1.263145 GBs

Is there a decrease here for normal pages?

> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
> 
> THP migration improvements + Multi threaded page copy:
> 
> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **

Ditto?

> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
>

Could you also comment on the CPU utilization impact of these
patches. 

Balbir Singh.


Re: [PATCH 0/6] Enable parallel page migration

2017-02-21 Thread Balbir Singh
On Fri, Feb 17, 2017 at 04:54:47PM +0530, Anshuman Khandual wrote:
>   This patch series is base on the work posted by Zi Yan back in
> November 2016 (https://lkml.org/lkml/2016/11/22/457) but includes some
> amount clean up and re-organization. This series depends on THP migration
> optimization patch series posted by Naoya Horiguchi on 8th November 2016
> (https://lwn.net/Articles/705879/). Though Zi Yan has recently reposted
> V3 of the THP migration patch series (https://lwn.net/Articles/713667/),
> this series is yet to be rebased.
> 
>   Primary motivation behind this patch series is to achieve higher
> bandwidth of memory migration when ever possible using multi threaded
> instead of a single threaded copy. Did all the experiments using a two
> socket X86 sytsem (Intel(R) Xeon(R) CPU E5-2650). All the experiments
> here have same allocation size 4K * 10 (which did not split evenly
> for the 2MB huge pages). Here are the results.
> 
> Vanilla:
> 
> Moved 10 normal pages in 247.00 msecs 1.544412 GBs
> Moved 10 normal pages in 238.00 msecs 1.602814 GBs
> Moved 195 huge pages in 252.00 msecs 1.513769 GBs
> Moved 195 huge pages in 257.00 msecs 1.484318 GBs
> 
> THP migration improvements:
> 
> Moved 10 normal pages in 302.00 msecs 1.263145 GBs

Is there a decrease here for normal pages?

> Moved 10 normal pages in 262.00 msecs 1.455991 GBs
> Moved 195 huge pages in 120.00 msecs 3.178914 GBs
> Moved 195 huge pages in 129.00 msecs 2.957130 GBs
> 
> THP migration improvements + Multi threaded page copy:
> 
> Moved 10 normal pages in 1589.00 msecs 0.240069 GBs **

Ditto?

> Moved 10 normal pages in 1932.00 msecs 0.197448 GBs **
> Moved 195 huge pages in 54.00 msecs 7.064254 GBs ***
> Moved 195 huge pages in 86.00 msecs 4.435694 GBs ***
>

Could you also comment on the CPU utilization impact of these
patches. 

Balbir Singh.