Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-20 Thread Michael S. Tsirkin
On Wed, Jun 20, 2018 at 09:11:39AM +, Wang, Wei W wrote:
> On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> > On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > > On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> > > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > > 1024) above,
> > > > > > so the maximum memory that can be reported is 2TB. For larger
> > guests, e.g.
> > > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > > than no optimization).
> > > > > >
> > > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters 
> > > > > > even
> > more.
> > > > > > I'd rather we had a better plan. From that POV I like what
> > > > > > Matthew Wilcox suggested for this which is to steal the necessary # 
> > > > > > of
> > entries off the list.
> > > > > Actually what Matthew suggested doesn't make a difference here.
> > > > > That method always steal the first free page blocks, and sure can
> > > > > be changed to take more. But all these can be achieved via kmalloc
> > > > I'd do get_user_pages really. You don't want pages split, etc.
> > 
> > Oops sorry. I meant get_free_pages .
> 
> Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, 
> which can report up to 2TB free memory. 
> 
> "getting two pages isn't harder", do you mean passing two arrays (two 
> allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Yes, or generally a list of pages with as many as needed.


> Please see if the following logic aligns to what you think:
> 
> uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
> unsigned long *arrays;
>  
>  /*
>  * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough 
> to
>  * store all the hints, we need to allocate multiple arrays.
>  * max_hints: the max number of 4MB free page blocks
>  * hints_per_page: the number of hints each page can store
>  * hints_per_array: the number of hints an array can store
>  * total_arrays: the number of arrays we need
>  */
> max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
> hints_per_page = PAGE_SIZE / sizeof(__le64);
> hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
> total_arrays = max_hints /  hints_per_array +
>!!(max_hints % hints_per_array);
> arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
> for (i = 0; i < total_arrays; i++) {
> arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, 
> MAX_ORDER - 1);
> 
>   if (!arrays[i])
>   goto out;
> }
> 
> 
> - the mm API needs to be changed to support storing hints to multiple 
> separated arrays offered by the caller.
> 
> Best,
> Wei

Yes. And add an API to just count entries so we know how many arrays to 
allocate.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-20 Thread Wang, Wei W
On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > 1024) above,
> > > > > so the maximum memory that can be reported is 2TB. For larger
> guests, e.g.
> > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > than no optimization).
> > > > >
> > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters 
> > > > > even
> more.
> > > > > I'd rather we had a better plan. From that POV I like what
> > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> entries off the list.
> > > > Actually what Matthew suggested doesn't make a difference here.
> > > > That method always steal the first free page blocks, and sure can
> > > > be changed to take more. But all these can be achieved via kmalloc
> > > I'd do get_user_pages really. You don't want pages split, etc.
> 
> Oops sorry. I meant get_free_pages .

Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, 
which can report up to 2TB free memory. 

"getting two pages isn't harder", do you mean passing two arrays (two 
allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Please see if the following logic aligns to what you think:

uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
unsigned long *arrays;
 
 /*
 * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
 * store all the hints, we need to allocate multiple arrays.
 * max_hints: the max number of 4MB free page blocks
 * hints_per_page: the number of hints each page can store
 * hints_per_array: the number of hints an array can store
 * total_arrays: the number of arrays we need
 */
max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
hints_per_page = PAGE_SIZE / sizeof(__le64);
hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
total_arrays = max_hints /  hints_per_array +
   !!(max_hints % hints_per_array);
arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
for (i = 0; i < total_arrays; i++) {
arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, 
MAX_ORDER - 1);

  if (!arrays[i])
  goto out;
}


- the mm API needs to be changed to support storing hints to multiple separated 
arrays offered by the caller.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-19 Thread Michael S. Tsirkin
On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) 
> > > > > above,
> > > > so the maximum memory that can be reported is 2TB. For larger guests, 
> > > > e.g.
> > > > 4TB, the optimization can still offer 2TB free memory (better than no
> > > > optimization).
> > > > 
> > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even 
> > > > more.
> > > > I'd rather we had a better plan. From that POV I like what Matthew 
> > > > Wilcox
> > > > suggested for this which is to steal the necessary # of entries off the 
> > > > list.
> > > Actually what Matthew suggested doesn't make a difference here. That 
> > > method always steal the first free page blocks, and sure can be changed 
> > > to take more. But all these can be achieved via kmalloc
> > I'd do get_user_pages really. You don't want pages split, etc.

Oops sorry. I meant get_free_pages .

> 
> > > by the caller which is more prudent and makes the code more 
> > > straightforward. I think we don't need to take that risk unless the MM 
> > > folks strongly endorse that approach.
> > > 
> > > The max size of the kmalloc-ed memory is 4MB, which gives us the 
> > > limitation that the max free memory to report is 2TB. Back to the 
> > > motivation of this work, the cloud guys want to use this optimization to 
> > > accelerate their guest live migration. 2TB guests are not common in 
> > > today's clouds. When huge guests become common in the future, we can 
> > > easily tweak this API to fill hints into scattered buffer (e.g. several 
> > > 4MB arrays passed to this API) instead of one as in this version.
> > > 
> > > This limitation doesn't cause any issue from functionality perspective. 
> > > For the extreme case like a 100TB guest live migration which is 
> > > theoretically possible today, this optimization helps skip 2TB of its 
> > > free memory. This result is that it may reduce only 2% live migration 
> > > time, but still better than not skipping the 2TB (if not using the 
> > > feature).
> > Not clearly better, no, since you are slowing the guest.
> 
> Not really. Live migration slows down the guest itself. It seems that the
> guest spends a little extra time reporting free pages, but in return the
> live migration time gets reduced a lot, which makes the guest endure less
> from live migration. (there is no drop of the workload performance when
> using the optimization in the tests)

My point was you can't say what is better without measuring.
Without special limitations you have hint overhead vs migration
overhead. I think we need to  build to scale to huge guests.
We might discover scalability problems down the road,
but no sense in building in limitations straight away.

> 
> 
> > 
> > 
> > > So, for the first release of this feature, I think it is better to have 
> > > the simpler and more straightforward solution as we have now, and clearly 
> > > document why it can report up to 2TB free memory.
> > No one has the time to read documentation about how an internal flag
> > within a device works. Come on, getting two pages isn't much harder
> > than a single one.
> 
> > > > If that doesn't fly, we can allocate out of the loop and just retry 
> > > > with more
> > > > pages.
> > > > 
> > > > > On the other hand, large guests being large mostly because the guests 
> > > > > need
> > > > to use large memory. In that case, they usually won't have that much 
> > > > free
> > > > memory to report.
> > > > 
> > > > And following this logic small guests don't have a lot of memory to 
> > > > report at
> > > > all.
> > > > Could you remind me why are we considering this optimization then?
> > > If there is a 3TB guest, it is 3TB not 2TB mostly because it would need 
> > > to use e.g. 2.5TB memory from time to time. In the worst case, it only 
> > > has 0.5TB free memory to report, but reporting 0.5TB with this 
> > > optimization is better than no optimization. (and the current 2TB 
> > > limitation isn't a limitation for the 3TB guest in this case)
> > I'd rather not spend time writing up random limitations.
> 
> This is not a random limitation. It would be more clear to see the code.

Users don't see code though, that's the point.

Exporting internal limitations from code to users isn't great.


> Also I'm not sure how get_user_pages could be used in our case, and what you
> meant by "getting two pages". I'll post out a new version, and we can
> discuss on the code.

Sorry, I meant get_free_pages.

> 
> Best,
> Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-19 Thread Wei Wang

On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:

On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:

On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:

On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:

Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,

so the maximum memory that can be reported is 2TB. For larger guests, e.g.
4TB, the optimization can still offer 2TB free memory (better than no
optimization).

Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
I'd rather we had a better plan. From that POV I like what Matthew Wilcox
suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method 
always steal the first free page blocks, and sure can be changed to take more. 
But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.



by the caller which is more prudent and makes the code more straightforward. I 
think we don't need to take that risk unless the MM folks strongly endorse that 
approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
that the max free memory to report is 2TB. Back to the motivation of this work, 
the cloud guys want to use this optimization to accelerate their guest live 
migration. 2TB guests are not common in today's clouds. When huge guests become 
common in the future, we can easily tweak this API to fill hints into scattered 
buffer (e.g. several 4MB arrays passed to this API) instead of one as in this 
version.

This limitation doesn't cause any issue from functionality perspective. For the 
extreme case like a 100TB guest live migration which is theoretically possible 
today, this optimization helps skip 2TB of its free memory. This result is that 
it may reduce only 2% live migration time, but still better than not skipping 
the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


Not really. Live migration slows down the guest itself. It seems that 
the guest spends a little extra time reporting free pages, but in return 
the live migration time gets reduced a lot, which makes the guest endure 
less from live migration. (there is no drop of the workload performance 
when using the optimization in the tests)








So, for the first release of this feature, I think it is better to have the 
simpler and more straightforward solution as we have now, and clearly document 
why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.


  

If that doesn't fly, we can allocate out of the loop and just retry with more
pages.


On the other hand, large guests being large mostly because the guests need

to use large memory. In that case, they usually won't have that much free
memory to report.

And following this logic small guests don't have a lot of memory to report at
all.
Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use 
e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free 
memory to report, but reporting 0.5TB with this optimization is better than no 
optimization. (and the current 2TB limitation isn't a limitation for the 3TB 
guest in this case)

I'd rather not spend time writing up random limitations.


This is not a random limitation. It would be more clear to see the code. 
Also I'm not sure how get_user_pages could be used in our case, and what 
you meant by "getting two pages". I'll post out a new version, and we 
can discuss on the code.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-18 Thread Michael S. Tsirkin
On Tue, Jun 19, 2018 at 01:06:48AM +, Wang, Wei W wrote:
> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > 4TB, the optimization can still offer 2TB free memory (better than no
> > optimization).
> > 
> > Maybe it's better, maybe it isn't. It certainly muddies the waters even 
> > more.
> > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > suggested for this which is to steal the necessary # of entries off the 
> > list.
> 
> Actually what Matthew suggested doesn't make a difference here. That method 
> always steal the first free page blocks, and sure can be changed to take 
> more. But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.

> by the caller which is more prudent and makes the code more straightforward. 
> I think we don't need to take that risk unless the MM folks strongly endorse 
> that approach.
> 
> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
> that the max free memory to report is 2TB. Back to the motivation of this 
> work, the cloud guys want to use this optimization to accelerate their guest 
> live migration. 2TB guests are not common in today's clouds. When huge guests 
> become common in the future, we can easily tweak this API to fill hints into 
> scattered buffer (e.g. several 4MB arrays passed to this API) instead of one 
> as in this version.
> 
> This limitation doesn't cause any issue from functionality perspective. For 
> the extreme case like a 100TB guest live migration which is theoretically 
> possible today, this optimization helps skip 2TB of its free memory. This 
> result is that it may reduce only 2% live migration time, but still better 
> than not skipping the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


> So, for the first release of this feature, I think it is better to have the 
> simpler and more straightforward solution as we have now, and clearly 
> document why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.

> 
>  
> > If that doesn't fly, we can allocate out of the loop and just retry with 
> > more
> > pages.
> > 
> > > On the other hand, large guests being large mostly because the guests need
> > to use large memory. In that case, they usually won't have that much free
> > memory to report.
> > 
> > And following this logic small guests don't have a lot of memory to report 
> > at
> > all.
> > Could you remind me why are we considering this optimization then?
> 
> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to 
> use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB 
> free memory to report, but reporting 0.5TB with this optimization is better 
> than no optimization. (and the current 2TB limitation isn't a limitation for 
> the 3TB guest in this case)

I'd rather not spend time writing up random limitations.


> Best,
> Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-18 Thread Wang, Wei W
On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> On Sat, Jun 16, 2018 at 01:09:44AM +, Wang, Wei W wrote:
> > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> 4TB, the optimization can still offer 2TB free memory (better than no
> optimization).
> 
> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method 
always steal the first free page blocks, and sure can be changed to take more. 
But all these can be achieved via kmalloc by the caller which is more prudent 
and makes the code more straightforward. I think we don't need to take that 
risk unless the MM folks strongly endorse that approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation 
that the max free memory to report is 2TB. Back to the motivation of this work, 
the cloud guys want to use this optimization to accelerate their guest live 
migration. 2TB guests are not common in today's clouds. When huge guests become 
common in the future, we can easily tweak this API to fill hints into scattered 
buffer (e.g. several 4MB arrays passed to this API) instead of one as in this 
version.

This limitation doesn't cause any issue from functionality perspective. For the 
extreme case like a 100TB guest live migration which is theoretically possible 
today, this optimization helps skip 2TB of its free memory. This result is that 
it may reduce only 2% live migration time, but still better than not skipping 
the 2TB (if not using the feature).

So, for the first release of this feature, I think it is better to have the 
simpler and more straightforward solution as we have now, and clearly document 
why it can report up to 2TB free memory.


 
> If that doesn't fly, we can allocate out of the loop and just retry with more
> pages.
> 
> > On the other hand, large guests being large mostly because the guests need
> to use large memory. In that case, they usually won't have that much free
> memory to report.
> 
> And following this logic small guests don't have a lot of memory to report at
> all.
> Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use 
e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free 
memory to report, but reporting 0.5TB with this optimization is better than no 
optimization. (and the current 2TB limitation isn't a limitation for the 3TB 
guest in this case)

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization