On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) 
> > > > > above,
> > > > so the maximum memory that can be reported is 2TB. For larger guests, 
> > > > e.g.
> > > > 4TB, the optimization can still offer 2TB free memory (better than no
> > > > optimization).
> > > > 
> > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even 
> > > > more.
> > > > I'd rather we had a better plan. From that POV I like what Matthew 
> > > > Wilcox
> > > > suggested for this which is to steal the necessary # of entries off the 
> > > > list.
> > > Actually what Matthew suggested doesn't make a difference here. That 
> > > method always steal the first free page blocks, and sure can be changed 
> > > to take more. But all these can be achieved via kmalloc
> > I'd do get_user_pages really. You don't want pages split, etc.

Oops sorry. I meant get_free_pages .

> 
> > > by the caller which is more prudent and makes the code more 
> > > straightforward. I think we don't need to take that risk unless the MM 
> > > folks strongly endorse that approach.
> > > 
> > > The max size of the kmalloc-ed memory is 4MB, which gives us the 
> > > limitation that the max free memory to report is 2TB. Back to the 
> > > motivation of this work, the cloud guys want to use this optimization to 
> > > accelerate their guest live migration. 2TB guests are not common in 
> > > today's clouds. When huge guests become common in the future, we can 
> > > easily tweak this API to fill hints into scattered buffer (e.g. several 
> > > 4MB arrays passed to this API) instead of one as in this version.
> > > 
> > > This limitation doesn't cause any issue from functionality perspective. 
> > > For the extreme case like a 100TB guest live migration which is 
> > > theoretically possible today, this optimization helps skip 2TB of its 
> > > free memory. This result is that it may reduce only 2% live migration 
> > > time, but still better than not skipping the 2TB (if not using the 
> > > feature).
> > Not clearly better, no, since you are slowing the guest.
> 
> Not really. Live migration slows down the guest itself. It seems that the
> guest spends a little extra time reporting free pages, but in return the
> live migration time gets reduced a lot, which makes the guest endure less
> from live migration. (there is no drop of the workload performance when
> using the optimization in the tests)

My point was you can't say what is better without measuring.
Without special limitations you have hint overhead vs migration
overhead. I think we need to  build to scale to huge guests.
We might discover scalability problems down the road,
but no sense in building in limitations straight away.

> 
> 
> > 
> > 
> > > So, for the first release of this feature, I think it is better to have 
> > > the simpler and more straightforward solution as we have now, and clearly 
> > > document why it can report up to 2TB free memory.
> > No one has the time to read documentation about how an internal flag
> > within a device works. Come on, getting two pages isn't much harder
> > than a single one.
> 
> > > > If that doesn't fly, we can allocate out of the loop and just retry 
> > > > with more
> > > > pages.
> > > > 
> > > > > On the other hand, large guests being large mostly because the guests 
> > > > > need
> > > > to use large memory. In that case, they usually won't have that much 
> > > > free
> > > > memory to report.
> > > > 
> > > > And following this logic small guests don't have a lot of memory to 
> > > > report at
> > > > all.
> > > > Could you remind me why are we considering this optimization then?
> > > If there is a 3TB guest, it is 3TB not 2TB mostly because it would need 
> > > to use e.g. 2.5TB memory from time to time. In the worst case, it only 
> > > has 0.5TB free memory to report, but reporting 0.5TB with this 
> > > optimization is better than no optimization. (and the current 2TB 
> > > limitation isn't a limitation for the 3TB guest in this case)
> > I'd rather not spend time writing up random limitations.
> 
> This is not a random limitation. It would be more clear to see the code.

Users don't see code though, that's the point.

Exporting internal limitations from code to users isn't great.


> Also I'm not sure how get_user_pages could be used in our case, and what you
> meant by "getting two pages". I'll post out a new version, and we can
> discuss on the code.

Sorry, I meant get_free_pages.

> 
> Best,
> Wei
_______________________________________________
Virtualization mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Reply via email to