On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote: > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote: > > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote: > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote: > > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote: > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) > > > > > above, > > > > so the maximum memory that can be reported is 2TB. For larger guests, > > > > e.g. > > > > 4TB, the optimization can still offer 2TB free memory (better than no > > > > optimization). > > > > > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even > > > > more. > > > > I'd rather we had a better plan. From that POV I like what Matthew > > > > Wilcox > > > > suggested for this which is to steal the necessary # of entries off the > > > > list. > > > Actually what Matthew suggested doesn't make a difference here. That > > > method always steal the first free page blocks, and sure can be changed > > > to take more. But all these can be achieved via kmalloc > > I'd do get_user_pages really. You don't want pages split, etc.
Oops sorry. I meant get_free_pages . > > > > by the caller which is more prudent and makes the code more > > > straightforward. I think we don't need to take that risk unless the MM > > > folks strongly endorse that approach. > > > > > > The max size of the kmalloc-ed memory is 4MB, which gives us the > > > limitation that the max free memory to report is 2TB. Back to the > > > motivation of this work, the cloud guys want to use this optimization to > > > accelerate their guest live migration. 2TB guests are not common in > > > today's clouds. When huge guests become common in the future, we can > > > easily tweak this API to fill hints into scattered buffer (e.g. several > > > 4MB arrays passed to this API) instead of one as in this version. > > > > > > This limitation doesn't cause any issue from functionality perspective. > > > For the extreme case like a 100TB guest live migration which is > > > theoretically possible today, this optimization helps skip 2TB of its > > > free memory. This result is that it may reduce only 2% live migration > > > time, but still better than not skipping the 2TB (if not using the > > > feature). > > Not clearly better, no, since you are slowing the guest. > > Not really. Live migration slows down the guest itself. It seems that the > guest spends a little extra time reporting free pages, but in return the > live migration time gets reduced a lot, which makes the guest endure less > from live migration. (there is no drop of the workload performance when > using the optimization in the tests) My point was you can't say what is better without measuring. Without special limitations you have hint overhead vs migration overhead. I think we need to build to scale to huge guests. We might discover scalability problems down the road, but no sense in building in limitations straight away. > > > > > > > > > So, for the first release of this feature, I think it is better to have > > > the simpler and more straightforward solution as we have now, and clearly > > > document why it can report up to 2TB free memory. > > No one has the time to read documentation about how an internal flag > > within a device works. Come on, getting two pages isn't much harder > > than a single one. > > > > > If that doesn't fly, we can allocate out of the loop and just retry > > > > with more > > > > pages. > > > > > > > > > On the other hand, large guests being large mostly because the guests > > > > > need > > > > to use large memory. In that case, they usually won't have that much > > > > free > > > > memory to report. > > > > > > > > And following this logic small guests don't have a lot of memory to > > > > report at > > > > all. > > > > Could you remind me why are we considering this optimization then? > > > If there is a 3TB guest, it is 3TB not 2TB mostly because it would need > > > to use e.g. 2.5TB memory from time to time. In the worst case, it only > > > has 0.5TB free memory to report, but reporting 0.5TB with this > > > optimization is better than no optimization. (and the current 2TB > > > limitation isn't a limitation for the 3TB guest in this case) > > I'd rather not spend time writing up random limitations. > > This is not a random limitation. It would be more clear to see the code. Users don't see code though, that's the point. Exporting internal limitations from code to users isn't great. > Also I'm not sure how get_user_pages could be used in our case, and what you > meant by "getting two pages". I'll post out a new version, and we can > discuss on the code. Sorry, I meant get_free_pages. > > Best, > Wei _______________________________________________ Virtualization mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/virtualization
