Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for > fast (de)inflating & fast live migration > > On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote: > > On 12/15/2016 05:38 PM, Li, Liang Z wrote: > > > > > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long > enough for the 'length' > > > Set the 'length' to a special value to indicate the "actual length in > > > next 8 > bytes". > > > > > > That will be much more simple. Right? > > > > Sounds fine to me. > > > > Sounds fine to me too indeed. > > I'm only wondering what is the major point for compressing gpfn+len in > 8 bytes in the common case, you already use sg_init_table to send down two > pages, we could send three as well and avoid all math and bit shifts and ors, > or not? > Yes, we can use more pages for that. > I agree with the above because from a performance prospective I tend to > think the above proposal will run at least theoretically faster because the > other way is to waste double amount of CPU cache, and bit mangling in the > encoding and the later decoding on qemu side should be faster than > accessing an array of double size, but then I'm not sure if it's measurable > optimization. So I'd be curious to know the exact motivation and if it is to > reduce the CPU cache usage or if there's some other fundamental reason to > compress it. > The header already tells qemu how big is the array payload, couldn't we just > add more pages if one isn't enough? > The original intention to compress the PFN and length it's to reduce the memory required. Even the code was changed a lot from the previous versions, I think this is still true. Now we allocate a specified buffer size to save the 'PFN|length', when the buffer is not big enough to save all the page info for a specified order. A double size buffer will be allocated. This is what we want to avoid because the allocation may fail and allocation takes some time, for fast live migration, time is a critical factor we have to consider, more time takes means more unnecessary pages are sent, because live migration starts before the request for unused pages get response. Thanks Liang > Thanks, > Andrea
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> On Fri, Dec 16, 2016 at 01:12:21AM +, Li, Liang Z wrote: > > There still exist the case if the MAX_ORDER is configured to a large > > value, e.g. 36 for a system with huge amount of memory, then there is only > 28 bits left for the pfn, which is not enough. > > Not related to the balloon but how would it help to set MAX_ORDER to 36? > My point here is MAX_ORDER may be configured to a big value. > What the MAX_ORDER affects is that you won't be able to ask the kernel > page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but > that's a driver issue not relevant to the amount of RAM. Drivers won't > suddenly start to ask the kernel allocator to allocate compound pages at > orders >= 11 just because more RAM was added. > > The higher the MAX_ORDER the slower the kernel runs simply so the smaller > the MAX_ORDER the better. > > > Should we limit the MAX_ORDER? I don't think so. > > We shouldn't strictly depend on MAX_ORDER value but it's mostly limited > already even if configurable at build time. > I didn't know that and will take a look, thanks for your information. Liang > We definitely need it to reach at least the hugepage size, then it's mostly > driver issue, but drivers requiring large contiguous allocations should rely > on > CMA only or vmalloc if they only require it virtually contiguous, and not rely > on larger MAX_ORDER that would slowdown all kernel allocations/freeing.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote: > On 12/15/2016 05:38 PM, Li, Liang Z wrote: > > > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long > > enough for the 'length' > > Set the 'length' to a special value to indicate the "actual length in next > > 8 bytes". > > > > That will be much more simple. Right? > > Sounds fine to me. > Sounds fine to me too indeed. I'm only wondering what is the major point for compressing gpfn+len in 8 bytes in the common case, you already use sg_init_table to send down two pages, we could send three as well and avoid all math and bit shifts and ors, or not? I agree with the above because from a performance prospective I tend to think the above proposal will run at least theoretically faster because the other way is to waste double amount of CPU cache, and bit mangling in the encoding and the later decoding on qemu side should be faster than accessing an array of double size, but then I'm not sure if it's measurable optimization. So I'd be curious to know the exact motivation and if it is to reduce the CPU cache usage or if there's some other fundamental reason to compress it. The header already tells qemu how big is the array payload, couldn't we just add more pages if one isn't enough? Thanks, Andrea
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On Fri, Dec 16, 2016 at 01:12:21AM +, Li, Liang Z wrote: > There still exist the case if the MAX_ORDER is configured to a large value, > e.g. 36 for a system > with huge amount of memory, then there is only 28 bits left for the pfn, > which is not enough. Not related to the balloon but how would it help to set MAX_ORDER to 36? What the MAX_ORDER affects is that you won't be able to ask the kernel page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but that's a driver issue not relevant to the amount of RAM. Drivers won't suddenly start to ask the kernel allocator to allocate compound pages at orders >= 11 just because more RAM was added. The higher the MAX_ORDER the slower the kernel runs simply so the smaller the MAX_ORDER the better. > Should we limit the MAX_ORDER? I don't think so. We shouldn't strictly depend on MAX_ORDER value but it's mostly limited already even if configurable at build time. We definitely need it to reach at least the hugepage size, then it's mostly driver issue, but drivers requiring large contiguous allocations should rely on CMA only or vmalloc if they only require it virtually contiguous, and not rely on larger MAX_ORDER that would slowdown all kernel allocations/freeing.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> On 12/15/2016 05:38 PM, Li, Liang Z wrote: > > > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long > > enough > for the 'length' > > Set the 'length' to a special value to indicate the "actual length in next 8 > bytes". > > > > That will be much more simple. Right? > > Sounds fine to me. Thanks for your inspiration! Liang
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/15/2016 05:38 PM, Li, Liang Z wrote: > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long > enough for the 'length' > Set the 'length' to a special value to indicate the "actual length in next 8 > bytes". > > That will be much more simple. Right? Sounds fine to me.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> On 12/15/2016 04:48 PM, Li, Liang Z wrote: > >>> It seems we leave too many bit for the pfn, and the bits leave for > >>> length is not enough, How about keep 45 bits for the pfn and 19 bits > >>> for length, 45 bits for pfn can cover 57 bits physical address, that > >>> should be > >> enough in the near feature. > >>> What's your opinion? > >> I still think 'order' makes a lot of sense. But, as you say, 57 bits > >> is enough for > >> x86 for a while. Other architectures who knows? > > Thinking about this some more... There are really only two cases that > matter: 4k pages and "much bigger" ones. > > Squeezing each 4k page into 8 bytes of metadata helps guarantee that this > scheme won't regress over the old scheme in any cases. For bigger ranges, 8 > vs 16 bytes means *nothing*. And 16 bytes will be as good or better than > the old scheme for everything which is >4k. > > How about this: > * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length' > * One special 'length' value to mean "actual length in next 8 bytes" > > That should be pretty simple to produce and decode. We have two record > sizes, but I think it is manageable. It works, Now that we intend to use another 8 bytes for length Why not: Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long enough for the 'length' Set the 'length' to a special value to indicate the "actual length in next 8 bytes". That will be much more simple. Right? Liang
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote: > > On 12/14/2016 12:59 AM, Li, Liang Z wrote: > > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend > > >> virtio-balloon for fast (de)inflating & fast live migration > > >> > > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote: > > >>> What's the conclusion of your discussion? It seems you want some > > >>> statistic before deciding whether to ripping the bitmap from the > > >>> ABI, am I right? > > >> > > >> I think Andrea and David feel pretty strongly that we should remove > > >> the bitmap, unless we have some data to support keeping it. I > > >> don't feel as strongly about it, but I think their critique of it > > >> is pretty valid. I think the consensus is that the bitmap needs to go. > > >> > > >> The only real question IMNHO is whether we should do a power-of-2 > > >> or a length. But, if we have 12 bits, then the argument for doing > > >> length is pretty strong. We don't need anywhere near 12 bits if doing > power-of-2. > > > > > > Just found the MAX_ORDER should be limited to 12 if use length > > > instead of order, If the MAX_ORDER is configured to a value bigger > > > than 12, it will make things more complex to handle this case. > > > > > > If use order, we need to break a large memory range whose length is > > > not the power of 2 into several small ranges, it also make the code > complex. > > > > I can't imagine it makes the code that much more complex. It adds a > > for loop. Right? > > > > > It seems we leave too many bit for the pfn, and the bits leave for > > > length is not enough, How about keep 45 bits for the pfn and 19 bits > > > for length, 45 bits for pfn can cover 57 bits physical address, that > > > should > be enough in the near feature. > > > > > > What's your opinion? > > > > I still think 'order' makes a lot of sense. But, as you say, 57 bits > > is enough for x86 for a while. Other architectures who knows? > > I think you can probably assume page size >= 4K. But I would not want to > make any other assumptions. E.g. there are systems that absolutely require > you to set high bits for DMA. > > I think we really want both length and order. > > I understand how you are trying to pack them as tightly as possible. > > However, I thought of a trick, we don't need to encode all possible orders. > For example, with 2 bits of order, we can make them mean: > 00 - 4K pages > 01 - 2M pages > 02 - 1G pages > > guest can program the sizes for each order through config space. > > We will have 10 bits left for legth. > Please don't, we just get rid of the bitmap for simplification. :) > It might make sense to also allow guest to program the number of bits used > for order, this will make it easy to extend without host changes. > There still exist the case if the MAX_ORDER is configured to a large value, e.g. 36 for a system with huge amount of memory, then there is only 28 bits left for the pfn, which is not enough. Should we limit the MAX_ORDER? I don't think so. It seems use order is better. Thanks! Liang > -- > MST
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/15/2016 04:48 PM, Li, Liang Z wrote: >>> It seems we leave too many bit for the pfn, and the bits leave for >>> length is not enough, How about keep 45 bits for the pfn and 19 bits >>> for length, 45 bits for pfn can cover 57 bits physical address, that should >>> be >> enough in the near feature. >>> What's your opinion? >> I still think 'order' makes a lot of sense. But, as you say, 57 bits is >> enough for >> x86 for a while. Other architectures who knows? Thinking about this some more... There are really only two cases that matter: 4k pages and "much bigger" ones. Squeezing each 4k page into 8 bytes of metadata helps guarantee that this scheme won't regress over the old scheme in any cases. For bigger ranges, 8 vs 16 bytes means *nothing*. And 16 bytes will be as good or better than the old scheme for everything which is >4k. How about this: * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length' * One special 'length' value to mean "actual length in next 8 bytes" That should be pretty simple to produce and decode. We have two record sizes, but I think it is manageable.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for > fast (de)inflating & fast live migration > > On 12/14/2016 12:59 AM, Li, Liang Z wrote: > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon > >> for fast (de)inflating & fast live migration > >> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote: > >>> What's the conclusion of your discussion? It seems you want some > >>> statistic before deciding whether to ripping the bitmap from the > >>> ABI, am I right? > >> > >> I think Andrea and David feel pretty strongly that we should remove > >> the bitmap, unless we have some data to support keeping it. I don't > >> feel as strongly about it, but I think their critique of it is pretty > >> valid. I think the consensus is that the bitmap needs to go. > >> > >> The only real question IMNHO is whether we should do a power-of-2 or > >> a length. But, if we have 12 bits, then the argument for doing > >> length is pretty strong. We don't need anywhere near 12 bits if doing > power-of-2. > > > > Just found the MAX_ORDER should be limited to 12 if use length instead > > of order, If the MAX_ORDER is configured to a value bigger than 12, it > > will make things more complex to handle this case. > > > > If use order, we need to break a large memory range whose length is > > not the power of 2 into several small ranges, it also make the code complex. > > I can't imagine it makes the code that much more complex. It adds a for loop. > Right? > Yes, just a little. :) > > It seems we leave too many bit for the pfn, and the bits leave for > > length is not enough, How about keep 45 bits for the pfn and 19 bits > > for length, 45 bits for pfn can cover 57 bits physical address, that should > > be > enough in the near feature. > > > > What's your opinion? > > I still think 'order' makes a lot of sense. But, as you say, 57 bits is > enough for > x86 for a while. Other architectures who knows? Yes.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote: > On 12/14/2016 12:59 AM, Li, Liang Z wrote: > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for > >> fast (de)inflating & fast live migration > >> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote: > >>> What's the conclusion of your discussion? It seems you want some > >>> statistic before deciding whether to ripping the bitmap from the ABI, > >>> am I right? > >> > >> I think Andrea and David feel pretty strongly that we should remove the > >> bitmap, unless we have some data to support keeping it. I don't feel as > >> strongly about it, but I think their critique of it is pretty valid. I > >> think the > >> consensus is that the bitmap needs to go. > >> > >> The only real question IMNHO is whether we should do a power-of-2 or a > >> length. But, if we have 12 bits, then the argument for doing length is > >> pretty > >> strong. We don't need anywhere near 12 bits if doing power-of-2. > > > > Just found the MAX_ORDER should be limited to 12 if use length instead of > > order, > > If the MAX_ORDER is configured to a value bigger than 12, it will make > > things more > > complex to handle this case. > > > > If use order, we need to break a large memory range whose length is not the > > power of 2 into several > > small ranges, it also make the code complex. > > I can't imagine it makes the code that much more complex. It adds a for > loop. Right? > > > It seems we leave too many bit for the pfn, and the bits leave for length > > is not enough, > > How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn > > can cover 57 bits > > physical address, that should be enough in the near feature. > > > > What's your opinion? > > I still think 'order' makes a lot of sense. But, as you say, 57 bits is > enough for x86 for a while. Other architectures who knows? I think you can probably assume page size >= 4K. But I would not want to make any other assumptions. E.g. there are systems that absolutely require you to set high bits for DMA. I think we really want both length and order. I understand how you are trying to pack them as tightly as possible. However, I thought of a trick, we don't need to encode all possible orders. For example, with 2 bits of order, we can make them mean: 00 - 4K pages 01 - 2M pages 02 - 1G pages guest can program the sizes for each order through config space. We will have 10 bits left for legth. It might make sense to also allow guest to program the number of bits used for order, this will make it easy to extend without host changes. -- MST
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/14/2016 12:59 AM, Li, Liang Z wrote: >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for >> fast (de)inflating & fast live migration >> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote: >>> What's the conclusion of your discussion? It seems you want some >>> statistic before deciding whether to ripping the bitmap from the ABI, >>> am I right? >> >> I think Andrea and David feel pretty strongly that we should remove the >> bitmap, unless we have some data to support keeping it. I don't feel as >> strongly about it, but I think their critique of it is pretty valid. I >> think the >> consensus is that the bitmap needs to go. >> >> The only real question IMNHO is whether we should do a power-of-2 or a >> length. But, if we have 12 bits, then the argument for doing length is >> pretty >> strong. We don't need anywhere near 12 bits if doing power-of-2. > > Just found the MAX_ORDER should be limited to 12 if use length instead of > order, > If the MAX_ORDER is configured to a value bigger than 12, it will make things > more > complex to handle this case. > > If use order, we need to break a large memory range whose length is not the > power of 2 into several > small ranges, it also make the code complex. I can't imagine it makes the code that much more complex. It adds a for loop. Right? > It seems we leave too many bit for the pfn, and the bits leave for length is > not enough, > How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn > can cover 57 bits > physical address, that should be enough in the near feature. > > What's your opinion? I still think 'order' makes a lot of sense. But, as you say, 57 bits is enough for x86 for a while. Other architectures who knows?
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for > fast (de)inflating & fast live migration > > On 12/08/2016 08:45 PM, Li, Liang Z wrote: > > What's the conclusion of your discussion? It seems you want some > > statistic before deciding whether to ripping the bitmap from the ABI, > > am I right? > > I think Andrea and David feel pretty strongly that we should remove the > bitmap, unless we have some data to support keeping it. I don't feel as > strongly about it, but I think their critique of it is pretty valid. I think > the > consensus is that the bitmap needs to go. > > The only real question IMNHO is whether we should do a power-of-2 or a > length. But, if we have 12 bits, then the argument for doing length is pretty > strong. We don't need anywhere near 12 bits if doing power-of-2. Just found the MAX_ORDER should be limited to 12 if use length instead of order, If the MAX_ORDER is configured to a value bigger than 12, it will make things more complex to handle this case. If use order, we need to break a large memory range whose length is not the power of 2 into several small ranges, it also make the code complex. It seems we leave too many bit for the pfn, and the bits leave for length is not enough, How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can cover 57 bits physical address, that should be enough in the near feature. What's your opinion? thanks! Liang
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> fast (de)inflating & fast live migration > > Hello, > > On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote: > > > On 12/08/2016 08:45 PM, Li, Liang Z wrote: > > > > What's the conclusion of your discussion? It seems you want some > > > > statistic before deciding whether to ripping the bitmap from the > > > > ABI, am I right? > > > > > > I think Andrea and David feel pretty strongly that we should remove > > > the bitmap, unless we have some data to support keeping it. I don't > > > feel as strongly about it, but I think their critique of it is > > > pretty valid. I think the consensus is that the bitmap needs to go. > > > > > > > Thanks for you clarification. > > > > > The only real question IMNHO is whether we should do a power-of-2 or > > > a length. But, if we have 12 bits, then the argument for doing > > > length is pretty strong. We don't need anywhere near 12 bits if doing > power-of-2. > > > > > So each item can max represent 16MB Bytes, seems not big enough, but > > enough for most case. > > Things became much more simple without the bitmap, and I like simple > > solution too. :) > > > > I will prepare the v6 and remove all the bitmap related stuffs. Thank you > > all! > > Sounds great! > > I suggested to check the statistics, because collecting those stats looked > simpler and quicker than removing all bitmap related stuff from the patchset. > However if you prefer to prepare a v6 without the bitmap another perhaps > more interesting way to evaluate the usefulness of the bitmap is to just run > the same benchmark and verify that there is no regression compared to the > bitmap enabled code. > > The other issue with the bitmap is, the best case for the bitmap is ever less > likely to materialize the more RAM is added to the guest. It won't regress > linearly because after all there can be some locality bias in the buddy > splits, > but if sync compaction is used in the large order allocations tried before > reaching order 0, the bitmap payoff will regress close to linearly with the > increase of RAM. > > So it'd be good to check the stats or the benchmark on large guests, at least > one hundred gigabytes or so. > > Changing topic but still about the ABI features needed, so it may be relevant > for this discussion: > > 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take >memory from, using alloc_pages_node in guest. So you can ask to >take X pages from vnode A, Y pages from vnode B, in one vmenter. > > 2) allowing qemu to tell the guest to stop inflating the balloon and >report a fragmentation limit being hit, when sync compaction >powered allocations fails at certain power-of-two order granularity >passed by qemu to the guest. This order constraint will be passed >by default for hugetlbfs guests with 2MB hpage size, while it can >be used optionally on THP backed guests. This option with THP >guests would allow a highlevel management software to provide a >"don't reduce guest performance" while shrinking the memory size of >the guest from the GUI. If you deselect the option, you can shrink >down to the last freeable 4k guest page, but doing so may have to >split THP in the host (you don't know for sure if they were really >THP but they could have been), and it may regress >performance. Inflating the balloon while passing a minimum >granularity "order" of the pages being zapped, will guarantee >inflating the balloon cannot decrease guest performance >instead. Plus it's needed for hugetlbfs anyway as far as I can >tell. hugetlbfs would not be host enforceable even if the idea is >not to free memory but only reduce the available memory of the >guest (not without major changes that maps a hugetlb page with 4k >ptes at least). While for a more cooperative usage of hugetlbfs >guests, it's simply not useful to inflate the balloon at anything >less than the "HPAGE_SIZE" hugetlbfs granularity. > > We also plan to use userfaultfd to make the balloon driver host enforced (will > work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to > the ABI so it's not strictly relevant for this discussion. > > On a side note, registering userfaultfd on the ballooned range, will keep > khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED > zapped sub-THP fragments no matter the sysfs tunings. > Thanks for your elaboration! > Thanks! > Andrea
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Hello, On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote: > > On 12/08/2016 08:45 PM, Li, Liang Z wrote: > > > What's the conclusion of your discussion? It seems you want some > > > statistic before deciding whether to ripping the bitmap from the ABI, > > > am I right? > > > > I think Andrea and David feel pretty strongly that we should remove the > > bitmap, unless we have some data to support keeping it. I don't feel as > > strongly about it, but I think their critique of it is pretty valid. I > > think the > > consensus is that the bitmap needs to go. > > > > Thanks for you clarification. > > > The only real question IMNHO is whether we should do a power-of-2 or a > > length. But, if we have 12 bits, then the argument for doing length is > > pretty > > strong. We don't need anywhere near 12 bits if doing power-of-2. > > > So each item can max represent 16MB Bytes, seems not big enough, > but enough for most case. > Things became much more simple without the bitmap, and I like simple solution > too. :) > > I will prepare the v6 and remove all the bitmap related stuffs. Thank you all! Sounds great! I suggested to check the statistics, because collecting those stats looked simpler and quicker than removing all bitmap related stuff from the patchset. However if you prefer to prepare a v6 without the bitmap another perhaps more interesting way to evaluate the usefulness of the bitmap is to just run the same benchmark and verify that there is no regression compared to the bitmap enabled code. The other issue with the bitmap is, the best case for the bitmap is ever less likely to materialize the more RAM is added to the guest. It won't regress linearly because after all there can be some locality bias in the buddy splits, but if sync compaction is used in the large order allocations tried before reaching order 0, the bitmap payoff will regress close to linearly with the increase of RAM. So it'd be good to check the stats or the benchmark on large guests, at least one hundred gigabytes or so. Changing topic but still about the ABI features needed, so it may be relevant for this discussion: 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take memory from, using alloc_pages_node in guest. So you can ask to take X pages from vnode A, Y pages from vnode B, in one vmenter. 2) allowing qemu to tell the guest to stop inflating the balloon and report a fragmentation limit being hit, when sync compaction powered allocations fails at certain power-of-two order granularity passed by qemu to the guest. This order constraint will be passed by default for hugetlbfs guests with 2MB hpage size, while it can be used optionally on THP backed guests. This option with THP guests would allow a highlevel management software to provide a "don't reduce guest performance" while shrinking the memory size of the guest from the GUI. If you deselect the option, you can shrink down to the last freeable 4k guest page, but doing so may have to split THP in the host (you don't know for sure if they were really THP but they could have been), and it may regress performance. Inflating the balloon while passing a minimum granularity "order" of the pages being zapped, will guarantee inflating the balloon cannot decrease guest performance instead. Plus it's needed for hugetlbfs anyway as far as I can tell. hugetlbfs would not be host enforceable even if the idea is not to free memory but only reduce the available memory of the guest (not without major changes that maps a hugetlb page with 4k ptes at least). While for a more cooperative usage of hugetlbfs guests, it's simply not useful to inflate the balloon at anything less than the "HPAGE_SIZE" hugetlbfs granularity. We also plan to use userfaultfd to make the balloon driver host enforced (will work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to the ABI so it's not strictly relevant for this discussion. On a side note, registering userfaultfd on the ballooned range, will keep khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED zapped sub-THP fragments no matter the sysfs tunings. Thanks! Andrea
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> On 12/08/2016 08:45 PM, Li, Liang Z wrote: > > What's the conclusion of your discussion? It seems you want some > > statistic before deciding whether to ripping the bitmap from the ABI, > > am I right? > > I think Andrea and David feel pretty strongly that we should remove the > bitmap, unless we have some data to support keeping it. I don't feel as > strongly about it, but I think their critique of it is pretty valid. I think > the > consensus is that the bitmap needs to go. > Thanks for you clarification. > The only real question IMNHO is whether we should do a power-of-2 or a > length. But, if we have 12 bits, then the argument for doing length is pretty > strong. We don't need anywhere near 12 bits if doing power-of-2. > So each item can max represent 16MB Bytes, seems not big enough, but enough for most case. Things became much more simple without the bitmap, and I like simple solution too. :) I will prepare the v6 and remove all the bitmap related stuffs. Thank you all! Liang
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/08/2016 08:45 PM, Li, Liang Z wrote: > What's the conclusion of your discussion? It seems you want some > statistic before deciding whether to ripping the bitmap from the > ABI, am I right? I think Andrea and David feel pretty strongly that we should remove the bitmap, unless we have some data to support keeping it. I don't feel as strongly about it, but I think their critique of it is pretty valid. I think the consensus is that the bitmap needs to go. The only real question IMNHO is whether we should do a power-of-2 or a length. But, if we have 12 bits, then the argument for doing length is pretty strong. We don't need anywhere near 12 bits if doing power-of-2.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> > 1. Current patches do a hypercall for each order in the allocator. > >This is inefficient, but independent from the underlying data > >structure in the ABI, unless bitmaps are in play, which they aren't. > > 2. Should we have bitmaps in the ABI, even if they are not in use by the > >guest implementation today? Andrea says they have zero benefits > >over a pfn/len scheme. Dave doesn't think they have zero benefits > >but isn't that attached to them. QEMU's handling gets more > >complicated when using a bitmap. > > 3. Should the ABI contain records each with a pfn/len pair or a > >pfn/order pair? > >3a. 'len' is more flexible, but will always be a power-of-two anyway > > for high-order pages (the common case) > > Len wouldn't be a power of two practically only if we detect adjacent pages > of smaller order that may merge into larger orders we already allocated (or > the other way around). > > [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at > order 8 pass -> merge as [addr=2M, len=3M] > > Not sure if it would be worth it, but that unless we do this, page-order or > len > won't make much difference. > > >3b. if we decide not to have a bitmap, then we basically have plenty > > of space for 'len' and should just do it > >3c. It's easiest for the hypervisor to turn pfn/len into the > >madvise() calls that it needs. > > > > Did I miss anything? > > I think you summarized fine all my arguments in your summary. > > > FWIW, I don't feel that strongly about the bitmap. Li had one > > originally, but I think the code thus far has demonstrated a huge > > benefit without even having a bitmap. > > > > I've got no objections to ripping the bitmap out of the ABI. > > I think we need to see a statistic showing the number of bits set in each > bitmap in average, after some uptime and lru churn, like running stresstest > app for a while with I/O and then inflate the balloon and > count: > > 1) how many bits were set vs total number of bits used in bitmaps > > 2) how many times bitmaps were used vs bitmap_len = 0 case of single >page > > My guess would be like very low percentage for both points. > > So there is a connection with the MAX_ORDER..0 allocation loop and the ABI > change, but I agree any of the ABI proposed would still allow for it this > logic to > be used. Bitmap or not bitmap, the loop would still work. Hi guys, What's the conclusion of your discussion? It seems you want some statistic before deciding whether to ripping the bitmap from the ABI, am I right? Thanks! Liang
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> Subject: Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast > (de)inflating > & fast live migration > > On 12/07/2016 05:35 AM, Li, Liang Z wrote: > >> Am 30.11.2016 um 09:43 schrieb Liang Li: > >> IOW in real examples, do we have really large consecutive areas or > >> are all pages just completely distributed over our memory? > > > > The buddy system of Linux kernel memory management shows there > should > > be quite a lot of consecutive pages as long as there are a portion of > > free memory in the guest. > ... > > If all pages just completely distributed over our memory, it means the > > memory fragmentation is very serious, the kernel has the mechanism to > > avoid this happened. > > While it is correct that the kernel has anti-fragmentation mechanisms, I don't > think it invalidates the question as to whether a bitmap would be too sparse > to be effective. > > > In the other hand, the inflating should not happen at this time > > because the guest is almost 'out of memory'. > > I don't think this is correct. Most systems try to run with relatively > little free > memory all the time, using the bulk of it as page cache. We have no reason > to expect that ballooning will only occur when there is lots of actual free > memory and that it will not occur when that same memory is in use as page > cache. > Yes. > In these patches, you're effectively still sending pfns. You're just sending > one pfn per high-order page which is giving a really nice speedup. IMNHO, > you're avoiding doing a real bitmap because creating a bitmap means either > have a really big bitmap, or you would have to do some sorting (or multiple > passes) of the free lists before populating a smaller bitmap. > > Like David, I would still like to see some data on whether the choice between > bitmaps and pfn lists is ever clearly in favor of bitmaps. You haven't > convinced me, at least, that the data isn't even worth collecting. I will try to get some data with the real workload and share it with your guys. Thanks! Liang
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On Wed, Dec 07, 2016 at 11:54:34AM -0800, Dave Hansen wrote: > We're talking about a bunch of different stuff which is all being > conflated. There are 3 issues here that I can see. I'll attempt to > summarize what I think is going on: > > 1. Current patches do a hypercall for each order in the allocator. >This is inefficient, but independent from the underlying data >structure in the ABI, unless bitmaps are in play, which they aren't. > 2. Should we have bitmaps in the ABI, even if they are not in use by the >guest implementation today? Andrea says they have zero benefits >over a pfn/len scheme. Dave doesn't think they have zero benefits >but isn't that attached to them. QEMU's handling gets more >complicated when using a bitmap. > 3. Should the ABI contain records each with a pfn/len pair or a >pfn/order pair? >3a. 'len' is more flexible, but will always be a power-of-two anyway > for high-order pages (the common case) Len wouldn't be a power of two practically only if we detect adjacent pages of smaller order that may merge into larger orders we already allocated (or the other way around). [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at order 8 pass -> merge as [addr=2M, len=3M] Not sure if it would be worth it, but that unless we do this, page-order or len won't make much difference. >3b. if we decide not to have a bitmap, then we basically have plenty > of space for 'len' and should just do it >3c. It's easiest for the hypervisor to turn pfn/len into the >madvise() calls that it needs. > > Did I miss anything? I think you summarized fine all my arguments in your summary. > FWIW, I don't feel that strongly about the bitmap. Li had one > originally, but I think the code thus far has demonstrated a huge > benefit without even having a bitmap. > > I've got no objections to ripping the bitmap out of the ABI. I think we need to see a statistic showing the number of bits set in each bitmap in average, after some uptime and lru churn, like running stresstest app for a while with I/O and then inflate the balloon and count: 1) how many bits were set vs total number of bits used in bitmaps 2) how many times bitmaps were used vs bitmap_len = 0 case of single page My guess would be like very low percentage for both points. > Surely we can think of a few ways... > > A bitmap is 64x more dense if the lists are unordered. It means being > able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G. > That's 64x fewer cachelines to touch, 64x fewer pages to move to the > hypervisor and lets us allocate 1/64th the memory. Given a maximum > allocation that we're allowed, it lets us do 64x more per-pass. > > Now, are those benefits worth it? Maybe not, but let's not pretend they > don't exist. ;) In the best case there are benefits obviously, the question is how common the best case is. The best case if I understand correctly is all high order not available, but plenty of order 0 pages available at phys address X, X+8k, X+16k, X+(8k*nr_bits_in_bitmap). How common is that 0 pages exist but they're not at an address < X or > X+(8k*nr_bits_in_bitmap)? > Yes, the current code sends one batch of pages up to the hypervisor per > order. But, this has nothing to do with the underlying data structure, > or the choice to have an order vs. len in the ABI. > > What you describe here is obviously more efficient. And it isn't possible with the current ABI. So there is a connection with the MAX_ORDER..0 allocation loop and the ABI change, but I agree any of the ABI proposed would still allow for it this logic to be used. Bitmap or not bitmap, the loop would still work.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
We're talking about a bunch of different stuff which is all being conflated. There are 3 issues here that I can see. I'll attempt to summarize what I think is going on: 1. Current patches do a hypercall for each order in the allocator. This is inefficient, but independent from the underlying data structure in the ABI, unless bitmaps are in play, which they aren't. 2. Should we have bitmaps in the ABI, even if they are not in use by the guest implementation today? Andrea says they have zero benefits over a pfn/len scheme. Dave doesn't think they have zero benefits but isn't that attached to them. QEMU's handling gets more complicated when using a bitmap. 3. Should the ABI contain records each with a pfn/len pair or a pfn/order pair? 3a. 'len' is more flexible, but will always be a power-of-two anyway for high-order pages (the common case) 3b. if we decide not to have a bitmap, then we basically have plenty of space for 'len' and should just do it 3c. It's easiest for the hypervisor to turn pfn/len into the madvise() calls that it needs. Did I miss anything? On 12/07/2016 10:38 AM, Andrea Arcangeli wrote: > On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote: >> It is more space-efficient. We're fitting the order into 6 bits, which >> would allows the full 2^64 address space to be represented in one entry, > > Very large order is the same as very large len, 6 bits of order or 8 > bytes of len won't really move the needle here, simpler code is > preferable. Agreed. But without seeing them side-by-side I'm not sure we can say which is simpler. > The main benefit of "len" is that it can be more granular, plus it's > simpler than the bitmap too. Eventually all this stuff has to end up > into a madvisev (not yet upstream but somebody posted it for jemalloc > and should get merged eventually). > > So the bitmap shall be demuxed to a addr,len array anyway, the bitmap > won't ever be sent to the madvise syscall, which makes the > intermediate representation with the bitmap a complication with > basically no benefits compared to a (N, [addr1,len1], .., [addrN, > lenN]) representation. FWIW, I don't feel that strongly about the bitmap. Li had one originally, but I think the code thus far has demonstrated a huge benefit without even having a bitmap. I've got no objections to ripping the bitmap out of the ABI. >> and leaves room for the bitmap size to be encoded as well, if we decide >> we need a bitmap in the future. > > How would a bitmap ever be useful with very large page-order? Surely we can think of a few ways... A bitmap is 64x more dense if the lists are unordered. It means being able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G. That's 64x fewer cachelines to touch, 64x fewer pages to move to the hypervisor and lets us allocate 1/64th the memory. Given a maximum allocation that we're allowed, it lets us do 64x more per-pass. Now, are those benefits worth it? Maybe not, but let's not pretend they don't exist. ;) >> If that was purely a length, we'd be limited to 64*4k pages per entry, >> which isn't even a full large page. > > I don't follow here. > > What we suggest is to send the data down represented as (N, > [addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each > one of maximum length 2^64, so 2^64 multiplied infinite times if you > wish. Simplifying the code and not having any bitmap at all and no :6 > :6 bits either. > > The high order to low order loop of allocations is the interesting part > here, not the bitmap, and the fact of doing a single vmexit to send > the large ranges. Yes, the current code sends one batch of pages up to the hypervisor per order. But, this has nothing to do with the underlying data structure, or the choice to have an order vs. len in the ABI. What you describe here is obviously more efficient. > Considering the loop that allocates starting from MAX_ORDER..1, the > chance the bitmap is actually getting filled with more than one bit at > page_shift of PAGE_SHIFT should be very low after some uptime. Yes, if bitmaps were in use, this is true. I think a guest populating bitmaps would probably not use the same algorithm. > By the very nature of this loop, if we already exacerbates all high > order buddies, the page-order 0 pages obtained are going to be fairly > fragmented reducing the usefulness of the bitmap and potentially only > wasting CPU/memory.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On Wed, Dec 07, 2016 at 10:44:31AM -0800, Dave Hansen wrote: > On 12/07/2016 10:38 AM, Andrea Arcangeli wrote: > >> > and leaves room for the bitmap size to be encoded as well, if we decide > >> > we need a bitmap in the future. > > How would a bitmap ever be useful with very large page-order? > > Please, guys. Read the patches. *Please*. I did read the code but you didn't answer my question. Why should a feature exist in the code that will never be useful. Why do you think we could ever decide we'll need the bitmap in the future for high order pages? > The current code doesn't even _use_ a bitmap. It's not using it right now, my question is exactly when it will ever use it? Leaving the bitmap only for order 0 allocations when you already wiped all high pages orders from the buddy, doesn't seem very good idea overall as the chance you got order 0 pages with close physical address doesn't seem very high. It would be high if the loop that eat into every possible higher order didn't run first, but such loop just run and already wiped everything. Also note, we need to call compaction very aggressive before falling back from order 9 down to order 8. Ideally we should never use the page_shift = PAGE_SHIFT case at all! Which leaves the bitmap as best as an optimization for something that is suboptimal case already. If the bitmap starts to payoff it means the admin did a mistake and shrunk the guest too much.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/07/2016 10:38 AM, Andrea Arcangeli wrote: >> > and leaves room for the bitmap size to be encoded as well, if we decide >> > we need a bitmap in the future. > How would a bitmap ever be useful with very large page-order? Please, guys. Read the patches. *Please*. The current code doesn't even _use_ a bitmap.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Hello, On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote: > It is more space-efficient. We're fitting the order into 6 bits, which > would allows the full 2^64 address space to be represented in one entry, Very large order is the same as very large len, 6 bits of order or 8 bytes of len won't really move the needle here, simpler code is preferable. The main benefit of "len" is that it can be more granular, plus it's simpler than the bitmap too. Eventually all this stuff has to end up into a madvisev (not yet upstream but somebody posted it for jemalloc and should get merged eventually). So the bitmap shall be demuxed to a addr,len array anyway, the bitmap won't ever be sent to the madvise syscall, which makes the intermediate representation with the bitmap a complication with basically no benefits compared to a (N, [addr1,len1], .., [addrN, lenN]) representation. If you prefer 1 byte of order (not just 6 bits) instead 8bytes of len that's possible too, I wouldn't be against that, the conversion before calling madvise would be pretty efficient too. > and leaves room for the bitmap size to be encoded as well, if we decide > we need a bitmap in the future. How would a bitmap ever be useful with very large page-order? > If that was purely a length, we'd be limited to 64*4k pages per entry, > which isn't even a full large page. I don't follow here. What we suggest is to send the data down represented as (N, [addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each one of maximum length 2^64, so 2^64 multiplied infinite times if you wish. Simplifying the code and not having any bitmap at all and no :6 :6 bits either. The high order to low order loop of allocations is the interesting part here, not the bitmap, and the fact of doing a single vmexit to send the large ranges. Once we pull out the largest order regions, we just add them to the array as [addr,1UL<
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Removing silly virtio-dev@ list because it's bouncing mail... On 12/07/2016 08:21 AM, David Hildenbrand wrote: >> Li's current patches do that. Well, maybe not pfn/length, but they do >> take a pfn and page-order, which fits perfectly with the kernel's >> concept of high-order pages. > > So we can send length in powers of two. Still, I don't see any benefit > over a simple pfn/len schema. But I'll have a more detailed look at the > implementation first, maybe that will enlighten me :) It is more space-efficient. We're fitting the order into 6 bits, which would allows the full 2^64 address space to be represented in one entry, and leaves room for the bitmap size to be encoded as well, if we decide we need a bitmap in the future. If that was purely a length, we'd be limited to 64*4k pages per entry, which isn't even a full large page.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by using bitmap, the madvise count was reduced to 605. when using the PFNs, the madvise count was 3932160. It means there are quite a lot consecutive bits in the bitmap. I didn't test for a guest with heavy memory workload. Would it then even make sense to go one step further and report {pfn, length} combinations? So simply send over an array of {pfn, length}? Li's current patches do that. Well, maybe not pfn/length, but they do take a pfn and page-order, which fits perfectly with the kernel's concept of high-order pages. So we can send length in powers of two. Still, I don't see any benefit over a simple pfn/len schema. But I'll have a more detailed look at the implementation first, maybe that will enlighten me :) And it makes sense if you think about: a) hugetlb backing: The host may only be able to free huge pages (we might want to communicate that to the guest later, that's another story). Still we would have to send bitmaps full of 4k frames (512 bits for 2mb frames). Of course, we could add a way to communicate that we are using a different bitmap-granularity. Yeah, please read the patches. If they're not clear, then the descriptions need work, but this is done already. I missed the page_shift, thanks for the hint. -- David
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/07/2016 07:42 AM, David Hildenbrand wrote: > Am 07.12.2016 um 14:35 schrieb Li, Liang Z: >>> Am 30.11.2016 um 09:43 schrieb Liang Li: This patch set contains two parts of changes to the virtio-balloon. One is the change for speeding up the inflating & deflating process, the main idea of this optimization is to use bitmap to send the page information to host instead of the PFNs, to reduce the overhead of virtio data transmission, address translation and madvise(). This can help to improve the performance by about 85%. >>> >>> Do you have some statistics/some rough feeling how many consecutive >>> bits are >>> usually set in the bitmaps? Is it really just purely random or is >>> there some >>> granularity that is usually consecutive? >>> >> >> I did something similar. Filled the balloon with 15GB for a 16GB idle >> guest, by >> using bitmap, the madvise count was reduced to 605. when using the >> PFNs, the madvise count >> was 3932160. It means there are quite a lot consecutive bits in the >> bitmap. >> I didn't test for a guest with heavy memory workload. > > Would it then even make sense to go one step further and report {pfn, > length} combinations? > > So simply send over an array of {pfn, length}? Li's current patches do that. Well, maybe not pfn/length, but they do take a pfn and page-order, which fits perfectly with the kernel's concept of high-order pages. > And it makes sense if you think about: > > a) hugetlb backing: The host may only be able to free huge pages (we > might want to communicate that to the guest later, that's another > story). Still we would have to send bitmaps full of 4k frames (512 bits > for 2mb frames). Of course, we could add a way to communicate that we > are using a different bitmap-granularity. Yeah, please read the patches. If they're not clear, then the descriptions need work, but this is done already.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Am 07.12.2016 um 14:35 schrieb Li, Liang Z: Am 30.11.2016 um 09:43 schrieb Liang Li: This patch set contains two parts of changes to the virtio-balloon. One is the change for speeding up the inflating & deflating process, the main idea of this optimization is to use bitmap to send the page information to host instead of the PFNs, to reduce the overhead of virtio data transmission, address translation and madvise(). This can help to improve the performance by about 85%. Do you have some statistics/some rough feeling how many consecutive bits are usually set in the bitmaps? Is it really just purely random or is there some granularity that is usually consecutive? I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by using bitmap, the madvise count was reduced to 605. when using the PFNs, the madvise count was 3932160. It means there are quite a lot consecutive bits in the bitmap. I didn't test for a guest with heavy memory workload. Would it then even make sense to go one step further and report {pfn, length} combinations? So simply send over an array of {pfn, length}? This idea came up when talking to Andrea Arcangeli (put him on cc). And it makes sense if you think about: a) hugetlb backing: The host may only be able to free huge pages (we might want to communicate that to the guest later, that's another story). Still we would have to send bitmaps full of 4k frames (512 bits for 2mb frames). Of course, we could add a way to communicate that we are using a different bitmap-granularity. b) if we really inflate huge memory regions (and it sounds like that according to your measurements), we can minimize the communication to the hypervisor and therefore the madvice calls. c) we don't want to optimize for inflating guests with almost full memory (and therefore little consecutive memory areas) - my opinion :) Thanks for the explanation! -- David
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
On 12/07/2016 05:35 AM, Li, Liang Z wrote: >> Am 30.11.2016 um 09:43 schrieb Liang Li: >> IOW in real examples, do we have really large consecutive areas or are all >> pages just completely distributed over our memory? > > The buddy system of Linux kernel memory management shows there should > be quite a lot of consecutive pages as long as there are a portion of > free memory in the guest. ... > If all pages just completely distributed over our memory, it means > the memory fragmentation is very serious, the kernel has the > mechanism to avoid this happened. While it is correct that the kernel has anti-fragmentation mechanisms, I don't think it invalidates the question as to whether a bitmap would be too sparse to be effective. > In the other hand, the inflating should not happen at this time because the > guest is almost > 'out of memory'. I don't think this is correct. Most systems try to run with relatively little free memory all the time, using the bulk of it as page cache. We have no reason to expect that ballooning will only occur when there is lots of actual free memory and that it will not occur when that same memory is in use as page cache. In these patches, you're effectively still sending pfns. You're just sending one pfn per high-order page which is giving a really nice speedup. IMNHO, you're avoiding doing a real bitmap because creating a bitmap means either have a really big bitmap, or you would have to do some sorting (or multiple passes) of the free lists before populating a smaller bitmap. Like David, I would still like to see some data on whether the choice between bitmaps and pfn lists is ever clearly in favor of bitmaps. You haven't convinced me, at least, that the data isn't even worth collecting.
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
> Am 30.11.2016 um 09:43 schrieb Liang Li: > > This patch set contains two parts of changes to the virtio-balloon. > > > > One is the change for speeding up the inflating & deflating process, > > the main idea of this optimization is to use bitmap to send the page > > information to host instead of the PFNs, to reduce the overhead of > > virtio data transmission, address translation and madvise(). This can > > help to improve the performance by about 85%. > > Do you have some statistics/some rough feeling how many consecutive bits are > usually set in the bitmaps? Is it really just purely random or is there some > granularity that is usually consecutive? > I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by using bitmap, the madvise count was reduced to 605. when using the PFNs, the madvise count was 3932160. It means there are quite a lot consecutive bits in the bitmap. I didn't test for a guest with heavy memory workload. > IOW in real examples, do we have really large consecutive areas or are all > pages just completely distributed over our memory? > The buddy system of Linux kernel memory management shows there should be quite a lot of consecutive pages as long as there are a portion of free memory in the guest. If all pages just completely distributed over our memory, it means the memory fragmentation is very serious, the kernel has the mechanism to avoid this happened. In the other hand, the inflating should not happen at this time because the guest is almost 'out of memory'. Liang > Thanks! > > -- > > David
Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Am 30.11.2016 um 09:43 schrieb Liang Li: This patch set contains two parts of changes to the virtio-balloon. One is the change for speeding up the inflating & deflating process, the main idea of this optimization is to use bitmap to send the page information to host instead of the PFNs, to reduce the overhead of virtio data transmission, address translation and madvise(). This can help to improve the performance by about 85%. Do you have some statistics/some rough feeling how many consecutive bits are usually set in the bitmaps? Is it really just purely random or is there some granularity that is usually consecutive? IOW in real examples, do we have really large consecutive areas or are all pages just completely distributed over our memory? Thanks! -- David