Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> > On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > >
> > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long
> enough for the 'length'
> > > Set the 'length' to a special value to indicate the "actual length in 
> > > next 8
> bytes".
> > >
> > > That will be much more simple. Right?
> >
> > Sounds fine to me.
> >
> 
> Sounds fine to me too indeed.
> 
> I'm only wondering what is the major point for compressing gpfn+len in
> 8 bytes in the common case, you already use sg_init_table to send down two
> pages, we could send three as well and avoid all math and bit shifts and ors,
> or not?
> 

Yes, we can use more pages for that.

> I agree with the above because from a performance prospective I tend to
> think the above proposal will run at least theoretically faster because the
> other way is to waste double amount of CPU cache, and bit mangling in the
> encoding and the later decoding on qemu side should be faster than
> accessing an array of double size, but then I'm not sure if it's measurable
> optimization. So I'd be curious to know the exact motivation and if it is to
> reduce the CPU cache usage or if there's some other fundamental reason to
> compress it.
> The header already tells qemu how big is the array payload, couldn't we just
> add more pages if one isn't enough?
> 

The original intention to compress the PFN and length it's to reduce the memory 
required.
Even the code was changed a lot from the previous versions, I think this is 
still true.

Now we allocate a specified buffer size to save the 'PFN|length', when the 
buffer is not big
enough to save all the page info for a specified order. A double size buffer 
will be allocated.
This is what we want to avoid because the allocation may fail and allocation 
takes some time,
for fast live migration, time is a critical factor we have to consider, more 
time takes means
more unnecessary pages are sent, because live migration starts before the 
request for unused
 pages get response. 

Thanks

Liang

> Thanks,
> Andrea



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z
> On Fri, Dec 16, 2016 at 01:12:21AM +, Li, Liang Z wrote:
> > There still exist the case if the MAX_ORDER is configured to a large
> > value, e.g. 36 for a system with huge amount of memory, then there is only
> 28 bits left for the pfn, which is not enough.
> 
> Not related to the balloon but how would it help to set MAX_ORDER to 36?
> 

My point here is  MAX_ORDER may be configured to a big value.

> What the MAX_ORDER affects is that you won't be able to ask the kernel
> page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
> that's a driver issue not relevant to the amount of RAM. Drivers won't
> suddenly start to ask the kernel allocator to allocate compound pages at
> orders >= 11 just because more RAM was added.
> 
> The higher the MAX_ORDER the slower the kernel runs simply so the smaller
> the MAX_ORDER the better.
> 
> > Should  we limit the MAX_ORDER? I don't think so.
> 
> We shouldn't strictly depend on MAX_ORDER value but it's mostly limited
> already even if configurable at build time.
> 

I didn't know that and will take a look, thanks for your information.


Liang
> We definitely need it to reach at least the hugepage size, then it's mostly
> driver issue, but drivers requiring large contiguous allocations should rely 
> on
> CMA only or vmalloc if they only require it virtually contiguous, and not rely
> on larger MAX_ORDER that would slowdown all kernel allocations/freeing.



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-16 Thread Andrea Arcangeli
On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > 
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> > enough for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 
> > 8 bytes".
> > 
> > That will be much more simple. Right?
> 
> Sounds fine to me.
> 

Sounds fine to me too indeed.

I'm only wondering what is the major point for compressing gpfn+len in
8 bytes in the common case, you already use sg_init_table to send down
two pages, we could send three as well and avoid all math and bit
shifts and ors, or not?

I agree with the above because from a performance prospective I tend
to think the above proposal will run at least theoretically faster
because the other way is to waste double amount of CPU cache, and bit
mangling in the encoding and the later decoding on qemu side should be
faster than accessing an array of double size, but then I'm not sure
if it's measurable optimization. So I'd be curious to know the exact
motivation and if it is to reduce the CPU cache usage or if there's
some other fundamental reason to compress it.

The header already tells qemu how big is the array payload, couldn't
we just add more pages if one isn't enough?

Thanks,
Andrea



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-16 Thread Andrea Arcangeli
On Fri, Dec 16, 2016 at 01:12:21AM +, Li, Liang Z wrote:
> There still exist the case if the MAX_ORDER is configured to a large value, 
> e.g. 36 for a system
> with huge amount of memory, then there is only 28 bits left for the pfn, 
> which is not enough.

Not related to the balloon but how would it help to set MAX_ORDER to
36?

What the MAX_ORDER affects is that you won't be able to ask the kernel
page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
that's a driver issue not relevant to the amount of RAM. Drivers won't
suddenly start to ask the kernel allocator to allocate compound pages
at orders >= 11 just because more RAM was added.

The higher the MAX_ORDER the slower the kernel runs simply so the
smaller the MAX_ORDER the better.

> Should  we limit the MAX_ORDER? I don't think so.

We shouldn't strictly depend on MAX_ORDER value but it's mostly
limited already even if configurable at build time.

We definitely need it to reach at least the hugepage size, then it's
mostly driver issue, but drivers requiring large contiguous
allocations should rely on CMA only or vmalloc if they only require it
virtually contiguous, and not rely on larger MAX_ORDER that would
slowdown all kernel allocations/freeing.



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> >
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> > enough
> for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8
> bytes".
> >
> > That will be much more simple. Right?
> 
> Sounds fine to me.

Thanks for your inspiration!

Liang



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Dave Hansen
On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> 
> Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> enough for the 'length'
> Set the 'length' to a special value to indicate the "actual length in next 8 
> bytes".
> 
> That will be much more simple. Right?

Sounds fine to me.



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> On 12/15/2016 04:48 PM, Li, Liang Z wrote:
> >>> It seems we leave too many bit  for the pfn, and the bits leave for
> >>> length is not enough, How about keep 45 bits for the pfn and 19 bits
> >>> for length, 45 bits for pfn can cover 57 bits physical address, that
> >>> should be
> >> enough in the near feature.
> >>> What's your opinion?
> >> I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> >> is enough for
> >> x86 for a while.  Other architectures who knows?
> 
> Thinking about this some more...  There are really only two cases that
> matter: 4k pages and "much bigger" ones.
> 
> Squeezing each 4k page into 8 bytes of metadata helps guarantee that this
> scheme won't regress over the old scheme in any cases.  For bigger ranges, 8
> vs 16 bytes means *nothing*.  And 16 bytes will be as good or better than
> the old scheme for everything which is >4k.
> 
> How about this:
>  * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
>  * One special 'length' value to mean "actual length in next 8 bytes"
> 
> That should be pretty simple to produce and decode.  We have two record
> sizes, but I think it is manageable.

It works,  Now that we intend to use another 8 bytes for length

Why not:

Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
enough for the 'length'
Set the 'length' to a special value to indicate the "actual length in next 8 
bytes".

That will be much more simple. Right?

Liang



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> > On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend
> > >> virtio-balloon for fast (de)inflating & fast live migration
> > >>
> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > >>> What's the conclusion of your discussion? It seems you want some
> > >>> statistic before deciding whether to  ripping the bitmap from the
> > >>> ABI, am I right?
> > >>
> > >> I think Andrea and David feel pretty strongly that we should remove
> > >> the bitmap, unless we have some data to support keeping it.  I
> > >> don't feel as strongly about it, but I think their critique of it
> > >> is pretty valid.  I think the consensus is that the bitmap needs to go.
> > >>
> > >> The only real question IMNHO is whether we should do a power-of-2
> > >> or a length.  But, if we have 12 bits, then the argument for doing
> > >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > > Just found the MAX_ORDER should be limited to 12 if use length
> > > instead of order, If the MAX_ORDER is configured to a value bigger
> > > than 12, it will make things more complex to handle this case.
> > >
> > > If use order, we need to break a large memory range whose length is
> > > not the power of 2 into several small ranges, it also make the code
> complex.
> >
> > I can't imagine it makes the code that much more complex.  It adds a
> > for loop.  Right?
> >
> > > It seems we leave too many bit  for the pfn, and the bits leave for
> > > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > > for length, 45 bits for pfn can cover 57 bits physical address, that 
> > > should
> be enough in the near feature.
> > >
> > > What's your opinion?
> >
> > I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> > is enough for x86 for a while.  Other architectures who knows?
> 
> I think you can probably assume page size >= 4K. But I would not want to
> make any other assumptions. E.g. there are systems that absolutely require
> you to set high bits for DMA.
> 
> I think we really want both length and order.
> 
> I understand how you are trying to pack them as tightly as possible.
> 
> However, I thought of a trick, we don't need to encode all possible orders.
> For example, with 2 bits of order, we can make them mean:
> 00 - 4K pages
> 01 - 2M pages
> 02 - 1G pages
> 
> guest can program the sizes for each order through config space.
> 
> We will have 10 bits left for legth.
> 

Please don't, we just get rid of the bitmap for simplification. :)

> It might make sense to also allow guest to program the number of bits used
> for order, this will make it easy to extend without host changes.
> 

There still exist the case if the MAX_ORDER is configured to a large value, 
e.g. 36 for a system
with huge amount of memory, then there is only 28 bits left for the pfn, which 
is not enough.
Should  we limit the MAX_ORDER? I don't think so.

It seems use order is better. 

Thanks!
Liang
> --
> MST



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Dave Hansen
On 12/15/2016 04:48 PM, Li, Liang Z wrote:
>>> It seems we leave too many bit  for the pfn, and the bits leave for
>>> length is not enough, How about keep 45 bits for the pfn and 19 bits
>>> for length, 45 bits for pfn can cover 57 bits physical address, that should 
>>> be
>> enough in the near feature.
>>> What's your opinion?
>> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is 
>> enough for
>> x86 for a while.  Other architectures who knows?

Thinking about this some more...  There are really only two cases that
matter: 4k pages and "much bigger" ones.

Squeezing each 4k page into 8 bytes of metadata helps guarantee that
this scheme won't regress over the old scheme in any cases.  For bigger
ranges, 8 vs 16 bytes means *nothing*.  And 16 bytes will be as good or
better than the old scheme for everything which is >4k.

How about this:
 * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
 * One special 'length' value to mean "actual length in next 8 bytes"

That should be pretty simple to produce and decode.  We have two record
sizes, but I think it is manageable.



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon
> >> for fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the
> >>> ABI, am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove
> >> the bitmap, unless we have some data to support keeping it.  I don't
> >> feel as strongly about it, but I think their critique of it is pretty
> >> valid.  I think the consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or
> >> a length.  But, if we have 12 bits, then the argument for doing
> >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> >
> > Just found the MAX_ORDER should be limited to 12 if use length instead
> > of order, If the MAX_ORDER is configured to a value bigger than 12, it
> > will make things more complex to handle this case.
> >
> > If use order, we need to break a large memory range whose length is
> > not the power of 2 into several small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for loop.
> Right?
> 

Yes, just a little. :)

> > It seems we leave too many bit  for the pfn, and the bits leave for
> > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > for length, 45 bits for pfn can cover 57 bits physical address, that should 
> > be
> enough in the near feature.
> >
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is 
> enough for
> x86 for a while.  Other architectures who knows?

Yes. 



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Michael S. Tsirkin
On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> >> fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the ABI,
> >>> am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove the
> >> bitmap, unless we have some data to support keeping it.  I don't feel as
> >> strongly about it, but I think their critique of it is pretty valid.  I 
> >> think the
> >> consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or a
> >> length.  But, if we have 12 bits, then the argument for doing length is 
> >> pretty
> >> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> > 
> > Just found the MAX_ORDER should be limited to 12 if use length instead of 
> > order,
> > If the MAX_ORDER is configured to a value bigger than 12, it will make 
> > things more
> > complex to handle this case. 
> > 
> > If use order, we need to break a large memory range whose length is not the 
> > power of 2 into several
> > small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for
> loop.  Right?
> 
> > It seems we leave too many bit  for the pfn, and the bits leave for length 
> > is not enough,
> > How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn 
> > can cover 57 bits
> > physical address, that should be enough in the near feature. 
> > 
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is
> enough for x86 for a while.  Other architectures who knows?

I think you can probably assume page size >= 4K. But I would not want
to make any other assumptions. E.g. there are systems that absolutely
require you to set high bits for DMA.

I think we really want both length and order.

I understand how you are trying to pack them as tightly as possible.

However, I thought of a trick, we don't need to encode all
possible orders. For example, with 2 bits of order,
we can make them mean:
00 - 4K pages
01 - 2M pages
02 - 1G pages

guest can program the sizes for each order through config space.

We will have 10 bits left for legth.

It might make sense to also allow guest to program the number of bits
used for order, this will make it easy to extend without
host changes.

-- 
MST



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Dave Hansen
On 12/14/2016 12:59 AM, Li, Liang Z wrote:
>> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
>> fast (de)inflating & fast live migration
>>
>> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
>>> What's the conclusion of your discussion? It seems you want some
>>> statistic before deciding whether to  ripping the bitmap from the ABI,
>>> am I right?
>>
>> I think Andrea and David feel pretty strongly that we should remove the
>> bitmap, unless we have some data to support keeping it.  I don't feel as
>> strongly about it, but I think their critique of it is pretty valid.  I 
>> think the
>> consensus is that the bitmap needs to go.
>>
>> The only real question IMNHO is whether we should do a power-of-2 or a
>> length.  But, if we have 12 bits, then the argument for doing length is 
>> pretty
>> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
> Just found the MAX_ORDER should be limited to 12 if use length instead of 
> order,
> If the MAX_ORDER is configured to a value bigger than 12, it will make things 
> more
> complex to handle this case. 
> 
> If use order, we need to break a large memory range whose length is not the 
> power of 2 into several
> small ranges, it also make the code complex.

I can't imagine it makes the code that much more complex.  It adds a for
loop.  Right?

> It seems we leave too many bit  for the pfn, and the bits leave for length is 
> not enough,
> How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn 
> can cover 57 bits
> physical address, that should be enough in the near feature. 
> 
> What's your opinion?

I still think 'order' makes a lot of sense.  But, as you say, 57 bits is
enough for x86 for a while.  Other architectures who knows?



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 
> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.

Just found the MAX_ORDER should be limited to 12 if use length instead of order,
If the MAX_ORDER is configured to a value bigger than 12, it will make things 
more
complex to handle this case. 

If use order, we need to break a large memory range whose length is not the 
power of 2 into several
small ranges, it also make the code complex.

It seems we leave too many bit  for the pfn, and the bits leave for length is 
not enough,
How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can 
cover 57 bits
physical address, that should be enough in the near feature. 

What's your opinion?


thanks!
Liang

 



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z
> fast (de)inflating & fast live migration
> 
> Hello,
> 
> On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote:
> > > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > > What's the conclusion of your discussion? It seems you want some
> > > > statistic before deciding whether to  ripping the bitmap from the
> > > > ABI, am I right?
> > >
> > > I think Andrea and David feel pretty strongly that we should remove
> > > the bitmap, unless we have some data to support keeping it.  I don't
> > > feel as strongly about it, but I think their critique of it is
> > > pretty valid.  I think the consensus is that the bitmap needs to go.
> > >
> >
> > Thanks for you clarification.
> >
> > > The only real question IMNHO is whether we should do a power-of-2 or
> > > a length.  But, if we have 12 bits, then the argument for doing
> > > length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > So each item can max represent 16MB Bytes, seems not big enough, but
> > enough for most case.
> > Things became much more simple without the bitmap, and I like simple
> > solution too. :)
> >
> > I will prepare the v6 and remove all the bitmap related stuffs. Thank you 
> > all!
> 
> Sounds great!
> 
> I suggested to check the statistics, because collecting those stats looked
> simpler and quicker than removing all bitmap related stuff from the patchset.
> However if you prefer to prepare a v6 without the bitmap another perhaps
> more interesting way to evaluate the usefulness of the bitmap is to just run
> the same benchmark and verify that there is no regression compared to the
> bitmap enabled code.
> 
> The other issue with the bitmap is, the best case for the bitmap is ever less
> likely to materialize the more RAM is added to the guest. It won't regress
> linearly because after all there can be some locality bias in the buddy 
> splits,
> but if sync compaction is used in the large order allocations tried before
> reaching order 0, the bitmap payoff will regress close to linearly with the
> increase of RAM.
> 
> So it'd be good to check the stats or the benchmark on large guests, at least
> one hundred gigabytes or so.
> 
> Changing topic but still about the ABI features needed, so it may be relevant
> for this discussion:
> 
> 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
>memory from, using alloc_pages_node in guest. So you can ask to
>take X pages from vnode A, Y pages from vnode B, in one vmenter.
> 
> 2) allowing qemu to tell the guest to stop inflating the balloon and
>report a fragmentation limit being hit, when sync compaction
>powered allocations fails at certain power-of-two order granularity
>passed by qemu to the guest. This order constraint will be passed
>by default for hugetlbfs guests with 2MB hpage size, while it can
>be used optionally on THP backed guests. This option with THP
>guests would allow a highlevel management software to provide a
>"don't reduce guest performance" while shrinking the memory size of
>the guest from the GUI. If you deselect the option, you can shrink
>down to the last freeable 4k guest page, but doing so may have to
>split THP in the host (you don't know for sure if they were really
>THP but they could have been), and it may regress
>performance. Inflating the balloon while passing a minimum
>granularity "order" of the pages being zapped, will guarantee
>inflating the balloon cannot decrease guest performance
>instead. Plus it's needed for hugetlbfs anyway as far as I can
>tell. hugetlbfs would not be host enforceable even if the idea is
>not to free memory but only reduce the available memory of the
>guest (not without major changes that maps a hugetlb page with 4k
>ptes at least). While for a more cooperative usage of hugetlbfs
>guests, it's simply not useful to inflate the balloon at anything
>less than the "HPAGE_SIZE" hugetlbfs granularity.
> 
> We also plan to use userfaultfd to make the balloon driver host enforced (will
> work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to
> the ABI so it's not strictly relevant for this discussion.
> 
> On a side note, registering userfaultfd on the ballooned range, will keep
> khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED
> zapped sub-THP fragments no matter the sysfs tunings.
> 

Thanks for your elaboration!

> Thanks!
> Andrea



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-09 Thread Andrea Arcangeli
Hello,

On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote:
> > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > What's the conclusion of your discussion? It seems you want some
> > > statistic before deciding whether to  ripping the bitmap from the ABI,
> > > am I right?
> > 
> > I think Andrea and David feel pretty strongly that we should remove the
> > bitmap, unless we have some data to support keeping it.  I don't feel as
> > strongly about it, but I think their critique of it is pretty valid.  I 
> > think the
> > consensus is that the bitmap needs to go.
> > 
> 
> Thanks for you clarification.
> 
> > The only real question IMNHO is whether we should do a power-of-2 or a
> > length.  But, if we have 12 bits, then the argument for doing length is 
> > pretty
> > strong.  We don't need anywhere near 12 bits if doing power-of-2.
> > 
> So each item can max represent 16MB Bytes, seems not big enough,
> but enough for most case.
> Things became much more simple without the bitmap, and I like simple solution 
> too. :)
> 
> I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Sounds great!

I suggested to check the statistics, because collecting those stats
looked simpler and quicker than removing all bitmap related stuff from
the patchset. However if you prefer to prepare a v6 without the bitmap
another perhaps more interesting way to evaluate the usefulness of the
bitmap is to just run the same benchmark and verify that there is no
regression compared to the bitmap enabled code.

The other issue with the bitmap is, the best case for the bitmap is
ever less likely to materialize the more RAM is added to the guest. It
won't regress linearly because after all there can be some locality
bias in the buddy splits, but if sync compaction is used in the large
order allocations tried before reaching order 0, the bitmap payoff
will regress close to linearly with the increase of RAM.

So it'd be good to check the stats or the benchmark on large guests,
at least one hundred gigabytes or so.

Changing topic but still about the ABI features needed, so it may be
relevant for this discussion:

1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
   memory from, using alloc_pages_node in guest. So you can ask to
   take X pages from vnode A, Y pages from vnode B, in one vmenter.

2) allowing qemu to tell the guest to stop inflating the balloon and
   report a fragmentation limit being hit, when sync compaction
   powered allocations fails at certain power-of-two order granularity
   passed by qemu to the guest. This order constraint will be passed
   by default for hugetlbfs guests with 2MB hpage size, while it can
   be used optionally on THP backed guests. This option with THP
   guests would allow a highlevel management software to provide a
   "don't reduce guest performance" while shrinking the memory size of
   the guest from the GUI. If you deselect the option, you can shrink
   down to the last freeable 4k guest page, but doing so may have to
   split THP in the host (you don't know for sure if they were really
   THP but they could have been), and it may regress
   performance. Inflating the balloon while passing a minimum
   granularity "order" of the pages being zapped, will guarantee
   inflating the balloon cannot decrease guest performance
   instead. Plus it's needed for hugetlbfs anyway as far as I can
   tell. hugetlbfs would not be host enforceable even if the idea is
   not to free memory but only reduce the available memory of the
   guest (not without major changes that maps a hugetlb page with 4k
   ptes at least). While for a more cooperative usage of hugetlbfs
   guests, it's simply not useful to inflate the balloon at anything
   less than the "HPAGE_SIZE" hugetlbfs granularity.

We also plan to use userfaultfd to make the balloon driver host
enforced (will work fine on hugetlbfs 2M and tmpfs too) but that's
going to be invisible to the ABI so it's not strictly relevant for
this discussion.

On a side note, registering userfaultfd on the ballooned range, will
keep khugepaged at bay so it won't risk to re-inflating the
MADV_DONTNEED zapped sub-THP fragments no matter the sysfs tunings.

Thanks!
Andrea



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 

Thanks for you clarification.

> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
So each item can max represent 16MB Bytes, seems not big enough,
but enough for most case.
Things became much more simple without the bitmap, and I like simple solution 
too. :)

I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Liang





Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Dave Hansen
On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> What's the conclusion of your discussion? It seems you want some
> statistic before deciding whether to  ripping the bitmap from the
> ABI, am I right?

I think Andrea and David feel pretty strongly that we should remove the
bitmap, unless we have some data to support keeping it.  I don't feel as
strongly about it, but I think their critique of it is pretty valid.  I
think the consensus is that the bitmap needs to go.

The only real question IMNHO is whether we should do a power-of-2 or a
length.  But, if we have 12 bits, then the argument for doing length is
pretty strong.  We don't need anywhere near 12 bits if doing power-of-2.






Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z
> > 1. Current patches do a hypercall for each order in the allocator.
> >This is inefficient, but independent from the underlying data
> >structure in the ABI, unless bitmaps are in play, which they aren't.
> > 2. Should we have bitmaps in the ABI, even if they are not in use by the
> >guest implementation today?  Andrea says they have zero benefits
> >over a pfn/len scheme.  Dave doesn't think they have zero benefits
> >but isn't that attached to them.  QEMU's handling gets more
> >complicated when using a bitmap.
> > 3. Should the ABI contain records each with a pfn/len pair or a
> >pfn/order pair?
> >3a. 'len' is more flexible, but will always be a power-of-two anyway
> > for high-order pages (the common case)
> 
> Len wouldn't be a power of two practically only if we detect adjacent pages
> of smaller order that may merge into larger orders we already allocated (or
> the other way around).
> 
> [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at
> order 8 pass -> merge as [addr=2M, len=3M]
> 
> Not sure if it would be worth it, but that unless we do this, page-order or 
> len
> won't make much difference.
> 
> >3b. if we decide not to have a bitmap, then we basically have plenty
> > of space for 'len' and should just do it
> >3c. It's easiest for the hypervisor to turn pfn/len into the
> >madvise() calls that it needs.
> >
> > Did I miss anything?
> 
> I think you summarized fine all my arguments in your summary.
> 
> > FWIW, I don't feel that strongly about the bitmap.  Li had one
> > originally, but I think the code thus far has demonstrated a huge
> > benefit without even having a bitmap.
> >
> > I've got no objections to ripping the bitmap out of the ABI.
> 
> I think we need to see a statistic showing the number of bits set in each
> bitmap in average, after some uptime and lru churn, like running stresstest
> app for a while with I/O and then inflate the balloon and
> count:
> 
> 1) how many bits were set vs total number of bits used in bitmaps
> 
> 2) how many times bitmaps were used vs bitmap_len = 0 case of single
>page
> 
> My guess would be like very low percentage for both points.
> 

> So there is a connection with the MAX_ORDER..0 allocation loop and the ABI
> change, but I agree any of the ABI proposed would still allow for it this 
> logic to
> be used. Bitmap or not bitmap, the loop would still work.

Hi guys,

What's the conclusion of your discussion? 
It seems you want some statistic before deciding whether to  ripping the bitmap 
from the ABI, am I right?

Thanks!
Liang 



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z
> Subject: Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 12/07/2016 05:35 AM, Li, Liang Z wrote:
> >> Am 30.11.2016 um 09:43 schrieb Liang Li:
> >> IOW in real examples, do we have really large consecutive areas or
> >> are all pages just completely distributed over our memory?
> >
> > The buddy system of Linux kernel memory management shows there
> should
> > be quite a lot of consecutive pages as long as there are a portion of
> > free memory in the guest.
> ...
> > If all pages just completely distributed over our memory, it means the
> > memory fragmentation is very serious, the kernel has the mechanism to
> > avoid this happened.
> 
> While it is correct that the kernel has anti-fragmentation mechanisms, I don't
> think it invalidates the question as to whether a bitmap would be too sparse
> to be effective.
> 
> > In the other hand, the inflating should not happen at this time
> > because the guest is almost 'out of memory'.
> 
> I don't think this is correct.  Most systems try to run with relatively 
> little free
> memory all the time, using the bulk of it as page cache.  We have no reason
> to expect that ballooning will only occur when there is lots of actual free
> memory and that it will not occur when that same memory is in use as page
> cache.
> 

Yes.
> In these patches, you're effectively still sending pfns.  You're just sending
> one pfn per high-order page which is giving a really nice speedup.  IMNHO,
> you're avoiding doing a real bitmap because creating a bitmap means either
> have a really big bitmap, or you would have to do some sorting (or multiple
> passes) of the free lists before populating a smaller bitmap.
> 
> Like David, I would still like to see some data on whether the choice between
> bitmaps and pfn lists is ever clearly in favor of bitmaps.  You haven't
> convinced me, at least, that the data isn't even worth collecting.

I will try to get some data with the real workload and share it with your guys.

Thanks!
Liang



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Andrea Arcangeli
On Wed, Dec 07, 2016 at 11:54:34AM -0800, Dave Hansen wrote:
> We're talking about a bunch of different stuff which is all being
> conflated.  There are 3 issues here that I can see.  I'll attempt to
> summarize what I think is going on:
> 
> 1. Current patches do a hypercall for each order in the allocator.
>This is inefficient, but independent from the underlying data
>structure in the ABI, unless bitmaps are in play, which they aren't.
> 2. Should we have bitmaps in the ABI, even if they are not in use by the
>guest implementation today?  Andrea says they have zero benefits
>over a pfn/len scheme.  Dave doesn't think they have zero benefits
>but isn't that attached to them.  QEMU's handling gets more
>complicated when using a bitmap.
> 3. Should the ABI contain records each with a pfn/len pair or a
>pfn/order pair?
>3a. 'len' is more flexible, but will always be a power-of-two anyway
>   for high-order pages (the common case)

Len wouldn't be a power of two practically only if we detect adjacent
pages of smaller order that may merge into larger orders we already
allocated (or the other way around).

[addr=2M, len=2M] allocated at order 9 pass
[addr=4M, len=1M] allocated at order 8 pass -> merge as [addr=2M, len=3M]

Not sure if it would be worth it, but that unless we do this, page-order or
len won't make much difference.

>3b. if we decide not to have a bitmap, then we basically have plenty
>   of space for 'len' and should just do it
>3c. It's easiest for the hypervisor to turn pfn/len into the
>madvise() calls that it needs.
> 
> Did I miss anything?

I think you summarized fine all my arguments in your summary.

> FWIW, I don't feel that strongly about the bitmap.  Li had one
> originally, but I think the code thus far has demonstrated a huge
> benefit without even having a bitmap.
> 
> I've got no objections to ripping the bitmap out of the ABI.

I think we need to see a statistic showing the number of bits set in
each bitmap in average, after some uptime and lru churn, like running
stresstest app for a while with I/O and then inflate the balloon and
count:

1) how many bits were set vs total number of bits used in bitmaps

2) how many times bitmaps were used vs bitmap_len = 0 case of single
   page

My guess would be like very low percentage for both points.

> Surely we can think of a few ways...
> 
> A bitmap is 64x more dense if the lists are unordered.  It means being
> able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
> That's 64x fewer cachelines to touch, 64x fewer pages to move to the
> hypervisor and lets us allocate 1/64th the memory.  Given a maximum
> allocation that we're allowed, it lets us do 64x more per-pass.
> 
> Now, are those benefits worth it?  Maybe not, but let's not pretend they
> don't exist. ;)

In the best case there are benefits obviously, the question is how
common the best case is.

The best case if I understand correctly is all high order not
available, but plenty of order 0 pages available at phys address X,
X+8k, X+16k, X+(8k*nr_bits_in_bitmap). How common is that 0 pages
exist but they're not at an address < X or > X+(8k*nr_bits_in_bitmap)?

> Yes, the current code sends one batch of pages up to the hypervisor per
> order.  But, this has nothing to do with the underlying data structure,
> or the choice to have an order vs. len in the ABI.
> 
> What you describe here is obviously more efficient.

And it isn't possible with the current ABI.

So there is a connection with the MAX_ORDER..0 allocation loop and the
ABI change, but I agree any of the ABI proposed would still allow for
it this logic to be used. Bitmap or not bitmap, the loop would still
work.




Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Dave Hansen
We're talking about a bunch of different stuff which is all being
conflated.  There are 3 issues here that I can see.  I'll attempt to
summarize what I think is going on:

1. Current patches do a hypercall for each order in the allocator.
   This is inefficient, but independent from the underlying data
   structure in the ABI, unless bitmaps are in play, which they aren't.
2. Should we have bitmaps in the ABI, even if they are not in use by the
   guest implementation today?  Andrea says they have zero benefits
   over a pfn/len scheme.  Dave doesn't think they have zero benefits
   but isn't that attached to them.  QEMU's handling gets more
   complicated when using a bitmap.
3. Should the ABI contain records each with a pfn/len pair or a
   pfn/order pair?
   3a. 'len' is more flexible, but will always be a power-of-two anyway
for high-order pages (the common case)
   3b. if we decide not to have a bitmap, then we basically have plenty
of space for 'len' and should just do it
   3c. It's easiest for the hypervisor to turn pfn/len into the
   madvise() calls that it needs.

Did I miss anything?

On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
> On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:
>> It is more space-efficient.  We're fitting the order into 6 bits, which
>> would allows the full 2^64 address space to be represented in one entry,
> 
> Very large order is the same as very large len, 6 bits of order or 8
> bytes of len won't really move the needle here, simpler code is
> preferable.

Agreed.  But without seeing them side-by-side I'm not sure we can say
which is simpler.

> The main benefit of "len" is that it can be more granular, plus it's
> simpler than the bitmap too. Eventually all this stuff has to end up
> into a madvisev (not yet upstream but somebody posted it for jemalloc
> and should get merged eventually).
> 
> So the bitmap shall be demuxed to a addr,len array anyway, the bitmap
> won't ever be sent to the madvise syscall, which makes the
> intermediate representation with the bitmap a complication with
> basically no benefits compared to a (N, [addr1,len1], .., [addrN,
> lenN]) representation.

FWIW, I don't feel that strongly about the bitmap.  Li had one
originally, but I think the code thus far has demonstrated a huge
benefit without even having a bitmap.

I've got no objections to ripping the bitmap out of the ABI.

>> and leaves room for the bitmap size to be encoded as well, if we decide
>> we need a bitmap in the future.
> 
> How would a bitmap ever be useful with very large page-order?

Surely we can think of a few ways...

A bitmap is 64x more dense if the lists are unordered.  It means being
able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
That's 64x fewer cachelines to touch, 64x fewer pages to move to the
hypervisor and lets us allocate 1/64th the memory.  Given a maximum
allocation that we're allowed, it lets us do 64x more per-pass.

Now, are those benefits worth it?  Maybe not, but let's not pretend they
don't exist. ;)

>> If that was purely a length, we'd be limited to 64*4k pages per entry,
>> which isn't even a full large page.
> 
> I don't follow here.
> 
> What we suggest is to send the data down represented as (N,
> [addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each
> one of maximum length 2^64, so 2^64 multiplied infinite times if you
> wish. Simplifying the code and not having any bitmap at all and no :6
> :6 bits either.
> 
> The high order to low order loop of allocations is the interesting part
> here, not the bitmap, and the fact of doing a single vmexit to send
> the large ranges.

Yes, the current code sends one batch of pages up to the hypervisor per
order.  But, this has nothing to do with the underlying data structure,
or the choice to have an order vs. len in the ABI.

What you describe here is obviously more efficient.

> Considering the loop that allocates starting from MAX_ORDER..1, the
> chance the bitmap is actually getting filled with more than one bit at
> page_shift of PAGE_SHIFT should be very low after some uptime.

Yes, if bitmaps were in use, this is true.  I think a guest populating
bitmaps would probably not use the same algorithm.

> By the very nature of this loop, if we already exacerbates all high
> order buddies, the page-order 0 pages obtained are going to be fairly
> fragmented reducing the usefulness of the bitmap and potentially only
> wasting CPU/memory.




Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Andrea Arcangeli
On Wed, Dec 07, 2016 at 10:44:31AM -0800, Dave Hansen wrote:
> On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
> >> > and leaves room for the bitmap size to be encoded as well, if we decide
> >> > we need a bitmap in the future.
> > How would a bitmap ever be useful with very large page-order?
> 
> Please, guys.  Read the patches.  *Please*.

I did read the code but you didn't answer my question.

Why should a feature exist in the code that will never be useful. Why
do you think we could ever decide we'll need the bitmap in the future
for high order pages?

> The current code doesn't even _use_ a bitmap.

It's not using it right now, my question is exactly when it will ever
use it?

Leaving the bitmap only for order 0 allocations when you already wiped
all high pages orders from the buddy, doesn't seem very good idea
overall as the chance you got order 0 pages with close physical
address doesn't seem very high. It would be high if the loop that eat
into every possible higher order didn't run first, but such loop just
run and already wiped everything.

Also note, we need to call compaction very aggressive before falling
back from order 9 down to order 8. Ideally we should never use the
page_shift = PAGE_SHIFT case at all! Which leaves the bitmap as best
as an optimization for something that is suboptimal case already. If
the bitmap starts to payoff it means the admin did a mistake and
shrunk the guest too much.



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Dave Hansen
On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
>> > and leaves room for the bitmap size to be encoded as well, if we decide
>> > we need a bitmap in the future.
> How would a bitmap ever be useful with very large page-order?

Please, guys.  Read the patches.  *Please*.

The current code doesn't even _use_ a bitmap.




Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Andrea Arcangeli
Hello,

On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:
> It is more space-efficient.  We're fitting the order into 6 bits, which
> would allows the full 2^64 address space to be represented in one entry,

Very large order is the same as very large len, 6 bits of order or 8
bytes of len won't really move the needle here, simpler code is
preferable.

The main benefit of "len" is that it can be more granular, plus it's
simpler than the bitmap too. Eventually all this stuff has to end up
into a madvisev (not yet upstream but somebody posted it for jemalloc
and should get merged eventually).

So the bitmap shall be demuxed to a addr,len array anyway, the bitmap
won't ever be sent to the madvise syscall, which makes the
intermediate representation with the bitmap a complication with
basically no benefits compared to a (N, [addr1,len1], .., [addrN,
lenN]) representation.

If you prefer 1 byte of order (not just 6 bits) instead 8bytes of len
that's possible too, I wouldn't be against that, the conversion before
calling madvise would be pretty efficient too.

> and leaves room for the bitmap size to be encoded as well, if we decide
> we need a bitmap in the future.

How would a bitmap ever be useful with very large page-order?

> If that was purely a length, we'd be limited to 64*4k pages per entry,
> which isn't even a full large page.

I don't follow here.

What we suggest is to send the data down represented as (N,
[addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each
one of maximum length 2^64, so 2^64 multiplied infinite times if you
wish. Simplifying the code and not having any bitmap at all and no :6
:6 bits either.

The high order to low order loop of allocations is the interesting part
here, not the bitmap, and the fact of doing a single vmexit to send
the large ranges.

Once we pull out the largest order regions, we just add them to the
array as [addr,1UL<

Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Dave Hansen
Removing silly virtio-dev@ list because it's bouncing mail...

On 12/07/2016 08:21 AM, David Hildenbrand wrote:
>> Li's current patches do that.  Well, maybe not pfn/length, but they do
>> take a pfn and page-order, which fits perfectly with the kernel's
>> concept of high-order pages.
> 
> So we can send length in powers of two. Still, I don't see any benefit
> over a simple pfn/len schema. But I'll have a more detailed look at the
> implementation first, maybe that will enlighten me :)

It is more space-efficient.  We're fitting the order into 6 bits, which
would allows the full 2^64 address space to be represented in one entry,
and leaves room for the bitmap size to be encoded as well, if we decide
we need a bitmap in the future.

If that was purely a length, we'd be limited to 64*4k pages per entry,
which isn't even a full large page.




Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread David Hildenbrand


I did something similar. Filled the balloon with 15GB for a 16GB idle
guest, by
using bitmap, the madvise count was reduced to 605. when using the
PFNs, the madvise count
was 3932160. It means there are quite a lot consecutive bits in the
bitmap.
I didn't test for a guest with heavy memory workload.


Would it then even make sense to go one step further and report {pfn,
length} combinations?

So simply send over an array of {pfn, length}?


Li's current patches do that.  Well, maybe not pfn/length, but they do
take a pfn and page-order, which fits perfectly with the kernel's
concept of high-order pages.


So we can send length in powers of two. Still, I don't see any benefit
over a simple pfn/len schema. But I'll have a more detailed look at the
implementation first, maybe that will enlighten me :)




And it makes sense if you think about:

a) hugetlb backing: The host may only be able to free huge pages (we
might want to communicate that to the guest later, that's another
story). Still we would have to send bitmaps full of 4k frames (512 bits
for 2mb frames). Of course, we could add a way to communicate that we
are using a different bitmap-granularity.


Yeah, please read the patches.  If they're not clear, then the
descriptions need work, but this is done already.



I missed the page_shift, thanks for the hint.

--

David



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Dave Hansen
On 12/07/2016 07:42 AM, David Hildenbrand wrote:
> Am 07.12.2016 um 14:35 schrieb Li, Liang Z:
>>> Am 30.11.2016 um 09:43 schrieb Liang Li:
 This patch set contains two parts of changes to the virtio-balloon.

 One is the change for speeding up the inflating & deflating process,
 the main idea of this optimization is to use bitmap to send the page
 information to host instead of the PFNs, to reduce the overhead of
 virtio data transmission, address translation and madvise(). This can
 help to improve the performance by about 85%.
>>>
>>> Do you have some statistics/some rough feeling how many consecutive
>>> bits are
>>> usually set in the bitmaps? Is it really just purely random or is
>>> there some
>>> granularity that is usually consecutive?
>>>
>>
>> I did something similar. Filled the balloon with 15GB for a 16GB idle
>> guest, by
>> using bitmap, the madvise count was reduced to 605. when using the
>> PFNs, the madvise count
>> was 3932160. It means there are quite a lot consecutive bits in the
>> bitmap.
>> I didn't test for a guest with heavy memory workload.
> 
> Would it then even make sense to go one step further and report {pfn,
> length} combinations?
> 
> So simply send over an array of {pfn, length}?

Li's current patches do that.  Well, maybe not pfn/length, but they do
take a pfn and page-order, which fits perfectly with the kernel's
concept of high-order pages.

> And it makes sense if you think about:
> 
> a) hugetlb backing: The host may only be able to free huge pages (we
> might want to communicate that to the guest later, that's another
> story). Still we would have to send bitmaps full of 4k frames (512 bits
> for 2mb frames). Of course, we could add a way to communicate that we
> are using a different bitmap-granularity.

Yeah, please read the patches.  If they're not clear, then the
descriptions need work, but this is done already.




Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread David Hildenbrand

Am 07.12.2016 um 14:35 schrieb Li, Liang Z:

Am 30.11.2016 um 09:43 schrieb Liang Li:

This patch set contains two parts of changes to the virtio-balloon.

One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.


Do you have some statistics/some rough feeling how many consecutive bits are
usually set in the bitmaps? Is it really just purely random or is there some
granularity that is usually consecutive?



I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the 
madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload.


Would it then even make sense to go one step further and report {pfn, 
length} combinations?


So simply send over an array of {pfn, length}?

This idea came up when talking to Andrea Arcangeli (put him on cc).

And it makes sense if you think about:

a) hugetlb backing: The host may only be able to free huge pages (we 
might want to communicate that to the guest later, that's another 
story). Still we would have to send bitmaps full of 4k frames (512 bits 
for 2mb frames). Of course, we could add a way to communicate that we 
are using a different bitmap-granularity.


b) if we really inflate huge memory regions (and it sounds like that 
according to your measurements), we can minimize the communication to 
the hypervisor and therefore the madvice calls.


c) we don't want to optimize for inflating guests with almost full 
memory (and therefore little consecutive memory areas) - my opinion :)



Thanks for the explanation!

--

David



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Dave Hansen
On 12/07/2016 05:35 AM, Li, Liang Z wrote:
>> Am 30.11.2016 um 09:43 schrieb Liang Li:
>> IOW in real examples, do we have really large consecutive areas or are all
>> pages just completely distributed over our memory?
> 
> The buddy system of Linux kernel memory management shows there should
> be quite a lot of consecutive pages as long as there are a portion of
> free memory in the guest.
...
> If all pages just completely distributed over our memory, it means
> the memory fragmentation is very serious, the kernel has the
> mechanism to avoid this happened.

While it is correct that the kernel has anti-fragmentation mechanisms, I
don't think it invalidates the question as to whether a bitmap would be
too sparse to be effective.

> In the other hand, the inflating should not happen at this time because the 
> guest is almost
> 'out of memory'.

I don't think this is correct.  Most systems try to run with relatively
little free memory all the time, using the bulk of it as page cache.  We
have no reason to expect that ballooning will only occur when there is
lots of actual free memory and that it will not occur when that same
memory is in use as page cache.

In these patches, you're effectively still sending pfns.  You're just
sending one pfn per high-order page which is giving a really nice
speedup.  IMNHO, you're avoiding doing a real bitmap because creating a
bitmap means either have a really big bitmap, or you would have to do
some sorting (or multiple passes) of the free lists before populating a
smaller bitmap.

Like David, I would still like to see some data on whether the choice
between bitmaps and pfn lists is ever clearly in favor of bitmaps.  You
haven't convinced me, at least, that the data isn't even worth collecting.



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Li, Liang Z
> Am 30.11.2016 um 09:43 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> 
> Do you have some statistics/some rough feeling how many consecutive bits are
> usually set in the bitmaps? Is it really just purely random or is there some
> granularity that is usually consecutive?
> 

I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the 
madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload. 

> IOW in real examples, do we have really large consecutive areas or are all
> pages just completely distributed over our memory?
> 

The buddy system of Linux kernel memory management shows there should be quite 
a lot of
 consecutive pages as long as there are a portion of free memory in the guest.
If all pages just completely distributed over our memory, it means the memory 
fragmentation is very serious, the kernel has the mechanism to avoid this 
happened.
In the other hand, the inflating should not happen at this time because the 
guest is almost
'out of memory'.

Liang

> Thanks!
> 
> --
> 
> David



Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-06 Thread David Hildenbrand

Am 30.11.2016 um 09:43 schrieb Liang Li:

This patch set contains two parts of changes to the virtio-balloon.

One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.


Do you have some statistics/some rough feeling how many consecutive bits 
are usually set in the bitmaps? Is it really just purely random or is 
there some granularity that is usually consecutive?


IOW in real examples, do we have really large consecutive areas or are 
all pages just completely distributed over our memory?


Thanks!

--

David