Re: Unmapping KVM Guest Memory from Host Kernel

2024-03-08 Thread Matthew Wilcox
On Fri, Mar 08, 2024 at 03:50:05PM +, Gowans, James wrote:
> Currently when using anonymous memory for KVM guest RAM, the memory all
> remains mapped into the kernel direct map. We are looking at options to
> get KVM guest memory out of the kernel’s direct map as a principled
> approach to mitigating speculative execution issues in the host kernel.
> Our goal is to more completely address the class of issues whose leak
> origin is categorized as "Mapped memory" [1].

One of the things that is holding Linux back is the inability to do I/O
to memory which is not part of memmap.  _So Much_ of our infrastructure
is based on having a struct page available to stick into an sglist, bio,
skb_frag, or whatever.  The solution to this is to move to a (phys_addr,
length) tuple instead of (page, offset, len) tuple.  I call this "phyr"
and I've written about it before.  I'm not working on this as I have
quite enough to do with the folio work, but I hope somebody works on it
before I get time to.



Re: [PATCH] block: Remove special-casing of compound pages

2024-02-29 Thread Matthew Wilcox
On Thu, Feb 29, 2024 at 11:25:13AM -0700, Greg Edwards wrote:
> > [1/1] block: Remove special-casing of compound pages
> >   commit: 1b151e2435fc3a9b10c8946c6aebe9f3e1938c55
> 
> This commit results in a change of behavior for QEMU VMs backed by hugepages
> that open their VM disk image file with O_DIRECT (QEMU cache=none or
> cache.direct=on options).  When the VM shuts down and the QEMU process exits,
> one or two hugepages may fail to free correctly.  It appears to be a race, as
> it doesn't happen every time.

Hi Greg,

By sheer coincidence the very next email after this one was:

https://lore.kernel.org/linux-mm/86e592a9-98d4-4cff-a646-0c0084328...@cybernetics.com/T/#u

Can you try Tony's patch and see if it fixes your problem?
I haven't even begun to analyse either your email or his patch,
but there's a strong likelihood that they're the same thing.



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-21 Thread Matthew Wilcox
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> tmpfs and hugetlbfs and page cache are designed around sharing memory:
> TDX is designed around absolutely not sharing memory; and the further
> uses which Sean foresees appear not to need it as page cache either.
> 
> Except perhaps for page migration reasons.  It's somewhat incidental,  
> but of course page migration knows how to migrate page cache, so
> masquerading as page cache will give a short cut to page migration,
> when page migration becomes at all possible.

I haven't read the patch series, and I'm not taking a position one way
or the other on whether this is better implemented as a shmem addition
or a shim that asks shmem for memory.  Page migration can be done for
driver memory by using PageMovable.  I just rewrote how it works, so
the details are top of my mind at the moment if anyone wants something
explained.  Commit 68f2736a8583 is the key one to look at.



Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-13 Thread Matthew Wilcox
On Mon, Jan 14, 2019 at 10:29:02AM +1100, Dave Chinner wrote:
> Until you have images (and hence host page cache) shared between
> multiple guests. People will want to do this, because it means they
> only need a single set of pages in host memory for executable
> binaries rather than a set of pages per guest. Then you have
> multiple guests being able to detect residency of the same set of
> pages. If the guests can then, in any way, control eviction of the
> pages from the host cache, then we have a guest-to-guest information
> leak channel.

I don't think we should ever be considering something that would allow a
guest to evict page's from the host's pagecache [1].  The guest should
be able to kick its own references to the host's pagecache out of its
own pagecache, but not be able to influence whether the host or another
guest has a read-only mapping cached.

[1] Unless the guest is allowed to modify the host's file; obviously
truncation, holepunching, etc are going to evict pages from the host's
page cache.



Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Matthew Wilcox
On Sat, Dec 29, 2018 at 12:12:27AM +, Peter Maydell wrote:
> On Fri, 28 Dec 2018 at 23:16, Andreas Dilger  wrot
> > On Dec 28, 2018, at 4:18 AM, Peter Maydell  wrote:
> > > The problem is that there is no 32-bit API in some cases
> > > (unless I have misunderstood the kernel code) -- not all
> > > host architectures implement compat syscalls or allow them
> > > to be called from 64-bit processes or implement all the older
> > > syscall variants that had smaller offets. If there was a guaranteed
> > > "this syscall always exists and always gives me 32-bit offsets"
> > > we could use it.
> >
> > The "32bitapi" mount option would use 32-bit hash for seekdir
> > and telldir, regardless of what kernel API was used.  That would
> > just set the FMODE_32BITHASH flag in the file->f_mode for all files.
> 
> A mount option wouldn't be much use to QEMU -- we can't tell
> our users how to mount their filesystems, which they're
> often doing lots of other things with besides running QEMU.
> (Otherwise we could just tell them "don't use ext4", which
> would also solve the problem :-)) We need something we can
> use at the individual-syscall level.

Could you use a prctl to set whether you were running in 32 or 64 bit
mode?  Or do you change which kind of task you're emulating too often
to make this a good idea?



Re: [Qemu-devel] [PATCH v21 1/5] xbitmap: Introduce xbitmap

2018-02-16 Thread Matthew Wilcox
On Fri, Feb 16, 2018 at 11:45:51PM +0200, Andy Shevchenko wrote:
> Now, the question about test case. Why do you heavily use BUG_ON?
> Isn't resulting statistics enough?

No.  If any of those tests fail, we want to stop dead.  They'll lead to
horrendous bugs throughout the kernel if they're wrong.  I think more of
the in-kernel test suite should stop dead instead of printing a warning.
Would you want to boot a machine which has a known bug in the page cache,
for example?



Re: [Qemu-devel] [PATCH v21 1/5] xbitmap: Introduce xbitmap

2018-02-16 Thread Matthew Wilcox
On Fri, Feb 16, 2018 at 07:44:50PM +0200, Andy Shevchenko wrote:
> On Tue, Jan 9, 2018 at 1:10 PM, Wei Wang <wei.w.w...@intel.com> wrote:
> > From: Matthew Wilcox <mawil...@microsoft.com>
> >
> > The eXtensible Bitmap is a sparse bitmap representation which is
> > efficient for set bits which tend to cluster. It supports up to
> > 'unsigned long' worth of bits.
> 
> >  lib/xbitmap.c| 444 
> > +++
> 
> Please, split tests to a separate module.

Hah, I just did this two days ago!  I didn't publish it yet, but I also made
it compile both in userspace and as a kernel module.  

It's the top two commits here:

http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2018-02-12

Note this is a complete rewrite compared to the version presented here; it
sits on top of the XArray and no longer has a preload interface.  It has a
superset of the IDA functionality.



Re: [Qemu-devel] [PATCH v20 3/7 RESEND] xbitmap: add more operations

2018-01-02 Thread Matthew Wilcox
On Fri, Dec 22, 2017 at 04:49:11PM +0800, Wei Wang wrote:
> Thanks for the improvement. I also found a small bug in xb_zero. With the
> following changes, it has passed the current test cases and tested with the
> virtio-balloon usage without any issue.

Thanks; I applied the change.  Can you supply a test-case for testing
xb_zero please?

> > @@ -25,8 +26,11 @@ idr-test: idr-test.o $(CORE_OFILES)
> >   multiorder: multiorder.o $(CORE_OFILES)
> > +xbitmap: xbitmap.o $(CORE_OFILES)
> > +   $(CC) $(CFLAGS) $(LDFLAGS) $^ -o xbitmap
> > +
> 
> I think the "$(CC).." line should be removed, which will fix the build error
> when adding "xbitmap" to TARGET.

Not sure what build error you're talking about, but yes that CC line
should go.  Thanks, deleted.



Re: [Qemu-devel] [PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

2018-01-02 Thread Matthew Wilcox
On Sun, Dec 24, 2017 at 03:42:02PM +0800, Wei Wang wrote:
> On 12/24/2017 12:45 PM, Tetsuo Handa wrote:
> > Matthew Wilcox wrote:
> > > If you can't preload with anything better than that, I think that
> > > xb_set_bit() should attempt an allocation with GFP_NOWAIT | __GFP_NOWARN,
> > > and then you can skip the preload; it has no value for you.
> > Yes, that's why I suggest directly using kzalloc() at xb_set_bit().
> 
> It has some possibilities to remove that preload if we also do the bitmap
> allocation in the xb_set_bit():
> 
> bitmap = rcu_dereference_raw(*slot);
> if (!bitmap) {
> bitmap = this_cpu_xchg(ida_bitmap, NULL);
> if (!bitmap) {
> bitmap = kmalloc(sizeof(*bitmap), gfp);
> if (!bitmap)
> return -ENOMEM;
> }
> }
> 
> But why not just follow the radix tree implementation style that puts the
> allocation in preload, which would be invoked with a more relaxed gfp in
> other use cases?

Actually, the radix tree does attempt allocation, and falls back to the
preload.  The IDA was the odd one out that doesn't attempt allocation,
and that was where I copied the code from.  For other users, the preload
API is beneficial, so I will leave it in.

> Its usage in virtio_balloon is just a little special that we need to put the
> allocation within the balloon_lock, which doesn't give us the benefit of
> using a relaxed gfp in preload, but it doesn't prevent us from living with
> the current APIs (i.e. the preload + xb_set pattern).
> On the other side, if we do it as above, we have more things that need to
> consider. For example, what if the a use case just want the radix tree
> implementation style, which means it doesn't want allocation within
> xb_set(), then would we be troubled with how to avoid the allocation path in
> that case?
> 
> So, I think it is better to stick with the convention by putting the
> allocation in preload. Breaking the convention should show obvious
> advantages, IMHO.

The radix tree convention is objectively awful, which is why I'm working
to change it.  Specifying the GFP flags at radix tree initialisation time
rather than allocation time leads to all kinds of confusion.  The preload
API is a pretty awful workaround, and it will go away once the XArray
is working correctly.  That said, there's no alternative to it without
making XBitmap depend on XArray, and I don't want to hold you up there.
So there's an xb_preload for the moment.




Re: [Qemu-devel] [PATCH v20 4/7] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-12-23 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 08:17:56PM +0800, Wei Wang wrote:
> +/*
> + * Send balloon pages in sgs to host. The balloon pages are recorded in the
> + * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> + * The page xbitmap is searched for continuous "1" bits, which correspond
> + * to continuous pages, to chunk into sgs.
> + *
> + * @page_xb_start and @page_xb_end form the range of bits in the xbitmap that
> + * need to be searched.
> + */
> +static void tell_host_sgs(struct virtio_balloon *vb,
> +   struct virtqueue *vq,
> +   unsigned long page_xb_start,
> +   unsigned long page_xb_end)

I'm not crazy about the naming here.  I'd use pfn_min and pfn_max like
you use in the caller.

> +{
> + unsigned long pfn_start, pfn_end;
> + uint32_t max_len = round_down(UINT_MAX, PAGE_SIZE);
> + uint64_t len;
> +
> + pfn_start = page_xb_start;

And I think pfn_start is actually just 'pfn'.

'pfn_end' is perhaps just 'end'.  Or 'gap'.

> + while (pfn_start < page_xb_end) {
> + pfn_start = xb_find_set(>page_xb, page_xb_end + 1,
> + pfn_start);
> + if (pfn_start == page_xb_end + 1)
> + break;
> + pfn_end = xb_find_zero(>page_xb, page_xb_end + 1,
> +pfn_start);
> + len = (pfn_end - pfn_start) << PAGE_SHIFT;

> +static inline int xb_set_page(struct virtio_balloon *vb,
> +struct page *page,
> +unsigned long *pfn_min,
> +unsigned long *pfn_max)
> +{

I really don't like it that you're naming things after the 'xb'.
Things should be named by something that makes sense to the user, not
after the particular implementation.  If you changed the underlying
representation from an xbitmap to, say, a BTree, you wouldn't want to
rename this function to 'btree_set_page'.  Maybe this function is really
"vb_set_page".  Or "record_page".  Or something.  Someone who understands
this driver better than I do can probably weigh in with a better name.

> + unsigned long pfn = page_to_pfn(page);
> + int ret;
> +
> + *pfn_min = min(pfn, *pfn_min);
> + *pfn_max = max(pfn, *pfn_max);
> +
> + do {
> + if (xb_preload(GFP_NOWAIT | __GFP_NOWARN) < 0)
> + return -ENOMEM;
> +
> + ret = xb_set_bit(>page_xb, pfn);
> + xb_preload_end();
> + } while (unlikely(ret == -EAGAIN));

OK, so you don't need a spinlock because you're under a mutex?  But you
can't allocate memory because you're in the balloon driver, and so a
GFP_KERNEL allocation might recurse into your driver?  Would GFP_NOIO
do the job?  I'm a little hazy on exactly how the balloon driver works.

If you can't preload with anything better than that, I think that
xb_set_bit() should attempt an allocation with GFP_NOWAIT | __GFP_NOWARN,
and then you can skip the preload; it has no value for you.

> @@ -173,8 +292,15 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
> size_t num)
>  
>   while ((page = balloon_page_pop())) {
>   balloon_page_enqueue(>vb_dev_info, page);
> + if (use_sg) {
> + if (xb_set_page(vb, page, _min, _max) < 0) {
> + __free_page(page);
> + continue;
> + }
> + } else {
> + set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> + }

Is this the right behaviour?  If we can't record the page in the xb,
wouldn't we rather send it across as a single page?




Re: [Qemu-devel] [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-23 Thread Matthew Wilcox
On Sat, Dec 23, 2017 at 11:33:45PM +0900, Tetsuo Handa wrote:
> Matthew Wilcox wrote:
> > On Sat, Dec 23, 2017 at 11:59:54AM +0900, Tetsuo Handa wrote:
> > > Matthew Wilcox wrote:
> > > > +   bit %= IDA_BITMAP_BITS;
> > > > +   radix_tree_iter_init(, index);
> > > > +   slot = idr_get_free_cmn(root, , GFP_NOWAIT | __GFP_NOWARN, 
> > > > index);
> > > > +   if (IS_ERR(slot)) {
> > > > +   if (slot == ERR_PTR(-ENOSPC))
> > > > +   return 0;   /* Already set */
> > > 
> > > Why already set? I guess something is there, but is it guaranteed that
> > > there is a bitmap with the "bit" set?
> > 
> > Yes.  For radix trees tagged with IDR_RT_MARKER, newly created slots
> > have the IDR_FREE tag set.  We only clear the IDR_FREE tag once the
> > bitmap is full.  So if we try to find a free slot and the tag is clear,
> > we know the bitmap is full.
> > 
> 
> OK. But does using IDR_FREE tag have more benefit than cost?
> You are doing
> 
>   if (bitmap_full(bitmap->bitmap, IDA_BITMAP_BITS))
>   radix_tree_iter_tag_clear(root, , IDR_FREE);
> 
> for each xb_set_bit() call. How likely do we hit ERR_PTR(-ENOSPC) path?
> Isn't removing both bitmap_full() and ERR_PTR(-ENOSPC) better?

You're assuming that the purpose of using IDR_FREE is to save xb_set_bit
from walking the tree unnecessarily.  It isn't; that's just a happy
side-effect.  Its main purpose is to make xb_find_zero() efficient.  If
we have large ranges of set bits, xb_find_zero() will be able to skip them.

> > This is just a lazy test.  We "know" that the bits in the range 1024-2047
> > will all land in the same bitmap, so there's no need to preload for each
> > of them.
> 
> Testcases also serves as how to use that API.
> Assuming such thing leads to incorrect usage.

Sure.  Would you like to submit a patch?

> > > If bitmap == NULL at this_cpu_xchg(ida_bitmap, NULL) is allowed,
> > > you can use kzalloc(sizeof(*bitmap), GFP_NOWAIT | __GFP_NOWARN)
> > > and get rid of xb_preload()/xb_preload_end().
> > 
> > No, we can't.  GFP_NOWAIT | __GFP_NOWARN won't try very hard to allocate
> > memory.  There's no reason to fail the call if the user is in a context
> > where they can try harder to free memory.
> 
> But there is no reason to use GFP_NOWAIT at idr_get_free_cmn() if it is
> safe to use GFP_KERNEL. If we don't require xb_preload() which forces
> idr_get_free_cmn() to use GFP_NOWAIT due to possibility of preemption
> disabled by xb_preload(), we can allow passing gfp flags to xb_set_bit().

The assumption is that the user has done:

xb_preload(GFP_KERNEL);
spin_lock(my_lock);
xb_set_bit(xb, bit);
spin_unlock(my_lock);
xb_preload_end();

This is not the world's greatest interface.  Once I have the XArray
finished, we'll be able to ditch the external spinlock and the preload
interface and be able to call:

xb_set_bit(xb, bit, GFP_KERNEL);

> > xb_preload also preloads radix tree nodes.
> 
> But it after all forces idr_get_free_cmn() to use GFP_NOWAIT, doesn't it?

I think you don't understand how the radix tree allocates nodes.  preloading
means that it will be able to access the nodes which were allocated earlier.

> Speak of initial user (i.e. virtio-balloon), xb_preload() won't be able to
> use GFP_KERNEL in order to avoid OOM lockup. Therefore, I don't see
> advantages with using xb_preload(). If xb_set_bit() receives gfp flags,
> the caller can pass GFP_KERNEL if it is safe to use GFP_KERNEL.

I haven't reviewed how virtio-balloon is using the interfaces.



Re: [Qemu-devel] [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-22 Thread Matthew Wilcox
On Sat, Dec 23, 2017 at 11:59:54AM +0900, Tetsuo Handa wrote:
> Matthew Wilcox wrote:
> > +   bit %= IDA_BITMAP_BITS;
> > +   radix_tree_iter_init(, index);
> > +   slot = idr_get_free_cmn(root, , GFP_NOWAIT | __GFP_NOWARN, index);
> > +   if (IS_ERR(slot)) {
> > +   if (slot == ERR_PTR(-ENOSPC))
> > +   return 0;   /* Already set */
> 
> Why already set? I guess something is there, but is it guaranteed that
> there is a bitmap with the "bit" set?

Yes.  For radix trees tagged with IDR_RT_MARKER, newly created slots
have the IDR_FREE tag set.  We only clear the IDR_FREE tag once the
bitmap is full.  So if we try to find a free slot and the tag is clear,
we know the bitmap is full.

> > +   bitmap = rcu_dereference_raw(*slot);
> > +   if (!bitmap) {
> > +   bitmap = this_cpu_xchg(ida_bitmap, NULL);
> > +   if (!bitmap)
> > +   return -ENOMEM;
> 
> I can't understand this. I can understand if it were
> 
>   BUG_ON(!bitmap);
> 
> because you called xb_preload().
> 
> But
> 
>   /*
>* Regular test 2
>* set bit 2000, 2001, 2040
>* Next 1 in [0, 2048)  --> 2000
>* Next 1 in [2000, 2002)   --> 2000
>* Next 1 in [2002, 2041)   --> 2040
>* Next 1 in [2002, 2040)   --> none
>* Next 0 in [2000, 2048)   --> 2002
>* Next 0 in [2048, 2060)   --> 2048
>*/
>   xb_preload(GFP_KERNEL);
>   assert(!xb_set_bit(, 2000));
>   assert(!xb_set_bit(, 2001));
>   assert(!xb_set_bit(, 2040));
[...]
>   xb_preload_end();
> 
> you are not calling xb_preload() prior to each xb_set_bit() call.
> This means that, if each xb_set_bit() is not surrounded with
> xb_preload()/xb_preload_end(), there is possibility of hitting
> this_cpu_xchg(ida_bitmap, NULL) == NULL.

This is just a lazy test.  We "know" that the bits in the range 1024-2047
will all land in the same bitmap, so there's no need to preload for each
of them.

> If bitmap == NULL at this_cpu_xchg(ida_bitmap, NULL) is allowed,
> you can use kzalloc(sizeof(*bitmap), GFP_NOWAIT | __GFP_NOWARN)
> and get rid of xb_preload()/xb_preload_end().

No, we can't.  GFP_NOWAIT | __GFP_NOWARN won't try very hard to allocate
memory.  There's no reason to fail the call if the user is in a context
where they can try harder to free memory.

> You are using idr_get_free_cmn(GFP_NOWAIT | __GFP_NOWARN), which
> means that the caller has to be prepared for allocation failure
> when calling xb_set_bit(). Thus, there is no need to use preload
> in order to avoid failing to allocate "bitmap".

xb_preload also preloads radix tree nodes.

> Also, please clarify why it is OK to just return here.
> I don't know what
> 
>   radix_tree_iter_replace(root, , slot, bitmap);
> 
> is doing. If you created a slot but did not assign "bitmap",
> what the caller of xb_test_bit() etc. will find? If there is an
> assumption about this slot, won't this cause a problem?

xb_test_bit will find NULL if bitmap wasn't assigned.  That doesn't
harm anything.




Re: [Qemu-devel] [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-21 Thread Matthew Wilcox

OK, here's a rewrite of xbitmap.

Compared to the version you sent:
 - xb_find_set() is the rewrite I sent out yesterday.
 - xb_find_clear() is a new implementation.  I use the IDR_FREE tag to find
   clear bits.  This led to me finding a bug in radix_tree_for_each_tagged().
 - xb_zero() is also a new implementation (replacing xb_clear_bit_range).
   It should also be more efficient in deep trees.
 - Did not accept your change to xb_set_bit(); I think it's important to
   leave the radix tree node in place, so we're guaranteed to make progress
   in low-memory situations.  We're expecting the caller to retry, so the
   memory should not leak.
 - Redid a lot of the kernel-doc.  Thanks for adding it!  I removed mentions
   of implementation details, leaving only information useful to those who
   are using it.

Compared to the version I put out back in February:
 - Accepted your change to xb_preload(); I think it's an improvement.
 - Rewrote xb_set_bit() to use the radix_tree_iter_ family of functions.
   Should improve performance for deep trees.
 - Rewrote xb_clear_bit() for the same reason.
 - Left out the use of radix tree exceptional entries.  Thanks for taking
   them out!  Keeps it clearer for now; if they prove useful, we can put
   them back in.
 - Removed the use of __radix_tree_delete and __radix_tree_create, so drop
   the changes to those functions.

Other miscellaneous notes
 - I left xb_fill() in the header file, even though there's no implementation
   yet.  Wouldn't be hard to add once we have a user.
 - Used SPDX tags instead of a license notice.

I think we need more test cases ... in reviewing this to send out,
I found a bug in xb_zero(), which I've only fixed because I don't have
time to write a test case for it.

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
new file mode 100644
index ..c008309a9494
--- /dev/null
+++ b/include/linux/xbitmap.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * eXtensible Bitmaps
+ * Copyright (c) 2017 Microsoft Corporation
+ * Author: Matthew Wilcox <mawil...@microsoft.com>
+ *
+ * eXtensible Bitmaps provide an unlimited-size sparse bitmap facility.
+ * All bits are initially zero.
+ *
+ * Locking is to be provided by the user.  No xb_ function is safe to
+ * call concurrently with any other xb_ function.
+ */
+
+#include 
+
+struct xb {
+   struct radix_tree_root xbrt;
+};
+
+#define XB_INIT {  \
+   .xbrt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT),\
+}
+#define DEFINE_XB(name)struct xb name = XB_INIT
+
+static inline void xb_init(struct xb *xb)
+{
+   INIT_RADIX_TREE(>xbrt, IDR_RT_MARKER | GFP_NOWAIT);
+}
+
+int xb_set_bit(struct xb *xb, unsigned long bit);
+bool xb_test_bit(const struct xb *xb, unsigned long bit);
+void xb_clear_bit(struct xb *xb, unsigned long bit);
+void xb_zero(struct xb *xb, unsigned long min, unsigned long max);
+void xb_fill(struct xb *xb, unsigned long min, unsigned long max);
+bool xb_find_set(const struct xb *xb, unsigned long max, unsigned long *bit);
+bool xb_find_zero(const struct xb *xb, unsigned long max, unsigned long *bit);
+
+static inline bool xb_empty(const struct xb *xb)
+{
+   return radix_tree_empty(>xbrt);
+}
+
+int __must_check xb_preload(gfp_t);
+
+static inline void xb_preload_end(void)
+{
+   preempt_enable();
+}
diff --git a/lib/Makefile b/lib/Makefile
index d11c48ec8ffd..08a8183c390b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -19,7 +19,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-idr.o int_sqrt.o extable.o \
+idr.o xbitmap.o int_sqrt.o extable.o \
 sha1.o chacha20.o irq_regs.o argv_split.o \
 flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c8d55565fafa..d2bd8feb7b85 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -37,7 +37,7 @@
 #include 
 #include 
 #include 
-
+#include 
 
 /* Number of nodes in fully populated tree of given height */
 static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
@@ -77,6 +77,11 @@ static struct kmem_cache *radix_tree_node_cachep;
RADIX_TREE_MAP_SHIFT))
 #define IDA_PRELOAD_SIZE   (IDA_MAX_PATH * 2 - 1)
 
+#define XB_INDEX_BITS  (BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
+#define XB_MAX_PATH(DIV_ROUND_UP(XB_INDEX_BITS, \
+   RADIX_TREE_MAP_SHIFT))
+#define XB_PRELOAD_SIZE(XB_MAX_PATH * 2 - 1)
+
 /*
  * Per-cpu pool of preloaded nodes
  */
@@ -1781,7 +1786,7 @@ void __rcu **radix_tree_next_chunk(const struct 
radix_tree_root *root,
child = rcu_dereference_raw(node-&

Re: [Qemu-devel] [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-21 Thread Matthew Wilcox

First of all, the test-suite doesn't build, so I don't know whether you
ran it or not.  Then I added the xb_find_set() call below, and it fails
the assert, so you should probably fix that.

diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index f03a0f9f9e29..b29af08a7597 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -249,11 +249,12 @@ void xbitmap_check_bit(unsigned long bit)
assert(!xb_test_bit(, bit));
assert(xb_set_bit(, bit) == 0);
assert(xb_test_bit(, bit));
-   assert(xb_clear_bit(, bit) == 0);
+   xb_clear_bit(, bit);
assert(xb_empty());
-   assert(xb_clear_bit(, bit) == 0);
+   xb_clear_bit(, bit);
assert(xb_empty());
xb_preload_end();
+   assert(xb_find_set(, ULONG_MAX, 0) == bit);
 }
 
 static void xbitmap_check_bit_range(void)
diff --git a/tools/testing/radix-tree/Makefile 
b/tools/testing/radix-tree/Makefile
index 34ece7883629..adf36e34dd77 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 
 CFLAGS += -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address
-LDFLAGS += -fsanitize=address
-LDLIBS+= -lpthread -lurcu
-TARGETS = main idr-test multiorder
+LDFLAGS += -fsanitize=address $(LDLIBS)
+LDLIBS := -lpthread -lurcu
+TARGETS = main idr-test multiorder xbitmap
 CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
 OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
 tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \

On Thu, Dec 21, 2017 at 10:30:06AM +0800, Wei Wang wrote:
> v20 RESEND Changes:
>   - fixed the !node path
>   - added the test cases for the !node path
>   - change __builtin_constant_p(start & 7) to __builtin_constant_p(start) 

Why would you do such a thing?  Just copy the kernel definitions.

> diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
> index 108f929..ede1029 100644
> --- a/include/linux/xbitmap.h
> +++ b/include/linux/xbitmap.h
> @@ -35,6 +35,12 @@ static inline void xb_init(struct xb *xb)
>  int xb_set_bit(struct xb *xb, unsigned long bit);
>  bool xb_test_bit(const struct xb *xb, unsigned long bit);
>  void xb_clear_bit(struct xb *xb, unsigned long bit);
> +void xb_clear_bit_range(struct xb *xb, unsigned long start,
> + unsigned long nbits);

This is xb_zero().  I thought we talked about this before?

> +unsigned long xb_find_set(struct xb *xb, unsigned long size,
> +   unsigned long offset);
> +unsigned long xb_find_zero(struct xb *xb, unsigned long size,
> +unsigned long offset);

Since you're using xb_find_zero(), I think we need the tags from the IDR.
At that point, I'm not sure there's a point in keeping the xbitmap and the
IDR as separate data structures.

> +/**
> + * xb_find_set - find the next set bit in a range of bits
> + * @xb: the xbitmap to search from
> + * @offset: the offset in the range to start searching
> + * @size: the size of the range
> + *
> + * Returns: the found bit or, @size if no set bit is found.
> + */
> +unsigned long xb_find_set(struct xb *xb, unsigned long size,
> +   unsigned long offset)
> +{
> + struct radix_tree_root *root = >xbrt;
> + struct radix_tree_node *node;
> + void __rcu **slot;
> + struct ida_bitmap *bitmap;
> + unsigned long index = offset / IDA_BITMAP_BITS;
> + unsigned long index_end = size / IDA_BITMAP_BITS;
> + unsigned long bit = offset % IDA_BITMAP_BITS;
> +
> + if (unlikely(offset >= size))
> + return size;
> +
> + while (index <= index_end) {
> + unsigned long ret;
> + unsigned int nbits = size - index * IDA_BITMAP_BITS;
> +
> + bitmap = __radix_tree_lookup(root, index, , );
> +
> + if (!node && !bitmap)
> + return size;
> +
> + if (bitmap) {
> + if (nbits > IDA_BITMAP_BITS)
> + nbits = IDA_BITMAP_BITS;
> +
> + ret = find_next_bit(bitmap->bitmap, nbits, bit);
> + if (ret != nbits)
> + return ret + index * IDA_BITMAP_BITS;
> + }
> + bit = 0;
> + index++;
> + }
> +
> + return size;
> +}
> +EXPORT_SYMBOL(xb_find_set);

This is going to be slower than the implementation I sent yesterday.  If I
call:
xb_init(xb);
xb_set_bit(xb, ULONG_MAX);
xb_find_set(xb, ULONG_MAX, 0);

it's going to call __radix_tree_lookup() 16 quadrillion times.
My implementation will walk the tree precisely once.




Re: [Qemu-devel] [PATCH v20 0/7] Virtio-balloon Enhancement

2017-12-21 Thread Matthew Wilcox
On Thu, Dec 21, 2017 at 10:49:44AM +0800, Wei Wang wrote:
> On 12/21/2017 01:10 AM, Matthew Wilcox wrote:
> One more question is about the return value, why would it be ambiguous? I
> think it is the same as find_next_bit() which returns the found bit or size
> if not found.

Because find_next_bit doesn't reasonably support a bitmap which is
ULONG_MAX in size.  The point of XBitmap is to support a bitmap which
is ULONG_MAX in size, so every possible return value is a legitimate
"we found a bit here".  There's no value which can possibly be used for
"no bit was found".



Re: [Qemu-devel] [PATCH v20 0/7] Virtio-balloon Enhancement

2017-12-20 Thread Matthew Wilcox
On Wed, Dec 20, 2017 at 04:13:16PM +, Wang, Wei W wrote:
> On Wednesday, December 20, 2017 8:26 PM, Matthew Wilcox wrote:
> > unsigned long bit;
> > xb_preload(GFP_KERNEL);
> > xb_set_bit(xb, 700);
> > xb_preload_end();
> > bit = xb_find_set(xb, ULONG_MAX, 0);
> > assert(bit == 700);
> 
> This above test will result in "!node with bitmap !=NULL", and it goes to the 
> regular "if (bitmap)" path, which finds 700.
> 
> A better test would be
> ...
> xb_set_bit(xb, 700);
> assert(xb_find_set(xb, ULONG_MAX, 800) == ULONG_MAX);
> ...

I decided to write a test case to show you what I meant, then I discovered
the test suite didn't build, then the test I wrote took forever to run, so
I rewrote xb_find_set() using the radix tree iterators.  So I have no idea
what bugs may be in your implementation, but at least this function passes
the current test suite.  Of course, there may be gaps in the test suite.
And since I changed the API to not have the ambiguous return value, I
also changed the test suite, and maybe I introduced a bug.

diff --git a/include/linux/xbitmap.h b/include/linux/xbitmap.h
index ede1029b8a27..96e7e3560a0e 100644
--- a/include/linux/xbitmap.h
+++ b/include/linux/xbitmap.h
@@ -37,8 +37,7 @@ bool xb_test_bit(const struct xb *xb, unsigned long bit);
 void xb_clear_bit(struct xb *xb, unsigned long bit);
 void xb_clear_bit_range(struct xb *xb, unsigned long start,
unsigned long nbits);
-unsigned long xb_find_set(struct xb *xb, unsigned long size,
- unsigned long offset);
+bool xb_find_set(struct xb *xb, unsigned long max, unsigned long *bit);
 unsigned long xb_find_zero(struct xb *xb, unsigned long size,
   unsigned long offset);
 
diff --git a/lib/xbitmap.c b/lib/xbitmap.c
index 0bd3027b082d..58c26c8dd595 100644
--- a/lib/xbitmap.c
+++ b/lib/xbitmap.c
@@ -150,48 +150,39 @@ EXPORT_SYMBOL(xb_test_bit);
 /**
  * xb_find_set - find the next set bit in a range of bits
  * @xb: the xbitmap to search from
- * @offset: the offset in the range to start searching
- * @size: the size of the range
+ * @max: the maximum position to search to
+ * @bit: the first bit to examine, and on exit, the found bit
  *
- * Returns: the found bit or, @size if no set bit is found.
+ * Returns: %true if a set bit was found.  @bit will be updated.
  */
-unsigned long xb_find_set(struct xb *xb, unsigned long size,
- unsigned long offset)
+bool xb_find_set(struct xb *xb, unsigned long max, unsigned long *bit)
 {
-   struct radix_tree_root *root = >xbrt;
-   struct radix_tree_node *node;
+   struct radix_tree_iter iter;
void __rcu **slot;
struct ida_bitmap *bitmap;
-   unsigned long index = offset / IDA_BITMAP_BITS;
-   unsigned long index_end = size / IDA_BITMAP_BITS;
-   unsigned long bit = offset % IDA_BITMAP_BITS;
-
-   if (unlikely(offset >= size))
-   return size;
-
-   while (index <= index_end) {
-   unsigned long ret;
-   unsigned int nbits = size - index * IDA_BITMAP_BITS;
-
-   bitmap = __radix_tree_lookup(root, index, , );
-   if (!node) {
-   index = (index | RADIX_TREE_MAP_MASK) + 1;
-   continue;
-   }
-
+   unsigned long index = *bit / IDA_BITMAP_BITS;
+   unsigned int first = *bit % IDA_BITMAP_BITS;
+   unsigned long index_end = max / IDA_BITMAP_BITS;
+
+   radix_tree_for_each_slot(slot, >xbrt, , index) {
+   if (iter.index > index_end)
+   break;
+   bitmap = radix_tree_deref_slot(slot);
if (bitmap) {
-   if (nbits > IDA_BITMAP_BITS)
-   nbits = IDA_BITMAP_BITS;
-
-   ret = find_next_bit(bitmap->bitmap, nbits, bit);
-   if (ret != nbits)
-   return ret + index * IDA_BITMAP_BITS;
+   unsigned int nbits = IDA_BITMAP_BITS;
+   if (iter.index == index_end)
+   nbits = max % IDA_BITMAP_BITS + 1;
+
+   first = find_next_bit(bitmap->bitmap, nbits, first);
+   if (first != nbits) {
+   *bit = first + iter.index * IDA_BITMAP_BITS;
+   return true;
+   }
}
-   bit = 0;
-   index++;
+   first = 0;
}
 
-   return size;
+   return false;
 }
 EXPORT_SYMBOL(xb_find_set);
 
@@ -246,19 +237,30 @@ static DEFINE_XB(xb1);
 
 void xbitmap_check_bit(unsigned long bit)
 {
+   unsigned long nbit = 0;
+
xb_preload(GFP_KERNEL);
assert(!xb_test_bit(, bit));
assert(xb_set_bit(, bit) =

Re: [Qemu-devel] [PATCH v20 0/7] Virtio-balloon Enhancement

2017-12-20 Thread Matthew Wilcox
On Wed, Dec 20, 2017 at 06:34:36PM +0800, Wei Wang wrote:
> On 12/19/2017 10:05 PM, Tetsuo Handa wrote:
> > I think xb_find_set() has a bug in !node path.
> 
> I think we can probably remove the "!node" path for now. It would be good to
> get the fundamental part in first, and leave optimization to come as
> separate patches with corresponding test cases in the future.

You can't remove the !node path.  You'll see !node when the highest set
bit is less than 1024.  So do something like this:

unsigned long bit;
xb_preload(GFP_KERNEL);
xb_set_bit(xb, 700);
xb_preload_end();
bit = xb_find_set(xb, ULONG_MAX, 0);
assert(bit == 700);




Re: [Qemu-devel] [PATCH v20 0/7] Virtio-balloon Enhancement

2017-12-19 Thread Matthew Wilcox
On Tue, Dec 19, 2017 at 11:05:11PM +0900, Tetsuo Handa wrote:
> Removing exceptional path made this patch easier to read.
> But what I meant is
> 
>   Can you eliminate exception path and fold all xbitmap patches into one, and
>   post only one xbitmap patch without virtio-balloon changes? 
> 
> .
> 
> I still think we don't need xb_preload()/xb_preload_end().
> I think xb_find_set() has a bug in !node path.

Don't think.  Write a test-case.  Please.  If it shows a bug, then great,
Wei has an example of what the bug is to fix.  If it doesn't show a bug,
then we can add it to the test suite anyway, to ensure that case continues
to work in the future.

> Also, please avoid unconditionally adding to builtin modules.
> There are users who want to save even few KB.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-17 Thread Matthew Wilcox
On Mon, Dec 18, 2017 at 10:33:00AM +0800, Wei Wang wrote:
> > My only qualm is that I've been considering optimising the memory
> > consumption when an entire 1024-bit chunk is full; instead of keeping a
> > pointer to a 128-byte entry full of ones, store a special value in the
> > radix tree which means "every bit is set".
> > 
> > The downside is that we then have to pass GFP flags to xbit_clear() and
> > xbit_zero(), and they can fail.  It's not clear to me whether that's a
> > good tradeoff.
> 
> Yes, this will sacrifice performance. In many usages, users may set bits one
> by one, and each time when a bit is set, it needs to scan the whole
> ida_bitmap to see if all other bits are set, if so, it can free the
> ida_bitmap. I think this extra scanning of the ida_bitmap would add a lot
> overhead.

Not a huge amount of overhead.  An ida_bitmap is only two cachelines,
and the loop is simply 'check each word against ~0ul', so up to 16
load/test/loop instructions.  Plus we have to do that anyway to maintain
the free tag for IDAs.

> > But I need to get the XArray (which replaces the radix tree) finished first.
> 
> OK. It seems the new implementation wouldn't be done shortly.
> Other parts of this patch series are close to the end of review, and we hope
> to make some progress soon. Would it be acceptable that we continue with the
> basic xb_ implementation (e.g. as xbitmap 1.0) for this patch series? and
> xbit_ implementation can come as xbitmap 2.0 in the future?

Yes, absolutely, I don't want to hold you up behind the XArray.



Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-17 Thread Matthew Wilcox
On Sun, Dec 17, 2017 at 01:47:21PM +, Wang, Wei W wrote:
> On Saturday, December 16, 2017 3:22 AM, Matthew Wilcox wrote:
> > On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
> > > Here's the API I'm looking at right now.  The user need take no lock;
> > > the locking (spinlock) is handled internally to the implementation.
> 
> Another place I saw your comment " The xb_ API requires you to handle your 
> own locking" which seems conflict with the above "the user need take no lock".
> Doesn't the caller need a lock to avoid concurrent accesses to the ida bitmap?

Yes, the xb_ implementation requires you to handle your own locking.
The xbit_ API that I'm proposing will take care of the locking for you.
There's also no preallocation in the API.

> We'll change it to "bool xb_find_set(.., unsigned long *result)", returning 
> false indicates no "1" bit is found.

I put a replacement proposal in the next paragraph:
bool xbit_find_set(struct xbitmap *, unsigned long *start, unsigned long max);

Maybe 'start' is the wrong name for that parameter.  Let's call it 'bit'.
It's both "where to start" and "first bit found".

> >  - xbit_clear() can't return an error.  Neither can xbit_zero().
> 
> I found the current xbit_clear implementation only returns 0, and there isn't 
> an error to be returned from this function. In this case, is it better to 
> make the function "void"?

Yes, I think so.

My only qualm is that I've been considering optimising the memory
consumption when an entire 1024-bit chunk is full; instead of keeping a
pointer to a 128-byte entry full of ones, store a special value in the
radix tree which means "every bit is set".

The downside is that we then have to pass GFP flags to xbit_clear() and
xbit_zero(), and they can fail.  It's not clear to me whether that's a
good tradeoff.

> Are you suggesting to rename the current xb_ APIs to the above xbit_ names 
> (with parameter changes)? 
> 
> Why would we need xbit_alloc, which looks like ida_get_new, I think set/clear 
> should be adequate to the current usages.

I'm intending on replacing the xb_ and ida_ implementations with this one.
It removes the preload API which makes it easier to use, and it handles
the locking for you.

But I need to get the XArray (which replaces the radix tree) finished first.



Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-15 Thread Matthew Wilcox
On Sat, Dec 16, 2017 at 01:31:24PM +0900, Tetsuo Handa wrote:
> Michael S. Tsirkin wrote:
> > On Sat, Dec 16, 2017 at 01:21:52AM +0900, Tetsuo Handa wrote:
> > > My understanding is that virtio-balloon wants to handle sparsely spreaded
> > > unsigned long values (which is PATCH 4/7) and wants to find all chunks of
> > > consecutive "1" bits efficiently. Therefore, I guess that holding the 
> > > values
> > > in ascending order at store time is faster than sorting the values at read
> > > time.

What makes you think that the radix tree (also xbitmap, also idr) doesn't
sort the values at store time?

> I'm asking whether we really need to invent a new library module (i.e.
> PATCH 1/7 + PATCH 2/7 + PATCH 3/7) for virtio-balloon compared to mine.
> 
> What virtio-balloon needs is ability to
> 
>   (1) record any integer value in [0, ULONG_MAX] range
> 
>   (2) fetch all recorded values, with consecutive values combined in
>   min,max (or start,count) form for efficiently
> 
> and I wonder whether we need to invent complete API set which
> Matthew Wilcox and Wei Wang are planning for generic purpose.

The xbitmap absolutely has that ability.  And making it generic code
means more people see it, use it, debug it, optimise it.  I originally
wrote the implementation for bcache, when Kent was complaining we didn't
have such a thing.  His needs weren't as complex as Wei's, which is why
I hadn't implemented everything that Wei needed.



Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-15 Thread Matthew Wilcox
On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
> Here's the API I'm looking at right now.  The user need take no lock;
> the locking (spinlock) is handled internally to the implementation.

I looked at the API some more and found some flaws:
 - how does xbit_alloc communicate back which bit it allocated?
 - What if xbit_find_set() is called on a completely empty array with
   a range of 0, ULONG_MAX -- there's no invalid number to return.
 - xbit_clear() can't return an error.  Neither can xbit_zero().
 - Need to add __must_check to various return values to discourage sloppy
   programming

So I modify the proposed API we compete with thusly:

bool xbit_test(struct xbitmap *, unsigned long bit);
int __must_check xbit_set(struct xbitmap *, unsigned long bit, gfp_t);
void xbit_clear(struct xbitmap *, unsigned long bit);
int __must_check xbit_alloc(struct xbitmap *, unsigned long *bit, gfp_t);

int __must_check xbit_fill(struct xbitmap *, unsigned long start,
unsigned long nbits, gfp_t);
void xbit_zero(struct xbitmap *, unsigned long start, unsigned long nbits);
int __must_check xbit_alloc_range(struct xbitmap *, unsigned long *bit,
unsigned long nbits, gfp_t);

bool xbit_find_clear(struct xbitmap *, unsigned long *start, unsigned long max);
bool xbit_find_set(struct xbitmap *, unsigned long *start, unsigned long max);

(I'm a little sceptical about the API accepting 'max' for the find
functions and 'nbits' in the fill/zero/alloc_range functions, but I think
that matches how people want to use it, and it matches how bitmap.h works)



Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-15 Thread Matthew Wilcox
On Sat, Dec 16, 2017 at 01:21:52AM +0900, Tetsuo Handa wrote:
> My understanding is that virtio-balloon wants to handle sparsely spreaded
> unsigned long values (which is PATCH 4/7) and wants to find all chunks of
> consecutive "1" bits efficiently. Therefore, I guess that holding the values
> in ascending order at store time is faster than sorting the values at read
> time. I don't know how to use radix tree API, but I think that B+ tree API
> suits for holding the values in ascending order.
> 
> We wait for Wei to post radix tree version combined into one patch and then
> compare performance between radix tree version and B+ tree version (shown
> below)?

Sure.  We all benefit from some friendly competition.  Even if a
competition between trees might remind one of the Entmoot ;-)

But let's not hold back -- let's figure out some good workloads to use
in our competition.  And we should also decide on the API / locking
constraints.  And of course we should compete based on not just speed,
but also memory consumption (both as a runtime overhead for a given set
of bits and as code size).  If you can replace the IDR, you get to count
that savings against the cost of your implementation.

Here's the API I'm looking at right now.  The user need take no lock;
the locking (spinlock) is handled internally to the implementation.

void xbit_init(struct xbitmap *xb);
int xbit_alloc(struct xbitmap *, unsigned long bit, gfp_t);
int xbit_alloc_range(struct xbitmap *, unsigned long start,
unsigned long nbits, gfp_t);
int xbit_set(struct xbitmap *, unsigned long bit, gfp_t);
bool xbit_test(struct xbitmap *, unsigned long bit);
int xbit_clear(struct xbitmap *, unsigned long bit);
int xbit_zero(struct xbitmap *, unsigned long start, unsigned long nbits);
int xbit_fill(struct xbitmap *, unsigned long start, unsigned long nbits,
gfp_t);
unsigned long xbit_find_clear(struct xbitmap *, unsigned long start,
unsigned long max);
unsigned long xbit_find_set(struct xbitmap *, unsigned long start,
unsigned long max);

> static bool set_ulong(struct ulong_list_head *head, const unsigned long value)
> {
>   if (!ptr) {
>   ptr = kzalloc(sizeof(*ptr), GFP_NOWAIT | __GFP_NOWARN);
>   if (!ptr)
>   goto out1;
>   ptr->bitmap = kzalloc(BITMAP_LEN / 8,
> GFP_NOWAIT | __GFP_NOWARN);
>   if (!ptr->bitmap)
>   goto out2;
>   if (btree_insertl(>btree, ~segment, ptr,
>  GFP_NOWAIT | __GFP_NOWARN))
>   goto out3;
> out3:
>   kfree(ptr->bitmap);
> out2:
>   kfree(ptr);
> out1:
>   return false;
> }

And what is the user supposed to do if this returns false?  How do they
make headway?  The xb_ API is clear -- you call xb_prealloc and that
ensures forward progress.




Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-15 Thread Matthew Wilcox
On Tue, Dec 12, 2017 at 07:55:55PM +0800, Wei Wang wrote:
> +int xb_preload_and_set_bit(struct xb *xb, unsigned long bit, gfp_t gfp);

I'm struggling to understand when one would use this.  The xb_ API
requires you to handle your own locking.  But specifying GFP flags
here implies you can sleep.  So ... um ... there's no locking?

> +void xb_clear_bit_range(struct xb *xb, unsigned long start, unsigned long 
> end);

That's xb_zero() which you deleted with the previous patch ... remember,
keep things as close as possible to the bitmap API.




Re: [Qemu-devel] [PATCH v19 1/7] xbitmap: Introduce xbitmap

2017-12-15 Thread Matthew Wilcox
On Fri, Dec 15, 2017 at 07:05:07PM +0800, kbuild test robot wrote:
> 21struct radix_tree_node *node;
> 22void **slot;
 ^^^
missing __rcu annotation here.

Wei, could you fold that change into your next round?  Thanks!




Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-14 Thread Matthew Wilcox
On Fri, Dec 15, 2017 at 01:29:45AM +0900, Tetsuo Handa wrote:
> > > Also, one more thing you need to check. Have you checked how long does
> > > xb_find_next_set_bit(xb, 0, ULONG_MAX) on an empty xbitmap takes?
> > > If it causes soft lockup warning, should we add cond_resched() ?
> > > If yes, you have to document that this API might sleep. If no, you
> > > have to document that the caller of this API is responsible for
> > > not to pass such a large value range.
> > 
> > Yes, that will take too long time. Probably we can document some 
> > comments as a reminder for the callers.
> 
> Then, I feel that API is poorly implemented. There is no need to brute-force
> when scanning [0, ULONG_MAX] range. If you eliminate exception path and
> redesign the data structure, xbitmap will become as simple as a sample
> implementation shown below. Not tested yet, but I think that this will be
> sufficient for what virtio-baloon wants to do; i.e. find consecutive "1" bits
> quickly. I didn't test whether finding "struct ulong_list_data" using radix
> tree can improve performance.

find_next_set_bit() is just badly implemented.  There is no need to
redesign the data structure.  It should be a simple matter of:

 - look at ->head, see it is NULL, return false.

If bit 100 is set and you call find_next_set_bit(101, ULONG_MAX), it
should look at block 0, see there is a pointer to it, scan the block,
see there are no bits set above 100, then realise we're at the end of
the tree and stop.

If bit 2000 is set, and you call find_next_set_bit(2001, ULONG_MAX)
tit should look at block 1, see there's no bit set after bit 2001, then
look at the other blocks in the node, see that all the pointers are NULL
and stop.

This isn't rocket science, we already do something like this in the radix
tree and it'll be even easier to do in the XArray.  Which I'm going back
to working on now.



Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-14 Thread Matthew Wilcox
On Wed, Dec 13, 2017 at 08:26:06PM +0800, Wei Wang wrote:
> On 12/12/2017 09:20 PM, Tetsuo Handa wrote:
> > Can you eliminate exception path and fold all xbitmap patches into one, and
> > post only one xbitmap patch without virtio-baloon changes? If exception path
> > is valuable, you can add exception path after minimum version is merged.
> > This series is too difficult for me to close corner cases.
> 
> That exception path is claimed to save memory, and I don't have a strong
> reason to remove that part.
> Matthew, could we get your feedback on this?

Sure.  This code is derived from the IDA code in lib/idr.c.  Eventually,
I intend to reunite them.  For IDA, it clearly makes sense; the first 62
entries result in allocating no memory at all, which is going to be 99%
of users.  After that, we allocate 128 bytes which will serve the first
1024 users.

The xbitmap, as used by Wei's patches here is going to be used somewhat
differently from that.  I understand why Tetsuo wants the exceptional
path removed; I'm not sure the gains will be as important.  But if we're
going to rebuild the IDA on top of the xbitmap, we need to keep them.

I really want to pay more attention to this, but I need to focus on
getting the XArray finished.



Re: [Qemu-devel] [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Matthew Wilcox
On Fri, Dec 01, 2017 at 03:09:08PM +, Wang, Wei W wrote:
> On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> > If start == end is legal,
> > 
> >for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> > 
> > makes this loop do nothing because 10 < 10 is false.
> 
> How about "start <= end "?

Don't ask Tetsuo for his opinion, write some userspace code that uses it.




Re: [Qemu-devel] [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Matthew Wilcox
On Fri, Dec 01, 2017 at 10:02:01PM +0900, Tetsuo Handa wrote:
> If start == end is legal,
> 
>for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.

... and this is why we add tests to the test-suite!



Re: [Qemu-devel] [PATCH v18 05/10] xbitmap: add more operations

2017-11-30 Thread Matthew Wilcox
On Thu, Nov 30, 2017 at 10:35:03PM +0900, Tetsuo Handa wrote:
> According to xb_set_bit(), it seems to me that we are trying to avoid memory 
> allocation
> for "struct ida_bitmap" when all set bits within a 1024-bits bitmap reside in 
> the first
> 61 bits.
> 
> But does such saving help? Is there characteristic bias that majority of set 
> bits resides
> in the first 61 bits, for "bit" is "unsigned long" which holds a page number 
> (isn't it)?
> If no such bias, wouldn't eliminating radix_tree_exception() case and always 
> storing
> "struct ida_bitmap" simplifies the code (and make the processing faster)?

It happens all the time.  The vast majority of users of the IDA set
low bits.  Also, it's the first 62 bits -- going up to 63 bits with the
XArray rewrite.

I do plan to redo the xbitmap on top of the XArray; I'm just trying to
get the XArray merged first.  The IDA and xbitmap code will share much
more code when that happens.



Re: [Qemu-devel] [PATCH v18 01/10] idr: add #include

2017-11-29 Thread Matthew Wilcox
On Wed, Nov 29, 2017 at 09:55:17PM +0800, Wei Wang wrote:
> The  was removed from radix-tree.h by the following commit:
> f5bba9d11a256ad2a1c2f8e7fc6aabe6416b7890.
> 
> Since that commit, tools/testing/radix-tree/ couldn't pass compilation
> due to: tools/testing/radix-tree/idr.c:17: undefined reference to
> WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
> issue.

Thanks; I sent this same patch out yesterday.

Unfortunately, you didn't cc the author of this breakage, Masahiro Yamada.
I want to highlight that these kinds of header cleanups are risky,
and very low reward.  I really don't want to see patches going all over
the tree randomly touching header files.  If we've got a real problem
to solve, then sure.  But I want to see a strong justification for any
more header file cleanups.




Re: [Qemu-devel] [PATCH v17 2/6] radix tree test suite: add tests for xbitmap

2017-11-06 Thread Matthew Wilcox
On Fri, Nov 03, 2017 at 04:13:02PM +0800, Wei Wang wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> Add the following tests for xbitmap:
> 1) single bit test: single bit set/clear/find;
> 2) bit range test: set/clear a range of bits and find a 0 or 1 bit in
> the range.
> 
> Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> Cc: Matthew Wilcox <mawil...@microsoft.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Michael S. Tsirkin <m...@redhat.com>
> ---
>  tools/include/linux/bitmap.h|  34 
>  tools/include/linux/kernel.h|   2 +
>  tools/testing/radix-tree/Makefile   |   7 +-
>  tools/testing/radix-tree/linux/kernel.h |   2 -
>  tools/testing/radix-tree/main.c |   5 +
>  tools/testing/radix-tree/test.h |   1 +
>  tools/testing/radix-tree/xbitmap.c  | 278 
> 

Umm.  No.  You've duplicated xbitmap.c into the test framework, so now it can
slowly get out of sync with the one in lib/.  That's not OK.

Put it back the way it was, with the patch I gave you as patch 1/n
(relocating xbitmap.c from tools/testing/radix-tree to lib/).
Then add your enhancements as patch 2/n.  All you should need to
change in your 1/n from
http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392
is a bit of Makefile tooling.  Leave the test suite embedded in the file;
that way people might remember to update the test suite when adding
new functionality.




Re: [Qemu-devel] [PATCH v15 2/5] lib/xbitmap: add xb_find_next_bit() and xb_zero()

2017-09-11 Thread Matthew Wilcox
On Mon, Aug 28, 2017 at 06:08:30PM +0800, Wei Wang wrote:
> +/**
> + *  xb_zero - zero a range of bits in the xbitmap
> + *  @xb: the xbitmap that the bits reside in
> + *  @start: the start of the range, inclusive
> + *  @end: the end of the range, inclusive
> + */
> +void xb_zero(struct xb *xb, unsigned long start, unsigned long end)
> +{
> + unsigned long i;
> +
> + for (i = start; i <= end; i++)
> + xb_clear_bit(xb, i);
> +}
> +EXPORT_SYMBOL(xb_zero);

Um.  This is not exactly going to be quick if you're clearing a range of bits.
I think it needs to be more along the lines of this:

void xb_clear(struct xb *xb, unsigned long start, unsigned long end) 
{ 
struct radix_tree_root *root = >xbrt; 
struct radix_tree_node *node; 
void **slot; 
struct ida_bitmap *bitmap; 
 
for (; end < start; start = (start | (IDA_BITMAP_BITS - 1)) + 1) { 
unsigned long index = start / IDA_BITMAP_BITS; 
unsigned long bit = start % IDA_BITMAP_BITS; 

bitmap = __radix_tree_lookup(root, index, , );
if (radix_tree_exception(bitmap)) {
unsigned long ebit = bit + 2;
unsigned long tmp = (unsigned long)bitmap;
if (ebit >= BITS_PER_LONG)
continue;
tmp &= ... something ...;
if (tmp == RADIX_TREE_EXCEPTIONAL_ENTRY)
__radix_tree_delete(root, node, slot);
else
rcu_assign_pointer(*slot, (void *)tmp);
} else if (bitmap) {
unsigned int nbits = end - start + 1;
if (nbits + bit > IDA_BITMAP_BITS)
nbits = IDA_BITMAP_BITS - bit;
bitmap_clear(bitmap->bitmap, bit, nbits);
if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
kfree(bitmap);
__radix_tree_delete(root, node, slot);
}
}
}
}

Also note that this should be called xb_clear(), not xb_zero() to fit
in with bitmap_clear().  And this needs a thorough test suite testing
all values for 'start' and 'end' between 0 and at least 1024; probably
much higher.  And a variable number of bits need to be set before calling
xb_clear() in the test suite.

Also, this implementation above is missing a few tricks.  For example,
if 'bit' is 0 and 'nbits' == IDA_BITMAP_BITS, we can simply call kfree
without first zeroing out the bits and then checking if the whole thing
is zero.

Another missing optimisation above is that we always restart the radix
tree walk from the top instead of simply moving on to the next bitmap.
This is still a thousand times faster than the implementation you posted,
but I'd be keen to add that optimisation too.

> +/**
> + * xb_find_next_bit - find next 1 or 0 in the give range of bits
> + * @xb: the xbitmap that the bits reside in
> + * @start: the start of the range, inclusive
> + * @end: the end of the range, inclusive
> + * @set: the polarity (1 or 0) of the next bit to find
> + *
> + * Return the index of the found bit in the xbitmap. If the returned index
> + * exceeds @end, it indicates that no such bit is found in the given range.
> + */
> +unsigned long xb_find_next_bit(struct xb *xb, unsigned long start,
> +unsigned long end, bool set)
> +{
> + unsigned long i;
> +
> + for (i = start; i <= end; i++) {
> + if (xb_test_bit(xb, i) == set)
> + break;
> + }
> +
> + return i;
> +}
> +EXPORT_SYMBOL(xb_find_next_bit);

Similar comments ... this is going to be very slow.  You can use the
tags in the tree to help you find set and clear bits so performance
doesn't fall apart in big trees.

I'd like to see this be two functions, xb_find_next_zero_bit() and
xb_find_next_set_bit().



Re: [Qemu-devel] [PATCH v15 1/5] lib/xbitmap: Introduce xbitmap

2017-09-11 Thread Matthew Wilcox
On Mon, Aug 28, 2017 at 06:08:29PM +0800, Wei Wang wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> The eXtensible Bitmap is a sparse bitmap representation which is
> efficient for set bits which tend to cluster.  It supports up to
> 'unsigned long' worth of bits, and this commit adds the bare bones --
> xb_set_bit(), xb_clear_bit() and xb_test_bit().
> 
> Signed-off-by: Matthew Wilcox <mawil...@microsoft.com>
> Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Michael S. Tsirkin <m...@redhat.com>

This is quite naughty of you.  You've modified the xbitmap implementation
without any indication in the changelog that you did so.  I don't
think the modifications you made are an improvement, but without any
argumentation from you I don't know why you think they're an improvement.

> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 898e879..ee72e2c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -496,6 +496,7 @@ static int __radix_tree_preload(gfp_t gfp_mask, unsigned 
> nr)
>  out:
>   return ret;
>  }
> +EXPORT_SYMBOL(__radix_tree_preload);
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to

You exported this to modules for some reason.  Why?

> @@ -2003,6 +2018,7 @@ static bool __radix_tree_delete(struct radix_tree_root 
> *root,
>   replace_slot(slot, NULL, node, -1, exceptional);
>   return node && delete_node(root, node, NULL, NULL);
>  }
> +EXPORT_SYMBOL(__radix_tree_delete);
>  
>  /**
>   * radix_tree_iter_delete - delete the entry at this iterator position

Ditto?

> diff --git a/lib/xbitmap.c b/lib/xbitmap.c
> new file mode 100644
> index 000..8c55296
> --- /dev/null
> +++ b/lib/xbitmap.c
> @@ -0,0 +1,176 @@
> +#include 
> +#include 
> +
> +/*
> + * The xbitmap implementation supports up to ULONG_MAX bits, and it is
> + * implemented based on ida bitmaps. So, given an unsigned long index,
> + * the high order XB_INDEX_BITS bits of the index is used to find the
> + * corresponding item (i.e. ida bitmap) from the radix tree, and the low
> + * order (i.e. ilog2(IDA_BITMAP_BITS)) bits of the index are indexed into
> + * the ida bitmap to find the bit.
> + */
> +#define XB_INDEX_BITS(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
> +#define XB_MAX_PATH  (DIV_ROUND_UP(XB_INDEX_BITS, \
> +   RADIX_TREE_MAP_SHIFT))
> +#define XB_PRELOAD_SIZE  (XB_MAX_PATH * 2 - 1)

I don't understand why you moved the xb_preload code here from the
radix tree.  I want all the code which touches the preload implementation
together in one place, which is the radix tree.

> +enum xb_ops {
> + XB_SET,
> + XB_CLEAR,
> + XB_TEST
> +};
> +
> +static int xb_bit_ops(struct xb *xb, unsigned long bit, enum xb_ops ops)
> +{
> + int ret = 0;
> + unsigned long index = bit / IDA_BITMAP_BITS;
> + struct radix_tree_root *root = >xbrt;
> + struct radix_tree_node *node;
> + void **slot;
> + struct ida_bitmap *bitmap;
> + unsigned long ebit, tmp;
> +
> + bit %= IDA_BITMAP_BITS;
> + ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
> +
> + switch (ops) {
> + case XB_SET:
> + ret = __radix_tree_create(root, index, 0, , );
> + if (ret)
> + return ret;
> + bitmap = rcu_dereference_raw(*slot);
> + if (radix_tree_exception(bitmap)) {
> + tmp = (unsigned long)bitmap;
> + if (ebit < BITS_PER_LONG) {
> + tmp |= 1UL << ebit;
> + rcu_assign_pointer(*slot, (void *)tmp);
> + return 0;
> + }
> + bitmap = this_cpu_xchg(ida_bitmap, NULL);
> + if (!bitmap)
> + return -EAGAIN;
> + memset(bitmap, 0, sizeof(*bitmap));
> + bitmap->bitmap[0] =
> + tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
> + rcu_assign_pointer(*slot, bitmap);
> + }
> + if (!bitmap) {
> + if (ebit < BITS_PER_LONG) {
> + bitmap = (void *)((1UL << ebit) |
> + RADIX_TREE_EXCEPTIONAL_ENTRY);
> + __radix_tree_replace(root, node, slot, bitmap,
> +  NULL, NULL);
> +  

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v11 3/6] virtio-balloon: VIRTIO_BALLOON_F_PAGE_CHUNKS

2017-06-28 Thread Matthew Wilcox
On Thu, Jun 15, 2017 at 04:10:17PM +0800, Wei Wang wrote:
> > So you still have a home-grown bitmap. I'd like to know why
> > isn't xbitmap suggested for this purpose by Matthew Wilcox
> > appropriate. Please add a comment explaining the requirements
> > from the data structure.
> 
> I didn't find his xbitmap being upstreamed, did you?

It doesn't have any users in the tree yet.  Can't add code with new users.
You should be the first!



Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-04-14 Thread Matthew Wilcox
On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > 2) transfer the guest unused pages to the host so that they
> > > can be skipped to migrate in live migration.
> > 
> > I don't understand this second bit.  You leave the pages on the free list,
> > and tell the host they're free.  What's preventing somebody else from
> > allocating them and using them for something?  Is the guest semi-frozen
> > at this point with just enough of it running to ask the balloon driver
> > to do things?
> 
> There's missing documentation here.
> 
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.

... hopefully you mean "write protects all memory, then sends a request
for unused pages", otherwise there's a race condition.

And I see the utility of this, but does this functionality belong in
the balloon driver?  It seems like it's something you might want even
if you don't have the balloon driver loaded.  Or something you might
not want if you do have the balloon driver loaded.



Re: [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list

2017-04-13 Thread Matthew Wilcox
On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
> OK. What do you think if we add this:
> 
> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)

That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.



Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-04-13 Thread Matthew Wilcox
On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> 2) transfer the guest unused pages to the host so that they
> can be skipped to migrate in live migration.

I don't understand this second bit.  You leave the pages on the free list,
and tell the host they're free.  What's preventing somebody else from
allocating them and using them for something?  Is the guest semi-frozen
at this point with just enough of it running to ask the balloon driver
to do things?




Re: [Qemu-devel] [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS

2017-04-13 Thread Matthew Wilcox
On Thu, Apr 13, 2017 at 07:34:19PM +0300, Michael S. Tsirkin wrote:
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.

That sounds an awful lot like the xbitmap I wrote a few months ago ...

http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392



Re: [Qemu-devel] [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-11 Thread Matthew Wilcox
On Sat, Mar 11, 2017 at 07:59:31PM +0800, Wei Wang wrote:
> I'm thinking what if the guest needs to transfer these much physically
> continuous
> memory to host: 1GB+2MB+64KB+32KB+16KB+4KB.
> Is it going to use Six 64-bit chunks? Would it be simpler if we just
> use the 128-bit chunk format (we can drop the previous normal 64-bit
> format)?

Is that a likely thing for the guest to need to do though?  Freeing a
1GB page is much more liikely, IMO.



Re: [Qemu-devel] [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-10 Thread Matthew Wilcox
On Fri, Mar 10, 2017 at 09:35:21PM +0200, Michael S. Tsirkin wrote:
> > bit 0 clear => bits 1-11 encode a page count, bits 12-63 encode a PFN, page 
> > size 4k.
> > bit 0 set, bit 1 clear => bits 2-12 encode a page count, bits 13-63 encode 
> > a PFN, page size 8k
> > bits 0+1 set, bit 2 clear => bits 3-13 for page count, bits 14-63 for PFN, 
> > page size 16k.
> > bits 0-2 set, bit 3 clear => bits 4-14 for page count, bits 15-63 for PFN, 
> > page size 32k
> > bits 0-3 set, bit 4 clear => bits 5-15 for page count, bits 16-63 for PFN, 
> > page size 64k
> > That means we can always pass 2048 pages (of whatever page size) in a 
> > single chunk.  And
> > we support arbitrary power of two page sizes.  I suggest something like 
> > this:
> > 
> > u64 page_to_chunk(struct page *page)
> > {
> > u64 chunk = page_to_pfn(page) << PAGE_SHIFT;
> > chunk |= (1UL << compound_order(page)) - 1;
> > }
> 
> You need to fill in the size, do you not?

I think I did ... (1UL << compound_order(page)) - 1 sets the bottom
N bits.  Bit N+1 will already be clear.  What am I missing?

> > > - host should pass its base page size to guest
> > >   this can be a separate patch and for now we can fall back on 12 bit if 
> > > not there
> > 
> > With this encoding scheme, I don't think we need to do this?  As long as
> > it's *at least* 12 bit, then we're fine.
> 
> I think we will still need something like this down the road.  The point
> is that not all hosts are able to use 4k pages in a balloon.
> So it's pointless for guest to pass 4k pages to such a host,
> and we need host to tell guest the page size it needs.
> 
> However that's a separate feature that can wait until
> another day.

Ah, the TRIM/DISCARD debate all over again ... should the guest batch
up or should the host do that work ... probably easier to account it in
the guest.  Might be better to frame it as 'balloon chunk size' rather than
host page size as it might have nothing to do with the host page size.

> > What per-chunk flags are you thinking would be useful?
> 
> Not entirely sure but I think would have been prudent to leave some free
> if possible. Your encoding seems to use them all up, so be it.

We don't necessarily have to support 2048 pages in a single chunk.
If it's worth reserving some bits, we can do that at the expense of
reducing the maximum number of pages per chunk.



Re: [Qemu-devel] [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-10 Thread Matthew Wilcox
On Fri, Mar 10, 2017 at 09:10:53PM +0200, Michael S. Tsirkin wrote:
> > I completely agree with you that we should be able to pass a hugepage
> > as a single chunk.  Also we shouldn't assume that host and guest have
> > the same page size.  I think we can come up with a scheme that actually
> > lets us encode that into a 64-bit word, something like this:
> > 
> > bit 0 clear => bits 1-11 encode a page count, bits 12-63 encode a PFN, page 
> > size 4k.
> > bit 0 set, bit 1 clear => bits 2-12 encode a page count, bits 13-63 encode 
> > a PFN, page size 8k
> > bits 0+1 set, bit 2 clear => bits 3-13 for page count, bits 14-63 for PFN, 
> > page size 16k.
> > bits 0-2 set, bit 3 clear => bits 4-14 for page count, bits 15-63 for PFN, 
> > page size 32k
> > bits 0-3 set, bit 4 clear => bits 5-15 for page count, bits 16-63 for PFN, 
> > page size 64k
> 
> huge page sizes go up to gigabytes.

There was supposed to be a '...' there.  For a 16GB hugepage (largest
size I know of today), that'd be:

bits 0-21 set, 22 clear, 23-33 page count, 34-63 PFN, page size 16G




Re: [Qemu-devel] [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-10 Thread Matthew Wilcox
On Fri, Mar 10, 2017 at 05:58:28PM +0200, Michael S. Tsirkin wrote:
> One of the issues of current balloon is the 4k page size
> assumption. For example if you free a huge page you
> have to split it up and pass 4k chunks to host.
> Quite often host can't free these 4k chunks at all (e.g.
> when it's using huge tlb fs).
> It's even sillier for architectures with base page size >4k.

I completely agree with you that we should be able to pass a hugepage
as a single chunk.  Also we shouldn't assume that host and guest have
the same page size.  I think we can come up with a scheme that actually
lets us encode that into a 64-bit word, something like this:

bit 0 clear => bits 1-11 encode a page count, bits 12-63 encode a PFN, page 
size 4k.
bit 0 set, bit 1 clear => bits 2-12 encode a page count, bits 13-63 encode a 
PFN, page size 8k
bits 0+1 set, bit 2 clear => bits 3-13 for page count, bits 14-63 for PFN, page 
size 16k.
bits 0-2 set, bit 3 clear => bits 4-14 for page count, bits 15-63 for PFN, page 
size 32k
bits 0-3 set, bit 4 clear => bits 5-15 for page count, bits 16-63 for PFN, page 
size 64k

That means we can always pass 2048 pages (of whatever page size) in a single 
chunk.  And
we support arbitrary power of two page sizes.  I suggest something like this:

u64 page_to_chunk(struct page *page)
{
u64 chunk = page_to_pfn(page) << PAGE_SHIFT;
chunk |= (1UL << compound_order(page)) - 1;
}

(note this is a single page of order N, so we leave the page count bits
set to 0, meaning one page).

> Two things to consider:
> - host should pass its base page size to guest
>   this can be a separate patch and for now we can fall back on 12 bit if not 
> there

With this encoding scheme, I don't think we need to do this?  As long as
it's *at least* 12 bit, then we're fine.

> - guest should pass full huge pages to host
>   this should be done correctly to avoid breaking up huge pages
>   I would say yes let's use a single format but drop the "normal chunk"
>   and always use the extended one.
>   Also, size is in units of 4k, right? Please document that low 12 bit
>   are reserved, they will be handy as e.g. flags.

What per-chunk flags are you thinking would be useful?



Re: [Qemu-devel] [PATCH v7 kernel 3/5] virtio-balloon: implementation of VIRTIO_BALLOON_F_CHUNK_TRANSFER

2017-03-09 Thread Matthew Wilcox
On Fri, Mar 03, 2017 at 01:40:28PM +0800, Wei Wang wrote:
> From: Liang Li 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> This patch optimizes step 2) by transfering pages to the host in
> chunks. A chunk consists of guest physically continuous pages, and
> it is offered to the host via a base PFN (i.e. the start PFN of
> those physically continuous pages) and the size (i.e. the total
> number of the pages). A normal chunk is formated as below:
> ---
> |  Base (52 bit)   | Size (12 bit)|
> ---
> For large size chunks, an extended chunk format is used:
> ---
> | Base (64 bit)   |
> ---
> ---
> | Size (64 bit)   |
> ---

What's the advantage to extended chunks?  IOW, why is the added complexity
of having two chunk formats worth it?  You already reduced the overhead by
a factor of 4096 with normal chunks ... how often are extended chunks used
and how much more efficient are they than having several normal chunks?