[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread Alexander Duyck
On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand  wrote:
>
>  +static int process_free_page(struct page *page,
>  +struct page_reporting_config *phconf, int 
>  count)
>  +{
>  +   int mt, order, ret = 0;
>  +
>  +   mt = get_pageblock_migratetype(page);
>  +   order = page_private(page);
>  +   ret = __isolate_free_page(page, order);
>  +
> >> I just started looking into the wonderful world of
> >> isolation/compaction/migration.
> >>
> >> I don't think saving/restoring the migratetype is correct here. AFAIK,
> >> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> >> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> >> MOVABLE. So that shouldn't be an issue - I guess.
> >>
> >> 1. You should never allocate something that is no
> >> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> >> CMA here. There should at least be a !is_migrate_isolate_page() check
> >> somewhere
> >>
> >> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> >> with isolation code, you have to hold the zone lock. Your code seems to
> >> do that, so at least you cannot race against isolation.
> >>
> >> 3. You could end up temporarily allocating something in the
> >> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> >> would have to be a way to make alloc_contig_range()/offlining code
> >> properly wait until the pages have been processed. Not sure about the
> >> real implications, though - too many details in the code (I wonder if
> >> Alex' series has a way of dealing with that)
> >>
> >> When you restore the migratetype, you could suddenly overwrite e.g.,
> >> ISOLATE, which feels wrong.
> >
> >
> > I was triggering an occasional CPU stall bug earlier, with saving and 
> > restoring
> > the migratetype I was able to fix it.
> > But I will further look into this to figure out if it is really required.
> >
>
> You should especially look into handling isolated/cma pages. Maybe that
> was the original issue. Alex seems to have added that in his latest
> series (skipping isolated/cma pageblocks completely) as well.

So as far as skipping isolated pageblocks, I get the reason for
skipping isolated, but why would we need to skip CMA? I had made the
change I did based on comments you had made earlier. But while working
on some of the changes to address isolation better and looking over
several spots in the code it seems like CMA is already being used as
an allocation fallback for MIGRATE_MOVABLE. If that is the case
wouldn't it make sense to allow pulling pages and reporting them while
they are in the free_list?

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH] content: document console vq detach buffer

2019-08-13 Thread Stefan Hajnoczi
On Tue, Aug 13, 2019 at 06:21:33PM +0530, Pankaj Gupta wrote:
> This patch documents console special case where vq buffers
> are deleted at port hotunplug time. This behavior is different in
> other devices where vq buffers are deleted at device unplug time.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  content.tex | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index ee0d7c9..33e8ccc 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -497,6 +497,8 @@ \section{Device Cleanup}\label{sec:General Initialization 
> And Device Operation /
>  of a live virtqueue.
>  
>  Thus a driver MUST ensure a virtqueue isn't live (by device reset) before 
> removing exposed buffers.
> +Console has a special property that when port is detached virtqueue is 
> considered stopped, device
> +must not use any buffers, and it is legal to take buffers out of the device.

The word "detach" is not defined or used in the spec, so it's unclear
what this is supposed to mean.  Is this related to a control message
(VIRTIO_CONSOLE_PORT_OPEN)?

How about:

  The device MUST NOT use buffers after it has sent
  VIRTIO_CONSOLE_PORT_OPEN with value=0 for a port.

I'm not sure what you mean by "take buffers out of the device"?  I guess
you mean the driver can modify the vring because the device isn't
looking.


signature.asc
Description: PGP signature


Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device

2019-08-13 Thread Stefan Hajnoczi
On Tue, Aug 13, 2019 at 07:24:45AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 13, 2019 at 10:07:11AM +0100, Stefan Hajnoczi wrote:
> > On Sat, Aug 03, 2019 at 05:12:15PM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> > > > +\subsubsection{Live migration considerations}\label{sec:Device Types / 
> > > > File System Device / Live Migration Considerations}
> > > > +
> > > > +When a guest is migrated to a new host
> > > 
> > > Please use driver and device here and elsewhere.
> > 
> > Will fix.
> > 
> > > > it is necessary to consider the FUSE
> > > > +session and its state.  The continuity of FUSE inode numbers (also 
> > > > known as
> > > > +nodeids) and fh values is necessary so the driver can continue 
> > > > operation
> > > > +without disruption.
> > > > +
> > > > +It is possible to maintain the FUSE session across live migration 
> > > > either by
> > > > +transferring the state or by redirecting requests from the new host to 
> > > > the old
> > > > +host where the state resides.  The details of how to achieve this are
> > > > +implementation-dependent and are not visible at the device interface 
> > > > level.
> > > > +
> > > > +Maintaining version and feature information negotiated by FUSE\_INIT is
> > > > +necessary so that no FUSE protocol feature changes are visible to the 
> > > > driver
> > > > +across live migration.  The FUSE\_INIT information forms part of the 
> > > > FUSE
> > > > +session state that needs to be transferred during live migration.
> > > 
> > > It bothers me that it's implicit, and is not exposed easily at the
> > > virtio level, so one has to bind a virtqueue and run buffers over it to
> > > even just check whether a given device supports a specific interface and
> > > can be migrated to.
> > > 
> > > How about exposing the version of the device in the config space?
> > > Spec can require that it matches the contents of the INIT command.
> > 
> > FUSE_INIT works like this:
> > 
> > 1. The client sends FUSE_INIT with the supported highest protocol
> >version.
> > 2. If the server supports the client's version, it replies with the same
> >version number and fills in the FUSE_INIT parameters.
> > 
> >Otherwise the server lowers the version number and replies to the
> >client, and the client will resend FUSE_INIT with the lower version
> >number to acknowledge that it supports the lower version.  Go to step
> >1.
> > 
> > With regards to live migration the server's FUSE_INIT reply structure
> > needs to be sent to the destination if a FUSE session has been
> > established.  If the destination does not support this protocol version
> > then live migration fails.
> > 
> > If management tools want to ensure that migration always succeeds then I
> > think the way to do that is not via virtio config space (they would
> > still need to start a guest in order to access it!).
> > 
> > I don't understand the benefit of exposing the device's version in the
> > config space?  Do you mean the negotiated version of the current FUSE
> > session or the highest FUSE protocol version supported by the device?
> 
> The highest version.
> The point is to make it crystal clear that for cross version migration
> the highest version must be consistent.
> Otherwise this is something implementers might easily overlook.

For migration the current FUSE session's negotiated version must be
supported by the destination.  That's all.  The highest version
supported by the device doesn't matter.

> Another option is to add non-normative text explaining what must
> be kept consistent to allow migration between devices.
> 
> Generally I think that it is unfortunate that pass-through
> devices like this one are growing extensions for
> feature negotiation outside the spec.
> This hurts attempts at generic transport pass-through like vhost-user.

I'm not sure how we'll implement vhost-user live migration.  Currently
the vhost-user protocol doesn't allow the slave to provide migration
state.  But we'll need to do this in order to transfer all the state
associated with the virtio-fs device (FUSE_INIT struct, inodes, file
locks, etc).  It's important to recognize that this state goes way
beyond the FUSE_INIT struct, there is a lot of implementation-specific
state associated with a FUSE session!

Stefan


signature.asc
Description: PGP signature


[virtio-dev] [PATCH v6 0/2] virtio-fs: add virtio file system device

2019-08-13 Thread Stefan Hajnoczi
v6:
 * Clarify that num_queues only counts request queues [Cornelia]
 * State that only high priority requests go on the hiprio queue [Cornelia]
 * Expand on how endianness works [Cornelia]
 * Use "driver" and "device" instead of "guest" and "host" [Michael]
 * Explain how setuid files and device nodes can be a security issue [Michael]
 * Clarify that security issues with shared file systems involve multiple 
machines [Michael]
 * Document timing side-channel attacks [Michael]

v5:
 * Explain multiqueue semantics: no ordering, identical functionality on each 
queue, one FUSE session state shared between all queues
 * Explain how the FUSE session is started with a FUSE_INIT request
 * Consistently use "submit" vs "made available" and "complete" vs "used" 
terminology [Michael]
 * Explain endianness [Michael]
 * Clarify hiprio vs normal queue usage [Michael]
 * Move SHOULD, MUST, etc wording into normative sections [Michael]
 * Mention that FUSE_INIT negotiated state needs to be transferred during live 
migration [Michael]
 * State that the DAX window is mapped with writeback caching like RAM [Michael]
 * Mention DAX window mapping alignment constraints (they are communicated via 
the FUSE protocol) [Michael]
 * Explain that FUSE_SETUPMAPPING fails when device resources are exhausted and 
that splitting mappings consumes resources too [Michael]
 * Clarify access rules to DAX window - only touch memory that has a mapping 
establised
 * Document that DAX data persistence is achieved via FUSE_FSYNC

v4:
 * Clarify that there are no request ordering guarantees between requests in a
   single queue [vgoyal]
 * Add explanation of FUSE session endianness detection [dgilbert]

v3:
 * Remove notifications virtqueue, it's unimplemented and can be added when
   needed [Miklos]
 * Add Security Considerations and Live Migration Considerations sections
   [Michael]
v2:
 * Clean up core virtio file system device spec
 * Add DAX window

These patches add the virtio file system device, which is based on Linux FUSE
but includes the DAX window extension.  Similar to virtio-scsi, which
transports SCSI commands, virtio-fs transports FUSE requests and the protocol
documentation is not duplicated here.

The DAX window allows file contents to be accessed directly from shared memory.
This eliminates copying of data, reduces the number of vmexits, and reduces the
guest's memory footprint.  It also allows coherent mmap MAP_SHARED semantics
between guests on the same host.

Stefan Hajnoczi (2):
  content: add virtio file system device
  virtio-fs: add DAX window

 content.tex  |   1 +
 introduction.tex |   3 +
 virtio-fs.tex| 279 +++
 3 files changed, 283 insertions(+)
 create mode 100644 virtio-fs.tex

-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v6 2/2] virtio-fs: add DAX window

2019-08-13 Thread Stefan Hajnoczi
Describe how shared memory region ID 0 is the DAX window and how
FUSE_SETUPMAPPING maps file ranges into the window.

Signed-off-by: Stefan Hajnoczi 
---
The FUSE_SETUPMAPPING message is part of the virtio-fs Linux patches:
https://gitlab.com/virtio-fs/linux/blob/virtio-fs/include/uapi/linux/fuse.h

v6:
 * Document timing side-channel attacks [Michael]
---
 virtio-fs.tex | 55 +++
 1 file changed, 55 insertions(+)

diff --git a/virtio-fs.tex b/virtio-fs.tex
index 81adf85..154b043 100644
--- a/virtio-fs.tex
+++ b/virtio-fs.tex
@@ -178,6 +178,51 @@ \subsubsection{Device Operation: High Priority 
Queue}\label{sec:Device Types / F
 
 The driver MUST anticipate that request queues are processed concurrently with 
the hiprio queue.
 
+\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File 
System Device / Device Operation / Device Operation: DAX Window}
+
+FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
+driver-provided buffer and the device.  In cases where data transfer is
+undesirable, the device can map file contents into the DAX window shared memory
+region.  The driver then accesses file contents directly in device-owned memory
+without a data transfer.
+
+Shared memory region ID 0 is called the DAX window.  Drivers map this shared
+memory region with writeback caching as if it were regular RAM.  The contents
+of the DAX window are undefined unless a mapping exists for that range.
+
+The driver maps a file range into the DAX window using the FUSE\_SETUPMAPPING
+request.  Alignment constraints for FUSE\_SETUPMAPPING and FUSE\_REMOVEMAPPING
+requests are communicated during FUSE\_INIT negotiation.
+
+When a FUSE\_SETUPMAPPING request perfectly overlaps a previous mapping, the
+previous mapping is replaced.  When a mapping partially overlaps a previous
+mapping, the previous mapping is split into one or two smaller mappings.  When
+a mapping is partially unmapped it is also split into one or two smaller
+mappings.
+
+Establishing new mappings or splitting existing mappings consumes resources.
+If the device runs out of resources the FUSE\_SETUPMAPPING request fails until
+resources are available again following FUSE\_REMOVEMAPPING.
+
+After FUSE\_SETUPMAPPING has completed successfully the file range is
+accessible from the DAX window at the offset provided by the driver in the
+request.  A mapping is removed using the FUSE\_REMOVEMAPPING request.
+
+Data is only guaranteed to be persistent when a FUSE\_FSYNC request is used by
+the device after having been made available by the driver following the write.
+
+\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File 
System Device / Device Operation / Device Operation: DAX Window}
+
+The device MUST allow mappings that completely or partially overlap existing 
mappings within the DAX window.
+
+The device MUST reject mappings that would go beyond the end of the DAX window.
+
+\drivernormative{\paragraph}{Device Operation: DAX Window}{Device Types / File 
System Device / Device Operation / Device Operation: DAX Window}
+
+The driver SHOULD be prepared to find shared memory region ID 0 absent and 
fall back to FUSE\_READ and FUSE\_WRITE requests.
+
+The driver MUST NOT access DAX window areas that have not been mapped.
+
 \subsubsection{Security Considerations}\label{sec:Device Types / File System 
Device / Security Considerations}
 
 The device provides access to a file system containing files owned by one or
@@ -206,6 +251,16 @@ \subsubsection{Security Considerations}\label{sec:Device 
Types / File System Dev
 virtio-fs.  They are typically managed at the file system administration level
 by providing shared access only to mutually trusted users.
 
+Multiple machines sharing access to a file system are susceptible to timing
+side-channel attacks.  By measuring the latency of accesses to file contents or
+file system metadata it is possible to infer whether other machines also
+accessed the same information.  Short latencies indicate that the information
+was cached due to a previous access.  This can reveal sensitive information,
+such as whether certain code paths were taken.  The DAX Window provides direct
+access to file contents and is therefore a likely target of such attacks.
+These attacks are also possible with traditional FUSE requests.  The safest
+approach is to avoid sharing file systems between untrusted machines.
+
 \subsubsection{Live migration considerations}\label{sec:Device Types / File 
System Device / Live Migration Considerations}
 
 When a driver is migrated to a new device it is necessary to consider the FUSE
-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v6 1/2] content: add virtio file system device

2019-08-13 Thread Stefan Hajnoczi
The virtio file system device transports Linux FUSE requests between a
FUSE daemon running on the host and the FUSE driver inside the guest.

The actual FUSE request definitions are not duplicated in the virtio
specification, similar to how virtio-scsi does not document SCSI
command details.  FUSE request definitions are available here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h

This patch documents the core virtio file system device, which is
functional but lacks the DAX feature introduced in the next patch.

Signed-off-by: Stefan Hajnoczi 
---
v6:
 * Clarify that num_queues only counts request queues [Cornelia]
 * State that only high priority requests go on the hiprio queue [Cornelia]
 * Expand on how endianness works [Cornelia]
 * Use "driver" and "device" instead of "guest" and "host" [Michael]
 * Explain how setuid files and device nodes can be a security issue [Michael]
 * Clarify that security issues with shared file systems involve multiple 
machines [Michael]
---
 content.tex  |   1 +
 introduction.tex |   3 +
 virtio-fs.tex| 224 +++
 3 files changed, 228 insertions(+)
 create mode 100644 virtio-fs.tex

diff --git a/content.tex b/content.tex
index ee0d7c9..c14e953 100644
--- a/content.tex
+++ b/content.tex
@@ -5677,6 +5677,7 @@ \subsubsection{Legacy Interface: Framing 
Requirements}\label{sec:Device
 \input{virtio-input.tex}
 \input{virtio-crypto.tex}
 \input{virtio-vsock.tex}
+\input{virtio-fs.tex}
 
 \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 
diff --git a/introduction.tex b/introduction.tex
index c96acf9..40f16f8 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -60,6 +60,9 @@ \section{Normative References}\label{sec:Normative References}
\phantomsection\label{intro:SCSI MMC}\textbf{[SCSI MMC]} &
 SCSI Multimedia Commands,
 \newline\url{http://www.t10.org/cgi-bin/ac.pl?t=f=mmc6r00.pdf}\\
+   \phantomsection\label{intro:FUSE}\textbf{[FUSE]} &
+   Linux FUSE interface,
+   
\newline\url{https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h}\\
 
 \end{longtable}
 
diff --git a/virtio-fs.tex b/virtio-fs.tex
new file mode 100644
index 000..81adf85
--- /dev/null
+++ b/virtio-fs.tex
@@ -0,0 +1,224 @@
+\section{File System Device}\label{sec:Device Types / File System Device}
+
+The virtio file system device provides file system access.  The device either
+directly manages a file system or it acts as a gateway to a remote file system.
+The details of how the device implementation accesses files are hidden by the
+device interface, allowing for a range of use cases.
+
+Unlike block-level storage devices such as virtio block and SCSI, the virtio
+file system device provides file-level access to data.  The device interface is
+based on the Linux Filesystem in Userspace (FUSE) protocol.  This consists of
+requests for file system traversal and access the files and directories within
+it.  The protocol details are defined by \hyperref[intro:FUSE]{FUSE}.
+
+The device acts as the FUSE file system daemon and the driver acts as the FUSE
+client mounting the file system.  The virtio file system device provides the
+mechanism for transporting FUSE requests, much like /dev/fuse in a traditional
+FUSE application.
+
+This section relies on definitions from \hyperref[intro:FUSE]{FUSE}.
+
+\subsection{Device ID}\label{sec:Device Types / File System Device / Device ID}
+  26
+
+\subsection{Virtqueues}\label{sec:Device Types / File System Device / 
Virtqueues}
+
+\begin{description}
+\item[0] hiprio
+\item[1\ldots n] request queues
+\end{description}
+
+\subsection{Feature bits}\label{sec:Device Types / File System Device / 
Feature bits}
+
+There are currently no feature bits defined.
+
+\subsection{Device configuration layout}\label{sec:Device Types / File System 
Device / Device configuration layout}
+
+All fields of this configuration are always available.
+
+\begin{lstlisting}
+struct virtio_fs_config {
+char tag[36];
+le32 num_queues;
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{tag}] is the name associated with this file system.  The tag is
+encoded in UTF-8 and padded with NUL bytes if shorter than the
+available space.  This field is not NUL-terminated if the encoded bytes
+take up the entire field.
+\item[\field{num_queues}] is the total number of request virtqueues exposed by
+the device.  Each virtqueue offers identical functionality and there are no
+ordering guarantees between requests made available on different queues.
+Use of multiple queues is intended to increase performance.
+\end{description}
+
+\drivernormative{\subsubsection}{Device configuration layout}{Device Types / 
File System Device / Device configuration layout}
+
+The driver MUST NOT write to device configuration fields.
+
+The driver MAY use from one up to 

[virtio-dev] [PATCH] content: document console vq detach buffer

2019-08-13 Thread Pankaj Gupta
This patch documents console special case where vq buffers
are deleted at port hotunplug time. This behavior is different in
other devices where vq buffers are deleted at device unplug time.

Signed-off-by: Pankaj Gupta 
---
 content.tex | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/content.tex b/content.tex
index ee0d7c9..33e8ccc 100644
--- a/content.tex
+++ b/content.tex
@@ -497,6 +497,8 @@ \section{Device Cleanup}\label{sec:General Initialization 
And Device Operation /
 of a live virtqueue.
 
 Thus a driver MUST ensure a virtqueue isn't live (by device reset) before 
removing exposed buffers.
+Console has a special property that when port is detached virtqueue is 
considered stopped, device
+must not use any buffers, and it is legal to take buffers out of the device.
 
 \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
 
-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device

2019-08-13 Thread Michael S. Tsirkin
On Tue, Aug 13, 2019 at 10:07:11AM +0100, Stefan Hajnoczi wrote:
> On Sat, Aug 03, 2019 at 05:12:15PM -0400, Michael S. Tsirkin wrote:
> > On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> > > +\subsubsection{Live migration considerations}\label{sec:Device Types / 
> > > File System Device / Live Migration Considerations}
> > > +
> > > +When a guest is migrated to a new host
> > 
> > Please use driver and device here and elsewhere.
> 
> Will fix.
> 
> > > it is necessary to consider the FUSE
> > > +session and its state.  The continuity of FUSE inode numbers (also known 
> > > as
> > > +nodeids) and fh values is necessary so the driver can continue operation
> > > +without disruption.
> > > +
> > > +It is possible to maintain the FUSE session across live migration either 
> > > by
> > > +transferring the state or by redirecting requests from the new host to 
> > > the old
> > > +host where the state resides.  The details of how to achieve this are
> > > +implementation-dependent and are not visible at the device interface 
> > > level.
> > > +
> > > +Maintaining version and feature information negotiated by FUSE\_INIT is
> > > +necessary so that no FUSE protocol feature changes are visible to the 
> > > driver
> > > +across live migration.  The FUSE\_INIT information forms part of the FUSE
> > > +session state that needs to be transferred during live migration.
> > 
> > It bothers me that it's implicit, and is not exposed easily at the
> > virtio level, so one has to bind a virtqueue and run buffers over it to
> > even just check whether a given device supports a specific interface and
> > can be migrated to.
> > 
> > How about exposing the version of the device in the config space?
> > Spec can require that it matches the contents of the INIT command.
> 
> FUSE_INIT works like this:
> 
> 1. The client sends FUSE_INIT with the supported highest protocol
>version.
> 2. If the server supports the client's version, it replies with the same
>version number and fills in the FUSE_INIT parameters.
> 
>Otherwise the server lowers the version number and replies to the
>client, and the client will resend FUSE_INIT with the lower version
>number to acknowledge that it supports the lower version.  Go to step
>1.
> 
> With regards to live migration the server's FUSE_INIT reply structure
> needs to be sent to the destination if a FUSE session has been
> established.  If the destination does not support this protocol version
> then live migration fails.
> 
> If management tools want to ensure that migration always succeeds then I
> think the way to do that is not via virtio config space (they would
> still need to start a guest in order to access it!).
> 
> I don't understand the benefit of exposing the device's version in the
> config space?  Do you mean the negotiated version of the current FUSE
> session or the highest FUSE protocol version supported by the device?

The highest version.
The point is to make it crystal clear that for cross version migration
the highest version must be consistent.
Otherwise this is something implementers might easily overlook.

Another option is to add non-normative text explaining what must
be kept consistent to allow migration between devices.

Generally I think that it is unfortunate that pass-through
devices like this one are growing extensions for
feature negotiation outside the spec.
This hurts attempts at generic transport pass-through like vhost-user.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread David Hildenbrand
On 13.08.19 12:42, Nitesh Narayan Lal wrote:
> 
> On 8/13/19 6:34 AM, David Hildenbrand wrote:
>> +static int process_free_page(struct page *page,
>> +struct page_reporting_config *phconf, int 
>> count)
>> +{
>> +   int mt, order, ret = 0;
> [...]
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting 
>> fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM 
>> otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +   struct zone *zone;
>> +   int ret;
>> +
>> +   for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +   /* we can not report pages which are not in the buddy */
>> +   if (zone_idx(zone) == ZONE_DEVICE)
>> +   continue;
>> +#endif
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".
>
 I think you are right (although it's confusing, we will have present
 sections part of a zone but the zone has no present_pages - screams like
 a re factoring - leftover from ZONE_DEVICE introduction).
>>>
>>> I think in that case it is safe to have this check here.
>>> What do you guys suggest?
>> If it's not needed, I'd say drop it (eventually add a comment).
> 
> 
> Comment to mention that we do not expect a zone with non-buddy page to be
> initialized here?

Something along these lines, or something like

/* ZONE_DEVICE is never considered populated */

-- 

Thanks,

David / dhildenb

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread David Hildenbrand
 +static int process_free_page(struct page *page,
 +struct page_reporting_config *phconf, int 
 count)
 +{
 +   int mt, order, ret = 0;
 +
 +   mt = get_pageblock_migratetype(page);
 +   order = page_private(page);
 +   ret = __isolate_free_page(page, order);
 +
>> I just started looking into the wonderful world of
>> isolation/compaction/migration.
>>
>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>> MOVABLE. So that shouldn't be an issue - I guess.
>>
>> 1. You should never allocate something that is no
>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>> CMA here. There should at least be a !is_migrate_isolate_page() check
>> somewhere
>>
>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>> with isolation code, you have to hold the zone lock. Your code seems to
>> do that, so at least you cannot race against isolation.
>>
>> 3. You could end up temporarily allocating something in the
>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>> would have to be a way to make alloc_contig_range()/offlining code
>> properly wait until the pages have been processed. Not sure about the
>> real implications, though - too many details in the code (I wonder if
>> Alex' series has a way of dealing with that)
>>
>> When you restore the migratetype, you could suddenly overwrite e.g.,
>> ISOLATE, which feels wrong.
> 
> 
> I was triggering an occasional CPU stall bug earlier, with saving and 
> restoring
> the migratetype I was able to fix it.
> But I will further look into this to figure out if it is really required.
> 

You should especially look into handling isolated/cma pages. Maybe that
was the original issue. Alex seems to have added that in his latest
series (skipping isolated/cma pageblocks completely) as well.

>> [...]
>>> So as per your comments in the cover page, the two functions above
>>> should also probably be plugged into the zone resizing logic somewhere
>>> so if a zone is resized the bitmap is adjusted.
>>>
 +/**
 + * zone_reporting_init - For each zone initializes the page reporting 
 fields
 + * and allocates the respective bitmap.
 + *
 + * This function returns 0 on successful initialization, -ENOMEM 
 otherwise.
 + */
 +static int zone_reporting_init(void)
 +{
 +   struct zone *zone;
 +   int ret;
 +
 +   for_each_populated_zone(zone) {
 +#ifdef CONFIG_ZONE_DEVICE
 +   /* we can not report pages which are not in the buddy */
 +   if (zone_idx(zone) == ZONE_DEVICE)
 +   continue;
 +#endif
>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>> zone will be considered "populated".
>>>
>> I think you are right (although it's confusing, we will have present
>> sections part of a zone but the zone has no present_pages - screams like
>> a re factoring - leftover from ZONE_DEVICE introduction).
> 
> 
> I think in that case it is safe to have this check here.
> What do you guys suggest?

If it's not needed, I'd say drop it (eventually add a comment).


-- 

Thanks,

David / dhildenb

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

2019-08-13 Thread Nitesh Narayan Lal


On 8/12/19 4:05 PM, David Hildenbrand wrote:
>>> ---
>>>  include/linux/mmzone.h |  11 ++
>>>  include/linux/page_reporting.h |  63 +++
>>>  mm/Kconfig |   6 +
>>>  mm/Makefile|   1 +
>>>  mm/page_alloc.c|  42 -
>>>  mm/page_reporting.c| 332 +
>>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>>  create mode 100644 include/linux/page_reporting.h
>>>  create mode 100644 mm/page_reporting.c
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index d77d717c620c..ba5f5b508f25 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -559,6 +559,17 @@ struct zone {
>>> /* Zone statistics */
>>> atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>> atomic_long_t   vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>>> +#ifdef CONFIG_PAGE_REPORTING
>>> +   /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>>> +   unsigned long *bitmap;
>>> +   /* Preserve start and end PFN in case they change due to hotplug */
>>> +   unsigned long base_pfn;
>>> +   unsigned long end_pfn;
>>> +   /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>>> +   atomic_t free_pages;
>>> +   /* Number of bits required in the bitmap */
>>> +   unsigned long nbits;
>>> +#endif
>>>  } cacheline_internodealigned_in_smp;
>> Okay, so the original thing this patch set had going for it was that
>> it was non-invasive. However, now you are adding a bunch of stuff to
>> the zone. That kind of loses the non-invasive argument for this patch
>> set compared to mine.
>>
> Adding something to "struct zone" is certainly less invasive than core
> buddy modifications, just saying (I agree that this is suboptimal. I
> would have guessed that all that's needed is a pointer to some private
> structure here). 


I think having just a pointer to a private structure makes sense here.
If I am not wrong then I can probably make an allocation for it for each
populated zone at the time I enable page reporting.

> However, the migratetype thingy below looks fishy to me.
>
>> If we are going to continue further with this patch set then it might
>> be worth looking into dynamically allocating the space you need for
>> this block. At a minimum you could probably look at making the bitmap
>> an RCU based setup so you could define the base and end along with the
>> bitmap. It would probably help to resolve the hotplug issues you still
>> need to address.
> Yeah, I guess that makes sense.
>
> [...]
>>> +
>>> +static int process_free_page(struct page *page,
>>> +struct page_reporting_config *phconf, int 
>>> count)
>>> +{
>>> +   int mt, order, ret = 0;
>>> +
>>> +   mt = get_pageblock_migratetype(page);
>>> +   order = page_private(page);
>>> +   ret = __isolate_free_page(page, order);
>>> +
> I just started looking into the wonderful world of
> isolation/compaction/migration.
>
> I don't think saving/restoring the migratetype is correct here. AFAIK,
> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> MOVABLE. So that shouldn't be an issue - I guess.
>
> 1. You should never allocate something that is no
> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> CMA here. There should at least be a !is_migrate_isolate_page() check
> somewhere
>
> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> with isolation code, you have to hold the zone lock. Your code seems to
> do that, so at least you cannot race against isolation.
>
> 3. You could end up temporarily allocating something in the
> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> would have to be a way to make alloc_contig_range()/offlining code
> properly wait until the pages have been processed. Not sure about the
> real implications, though - too many details in the code (I wonder if
> Alex' series has a way of dealing with that)
>
> When you restore the migratetype, you could suddenly overwrite e.g.,
> ISOLATE, which feels wrong.


I was triggering an occasional CPU stall bug earlier, with saving and restoring
the migratetype I was able to fix it.
But I will further look into this to figure out if it is really required.

> [...]
>> So as per your comments in the cover page, the two functions above
>> should also probably be plugged into the zone resizing logic somewhere
>> so if a zone is resized the bitmap is adjusted.
>>
>>> +/**
>>> + * zone_reporting_init - For each zone initializes the page reporting 
>>> fields
>>> + * and allocates the respective bitmap.
>>> + *
>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>> + */
>>> +static int zone_reporting_init(void)
>>> +{
>>> +   struct zone *zone;
>>> +   int ret;
>>> +
>>> +

Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device

2019-08-13 Thread Stefan Hajnoczi
On Sat, Aug 03, 2019 at 05:12:15PM -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> > +\subsubsection{Live migration considerations}\label{sec:Device Types / 
> > File System Device / Live Migration Considerations}
> > +
> > +When a guest is migrated to a new host
> 
> Please use driver and device here and elsewhere.

Will fix.

> > it is necessary to consider the FUSE
> > +session and its state.  The continuity of FUSE inode numbers (also known as
> > +nodeids) and fh values is necessary so the driver can continue operation
> > +without disruption.
> > +
> > +It is possible to maintain the FUSE session across live migration either by
> > +transferring the state or by redirecting requests from the new host to the 
> > old
> > +host where the state resides.  The details of how to achieve this are
> > +implementation-dependent and are not visible at the device interface level.
> > +
> > +Maintaining version and feature information negotiated by FUSE\_INIT is
> > +necessary so that no FUSE protocol feature changes are visible to the 
> > driver
> > +across live migration.  The FUSE\_INIT information forms part of the FUSE
> > +session state that needs to be transferred during live migration.
> 
> It bothers me that it's implicit, and is not exposed easily at the
> virtio level, so one has to bind a virtqueue and run buffers over it to
> even just check whether a given device supports a specific interface and
> can be migrated to.
> 
> How about exposing the version of the device in the config space?
> Spec can require that it matches the contents of the INIT command.

FUSE_INIT works like this:

1. The client sends FUSE_INIT with the supported highest protocol
   version.
2. If the server supports the client's version, it replies with the same
   version number and fills in the FUSE_INIT parameters.

   Otherwise the server lowers the version number and replies to the
   client, and the client will resend FUSE_INIT with the lower version
   number to acknowledge that it supports the lower version.  Go to step
   1.

With regards to live migration the server's FUSE_INIT reply structure
needs to be sent to the destination if a FUSE session has been
established.  If the destination does not support this protocol version
then live migration fails.

If management tools want to ensure that migration always succeeds then I
think the way to do that is not via virtio config space (they would
still need to start a guest in order to access it!).

I don't understand the benefit of exposing the device's version in the
config space?  Do you mean the negotiated version of the current FUSE
session or the highest FUSE protocol version supported by the device?


signature.asc
Description: PGP signature


[virtio-dev] Re: [PATCH v5 4/6] mm: Introduce Reported pages

2019-08-13 Thread David Hildenbrand
On 12.08.19 23:33, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> In order to pave the way for free page reporting in virtualized
> environments we will need a way to get pages out of the free lists and
> identify those pages after they have been returned. To accomplish this,
> this patch adds the concept of a Reported Buddy, which is essentially
> meant to just be the Uptodate flag used in conjunction with the Buddy
> page type.
> 
> It adds a set of pointers we shall call "boundary" which represents the
> upper boundary between the unreported and reported pages. The general idea
> is that in order for a page to cross from one side of the boundary to the
> other it will need to go through the reporting process. Ultimately a
> free_list has been fully processed when the boundary has been moved from
> the tail all they way up to occupying the first entry in the list.
> 
> Doing this we should be able to make certain that we keep the reported
> pages as one contiguous block in each free list. This will allow us to
> efficiently manipulate the free lists whenever we need to go in and start
> sending reports to the hypervisor that there are new pages that have been
> freed and are no longer in use.
> 
> An added advantage to this approach is that we should be reducing the
> overall memory footprint of the guest as it will be more likely to recycle
> warm pages versus trying to allocate the reported pages that were likely
> evicted from the guest memory.
> 
> Since we will only be reporting one zone at a time we keep the boundary
> limited to being defined for just the zone we are currently reporting pages
> from. Doing this we can keep the number of additional pointers needed quite
> small. To flag that the boundaries are in place we use a single bit
> in the zone to indicate that reporting and the boundaries are active.
> 
> The determination of when to start reporting is based on the tracking of
> the number of free pages in a given area versus the number of reported
> pages in that area. We keep track of the number of reported pages per
> free_area in a separate zone specific area. We do this to avoid modifying
> the free_area structure as this can lead to false sharing for the highest
> order with the zone lock which leads to a noticeable performance
> degradation.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  include/linux/mmzone.h |   40 +
>  include/linux/page-flags.h |   11 +
>  include/linux/page_reporting.h |  138 ++
>  mm/Kconfig |5 +
>  mm/Makefile|1 
>  mm/memory_hotplug.c|1 
>  mm/page_alloc.c|  136 +-
>  mm/page_reporting.c|  308 
> 
>  8 files changed, 632 insertions(+), 8 deletions(-)
>  create mode 100644 include/linux/page_reporting.h
>  create mode 100644 mm/page_reporting.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2f2b6f968ed3..b8ed926552b1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -462,6 +462,14 @@ struct zone {
>   seqlock_t   span_seqlock;
>  #endif
>  
> +#ifdef CONFIG_PAGE_REPORTING
> + /*
> +  * Pointer to reported page tracking statistics array. The size of
> +  * the array is MAX_ORDER - PAGE_REPORTING_MIN_ORDER. NULL when
> +  * unused page reporting is not present.
> +  */
> + unsigned long   *reported_pages;
> +#endif
>   int initialized;
>  
>   /* Write-intensive fields used from the page allocator */
> @@ -537,6 +545,14 @@ enum zone_flags {
>   ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks.
>* Cleared when kswapd is woken.
>*/
> + ZONE_PAGE_REPORTING_REQUESTED,  /* zone enabled page reporting and has
> +  * requested flushing the data out of
> +  * higher order pages.
> +  */
> + ZONE_PAGE_REPORTING_ACTIVE, /* zone enabled page reporting and is
> +  * activly flushing the data out of
> +  * higher order pages.
> +  */
>  };
>  
>  static inline unsigned long zone_managed_pages(struct zone *zone)
> @@ -757,6 +773,8 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
>   return !pgdat->node_start_pfn && !pgdat->node_spanned_pages;
>  }
>  
> +#include 
> +
>  /* Used for pages not on another list */
>  static inline void add_to_free_list(struct page *page, struct zone *zone,
>   unsigned int order, int migratetype)
> @@ -771,10 +789,16 @@ static inline void add_to_free_list(struct page *page, 
> struct zone *zone,
>  static inline void add_to_free_list_tail(struct page *page, struct zone 
> *zone,
>