Re: [ndctl PATCH] libdaxctl: fix memory leaks with daxctl_memory objects

2019-10-02 Thread Verma, Vishal L
On Tue, 2019-10-01 at 16:19 -0700, Ira Weiny wrote:
> 
> > diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
> > index 8abfd64..639224c 100644
> > --- a/daxctl/lib/libdaxctl.c
> > +++ b/daxctl/lib/libdaxctl.c
> > @@ -204,8 +204,9 @@ DAXCTL_EXPORT void daxctl_region_get_uuid(struct 
> > daxctl_region *region, uuid_t u
> >  
> >  static void free_mem(struct daxctl_dev *dev)
> >  {
> > -   if (dev && dev->mem) {
> > +   if (dev->mem) {
> 
> There is a comment in daxctl_dev_disable() which says:
> 
> /* If there is a memory object, first free that */
> 
> I'm 100% sure that dev can't be NULL there.  So that comment no longer 
> applies.
> 
> May want to remove that comment.
> 
Hm, I'm not sure I follow.

Yes, dev can't be null there, but 'mem' can me. Hence the comment saying
if /mem/ is present, free it.

The check in free_mem() only checks for non-NULL mem too.

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 06/10] daxctl: show a 'movable' attribute in device listings

2019-10-02 Thread Vishal Verma
For dax devices in 'system-ram' mode, display a 'movable' attribute in
device listings. This helps a user easily determine if memory was not
onlined in the expected way for any reason.

Link: https://github.com/pmem/ndctl/issues/110
Reported-by: Ben Olson 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 util/json.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/util/json.c b/util/json.c
index 5e6a32a..497c52b 100644
--- a/util/json.c
+++ b/util/json.c
@@ -278,7 +278,7 @@ struct json_object *util_daxctl_dev_to_json(struct 
daxctl_dev *dev,
struct daxctl_memory *mem = daxctl_dev_get_memory(dev);
const char *devname = daxctl_dev_get_devname(dev);
struct json_object *jdev, *jobj;
-   int node;
+   int node, movable;
 
jdev = json_object_new_object();
if (!devname || !jdev)
@@ -306,6 +306,18 @@ struct json_object *util_daxctl_dev_to_json(struct 
daxctl_dev *dev,
if (jobj)
json_object_object_add(jdev, "mode", jobj);
 
+   if (mem) {
+   movable = daxctl_memory_is_movable(mem);
+   if (movable == 1)
+   jobj = json_object_new_boolean(true);
+   else if (movable == 0)
+   jobj = json_object_new_boolean(false);
+   else
+   jobj = NULL;
+   if (jobj)
+   json_object_object_add(jdev, "movable", jobj);
+   }
+
if (!daxctl_dev_is_enabled(dev)) {
jobj = json_object_new_string("disabled");
if (jobj)
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 00/10] fixes and movability for system-ram mode

2019-10-02 Thread Vishal Verma
This patchset improves the user experience around memory onlining,
specifically the state it is onlined in - movable vs non-movable. It
also adds an option to make the memory non-movable when onlining.

Patches 1-3 perform some preparatory and clean up steps.
Patches 4-6 add a way to determine and display the 'movable' vs
'non-movable' state of memory.
Patches 7-8 attempt to detect a race with memory onlining, and add a
Documentation clarification
Patches 9-10 Add the new --no-movable option to commands that may
online memory.

Vishal Verma (10):
  libdaxctl: refactor path construction in op_for_one_memblock()
  libdaxctl: refactor memblock_is_online() checks
  daxctl/device.c: fix json output omission for reconfigure-device
  libdaxctl: add an API to determine if memory is movable
  libdaxctl: allow memblock_in_dev() to return an error
  daxctl: show a 'movable' attribute in device listings
  daxctl: detect races when onlining memory blocks
  Documentation: clarify memory movablity for reconfigure-device
  libdaxctl: add an API to online memory in a non-movable state
  daxctl: add --no-movable option for onlining memory

 Documentation/daxctl/daxctl-online-memory.txt |   2 +
 .../daxctl/daxctl-reconfigure-device.txt  |  24 +-
 Documentation/daxctl/movable-options.txt  |  10 +
 daxctl/device.c   |  45 ++-
 daxctl/lib/libdaxctl-private.h|  26 ++
 daxctl/lib/libdaxctl.c| 280 +-
 daxctl/lib/libdaxctl.sym  |   6 +
 daxctl/libdaxctl.h|   2 +
 util/json.c   |  14 +-
 9 files changed, 324 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/daxctl/movable-options.txt

-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 05/10] libdaxctl: allow memblock_in_dev() to return an error

2019-10-02 Thread Vishal Verma
With the MEM_FIND_ZONE operation, and the expectation that it will be
called from 'daxctl list' listings, it is possible that memblock_in_dev()
gets called without sufficient privileges. If this happens, currently,
the above simply returns a 'false'. This was acceptable when the only
operations were onlining/offlining (as there would be an actual failure
later). However, it is not acceptable in the MEM_FIND_ZONE case, as it
could yeild a different answer based on the privilege level.

Change memblock_in_dev() to return an 'int' instead of a 'bool' so that
error cases can be distinguished from actual address range test.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 daxctl/lib/libdaxctl.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index 4460174..617887c 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -1211,40 +1211,42 @@ static int memblock_find_zone(struct daxctl_memory 
*mem, char *memblock,
return 0;
 }
 
-static bool memblock_in_dev(struct daxctl_memory *mem, const char *memblock)
+static int memblock_in_dev(struct daxctl_memory *mem, const char *memblock)
 {
const char *mem_base = "/sys/devices/system/memory/";
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
unsigned long long memblock_res, dev_start, dev_end;
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+   int rc, path_len = mem->buf_len;
unsigned long memblock_size;
-   int path_len = mem->buf_len;
char buf[SYSFS_ATTR_SIZE];
unsigned long phys_index;
char *path = mem->mem_buf;
 
if (snprintf(path, path_len, "%s/%s/phys_index",
mem_base, memblock) < 0)
-   return false;
+   return -ENXIO;
 
-   if (sysfs_read_attr(ctx, path, buf) == 0) {
+   rc = sysfs_read_attr(ctx, path, buf);
+   if (rc == 0) {
phys_index = strtoul(buf, NULL, 16);
if (phys_index == 0 || phys_index == ULONG_MAX) {
+   rc = -errno;
err(ctx, "%s: %s: Unable to determine phys_index: %s\n",
-   devname, memblock, strerror(errno));
-   return false;
+   devname, memblock, strerror(-rc));
+   return rc;
}
} else {
err(ctx, "%s: %s: Unable to determine phys_index: %s\n",
-   devname, memblock, strerror(errno));
-   return false;
+   devname, memblock, strerror(-rc));
+   return rc;
}
 
dev_start = daxctl_dev_get_resource(dev);
if (!dev_start) {
err(ctx, "%s: Unable to determine resource\n", devname);
-   return false;
+   return -EACCES;
}
dev_end = dev_start + daxctl_dev_get_size(dev);
 
@@ -1252,14 +1254,14 @@ static bool memblock_in_dev(struct daxctl_memory *mem, 
const char *memblock)
if (!memblock_size) {
err(ctx, "%s: Unable to determine memory block size\n",
devname);
-   return false;
+   return -ENXIO;
}
memblock_res = phys_index * memblock_size;
 
if (memblock_res >= dev_start && memblock_res <= dev_end)
-   return true;
+   return 1;
 
-   return false;
+   return 0;
 }
 
 static int op_for_one_memblock(struct daxctl_memory *mem, char *memblock,
@@ -1317,8 +1319,12 @@ static int daxctl_memory_op(struct daxctl_memory *mem, 
enum memory_op op)
errno = 0;
while ((de = readdir(node_dir)) != NULL) {
if (strncmp(de->d_name, "memory", 6) == 0) {
-   if (!memblock_in_dev(mem, de->d_name))
+   rc = memblock_in_dev(mem, de->d_name);
+   if (rc < 0)
+   goto out_dir;
+   if (rc == 0) /* memblock not in dev */
continue;
+   /* memblock is in dev, perform op */
rc = op_for_one_memblock(mem, de->d_name, op,
_flags);
if (rc < 0)
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 03/10] daxctl/device.c: fix json output omission for reconfigure-device

2019-10-02 Thread Vishal Verma
The reconfig_mode_{devdax,system_ram}() function can have a positive
return status from memory online/offline operations, indicating skipped
memory blocks. Don't omit printing the device listing json in these
cases; the reconfiguration has succeeded, and its results should be
printed.

Link: https://github.com/pmem/ndctl/issues/115
Reported-by: Michal Biesek 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 daxctl/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/daxctl/device.c b/daxctl/device.c
index 4887ccf..920efc6 100644
--- a/daxctl/device.c
+++ b/daxctl/device.c
@@ -398,7 +398,7 @@ static int do_reconfig(struct daxctl_dev *dev, enum 
dev_mode mode,
rc = -EINVAL;
}
 
-   if (rc)
+   if (rc < 0)
return rc;
 
*jdevs = json_object_new_array();
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 08/10] Documentation: clarify memory movablity for reconfigure-device

2019-10-02 Thread Vishal Verma
Move the note about potential races in memory onlining into the
'Description' section to make it more prominent. Reword the note to talk
about the sources of such races, and talk about the new 'race detection'
semantics in daxctl.

Link: https://github.com/pmem/ndctl/issues/110
Reported-by: Ben Olson 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 .../daxctl/daxctl-reconfigure-device.txt  | 24 +++
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt 
b/Documentation/daxctl/daxctl-reconfigure-device.txt
index 196d692..4663529 100644
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
@@ -97,6 +97,20 @@ error reconfiguring devices: Operation not supported
 reconfigured 0 devices
 
 
+'daxctl-reconfigure-device' nominally expects that it will online new memory
+blocks as 'movable', so that kernel data doesn't make it into this memory.
+However, there are other potential agents that may be configured to
+automatically online new hot-plugged memory as it appears. Most notably,
+these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
+or system udev rules. If such an agent races to online memory sections, daxctl
+checks if the blocks were onlined as 'movable' memory. If this was not the
+case, and the memory blocks are found to be in a different zone, then a
+warning is displayed. If it is desired that a different agent control the
+onlining of memory blocks, and the associated memory zone, then it is
+recommended to use the --no-online option described below. This will abridge
+the device reconfiguration operation to just hotplugging the memory, and
+refrain from then onlining it.
+
 OPTIONS
 ---
 -r::
@@ -121,16 +135,6 @@ OPTIONS
brought online automatically and immediately with the 'online_movable'
policy. Use this option to disable the automatic onlining behavior.
 
-   NOTE: While this option prevents daxctl from automatically onlining
-   the memory sections, there may be other agents, notably system udev
-   rules, that online new memory sections as they appear. Coordinating
-   with such rules is out of scope of this utility, and the system
-   administrator is expected to remove them if they are undesirable.
-   If such an agent races to online memory sections, daxctl is prepared
-   to lose the race, and not fail the onlining operation as it only
-   cares that the memory section was onlined, not that it was the one
-   to do so.
-
 -f::
 --force::
When converting from "system-ram" mode to "devdax", it is expected
-- 
2.20.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 10/10] daxctl: add --no-movable option for onlining memory

2019-10-02 Thread Vishal Verma
Add a new '--no-movable' option to daxctl commands that may online
memory - i.e. daxctl-reconfigure-device and daxctl-online-memory.

Users may wish additional control over the state of the newly added
memory. Retain the daxctl default for onlining memory as 'movable', but
allow it to be overridden using the above option.

Link: https://github.com/pmem/ndctl/issues/110
Cc: Ben Olson 
Cc: Dave Hansen 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 Documentation/daxctl/daxctl-online-memory.txt |  2 ++
 .../daxctl/daxctl-reconfigure-device.txt  |  2 ++
 Documentation/daxctl/movable-options.txt  | 10 ++
 daxctl/device.c   | 34 ---
 4 files changed, 44 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/daxctl/movable-options.txt

diff --git a/Documentation/daxctl/daxctl-online-memory.txt 
b/Documentation/daxctl/daxctl-online-memory.txt
index 5ac1cbf..08b45cc 100644
--- a/Documentation/daxctl/daxctl-online-memory.txt
+++ b/Documentation/daxctl/daxctl-online-memory.txt
@@ -62,6 +62,8 @@ OPTIONS
more /dev/daxX.Y devices, where X is the region id and Y is the device
instance id.
 
+include::movable-options.txt[]
+
 -u::
 --human::
By default the command will output machine-friendly raw-integer
diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt 
b/Documentation/daxctl/daxctl-reconfigure-device.txt
index 4663529..cb28fed 100644
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
@@ -135,6 +135,8 @@ OPTIONS
brought online automatically and immediately with the 'online_movable'
policy. Use this option to disable the automatic onlining behavior.
 
+include::movable-options.txt[]
+
 -f::
 --force::
When converting from "system-ram" mode to "devdax", it is expected
diff --git a/Documentation/daxctl/movable-options.txt 
b/Documentation/daxctl/movable-options.txt
new file mode 100644
index 000..0ddd7b6
--- /dev/null
+++ b/Documentation/daxctl/movable-options.txt
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
+
+-M::
+--no-movable::
+   '--movable' is the default. This can be overridden to online new
+   memory such that is is not 'movable'. This allows any allocation
+   to potentially be served from this memory. This may preclude subsequent
+   removal. With the '--movable' behavior (which is default), kernel
+   allocations will not consider this memory, and it will be reserved
+   for application use.
diff --git a/daxctl/device.c b/daxctl/device.c
index 28698bf..0695a08 100644
--- a/daxctl/device.c
+++ b/daxctl/device.c
@@ -21,6 +21,7 @@ static struct {
const char *mode;
int region_id;
bool no_online;
+   bool no_movable;
bool force;
bool human;
bool verbose;
@@ -37,6 +38,12 @@ enum dev_mode {
 static enum dev_mode reconfig_mode = DAXCTL_DEV_MODE_UNKNOWN;
 static unsigned long flags;
 
+enum memory_zone {
+   MEM_ZONE_MOVABLE,
+   MEM_ZONE_NORMAL,
+};
+static enum memory_zone mem_zone = MEM_ZONE_MOVABLE;
+
 enum device_action {
ACTION_RECONFIG,
ACTION_ONLINE,
@@ -55,13 +62,24 @@ OPT_BOOLEAN('N', "no-online", _online, \
 OPT_BOOLEAN('f', "force", , \
"attempt to offline memory sections before reconfiguration")
 
+#define ZONE_OPTIONS() \
+OPT_BOOLEAN('M', "no-movable", _movable, \
+   "online memory in ZONE_NORMAL")
+
 static const struct option reconfig_options[] = {
BASE_OPTIONS(),
RECONFIG_OPTIONS(),
+   ZONE_OPTIONS(),
+   OPT_END(),
+};
+
+static const struct option online_options[] = {
+   BASE_OPTIONS(),
+   ZONE_OPTIONS(),
OPT_END(),
 };
 
-static const struct option memory_options[] = {
+static const struct option offline_options[] = {
BASE_OPTIONS(),
OPT_END(),
 };
@@ -126,6 +144,8 @@ static const char *parse_device_options(int argc, const 
char **argv,
}
if (strcmp(param.mode, "system-ram") == 0) {
reconfig_mode = DAXCTL_DEV_MODE_RAM;
+   if (param.no_movable)
+   mem_zone = MEM_ZONE_NORMAL;
} else if (strcmp(param.mode, "devdax") == 0) {
reconfig_mode = DAXCTL_DEV_MODE_DEVDAX;
if (param.no_online) {
@@ -136,6 +156,9 @@ static const char *parse_device_options(int argc, const 
char **argv,
}
break;
case ACTION_ONLINE:
+   if (param.no_movable)
+   mem_zone = MEM_ZONE_NORMAL;
+   /* fall through */
case ACTION_OFFLINE:
/* nothing special */
break;
@@ -194,7 +217,10 @@ static int dev_online_memory(struct daxctl_dev *dev)
num_on == 1 ? "" : "s");
 
/* online the remaining sections */
-   rc = 

[ndctl PATCH 02/10] libdaxctl: refactor memblock_is_online() checks

2019-10-02 Thread Vishal Verma
The {online,offline}_one_memblock() helpers both open-coded the check
for whether a block is online. There is already a function to perform
this check - memblock_is_online(). Consolidate the checking using this
helper everywhere it is applicable.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 daxctl/lib/libdaxctl.c | 90 +-
 1 file changed, 37 insertions(+), 53 deletions(-)

diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index a828644..6243857 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -1047,12 +1047,11 @@ DAXCTL_EXPORT unsigned long 
daxctl_memory_get_block_size(struct daxctl_memory *m
return mem->block_size;
 }
 
-static int online_one_memblock(struct daxctl_memory *mem, char *memblock)
+static int memblock_is_online(struct daxctl_memory *mem, char *memblock)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
-   const char *mode = "online_movable";
int len = mem->buf_len, rc;
char buf[SYSFS_ATTR_SIZE];
char *path = mem->mem_buf;
@@ -1073,41 +1072,20 @@ static int online_one_memblock(struct daxctl_memory 
*mem, char *memblock)
return rc;
}
 
-   /*
-* if already online, possibly due to kernel config or a udev rule,
-* there is nothing to do and we can skip over the memblock
-*/
if (strncmp(buf, "online", 6) == 0)
return 1;
 
-   rc = sysfs_write_attr_quiet(ctx, path, mode);
-   if (rc) {
-   /*
-* While we performed an already-online check above, there
-* is still a TOCTOU hole where someone (such as a udev rule)
-* may have raced to online the memory. In such a case,
-* the sysfs store will fail, however we can check for this
-* by simply reading the state again. If it changed to the
-* desired state, then we don't have to error out.
-*/
-   if (sysfs_read_attr(ctx, path, buf) == 0) {
-   if (strncmp(buf, "online", 6) == 0)
-   return 1;
-   }
-   err(ctx, "%s: Failed to online %s: %s\n",
-   devname, path, strerror(-rc));
-   }
-   return rc;
+   /* offline */
+   return 0;
 }
 
-static int offline_one_memblock(struct daxctl_memory *mem, char *memblock)
+static int online_one_memblock(struct daxctl_memory *mem, char *memblock)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
-   const char *mode = "offline";
+   const char *mode = "online_movable";
int len = mem->buf_len, rc;
-   char buf[SYSFS_ATTR_SIZE];
char *path = mem->mem_buf;
const char *node_path;
 
@@ -1119,37 +1097,39 @@ static int offline_one_memblock(struct daxctl_memory 
*mem, char *memblock)
if (rc < 0)
return -ENOMEM;
 
-   rc = sysfs_read_attr(ctx, path, buf);
-   if (rc) {
-   err(ctx, "%s: Failed to read %s: %s\n",
-   devname, path, strerror(-rc));
+   /*
+* if already online, possibly due to kernel config or a udev rule,
+* there is nothing to do and we can skip over the memblock
+*/
+   rc = memblock_is_online(mem, memblock);
+   if (rc)
return rc;
-   }
-
-   /* if already offline, there is nothing to do */
-   if (strncmp(buf, "offline", 7) == 0)
-   return 1;
 
rc = sysfs_write_attr_quiet(ctx, path, mode);
if (rc) {
-   /* Close the TOCTOU hole like in online_one_memblock() above */
-   if (sysfs_read_attr(ctx, path, buf) == 0) {
-   if (strncmp(buf, "offline", 7) == 0)
-   return 1;
-   }
-   err(ctx, "%s: Failed to offline %s: %s\n",
+   /*
+* While we performed an already-online check above, there
+* is still a TOCTOU hole where someone (such as a udev rule)
+* may have raced to online the memory. In such a case,
+* the sysfs store will fail, however we can check for this
+* by simply reading the state again. If it changed to the
+* desired state, then we don't have to error out.
+*/
+   if (memblock_is_online(mem, memblock))
+   return 1;
+   err(ctx, "%s: Failed to online %s: %s\n",
devname, path, strerror(-rc));
}
return rc;
 }
 
-static int memblock_is_online(struct daxctl_memory *mem, char *memblock)
+static int 

[ndctl PATCH 04/10] libdaxctl: add an API to determine if memory is movable

2019-10-02 Thread Vishal Verma
By default, daxctl always attempts to online new memory sections as
'movable' so that routine kernel allocations aren't serviced from this
memory, and the memory is later removable via hot-unplug.

System configuration, or other agents (such as udev rules) may race
'daxctl' to online memory, and this may result in the memory not being
'movable'. Add an interface to query the movability of a memory object
associated with a dax device.

This is in preparation to both display a 'movable' attribute in device
listings, as well as optionally allowing memory to be onlined as
non-movable.

Cc: Dan Williams 
Reported-by: Ben Olson 
Signed-off-by: Vishal Verma 
---
 daxctl/lib/libdaxctl-private.h | 20 +
 daxctl/lib/libdaxctl.c | 77 --
 daxctl/lib/libdaxctl.sym   |  5 +++
 daxctl/libdaxctl.h |  1 +
 4 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/daxctl/lib/libdaxctl-private.h b/daxctl/lib/libdaxctl-private.h
index 7ba3c46..82939bb 100644
--- a/daxctl/lib/libdaxctl-private.h
+++ b/daxctl/lib/libdaxctl-private.h
@@ -44,6 +44,25 @@ enum memory_op {
MEM_SET_ONLINE,
MEM_IS_ONLINE,
MEM_COUNT,
+   MEM_FIND_ZONE,
+};
+
+/* OR-able flags, 1, 2, 4, 8 etc */
+enum memory_op_status {
+   MEM_ST_OK = 0,
+   MEM_ST_ZONE_INCONSISTENT = 1,
+};
+
+enum memory_zones {
+   MEM_ZONE_UNKNOWN = 1,
+   MEM_ZONE_MOVABLE,
+   MEM_ZONE_NORMAL,
+};
+
+static const char *zone_strings[] = {
+   [MEM_ZONE_UNKNOWN] = "mixed",
+   [MEM_ZONE_NORMAL] = "Normal",
+   [MEM_ZONE_MOVABLE] = "Movable",
 };
 
 /**
@@ -86,6 +105,7 @@ struct daxctl_memory {
size_t buf_len;
char *node_path;
unsigned long block_size;
+   enum memory_zones zone;
 };
 
 
diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index 6243857..4460174 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -1159,6 +1159,58 @@ static int offline_one_memblock(struct daxctl_memory 
*mem, char *memblock)
return rc;
 }
 
+static int memblock_find_zone(struct daxctl_memory *mem, char *memblock,
+   int *status)
+{
+   struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
+   const char *devname = daxctl_dev_get_devname(dev);
+   struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+   enum memory_zones cur_zone;
+   int len = mem->buf_len, rc;
+   char buf[SYSFS_ATTR_SIZE];
+   char *path = mem->mem_buf;
+   const char *node_path;
+
+   rc = memblock_is_online(mem, memblock);
+   if (rc < 0)
+   return rc;
+   if (rc == 0)
+   return -ENXIO;
+
+   node_path = daxctl_memory_get_node_path(mem);
+   if (!node_path)
+   return -ENXIO;
+
+   rc = snprintf(path, len, "%s/%s/valid_zones", node_path, memblock);
+   if (rc < 0)
+   return -ENOMEM;
+
+   rc = sysfs_read_attr(ctx, path, buf);
+   if (rc) {
+   err(ctx, "%s: Failed to read %s: %s\n",
+   devname, path, strerror(-rc));
+   return rc;
+   }
+
+   if (strcmp(buf, zone_strings[MEM_ZONE_MOVABLE]) == 0)
+   cur_zone = MEM_ZONE_MOVABLE;
+   else if (strcmp(buf, zone_strings[MEM_ZONE_NORMAL]) == 0)
+   cur_zone = MEM_ZONE_NORMAL;
+   else
+   cur_zone = MEM_ZONE_UNKNOWN;
+
+   if (mem->zone) {
+   if (mem->zone == cur_zone)
+   return 0;
+   else
+   *status |= MEM_ST_ZONE_INCONSISTENT;
+   } else {
+   mem->zone = cur_zone;
+   }
+
+   return 0;
+}
+
 static bool memblock_in_dev(struct daxctl_memory *mem, const char *memblock)
 {
const char *mem_base = "/sys/devices/system/memory/";
@@ -1211,7 +1263,7 @@ static bool memblock_in_dev(struct daxctl_memory *mem, 
const char *memblock)
 }
 
 static int op_for_one_memblock(struct daxctl_memory *mem, char *memblock,
-   enum memory_op op)
+   enum memory_op op, int *status)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
@@ -1234,6 +1286,8 @@ static int op_for_one_memblock(struct daxctl_memory *mem, 
char *memblock,
return !rc;
case MEM_COUNT:
return 0;
+   case MEM_FIND_ZONE:
+   return memblock_find_zone(mem, memblock, status);
}
 
err(ctx, "%s: BUG: unknown op: %d\n", devname, op);
@@ -1245,8 +1299,8 @@ static int daxctl_memory_op(struct daxctl_memory *mem, 
enum memory_op op)
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+   int rc, count = 0, status_flags = 0;
const char *node_path;
-   int rc, count = 0;
struct dirent *de;
DIR *node_dir;
 
@@ -1265,7 

[ndctl PATCH 07/10] daxctl: detect races when onlining memory blocks

2019-10-02 Thread Vishal Verma
Sometimes, system configuration can result in new memory blocks getting
onlined automatically. Often, these auto-onlining mechanisms don't
provide a choice or configurability in the matter of which zone is used
for these new blocks, and they just end up in ZONE_NORMAL.

Usually, for hot-plugged memory, ZONE_NORMAL is undesirable because:
 - An application might want total control over this memory
 - ZONE_NORMAL precludes hot-removal of this memory
 - Having kernel data structures in this memory, especially performance
   sensitive ones, such as page tables, may be undesirable.

Thus report if a race condition is encountered while onlining memory,
and provide the user options to remedy it.

Clarify the default zone expectations, and the race detection behavior
in the daxctl-reconfigure-device man page, and move the relevant section
under the 'Description' header, instead of hidden away under the
'--no-online' option.

Cc: Ben Olson 
Cc: Dave Hansen 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 daxctl/device.c|  9 ++
 daxctl/lib/libdaxctl.c | 62 +-
 2 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/daxctl/device.c b/daxctl/device.c
index 920efc6..28698bf 100644
--- a/daxctl/device.c
+++ b/daxctl/device.c
@@ -174,6 +174,15 @@ static int dev_online_memory(struct daxctl_dev *dev)
devname, strerror(-num_on));
return num_on;
}
+   if (num_on)
+   fprintf(stderr,
+   "%s:\n  WARNING: detected a race while onlining memory\n"
+   "  Some memory may not be in the expected zone. It is\n"
+   "  recommended to disable any other onlining mechanisms,\n"
+   "  and retry. If onlining is to be left to other agents,\n"
+   "  use the --no-online option to suppress this warning\n",
+   devname);
+
if (num_on == num_sections) {
fprintf(stderr, "%s: all memory sections (%d) already online\n",
devname, num_on);
diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index 617887c..5a7e37c 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -1079,10 +1079,10 @@ static int memblock_is_online(struct daxctl_memory 
*mem, char *memblock)
return 0;
 }
 
-static int online_one_memblock(struct daxctl_memory *mem, char *memblock)
+static int online_one_memblock(struct daxctl_memory *mem, char *memblock,
+   int *status)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
-   const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
const char *mode = "online_movable";
int len = mem->buf_len, rc;
@@ -1097,10 +1097,6 @@ static int online_one_memblock(struct daxctl_memory 
*mem, char *memblock)
if (rc < 0)
return -ENOMEM;
 
-   /*
-* if already online, possibly due to kernel config or a udev rule,
-* there is nothing to do and we can skip over the memblock
-*/
rc = memblock_is_online(mem, memblock);
if (rc)
return rc;
@@ -1108,18 +1104,14 @@ static int online_one_memblock(struct daxctl_memory 
*mem, char *memblock)
rc = sysfs_write_attr_quiet(ctx, path, mode);
if (rc) {
/*
-* While we performed an already-online check above, there
-* is still a TOCTOU hole where someone (such as a udev rule)
-* may have raced to online the memory. In such a case,
-* the sysfs store will fail, however we can check for this
-* by simply reading the state again. If it changed to the
-* desired state, then we don't have to error out.
+* If the block got onlined, potentially by some other agent,
+* do nothing for now. There will be a full scan for zone
+* correctness later.
 */
-   if (memblock_is_online(mem, memblock))
-   return 1;
-   err(ctx, "%s: Failed to online %s: %s\n",
-   devname, path, strerror(-rc));
+   if (memblock_is_online(mem, memblock) == 1)
+   return 0;
}
+
return rc;
 }
 
@@ -1150,7 +1142,7 @@ static int offline_one_memblock(struct daxctl_memory 
*mem, char *memblock)
 
rc = sysfs_write_attr_quiet(ctx, path, mode);
if (rc) {
-   /* Close the TOCTOU hole like in online_one_memblock() above */
+   /* check if something raced us to offline (unlikely) */
if (!memblock_is_online(mem, memblock))
return 1;
err(ctx, "%s: Failed to offline %s: %s\n",
@@ -1274,7 +1266,7 @@ static int op_for_one_memblock(struct daxctl_memory *mem, 
char *memblock,
 
switch (op) {
  

[ndctl PATCH 09/10] libdaxctl: add an API to online memory in a non-movable state

2019-10-02 Thread Vishal Verma
Some users may want additional control over the state in which memory
gets onlined - specifically, controlling the 'movable' aspect of the
resulting memory blocks.

Until now, libdaxctl only brought up memory in the 'movable' state. Add
a new interface to online memory in the non-movable state.

Link: https://github.com/pmem/ndctl/issues/110
Cc: Ben Olson 
Cc: Dave Hansen 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 daxctl/lib/libdaxctl-private.h |  6 +
 daxctl/lib/libdaxctl.c | 45 --
 daxctl/lib/libdaxctl.sym   |  1 +
 daxctl/libdaxctl.h |  1 +
 4 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/daxctl/lib/libdaxctl-private.h b/daxctl/lib/libdaxctl-private.h
index 82939bb..3e0457f 100644
--- a/daxctl/lib/libdaxctl-private.h
+++ b/daxctl/lib/libdaxctl-private.h
@@ -42,6 +42,7 @@ static const char *dax_modules[] = {
 enum memory_op {
MEM_SET_OFFLINE,
MEM_SET_ONLINE,
+   MEM_SET_ONLINE_NO_MOVABLE,
MEM_IS_ONLINE,
MEM_COUNT,
MEM_FIND_ZONE,
@@ -65,6 +66,11 @@ static const char *zone_strings[] = {
[MEM_ZONE_MOVABLE] = "Movable",
 };
 
+static const char *state_strings[] = {
+   [MEM_ZONE_NORMAL] = "online",
+   [MEM_ZONE_MOVABLE] = "online_movable",
+};
+
 /**
  * struct daxctl_region - container for dax_devices
  */
diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index 5a7e37c..d913807 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -1080,11 +1080,10 @@ static int memblock_is_online(struct daxctl_memory 
*mem, char *memblock)
 }
 
 static int online_one_memblock(struct daxctl_memory *mem, char *memblock,
-   int *status)
+   enum memory_zones zone, int *status)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
-   const char *mode = "online_movable";
int len = mem->buf_len, rc;
char *path = mem->mem_buf;
const char *node_path;
@@ -1101,7 +1100,14 @@ static int online_one_memblock(struct daxctl_memory 
*mem, char *memblock,
if (rc)
return rc;
 
-   rc = sysfs_write_attr_quiet(ctx, path, mode);
+   switch (zone) {
+   case MEM_ZONE_MOVABLE:
+   case MEM_ZONE_NORMAL:
+   rc = sysfs_write_attr_quiet(ctx, path, state_strings[zone]);
+   break;
+   default:
+   rc = -EINVAL;
+   }
if (rc) {
/*
 * If the block got onlined, potentially by some other agent,
@@ -1266,7 +1272,11 @@ static int op_for_one_memblock(struct daxctl_memory 
*mem, char *memblock,
 
switch (op) {
case MEM_SET_ONLINE:
-   return online_one_memblock(mem, memblock, status);
+   return online_one_memblock(mem, memblock, MEM_ZONE_MOVABLE,
+   status);
+   case MEM_SET_ONLINE_NO_MOVABLE:
+   return online_one_memblock(mem, memblock, MEM_ZONE_NORMAL,
+   status);
case MEM_SET_OFFLINE:
return offline_one_memblock(mem, memblock);
case MEM_IS_ONLINE:
@@ -1344,14 +1354,25 @@ out_dir:
 /*
  * daxctl_memory_online() will online to ZONE_MOVABLE by default
  */
-DAXCTL_EXPORT int daxctl_memory_online(struct daxctl_memory *mem)
+static int daxctl_memory_online_with_zone(struct daxctl_memory *mem,
+   enum memory_zones zone)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
int rc;
 
-   rc = daxctl_memory_op(mem, MEM_SET_ONLINE);
+   switch (zone) {
+   case MEM_ZONE_MOVABLE:
+   rc = daxctl_memory_op(mem, MEM_SET_ONLINE);
+   break;
+   case MEM_ZONE_NORMAL:
+   rc = daxctl_memory_op(mem, MEM_SET_ONLINE_NO_MOVABLE);
+   break;
+   default:
+   err(ctx, "%s: BUG: invalid zone for onlining\n", devname);
+   rc = -EINVAL;
+   }
if (rc)
return rc;
 
@@ -1364,7 +1385,7 @@ DAXCTL_EXPORT int daxctl_memory_online(struct 
daxctl_memory *mem)
rc = daxctl_memory_op(mem, MEM_FIND_ZONE);
if (rc)
return rc;
-   if (mem->zone != MEM_ZONE_MOVABLE) {
+   if (mem->zone != zone) {
err(ctx,
"%s:\n  WARNING: detected a race while onlining memory\n"
"  Some memory may not be in the expected zone. It is\n"
@@ -1378,6 +1399,16 @@ DAXCTL_EXPORT int daxctl_memory_online(struct 
daxctl_memory *mem)
return rc;
 }
 
+DAXCTL_EXPORT int daxctl_memory_online(struct daxctl_memory *mem)
+{
+   return daxctl_memory_online_with_zone(mem, MEM_ZONE_MOVABLE);
+}
+
+DAXCTL_EXPORT int daxctl_memory_online_no_movable(struct daxctl_memory *mem)
+{
+   return 

[ndctl PATCH 01/10] libdaxctl: refactor path construction in op_for_one_memblock()

2019-10-02 Thread Vishal Verma
In preparation for memblock operations to check additional sysfs
attributes in the memoryXXX block, 'path' can't be prematurely set
to the memoryXXX/state file.

Push down path construction into each memory op helper, so that each
helper gets an opportunity to reconstruct and act upon multiple paths.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 daxctl/lib/libdaxctl.c | 64 +-
 1 file changed, 44 insertions(+), 20 deletions(-)

diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index 4dfc524..a828644 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -1047,13 +1047,24 @@ DAXCTL_EXPORT unsigned long 
daxctl_memory_get_block_size(struct daxctl_memory *m
return mem->block_size;
 }
 
-static int online_one_memblock(struct daxctl_dev *dev, char *path)
+static int online_one_memblock(struct daxctl_memory *mem, char *memblock)
 {
+   struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
const char *mode = "online_movable";
+   int len = mem->buf_len, rc;
char buf[SYSFS_ATTR_SIZE];
-   int rc;
+   char *path = mem->mem_buf;
+   const char *node_path;
+
+   node_path = daxctl_memory_get_node_path(mem);
+   if (!node_path)
+   return -ENXIO;
+
+   rc = snprintf(path, len, "%s/%s/state", node_path, memblock);
+   if (rc < 0)
+   return -ENOMEM;
 
rc = sysfs_read_attr(ctx, path, buf);
if (rc) {
@@ -1089,13 +1100,24 @@ static int online_one_memblock(struct daxctl_dev *dev, 
char *path)
return rc;
 }
 
-static int offline_one_memblock(struct daxctl_dev *dev, char *path)
+static int offline_one_memblock(struct daxctl_memory *mem, char *memblock)
 {
+   struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
const char *mode = "offline";
+   int len = mem->buf_len, rc;
char buf[SYSFS_ATTR_SIZE];
-   int rc;
+   char *path = mem->mem_buf;
+   const char *node_path;
+
+   node_path = daxctl_memory_get_node_path(mem);
+   if (!node_path)
+   return -ENXIO;
+
+   rc = snprintf(path, len, "%s/%s/state", node_path, memblock);
+   if (rc < 0)
+   return -ENOMEM;
 
rc = sysfs_read_attr(ctx, path, buf);
if (rc) {
@@ -1121,12 +1143,23 @@ static int offline_one_memblock(struct daxctl_dev *dev, 
char *path)
return rc;
 }
 
-static int memblock_is_online(struct daxctl_dev *dev, char *path)
+static int memblock_is_online(struct daxctl_memory *mem, char *memblock)
 {
+   struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
const char *devname = daxctl_dev_get_devname(dev);
struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+   int len = mem->buf_len, rc;
char buf[SYSFS_ATTR_SIZE];
-   int rc;
+   char *path = mem->mem_buf;
+   const char *node_path;
+
+   node_path = daxctl_memory_get_node_path(mem);
+   if (!node_path)
+   return -ENXIO;
+
+   rc = snprintf(path, len, "%s/%s/state", node_path, memblock);
+   if (rc < 0)
+   return -ENOMEM;
 
rc = sysfs_read_attr(ctx, path, buf);
if (rc) {
@@ -1193,7 +1226,7 @@ static bool memblock_in_dev(struct daxctl_memory *mem, 
const char *memblock)
return false;
 }
 
-static int op_for_one_memblock(struct daxctl_memory *mem, char *path,
+static int op_for_one_memblock(struct daxctl_memory *mem, char *memblock,
enum memory_op op)
 {
struct daxctl_dev *dev = daxctl_memory_get_dev(mem);
@@ -1203,11 +1236,11 @@ static int op_for_one_memblock(struct daxctl_memory 
*mem, char *path,
 
switch (op) {
case MEM_SET_ONLINE:
-   return online_one_memblock(dev, path);
+   return online_one_memblock(mem, memblock);
case MEM_SET_OFFLINE:
-   return offline_one_memblock(dev, path);
+   return offline_one_memblock(mem, memblock);
case MEM_IS_ONLINE:
-   rc = memblock_is_online(dev, path);
+   rc = memblock_is_online(mem, memblock);
if (rc < 0)
return rc;
/*
@@ -1245,19 +1278,10 @@ static int daxctl_memory_op(struct daxctl_memory *mem, 
enum memory_op op)
 
errno = 0;
while ((de = readdir(node_dir)) != NULL) {
-   char *path = mem->mem_buf;
-   int len = mem->buf_len;
-
if (strncmp(de->d_name, "memory", 6) == 0) {
if (!memblock_in_dev(mem, de->d_name))
continue;
-   rc = snprintf(path, len, "%s/%s/state",
-   node_path, de->d_name);
-   if (rc < 0) {

Re: Lease semantic proposal

2019-10-02 Thread Jeff Layton
On Wed, 2019-10-02 at 15:27 -0400, J. Bruce Fields wrote:
> On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> > On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > > Since the last RFC patch set[1] much of the discussion of supporting 
> > > > > RDMA with
> > > > > FS DAX has been around the semantics of the lease mechanism.[2]  
> > > > > Within that
> > > > > thread it was suggested I try and write some documentation and/or 
> > > > > tests for the
> > > > > new mechanism being proposed.  I have created a foundation to test 
> > > > > lease
> > > > > functionality within xfstests.[3] This should be close to being 
> > > > > accepted.
> > > > > Before writing additional lease tests, or changing lots of kernel 
> > > > > code, this
> > > > > email presents documentation for the new proposed "layout lease" 
> > > > > semantic.
> > > > > 
> > > > > At Linux Plumbers[4] just over a week ago, I presented the current 
> > > > > state of the
> > > > > patch set and the outstanding issues.  Based on the discussion there, 
> > > > > well as
> > > > > follow up emails, I propose the following addition to the fcntl() man 
> > > > > page.
> > > > > 
> > > > > Thank you,
> > > > > Ira
> > > > > 
> > > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > > 
> > > > > 
> > > > 
> > > > Thank you so much for doing this, Ira. This allows us to debate the
> > > > user-visible behavior semantics without getting bogged down in the
> > > > implementation details. More comments below:
> > > 
> > > Thanks.  Sorry for the delay in response.  Turns out this email was in my
> > > spam...  :-/  I'll need to work out why.
> > > 
> > > > > 
> > > > > Layout Leases
> > > > > -
> > > > > 
> > > > > Layout (F_LAYOUT) leases are special leases which can be used to 
> > > > > control and/or
> > > > > be informed about the manipulation of the underlying layout of a file.
> > > > > 
> > > > > A layout is defined as the logical file block -> physical file block 
> > > > > mapping
> > > > > including the file size and sharing of physical blocks among files.  
> > > > > Note that
> > > > > the unwritten state of a block is not considered part of file layout.
> > > > > 
> > > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > > 
> > > > > Read layout leases can be used to be informed of layout changes by the
> > > > > system or other users.  This lease is similar to the standard read 
> > > > > (F_RDLCK)
> > > > > lease in that any attempt to change the _layout_ of the file will be 
> > > > > reported to
> > > > > the process through the lease break process.  But this lease is 
> > > > > different
> > > > > because the file can be opened for write and data can be read and/or 
> > > > > written to
> > > > > the file as long as the underlying layout of the file does not change.
> > > > > Therefore, the lease is not broken if the file is simply open for 
> > > > > write, but
> > > > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > > > write()
> > > > > results in changing the underlying layout.
> > > > > 
> > > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > > 
> > > > > Write Layout leases can be used to break read layout leases to 
> > > > > indicate that
> > > > > the process intends to change the underlying layout lease of the file.
> > > > > 
> > > > > A process which has taken a write layout lease has exclusive 
> > > > > ownership of the
> > > > > file layout and can modify that layout as long as the lease is held.
> > > > > Operations which change the layout are allowed by that process.  But 
> > > > > operations
> > > > > from other file descriptors which attempt to change the layout will 
> > > > > break the
> > > > > lease through the standard lease break process.  The F_LAYOUT flag is 
> > > > > used to
> > > > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > > > F_LAYOUT.  In
> > > > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > > > operations,
> > > > > if they change the underlying layout, may.
> > > > > 
> > > > > The distinction between read layout leases and write layout leases is 
> > > > > that
> > > > > write layout leases can change the layout without breaking the lease 
> > > > > within the
> > > > > owning process.  This is useful to guarantee a layout prior to 
> > > > > specifying the
> > > > > unbreakable flag described below.
> > > > > 
> > > > > 
> > > > 
> > > > The above sounds totally reasonable. You're essentially exposing the
> > > > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > > > leases work the same way as "normal" leases, wrt signals and 

Re: Lease semantic proposal

2019-10-02 Thread J. Bruce Fields
On Wed, Oct 02, 2019 at 08:28:40AM -0400, Jeff Layton wrote:
> On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> > On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > > Since the last RFC patch set[1] much of the discussion of supporting 
> > > > RDMA with
> > > > FS DAX has been around the semantics of the lease mechanism.[2]  Within 
> > > > that
> > > > thread it was suggested I try and write some documentation and/or tests 
> > > > for the
> > > > new mechanism being proposed.  I have created a foundation to test lease
> > > > functionality within xfstests.[3] This should be close to being 
> > > > accepted.
> > > > Before writing additional lease tests, or changing lots of kernel code, 
> > > > this
> > > > email presents documentation for the new proposed "layout lease" 
> > > > semantic.
> > > > 
> > > > At Linux Plumbers[4] just over a week ago, I presented the current 
> > > > state of the
> > > > patch set and the outstanding issues.  Based on the discussion there, 
> > > > well as
> > > > follow up emails, I propose the following addition to the fcntl() man 
> > > > page.
> > > > 
> > > > Thank you,
> > > > Ira
> > > > 
> > > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > > 
> > > > 
> > > 
> > > Thank you so much for doing this, Ira. This allows us to debate the
> > > user-visible behavior semantics without getting bogged down in the
> > > implementation details. More comments below:
> > 
> > Thanks.  Sorry for the delay in response.  Turns out this email was in my
> > spam...  :-/  I'll need to work out why.
> > 
> > > > 
> > > > Layout Leases
> > > > -
> > > > 
> > > > Layout (F_LAYOUT) leases are special leases which can be used to 
> > > > control and/or
> > > > be informed about the manipulation of the underlying layout of a file.
> > > > 
> > > > A layout is defined as the logical file block -> physical file block 
> > > > mapping
> > > > including the file size and sharing of physical blocks among files.  
> > > > Note that
> > > > the unwritten state of a block is not considered part of file layout.
> > > > 
> > > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > > 
> > > > Read layout leases can be used to be informed of layout changes by the
> > > > system or other users.  This lease is similar to the standard read 
> > > > (F_RDLCK)
> > > > lease in that any attempt to change the _layout_ of the file will be 
> > > > reported to
> > > > the process through the lease break process.  But this lease is 
> > > > different
> > > > because the file can be opened for write and data can be read and/or 
> > > > written to
> > > > the file as long as the underlying layout of the file does not change.
> > > > Therefore, the lease is not broken if the file is simply open for 
> > > > write, but
> > > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > > write()
> > > > results in changing the underlying layout.
> > > > 
> > > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > > 
> > > > Write Layout leases can be used to break read layout leases to indicate 
> > > > that
> > > > the process intends to change the underlying layout lease of the file.
> > > > 
> > > > A process which has taken a write layout lease has exclusive ownership 
> > > > of the
> > > > file layout and can modify that layout as long as the lease is held.
> > > > Operations which change the layout are allowed by that process.  But 
> > > > operations
> > > > from other file descriptors which attempt to change the layout will 
> > > > break the
> > > > lease through the standard lease break process.  The F_LAYOUT flag is 
> > > > used to
> > > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > > F_LAYOUT.  In
> > > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > > operations,
> > > > if they change the underlying layout, may.
> > > > 
> > > > The distinction between read layout leases and write layout leases is 
> > > > that
> > > > write layout leases can change the layout without breaking the lease 
> > > > within the
> > > > owning process.  This is useful to guarantee a layout prior to 
> > > > specifying the
> > > > unbreakable flag described below.
> > > > 
> > > > 
> > > 
> > > The above sounds totally reasonable. You're essentially exposing the
> > > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > > leases work the same way as "normal" leases, wrt signals and timeouts?
> > 
> > That was my intention, yes.
> >
> > > I do wonder if we're better off not trying to "or" in flags for this,
> > > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > > feel terribly 

Re: Lease semantic proposal

2019-10-02 Thread Dan Williams
On Tue, Oct 1, 2019 at 2:02 PM Ira Weiny  wrote:
>
> On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> > On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > > doesn't appear to be compatible with the semantics required by
> > > > existing users of layout leases.
> > >
> > > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > > consistent
> > > with what is currently implemented.  Also, by exporting all this to user 
> > > space
> > > we can now write tests for it independent of the RDMA pinning.
> >
> > The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> > layout changes to occur to the file while the layout lease is held.
>
> This was not my understanding.

I think you guys are talking past each other. F_RDLCK | F_LAYOUT can
be broken to allow writes to the file / layout. The new unbreakable
case would require explicit SIGKILL as "revocation method of last
resort", but that's the new incremental extension being proposed. No
changes to the current behavior of F_RDLCK | F_LAYOUT.

Dave, the question at hand is whether this new layout lease mode being
proposed is going to respond to BREAK_WRITE, or just BREAK_UNMAP. It
seems longterm page pinning conflicts really only care about
BREAK_UNMAP where pages that were part of the file are being removed
from the file. The unbreakable case can tolerate layout changes that
keep pinned pages mapped / allocated to the file.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Lease semantic proposal

2019-10-02 Thread Jeff Layton
On Tue, 2019-10-01 at 11:17 -0700, Ira Weiny wrote:
> On Mon, Sep 23, 2019 at 04:17:59PM -0400, Jeff Layton wrote:
> > On Mon, 2019-09-23 at 12:08 -0700, Ira Weiny wrote:
> > > Since the last RFC patch set[1] much of the discussion of supporting RDMA 
> > > with
> > > FS DAX has been around the semantics of the lease mechanism.[2]  Within 
> > > that
> > > thread it was suggested I try and write some documentation and/or tests 
> > > for the
> > > new mechanism being proposed.  I have created a foundation to test lease
> > > functionality within xfstests.[3] This should be close to being accepted.
> > > Before writing additional lease tests, or changing lots of kernel code, 
> > > this
> > > email presents documentation for the new proposed "layout lease" semantic.
> > > 
> > > At Linux Plumbers[4] just over a week ago, I presented the current state 
> > > of the
> > > patch set and the outstanding issues.  Based on the discussion there, 
> > > well as
> > > follow up emails, I propose the following addition to the fcntl() man 
> > > page.
> > > 
> > > Thank you,
> > > Ira
> > > 
> > > [1] https://lkml.org/lkml/2019/8/9/1043
> > > [2] https://lkml.org/lkml/2019/8/9/1062
> > > [3] https://www.spinics.net/lists/fstests/msg12620.html
> > > [4] https://linuxplumbersconf.org/event/4/contributions/368/
> > > 
> > > 
> > 
> > Thank you so much for doing this, Ira. This allows us to debate the
> > user-visible behavior semantics without getting bogged down in the
> > implementation details. More comments below:
> 
> Thanks.  Sorry for the delay in response.  Turns out this email was in my
> spam...  :-/  I'll need to work out why.
> 
> > > 
> > > Layout Leases
> > > -
> > > 
> > > Layout (F_LAYOUT) leases are special leases which can be used to control 
> > > and/or
> > > be informed about the manipulation of the underlying layout of a file.
> > > 
> > > A layout is defined as the logical file block -> physical file block 
> > > mapping
> > > including the file size and sharing of physical blocks among files.  Note 
> > > that
> > > the unwritten state of a block is not considered part of file layout.
> > > 
> > > **Read layout lease F_RDLCK | F_LAYOUT**
> > > 
> > > Read layout leases can be used to be informed of layout changes by the
> > > system or other users.  This lease is similar to the standard read 
> > > (F_RDLCK)
> > > lease in that any attempt to change the _layout_ of the file will be 
> > > reported to
> > > the process through the lease break process.  But this lease is different
> > > because the file can be opened for write and data can be read and/or 
> > > written to
> > > the file as long as the underlying layout of the file does not change.
> > > Therefore, the lease is not broken if the file is simply open for write, 
> > > but
> > > _may_ be broken if an operation such as, truncate(), fallocate() or 
> > > write()
> > > results in changing the underlying layout.
> > > 
> > > **Write layout lease (F_WRLCK | F_LAYOUT)**
> > > 
> > > Write Layout leases can be used to break read layout leases to indicate 
> > > that
> > > the process intends to change the underlying layout lease of the file.
> > > 
> > > A process which has taken a write layout lease has exclusive ownership of 
> > > the
> > > file layout and can modify that layout as long as the lease is held.
> > > Operations which change the layout are allowed by that process.  But 
> > > operations
> > > from other file descriptors which attempt to change the layout will break 
> > > the
> > > lease through the standard lease break process.  The F_LAYOUT flag is 
> > > used to
> > > indicate a difference between a regular F_WRLCK and F_WRLCK with 
> > > F_LAYOUT.  In
> > > the F_LAYOUT case opens for write do not break the lease.  But some 
> > > operations,
> > > if they change the underlying layout, may.
> > > 
> > > The distinction between read layout leases and write layout leases is that
> > > write layout leases can change the layout without breaking the lease 
> > > within the
> > > owning process.  This is useful to guarantee a layout prior to specifying 
> > > the
> > > unbreakable flag described below.
> > > 
> > > 
> > 
> > The above sounds totally reasonable. You're essentially exposing the
> > behavior of nfsd's layout leases to userland. To be clear, will F_LAYOUT
> > leases work the same way as "normal" leases, wrt signals and timeouts?
> 
> That was my intention, yes.
>
> > I do wonder if we're better off not trying to "or" in flags for this,
> > and instead have a separate set of commands (maybe F_RDLAYOUT,
> > F_WRLAYOUT, F_UNLAYOUT). Maybe I'm just bikeshedding though -- I don't
> > feel terribly strongly about it.
> 
> I'm leaning that was as well.  To make these even more distinct from
> F_SETLEASE.
> 
> > Also, at least in NFSv4, layouts are handed out for a particular byte
> > range in a file. Should we consider doing this with an API that allows
> > for that in the future? Is this something that would