Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Adam Manzanares


On 6/14/18 1:47 PM, Jens Axboe wrote:
> On 6/14/18 2:41 PM, Adam Manzanares wrote:
>>
>>
>> On 6/14/18 1:37 PM, Jens Axboe wrote:
>>> On 6/14/18 2:32 PM, Adam Manzanares wrote:


 On 6/14/18 9:09 AM, Hannes Reinecke wrote:
> On Thu, 14 Jun 2018 09:33:35 -0600
> Jens Axboe  wrote:
>
>> On 6/14/18 9:29 AM, Hannes Reinecke wrote:
>>> On Thu, 14 Jun 2018 08:47:33 -0600
>>> Jens Axboe  wrote:
>>>  
 On 6/14/18 7:38 AM, Hannes Reinecke wrote:
> For performance reasons we should be able to allocate all memory
> from a given NUMA node, so this patch adds a new parameter
> 'rd_numa_node' to allow the user to specify the NUMA node id.
> When restricing fio to use the same NUMA node I'm seeing a
> performance boost of more than 200%.

 Looks fine to me. One comment.
 
> @@ -342,6 +343,10 @@ static int max_part = 1;
> module_param(max_part, int, 0444);
> MODULE_PARM_DESC(max_part, "Num Minors to reserve between
> devices");
> +static int rd_numa_node = NUMA_NO_NODE;
> +module_param(rd_numa_node, int, 0444);
> +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
> disk on.");

 This could feasibly be 0644, as there would be nothing wrong with
 altering this at runtime.
 
>>>
>>> While we could it would not change the allocation of _existing_ ram
>>> devices, making behaviour rather unpredictable.
>>> Hence I did decide against it (and yes, I actually thought about
>>> it).
>>>
>>> But if you insist ...
>>
>> Right, it would just change new allocations. Probably not a common use
>> case, but there's really nothing that prevents it from being feasible.
>>
>> Next question - what does the memory allocator do if we run out of
>> memory on the given node? Should we punt to a different node if that
>> happens? Slower, but functional, seems preferable to not being able
>> to get memory.
>>
>
> Hmm. That I haven't considered; yes, that really sounds like an idea.
> Will be sending an updated patch.

 Will numactl ... modprobe brd ... solve this problem?
>>>
>>> It won't, pages are allocated as needed.
>>>
>>
>> Then how about a numactl ... dd /dev/ram ... after the modprobe.
> 
> Yes of course, or you could do that for every application that ends
> up in the path of the doing IO to it. The point of the option is to
> just make it explicit, and not have to either NUMA pin each task,
> or prefill all possible pages.
> 

Makes sense, I have done some similar benchmarking and had to worry 
about NUMA awareness and the numactl + dd approach seemed to work 
because I wanted to not take a performance hit for page allocation 
during the benchmarking.

Would anyone be interested in forcing the allocations to occur during 
module initialization?



Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Jens Axboe
On 6/14/18 2:41 PM, Adam Manzanares wrote:
> 
> 
> On 6/14/18 1:37 PM, Jens Axboe wrote:
>> On 6/14/18 2:32 PM, Adam Manzanares wrote:
>>>
>>>
>>> On 6/14/18 9:09 AM, Hannes Reinecke wrote:
 On Thu, 14 Jun 2018 09:33:35 -0600
 Jens Axboe  wrote:

> On 6/14/18 9:29 AM, Hannes Reinecke wrote:
>> On Thu, 14 Jun 2018 08:47:33 -0600
>> Jens Axboe  wrote:
>> 
>>> On 6/14/18 7:38 AM, Hannes Reinecke wrote:
 For performance reasons we should be able to allocate all memory
 from a given NUMA node, so this patch adds a new parameter
 'rd_numa_node' to allow the user to specify the NUMA node id.
 When restricing fio to use the same NUMA node I'm seeing a
 performance boost of more than 200%.
>>>
>>> Looks fine to me. One comment.
>>>
 @@ -342,6 +343,10 @@ static int max_part = 1;
module_param(max_part, int, 0444);
MODULE_PARM_DESC(max_part, "Num Minors to reserve between
 devices");
 +static int rd_numa_node = NUMA_NO_NODE;
 +module_param(rd_numa_node, int, 0444);
 +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
 disk on.");
>>>
>>> This could feasibly be 0644, as there would be nothing wrong with
>>> altering this at runtime.
>>>
>>
>> While we could it would not change the allocation of _existing_ ram
>> devices, making behaviour rather unpredictable.
>> Hence I did decide against it (and yes, I actually thought about
>> it).
>>
>> But if you insist ...
>
> Right, it would just change new allocations. Probably not a common use
> case, but there's really nothing that prevents it from being feasible.
>
> Next question - what does the memory allocator do if we run out of
> memory on the given node? Should we punt to a different node if that
> happens? Slower, but functional, seems preferable to not being able
> to get memory.
>

 Hmm. That I haven't considered; yes, that really sounds like an idea.
 Will be sending an updated patch.
>>>
>>> Will numactl ... modprobe brd ... solve this problem?
>>
>> It won't, pages are allocated as needed.
>>
> 
> Then how about a numactl ... dd /dev/ram ... after the modprobe.

Yes of course, or you could do that for every application that ends
up in the path of the doing IO to it. The point of the option is to
just make it explicit, and not have to either NUMA pin each task,
or prefill all possible pages.

-- 
Jens Axboe



Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Adam Manzanares


On 6/14/18 1:37 PM, Jens Axboe wrote:
> On 6/14/18 2:32 PM, Adam Manzanares wrote:
>>
>>
>> On 6/14/18 9:09 AM, Hannes Reinecke wrote:
>>> On Thu, 14 Jun 2018 09:33:35 -0600
>>> Jens Axboe  wrote:
>>>
 On 6/14/18 9:29 AM, Hannes Reinecke wrote:
> On Thu, 14 Jun 2018 08:47:33 -0600
> Jens Axboe  wrote:
> 
>> On 6/14/18 7:38 AM, Hannes Reinecke wrote:
>>> For performance reasons we should be able to allocate all memory
>>> from a given NUMA node, so this patch adds a new parameter
>>> 'rd_numa_node' to allow the user to specify the NUMA node id.
>>> When restricing fio to use the same NUMA node I'm seeing a
>>> performance boost of more than 200%.
>>
>> Looks fine to me. One comment.
>>
>>> @@ -342,6 +343,10 @@ static int max_part = 1;
>>>module_param(max_part, int, 0444);
>>>MODULE_PARM_DESC(max_part, "Num Minors to reserve between
>>> devices");
>>> +static int rd_numa_node = NUMA_NO_NODE;
>>> +module_param(rd_numa_node, int, 0444);
>>> +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
>>> disk on.");
>>
>> This could feasibly be 0644, as there would be nothing wrong with
>> altering this at runtime.
>>
>
> While we could it would not change the allocation of _existing_ ram
> devices, making behaviour rather unpredictable.
> Hence I did decide against it (and yes, I actually thought about
> it).
>
> But if you insist ...

 Right, it would just change new allocations. Probably not a common use
 case, but there's really nothing that prevents it from being feasible.

 Next question - what does the memory allocator do if we run out of
 memory on the given node? Should we punt to a different node if that
 happens? Slower, but functional, seems preferable to not being able
 to get memory.

>>>
>>> Hmm. That I haven't considered; yes, that really sounds like an idea.
>>> Will be sending an updated patch.
>>
>> Will numactl ... modprobe brd ... solve this problem?
> 
> It won't, pages are allocated as needed.
> 

Then how about a numactl ... dd /dev/ram ... after the modprobe.

Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Jens Axboe
On 6/14/18 2:32 PM, Adam Manzanares wrote:
> 
> 
> On 6/14/18 9:09 AM, Hannes Reinecke wrote:
>> On Thu, 14 Jun 2018 09:33:35 -0600
>> Jens Axboe  wrote:
>>
>>> On 6/14/18 9:29 AM, Hannes Reinecke wrote:
 On Thu, 14 Jun 2018 08:47:33 -0600
 Jens Axboe  wrote:

> On 6/14/18 7:38 AM, Hannes Reinecke wrote:
>> For performance reasons we should be able to allocate all memory
>> from a given NUMA node, so this patch adds a new parameter
>> 'rd_numa_node' to allow the user to specify the NUMA node id.
>> When restricing fio to use the same NUMA node I'm seeing a
>> performance boost of more than 200%.
>
> Looks fine to me. One comment.
>   
>> @@ -342,6 +343,10 @@ static int max_part = 1;
>>   module_param(max_part, int, 0444);
>>   MODULE_PARM_DESC(max_part, "Num Minors to reserve between
>> devices");
>> +static int rd_numa_node = NUMA_NO_NODE;
>> +module_param(rd_numa_node, int, 0444);
>> +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
>> disk on.");
>
> This could feasibly be 0644, as there would be nothing wrong with
> altering this at runtime.
>   

 While we could it would not change the allocation of _existing_ ram
 devices, making behaviour rather unpredictable.
 Hence I did decide against it (and yes, I actually thought about
 it).

 But if you insist ...
>>>
>>> Right, it would just change new allocations. Probably not a common use
>>> case, but there's really nothing that prevents it from being feasible.
>>>
>>> Next question - what does the memory allocator do if we run out of
>>> memory on the given node? Should we punt to a different node if that
>>> happens? Slower, but functional, seems preferable to not being able
>>> to get memory.
>>>
>>
>> Hmm. That I haven't considered; yes, that really sounds like an idea.
>> Will be sending an updated patch.
> 
> Will numactl ... modprobe brd ... solve this problem?

It won't, pages are allocated as needed.

-- 
Jens Axboe



Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Adam Manzanares


On 6/14/18 9:09 AM, Hannes Reinecke wrote:
> On Thu, 14 Jun 2018 09:33:35 -0600
> Jens Axboe  wrote:
> 
>> On 6/14/18 9:29 AM, Hannes Reinecke wrote:
>>> On Thu, 14 Jun 2018 08:47:33 -0600
>>> Jens Axboe  wrote:
>>>
 On 6/14/18 7:38 AM, Hannes Reinecke wrote:
> For performance reasons we should be able to allocate all memory
> from a given NUMA node, so this patch adds a new parameter
> 'rd_numa_node' to allow the user to specify the NUMA node id.
> When restricing fio to use the same NUMA node I'm seeing a
> performance boost of more than 200%.

 Looks fine to me. One comment.
   
> @@ -342,6 +343,10 @@ static int max_part = 1;
>   module_param(max_part, int, 0444);
>   MODULE_PARM_DESC(max_part, "Num Minors to reserve between
> devices");
> +static int rd_numa_node = NUMA_NO_NODE;
> +module_param(rd_numa_node, int, 0444);
> +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
> disk on.");

 This could feasibly be 0644, as there would be nothing wrong with
 altering this at runtime.
   
>>>
>>> While we could it would not change the allocation of _existing_ ram
>>> devices, making behaviour rather unpredictable.
>>> Hence I did decide against it (and yes, I actually thought about
>>> it).
>>>
>>> But if you insist ...
>>
>> Right, it would just change new allocations. Probably not a common use
>> case, but there's really nothing that prevents it from being feasible.
>>
>> Next question - what does the memory allocator do if we run out of
>> memory on the given node? Should we punt to a different node if that
>> happens? Slower, but functional, seems preferable to not being able
>> to get memory.
>>
> 
> Hmm. That I haven't considered; yes, that really sounds like an idea.
> Will be sending an updated patch.

Will numactl ... modprobe brd ... solve this problem?

> 
> Cheers,
> 
> Hannes
> 

Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Mike Snitzer
On Thu, Jun 14 2018 at  1:37pm -0400,
Luis R. Rodriguez  wrote:

> On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> > On Wed, Jun 13 2018 at  8:11pm -0400,
> > Luis R. Rodriguez  wrote:
> > 
> > > Setting up a zoned disks in a generic form is not so trivial. There
> > > is also quite a bit of tribal knowledge with these devices which is not
> > > easy to find.
> > > 
> > > The currently supplied demo script works but it is not generic enough to 
> > > be
> > > practical for Linux distributions or even developers which often move
> > > from one kernel to another.
> > > 
> > > This tries to put a bit of this tribal knowledge into an initial udev
> > > rule for development with the hopes Linux distributions can later
> > > deploy. Three rule are added. One rule is optional for now, it should be
> > > extended later to be more distribution-friendly and then I think this
> > > may be ready for consideration for integration on distributions.
> > > 
> > > 1) scheduler setup
> > 
> > This is wrong.. if zoned devices are so dependent on deadline or
> > mq-deadline then the kernel should allow them to be hardcoded.  I know
> > Jens removed the API to do so but the fact that drivers need to rely on
> > hacks like this udev rule to get a functional device is proof we need to
> > allow drivers to impose the scheduler used.
> 
> This is the point to the patch as well, I actually tend to agree with you,
> and I had tried to draw up a patch to do just that, however its *not* possible
> today to do this and would require some consensus. So from what I can tell
> we *have* to live with this one or a form of it. Ie a file describing which
> disk serial gets deadline and which one gets mq-deadline.
> 
> Jens?
> 
> Anyway, let's assume this is done in the kernel, which one would use deadline,
> which one would use mq-deadline?

The zoned storage driver needs to make that call based on what mode it
is in.  If it is using blk-mq then it selects mq-deadline, otherwise
deadline.
 
> > > 2) backlist f2fs devices
> > 
> > There should porbably be support in dm-zoned for detecting whether a
> > zoned device was formatted with f2fs (assuming there is a known f2fs
> > superblock)?
> 
> Not sure what you mean. Are you suggesting we always setup dm-zoned for
> all zoned disks and just make an excemption on dm-zone code to somehow
> use the disk directly if a filesystem supports zoned disks directly somehow?

No, I'm saying that a udev rule wouldn't be needed if dm-zoned just
errored out if asked to consume disks that already have an f2fs
superblock.  And existing filesystems should get conflicting superblock
awareness "for free" if blkid or whatever is trained to be aware of
f2fs's superblock.
 
> f2fs does not require dm-zoned. What would be required is a bit more complex
> given one could dedicate portions of the disk to f2fs and other portions to
> another filesystem, which would require dm-zoned.
> 
> Also filesystems which *do not* support zoned disks should *not* be allowing
> direct setup. Today that's all filesystems other than f2fs, in the future
> that may change. Those are bullets we are allowing to trigger for users
> just waiting to shot themselves on the foot with.
> 
> So who's going to work on all the above?

It should take care of itself if existing tools are trained to be aware
of new signatures.  E.g. ext4 and xfs already are aware of one another
so that you cannot reformat a device with the other unless force is
given.

Same kind of mutual exclussion needs to happen for zoned devices.

So the zoned device tools, dm-zoned, f2fs, whatever.. they need to be
updated to not step on each others toes.  And other filesystems' tools
need to be updated to be zoned device aware.

> The point of the udev script is to illustrate the pains to properly deploy
> zoned disks on distributions today and without a roadmap... this is what
> at least I need on my systems today to reasonably deploy these disks for
> my own development.
> 
> Consensus is indeed needed for a broader picture.

Yeap.

> > > 3) run dmsetup for the rest of devices
> > 
> > automagically running dmsetup directly from udev to create a dm-zoned
> > target is very much wrong.  It just gets in the way of proper support
> > that should be add to appropriate tools that admins use to setup their
> > zoned devices.  For instance, persistent use of dm-zoned target should
> > be made reliable with a volume manager..
> 
> Ah yes, but who's working on that? How long will it take?

No idea, as is (from my vantage point) there is close to zero demand for
zoned devices.  It won't be a priority until enough customers are asking
for it.

> I agree it is odd to expect one to use dmsetup and then use a volume manager 
> on
> top of it, if we can just add proper support onto the volume manager... then
> that's a reasonable way to go.
> 
> But *we're not there* yet, and as-is today, what is described in the udev
> script is the best we can do for a generic setup.

Just 

Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Luis R. Rodriguez
On Thu, Jun 14, 2018 at 07:37:19PM +0200, Luis R. Rodriguez wrote:
> 
> Ie a file describing which
> disk serial gets deadline and which one gets mq-deadline.
> Anyway, let's assume this is done in the kernel, which one would use deadline,
> which one would use mq-deadline?

Nevermind this, deadline will work, Bart already stated why.

 Luis


Re: [PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Luis R. Rodriguez
On Thu, Jun 14, 2018 at 04:19:23PM +, Bart Van Assche wrote:
> On Wed, 2018-06-13 at 17:11 -0700, Luis R. Rodriguez wrote:
> > This tries to put a bit of this tribal knowledge into an initial udev
> > rule for development with the hopes Linux distributions can later
> > deploy. Three rule are added. One rule is optional for now, it should be
> > extended later to be more distribution-friendly and then I think this
> > may be ready for consideration for integration on distributions.
> > 
> > 1) scheduler setup
> > 2) backlist f2fs devices
> > 3) run dmsetup for the rest of devices
> 
> Hello Luis,
> 
> I think it is wrong to package the zoned block device scheduler rule in the
> dm-zoned-tools package. That udev rule should be activated whether or not the
> dm-zoned-tools package has been installed. Have you considered to submit the
> zoned block device scheduler rule to the systemd project since today that
> project includes all base udev rules?

Nope, this is a udev rule intended for developers wishing to brush up on our
state of affairs. All I wanted to do is to get my own drive to work at home in
a reasonably reliable and generic form, which also allowed me to hop around
kernels without a fatal issue.

Clearly there is much to discuss still and a roadmap to illustrate. We should
update the documentation to reflect this first, and based on that enable then
distributions based on that discussion. We should have short term and long term
plans. That discussion may have taken place already so forgive me for not
knowing about it, I did try to read as much from the code and existing tools
and just didn't find anything else to be clear on it.

> > +# Zoned disks can only work with the deadline or mq-deadline scheduler. 
> > This is
> > +# mandated for all SMR drives since v4.16. It has been determined this 
> > must be
> > +# done through a udev rule, and the kernel should not set this up for 
> > disks.
> > +# This magic will have to live for *all* zoned disks.
> > +# XXX: what about distributions that want mq-deadline ? Probably easy for 
> > now
> > +#  to assume deadline and later have a mapping file to enable
> > +#  mq-deadline for specific serial devices?
> > +ACTION=="add|change", KERNEL=="sd*[!0-9]", 
> > ATTRS{queue/zoned}=="host-managed", \
> > +   ATTR{queue/scheduler}="deadline"
> 
> I think it is wrong to limit this rule to SCSI disks only. Work is ongoing to
> add zoned block device support to the null_blk driver. That is a block driver
> and not a SCSI driver. I think the above udev rule should apply to that block
> driver too.

Sure patches welcomed :)

> Regarding blk-mq, from the mq-deadline source code:
>   .elevator_alias = "deadline",
> 
> In other words, the name "deadline" should work both for legacy and for blk-mq
> block devices.

Groovy that helps.

  Luis


Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Luis R. Rodriguez
On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> On Wed, Jun 13 2018 at  8:11pm -0400,
> Luis R. Rodriguez  wrote:
> 
> > Setting up a zoned disks in a generic form is not so trivial. There
> > is also quite a bit of tribal knowledge with these devices which is not
> > easy to find.
> > 
> > The currently supplied demo script works but it is not generic enough to be
> > practical for Linux distributions or even developers which often move
> > from one kernel to another.
> > 
> > This tries to put a bit of this tribal knowledge into an initial udev
> > rule for development with the hopes Linux distributions can later
> > deploy. Three rule are added. One rule is optional for now, it should be
> > extended later to be more distribution-friendly and then I think this
> > may be ready for consideration for integration on distributions.
> > 
> > 1) scheduler setup
> 
> This is wrong.. if zoned devices are so dependent on deadline or
> mq-deadline then the kernel should allow them to be hardcoded.  I know
> Jens removed the API to do so but the fact that drivers need to rely on
> hacks like this udev rule to get a functional device is proof we need to
> allow drivers to impose the scheduler used.

This is the point to the patch as well, I actually tend to agree with you,
and I had tried to draw up a patch to do just that, however its *not* possible
today to do this and would require some consensus. So from what I can tell
we *have* to live with this one or a form of it. Ie a file describing which
disk serial gets deadline and which one gets mq-deadline.

Jens?

Anyway, let's assume this is done in the kernel, which one would use deadline,
which one would use mq-deadline?

> > 2) backlist f2fs devices
> 
> There should porbably be support in dm-zoned for detecting whether a
> zoned device was formatted with f2fs (assuming there is a known f2fs
> superblock)?

Not sure what you mean. Are you suggesting we always setup dm-zoned for
all zoned disks and just make an excemption on dm-zone code to somehow
use the disk directly if a filesystem supports zoned disks directly somehow?

f2fs does not require dm-zoned. What would be required is a bit more complex
given one could dedicate portions of the disk to f2fs and other portions to
another filesystem, which would require dm-zoned.

Also filesystems which *do not* support zoned disks should *not* be allowing
direct setup. Today that's all filesystems other than f2fs, in the future
that may change. Those are bullets we are allowing to trigger for users
just waiting to shot themselves on the foot with.

So who's going to work on all the above?

The point of the udev script is to illustrate the pains to properly deploy
zoned disks on distributions today and without a roadmap... this is what
at least I need on my systems today to reasonably deploy these disks for
my own development.

Consensus is indeed needed for a broader picture.

> > 3) run dmsetup for the rest of devices
> 
> automagically running dmsetup directly from udev to create a dm-zoned
> target is very much wrong.  It just gets in the way of proper support
> that should be add to appropriate tools that admins use to setup their
> zoned devices.  For instance, persistent use of dm-zoned target should
> be made reliable with a volume manager..

Ah yes, but who's working on that? How long will it take?

I agree it is odd to expect one to use dmsetup and then use a volume manager on
top of it, if we can just add proper support onto the volume manager... then
that's a reasonable way to go.

But *we're not there* yet, and as-is today, what is described in the udev
script is the best we can do for a generic setup.

> In general this udev script is unwelcome and makes things way worse for
> the long-term success of zoned devices.

dm-zoned-tools does not acknowledge in any way a roadmap, and just provides
a script, which IMHO is less generic and less distribution friendly. Having
a udev rule in place to demonstrate the current state of affairs IMHO is
more scalable demonstrates the issues better than the script.

If we have an agreed upon long term strategy lets document that. But from
what I gather we are not even in consensus with regards to the scheduler
stuff. If we have consensus on the other stuff lets document that as
dm-zoned-tools is the only place I think folks could find to reasonably
deploy these things.

> I don't dispute there is an obvious void for how to properly setup zoned
> devices, but this script is _not_ what should fill that void.

Good to know! Again, consider it as an alternative to the script.

I'm happy to adapt the language and supply it only as an example script
developers can use, but we can't leave users hanging as well. Let's at
least come up with a plan which we seem to agree on and document that.

  Luis


Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Bart Van Assche
On Thu, 2018-06-14 at 08:38 -0400, Mike Snitzer wrote:
> On Wed, Jun 13 2018 at  8:11pm -0400,
> Luis R. Rodriguez  wrote:
> > 1) scheduler setup
> 
> This is wrong.. if zoned devices are so dependent on deadline or
> mq-deadline then the kernel should allow them to be hardcoded.  I know
> Jens removed the API to do so but the fact that drivers need to rely on
> hacks like this udev rule to get a functional device is proof we need to
> allow drivers to impose the scheduler used.

Hello Mike,

As you know the Linux kernel block layer stack can reorder requests. However,
for zoned block devices it is essential that the block device receives write
requests in the same order as these were submitted by the (user space)
application or by the (kernel) filesystem. After a long debate the choice was
made to make I/O schedulers responsible for guaranteeing to preserve the write
order. Today only the deadline scheduler guarantees that the write order is
preserved. Hence the udev rule that sets the deadline scheduler for zoned
block devices.

Bart.






Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Hannes Reinecke
On Thu, 14 Jun 2018 09:33:35 -0600
Jens Axboe  wrote:

> On 6/14/18 9:29 AM, Hannes Reinecke wrote:
> > On Thu, 14 Jun 2018 08:47:33 -0600
> > Jens Axboe  wrote:
> >   
> >> On 6/14/18 7:38 AM, Hannes Reinecke wrote:  
> >>> For performance reasons we should be able to allocate all memory
> >>> from a given NUMA node, so this patch adds a new parameter
> >>> 'rd_numa_node' to allow the user to specify the NUMA node id.
> >>> When restricing fio to use the same NUMA node I'm seeing a
> >>> performance boost of more than 200%.
> >>
> >> Looks fine to me. One comment.
> >>  
> >>> @@ -342,6 +343,10 @@ static int max_part = 1;
> >>>  module_param(max_part, int, 0444);
> >>>  MODULE_PARM_DESC(max_part, "Num Minors to reserve between
> >>> devices"); 
> >>> +static int rd_numa_node = NUMA_NO_NODE;
> >>> +module_param(rd_numa_node, int, 0444);
> >>> +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
> >>> disk on.");
> >>
> >> This could feasibly be 0644, as there would be nothing wrong with
> >> altering this at runtime.
> >>  
> > 
> > While we could it would not change the allocation of _existing_ ram
> > devices, making behaviour rather unpredictable.
> > Hence I did decide against it (and yes, I actually thought about
> > it).
> > 
> > But if you insist ...  
> 
> Right, it would just change new allocations. Probably not a common use
> case, but there's really nothing that prevents it from being feasible.
> 
> Next question - what does the memory allocator do if we run out of
> memory on the given node? Should we punt to a different node if that
> happens? Slower, but functional, seems preferable to not being able
> to get memory.
> 

Hmm. That I haven't considered; yes, that really sounds like an idea.
Will be sending an updated patch.

Cheers,

Hannes


Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Jens Axboe
On 6/14/18 9:29 AM, Hannes Reinecke wrote:
> On Thu, 14 Jun 2018 08:47:33 -0600
> Jens Axboe  wrote:
> 
>> On 6/14/18 7:38 AM, Hannes Reinecke wrote:
>>> For performance reasons we should be able to allocate all memory
>>> from a given NUMA node, so this patch adds a new parameter
>>> 'rd_numa_node' to allow the user to specify the NUMA node id.
>>> When restricing fio to use the same NUMA node I'm seeing a
>>> performance boost of more than 200%.  
>>
>> Looks fine to me. One comment.
>>
>>> @@ -342,6 +343,10 @@ static int max_part = 1;
>>>  module_param(max_part, int, 0444);
>>>  MODULE_PARM_DESC(max_part, "Num Minors to reserve between
>>> devices"); 
>>> +static int rd_numa_node = NUMA_NO_NODE;
>>> +module_param(rd_numa_node, int, 0444);
>>> +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
>>> disk on.");  
>>
>> This could feasibly be 0644, as there would be nothing wrong with
>> altering this at runtime.
>>
> 
> While we could it would not change the allocation of _existing_ ram
> devices, making behaviour rather unpredictable.
> Hence I did decide against it (and yes, I actually thought about it).
> 
> But if you insist ...

Right, it would just change new allocations. Probably not a common use
case, but there's really nothing that prevents it from being feasible.

Next question - what does the memory allocator do if we run out of
memory on the given node? Should we punt to a different node if that
happens? Slower, but functional, seems preferable to not being able
to get memory.

-- 
Jens Axboe



Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Hannes Reinecke
On Thu, 14 Jun 2018 08:47:33 -0600
Jens Axboe  wrote:

> On 6/14/18 7:38 AM, Hannes Reinecke wrote:
> > For performance reasons we should be able to allocate all memory
> > from a given NUMA node, so this patch adds a new parameter
> > 'rd_numa_node' to allow the user to specify the NUMA node id.
> > When restricing fio to use the same NUMA node I'm seeing a
> > performance boost of more than 200%.  
> 
> Looks fine to me. One comment.
> 
> > @@ -342,6 +343,10 @@ static int max_part = 1;
> >  module_param(max_part, int, 0444);
> >  MODULE_PARM_DESC(max_part, "Num Minors to reserve between
> > devices"); 
> > +static int rd_numa_node = NUMA_NO_NODE;
> > +module_param(rd_numa_node, int, 0444);
> > +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
> > disk on.");  
> 
> This could feasibly be 0644, as there would be nothing wrong with
> altering this at runtime.
> 

While we could it would not change the allocation of _existing_ ram
devices, making behaviour rather unpredictable.
Hence I did decide against it (and yes, I actually thought about it).

But if you insist ...

Cheers,

Hannes


Re: remove nvme_reinit_tagset and blk_mq_tagset_iter

2018-06-14 Thread Jens Axboe
On 6/14/18 8:54 AM, Christoph Hellwig wrote:
> On Thu, Jun 14, 2018 at 08:45:07AM -0600, Jens Axboe wrote:
>> On 6/14/18 6:30 AM, Christoph Hellwig wrote:
>>> Now that nvme-fc moved off the reinit request concept both the core nvme
>>> and blk-mq code for it becomes unused.
>>>
>>> Jens, are you ok with queuing up the blk-mq side with the pending nvme
>>> changes?
>>
>> Are you asking if you can queue it up, presumable because the nvme-fc
>> change isn't upstream yet? That's fine with me, you can add my
>> reviewed-by to the patches.
> 
> Yes, that is the plan.  The nvme changes are fixes I'm going to send
> you ASAP, and I think we can just send the dead code removal along.

OK, that sounds fine then. Please do send them asap, so I can ship
what I have for Linus before he cuts -rc1 on Sunday.

-- 
Jens Axboe



Re: remove nvme_reinit_tagset and blk_mq_tagset_iter

2018-06-14 Thread Christoph Hellwig
On Thu, Jun 14, 2018 at 08:45:07AM -0600, Jens Axboe wrote:
> On 6/14/18 6:30 AM, Christoph Hellwig wrote:
> > Now that nvme-fc moved off the reinit request concept both the core nvme
> > and blk-mq code for it becomes unused.
> > 
> > Jens, are you ok with queuing up the blk-mq side with the pending nvme
> > changes?
> 
> Are you asking if you can queue it up, presumable because the nvme-fc
> change isn't upstream yet? That's fine with me, you can add my
> reviewed-by to the patches.

Yes, that is the plan.  The nvme changes are fixes I'm going to send
you ASAP, and I think we can just send the dead code removal along.


Re: remove nvme_reinit_tagset and blk_mq_tagset_iter

2018-06-14 Thread Jens Axboe
On 6/14/18 6:30 AM, Christoph Hellwig wrote:
> Now that nvme-fc moved off the reinit request concept both the core nvme
> and blk-mq code for it becomes unused.
> 
> Jens, are you ok with queuing up the blk-mq side with the pending nvme
> changes?

Are you asking if you can queue it up, presumable because the nvme-fc
change isn't upstream yet? That's fine with me, you can add my
reviewed-by to the patches.

-- 
Jens Axboe



Re: [PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Christoph Hellwig
On Thu, Jun 14, 2018 at 01:39:50PM +, Bart Van Assche wrote:
> On Thu, 2018-06-14 at 10:01 +, Damien Le Moal wrote:
> > Applied. Thanks Luis !
> 
> Hello Damien,
> 
> Can this still be undone? I agree with Mike that it's wrong to invoke
> "/sbin/dmsetup create ... zoned ..." from a udev rule.

Yes.  We'll really need to verfify the device has dm-zoned metadata
first.  Preferably including a uuid for stable device naming.


[PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Hannes Reinecke
For performance reasons we should be able to allocate all memory
from a given NUMA node, so this patch adds a new parameter
'rd_numa_node' to allow the user to specify the NUMA node id.
When restricing fio to use the same NUMA node I'm seeing a performance
boost of more than 200%.

Signed-off-by: Hannes Reinecke 
---
 drivers/block/brd.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index bb976598ee43..7142d836539e 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -36,6 +36,7 @@
  */
 struct brd_device {
int brd_number;
+   int brd_numa_node;
 
struct request_queue*brd_queue;
struct gendisk  *brd_disk;
@@ -103,7 +104,7 @@ static struct page *brd_insert_page(struct brd_device *brd, 
sector_t sector)
 * restriction might be able to be lifted.
 */
gfp_flags = GFP_NOIO | __GFP_ZERO;
-   page = alloc_page(gfp_flags);
+   page = alloc_pages_node(brd->brd_numa_node, gfp_flags, 0);
if (!page)
return NULL;
 
@@ -342,6 +343,10 @@ static int max_part = 1;
 module_param(max_part, int, 0444);
 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
 
+static int rd_numa_node = NUMA_NO_NODE;
+module_param(rd_numa_node, int, 0444);
+MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM disk on.");
+
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
@@ -363,7 +368,7 @@ __setup("ramdisk_size=", ramdisk_size);
 static LIST_HEAD(brd_devices);
 static DEFINE_MUTEX(brd_devices_mutex);
 
-static struct brd_device *brd_alloc(int i)
+static struct brd_device *brd_alloc(int i, int node)
 {
struct brd_device *brd;
struct gendisk *disk;
@@ -372,10 +377,11 @@ static struct brd_device *brd_alloc(int i)
if (!brd)
goto out;
brd->brd_number = i;
+   brd->brd_numa_node = node;
spin_lock_init(>brd_lock);
INIT_RADIX_TREE(>brd_pages, GFP_ATOMIC);
 
-   brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
+   brd->brd_queue = blk_alloc_queue_node(GFP_KERNEL, node, NULL);
if (!brd->brd_queue)
goto out_free_dev;
 
@@ -434,7 +440,7 @@ static struct brd_device *brd_init_one(int i, bool *new)
goto out;
}
 
-   brd = brd_alloc(i);
+   brd = brd_alloc(i, rd_numa_node);
if (brd) {
add_disk(brd->brd_disk);
list_add_tail(>brd_list, _devices);
@@ -495,7 +501,7 @@ static int __init brd_init(void)
max_part = 1;
 
for (i = 0; i < rd_nr; i++) {
-   brd = brd_alloc(i);
+   brd = brd_alloc(i, rd_numa_node);
if (!brd)
goto out_free;
list_add_tail(>brd_list, _devices);
-- 
2.12.3



Re: [PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Bart Van Assche
On Thu, 2018-06-14 at 10:01 +, Damien Le Moal wrote:
> Applied. Thanks Luis !

Hello Damien,

Can this still be undone? I agree with Mike that it's wrong to invoke
"/sbin/dmsetup create ... zoned ..." from a udev rule.

Thanks,

Bart.





Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Mike Snitzer
On Wed, Jun 13 2018 at  8:11pm -0400,
Luis R. Rodriguez  wrote:

> Setting up a zoned disks in a generic form is not so trivial. There
> is also quite a bit of tribal knowledge with these devices which is not
> easy to find.
> 
> The currently supplied demo script works but it is not generic enough to be
> practical for Linux distributions or even developers which often move
> from one kernel to another.
> 
> This tries to put a bit of this tribal knowledge into an initial udev
> rule for development with the hopes Linux distributions can later
> deploy. Three rule are added. One rule is optional for now, it should be
> extended later to be more distribution-friendly and then I think this
> may be ready for consideration for integration on distributions.
> 
> 1) scheduler setup

This is wrong.. if zoned devices are so dependent on deadline or
mq-deadline then the kernel should allow them to be hardcoded.  I know
Jens removed the API to do so but the fact that drivers need to rely on
hacks like this udev rule to get a functional device is proof we need to
allow drivers to impose the scheduler used.

> 2) backlist f2fs devices

There should porbably be support in dm-zoned for detecting whether a
zoned device was formatted with f2fs (assuming there is a known f2fs
superblock)?

> 3) run dmsetup for the rest of devices

automagically running dmsetup directly from udev to create a dm-zoned
target is very much wrong.  It just gets in the way of proper support
that should be add to appropriate tools that admins use to setup their
zoned devices.  For instance, persistent use of dm-zoned target should
be made reliable with a volume manager..

In general this udev script is unwelcome and makes things way worse for
the long-term success of zoned devices.

I don't dispute there is an obvious void for how to properly setup zoned
devices, but this script is _not_ what should fill that void.

So a heartfelt:

Nacked-by: Mike Snitzer 


[PATCH 2/2] blk-mq: remove blk_mq_tagset_iter

2018-06-14 Thread Christoph Hellwig
Unused now that nvme stopped using it.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-tag.c | 29 -
 include/linux/blk-mq.h |  2 --
 2 files changed, 31 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 70356a2a11ab..09b2ee6694fb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -311,35 +311,6 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
-int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data,
-int (fn)(void *, struct request *))
-{
-   int i, j, ret = 0;
-
-   if (WARN_ON_ONCE(!fn))
-   goto out;
-
-   for (i = 0; i < set->nr_hw_queues; i++) {
-   struct blk_mq_tags *tags = set->tags[i];
-
-   if (!tags)
-   continue;
-
-   for (j = 0; j < tags->nr_tags; j++) {
-   if (!tags->static_rqs[j])
-   continue;
-
-   ret = fn(data, tags->static_rqs[j]);
-   if (ret)
-   goto out;
-   }
-   }
-
-out:
-   return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_tagset_iter);
-
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index fb355173f3c7..e3147eb74222 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -281,8 +281,6 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout);
-int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data,
-   int (reinit_request)(void *, struct request *));
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
-- 
2.17.1



Re: [PATCH] dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Damien Le Moal


On 6/14/18 09:11, Luis R. Rodriguez wrote:
> Setting up a zoned disks in a generic form is not so trivial. There
> is also quite a bit of tribal knowledge with these devices which is not
> easy to find.
> 
> The currently supplied demo script works but it is not generic enough to be
> practical for Linux distributions or even developers which often move
> from one kernel to another.
> 
> This tries to put a bit of this tribal knowledge into an initial udev
> rule for development with the hopes Linux distributions can later
> deploy. Three rule are added. One rule is optional for now, it should be
> extended later to be more distribution-friendly and then I think this
> may be ready for consideration for integration on distributions.
> 
> 1) scheduler setup
> 2) backlist f2fs devices
> 3) run dmsetup for the rest of devices
> 
> Note that this udev rule will not work well if you want to use a disk
> with f2fs on part of the disk and another filesystem on another part of
> the disk. That setup will require manual love so these setups can use
> the same backlist on rule 2).
> 
> Its not widely known for instance that as of v4.16 it is mandated to use
> either deadline or the mq-deadline scheduler for *all* SMR drivers. Its
> also been determined that the Linux kernel is not the place to set this up,
> so a udev rule *is required* as per latest discussions. This is the
> first rule we add.
> 
> Furthermore if you are *not* using f2fs you always have to run dmsetup.
> dmsetups do not persist, so you currently *always* have to run a custom
> sort of script, which is not ideal for Linux distributions. We can invert
> this logic into a udev rule to enable users to blacklist disks they know they
> want to use f2fs for. This the second optional rule. This blacklisting
> can be generalized further in the future with an exception list file, for
> instance using INPUT{db} or the like.
> 
> The third and final rule added then runs dmsetup for the rest of the disks
> using the disk serial number for the new device mapper name.
> 
> Note that it is currently easy for users to make a mistake and run mkfs
> on the the original disk, not the /dev/mapper/ device for non f2fs
> arrangements. If that is done experience shows things can easily fall
> apart with alignment *eventually*. We have no generic way today to
> error out on this condition and proactively prevent this.
> 
> Signed-off-by: Luis R. Rodriguez 
> ---
>  README| 10 +-
>  udev/99-zoned-disks.rules | 78 
> +++
>  2 files changed, 87 insertions(+), 1 deletion(-)
>  create mode 100644 udev/99-zoned-disks.rules
> 
> diff --git a/README b/README
> index 65e96c34fd04..f49541eaabc8 100644
> --- a/README
> +++ b/README
> @@ -168,7 +168,15 @@ Options:
>   reclaiming random zones if the percentage of
>   free random data zones falls below .
>  
> -V. Example scripts
> +V. Udev zone disk deployment
> +
> +
> +A udev rule is provided which enables you to set the IO scheduler, blacklist
> +driver to run dmsetup, and runs dmsetup for the rest of the zone drivers.
> +If you use this udev rule the below script is not needed. Be sure to mkfs 
> only
> +on the resulting /dev/mapper/zone-$serial device you end up with.
> +
> +VI. Example scripts
>  ==
>  
>  [[
> diff --git a/udev/99-zoned-disks.rules b/udev/99-zoned-disks.rules
> new file mode 100644
> index ..e19b738dcc0e
> --- /dev/null
> +++ b/udev/99-zoned-disks.rules
> @@ -0,0 +1,78 @@
> +# To use a zone disks first thing you need to:
> +#
> +# 1) Enable zone disk support in your kernel
> +# 2) Use the deadline or mq-deadline scheduler for it - mandated as of v4.16
> +# 3) Blacklist devices dedicated for f2fs as of v4.10
> +# 4) Run dmsetup other disks
> +# 5) Create the filesystem -- NOTE: use mkfs /dev/mapper/zone-serial if
> +#you enabled use dmsetup on the disk.
> +# 6) Consider using nofail mount option in case you run an supported kernel
> +#
> +# You can use this udev rules file for 2) 3) and 4). Further details below.
> +#
> +# 1) Enable zone disk support in your kernel
> +#
> +#o CONFIG_BLK_DEV_ZONED
> +#o CONFIG_DM_ZONED
> +#
> +# This will let the kernel actually see these devices, ie, via fdisk /dev/sda
> +# for instance. Run:
> +#
> +#dmzadm --format /dev/sda
> +
> +# 2) Set deadline or mq-deadline for all disks which are zoned
> +#
> +# Zoned disks can only work with the deadline or mq-deadline scheduler. This 
> is
> +# mandated for all SMR drives since v4.16. It has been determined this must 
> be
> +# done through a udev rule, and the kernel should not set this up for disks.
> +# This magic will have to live for *all* zoned disks.
> +# XXX: what about distributions that want mq-deadline ? Probably easy for now
> +#  to assume deadline and later have a mapping file to enable
> +#  mq-deadline for specific serial devices?

Re: [RFC] cleanup bcache bio handling

2018-06-14 Thread Christoph Hellwig
On Thu, Jun 14, 2018 at 09:55:32AM +0800, Ming Lei wrote:
> > Agreed about bio_for_each_chunk_all(), but I just looked at the patch that
> > introduces them and it looks to me like there's no need, they should just be
> > bio_for_each_segment_all().
> 
> Now we can't change the vector with bio_for_each_segment_all(), so
> bio_for_each_chunk_all() has to be used. So looks it makes sense to use
> bio_add_page() to remove bio_for_each_chunk_all().

For now I'd suggest we just open code bio_for_each_segment_all in bcache
as a first step to get the multipage bio vec work rolling.  The code
already pokes deep into bio internals, so having that iteration right
next to it isn't really an issue.  In the long run I'd still like to
see it cleaned up, though.