linux-next: BUG: KASAN: use-after-free in bt_iter+0x29b/0x310

2018-12-03 Thread Andrei Vagin
Hi,

We run CRIU tests on linux-next. Today we found this bug in a kernel log:

https://travis-ci.org/avagin/linux/jobs/462912976

[2.516900] random: fast init done
[2.591491] sd 0:0:1:0: [sda] 146800640 512-byte logical blocks:
(75.2 GB/70.0 GiB)
[2.591688] sd 0:0:1:0: Attached scsi generic sg0 type 0
[2.591703] sd 0:0:1:0: [sda] 4096-byte physical blocks
[2.592085] sd 0:0:1:0: [sda] Write Protect is off
[2.592245] sd 0:0:1:0: [sda] Mode Sense: 1f 00 00 08
[2.592390] sd 0:0:1:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[2.597534] 
==
[2.597694] BUG: KASAN: use-after-free in bt_iter+0x29b/0x310
[2.597813] Read of size 8 at addr 8881d44a1780 by task kworker/u4:0/7
[2.597929]
[2.598042] CPU: 1 PID: 7 Comm: kworker/u4:0 Not tainted
4.20.0-rc5-next-20181203+ #1
[2.598170] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[2.598308] Workqueue: events_unbound async_run_entry_fn
[2.598424] Call Trace:
[2.598549]  dump_stack+0x5b/0x8b
[2.598666]  print_address_description+0x6a/0x270
[2.598796]  ? bt_iter+0x29b/0x310
[2.598910]  kasan_report+0x133/0x1ae
[2.599024]  ? bt_iter+0x29b/0x310
[2.599152]  ? bt_iter+0x29b/0x310
[2.599285]  bt_iter+0x29b/0x310
[2.599402]  blk_mq_queue_tag_busy_iter+0x481/0x8f0
[2.599525]  ? blk_mq_stop_hw_queues+0x100/0x100
[2.599644]  ? blk_mq_put_tag+0x150/0x150
[2.599760]  ? do_raw_spin_unlock+0x54/0x220
[2.599879]  ? blk_mq_stop_hw_queues+0x100/0x100
[2.58]  ? __sbitmap_get_word+0x2a/0x80
[2.600116]  blk_mq_in_flight+0xd2/0x130
[2.600232]  ? blk_mq_end_request+0x430/0x430
[2.600353]  ? blk_account_io_start+0x602/0x760
[2.600469]  ? find_held_lock+0x32/0x1c0
[2.600597]  part_round_stats+0x11c/0x690
[2.600715]  ? blk_get_request+0xa0/0xa0
[2.600831]  ? lock_acquire+0xfe/0x290
[2.600949]  blk_account_io_start+0x404/0x760
[2.601065]  ? kvm_clock_get_cycles+0xd/0x10
[2.601180]  ? ktime_get+0x9c/0x120
[2.601323]  ? blk_account_io_done+0x750/0x750
[2.601439]  ? blk_mq_get_request+0xd54/0x1720
[2.601562]  ? dd_request_merge+0x220/0x220
[2.601681]  blk_mq_make_request+0x825/0xf70
[2.601808]  ? blk_mq_try_issue_directly+0x130/0x130
[2.601925]  ? generic_make_request_checks+0xa89/0x18f0
[2.602042]  ? blk_cleanup_queue+0x1b0/0x1b0
[2.602158]  ? blk_dump_rq_flags+0x3b0/0x3b0
[2.602277]  ? kthread+0x2e9/0x3a0
[2.602392]  ? kasan_unpoison_shadow+0x35/0x40
[2.602512]  ? kasan_kmalloc+0xa5/0xd0
[2.602629]  generic_make_request+0x541/0xd60
[2.602746]  ? mempool_alloc+0xf7/0x2c0
[2.602862]  ? blk_queue_enter+0x840/0x840
[2.602981]  ? guard_bio_eod+0x151/0x4c0
[2.603096]  ? find_held_lock+0x32/0x1c0
[2.603234]  ? submit_bio+0x142/0x3f0
[2.603354]  submit_bio+0x142/0x3f0
[2.603469]  ? lock_downgrade+0x5d0/0x5d0
[2.603589]  ? lock_acquire+0xfe/0x290
[2.603704]  ? generic_make_request+0xd60/0xd60
[2.603821]  ? bvec_alloc+0x270/0x270
[2.603937]  ? guard_bio_eod+0x169/0x4c0
[2.604055]  submit_bh_wbc+0x4d0/0x710
[2.604172]  ? _raw_spin_unlock+0x24/0x30
[2.604291]  block_read_full_page+0x3e6/0x830
[2.604408]  ? I_BDEV+0x10/0x10
[2.604527]  ? __bread_gfp+0x1f0/0x1f0
[2.604653]  ? add_to_page_cache_lru+0x112/0x1c0
[2.604770]  ? add_to_page_cache_locked+0x10/0x10
[2.604892]  ? alloc_pages_current+0xb3/0x2b0
[2.605009]  do_read_cache_page+0x658/0x10f0
[2.605127]  ? blkdev_writepages+0x10/0x10
[2.605243]  ? pagecache_get_page+0x6a0/0x6a0
[2.605361]  ? __device_add_disk+0xc9e/0xf40
[2.605476]  ? sd_probe_async+0x42d/0x720
[2.605596]  ? async_run_entry_fn+0xc3/0x5d0
[2.605711]  ? process_one_work+0x96c/0x16c0
[2.605828]  ? worker_thread+0x87/0xe80
[2.605941]  ? kthread+0x2e9/0x3a0
[2.606054]  ? ret_from_fork+0x35/0x40
[2.606171]  ? __save_stack_trace+0x5e/0x100
[2.606291]  ? deref_stack_reg+0xad/0xe0
[2.606406]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[2.606533]  ? depot_save_stack+0x2d9/0x460
[2.606650]  ? fs_reclaim_release.part.90+0x5/0x20
[2.606766]  ? find_held_lock+0x32/0x1c0
[2.606885]  read_dev_sector+0xbb/0x380
[2.607002]  read_lba+0x34d/0x620
[2.607118]  ? ultrix_partition+0x7a0/0x7a0
[2.607233]  ? kasan_unpoison_shadow+0x35/0x40
[2.607354]  efi_partition+0x2f2/0x1690
[2.607468]  ? get_page_from_freelist+0x7dc/0x4120
[2.607595]  ? vzalloc+0x8c/0xb0
[2.607708]  ? check_partition+0xe6/0x680
[2.607826]  ? is_gpt_valid.part.5+0xd80/0xd80
[2.607941]  ? get_page_from_freelist+0x70e/0x4120
[2.608062]  ? string+0x14c/0x220
[2.608178]  ? string+0x14c/0x220
[2.608296]  ? format_decode+0x3be/0x760
[2.608417]  ? memcpy+0x39/0x50
[2.608536]  ? vsnprintf+0x204/0

Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration

2018-12-03 Thread Juergen Gross
On 04/12/2018 02:14, Dongli Zhang wrote:
> Hi Boris,
> 
> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
>> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>>
 On 11/30/18 4:49 PM, Manjunath Patil wrote:
> Thank you Boris for your comments. I removed faulty email of mine.
>
> replies inline.
> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>> Hi,
>>> Feel free to suggest/comment on this.
>>>
>>> I am trying to do the following at dst during the migration now.
>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>> 2. Store the old rinfo and nr_rings into temp variables in
>>> negotiate_mq()
>>> 3. let nr_rings get re-calculated based on backend data
>>> 4. try allocating new memory based on new nr_rings
>> Since I suspect number of rings will likely be the same why not reuse
>> the rings in the common case?
> I thought attaching devices will be more often than migration. Hence
> did not want add to an extra check for
>- if I am inside migration code path and
>- if new nr_rings is equal to old nr_rings or not
>
> Sure addition of such a thing would avoid the memory allocation
> altogether in migration path,
> but it would add a little overhead for normal device addition.
>
> Do you think its worth adding that change?

 IMO a couple of extra checks are not going to make much difference.
>>> I will add this change

 I wonder though --- have you actually seen the case where you did fail
 allocation and changes provided in this patch made things work? I am
 asking because right after negotiate_mq() we will call setup_blkring()
 and it will want to allocate bunch of memory. A failure there is fatal
 (to ring setup). So it seems to me that you will survive negotiate_mq()
 but then will likely fail soon after.
>>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>>> included my patch, I manually triggered the ENOMEM using a debug flag.
>>> The patch works for ENOMEM inside negotiate_mq().
>>>
>>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>>> might hit it in setup_blkring() as well.
>>> We should add the similar change to blkif_sring struct as well.
>>
>>
>> Won't you have a similar issue with other frontends, say, netfront?
> 
> I think the kmalloc is failed not because of OOM.
> 
> In fact, the size of "blkfront_ring_info" is large. When domU have 4
> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.
> 
> There is chance that kmalloc() 300+ KB would fail.

So kmalloc() might not be the best choice. Any reason why you don't
change it to vmalloc()? This should address the problem in a much
simpler way.


Juergen


Re: DIF/DIX issue related to config CONFIG_SCSI_MQ_DEFAULT

2018-12-03 Thread Martin K. Petersen


Hi John,

> We have also noticed that if we just enable DIF in hisi_sas (with MQ),
> and not DIX, then no issue.

Enabling DIF doesn't really do anything on the kernel side other than
setting PROTECT=1 in the READ/WRITE CDB and telling the driver which DIX
protection operation the HBA should use. Since protection information is
invisible to the kernel and only sent on the wire between initiator and
target, enabling DIF doesn't really have the ability to interfere with
anything on the kernel side. We're basically just setting flags asking
HBA and storage to enable protected transfers.

> I did also noticed mail "[PATCH v2 01/23] zfcp: make DIX experimental,
> disabled, and independent of DIF", where DIX is made experimental.

...for the zfcp driver on zSeries.

Just nitpicking on terminology here:

T10 Protection Information (formerly known as DIF) describes how to
generate and verify 8 bytes of extra information that's sent trailing
each logical block on the wire between an initiator and target. The T10
PI spec is focused on the target device implementation of this and
largely ignores the initiator side.

DIX tries to remedy this deficiency. It is a spec that describes a set
of logical operations an initiator must implement to facilitate sending
and receiving the T10 protection information to/from host memory instead
of terminating it at the HBA. The DIX spec isn't experimental, it's
about a decade old and hasn't changed in years.

The Linux kernel support for data integrity passthrough in the block
layer and SCSI isn't experimental either. It's also a decade old and
used extensively in production.

So I object to the notion of "DIX being made experimental". An
ASIC/firmware/driver implementation of DIX may be experimental. And of
course I can't rule out regressions in the kernel block integrity
implementation as a result of some of the recent MQ changes (will be
happy to work with you guys to figure those out).

But DIX isn't experimental, nor is the kernel support for passing
protection information to an HBA.

> For now we may not support DIX. It seems to have issues. We wanted to
> try 3008 card on our system, but it does not seem to support DIX 0-3.

For some reason Broadcom have not upstreamed their DIX support. It's
supposedly available in their outbox driver.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration

2018-12-03 Thread Dongli Zhang
Hi Manjunath,

On 12/04/2018 10:49 AM, Manjunath Patil wrote:
> On 12/3/2018 6:16 PM, Boris Ostrovsky wrote:
> 
>> On 12/3/18 8:14 PM, Dongli Zhang wrote:
>>> Hi Boris,
>>>
>>> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
 On 12/2/18 3:31 PM, Manjunath Patil wrote:
> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>
>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>
>>> replies inline.
>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
 On 11/29/18 12:17 AM, Manjunath Patil wrote:
> Hi,
> Feel free to suggest/comment on this.
>
> I am trying to do the following at dst during the migration now.
> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
> 2. Store the old rinfo and nr_rings into temp variables in
> negotiate_mq()
> 3. let nr_rings get re-calculated based on backend data
> 4. try allocating new memory based on new nr_rings
 Since I suspect number of rings will likely be the same why not reuse
 the rings in the common case?
>>> I thought attaching devices will be more often than migration. Hence
>>> did not want add to an extra check for
>>> - if I am inside migration code path and
>>> - if new nr_rings is equal to old nr_rings or not
>>>
>>> Sure addition of such a thing would avoid the memory allocation
>>> altogether in migration path,
>>> but it would add a little overhead for normal device addition.
>>>
>>> Do you think its worth adding that change?
>> IMO a couple of extra checks are not going to make much difference.
> I will add this change
>> I wonder though --- have you actually seen the case where you did fail
>> allocation and changes provided in this patch made things work? I am
>> asking because right after negotiate_mq() we will call setup_blkring()
>> and it will want to allocate bunch of memory. A failure there is fatal
>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>> but then will likely fail soon after.
> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
> included my patch, I manually triggered the ENOMEM using a debug flag.
> The patch works for ENOMEM inside negotiate_mq().
>
> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
> might hit it in setup_blkring() as well.
> We should add the similar change to blkif_sring struct as well.
 Won't you have a similar issue with other frontends, say, netfront?
>>> I think the kmalloc is failed not because of OOM.
>>>
>>> In fact, the size of "blkfront_ring_info" is large. When domU have 4
>>> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.
>>>
>>> There is chance that kmalloc() 300+ KB would fail.
>>>
>>>
>>> About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 
>>> KB?
>> TBH these look like comparable sizes to me.  I am not convinced that
>> these changes will make a difference. If the number of rings on source
>> and destination were the same I'd absolutely agree with this patch but
>> since you are trying to handle different sizes the code becomes somewhat
>> more complex, and I am not sure it's worth it. (Can you actually give me
>> an example of when we can expect number of rings to change during
>> migration?)
>>
>> But others may think differently.
> Hi Boris,
> I think allocation of 300KB chunk[order 7 allocation] is more likely to fail
> than 70KB[order 5] especially under memory pressure.
> If it comes to that, I think we should fix this too.
> 
> The no.of rings in most cases remain 4 thanks to xen_blkif_max_queues module
> parameter.
> If the src host has allocated less than 4[may be vpcu given to this dom0 were
> less than 4], then we can expect the dst to allocate more than src side and 
> vice
> versa.

xen_blkif_max_queues is tunable so the size to kmalloc() would be larger when
both xen_blkif_max_queues and dom0 vcpu are large.

Dongli Zhang


Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration

2018-12-03 Thread Manjunath Patil

On 12/3/2018 6:16 PM, Boris Ostrovsky wrote:


On 12/3/18 8:14 PM, Dongli Zhang wrote:

Hi Boris,

On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:

On 12/2/18 3:31 PM, Manjunath Patil wrote:

On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:


On 11/30/18 4:49 PM, Manjunath Patil wrote:

Thank you Boris for your comments. I removed faulty email of mine.

replies inline.
On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:

On 11/29/18 12:17 AM, Manjunath Patil wrote:

Hi,
Feel free to suggest/comment on this.

I am trying to do the following at dst during the migration now.
1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
2. Store the old rinfo and nr_rings into temp variables in
negotiate_mq()
3. let nr_rings get re-calculated based on backend data
4. try allocating new memory based on new nr_rings

Since I suspect number of rings will likely be the same why not reuse
the rings in the common case?

I thought attaching devices will be more often than migration. Hence
did not want add to an extra check for
- if I am inside migration code path and
- if new nr_rings is equal to old nr_rings or not

Sure addition of such a thing would avoid the memory allocation
altogether in migration path,
but it would add a little overhead for normal device addition.

Do you think its worth adding that change?

IMO a couple of extra checks are not going to make much difference.

I will add this change

I wonder though --- have you actually seen the case where you did fail
allocation and changes provided in this patch made things work? I am
asking because right after negotiate_mq() we will call setup_blkring()
and it will want to allocate bunch of memory. A failure there is fatal
(to ring setup). So it seems to me that you will survive negotiate_mq()
but then will likely fail soon after.

I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
included my patch, I manually triggered the ENOMEM using a debug flag.
The patch works for ENOMEM inside negotiate_mq().

As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
might hit it in setup_blkring() as well.
We should add the similar change to blkif_sring struct as well.

Won't you have a similar issue with other frontends, say, netfront?

I think the kmalloc is failed not because of OOM.

In fact, the size of "blkfront_ring_info" is large. When domU have 4
queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.

There is chance that kmalloc() 300+ KB would fail.


About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB?

TBH these look like comparable sizes to me.  I am not convinced that
these changes will make a difference. If the number of rings on source
and destination were the same I'd absolutely agree with this patch but
since you are trying to handle different sizes the code becomes somewhat
more complex, and I am not sure it's worth it. (Can you actually give me
an example of when we can expect number of rings to change during
migration?)

But others may think differently.

Hi Boris,
I think allocation of 300KB chunk[order 7 allocation] is more likely to 
fail than 70KB[order 5] especially under memory pressure.

If it comes to that, I think we should fix this too.

The no.of rings in most cases remain 4 thanks to xen_blkif_max_queues 
module parameter.
If the src host has allocated less than 4[may be vpcu given to this dom0 
were less than 4], then we can expect the dst to allocate more than src 
side and vice versa.


-Thanks,
Manjunath


-boris


___
Xen-devel mailing list
xen-de...@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel




Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration

2018-12-03 Thread Boris Ostrovsky
On 12/3/18 8:14 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
>> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>>
 On 11/30/18 4:49 PM, Manjunath Patil wrote:
> Thank you Boris for your comments. I removed faulty email of mine.
>
> replies inline.
> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
>> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>>> Hi,
>>> Feel free to suggest/comment on this.
>>>
>>> I am trying to do the following at dst during the migration now.
>>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>>> 2. Store the old rinfo and nr_rings into temp variables in
>>> negotiate_mq()
>>> 3. let nr_rings get re-calculated based on backend data
>>> 4. try allocating new memory based on new nr_rings
>> Since I suspect number of rings will likely be the same why not reuse
>> the rings in the common case?
> I thought attaching devices will be more often than migration. Hence
> did not want add to an extra check for
>- if I am inside migration code path and
>- if new nr_rings is equal to old nr_rings or not
>
> Sure addition of such a thing would avoid the memory allocation
> altogether in migration path,
> but it would add a little overhead for normal device addition.
>
> Do you think its worth adding that change?
 IMO a couple of extra checks are not going to make much difference.
>>> I will add this change
 I wonder though --- have you actually seen the case where you did fail
 allocation and changes provided in this patch made things work? I am
 asking because right after negotiate_mq() we will call setup_blkring()
 and it will want to allocate bunch of memory. A failure there is fatal
 (to ring setup). So it seems to me that you will survive negotiate_mq()
 but then will likely fail soon after.
>>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>>> included my patch, I manually triggered the ENOMEM using a debug flag.
>>> The patch works for ENOMEM inside negotiate_mq().
>>>
>>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>>> might hit it in setup_blkring() as well.
>>> We should add the similar change to blkif_sring struct as well.
>>
>> Won't you have a similar issue with other frontends, say, netfront?
> I think the kmalloc is failed not because of OOM.
>
> In fact, the size of "blkfront_ring_info" is large. When domU have 4
> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.
>
> There is chance that kmalloc() 300+ KB would fail.
>
>
> About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB?

TBH these look like comparable sizes to me.  I am not convinced that
these changes will make a difference. If the number of rings on source
and destination were the same I'd absolutely agree with this patch but
since you are trying to handle different sizes the code becomes somewhat
more complex, and I am not sure it's worth it. (Can you actually give me
an example of when we can expect number of rings to change during
migration?)

But others may think differently.


-boris



[PATCH v5 04/13] datagram: consolidate datagram copy to iter helpers

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

skb_copy_datagram_iter and skb_copy_and_csum_datagram are essentialy
the same but with a couple of differences: The first is the copy
operation used which either a simple copy or a csum_and_copy, and the
second are the behavior on the "short copy" path where simply copy
needs to return the number of bytes successfully copied while csum_and_copy
needs to fault immediately as the checksum is partial.

Introduce __skb_datagram_iter that additionally accepts:
1. copy operation function pointer
2. private data that goes with the copy operation
3. fault_short flag to indicate the action on short copy

Suggested-by: David S. Miller 
Acked-by: David S. Miller 
Signed-off-by: Sagi Grimberg 
---
 net/core/datagram.c | 136 ++--
 1 file changed, 42 insertions(+), 94 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index abe642181b64..382543302ae5 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -408,27 +408,20 @@ int skb_kill_datagram(struct sock *sk, struct sk_buff 
*skb, unsigned int flags)
 }
 EXPORT_SYMBOL(skb_kill_datagram);
 
-/**
- * skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
- * @skb: buffer to copy
- * @offset: offset in the buffer to start copying from
- * @to: iovec iterator to copy to
- * @len: amount of data to copy from buffer to iovec
- */
-int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
-  struct iov_iter *to, int len)
+int __skb_datagram_iter(const struct sk_buff *skb, int offset,
+   struct iov_iter *to, int len, bool fault_short,
+   size_t (*cb)(const void *, size_t, void *, struct 
iov_iter *),
+   void *data)
 {
int start = skb_headlen(skb);
int i, copy = start - offset, start_off = offset, n;
struct sk_buff *frag_iter;
 
-   trace_skb_copy_datagram_iovec(skb, len);
-
/* Copy header. */
if (copy > 0) {
if (copy > len)
copy = len;
-   n = copy_to_iter(skb->data + offset, copy, to);
+   n = cb(skb->data + offset, copy, data, to);
offset += n;
if (n != copy)
goto short_copy;
@@ -450,8 +443,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
 
if (copy > len)
copy = len;
-   n = copy_to_iter(vaddr + frag->page_offset +
-offset - start, copy, to);
+   n = cb(vaddr + frag->page_offset +
+   offset - start, copy, data, to);
kunmap(page);
offset += n;
if (n != copy)
@@ -471,8 +464,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
if ((copy = end - offset) > 0) {
if (copy > len)
copy = len;
-   if (skb_copy_datagram_iter(frag_iter, offset - start,
-  to, copy))
+   if (__skb_datagram_iter(frag_iter, offset - start,
+   to, copy, short_copy, cb, data))
goto fault;
if ((len -= copy) == 0)
return 0;
@@ -493,11 +486,32 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
return -EFAULT;
 
 short_copy:
-   if (iov_iter_count(to))
+   if (fault_short || iov_iter_count(to))
goto fault;
 
return 0;
 }
+
+static size_t simple_copy_to_iter(const void *addr, size_t bytes,
+   void *data __always_unused, struct iov_iter *i)
+{
+   return copy_to_iter(addr, bytes, i);
+}
+
+/**
+ * skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
+ * @skb: buffer to copy
+ * @offset: offset in the buffer to start copying from
+ * @to: iovec iterator to copy to
+ * @len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
+  struct iov_iter *to, int len)
+{
+   trace_skb_copy_datagram_iovec(skb, len);
+   return __skb_datagram_iter(skb, offset, to, len, false,
+   simple_copy_to_iter, NULL);
+}
 EXPORT_SYMBOL(skb_copy_datagram_iter);
 
 /**
@@ -648,87 +662,21 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct 
iov_iter *from)
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
+/**
+ * skb_copy_and_csum_datagram_iter - Copy datagram to an iovec iterator
+ *  and update a checksum.
+ * @skb: buffer to copy
+ * @offset: offset in the buffer to start copying from
+ * @to: iovec iterator to copy to
+ * @len: amount of data to copy from buffer 

[PATCH v5 06/13] datagram: introduce skb_copy_and_hash_datagram_iter helper

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Introduce a helper to copy datagram into an iovec iterator
but also update a predefined hash. This is useful for
consumers of skb_copy_datagram_iter to also support inflight
data digest without having to finish to copy and only then
traverse the iovec and calculate the digest hash.

Acked-by: David S. Miller 
Signed-off-by: Sagi Grimberg 
---
 include/linux/skbuff.h |  3 +++
 net/core/datagram.c| 20 +++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0ba687454267..b0b8d5653f0d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3309,6 +3309,9 @@ static inline int skb_copy_datagram_msg(const struct 
sk_buff *from, int offset,
 }
 int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen,
   struct msghdr *msg);
+int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset,
+  struct iov_iter *to, int len,
+  struct ahash_request *hash);
 int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 struct iov_iter *from, int len);
 int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 382543302ae5..ef262282c8be 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -465,7 +465,7 @@ int __skb_datagram_iter(const struct sk_buff *skb, int 
offset,
if (copy > len)
copy = len;
if (__skb_datagram_iter(frag_iter, offset - start,
-   to, copy, short_copy, cb, data))
+   to, copy, fault_short, cb, 
data))
goto fault;
if ((len -= copy) == 0)
return 0;
@@ -492,6 +492,24 @@ int __skb_datagram_iter(const struct sk_buff *skb, int 
offset,
return 0;
 }
 
+/**
+ * skb_copy_and_hash_datagram_iter - Copy datagram to an iovec iterator
+ *  and update a hash.
+ * @skb: buffer to copy
+ * @offset: offset in the buffer to start copying from
+ * @to: iovec iterator to copy to
+ * @len: amount of data to copy from buffer to iovec
+ *  @hash: hash request to update
+ */
+int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset,
+  struct iov_iter *to, int len,
+  struct ahash_request *hash)
+{
+   return __skb_datagram_iter(skb, offset, to, len, true,
+   hash_and_copy_to_iter, hash);
+}
+EXPORT_SYMBOL(skb_copy_and_hash_datagram_iter);
+
 static size_t simple_copy_to_iter(const void *addr, size_t bytes,
void *data __always_unused, struct iov_iter *i)
 {
-- 
2.17.1



[PATCH v5 05/13] iov_iter: introduce hash_and_copy_to_iter helper

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Allow consumers that want to use iov iterator helpers and also update
a predefined hash calculation online when copying data. This is useful
when copying incoming network buffers to a local iterator and calculate
a digest on the incoming stream. nvme-tcp host driver that will be
introduced in following patches is the first consumer via
skb_copy_and_hash_datagram_iter.

Acked-by: David S. Miller 
Signed-off-by: Sagi Grimberg 
---
 include/linux/uio.h |  3 +++
 lib/iov_iter.c  | 16 
 2 files changed, 19 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 41d1f8d3313d..ecf584f6b82d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 struct page;
@@ -269,6 +270,8 @@ static inline void iov_iter_reexpand(struct iov_iter *i, 
size_t count)
 size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, 
struct iov_iter *i);
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct 
iov_iter *i);
 bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, 
struct iov_iter *i);
+size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
+   struct iov_iter *i);
 
 int import_iovec(int type, const struct iovec __user * uvector,
 unsigned nr_segs, unsigned fast_segs,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index db93531ca3e3..8a5f7b2ae346 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PIPE_PARANOIA /* for now */
 
@@ -1475,6 +1476,21 @@ size_t csum_and_copy_to_iter(const void *addr, size_t 
bytes, void *csump,
 }
 EXPORT_SYMBOL(csum_and_copy_to_iter);
 
+size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
+   struct iov_iter *i)
+{
+   struct ahash_request *hash = hashp;
+   struct scatterlist sg;
+   size_t copied;
+
+   copied = copy_to_iter(addr, bytes, i);
+   sg_init_one(&sg, addr, copied);
+   ahash_request_set_crypt(hash, &sg, NULL, copied);
+   crypto_ahash_update(hash);
+   return copied;
+}
+EXPORT_SYMBOL(hash_and_copy_to_iter);
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
size_t size = i->count;
-- 
2.17.1



[PATCH v5 00/13] TCP transport binding for NVMe over Fabrics

2018-12-03 Thread Sagi Grimberg
This patch set implements the NVMe over Fabrics TCP host and the target
drivers. Now NVMe over Fabrics can run on every Ethernet port in the world.
The implementation conforms to NVMe over Fabrics 1.1 specification (which
will include already publicly available NVMe/TCP transport binding, TP 8000).

The host driver hooks into the NVMe host stack and implements the TCP
transport binding for NVMe over Fabrics. The NVMe over Fabrics TCP host
driver is responsible for establishing a NVMe/TCP connection, TCP event
and error handling and data-plane messaging and stream processing.

The target driver hooks into the NVMe target core stack and implements
the TCP transport binding. The NVMe over Fabrics target driver is
responsible for accepting and establishing NVMe/TCP connections, TCP
event and error handling, and data-plane messaging and stream processing.

The implementation of both the host and target are fairly simple and
straight-forward. Every NVMe queue is backed by a TCP socket that provides
us reliable, in-order delivery of fabrics capsules and/or data.

All NVMe queues are sharded over a private bound workqueue such that we
always have a single context handling the byte stream and we don't need
to worry about any locking/serialization. In addition, close attention
was paid to a completely non-blocking data plane to minimize context
switching and/or unforced scheduling.

Also, @netdev mailing list is cc'd as this patch set contains generic
helpers for online digest calculation (patches 1-3).

The patchset structure:
- patches 1-6 are prep to add a helper for digest calculation online
  with data placement
- patches 7-9 are preparatory patches for NVMe/TCP
- patches 10-13 implements NVMe/TCP

Thanks to the members of the Fabrics Linux Driver team that helped
development, testing and benchmarking this work.

Gitweb code is available at:

git://git.infradead.org/nvme.git nvme-tcp

Changes from v4:
- Added acks from Dave Miller for relevant patches
- Fixed possible memory leak in nvmet-tcp error flow

Changes from v3:
- various changes based on comments from christoph
  - removed unused variables
  - united send/recv iter initialization
  - removed unneeded void * casting
  - fixed long lines
  - removed unneeded wrappers (nvme_tcp_free_tagset and friends)
  - remove null sgl setting
  - fixed socket callbacks naming
  - reworked nvmet-tcp send_list processing
- omitted nvme-cli patches as no changes were made to them and no negative
  feedback was accepted since v3

Changes from v2:
- fixed stupid missing symbol export for skb_copy_and_hash_datagram_iter 
- dropped patch that moved err_work and connect_work to nvme_ctrl
- fixed maxr2t icreq validation
- got rid of host and target send/recv context structures by moving
  the members directly to their parent structure along with some struct
  documentation
- removed bh disable when locking the queue lock
- moved definition in nvme-tcp.h to appropriate patch
- added patch to rework nvme-cli trtype handling for discovery log entries
  a bit
- rebased on top of nvme-4.21 branch
- cleaned up some checkpatch warnings
- collected review tags

Changes from v1:
- unified skb_copy_datagram_iter and skb_copy_and_csum_datagram (and the
  new skb_hash_and_copy_datagram_iter) to a single code path
- removed nvmet modparam budgets (made them a define set to their default
  values)
- fixed nvme-tcp host chained r2t transfers reported off-list
- made .install_queue callout return nvme status code
- Added some review tags
- rebased on top of nvme-4.21 branch (nvme tree) + sqflow disable patches

Sagi Grimberg (13):
  ath6kl: add ath6kl_ prefix to crypto_type
  datagram: open-code copy_page_to_iter
  iov_iter: pass void csum pointer to csum_and_copy_to_iter
  datagram: consolidate datagram copy to iter helpers
  iov_iter: introduce hash_and_copy_to_iter helper
  datagram: introduce skb_copy_and_hash_datagram_iter helper
  nvmet: Add install_queue callout
  nvme-fabrics: allow user passing header digest
  nvme-fabrics: allow user passing data digest
  nvme-tcp: Add protocol header
  nvmet-tcp: add NVMe over TCP target driver
  nvmet: allow configfs tcp trtype configuration
  nvme-tcp: add NVMe over TCP host driver

 drivers/net/wireless/ath/ath6kl/cfg80211.c |2 +-
 drivers/net/wireless/ath/ath6kl/common.h   |2 +-
 drivers/net/wireless/ath/ath6kl/wmi.c  |6 +-
 drivers/net/wireless/ath/ath6kl/wmi.h  |6 +-
 drivers/nvme/host/Kconfig  |   15 +
 drivers/nvme/host/Makefile |3 +
 drivers/nvme/host/fabrics.c|   10 +
 drivers/nvme/host/fabrics.h|4 +
 drivers/nvme/host/tcp.c| 2242 
 drivers/nvme/target/Kconfig|   10 +
 drivers/nvme/target/Makefile   |2 +
 drivers/nvme/target/configfs.c |1 +
 drivers/nvme/target/fabrics-cmd.c  |   10 +
 drivers/nvme/target/nvmet.h|1 +
 dri

[PATCH v5 03/13] iov_iter: pass void csum pointer to csum_and_copy_to_iter

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

The single caller to csum_and_copy_to_iter is skb_copy_and_csum_datagram
and we are trying to unite its logic with skb_copy_datagram_iter by passing
a callback to the copy function that we want to apply. Thus, we need
to make the checksum pointer private to the function.

Acked-by: David S. Miller 
Signed-off-by: Sagi Grimberg 
---
 include/linux/uio.h | 2 +-
 lib/iov_iter.c  | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 55ce99ddb912..41d1f8d3313d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -266,7 +266,7 @@ static inline void iov_iter_reexpand(struct iov_iter *i, 
size_t count)
 {
i->count = count;
 }
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, 
struct iov_iter *i);
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, 
struct iov_iter *i);
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct 
iov_iter *i);
 bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, 
struct iov_iter *i);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7ebccb5c1637..db93531ca3e3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1432,10 +1432,11 @@ bool csum_and_copy_from_iter_full(void *addr, size_t 
bytes, __wsum *csum,
 }
 EXPORT_SYMBOL(csum_and_copy_from_iter_full);
 
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump,
 struct iov_iter *i)
 {
const char *from = addr;
+   __wsum *csum = csump;
__wsum sum, next;
size_t off = 0;
sum = *csum;
-- 
2.17.1



[PATCH v5 01/13] ath6kl: add ath6kl_ prefix to crypto_type

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Prevent a namespace conflict as in following patches as skbuff.h will
include the crypto API.

Acked-by: David S. Miller 
Cc: Kalle Valo 
Signed-off-by: Sagi Grimberg 
---
 drivers/net/wireless/ath/ath6kl/cfg80211.c | 2 +-
 drivers/net/wireless/ath/ath6kl/common.h   | 2 +-
 drivers/net/wireless/ath/ath6kl/wmi.c  | 6 +++---
 drivers/net/wireless/ath/ath6kl/wmi.h  | 6 +++---
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/ath/ath6kl/cfg80211.c 
b/drivers/net/wireless/ath/ath6kl/cfg80211.c
index e121187f371f..fa049c4ae315 100644
--- a/drivers/net/wireless/ath/ath6kl/cfg80211.c
+++ b/drivers/net/wireless/ath/ath6kl/cfg80211.c
@@ -1322,7 +1322,7 @@ static int ath6kl_cfg80211_set_default_key(struct wiphy 
*wiphy,
struct ath6kl_vif *vif = netdev_priv(ndev);
struct ath6kl_key *key = NULL;
u8 key_usage;
-   enum crypto_type key_type = NONE_CRYPT;
+   enum ath6kl_crypto_type key_type = NONE_CRYPT;
 
ath6kl_dbg(ATH6KL_DBG_WLAN_CFG, "%s: index %d\n", __func__, key_index);
 
diff --git a/drivers/net/wireless/ath/ath6kl/common.h 
b/drivers/net/wireless/ath/ath6kl/common.h
index 4f82e8632d37..d6e5234f67a1 100644
--- a/drivers/net/wireless/ath/ath6kl/common.h
+++ b/drivers/net/wireless/ath/ath6kl/common.h
@@ -67,7 +67,7 @@ struct ath6kl_llc_snap_hdr {
__be16 eth_type;
 } __packed;
 
-enum crypto_type {
+enum ath6kl_crypto_type {
NONE_CRYPT  = 0x01,
WEP_CRYPT   = 0x02,
TKIP_CRYPT  = 0x04,
diff --git a/drivers/net/wireless/ath/ath6kl/wmi.c 
b/drivers/net/wireless/ath/ath6kl/wmi.c
index 777acc564ac9..9d7ac1ab2d02 100644
--- a/drivers/net/wireless/ath/ath6kl/wmi.c
+++ b/drivers/net/wireless/ath/ath6kl/wmi.c
@@ -1849,9 +1849,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx,
   enum network_type nw_type,
   enum dot11_auth_mode dot11_auth_mode,
   enum auth_mode auth_mode,
-  enum crypto_type pairwise_crypto,
+  enum ath6kl_crypto_type pairwise_crypto,
   u8 pairwise_crypto_len,
-  enum crypto_type group_crypto,
+  enum ath6kl_crypto_type group_crypto,
   u8 group_crypto_len, int ssid_len, u8 *ssid,
   u8 *bssid, u16 channel, u32 ctrl_flags,
   u8 nw_subtype)
@@ -2301,7 +2301,7 @@ int ath6kl_wmi_disctimeout_cmd(struct wmi *wmi, u8 
if_idx, u8 timeout)
 }
 
 int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index,
- enum crypto_type key_type,
+ enum ath6kl_crypto_type key_type,
  u8 key_usage, u8 key_len,
  u8 *key_rsc, unsigned int key_rsc_len,
  u8 *key_material,
diff --git a/drivers/net/wireless/ath/ath6kl/wmi.h 
b/drivers/net/wireless/ath/ath6kl/wmi.h
index a60bb49fe920..784940ba4c90 100644
--- a/drivers/net/wireless/ath/ath6kl/wmi.h
+++ b/drivers/net/wireless/ath/ath6kl/wmi.h
@@ -2556,9 +2556,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx,
   enum network_type nw_type,
   enum dot11_auth_mode dot11_auth_mode,
   enum auth_mode auth_mode,
-  enum crypto_type pairwise_crypto,
+  enum ath6kl_crypto_type pairwise_crypto,
   u8 pairwise_crypto_len,
-  enum crypto_type group_crypto,
+  enum ath6kl_crypto_type group_crypto,
   u8 group_crypto_len, int ssid_len, u8 *ssid,
   u8 *bssid, u16 channel, u32 ctrl_flags,
   u8 nw_subtype);
@@ -2610,7 +2610,7 @@ int ath6kl_wmi_config_debug_module_cmd(struct wmi *wmi, 
u32 valid, u32 config);
 
 int ath6kl_wmi_get_stats_cmd(struct wmi *wmi, u8 if_idx);
 int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index,
- enum crypto_type key_type,
+ enum ath6kl_crypto_type key_type,
  u8 key_usage, u8 key_len,
  u8 *key_rsc, unsigned int key_rsc_len,
  u8 *key_material,
-- 
2.17.1



[PATCH v5 07/13] nvmet: Add install_queue callout

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

nvmet-tcp will implement it to allocate queue commands which
are only known at nvmf connect time (sq size).

Reviewed-by: Christoph Hellwig 
Signed-off-by: Sagi Grimberg 
---
 drivers/nvme/target/fabrics-cmd.c | 10 ++
 drivers/nvme/target/nvmet.h   |  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/nvme/target/fabrics-cmd.c 
b/drivers/nvme/target/fabrics-cmd.c
index 328ae46d8344..ee7d84621d65 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -121,6 +121,16 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, 
struct nvmet_req *req)
req->rsp->sq_head = cpu_to_le16(0x);
}
 
+   if (ctrl->ops->install_queue) {
+   u16 ret = ctrl->ops->install_queue(req->sq);
+
+   if (ret) {
+   pr_err("failed to install queue %d cntlid %d ret %x\n",
+   qid, ret, ctrl->cntlid);
+   return ret;
+   }
+   }
+
return 0;
 }
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 7d8b7a7d572a..89df51ee5bdf 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -279,6 +279,7 @@ struct nvmet_fabrics_ops {
void (*delete_ctrl)(struct nvmet_ctrl *ctrl);
void (*disc_traddr)(struct nvmet_req *req,
struct nvmet_port *port, char *traddr);
+   u16 (*install_queue)(struct nvmet_sq *nvme_sq);
 };
 
 #define NVMET_MAX_INLINE_BIOVEC8
-- 
2.17.1



[PATCH v5 08/13] nvme-fabrics: allow user passing header digest

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Header digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Sagi Grimberg 
---
 drivers/nvme/host/fabrics.c | 5 +
 drivers/nvme/host/fabrics.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 10074ac7731b..4272f8a95db3 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -614,6 +614,7 @@ static const match_table_t opt_tokens = {
{ NVMF_OPT_HOST_ID, "hostid=%s" },
{ NVMF_OPT_DUP_CONNECT, "duplicate_connect" },
{ NVMF_OPT_DISABLE_SQFLOW,  "disable_sqflow"},
+   { NVMF_OPT_HDR_DIGEST,  "hdr_digest"},
{ NVMF_OPT_ERR, NULL}
 };
 
@@ -633,6 +634,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options 
*opts,
opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
opts->kato = NVME_DEFAULT_KATO;
opts->duplicate_connect = false;
+   opts->hdr_digest = false;
 
options = o = kstrdup(buf, GFP_KERNEL);
if (!options)
@@ -827,6 +829,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options 
*opts,
case NVMF_OPT_DISABLE_SQFLOW:
opts->disable_sqflow = true;
break;
+   case NVMF_OPT_HDR_DIGEST:
+   opts->hdr_digest = true;
+   break;
default:
pr_warn("unknown parameter or missing value '%s' in 
ctrl creation request\n",
p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index ecd9a006a091..a6127f1a9e8e 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -59,6 +59,7 @@ enum {
NVMF_OPT_HOST_ID= 1 << 12,
NVMF_OPT_DUP_CONNECT= 1 << 13,
NVMF_OPT_DISABLE_SQFLOW = 1 << 14,
+   NVMF_OPT_HDR_DIGEST = 1 << 15,
 };
 
 /**
@@ -103,6 +104,7 @@ struct nvmf_ctrl_options {
struct nvmf_host*host;
int max_reconnects;
booldisable_sqflow;
+   boolhdr_digest;
 };
 
 /*
-- 
2.17.1



[PATCH v5 10/13] nvme-tcp: Add protocol header

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Signed-off-by: Sagi Grimberg 
---
 include/linux/nvme-tcp.h | 189 +++
 include/linux/nvme.h |   1 +
 2 files changed, 190 insertions(+)
 create mode 100644 include/linux/nvme-tcp.h

diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
new file mode 100644
index ..03d87c0550a9
--- /dev/null
+++ b/include/linux/nvme-tcp.h
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVMe over Fabrics TCP protocol header.
+ * Copyright (c) 2018 Lightbits Labs. All rights reserved.
+ */
+
+#ifndef _LINUX_NVME_TCP_H
+#define _LINUX_NVME_TCP_H
+
+#include 
+
+#define NVME_TCP_DISC_PORT 8009
+#define NVME_TCP_ADMIN_CCSZSZ_8K
+#define NVME_TCP_DIGEST_LENGTH 4
+
+enum nvme_tcp_pfv {
+   NVME_TCP_PFV_1_0 = 0x0,
+};
+
+enum nvme_tcp_fatal_error_status {
+   NVME_TCP_FES_INVALID_PDU_HDR= 0x01,
+   NVME_TCP_FES_PDU_SEQ_ERR= 0x02,
+   NVME_TCP_FES_HDR_DIGEST_ERR = 0x03,
+   NVME_TCP_FES_DATA_OUT_OF_RANGE  = 0x04,
+   NVME_TCP_FES_R2T_LIMIT_EXCEEDED = 0x05,
+   NVME_TCP_FES_DATA_LIMIT_EXCEEDED= 0x05,
+   NVME_TCP_FES_UNSUPPORTED_PARAM  = 0x06,
+};
+
+enum nvme_tcp_digest_option {
+   NVME_TCP_HDR_DIGEST_ENABLE  = (1 << 0),
+   NVME_TCP_DATA_DIGEST_ENABLE = (1 << 1),
+};
+
+enum nvme_tcp_pdu_type {
+   nvme_tcp_icreq  = 0x0,
+   nvme_tcp_icresp = 0x1,
+   nvme_tcp_h2c_term   = 0x2,
+   nvme_tcp_c2h_term   = 0x3,
+   nvme_tcp_cmd= 0x4,
+   nvme_tcp_rsp= 0x5,
+   nvme_tcp_h2c_data   = 0x6,
+   nvme_tcp_c2h_data   = 0x7,
+   nvme_tcp_r2t= 0x9,
+};
+
+enum nvme_tcp_pdu_flags {
+   NVME_TCP_F_HDGST= (1 << 0),
+   NVME_TCP_F_DDGST= (1 << 1),
+   NVME_TCP_F_DATA_LAST= (1 << 2),
+   NVME_TCP_F_DATA_SUCCESS = (1 << 3),
+};
+
+/**
+ * struct nvme_tcp_hdr - nvme tcp pdu common header
+ *
+ * @type:  pdu type
+ * @flags: pdu specific flags
+ * @hlen:  pdu header length
+ * @pdo:   pdu data offset
+ * @plen:  pdu wire byte length
+ */
+struct nvme_tcp_hdr {
+   __u8type;
+   __u8flags;
+   __u8hlen;
+   __u8pdo;
+   __le32  plen;
+};
+
+/**
+ * struct nvme_tcp_icreq_pdu - nvme tcp initialize connection request pdu
+ *
+ * @hdr:   pdu generic header
+ * @pfv:   pdu version format
+ * @hpda:  host pdu data alignment (dwords, 0's based)
+ * @digest:digest types enabled
+ * @maxr2t:maximum r2ts per request supported
+ */
+struct nvme_tcp_icreq_pdu {
+   struct nvme_tcp_hdr hdr;
+   __le16  pfv;
+   __u8hpda;
+   __u8digest;
+   __le32  maxr2t;
+   __u8rsvd2[112];
+};
+
+/**
+ * struct nvme_tcp_icresp_pdu - nvme tcp initialize connection response pdu
+ *
+ * @hdr:   pdu common header
+ * @pfv:   pdu version format
+ * @cpda:  controller pdu data alignment (dowrds, 0's based)
+ * @digest:digest types enabled
+ * @maxdata:   maximum data capsules per r2t supported
+ */
+struct nvme_tcp_icresp_pdu {
+   struct nvme_tcp_hdr hdr;
+   __le16  pfv;
+   __u8cpda;
+   __u8digest;
+   __le32  maxdata;
+   __u8rsvd[112];
+};
+
+/**
+ * struct nvme_tcp_term_pdu - nvme tcp terminate connection pdu
+ *
+ * @hdr:   pdu common header
+ * @fes:   fatal error status
+ * @fei:   fatal error information
+ */
+struct nvme_tcp_term_pdu {
+   struct nvme_tcp_hdr hdr;
+   __le16  fes;
+   __le32  fei;
+   __u8rsvd[8];
+};
+
+/**
+ * struct nvme_tcp_cmd_pdu - nvme tcp command capsule pdu
+ *
+ * @hdr:   pdu common header
+ * @cmd:   nvme command
+ */
+struct nvme_tcp_cmd_pdu {
+   struct nvme_tcp_hdr hdr;
+   struct nvme_command cmd;
+};
+
+/**
+ * struct nvme_tcp_rsp_pdu - nvme tcp response capsule pdu
+ *
+ * @hdr:   pdu common header
+ * @hdr:   nvme-tcp generic header
+ * @cqe:   nvme completion queue entry
+ */
+struct nvme_tcp_rsp_pdu {
+   struct nvme_tcp_hdr hdr;
+   struct nvme_completion  cqe;
+};
+
+/**
+ * struct nvme_tcp_r2t_pdu - nvme tcp ready-to-transfer pdu
+ *
+ * @hdr:   pdu common header
+ * @command_id:nvme command identifier which this relates to
+ * @ttag:  transfer tag (controller generated)
+ * @r2t_offset:offset from the start of the command data
+ * @r2t_length:length the host is allowed to send
+ */
+struct nvme_tcp_r2t_pdu {
+   struct nvme_tcp_hdr hdr;
+   __u16 

[PATCH v5 13/13] nvme-tcp: add NVMe over TCP host driver

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

This patch implements the NVMe over TCP host driver. It can be used to
connect to remote NVMe over Fabrics subsystems over good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

To connect to all NVMe over Fabrics controllers reachable on a given taget
port over TCP use the following command:

nvme connect-all -t tcp -a $IPADDR

This requires the latest version of nvme-cli with TCP support.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Roy Shterman 
Signed-off-by: Solganik Alexander 
---
 drivers/nvme/host/Kconfig  |   15 +
 drivers/nvme/host/Makefile |3 +
 drivers/nvme/host/tcp.c| 2242 
 3 files changed, 2260 insertions(+)
 create mode 100644 drivers/nvme/host/tcp.c

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..0f345e207675 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -57,3 +57,18 @@ config NVME_FC
  from https://github.com/linux-nvme/nvme-cli.
 
  If unsure, say N.
+
+config NVME_TCP
+   tristate "NVM Express over Fabrics TCP host driver"
+   depends on INET
+   depends on BLK_DEV_NVME
+   select NVME_FABRICS
+   help
+ This provides support for the NVMe over Fabrics protocol using
+ the TCP transport.  This allows you to use remote block devices
+ exported using the NVMe protocol set.
+
+ To configure a NVMe over Fabrics controller use the nvme-cli tool
+ from https://github.com/linux-nvme/nvme-cli.
+
+ If unsure, say N.
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index aea459c65ae1..8a4b671c5f0c 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME)  += nvme.o
 obj-$(CONFIG_NVME_FABRICS) += nvme-fabrics.o
 obj-$(CONFIG_NVME_RDMA)+= nvme-rdma.o
 obj-$(CONFIG_NVME_FC)  += nvme-fc.o
+obj-$(CONFIG_NVME_TCP) += nvme-tcp.o
 
 nvme-core-y:= core.o
 nvme-core-$(CONFIG_TRACING)+= trace.o
@@ -21,3 +22,5 @@ nvme-fabrics-y+= fabrics.o
 nvme-rdma-y+= rdma.o
 
 nvme-fc-y  += fc.o
+
+nvme-tcp-y += tcp.o
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
new file mode 100644
index ..15543358e245
--- /dev/null
+++ b/drivers/nvme/host/tcp.c
@@ -0,0 +1,2242 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVMe over Fabrics TCP host.
+ * Copyright (c) 2018 Lightbits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "nvme.h"
+#include "fabrics.h"
+
+struct nvme_tcp_queue;
+
+enum nvme_tcp_send_state {
+   NVME_TCP_SEND_CMD_PDU = 0,
+   NVME_TCP_SEND_H2C_PDU,
+   NVME_TCP_SEND_DATA,
+   NVME_TCP_SEND_DDGST,
+};
+
+struct nvme_tcp_request {
+   struct nvme_request req;
+   void*pdu;
+   struct nvme_tcp_queue   *queue;
+   u32 data_len;
+   u32 pdu_len;
+   u32 pdu_sent;
+   u16 ttag;
+   struct list_headentry;
+   u32 ddgst;
+
+   struct bio  *curr_bio;
+   struct iov_iter iter;
+
+   /* send state */
+   size_t  offset;
+   size_t  data_sent;
+   enum nvme_tcp_send_state state;
+};
+
+enum nvme_tcp_queue_flags {
+   NVME_TCP_Q_ALLOCATED= 0,
+   NVME_TCP_Q_LIVE = 1,
+};
+
+enum nvme_tcp_recv_state {
+   NVME_TCP_RECV_PDU = 0,
+   NVME_TCP_RECV_DATA,
+   NVME_TCP_RECV_DDGST,
+};
+
+struct nvme_tcp_ctrl;
+struct nvme_tcp_queue {
+   struct socket   *sock;
+   struct work_struct  io_work;
+   int io_cpu;
+
+   spinlock_t  lock;
+   struct list_headsend_list;
+
+   /* recv state */
+   void*pdu;
+   int pdu_remaining;
+   int pdu_offset;
+   size_t  data_remaining;
+   size_t  ddgst_remaining;
+
+   /* send state */
+   struct nvme_tcp_request *request;
+
+   int queue_size;
+   size_t  cmnd_capsule_len;
+   struct nvme_tcp_ctrl*ctrl;
+   unsigned long   flags;
+   boolrd_enabled;
+
+   boolhdr_digest;
+   booldata_digest;
+   struct ahash_reques

[PATCH v5 09/13] nvme-fabrics: allow user passing data digest

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Data digest is a nvme-tcp specific feature, but nothing prevents other
transports reusing the concept so do not associate with tcp transport
solely.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Sagi Grimberg 
---
 drivers/nvme/host/fabrics.c | 5 +
 drivers/nvme/host/fabrics.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4272f8a95db3..9c62c6838b76 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -615,6 +615,7 @@ static const match_table_t opt_tokens = {
{ NVMF_OPT_DUP_CONNECT, "duplicate_connect" },
{ NVMF_OPT_DISABLE_SQFLOW,  "disable_sqflow"},
{ NVMF_OPT_HDR_DIGEST,  "hdr_digest"},
+   { NVMF_OPT_DATA_DIGEST, "data_digest"   },
{ NVMF_OPT_ERR, NULL}
 };
 
@@ -635,6 +636,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options 
*opts,
opts->kato = NVME_DEFAULT_KATO;
opts->duplicate_connect = false;
opts->hdr_digest = false;
+   opts->data_digest = false;
 
options = o = kstrdup(buf, GFP_KERNEL);
if (!options)
@@ -832,6 +834,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options 
*opts,
case NVMF_OPT_HDR_DIGEST:
opts->hdr_digest = true;
break;
+   case NVMF_OPT_DATA_DIGEST:
+   opts->data_digest = true;
+   break;
default:
pr_warn("unknown parameter or missing value '%s' in 
ctrl creation request\n",
p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index a6127f1a9e8e..524a02a67817 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -60,6 +60,7 @@ enum {
NVMF_OPT_DUP_CONNECT= 1 << 13,
NVMF_OPT_DISABLE_SQFLOW = 1 << 14,
NVMF_OPT_HDR_DIGEST = 1 << 15,
+   NVMF_OPT_DATA_DIGEST= 1 << 16,
 };
 
 /**
@@ -105,6 +106,7 @@ struct nvmf_ctrl_options {
int max_reconnects;
booldisable_sqflow;
boolhdr_digest;
+   booldata_digest;
 };
 
 /*
-- 
2.17.1



[PATCH v5 12/13] nvmet: allow configfs tcp trtype configuration

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Reviewed-by: Max Gurtovoy 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Sagi Grimberg 
---
 drivers/nvme/target/configfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index db2cb64be7ba..618bbd006544 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -34,6 +34,7 @@ static const struct nvmet_transport_name {
 } nvmet_transport_names[] = {
{ NVMF_TRTYPE_RDMA, "rdma" },
{ NVMF_TRTYPE_FC,   "fc" },
+   { NVMF_TRTYPE_TCP,  "tcp" },
{ NVMF_TRTYPE_LOOP, "loop" },
 };
 
-- 
2.17.1



[PATCH v5 11/13] nvmet-tcp: add NVMe over TCP target driver

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

This patch implements the TCP transport driver for the NVMe over Fabrics
target stack. This allows exporting NVMe over Fabrics functionality over
good old TCP/IP.

The driver implements the TP 8000 of how nvme over fabrics capsules and
data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte
stream. nvme-tcp header and data digest are supported as well.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Roy Shterman 
Signed-off-by: Solganik Alexander 
---
 drivers/nvme/target/Kconfig  |   10 +
 drivers/nvme/target/Makefile |2 +
 drivers/nvme/target/tcp.c| 1737 ++
 3 files changed, 1749 insertions(+)
 create mode 100644 drivers/nvme/target/tcp.c

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 3c7b61ddb0d1..d94f25cde019 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -60,3 +60,13 @@ config NVME_TARGET_FCLOOP
  to test NVMe-FC transport interfaces.
 
  If unsure, say N.
+
+config NVME_TARGET_TCP
+   tristate "NVMe over Fabrics TCP target support"
+   depends on INET
+   depends on NVME_TARGET
+   help
+ This enables the NVMe TCP target support, which allows exporting NVMe
+ devices over TCP.
+
+ If unsure, say N.
diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile
index 8118c93391c6..8c3ad0fb6860 100644
--- a/drivers/nvme/target/Makefile
+++ b/drivers/nvme/target/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_NVME_TARGET_LOOP)  += nvme-loop.o
 obj-$(CONFIG_NVME_TARGET_RDMA) += nvmet-rdma.o
 obj-$(CONFIG_NVME_TARGET_FC)   += nvmet-fc.o
 obj-$(CONFIG_NVME_TARGET_FCLOOP)   += nvme-fcloop.o
+obj-$(CONFIG_NVME_TARGET_TCP)  += nvmet-tcp.o
 
 nvmet-y+= core.o configfs.o admin-cmd.o fabrics-cmd.o \
discovery.o io-cmd-file.o io-cmd-bdev.o
@@ -12,3 +13,4 @@ nvme-loop-y   += loop.o
 nvmet-rdma-y   += rdma.o
 nvmet-fc-y += fc.o
 nvme-fcloop-y  += fcloop.o
+nvmet-tcp-y+= tcp.o
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
new file mode 100644
index ..d31bec260160
--- /dev/null
+++ b/drivers/nvme/target/tcp.c
@@ -0,0 +1,1737 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVMe over Fabrics TCP target.
+ * Copyright (c) 2018 Lightbits Labs. All rights reserved.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "nvmet.h"
+
+#define NVMET_TCP_DEF_INLINE_DATA_SIZE (4 * PAGE_SIZE)
+
+#define NVMET_TCP_RECV_BUDGET  8
+#define NVMET_TCP_SEND_BUDGET  8
+#define NVMET_TCP_IO_WORK_BUDGET   64
+
+enum nvmet_tcp_send_state {
+   NVMET_TCP_SEND_DATA_PDU,
+   NVMET_TCP_SEND_DATA,
+   NVMET_TCP_SEND_R2T,
+   NVMET_TCP_SEND_DDGST,
+   NVMET_TCP_SEND_RESPONSE
+};
+
+enum nvmet_tcp_recv_state {
+   NVMET_TCP_RECV_PDU,
+   NVMET_TCP_RECV_DATA,
+   NVMET_TCP_RECV_DDGST,
+   NVMET_TCP_RECV_ERR,
+};
+
+enum {
+   NVMET_TCP_F_INIT_FAILED = (1 << 0),
+};
+
+struct nvmet_tcp_cmd {
+   struct nvmet_tcp_queue  *queue;
+   struct nvmet_reqreq;
+
+   struct nvme_tcp_cmd_pdu *cmd_pdu;
+   struct nvme_tcp_rsp_pdu *rsp_pdu;
+   struct nvme_tcp_data_pdu*data_pdu;
+   struct nvme_tcp_r2t_pdu *r2t_pdu;
+
+   u32 rbytes_done;
+   u32 wbytes_done;
+
+   u32 pdu_len;
+   u32 pdu_recv;
+   int sg_idx;
+   int nr_mapped;
+   struct msghdr   recv_msg;
+   struct kvec *iov;
+   u32 flags;
+
+   struct list_headentry;
+   struct llist_node   lentry;
+
+   /* send state */
+   u32 offset;
+   struct scatterlist  *cur_sg;
+   enum nvmet_tcp_send_state   state;
+
+   __le32  exp_ddgst;
+   __le32  recv_ddgst;
+};
+
+enum nvmet_tcp_queue_state {
+   NVMET_TCP_Q_CONNECTING,
+   NVMET_TCP_Q_LIVE,
+   NVMET_TCP_Q_DISCONNECTING,
+};
+
+struct nvmet_tcp_queue {
+   struct socket   *sock;
+   struct nvmet_tcp_port   *port;
+   struct work_struct  io_work;
+   int cpu;
+   struct nvmet_cq nvme_cq;
+   struct nvmet_sq nvme_sq;
+
+   /* send state */
+   struct nvmet_tcp_cmd*cmds;
+   unsigned intnr_cmds;
+   struct list_headfree_list;
+   struct llist_head   resp_list;
+   struct list_headresp_send_list;
+   int send_list_len;
+   

[PATCH v5 02/13] datagram: open-code copy_page_to_iter

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

This will be useful to consolidate skb_copy_and_hash_datagram_iter and
skb_copy_and_csum_datagram to a single code path.

Acked-by: David S. Miller 
Signed-off-by: Sagi Grimberg 
---
 net/core/datagram.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..abe642181b64 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -445,11 +445,14 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
 
end = start + skb_frag_size(frag);
if ((copy = end - offset) > 0) {
+   struct page *page = skb_frag_page(frag);
+   u8 *vaddr = kmap(page);
+
if (copy > len)
copy = len;
-   n = copy_page_to_iter(skb_frag_page(frag),
- frag->page_offset + offset -
- start, copy, to);
+   n = copy_to_iter(vaddr + frag->page_offset +
+offset - start, copy, to);
+   kunmap(page);
offset += n;
if (n != copy)
goto short_copy;
-- 
2.17.1



[PATCH v6 2/2] arm64: crypto: add NEON accelerated XOR implementation

2018-12-03 Thread Jackie Liu
This is a NEON acceleration method that can improve
performance by approximately 20%. I got the following
data from the centos 7.5 on Huawei's HISI1616 chip:

[ 93.837726] xor: measuring software checksum speed
[ 93.874039]   8regs  : 7123.200 MB/sec
[ 93.914038]   32regs : 7180.300 MB/sec
[ 93.954043]   arm64_neon: 9856.000 MB/sec
[ 93.954047] xor: using function: arm64_neon (9856.000 MB/sec)

I believe this code can bring some optimization for
all arm64 platform. thanks for Ard Biesheuvel's suggestions.

Signed-off-by: Jackie Liu 
Reviewed-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/Kbuild |   1 -
 arch/arm64/include/asm/xor.h  |  73 +
 arch/arm64/lib/Makefile   |   6 ++
 arch/arm64/lib/xor-neon.c | 184 ++
 4 files changed, 263 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/xor.h
 create mode 100644 arch/arm64/lib/xor-neon.c

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 6cd5d77..1877f29 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -27,4 +27,3 @@ generic-y += trace_clock.h
 generic-y += unaligned.h
 generic-y += user.h
 generic-y += vga.h
-generic-y += xor.h
diff --git a/arch/arm64/include/asm/xor.h b/arch/arm64/include/asm/xor.h
new file mode 100644
index 000..856386a
--- /dev/null
+++ b/arch/arm64/include/asm/xor.h
@@ -0,0 +1,73 @@
+/*
+ * arch/arm64/include/asm/xor.h
+ *
+ * Authors: Jackie Liu 
+ * Copyright (C) 2018,Tianjin KYLIN Information Technology Co., Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+
+extern struct xor_block_template const xor_block_inner_neon;
+
+static void
+xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
+{
+   kernel_neon_begin();
+   xor_block_inner_neon.do_2(bytes, p1, p2);
+   kernel_neon_end();
+}
+
+static void
+xor_neon_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+   unsigned long *p3)
+{
+   kernel_neon_begin();
+   xor_block_inner_neon.do_3(bytes, p1, p2, p3);
+   kernel_neon_end();
+}
+
+static void
+xor_neon_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+   unsigned long *p3, unsigned long *p4)
+{
+   kernel_neon_begin();
+   xor_block_inner_neon.do_4(bytes, p1, p2, p3, p4);
+   kernel_neon_end();
+}
+
+static void
+xor_neon_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+   unsigned long *p3, unsigned long *p4, unsigned long *p5)
+{
+   kernel_neon_begin();
+   xor_block_inner_neon.do_5(bytes, p1, p2, p3, p4, p5);
+   kernel_neon_end();
+}
+
+static struct xor_block_template xor_block_arm64 = {
+   .name   = "arm64_neon",
+   .do_2   = xor_neon_2,
+   .do_3   = xor_neon_3,
+   .do_4   = xor_neon_4,
+   .do_5   = xor_neon_5
+};
+#undef XOR_TRY_TEMPLATES
+#define XOR_TRY_TEMPLATES   \
+   do {\
+   xor_speed(&xor_block_8regs);\
+   xor_speed(&xor_block_32regs);\
+   if (cpu_has_neon()) { \
+   xor_speed(&xor_block_arm64);\
+   } \
+   } while (0)
+
+#endif /* ! CONFIG_KERNEL_MODE_NEON */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 69ff988..5540a16 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -5,6 +5,12 @@ lib-y  := clear_user.o delay.o copy_from_user.o
\
   memcmp.o strcmp.o strncmp.o strlen.o strnlen.o   \
   strchr.o strrchr.o tishift.o
 
+ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
+obj-$(CONFIG_XOR_BLOCKS)   += xor-neon.o
+CFLAGS_REMOVE_xor-neon.o   += -mgeneral-regs-only
+CFLAGS_xor-neon.o  += -ffreestanding
+endif
+
 # Tell the compiler to treat all general purpose registers (with the
 # exception of the IP registers, which are already handled by the caller
 # in case of a PLT) as callee-saved, which allows for efficient runtime
diff --git a/arch/arm64/lib/xor-neon.c b/arch/arm64/lib/xor-neon.c
new file mode 100644
index 000..131c60c2
--- /dev/null
+++ b/arch/arm64/lib/xor-neon.c
@@ -0,0 +1,184 @@
+/*
+ * arch/arm64/lib/xor-neon.c
+ *
+ * Authors: Jackie Liu 
+ * Copyright (C) 2018,Tianjin KYLIN Information Technology Co., Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+
+void xor_arm64_neon_2(unsigned long bytes, unsigned long *p1,
+   unsigned long *p2)
+{
+   uint64_t *dp1 = (uint64_t *)p1;
+   uint64_t *dp2 = (uint64_t *)p2;
+
+   register uint64x2_t v0, v1, v2,

[PATCH v6 1/2] arm64/neon: add workaround for ambiguous C99 stdint.h types

2018-12-03 Thread Jackie Liu
In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround
for ambiguous C99 stdint.h types"), this patch redefines the macros that
are used in stdint.h so its definitions of uint64_t and int64_t are
compatible with those of the kernel.

This patch comes from: https://patchwork.kernel.org/patch/3540001/
Wrote by: Ard Biesheuvel 

We mark this file as a private file and don't have to override asm/types.h

Signed-off-by: Jackie Liu 
Reviewed-by: Ard Biesheuvel 
---
 arch/arm64/include/asm/neon-intrinsics.h | 39 
 1 file changed, 39 insertions(+)
 create mode 100644 arch/arm64/include/asm/neon-intrinsics.h

diff --git a/arch/arm64/include/asm/neon-intrinsics.h 
b/arch/arm64/include/asm/neon-intrinsics.h
new file mode 100644
index 000..2ba6c6b
--- /dev/null
+++ b/arch/arm64/include/asm/neon-intrinsics.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2018 Linaro, Ltd. 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __ASM_NEON_INTRINSICS_H
+#define __ASM_NEON_INTRINSICS_H
+
+#include 
+
+/*
+ * In the kernel, u64/s64 are [un]signed long long, not [un]signed long.
+ * So by redefining these macros to the former, we can force gcc-stdint.h
+ * to define uint64_t / in64_t in a compatible manner.
+ */
+
+#ifdef __INT64_TYPE__
+#undef __INT64_TYPE__
+#define __INT64_TYPE__ long long
+#endif
+
+#ifdef __UINT64_TYPE__
+#undef __UINT64_TYPE__
+#define __UINT64_TYPE__unsigned long long
+#endif
+
+/*
+ * genksyms chokes on the ARM NEON instrinsics system header, but we
+ * don't export anything it defines anyway, so just disregard when
+ * genksyms execute.
+ */
+#ifndef __GENKSYMS__
+#include 
+#endif
+
+#endif /* __ASM_NEON_INTRINSICS_H */
-- 
2.7.4





[PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension

2018-12-03 Thread Jackie Liu
v6:
  neon-intrinsics.h: change _NEON_INTRINSICS_H to __ASM_NEON_INTRINSICS_H.
  neon-intrinsics.h: add header LICENSE declaration
  neon-intrinsics.h: explain part of the code with comments

Jackie Liu (2):
  arm64/neon: add workaround for ambiguous C99 stdint.h types
  arm64: crypto: add NEON accelerated XOR implementation

 arch/arm64/include/asm/Kbuild|   1 -
 arch/arm64/include/asm/neon-intrinsics.h |  39 +++
 arch/arm64/include/asm/xor.h |  73 
 arch/arm64/lib/Makefile  |   6 +
 arch/arm64/lib/xor-neon.c| 184 +++
 5 files changed, 302 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/neon-intrinsics.h
 create mode 100644 arch/arm64/include/asm/xor.h
 create mode 100644 arch/arm64/lib/xor-neon.c

-- 
2.7.4





Re: [PATCH 7/7] blk-mq: use plug for devices that implement ->commits_rqs()

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 6/7] blk-mq: use bd->last == true for list inserts

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 4/7] virtio_blk: implement mq_ops->commit_rqs() hook

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 5/7] ataflop: implement mq_ops->commit_rqs() hook

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 2/7] blk-mq: add mq_ops->commit_rqs()

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 1/7] block: improve logic around when to sort a plug list

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-03 Thread Sagi Grimberg




A driver may wish to iterate every tagged request, not just ones that
satisfy blk_mq_request_started(). The intended use is so a driver may
terminate entered requests on quiesced queues.


How about we just move the started check into the handler passed in for
those that care about it? Much saner to make the interface iterate
everything, and leave whatever state check to the callback.


So we used to do that, and I changed it back in May to test for
MQ_RQ_IN_FLIGHT, and then Ming changed it to check
blk_mq_request_started.  So this is clearly a minefield of sorts..

Note that at least mtip32xx, nbd, skd and the various nvme transports
want to use the function to terminate all requests in the error
path, and it would be great to have one single understood, documented
and debugged helper for that in the core, so this is a vote for moving
more of the logic in your second helper into the core code.  skd
will need actually use ->complete to release resources for that, though
and mtip plays some odd abort bits.  If it weren't for the interesting
abort behavior in nvme-fc that means we could even unexport the
low-level interface.


Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set 
some error flags in the driver, then _restarting_ the queue, just so 
that the driver then sees the error flag and terminates the requests.

Which I always found quite counter-intuitive.


What about requests that come in after the iteration runs? how are those
terminated?


Re: [PATCH 13/13] block: enable polling by default if a poll map is initalized

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 12/13] block: only allow polling if a poll queue_map exists

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 11/13] block: remove ->poll_fn

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration

2018-12-03 Thread Dongli Zhang
Hi Boris,

On 12/04/2018 12:07 AM, Boris Ostrovsky wrote:
> On 12/2/18 3:31 PM, Manjunath Patil wrote:
>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>>
>>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
 Thank you Boris for your comments. I removed faulty email of mine.

 replies inline.
 On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
> On 11/29/18 12:17 AM, Manjunath Patil wrote:
>> Hi,
>> Feel free to suggest/comment on this.
>>
>> I am trying to do the following at dst during the migration now.
>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
>> 2. Store the old rinfo and nr_rings into temp variables in
>> negotiate_mq()
>> 3. let nr_rings get re-calculated based on backend data
>> 4. try allocating new memory based on new nr_rings
> Since I suspect number of rings will likely be the same why not reuse
> the rings in the common case?
 I thought attaching devices will be more often than migration. Hence
 did not want add to an extra check for
- if I am inside migration code path and
- if new nr_rings is equal to old nr_rings or not

 Sure addition of such a thing would avoid the memory allocation
 altogether in migration path,
 but it would add a little overhead for normal device addition.

 Do you think its worth adding that change?
>>>
>>> IMO a couple of extra checks are not going to make much difference.
>> I will add this change
>>>
>>> I wonder though --- have you actually seen the case where you did fail
>>> allocation and changes provided in this patch made things work? I am
>>> asking because right after negotiate_mq() we will call setup_blkring()
>>> and it will want to allocate bunch of memory. A failure there is fatal
>>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>>> but then will likely fail soon after.
>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
>> included my patch, I manually triggered the ENOMEM using a debug flag.
>> The patch works for ENOMEM inside negotiate_mq().
>>
>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
>> might hit it in setup_blkring() as well.
>> We should add the similar change to blkif_sring struct as well.
> 
> 
> Won't you have a similar issue with other frontends, say, netfront?

I think the kmalloc is failed not because of OOM.

In fact, the size of "blkfront_ring_info" is large. When domU have 4
queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB.

There is chance that kmalloc() 300+ KB would fail.


About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB?

Dongli Zhang


Re: [PATCH 10/13] nvme-mpath: remove I/O polling support

2018-12-03 Thread Sagi Grimberg

If it really becomes an issue we
should rework the nvme code to also skip the multipath code for any
private namespace, even if that could mean some trouble when rescanning.



This requires some explanation? skip the multipath code how?

Other than that,
Reviewed-by: Sagi Grimberg 


Re: [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues

2018-12-03 Thread Sagi Grimberg




Now that we can't poll regular, interrupt driven I/O queues there
is almost nothing that can race with an interrupt.  The only
possible other contexts polling a CQ are the error handler and
queue shutdown, and both are so far off in the slow path that
we can simply use the big hammer of disabling interrupts.

With that we can stop taking the cq_lock for normal queues.


Nice,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues

2018-12-03 Thread Sagi Grimberg




@@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
nvme_stop_queues(&dev->ctrl);
  
  	if (!dead && dev->ctrl.queue_count > 0) {

-   nvme_disable_io_queues(dev);
+   if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
+   


Would be nice if the opcode change would be kept inside but still
split like:

static void nvme_disable_io_queues(struct nvme_dev *dev)
{
if (__nvme_disable_io_queues(dev, nvme_admin_delete_sq))
__nvme_disable_io_queues(dev, nvme_admin_delete_cq);
}


Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues

2018-12-03 Thread Sagi Grimberg




+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)


Do we still need to carry the tag around?

Other than that,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit

2018-12-03 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled

2018-12-03 Thread Sagi Grimberg




@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  
  	if (nr_io_queues == 0)

return 0;
+   
+   clear_bit(NVMEQ_ENABLED, &adminq->flags);
  


This is a change of behavior, looks correct though as we can fail
nvme_setup_irqs after we freed the admin vector. Needs documentation 
though..


Re: block: sbitmap related lockdep warning

2018-12-03 Thread Jens Axboe
On 12/3/18 5:31 PM, Bart Van Assche wrote:
> On Mon, 2018-12-03 at 15:24 -0700, Jens Axboe wrote:
>> On 12/3/18 3:02 AM, Ming Lei wrote:
>>> Hi,
>>>
>>> Just found there is sbmitmap related lockdep warning, not take a close
>>> look yet, maybe
>>> it is caused by recent sbitmap change.
>>>
>>> [1] test
>>> - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1
>>> submit_queues=1 hw_queue_depth=1
>>> - then run fio on the 4 null_blk devices
>>
>> This is a false positive - lockdep thinks that ->swap_lock needs to be
>> IRQ safe since it's called with IRQs disabled from the
>> blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ
>> context. I wonder how to teach lockdep about that...
> 
> There is probably a better solution, but one possible solution is to disable
> lockdep checking for swap_lock by using lockdep_set_novalidate_class().

That does seem like a sledge hammer, but I don't see anything that does
what we need directly. Surely this isn't a unique situation? Maybe
marking it novalidate is just the way to do it...

-- 
Jens Axboe



Re: [PATCH 01/13] block: move queues types to the block layer

2018-12-03 Thread Sagi Grimberg




On 12/2/18 8:46 AM, Christoph Hellwig wrote:

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.

Signed-off-by: Christoph Hellwig 
---
  block/blk-mq-sysfs.c|  9 +-
  block/blk-mq.h  | 21 +++--
  drivers/nvme/host/pci.c | 68 +++--
  include/linux/blk-mq.h  | 15 -
  4 files changed, 51 insertions(+), 62 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 6efef1f679f0..9c2df137256a 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -173,9 +173,16 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
blk_mq_hw_ctx *hctx, char *page)
return ret;
  }
  
+static const char *const hctx_types[] = {

+   [HCTX_TYPE_DEFAULT] = "default",
+   [HCTX_TYPE_READ]= "read",
+   [HCTX_TYPE_POLL]= "poll",
+};
+
  static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char 
*page)
  {
-   return sprintf(page, "%u\n", hctx->type);
+   BUILD_BUG_ON(ARRAY_SIZE(hctx_types) != HCTX_MAX_TYPES);
+   return sprintf(page, "%s\n", hctx_types[hctx->type]);
  }
  
  static struct attribute *default_ctx_attrs[] = {

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7291e5379358..a664ea44ffd4 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -81,16 +81,14 @@ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map 
*qmap, unsigned int);
  /*
   * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue
   * @q: request queue
- * @hctx_type: the hctx type index
+ * @type: the hctx type index
   * @cpu: CPU
   */
  static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct 
request_queue *q,
- unsigned int 
hctx_type,
+ enum hctx_type type,
  unsigned int cpu)
  {
-   struct blk_mq_tag_set *set = q->tag_set;
-
-   return q->queue_hw_ctx[set->map[hctx_type].mq_map[cpu]];
+   return q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]];
  }
  
  /*

@@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx 
*blk_mq_map_queue(struct request_queue *q,
 unsigned int flags,
 unsigned int cpu)
  {
-   int hctx_type = 0;
+   enum hctx_type type = HCTX_TYPE_DEFAULT;
+
+   if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
+   ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
+   type = HCTX_TYPE_POLL;
  
-	if (q->mq_ops->rq_flags_to_type)

-   hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
+   else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
+((flags & REQ_OP_MASK) == REQ_OP_READ))
+   type = HCTX_TYPE_READ;


Nit, there seems to be an extra newline that can be omitted here before
the else if statement (if I'm reading this correctly)...

Otherwise looks good,

Reviewed-by: Sagi Grimberg 


Re: block: sbitmap related lockdep warning

2018-12-03 Thread Bart Van Assche
On Mon, 2018-12-03 at 15:24 -0700, Jens Axboe wrote:
> On 12/3/18 3:02 AM, Ming Lei wrote:
> > Hi,
> > 
> > Just found there is sbmitmap related lockdep warning, not take a close
> > look yet, maybe
> > it is caused by recent sbitmap change.
> > 
> > [1] test
> > - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1
> > submit_queues=1 hw_queue_depth=1
> > - then run fio on the 4 null_blk devices
> 
> This is a false positive - lockdep thinks that ->swap_lock needs to be
> IRQ safe since it's called with IRQs disabled from the
> blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ
> context. I wonder how to teach lockdep about that...

There is probably a better solution, but one possible solution is to disable
lockdep checking for swap_lock by using lockdep_set_novalidate_class().

Bart.


Re: [PATCH] sbitmap: fix sbitmap_for_each_set()

2018-12-03 Thread Jens Axboe
On 12/3/18 3:45 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> We need to ignore bits in the cleared mask when iterating over all set
> bits.

Thanks Omar, applied.

-- 
Jens Axboe



Re: [PATCH 05/13] blkcg: associate blkg when associating a device

2018-12-03 Thread Dennis Zhou
On Fri, Nov 30, 2018 at 01:54:26AM -0800, Christoph Hellwig wrote:
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 62715a5a4f32..8bc9d9b29fd3 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -486,6 +486,12 @@ extern unsigned int bvec_nr_vecs(unsigned short idx);
> >  extern const char *bio_devname(struct bio *bio, char *buffer);
> >  
> >  #define bio_set_dev(bio, bdev) \
> > +do {   \
> > +   bio_set_dev_only(bio, bdev);\
> > +   bio_associate_blkg(bio);\
> > +} while (0)
> > +
> > +#define bio_set_dev_only(bio, bdev)\
> 
> This lacks any explanation on when you would use bio_set_dev_only or
> bio_set_dev.  Please document why we need both and why you'd choose or
> the other.

I realized after thinking about this more and checking more use cases
that it isn't as simple as swapping macro uses because many of the
callers share common bio allocation paths. I think the simplest way
forward is to have writeback and swap do reassociation and split out bio
init code in a future series. So in v5, there is only bio_set_dev().

Thanks,
Dennis


[PATCH] sbitmap: fix sbitmap_for_each_set()

2018-12-03 Thread Omar Sandoval
From: Omar Sandoval 

We need to ignore bits in the cleared mask when iterating over all set
bits.

Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
Signed-off-by: Omar Sandoval 
---
 include/linux/sbitmap.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 92806a2dbab7..03f50fcedc79 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
nr = SB_NR_TO_BIT(sb, start);
 
while (scanned < sb->depth) {
-   struct sbitmap_word *word = &sb->map[index];
-   unsigned int depth = min_t(unsigned int, word->depth - nr,
+   unsigned long word;
+   unsigned int depth = min_t(unsigned int,
+  sb->map[index].depth - nr,
   sb->depth - scanned);
 
scanned += depth;
-   if (!word->word)
+   word = sb->map[index].word & ~sb->map[index].cleared;
+   if (!word)
goto next;
 
/*
@@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
 */
depth += nr;
while (1) {
-   nr = find_next_bit(&word->word, depth, nr);
+   nr = find_next_bit(&word, depth, nr);
if (nr >= depth)
break;
if (!fn(sb, (index << sb->shift) + nr, data))
-- 
2.19.2



Re: block: sbitmap related lockdep warning

2018-12-03 Thread Jens Axboe
On 12/3/18 3:02 AM, Ming Lei wrote:
> Hi,
> 
> Just found there is sbmitmap related lockdep warning, not take a close
> look yet, maybe
> it is caused by recent sbitmap change.
> 
> [1] test
> - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1
> submit_queues=1 hw_queue_depth=1
> - then run fio on the 4 null_blk devices

This is a false positive - lockdep thinks that ->swap_lock needs to be
IRQ safe since it's called with IRQs disabled from the
blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ
context. I wonder how to teach lockdep about that...

-- 
Jens Axboe



Re: sbitmap: check cleared bits when iterating busy bits

2018-12-03 Thread Jens Axboe
On 12/3/18 3:05 PM, Omar Sandoval wrote:
> On Mon, Dec 03, 2018 at 02:56:17PM -0700, Jens Axboe wrote:
>> When we are iterating the set bits in a word, we also need to factor in
>> the cleared bits. Don't call fn() unless the bit is also not set in
>> the cleared word.
>>
>> Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
>> Signed-off-by: Jens Axboe 
>>
>> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
>> index 92806a2dbab7..9f374fbcdba6 100644
>> --- a/include/linux/sbitmap.h
>> +++ b/include/linux/sbitmap.h
>> @@ -283,6 +283,11 @@ static inline void __sbitmap_for_each_set(struct 
>> sbitmap *sb,
>>  nr = find_next_bit(&word->word, depth, nr);
>>  if (nr >= depth)
>>  break;
>> +/* if set in cleared, it's actually free */
>> +if (test_bit(nr, &word->cleared)) {
>> +nr++;
>> +continue;
>> +}
>>  if (!fn(sb, (index << sb->shift) + nr, data))
>>  return;
>>  
>> -- 
>> Jens Axboe
>>
> 
> How about something like this:
> 
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index f0f49bbb2617..fe9122386255 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct 
> sbitmap *sb,
>   nr = SB_NR_TO_BIT(sb, start);
>  
>   while (scanned < sb->depth) {
> - struct sbitmap_word *word = &sb->map[index];
> - unsigned int depth = min_t(unsigned int, word->depth - nr,
> + unsigned long word;
> + unsigned int depth = min_t(unsigned int,
> +sb->map[index].depth - nr,
>  sb->depth - scanned);
>  
>   scanned += depth;
> - if (!word->word)
> + word = sb->map[index].word & ~sb->map[index].cleared;
> + if (!word)
>   goto next;
>  
>   /*
> @@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
> *sb,
>*/
>   depth += nr;
>   while (1) {
> - nr = find_next_bit(&word->word, depth, nr);
> + nr = find_next_bit(&word, depth, nr);
>   if (nr >= depth)
>   break;
>   if (!fn(sb, (index << sb->shift) + nr, data))
> 
> Might be marginally faster.

Yeah that looks fine as well, tests out good too.

-- 
Jens Axboe



Re: sbitmap: check cleared bits when iterating busy bits

2018-12-03 Thread Omar Sandoval
On Mon, Dec 03, 2018 at 02:56:17PM -0700, Jens Axboe wrote:
> When we are iterating the set bits in a word, we also need to factor in
> the cleared bits. Don't call fn() unless the bit is also not set in
> the cleared word.
> 
> Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
> Signed-off-by: Jens Axboe 
> 
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 92806a2dbab7..9f374fbcdba6 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -283,6 +283,11 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
> *sb,
>   nr = find_next_bit(&word->word, depth, nr);
>   if (nr >= depth)
>   break;
> + /* if set in cleared, it's actually free */
> + if (test_bit(nr, &word->cleared)) {
> + nr++;
> + continue;
> + }
>   if (!fn(sb, (index << sb->shift) + nr, data))
>   return;
>  
> -- 
> Jens Axboe
> 

How about something like this:

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index f0f49bbb2617..fe9122386255 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
nr = SB_NR_TO_BIT(sb, start);
 
while (scanned < sb->depth) {
-   struct sbitmap_word *word = &sb->map[index];
-   unsigned int depth = min_t(unsigned int, word->depth - nr,
+   unsigned long word;
+   unsigned int depth = min_t(unsigned int,
+  sb->map[index].depth - nr,
   sb->depth - scanned);
 
scanned += depth;
-   if (!word->word)
+   word = sb->map[index].word & ~sb->map[index].cleared;
+   if (!word)
goto next;
 
/*
@@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
 */
depth += nr;
while (1) {
-   nr = find_next_bit(&word->word, depth, nr);
+   nr = find_next_bit(&word, depth, nr);
if (nr >= depth)
break;
if (!fn(sb, (index << sb->shift) + nr, data))

Might be marginally faster.


sbitmap: check cleared bits when iterating busy bits

2018-12-03 Thread Jens Axboe
When we are iterating the set bits in a word, we also need to factor in
the cleared bits. Don't call fn() unless the bit is also not set in
the cleared word.

Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
Signed-off-by: Jens Axboe 

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 92806a2dbab7..9f374fbcdba6 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -283,6 +283,11 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
nr = find_next_bit(&word->word, depth, nr);
if (nr >= depth)
break;
+   /* if set in cleared, it's actually free */
+   if (test_bit(nr, &word->cleared)) {
+   nr++;
+   continue;
+   }
if (!fn(sb, (index << sb->shift) + nr, data))
return;
 
-- 
Jens Axboe



Re: [PATCH 04/13] blkcg: introduce common blkg association logic

2018-12-03 Thread Dennis Zhou
Hi Christoph,

On Fri, Nov 30, 2018 at 01:52:09AM -0800, Christoph Hellwig wrote:
> >  EXPORT_SYMBOL_GPL(bio_associate_blkcg);
> >  
> >  /**
> > - * bio_associate_blkg - associate a bio with the a blkg
> > + * bio_has_queue - required check for blkg association
> > + * @bio: target bio
> > + *
> > + * A blkg represents the relationship between a blkcg and a request_queue.
> > + * If there is no request_queue, there is no blkg and therefore nothing to
> > + * associate with.
> > + */
> > +static inline bool bio_has_queue(struct bio *bio)
> > +{
> > +   return bio->bi_disk && bio->bi_disk->queue;
> > +}
> 
> How do you ever see a bio without a queue?  We can't even do I/O in
> that case.

The case I found was with the flush bio in dm which is statically
allocated in dm_alloc(). The issue issue is that bio_set_dev() is called
on a bdev that isn't opened. So, the bdev wasn't pointing to a genhd.
I've fixed the issue with the patch below, which will be added in v5.

I think I was being overly cautious with the change and have taken this
out in v5. It seems that this should be a one-off case which should work
with the patch below.

Thanks,
Dennis

---
>From 3ee13402af369ee8618549b63593d68ffca574ca Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Mon, 3 Dec 2018 10:56:34 -0800
Subject: [PATCH 05/14] dm: set flush bio device on demand

The next patch changes the macro bio_set_dev() to associate a bio with a
blkg based on the device set. However, dm creates a static bio to be
used as the basis for cloning empty flush bios on creation. This
association is with a not-opened bdev so bd_disk is %NULL. To easily get
around this, we will set the device on the static bio every time and use
that to copy to the other bios.

Signed-off-by: Dennis Zhou 
---
 drivers/md/dm.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a733e4c920af..b5e996c5c709 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1417,10 +1417,14 @@ static int __send_empty_flush(struct clone_info *ci)
unsigned target_nr = 0;
struct dm_target *ti;
 
+   bio_set_dev(ci->bio, ci->io->md->bdev);
+
BUG_ON(bio_has_data(ci->bio));
while ((ti = dm_table_get_target(ci->map, target_nr++)))
__send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
 
+   bio_disassociate_blkg(ci->bio);
+
return 0;
 }
 
@@ -1939,7 +1943,6 @@ static struct mapped_device *alloc_dev(int minor)
goto bad;
 
bio_init(&md->flush_bio, NULL, 0);
-   bio_set_dev(&md->flush_bio, md->bdev);
md->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
 
dm_stats_init(&md->stats);
-- 
2.17.1



Re: [PATCH v2] blk-mq: don't call ktime_get_ns() if we don't need it

2018-12-03 Thread Omar Sandoval
On Fri, Nov 30, 2018 at 02:13:54PM -0700, Jens Axboe wrote:
> We only need the request fields and the end_io time if we have
> stats enabled, or if we have a scheduler attached as those may
> use it for completion time stats.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> 
> ---
> 
> v2: add helper, use it in both spots. also clear ->start_time_ns
> so merging doesn't read garbage.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7dcef565dc0f..e09d7f500077 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -281,6 +281,15 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
>  }
>  EXPORT_SYMBOL(blk_mq_can_queue);
>  
> +/*
> + * Only need start/end time stamping if we have stats enabled, or using
> + * an IO scheduler.
> + */
> +static inline bool blk_mq_need_time_stamp(struct request *rq)
> +{
> + return (rq->rq_flags & RQF_IO_STAT) || rq->q->elevator;
> +}
> +
>  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>   unsigned int tag, unsigned int op)
>  {
> @@ -316,7 +325,10 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>   RB_CLEAR_NODE(&rq->rb_node);
>   rq->rq_disk = NULL;
>   rq->part = NULL;
> - rq->start_time_ns = ktime_get_ns();
> + if (blk_mq_need_time_stamp(rq))
> + rq->start_time_ns = ktime_get_ns();
> + else
> + rq->start_time_ns = 0;
>   rq->io_start_time_ns = 0;
>   rq->nr_phys_segments = 0;
>  #if defined(CONFIG_BLK_DEV_INTEGRITY)
> @@ -522,7 +534,10 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request);
>  
>  inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
>  {
> - u64 now = ktime_get_ns();
> + u64 now = 0;
> +
> + if (blk_mq_need_time_stamp(rq))
> + now = ktime_get_ns();
>  
>   if (rq->rq_flags & RQF_STATS) {
>   blk_mq_poll_stats_start(rq->q);
> -- 
> Jens Axboe
> 


Re: [PATCH v5 1/2] arm64/neon: add workaround for ambiguous C99 stdint.h types

2018-12-03 Thread Ard Biesheuvel
On Mon, 3 Dec 2018 at 20:22, Will Deacon  wrote:
>
> On Wed, Nov 28, 2018 at 09:09:00AM +0800, Jackie Liu wrote:
> > In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround
> > for ambiguous C99 stdint.h types"), this patch redefines the macros that
> > are used in stdint.h so its definitions of uint64_t and int64_t are
> > compatible with those of the kernel.
> >
> > This patch comes from: https://patchwork.kernel.org/patch/3540001/
> > Wrote by: Ard Biesheuvel 
> >
> > We mark this file as a private file and don't have to override asm/types.h
> >
> > Reviewed-by: Ard Biesheuvel 
> > Signed-off-by: Jackie Liu 
> > ---
> >  arch/arm64/include/asm/neon-intrinsics.h | 34 
> > 
> >  1 file changed, 34 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/neon-intrinsics.h
> >
> > diff --git a/arch/arm64/include/asm/neon-intrinsics.h 
> > b/arch/arm64/include/asm/neon-intrinsics.h
> > new file mode 100644
> > index 000..e378766
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/neon-intrinsics.h
> > @@ -0,0 +1,34 @@
> > +#ifndef _NEON_INTRINSICS_H
> > +#define _NEON_INTRINSICS_H
>
> We tend to name these with an __ASM_ prefix, so it should be:
>
> #ifndef __ASM_NEON_INTRINSICS_H
>
> That said, I notice that the commit you refer to for arch/arm/ actually
> places this stuff under uapi/. Is that needed?
>

No, it doesn't. It creates asm/types.h which has been moved into uap/
at a later date (which I guess means we're stuck with it). In
hindsight, it would have been better for ARM to create a neon
instrinsics header file such as this one, since the override is only
needed when you include .

> > +#include 
> > +
> > +/*
> > + * For Aarch64, there is some ambiguity in the definition of the types 
> > below
> > + * between the kernel and GCC itself. This is usually not a big deal, but 
> > it
> > + * causes trouble when including GCC's version of 'stdint.h' (this is the 
> > file
> > + * that gets included when you #include  on a -ffreestanding 
> > build).
> > + * As this file also gets included implicitly when including 'arm_neon.h' 
> > (the
> > + * NEON intrinsics support header), we need the following to work around 
> > the
> > + * issue if we want to use NEON intrinsics in the kernel.
> > + */
>
> Could you elaborate on what the ambiguities / conflicts in the types are
> please? I think you can also remove the sentence about directly including
> stdint on a freestanding build, since it doesn't seem relevant to the
> kernel afaict (we only pull it in via arm_neon.h).
>

In the kernel, u64/s64 are [un]signed long long, not [un]signed long.
So by redefining these macros to the former, we can force gcc-stdint.h
to define uint64_t / in64_t in a compatible manner.

> > +
> > +#ifdef __INT64_TYPE__
> > +#undef __INT64_TYPE__
> > +#define __INT64_TYPE__   __signed__ long long
>
> Do we need this __signed__ part?
>

No that seems redundant to me.


Re: [PATCH v5 1/2] arm64/neon: add workaround for ambiguous C99 stdint.h types

2018-12-03 Thread Will Deacon
On Wed, Nov 28, 2018 at 09:09:00AM +0800, Jackie Liu wrote:
> In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround
> for ambiguous C99 stdint.h types"), this patch redefines the macros that
> are used in stdint.h so its definitions of uint64_t and int64_t are
> compatible with those of the kernel.
> 
> This patch comes from: https://patchwork.kernel.org/patch/3540001/
> Wrote by: Ard Biesheuvel 
> 
> We mark this file as a private file and don't have to override asm/types.h
> 
> Reviewed-by: Ard Biesheuvel 
> Signed-off-by: Jackie Liu 
> ---
>  arch/arm64/include/asm/neon-intrinsics.h | 34 
> 
>  1 file changed, 34 insertions(+)
>  create mode 100644 arch/arm64/include/asm/neon-intrinsics.h
> 
> diff --git a/arch/arm64/include/asm/neon-intrinsics.h 
> b/arch/arm64/include/asm/neon-intrinsics.h
> new file mode 100644
> index 000..e378766
> --- /dev/null
> +++ b/arch/arm64/include/asm/neon-intrinsics.h
> @@ -0,0 +1,34 @@
> +#ifndef _NEON_INTRINSICS_H
> +#define _NEON_INTRINSICS_H

We tend to name these with an __ASM_ prefix, so it should be:

#ifndef __ASM_NEON_INTRINSICS_H

That said, I notice that the commit you refer to for arch/arm/ actually
places this stuff under uapi/. Is that needed?

> +#include 
> +
> +/*
> + * For Aarch64, there is some ambiguity in the definition of the types below
> + * between the kernel and GCC itself. This is usually not a big deal, but it
> + * causes trouble when including GCC's version of 'stdint.h' (this is the 
> file
> + * that gets included when you #include  on a -ffreestanding 
> build).
> + * As this file also gets included implicitly when including 'arm_neon.h' 
> (the
> + * NEON intrinsics support header), we need the following to work around the
> + * issue if we want to use NEON intrinsics in the kernel.
> + */

Could you elaborate on what the ambiguities / conflicts in the types are
please? I think you can also remove the sentence about directly including
stdint on a freestanding build, since it doesn't seem relevant to the
kernel afaict (we only pull it in via arm_neon.h).

> +
> +#ifdef __INT64_TYPE__
> +#undef __INT64_TYPE__
> +#define __INT64_TYPE__   __signed__ long long

Do we need this __signed__ part?

Will


Re: [PATCH V2] mm: Replace all open encodings for NUMA_NO_NODE

2018-12-03 Thread Doug Ledford
On Mon, 2018-11-26 at 17:56 +0530, Anshuman Khandual wrote:
> At present there are multiple places where invalid node number is encoded
> as -1. Even though implicitly understood it is always better to have macros
> in there. Replace these open encodings for an invalid node number with the
> global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like
> 'invalid node' from various places redirecting them to a common definition.
> 
> Signed-off-by: Anshuman Khandual 
> ---
> Changes in V2:
> 
> - Added inclusion of 'numa.h' header at various places per Andrew
> - Updated 'dev_to_node' to use NUMA_NO_NODE instead per Vinod
> 
> Changes in V1: (https://lkml.org/lkml/2018/11/23/485)
> 
> - Dropped OCFS2 changes per Joseph
> - Dropped media/video drivers changes per Hans
> 
> RFC - https://patchwork.kernel.org/patch/10678035/
> 
> Build tested this with multiple cross compiler options like alpha, sparc,
> arm64, x86, powerpc, powerpc64le etc with their default config which might
> not have compiled tested all driver related changes. I will appreciate
> folks giving this a test in their respective build environment.
> 
> All these places for replacement were found by running the following grep
> patterns on the entire kernel code. Please let me know if this might have
> missed some instances. This might also have replaced some false positives.
> I will appreciate suggestions, inputs and review.
> 
> 1. git grep "nid == -1"
> 2. git grep "node == -1"
> 3. git grep "nid = -1"
> 4. git grep "node = -1"
> 
>  drivers/infiniband/hw/hfi1/affinity.c |  3 ++-
>  drivers/infiniband/hw/hfi1/init.c |  3 ++-

For the drivers/infiniband changes,

Acked-by: Doug Ledford 

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-03 Thread James Smart




On 12/1/2018 10:32 AM, Bart Van Assche wrote:

On 12/1/18 9:11 AM, Hannes Reinecke wrote:


Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, 
set some error flags in the driver, then _restarting_ the queue, just 
so that the driver then sees the error flag and terminates the requests.

Which I always found quite counter-intuitive.
So having a common helper for terminating requests for queue errors 
would be very welcomed here.


But when we have that we really should audit all drivers to ensure 
they do the right thin (tm).


Would calling blk_abort_request() for all outstanding requests be 
sufficient to avoid that the queue has to be stopped and restarted in 
the nvme-fc driver?


what nvme-fc does is the same as what is done in all the other 
transports - for the same reasons.  If we're eliminating those 
synchronization reasons, and now that we've plugged the request_queue 
path into the transports to check state appropriately, I don' t think 
there are reasons to block the queue.  In some respects, it is nice to 
stop new io while the work to terminate everything else happens, but I 
don't know that it's required.  I would hope that the bounced work due 
to the controller state (returned BLK_STAT_RESOURCE) is actually pausing 
for a short while. I've seen some circumstances where it didn't and was 
infinitely polling. Which would be a change in behavior vs the queue stops.


-- james



Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues

2018-12-03 Thread Keith Busch
On Sun, Dec 02, 2018 at 08:46:19AM -0800, Christoph Hellwig wrote:
> This will allow us to simplify both the regular NVMe interrupt handler
> and the upcoming aio poll code.  In addition to that the separate
> queues are generally a good idea for performance reasons.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch 


Re: [PATCH 10/13] nvme-mpath: remove I/O polling support

2018-12-03 Thread Keith Busch
On Sun, Dec 02, 2018 at 08:46:25AM -0800, Christoph Hellwig wrote:
> The ->poll_fn has been stale for a while, as a lot of places check for mq
> ops.  But there is no real point in it anyway, as we don't even use
> the multipath code for subsystems without multiple ports, which is usually
> what we do high performance I/O to.  If it really becomes an issue we
> should rework the nvme code to also skip the multipath code for any
> private namespace, even if that could mean some trouble when rescanning.
> 
> Signed-off-by: Christoph Hellwig 

This was a bit flawed anyway since the head's current path could change,
and you end up polling the wrong request_queue. Not really harmful other
than some wasted CPU cycles, but might be worth thinking about if we
want to bring mpath polling back.

Reviewed-by: Keith Busch 


Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues

2018-12-03 Thread Keith Busch
On Sun, Dec 02, 2018 at 08:46:22AM -0800, Christoph Hellwig wrote:
> This is the last place outside of nvme_irq that handles CQEs from
> interrupt context, and thus is in the way of removing the cq_lock for
> normal queues, and avoiding lockdep warnings on the poll queues, for
> which we already take it without IRQ disabling.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch 


Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration

2018-12-03 Thread Boris Ostrovsky
On 12/2/18 3:31 PM, Manjunath Patil wrote:
> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote:
>
>> On 11/30/18 4:49 PM, Manjunath Patil wrote:
>>> Thank you Boris for your comments. I removed faulty email of mine.
>>>
>>> replies inline.
>>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote:
 On 11/29/18 12:17 AM, Manjunath Patil wrote:
> Hi,
> Feel free to suggest/comment on this.
>
> I am trying to do the following at dst during the migration now.
> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it.
> 2. Store the old rinfo and nr_rings into temp variables in
> negotiate_mq()
> 3. let nr_rings get re-calculated based on backend data
> 4. try allocating new memory based on new nr_rings
 Since I suspect number of rings will likely be the same why not reuse
 the rings in the common case?
>>> I thought attaching devices will be more often than migration. Hence
>>> did not want add to an extra check for
>>>    - if I am inside migration code path and
>>>    - if new nr_rings is equal to old nr_rings or not
>>>
>>> Sure addition of such a thing would avoid the memory allocation
>>> altogether in migration path,
>>> but it would add a little overhead for normal device addition.
>>>
>>> Do you think its worth adding that change?
>>
>> IMO a couple of extra checks are not going to make much difference.
> I will add this change
>>
>> I wonder though --- have you actually seen the case where you did fail
>> allocation and changes provided in this patch made things work? I am
>> asking because right after negotiate_mq() we will call setup_blkring()
>> and it will want to allocate bunch of memory. A failure there is fatal
>> (to ring setup). So it seems to me that you will survive negotiate_mq()
>> but then will likely fail soon after.
> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I
> included my patch, I manually triggered the ENOMEM using a debug flag.
> The patch works for ENOMEM inside negotiate_mq().
>
> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we
> might hit it in setup_blkring() as well.
> We should add the similar change to blkif_sring struct as well.


Won't you have a similar issue with other frontends, say, netfront?


-boris


block: sbitmap related lockdep warning

2018-12-03 Thread Ming Lei
Hi,

Just found there is sbmitmap related lockdep warning, not take a close
look yet, maybe
it is caused by recent sbitmap change.

[1] test
- modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1
submit_queues=1 hw_queue_depth=1
- then run fio on the 4 null_blk devices

[2] lockdep warning
[  100.967642] start test sanity/001
[  101.238280] null: module loaded
[  106.093735]
[  106.094012] =
[  106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  106.095759] 4.20.0-rc3_5d2ee7122c73_for-next+ #1 Not tainted
[  106.096551] -
[  106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[  106.098231] 4c43fa71
(&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
[  106.099431]
[  106.099431] and this task is already holding:
[  106.100229] 7eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.101630] which would create a new lock dependency:
[  106.102326]  (&(&hctx->dispatch_wait_lock)->rlock){} ->
(&(&sb->map[i].swap_lock)->rlock){+.+.}
[  106.103553]
[  106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
[  106.104580]  (&sbq->ws[i].wait){..-.}
[  106.104582]
[  106.104582] ... which became SOFTIRQ-irq-safe at:
[  106.105751]   _raw_spin_lock_irqsave+0x4b/0x82
[  106.106284]   __wake_up_common_lock+0x119/0x1b9
[  106.106825]   sbitmap_queue_wake_up+0x33f/0x383
[  106.107456]   sbitmap_queue_clear+0x4c/0x9a
[  106.108046]   __blk_mq_free_request+0x188/0x1d3
[  106.108581]   blk_mq_free_request+0x23b/0x26b
[  106.109102]   scsi_end_request+0x345/0x5d7
[  106.109587]   scsi_io_completion+0x4b5/0x8f0
[  106.110099]   scsi_finish_command+0x412/0x456
[  106.110615]   scsi_softirq_done+0x23f/0x29b
[  106.15]   blk_done_softirq+0x2a7/0x2e6
[  106.111608]   __do_softirq+0x360/0x6ad
[  106.112062]   run_ksoftirqd+0x2f/0x5b
[  106.112499]   smpboot_thread_fn+0x3a5/0x3db
[  106.113000]   kthread+0x1d4/0x1e4
[  106.113457]   ret_from_fork+0x3a/0x50
[  106.113969]
[  106.113969] to a SOFTIRQ-irq-unsafe lock:
[  106.114672]  (&(&sb->map[i].swap_lock)->rlock){+.+.}
[  106.114674]
[  106.114674] ... which became SOFTIRQ-irq-unsafe at:
[  106.116000] ...
[  106.116003]   _raw_spin_lock+0x33/0x64
[  106.116676]   sbitmap_get+0xd5/0x22c
[  106.117134]   __sbitmap_queue_get+0xe8/0x177
[  106.117731]   __blk_mq_get_tag+0x1e6/0x22d
[  106.118286]   blk_mq_get_tag+0x1db/0x6e4
[  106.118756]   blk_mq_get_driver_tag+0x161/0x258
[  106.119383]   blk_mq_dispatch_rq_list+0x28e/0xd7c
[  106.120043]   blk_mq_do_dispatch_sched+0x23a/0x287
[  106.120607]   blk_mq_sched_dispatch_requests+0x379/0x3fc
[  106.121234]   __blk_mq_run_hw_queue+0x137/0x17e
[  106.121781]   __blk_mq_delay_run_hw_queue+0x80/0x25f
[  106.122366]   blk_mq_run_hw_queue+0x151/0x187
[  106.122887]   blk_mq_sched_insert_requests+0x13f/0x175
[  106.123492]   blk_mq_flush_plug_list+0x7d6/0x81b
[  106.124042]   blk_flush_plug_list+0x392/0x3d7
[  106.124557]   blk_finish_plug+0x37/0x4f
[  106.125019]   read_pages+0x3ef/0x430
[  106.125446]   __do_page_cache_readahead+0x18e/0x2fc
[  106.126027]   force_page_cache_readahead+0x121/0x133
[  106.126621]   page_cache_sync_readahead+0x35f/0x3bb
[  106.127229]   generic_file_buffered_read+0x410/0x1860
[  106.127932]   __vfs_read+0x319/0x38f
[  106.128415]   vfs_read+0xd2/0x19a
[  106.128817]   ksys_read+0xb9/0x135
[  106.129225]   do_syscall_64+0x140/0x385
[  106.129684]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  106.130292]
[  106.130292] other info that might help us debug this:
[  106.130292]
[  106.131226] Chain exists of:
[  106.131226]   &sbq->ws[i].wait -->
&(&hctx->dispatch_wait_lock)->rlock -->
&(&sb->map[i].swap_lock)->rlock
[  106.131226]
[  106.132865]  Possible interrupt unsafe locking scenario:
[  106.132865]
[  106.133659]CPU0CPU1
[  106.134194]
[  106.134733]   lock(&(&sb->map[i].swap_lock)->rlock);
[  106.135318]local_irq_disable();
[  106.136014]lock(&sbq->ws[i].wait);
[  106.136747]
lock(&(&hctx->dispatch_wait_lock)->rlock);
[  106.137742]   
[  106.138110] lock(&sbq->ws[i].wait);
[  106.138625]
[  106.138625]  *** DEADLOCK ***
[  106.138625]
[  106.139430] 3 locks held by fio/1043:
[  106.139947]  #0: 76ff0fd9 (rcu_read_lock){}, at:
hctx_lock+0x29/0xe8
[  106.140813]  #1: 2feb1016 (&sbq->ws[i].wait){..-.}, at:
blk_mq_dispatch_rq_list+0x4ad/0xd7c
[  106.141877]  #2: 7eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[  106.143267]
[  106.143267] the dependencies between SOFTIRQ-irq-safe lock and the
holding lock:
[  106.144351]  -> (&sbq->ws[i].wait){..-.} ops: 82 {
[  106.144926] IN-SOFTIRQ-W at:
[  106.145314]   _raw

Re: [PATCH v5 0/5] lightnvm: Flexible metadata

2018-12-03 Thread Hans Holmberg
Great! The tests(rocksdb, pblk recovery and the generic xfs suite)
completed successfully on one of our disks, so feel free to add:

Tested-by: Hans Holmberg 

Thanks,
Hans
On Fri, Nov 30, 2018 at 2:03 PM Hans Holmberg
 wrote:
>
> I just started a regression test on this patch set that'll run over
> the weekend. I'll add a tested-by if everything checks out.
>
> All the best,
> Hans
> On Fri, Nov 30, 2018 at 12:49 PM Igor Konopko  
> wrote:
> >
> > This series of patches extends the way how pblk can
> > store L2P sector metadata. After this set of changes
> > any size of NVMe metadata is supported in pblk.
> > Also there is an support for case without NVMe metadata.
> >
> > Changes v4 --> v5:
> > -rebase on top of ocssd/for-4.21/core
> >
> > Changes v3 --> v4:
> > -rename nvm_alloc_dma_pool() to nvm_create_dma_pool()
> > -split pblk_get_meta() calls and lba setting into
> > two operations for better core readability
> > -fixing compilation with CONFIG_NVM disabled
> > -getting rid of unnecessary memcpy for packed metadata
> > on write path
> > -support for drives with oob size >0 and <16B in packed
> > metadata mode
> > -minor commit message updates
> >
> > Changes v2 --> v3:
> > -Rebase on top of ocssd/for-4.21/core
> > -get/set_meta_lba helpers were removed
> > -dma reallocation was replaced with single allocation
> > -oob metadata size was added to pblk structure
> > -proper checks on pblk creation were added
> >
> > Changes v1 --> v2:
> > -Revert sector meta size back to 16b for pblk
> > -Dma pool for larger oob meta are handled in core instead of pblk
> > -Pblk oob meta helpers uses __le64 as input outpu instead of u64
> > -Other minor fixes based on v1 patch review
> >
> > Igor Konopko (5):
> >   lightnvm: pblk: Move lba list to partial read context
> >   lightnvm: pblk: Helpers for OOB metadata
> >   lightnvm: Flexible DMA pool entry size
> >   lightnvm: Disable interleaved metadata
> >   lightnvm: pblk: Support for packed metadata
> >
> >  drivers/lightnvm/core.c  |  9 --
> >  drivers/lightnvm/pblk-core.c | 61 +++--
> >  drivers/lightnvm/pblk-init.c | 44 +--
> >  drivers/lightnvm/pblk-map.c  | 20 +++-
> >  drivers/lightnvm/pblk-rb.c   |  3 ++
> >  drivers/lightnvm/pblk-read.c | 66 
> > +++-
> >  drivers/lightnvm/pblk-recovery.c | 25 +--
> >  drivers/lightnvm/pblk-sysfs.c|  7 +
> >  drivers/lightnvm/pblk-write.c|  9 +++---
> >  drivers/lightnvm/pblk.h  | 24 +--
> >  drivers/nvme/host/lightnvm.c |  6 ++--
> >  include/linux/lightnvm.h |  3 +-
> >  12 files changed, 209 insertions(+), 68 deletions(-)
> >
> > --
> > 2.14.5
> >